Gemma 是 Google 的一系列轻量级、最先进的开放式模型, 基于用于创建双子座模型(state-of-the-art)的相同研究和技术构建。 它们是文本到文本、仅解码器的大型语言模型,提供英文版本、 具有开放权重、预训练变体和指令调整变体。Gemma 模型非常适合各种文本生成任务,包括 问答、总结和推理。它们的尺寸相对较小 可以将它们部署在资源有限的环境中,例如 笔记本电脑、台式机或您自己的云基础设施,使对 最先进的 AI 模型,帮助促进每个人的创新。
下面我们将分享一些关于如何快速开始运行模型的代码片段。首先确保 ,然后从与您的用例相关的部分复制代码段。pip install -U transformers
pip install - U transformers
1. 在 CPU 上运行模型
from transformers import AutoTokenizer , AutoModelForCausalLM
tokenizer = AutoTokenizer . from_pretrained ( "google/gemma-7b" )
model = AutoModelForCausalLM . from_pretrained ( "google/gemma-7b" )
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer ( input_text , return_tensors = "pt" )
outputs = model . generate ( * * input_ids )
print ( tokenizer . decode ( outputs [ 0 ] ) )
2. 在单个/多个 GPU 上运行模型
# pip install accelerate
from transformers import AutoTokenizer , AutoModelForCausalLM
tokenizer = AutoTokenizer . from_pretrained ( "google/gemma-7b" )
model = AutoModelForCausalLM . from_pretrained ( "google/gemma-7b" , device_map = "auto" )
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer ( input_text , return_tensors = "pt" ) . to ( "cuda" )
outputs = model . generate ( * * input_ids )
print ( tokenizer . decode ( outputs [ 0 ] ) )
3. 使用不同的精度在 GPU 上运行模型
3.1 使用 torch.float16
# pip install accelerate
from transformers import AutoTokenizer , AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer . from_pretrained ( "google/gemma-7b" )
model = AutoModelForCausalLM . from_pretrained ( "google/gemma-7b" , device_map = "auto" , torch_dtype = torch . float16 )
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer ( input_text , return_tensors = "pt" ) . to ( "cuda" )
outputs = model . generate ( * * input_ids )
print ( tokenizer . decode ( outputs [ 0 ] ) )
3.2 使用 torch.bfloat16
# pip install accelerate
from transformers import AutoTokenizer , AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer . from_pretrained ( "google/gemma-7b" )
model = AutoModelForCausalLM . from_pretrained ( "google/gemma-7b" , device_map = "auto" , torch_dtype = torch . bfloat16 )
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer ( input_text , return_tensors = "pt" ) . to ( "cuda" )
outputs = model . generate ( * * input_ids )
print ( tokenizer . decode ( outputs [ 0 ] ) )
4. 量化版本通过bitsandbytes
4.1 使用 8 位精度 (int8)
# pip install bitsandbytes accelerate
from transformers import AutoTokenizer , AutoModelForCausalLM , BitsAndBytesConfig
quantization_config = BitsAndBytesConfig ( load_in_8bit = True )
tokenizer = AutoTokenizer . from_pretrained ( "google/gemma-7b" )
model = AutoModelForCausalLM . from_pretrained ( "google/gemma-7b" , quantization_config = quantization_config )
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer ( input_text , return_tensors = "pt" ) . to ( "cuda" )
outputs = model . generate ( * * input_ids )
print ( tokenizer . decode ( outputs [ 0 ] ) )
4.2 使用 8 位精度 (int4)
# pip install bitsandbytes accelerate
from transformers import AutoTokenizer , AutoModelForCausalLM , BitsAndBytesConfig
quantization_config = BitsAndBytesConfig ( load_in_4bit = True )
tokenizer = AutoTokenizer . from_pretrained ( "google/gemma-7b" )
model = AutoModelForCausalLM . from_pretrained ( "google/gemma-7b" , quantization_config = quantization_config )
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer ( input_text , return_tensors = "pt" ) . to ( "cuda" )
outputs = model . generate ( * * input_ids )
print ( tokenizer . decode ( outputs [ 0 ] ) )
5. flash-attn
model = AutoModelForCausalLM . from_pretrained (
model_id ,
torch_dtype = torch . float16 ,
+ attn_implementation = "flash_attention_2"
) . to ( 0 )
6.聊天模板
指令优化模型使用聊天模板,必须遵守该模板才能进行对话使用。应用它的最简单方法是使用分词器的内置聊天模板,如以下代码片段所示。
让我们加载模型并将聊天模板应用于对话。在此示例中,我们将从单个用户交互开始:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from transformers import AutoTokenizer , AutoModelForCausalLM
import transformers
import torch
model_id = "gg-hf/gemma-7b-it"
dtype = torch . bfloat16
tokenizer = AutoTokenizer . from_pretrained ( model_id )
model = AutoModelForCausalLM . from_pretrained (
model_id ,
device_map = "cuda" ,
torch_dtype = dtype ,
)
chat = [
{ "role" : "user" , "content" : "Write a hello world program" } ,
]
prompt = tokenizer . apply_chat_template ( chat , tokenize = False , add_generation_prompt = True )
此时,提示包含以下文本:
< bos > < start_of_turn > user
Write a hello world program < end_of_turn >
< start_of_turn > model
如您所见,每个回合前面都有一个 <start_of_turn>
分隔符,然后是实体的角色( user
对于用户提供的内容,或者 model
用于LLM响应)。回合完成 <end_of_turn>
令牌。
如果需要在没有分词器的聊天模板的情况下手动构建提示,可以按照此格式手动构建提示。
提示准备好后,可以像这样执行生成:
inputs = tokenizer . encode ( prompt , add_special_tokens = False , return_tensors = "pt" )
outputs = model . generate ( input_ids = inputs . to ( model . device ) , max_new_tokens = 150 )
print ( tokenizer . decode ( outputs [ 0 ] ) )
8位精度时model= 那一步会报错ImportError: Using bitsandbytes 8-bit quantization requires Accelerate: pip install accelerate and the latest version of bitsandbytes: pip install -i https://pypi.org/simple/ bitsandbytes 但实际上既安装了Accelerate又是最新的bitsandbytes,github上面也有不少问这个问题的,但是似乎没有很好的解决方案。谢谢。
pip install -U xxx ?