0. 原理
分层推理本质上是“分而治之”的方法
这没有使用量化、蒸馏、修剪或其他模型压缩技术
大型语言模型之所以体积大,占用大量内存,主要是因为它们的结构包含许多“层”。 从LLM嵌入投影层开始,然后是许多相同的变压器层。 8B 型号有32 层。但是在推理过程中,每一层都是独立的,只依赖于前一层的输出。 因此,在运行层后,可以释放其内存,仅保留层的输出。基于这一概念,AirLLM实现了分层推理。 如何在基于LLMTransformer的推理过程中,各层是按顺序执行的。上一层的输出是下一层的输入。一次只执行一个图层。 因此,完全没有必要将所有层都保留在 GPU 内存中。我们可以在执行该层时从磁盘加载所需的任何层,执行所有计算,然后完全释放内存。 这样一来,每层所需的 GPU 内存仅为一个 transformer 层的参数大小,即完整型号的 1/32,约为 417MB。
然后使用闪存深度优化cuda内存访问,实现多倍加速按层分片模型文件。
使用 HuggingFace Accelerate 提供的元设备功能。当您通过元设备加载模型时,实际上不会读取模型数据,只会加载代码。内存使用率为 0。
提供使用“压缩”参数进行量化的选项 “压缩”:支持的选项:4 位、8 位用于 4 位或 8 位块级量化
引用连接:https://github.com/lyogavin/Anima
1. 推理 Meta-Llama-3-8B-Instruct
1.1 查看 config.json
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
{ "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128001, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.40.0.dev0", "use_cache": true, "vocab_size": 128256 } |
- 模型类型 (model_type): llama
- 隐藏层激活函数 (hidden_act): silu
- 隐藏层大小 (hidden_size): 4096
- 中间层大小 (intermediate_size): 14336
- 最大位置嵌入数 (max_position_embeddings): 8192
- 注意力头数 (num_attention_heads): 64
- 隐藏层数 (num_hidden_layers): 32
- 键值头数 (num_key_value_heads): 8
- 词汇表大小 (vocab_size): 128256
- 初始化范围 (initializer_range): 0.02
- 注意力丢失 (attention_dropout): 0.0
- 标准化余弦阈值 (rms_norm_eps): 1e-05
- 绑定词嵌入 (tie_word_embeddings): false
- 张量数据类型 (torch_dtype): bfloat16
- Transformers版本 (transformers_version): 4.40.0.dev0
- 是否使用缓存 (use_cache): true
- 起始标记ID (bos_token_id): 128000
- 结束标记ID (eos_token_id): 128001
- Rope theta值 (rope_theta): 500000.0
从配置文件里的参数 num_hidden_layers 为 80,标识模型权重有80层
1.2 尝试加载
参照 airllm 的例子,加载 Meta-Llama-3-8B-Instruct
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
from airllm import AutoModel MAX_LENGTH = 128 # could use hugging face model repo id: model = AutoModel.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct") input_text = [ 'What is the capital of United States?', #'I like', ] input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH, padding=False) print(f'input_tokens:{len(input_tokens.input_ids[0])}') generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=20, use_cache=True, return_dict_in_generate=True) print(f'output_tokens:{len(generation_output.sequences[0])}') output = model.tokenizer.decode(generation_output.sequences[0]) print(output) |
运行结果
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
python test_airllm_8B.py >>>> bitsandbytes installed >>>> cache_utils installed found index file... 0%| | 0/35 [00:00<?, ?it/s]Loading shard 1/4 saved as: meta-llama/Meta-Llama-3-8B-Instruct/splitted_model/model.embed_tokens.safetensors 3%|███▋ | 1/35 [00:39<22:19, 39.39s/it] saved as: meta-llama/Meta-Llama-3-8B-Instruct/splitted_model/model.layers.0.safetensors 6%|███████ | 2/35 [00:55<14:09, 25.74s/it] saved as: meta-llama/Meta-Llama-3-8B-Instruct/splitted_model/model.layers.1.safetensors 9%|██████████▉ | 3/35 [01:12<11:40, 21.88s/it] saved as: meta-llama/Meta-Llama-3-8B-Instruct/splitted_model/model.layers.2.safetensors 11%|██████████████▌ | 4/35 [01:29<10:16, 19.87s/it] saved as: meta-llama/Meta-Llama-3-8B-Instruct/splitted_model/model.layers.3.safetensors ... saved as: meta-llama/Meta-Llama-3-8B-Instruct/splitted_model/lm_head.safetensors 100%|███████████████████ | 35/35 [10:02<00:00, 17.23s/it] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... input_tokens:8 The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation. new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████████████████| 35/35 [02:18<00:00, 3.95s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████████████████| 35/35 [02:20<00:00, 4.01s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████████████████| 35/35 [02:19<00:00, 3.98s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████████████████| 35/35 [02:16<00:00, 3.90s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████████████████| 35/35 [02:17<00:00, 3.92s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████████████████| 35/35 [02:20<00:00, 4.03s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████████████████| 35/35 [02:14<00:00, 3.84s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████████████████| 35/35 [02:24<00:00, 4.12s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████████████████| 35/35 [02:22<00:00, 4.06s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████████████████| 35/35 [02:17<00:00, 3.94s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████████████████| 35/35 [02:17<00:00, 3.93s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████████████████| 35/35 [02:15<00:00, 3.87s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████████████████| 35/35 [02:17<00:00, 3.93s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████████████████| 35/35 [02:17<00:00, 3.93s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████████████████| 35/35 [02:20<00:00, 4.01s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████████████████| 35/35 [02:17<00:00, 3.94s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████████████████| 35/35 [02:15<00:00, 3.88s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████████████████| 35/35 [02:13<00:00, 3.82s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████████████████| 35/35 [02:17<00:00, 3.93s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████████████████| 35/35 [02:17<00:00, 3.94s/it] output_tokens:28 What is the capital of United States? A) Washington D.C. B) New York C) Los Angeles D) Chicago Answer: |
从处理过程来看,其实是把每一层保存到 splitted_mode 目录下了。
我们显示一下文件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
ls -l total 15684244 -rwxrwxrwx 1 tony tony 1050673248 Apr 28 22:19 lm_head.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:19 lm_head.safetensors.done -rwxrwxrwx 1 tony tony 1050673264 Apr 28 22:09 model.embed_tokens.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:09 model.embed_tokens.safetensors.done -rwxrwxrwx 1 tony tony 436225016 Apr 28 22:10 model.layers.0.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:10 model.layers.0.safetensors.done -rwxrwxrwx 1 tony tony 436225016 Apr 28 22:10 model.layers.1.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:10 model.layers.1.safetensors.done -rwxrwxrwx 1 tony tony 436225024 Apr 28 22:12 model.layers.10.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:12 model.layers.10.safetensors.done -rwxrwxrwx 1 tony tony 436225024 Apr 28 22:13 model.layers.11.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:13 model.layers.11.safetensors.done -rwxrwxrwx 1 tony tony 436225024 Apr 28 22:13 model.layers.12.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:13 model.layers.12.safetensors.done -rwxrwxrwx 1 tony tony 436225024 Apr 28 22:13 model.layers.13.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:13 model.layers.13.safetensors.done -rwxrwxrwx 1 tony tony 436225024 Apr 28 22:13 model.layers.14.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:13 model.layers.14.safetensors.done -rwxrwxrwx 1 tony tony 436225024 Apr 28 22:14 model.layers.15.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:14 model.layers.15.safetensors.done -rwxrwxrwx 1 tony tony 436225024 Apr 28 22:14 model.layers.16.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:14 model.layers.16.safetensors.done -rwxrwxrwx 1 tony tony 436225024 Apr 28 22:14 model.layers.17.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:14 model.layers.17.safetensors.done -rwxrwxrwx 1 tony tony 436225024 Apr 28 22:14 model.layers.18.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:14 model.layers.18.safetensors.done -rwxrwxrwx 1 tony tony 436225024 Apr 28 22:15 model.layers.19.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:15 model.layers.19.safetensors.done -rwxrwxrwx 1 tony tony 436225016 Apr 28 22:10 model.layers.2.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:10 model.layers.2.safetensors.done -rwxrwxrwx 1 tony tony 436225024 Apr 28 22:15 model.layers.20.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:15 model.layers.20.safetensors.done -rwxrwxrwx 1 tony tony 436225024 Apr 28 22:15 model.layers.21.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:15 model.layers.21.safetensors.done -rwxrwxrwx 1 tony tony 436225024 Apr 28 22:16 model.layers.22.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:16 model.layers.22.safetensors.done -rwxrwxrwx 1 tony tony 436225024 Apr 28 22:16 model.layers.23.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:16 model.layers.23.safetensors.done -rwxrwxrwx 1 tony tony 436225024 Apr 28 22:16 model.layers.24.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:16 model.layers.24.safetensors.done -rwxrwxrwx 1 tony tony 436225024 Apr 28 22:16 model.layers.25.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:16 model.layers.25.safetensors.done -rwxrwxrwx 1 tony tony 436225024 Apr 28 22:17 model.layers.26.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:17 model.layers.26.safetensors.done -rwxrwxrwx 1 tony tony 436225024 Apr 28 22:17 model.layers.27.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:17 model.layers.27.safetensors.done -rwxrwxrwx 1 tony tony 436225024 Apr 28 22:17 model.layers.28.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:17 model.layers.28.safetensors.done -rwxrwxrwx 1 tony tony 436225024 Apr 28 22:17 model.layers.29.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:17 model.layers.29.safetensors.done -rwxrwxrwx 1 tony tony 436225016 Apr 28 22:10 model.layers.3.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:10 model.layers.3.safetensors.done -rwxrwxrwx 1 tony tony 436225024 Apr 28 22:18 model.layers.30.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:18 model.layers.30.safetensors.done -rwxrwxrwx 1 tony tony 436225024 Apr 28 22:18 model.layers.31.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:18 model.layers.31.safetensors.done -rwxrwxrwx 1 tony tony 436225016 Apr 28 22:11 model.layers.4.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:11 model.layers.4.safetensors.done -rwxrwxrwx 1 tony tony 436225016 Apr 28 22:11 model.layers.5.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:11 model.layers.5.safetensors.done -rwxrwxrwx 1 tony tony 436225016 Apr 28 22:11 model.layers.6.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:11 model.layers.6.safetensors.done -rwxrwxrwx 1 tony tony 436225016 Apr 28 22:11 model.layers.7.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:11 model.layers.7.safetensors.done -rwxrwxrwx 1 tony tony 436225016 Apr 28 22:12 model.layers.8.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:12 model.layers.8.safetensors.done -rwxrwxrwx 1 tony tony 436225016 Apr 28 22:12 model.layers.9.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:12 model.layers.9.safetensors.done -rwxrwxrwx 1 tony tony 8280 Apr 28 22:18 model.norm.safetensors -rwxrwxrwx 1 tony tony 0 Apr 28 22:18 model.norm.safetensors.done |
第一次要保存这些文件,所以非常慢,大概需要2个多小时,每层大小约为417MB。
那第二次则直接判断这些文件存在不存在,自然就快了。
1 2 3 4 5 6 7 |
output_tokens:28 What is the capital of United States? A) Washington D.C. B) New York C) Los Angeles D) Chicago Answer: real 47m12.458s user 18m15.109s sys 20m54.825s |
和前面一样,输入token数为8, 输总token 为28, 所以28-8 = 20, 总共20个回合,每个回合2分多钟
1.3 尝试4bit加载
只需要增加 compression=’4bit’就可以,8bit 标识8位压缩
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
from airllm import AutoModel MAX_LENGTH = 128 # could use hugging face model repo id: model = AutoModel.from_pretrained( "meta-llama/Meta-Llama-3-8B-Instruct", compression='4bit', prefetching=False) input_text = [ 'What is the capital of United States?', #'I like', ] input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH, padding=False) print(f'input_tokens:{len(input_tokens.input_ids[0])}') generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=20, use_cache=True, return_dict_in_generate=True) print(f'output_tokens:{len(generation_output.sequences[0])}') output = model.tokenizer.decode(generation_output.sequences[0]) print(output) |
1 |
time python test_airllm_8B.py |
运行后会重新产生一个4bit 的目录,文件产生大约时间为10分钟左右
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
ls -la total 4411404 drwxrwxrwx 1 tony tony 4096 Apr 29 11:32 . drwxrwxrwx 1 tony tony 4096 Apr 29 11:22 .. -rwxrwxrwx 1 tony tony 295502380 Apr 29 11:32 lm_head.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:32 lm_head.safetensors.done -rwxrwxrwx 1 tony tony 295502420 Apr 29 11:23 model.embed_tokens.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:23 model.embed_tokens.safetensors.done -rwxrwxrwx 1 tony tony 122693697 Apr 29 11:23 model.layers.0.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:23 model.layers.0.safetensors.done -rwxrwxrwx 1 tony tony 122693697 Apr 29 11:23 model.layers.1.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:23 model.layers.1.safetensors.done -rwxrwxrwx 1 tony tony 122693729 Apr 29 11:26 model.layers.10.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:26 model.layers.10.safetensors.done -rwxrwxrwx 1 tony tony 122693729 Apr 29 11:26 model.layers.11.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:26 model.layers.11.safetensors.done -rwxrwxrwx 1 tony tony 122693729 Apr 29 11:26 model.layers.12.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:26 model.layers.12.safetensors.done -rwxrwxrwx 1 tony tony 122693729 Apr 29 11:27 model.layers.13.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:27 model.layers.13.safetensors.done -rwxrwxrwx 1 tony tony 122693729 Apr 29 11:27 model.layers.14.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:27 model.layers.14.safetensors.done -rwxrwxrwx 1 tony tony 122693729 Apr 29 11:27 model.layers.15.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:27 model.layers.15.safetensors.done -rwxrwxrwx 1 tony tony 122693729 Apr 29 11:28 model.layers.16.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:28 model.layers.16.safetensors.done -rwxrwxrwx 1 tony tony 122693729 Apr 29 11:28 model.layers.17.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:28 model.layers.17.safetensors.done -rwxrwxrwx 1 tony tony 122693729 Apr 29 11:28 model.layers.18.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:28 model.layers.18.safetensors.done -rwxrwxrwx 1 tony tony 122693729 Apr 29 11:28 model.layers.19.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:28 model.layers.19.safetensors.done -rwxrwxrwx 1 tony tony 122693697 Apr 29 11:23 model.layers.2.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:23 model.layers.2.safetensors.done -rwxrwxrwx 1 tony tony 122693729 Apr 29 11:29 model.layers.20.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:29 model.layers.20.safetensors.done -rwxrwxrwx 1 tony tony 122693729 Apr 29 11:29 model.layers.21.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:29 model.layers.21.safetensors.done -rwxrwxrwx 1 tony tony 122693729 Apr 29 11:29 model.layers.22.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:29 model.layers.22.safetensors.done -rwxrwxrwx 1 tony tony 122693729 Apr 29 11:30 model.layers.23.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:30 model.layers.23.safetensors.done -rwxrwxrwx 1 tony tony 122693729 Apr 29 11:30 model.layers.24.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:30 model.layers.24.safetensors.done -rwxrwxrwx 1 tony tony 122693729 Apr 29 11:30 model.layers.25.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:30 model.layers.25.safetensors.done -rwxrwxrwx 1 tony tony 122693729 Apr 29 11:30 model.layers.26.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:30 model.layers.26.safetensors.done -rwxrwxrwx 1 tony tony 122693729 Apr 29 11:31 model.layers.27.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:31 model.layers.27.safetensors.done -rwxrwxrwx 1 tony tony 122693729 Apr 29 11:31 model.layers.28.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:31 model.layers.28.safetensors.done -rwxrwxrwx 1 tony tony 122693729 Apr 29 11:31 model.layers.29.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:31 model.layers.29.safetensors.done -rwxrwxrwx 1 tony tony 122693697 Apr 29 11:24 model.layers.3.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:24 model.layers.3.safetensors.done -rwxrwxrwx 1 tony tony 122693729 Apr 29 11:32 model.layers.30.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:32 model.layers.30.safetensors.done -rwxrwxrwx 1 tony tony 122693729 Apr 29 11:32 model.layers.31.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:32 model.layers.31.safetensors.done -rwxrwxrwx 1 tony tony 122693697 Apr 29 11:24 model.layers.4.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:24 model.layers.4.safetensors.done -rwxrwxrwx 1 tony tony 122693697 Apr 29 11:24 model.layers.5.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:24 model.layers.5.safetensors.done -rwxrwxrwx 1 tony tony 122693697 Apr 29 11:25 model.layers.6.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:25 model.layers.6.safetensors.done -rwxrwxrwx 1 tony tony 122693697 Apr 29 11:25 model.layers.7.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:25 model.layers.7.safetensors.done -rwxrwxrwx 1 tony tony 122693697 Apr 29 11:25 model.layers.8.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:25 model.layers.8.safetensors.done -rwxrwxrwx 1 tony tony 122693697 Apr 29 11:25 model.layers.9.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:25 model.layers.9.safetensors.done -rwxrwxrwx 1 tony tony 2820 Apr 29 11:32 model.norm.safetensors -rwxrwxrwx 1 tony tony 0 Apr 29 11:32 model.norm.safetensors.done |
每层的文件大小大概为118MB,是前面没压缩的面的1/4。
总结消耗时间
1 2 3 |
real 49m44.333s user 3m19.002s sys 1m52.240s |
1.4 4bit加载后,保留在内存
原来的代码是每次运行都行从文件里面加载,现在改为,加载后保留内存,先克隆原来的项目(版本号是2.8.3)
1 |
git clone https://github.com/lyogavin/Anima/ |
1.4.1 修改版本号
1 |
cd Anima/air_llm |
修改 setup.py 文件里面的版本号
1 2 3 4 5 |
... setuptools.setup( name="airllm", version="2.8.5", ... |
1.4.2 修改 load_layer_to_cpu 函数
修改 airllm/airllm_base.py 文件两处,首先添加一个全局变量 loaded_layers,用来保存加载后的层
1 2 3 4 5 6 |
... loaded_layers = {} class AirLLMBaseModel(GenerationMixin): ... |
修改后的 load_layer_to_cpu 函数:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
def load_layer_to_cpu(self, layer_name): t = time.time() # 检查层是否已经在全局字典中 if layer_name in loaded_layers: state_dict = loaded_layers[layer_name] else: load_layer_output = load_layer(self.checkpoint_path, layer_name, self.profiling_mode) elapsed_time = time.time() - t if self.profiling_mode: state_dict, compression_time = load_layer_output disk_loading_time = elapsed_time - compression_time self.profiler.add_profiling_time('load_safe_tensor', disk_loading_time) self.profiler.add_profiling_time('compression_time', compression_time) else: state_dict = load_layer_output # 存储到全局变量中 loaded_layers[layer_name] = state_dict # pin memory: if self.prefetching: t = time.time() for k in state_dict.keys(): state_dict[k].pin_memory() elapsed_time = time.time() - t if self.profiling_mode: self.profiler.add_profiling_time('pin_memory_to_trigger_load', elapsed_time) return state_dict |
1.4.3 安装到你的环境
1 |
pip install -e . |
1.4.5 进行再次测试
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 |
time python test_airllm_8B.py >>>> bitsandbytes installed >>>> cache_utils installed found index file... found_layers:{'model.embed_tokens.': True, 'model.layers.0.': True, 'model.layers.1.': True, 'model.layers.2.': True, 'model.layers.3.': True, 'model.layers.4.': True, 'model.layers.5.': True, 'model.layers.6.': True, 'model.layers.7.': True, 'model.layers.8.': True, 'model.layers.9.': True, 'model.layers.10.': True, 'model.layers.11.': True, 'model.layers.12.': True, 'model.layers.13.': True, 'model.layers.14.': True, 'model.layers.15.': True, 'model.layers.16.': True, 'model.layers.17.': True, 'model.layers.18.': True, 'model.layers.19.': True, 'model.layers.20.': True, 'model.layers.21.': True, 'model.layers.22.': True, 'model.layers.23.': True, 'model.layers.24.': True, 'model.layers.25.': True, 'model.layers.26.': True, 'model.layers.27.': True, 'model.layers.28.': True, 'model.layers.29.': True, 'model.layers.30.': True, 'model.layers.31.': True, 'model.norm.': True, 'lm_head.': True} saved layers already found in meta-llama/Meta-Llama-3-8B-Instruct/splitted_model.4bit Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> not support prefetching for compression for now. loading with no prepetching mode. input_tokens:8 The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation. new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████| 35/35 [01:59<00:00, 3.41s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████| 35/35 [00:07<00:00, 4.45it/s] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████| 35/35 [00:07<00:00, 4.43it/s] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████| 35/35 [00:07<00:00, 4.44it/s] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████| 35/35 [00:07<00:00, 4.44it/s] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████| 35/35 [00:07<00:00, 4.82it/s] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████| 35/35 [00:07<00:00, 5.00it/s] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████| 35/35 [00:06<00:00, 5.02it/s] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████| 35/35 [00:07<00:00, 4.99it/s] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████| 35/35 [00:06<00:00, 5.25it/s] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████| 35/35 [00:06<00:00, 5.21it/s] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████| 35/35 [00:06<00:00, 5.20it/s] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████| 35/35 [00:06<00:00, 5.52it/s] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████| 35/35 [00:06<00:00, 5.21it/s] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████| 35/35 [00:06<00:00, 5.28it/s] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████| 35/35 [00:06<00:00, 5.24it/s] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████| 35/35 [00:06<00:00, 5.28it/s] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████| 35/35 [00:06<00:00, 5.32it/s] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████| 35/35 [00:06<00:00, 5.17it/s] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(self.running_device): 100%|█████████████████| 35/35 [00:06<00:00, 5.10it/s] output_tokens:28 What is the capital of United States? A) Washington D.C. B) New York C) Los Angeles D) Chicago Answer: real 4m27.649s user 2m48.139s sys 0m10.896s |
可以看到第一次推理的时间没有变:1分59秒,但后面19次的推理时间都是7秒左右,整体时间缩小了10倍。