概述
Llama-3.1-8B-Fusion-9010
是一个混合模型,它结合了两个强大的基于 Llama 的模型的优势:arcee-ai/Llama-3.1-SuperNova-Lite 和 mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated。权重以 9:1 的比例混合,其中 90% 的权重来自 SuperNova-Lite,10% 来自被删减的 Meta-Llama-3.1-8B-Instruct 模型。虽然是简单的混合,但模型是可用的,并且没有出现乱码。 这是一个实验。我分别测试 9:1、8:2、7:3、6:4 和 5:5 的比率,以查看它们对模型的影响有多大。 所有模型评估报告将在随后提供。
型号详细信息
- 基本型号:
- 型号尺寸:8B 参数
- 建筑:Meta 3.1
- 混合比例:9:1 (SuperNova-Lite:Meta-Llama-3.1-8B-Instruct-abliterated)
主要特点
- SuperNova-Lite 贡献 (90%):Llama-3.1-SuperNova-Lite 是 Arcee.ai 开发的一个 8B 参数模型,基于 Llama-3.1-8B-Instruct 架构。
- Meta-Llama-3.1-8B-Instruct-abliterated 贡献 (10%):这是 Llama 3.1 8B Instruct 的未经审查版本,使用消融创建。
用法
您可以通过使用 Hugging Face 的 transformers
库加载它来在您的应用程序中使用这个混合模型:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
import torch from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer import time mixed_model_name = "huihui-ai/Llama-3.1-8B-Fusion-9010" # Check if CUDA is available device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Load model and tokenizer mixed_model = AutoModelForCausalLM.from_pretrained(mixed_model_name, device_map=device, torch_dtype=torch.bfloat16) tokenizer = AutoTokenizer.from_pretrained(mixed_model_name) # Ensure the tokenizer has pad_token_id set tokenizer.pad_token_id = tokenizer.eos_token_id # Input loop print("Start inputting text for inference (type 'exit' to quit)") while True: prompt = input("Enter your prompt: ") if prompt.lower() == "exit": print("Exiting inference loop.") break # Inference phase: Generate text using the modified model chat = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt} ] # Prepare input data input_ids = tokenizer.apply_chat_template( chat, tokenize=True, add_generation_prompt=True, return_tensors="pt" ).to(device) # Use TextStreamer for streaming output streamer = TextStreamer(tokenizer, skip_special_tokens=True) # Record the start time start_time = time.time() # Generate text and stream output character by character outputs = mixed_model.generate( input_ids, max_new_tokens=8192, do_sample=True, temperature=0.6, top_p=0.9, streamer=streamer # Enable streaming output ) # Record the end time end_time = time.time() # Calculate the number of generated tokens generated_tokens = outputs[0][input_ids.shape[-1]:].shape[0] # Calculate the total time taken total_time = end_time - start_time # Calculate tokens generated per second tokens_per_second = generated_tokens / total_time print(f"\nGenerated {generated_tokens} tokens in total, took {total_time:.2f} seconds, generating {tokens_per_second:.2f} tokens per second.") |
评估
以下数据已重新评估并计算为每次测试的平均值。
Benchmark | SuperNova-Lite | Meta-Llama-3.1-8B-Instruct-abliterated | Llama-3.1-8B-Fusion-9010 | Llama-3.1-8B-Fusion-8020 | Llama-3.1-8B-Fusion-7030 | Llama-3.1-8B-Fusion-6040 | Llama-3.1-8B-Fusion-5050 |
---|---|---|---|---|---|---|---|
IF_Eval | 82.09 | 76.29 | 82.44 | 82.93 | 83.10 | 82.94 | 82.03 |
MMLU Pro | 35.87 | 33.1 | 35.65 | 35.32 | 34.91 | 34.5 | 33.96 |
TruthfulQA | 64.35 | 53.25 | 62.67 | 61.04 | 59.09 | 57.8 | 56.75 |
BBH | 49.48 | 44.87 | 48.86 | 48.47 | 48.30 | 48.19 | 47.93 |
GPQA | 31.98 | 29.50 | 32.25 | 32.38 | 32.61 | 31.14 | 30.6 |
原文链接:https://huggingface.co/huihui-ai/Llama-3.1-8B-Fusion-9010