融合模型 Llama-3.1-8B-Fusion-9010

概述

Llama-3.1-8B-Fusion-9010是一个混合模型，它结合了两个强大的基于 Llama 的模型的优势：arcee-ai/Llama-3.1-SuperNova-Lite 和 mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated。权重以 9：1 的比例混合，其中 90% 的权重来自 SuperNova-Lite，10% 来自被删减的 Meta-Llama-3.1-8B-Instruct 模型。虽然是简单的混合，但模型是可用的，并且没有出现乱码。这是一个实验。我分别测试 9：1、8：2、7：3、6：4 和 5：5 的比率，以查看它们对模型的影响有多大。所有模型评估报告将在随后提供。

型号详细信息

基本型号：
- arcee-ai/llama-3.1-SuperNova-Lite （90%）
- mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated （10%）
型号尺寸：8B 参数
建筑：Meta 3.1
混合比例：9：1 （SuperNova-Lite：Meta-Llama-3.1-8B-Instruct-abliterated）

主要特点

SuperNova-Lite 贡献（90%）：Llama-3.1-SuperNova-Lite 是 Arcee.ai 开发的一个 8B 参数模型，基于 Llama-3.1-8B-Instruct 架构。
Meta-Llama-3.1-8B-Instruct-abliterated 贡献（10%）：这是 Llama 3.1 8B Instruct 的未经审查版本，使用消融创建。

用法

您可以通过使用 Hugging Face 的 transformers 库加载它来在您的应用程序中使用这个混合模型：

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
import time

mixed_model_name = "huihui-ai/Llama-3.1-8B-Fusion-9010"

# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model and tokenizer
mixed_model = AutoModelForCausalLM.from_pretrained(mixed_model_name, device_map=device, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(mixed_model_name)

# Ensure the tokenizer has pad_token_id set
tokenizer.pad_token_id = tokenizer.eos_token_id

# Input loop
print("Start inputting text for inference (type 'exit' to quit)")
while True:
    prompt = input("Enter your prompt: ")
    if prompt.lower() == "exit":
        print("Exiting inference loop.")
        break

    # Inference phase: Generate text using the modified model
    chat = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]

    # Prepare input data
    input_ids = tokenizer.apply_chat_template(
        chat, tokenize=True, add_generation_prompt=True, return_tensors="pt"
    ).to(device)

    # Use TextStreamer for streaming output
    streamer = TextStreamer(tokenizer, skip_special_tokens=True)

    # Record the start time
    start_time = time.time()

    # Generate text and stream output character by character
    outputs = mixed_model.generate(
        input_ids,
        max_new_tokens=8192,
        do_sample=True,
        temperature=0.6,
        top_p=0.9,
        streamer=streamer  # Enable streaming output
    )

    # Record the end time
    end_time = time.time()

    # Calculate the number of generated tokens
    generated_tokens = outputs[0][input_ids.shape[-1]:].shape[0]

    # Calculate the total time taken
    total_time = end_time - start_time

    # Calculate tokens generated per second
    tokens_per_second = generated_tokens / total_time

    print(f"\nGenerated {generated_tokens} tokens in total, took {total_time:.2f} seconds, generating {tokens_per_second:.2f} tokens per second.")

import torch

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

import time

mixed_model_name = "huihui-ai/Llama-3.1-8B-Fusion-9010"

# Check if CUDA is available

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model and tokenizer

mixed_model = AutoModelForCausalLM.from_pretrained(mixed_model_name, device_map=device, torch_dtype=torch.bfloat16)

tokenizer = AutoTokenizer.from_pretrained(mixed_model_name)

# Ensure the tokenizer has pad_token_id set

tokenizer.pad_token_id = tokenizer.eos_token_id

# Input loop

print("Start inputting text for inference (type 'exit' to quit)")

while True:

prompt = input("Enter your prompt: ")

if prompt.lower() == "exit":

print("Exiting inference loop.")

break

# Inference phase: Generate text using the modified model

chat = [

{"role": "system", "content": "You are a helpful assistant."},

{"role": "user", "content": prompt}

]

# Prepare input data

input_ids = tokenizer.apply_chat_template(

chat, tokenize=True, add_generation_prompt=True, return_tensors="pt"

).to(device)

# Use TextStreamer for streaming output

streamer = TextStreamer(tokenizer, skip_special_tokens=True)

# Record the start time

start_time = time.time()

# Generate text and stream output character by character

outputs = mixed_model.generate(

input_ids,

max_new_tokens=8192,

do_sample=True,

temperature=0.6,

top_p=0.9,

streamer=streamer # Enable streaming output

)

# Record the end time

end_time = time.time()

# Calculate the number of generated tokens

generated_tokens = outputs[0][input_ids.shape[-1]:].shape[0]

# Calculate the total time taken

total_time = end_time - start_time

# Calculate tokens generated per second

tokens_per_second = generated_tokens / total_time

print(f"\nGenerated {generated_tokens} tokens in total, took {total_time:.2f} seconds, generating {tokens_per_second:.2f} tokens per second.")

评估

以下数据已重新评估并计算为每次测试的平均值。

Benchmark	SuperNova-Lite	Meta-Llama-3.1-8B-Instruct-abliterated	Llama-3.1-8B-Fusion-9010	Llama-3.1-8B-Fusion-8020	Llama-3.1-8B-Fusion-7030	Llama-3.1-8B-Fusion-6040	Llama-3.1-8B-Fusion-5050
IF_Eval	82.09	76.29	82.44	82.93	83.10	82.94	82.03
MMLU Pro	35.87	33.1	35.65	35.32	34.91	34.5	33.96
TruthfulQA	64.35	53.25	62.67	61.04	59.09	57.8	56.75
BBH	49.48	44.87	48.86	48.47	48.30	48.19	47.93
GPQA	31.98	29.50	32.25	32.38	32.61	31.14	30.6

原文链接：https://huggingface.co/huihui-ai/Llama-3.1-8B-Fusion-9010

概述

型号详细信息

主要特点

用法

评估

相关文章

发表评论 取消回复

发表评论取消回复