取消任何LLM模型的严格审查

第三代 Llama 模型提供了微调（Instruct）版本，这些版本在理解和遵循说明方面表现出色。然而，这些模型受到严格审查，旨在拒绝被视为有害的请求，其响应包括“作为 AI 助手，我无法帮助你”。虽然此安全功能对于防止误用至关重要，但它限制了模型的灵活性和响应能力。

在本文中，我们将探讨一种称为 “abliteration” 的技术，该技术可以在不重新训练的情况下取消对任何 LLM 的审查。这种技术有效地消除了模型的内置拒绝机制，使其能够响应所有类型的提示。

该代码可在 Google Colab 和 GitHub 上的 LLM 课程中找到。

✂️ 什么是烧蚀(abliteration)？

现代 LLM 针对安全性和指令遵循进行了微调，这意味着它们经过培训可以拒绝有害请求。在他们的博客文章中，Arditi 等人表明，这种拒绝行为是由模型残差流中的特定方向介导的。如果我们阻止模型表示这个方向，它就会失去拒绝请求的能力。相反，人为地添加此方向可能会导致模型拒绝甚至无害的请求。

在传统的仅限解码器的类似 Llama 的架构中，我们可以针对三个残差流：在每个块的开头（“pre”）、注意力层和 MLP 层之间（“mid”）以及 MLP 之后（“post”）。下图说明了每个残差流的位置。

要取消 LLM 的审查，我们首先需要确定模型中的 “拒绝方向(refusal direction)”。此过程涉及几个技术步骤：

数据收集：在一组有害指令和一组无害指令上运行模型，记录每个指令在最后一个标记位置的残差流激活。
均值差值：计算有害和无害指令的激活值之间的均值差值。这为我们提供了一个向量，表示模型每一层的 “拒绝方向(refusal direction)”。
选择：对这些向量进行归一化并评估它们以选择单个最佳 “拒绝方向(refusal direction)”。

一旦我们确定了拒绝方向，我们就可以 “烧蚀(ablate)” 它，有效地消除了模型表示这一特征的能力。这可以通过推理时干预或永久使用权重正交来完成。

我们先谈谈推理时干预。对于写入残差流的每个组件（例如注意力头），我们计算其输出在 refu 方向上的投影并减去此投影。这种减法应用于每个 token 和每一层，确保模型永远不会代表 refal 方向。

另一方面，权重正交化涉及直接修改模型权重。通过相对于 refal 方向正交元件权重，它可以完全防止模型写入该方向。这是通过调整写入残差流的矩阵来实现的，确保它们不会对 ref 方向产生影响。

在下一节中，我们将使用权重正交化实现删失。

💻 实现

以下删除实现基于 FailSpy 的笔记本，而笔记本本身也基于原作者的笔记本。我主要对其进行了调整和简化，使其更易于理解。这部分代码很多，所以你可以看到发生了什么，但如果你对技术细节不太感兴趣，你可以使用 FailSpy 的 abliterator 库（还可以查看他在 Hugging Face 上的废除模型集合）。

该代码依赖于出色的 TransformerLens 库（以前称为 EasyTransformer）来完成繁重的工作。它专为机制可解释性而设计，此处用于干预激活。感谢 Neel Nanda 和 Joseph Bloom 创建和维护此库。

首先，让我们安装必要的包并导入它们。所有这些步骤都可以在此 Google Colab 笔记本中找到。

!pip install transformers transformers_stream_generator tiktoken transformer_lens einops jaxtyping

import torch
import functools
import einops
import gc

from datasets import load_dataset
from tqdm import tqdm
from torch import Tensor
from typing import List
from transformer_lens import HookedTransformer, utils
from transformer_lens.hook_points import HookPoint
from transformers import AutoModelForCausalLM, AutoTokenizer
from jaxtyping import Float, Int
from collections import defaultdict

# Turn automatic differentiation off to save GPU memory (credit: Undi95)
torch.set_grad_enabled(False)

!pip install transformers transformers_stream_generator tiktoken transformer_lens einops jaxtyping

import torch

import functools

import einops

import gc

from datasets import load_dataset

from tqdm import tqdm

from torch import Tensor

from typing import List

from transformer_lens import HookedTransformer, utils

from transformer_lens.hook_points import HookPoint

from transformers import AutoModelForCausalLM, AutoTokenizer

from jaxtyping import Float, Int

from collections import defaultdict

# Turn automatic differentiation off to save GPU memory (credit: Undi95)

torch.set_grad_enabled(False)

我们需要两个数据集：一个包含无害的指令，另一个包含有害的指令。我们将使用 tatsu-lab/alpaca 以及来自 llm-attacks 的数据。为了方便起见，我将它们重新打包到两个 Hugging Face 数据集中：mlabonne/harmless_alpaca 和 mlabonne/harmful_behaviors。这样，您可以轻松地将它们替换为您自己的数据集。

我们将加载说明并将它们重新格式化为带有 “role” 和 “content” 键的词典列表。这使得它与 apply_chat_tokenizer（） 方法兼容，我们将使用它来遵循 Llama 3 的聊天模板。

def reformat_texts(texts):
    return [[{"role": "user", "content": text}] for text in texts]

# Get harmful and harmless datasets
def get_harmful_instructions():
    dataset = load_dataset('mlabonne/harmful_behaviors')
    return reformat_texts(dataset['train']['text']), reformat_texts(dataset['test']['text'])

def get_harmless_instructions():
    dataset = load_dataset('mlabonne/harmless_alpaca')
    return reformat_texts(dataset['train']['text']), reformat_texts(dataset['test']['text'])

harmful_inst_train, harmful_inst_test = get_harmful_instructions()
harmless_inst_train, harmless_inst_test = get_harmless_instructions()

def reformat_texts(texts):

return [[{"role": "user", "content": text}] for text in texts]

# Get harmful and harmless datasets

def get_harmful_instructions():

dataset = load_dataset('mlabonne/harmful_behaviors')

return reformat_texts(dataset['train']['text']), reformat_texts(dataset['test']['text'])

def get_harmless_instructions():

dataset = load_dataset('mlabonne/harmless_alpaca')

return reformat_texts(dataset['train']['text']), reformat_texts(dataset['test']['text'])

harmful_inst_train, harmful_inst_test = get_harmful_instructions()

harmless_inst_train, harmless_inst_test = get_harmless_instructions()

现在我们有了数据集，我们可以加载我们想要删除的模型。不幸的是，你不能直接使用 HookedTransformer 加载自定义模型。在这里，我使用 FailSpy 的笔记本中描述的一个技巧下载自定义模型并将其重命名为 meta-llama/Meta-Llama-3-8B-Instruct。如果您的 GPU 与 BF16 不兼容，请以 torch.float16 格式加载。

在此示例中，我们将使用 mlabonne/Daredevil-8B，这是使用 DARE TIES 创建的大型合并（请参阅我关于模型合并的文章），它在 8B 类别的 Open LLM 排行榜上具有最高的 MMLU 分数。

MODEL_ID = "mlabonne/Daredevil-8B"
MODEL_TYPE = "meta-llama/Meta-Llama-3-8B-Instruct"

# Download and load model
!git clone https://huggingface.co/{MODEL_ID} {MODEL_TYPE}

# Load model and tokenizer
model = HookedTransformer.from_pretrained_no_processing(
    MODEL_TYPE,
    local_files_only=True,
    dtype=torch.bfloat16,
    default_padding_side='left'
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_TYPE)
tokenizer.padding_side = 'left'
tokenizer.pad_token = tokenizer.eos_token

MODEL_ID = "mlabonne/Daredevil-8B"

MODEL_TYPE = "meta-llama/Meta-Llama-3-8B-Instruct"

# Download and load model

!git clone https://huggingface.co/{MODEL_ID} {MODEL_TYPE}

# Load model and tokenizer

model = HookedTransformer.from_pretrained_no_processing(

MODEL_TYPE,

local_files_only=True,

dtype=torch.bfloat16,

default_padding_side='left'

)

tokenizer = AutoTokenizer.from_pretrained(MODEL_TYPE)

tokenizer.padding_side = 'left'

tokenizer.pad_token = tokenizer.eos_token

我们现在可以对数据集进行标记化。我们对无害和有害指令使用相同数量的样本。请注意，大量样本可以使用所有 RAM/VRAM，这就是为什么我在这里将其限制为 256 的原因。

def tokenize_instructions(tokenizer, instructions):
    return tokenizer.apply_chat_template(
        instructions,
        padding=True,
        truncation=False,
        return_tensors="pt",
        return_dict=True,
        add_generation_prompt=True,
    ).input_ids

n_inst_train = min(256, len(harmful_inst_train), len(harmless_inst_train))

# Tokenize datasets
harmful_tokens = tokenize_instructions(
    tokenizer,
    instructions=harmful_inst_train[:n_inst_train],
)
harmless_tokens = tokenize_instructions(
    tokenizer,
    instructions=harmless_inst_train[:n_inst_train],
)

def tokenize_instructions(tokenizer, instructions):

return tokenizer.apply_chat_template(

instructions,

padding=True,

truncation=False,

return_tensors="pt",

return_dict=True,

add_generation_prompt=True,

).input_ids

n_inst_train = min(256, len(harmful_inst_train), len(harmless_inst_train))

# Tokenize datasets

harmful_tokens = tokenize_instructions(

tokenizer,

instructions=harmful_inst_train[:n_inst_train],

)

harmless_tokens = tokenize_instructions(

tokenizer,

instructions=harmless_inst_train[:n_inst_train],

)

一切都设置好了，我们现在可以实施 abliteration 的第一步：数据收集。我们希望处理这些标记化的数据集，并将残差流激活存储在 Harmful 和 Harmless 中。这由 transformer_lens 库管理。

# Define batch size based on available VRAM
batch_size = 32

# Initialize defaultdicts to store activations
harmful = defaultdict(list)
harmless = defaultdict(list)

# Process the training data in batches
num_batches = (n_inst_train + batch_size - 1) // batch_size
for i in tqdm(range(num_batches)):
    print(i)
    start_idx = i * batch_size
    end_idx = min(n_inst_train, start_idx + batch_size)

    # Run models on harmful and harmless prompts, cache activations
    harmful_logits, harmful_cache = model.run_with_cache(
        harmful_tokens[start_idx:end_idx],
        names_filter=lambda hook_name: 'resid' in hook_name,
        device='cpu',
        reset_hooks_end=True
    )
    harmless_logits, harmless_cache = model.run_with_cache(
        harmless_tokens[start_idx:end_idx],
        names_filter=lambda hook_name: 'resid' in hook_name,
        device='cpu',
        reset_hooks_end=True
    )

    # Collect and store the activations
    for key in harmful_cache:
        harmful[key].append(harmful_cache[key])
        harmless[key].append(harmless_cache[key])

    # Flush RAM and VRAM
    del harmful_logits, harmless_logits, harmful_cache, harmless_cache
    gc.collect()
    torch.cuda.empty_cache()

# Concatenate the cached activations
harmful = {k: torch.cat(v) for k, v in harmful.items()}
harmless = {k: torch.cat(v) for k, v in harmless.items()}

# Define batch size based on available VRAM

batch_size = 32

# Initialize defaultdicts to store activations

harmful = defaultdict(list)

harmless = defaultdict(list)

# Process the training data in batches

num_batches = (n_inst_train + batch_size - 1) // batch_size

for i in tqdm(range(num_batches)):

print(i)

start_idx = i * batch_size

end_idx = min(n_inst_train, start_idx + batch_size)

# Run models on harmful and harmless prompts, cache activations

harmful_logits, harmful_cache = model.run_with_cache(

harmful_tokens[start_idx:end_idx],

names_filter=lambda hook_name: 'resid' in hook_name,

device='cpu',

reset_hooks_end=True

)

harmless_logits, harmless_cache = model.run_with_cache(

harmless_tokens[start_idx:end_idx],

names_filter=lambda hook_name: 'resid' in hook_name,

device='cpu',

reset_hooks_end=True

)

# Collect and store the activations

for key in harmful_cache:

harmful[key].append(harmful_cache[key])

harmless[key].append(harmless_cache[key])

# Flush RAM and VRAM

del harmful_logits, harmless_logits, harmful_cache, harmless_cache

gc.collect()

torch.cuda.empty_cache()

# Concatenate the cached activations

harmful = {k: torch.cat(v) for k, v in harmful.items()}

harmless = {k: torch.cat(v) for k, v in harmless.items()}

现在，我们可以计算每一层的 refal 方向。这对应于有害和无害指令的激活之间的平均差，然后将其归一化。我们按降序对它们进行activation_scored排序。

# Helper function to get activation index
def get_act_idx(cache_dict, act_name, layer):
    key = (act_name, layer)
    return cache_dict[utils.get_act_name(*key)]

# Compute difference of means between harmful and harmless activations at intermediate layers
activation_layers = ["resid_pre", "resid_mid", "resid_post"]
activation_refusals = defaultdict(list)

for layer_num in range(1, model.cfg.n_layers):
    pos = -1  # Position index

    for layer in activation_layers:
        harmful_mean_act = get_act_idx(harmful, layer, layer_num)[:, pos, :].mean(dim=0)
        harmless_mean_act = get_act_idx(harmless, layer, layer_num)[:, pos, :].mean(
            dim=0
        )

        refusal_dir = harmful_mean_act - harmless_mean_act
        refusal_dir = refusal_dir / refusal_dir.norm()
        activation_refusals[layer].append(refusal_dir)

# Get all calculated potential refusal directions, sort them in descending order based on their mean
# Use a subset of layers if certain activations are not promising
selected_layers = ["resid_pre"]
activation_scored = sorted(
    [
        activation_refusals[layer][l - 1]
        for l in range(1, model.cfg.n_layers)
        for layer in selected_layers
    ],
    key=lambda x: abs(x.mean()),
    reverse=True,
)

# Helper function to get activation index

def get_act_idx(cache_dict, act_name, layer):

key = (act_name, layer)

return cache_dict[utils.get_act_name(*key)]

# Compute difference of means between harmful and harmless activations at intermediate layers

activation_layers = ["resid_pre", "resid_mid", "resid_post"]

activation_refusals = defaultdict(list)

for layer_num in range(1, model.cfg.n_layers):

pos = -1 # Position index

for layer in activation_layers:

harmful_mean_act = get_act_idx(harmful, layer, layer_num)[:, pos, :].mean(dim=0)

harmless_mean_act = get_act_idx(harmless, layer, layer_num)[:, pos, :].mean(

dim=0

)

refusal_dir = harmful_mean_act - harmless_mean_act

refusal_dir = refusal_dir / refusal_dir.norm()

activation_refusals[layer].append(refusal_dir)

# Get all calculated potential refusal directions, sort them in descending order based on their mean

# Use a subset of layers if certain activations are not promising

selected_layers = ["resid_pre"]

activation_scored = sorted(

[

activation_refusals[layer][l - 1]

for l in range(1, model.cfg.n_layers)

for layer in selected_layers

key=lambda x: abs(x.mean()),

reverse=True,

)

该过程的最后一步包括评估我们计算的拒绝方向。为此，我们将在推理过程中将 refusal 方向应用于每个残差流和每个块。在下面的代码段中，我们获得了 4 个 test harmful instructions 和 20 个块（或层）的世代。

def _generate_with_hooks(
    model: HookedTransformer,
    tokenizer: AutoTokenizer,
    tokens: Int[Tensor, "batch_size seq_len"],
    max_tokens_generated: int = 64,
    fwd_hooks=[],
) -> List[str]:
    all_tokens = torch.zeros(
        (tokens.shape[0], tokens.shape[1] + max_tokens_generated),
        dtype=torch.long,
        device=tokens.device,
    )
    all_tokens[:, : tokens.shape[1]] = tokens
    for i in range(max_tokens_generated):
        with model.hooks(fwd_hooks=fwd_hooks):
            logits = model(all_tokens[:, : -max_tokens_generated + i])
            next_tokens = logits[:, -1, :].argmax(
                dim=-1
            )  # greedy sampling (temperature=0)
            all_tokens[:, -max_tokens_generated + i] = next_tokens
    return tokenizer.batch_decode(
        all_tokens[:, tokens.shape[1] :], skip_special_tokens=True
    )

def get_generations(
    model: HookedTransformer,
    tokenizer: AutoTokenizer,
    instructions: List[str],
    fwd_hooks=[],
    max_tokens_generated: int = 64,
    batch_size: int = 4,
) -> List[str]:
    generations = []
    for i in tqdm(range(0, len(instructions), batch_size)):
        tokens = tokenize_instructions(
            tokenizer, instructions=instructions[i : i + batch_size]
        )
        generation = _generate_with_hooks(
            model,
            tokenizer,
            tokens,
            max_tokens_generated=max_tokens_generated,
            fwd_hooks=fwd_hooks,
        )
        generations.extend(generation)
    return generations

# Inference-time intervention hook
def direction_ablation_hook(
    activation: Float[Tensor, "... d_act"],
    hook: HookPoint,
    direction: Float[Tensor, "d_act"],
):
    if activation.device != direction.device:
        direction = direction.to(activation.device)
    proj = (
        einops.einsum(
            activation, direction.view(-1, 1), "... d_act, d_act single -> ... single"
        )
        * direction
    )
    return activation - proj

# Testing baseline
N_INST_TEST = 4
baseline_generations = get_generations(
    model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=[]
)

# Evaluating layers defined earlier (needs human evaluation to determine best layer for refusal inhibition)
EVAL_N = 20  # Evaluate how many of the top N potential directions
evals = []
for refusal_dir in tqdm(activation_scored[:EVAL_N]):
    hook_fn = functools.partial(direction_ablation_hook, direction=refusal_dir)
    fwd_hooks = [
        (utils.get_act_name(act_name, layer), hook_fn)
        for layer in list(range(model.cfg.n_layers))
        for act_name in activation_layers
    ]
    intervention_generations = get_generations(
        model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=fwd_hooks
    )
    evals.append(intervention_generations)

def _generate_with_hooks(

model: HookedTransformer,

tokenizer: AutoTokenizer,

tokens: Int[Tensor, "batch_size seq_len"],

max_tokens_generated: int = 64,

fwd_hooks=[],

) -> List[str]:

all_tokens = torch.zeros(

(tokens.shape[0], tokens.shape[1] + max_tokens_generated),

dtype=torch.long,

device=tokens.device,

)

all_tokens[:, : tokens.shape[1]] = tokens

for i in range(max_tokens_generated):

with model.hooks(fwd_hooks=fwd_hooks):

logits = model(all_tokens[:, : -max_tokens_generated + i])

next_tokens = logits[:, -1, :].argmax(

dim=-1

) # greedy sampling (temperature=0)

all_tokens[:, -max_tokens_generated + i] = next_tokens

return tokenizer.batch_decode(

all_tokens[:, tokens.shape[1] :], skip_special_tokens=True

)

def get_generations(

model: HookedTransformer,

tokenizer: AutoTokenizer,

instructions: List[str],

fwd_hooks=[],

max_tokens_generated: int = 64,

batch_size: int = 4,

) -> List[str]:

generations = []

for i in tqdm(range(0, len(instructions), batch_size)):

tokens = tokenize_instructions(

tokenizer, instructions=instructions[i : i + batch_size]

)

generation = _generate_with_hooks(

model,

tokenizer,

tokens,

max_tokens_generated=max_tokens_generated,

fwd_hooks=fwd_hooks,

)

generations.extend(generation)

return generations

# Inference-time intervention hook

def direction_ablation_hook(

activation: Float[Tensor, "... d_act"],

hook: HookPoint,

direction: Float[Tensor, "d_act"],

if activation.device != direction.device:

direction = direction.to(activation.device)

proj = (

einops.einsum(

activation, direction.view(-1, 1), "... d_act, d_act single -> ... single"

)

* direction

)

return activation - proj

# Testing baseline

N_INST_TEST = 4

baseline_generations = get_generations(

model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=[]

)

# Evaluating layers defined earlier (needs human evaluation to determine best layer for refusal inhibition)

EVAL_N = 20 # Evaluate how many of the top N potential directions

evals = []

for refusal_dir in tqdm(activation_scored[:EVAL_N]):

hook_fn = functools.partial(direction_ablation_hook, direction=refusal_dir)

fwd_hooks = [

(utils.get_act_name(act_name, layer), hook_fn)

for layer in list(range(model.cfg.n_layers))

for act_name in activation_layers

]

intervention_generations = get_generations(

model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=fwd_hooks

)

evals.append(intervention_generations)

我们将所有世代都存储在 evals 列表中。我们现在可以打印它们并手动选择为每个指令提供响应的层（块）。我会自动排除包含“I cannot”和“I can’t”的回复，以筛选掉不需要的答案。

如果找不到满足这些要求的层，则可能需要测试前面的 selected_layers 列表中的其他残差流、其他说明、其他块等。

# Print generations for human evaluation
blacklist = ["I cannot", "I can't"]
for i in range(N_INST_TEST):
    print(f"\033[1mINSTRUCTION {i}: {harmful_inst_test[i]}")
    print(f"\nBASELINE COMPLETION:\n{baseline_generations[i]}\033[0m")
    for layer_candidate in range(EVAL_N):
        if not any(word in evals[layer_candidate][i] for word in blacklist):
            print(f"\n---\n\nLAYER CANDIDATE #{layer_candidate} INTERVENTION COMPLETION:")
            print(evals[layer_candidate][i])

# Print generations for human evaluation

blacklist = ["I cannot", "I can't"]

for i in range(N_INST_TEST):

print(f"\033[1mINSTRUCTION {i}: {harmful_inst_test[i]}")

print(f"\nBASELINE COMPLETION:\n{baseline_generations[i]}\033[0m")

for layer_candidate in range(EVAL_N):

if not any(word in evals[layer_candidate][i] for word in blacklist):

print(f"\n---\n\nLAYER CANDIDATE #{layer_candidate} INTERVENTION COMPLETION:")

print(evals[layer_candidate][i])

就我而言，候选图层 9 设法为这四条指令提供了未经审查的答案。这是我们将为拒绝方向选择的那个。在下文中，我们将实施权重正交化来修改权重并防止模型创建具有此方向的输出。您可以通过打印完成来验证模型是否成功取消删失。

def get_orthogonalized_matrix(
    matrix: Float[Tensor, "... d_model"], vec: Float[Tensor, "d_model"]
) -> Float[Tensor, "... d_model"]:
    proj = (
        einops.einsum(
            matrix, vec.view(-1, 1), "... d_model, d_model single -> ... single"
        )
        * vec
    )
    return matrix - proj

# Select the layer with the highest potential refusal direction
LAYER_CANDIDATE = 9
refusal_dir = activation_scored[LAYER_CANDIDATE]

# Orthogonalize the model's weights
if refusal_dir.device != model.W_E.device:
    refusal_dir = refusal_dir.to(model.W_E.device)
model.W_E.data = get_orthogonalized_matrix(model.W_E, refusal_dir)

for block in tqdm(model.blocks):
    if refusal_dir.device != block.attn.W_O.device:
        refusal_dir = refusal_dir.to(block.attn.W_O.device)
    block.attn.W_O.data = get_orthogonalized_matrix(block.attn.W_O, refusal_dir)
    block.mlp.W_out.data = get_orthogonalized_matrix(block.mlp.W_out, refusal_dir)

# Generate text with abliterated model
orthogonalized_generations = get_generations(
    model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=[]
)

# Print generations
for i in range(N_INST_TEST):
    if len(baseline_generations) > i:
        print(f"INSTRUCTION {i}: {harmful_inst_test[i]}")
        print(f"\033[92mBASELINE COMPLETION:\n{baseline_generations[i]}")
    print(f"\033[91mINTERVENTION COMPLETION:\n{evals[LAYER_CANDIDATE][i]}")
    print(f"\033[95mORTHOGONALIZED COMPLETION:\n{orthogonalized_generations[i]}\n")

def get_orthogonalized_matrix(

matrix: Float[Tensor, "... d_model"], vec: Float[Tensor, "d_model"]

) -> Float[Tensor, "... d_model"]:

proj = (

einops.einsum(

matrix, vec.view(-1, 1), "... d_model, d_model single -> ... single"

)

* vec

)

return matrix - proj

# Select the layer with the highest potential refusal direction

LAYER_CANDIDATE = 9

refusal_dir = activation_scored[LAYER_CANDIDATE]

# Orthogonalize the model's weights

if refusal_dir.device != model.W_E.device:

refusal_dir = refusal_dir.to(model.W_E.device)

model.W_E.data = get_orthogonalized_matrix(model.W_E, refusal_dir)

for block in tqdm(model.blocks):

if refusal_dir.device != block.attn.W_O.device:

refusal_dir = refusal_dir.to(block.attn.W_O.device)

block.attn.W_O.data = get_orthogonalized_matrix(block.attn.W_O, refusal_dir)

block.mlp.W_out.data = get_orthogonalized_matrix(block.mlp.W_out, refusal_dir)

# Generate text with abliterated model

orthogonalized_generations = get_generations(

model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=[]

)

# Print generations

for i in range(N_INST_TEST):

if len(baseline_generations) > i:

print(f"INSTRUCTION {i}: {harmful_inst_test[i]}")

print(f"\033[92mBASELINE COMPLETION:\n{baseline_generations[i]}")

print(f"\033[91mINTERVENTION COMPLETION:\n{evals[LAYER_CANDIDATE][i]}")

print(f"\033[95mORTHOGONALIZED COMPLETION:\n{orthogonalized_generations[i]}\n")

现在，我们已准备好使用该模型。我们将其转换回 Hugging Face 格式并将其上传到 HF 中心。

# Convert model back to HF safetensors
hf_model = AutoModelForCausalLM.from_pretrained(MODEL_TYPE, torch_dtype=torch.bfloat16)
lm_model = hf_model.model

state_dict = model.state_dict()
lm_model.embed_tokens.weight = torch.nn.Parameter(state_dict["embed.W_E"].cpu())

for l in range(model.cfg.n_layers):
    lm_model.layers[l].self_attn.o_proj.weight = torch.nn.Parameter(
        einops.rearrange(
            state_dict[f"blocks.{l}.attn.W_O"], "n h m->m (n h)", n=model.cfg.n_heads
        ).contiguous()
    )
    lm_model.layers[l].mlp.down_proj.weight = torch.nn.Parameter(
        torch.transpose(state_dict[f"blocks.{l}.mlp.W_out"], 0, 1).contiguous()
    )

hf_model.push_to_hub(f"{MODEL_ID}-abliterated")
# hf_model.push_to_hub(f"{MODEL_ID}-abliterated")

# Convert model back to HF safetensors

hf_model = AutoModelForCausalLM.from_pretrained(MODEL_TYPE, torch_dtype=torch.bfloat16)

lm_model = hf_model.model

state_dict = model.state_dict()

lm_model.embed_tokens.weight = torch.nn.Parameter(state_dict["embed.W_E"].cpu())

for l in range(model.cfg.n_layers):

lm_model.layers[l].self_attn.o_proj.weight = torch.nn.Parameter(

einops.rearrange(

state_dict[f"blocks.{l}.attn.W_O"], "n h m->m (n h)", n=model.cfg.n_heads

).contiguous()

)

lm_model.layers[l].mlp.down_proj.weight = torch.nn.Parameter(

torch.transpose(state_dict[f"blocks.{l}.mlp.W_out"], 0, 1).contiguous()

)

hf_model.push_to_hub(f"{MODEL_ID}-abliterated")

# hf_model.push_to_hub(f"{MODEL_ID}-abliterated")

⚖️ DPO 微调

我在 Open LLM 排行榜和 Nous 的基准测试套件上评估了上一节中的删减模型和源模型。结果如下：

如您所见，源模型的性能明显优于 Llama 3 8B Instruct。但是，我们观察到在所有基准测试中，消融版本的性能都有所下降。消融过程成功地取消了它，但也降低了模型的质量。

为了解决这个问题，一个想法包括进一步训练我们的 ablited 模型来治愈它。与大多数微调模型一样，Llama 3 8B Instruct 在监督微调方面非常脆弱。额外的 SFT 可能会破坏模型的性能。

或者，偏好对齐非常轻，不应该切除我们的脑叶切除模型。DPO 因其易用性和良好的跟踪记录而成为不错的选择。为了实现它，我将 LazyAxolotl 与 mlabonne/orpo-dpo-mix-40k 数据集一起使用。以下是我使用的配置：

base_model: mlabonne/Daredevil-8B-abliterated
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: true
strict: false
save_safetensors: true

rl: dpo
chat_template: chatml
datasets:
  - path: mlabonne/orpo-dpo-mix-40k-flat
    split: train
    type: chatml.intel

dataset_prepared_path:
val_set_size: 0.0
output_dir: ./out

adapter: qlora
lora_model_dir:

sequence_len: 2048
sample_packing: false
pad_to_sequence_len: false

lora_r: 64
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project: axolotl
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 5e-6
train_on_inputs: false
group_by_length: false

bf16: auto
fp16:
tf32:

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
warmup_steps: 100
evals_per_epoch: 0
eval_table_size:
eval_table_max_new_tokens: 128
saves_per_epoch: 1
debug:
deepspeed: deepspeed_configs/zero2.json
weight_decay: 0.0
special_tokens:
  pad_token: <|end_of_text|>

base_model: mlabonne/Daredevil-8B-abliterated

model_type: LlamaForCausalLM

tokenizer_type: AutoTokenizer

load_in_8bit: false

load_in_4bit: true

strict: false

save_safetensors: true

rl: dpo

chat_template: chatml

datasets:

- path: mlabonne/orpo-dpo-mix-40k-flat

split: train

type: chatml.intel

dataset_prepared_path:

val_set_size: 0.0

output_dir: ./out

adapter: qlora

lora_model_dir:

sequence_len: 2048

sample_packing: false

pad_to_sequence_len: false

lora_r: 64

lora_alpha: 32

lora_dropout: 0.05

lora_target_linear: true

lora_fan_in_fan_out:

wandb_project: axolotl

wandb_entity:

wandb_watch:

wandb_name:

wandb_log_model:

gradient_accumulation_steps: 8

micro_batch_size: 1

num_epochs: 1

optimizer: paged_adamw_8bit

lr_scheduler: cosine

learning_rate: 5e-6

train_on_inputs: false

group_by_length: false

bf16: auto

fp16:

tf32:

gradient_checkpointing: true

early_stopping_patience:

resume_from_checkpoint:

local_rank:

logging_steps: 1

xformers_attention:

flash_attention: true

warmup_steps: 100

evals_per_epoch: 0

eval_table_size:

eval_table_max_new_tokens: 128

saves_per_epoch: 1

debug:

deepspeed: deepspeed_configs/zero2.json

weight_decay: 0.0

special_tokens:

pad_token: <|end_of_text|>

我使用 6xA6000 GPU 和 DeepSpeed ZeRO-2 对其进行了训练。培训大约需要 6 小时 45 分钟。以下是我从 W&B 得到的训练曲线：

它自动上传了 DPO 微调模型，称为 mlabonne/NeuralDaredevil-8B-abliterated。为了看看它是否修复了我们的 ablite 版本，我根据相同的基准对其进行了评估：

我们可以看到，这种额外的训练使我们能够恢复由于烧蚀而导致的大部分性能下降。该模型没有改进的一个领域是 GSM8K，一个数学数据集，这可能意味着 orpo-dpo-mix-40k 将从更多的数学样本中受益。

最终模型是未经审查的 LLM，在 8B 类别中具有最先进的性能。当您不需要审查时，我推荐它作为 Llama 3 8B Instruct 的改进版本。您可以在 LM Studio 中使用 GGUF 等量化版本。

结论

在本文中，我们介绍了烧蚀的概念。该技术使用模型对无害和有害提示的激活来计算拒绝方向。然后，它使用这个方向来修改模型的权重，并确保我们停止输出 refals。这种技术还证明了安全微调的脆弱性，并提出了道德考虑。

我们对 Daredevil-8B 应用了删减以取消审查，这也降低了模型的性能。然后，我们使用 DPO 对其进行修复，以创建 NeuralDaredevil-8B 模型，这是一种完全未经审查的高质量 8B LLM。擦除不仅限于删除对齐，应被视为一种无需重新训练的微调形式。事实上，它可以创造性地应用于其他目标，例如 FailSpy 的 MopeyMule，它采用忧郁的对话风格。

我希望你喜欢这篇文章。如果你想看到更多，请在 Hugging Face 和 Twitter @maximelabonne 上关注我。

引用

FailSpy，“abliterator 库”，GitHub，2024 年。
Andy Arditi、Oscar Obeso、Aaquib111、wesg、Neel Nanda，“LLM 中的拒绝由单一方向调解”，Lesswrong，2024 年。

原文链接：https://huggingface.co/blog/mlabonne/abliteration

✂️ 什么是烧蚀(abliteration)？

💻 实现

⚖️ DPO 微调

结论

引用

相关文章

发表评论 取消回复

发表评论取消回复