0.GPT 模型概述
GPT 模型是 Generative Pretrained Transformer 的缩写,是专为生成类似人类的文本而设计的高级深度学习模型。这些由 OpenAI 开发的模型已经进行了多次迭代:GPT-1、GPT-2、GPT-3,以及最近的 GPT-4。
GPT 是一种基于 transformer 架构的 AI 语言模型,该模型经过预训练、生成、无监督,能够在零/一/少多任务设置中表现良好。它从 NLP 任务的标记序列中预测下一个标记(字符序列的实例),它尚未经过训练。在只看到几个例子之后,它可以在某些基准测试中达到预期的结果,包括机器翻译、问答和完形填空任务。GPT 模型主要基于条件概率计算一个单词出现在另一个文本中的可能性,因为它出现在另一个文本中。例如,在句子中,“玛格丽特正在组织车库销售……也许我们可以买那个旧的……”椅子这个词可能比“大象”这个词更合适。此外,转换器模型使用多个称为注意力块的单元来学习要关注文本序列的哪些部分。一个转换器可能有多个注意力块,每个注意力块学习一门语言的不同方面。
此外,GPT 模型还具有许多功能,例如生成前所未有的高质量合成文本样本。如果用输入启动模型,它将生成一个较长的延续。GPT 模型在不使用特定领域训练数据的情况下,优于在维基百科、新闻和书籍等领域训练的其他语言模型。GPT 仅从文本中学习语言任务,例如阅读理解、总结和问答,而无需特定任务的训练数据。这些任务的分数(“分数”是指模型分配的数值,用于表示给定输出或结果的可能性或概率)不是最好的,但它们表明具有足够数据和计算的无监督技术可以使任务受益。
以下测试环境为 ubuntu 22.04.03 tls
1.数据集准备
1.0 获取数据
使用Tiny Shakespeare 数据集
1 2 3 4 5 6 7 |
$ md CreateGPT $ cd CreateGPT $ mkdir data $ cd data $ wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt $ cd .. |
1.1 读取 Tiny Shakespeare 数据集,并打印数据集长度
setup1.1.py 代码如下:
1 2 3 4 5 |
# read it in to inspect it with open('data/input.txt', 'r', encoding='utf-8') as f: text = f.read() print("length of dataset in characters: ", len(text)) |
运行 python setup1.1.py
1 2 |
$ python setup1.1.py length of dataset in characters: 1115394 |
1.2 统计数据集中都包含哪些字符种类:
setup1.2.py 代码如下:
1 2 3 4 5 6 7 8 9 10 |
# read it in to inspect it with open('data/input.txt', 'r', encoding='utf-8') as f: text = f.read() # print("length of dataset in characters: ", len(text)) # here are all the unique characters that occur in this text chars = sorted(list(set(text))) vocab_size = len(chars) print(''.join(chars)) print(vocab_size) |
运行 python setup1.2.py
1 2 3 4 5 |
$ python setup1.2.py !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz 65 |
2.Tokenize
Tokenize 指的是,将原始文本转化为一些数值的序列,即 Token 序列。
如何将自然语言文本变为 Token 序列,有很多高级算法,比如:google/sentencepiece、openai/tiktoken。
前面说到,本文中采用的基于字符级别的语言模型,它的 Tokennizer 算法十分简单。前面代码中的 chars
包含了语料中所有字符的种类,给出一个字符,只要看该字符在 chars
中的 indexOf,就得到了一种数值化的方法。
2.1 将上面的词汇表(chars
)映射为整数
stoi
字符到整数的映射,itos
整数到字符的映射。encode
和 decode
分别是对字符串的编解码。
setup2.1.py 代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# read it in to inspect it with open('data/input.txt', 'r', encoding='utf-8') as f: text = f.read() # print("length of dataset in characters: ", len(text)) # here are all the unique characters that occur in this text chars = sorted(list(set(text))) vocab_size = len(chars) # print(''.join(chars)) # print(vocab_size) # create a mapping from characters to integers stoi = { ch:i for i,ch in enumerate(chars) } itos = { i:ch for i,ch in enumerate(chars) } # encoder: take a string, output a list of integers encode = lambda s: [stoi[c] for c in s] # decoder: take a list of integers, output a string decode = lambda l: ''.join([itos[i] for i in l]) print(encode("hii there")) print(decode(encode("hii there"))) |
运行 python setup2.1.py
1 2 3 |
$ python setup2.1.py [46, 47, 47, 1, 58, 46, 43, 56, 43] hii there |
在上面代码中,实现了一个编解码方法,能够将文本编码为 “Token” 序列。之所以 Token 要打引号,因为在基于字符级别的粒度下,算法直观但是过于简单。还有一点需要注意的是,接下来使用的自然语言,必须是 chars
中的字符,不能超出这个范围。
不论是高级算法还是本文中的简化方法:原理都是一样的,将文本转为数值序列。
2.2 以 tiktoken(有 50257 种 Tokens)为例:
setup2.2.py 代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
import tiktoken enc = tiktoken.get_encoding('gpt2') print(enc.n_vocab) # 50257 print(enc.encode('hii there')) # [71, 4178, 612] print(enc.decode([71, 4178, 612])) # 'hii there' |
运行 python setup2.2.py
1 2 3 4 |
$ python setup2.2.py 50257 [71, 4178, 612] hii there |
这里可能需要你安装 tiktoken 模块
1 |
$ pip install tiktoken |
可以看到,使用起来与本文是一样的。但是:
1 2 |
通过高级 Tokenizer 编码后,序列的长度变短。Tiny Shakespeare 要每个字符一个 Token,而语言模型的上下文有限,能够支持的 Token 长度也有限。因此高级的 Tokenizer,提升了表达原始信息的密度和效率。 注:在高级算法中,拆解出来的是 subwords(子词单元)。 既不是对整个单词编码,也不是对单个字符编码,而是按照统计训练,对子词编码。 |
2.3 将整个 Tiny Shakespeare 编码后,转为 PyTorch 序列
setup2.3.py 代码如下:
1 2 3 4 5 6 7 8 9 |
# let's now encode the entire text dataset and store it into a torch.Tensor import torch # we use PyTorch: https://pytorch.org with open('data/input.txt', 'r', encoding='utf-8') as f: text = f.read() data = torch.tensor(encode(text), dtype=torch.long) print(data.shape, data.dtype) # the 1000 characters we looked at earier will to the GPT look like this print(data[:1000]) |
运行 python setup2.3.py
这里可能需要你安装 torch 模块
1 |
$ pip install torch |
3. 训练集、验证集、数据切分
Tiny Shakespeare 的前 90% 用于训练,后 10% 用于验证
分片(Chunk)称之为 block,分片的大小称之为 block_size
setup3.1.py 代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
# let's now encode the entire text dataset and store it into a torch.Tensor import torch # we use PyTorch: https://pytorch.org with open('data/input.txt', 'r', encoding='utf-8') as f: text = f.read() chars = sorted(list(set(text))) # create a mapping from characters to integers stoi = { ch:i for i,ch in enumerate(chars) } itos = { i:ch for i,ch in enumerate(chars) } # encoder: take a string, output a list of integers encode = lambda s: [stoi[c] for c in s] data = torch.tensor(encode(text), dtype=torch.long) # Let's now split up the data into train and validation sets n = int(0.9*len(data)) # first 90% will be train, rest val train_data = data[:n] val_data = data[n:] block_size = 8 print(train_data[:block_size+1]) |
运行 python setup3.1.py
1 2 |
$ python setup3.1.py tensor([18, 47, 56, 57, 58, 1, 15, 47, 58]) |
这里取了训练集中的第一个 Block。block_size 大小为 8,为什么我们取了 9 个 Token 呢?
为了理解这个问题,首先看训练方式,将 Chunck 拆分为两个子集 x 和 y,其中 x 表示输入 Token 序列,在使用时是累增的,y 表示基于该输入,与其的输出。
setup3.2.py 代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
# let's now encode the entire text dataset and store it into a torch.Tensor import torch # we use PyTorch: https://pytorch.org with open('data/input.txt', 'r', encoding='utf-8') as f: text = f.read() chars = sorted(list(set(text))) # create a mapping from characters to integers stoi = { ch:i for i,ch in enumerate(chars) } itos = { i:ch for i,ch in enumerate(chars) } # encoder: take a string, output a list of integers encode = lambda s: [stoi[c] for c in s] data = torch.tensor(encode(text), dtype=torch.long) # Let's now split up the data into train and validation sets n = int(0.9*len(data)) # first 90% will be train, rest val train_data = data[:n] val_data = data[n:] block_size = 8 x = train_data[:block_size] y = train_data[1:block_size+1] for t in range(block_size): context = x[:t+1] target = y[t] print(f"when input is {context} the target: {target}") |
运行 python setup3.2.py
1 2 3 4 5 6 7 8 9 |
$ python setup3.2.py when input is tensor([18]) the target: 47 when input is tensor([18, 47]) the target: 56 when input is tensor([18, 47, 56]) the target: 57 when input is tensor([18, 47, 56, 57]) the target: 58 when input is tensor([18, 47, 56, 57, 58]) the target: 1 when input is tensor([18, 47, 56, 57, 58, 1]) the target: 15 when input is tensor([18, 47, 56, 57, 58, 1, 15]) the target: 47 when input is tensor([18, 47, 56, 57, 58, 1, 15, 47]) the target: 58 |
给出一个 Block,分为几轮。第一轮,用第一个 Token 推测第二个 Token。第二轮,用前两个 Token推测第三个 Token。以此类推,到了第八轮,用前八个 Token 推测第九个 Token。
block_size 大小为 8,表示我们的最大训练长度为 8。每一批数据有 9 个元素,其中第九个元素不参与训练,只参与验证。
我们将 Tiny Shakespeare 切分称一系列 Block,就相当于一系列考试题,每道题是一个长度为 9 的连续 Token 序列,按照上图方式,考语言模型。
4. Batch 划分
将训练集进行 Block 切分后,我们可以一个一个向 GPU 投喂(训练)。但是,我们想,GPU 什么能力最强大?并行计算能力!一个一个向 GPU 投喂喂不饱。为了能够充分发挥出 GPU 的并行运算能力,我们将多个 Block 打包成一批(Batch),一批一批向 GPU 投喂。总之一句话,不能让 GPU 闲着,提升训练效率。
值得一提的是,尽管一个 Batch 内的 Blocks 是一批进入 GPU 的,但是它们之间相互隔离,互相不知道对方的存在,互不干扰。
setup4.1.py 代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
# let's now encode the entire text dataset and store it into a torch.Tensor import torch # we use PyTorch: https://pytorch.org with open('data/input.txt', 'r', encoding='utf-8') as f: text = f.read() chars = sorted(list(set(text))) # create a mapping from characters to integers stoi = { ch:i for i,ch in enumerate(chars) } itos = { i:ch for i,ch in enumerate(chars) } # encoder: take a string, output a list of integers encode = lambda s: [stoi[c] for c in s] data = torch.tensor(encode(text), dtype=torch.long) # Let's now split up the data into train and validation sets n = int(0.9*len(data)) # first 90% will be train, rest val train_data = data[:n] val_data = data[n:] torch.manual_seed(1337) # how many independent sequences will we process in parallel? batch_size = 4 # what is the maximum context length for predictions? block_size = 8 def get_batch(split): # generate a small batch of data of inputs x and targets y data = train_data if split == 'train' else val_data ix = torch.randint(len(data) - block_size, (batch_size,)) x = torch.stack([data[i:i+block_size] for i in ix]) y = torch.stack([data[i+1:i+block_size+1] for i in ix]) return x, y xb, yb = get_batch('train') print('inputs:') print(xb.shape) print(xb) print('targets:') print(yb.shape) print(yb) print('----') for b in range(batch_size): # batch dimension for t in range(block_size): # time dimension context = xb[b, :t+1] target = yb[b,t] print(f"when input is {context.tolist()} the target: {target}") |
运行 python setup4.1.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
$ python setup4.1.py inputs: torch.Size([4, 8]) tensor([[24, 43, 58, 5, 57, 1, 46, 43], [44, 53, 56, 1, 58, 46, 39, 58], [52, 58, 1, 58, 46, 39, 58, 1], [25, 17, 27, 10, 0, 21, 1, 54]]) targets: torch.Size([4, 8]) tensor([[43, 58, 5, 57, 1, 46, 43, 39], [53, 56, 1, 58, 46, 39, 58, 1], [58, 1, 58, 46, 39, 58, 1, 46], [17, 27, 10, 0, 21, 1, 54, 39]]) ---- when input is [24] the target: 43 when input is [24, 43] the target: 58 when input is [24, 43, 58] the target: 5 when input is [24, 43, 58, 5] the target: 57 when input is [24, 43, 58, 5, 57] the target: 1 when input is [24, 43, 58, 5, 57, 1] the target: 46 when input is [24, 43, 58, 5, 57, 1, 46] the target: 43 when input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39 when input is [44] the target: 53 when input is [44, 53] the target: 56 when input is [44, 53, 56] the target: 1 when input is [44, 53, 56, 1] the target: 58 when input is [44, 53, 56, 1, 58] the target: 46 when input is [44, 53, 56, 1, 58, 46] the target: 39 when input is [44, 53, 56, 1, 58, 46, 39] the target: 58 when input is [44, 53, 56, 1, 58, 46, 39, 58] the target: 1 when input is [52] the target: 58 when input is [52, 58] the target: 1 when input is [52, 58, 1] the target: 58 when input is [52, 58, 1, 58] the target: 46 when input is [52, 58, 1, 58, 46] the target: 39 when input is [52, 58, 1, 58, 46, 39] the target: 58 when input is [52, 58, 1, 58, 46, 39, 58] the target: 1 when input is [52, 58, 1, 58, 46, 39, 58, 1] the target: 46 when input is [25] the target: 17 when input is [25, 17] the target: 27 when input is [25, 17, 27] the target: 10 when input is [25, 17, 27, 10] the target: 0 when input is [25, 17, 27, 10, 0] the target: 21 when input is [25, 17, 27, 10, 0, 21] the target: 1 when input is [25, 17, 27, 10, 0, 21, 1] the target: 54 when input is [25, 17, 27, 10, 0, 21, 1, 54] the target: 39 |
从代码中可以看出:Block 的大小为 8。Batch 的大小为 4,即一个 Batch 包含 4 个 Blocks。另外锁定了随机数种子为 1337
,这样我们都能复现跟 Andrej Karpathy 一样的训练效果。
上述代码,运行后的日志输出如上。可以看出:输入由 1 个 8 元素向量变为 4 个。验证向量也变为 4 个。都 Batch 化了。在后续的推理释义中,也是将 Batch 内每个 Block 的推理过程打印出来。
5. BigramLanguageModel V1
二元语言模型(BigramLanguageModel),概括说:根据前一个词,来推测下一个词。举例来说:例如,对于句子 “I love to play football”,会得到以下的词组:”I love”, “love to”, “to play”, “play football”。
接下来,我们来实现第一个 BigramLanguageModel,与视频不同之处在于,我称之为 BigramLanguageModelV1,后续每进行一次更改,都会创建一个新类,并提升版本。
模型继承自 Pytorch 的 Module,在构造方法中声明模型内部包含的层。可见该模型只有一层(nn.Embedding)。模型还包括 forward,前向传播过程,用于训练。模型一旦训练好后,通过 generate 可进行文本生成。
setup5.1.py 代码实现如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 |
import torch import torch.nn as nn from torch.nn import functional as F torch.manual_seed(1337) # 二元语言模型实现 class BigramLanguageModelV1(nn.Module): def __init__(self, vocab_size): super().__init__() # 每个词直接从一个查找表中获取下一个词的logits值 # logits是模型做出预测前的一组未经归一化的分数,反映了不同结果的相对可能性 self.token_embedding_table = nn.Embedding(vocab_size, vocab_size) # 模型前向传播 # idx:即前面的 x,表示输入数据,词在词汇表中的索引的向量 # targets:训练的目标输出,比如正确的下一个词的索引 def forward(self, idx, targets=None): # idx and targets are both (B,T) tensor of integers logits = self.token_embedding_table(idx) # (B,T,C) if targets is None: loss = None else: B, T, C = logits.shape logits = logits.view(B*T, C) targets = targets.view(B*T) loss = F.cross_entropy(logits, targets) return logits, loss # 在模型已经训练好之后,根据给定的输入生成文本的方法。 def generate(self, idx, max_new_tokens): # idx is (B, T) array of indices in the current context for _ in range(max_new_tokens): # get the predictions logits, loss = self(idx) # focus only on the last time step logits = logits[:, -1, :] # becomes (B, C) # apply softmax to get probabilities probs = F.softmax(logits, dim=-1) # (B, C) # sample from the distribution idx_next = torch.multinomial(probs, num_samples=1) # (B, 1) # append sampled index to the running sequence idx = torch.cat((idx, idx_next), dim=1) # (B, T+1) return idx # let's now encode the entire text dataset and store it into a torch.Tensor with open('data/input.txt', 'r', encoding='utf-8') as f: text = f.read() chars = sorted(list(set(text))) vocab_size = len(chars) # create a mapping from characters to integers stoi = { ch:i for i,ch in enumerate(chars) } itos = { i:ch for i,ch in enumerate(chars) } # encoder: take a string, output a list of integers encode = lambda s: [stoi[c] for c in s] data = torch.tensor(encode(text), dtype=torch.long) # Let's now split up the data into train and validation sets n = int(0.9*len(data)) # first 90% will be train, rest val train_data = data[:n] val_data = data[n:] torch.manual_seed(1337) # how many independent sequences will we process in parallel? batch_size = 4 # what is the maximum context length for predictions? block_size = 8 def get_batch(split): # generate a small batch of data of inputs x and targets y data = train_data if split == 'train' else val_data ix = torch.randint(len(data) - block_size, (batch_size,)) x = torch.stack([data[i:i+block_size] for i in ix]) y = torch.stack([data[i+1:i+block_size+1] for i in ix]) return x, y xb, yb = get_batch('train') print('inputs:') print(xb.shape) print(xb) print('targets:') print(yb.shape) print(yb) # get device device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # create model m = BigramLanguageModelV1(vocab_size).to(device) logits, loss = m(xb.to(device), yb.to(device)) print(logits.shape) print(loss) |
运行 python setup5.1.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
$ python setup5.1.py inputs: torch.Size([4, 8]) tensor([[24, 43, 58, 5, 57, 1, 46, 43], [44, 53, 56, 1, 58, 46, 39, 58], [52, 58, 1, 58, 46, 39, 58, 1], [25, 17, 27, 10, 0, 21, 1, 54]]) targets: torch.Size([4, 8]) tensor([[43, 58, 5, 57, 1, 46, 43, 39], [53, 56, 1, 58, 46, 39, 58, 1], [58, 1, 58, 46, 39, 58, 1, 46], [17, 27, 10, 0, 21, 1, 54, 39]]) torch.Size([32, 65]) tensor(5.0364, device='cuda:0', grad_fn=<NllLossBackward0>) |
logits 是模型做出预测前的一组未经归一化的分数,反映了不同结果的相对可能性。如何理解 logits 的 shape 呢?xb(4×8)的每个元素(Token,字母在词汇表中的排序),在 forward
中,都要输入 nn.Embedding
,得到一个大小为 65(词汇表大小)的向量。该向量中的每个元素,表示有当前 Token,推测出该向量表示 Token 的可能性(未归一化)。
下面以第一个词为例,它的长度为 65 的表示 65 种字符可能性的向量为:
1 2 |
print(logits[0].shape) print(logits[0]) |
setup5.2.py 完整的代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
import torch import torch.nn as nn from torch.nn import functional as F torch.manual_seed(1337) # 二元语言模型实现 class BigramLanguageModelV1(nn.Module): def __init__(self, vocab_size): super().__init__() # 每个词直接从一个查找表中获取下一个词的logits值 # logits是模型做出预测前的一组未经归一化的分数,反映了不同结果的相对可能性 self.token_embedding_table = nn.Embedding(vocab_size, vocab_size) # 模型前向传播 # idx:即前面的 x,表示输入数据,词在词汇表中的索引的向量 # targets:训练的目标输出,比如正确的下一个词的索引 def forward(self, idx, targets=None): # idx and targets are both (B,T) tensor of integers logits = self.token_embedding_table(idx) # (B,T,C) if targets is None: loss = None else: B, T, C = logits.shape logits = logits.view(B*T, C) targets = targets.view(B*T) loss = F.cross_entropy(logits, targets) return logits, loss # 在模型已经训练好之后,根据给定的输入生成文本的方法。 def generate(self, idx, max_new_tokens): # idx is (B, T) array of indices in the current context for _ in range(max_new_tokens): # get the predictions logits, loss = self(idx) # focus only on the last time step logits = logits[:, -1, :] # becomes (B, C) # apply softmax to get probabilities probs = F.softmax(logits, dim=-1) # (B, C) # sample from the distribution idx_next = torch.multinomial(probs, num_samples=1) # (B, 1) # append sampled index to the running sequence idx = torch.cat((idx, idx_next), dim=1) # (B, T+1) return idx # let's now encode the entire text dataset and store it into a torch.Tensor with open('data/input.txt', 'r', encoding='utf-8') as f: text = f.read() chars = sorted(list(set(text))) vocab_size = len(chars) # create a mapping from characters to integers stoi = { ch:i for i,ch in enumerate(chars) } itos = { i:ch for i,ch in enumerate(chars) } # encoder: take a string, output a list of integers encode = lambda s: [stoi[c] for c in s] data = torch.tensor(encode(text), dtype=torch.long) # Let's now split up the data into train and validation sets n = int(0.9*len(data)) # first 90% will be train, rest val train_data = data[:n] val_data = data[n:] torch.manual_seed(1337) # how many independent sequences will we process in parallel? batch_size = 4 # what is the maximum context length for predictions? block_size = 8 def get_batch(split): # generate a small batch of data of inputs x and targets y data = train_data if split == 'train' else val_data ix = torch.randint(len(data) - block_size, (batch_size,)) x = torch.stack([data[i:i+block_size] for i in ix]) y = torch.stack([data[i+1:i+block_size+1] for i in ix]) return x, y xb, yb = get_batch('train') print('inputs:') print(xb.shape) print(xb) print('targets:') print(yb.shape) print(yb) # get device device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # create model m = BigramLanguageModelV1(vocab_size).to(device) logits, loss = m(xb.to(device), yb.to(device)) print(logits[0].shape) print(logits[0]) |
运行 python setup5.2.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
$ python setup5.2.py inputs: torch.Size([4, 8]) tensor([[24, 43, 58, 5, 57, 1, 46, 43], [44, 53, 56, 1, 58, 46, 39, 58], [52, 58, 1, 58, 46, 39, 58, 1], [25, 17, 27, 10, 0, 21, 1, 54]]) targets: torch.Size([4, 8]) tensor([[43, 58, 5, 57, 1, 46, 43, 39], [53, 56, 1, 58, 46, 39, 58, 1], [58, 1, 58, 46, 39, 58, 1, 46], [17, 27, 10, 0, 21, 1, 54, 39]]) torch.Size([65]) tensor([ 1.6347, -0.0518, 0.4996, 0.7216, 0.5085, -0.7719, 0.2388, 0.3138, 0.2178, 0.0328, -0.1699, 1.0659, 0.7200, -0.6166, 0.0806, 2.5231, -1.4623, 2.1707, 0.1624, 1.0296, -1.1377, 0.5856, 0.0173, 0.3136, 1.0124, 1.5122, -0.3359, 0.2456, -0.3773, 0.1587, 2.1503, -1.5131, -0.9552, -0.8995, -0.9583, -0.5945, 0.5850, 0.5266, 0.7615, 0.5331, 1.1796, 1.3316, -0.2094, 0.0960, -0.6945, 0.5669, -0.5883, 1.4064, -1.2537, -1.5195, 0.7446, 1.1914, 0.1801, 1.2333, -0.2299, -0.1531, 0.8408, -0.3993, -0.6126, -0.6597, 0.5906, 1.1219, 0.2432, 1.1519, 0.9950], device='cuda:0', grad_fn=<SelectBackward0>) |
其中,值最大的元素的序号,就是最可能的那个字母。注意,模型还没有进行任何训练,处于神经错乱状态,预测地与事实不符是正常的。
有了这个词嵌入向量后,计算出模型与 Target(事实的下一个 Token)之间的误差了。这里使用了交叉熵误差。
1 2 |
交叉熵误差(Cross-Entropy Loss,简称CE)是一种常用的损失函数(loss function),尤其在机器学习和深度学习中的分类问题。它是用来衡量模型预测概率分布与真实概率分布之间的相似度。交叉熵误差的值越小,表示模型预测的概率分布与真实概率分布越接近,模型的性能越好。 |
前向传播过程看完了,接下来尝试调用 generate
进行一次文本生成:
1 2 3 4 5 |
print( decode( m.generate( idx = torch.zeros((1, 1), dtype=torch.long).to(device), max_new_tokens=12)[0].tolist())) |
setup5.3.py 完整的代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
import torch import torch.nn as nn from torch.nn import functional as F torch.manual_seed(1337) # 二元语言模型实现 class BigramLanguageModelV1(nn.Module): def __init__(self, vocab_size): super().__init__() # 每个词直接从一个查找表中获取下一个词的logits值 # logits是模型做出预测前的一组未经归一化的分数,反映了不同结果的相对可能性 self.token_embedding_table = nn.Embedding(vocab_size, vocab_size) # 模型前向传播 # idx:即前面的 x,表示输入数据,词在词汇表中的索引的向量 # targets:训练的目标输出,比如正确的下一个词的索引 def forward(self, idx, targets=None): # idx and targets are both (B,T) tensor of integers logits = self.token_embedding_table(idx) # (B,T,C) if targets is None: loss = None else: B, T, C = logits.shape logits = logits.view(B*T, C) targets = targets.view(B*T) loss = F.cross_entropy(logits, targets) return logits, loss # 在模型已经训练好之后,根据给定的输入生成文本的方法。 def generate(self, idx, max_new_tokens): # idx is (B, T) array of indices in the current context for _ in range(max_new_tokens): # get the predictions logits, loss = self(idx) # focus only on the last time step logits = logits[:, -1, :] # becomes (B, C) # apply softmax to get probabilities probs = F.softmax(logits, dim=-1) # (B, C) # sample from the distribution idx_next = torch.multinomial(probs, num_samples=1) # (B, 1) # append sampled index to the running sequence idx = torch.cat((idx, idx_next), dim=1) # (B, T+1) return idx # let's now encode the entire text dataset and store it into a torch.Tensor with open('data/input.txt', 'r', encoding='utf-8') as f: text = f.read() chars = sorted(list(set(text))) vocab_size = len(chars) # create a mapping from characters to integers stoi = { ch:i for i,ch in enumerate(chars) } itos = { i:ch for i,ch in enumerate(chars) } # encoder: take a string, output a list of integers encode = lambda s: [stoi[c] for c in s] # decoder: take a list of integers, output a string decode = lambda l: ''.join([itos[i] for i in l]) data = torch.tensor(encode(text), dtype=torch.long) # Let's now split up the data into train and validation sets n = int(0.9*len(data)) # first 90% will be train, rest val train_data = data[:n] val_data = data[n:] torch.manual_seed(1337) # how many independent sequences will we process in parallel? batch_size = 4 # what is the maximum context length for predictions? block_size = 8 def get_batch(split): # generate a small batch of data of inputs x and targets y data = train_data if split == 'train' else val_data ix = torch.randint(len(data) - block_size, (batch_size,)) x = torch.stack([data[i:i+block_size] for i in ix]) y = torch.stack([data[i+1:i+block_size+1] for i in ix]) return x, y xb, yb = get_batch('train') print('inputs:') print(xb.shape) print(xb) print('targets:') print(yb.shape) print(yb) # get device device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # create model m = BigramLanguageModelV1(vocab_size).to(device) logits, loss = m(xb.to(device), yb.to(device)) print(logits.shape) print(loss) #print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist())) print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long).to(device), max_new_tokens=12)[0].tolist())) |
运行 python setup5.3.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
$ python setup5.3.py inputs: torch.Size([4, 8]) tensor([[24, 43, 58, 5, 57, 1, 46, 43], [44, 53, 56, 1, 58, 46, 39, 58], [52, 58, 1, 58, 46, 39, 58, 1], [25, 17, 27, 10, 0, 21, 1, 54]]) targets: torch.Size([4, 8]) tensor([[43, 58, 5, 57, 1, 46, 43, 39], [53, 56, 1, 58, 46, 39, 58, 1], [58, 1, 58, 46, 39, 58, 1, 46], [17, 27, 10, 0, 21, 1, 54, 39]]) torch.Size([32, 65]) tensor(5.0364, device='cuda:0', grad_fn=<NllLossBackward0>) yq$;tfBfROkN |
6. BigramLanguageModel V1 训练
使用如下代码对模型进行一万次训练,用训练后模型,生成一个长度为 100 的序列看看
setup6.1.py 完整的代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
import torch import torch.nn as nn from torch.nn import functional as F torch.manual_seed(1337) # 二元语言模型实现 class BigramLanguageModelV1(nn.Module): def __init__(self, vocab_size): super().__init__() # 每个词直接从一个查找表中获取下一个词的logits值 # logits是模型做出预测前的一组未经归一化的分数,反映了不同结果的相对可能性 self.token_embedding_table = nn.Embedding(vocab_size, vocab_size) # 模型前向传播 # idx:即前面的 x,表示输入数据,词在词汇表中的索引的向量 # targets:训练的目标输出,比如正确的下一个词的索引 def forward(self, idx, targets=None): # idx and targets are both (B,T) tensor of integers logits = self.token_embedding_table(idx) # (B,T,C) if targets is None: loss = None else: B, T, C = logits.shape logits = logits.view(B*T, C) targets = targets.view(B*T) loss = F.cross_entropy(logits, targets) return logits, loss # 在模型已经训练好之后,根据给定的输入生成文本的方法。 def generate(self, idx, max_new_tokens): # idx is (B, T) array of indices in the current context for _ in range(max_new_tokens): # get the predictions logits, loss = self(idx) # focus only on the last time step logits = logits[:, -1, :] # becomes (B, C) # apply softmax to get probabilities probs = F.softmax(logits, dim=-1) # (B, C) # sample from the distribution idx_next = torch.multinomial(probs, num_samples=1) # (B, 1) # append sampled index to the running sequence idx = torch.cat((idx, idx_next), dim=1) # (B, T+1) return idx # let's now encode the entire text dataset and store it into a torch.Tensor with open('data/input.txt', 'r', encoding='utf-8') as f: text = f.read() chars = sorted(list(set(text))) vocab_size = len(chars) # create a mapping from characters to integers stoi = { ch:i for i,ch in enumerate(chars) } itos = { i:ch for i,ch in enumerate(chars) } # encoder: take a string, output a list of integers encode = lambda s: [stoi[c] for c in s] # decoder: take a list of integers, output a string decode = lambda l: ''.join([itos[i] for i in l]) data = torch.tensor(encode(text), dtype=torch.long) # Let's now split up the data into train and validation sets n = int(0.9*len(data)) # first 90% will be train, rest val train_data = data[:n] val_data = data[n:] torch.manual_seed(1337) # how many independent sequences will we process in parallel? batch_size = 4 # what is the maximum context length for predictions? block_size = 8 def get_batch(split): # generate a small batch of data of inputs x and targets y data = train_data if split == 'train' else val_data ix = torch.randint(len(data) - block_size, (batch_size,)) x = torch.stack([data[i:i+block_size] for i in ix]) y = torch.stack([data[i+1:i+block_size+1] for i in ix]) return x, y # get device device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # create model m = BigramLanguageModelV1(vocab_size).to(device) # create a PyTorch optimizer optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3) from tqdm import tqdm for steps in tqdm(range(10000)): # increase number of steps for good results... # sample a batch of data xb, yb = get_batch('train') # evaluate the loss logits, loss = m(xb.to(device), yb.to(device)) optimizer.zero_grad(set_to_none=True) loss.backward() optimizer.step() print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long).to(device), max_new_tokens=100)[0].tolist())) |
运行 python setup6.1.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
$ python setup6.1.py 100%|████████████████████████████████████████| 10000/10000 [00:06<00:00, 1452.75it/s] 2.527495861053467 CExfikRO: wcowindakOLOLETHAK HAPOFouBayou e. S:gO:33SA: LTauss: WanthafNusqhe, vet?ar dXlasoate |
尽管还是乱码,但是有点 Tiny Shakespeare 剧本对话的意思了。
再训练一万次(累计 2w 次)
setup6.1.py 修改后代码如下:
1 |
for steps in tqdm(range(20000)): # increase number of steps for good results... |
运行 python setup6.1.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
$ python setup6.1.py 100%|████████████████████████████████████████| 20000/20000 [00:13<00:00, 1531.06it/s] 2.4393372535705566 CExfik brid owindakis by bth HAPOFourayou e. S: O:33SA: LUCous: Wanthar u qur, vet? F dilasoate tony@TONYP15GEN2:/mnt/d/OpenAI/CreateGPT$ |
训练10万次
setup6.1.py 修改后代码如下:
1 |
for steps in tqdm(range(100000)): # increase number of steps for good results... |
运行 python setup6.1.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
$ python setup6.1.py 100%|██████████████████████████████████████| 100000/100000 [01:06<00:00, 1513.27it/s] 2.6740522384643555 CExthy brid owindakis by bth Hiset bube d e. S: O: IS: Falatanss: Wanthar u qur, vet? F dilasoate |
7. 使用矩阵乘法实现累增运算
一个示例,展示如何通过矩阵乘法进行‘加权聚合
setup7.1.py 完整的代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# toy example illustrating how matrix multiplication can be used for a "weighted aggregation" import torch torch.manual_seed(42) a = torch.tril(torch.ones(3, 3)) a = a / torch.sum(a, 1, keepdim=True) b = torch.randint(0,10,(3,2)).float() c = a @ b print('a=') print(a) print('--') print('b=') print(b) print('--') print('c=') print(c) |
运行 python setup7.1.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
$ python setup7.1.py a= tensor([[1.0000, 0.0000, 0.0000], [0.5000, 0.5000, 0.0000], [0.3333, 0.3333, 0.3333]]) -- b= tensor([[2., 7.], [6., 4.], [6., 5.]]) -- c= tensor([[2.0000, 7.0000], [4.0000, 5.5000], [4.6667, 5.3333]]) |
GPT 中包含 Attention 自注意力机制(self-attention),简单来说,对上面的每一轮进行加权:
如何实现上述累增运算呢?一种直观方法是使用循环,但是这样效率低。The mathematical trick 指的就是使用一个矩阵运算来替代循环,矩阵运算效率更高,啪的一下就全算完了。
这里使用的矩阵是三角阵,下三角是权重,上三角都是 0:
1 2 3 4 |
权重 0 0 0 权重 权重 0 0 权重 权重 权重 0 权重 权重 权重 权重 |
我们以这个阵的每一行,与 idx 列向量相乘,是不是就把这一轮中的头几个元素,与权重相乘了?
为此,引入一个新的超参数 Channels:
1 2 3 |
B,T,C = 4,8,2 # batch, time, channels x = torch.randn(B,T,C) x.shape |
“Channel” 参数指的是在神经网络,尤其是在处理自注意力(Self-Attention)机制时,数据的一个维度,它表示输入数据中的特征数量。
例如,在计算机视觉任务中,对于彩色图像,常见的通道数为3,分别代表红、绿、蓝(RGB)颜色通道。在自然语言处理(NLP)和Transformer模型的上下文中,“Channel” 通常指的是嵌入向量(embedding vector)的维度,或者说,每个单词或标记(token)被表示成的向量的大小。这些嵌入向量是高维空间中的点,每一个维度(或”channel”)可以被看作是捕捉输入数据中某种特定方面的特征。
这里以通道数 C=2
作为示意。
下面介绍两种矩阵运算方法。第一种运算:
setup7.2.py 完整的代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import torch B,T,C = 4,8,2 # batch, time, channels x = torch.randn(B,T,C) #x.shape # version 2: using matrix multiply for a weighted aggregation # 创建一个 8x8 的下三角阵 # 在下三角阵中,主对角线上方的所有元素都被设置为0 wei = torch.tril(torch.ones(T, T)) # 将每一行的元素除以该行元素的和,以确保每一行的元素和为1 # 这样做的目的是将`wei`转换为一个权重矩阵,可以用于对输入数据`x`进行加权平均。 wei = wei / wei.sum(1, keepdim=True) # 这行代码使用矩阵乘法(`@`操作符)将权重矩阵`wei`应用于输入数据`x` # 这实际上是对`x`的每一行进行加权平均,权重由`wei`的对应行给出。 # 结果`xbow2`的形状为`(B, T, C)`。 xbow2 = wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C) print(xbow2) |
运行 python setup7.2.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
$ python setup7.2.py tensor([[[-2.2734, 1.9014], [-1.2950, 0.5992], [-0.3070, 0.6050], [-0.1546, 0.5280], [-0.0619, 0.6683], [-0.0036, 0.7324], [ 0.2498, 0.5098], [ 0.1783, 0.4924]], [[-1.4634, -1.2901], [-1.4572, -0.7144], [-0.8114, -0.4710], [-0.2247, -0.5062], [-0.2281, -0.8140], [-0.2089, -0.7699], [-0.3723, -0.6599], [-0.3884, -0.5638]], [[-0.4514, 0.1597], [ 0.4560, -0.0235], [-0.3107, 0.9944], [-0.5348, 0.9982], [-0.5057, 0.9492], [-0.2476, 0.8629], [-0.1491, 0.8491], [-0.1891, 0.5784]], [[ 0.8197, 0.9400], [ 0.7682, 0.0250], [ 0.4424, 0.3952], [ 0.1836, 0.5423], [ 0.1700, 0.1112], [ 0.1238, -0.2225], [-0.0805, -0.2310], [-0.1978, 0.0533]]]) |
这段代码的主要目的是创建一个下三角矩阵,并用它来对输入数据x
进行加权平均。
这样,便完成了对输入序列的每轮累增处理,并在每轮累进中进行加权。下面再介绍第二种等效运算:
setup7.3.py 完整的代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
import torch from torch.nn import functional as F B,T,C = 4,8,2 # batch, time, channels x = torch.randn(B,T,C) #x.shape # 首先使用`torch.ones(T, T)`创建一个大小为`T x T`的全1矩阵 # 然后使用`torch.tril`将这个矩阵转换为下三角矩阵。 # 在下三角矩阵中,主对角线上方的所有元素都被设置为0。 tril = torch.tril(torch.ones(T, T)) # 这行代码创建了一个大小为`T x T`的全0矩阵,用于存储权重。 wei = torch.zeros((T,T)) # 这行代码使用`masked_fill`函数将`wei`中对应`tril`为0的位置填充为负无穷。 # 这样做的目的是在接下来的softmax操作中,这些位置的权重将被设置为0。 wei = wei.masked_fill(tril == 0, float('-inf')) # 这行代码使用softmax函数将`wei`转换为一个权重矩阵,可以用于对输入数据`x`进行加权平均。 # softmax函数会将每一行的元素转换为正值,并且确保每一行的元素和为1。 wei = F.softmax(wei, dim=-1) # 这行代码使用矩阵乘法(`@`操作符)将权重矩阵`wei`应用于输入数据`x`。 # 这实际上是对`x`的每一行进行加权平均,权重由`wei`的对应行给出。 # 结果`xbow3`的形状为`(B, T, C)`。 xbow3 = wei @ x print(xbow3) |
运行 python setup7.3.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
$ python setup7.3.py tensor([[[ 0.4182, 0.7875], [-0.5942, 0.5962], [-0.9075, 0.1195], [-0.4145, -0.1805], [-0.9673, 0.2452], [-0.7444, 0.3172], [-0.6002, 0.3314], [-0.5139, 0.2772]], [[ 1.0123, 0.4578], [ 0.2628, 0.7642], [ 0.5665, 0.6021], [ 0.4955, 0.6483], [ 0.8661, 0.6308], [ 0.7084, 0.4466], [ 0.5995, 0.5258], [ 0.4570, 0.5571]], [[ 0.6116, -0.9643], [ 0.0972, 1.0087], [ 0.5342, 0.8562], [ 0.4989, 0.6730], [ 0.4848, 0.5810], [ 0.3251, 0.3983], [ 0.2218, 0.3912], [ 0.2258, 0.3659]], [[ 0.1742, -0.7519], [ 0.9666, -0.1328], [ 0.6983, -0.2136], [ 0.4448, -0.7202], [ 0.3096, -0.5783], [ 0.2398, -0.6006], [ 0.1999, -0.3986], [ 0.2412, -0.2943]]]) |
8. 实现 Masked Self-Attention
下面来实现 Masked Self-Attention。
setup8.1.py 完整的代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
import torch import torch.nn as nn from torch.nn import functional as F # version 4: self-attention! torch.manual_seed(1337) # 这两行代码首先定义了一些变量,包括批次大小(B)、时间步长(T)和通道数(C) # 然后生成了一个随机的张量`x`,其形状为`(B, T, C)`。这是 Mock 输入 B,T,C = 4,8,32 # batch, time, channels x = torch.randn(B,T,C) # let's see a single Head perform self-attention # 这些行定义了自注意力机制中的关键部分:键、查询和值。 # 每个部分都是一个线性变换,将输入的特征维度(C)转换为头大小(head_size)。 head_size = 16 key = nn.Linear(C, head_size, bias=False) query = nn.Linear(C, head_size, bias=False) value = nn.Linear(C, head_size, bias=False) # 这两行代码将输入`x`通过键和查询的线性变换,得到新的键和查询。 k = key(x) # (B, T, 16) q = query(x) # (B, T, 16) # 这行代码计算了查询和键的点积,得到了权重矩阵`wei`。 wei = q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T) # 这行代码创建了一个下三角矩阵。 tril = torch.tril(torch.ones(T, T)) #wei = torch.zeros((T,T)) # 这行代码将权重矩阵`wei`中对应下三角矩阵为0的位置填充为负无穷。 wei = wei.masked_fill(tril == 0, float('-inf')) # 这行代码对权重矩阵`wei`进行了softmax操作,使得每一行的和为1。 wei = F.softmax(wei, dim=-1) # 这行代码将输入`x`通过值的线性变换,得到新的值。 v = value(x) # 这行代码将权重矩阵`wei`和值`v`进行矩阵乘法,得到输出`out`。 out = wei @ v print(out) |
运行 python setup8.1.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
$ python setup8.1.py tensor([[[-1.5713e-01, 8.8009e-01, 1.6152e-01, -7.8239e-01, -1.4289e-01, 7.4676e-01, 1.0068e-01, -5.2395e-01, -8.8726e-01, 1.9068e-01, 1.7616e-01, -5.9426e-01, -4.8124e-01, -4.8599e-01, 2.8623e-01, 5.7099e-01], [ 6.7643e-01, -5.4770e-01, -2.4780e-01, 3.1430e-01, -1.2799e-01, -2.9521e-01, -4.2962e-01, -1.0891e-01, -4.9282e-02, 7.2679e-01, 7.1296e-01, -1.1639e-01, 3.2665e-01, 3.4315e-01, -7.0975e-02, 1.2716e+00], [ 4.8227e-01, -1.0688e-01, -4.0555e-01, 1.7696e-01, 1.5811e-01, -1.6967e-01, 1.6217e-02, 2.1509e-02, -2.4903e-01, -3.7725e-01, 2.7867e-01, 1.6295e-01, -2.8951e-01, -6.7610e-02, -1.4162e-01, 1.2194e+00], [ 1.9708e-01, 2.8561e-01, -1.3028e-01, -2.6552e-01, 6.6781e-02, 1.9535e-01, 2.8073e-02, -2.4511e-01, -4.6466e-01, 6.9287e-02, 1.5284e-01, -2.0324e-01, -2.4789e-01, -1.6213e-01, 1.9474e-01, 7.6778e-01], [ 2.5104e-01, 7.3457e-01, 5.9385e-01, 2.5159e-01, 2.6064e-01, 7.5820e-01, 5.5947e-01, 3.5387e-01, -5.9338e-01, -1.0807e+00, -3.1110e-01, -2.7809e-01, -9.0541e-01, 1.3181e-01, -1.3818e-01, 6.3715e-01], [ 3.4277e-01, 4.9605e-01, 4.7248e-01, 3.0277e-01, 1.8440e-01, 5.8144e-01, 3.8245e-01, 2.9521e-01, -4.8969e-01, -7.7051e-01, -1.1721e-01, -2.5412e-01, -6.8921e-01, 1.9795e-01, -1.5135e-01, 7.6659e-01], [ 1.8658e-01, -9.6351e-02, -1.4300e-01, 3.0587e-01, 8.3441e-02, -6.8646e-03, -2.0472e-01, -1.5350e-01, -7.6250e-02, 3.2689e-01, 3.0896e-01, 7.6626e-02, 9.9243e-02, 1.6560e-01, 1.9745e-01, 7.6248e-01], [ 1.3013e-01, -3.2832e-02, -4.9645e-01, 2.8652e-01, 2.7042e-01, -2.6357e-01, -7.3756e-02, 3.7857e-01, 7.4580e-02, 3.3827e-02, 1.4695e-02, 3.1937e-01, 2.9926e-01, -1.6530e-01, -3.8630e-02, 3.3748e-01]], [[-1.3254e+00, 1.1236e+00, 2.2927e-01, -2.9970e-01, -7.6267e-03, 7.9364e-01, 8.9581e-01, 3.9650e-01, -6.6613e-01, -2.1844e-01, -1.3539e+00, 4.1245e-01, 9.6011e-01, -1.0805e+00, -3.9751e-01, -4.4439e-01], [-3.8338e-01, -1.9659e-01, 8.8455e-02, 1.8560e-01, -8.7010e-02, 1.3239e-01, 3.0841e-01, -2.4350e-01, -1.9396e-01, -1.7634e-02, 4.8439e-01, 5.4210e-01, -2.0407e-02, -4.2467e-01, -2.3463e-01, -4.6465e-01], [-1.1100e+00, 3.2334e-01, 4.7054e-01, -6.3595e-02, 2.5443e-01, 1.5352e-01, 2.5186e-01, 2.6286e-01, 2.7916e-01, -3.1662e-03, -3.2880e-02, 4.8191e-01, 7.4431e-01, -1.9921e-01, 2.7134e-01, -8.5871e-02], [-9.7190e-01, 4.6124e-01, 4.2349e-01, -1.7230e-02, 1.5847e-01, 4.1175e-01, 4.0764e-01, 2.4982e-01, -5.0322e-02, 4.1514e-03, -3.9853e-01, 4.3551e-01, 7.0285e-01, -4.3081e-01, 2.6684e-02, -2.0169e-01], [ 3.3586e-01, -8.5915e-02, 9.3660e-01, 7.7311e-01, 1.8037e-01, 8.2853e-01, -6.9183e-02, 2.8814e-01, 1.1734e-01, 6.8448e-01, -5.8500e-02, 1.2726e-01, 2.9780e-01, 1.9324e-01, 1.5655e-01, -9.3004e-03], [ 1.6984e-01, 3.0993e-02, 8.1557e-01, 6.1679e-01, 1.0429e-01, 7.4573e-01, 2.3072e-02, 3.0572e-01, 5.8163e-02, 5.7122e-01, -4.5275e-02, 1.5051e-01, 3.2901e-01, 5.6984e-02, 1.0311e-01, -9.9174e-02], [ 4.6497e-02, 1.5765e-01, 3.9760e-01, 1.7619e-01, -2.1168e-01, 2.3365e-01, -6.2083e-02, 2.1726e-01, -7.8725e-03, 4.5389e-01, 3.4349e-01, -5.5631e-02, 3.3726e-01, -3.7591e-01, -1.0140e-02, -4.5806e-01], [-5.3896e-01, 7.5555e-01, 3.3034e-01, -1.5849e-01, -2.6740e-01, 4.3495e-01, 3.7772e-01, 5.5794e-01, -1.8369e-01, 1.5938e-01, -2.1042e-01, 5.5790e-02, 6.3184e-01, -6.4884e-01, -9.6084e-02, -5.0751e-01]], [[ 6.8925e-02, 1.2248e+00, -4.1194e-01, -1.7046e-01, -6.9224e-01, -2.9201e-01, 1.2704e+00, -6.8596e-01, 4.3798e-01, -2.6366e-01, 1.1528e-01, 1.1676e+00, -7.2138e-01, -1.2308e+00, 8.3821e-01, -5.5987e-01], [-4.6375e-01, 6.3807e-01, -1.5842e-01, -1.3309e-01, -5.9402e-01, -5.0374e-01, 2.3289e-01, -3.2126e-01, 4.5781e-01, -1.8590e-01, 1.9215e-01, 3.7566e-01, -3.5905e-01, -7.7262e-01, 3.5036e-01, 6.9694e-02], [-6.4044e-01, 1.3831e-01, -6.1007e-02, -1.1112e-01, -4.5228e-01, -6.2271e-01, -1.7030e-01, -2.4949e-01, 5.0670e-01, -9.6444e-02, 4.8315e-01, 9.4986e-02, -2.9810e-01, -3.6538e-01, 3.9458e-01, 4.1512e-01], [-6.7193e-01, 1.2516e-01, 7.3386e-02, -1.3198e-01, -1.7880e-01, -5.6740e-01, -6.8226e-01, 5.0844e-02, 3.3051e-01, 7.8242e-02, 6.8022e-02, -2.4041e-01, -6.6864e-02, -1.8411e-01, -5.3514e-02, 4.5113e-01], [-1.4270e-02, 1.0195e+00, -3.4792e-01, -1.6421e-01, -5.5846e-01, -3.2457e-01, 9.9404e-01, -5.6891e-01, 4.0097e-01, -1.8123e-01, 1.1856e-01, 9.8704e-01, -6.4057e-01, -1.0320e+00, 7.3320e-01, -4.3167e-01], [-6.3858e-01, -7.6533e-02, -3.6510e-01, 1.7782e-01, -6.5426e-02, -3.5158e-01, 7.9591e-02, 1.7384e-01, 3.6676e-01, -4.2302e-02, 2.4923e-01, 4.8239e-01, -2.1295e-01, -2.9492e-01, 3.4749e-01, -1.7111e-01], [-2.2366e-01, -5.5317e-02, -1.8296e-01, 2.4258e-01, 2.5357e-01, -1.6154e-01, -2.3908e-01, 3.3243e-01, 1.0304e-01, 2.6067e-01, -5.0670e-02, 3.6947e-01, -4.9856e-02, 1.1197e-01, 1.1752e-01, -2.5078e-01], [-2.4821e-01, 1.4845e-01, -3.5033e-01, 1.7102e-01, 1.6613e-01, -2.0643e-01, 8.6633e-02, 8.8414e-02, 2.1188e-01, 2.5805e-01, 5.5146e-02, 4.2668e-01, -2.0443e-01, -1.7372e-01, 3.8899e-01, 5.1725e-02]], [[ 9.7183e-02, 5.7301e-02, -1.0468e-01, -4.6654e-02, -1.4006e-01, -8.4126e-01, -1.3625e-01, -6.7465e-01, -2.1541e-01, 1.0993e+00, 2.3427e-01, 3.2605e-02, -1.8521e-01, 1.4780e-01, -6.1045e-01, 1.5391e+00], [ 1.9305e-01, -2.1031e-01, -3.4658e-01, 2.0567e-01, -1.7799e-01, -7.4604e-01, -6.4427e-01, -6.9183e-01, -2.0558e-01, 7.0413e-01, 2.3632e-01, 9.8800e-04, -1.7015e-01, 1.1203e-01, -7.1064e-01, 1.2431e+00], [ 2.9114e-01, -4.8343e-01, -5.9254e-01, 4.6477e-01, -2.1832e-01, -6.4460e-01, -1.1627e+00, -7.0993e-01, -1.9703e-01, 2.9262e-01, 2.3669e-01, -3.1050e-02, -1.5471e-01, 7.7153e-02, -8.1137e-01, 9.3578e-01], [ 1.7549e-01, -3.4260e-02, -2.0523e-01, 2.7644e-02, -2.1312e-01, -5.6022e-01, -3.5273e-01, -6.2722e-01, -3.0037e-01, 4.6061e-01, 1.5004e-01, 1.9040e-02, -1.4646e-01, 1.7220e-01, -6.2559e-01, 1.0722e+00], [ 1.7354e-01, -1.7962e-01, -2.7874e-01, -1.0590e-01, -1.2952e-01, -3.5086e-01, -5.5830e-01, -3.8638e-01, -2.9719e-01, 3.3368e-02, 1.7392e-01, 5.5898e-02, -7.2007e-02, 1.3182e-02, -6.6710e-01, 5.4229e-01], [ 2.4678e-01, -4.7274e-01, -5.2827e-01, 3.1212e-01, -1.7528e-01, -4.8636e-01, -1.1223e+00, -5.4196e-01, -2.0142e-01, 4.0103e-02, 2.2231e-01, -2.9380e-02, -9.4353e-02, 2.6374e-02, -7.8726e-01, 6.2836e-01], [-3.9784e-01, 2.5915e-01, 5.0358e-01, -4.6864e-01, -2.2024e-02, -3.2242e-01, -1.2578e-01, 1.0634e-01, 1.3618e-01, 1.7780e-01, 1.0391e-01, -6.2540e-01, 3.8904e-01, 3.3690e-01, -5.5140e-01, 5.2246e-01], [-3.5927e-01, 3.3935e-02, -2.9863e-02, -1.5019e-01, -6.0354e-03, -6.5733e-02, -3.9659e-01, -6.0435e-02, -5.7551e-01, -2.9157e-01, 1.4899e-01, -7.5002e-02, 7.3228e-02, -4.7413e-02, -6.4394e-01, 2.8560e-01]]], grad_fn=<UnsafeViewBackward0>) |
这段代码实现了一个带有掩码的自注意力(Masked Self-Attention)机制。Masked Self-Attention 允许模型在处理序列数据时,仅考虑当前位置之前的信息,常用于如生成文本的任务中,以避免未来信息的泄露。
自注意力机制的三个核心组件:查询(query)、键(key)和值(value),它们都来源于同一个输入数据 x
。这里使用 nn.Linear()
对每个组件进行线性变换(映射),以生成不同的表示空间。这是实现注意力机制的标准做法,通过这种方式,可以让模型学习到如何最有效地表示数据。
如果移除用于将权重矩阵 wei
中特定位置设置为负无穷的代码行(wei.masked_fill(tril == 0, float('-inf'))
),那么该实现将不再是一个带有掩码的自注意力机制,而是变回一个标准的自注意力机制。标准的自注意力允许每个序列元素“注意”序列中的所有其他元素,而不是仅仅是之前的元素。
自注意力机制的一个关键特性:查询(query)、键(key)和值(value)向量都来源于同一个输入 x
。这意味着自注意力机制能够在输入数据的内部找到元素之间的关系。
注:如果将 query输入为x,key,value输入为 y,便成为另一种注意力机制——交叉注意力(cross-attention)。在交叉注意力设置中,查询(query)向量来自于一个输入(例如 x
),而键(key)和值(value)向量来自于另一个不同的输入(例如 y
)。这种机制常用于处理两种不同的序列,例如在机器翻译任务中,模型需要考虑源语言句子(作为 x
)和目标语言句子(作为 y
)之间的关系。
这段代码中,最后的几行代码(从生成下三角阵到 softmax normalize)我们已经比较熟悉了。新增的部分是引入自注意力的 query, key, value,构成了一个单头的自注意力机制。注:可以看到 Channels 变成了 32,词向量多大,这里的 C 就跟着多大。
9. Weight Normalization for Softmax
在原版论文的公式中,有一个分母:
Softmax函数:Softmax函数是一种将实数向量转换为概率分布的函数。对于任意实数向量,Softmax函数会压缩每个元素的范围到[0, 1],并且使得所有元素的和为1。这在多类分类问题中非常有用,特别是在模型的输出层,可以用来代表概率分布。
注意力机制中的Softmax:在注意力机制中,Softmax用于计算注意力权重,即确定在生成输出时应该给予序列中每个元素多少“注意力”。通过Softmax,模型能够决定在聚合信息时对哪些元素给予更多的重视。
Weight Normalization for Softmax:权重正规化是一种技术,用于调整权重向量的尺度,使其具有一定的统计性质(例如,使方差为1)。在注意力机制的上下文中,这是通过调整查询(query)和键(key)的点积结果来实现的,从而影响Softmax函数的输入。
为什么需要权重正规化?
- 避免Softmax饱和:如果没有正规化,当head_size(即,每个注意力头的维度)很大时,查询和键的点积结果可能会非常大,导致Softmax输入的值域过大。这会使得Softmax函数的输出变得极端,即大多数的注意力权重都集中在少数几个值上,而其他值几乎被忽略。
- 保持梯度稳定:通过控制Softmax输入的尺度,可以帮助保持梯度的稳定性,从而避免训练过程中的梯度爆炸或消失问题。
如何实现权重正规化?
- 权重正规化可以通过除以 dk^1/2来实现,其中 dk 是head_size。这个操作确保了当head_size很大时,点积结果的方差大约是1,从而缩小了Softmax输入的值域。
对应的代码实现如下:
1 2 |
# compute attention scores ("affinities") wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T) |
10. 单头自注意力模块
基于前面的知识储备,单头注意力模块实现如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
class Head(nn.Module): """ one head of self-attention """ def __init__(self, head_size): super().__init__() self.key = nn.Linear(n_embd, head_size, bias=False) self.query = nn.Linear(n_embd, head_size, bias=False) self.value = nn.Linear(n_embd, head_size, bias=False) self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size))) self.dropout = nn.Dropout(dropout) def forward(self, x): B,T,C = x.shape k = self.key(x) # (B,T,C) q = self.query(x) # (B,T,C) # compute attention scores ("affinities") wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T) wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T) wei = F.softmax(wei, dim=-1) # (B, T, T) wei = self.dropout(wei) # perform the weighted aggregation of the values v = self.value(x) # (B,T,C) out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C) return out |
其中,引入了 Dropout,在训练时随机丢掉部分权重,来提升训练效果,避免 overfiting.
11. 多头自注意力模块
组装多个单头自注意力模块,便得到了多头自注意力模块:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
class MultiHeadAttention(nn.Module): """ multiple heads of self-attention in parallel """ def __init__(self, num_heads, head_size): super().__init__() self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)]) self.proj = nn.Linear(n_embd, n_embd) self.dropout = nn.Dropout(dropout) def forward(self, x): out = torch.cat([h(x) for h in self.heads], dim=-1) out = self.dropout(self.proj(out)) return out |
多头自注意力通过并行运行多个自注意力机制来增加模型的表达能力。每个头关注输入数据的不同部分,从而能够捕获不同的信息和特征。这些不同头的输出会有不同的表示空间和维度。通过拼接这些输出,我们获得了一个综合了所有头信息的表示,但这个综合后的表示的维度会比原始输入大。
线性变换(self.proj
)在这里的作用是将这个维度更大的表示压缩回原始输入数据的维度。这不仅使得多头自注意力模块的输出可以无缝地融入后续层,而且还通过这个过程整合了来自不同头的信息,增强了模型对输入数据的理解能力。
此外,线性变换还提供了额外的参数,为模型的学习提供了更多的灵活性和能力,有助于模型更好地拟合和理解数据。通过训练,这些参数可以调整以优化模型的性能,从而提高模型对于特定任务的准确性和效率。
12. FeedForward Layer
对多头自注意力模块进行整合:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
class FeedFoward(nn.Module): """ a simple linear layer followed by a non-linearity """ def __init__(self, n_embd): super().__init__() self.net = nn.Sequential( nn.Linear(n_embd, 4 * n_embd), nn.ReLU(), nn.Linear(4 * n_embd, n_embd), nn.Dropout(dropout), ) def forward(self, x): return self.net(x) |
13. LayerNorm
关于 LayerNorm(层归一化)具体可点击阅读笔记。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
class LayerNorm1d: # (used to be BatchNorm1d) def __init__(self, dim, eps=1e-5, momentum=0.1): self.eps = eps self.gamma = torch.ones(dim) self.beta = torch.zeros(dim) def __call__(self, x): # calculate the forward pass xmean = x.mean(1, keepdim=True) # batch mean xvar = x.var(1, keepdim=True) # batch variance xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance self.out = self.gamma * xhat + self.beta return self.out def parameters(self): return [self.gamma, self.beta] |
14. Positional encoding
Attention 机制通过注意到序列中的其它元素实现了能力提升。但是,Attention 本身是不考虑元素在序列中的顺序的。Positional Encoding 可解决这一问题。
1 |
在许多自然语言处理任务中,词的顺序和位置对于语义的理解至关重要。然而,在使用 Transformer 模型时,由于其多头自注意力层的特性,模型对输入数据的顺序并不敏感。为了解决这个问题,位置编码(Positional Encoding)被引入 Transformer 模型中,使得模型能够理解输入数据中词的顺序和相对位置。 |
在视频给出了一种位置编码方法,使用 torch.arrage
与 nn.Embedding
生成位置向量
1 2 3 4 5 |
def forward(self, idx, targets=None): tok_emb = self.token_embedding_table(idx) # (B,T,C) pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C) x = tok_emb + pos_emb # (B,T,C) ...... |
其中:
- 单词嵌入(Token Embeddings):
tok_emb = self.token_embedding_table(idx)
这一行代码通过查找嵌入表将输入的单词索引idx
转换成对应的嵌入向量tok_emb
。嵌入表是一个预先训练好的,可以将每个唯一单词映射到一个高维空间中的向量的表。这里的(B,T,C)
表示批次大小为B,序列长度为T,嵌入维度为C。 - 位置嵌入(Positional Embeddings):
pos_emb = self.position_embedding_table(torch.arange(T, device=device))
这行代码生成一个位置嵌入,其中torch.arange(T)
生成一个从0到T-1的序列,对应于输入序列中每个位置的索引。self.position_embedding_table
是一个预先定义的嵌入表,它将这些位置索引映射到C维的向量上,这样每个位置就有了自己的位置嵌入。这个嵌入向量能够代表或编码该位置在序列中的相对或绝对位置信息。 - 合并嵌入:
x = tok_emb + pos_emb
最后,通过将单词嵌入和位置嵌入相加,为每个单词生成了一个包含了位置信息的最终嵌入。这个操作确保了模型的输入既包含了单词的语义信息(通过单词嵌入),也包含了单词的位置信息(通过位置嵌入)。这样,即使在处理序列的时候,模型也能够识别出单词的顺序,从而更好地理解语言或序列数据的结构和含义。
15. GPT Block 组件
GPT 是由多个 Block 组件串起来的。(注,这里说的 Block 不是前面的序列切片,这里指 GPT 的组成模块)。它的构造如下:
- 加入多头自注意力
- 多头自注意力后面加 Feed Forward Layer
- 加入 residual connection(残差连接)
- 加入 LayerNorm,Pre-LayerNorm,其中,后者是在进入多头自注意力之前,就先进行层归一化
具体代码实现如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
class Block(nn.Module): """ Transformer block: communication followed by computation """ def __init__(self, n_embd, n_head): # n_embd: embedding dimension, n_head: the number of heads we'd like super().__init__() head_size = n_embd // n_head self.sa = MultiHeadAttention(n_head, head_size) self.ffwd = FeedFoward(n_embd) self.ln1 = nn.LayerNorm(n_embd) self.ln2 = nn.LayerNorm(n_embd) def forward(self, x): x = x + self.sa(self.ln1(x)) x = x + self.ffwd(self.ln2(x)) return x |
16. 基于 BigramLanguageModel 魔改 GPT
接下来,我们基于已有 BigramLanguageModel 的框架,加上前面几节中的知识,魔改出 GPT:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
# super simple bigram model class BigramLanguageModelV2(nn.Module): def __init__(self): super().__init__() # each token directly reads off the logits for the next token from a lookup table self.token_embedding_table = nn.Embedding(vocab_size, n_embd) self.position_embedding_table = nn.Embedding(block_size, n_embd) self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)]) self.ln_f = nn.LayerNorm(n_embd) # final layer norm self.lm_head = nn.Linear(n_embd, vocab_size) def forward(self, idx, targets=None): B, T = idx.shape # idx and targets are both (B,T) tensor of integers tok_emb = self.token_embedding_table(idx) # (B,T,C) pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C) x = tok_emb + pos_emb # (B,T,C) x = self.blocks(x) # (B,T,C) x = self.ln_f(x) # (B,T,C) logits = self.lm_head(x) # (B,T,vocab_size) if targets is None: loss = None else: B, T, C = logits.shape logits = logits.view(B*T, C) targets = targets.view(B*T) loss = F.cross_entropy(logits, targets) return logits, loss def generate(self, idx, max_new_tokens): # idx is (B, T) array of indices in the current context for _ in range(max_new_tokens): # crop idx to the last block_size tokens idx_cond = idx[:, -block_size:] # get the predictions logits, loss = self(idx_cond) # focus only on the last time step logits = logits[:, -1, :] # becomes (B, C) # apply softmax to get probabilities probs = F.softmax(logits, dim=-1) # (B, C) # sample from the distribution idx_next = torch.multinomial(probs, num_samples=1) # (B, 1) # append sampled index to the running sequence idx = torch.cat((idx, idx_next), dim=1) # (B, T+1) return idx |
17. 训练 GPT
下面是完整的训练代码
这个脚本概述了使用简化的Transformer架构创建和训练一个字符级别的语言模型的过程。该模型在一个数据集上进行训练,这个数据集很可能来自于Tiny Shakespeare语料库,以生成类似风格的文本。以下是脚本中关键组件和过程的逐步解析:
- 超参数设置:定义了批处理大小、块大小、学习率以及模型架构细节,如嵌入层的数量、头部数、层数和丢弃率等训练参数。
- 数据准备:
- 从文件中加载文本数据。
- 从文本中创建一个独特字符的词汇表,并将字符映射为整数(以及相反的映射)以便处理。
- 将数据分割为训练集和验证集。
- 批处理准备:实现了一个函数,为训练和验证生成数据批次。每个批次由输入序列及其对应的目标序列组成,目标序列本质上是输入序列向右移动一个字符。
- 模型组件:定义了Transformer模型的关键组成部分:
- Head:实现了单个自注意力头。
- MultiHeadAttention:汇总多个自注意力头。
- FeedForward:一个简单的线性层,后跟ReLU激活。
- Block:将注意力和前馈组件组合成单个Transformer块。
- 模型架构:使用嵌入层构建语言模型,用于令牌和位置嵌入,多个Transformer块,最后的层正则化,以及一个线性层来预测下一个字符。
- 损失估计:定义了一个函数,以在不更新模型权重的情况下,评估模型在训练和验证集上的性能。
- 训练循环:通过以下步骤迭代训练模型:
- 抽取数据批次。
- 计算损失。
- 执行反向传播。
- 更新模型的权重。
- 定期在训练和验证集上评估模型,以监控性能。
- 文本生成:实现了一种方法,从给定开始上下文的模型中生成文本。它通过从模型的预测中采样,迭代添加新字符来扩展上下文,以生成指定长度的序列。
- 执行:最后,脚本初始化模型,将其移动到适当的设备(如果可用,则为GPU),并打印出模型的参数数量。然后进入训练循环,定期报告损失,完成后,从训练好的模型生成文本序列。
该脚本展示了如何实现一个基于Transformer的模型,用于字符级文本生成任务,凸显了Transformer架构对于序列建模任务的灵活性和有效性。
setup17.1 完整代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 |
import torch import torch.nn as nn from torch.nn import functional as F # hyperparameters # how many independent sequences will we process in parallel? batch_size = 16 # what is the maximum context length for predictions? block_size = 32 max_iters = 5000 eval_interval = 100 learning_rate = 1e-3 device = 'cuda' if torch.cuda.is_available() else 'cpu' eval_iters = 200 n_embd = 64 n_head = 4 n_layer = 4 dropout = 0.0 # ------------ torch.manual_seed(1337) # wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt with open('data/input.txt', 'r', encoding='utf-8') as f: text = f.read() # here are all the unique characters that occur in this text chars = sorted(list(set(text))) vocab_size = len(chars) # create a mapping from characters to integers stoi = { ch:i for i,ch in enumerate(chars) } itos = { i:ch for i,ch in enumerate(chars) } # encoder: take a string, output a list of integers encode = lambda s: [stoi[c] for c in s] # decoder: take a list of integers, output a string decode = lambda l: ''.join([itos[i] for i in l]) # Train and test splits data = torch.tensor(encode(text), dtype=torch.long) # first 90% will be train, rest val n = int(0.9*len(data)) train_data = data[:n] val_data = data[n:] # data loading def get_batch(split): # generate a small batch of data of inputs x and targets y data = train_data if split == 'train' else val_data ix = torch.randint(len(data) - block_size, (batch_size,)) x = torch.stack([data[i:i+block_size] for i in ix]) y = torch.stack([data[i+1:i+block_size+1] for i in ix]) x, y = x.to(device), y.to(device) return x, y @torch.no_grad() def estimate_loss(): out = {} model.eval() for split in ['train', 'val']: losses = torch.zeros(eval_iters) for k in range(eval_iters): X, Y = get_batch(split) logits, loss = model(X, Y) losses[k] = loss.item() out[split] = losses.mean() model.train() return out class Head(nn.Module): """ one head of self-attention """ def __init__(self, head_size): super().__init__() self.key = nn.Linear(n_embd, head_size, bias=False) self.query = nn.Linear(n_embd, head_size, bias=False) self.value = nn.Linear(n_embd, head_size, bias=False) self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size))) self.dropout = nn.Dropout(dropout) def forward(self, x): B,T,C = x.shape k = self.key(x) # (B,T,C) q = self.query(x) # (B,T,C) # compute attention scores ("affinities") wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T) wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T) wei = F.softmax(wei, dim=-1) # (B, T, T) wei = self.dropout(wei) # perform the weighted aggregation of the values v = self.value(x) # (B,T,C) out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C) return out class MultiHeadAttention(nn.Module): """ multiple heads of self-attention in parallel """ def __init__(self, num_heads, head_size): super().__init__() self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)]) self.proj = nn.Linear(n_embd, n_embd) self.dropout = nn.Dropout(dropout) def forward(self, x): out = torch.cat([h(x) for h in self.heads], dim=-1) out = self.dropout(self.proj(out)) return out class FeedFoward(nn.Module): """ a simple linear layer followed by a non-linearity """ def __init__(self, n_embd): super().__init__() self.net = nn.Sequential( nn.Linear(n_embd, 4 * n_embd), nn.ReLU(), nn.Linear(4 * n_embd, n_embd), nn.Dropout(dropout), ) def forward(self, x): return self.net(x) class Block(nn.Module): """ Transformer block: communication followed by computation """ def __init__(self, n_embd, n_head): # n_embd: embedding dimension, n_head: the number of heads we'd like super().__init__() head_size = n_embd // n_head self.sa = MultiHeadAttention(n_head, head_size) self.ffwd = FeedFoward(n_embd) self.ln1 = nn.LayerNorm(n_embd) self.ln2 = nn.LayerNorm(n_embd) def forward(self, x): x = x + self.sa(self.ln1(x)) x = x + self.ffwd(self.ln2(x)) return x # super simple bigram model class BigramLanguageModel(nn.Module): def __init__(self): super().__init__() # each token directly reads off the logits for the next token from a lookup table self.token_embedding_table = nn.Embedding(vocab_size, n_embd) self.position_embedding_table = nn.Embedding(block_size, n_embd) self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)]) self.ln_f = nn.LayerNorm(n_embd) # final layer norm self.lm_head = nn.Linear(n_embd, vocab_size) def forward(self, idx, targets=None): B, T = idx.shape # idx and targets are both (B,T) tensor of integers tok_emb = self.token_embedding_table(idx) # (B,T,C) pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C) x = tok_emb + pos_emb # (B,T,C) x = self.blocks(x) # (B,T,C) x = self.ln_f(x) # (B,T,C) logits = self.lm_head(x) # (B,T,vocab_size) if targets is None: loss = None else: B, T, C = logits.shape logits = logits.view(B*T, C) targets = targets.view(B*T) loss = F.cross_entropy(logits, targets) return logits, loss def generate(self, idx, max_new_tokens): # idx is (B, T) array of indices in the current context for _ in range(max_new_tokens): # crop idx to the last block_size tokens idx_cond = idx[:, -block_size:] # get the predictions logits, loss = self(idx_cond) # focus only on the last time step logits = logits[:, -1, :] # becomes (B, C) # apply softmax to get probabilities probs = F.softmax(logits, dim=-1) # (B, C) # sample from the distribution idx_next = torch.multinomial(probs, num_samples=1) # (B, 1) # append sampled index to the running sequence idx = torch.cat((idx, idx_next), dim=1) # (B, T+1) return idx model = BigramLanguageModel() m = model.to(device) # print the number of parameters in the model print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters') # create a PyTorch optimizer optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate) for iter in range(max_iters): # every once in a while evaluate the loss on train and val sets if iter % eval_interval == 0 or iter == max_iters - 1: losses = estimate_loss() print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}") # sample a batch of data xb, yb = get_batch('train') # evaluate the loss logits, loss = model(xb, yb) optimizer.zero_grad(set_to_none=True) loss.backward() optimizer.step() # generate from the model context = torch.zeros((1, 1), dtype=torch.long, device=device) print(decode(m.generate(context, max_new_tokens=2000)[0].tolist())) |
运行 setup17.1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
tony@TONYP15GEN2:/mnt/d/OpenAI/CreateGPT$ python setup17.1.py 0.209729 M parameters step 0: train loss 4.4116, val loss 4.4022 step 100: train loss 2.6568, val loss 2.6670 step 200: train loss 2.5091, val loss 2.5058 .... step 4900: train loss 1.6678, val loss 1.8338 step 4999: train loss 1.6627, val loss 1.8233 And they bride will to lovest made To bube toest the dest day, and bartht he us his vetward that a enswing my feanst, An yentreath, Lot fortth bettly would but With entends will is that Glost and the now our wabs! All in you his husberd with at princess, why holvings nor To this destrittle, demath kneoul---on her pribest, and doth will now; But poor of his butt known, rupt for to to his shall do allood, That Prive my of. HENRY BOLINGS: You ardsables! Eghts, hois courtear tear rests I command. O, no to Pome, griving and your mast a cempres-ennom betwer'd madant thou such But not usinne, will confessy. Which migh. ...... |
1 |
增加 max_iters 次数到 50000和其他参数 |
1 2 3 4 5 6 7 |
# how many independent sequences will we process in parallel? batch_size = 64 # what is the maximum context length for predictions? block_size = 64 max_iters = 50000 eval_interval = 1000 |
运行 setup17.1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
$ python setup17.1.py 0.211777 M parameters step 0: train loss 4.3393, val loss 4.3480 step 1000: train loss 1.9105, val loss 1.9978 step 2000: train loss 1.6546, val loss 1.8139 step 3000: train loss 1.5592, val loss 1.7467 ...... step 49999: train loss 1.2681, val loss 1.5861 What thy bridal? STANLEY: He madest my best eyes. One stride, and be that fought thee by their caues with my face, and zoke he on the office commandion will beg it ended mine Stirs in oversumed; the next is waked. Anly die; For humble comforwater and plaw you: The sensemoumes are we'll like to thee To Wicknes do evimes to them, and Tieuted Keep ContemenDusgued. ...... |
18.保存 model
setup18.1 完整代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 |
import os import torch import torch.nn as nn from torch.nn import functional as F # hyperparameters # how many independent sequences will we process in parallel? batch_size = 64 # what is the maximum context length for predictions? block_size = 64 max_iters = 500000 #50000 eval_interval = 1000 learning_rate = 1e-3 device = 'cuda' if torch.cuda.is_available() else 'cpu' eval_iters = 200 n_embd = 64 n_head = 4 n_layer = 4 dropout = 0.0 # ------------ torch.manual_seed(1337) # wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt with open('data/input.txt', 'r', encoding='utf-8') as f: text = f.read() # here are all the unique characters that occur in this text chars = sorted(list(set(text))) vocab_size = len(chars) # create a mapping from characters to integers stoi = { ch:i for i,ch in enumerate(chars) } itos = { i:ch for i,ch in enumerate(chars) } # encoder: take a string, output a list of integers encode = lambda s: [stoi[c] for c in s] # decoder: take a list of integers, output a string decode = lambda l: ''.join([itos[i] for i in l]) # Train and test splits data = torch.tensor(encode(text), dtype=torch.long) # first 90% will be train, rest val n = int(0.9*len(data)) train_data = data[:n] val_data = data[n:] # data loading def get_batch(split): # generate a small batch of data of inputs x and targets y data = train_data if split == 'train' else val_data ix = torch.randint(len(data) - block_size, (batch_size,)) x = torch.stack([data[i:i+block_size] for i in ix]) y = torch.stack([data[i+1:i+block_size+1] for i in ix]) x, y = x.to(device), y.to(device) return x, y @torch.no_grad() def estimate_loss(): out = {} model.eval() for split in ['train', 'val']: losses = torch.zeros(eval_iters) for k in range(eval_iters): X, Y = get_batch(split) logits, loss = model(X, Y) losses[k] = loss.item() out[split] = losses.mean() model.train() return out class Head(nn.Module): """ one head of self-attention """ def __init__(self, head_size): super().__init__() self.key = nn.Linear(n_embd, head_size, bias=False) self.query = nn.Linear(n_embd, head_size, bias=False) self.value = nn.Linear(n_embd, head_size, bias=False) self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size))) self.dropout = nn.Dropout(dropout) def forward(self, x): B,T,C = x.shape k = self.key(x) # (B,T,C) q = self.query(x) # (B,T,C) # compute attention scores ("affinities") wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T) wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T) wei = F.softmax(wei, dim=-1) # (B, T, T) wei = self.dropout(wei) # perform the weighted aggregation of the values v = self.value(x) # (B,T,C) out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C) return out class MultiHeadAttention(nn.Module): """ multiple heads of self-attention in parallel """ def __init__(self, num_heads, head_size): super().__init__() self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)]) self.proj = nn.Linear(n_embd, n_embd) self.dropout = nn.Dropout(dropout) def forward(self, x): out = torch.cat([h(x) for h in self.heads], dim=-1) out = self.dropout(self.proj(out)) return out class FeedFoward(nn.Module): """ a simple linear layer followed by a non-linearity """ def __init__(self, n_embd): super().__init__() self.net = nn.Sequential( nn.Linear(n_embd, 4 * n_embd), nn.ReLU(), nn.Linear(4 * n_embd, n_embd), nn.Dropout(dropout), ) def forward(self, x): return self.net(x) class Block(nn.Module): """ Transformer block: communication followed by computation """ def __init__(self, n_embd, n_head): # n_embd: embedding dimension, n_head: the number of heads we'd like super().__init__() head_size = n_embd // n_head self.sa = MultiHeadAttention(n_head, head_size) self.ffwd = FeedFoward(n_embd) self.ln1 = nn.LayerNorm(n_embd) self.ln2 = nn.LayerNorm(n_embd) def forward(self, x): x = x + self.sa(self.ln1(x)) x = x + self.ffwd(self.ln2(x)) return x # super simple bigram model class BigramLanguageModel(nn.Module): def __init__(self): super().__init__() # each token directly reads off the logits for the next token from a lookup table self.token_embedding_table = nn.Embedding(vocab_size, n_embd) self.position_embedding_table = nn.Embedding(block_size, n_embd) self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)]) self.ln_f = nn.LayerNorm(n_embd) # final layer norm self.lm_head = nn.Linear(n_embd, vocab_size) def forward(self, idx, targets=None): B, T = idx.shape # idx and targets are both (B,T) tensor of integers tok_emb = self.token_embedding_table(idx) # (B,T,C) pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C) x = tok_emb + pos_emb # (B,T,C) x = self.blocks(x) # (B,T,C) x = self.ln_f(x) # (B,T,C) logits = self.lm_head(x) # (B,T,vocab_size) if targets is None: loss = None else: B, T, C = logits.shape logits = logits.view(B*T, C) targets = targets.view(B*T) loss = F.cross_entropy(logits, targets) return logits, loss def generate(self, idx, max_new_tokens): # idx is (B, T) array of indices in the current context for _ in range(max_new_tokens): # crop idx to the last block_size tokens idx_cond = idx[:, -block_size:] # get the predictions logits, loss = self(idx_cond) # focus only on the last time step logits = logits[:, -1, :] # becomes (B, C) # apply softmax to get probabilities probs = F.softmax(logits, dim=-1) # (B, C) # sample from the distribution idx_next = torch.multinomial(probs, num_samples=1) # (B, 1) # append sampled index to the running sequence idx = torch.cat((idx, idx_next), dim=1) # (B, T+1) return idx model = BigramLanguageModel() m = model.to(device) # print the number of parameters in the model print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters') # create a PyTorch optimizer optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate) best_val_loss = float('inf') # 初始化最佳验证损失为无穷大 for iter in range(max_iters): # every once in a while evaluate the loss on train and val sets if iter % eval_interval == 0 or iter == max_iters - 1: losses = estimate_loss() print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}") val_loss = losses['train'] if val_loss < best_val_loss: best_val_loss = val_loss # sample a batch of data xb, yb = get_batch('train') # evaluate the loss logits, loss = model(xb, yb) optimizer.zero_grad(set_to_none=True) loss.backward() optimizer.step() # 指定模型保存的目录和文件名 model_dir = 'model' model_filename = 'model.pth' model_path = os.path.join(model_dir, model_filename) # 确保目录存在,如果不存在,则创建 if not os.path.exists(model_dir): os.makedirs(model_dir) # 保存模型状态和其它信息 torch.save({ 'epoch': max_iters, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), #'scheduler_state_dict': scheduler.state_dict() if scheduler else None, 'best_val_loss': best_val_loss, 'hyperparameters': { 'learning_rate': learning_rate, 'batch_size': batch_size, 'n_layer': n_layer, 'n_head': n_head, 'dropout': dropout, }, #'loss_history': { # 'train': train_loss_history, # 'val': val_loss_history, #}, # 其它需要保存的信息 }, model_path) print(f"Model saved to {model_path}") # generate from the model context = torch.zeros((1, 1), dtype=torch.long, device=device) print(decode(m.generate(context, max_new_tokens=2000)[0].tolist())) |
运行 setup18.1.py
1 2 3 4 5 6 7 |
$ python setup18.1.py step 0: train loss 4.3393, val loss 4.3480 step 1000: train loss 1.9105, val loss 1.9978 step 2000: train loss 1.6546, val loss 1.8139 step 3000: train loss 1.5592, val loss 1.7467 ...... |
step 49000: train loss 1.2699, val loss 1.5956
step 49999: train loss 1.2681, val loss 1.5861
What thy bridal?
STANLEY:
He madest my best eyes.
One stride, and be that fought thee by their
caues with my face, and zoke he on
the office commandion will beg it ended mine
Stirs in oversumed; the next is waked. Anly die;
For humble comforwater and plaw you:
The sensemoumes are we’ll like to thee
To Wicknes do evimes to them, and Tieuted Keep ContemenDusgued.
参考链接:
https://medium.com/@fareedkhandev/understanding-transformers-a-step-by-step-math-example-part-1-a7809015150a
https://medium.com/@fareedkhandev/create-gpt-from-scratch-using-python-part-1-bd89ccf6206a
https://www.leewayhertz.com/build-a-gpt-model/
https://garden.maxieewong.com/087.%E8%A7%86%E9%A2%91%E5%BA%93/YouTube/Andrej%20Karpathy/Let’s%20build%20GPT%EF%BC%9Afrom%20scratch,%20in%20code,%20spelled%20out./
https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing#scrollTo=nql_1ER53oCf