一个人看的WWW视频免费观看,亚洲五月

注意力機(jī)制

在整個(gè)注意力過(guò)程中，模型會(huì)學(xué)習(xí)了三個(gè)權(quán)重:查詢、鍵和值。查詢、鍵和值的思想來(lái)源于信息檢索系統(tǒng)。所以我們先理解數(shù)據(jù)庫(kù)查詢的思想。

假設(shè)有一個(gè)數(shù)據(jù)庫(kù)，里面有所有一些作家和他們的書籍信息。現(xiàn)在我想讀一些Rabindranath寫的書：

在數(shù)據(jù)庫(kù)中，作者名字類似于鍵，圖書類似于值。查詢的關(guān)鍵詞Rabindranath是這個(gè)問(wèn)題的鍵。所以需要計(jì)算查詢和數(shù)據(jù)庫(kù)的鍵(數(shù)據(jù)庫(kù)中的所有作者)之間的相似度，然后返回最相似作者的值(書籍)。

同樣，注意力有三個(gè)矩陣，分別是查詢矩陣(Q)、鍵矩陣(K)和值矩陣(V)。它們中的每一個(gè)都具有與輸入嵌入相同的維數(shù)。模型在訓(xùn)練中學(xué)習(xí)這些度量的值。

我們可以假設(shè)我們從每個(gè)單詞中創(chuàng)建一個(gè)向量，這樣我們就可以處理信息。對(duì)于每個(gè)單詞，生成一個(gè)512維的向量。所有3個(gè)矩陣都是512x512(因?yàn)閱卧~嵌入的維度是512)。對(duì)于每個(gè)標(biāo)記嵌入，我們將其與所有三個(gè)矩陣(Q, K, V)相乘，每個(gè)標(biāo)記將有3個(gè)長(zhǎng)度為512的中間向量。

接下來(lái)計(jì)算分?jǐn)?shù)，它是查詢和鍵向量之間的點(diǎn)積。分?jǐn)?shù)決定了當(dāng)我們?cè)谀硞€(gè)位置編碼單詞時(shí)，對(duì)輸入句子的其他部分的關(guān)注程度。

然后將點(diǎn)積除以關(guān)鍵向量維數(shù)的平方根。這種縮放是為了防止點(diǎn)積變得太大或太小(取決于正值或負(fù)值)，因?yàn)檫@可能導(dǎo)致訓(xùn)練期間的數(shù)值不穩(wěn)定。選擇比例因子是為了確保點(diǎn)積的方差近似等于1。

然后通過(guò)softmax操作傳遞結(jié)果。這將分?jǐn)?shù)標(biāo)準(zhǔn)化：它們都是正的，并且加起來(lái)等于1。softmax輸出決定了我們應(yīng)該從不同的單詞中獲取多少信息或特征(值)，也就是在計(jì)算權(quán)重。

這里需要注意的一點(diǎn)是，為什么需要其他單詞的信息/特征？因?yàn)槲覀兊恼Z(yǔ)言是有上下文含義的，一個(gè)相同的單詞出現(xiàn)在不同的語(yǔ)境，含義也不一樣。

最后一步就是計(jì)算softmax與這些值的乘積，并將它們相加。

可視化圖解

上面邏輯都是文字內(nèi)容，看起來(lái)有一些枯燥，下面我們可視化它的矢量化實(shí)現(xiàn)。這樣可以更加深入的理解。

查詢鍵和矩陣的計(jì)算方法如下

同樣的方法可以計(jì)算鍵向量和值向量。

最后計(jì)算得分和注意力輸出。

簡(jiǎn)單代碼實(shí)現(xiàn)

importtorch
 importtorch.nnasnn
 fromtypingimportList
 
 defget_input_embeddings(words: List[str], embeddings_dim: int):
     # we are creating random vector of embeddings_dim size for each words
     # normally we train a tokenizer to get the embeddings.
     # check the blog on tokenizer to learn about this part
     embeddings= [torch.randn(embeddings_dim) forwordinwords]
     returnembeddings
 
 
 text="I should sleep now"
 words=text.split(" ")
 len(words) # 4
 
 
 embeddings_dim=512# 512 dim because the original paper uses it. we can use other dim also
 embeddings=get_input_embeddings(words, embeddings_dim=embeddings_dim)
 embeddings[0].shape# torch.Size([512])
 
 
 # initialize the query, key and value metrices 
 query_matrix=nn.Linear(embeddings_dim, embeddings_dim)
 key_matrix=nn.Linear(embeddings_dim, embeddings_dim)
 value_matrix=nn.Linear(embeddings_dim, embeddings_dim)
 query_matrix.weight.shape, key_matrix.weight.shape, value_matrix.weight.shape# torch.Size([512, 512]), torch.Size([512, 512]), torch.Size([512, 512])
 
 
 # query, key and value vectors computation for each words embeddings
 query_vectors=torch.stack([query_matrix(embedding) forembeddinginembeddings])
 key_vectors=torch.stack([key_matrix(embedding) forembeddinginembeddings])
 value_vectors=torch.stack([value_matrix(embedding) forembeddinginembeddings])
 query_vectors.shape, key_vectors.shape, value_vectors.shape# torch.Size([4, 512]), torch.Size([4, 512]), torch.Size([4, 512])
 
 
 # compute the score
 scores=torch.matmul(query_vectors, key_vectors.transpose(-2, -1)) /torch.sqrt(torch.tensor(embeddings_dim, dtype=torch.float32))
 scores.shape# torch.Size([4, 4])
 
 
 # compute the attention weights for each of the words with the other words
 softmax=nn.Softmax(dim=-1)
 attention_weights=softmax(scores)
 attention_weights.shape# torch.Size([4, 4])
 
 
 # attention output
 output=torch.matmul(attention_weights, value_vectors)
 output.shape# torch.Size([4, 512])

以上代碼只是為了展示注意力機(jī)制的實(shí)現(xiàn)，并未優(yōu)化。

多頭注意力

上面提到的注意力是單頭注意力，在原論文中有8個(gè)頭。對(duì)于多頭和單多頭注意力計(jì)算相同，只是查詢(q0-q3)，鍵(k0-k3)，值(v0-v3)中間向量會(huì)有一些區(qū)別。

之后將查詢向量分成相等的部分（有多少頭就分成多少）。在上圖中有8個(gè)頭，查詢，鍵和值向量的維度為512。所以就變?yōu)榱?個(gè)64維的向量。

把前64個(gè)向量放到第一個(gè)頭，第二組向量放到第二個(gè)頭，以此類推。在上面的圖片中，我只展示了第一個(gè)頭的計(jì)算。

這里需要注意的是：不同的框架有不同的實(shí)現(xiàn)方法，pytorch官方的實(shí)現(xiàn)是上面這種，但是tf和一些第三方的代碼中是將每個(gè)頭分開計(jì)算了，比如8個(gè)頭會(huì)使用8個(gè)linear（tf的dense）而不是一個(gè)大linear再拆解。還記得Pytorch的transformer里面要求emb_dim能被num_heads整除嗎，就是因?yàn)檫@個(gè)

使用哪種方式都可以，因?yàn)樽罱K的結(jié)果都類似影響不大。

當(dāng)我們?cè)谝粋€(gè)head中有了小查詢、鍵和值(64 dim的)之后，計(jì)算剩下的邏輯與單個(gè)head注意相同。最后得到的64維的向量來(lái)自每個(gè)頭。

我們將每個(gè)頭的64個(gè)輸出組合起來(lái)，得到最后的512個(gè)dim輸出向量。

多頭注意力可以表示數(shù)據(jù)中的復(fù)雜關(guān)系。每個(gè)頭都能學(xué)習(xí)不同的模式。多個(gè)頭還提供了同時(shí)處理輸入表示的不同子空間(本例：64個(gè)向量表示512個(gè)原始向量)的能力。

多頭注意代碼實(shí)現(xiàn)

num_heads=8
 # batch dim is 1 since we are processing one text.
 batch_size=1
 
 text="I should sleep now"
 words=text.split(" ")
 len(words) # 4
 
 
 embeddings_dim=512
 embeddings=get_input_embeddings(words, embeddings_dim=embeddings_dim)
 embeddings[0].shape# torch.Size([512])
 
 
 # initialize the query, key and value metrices 
 query_matrix=nn.Linear(embeddings_dim, embeddings_dim)
 key_matrix=nn.Linear(embeddings_dim, embeddings_dim)
 value_matrix=nn.Linear(embeddings_dim, embeddings_dim)
 query_matrix.weight.shape, key_matrix.weight.shape, value_matrix.weight.shape# torch.Size([512, 512]), torch.Size([512, 512]), torch.Size([512, 512])
 
 
 # query, key and value vectors computation for each words embeddings
 query_vectors=torch.stack([query_matrix(embedding) forembeddinginembeddings])
 key_vectors=torch.stack([key_matrix(embedding) forembeddinginembeddings])
 value_vectors=torch.stack([value_matrix(embedding) forembeddinginembeddings])
 query_vectors.shape, key_vectors.shape, value_vectors.shape# torch.Size([4, 512]), torch.Size([4, 512]), torch.Size([4, 512])
 
 
 # (batch_size, num_heads, seq_len, embeddings_dim)
 query_vectors_view=query_vectors.view(batch_size, -1, num_heads, embeddings_dim//num_heads).transpose(1, 2) 
 key_vectors_view=key_vectors.view(batch_size, -1, num_heads, embeddings_dim//num_heads).transpose(1, 2) 
 value_vectors_view=value_vectors.view(batch_size, -1, num_heads, embeddings_dim//num_heads).transpose(1, 2) 
 query_vectors_view.shape, key_vectors_view.shape, value_vectors_view.shape
 # torch.Size([1, 8, 4, 64]),
 #  torch.Size([1, 8, 4, 64]),
 #  torch.Size([1, 8, 4, 64])
 
 
 # We are splitting the each vectors into 8 heads. 
 # Assuming we have one text (batch size of 1), So we split 
 # the embedding vectors also into 8 parts. Each head will 
 # take these parts. If we do this one head at a time.
 head1_query_vector=query_vectors_view[0, 0, ...]
 head1_key_vector=key_vectors_view[0, 0, ...]
 head1_value_vector=value_vectors_view[0, 0, ...]
 head1_query_vector.shape, head1_key_vector.shape, head1_value_vector.shape
 
 
 # The above vectors are of same size as before only the feature dim is changed from 512 to 64
 # compute the score
 scores_head1=torch.matmul(head1_query_vector, head1_key_vector.permute(1, 0)) /torch.sqrt(torch.tensor(embeddings_dim//num_heads, dtype=torch.float32))
 scores_head1.shape# torch.Size([4, 4])
 
 
 # compute the attention weights for each of the words with the other words
 softmax=nn.Softmax(dim=-1)
 attention_weights_head1=softmax(scores_head1)
 attention_weights_head1.shape# torch.Size([4, 4])
 
 output_head1=torch.matmul(attention_weights_head1, head1_value_vector)
 output_head1.shape# torch.Size([4, 512])
 
 
 # we can compute the output for all the heads
 outputs= []
 forhead_idxinrange(num_heads):
     head_idx_query_vector=query_vectors_view[0, head_idx, ...]
     head_idx_key_vector=key_vectors_view[0, head_idx, ...]
     head_idx_value_vector=value_vectors_view[0, head_idx, ...]
     scores_head_idx=torch.matmul(head_idx_query_vector, head_idx_key_vector.permute(1, 0)) /torch.sqrt(torch.tensor(embeddings_dim//num_heads, dtype=torch.float32))
 
     softmax=nn.Softmax(dim=-1)
     attention_weights_idx=softmax(scores_head_idx)
     output=torch.matmul(attention_weights_idx, head_idx_value_vector)
     outputs.append(output)
 
 [out.shapeforoutinoutputs]
 # [torch.Size([4, 64]),
 #  torch.Size([4, 64]),
 #  torch.Size([4, 64]),
 #  torch.Size([4, 64]),
 #  torch.Size([4, 64]),
 #  torch.Size([4, 64]),
 #  torch.Size([4, 64]),
 #  torch.Size([4, 64])]
 
 # stack the result from each heads for the corresponding words
 word0_outputs=torch.cat([out[0] foroutinoutputs])
 word0_outputs.shape
 
 # lets do it for all the words
 attn_outputs= []
 foriinrange(len(words)):
     attn_output=torch.cat([out[i] foroutinoutputs])
     attn_outputs.append(attn_output)
 [attn_output.shapeforattn_outputinattn_outputs] # [torch.Size([512]), torch.Size([512]), torch.Size([512]), torch.Size([512])]
 
 
 # Now lets do it in vectorize way. 
 # We can not permute the last two dimension of the key vector.
 key_vectors_view.permute(0, 1, 3, 2).shape# torch.Size([1, 8, 64, 4])
 
 
 # Transpose the key vector on the last dim
 score=torch.matmul(query_vectors_view, key_vectors_view.permute(0, 1, 3, 2)) # Q*k
 score=torch.softmax(score, dim=-1)
 
 
 # reshape the results 
 attention_results=torch.matmul(score, value_vectors_view)
 attention_results.shape# [1, 8, 4, 64]
 
 # merge the results
 attention_results=attention_results.permute(0, 2, 1, 3).contiguous().view(batch_size, -1, embeddings_dim)
 attention_results.shape# torch.Size([1, 4, 512])

總結(jié)

注意力機(jī)制（attention mechanism）是Transformer模型中的重要組成部分。Transformer是一種基于自注意力機(jī)制（self-attention）的神經(jīng)網(wǎng)絡(luò)模型，廣泛應(yīng)用于自然語(yǔ)言處理任務(wù)，如機(jī)器翻譯、文本生成和語(yǔ)言模型等。本文介紹的自注意力機(jī)制是Transformer模型的基礎(chǔ)，在此基礎(chǔ)之上衍生發(fā)展出了各種不同的更加高效的注意力機(jī)制，所以深入了解自注意力機(jī)制，將能夠更好地理解Transformer模型的設(shè)計(jì)原理和工作機(jī)制，以及如何在具體的各種任務(wù)中應(yīng)用和調(diào)整模型。這將有助于你更有效地使用Transformer模型并進(jìn)行相關(guān)研究和開發(fā)。

聲明：本文內(nèi)容及配圖由入駐作者撰寫或者入駐合作網(wǎng)站授權(quán)轉(zhuǎn)載。文章觀點(diǎn)僅代表作者本人，不代表電子發(fā)燒友網(wǎng)立場(chǎng)。文章及其配圖僅供工程師學(xué)習(xí)之用，如有內(nèi)容侵權(quán)或者其他違規(guī)問(wèn)題，請(qǐng)聯(lián)系本站處理。舉報(bào)投訴

處理器

處理器

+關(guān)注

關(guān)注
68

文章
19103

瀏覽量
228825
神經(jīng)網(wǎng)絡(luò)

神經(jīng)網(wǎng)絡(luò)

+關(guān)注

關(guān)注
42

文章
4734

瀏覽量
100420
pytorch

pytorch

+關(guān)注

關(guān)注
2

文章
802

瀏覽量
13116

評(píng)論

相關(guān)推薦

淺談自然語(yǔ)言處理中的注意力機(jī)制

本文深入淺出地介紹了近些年的自然語(yǔ)言中的注意力機(jī)制包括從起源、變體到評(píng)價(jià)指標(biāo)方面。

發(fā)表于 01-25 16:51 ?6319次閱讀

淺談自然語(yǔ)言處理<b class='flag-5'>中</b>的<b class='flag-5'>注意力</b><b class='flag-5'>機(jī)制</b>

深度分析NLP中的注意力機(jī)制

注意力機(jī)制越發(fā)頻繁的出現(xiàn)在文獻(xiàn)中，因此對(duì)注意力機(jī)制的學(xué)習(xí)、掌握與應(yīng)用顯得十分重要。本文便對(duì)注意力

發(fā)表于 02-17 09:18 ?3823次閱讀

注意力機(jī)制的誕生、方法及幾種常見模型

簡(jiǎn)而言之，深度學(xué)習(xí)中的注意力機(jī)制可以被廣義地定義為一個(gè)描述重要性的權(quán)重向量：通過(guò)這個(gè)權(quán)重向量為了預(yù)測(cè)或者推斷一個(gè)元素，比如圖像中的某個(gè)像素或句子中

發(fā)表于 03-12 09:49 ?4.1w次閱讀

注意力機(jī)制或?qū)⑹俏磥?lái)機(jī)器學(xué)習(xí)的核心要素

目前注意力機(jī)制已是深度學(xué)習(xí)里的大殺器，無(wú)論是圖像處理、語(yǔ)音識(shí)別還是自然語(yǔ)言處理的各種不同類型的任務(wù)中，都很容易遇到注意力模型的身影。

發(fā)表于 05-07 09:37 ?1277次閱讀

基于選擇機(jī)制的自注意力網(wǎng)絡(luò)模型

自注意力網(wǎng)絡(luò)(SANs)在許多自然語(yǔ)言處理任務(wù)中取得顯著的成功，其中包括機(jī)器翻譯、自然語(yǔ)言推理以及語(yǔ)義角色標(biāo)注任務(wù)。

發(fā)表于 08-31 10:45 ?4978次閱讀

基于注意力機(jī)制的深度學(xué)習(xí)模型AT-DPCNN

情感分析是自然語(yǔ)言處理領(lǐng)域的一個(gè)重要分支，卷積神經(jīng)網(wǎng)絡(luò)（CNN）在文本情感分析方面取得了較好的效果，但其未充分提取文本信息中的關(guān)鍵情感信息。為此，建立一種基于注意力機(jī)制的深度學(xué)習(xí)模型AT-

發(fā)表于 03-17 09:53 ?12次下載

融合雙層多頭自注意力與CNN的回歸模型

針對(duì)現(xiàn)有文本情感分析方法存在的無(wú)法高效捕捉相關(guān)文本情感特征從而造成情感分析效果不佳的問(wèn)題提出一種融合雙層多頭自注意力與卷積神經(jīng)網(wǎng)絡(luò)（CNN）的回歸模型 DLMA-CNN。采用多頭自注意力

發(fā)表于 03-25 15:16 ?6次下載

基于注意力機(jī)制等的社交網(wǎng)絡(luò)熱度預(yù)測(cè)模型

基于注意力機(jī)制等的社交網(wǎng)絡(luò)熱度預(yù)測(cè)模型

發(fā)表于 06-07 15:12 ?14次下載

基于多通道自注意力機(jī)制的電子病歷架構(gòu)

基于多通道自注意力機(jī)制的電子病歷架構(gòu)

發(fā)表于 06-24 16:19 ?75次下載

基于注意力機(jī)制的跨域服裝檢索方法綜述

基于注意力機(jī)制的跨域服裝檢索方法綜述

發(fā)表于 06-27 10:33 ?2次下載

基于注意力機(jī)制的新聞文本分類模型

基于注意力機(jī)制的新聞文本分類模型

發(fā)表于 06-27 15:32 ?30次下載

計(jì)算機(jī)視覺中的注意力機(jī)制

計(jì)算機(jī)視覺中的注意力機(jī)制 卷積神經(jīng)網(wǎng)絡(luò)中常用的Attention 參考 注意力機(jī)制簡(jiǎn)介與分類 注意力

發(fā)表于 05-22 09:46 ?0次下載

PyTorch教程11.4之Bahdanau注意力機(jī)制

電子發(fā)燒友網(wǎng)站提供《PyTorch教程11.4之Bahdanau注意力機(jī)制.pdf》資料免費(fèi)下載

發(fā)表于 06-05 15:11 ?0次下載

PyTorch教程11.6之自注意力和位置編碼

電子發(fā)燒友網(wǎng)站提供《PyTorch教程11.6之自注意力和位置編碼.pdf》資料免費(fèi)下載

發(fā)表于 06-05 15:05 ?0次下載

詳細(xì)介紹?注意力機(jī)制中的掩碼

注意力機(jī)制的掩碼允許我們發(fā)送不同長(zhǎng)度的批次數(shù)據(jù)一次性的發(fā)送到transformer中。在代碼中是通過(guò)將所有序列填充到相同的長(zhǎng)度，然后使用“a

發(fā)表于 07-17 16:46 ?661次閱讀