LLM 之位置編碼算法總結

原文：https://zhuanlan.zhihu.com/p/717358390

之前負責了 rotary_embedding 算子的實現和優化，順帶把 LLM 中圍繞 RoPE 的一些拓展過了一遍，形成此篇筆記。

1 Sinusoidal 位置編碼

一種非訓練的絕對位置編碼，它的 position embedding 計算可以用如下公式表示:

{𝑝𝑒→2𝑑=𝑠𝑖𝑛(𝑝𝑜𝑠_𝑖𝑑100002𝑑ℎ𝑒𝑎𝑑_𝑑𝑖𝑚)𝑝𝑒→2𝑑+1=𝑐𝑜𝑠(𝑝𝑜𝑠_𝑖𝑑100002𝑑ℎ𝑒𝑎𝑑_𝑑𝑖𝑚)

對於一個 head_dim = 128 的 position embedding，其大小如下圖所示:

head_dim =128, Sinusoidal PE

2 RoPE 旋轉位置編碼

2.1 RoPE 的計算

rotary embedding 的出發點爲：通過絕對位置編碼的方式來實現相對位置編碼，在介紹具體的位置編碼算法前，我們先來理一下旋轉這個變換。

二維空間的旋轉矩陣如下

=[𝑐𝑜𝑠𝜃−𝑠𝑖𝑛𝜃𝑠𝑖𝑛𝜃𝑐𝑜𝑠𝜃]

Proof:

二維平面旋轉矩陣推導

旋轉矩陣有一些特殊的性質，我們可以將旋轉矩陣的集合做如下定義:

(𝑛)={𝑅∈𝑅𝑛×𝑛|𝑅𝑅𝑇=𝐼,𝑑𝑒𝑡(𝑅)=1}

Roformer 論文給出的計算過程如下

From Roformer

𝑎(𝑚,𝑛)=Re⁡⟨𝑓(𝑞,𝑚),𝑓(𝑘,𝑛)⟩=Re⁡[∑𝑗=0𝑑/2−1(𝑞2𝑗+i𝑞2𝑗+1)(𝑘2𝑗−i𝑘2𝑗+1)𝑒i(𝑚−𝑛)𝜃𝑗]=∑𝑗=0𝑑/2−1(𝑞2𝑗𝑘2𝑗+𝑞2𝑗+1𝑘2𝑗+1)cos⁡((𝑚−𝑛)𝜃𝑗)+(𝑞2𝑗𝑘2𝑗+1−𝑞2𝑗+1𝑘2𝑗)sin⁡((𝑚−𝑛)𝜃𝑗)

2.2 遠程衰減特性討論

Roformer 論文的 3.4.3 節介紹了一下 RoPE 的遠程衰減特性。

Q: 什麼是遠程衰減特性？
A: 相同的兩個 token embedding，他們之間的位置距離越近，內積分數越高。

在介紹論文遠程衰減性質的推導前，先簡單介紹一下阿貝爾變換 (Abel transformation):

𝑇𝑛=𝑎1(𝐵1−𝐵0)+𝑎2(𝐵2−𝐵1)+...+𝑎𝑛(𝐵𝑛−𝐵𝑛−1)=(𝑎1−𝑎2)𝐵1+(𝑎2−𝑎3)𝐵2+...+(𝑎𝑛−1−𝑎𝑛)𝐵𝑛−1+𝑎𝑛𝐵𝑛

論文證明如下：

Rofrmer 3.4.3 節關於遠程衰減性質的推導

對此，嘗試對這個公式轉成 python 代碼，進行繪圖

%matplotlib inline
import numpy as np  
import matplotlib.pyplot as plt  
  
def theta(t, d):  
    return 10000 ** (-2 * t / d)  
  
def f(m, d):  
    total_norm = 0  
    for j in range(d // 2):  
        # 計算內層Sum的複數和  
        inner_sum = sum(np.exp(1j * m * theta(i, d)) for i in range(j + 1))  
        # 計算複數和的模並累加  
        total_norm += abs(inner_sum)  
    # 返回模的平均值  
    return total_norm / (d // 2)  
  
# 設置參數  
d = 128  
m_values = np.linspace(0, 256, 1000)  # 生成一個從0到256的等差數列，用於繪圖  
  
# 計算f[m_]的值  
f_values = [f(m, d) for m in m_values]  
  
# 繪製圖形  
plt.plot(m_values, f_values)  
plt.xlabel('relative_distance')  
plt.ylabel('attention_score')  
plt.title('f')  
plt.grid(True)

可以得到論文中遠程衰減的示意圖

Roformer 論文 Fig 2: Long-term decay of RoPE

如果把 m-n 增大到 2048，可以得到下面這張圖，後面其實處於震盪狀態

3 PI 位置插值

先把 PI 中的公式 2(即本文 2.1 節中 self-attention score a(m, n) 的計算) 寫成如下矩陣乘形式，便於理解論文 Figure 2

q 和 k 相對距離爲 (m-n) 時的 attention-score 計算示意圖

PI Fig. 2

接下來圍繞論文中的圖 2 來分析作者的 motivation，圖二的代碼如下

PI code for Fig. 2

torch.linalg.solve(A, B) 給出的是線性系統Ax=B 的一個最小二乘估計，其中A 是一個L x 128的 sin-cos 矩陣，B是一些隨機產生的數據 (實際上是一些 attention-score)，shape 爲L，所以解得的 coeffs 是一個 128 dim 的向量。如下圖所示：

按照論文附錄給出的代碼：1）把外推實驗在 [0, 2048] 的結果取出來觀察，可以看到是和論文中的圖一一樣的(這是因爲第一個圖是通過最小二乘得到的，而第二個圖相當於線性組合的每個 basis 都在原來的 basis 上 append 了一段從 2048-4096 的數據)；2）把內插的結果擴大到 2000 多，縱座標 (self-attention score) 仍然是在一個區間內。

作者認爲 Roformer 論文中通過公式 (37) 給出的 extrapolation upper bound 放得太鬆了，並且自己推了個 interpolation upper bound，這個值會比外推的小得多，interpolated attention score 會比 extroplated attention score 更加穩定。

論文給出的內插和外推示意圖：

落實到具體的方法非常簡單：

4 NTK-aware RoPE

Huggingface 上的實現: base change

https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/

5 Dyanmic NTK

Dynamic NTK IMPL

關於 Dynamic NTK 的設計動機，可以參考 reddit 上相應的帖子

https://link.zhihu.com/?target=https%3A//www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/

if it was possible to pick the correct scale parameter dynamically based on the sequence length rather than having to settle for the fixed tradeoff of maximum sequence length vs. performance on shorter sequences. My idea was to use the exact position values for the first 2k context (after all, why mess with a good thing?) and then re-calculate the position vector for every new sequence length as the model generates token by token. Essentially, set scale to original model context length / current sequence length. This has the effect of slowly increasing scale as the sequence length increases. The main hyperparamter of NTK-Aware is α. Like static linear scaling, it represents a tradeoff between short/long sequence performance. So I thought, why not use the same dynamic scaling method with NTK-Aware? For Dynamic NTK, the scaling of α is set to (α * current sequence length / original model context length) - (α - 1). The idea again is to dynamically scale the hyperparameter as the sequence length increases.

6 YaRN

YaRN: Yet another RoPE extensioN method. 因此，適用於使用了 RoPE 的模型，比如 LLaMA, GPT-NeoX、PaLM 等系列模型。

一句話介紹 YaRN 具體的方法: attention scaling + NTK-by-parts 插值。

YaRN 將 RoPE，Position Interpolation、NTK-aware 和 NTK-by-part 放在了一個 general 的框架下進行分析：

作者將像 Positon interpolation 這種不關注一個 head 不同 dimension 的波長 / 週期的插值方法稱之爲 blind interpolation (interpolation method independent with wavelength or frequency)。不同於 blind interpolation, NTK-by-part 和 YaRN 是一種 targeted interpolation。

7 Alibi

ALiBi: Attention with Linear Biases. ALiBi 的名字就表明了它的計算原理：

When using ALiBi, we do not add position embeddings at any point in the network. The only modification we apply is after the query-key dot product, where we add a static, non-learned bias

《ransformer 升級之路：7、長度外推性與局部注意力》（https://www.spaces.ac.cn/archives/9431）提到一個很有意思的角度：它沒有將 ALiBi 視作一種位置編碼，而把它視作一種對局部注意力的平滑操作，從而保證注意力不至於過於分散。

8 參考資料

1 《ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING》（https://arxiv.org/pdf/2104.09864）

2 《EXTENDING CONTEXT WINDOW OF LARGE LANGUAGE MODELS VIA POSITION INTERPOLATION》（https://arxiv.org/pdf/2306.15595）

3 《YARN: Efficient Context Window Extension of LLM》（https://arxiv.org/pdf/2309.00071）

4 《Transformer Architecture: The Positional Encoding》（https://kazemnejad.com/blog/transformer_architecture_positional_encoding/）

5 《Reddit: NTK-aware RoPE》（https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/）

6 《Reddit: Dynamic NTK》（https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/）

7 《NTK-by-parts》（https://github.com/jquesnelle/yarn/pull/1）

8《ransformer 升級之路：7、長度外推性與局部注意力》（https://www.spaces.ac.cn/archives/9431）

本文由 Readfog 進行 AMP 轉碼，版權歸原作者所有。
來源：https://mp.weixin.qq.com/s/4DESPRSEZs_QLh5E0L7NFg