PostgreSQL Blink-tree 實現

lehman blink-tree and Vladimir Lanin cocurrent Btree

PosegreSQL blink-tree 實現方式引用了兩個文章

Lehman and Yao’s high-concurrency B-tree management algorithm

V. Lanin and D. Shasha, A Symmetric Concurrent B-Tree Algorithm

MySQL InnoDB 的 btree 實現主要參考的是

R. Bayer & M. Schkolnick Concurrency of operations on B-trees March 1977

lehman blink-tree

Blink-tree 的 2 個核心變化

Adding a single “link” pointer field to each node.

這裏有一個當時時間點的背景, 我們現在見到的大部分的 Btree 實現裏面, 都會有 left/right point 指向 left/right page. 但是當時對標準 Btree 的定義並沒有這個要求. Btree 是非葉子節點也保存數據, B+tree 是隻有葉子節點保存數據, 從而使 btree height 儘可能低. 但是並沒有嚴格的要求把葉子節點連接到一起.

但是總體而言, 對 Btree 來說, 並沒有強制要求有 left/right 指針指向左右 page.

像 InnoDB 裏面的 btree 已經自帶了 leaft page 和 right page 指針了, 同時在不同的 level 包含 leaf/non-leaf node left/right 指針都指向了自己的兄弟節點了.

所以到現在這裏 right page 指針就可以和 link page 指針複用.
在每個節點內增加一個字段 high key, 在查詢時如果目標值超過該節點的 high key, 就需要循着 link pointer 繼續往後繼節點查找

所以目前和 PolarDB 的 blink-tree 比較大的區別是取消了 lock-coupling 的操作, search 操作不加鎖

PolarDB blink-tree

search 操作是通過 lock-coupling 操作, 自上而下進行加鎖放鎖操作.

SMO 操作則沒有 lock-coupling, 是先加子節點 lock, 然後釋放子節點, 再去加父節點. 具體是:

給 leaf-page 加鎖完成操作要插入父節點的時候, 需要把子節點 page lock 釋放, 然後重新 search btree, 找到父節點加 page lock 並且修改. 當然這裏也可以通過把父節點指針保存下來, 從而規避第二次 search 操作, 但這個是一個優化

在標準的 blink-tree 中, 也就是 PostgreSQL Blink-tree

search 操作並沒有 lock coupling. 而是隻需要加當前層的 latch, 如果查找到 child page id 到獲得 child page 之間, 因爲沒有 lock-coupling, 釋放完 parent node latch, 到加上 child nodt latch 這一段時間是完全不持有 latch 的, 因此 child page 發生了 SMO 操作, 要查找的 record 不在 child page 了, 那麼該如何處理?

PolarDB blink-tree 中, 通過 lock-coupling 操作保證了不存在一個時刻, 同時不持有 parent node 和 child node latch, 從而不會發生這樣的情況.

下面這個例子就是這樣的情況:

search 15 操作和觸發 SMO 的 insert 9 操作再併發進行着

15 原本在 y 裏面, find(15) 操作的時候 y 進行了分裂, 分裂成 y 和 y’. 15 到了新的 y’ 裏面.

# This is not how it works in postgres. This demonstrates the problem:"Thread A, searching for 15"   |   "Thread B, inserting 9"
                               |   node2 = read(x);
node = read(x);                |"Examine node, 15 lies in y"   |   "Examine node2, 9 belongs in y"
                               |   node2 = y;
                               |   # 9 does not fit in y
                               |   # Split y into (8,9,10) and (12,15)
                               |   y = (8,9,10); y_prime = (12,15)
                               |   x.add_pointer(y_prime)
                               |   
"y now points to (8,9,10)!"    |
node = read(y)                 |
find(15) "15 not found in y!"  |

對於這個例子, 可以看到 PolarDB blink-tree 通過 lock-coupling 去解決了問題, 在 read(x) 操作之後, 同時去持有 node(y) s lock, 那麼 Thread B SMO 操作的時候需要持有 node(y) x lock, 那麼 SMO 操作就會被阻塞, 從而避免了上述問題的發生.

lehman 介紹的 blink-tree 怎麼解決呢?

在 node(y) 裏面, 增加了 link-page 以及 high key 以後.

上述的 find(15) 操作判斷 15 > node(y)’s high-key, 那麼就去 node(y)’s link-page 去進行查找. 也就是 y’. 那麼在 y’ 上就可以找到 15

那麼 SMO 操作是如何進行的呢?

lehman blink-tree SMO 操作是持有子節點去加父節點的鎖, 並且是自下而上的 latch coupling 操作, 由於 search 操作不需要 lock coupling, 那麼自下而上的操作也就不會有問題. 所以可以持有 child latch 同時去申請 parent node latch.

這裏會同時持有 child, parent 兩個節點的 latch.

如果這個時候 parent 節點也含有 link page, 也就是需要插入到 parent node -> link page. 那麼就需要同時持有 child, parent, parent->link page 這 3 個 page 的 latch.

如果在 parent->link page 依然找不到插入位置, 需要到 parent->link page->link page, 那麼就可以把 parent node 放開, 再去持有 link page -> link page.

因此同一時刻最多持有 3 個節點的 latch.

大部分情況下 link page 只會有一個, 很多操作可以簡化.

這裏在 Vladimir Lanin Concurrent Btree 裏面會有進一步的優化.

按照現在 PG 實現, 如果鎖住子節點再向父節點進行插入, 只會出現一個 link page. 因爲第一個 page 發生分裂的時候, 在分裂沒有結束之前是不會放開 page lock, 那麼新的插入是無法進行的.

只有像 PolarDB blink-tree 做法一樣, 插入 child node 完成以後, 放開 child node latch, 然後再去插入 parent node, 允許插入 parent node 過程中, link page 繼續被插入纔可能出現多個 link page 的情況了.

我理解 PG 這裏也是做了權衡, 爲了避免出現多個 link page 的複雜情況的.

這裏雖然不會出現多個 Link-page, 但是有可能 search/insert 的時候需要走多個 link page 到目標 Page, 比如下面例子

其實這裏也可以使用類似 PolarDB blink-tree 的方式, 也就是插入子節點以後, 就可以把子節點的鎖放開, 重新遍歷 btree 去插入父節點, 從而可以進一步的讓子節點的 latch 儘早放開.

其實 blink-tree 這個文章也講到了 remembered list

We then proceed back up the tree (using our “remembered” list of nodes through which we searched)

Vladimir Lanin Cocurrent Btree

一開始總結了在 Blink Tree 之前 Btree 併發的實現方式.

search 的時候自上而下 lock coupling 加鎖, SMO 的時候 lock subtree 並且自上而下加鎖方式, 由於 Search and SMO 操作都是自上而下, 那麼就可以避免死鎖的發生.

該文章出來之前的併發控制方式, 缺點在哪裏呢?

很難計算清楚 lock subtree 的範圍到底是多少, 這個也是在 MySQL 現有代碼裏面非常繁瑣的一塊.
lock coupling 併發的範圍還是不夠. 這裏強調 lock-coupling 不一定需要配合 blink-tree 使用, 配合標準的 btree 使用也是可以的. 在這個文章裏面就是配合 b+tree 使用的.

這 2 種方法都是犧牲併發去獲得安全性.

當然也有在 lock coupling + lock subtree 的優化方法, 就是通過先樂觀加鎖, 再悲觀加鎖的方法. 樂觀路徑的時候一路都是 S lock, 然後找到 leaf node, 僅僅對 leaf node 加 X lock, 那麼在 (k-1)/k (2k 表示一個 page 裏面 record 個數) 情況下, 都可以走樂觀. 其實 InnoDB 就是先樂觀在悲觀的方式.

其他做法和 lehman blink-tree 類似, 只不過在 SMO 的時候, 實現了 only lock one node at a time, 不過在 PostgreSQL 具體實現的時候並沒有這樣實現, 我理解主要爲了考慮安全性.

文章也提到:

Although it is not apparent in [Lehman, Yao 811 itself, the B-link structure allows inserts and searches to lock only one node at a time.

也就是可以實現 insert and search only one node, 這個也是我的想法.

Each action holds no more than one read lock at a time during its descent, an insertion holds no more than one write lock at a time during its ascent, and a deletion needs no more than two write locks at a time during its ascent.

After the completion of a half-split or a half-merge, all locks are released.

在文章裏面確實是這樣, half-split 之後, 所有的 locks 都釋放了, 那麼插入父節點的時候就會 PolarDB 現有做法類似, 也就是釋放所有的 lock 重新去插入新的一層的數據, 從而保證 SMO 操作統一時刻也僅僅只有 Lock 一層.

Normally, finding the node with the right coverlet for the add-link or remove-link is done as in [Lehman, Yao 811, by reaccessing the last node visited during the locate phase on the level above the current node. Sometimes (e.g. when the locate phase had started at or below the current level) this is not possibie, and the node must-be found by a new descent from the top.

插入父節點的時候可以通過保存的 memory-list 或者重新遍歷了

另外, 用類似 link-page 思路補充了再 lehman 文章中沒有實現的 delete 操作

如果僅僅是和 MySQL 的 InnoDB 對比, PG 的 Blink-tree 實現在加鎖粒度上明顯更加的細緻, 避免的整個 Btree 的 Index lock 的同時, 也同時規避了通過 Lock subtree 的方式進行 Search 操作和 SMO 操作的衝突問題.

本文由 Readfog 進行 AMP 轉碼，版權歸原作者所有。
來源：https://mp.weixin.qq.com/s/1g07w92T0su3OkaRG88Afw

lehman blink-tree and Vladimir Lanin cocurrent Btree

猜你喜歡