紅黑樹在內核的應用——虛擬內存管理
1、linux 內核中利用紅黑樹增刪改查快速、穩定的特性來管理的還有另一個非常重要的功能:虛擬內存管理!前面介紹了 buddy 和 slab 算法是用來管理物理頁面的。由於早期物理頁面遠比虛擬頁面小很多,而且只需要分配和回收合併,所以也沒用樹形結構來組織,簡單粗暴地用鏈表來管理!但是虛擬內存不一樣了:以 32 位的系統爲例,虛擬內存有 4GB,能劃分的虛擬內存塊有很多,劃分後需要快速增刪查虛擬內存塊(需要頻繁地讀取代碼、讀寫數據、加載動態鏈接庫等),此時用紅黑樹就很合適了!老規矩,先上結構體:
task_struct 中嵌套了一個 mm_struct 結構體指針,這個結構體大有乾坤:
struct task_struct {
.......
struct mm_struct *mm;
.......
}
繼續深入:又發現了紅黑樹的根節點!另外兩個 vm_area_struct 結構體又是幹嘛的了?
struct mm_struct {
struct vm_area_struct * mmap; /* list of VMAs */
struct rb_root mm_rb;/*又是紅黑樹的根節點*/
struct vm_area_struct * mmap_cache; /* last find_vma result */
.......
}
繼續深入:
-
發現 rb_node 結構體了麼?這明顯是紅黑樹的節點呀!和上面的 rb_root 不是剛好組成紅黑樹麼?
-
進程的虛擬內存空間會被分成不同的若干區域,每個區域都有其相關的屬性和用途;一個合法的地址總是落在某個區域當中的,這些區域也不會重疊。在 linux 內核中,這樣的區域被稱之爲虛擬內存區域 (virtual memory areas), 簡稱 VMA。一個 vma 就是一塊連續的線性地址空間的抽象,它擁有自身的權限 (可讀,可寫,可執行等等) ;
struct vm_area_struct {
struct mm_struct * vm_mm; /* 所屬的內存描述符 */
unsigned long vm_start; /* vma的起始地址 */
unsigned long vm_end; /* vma的結束地址 */
/* 該vma的在一個進程的vma鏈表中的前驅vma和後驅vma指針,鏈表中的vma都是按地址來排序的*/
struct vm_area_struct *vm_next, *vm_prev;
pgprot_t vm_page_prot; /* vma的訪問權限 */
unsigned long vm_flags; /* 標識集 */
struct rb_node vm_rb; /* 紅黑樹中對應的節點 */
...............
}
爲了直觀展示這些結構體之間的關係,我畫了一張圖供大家參考,要點說明如下:
-
vm_area_struct 有 vm_start 和 vm_end 兩個字段,分別指向虛擬內存區域的開始和結束地址;
-
vm_area_struct 的 vm_rb 又組成了紅黑樹:便於根據特定的條件快速查找目標區域;
-
vm_area_struct 實例之間也用鏈表連接:主要用來遍歷節點(不需要像樹形結構一樣前序、中序、後序等方式遍歷,速度快一些;樹形結構遍歷需要遞歸或藉助棧 / 隊列等結構,空間複雜度是 O(N))
2、(1)結構體定義好了,下一步就是操作這些結構體了。既然用到了紅黑樹管理 VMA,第一件事肯定是建樹、建鏈表啦(所有的操作都在 mm\mmap.c 文件裏),最直接的 api 是__vma_link 函數,如下:
/*將新創建的vm_area_struct掛在mm_struct中管理的紅黑樹上*/
static void
__vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
struct vm_area_struct *prev, struct rb_node **rb_link,
struct rb_node *rb_parent)
{
__vma_link_list(mm, vma, prev, rb_parent);
__vma_link_rb(mm, vma, rb_link, rb_parent);
}
調用了兩個函數,從名稱看就知道一個是建立鏈表,另一個是建立紅黑樹;先看第一個_vma_link_list 函數,如下:就是把 vma 實例加入到 mm 的 mmap 字段,然後讓 vma 的 next 指針指向下一個 vma 實例,完成鏈表的建立!
void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
struct vm_area_struct *prev, struct rb_node *rb_parent)
{
struct vm_area_struct *next;
vma->vm_prev = prev;
if (prev) {
next = prev->vm_next;
prev->vm_next = vma;
} else {
mm->mmap = vma;
if (rb_parent)
next = rb_entry(rb_parent,
struct vm_area_struct, vm_rb);
else
next = NULL;
}
vma->vm_next = next;
if (next)
next->vm_prev = vma;
}
另一個是__vma_link_rb 函數在紅黑樹中插入節點,如下:
void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
struct rb_node **rb_link, struct rb_node *rb_parent)
{
/* Update tracking information for the gap following the new vma. */
if (vma->vm_next)
vma_gap_update(vma->vm_next);
else
mm->highest_vm_end = vma->vm_end;
/*
* vma->vm_prev wasn't known when we followed the rbtree to find the
* correct insertion point for that vma. As a result, we could not
* update the vma vm_rb parents rb_subtree_gap values on the way down.
* So, we first insert the vma with a zero rb_subtree_gap value
* (to be consistent with what we did on the way down), and then
* immediately update the gap to the correct value. Finally we
* rebalance the rbtree after all augmented values have been set.
*/
rb_link_node(&vma->vm_rb, rb_parent, rb_link);
vma->rb_subtree_gap = 0;
vma_gap_update(vma);
/*傳入的vma插入紅黑樹*/
vma_rb_insert(vma, &mm->mm_rb);
}
(2)上述代碼執行完畢,紅黑樹也就建好了!接下來就是查詢了;因爲紅黑樹是用來管理 vma 的,建樹的時候所用的 key 值就是 vma 的線性地址了,所以紅黑樹節點左孩 vma 的地址都比當前節點小,右孩 vma 的地址都比當前節點大;代碼如下:
/* Look up the first VMA which satisfies addr < vm_end, NULL if none. */
struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
{
struct rb_node *rb_node;
struct vm_area_struct *vma;
/* Check the cache first. */
vma = vmacache_find(mm, addr);
if (likely(vma))
return vma;
/*紅黑樹的根節點*/
rb_node = mm->mm_rb.rb_node;
while (rb_node) {
struct vm_area_struct *tmp;
tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);
if (tmp->vm_end > addr) {
vma = tmp;
/*要查找的addr剛好在這個vma實例的vm_start和vm_end之間,說明找到了*/
if (tmp->vm_start <= addr)
break;
/*addr比vm_start都大,繼續從左子樹查找*/
rb_node = rb_node->rb_left;
} else /*addr比vm_end小,繼續從右子樹查找*/
rb_node = rb_node->rb_right;
}
if (vma)
vmacache_update(addr, vma);
return vma;
}
(3)上面是根據用戶指定的線性地址查找第一個符合的 vma 實例,用戶實際使用時,還需要查找空閒未使用的虛擬內存塊,用來存儲重要的數據,這個又該怎麼實現了?linux 的實現方法爲:arch_get_unmapped_area 函數,如下:最核心的代碼加了中文註釋;思路就是根據 addr 查找 vma;如果 vma 爲空,並且大小、邊界等條件也都滿足,這塊內存就可以拿來用了!
unsigned long
arch_get_unmapped_area(struct file *filp, unsigned long addr,
unsigned long len, unsigned long pgoff, unsigned long flags)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
struct vm_unmapped_area_info info;
if (len > TASK_SIZE - mmap_min_addr)
return -ENOMEM;
if (flags & MAP_FIXED)
return addr;
if (addr) {
addr = PAGE_ALIGN(addr);//將addr調整成頁大小的倍數
/*通過addr查找對應的vma是否爲空;如果是空,說明該區域還未被使用
,如果其他條件也滿足,就直接使用這塊地址了*/
vma = find_vma(mm, addr);
if (TASK_SIZE - len >= addr && addr >= mmap_min_addr &&
(!vma || addr + len <= vma->vm_start))
return addr;
}
info.flags = 0;
info.length = len;
info.low_limit = mm->mmap_base;
info.high_limit = TASK_SIZE;
info.align_mask = 0;
return vm_unmapped_area(&info);
}
(4)內存用完後就要釋放,爲了避免碎片,肯定是要和現有的空閒內存合併的;由於空閒內存沒有用紅黑樹組織,所以此步驟也不涉及紅黑樹的操作,具體的思路爲:先檢查釋放區域之前的 prve 區域終止地址是否與釋放區域起始地址重合,或釋放區域的結束地址是否與其之後的 next 區域起始地址重合;接着再檢查將要合併的區域是否有相同的標誌。如果合併區域均映射了磁盤文件,則還要檢查其映射文件是否相同,以及文件內的偏移量是否連續。思路並不複雜,代碼如下:
/*
* Given a mapping request (addr,end,vm_flags,file,pgoff), figure out
* whether that can be merged with its predecessor or its successor.
* Or both (it neatly fills a hole).
*
* In most cases - when called for mmap, brk or mremap - [addr,end) is
* certain not to be mapped by the time vma_merge is called; but when
* called for mprotect, it is certain to be already mapped (either at
* an offset within prev, or at the start of next), and the flags of
* this area are about to be changed to vm_flags - and the no-change
* case has already been eliminated.
*
* The following mprotect cases have to be considered, where AAAA is
* the area passed down from mprotect_fixup, never extending beyond one
* vma, PPPPPP is the prev vma specified, and NNNNNN the next vma after:
*
* AAAA AAAA AAAA AAAA
* PPPPPPNNNNNN PPPPPPNNNNNN PPPPPPNNNNNN PPPPNNNNXXXX
* cannot merge might become might become might become
* PPNNNNNNNNNN PPPPPPPPPPNN PPPPPPPPPPPP 6 or
* mmap, brk or case 4 below case 5 below PPPPPPPPXXXX 7 or
* mremap move: PPPPXXXXXXXX 8
* AAAA
* PPPP NNNN PPPPPPPPPPPP PPPPPPPPNNNN PPPPNNNNNNNN
* might become case 1 below case 2 below case 3 below
*
* It is important for case 8 that the the vma NNNN overlapping the
* region AAAA is never going to extended over XXXX. Instead XXXX must
* be extended in region AAAA and NNNN must be removed. This way in
* all cases where vma_merge succeeds, the moment vma_adjust drops the
* rmap_locks, the properties of the merged vma will be already
* correct for the whole merged range. Some of those properties like
* vm_page_prot/vm_flags may be accessed by rmap_walks and they must
* be correct for the whole merged range immediately after the
* rmap_locks are released. Otherwise if XXXX would be removed and
* NNNN would be extended over the XXXX range, remove_migration_ptes
* or other rmap walkers (if working on addresses beyond the "end"
* parameter) may establish ptes with the wrong permissions of NNNN
* instead of the right permissions of XXXX.
*/
struct vm_area_struct *vma_merge(struct mm_struct *mm,
struct vm_area_struct *prev, unsigned long addr,
unsigned long end, unsigned long vm_flags,
struct anon_vma *anon_vma, struct file *file,
pgoff_t pgoff, struct mempolicy *policy,
struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
{
pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
struct vm_area_struct *area, *next;
int err;
/*
* We later require that vma->vm_flags == vm_flags,
* so this tests vma->vm_flags & VM_SPECIAL, too.
*/
if (vm_flags & VM_SPECIAL)
return NULL;
if (prev)
next = prev->vm_next;
else
next = mm->mmap;
area = next;
if (area && area->vm_end == end) /* cases 6, 7, 8 */
next = next->vm_next;
/* verify some invariant that must be enforced by the caller */
VM_WARN_ON(prev && addr <= prev->vm_start);
VM_WARN_ON(area && end > area->vm_end);
VM_WARN_ON(addr >= end);
/*
* Can it merge with the predecessor?
*/
if (prev && prev->vm_end == addr &&
mpol_equal(vma_policy(prev), policy) &&
can_vma_merge_after(prev, vm_flags,
anon_vma, file, pgoff,
vm_userfaultfd_ctx)) {
/*
* OK, it can. Can we now merge in the successor as well?
*/
if (next && end == next->vm_start &&
mpol_equal(policy, vma_policy(next)) &&
can_vma_merge_before(next, vm_flags,
anon_vma, file,
pgoff+pglen,
vm_userfaultfd_ctx) &&
is_mergeable_anon_vma(prev->anon_vma,
next->anon_vma, NULL)) {
/* cases 1, 6 */
err = __vma_adjust(prev, prev->vm_start,
next->vm_end, prev->vm_pgoff, NULL,
prev);
} else /* cases 2, 5, 7 */
err = __vma_adjust(prev, prev->vm_start,
end, prev->vm_pgoff, NULL, prev);
if (err)
return NULL;
khugepaged_enter_vma_merge(prev, vm_flags);
return prev;
}
/*
* Can this new request be merged in front of next?
*/
if (next && end == next->vm_start &&
mpol_equal(policy, vma_policy(next)) &&
can_vma_merge_before(next, vm_flags,
anon_vma, file, pgoff+pglen,
vm_userfaultfd_ctx)) {
if (prev && addr < prev->vm_end) /* case 4 */
err = __vma_adjust(prev, prev->vm_start,
addr, prev->vm_pgoff, NULL, next);
else { /* cases 3, 8 */
err = __vma_adjust(area, addr, next->vm_end,
next->vm_pgoff - pglen, NULL, next);
/*
* In case 3 area is already equal to next and
* this is a noop, but in case 8 "area" has
* been removed and next was expanded over it.
*/
area = next;
}
if (err)
return NULL;
khugepaged_enter_vma_merge(area, vm_flags);
return area;
}
return NULL;
}
(5)既然釋放內存,相應的 vma 肯定也要從紅黑樹刪除的,這個功能在 detach_vmas_to_be_unmapped 中實現的:
/*
* Create a list of vma's touched by the unmap, removing them from the mm's
* vma list as we go..
Remove the vma's, and unmap the actual pages
*/
static void
detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
struct vm_area_struct *prev, unsigned long end)
{
struct vm_area_struct **insertion_point;
struct vm_area_struct *tail_vma = NULL;
insertion_point = (prev ? &prev->vm_next : &mm->mmap);
vma->vm_prev = NULL;
do {
vma_rb_erase(vma, &mm->mm_rb);//從紅黑樹刪除vma
mm->map_count--;
tail_vma = vma;
vma = vma->vm_next;
} while (vma && vma->vm_start < end);
*insertion_point = vma;
if (vma) {
vma->vm_prev = prev;
vma_gap_update(vma);
} else
mm->highest_vm_end = prev ? prev->vm_end : 0;
tail_vma->vm_next = NULL;
/* Kill the cache */
vmacache_invalidate(mm);
}
總結:
1、這裏有虛擬內存和物理內存、進程內存和操作系統內存的映射圖示,方便各位理解
2、用 rb_node 關鍵詞搜索,我一共找到了 3 千多個:可見紅黑樹在 linux 內核使用的範圍之廣!
3、AVL 樹和紅黑樹很類似, 但是 AVL 樹的刪除和插入需要多次旋轉操作以及不斷向根節點回溯,所以在大量刪除和插入操作的情況下, AVL 樹的效率較低。紅黑樹是一種查找效率僅次於 AVL 樹的不完全平衡二叉樹, 它摒棄了 AVL 要求的強平衡約束,能夠以 O(logn) 的時間複雜度進行插入、刪除操作;而且其插入和刪除最多需要兩次或者三次旋轉即可保持樹的平衡。雖然二者的算法複雜度相同, 但在最壞情況下, 紅黑樹提供了更加快速刪除和插入一個節點的算法,顯然紅黑樹的高效操作更適合 Linux 內核中大量 VMA 的添加、刪除和查找!
本文由 Readfog 進行 AMP 轉碼,版權歸原作者所有。
來源:https://mp.weixin.qq.com/s/71IliTzW3ERISxj8W1WD_Q