紅黑樹在內核的應用——虛擬內存管理

1、linux 內核中利用紅黑樹增刪改查快速、穩定的特性來管理的還有另一個非常重要的功能：虛擬內存管理！前面介紹了 buddy 和 slab 算法是用來管理物理頁面的。由於早期物理頁面遠比虛擬頁面小很多，而且只需要分配和回收合併，所以也沒用樹形結構來組織，簡單粗暴地用鏈表來管理！但是虛擬內存不一樣了：以 32 位的系統爲例，虛擬內存有 4GB，能劃分的虛擬內存塊有很多，劃分後需要快速增刪查虛擬內存塊（需要頻繁地讀取代碼、讀寫數據、加載動態鏈接庫等），此時用紅黑樹就很合適了！老規矩，先上結構體：

task_struct 中嵌套了一個 mm_struct 結構體指針，這個結構體大有乾坤：

struct task_struct {
.......
    struct mm_struct *mm;
.......
}

繼續深入：又發現了紅黑樹的根節點！另外兩個 vm_area_struct 結構體又是幹嘛的了？

struct mm_struct {
          struct vm_area_struct * mmap;       /* list of VMAs */
          struct rb_root mm_rb;/*又是紅黑樹的根節點*/
          struct vm_area_struct * mmap_cache;      /* last find_vma result */
.......
}

繼續深入：

發現 rb_node 結構體了麼？這明顯是紅黑樹的節點呀！和上面的 rb_root 不是剛好組成紅黑樹麼？
進程的虛擬內存空間會被分成不同的若干區域，每個區域都有其相關的屬性和用途；一個合法的地址總是落在某個區域當中的，這些區域也不會重疊。在 linux 內核中，這樣的區域被稱之爲虛擬內存區域 (virtual memory areas), 簡稱 VMA。一個 vma 就是一塊連續的線性地址空間的抽象，它擁有自身的權限 (可讀，可寫，可執行等等) ；

struct vm_area_struct {
    struct mm_struct * vm_mm;    /* 所屬的內存描述符 */
    unsigned long vm_start;    /* vma的起始地址 */
    unsigned long vm_end;        /* vma的結束地址 */
 
    /* 該vma的在一個進程的vma鏈表中的前驅vma和後驅vma指針，鏈表中的vma都是按地址來排序的*/
    struct vm_area_struct *vm_next, *vm_prev;
 
    pgprot_t vm_page_prot;        /* vma的訪問權限 */
    unsigned long vm_flags;    /* 標識集 */
 
    struct rb_node vm_rb;      /* 紅黑樹中對應的節點 */
        ...............
}

爲了直觀展示這些結構體之間的關係，我畫了一張圖供大家參考，要點說明如下：

vm_area_struct 有 vm_start 和 vm_end 兩個字段，分別指向虛擬內存區域的開始和結束地址；
vm_area_struct 的 vm_rb 又組成了紅黑樹：便於根據特定的條件快速查找目標區域；
vm_area_struct 實例之間也用鏈表連接：主要用來遍歷節點（不需要像樹形結構一樣前序、中序、後序等方式遍歷，速度快一些；樹形結構遍歷需要遞歸或藉助棧 / 隊列等結構，空間複雜度是 O(N)）

2、（1）結構體定義好了，下一步就是操作這些結構體了。既然用到了紅黑樹管理 VMA，第一件事肯定是建樹、建鏈表啦（所有的操作都在 mm\mmap.c 文件裏），最直接的 api 是__vma_link 函數，如下：

/*將新創建的vm_area_struct掛在mm_struct中管理的紅黑樹上*/
static void
__vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
    struct vm_area_struct *prev, struct rb_node **rb_link,
    struct rb_node *rb_parent)
{
    __vma_link_list(mm, vma, prev, rb_parent);
    __vma_link_rb(mm, vma, rb_link, rb_parent);
}

調用了兩個函數，從名稱看就知道一個是建立鏈表，另一個是建立紅黑樹；先看第一個_vma_link_list 函數，如下：就是把 vma 實例加入到 mm 的 mmap 字段，然後讓 vma 的 next 指針指向下一個 vma 實例，完成鏈表的建立！

void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
        struct vm_area_struct *prev, struct rb_node *rb_parent)
{
    struct vm_area_struct *next;

    vma->vm_prev = prev;
    if (prev) {
        next = prev->vm_next;
        prev->vm_next = vma;
    } else {
        mm->mmap = vma;
        if (rb_parent)
            next = rb_entry(rb_parent,
                    struct vm_area_struct, vm_rb);
        else
            next = NULL;
    }
    vma->vm_next = next;
    if (next)
        next->vm_prev = vma;
}

另一個是__vma_link_rb 函數在紅黑樹中插入節點，如下：

void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
        struct rb_node **rb_link, struct rb_node *rb_parent)
{
    /* Update tracking information for the gap following the new vma. */
    if (vma->vm_next)
        vma_gap_update(vma->vm_next);
    else
        mm->highest_vm_end = vma->vm_end;

    /*
     * vma->vm_prev wasn't known when we followed the rbtree to find the
     * correct insertion point for that vma. As a result, we could not
     * update the vma vm_rb parents rb_subtree_gap values on the way down.
     * So, we first insert the vma with a zero rb_subtree_gap value
     * (to be consistent with what we did on the way down), and then
     * immediately update the gap to the correct value. Finally we
     * rebalance the rbtree after all augmented values have been set.
     */
    rb_link_node(&vma->vm_rb, rb_parent, rb_link);
    vma->rb_subtree_gap = 0;
    vma_gap_update(vma);
    /*傳入的vma插入紅黑樹*/
    vma_rb_insert(vma, &mm->mm_rb);
}

（2）上述代碼執行完畢，紅黑樹也就建好了！接下來就是查詢了；因爲紅黑樹是用來管理 vma 的，建樹的時候所用的 key 值就是 vma 的線性地址了，所以紅黑樹節點左孩 vma 的地址都比當前節點小，右孩 vma 的地址都比當前節點大；代碼如下：

/* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
{
    struct rb_node *rb_node;
    struct vm_area_struct *vma;

    /* Check the cache first. */
    vma = vmacache_find(mm, addr);
    if (likely(vma))
        return vma;
    /*紅黑樹的根節點*/
    rb_node = mm->mm_rb.rb_node;

    while (rb_node) {
        struct vm_area_struct *tmp;

        tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);

        if (tmp->vm_end > addr) {
            vma = tmp;
            /*要查找的addr剛好在這個vma實例的vm_start和vm_end之間，說明找到了*/
            if (tmp->vm_start <= addr)
                break;
            /*addr比vm_start都大，繼續從左子樹查找*/
            rb_node = rb_node->rb_left;
        } else /*addr比vm_end小，繼續從右子樹查找*/
            rb_node = rb_node->rb_right;
    }

    if (vma)
        vmacache_update(addr, vma);
    return vma;
}

（3）上面是根據用戶指定的線性地址查找第一個符合的 vma 實例，用戶實際使用時，還需要查找空閒未使用的虛擬內存塊，用來存儲重要的數據，這個又該怎麼實現了？linux 的實現方法爲：arch_get_unmapped_area 函數，如下：最核心的代碼加了中文註釋；思路就是根據 addr 查找 vma；如果 vma 爲空，並且大小、邊界等條件也都滿足，這塊內存就可以拿來用了！

unsigned long
arch_get_unmapped_area(struct file *filp, unsigned long addr,
        unsigned long len, unsigned long pgoff, unsigned long flags)
{
    struct mm_struct *mm = current->mm;
    struct vm_area_struct *vma;
    struct vm_unmapped_area_info info;

    if (len > TASK_SIZE - mmap_min_addr)
        return -ENOMEM;

    if (flags & MAP_FIXED)
        return addr;

    if (addr) {
        addr = PAGE_ALIGN(addr);//將addr調整成頁大小的倍數
        /*通過addr查找對應的vma是否爲空；如果是空，說明該區域還未被使用
          ，如果其他條件也滿足，就直接使用這塊地址了*/
        vma = find_vma(mm, addr);
        if (TASK_SIZE - len >= addr && addr >= mmap_min_addr &&
            (!vma || addr + len <= vma->vm_start))
            return addr;
    }

    info.flags = 0;
    info.length = len;
    info.low_limit = mm->mmap_base;
    info.high_limit = TASK_SIZE;
    info.align_mask = 0;
    return vm_unmapped_area(&info);
}

（4）內存用完後就要釋放，爲了避免碎片，肯定是要和現有的空閒內存合併的；由於空閒內存沒有用紅黑樹組織，所以此步驟也不涉及紅黑樹的操作，具體的思路爲：先檢查釋放區域之前的 prve 區域終止地址是否與釋放區域起始地址重合，或釋放區域的結束地址是否與其之後的 next 區域起始地址重合；接着再檢查將要合併的區域是否有相同的標誌。如果合併區域均映射了磁盤文件，則還要檢查其映射文件是否相同，以及文件內的偏移量是否連續。思路並不複雜，代碼如下：

/*
 * Given a mapping request (addr,end,vm_flags,file,pgoff), figure out
 * whether that can be merged with its predecessor or its successor.
 * Or both (it neatly fills a hole).
 *
 * In most cases - when called for mmap, brk or mremap - [addr,end) is
 * certain not to be mapped by the time vma_merge is called; but when
 * called for mprotect, it is certain to be already mapped (either at
 * an offset within prev, or at the start of next), and the flags of
 * this area are about to be changed to vm_flags - and the no-change
 * case has already been eliminated.
 *
 * The following mprotect cases have to be considered, where AAAA is
 * the area passed down from mprotect_fixup, never extending beyond one
 * vma, PPPPPP is the prev vma specified, and NNNNNN the next vma after:
 *
 *     AAAA             AAAA                AAAA          AAAA
 *    PPPPPPNNNNNN    PPPPPPNNNNNN    PPPPPPNNNNNN    PPPPNNNNXXXX
 *    cannot merge    might become    might become    might become
 *                    PPNNNNNNNNNN    PPPPPPPPPPNN    PPPPPPPPPPPP 6 or
 *    mmap, brk or    case 4 below    case 5 below    PPPPPPPPXXXX 7 or
 *    mremap move:                                    PPPPXXXXXXXX 8
 *        AAAA
 *    PPPP    NNNN    PPPPPPPPPPPP    PPPPPPPPNNNN    PPPPNNNNNNNN
 *    might become    case 1 below    case 2 below    case 3 below
 *
 * It is important for case 8 that the the vma NNNN overlapping the
 * region AAAA is never going to extended over XXXX. Instead XXXX must
 * be extended in region AAAA and NNNN must be removed. This way in
 * all cases where vma_merge succeeds, the moment vma_adjust drops the
 * rmap_locks, the properties of the merged vma will be already
 * correct for the whole merged range. Some of those properties like
 * vm_page_prot/vm_flags may be accessed by rmap_walks and they must
 * be correct for the whole merged range immediately after the
 * rmap_locks are released. Otherwise if XXXX would be removed and
 * NNNN would be extended over the XXXX range, remove_migration_ptes
 * or other rmap walkers (if working on addresses beyond the "end"
 * parameter) may establish ptes with the wrong permissions of NNNN
 * instead of the right permissions of XXXX.
 */
struct vm_area_struct *vma_merge(struct mm_struct *mm,
            struct vm_area_struct *prev, unsigned long addr,
            unsigned long end, unsigned long vm_flags,
            struct anon_vma *anon_vma, struct file *file,
            pgoff_t pgoff, struct mempolicy *policy,
            struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
{
    pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
    struct vm_area_struct *area, *next;
    int err;

    /*
     * We later require that vma->vm_flags == vm_flags,
     * so this tests vma->vm_flags & VM_SPECIAL, too.
     */
    if (vm_flags & VM_SPECIAL)
        return NULL;

    if (prev)
        next = prev->vm_next;
    else
        next = mm->mmap;
    area = next;
    if (area && area->vm_end == end)        /* cases 6, 7, 8 */
        next = next->vm_next;

    /* verify some invariant that must be enforced by the caller */
    VM_WARN_ON(prev && addr <= prev->vm_start);
    VM_WARN_ON(area && end > area->vm_end);
    VM_WARN_ON(addr >= end);

    /*
     * Can it merge with the predecessor?
     */
    if (prev && prev->vm_end == addr &&
            mpol_equal(vma_policy(prev), policy) &&
            can_vma_merge_after(prev, vm_flags,
                        anon_vma, file, pgoff,
                        vm_userfaultfd_ctx)) {
        /*
         * OK, it can.  Can we now merge in the successor as well?
         */
        if (next && end == next->vm_start &&
                mpol_equal(policy, vma_policy(next)) &&
                can_vma_merge_before(next, vm_flags,
                             anon_vma, file,
                             pgoff+pglen,
                             vm_userfaultfd_ctx) &&
                is_mergeable_anon_vma(prev->anon_vma,
                              next->anon_vma, NULL)) {
                            /* cases 1, 6 */
            err = __vma_adjust(prev, prev->vm_start,
                     next->vm_end, prev->vm_pgoff, NULL,
                     prev);
        } else                    /* cases 2, 5, 7 */
            err = __vma_adjust(prev, prev->vm_start,
                     end, prev->vm_pgoff, NULL, prev);
        if (err)
            return NULL;
        khugepaged_enter_vma_merge(prev, vm_flags);
        return prev;
    }

    /*
     * Can this new request be merged in front of next?
     */
    if (next && end == next->vm_start &&
            mpol_equal(policy, vma_policy(next)) &&
            can_vma_merge_before(next, vm_flags,
                         anon_vma, file, pgoff+pglen,
                         vm_userfaultfd_ctx)) {
        if (prev && addr < prev->vm_end)    /* case 4 */
            err = __vma_adjust(prev, prev->vm_start,
                     addr, prev->vm_pgoff, NULL, next);
        else {                    /* cases 3, 8 */
            err = __vma_adjust(area, addr, next->vm_end,
                     next->vm_pgoff - pglen, NULL, next);
            /*
             * In case 3 area is already equal to next and
             * this is a noop, but in case 8 "area" has
             * been removed and next was expanded over it.
             */
            area = next;
        }
        if (err)
            return NULL;
        khugepaged_enter_vma_merge(area, vm_flags);
        return area;
    }

    return NULL;
}

（5）既然釋放內存，相應的 vma 肯定也要從紅黑樹刪除的，這個功能在 detach_vmas_to_be_unmapped 中實現的：

/*
 * Create a list of vma's touched by the unmap, removing them from the mm's
 * vma list as we go..
   Remove the vma's, and unmap the actual pages

 */
static void
detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
    struct vm_area_struct *prev, unsigned long end)
{
    struct vm_area_struct **insertion_point;
    struct vm_area_struct *tail_vma = NULL;

    insertion_point = (prev ? &prev->vm_next : &mm->mmap);
    vma->vm_prev = NULL;
    do {
        vma_rb_erase(vma, &mm->mm_rb);//從紅黑樹刪除vma
        mm->map_count--;
        tail_vma = vma;
        vma = vma->vm_next;
    } while (vma && vma->vm_start < end);
    *insertion_point = vma;
    if (vma) {
        vma->vm_prev = prev;
        vma_gap_update(vma);
    } else
        mm->highest_vm_end = prev ? prev->vm_end : 0;
    tail_vma->vm_next = NULL;

    /* Kill the cache */
    vmacache_invalidate(mm);
}

總結：

1、這裏有虛擬內存和物理內存、進程內存和操作系統內存的映射圖示，方便各位理解

2、用 rb_node 關鍵詞搜索，我一共找到了 3 千多個：可見紅黑樹在 linux 內核使用的範圍之廣！

3、AVL 樹和紅黑樹很類似, 但是 AVL 樹的刪除和插入需要多次旋轉操作以及不斷向根節點回溯，所以在大量刪除和插入操作的情況下, AVL 樹的效率較低。紅黑樹是一種查找效率僅次於 AVL 樹的不完全平衡二叉樹, 它摒棄了 AVL 要求的強平衡約束，能夠以 O(logn) 的時間複雜度進行插入、刪除操作；而且其插入和刪除最多需要兩次或者三次旋轉即可保持樹的平衡。雖然二者的算法複雜度相同, 但在最壞情況下, 紅黑樹提供了更加快速刪除和插入一個節點的算法，顯然紅黑樹的高效操作更適合 Linux 內核中大量 VMA 的添加、刪除和查找！

本文由 Readfog 進行 AMP 轉碼，版權歸原作者所有。
來源：https://mp.weixin.qq.com/s/71IliTzW3ERISxj8W1WD_Q

猜你喜歡