Prometheus 存儲引擎分析

Prometheus 作爲雲原生時代的時序數據庫，是當下最流行的監控平臺之一，儘管其整體架構一直沒怎麼變，但其底層的存儲引擎卻演進了幾個版本，感興趣的讀者可參考 Prometheus 存儲層的演進 (https://tech.ipalfish.com/blog/2020/03/31/the-evolution-of-prometheus-storage-layer/)。本文主要介紹 Prometheus V2（即現在使用的）版本的存儲格式細節，以及查詢是如何定位到符合條件的數據，旨在通過本文的分析，對 Prometheus 的存儲引擎有更深入瞭解。

說明：本文並不會涉及到查詢的解析與函數求值過程。代碼分析基於 v2.25.2 版本。

背景知識

時序特點

時序數據的特點可以用一話概括：垂直寫（最新數據），水平查。

對於雲原生場景來說，另一個特點是數據生命週期短，一次容器的擴縮容會導致時間線膨脹一倍。瞭解這兩個特點後，來看看 Prometheus 是如何存儲數據來迎合上述模式：

├── 01BKGV7JC0RY8A6MACW02A2PJD  // block 的 ULID
│   ├── chunks
│   │   └── 000001
│   ├── tombstones
│   ├── index
│   └── meta.json
├── chunks_head
│   └── 000001
└── wal
    ├── 000000002
    └── checkpoint.00000001
        └── 00000000

可以看到，數據目錄主要有以下幾部分：

block，一個時間段內（默認 2 小時）的所有數據，只讀，用 ULID 命名。每一個 block 內主要包括：
chunks 固定大小（最大 128M）的 chunks 文件
index 索引文件，主要包含倒排索引的信息
meta.json 元信息，主要包括 block 的 minTime/maxTime，方便查詢時過濾
chunks_head，當前在寫入的 block 對應的 chunks 文件，只讀，最多 120 個數據點，時間跨度最大 2 小時。
wal，Prometheus 採用攢批的方式來異步刷盤，因此需要 WAL 來保證數據可靠性

通過上面的目錄結構，不難看出 Prometheus 的設計思路：

通過數據按時間分片的方式來解決數據生命週期短的問題
通過內存攢批的方式來對應只寫最新數據的場景

數據模式

Prometheus 支持的模式比較簡單，只支持單值模式，如下：

cpu_usage{core="1", ip="130.25.175.171"} 14.04 1618137750
metric     labels                        value timesample

倒排索引

索引是支持多維搜索的主要手段，時序中的索引結構和搜索引擎的類似，是個倒排索引，可參考下圖

在一次查詢中，會對涉及到的 label 分別求對應的 postings lists（即時間線集合），然後根據 filter 類型進行集合運算，最後根據運算結果得出的時間線，去查相應數據即可。

磁盤存儲格式

數據格式

┌──────────────────────────────┐
│  magic(0x0130BC91) <4 byte>  │
├──────────────────────────────┤
│    version(1) <1 byte>       │
├──────────────────────────────┤
│    padding(0) <3 byte>       │
├──────────────────────────────┤
│ ┌──────────────────────────┐ │
│ │         Chunk 1          │ │
│ ├──────────────────────────┤ │
│ │          ...             │ │
│ ├──────────────────────────┤ │
│ │         Chunk N          │ │
│ └──────────────────────────┘ │
└──────────────────────────────┘
# 單個 chunk 內的結構
┌─────────────────────┬───────────────────────┬───────────────────────┬───────────────────┬───────────────┬──────────────┬────────────────┐
| series ref <8 byte> | mint <8 byte, uint64> | maxt <8 byte, uint64> | encoding <1 byte> | len <uvarint> | data <bytes> │ CRC32 <4 byte> │
└─────────────────────┴───────────────────────┴───────────────────────┴───────────────────┴───────────────┴──────────────┴────────────────┘

chunk 爲數據在磁盤中的最小組織單元，需要明確以下兩點：

單個 chunk 的時間跨度默認是 2 小時，Prometheus 後臺會有合併操作，把時間相鄰的 block 合到一起
series ref 爲時間線的唯一標示，由 8 個字節組成，前 4 個表示文件 id，後 4 個表示在文件內的 offset，需配合後文的索引結構來實現數據的定位

索引格式

┌────────────────────────────┬─────────────────────┐
│ magic(0xBAAAD700) <4b>     │ version(1) <1 byte> │
├────────────────────────────┴─────────────────────┤
│ ┌──────────────────────────────────────────────┐ │
│ │                 Symbol Table                 │ │
│ ├──────────────────────────────────────────────┤ │
│ │                    Series                    │ │
│ ├──────────────────────────────────────────────┤ │
│ │                 Label Index 1                │ │
│ ├──────────────────────────────────────────────┤ │
│ │                      ...                     │ │
│ ├──────────────────────────────────────────────┤ │
│ │                 Label Index N                │ │
│ ├──────────────────────────────────────────────┤ │
│ │                   Postings 1                 │ │
│ ├──────────────────────────────────────────────┤ │
│ │                      ...                     │ │
│ ├──────────────────────────────────────────────┤ │
│ │                   Postings N                 │ │
│ ├──────────────────────────────────────────────┤ │
│ │               Label Offset Table             │ │
│ ├──────────────────────────────────────────────┤ │
│ │             Postings Offset Table            │ │
│ ├──────────────────────────────────────────────┤ │
│ │                      TOC                     │ │
│ └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘

在一個索引文件中，最主要的是以下幾部分（從下往上）：

TOC 存儲的是其他部分的 offset
Postings Offset Table，用來存儲倒排索引，Key 爲 label name/value 序對，Value 爲 Postings 在文件中的 offset。
Postings N，存儲的是具體的時間線序列
Series，存儲的是當前時間線，對應的 chunk 文件信息
Label Offset Table 與 Label Index 目前在查詢時沒有使用到，這裏不再講述

每個部分的具體編碼格式，可參考官方文檔 Index Disk Format，這裏重點講述一次查詢是如何找到符合條件的數據的：

首先在 Posting Offset Table 中，找到對應 label 的 Postings 位置

然後再根據 Postings 中的 series 信息，找到對應的 chunk 位置，即上文中的 series ref。

使用方式

Prometheus 在啓動時，會去加載數據元信息到內存中。主要有下面兩部分：

block 的元信息，最主要的是 mint/maxt，用來確定一次查詢是否需要查看當前 block 文件，之後把 chunks 文件以 mmap 方式打開

// open all blocks
bDirs, err := blockDirs(dir)
for _, bDir := range bDirs {
    meta, _, err := readMetaFile(bDir)
    // See if we already have the block in memory or open it otherwise.
    block, open := getBlock(loaded, meta.ULID)
    if !open {
        block, err = OpenBlock(l, bDir, chunkPool)
        if err != nil {
            corrupted[meta.ULID] = err
            continue
        }
    }
    blocks = append(blocks, block)
}
// open chunk files
for _, fn := range files {
    f, err := fileutil.OpenMmapFile(fn)
    if err != nil {
        return nil, tsdb_errors.NewMulti(
            errors.Wrap(err, "mmap files"),
            tsdb_errors.CloseAll(cs),
        ).Err()
    }
    cs = append(cs, f)
    bs = append(bs, realByteSlice(f.Bytes()))
}

block 對應的索引信息，主要是倒排索引。由於單個 label 對應的 Postings 可能會非常大，Prometheus 不是全量加載，而是每隔 32 個加載，來減輕內存壓力。並且保證第一個與最後一個一定被加載，查詢時採用類似跳錶的方式進行 posting 定位。
下面代碼爲 DB 啓動時，讀入 postings 的邏輯：

// For the postings offset table we keep every label name but only every nth
// label value (plus the first and last one), to save memory.
ReadOffsetTable(r.b, r.toc.PostingsTable, func(key []string, _ uint64, off int) error {
    if _, ok := r.postings[key[0]]; !ok {
        // Next label name.
        r.postings[key[0]] = []postingOffset{}
        if lastKey != nil {
            // Always include last value for each label name.
            r.postings[lastKey[0]] = append(r.postings[lastKey[0]], postingOffset{value: lastKey[1], off: lastOff})
        }
        lastKey = nil
        valueCount = 0
    }
    if valueCount%32 == 0 {
        r.postings[key[0]] = append(r.postings[key[0]], postingOffset{value: key[1], off: off})
        lastKey = nil
    } else {
        lastKey = key
        lastOff = off
    }
    valueCount++
}
if lastKey != nil {
    r.postings[lastKey[0]] = append(r.postings[lastKey[0]], postingOffset{value: lastKey[1], off: lastOff})
}

下面代碼爲根據 label 查詢 postings 的邏輯，完整可見 index 的 Postings 方法：

e, ok := r.postings[name] // name 爲 label key
if !ok || len(values) == 0 { // values 爲當前需要查詢的 label values
    return EmptyPostings(), nil
}
res := make([]Postings, 0, len(values))
skip := 0
valueIndex := 0
for valueIndex < len(values) && values[valueIndex] < e[0].value {
    // Discard values before the start.
    valueIndex++
}
for valueIndex < len(values) {
    value := values[valueIndex]
    // 用二分查找，找到當前 value 在 postings 中的位置
    i := sort.Search(len(e), func(i int) bool { return e[i].value >= value })
    if i == len(e) {
        // We're past the end.
        break
    }
    if i > 0 && e[i].value != value {  // postings 中沒有該 value，需要用前面一個來在文件中搜索
        // Need to look from previous entry.
        i--
    }
    // Don't Crc32 the entire postings offset table, this is very slow
    // so hope any issues were caught at startup.
    d := encoding.NewDecbufAt(r.b, int(r.toc.PostingsTable), nil)
    d.Skip(e[i].off)
    // Iterate on the offset table.
    var postingsOff uint64 // The offset into the postings table.
    for d.Err() == nil {
        // ... skip 邏輯省略
        v := d.UvarintBytes()       // Label value.
        postingsOff = d.Uvarint64() // Offset.
        for string(v) >= value {
            if string(v) == value {
                // Read from the postings table.
                d2 := encoding.NewDecbufAt(r.b, int(postingsOff), castagnoliTable)
                _, p, err := r.dec.Postings(d2.Get())
                res = append(res, p)
            }
            valueIndex++
            if valueIndex == len(values) {
                break
            }
            value = values[valueIndex]
        }
        if i+1 == len(e) || value >= e[i+1].value || valueIndex == len(values) {
            // Need to go to a later postings offset entry, if there is one.
            break
        }
    }
}

內存結構

Block 在 Prometheus 實現中，主要分爲兩類：

當前正在寫入的，稱爲 head。當超過 2 小時或超過 120 個點時，head 會將 chunk 寫入到本地磁盤中，並使用 mmap 映射到內存中，保存在下文的 mmappedChunk 中。
歷史只讀的，存放在一數組中

type DB struct {
    blocks []*Block
    head *Head
    // ... 忽略其他字段
}
// Block 內的主要字段是 IndexReader，其內部主要是 postings，即倒排索引
// Map of LabelName to a list of some LabelValues's position in the offset table.
// The first and last values for each name are always present.
postings map[string][]postingOffset
type postingOffset struct {
    value string // label value
    off   int    // posting 在對於文件中的 offset
}

在上文磁盤結構中介紹過，postingOffset 不是全量加載，而是每隔 32 個。

Head

type Head struct {
    postings *index.MemPostings // Postings lists for terms.
    // All series addressable by their ID or hash.
    series *stripeSeries
    // ... 忽略其他字段
}
type MemPostings struct {
    mtx     sync.RWMutex
    m       map[string]map[string][]uint64  // label key -> label value -> posting lists
    ordered bool
}

MemPostings 是 Head 中的索引結構，與 Block 的 postingOffset 不同，posting 是全量加載的，畢竟 Head 保存的數據較小，對內存壓力也小。

type stripeSeries struct {
    size                    int
    series                  []map[uint64]*memSeries
    hashes                  []seriesHashmap
    locks                   []stripeLock
    seriesLifecycleCallback SeriesLifecycleCallback
}
type memSeries struct {
    sync.RWMutex
    mmappedChunks []*mmappedChunk // 只讀
    headChunk     *memChunk // 讀寫
    ...... // 省略其他字段
    }
type mmappedChunk struct {
    // 數據文件在磁盤上的位置，即上文中的 series ref
    ref              uint64
    numSamples       uint16
    minTime, maxTime int64
}

stripeSeries 是比較的核心結構，series 字段的 key 爲時間線，採用自增方式生成；value 爲 memSeries，內部有存儲具體數據的 chunk，採用分段鎖思路來減少鎖競爭。

使用方式

對於一個查詢，大概涉及的步驟：

根據 label 查出所涉及到的時間線，然後根據 filter 類型，進行集合運算，找出符合要求的時間線
根據時間線信息與時間範圍信息，去 block 內查詢符合條件的數據

在第一步主要在 PostingsForMatchers 函數中完成，主要有下面幾個優化點：

對於取反的 filter（ != !~ ），轉化爲等於的形式，這樣因爲等於形式對應的時間線往往會少於取反的效果，最後在合併時，減去這些取反的時間線即可。可參考：Be smarter in how we look at matchers. #572
不同 label 的時間線合併時，利用了時間線有序的特點，採用類似 mergesort 的方式來惰性合併，大致過程如下：

type intersectPostings struct {
    arr []Postings // 需要合併的時間線數組
    cur uint64 // 當前的時間線
}
func (it *intersectPostings) doNext() bool {
Loop:
    for {
        for _, p := range it.arr {
            if !p.Seek(it.cur) {
                return false
            }
            if p.At() > it.cur {
                it.cur = p.At()
                continue Loop
            }
        }
        return true
    }
}
func (it *intersectPostings) Next() bool {
    for _, p := range it.arr {
        if !p.Next() {
            return false
        }
        if p.At() > it.cur {
            it.cur = p.At()
        }
    }
    return it.doNext()
}

在第一步查出符合條件的 chunk 所在文件以及 offset 信息之後，第二步的取數據則相對簡單，直接使用 mmap 讀數據即可，這間接利用操作系統的 page cache 來做緩存，自身不需要再去實現 Buffer Pool 之類的數據結構。

總結

通過上文的分析，大體上把 Prometheus 的存儲結構以及查詢流程分析了一遍，還有些細節沒再展開去介紹，比如爲了節約內存使用，label 使用了字典壓縮，但這並不妨礙讀者理解其原理。

此外，Prometheus 默認 2 小時一個 Block 對大時間範圍查詢不友好，因此其後臺會對定期 chunk 文件進行 compaction，合併後的文件大小爲 min(31d, retention_time * 0.1) ，相關細節後面有機會再單獨介紹吧。

參考

Prometheus 時序數據庫 - 數據的查詢 https://my.oschina.net/alchemystar/blog/4985328
Prometheus 時序數據庫 - 磁盤中的存儲結構 https://my.oschina.net/alchemystar/blog/4965684
Prometheus TSDB (Part 1): The Head Block https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/
Prometheus TSDB (Part 4): Persistent Block and its Index https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index/
Prometheus TSDB (Part 5): Queries https://ganeshvernekar.com/blog/prometheus-tsdb-queries/
Prometheus: The Unicorn in Metrics https://www.alibabacloud.com/blog/prometheus-the-unicorn-in-metrics_595168
Writing a Time Series Database from Scratch https://fabxc.org/tsdb/
https://github.com/prometheus/prometheus/blob/main/tsdb/docs/format/index.md

作者：劉家財

來源：https://liujiacai.net/blog/2021/04/11/prometheus-storage-engine/

本文由 Readfog 進行 AMP 轉碼，版權歸原作者所有。
來源：https://mp.weixin.qq.com/s/LU609X_0St_icheFJSdRLQ

背景知識

時序特點

數據模式

倒排索引

磁盤存儲格式

數據格式

索引格式

使用方式

內存結構

Head

使用方式

總結

參考

猜你喜歡