代碼精讀 LevelDB 布隆濾波器

簡介

布隆濾波器 BloomFilter 是一種常用的「使用少量字節數判斷鍵值存在性」的手段。其大致原理爲：使用多個哈希函數，對同一個鍵值做哈希運行，並將其哈希值取模設置比特數值。當給定一個鍵值時，如果計算上述多個哈希函數，存在一個哈希值對應的比特位爲零，則說明該鍵值不存在。

從上可以發現，BloomFilter 常用於判斷鍵的不存在性。由於哈希碰撞的原因，可能存在假陽性 (False Positive)，即：所有哈希函數計算出結果對應的比特位全爲 1，但該鍵仍然不存在。

爲了降低假陽性的概率，存在一個權衡，即使用更少的哈希函數，和更少的比特位，使得判別錯誤率在可接受範圍內。其理論推導的結果如下。其中，k 爲哈希函數個數，m 爲比特數組容量，n 爲鍵的個數，x 爲每個鍵佔用的比特位數，即 m/n。

最佳哈希函數個數爲：(m/n) * ln2 ~= 0.69 x
在最佳哈希個數前提下，使得錯誤率不超過 y，則：m >= 1.44n * log2(1/y)，即 x >= 1.44 * log2(1/y)

實現

在 LevelDB 中，爲使得假陽性概率在 1%，則每鍵佔用比特數：x >= 9.57。因此 LevelDB 推薦每鍵佔用比特位爲 10。

哈希函數

每輪 hash 計算處理 4 字節。可以考慮拓展成 64 位哈希函數。

static uint32_t BloomHash(const Slice& key) {
  return Hash(key.data(), key.size(), 0xbc9f1d34);
}

uint32_t Hash(const char* data, size_t n, uint32_t seed) {
  // Similar to murmur hash
  const uint32_t m = 0xc6a4a793;
  const uint32_t r = 24;
  const char* limit = data + n;
  uint32_t h = seed ^ (n * m);

  // Pick up four bytes at a time
  while (data + 4 <= limit) {
    uint32_t w = DecodeFixed32(data);
    data += 4;
    h += w;
    h *= m;
    h ^= (h >> 16);
  }

  // Pick up remaining bytes
  switch (limit - data) {
    case 3:
     h += static_cast<uint8_t>(data[2]) << 16;
     FALLTHROUGH_INTENDED;
    case 2:
     h += static_cast<uint8_t>(data[1]) << 8;
     FALLTHROUGH_INTENDED;
    case 1:
     h += static_cast<uint8_t>(data[0]);
     h *= m;
     h ^= (h >> r);
     break;
  }
  return h;
}

FilterPolicy

FilterPolicy 是一個純虛基類，聲明瞭 3 個接口。其主要用於判斷 key 的存在性，以減少 disk IO 的次數。在 LevelDB 中，僅實現了 BloomFilterPolicy 一種過濾器，其計算出來的位圖將存儲在 sst FilterBlock 中。

struct FilterPolicy {
 public:
  virtual ~FilterPolicy();

  // Return the name of this policy.  Note that if the filter encoding
  // changes in an incompatible way, the name returned by this method
  // must be changed.  Otherwise, old incompatible filters may be
  // passed to methods of this type.
  virtual const char* Name() const = 0;

  // keys[0,n-1] contains a list of keys (potentially with duplicates)
  // that are ordered according to the user supplied comparator.
  // Append a filter that summarizes keys[0,n-1] to *dst.
  //
  // Warning: do not change the initial contents of *dst.  Instead,
  // append the newly constructed filter to *dst.
  virtual void CreateFilter(const Slice* keys, int n, std::string* dst) const = 0;

  // "filter" contains the data appended by a preceding call to
  // CreateFilter() on this class.  This method must return true if
  // the key was in the list of keys passed to CreateFilter().
  // This method may return true or false if the key was not on the
  // list, but it should aim to return false with a high probability.
  virtual bool KeyMayMatch(const Slice& key, const Slice& filter) const = 0;
};

BloomFilterPolicy 僅包含每鍵佔用比特數和哈希函數個數成員。其中哈希個數，按照上述公式推導，採用 ln(2) * bits_per_key。

struct BloomFilterPolicy : public FilterPolicy {
  size_t bits_per_key_; // 每個 key 所需的比特數
  size_t k_; // 哈希函數個數
};

explicit BloomFilterPolicy(int bits_per_key) : bits_per_key_(bits_per_key) {
  // We intentionally round down to reduce probing cost a little bit
  k_ = static_cast<size_t>(bits_per_key * 0.69);  // 0.69 =~ ln(2)
  if (k_ < 1) k_ = 1;
  if (k_ > 30) k_ = 30;
}

構造 BloomFilter 流程

根據入參 keys 的長度，計算出所需的比特數組長度，並按字節對齊
在 dst 數組末尾，持久化 k_
遍歷每個 keys，按照 double-hashing，爲每個 key 計算 k_ 次哈希，並設置比特數組

void CreateFilter(const Slice* keys, int n, std::string* dst) const override {
  // Compute bloom filter size (in both bits and bytes)
  size_t bits = n * bits_per_key_; // 計算 bloom filter 總比特數

  // For small n, we can see a very high false positive rate.  Fix it
  // by enforcing a minimum bloom filter length.
  if (bits < 64) bits = 64; // 最小佔用比特數

  size_t bytes = (bits + 7) / 8;
  bits = bytes * 8; // 按字節對齊後的比特數

  const size_t init_size = dst->size(); // 初始字節，保持不動
  dst->resize(init_size + bytes, 0);
  dst->push_back(static_cast<char>(k_));  // Remember # of probes in filter，持久化了 k
  char* array = &(*dst)[init_size];
  for (int i = 0; i < n; i++) {
   // Use double-hashing to generate a sequence of hash values.
   // See analysis in [Kirsch,Mitzenmacher 2006].
   uint32_t h = BloomHash(keys[i]); // 僅使用一個哈希函數
   const uint32_t delta = (h >> 17) | (h << 15);  // Rotate right 17 bits
   for (size_t j = 0; j < k_; j++) { // 一個哈希函數，經過 k 次變換
     const uint32_t bitpos = h % bits;
     array[bitpos / 8] |= (1 << (bitpos % 8)); // 通過哈希值，設置比特位
     h += delta;
   }
  }
}

驗證 BloomFilter 流程：

從入參 bloom_filter 中還原出比特數組長度，以及哈希個數 k_
優化：如果哈希個數超過 30，則直接返回
按照 double-hash，計算入參 key 的 k 次哈希值，並根據比特數組，判斷其存在性

bool KeyMayMatch(const Slice& key, const Slice& bloom_filter) const override {
  const size_t len = bloom_filter.size();
  if (len < 2) return false;

  const char* array = bloom_filter.data();
  const size_t bits = (len - 1) * 8; // 比特總數

  // Use the encoded k so that we can read filters generated by
  // bloom filters created using different parameters.
  const size_t k = array[len - 1]; // 還原出 k_ 參數
  if (k > 30) {
   // Reserved for potentially new encodings for short bloom filters.
   // Consider it a match.
   return true;
  }

  uint32_t h = BloomHash(key);
  const uint32_t delta = (h >> 17) | (h << 15);  // Rotate right 17 bits
  for (size_t j = 0; j < k; j++) { // 如果 k 次哈希計算，有一次不命中 bitmap，就認爲 key 不存在
   const uint32_t bitpos = h % bits;
   if ((array[bitpos / 8] & (1 << (bitpos % 8))) == 0) return false;
   h += delta;
  }
  return true;
}

本文由 Readfog 進行 AMP 轉碼，版權歸原作者所有。
來源：https://mp.weixin.qq.com/s/eXutrgBfXjIErdbocNntiA

簡介

實現

哈希函數

FilterPolicy

猜你喜歡