pprof 的原理與實現

wziww 是幫我更新 golang-notes 的小夥伴，這篇 pprof 的原理與實現是他寫的，本文如果有打賞收入的話，會全額轉給他~

本章節沒有介紹具體 pprof 以及周邊工具的使用, 而是進行了 runtime pprof 實現原理的分析, 旨在提供給讀者一個使用方面的參考在進行深入本章節之前, 讓我們來看三個問題, 相信下面這幾個問題也是大部分人在使用 pprof 的時候對它最大的困惑, 那麼可以帶着這三個問題來進行接下去的分析

開啓 pprof 會對 runtime 產生多大的壓力?
能否選擇性在合適階段對生產環境的應用進行 pprof 的開啓 / 關閉操作?
pprof 的原理是什麼?

go 內置的 pprof API 在 runtime/pprof 包內, 它提供給了用戶與 runtime 交互的能力, 讓我們能夠在應用運行的過程中分析當前應用的各項指標來輔助進行性能優化以及問題排查, 當然也可以直接加載 _ "net/http/pprof" 包使用內置的 http 接口 來進行使用, net 模塊內的 pprof 即爲 go 替我們封裝好的一系列調用 runtime/pprof 的方法, 當然也可以自己直接使用

// src/runtime/pprof/pprof.go
// 可觀察類目
profiles.m = map[string]*Profile{
        "goroutine":    goroutineProfile,
        "threadcreate": threadcreateProfile,
        "heap":         heapProfile,
        "allocs":       allocsProfile,
        "block":        blockProfile,
        "mutex":        mutexProfile,
    }

allocs

var allocsProfile = &Profile{
  name:  "allocs",
  count: countHeap, // identical to heap profile
  write: writeAlloc,
}

writeAlloc (主要涉及以下幾個 api)
ReadMemStats(m *MemStats)
MemProfile(p []MemProfileRecord, inuseZero bool)

// ReadMemStats populates m with memory allocator statistics.
//
// The returned memory allocator statistics are up to date as of the
// call to ReadMemStats. This is in contrast with a heap profile,
// which is a snapshot as of the most recently completed garbage
// collection cycle.
func ReadMemStats(m *MemStats) {
  // STW 操作
  stopTheWorld("read mem stats")
  // systemstack 切換
  systemstack(func() {
    // 將 memstats 通過 copy 操作複製給 m
    readmemstats_m(m)
  })

  startTheWorld()
}

// MemProfile returns a profile of memory allocated and freed per allocation
// site.
//
// MemProfile returns n, the number of records in the current memory profile.
// If len(p) >= n, MemProfile copies the profile into p and returns n, true.
// If len(p) < n, MemProfile does not change p and returns n, false.
//
// If inuseZero is true, the profile includes allocation records
// where r.AllocBytes > 0 but r.AllocBytes == r.FreeBytes.
// These are sites where memory was allocated, but it has all
// been released back to the runtime.
//
// The returned profile may be up to two garbage collection cycles old.
// This is to avoid skewing the profile toward allocations; because
// allocations happen in real time but frees are delayed until the garbage
// collector performs sweeping, the profile only accounts for allocations
// that have had a chance to be freed by the garbage collector.
//
// Most clients should use the runtime/pprof package or
// the testing package's -test.memprofile flag instead
// of calling MemProfile directly.
func MemProfile(p []MemProfileRecord, inuseZero bool) (n int, ok bool) {
  lock(&proflock)
  // If we're between mProf_NextCycle and mProf_Flush, take care
  // of flushing to the active profile so we only have to look
  // at the active profile below.
  mProf_FlushLocked()
  clear := true
  /* 
   * 記住這個 mbuckets -- memory profile buckets 
   * allocs 的採樣都是記錄在這個全局變量內, 下面會進行詳細分析
   * -------------------------------------------------
   * (gdb) info variables mbuckets
   * All variables matching regular expression "mbuckets":

   * File runtime:
   * runtime.bucket *runtime.mbuckets;
   * (gdb)
   */
  for b := mbuckets; b != nil; b = b.allnext {
    mp := b.mp()
    if inuseZero || mp.active.alloc_bytes != mp.active.free_bytes {
      n++
    }
    if mp.active.allocs != 0 || mp.active.frees != 0 {
      clear = false
    }
  }
  if clear {
    // Absolutely no data, suggesting that a garbage collection
    // has not yet happened. In order to allow profiling when
    // garbage collection is disabled from the beginning of execution,
    // accumulate all of the cycles, and recount buckets.
    n = 0
    for b := mbuckets; b != nil; b = b.allnext {
      mp := b.mp()
      for c := range mp.future {
        mp.active.add(&mp.future[c])
        mp.future[c] = memRecordCycle{}
      }
      if inuseZero || mp.active.alloc_bytes != mp.active.free_bytes {
        n++
      }
    }
  }
  if n <= len(p) {
    ok = true
    idx := 0
    for b := mbuckets; b != nil; b = b.allnext {
      mp := b.mp()
      if inuseZero || mp.active.alloc_bytes != mp.active.free_bytes {
        // mbuckets 數據拷貝
        record(&p[idx], b)
        idx++
      }
    }
  }
  unlock(&proflock)
  return
}

總結一下 pprof/allocs 所涉及的操作

短暫的 STW 以及 systemstack 切換來獲取 runtime 相關信息
拷貝全局對象 mbuckets 值返回給用戶

mbuckets

上文提到, pprof/allocs 的核心在於對 mbuckets 的操作, 下面用一張圖來簡單描述下 mbuckets 的相關操作

var mbuckets  *bucket // memory profile buckets
type bucket struct {
  next    *bucket
  allnext *bucket
  typ     bucketType // memBucket or blockBucket (includes mutexProfile)
  hash    uintptr
  size    uintptr
  nstk    uintptr
}

                                                  ---------------
                                                 |  user access  |
                                                  ---------------
                                                         |
 ------------------                                      |
|   mbuckets list  |              copy                   |
|     (global)     | -------------------------------------  
 ------------------
       |
       |
       | create_or_get && insert_or_update bucket into mbuckets
       |
       |
 --------------------------------------
|  func stkbucket & typ == memProfile  |
 --------------------------------------
                |
         ----------------
        |  mProf_Malloc  | // 堆棧等信息記錄
         ----------------
                |
         ----------------
        |  profilealloc  | // next_sample 計算
         ----------------
                |      
                |       /*
                |       * if rate := MemProfileRate; rate > 0 {
                |       *   if rate != 1 && size < c.next_sample {
                |       *     c.next_sample -= size
                | 採樣   *   } else {
                | 記錄   *     mp := acquirem()
                |       *     profilealloc(mp, x, size)
                |       *     releasem(mp)
                |       *   }
                |       * }
                |       */
                |
           ------------    不採樣
          |  mallocgc  |-----------...
           ------------

由上圖我們可以清晰的看見, runtime 在內存分配的時候會根據一定策略進行採樣, 記錄到 mbuckets 中讓用戶得以進行分析, 而採樣算法有個重要的依賴 MemProfileRate

// MemProfileRate controls the fraction of memory allocations
// that are recorded and reported in the memory profile.
// The profiler aims to sample an average of
// one allocation per MemProfileRate bytes allocated.
//
// To include every allocated block in the profile, set MemProfileRate to 1.
// To turn off profiling entirely, set MemProfileRate to 0.
//
// The tools that process the memory profiles assume that the
// profile rate is constant across the lifetime of the program
// and equal to the current value. Programs that change the
// memory profiling rate should do so just once, as early as
// possible in the execution of the program (for example,
// at the beginning of main).
var MemProfileRate int = 512 * 1024

默認大小是 512 KB, 可以由用戶自行配置.

值的注意的是, 由於開啓了 pprof 會產生一些採樣的額外壓力及開銷, go 團隊已經在較新的編譯器中有選擇地進行了這個變量的配置以改變 [1] 默認開啓的現狀

具體方式爲代碼未進行相關引用則編譯器將初始值配置爲 0, 否則則爲默認 (512 KB)

(本文討論的基於 1.14.3 版本, 如有差異請進行版本確認)

pprof/allocs 總結

開啓後會對 runtime 產生額外壓力, 採樣時會在 runtime malloc 時記錄額外信息以供後續分析
可以人爲選擇是否開啓, 以及採樣頻率, 通過設置 runtime.MemProfileRate 參數, 不同 go 版本存在差異 (是否默認開啓), 與用戶代碼內是否引用(linker) 相關模塊 / 變量有關, 默認大小爲 512 KB

allocs 部分還包含了 heap 情況的近似計算, 放在下一節分析

heap

allocs: A sampling of all past memory allocations

heap: A sampling of memory allocations of live objects. You can specify the gc GET parameter to run GC before taking the heap sample.

對比下 allocs 和 heap 官方說明上的區別, 一個是分析所有內存分配的情況, 一個是當前 heap 上的分配情況. heap 還能使用額外參數運行一次 GC 後再進行分析

看起來兩者差別很大。。。不過實質上在代碼層面兩者除了一次 GC 可以人爲調用以及生成的文件類型不同之外 (debug == 0 的時候) 之外沒啥區別.

heap 採樣 (僞)

// p 爲上文提到過的 MemProfileRecord 採樣記錄
for _, r := range p {
    hideRuntime := true
    for tries := 0; tries < 2; tries++ {
      stk := r.Stack()
      // For heap profiles, all stack
      // addresses are return PCs, which is
      // what appendLocsForStack expects.
      if hideRuntime {
        for i, addr := range stk {
          if f := runtime.FuncForPC(addr); f != nil && strings.HasPrefix(f.Name(), "runtime.") {
            continue
          }
          // Found non-runtime. Show any runtime uses above it.
          stk = stk[i:]
          break
        }
      }
      locs = b.appendLocsForStack(locs[:0], stk)
      if len(locs) > 0 {
        break
      }
      hideRuntime = false // try again, and show all frames next time.
    }
    // rate 即爲 runtime.MemProfileRate
    values[0], values[1] = scaleHeapSample(r.AllocObjects, r.AllocBytes, rate)
    values[2], values[3] = scaleHeapSample(r.InUseObjects(), r.InUseBytes(), rate)
    var blockSize int64
    if r.AllocObjects > 0 {
      blockSize = r.AllocBytes / r.AllocObjects
    }
    b.pbSample(values, locs, func() {
      if blockSize != 0 {
        b.pbLabel(tagSample_Label, "bytes", "", blockSize)
      }
    })
  }

// scaleHeapSample adjusts the data from a heap Sample to
// account for its probability of appearing in the collected
// data. heap profiles are a sampling of the memory allocations
// requests in a program. We estimate the unsampled value by dividing
// each collected sample by its probability of appearing in the
// profile. heap profiles rely on a poisson process to determine
// which samples to collect, based on the desired average collection
// rate R. The probability of a sample of size S to appear in that
// profile is 1-exp(-S/R).
func scaleHeapSample(count, size, rate int64) (int64, int64) {
  if count == 0 || size == 0 {
    return 0, 0
  }

  if rate <= 1 {
    // if rate==1 all samples were collected so no adjustment is needed.
    // if rate<1 treat as unknown and skip scaling.
    return count, size
  }

  avgSize := float64(size) / float64(count)
  scale := 1 / (1 - math.Exp(-avgSize/float64(rate)))

  return int64(float64(count) * scale), int64(float64(size) * scale)
}

爲什麼要在標題里加個僞? 看上面代碼片段也可以注意到, 實質上在 pprof 分析的時候並沒有掃描所有堆上內存進行分析 (想想也不現實) , 而是通過之前採樣出的數據, 進行計算 (現有對象數量, 大小, 採樣率等) 來估算出 heap 上的情況, 當然給我們參考一般來說是足夠了

goroutine

debug >= 2 的情況, 直接進行堆棧輸出, 詳情可以查看 stack[2] 章節

// fetch == runtime.GoroutineProfile
func writeRuntimeProfile(w io.Writer, debug int, name string, fetch func([]runtime.StackRecord) (int, bool)) error {
  // Find out how many records there are (fetch(nil)),
  // allocate that many records, and get the data.
  // There's a race—more records might be added between
  // the two calls—so allocate a few extra records for safety
  // and also try again if we're very unlucky.
  // The loop should only execute one iteration in the common case.
  var p []runtime.StackRecord
  n, ok := fetch(nil)
  for {
    // Allocate room for a slightly bigger profile,
    // in case a few more entries have been added
    // since the call to ThreadProfile.
    p = make([]runtime.StackRecord, n+10)
    n, ok = fetch(p)
    if ok {
      p = p[0:n]
      break
    }
    // Profile grew; try again.
  }

  return printCountProfile(w, debug, name, runtimeProfile(p))
}

// GoroutineProfile returns n, the number of records in the active goroutine stack profile.
// If len(p) >= n, GoroutineProfile copies the profile into p and returns n, true.
// If len(p) < n, GoroutineProfile does not change p and returns n, false.
//
// Most clients should use the runtime/pprof package instead
// of calling GoroutineProfile directly.
func GoroutineProfile(p []StackRecord) (n int, ok bool) {
  gp := getg()

  isOK := func(gp1 *g) bool {
    // Checking isSystemGoroutine here makes GoroutineProfile
    // consistent with both NumGoroutine and Stack.
    return gp1 != gp && readgstatus(gp1) != _Gdead && !isSystemGoroutine(gp1, false)
  }
  // 熟悉的味道, STW 又來了
  stopTheWorld("profile")
  // 統計有多少 goroutine
  n = 1
  for _, gp1 := range allgs {
    if isOK(gp1) {
      n++
    }
  }
  // 當傳入的 p 非空的時候, 開始獲取各個 goroutine 信息, 整體姿勢和 stack api 幾乎一模一樣
  if n <= len(p) {
    ok = true
    r := p

    // Save current goroutine.
    sp := getcallersp()
    pc := getcallerpc()
    systemstack(func() {
      saveg(pc, sp, gp, &r[0])
    })
    r = r[1:]

    // Save other goroutines.
    for _, gp1 := range allgs {
      if isOK(gp1) {
        if len(r) == 0 {
          // Should be impossible, but better to return a
          // truncated profile than to crash the entire process.
          break
        }
        saveg(^uintptr(0), ^uintptr(0), gp1, &r[0])
        r = r[1:]
      }
    }
  }

  startTheWorld()

  return n, ok
}

總結下 pprof/goroutine

STW 操作, 如果需要觀察詳情的需要注意這個 API 帶來的風險
整體流程基本就是 stackdump 所有協程信息的流程, 差別不大沒什麼好講的, 不熟悉的可以去看下 stack 對應章節

pprof/threadcreate

可能會有人想問, 我們通常只關注 goroutine 就夠了, 爲什麼還需要對線程的一些情況進行追蹤? 例如無法被搶佔的阻塞性系統調用 [3], cgo 相關的線程等等, 都可以利用它來進行一個簡單的分析, 當然大多數情況考慮的線程問題 (諸如泄露等), 一般都是上層的使用問題所導致的 (線程泄露等)

// 還是用之前用過的無法被搶佔的阻塞性系統調用來進行一個簡單的實驗
package main

import (
  "fmt"
  "net/http"
  _ "net/http/pprof"
  "os"
  "syscall"
  "unsafe"
)

const (
  SYS_futex           = 202
  _FUTEX_PRIVATE_FLAG = 128
  _FUTEX_WAIT         = 0
  _FUTEX_WAKE         = 1
  _FUTEX_WAIT_PRIVATE = _FUTEX_WAIT | _FUTEX_PRIVATE_FLAG
  _FUTEX_WAKE_PRIVATE = _FUTEX_WAKE | _FUTEX_PRIVATE_FLAG
)

func main() {
  fmt.Println(os.Getpid())
  go func() {
    b := make([]byte, 1<<20)
    _ = b
  }()
  for i := 1; i < 13; i++ {
    go func() {
      var futexVar int = 0
      for {
        // Syscall && RawSyscall, 具體差別分析可自行查看 syscall 章節
        fmt.Println(syscall.Syscall6(
          SYS_futex,                          // trap AX    202
          uintptr(unsafe.Pointer(&futexVar)), // a1 DI      1
          uintptr(_FUTEX_WAIT),               // a2 SI      0
          0,                                  // a3 DX
          0,                                  //uintptr(unsafe.Pointer(&ts)), // a4 R10
          0,                                  // a5 R8
          0))
      }
    }()
  }
  http.ListenAndServe("0.0.0.0:8899", nil)
}

# GET /debug/pprof/threadcreate?debug=1
threadcreate profile: total 18
17 @
#  0x0

1 @ 0x43b818 0x43bfa3 0x43c272 0x43857d 0x467fb1
#  0x43b817  runtime.allocm+0x157      /usr/local/go/src/runtime/proc.go:1414
#  0x43bfa2  runtime.newm+0x42      /usr/local/go/src/runtime/proc.go:1736
#  0x43c271  runtime.startTemplateThread+0xb1  /usr/local/go/src/runtime/proc.go:1805
#  0x43857c  runtime.main+0x18c      /usr/local/go/src/runtime/proc.go:186

# 再結合諸如 pstack 的工具
ps -efT | grep 22298 # pid = 22298
root     22298 22298 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22299 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22300 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22301 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22302 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22303 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22304 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22305 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22306 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22307 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22308 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22309 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22310 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22311 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22312 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22316 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22317 13767  0 16:59 pts/4    00:00:00 ./mstest

pstack 22299
Thread 1 (process 22299):
#0  runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:568
#1  0x00000000004326f4 in runtime.futexsleep (addr=0xb2fd78 <runtime.sched+280>, val=0, ns=60000000000) at /usr/local/go/src/runtime/os_linux.go:51
#2  0x000000000040cb3e in runtime.notetsleep_internal (n=0xb2fd78 <runtime.sched+280>, ns=60000000000, ~r2=<optimized out>) at /usr/local/go/src/runtime/lock_futex.go:193
#3  0x000000000040cc11 in runtime.notetsleep (n=0xb2fd78 <runtime.sched+280>, ns=60000000000, ~r2=<optimized out>) at /usr/local/go/src/runtime/lock_futex.go:216
#4  0x00000000004433b2 in runtime.sysmon () at /usr/local/go/src/runtime/proc.go:4558
#5  0x000000000043af33 in runtime.mstart1 () at /usr/local/go/src/runtime/proc.go:1112
#6  0x000000000043ae4e in runtime.mstart () at /usr/local/go/src/runtime/proc.go:1077
#7  0x0000000000401893 in runtime/cgo(.text) ()
#8  0x00007fb1e2d53700 in ?? ()
#9  0x0000000000000000 in ?? ()

其他的線程如果感興趣也可以仔細查看

pprof/threadcreate 具體實現和 pprof/goroutine 類似, 無非前者遍歷的對象是全局 allm, 而後者爲 allgs, 區別在於 pprof/threadcreate => ThreadCreateProfile 時不會進行進行 STW

pprof/mutex

mutex 默認是關閉採樣的, 通過 runtime.SetMutexProfileFraction(int) 來進行 rate 的配置進行開啓或關閉

和上文分析過的 mbuckets 類似, 這邊用以記錄採樣數據的是 xbuckets, bucket 記錄了鎖持有的堆棧, 次數 (採樣) 等信息以供用戶查看

//go:linkname mutexevent sync.event
func mutexevent(cycles int64, skip int) {
  if cycles < 0 {
    cycles = 0
  }
  rate := int64(atomic.Load64(&mutexprofilerate))
  // TODO(pjw): measure impact of always calling fastrand vs using something
  // like malloc.go:nextSample()
  // 同樣根據 rate 來進行採樣, 這邊用以記錄 rate 的是 mutexprofilerate 變量
  if rate > 0 && int64(fastrand())%rate == 0 {
    saveblockevent(cycles, skip+1, mutexProfile)
  }
}

                                                  ---------------
                                                 |  user access  |
                                                  ---------------
                                                         |
 ------------------                                      |
|   xbuckets list  |              copy                   |
|     (global)     | -------------------------------------  
 ------------------
       |
       |
       | create_or_get && insert_or_update bucket into xbuckets
       |
       |
 --------------------------------------
|  func stkbucket & typ == mutexProfile  |
 --------------------------------------
                 |
         ------------------
        |  saveblockevent  | // 堆棧等信息記錄
         ------------------
                 |
                 |      
                 |       /*  
                 |       *   //go:linkname mutexevent sync.event
                 |       *   func mutexevent(cycles int64, skip int) {
                 |       *     if cycles < 0 {
                 |       *       cycles = 0
                 |       *     }
                 | 採樣   *     rate := int64(atomic.Load64(&mutexprofilerate))
                 | 記錄   *     // TODO(pjw): measure impact of always calling fastrand vs using something
                 |       *     // like malloc.go:nextSample()
                 |       *     if rate > 0 && int64(fastrand())%rate == 0 {
                 |       *       saveblockevent(cycles, skip+1, mutexProfile)
                 |       *     }
                 |       * 
                 |       */
                 |
           ------------     不採樣
          | mutexevent | ----------....
           ------------
                 |
                 |
           ------------   
          | semrelease1 |
           ------------
                 |
                 |
       ------------------------  
      |   runtime_Semrelease   |
       ------------------------
                 |
                 |
           ------------   
          | unlockSlow |
           ------------
                 |
                 |
           ------------  
          |   Unlock   |
           ------------

pprof/block

同上, 主要來分析下 bbuckets

                                                  ---------------
                                                 |  user access  |
                                                  ---------------
                                                         |
 ------------------                                      |
|   bbuckets list  |              copy                   |
|     (global)     | -------------------------------------  
 ------------------
       |
       |
       | create_or_get && insert_or_update bucket into bbuckets
       |
       |
 --------------------------------------
|  func stkbucket & typ == blockProfile  |
 --------------------------------------
                 |
         ------------------
        |  saveblockevent  | // 堆棧等信息記錄
         ------------------
                 |
                 |      
                 |       /*  
                 |       *   func blocksampled(cycles int64) bool {
                 |       *     rate := int64(atomic.Load64(&blockprofilerate))
                 |       *     if rate <= 0 || (rate > cycles && int64(fastrand())%rate > cycles) {
                 |       *       return false
                 | 採樣   *     }
                 | 記錄   *     return true
                 |       *   }
                 |       */
                 |
           ------------     不採樣
          | blockevent | ----------....
           ------------
                 |----------------------------------------------------------------------------
                 |                                     |                                      |
           ------------          -----------------------------------------------        ------------
          | semrelease1 |       |  chansend / chanrecv &&  mysg.releasetime > 0 |      |  selectgo  |
           ------------          -----------------------------------------------        ------------

相比較 mutex 的採樣, block 的埋點會額外存在於 chan 中, 每次 block 記錄的是前後兩個 cpu 週期 的差值 (cycles) 需要注意的是 cputicks 可能在不同系統上存在一些問題 [4]. 暫不放在這邊討論

pprof/profile

上面分析的都屬於 runtime 在運行的過程中自動採用保存數據後用戶進行觀察的, profile 則是用戶選擇指定週期內的 CPU Profiling

#總結

pprof 的確會給 runtime 帶來額外的壓力, 壓力的多少取決於用戶使用的各個 *_rate 配置, 在獲取 pprof 信息的時候需要按照實際情況酌情使用各個接口, 每個接口產生的額外壓力是不一樣的.
不同版本在是否默認開啓上有不同策略, 需要自行根據各自的環境進行確認
pprof 獲取到的數據僅能作爲參考, 和設置的採樣頻率有關, 在計算例如 heap 情況時會進行相關的近似預估, 非實質上對 heap 進行掃描

 -------------------------
|  pprof.StartCPUProfile  |
 -------------------------
            |
            |
            |
 -------------------------
|  sleep(time.Duration)   |
 -------------------------
            |
            |
            |
 -------------------------
|  pprof.StopCPUProfile  |
 -------------------------

pprof.StartCPUProfile 與 pprof.StopCPUProfile 核心爲 runtime.SetCPUProfileRate(hz int) 控制 cpu profile 頻率, 但是這邊的頻率設置和前面幾個有差異, 不僅僅是設計 rate 的設置, 還涉及全局對象 cpuprof log buffer 的分配

var cpuprof cpuProfile
type cpuProfile struct {
  lock mutex
  on   bool     // profiling is on
  log  *profBuf // profile events written here

  // extra holds extra stacks accumulated in addNonGo
  // corresponding to profiling signals arriving on
  // non-Go-created threads. Those stacks are written
  // to log the next time a normal Go thread gets the
  // signal handler.
  // Assuming the stacks are 2 words each (we don't get
  // a full traceback from those threads), plus one word
  // size for framing, 100 Hz profiling would generate
  // 300 words per second.
  // Hopefully a normal Go thread will get the profiling
  // signal at least once every few seconds.
  extra      [1000]uintptr
  numExtra   int
  lostExtra  uint64 // count of frames lost because extra is full
  lostAtomic uint64 // count of frames lost because of being in atomic64 on mips/arm; updated racily
}

log buffer 的大小每次分配是固定的, 無法進行調節

cpuprof.add

將 stack trace 信息寫入 cpuprof 的 log buffer

// add adds the stack trace to the profile.
// It is called from signal handlers and other limited environments
// and cannot allocate memory or acquire locks that might be
// held at the time of the signal, nor can it use substantial amounts
// of stack.
//go:nowritebarrierrec
func (p *cpuProfile) add(gp *g, stk []uintptr) {
  // Simple cas-lock to coordinate with setcpuprofilerate.
  for !atomic.Cas(&prof.signalLock, 0, 1) {
    osyield()
  }

  if prof.hz != 0 { // implies cpuprof.log != nil
    if p.numExtra > 0 || p.lostExtra > 0 || p.lostAtomic > 0 {
      p.addExtra()
    }
    hdr := [1]uint64{1}
    // Note: write "knows" that the argument is &gp.labels,
    // because otherwise its write barrier behavior may not
    // be correct. See the long comment there before
    // changing the argument here.
    cpuprof.log.write(&gp.labels, nanotime(), hdr[:], stk)
  }

  atomic.Store(&prof.signalLock, 0)
}

來看下調用 cpuprof.add 的流程

 ------------------------
|   cpu profile start    |
 ------------------------
            |
            |
            | start timer (setitimer syscall / ITIMER_PROF)
            | 每個一段時間(rate)在向當前 P 所在線程發送一個 SIGPROF 信號量   --
            |                                                           |
            |                                                           |
 ------------------------                   loop                        |
|       sighandler       |----------------------------------------------
 ------------------------                                            |
            |                                                        |
            | /*                                                     |
            |  *  if sig == _SIGPROF {                               |
            |  *    sigprof(c.sigpc(), c.sigsp(), c.siglr(), gp, _g_.m)
            |  *    return                                           |
            |  */ }                                                  |
            |                                                        |
  ----------------------------                                       | stop
 |   sigprof(stack strace)    |                                      |
  ----------------------------                                       |
            |                                                        |
            |                                                        |
            |                                                        |
  ----------------------                                             |
 |     cpuprof.add      |                                            |
  ----------------------                                   ----------------------
           |                                              |   cpu profile stop   |
           |                                               ----------------------                  
           |            
  ----------------------
 |  cpuprof.log buffer  |                                         
  ----------------------
           |                                        ---------------------                  ---------------
           ----------------------------------------|   cpuprof.read      |----------------|  user access  |
                                                    ---------------------                  ---------------

由於 GMP 的模型設計, 在絕大多數情況下通過這種 timer + sig + current thread 以及當前支持的搶佔式調度, 這種記錄方式是能夠很好進行整個 runtime cpu profile 採樣分析的, 但也不能排除一些極端情況是無法被覆蓋的, 畢竟也只是基於當前 M 而已.

總結

可用性:

runtime 自帶的 pprof 已經在數據採集的準確性, 覆蓋率, 壓力等各方面替我們做好了一個比較均衡及全面的考慮

在絕大多數場景下使用起來需要考慮的性能點無非就是幾個 rate 的設置

不同版本的默認開啓是有差別的, 幾個參數默認值可自行確認, 有時候你覺得沒有開啓 pprof 但是實際上已經開啓了

當選擇的參數合適的時候, pprof 遠遠沒有想象中那般 “重”

侷限性:

得到的數據只是採樣 (根據 rate 決定) 或預估值

無法 cover 所有場景, 對於一些特殊的或者極端的情況, 需要各自進行優化來選擇合適的手段完善

安全性:

生產環境可用 pprof, 注意接口不能直接暴露, 畢竟存在諸如 STW 等操作, 存在潛在風險點

#開源項目 pprof 參考 nsq[5] etcd[6] 採用的是配置式 [7] 選擇是否開啓

參考資料

https://go-review.googlesource.com/c/go/+/299671

[1] 改變: https://go-review.googlesource.com/c/go/+/299671/8/src/runtime/mprof.go

[2] stack: runtime_stack.md

[3] 系統調用: syscall.md

[4] 問題: https://github.com/golang/go/issues/8976

[5] nsq: https://github.com/nsqio/nsq/blob/v1.2.0/nsqd/http.go#L78-L88

[6] etcd: https://github.com/etcd-io/etcd/blob/release-3.4/pkg/debugutil/pprof.go#L23

[7] 配置式: https://github.com/etcd-io/etcd/blob/release-3.4/etcd.conf.yml.sample#L76

本文由 Readfog 進行 AMP 轉碼，版權歸原作者所有。
來源：https://mp.weixin.qq.com/s/1VoZ9dZYk7-yWS3mP0Nc-w