調試一個 Go 應用的死鎖 Bug

【導讀】Go 發生了死鎖怎麼排查？本文作者梳理了排障思路和問題結局過程。

上一週幾乎花了一整週的時間調試這個頭疼的死鎖 Bug。死鎖 Bug 很難重現，因此也很難調試。謹以此文紀念這個教訓。

死鎖原因分析

分析過後這個死鎖的原因主要是 Mutex 和 Channel 的混用。在 Golang 中 Mutex 和 Channel 都能夠作爲同步功能使用，保證多個協程之間不會同時讀寫共享數據，保證不出現數據競爭（Data Race）。Channel 能比 Mutex 更加靈活地使用，比如用在調度 Goroutine，收集多個 Goroutine 的返回數據等。

Channel 用在同步上的一例：

See here: https://medium.com/stupid-gopher-tricks/more-powerful-synchronization-in-go-using-channels-f4a1c3242ed0

go func() {
  for {
    select {
    case value = <-h.setValCh: // set the current value.
    case h.getValCh <- value: // send the current value.
  }
}()

如上例 Goroutine 協程中利用一個 select - case 來決定當前執行的任務。讀寫任務都以這一個 Goroutine 作爲入口，保證了不會同時出現讀寫的情況。這樣的用法非常強大，但使用者也需要注意其誤用帶來的死鎖問題。

比如以下出現的問題。這個問題比較簡單，但是確實是新手很可能會犯的一個問題：

func foo() {
 a := make(chan bool)
 b := make(chan bool)
 done := make(chan bool)
 go func() {
  for {
   select {
   case <-a:
    fmt.Println("case A")
    <-b
   case <-b:
    fmt.Println("case B")
   case <-done:
    fmt.Println("case done")
    break
   }
  }
 }()
}

如果程序中只出現 Mutex 或 Channel 進行同步，程序都會簡單易懂，也更好 Debug。需要注意的是不同的部分使用 Mutex 或 Channel 並出現互相操作的時候。

以下是這次 Bug 出現的極簡化版。你能看出問題所在嗎？

type A struct {
    mtx *sync.Mutex
    // other data structures
}

type B struct {
    action chan bool
    clear  chan bool
    // other channels and data structures
}

a := NewA()
b := NewB()

func NewB() *B {
    go func() {
        for {
            select {
            case <- clear:
                // clear records
            case <- action:
                a.Action()
                // ... other cases
            }
        }
    }()
    // other initializations
}

func (a *A) Action() {
    a.Mtx.Lock()
    defer a.Mtx.Unlock()

    // do action
}

func (a *A) Foo() {
    a.Mtx.Lock()
    defer a.Mtx.Unlock()

    // do some other actions
    b.clear <- true
}

當 Action() 和 Foo() 被不同 Goroutine 同時調用的時候，兩個函數中的 Mutex 可能會被同時鎖住。這在沒有 Channel 的情況下通常是沒有問題的。而 Channel 在這個程序中作爲同步作用出現，保證了只有一個 case 能夠同時執行。也就是 clear 和 Action() 不會同時出現。而 Action() 和 Foo() 同時鎖住的時候，Action() 可能會等待 Foo()，而 Foo() 中的 b.clear <- true 語句會阻塞等待 Action() 的結束，出現互相等待的情況。這樣程序就出現了死鎖！

利用 Go 生態的調試工具

目前我還沒有找到非常好的，能夠解決這一問題的調試工具。Golang 在運行時中加入了全局死鎖的檢測，但死鎖問題往往是局部的，目前好像並沒有什麼工具能夠直接準確定位類似的死鎖問題。

這次問題 gdb 和 Golang 的 pprof 工具庫幫上了大忙。尤其是 pprof。對於有 HTTP 服務的服務器 Go 程序，使用 pprof 非常簡單：直接導入 pprof，就能夠在默認的 HTTP 服務上註冊一個新的路徑作爲調試：

import (
    ...
    _ "net/http/pprof"
)

然後 HTTP 服務啓動之後便能通過瀏覽器或者 curl 看到 debug 輸出。如下是輸出程序中所有 Goroutine 的 backtrace：

curl localhost:10000/debug/pprof/goroutines?debug=1

閱讀 pprof 輸出的時候，可以特別關注以下幾個點來調試死鎖問題：

另外 pprof 作爲一個程序分析庫非常有用。我這一次甚至利用 pprof 發現了一個資源泄漏的問題。更多參考：

https://golang.org/pkg/net/http/pprof/
Profiling Go programs with pprof
https://blog.minio.io/debugging-go-routine-leaks-a1220142d32c

pprof 的樣例輸出，來自博客 (https://blog.minio.io/debugging-go-routine-leaks-a1220142d32c)：

goroutine 149 [chan send]:
main.sum(0xc420122e58, 0x3, 0x3, 0xc420112240)
        /home/karthic/gophercon/count-instrument.go:39 +0x6c
created by main.sumConcurrent
        /home/karthic/gophercon/count-instrument.go:51 +0x12b

goroutine 243 [chan send]:
main.sum(0xc42021a0d8, 0x3, 0x3, 0xc4202760c0)
        /home/karthic/gophercon/count-instrument.go:39 +0x6c
created by main.sumConcurrent
        /home/karthic/gophercon/count-instrument.go:51 +0x12b

goroutine 259 [chan send]:
main.sum(0xc4202700d8, 0x3, 0x3, 0xc42029c0c0)
        /home/karthic/gophercon/count-instrument.go:39 +0x6c
created by main.sumConcurrent
        /home/karthic/gophercon/count-instrument.go:51 +0x12b

經驗總結

可能很少有人會像注意 Mutex 一樣注意 Channel 的同步功能，但這是 Golang 中的經典用法。但在使用時需要注意。

儘量不要混用 Mutex 和 Channel。尤其是不能將 Channel 操作放在 Mutex 的保護區間，否則很有可能出現死鎖現象。
儘量用盡可能小的區間，如果有可能，只放在需要保護的數據的周邊。甚至可以將保護直接數據的周邊。可以考慮用 Getter 和 Setter 函數。這樣也減少了 Channel 在 Goroutine 裏的可能性。

這樣就應該能在一定程度上減少死鎖的可能性。當然，避免死鎖還是離不開程序猿自身謹慎地設計規劃代碼。

轉自：

zhuanlan.zhihu.com/p/56430428

Go 開發大全

參與維護一個非常全面的 Go 開源技術資源庫。日常分享 Go, 雲原生、k8s、Docker 和微服務方面的技術文章和行業動態。

本文由 Readfog 進行 AMP 轉碼，版權歸原作者所有。
來源：https://mp.weixin.qq.com/s/rw_JVcXPXrck_L1gS1iP8w

利用 Go 生態的調試工具

經驗總結

猜你喜歡