記一次 go 程序頻繁重啓的問題定位過程

【導讀】go 程序頻繁重啓，看過業務日誌沒有 panic 異常，應該如何排查問題？本文記錄了一次頻繁重啓的問題排查過程。

錯誤日誌

監控日誌

程序啓動時會打印進程號，同時有系統 signal 信號捕捉程序，會將程序退出的所有能捕捉的信號都捕捉並打印，然後退出。

1.1 不能被捕捉的信號

SIGKILL 9 Term 無條件結束程序 (不能被捕獲、阻塞或忽略) SIGSTOP 17,19,23 Stop 停止進程 (不能被捕獲、阻塞或忽略)

第一個就是我們常見的kill -9 pid

排查方式

1.1 查看日誌

日誌中加入了捕捉信號量的程序，會將程序退出的所有能捕捉的信號都捕捉並打印，然後退出

// for build -ldflags
var (
    // explain for more details: https://colobu.com/2015/10/09/Linux-Signals/
    // Signal selection reference:
    //   1. https://github.com/fvbock/endless/blob/master/endless.go
    //   2. https://blog.csdn.net/chuanglan/article/details/80750119
    hookableSignals = []os.Signal{
        syscall.SIGHUP,
        syscall.SIGUSR1,
        syscall.SIGUSR2,
        syscall.SIGINT,
        syscall.SIGTERM,
        syscall.SIGTSTP,
        syscall.SIGQUIT,
        syscall.SIGSTOP,
        syscall.SIGKILL,
    }
    defaultHeartbeatTime = 1 * time.Minute
)

func handleSignal(pid int) {
    // Go signal notification works by sending `os.Signal`
    // values on a channel. We'll create a channel to
    // receive these notifications (we'll also make one to
    // notify us when the program can exit).
    sigs := make(chan os.Signal, 1)
    done := make(chan bool, 1)
    // `signal.Notify` registers the given channel to
    // receive notifications of the specified signals.
    signal.Notify(sigs, hookableSignals...)
    // This goroutine executes a blocking receive for
    // signals. When it gets one it'll print it out
    // and then notify the program that it can finish.
    go func() {
        sig := <-sigs
        logs.Info("pid[%d], signal: [%v]", pid, sig)
        done <- true
    }()
    // The program will wait here until it gets the
    // expected signal (as indicated by the goroutine
    // above sending a value on `done`) and then exit.
    for {
        logs.Debug("pid[%d], awaiting signal", pid)

        select {
        case <-done:
            logs.Info("exiting")
            return
        case <-time.After(defaultHeartbeatTime):
        }
    }

通過錯誤日誌可以看到，捕捉不到，所以基本上可以推測，應該是信號 SIGKILL 或 SIGSTOP 的問題。

1.2 strace 監控當前進程

命令：strace -T -tt -e trace=all -p pid，我當時監控時 pid 是 39918，最後等待了一段時間，果然捕捉到了程序異常退出的信號

但這隻能確認這是 SIGKILL 的問題，具體什麼原因，還是無法得知。

1.3 Taming the OOM killer

既然是 SIGKILL，那很有可能是 OOM 的問題，所以我試了一下 demsg 命令：dmesg -T | grep -E -i -B100 pid，pid 是 39918。

dmesg

果然，真的是 Out Of Memory 錯誤。

1.4 go tool pprof

我部署的 web 框架是 beego，它本身自帶嵌入了 golang 的調試程序，只需要在 app.conf 中設置AdminPort = 8098屬性就可以，我設置的端口是 8098。它還有個監控界面，可以直接在瀏覽器訪問，還是很方便。

不過由於服務器在雲端，沒有開通這個端口，開通要走很多程序，麻煩，那算了，直接監控了把文件下載到本地查看，命令curl http://localhost:8098/prof?command=get%20memprof，之後，在你的程序的根目錄下，就會生成一個 mem-xxxx.memprof 的文件。

把這個文件下載到本地，使用命令go tool pprof bdms mem-43964.memprof查看。後來果然發現一個程序沒有關閉 sql 連接，導致大量的內存佔用。

本文由 Readfog 進行 AMP 轉碼，版權歸原作者所有。
來源：https://mp.weixin.qq.com/s/gM9DRkN0cM5c96HsUbqtzA

1.1 不能被捕捉的信號

1.1 查看日誌

1.2 strace 監控當前進程

1.3 Taming the OOM killer

1.4 go tool pprof

猜你喜歡