使用 Prometheus 監控 Harbor

你好！我是李大白，今天分享的是基於 Prometheus 監控 harbor 服務。

在之前的文章中分別介紹了harbor基於離線安裝的高可用汲取設計和部署。那麼，如果我們的 harbor 服務主機或者 harbor 服務及組件出現異常，我們該如何快速處理呢？

Harbor v2.2及以上版本支持配置 Prometheus 監控 Harbor，所以你的 harbor 版本必須要大於 2.2。

本篇文章以二進制的方式簡單的部署 Prometheus 相關服務，可以幫助你快速的的實現 Prometheus 對 harbor 的監控。

Prometheus 監控 Harbor（二進制版）

一、部署說明

在 harbor 服務主機上部署:

prometheus
node-exporter
grafana
alertmanager

harbor 版本：2.4.2
主機：192.168.2.22

二、Harbor 啓用 metrics 服務

2.1 停止 Harbor 服務

$ cd /app/harbor              
$ docker-compose  down

2.2 修改 harbor.yml 配置

修改 harbor 的配置文件中 metrics 參數，啓用harbor-exporter組件。

$ cat harbor.yml
### metrics配置部分
metric:
  enabled: true     #是否啓用，需要修改爲true（啓用）
  port: 9099       #默認的端口爲9090，與prometheus的端口會衝突（所以需要修改下）
  path: /metrics

對 harbor 不熟悉的建議對配置文件備份下！

2.3 配置注入組件

$ ./prepre

2.4 install 安裝 harbor

$ ./install.sh  --with-notary  --with-trivy  --with-chartmuseum
$ docker-compose ps
NAME          COMMAND             SERVICE       STATUS            PORTS
chartmuseum     "./docker-entrypoint…"   chartmuseum    running (healthy)   
harbor-core     "/harbor/entrypoint.…"   core         running (healthy)   
harbor-db      "/docker-entrypoint.…"   postgresql     running (healthy)   
harbor-exporter  "/harbor/entrypoint.…"   exporter      running

可以看到多了 harbor-exporter 組件。

三、Harbor 指標說明

在前面啓用了 harbor-exporter 監控組件後，可以通過 curl 命令去查看 harbor 暴露了哪些指標。

harbor 暴露了以下 4 個關鍵組件的指標數據。

3.1 harbor-exporter 組件指標

exporter組件指標與 Harbor 實例配置相關，並從 Harbor 數據庫中收集一些數據。

指標可在<harbor_instance>:<metrics_port>/<metrics_path>查看

$ curl  http://192.168.2.22:9099/metrics

1）harbor_project_total

harbor_project_total 採集了公共和私人項目總共數量。

$ curl  http://192.168.2.22:9099/metrics | grep harbor_project_total
# HELP harbor_project_total Total projects number
# TYPE harbor_project_total gauge
harbor_project_total{public="true"} 1   # 公共項目的數量爲“1”
harbor_project_total{public="false"} 1     #私有項目的數量

2）harbor_project_repo_total

項目（Project）中的存儲庫總數。

$ curl  http://192.168.2.22:9099/metrics | grep harbor_project_repo_total
# HELP harbor_project_repo_total Total project repos number
# TYPE harbor_project_repo_total gauge
harbor_project_repo_total{project_} 0

3）harbor_project_member_total

項目中的成員總數

$ curl  http://192.168.2.22:9099/metrics | grep harbor_project_member_total
# HELP harbor_project_member_total Total members number of a project
# TYPE harbor_project_member_total gauge
harbor_project_member_total{project_} 1  #項目library下有“1”個用戶

4）harbor_project_quota_usage_byte

一個項目的總使用資源

$ curl  http://192.168.2.22:9099/metrics | grep harbor_project_quota_usage_byte
# HELP harbor_project_quota_usage_byte The used resource of a project
# TYPE harbor_project_quota_usage_byte gauge
harbor_project_quota_usage_byte{project_} 0

5）harbor_project_quota_byte

項目中設置的配額

$ curl  http://192.168.2.22:9099/metrics | grep harbor_project_quota_byte
# HELP harbor_project_quota_byte The quota of a project
# TYPE harbor_project_quota_byte gauge
harbor_project_quota_byte{project_} -1   #-1 表示不限制

6）harbor_artifact_pulled

項目中鏡像拉取的總數

$ curl  http://192.168.2.22:9099/metrics | grep harbor_artifact_pulled
# HELP harbor_artifact_pulled The pull number of an artifact
# TYPE harbor_artifact_pulled gauge
harbor_artifact_pulled{project_} 0

7）harbor_project_artifact_total

項目中的工件類型總數，artifact_type , project_name, public (true, false)

$ curl  http://192.168.2.22:9099/metrics | grep harbor_project_artifact_total

8）harbor_health

 Harbor狀態

$ curl  http://192.168.2.22:9099/metrics | grep harbor_health
# HELP harbor_health Running status of Harbor
# TYPE harbor_health gauge
harbor_health 1  #1表示正常，0表示異常

9）harbor_system_info

Harbor 實例的信息，auth_mode (db_auth, ldap_auth, uaa_auth, http_auth, oidc_auth),harbor_version, self_registration( true, false)

$ curl  http://192.168.2.22:9099/metrics | grep harbor_system_info
# HELP harbor_system_info Information of Harbor system
# TYPE harbor_system_info gauge
harbor_system_info{auth_mode="db_auth",harbor_version="v2.4.2-ef2e2e56",self_registration="false"} 1

10）harbor_up

Harbor 組件運行狀態，組件 (chartmuseum, core, database, jobservice, portal, redis, registry, registryctl, trivy)

$ curl  http://192.168.2.22:9099/metrics | grep harbor_up
harbor_up Running status of harbor component
# TYPE harbor_up gauge
harbor_up{component="chartmuseum"} 1
harbor_up{component="core"} 1
harbor_up{component="database"} 1
harbor_up{component="jobservice"} 1
harbor_up{component="portal"} 1
harbor_up{component="redis"} 1
harbor_up{component="registry"} 1
harbor_up{component="registryctl"} 1
harbor_up{component="trivy"} 1   #Trivy掃描器運行狀態

11）harbor_task_queue_size

隊列中每種類型的任務總數，

$ curl  http://192.168.2.22:9099/metrics | grep harbor_task_queue_size
# HELP harbor_task_queue_size Total number of tasks
# TYPE harbor_task_queue_size gauge
harbor_task_queue_size{type="DEMO"} 0
harbor_task_queue_size{type="GARBAGE_COLLECTION"} 0
harbor_task_queue_size{type="IMAGE_GC"} 0
harbor_task_queue_size{type="IMAGE_REPLICATE"} 0
harbor_task_queue_size{type="IMAGE_SCAN"} 0
harbor_task_queue_size{type="IMAGE_SCAN_ALL"} 0
harbor_task_queue_size{type="P2P_PREHEAT"} 0
harbor_task_queue_size{type="REPLICATION"} 0
harbor_task_queue_size{type="RETENTION"} 0
harbor_task_queue_size{type="SCHEDULER"} 0
harbor_task_queue_size{type="SLACK"} 0
harbor_task_queue_size{type="WEBHOOK"} 0

12）harbor_task_queue_latency

多久前要處理的下一個作業按類型排入隊列

$ curl  http://192.168.2.22:9099/metrics | grep harbor_task_queue_latency
# HELP harbor_task_queue_latency how long ago the next job to be processed was enqueued
# TYPE harbor_task_queue_latency gauge
harbor_task_queue_latency{type="DEMO"} 0
harbor_task_queue_latency{type="GARBAGE_COLLECTION"} 0
harbor_task_queue_latency{type="IMAGE_GC"} 0
harbor_task_queue_latency{type="IMAGE_REPLICATE"} 0
harbor_task_queue_latency{type="IMAGE_SCAN"} 0
harbor_task_queue_latency{type="IMAGE_SCAN_ALL"} 0
harbor_task_queue_latency{type="P2P_PREHEAT"} 0
harbor_task_queue_latency{type="REPLICATION"} 0
harbor_task_queue_latency{type="RETENTION"} 0
harbor_task_queue_latency{type="SCHEDULER"} 0
harbor_task_queue_latency{type="SLACK"} 0
harbor_task_queue_latency{type="WEBHOOK"} 0

13）harbor_task_scheduled_total

計劃任務數

$ curl  http://192.168.2.22:9099/metrics | grep harbor_task_scheduled_total
# HELP harbor_task_scheduled_total total number of scheduled job
# TYPE harbor_task_scheduled_total gauge
harbor_task_scheduled_total 0

14）harbor_task_concurrency

池（Total) 上每種類型的併發任務總數

$ curl  http://192.168.2.22:9099/metrics | grep harbor_task_concurrency
harbor_task_concurrency{pool="d4053262b74f0a7b83bc6add",type="GARBAGE_COLLECTION"} 0

3.2 harbor-core 組件指標

以下是從 Harbor core 組件中提取的指標，獲取格式：

<harbor_instance>:<metrics_port>/<metrics_path>?comp=core.

1）harbor_core_http_inflight_requests

請求總數，操作（Harbor API operationId 中的值。一些遺留端點沒有，因此標籤值爲）operationId``unknown

harbor-core 組件的指標

$ curl  http://192.168.2.22:9099/metrics?comp=core |  grep harbor_core_http_inflight_requests
# HELP harbor_core_http_inflight_requests The total number of requests
# TYPE harbor_core_http_inflight_requests gauge
harbor_core_http_inflight_requests 0

2）harbor_core_http_request_duration_seconds

請求的持續時間，

方法 (GET, POST, HEAD, PATCH, PUT), 操作 ( Harbor APIoperationId 中的值。一些遺留端點沒有, 所以標籤值爲), 分位數 operationId``unknown

$ curl  http://192.168.2.22:9099/metrics?comp=core |  grep harbor_core_http_request_duration_seconds
# HELP harbor_core_http_request_duration_seconds The time duration of the requests
# TYPE harbor_core_http_request_duration_seconds summary
harbor_core_http_request_duration_seconds{method="GET",operation="GetHealth",quantile="0.5"} 0.001797115
harbor_core_http_request_duration_seconds{method="GET",operation="GetHealth",quantile="0.9"} 0.010445204
harbor_core_http_request_duration_seconds{method="GET",operation="GetHealth",quantile="0.99"} 0.010445204

3）harbor_core_http_request_total

請求總數

方法（GET, POST, HEAD, PATCH, PUT），操作（[Harbor API operationId 中的值。一些遺留端點沒有，因此標籤值爲）operationId``unknown

$ curl  http://192.168.2.22:9099/metrics?comp=core |  grep harbor_core_http_request_total
# HELP harbor_core_http_request_total The total number of requests
# TYPE harbor_core_http_request_total counter
harbor_core_http_request_total{code="200",method="GET",operation="GetHealth"} 14
harbor_core_http_request_total{code="200",method="GET",operation="GetInternalconfig"} 1
harbor_core_http_request_total{code="200",method="GET",operation="GetPing"} 176
harbor_core_http_request_total{code="200",method="GET",operation="GetSystemInfo"} 14

3.3 registry 組件指標

註冊表，以下是從 Docker 發行版中提取的指標，查看指標方式：

<harbor_instance>:<metrics_port>/<metrics_path>?comp=registry.

1）registry_http_in_flight_requests

進行中的 HTTP 請求，處理程序

$ curl  http://192.168.2.22:9099/metrics?comp=registry |  grep registry_http_in_flight_requests
# HELP registry_http_in_flight_requests The in-flight HTTP requests
# TYPE registry_http_in_flight_requests gauge
registry_http_in_flight_requests{handler="base"} 0
registry_http_in_flight_requests{handler="blob"} 0
registry_http_in_flight_requests{handler="blob_upload"} 0
registry_http_in_flight_requests{handler="blob_upload_chunk"} 0
registry_http_in_flight_requests{handler="catalog"} 0
registry_http_in_flight_requests{handler="manifest"} 0
registry_http_in_flight_requests{handler="tags"} 0

2）registry_http_request_duration_seconds

HTTP 請求延遲（以秒爲單位），處理程序、方法 (,,,, GET) POST, 文件 HEADPATCHPUT

$ curl  http://192.168.2.22:9099/metrics?comp=registry |  grep registry_http_request_duration_seconds

3）registry_http_request_size_bytes

HTTP 請求大小（以字節爲單位）。

$ curl  http://192.168.2.22:9099/metrics?comp=registry |  grep registry_http_request_size_bytes

3.4 jobservice 組件指標

以下是從 Harbor Jobservice 提取的指標，

可在<harbor_instance>:<metrics_port>/<metrics_path>?comp=jobservice.查看

1）harbor_jobservice_info

Jobservice 的信息,

$ curl  http://192.168.2.22:9099/metrics?comp=jobservice | grep harbor_jobservice_info
# HELP harbor_jobservice_info the information of jobservice
# TYPE harbor_jobservice_info gauge
harbor_jobservice_info{node="f47de52e23b7:172.18.0.11",pool="35f1301b0e261d18fac7ba41",workers="10"} 1

2）harbor_jobservice_task_total

每個作業類型處理的任務數

$ curl  http://192.168.2.22:9099/metrics?comp=jobservice | grep harbor_jobservice_task_tota

3）harbor_jobservice_task_process_time_seconds

任務處理時間的持續時間，即任務從開始執行到任務結束用了多少時間。

$ curl  http://192.168.2.22:9099/metrics?comp=jobservice | grep harbor_jobservice_task_process_time_seconds

四、部署 Prometheus Server（二進制）

4.1 創建安裝目錄

$ mkdir  /etc/prometheus

4.2 下載安裝包

$ wget https://github.com/prometheus/prometheus/releases/download/v2.36.2/prometheus-2.36.2.linux-amd64.tar.gz -c
$ tar zxvf  prometheus-2.36.2.linux-amd64.tar.gz  -C  /etc/prometheus
$ cp prometheus-2.36.2.linux-amd64/{prometheus,promtool}   /usr/local/bin/
$ prometheus  --version    #查看版本
prometheus, version 2.36.2 (branch: HEAD, revision: d7e7b8e04b5ecdc1dd153534ba376a622b72741b)
  build user:       root@f051ce0d6050
  build date:       20220620-13:21:35
  go version:       go1.18.3
  platform:         linux/amd64

4.3 修改配置文件

在 prometheus 的配置文件中指定獲取 harbor 採集的指標數據。

$ cp  prometheus-2.36.2.linux-amd64/prometheus.yml   /etc/prometheus/
$ cat <<EOF > /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
## 指定Alertmanagers地址
alerting:
  alertmanagers:
  - static_configs:
    - targets: ["192.168.2.10:9093"]  #填寫Alertmanagers地址  
## 配置告警規則文件
rule_files:   #指定告警規則
  - /etc/prometheus/rules.yml

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
  - job_name: 'node-exporter'
    static_configs:
    - targets:
      - '192.168.2.22:9100'
  - job_name: "harbor-exporter"
    scrape_interval: 20s
    static_configs:
      - targets: ['192.168.2.22:9099']
  - job_name: 'harbor-core'
    params:
      comp: ['core']
    static_configs:
     - targets: ['192.168.2.22:9099']
  - job_name: 'harbor-registry'
    params:
      comp: ['registry']
    static_configs:
    - targets: ['192.168.2.22:9099']
  - job_name: 'harbor-jobservice'
    params:
      comp: ['jobservice']
    static_configs:
    - targets: ['192.168.2.22:9099']
EOF

4.4 語法檢查

檢測配置文件的語法是否正確！

$ promtool check config  /etc/prometheus/prometheus.yml
Checking /etc/prometheus/prometheus.yml
 SUCCESS: /etc/prometheus/prometheus.yml is valid prometheus config file syntax
 
 Checking /etc/prometheus/rules.yml
  SUCCESS: 6 rules found

4.5 創建服務啓動文件

$ cat <<EOF >  /usr/lib/systemd/system/prometheus.service
[Unit]
Description=Prometheus Service
Documentation=https://prometheus.io/docs/introduction/overview/
wants=network-online.target
After=network-online.target

[Service]
Type=simple
User=root
Group=root
ExecStart=/usr/local/bin/prometheus  --config.file=/etc/prometheus/prometheus.yml

[Install]
WantedBy=multi-user.target
EOF

4.6 啓動服務

$ systemctl daemon-reload
$ systemctl enable --now prometheus.service
$ systemctl status prometheus.service

4.7 瀏覽器訪問 Prometheus UI

在瀏覽器地址欄輸入主機 IP:9090 訪問 Prometheus UI 管理界面。

五、部署 node-exporter

node-exporter服務可採集主機的cpu、內存、磁盤等資源指標。

5.1 下載安裝包

$ wget https://github.com/prometheus/node_exporter/releases/download/v1.2.2/node_exporter-1.2.2.linux-amd64.tar.gz
$ tar zxvf node_exporter-1.2.2.linux-amd64.tar.gz
$ cp node_exporter-1.2.2.linux-amd64/node_exporter   /usr/local/bin/
$ node_exporter  --version
node_exporter, version 1.2.2 (branch: HEAD, revision: 26645363b486e12be40af7ce4fc91e731a33104e)
  build user:       root@b9cb4aa2eb17
  build date:       20210806-13:44:18
  go version:       go1.16.7
  platform:         linux/amd64

5.2 創建服務啓動文件

$ cat <<EOF >  /usr/lib/systemd/system/node-exporter.service
[Unit]
Description=Prometheus Node Exporter
After=network.target

[Service]
ExecStart=/usr/local/bin/node_exporter
#User=prometheus

[Install]
WantedBy=multi-user.target
EOF

5.3 啓動服務

$ systemctl daemon-reload
$ systemctl enable --now node-exporter.service
$ systemctl status node-exporter.service
$ ss  -ntulp |  grep node_exporter
tcp    LISTEN     0   128   :::9100    :::*    users:(("node_exporter",pid=36218,fd=3)

5.4 查看 node 指標

通過 curl 獲取 node-exporter 服務採集到的監控數據。

$ curl  http://localhost:9100/metrics

六、Grafana 部署與儀表盤設計

二進制部署 Grafana v8.4.4 服務。

6.1 下載安裝包

$ wget https://dl.grafana.com/enterprise/release/grafana-enterprise-8.4.4.linux-amd64.tar.gz  -c 
$ tar zxvf grafana-enterprise-8.4.4.linux-amd64.tar.gz  -C  /etc/
$ mv  /etc/grafana-8.4.4   /etc/grafana
$ cp -a  /etc/grafana/bin/{grafana-cli,grafana-server}  /usr/local/bin/
#安裝依賴包
$ yum install -y  fontpackages-filesystem.noarch libXfont libfontenc lyx-fonts.noarch  xorg-x11-font-utils

6.2 安裝插件

安裝 grafana 時鐘插件

$ grafana-cli plugins install grafana-clock-panel

安裝 Zabbix 插件

$ grafana-cli plugins install alexanderzobnin-zabbix-app

安裝服務器端圖像渲染組件

$ yum install -y fontconfig freetype* urw-fonts

6.3 創建服務啓動文件

$ cat <<EOF >/usr/lib/systemd/system/grafana.service
[Service]
Type=notify
ExecStart=/usr/local/bin/grafana-server -homepath /etc/grafana
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

-homepath：指定 grafana 的工作目錄

6.4 啓動 grafana 服務

$ systemctl daemon-reload
$ systemctl enable --now grafana.service
$ systemctl status grafana.service
$ ss  -ntulp |  grep grafana-server
tcp    LISTEN     0   128    :::3000   :::*   users:(("grafana-server",pid=120140,fd=9))

6.5 配置數據源

在瀏覽器地址欄輸入主機 IP 和 grafana 服務端口訪問 Grafana UI 界面後，添加 Prometheus 數據源。

默認用戶密碼：admin/admin

6.6 導入 json 模板

一旦您配置了Prometheus服務器以收集您的 Harbor 指標，您就可以使用 Grafana 來可視化您的數據。Harbor 存儲庫中提供了一個示例 Grafana 儀表板，可幫助您開始可視化 Harbor 指標。

Harbor 官方提供了一個 grafana 的 json 文件模板。下載：

https://github.com/goharbor/harbor/blob/main/contrib/grafana-dashborad/metrics-example.json

七、部署 AlertManager 服務（擴展）

Alertmanager 是一個獨立的告警模塊，接收 Prometheus 等客戶端發來的警報，之後通過分組、刪除重複等處理，並將它們通過路由發送給正確的接收器；

7.1 下載安裝包

$ wget https://github.com/prometheus/alertmanager/releases/download/v0.23.0/alertmanager-0.23.0.linux-amd64.tar.gz
$ tar zxvf alertmanager-0.23.0.linux-amd64.tar.gz
$ cp  alertmanager-0.23.0.linux-amd64/{alertmanager,amtool}   /usr/local/bin/

7.2 修改配置文件

$ mkdir /etc/alertmanager
$ cat /etc/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

7.3 創建服務啓動文件

$ cat <<EOF >/usr/lib/systemd/system/alertmanager.service
[Unit]
Description=alertmanager
fter=network.target

[Service]
ExecStart=/usr/local/bin/alertmanager --config.file=/etc/alertmanager/alertmanager.yml
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

7.4 啓動服務

$ systemctl daemon-reload
$ systemctl enable --now alertmanager.service
$ systemctl status alertmanager.service
$ ss  -ntulp |  grep alertmanager

7.5 配置告警規則

前面在 Prometheus server 的配置文件中中指定了告警規則的文件爲/etc/prometheus/rules.yml。

$ cat /etc/prometheus/rules.yml
groups:
  - name: Warning
    rules:
      - alert: NodeMemoryUsage
        expr: 100 - (node_memory_MemFree_bytes + node_memory_Cached_bytes + node_memory_Buffers_bytes) / node_memory_MemTotal_bytes*100 > 80
        for: 1m
        labels:
          status: Warning
        annotations:
          summary: "{{$labels.instance}}: 內存使用率過高"
          description: "{{$labels.instance}}: 內存使用率大於 80% (當前值: {{ $value }}"

      - alert: NodeCpuUsage
        expr: (1-((sum(increase(node_cpu_seconds_total{mode="idle"}[1m])) by (instance)) / (sum(increase(node_cpu_seconds_total[1m])) by (instance)))) * 100 > 70
        for: 1m
        labels:
          status: Warning
        annotations:
          summary: "{{$labels.instance}}: CPU使用率過高"
          description: "{{$labels.instance}}: CPU使用率大於 70% (當前值: {{ $value }}"

      - alert: NodeDiskUsage
        expr: 100 - node_filesystem_free_bytes{fstype=~"xfs|ext4"} / node_filesystem_size_bytes{fstype=~"xfs|ext4"} * 100 > 80
        for: 1m
        labels:
          status: Warning
        annotations:
          summary: "{{$labels.instance}}: 分區使用率過高"
          description: "{{$labels.instance}}: 分區使用大於 80% (當前值: {{ $value }}"

      - alert: Node-UP
        expr: up{job='node-exporter'} == 0
        for: 1m
        labels:
          status: Warning
        annotations:
          summary: "{{$labels.instance}}: 服務宕機"
          description: "{{$labels.instance}}: 服務中斷超過1分鐘"

      - alert: TCP
        expr: node_netstat_Tcp_CurrEstab > 1000
        for: 1m
        labels:
          status: Warning
        annotations:
          summary: "{{$labels.instance}}: TCP連接過高"
          description: "{{$labels.instance}}: 連接大於1000 (當前值: {{$value}})"

      - alert: IO
        expr: 100 - (avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) < 60
        for: 1m
        labels:
          status: Warning
        annotations:
          summary: "{{$labels.instance}}: 流入磁盤IO使用率過高"
          description: "{{$labels.instance}}:流入磁盤IO大於60%  (當前值:{{$value}})"

本文由 Readfog 進行 AMP 轉碼，版權歸原作者所有。
來源：https://mp.weixin.qq.com/s/MxMxhclbv16m05Y-Bt36aA