Linkerd 金絲雀部署與 A-B 測試

Flagger Linkerd Traffic Split(流量拆分)

本指南向您展示如何使用 Linkerd 和 Flagger 來自動化金絲雀部署與 A/B 測試。

前提條件

Flagger 需要 Kubernetes 集羣 v1.16 或更新版本和 Linkerd 2.10 或更新版本。

安裝 Linkerd the PrometheusLinkerd Viz 的一部分):

linkerd install | kubectl apply -f -
linkerd viz install | kubectl apply -f -

在 linkerd 命名空間中安裝 Flagger

kubectl apply -k github.com/fluxcd/flagger//kustomize/linkerd

引導程序

=======

Flagger 採用 Kubernetes deployment 和可選的水平 Pod 自動伸縮 (HPA),然後創建一系列對象(Kubernetes 部署、ClusterIP 服務和 SMI 流量拆分)。這些對象將應用程序暴露在網格內部並驅動 Canary 分析和推廣。

創建一個 test 命名空間並啓用 Linkerd 代理注入:

kubectl create ns test
kubectl annotate namespace test linkerd.io/inject=enabled

安裝負載測試服務以在金絲雀分析期間生成流量:

kubectl apply -k https://github.com/fluxcd/flagger//kustomize/tester?ref=main

創建部署和水平 pod autoscaler:

kubectl apply -k https://github.com/fluxcd/flagger//kustomize/podinfo?ref=main

爲 podinfo 部署創建一個 Canary 自定義資源:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: podinfo
  namespace: test
spec:
  # deployment reference
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  # HPA reference (optional)
  autoscalerRef:
    apiVersion: autoscaling/v2beta2
    kind: HorizontalPodAutoscaler
    name: podinfo
  # the maximum time in seconds for the canary deployment
  # to make progress before it is rollback (default 600s)
  progressDeadlineSeconds: 60
  service:
    # ClusterIP port number
    port: 9898
    # container port number or name (optional)
    targetPort: 9898
  analysis:
    # schedule interval (default 60s)
    interval: 30s
    # max number of failed metric checks before rollback
    threshold: 5
    # max traffic percentage routed to canary
    # percentage (0-100)
    maxWeight: 50
    # canary increment step
    # percentage (0-100)
    stepWeight: 5
    # Linkerd Prometheus checks
    metrics:
    - name: request-success-rate
      # minimum req success rate (non 5xx responses)
      # percentage (0-100)
      thresholdRange:
        min: 99
      interval: 1m
    - name: request-duration
      # maximum req duration P99
      # milliseconds
      thresholdRange:
        max: 500
      interval: 30s
    # testing (optional)
    webhooks:
      - name: acceptance-test
        type: pre-rollout
        url: http://flagger-loadtester.test/
        timeout: 30s
        metadata:
          type: bash
          cmd: "curl -sd 'test' http://podinfo-canary.test:9898/token | grep token"
      - name: load-test
        type: rollout
        url: http://flagger-loadtester.test/
        metadata:
          cmd: "hey -z 2m -q 10 -c 2 http://podinfo-canary.test:9898/"

將上述資源另存爲 podinfo-canary.yaml 然後應用:

kubectl apply -f ./podinfo-canary.yaml

當 Canary 分析開始時,Flagger 將在將流量路由到 Canary 之前調用 pre-rollout webhooks。金絲雀分析將運行五分鐘,同時每半分鐘驗證一次 HTTP 指標和 rollout(推出) hooks

幾秒鐘後,Flager 將創建 canary 對象:

# applied
deployment.apps/podinfo
horizontalpodautoscaler.autoscaling/podinfo
ingresses.extensions/podinfo
canary.flagger.app/podinfo

# generated
deployment.apps/podinfo-primary
horizontalpodautoscaler.autoscaling/podinfo-primary
service/podinfo
service/podinfo-canary
service/podinfo-primary
trafficsplits.split.smi-spec.io/podinfo

在 boostrap 之後,podinfo 部署將被縮放到零, 並且到 podinfo.test 的流量將被路由到主 pod。在 Canary 分析過程中,可以使用 podinfo-canary.test 地址直接定位 Canary Pod

自動金絲雀推進

Flagger 實施了一個控制循環,在測量 HTTP 請求成功率、請求平均持續時間和 Pod 健康狀況等關鍵性能指標的同時,逐漸將流量轉移到金絲雀。根據對 KPI 的分析,提升或中止 Canary,並將分析結果發佈到 Slack

 Flagger 金絲雀階段

通過更新容器鏡像觸發金絲雀部署:

kubectl -n test set image deployment/podinfo \
podinfod=stefanprodan/podinfo:3.1.1

Flagger 檢測到部署修訂已更改並開始新的部署:

kubectl -n test describe canary/podinfo

Status:
  Canary Weight:         0
  Failed Checks:         0
  Phase:                 Succeeded
Events:
 New revision detected! Scaling up podinfo.test
 Waiting for podinfo.test rollout to finish: 0 of 1 updated replicas are available
 Pre-rollout check acceptance-test passed
 Advance podinfo.test canary weight 5
 Advance podinfo.test canary weight 10
 Advance podinfo.test canary weight 15
 Advance podinfo.test canary weight 20
 Advance podinfo.test canary weight 25
 Waiting for podinfo.test rollout to finish: 1 of 2 updated replicas are available
 Advance podinfo.test canary weight 30
 Advance podinfo.test canary weight 35
 Advance podinfo.test canary weight 40
 Advance podinfo.test canary weight 45
 Advance podinfo.test canary weight 50
 Copying podinfo.test template spec to podinfo-primary.test
 Waiting for podinfo-primary.test rollout to finish: 1 of 2 updated replicas are available
 Promotion completed! Scaling down podinfo.test

請注意,如果您在 Canary 分析期間對部署應用新更改,Flagger 將重新開始分析。

金絲雀部署由以下任何對象的更改觸發:

您可以通過以下方式監控所有金絲雀:

watch kubectl get canaries --all-namespaces

NAMESPACE   NAME      STATUS        WEIGHT   LASTTRANSITIONTIME
test        podinfo   Progressing   15       2019-06-30T14:05:07Z
prod        frontend  Succeeded     0        2019-06-30T16:15:07Z
prod        backend   Failed        0        2019-06-30T17:05:07Z

自動回滾

在金絲雀分析期間,您可以生成 HTTP 500 錯誤和高延遲來測試 Flagger 是否暫停並回滾有故障的版本。

觸發另一個金絲雀部署:

kubectl -n test set image deployment/podinfo \
podinfod=stefanprodan/podinfo:3.1.2

使用以下命令執行負載測試器 pod

kubectl -n test exec -it flagger-loadtester-xx-xx sh

生成 HTTP 500 錯誤:

watch -n 1 curl http://podinfo-canary.test:9898/status/500

生成延遲:

watch -n 1 curl http://podinfo-canary.test:9898/delay/1

當失敗的檢查次數達到金絲雀分析閾值時,流量將路由回主服務器,金絲雀縮放爲零,並將推出標記爲失敗。

kubectl -n test describe canary/podinfo

Status:
  Canary Weight:         0
  Failed Checks:         10
  Phase:                 Failed
Events:
 Starting canary analysis for podinfo.test
 Pre-rollout check acceptance-test passed
 Advance podinfo.test canary weight 5
 Advance podinfo.test canary weight 10
 Advance podinfo.test canary weight 15
 Halt podinfo.test advancement success rate 69.17% < 99%
 Halt podinfo.test advancement success rate 61.39% < 99%
 Halt podinfo.test advancement success rate 55.06% < 99%
 Halt podinfo.test advancement request duration 1.20s > 0.5s
 Halt podinfo.test advancement request duration 1.45s > 0.5s
 Rolling back podinfo.test failed checks threshold reached 5
 Canary failed! Scaling down podinfo.test

自定義指標

Canary analysis 可以通過 Prometheus 查詢進行擴展。

讓我們定義一個未找到錯誤的檢查。編輯 canary analysis 並添加以下指標:

  analysis:
    metrics:
    - name: "404s percentage"
      threshold: 3
      query: |
        100 - sum(
            rate(
                response_total{
                    namespace="test",
                    deployment="podinfo",
                    status_code!="404",
                    direction="inbound"
                }[1m]
            )
        )
        /
        sum(
            rate(
                response_total{
                    namespace="test",
                    deployment="podinfo",
                    direction="inbound"
                }[1m]
            )
        )
        * 100

上述配置通過檢查 HTTP 404 req/sec 百分比是否低於總流量的 3% 來驗證金絲雀版本。如果 404s 率達到 3% 閾值,則分析將中止,金絲雀被標記爲失敗。

通過更新容器鏡像觸發金絲雀部署:

kubectl -n test set image deployment/podinfo \
podinfod=stefanprodan/podinfo:3.1.3

生成 404

watch -n 1 curl http://podinfo-canary:9898/status/404

監視 Flagger 日誌:

kubectl -n linkerd logs deployment/flagger -f | jq .msg

Starting canary deployment for podinfo.test
Pre-rollout check acceptance-test passed
Advance podinfo.test canary weight 5
Halt podinfo.test advancement 404s percentage 6.20 > 3
Halt podinfo.test advancement 404s percentage 6.45 > 3
Halt podinfo.test advancement 404s percentage 7.22 > 3
Halt podinfo.test advancement 404s percentage 6.50 > 3
Halt podinfo.test advancement 404s percentage 6.34 > 3
Rolling back podinfo.test failed checks threshold reached 5
Canary failed! Scaling down podinfo.test

如果您配置了 SlackFlager 將發送一條通知,說明金絲雀失敗的原因。

Linkerd Ingress

有兩個入口控制器與 Flagger 和 Linkerd 兼容:NGINX 和 Gloo

安裝 NGINX:

helm upgrade -i nginx-ingress stable/nginx-ingress \
--namespace ingress-nginx

爲 podinfo 創建一個 ingress 定義,將傳入標頭重寫爲內部服務名稱(Linkerd 需要):

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: podinfo
  namespace: test
  labels:
    app: podinfo
  annotations:
    kubernetes.io/ingress.class: "nginx"
    nginx.ingress.kubernetes.io/configuration-snippet: |
      proxy_set_header l5d-dst-override $service_name.$namespace.svc.cluster.local:9898;
      proxy_hide_header l5d-remote-ip;
      proxy_hide_header l5d-server-id;
spec:
  rules:
    - host: app.example.com
      http:
        paths:
          - backend:
              serviceName: podinfo
              servicePort: 9898

使用 ingress controller 時,Linkerd 流量拆分不適用於傳入流量,因爲 NGINX 在網格之外運行。爲了對前端應用程序運行金絲雀分析,Flagger 創建了一個 shadow ingress 並設置了 NGINX 特定的註釋 (annotations)。

A/B 測試

除了加權路由,Flagger 還可以配置爲根據 HTTP 匹配條件將流量路由到金絲雀。在 A/B 測試場景中,您將使用 HTTP headers 或 cookies 來定位您的特定用戶羣。這對於需要會話關聯的前端應用程序特別有用。

 Flagger Linkerd Ingress

編輯 podinfo 金絲雀分析,將提供者設置爲 nginx,添加 ingress 引用,移除 max/step 權重並添加匹配條件和 iterations

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: podinfo
  namespace: test
spec:
  # ingress reference
  provider: nginx
  ingressRef:
    apiVersion: extensions/v1beta1
    kind: Ingress
    name: podinfo
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  autoscalerRef:
    apiVersion: autoscaling/v2beta2
    kind: HorizontalPodAutoscaler
    name: podinfo
  service:
    # container port
    port: 9898
  analysis:
    interval: 1m
    threshold: 10
    iterations: 10
    match:
      # curl -H 'X-Canary: always' http://app.example.com
      - headers:
          x-canary:
            exact: "always"
      # curl -b 'canary=always' http://app.example.com
      - headers:
          cookie:
            exact: "canary"
    # Linkerd Prometheus checks
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
      interval: 1m
    - name: request-duration
      thresholdRange:
        max: 500
      interval: 30s
    webhooks:
      - name: acceptance-test
        type: pre-rollout
        url: http://flagger-loadtester.test/
        timeout: 30s
        metadata:
          type: bash
          cmd: "curl -sd 'test' http://podinfo-canary:9898/token | grep token"
      - name: load-test
        type: rollout
        url: http://flagger-loadtester.test/
        metadata:
          cmd: "hey -z 2m -q 10 -c 2 -H 'Cookie: canary=always' http://app.example.com"

上述配置將運行 10 分鐘的分析,目標用戶是:canary cookie 設置爲 always 或使用 X-Canary: always header 調用服務。

請注意,負載測試現在針對外部地址並使用 canary cookie

通過更新容器鏡像觸發金絲雀部署:

kubectl -n test set image deployment/podinfo \
podinfod=stefanprodan/podinfo:3.1.4

Flagger 檢測到部署修訂已更改並開始 A/B 測試:

kubectl -n test describe canary/podinfo

Events:
 Starting canary deployment for podinfo.test
 Pre-rollout check acceptance-test passed
 Advance podinfo.test canary iteration 1/10
 Advance podinfo.test canary iteration 2/10
 Advance podinfo.test canary iteration 3/10
 Advance podinfo.test canary iteration 4/10
 Advance podinfo.test canary iteration 5/10
 Advance podinfo.test canary iteration 6/10
 Advance podinfo.test canary iteration 7/10
 Advance podinfo.test canary iteration 8/10
 Advance podinfo.test canary iteration 9/10
 Advance podinfo.test canary iteration 10/10
 Copying podinfo.test template spec to podinfo-primary.test
 Waiting for podinfo-primary.test rollout to finish: 1 of 2 updated replicas are available
 Promotion completed! Scaling down podinfo.test
本文由 Readfog 進行 AMP 轉碼,版權歸原作者所有。
來源https://mp.weixin.qq.com/s/8ThwH9DvFAnc-trOSf_nNQ