repmgrd 介紹

repmgrd 作爲運行在集羣中每個節點上的一個管理和監控的守護程序，可以自動進行故障轉移和維護複製關係，並提供有關每個節點狀態的監控信息。

repmgrd 不依賴其他服務

提供的功能包括:

• 衆多的配置項提供選擇

• 根據場景可自定義對應執行腳本

• 一個命令進入維護模式

• 多數據中心場景，通過 location 限制候選主節點

• 多種驗證 Postgresql 存活狀態方法（PostgreSQL ping, query execution or new connection）

保留監控數據（可選）

repmgrd 故障轉移

見證節點介紹

是一個獨立於當前主從複製數據庫的 Postgresql 數據庫。

目的是當發生 failover 場景時，用來證明主服務自身不可用還是數據中心網絡問題

一個典型的場景

位於兩個數據中心的主從複製結構。和主服務同一個數據中心上創建一個見證服務。當主服務不可用時，從節點判斷邏輯，如果主服務和見證節點都不可見則認爲是網絡問題。如果主服務不可見，見證節點可見則認爲是主服務發生問題。

注意事項:

不要將見證節點與主從服務安裝在同一個物理機上。

多數據中心場景建議使用 localiton 來預防腦裂問題

見證節點只在 repmgrd 啓用時有效。

添加見證節點

安裝一個獨立的 Postgresql 實例，配置參考普通的 repmgr。

註冊見證節點

repmgr -f /etc/repmgr.conf witness register -h node1

思考: 見證節點是否需要運行 repmgrd。見證節點不同於其他從節點，元數據可以通過流複製直接獲取到。需要 repmgrd 來維護。

解決腦裂問題

多數據中心帶來的腦裂問題，當不同數據中心的節點彼此不可見時，從庫的服務發生 failover 產生一個新的主節點。此時整個集羣同時擁有兩個主服務。

雖然見證服務可以在仲裁。但是見證服務存在如下弊端。

• 需要保證見證節點與主服務節點位於同一個數據中心。

• 當集羣中多數節點與主服務節點不在同一個數據中心時，增加額外的維護成本，且不便擴展，尤其是大規模應用時。

repmgr4 提出 location 的概念，通過配置 location 來指定節點的物理位置。

當發生 failover 時。repmgr 首先檢測是否存在與主服務位於同一個 localtion 的服務是可見的。如果不存（即整個與主服務位於同一個數據中心的節點都不可見）在則認爲是網絡問題。不提升新主庫，並且進入降級模式。

主服務可見性共識

更復雜的集羣結構，當從節點服務連接不到主服務時，詢問其他從服務最後一次看到主服務的時間。當存在節點可以連接到主服務則證明主服務還存活，並且可以正常提供寫服務。此時不發生新的選舉。

配置參數

 primary_visibility_consensus=true

注意事項。必須每個服務都設置爲 true 才生效

從節點斷開連接

當 standby_disconnect_on_failover 設置爲 true 時，在發生故障轉移時首先從節點斷開本地的 wal receiver, 並且等待所有其他節點斷開自己的 wal receiver。

這樣做的目的是爲確保在故障轉移時，所有的節點 lsn 處於靜止狀態。

重要提示:

•Postgresql9.5 及以上，repmgr 的數據庫用戶具有超級用戶權限

•standby_disconnect_on_failover 必須要所有節點服務都設置才生效

• 注意設置此參數後將帶來的延遲，包括確定自己 wal receiver 是否已經關閉。等他確認其他節點 wal receiver 是否已經完全關閉

• 開啓此參數建議同時開啓 primary_visibility_consensus

故障轉移驗證

在發生故障轉移時，某個節點被選舉爲新的主節點時，可以通過配置自定義判斷腳本，進一步決定當前是否可以被提升爲主節點（用戶可以自定義輔助邏輯）。必須在所有節點進行配置。

配置如下：

failover_validation_command=/path/to/script.sh %n

可使用參數：

•%n: node ID

•%a: node name

•%v: 可見節點數

•%u: number of shared upstream nodes

•%t: 總節點數

根據腳本返回碼（是否非 0）決定是否可以將當前節點提升爲主節點，如果不符合條件，中斷本次故障轉移流程，election_rerun_interval 秒後進入下一輪故障轉移流程。

級聯複製

Postgresql 9.2 數據庫具有了級聯複製功能。repmgr 可以提供級聯複製支持。

當發生故障轉移時，repmgr 保持級聯關係不變。

在上游節點發生故障時下游節點自動重新嘗試連接原上游節點的父節點。

主節點檢測從庫連接

以上的考慮情形大多是通過從節點檢測主節點的運行情況來決定是否發生故障轉移。在 repmgr 中主節點也可以檢測到從節點的連接情況。

通過主機點不斷的檢測從節點的連接情況，當滿足某些狀況時可以執行自定義腳本。如發生故障轉移後產生一個新的主服務時，其他的從節點也都跟着指向了新主節點，原主節點的從節點個數變爲零。這時可以通過執行一個特殊的命令來阻止應用寫入原主節點。

檢測過程和標準

• 間隔時間 child_nodes_check_interval 默認 5 秒查詢一次 pg_stat_replication 視圖，並與註冊 repmgr 時指定 upstream node 爲主及節點的列表進行對比

• 如果檢測到從節點不在 pg_stat_replication 視圖中。記錄檢測到的時間並觸發 child_node_disconnect 事件

• 如果從節點重新出現在 pg_stat_repliation 視圖中，清除上面的檢測時間記錄，並觸發 child_node_reconnect 事件

• 如果檢測到一個新的從節點加入，將新節點加入到內部列表中並觸發 child_node_new_connect 事件

• 如果在 repmgr.conf 中配置了參數 child_nodes_disconnect_command，repmgrd 會循環檢測所有節點，如果檢測到連接的從節點數小於 child_nodes_connected_min_count （默認零）並且超過 child_nodes_disconnect_timeout（默認 30s）所指定的時間時，觸發 child_nodes_disconnect_command 事件

• 注意事項，在 repmgrd 啓動時沒有連接的子節點不會被認爲是丟失的，因爲 repmgrd 無法知道它們爲什麼沒有被連接。

注意事項

• 子節點配置爲歸檔複製時

• 子節點的 primary_conninfo 信息中的 application_name 與節點的 repmgr.conf 文件中定義的節點名稱相同。此外，這個 application_name 在整個複製集羣中必須是唯一的。如果使用自定義的 application_name，或者 application_name 在整個複製集羣中不是唯一的，repmgr 將無法可靠地監控子節點的連接。

配置參數

•child_nodes_check_interval 檢測間隔時間默認值 5s

•child_nodes_disconnect_command 自定義腳本

•child_nodes_disconnect_timeout 超時時間默認值 30s

•child_nodes_connected_min_count 存活從庫的最小值。當檢測到存活的從節點數小於該值時，觸發 child_nodes_disconnect_command 事件。

•child_nodes_disconnect_min_count 丟失節點的最小值，當丟失的節點數大於該值時觸發 child_nodes_disconnect_command 事件。

注意 child_nodes_connected_min_count 值會覆蓋 child_nodes_disconnect_min_count。當這兩個參數都未設置時。檢測到從節點存在個數爲零時觸發 child_nodes_disconnect_command 事件。

事件類型

•child_node_disconnect

•child_node_reconnect

•child_node_new_connect

•child_nodes_disconnect_command

查看事件

repmgr cluster event --event=xxxxx (事件類型)

repmgrd 配置

Postgresql 配置 postgressql.conf 需要重啓服務

shared_preload_libraries = 'repmgr'

•monitor_interval_secs 檢測上游節點時間間隔默認 2 秒

•connection_check_type 檢測類型 PQping (default)、connection、query

•reconnect_attempts 重連嘗試次數默認 6 次

•reconnect_interval 重連嘗試間隔時間

•degraded_monitoring_timeout 默認 - 1 禁止。默認情況下，repmgrd 將無限期地繼續處於降級監視模式。設置該參數超時（以秒爲單位），之後 repmgrd 將終止。

必須設置的參數

•failover =automatic
•promote_command =/'usr/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file'
•follow_command= '/usr/bin/repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'

可選參數

priority
failover_validation_command
primary_visibility_consensus
always_promote
standby_disconnect_on_failover
election_rerun_interval
sibling_nodes_disconnect_timeout

其他參數

除了以上配置，conninfo 也會影響 repmgr 如何與 PostgreSQL 進行網絡連接，如參數 connect_timeout。同時也受到系統網絡設置 tcp_syn_retries 將影響。

日誌自動切割

/var/log/repmgr/repmgrd.log {
        missingok
        compress
        rotate 52
        maxsize 100M
        weekly
        create 0600 postgres postgres
        postrotate
            /usr/bin/killall -HUP repmgrd
        endscript
    }

repmgrd 操作

repmgrd 維護模式

進入維護模式後，將不會發生自動故障轉移。

# 維護模式
repmgr -f /etc/repmgr.conf service pause
# 解除維護模式
repmgr -f /etc/repmgr.conf service unpause
# 查看當前模式
repmgr -f /etc/repmgr.conf service status

注意: 維護模式的狀態不會因爲重啓 repmgrd 或 Postgresql 服務而改變。

當進行 Postgresql 版本升級時，要完全關閉 repmgrd 服務。並且安裝與數據庫對應的新版本 repmgr。

手動故障轉移 repmgr standby switchover 會自動進行 pause/unpause 狀態切換。

當 Postgresql 數據庫通過 pg_wal_replay_pause 處於暫停 wal 日誌回放狀態時，故障轉移會自動恢復 wal 回放。

注意執行 repmgr_standby_promote 將被拒絕，直到管理員 wal resumed 恢復回放。

降級模式

當遇到某些情況時將進入 "降級監控" 模式，即 repmgrd 仍處於活動狀態，但在等待情況的解決。

情況包括: 當發生故障時

• 與主節點位於同一個數據中的的其他節點都不可見。根據 location 參數

• 沒有可以提升爲主節點的候選節點

• 候選節點不能被提升主節點

• 節點不能作爲新主節點的從節點

• 節點沒有設置自動故障轉移

• 主節點故障，但是沒其他節點被提升爲主節點

默認情況，降級模式將一致持續下去。但如果設置了 degraded_monitoring_timeout 參數，超過指定值，repmgrd 將停止運行。

監控數據存儲

當參數 monitoring_history=true 時，監控記錄數據將會不斷的寫入到 repmgr.monitoring_history 表中。可以使用命令 repmgr cluster cleanup -k/ --keep-history (選擇需要保留記錄天數) 定期清理數據。

這些數據也會複製到從節點中，可以通過設置 ALTER TABLE repmgr.monitoring_history SET UNLOGGED; 使數據不復制到從節點。

選舉流程簡介

源碼位置 repmgrd-physical.c

當 primary 節點被檢測到發生故障，進入選舉階段時的大概流程。

/*
 * Failover decision for nodes attached to the current primary.
 *
 * NB: this function sets "sibling_nodes"; caller (do_primary_failover)
 * expects to be able to read this list
 */
# 選舉流程 Postgresql 支持級聯複製。新的主節點只能在一級子節點中產生。
#  sibling_nodes 兄弟節點
static ElectionResult
do_election(NodeInfoList *sibling_nodes, int *new_primary_id)
{
  int      electoral_term = -1;
    #
  NodeInfoListCell *cell = NULL;
# 
  t_node_info *candidate_node = NULL;
# 選舉狀態
  election_stats stats;
  ReplInfo  local_replication_info;
  /* To collate details of nodes with primary visible for logging purposes */
  PQExpBufferData nodes_with_primary_visible;
  /*
   * Check if at least one server in the primary's location is visible; if
   * not we'll assume a network split between this node and the primary
   * location, and not promote any standby.
   *
   * NOTE: this function is only ever called by standbys attached to the
   * current (unreachable) primary, so "upstream_node_info" will always
   * contain the primary node record.
   */
  # 根據是否存在與故障節點位置相同的可見節點。如果都爲不可見狀態，則認爲是網絡問題。PS: 每個節點可配置location 參數指定所在的物理位置。如多數據中心場景
  bool    primary_location_seen = false;
  int      nodes_with_primary_still_visible = 0;
  if (config_file_options.failover_delay > 0)
  {
    log_debug("sleeping %i seconds (\"failover_delay\") before initiating failover",
          config_file_options.failover_delay);
    sleep(config_file_options.failover_delay);
  }
  /* we're visible */ 
  #可見節點數
  stats.visible_nodes = 1;
  #與當前node爲相同上游的節點數 shared_upstream_nodes = sibling_nodes + 1(當前節點)
  stats.shared_upstream_nodes = 0;
  # 集羣中所有節點數
  stats.all_nodes = 0;
    # SELECT term FROM repmgr.voting_term; 初始值爲1 , 執行repmgr standby promote加一
  electoral_term = get_current_term(local_conn);
  if (electoral_term == -1)
  {
    log_error(_("unable to determine electoral term"));
    return ELECTION_NOT_CANDIDATE;
  }
  log_debug("do_election(): electoral term is %i", electoral_term);
   # 如果repmgr.conf中failover參數設置爲manual ,該節點不參與選舉
  if (config_file_options.failover == FAILOVER_MANUAL)
  {
    log_notice(_("this node is not configured for automatic failover so will not be considered as promotion candidate, and will not follow the new primary"));
    log_detail(_("\"failover\" is set to \"manual\" in repmgr.conf"));
    log_hint(_("manually execute \"repmgr standby follow\" to have this node follow the new primary"));
    return ELECTION_NOT_CANDIDATE;
  }
#如果設置priority爲0 ，該節點不參與選舉 。見證節點的priority=0。及見證節點到這裏已經出局
  /* node priority is set to zero - don't become a candidate, and lose by default */
  if (local_node_info.priority <= 0)
  {
    log_notice(_("this node's priority is %i so will not be considered as an automatic promotion candidate"),
           local_node_info.priority);
    return ELECTION_LOST;
  }
    #該方法 SQL 查詢 視圖repmgr.nodes ,獲取兄弟節點。何爲兄弟，與當前節點上游節點相同的節點，不包括自身。
    # 注意一點。見證節點的也算在兄弟節點中。因爲在nodes表中他們的upstream_node_id 相同 
    # SELECT REPMGR_NODES_COLUMNS FROM repmgr.nodes n  WHERE n.upstream_node_id = %i  AND n.node_id != %i    AND n.active IS TRUE ORDER BY n.node_id  
  /* get all active nodes attached to upstream, excluding self */
  get_active_sibling_node_records(local_conn,
                  local_node_info.node_id,
                  upstream_node_info.node_id,
                  sibling_nodes);
  log_info(_("%i active sibling nodes registered"), sibling_nodes->node_count);
  stats.shared_upstream_nodes = sibling_nodes->node_count + 1;
    #獲取集羣中所有節點數量 SELECT count(*) FROM repmgr.nodes n 
  get_all_nodes_count(local_conn, &stats.all_nodes);
  log_info(_("%i total nodes registered"), stats.all_nodes);
    #判斷記錄當前節點位置是否與 上游節點（主節點）位置相同
  if (strncmp(upstream_node_info.location, local_node_info.location, MAXLEN) != 0)
  {
    log_info(_("primary node \"%s\" (ID: %i) has location \"%s\", this node's location is \"%s\""),
         upstream_node_info.node_name,
         upstream_node_info.node_id,
         upstream_node_info.location,
         local_node_info.location);
  }
  else 
  {
    log_info(_("primary node  \"%s\" (ID: %i) and this node have the same location (\"%s\")"),
         upstream_node_info.node_name,
         upstream_node_info.node_id,
         local_node_info.location);
  }
  local_node_info.last_wal_receive_lsn = InvalidXLogRecPtr;
  # 沒有兄弟節點，沒有見證節點，獨生子。如果與主節點位置相同直接繼承王位，江湖中的打打殺殺在這裏不存在。
  /* fast path if no other standbys (or witness) exists - normally win by default */
  if (sibling_nodes->node_count == 0)
  {  # 比較上游節點（原主節點）與當前節點的位置，如果相同
    if (strncmp(upstream_node_info.location, local_node_info.location, MAXLEN) == 0)
    {# 升級爲主服務前的配置的驗證腳本，是否返回值爲0
      if (config_file_options.failover_validation_command[0] != '\0')
      { 
        return execute_failover_validation_command(&local_node_info, &stats);
      }
      log_info(_("no other sibling nodes - we win by default"));
          # 勝出，新王誕生。
      return ELECTION_WON;
    }
    else
    {
      /*
       * If primary and standby have different locations set, the assumption
       * is that no action should be taken as we can't tell whether there's
       * been a network interruption or not.
       *
       * Normally a situation with primary and standby in different physical
       * locations would be handled by leaving the location as "default" and
       * setting up a witness server in the primary's location.
       */
      log_debug("no other nodes, but primary and standby locations differ");
           # 雖然只有一個從節點，但是與主節點的位置不同，不能進行升級。集羣進行降級處理。很遺憾。
           # 相當於 主節點相同位置的節點都不可見。被認爲網絡分區故障。
      monitoring_state = MS_DEGRADED;
      INSTR_TIME_SET_CURRENT(degraded_monitoring_start);
      return ELECTION_NOT_CANDIDATE;
    }
  }
  # 當有多個兄弟的時候
  else 
  {
    /* standby nodes found - check if we're in the primary location before checking theirs */ # 當前節點是否與主節點位置相同
    if (strncmp(upstream_node_info.location, local_node_info.location, MAXLEN) == 0)
    {   # 存在與主節點位置相同的直接從節點
      primary_location_seen = true;
    }
  }
  /* get our lsn */ # 
  if (get_replication_info(local_conn, STANDBY, &local_replication_info) == false)
  {
    log_error(_("unable to retrieve replication information for local node"));
    return ELECTION_LOST;
  }
     # wal 回放狀態爲 paused ，嘗試恢復wal回放狀態 , resume失敗 當前節點不參與選舉
  /* check if WAL replay on local node is paused */
  if (local_replication_info.wal_replay_paused == true)
  {
    log_debug("WAL replay is paused");
    if (local_replication_info.last_wal_receive_lsn > local_replication_info.last_wal_replay_lsn)
    {
      log_warning(_("WAL replay on this node is paused and WAL is pending replay"));
      log_detail(_("replay paused at %X/%X; last WAL received is %X/%X"),
             format_lsn(local_replication_info.last_wal_replay_lsn),
             format_lsn(local_replication_info.last_wal_receive_lsn));
    }
    # 嘗試恢復wal狀態
    /* attempt to resume WAL replay - unlikely this will fail, but just in case */
    if (resume_wal_replay(local_conn) == false)
    {
      log_error(_("unable to resume WAL replay"));
      log_detail(_("this node cannot be reliably promoted"));
      # resume 失敗，該節點不參與選舉
      return ELECTION_LOST;
    }
    log_notice(_("WAL replay forcibly resumed"));
  }
  local_node_info.last_wal_receive_lsn = local_replication_info.last_wal_receive_lsn;
  log_info(_("local node's last receive lsn: %X/%X"), format_lsn(local_node_info.last_wal_receive_lsn));
  /* pointer to "winning" node, initially self */
  candidate_node = &local_node_info;
  initPQExpBuffer(&nodes_with_primary_visible);
   # 衆候選節點進入選舉會場，開始廝殺。
  for (cell = sibling_nodes->head; cell; cell = cell->next)
  {
    ReplInfo  sibling_replication_info;
    log_info(_("checking state of sibling node \"%s\" (ID: %i)"),
         cell->node_info->node_name,
         cell->node_info->node_id);
    /* assume the worst case */
    cell->node_info->node_status = NODE_STATUS_UNKNOWN;
    cell->node_info->conn = establish_db_connection(cell->node_info->conninfo, false);
    if (PQstatus(cell->node_info->conn) != CONNECTION_OK)
    {
      close_connection(&cell->node_info->conn);
      continue;
    }
    cell->node_info->node_status = NODE_STATUS_UP;
    stats.visible_nodes++;
    /*
     * see if the node is in the primary's location (but skip the check if
     * we've seen a node there already)
     */
    if (primary_location_seen == false)
    {
      if (strncmp(cell->node_info->location, upstream_node_info.location, MAXLEN) == 0)
      {
        log_debug("node %i in primary location \"%s\"",
              cell->node_info->node_id,
              cell->node_info->location);
        primary_location_seen = true;
      }
    }
    /*
     * check if repmgrd running - skip if not
     *
     * TODO: include pid query in replication info query?
     *
     * NOTE: from Pg12 we could execute "pg_promote()" from a running repmgrd;
     * here we'll need to find a way of ensuring only one repmgrd does this
     */
    if (repmgrd_get_pid(cell->node_info->conn) == UNKNOWN_PID)
    {
      log_warning(_("repmgrd not running on node \"%s\" (ID: %i), skipping"),
            cell->node_info->node_name,
            cell->node_info->node_id);
      continue;
    }
    if (get_replication_info(cell->node_info->conn, cell->node_info->type, &sibling_replication_info) == false)
    {
      log_warning(_("unable to retrieve replication information for node \"%s\" (ID: %i), skipping"),
            cell->node_info->node_name,
            cell->node_info->node_id);
      continue;
    }
    /*
     * Check if node is not in recovery - it may have been promoted
     * outside of the failover mechanism, in which case we may be able
     * to follow it.
     */
    # 有兄弟節點被手動提升爲主節點
    if (sibling_replication_info.in_recovery == false && cell->node_info->type != WITNESS)
    {
      bool can_follow;
      log_warning(_("node \"%s\" (ID: %i) is not in recovery"),
            cell->node_info->node_name,
            cell->node_info->node_id);
      /*
       * Node is not in recovery, but still reporting an upstream
       * node ID; possible it was promoted manually (e.g. with "pg_ctl promote"),
       * or (less likely) the node's repmgrd has just switched to primary
       * monitoring node but has not yet unset the upstream node ID in
       * shared memory. Either way, log this.
       */
      if (sibling_replication_info.upstream_node_id != UNKNOWN_NODE_ID)
      {
        log_warning(_("node \"%s\" (ID: %i) still reports its upstream is node %i, last seen %i second(s) ago"),
              cell->node_info->node_name,
              cell->node_info->node_id,
              sibling_replication_info.upstream_node_id,
              sibling_replication_info.upstream_last_seen);
      }
      #檢驗當前節點是否能follow 被提升的兄弟節點
      can_follow = check_node_can_follow(local_conn,
                         local_node_info.last_wal_receive_lsn,
                         cell->node_info->conn,
                         cell->node_info);
      # 在選舉結果前有節點手動提升節點爲主節點。並且當前節點可以follower。當前節點不在參與選舉
      if (can_follow == true)
      {
        *new_primary_id = cell->node_info->node_id;
        termPQExpBuffer(&nodes_with_primary_visible);
        return ELECTION_CANCELLED;
      }
      # 如果不能follower主節點。這種情況會很棘手。
      /*
       * Tricky situation here - we'll assume the node is a rogue primary
       */
      log_warning(_("not possible to attach to node \"%s\" (ID: %i), ignoring"),
            cell->node_info->node_name,
            cell->node_info->node_id);
      continue;
    }
    else
    {
      log_info(_("node \"%s\" (ID: %i) reports its upstream is node %i, last seen %i second(s) ago"),
           cell->node_info->node_name,
           cell->node_info->node_id,
           sibling_replication_info.upstream_node_id,
           sibling_replication_info.upstream_last_seen);
    }
    /* check if WAL replay on node is paused */
    if (sibling_replication_info.wal_replay_paused == true)
    {
      /*
       * Theoretically the repmgrd on the node should have resumed WAL play
       * at this point.
       */
      if (sibling_replication_info.last_wal_receive_lsn > sibling_replication_info.last_wal_replay_lsn)
      {
        log_warning(_("WAL replay on node \"%s\" (ID: %i) is paused and WAL is pending replay"),
              cell->node_info->node_name,
              cell->node_info->node_id);
      }
    }
    /*
     * Check if node has seen primary "recently" - if so, we may have "partial primary visibility".
     * For now we'll assume the primary is visible if it's been seen less than
     * monitor_interval_secs * 2 seconds ago. We may need to adjust this, and/or make the value
     * configurable.
     */
    if (sibling_replication_info.upstream_last_seen >= 0 && sibling_replication_info.upstream_last_seen < (config_file_options.monitor_interval_secs * 2))
    {
      if (sibling_replication_info.upstream_node_id != upstream_node_info.node_id)
      {
        log_warning(_("assumed sibling node \"%s\" (ID: %i) monitoring different upstream node %i"),
              cell->node_info->node_name,
              cell->node_info->node_id,
              sibling_replication_info.upstream_node_id);
      }
      else
      {
        nodes_with_primary_still_visible++;
        log_notice(_("%s node \"%s\" (ID: %i) last saw primary node %i second(s) ago, considering primary still visible"),
               get_node_type_string(cell->node_info->type),
               cell->node_info->node_name,
               cell->node_info->node_id,
               sibling_replication_info.upstream_last_seen);
        appendPQExpBuffer(&nodes_with_primary_visible,
                  " - node \"%s\" (ID: %i): %i second(s) ago\n",
                  cell->node_info->node_name,
                  cell->node_info->node_id,
                  sibling_replication_info.upstream_last_seen);
      }
    }
    else
    {
      log_info(_("%s node \"%s\" (ID: %i) last saw primary node %i second(s) ago"),
           get_node_type_string(cell->node_info->type),
           cell->node_info->node_name,
           cell->node_info->node_id,
           sibling_replication_info.upstream_last_seen);
    }
    # 見證節點不參與選舉
    /* don't interrogate a witness server */
    if (cell->node_info->type == WITNESS)
    {
      log_debug("node %i is witness, not querying state", cell->node_info->node_id);
      continue;
    }
        # priority=0 的節點不參與選舉
    /* don't check 0-priority nodes */
    if (cell->node_info->priority <= 0)
    {
      log_info(_("node \"%s\" (ID: %i) has priority of %i, skipping"),
             cell->node_info->node_name,
             cell->node_info->node_id,
             cell->node_info->priority);
      continue;
    }
    /* get node's last receive LSN - if "higher" than current winner, current node is candidate */
    cell->node_info->last_wal_receive_lsn = sibling_replication_info.last_wal_receive_lsn;
    log_info(_("last receive LSN for sibling node \"%s\" (ID: %i) is: %X/%X"),
         cell->node_info->node_name,
         cell->node_info->node_id,
         format_lsn(cell->node_info->last_wal_receive_lsn));
    /* compare LSN */ # 比較LSN
    if (cell->node_info->last_wal_receive_lsn > candidate_node->last_wal_receive_lsn)
    {
      /* other node is ahead */
      log_info(_("node \"%s\" (ID: %i) is ahead of current candidate \"%s\" (ID: %i)"),
           cell->node_info->node_name,
           cell->node_info->node_id,
           candidate_node->node_name,
           candidate_node->node_id);
      candidate_node = cell->node_info;
    }
    # LSN 相同比較 priority,priority也相同然後比較 node_id
    /* LSN is same - tiebreak on priority, then node_id */
    #LSN 對比
    else if (cell->node_info->last_wal_receive_lsn == candidate_node->last_wal_receive_lsn)
    {
      log_info(_("node \"%s\" (ID: %i) has same LSN as current candidate \"%s\" (ID: %i)"),
           cell->node_info->node_name,
           cell->node_info->node_id,
           candidate_node->node_name,
           candidate_node->node_id);
            #priority 對比
      if (cell->node_info->priority > candidate_node->priority)
      {
        log_info(_("node \"%s\" (ID: %i) has higher priority (%i) than current candidate \"%s\" (ID: %i) (%i)"),
             cell->node_info->node_name,
             cell->node_info->node_id,
             cell->node_info->priority,
             candidate_node->node_name,
             candidate_node->node_id,
             candidate_node->priority);
        candidate_node = cell->node_info;
      }
      else if (cell->node_info->priority == candidate_node->priority)
      {
        if (cell->node_info->node_id < candidate_node->node_id)
        {
          log_info(_("node \"%s\" (ID: %i) has same priority but lower node_id than current candidate \"%s\" (ID: %i)"),
               cell->node_info->node_name,
               cell->node_info->node_id,
               candidate_node->node_name,
               candidate_node->node_id);
          candidate_node = cell->node_info;
        }
      }
      else
      {
        log_info(_("node \"%s\" (ID: %i) has lower priority (%i) than current candidate \"%s\" (ID: %i) (%i)"),
             cell->node_info->node_name,
             cell->node_info->node_id,
             cell->node_info->priority,
             candidate_node->node_name,
             candidate_node->node_id,
             candidate_node->priority);
      }
    }
  }
    # 網絡分區，降級。
  if (primary_location_seen == false)
  {
    log_notice(_("no nodes from the primary location \"%s\" visible - assuming network split"),
           upstream_node_info.location);
    log_detail(_("node will enter degraded monitoring state waiting for reconnect"));
    monitoring_state = MS_DEGRADED;
    INSTR_TIME_SET_CURRENT(degraded_monitoring_start);
    reset_node_voting_status();
    termPQExpBuffer(&nodes_with_primary_visible);
    return ELECTION_CANCELLED;
  }
    # 主節點仍可見
  if (nodes_with_primary_still_visible > 0)
  {
    log_info(_("%i nodes can see the primary"),
           nodes_with_primary_still_visible);
    log_detail(_("following nodes can see the primary:\n%s"),
           nodes_with_primary_visible.data);
    #主節點可見性共識
    if (config_file_options.primary_visibility_consensus == true)
    {
      log_notice(_("cancelling failover as some nodes can still see the primary"));
      monitoring_state = MS_DEGRADED;
      INSTR_TIME_SET_CURRENT(degraded_monitoring_start);
      reset_node_voting_status();
      termPQExpBuffer(&nodes_with_primary_visible);
      return ELECTION_CANCELLED;
    }
  }
  termPQExpBuffer(&nodes_with_primary_visible);
  log_info(_("visible nodes: %i; total nodes: %i; no nodes have seen the primary within the last %i seconds"),
       stats.visible_nodes,
       stats.shared_upstream_nodes,
       (config_file_options.monitor_interval_secs * 2));
  # 同級節點中大部分不可見，不參與選舉
  if (stats.visible_nodes <= (stats.shared_upstream_nodes / 2.0))
  {
    log_notice(_("unable to reach a qualified majority of nodes"));
    log_detail(_("node will enter degraded monitoring state waiting for reconnect"));
    monitoring_state = MS_DEGRADED;
    INSTR_TIME_SET_CURRENT(degraded_monitoring_start);
    reset_node_voting_status();
    return ELECTION_CANCELLED;
  }
  log_notice(_("promotion candidate is \"%s\" (ID: %i)"),
         candidate_node->node_name,
         candidate_node->node_id);
   # 新選出的節點爲當前節點
  if (candidate_node->node_id == local_node_info.node_id)
  {
    /*
     * If "failover_validation_command" is set, execute that command
     * and decide the result based on the command's output
     */
      # 驗證failover_validation_command 腳本的執行
    if (config_file_options.failover_validation_command[0] != '\0')
    {
      return execute_failover_validation_command(candidate_node, &stats);
    }
    # 當前節點宣佈勝出，爲主節點
    return ELECTION_WON;
  }
  return ELECTION_LOST;
}if (primary_location_seen == false)
  {
    log_notice(_("no nodes from the primary location \"%s\" visible - assuming network split"),
           upstream_node_info.location);
    log_detail(_("node will enter degraded monitoring state waiting for reconnect"));
    monitoring_state = MS_DEGRADED;
    INSTR_TIME_SET_CURRENT(degraded_monitoring_start);
    reset_node_voting_status();
    termPQExpBuffer(&nodes_with_primary_visible);
    return ELECTION_CANCELLED;
  }
  if (nodes_with_primary_still_visible > 0)
  {
    log_info(_("%i nodes can see the primary"),
           nodes_with_primary_still_visible);
    log_detail(_("following nodes can see the primary:\n%s"),
           nodes_with_primary_visible.data);
    if (config_file_options.primary_visibility_consensus == true)
    {
      log_notice(_("cancelling failover as some nodes can still see the primary"));
      monitoring_state = MS_DEGRADED;
      INSTR_TIME_SET_CURRENT(degraded_monitoring_start);
      reset_node_voting_status();
      termPQExpBuffer(&nodes_with_primary_visible);
      return ELECTION_CANCELLED;
    }
  }

總結:

選舉結果狀態

ELECTION_NOT_CANDIDAT
ELECTION_WON
ELECTION_LOST
ELECTION_CANCELLED
ELECTION_RERUN

根據 location 或 winess 節點決定集羣是否處於網絡分區狀態，進入選舉或降級模式。

當只有一個一級從節點

從節點原主節點位置相同直接選舉成功。
從節點與原主節點位置不同且沒有見證節點則降級。

不參與選舉的節點

Priority=0
repmgr.conf 中 failover 參數設置爲 manual
wal 處於 pause 狀態將被 resume。但 resume 失敗

Repmgr 選舉候選備節點會以以下順序選舉：LSN-> Priority-> Node_ID。

流程圖

功能增強

如何將 repmgrd 真正投入到生產中，真正的落地。在某些 CASE 面向仍能滿足 HA 要求

Virtual IP

• 註冊主節點時，綁定 vip

•Failover & Switchover 時利用 failover_validation_command 事件重新綁定 vip

• 整個集羣重啓時，識別出主節點綁定 vip

主節點異常後恢復處理

• 主節點網絡問題，如網絡防火牆等原因。Postgresql 服務仍在運行，並可寫。

設置參數 child_nodes_connected_min_count
觸發事件 child_nodes_disconnect_command 中定義回調方法將主節點降級爲只讀，或關閉。

異常恢復時，網絡重新恢復連接。

找出集羣故障轉移後新主節點，方法 libpq 中可寫多個 hosts。

pg_rewind 對齊時間線

rejoin 新集羣
斷電，不能做任何處理
電力恢復。服務重啓，檢測從庫數量如果爲 0 暫時不提供服務。等待一段時間 stop service -> rejoin 集羣
從節點故障及恢復

• 當從節點發生故障時不會引起故障轉移。當需要注意當從節點故障恢復後 wal 日誌會落後與主庫。
對外服務管理
應用在通常情況下不會直接連接數據庫，比如通過 dns，haproxy 等中間管理服務進行負載均衡和解耦。
在數據庫發生故障轉移時，對應用程序來說盡量做到不需變化，無感。
新增一個獨立與現有服務之外的服務。
主要功能
提供一個七層 http 請求, 包括訪問權限驗證
返回值: 服務已經正常運行的時間 ps: 上層負載均衡用戶可參考該值進行權限分配。如節點剛接入集羣緩存數據尚未準備充分，建議業務流量逐步接入。
返回碼: 200 , 其他
主庫訪問地址 ip:port/master ?xxx
如返回碼爲 200，當前節點爲主節點並且服務正常。
其他返回碼, 暫時不能提供寫服務
主要判斷邏輯: PG 服務可連接性，下游節點個數
從庫訪問地址 ip:port/replocation ?xxx
主要判斷邏輯: PG 服務可連接性，落後主節點的 wal 差值

可解決的問題
vip 不能跨網段管理
在 Postgresql 對外提供服務時進行健康狀態檢測。如果不滿足需求，停止對應用提供服務，待恢復後在提供服務

故障切換時間預估
switchover & failover: 故障發生後檢測切換的總時間預估
主要參考因素:
monitor_interval_secs 檢測上游節點時間間隔默認 2 秒
connection_check_type 檢測類型每次檢測的 timeout
reconnect_attempts 重連嘗試次數默認 6 次
reconnect_interval 重連嘗試間隔時間

只有兩個節點
主要面臨問題: 兩個節點主要是當發生網絡故障時，而非 postgres 數據庫應用問題時。從庫無法判斷是網絡還是主庫發生了故障。見證節點主要是爲了解決這樣的問題。
解決思路: 模擬見證節點。在從庫無法與主庫通信時，ping 一個其他節點。或 ping 網關或自己的 ip。當 ping 失敗時認爲發生了網絡問題。
具體實現: 配置 failover_validation_command

failover_validation_command.sh

#! bin/bash
ping -c 5 {{ inventory_hostname }}
if [ $? -eq 0 ]; then
  echo "ping $ip  success!"
  return 0
else
  echo "ping  $ip fail!"
  return 1
fi

本文由 Readfog 進行 AMP 轉碼，版權歸原作者所有。
來源：https://mp.weixin.qq.com/s/NYFqiNitrfs3o_hTUf5t6A

repmgrd 介紹

見證節點介紹

添加見證節點

解決腦裂問題

主服務可見性共識

從節點斷開連接

故障轉移驗證

級聯複製

主節點檢測從庫連接

檢測過程和標準

注意事項

配置參數

事件類型

查看事件

repmgrd 配置

必須設置的參數

可選參數

其他參數

日誌自動切割

repmgrd 操作

repmgrd 維護模式

降級模式

監控數據存儲

選舉流程簡介

總結**:**

功能增強

Virtual IP

主節點異常後恢復處理

從節點故障及恢復

• 當從節點發生故障時不會引起故障轉移。當需要注意當從節點故障恢復後 wal 日誌會落後與主庫。

對外服務管理

主要功能

可解決的問題

故障切換時間預估

只有兩個節點

猜你喜歡

總結: