Patroni 集羣 switchover 和 failover
Patroni switchover
昨天學習了怎麼進入維護模式,今天來研究一下手工故障轉移,也就是常說的 switchover 和 failover。
先來看看我們的集羣
[postgres@133e0e204e206 ~]$ patronictl -c /etc/patroni.yml list
+ Cluster: patnori-test (6962171552537974697) --+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| postgres1 | 133.0.204.206 | Replica | running | 16 | 0 |
| postgres2 | 133.0.204.207 | Replica | running | 16 | 0 |
| postgres3 | 133.0.204.208 | Leader | running | 16 | |
+-----------+---------------+---------+---------+----+-----------+
直接使用 patronictl 運行 switchover
[postgres@133e0e204e206 ~]$ patronictl -c /etc/patroni.yml switchover
Master [postgres3]:
Candidate ['postgres1', 'postgres2'] []: postgres1
When should the switchover take place (e.g. 2021-06-08T16:32 ) [now]:
Current cluster topology
+ Cluster: patnori-test (6962171552537974697) --+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| postgres1 | 133.0.204.206 | Replica | running | 16 | 0 |
| postgres2 | 133.0.204.207 | Replica | running | 16 | 0 |
| postgres3 | 133.0.204.208 | Leader | running | 16 | |
+-----------+---------------+---------+---------+----+-----------+
Are you sure you want to switchover cluster patnori-test, demoting current master postgres3? [y/N]: y
2021-06-08 15:32:41.86549 Successfully switched over to "postgres1"
+ Cluster: patnori-test (6962171552537974697) --+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| postgres1 | 133.0.204.206 | Leader | running | 16 | |
| postgres2 | 133.0.204.207 | Replica | running | 16 | 0 |
| postgres3 | 133.0.204.208 | Replica | stopped | | unknown |
+-----------+---------------+---------+---------+----+-----------+
這裏有幾個需要輸入的,先是要讓你確認 master,現在我們在 postgres3 上,然後讓你選擇候選人,也就是切換後的 leader,我們選擇 postgres1。然後它會讓你再確認一次,輸入 y 之後切換正式開始。
[postgres@133e0e204e206 ~]$ patronictl -c /etc/patroni.yml list
+ Cluster: patnori-test (6962171552537974697) --+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| postgres1 | 133.0.204.206 | Leader | running | 17 | |
| postgres2 | 133.0.204.207 | Replica | running | 17 | 0 |
| postgres3 | 133.0.204.208 | Replica | running | 17 | 0 |
+-----------+---------------+---------+---------+----+-----------+
注意看,切換完成 TL,也就是時間線從 16 變成了 17。
Jun 8 15:32:40 133e0e204e208 patroni: 2021-06-08 15:32:40,848 INFO: received switchover request with leader=postgres3 candidate=postgres1 scheduled_at=None
Jun 8 15:32:40 133e0e204e208 patroni: 2021-06-08 15:32:40,855 INFO: Got response from postgres1 http://133.0.204.206:8008/patroni: {"state": "running", "postmaster_start_time": "2021-06-07 16:44:00.382 CST
", "role": "replica", "server_version": 130002, "cluster_unlocked": false, "xlog": {"received_location": 3909288144, "replayed_location": 3909288144, "replayed_timestamp": null, "paused": false}, "timeline"
: 16, "database_system_identifier": "6962171552537974697", "patroni": {"version": "2.0.2", "scope": "patnori-test"}}
Jun 8 15:32:40 133e0e204e208 patroni: 2021-06-08 15:32:40,861 INFO: Lock owner: postgres3; I am postgres3
Jun 8 15:32:40 133e0e204e208 patroni: 2021-06-08 15:32:40,867 INFO: Got response from postgres1 http://133.0.204.206:8008/patroni: {"state": "running", "postmaster_start_time": "2021-06-07 16:44:00.382 CST
", "role": "replica", "server_version": 130002, "cluster_unlocked": false, "xlog": {"received_location": 3909288144, "replayed_location": 3909288144, "replayed_timestamp": null, "paused": false}, "timeline"
: 16, "database_system_identifier": "6962171552537974697", "patroni": {"version": "2.0.2", "scope": "patnori-test"}}
Jun 8 15:32:40 133e0e204e208 patroni: 2021-06-08 15:32:40,871 INFO: manual failover: demoting myself
Jun 8 15:32:41 133e0e204e208 systemd-logind: Removed session 2373.
Jun 8 15:32:41 133e0e204e208 patroni: 2021-06-08 15:32:41,201 INFO: Leader key released
Jun 8 15:32:43 133e0e204e208 patroni: 2021-06-08 15:32:43,208 INFO: Local timeline=16 lsn=0/E9030180
Jun 8 15:32:43 133e0e204e208 patroni: 2021-06-08 15:32:43,212 INFO: master_timeline=17
Jun 8 15:32:43 133e0e204e208 patroni: 2021-06-08 15:32:43,216 INFO: master: history=13#0110/E9014430#011no recovery target specified
Jun 8 15:32:43 133e0e204e208 patroni: 14#0110/E9014B30#011no recovery target specified
Jun 8 15:32:43 133e0e204e208 patroni: 15#0110/E902D570#011no recovery target specified
Jun 8 15:32:43 133e0e204e208 patroni: 16#0110/E90301F8#011no recovery target specified
Jun 8 15:32:43 133e0e204e208 patroni: 2021-06-08 15:32:43,217 INFO: closed patroni connection to the postgresql cluster
Jun 8 15:32:43 133e0e204e208 patroni: 2021-06-08 15:32:43.406 CST [17816] LOG: redirecting log output to logging collector process
Jun 8 15:32:43 133e0e204e208 patroni: 2021-06-08 15:32:43.406 CST [17816] HINT: Future log output will appear in directory "log".
Jun 8 15:32:43 133e0e204e208 patroni: 2021-06-08 15:32:43,406 INFO: postmaster pid=17816
Jun 8 15:32:43 133e0e204e208 patroni: localhost:5432 - rejecting connections
Jun 8 15:32:43 133e0e204e208 patroni: localhost:5432 - accepting connections
Jun 8 15:32:50 133e0e204e208 patroni: 2021-06-08 15:32:50,863 INFO: Lock owner: postgres1; I am postgres3
Jun 8 15:32:50 133e0e204e208 patroni: 2021-06-08 15:32:50,863 INFO: does not have lock
Jun 8 15:32:50 133e0e204e208 patroni: 2021-06-08 15:32:50,863 INFO: establishing a new patroni connection to the postgres cluster
Jun 8 15:32:50 133e0e204e208 patroni: 2021-06-08 15:32:50,918 INFO: no action. i am a secondary and i am following a leader
Jun 8 15:32:52 133e0e204e208 patroni: 2021-06-08 15:32:52,403 INFO: Lock owner: postgres1; I am postgres3
從日誌上看,它接收到了 switchover 的請求,然後使用 REST API 訪問要切換到候選人,這一步主要是要確認 xlog 的位置。確認完成後它就會將自己降級。並釋放 Leader key。而節點 1 會獲取 leader key,成爲 leader,並 promoted 自己。
當然還有一些選項,可以幫助我們直接用命令進行切換,而不是使用這種交互模式。
[postgres@133e0e204e206 ~]$ patronictl switchover --help
Usage: patronictl switchover [OPTIONS] [CLUSTER_NAME]
Switchover to a replica
Options:
--master TEXT The name of the current master
--candidate TEXT The name of the candidate
--scheduled TEXT Timestamp of a scheduled switchover in unambiguous format
(e.g. ISO 8601)
--force Do not ask for confirmation at any point
--help Show this message and exit.
scheduled 可以幫我們制定一個時間進行切換,比如你想 2 小時後切換。
force 避免反覆的確認。
[postgres@133e0e204e206 ~]$ patronictl -c /etc/patroni.yml switchover --master postgres1 --candidate postgres2 --scheduled now --force
Current cluster topology
+ Cluster: patnori-test (6962171552537974697) --+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| postgres1 | 133.0.204.206 | Leader | running | 17 | |
| postgres2 | 133.0.204.207 | Replica | running | 17 | 0 |
| postgres3 | 133.0.204.208 | Replica | running | 17 | 0 |
+-----------+---------------+---------+---------+----+-----------+
2021-06-08 16:21:31.79130 Successfully switched over to "postgres2"
+ Cluster: patnori-test (6962171552537974697) --+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| postgres1 | 133.0.204.206 | Replica | stopped | | unknown |
| postgres2 | 133.0.204.207 | Leader | running | 17 | |
| postgres3 | 133.0.204.208 | Replica | running | 17 | 0 |
+-----------+---------------+---------+---------+----+-----------+
下面是我用一條命令實現從 postgres1 切換到 postgres2。
Patroni Failover
和 switchover 一樣,fileover 有相同的參數,只是缺少了 scheduled。
[postgres@133e0e204e206 ~]$ patronictl failover --help
Usage: patronictl failover [OPTIONS] [CLUSTER_NAME]
Failover to a replica
Options:
--master TEXT The name of the current master
--candidate TEXT The name of the candidate
--force Do not ask for confirmation at any point
--help Show this message and exit.
當然故障轉移是自動的,但是也會有異常的情況,你可以手工進行 failover。
幾個問題
最後有幾個問題:
1. 一主兩從的環境,所有的從庫都落後於主庫,此時 leader 節點出現問題,這種情況下該怎辦?
這種情況下,沒有任何副本有資格成爲新的主庫,因爲它們都落後於主庫,此時我們只能觸發手動故障轉移,也就是 failover。patronictl failover <your-cluster-name>
,這樣會丟數據,它將一個從庫變成新的主庫。
2.maximum_lag_on_failover 參數的含義
這個參數的具體含義。我們的 Patroni 每隔 10 秒,leader 節點會將它的 wal_position 寫入到 etcd。在 leader 競爭的情況下,每個副本都會將其 wal_position 與 etcd 中領導者最後註冊的 wal_position 進行比較。如果差值大於 maximum_lag_on_failover,它將不會提升。
本文由 Readfog 進行 AMP 轉碼,版權歸原作者所有。
來源:https://mp.weixin.qq.com/s/4CwloH5NdieiKGZkJjDuxg