PostgreSQL

PostgreSQL 12如何配置流复制主备环境以及执行主备切换注意事项

一 不再需要配置recovery.conf文件

从PostgreSQL 12开始,在执行通过流复制来配置主备数据库的时候,不再需要配置额外配置recovery.conf文件了。取而代之的是在备库环境的$PGDATA路径下配置一个standby.signal文件,注意该文件是1个普通的文本文件,内容为空。理解起来就是,该文件是一个标识文件。如果备库通过执行pg_ctl promote提升为主库的话,那么该文件将自动消失。

二 备库执行基础备份时新的命令行选项-R

PostgreSQL 9.4.10版本:

[postgres@ppasdev ~]$ pg_basebackup -V
pg_basebackup (PostgreSQL) 9.4.10
[postgres@ppasdev ~]$ pg_basebackup --help
pg_basebackup takes a base backup of a running PostgreSQL server.

Usage:
pg_basebackup [OPTION]...

Options controlling the output:
 -D, --pgdata=DIRECTORY receive base backup into directory
 -F, --format=p|t       output format (plain (default), tar)
 -r, --max-rate=RATE   maximum transfer rate to transfer data directory
                        (in kB/s, or use suffix "k" or "M")
 -R, --write-recovery-conf
                        write recovery.conf after backup
 -T, --tablespace-mapping=OLDDIR=NEWDIR
                        relocate tablespace in OLDDIR to NEWDIR
 -x, --xlog             include required WAL files in backup (fetch mode)
 -X, --xlog-method=fetch|stream
                        include required WAL files with specified method

PostgreSQL 12.8版本:

[postgres@ecs-db2 ~]$ pg_basebackup -V
pg_basebackup (PostgreSQL) 12.8
[postgres@ecs-db2 ~]$ pg_basebackup --help
pg_basebackup takes a base backup of a running PostgreSQL server.

Usage:
pg_basebackup [OPTION]...

Options controlling the output:
 -D, --pgdata=DIRECTORY receive base backup into directory
 -F, --format=p|t       output format (plain (default), tar)
 -r, --max-rate=RATE   maximum transfer rate to transfer data directory
                        (in kB/s, or use suffix "k" or "M")
 -R, --write-recovery-conf
                        write configuration for replication
 -T, --tablespace-mapping=OLDDIR=NEWDIR
                        relocate tablespace in OLDDIR to NEWDIR
     --waldir=WALDIR   location for the write-ahead log directory
 -X, --wal-method=none|fetch|stream
                        include required WAL files with specified method
 -z, --gzip             compress tar output
 -Z, --compress=0-9     compress tar output with given compression level

从对比中,可以看到在9.4.10版本中,-R选项用于创建recovery.conf文件,而在12.8版本中,-R则是用于创建用于replication的配置文件,其实就是生成$PGDATA/standby.signal文件。

三 如何生成standby.signal文件

两种方式,一种是在备库执行基础备份的时候,加上-R选项用于自动创建$PGDATA/standby.signal文件:

[postgres@ecs-db2 pg12.8]$ pwd
/postgres/pg12.8
[postgres@ecs-db2 pg12.8]$ ll
total 20
drwxrwxr-x  2 postgres postgres 4096 Oct 19 15:01 bin
drwxrwxr-x  6 postgres postgres 4096 Oct 19 15:01 include
drwxrwxr-x  4 postgres postgres 4096 Oct 19 15:01 lib
drwxrwxr-x  8 postgres postgres 4096 Oct 19 15:01 share
[postgres@ecs-db2 pg12.8]$ pg_basebackup -h 172.22.3.4 -p 5432 -U replica --password -X stream -Fp --progress -D $PGDATA -R
Password:
24597/24597 kB (100%), 1/1 tablespace
[postgres@ecs-db2 pg12.8]$ ll data
total 120
-rw------- 1 postgres postgres   224 Oct 20 15:11 backup_label
drwx------ 5 postgres postgres  4096 Oct 20 15:11 base
drwx------ 2 postgres postgres  4096 Oct 20 15:11 global
drwx------ 2 postgres postgres  4096 Oct 20 15:11 pg_commit_ts
drwx------ 2 postgres postgres  4096 Oct 20 15:11 pg_dynshmem
-rw------- 1 postgres postgres  4867 Oct 20 15:11 pg_hba.conf
-rw------- 1 postgres postgres  1636 Oct 20 15:11 pg_ident.conf
drwx------ 4 postgres postgres  4096 Oct 20 15:11 pg_logical
drwx------ 4 postgres postgres  4096 Oct 20 15:11 pg_multixact
drwx------ 2 postgres postgres  4096 Oct 20 15:11 pg_notify
drwx------ 2 postgres postgres  4096 Oct 20 15:11 pg_replslot
drwx------ 2 postgres postgres  4096 Oct 20 15:11 pg_serial
drwx------ 2 postgres postgres  4096 Oct 20 15:11 pg_snapshots
drwx------ 2 postgres postgres  4096 Oct 20 15:11 pg_stat
drwx------ 2 postgres postgres  4096 Oct 20 15:11 pg_stat_tmp
drwx------ 2 postgres postgres  4096 Oct 20 15:11 pg_subtrans
drwx------ 2 postgres postgres  4096 Oct 20 15:11 pg_tblspc
drwx------ 2 postgres postgres  4096 Oct 20 15:11 pg_twophase
-rw------- 1 postgres postgres     3 Oct 20 15:11 PG_VERSION
drwx------ 3 postgres postgres  4096 Oct 20 15:11 pg_wal
drwx------ 2 postgres postgres  4096 Oct 20 15:11 pg_xact
-rw------- 1 postgres postgres   337 Oct 20 15:11 postgresql.auto.conf
-rw------- 1 postgres postgres 26718 Oct 20 15:11 postgresql.conf
-rw------- 1 postgres postgres     0 Oct 20 15:11 standby.signal
[postgres@ecs-db2 pg12.8]$

方式2:如果在备库上执行pg_basebackup对主库进行备份的时候,没有使用-R选项的话,我们可以在备库的$PGDATA路径下,touch standby.signal就好了。

记住:该文件只是一个标识文件,它的存在就是告诉数据库,当我们执行pg_ctl start启动的时候,当前库的角色是standby,不是primary角色。

四执行stream主备配置流程

主体思路跟PostgreSQL 11及以前版本的配置流程大同小异,甚至是更简单一些了。

1 主库创建流复制的用户
postgres=# CREATE ROLE replica login replication encrypted password 'replica';
CREATE ROLE
postgres=#
2 主库修改pg_hba.conf文件,允许备库IP通过复制用户访问数据库
[postgres@ecs-db1 data]$ tail pg_hba.conf 
host   all             all             127.0.0.1/32           trust
host   all             all             0.0.0.0/0               md5
# IPv6 local connections:
host   all             all             ::1/128                 trust
# Allow replication connections from localhost, by a user with the
# replication privilege.
local   replication     all                                     trust
host   replication     all             127.0.0.1/32           trust
host   replication     all             ::1/128                 trust
host   replication     replica         172.22.3.5/32           md5
[postgres@ecs-db1 data]$

最后一行,添加了replica用户可以从备库IP 172.22.3.5访问主库。

3 备库上执行对于主库的基础备份
[postgres@ecs-db2 pg12.8]$ pg_basebackup -h 172.22.3.4 -p 5432 -U replica --password -X stream -Fp --progress -D $PGDATA -R
Password:
24598/24598 kB (100%), 1/1 tablespace
[postgres@ecs-db2 pg12.8]$

注意,备份选项上带有-R选项。

4 备库就可以执行pg_ctl start启动了

这时,就可以看到备库服务器上自动生成了standby.signal文件。同时,也看到在$PGDATA路径下,数据库自动帮我们配置了关于流复制的主库的信息:

[postgres@ecs-db2 pg12.8]$ cat data/postgresql.auto.conf
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
primary_conninfo = 'user=replica password=replica host=172.22.3.4 port=5432 sslmode=disable sslcompression=0 gssencmode=disable krbsrvname=postgres target_session_attrs=any'
[postgres@ecs-db2 pg12.8]$

当然了,如果我们没有使用-R来备份主库的话。我们完全可以在备库上手工创建standby.signal文件,然后手工编辑postgresql.auto.conf,并在其内容中配置主库的信息。

5 备库数据库进程信息
[postgres@ecs-db2 ~]$ ps -ef|grep postgres
postgres  8911     1  0 Oct20 ?        00:00:00 /postgres/pg12.8/bin/postgres
postgres  8912  8911  0 Oct20 ?        00:00:00 postgres: startup   recovering 000000010000000000000003
postgres  8913  8911  0 Oct20 ?        00:00:00 postgres: checkpointer  
postgres  8914  8911  0 Oct20 ?        00:00:00 postgres: background writer  
postgres  8915  8911  0 Oct20 ?        00:00:00 postgres: stats collector  
postgres  8916  8911  0 Oct20 ?        00:00:32 postgres: walreceiver   streaming 0/3000148
root     18015 18014  0 16:13 pts/0    00:00:00 su - postgres
postgres 18016 18015  0 16:13 pts/0    00:00:00 -bash
postgres 18055 18016  0 16:13 pts/0    00:00:00 ps -ef
postgres 18056 18016  0 16:13 pts/0    00:00:00 grep --color=auto postgres
[postgres@ecs-db2 ~]$

备库上,可以看到walreceiver进程,正在读取日志streaming 0/3000148,执行恢复recovering 000000010000000000000003。

6 主库数据库进程信息
[postgres@ecs-db1 ~]$ ps -ef|grep postgres
postgres  3708     1  0 Oct20 ?        00:00:00 /postgres/pg12.8/bin/postgres
postgres  3710  3708  0 Oct20 ?        00:00:00 postgres: checkpointer  
postgres  3711  3708  0 Oct20 ?        00:00:00 postgres: background writer  
postgres  3712  3708  0 Oct20 ?        00:00:00 postgres: walwriter  
postgres  3713  3708  0 Oct20 ?        00:00:00 postgres: autovacuum launcher  
postgres  3714  3708  0 Oct20 ?        00:00:00 postgres: stats collector  
postgres  3715  3708  0 Oct20 ?        00:00:00 postgres: logical replication launcher  
postgres  4174  3708  0 Oct20 ?        00:00:00 postgres: walsender replica 172.22.3.5(57926) streaming 0/3000148
root     30410 30409  0 16:11 pts/1    00:00:00 su - postgres
postgres 30411 30410  0 16:11 pts/1    00:00:00 -bash
postgres 30868 30411  0 16:13 pts/1    00:00:00 ps -ef
postgres 30869 30411  0 16:13 pts/1    00:00:00 grep --color=auto postgres
[postgres@ecs-db1 ~]$

主库上看到,后台进程walsender,正在向replica 172.22.3.5(57926) streaming 0/3000148推送日志信息。

7 主库查看数据库复制信息
[postgres@ecs-db1 ~]$ psql -xc "select * from pg_stat_replication"
-[ RECORD 1 ]----+------------------------------
pid             | 4174
usesysid         | 16384
usename         | replica
application_name | walreceiver
client_addr     | 172.22.3.5
client_hostname |
client_port     | 57926
backend_start   | 2021-10-20 15:45:37.817863+08
backend_xmin     |
state           | streaming
sent_lsn         | 0/3000148
write_lsn       | 0/3000148
flush_lsn       | 0/3000148
replay_lsn       | 0/3000148
write_lag       |
flush_lag       |
replay_lag       |
sync_priority   | 0
sync_state       | async
reply_time       | 2021-10-21 16:11:58.438541+08

[postgres@ecs-db1 ~]$

五 主备切换及注意事项

如果因为意外或故障导致主库不可用的情况下,可以直接将备库提升为主库对外提供服务。然后视具体情况看原来的主库是否需要重建,或者是否待故障恢复之后,可以直接作为新的备库,然后从新的主库(原备库)同步数据。

下面是模拟切换步骤:

1 主库停止,模拟故障
[postgres@ecs-db1 ~]$ ps -ef|grep postgres
postgres  3708     1  0 Oct20 ?        00:00:00 /postgres/pg12.8/bin/postgres
postgres  3710  3708  0 Oct20 ?        00:00:00 postgres: checkpointer  
postgres  3711  3708  0 Oct20 ?        00:00:00 postgres: background writer  
postgres  3712  3708  0 Oct20 ?        00:00:00 postgres: walwriter  
postgres  3713  3708  0 Oct20 ?        00:00:00 postgres: autovacuum launcher  
postgres  3714  3708  0 Oct20 ?        00:00:00 postgres: stats collector  
postgres  3715  3708  0 Oct20 ?        00:00:00 postgres: logical replication launcher  
postgres  4174  3708  0 Oct20 ?        00:00:00 postgres: walsender replica 172.22.3.5(57926) streaming 0/3000148
postgres  4747 30411  0 16:42 pts/1    00:00:00 ps -ef
postgres  4748 30411  0 16:42 pts/1    00:00:00 grep --color=auto postgres
root     30410 30409  0 16:11 pts/1    00:00:00 su - postgres
postgres 30411 30410  0 16:11 pts/1    00:00:00 -bash
[postgres@ecs-db1 ~]$ pg_ctl status
pg_ctl: server is running (PID: 3708)
/postgres/pg12.8/bin/postgres
[postgres@ecs-db1 ~]$ pg_ctl stop -m fast
waiting for server to shut down.... done
server stopped
[postgres@ecs-db1 ~]$ ps -ef|grep postgres
postgres  4843 30411  0 16:43 pts/1    00:00:00 ps -ef
postgres  4844 30411  0 16:43 pts/1    00:00:00 grep --color=auto postgres
root     30410 30409  0 16:11 pts/1    00:00:00 su - postgres
postgres 30411 30410  0 16:11 pts/1    00:00:00 -bash
[postgres@ecs-db1 ~]$ pg_ctl status
pg_ctl: no server running
[postgres@ecs-db1 ~]$

通过pg_ctl stop -m fast停止原来的主库之后,数据库后台进程都没有了。

2 备库提升为新主库,对外提供服务
[postgres@ecs-db2 ~]$ ps -ef|grep postgres
postgres  8911     1  0 Oct20 ?        00:00:00 /postgres/pg12.8/bin/postgres
postgres  8912  8911  0 Oct20 ?        00:00:00 postgres: startup   recovering 000000010000000000000003
postgres  8913  8911  0 Oct20 ?        00:00:00 postgres: checkpointer  
postgres  8914  8911  0 Oct20 ?        00:00:00 postgres: background writer  
postgres  8915  8911  0 Oct20 ?        00:00:00 postgres: stats collector  
root     18015 18014  0 16:13 pts/0    00:00:00 su - postgres
postgres 18016 18015  0 16:13 pts/0    00:00:00 -bash
postgres 25404 18016  0 16:43 pts/0    00:00:00 ps -ef
postgres 25405 18016  0 16:43 pts/0    00:00:00 grep --color=auto postgres
[postgres@ecs-db2 ~]$ pg_ctl status
pg_ctl: server is running (PID: 8911)
/postgres/pg12.8/bin/postgres
[postgres@ecs-db2 ~]$ pg_ctl promote
waiting for server to promote.... done
server promoted
[postgres@ecs-db2 ~]$ ps -ef|grep postgres
postgres  8911     1  0 Oct20 ?        00:00:00 /postgres/pg12.8/bin/postgres
postgres  8913  8911  0 Oct20 ?        00:00:00 postgres: checkpointer  
postgres  8914  8911  0 Oct20 ?        00:00:00 postgres: background writer  
postgres  8915  8911  0 Oct20 ?        00:00:00 postgres: stats collector  
root     18015 18014  0 16:13 pts/0    00:00:00 su - postgres
postgres 18016 18015  0 16:13 pts/0    00:00:00 -bash
postgres 25622  8911  0 16:44 ?        00:00:00 postgres: walwriter  
postgres 25623  8911  0 16:44 ?        00:00:00 postgres: autovacuum launcher  
postgres 25624  8911  0 16:44 ?        00:00:00 postgres: logical replication launcher  
postgres 25628 18016  0 16:44 pts/0    00:00:00 ps -ef
postgres 25629 18016  0 16:44 pts/0    00:00:00 grep --color=auto postgres
[postgres@ecs-db2 ~]$

重要1:启动备库为新主库的命令是pg_ctl promote。

提升备库为主库之后,可以看到,后台进程中不再有startup recovering,以及walreceiver streaming进程了。同时,多了postgres: walwriter 写进程。

重要2:$PGDATA/standby.signal文件自动消失了。这是告诉PostgreSQL,我现在不再是备库了,我的身份是主库了。

3 新主库修改pg_hba.conf文件

修改新主库(原备库172.22.3.5)的$PGDATA/pg_hba.conf文件,在其中添加允许新备库(原主库172.22.3.4)可以通过replica用户访问数据库的条目信息。

host    replication     all             172.22.3.101/32           md5

如果不做这一步配置的话,将来启动原主库为新备库的时候,可能会遇到下述错误。

2021-10-21 17:13:20.464 CST [11394] FATAL:  could not connect to the primary server: FATAL:  no pg_hba.conf entry for replication connection from host "172.22.3.101", user "replica", SSL off
2021-10-21 17:13:20.466 CST [11395] FATAL: could not connect to the primary server: FATAL: no pg_hba.conf entry for replication connection from host "172.22.3.101", user "replica", SSL off

注意:这里的172.22.3.101是原主库上配置的1个浮动IP地址,绑定在eth0:1设备上。如果主从环境的数据库没有配置浮动IP的话,则这里的IP地址,应该直接填原主库的实际IP地址。

4 原主库新建$PGDATA/standby.signal文件
[postgres@ecs-db1 data]$ touch standby.signal
[postgres@ecs-db1 data]$ pwd
/postgres/pg12.8/data
[postgres@ecs-db1 data]$ ll standby.signal
-rw-rw-r-- 1 postgres postgres 0 Oct 21 16:54 standby.signal
[postgres@ecs-db1 data]$

注意:这一步骤非常非常重要,如果不配置该文件的话,那么原来的主库一旦重新启动话,就将成为了1个新的独立主库,脱离了主从数据库环境。

5 原主库修改$PGDATA/postgresql.auto.conf文件

注意,应该用单引号,而不是双引号。否则遇到下述错误。

[postgres@ecs-db1 data]$ cat postgresql.auto.conf 
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
primary_conninfo="user=replica password=replica host=172.22.3.5 port=5432"
[postgres@ecs-db1 data]$
[postgres@ecs-db1 data]$ pg_ctl start -l ~/pg.log
waiting for server to start.... stopped waiting
pg_ctl: could not start server
Examine the log output.
[postgres@ecs-db1 data]$ tailf ~/pg.log
数据库启动日志
2021-10-21 09:07:06.512 GMT [10045] LOG: syntax error in file "/postgres/pg12.8/data/postgresql.auto.conf" line 3, near token """
2021-10-21 09:07:06.512 GMT [10045] FATAL: configuration file "postgresql.auto.conf" contains errors

修改$PGDATA/postgresql.auto.conf配置文件为下述正确的格式:

[postgres@ecs-db1 data]$ cat postgresql.auto.conf 
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
primary_conninfo='user=replica password=replica host=172.22.3.5 port=5432'
[postgres@ecs-db1 data]$
6 启动原主库,变为新备库
[postgres@ecs-db1 data]$ pg_ctl start -l ~/pg.log 
waiting for server to start.... done
server started
[postgres@ecs-db1 data]$ ps -ef|grep postgres
root      8116  8115  0 16:58 pts/0    00:00:00 su - postgres
postgres  8118  8116  0 16:58 pts/0    00:00:00 -bash
root      8598  8597  0 17:00 pts/2    00:00:00 su - postgres
postgres  8600  8598  0 17:00 pts/2    00:00:00 -bash
postgres 11368  8118  0 17:13 pts/0    00:00:00 tailf pg.log
postgres 11389     1  0 17:13 ?        00:00:00 /postgres/pg12.8/bin/postgres
postgres 11390 11389  0 17:13 ?        00:00:00 postgres: startup   recovering 000000020000000000000003
postgres 11391 11389  0 17:13 ?        00:00:00 postgres: checkpointer  
postgres 11392 11389  0 17:13 ?        00:00:00 postgres: background writer  
postgres 11393 11389  0 17:13 ?        00:00:00 postgres: stats collector  
postgres 11440 11389  0 17:13 ?        00:00:00 postgres: walreceiver   streaming 0/3013AC8
postgres 12545 30411  0 17:18 pts/1    00:00:00 ps -ef
postgres 12546 30411  0 17:18 pts/1    00:00:00 grep --color=auto postgres
root     30410 30409  0 16:11 pts/1    00:00:00 su - postgres
postgres 30411 30410  0 16:11 pts/1    00:00:00 -bash
[postgres@ecs-db1 ~]$ tailf pg.log
2021-10-21 17:13:45.488 CST [11440] LOG: fetching timeline history file for timeline 2 from primary server
2021-10-21 17:13:45.493 CST [11440] LOG: started streaming WAL from primary at 0/3000000 on timeline 1
2021-10-21 17:13:45.493 CST [11440] LOG: replication terminated by primary server
2021-10-21 17:13:45.493 CST [11440] DETAIL: End of WAL reached on timeline 1 at 0/30001C0.
2021-10-21 17:13:45.494 CST [11390] LOG: new target timeline is 2
2021-10-21 17:13:45.494 CST [11440] LOG: restarted WAL streaming at 0/3000000 on timeline 2
2021-10-21 17:13:45.539 CST [11390] LOG: redo starts at 0/30001C0

这样,就完成了一次主从数据库环境的切换操作了。

六小结和参考链接

  1. 随着新版本的发行,数据库的配置和使用也越来越简单顺手了。
  2. 备库提升为主库的命令:pg_ctl promote;
  3. 新主库(原备库)的pg_hba.conf文件,要开放允许流复制访问数据库的信息给原主库的IP地址;
  4. 原主库配置为新备库的时候,务必要创建$PGDATA/standby.signal文件;
  5. 原主库配置为新备库的时候,务必要修改$PGDATA/postgresql.auto.conf文件,添加主库primary_conninfo的信息;

参考:https://www.percona.com/blog/2019/10/11/how-to-set-up-streaming-replication-in-postgresql-12/

https://www.2ndquadrant.com/en/blog/replication-configuration-changes-in-postgresql-12

延伸阅读之前写的一篇:PostgreSQL主从同步配置,切换步骤

留言