2010-01-20 88 views
0

我希望有人能够帮助我解决WAL-shipping和热备用问题。我的待机系统运行了好几个星期,然后突然开始寻找不存在的.history文件。然后它会出现问题,并且无法在重建备用数据库的情况下成功重启它。Postgres HA(基于WAL-shipping)失败

这两个系统都运行CentOS 4.5和postgres 8.4.1。他们使用NFS在备用数据库上存储WAL文件。

日志的相关大块,我的意见:

[** Recovery is running normally **] 

Trigger file   : /tmp/pgsql.trigger 
Waiting for WAL file : 00000001000000830000005B 
WAL file path   : /var/tafkan_backup_from_db1/00000001000000830000005B 
Restoring to   : pg_xlog/RECOVERYXLOG 
Sleep interval   : 2 seconds 
Max wait interval  : 0 forever 
Command for restore  : cp "/var/tafkan_backup_from_db1/00000001000000830000005B" "pg_xlog/RECOVERYXLOG" 
Keep archive history : 00000001000000830000004D and later 
WAL file not present yet. Checking for trigger file... 
WAL file not present yet. Checking for trigger file... 
WAL file not present yet. Checking for trigger file... 
running restore   : OK 

Trigger file   : /tmp/pgsql.trigger 
Waiting for WAL file : 00000001000000830000005B 
WAL file path   : /var/tafkan_backup_from_db1/00000001000000830000005B 
Restoring to   : pg_xlog/RECOVERYXLOG 
Sleep interval   : 2 seconds 
Max wait interval  : 0 forever 
Command for restore  : cp "/var/tafkan_backup_from_db1/00000001000000830000005B" "pg_xlog/RECOVERYXLOG" 
Keep archive history : 000000000000000000000000 and later 
running restore   : OK 

[** All of a sudden it starts looks for .history files **] 

Trigger file   : /tmp/pgsql.trigger 
Waiting for WAL file : 00000002.history 
WAL file path   : /var/tafkan_backup_from_db1/00000002.history 
Restoring to   : pg_xlog/RECOVERYHISTORY 
Sleep interval   : 2 seconds 
Max wait interval  : 0 forever 
Command for restore  : cp "/var/tafkan_backup_from_db1/00000002.history" "pg_xlog/RECOVERYHISTORY" 
Keep archive history : 000000000000000000000000 and later 
running restore   :cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory 
cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory 
cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory 
cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory 
not restored 
history file not found 
Trigger file   : /tmp/pgsql.trigger 
Waiting for WAL file : 00000001.history 
WAL file path   : /var/tafkan_backup_from_db1/00000001.history 
Restoring to   : pg_xlog/RECOVERYHISTORY 
Sleep interval   : 2 seconds 
Max wait interval  : 0 forever 
Command for restore  : cp "/var/tafkan_backup_from_db1/00000001.history" "pg_xlog/RECOVERYHISTORY" 
Keep archive history : 000000000000000000000000 and later 
running restore   :cp: cannot stat `/var/tafkan_backup_from_db1/00000001.history': No such file or directory 
cp: cannot stat `/var/tafkan_backup_from_db1/00000001.history': No such file or directory 
cp: cannot stat `/var/tafkan_backup_from_db1/00000001.history': No such file or directory 
cp: cannot stat `/var/tafkan_backup_from_db1/00000001.history': No such file or directory 
not restored 
history file not found 

[** I stopped Postgres, renamed recovery.done to recovery.conf, and restarted it. **] 

Trigger file   : /tmp/pgsql.trigger 
Waiting for WAL file : 00000002.history 
WAL file path   : /var/tafkan_backup_from_db1/00000002.history 
Restoring to   : pg_xlog/RECOVERYHISTORY 
Sleep interval   : 2 seconds 
Max wait interval  : 0 forever 
Command for restore  : cp "/var/tafkan_backup_from_db1/00000002.history" "pg_xlog/RECOVERYHISTORY" 
Keep archive history : 000000000000000000000000 and later 
running restore   :cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory 
cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory 
cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory 
cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory 
not restored 
history file not found 
Trigger file   : /tmp/pgsql.trigger 
Waiting for WAL file : 0000000200000083000000A2 
WAL file path   : /var/tafkan_backup_from_db1/0000000200000083000000A2 
Restoring to   : pg_xlog/RECOVERYXLOG 
Sleep interval   : 2 seconds 
Max wait interval  : 0 forever 
Command for restore  : cp "/var/tafkan_backup_from_db1/0000000200000083000000A2" "pg_xlog/RECOVERYXLOG" 
Keep archive history : 000000000000000000000000 and later 
WAL file not present yet. Checking for trigger file... 
WAL file not present yet. Checking for trigger file... 
WAL file not present yet. Checking for trigger file... 
WAL file not present yet. Checking for trigger file... 

[** This file is not present. All WAL files start with 00000001. **] 

任何想法?我甚至不知道历史文件是什么,而且(大多数是优秀的)文档在这些方面都不是很清楚。

PS。我希望我运行的虚拟机,所以我可以使用link text,而不必担心任何这种应用程序级HA废话:-)

更新:这里是一些在这个时候从备用服务器的日志。它看起来像是让服务器停止恢复并联机,但我不知道是什么。我很确定没有任何东西可以创建触发器文件。

2010-01-20 03:30:15 EST 4b3a5c63.401b LOG: restored log file "00000001000000830000005A" from archive 
2010-01-20 03:30:23 EST 4b3a5c63.401b LOG: restored log file "00000001000000830000005B" from archive 
2010-01-20 03:30:23 EST 4b3a5c63.401b LOG: record with zero length at 83/5BFA2FF8 
2010-01-20 03:30:23 EST 4b3a5c63.401b LOG: redo done at 83/5BFA2FAC 
2010-01-20 03:30:23 EST 4b3a5c63.401b LOG: last completed transaction was at log time 2010-01-20 03:28:04.594399-05 
2010-01-20 03:30:25 EST 4b3a5c63.401b LOG: restored log file "00000001000000830000005B" from archive 
2010-01-20 03:30:37 EST 4b3a5c63.401b LOG: selected new timeline ID: 2 
2010-01-20 03:30:49 EST 4b3a5c63.401b LOG: archive recovery complete 
2010-01-20 03:30:59 EST 4b3a5c62.4019 LOG: database system is ready to accept connections 
+0

嗨sbleon,我只是想备份WAL文件到备用位置,我不需要热备份,你能帮忙吗? – 2012-01-31 09:26:45

+1

@indyaah,检查[优秀的PostgreSQL文档](http://www.postgresql.org/docs/)为您的版本。 – sbleon 2012-03-01 21:42:23

+0

感谢帮助哥们。 :D – 2012-03-03 09:17:23

回答

1

我能够通过更新我的两台PostgreSQL服务器上的CentOS操作系统来解决此问题。因此,我认为这是某种底层网络错误的症状。

1

用于HA一个完全不同的方法可能是主机在两个机器之间共享的DRBD设备上的PG数据库。

+0

感谢您的建议!如果我无法获得WAL运输的可靠运作,那可能是我会做的。 – sbleon 2010-01-20 20:08:51

1

您是否使用自己的恢复脚本/程序?如果是 - 请不要这样做。使用PostgreSQL contrib中的pg_standby。

否则 - 忽略.history文件。

+0

我正在使用pg_standby。 recovery.conf包含: “restore_command ='pg_standby -l -d -s 2 -t /tmp/pgsql.trigger/var/tafkan_backup_from_db1%f%p%r 2 >> standby.log'”。 我不能忽略.history文件,因为当pg_standby开始寻找它们时,恢复失败,recovery.conf被移动到recovery.done,并且WAL文件开始快速堆积。 – sbleon 2010-01-20 20:07:38

1

您的复制副本在某个时间点上线。 “00000002.history”正在查找时间线00000002的历史记录文件,而其余日志以00000001开始,这是原始时间线。

我会在开始查找历史文件之前检查PostgreSQL日志,看是否有任何迹象表明数据库上线,即使是片刻。

+0

谢谢,马修。我在我的问题中添加了一些日志。你说得对,它让网络上网,但我无法想象什么,或者为什么。 – sbleon 2010-01-21 20:43:27

+0

源端发生了什么?条目“在83/5BFA2FF8处长度为零的记录”看起来只是它试图恢复的部分WAL日志。 IIRC,当它在WAL中遇到无效记录时,无论触发文件是否存在,它都会回滚到WAL中的最后*好*记录,然后联机。 我会看看2010-01-20 03:28:04.594399-05两个系统日志,看看Postgres,操作系统或网络中是否有任何错误。 – 2010-01-21 22:00:24

+0

这种行为是有道理的。如果备份看到的东西看起来像是主要的故障,它会假定主要已经死亡,并且它应该捡起松弛。我怀疑这里可能存在网络问题。我要去看看那个角度。谢谢! – sbleon 2010-01-22 20:46:18