2016-03-01 1459 views
1

所以我正在用一些本地虚拟机测试一些玩具postgresql基础结构,以确定pgpool在故障转移时的行为。我配置了一个基本的设置,其中有两台数据库机器(192.168.0.2和192.168.0.3)和一台pgpool机器(192.168.0.4)。已使用流复制将192.168.0.3设置为192.168.0.2的从属设备。 pgpool-ii已经使用以下配置:主/从模式下的pgpool-ii:我如何最容易触发故障切换?

listen_addresses = '*' 
backend_hostname0 = '192.168.0.2' 
backend_port0 = 5432 
backend_weight0 = 1 
backend_data_directory0 = '/var/lib/postgresql/9.4/main/' 
backend_flag0 = 'ALLOW_TO_FAILOVER' 
backend_hostname1 = '192.168.0.3' 
backend_port1 = 5432 
backend_weight1 = 1 
backend_data_directory1 = '/var/lib/postgresql/9.4/main/' 
backend_flag1 = 'ALLOW_TO_FAILOVER' 
enable_pool_hba = on 
replication_mode = false 
master_slave_mode = on 
master_slave_sub_mode = 'stream' 
fail_over_on_backend_error = true 
failover_command = '/root/pgpool_failover_stream.sh %d %H /tmp/postgresql.trigger.5432' 
load_balance_mode = false 

我已经证实了这一切的作品。也就是说,当我更改master数据库时,复制工作正常,我可以通过示例应用程序连接到master,slave和pgpool-ii,并获得我期望的结果。

现在,我已经开始了一个连接到pgpool的长时间运行的应用程序,然后尝试通过SSH进入主数据库服务器并强制结束postgres任务(以root用户身份登录service postgresql stop)进行故障转移。我的应用程序保持正确执行查询,但不发生故障转移(脚本尚未运行)。我甚至测试过直接连接到master数据库,当我停止postgres服务时,我最终崩溃了应用程序。

我做错了什么?我没有正确配置我的pgpool吗?还是有更好的方法来触发故障转移?

编辑:按照要求,这里是哪里出现的第一个错误日志的部分:

... 
2016-03-15 18:47:15: pid 1232: DEBUG: initializing backend status 
2016-03-15 18:47:15: pid 1231: DEBUG: initializing backend status 
2016-03-15 18:47:15: pid 1230: DEBUG: initializing backend status 
2016-03-15 18:47:15: pid 1209: ERROR: failed to authenticate 
2016-03-15 18:47:15: pid 1209: DETAIL: invalid authentication message response type, Expecting 'R' and received 'E' 
2016-03-15 18:47:15: pid 1209: LOG: find_primary_node: checking backend no 1 
2016-03-15 18:47:15: pid 1209: ERROR: failed to authenticate 
2016-03-15 18:47:15: pid 1209: DETAIL: invalid authentication message response type, Expecting 'R' and received 'E' 
2016-03-15 18:47:15: pid 1209: DEBUG: find_primary_node: no primary node found 
... 

奇怪的是,我仍然可以连接到pgpool和执行查询,所以我显然不明白的东西那里。

编辑2:这些是我在主人的service postgresql shutdown后得到的错误。我显示了一切,开始关闭pgpool。

... 
2016-03-16 17:24:57: pid 1012: DEBUG: session context: clearing doing extended query messaging. DONE 
2016-03-16 17:24:57: pid 1012: DEBUG: session context: setting doing extended query messaging. DONE 
2016-03-16 17:24:57: pid 1012: DEBUG: session context: setting query in progress. DONE 
2016-03-16 17:24:57: pid 1012: DEBUG: reading backend data packet kind 
2016-03-16 17:24:57: pid 1012: DETAIL: backend:0 of 2 kind = 'E' 
2016-03-16 17:24:57: pid 1012: DEBUG: processing backend response 
2016-03-16 17:24:57: pid 1012: DETAIL: received kind 'E'(45) from backend 
2016-03-16 17:24:57: pid 1012: ERROR: unable to forward message to frontend 
2016-03-16 17:24:57: pid 1012: DETAIL: FATAL error occured on backend 
2016-03-16 17:24:57: pid 1012: DEBUG: session context: setting query in progress. DONE 
2016-03-16 17:24:57: pid 1012: DEBUG: decide where to send the queries 
2016-03-16 17:24:57: pid 1012: DETAIL: destination = 3 for query= "DISCARD ALL" 
2016-03-16 17:24:57: pid 1012: DEBUG: waiting for query response 
2016-03-16 17:24:57: pid 1012: DETAIL: waiting for backend:0 to complete the query 
2016-03-16 17:24:57: pid 1012: FATAL: unable to read data from DB node 0 
2016-03-16 17:24:57: pid 1012: DETAIL: EOF encountered with backend 
2016-03-16 17:24:57: pid 998: DEBUG: reaper handler 
2016-03-16 17:24:57: pid 998: LOG: child process with pid: 1012 exits with status 256 
2016-03-16 17:24:57: pid 998: LOG: fork a new child process with pid: 1033 
2016-03-16 17:24:57: pid 998: DEBUG: reaper handler: exiting normally 
2016-03-16 17:24:57: pid 1033: DEBUG: initializing backend status 
2016-03-16 17:25:02: pid 1031: DEBUG: PCP child receives shutdown request signal 2 
2016-03-16 17:25:02: pid 1029: LOG: child process received shutdown request signal 2 
... 

请注意,我的示例应用程序事实上在主站关闭时死亡。

编辑3:错误我得到在新的日志,经过合理设置sr_check_periodsr_check_usersr_check_password,所有先前的错误,现在都没有了:

2016-03-31 17:45:00: pid 18363: DEBUG: detect error: kind: 1 
2016-03-31 17:45:00: pid 18363: DEBUG: reading backend data packet kind 
2016-03-31 17:45:00: pid 18363: DETAIL: backend:0 of 2 kind = '1' 
... 
2016-03-31 17:45:00: pid 18363: DEBUG: detect error: kind: S 

回答

0

有可能是没有得到执行故障切换脚本多重原因。主要步骤是启用log_destination属性到syslog并启用调试模式(debug_level = 1)。

我亲眼目睹了故障切换脚本无法获取%d,%H(特殊字符)的参数的情况,因为脚本无法将ssh发送到从站并触发触发器文件。

如果您发布相同的日志文件,我可以提供更多的细节。

基于新的日志: 我可以看到一个错误:未通过身份验证。 你可以检查pgpool以下参数是否已正确配置

health_check_user
health_check_password
recovery_user
recovery_password
wd_lifecheck_user
wd_lifecheck_password
sr_check_user
sr_check_password

我希望你有接下来的步骤改变Postgres的用户密码

alter user postgres password 'yourpassword' 

,并确保您在任何情况下给出相同的密码。

从日志中,它看起来像一个认证问题。你能告诉我你使用的pgpool的版本吗?

这是我们正在使用的有3台(1个主站,1个从站和1个机pgpool) 我已经修改了设置以适应您的IP地址

listen_addresses = '*' 
    port = 5433 
    socket_dir = '/var/run/postgresql' 
    pcp_port = 9898 
    pcp_socket_dir = '/var/run/postgresql' 

    backend_hostname0 = '192.168.0.2' 
    backend_port0 = 5432 
    backend_weight0 = 1 
    backend_data_directory0 = '/var/lib/postgresql/9.4/main' 
    backend_flag0 = 'ALLOW_TO_FAILOVER' 

    backend_hostname1 = '192.168.0.3' 
    backend_port1 = 5432 
    backend_weight1 = 1 
    backend_data_directory1 = '/var/lib/postgresql/9.4/main' 
    backend_flag1 = 'ALLOW_TO_FAILOVER' 

    enable_pool_hba = on 
    pool_passwd = '' 
    authentication_timeout = 60 
    ssl = off 
    num_init_children = 4 
    max_pool = 2 
    child_life_time = 300 
    child_max_connections = 0 
    connection_life_time = 0 
    client_idle_limit = 0 
    log_destination = 'stderr,syslog' 
    print_timestamp = on 
    log_connections = on 
    log_hostname = on 
    log_statement = on 
    log_per_node_statement = on 
    log_standby_delay = 'none' 
    syslog_facility = 'LOCAL0' 
    syslog_ident = 'pgpool' 
    debug_level = 1 
    pid_file_name = '/var/run/postgresql/pgpool.pid' 
    logdir = '/var/log/postgresql' 
    connection_cache = on 
    reset_query_list = 'ABORT; DISCARD ALL' 

    replication_mode = off 
    replicate_select = off 
    insert_lock = on 
    lobj_lock_table = '' 
    replication_stop_on_mismatch = off 
    failover_if_affected_tuples_mismatch = off 

    load_balance_mode = off 
    ignore_leading_white_space = on 
    white_function_list = '' 
    black_function_list = 'nextval,setval' 

    master_slave_mode = on 
    master_slave_sub_mode = 'stream' 
    sr_check_period = 10 
    sr_check_user = 'postgres' 
    sr_check_password = 'postgres123' 
    delay_threshold = 0 
    follow_master_command = '' 
    parallel_mode = off 
    pgpool2_hostname = 'pgmaster' 

    system_db_hostname = 'localhost' 
    system_db_port = 5432 
    system_db_dbname = 'pgpool' 
    system_db_schema = 'pgpool_catalog' 
    system_db_user = 'pgpool' 
    system_db_password = '' 

    health_check_period = 5 
    health_check_timeout = 20 
    health_check_user = 'postgres' 
    health_check_password = 'postgres123' 
    health_check_max_retries = 2 
    health_check_retry_delay = 1 

    failover_command = '/usr/sbin/failover_modified.sh %d "%H" %P /var/lib/postgresql/9.4/main/pgsql.trigger.5432' 
    failback_command = '' 
    fail_over_on_backend_error = on 
    search_primary_node_timeout = 10 

    recovery_user = 'postgres' 
    recovery_password = 'postgres123' 
    recovery_1st_stage_command = '' 
    recovery_2nd_stage_command = '' 
    recovery_timeout = 90 
    client_idle_limit_in_recovery = 0 

    use_watchdog = off 
    trusted_servers = '' 
    ping_path = '/bin' 
    wd_hostname = '' 
    wd_port = 9000 
    wd_authkey = '' 
    delegate_IP = '' 
    ifconfig_path = '/sbin' 
    if_up_cmd = 'ifconfig eth0:0 inet $_IP_$ netmask 255.255.255.0' 
    if_down_cmd = 'ifconfig eth0:0 down' 
    arping_path = '/usr/sbin' 
    arping_cmd = 'arping -U $_IP_$ -w 1' 

    clear_memqcache_on_escalation = on 
    wd_escalation_command = '' 

    wd_lifecheck_method = 'heartbeat' 
    wd_interval = 10 
    wd_heartbeat_port = 9694 
    wd_heartbeat_keepalive = 2 
    wd_heartbeat_deadtime = 30 
    heartbeat_destination0 = '192.168.0.2' 
    heartbeat_destination_port0 = 9694 
    heartbeat_device0 = '' 

    heartbeat_destination1 = '192.168.0.3' 
    wd_life_point = 3 
    wd_lifecheck_query = 'SELECT 1' 
    wd_lifecheck_dbname = 'postgres' 
    wd_lifecheck_user = 'postgres' 
    wd_lifecheck_password = 'postgres123' 

    relcache_expire = 0 
    relcache_size = 256 
    check_temp_table = on 

    memory_cache_enabled = off 
    memqcache_method = 'shmem' 
    memqcache_memcached_host = 'localhost' 
    memqcache_memcached_port = 11211 
    memqcache_total_size = 67108864 
    memqcache_max_num_cache = 1000000 
    memqcache_expire = 0 
    memqcache_auto_cache_invalidation = on 
    memqcache_maxcache = 409600 
    memqcache_cache_block_size = 1048576 
    memqcache_oiddir = '/var/log/pgpool/oiddir' 
    white_memqcache_table_list = '' 
    black_memqcache_table_list = '' 

而且,我希望配置你已经修改了pool_hba.conf来访问主从服务器

+0

嗨Raveesh,谢谢你的回复!我已启用日志记录,甚至在启动时我已经注意到一些错误似乎可能是相关的。我编辑了我的问题以包含必要的信息。 – gdoug

+0

您可以给出关闭主设备后发生的日志。我认为这些日志没有指出“为什么故障转移不执行脚本”的真正问题 –

+0

再次请求更新日志信息。 – gdoug