Database Replication Failure

Overview

After an OS upgrade, a database replication process may fail. As result, if the master goes down, the slave node will not be able to become the master.

Failure Detection

To detect if a database replication has failed, follow this process.


  1. On the master node, run the following command:

    psql -U postgres -c 'select * from pg_stat_replication;'


    Sample output in a healthy cluster.

     [root@cliqr-centos7-base-image ~]# psql -U postgres -c 'select * from pg_stat_replication;'
     pid | usesysid | usename | application_name | client_addr | client_hostname | client_port |
     backend_start | backend_xmin | state | sent_location | write_location | flush_locatio
     n | replay_location | sync_priority | sync_state
     --------------------------------------------------------------------------------+
     -----------------------------------------------------------------------------+
     ------------------------------------------+-----------
     20381 | 16389 | replication | node1 | 10.193.73.214 | | 49112 | 20
     20-05-06 21:09:27.877976+00 | | streaming | B/C12AF710 | B/C12AF710 | B/C12AF710 
    
    B/C12AF570        1           sync (1 row)


  • In a cluster where replication is not happening, you will not see any rows being displayed.

Workaround

On the master node, execute the following steps:

  1. Tail the log file and look for replication slot name

    [root@db pg_log]# cd /var/lib/pgsql/9.6/data/pg_log
    [root@db pg_log]# tailf postgresql-Wed.log
    (make sure to tail the appropriate log file based on the day of the week)


    You will see lines similar to the following code block.

    < 2020-05-06 05:22:49.208 UTC > ERROR: replication slot "cliqr_rep_slot_node2" does not exist
     < 2020-05-06 05:22:54.216 UTC > ERROR: replication slot "cliqr_rep_slot_node2" does not exist
     < 2020-05-06 05:22:59.219 UTC > ERROR: replication slot "cliqr_rep_slot_node2" does not exist
     ^C
  2. Create a replication slot manually by using the following command:
    (replace the name of the replication slot if needed based on the output seen in the log file)

    [root@db pg_log]# psql -U postgres -c "SELECT * FROM pg_create_physical_replication_slot('cliqr_rep_slot_node2')"
     slot_name | xlog_position
     --------------------+-------------
     cliqr_rep_slot_node2 | (1 row)
  3. After about 10-15 seconds, run the following command on the master node to verify replication is working:

    psql -U postgres -c 'select * from pg_stat_replication;'

Replication on the Master Node

Each time there is a fail over, the replication slot must be created on the new master node:

If a master node goes down which results in a fail over, the failed node must be brought back up as the new slave node by the admin to make the cluster healthy again.


In addition, after the failed node is restored to healthy state, be sure to verify and fix the replication slot in the master node by using the above procedure.


  • No labels
Terms & Conditions Privacy Statement Cookies Trademarks