Postgres Replication Catch-Up

Well, this is my very first blog, very first blog entry.  After working with DB2 for most of my career, I have recently taken a job working with PostgreSQL.  I had worked with it a little bit previously on a couple of smaller projects and am very happy to be working on pretty much full time now.

So, in my new environment, we have a couple servers setup as replication targets using the Postgres built-in replication.  I have learned a few things I’d like to share regarding replication.

One thing that can cause issues with the replication is if you have network slowness or connection loss between your master and slave servers.  Another scenario is if you have really heavy write activity on your master and the slave(s) cannot keep up. If you run into one of these things, your replication may fall behind.

You can recognize that you are behind by running a couple of queries.  On the master, run this:

master# select pg_current_xlog_location();
(1 row)

Then on the slave server, you can run a query like this:

slave# select pg_last_xlog_receive_location();
(1 row)

The 2 digits before the slash together with the next 2 represent the transaction log file you’re on and the rest is the position within the file. If the log file matches between the 2 queries, you’re in pretty good shape in terms of your slave keeping up with the master.

As an alternative to the queries, on the master, you can go to the pg_xlog directory within your postgres instance and look at the most recent log file and on the slave you can do this:

$ ps -ef | grep -i "postgres: startup"

postgres  2549  2546  0 Apr09 ?        00:28:04 postgres: startup processrecovering 00000001000000D000000095

Scroll all the way to the right and you see the log file currently being recovered. If this matches the most recent log file on the master, then again you’re in pretty good shape.

Normally your replication will catch up on its own if it doesn’t fall too far behind. How far behind is too far?  This depends on the setting in your postgresql.conf file for the parameter wal_keep_segments.  If your replication falls behind by more than wal_keep_segments log files, then you have a problem.

The wal_keep_segments parameter represents the number of used transaction log files that postgres will keep around in your pg_xlog directory before it starts purging them.  So, if you’re doing replication and your slave falls behind by more log files than the value of wal_keep_segments, then the log file you need will no longer be there for your slave when it starts to catch up.

Now if you are also archiving off your transaction logs (using archive_mode = on with a valid archive_command in your postgresql.conf), then you can recover your replication slave pretty easily.  All you have to do is pull all the logs from your archive starting with the log file your slave needs (based on the ps command given above) and copy those into the pg_xlog directory within your postgres instance on the slave server.  Postgres will automatically find those there and apply them to the replicated DB.  Once all the log files no longer available on the master have been applied, the slave will once again be able to request log files directly from the master and replication will resume just as they were prior to whatever disruption caused the problem.

If you don’t archive your log files, then you’re in kinda sad shape.  You’ll need to take a new pg_basebackup on the master and recover that on your slave machine in order to get the replication back up and running again.