A couple of days back a new feature has landed in the PostgreSQL world for the development of version 11, in the shape of the following commit:
commit: 6d9fa52645e71711410a66b5349df3be0dd49608
author: Peter Eisentraut <peter_e@gmx.net>
date: Mon, 11 Sep 2017 16:48:30 -0400
pg_receivewal: Add --endpos option
This is primarily useful for making tests of this utility more
deterministic, to avoid the complexity of starting pg_receivewal as a
deamon in TAP tests.
While this is less useful than the equivalent pg_recvlogical option,
users can as well use it for example to enforce WAL streaming up to a
end-of-backup position, to save only a minimal amount of WAL.
Use this new option to stream WAL data in a deterministic way within a
new set of TAP tests.
Author: Michael Paquier
pg_receivewal is mainly used in production deployments as a standalone daemon continuously streaming WAL from a given Postgres instance, be it a primary or a standby, to archive WAL which is afterwards used for recovery of backups or PITR purposes. This is also useful to allow standbys to catch up a primary when they have been disconnected for a long time, even wanted WAL segments may have been recycled on the primary after too many checkpoints. The use of replication slots can leverage that, but those can show as well their limits if the local pg_wal data gets too much bloated, leading to a replication slot being forcibly dropped (note that there is a patch proposed for integration with Postgres 11 allowing to advance the oldest WAL position of a physical slot).
Being firstly designed for testing purposes thanks to which the code coverage of for example pg_receivewal.c has been increased from 15% to 55% using a set of dedicated TAP tests really simplified by the fact that pg_receivewal can be controlled easily without signals with this feature, this option is as well useful for backup purposes. When taking a base backup using the set of low-level APIs pg_start_backup() and pg_stop_backup(), the WAL position (LSN) returned by pg_stop_backup() marks the point in WAL replay up to which the PostgreSQL instance that has been backed up can be recovered into a consistent state. When archiving is enabled, pg_stop_backup() is careful to wait for WAL segments needed for recovery to be archived before telling back to the client that the backup is successfully completed. Note as well that pg_stop_backup() enforces a WAL segment switch for this purpose.
Archiving using archive_mode and archive_command unfortunately complicates the failure handling in this area and may put in danger the PostgreSQL instance itself with a full pg_wal path if the archive command repeats failing for a reason or another. So some users prefer the possibility to fetch WAL segments in a way similar to what standbys use, via streaming replication. The failure handling happens then independently of the primary instance of Postgres running.
Using a daemon of pg_receivewal, one can be sure that a base is safe to use by monitoring the on-disk progress of WAL segments generated, and then making sure that the LSN position returned by pg_stop_backup() corresponds to a segment taht has been completed (well what I am writing here is not completely true either, one could always look at the contents of the current partial segment written at record level, but this is way more costly than looking at completed segments). However, using –endpos, it is possible to request an instance of pg_receivewal to stop streaming up to the position where backup can reach a consistent state, and no more, saving some WAL archive fetching which may not be needed.
Note that pg_receivewal determines the position it starts streaming by first looking at the contents of the archives it is saving data to and looking at the lastly completed segment, and requests the current LSN position generated by PostgreSQL if it finds nothing. It also always begins streaming from the beginning of a segment. So be careful when using it, and you would likely want to pair it with a replication slot, which in turn requires monitoring of its own with for example a background worker scanning pg_replication_slots periodically.