Skip to content

Commit 84bec44

Browse files
committed
Fix initialization of the WAL buffer at startup
There were two bugs: 1. We initialized the page headers on the allocated temporary WAL page even if the end-of-log was precisely at page boundary. That means we wrote beyond the end of the allocation. That readily gives an assertion failure on debug-enabled builds. Add test case and fix. 2. We initialize the WAL buffer by copying the contents of the WAL read buffer. The idea is that when we stop reading WAL, the last buffer in the reader becomes the new WAL buffer we'll write to. That's how PostgreSQL does it too, see code around comment "Tricky point here" in StartupXLOG(). However, that doesn't work with Neon. The startup procedure is a little different: we don't do normal WAL recovery and we don't read any WAL at startup, except when promoting a read replica. Vanilla PostgreSQL always reads WAL: it reads the last checkpoint record from the WAL if nothing else, but in Neon we don't necessarily read even that. In that case, the xlogreader's read buffer is still uninitialized by the time that we copy it. That's relatively harmless, the only consequence is that the initial WAL segment on local disk can contain garbage before the first WAL record that we write. That's why we haven't noticed until now. Furthermore, it seems that the uninitialized memory just happens to be all-zeros. However, it now caused the test_pg_waldump.py test to fail with the new communicator implementation. That was very coincidental - the new communicator process isn't even running yet when the WAL buffer is initialized. It seems to have changed the memory allocation just so that the uninitialized memory is no longer all-zeros. That's normally harmless too, but it makes the pg_waldump test to fail: pg_waldump, with the --ignore option, starts reading the WAL from the first non-zero bytes, so when the uninitialized portion was filled with garbage rather than zeros, it fails. This little patch to poison the allocated buffer with garbage was helpful while debugging, to make the test fail in a repeatable fashion with or without the new communicator: ``` diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 988be3f..2f4844c2b86 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -97,6 +97,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir, */ state->readBuf = (char *) palloc_extended(XLOG_BLCKSZ, MCXT_ALLOC_NO_OOM); + memset(state->readBuf, 0x7e, XLOG_BLCKSZ); if (!state->readBuf) { pfree(state); ```
1 parent 99ab234 commit 84bec44

File tree

1 file changed

+57
-63
lines changed

1 file changed

+57
-63
lines changed

src/backend/access/transam/xlogrecovery.c

Lines changed: 57 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -1611,6 +1611,16 @@ FinishWalRecovery(void)
16111611
Assert(!WalRcvStreaming());
16121612
StandbyMode = false;
16131613

1614+
/*
1615+
* We cannot start generating new WAL if we don't have a valid prev-LSN
1616+
* to use for the first new WAL record. (Shouldn't happen.)
1617+
*/
1618+
if (NeonRecoveryRequested &&!neonWriteOk)
1619+
ereport(ERROR,
1620+
(errmsg("cannot start in read-write mode from this base backup")));
1621+
1622+
// FIXME: should we unlink neon.signal?
1623+
16141624
/*
16151625
* Determine where to start writing WAL next.
16161626
*
@@ -1674,84 +1684,68 @@ FinishWalRecovery(void)
16741684
}
16751685
}
16761686

1677-
/*
1678-
* When starting from a neon base backup, we don't have WAL. Initialize
1679-
* the WAL page where we will start writing new records from scratch,
1680-
* instead.
1681-
*/
1682-
if (NeonRecoveryRequested)
1683-
{
1684-
if (!neonWriteOk)
1685-
{
1686-
/*
1687-
* We cannot start generating new WAL if we don't have a valid prev-LSN
1688-
* to use for the first new WAL record. (Shouldn't happen.)
1689-
*/
1690-
ereport(ERROR,
1691-
(errmsg("cannot start in read-write mode from this base backup")));
1692-
}
1693-
else
1694-
{
1695-
int offs = endOfLog % XLOG_BLCKSZ;
1696-
XLogRecPtr pageBeginPtr = endOfLog - offs;
1697-
bool isLongHeader = (pageBeginPtr % wal_segment_size) == 0;
1698-
int lastPageSize = isLongHeader ? SizeOfXLogLongPHD : SizeOfXLogShortPHD;
1699-
char *page = palloc0(offs);
1700-
XLogPageHeader xlogPageHdr = (XLogPageHeader) page;
1701-
1702-
memcpy(page, xlogreader->readBuf, offs);
1703-
if (xlogPageHdr->xlp_magic != XLOG_PAGE_MAGIC)
1704-
{
1705-
xlogPageHdr->xlp_pageaddr = pageBeginPtr;
1706-
xlogPageHdr->xlp_magic = XLOG_PAGE_MAGIC;
1707-
xlogPageHdr->xlp_tli = recoveryTargetTLI;
1708-
xlogPageHdr->xlp_info = 0;
1709-
/*
1710-
* If we start writing with offset from page beginning, pretend in
1711-
* page header there is a record ending where actual data will
1712-
* start.
1713-
*/
1714-
xlogPageHdr->xlp_rem_len = offs - lastPageSize;
1715-
if (xlogPageHdr->xlp_rem_len > 0)
1716-
xlogPageHdr->xlp_info |= XLP_FIRST_IS_CONTRECORD;
1717-
readOff = XLogSegmentOffset(pageBeginPtr, wal_segment_size);
1718-
1719-
if (isLongHeader)
1720-
{
1721-
XLogLongPageHeader longHdr = (XLogLongPageHeader) page;
1722-
1723-
longHdr->xlp_sysid = GetSystemIdentifier();
1724-
longHdr->xlp_seg_size = wal_segment_size;
1725-
longHdr->xlp_xlog_blcksz = XLOG_BLCKSZ;
1726-
1727-
xlogPageHdr->xlp_info |= XLP_LONG_HEADER;
1728-
}
1729-
}
1730-
result->lastPageBeginPtr = pageBeginPtr;
1731-
result->lastPage = page;
1732-
elog(LOG, "Continue writing WAL at %X/%X", LSN_FORMAT_ARGS(xlogreader->EndRecPtr));
1733-
1734-
// FIXME: should we unlink neon.signal?
1735-
}
1736-
}
1687+
elog(LOG, "Continue writing WAL at %X/%X", LSN_FORMAT_ARGS(endOfLog));
17371688

17381689
/*
17391690
* Copy the last partial block to the caller, for initializing the WAL
17401691
* buffer for appending new WAL.
17411692
*/
1742-
else if (endOfLog % XLOG_BLCKSZ != 0)
1693+
if (endOfLog % XLOG_BLCKSZ != 0)
17431694
{
17441695
char *page;
17451696
int len;
17461697
XLogRecPtr pageBeginPtr;
17471698

17481699
pageBeginPtr = endOfLog - (endOfLog % XLOG_BLCKSZ);
1749-
Assert(readOff == XLogSegmentOffset(pageBeginPtr, wal_segment_size));
17501700

17511701
/* Copy the valid part of the last block */
17521702
len = endOfLog % XLOG_BLCKSZ;
17531703
page = palloc(len);
1754-
memcpy(page, xlogreader->readBuf, len);
1704+
1705+
/*
1706+
* With neon, it's possible that we start without having read any WAL
1707+
* whatsoever. In that case, initialize the WAL page where we will
1708+
* start writing new records from scratch, instead.
1709+
*/
1710+
if (NeonRecoveryRequested && endOfLog == RedoStartLSN)
1711+
{
1712+
bool isLongHeader = (pageBeginPtr % wal_segment_size) == 0;
1713+
int lastPageSize = isLongHeader ? SizeOfXLogLongPHD : SizeOfXLogShortPHD;
1714+
XLogPageHeader xlogPageHdr = (XLogPageHeader) page;
1715+
1716+
Assert(len >= lastPageSize);
1717+
1718+
xlogPageHdr->xlp_pageaddr = pageBeginPtr;
1719+
xlogPageHdr->xlp_magic = XLOG_PAGE_MAGIC;
1720+
xlogPageHdr->xlp_tli = recoveryTargetTLI;
1721+
xlogPageHdr->xlp_info = 0;
1722+
/*
1723+
* If we start writing with offset from page beginning, pretend in
1724+
* page header there is a record ending where actual data will
1725+
* start.
1726+
*/
1727+
xlogPageHdr->xlp_rem_len = len - lastPageSize;
1728+
if (xlogPageHdr->xlp_rem_len > 0)
1729+
xlogPageHdr->xlp_info |= XLP_FIRST_IS_CONTRECORD;
1730+
readOff = XLogSegmentOffset(pageBeginPtr, wal_segment_size);
1731+
1732+
if (isLongHeader)
1733+
{
1734+
XLogLongPageHeader longHdr = (XLogLongPageHeader) page;
1735+
1736+
longHdr->xlp_sysid = GetSystemIdentifier();
1737+
longHdr->xlp_seg_size = wal_segment_size;
1738+
longHdr->xlp_xlog_blcksz = XLOG_BLCKSZ;
1739+
1740+
xlogPageHdr->xlp_info |= XLP_LONG_HEADER;
1741+
}
1742+
}
1743+
else
1744+
{
1745+
Assert(readOff == XLogSegmentOffset(pageBeginPtr, wal_segment_size));
1746+
1747+
memcpy(page, xlogreader->readBuf, len);
1748+
}
17551749

17561750
result->lastPageBeginPtr = pageBeginPtr;
17571751
result->lastPage = page;

0 commit comments

Comments
 (0)