That seems further along than restart_lsn and yet the consumer is reporting no changes on bigdata. This kind of makes sense, since WAL files are also used for streaming replication, where the WAL files are replicated to another instance rather than the changes within. Here’s what’s going on: a logical replication slot is for consuming changes from one database in a postgres instance, whereas the WAL files store changes made across all databases in the postgres instance. Databases in the third category made even less sense nothing was being written, so where was the disk space going? RDS metrics can differentiate between transaction log usage vs general disk usage, and the transaction log disk usage made the very same sawtooth pattern as space was reclaimed. Maybe the second category could be attributed to databases with large changesets that were being vacuumed, but the metrics didn’t back that story. Postgres instances with no writes were filling up disk space at a slow but steady rate.Postgres instances with infrequent writes were filling up their disk when idle, and then reclaiming the space shortly after a write occurred.Postgres instances that were write-heavy seemed to be working normally and were not filling up their disk.A heatmap analysis showed postgres instances falling into one of three categories: This was an environment with many databases setup for logical replication, and yet the disk-filling behaviour was occurring on only a small handful. This isn’t a narrative trick where I’m going to later tell you that I was wrong and later found consumer wasn’t really running, because it was definitely running! I was taken a bit by surprise when some of our RDS postgres instances began eating their disks at the same time that the logical replication consumer was online, active, and claiming (accurately) that it saw no new changes to replicate. It follows then that if the disk is filling up, the consumer is probably offline. As changes are replicated, the consumer advances the slot’s restart_lsn and WAL files older than that position can be cleaned up. If the consumer crashes and doesn’t advance the cursor, postgres will have maintained a copy of all unconsumed changes so that the consumer can pick up where it left off. A logical replication slot works by keeping track of the oldest address in the oldest write-ahead log (WAL) that the consumer has already replicated postgres will consequently refuse to delete anything newer from disk until the cursor advances. The most common cause for logical replication consuming disk (and which this post is not about, so bear with me) is that the replication consumer is offline. Stop me if you’ve heard this one before: you have a use case for Postgres logical replication and a few minutes/hours/days after enabling it, PagerDuty is yelling that the database’s disk is dangerously full and would you please do something about it before bits start spilling out onto the datacentre floor?
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |