We recently had an issue with our service being down because the Postgres server went down. We quickly looked at the database server logs and realised that data partition for Postgres is full. Disk usage history showed us that the partition grew 65 GB within 15 minutes.

Quick remedy

A quick look at the disk usage on the data directory revealed that pg_xlogs and base are the highest consumers of disk space.

$ du -sh /pg-data/postgres/9.4/data/*
387G    base
664K    global
48M     pg_clog
0       pg_dynshmem
8.0K    pg_hba.conf
4.0K    pg_ident.conf
0       pg_logical
147M    pg_multixact
8.0K    pg_notify
0       pg_replslot
0       pg_serial
0       pg_snapshots
0       pg_stat
188K    pg_stat_tmp
192K    pg_subtrans
0       pg_tblspc
0       pg_twophase
4.0K    PG_VERSION
65G     pg_xlog

We copied pg_xlog to a different partition that has more space and created a symlink to this location.

$ cd /pg-data/postgres/9.4/data/

# Stop postgres
$ sudo /etc/init.d/postgresql stop

# copy current +pg_xlog+ directory to new partition
$ cp -a pg_xlog /wal/

# Remove original +pg_xlog+ directory
$ rm -rf pg_xlog

# Login as +postgres+ user for correct symlink's ownership
$ su - postgres

# Create the symlink
$ ln -s /wal/pg_xlog pg_xlog

# Start postgres
$ sudo /etc/init.d/postgresql start

With this, we were able to restart Postgres and resume the operations back to normal.

Finding the cause

The usual size of pg_xlog is about 1.2 GB. We needed to figure out why it grew to 65 GB within 15 minutes.

Failing Archival brought down Postgres server.

We looked in Postgres log to see when the issue of low space started and we found this:

scp: /usr/ubu/var/postgres/server1.wal_tmp/000000010000315D00000075.gz: No space left on device
rm: cannot remove ‘/usr/ubu/var/postgres/wal.archive/000000010000315D00000075’: No such file or directory
2019-08-06 12:41:36 CEST [6915]: [1-1] user=,db=,app=,client= LOG:  archive command failed with exit code 1
2019-08-06 12:41:36 CEST [6915]: [2-1] user=,db=,app=,client= DETAIL:  The failed archive command was: /usr/ubu/var/postgres/bin/archiveWAL.sh pg_xlog/000000010000315D00000075 000000010000315D00000075

It seems like the archive command was failing!

We have Continuous Archiving and Point-in-Time Recovery (PITR) setup for our Postgres instance. In ideal scenario, the archive command copies the Write Ahead Logs (WAL) to a different location and the archived old WAL segment files will be cleared.

From this, we understood that the disk to which the WAL segment files are archived to was full. That cause the archive command in our PITR setup to fail and that let the pg_xlog directory grow as the WAL segment files are not deleted.

Too many WAL segment files.

We wanted to understand how the disk on archive get full so fast. While we are trying to understand that, we realised that there are too many WAL segment files created since an hour before the incident.

Usually, on average, there are two to four WAL segment files per minute but during this period we had more than 40 per minute.

What is in WAL files?

Why are there suddenly too many WAL segment files? Are there any specific updates or inserts causing this?

Using pg_xlogdump, we looked at one of the WAL files during the time of the issue and saw this:

...
rmgr: Heap   (Skipped...) desc: update: rel 4552/292934377/423775885; (Skipped...)
rmgr: Btree  (Skipped...) desc: insert: rel 4552/292934377/423775891; (Skipped...)
rmgr: Btree  (Skipped...) desc: insert: rel 4552/292934377/674927843; (Skipped...)
rmgr: Heap   (Skipped...) desc: update: rel 4552/292934377/423775885; (Skipped...)
rmgr: Btree  (Skipped...) desc: insert: rel 4552/292934377/423775891; (Skipped...)
rmgr: Btree  (Skipped...) desc: insert: rel 4552/292934377/674927843; (Skipped...)
rmgr: Heap   (Skipped...) desc: update: rel 4552/292934377/423775885; (Skipped...)
rmgr: Btree  (Skipped...) desc: insert: rel 4552/292934377/423775891; (Skipped...)
rmgr: Btree  (Skipped...) desc: insert: rel 4552/292934377/674927843; (Skipped...)
...

There are a lot of updates on the relation: 4552/292934377/423775885. We needed to find the respective table from this rel information.

According to this article, the rel identifier contains three parts:

  1. 4552 is the tablespace.
  2. 292934377 is the database’s OID.
  3. 423775885 is the filenode ID of a relation in the database with OID. This can be any relation in Postgres (a table, an index etc.).

To map the filenode ID to a relation on Postgres 9.4, we used pg_filenode_relation with the tablespace and Filenode ID.

exp-prod=> SELECT pg_filenode_relation(4552, 423775885);
 pg_filenode_relation
----------------------
 companies_pkey

This let us know that it is an update on the companies table and that the other inserts in the WAL are to indexes on companies table.

Using this we were able to find the table on which we have the issues and were able to refactor the code to prevent these from happening.

Summary

There were too many updates on a table which caused too many WAL segments files to be created. These segment files were not archived to the remote disk because the remote disk was full. The failure of the archive caused the pg_xlog directory to grow huge to full disk space leading Postgres to shutdown.

We used pg_xlogdump and pg_filenode_relation to read the WAL segment files and find the tables on which we have the updates that caused WAL to grow.