Debugging sudden increase in Postgres disk usage
We recently had an issue with our service being down because the Postgres server went down. We quickly looked at the database server logs and realised that data partition for Postgres is full. Disk usage history showed us that the partition grew 65 GB within 15 minutes.
Quick remedy
A quick look at the disk usage on the data directory revealed that pg_xlogs
and base
are the highest consumers of disk space.
$ du -sh /pg-data/postgres/9.4/data/*
387G base
664K global
48M pg_clog
0 pg_dynshmem
8.0K pg_hba.conf
4.0K pg_ident.conf
0 pg_logical
147M pg_multixact
8.0K pg_notify
0 pg_replslot
0 pg_serial
0 pg_snapshots
0 pg_stat
188K pg_stat_tmp
192K pg_subtrans
0 pg_tblspc
0 pg_twophase
4.0K PG_VERSION
65G pg_xlog
We copied pg_xlog
to a different partition that has more space and created a
symlink to this location.
$ cd /pg-data/postgres/9.4/data/
# Stop postgres
$ sudo /etc/init.d/postgresql stop
# copy current +pg_xlog+ directory to new partition
$ cp -a pg_xlog /wal/
# Remove original +pg_xlog+ directory
$ rm -rf pg_xlog
# Login as +postgres+ user for correct symlink's ownership
$ su - postgres
# Create the symlink
$ ln -s /wal/pg_xlog pg_xlog
# Start postgres
$ sudo /etc/init.d/postgresql start
With this, we were able to restart Postgres and resume the operations back to normal.
Finding the cause
The usual size of pg_xlog
is about 1.2 GB.
We needed to figure out why it grew to 65 GB within 15 minutes.
Failing Archival brought down Postgres server.
We looked in Postgres log to see when the issue of low space started and we found this:
scp: /usr/ubu/var/postgres/server1.wal_tmp/000000010000315D00000075.gz: No space left on device
rm: cannot remove ‘/usr/ubu/var/postgres/wal.archive/000000010000315D00000075’: No such file or directory
2019-08-06 12:41:36 CEST [6915]: [1-1] user=,db=,app=,client= LOG: archive command failed with exit code 1
2019-08-06 12:41:36 CEST [6915]: [2-1] user=,db=,app=,client= DETAIL: The failed archive command was: /usr/ubu/var/postgres/bin/archiveWAL.sh pg_xlog/000000010000315D00000075 000000010000315D00000075
It seems like the archive command was failing!
We have Continuous Archiving and Point-in-Time Recovery (PITR) setup for our Postgres instance. In ideal scenario, the archive command copies the Write Ahead Logs (WAL) to a different location and the archived old WAL segment files will be cleared.
From this, we understood that the disk to which the WAL segment files are
archived to was full.
That cause the archive command in our PITR setup to fail and that let the
pg_xlog
directory grow as the WAL segment files are not deleted.
Too many WAL segment files.
We wanted to understand how the disk on archive get full so fast. While we are trying to understand that, we realised that there are too many WAL segment files created since an hour before the incident.
Usually, on average, there are two to four WAL segment files per minute but during this period we had more than 40 per minute.
What is in WAL files?
Why are there suddenly too many WAL segment files? Are there any specific updates or inserts causing this?
Using pg_xlogdump
,
we looked at one of the WAL files during the time of the issue and saw this:
...
rmgr: Heap (Skipped...) desc: update: rel 4552/292934377/423775885; (Skipped...)
rmgr: Btree (Skipped...) desc: insert: rel 4552/292934377/423775891; (Skipped...)
rmgr: Btree (Skipped...) desc: insert: rel 4552/292934377/674927843; (Skipped...)
rmgr: Heap (Skipped...) desc: update: rel 4552/292934377/423775885; (Skipped...)
rmgr: Btree (Skipped...) desc: insert: rel 4552/292934377/423775891; (Skipped...)
rmgr: Btree (Skipped...) desc: insert: rel 4552/292934377/674927843; (Skipped...)
rmgr: Heap (Skipped...) desc: update: rel 4552/292934377/423775885; (Skipped...)
rmgr: Btree (Skipped...) desc: insert: rel 4552/292934377/423775891; (Skipped...)
rmgr: Btree (Skipped...) desc: insert: rel 4552/292934377/674927843; (Skipped...)
...
There are a lot of updates on the relation: 4552/292934377/423775885
.
We needed to find the respective table from this rel
information.
According to this
article, the rel
identifier contains three parts:
4552
is the tablespace.292934377
is the database’s OID.423775885
is the filenode ID of a relation in the database with OID. This can be any relation in Postgres (a table, an index etc.).
To map the filenode ID to a relation on Postgres 9.4, we used pg_filenode_relation
with the tablespace and Filenode ID.
exp-prod=> SELECT pg_filenode_relation(4552, 423775885);
pg_filenode_relation
----------------------
companies_pkey
This let us know that it is an update on the companies table and that the other inserts in the WAL are to indexes on companies table.
Using this we were able to find the table on which we have the issues and were able to refactor the code to prevent these from happening.
Summary
There were too many updates on a table which caused too many WAL segments files
to be created.
These segment files were not archived to the remote disk because the remote
disk was full.
The failure of the archive caused the pg_xlog
directory to grow huge to full
disk space leading Postgres to shutdown.
We used pg_xlogdump
and pg_filenode_relation
to read the WAL segment files
and find the tables on which we have the updates that caused WAL to grow.