Thursday, May 31, 2018

Exploding kittens, I mean, databases.

There was a time when our database storage depleted like no tomorrow. It went through 80GB of disk space in less than 48 hours.


Before any investigation was conducted, we did the obvious, gave the machine more disk space. The operation was painless on RDS.

The sequence of checks were:

* Shit, did we forget to clean the log?
Nope, log was removed after 3 days, the task was carried on by RDS.
We also checked the db and its internal size, using the following query. Seems like all the occupied storage was actually spent on data itself.

SELECT
   relname as "Table",
   pg_size_pretty(pg_total_relation_size(relid)) As "Size",
   pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as "External Size"
   FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC;

* Was the any migration / application change in the last couple of days that might be responsible for this?
Well, there was, we changed a column from varchar(500) to text. But no, this shouldn't explode the storage, and to Postgres, varchar and text were the same thing under the hood. Check this article from Depesz https://www.depesz.com/2010/03/02/charx-vs-varcharx-vs-varchar-vs-text/

* Was auto_vacuum on?
auto_vacuum is pretty crucial to Postgres, to free storage from intermittent insert and update statements. Our auto_vacuum might have been off, the result of this query was alarming

pp_pqsql_prod=> select count(*) from pg_stat_all_tables where last_autovacuum is not null;
 count
-------
     0
(1 row)

Sanity check on another RDS confirmed this abnormality

db_authenticator_pp=> select count(*) from pg_stat_all_tables where last_autovacuum is not null;
 count
-------
    26
(1 row)

However, till the end of the investigation, we still didn't know why there was no auto vacuum record. DevOps confirmed the option was on the whole time. The table was populated a couple hours after the incident. We did an experiment of changing storage on another RDS instance, to see if that would wipe out auto vacuum record, it didn't.

* And the last thing was AWS Data Migration Service (DMS).
We were using DMS to sync the content of Postgres to a few other databases. The DMS machine was scaled down recently to save cost, but at some point, it got too small to completely digest the data changes in our RDS. And per AWS, undigested data changes piled up:

The most common clients for PostgreSQL logical replication are AWS Database Migration Service or a custom-managed host on an AWS EC2 instance. The logical replication slot knows nothing about the receiver of the stream; there is no requirement that the target be a replica database. If you set up a logical replication slot and don't read from the slot, data can be written to your DB instance's storage and you can quickly fill up the storage on your instance.

Learning this, we upsized the DMS machine and waited for change in storage. This was the culprit.