Sunday, October 7, 2018

Safe Operations For High Volume PostgreSQL

Add a new column (safe)

In general, adding a new column is a safe operation that doesn't lock the table. However, there are cases where certain options of the command lock the table. These cases are quite common when working on an evolving product.

Add a new column with a non-nullable constraint / default value (unsafe)

This requires each row of the table to be updated so the default column value can be stored. The entire table is locked during this operation, depending on the size of the table, it can take a while and everything else comes to a halt. The solution is to add the new nullable column without a default value, backfill the desired value in batches, then add the non-nullable constraint/default value with an alter statement.

One big good news is that from PostgreSQL 11, adding a new column with default value will be painless, as it should have been :)

Add a default value to an existing column. (safe)

If the column is nullable previously, the default value won't alter those existing null rows. If the column is non-nullable, all rows are guaranteed to store some value. Either way, the operation does not block the table and is safe to execute.

Change the type of a column. (unsafe)

Strictly speaking, this operation locks the table and is therefore unsafe. If the underlining datatype is not changed, like increasing the length of varchar, then the table isn't locked. But if the change requires a rewrite/re-cast, each row of the table is to be updated and the table is locked for the same reason as adding a new column with a default value. The solution is to add a new column, backfill the new column, and get rid of the old one.

Add a Foreign Key (unsafe)

To protect data integrity, an AccessExclusiveLock is placed on both tables, grinds every read and write to a halt (PostgreSQL 9.6+ reportedly allows read). The solution is to take advantage of Postgres' ability to introduce invalid constraints. The procedure is to first create an invalid Foreign Key constraint by specifying NOT VALID in the ALTER TABLE statement, and then validate it in a separate VALIDATE CONSTRAINT statement. The second validation only requires a RowShareLock that doesn't block reads nor writes. Do note that if there is a reference to non-existing rows, the operation won't be completed and you have to take care of integrity on your own.

Add an index (unsafe)

By nature, the table being indexed is locked against writes and the entire index is built in a single scan of the table. Read transactions can still be done meanwhile.

For a production environment, it is always better to build the index without locking writes. CREATE INDEX comes with the CONCURRENTLY option for this purpose. Though building index is still extra work for the database and this is reflected on extra CPU and I/O load. This might still slow other operations, we notice an increased queue depth when we add index concurrently on one of the biggest tables in the system. Because our infrastructure is in the cloud, with a minimal budget, we can upsize the database a couple of sizes larger than normal. The extra power makes adding indices (still with the CONCURRENTLY option) a lot more comfortable and productive.

Add a column with a unique constraint (unsafe)

This operation will lock the table because it requires a scan for uniqueness. The solution is to add the column, add a unique index concurrently, and then add the constraint onto the table (ADD CONSTRAINT UNIQUE USING INDEX).

Drop a column / a constraint (safe)

This is safe and quick. If the operation appears to take time, the cause is more likely because the table/index was in use rather than the operation time itself. A quick check on pg_stat_activity should confirm that.

Rename an entity (safe)

The entity can be a table (or an index, sequence, or view) or a column in a table. This operation has no effect on stored data and therefore also isn't affected by the size of the data.

Turn on MultiAZ (unsafe)

This is particular to AWS RDS. AWS RDS MultiAZ setup is different from a typical read replica in nature, one is synchronous replication, the other is asynchronous. As such, naively turning on MultiAZ on a (high volume) write-intensive DB will lead to every single write operation being synchronously replicated to the standby replica. It would severely impact the performance of the master because initially the data on the slave machine would not satisfy many integrity constraints and result in either locks or scans on the master for missing data.

It is recommended to take advantage of the asynchronous nature of read replication to get around:

  1. Create a Multi-AZ Read Replica of your Single AZ DB Instance and enable automated backups.
  2. Once the read replica is synchronized, promote the read replica to be a standalone DB Instance.
  3. Redirect your application to the newly promoted read replica.
  4. Rename the source and the new instance to preserve the endpoint (optional).

Thursday, May 31, 2018

Exploding kittens, I mean, databases.

There was a time when our database storage depleted like no tomorrow. It went through 80GB of disk space in less than 48 hours.

Before any investigation was conducted, we did the obvious, gave the machine more disk space. The operation was painless on RDS.

The sequence of checks were:

* Shit, did we forget to clean the log?
Nope, log was removed after 3 days, the task was carried on by RDS.
We also checked the db and its internal size, using the following query. Seems like all the occupied storage was actually spent on data itself.

   relname as "Table",
   pg_size_pretty(pg_total_relation_size(relid)) As "Size",
   pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as "External Size"
   FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC;

* Was the any migration / application change in the last couple of days that might be responsible for this?
Well, there was, we changed a column from varchar(500) to text. But no, this shouldn't explode the storage, and to Postgres, varchar and text were the same thing under the hood. Check this article from Depesz

* Was auto_vacuum on?
auto_vacuum is pretty crucial to Postgres, to free storage from intermittent insert and update statements. Our auto_vacuum might have been off, the result of this query was alarming

pp_pqsql_prod=> select count(*) from pg_stat_all_tables where last_autovacuum is not null;
(1 row)

Sanity check on another RDS confirmed this abnormality

db_authenticator_pp=> select count(*) from pg_stat_all_tables where last_autovacuum is not null;
(1 row)

However, till the end of the investigation, we still didn't know why there was no auto vacuum record. DevOps confirmed the option was on the whole time. The table was populated a couple hours after the incident. We did an experiment of changing storage on another RDS instance, to see if that would wipe out auto vacuum record, it didn't.

* And the last thing was AWS Data Migration Service (DMS).
We were using DMS to sync the content of Postgres to a few other databases. The DMS machine was scaled down recently to save cost, but at some point, it got too small to completely digest the data changes in our RDS. And per AWS, undigested data changes piled up:

The most common clients for PostgreSQL logical replication are AWS Database Migration Service or a custom-managed host on an AWS EC2 instance. The logical replication slot knows nothing about the receiver of the stream; there is no requirement that the target be a replica database. If you set up a logical replication slot and don't read from the slot, data can be written to your DB instance's storage and you can quickly fill up the storage on your instance.

Learning this, we upsized the DMS machine and waited for change in storage. This was the culprit.

Monday, January 22, 2018

A non-DBA guide on AWS Elasticsearch Service

Elasticsearch serves in my production system as the secondary storage. Transactional data is ultimately stored in RDS Postgres, but Elasticsearch is employed for its superior query features and performance. And because I have a long-lived growing index, I have experienced growing pains with Elasticsearch (and every other parts of the system, it's a Sisyphean show) that made me look into AWS Elasticsearch Service (AWS ESS) in greater details. I am not covering rolling indices, continuous data flow with an indexing period and retention window, which is another great application of Elasticsearch but not my use case. I have spent more time on this than I should, because a definite answer to ES scalability is complicated. When the official guide from Elastic itself is "it depends", I don't think I have a good shot at coming out being correct.

Unlike Redis, ES can't be scaled effectively with blackbox approach. Each component of ES is scaled differently. Any ES cluster can be explained as following:
  • Cluster: ES is meant to be run in a cluster. Every instance of AWS ESS is cluster-ready even with only one node.
  • Node: A single ES instance. In AWS ESS, each node runs on a separate EC2 instance.
  • Index: A collection of documents. Very soon, referring to an ES Index as similar to an SQL database is a bad analogy (read more here). It is the highest level of data organization in ES.
  • Shard: Common with other distributed storage system, an index is split into shards and distributed across multiple nodes. ES automatically arranges and balances these shards.
  • Replica: A read-only replication of a shard.
I asserted the scalability of AWS ESS from the smallest to the largest elements of a cluster, and the findings are

1. Shard

Scaling on this level involves the size of a shard and the number of shard in a cluster.
AWS's rule of thumb is to keep shard size below 50GB per shard. But that might as well be the upper limit of it. In practice, the best shard size is between 10-30GB in a stable cluster, or 10-20GB when cluster size is subject to frequent change.
The number of shard in a cluster determines maximum size of a cluster and its performance. A node can keep multiple shards, but a shard can't be split further, it has to reside in a single node. If initially a single-node cluster has 10 shards, its maximum growth is 10 nodes, each serving a single shard. The reason people don't create a hundred-shard cluster from day one is that the higher number of shard a cluster has, the more it is taxed on communication between shards. The ideal number of shard is the balance between giving growth space for the dataset and avoid excessive communications between shards. I suggest allocate shards 3-5 times the number of nodes in the initial setup.

2. Replica

In ES, a replica contributes to fail-over, but its primary role is to boost search performance. It takes some of the query load from master shards. Secondly, instead of scanning all nodes in the cluster for a query, the present of replicas allows traversing less nodes while still ensuring all shards are scanned. Push this to an extreme, for n nodes, I can have n shards and n-1 replicas. This means a query never needs to traverse more than one node. However, this also means there is no scale out/in for the cluster, each node has to be big enough for the entire dataset. Not recommended, gotta love the speed though.

3. Node

Scaling node is about choosing the right size for the machine. And this is surprisingly straight forward given ES' nature. ES uses Java, and thus its performance is tied to the mighty JVM. JVM heap size recommendation for ES is 32GB. It (heap size) also can't be more than 50% of available memory. Therefore the ideal memory of an ES instance is 64GB. This is the reason why earlier I suggested a cap of 30GB on shard size, so that the entire shard can fit into memory. A machine with less or more memory is still perfectly functional, it is merely a matter of bang for the buck. I settle on scaling up my machine till 64GB RAM and subsequently scaling out. I still have to deal with a whopping 64GB free memory whenever I scale out (and its bill), so 32GB maybe a more cost-conscious threshold. Meanwhile, I entertain the extra memory with more replicas.

4. Cluster

Scaling a cluster of ES is not simply adding more machines into it, but also understanding the setup's topology. On AWS ESS, the focus on usability dwarfed most ES topology configuration. The only significant configuration left is dedicated master nodes. These nodes perform cluster management tasks and give the data node some slack for stability. AWS guide on this is comprehensible, I couldn't do a better job.

5. AWS ESS drawbacks

It wasn't until the previous point did AWS ESS drawbacks from native ES emerge. In reality, there are more to that. Below is a list of things omitted by AWS ESS.
  • ES supports a third type of node: coordinating nodes. These nodes participate in cross-cluster query where a request is first scattered to data nodes and then gathered into a single resultset. Not particular popular in small setup, but completely off the table with ESS.
  • There is only HTTP connection. ES supports TCP connection and this should be more favorable for JVM-based languages, receiving better support and cutting down additional network complexity
  • No in-place upgrades. Upgrading ES version in AWS ESS is unnecessary painful to do with zero downtime. Painful because it involves launching a new cluster, whitelisting it for reindexing, executing the reindexing and that updating all services to point to the new cluster. Unnecessary because ES comes with in-place / rolling version upgrade.
  • Backup frequency is limited. To only one a day. Backup in Elasticsearch is supposed to be pretty cheap.
  • Security. One of the biggest reason I haven't provided Kibana as an BI interface to my clients is because X-Pack is not supported.
  • Limited access to the rest of ES ecosystem. ES ecosystem is growing fast and a force to be reckoned with. No logs, and no plugins are supported in AWS RSS. Cutting edge much?


The initial storage should be 10x the amount of data. Select the EC2 instance whose storage to memory ratio is 8:1. If that means more than 64GB of memory, take multiple 64GB machines. The number of shard is 3x the number of node. Minimal replica is 1, more if budget allows. Scale up the machine until reaching 64GB memory, after which scale out. Look for AWS ESS alternatives when the rush is over. I take no responsibility over your employment status.

Thursday, January 4, 2018

A non-DBA guide on AWS ElasticCache Redis

I use Redis at work as either a cache engine or a shared memory between instances of the same service. It has been a great addition to the tech stack, offering simple ways to solve performance issues, through a well put together interface. While Redis delivers, AWS' managed Redis solution, ElastiCache, presents me with a few growth pains that have little to do with Redis technical design, and more with AWS trying to shoehorn to its hardware platform.

AWS Redis comes in two favors, cluster mode disabled (default) and enabled. And this is the most important choice you have to make that would later on expose you to some rigid aspects of a service whose name is elastic. While AWS makes it look like this is an option that can be turned on when needed (like RDS' multi-AZ), the choice is irreversible.

In the nut shell, cluster mode disabled means the entire data needs to fit into a single node. Up to 5 replica nodes can be added to help with the read load, but these replicas are exact copies of the data, they neither increase the storage capacity nor balance the write load.

Enabling cluster mode allows data to be split into maximum 15 shards, each shard share the same anatomy of a cluster-mode-disabled cluster (AWS please work on your naming, I will go with standalone mode from now), one primary node and up to 5 replicas.

Trials and errors led me to believe the cluster mode is simply superior to the standalone one, there might be some drawbacks, but those can be easily mitigated. Lets dive in.
  • In standalone mode, backup and restore are not supported on cache.t1.micro or cache.t2.* nodes. Whereas in cluster mode, these operations are supported on all node types. I was burnt by this once, for some reasons, my cache.t2.small node restarted itself and wiped out all the data.
  • People who aren't familiar with Redis might get turned away from cluster mode because the number of shards (node groups) couldn't be changed without downtime. But the latest version supported in AWS Elasticache, 3.2.10, allows you to do exactly this. Clusters in 3.2.10 can be scaled in and out, and rebalanced on the fly. Which leads to the next point.
  • Node type in a 3.2.10 clusters can't be changed (scale up/down) without downtime. 
  • Cluster mode does not support multiple databases. Therefore, when restoring to a cluster, the restore will fail if the RDB file references more than one database. If this is acknowledged early in product life circle, it is easily mitigated by proper namespaces. However, this poses a pain point for migrating from a standalone machine to a cluster.
To illustrate this point, I made a simple AWS Elasicache Redis Calculator.

For the same cache size, I can use a cluster of 2 cache.r*.large or a standalone cache.r*.xlarge. And other than the insignificant cost of EBS, the 2 options cost exactly the same. So if from day 01, you have a single-node cluster of cache.r*.large, you can scale all the way to a 15-node cluster of 150GB cache capacity, knowing that you are on the most economical option all the time. Pretty neat.

Only issue with Redis cluster so far is that while the same code that works on cluster, will work on standalone, the other way around is not true. In a Redis cluster topology, the keyspace is divided into hash slots. Different nodes hold a different set of hash slots. Multiple keys operations, or transactions involving multiple keys are allowed only if all the keys involves are in hash slots belonging to the same node.

During this research, I also learned that Redis, unlike Memcached, is single-threaded. Self-hosted Redis can work around this by running multi-tenant, multiple processes (shards) on the same machine. This is not an option in AWS Elasticache. So scaling CPU vertical is the only option and is not optimum. Once I go beyond my sweet 150-GB cluster, I am wasting money on CPU cores that provide zero benefit to Redis, and that much only for memory is just not justifiable. Like how Apple wants a hefty pay for that 128GB SSD.

Hence I included Redis Labs Cloud option into the calculator. Redis Labs is where Salvatore Sanfilippo, creator of Redis, is working, and should know a thing or two about Redis performance. But unless I got something wrong, Redis Labs Cloud doesn't strike me as a cheaper alternative. Yes it is better, but not cheaper.

One last thing, if you are caching long string, images, or JSON/XML structure in Redis, please consider compressing the data. Compression sinks my json blob 20 times. It would take a while till my cache is fully compressed, so far the outcome looks great.

P/S: AWS Elasticache Redis is plagued with mysterious node recovery and data loss. Here are a few instances.

Started in 2013, still true now as of 2018.

Online offline scaling -
Redis Labs Cloud as an alternative -
Redis multi-tenant -