Life as an engineer: A non-DBA guide on upgrading your AWS RDS

I use AWS RDS at work for my production database. I understand that it is not infinitely scalable nor provides regional replication out of the box in case my datacenter submerges 3 meters under the sea because global warming is real, people. But in overall, the thing is pretty nice for a growing startup with a lot of battles to fight, computing capacity or storage upgrade without downtime is just clicks away.

I originally titled this as "When to upgrade your AWS RDS". But that's kinda stupid. When would you upgrade your database? I don't know, when everything is smooth and I have a bit of free time, I guess. The one and only reason I look into my database is when thing runs slow and one of those Kibana chart spikes up. It is also worth noting that this is not about tuning those parameters in Postgres. That would be in a DBA guide.

Hold your horse if placing indices and optimizing SQL pop up in your mind upon the appearance of "performance", we will get there. But I have learned that in the age of cloud, hardware problems still exist, virtually.

There are 2 hardware configs of RDS that matter when it comes to performance: CPU/RAM and Disk Space. But if you don't pay attention to the difference in these 2, you will pay handsomely and not necessarily get what you want. AWS is like budget airlines, anything out of textbook examples has the potential to cause explosive cost.

When I said disk space is one of the two important configs, I lied. Or rather, I oversimplified how RDS works. RDS storage puts a cap on how many operations it can perform in a second, the unit is known as IOPS. On general purpose SSD, the one I chose to start with, RDS provides a baseline performance of 3 IOPS/GB. This means no matter how efficient your indices are nor how fast your query runs, you can only run so many operations on a block of storage. There are Write IOPS and Read IOPS metrics in CloudWatch telling you how much traffic is going in and out of the database. There is also a Queue Depth metric showing how many requests to database are put onto a waiting list. If queue depth fluctuating above zero, and sum of IOPS aligns with the size of your disk space, don't bother optimizing that query.

I wish I could show you a typical charge when a machine is out of IOPS, but it was fixed a while back. Queue depth was dancing around 50 line back then. Use your imagination.

There are 2 ways you can increase the IOPS of an SSD.

IOPS increases according to disk space, so giving the RDS machine a lot more space than your storage works.
AWS provides a higher performance class of SSD that come with provisioned IOPS, the IOPS still increases according to disk space, but at a much better ratio, 50:1 instead of 3:1.

Pros of increasing disk space is that it is significantly cheaper than provisioned IOPS. The cost breakdown looks like this

A T2.small machine with 100GB gives me 300 IOPS at $55/month. Never use t2 machine for anything with load or performance requirement, but that's another story for another time.
The same machine with 100GB and 1000 provisioned IOPS costs $165/month. And this is the minimal provisioned IOPS. Ok, it's three times the cost for three times throughput, so not a bad deal, but hold on.
The same machine with 200GB gives 600 IOPS at merely $69/month.

The cons of increasing disk space is that there is no easy way to reduce it afterwards. You could either spin up a new machine and dump your db there, or set up a replica (an option available in RDS) and promote for the switch over. Both sound like minor headache.

My service happens to require 500 IOPS and is not consumer-facing so some downtime wouldn't hurt anyone. So saving $100/month is a good deal to me. Your use case might be different.

The second hardware config is CPU/RAM is more straight forward.

If your query runs consistently slow on high and low IOPS, CPU usage forms some spiky chart, and if you are unfortunate enough to be on a T2 machine, your CPU credit keeps depleting overtime, your CPU is the problem. You can try to remedy this by rewriting your queries, especially if you are using an ORM. You don't need to cover all of them. A look into `pg_stat_activity` table during a high load moment should give you a glimpse on what is going on inside the database. Work on either the slowest one, or the one that runs most often. I once managed to reduce the CPU usage by 30% by optimizing for the second category.
If your ReadIOPS is not stable under load, there might be a chance your data set is not all in memory. A dramatic drop in ReadIOPS when upgrading to a bigger machine would confirm this hypothesis. You are on the optimal memory level when ReadIOPS no longer takes such dive or is reduced to a very small amount.

There is AWS RDS r* family which are more memory optimized (almost twice the memory compare to an m* cousin). But the ultimate fix is to upgrade your hardware to a machine that looks like a gym buddy of The Rock. Because why should I spend my time optimizing someone else's code? Solve my problem all the time. Did I say this is much more straight forward?

Now go check your IOPS.

Life as an engineer

Pages

Friday, December 8, 2017

A non-DBA guide on upgrading your AWS RDS

No comments:

Post a Comment