Thursday, May 31, 2018

Exploding kittens, I mean, databases.

There was a time when our database storage depleted like no tomorrow. It went through 80GB of disk space in less than 48 hours.

Before any investigation was conducted, we did the obvious, gave the machine more disk space. The operation was painless on RDS.

The sequence of checks were:

* Shit, did we forget to clean the log?
Nope, log was removed after 3 days, the task was carried on by RDS.
We also checked the db and its internal size, using the following query. Seems like all the occupied storage was actually spent on data itself.

   relname as "Table",
   pg_size_pretty(pg_total_relation_size(relid)) As "Size",
   pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as "External Size"
   FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC;

* Was the any migration / application change in the last couple of days that might be responsible for this?
Well, there was, we changed a column from varchar(500) to text. But no, this shouldn't explode the storage, and to Postgres, varchar and text were the same thing under the hood. Check this article from Depesz

* Was auto_vacuum on?
auto_vacuum is pretty crucial to Postgres, to free storage from intermittent insert and update statements. Our auto_vacuum might have been off, the result of this query was alarming

pp_pqsql_prod=> select count(*) from pg_stat_all_tables where last_autovacuum is not null;
(1 row)

Sanity check on another RDS confirmed this abnormality

db_authenticator_pp=> select count(*) from pg_stat_all_tables where last_autovacuum is not null;
(1 row)

However, till the end of the investigation, we still didn't know why there was no auto vacuum record. DevOps confirmed the option was on the whole time. The table was populated a couple hours after the incident. We did an experiment of changing storage on another RDS instance, to see if that would wipe out auto vacuum record, it didn't.

* And the last thing was AWS Data Migration Service (DMS).
We were using DMS to sync the content of Postgres to a few other databases. The DMS machine was scaled down recently to save cost, but at some point, it got too small to completely digest the data changes in our RDS. And per AWS, undigested data changes piled up:

The most common clients for PostgreSQL logical replication are AWS Database Migration Service or a custom-managed host on an AWS EC2 instance. The logical replication slot knows nothing about the receiver of the stream; there is no requirement that the target be a replica database. If you set up a logical replication slot and don't read from the slot, data can be written to your DB instance's storage and you can quickly fill up the storage on your instance.

Learning this, we upsized the DMS machine and waited for change in storage. This was the culprit.

Monday, January 22, 2018

A non-DBA guide on AWS Elasticsearch Service

Elasticsearch serves in my production system as the secondary storage. Transactional data is ultimately stored in RDS Postgres, but Elasticsearch is employed for its superior query features and performance. And because I have a long-lived growing index, I have experienced growing pains with Elasticsearch (and every other parts of the system, it's a Sisyphean show) that made me look into AWS Elasticsearch Service (AWS ESS) in greater details. I am not covering rolling indices, continuous data flow with an indexing period and retention window, which is another great application of Elasticsearch but not my use case. I have spent more time on this than I should, because a definite answer to ES scalability is complicated. When the official guide from Elastic itself is "it depends", I don't think I have a good shot at coming out being correct.

Unlike Redis, ES can't be scaled effectively with blackbox approach. Each component of ES is scaled differently. Any ES cluster can be explained as following:
  • Cluster: ES is meant to be run in a cluster. Every instance of AWS ESS is cluster-ready even with only one node.
  • Node: A single ES instance. In AWS ESS, each node runs on a separate EC2 instance.
  • Index: A collection of documents. Very soon, referring to an ES Index as similar to an SQL database is a bad analogy (read more here). It is the highest level of data organization in ES.
  • Shard: Common with other distributed storage system, an index is split into shards and distributed across multiple nodes. ES automatically arranges and balances these shards.
  • Replica: A read-only replication of a shard.
I asserted the scalability of AWS ESS from the smallest to the largest elements of a cluster, and the findings are

1. Shard

Scaling on this level involves the size of a shard and the number of shard in a cluster.
AWS's rule of thumb is to keep shard size below 50GB per shard. But that might as well be the upper limit of it. In practice, the best shard size is between 10-30GB in a stable cluster, or 10-20GB when cluster size is subject to frequent change.
The number of shard in a cluster determines maximum size of a cluster and its performance. A node can keep multiple shards, but a shard can't be split further, it has to reside in a single node. If initially a single-node cluster has 10 shards, its maximum growth is 10 nodes, each serving a single shard. The reason people don't create a hundred-shard cluster from day one is that the higher number of shard a cluster has, the more it is taxed on communication between shards. The ideal number of shard is the balance between giving growth space for the dataset and avoid excessive communications between shards. I suggest allocate shards 3-5 times the number of nodes in the initial setup.

2. Replica

In ES, a replica contributes to fail-over, but its primary role is to boost search performance. It takes some of the query load from master shards. Secondly, instead of scanning all nodes in the cluster for a query, the present of replicas allows traversing less nodes while still ensuring all shards are scanned. Push this to an extreme, for n nodes, I can have n shards and n-1 replicas. This means a query never needs to traverse more than one node. However, this also means there is no scale out/in for the cluster, each node has to be big enough for the entire dataset. Not recommended, gotta love the speed though.

3. Node

Scaling node is about choosing the right size for the machine. And this is surprisingly straight forward given ES' nature. ES uses Java, and thus its performance is tied to the mighty JVM. JVM heap size recommendation for ES is 32GB. It (heap size) also can't be more than 50% of available memory. Therefore the ideal memory of an ES instance is 64GB. This is the reason why earlier I suggested a cap of 30GB on shard size, so that the entire shard can fit into memory. A machine with less or more memory is still perfectly functional, it is merely a matter of bang for the buck. I settle on scaling up my machine till 64GB RAM and subsequently scaling out. I still have to deal with a whopping 64GB free memory whenever I scale out (and its bill), so 32GB maybe a more cost-conscious threshold. Meanwhile, I entertain the extra memory with more replicas.

4. Cluster

Scaling a cluster of ES is not simply adding more machines into it, but also understanding the setup's topology. On AWS ESS, the focus on usability dwarfed most ES topology configuration. The only significant configuration left is dedicated master nodes. These nodes perform cluster management tasks and give the data node some slack for stability. AWS guide on this is comprehensible, I couldn't do a better job.

5. AWS ESS drawbacks

It wasn't until the previous point did AWS ESS drawbacks from native ES emerge. In reality, there are more to that. Below is a list of things omitted by AWS ESS.
  • ES supports a third type of node: coordinating nodes. These nodes participate in cross-cluster query where a request is first scattered to data nodes and then gathered into a single resultset. Not particular popular in small setup, but completely off the table with ESS.
  • There is only HTTP connection. ES supports TCP connection and this should be more favorable for JVM-based languages, receiving better support and cutting down additional network complexity
  • No in-place upgrades. Upgrading ES version in AWS ESS is unnecessary painful to do with zero downtime. Painful because it involves launching a new cluster, whitelisting it for reindexing, executing the reindexing and that updating all services to point to the new cluster. Unnecessary because ES comes with in-place / rolling version upgrade.
  • Backup frequency is limited. To only one a day. Backup in Elasticsearch is supposed to be pretty cheap.
  • Security. One of the biggest reason I haven't provided Kibana as an BI interface to my clients is because X-Pack is not supported.
  • Limited access to the rest of ES ecosystem. ES ecosystem is growing fast and a force to be reckoned with. No logs, and no plugins are supported in AWS RSS. Cutting edge much?


The initial storage should be 10x the amount of data. Select the EC2 instance whose storage to memory ratio is 8:1. If that means more than 64GB of memory, take multiple 64GB machines. The number of shard is 3x the number of node. Minimal replica is 1, more if budget allows. Scale up the machine until reaching 64GB memory, after which scale out. Look for AWS ESS alternatives when the rush is over. I take no responsibility over your employment status.

Thursday, January 4, 2018

A non-DBA guide on AWS ElasticCache Redis

I use Redis at work as either a cache engine or a shared memory between instances of the same service. It has been a great addition to the tech stack, offering simple ways to solve performance issues, through a well put together interface. While Redis delivers, AWS' managed Redis solution, ElastiCache, presents me with a few growth pains that have little to do with Redis technical design, and more with AWS trying to shoehorn to its hardware platform.

AWS Redis comes in two favors, cluster mode disabled (default) and enabled. And this is the most important choice you have to make that would later on expose you to some rigid aspects of a service whose name is elastic. While AWS makes it look like this is an option that can be turned on when needed (like RDS' multi-AZ), the choice is irreversible.

In the nut shell, cluster mode disabled means the entire data needs to fit into a single node. Up to 5 replica nodes can be added to help with the read load, but these replicas are exact copies of the data, they neither increase the storage capacity nor balance the write load.

Enabling cluster mode allows data to be split into maximum 15 shards, each shard share the same anatomy of a cluster-mode-disabled cluster (AWS please work on your naming, I will go with standalone mode from now), one primary node and up to 5 replicas.

Trials and errors led me to believe the cluster mode is simply superior to the standalone one, there might be some drawbacks, but those can be easily mitigated. Lets dive in.
  • In standalone mode, backup and restore are not supported on cache.t1.micro or cache.t2.* nodes. Whereas in cluster mode, these operations are supported on all node types. I was burnt by this once, for some reasons, my cache.t2.small node restarted itself and wiped out all the data.
  • People who aren't familiar with Redis might get turned away from cluster mode because the number of shards (node groups) couldn't be changed without downtime. But the latest version supported in AWS Elasticache, 3.2.10, allows you to do exactly this. Clusters in 3.2.10 can be scaled in and out, and rebalanced on the fly. Which leads to the next point.
  • Node type in a 3.2.10 clusters can't be changed (scale up/down) without downtime. 
  • Cluster mode does not support multiple databases. Therefore, when restoring to a cluster, the restore will fail if the RDB file references more than one database. If this is acknowledged early in product life circle, it is easily mitigated by proper namespaces. However, this poses a pain point for migrating from a standalone machine to a cluster.
To illustrate this point, I made a simple AWS Elasicache Redis Calculator.

For the same cache size, I can use a cluster of 2 cache.r*.large or a standalone cache.r*.xlarge. And other than the insignificant cost of EBS, the 2 options cost exactly the same. So if from day 01, you have a single-node cluster of cache.r*.large, you can scale all the way to a 15-node cluster of 150GB cache capacity, knowing that you are on the most economical option all the time. Pretty neat.

Only issue with Redis cluster so far is that while the same code that works on cluster, will work on standalone, the other way around is not true. In a Redis cluster topology, the keyspace is divided into hash slots. Different nodes hold a different set of hash slots. Multiple keys operations, or transactions involving multiple keys are allowed only if all the keys involves are in hash slots belonging to the same node.

During this research, I also learned that Redis, unlike Memcached, is single-threaded. Self-hosted Redis can work around this by running multi-tenant, multiple processes (shards) on the same machine. This is not an option in AWS Elasticache. So scaling CPU vertical is the only option and is not optimum. Once I go beyond my sweet 150-GB cluster, I am wasting money on CPU cores that provide zero benefit to Redis, and that much only for memory is just not justifiable. Like how Apple wants a hefty pay for that 128GB SSD.

Hence I included Redis Labs Cloud option into the calculator. Redis Labs is where Salvatore Sanfilippo, creator of Redis, is working, and should know a thing or two about Redis performance. But unless I got something wrong, Redis Labs Cloud doesn't strike me as a cheaper alternative. Yes it is better, but not cheaper.

One last thing, if you are caching long string, images, or JSON/XML structure in Redis, please consider compressing the data. Compression sinks my json blob 20 times. It would take a while till my cache is fully compressed, so far the outcome looks great.

P/S: AWS Elasticache Redis is plagued with mysterious node recovery and data loss. Here are a few instances.

Started in 2013, still true now as of 2018.

Online offline scaling -
Redis Labs Cloud as an alternative -
Redis multi-tenant -

Saturday, December 23, 2017

Parcel Perform - A day on timelapse

It's festive season, and our traffic grew 50x compared to a year ago. Internally we also learned a lot and grew a thicker skin. I thought it would be nice to time lapse a normal Friday at office, so we have something to look back next year.

Friday, December 8, 2017

A non-DBA guide on upgrading your AWS RDS

I use AWS RDS at work for my production database. I understand that it is not infinitely scalable nor provides regional replication out of the box in case my datacenter submerges 3 meters under the sea because global warming is real, people. But in overall, the thing is pretty nice for a growing startup with a lot of battles to fight, computing capacity or storage upgrade without downtime is just clicks away.

I originally titled this as "When to upgrade your AWS RDS". But that's kinda stupid. When would you upgrade your database? I don't know, when everything is smooth and I have a bit of free time, I guess. The one and only reason I look into my database is when thing runs slow and one of those Kibana chart spikes up. It is also worth noting that this is not about tuning those parameters in Postgres. That would be in a DBA guide.

Hold your horse if placing indices and optimizing SQL pop up in your mind upon the appearance of "performance", we will get there. But I have learned that in the age of cloud, hardware problems still exist, virtually.

There are 2 hardware configs of RDS that matter when it comes to performance: CPU/RAM and Disk Space. But if you don't pay attention to the difference in these 2, you will pay handsomely and not necessarily get what you want. AWS is like budget airlines, anything out of textbook examples has the potential to cause explosive cost.

When I said disk space is one of the two important configs, I lied. Or rather, I oversimplified how RDS works. RDS storage puts a cap on how many operations it can perform in a second, the unit is known as IOPS. On general purpose SSD, the one I chose to start with, RDS provides a baseline performance of 3 IOPS/GB. This means no matter how efficient your indices are nor how fast your query runs, you can only run so many operations on a block of storage. There are Write IOPS and Read IOPS metrics in CloudWatch telling you how much traffic is going in and out of the database. There is also a Queue Depth metric showing how many requests to database are put onto a waiting list. If queue depth fluctuating above zero, and sum of IOPS aligns with the size of your disk space, don't bother optimizing that query.

I wish I could show you a typical charge when a machine is out of IOPS, but it was fixed a while back. Queue depth was dancing around 50 line back then. Use your imagination.

There are 2 ways you can increase the IOPS of an SSD.
  • IOPS increases according to disk space, so giving the RDS machine a lot more space than your storage works. 
  • AWS provides a higher performance class of SSD that come with provisioned IOPS, the IOPS still increases according to disk space, but at a much better ratio, 50:1 instead of 3:1.
Pros of increasing disk space is that it is significantly cheaper than provisioned IOPS. The cost breakdown looks like this
  • A T2.small machine with 100GB gives me 300 IOPS at $55/month. Never use t2 machine for anything with load or performance requirement, but that's another story for another time.
  • The same machine with 100GB and 1000 provisioned IOPS costs $165/month. And this is the minimal provisioned IOPS. Ok, it's three times the cost for three times throughput, so not a bad deal, but hold on.
  • The same machine with 200GB gives 600 IOPS at merely $69/month.
The cons of increasing disk space is that there is no easy way to reduce it afterwards. You could either spin up a new machine and dump your db there, or set up a replica (an option available in RDS) and promote for the switch over. Both sound like minor headache.

My service happens to require 500 IOPS and is not consumer-facing so some downtime wouldn't hurt anyone. So saving $100/month is a good deal to me. Your use case might be different.

The second hardware config is CPU/RAM is more straight forward.

  • If your query runs consistently slow on high and low IOPS, CPU usage forms some spiky chart, and if you are unfortunate enough to be on a T2 machine, your CPU credit keeps depleting overtime, your CPU is the problem. You can try to remedy this by rewriting your queries, especially if you are using an ORM. You don't need to cover all of them. A look into `pg_stat_activity` table during a high load moment should give you a glimpse on what is going on inside the database. Work on either the slowest one, or the one that runs most often. I once managed to reduce the CPU usage by 30% by optimizing for the second category.
  • If your ReadIOPS is not stable under load, there might be a chance your data set is not all in memory. A dramatic drop in ReadIOPS when upgrading to a bigger machine would confirm this hypothesis. You are on the optimal memory level when ReadIOPS no longer takes such dive or is reduced to a very small amount.
The ultimate fix is to upgrade your hardware to a machine that looks like a gym buddy of The Rock. Because why should I spend my time optimizing someone else's code? Solve my problem all the time. Did I say this is much more straight forward?

Now go check your IOPS.

Sunday, September 10, 2017

Can Kafka guarantee message ordering?

If you never heard of Kafka, this is not for you. The short paragraph won't repeat the concept of a messaging system and its components, internal as well as external.

Kafka has been proven to be a versatile tool for different use cases. In some of those, message ordering, that messages are consumed in the same order as they are created, is more important than the speed of either consumption or production. There is a big difference between depositing $100 and then withdrawing from it, and the other way around.

This illustration is here to please the eyes rather than any particular purpose


In Kafka, order can only be guaranteed within a partition. This means that if messages were sent from the producer in a specific order, the broken will write them to a partition and all consumers will read from that in the same order. So naturally, single-partition topic is easier to enforce ordering compared to its multiple-partition siblings.

But sometimes, this setting is not desirable. Having a single partition limits the throughput speed to that of a single consumer. And in most of systems, an enforcement on global order of all messages ever created is unnecessary. The order of depositing and withdrawing of my account shouldn't interfere that of my colleagues (non-causal). In these cases, as long as relevant messages are sent to the same partition, message (causal) ordering is still guaranteed. In Kafka, this is achieved using keyed messages. Kafka partitioners would hash keys and ensure a key always go to the same partition given the number of partitions is not changed.

If the naive hash scheme skews the size of your partitions, e.g. a ADHD bitcoin broker with a disproportionately number of activities making his partition twice as large as anything else, Kafka allows custom partitioner.


Kafka numerous settings include `retries` indicates the number of time a message will be retried on encountering intermittent error, and ``, the number of unacknowledged messages the producer will send on a single connection before blocking (pipelining, not like they are sent concurrently). The combination of these two can cause a bit of trouble. It is hard to build a reliable system without zero `retries`. On the other hand, it is possible that the broker fails the first batch of message, succeeds in the second (already in flight), and then retries the first batch and succeeds this time. With positive `retries`, `` should be set to 1. This comes with a penalty on producer throughput and should only be used where order enforcement is a must.


Sadly, there is no ordering guarantee for messages written by different producers, the order of reception is used. If the system design insists on multiple producers, one topic, and strong ordering, the responsibility falls into the producer application code. This means bigger investment and is the reason why this comes last. As both clock and network in a distributed system are not reliable, sequence numbers can replace timestamp in showing order of messages. The Lamport timestamp is one of such simple and compact mechanism. Lamport timestamp however cannot tell whether two messages are concurrent or causally dependent. The more heavy-weight version vectors would gladly take the job with a few extra dimes.


Single partition, and producer are easier to maintain total order. On multiple partitions, get casually dependent messages to the same partition using key and partitioner. Independent sequence ordering is needed with multiple producers. Accept the tradeoff of not having message pipelining if you want retries and order in the same place.

Sunday, August 6, 2017

Stories of the underdogs II

The internet as we know today started in the late 1980s. By mid 1990s, it had already had a revolutionary impact on culture, commerce, and technology. Around that time Vietnam had its first commercial internet connection. The tipping point was 2004 when internet usage grew by 200% and gave 10% of the population access to the highway of information. A young population, and a network of locally connected game centers, the first profound impact of the internet in Vietnam was the gaming industry. There were some competitions and when the dust settled, two dominating players emerged: Singapore-based Garena and homegrown VNG. When Garena came to Vietnam in 2010, the press described the event as the doom of VNG, that the little Vietnamese startup would soon be devoured. And there was foundation to that, Garena was led by a Stanford MBA graduate, with a seed fund of $2M and operated at 5 different countries before entering Vietnam. On the other hand, VNG founder was originally a bank officer by day and ran his game center in an alley at night. The startup got $80k seed fund and was non-existential outside of the country. Seven years later, both are still around and extremely successful being the first two SEA startups worth more than $1B. There is a little twist to that though. VNG is still virtually non-existential outside of Vietnam. It manages to get on the same tier with Garena with only one market, while the later has 8. Within Vietnam, Garena is solely known as a game publisher, VNG has transformed into an Internet Giant whose products touch every Vietnamese internet users.

The story of VNG is inspiring, against all odds, and exactly the kind of folk tales you suppose to hear in the world of startups. Most startups suffer from inexperience, understaffed, under financed, and overloaded. Of course, once in a while, there are exceptions, like Color, a photo-based social network led by veteran Vietnamese American founders with $41M initial funding. But for the majority of startups, those 4 horsemen of entrepreneurship continue to chase them till exit. Yet despite being disadvantaged, startups succeeded and disrupted markets from time to time. Again, except if you were Color, the startup was a flop and sold to Apple at $7M. So what if we have read the startup stories wrong. That is, what if instead of succeeding despite of all the challenges, great startups succeed exactly because of the challenges they face?

To listen at these stories differently, first we need to redefine what is an advantage. The same qualities that make big companies appear to be fearsome are often the source of great weakness. For example,
when it comes to technology startups, being big isn't exactly an advantage. While a bigger developer team in theory can produce software faster, they are also significantly harder to manage, might suffer from low productivity because of communicating and bureaucracy overhead, and more expensive to keep. Instagram was acquired for a billion dollar with a team of 13. Whatsapp has 50 engineers for 900 million users. With the power of software, hardware, and automation, size is more a liability than an advantage.

On another point, it's tartarus for a company to expand its core competencies to other areas. Once its core competencies are formed, much of the company resources is spent on extending the scope of features so the product is applicable to more use cases, refining development process, and organizational structure to sustain all those activities. Innovations are still cultivated, usually in form of specialization of teams. which results in a stream of smaller, continuous improvement over an existing product or service. That also makes disruptive innovation, like creating products or services that did not exist before, harder. Google for example is an extremely successful company, yet despite its size and the number of projects, much of Google revenue surrounds its core competencies as a search engine. Most of the projects whose role is to ensure Google's relevancy in the next decade such as Android, self-driving car, or AI, are from acquisition and not in-house.

While big corporates focus on regional and global scale, Vietnam startups, even the successful ones, think and act local. This in turn results in products solving very specific problems existing only in the country, and virtually unheard of for anyone else. As a game distributor, Garena did and still aces VNG in every perspectives. They are well-tuned towards international partnership, multi-nation online game operation, and game tournament hosting. What left for VNG were games that are much less popular and operate in only one country. And it was meant to be that way. VNG resources were spent on another problem of Vietnam in the late 2000s. Back then, to run a game center, you needed to be some sort of computer/game enthusiast because the hardware constantly ran outdated, latest games kept popping up, and your computer needed to be protected against these children who were too eager to try whatever tricks they found on the internet. But that wouldn't scale when game center became a business and investment. So VNG came up with this business development team that would consult small business owners with new location, hardware and software setups, and on top of that, install a home grown management software that not only protected the computers against threats, updated with latest games, but also enabled a VNG membership country-wide. So all of sudden, credit player accommodated in one center, can be used in another, as long as the other place also installed VNG management software. The software spread like wildfire and so did VNG presence in every corner of the country.

The trend in Vietnamese startup in recent years can be summed up as following:
  • Solve local problems
  • Avoid direct confrontation with bigger players
  • Disproportionately abundant in Entertainment and Lifestyle segment, due to the size of it
  • Much fewer B2B effort even among each other, which sometimes unintentionally leads to fewer strategic partnership enhancing value proposition and creating win-win situation
In other words, many startups here find guerrilla tactic fits them right. But then, if such can be formulated, why aren't we seeing more successful startups from Vietnam? Because guerrilla tactic is hard, to the point of desperation. Niche local problems aren't always obvious, they more often lurk in a corner, hiding in plain sight. Finding them is like shooting arrow to a bullseye you cannot see. Sometime, the bullseye doesn't even exist. And you have to go on like that days after days, with marginal profit, investors and employees show doubt, sometimes very expressively, none of the effort seems worth it. You only turn to guerrilla tactic when nothing else seems to work. Even startups succeed in this tactic, drop it as soon as they can. Right when VNG revenue from game distribution was stable, they started looking for extension in conventional market, like social network, media streaming, and messaging, all of which eventually replace game distribution as VNG's cash cow.

Doing startup is one of the many ways to gain perspectives of the world. A great one. Much of what we consider valuable in our world arise out of these kinds of challenges, because the act of facing overwhelming odds produces greatness and beauty. But just as important, the nature of these challenges might not be what they appear to be, giants have weakness, and underdogs over and over accomplish the unexpected with right preparation.