Thursday, January 4, 2018

A non-DBA guide on AWS ElasticCache Redis


I use Redis at work as either a cache engine or a shared memory between instances of the same service. It has been a great addition to the tech stack, offering simple ways to solve performance issues, through a well put together interface. While Redis delivers, AWS' managed Redis solution, ElastiCache, presents me with a few growth pains that have little to do with Redis technical design, and more with AWS trying to shoehorn to its hardware platform.

AWS Redis comes in two favors, cluster mode disabled (default) and enabled. And this is the most important choice you have to make that would later on expose you to some rigid aspects of a service whose name is elastic. While AWS makes it look like this is an option that can be turned on when needed (like RDS' multi-AZ), the choice is irreversible.


In the nut shell, cluster mode disabled means the entire data needs to fit into a single node. Up to 5 replica nodes can be added to help with the read load, but these replicas are exact copies of the data, they neither increase the storage capacity nor balance the write load.

Enabling cluster mode allows data to be split into maximum 15 shards, each shard share the same anatomy of a cluster-mode-disabled cluster (AWS please work on your naming, I will go with standalone mode from now), one primary node and up to 5 replicas.

Trials and errors led me to believe the cluster mode is simply superior to the standalone one, there might be some drawbacks, but those can be easily mitigated. Lets dive in.
  • In standalone mode, backup and restore are not supported on cache.t1.micro or cache.t2.* nodes. Whereas in cluster mode, these operations are supported on all node types. I was burnt by this once, for some reasons, my cache.t2.small node restarted itself and wiped out all the data.
  • People who aren't familiar with Redis might get turned away from cluster mode because the number of shards (node groups) couldn't be changed without downtime. But the latest version supported in AWS Elasticache, 3.2.10, allows you to do exactly this. Clusters in 3.2.10 can be scaled in and out, and rebalanced on the fly. Which leads to the next point.
  • Node type in a 3.2.10 clusters can't be changed (scale up/down) without downtime. 
  • Cluster mode does not support multiple databases. Therefore, when restoring to a cluster, the restore will fail if the RDB file references more than one database. If this is acknowledged early in product life circle, it is easily mitigated by proper namespaces. However, this poses a pain point for migrating from a standalone machine to a cluster.
To illustrate this point, I made a simple AWS Elasicache Redis Calculator.


For the same cache size, I can use a cluster of 2 cache.r*.large or a standalone cache.r*.xlarge. And other than the insignificant cost of EBS, the 2 options cost exactly the same. So if from day 01, you have a single-node cluster of cache.r*.large, you can scale all the way to a 15-node cluster of 150GB cache capacity, knowing that you are on the most economical option all the time. Pretty neat.

Only issue with Redis cluster so far is that while the same code that works on cluster, will work on standalone, the other way around is not true. In a Redis cluster topology, the keyspace is divided into hash slots. Different nodes hold a different set of hash slots. Multiple keys operations, or transactions involving multiple keys are allowed only if all the keys involves are in hash slots belonging to the same node.

During this research, I also learned that Redis, unlike Memcached, is single-threaded. Self-hosted Redis can work around this by running multi-tenant, multiple processes (shards) on the same machine. This is not an option in AWS Elasticache. So scaling CPU vertical is the only option and is not optimum. Once I go beyond my sweet 150-GB cluster, I am wasting money on CPU cores that provide zero benefit to Redis, and that much only for memory is just not justifiable. Like how Apple wants a hefty pay for that 128GB SSD.

Hence I included Redis Labs Cloud option into the calculator. Redis Labs is where Salvatore Sanfilippo, creator of Redis, is working, and should know a thing or two about Redis performance. But unless I got something wrong, Redis Labs Cloud doesn't strike me as a cheaper alternative. Yes it is better, but not cheaper.

One last thing, if you are caching long string, images, or JSON/XML structure in Redis, please consider compressing the data. Compression sinks my json blob 20 times. It would take a while till my cache is fully compressed, so far the outcome looks great.




P/S: AWS Elasticache Redis is plagued with mysterious node recovery and data loss. Here are a few instances.
https://forums.aws.amazon.com/thread.jspa?threadID=140869
https://forums.aws.amazon.com/thread.jspa?threadID=209636
https://forums.aws.amazon.com/thread.jspa?threadID=242396

Started in 2013, still true now as of 2018.

---
Reference:
Online offline scaling - http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/scaling-redis-cluster-mode-enabled.html
Redis Labs Cloud as an alternative - http://www.cloudadmins.org/2016/07/redis-labs-great-alternative-to-aws-elastic-cache/
Redis multi-tenant - https://redislabs.com/redis-enterprise-documentation/administering/designing-production/hardware-requirements/

No comments:

Post a Comment