Sunday, March 31, 2019

Do young developers have it easier?

Image result for dilbert intern


The world of computer programming is changing, fast. In the short 10 years of my career, I have had chances to witness industry standards emerged, demolished, and resurrected. Serverless, the latest abstraction upon abstractions that mankind call programming, has had its glorious days and is now probably on the peak of Gartner Hype Cycle. The time I spent on XML and its abomination of a sibling XSLT is one I would never get back, though the problems have not necessarily gone away. And everyone is deep learning left, right and center, after decades of hiatus.

As we get better at reinventing programming paradigm, streamlining development process, and specializing careers, new crops of developers come into the work force, with perspectives drastically different from their previous generations. But does growing up with an iPhone in their pockets necessarily mean these developers have it easier?

Like youngsters of any species, young developers, despite their non-existential social skills and blissful ignorance, are known for being fast, energetic, and tenacious. And I think this is where a lot of misperceptions come from. Where I give credit to young developers is their stamina. Programming is neither easy nor relaxing. It is a demanding job requiring intense concentration. Young people have the energy, especially when combined with pizza and coffee, to stick to her computers at 2 in the morning, working on intermittent quirks of an API she released earlier, and yet be delight for the new knowledge and that her work matters. Such work is harder when one gets older. He has a life he needs to keep up with. Actually, he is probably pissed that the API doesn't just work the way it should be.

That, however, does not mean young developers get shit done faster. They really don't, otherwise you would see me preparing my retire plan really soon. If anything, they tend to screw thing up faster than The Flash on red bull. I have had almost every interns accidentally drop a database, or delete something they shouldn't have, in their 3-month time. Neither do they learn faster. I am in the camp of believing as one makes progress in his career, he learns faster. Programming knowledge is cumulative. "New" ideas usually have popped up some time before in a slightly different form, in some other language or situation. Young developers, outside of school work, do not spend time tinkering assembly, distributed model, or the 7 layers of the Internet. They rip the benefits of other developers who worked out the details and packaged them nicely so that they are more approachable by others.

Does the fact that young developers rely on more layers of abstraction a bad thing? Of course not. I am glad that I can just sit here writing my post in plain English and not HTML nor CSS. That's exactly how we progress as a species, relying on abstractions built by generations before. In that direction, current mature technology favors quick iterations and shorter time to market. But wait, isn't that good old Agile, what's the big deal? Agile is about building one thing that is small yet works well, then a little more. Yet in an ecosystem with 600k apps in Play Store alone, builders aren't even sure what if the one thing they are building gonna be a hit, or miss, till it is out. There are so many high quality reusable building blocks: authentication, real-time database, infinitely scalable database, etc. Writing a new software is now less about constructing and more about figuring out the right combination of gluing things together. The rise of function as a service means young developers can just build it. Work? No? Blow it up. Do it again. All before lunch starts. Older developers have been burned, they don't want to get burned again, they dread the concept of redoing till figuring out.

Though blessed with a beginner's mind, I think young developers still have a long way to go. I don't tend to find old developers that often. Part because IT is a new profession in where I am living. Part because many by their middle age move on to management level, or move out. The guys who remain hand-on throughout their career is really a force to be reckoned with. They are matured as a developer and have learned the art of leadership. They delegate appropriate work to junior people, balance between net contribution and learning something new, and thus saving both theirs and project time. People who have been around a while also see the bigger picture more clearly, both in terms of product scope and project life cycle, which problem can be delayed, and which shouldn't.

Then comes scaling. A new application gaining traction, at some point, would hit the scalability wall. Hard. Like Miley Cyrus on a wrecking ball hard. Scaling is a lot harder than throwing together a bunch of code one did not write *cough* StackOverflow *cough*, and duct-taping till it works. A scalable system takes real dedication, expertise, and a calling to learn more and better ways to be a great developer. And these don't come over night.

Young developers are intelligent, hard working, and excellent at solving the problems they are given.
Old developers are intelligent, lazy, and excellent at predicting problems in 6 months time and building the right foundation today to make it easy when time comes.

Friday, February 22, 2019

Simple example on why LIMIT is not always the best thing for SQL Database

Since our production database grew to the size where querying data without indices became a suicide mission, we have learned that LIMIT does not always translate into faster query time and less work on the machine. Today at work, we ran into a beautifully simple example to illustrate this point, which was perceived as counter-intuitive by more junior teammates.

Here we have a query over a 8M-row table, the condition fits squarely into an index.

EXPLAIN SELECT "record".*
FROM "record" WHERE ("record"."status" = 'not_found' AND "record"."updated_date" < '2018-12-23 15:48:23.008320+00:00');
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on record  (cost=256890.59..3475874.26 rows=8448978 width=1740)
   Recheck Cond: (((status)::text = 'not_found'::text) AND (updated_date < '2018-12-23 15:48:23.00832+00'::timestamp with time zone))
   ->  Bitmap Index Scan on not_found_records_idx  (cost=0.00..254778.35 rows=8448978 width=0)
         Index Cond: (((status)::text = 'not_found'::text) AND (updated_date < '2018-12-23 15:48:23.00832+00'::timestamp with time zone))
(4 rows)


That was still lot of records, so we were hoping a LIMIT clause would shorten the execution time.

EXPLAIN SELECT "record".*
FROM "record" WHERE ("record"."status" = 'not_found' AND "record"."updated_date" < '2018-12-23 15:48:23.008320+00:00') LIMIT 1;
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.46 rows=1 width=1740)
   ->  Seq Scan on record  (cost=0.00..3923250.74 rows=8448978 width=1740)
         Filter: ((updated_date < '2018-12-23 15:48:23.00832+00'::timestamp with time zone) AND ((status)::text = 'not_found'::text))
(3 rows)


Only that it didn't. Adding the LIMIT clause changed the query plan from Bitmap Index Scan to a Sequence Scan, which is about the worst thing we want to do on a 8M-row table. It is a typical "optimization" of the planner, who thought that the query with LIMIT clause was too small.

Operation wise, an index scan is more expensive than a sequence scan. An index scan would require to read the index pages first and then read the data pages for relevant rows. The sequence scan only deal with data pages. And so when the LIMIT is small, combined with the table statistics, the planner is tempted to believe there are enough (random) rows that match the filter condition, which makes a sequence scan cheaper.

In fact, somehow this table statistics suggests that every LIMIT is too small till it is half of the table!

EXPLAIN SELECT "record".*
FROM "record" WHERE ("record"."status" = 'not_found' AND "record"."updated_date" < '2018-12-23 15:48:23.008320+00:00') LIMIT 3000000;
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..1393038.57 rows=3000000 width=1740)
   ->  Seq Scan on record  (cost=0.00..3923250.74 rows=8448978 width=1740)
         Filter: ((updated_date < '2018-12-23 15:48:23.00832+00'::timestamp with time zone) AND ((status)::text = 'not_found'::text))
(3 rows)

EXPLAIN SELECT "record".*
FROM "record" WHERE ("record"."status" = 'not_found' AND "record"."updated_date" < '2018-12-23 15:48:23.008320+00:00') LIMIT 4000000;
                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=257234.59..1781198.16 rows=4000000 width=1740)
   ->  Bitmap Heap Scan on record  (cost=257234.59..3476218.26 rows=8448978 width=1740)
         Recheck Cond: (((status)::text = 'not_found'::text) AND (updated_date < '2018-12-23 15:48:23.00832+00'::timestamp with time zone))
         ->  Bitmap Index Scan on not_found_records_idx  (cost=0.00..255122.35 rows=8448978 width=0)
               Index Cond: (((status)::text = 'not_found'::text) AND (updated_date < '2018-12-23 15:48:23.00832+00'::timestamp with time zone))
(5 rows)


Adding an ORDER BY clause into the query brings structure to the search and guide the planner to use the index

EXPLAIN SELECT "record".*
FROM "record" WHERE ("record"."status" = 'not_found' AND "record"."updated_date" < '2018-12-23 15:48:23.008320+00:00') ORDER BY "record"."imported_date" LIMIT 1;
                                                                      QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3518127.15..3518127.15 rows=1 width=1740)
   ->  Sort  (cost=3518127.15..3539249.59 rows=8448978 width=1740)
         Sort Key: imported_date
         ->  Bitmap Heap Scan on record  (cost=256898.59..3475882.26 rows=8448978 width=1740)
               Recheck Cond: (((status)::text = 'not_found'::text) AND (updated_date < '2018-12-23 15:48:23.00832+00'::timestamp with time zone))
               ->  Bitmap Index Scan on not_found_records_idx  (cost=0.00..254786.35 rows=8448978 width=0)
                     Index Cond: (((status)::text = 'not_found'::text) AND (updated_date < '2018-12-23 15:48:23.00832+00'::timestamp with time zone))
(7 rows)

Actually, any random ORDER BY would do

EXPLAIN SELECT "record".*
FROM "record" WHERE ("record"."status" = 'not_found' AND "record"."updated_date" < '2018-12-23 15:48:23.008320+00:00') ORDER BY random() LIMIT 1;
                                                                      QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3539253.59..3539253.60 rows=1 width=1748)
   ->  Sort  (cost=3539253.59..3560376.04 rows=8448978 width=1748)
         Sort Key: (random())
         ->  Bitmap Heap Scan on record  (cost=256902.59..3497008.70 rows=8448978 width=1748)
               Recheck Cond: (((status)::text = 'not_found'::text) AND (updated_date < '2018-12-23 15:48:23.00832+00'::timestamp with time zone))
               ->  Bitmap Index Scan on not_found_records_idx  (cost=0.00..254790.35 rows=8448978 width=0)
                     Index Cond: (((status)::text = 'not_found'::text) AND (updated_date < '2018-12-23 15:48:23.00832+00'::timestamp with time zone))
(7 rows)


However, that is still far from optimal. In order to return the first row, the database first needs to fetch all rows matching the index condition, and only after that, sorts. This creates strains on the machine memory and maybe even disk swap.

Index is actually ordered structure (BTree). The most optimal query is the one using one of the indexed columns for sorting.

EXPLAIN SELECT "record".*
FROM "record" WHERE ("record"."status" = 'not_found' AND "record"."updated_date" < '2018-12-23 15:48:23.008320+00:00') ORDER BY "record"."updated_date" LIMIT 1;
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.56..1.90 rows=1 width=1740)
   ->  Index Scan using not_found_records_idx on record  (cost=0.56..11280384.47 rows=8448978 width=1740)
         Index Cond: (((status)::text = 'not_found'::text) AND (updated_date < '2018-12-23 15:48:23.00832+00'::timestamp with time zone))
(3 rows)

This, in fact, is even more effective that the original no-order query, with or without LIMIT.

Friday, February 1, 2019

2018 - Shit am I getting old?

Roughly 4 years ago I put together The Transforming 2014. In retrospective, it was a great year. It felt exactly what youth should feel like, feel like I can afford to fail and bounce back, that opportunities are there to take, and that I am the master of my life. 2018 feels very different. The year followed a trajectory I have carved out over the last few years. It is not particularly eventful, the changes are there, but not exactly tangible. For the lack of eloquence, I would put it similar to when one cries, all the sorrows and memories and grudges build up inside, but till the instant everything bursts out in the form of tear drops, the entire thing is internal and invisible. If a tree falls in a forest and no one is around to hear it, does it make a sound?


Size and Duration matter

Before I got accused for having lewd thought, this is a note about Parcel Perform. That's what I do. I am a software engineer here. For the fourth year. I have never worked on a single project that long, and I definitely have been staying in Vietnam longer than my intention.

We build a product around the belief that logistics optimization is the next frontier of e-commerce war. And in that business, size matters. Specifically, dataset size, which grows by dozens of GB every week. At that speed, what worked a month ago, wouldn't work for the next. The job description is as pure as it can get, find a way to process incoming data most efficiently before the size squashes you like an insignificant insect in tropical storm, all while adding increasingly advanced features to deliver values. Or to build a train from the first material you find, all while keeping it running on a century-old rails, hoping you would make it to the next station, if you prefer to speak metaphors.

I can no longer just add a column with a default value to my database, because that would update hundreds of millions of rows and grind everything to a halt in the process. Everything I thought I knew about software is put into tests. At times, I feel like an imposter, a fraud. I feel alive every time anything breaks. Which is about every day.

This is also my third time being employee #0 and building a tech team from the ground up. I have a knack doing this. I know which minimal unit of work provides most impact. And I know what developers need to level up their career. But I hope I won't have to repeat the same thing in my next job. The first year of every startup is to build glorified excel sheets. The first engineer's day is always a constant juggle between development, test, and expectation management. It gets old.

I have hung around with Parcel Perform long enough to pass that stage. We get to work on interesting challenges unique to our business. Our tech stack moulded around us. We have a support network so that the weight of the entire system does not have to rest on a single shoulder. It takes time for things to come together, and so, duration matters.

Creativity Winter

As much as I enjoy seeing my technical skills grow in the last couple of years, the same can't be said about my creativity outside of work.
  • Between 2016 and 2018, I wrote 13 posts. I wrote 19 posts in 2015 alone. 
  • At some points, my photo app only shows content on white board I took post-meeting.
  • The last Barcamp in Saigon was in 2016.
None of these was an act of consciousness. Like, I never decided I didn't need this in my life any more. I noticed that my most creative period was when my life happens between airports and on MRT rides. Somehow the constant scenery change kept my mind alert and fresh. Though I successfully prevented my career from repeating itself, the life outside of work got stuck in an explore and exploit paradox. On the one hand, there must still be stuff out there to give me an adrenaline rush. On the other hand, comfort feels good. I long for a kick start to get out of this stage of limbo.

Not all hope is lost. I am writing this, albeit sluggish. A kid at work got into photography recently, perhaps I can get a partner in crime. And Barcamp is back in 2019. (Surprised? Me too!)

Pretty sure we are the only company throwing morning kayak sessions in the city

I was in a HackerX event in 2018. Basically speed dating for developer employment. And that was literally how I started my part of flirting duty. Getting a tandem kayak was pretty much an impulsive purchase. But it was a great one. I wondered why I didn't think of that earlier. In Vietnam, I get to pretty much drag the boat and pop it in any body of water I feel like. That my office is a few hundreds meters away from a pier also helps. The same can't be said for any other countries I have lived, due to either regularity or location. Be it a morning exercise before work, or a more adventurous weekend, it's good fun and brings friends together. One at a time. There are only two seats on a tandem kayak, damn it.


I wish 2018 was a grandiose. Instead, I just got lot of doubts. I had thing figured out, or my life just got boring? The progress I made in my career was solid, or I was making excuse and succumbing to familiar comfort? I've got the right balance, or work was slowly gulping my life? I still identify myself more as a 20 year-old kid scared shitless in his first job, rather than a 30 year-old I would be when this year ends. Well, I guess that's it right? Till this year ends, I have another year to live in my 20s and do stupid stuff. 30s uncertainty can wait, its time will come later :)

Tuesday, January 1, 2019

Parcel Perform - Another (hack)day on timelapse

It was another good year at Parcel Perform. I hope this video turn out to be a better version than last year's CCTV on soundtrack. Great to see the side-by-side comparison of how things changed in a year anyway.

(Yes, I wore the same T-shirt, that was on purpose. No, I don't own only 1 T-shirt in my wardrobe) 

Sunday, October 7, 2018

Safe Operations For High Volume PostgreSQL

Add a new column (safe)

In general, adding a new column is a safe operation that doesn't lock the table. However, there are cases where certain options of the command lock the table. These cases are quite common when working on an evolving product.

Add a new column with a non-nullable constraint / default value (unsafe)

This requires each row of the table to be updated so the default column value can be stored. The entire table is locked during this operation, depending on the size of table, it can take a while and everything else comes to a halt. The solution is to add the new nullable column without a default value, backfill the desired value in batches, then add the non-nullable constraint / default value with an alter statement.

Add a default value to an existing column. If the column is nullable previously, the default value won't alter those existing null rows. If the column is non-nullable, all rows are guaranteed to store some value. Either way, the operation does not block table and safe to execute.

Change the type of a column. Strictly speaking, this operation locks the table and therefore unsafe. If the underlining datatype is not changed, like changing the length of a varchar, then the table isn't locked. But if the change requires a rewrite/re-cast, each row of the table is to be updated and the table is locked for the same reason as adding a new column with default value. The solution is to add a new column, backfill the new column, and get rid of the old one.

One big good news is that from PostgreSQL 11, adding a new column with default value will be painless, as it should have been :)

Add a Foreign Key (unsafe)

To protect data integrity, an AccessExclusiveLock is placed on both tables, again grinds every read and write to a halt (PostgreSQL 9.6+ reportedly allows read). The solution is to take advantage of Postgres' ability to introduce invalid constraint. The procedure is to first creating an invalid Foreign Key constraint by specifying NOT VALID in the ALTER TABLE statement, and then validate it in a separate VALIDATE CONSTRAINT statement. The second validation only requires a RowShareLock that doesn't block reads nor writes. Do note that if there is reference to non-existing rows, the operation won't be completed and you have to take care of integrity on your own.

Add an index (unsafe)

By nature, the table being indexed is locked against writes and the entire index is built in a single scan of the table. Read transactions can still be done meanwhile.

For production environment, it is always better to build the index without locking writes. CREATE INDEX comes with CONCURRENTLY option for this purpose. Though, building index is still an extra work for the database and this is reflected on extra CPU and I/O load. This might still slow other operations, we notice an increased queue depth when we add index concurrently on one of the biggest tables in the system. Because our infrastructure is in the cloud, with a minimal budget, we can upsize the database a couple of size larger than normal. The extra power makes adding indices (still with CONCURRENTLY option) lot more comfortable and productive.

Add a column with unique constraint (unsafe)

This operation will lock the table because it requires a scan for uniqueness. The solution is to add the column, add unique index concurrently, and then add the constraint onto the table (ADD CONSTRAINT UNIQUE USING INDEX).

Drop a column / a constraint (safe)

This is safe and quick. If the operation appears to take time, the cause is more likely because the table/index was in use rather than the operation time itself. A quick check on pg_stat_activity should confirm that.

Rename an entity (safe)

The entity can be a table (or an index, sequence, or view) or a column in a table. This operation has no effect on stored data and therefore also isn't affected by the size of data.

Thursday, May 31, 2018

Exploding kittens, I mean, databases.

There was a time when our database storage depleted like no tomorrow. It went through 80GB of disk space in less than 48 hours.


Before any investigation was conducted, we did the obvious, gave the machine more disk space. The operation was painless on RDS.

The sequence of checks were:

* Shit, did we forget to clean the log?
Nope, log was removed after 3 days, the task was carried on by RDS.
We also checked the db and its internal size, using the following query. Seems like all the occupied storage was actually spent on data itself.

SELECT
   relname as "Table",
   pg_size_pretty(pg_total_relation_size(relid)) As "Size",
   pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as "External Size"
   FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC;

* Was the any migration / application change in the last couple of days that might be responsible for this?
Well, there was, we changed a column from varchar(500) to text. But no, this shouldn't explode the storage, and to Postgres, varchar and text were the same thing under the hood. Check this article from Depesz https://www.depesz.com/2010/03/02/charx-vs-varcharx-vs-varchar-vs-text/

* Was auto_vacuum on?
auto_vacuum is pretty crucial to Postgres, to free storage from intermittent insert and update statements. Our auto_vacuum might have been off, the result of this query was alarming

pp_pqsql_prod=> select count(*) from pg_stat_all_tables where last_autovacuum is not null;
 count
-------
     0
(1 row)

Sanity check on another RDS confirmed this abnormality

db_authenticator_pp=> select count(*) from pg_stat_all_tables where last_autovacuum is not null;
 count
-------
    26
(1 row)

However, till the end of the investigation, we still didn't know why there was no auto vacuum record. DevOps confirmed the option was on the whole time. The table was populated a couple hours after the incident. We did an experiment of changing storage on another RDS instance, to see if that would wipe out auto vacuum record, it didn't.

* And the last thing was AWS Data Migration Service (DMS).
We were using DMS to sync the content of Postgres to a few other databases. The DMS machine was scaled down recently to save cost, but at some point, it got too small to completely digest the data changes in our RDS. And per AWS, undigested data changes piled up:

The most common clients for PostgreSQL logical replication are AWS Database Migration Service or a custom-managed host on an AWS EC2 instance. The logical replication slot knows nothing about the receiver of the stream; there is no requirement that the target be a replica database. If you set up a logical replication slot and don't read from the slot, data can be written to your DB instance's storage and you can quickly fill up the storage on your instance.

Learning this, we upsized the DMS machine and waited for change in storage. This was the culprit.

Monday, January 22, 2018

A non-DBA guide on AWS Elasticsearch Service



Elasticsearch serves in my production system as the secondary storage. Transactional data is ultimately stored in RDS Postgres, but Elasticsearch is employed for its superior query features and performance. And because I have a long-lived growing index, I have experienced growing pains with Elasticsearch (and every other parts of the system, it's a Sisyphean show) that made me look into AWS Elasticsearch Service (AWS ESS) in greater details. I am not covering rolling indices, continuous data flow with an indexing period and retention window, which is another great application of Elasticsearch but not my use case. I have spent more time on this than I should, because a definite answer to ES scalability is complicated. When the official guide from Elastic itself is "it depends", I don't think I have a good shot at coming out being correct.

Unlike Redis, ES can't be scaled effectively with blackbox approach. Each component of ES is scaled differently. Any ES cluster can be explained as following:
Source: qbox.io
  • Cluster: ES is meant to be run in a cluster. Every instance of AWS ESS is cluster-ready even with only one node.
  • Node: A single ES instance. In AWS ESS, each node runs on a separate EC2 instance.
  • Index: A collection of documents. Very soon, referring to an ES Index as similar to an SQL database is a bad analogy (read more here). It is the highest level of data organization in ES.
  • Shard: Common with other distributed storage system, an index is split into shards and distributed across multiple nodes. ES automatically arranges and balances these shards.
  • Replica: A read-only replication of a shard.
I asserted the scalability of AWS ESS from the smallest to the largest elements of a cluster, and the findings are

1. Shard

Scaling on this level involves the size of a shard and the number of shard in a cluster.
AWS's rule of thumb is to keep shard size below 50GB per shard. But that might as well be the upper limit of it. In practice, the best shard size is between 10-30GB in a stable cluster, or 10-20GB when cluster size is subject to frequent change.
The number of shard in a cluster determines maximum size of a cluster and its performance. A node can keep multiple shards, but a shard can't be split further, it has to reside in a single node. If initially a single-node cluster has 10 shards, its maximum growth is 10 nodes, each serving a single shard. The reason people don't create a hundred-shard cluster from day one is that the higher number of shard a cluster has, the more it is taxed on communication between shards. The ideal number of shard is the balance between giving growth space for the dataset and avoid excessive communications between shards. I suggest allocate shards 3-5 times the number of nodes in the initial setup.

2. Replica

In ES, a replica contributes to fail-over, but its primary role is to boost search performance. It takes some of the query load from master shards. Secondly, instead of scanning all nodes in the cluster for a query, the present of replicas allows traversing less nodes while still ensuring all shards are scanned. Push this to an extreme, for n nodes, I can have n shards and n-1 replicas. This means a query never needs to traverse more than one node. However, this also means there is no scale out/in for the cluster, each node has to be big enough for the entire dataset. Not recommended, gotta love the speed though.

3. Node

Scaling node is about choosing the right size for the machine. And this is surprisingly straight forward given ES' nature. ES uses Java, and thus its performance is tied to the mighty JVM. JVM heap size recommendation for ES is 32GB. It (heap size) also can't be more than 50% of available memory. Therefore the ideal memory of an ES instance is 64GB. This is the reason why earlier I suggested a cap of 30GB on shard size, so that the entire shard can fit into memory. A machine with less or more memory is still perfectly functional, it is merely a matter of bang for the buck. I settle on scaling up my machine till 64GB RAM and subsequently scaling out. I still have to deal with a whopping 64GB free memory whenever I scale out (and its bill), so 32GB maybe a more cost-conscious threshold. Meanwhile, I entertain the extra memory with more replicas.

4. Cluster

Scaling a cluster of ES is not simply adding more machines into it, but also understanding the setup's topology. On AWS ESS, the focus on usability dwarfed most ES topology configuration. The only significant configuration left is dedicated master nodes. These nodes perform cluster management tasks and give the data node some slack for stability. AWS guide on this is comprehensible, I couldn't do a better job.

5. AWS ESS drawbacks

It wasn't until the previous point did AWS ESS drawbacks from native ES emerge. In reality, there are more to that. Below is a list of things omitted by AWS ESS.
  • ES supports a third type of node: coordinating nodes. These nodes participate in cross-cluster query where a request is first scattered to data nodes and then gathered into a single resultset. Not particular popular in small setup, but completely off the table with ESS.
  • There is only HTTP connection. ES supports TCP connection and this should be more favorable for JVM-based languages, receiving better support and cutting down additional network complexity
  • No in-place upgrades. Upgrading ES version in AWS ESS is unnecessary painful to do with zero downtime. Painful because it involves launching a new cluster, whitelisting it for reindexing, executing the reindexing and that updating all services to point to the new cluster. Unnecessary because ES comes with in-place / rolling version upgrade.
  • Backup frequency is limited. To only one a day. Backup in Elasticsearch is supposed to be pretty cheap.
  • Security. One of the biggest reason I haven't provided Kibana as an BI interface to my clients is because X-Pack is not supported.
  • Limited access to the rest of ES ecosystem. ES ecosystem is growing fast and a force to be reckoned with. No logs, and no plugins are supported in AWS RSS. Cutting edge much?

TL;DR 

The initial storage should be 10x the amount of data. Select the EC2 instance whose storage to memory ratio is 8:1. If that means more than 64GB of memory, take multiple 64GB machines. The number of shard is 3x the number of node. Minimal replica is 1, more if budget allows. Scale up the machine until reaching 64GB memory, after which scale out. Look for AWS ESS alternatives when the rush is over. I take no responsibility over your employment status.