—
The post is dedicated to Parcel Perform’s engineering journey to profit through the pandemic - also known affectionately as the longest Tet ever by the locals. Other teams fought hard too and deserved their own stories. Opinions are my own and do not express that of my employer.
—
Patient zero
It was a day in March that Covid stopped being a passing disease in some corners of the world like SARS, Zika, or Ebola and started to be a reality to me. Vietnam got its first case in January. Sequoia’s letter to its founders and CEOs was making its way around the Internet. And WHO declared Covid a pandemic.
I am not going to lie, despite the unease of entering a pandemic period, part of me was looking forward to it, fondly. I was ten when the dot-com boom took place. Vietnam was some remote corner of the world, known only for the war. The Internet was a foreign word. Mom brought home rice, canned food, and fish sauce in a basket to prepare for Y2K. Everything I know about the dot-com bubble, its magic and devastation, and how Internet giants emerged from its crumble was told to me. More than just an economic downturn, it was imprinted on me that not just surviving, but striving through a hard time is the ultimate test for me as an engineer, and the system I have built. All were thoughts and stories until that day in March. The old farts could move aside with their bubble. My generation had Covid.
The management acted swiftly and decisively. Changes were introduced to protect individuals and the company's sustainability. The flexible WFH policy had always been there. Teams were split to come to the office on alternative days. Meetings were moved online. Monthly townhalls were broadcast online with details of a break-even plan by end of the year. With everyone in lockdown, e-commerce activities - our main source of revenue - raised significantly. Vietnam also took aggressive measures to minimize the spread of the virus. The new changes in life both inside and outside of work were new and exciting. For a while, Covid had seemed like a challenge we could face with a sense of hope.
It was then the grim reality set in. As more customers worked online, our platform became the only way for them to stay on top of their logistics situation. The eyes were on us. The demand for service availability, something previously had not been as desirable as it could be, skyrocketed. In parallel, the virus spread at a speed that made the Mongolian hordes look like amateurs. People started to cite stories about the infamous Spanish Flu. Recovery estimation was changed from months to years. The virus was not the only thing in the air, fear also was. The investment market contracted, dried up, and imploded on its former optimistic self. No one wanted to bet on the uncertainty of what eventually became the biggest threat to humanity in this century (so far). We had to make changes to keep the remaining runway as long as they could be and planned for an unattractive funding round that we did not even know could happen.
By April, we found ourselves fighting bigger challenges, with a smaller and less effective workforce. Basically a sadistic role-play of the US health care system.
Is less more?
In software development, there is a sense of elitism of doing more with less (people). Whatsapp was sold with 55 employees, serving billions of messages every single day. Instagram with 13 employees, including the two founders. Markus Frind built Plenty of Fish solo for 6 years. Imagine what else these people could have done if they had had Asian parents!
That wasn't exactly our story though.
The break-even plan was broken down into monthly targets. We scored the first month, then got into a dry spell effectively the whole Q2. The B2B sales cycle was notoriously long, we had strong leads that took months to realize. The general anxiety about the future did not make anything better. The off-track plan placed a hiring freeze on us. Meanwhile, the unprecedented quarantine surge traffic and sales effort sent more work our way. We typically won customers by going the extra mile to make custom features and integrations. The work was not hard, but irritatingly time-consuming as most normal customers were not fluent in the programming if-this-then-what riddle.
- The time on custom work kept us from working on the core system.
- The under-invested core could not handle the surge in traffic, so here and there engineers got pulled out to help fire-fighting.
- The looming committed deadlines were peril dishes for easy implementation whose architectural simplicity was an afterthought. The code became brittle and less welcomed for an extension, making custom work and fire fighting even more time-consuming.
It was a vicious circle that could very quickly deplete the team morale and leave everyone burnt out.
We sook to find the balance between feature work, and core system. We designated 90% of the development time to be split between making new features, adding customizations, and paying tech-debt, and the remaining 10% on whatever the team thought would be a future issue if left unchecked. That went exactly as good as Donald Trump’s plan for the pandemic: utter chaos stems from reality detachment.
The reality was that we had more on our plate than we could chew. We struggled to keep up with the delivery schedule before, we would not be able to with 10% less. And though the time invested on tech-debt would help us in the long run, investment took time, the time we could not afford. In the deadline frenzy, the 10% budget was a forbidden fruit, and development time was given to whatever made the biggest noise at the time. Sometimes it was the core system because nothing spoke louder than a system outage. But for the rest of the time, it was a customer-first policy. We needed that revenue stream to pull through the hard time. We were the homeless of software development trying to make a saving account.
Less is more does not come solely from the engineering side of things, it only thrives where it aligns with the whole business as a cohesive unit. Whatsapp, Instagram, and Plenty of Fish are consumer mobile apps that demand very little customization compared to the world of B2B that we are in. SAP has thousands of developers - not Techcrunch headline material - yet denying so is to deny the laws of physics.
There were some interesting leads that we kept hearing about, but otherwise, Q2 ended on an eerily uneventful note. Little did we know, this was the night before the storm.
Growing pains
Then came Q3 in a way we could not have expected. The traffic that has not slowed down in previous quarters then gained even more momentum. Malls in the US and EU were shutting down and likely remained so over the holiday season due to concerns about spreading the virus (it indeed happened). People turned their compulsive buying online. Good times if your business is centered around tracking and analyzing e-commerce shipments. There was only one little convenience. The data stream flooded through us with all the force of the mighty Mekong before people built all those dams over her.
Having invested in a horizontally-scale application layer, our journey to scalability was a walk in the park. Except for trees, the park had carnivorous ents, pedestrian Nazguls, and birds fell beasts. The squirrels were cool, they probably just had rabies. The walkthrough Mordor taught me about growing pains more than all my teenage years combined.
- Lack of imagination. As much as the growth was welcomed, we did not successfully foresee the full impact of such growth on the system.
- Lack of experience and expertise. When incidents happened, we did not know what to do, not immediately, and not fluently executed. Various types of database lock, Kafka data corruption, and Flink zombie jobs all happened for the first time to all of us.
- Lack of infrastructure investment. The list of tech-debt was ever-growing, and the development of internal tooling came too little and too late.
The aftermath
- We transformed the functional engineering team into cross-functional squads just days before the country entered lockdown.
- We now have a CS team that collectively covers 24 hours of a day and a stronger presence in the EU while the common situation of the industry is to contract inwards.
- Our SLA went from best-attempt to actual tangible values. A move that increased the excitement of the sales team ten folds, the exact amount it decreased from my engineering team. A striking example of conservation of happiness.
- We ended up overshooting the break-even plan 2x.