Sunday, December 13, 2020

2020 - A year unlike any others

The post is dedicated to Parcel Perform’s engineering journey to profit through the pandemic - also known affectionately as the longest Tet ever by the locals. Other teams fought hard too and deserved their own stories. Opinions are my own and do not express that of my employer.

Patient zero

It was a day in March that Covid stopped being a passing disease in some corners of the world like SARS, Zika, or Ebola and started to be a reality to me. Vietnam got its first case in January. Sequoia’s letter to its founders and CEOs was making its way around the Internet. And WHO declared Covid a pandemic. 

I am not going to lie, despite the unease of entering a pandemic period, part of me was looking forward to it, fondly. I was ten when the dot-com boom took place. Vietnam was some remote corner of the world, known only for the war. The Internet was a foreign word. Mom brought home rice, canned food, and fish sauce in a basket to prepare for Y2K. Everything I know about the dot-com bubble, its magic and devastation, and how Internet giants emerged from its crumble was told to me. More than just an economic downturn, it was imprinted on me that not just surviving, but striving through a hard time is the ultimate test for me as an engineer, and the system I have built. All were thoughts and stories until that day in March. The old farts could move aside with their bubble. My generation had Covid.

The management acted swiftly and decisively. Changes were introduced to protect individuals and the company's sustainability. The flexible WFH policy had always been there. Teams were split to come to the office on alternative days. Meetings were moved online. Monthly townhalls were broadcast online with details of a break-even plan by end of the year. With everyone in lockdown, e-commerce activities - our main source of revenue - raised significantly. Vietnam also took aggressive measures to minimize the spread of the virus. The new changes in life both inside and outside of work were new and exciting. For a while, Covid had seemed like a challenge we could face with a sense of hope.

It was then the grim reality set in. As more customers worked online, our platform became the only way for them to stay on top of their logistics situation. The eyes were on us. The demand for service availability, something previously had not been as desirable as it could be, skyrocketed. In parallel, the virus spread at a speed that made the Mongolian hordes look like amateurs. People started to cite stories about the infamous Spanish Flu. Recovery estimation was changed from months to years. The virus was not the only thing in the air, fear also was. The investment market contracted, dried up, and imploded on its former optimistic self. No one wanted to bet on the uncertainty of what eventually became the biggest threat to humanity in this century (so far). We had to make changes to keep the remaining runway as long as they could be and planned for an unattractive funding round that we did not even know could happen.

By April, we found ourselves fighting bigger challenges, with a smaller and less effective workforce. Basically a sadistic role-play of the US health care system.

Is less more?

In software development, there is a sense of elitism of doing more with less (people). Whatsapp was sold with 55 employees, serving billions of messages every single day. Instagram with 13 employees, including the two founders. Markus Frind built Plenty of Fish solo for 6 years. Imagine what else these people could have done if they had had Asian parents! 

That wasn't exactly our story though. 

The break-even plan was broken down into monthly targets. We scored the first month, then got into a dry spell effectively the whole Q2. The B2B sales cycle was notoriously long, we had strong leads that took months to realize. The general anxiety about the future did not make anything better. The off-track plan placed a hiring freeze on us. Meanwhile, the unprecedented quarantine surge traffic and sales effort sent more work our way. We typically won customers by going the extra mile to make custom features and integrations. The work was not hard, but irritatingly time-consuming as most normal customers were not fluent in the programming if-this-then-what riddle.

  • The time on custom work kept us from working on the core system.
  • The under-invested core could not handle the surge in traffic, so here and there engineers got pulled out to help fire-fighting.
  • The looming committed deadlines were peril dishes for easy implementation whose architectural simplicity was an afterthought. The code became brittle and less welcomed for an extension, making custom work and fire fighting even more time-consuming.

It was a vicious circle that could very quickly deplete the team morale and leave everyone burnt out.

We sook to find the balance between feature work, and core system. We designated 90% of the development time to be split between making new features, adding customizations, and paying tech-debt, and the remaining 10% on whatever the team thought would be a future issue if left unchecked. That went exactly as good as Donald Trump’s plan for the pandemic: utter chaos stems from reality detachment.

The reality was that we had more on our plate than we could chew. We struggled to keep up with the delivery schedule before, we would not be able to with 10% less. And though the time invested on tech-debt would help us in the long run, investment took time, the time we could not afford. In the deadline frenzy, the 10% budget was a forbidden fruit, and development time was given to whatever made the biggest noise at the time. Sometimes it was the core system because nothing spoke louder than a system outage. But for the rest of the time, it was a customer-first policy. We needed that revenue stream to pull through the hard time. We were the homeless of software development trying to make a saving account.

Less is more does not come solely from the engineering side of things, it only thrives where it aligns with the whole business as a cohesive unit. Whatsapp, Instagram, and Plenty of Fish are consumer mobile apps that demand very little customization compared to the world of B2B that we are in. SAP has thousands of developers - not Techcrunch headline material - yet denying so is to deny the laws of physics.

There were some interesting leads that we kept hearing about, but otherwise, Q2 ended on an eerily uneventful note. Little did we know, this was the night before the storm.

Growing pains

Then came Q3 in a way we could not have expected. The traffic that has not slowed down in previous quarters then gained even more momentum. Malls in the US and EU were shutting down and likely remained so over the holiday season due to concerns about spreading the virus (it indeed happened). People turned their compulsive buying online. Good times if your business is centered around tracking and analyzing e-commerce shipments. There was only one little convenience. The data stream flooded through us with all the force of the mighty Mekong before people built all those dams over her.

Having invested in a horizontally-scale application layer, our journey to scalability was a walk in the park. Except for trees, the park had carnivorous ents, pedestrian Nazguls, and birds fell beasts. The squirrels were cool, they probably just had rabies. The walkthrough Mordor taught me about growing pains more than all my teenage years combined.

For months we wrestled with performance issues, always shoved into our hands at the most inconvenient moment. The embodiment of Murphy’s law, solidified by postmortems piled higher and deeper, stem from a series of adequacy:
  • Lack of imagination. As much as the growth was welcomed, we did not successfully foresee the full impact of such growth on the system.
  • Lack of experience and expertise. When incidents happened, we did not know what to do, not immediately, and not fluently executed. Various types of database lock, Kafka data corruption, and Flink zombie jobs all happened for the first time to all of us.
  • Lack of infrastructure investment. The list of tech-debt was ever-growing, and the development of internal tooling came too little and too late. 
There has always been a tug of war between building a solid system with imaginative traffic vs a house of cards with desirable features waiting to collapse on its own weight. Both compete on finite resources that are time and effort. A seasoned entrepreneur would say the latter is preferable as it indicates we have found a problem worth solving and people are willing to pay for the pain killer. But such knowledge offered little comfort as you were staring at the screen at two in the morning, with the weight of the entire system on the chest, feeling the cold sweeps in the limbs subduing other sensory leaving only numbness, getting angry with yourself and everyone and everything.

We had three system meltdowns for the three months of Q3. Each came in a magnitude that threatened the existence of Parcel Perform and made me question the decisions of my life. And when you do that three times in a row, you start to question your sanity too. But there has always been light at the end of a tunnel, no matter how long. With sheer efforts and a lot of hours staring at the screen and self-doubt - because hey we were lack of everything else - we pulled through. I wrote this piece to remind myself the fight is only over when I give up.

Performance issues suck really hard. Going through them is painful. And I wish them to never be a part of my life. But the growth which comes with the effort to overcome each and every performance challenge is undeniable. Every time we solve a problem, the system gets stronger and stands higher. We acquire knowledge we would not fathom before. Our process is taking steps to mature so we can fight together effectively and reserve the hard-earned knowledge, though we are a long way from the finish. The insurmountable mountains are less intimidating. Much of what we consider valuable in our world arises out of these kinds of challenges because the act of facing overwhelming odds produces greatness and beauty. Such is growth and pain at their worst and best combinations. 

We ended Q3, bleeding from the mouth, but confident to start hiring again.

The aftermath

2020 was a bizarre experience. The world was forced to accept a new reality for better or worse. We were forced to mature. Decisions in the back of our minds that we knew we would get there somehow were put in the first-row seat as the pandemic arrived.
  • We transformed the functional engineering team into cross-functional squads just days before the country entered lockdown.
  • We now have a CS team that collectively covers 24 hours of a day and a stronger presence in the EU while the common situation of the industry is to contract inwards.
  • Our SLA went from best-attempt to actual tangible values. A move that increased the excitement of the sales team ten folds, the exact amount it decreased from my engineering team. A striking example of conservation of happiness.
  • We ended up overshooting the break-even plan 2x.
But on the other hand, there is no denying that wherever we look we see improvement needed. We are trying to keep a system that has grown 4x since the beginning of the year stable. New components like pg_bouncer and setups like splitting the Flink cluster into single jobs were irreplaceably crucial. Yet the biggest pain points have shared the same pattern: the application logic we implemented a year ago can no longer handle the traffic we have today. We haven’t fully escaped the hideous Sisyphus circle. Patches of various degrees of permanence were introduced; still usually before we could fully complete the implementation and internalize the knowledge, another thing would happen. The little tech debts we eagerly took are coming back with compound interests.

And we are doing all these jugglings in the aftermath of a 6-month hiring freeze. We had had the same set of people working together from the beginning till the end of the pandemic (Vietnam calendar, we have been lucky). We worked intimately with all the services and developed an acute sense of troubleshooting. While that experience was pivotal in maintaining the system, our investment in the onboarding process, documentation, and internal tooling was insidiously slacked off. A person unfamiliar with the system would find what we have been working on in the last 6 months overwhelming. The participation of the new engineers post-hiring freeze was a waking call.

An exploding system and a high overhead of adding new members were a stressful combination. At times, we were probably just steps away from the death of a thousand cuts. Despite all that, there is a mutual feeling that the worst Covid has to offer was in the past and thing only gets better from here on. We have a strong team of young people who every day fight above their weight class. The debts we have are solvable with the right amount of time. We have a product that fits the market and has heaps of space to develop. When you manage to grow so much despite the turbulent time of 2020, fewer things can frighten you. We are looking at 2021 with much anticipation.