Saturday, May 1, 2021

Lessons from a promotion

My tech team is organized into squads - cross-functional teams owning end-to-end feature development. The squad leadership is a joint collaboration of a product owner, a project manager, and a tech lead. As the business expands, more squads are needed and it falls to me to fulfill these new tech lead vacancies. To facilitate professional growth, internal promotion is favored over external hires. Along the way, I learned a few lessons.

It starts with a job description

The one thing that makes or breaks the promotion is the job description. A vertical promotion from junior to senior involves performing harder and grander tasks with tighter deadlines, in other words, being more proficient a what you have already been doing. A promotion to a leadership position is more “horizontal” in that sense. Pretty much like the time you left high-school, I don’t think any amount of prior experience can truly prepare one for what comes next.

Our tech leads work with people across all principles to provide a cohesive technical vision for the squad, contribute to the product strategy, and coach their members. In that role, many activities are new to them. They will be working with people whose functions they haven’t fully comprehended, like a tech lead with FE background working with DevOps for a deployment plan. They are asked for estimations while given far less details than what they received pre-promotion. They are exposed to HR matters around the well-being of their crew, not all of which make everyone happy. And just sometimes they have the trauma of having their handcrafted solution taken out of context for an entirely different thing and 3 days to deliver. Given the drastic change in scope of work (and the PTSD), it is understandable that post-promotion, some feel like a fish out of water. Unfortunately if there is a structural approach to eliminate this sense of disorientation, I haven’t found it yet.

While it is tempting to propose a five-page long job description listing out all little details one is supposed to perform and hence solve the challenge once for all, the managerial wet dream is nothing more than a motivational debt. Software development rewards people for their creative prowess and that in turn attracts great problem solvers to the craft. Practically spelling out what one needs to do is the opposite of that. The job description should enable the person to picture the boundaries of her authority and the impact she has on the team without resorting to dictating the specific activities. Everyone will have different responses to “make tactical moves to ensure successful deliveries”, or “look after the career development of team members”, and that’s part of the growth. Take that, Tiger Mom!

Strength in diversity

In the previous year, the squad model had some initial successes. The first two squads jelled and performed well, relatively uneventfully. Structurally. both were the mirroring image of each other: BE-heavy, big data focus, led by old-timers. So when it came to the next new squad, there was a strong urge to copy the earlier success: same leadership profile, same structure, and same kind of work. That should be easy, the management knows what to do, the promoted people have existing role models to follow, and things probably fall into the right place like they had done before. That was as close to a squad printer as I could think of.

In reality, my third squad was FE-heavy, had a strong interest in UI/UX topics, and had a product owner stationed away from the main body of the team. It couldn’t be any more different from the former two. I am glad that this happened.

A parthenogenetic offspring of a squad would have been an easy choice down a slippery slope.

I didn’t realize at the time, but collectively the technical discussion had already leaned towards the server side of thing more than it should. It is normal that individually each of us turns our face towards what we know and against what we don’t. But it gets dangerous when we all turn towards the same thing, we get ignorant of our faults and prejudices. In OOP, that is known as closed for modification, and closed for extension too.

The identical leadership profile would also send the wrong kind of signal, that one has be X and work in Y to get promoted. Everyone with a different profile probably feels unappreciated like a 40-year-old on Snapchat and take their chance elsewhere.

With the birth of the third squad, I got to learn the importance of a design system, the vast untapped advancement of browser technologies, and the bias in BE-FE collaboration. All these are areas of improvement that wouldn’t have surfaced if we had gone down the easy path and promoted yet another BE engineer. It itches me to sound like a social justice warrior, but we did find new powers in diversity.

The support structure

No, the new squad was not released to the wild to fend for itself. That would have been bogus.

In fact, the support structure was the one area received the most attention back in the squad formation. We defined the 3-prong structure where product, technology, and agenda support each other. The right people were hand picked for the backbone and the remaining vacancies received the highest recruitment priority. Meeting plan was laid out so everyone had multiple outlets to discuss their opinions. 

The support structure was least of my concern, till something hit me in the face, something technical yet also... sociological.

The new squad got its people and work split from the two existing ones, like a cell division. Hence found itself co-contributing a number of code repos with the others. That led to some confusions where its realm of existence started and ended. The team operated with a constant fear of stepping on someone else’s foot. The organizational structure was changed and had not been reflected in the software interface. Wham! It was such a classic case of the Conway’s law that I was awed to observe it first hand, yet hurt for not seeing it coming earlier. The law was one of my favorite engineering observations, right up there with Murphy’s.

The following rectification was relatively straightforward. We educated people about the boundaries of squads, brought in service contract to strengthen the interface between them, and proceeded to splitting shared services into smaller ones where it made sense.

A personalized journey

Accompanying the tech leads on their way through aforementioned obstacles was a rewarding experience but easy it was not. There are many questions yet few definite answers, how many tech debts are too many, when an internal tool should be made. Much variation in preference, some are more than happy to deal with abstraction where others are keen on a transparent view. And much uncertainty ahead for that no plan can account, how one accounts for spending a month waiting for an engineer to onboard just to have the guy quit a day before his start.

It is probably apparent now that I haven’t done enough of this to actually know what I am doing. But I am experienced enough in software development to deal with uncertainties. And I am invested in getting it work for my team.

The Agile Manifesto has it that

Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan

They are good guidelines to drive our weekly catch-ups. We keep a very experimental approach at what we are doing, and maintain a close feedback loop. What work are replicated elsewhere, what don’t are studied. But most importantly, I always try to be a thought partner through out this journey. Interactively growing a team where people are collaborative and open about their problems today is more important than having it down to a science with a rigid plan tomorrow.

If one day I have a toddler, I ain’t no need for books.


Sunday, December 13, 2020

2020 - A year unlike any others

The post is dedicated to Parcel Perform’s engineering journey to profit through the pandemic - also known affectionately as the longest Tet ever by the locals. Other teams fought hard too and deserved their own stories. Opinions are my own and do not express that of my employer.

Patient zero

It was a day in March that Covid stopped being a passing disease in some corners of the world like SARS, Zika, or Ebola and started to be a reality to me. Vietnam got its first case in January. Sequoia’s letter to its founders and CEOs was making its way around the Internet. And WHO declared Covid a pandemic. 

I am not going to lie, despite the unease of entering a pandemic period, part of me was looking forward to it, fondly. I was ten when the dot-com boom took place. Vietnam was some remote corner of the world, known only for the war. The Internet was a foreign word. Mom brought home rice, canned food, and fish sauce in a basket to prepare for Y2K. Everything I know about the dot-com bubble, its magic and devastation, and how Internet giants emerged from its crumble was told to me. More than just an economic downturn, it was imprinted to me that not just surviving, but striving through a hard time is the ultimate test for me as an engineer, and the system I have built. All were thought and stories until that day in March. The old farts could move aside with their bubble. My generation had Covid.

The management acted swiftly and decisively. Changes were introduced to protect individuals and the company's sustainability. The flexible WFH policy had always been there. Teams were split to come to the office on alternative days. Meetings were moved online. Monthly townhalls were broadcast online with details of a break-even plan by end of the year. With everyone in lockdown, e-commerce activities - our main source of revenue - raised significantly. Vietnam also took aggressive measures to minimize the spread of the virus. The new changes in life both inside and outside of work were new and exciting. For a while, Covid had seemed like a challenge we could face with a sense of hope.

It was then the grim reality set in. As more customers worked online, our platform became the only way for them to stay on top of their logistics situation. The eyes were on us. The demand for service availability, something previously had not been as desirable as it could be, skyrocketed. In parallel, the virus spread at a speed that made the Mongolian hordes look like amateurs. People started to cite stories about the infamous Spanish Flu. Recovery estimation was changed from months to years. The virus was not the only thing in the air, fear also was. The investment market contracted, dried up, imploded on its former optimistic self. No one wanted to bet on the uncertainty of what eventually became the biggest threat to humanity in this century (so far). We had to make changes to keep the remaining runway as long as they could be and planned for an unattractive funding round that we did not even know could happen.

By April, we found ourselves fighting bigger challenges, with a smaller and less effective workforce. Basically a sadistic role-play of the US health care system.

Is less more?

In software development, there is a sense of elitism of doing more with less (people). Whatsapp was sold with 55 employees, serving billions of messages every single day. Instagram with 13 employees, including the two founders. Markus Frind built Plenty of Fish solo for 6 years. Imagine what else these people could have done if they had had Asian parents! 

That wasn't exactly our story though. 

The break-even plan was broken down into monthly targets. We scored the first month, then got into a dry spell effectively the whole Q2. The B2B sales cycle was notoriously long, we had strong leads that took months to realize. The general anxiety in the future did not make anything better. The off-track plan placed a hiring freeze on us. Meanwhile, the unprecedented quarantine surge traffic and sales effort sent more work our way. We typically won customers by going the extra mile to make custom features and integrations. The work was not hard, but irritatingly time-consuming as most normal customers were not fluent in the programming if-this-then-what riddle.

  • The time on custom work kept us from working on the core system.
  • The under-invested core could not handle the surge in traffic, so here and there engineers got pulled out to help fire-fighting.
  • The looming committed deadlines were peril dishes for easy implementation whose architectural simplicity was an afterthought. The code became brittle and less welcomed for an extension, making custom work and fire fighting even more time-consuming.

It was a vicious circle that could very quickly deplete the team morale and leave everyone burnt out.

We sook to find the balance between feature work, and core system. We designated 90% of the development time to be split between making new features, adding customizations, and paying tech-debt, and the remaining 10% on whatever the team thought would be a future issue if left unchecked. That went exactly as good as Donald Trump’s plan for the pandemic: utter chaos stems from reality detachment.

The reality was that we had more on our plate than we could chew. We struggled to keep up with the delivery schedule before, we would not be able to with 10% less. And though the time invested on tech-debt would help us in the long run, investment took time, the time we could not afford. In the deadline frenzy, the 10% budget was a forbidden fruit, development time was given to whatever made the biggest noise at the time. Sometimes it was the core system because nothing spoke louder than a system outage. But for the rest of the time, it was a customer-first policy. We needed that revenue stream to pull through the hard time. We were the homeless of software development trying to make a saving account.

Less is more does not come solely from the engineering side of things, it only thrives where it aligns with the whole business as a cohesive unit. Whatsapp, Instagram, and Plenty of Fish are consumer mobile apps that demand very little customization compared to the world of B2B that we are in. SAP has thousands of developers - not Techcrunch headline material - yet denying so is to deny the laws of physics.

There were some interesting leads that we kept hearing about, but otherwise, Q2 ended on an eerily uneventful note. Little did we know, this was the night before the storm.

Growing pains

Then came Q3 in a way we could not have expected. The traffic that has not slowed down in previous quarters then gained even more momentum. Malls in the US and EU were shutting down and likely remained so over the holiday season due to concerns about spreading the virus (it indeed happened). People turned their compulsive buying online. Good times if your business centered around tracking and analyzing e-commerce shipment. There was only one little convenience. The data stream flooded through us with all the force of the mighty Mekong before people built all those dams over her.

Having invested in a horizontally-scale application layer, our journey to scalability was a walk in the park. Except for trees, the park had carnivorous ents, pedestrian Nazguls, and birds fell beasts. The squirrels were cool, they probably just had rabies. The walkthrough Mordor taught me about growing pains more than all my teenage years combined.

For months we wrestled with performance issues, always shoved into our hands at the most inconvenient moment. The embodiment of Murphy’s law, solidified by postmortems piled higher and deeper, stem from a series of adequacy:
  • Lack of imagination. As much as the growth was welcomed, we did not successfully foresee the full impact of such growth on the system.
  • Lack of experience and expertise. When incidents happened, we did not know what to do, not immediately, and not fluently executed. Various types of database lock, Kafka data corruption, and Flink zombie jobs all happened for the first time to all of us.
  • Lack of infrastructure investment. The list of tech-debt was ever-growing, and the development of internal tooling came too little and too late. 
There has always been a tug of war between building a solid system with imaginative traffic vs a house of cards with desirable features waiting to collapse on its own weight. Both compete on the finite resources that are time and effort. A seasoned entrepreneur would say the latter is preferable as it indicates we have found a problem worth solving and people willing to pay for the pain killer. But such knowledge offered little comfort as you were staring at the screen at two in the morning, with the weight of the entire system on the chest, feeling the cold sweeps in the limbs subduing other sensory leaving only numbness, getting angry with yourself and everyone and everything.

We had three system meltdowns for the three months of Q3. Each came in the magnitude that threatened the existence of Parcel Perform and made me question the decisions of my life. And when you do that three times in a row, you start to question your sanity too. But there has always been light at the end of a tunnel, no matter how long. With sheer efforts and a lot of hours staring at the screen and self-doubt - because hey we were lack of everything else - we pulled through. I wrote this piece to remind myself the fight is only over when I give up.

Performance issues suck really hard. Going through them is painful. And I wish them to never be a part of my life. But the growth which comes with the effort to overcome each and every performance challenge is undeniable. Every time we solve a problem, the system gets stronger and stands higher. We acquire knowledge we would not fathom before. Our process is taking steps to mature so we can fight together effectively and reserve the hard-earned knowledge, though we are a long way from the finish. The insurmountable mountains are less intimidating. Much of what we consider valuable in our world arises out of these kinds of challenges because the act of facing overwhelming odds produces greatness and beauty. Such is growth and pain at their worst and best combinations. 

We ended Q3, bleeding from the mouth, but confident to start hiring again.

The aftermath

2020 was a bizarre experience. The world was forced to accept a new reality for better or worse. We were forced to mature. Decisions in the back of the mind that we knew we would get there somehow were put in the first-row seat as the pandemic arrived.
  • We transformed the functional engineering team into cross-functional squads just days before the country entered lockdown.
  • We now have a CS team that collectively covers 24 hours of a day and a stronger presence in the EU while the common situation of the industry is to contract inwards.
  • Our SLA went from best-attempt to actual tangible values. A move that increased the excitement of the sales team ten folds, the exact amount it decreased from my engineering team. A striking example of conservation of happiness.
  • We ended up overshoot the break-even plan 2x.
But on the other hand, there is no denying that wherever we look we see improvement needed. We are trying to keep a system that has grown 4x since the beginning of the year stable. New components like pg_bouncer and setups like splitting the Flink cluster into single jobs were irreplaceably crucial. Yet the biggest pain points have shared the same pattern: the application logic we implemented a year ago can no longer handle the traffic we have today. We haven’t fully escaped the hideous Sisyphus circle. Patches of various degrees of permanence were introduced; still usually before we could fully complete the implementation and internalize the knowledge, another thing would happen. The little tech debts we eagerly took are coming back with compound interests.

And we are doing all these jugglings in the aftermath of a 6-month hiring freeze. We had had the same set of people working together from the beginning till the end of the pandemic (Vietnam calendar, we have been lucky). We worked intimately with all the services and developed an acute sense of troubleshooting. While that experience was pivotal in maintaining the system, our investment into the onboarding process, documentations, internal tooling was insidiously slacked off. A person unfamiliar with the system would find what we have been working on in the last 6 months overwhelming. The participation of the new engineers post hiring freeze was a waking call.

An exploding system and a high overhead of adding new members were a stressful combination. At times, we were probably just steps away from the death of a thousand cuts. Despite all that, there is a mutual feeling that the worst Covid has to offer was in the past and thing only gets better from here on. We have a strong team of young people who every day fight above their weight class. The debts we have are solvable with the right amount of time. We have a product that fits the market and has heaps of space to develop. When you manage to grow so much despite the turbulent time of 2020, fewer things can frighten you. We are looking at 2021 with much anticipation.

Saturday, August 15, 2020

Đừng bỏ cuộc

Gần hai giờ sáng, phía ngoài văn phòng, một cặp vẫn đang tâm sự. Dù ánh đèn 7-11 hắt ra, mắt vẫn díu lại, chẳng nhìn được rõ mặt. Mấy ngày trước, vài khách hàng lớn bắt đầu sử dụng dịch vụ, lượng thông tin tăng mạnh. Hệ thống như nhà cấp 4 chặn đường bão cấp 8, dột vô số chỗ. Hy vọng đây là đêm cuối. Mọi người đã giải quyết được nhiều vấn đề. Giấc ngủ sẽ trở lại sau những ngày thấp thỏm trông con mọn.

Tôi làm việc ở một startup đã vài năm. Làm việc ở đây cảm giác như lái tàu hoả từ thời Liên Xô trên đường ray chưa tồn tại. Tàu vừa chạy vừa xây đường, bằng bất cứ gì có được xung quanh. Xây lúc nhanh lúc chậm. Đường lúc lên lúc xuống. Nhưng quan trọng nhất là tàu vẫn phải chạy.

Thiếu người, bắt đầu sau, và ít tiền, nhưng vẫn làm tốt hơn những công ty cạnh tranh, không có cách khẳng định "tôi giỏi" nào đơn giản mà mạnh mẽ hơn vậy. Những ngày đó, bạn đi trên mây và thế giới là của riêng mình bạn. Và cũng có những ngày như hôm nay, người gồng, đầu cúi, mắt nhìn không qua được mặt bàn. Công việc là một chuỗi dài những sai lầm ngu xuẩn.

Lúc nhỏ, ba mẹ hay nói lớn lên sẽ làm được cái này cái kia. Đầu lớp một đạp được xe. Lên cấp ba làm được trại 26/3. Vào đại học tự lập. Như thể bên trong có những cái công tắc màu nhiệm, đủ tuổi thì công tắt bật, sẽ hiểu được những hệ thống bự đùng, thấu sự đời, và đạt niết bàn. Theo đúng thứ tự như vậy.

Có điều, sau hai startups thất bại, vẫn chưa cái công tắc nào được bật. Chỉ có công việc là khó hơn. Nhiều khi sợ hãi, như người bơi xa sợ đuối nước, chỉ muốn quay đầu, mọi áp lực này sẽ biến mất. Không còn những cuộc gọi lúc nửa đêm. Không còn những đêm dài một mình trước màn hình, nghe dưới da nhịp tim tăng dần. Không còn vò đầu bứt tóc, bất lực trước những câu hỏi tại sao. Nhưng làm startup nhiều hạn chế. Không có lưới bảo hiểm. Giờ mà buông bỏ, khó quá không làm, thì sau lưng cũng không còn ai làm cả.

Từ dòng code đầu tiên, chật vật mới xử lý hết 20k requests trên con máy ảo bé tí, đến giờ mỗi ngày vài "Tê" đi ra đi vào, hệ thống và mọi người xung quanh nó đã dậy thì biết bao nhiêu lần, có cả chết đi sống lại, đều là nhờ không bỏ cuộc mà tìm được lối ra.

Không có một bí kíp luôn đúng cho các vấn đề của một hệ thống phức tạp. Quan trọng là kiên nhẫn và đừng quá khó khăn với bản thân. Nhìn được chuỗi sai lầm ngu xuẩn là đi được bước đầu tiên rồi. Giải quyết một vấn đề, tàu chạy được một ngày. Giải quyết một vấn đề nữa, chạy thêm một ngày. Rồi vấn đề thứ ba, thứ tư, thứ năm. Đến cuối cùng của chuỗi ngu xuẩn, là đến đích rồi. Hoặc là thế, hoặc là thất bại và có được một bài blog ngon lành trên con đường chống dốt. An toàn hạnh phúc với những dự án bé bé xinh xinh, rồi sao chịu được sóng to gió lớn?

Có lẽ, đó là cái công tắc cuối cùng, đã được bật từ lâu.

Sleep is for the weak.

I am weak.

P/s: Sau khi lên nháp ý tưởng bài blog này, hệ thống của tôi bị sập mất Kafka - lần đầu sau gần 5 năm. Tốn thêm bốn tiếng căng thẳng mới giải quyết được vấn đề. Một minh chứng về việc một hệ thống IT chỉ tồn tại giữa những lần bị sập, và không có lần sập cuối cùng.

Saturday, April 25, 2020

Building a postmortem culture


\ ˌpōs(t)-ˈmȯr-təm \ 
1. Autopsy
    A postmortem showed that the man had been poisoned.
2. An analysis or discussion of an event after it is over
    The blameful postmortem culture shuts down the exploration of the problem because no one wants to be seen as stupid, even if it's ignoring the clear truth.

A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring. The postmortem concept is well known in the technology industry.

I picked up the concept of postmortem from my previous job at Silicon Straits Saigon. The idea that we could study an incident was there, but the guidelines and culture enforcement was weak. So though I was sold that postmortem was a powerful practice and with proper enforcements made a system become more robust over time, I didn't exactly know how to start a culture around it. The most concrete guidelines I received was from Site Reliability Engineering. Wherever the Google practices seemed too extreme or impractical in my context, there was the Internet. The knowledge was powerful and enlightening, and I appreciated the journey in the last 6 months to transform it into operational HOWTOs.

From the very beginning, I was aware that a postmortem culture needed to be a joined effort of the entire organization for it to be effective. And I was never interested in being a secretary. But like many other initiatives that involve other people, you can't just make an announcement and expect things to happen, magically. I tried. A few times. So in the beginning it was just me recording the incidents that I was a part of either the solution or the problem. Most of the time both. And that gave me the time and experience I needed to make calibrations to the plan before it was presented to everyone.

Work to take blame out of the process

Blame, both the act of blaming and the fear of being blamed is the enemy of a productive postmortem culture. If a culture of finger-pointing and shaming individuals or teams for doing the "wrong" thing prevails, people will not bring issues to light for fear of punishment, or stop investigations prematurely as soon as a "culprit" is identified. Such halts the development of preventive methods for the same situation in the future. The force to blame is formidable, we as human beings are wired for it. Dr. Brené Brown, in a TED talk, explained blame existence as "a way to discharge pain and discomfort". The fact that whenever you want to trace back whose code caused your miserable wake up at two in the morning for a system outage, the command says `git blame` certainly doesn't help.

This is where being blameless gets popular in postmortem literature. And if there is anything subjective to an objective piece of work that is a postmortem, this is it. I find being completely blameless hard to implement. On the one hand, a postmortem is simply not a place to vent frustration. On the other hand, at times, it feels like tip-toeing around people, so worried about triggering their fragile souls that you miss out on chances to call out where and how services can be improved. This is where I come to an agreement with J. Paul Reed that it is important to acknowledge the human tendency to blame, allow a productive form of its expression, and yet constantly refocus to go beyond it.

Here are some examples. The examples might or might not involve me in might or might not actual scenario.

Someone pushed bad code to production via emergency pipeline. The tests in CICD could have caught this, but someone thought he knew better. Seriously, if you aren't sure what you are doing, you shouldn't act so recklessly. Rolling back in the middle of the night is a waste of time.
Action items:
  • Think before you edit someone's code.

Completely blameless:
Last night, an unauthorized code was pushed to production. CICD was skipped because CICD takes 30min and it was fire fighting situation. The fix was not compatible with a recent refactor. 
Action items:
  • Improve CICD speed

Last night, an unauthorized code was pushed to production. CICD was skipped because CICD takes 30min and it was fire fighting situation. The fix was not compatible with a recent refactor. 
Action items:
  • Improve CICD speed
  • Infrequent contributors should use the safety net of CICD
  • Issues in a service need escalating to maintainer of the repo for code review
  • Rollback mechanism needs to be available to developers on pager duty.
I felt like in the examples above, without accepting that pushing code to an unfamiliar service in the middle of the night was a reckless action, we would miss the chance to put in preventive measures. But again it is subjective, perhaps my blame-aware version fits perfectly into a blameless version of another. I hope you get the point.

Work on some guidelines

It is useful to be as specific as possible about when a postmortem is expected, who should write it, what should be written, and what the goals of the record are. Not only it provides a level of consistency across your organization, but it also prevents the task of writing a postmortem to be seen as a whimsical assignment from some higher-level authority and someone is being picked on as a punishment for doing the "wrong" thing.

Some of my personal notes of the matter:
  • Different teams might have different sets of postmortem triggers. The more critical your function is, the more detailed the triggers should be.
  • People who caused the incident might or might not be the ones to write the postmortem. The choice should be based on the level of contribution the person has to offer, both in terms of context and knowledge, not because of his previous actions.
  • Be patient. The people you are working with are professionals in software development, but the ability to write good software does not transmit into the ability to write a good document. Quality of root cause analysis prevails eloquence. Save the latter for your blog.

Work on the impact to your audience

in the beginning, the incidents I was working on were about a database migration to Aurora (and a hasty fallback), so I was assuming my audience would be my fellow developers. Possibly extend to project managers, project managers love knowing why you are stealing time from their team. And a reasonable consequence was to write the postmortem in markdown and store them in the same code repo with the affected services. There were a few issues with that.

Firstly, in a technology startup, the scope of tech choice is always bigger than "just the tech team". In my case, Customer Success people need to know what the impacts on our customers were and are, Product people want to know if the choice comes with new possibilities, and Sale people want to sell those possibilities. As much as I love such integration between the developers and the rest of the company, the idea of granting universal access to code repos to view markdown postmortems terrifies my boss, and therefore subsequently me, obviously.

Secondly, as familiar as markdown is, it is not a very productive option if you want to include media in it. And we want charts of various system metrics during the time of the incident to be included in the postmortem.

Lastly, writing a postmortem is gradually becoming a collaboration effort, and git repo, though supports collaboration, does not do it in real-time.

Considering all the options, we finally settled with a shared Google Drive in the company account. It is neither techie nor fancy. But it allows very flexible accessibility, tracks versions, natively supports embedded media, and lets multiple people collaborate in real-time. We share our postmortems in a company-wide channel, and sometimes hold an additional presentation for particularly interesting ones.

Let it grow

When you have done your homework, built a foundation of trust and safety, laid out the guidelines and constantly improved it, and integrated the postmortems with your larger audience, it is probably time to take a step back and let the culture take a spin on its own. My company's postmortem culture won't be the same as Google's no matter how many Google books do I read. And as long as it works for us, it doesn't matter.

With some gentle nudges, my colleagues are picking up postmortem on their own. We have seen contributions from Product Owners and Project Managers, besides the traditional developers and DevOps contributors. The findings are anticipated by a large audience across the company. 

And in the latest incident, which involved the degradation of performance in a few key features of our SaaS offering over the course of a week, we identified another usage of a postmortem: a postmortem updated regularly with the latest incident reports, findings, and potential impacts in a near real-time fashion is a powerful communication tool across the organization, both ensure the flow of information to people who need it (CS to answer questions, PM to change project plans for urgent hotfixes, etc) and allow developers to focus on their critical work without frequent interruptions.

As we grow and our system gets more sophisticated, hopefully, the constructive postmortem culture would turn out to be a solid building block.

Thursday, January 16, 2020

Run, Forrest, Run!

Tl;dr: I ran my first marathon, and whined about it. Move on.

4 years after finishing my first half marathon, I finally did my first full marathon, 42k of sweat and pain. 2019 was horrible for me, through all ups and downs, the marathon plan is one of a few that keep me together. The cut off time was 7 hours. I wanted to do a sub-5 (complete the run under 5 hours) but ended up with a sub-6. I was squarely in the bottom quarter of my age group. So it wasn't all glory and stuff, but I am so glad I did it.

I must have started the training back in March or something, and didn't follow the training plan through and through, obviously. I got sick, which paused the plan by a week every time it happened. I got injuries that eventually put me out of action for a whole month. And when I was back, following the original training plan just gave me too much stress and guilt, which I certainly didn't need - my life was really low, so I forwent it and just ran whatever the fuck I wanted. That was probably 2 months ago.

The injuries were actually a blessing in disguise. They forced me to rethink my running form. I picked up a book on running (that is not Murakami's autobiography) and tried to avoid "common sense" misconceptions, like most notably, landing on your whole foot. I finished 42k without any injuries. Yay!

The day I got the bib, it came with a shock. I was put into the 30-39 age group. Technically, it is not my birthday yet. And despite all the talks, I was not mentally prepared for this. Ouch! Oh and I also got interviewed.

I had never run the full distance prior to the run and in retrospect, wasn't a great idea. I now believe that the body would prepare for an extra few km on top of the maximum distance you have covered but not by a long shot. And it makes sense, why would my body be ready for 100k if I have never run 50k? The longest I had done was 30k and that explained why from 34k I got cramps so bad.

It was also the first run that I got proper sleep the night before. And I wasn't hungry. I sure stuffed myself with loads of carb, so full that on the night before the run, I thought it was stupid, I couldn't possibly run with such a stomach. but above all, shout out to the organizers, the route had more than sufficient water, electrolyte, and banana.

One last thing, the Nike app has improved a lot between then and now. It is no longer off by 30% and comes with cooler features. Well done Nike.


If that hasn't bored you out of your skull yet, you might want to see how my run broke down. "How did you remember all of this?" - I knew I gonna write one of these post-event, so it wasn't that hard. And I made up all the bit I didn't remember, including that I ran at all. Bwahaha.

Starting line: That's right, 42k is the first wave, the first-class citizen of a marathon. With all the volunteers standing around and looking, the limelight feels good. Wait, hang on. It's already 4. Why aren't we starting? Technical issue? Great, I am trying to get some work-life balance and here I am, with bugs.

0km: 10 minutes in, here we go guys!!! Let me just start my run on the Nike mobile app. Fuck fuck fuck. I dropped an energy bar while shoveling the phone back to the running belt. Screw it, I am not fighting against a wave of runners for a stupid bar. What a start.

1km: An old man with a Vietnam flag on his back is making a crude joke that a bunch of fit men, leaving their horny wives and young children home, to run on the street at 4 in the morning must all have mental issues. It could have been a good joke, it could have. But why did you have to be so fucking disgusting in your choice of words old man? Urgg why are you even carrying our flag?

2km: Some already making pit stops at the trees by the sides of the road. Shit looking at them gives me the urge too. Nah. If I sweat enough, the exceeding liquid will just be repurposed in time. Probably. The 42k 4:45 pacers are here, but they seem slow (1) and have loud music on. Better keep some distance.

3km: Here it is, the first major water station. Thanks to the Starting Line Incident, I am down to 4 bars now. I should have more bananas. Double portion, please! There are Waldo, Doraemon, and Ao Dai right in front of me. Cute, but I am not falling behind casual cosplayers. Onwards!

5km: We are joined by a group of 21k, they seem to have a shorter route. I no longer hear the music of the 4:45 pacers. I also don't want to have my pace mess up by 21k runners. Time to speed up a bit.

7km: Just gulped down the first energy bar. Entering the beast - Phu My bridge. Still have a vivid memory of how it wore me out in my first 21k. Some 21k runners keep passing me. Well, at least they aren't 42k.

9km: The easier quarter of the bridge was easy. Neat, there is a water station before the hardest quarter. Go in for a shower. Feel so good. Kimochi!!!

10km: Wow that's the highest point of the ascending half already? That was quick. I'm feeling great. The training works!

12km: "Coming through!" I didn't yell but it was certainly loud as I ran past a few runners. I'm sprinting! Not supposed to put stress on my feet no? But I am on a runner's high. Gotta take advantage of this slope then.

14km: Keeping up a good speed. Ketchup guy wait for me! Well, he is a 42k runner in costume which, for the lack of visual detail, only makes me think of a bottle of ketchup. I might be running too fast. There is no downhill gravity to play with. Slowing down.

15km: Crossroad. Am I turning or keeping straight? Oh, there is a volunteer, neat. I asked you twice for the direction and the best you can fathom is "Huh?". You, sir, are truly an idiot.

16km: "That fires we don't put out, will bigger burn". And that's exactly why I am standing here right next to a tree, minding my own business. Here comes the same water station on the 3rd KM. Banana!  I am joined by a bunch of 21ks. This group is with pacers of 2:20. Guess I'm not doing too badly myself (2). But they are loud. I am putting in some distance.

18km: This is proceeding nicely. I'm bored. Time for some music. After all, what is the point of having the pinnacle of technology in my belt? And lost a bloody energy bar.

20km: I am rejoined by the 2:20 pacers. This time the topic is on the color of the underwear the pacers are wearing. I should now add that the pacers in this group are women. "You're wearing nothing!" Someone screamed top of his lungs. Look like he is having a really good time. No, he isn't carrying a Vietnam flag. I looked. I'd love to add some distance again, but I am getting slower.

21km: Canada International School eh? Funny. I'd be here again later this afternoon to watch a game of Saigon Heat. This is a massive waste of energy. Doraemon is behind me. I'm not running behind a cosplayer, not a blue fat cat with comically short legs (and balls for hands). Just a bit faster. Entering the differentiator turn, this is the part of the route that 21ks don't join. This stretch of the route seems to last forever (3).

25km: The sun is already high. I can't possibly head to a tree this time, can I? (4) Embracing myself for a stinky toilet. Wow, it's actually clean. This is awesome. The toilet, not me pissing.

27km: Doraemon is behind me again, but I can't possibly run any faster than this. I tried. Ran ahead of him for tens of meters and I would fall back to a normal pace and he would pass. Not just Doraemon though. I am losing count of how many have passed me.

28km: Good morning milady, can you help me with some of that muscle spray please, on both legs? Wow, that was refreshing! Thank you very much.

29km: God damn, some of that spray got onto my crotch. My balls are freezing. I sure hope they don't fall off.

30km: Got tension on the thighs. I got this. I got this. I trained for this. The app announces I have 12km left. Took me a while to calculate that I have run for 30km. Math is super hard.

32km: I have never run this far in one go. From here on, it's uncharted territory. Squats, I need to do a few squats, it stretches my thighs a bit so they are functional again.

33km: The tensions have turned into cramps. Squat. Run for a couple of hundred meters. Cramp. Squat. Rinse and repeat. It hurts so much.

34km: Arrggg I fought, but I can't run anymore. My thighs got cramps. My ankles hurt from all this stomping. And the soles of my feet too, for pretty much the same reason. Worst of all, my brain seems to go blank, this is stupid, what I am even doing. I have to walk now.

35km: I have run a few short dashes, a couple of hundred meters each. One of the attempts locked my legs, almost landed my face on asphalt. The cramps are still going strong. Someone just handed me a big chunk of ice. It freezes my hands. Dude, what I am supposed to do with this? My balls are gone and that's bad enough. Here, tree, your daily ration in a solid form.

36km: Here is the plan, I gonna run between one crossroad to the next, then walk till the next crossroad. I still get cramps whenever the 2 crossroads a bit farther apart, but at least my mind has come back. The sunlight is roasting me. I miss you, sunshine.

41km: Fuck no! My legs gave up on me. Completely. I get cramps just from walking. No amount of squat seems to help. I can hear the crowd from here, so fucking close.

42km: My legs are at the stage where any excessive movement would give me cramps. The last 200 meters, the finish line is finally here. Here goes nothing. The legs don't seem to be mine, I move them like two sticks. I run. I cross the line. I get a high five. A girl put this medal around my neck. Heck, I can' even recall what she looks like. But she was wearing a Bà Ba, that was a nice touch. Under normal conditions, I would have appreciated the outfit, but right now I am having a strong urge to vomit my guts out.

(1) This is probably the first sign that I didn't manage my energy level well. Too cocky. But again, I aimed for sub-5, so...
(2) I conveniently forgot the fact that 42k started 30 min in advance. But we also had a longer route since the beginning. All else being equal, I was running the first half between 2:10-2:20.
(3) It didn't. It lasted for 3.5km. Running on a familiar route made me feel like it was shorter.
(4) I talked to a friend about this. Pro-tip is to just pee on yourself. In a race, you probably consume enough water that your pee is transparent anyway. My shorts were white, so it didn't help much with the level of confidence. Best to do this at a water station where they usually put a big bucket of water for a quick shower.

Saturday, October 26, 2019

What you want to do in your life

Quite some time ago, I wrote "Good grades, good opportunities". Basically what I meant was that working hard in school is one way to propel in life, but it isn't necessarily the best way. By focusing too much on a part, one risks missing the whole spectrum of possibilities and opportunities, like a sad puppy looking at a rainbow. Yet supposed you have graduated and what you want to do with your life is anything but a fogged window, how would you get yourself into a position with better visibility?

My generation was fed that we should be following our passion. That implies somewhere in your mind there is a hardwired plan waiting to unfold, and that you know what you like - more so than any others. Well, if that was the case, we wouldn't be sitting here, would we? The people who said that also didn't really mean it. They probably meant that though the journey to discover what to do in your life is more banging your head against a brick wall than smooth sailing, don't let it demoralized you. I agree. You shouldn't underestimate your potential, and succumb to the thought you can't do what other people can.

The first step towards a way out is to ask whether you must have a job. Hint: if you can afford the question, you probably don't. Besides providing a means to get by, being in a job is also a useful experience. It is certain that each job is unique, but given a bit of abstraction, they have a fair share of similarity. For example, most engineering jobs are project-based and focus on time management, communication, and problem solving. These skills are interchangeable regardless of industry. Through the job, you also get exposed to an industry, where opportunities barely visible from outsiders. No, the world doesn't have a conspiracy against people like you and me, nor some people have it easier than others. It just changes fast and because innovation rarely follows a pattern, being in the right space is crucial.

Remember this is your day job. Working at a day job doesn't mean doing it half-heartedly. It means you shouldn't let your identity be defined by this job, just as an aspiring writer with a day job as a cram school teacher doesn't think of himself as a teacher. I usually tell fresh grads that who they are in the next 5 years is defined by what they do after 7PM. Because that is when the day job ends and real work begins. If you are one of the fortunate individuals who can afford not having a job, you get to do real work full time.

What real work? To discover what you like. This is actually harder than what it sounds like. There are a lot of jobs, ones that you don't know and ones whose name hasn't existed yet. And the dramatic way the media has described professionals only makes an accurate image harder to find. Most lawyers go through their careers without a single murder case. Literally no one hacks into a system by typing in a terminal like a maniac. And my favorite series House MD is about the exact opposite what doctors do.

I am probably the worst one to whom you can talk about the definition of "like". But I know that working on a job where you only execute instructions is not going to be very fulfilling. It is better to work in areas where you are interested to understand how things work, can exercise your freedom of thought, and get into the zone easily (where time flies). In other words, work on stuff about which you are curious. You don't need a full time job in an industry to experience it. Better yet, don't even bother with the concept of industry, it puts boundaries to your thoughts. Just pick a project that seems interesting: volunteer at a professional event, research the answer to a hard question, build a bottle rocket on your roof top. Choose a project that only takes a few weeks. Make sure it is something you can finish without any blocker. It should be a bit challenging, but not so hard that you feel overwhelmed. Online courses and books are both very helpful in this journey, though you should avoid acquiring too much data without a project to put it to use. Rinse and repeat with another project, and another, and yet another.

These self-directed activities can be a bit overwhelming at first, especially if you have got used to receiving assignments from school. But you will be able to adapt and get better. It is a part of being human, and being young.You need some initial discipline to get started, but once you do, given the project is an object of your interesting, curiosity will turn work into play and discipline is no longer needed.

If you continue down this path, you probably find something that stands out from the rest, enough that you want to come back for more. If you are lucky, you would find more than one. Depends on how good you are at multi-tasking, you might need to choose. Choose the option that leads to more options afterwards. There is a paradox of choice, but advancement in life is measured in possibilities. You go from one stage of education to another because the higher stage give you more options. There are simply more things a guy with a college degree can do compared to one graduated from high school. Building a software product gives your more options than doing business analysis. Once you have built a product, you can get into business analysis later if you prefer the narrow focus. But analyzing business is not going to make you a product manager.

Paulo Coelho's renowned The Alchemist had it that when you really want something, omens is the way the universe points you to the right direction. I have picked up many things in my life, and given up most of them. The ones that stick are usually ones I experienced some sense of beginner's luck. Perhaps it is my omen. Yours might be the same. Or it could be the urge to do something you weren't allowed to do in your younger years. Or the thrill of an audacious plan even people who have known you for years couldn't think you would pull it off (actually, the more people know you, the more likely they are to subject to stereotypical thinking). 

You should believe you have it within, that you will sense the click when the omen happens. Now go give you some more exposure of what the world has to offer.

Saturday, August 31, 2019

Postgres does not use index on gigantic numeric value

Nothing beats a Saturday morning when you wake up fresh and excited, ready for the second sleep, and realize that your Postgres database was harassed over the night and accumulated a number of downtime minutes that is too embarrassing to state here.

My database load looked like this during the outage.

buffer_mapping looks new. Postgres official documentation on this matter seems to be written by Captain Obvious: Waiting to associate a data block with a buffer in the buffer pool. Thanks for nothing. Basically buffer_mapping is a lightweight lock in read operations, my processes were fighting to reserve buffers in which to read data pages.

I have a read problem.

This query accommodates the highest number of locks:

pp_pqsql_prod=> explain select * FROM "big_table" WHERE "big_table"."id" = 9200190224054041915721;
                                           QUERY PLAN
 Gather  (cost=1000.00..4568847.85 rows=535675 width=1673)
   Workers Planned: 2
   ->  Parallel Seq Scan on big_table  (cost=0.00..4514280.35 rows=223198 width=1673)
         Filter: ((id)::numeric = '9200190224054041915721'::numeric)
(4 rows)

It is a sequence scan, on a 100M-row table, so it is obvious why it caused all the ruckus. What's less obviously is why Postgres performed a sequence scan on a primary key column.

With the help of a colleague, it appeared that given a smaller numeric value, the index kicks in just fine.

pp_pqsql_prod=> explain select * FROM "big_table" WHERE "big_table"."id" = 9200190;
                                              QUERY PLAN
 Index Scan using big_table_pkey on big_table  (cost=0.57..8.59 rows=1 width=1673)
   Index Cond: (id = 9200190)
(2 rows)

The problem is the size of the queried value. Eventually I stumbled upon this stackexchange Q&A. It can be seen in the first explain that because 9200190224054041915721 was too big, it had to be casted into numeric data type. My primary key was not that big, its data type was bigint. So it had to be casted too, because apple can't be compared to orange. What I have now is a numeric to numeric comparison, and a bigint index can't serve that.

Problem be gone and so was my morning.