Saturday, August 15, 2020

Đừng bỏ cuộc

Gần hai giờ sáng, phía ngoài văn phòng, một cặp vẫn đang tâm sự. Dù ánh đèn 7-11 hắt ra, mắt vẫn díu lại, chẳng nhìn được rõ mặt. Mấy ngày trước, vài khách hàng lớn bắt đầu sử dụng dịch vụ, lượng thông tin tăng mạnh. Hệ thống như nhà cấp 4 chặn đường bão cấp 8, dột vô số chỗ. Hy vọng đây là đêm cuối. Mọi người đã giải quyết được nhiều vấn đề. Giấc ngủ sẽ trở lại sau những ngày thấp thỏm trông con mọn.

Tôi làm việc ở một startup đã vài năm. Làm việc ở đây cảm giác như lái tàu hoả từ thời Liên Xô trên đường ray chưa tồn tại. Tàu vừa chạy vừa xây đường, bằng bất cứ gì có được xung quanh. Xây lúc nhanh lúc chậm. Đường lúc lên lúc xuống. Nhưng quan trọng nhất là tàu vẫn phải chạy.

Thiếu người, bắt đầu sau, và ít tiền, nhưng vẫn làm tốt hơn những công ty cạnh tranh, không có cách khẳng định "tôi giỏi" nào đơn giản mà mạnh mẽ hơn vậy. Những ngày đó, bạn đi trên mây và thế giới là của riêng mình bạn. Và cũng có những ngày như hôm nay, người gồng, đầu cúi, mắt nhìn không qua được mặt bàn. Công việc là một chuỗi dài những sai lầm ngu xuẩn.

Lúc nhỏ, ba mẹ hay nói lớn lên sẽ làm được cái này cái kia. Đầu lớp một đạp được xe. Lên cấp ba làm được trại 26/3. Vào đại học tự lập. Như thể bên trong có những cái công tắc màu nhiệm, đủ tuổi thì công tắt bật, sẽ hiểu được những hệ thống bự đùng, thấu sự đời, và đạt niết bàn. Theo đúng thứ tự như vậy.

Có điều, sau hai startups thất bại, vẫn chưa cái công tắc nào được bật. Chỉ có công việc là khó hơn. Nhiều khi sợ hãi, như người bơi xa sợ đuối nước, chỉ muốn dừng lại, mọi áp lực này sẽ biến mất. Không còn những cuộc gọi lúc nửa đêm. Không còn những đêm dài một mình trước màn hình, nghe dưới da nhịp tim tăng dần. Không còn vò đầu bứt tóc, bất lực trước những câu hỏi tại sao. Nhưng làm startup nhiều hạn chế. Không có lưới bảo hiểm. Giờ mà buông bỏ, khó quá không làm, thì sau lưng cũng không còn ai làm cả.

Từ dòng code đầu tiên, chật vật mới xử lý hết 20k requests trên con máy ảo bé tí, đến giờ mỗi ngày vài "Tê" đi ra đi vào, hệ thống và mọi người xung quanh nó đã dậy thì biết bao nhiêu lần, có cả chết đi sống lại, đều là nhờ không bỏ cuộc mà tìm được lối ra.

Không có một bí kíp luôn đúng cho các vấn đề của một hệ thống phức tạp. Quan trọng là kiên nhẫn và đừng quá khó khăn với bản thân. Nhìn được chuỗi sai lầm ngu xuẩn là đi được bước đầu tiên rồi. Giải quyết một vấn đề, tàu chạy được một ngày. Giải quyết một vấn đề nữa, chạy thêm một ngày. Rồi vấn đề thứ ba, thứ tư, thứ năm. Đến cuối cùng của chuỗi ngu xuẩn, là đến đích rồi. Hoặc là thế, hoặc là thất bại và có được một bài blog ngon lành trên con đường chống dốt. An toàn hạnh phúc với những dự án bé bé xinh xinh, rồi sao chịu được sóng to gió lớn?

Có lẽ, đó là cái công tắc cuối cùng, đã được bật từ lâu.


Sleep is for the weak.

I am weak.


P/s: Sau khi lên nháp ý tưởng bài blog này, hệ thống của tôi bị sập mất Kafka - lần đầu sau gần 5 năm. Tốn thêm bốn tiếng căng thẳng mới giải quyết được vấn đề. Một minh chứng về việc một hệ thống IT chỉ tồn tại giữa những lần bị sập, và không có lần sập cuối cùng.

Saturday, April 25, 2020

Building a postmortem culture

post·mor·tem

\ ˌpōs(t)-ˈmȯr-təm \ 
noun
1. Autopsy
    A postmortem showed that the man had been poisoned.
2. An analysis or discussion of an event after it is over
    The blameful postmortem culture shuts down the exploration of the problem because no one wants to be seen as stupid, even if it's ignoring the clear truth.

A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring. The postmortem concept is well known in the technology industry.

I picked up the concept of postmortem from my previous job at Silicon Straits Saigon. The idea that we could study an incident was there, but the guidelines and culture enforcement were weak. So though I was sold that postmortem was a powerful practice and with proper enforcements made a system become more robust over time, I didn't exactly know how to start a culture around it. The most concrete guidelines I received was from Site Reliability Engineering. Wherever the Google practices seemed too extreme or impractical in my context, there was the Internet. The knowledge was powerful and enlightening, and I appreciated the journey in the last 6 months to transform it into operational HOWTOs.

Since the very beginning, I was aware that a postmortem culture needed to be a joined effort of the entire organization for it to be effective. And I was never interested in being a secretary. But like many other initiatives that involve other people, you can't just make an announcement and expect things to happen, magically. I tried. A few times. So in the beginning it was just me recording the incidents that I was a part of either the solution, or the problem. Most of the time both. And that gave me the time and experience I needed to make calibrations to the plan before it was presented to everyone.

Work to take blame out of the process

Blame, both the act of blaming and the fear of being blamed, is the enemy of a productive postmortem culture. If a culture of finger pointing and shaming individuals or teams for doing the "wrong" thing prevails, people will not bring issues to light for fear of punishment, or stop investigations prematurely as soon as a "culprit" is identified. Such halts the development of preventive methods for the same situation in the future. The force to blame is formidable, we as human beings are wired for it. Dr. Brené Brown, in a TED talk, explained blame existence as "a way to discharge pain and discomfort". The fact that whenever you want to trace back whose code caused your miserable wake up at two in the morning for a system outage, the command says `git blame` certainly doesn't help.

This is where being blameless gets popular in postmortem literature. And if there is anything subjective to an objective piece of work that is a postmortem, this is it. I find being completely blameless hard to implement. On the one hand, a postmortem is simply not a place to vent frustration. On the other hand, at times, it feels like tip toeing around people, so worried about triggering their fragile souls that you miss out chances to call out where and how services can be improved. This is where I come to agreement with J. Paul Reed that it is important to acknowledge the human tendency to blame, allow a productive form of its expression, and yet constantly refocus to go beyond it.

Here are some examples. The examples might or might not involve me in a might or might not actual scenario.

Blameful:
Someone pushed bad code to production via emergency pipeline. The tests in CICD could have caught this, but someone thought he knew better. Seriously, if you aren't sure what you are doing, you shouldn't act so recklessly. Rolling back in the middle of the night is a waste of time.
Action items:
  • Think before you edit someone's code.

Completely blameless:
Last night, an unauthorized code was pushed to production. CICD was skipped because CICD takes 30min and it was fire fighting situation. The fix was not compatible with a recent refactor. 
Action items:
  • Improve CICD speed

Blame-aware:
Last night, an unauthorized code was pushed to production. CICD was skipped because CICD takes 30min and it was fire fighting situation. The fix was not compatible with a recent refactor. 
Action items:
  • Improve CICD speed
  • Infrequent contributors should use the safety net of CICD
  • Issues in a service need escalating to maintainer of the repo for code review
  • Rollback mechanism needs to be available to developers on pager duty.
I felt like in the examples above, without accepting that pushing code to an unfamiliar service in the middle of the night was a reckless action, we would miss the chance to put in preventive measures. But again it is subjective, perhaps my blame-aware version fits perfectly into a blameless version of another. I hope you get the point.

Work on some guidelines

It is useful to be as specific as possible about when a postmortem is expected, who should write it, what should be written, and what the goals of the record are. Not only it provides a level of consistency across your organization, it also prevents the task of writing a postmortem to be seen as a whimsical assignment from some higher level authority and someone is being picked on as a punishment for doing the "wrong" thing.

Some of my personal notes of the matter:
  • Different teams might have different sets of postmortem triggers. The more critical your function is, the more detailed the triggers should be.
  • People who caused the incident might or might not be ones to write the postmortem. The choice should be based on the level of contribution the person has to offer, both in term of context and knowledge, not because of his previous actions.
  • Be patient. The people you are working with are professional in software development, but the ability to write a good software does not transmit into the ability to write a good document. Quality of root cause analysis prevails eloquence. Save the latter for your blog.

Work on the impact to your audience

in the beginning, the incidents I was working on were about a database migration to Aurora (and a hasty fallback), so I was assuming my audience would be my fellow developers. Possibly extend to project managers, project managers love knowing why you are stealing time from their team. And a reasonable consequence was to write the postmortem in markdown and store them in the same code repo with the affected services. There were a few issues with that.

Firstly, in a technology startup, the scope of tech choice is always bigger than "just the tech team". In my case, Customer Success people need to know what the impacts on our customers were and are, Product people want to know if the choice comes with new possibilities, and Sale people want to sale those possibilities. As much as I love such integration between the developers and the rest of the company, the idea of granting universal access to code repos to view markdown postmortems terrifies my boss, and therefore subsequently me, obviously.

Secondly, as familiar as markdown is, it is not a very productive option if you want to include media in it. And we want charts of various system metrics during the time of incident to be included in the postmortem.

Lastly, writing a postmortem is gradually becoming a collaboration effort, and git repo, though supports collaboration, does not do it in real time.

Considering all the options, we finally settled with a shared Google Drive in the company account. It is neither techie nor fancy. But it allows very flexible accessibility, tracks versions, natively supports embedded media, and lets multiple people collaborate in real time. We share our postmortems in a company-wide channel, and sometimes hold an additional presentation for particularly interesting ones.

Let it grow

When you have done your homework, built a foundation of trust and safety, laid out the guidelines and constantly improved it, and integrated the postmortems with your larger audience, it is probably time to take a step back and let the culture take a spin on its own. My company's postmortem culture won't be the same as Google's no matter how many Google books do I read. And as long as it works for us, it doesn't matter.

With some gentle nudges, my colleagues are picking up postmortem on their own. We have seen contribution from Product Owners and Project Managers, beside the traditional developers and devops contributors. The findings are anticipated by a large audience across the company. 

And in the latest incident, which involved the degradation of performance in a few key features of our SaaS offering over the course of a week, we identified another usage of postmortem: a postmortem updated regularly with latest incident reports, findings, and potential impacts in a near real-time fashion is a powerful communication tool across the organization, both ensures flow of information to people who need it (CS to answer questions, PM to change project plans for urgent hotfixes, etc) and allows developers to focus on their critical work without frequent interruptions.

As we grow and our system gets more sophisticated, hopefully the constructive postmortem culture would turn out to be a solid building block.

Thursday, January 16, 2020

Run, Forrest, Run!

Tl;dr: I ran my first marathon, and whined about it. Move on.


4 years after finishing my first half marathon, I finally did my first full marathon, 42k of sweat and pain. 2019 was horrible for me, through all ups and downs, the marathon plan is one of a few that keep me together. The cut off time was 7 hours. I wanted to do a sub-5 (complete the run under 5 hours), but ended up with a sub-6. I was squarely in the bottom quarter of my age group. So it wasn't all glory and stuff, but I am so glad I did it.

I must have started the training back in March or something, and didn't follow the training plan through and through, obviously. I got sick, which paused the plan by a week every time it happened. I got injuries which eventually put me out of action for a whole month. And when I was back, following the original training plan just gave me too much stress and guilt, which I certainly didn't need - my life was really low, so I forwent it and just ran whatever the fuck I wanted. That was probably 2 months ago.

The injuries were actually blessing in disguise. They forced me to rethink my running form. I picked up a book on running (that is not Murakami's autobiography) and tried to avoid "common sense" misconceptions, like most notably, landing on your whole foot. I finished 42k without any injuries. Yay!

The day I got the bib, it came with a shock. I was put into the 30-39 age group. Technically, it is not my birthday yet. And despite all the talks, I was not mentally prepared for this. Ouch! Oh and I also got interviewed.

I had never run the full distance prior to the run and in retrospective, wasn't a great idea. I now believe that the body would prepare for an extra few kms on top of the maximum distance you have covered but not by a long shot. And it makes sense, why would my body be ready for 100k if I have never run 50k? The longest I had done was 30k and that explained why from 34k I got cramps so bad.

It was also the first run that I got proper sleep the night before. And I wasn't hungry. I sure stuffed myself with loads of carb, so full that on the night before the run, I thought it was stupid, I couldn't possibly run with such a stomach. but above all, shout out to the organizers, the route had more than sufficient water, electrolyte, and banana.

One last thing, the Nike app has improved a lot between then and now. It is no longer off by 30%, and comes with cooler features. Well done Nike.

---

If that hasn't bored you out of your skull yet, you might want to see how my run broke down. "How did you remember all of this?" - I knew I gonna write one of this post-event, so it wasn't that hard. And I made up all the bit I didn't remember, including that I ran at all. Bwahaha.



Starting line: That's right, 42k is the first wave, the first class citizen of a marathon. With all the volunteers standing around and looking, the limelight feels good. Wait, hang on. It's already 4. Why aren't we starting? Technical issue? Great, I am trying to get some work life balance and here I am, with bugs.

0km: 10 minutes in, here we go guys!!! Let me just start my run on the Nike mobile app. Fuck fuck fuck. I dropped an energy bar while shoveling the phone back to the running belt. Screw it, I am not fighting against a wave of runners for a stupid bar. What a start.

1km: An old man with a Vietnam flag on his back is making crude joke that a bunch of fit men, leaving their horny wives and young children home, to run on the street at 4 in the morning must all have mental issues. It could have been a good joke, it could have. But why did you have to be so fucking disgusting in your choice of words old man? Urgg why are you even carrying our flag?

2km: Some already making pit stops at the trees by the sides of the road. Shit looking at them gives me the urge too. Nah. If I sweat enough, the exceeding liquid will just be repurposed in time. Probably. The 42k 4:45 pacers are here, but they seem slow (1) and have loud music on. Better keep some distance.

3km: Here it is, the first major water station. Thanks to the Starting Line Incident, I am down to 4 bars now. I should have more banana. Double portion please! There are Waldo, Doraemon, and Ao Dai right in front of me. Cute, but I am not falling behind casual cosplayers. Onwards!

5km: We are joined by a group of 21k, they seem to have a shorter route. I no longer hear the music of the 4:45 pacers. I also don't want to have my pace mess up by 21k runners. Time to speed up a bit.

7km: Just gulped down the first energy bar. Entering the beast - Phu My bridge. Still have vivid memory how it wore me out in my first 21k. Some 21k runners keep passing me. Well at least they aren't 42k.

9km: The easier quarter of the bridge was easy. Neat, there is a water station before the hardest quarter. Go in for a shower. Feel so good. Kimochi!!!

10km: Wow that's the highest point of the ascending half already? That was quick. I'm feeling great. The training works!

12km: "Coming through!" I didn't yell but it was certainly loud as I ran pass a few runners. I'm sprinting! Not supposed to put stress on my feet no? But I am on a runner's high. Gotta take advantage of this slope then.

14km: Keeping up good speed. Ketchup guy wait for me! Well, he is a 42k runner in costume which, for the lack of visual detail, only makes me think of a bottle of ketchup. I might be running too fast. There is no down hill gravity to play with. Slowing down.

15km: Crossroad. Am I turning or keeping straight? Oh there is a volunteer, neat. I asked you twice for the direction and the best you can fathom is "Huh?". You, sir, are truly an idiot.

16km: "That fires we don't put out, will bigger burn". And that's exactly why I am standing here right next to a tree, minding my own business. Here comes the same water station at the 3rd KM. Banana!  I am joined by a bunch of 21ks. This group is with pacers of 2:20. Guess I'm not doing to badly myself (2). But they are loud. I am putting in some distance.

18km: This is proceeding nicely. I'm bored. Time for some music. After all what is the point of having the pinnacle of technology in my belt. And lost a bloody energy bar.

20km: I am rejoined by the 2:20 pacers. This time the topic is on the color of the underwear the pacers are wearing. I should now add that the pacers in this group are women. "You're wearing nothing!" Someone screamed top of his lungs. Look like he is having a really good time. No, he isn't carrying a Vietnam flag. I looked. I'd love to add some distance again, but I am getting slower.

21km: Canada International School eh? Funny. I'd be here again later this afternoon to watch a game of Saigon Heat. This is a massive waste of energy. Doraemon is behind me. I'm not running behind a cosplayer, not a blue fat cat with comically short legs (and balls for hands). Just a bit faster. Entering the differentiator turn, this is the part of the route that 21ks don't join. This stretch of the route seems to last forever (3).

25km: The sun is already high. I can't possibly head to a tree this time, can I? (4) Embracing myself for a stinky toilet. Wow it's actually clean. This is awesome. The toilet, not me pissing.

27km: Doraemon is behind me again, but I can't possibly run any faster than this. I tried. Ran ahead of him for tens of meters and I would fall back to normal pace and he would pass. Not just Doraemon though. I am losing count of how many have passed me.

28km: Good morning milady, can you help me with some of that muscle spray please, on both legs? Wow that was refreshing! Thank you very much.

29km: God damn, some of that spray got onto my crotch. My balls are freezing. I sure hope they don't fall off.

30km: Got tension on the thighs. I got this. I got this. I trained for this. The app announces I have 12km left. Took me a while to calculate that I have run for 30km. Math is super hard.

32km: I have never run this far in one go. From here on, it's uncharged territory. Squats, I need to do a few squats, it stretches my thighs a bit so they are functional again.

33km: The tensions have turned into cramps. Squat. Run for a couple of hundred meters. Cramp. Squat. Rinse and repeat. It hurts so much.

34km: Arrggg I fought, but I can't run any more. My thighs got cramps. My ankles hurt from all these stomping. And the soles of my feet too, for pretty much the same reason. Worst of all, my brain seems to go blank, this is stupid, what I am even doing. I have to walk now.

35km: I have run a few short dashes, a couple of hundred meters each. One of the attempts locked my legs, almost landed my face on asphalt. The cramps are still going strong. Someone just handed me a big chunk of ice. It freezes my hands. Dude, what I am supposed to do with this? My balls are gone and that's bad enough. Here, tree, your daily ration in a solid form.

36km: Here is the plan, I gonna run between one crossroad to the next, then walk till the next crossroad. I still get cramps whenever the 2 crossroads a bit farther apart, but at least my mind has come back. The sunlight is roasting me. I miss you, sunshine.

41km: Fuck no! My legs gave up on me. Completely. I get cramps just from walking. No amount of squat seems to help. I can hear the crowd from here, so fucking close.

42km: My legs are at the stage where any excessive movement would give my cramps. The last 200 meters, the finish line is finally here. Here goes nothing. The legs don't seem to be mine, I move them like two sticks. I run. I cross the line. I get a high five. A girl put this medal around my neck. Heck, I can' even recall what she looks like. But she was wearing a Bà Ba, that was a nice touch. Under normal condition, I would have appreciated the outfit, but right now I am having a strong urge to vomit my guts out.

---
(1) This is probably the first sign that I didn't manage my energy level well. Too cocky. But again, I aimed for sub-5, so...
(2) I conveniently forgot the fact that 42k started 30 min in advance. But we also had a longer route since the beginning. All else being equal, I was running the first half between 2:10-2:20.
(3) It didn't. It lasted for 3.5km. Running on familiar route made me feel like it was shorter.
(4) I talked to a friend about this. Pro tip is to just pee on yourself. In a race, you probably consume enough water that your pee is transparent anyway. My shorts were white, so it didn't help much with the level of confidence. Best to do this at a water station where they usually put big bucket of water for quick shower.