Life as an engineer: December 2015

The context

Over the last few months, I was commissioned to build an analytics system for logistics. Think of Google Analytics for shipment. The project consists of an engine to collect shipment information, analytics logics to normalize, combine and produce subsets of data through various stages, and a few different interfaces to represent the data to targeted users. That nature suggests a system with discrete components that should be planned and built in plug-and-play manner.

Having each component as a standalone, independent module allows me to choose the proper tool for the task. Java made up the core of the system where we need processing speed, stability, and off-the-shelf client support. Nodejs is used where either web socket or massive asynchronous tasks involves. And Python where we want a quick way to glue things together and iterate on ideas. The communication hub is built around Apache Kafka. Up till this point, all development environment is orchestrated with Docker and Docker Compose. We are planning to bring the Docker containers to production soon, once the monitoring is in place.

The project is considered strategic and therefore a new team has been built around it. The goal is to have this team dedicate to the project, grow with it and form the intellectual core in the long run. The project is run in 2-week sprints with committed deliveries at the end of each sprint. Though more often delivery schedule is always welcomed.

So that's the background, lets look at expensive lessons I have learned along the way.

1. Good onboarding is crucial

For many project, onboarding a new team member is an important activity but often neglected and instead, the newbie is given a stack of obsolete document. If she is lucky, she would be given the source code and have to figure out how to bring up the system.

The main reason for the neglect is because during onboarding period, the team's productivity slows down as key people are pulled away from their tasks to explain (parts of) the system to the newbie. And if your team follow a rigid delivery schedule like my team does, onboarding can be a stressful experience as more balls have to be juggled: maintain committed deliveries, and spend decent time for thorough explanation.

I have learned that with such many balls, go for good onboarding and negotiate delivery schedule if needed. While it is tempting to spend less time on the newbie (who, after all, is supposed to kick ass), there are too many reasons not to:

There are just so many works a good onboarding requires. Over the course of few days, one has to go through the project overview, business values, and core competencies. he also needs to set up a development environment and interact with the system one would later help building. During which, he would need to get familiar with all components, what they do, and their scope of responsibility. And before writing any line of code, he must understand project standards, toolings, and enforced and unenforced conventions. It is a major, haunting task.
Plus a microservice system is almost always complicated by its nature with all those moving components. A newbie, no matter how good he is, won't be productive till he has a good grab on both the big picture and the little piece of lego he will be working on. As a universal law of knowledge worker, a "just code the thing as spec'd" is a lose-lose situation.
Given that, no matter how much night oil you burn, you aren't very likely to meet your deadline anyway, not in a decent way. Goofing around for a demo, or pull together an immature onboarding would only result in expensive technical debt and bad morale, both definitely bite you back at some point, might as well, the very next sprint.

A well-informed newbie would always be productive in no time and make up for the delayed delivery. Always.

2. You can't possibly write enough document.

Now, every time I start a new project, I talk to myself "This is more documentation than I have ever done before. Should be enough for even my mother to understand." and it's never enough.

But that is even more true with a system with much dynamic as a microservies system. Seriously, the number of documentation I write this time is just ridiculous.

On highest level, there is architecture and infrastructure documentation. One focuses on the logical components, while the other on the actual machines things run on.
Integration across services are captured in flow diagrams, which then come with message format (Kafka, MQ, remember?).
Every component then has its own set up guide (besides one whole-system set up guide) and those whose service is used by others have their interactive API doc.
Specifically complicated components get their own flow diagram (though the rest should be able to interact with them in black box manner.

But the reality is that not only the documentation alone isn't enough to understand the system (a process which, to be fair, can be overflowed with data and decent time for self exploration should be given), I couldn't keep up with the amount of doc, the amount of work is justifiable for a full-time technical writer.

That situation calls for a change in practice. While major high-level document (like the architecture and infrastructure) should continue to drive the implementation, finer detailed documentations that involve multiple modification each sprint should be derived from the code itself. The goal is to have the code serve as one reliable source of truth and everything else to make sense of it is generated when needed. For example, a business analyst needs a database schema? Generate it from the database schema. A developer needs to know how to use the latest API? Comment block is extracted, combined with a simple interactive form to try out, and you have API doc.

3. Automate your environment setup

Now that you have automated document generation, it is a good time to automate environment setup too, starting from dev and all the way to production. My project is using Docker, but this can be applied to all tools, like Chef, Puppet, or Ansible.

At the beginning of the project, with a handful of services, it is simple enough to announce a change in service Docker with a group chat message, or a poke. You can also easily update the new setup requirements to the service's README file. But by the time the project gets a dozen of services, if a developer has to either keep a eye on chat messages for setup changes, or iterate that many README files, he would commit a suicide.

Be it bash script or python script, or whatever else, automate environment setup as much as you can. A few things you can consider to begin with are:

auto update code of all services and restart Docker containers
auto remove obsolete Docker images and containers (you get lot of these when building new services)
auto update configuration files based on last edited/updated timestamp.

I can't stress how relieving it was for my team to have a smooth, reliable automation. No more waste of half an hour every morning trying to figure out why what worked yesterday before stops today. Seamless automation is worth all effort.

4. Standardize everything

The upside of microservices is that developers have lot of freedom to do their work, as long as service contract is respected. The downside of microservices is that developers have too much freedom in their work.

Aside from the choices of technology, a few things I kept running into

Mix used of `CMD` and `ENTRYPOINT` in Docker (hint, they aren't supposed to compete, but complement each other)
Inconsistent log formats. The log level fluctuated from standard output to log rotator, from random debug log to nginx-style. The Wild West can only so crazy.
Bash scripts with different names, in different directories, but all do the same thing. `run.sh`, `start.sh`, in a `scripts/` directory, or at root level. And so on.

Independently, these are just a bunch of little harmless variations of some right things to do to keep a software system maintainable. But together, they make the life of whoever whose job involves jumping in and out services miserable (i.e. everyone but developers). Now of course monolithic codebases have such issue too, but it can be ruled out easily even without a quality police because everyone looks at such thorn all the time. I didn't realize how different personal preferences can be (but hey, after all, we are a new team).

Well, eventually the variations pissed me off so much I had the development paused so we could sync up the conventions and made an oath to keep them. For the watch!

5. Collective ownership doesn't work

By the book, Agile management encourages every to share responsibilities for code quality, anyone can make necessary changes anywhere and everyone is expected to fix problems they find.

Whereas the share responsibilities part still rings every bits and bytes and we encourage that with tight feedback circle, rigid definition of done, and tests as safety net, the two later parts go from bad to really bad. In a microservices system that employs various technologies, anyone can still make change anywhere, but it is probably for one's best interest that he doesn't. Lets take a quick look at the technologies I mentioned upfront: nodejs is asynchronous by nature and one must forgo traditional threading model (I got many eyebrows saying that nodejs runs on only one big ass thread, but it does, people!). Python is a dynamically-typed language and Java is a statically-typed one. Each language calls for a different mindset. Makes change to a codebase one does not understand the philosophy behind and technical debt is probably the best outcome.

And fix problems they find? Typical management's hopeless optimism.

A more practical model is to let a developer takes a complete ownership over his service(s). Others might chip in for help, but within the service's scope, he is the technical lead and in charge of maintain quality standard, code convention, and whatever else he considers important. That might sound too rigid for an Agile team, and it might be true. But screw the Agile label, I need reliability. With no monitor system in place, and a dozen services running, each has its own set of runtime problems, I need each to know one thing inside out, not a bit of everything.

6. Keep an eye on everything that moves

Ok, saying that people didn't tell me about the need of monitoring a microservices system is an overstatement. The topic of monitoring appears one way or another in books and articles I have read about microservices. What wasn't seemed to be stressed enough is the sheer amount of work required to get monitoring up to the level you can be confident about the system without fearing something would fall apart the moment you turn away.

Right on the development environment, having all the log gathered in one place that you can later on `tail -f *.log` is a huge time saver. Depends on the 3rd party libraries that you use, your log might be populate with mumbo jumbo of bullshit. Take effort to filter those out of your log, the investment is paid back every time you inspect an inter-service bug. If possible, slice your log into 3 groups:

Activity log (or debug log) to monitor the flow of data between services
Error log, so you can find the most critical thing right away when something goes awry
Third-party libraries log, in case you want to play safe

Once the system is in production, the focus is less on the flow of data, but more on performance and health checking. You want to have the ability to know how many nodes has a request had to travel through before a user can see anything, how much time did it spend at each node, and set up programming to trigger escalating actions when certain thresholds are passed. For this purpose, we are using ELK stack where Logstash crawls log from distributed servers, feeds ElasticSearch, and put a presentation layer of Kibana on top. Formatting log and organize Kibana report is then an ever going job.

While ELK would tell you about the performance of a system, it shows little about the health, e.g. a service can be serving request at less than 100ms, but its RAM is whooping 90% up and CPU utilization is always above 70%. That introduces a different set of system monitor tools, like AWS CloudWatch, Navigos, or NewRelic.

From time to time, by doing house keeping work such as migrating service into a bigger machine, scale out a service to a few instances, or deploying new service instances while iteratively shut down old ones to achieve zero down time, you would get really tired of constantly checking whether a service is up, and still at the same IP address or not. Well, that is service discovery like Consul or etcd you are longing for.

My point here is to illustrate two points. First, it is very crucial to keep a close eye on the system as a whole and optimize the flow of data. Second, it is very tempting to apply all the bells and whistles, and surrounded yourselves in dashboards and get distracted from the only thing that matters: the system itself.

7. Whatever you estimate, multiply by 3

There is probably no other project where my estimation has been off by 100% and, unfortunately, nothing intrinsically stops it from going to 150% or even 200%. Most original estimation could only barely cover coding and unit testing. While equal amount of time is required for gluing a service to others (integration test), and environment setup (Docker, monitoring). The experience learned from this is absolutely valuable, but it is costing an arm and a leg.

Nhỏ lớn tôi ở trong cái xóm nhỏ (trước) làm nông, lọt thỏm dưới chân mấy con đồi. Xóm nhỏ chẳng có gì đặc biệt, đến người Đà Lạt không có việc đi vào cũng chẳng biết. Được cái gần ga tàu, ngày nghe tàu hụ hai lần. Nhưng tàu ở Đà Lạt là tàu cảnh, cả ngày chỉ tha được cái mông rỉ sét đến trạm Trại Mát, chưa được 20km đi lẫn về.

Muốn đi tàu cho ra hồn, phải xuống ga Phan Rang Tháp Chàm, ở đó có tàu Thống Nhất chạy dọc Sài Gòn Hà Nội đi ngang. Mà xuống được đến Phan Rang, phải qua đèo Sông Pha (lớn lên tôi biết còn gọi là đèo Ngoạn Mục, nhưng cũng như cái tên hồ Sương Mai, tôi cho quả là dở hơi). Với một đứa nhỏ dặt dẹo đi xe hơi trong thành phố cũng bị say, đèo Sông Pha cũng kinh hãi như cuối năm họp phụ huynh. Vậy nên đi tàu, với tôi, chết luôn hình ảnh những chuyến đi đằng đẵng, những vùng đất xa lạ.

Vậy mà lớn lên tôi đâm nghiền tàu hoả. Không chính xác là nghiền những vùng đất xa lạ kia, mà là cảm giác đi tàu. Tôi có tật gắn cảm giác với ký ức. Bị đau tôi nhớ về hồi nhỏ bị đòn. Củ khoai nóng làm tôi nhớ mấy lần lùi khoai trong vườn, không biết nhóm lửa đốt toàn bằng dầu hoả. Tiếng đá kiện lách cách kéo tôi về những năm đồng phục đi học. Bước lên tàu, mùi dầu diesel gợi lại lẫn lộn bao kỉ niệm: lúc nhỏ phấn khích phát hiện ra đi tàu lửa không bị say (nhấn mạnh là tàu lửa, vì có lần đi câu mực trên tàu thuỷ không được êm xuôi lắm), cảnh bình minh choáng ngợp đoạn đi vào Phú Yên - xứ Nẫu nói chung là một viên ngọc thô đầy cảnh đẹp, rồi lần thất tình bỏ nhà, bỏ việc đi biệt.

Những chuyến đi đó đưa tôi đi khắp miền duyên hải Việt Nam, qua những đồng lúa, chui hầm Hải Vân tối hù, lên tận đê sông Hồng mà khi nhỏ chỉ tồn tại trong những câu chuyện xa quê của bà nội. Tàu đưa tôi qua cả các cung bậc xã hội đất nước này. Trên con tàu mười mấy toa, một người trên đường đến toa ăn, đi được qua từ khu giường mềm điều hoà, đến ghế cứng gió trời. Chặng đường càng dài, phân tầng càng rõ ràng. Tôi nhớ hoài lần đi tàu đêm ngang Ninh Thuận, trời hè nóng như hun, toa tàu đóng cửa sổ mắt cáo, không quạt, không khí quánh mùi dầu máy trộn mồ hôi, non trăm con người ngồi ghế cứng, vặn vẹo tìm chỗ ngã lưng chợp mắt. Với tôi, đi tàu là đi chơi. Với nhiều người khác, đi tàu là lựa chọn duy nhất cho hành trình dài dẵng, vì làm cả năm không dám mơ một tiếng ngồi máy bay. Không đến mức người trong nhung lụa, người bánh tráng phơi sương, nhưng vẫn là một lần mở mắt.

Lớn lên với một sân ga trống. Trưởng thành hơn sau từng chuyến đi. Đoàn tàu với nhịp điệu xình xịch cho tôi nhiều trải nghiệm đầu tiên. Mà tôi có cảm giác thế này, là cảm xúc lần đầu là mạnh mẽ nhất vì mỗi lần như thế, một người để lại một phần tâm hồn của mình lại trong những mẩu ký ức, vun vặt hoặc sâu đậm. Tâm hồn dần bé lại và khi nó biến mất, anh ấy chỉ còn sống trong những hoài niệm. Điều này không đáng buồn, vì một ngày đôi chân ngừng lại, ta biết được mình đã đi qua những gì. Mỗi người lại có sợi dây của riêng mình, kết nối những ký ức này. Đường tàu dằng dặc, đường tỉnh lộ nắng cháy da là những sợi dây của tôi. Đi tàu, đưa tôi về những miền nhớ xa xôi như chính hành trình, đưa tôi về thăm tôi của ngày cũ.

Life as an engineer

Pages

Wednesday, December 30, 2015

What people did not tell you about microservices