The context
Over the last few months, I was commissioned to build an analytics system for logistics. Think of Google Analytics for shipment. The project consists of an engine to collect shipment information, analytics logics to normalize, combine and produce subsets of data through various stages, and a few different interfaces to represent the data to targeted users. That nature suggests a system with discrete components that should be planned and built in plug-and-play manner.Having each component as a standalone, independent module allows me to choose the proper tool for the task. Java made up the core of the system where we need processing speed, stability, and off-the-shelf client support. Nodejs is used where either web socket or massive asynchronous tasks involves. And Python where we want a quick way to glue things together and iterate on ideas. The communication hub is built around Apache Kafka. Up till this point, all development environment is orchestrated with Docker and Docker Compose. We are planning to bring the Docker containers to production soon, once the monitoring is in place.
The project is considered strategic and therefore a new team has been built around it. The goal is to have this team dedicate to the project, grow with it and form the intellectual core in the long run. The project is run in 2-week sprints with committed deliveries at the end of each sprint. Though more often delivery schedule is always welcomed.
So that's the background, lets look at expensive lessons I have learned along the way.
1. Good onboarding is crucial
For many project, onboarding a new team member is an important activity but often neglected and instead, the newbie is given a stack of obsolete document. If she is lucky, she would be given the source code and have to figure out how to bring up the system.The main reason for the neglect is because during onboarding period, the team's productivity slows down as key people are pulled away from their tasks to explain (parts of) the system to the newbie. And if your team follow a rigid delivery schedule like my team does, onboarding can be a stressful experience as more balls have to be juggled: maintain committed deliveries, and spend decent time for thorough explanation.
I have learned that with such many balls, go for good onboarding and negotiate delivery schedule if needed. While it is tempting to spend less time on the newbie (who, after all, is supposed to kick ass), there are too many reasons not to:
- There are just so many works a good onboarding requires. Over the course of few days, one has to go through the project overview, business values, and core competencies. he also needs to set up a development environment and interact with the system one would later help building. During which, he would need to get familiar with all components, what they do, and their scope of responsibility. And before writing any line of code, he must understand project standards, toolings, and enforced and unenforced conventions. It is a major, haunting task.
- Plus a microservice system is almost always complicated by its nature with all those moving components. A newbie, no matter how good he is, won't be productive till he has a good grab on both the big picture and the little piece of lego he will be working on. As a universal law of knowledge worker, a "just code the thing as spec'd" is a lose-lose situation.
- Given that, no matter how much night oil you burn, you aren't very likely to meet your deadline anyway, not in a decent way. Goofing around for a demo, or pull together an immature onboarding would only result in expensive technical debt and bad morale, both definitely bite you back at some point, might as well, the very next sprint.
A well-informed newbie would always be productive in no time and make up for the delayed delivery. Always.
2. You can't possibly write enough document.
Now, every time I start a new project, I talk to myself "This is more documentation than I have ever done before. Should be enough for even my mother to understand." and it's never enough.
But that is even more true with a system with much dynamic as a microservies system. Seriously, the number of documentation I write this time is just ridiculous.
- On highest level, there is architecture and infrastructure documentation. One focuses on the logical components, while the other on the actual machines things run on.
- Integration across services are captured in flow diagrams, which then come with message format (Kafka, MQ, remember?).
- Every component then has its own set up guide (besides one whole-system set up guide) and those whose service is used by others have their interactive API doc.
- Specifically complicated components get their own flow diagram (though the rest should be able to interact with them in black box manner.
That situation calls for a change in practice. While major high-level document (like the architecture and infrastructure) should continue to drive the implementation, finer detailed documentations that involve multiple modification each sprint should be derived from the code itself. The goal is to have the code serve as one reliable source of truth and everything else to make sense of it is generated when needed. For example, a business analyst needs a database schema? Generate it from the database schema. A developer needs to know how to use the latest API? Comment block is extracted, combined with a simple interactive form to try out, and you have API doc.
3. Automate your environment setup
Now that you have automated document generation, it is a good time to automate environment setup too, starting from dev and all the way to production. My project is using Docker, but this can be applied to all tools, like Chef, Puppet, or Ansible.At the beginning of the project, with a handful of services, it is simple enough to announce a change in service Docker with a group chat message, or a poke. You can also easily update the new setup requirements to the service's README file. But by the time the project gets a dozen of services, if a developer has to either keep a eye on chat messages for setup changes, or iterate that many README files, he would commit a suicide.
Be it bash script or python script, or whatever else, automate environment setup as much as you can. A few things you can consider to begin with are:
- auto update code of all services and restart Docker containers
- auto remove obsolete Docker images and containers (you get lot of these when building new services)
- auto update configuration files based on last edited/updated timestamp.
I can't stress how relieving it was for my team to have a smooth, reliable automation. No more waste of half an hour every morning trying to figure out why what worked yesterday before stops today. Seamless automation is worth all effort.
4. Standardize everything
The upside of microservices is that developers have lot of freedom to do their work, as long as service contract is respected. The downside of microservices is that developers have too much freedom in their work.Aside from the choices of technology, a few things I kept running into
- Mix used of `CMD` and `ENTRYPOINT` in Docker (hint, they aren't supposed to compete, but complement each other)
- Inconsistent log formats. The log level fluctuated from standard output to log rotator, from random debug log to nginx-style. The Wild West can only so crazy.
- Bash scripts with different names, in different directories, but all do the same thing. `run.sh`, `start.sh`, in a `scripts/` directory, or at root level. And so on.
Independently, these are just a bunch of little harmless variations of some right things to do to keep a software system maintainable. But together, they make the life of whoever whose job involves jumping in and out services miserable (i.e. everyone but developers). Now of course monolithic codebases have such issue too, but it can be ruled out easily even without a quality police because everyone looks at such thorn all the time. I didn't realize how different personal preferences can be (but hey, after all, we are a new team).
Well, eventually the variations pissed me off so much I had the development paused so we could sync up the conventions and made an oath to keep them. For the watch!
5. Collective ownership doesn't work
By the book, Agile management encourages every to share responsibilities for code quality, anyone can make necessary changes anywhere and everyone is expected to fix problems they find.Whereas the share responsibilities part still rings every bits and bytes and we encourage that with tight feedback circle, rigid definition of done, and tests as safety net, the two later parts go from bad to really bad. In a microservices system that employs various technologies, anyone can still make change anywhere, but it is probably for one's best interest that he doesn't. Lets take a quick look at the technologies I mentioned upfront: nodejs is asynchronous by nature and one must forgo traditional threading model (I got many eyebrows saying that nodejs runs on only one big ass thread, but it does, people!). Python is a dynamically-typed language and Java is a statically-typed one. Each language calls for a different mindset. Makes change to a codebase one does not understand the philosophy behind and technical debt is probably the best outcome.
And fix problems they find? Typical management's hopeless optimism.
A more practical model is to let a developer takes a complete ownership over his service(s). Others might chip in for help, but within the service's scope, he is the technical lead and in charge of maintain quality standard, code convention, and whatever else he considers important. That might sound too rigid for an Agile team, and it might be true. But screw the Agile label, I need reliability. With no monitor system in place, and a dozen services running, each has its own set of runtime problems, I need each to know one thing inside out, not a bit of everything.
6. Keep an eye on everything that moves
Ok, saying that people didn't tell me about the need of monitoring a microservices system is an overstatement. The topic of monitoring appears one way or another in books and articles I have read about microservices. What wasn't seemed to be stressed enough is the sheer amount of work required to get monitoring up to the level you can be confident about the system without fearing something would fall apart the moment you turn away.Right on the development environment, having all the log gathered in one place that you can later on `tail -f *.log` is a huge time saver. Depends on the 3rd party libraries that you use, your log might be populate with mumbo jumbo of bullshit. Take effort to filter those out of your log, the investment is paid back every time you inspect an inter-service bug. If possible, slice your log into 3 groups:
- Activity log (or debug log) to monitor the flow of data between services
- Error log, so you can find the most critical thing right away when something goes awry
- Third-party libraries log, in case you want to play safe
Once the system is in production, the focus is less on the flow of data, but more on performance and health checking. You want to have the ability to know how many nodes has a request had to travel through before a user can see anything, how much time did it spend at each node, and set up programming to trigger escalating actions when certain thresholds are passed. For this purpose, we are using ELK stack where Logstash crawls log from distributed servers, feeds ElasticSearch, and put a presentation layer of Kibana on top. Formatting log and organize Kibana report is then an ever going job.
While ELK would tell you about the performance of a system, it shows little about the health, e.g. a service can be serving request at less than 100ms, but its RAM is whooping 90% up and CPU utilization is always above 70%. That introduces a different set of system monitor tools, like AWS CloudWatch, Navigos, or NewRelic.
From time to time, by doing house keeping work such as migrating service into a bigger machine, scale out a service to a few instances, or deploying new service instances while iteratively shut down old ones to achieve zero down time, you would get really tired of constantly checking whether a service is up, and still at the same IP address or not. Well, that is service discovery like Consul or etcd you are longing for.
My point here is to illustrate two points. First, it is very crucial to keep a close eye on the system as a whole and optimize the flow of data. Second, it is very tempting to apply all the bells and whistles, and surrounded yourselves in dashboards and get distracted from the only thing that matters: the system itself.