Yes, you should monitor your system
Knowing your system goes far beyond writing and reviewing the code. Chances are, you are the one maintaining it, taking feature requests from the product owner and tweak some documentation from time to time.
Also, you take all the blame when something goes bad.
Let’s say you are responsible for a complex system, that needs to be connected to an external, real-time service, like Facebook Messenger, for example.
One day your product owner messages you on slack with words “production doesn’t work, some customers can’t reply to our chatbot. We loose money every minute! Do something about it”.
I know it better than I would like to admit. You know the drill. It is time to check if requests are getting proper responses, if the database is accessible and why the hell your unit test suite is still passing.
Everything looks fine. But the issue is confirmed to by the system’s fault. You need more info and then you realize.
You yet again forgot to add logs to your software.
Sounds like a nightmare doesn’t it? Fortunately this doesn’t have to be your reality.
Why do I need to use logging in my projects?
As stated in the example above - you never know what happens to your system in the wild. Logging is a form of teaching the service that it needs to talk to you as well as to the users. Let’s say that in the example above, the external API changed slightly and it still works with your syntax, but gives you different results. It doesn’t produce errors in your backend, but the app is useless business-wise. You should keep a log of responses from the service, so you know when something works as expected and when it silently breaks. It can be especially life saving if you deal with asynchronous tasks when the order of executions can make or break your business as well.
Other often underrated use case for logging is audit. There are many reasons why a long running service needs to be audited. If that happens, proper log system is invaluable. The data science team can extract all the statistics about the system. Software engineers (even ones new to the project) can spot some bottlenecks. And in a case of merger with a different company - the legal team will have easier time to check it the system is compliant will all the new laws it has to obey.
The developer that you pass the codebase to will also appreciate some higher verbosity.
We teach a lot about clean code, good practices and design patters, but less about what actually happens once the system hits production. Let’s fill this space.
How do I keep track
The best strategy for rigorous logging is incrementally adding logs with each code change. As well as you click request changes on a PR without unit tests, as well you should stop a PR that implements an important business logic function without printing to the console:
Copy 1logger.info("User 1239876130 has been billed $30 for the Premium account renewal.")
As always, this comes down to the team cooperation. Some may forget, some may oppose, because they don’t see value. Same may not care, but the benefits will show eventually.
How should I implement logging?
To be honest, if one wants to implement logging without knowing any options or guidelines they can be overwhelmed with amount of options. You can save to text files, print to STDOUT, STDERR or directly to external services. However, the complexity here is not required.
12 factor app, a set of industry standard guidelines, puts logging as a first class citizen of a system - on the same level as dependency management, app configuration and the codebase itself.
The general advice is really simple:
A twelve-factor app never concerns itself with routing or storage of its output stream. It should not attempt to write to or manage logfiles. Instead, each running process writes its event stream, unbuffered, to stdout. During local development, the developer will view this stream in the foreground of their terminal to observe the app’s behavior. — The Twelve-Factor App
so fear not! Your whole logging system can be as simple as using print statements in your structure of choice.
There are plenty of tools working out of the box that can enhance your logging system. In this article I will try to list out ones that I personally used, heard positively about or consider trying out in the future. If you know software that should make the list, please let me know at @wkulikowski1!
Cloud based solutions
Each of the major public cloud providers has their own logging solution, well integrated with rest of their products. The list goes as follows:
- 👉 Google Cloud Logging for GCP (previously Stackdriver)
If your stack relies heavily on one of those platforms, sticking inside the ecosystem may seem like the most simple solution. In AWS, for example it is trivial to trigger Lambda function on seeing a certain log in CloudWatch. If you use firebase, Google Logging is couple of clicks away on your dashboard. As much as vendor lock-in can become a problem in the future, using logging solutions from the same provider can cut your costs and speed up the development.
Logstash is a part of elastic stack. It focuses on gathering log data from unlimited amount of sources and then categorizing, sorting and transforming it in a desired way. However in the theory, Logstash (and whole ElasticStack) is independent and open sourced, all cloud vendors have some sort of ready out-of-the-box solution for hosting the stack. It is pricy and demand a lot of computing power, but remains extremely valuable for many companies which run it in production every day.
Heroku add ons
I am a personal fan of heroku and the ease of adding new services to your system on it. For example, you can have the whole set of logging tool just by choosing a proper add-on. Among currently available are:
although every solution serves a different need, each of them introduces a valuable enhancement for your log management.
If logging is just a stream of plain text words, monitoring is grouping, categorizing and visualizing insights about your system. Usually there are some charts, dashboards and frankly, whatever is required by the product manager at the time. Do we need to acquire as many new customers as possible? Probably the most useful and needed metric will be landing page visits and user conversion. Do we need to have a stable payment system? Let’s track amount of errors our app experiences while pinging the /payment endpoint.
Best practices for monitoring
It doesn’t take a data scientist to setup and interpret the dashboard, but it should be built & read under some important assumptions.
Readability for newcomers
Your not the only person reading the monitoring dashboard and most likely you will not be around forever. Your role is to make the monitoring system accessible for everybody. Show critical charts at the beginning. Draw additional lines showing norms/averages for better context. Learn data visualization.
Healthy baseline could be the line mentioned in the previous paragraph. If I am a new maintainer of the project, I need to know what are our targets as soon as possible. If there is a business requirement for at least 5% conversion rate, there should be line at the chart indicating how close is the dangerous area.
True representation of your servers
Finally, your graphs should cover all system picturing it exactly as it is. Some endpoints will be under a heavier load than others and some statistics are caused by events different that you expect. If you are solving problem X and look only on stat Y, you assume that Y causes it. But what if it is caused by Z? You don’t see Z in your metrics. Not only it will be harder to measure now, you can miss it during your problem solving in general.
The list is probably super incomplete, but again, please let me know!
Kibana will visualize the logs that you have stored in Logstash. It is a part of elastcStack - which is really cool, and you should check it out!
Grafana is an “open observability platform” which focuses more on visualizing the database than logs. It is perfect for displaying ratios, business logic and more “static” data. Many companies use Grafana in production every day.
The last part of a truly healthy system is altering. Stuff breaks. Your server will spill out 500 errors sooner or later and you should be the first person to know about it.
We use Sentry extensively at 10Clouds and it works wonders for us. Sentry is open sourced as well, however the company will let you buy a managed solution in a SAAS package. After integration with your app, sentry will catch all the errors, group them and notify you by the channel of choice. Nothing happens unnoticed.
Nobody wants to maintain a black box system, no matter if it works perfectly or fails mysteriously every second Thursday. Logging & monitoring saved me more times that I would like to admit and I encourage you to introduce the observability culture to your development team as well. If you need further help or just spot a mistake - as always - please message me on Twitter.