A few months back I gave my first keynote at Agile Testing Days. It was a really exciting opportunity and I chose to share my journey to understanding observability. This meant I needed to go into a bit of detail, but I have overwhelmingly received the feedback that the depth of content was what made people really understand the concepts (though of course it was a risk and some felt it was a bit too deep in the weeds for a keynote!).
Thankfully the positive feedback has led to a few more deliveries of it so you can see a version of the slides here and even a recorded version here. By managing to deliver this a few times I have gotten a fairly standard set of questions so I figured recording these in some way may be helpful. Here goes…
ELK, Prometheus, … are expensive. What tooling can you recommend for small projects?
I would slightly challenge their expensiveness. Both are open source projects so you pay for what you store. Therefore, I can understand how ELK costs can run up quickly, but running prometheus is actually quite inexpensive to run as you can store data in s3 and keep the costs to maintenance. But in any event, what to use for small projects? I would prioritise a few things…
1) Make sure the same tools are available in all envs. This builds better understanding and therefore leverage of the tools. Of course your retention can be lower in non-prod, but keep the access the same.
2) Aggregation across all services. Make sure people can clearly get to data about any/all services in a single place and query across them. In logs talk that would mean try getting them available in a central server and via a single index. In metrics terms this is more common from any metrics offering.
3) Build in the sturcture early even if you are not building in the discoverability via centralised tools. This means start with json structured logs. Start with a common status and metrics end point for each service so people know where to find it. And build tracing IDs into your logs.
Do you have best practices for what to log?
Have a think about logging vs events here. If you want to log, the value here is time specific stuff. In general you can think about an event bring created for each endpoint called. You can fine grain that more as you see a need and add a new event for each method called within a service or even add a new event for a super crazy for loop set of logic. The idea is to scope an event to a piece of work that makes logical (and ideally business) sense. So, if you need to know when a process is triggered? Log it. If you need to know order of events? Logging. If you want a massive stack trace? Could be a log. When looking at events, I would suggest adding fields for any specialty (non-PII) data. User IDs? Definitely. Parameters used in the methods? Probably.
In one slide you added the input parameters for a method to your properties. Do you also add local variables within the code? What about the result of a method?
Absolute add more than just input parameters. Parameters are actually easy to get via framework tools in a more automated fashion (as suggested by another question!) so where you will see more manual additions in more mature code tends to be for internal information. Keep in mind, this is the same as any other outputs so you will need to be aware of any PII or other sensitive data you would not want to push out to your telemetry data. I have been working with a legacy app with this as well as the workshop app and I have created some getting started ideas for more legacy apps. For example, start with adding any variables you currently log out as a part of a string to a field/value pairs. Then look to get any parameters from the methods. If that stuff is already covered you can look to add any intermediary data that is created within the body of a method.
Do you really collect your properties “manually” or do you use AOP (or something else) to collect them?
That app that I showed is a workshop app. It is small and has no intention of growing. It also keeps the conversation relatable by even non-deep tech people. That is a lot to say you are absolutely correct to think about ways to get this data with less manual/reptitive steps. For example, we have just implemented an aspect (using Java’s AspectJ) to create subspans for each method within a certain class rather than adding that manually. So I would absolutely applaud your intention to use clean code methods while doing this!
Do you test event logging?
We have tested that the way we call our libraries is correct. As in, we have validated that we do not end up in a null pointer exception scenario and in the case where we are using the aspectj pattern in Java we have tested that we are capturing the correct methods. At the same time we are not testing that the library itself works correctly so basically we are following the same patterns we use for any other library we use in our code.
Assume you have a method that throws an Exception and does not handle the Exception in a try – catch block. How do you make sure you receive your collected data?
I believe this can be handled in certain languages and tends to be called something like “flushing”. We are currently doing that in Python but I believe other languages and most logging libraries have something to this effect.
If you had to decide between adding automated tests or event logging to a brownfield project (with neither), what would you do?
IMO this very much depends on risk and secondly time to invest. In the book “effectively working with legacy code” it is suggested to wrap any unfamiliar/untested code in a high-level end to end test before refactoring/introducing unit tests so that you can properly refactor (change implementation not behaviour) safely. So, if you can withstand the introduction of an issue into production and are comfortable with your process and time to fix then go for logging/alerting. If that structure is not yet in place I would lean towards pre-production automation testing. But if you do have the right circumstances, adding events provided a much more robust opportunity to identify and debug more issues than just the ones you are currently identifying so there is definitely reasons to start there.
What do you think about sentry.io?
We use this at our company and I think it can provide fantastic benefit for front end code. That being said I bucket it with things like static code analysis tools where all they can do is point out issues. And if you don’t have the time / prioritisation to fix and issues found they become quickly ignored. For example, a few years ago I once ran a static code analysis tools against my project and the output was so large it maxed out the console. That left the team dismissive of using it. It took me almost a month of chipping away the “silly” errors to get the rest of the team to notice the really tough ones in there. This is happening right now with sentry at MOO where we max out our paid account worth of reported errors about half way through each month and the mountain seems very high to climb!
Please let me know what others you may be interested in and I will look to expand this list! Also huge thanks to Alex Schladebeck and her team at Bredex for helping shape many of these interesting questions!