Recently there was an ask in a semi-public forum:
I’m interested in what baby steps I can take to try and improve observability in general but particularly in production. I know this is a quite general question (but not sure I know enough about it yet to ask the right thing.)
Are there resources or tips that you would recommend for an observability newbie? Something that could help me define achievable goals and a path to start chipping away at them or at the skills needed to achieve them?A curious QA type
This is far from the first time I have heard this asked, and far from the first time I have tried to answer. This just is the first time that answer is making it to this blog!
The context for this ask is a person who has experience as a QA, is at an organisation that uses the observability vocabulary, but often is too stretched to see engineers able to really invest in moving forward with necessary experiments and changes.
Before I share my answer, I should say, the fabulous Lisa Crispin was also in this forum and she kicked the answer off with some great suggestions:
- Introductory guides to observability on Honeycomb.io
- A (somewhat unkept) resource testingInDevOps.org
- Katrina Clokie’s book Practical Guide to Testing in DevOps
- Getting started with Proofs of Concepts (PoCs) with tools and techniques like OpenTelemetry
- Asking questions centered around debugging productions issues faster
I came late to the party, but here is what I wrote.
Note: this is unedited from a chat response since if I try and write something blog worthy I’ll never end up publishing it!
You will see the concept of “3 pillars” thrown around a lot. These are really helpful and comforting when reaching for a specific tool or skill to learn. They speak to the data types that often make up telemetry (fancy word for data your app exposes). The three are logs, metrics and traces. I would recommend the writing by Spees to understand the jargon.
When it comes to where to start in concept, I think the idea to keep in mind is how to lessen the distance between tech and business (to benefit both!). And to lessen the distance between noobs and long timers (again benefits both!).
So what is hard right now? Figuring out if users are happy? Figuring out where a problem is coming from? Start asking dream like questions…. “Boy I wish I could just do a quick search and know which service is struggling” or “wouldn’t it be nice to let our customers know they may be feeling impact before they tell us??”
Of course these are probably dream worlds for you (they are for me!) but asking them, and then asking why not will start to expose the missing data or missing data structure.
Each person and team needs to go on their own learning journey (yes… every human does touch the hot stove at least once!). But a hot tip:
Don’t forget about the goals of lessening the distance between business <-> tech, noobs <-> experienced
It will be tempting to really lean into the tech and experienced side of those things. But keeping an eye on balancing will lead you towards trying to use less tools (which naturally leads to using events over logs as they can cover both tracing and logging style output!) and to make it easier to query (and naturally this will lead towards structured data!).
A couple additional readings could be: