A (short) answer on getting started with observability as a QA

Recently there was an ask in a semi-public forum:

I’m interested in what baby steps I can take to try and improve observability in general but particularly in production. I know this is a quite general question (but not sure I know enough about it yet to ask the right thing.)

Are there resources or tips that you would recommend for an observability newbie? Something that could help me define achievable goals and a path to start chipping away at them or at the skills needed to achieve them?

A curious QA type

This is far from the first time I have heard this asked, and far from the first time I have tried to answer. This just is the first time that answer is making it to this blog!

The context for this ask is a person who has experience as a QA, is at an organisation that uses the observability vocabulary, but often is too stretched to see engineers able to really invest in moving forward with necessary experiments and changes.

Before I share my answer, I should say, the fabulous Lisa Crispin was also in this forum and she kicked the answer off with some great suggestions:

  • Introductory guides to observability on Honeycomb.io
  • A (somewhat unkept) resource testingInDevOps.org
  • Katrina Clokie’s book Practical Guide to Testing in DevOps
  • Getting started with Proofs of Concepts (PoCs) with tools and techniques like OpenTelemetry
  • Asking questions centered around debugging productions issues faster

I came late to the party, but here is what I wrote.

Note: this is unedited from a chat response since if I try and write something blog worthy I’ll never end up publishing it!


You will see the concept of “3 pillars” thrown around a lot. These are really helpful and comforting when reaching for a specific tool or skill to learn. They speak to the data types that often make up telemetry (fancy word for data your app exposes). The three are logs, metrics and traces. I would recommend the writing by Spees to understand the jargon.

When it comes to where to start in concept, I think the idea to keep in mind is how to lessen the distance between tech and business (to benefit both!). And to lessen the distance between noobs and long timers (again benefits both!).

So what is hard right now? Figuring out if users are happy? Figuring out where a problem is coming from? Start asking dream like questions…. “Boy I wish I could just do a quick search and know which service is struggling” or “wouldn’t it be nice to let our customers know they may be feeling impact before they tell us??”

Of course these are probably dream worlds for you (they are for me!) but asking them, and then asking why not will start to expose the missing data or missing data structure.

Each person and team needs to go on their own learning journey (yes… every human does touch the hot stove at least once!). But a hot tip:

Don’t forget about the goals of lessening the distance between business <-> tech, noobs <-> experienced

It will be tempting to really lean into the tech and experienced side of those things. But keeping an eye on balancing will lead you towards trying to use less tools (which naturally leads to using events over logs as they can cover both tracing and logging style output!) and to make it easier to query (and naturally this will lead towards structured data!).

A couple additional readings could be:

Advertisement

Using Google Secrets Manager (GSM) as a Terraform tfvars store

The premise

When writing Terraform, it is good practice to parameterise variables when and where you can. This helps to reduce duplication where possible and make changing variables easier. In some cases, the variables are convenience things like what region to run the cloud infrastructure in. In other cases they are secret values you actively do not want to store in your source control (e.g. an API key).

Terraform provides a few different ways to pass these variables in including flat file and environment variables. While you can use these techniques securely, they can get unweildly as projects grow so many people opt for 3rd part storage options like HashiCorp Vault, AWS SSM, or GCP GSM.

During a move towards using GSM I found an unexpected behaviour that I wanted to capture.

The tech

Terraform is a DSL for writing declarative configuration files across many different cloud services. This post is not about what Terraform is, or why to use it as there are lots of other great resources for that.

Google Secrets Manager (GSM) is a key value storage that has built in versioning. Again though, this is not a post about what GSM is or why to use it.

For the purpose of this post, I am going to be using Cloud Run on the Google Cloud Platform (GCP) free tier. My Cloud Run code is pretty much a copy of the great quickstart blog from Seth Vargo and let’s us get right to the discussion around secrets. The full code from this blog post is available here.

Setting variables in Terraform

Imagine a world where we have a Cloud Run service that needs an environment variable set. This would require:

1. A variable to be declared

variable "secret_variable" {}

2. That variable to be called from the service

resource "google_cloud_run_service" "my-service" {
  name     = "my-service"
  location = var.google_provider_vars.region

  template {
    spec {
      containers {
        image = "gcr.io/cloudrun/hello"
        env {
          name = "PUBLIC_VARIABLE"
          value = "insecure"
        }
        env {
          name = "PRIVATE_VARIABLE"
          value = var.secret_variable
        }
      }
    }
  }

  traffic {
    percent         = 100
    latest_revision = true
  }

  depends_on = [google_project_service.run]
}

And to run a command like terraform plan or terraform apply we would need to set that variable.

This seems simple enough, but we need a way to store this secret variable securely while also collaborating with a number of other people on the same project. This means passing the values somehow so that they can be shared used in the same way by everyone.


A quick review of the built in options for using variables from Terraform:

Option 1: Let Terraform as via the Command Line

Command line input is the easiest and quickest way to get moving. As Terraform runs it will stop at each missing variable declaration and ask for user input.

$ terraform plan
var.private_variable
  Enter a value:

But this is not only error prone and tiring as a person, it is impossible within CI. For this reason it is very rare for anyone to use this method aside from early days spiking.

Option 2: Set environment variables

Setting environment variables is a set better since this can at least be repeatable within the scope of a single engineer and single session. That being said, it requires the environment variables be stored which makes them a target for anyone on the machine. It also requires that they be shared among any collaborators on the project. For those reasons this again is usually only a viable option only for small projects.

$ export TF_VAR_private_variable="super secret" && terraform plan

Option 3: Use an additional -var-file

While Terraform automatically picks up variables in a terraform.tfvars file within the project, you are able to pass in additional files via the command line. The viability of this strategy is very dependent on how you store this file. But no matter how securely you are storing the file, it is still a file on disk which makes for higher risk if people are on the server that runs terraform.

$ terraform plan -var-file="terraform-secret.tfvars" 

With all three of these options the variables are stored in some way on the operators machine and need to be shared across people via a secondary process.

In my particular case, we were using a tool called Atlantis to run our Terraform commands in CircleCI. Our Atlantis server was running in Kubernetes which allowed us to store the variables in a Kubernetes Secret and then “mount” that as if it were a directory on the server so Terraform could read in the values. In Atlantis you can tell it where to look for additional variable files when it is running Terraform commands.

So if you put all that together, we had something like this:

# atlantis-deployment.yaml
apiVersion: apps/v1
kind: Deployment
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: atlantis
        image: runatlantis/atlantis:latest
        ...
        volumeMounts:
        - mountPath: ".config/tfvars"
          name: atlantis-tfvars
          readOnly: true
      volumes:
      - name: atlantis-tfvars
        secret:
          secretName: atlantis-tfvars        

# Atlantis.yaml
---
version: 3
projects:
- dir: infra
  workflow: infra
workflows:
  infra:
    plan:
      steps:
      - init
      - plan:
          extra_args: [-var-file, /.config/tfvars/infra.tfvars]

While this worked well for a while, we wanted to evolve away from secrets on disk, and also away from a kind of “chicken and egg” scenario where we needed to write the secrets to this Kuberentes secret in order to then write them to Terraform (which was often making another more specific Kuberentes secret!).

In comes Google Secret Manager (GSM)

GSM was only introduced about a year ago, and has been on a wish list for a bit. So we finally got around to spiking out GSM as a secret store and on first research it seemed both to fit our need and be easy to implement.

So going back to the Cloud Run demo example, we needed to create a new secret and begin using it. This was a 3 part process to complete.

1. Create a secret in Terraform. This acts like a folder which can hold 0 to many versions of a single value.

resource "google_secret_manager_secret" "secret_variables" {
  secret_id = "secret_variables"
  project   = data.google_project.project.number

  replication {
    automatic = true
  }
}

2. Manually create a version of that secret in the GCP Console. This needs to be done manually as this is how we manage to not store any secret values in code. Think of this like storing a value in your password manager.

Notice that while .tfvar files are key:value text files, Terraform provides built in support for JSON so it seemed sensible to store those variables as JSON secrets.

{
    "secret_variable": "super secret"
}

3. Reference that secret version in Terraform code. This means reading and parsing that secret value and then referencing that value where necessary.

# secrets.tf
locals {
  secret_variables = jsondecode(data.google_secret_manager_secret_version.secret_variables.secret_data)
}

data "google_secret_manager_secret_version" "secret_variables" {
  provider = google-beta

  project = data.google_project.project.number
  secret  = google_secret_manager_secret.secret_variables.secret_id
  version = 1 # valid json
}

# cloudrun.tf
...
env {
  name = "PRIVATE_VARIABLE"
  value = local.secret_variables.secret_variable
 }
...

Nice! This was almost too easy! When we run our plan with this code we can see that the secret does not change and that is exactly what should happen. We just changed from feeding it via a file to reading it from a cloud service.

But there is a catch!

As I started to roll out the change past proof of concept level, I ran head first into the 64kb size limit of GSM which resulted in a much bigger issue.

While this limit is well documented, it doesn’t have the nicest user experience when using the UI. When you paste a value into the version creator in the UI it just stops at 64kb. No warning you hit that limit. Just an abrupt end to the pasting of values. I did not notice this, and so when I tried to run a Terraform plan I was running it with invalid JSON.

Unfortunately, the built-in Terraform function jsondecode() reacts very badly to invalid JSON. It prints all of the invalid string out in its error message. That does not seem to be in official documentation, and usually isn’t a bit issue.

So the combination of secret values and the unexpected cutting of the values meant that on that plan all the values of that secret were printed to console!

Below is an example output from the demo Cloud Run repo:

$ terraform plan
...
Error: Error in function call

  on secrets.tf line 2, in locals:
   2:   secret_variables = jsondecode(data.google_secret_manager_secret_version.secret_variables.secret_data)
    |----------------
    | data.google_secret_manager_secret_version.secret_variables.secret_data is "{\n    \"secret_variable\": \"super secret\"\n"

Call to function "jsondecode" failed: EOF.

In evaluating the impact of this it was nearly too much for us to decide to keep using this setup. While it was nicer than our -var-file solution, the risk of invalid JSON was just too high.

Searching for a workaround

What we realised was we needed a way to validate all secrets that Terraform would try and decode before it ran any commands. Given what I shared earlier about Atlantis, we had an idea to run a JSON validator as a part of the Atlantis workflow.

While this felt better, we still had concerns. How would we futureproof against new secrets being added? What happens if we start using non-JSON secrets? And is this really the right place to be doing this check?

Our final solution

Thankfully we keep looking, and we came across Terraforms “escape hatch”…the external data provider. This is warned against for good reason by Hashicorp on the docs. Depending on the use case, this could create some really tough breaking scenarios on future Terraform upgrades. But right down the bottom there is a section “Processing JSON in shell scripts”. While this is meant as a side effect of the data source, it was actually exactly what we needed!

So, instead of having to verify the secret JSON before running terraform and then safely decoding it. We could verify and decode in a single step while setting our own sensible error messaging.

First we needed to create a small shell script to parse valid JSON and error in a safe way on invalid JSON:

SECRET_NAME=$1
SECRET_VALUE=$2

set +e
echo "$SECRET_VALUE" | python -m json.tool >/dev/null
EXIT_CODE=$?
set -e

if [[ "$EXIT_CODE" -eq 0 ]]
then
  echo "$SECRET_VALUE"; exit 0;
else
  SECRET_VALUE_SIZE=$(echo $SECRET_VALUE | wc -c | awk '{ foo = $1 / 1024 ; print foo "kb" }')
  >&2 echo "The secret "$SECRET_NAME" is $SECRET_VALUE_SIZE and did not parse as valid json"; exit 1;
fi

Then we needed to trade out our local variable that used jsondecode and instead pass those values to this script. The external data source outputs a results object which is the JSON tree and can be referenced via usual techniques:

# secrets.tf
data "external" "secret_variables" {
  program = [
    "./json_validator.sh",
    data.google_secret_manager_secret_version.secret_variables.secret,
    data.google_secret_manager_secret_version.secret_variables.secret_data
  ]
}

# cloudrun.tf
...
env {
  name = "PRIVATE_VARIABLE"
  value = data.external.secret_variables.results.secret_variable
 }
...

So the code is a bit more complicated, but not really actually that different to the previous local variables example. It is in the terraform plan outputs where this really shines. If we send through bad JSON, this is our output:

$ terraform plan
...
Error: failed to execute "./json_validator.sh": Expecting object: line 3 column 1 (char 40)
The secret secret_variables is 0.0351562kb and did not parse as valid json

Conclusion

I am fairly new to GCP in general and since Google Secrets Manager is only about a year old I am also new to this service. This probably increased my surprise at the size limitation, but far more surprised at the behaviour of the Google Terraform provider which did not mark the secret version data as sensitive which will mean it does not get printed to the console output. So despite this surprise (and possible catastrophic error), I am super happy with the process we came up with!

FAQ for my Agile Testing Days keynote

A few months back I gave my first keynote at Agile Testing Days. It was a really exciting opportunity and I chose to share my journey to understanding observability. This meant I needed to go into a bit of detail, but I have overwhelmingly received the feedback that the depth of content was what made people really understand the concepts (though of course it was a risk and some felt it was a bit too deep in the weeds for a keynote!).

Thankfully the positive feedback has led to a few more deliveries of it so you can see a version of the slides here and even a recorded version here. By managing to deliver this a few times I have gotten a fairly standard set of questions so I figured recording these in some way may be helpful. Here goes…

ELK, Prometheus, … are expensive. What tooling can you recommend for small projects?

I would slightly challenge their expensiveness. Both are open source projects so you pay for what you store. Therefore, I can understand how ELK costs can run up quickly, but running prometheus is actually quite inexpensive to run as you can store data in s3 and keep the costs to maintenance. But in any event, what to use for small projects? I would prioritise a few things…

1) Make sure the same tools are available in all envs. This builds better understanding and therefore leverage of the tools. Of course your retention can be lower in non-prod, but keep the access the same.

2) Aggregation across all services. Make sure people can clearly get to data about any/all services in a single place and query across them. In logs talk that would mean try getting them available in a central server and via a single index. In metrics terms this is more common from any metrics offering.

3) Build in the sturcture early even if you are not building in the discoverability via centralised tools. This means start with json structured logs. Start with a common status and metrics end point for each service so people know where to find it. And build tracing IDs into your logs.

Do you have best practices for what to log?

Have a think about logging vs events here. If you want to log, the value here is time specific stuff. In general you can think about an event bring created for each endpoint called. You can fine grain that more as you see a need and add a new event for each method called within a service or even add a new event for a super crazy for loop set of logic. The idea is to scope an event to a piece of work that makes logical (and ideally business) sense. So, if you need to know when a process is triggered? Log it. If you need to know order of events? Logging. If you want a massive stack trace? Could be a log. When looking at events, I would suggest adding fields for any specialty (non-PII) data. User IDs? Definitely. Parameters used in the methods? Probably.

In one slide you added the input parameters for a method to your properties. Do you also add local variables within the code? What about the result of a method?

Absolute add more than just input parameters. Parameters are actually easy to get via framework tools in a more automated fashion (as suggested by another question!) so where you will see more manual additions in more mature code tends to be for internal information. Keep in mind, this is the same as any other outputs so you will need to be aware of any PII or other sensitive data you would not want to push out to your telemetry data. I have been working with a legacy app with this as well as the workshop app and I have created some getting started ideas for more legacy apps. For example, start with adding any variables you currently log out as a part of a string to a field/value pairs. Then look to get any parameters from the methods. If that stuff is already covered you can look to add any intermediary data that is created within the body of a method.

Do you really collect your properties “manually” or do you use AOP (or something else) to collect them?

That app that I showed is a workshop app. It is small and has no intention of growing. It also keeps the conversation relatable by even non-deep tech people. That is a lot to say you are absolutely correct to think about ways to get this data with less manual/reptitive steps. For example, we have just implemented an aspect (using Java’s AspectJ) to create subspans for each method within a certain class rather than adding that manually. So I would absolutely applaud your intention to use clean code methods while doing this!

Do you test event logging?

We have tested that the way we call our libraries is correct. As in, we have validated that we do not end up in a null pointer exception scenario and in the case where we are using the aspectj pattern in Java we have tested that we are capturing the correct methods. At the same time we are not testing that the library itself works correctly so basically we are following the same patterns we use for any other library we use in our code.

Assume you have a method that throws an Exception and does not handle the Exception in a try  – catch block. How do you make sure you receive your collected data?

I believe this can be handled in certain languages and tends to be called something like “flushing”. We are currently doing that in Python but I believe other languages and most logging libraries have something to this effect.

If you had to decide between adding automated tests or event logging to a brownfield project (with neither), what would you do?

IMO this very much depends on risk and secondly time to invest. In the book “effectively working with legacy code” it is suggested to wrap any unfamiliar/untested code in a high-level end to end test before refactoring/introducing unit tests so that you can properly refactor (change implementation not behaviour) safely. So, if you can withstand the introduction of an issue into production and are comfortable with your process and time to fix then go for logging/alerting. If that structure is not yet in place I would lean towards pre-production automation testing. But if you do have the right circumstances, adding events provided a much more robust opportunity to identify and debug more issues than just the ones you are currently identifying so there is definitely reasons to start there.

What do you think about sentry.io?

We use this at our company and I think it can provide fantastic benefit for front end code. That being said I bucket it with things like static code analysis tools where all they can do is point out issues. And if you don’t have the time / prioritisation to fix and issues found they become quickly ignored. For example, a few years ago I once ran a static code analysis tools against my project and the output was so large it maxed out the console. That left the team dismissive of using it. It took me almost a month of chipping away the “silly” errors to get the rest of the team to notice the really tough ones in there. This is happening right now with sentry at MOO where we max out our paid account worth of reported errors about half way through each month and the mountain seems very high to climb!

Please let me know what others you may be interested in and I will look to expand this list! Also huge thanks to Alex Schladebeck and her team at Bredex for helping shape many of these interesting questions!

Definining Platform Engineering vs SRE from a testers vocabulary

Going off the brilliant Angie Jones‘ tweet, I am going to try and spit out more blogs even if they are short so that they can start the interesting conversations.

 

I am grateful to be a part of some fun slack groups and in one a question was raised:

What is the difference between site reliability engineering and platform engineering? I think I’ve heard @abby.bangser say she’s a platform engineer. But it seems like that work is also about site reliability. I can’t articulate what it is I want to learn about because I’m not even sure of the right terms! “DevOps-y stuff” or “CD-stuff” doesn’t sound very professional.

From here a few people jumped in with their ideas but the overwhelming response seemed to be “It all sounds super related” and I have to agree. It is fantastic that the worlds are melding so much more these days but as someone who felt very on the outside of it all just 15 months ago I have a huge amount of empathy for people who want to join the fray and just aren’t sure where to start or what terms mean what.

For some context, by the time I responded it was more of a comparison between SRE and Platform Engineering. In addition, this group is heavily populated by testers or former testers so I chose to describe my understanding using my testing industry definitions. It seemed to help, but I would be very curious where this muddies the waters and what I can to do better welcome testing experienced people into the world of operations.

So here is the exact quote of what I wrote…

IMO SRE is like the quality analyst but for systems. They are the ones who help identify/quantify, test for, and track quality metrics of the system (hence SLAs being big in that role). But just as a great QA can also be a badass bug hunter because they know the wholistic system, an SRE can be a badass triager for the same reasons. Hence often being thought of during incidents.

Versus a platform engineer. IMO these are more like the automation engineers of the testing space. They understand the values the same as QA, but instead of focusing on how to socialise and define what the concepts look like at the org level, they are the “roll up your sleeves and do it” group. Helping run the tools that make your software teams rock and roll. They will probably run a CI server, a git server, maybe a code eval tool like sonarqube, testing tools like pact etc.

I’m totally open to evolving that. That’s just what I understand right now. I am targeting being an SRE, but by joining a platform team I feel I am gaining some understanding of the underlying system and tools so that I can be more effective in that role (just as I believe people who have some software delivery experience via personal or close collaboration are most effective when testing software).

And given the first person asking the question is in very much the same position I was in where they REALLY want to get involved in exploratory testing in production via observability tooling, I added one more thing…

Sorry actually last thing. I also wanted to learn how to use o11y tools but kept getting stopped by not having them/access. Hence going to platform to get rid of any blockers. BUT given the right job/org etc I wouldn’t have felt a need. And right now I’m actually kinda itching to get to use them for real on software as being on platform limits some usecases even if I’m able to make them available to the software teams.

So this evolved even more and did touch on how reliability engineering is SO MUCH more than just software (please see Nora Jones, John Allspaw and others if you need to learn more here!), but I want to understanding…is using these analogis helpful? Harmful? Confusing? A good entry point?

Baby steps into tracing

At MOO we have had tracing for a long time. This was based on a solution created in house before the likes of Jaeger and OpenZipkin were made widely available. And arguably met the same requirements at the time. However, like most home grown tools, the people who built it moved on and the support and even the use of it started to dwindle to where some of the newer developers didn’t even realise tracing was supported at the company.

When the ask for tracing support surged recently, we realised our under invested in tool was just not the right fit anymore so we went on a hunt for what we could move to. We identified kind of 3 categories of tools…the magic promise to do everything for you for the cost of your first born child tool(™️), the open source tracing tool, and then the new dark horse observability tool. We evaluated at least one tool in each category and realised that when balancing developmental, operational and cash costs we picked the dark horse option of Honeycomb. While Honeycomb can support a lot more, it also supported our tracing needs right now and we were excited for what we could do to grow into more interesting use cases.

So now it has been a few months and just like anything else, we have not grown in a clear linear path. We have access…mostly everyone has signed in at least once. We have tracing…in at least what used to have tracing plus a few more services. We have…a couple…big wins with more interesting use cases.

So now we are looking at how to make sure everyone at the company comes on the journey into the new and exciting world. While the previous statements were setting the scene and pretty steeped in clear facts, the next few things are just my personal findings about our experience and what I think has helped us.

While we have some interesting wins using features like bubble up, it seems that focusing on getting pretty stock standard tracing has been a good way to build experience with the UI, conversations around how communications across crews will work via their data interactions, and what our observability language and norms will be. Below are a couple of things that I have noticed in our experience that has helped with onboarding and upskilling.

One, two, many

When introducing a new use case for the data, get it working in a big impact service first. In our case, we added deployment markers for our monolith (recently made into a pretty bad ass one too!). This means that when a new deployment goes out to production, we can tell so if something goes wrong, we can also see if it correlates with a deployment.

This helped us clearly identify when one of our servers did not provision correctly after a deployment and we saw an introduction of 400 errors. In this case it was a provisioning issue (with starting up tomcat) and not a software issue, so it was limited to a single host and therefore not visible in the bottom graph since the count of errors was quite small in comparison to our traffic.

a12b8a90-d61d-4a28-aa48-ee4427dfb6eb
A look at how we served a higher number of 400 errors after a deploy (#92388), but the fix was actually hardware so there is no corresponding release to fix it.

Then one of our teams saw these and thought it was great! So they introduced markers for their application too.

And as we saw people picking up this awesome opportunity to gain insights, we also saw how the world would look if EVERY deployment was marked given the high number of services and deploys we do every day.

multiple_deploys
Overlapping deploy markers given excitement over the new feature to try.

Getting to good here is in part learning best practices like changing colours depending on the app, deciding what deploys need a marker etc. And part waiting for the feature request to only show certain markers rather than all or nothing.

But in either case, starting by introducing an example of a feature without having to have it sorted out at scale yet generated a level of conversation I just don’t think we would have had if it were around a planning whiteboard.

Work in bursts

Tracing can appear like a magical image of what is happening in your system. It is a bit like all of a sudden joining Ms Frizzle’s class and taking the magical school bus to see what actually happens when someone clicks a button on your website. But in reality is is just a bit of grunt work to make each and every service conform to some simple communication rules. In our case we need every service to send a header with its request that says what the overarching trace id is, what its own id is, and if there is any specific data that it wants to be known by any of its children. What this does is allow every request that gets that header to know what bucket it falls in (trace id) and how it is organised within the bucket (the other id is it’s parent). Well there is a reason for the Tower of Babel story, communication can be hard and we ran into that.

We had traces that only showed our CDN level. A CDN (or Content Delivery Network) is a common tool used by websites to allow for protecting the application from unnecessary load and decreasing response time in locations around the world through caching pages and content that can then be served from a server closer to the person requesting it and without needing to ever touch our internal servers. In the case of the trace below though, it says the cache status was “PASS” in Fastly which means it did need to be responded to by our internal services yet we don’t see any listed below the initial CDN trace.

only_fastly_trace
A request that took almost 3 seconds, passed our cache, and we have no other visibility of what services were called.

We had traces that started without any indication of the users request via our CDN.

missing_context_trace
Hmmm the only way to call Bascule is via Fastly…why don’t we see a Fastly span there?

We had traces that knew their bucket, but didn’t have parents so didn’t know how they fit together.

missing_spans
When a span has a shared “trace id” but has a “parent id” that does not correspond to any other spans, then we see “(missing)” rendered. That means site didn’t actually call render, there is some other service that lives between those two that told render about itself, but didn’t tell honeycomb.

These were all frustrating situations because we could see how close things were but they were just not lining up. I mean…it is an age old tale of integrations and we definitely (re)learned how much more valuable taking 4 developers and working for 1 day all together is instead of taking 4 developers and each working for 1 day. This is important to call out though because a part of our pitch to leadership for bringing in a new tool included an estimate of time taken from developers. We had an agreement on general hours spent but were free to do with them as we pleased. While we would have preferred a hackathon right from the beginning, this wasn’t feasible early on due to product backlog constraints. But when we got to the end of the nitty gritty, that was the absolute best way to get across the line.

Build the boarder first

I am a big puzzler. I love em. And any good puzzler knows that by building the border first you can completely transform your expectations (unless you are doing a liberty puzzle…then all rules out the window!). All of a sudden the table space needed is way (usually) more than you actually need and you start to question how the heck all those pieces are going to fit in that frame?!

We decided to trial the same theory with tracing. If we had our border span which is from the CDN request to it’s response, then we would know that any unaccounted for time was a missing span. Since every single request to our website is first serviced by Fastly where it can be evaluated for special localisation treatment, possible cached response opportunities, or of course, to be sent through to our backend services.

This tracking from a single entry point for all requests to our site has worked wonders and we now have two major scenarios where we start asking questions about coverage.

The first is visible in the trace below. In this trace we can see our CDN and it says our request took 4.99s. In looking at the specific start stamps I can see that it took 29ms to get from Fastly to our internal service “site”. Then it took 8ms to get to the second internal service “render” and then 2ms to get a request to “template” service. All of these times seem reasonable as direct parent-child relationships. But then we spend 4.847 seconds living in the unknown. Does render have a call to another service? is it calling a DB? Is it just doing a really time intensive activity? We don’t know but we sure as heck want to look into it now!

missing_context_2
While all of the spans start in very quick succession, what the heck happens in the render service after calling out to template?

The other big place we are asking questions is when we see time between spans. In the following trace we see what looks like the same type of request. It moves from our CDN to our internal service called site and then off to render and template. But there is a very big difference here. Instead of the request moving from Fastly to site in 29ms, it took 1.01s. This is a huge difference! When it was we could chalk it up to network speed. And that may be the case at in this second trace too. It also was a lot more time on the backend. Instead of the site span taking up 4.941 of the 4.990 seconds (99%), it was only 1.719 of the 3.44s (~50%). Given we have a couple of routing layers in there we all of a sudden need more visibility into their behaviour.

missing_services
A request which has a lot of time missing between our CDN and our first internal service (site).

Getting the borders of our CDN request in has helped us spot this gap. Had we not had Fastly tracing, we would never know that the 2nd trace was any different than the first as they both have very similar internal service duration (4.99s vs 3.44s) and hit all the same services.

Wrap up

These are not meant to be “the right way”, just a reflection back on what has made the biggest impact in our first couple of months using a new tool for understanding our systems. Can’t wait to see where we go next!

Tracing our pipelines

Tracing has been around for ages. That being said, it seems to be most widely used in a very singular fashion. In my learning about tracing I have been really excited about the difference in power behind new era tools created by the industry and communities like Jaeger and Honeycomb in comparison to what I had previously had experience with.

If you look up OpenTracing (soon to be merged with OpenCensus), there are a lot of great ideas, but the ability to trace really boils down to every span needing an ID, a connection to it’s parent, and a duration. Once you have this information you can generate a waterfall graph that displays the relationship and duration of all calls required to complete an action.

Image result for opentracing.io
A generic waterfall graph showing a web framework relying on calls to 2 different RPC applications and some external APIs to complete a request. (https://opentracing.io/docs/best-practices/instrumenting-your-application/)

 

Studying tracing graphs can provide amazing insights, but once exposed to one with more detailed and specific information in it, the flood gate of ideas opens. The first time I actually created this was via a really great online tutorial and I snapped this screenshot:

Selection_191.png
Trace from Jaeger showing off key timing and count information as well as an error icon on the request that errors as well as integrated logs.

This not only shows me that it take 51 separate requests and 707.27ms to satsify the request successfully. It also highlights that while the request was successful, there was an error in redis which I can even dig in and see the log tells me it was a timeout. This. Is. Powerful.

And once you start thinking in tracing, it becomes quiet obvious how many use cases there are for it. Does the problem have a few moving parts and each of them have a start and a finish? You may be able to trace it! Moving away from the technical web request version, what about a sales process? Or feature delivery? Or code pipelines? This. Is. Powerful.

And with power often comes greed right? Well I got greedy. I didn’t just want to see set fields, I want to put specific and personally valuable fields which are relevant to my team and my application. I also don’t want to just use it in my Java or Python application, I want to just give it the necessary info as an API request because then I am unleashed from being a software developer and instead can apply this to business processes too. The first time I managed this was with a quickstart and I was hooked.

 

Looking for a first use case

To me the idea of tracing pipelines was my aha! moment to me when I first read about it, but I didn’t have a good reason to stop what I was working on to play with the fun toys right that second. This was particularly true because I had no idea how I would get started. Then excitingly we started using Honeycomb.io at work and the fantastic people once again didn’t just show how they use their own tools, but enabled users to gain similar insights through a small buildevents library that they released. Of course I jumped in and did some POC work, but we couldn’t quite justify time in our backlog to properly build it out integrate it given it wasn’t really made for Gitlab and our alpine OS based build images requiring some trickery to install it.

When a cost/value ratio isn’t quite right, you can either look to increase the value or decrease the cost to make the investment worth it. In our case the month of July did both!

In the realm of decreasing the cost, there were some releases that came out to enhance the support Gitlab and alpine base images as well as just a growing comfort with the honeycomb tool as we settled in as users.

To increase the value, we as an organisation had just finished merging a few highly coupled codebases which has left most of the team a bit confused as they find their feet in the new repository and pipeline world. Even more than just getting familiarity, we have a need to understand how to optimise this now larger pipeline to still have fast and valuable feedback. This felt like the perfect time to try and get some visuals around what is happening.

 

Starting small

While we had images of unicorns dancing on rainbows in my brain where all of our pipelines were traced, we realised a need to quickly test our theory and build a plan. Continuing to keep costs low low for implementation across the organisation would be key as well which meant both the buildevents library and the necessary credentials needed to come for free for most if not all teams and pipelines. This was the easy part as Gitlab can could build in a shared environment variable for credentials and there are some highly leveraged build images to start with.

RUN curl -L -o /usr/local/bin/buildevents \
    https://artifactory.prod.moocrews.com/artifactory/github-remote/honeycombio/buildevents/releases/latest/download/buildevents-linux-amd64 && \
    chmod 755 /usr/local/bin/buildevents

As a note: The one line change to our docker images has actually made it so easy that people don’t mind extending this out to their speciality images.

Our pipelines are all stored as code in .gitlab-ci.yml files. These files are maintained by teams for each of their repositories so we do need to made independent changes were we wanted to implement this. But again, leveraging the before_script and after_script options from the templating language we could make a fairly mundane change.

before_script:
  - date +%s > /tmp/job_start_time

after_script:
  - export BUILD_ID=$CI_PIPELINE_ID
  - export STEP_ID=$CI_JOB_ID-$CI_COMMIT_SHORT_SHA
  - export START_TIME=$(cat /tmp/job_start_time)
  - export NAME=$CI_JOB_NAME
  - echo "Sending honeycomb trace for ${NAME} (id= ${STEP_ID}) within ${BUILD_ID} pipeline starting at ${START_TIME}"
  - buildevents step $BUILD_ID $STEP_ID $START_TIME $NAME

 

This was very exciting given the low level of investment, that being said, I bet you can spot some concerns. For one, you may also notice that the code added in the before_script actually creates and saves a file. The way gitlab works is that script definition (e.g. in a job or before / after script) is run from an independent shell process. Therefore, environment variables aren’t good enough to pass between them.

As a quick example, you can recreate this behaviour:

Selection_194.png
Open two different terminal windows. Type export testing=”this is just in this shell” in one of them, and test that it worked by checking echo $testing. Then try echo $testing in a different window. You won’t get the same result!

 

But even with this limitation, with just these two lines of changes, the following trace was created in one of our smaller code bases!

Selection_193.png
Trace in honeycomb for a basic build, test, release pipeline.

 

Next hurdle up though, I am sure you see the top span says “missing” and the trace time is set at 1970-01-01 01:00:00 (because the top trace is missing). This is another fun limitation due to the isolation Gitlab provides for its runs. Each job (and its corresponding before and after components) is run in an independent docker image so the only way for them to share information is via artifacts. So if we want to be able to send a “build” command via the buildevents library to generate that top span, we would need to somehow know when the first start time file was created and save that as an artifact to be consumed after we are done with all steps of the pipeline. As we thought more about it, this is not only tricky because artifacts require that you make decisions about how long you keep them around, how you clean them up. It is also tricky because we may need more personalised changes per pipeline to know possible first and last steps. And finally we would have to decide to finalise a pipeline even though people could come back through and either rerun or trigger a manual optional step even after the pipeline was “finished” which would make the trace then show a span outside the scope of the higher level build trace.

Given we still get the necessary information per step, and the shadow build span is there providing a sense of timing, we decided to move forward and get the business value of tracing our larger pipeline before tackling these changes.

 

Tackling a MUCH bigger pipeline

The next pipeline included multiple services in different languages, database and configuration changes and therefore had a much more interesting set of stages and steps to run so we put our “universal” and “easily adopted” solution to the test.

Our pipeline file structure looks a bit like this:

.gitlab-ci.yml
├── app1/
│   ├── .gitlab-ci-app1-build.yml
│   ├── .gitlab-ci-app1-test.yml
│   ├── .gitlab-ci-app1-release.yml
├── app2/
│   ├── .gitlab-ci-app2-build.yml
│   ├── .gitlab-ci-app2-test.yml
│   ├── .gitlab-ci-app2-release.yml
├── shareddb/
│   ├── .gitlab-ci-shareddb-build.yml
│   ├── .gitlab-ci-shareddb-test.yml
│   ├── .gitlab-ci-shareddb-release.yml
├── configuration/
│   ├── .gitlab-ci-config-build.yml
│   ├── .gitlab-ci-config-test.yml
│   ├── .gitlab-ci-config-release.yml

 

There are an array of base images used including the one we had already edited and an array of those jobs use before_script or after_script which we know could cause issues. Instead of dealing with analysing all this, we decided to just kick the tires a bit…we just put the code changes into the base .gitlab-ci.yml file and ran the pipeline! Turns out that this easy solution provided data for over half of the steps as is. We then took the ones that failed and analysed what happened. Adding the buildevents library to three more docker images helped knock off a large group of steps and then adding the creation of the job_start_time file into some step specific before_scripts got the rest.

Overall this process took us about a day of effort from start to finish and has the potential to really transform how people see our build pipeline as well as how we prioritise work associated with improving it.

 

Next steps

As ones to never rest on our laurels, we have a couple of hopes moving forward. For one, it would be great to dig into what a “build span” would look like and how we could get that information into our trace. Additionally, you can spot a bit of a gap between the steps in our example trace. We believe this is made up of a couple of things including waiting for a runner to be available for the next job to start as well as running the base steps like downloading necessary artifacts etc. We would love to also make clear what this time is made up of as it could be an opportunity for improvement too.

In addition, one Gitlab limitation (and would love to be proved wrong on!) is that the concept of stages within a pipeline is just a shadow concept. Stages allow us to organise what steps can be run in parallel together, while forcing stages to run in serial. So you can imagine a typical pipeline may have the stages of build, test, deploy. But there may be multiple packages being built or tested in parallel. This is great for organisation, but when it comes to analysing run times, the UI is basically useless for trending, and the API has no way query for stage details. Given a stage can have many jobs running in < 1 minute a single job that runs for > 10 minutes, it is extremely important to understand how a stage operates to correctly prioritise speeding up our bottlenecks. While this is visible in the graphs, we would love to make it more obvious.

Using cURL and jq to work with Trello data

As mentioned in this blog post we are using Trello to handle our data at Speak Easy instead of the google sheets that we used to do. There are a lot of pros for this, but one con is that the data is just not as accessible as when you could sort lists etc in google sheets. I decided to make that better by creating a few scripts in python leveraging just a basic cURL command and the jq library.

Todo list:

  1. Access the Trello API from the command line
  2. Identify first use case for API call from command line
  3. Create script for first use case

1. Access the Trello API from the command line

First things first, I want to make sure I can access the API with a cURL command with the right auth etc. To do this I leveraged my old postman scripts and stole the cURL command from there. If you haven’t seen it before postman has a bunch of easy to export formats if you click the code link just under the save button.

PostmanCURLSnippet.png
How to open the code snippet window in Postman and select the cURL export type

Once I copied this to my clipboard I was able to run this in my command line and get the same results as in postman. But realistically this isn’t super helpful because if anything it is harder to read. But the power is really going to come from when I start using the jq library. So now I need to install jq which I am able to do easily through brew install jq.  This means that I can now “pipe” the response into jq and see the json formatted as well as start manipulating. So first things first, below you can see the difference in the cURL response before jq, and after being piped to jq.

CURLWithJqParsing
curl -X GET ‘https://api.trello.com/1/board/######/lists?key=######&token=######’ | jq ‘.’

2. Identify first use case for API call from command line

So why do I want to do this anyways? The first thing I thought about is how do we create a mailing list from our mentors? Previously it was a column in our google sheets doc. Now it is spread out across cards. I want to collect all the email addresses in each of the cards into a single list.

So let’s get to work!

I needed to actually improve the API call more than the jq filtering. In this case I was able to ask the Trello API for only the fields I cared about (name and desc which includes email).

SpecificCURLForEmailAddress.png
curl -X GET ‘https://api.trello.com/1/board/######/cards?lists=open&checklists=all&amp;fields=name,desc&key=######&token=######’ | jq ‘.’

 

So here is where I bet you have already spotted my mistake and I have to fully admit that I am writing this live as I go and have just realised a show stopper! The whole point of this exercise was to access the email addresses easier. The kicker though is that while we have email addresses in the cards, they are in the body of description and therefore I would need to not only do all this funky fun with jq, but also use some funky regex to parse them out. And that my friends is where I will be drawing the line.

So instead of working harder, I decided to work smarter and figure out how to add more useful data fields to the cards. Turns out there is a power up for that called “Custom Fields” which has allowed me to make an email text field for all cards. This comes with some docs for the API as well.

TrelloCardWithCustomEmailField
Picture of my Mentor Trello card with the custom email field.

So now, when I ask the Trello API for fields, I can ask for the ‘customFieldItems’ instead of the ‘desc’ field.

MyEmailAddress.png
curl -X GET ‘https://api.trello.com/1/board/######/cards?lists=open&fields=name&customFieldItems=true&key=######&token=######’ | jq ‘.[] | select(.name == “MENTOR: Abby Bangser (EU)”).customFieldItems[].value[]

Wrap up

So given I have just created myself a new todo list for this activity:

  1. Access the Trello API from the command line
  2. Identify first use case for API call from command line
  3. Introduce email fields for all Mentors
  4. Create script for first use case

I am going to call it quits for the day. the next blog will need to dive into moving this proof of concept into a python script 🙂

In the mean time…is jq interesting? Want more details in a blog or is it just a utility I should show usages of?

 

First things first…AWS training wheels

So I am getting started on a first website implementation tomorrow. I have worked on enough early cloud projects at this point to know there are some basic house keeping things I need to get in order before using my AWS account for, well, anything.

The two areas that I have found most important are basic user auth and account monitoring (which includes billing awareness) so I am going to focus on those two for tonight. One thing to note, I am not going to go through the step by step of how to do these as they are things that are very well documented. I am happy to share specific links if you would like a place to start but a given the expectations set in this post a google search will provide tons of starting points.

Basic IAM

In AWS the service which provides both authentication and authorisation (yes I cheated and used the ambiguous term before). Given IAM is a service authentication there are only a couple of things which you need to tune and most of them are provided as a check list right on the landing page for the service.

Screen Shot 2018-05-29 at 08.15.51
The basic expectations of an AWS account’s IAM setup including no root access keys, MFA, and individual logins, and a password policy.

 

Access keys

The one you may be least familiar with if you are not an AWS user is the “Delete your root access keys” request. Access Keys are the digital signature used when making command line requests to AWS. Your root user is an un-bounded power user who can not be limited from deleting/changing/creating things at will. If this gets compromised it could be a problem, but also most people like to protect themselves from themselves and this a perfect example of when to do that.

So you may ask, if you can’t use the root user account that you have created, what can you use? That is when you create a user with only the permissions you need to get the job done. Don’t worry, if your “job” expands over time you can always log back into root and expand your access, but it will require that extra bit of thought which can be good. In my case I have given pretty much complete admin access to S3, CloudFront, CodeBuild, and Route53.

MFA

The second one of the list is about MFA (Multi Factor Authentication) being set up on your root user. This is great, but I would highly suggest that you get this set up on ALL users. Unfortunately this is not a set by default and can not be. However, one thing you can do before switching from the root account is to create a policy which requires MFA before a user can do anything of use and apply this to all users. I used this tutorial to create the policy (https://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_users-self-manage-mfa-and-creds.html).

Account monitoring

Based on the AWS docs (https://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html#keep-a-log) I wanted to confirm I had some basic level of account awareness. The ones applicable to a completely new account I figured would be CloudTrail, CloudWatch, and AWS Config.

 

CloudTrail

This is a very high level access log which looks at create, modify, and delete actions in the account. There are of course concerns around reads but that is just not what CloudTrail is used for. The good part is that CloudTrail is default set up for the past 90 days for free. Between the limited data retention and the difficulty of actually using the massive amount of data CloudTrail generates, the is realistically something you need to enhance given a larger footprint in AWS. But for now I am happy with the defaults.

CloudWatch

CloudWatch is a more fine grained method of tracking services and requires configuration unlike CloudTrail. The only thing I am initially concerned with is being a bit of a dummy on costs so this is where I am going to set up my billing alert. Thankfully this is another setup activity that is so well accepted that it has clear documentation which I followed (https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/monitor-charges.html).

AWS Config

Kinda wish I had looked this tool up before if I am honest. It appears to be an easy framework to create the “detective checks” I discussed at a previous engagement. What these detective checks are meant to do is evaluate the ongoing compliance of your account given that changes can be made by many different people and processes and throughout time. At this point I have just clicked through and used an existing rule set “iam-user-no-policies-check” which actually did catch some issues with an old user on my account. I hope to create a new rule around the MFA requirement but that involves a bit more work with lambdas etc so that will have to wait for another day and another blog post.

Wrap up

So where are we now? Honestly not that far. But the bare minimum set up is complete and I should be able to now use this account with basic assurances around access and billing issues.

And whats next? I am going to work with Gojko Adzic on getting a basic website up under a new domain name. The main technologies will be Jekyll, CodeDeploy, S3, and CloudFront.

 

Kicking off #SEBehindTheScenes

Hopefully this marks the start of a new onslaught of blogging so I figure I should set up why as well as what will come.

For over a year now I have been working with SpeakEasy as an admin based on this simple call for help. When I first signed up, I had big dreams of the impact I could have. We talked about improving the service by removing some of the admin overhead. We talked about generating more data on how we were doing on matching, and more importantly how our applicants were getting on as speakers! We talked about how to support our mentors better so as to make that a great experience as well. Unfortunately I underestimated 1) my life circumstances and 2) the amount of effort Fiona and Anne-Marie had been putting in to “keep the lights on”.

So what have I done? I have moved us from a google docs admin experience to a Trello one and I believe we have a couple good outcomes (and areas for growth) from this move.

We now have visibility of the journey from original sign up to match. It will need some work but has already quantified a fear we had around drop off rate from signing up on the site to getting a mentor match. This means we now know that over the last year we have had 40% of applicants never get to the point of confirming their interest via email. Hmmm that sounds like a barrier we may want to lower.

We also now have a process which allows us to track who has been contacted, for what reason, and at what time. It still has a lot of manual process around moving columns, adding comments, etc. It also still has a lot of limitations since the comments are hand written and may limit our ability to search across applicant and mentor experiences for better data in the future. But it is a start and given that I am handing over “head matching admin” to Kristine Corbus after a year of driving the process myself, I have the utmost faith this process will be put through it’s paces and improved!

So what am I going to be doing if not matching? A couple things. First of all Anne-Marie and I have been through the cycle a few times of being excited about all the changes we could drive for SE, losing momentum because of the other things going on and the size of the task ahead, and then restarting. We have refocused and are planning on identifying the microtasks that can make a difference and hopefully start a push.

I will also be hopefully putting some of my goals around pipelines, observable systems, and testing in production to the test in putting up a new website for SE where we can start to incorporate some of the new changes we want to see. That is really the key to this hashtag. It is a promise I have made to Anne-Marie to broadcast the learnings I have as I take on the task of re-platforming and then operating the SpeakEasy website.

Step 1? A call for help on twitter:

 

Step 2? A pairing session the the one and only Gojko Adzic next week which I plan to blog about as well 🙂

Fun with google form scripting

Sometimes doing the not most important thing on your todo list is so fun. That was the case today.

I have started to work with a lot of really amazing people to support SpeakEasy as a place for new voices to be heard at testing conferences. Turns out it is a lot of work! It is so impressive to me that Anne Marie and Fiona dreamed this up, actually got it running, and have made such a sustained impact on the community over the last few years. I personally am a graduate of the program, have continued on as a mentor, and now am a volunteer. As a group of new volunteers, we were given the open invitation from Anne Marie and Fiona to make SpeakEasy our own. They know that the need to keep it running took its toll on some of the processess and that with some extra volunteer power we may be able to address some of the underlying processes.

One process I was particularly interested by was how we were alerted about new mentees and then how we would then take them through the process from new sign up to being mentored.

Oh the glory of google sheets! We added three columns to the google sheet that gets created from the google form on the website. One was used to sign up for the job of getting that person matched, one was for tracking your progress on that journey, and one was for confirming final status of that new sign up. As you can imagine this gets messy and also limits the value we can get from data like how long it takes for someone to get matched.

Excel table row with data about a mentee and matching them with a mentor.
An example row from the “to be mentored” sheet.

 

In comes the power of a small script. I have known about the ability to script Google apps for a while now but never played with it myself. This felt like the perfect opportunity. What I really wanted was to be able to have a new mentee show up in a management tool for tracking and visibility, in our case Trello. But let’s start with how to get a Google form to take action on submit.

I found this website that gave a great framework for my efforts. A couple of interesting notes to get it working though. First of all, do make sure you update all of the fields they suggest to make sure they match your form. But even if you do that, if you choose to “Run > onFormSubmit” from the tool bar you will get an error like this:

TypeError: Cannot read property "response" from undefined. (line 24, file "Code")
This is the error when trying to run onFormSubmit from inside the Google scripting tool.

 

Basically it is telling you that because no form has been submitted you can not run this command. Shucks. Really wanted to test this before going live. Good thing I attached to it to a play form, so I just went and did that! Submitted the form and waited for my glorious Trello card to show up.

Wahwahhhhh turns out that there are some typos in the script. It took me a while to sort them, but using the “notification” option on the triggers to say email me immediately helped a lot to debug. In particular you will need to look at SendEmail() and update the “title” and “message” variables to “trelloTitle” and “trelloDescription” for the form to work.

Viola! Now on submit I have a new line in my excel response sheet as well as a new card in trello.

A trello card with the details from the entry form on the SpeakEasy website
A new trello card for a mentee to match

But honestly, the more exciting thing is the power of collaboration that is unlocked now. Now we have timestamps for activities that we perform, we can at a glance see how much any one person has in progress, and we can track where bottlenecks may occur during the process.

A filled in Trello card with dates and times of activties from new application through to introduced to mentor.
Possibly example process of getting Jane Doe matched in Trello.

 

Obviously we as a team need to sort out our own process with this tool, but the oportunities for collaboration are so much greater. Looking forward to seeing how it progresses!