Tracing our pipelines

Tracing has been around for ages. That being said, it seems to be most widely used in a very singular fashion. In my learning about tracing I have been really excited about the difference in power behind new era tools created by the industry and communities like Jaeger and Honeycomb in comparison to what I had previously had experience with.

If you look up OpenTracing (soon to be merged with OpenCensus), there are a lot of great ideas, but the ability to trace really boils down to every span needing an ID, a connection to it’s parent, and a duration. Once you have this information you can generate a waterfall graph that displays the relationship and duration of all calls required to complete an action.

Image result for opentracing.io
A generic waterfall graph showing a web framework relying on calls to 2 different RPC applications and some external APIs to complete a request. (https://opentracing.io/docs/best-practices/instrumenting-your-application/)

 

Studying tracing graphs can provide amazing insights, but once exposed to one with more detailed and specific information in it, the flood gate of ideas opens. The first time I actually created this was via a really great online tutorial and I snapped this screenshot:

Selection_191.png
Trace from Jaeger showing off key timing and count information as well as an error icon on the request that errors as well as integrated logs.

This not only shows me that it take 51 separate requests and 707.27ms to satsify the request successfully. It also highlights that while the request was successful, there was an error in redis which I can even dig in and see the log tells me it was a timeout. This. Is. Powerful.

And once you start thinking in tracing, it becomes quiet obvious how many use cases there are for it. Does the problem have a few moving parts and each of them have a start and a finish? You may be able to trace it! Moving away from the technical web request version, what about a sales process? Or feature delivery? Or code pipelines? This. Is. Powerful.

And with power often comes greed right? Well I got greedy. I didn’t just want to see set fields, I want to put specific and personally valuable fields which are relevant to my team and my application. I also don’t want to just use it in my Java or Python application, I want to just give it the necessary info as an API request because then I am unleashed from being a software developer and instead can apply this to business processes too. The first time I managed this was with a quickstart and I was hooked.

 

Looking for a first use case

To me the idea of tracing pipelines was my aha! moment to me when I first read about it, but I didn’t have a good reason to stop what I was working on to play with the fun toys right that second. This was particularly true because I had no idea how I would get started. Then excitingly we started using Honeycomb.io at work and the fantastic people once again didn’t just show how they use their own tools, but enabled users to gain similar insights through a small buildevents library that they released. Of course I jumped in and did some POC work, but we couldn’t quite justify time in our backlog to properly build it out integrate it given it wasn’t really made for Gitlab and our alpine OS based build images requiring some trickery to install it.

When a cost/value ratio isn’t quite right, you can either look to increase the value or decrease the cost to make the investment worth it. In our case the month of July did both!

In the realm of decreasing the cost, there were some releases that came out to enhance the support Gitlab and alpine base images as well as just a growing comfort with the honeycomb tool as we settled in as users.

To increase the value, we as an organisation had just finished merging a few highly coupled codebases which has left most of the team a bit confused as they find their feet in the new repository and pipeline world. Even more than just getting familiarity, we have a need to understand how to optimise this now larger pipeline to still have fast and valuable feedback. This felt like the perfect time to try and get some visuals around what is happening.

 

Starting small

While we had images of unicorns dancing on rainbows in my brain where all of our pipelines were traced, we realised a need to quickly test our theory and build a plan. Continuing to keep costs low low for implementation across the organisation would be key as well which meant both the buildevents library and the necessary credentials needed to come for free for most if not all teams and pipelines. This was the easy part as Gitlab can could build in a shared environment variable for credentials and there are some highly leveraged build images to start with.

RUN curl -L -o /usr/local/bin/buildevents \
    https://artifactory.prod.moocrews.com/artifactory/github-remote/honeycombio/buildevents/releases/latest/download/buildevents-linux-amd64 && \
    chmod 755 /usr/local/bin/buildevents

As a note: The one line change to our docker images has actually made it so easy that people don’t mind extending this out to their speciality images.

Our pipelines are all stored as code in .gitlab-ci.yml files. These files are maintained by teams for each of their repositories so we do need to made independent changes were we wanted to implement this. But again, leveraging the before_script and after_script options from the templating language we could make a fairly mundane change.

before_script:
  - date +%s > /tmp/job_start_time

after_script:
  - export BUILD_ID=$CI_PIPELINE_ID
  - export STEP_ID=$CI_JOB_ID-$CI_COMMIT_SHORT_SHA
  - export START_TIME=$(cat /tmp/job_start_time)
  - export NAME=$CI_JOB_NAME
  - echo "Sending honeycomb trace for ${NAME} (id= ${STEP_ID}) within ${BUILD_ID} pipeline starting at ${START_TIME}"
  - buildevents step $BUILD_ID $STEP_ID $START_TIME $NAME

 

This was very exciting given the low level of investment, that being said, I bet you can spot some concerns. For one, you may also notice that the code added in the before_script actually creates and saves a file. The way gitlab works is that script definition (e.g. in a job or before / after script) is run from an independent shell process. Therefore, environment variables aren’t good enough to pass between them.

As a quick example, you can recreate this behaviour:

Selection_194.png
Open two different terminal windows. Type export testing=”this is just in this shell” in one of them, and test that it worked by checking echo $testing. Then try echo $testing in a different window. You won’t get the same result!

 

But even with this limitation, with just these two lines of changes, the following trace was created in one of our smaller code bases!

Selection_193.png
Trace in honeycomb for a basic build, test, release pipeline.

 

Next hurdle up though, I am sure you see the top span says “missing” and the trace time is set at 1970-01-01 01:00:00 (because the top trace is missing). This is another fun limitation due to the isolation Gitlab provides for its runs. Each job (and its corresponding before and after components) is run in an independent docker image so the only way for them to share information is via artifacts. So if we want to be able to send a “build” command via the buildevents library to generate that top span, we would need to somehow know when the first start time file was created and save that as an artifact to be consumed after we are done with all steps of the pipeline. As we thought more about it, this is not only tricky because artifacts require that you make decisions about how long you keep them around, how you clean them up. It is also tricky because we may need more personalised changes per pipeline to know possible first and last steps. And finally we would have to decide to finalise a pipeline even though people could come back through and either rerun or trigger a manual optional step even after the pipeline was “finished” which would make the trace then show a span outside the scope of the higher level build trace.

Given we still get the necessary information per step, and the shadow build span is there providing a sense of timing, we decided to move forward and get the business value of tracing our larger pipeline before tackling these changes.

 

Tackling a MUCH bigger pipeline

The next pipeline included multiple services in different languages, database and configuration changes and therefore had a much more interesting set of stages and steps to run so we put our “universal” and “easily adopted” solution to the test.

Our pipeline file structure looks a bit like this:

.gitlab-ci.yml
├── app1/
│   ├── .gitlab-ci-app1-build.yml
│   ├── .gitlab-ci-app1-test.yml
│   ├── .gitlab-ci-app1-release.yml
├── app2/
│   ├── .gitlab-ci-app2-build.yml
│   ├── .gitlab-ci-app2-test.yml
│   ├── .gitlab-ci-app2-release.yml
├── shareddb/
│   ├── .gitlab-ci-shareddb-build.yml
│   ├── .gitlab-ci-shareddb-test.yml
│   ├── .gitlab-ci-shareddb-release.yml
├── configuration/
│   ├── .gitlab-ci-config-build.yml
│   ├── .gitlab-ci-config-test.yml
│   ├── .gitlab-ci-config-release.yml

 

There are an array of base images used including the one we had already edited and an array of those jobs use before_script or after_script which we know could cause issues. Instead of dealing with analysing all this, we decided to just kick the tires a bit…we just put the code changes into the base .gitlab-ci.yml file and ran the pipeline! Turns out that this easy solution provided data for over half of the steps as is. We then took the ones that failed and analysed what happened. Adding the buildevents library to three more docker images helped knock off a large group of steps and then adding the creation of the job_start_time file into some step specific before_scripts got the rest.

Overall this process took us about a day of effort from start to finish and has the potential to really transform how people see our build pipeline as well as how we prioritise work associated with improving it.

 

Next steps

As ones to never rest on our laurels, we have a couple of hopes moving forward. For one, it would be great to dig into what a “build span” would look like and how we could get that information into our trace. Additionally, you can spot a bit of a gap between the steps in our example trace. We believe this is made up of a couple of things including waiting for a runner to be available for the next job to start as well as running the base steps like downloading necessary artifacts etc. We would love to also make clear what this time is made up of as it could be an opportunity for improvement too.

In addition, one Gitlab limitation (and would love to be proved wrong on!) is that the concept of stages within a pipeline is just a shadow concept. Stages allow us to organise what steps can be run in parallel together, while forcing stages to run in serial. So you can imagine a typical pipeline may have the stages of build, test, deploy. But there may be multiple packages being built or tested in parallel. This is great for organisation, but when it comes to analysing run times, the UI is basically useless for trending, and the API has no way query for stage details. Given a stage can have many jobs running in < 1 minute a single job that runs for > 10 minutes, it is extremely important to understand how a stage operates to correctly prioritise speeding up our bottlenecks. While this is visible in the graphs, we would love to make it more obvious.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s