Baby steps into tracing

At MOO we have had tracing for a long time. This was based on a solution created in house before the likes of Jaeger and OpenZipkin were made widely available. And arguably met the same requirements at the time. However, like most home grown tools, the people who built it moved on and the support and even the use of it started to dwindle to where some of the newer developers didn’t even realise tracing was supported at the company.

When the ask for tracing support surged recently, we realised our under invested in tool was just not the right fit anymore so we went on a hunt for what we could move to. We identified kind of 3 categories of tools…the magic promise to do everything for you for the cost of your first born child tool(™️), the open source tracing tool, and then the new dark horse observability tool. We evaluated at least one tool in each category and realised that when balancing developmental, operational and cash costs we picked the dark horse option of Honeycomb. While Honeycomb can support a lot more, it also supported our tracing needs right now and we were excited for what we could do to grow into more interesting use cases.

So now it has been a few months and just like anything else, we have not grown in a clear linear path. We have access…mostly everyone has signed in at least once. We have tracing…in at least what used to have tracing plus a few more services. We have…a couple…big wins with more interesting use cases.

So now we are looking at how to make sure everyone at the company comes on the journey into the new and exciting world. While the previous statements were setting the scene and pretty steeped in clear facts, the next few things are just my personal findings about our experience and what I think has helped us.

While we have some interesting wins using features like bubble up, it seems that focusing on getting pretty stock standard tracing has been a good way to build experience with the UI, conversations around how communications across crews will work via their data interactions, and what our observability language and norms will be. Below are a couple of things that I have noticed in our experience that has helped with onboarding and upskilling.

One, two, many

When introducing a new use case for the data, get it working in a big impact service first. In our case, we added deployment markers for our monolith (recently made into a pretty bad ass one too!). This means that when a new deployment goes out to production, we can tell so if something goes wrong, we can also see if it correlates with a deployment.

This helped us clearly identify when one of our servers did not provision correctly after a deployment and we saw an introduction of 400 errors. In this case it was a provisioning issue (with starting up tomcat) and not a software issue, so it was limited to a single host and therefore not visible in the bottom graph since the count of errors was quite small in comparison to our traffic.

a12b8a90-d61d-4a28-aa48-ee4427dfb6eb
A look at how we served a higher number of 400 errors after a deploy (#92388), but the fix was actually hardware so there is no corresponding release to fix it.

Then one of our teams saw these and thought it was great! So they introduced markers for their application too.

And as we saw people picking up this awesome opportunity to gain insights, we also saw how the world would look if EVERY deployment was marked given the high number of services and deploys we do every day.

multiple_deploys
Overlapping deploy markers given excitement over the new feature to try.

Getting to good here is in part learning best practices like changing colours depending on the app, deciding what deploys need a marker etc. And part waiting for the feature request to only show certain markers rather than all or nothing.

But in either case, starting by introducing an example of a feature without having to have it sorted out at scale yet generated a level of conversation I just don’t think we would have had if it were around a planning whiteboard.

Work in bursts

Tracing can appear like a magical image of what is happening in your system. It is a bit like all of a sudden joining Ms Frizzle’s class and taking the magical school bus to see what actually happens when someone clicks a button on your website. But in reality is is just a bit of grunt work to make each and every service conform to some simple communication rules. In our case we need every service to send a header with its request that says what the overarching trace id is, what its own id is, and if there is any specific data that it wants to be known by any of its children. What this does is allow every request that gets that header to know what bucket it falls in (trace id) and how it is organised within the bucket (the other id is it’s parent). Well there is a reason for the Tower of Babel story, communication can be hard and we ran into that.

We had traces that only showed our CDN level. A CDN (or Content Delivery Network) is a common tool used by websites to allow for protecting the application from unnecessary load and decreasing response time in locations around the world through caching pages and content that can then be served from a server closer to the person requesting it and without needing to ever touch our internal servers. In the case of the trace below though, it says the cache status was “PASS” in Fastly which means it did need to be responded to by our internal services yet we don’t see any listed below the initial CDN trace.

only_fastly_trace
A request that took almost 3 seconds, passed our cache, and we have no other visibility of what services were called.

We had traces that started without any indication of the users request via our CDN.

missing_context_trace
Hmmm the only way to call Bascule is via Fastly…why don’t we see a Fastly span there?

We had traces that knew their bucket, but didn’t have parents so didn’t know how they fit together.

missing_spans
When a span has a shared “trace id” but has a “parent id” that does not correspond to any other spans, then we see “(missing)” rendered. That means site didn’t actually call render, there is some other service that lives between those two that told render about itself, but didn’t tell honeycomb.

These were all frustrating situations because we could see how close things were but they were just not lining up. I mean…it is an age old tale of integrations and we definitely (re)learned how much more valuable taking 4 developers and working for 1 day all together is instead of taking 4 developers and each working for 1 day. This is important to call out though because a part of our pitch to leadership for bringing in a new tool included an estimate of time taken from developers. We had an agreement on general hours spent but were free to do with them as we pleased. While we would have preferred a hackathon right from the beginning, this wasn’t feasible early on due to product backlog constraints. But when we got to the end of the nitty gritty, that was the absolute best way to get across the line.

Build the boarder first

I am a big puzzler. I love em. And any good puzzler knows that by building the border first you can completely transform your expectations (unless you are doing a liberty puzzle…then all rules out the window!). All of a sudden the table space needed is way (usually) more than you actually need and you start to question how the heck all those pieces are going to fit in that frame?!

We decided to trial the same theory with tracing. If we had our border span which is from the CDN request to it’s response, then we would know that any unaccounted for time was a missing span. Since every single request to our website is first serviced by Fastly where it can be evaluated for special localisation treatment, possible cached response opportunities, or of course, to be sent through to our backend services.

This tracking from a single entry point for all requests to our site has worked wonders and we now have two major scenarios where we start asking questions about coverage.

The first is visible in the trace below. In this trace we can see our CDN and it says our request took 4.99s. In looking at the specific start stamps I can see that it took 29ms to get from Fastly to our internal service “site”. Then it took 8ms to get to the second internal service “render” and then 2ms to get a request to “template” service. All of these times seem reasonable as direct parent-child relationships. But then we spend 4.847 seconds living in the unknown. Does render have a call to another service? is it calling a DB? Is it just doing a really time intensive activity? We don’t know but we sure as heck want to look into it now!

missing_context_2
While all of the spans start in very quick succession, what the heck happens in the render service after calling out to template?

The other big place we are asking questions is when we see time between spans. In the following trace we see what looks like the same type of request. It moves from our CDN to our internal service called site and then off to render and template. But there is a very big difference here. Instead of the request moving from Fastly to site in 29ms, it took 1.01s. This is a huge difference! When it was we could chalk it up to network speed. And that may be the case at in this second trace too. It also was a lot more time on the backend. Instead of the site span taking up 4.941 of the 4.990 seconds (99%), it was only 1.719 of the 3.44s (~50%). Given we have a couple of routing layers in there we all of a sudden need more visibility into their behaviour.

missing_services
A request which has a lot of time missing between our CDN and our first internal service (site).

Getting the borders of our CDN request in has helped us spot this gap. Had we not had Fastly tracing, we would never know that the 2nd trace was any different than the first as they both have very similar internal service duration (4.99s vs 3.44s) and hit all the same services.

Wrap up

These are not meant to be “the right way”, just a reflection back on what has made the biggest impact in our first couple of months using a new tool for understanding our systems. Can’t wait to see where we go next!

Tracing our pipelines

Tracing has been around for ages. That being said, it seems to be most widely used in a very singular fashion. In my learning about tracing I have been really excited about the difference in power behind new era tools created by the industry and communities like Jaeger and Honeycomb in comparison to what I had previously had experience with.

If you look up OpenTracing (soon to be merged with OpenCensus), there are a lot of great ideas, but the ability to trace really boils down to every span needing an ID, a connection to it’s parent, and a duration. Once you have this information you can generate a waterfall graph that displays the relationship and duration of all calls required to complete an action.

Image result for opentracing.io
A generic waterfall graph showing a web framework relying on calls to 2 different RPC applications and some external APIs to complete a request. (https://opentracing.io/docs/best-practices/instrumenting-your-application/)

 

Studying tracing graphs can provide amazing insights, but once exposed to one with more detailed and specific information in it, the flood gate of ideas opens. The first time I actually created this was via a really great online tutorial and I snapped this screenshot:

Selection_191.png
Trace from Jaeger showing off key timing and count information as well as an error icon on the request that errors as well as integrated logs.

This not only shows me that it take 51 separate requests and 707.27ms to satsify the request successfully. It also highlights that while the request was successful, there was an error in redis which I can even dig in and see the log tells me it was a timeout. This. Is. Powerful.

And once you start thinking in tracing, it becomes quiet obvious how many use cases there are for it. Does the problem have a few moving parts and each of them have a start and a finish? You may be able to trace it! Moving away from the technical web request version, what about a sales process? Or feature delivery? Or code pipelines? This. Is. Powerful.

And with power often comes greed right? Well I got greedy. I didn’t just want to see set fields, I want to put specific and personally valuable fields which are relevant to my team and my application. I also don’t want to just use it in my Java or Python application, I want to just give it the necessary info as an API request because then I am unleashed from being a software developer and instead can apply this to business processes too. The first time I managed this was with a quickstart and I was hooked.

 

Looking for a first use case

To me the idea of tracing pipelines was my aha! moment to me when I first read about it, but I didn’t have a good reason to stop what I was working on to play with the fun toys right that second. This was particularly true because I had no idea how I would get started. Then excitingly we started using Honeycomb.io at work and the fantastic people once again didn’t just show how they use their own tools, but enabled users to gain similar insights through a small buildevents library that they released. Of course I jumped in and did some POC work, but we couldn’t quite justify time in our backlog to properly build it out integrate it given it wasn’t really made for Gitlab and our alpine OS based build images requiring some trickery to install it.

When a cost/value ratio isn’t quite right, you can either look to increase the value or decrease the cost to make the investment worth it. In our case the month of July did both!

In the realm of decreasing the cost, there were some releases that came out to enhance the support Gitlab and alpine base images as well as just a growing comfort with the honeycomb tool as we settled in as users.

To increase the value, we as an organisation had just finished merging a few highly coupled codebases which has left most of the team a bit confused as they find their feet in the new repository and pipeline world. Even more than just getting familiarity, we have a need to understand how to optimise this now larger pipeline to still have fast and valuable feedback. This felt like the perfect time to try and get some visuals around what is happening.

 

Starting small

While we had images of unicorns dancing on rainbows in my brain where all of our pipelines were traced, we realised a need to quickly test our theory and build a plan. Continuing to keep costs low low for implementation across the organisation would be key as well which meant both the buildevents library and the necessary credentials needed to come for free for most if not all teams and pipelines. This was the easy part as Gitlab can could build in a shared environment variable for credentials and there are some highly leveraged build images to start with.

RUN curl -L -o /usr/local/bin/buildevents \
    https://artifactory.prod.moocrews.com/artifactory/github-remote/honeycombio/buildevents/releases/latest/download/buildevents-linux-amd64 && \
    chmod 755 /usr/local/bin/buildevents

As a note: The one line change to our docker images has actually made it so easy that people don’t mind extending this out to their speciality images.

Our pipelines are all stored as code in .gitlab-ci.yml files. These files are maintained by teams for each of their repositories so we do need to made independent changes were we wanted to implement this. But again, leveraging the before_script and after_script options from the templating language we could make a fairly mundane change.

before_script:
  - date +%s > /tmp/job_start_time

after_script:
  - export BUILD_ID=$CI_PIPELINE_ID
  - export STEP_ID=$CI_JOB_ID-$CI_COMMIT_SHORT_SHA
  - export START_TIME=$(cat /tmp/job_start_time)
  - export NAME=$CI_JOB_NAME
  - echo "Sending honeycomb trace for ${NAME} (id= ${STEP_ID}) within ${BUILD_ID} pipeline starting at ${START_TIME}"
  - buildevents step $BUILD_ID $STEP_ID $START_TIME $NAME

 

This was very exciting given the low level of investment, that being said, I bet you can spot some concerns. For one, you may also notice that the code added in the before_script actually creates and saves a file. The way gitlab works is that script definition (e.g. in a job or before / after script) is run from an independent shell process. Therefore, environment variables aren’t good enough to pass between them.

As a quick example, you can recreate this behaviour:

Selection_194.png
Open two different terminal windows. Type export testing=”this is just in this shell” in one of them, and test that it worked by checking echo $testing. Then try echo $testing in a different window. You won’t get the same result!

 

But even with this limitation, with just these two lines of changes, the following trace was created in one of our smaller code bases!

Selection_193.png
Trace in honeycomb for a basic build, test, release pipeline.

 

Next hurdle up though, I am sure you see the top span says “missing” and the trace time is set at 1970-01-01 01:00:00 (because the top trace is missing). This is another fun limitation due to the isolation Gitlab provides for its runs. Each job (and its corresponding before and after components) is run in an independent docker image so the only way for them to share information is via artifacts. So if we want to be able to send a “build” command via the buildevents library to generate that top span, we would need to somehow know when the first start time file was created and save that as an artifact to be consumed after we are done with all steps of the pipeline. As we thought more about it, this is not only tricky because artifacts require that you make decisions about how long you keep them around, how you clean them up. It is also tricky because we may need more personalised changes per pipeline to know possible first and last steps. And finally we would have to decide to finalise a pipeline even though people could come back through and either rerun or trigger a manual optional step even after the pipeline was “finished” which would make the trace then show a span outside the scope of the higher level build trace.

Given we still get the necessary information per step, and the shadow build span is there providing a sense of timing, we decided to move forward and get the business value of tracing our larger pipeline before tackling these changes.

 

Tackling a MUCH bigger pipeline

The next pipeline included multiple services in different languages, database and configuration changes and therefore had a much more interesting set of stages and steps to run so we put our “universal” and “easily adopted” solution to the test.

Our pipeline file structure looks a bit like this:

.gitlab-ci.yml
├── app1/
│   ├── .gitlab-ci-app1-build.yml
│   ├── .gitlab-ci-app1-test.yml
│   ├── .gitlab-ci-app1-release.yml
├── app2/
│   ├── .gitlab-ci-app2-build.yml
│   ├── .gitlab-ci-app2-test.yml
│   ├── .gitlab-ci-app2-release.yml
├── shareddb/
│   ├── .gitlab-ci-shareddb-build.yml
│   ├── .gitlab-ci-shareddb-test.yml
│   ├── .gitlab-ci-shareddb-release.yml
├── configuration/
│   ├── .gitlab-ci-config-build.yml
│   ├── .gitlab-ci-config-test.yml
│   ├── .gitlab-ci-config-release.yml

 

There are an array of base images used including the one we had already edited and an array of those jobs use before_script or after_script which we know could cause issues. Instead of dealing with analysing all this, we decided to just kick the tires a bit…we just put the code changes into the base .gitlab-ci.yml file and ran the pipeline! Turns out that this easy solution provided data for over half of the steps as is. We then took the ones that failed and analysed what happened. Adding the buildevents library to three more docker images helped knock off a large group of steps and then adding the creation of the job_start_time file into some step specific before_scripts got the rest.

Overall this process took us about a day of effort from start to finish and has the potential to really transform how people see our build pipeline as well as how we prioritise work associated with improving it.

 

Next steps

As ones to never rest on our laurels, we have a couple of hopes moving forward. For one, it would be great to dig into what a “build span” would look like and how we could get that information into our trace. Additionally, you can spot a bit of a gap between the steps in our example trace. We believe this is made up of a couple of things including waiting for a runner to be available for the next job to start as well as running the base steps like downloading necessary artifacts etc. We would love to also make clear what this time is made up of as it could be an opportunity for improvement too.

In addition, one Gitlab limitation (and would love to be proved wrong on!) is that the concept of stages within a pipeline is just a shadow concept. Stages allow us to organise what steps can be run in parallel together, while forcing stages to run in serial. So you can imagine a typical pipeline may have the stages of build, test, deploy. But there may be multiple packages being built or tested in parallel. This is great for organisation, but when it comes to analysing run times, the UI is basically useless for trending, and the API has no way query for stage details. Given a stage can have many jobs running in < 1 minute a single job that runs for > 10 minutes, it is extremely important to understand how a stage operates to correctly prioritise speeding up our bottlenecks. While this is visible in the graphs, we would love to make it more obvious.

Using cURL and jq to work with Trello data

As mentioned in this blog post we are using Trello to handle our data at Speak Easy instead of the google sheets that we used to do. There are a lot of pros for this, but one con is that the data is just not as accessible as when you could sort lists etc in google sheets. I decided to make that better by creating a few scripts in python leveraging just a basic cURL command and the jq library.

Todo list:

  1. Access the Trello API from the command line
  2. Identify first use case for API call from command line
  3. Create script for first use case

1. Access the Trello API from the command line

First things first, I want to make sure I can access the API with a cURL command with the right auth etc. To do this I leveraged my old postman scripts and stole the cURL command from there. If you haven’t seen it before postman has a bunch of easy to export formats if you click the code link just under the save button.

PostmanCURLSnippet.png
How to open the code snippet window in Postman and select the cURL export type

Once I copied this to my clipboard I was able to run this in my command line and get the same results as in postman. But realistically this isn’t super helpful because if anything it is harder to read. But the power is really going to come from when I start using the jq library. So now I need to install jq which I am able to do easily through brew install jq.  This means that I can now “pipe” the response into jq and see the json formatted as well as start manipulating. So first things first, below you can see the difference in the cURL response before jq, and after being piped to jq.

CURLWithJqParsing
curl -X GET ‘https://api.trello.com/1/board/######/lists?key=######&token=######’ | jq ‘.’

2. Identify first use case for API call from command line

So why do I want to do this anyways? The first thing I thought about is how do we create a mailing list from our mentors? Previously it was a column in our google sheets doc. Now it is spread out across cards. I want to collect all the email addresses in each of the cards into a single list.

So let’s get to work!

I needed to actually improve the API call more than the jq filtering. In this case I was able to ask the Trello API for only the fields I cared about (name and desc which includes email).

SpecificCURLForEmailAddress.png
curl -X GET ‘https://api.trello.com/1/board/######/cards?lists=open&checklists=all&amp;fields=name,desc&key=######&token=######’ | jq ‘.’

 

So here is where I bet you have already spotted my mistake and I have to fully admit that I am writing this live as I go and have just realised a show stopper! The whole point of this exercise was to access the email addresses easier. The kicker though is that while we have email addresses in the cards, they are in the body of description and therefore I would need to not only do all this funky fun with jq, but also use some funky regex to parse them out. And that my friends is where I will be drawing the line.

So instead of working harder, I decided to work smarter and figure out how to add more useful data fields to the cards. Turns out there is a power up for that called “Custom Fields” which has allowed me to make an email text field for all cards. This comes with some docs for the API as well.

TrelloCardWithCustomEmailField
Picture of my Mentor Trello card with the custom email field.

So now, when I ask the Trello API for fields, I can ask for the ‘customFieldItems’ instead of the ‘desc’ field.

MyEmailAddress.png
curl -X GET ‘https://api.trello.com/1/board/######/cards?lists=open&fields=name&customFieldItems=true&key=######&token=######’ | jq ‘.[] | select(.name == “MENTOR: Abby Bangser (EU)”).customFieldItems[].value[]

Wrap up

So given I have just created myself a new todo list for this activity:

  1. Access the Trello API from the command line
  2. Identify first use case for API call from command line
  3. Introduce email fields for all Mentors
  4. Create script for first use case

I am going to call it quits for the day. the next blog will need to dive into moving this proof of concept into a python script 🙂

In the mean time…is jq interesting? Want more details in a blog or is it just a utility I should show usages of?

 

First things first…AWS training wheels

So I am getting started on a first website implementation tomorrow. I have worked on enough early cloud projects at this point to know there are some basic house keeping things I need to get in order before using my AWS account for, well, anything.

The two areas that I have found most important are basic user auth and account monitoring (which includes billing awareness) so I am going to focus on those two for tonight. One thing to note, I am not going to go through the step by step of how to do these as they are things that are very well documented. I am happy to share specific links if you would like a place to start but a given the expectations set in this post a google search will provide tons of starting points.

Basic IAM

In AWS the service which provides both authentication and authorisation (yes I cheated and used the ambiguous term before). Given IAM is a service authentication there are only a couple of things which you need to tune and most of them are provided as a check list right on the landing page for the service.

Screen Shot 2018-05-29 at 08.15.51
The basic expectations of an AWS account’s IAM setup including no root access keys, MFA, and individual logins, and a password policy.

 

Access keys

The one you may be least familiar with if you are not an AWS user is the “Delete your root access keys” request. Access Keys are the digital signature used when making command line requests to AWS. Your root user is an un-bounded power user who can not be limited from deleting/changing/creating things at will. If this gets compromised it could be a problem, but also most people like to protect themselves from themselves and this a perfect example of when to do that.

So you may ask, if you can’t use the root user account that you have created, what can you use? That is when you create a user with only the permissions you need to get the job done. Don’t worry, if your “job” expands over time you can always log back into root and expand your access, but it will require that extra bit of thought which can be good. In my case I have given pretty much complete admin access to S3, CloudFront, CodeBuild, and Route53.

MFA

The second one of the list is about MFA (Multi Factor Authentication) being set up on your root user. This is great, but I would highly suggest that you get this set up on ALL users. Unfortunately this is not a set by default and can not be. However, one thing you can do before switching from the root account is to create a policy which requires MFA before a user can do anything of use and apply this to all users. I used this tutorial to create the policy (https://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_users-self-manage-mfa-and-creds.html).

Account monitoring

Based on the AWS docs (https://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html#keep-a-log) I wanted to confirm I had some basic level of account awareness. The ones applicable to a completely new account I figured would be CloudTrail, CloudWatch, and AWS Config.

 

CloudTrail

This is a very high level access log which looks at create, modify, and delete actions in the account. There are of course concerns around reads but that is just not what CloudTrail is used for. The good part is that CloudTrail is default set up for the past 90 days for free. Between the limited data retention and the difficulty of actually using the massive amount of data CloudTrail generates, the is realistically something you need to enhance given a larger footprint in AWS. But for now I am happy with the defaults.

CloudWatch

CloudWatch is a more fine grained method of tracking services and requires configuration unlike CloudTrail. The only thing I am initially concerned with is being a bit of a dummy on costs so this is where I am going to set up my billing alert. Thankfully this is another setup activity that is so well accepted that it has clear documentation which I followed (https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/monitor-charges.html).

AWS Config

Kinda wish I had looked this tool up before if I am honest. It appears to be an easy framework to create the “detective checks” I discussed at a previous engagement. What these detective checks are meant to do is evaluate the ongoing compliance of your account given that changes can be made by many different people and processes and throughout time. At this point I have just clicked through and used an existing rule set “iam-user-no-policies-check” which actually did catch some issues with an old user on my account. I hope to create a new rule around the MFA requirement but that involves a bit more work with lambdas etc so that will have to wait for another day and another blog post.

Wrap up

So where are we now? Honestly not that far. But the bare minimum set up is complete and I should be able to now use this account with basic assurances around access and billing issues.

And whats next? I am going to work with Gojko Adzic on getting a basic website up under a new domain name. The main technologies will be Jekyll, CodeDeploy, S3, and CloudFront.

 

Kicking off #SEBehindTheScenes

Hopefully this marks the start of a new onslaught of blogging so I figure I should set up why as well as what will come.

For over a year now I have been working with SpeakEasy as an admin based on this simple call for help. When I first signed up, I had big dreams of the impact I could have. We talked about improving the service by removing some of the admin overhead. We talked about generating more data on how we were doing on matching, and more importantly how our applicants were getting on as speakers! We talked about how to support our mentors better so as to make that a great experience as well. Unfortunately I underestimated 1) my life circumstances and 2) the amount of effort Fiona and Anne-Marie had been putting in to “keep the lights on”.

So what have I done? I have moved us from a google docs admin experience to a Trello one and I believe we have a couple good outcomes (and areas for growth) from this move.

We now have visibility of the journey from original sign up to match. It will need some work but has already quantified a fear we had around drop off rate from signing up on the site to getting a mentor match. This means we now know that over the last year we have had 40% of applicants never get to the point of confirming their interest via email. Hmmm that sounds like a barrier we may want to lower.

We also now have a process which allows us to track who has been contacted, for what reason, and at what time. It still has a lot of manual process around moving columns, adding comments, etc. It also still has a lot of limitations since the comments are hand written and may limit our ability to search across applicant and mentor experiences for better data in the future. But it is a start and given that I am handing over “head matching admin” to Kristine Corbus after a year of driving the process myself, I have the utmost faith this process will be put through it’s paces and improved!

So what am I going to be doing if not matching? A couple things. First of all Anne-Marie and I have been through the cycle a few times of being excited about all the changes we could drive for SE, losing momentum because of the other things going on and the size of the task ahead, and then restarting. We have refocused and are planning on identifying the microtasks that can make a difference and hopefully start a push.

I will also be hopefully putting some of my goals around pipelines, observable systems, and testing in production to the test in putting up a new website for SE where we can start to incorporate some of the new changes we want to see. That is really the key to this hashtag. It is a promise I have made to Anne-Marie to broadcast the learnings I have as I take on the task of re-platforming and then operating the SpeakEasy website.

Step 1? A call for help on twitter:

 

Step 2? A pairing session the the one and only Gojko Adzic next week which I plan to blog about as well 🙂

Fun with google form scripting

Sometimes doing the not most important thing on your todo list is so fun. That was the case today.

I have started to work with a lot of really amazing people to support SpeakEasy as a place for new voices to be heard at testing conferences. Turns out it is a lot of work! It is so impressive to me that Anne Marie and Fiona dreamed this up, actually got it running, and have made such a sustained impact on the community over the last few years. I personally am a graduate of the program, have continued on as a mentor, and now am a volunteer. As a group of new volunteers, we were given the open invitation from Anne Marie and Fiona to make SpeakEasy our own. They know that the need to keep it running took its toll on some of the processess and that with some extra volunteer power we may be able to address some of the underlying processes.

One process I was particularly interested by was how we were alerted about new mentees and then how we would then take them through the process from new sign up to being mentored.

Oh the glory of google sheets! We added three columns to the google sheet that gets created from the google form on the website. One was used to sign up for the job of getting that person matched, one was for tracking your progress on that journey, and one was for confirming final status of that new sign up. As you can imagine this gets messy and also limits the value we can get from data like how long it takes for someone to get matched.

Excel table row with data about a mentee and matching them with a mentor.
An example row from the “to be mentored” sheet.

 

In comes the power of a small script. I have known about the ability to script Google apps for a while now but never played with it myself. This felt like the perfect opportunity. What I really wanted was to be able to have a new mentee show up in a management tool for tracking and visibility, in our case Trello. But let’s start with how to get a Google form to take action on submit.

I found this website that gave a great framework for my efforts. A couple of interesting notes to get it working though. First of all, do make sure you update all of the fields they suggest to make sure they match your form. But even if you do that, if you choose to “Run > onFormSubmit” from the tool bar you will get an error like this:

TypeError: Cannot read property "response" from undefined. (line 24, file "Code")
This is the error when trying to run onFormSubmit from inside the Google scripting tool.

 

Basically it is telling you that because no form has been submitted you can not run this command. Shucks. Really wanted to test this before going live. Good thing I attached to it to a play form, so I just went and did that! Submitted the form and waited for my glorious Trello card to show up.

Wahwahhhhh turns out that there are some typos in the script. It took me a while to sort them, but using the “notification” option on the triggers to say email me immediately helped a lot to debug. In particular you will need to look at SendEmail() and update the “title” and “message” variables to “trelloTitle” and “trelloDescription” for the form to work.

Viola! Now on submit I have a new line in my excel response sheet as well as a new card in trello.

A trello card with the details from the entry form on the SpeakEasy website
A new trello card for a mentee to match

But honestly, the more exciting thing is the power of collaboration that is unlocked now. Now we have timestamps for activities that we perform, we can at a glance see how much any one person has in progress, and we can track where bottlenecks may occur during the process.

A filled in Trello card with dates and times of activties from new application through to introduced to mentor.
Possibly example process of getting Jane Doe matched in Trello.

 

Obviously we as a team need to sort out our own process with this tool, but the oportunities for collaboration are so much greater. Looking forward to seeing how it progresses!

Pairing with a dev to create a utility to test translations

In another example of how automation is not just Selenium (or just Unit testing or just…), I had a really great interaction with some developers on my team the other week while we were introducting a translations file for the first time in our greenfield app and wanted to recap it in case it helps others.

Disclaimer: it was arguably too late to be doing this at 6 months into our project. There is a lot more to translations (and even more to internationalisation), but I have yet to find a good reason not to at least extract all strings to a resource file of some sort from the begining. Try to get that practice included ASAP since it takes a lot of development AND testing effort to clean up random hard coded strings later on.

But whether you were quicker to the resource file than us or not, many of us have to test translations and that is really what this is about.

One day we went to finalise analysis and kick off a story which was meant to get our app translation friendly. As we dug into what that meant, we realised that the hidden costs were in:

  1. the number of pages and states that would need to be tested for full coverage of all strings
  2. exactly how translations was going to work moving foward

Because of this we chose to immediate split the story and worry about one thing at a time. First things first, let’s get the structure of a single translation understood, working well, and tested well. The stories became:

  • pick one static string that is easily accessible and translate it
  • translate all strings on page 1
  • translate all strings on page 2
  • translate all strings on page 3

We decided that it was not necessary to split any more fine grained than by page to start, but agreed we would reassess before playing those stories.

Now it is time to dig into that idea of understanding, implemening, and testing a single string…

This was inherently going to be a bit of an exploration story so we were working in abstracts during the kickoff. That being said, there were plenty of good questions to ask from the testing perspective. I shared my experience from a previous project where we set up a couple of tools to assist testing and I was hoping for the same tools here. The idea was to be able to create a “test” resource file (just as you may have a FR-fr or EN-gb language file) which could be kept up to date with the evolving  resource file and loaded as the language of choice easily. We also spoke about looking for more automated ways to test, but regardless of unit or integration test coverage, this tool was necessary to exploratory testing moving forward as more enhancements (and therefore more strings to translate) were added to the app. The devs seemed keen on the idea so I went on my merry way right into a 3 day vacation.

I came back rested and revived and so excited to see that story in the “done” column! But wait…there in the “to do” column was a glaring card “test translations”. skreeeeeech! What is that? Since when does our team not build in testing?! Obviously I was due for an update.

I spoke with one of the developers from the origional kick off and first congratulated her on the story being done and then asked how it went. She explained to me that the framework for translations already existed in other apps and we were asked to follow pattern. Because of this, implementation was pretty easy, but understanding was still quite limited and testing was viewed as “not necessary” since it is the same as all other teams. Whew, ok a lot to unpack…

  1. “We have always done it that way” is not only the most dangerous phrase in the english language, but absolutely the most dangerous phrase in a quaity software team. I raised some questions about how the framework would impact our team directly (maintenance, performance, etc) and we quickly came to the conclusion keeping it as a black box wasn’t going to work for us.
  2. Let’s define testing. No no no, not in another ranting way, in this context only. The developer was thinking regression testing, I was thinking future exploratory testing. They both needed to be discussed!

As we dug into the framework we adopted, I was brought up to speed on why putting in automated regression testing was probably not worth the effort in the short term. We moved on to the exploratory side. Translation strings are not something that will stay static. Since new strngs will be added all the time so we need a way to make sure that new effort can be validated. With very little explanation, this became clear to the developer and we put our heads together to come up with a solution given the kind of convulated translations framework. Within about 10 minutes we had devised a plan which would take less than an hour to code and would provide me a 2 command process to load up a test translations file. Success!

So what does this look like?

create_test_translations.bash – Makes a copy of the base translations file, names it TE-te and adds “TRANSLATED–” to the beginning of all strings.

load_translations.bash – This was a bit of a work around for our framework and dealt with restarting the app in the required way to use the new test translations file.

And to clean up after?

git checkout

Definitely not fancy. Definitely not enough forever, but for it met our current cost/value ratio needs. I unfortunatley can’t show the client site, so instead I am going to use the amazing Troy Hunt site HackYourselfFirst so that you can get the gist. Hopefully you can see the “untranslated” string a bit easier as well.

First, the site translated to Dutch…

Screen Shot 2017-03-05 at 10.38.30.png

 

Did you spot the 4 places that it was not translated? Did you think you spotted more than 4? Notice some words intentionally stay the same (proper nouns etc) and others should have been but weren’t.

Now for the site to be translated to my “test” language…

screen-shot-2017-03-05-at-10-29-13
Troy Hunt’s HackYourselfFirst site with translations being applied to _most_ pieces of text.

 

This time I could quickly tell that the vote counter was not translated as well as the Top manufacturerers text. At least for me, this was A LOT easier to cut down on both false positives (thinking Lexus or McLaren should have been traslated) and false negatives (eyes skipping past the word votes)

Translations can be a tricky part of testing a global website. Since most of us do not speak every language that the site will be displayed in. But there are certain heursitics that we can use to at least combat the most outrageous issues. Take a look at how strings of differnet lengths will look, look at how we handle new strings, deleted strings, changed strings. Etc etc etc.

As I mentioned at the beginning, there is A LOT more to internationalisation, but this was a way for our team to spend a little less time combing through text updates.