The premise
When writing Terraform, it is good practice to parameterise variables when and where you can. This helps to reduce duplication where possible and make changing variables easier. In some cases, the variables are convenience things like what region to run the cloud infrastructure in. In other cases they are secret values you actively do not want to store in your source control (e.g. an API key).
Terraform provides a few different ways to pass these variables in including flat file and environment variables. While you can use these techniques securely, they can get unweildly as projects grow so many people opt for 3rd part storage options like HashiCorp Vault, AWS SSM, or GCP GSM.
During a move towards using GSM I found an unexpected behaviour that I wanted to capture.
The tech
Terraform is a DSL for writing declarative configuration files across many different cloud services. This post is not about what Terraform is, or why to use it as there are lots of other great resources for that.
Google Secrets Manager (GSM) is a key value storage that has built in versioning. Again though, this is not a post about what GSM is or why to use it.
For the purpose of this post, I am going to be using Cloud Run on the Google Cloud Platform (GCP) free tier. My Cloud Run code is pretty much a copy of the great quickstart blog from Seth Vargo and let’s us get right to the discussion around secrets. The full code from this blog post is available here.
Setting variables in Terraform
Imagine a world where we have a Cloud Run service that needs an environment variable set. This would require:
1. A variable to be declared
variable "secret_variable" {}
2. That variable to be called from the service
resource "google_cloud_run_service" "my-service" {
name = "my-service"
location = var.google_provider_vars.region
template {
spec {
containers {
image = "gcr.io/cloudrun/hello"
env {
name = "PUBLIC_VARIABLE"
value = "insecure"
}
env {
name = "PRIVATE_VARIABLE"
value = var.secret_variable
}
}
}
}
traffic {
percent = 100
latest_revision = true
}
depends_on = [google_project_service.run]
}
And to run a command like terraform plan
or terraform apply
we would need to set that variable.
This seems simple enough, but we need a way to store this secret variable securely while also collaborating with a number of other people on the same project. This means passing the values somehow so that they can be shared used in the same way by everyone.
A quick review of the built in options for using variables from Terraform:
Option 1: Let Terraform as via the Command Line
Command line input is the easiest and quickest way to get moving. As Terraform runs it will stop at each missing variable declaration and ask for user input.
$ terraform plan
var.private_variable
Enter a value:
But this is not only error prone and tiring as a person, it is impossible within CI. For this reason it is very rare for anyone to use this method aside from early days spiking.
Option 2: Set environment variables
Setting environment variables is a set better since this can at least be repeatable within the scope of a single engineer and single session. That being said, it requires the environment variables be stored which makes them a target for anyone on the machine. It also requires that they be shared among any collaborators on the project. For those reasons this again is usually only a viable option only for small projects.
$ export TF_VAR_private_variable="super secret" && terraform plan
Option 3: Use an additional -var-file
While Terraform automatically picks up variables in a terraform.tfvars
file within the project, you are able to pass in additional files via the command line. The viability of this strategy is very dependent on how you store this file. But no matter how securely you are storing the file, it is still a file on disk which makes for higher risk if people are on the server that runs terraform.
$ terraform plan -var-file="terraform-secret.tfvars"
With all three of these options the variables are stored in some way on the operators machine and need to be shared across people via a secondary process.
In my particular case, we were using a tool called Atlantis to run our Terraform commands in CircleCI. Our Atlantis server was running in Kubernetes which allowed us to store the variables in a Kubernetes Secret and then “mount” that as if it were a directory on the server so Terraform could read in the values. In Atlantis you can tell it where to look for additional variable files when it is running Terraform commands.
So if you put all that together, we had something like this:
# atlantis-deployment.yaml
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 1
template:
spec:
containers:
- name: atlantis
image: runatlantis/atlantis:latest
...
volumeMounts:
- mountPath: ".config/tfvars"
name: atlantis-tfvars
readOnly: true
volumes:
- name: atlantis-tfvars
secret:
secretName: atlantis-tfvars
# Atlantis.yaml
---
version: 3
projects:
- dir: infra
workflow: infra
workflows:
infra:
plan:
steps:
- init
- plan:
extra_args: [-var-file, /.config/tfvars/infra.tfvars]
While this worked well for a while, we wanted to evolve away from secrets on disk, and also away from a kind of “chicken and egg” scenario where we needed to write the secrets to this Kuberentes secret in order to then write them to Terraform (which was often making another more specific Kuberentes secret!).
In comes Google Secret Manager (GSM)
GSM was only introduced about a year ago, and has been on a wish list for a bit. So we finally got around to spiking out GSM as a secret store and on first research it seemed both to fit our need and be easy to implement.
So going back to the Cloud Run demo example, we needed to create a new secret and begin using it. This was a 3 part process to complete.
1. Create a secret in Terraform. This acts like a folder which can hold 0 to many versions of a single value.
resource "google_secret_manager_secret" "secret_variables" {
secret_id = "secret_variables"
project = data.google_project.project.number
replication {
automatic = true
}
}
2. Manually create a version of that secret in the GCP Console. This needs to be done manually as this is how we manage to not store any secret values in code. Think of this like storing a value in your password manager.
Notice that while .tfvar
files are key:value text files, Terraform provides built in support for JSON so it seemed sensible to store those variables as JSON secrets.
{
"secret_variable": "super secret"
}
3. Reference that secret version in Terraform code. This means reading and parsing that secret value and then referencing that value where necessary.
# secrets.tf
locals {
secret_variables = jsondecode(data.google_secret_manager_secret_version.secret_variables.secret_data)
}
data "google_secret_manager_secret_version" "secret_variables" {
provider = google-beta
project = data.google_project.project.number
secret = google_secret_manager_secret.secret_variables.secret_id
version = 1 # valid json
}
# cloudrun.tf
...
env {
name = "PRIVATE_VARIABLE"
value = local.secret_variables.secret_variable
}
...
Nice! This was almost too easy! When we run our plan with this code we can see that the secret does not change and that is exactly what should happen. We just changed from feeding it via a file to reading it from a cloud service.
But there is a catch!
As I started to roll out the change past proof of concept level, I ran head first into the 64kb size limit of GSM which resulted in a much bigger issue.
While this limit is well documented, it doesn’t have the nicest user experience when using the UI. When you paste a value into the version creator in the UI it just stops at 64kb. No warning you hit that limit. Just an abrupt end to the pasting of values. I did not notice this, and so when I tried to run a Terraform plan I was running it with invalid JSON.
Unfortunately, the built-in Terraform function jsondecode()
reacts very badly to invalid JSON. It prints all of the invalid string out in its error message. That does not seem to be in official documentation, and usually isn’t a bit issue.
So the combination of secret values and the unexpected cutting of the values meant that on that plan all the values of that secret were printed to console!
Below is an example output from the demo Cloud Run repo:
$ terraform plan
...
Error: Error in function call
on secrets.tf line 2, in locals:
2: secret_variables = jsondecode(data.google_secret_manager_secret_version.secret_variables.secret_data)
|----------------
| data.google_secret_manager_secret_version.secret_variables.secret_data is "{\n \"secret_variable\": \"super secret\"\n"
Call to function "jsondecode" failed: EOF.
In evaluating the impact of this it was nearly too much for us to decide to keep using this setup. While it was nicer than our -var-file
solution, the risk of invalid JSON was just too high.
Searching for a workaround
What we realised was we needed a way to validate all secrets that Terraform would try and decode before it ran any commands. Given what I shared earlier about Atlantis, we had an idea to run a JSON validator as a part of the Atlantis workflow.
While this felt better, we still had concerns. How would we futureproof against new secrets being added? What happens if we start using non-JSON secrets? And is this really the right place to be doing this check?
Our final solution
Thankfully we keep looking, and we came across Terraforms “escape hatch”…the external
data provider. This is warned against for good reason by Hashicorp on the docs. Depending on the use case, this could create some really tough breaking scenarios on future Terraform upgrades. But right down the bottom there is a section “Processing JSON in shell scripts”. While this is meant as a side effect of the data source, it was actually exactly what we needed!
So, instead of having to verify the secret JSON before running terraform and then safely decoding it. We could verify and decode in a single step while setting our own sensible error messaging.
First we needed to create a small shell script to parse valid JSON and error in a safe way on invalid JSON:
SECRET_NAME=$1
SECRET_VALUE=$2
set +e
echo "$SECRET_VALUE" | python -m json.tool >/dev/null
EXIT_CODE=$?
set -e
if [[ "$EXIT_CODE" -eq 0 ]]
then
echo "$SECRET_VALUE"; exit 0;
else
SECRET_VALUE_SIZE=$(echo $SECRET_VALUE | wc -c | awk '{ foo = $1 / 1024 ; print foo "kb" }')
>&2 echo "The secret "$SECRET_NAME" is $SECRET_VALUE_SIZE and did not parse as valid json"; exit 1;
fi
Then we needed to trade out our local
variable that used jsondecode
and instead pass those values to this script. The external data source outputs a results object which is the JSON tree and can be referenced via usual techniques:
# secrets.tf
data "external" "secret_variables" {
program = [
"./json_validator.sh",
data.google_secret_manager_secret_version.secret_variables.secret,
data.google_secret_manager_secret_version.secret_variables.secret_data
]
}
# cloudrun.tf
...
env {
name = "PRIVATE_VARIABLE"
value = data.external.secret_variables.results.secret_variable
}
...
So the code is a bit more complicated, but not really actually that different to the previous local variables example. It is in the terraform plan
outputs where this really shines. If we send through bad JSON, this is our output:
$ terraform plan
...
Error: failed to execute "./json_validator.sh": Expecting object: line 3 column 1 (char 40)
The secret secret_variables is 0.0351562kb and did not parse as valid json
Conclusion
I am fairly new to GCP in general and since Google Secrets Manager is only about a year old I am also new to this service. This probably increased my surprise at the size limitation, but far more surprised at the behaviour of the Google Terraform provider which did not mark the secret version data as sensitive which will mean it does not get printed to the console output. So despite this surprise (and possible catastrophic error), I am super happy with the process we came up with!