AWS multi-account CI/CD with Gitlab runners

sennder’s exponential growth towards the end of 2020 and beginning of 2021 has had a deep impact on all the levels of the company, including the product engineering department. In order to adapt to the business needs and be able to cover the opportunities in the market with innovative solutions, we needed to redefine the way in which software is developed, built, and operated at sennder so that it scales. With that purpose in mind, we decided to re-architect our entire cloud infrastructure, and after months of work, we released SennCloud.

SennCloud is the new cloud infrastructure framework for sennder’s product engineering. It is a clean, newly-built cloud-native ecosystem for the development and evolution of microservices on top of Amazon Web Services (AWS).

SennCloud, based on AWS, is designed with cloud computing best practices in mind. It breaks with the current status quo at sennder of having one single AWS account that aggregates all the environments and resources together. Instead, we now have individual, separate AWS accounts, paving the way for new security features and, equally important, new ways of working.

A new organization setup meant that the existing Continuous Integration / Continuous Deployment (CI/CD) setup, implemented to work within an individual AWS account, had to be redesigned too. Thus, a multi-account CI/CD system for SennCloud was created. This article will introduce the new CI/CD system and explain in detail how to set it up.

What are we using (and why)?

sennder started off its journey with one AWS account. The account hosted everything:

Our services
Our CI/CD infrastructure
Access management
Business intelligence applications

As the organization grew organically, the need for more segregation in our cloud infrastructure became more important. The main question we needed to answer was: How do we enable sennder’s product engineering teams to move as efficiently as possible while also guiding them with sufficient guardrails?

The platform team at sennder decided to evaluate AWS Organizations together with AWS Control Tower. AWS Control Tower gives us the ability to:

Roll out AWS accounts on-demand
Scale fast
Seamlessly apply policies, enrolling access management
Maintain centralized control across the whole organization

Control Tower is enabled in our root account, the first entry point to our cloud world. It creates two AWS accounts: One account for audits and another one for log archives. These two accounts get associated with an organizational unit (OU), called the core OU.

After the enrollment of those accounts is done, we can choose how to structure your overall cloud setup.

There are numerous questions to answer when creating a new AWS account setup:

Do AWS accounts represent environments?
Does my team get its own AWS account?
How do we monitor the overall setup?
Who is responsible for maintaining the development environments?
And most tricky: What is an environment?

We have not yet found the correct answer to those questions, but we decided to join forces with a variety of teams to gather as much input as possible before architecting the accounts.

An environment at sennder is a collection of services that serve a certain purpose along the software development lifecycle (SDLC). When engineers develop their services, they will be able to test their code in a so-called development environment. The development environment will not host any common infrastructure. Every service is responsible to mock or stub its dependencies. Our engineers are currently experimenting with contract testing to move independently and as quickly as possible. After trying out their services, the teams are able to integrate them into an integration environment. Every team is held accountable for the health of their services in this environment. Our quality assurance (QA) engineers are able to test the integration of the services before releasing everything into production. Food for thought: Maybe the integration environment is not needed either, as more services like LaunchDarkly are popping up.

For now, engineering at sennder operates on two accounts that correlate to the previously mentioned environments, development, and production. Both accounts are associated with a separate OU, the so-called workload OU. Besides the two environment accounts, we maintain our networking in a separate admin account. Most of our account setup is created with terraform. The terraform state for common cloud infrastructure is kept in the admin account. Services store their own state in the respective environment accounts.

With continuous delivery in mind, the platform team at sennder aims for account segregation while keeping the infrastructure setup as close to a productive environment as possible.

Taking a step back from the multiple accounts, we want to provide a centralized solution for our CI/CD. We use Gitlab CI SaaS to host our codebase and customized runners in AWS Elastic Kubernetes Service (EKS) to build our software. Those runners need to be able to assume roles in the different AWS accounts, push images to AWS Elastic Container Registries (ECR), deploy software, and register resources in our AWS API Gateway. The services are mainly hosted with AWS Elastic Container Service (ECS).

How did we connect the different parts?

Since we use GitLab as a SaaS git repository manager, we decided to use GitLab CI to set up our pipelines. This approach allows us to store the pipeline definitions alongside the code.

As we wanted to have granular control over the performance of the pipelines, we deployed the GitLab runners into an AWS EKS cluster. In order to do so, we decided to rely on GitLab Runner’s Helm chart. The EKS cluster is placed within a dedicated account for the CI/CD infrastructure, which also includes the Docker image registry, supported by ECR, and an S3 bucket which is used by GitLab CI to cache contents between jobs and speed up the pipelines.

The main goal is to allow the centralized CI/CD infrastructure to deploy infrastructure and applications to different accounts within the organization, as depicted by the diagram below.

Runner types

Due to the nature of the different pieces of software, we build and the pipeline configurations, not all the jobs require the same amount of resources from the cluster. So we introduced three types of runners, deployed using the Helm Chart with different values:

Small runners: Intended for lightweight pipelines, mostly based on command-line interfaces (CLIs): terraform, kubectl, helm, lint processes.
Medium runners: Used for intensive pipelines such as Docker image building.
Big runners: Dedicated to extremely intensive workloads such as compilation, webpage rendering, or performance testing.

The runner type is selected within a pipeline using tags so developers can customize the different jobs within the pipeline to use the runner that suits the workload best. The following snippet shows a Terraform job that relies on small runners. Notice how the tags: directive is used to select the runner.

image: xxxx.dkr.ecr.eu-central-1.amazonaws.com/platform/docker-base-images/docker-19.03-tf-13:latest
 
.prepare:
 variables:
   PLAN: ${CI_ENVIRONMENT_NAME}.plan.tfplan
   PLAN_JSON: ${CI_ENVIRONMENT_NAME}.tfplan.json
   TF_ROOT: ${CI_PROJECT_DIR}/iac/terraform/app
 before_script:
   - cd ${TF_ROOT}
   - terraform init
 tags:
   - small

Delegated roles

Once the runners were in place, the next step was to ensure that they could deploy applications and infrastructure into the different accounts that make up SennCloud. This was done using AWS Identity Access Management (IAM) Roles. We added a delegated role in each account to which the GitLab CI pipelines would have to deploy, providing the required permissions over the resources to deploy in those accounts.

The next step was to determine what entity would assume those delegated roles. The simplest solution would have been to authorize the role that is attached to the EKS nodes instance profiles to assume the delegated roles. However, if this approach is implemented, any pod running in the cluster could assume the delegated roles and deploy infrastructure into the accounts. To avoid such risk, we used Kube2IAM, a solution that associates AWS IAM roles with Kubernetes pods, to make sure that only the GitLab Runner pods can assume the roles and deploy.

Kube2iam is deployed as a Helm chart into the cluster, and it requires additional AWS roles to be created prior to the installation. It is deployed as a DaemonSet, as it needs to be present in all the nodes of the cluster so that all the IAM calls performed by the pods can be intercepted and validated.

Once Kube2iam is present in the cluster we just need to add the iam.amazonaws.com/role: annotation to make a specific pod use and existing IAM role. The following snippet shows the annotations of a running GitLab runner pod. In this case, the role associated with the pod is used to access the cache bucket.

apiVersion: v1
kind: Pod
metadata:
  annotations:
    iam.amazonaws.com/role: arn:aws:iam::xxxx:role/cicd-kube2iam-gitlab_runner_role
    kubernetes.io/psp: eks.privileged
  labels:
    app: small-gitlab-runner-gitlab-runner
    chart: gitlab-runner-0.25.0
    heritage: Helm
    pod-template-hash: 6b6cf4b866
    release: small-gitlab-runner
  name: small-gitlab-runner-gitlab-runner-6b6cf4b866-nkrw4
  namespace: default

AWS profiles & Terraform

As we said before, the infrastructure is provisioned by means of Terraform, and it needs to know which IAM Role to use when deploying to a specific account. Terraform handles account switching by means of AWS profiles that can be specified as environment variables or within the Terraform code itself. The following snippet shows a provider definition that will use the admin profile. Profiles are also specified within the backend configuration so that Terraform can access the remote state buckets where it will persist its data.

provider "aws" {
 region              = var.region
 allowed_account_ids = var.allowed_account_ids
 profile             = "admin"
}

In order for Terraform to assume the role associated with a specific profile, the profile configuration needs to be placed in the ~/.aws/config file within the system where Terraform CLI is executed. In our case, it is executed inside a CI/CD pipeline, which runs in a Docker container within a Kubernetes cluster. To make the profiles available within the pipeline container, we use tailored runner images in which the config file is added during the image build process.

The snippet below shows the content of the config file placed in one of our runner images. Notice how the roles to assume by the runners are the delegated roles of the different accounts. The credential_source = Ec2InstanceMetadata forces terraform to use the instance profile Role and in turn the runner Kube2iam role as the base role that will assume the rest of the roles in the different accounts.

[profile admin]
role_arn = arn:aws:iam::xxxxxxxxxxxx:role/cicd/delegated-role
credential_source = Ec2InstanceMetadata
[profile cicd]
role_arn = arn:aws:iam::xxxxxxxxxxxx:role/cicd/delegated-role
credential_source = Ec2InstanceMetadata
[profile dev]
role_arn = arn:aws:iam::xxxxxxxxxxxx:role/cicd/delegated-role
credential_source = Ec2InstanceMetadata
[profile prod]
role_arn = arn:aws:iam::xxxxxxxxxxxx:role/cicd/delegated-role
credential_source = Ec2InstanceMetadata

How does SennCloud boost our business?

After reading through our story of sunrising a new cloud setup called SennCloud for sennder, you might ask yourself, “But why? I am running on a single AWS account and everything works fine.” We understand that there are 100 ways of architecting a platform and there is definitely no right or wrong way. The key to enabling teams to use the platform as successfully as possible is to architect the platform in close collaboration with the teams.

We improved the SDLC in our organization by segregating everything else from the production environments and enabling every team member to work on their infrastructure. It might seem risky, but nowadays it is very possible with a combination of consolidated billing and guardrails in place. Be aware that you still need to monitor what is happening on your platform. Teams will forget to clean up after themselves and will need some guidance in setting up their initial setup. We see this as an investment into our future, boosting the capabilities of our teams to new levels.

In order to gain experience and confidence in the new platform, you need rapid feedback loops. Our new CI/CD flow supports this because our pipelines can build software in a fast and reliable manner, making it easy for the teams to maneuver. Enhanced observability also supports this approach. Again, the key is to work closely with the teams. If the teams don’t know what is failing, you will have a hard time explaining it to them. But if they already know upfront where and how to dig deeper into their setup, real DevOps happens.

One topic that’s still on our plate is our production environments. Isolating production and development environments from each other seems right at first glance, but can definitely be improved. In the coming months, we will be working on new solutions to enable teams to maintain their own productive software, paving the way for self-service infrastructure.

...

About the authors

Miguel Fontanilla (he/him)

Hello, hello, my name is Miguel and I joined sennder in October of 2020 to become a member of the platform team. Apart from being an amateur kite-surfer and snowboarder, I am a senior DevOps engineer and I’m always eager to investigate and try new technologies and tools within the modern software engineering landscape. I also love to help people understand these technologies, and that’s why I run my own blog and participate in open-source projects and communities.

Jan Carsten Lohmüller (he/him)

Hi, I am Jan Carsten, platform engineer, software developer, coffee enthusiast, and passionate cyclist. I have been involved in both startups as well as established global organizations and research. I joined sennder in August 2020 as a consultant from Netlight. My passion is sharing knowledge, enabling teams, and tweaking my .zshrc. My current scope of work is building a new cloud infrastructure framework for sennder’s engineering teams.
Also, check out Jan Carsten’s Medium page here.