About the author: Miguel Fontanilla joined sennder in October 2020 as a senior platform engineer on the platform team. His areas of expertise are cloud architecture, container orchestration, and automation. He is an AWS community builder member of the containers group.
Miguel is eager to investigate and try new technologies and tools within the modern software engineering landscape. He loves helping people understand these technologies. He runs his own blog and participates in conferences, open-source projects, and communities. If you want to know more about him, you can check out his LinkedIn profile.
To measure software delivery performance, more and more organizations are defaulting to the four key metrics as defined by the DevOps Research and Assessment (DORA) research program. The DORA group found that elite-performing software teams, those who deliver the most value– fastest and most consistently– optimise for four metrics in particular:
- Deployment Frequency – This metric measures how often code is deployed to production. This metric is correlated with both the speed and the quality of an engineering team. It analyzes how much value gets delivered to end-users and customers and how quickly a team can provide that value.
- Lead Time for Changes – This metric is defined as the amount of time it takes a commit to get into production. Lead time for changes is a good indicator of the efficiency of the development process, the code complexity, and the team’s capacity.
- Change Failure Rate – Change Failure Rate (CFR) is an extremely helpful metric that identifies the percentage of workflows that fail to enter production and the overall risk that this poses to development. This metric isn’t concerned with failures that happen before deployment. As a result, any errors caught and fixed during the testing phase won’t be considered when tracking the change failure rate.
- Mean Time to Recover – Time to Restore or Mean Time to Restore (MTTR), measures the time elapsed to recover after an incident occurs in production. This metric is calculated as the difference between the time an incident started and the time the incident is resolved. The metric is important as it encourages engineers to build more robust systems. It is usually calculated by tracking the average time between a bug report and the moment the bug fix is deployed.
If you want to learn more, read this article by DORA4 team about using the four key metrics to measure your DevOps performance.
Software Development Life Cycle at sennder
Prior to discussing the implementation details, it is important to understand our use cases and the ecosystem where our services run. For this purpose, we will analyze the characteristics of our software development framework, including development team structure.
Our engineering department is subdivided into Business Areas and Pods, which develop and own different services. A Pod is an individual end-to-end delivery team that makes the smallest unit of scale at sennder and is responsible for the success of a single product area. We use the terms team and pod interchangeably.
Business Areas (BAs) are defined as a logical grouping of pods that work towards a common mission and focus on solving customer needs of a particular part of the business. Services are made up of several subunits that we call components. For example, a specific service can have components that are exposed via APIs, as well as internal components that run workloads but are not exposed publicly.
The ecosystem our engineers use to build, deploy and run their services rely on the following products and tools that cover the overall lifecycle of our software:
- GitLab is the main source code management service.
- GitLab CI runs our pipelines, to build, test, and deploy services.
- Terraform is the choice to build and maintain cloud infrastructure.
- DataDog is our single pane of glass for monitoring, logging, and tracing.
- Opsgenie is our incident management tool.
- Amazon Web Services (AWS) is the main cloud provider we use.
It is also worth noting that at sennder, we embrace a microservice paradigm, where long-running workloads coexist with event-driven workloads. As we rely on AWS services to run our workloads, we use different platforms to deploy our applications. In order to compute DORA4 metrics, we need to ingest data from all of them, including:
- Elastic Container Service (ECS) Fargate for serverless containers
- Elastic Kubernetes Service (EKS) for containers building up more complex applications
- AWS Lambda for serverless functions (FaaS)
As part of our standardization efforts, we introduced a tagging schema for all our cloud resources to increase traceability and to get a more granular cost control. Since we rely on Terraform to deploy most of our infrastructure, the tagging schemas are embedded into our Terraform modules, and they generate AWS tags in all the resources that are created.
The following image shows the tagging schema applied to our AWS resources. Some of the tags are configured by the developers, while some others are automatically generated by our CI pipelines, like the repository URLs or the repository IDs.
This standard tagging schema provides a set of tags that help us correlate the DORA4 metrics, specifically:
In the case of Elastic Kubernetes Service (EKS) applications, we follow a similar approach, but in this case, we rely on Kubernetes labels. The code below shows an example deployment using the labels. These labels can be standardized across services by leveraging Helm charts.
These tags will have a crucial role in the computing of the DORA4 metrics, as they help correlate events from different sources.
Implementing Deployment Frequency
Once the context has been set, we can dive deeper into the technical implementation of the first metric. The metric is defined as how often an organization successfully releases to production. The key to measuring this metric lies in registering and averaging the deployment events per day.
It is important to define here what is considered a successful deployment to production. Ultimately, this depends on each team’s individual business requirements, as well as on the nature of the platform where the artifacts are deployed. In our case, taking into account the three computing platforms we rely on, we consider a deployment successful when:
- A new version of the function is deployed, which is reflected by the AWS event UpdateFunctionCode20150331v2 in AWS Lambda.
- The previous replica set of a Deployment is scaled to zero, after a successful rolling update in AWS EKS. Learn more about rolling updates.
- The SERVICE_DEPLOYMENT_COMPLETED event is triggered in AWS Elastic Container Service (ECS). Learn more about ECS events.
These events need to be ingested, correlated, and formatted in order to determine the deployment cadence of each pod. For our implementation, we relied on several serverless AWS services to build an ingestion pipeline: Lambda, EventBridge, and Simple Queue Service. The diagram below shows a high-level architecture of the implementation.
EventBridge is an AWS serverless event bus that can be used to receive, filter, transform, route, and deliver events from AWS, as well as external events. EventBridge receives an event, defined as an indicator of a change in environment, and applies a rule to route the event to a target. Rules match events to targets based on the structure of the event, called an event pattern, or on a schedule. All events that come to EventBridge are associated with an event bus. This article provides more information on EventBridge.
For our use case, we set different event rules in EventBridge to match the events described above using event patterns. The code snippet below shows the event pattern for the lambda deployments. Note how the event name targets the UpdateFunctionCode20150331v2 event.
For ECS deployments, the event rule looks like this.
In the case of EKS deployments, we rely on the Kubernetes events exporter to detect the scale-down events of the previous replica sets belonging to a deployment and to send these events to our event bus. The following snippet shows the configuration of the exporter-cfg configmap that the events exporter mounts. This configmap allows setting ingestion rules based on the cluster events, as well as the receivers to send the events. If you want to know more about how to configure the events exporter, check this documentation.
In the case of events coming from EKS, there is no need to set an event rule, as the events are already filtered by the exporter before being sent to the target bus.
Once the events are matched, they are sent to an AWS Simple Queue Service (SQS) to be consumed later by lambda functions. Besides, a dead letter queue is configured to ensure messages that cannot be consumed are not dropped and are reprocessed by the lambda function.
Finally, the lambda function is triggered when a new event is enqueued on the SQS queue. It processes the event, adds the right tags and formatting, and submits the event to DataDog API.
This function maps the AWS and Kubernetes tags to DataDog-specific tags that we can use to filter and correlate the metrics. It also sets the event source and component type, so that we can differentiate between EKS, ECS, and lambda events. The DataDog tags that are added to the events are the following. The DORA4 prefix is added so that they are segregated from the rest of the metrics:
Metrics and Dashboards
When the lambda function has set the correct format and tags for the event, it is pushed to the metrics API in DataDog. Using the metrics explorer we can check the different deployment events that have happened over a specific range of time.
Once the data is available in DataDog, all that is left to do is to build dashboards from the metrics ingested.
The image below shows one of the dashboards we use for the Deployment Frequency metric. The parameters on the top allow users to filter the metrics to match their service-specific values.
In the next article we will explore the remaining metrics, and share the conclusions and lessons learned from implementing them.