TECH CULTURE
10 minute read • April 25, 2023

Implementing DORA4 with Serverless Technology at sennder (Part 2)

The previous article presented the main use cases and the ecosystem where sennder apps are deployed and run.
ltfc
authors
Miguel Fontanilla

Miguel Fontanilla

The first article introduced DORA4 metrics and the context for their implementation at sennder.

This article he will focus on the remaining metrics, the conclusions and lessons learned.

Implementing the Four Key Metrics at sennder

The previous article presented the main use cases and the ecosystem where sennder apps are deployed and run.

As a quick reminder, here are some of the main characteristics of our software development environment:

  • Our engineering area is subdivided into Pods and Business Areas (BA)
  • Services are deployed on different AWS compute platforms: ECS, EKS and Lambda
  • We rely on GitLab as an SCM and CI/CD platform
  • Incidents are managed using Opsgenie
  • We rely on AWS tags and Kubernetes labels for service-level traceability

Lead Time for Changes

In order to compute Lead Time for Changes metrics there are two main events that need to be registered: when the commit happened and when the deployment containing that commit happened. In their four metrics article, the DORA team suggests implementing this metric by maintaining a list of all the changes included in a specific deployment, for example by keeping a commit SHA to deployments mapping table.

We followed a different strategy. We rely on GitLab as our main source code management tool and we also run our pipelines, including the deployment ones, with GitLab CI. We consider two main contributors for the Lead Time for Changes:

  • The time since a Merge Request containing a specific commit is opened until it is merged to the main branch.
  • The time in elapsed since the merge request is merged until the deployment pipeline triggered by the merge request succeeds.

Adding these two times together gives us the overall Lead Time for Changes. However, this is half of the job.

We need the ability to filter this metric by BA, team, and component like we do for the Deployment Frequency.

In this case, there are no deployment events ingested, so we cannot extract tags from the resources.

However, most of the values can be extracted from GitLab, as our repositories are organized in subgroups according to the tech division structure. The project ID, repository, BA, pod, and component name can be extracted directly from the GitLab application programming interface (API).

The architecture is simpler than for the Deployment Frequency metric, as in this case the lambda function just needs to query the GitLab Merge Requests API, pipelines API, and repositories API, and push the processed data to DataDog metrics API.

Architecture

In this implementation, the lambda function is invoked periodically using a scheduled event rule so that it scrapes the GitLab API.

The scheduled expression shown below ensures that the lambda is executed at 7:00 UTC from Monday to Friday. The lambda function will only process events from the day before to avoid processing them twice. Since this eliminates the possibility of duplicated event processing, no queuing systems are needed.

Scheduled event rule

Similarly to what we did for Deployment Frequency, we built DataDog dashboards that display the LTFC metric and that can be filtered using the tags.

Lead time for changes dashboard

Change Failure Rate

Change Failure Rate depends on two metrics: how many deployments were attempted, and how many resulted in production failures.

The number of deployments is already calculated by the Deployment Frequency metric. For this metric, the key is to determine how many incidents each service suffered. We only process incidents that are already resolved in our implementation, which means that ongoing incidents will not be processed until they are resolved.

As mentioned in the previous article, we use OpsGenie for incident management, so this will be the source of truth for incident data.

However, it is important to ensure that the incidents can be correlated to deployment events. Since deployment events rely on resource tags for traceability, we decided to follow a similar approach and add tags to OpsGenie services. The following image shows an example service with a set of tags that allow us to match the deployment metrics.

OpsGenie tags

Tags can be manually configured in the OpsGenie graphical user interface (GUI).

In order to reduce user workload and ease standardization, we created Terraform modules to bootstrap OpsGenie services and their tags. This way, we can avoid manual errors when configuring the service tags.

The architecture is also simpler than the Deployment Frequency metric one.

The lambda function is invoked periodically so that it scrapes the OpsGenie API in search of new incidents. For this metric, we use the incidents API and the services API.

Incident architecture

Whenever a new incident is found, the lambda function reads its tags and pushes a new item to the DataDog metrics API with the corresponding DataDog tags.

However, there is something important to note here. When resolved incidents are processed by the lambda function, they are not eliminated from Opsgenie.

We keep them as they might be needed in postmortems. Thus, it is critical to ensure that incidents are not processed more than once if we want to provide accurate metrics. To prevent duplicates in event processes, we leverage incident tags by adding a DORA4-ACK tag to all the incidents that have been processed by the function. The next time the lambda function pulls the incident, it will skip it.

DORA4-ACK tag

Once the number of incidents is available in DataDog, we can calculate the Change Failure Rate metric by dividing the number of incidents for a given component, team, or BA over a period by the number of deployments for the same component, team or BA over the same period.

As for the previous metrics, obtaining DataDog dashboards from the computed Change Failure Rate (CFR) is rather straightforward.

Change failure rate dashboard

Mean Time to Recover

To measure the Time to Restore Services, we need to know when the incident was created and when it was resolved.

These values and many more are available through the OpsGenie incidents API. For this metric, we will collect the following values:

  • impactDetectDate
  • impactStartDate
  • impactEndDate
  • id
  • priority
  • tags

Since all these values are already collected by the lambda function we developed for the Change Failure Rate, there is no need to build a new lambda function for Mean Time To Recover (MTTR).

The metric is computed as impactEndDate - impactStartDate by the lambda function before being pushed to the DataDog metrics API.

As the priority is also processed and sent to DataDog, we can filter by priority in the dashboards. The following image shows one of the MTTR dashboards we use.

Mean time to restore dashboard

Summary

The two-part series of articles presented the four key metrics proposed by the DevOps Research and Assessment (DORA) research program and the way in which they were implemented at sennder.

Development teams can achieve significantly better business outcomes by measuring these metrics and continuously iterating to improve on them.

These metrics help DevOps and engineering leaders measure software delivery throughput and stability. All in all, they show how development teams can deliver better software to their customers, faster.

While DORA metrics are a great way for DevOps teams to measure and improve performance, the practice itself doesn’t come without its own set of challenges.

For most companies, the four metrics are simply a starting point and they need to be customized to fit into the context of each application rather than team or organization.

There are several learnings we can share from our particular implementation and the outcomes from measuring these metrics for our development teams.

The implementation of the four key metrics has highlighted that we deployed big changes as our deployment frequency was low.

By monitoring this metric, teams that were delivering less frequently became aware of this situation and worked on improving their cadence. The following weekly trend dashboard shows how a specific team has improved its Deployment Frequency over the past months.

sennder deployment frequency

Apart from this, we were also able to optimize CD pipelines as we measured the Lead Time for Changes.

This metric allowed us to identify the jobs and workflows that were generating bottlenecks and improve them.

Besides, the mean time to recover metric has helped us to improve our Incident Management process.

References

authors
Miguel Fontanilla

Miguel Fontanilla

Share this article