Building data science pipelines on GitLab

Nicolas Gautier has worked as a sennder machine learning engineer since February 2020. In this article, he recounts why we chose GitLab to build our data science pipelines and gives some insight into the decision-making process.

I have been working as a sennder machine learning engineer since February 2020. When I started we already had some pretty decent models but not a lot when it came to automated deployment or testing. One of the first tasks I had the opportunity to contribute to was priming our machine learning CI/CD infrastructure.

A new hope

In this article, we look at why we chose GitLab to build our data science pipelines. The goal is to give insight into our decision process to explain this uncommon choice. The exciting thing about working at the intersection of data science and DevOps is that, while both fields are relatively mature, they don’t overlap much. It’s a bit of a wild west when it comes to tooling. When it comes to bringing models to production, manual training only gets you so far. A majority of end-to-end training tasks involve some steps, e.g:

Fetching data
Transforming data
Training
Reporting
Deployment

We were lucky enough to start fresh, without the need to accommodate legacy systems. This kind of freedom is exhilarating, but also scary. What if we make the wrong choice?

Deploying software is nothing new, so what’s so special about machine learning? The answer is not much, not really. Sure there are some specificities but all in all, it’s pretty similar to deploying other software, so surely there is already an elegant solution that exists for us to use and so we never have to worry about this topic again.

Established solutions strike back

Requirements:

First of all, what do we need? Here were our requirements:

Accessible: having the best system ever can be detrimental if no one knows how to set it up, operate, and/or maintain it. Not everyone wants or should be a DevOps master and we should not assume that any team member now or in the future will have the time or inclination to spend hours ramping up on a given stack.
Powerful: we need to execute complex DAG’s (directed acyclic graphs) reasonably, for example with conditional execution, retries, etc.
Coherent with the rest of the organization: it made sense for other teams to be able to peer into our pipelines.
Integration with version control: this requirement seems basic but some solutions do not offer this out of the box (see below).
Maintained: nobody wants to use a tool that is not supported.
Kubernetes support: we work with Kubernetes a lot and having a completely different system would cause some issues.

Our options:

With this in mind, let us consider the usual suspects. This is by no means an exhaustive or comprehensive overview, we had a finite amount of time to consider each system.

Jenkins: the reigning king, powerful and mature. Unfortunately, Jenkins also suffers from its age. It’s hard to deploy on Kubernetes (possible but not trivial). It also requires one to learn its own configuration language, which is, shall we say, verbose.
Airflow: its features are impressive, one can do anything with it. Because it came before the Kubernetes wave, it doesn’t play too well with it (again possible but not trivial). It’s relatively complex, as it needs a deployed server, its own database, etc.
Argo pipelines: The cool new kid on the block. Powerful, with native handling of artifacts through remote storage backends and a wide range of DAG features. I have a personal soft spot for this one since it’s built into Kubernetes, it integrates nicely. However, this means one needs some decent Kubernetes knowledge to operate it, which has a famously high initial learning curve.
GitLab: The outsider. Ee looked at GitLab because our organization was migrating its CI there at the time. GitLab offers budding DAG features that are being enhanced. It’s not as feature-complete as some of the above solutions but it has the advantage of being easy to write. Since all developers need to write CI pipelines, anyone would be able to read and understand them. It’s also very easy to set up workers.

These are the pipelines you’re looking for

We went with GitLab. It offers everything we were looking for, the deciding factor being democratization and ease of use. We ran an initial proof of concept by writing the same data science pipeline in both Argo and GitLab. While this author was somewhat partial to Argo’s cool factor our team voiced a strong preference for the alternative, and we haven’t looked back since. Here’s what a typical pipeline looks like:

If this could be useful to you, you can try everything using GitLab’s free offering and shared runners, you might be pleasantly surprised. ....