Background: Athemaster is a technology company offering solutions and expertise in implementing Enterprise Data Hub and automating Data integration with Open Source technologies. My offices are based in Taipei, Taiwan. There I have to manage more than 50 data pipelines, and they have dependencies between each other. In the beginning, some tables from the relational database have to be ingested to the Hadoop cluster. After that, some tables have to be joined. Then I use some statistical models to check the data to find out if there is any fraudulent behavior. If the model detects fraud, the fraud data is sent to our security department for verification. This whole pipeline is very long, and the relationship is very complex. I set out to find a tool to help me manage them so that I can easily observe the status of those jobs and rerun or revise them.
Goals: To identify and start using automation ETL pipelines and MLOps pipelines, making clearer dependencies between data processing stages, and creating more elasticity in my pipeline configuration for easier management.
"Jenkins makes complex data pipeline management become simple."
Solution & Results: I choose Jenkins to manage data pipelines for 3 reasons.
Easy to deploy and maintain. The basic requirement for a data pipeline management system is to be stable and robust. The output of a data pipeline may be a table, model, or report. If the system is unstable, it will hurt our business directly. Jenkins is written by Java, used for thousands of businesses, and released fast. So I trust Jenkins is a robust tool for doing the critical things. Just execute it and enjoy.
Observability and transparency. Jenkins' UI is very simple. I can see the status of a job easily and clearly. I can even customize the sheet to filter and organize different types of jobs. If something goes wrong, I can follow the UI by clicking the red light, and find the log of the red stage.
Extendibility. Jenkins has a powerful plugin system. It makes Jenkins stronger and with that, there are more possibilities. I can always find features I want in the plugins market.
Speaking of capabilities:
UI Design. Jenkins UI is very, very, very simple, clear, and easy to use and understand.
Run condition and conditional build. Data Pipelines are very complex. According to the result, I have to choose different build stages.
Parameterized Scheduler and parameterized trigger. Rerun a data pipeline is very common. Parameterized trigger lets me easily rerun a pipeline with a customized parameter.
Actually, it is not the first time I use Jenkins to manage data pipelines. And the results are always excellent. For this use case, we saw:
Improved data pipelines transparency and observability as proven by my team lead
The development release cycle of a new pipeline shortened from 1 week to 1 day
10X speed improvement of troubleshooting