When a recommendation engine has to respond to hundreds of thousands of requests per second, there is no room for development downtime.
Background: Taboola’s recommendation engine responds to hundreds of thousands of requests per second. The service has to be fast – so fast that its p95 should be below 500 milliseconds per request. Which means we can’t have any downtime at all, or even afford slower responses.
In addition, it’s critical to prevent the installation of a faulty version. A faulty version could lead to downtime or degraded performance, which can directly result in a loss of revenue. For this reason, we have multiple testing gateways during development — to help prevent a bad version. However, based on our experience, sometimes when the software meets production, unexpected and often bad things can happen. We need to be ready to prevent that. Another important requirement is to deploy during office hours, when most of the engineers will be available to assist should something go wrong.
Goals: To deploy a highly sophisticated Java service, one that is very actively developed on a daily basis, to thousands of servers in multiple data centers around the world.
"Jenkins pipelines made the implementation of a very complex flow easy."
Solution & Results: To meet the objectives, we designed a flow for the deployment. The following are the flow stages at high level:
Is today a deployment day? — We don’t deploy on holidays :)
Is today’s version valid? — Validate the version using canary testing which is implemented in another Jenkins flow
Data center verification — Deploy on a single data center and verify
New version for all — Deploy on the rest of the data centers (6 out of 7) in parallel
The deployment procedure on a single data center goes like this:
Get the list of servers to be deployed
Calculate the size of the server batch (using metrics and math :)
For each server in the batch
Silence all alerts
Stop the old version and remove it
Install the new version
Start the service
Verify that the service started correctly
Unsilence all alerts
Run a batch verification to check various metrics of the domain
Wait for a minute for the next server batch
Repeat until no servers are left
For reference, the flow is detailed at: https://engineering.taboola.com/high-scale-service-deployment/
All of the logic is implemented with Jenkins Pipelines and Groovy support. We created a large shared libs repository with our deployment flow infrastructure. It made the process easy to maintain, extend and generalize to other services as well. As for Jenkins Plugins, we use different plugins during the flow run to report metrics and alert. For example, we integrated the Pager Duty Plugin to trigger an alert in case of a failure. The alert is triggered and resolved automatically by code.
All in all, we saw great results, including:
a deployment flow with high reliability
it’s easier to maintain and extend it with Jenkins Pipelines and Groovy
we’re able to deploy higher amount of servers in the same or even less time, due to the Jenkins Pipeline flow