Background: When delivering real-time ads in the marketplace, ensuring your developers were on call to respond to fixes as needed is a critical component in any software situation. We found that our developer alerts were being generated using a "mediator." This meant that builds that failed or yielded metrics that should have been triggering an alert were doing so by sending an email to the Network Operations Center. The NOC, in turn, was then contacting the responsible team or developer. This pass-along process caused a delay in attending to the issues at hand, correcting the failures, and resuming the build.**
In many cases, the cycle of an alert having to run through a NOC eliminates any sense of urgency. It could take up to 30 minutes before a party responsible might be alerted an appropriate action took place. This delay can significantly affect builds and pipelines. It took me by surprise that there was no PagerDuty plugin for Jenkins yet, so I decided to develop and open-source it.
Goals: Build a new Jenkins plugin to alert developers of issues and accelerate response and resolution times.
"The key difference is the huge community, and the fact that Jenkins is a true 'battle-proven' tool." Alexander Leibzon, Software Developer/Architect
Solution & Results: We took inspiration from PagerDuty, an incident management platform that provides reliable notifications, automatic escalations, on-call scheduling, and other functionality to help teams detect and fix infrastructure problems quickly. The mobile app allows you to trigger, acknowledge, and resolve incidents.
After figuring out the need for a Jenkins PagerDuty plugin, it was an easy choice to turn to Jenkins documentation and codebase. This is actually a really awesome way to get to know Jenkins in-depth, much more than just using it to execute pipelines or run ad hoc jobs. After going over possible plugin options, PagerDuty was created as a post-build notifier plugin. **
One of the critical capabilities to getting this done is the overall extendability of all the components. Plus, the Jenkins community and documentation played a considerable role in the quick development and adoption of the plugin.
We now have an option to trigger and resolve PagerDuty incidents directly from builds and pipelines. In addition, we shortened the "Alert to Resolution" cycle from one half-hour to just a few minutes. Best of all, we now have a better and holistic understanding of Jenkins internals.
With the open source PagerDuty plugin, we achieved our goals and more, including:
The ability to trigger incidents on various job statuses: Success, Failure, Aborted, Unstable, & Not Built
The ability to trigger incidents based on number of consecutive build results
The ability to automatically resolve incidents when job is back to normal
Being pipeline compatible