Please log in to watch this conference skillscast.
Zendesk has been struggling with reliability from it’s beginning - in many ways it has been a victim of its own overnight success. Over the last few years we’ve had to take drastic measures to address major outages, such as implementing company-wide change freezes.
These measures hurt when you have 1000 engineers in 120 product development teams across the globe, and in many ways create more risk when the freeze begins to thaw.
In order to avoid these freeze’s we have recently moved to implement concepts from the Site Reliability Engineering (SRE) discipline, specifically implementing Error Budgets along with SLOs/SLIs. The aim of this is to “scope” the freeze to those systems that have more reliability issues.
We’ve had some wins in introducing this approach, but are still very much at the beginning of this journey. This talk will tell the story of this journey along with providing some practical suggestions around tooling and practices to implement.
YOU MAY ALSO LIKE: