Please log in to watch this conference skillscast.
Zendesk has been struggling with reliability from it’s beginning - in many ways it has been a victim of its own overnight success. Over the last few years we’ve had to take drastic measures to address major outages, such as implementing company-wide change freezes.
These measures hurt when you have 1000 engineers in 120 product development teams across the globe, and in many ways create more risk when the freeze begins to thaw.
In order to avoid these freeze’s we have recently moved to implement concepts from the Site Reliability Engineering (SRE) discipline, specifically implementing Error Budgets along with SLOs/SLIs. The aim of this is to “scope” the freeze to those systems that have more reliability issues.
We’ve had some wins in introducing this approach, but are still very much at the beginning of this journey. This talk will tell the story of this journey along with providing some practical suggestions around tooling and practices to implement.
YOU MAY ALSO LIKE:
- Rolling out Error Budgets across a 1000 person global engineering organisation (SkillsCast recorded in December 2019)
- Foundations of Social Leadership (Online Meetup on 18th January 2023)
- Does Culture Impact Software Design? (SkillsCast recorded in October 2021)
- Storytelling: Shifting the Culture Through Sharing Experiences (SkillsCast recorded in July 2021)