At Hotels.com we run a bunch of microservices and infra in production. Where applications previously ran on fixed hosts for their lifetime, moving our services to AWS and on Kubernetes there are two new types of change we must be prepared for; Kubernetes dynamically managing the lifecycle of our applications and the EC2 servers that underlie our platform are ephemeral and may fail or be replaced at any time. Each incident not only impacts our revenue but also our customers' trust. In an effort to build resilience in our services we've explored processes and tools like Toxiproxy and Kubemonkey to stress and "break" our systems on purpose and without impacting production.
In this talk we'll talk you through our work on resilience and chaos testing. Why we need resilience, what does it mean for us and what kind of tools we have explored and have been using so far.
Daniel is a Software Engineer with over 12 years of experience. He specialises on designing microservices and high load, scalable and fault tolerant systems and he's also an open source contributor.
Nikos is a Software Engineer at Hotels.com (Expedia Group). He's working for a team that's exploring new technologies that can improve the Hotels.com and Expedia Group platforms and he's part of the Open-Source and InnerSource groups there. His team recently started experimenting with Resilience and Chaos Engineering, mainly from an application's perspective.