Lyb6jldd7jdbrhatnvr8
SkillsCast

Monitoring highly distributed systems

7th November 2016 in London at CodeNode

There are 35 other SkillsCasts available from µCon 2016: The Microservices Conference

Please log in to watch this conference skillscast.

Https s3.amazonaws.com prod.tracker2 resource 41088130 skillsmatter conference skillscast o9nohu

Knowing what's happening in your system is key to effective monitoring, troubleshooting, and crisis resolution. Unfortunately, when your microservice ecosystem scales to dozens or hundreds of microservices and every user action involves 10 microservices to complete, it becomes incredibly difficult to have that needed visibility and insight. At Jet, they want to know the current state of every distributed process, numbering a few hundred million per day. To gain this visibility the team coupled a common communication protocol which provides an ID to correlate all the messages in a single process with telemetry collection for every act of communication between microservices; pulling this data together results in a stream of data from which the current state of our 100 million daily processes can be viewed with ease.

This stream of data allows the Jet team to effectively build metaprograms which operate on the state of the distributed system. For example: monitoring for end-to-end SLAs, checking the status of any single process, powering your Ops platform, and automated integration testing of an entire distributed system.

This talk will share with you what the Jet team has done to build this real time, holistic view of our 700+ microservice architecture, so that they can monitor every single process for completion, validate that every single process is behaving as expected, empower their operations team to investigate and triage long running processes (e.g. catalog management and clean up). The talk will cover the DrOrpheus communication protocol they use to create their distributed process context, the telemetry data collection architecture, and the XRay real time telemetry processing platform which enables them to convert billions of telemetry events per day into many different, but accurate, views of their distributed systems state.

YOU MAY ALSO LIKE:

Thanks to our sponsors

Monitoring highly distributed systems

Erich Ess

Directory of engineering at Jet.com. Building distributed systems and microservice platforms.

SkillsCast

Please log in to watch this conference skillscast.

Https s3.amazonaws.com prod.tracker2 resource 41088130 skillsmatter conference skillscast o9nohu

Knowing what's happening in your system is key to effective monitoring, troubleshooting, and crisis resolution. Unfortunately, when your microservice ecosystem scales to dozens or hundreds of microservices and every user action involves 10 microservices to complete, it becomes incredibly difficult to have that needed visibility and insight. At Jet, they want to know the current state of every distributed process, numbering a few hundred million per day. To gain this visibility the team coupled a common communication protocol which provides an ID to correlate all the messages in a single process with telemetry collection for every act of communication between microservices; pulling this data together results in a stream of data from which the current state of our 100 million daily processes can be viewed with ease.

This stream of data allows the Jet team to effectively build metaprograms which operate on the state of the distributed system. For example: monitoring for end-to-end SLAs, checking the status of any single process, powering your Ops platform, and automated integration testing of an entire distributed system.

This talk will share with you what the Jet team has done to build this real time, holistic view of our 700+ microservice architecture, so that they can monitor every single process for completion, validate that every single process is behaving as expected, empower their operations team to investigate and triage long running processes (e.g. catalog management and clean up). The talk will cover the DrOrpheus communication protocol they use to create their distributed process context, the telemetry data collection architecture, and the XRay real time telemetry processing platform which enables them to convert billions of telemetry events per day into many different, but accurate, views of their distributed systems state.

YOU MAY ALSO LIKE:

Thanks to our sponsors

About the Speaker

Monitoring highly distributed systems

Erich Ess

Directory of engineering at Jet.com. Building distributed systems and microservice platforms.

Photos