Please log in to watch this conference skillscast.
Performance monitoring is an important part of running a successful theme park. Like a distributed system, theme parks have separate components (attractions), each with a queue of work to get through. How can we find out which of them are the least efficient? Which ones are slowing us down? Where should we spend time optimizing?
Join Mike for a roller-coaster ride through distributed system performance monitoring. Find out which measurements tell you the most about your system and how to optimize it. As an added bonus, you'll learn how to run a successful theme park! Mike has 20 years of experience developing and monitoring complex systems. In that time, he has visited some of the worlds greatest theme parks.
Q&A
Question: Given SMTP itself is a (very cheap, lightweight, global scale) store-and-forward (queueing and work-stealing) mechanism, what's the difference between that queue server being unresponsive and the SMTP server being unresponsive?
Answer: Queuing systems are simple and tend not to fail very often. Once we've written to a queue, we can assume that the email will eventually be sent.
Question: I'm a bit more confused then. I can understand if it was: app → smtp (store and forward) ⇶ N x smtp (AV scanned) ⇛ smtp (external mail exchange) But that doesn't change the failure condition of the app if the queue server is down and the message can't make it into a queue (whether a "message queue" system or smtp).
Answer: You are correct. If the queue server is down, then the app server is still unable to handle work. However, we found that the queue server was a lot more robust than the smtp server and this approach would work whether we were dealing with SMTP, a remote web service, a disk, anything.
Question: Architecture diagrams are great for communicating with stakeholders
Answer: Agreed. I highly recommend the work of Simon Brown and his C4 architecture diagramming technique. I did a talk about that about 4 years ago too
Question: Did you use the Universal Scalability Law in your observations and subsequent modelling and optimizations? (e.g. you mentioned "deadlocks", did you measure contention and coherence as well as duration, concurrency, throughput?)
Answer: I did not at the time. I might have saved an awful lot of time if I had.
Question: “Unlike a theme park, we can’t just close the gates” - good advice for Kmart’s web team - It's far from ideal, but isn't it a viable solution if you don't have the ability to increase capacity in the short term?
Answer: Yes. If it's the only level you have, you need to pull it. The alternative is that the system gets less and less responsive until it appears to have stopped.
Question: Could you also slow the additions to a queue? In the case of Disneyland, if there was a mini attraction in the queue then they take longer to join the main queue. As a result the perceived time spent is reduced and hence increases satisfaction. This has been done at some stores for Santa photos. You queue up briefly. You then go into Santa’s cave and have something to do before joining the main queue. Wait time is similar but you feel better about the wait.
Answer: Yes for sure. There's a lot of psychology that goes into the queues at Disneyland. They do a great job of making it seem like you are closer to the front than you really are. You can do that with a software system as well by identifying a subset of the process or message load and route it to a different endpoint that you tune separately.
I did write a blog post about some of these metrics if you're looking for more (or want to go over it again).
There's also a demo you can download and run (Windows only I am afraid) which allows you to tweak duration and concurrency and see the impact on queue wait time and throughput.
YOU MAY ALSO LIKE:
The Science of Queues: Performance Monitoring for Themes Parks and Distributed Systems
Mike Minutillo
EngineerParticular Software.