Please log in to watch this conference skillscast.
This session is a deep dive into the modern best practices around asynchronous decoupling, resilience, and scalability that allow us to implement a large-scale software system from the building blocks of events and services, based on the speaker's experiences implementing such systems at Google, eBay, and other high-performing technology organizations.
We will outline the various options for handling event delivery and event ordering in a distributed system. We will cover data and persistence in an event-driven architecture. Finally, we will describe how to combine events, services, and so-called "serverless" functions into a powerful overall architecture.
You will leave with practical suggestions to help you accelerate your development velocity and drive business results.
Q&A
Question: When pulling out tables from a shared database, what would happen to the foriegn keys?
Answer: You manage foreign keys explicitly in the application layer. You don't get automated referential integrity.
Question: In that case, is there a period of time as the saga plays out that you could have a broken data?
Answer: That's what I was trying to get at by talking about explicitly modeling the "intermediate states". You will be able to observe those intermediate ("pending", etc.) states. It's not broken; it's just not completed yet
Question: Is there a good read about use cases for workflows and saga with examples on how to implement one ?
Answer: There are some excellent talks. One by Caitie McCaffrey at J on the Beach in 2018. Many talks by Chris Richardson on the Saga Pattern. Chris Richardson also has an excellent book called Microservices Patterns, which goes over the Saga pattern in detail, and his website has some good stuff on it too: https://microservices.io/patterns/data/saga.html
Question: When distributed, how do you support db restores without losing the events sent from another service in the meantime?
Answer: Don't do that :). Restoring data that loses previously committed work would be a big problem whether you are using events or not. Typical storage patterns at large scale write to multiple disks or replicas to avoid data loss.
Question: For the saga model you showed are there not issues with creating the bi-directional dependencies?
Answer: Not necessarily. Theoretically one could imagine a workflow that went back and forth between two services in a series of "handoffs" of the workflow. I wouldn't recommend this approach, though. If two services were so tightly coupled, they really should be one service.
Question: A question that always sticks with eventual consistency is how to ensure what's important is processed in the right order. In your customer address example if customer updates address and clicks on purchase, how do we ensure we're not shipping to the old address, as that eventual consistency may get there after the shipment service has taken it's stale copy ?
Answer: Excellent question. For this particular example, I would have a
Question: Another example here might be that we have some data changed in a CRM system and a processing system directly after, but the processing depends on the data that has been stored in the CRM system. Any tips on how to handle this?
Answer: I think I'd have to understand the use-case in more detail to make a good recommendation. It sounds like the problem is that you update the CRM, but the CRM is eventually consistent itself, and returns stale data to the processing system? If there is any way to get the CRM system to fire off an event when it has the update, that might help?
Question: Ok, let me explain this better. So give the use case of onboarding an investor, where we create the investor/person in the CRM service. We then call an Account Service to create the account for the investor, but due to eventual consistency between the CRM and Account service the investor details are not in the Account service yet. Hope this makes better sense. Any tips on how to handle this?
Answer: Sounds like InvestorOnboarding should be a workflow instead of a succession of synchronous calls that are all expected to immediately succeed. There is clearly a state machine here: started -> personal-info-available -> account-created -> etc. Asking yourself how you know when the details are available in the CRM will help you answer how to transition from state to state. I hope this helps.
Question: How do you deal with missed events? Do you use event backplanes that can be trusted, or do you deal with it in other ways? e.g. listen to events and maintain a cache, but synchronize periodically, just in case?
Answer: Great question. There are a number of approaches I have used:
- Use a "reliable" transport
- If you don't trust (1), use a periodic "reconciliation batch" to find and synchronize inconsistencies. Financial institutions do this all the time with each other.
Strong point of view here: You need to be using a reliable transport like Kafka. I have spent seemingly years of my life debugging non-reliable event systems like Rabbit. The fact that it mostly works doesn't help when it doesn't.
Question: Do you try and put a distributed transaction
around the creation of Materialised View records?
Answer: Essentially you use Sagas. I.e. When you see the first event in a workflow you create a “record” in your materialised view, then wait for the child events to come in to fill in the details. You leave a "hole" for the other side of the join to come in and fill.
Question: Are there any ‘event contract’ standards that are winning out as the predominant industry standard the way swagger transformed RESTful API’s ?
Answer: Async API is one of them https://www.asyncapi.com/.
Question: Do you have any good practices for tracing events as they go up and down through an entire system? For example, if data enters the system through the APIs at the front gate, do you have any recommendations to be able to trace that data at scale, all the way to the data store at the other end of that pipeline, especially if you have events that number in the billions?
Answer: As others suggest, the way you phrase the question lends itself to a correlation / request id which you pass along.
Question: Any pitfalls to watch out for when making use of CDC?
Answer: Depending on how complex your DB schema is, CDC might be too low-level. If you are writing or updating multiple rows across several tables as part of a single semantic operation, it might be hard to "reconstruct" the real event from those individual DB changes.
On the plus side, CDC is dead simple.
YOU MAY ALSO LIKE:
Scaling Your Architecture With Services and Events
Randy Shoup
VP Engineering and Chief Architect
eBay