Please log in to watch this conference skillscast.
The world we live in today is fed by data. From self-driving cars and route planning to fraud prevention to content and network recommendations to ranking and bidding, the world we live in today not only consumes low-latency data streams, it adapts to changing conditions modeled by that data.
While the world of software engineering has settled on best practices for developing and managing both stateless service architectures and database systems, the larger world of data infrastructure still presents a greenfield opportunity. To thrive, this field borrows from several disciplines : distributed systems, database systems, operating systems, control systems, and software engineering to name a few.
Of particular interest to me is the sub field of data streams, specifically regarding how to build high-fidelity nearline data streams as a service within a lean team. To build such systems, human operations is a non-starter. All aspects of operating streaming data pipelines must be automated. Come to this talk to learn how to build such a system soup-to-nuts.
Q&A
Question: In your case are the Observer and Deployers are the sum of a few systems and processes? or have your team(s) invested in creating centralised solutions?
Answer: I'm actually at a startup now, so I use a different system now than I used at PayPal.
Let me describe PayPal here first…
The Deployer actually does something i missed in the talk
- It deploys code
- It waits for O's signal to settle and then check lag and loss metrics (It can also check scoring skew)
- If it detects something is wrong, it will rollback the code AND roll back the Kafka checkpoint.
- This is why my system is AtLeastOnce delivery. This means that the deployer can replay messages during the window of a bad deployment
The Observer system uses reservoir sampling instrumentation in each microservice to collect data. The data is sent over Kafka to a Streaming Spark job that collects the data and writes it to ES (our metrics store).
A separate Spark job computes rolling-window loss and lag. This system will send alarms.
In my current startup, we run in AWS.
Our Observer system is 100% cloudwatch. To save money, we do some client side metric sampling for all time-based metrics (e.g. Lag). We can't sample Loss metrics or data points at all.
In my startup, we run our own K8s on EC2. We do HPA CPU autoscaling on K8S over memory-based EC2 autoscaling (2 level autoscaling). It's a bit of an advanced topic for a keynote.
Our Deployer system is currently more complex.. Deployment is via Kubectl commands.
Question: Whats the best way to learn more and practise with a solution?
Answer: I can write up something up that shows more implementation details. You can connect with me on https://www.linkedin.com/in/siddharthanand/ (email : sanand@apache.org).
As I publish on this topic, I'll share it on LinkedIn.
You can email me at sanand@apache.org after this Q&A with questions too and I can always find time for zoom mentoring/etc…
Question: On the e2e lag, does the expel time part means the data have reached to its destination, say becoming available in a database?
Answer: In the expel node, we
- Read from kafka (autocommit disabled)
- write to say BigQuery
- Wait for a 200 (OK)
- When we get it, we call KafkaConsumer.acknowledge()
So, yes.. E2E lag encapsulates receiving the request at S and confirming the send at D. The entire chain is transactional.
YOU MAY ALSO LIKE:
Building & Operating Autonomous Data Streams
Sid Anand
Chief ArchitectDatazoom