Apache Spark is a distributed computing framework that enables scalable, high-throughput, and fault-tolerant processing of data. Spark Streaming delivers the power of Spark to process streams of data in near real-time.
After a quick introduction, in this talk we are going to discuss the Spark Streaming "micro-batch" model that enables the re-use of Spark as a data processing engine for in-flight data.
In particular, we will place emphasis on:
• the different stream consumption approaches
• the performance characteristics of each, and
• zoom into the new Kafka "direct" receiver for improved reliability.
Though several live examples, we will explore the Spark Streaming API and see how streaming jobs can be combined with other Spark libraries to create data products that extract value from data in real-time.
Gerard is the Lead Engineer Distributed Computing at Data-Fellas where he works on the Data Engineering aspects of our products. Before Data-Fellas, he worked at Virdata, an IoT startup, on building and extending their data processing pipelines on public clouds.
He has a background in Computer Engineering and is a former Java geek now converted to Scala and Functional Programming. Through his career in technology companies like Alcatel-Lucent, Bell Labs and Sony he has been mostly involved with creating and scaling up back-end systems for telecommunications, entertainment, Smart Grids and IoT.
He talks on local and international meetups and conferences on Spark, Scala and Cassandra. He’s also a top contributor to Stack Overflow on the Apache-Spark and Spark-Streaming tags.