Bbcjg5iqw9tk8lscjndo
2 DAY COURSE

Lightbend Apache Spark for Scala - Professional

Topics covered at SPARK-01-02
View Schedule & Book More dates available

Next up:

Would you like to learn about how to implement data analytics using Apache Spark for Reactive applications? Then join us for this two day hands-on course led by the world's leading Spark experts.

Join this two-day Apache Spark course for developers and learn how to implement data processing pipelines and analytics using Apache Spark. Come along to learn the Spark Core, SQL/DataFrame, Streaming, and MLlib (machine learning) APIs through hands-on exercises. You will also learn about Spark internals and tips for improving application performance. Additional coverage includes integration with Mesos, Hadoop, and Reactive frameworks like Akka.

**formerly Apache Spark Workshop

Learn how to:

  • Use the Spark Scala APIs to implement various data analytics algorithms for offline (batch-mode) and event-streaming applications
  • Understand Spark internals
  • Consider Spark performance
  • Test and deploy Spark applications
  • Integrating Spark with Mesos, Hadoop, and Akka

About the Expert

Vincent Van Steenbergen

Vincent Van Steenbergen is a Senior Data Engineer who’s been working on Big Data projects using Machine Learning (recommender system, fraud detection) and more recently Deep Learning (voice analysis, natural language processing). He regularly speaks at international conferences and meetups about Big Data tech stacks such as Scala, Akka, Spark as well as Machine Learning and Deep Learning.

About the Author

Dean Wampler

Dean Wampler, Ph.D., is the Architect for Big Data Products and Services in the Office of the CTO at Lightbend, where he focuses on the evolving “Fast Data” ecosystem for streaming applications based on the SMACK stack, Spark, Mesos, Akka (and the rest of the Lightbend Reactive Platform), Cassandra, Kafka, and other tools.

Thanks to our partners

Introduction - Why Spark

  • How Spark improves on Hadoop MapReduce
  • The core abstractions in Spark
  • What happens during a Spark job?
  • The Spark ecosystem
  • Deployment options
  • References for more information

Spark's Core API

  • Resilient Distributed Datasets (RDD) and how they implement your job
  • Using the Spark Shell (interpreter) vs submitting Spark batch jobs
  • Using the Spark web console.
  • Reading and writing data files
  • Working with structured and unstructured data
  • Building data transformation pipelines
  • Spark under the hood: caching, checkpointing, partitioning, shuffling, etc.
  • Mastering the RDD API
  • Broadcast variables, accumulators

Spark SQL and DataFrames

  • Working with the DataFrame API for structured data
  • Working with SQL
  • Performance optimizations
  • Support for JSON and Parquet formats
  • Integration with Hadoop Hive

Processing events with Spark Streaming:

  • Working with time slices, “mini-batches”, of events
  • Working with moving windows of mini-batches
  • Reuse of code in batch-mode and streaming: the Lambda Architecture
  • Working with different streaming sources: sockets, file systems, Kafka, etc.
  • Resiliency and fault tolerance considerations
  • Stateful transformations (e.g., running statistics)

Other Spark-based Libraries:

  • MLlib for machine learning
  • Discussion of GraphX for graph algorithms, Tachyon for distributed caching, and BlinkDB for approximate queries

Deploying to clusters:

  • Spark’s clustering abstractions: cluster vs. client deployments, coarse-grained and fine-grained process management
  • Standalone mode
  • Mesos
  • Hadoop YARN
  • EC2
  • Cassandra rings

Using Spark with the Lightbend Reactive Platform:

  • Akka Streams and Spark Streaming

Conclusions

Audience

If you are an experienced developer and would like to learn how to write data-centric applications using Spark, this Apache Spark course is for you!

Prerequisites

To benefit from this Apache Spark course, you could have prior experience with using Scala on a project, or have attended our Lightbend Scala Language - Professional course. Some prior experience with SQL, machine learning and other Big Data tools will be helpful, but not is not essential.

Bring your own hardware

Please bring your own laptop to this course, as it will help you put your newly learned skills into practice after the course using the same environment. If you are unable to bring a laptop for this course, please contact us as soon as possible (on +44 20 7183 9040 or email us) and we'll get stuff sorted for you!

Your laptop should have:

Setup instructions for your laptop will arrive a week or so before the training.

Please submit all laptop requests a minimum of 48 hours prior to the course as laptops are subject to availability.

Overview

Would you like to learn about how to implement data analytics using Apache Spark for Reactive applications? Then join us for this two day hands-on course led by the world's leading Spark experts.

Join this two-day Apache Spark course for developers and learn how to implement data processing pipelines and analytics using Apache Spark. Come along to learn the Spark Core, SQL/DataFrame, Streaming, and MLlib (machine learning) APIs through hands-on exercises. You will also learn about Spark internals and tips for improving application performance. Additional coverage includes integration with Mesos, Hadoop, and Reactive frameworks like Akka.

**formerly Apache Spark Workshop

Learn how to:

  • Use the Spark Scala APIs to implement various data analytics algorithms for offline (batch-mode) and event-streaming applications
  • Understand Spark internals
  • Consider Spark performance
  • Test and deploy Spark applications
  • Integrating Spark with Mesos, Hadoop, and Akka

About the Expert

Vincent Van Steenbergen

Vincent Van Steenbergen is a Senior Data Engineer who’s been working on Big Data projects using Machine Learning (recommender system, fraud detection) and more recently Deep Learning (voice analysis, natural language processing). He regularly speaks at international conferences and meetups about Big Data tech stacks such as Scala, Akka, Spark as well as Machine Learning and Deep Learning.

About the Author

Dean Wampler

Dean Wampler, Ph.D., is the Architect for Big Data Products and Services in the Office of the CTO at Lightbend, where he focuses on the evolving “Fast Data” ecosystem for streaming applications based on the SMACK stack, Spark, Mesos, Akka (and the rest of the Lightbend Reactive Platform), Cassandra, Kafka, and other tools.

Thanks to our partners

Program

Introduction - Why Spark

  • How Spark improves on Hadoop MapReduce
  • The core abstractions in Spark
  • What happens during a Spark job?
  • The Spark ecosystem
  • Deployment options
  • References for more information

Spark's Core API

  • Resilient Distributed Datasets (RDD) and how they implement your job
  • Using the Spark Shell (interpreter) vs submitting Spark batch jobs
  • Using the Spark web console.
  • Reading and writing data files
  • Working with structured and unstructured data
  • Building data transformation pipelines
  • Spark under the hood: caching, checkpointing, partitioning, shuffling, etc.
  • Mastering the RDD API
  • Broadcast variables, accumulators

Spark SQL and DataFrames

  • Working with the DataFrame API for structured data
  • Working with SQL
  • Performance optimizations
  • Support for JSON and Parquet formats
  • Integration with Hadoop Hive

Processing events with Spark Streaming:

  • Working with time slices, “mini-batches”, of events
  • Working with moving windows of mini-batches
  • Reuse of code in batch-mode and streaming: the Lambda Architecture
  • Working with different streaming sources: sockets, file systems, Kafka, etc.
  • Resiliency and fault tolerance considerations
  • Stateful transformations (e.g., running statistics)

Other Spark-based Libraries:

  • MLlib for machine learning
  • Discussion of GraphX for graph algorithms, Tachyon for distributed caching, and BlinkDB for approximate queries

Deploying to clusters:

  • Spark’s clustering abstractions: cluster vs. client deployments, coarse-grained and fine-grained process management
  • Standalone mode
  • Mesos
  • Hadoop YARN
  • EC2
  • Cassandra rings

Using Spark with the Lightbend Reactive Platform:

  • Akka Streams and Spark Streaming

Conclusions

Audience

Audience

If you are an experienced developer and would like to learn how to write data-centric applications using Spark, this Apache Spark course is for you!

Prerequisites

To benefit from this Apache Spark course, you could have prior experience with using Scala on a project, or have attended our Lightbend Scala Language - Professional course. Some prior experience with SQL, machine learning and other Big Data tools will be helpful, but not is not essential.

Bring your own hardware

Please bring your own laptop to this course, as it will help you put your newly learned skills into practice after the course using the same environment. If you are unable to bring a laptop for this course, please contact us as soon as possible (on +44 20 7183 9040 or email us) and we'll get stuff sorted for you!

Your laptop should have:

Setup instructions for your laptop will arrive a week or so before the training.

Please submit all laptop requests a minimum of 48 hours prior to the course as laptops are subject to availability.