A SkillsCast for this session is not available.
Many organisations face the difficult challenge of enabling Machine Learning projects to get to market more quickly and to allow data science teams to share their features. In this talk, I will be discussing the machine learning pipeline developed at a large Australian telecommunications company to achieve this goal using Kafka and Spark as well as the challenges faced along the way. I’ll begin by discussing the utility and motivation for a centralised feature store, before looking at the complexities of such an undertaking (both technical and organisational). We will then dig into the technical details of implementation by discussing the scalability headaches we faced and dive into the details of the solutions used to drastically improve the speed and organisational scalability of the system. Several areas that will be covered are providing a declarative API that allowed us to compile feature definitions into optimised spark code, the complexity of a true streaming dedupe, adjusting the workflow for different machine learning use cases, fine-tuning the resource allocation to avoid unnecessary bottlenecks and allowing for streaming and batch data sources. Finally, we will touch on lessons learnt along the way and offer advice on things to avoid as well as how to take things to the next stage.
YOU MAY ALSO LIKE:
- Building a Centralised Machine Learning Pipeline with Spark and Kafka (SkillsCast recorded in September 2018)
- Java Forum (Online Conference on 31st August 2022)
- YOW! Perth Developer Conference 2022: In-Person (in Perth on 19th - 20th September 2022)
- Bazel eXchange Panel: Day 2 (SkillsCast recorded in June 2022)
- Fast, Green — Choose Two: A Buildkite Case Study (SkillsCast recorded in June 2022)