Please log in to watch this conference skillscast.
Many organisations face the difficult challenge of enabling Machine Learning projects to get to market more quickly and to allow data science teams to share their features. In this talk, I will be discussing the machine learning pipeline developed at a large Australian telecommunications company to achieve this goal using Kafka and Spark as well as the challenges faced along the way. I’ll begin by discussing the utility and motivation for a centralised feature store, before looking at the complexities of such an undertaking (both technical and organisational). We will then dig into the technical details of implementation by discussing the scalability headaches we faced and dive into the details of the solutions used to drastically improve the speed and organisational scalability of the system. Several areas that will be covered are providing a declarative API that allowed us to compile feature definitions into optimised spark code, the complexity of a true streaming dedupe, adjusting the workflow for different machine learning use cases, fine tuning the resource allocation to avoid unnecessary bottlenecks and allowing for streaming and batch data sources. Finally we will touch on lessons learnt along the way and offer advice on things to avoid as well as how to take things to the next stage.
YOU MAY ALSO LIKE:
- Building a Centralised Machine Learning Pipeline with Spark and Kafka (SkillsCast recorded in September 2018)
- Rust Forum (Online Conference on 24th May 2022)
- Bazel eXchange: Watch Party (in London on 21st June 2022)
- Shift to Data as a Product, Leverage In-Business Expertise to Scale Analytics (Online Meetup on 22nd May 2022)
- LDN Talks May 2022 - Quickwit Takeover (in London on 30th May 2022)
- Rust Macros: The What, Why, and How (SkillsCast recorded in May 2022)
- Using Scenarios to Reinvigorate Your Microservice Architecture (SkillsCast recorded in April 2022)