Hnrsnprvmi2bd98lxeiz
SkillsCast

Count-Min Sketch in Real Data Applications

10th December 2015 in London at Business Design Centre

There are 36 other SkillsCasts available from Scala eXchange 2015

Please log in to watch this conference skillscast.

547436379 640

In this talk Laura will briefly explain several probabilistic data structures used for approximate query answering, such as a Bloom Filter, HyperLogLog and Count-Min Sketch, as well as present canonical examples of use at Twitter for each of the structures mentioned. You will learn how these structures are implemented in Twitter’s open-source abstract algebra library Algebird and how they are used by Summingbird - an open-source domain-specific language (implemented in Scala) that integrates online and batch MapReduce computations in a single framework. 

Queries such as whether a user has seen a tweet, the number of unique users who have favourited a tweet, or the number of times a tweet was seen over a period of time, are all hard to answer exactly due to having huge datasets and tight time constraints. However, the exact number is often not necessary as long as the errors are bounded. Certain algebraic data structures enable approximate answers to the queries mentioned, owing to distributed computations performed over them that are guaranteed to be correct because of their algebraic properties.

Take a look at Algebird and Summingbird.

YOU MAY ALSO LIKE:

Count-Min Sketch in Real Data Applications

Laura Bledaite

Laura is a data scientist at Twitter with a taste for functional programming. She mostly writes Scala, though she really admires O'Caml as well. Laura knows a bit about recommender systems, statistics and math. She did her BSc in Financial and Actuarial Mathematics in Vilnius University, where with several friends she wrote a thesis about The Role of Tail Index in Analysis of Currency Returns.

SkillsCast

Please log in to watch this conference skillscast.

547436379 640

In this talk Laura will briefly explain several probabilistic data structures used for approximate query answering, such as a Bloom Filter, HyperLogLog and Count-Min Sketch, as well as present canonical examples of use at Twitter for each of the structures mentioned. You will learn how these structures are implemented in Twitter’s open-source abstract algebra library Algebird and how they are used by Summingbird - an open-source domain-specific language (implemented in Scala) that integrates online and batch MapReduce computations in a single framework. 

Queries such as whether a user has seen a tweet, the number of unique users who have favourited a tweet, or the number of times a tweet was seen over a period of time, are all hard to answer exactly due to having huge datasets and tight time constraints. However, the exact number is often not necessary as long as the errors are bounded. Certain algebraic data structures enable approximate answers to the queries mentioned, owing to distributed computations performed over them that are guaranteed to be correct because of their algebraic properties.

Take a look at Algebird and Summingbird.

YOU MAY ALSO LIKE:

About the Speaker

Count-Min Sketch in Real Data Applications

Laura Bledaite

Laura is a data scientist at Twitter with a taste for functional programming. She mostly writes Scala, though she really admires O'Caml as well. Laura knows a bit about recommender systems, statistics and math. She did her BSc in Financial and Actuarial Mathematics in Vilnius University, where with several friends she wrote a thesis about The Role of Tail Index in Analysis of Currency Returns.

Photos