5th July 2018 in London at CodeNode

There are 22 other SkillsCasts available from Infiniteconf 2018 - The conference on Big Data and AI

Please log in to watch this conference skillscast.

711544438 640

In this talk, you will look at how to efficiently (in both space and time) summarize large, potentially unbounded, streams of data by approximating the underlying distribution using so-called sketch algorithms. The main approach you are going to be looking at is summarization via histograms. Histograms have a number of desirable properties: they work well in an online setting, are embarrassingly parallel, and are space-bound. Not to mention they capture the entire (empirical) distribution which is something that often gets lost when doing descriptive statistics.

Building from that you will delve into related problems of sampling in a stream setting, and updating in a batch setting, and highlight some cool tricks such as capturing time-dynamics via data snapshotting. To finish off we will touch upon algorithms to summarize categorical data, most notably count-min sketch. The talk is motivated by my work at Metabase -- an opensoucre analytics tool -- where you heavily utilize histograms in building our "data scientist in a box".

YOU MAY ALSO LIKE:

Thanks to our sponsors

Sketch Algorithms

Simon Belak

Simon built his first computer out of Lego bricks and learned to program soon after. Emergence, networks, modes of thought, limits of language and expression are what makes him smile (and stay up at night). The combination of lisp and machine learning put him on the path of always striving to make himself redundant if not outright obsolete.

SkillsCast

Please log in to watch this conference skillscast.

711544438 640

In this talk, you will look at how to efficiently (in both space and time) summarize large, potentially unbounded, streams of data by approximating the underlying distribution using so-called sketch algorithms. The main approach you are going to be looking at is summarization via histograms. Histograms have a number of desirable properties: they work well in an online setting, are embarrassingly parallel, and are space-bound. Not to mention they capture the entire (empirical) distribution which is something that often gets lost when doing descriptive statistics.

Building from that you will delve into related problems of sampling in a stream setting, and updating in a batch setting, and highlight some cool tricks such as capturing time-dynamics via data snapshotting. To finish off we will touch upon algorithms to summarize categorical data, most notably count-min sketch. The talk is motivated by my work at Metabase -- an opensoucre analytics tool -- where you heavily utilize histograms in building our "data scientist in a box".

YOU MAY ALSO LIKE:

Thanks to our sponsors

About the Speaker

Sketch Algorithms

Simon Belak

Simon built his first computer out of Lego bricks and learned to program soon after. Emergence, networks, modes of thought, limits of language and expression are what makes him smile (and stay up at night). The combination of lisp and machine learning put him on the path of always striving to make himself redundant if not outright obsolete.

Photos