Please log in to watch this conference skillscast.
In this talk, you will look at how to efficiently (in both space and time) summarize large, potentially unbounded, streams of data by approximating the underlying distribution using so-called sketch algorithms. The main approach you are going to be looking at is summarization via histograms. Histograms have a number of desirable properties: they work well in an online setting, are embarrassingly parallel, and are space-bound. Not to mention they capture the entire (empirical) distribution which is something that often gets lost when doing descriptive statistics.
Building from that you will delve into related problems of sampling in a stream setting, and updating in a batch setting, and highlight some cool tricks such as capturing time-dynamics via data snapshotting. To finish off we will touch upon algorithms to summarize categorical data, most notably count-min sketch. The talk is motivated by my work at Metabase -- an opensoucre analytics tool -- where you heavily utilize histograms in building our "data scientist in a box".
YOU MAY ALSO LIKE:
Simon built his first computer out of Lego bricks and learned to program soon after. Emergence, networks, modes of thought, limits of language and expression are what makes him smile (and stay up at night). The combination of lisp and machine learning put him on the path of always striving to make himself redundant if not outright obsolete.