16 Mar 2013 titus   » (Journeyer)

My 2013 PyCon talk: Awesome Big Data Algorithms

Schedule link

Description

Random algorithms and probabilistic data structures are algorithmically efficient and can provide shockingly good practical results. I will give a practical introduction, with live demos and bad jokes, to this fascinating algorithmic niche. I will conclude with some discussions of how our group has applied this to large sequencing data sets (although this will not be the focus of the talk).

Abstract:

I propose to start with Python implementations of most of the DS & A mentioned in this excellent blog post:

http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

and also discuss skip lists and any other random algorithms that catch my fancy. I'll put everything together in an IPython notebook and add visualizations as appropriate.

I'll finish with some discussion of how we've put these approaches to work in my lab's research, which focuses on compressive approaches to large data sets (and is regularly featured in my Python-ic blog, http://ivory.idyll.org/blog/).

SkipLists:

skiplist IPython Notebook

Wikipedia page on SkipLists

John Shipman's excellent writeup

William Pugh's original article

The HackerNews (oops!) reference for my reddit-attributed quote about putting a gun to someone's head and asking them to write a log-time algorithm for storing stuff: https://news.ycombinator.com/item?id=2670632

HyperLogLog:

coinflips IPython Notebook

HyperLogLog counter IPython Notebook

Aggregate Knowledge's EXCELLENT blog post on HyperLogLog. The section on Big Pattern Observables is truly fantastic :)

Flajolet et al. is the original paper. It gets a bit technical in the middle but the discussions are great.

Nick Johnson's blog post on cardinality estimation

MetaMarkets' blog post on cardinality counting

More High Scalability blog posts, this one by Matt Abrams

The obligatory Stack Overflow Q&A

Vasily Evseenko's git repo https://github.com/svpcom/hyperloglog, forked from Nelson Goncalves's git repo, https://github.com/goncalvesnelson/Log-Log-Sketch, served as the source for my IPython Notebook.

Bloom Filters:

Bloom filter notebook

The Wikipedia page is pretty good.

Everything I know about Bloom filters comes from my research.

I briefly mentioned the CountMin Sketch, which is an extension of the basic Bloom filter approach, for counting frequency distributions of objects.

Other nifty things to look at

Quotient filters

Rapidly-exploring random trees

Random binary trees

Treaps

StackOverflow question on Bloom-filter like structures that go the other way

A survey of probabilistic data structures

K-Minimum Values over at Aggregate Knowledge again

Set operations on HyperLogLog counters, again over at Aggregate Knowledge.

Using SkipLists to calculate an efficient running median

Syndicated 2013-03-15 23:00:00 from Living in an Ivory Basement

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!