MapReduce, Big Data... I'm confused

MapReduce, Big Data... I'm confused

I'm feeling a little like Goldilocks and her porridge at the moment - some of the explanations are toooooo complicated and others are tooooo simple - and nothing was just right... and then I found it - so just wanting to share. (February 2011)

If you already code in Erlang or Lisp then MapReduce is pretty simple stuff. If you're not really that geeky, you probably don't care. But if you're there's a little geek inside you and you've programmed in COBOL for instance, you're probably curious as to exactly how MapReduce works. I needed an explanation - here's what I found:

MapReduce

This is Pete Warden's attempt at MapReduce for Idiots. Here's his 10 minute chat on Slideshare. Unfortunately Pete is not actually an idiot. I am though. Still needed more help.

I came across a slideshow on Saliya's Blog, in a post entitled Map Reduce Explained Simply as the Story of Sam - but although the pulped fruit analogy was cute, it didn't quite quench my thirst.

Google's Code University has a better description in it's MapReduce Tutorial - this got me pretty far down the track on how to parallelize a task, but then I got lost when I hit the second chunk of code. Thud! That was me hitting the wall.

This was the one that explained it to me, from the blog of Kaushik Sathupadi - Map Reduce: A really simple introduction. Finally an explanation that didn't use code - one that put it all very simply - in a very imaginative human worker problem. Read this.

If you want something to watch, there's a 30 minute explanation of Big Data, Hadoop etc. from Cloudera's CEO Mike Olsen. HIs talk nicely explains the context for these concepts in today's data environment. What is Hadoop? Other big data terms like MapReduce.

My Quick Summary

Map Reduce works with problems that can be broken down into sub-tasks, the sub-tasks worked on individually and independently. The results of those sub-tasks are additive and the consolidation is the solution to the original problem. The 'individually and independently' is key to why Map Reduce is such big news - because the task can be distributed over many independent machines - therefore ability to use elastic compute clouds that scale up (and down) with the amount of data.

Stuff to Read