Hadoop is NOT “Big Data” is NOT Analytics

I am amazed at the way the words “Hadoop”, “Big Data” and “Analytics” are bandied about in a very haphazard fashion these days. For those desirous of working in the field of Analytics (especially the very young but also some not so young), my earnest entreaty is to understand that these three words mean very different things. Using them interchangeably just demonstrates ignorance rather than expertise.

Perhaps a bit of history would help to give some perspective. Folks in academia have been solving “big data” problems for a long time using the power of cluster and distributed computing to solve embarrasingly parallel problems. Before the advent of inexpensive “cloud-based” resources, universities and research organizations would build their own very large “super clusters” using either commodity off-the-shelf (COTS) components or if you went back even further, would use large, shared-memory computers using non-uniform memory access (NUMA) architecture (the likes of Silicon Graphics sold these). As research and some large industrial organizations started building “Beowulf” clusters, they started putting together operating system packages (like ROCKS from the University of California at San Diego and the San Diego Supercomputer Center) and and that made it easier for people to quickly set up their clusters. Of course people had to write distributed applications on them using scripting languages or message passing systems like MPI or PVM. Coders had to of course keep track of the inter-node communications, making sure that the right packets and pieces of information arrived at the right time and the right place.

The terms “Big Data” and Hadoop have gained favor in recent times. Hadoop, by abstracting away the pain of inter-node communication, has made it fairly easy for programmers take any embarrasingly parallel problems and quickly task-farm them across large clusters. Big Data on the other hand is to me just the fuel that Hadoop works on to convert it into a form amenable for analysis. A person who is able to write code using Hadoop and the associated frameworks is not necessarily someone who can understand the underlying patterns in that data and come up with actionable insights. That is what a data scientist is supposed to do. Again, data scientists might not be able to write the code to convert “Big Data” into “actionable” data. That’s what a Hadoop practitioner does. These are very distinct job descriptions.

While the term analytics has become a catch-all phrase used across the entire value chain, I personally prefer to use it more for the job of actually working with the data to get analytical insights. That separates out upstream and downstream elements of the entire data mining workflow.

Thoughts/suggestions/critiques are welcome!