Data analysis with Hadoop and Hive. New tools. Old game.

I’ve done a lot of data analysis in my life because more than half of my career I was a part of an actual business rather than ISV. I’ve worked for commodity exchange broker, wholesale and retail companies, Coca-Cola, Bayer and other businesses where I did a lot of analytical work on top of software development. I’ve done my first data analysis project in late 80s with Lotus 1-2-3 when circumstances forced me to pause software development and sell jeans (BTW, that was a great experience at the end – learn how to make a sale). And since then I’ve learned the power of databases and spreadsheets that allowed me to dig the data and help businesses grow.
Recently I’ve done another data analysis project with new tools – Hadoop and Hive. I’ve been working with them for the last 2+ years, but more from a development perspective. This was the first time I did actual data digging with them. And just an hour into the analysis I understood that although these are new tools, the actual process is the same. Yes, they provide some structural (sets, arrays, maps) and scalability advantages, but one needs the same data analytical state of mind to get the answers from the data.  And the fact that you have this power doesn’t mean that you have to use it right away. 

Getting answers from the data is an iterative process – you write query, analyst the result, optimize your queries and repeat the process. And that’s when you want to move fast – before you make The Query that will give you the ultimate answer. This fastness to find the path is especially important with large data sets and Hadoop latency – start small, move fast until you are really ready to “Release the Kraken”.

So, I think that Big Data analysis has not a lot to do with Hadoop & Co, well … unless you are on the operation side. It’s still more about how to dig the data and find a way to answer the question fast before forming it into a repeatable process suitable for the tools.
And that’s why learning how to use spreadsheets to analyze data is so important – it teaches how to find the answers.