As a technical person my first curiosity about Big Data was to learn what it is. I think one of the most relevant definitions for Big Data was provided by Gartner Inc. The definition is the following.

Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. [1]

My curiosity was to find out how Big Data will create a new scientific revolution. I read two books about this topic that convinced me of the revolution that Big Data ignited. The books are listed below.

I am going to outline few things that I learnt from these two books here. I read few other books on Big Data and its impact but I think these two books are the best.

Inductive Reasoning

Most of the science theories are proven based on “Inductive Reasoning”. The drawback in inductive reasoning is that it sometimes takes us to false conclusions even if the premisses are true

  • I tossed a coin 20 times and I got head all times
  • Therefore whenever I toss the coin I will get only head

There are few lessons to learn in this example. First one by theory the probability of getting a head when a coin is tossed once is 1/2 but in practice it is not the same. In practice the probablity that we get after 20 trials may be completely wrong if we increase the trials to 1000 or huge number.

Quite a good number of science theories and disease diagnosis are done based on inductive reasoning derived from a small sample set. In the world of Big Data that number in the sample set is going to increase dramatically and it can take us closer to the truth. Most of the extrapolation based research results should also be reviewed with huge sample size again.

Correlation

Most people who have taken Predictive Analytics or Machine Learning courses might have learnt about this term. This is now the new output expected out of any analysis. Only when we find proxy for some events then we can make prediction about the events. The best example is correlation between lightning and thunder. For common people letting them know when thunder will occur is enough rather than teaching them what causes lightning and thunder. It may be slightly useful if we even teach them based on some numbers how to predict thunder occurence.

  • If I observed as per my research at a particular location on earth that 60% of the time thunder occured lightning preceded it by 10 seconds.
  • Now lightning is my proxy event to predict thunder.

The best usage of this is in e-commerce domain. If 60% of customers who bought product A also bought product B then I need to watch out for sales of Product A to predict sales of product B.

I think these two reasons are good enough to convince me of the revolution that Big Data has started. In my next post I will add the type of projects that a technical person can expect to work on the Big Data domain.