Okay, the time has come to retire “Big Data”. There was a great post in TechCrunch at the begining of the year, talking about "Why We Need to Kill Big Data", as well as a good tweet by Dr. Wells Martin about Big Data. First, the industry is ready for something new and “Big Data” is 1) Yesterday’s news, 2) Doesn’t really describe what consumers want, and 3) Segments the market in an unrealistic way.
First, in 2009 / 2010 when more and more organizations began to realize that connecting to and analyzing lots of data would help them yield a competitive advantage over their peers in the industry (2.4 times more likely to outperform their peers, in fact). Combine that with the fact that new companies that had started to develop capabilities around Hadoop, so we needed to create a new industry segment. (One of the reasons why I love Storage, we can create and kill industry segments whenever we want! – Remember ILM?), hence “Big Data” as it was mostly large enterprise companies analyzing large amounts of data.
However, as time moved forward, new capabilities became available to analyze lots of data. Additionally, open API’s became available for consumers to reach into places, such as social media sites (twitter for example) and sift through lots and lots of data before bringing it behind their firewall. Once inside, businesses could do more analysis before actioning a decision based on their research. In fact, there are now over 7,000 open API’s that customers can write to in order to analyze data. Not only are their social media sites, twitter or Facebook but also The National Weather Bureau, and The Census Bureau to name a couple. Essentially, this access to information has created the equivalent of the periodic table of elements for data. By tapping putting certain aspects of this data together, businesses can create actionable events that make their business more competitive. Let me give you an example.
A decade ago Wal-Mart may have made a business decision to ship X * 10% the number of sweaters to Maine that year where “X” was the number of sweaters sold last year and the 10% factor is due to the fact that the store is more popular than it was the previous year. Today Wal-Mart has the ability to look at historical weather data to determine how cold it may be in Maine this year affecting how many sweaters they ship. They could look at the census data to determine the fact that there are now less people living in Maine, and the break down is x number of males in certain age rages and y number of females in certain age ranges. This helps them to know how many of what size as well as what gender / color they should ship. They can also tap into Facebook and see that sweaters are out this year and vests are the new fashion statement. Next they could look at twitter and see that there were a number of negative comments about a particular type of sweater that didn’t last well, letting them know they shouldn’t ship that many sweaters of that brand. The end result is that Wal-Mart will have less, “wrong” inventory on the shelves taking up valuable space. The will waste expensive shipping costs on bringing too many or the wrong item to the stores, then shipping them to secondary discount stores. This helps Wal-Mart to reduce costs and pass them onto the consumer, so now if you want that sweater, you will probably get one that fits, is your color and will be less expensive.
Last year I took a trip to Vietnam where they had specifically asked me to speak on the topic of “Big Data”. When I got there, the conference coordinator told me that he didn’t really know what I would be talking about because Vietnam doesn’t really have “Big Data” issues (they seem to have some networking issues though). I believe he felt that storing and managing lots of data was what “Big Data” really meant. I looked up “big Data” on Wikipedia and the definition of “Big Data” is:
Nowhere in this description of “Big Data” does it talk about managing petabytes of information. I put up this definition at the beginning of my presentation and told the audience that they may have only been used to managing a terabyte of data, but with new capabilities, they can reach out into the world wide interweb and analyze resource they never could before and only bring back a few hundred megabytes of information to use in their decision making process. So the term “Big Data” is really a misnomer.
So what is it that companies really want? Well, they want to analyze as much data as they can to gain a competitive advantage. What else gives you a competitive advantage, time. If you are the first to make the right decision, you can gain time to market advantages that give you a competitive advantage. If “Big Data” today doesn’t mean “a lot” of data, and when you perform your analysis, you want your answers ASAP, then what businesses really want is “Real-time Data”. (It isn’t to say that the more data you can analyze the better, but I’ll get to that in a moment.) If businesses can analyze the right data today, in real-time and create actionable events quickly that allow them to make business decisions that allow them to turn their business quickly, then they can be more flexible than the competition and gain a competitive advantage that allows them to prosper.
So what does this all mean? Well, there are a few dynamics of Real-time Data. First, businesses need to keep as much data on line and available as feasible. Here is the biggest, and probably the most important reason. As the great data scientist once told me, “You are only as smart today, as the most recent knowledge you have.” What this means is, today you have some knowledge set that allows you to ask your data questions that will yield a set of answers. The challenge is, tomorrow you may gain some new knowledge that allows you to ask as new question. A question that could yield a better answer and require you to change your business thinking. By keeping the data on-line and available, you can your data questions whenever you want.
Real-time Data does require some new thinking for IT. The ability to keep as much capacity on-line and available at the right cost is still a challenge. Additionally, IT needs to think about how they access these 7,000 API’s and how much data they may bring into their shop at any time. Additionally, it forces IT to think about how they back this data up. If you can reach out into the world wide interweb and run some analytics at any moment in time to get some business results, do I need to back that up? Maybe not if I can just re-run the job. Having cloud like flexibility in order to give the company the agility they need becomes more important, especially when businesses can make decisions quickly.
IT really needs to think about how Real-time Data / Analytics will change the way companies start to leverage the most valuable asset in the data center – the DATA!