Tuesday, June 2, 2015

What is different about Big Data?

What is different about Big Data? Other than that it is, er, big?

This is a question that I am frequently asked, most often by people who are plainly skeptical about Big Data simply because it is hugely oversold these days. At the recent Annual Signal and Image Sciences Workshop (more about this later in this blog), one of those skeptics remarked to me that there is a certain sameness to all the articles on Big Data found on the Internet. The first, and often the only, takeaway you get out of those articles is that Big Data is big. Also, the authors of those articles readily give in to the temptation to create clever phrases using the word “big.” The Big Data & Analytics Special Interest Group was inaugurated with a great presentation titled Big Lies and Small Truths. In many articles, the apparently big benefits are touted, mixed in with cautionary mention of some other things Big: Infrastructure, Expense, Dangers, Fatigue and Brother.

So, what is really different about Big Data? To me, the most qualitatively distinguishing aspect of the data in Big Data is this: It has quietly come into existence, with no one in particular having intentionally created it. Where in the past, data creation was a purposeful and protocol driven activity, today vast amounts of data have just come into existence by virtue of the fact that all of us choose to live and conduct business digitally. We do not stop to wipe off fingerprints, we let the soda dribble, we do not sweep away breadcrumbs. It is this nature of its creation, the largely unintentional, evolving, uncurated, uncontrolled and imperfect nature of Big Data that distinguishes it from the more familiar data. The bigness is not the essential feature, even if bigness is inevitable. This type of data quietly accumulated for years till, quite suddenly, we woke up to the Big Possibilities.

However, Big Problems arise due to the fact that the data comprising Big Data does not originate from projects specifically designed to facilitate statistical inference and prediction. This brings us back to the event of May 13, the Annual Signal and Image Sciences Workshop hosted and sponsored by Lawrence Livermore National Laboratories. Unlike most forums in which we hear about problems and innovative solutions that arise due to the bigness of the data, this workshop featured several talks highlighting its more fundamental aspects.

The difficulties that must be resolved before truly significant insights can be obtained from Big Data are many. These require an understanding of the fundamental limits to information extraction and the mathematical trade-offs involved in creating algorithms. The workshop offered a chance to hear about the relevance of kernel PCA, spectral clustering, Johnson-Lindenstrauss lemma, and the Chinese restaurant process. The last has nothing to do with General Tso’s Chicken, which delicacy GrubHub’s elementary Big Data analysis tells us is the most popular Chinese dish in the USA.

Is there anything else that makes Big Data fundamentally different? The answer is yes. Compared to data sets studied traditionally, the data sets in most Big Data scenarios are of high dimension. And also high, of course, is the number of high dimensional observations available in the data set. Dimension, as in 2D and 3D, is a familiar concept. But with a little mental stretching, we can understand the dimension of a data set as the number of variables that are measured for each observation.

High dimensionality is troublesome even in scenarios with moderate sized data sets. When there are too many directions, which are what dimensions are, we don’t know which way to look. For instance, take a look at the Communities and Crime Data Set available in the Machine Learning Repository maintained by the University of California at Irvine. This is a 122-dimensional data set of size 1994. And this doesn’t even qualify as Big Data.

If you need any convincing that data sets can look meaningless except when looked at in just the right way, take a look at the 3D data set I specially prepared for this blog. Each data point in the set is a 3D point fixed in a cubical volume of space. One possible way to present such a data set on a 2D screen is this: imagine placing the collection of points on a slowly spinning turntable and simply watch till the points align in a way that makes sense.
The complexity in real-life high dimensional data hides the meaning really deep within the data. You (or more correctly, your algorithm) would not only need to figure out exactly from what viewpoint to look, but would also need to design funny mirrors to undo nonlinear distortions in the data.

Funny things happen in high dimensional spaces. For example, it is quite rare for two random lines drawn on a page (2D) to turn out to be perpendicular to each other. However, in the high dimensional space of many Big Data sets it is practically a given that any two random vectors will be almost perpendicular to each other. In dealing with data sets of high dimensionality, the data scientist is forced (actually the good data scientist is delighted) to think about issues like noise accumulation, spurious correlations, incidental homogeneity and algorithmic instability which are rare considerations in working with traditional data. These are important issues because at the end of the day the data scientist is expected to help arrive at decisions from the data, or at very least help decision makers better understand high dimensional data and make informed decisions.

“Solving a problem simply means representing it so that the solution is obvious,” Nobel laureate Herbert Simon said it best. Large-scale data visualization is a growing research topic because high dimensional data is growing while our computer screens are not. Ideally, effective visualization algorithms highlight patterns and structure, and remove distracting clutter. Since human perception is limited to three-dimensional space, the challenge of visualizing high dimensional data is to optimally reduce the data and present it in a manner comprehensible to the user while preserving as much relevant information as possible.

Unfortunately, having so many dimensions (and the proliferation of tools being created by big.data.dot.coms) to play with opens up almost limitless ways to look at data. Anyone who wants to find something will find something. On the other hand, it is rarely lack of data that has resulted in bad consequences or policies. Domain expertise remains valuable despite all the data readily available. Astute practitioners place great importance on beginning with posing good questions rather than diving into the data headfirst. The questions shift the focus back to the purpose of utilizing big data – or what should be the purpose – gaining tangible awareness and understanding of real-world behaviors or conditions that lead to an enrichment of society.

[This post first appeared on the website of the IEEE Consultants Network of Silicon Valley.]

33 comments:

  1. nice blog,..
    SEO training in hyderabad by experts in digital markeing And by prosessional experts in seo.All the training by placement and also guide by the professionals.SEO training in hyderabad

    ReplyDelete

  2. this is valuable information for learners.thanks
    http://hadooptraininginhyderabad.co.in/salesforce-training-in-hyderabad/

    ReplyDelete
  3. • Thanks for sharing such an interesting post.
    ios training in chennai

    ReplyDelete
  4. • Very good effort in collecting information.........

    ios training in chennai

    ReplyDelete
  5. It’s really amazing that we can record what our visitors do on our site. Thanks for sharing this awesome guide. I’m happy that I came across with your site this article is on point,thanks again and have a great day. Keep update more information..

    Base SAS Training in Chennai

    ReplyDelete


  6. Its a wonderful post and very helpful, thanks for all this information. You are including better information regarding this topic in an effective way.Thank you so much

    Installment loans
    Payday loans
    Title loans
    Cash Advances

    ReplyDelete
  7. Sekarang kita bahas cara mudah memenangkan game ini. Agar anda bukan hanya tau cara main tapi anda juga tau trik menangnya.
    asikqq
    dewaqq
    sumoqq
    interqq
    pionpoker
    bandar ceme terbaik
    hobiqq
    paito warna
    forum prediksi

    ReplyDelete
  8. Hey! Good blog. I was facing an error in my QuickBooks software, so I called QuickBooks Error 6123 (855)-756-1077. I was tended to by an experienced and friendly technician who helped me to get rid of that annoying issue in the least possible time.

    ReplyDelete
  9. Hey! Mind-blowing blog. Keep writing such beautiful blogs. In case you are struggling with issues on QuickBooks software, dial QuickBooks Support Phone Number (877)603-0806. The team, on the other end, will assist you with the best technical services.

    ReplyDelete
  10. Hey! What a wonderful blog. I loved your blog. QuickBooks is the best accounting software, however, it has lots of bugs like QuickBooks Error. To fix such issues, you can contact experts via QuickBooks technical support number

    ReplyDelete
  11. Nice Blog !
    Are you getting QuickBooks Error 8007 while working on QuickBooks software. You are not alone as many of the clientele have complained about this error.

    ReplyDelete
  12. Very nice blog! Thanks so much for sharing with us .In case if you face any technical issue in QuickBooks, you can contact Us:

    Quickbooks Customer Service

    ReplyDelete
  13. Hey! Mind-blowing blog. Keep writing such beautiful blogs. In case you are struggling with issues on QuickBooks software, dial QuickBooks Customer Service Phone Number . The team, on the other end, will assist you with the best technical services.

    ReplyDelete
  14. Hey! What a wonderful blog. I loved your blog. QuickBooks is the best accounting software, however, it has lots of bugs like QuickBooks Error. To fix such issues, you can contact experts via QuickBooks Phone Number

    ReplyDelete
  15. event technology also essential your marketing platform communicates across your technology stack. vendor ideas for events, speakers bio examples and future endeavors mean

    ReplyDelete
  16. Hey! Lovely blog. Your blog contains all the details and information related to the topic. In case you are a QuickBooks user, here is good news for you. You may encounter any error like QuickBooks Customer, visit at QuickBooks Customer Service Number for quick help at (888)981-4592.

    ReplyDelete
  17. Congratulations on your article, it was very helpful and successful. 9f21638e33ec3a52c8cf72f356c8f1dc
    website kurma
    website kurma
    sms onay

    ReplyDelete
  18. Thank you for your explanation, very good content. 2c9636007527e1d1d5877983bd1140c3
    define dedektörü

    ReplyDelete

I welcome your comments. Do let me know if you have a numerical and insightful story to tell.