This is a question that I am frequently asked, most often by people who are plainly skeptical about Big Data simply because it is hugely oversold these days. At the recent Annual Signal and Image Sciences Workshop (more about this later in this blog), one of those skeptics remarked to me that there is a certain sameness to all the articles on Big Data found on the Internet. The first, and often the only, takeaway you get out of those articles is that Big Data is big. Also, the authors of those articles readily give in to the temptation to create clever phrases using the word “big.” The Big Data & Analytics Special Interest Group was inaugurated with a great presentation titled Big Lies and Small Truths. In many articles, the apparently big benefits are touted, mixed in with cautionary mention of some other things Big: Infrastructure, Expense, Dangers, Fatigue and Brother.
So, what is really different about Big Data? To me, the most qualitatively distinguishing aspect of the data in Big Data is this: It has quietly come into existence, with no one in particular having intentionally created it. Where in the past, data creation was a purposeful and protocol driven activity, today vast amounts of data have just come into existence by virtue of the fact that all of us choose to live and conduct business digitally. We do not stop to wipe off fingerprints, we let the soda dribble, we do not sweep away breadcrumbs. It is this nature of its creation, the largely unintentional, evolving, uncurated, uncontrolled and imperfect nature of Big Data that distinguishes it from the more familiar data. The bigness is not the essential feature, even if bigness is inevitable. This type of data quietly accumulated for years till, quite suddenly, we woke up to the Big Possibilities.
However, Big Problems arise due to the fact that the data comprising Big Data does not originate from projects specifically designed to facilitate statistical inference and prediction. This brings us back to the event of May 13, the Annual Signal and Image Sciences Workshop hosted and sponsored by Lawrence Livermore National Laboratories. Unlike most forums in which we hear about problems and innovative solutions that arise due to the bigness of the data, this workshop featured several talks highlighting its more fundamental aspects.
The difficulties that must be resolved before truly significant insights can be obtained from Big Data are many. These require an understanding of the fundamental limits to information extraction and the mathematical trade-offs involved in creating algorithms. The workshop offered a chance to hear about the relevance of kernel PCA, spectral clustering, Johnson-Lindenstrauss lemma, and the Chinese restaurant process. The last has nothing to do with General Tso’s Chicken, which delicacy GrubHub’s elementary Big Data analysis tells us is the most popular Chinese dish in the USA.
Is there anything else that makes Big Data fundamentally different? The answer is yes. Compared to data sets studied traditionally, the data sets in most Big Data scenarios are of high dimension. And also high, of course, is the number of high dimensional observations available in the data set. Dimension, as in 2D and 3D, is a familiar concept. But with a little mental stretching, we can understand the dimension of a data set as the number of variables that are measured for each observation.
High dimensionality is troublesome even in scenarios with moderate sized data sets. When there are too many directions, which are what dimensions are, we don’t know which way to look. For instance, take a look at the Communities and Crime Data Set available in the Machine Learning Repository maintained by the University of California at Irvine. This is a 122-dimensional data set of size 1994. And this doesn’t even qualify as Big Data.
If you need any convincing that data sets can look meaningless except when looked at in just the right way, take a look at the 3D data set I specially prepared for this blog. Each data point in the set is a 3D point fixed in a cubical volume of space. One possible way to present such a data set on a 2D screen is this: imagine placing the collection of points on a slowly spinning turntable and simply watch till the points align in a way that makes sense.
The complexity in real-life high dimensional data hides the meaning really deep within the data. You (or more correctly, your algorithm) would not only need to figure out exactly from what viewpoint to look, but would also need to design funny mirrors to undo nonlinear distortions in the data.
Funny things happen in high dimensional spaces. For example, it is quite rare for two random lines drawn on a page (2D) to turn out to be perpendicular to each other. However, in the high dimensional space of many Big Data sets it is practically a given that any two random vectors will be almost perpendicular to each other. In dealing with data sets of high dimensionality, the data scientist is forced (actually the good data scientist is delighted) to think about issues like noise accumulation, spurious correlations, incidental homogeneity and algorithmic instability which are rare considerations in working with traditional data. These are important issues because at the end of the day the data scientist is expected to help arrive at decisions from the data, or at very least help decision makers better understand high dimensional data and make informed decisions.
“Solving a problem simply means representing it so that the solution is obvious,” Nobel laureate Herbert Simon said it best. Large-scale data visualization is a growing research topic because high dimensional data is growing while our computer screens are not. Ideally, effective visualization algorithms highlight patterns and structure, and remove distracting clutter. Since human perception is limited to three-dimensional space, the challenge of visualizing high dimensional data is to optimally reduce the data and present it in a manner comprehensible to the user while preserving as much relevant information as possible.
Unfortunately, having so many dimensions (and the proliferation of tools being created by big.data.dot.coms) to play with opens up almost limitless ways to look at data. Anyone who wants to find something will find something. On the other hand, it is rarely lack of data that has resulted in bad consequences or policies. Domain expertise remains valuable despite all the data readily available. Astute practitioners place great importance on beginning with posing good questions rather than diving into the data headfirst. The questions shift the focus back to the purpose of utilizing big data – or what should be the purpose – gaining tangible awareness and understanding of real-world behaviors or conditions that lead to an enrichment of society.
Funny things happen in high dimensional spaces. For example, it is quite rare for two random lines drawn on a page (2D) to turn out to be perpendicular to each other. However, in the high dimensional space of many Big Data sets it is practically a given that any two random vectors will be almost perpendicular to each other. In dealing with data sets of high dimensionality, the data scientist is forced (actually the good data scientist is delighted) to think about issues like noise accumulation, spurious correlations, incidental homogeneity and algorithmic instability which are rare considerations in working with traditional data. These are important issues because at the end of the day the data scientist is expected to help arrive at decisions from the data, or at very least help decision makers better understand high dimensional data and make informed decisions.
“Solving a problem simply means representing it so that the solution is obvious,” Nobel laureate Herbert Simon said it best. Large-scale data visualization is a growing research topic because high dimensional data is growing while our computer screens are not. Ideally, effective visualization algorithms highlight patterns and structure, and remove distracting clutter. Since human perception is limited to three-dimensional space, the challenge of visualizing high dimensional data is to optimally reduce the data and present it in a manner comprehensible to the user while preserving as much relevant information as possible.
Unfortunately, having so many dimensions (and the proliferation of tools being created by big.data.dot.coms) to play with opens up almost limitless ways to look at data. Anyone who wants to find something will find something. On the other hand, it is rarely lack of data that has resulted in bad consequences or policies. Domain expertise remains valuable despite all the data readily available. Astute practitioners place great importance on beginning with posing good questions rather than diving into the data headfirst. The questions shift the focus back to the purpose of utilizing big data – or what should be the purpose – gaining tangible awareness and understanding of real-world behaviors or conditions that lead to an enrichment of society.
[This post first appeared on the website of the IEEE Consultants Network of Silicon Valley.]
Really nice post..
ReplyDeleteRegards..
Big Data Training in Chennai
Absolutely!!! try working with someone like that!!!
ReplyDeleteWeb Designing Training in Chennai
nice blog,..
ReplyDeleteSEO training in hyderabad by experts in digital markeing And by prosessional experts in seo.All the training by placement and also guide by the professionals.SEO training in hyderabad
ReplyDeletethis is valuable information for learners.thanks
http://hadooptraininginhyderabad.co.in/salesforce-training-in-hyderabad/
• Thanks for sharing such an interesting post.
ReplyDeleteios training in chennai
• Very good effort in collecting information.........
ReplyDeleteios training in chennai
It’s really amazing that we can record what our visitors do on our site. Thanks for sharing this awesome guide. I’m happy that I came across with your site this article is on point,thanks again and have a great day. Keep update more information..
ReplyDeleteBase SAS Training in Chennai
I really appreciate your blog. Thank you for sharing this blog.
ReplyDeletebest tableau online training in hyderabad.
online tableau course training
I really appreciate your blog. Thank you for sharing this blog.
ReplyDeletebest tableau online training in hyderabad.
online tableau course training
ReplyDeleteIts a wonderful post and very helpful, thanks for all this information. You are including better information regarding this topic in an effective way.Thank you so much
Installment loans
Payday loans
Title loans
Cash Advances
Thanks for sharing nice stuff really it. You might like read about
ReplyDeleteIntroduction to FFT
FFT in Matlab
DFT and IDFT in Matlab
What a wonderful blog!!! This is very helps to develop myself. Thank you and please keep blogging...
ReplyDeleteSpoken English Classes in Chennai
Best Spoken English Classes in Chennai
Spoken English Class in Chennai
Spoken English in Chennai
Best Spoken English Class in Chennai
Sekarang kita bahas cara mudah memenangkan game ini. Agar anda bukan hanya tau cara main tapi anda juga tau trik menangnya.
ReplyDeleteasikqq
dewaqq
sumoqq
interqq
pionpoker
bandar ceme terbaik
hobiqq
paito warna
forum prediksi
very useful post..
ReplyDeleteinplant training in chennai
inplant training in chennai
inplant training in chennai for it
Australia hosting
mexico web hosting
moldova web hosting
albania web hosting
andorra hosting
australia web hosting
denmark web hosting
Hey! Good blog. I was facing an error in my QuickBooks software, so I called QuickBooks Error 6123 (855)-756-1077. I was tended to by an experienced and friendly technician who helped me to get rid of that annoying issue in the least possible time.
ReplyDeleteHey! Mind-blowing blog. Keep writing such beautiful blogs. In case you are struggling with issues on QuickBooks software, dial QuickBooks Support Phone Number (877)603-0806. The team, on the other end, will assist you with the best technical services.
ReplyDeleteHey! What a wonderful blog. I loved your blog. QuickBooks is the best accounting software, however, it has lots of bugs like QuickBooks Error. To fix such issues, you can contact experts via QuickBooks technical support number
ReplyDeleteNice Blog !
ReplyDeleteAre you getting QuickBooks Error 8007 while working on QuickBooks software. You are not alone as many of the clientele have complained about this error.
Very nice blog! Thanks so much for sharing with us .In case if you face any technical issue in QuickBooks, you can contact Us:
ReplyDeleteQuickbooks Customer Service
Hey! Mind-blowing blog. Keep writing such beautiful blogs. In case you are struggling with issues on QuickBooks software, dial QuickBooks Customer Service Phone Number . The team, on the other end, will assist you with the best technical services.
ReplyDeleteHey! What a wonderful blog. I loved your blog. QuickBooks is the best accounting software, however, it has lots of bugs like QuickBooks Error. To fix such issues, you can contact experts via QuickBooks Phone Number
ReplyDeleteevent technology also essential your marketing platform communicates across your technology stack. vendor ideas for events, speakers bio examples and future endeavors mean
ReplyDeleteHey! Lovely blog. Your blog contains all the details and information related to the topic. In case you are a QuickBooks user, here is good news for you. You may encounter any error like QuickBooks Customer, visit at QuickBooks Customer Service Number for quick help at (888)981-4592.
ReplyDeletesmm panel
ReplyDeleteSmm panel
İs İlanlari
İNSTAGRAM TAKİPÇİ SATIN AL
hırdavatçı burada
Https://www.beyazesyateknikservisi.com.tr
servis
Tiktok hile
yapay çim fiyatları
ReplyDeleteCongratulations on your article, it was very helpful and successful. 9f21638e33ec3a52c8cf72f356c8f1dc
ReplyDeletewebsite kurma
website kurma
sms onay
Thank you for your explanation, very good content. 2c9636007527e1d1d5877983bd1140c3
ReplyDeletedefine dedektörü
Good content. You write beautiful things.
ReplyDeletevbet
vbet
sportsbet
korsan taksi
mrbahis
hacklink
hacklink
mrbahis
taksi
salt likit
ReplyDeletesalt likit
dr mood likit
big boss likit
dl likit
dark likit
CR4S
Portekiz yurtdışı kargo
ReplyDeleteRomanya yurtdışı kargo
Slovakya yurtdışı kargo
Slovenya yurtdışı kargo
İngiltere yurtdışı kargo
5Q6A
Azerbaycan yurtdışı kargo
ReplyDeleteAruba yurtdışı kargo
Avustralya yurtdışı kargo
Azor Adaları yurtdışı kargo
Bahamalar yurtdışı kargo
Q1DLYY
salt likit
ReplyDeletesalt likit
dr mood likit
big boss likit
dl likit
dark likit
TEO
https://saglamproxy.com
ReplyDeletemetin2 proxy
proxy satın al
knight online proxy
mobil proxy satın al
8Eİ