Monday, August 24, 2015

The FFT is Fifty

Their 1965 paper starts with the unassuming words "An efficient method for the calculation of the interactions of a 2^m factorial experiment was introduced by Yates and is widely known by his name." The paper does not use the word "fast" even once. But the algorithm described in that paper written by James Cooley and John Tukey[1] came to be known as the Fast Fourier Transform (FFT), and turned out to be just what the world was waiting for. Overnight, in universities and laboratories around the world, scientists and engineers began writing code and building hardware to implement the FFT. Fifty years later, versions of that algorithm are routinely used, often tacitly assumed to be available to programmers. Any computation that requires (either as an end, or as an intermediate step) the unraveling of the frequency content of information carrying signals benefits from the enormous speedup offered by the FFT. Other than the speedup, the FFT does not offer any computational result that was not available before its discovery. However, were it not for this amazing algorithm, it would be practically impossible to filter cell-phone signals, to compress movie files, do spectroscopy, take magnetic resonance imaging scans, perform quantum computing, and solve differential equations, among other things. The Cooley-Tukey paper created a surge in products that suddenly became possible, and often even became practical to operate in real-time.
Although the idea behind FFT can be explained without using complex math (it involves complex numbers, so I guess that does make the math complex), no attempt to do so will be made here. Suffice it to say that the algorithm belongs to a class of mathematical calculations that can either be performed in a straightforward manner, or can be solved by dividing the calculations into smaller steps. Cooley and Tukey showed how the dividing could be done in a manner such that the total effort needed to compute the simpler steps is much less than the effort that would be needed to tackle the computation without such division. There is a wealth of deep mathematics underlying the FFT algorithm. But let us remember that John Tukey, ever the pragmatic engineer, is said to have declared “I wouldn’t want to fly in a plane whose design depended on whether a function was Riemann or Lebesgue integrable.”
For signals with a million samples, the gain in speed obtained by virtue of the magic of FFT can be upwards of 50,000. To appreciate this speedup ratio, consider some calculation that takes 1 second to complete on your computer. To most of us, this would seem sluggish. In today's world we expect our devices to be instantly responsive, but we might tolerate such slowness once in a while, say when we are applying some fancy effects to produce an irresistible cat video. Now, imagine that same computation being 50,000 times slower i.e. taking almost 14 hours to execute.
For decades, the FFT algorithm and its direct variants have ruled the world of signal processing. It is only recently that researchers[2]  at MIT came up with an algorithm which takes advantage of a special property of real-life signals (sparsity) to achieve significantly faster performance. Although the speedup achieved (relative to the FFT, not relative to the straightforward computation) is not as spectacular, it is still useful in many practical applications. For example, it has the potential to reduce the time patients spend inside a MRI machine from an agonizing one hour to a bearable 20 minutes. There are many other applications where increased efficiency of computing the spectrum leads to other benefits such as longer battery life.
My introduction to the FFT came via a classic textbook, Athanasios Papoulis's Signal Analysis. In graduate school, I remember scribbling pages after pages of formulas as I worked to absorb the concept. When it all fell into place, the feeling was as exhilarating as though I had personally come up with the algorithm myself. Wonderful applications suggested themselves, and it seemed that the world could be saved by my new found understanding of the FFT.
I was to be humbled because I was soon introduced to an application made possible by the FFT. Transmultiplexers are computational devices that allow a single wire (or radio channel) to simultaneously transmit thousands of speech signals in a way that allows them to be separated by individual receivers. Remarkably, FFT figures into the calculations needed to pull off this feat. Unlike common applications I had encountered at that time where the input to the FFT is a sequence of samples of a single signal as it is generated in time, in the transmultiplexer calculations, the input to the FFT at each time step is the sequence formed by gathering samples of several independent speech signals at a given instant of time. It amazed me that you could, in a manner of speaking, meaningfully combine and take apart independent words uttered by a crowd at a given time instant. Now this is one application I could not have come up with.
I know that many engineers and scientists have their own personal story of their first encounter with the FFT. Well, we are nerds are we not? Can we share?
[1] Cooley, James W.; Tukey, John W. (1965). "An algorithm for the machine calculation of complex Fourier series". Mathematics of Computation, 19: 297–301.
[2] Haitham Hassanieh, Piotr Indyk, Dina Katabi, Eric Price. (2012). "Simple and Practical Algorithm for Sparse Fourier Transform". ACM-SIAM Symposium on Discrete Algorithms, Kyoto Japan

Tuesday, June 2, 2015

What is different about Big Data?

What is different about Big Data? Other than that it is, er, big?

This is a question that I am frequently asked, most often by people who are plainly skeptical about Big Data simply because it is hugely oversold these days. At the recent Annual Signal and Image Sciences Workshop (more about this later in this blog), one of those skeptics remarked to me that there is a certain sameness to all the articles on Big Data found on the Internet. The first, and often the only, takeaway you get out of those articles is that Big Data is big. Also, the authors of those articles readily give in to the temptation to create clever phrases using the word “big.” The Big Data & Analytics Special Interest Group was inaugurated with a great presentation titled Big Lies and Small Truths. In many articles, the apparently big benefits are touted, mixed in with cautionary mention of some other things Big: Infrastructure, Expense, Dangers, Fatigue and Brother.

So, what is really different about Big Data? To me, the most qualitatively distinguishing aspect of the data in Big Data is this: It has quietly come into existence, with no one in particular having intentionally created it. Where in the past, data creation was a purposeful and protocol driven activity, today vast amounts of data have just come into existence by virtue of the fact that all of us choose to live and conduct business digitally. We do not stop to wipe off fingerprints, we let the soda dribble, we do not sweep away breadcrumbs. It is this nature of its creation, the largely unintentional, evolving, uncurated, uncontrolled and imperfect nature of Big Data that distinguishes it from the more familiar data. The bigness is not the essential feature, even if bigness is inevitable. This type of data quietly accumulated for years till, quite suddenly, we woke up to the Big Possibilities.

However, Big Problems arise due to the fact that the data comprising Big Data does not originate from projects specifically designed to facilitate statistical inference and prediction. This brings us back to the event of May 13, the Annual Signal and Image Sciences Workshop hosted and sponsored by Lawrence Livermore National Laboratories. Unlike most forums in which we hear about problems and innovative solutions that arise due to the bigness of the data, this workshop featured several talks highlighting its more fundamental aspects.

The difficulties that must be resolved before truly significant insights can be obtained from Big Data are many. These require an understanding of the fundamental limits to information extraction and the mathematical trade-offs involved in creating algorithms. The workshop offered a chance to hear about the relevance of kernel PCA, spectral clustering, Johnson-Lindenstrauss lemma, and the Chinese restaurant process. The last has nothing to do with General Tso’s Chicken, which delicacy GrubHub’s elementary Big Data analysis tells us is the most popular Chinese dish in the USA.

Is there anything else that makes Big Data fundamentally different? The answer is yes. Compared to data sets studied traditionally, the data sets in most Big Data scenarios are of high dimension. And also high, of course, is the number of high dimensional observations available in the data set. Dimension, as in 2D and 3D, is a familiar concept. But with a little mental stretching, we can understand the dimension of a data set as the number of variables that are measured for each observation.

High dimensionality is troublesome even in scenarios with moderate sized data sets. When there are too many directions, which are what dimensions are, we don’t know which way to look. For instance, take a look at the Communities and Crime Data Set available in the Machine Learning Repository maintained by the University of California at Irvine. This is a 122-dimensional data set of size 1994. And this doesn’t even qualify as Big Data.

If you need any convincing that data sets can look meaningless except when looked at in just the right way, take a look at the 3D data set I specially prepared for this blog. Each data point in the set is a 3D point fixed in a cubical volume of space. One possible way to present such a data set on a 2D screen is this: imagine placing the collection of points on a slowly spinning turntable and simply watch till the points align in a way that makes sense.
The complexity in real-life high dimensional data hides the meaning really deep within the data. You (or more correctly, your algorithm) would not only need to figure out exactly from what viewpoint to look, but would also need to design funny mirrors to undo nonlinear distortions in the data.

Funny things happen in high dimensional spaces. For example, it is quite rare for two random lines drawn on a page (2D) to turn out to be perpendicular to each other. However, in the high dimensional space of many Big Data sets it is practically a given that any two random vectors will be almost perpendicular to each other. In dealing with data sets of high dimensionality, the data scientist is forced (actually the good data scientist is delighted) to think about issues like noise accumulation, spurious correlations, incidental homogeneity and algorithmic instability which are rare considerations in working with traditional data. These are important issues because at the end of the day the data scientist is expected to help arrive at decisions from the data, or at very least help decision makers better understand high dimensional data and make informed decisions.

“Solving a problem simply means representing it so that the solution is obvious,” Nobel laureate Herbert Simon said it best. Large-scale data visualization is a growing research topic because high dimensional data is growing while our computer screens are not. Ideally, effective visualization algorithms highlight patterns and structure, and remove distracting clutter. Since human perception is limited to three-dimensional space, the challenge of visualizing high dimensional data is to optimally reduce the data and present it in a manner comprehensible to the user while preserving as much relevant information as possible.

Unfortunately, having so many dimensions (and the proliferation of tools being created by to play with opens up almost limitless ways to look at data. Anyone who wants to find something will find something. On the other hand, it is rarely lack of data that has resulted in bad consequences or policies. Domain expertise remains valuable despite all the data readily available. Astute practitioners place great importance on beginning with posing good questions rather than diving into the data headfirst. The questions shift the focus back to the purpose of utilizing big data – or what should be the purpose – gaining tangible awareness and understanding of real-world behaviors or conditions that lead to an enrichment of society.

[This post first appeared on the website of the IEEE Consultants Network of Silicon Valley.]

Tuesday, July 22, 2014

A Gentle Introduction to Backpropagation

Why is this blog being written?

[Update: A more detailed version of this article with coding outline is available. Please read the note at the end of this article.]

Neural networks have always fascinated me ever since I became aware of them in the 1990s. They are often represented with a hypnotizing array of connections. In the last decade, deep neural networks have dominated pattern recognition, often replacing other algorithms for applications like computer vision and voice recognition. At least in specialized tasks, they indeed come close to mimicking the miraculous feats of cognition our brains are capable of.

While neural networks are capable of such feats, the very discovery of a method of programming such a computational device seems to me to be a miraculous feat of cognition worthy of celebration. My purpose in writing this blog is to share my perspective on an amazing algorithm made widely known by a 1986 publication in Nature.
I will assume that the reader has some understanding of the meaning and purpose of the various elements of a neural network such as the one shown in Figure 1. I have provided a little bit of background below. Other than that, this blog should be an easy read for those with some familiarity of basic calculus.

Figure 1: An example of a multi-layered neural network that can be used to associate an input consisting of 10 numbers with one of 4 decisions or predictions.

What is so difficult about designing a neural network?

To appreciate the difficulty involved in designing a neural network, consider this: The neural network shown in Figure 1 can be used to associate an input consisting of 10 numbers with one of 4 decisions or predictions. For example, the neural network shown may be used by a bank to determine if credit should be extended to a customer. In this case, the 10 input numbers represent various parameters relevant to an individual's financial responsibility such as balance in savings accounts, outstanding loan amounts, number of years of employment, and so on. The neural network takes in these 10 numbers, performs calculations and produces an output consisting of 4 numbers. Depending on the position at which the maximum of these 4 numbers appears in the output, the prediction could be one of the following:
  1. Excellent creditworthiness with high spending limit
  2. Average creditworthiness with moderate spending limit
  3. Low creditworthiness with low spending limit
  4. High default risk.
Based on this prediction, the bank would take the appropriate decision.

A neural network such as the one shown in Figure 1 can perform this miraculous feat of cognition only if it is specifically trained to do so. For the network to work correctly, the weight at each of its (10$\times $4)+(4$\times $6)+(6$\times $8)+(8$\times $4) =144 connections have to be carefully chosen such that the network classifies every input drawn from a known training set with a high degree of accuracy. This is a classic application of the supervised learning paradigm in machine learning.

There are, of course, no formulas in existence to directly set the values of the 144 weights. The only recourse is to start with some initial values for the 144 weights, check how good the resulting neural network is, and repeatedly refine the weights to progressively make the network more and more accurate. So, what is needed is a method of refining the weights.

If one were to think of the accuracy as some grand function of the weights, it makes sense to refine each weight by changing it by an amount proportional to the partial derivative of that grand function with respect to that weight. Why bring in partial derivatives? Because they precisely predict how the accuracy responds to small changes in the weights. In fact, at every iteration, performing a refinement guided by the partial derivatives results in a more advantageous gain in accuracy compared to any other method of refinement. The method of steepest descent does exactly what is suggested here – with the minor, but entirely equivalent, goal of seeking to progressively decrease the error, rather than increase the accuracy.

To keep things straight, let me list the concepts I have described thus far:
  1. The weights of the neural network must be set such that the error on a known training set is minimized. Ideally, the network must yield zero error on a large and representative training data set.
  2. For any given setting of weights, we can attempt to refine them by changing each weight by an amount proportional to the partial derivative of the error with respect to that weight. The partial derivatives themselves change after any such refinement, and must be recomputed.
  3. We can keep on refining the weights in this manner, and stop refining when the error is zero. In real life, we call it done when the error is low enough. Or refuses to fall anymore. Or we run out of time after a few million rounds of refinements.
Seems simple, right? Yes. As long as there is a simple way of calculating the partial derivative of error with respect to every weight. In principle, the partial derivatives can be calculated by systematically perturbing the weights and measuring the resulting changes in error. A word to the wise, don't try this at home. Or even on your neighborhood supercomputer.


In fact, this error minimization problem that must be solved to train a neural network eluded a practical solution for decades till D. E. Rumelhart, C. E. Hinton, and R. J. Williams (drawing inspiration from other researchers) demonstrated a technique, which they called backpropagation, and made it widely known (Nature 323, 533-536, 9 October 1986). It is essentially by building upon their method that today others have ventured to program neural networks with 60 million weights, with astounding results.

According to Bernard Widrow, now Professor Emeritus at Stanford University and one of the pioneers of neural networks "The basic concepts of backpropagation are easily grasped. Unfortunately, these simple ideas are often obscured by relatively intricate notation, so formal derivations of the backpropagation rule are often tedious." This is indeed unfortunate because the backpropagation rule is one of the most elegant applications of calculus that I have known. 

Easy as 1-2-3
Once you appreciate the fact that, in order to train a neural network, you need to somehow calculate the partial derivatives of the error with respect to weights, backpropagation can be easily and qualitatively derived by reducing it to three core concepts. It also helps immensely to keep the notation intuitive and easy to connect to the concept being symbolized.
Figure 2: The derivation of the backpropagation algorithm is simplified by adding an extra computational block to calculate the error and also by boxing parts of the network.

1. Boxing

Since training the neural network is all about minimizing the training error, the first step in the derivation involves tacking on an extra computational block to calculate the error between the actual output $\left\{o_1,o_2,o_3,o_4\right\}$ and a known target $\{t_1,t_2,t_3,t_4\}$. This is shown as a triangular block in Figure 2. For now, let us think of the output and the target as known and fixed entities. Although we need not concern ourselves with the exact formula to compute the error, I offer the familiar sum-of-squares error as an example

Next, we choose one of the layers (say Layer 3) and enclose that layer and all following layers (including the error calculating block) in a box, as shown in gray in Figure 2. Keep in mind that this is just one of several nested boxes we can construct in order to compartmentalize the network. Let us resolve not to worry about anything going on inside the box but simply think of the relationship between the input to the box and the output (i.e the error) coming out of the box. We will call this box the current box, and call the input to this box the current input$\{c_1,c_2,c_3,c_4,c_5,c_6\}$. It is important to recognize that the functional relationship between the current input to the box and the output emanating from the box is completely defined and can be computed. Let us denote this function as $E(c_1,c_2,c_3,c_4,c_5,c_6)$.
Figure 3: The box contains parts of the neural network hidden from view. It allows us to think about broad relationships among different parts of the network.

2. Sensitivity

As our journey through backpropagation continues, I gently request you to assume that the vector of partial derivatives $\frac{\partial E}{\partial c_1},\frac{\partial E}{\partial c_2}, \ldots
$ of the function $E(c_1,c_2,c_3,c_4,c_5,c_6)$ is known. This might seem like a stretch. After all, we have set out to find a method to compute some other (equally unrealized) partial derivatives. But I assure you it will all work out in the end. To emphasize the crucial nature of this simple concept, it has been given a name: Sensitivity. Let us denote the sensitivity of our current box as $\{\text{$\delta $c}_1,\text{$\delta $c}_2,\text{$\delta $c}_3,\text{$\delta
   $c}_4,\text{$\delta $c}_5,\text{$\delta $c}_6\}$. With the help of Figure 3 to hold these concepts in our mind, we can concretely think about how the output of the current box responds to a small perturbation applied to any of its current inputs. For example, if the fourth component of the current input changes by the small amount $\Delta c_4,$ we can expect the error at the output to change by $\Delta c_4 \delta c_4.$ Further, in addition to the hypothetical change in component 4, if there is a simultaneous change of $\Delta c_6$ in component 6, we can expect the error at the output to change by an additional amount, making the total change $\Delta c_4 \delta c_4 + \Delta c_6 \delta c_6.$ The effect of small simultaneous changes in the current input components simply add up at the output.

Knowing the sensitivity of the current box, what can we say about the sensitivity of the preceding box? Keep in mind that the preceding box encloses Layer 2 and all following layers, including the error calculating block. For our specific example, let us call the input to this box the preceding input, $\{p_1,p_2,p_3,p_4.\}$ It follows quite logically that the sensitivity of the preceding box (which we will naturally denote as $\{\text{$\delta $p}_1,\text{$\delta $p}_2,\text{$\delta $p}_3,\text{$\delta
   $p}_4\}$) must be related to the sensitivity of the current box and the extra neural network elements making up the difference between the two nested boxes. The extra elements are the very vital nonlinear activation function units, summing junctions and weights.

Figure 4(a) shows the current box and the extra elements that must be added to construct the preceding box. For clarity, all the elements not relevant to the calculation of the first component of sensitivity ($\delta p_1$) have been grayed out. Look closely at Figures 4(a) and 4(b) to understand how the sensitivity of the preceding box can easily be derived from first principles. Specifically, Figure 4(b) provides insight into how $\delta p_1(=\frac{\partial e}{\partial p_1})$ can be computed  by allowing the input component $p_1$ to change by a small quantity $\Delta p_1$ and following the resulting changes in the network. Notes: (i) The notation $\mathcal{A}'\left(p_1\right)$ has been used for the derivative of the activation function evaluated at $p_1.$ (ii) For clarity, not all changes in signals have been explicitly labeled. Those that are not labeled can easily be determined since they all follow an obvious pattern.
Figure 4(a) shows the current box and the extra elements that must be added to construct the preceding box. Figure 4(b) provides insight into how $\delta p_1(=\frac{\partial e}{\partial p_1})$ can be computed  by allowing the input component $p_1$ to change by a small quantity $\Delta p_1$ and following the resulting changes in the network.
This is indeed a deep result. (And one which we have managed to arrive at without recourse to tedious calculus and numbered equations.) By repeated application, this result allows us to work backwards and calculate the sensitivity of the error to changes in the input at every layer.

The algorithm gets the name backpropagation because the sensitivities are propagated backwards as they are calculated in sequence. The textbook formula to express the sensitivity of the preceding layer in terms of the sensitivity of the current layer is easily seen to be
\[\delta p_i = \mathcal{A}'(p_i) \sum _{j} w_{i j}\delta c_j \]

A starting point is all we need to completely calculate all the sensitivity terms throughout the neural network. To do this, we consider the error computing block itself as the first box. For this box, the input is $\left\{o_1,o_2,o_3,o_4\right\},$ and the output is $e$ as given in the sum-of-squares error formula we have seen before. Simple calculus gives us the components of the sensitivity of the error computing block
2 \left(o_1-t_1\right),
2 \left(o_2-t_2\right),
2 \left(o_3-t_3\right),
2 \left(o_4-t_4\right)

3. Weight updates

At this point, the last section writes itself. Following the same strategy outlined in the previous figure, look at Figure 5(a) and 5(b) to intuitively understand how the error changes in response to a small change in one of the weights, say $w_{1 1}$. Once again in these figures, details of connections not immediately relevant to this calculation have been grayed out. The much sought after partial derivative of error with respect to the specific weight $w_{1 1}$ is easily seen to be $\mathcal{A}\left(p_1\right) \delta c_1$. The textbook formula to compute the partial derivative of the error with respect to any weight is easily seen to be
\[\frac{\partial e}{\partial w_{i j}} =\mathcal{A}\left(p_i\right) \delta c_j \]
Figure 5: Another set of figures designed to provide insight into the error changes in response to a small change in one of the weights, $w_{1 1}$. Using this intuition, the partial derivative $\frac{\partial e}{\partial w_{1 1}}$ can be computed as $\mathcal{A}\left(p_1\right) \delta c_1$.
Now that we have a formula to calculate the partial derivative of error with respect to every weight in the network, we can proceed to iteratively refine the weights and minimize the error using the method of steepest descent.

In the most popular version of backpropagation, called stochastic backpropagation, the weights are initially set to small random values and the training set is randomly polled to pick out a single input-target pair. The input is passed through the network to compute internal signals (like $\mathcal{A}\left(p_1\right)$ and $\mathcal{A}'\left(p_1\right)$ shown in Figures 4 and 5) and the output vector. Once this is done, all the information needed to initiate backpropagation becomes available. The partial derivatives of error with respect to the weights can be computed, and the weights can be refined with intent to reduce the error. The process is iterated using another randomly chosen input-target pair.

The miraculous feat of cognition

I am in awe of the miraculous feat of cognition that lead early neural network researchers to arrive at the backpropagation algorithm. They clearly had the ability to see patterns and make elegant groupings which ultimately made it possible to train huge networks. Their work not only resulted in the neural network applications we use today, but have also inspired a host of other related algorithms which depend on error minimization.
Although this algorithm has been presented here as a single established method, it should be regarded as a framework. In my experience, appreciating how an algorithm is derived leads to insight which makes it possible to explore beneficial variations. The process of designing robust and efficient scientific algorithms frequently leads us to regard established frameworks as substrates upon which to build new and better algorithms.

Note: A PDF version of this article with high quality graphics and pseudocode ("Training Wheels for Training Neural Networks") for implementing the algorithm is available. Please visit and find Downloads on the home page.

Thursday, September 5, 2013

Land that Big Data job. Cash in on the movement.

Riding on the wave of a new movement requires that you learn new skills. I am often asked “What do I need to learn to profit from the Big Data trend?” My answer is “Learn some Machine Learning.”

Why do I say that? Three reasons
  1. For business leaders, investing just a little bit of time to pick up the basics of machine learning can help them formulate solvable problems and navigate the sea of potential technological options that can be used to realize business value
  2. Those with even a basic degree of comfort with numbers  (high school math, really) can quickly come up to speed on machine learning and go on to become leaders who contribute to gains on hundreds of challenging real-world data analysis problems from a wide span of industries and sciences.
  3. They are hiring. Do you have the skills? More and more companies are looking at data to increase productivity and competitiveness. What are you waiting for?

Machine Learning has arrived
After all, no less an authority than Bill Gates recently said, “When it comes to technology, there are four areas where I think a lot of exciting things will happen in the coming decades: big data, machine learning, genomics, and ubiquitous computing …” What is more, Bill Gates is also reported to have declared that “a breakthrough in machine learning would be worth ten Microsofts.”

Even if you happen to think that the above stand is debatable, there is no question that machine learning has arrived and is moving center stage. Large computational capital is readily available, and simple algorithms are waiting for someone to leverage them for handsome returns.

Numerous aspects of our life, both significant and trivial, have become incredibly measurable. Several aspects are already finely measured and stored. Privacy concerns aside, this might seem like a good thing till we realize that most of this data lies in deep slumber. Unused, unsorted and unedited, this data, rather than boosting the bottom line, kills productivity – cluttering up storage space on devices and slowing down connections. Just as most of us store more digital pictures than we can handle, companies frequently find that – although a treasure troves of data is on hand – the very volume of data makes it difficult to find what they need or glean actionable insight.

But we mustn’t blame the data. We must celebrate it. Astute business leaders have realized that tremendous insights exist in Big Data and want to use it in order to leap ahead of their competitors. Businesses are looking for expertise in machine learning to parse, reduce, simplify and categorize data.

Practitioner of this craft, sometimes called data scientists, are not quite software engineers. Although they might be quite competent at coding, their expertise lies in developing or adapting algorithms that operate on data to reveal order that can lead to insight. Data scientists can understand a data-centric problem, come up with a solvable formulation and recommend a practical solution. They may provide working prototypes to demonstrate the solution, and even sometimes provide segments of final code. They work closely with others more grounded in software engineering whose job is to translate their work, optimize it exquisitely and plug it into the larger framework of efficient runtime code implemented in the final product.

You are already a Data Scientist
Lest we forget, the ability to tame Big Data is the ultimate hallmark of intelligent beings. Your brain runs the best machine learning algorithms known. It absorbs massive data sets every moment of every day. The processing and directing of thought and action that you so effortlessly accomplish is amazing. Even more so once you consider that no neuron in your brain fires faster than 1 kHz, the speed of a circa 1980 PC. Of course, your brain’s biological computing system does not work the way a computer does. Rather than laboriously finding and opening large files loaded with complex information, your brain calls up millions of individually stored simple data elements, processes them in parallel in multiple locations, sometimes breaks rules leading to profoundly creative solutions and generally causes an emergence of sensible outcomes. This happened even before you spoke your first word, and now happens even when you sleep.

So, it looks like we already know machine learning without knowing that we know it. Many practitioners that I know report that the process of learning machine learning is accompanied by fleeting feelings of understanding how we think. This seems natural considering that machine learning is all about the problem of finding essential and distinguishing properties of a category, a quest that we are constantly engaged in.

Just good for fun and games?
Let me end this blog by taking another question that I am sometimes asked, “Is Big Data only useful for superficial stuff like recommending good movies to watch?”

My answer is that machine learning is used in several efforts to improve the quality of security, public health and safety. Methods for the automatic detection of events and other patterns in huge data sets find natural applications in areas such as flagging shipping containers likely to hold banned goods, timely alarms of disease outbreaks, warning of possible but yet-to-be-committed crimes, predicting dangerous sinkholes using satellite data, improved methods of assessing a patient’s risk of developing diabetes, proactively abating traffic congestion, early detection and containment of forest fires before they become too big, the list goes on ...

To be sure, Big Data and machine learning will be used to earn big bucks for corporations. But it is being, and can be, put to more noble uses in the wider world.

Friday, May 24, 2013

Prudential Harnesses the Wisdom of Crowds

“Statistics is boring! I can prove it to you!” I sometimes declare in statistics classes that I teach. I do that in order to make the class lively. As proof, I ask the class to name a famous physicist (Einstein's name comes up every time), a famous biologist (Darwin's name comes up every time), and so on. I finally, ask them to name a famous statistician. This is usually greeted with silence, laughter, and the inevitable smart aleck response, “Shashi.” Statistics has no celebrities to boast of? It must be boring! Is it really? I proceed to challenge the class with “I bet you definitely know the name of a famous and pioneering statistician.” This is indeed true, but the class has to wait till the end of the day to know who it is.

I often try to make up simple experiments to help the class realize that statistics is merely common sense quantified. Sadly, this is not the message most people take away from statistics classes or training that they might have endured. So, I was quite struck by the uniqueness of the Prudential commercial demonstrating statistics in an interactive way. In this commercial, filmed at a park in Austin, Texas, a group of 400 people were asked the question “How old is the oldest person you've ever known?” Each person was given a blue dot and asked to stick it on a 1,100-square-foot wall, lined up with the age of that person. The results, and the way the results develop into a neat histogram, are striking.

A participant in Prudential's commercial adds her sticker.
The final array of stickers is on the right.

The visual effect is spectacular because the blue dots organically pile up into a mountain well to the right of a reference line showing the retirement age of 65. Prudential prudently realized that the public cannot be expected to show sustained interest in static graphs or percentages (even when rendered in arresting, moving colors and fonts) attempting to communicate details of life expectancy and the cost of retirement. In a refreshing move, they harnessed the wisdom of the crowds to create a compelling visual for them.

Compelling, it is. But the thought that is so compellingly communicated – “Oh my gosh! Just look at this huge number of very very old people! If they retired at 65, what in the world are they going to live on in their 90s and beyond? Now, I better do something about this for myself!” – is quite far removed from the statistical facts that happen to be actually relevant to retirement and ageing.

Besides, there is a bit of a problem with the question How old is the oldest person you've ever known?” This was revealed when I attempted to duplicate the experiment using my friends as volunteers: Given more time to think and more opportunity for family members to jog our memory, we often find that we have known someone older than what we first thought. And, as it often happened in my experiment, we can recall knowing someone really old without knowing their exact age. A statisticians job is to prevent such problems by helping develop a solid protocol for the experiment.

But never mind. I don't want this blog to turn out to be a critique of Prudential’s approach to selling retirement planning products. On the contrary, I very much appreciate Prudential’s effort in bringing complex statistical concepts to the masses. In fact, a number of interesting statistical and computational ideas can be illustrated by taking Prudential’s experiment as a starting point. Let us explore.

First, let us try to quantitatively describe shape of the mountain of blue stickers. Since it looks suggestive of the famous bell curve, let us fit such a curve to approximate the tops of the stacks of blue stickers. The blue sticker experiment’s design – and the resulting images – actually makes it easy for us to fit a bell curve. The steps are easy: (1) Scan each column of pixels in the image from top to bottom and place a red mark at the first blue pixel encountered. (2) Next, allow a curve fitting algorithm (quite routinely used in scientific programming) to find the bell curve that best follows the red marks. The figure below shows the steps and the result of doing this exercise involving simple image processing and curve fitting operations.

Fitting a bell curve to the histogram of blue stickers

Next we ask questions. “Why does the mountain of blue stickers have the appearance that it does? Why does it look orderly even though the participants did not plan for creating the order? Why does it have a peak around 90 years?” We can answer these questions by doing our own blue sticker experiment, this time virtually (and at near-zero cost!)

Let us start with the 2011 US Census, which makes data on age distribution available on its website. From this data, we can apply curve fitting again to build an approximate mathematical model of age distribution, shown as the blue curve which approximately follows the red data points.

Age distribution data and its mathematical model

Having a mathematical model allows us to perform all manner of useful thought experiments, or simulations – or indulge our fancy. When done carefully, these experiments give rise to insight, help make predictions and often provide easy answers to real world questions. For example, from the mathematical model of age distribution, we can generate a random group of virtual people and be sure that the number of retirees (i.e. those past 65) is about what would be expected in an actual sample of the same size (roughly 12%). We can then create an easily understandable picture of the age distribution of that group.

A random group of virtual people. Those aged 65 and older are colored orange.

If I were a participant in Prudential’s experiment and if the people depicted in this picture were the ones (and only ones) ever known to me, I would go up to the wall and place my sticker above the 92 mark. How old is the oldest person you've ever known? 92.

Since we have a mathematical model, we can repeat this process of identifying the oldest individual as often as we like, each time using a different virtual group. Each repetition of the process corresponds to one instance of an individual in Prudential’s experiment mentally going over everyone ever known to him or her and identifying the oldest person known. The result of repeating the process a large number of times – each time placing a virtual sticker on a virtual wall – is shown in the figure below.

Virtual blue stickers generated by running a simulation of Prudential's experiment.
The histogram and the bell curve fit closely follow the result from the live experiment.

Even though the experiment involves generating random ages, we see that the pattern of stickers is not quite random. It follows what seems to be the familiar bell curve again (shown as a blue curve). What is more, it even peaks at around 90 years. The agreement between the actual (shown as a black dashed curve) and virtual versions is quite close. In aggregate, a predictable pattern emerges from randomness. The wisdom of the crowd ensures that the stickers fall into an orderly and familiar histogram, rather than be scattered all over the place.

So, that’s it for now. I hope we see more such creative attempts to infuse freshness into commercials. Everyone I know enjoyed this commercial. I am sure there were statisticians behind the scene who worked hard, (well before the experiment was conducted live) to make sure that it did not turn out to be a public flop. There is lots more material than can be covered in this blog. Does it really follow a bell curve? Does it matter? What is the distribution that is actually relevant to funding retirement plans?

As always, I will enjoy talking to you about this or any other numerical and insightful topics.