Everyone seems to be talking about the opportunities and possibilities of Big Data at the moment. But size isn’t everything, sometimes small is beautiful. The way that Big Data (and the associated analysis, or Data Mining) is currently being applied to manufacturing process improvement means that it is sadly going to fail to live up to expectations, and it won’t give a return on the huge investment being made. But it doesn’t have to be that way.
I’m referring to Big Data as consisting of broad range of data types being generated in huge quantities. Data is more than numbers. With new and more accessible technology, there is now the possibility to carry out analyses on data which was previously uneconomic to consider. The ever multiplying in-process sensors now generate a lot of data. Surely, as well as implementing a police state, Big Data Analytics can be put to use in this environment too? In addition, I’m also using the term Data Mining as it is generally used, to refer to a set of strategies for looking at these large amounts of data that have been collected and stored, in order to generate new information, and hence new knowledge.
At this moment in time, statistically savvy people are skeptical of the hype surrounding Big Data. Many refer to the stories such as the one about the Literary Digest publication, who, in 1936 mailed out 10 million survey cards in an attempt to forecast the result of that year’s US presidential election. After cross-tabulating an impressive 2.4 million responses, they predicted a crushing victory for Alfred Landon over the incumbent President Roosevelt. There was indeed a crushing victory, but it was Roosevelt’s. Yet, in the same election, George Gallup had correctly forecast the result using a far smaller dataset. This was because Gallup understood what the Literary Digest did not: it’s far more important to have a representative sample than simply a big one. Today, the analyses carried out on data from Twitter, Facebook and smartphone locations are repeating the same basic error, with similar outcomes. It’s as if we’ve forgotten the lessons of history. Certainly, at times new paradigms need to replace the currently accepted ones, but only once theories have been substantiated, are explained and can be shown to consistently make better, more accurate predictions. I remember my father telling me how painful it is to watch people make the same mistakes over and over. Finally I’m old enough to understand.
However, it is not actually those basic errors that will lead to the disappointment and failure to get a return. Because these basic errors are understood by enough people, that problem will be fixed. Neither am I worried by any of the concerns of system integration or infrastructure. Simply, it’s not my bag. And I don’t wish to criticize the tools. I’m a big fan of the programming language R, for example. I like very much being able to hack into any dataset in any format and process it just how I want. And I just love the fact that I can output the most wonderful information-rich graphics.
The real problems are at a much deeper, more fundamental level. Not long ago, we worked on a project that nicely illustrated these. It isn’t important what the product was, because the most important elements of the story are common to every project. However, in this case (some very) Big Data was available, and management was very determined to learn how to use it. The particular improvement was needed because a poor quality output from some unidentified upstream process was causing a downstream process to be run much slower than budgeted, and frequently repeated. Plant output was dramatically lower than it should be, and solving the problem was worth a lot of money.
But progress had been slow. It just happened that, nine months into their project, we were conducting a one week workshop at the facility, and shocked management by getting to the answer within a few hours. Sadly for their ambitions, the answer was not contained in that huge dataset. In our workshops, we have some formal sessions in a training room, the remainder of the time we let the attendees work on their projects in teams of two or three. This means we have to get around and find where they are working, which should be out on the shop floor. An hour after the first project session started, I caught this particular team still in a spare office close to the conference room, staring at a computer screen. Watching and listening for a few minutes, I realized one member was explaining the structure of the data, which happened to be in an Excel spreadsheet, to his team mates, and what their analysis had so far uncovered, and what experiments he was proposing they carry out. But, they couldn’t get the experiments completed that week (or even within a month) so, as far as he was concerned, they were done for now.
I’m considered by some to be a power-user of Excel. Sometimes, a test lasting a few seconds can generate a million rows of data, but I only need 50. If I need to, or it will be quicker than writing some code in R, I’ll write an Excel macro to extract the small amount of data I actually need. It just has to be the right 50. Yet, in my experience, I had never seen a spreadsheet where there were so many columns of data that the column reference required three letters. I’ve since checked; on my laptop the columns go out to XFD which is the 16,384th column reading left to right. On this particular spreadsheet the last column of data was labeled AEZ, meaning there were 832 columns. One of them (column UJ which is the 556th and which seemed almost an afterthought) was apparently the closest thing they had to a performance response (often called the Y). The remaining columns were assigned to variables measured by sensors in the manufacturing process that were being analyzed to find any sort of co-relationship or patterns, although not necessarily linked with column UJ!
Each row was a unit of the product. There were tens of thousands of rows, representing almost a year of production. Some columns were blank for the first 20,000 rows, because a sensor had only been added after the project started (on the recommendation of the project team). Sometimes a particular column was blank for a few thousand rows because there had been a sensor fault. All of this mess of several million pieces of data with some pieces missing needed to be taken care of. And so they had – for 9 miserable months!
Nevertheless, they had found over twenty variables as promising suspects affecting plant performance, and had designed some experiments to test them further. Some of the proposed tests would be somewhat disruptive to production, costing a lot of money in downtime. Other variables were not able to be controlled, so it was either a matter of waiting for them to change (by some magical hand?) or investing to make them controllable.
The true causal explanation for the poor performance, which we were able to find and fix within a few hours, showed that every single unit produced at one process step was problematic. Let’s restate that differently. Every row in the Big Data set was 832 columns of measurements made about parts that were almost the same, and all bad. None of them were substantially better or worse. There was no new knowledge hidden in that data, waiting to be found.
This is not a unique situation. It is the most common scenario that we have seen in thirty years of diagnosing product and process performance. What was really wrong was that each unit of the product was not bad all over. The processing time downstream was the result of a surface condition created upstream. But the surface condition was not uniform over the whole product. There was in fact a non-random, very repeatable pattern within a single cycle of the upstream process. The pattern pointed to just two potential causal mechanisms. The first one we tested was not it, but the second was. The tests were carried out during the project sessions in a couple of days.
I do not wish to embarrass anybody by describing the actual product. In fact, the most relevant part of the description could refer to 90% of the short projects we tackle in our workshops. In recent years we have more and more frequently to persuade folks to dump their beloved Big Data sets. How do we know the answer is not there? Because, what almost everyone has is the equivalent of a row of data for one cycle of the process step that matters. But the variation in performance that has any leverage is within a row, and the dataset doesn’t have that resolution.
It’s a fundamental truth that, no matter how much data you have, you cannot extract any useful information from that data once the information has been removed, or if it was never there. It is far more effective to store data with the appropriate resolution for far fewer products in order to have a better chance of the maintaining the information content needed.
There’s another problem. What do I mean by leverage? Let’s think about the list of suspect variables that had come out of the Big Data analytics. Did it mean anything? Certainly some of those variables would have had some influence on the performance we were interested in. But the economic impact from controlling them would be so small as not to be noticed. This is due to the sparsity of effects principle – a combination of the Pareto principle and the way that independent variables add and combine in the total effect that they have, often termed the square-root-of-the-sum-of-the-squares. Paradoxically, the bigger the sample size used for seeing correlations, the easier it is to find causal variables which will result in little or no economic impact.
Finally a word on models of causation. Some Data Mining approaches are often based upon pattern recognition rather than any theory about causality. Quite a few Big Data cheerleaders suggest that worrying about causation is old-school stuff (“David, you’re stuck in last century”). We’ll see, I suspect that we’ll learn that is nonsense. But seriously, we do have to be conscious about any model of causation we adopt – both its strengths and weaknesses. For example, the idea of fitting data to the algebraic general equation is not the same thing as a causal explanation. It is often a very useful thing to do to test the leverage of this variable compared to that variable, for example. But recognize that it is a shallow explanation. It is just algebra, and deserves the name the Black Box Model – it’s dark inside and we can’t really see what’s going on. It is pure inductive reasoning, in which we say that a given cause will probably result in such and such an effect (because history looks that way). It is not the same as understanding the physics. Shallow explanations can help achieve better performance, and sometimes that’s all we need to do and move on. Other times, we need a deeper causal explanation to achieve real competitive advantage.
in descending order of importance, I submit that there are three main drivers that will lead to Big Disappointment from Big Data Mining applied to manufacturing process improvement:
The information content of the data.
The lack of leverage from hundreds of real correlations that are there to be discovered.
The dangers of inductive reasoning when using the black-box model (shallow explanations).
In the book Diagnosing Performance and Reliability, our experiences and lessons learned over the last 30 years are described, along with what have turned out to be the most important strategies for diagnosing problems that had stumped many people. Most importantly the source of the data, and how it is connected into information leading to knowledge and understanding are explained. There are over 40 real-world case-studies. For the most part, we generated just the data needed. Today, in spite of being in the Big Data age, we find that the either the information content has been stripped from the available data through aggregation and transformation, or else it was never collected. It is extremely rare to find economically relevant information in the Big Database of manufacturing organizations. But they could change that.