Everyone seems to be talking about the opportunities and possibilities of Big Data at the moment. But size isn’t everything, sometimes small is beautiful. The way that Big Data (and the associated analysis, or Data Mining) is currently being applied to manufacturing process improvement means that it is sadly going to fail to live up to expectations, and it won’t give a return on the huge investment being made. But it doesn’t have to be that way.
I’m referring to Big Data as consisting of broad range of data types being generated in huge quantities. Data is more than numbers. With new and more accessible technology, there is now the possibility to carry out analyses on data which was previously uneconomic to consider. The ever multiplying in-process sensors now generate a lot of data. Surely, as well as implementing a police state, Big Data Analytics can be put to use in this environment too? In addition, I’m also using the term Data Mining as it is generally used, to refer to a set of strategies for looking at these large amounts of data that have been collected and stored, in order to generate new information, and hence new knowledge.
At this moment in time, statistically savvy people are skeptical of the hype surrounding Big Data. Many refer to the stories such as the one about the Literary Digest publication, who, in 1936 mailed out 10 million survey cards in an attempt to forecast the result of that year’s US presidential election. After cross-tabulating an impressive 2.4 million responses, they predicted a crushing victory for Alfred Landon over the incumbent President Roosevelt. There was indeed a crushing victory, but it was Roosevelt’s. Yet, in the same election, George Gallup had correctly forecast the result using a far smaller dataset. This was because Gallup understood what the Literary Digest did not: it’s far more important to have a representative sample than simply a big one. Today, the analyses carried out on data from Twitter, Facebook and smartphone locations are repeating the same basic error, with similar outcomes. It’s as if we’ve forgotten the lessons of history. Certainly, at times new paradigms need to replace the currently accepted ones, but only once theories have been substantiated, are explained and can be shown to consistently make better, more accurate predictions. I remember my father telling me how painful it is to watch people make the same mistakes over and over. Finally I’m old enough to understand.
However, it is not actually those basic errors that will lead to the disappointment and failure to get a return. Because these basic errors are understood by enough people, that problem will be fixed. Neither am I worried by any of the concerns of system integration or infrastructure. Simply, it’s not my bag. And I don’t wish to criticize the tools. I’m a big fan of the programming language R, for example. I like very much being able to hack into any dataset in any format and process it just how I want. And I just love the fact that I can output the most wonderful information-rich graphics.
The real problems are at a much deeper, more fundamental level. Not long ago, we worked on a project that nicely illustrated these. It isn’t important what the product was, because the most important elements of the story are common to every project. However, in this case (some very) Big Data was available, and management was very determined to learn how to use it. The particular improvement was needed because a poor quality output from some unidentified upstream process was causing a downstream process to be run much slower than budgeted, and frequently repeated. Plant output was dramatically lower than it should be, and solving the problem was worth a lot of money.
But progress had been slow. It just happened that, nine months into their project, we were conducting a one week workshop at the facility, and shocked management by getting to the answer within a few hours. Sadly for their ambitions, the answer was not contained in that huge dataset. In our workshops, we have some formal sessions in a training room, the remainder of the time we let the attendees work on their projects in teams of two or three. This means we have to get around and find where they are working, which should be out on the shop floor. An hour after the first project session started, I caught this particular team still in a spare office close to the conference room, staring at a computer screen. Watching and listening for a few minutes, I realized one member was explaining the structure of the data, which happened to be in an Excel spreadsheet, to his team mates, and what their analysis had so far uncovered, and what experiments he was proposing they carry out. But, they couldn’t get the experiments completed that week (or even within a month) so, as far as he was concerned, they were done for now.
I’m considered by some to be a power-user of Excel. Sometimes, a test lasting a few seconds can generate a million rows of data, but I only need 50. If I need to, or it will be quicker than writing some code in R, I’ll write an Excel macro to extract the small amount of data I actually need. It just has to be the right 50. Yet, in my experience, I had never seen a spreadsheet where there were so many columns of data that the column reference required three letters. I’ve since checked; on my laptop the columns go out to XFD which is the 16,384th column reading left to right. On this particular spreadsheet the last column of data was labeled AEZ, meaning there were 832 columns. One of them (column UJ which is the 556th and which seemed almost an afterthought) was apparently the closest thing they had to a performance response (often called the Y). The remaining columns were assigned to variables measured by sensors in the manufacturing process that were being analyzed to find any sort of co-relationship or patterns, although not necessarily linked with column UJ!
Each row was a unit of the product. There were tens of thousands of rows, representing almost a year of production. Some columns were blank for the first 20,000 rows, because a sensor had only been added after the project started (on the recommendation of the project team). Sometimes a particular column was blank for a few thousand rows because there had been a sensor fault. All of this mess of several million pieces of data with some pieces missing needed to be taken care of. And so they had – for 9 miserable months!
Nevertheless, they had found over twenty variables as promising suspects affecting plant performance, and had designed some experiments to test them further. Some of the proposed tests would be somewhat disruptive to production, costing a lot of money in downtime. Other variables were not able to be controlled, so it was either a matter of waiting for them to change (by some magical hand?) or investing to make them controllable.
The true causal explanation for the poor performance, which we were able to find and fix within a few hours, showed that every single unit produced at one process step was problematic. Let’s restate that differently. Every row in the Big Data set was 832 columns of measurements made about parts that were almost the same, and all bad. None of them were substantially better or worse. There was no new knowledge hidden in that data, waiting to be found.