Making sense of a large pile of data, as in several hundred measurements of 5 to 10 different characteristics of a product, is a dream of a data analyst come true. In manufacturing environments, especially in the chemical industry, it often happens that we measure a wealth of data on each finished product batch . Some of these measurements are important as they might directly relate to customer requirements, some are “informative” only and the goal of collecting all the data is to have an idea of what is happening in the production process. This is the theory, at least. Of course, life is always a bit more complicated, and many operations will simply be overwhelmed by the amount of data without having the adequate tools to make sense of it. So, the data will be faithfully collected, after all it is in the SOP, but it will not be used to draw conclusions from it.
What I would like to show, is a method, to actually use this wealth of information. The method is by no means new but it definitely failed to find its way into the manufacturing environment. I still recall the first reaction of a colleague when I showed it to him: “oh, this is for PhD work” he said – meaning more or less “this is way too complex and practically useless”.
The method is called principal component analysis. The idea is to generate artificial variables out of the set of real measured ones in a way that the new variables capture the most of the variation present in the data set. To be more precise, we generate linear combinations of the original variables in a way that the first linear combination captures the most of the total variation, the second linear combination captures the most of the residual variation and so on. If there is some systematic trend in the data, the hope is that the first few artificial variables (called principal components) will capture this systematic pattern and the latecomers will only catch random variation in the data. Then, we can restrict our analysis to the first few principal components and have cleansed our data of most of the random variation.
To make this more clear imagine a room in which we have temperature sensors in each of the 8 corners. We will have 8 sets of measurements that are somewhat correlated but also vary independently of each other. Running a PCA (principal component analysis) will yield a first principal component that will be the average of the 8 measurements – as obviously the average temperature in the room captures most of the variation of the room temperature (just think winter vs. summer, night vs, day). The second principal component will capture the difference between the averages of the top 4 sensors and the averages of the bottom 4 sensors (it is generally warmer on top, right?) , the third component the difference between the side where we have the windows and the side that has no windows….and the rest probably only meaningless measurement variation and possibly some complex but small differences due to the patterns of air currents in the room.
What did we gain? Instead of the original 8 values we only need to analyze 3 – and we will gain a much better insight into what is going on. We did not reduce the number of measurements though -we still need all the sensors to construct our principal components.
What did we lose ? As my colleague put it – this is PhD work , so we lose a bit of transparency. It is relatively easy to explain a trend in a sensor on the top of the room , but a trend in the first principal component sounds like academic gibberish. So, we will have learn to explain the results and to relate them to real life experiences. This is not as difficult as it sounds, but definitely needs to be carefully prepared and the message tailored separately in each practical case.
Now to close with a practical example : I just had the good luck to run into a problem at a customer I work for. They measured 5 components of a product and had the respective data from 2014 onward. All data points represented product batches that were accepted by the QA and delivered to the customers, as they were all in spec. The problem was that the customers still complained of highly variable quality and especially of variations in quality over time.
Looking at five control charts, all consisting of quite skewed non-normal data, basically just increased the confusion. There were regularly points that were out of control, but did we have more of them in 2015 then in 2014? And if yes, was the difference significant? No one could tell.
So, I proposed to analyze the data with PCA.
In R one easy way to do this is by using the prcomp function in the following way:
Here obviously the InputDataFrame contains the input data – that is, a table where each column contains the values of one specific measurement of the product (like viscosity or pH or conductivity etc.) and each line is one set of measurements for one batch of the product. The scale = TRUE options asks the function to standardize the variables before the analysis. This makes sense in general but especially when the different measurements are on different scales (e.g. the conductivity could be a number between 500 and 1000 whereas the pH is less then 10). If they are not scaled, then the variables with greater values will dominate the analysis, something we want to avoid.
The Results object will contain all the outcomes of the PCA analysis. The first component sdev is a list of the standard deviations captured by each principal component. The first value is the greatest and the rest will follow in decreasing order. From these values we can generate the so called “scree plot” which shows us how much of the total variance is captured by the first, second etc. principal component. The name comes from the image of a steep mountainside filled with rubble at the bottom – scree. The steeper the descent, the more variation is captured by the first few principal components – that is, the clearer the discrimination between systematic trends and useless noise.
A quick way to generate the scree plot out of the results of prcomp would be as follows : plot(Results$sdev^2/sum(Results$sdev^2), type = “l”) .
Simply calling plot on Results will give us a barchart of the variances captured by each principal component, which has comparable information, though it is not exactly the same.
The next element of the prcomp return structure is called “rotation” and it tells us how the principal components are constructed out of the original values. Best is to see this in a practical example:
Turbidity 0.45146656 -0.59885316
Moisture 0.56232155 -0.36368385
X25..Sol.Viscosity. 0.06056854 0.04734582
Conductivity -0.44352071 -0.59106943
pH..1.. 0.52876578 0.39686805
In this case we have 5 principal components but I am only showing the first two as an example. The first component is built as 0.45 times the standardized value for Turbidity plus 0.56 times Moisture plus 0.06 times the viscosity minus 0.44 times the conductivity and 0.52 times the pH. For each measured data can generate a new data point using these values by multiplying the standardized original measurement with these weights (called loadings in PCA language).
What good would that bring us? The benefit is that after looking at the scree plot we can see that most of the variation in our data is captured by the first two (maybe three) principal components. This means that we can generate new artificial values that have only two components (PC1 and PC2 instead of the original 5, Turbidity, Moisture etc…) and still describe the systematic trends of our data. This will make doing graphical analysis much easier – we can restrict ourselves to calculating the new artificial values for PC1 and PC2 and to look at a two-dimensional graph to spot patterns. It also means that we eliminated a lot of random variation from our data. They are captured by the remaining components PC3—PC5 and we conveniently leave them out of the analysis. So, not only that we get a graph we can actually understand (I find my way in two dimensions much faster then in 5 ) but we also managed to clean up the picture by eliminating a lot of randomness.
As it is, we don’t even need to do the calculation ourselves – the Results structure contains the recalculated values for our original data in a structure called x – so x$x[,1] contains the transformed values for the first principal component and x$x[,2] for the second principal component. Plotting one against the other can give us a nice picture of the systems evolution.
Here is the resulting plot of PC2 versus PC1 for the example I was working on.
The colors show different years and we can see that between 2015 and 2016 there was a very visible shift towards increasing PC1 values. Now looking at the loadings we see that the PC1 values increase if Turbity, Moisture and /or pH increase and (or) conductivity decreases , Remember, these are all changes that are well within specs and also within the control limits, so they were not discernible with simple control charts. Still, once we knew what we were looking for trends in the increase of moisture and decrease in conductivity were easily identifiable.
Now, we can identify the process changes that led to these trends and strengthen the process control. PCA can be used to plot the measured values of future batches to see that we have no creeping of the measured values in the future. Not bad for an academic and useless method, right?