The nice thing about working as a consultant ist he variety of different problems one is confronted with. Just as I was preparing the next blog-post about the possibilities of simulation in statistics I received a request to help design an experiment to determine the life-time of a chemical product. Life-time analysis – better known as survival analysis- is a large and interesting part of statistics. We, in the “normal” six sigma world do not see much of this, but in DFSS projects is one of the more important elements.

Today, Minitab provides some help in a relatively new sub-menu and there are specialized programs such as Weibull++ which help specialists. Of course R has a wealth of libraries to use as well. On the other hand, there is nothing better then a simulation to build the intuition around a problem, so I decided to go that way, using R for the ease of writing scripts. A problem, in these cases is also, that calculating sample sizes is not easy and is not really well documented in Minitab for example.

So, I wanted to simulate some life data analysis and especially look at the power of the analysis as a function of external stress – something even Weibull++ will not provide as the company sells a special program to plan HAST tests. (Just for those less versed in DFSS – HAST is the abbreviation of Highly Accelerated Stress Test. The idea is that one can define a stress factor (like weight for mechanical assemblies or temperature for electronic equipment) and assume that there is a relationship between the distribution of survival times and the level of the stress factor – like the higher the stress the faster a breakdown occurs. Once this is known, from theory or possibly from  earlier experiments, one can run a test for a much shorter duration with an elevated stress level, and work out the survival distribution for the normal stress levels from the data. )

In our situation we had basically no knowledge of the relationship between the stress factor and the life-time, so what we needed to do was to run life-time tests at different levels of the stress factor and work out the relationship from the results, THEN use these results to extrapolate the life-time distribution at normal levels. Of course, the easiest would be to simply run the life-time tests at the normal levels, we just do not have the time and the resources to wait for month and month until the product dies.

However, if we want to see the effect of higher stress levels on the product we absolutely need to know the power of the statistical method used to test the data. This will determine the sample size we need at each setting, and as usual, we avoid both the risk of not seeing anything in a stupidly underpowered study or of wasting a lot of product and effort if we use a higher then necessary sample size.

Estimating the power is simple in principle. We define an effect strength, we simulate the data with the defined influence of the stress factor  and run the analysis. We repeat this step 2000 (or more) times and count how many times we have a p-value that is less then 0.05, divide it by the total number of experiments and there we are, we have a power estimate. Of course, the details are a bit more cumbersome.

 

First, the R code to do this is here:

# Sample size determination for accelerated life testing

# Assume the survival time distribution is exponential, that is

# no manufacturing errors and no aging – constant hazard rate

# Assume that the hazard rate is dependent on a stress factor F

 

# To define a realistic parametrization we use historical knowledege

# we know that at F=20 the MTTF=3*30=90 days

# lambda = 1/MTBF =1/90

# the effect size is given in reduction of the MTTF – so an increase in the Forcing

# factor from 20 to 40 reduces the MTTF by DM days

 

require(survival)

F=c(20, 40, 60, 80)

# DM is the effect size in MTBF days – in this case a change of 20 grades

# reduces the MTBF by 10 days – a weak effect

 

#DM=10

# Percentage censored values – as per now the censoring will be randomly decided

#CensPer=0.0

 

lambda0=0.011 #=!/MTBF=1/90 days

#Function to calculate the power at a setup consisting of a given sample size, effect size and percent censured

SingleSetupRun=function(Nrun=100, DM=10, Ns=10, CensPer=0){

ps=numeric(Nrun)

scales=numeric(Nrun)

# FOR Nrun times DO

for(i in 1:Nrun){

Fs=numeric(0)

Ts=numeric(0)

#   FOR each value of F DO

for(n in 0:3){

#     Calculate the lambda

lambda = lambda0/(1-lambda0*n*DM)

#     Add the F values to the vector

Fs=c(Fs, rep(F[n+1], Ns))

#     Generate Ns samples from the exponential described by lambda

Ts=c(Ts, rexp(Ns, lambda))

 

#   ENDFOR

}

#   Build a censoring vector

cs=rep(1,length(Fs))

#   A percentage equal to CensPer is censored

cptr=sample(1:length(Fs),as.integer(CensPer*length(Fs)),replace=FALSE)

 

cs[cptr]=0

 

#   Build the data frame

sdata=data.frame(Fact=Fs, Stime=Ts, Cens=cs)

S=Surv(sdata$Stime, sdata$Cens)

 

#   Perform the regression

mod=survreg(S~sdata$Fact, dist=“weibull“)

 

#   Get the p-value got from survival:::print.summary.survreg

ps[i]=summary(mod)$table[11]

scales[i]=summary(mod)$scale

 

# ENDFOR Nrun

}

print(sprintf(„Power is: %.2f, for effect %d and sample size %d“,sum(ps<0.05)/length(ps), DM, Ns))

sign.ind=which(ps<0.05

print(sprintf(„Percent scale>1 %.2f“, sum(scales[sign.ind]>1)/length(sign.ind)))

}

 

Now, let us look at some details.

Setting the parameter of the exponential distribution

 

In life data analysis probably the most important decision is to determine the distribution of the survival times. The simples model is the one where the survival times are described by the exponential distribution. This is equivalent to the assumption that there are no memory effects present – that is , the remaining time until a failure is independent of the age of the product. A different way to describe this is by using the concept of the hazard. The hazard at time t  is defined as the probability that the product fails exactly at time t after haing survived until this moment. The memoryless property (which is equivalent to the exponential survival times distribution) simply says that the hazard is constant in time. To make things even simpler, the hazard will be the only parameter of the exponential distribution – defined as “lambda” in the code above. Generating random values of a given exponential distribution is simply done by calling the R function  rexp(size, lambda).

 

So, if we want to simulate different effect sizes we just need to change the lambda parameter accordingly. There is the crux, however – what do we mean by accordingly? We (at least I) are not trained to think in terms of hazards – so  how can we decide whether changing the lambda parameter by 0.1 is a strong or a weak effect? I honestly do not know. So, I had to find a way, to make effect size more easy to estimate, and I came up with the idea of using the Mean Time to Failure or MTTF. This is simply the expected value of the exponential distribution – and despite its name it is the time at which not 50% but roughly 37% of the tested items still survive. It can be proven that lambda = 1/MTTF which is handy because most manufacturers provide the MTTF number. But, given, that this is a time value we can more easily relate to it then to the concept of the hazard so, will be easy to determine a weak, medium or strong effect. We know, that our MTTF under normal, unstressed, conditions should be around 90 days. So, an effect that shifts the MTTF by around 10%  can be considered weak . Knowing this we can work out backwards the change in the lambda that will reduce the MTTF by 10 days (10%-ish). The details are in the comments of the code.

Censoring

The other special feature of the life data analysis is that we have two types of life data. There is the “real” type, where the value is simply the time of failure, but we also have “censored” data where we did not see any failure but we have to interrupt the observation. E.g. we planned the experiment for 8 weeks and at this time some items just kept on living. This is valuable information and it would be a folly to discard this data. But we can not enter a value for the failure time either – it could be the next second or the next month, who knows? In life-data analysis we mark these points as “censored” and we can still use them in the anlysis – but obviously this analysis has to be different type from a normal least squares regression. For the power analysis, we also have the intuition that the amount of censoring will reduce the power – a censored data point simply does not have all the information a “normal” point has.   In our R function CensPer is the variable giving  how much percent of the available data is censored. I just attached the censored label randomly to data points – a better way would be to assume that the  censored values are mostly those with a high life-time.

Scale  and power

Before looking at the results we still need to discuss another detail. When analyzing life data, as mentioned before, we can decide which distribution we assume for the distribution of our data – and the simples choice is the exponential,l with a constant hazard rate. However, in real life we can not be absolutely sure that this assumption holds – so, in a real life situation we should not pick the exponential but something that allows a little more flexibility. Fortunately, there is a such a choice, the Weibull distribution, which is flexible enough and contains the exponential distribution as a special case. So, in the analysis our best option is to pick the Weibull distribution, so, we should pick this option also when looking at the power in simulations. If you check the code, the line

mod=survreg(S~sdata$Fact, dist=“weibull“)

is running the regression, with the Weibull distribution as our choice. This has an additional advantage for the power analysis. If we look at the results of the regression, we shall see a calculated value called “scale”. This is giving us the direction of the effect that was detected with the regression – if the scale is greater then 1  the hazard will increase with increasing stress factor, if it is smaller then one then the hazard will decrease with the stress. By looking at the distribution of the scale values we can judge how well a regression with the given sample and effect size can determine the direction of the effect. In principle, it is not enough to have a  p value less then 0.05 (statistically significant effect) we would also need a scale that is greater then 1 (as we know that we set up the simulation in a way that the hazard increases with the strength of the factor.

Results

Well, the first and most important result is that the code above does give a pretty good idea of how the power will look like. Feel free to paste the code into R Studio and experiment with it.

Now some results: this will not come as a surprise but small effect sizes and small sample sizes will result in very low power – it is almost hopeless to expect to see a significant effect. Moreover the direction of the effect will be variable and misleading . for an effect of 10 days of shift in the MTTF and a sample of 5 per factor setting I got only 14% significant tests (power of 14%) and of those only 16% had a scale >1 – basically useless for practical purposes.

Increasing the sample size will strongly increase the power, but will only have a mediocre effect on identifying the right direction of the effect. At 10 samples per factor value the power increased to 18% and of those 27% identified the right direction. At a sample size of 20 the power was 29% and scale>1 was at 32%. The numbers at the completely unrealistic sample size of 100 are : power at 84% but percent scale > 1 still at 46% only.

So, we can hope to identify the existence of an effect with this method, but the direction will be iffy even at very high sample sizes. However, if we have a stronger effect, both numbers improve drastically:

At an effect of 30 days on the MTTF, even at a sample size of 5 (!) we would have a power of 99% with scale>1 at 89%. This effect grows in a very non-linear way – at an effect of 20 days we would still only have a power of 45% with the scale correctly identified in only 26% of the cases. So, definitely, the method becomes more reliable at stronger effects.

How does censoring affect the power? As expected, all things being the same a hgher percentage of censored values will reduce the power. For example, if we stay with previous case but we introduce a censoring of 50% of our generated data then the power decreases to 76% from 99% and the percent scale>1 to 75% from 89%.

The effect is not very strong though, 50% censored does seem to be extreme, so,  there is hope that we can use this method, provided we manage to identify a suitable effective stress factor.

There is a lot more to find out with this simulation, and I encourage everyone to investigate this further on their own.

 

And, as always, any feedback is warmly welcome.