Five years ago I had to make an emergency run to Fry’s because the video card in my wife’s computer had died and she had a raid scheduled with her gaming guild. With a wide selection of high end cards, I selected her current card because it had a lifetime warranty.  Today I’m having to replace her video card again, not because it’s defective, the card works fine, it just can’t handle the demands of a new game. That lifetime warranty provided little more value than a marketing ploy.

Reliability and warranty go hand in hand but the relationship is not as obvious as you might think. You hear stories all the time of equipment that fails right after the warranty period and wonder if the companies are designing their equipment to fail the day after the warranty expires.

Of course there are the exceptions too. Much to my wife’s dismay, our oven is still going strong after 25 years of service. I don’t know what the typical warranty on an oven is but I’m sure it’s far less than 25 years. Of course there’s also the other side of the coin. Read the reviews on any popular electronics device and you’ll always see a few where it failed that day, month or year and therefore isn’t fit to be used to keep their cat’s litterbox in place.

Do these early failures mean the company has poor quality and does no testing? Is my oven a sign of really great quality that isn’t seen anymore because everything built today uses inferior parts? The answers are not as clear as you might think and a lot of research goes into getting the numbers right.

Since my recently departed video card is fresh in my mind, I’ll start with the fan as an example. I’ll simplify the analysis by only considering the bearings as a failure mechanism. As the fan shaft rotates on the bearing surface, like a whetstone on a knife blade, friction causes a small amount of material to be removed, sooner or later enough material is removed that the fan can no longer rotate and it stops working. We’ve all seen fans that had to be “helped” to start. A slight poke and it starts turning until the next time it’s turned off. That’s a worn out bearing.

Making you the engineer in charge of our fan design, you have a number of choices. You can make the bearing surface larger to share more of the load, thereby cutting down on the wear but reducing the efficiency of the motor. You can use harder materials to reduce wear but this increases cost. You could put a smoother finish on the material you are using but this also increases cost. Nobody is willing to pay more for a fan that will outlive all the other components on the card. It’s always a tradeoff.

There’s a number of other factors that will affect the life of our fan, the speed it runs at (increased wear on the bearing), how big the fan blades are (increased load on the bearing), what kind of tolerances are used to produce the bearing (you can make all the calculations you want but they won’t overcome sloppy workmanship) and finally how often the fan will be used.

Give all that to a reliability engineer and he’s going to tell you something like, you can expect 50 failures in a million hours of operation or the Mean Time Before Failure (MTBF) on this fan design is 20,000 hours (the MTBF is the hours of operation divided by the number of expected failures and yes, I picked the number at random to make later calculations easier). Now when you say “great, I can give this fan a 20,000 hour warranty,” he’s going to get a sour look on his face and try to explain probability and the bell curve to you. He may even throw in the bathtub curve but it’s all related.

IStandard_deviation_diagram’ll try to make this easier. The bell curve looks just like a bell. The top of the bell indicates the point of equal probability that a fan built to your specifications will fail. Half will fail sooner and half will fail later. This point is your MTBF. If, like me, you have problems believing the bell curve is an accurate representation, remember I limited our fan discussion to bearing failure due to wear. When you have multiple components, each with a different failure rate, the bell curve becomes a very useful tool and amazingly accurate.

As you can see from the chart, if the warranty was for 20000 hours, half the fans would probably fail inside warranty. In case you didn’t guess, that’s really bad for business. Going back to the bell curve, if I want to limit my warranty return rate to two percent, my warranty period has to be one third of my MTBF.

Of course, as GM has found out on numerous occasions, even the best reliability engineers can’t help you if your suppliers are providing junk.

BathtubThis would also be a good time to introduce the bathtub curve. It’s an unfortunate fact of life that some assemblies are destined to fail early in their life. Maybe the assembler had a hangover on that day or the inspector was planning a hot date for that evening. Either way, that assembly probably won’t survive much beyond the initial power on. That gives us two curves, early failures, usually caused by poor workmanship and the bell curve representing a normal failure distribution. Connect to two curves and you see the bathtub between them.

Most manufacturers use testing and burn in to eliminate as much of the early failures as possible. Letting those failures get out the door into your customer’s hands will get you extremely poor reviews. Looking at the bathtub curve, you also see the probability of failure never goes to zero. There’s always a chance the unit will fail right out of the box.

All the components on my video card and my oven have similar failure curves. How long they will survive before failing become a matter of probability and parts count. The more parts that can fail, the higher the probability that something will fail.

My grandmother’s washboard never broke but it had far fewer parts than my washer. I know that’s an unfair comparison but today’s equipment significant improvements in efficiency and capability. These improvements come at the cost of higher complexity and lower reliability. Yes, your old black and white television lasted years but you expect a lot more from your home entertainment center today.

I’ve simplified the explanations and avoided much of the math but you can see where this is going. If I’m a manufacturer, I want a warranty period much shorter than my MTBF. If your hard drive failed a week after the warranty ran out, it was probably bad luck on your part.

If my oven is still running after 25 years, I could simply be lucky but it suggests my microwave is used more often than my oven and I indulge my wife’s cravings for take out a little too often. It’s all in the numbers.

© 2015 – 2019, Byron Seastrunk. All rights reserved.