Five years ago I had to make an emergency run to Fry’s because the video card in my wife’s computer had died and she had a raid scheduled with her gaming guild. With a wide selection of high end cards, I selected her current card because it had a lifetime warranty. Today I’m having to replace her video card again, not because it’s defective, the card works fine, it just can’t handle the demands of a new game. That lifetime warranty provided little more value than a marketing ploy.
Reliability and warranty go hand in hand but the relationship is not as obvious as you might think. You hear stories all the time of equipment that fails right after the warranty period and wonder if the companies are designing their equipment to fail the day after the warranty expires.
Of course there are the exceptions too. Much to my wife’s dismay, our oven is still going strong after 25 years of service. I don’t know what the typical warranty on an oven is but I’m sure it’s far less than 25 years. Of course there’s also the other side of the coin. Read the reviews on any popular electronics device and you’ll always see a few where it failed that day, month or year and therefore isn’t fit to be used to keep their cat’s litterbox in place.
Do these early failures mean the company has poor quality and does no testing? Is my oven a sign of really great quality that isn’t seen anymore because everything built today uses inferior parts? The answers are not as clear as you might think and a lot of research goes into getting the numbers right.
Since my recently departed video card is fresh in my mind, I’ll start with the fan as an example. I’ll simplify the analysis by only considering the bearings as a failure mechanism. As the fan shaft rotates on the bearing surface, like a whetstone on a knife blade, friction causes a small amount of material to be removed, sooner or later enough material is removed that the fan can no longer rotate and it stops working. We’ve all seen fans that had to be “helped” to start. A slight poke and it starts turning until the next time it’s turned off. That’s a worn out bearing.
Making you the engineer in charge of our fan design, you have a number of choices. You can make the bearing surface larger to share more of the load, thereby cutting down on the wear but reducing the efficiency of the motor. You can use harder materials to reduce wear but this increases cost. You could put a smoother finish on the material you are using but this also increases cost. Nobody is willing to pay more for a fan that will outlive all the other components on the card. It’s always a tradeoff.
There’s a number of other factors that will affect the life of our fan, the speed it runs at (increased wear on the bearing), how big the fan blades are (increased load on the bearing), what kind of tolerances are used to produce the bearing (you can make all the calculations you want but they won’t overcome sloppy workmanship) and finally how often the fan will be used.
Give all that to a reliability engineer and he’s going to tell you something like, you can expect 50 failures in a million hours of operation or the Mean Time Before Failure (MTBF) on this fan design is 20,000 hours (the MTBF is the hours of operation divided by the number of expected failures and yes, I picked the number at random to make later calculations easier). Now when you say “great, I can give this fan a 20,000 hour warranty,” he’s going to get a sour look on his face and try to explain probability and the bell curve to you. He may even throw in the bathtub curve but it’s all related.
I’ll try to make this easier. The bell curve looks just like a bell. The top of the bell indicates the point of equal probability that a fan built to your specifications will fail. Half will fail sooner and half will fail later. This point is your MTBF. If, like me, you have problems believing the bell curve is an accurate representation, remember I limited our fan discussion to bearing failure due to wear. When you have multiple components, each with a different failure rate, the bell curve becomes a very useful tool and amazingly accurate.
As you can see from the chart, if the warranty was for 20000 hours, half the fans would probably fail inside warranty. In case you didn’t guess, that’s really bad for business. Going back to the bell curve, if I want to limit my warranty return rate to two percent, my warranty period has to be one third of my MTBF.
Of course, as GM has found out on numerous occasions, even the best reliability engineers can’t help you if your suppliers are providing junk.
This would also be a good time to introduce the bathtub curve. It’s an unfortunate fact of life that some assemblies are destined to fail early in their life. Maybe the assembler had a hangover on that day or the inspector was planning a hot date for that evening. Either way, that assembly probably won’t survive much beyond the initial power on. That gives us two curves, early failures, usually caused by poor workmanship and the bell curve representing a normal failure distribution. Connect to two curves and you see the bathtub between them.
Most manufacturers use testing and burn in to eliminate as much of the early failures as possible. Letting those failures get out the door into your customer’s hands will get you extremely poor reviews. Looking at the bathtub curve, you also see the probability of failure never goes to zero. There’s always a chance the unit will fail right out of the box.
All the components on my video card and my oven have similar failure curves. How long they will survive before failing become a matter of probability and parts count. The more parts that can fail, the higher the probability that something will fail.
My grandmother’s washboard never broke but it had far fewer parts than my washer. I know that’s an unfair comparison but today’s equipment significant improvements in efficiency and capability. These improvements come at the cost of higher complexity and lower reliability. Yes, your old black and white television lasted years but you expect a lot more from your home entertainment center today.
I’ve simplified the explanations and avoided much of the math but you can see where this is going. If I’m a manufacturer, I want a warranty period much shorter than my MTBF. If your hard drive failed a week after the warranty ran out, it was probably bad luck on your part.
If my oven is still running after 25 years, I could simply be lucky but it suggests my microwave is used more often than my oven and I indulge my wife’s cravings for take out a little too often. It’s all in the numbers.
© 2015 – 2019, Byron Seastrunk. All rights reserved.
A good article to make things easy to understand sir, i enjoyed reading it,
question : Designer when he wants to check his design he probability chooses accelerated life test, which not only eliminate early failures but also proves the life of system and critical design elements , if my burn in test makes my product survive for warranty period do i really invest time and money in accelerated tests ?
I really can’t give you an answer although it seems you’ve already decided on one. Still, since you asked, I’ll give you my opinion. If you’re buying your components from reputable vendors, straight burn in, without thermal cycling, buys you nothing. Take a look at your burn in failure data, if it’s zero or extremely low, your burn in is wasted effort. Environmental Stress Screening (ESS), requires you to do do thermal cycling. This is far more effective than simple burn in because it helps find mechanical issues, such as poor solder joints, that burn in will never find. Accelerated testing will improve your product’s reliability and greatly reduces your chamber time but does, as you noted, increase your investment costs.
Bottom line, if you’re meeting your warranty period and you have no desire to improve your first pass yield (take a look at your rework costs), what you’re doing is fine. I would highly suggest looking at your burn in failure rate. If it’s extremely low, your burn is largely ineffective and could be eliminated.
I recently had an experience with a bad bearing on a Washing Machine. It is 12 years old and according to the M****G manufacturer the life expectancy of the machine is rated at 10 years. Naturally, as an engineer I began the process of repairing the machine. A certified repair person quoted me $967, so I took on the responsibility of doing it myself. After $179 in parts, the machine and some ingenuity in creating a tool to push the bearings out, the machine was repaired and is fully functional (much quieter than before due to the logic explained in your article about wear and tear). Unfortunately, my wife could not wait the one week it took to repair it and we bought a brand new washer for $1200. They offered me the 5-year extended warranty that I happily turned down because now I am a washing machine repair expert and can handle it myself 🙂