The Long and Winding Road to High Reliability
Although reliability engineering is a well-established discipline today, the concept itself has a long history of improvements made to ensure that electrical and electronic components are not only reliable but predictable. RF or microwave devices, subsystems, and systems have evolved in lockstep with the evolution of reliability engineering, and today even if a component is not explicitly called “hi-rel,” it will still be more reliable than a similar component of even a decade ago.
The foundation of what today is considered reliability prediction arose from probability and statistics described by Blaise Pascal and Pierre de Fermat in the 1600s, and it evolved quickly into statistical quality and process control once large quantities of goods became possible in the 1920s. The word reliability itself was coined by the poet Samuel Taylor Coleridge in 1816, which he used to compliment one of his contemporaries.
However, it took quite a while before an association with electrical devices was established, primarily because there were relatively few of them until just before World War II. Huge amounts of electronic equipment were needed in a very short time, revealing the fallibility of all types of electrical components, and especially vacuum tubes.
That’s not to say that no one took reliability into consideration before the war, and a classic example is Charles Lindbergh’s extraordinarily thorough assessment of engine reliability leading up to his 1927 solo flight from New York to Paris. It remains one of the most impressive feats of planning.
One of Lindbergh’s many crucial decisions was the number and reliability of the engines chosen to power the Spirit of St. Louis. He had a choice of using up to three engines, and (simple) logic dictates that multiple engines would be more reliable and safer than one. He had settled on a plane built by Ryan Aircraft and as he had experience flying with Wright Aeronautical’s JC-5 Whirlwind nine-cylinder, air-cooled engine (Figure 1), he insisted that it be used on the Ryan aircraft.
A three-engine configuration was ruled out as too complex, heavy, and costly, and although a twin-engine plane had benefits in terms of speed and fuel load, in his era aircraft required two functioning engines to fly. This meant that from a reliability standpoint a single failure was essentially a total one. But there were other conflicting factors, as a twin-engine aircraft would have an engine failure rate half that of a single-engine version, and the plane potentially had a higher probability of survival overall. On the other hand, Lindberg highly valued the light weight and less complexity of a single-engine aircraft.
To arrive at his ultimate decision, Lindbergh calculated based on his experience that if two engines were used and both were from the same lot, the likelihood of a defective engine in any lot was about 0.5%. So, if the lot size was 200 engines, one would be defective. Using this method, a twin-engine aircraft would have a 2-in-200 possibility of having a defective engine versus 1-in-100 for a single-engine aircraft, or half the risk.
Viewed differently, assume that the Whirlwind engine had a 200-hr. (0.005 failure-per-hr.) operating life before requiring maintenance. With two engines, considered in series as both were required for flight, the total failure rate would be the sum of the individual failure rates or 0.005 x 2 failures per hour or 0.01 failures per hour. The two-engine plane would have an average of one engine failure every 100 hours of continuous operation, twice that of the single-engine option. So the latter is what he chose. Lindbergh obviously would not be inclined to make the same decision today as engine reliability (and many other things) are orders of magnitude more reliable. However, in 1927 his conclusion (and intuition) lead him to a successful choice.
Lindbergh was also an extraordinary pilot and “intuitive” navigator and optimized every possible variable, from reducing the flight’s distance by 140 nm using the Mercator rather than great circle route to sometimes flying only 20 ft. above the water to correct for drift based on wave conditions. He was only 6 miles off course when landing in Nova Scotia and 3 mi. (accuracy of 0.1 deg. over a 1700-miles distance) from reaching Ireland (he had predicted 50). In short, although Lindberg was hailed as Lucky Lindy (and he was), his success had at least as much to do with meticulous reliability assessment and planning.
Reliability Meets Reality
Before World War II, the U.S. needed to make a rapid assessment of the capabilities and relative reliability of its subsystems, systems, and platforms. The results of one relatively comprehensive study showed, probably to no one’s surprise, that more than 50% of all electronic equipment and airborne platforms in the U.S. inventory could not meet minimum military requirements.
Although military requirements were defined only generally, it was obvious that a figure of merit was needed to determine a system’s reliability, so government procurement agencies looked to standardization of requirements on which they could base predictions. It was reasoned that without standards, accurately evaluating system performance would be difficult and often impossible as components from different suppliers varied, as did their characteristics from one lot to the next—and from one to the other.
In 1957 the Advisory Group on the Reliability of Electronic Equipment (AGREE) produced a report recommending that more thorough reliability requirements and consistency were required from suppliers, that the military should find better ways to accurately determine why products fail, and that tests should be conducted to establish statistical confidence for various types of components.
The report also accomplished two firsts: development of more stressful tests including temperature cycling and vibration (eventually resulting in MIL-STD 781), and the first definition of reliability as “the probability of a product performing without failure a specified function under given conditions for a specified period of time.”
At about the same time, Robert Lusser at Redstone Arsenal produced a report noting that electrical or electronic components were responsible for 60% of failures in an Army missile system he had sampled. This, he noted, was because current methods for determining the quality and reliability of electronic components were inadequate. In addition, Aeronautical, Inc. (ARINC), created in 1929 by the FCC to coordinate and license commercial radio equipment, established a process it believed could reduce the infant mortality of vacuum tubes by a factor of four (Figure 2). And in 1956, RCA produced a document called TR-1100 dedicated to “Reliability Stress Analysis for Electronic Equipment,” which ultimately resulted in creation of the legendary or perhaps infamous Military Handbook 217 (MIL-HDBK-217).
The “handbook” has taken its fair share of criticism over the years for various inconsistencies, unrealistic assumptions, and many other things, but it’s interesting to note that although it was last updated in 1991, some test methods and other elements are still used today. Another major document created at about the same time was created by the Rome Air Development Center (RADC) and entitled “Quality and Reliability Assurance Procedures for Monolithic Microcircuits” and resulted in MIL-STD- 883 and Mil-M 38510.
Over the years, many studies were conducted requiring thousands of hours of research and analysis by the government, industry, and academia, and significant results were achieved. This was not a moment too soon, because when the transistor arrived and other solid-state devices followed it, assessing the performance of these tiny devices required even more rigorous reliability testing.
This situation became far more difficult with the emergence of space flight, which introduced a host of potential failure mechanisms rarely if ever experienced on Earth. The result of this became the incredibly stringent requirements for “space qualification,” which like “hi-rel” (and “lite” beer) mean little without qualification. By this time, the military and NASA had of necessity flown many missions above the atmosphere and in space using standard COTS components, a point to ponder.
Most space missions conducted by the Department of Defense, Defense Intelligence Agency, National Reconnaissance Office, National Geospatial-Intelligence Agency, and others, rely on fully-radiation-hardened, space qualified parts. After all, using less radiation-tolerant devices makes little sense in billion-dollar missions. That said, even in space, DoD and NASA are under pressure to reduce costs and not all its missions are long-term or high-altitude. In addition, the private space sector is advancing rapidly, and their missions so far are limited to low-Earth orbit. As a result, there is a need for devices that fill the gap between pure COTS, “upscreened” versions and far more costly devices that are radiation hardened by design. This market is just beginning to take shape.
Many of the demanding reliability requirements of space flight were the work of Willis Willoughby who had a long, highly influential career at NASA ensuring that spacecraft worked reliably, including to the moon and back. The Navy Material Command later called on Willoughby’s talents to dramatically revise procurement standards and enact policies to improve reliability. Willoughby demanded that all contracts contain specifications for reliability and maintainability rather than simply those for performance. He’s largely credited for emphasizing the importance of temperature cycling and random vibration that later became environmental stress screening (ESS), as well.
In the 1970s, accelerated life testing, the use of scanning electron microscopes for analysis and loose particle detection testing, and evaluation of plastic-packaged transistors and ICs were first realized. Important included studies of soft error rates caused by alpha particles and accelerated testing of ICs with calculated activation energies of failure mechanisms. Bell Labs also initiated the goal of achieving communication system reliability with no more than two hours of downtime throughout 40 years of operation.
In the 1980s, communications systems began to replace older mechanical switching systems with semiconductor-based devices and Bell Labs developed the first prediction methodology for telecommunications, while SAE developed a similar document (SAE 870050) for automotive applications. Predictability, and thus reliability techniques, continued to evolve as it was realized that the semiconductor die alone was not the sole cause of failures. As a result, the overall failure rate of electronic components dropped tenfold.
The PC brought reliability analysis to the benchtop and test bench, and for the first time, commercially available programs were available for evaluating reliability. In the next decade, gallium arsenide began to permeate both defense systems, especially phase-array radars, as well as commercial communications equipment, and other applications.
Failure rate models were also developed based upon intrinsic defects that replaced some of the complexity-driven failure rates that dominated from the 1960s through the 1980s. This effort was led by RAC (the new name for RADC) and resulted in PRISM, a new approach to predictions. Reliability growth was recognized for components in this document. Many of the military specifications became obsolete and best commercial practices were often adopted.
The emergence of standards, test routines, and sophisticated equipment had by this time made it possible to predict the failure rate of solid-state devices very accurately. In fact, it was arguable that COTS devices were equal to and in many cases as reliable as those used in defense systems, which cost 10 times as much. To realize the cost benefits of COTS, Defense Secretary William Perry required their use whenever possible in defense systems in 1994.
Summing It Up
Looking back on nearly a century of improvements in reliability, and in particular those of RR and microwave components, the results are impressive (Figure 3). Even vacuum tubes, which in the minds of most people no longer exist, are still powering everything from satellite transponders to radar and electronic warfare systems and their reliability and longevity continue to improve. Small-signal and RF power devices operate for millions and even tens of millions of hours without failure, and reliability of both active and passive components in general is predictable with impressive levels of accuracy.