# IN-SITU, REAL-TIME DETECTOR FOR FAULTS IN SOLDER JOINT NETWORKS BELONGING TO OPERATIONAL, FULLY PROGRAMMED FIELD PROGRAMMABMLE GATE ARRAYS (FPGAS)

James P. Hofmeister Sr. Principal Engineer Ridgetop Group, Inc. 6595 N. Oracle Rd. Tucson, AZ 85750 (520) 742-3300 hoffy@ridgetopgroup.com Pradeep Lall
Thomas Walter Assoc.
Professor
Auburn University
Dept of Mechanical
Engineering and CAVE
Auburn, AL 36849
(334) 844-3424
Iall@eng.auburn.edu

Russell Graves
Dir., Radiation Engr.
Ridgetop Group, Inc.
6595 N. Oracle Rd.
Tucson, AZ 85750
(520) 742-3300
russ@ridgetopgroup.com

Abstract - In this paper we introduce an in-situ solder-joint built-in self-test (SJ BIST) for detecting high-resistance faults in operational, fully-programmed field programmable gate arrays (FPGAs). The approach is simple to implement, offers a method to detect high-resistance faults that result from damaged solder-joints, and uses a maximum of one small capacitor externally-connected to each selected test pin or each group of two test pins.

#### INTRODUCTION

This paper introduces an innovative, in-situ solder-joint built-in self-test (SJ BIST) to detect high-resistance damage to solder-joint networks of fully operational Field Programmable Gate Arrays (FPGAs) in ball-grid array (BGA) packages such as a XILINX® FG1152/FG1156. FPGAs are used in all manner and kinds of control systems in both defense and commercial applications.

A prototype two-port group SJ BIST core was designed, programmed, simulated, synthesized and loaded into an FPGA on a development board. The SJ BIST core correctly detects and reports instances of high-resistance without false errors. The initial test results are presented in this paper. Initial designs for HALT experiments have been completed and we plan on fabricating boards, populating them with programmed FPGAs

and conducting HALTs at both the Center for Advanced Vehicle Electronics (CAVE) at Auburn University and at a Department of Defense contractor during a Phase II period of Small Business Innovation Research contract award. Evaluation of the SJ BIST is also being conducted at a German university under the sponsorship of an automobile manufacturer.

## **Mechanics-of-Failure**

Solder-joint damage under thermo-mechanical and shock stresses is cumulative, and damage manifests in the form of plastic work and cracks, which propagate till eventual fracture of solder joints [1-4] resulting in FPGA operational failures. An illustration of a fractured solder joint (or bump) under thermo-mechanical stresses is shown in Figure 1. Thermo-mechanical stresses may result from differential expansion under environmental and operational temperature exposure due to thermal expansion coefficient of mismatches. Shock loads may be imposed during shipping and normal operation in harsh environments. Even though one or more solder balls (bumps) are cracked, a solder-joint network belonging to a damaged bump might not immediately experience a catastrophic failure. One reason is other solder balls of the BGA package remain intact and tend to keep the package pressed toward the board to maintain electrical contact between the surfaces of cracks [4-6].

However, subsequent mechanical vibration or shock tends to cause such cracked bumps to momentarily open and cause hard-to-diagnose faults of high resistance –  $100\Omega$ ,  $300\Omega$ ,  $500\Omega$  and  $1000\Omega$  have been used as threshold levels [1,7-10] – lasting for periods of hundreds of nanoseconds, or less, to more than  $1\mu s$  [1,5,10].



Figure 1: Crack Propagation at the Top and Bottom of a Solder Joint, 15mm BGA [2].

These intermittent faults increase in frequency as evidenced by a practice of logging BGA package failures only after multiple events of highresistance: an initial event followed by some number (for example, 2 to 10) of additional events within a specified period of time, such as ten percent of the number of cycles of the initial event [8-10]. Even then, an intermittent fault of highresistance in a solder-joint network might not result in an operational fault. For example, the high-resistance fault might happen in a ground or power connection, or it might happen during a period when the network is not being written, or it might be too short in duration to cause a signal Figure 2 shows a shock-actuated intermittent open (high resistance) of a package interconnect.

Figure 3 represents HALT test results performed on XILINX FG1156 Daisy Chain packages in which 30 out of 32 tested packages failed in a test consisting of 3108 cycles. temperature cycle of the HALT was a transition from -55°C to 125 °C in 30 minutes: 3-minute ramps and 12-minute dwells. What is not immediately apparent is that each of the logged FPGA failures (diamond symbols) represents at least 30 events of high resistance; a FAIL was defined as being at least 2 OPENs (net resistance of  $500\Omega$  or higher) within one temperature cycle, log 15 FAILURES [9]. A single fault in a temperature cycle was not counted as a FAIL event.



Figure 2: Shock-actuated Failure: Transient Strain and Resistance.



Figure 3: Representation of XILINX FPGA HALT Test Results [9].

# Location of Greatest Stress on FPGA I/O Ports

The I/O ports of an FPGA nearest the edges of the BGA package, especially those nearest the one of the four corners of a BGA package, experience the greatest thermo-mechanical stresses [11-14]. For this reason, the corner I/O solder joints of the XILINK FG1156 are either not used or they are used as additional ground connections. This means that I/O ports on the outer edge of the BGA package that are near to one of the four corners are strong candidates for SJ BIST testing because those ports are likely to fail first.

#### State of the Art

In previous work, the authors have demonstrated the use of leading indicators of failure for prognostication of electronics [11-14]. One important reason for using an in-situ SJ BIST is that stress magnitudes are hard to derive, much less keep track of, which leads to inaccurate life expectancy predictions [15]. Another reason for using an in-situ SJ BIST is that even though a particular damaged solder-joint port might not result in immediate FPGA operational failure, the damage indicates the FPGA is likely to have other I/O ports that are damaged – the FPGA is no longer reliable. An in-situ SJ BIST can also be used in newly designed manufacturing reliability tests to address a concern that failure modes caused by the PWB-FPGA assembly are not being detected during component qualification [6].

Prior to this innovation, there were no known methods for detecting faults in operational, fully-programmed FPGAs. Furthermore, FPGAs are not amenable to the measurement techniques typically used in manufacturing reliability tests such as Highly Accelerated Life Tests (HALTs) [4]. This is because those measurement techniques require devices to be powered-off, and because FPGA I/O ports are digital, rather than analog, circuits, an example of which is shown in Figure 4.

Modern BGA FPGAs, such as the fine-pitch XILINX FG1156, have more than a thousand I/O ports and very small pitch and ball sizes. For example, the FG1156 has a 34 x 34 array of nominal 0.60 mm solder balls with a pitch of 1.0mm (see Figure 5). This tends to make physical inspection techniques impractical and not useful.

### IN-SITU SJ BIST INNOVATION

The in-situ SJ BIST innovation requires the attachment of a small capacitor to an I/O port, preferably an unused port near a corner of the package. The SJ BIST writes a logical '1' to charge the capacitor and then reads the voltage across the charged capacitor. If the solder-joint network is undamaged, the write causes the capacitor to be fully charged and a logical '1' is read by the SJ BIST. When the solder-joint network is sufficiently damaged, the RC time constant becomes large, the capacitor is insufficiently charged, a logical '0' instead of a logical '1' is read by the SJ BIST and a fault is reported.



Figure 4: Example of an FPGA I/O Buffer [16].



Figure 5: Bottom View of a XILINX FG1156 – Package Size is 35 x 35 mm with a 34 x 34 Array of Solder Balls of Nominal Diameter of 0.6 mm and a Pitch of 1.0mm [17].

# **SJ BIST Description**

This SJ BIST description is for two cases: one in which the solder-joint network is undamaged, and one in which the solder-joint network is damaged enough to cause errors (faults) in I/O signals.

#### **Undamaged Solder Joint**

Referring to Figure 6, the top picture is a 1.0 MHz clock input to a test FPGA and the bottom is the signal across a 1.0  $\mu$ F capacitor connected to two I/O ports selected for testing. The signal across the capacitor is caused by the SJ BIST writing '1s' and '0s.'



Figure 6: Solder Joint BIST – Input Clock (top) and Signal Across the Capacitor (bottom):  $2\mu s$  x 2.0V Grid.

Still referring to Figure 6, the positive pulse of the first clock causes the Solder Joint Built-In Self-Test (configured for a two-port BIST) to write a logical one ('1') through the first I/O port to the capacitor. The charged voltage on the capacitor is read from the second I/O port during the positive pulse of the second clock. For this test, a one (3.3V) is both written and read, so the SJ BIST then writes a logical zero ('0') during the third clock through the first I/O port to the capacitor to discharge it. During the 4<sup>th</sup> clock, the charged voltage is read from the second I/O port. For this case, because a '0' was both written and read – the first I/O port is evaluated as being okay and no fault is reported.

During the next set of four clock periods, the SJ BIST writes through the second I/O port instead of the first I/O port; the SJ BIST reads from the first I/O port instead of the second I/O port.

The 1  $\mu\text{F}$  capacitor was then replaced by a 100 nF capacitor, which in turn was connected to the FPGA I/O port via network wiring having a resistance of 1.0  $\Omega$  – higher than the total 100 m $\Omega$  resistance we expect as the maximum resistance of a solder ball and the network to which it is

attached. Figure 7 shows the resulting signal, which, as expected, did not cause the SJ BIST to detect a fault.



Figure 7: Signal Across a 100 nF Capacitor Attached to a Solder Joint with a Network Resistance of 1.0  $\Omega$ : 2 $\mu$ s x 1.0V Grid.

## Damaged Solder Joint

Figure 8 shows the signal when the network resistance was increased from 1.0 to 100  $\Omega$ . Referring to the first positive pulse of the output, the capacitor is charged through a first I/O port during a first clock and the voltage across the capacitor is read through a second I/O port during a second clock. Because of the increase in the network resistance, the charged voltage across the capacitor is about 2.0V instead of 3.3V and less than 2.0V at the output of the input buffer connected to the second I/O port (see Figure 4), this is logical '0' instead of a logical '1' – this is a fault and the SJ BIST detects it.

Still referring to Figure 8, in a third clock period, the fault is recorded by incrementing a fault counter for that I/O port and a '0' is written through that first I/O port – but as seen, the capacitor does not fully discharge, which is another fault condition. In the next three clocks, the second fault is detected, evaluated and then a '0' is written – through the second I/O port instead of the first. As seen, the second write '0' is successful – the capacitor is fully discharged. The second fault condition is evaluated by the SJ BIST as being a continuation of the previously detected fault and it is not recorded as new fault – a fault

counter is not incremented. Had the most previous write-read of '1' been successful, this write-read of '0' would have been recorded as a new fault.



Figure 8: Signal Across Capacitor – Network Resistance Increased from 1.0 to 100  $\Omega$ : 2 $\mu$ s x 1.0V Grid.

## Prototype SJ BIST: Fault Evaluation

We have verified the SJ BIST works correctly for the following conditions: (1) fault during write-read '1' test of I/O port 1; (2) fault during write-read '0' test of I/O port 1; (3) fault during write-read '0' of I/O port 1 immediately following a write-read '1' fault of that port; (4) fault during write-read '1' test of I/O port 2; (5) fault during write-read '0' test of I/O port 2; (6) fault during write-read '0' of I/O port 2 immediately following a write-read '1' fault of that port; (7) long-lasting fault for I/O port 1; (8) long-lasting fault for I/O port 2; (9) multiple faults for I/O port 1; (10) multiple faults for I/O port 1 and I/O port 2.

## SJ BIST Signals

The SJ BIST, at minimum, must present at least one error signal (a fault indicator) either to an external FPGA I/O port or to an internal fault management program. For evaluation and investigation, our prototype SJ BIST core provides two error signals plus fault counts.

The SJ BIST, at minimum, must accept at least one control signal: an enable (disable) BIST.

# Error Signals and Fault Counts

In addition to recording fault counts, the prototype SJ BIST core described in this paper provides two error signals: (1) at least one fault has been detected in the 2-port network being tested and (2) at least one fault is currently active. The fault counts are provided for research evaluation purposes. For a deployed SJ BIST, we anticipate most applications would only use the two error signals. We also believe a deployed SJ BIST application would most likely use at least four groups of cores – one for each corner of an FPGA.

# Control Signals

In addition to CLK, the SJ BIST core has two input-control signals: ENABLE and RESET. ENABLE is used to turn the SJ BIST detection on and off; RESET is used to reset both the fault signal latches and the fault counters. For a deployed SJ BIST, RESET might not be used.

# Faults: Duration, Detection and Number of Ports

Our current effort is focused on the design and development of two SJ BIST cores: a two-port and a one-port SJ BIST. To test more than one or two I/O ports, we believe that multiple SJ BIST cores should be used in the deployed FPGA.

Each of the SJ BIST has disadvantages and advantages related to the number of gates, the number of externally-connected capacitors, the power dissipation and the minimum duration of a fault period for "guaranteed" detection.

# Fault Duration and Detection: Two-Port SJ BIST Core

Referring back to Figure 6, the signal sequence is write-read '1' (test I/O port 1), write-read '0' (test I/O port 2) and write-read '0' (test I/O port 2). This sequence takes a total of 8 clocks to complete. This means the following: (1) a fault must have a minimum duration of 4 clock periods for "guaranteed" detection; (2) a fault with a duration of one-half of a clock period is detectable when it occurs at the start of either the write-read '1' or the write-read '0' sequence for that pin. For a FPGA with a 100MHz CLK, the guaranteed detection duration is 40 ns.

To test eight I/O ports, two I/O ports for each corner of a BGA package, four 2-port SJ BIST cores could be used and the error signals ORed together.

# Fault Duration and Detection: One-Port SJ BIST Core

A single-port SJ BIST core has been designed to address a possible issue with 4-clock period fault duration for guaranteed detection. In comparison to the original two-port SJ BIST core: (1) this core has a 2-clock guaranteed detection period instead of a 4-clock period; (2) this core uses more logic gates per tested I/O port; (3) this core requires double the number of externally-connected capacitors and (4) this core dissipates more power.

We also have a modified design that would further reduce the guaranteed detection periods to the following: 2 clock periods for a two-port SJ BIST core; and 1 clock period for a one-pin SJ BIST core. We plan on programming, testing and validating these configurations during the next phase of our design and development.

#### SUMMARY

In this paper we presented an overview of the physics of failure associated with the solder joints of FPGAs in BGA packages: the primary contributor to fatigue damage is thermomechanical stresses related to CTE mismatches, shock and vibration, and power on-off sequencing. Solder-joint fatigue damage can result in cracks that cause intermittent instances of high-resistance spikes that are hard-to-diagnose. In reliability testing, OPENs (faults) are often characterized by spikes of a  $100\Omega$  or more lasting for less than 100ns to  $1\mu s$  or longer.

Prior to the innovative SJ BIST presented in this paper, there were no known methods for detecting high-resistance faults in solder-joint networks belonging to operational, fully-programmed FPGAs.

An in-situ SJ BIST that can be used in operational FPGAs is useful because stress magnitudes are hard to derive, which leads to inaccurate life expectancy predictions; and even though a particular damaged solder-joint port might not result in immediate FPGA operational failure, the damage indicates the FPGA is no longer reliable. An in-situ SJ BIST can also be used in newly designed manufacturing reliability tests to investigate failure modes related to the PWB-FPGA assembly.

Two prototype SJ BIST cores have been designed: a one-port SJ BIST and a two-port SJ BIST. The two-port SJ BIST was programmed, simulated, synthesized, loaded into a FPGA on a development board and tested in a laboratory. The test results show the SJ BIST core correctly detects and reports instances of high-resistance ( $100\Omega$  or more) without false errors – no errors detected or reported when the network resistance is  $1.0\Omega$  or less.

#### **FUTURE ACTIVITIES**

Extensive HALT experiments will be run. Some of the desired results of those experiments are the following: (1) determination of the minimum detectable fault duration; (2) the sensitivity of the SJ BISTI (3) optimal capacitor size relative to clock frequency and fault sensitivity; (4) statistical measures related to test I/O port location and first failure; and (5) reliability measurements and statistics given the absence or presence of fault detection by the SJ BIST.

#### **ACKNOWLEDGMENT**

The work presented in this paper was funded by Small Business Innovation Research contract awards from the Department of Defense, Naval Air, Joint Strike Fighter program: Contract No. N68335-05-C-0101 P00002. Two final patent applications have been filed in January and February of 2006: one for the innovation topic of this paper and one for a related innovation.

#### REFERENCES

- [1]. Accelerated Reliability Task IPC-SM-785, SMT Force Group Standard, Product Reliability Committee of the IPC, Published by Analysis Tech., Inc., 2005, <a href="https://www.analysistech.com/event-tech-IPC-SM-785">www.analysistech.com/event-tech-IPC-SM-785</a>.
- [2]. P. Lall, M.N. Islam, N. Singh, J.C. Suhling and R. Darveaux, "Model for BGA and CSP Reliability in Automotive Underhood Applications," *IEEE Trans. Comp. and Pack. Tech.*, Vol. 27, No. 3, Sep. 2004, pp. 585-593.
- [3]. R. Gannamani, V. Valluri, Sidharth and M-L Zhang, "Reliability evaluation of chip scale packages," Advanced Micro Devices, Sunnyvale, CA, in Daisy Chain Samples, Application Note, Spansion, July 2003, pp. 4-9.

- [4]. Sony Semiconductor Quality and Reliability Handbook, Revised May 2001, <a href="http://www.sony.net/products/SC-HP/tec/catalog">http://www.sony.net/products/SC-HP/tec/catalog</a>, Vol. 2, pp. 66-67, Vol. 4, pp. 120-129.
- [5]. Use Condition Based Reliability Evaluation: An Example Applied to Ball Grid Array (BGA) Packages, SEMATECH Technology Transfer #99083813A-XFR, International SEMATECH, 1999, pg. 6.
- [6]. Comparison of Ball Grid Array (BGA) Component and Assembly Level Qualification Tests and Failure Modes, SEMATECH Technology Transfer #00053957A-XFR, International SEMATECH, May 31, 2000, pp. 1-4.
- [7]. R. Roergren, P-E. Teghall and P. Carlsson, "Reliability of BGA Packages in an Automotive Environment," IVF-The Swedish Institute of Production Engineering Research, Argongatan 30, SE-431 53 Moelndal, Sweden, http://www.ivf.se, accessed Dec. 25, 2005.
- [8]. D.E. Hodges Popp, A. Mawer and G. Presas, "Flip chip PBGA solder joint reliability: power cycling versus thermal cycling," Motorola Semiconductor Products Sector, Austin, TX, Dec. 19, 2005.
- [9]. The Reliability Report, XILINX, xgoogle.xilinx.com, Sep. 1, 2003, pp. 225-229.
- [10]. J-P. Clech, D.M. Noctor, J.C. Manock, G.W. Lynott and F.E. Bader, "Surface mount assembly failure statistics and failure-free times," in *Proceedings, 44<sup>th</sup> ECTC,* Washington, D.C., May 1-4, 1994, pp. 487-497.
- [11]. P. Lall, P. Choudhary and S. Gupte, "Health Monitoring for Damage Initiation & Progression during Mechanical Shock in Electronic Assemblies," *Proceedings of the 56<sup>th</sup> IEEE Electronic Components and Technology Conference*, San Diego, California, pp. 85-94, May 30-June 2, 2006.
- [12]. P. Lall, M. Hande, M.N. Singh, J. Suhling and J. Lee, "Feature Extraction and Damage Data for Prognostication of Leaded and Leadfree Electronics," *Proceedings of the 56<sup>th</sup> IEEE Electronic Components and Technology Conference*, San Diego, California, pp.718-727, May 30-June 2, 2006.
- [13]. P. Lall, N. Islam and J. Suhling, "Leading Indicators of Failure for Prognostication of Leaded and Lead-Free Electronics in Harsh Environments," *Proceedings of the ASME*

- *InterPACK Conference,* San Francisco, CA, Paper IPACK2005-73426, pp. 1-9, July 17-22, 2005.
- [14]. P. Lall, M.N. Islam, K. Rahim and J. Suhling, "Prognostics and Health Management of Electronic Packaging," Accepted for Publication in *IEEE Transactions on Components and Packaging Technologies*, Paper Available in Digital Format on IEEE Explore, pp. 1-12, March 2005.
- [15]. P. Lall, "Challenges in Accelerated Life Testing," *Inter Society Conf., Thermal Phenomena*, 2004, pg. 727.
- [16]. FPGA I/O Buffer shown was taken from documentation for an Altera FPGA development kit, May, 2006.
- [17]. XILINX Fine-Pitch BGA (FG1156/FGG1156) Package, PK039 (v1.2), June 25, 2004.

XILINX is a registered trade mark of Xilinx, Inc.