### UNIVERSITY OF SOUTHAMPTON

FACULTY OF PHYSICAL SCIENCES AND ENGINEERING Electronics and Computer Science

# Design and Validation of Online Fault Tolerance Architecture for TSV-based 3D-IC

by Yi Zhao

Technical Report

26 July 2013

# Content

| Summary                                     | 3    |  |  |  |  |  |
|---------------------------------------------|------|--|--|--|--|--|
| Introduction                                |      |  |  |  |  |  |
| II. Preliminaries                           | 4    |  |  |  |  |  |
| III. Proposed TSV fault tolerance Technique | 6    |  |  |  |  |  |
| A. Detection Block                          | 7    |  |  |  |  |  |
| I. Void or delamination defect              | 7    |  |  |  |  |  |
| II. Short- to-substrate defect              | 9    |  |  |  |  |  |
| B. Recovery Block                           | 10   |  |  |  |  |  |
| IV. Simulation Results                      | 13   |  |  |  |  |  |
| A. HSpice Validation for Detection Block    | 13   |  |  |  |  |  |
| B. Functional Validation for Recovery Block | 16   |  |  |  |  |  |
| C. Repair capability and Area Overhead      | . 17 |  |  |  |  |  |
| V. Conclusion                               | 17   |  |  |  |  |  |
| References                                  | . 17 |  |  |  |  |  |

### Summary

This technical report presents the design, validation and evaluation of an efficient online fault tolerance technique for fault detection and recovery in presence of three TSV defects: voids, delamination between TSV and landing pad, and TSV short-to-substrate. The technique employs transition delay test for TSV fault detection. Fault recovery is carried out by employing redundant TSVs and rerouting input/output signals to fault-free TSVs. This technique is efficient because it requires small (2 x number of TSVs per group) number of clock cycles for fault detection and recovery. Using 65-nm technology, simulations are carried out using HSPICE and ModelSim to validate fault detection and recovery. Synthesized RTL model of this technique is used to evaluate the area overhead. It is shown that regular and redundant TSVs can be divided into groups to minimize area overhead without affecting fault tolerance capability of the technique.

### I. Introduction

Three-dimensional Integrated Circuit (3D-ICs) vertically stacks multiple silicon dies to reduce overall wire length, power consumption, and allow integration of heterogeneous technologies [1]. Through-Silicon-Vias (TSVs) based technology is used to implement 3D integration, which stacks dies or wafers with vertical TSV interconnects. Recent research has highlighted that yield of TSV based 3D-ICs is affected due to TSV defects, which are introduced in bonding stage of fabrication process when different dies are bonded together and one defective TSV can potentially fail the entire 3D IC along with all known-good dies [2, 7, 8, 23]. In addition to manufacturing defects, thermal stress induced in fabrication process and during operation also causes TSV reliability issues. This is why recent publications have highlighted these challenges and proposed solutions for improving testability, yield and reliability [3-16, 24].

Defective TSVs can be introduced either due to manufacturing defects or thermal-induced reliability issue. [3-8, 10-11, 17, 22]. [7, 11] summarized various types of TSV defects such as TSV improper filling of TSV, crack in TSV interface, delamination between TSV and landing pad, pinhole defect (Short between TSV to substrate) and voids growth due to electromigration. Among those defect types, void, delamination and TSV short to substrate TSV defects are three main TSV defect types. Therefore, recent research has particularly focus on these three types of defects [3, 4, 5, 6, 8]. A study reported in [3] has described TSV short-to-substrate defect, which forms a resistive path from TSV-to-substrate and therefore causes signal integrity issue. A voltage comparison structure was proposed in [3] to detect this type of TSV defect during pre-bond test. Void inside TSV is another important defect type, which behaves as a resistive open defect. The work presented in [4] proposed a sensing amplification method to test this defect before bonding process. From post-bond testing point of view, defects can also be introduced during bonding stage as well as in normal operation because of thermal stress induced in fabrication process and during operation. Research reported in [9], [10] shows that thermal stress during fabrication process and normal operation can cause damage to TSV interconnects, such as delamination of TSV interfaces. It is reported in [11] that void growth can also occur during operation due to thermal load. Therefore, pre-bond testing does not scale well when taking these effects into consideration. A transition delay test infrastructure is presented in [5] for testing TSVs due to resistive open defect and address IR-drop issue during transition test when considering both pre-bond and post-bond test. The work presented in [6] shows a ring oscillator

structure to characterize propagation delay in TSVs to detect small delay fault due to resistive open TSV defects.

Redundant circuits have been employed by both academic and industry as an effective method for improving yield [12-15, 24]. In [24], redundant TSVs are provided for an 8GB 3D DDR3 for improving the chip yield. The work presented in [12], [14] shows how a single redundant TSV can be allocated to a number of regular TSVs for repair. The work presented in [13] partitions multiple regular and redundant TSVs into groups according to a specified grouping ratio, where redundant TSVs are used to repair defective TSVs in that group, which is used to optimize yield and reduce hardware overhead. A switch based technique is proposed in [15] for TSV repair. By using this switch, redundant TSVs can be routed to repair 'distant' defective TSVs, thus increasing repair efficiency. All previous references [3-15, 24] have proposed techniques for improving testability and yield. The only work that focusses on improving in-field TSV reliability is presented in [16], which uses on-chip processor for online fault detection and recovery. However, available literature does not show any hardware based infield fault tolerance technique for TSV fault detection and recovery where such an on-chip processor is not available. This technical report presents the following contributions:

- 1. An efficient online fault tolerance technique, which requires only small number of clock cycles (twice the number of TSVs per group) for fault detection and recovery in the presence of three latent TSV defects: void, delamination and short-to-substrate.
- 2. Electrical and logical simulations to demonstrate correct operation of detection and recovery using realistic fault models and synthesized RTL model of this technique.
- 3. The trade-off between repair capability and area overhead of this technique is evaluated with a 65-nm technology and using Synopsys design compiler. It is shown that the area overhead can be reduced without affecting repair capability through appropriate grouping of regular and redundant TSVs.

This report is organized as follows: Sec. II gives an overview of the electrical models used for TSV defects. The online fault tolerance technique is described in Sec. III. Simulation results are presented in Sec. IV and Sec. V concludes the report.



## II. Preliminaries

Fig. 1. Three types of TSV defects

TSVs are subjected to manufacturing defects and reliability challenges during life time of 3D ICs. Fig. 1 shows three defect types targeted in this work, which are either due to imperfect TSV fabrication or thermal induced stress during operation. These three defects include: voids, delamination at interface and short-to-substrate. Void defects are caused by improper TSV filling and thermal stress during normal operation of the device [17]. This defect increases TSV resistance [6] and causes delay faults. Second defect type (TSV delamination defect) is either due to misalignment of bonding pad and TSV during fabrication or due to thermal stress induced on TSV in thermal processing and normal operation. This thermal stress can cause delamination at the interface of TSV structure (TSV and its landing pad) and increases TSV resistance. These two types of defects (void and delamination) can both be modelled as a resistive open defect leading to increase in delay. The equivalent electrical model of this defect is shown in Fig. 2(b) [8]. It can be seen that a resistor ( $R_{open}$ ) is added to the TSV model (Fig. 2(a)). Third defect type (short-to-substrate) is due to pinhole in the dielectric layer that is deposited to form the side wall between TSV and substrate. It models a resistive path from TSV to substrate due to the non-conformal sidewall insulation [3]. The electrical equivalent of this defect is shown in Fig. 2(c). A resistor, denoted by R<sub>short</sub>, is added to the TSV model (Fig. 2(a)), which represents the leakage current path between TSV and substrate and reduces the current for charging TSV. Note that in this work, modelling of TSV is based on a transmission line T-model [18], shown in Fig. 2(a), where R<sub>tsv</sub> and C<sub>tsv</sub> denote TSV resistance and capacitance respectively, R<sub>pull</sub> denotes the resistance of the pull-up network of the driving gate and C<sub>p</sub> denotes the parasitic capacitance of the circuit. Assuming that signal is transmitting from TSV terminal in die 1 (referred as terminal *t1*) by a driving gate to the TSV terminal in die 2 (referred as terminal *t2*).



(b) Equivalent circuit of TSV with void or delamination defect [8]



(c) Equivalent circuit of TSV with short-to-substrate defect [3]

Fig. 2. Electrical equivalents for (a) TSV T-model; (b) TSV model incorporating void or delamination defect and (c) Short-tosubstrate defect.

## III. Proposed TSV fault tolerance Technique

Fig. 3(a) shows the block diagram of the proposed fault tolerance technique to test and repair a single TSV group. It consists of three blocks: detection block, recovery block and routing block. These blocks are used to test and repair a group of TSVs (referred as TSV group). A TSV group with a grouping ratio of m:n, consists of m input (output) signals, m regular TSVs, and n redundant TSVs, where each TSV group can tolerate up to n TSV defects. The number of redundant TSVs in a design has an effect on yield, repair capability and hardware cost. For a given fault rate, recent papers have proposed algorithms to determine grouping ratio to minimize hardware cost and maximize yield [13], [15]. In this work, it is assumed that at design time, TSVs are divided into groups.

The detection block (Fig. 3(a)) is used for testing each TSV in a group. Input test patterns are applied from a die (Die 1) and output test response is observed through Test observation block located on subsequent die (Die 2). The detection block uses delay test to differentiate between faulty and fault-free TSVs, where each TSV is tested for void, delamination and short-to-substrate defects (Fig. 2). The status of each TSV is updated in TSV status registers, which are located on both dies and hold the number and location of all faulty TSVs in a group. In case, a faulty TSV is found, fault recovery is initiated after identifying the number and location of all faulty TSVs in a group. Sec. III-A and Fig. 4 provide detailed description of the detection block. Note that the detection block does not distinguish between different defect types, as that is typically required for diagnosis. The recovery block is used to bypass defective TSVs with fault-free TSVs. The recovery block is implemented on both dies that are connected by the TSV group. As shown in Fig. 3(a), it consists of TSV status register and control. TSV status register holds fault status of each TSV ('1' represents faulty TSV and '0' represents fault-free). Control provides appropriate control signals to bypass faulty TSVs and it is used to configure the Routing block. The routing block consists of a set of multiplexers and de-multiplexers to connect each signal line to a TSV. The control signals of these multiplexers (de-multiplexers) are provided by the control unit of the Recovery block. Sec. III-B and Fig. 7 provide detailed description of the recovery and routing blocks.

Fig 3(b) shows the detection, recovery and routing blocks of the proposed fault tolerance technique. For illustration, a grouping ratio of 4:2 is used, where each group consists of 4 regular TSVs and 2 redundant TSVs, therefore it can tolerate up to two defective TSVs. Test input and Test observation blocks are used for testing each TSV for three defects (Fig. 2); test results are stored in TSV status registers on both dies. A double TSV interconnection is used to update TSV status register on die 1. This concept was also used in [2] for error communication between dies. Once fault detection is complete, recovery is initiated to reroute signals through fault-free TSVs (replacing defective TSVs) by reconfiguring the routing block between signals and TSVs (Fig. 3(b)). The control unit is used to generate the selection signal for each signal line to connect it with the appropriate TSV. The connection boxes (de-Mux terminal within routing block) shown in Fig. 3(b) are implemented by using de-multiplexers between input signals and TSVs (Fig. 4). The connection for input (die 1) and output (die 2) signals is similar, the only difference is that demultiplexers are used with input signals and multiplexers are used with output signals. For a grouping ratio of 4:2, each signal can use one of three possible TSVs; hence a 1-to-3 de-multiplexer is needed. The control is also used to report when the number of defective TSVs is higher than the maximum tolerance limit of a TSV group.

To illustrate the working of recovery block for a grouping ratio of 4:2, assume there are two defective TSVs (TSV2 and TSV4) in a group (Fig. 3(b)). The reconfiguration circuit on both dies (Die 1 and Die 2) are similar and for illustration, we explain only the one on Die 1. It follows the following two connection rules. Firstly, once a TSV has been used by a signal line (shown as a tick in connection box), any other signal line cannot use that TSV. This is because one TSV can only be occupied by only one signal line. Secondly, if a TSV is defective, all connection boxes (de-Mux or Mux terminals, Fig. 3(b)) that correspond to that TSV cannot be used. Based on the connection rules and test results stored in the TSV status register, the availability of a TSV is found. Once the first signal line is connected, the method moves to the next until all input signals are connected to a TSV. We next describe the operation of the detection and recovery blocks.

#### **Detection Block** Α.

Fig. 4 shows the detection block for a single TSV, as an example. It consists of an input signal unit for test patterns and input signals, where transition signals are stored for test application. Fig. 4 also shows the test observation block (Fig. 3(a)), where test output is observed in a flip-flop and stored in TSV status registers. The SI signal and NAND gate is used to initialize TSV status registers. The detection block applies a transition signal on a die (Die 1) and the output is observed on the subsequent die (Die 2). We next explain the working of detection block when considering three defects (Fig. 2).



Fig. 3(a). Fault tolerance technique

Fig. 3(b). Detection and recovery blocks for a grouping ratio of 4:2

#### I. Void or delamination defect

As described in Sec. II, void and delamination defects increase TSV resistance forming a higher resistance TSV path, thus increasing RC delay. To derive RC delay at t2 end (Fig. 4) of the TSV, we employed TSV electrical model with void or delamination defects shown in Fig. 2(b). The RC delay of TSV at t2 end is:

$$\left(R_{pull} + \frac{1}{2}R_{TSV} + \frac{1}{2}R_{open}\right)C_{TSV} + \left(R_{pull} + R_{TSV} + R_{open}\right)C_{p}$$
(1)

Where,  $R_{open}$  denotes the open resistance due to void or delamination defect,  $R_{pull}$  denotes the resistance of the pull-up network driving the TSV (de-multiplexers, Fig. 4) and *Cp* denotes the parasitic capacitance of the test circuit. When the TSV is fault-free  $R_{open}\sim0$ , the TSV resistance is small (in hundreds m $\Omega$ ) and can be ignored when compared to the pull- up resistance of driving gate  $R_{pull}$ , which is usually several k $\Omega$ , such that the path delay is not effected by the TSV resistance. However, in case of void or delamination defects, open resistance of a TSV ( $R_{open}$ ) can be up to  $1M\Omega$  [6], which is significantly higher than accumulative effect of  $R_{TSV}$  and  $R_{pull}$ .

Assuming the NAND gate (Fig. 4) with logic threshold voltage denoted by  $L_{vth}$ , where  $L_{vth}$  of a gate input is the input voltage at which the output voltage reaches half of the supply voltage, while the other gate input(s) are at non-controlling value(s) [19]. A rising transition is applied to the TSV from *In\_TSV1* (Fig. 4), since the delay at t2 end is dependent on the value of  $R_{open}$ , the rising transition at *t2* becomes slower, such that at a given capture time, the voltage at the *t2* is lower than  $L_{vth}$ , as illustrated in Fig. 5.

Therefore, if TSV open resistance due to void or delamination defect exceeds a critical value  $R_{open-critical}$ , the voltage at *t*2 is lower than the  $L_{vth}$  at a given signal capture time and therefore the test detects a faulty signal Test\_result=1 (Fig. 5). Signal capture time represents the test clock frequency, which is applied to the flip-flop shown in Fig. 4. Note that the internal clock may be used as a test clock to avoid overhead of a separate DFT clock [20].

TSV open critical resistance  $R_{open-critical}$  is a function of logic threshold voltage  $L_{vth}$  and signal capture time (denoted by test clock frequency  $F_{clock}$ ), where  $L_{vth}$  is kept at 50% of Vdd for illustration, otherwise it varies per gate input and is also effected by process variation [20]. The range of TSV open resistance [0,  $R_{open-critical}$ ] is referred as benign region, which means if  $R_{open} < R_{open-critical}$ , TSV is regarded as fault-free. Whereas, when  $R_{open} > R_{open-critical}$ , a defective TSV with void or delamination defect can be detected. In the first set of simulation results section (Sec. IV-A), we investigate the delay of TSV as a function of  $R_{open}$ and evaluate  $R_{open-critical}$  with respect to test clock frequency  $F_{clock}$ .



Fig. 4 Detection block for a single TSV



Fig. 5 Test pattern for detection of void or delamination defect

#### II. Short- to-substrate defect

Short-to-substrate TSV defect leads to a resistive path between TSV and substrate and causes current leakage as shown in Fig. 2(c), leading to reduced TSV charging current. Assume a rising transition is applied from  $In_TSVI$  (Fig. 4), which can be expressed as,  $I_{charge} = I_I - I_{leakage}$ , where,  $I_I$  is the input current at tI and  $I_{leakage}$  is the leakage current from TSV to substrate through the short resistor (Fig. 2(c)). Due to lower TSV charging current ( $I_{charge}$ ), the rising transition time observed at t2 increases with increase in defect size. The testing method is similar to that of void or delamination defect. In this case, critical resistance  $R_{short-critical}$  is the maximum detectable  $R_{short}$  resistance, which is in the range [0,  $R_{short-critical}$ ] and resistance higher than  $R_{short-critical}$  is not detectable. This is detected by voltage at t2 (Fig. 4), which is compared with logic threshold voltage ( $L_{vth}$ ) at a given capture time, as illustrated in Fig. 6. Note that the short-to-substrate resistance degrades the voltage level at both ends of TSV, which means that  $R_{short}$  forms a voltage divider between  $R_{short}$ ,  $R_{pull}$  and  $R_{tsv}$ .

It is clear that with smaller  $R_{short}$ , the voltage at t2 (Fig. 4) is lower, such that for a rising transition signal, the voltage at t2 is lower than  $L_{vth}$ , at signal capture time (Fig. 6). Simulation results using different defect sizes and test clock frequencies for detecting this type of defect are presented in Sec. IV-A.



Fig. 6 Test pattern to detect short-to-substrate TSV defect

#### B. Recovery Block

The recovery block (Fig. 3(a)) is used to bypass defective TSVs with fault-free TSVs and it is implemented on both dies that are connected by the TSV group. Recovery is initiated after completing the test for three defect types (Fig. 2) and it is used to reconfigure connections between input/output signals with fault-free TSVs. This section has two objectives. First, it describes the working of reconfiguration process by considering a design with a grouping ratio of 4:2. Second, it shows how the proposed technique can be scaled to any grouping ratio (m:n).

The circuit for reconfiguring input and output signals are similar and therefore only input part is shown in Fig. 7 (a). As can be seen, it consists of the following six components: 1) A routing block consisting of demultiplexers to connect signal lines with TSVs; 2) A latch chain that stores the selection signals for demultiplexers; 3) TSV status register which stores faulty status information for each TSV, where a '0' indicates fault-free and '1' indicates faulty TSV; 4) A signal line counter to indicate the number of signals that have been configured, it is also used to update the latch chain through "enable"; 5) An adder "Faulty TSV accumulator", which can count faulty TSV number and provides input to the latch chain; 6) A comparator which compares the existing faulty TSV number with the tolerance limit of the TSV group, and reports an error in case of exceeding the tolerance limit.



Fig. 7.(a) Reconfiguring a faulty design with a grouping ratio of 4:2 (b) Reconfiguring process per clock cycle

In this example (Fig. 7(a)), each input signal can be routed to three possible TSVs, which is why the selection signal for each de-multiplexer has two bits. Two latches are required in the latch chain to store the selection signal of each de-multiplexer. The top two latches are used for storing selection signals for the first de-multiplexer (signal line 1), and the remaining pairs (latches) are for rest of the de-multiplexers (signal line 2 to signal line 4). The selection signals for the first de-multiplexer are scanned in to the latch chain from bottom, and it is shifted up such that after completing the configuration process, it moves to the pair of latches on top. The proposed reconfiguration method sets the selection signals for de-multiplexers sequentially. This is managed by signal line counter, it receives (shifted out) values from TSV status register, if a '0' is received, it means that a fault-free TSV is found and a signal line can be configured. It outputs an "enable" signal, which triggers the latch chain to scan in new values from faulty TSV accumulator. It is referred as an accumulator because as soon as '1' is received, it means that for all the remaining signal lines (to be configured), their default TSV connection is not available. Signal line counter is also used to count the number of signal lines that have been configured. For a grouping ratio of *4:2*, only four signal lines have to be configured, such that once the signal line counter reaches the count of four, it disables the latch chain, which means that the configuration process is complete.

Fig. 7(b) shows a reconfiguration process in detail (per clock cycle), when assuming a design with a grouping ratio of 4:2, with two defective TSVs (TSV2 and TSV4). Reconfiguration process is initiated after detection phase with '010100' as the initial value of TSV status register. In total, four signal lines have to be reconfigured by updating the latch chain, which holds the selection signals of all de-

multiplexers. As can be seen, in the first clock cycle, the first shifted out value from the TSV status register is '0', which is sent to the faulty TSV accumulator and signal line counter. This means that the output of faulty TSV accumulator is '00' and the value of signal line counter becomes '1', which means that the first signal line can use TSV1. Signal line counter asserts the enable signal, and the latch chain scans in new values for the first signal line from faulty TSV accumulator. As shown in Fig. 7(b), the status of logic values in latch chain becomes '00' for all four latch pairs. In the second clock cycle, the shifted out logic value from TSV status register is '1', the faulty TSV accumulator becomes '01' and the signal line counter value stays at '1' because Signal line 2 is not yet configured, enable signal is set to low, which keeps the latch chain at the same logic values as previous clock cycle. As shown in Fig. 7(b), signal line 2 is configured in third clock cycle, when the shifted out value from TSV status register is '0', and the logic values in latch chain becomes '00', '00', '01'. The value of signal line counter becomes 2, which means that 2 (out of four) signal lines have been configured. This process continues until all four signal lines are configured in the sixth clock cycle, and the latch chain holds '00', '01', '10', '10' in each of the four latch pairs. The resultant reconfiguration of signal lines is shown in Fig. 7(a), where all defective TSVs are bypassed. In Sec. IV-B, functional validation of this design is also demonstrated using ModelSim.

Fig. 8 shows the architecture of the proposed fault tolerance technique with a grouping ratio of m:n. As can be seen, each group contains m+n TSVs with m input/output signal lines. The TSV status register consists of m+n bits. Each signal line can have n+1 TSVs for communication, such that 1-to-(n+1) demultiplexer is needed. Selection signal for signal line i will need k= $[log_2(n+1)]$  bits, which are Si(0), Si(1), ..., Si(k-1). Therefore, for each signal line the latch chain consists of k latches, which holds the demultiplexer selection signal. The signal line counter generates m latch renew enable signals. The comparator is used to report if the number of faulty TSVs in a group exceeds the maximum tolerance limit.

Overall, for a grouping ratio of m:n, this technique requires m+n clock cycles to test m regular and n redundant TSVs serially and m+n clock cycles for repairing all TSVs in the presence of defects. Therefore in total it requires only 2.(m+n) clock cycles for fault detection and recovery. Theoretical lower bound to test and repair all TSVs per design is 2 clock cycles, assuming an infrastructure to test and repair all TSV in parallel. The proposed technique approaches theoretical lower bound by using only 2.(m+n) clock cycles. The area overhead of the fault tolerance technique (detection, recovery and routing blocks on both dies) is:

 $Area = A_{detection} + A_{routing} + A_{recovery} + A_{redundant TSV}$ 

= (m + n) Nand gates $+ \{3(m + n) + 2m[log_2(n + 1)]\} \text{ FlipFlop}$  $+ (m) demux_{1-to-(n+1)} + (m) mux_{(n+1)-to-1}$  $+ (2) signal line counter_{m-bit} + (2) accumulator$  $+ comparator + A_{redundant TSV}$ 

(2)

where "A" denotes area overhead of a TSV group with a grouping ratio of *m*:*n*; all other notations have their usual meaning. It can be seen that this technique can be easily scaled to suit a generic design with any specified grouping ratio. Simulation results presented in Sec. IV-C demonstrate how area overhead can be reduced without affecting fault tolerance and repair capability for various grouping ratios.



Fig. 8. Architecture of fault tolerance technique with grouping ratio of *m*:*n*.

### **IV.** Simulation Results

Three sets of simulations are conducted to validate and evaluate the fault tolerance technique. The first set of simulation validates the detection block through HSPICE and characterizes detectable resistance range for three defects: void, delamination and short-to-substrate. The second set of simulation functionally validates the recovery block through RTL model implementation of fault tolerance technique using Modelsim. The last set of simulation analyses the trade-off between area overhead and repair capability of this technique through synthesis using Synopsys design compiler.

### A. HSpice Validation for Detection Block

This simulation employs the electrical models of TSV and three defect types shown in Fig. 2. The test circuit (Fig. 4) is modelled with HSPICE using 65-nm ST Microelectronics gate library. All simulations are carried out at 25°C and 1.2-V. For illustration, the defect free TSV resistance and capacitance is 200-m $\Omega$  and 200-pF respectively, as they represent typical values [18]. The test clock frequency is  $F_{clock}$  is 1.5 GHz. It was shown in [20] that when considering process variation with  $\pm 3\sigma$  variation effects, logic threshold voltages of all gates (in a gate library) are within 20%-80% of V<sub>dd</sub>. This means that for a rising

transition, logic-1 is guaranteed at  $V_{Out} > 80\%$  of  $V_{dd}$ , similarly, logic-0 is guaranteed at  $V_{Out} < 20\%$  of  $V_{dd}$ . Therefore, the rising (falling) transition delay is equal to the time taken for TSV voltage to rise (falls) from 20% (80%) to 80% (20%).

|                              | Defe                           | ect type | <b>R</b> open             | Transition Delay<br>T2 node<br>(ns) |            | Classification   |  |  |
|------------------------------|--------------------------------|----------|---------------------------|-------------------------------------|------------|------------------|--|--|
|                              |                                |          |                           | Rising                              | Falling    |                  |  |  |
|                              |                                |          | Defect free<br>0Ω         | 0.242                               | 0.160      |                  |  |  |
|                              |                                |          | 1kΩ                       | 0.311                               | 0.225      | Enviltar for a   |  |  |
|                              |                                |          | 2kΩ                       | 0.419                               | 0.339      | Faulty-free      |  |  |
|                              | _                              |          | 3kΩ                       | 0.541                               | 0.469      |                  |  |  |
|                              | \<br>\                         | /oid/    | 4kΩ                       | 0.667                               | 0.608      |                  |  |  |
| L                            | Jela                           | mination | 5kΩ                       | 0.805                               | 0.743      |                  |  |  |
|                              |                                |          | 10kΩ                      | 1.492                               | 1.441      |                  |  |  |
|                              |                                |          | 50kΩ                      | 7.085                               | 7.033      | Faulty           |  |  |
|                              |                                |          | 100kΩ                     | 14.121                              | 14.030     | -                |  |  |
|                              |                                |          | 1MΩ                       | Stuck-open                          | Stuck-open |                  |  |  |
| Delay (ns)                   | 20<br>10<br>5<br>2<br>1<br>0.5 | (1k,     | (Rope<br>(4k,<br>3. 2GHz) | n-critical,<br>(10k, (<br>1.5GHz)   | Fclock)    | Benign<br>Region |  |  |
| Open resistance value (kohm) |                                |          |                           |                                     |            |                  |  |  |

TABLE I: Void or delamination defect characterization

Fig. 9. Delay (Rising) as a function of open resistance showing critical open resistances for three test clock frequencies: 0.67GHz, 1.5GHz, 3.2GHz.

Table I shows the simulation results when considering void or delamination defects. It shows the transition delay behaviour of open resistance due to void or delamination defects.  $R_{open}$  is-in the range of [0, 1MΩ], where 0Ω is in case of fault-free TSV behaviour (with only TSV resistance of 200-mΩ) and 1MΩ represents full-open TSV defect, beyond which it can be treated as a stuck-open fault. From Table I, it can be seen that the rising (falling) transition delay increases from 0.242-ns (0.160-ns) to 14.12-ns (14.03-ns), when  $R_{open}$  of TSV increases from fault-free to 100-KΩ. Moreover, when  $R_{open}$  is 1-MΩ, it behaves as stuck-open fault. These results indicate that the detection block is capable of detecting void or delamination defects with  $R_{open} > 4$ -KΩ. This is because beyond this resistance value, the rising delay takes longer than  $\frac{1}{1.5*10^9} = 0.67$ -ns to reach 80% of V<sub>dd</sub> as shown in Table I. This open resistance is referred as resistive open critical resistance ( $R_{open-critical}$  is 4kΩ). Note that this value changes with  $F_{clock}$ . When testing resistive open defects, it is desirable to have lower  $R_{open-critical}$ , which is possible by using higher test clock frequencies. The relationship between test clock frequency and detectable defect size is well-studied [20], [21], we quantize it for Void/delamination TSV defects by considering three test

frequencies. Fig. 9 shows an analysis of test clock frequency and critical open resistance ( $R_{open-critical}$ ) where delay (Rising) is depicted as a function of open resistance  $R_{open}$ . Test clock frequency is set to be 0.67GHz, 1.5GHz, and 3.2GHz, and the critical open resistance value are 10k $\Omega$ , 4k $\Omega$ , and 1k $\Omega$  respectively. The shaded areas denote benign regions with respective test clock frequencies, and as expected, higher test clock frequency allows higher detectable range.

Table II shows simulation results when considering short-to-substrate defect. It shows the rising transition delay, along with degraded TSV voltage due to short-to-substrate defect, referred as  $R_{short}$ . It can be seen that the delay increases from 0.242-ns when  $R_{short}$  is 1-M $\Omega$  and it behaves as stuck-open defect for resistance value  $\leq$  900- $\Omega$ . When comparing Table I and Table II, it can be observed that  $R_{short}$  of 1-M $\Omega$  has about the same delay as that of fault-free TSV. It can be seen from Table II that for test clock frequency of 1.5GHz, TSV with  $R_{short} < 2K\Omega$  are detectable and referred as critical short resistance ( $R_{short-critical}$  is  $2K\Omega$ ). When the resistance value is smaller than 900 $\Omega$ , the degradation of TSV voltage is more than 50% of the supply voltage 0.6V ( $V_{dd}$ =1.2V), which is regarded as stuck-open defect. Fig. 10 shows the critical resistance relationship between short-to-substrate defect and test clock frequency. The delay (Rising) is depicted as a function of  $R_{short}$ . Test clock frequency is set to 1.5GHz, 1.8GHz, and 3GHz (for illustration), which leads to  $R_{short-critical}$  of  $2k\Omega$ ,  $3k\Omega$ , and  $5k\Omega$  respectively. It can be observed that the delay increment is faster when short-to-substrate resistance is smaller. This is due to higher leakage current with bigger defect size (i.e., smaller  $R_{short}$ ), which leads to smaller TSV charging current.

| $0\Omega$ Stuck-open   0     500Ω   Stuck-open   0.38     900Ω   Stuck-open   0.60     1kΩ   0.758   0.64     2kΩ   0.665   0.87                       |  |  |  |  |  |  |  |  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|--|--|--|
| $500\Omega$ Stuck-open   0.38   Faulty $900\Omega$ Stuck-open   0.60   Faulty $1k\Omega$ 0.758   0.64   2kΩ   0.665   0.87                             |  |  |  |  |  |  |  |  |
| 900Ω   Stuck-open   0.60   Faulty $1k\Omega$ 0.758   0.64 $2k\Omega$ 0.665   0.87                                                                      |  |  |  |  |  |  |  |  |
| 1kΩ   0.758   0.64     2kΩ   0.665   0.87                                                                                                              |  |  |  |  |  |  |  |  |
| 2kΩ 0.665 0.87                                                                                                                                         |  |  |  |  |  |  |  |  |
|                                                                                                                                                        |  |  |  |  |  |  |  |  |
| 3kΩ 0.551 0.97                                                                                                                                         |  |  |  |  |  |  |  |  |
| 4kΩ 0.379 1.03                                                                                                                                         |  |  |  |  |  |  |  |  |
| 5k $\Omega$ 0.336 1.06 Faulty-free                                                                                                                     |  |  |  |  |  |  |  |  |
| 10kΩ 0.279 1.13                                                                                                                                        |  |  |  |  |  |  |  |  |
| 100kΩ 0.245 1.19                                                                                                                                       |  |  |  |  |  |  |  |  |
| 1MΩ 0.242 1.19                                                                                                                                         |  |  |  |  |  |  |  |  |
| 1.00 0.12 Hz 1.12   1.00 0.8 (Rshort-critical, Fclock)   0.7 (2k, 1.5GHz) Benign Region   0.6 (3k, 1.8GHz) Region   0.4 (5k, 3.0GHz) 0.3   0.2 0.1 0.1 |  |  |  |  |  |  |  |  |
| 1 2 3 4 5 6 7 8 9 10501005001000                                                                                                                       |  |  |  |  |  |  |  |  |

TABLE II: Short-to-substrate defect characterization

Fig. 10. Delay (Rising) as a function of short-to-substrate resistance showing critical resistances for three test clock frequencies: 1.5GHz, 1.8GHz, 3.0GHz.

#### B. Functional Validation for Recovery Block

This set of simulation is used to functionally validate the recovery block and reconfiguration process of the proposed technique (Sec. III-B) using ModelSim. For illustration, the reconfiguring process is simulated for a grouping ratio of 4:2 (four regular and two redundant TSVs), for the schematic shown in Fig. 7(a), with two defective TSVs (TSV2 and TSV4) as shown by the contents of TSV status register. The grouping ratio and locations of faulty TSV are selected for ease of process (per clock cycle) shown in Fig. 7(b). Fig. 11 shows simulation results for every clock cycle, the initial status of TSV status register is '010100'. From simulation results, we can observe that after first clock cycle, the first value of TSV status register is shifted out and the TSV\_status (TSV status register) becomes '101000', which initiates the reconfiguration process. Signal\_line\_counter (Fig. 11) indicates the number of signal lines that have been configured and after six clock cycles, all four signal lines are configured with its value equal to '4'. As expected, each time "enable (Latch chain)" signal is asserted, the latch chain scans in the output of faulty TSV accumulate adder and after six clock cycles the latch chain holds the selection signals for all de-multiplexers, which are S<sub>1</sub>(1), S<sub>1</sub>(0) = (0,0); S<sub>2</sub>(1), S<sub>2</sub>(0) = (0,1); S<sub>3</sub>(1), S<sub>3</sub>(0) = (1,0); and S<sub>4</sub>(1), S<sub>4</sub>(0) = (1,0). When comparing these with expected results shown in Fig. 7(b), we observe that the signal lines are correctly configured.



Fig. 11. ModelSim functional validation of recovery block and reconfiguration process

TABLE III: Trade-off between repair capability and area overhead (regular TSV number is 1000, fault rate is 0.01).

| Redundancy<br>percentage | Grouping<br>ratio | Total<br>Redundant<br>TSVs | Area<br>overhead<br>per die<br>(um <sup>2</sup> ) | Repair<br>capability<br>(%) |
|--------------------------|-------------------|----------------------------|---------------------------------------------------|-----------------------------|
|                          | 1:1               | 1,000                      | 63,350                                            | 55                          |
| 100%                     | 2:2               |                            | 74,300                                            | 99                          |
|                          | 3:3               |                            | 75,993                                            | 100                         |
|                          | 2:1               | 500                        | 47,525                                            | 41                          |
| 50%                      | 4:2               |                            | 56,088                                            | 97                          |
|                          | 6:3               |                            | 57,252                                            | 100                         |
| 2504                     | 12:3              | 250                        | 48,069                                            | 99                          |
| 23%                      | 16:4              |                            | 57,554                                            | 100                         |
| 10%                      | 40:4              | 100                        | 52,287                                            | 99                          |
| 10%                      | 50:5              |                            | 56,235                                            | 100                         |
| 50/                      | 100:5             | 50                         | 54,301                                            | 96                          |
| 5%                       | 120:6             |                            | 58,302                                            | 99                          |

#### C. Repair capability and Area Overhead

This simulation aims to evaluate the area overhead and repair capability of the fault tolerance technique. As mentioned in Sec. III, grouping ratios are fixed during design time. This simulation shows that an optimal grouping ratio can be selected to achieve the target repair capability while using minimal area. For illustration, regular TSV number used in this simulation is 1000. Area overhead of the proposed technique is computed through synthesis using Synopsys Design Compiler and STMicroelectronics 65-nm gate library. The diameter of a TSV is 5-um [18], such that for each redundant TSV, the area overhead is 25 um<sup>2</sup>. The second last column of Table III shows the overall area overhead per die, which is the sum of redundant TSVs, detection, recovery and routing blocks. It can also be estimated using Eq. (2) (Sec. III-B). For a given fault rate of 0.01 [13], repair capability of different grouping ratios are calculated using the method described in [13] and results are shown in the last column. It can be seen (Table III) that for 100% redundancy percentage, grouping ratios are varied (from 1:1 to 3:3) until 100% repair capability is found. As can be seen, the repair capability is 100% for a grouping ratio of 3:3, but the area overhead increases with grouping ratios (2:2, 3:3) for a given redundancy percentage (100%). This is because of more complex control and routing block (Fig. 3(a)). To reduce hardware overhead, the method reduces redundancy percentage. In case of 50% redundancy percentage, the grouping ratio of 6:3 achieves lower area overhead (than 3:3) while achieving 100% repair capability. For 25% redundancy percentage, grouping ratio of 16:4 achieves 100% repair capability but uses higher area than 6:3. Overall, the best grouping ratio, which achieves the lowest area overhead with 100% repair capability, is 50:5, as in case of 10% redundancy percentage. This clearly shows that for a given fault rate, area overhead can be reduced while achieving 100% repair capability by carefully selecting grouping ratio.

### V. Conclusion

This report has presented a cost-effective and efficient online fault tolerance technique, with detailed validation and evaluation of fault detection and recovery for improving in-field reliability of TSV based 3D IC design. Three important latent TSV defect types have been considered: void, delamination and TSV short-to-substrate. Fault detection is carried out using (detection block) transition delay test. Fault recovery is carried out using redundant TSVs and rerouting input/output signals to fault-free TSVs. It is efficient because it requires only 2(m+n) clock cycles for fault detection and recovery for a design with *m* regular and *n* redundant TSVs in a group. The proposed technique is implemented on a 65-nm design library. Detailed electrical and logical simulations are carried out to validate the working of detection and recovery blocks. It is shown that the area overhead can be reduced without affecting repair capability through appropriate grouping of regular and redundant TSVs.

### References

[1] K. Banerjee *et al.*, "3-D ICs: A novel chip design for improving deep-submicrometer interconnect performance and systemson-chip integration", *Proc. IEEE*, vol. 89, no. 5, pp. 602-633, 2001.

[2] N. Miyakawa et al., "Multilayer stacking technology using wafer-to-wafer stacked method," J. Emerg. Technol. Comput. Syst. vol. 4, no. 4, 2008.

[3] Minki Cho et al., "Design method and test structure to characterize and repair TSV defect induced signal degradation in 3D system," Computer-Aided Design (ICCAD), pp.694-697, 7-11 Nov. 2010.

[4] Po-Yuan Chen *et al.*, "On-Chip TSV Testing for 3D IC before Bonding Using Sense Amplification," *Asian Test Symposium*, pp.450-455, Nov. 2009.

[5] S. Panth and Sung-Kyu Lim, "Transition delay fault testing of 3D ICs with IR-drop study," VLSI Test Symposium, pp.270-275, April 2012.

[6] Shi-Yu Huang et al., "Small delay testing for TSVs in 3-D ICs," In Design Automation Conference 2012.

[7] K. Chakrabarty *et al.*, "TSV defects and TSV-induced circuit failures: The third dimension in test and design-for-test," *Reliability Physics Symposium (IRPS)*, April 2012.

[8] Ye Fangming and K. Chakrabarty, "TSV open defects in 3D integrated circuits: Characterization, test, and optimal spare allocation," (DAC), 2012.

[9] P. Ramm *et al.*, "Through silicon via technology - processes and reliability for wafer-level 3D system integration," *ECTC*, 2008.

[10] K.H Lu, et al., "Thermal stress induced delamination of through silicon vias in 3-D interconnects," ECTC, pp. 40-45, June 2010.

[11] L. Gyujei *et al.*, "Interfacial reliability and micropartial stress analysis between TSV and CPB through NIT and MSA," *ECTC* 2011.

[12] I. Loi et al., "A low-overhead fault tolerance scheme for TSV-based 3-D network on chip links", *ICCAD*, pp. 598-602, 2008.

[13] Y. Zhao et al., "Cost-Effective TSV Grouping for Yield Improvement of 3D-ICs", ATS, Nov 2011.

[14] A.-C. Hsieh et al., "TSV Redundancy: Architecture and Design Issues in 3D IC," pp. 166-171, DATE 2010.

[15] L. Jiang et al., "On effective TSV repair for 3D-stacked ICs," DATE, pp.793-798, March 2012.

[16] L. Jiang et al. "On effective and efficient in-field TSV repair for stacked 3D ICs", DAC 2013.

[17] T. Frank et al., "Reliability approach of high density Through Silicon Via (TSV)," EPTC, pp.321-324, Dec. 2010.

[18] G. Katti *et al.*, "Electrical Modeling and Characterization of Through Silicon via for Three-Dimensional ICs," *Electron Devices, IEEE Transactions on*, vol.57, no.1, pp.256-262, Jan 2010.

[19] S. Khursheed *et al.*, "Gate-sizing-based single  $V_{dd}$  test for bridge defects in multivoltage designs", *IEEE Trans. on CAD*, 2010.

[20] S. Khursheed *et al.*, "Delay Test for Diagnosis of Power Switches," *IEEE Trans. on Very Large Scale Integration (VLSI) Systems*, 2013.

[21] M. Abramovici et al., "Digital Systems Testing and Testable Design", Computer Science Press, 1990.

[22] Frank, T.; Chappaz, C.; Leduc, P.; Arnaud, L.; Lorut, F.; Moreau, S.; Thuaire, A.; El-Farhane, R.; Anghel, L., "Resistance increase due to electromigration induced depletion under TSV," *Reliability Physics Symposium (IRPS), 2011 IEEE International*, April 2011

[23] A. W. T. et all, "Enabling soi based assembly technology for three dimensional integrated circuits," pp. 352–355, IEDM 2005

[24] U. Kang, et al. 8 Gb 3-D DDR3 DRAM using through-silicon-via technology. *IEEE Journal of Solid-State Circuits*, 45(1):111–119, 2010

[25] IWLS 2005 Benchmark circuits.-URL: http://iwls.org/iwls2005/benchmarks.html

[26] Cong, J.; Guojie Luo; Yiyu Shi, "Thermal-aware cell and through-silicon-via co-placement for 3D ICs," *Design Automation Conference (DAC)*, pp.670-675,2011