# A 3D-Stacked Cortex-M0 SoC with 20.3Gbps/mm<sup>2</sup> 7.1mW/mm<sup>2</sup> Simultaneous Wireless Inter-Tier Data and Power Transfer

Benjamin J. Fletcher\*<sup>†</sup>, Terrence Mak\*, Shidhartha Das<sup>†</sup>

\*University of Southampton, UK, <sup>†</sup>Arm Research, Cambridge, UK, bjf1g13@ecs.soton.ac.uk

### Abstract

This paper presents a 2-tier 3D-stacked Cortex-M0 SoC, in 65nm CMOS technology, with wireless inter-tier power and data transfer through an inductively coupled bus. The proposed design is the first implementation of a wireless link as part of a standard SoC bus, and achieves  $20.3Gbps/mm^2$  data, and  $7.1mW/mm^2$  power transfer simultaneously through a  $250\,\mu m$  channel. This also makes it the smallest ever reported inductive data and power link.

## Introduction

Wireless 3D integration using Inductively-coupled Links (ILs) has recently gained popularity as a low-cost method of realising stacked 3D-ICs [1]. When using ILs, data is encoded as a series of current pulses which are fed through planar inductors fabricated in each die, forming a magnetic field within the die-stack. This field can be sensed by neighbouring stacked dies, allowing the data to be decoded, and hence communicated wirelessly. In contrast with conventional 3D-IC/3D-SiP solutions, such as flip-chip bonding, wire-bonding or using Through Silicon Vias (TSVs), ILs require no additional postfabrication processing as dies can be simply picked and stacked using adhesive, significantly reducing the assembly cost [1-4].

A range of prior works explore the use of ILs for communicating data within a 3D stack [1-5], however typically opt to provide power by other separate means, such as wire-bonded connections [2-5]. Whilst this is an adequate solution, the addition of wire-bonds to each die undermines the cost-saving benefits associated with '*wireless*' integration. In this paper, we address this, presenting a inductively-coupled vertical bus that performs wireless data and power transfer. We also, for the first time, incorporate the ILs *within* the main SoC AHB-Lite bus, by implementing a Wireless Bus Interface (WBI).

### SoC Architecture

Fig. 1 shows the architecture of the presented SoC, including 2 stacked Arm Cortex M0 CPUs, SRAM, and 5 parallel ILs (forming the vertical part of the SoC bus):  $2 \times$  Data and Power (DAP) uplinks,  $2 \times$  Data Only (DO) downlinks, and  $1 \times$  Clock Link (CL). The CL is used to forward the clock from the lower (master) die to the upper (slave) die, the DAP links provide unified data and power transmission from the master die to the slave die, and the DO links allow reciprocal (slave to master) data communication.

2(a) illustrates the DAP link design, consisting of Fig. modulator and driver circuits, full-wave rectifier and a senseamplifier (SA) based receiver. To avoid the large area overheads associated with a separate wireless power transfer (WPT) link, the proposed DAP link performs simultaneous data and power delivery through a single channel. To achieve this, biphase shift keying (BPSK) modulation is used where a highfrequency  $(f_{hf})$  carrier signal (selected to match the resonant frequency of the coupled system) delivers power, whilst the data is encoded at a lower frequency  $(f_{dat})$  by introducing 180degrees of phase shift in the carrier signal at each data edge. The encoded signal is transmitted through the channel and a full wave rectifier is used to recover the power from the  $f_{hf}$  signal, whilst the SA samples the phase of the RX signal for recovering the data. For typical operation  $f_{dat} \ll f_{hf}$  and hence several

phase samples are collected. A majority voter circuit is then used to determine the most-probable value, as illustrated by the operation waveforms (Fig. 2(d)). Maintaining customizability is important, and as such the presented links are integrated using a WBI, implementing the AHB-Lite protocol. This allows memory-mapped peripherals existing in separate physical dies to be addressed on the main system bus. The WBI uses a 36-bit packet structure with 4 preamble bits followed by 32 data bits.

The Clock Link (CL) design is shown in Fig. 2 (b). The CL uses a H-Bridge transmitter, with a 2-stage cascaded amplifier on the RX side. To conserve area, a smaller non-resonant coupled inductor pair is selected for the clock link (which is not used for WPT). Under normal operation, the high frequency carrier is transmitted through the CL, and  $f_{dat}$  and  $f_{sys}$  (used for the MCU) are generated using dividers in each die.

#### Results

Combining these elements, the proposed SoC was implemented in a 3D stacked 65nm test-chip (Fig. 3). Two identical dies were stacked in a Face-to-Back (F2B) arrangement (for system scalability) with a lateral offset of 400  $\mu$ m (to align the TX and RX channels). Standard wafer-thinning was performed to a thickness of 70  $\mu$ m, resulting in a total communication distance of 80  $\mu$ m including the epoxy adhesive used for assembly. For BER measurements, data patterns were generated in the MCU software and written wirelessly to addresses in the opposite die across the AHB-Lite bus, using the WBI.

Fig. 4 shows the performance of the clock link. At 1.43GHz (the resonant DAP/DO link frequency) the CL operates at 17.0pJ/cycle with a BER  $<10^{-12}$ , and jitter <5%. Fig. 5 (a) shows the DAP link power delivery vs. frequency with a 100 $\Omega$ , 250fF load. For testing, the recovered power is exported to external pads for measurement, however in other systems can be regulated and used to power stacked slave peripherals. At the 1.43GHz operating frequency, the design achieves an average power delivery of 0.83mW across the two DAP links whilst transmitting a PRBS, and a peak power delivery of 0.88mW (transmitting constant data). Fig. 5 (b) presents simulation results showing how the maximum WPT scales with additional wafer thinning, demonstrating capability for upto 2.0mW/link of wireless power delivery when dies are F2F stacked.

Fig. 6 (a)-(b) summarizes the data transmission performance of the DAP and DO link which achieve a BER  $<10^{-7}$  under normal operation and  $<10^{-6}$  for high bandwidth applications (where  $f_{hf}$ : $f_{dat}$  is 1:1). For error-sensitive applications, additional error correction code (ECC) can be implemented in the MCU software. An example of this is shown in Fig. 4 (c), where Hamming(32,26) SEC-DED ECC runs as part of the MCU data transfer application. Using this approach, the BER can be decreased to  $<10^{-10}$  (Fig. 6 (c)).

Finally, Fig. 7 compares the design with prior works, demonstrating a  $7.8 \times$  reduction in area per link compared with existing implementations where both power and data are transmitted wirelessly (making this the smallest reported inductive power and data link). We also achieve a  $1.7 \times$  bandwidth improvement (per unit area), whilst remaining competitive in terms of WPT/mm<sup>2</sup>. This work also represents the first instance of unified wireless power and data transmission in a 3D-IC, and the first integration of ILs using a standard SoC bus protocol.



Fig. 1: (a) 3D illustration, and (b) System diagram showing the main functional blocks of the SoC, including: simultaneous wireless Data And Power (DAP) uplinks, Data Only (DO) downlinks, Clock Link (CL), Arm Cortex M0 MCU, AHB-Lite bus interface, and SRAM banks. The upper (slave) die can be interchanged depending on the application, whilst still sourcing power, clock and data through the wireless bus.



Fig. 2: Schematic of (a) Simultaneous Data and Power (DAP) Link circuits, (b) Clock Link circuits, and (c) Operation Waveforms.





Fig. 4: Measured performance of inductively coupled clock link including (a) Energy and jitter, and (b) BER. The clock jitter (shown in the eye diagrams) is less than 5% across all frequencies, measured externally from wire-bonded debug IO pins (likely resulting in an inflated representation of the internal jitter).

#### References

- D. Ditzel, et al., IEEE Hot Chips, 2014. [1]
- B. Fletcher, et al., IEEE ESSCIRC, 2019. [2]
- [3] N. Miura, et al., IEEE ISSCC, 2011.
- [4] D. Mizoguchi, *et al.*, IEEE ISSCC., 2004.
  [5] N. Miura *et al.*, IEEE JSSC, vol. 46(4), 2011
- A. Radecki et al., IEEE JSSC, vol. 47(10), 2012. [6]
- Y. Yuxiang et al., IEEE Symp. on VLSI Circuits, 2009. [7]



Fig. 5: (a) Measured power delivery performance of the Data And Power (DAP) Links 0-1 with a 1000hm 250fF load. (b) Simulated maximum WPT vs. communication distance for the presented design.



Fig. 6: (a) Measured Bit Error Rate (BER) as a function of transmit power for links 0-1 (DAP) and 2-3 (DO) ( $f_{dat}$ =357MHz). (b) Measured BER in high bandwidth mode ( $f_{dat}$ =1.43GHz). (c) Measured BER with software-based ECC. (d) Oscilloscope capture showing system operation.

|  |                       | Data Only Links              |                                                            |                              | Data And Power Links         |                                         |                               |
|--|-----------------------|------------------------------|------------------------------------------------------------|------------------------------|------------------------------|-----------------------------------------|-------------------------------|
|  |                       | Miura et<br>al. [3]          | Mizoguchi<br>et al. [4]                                    | Miura et<br>al. [5]          | Radecki et<br>al. [6]        | Y. Yuxiang<br>et al. [7]                | This Work                     |
|  | Approach              | NRZ-<br>Encoded<br>Data      | NRZ-<br>Encoded<br>Data                                    | NRZ-<br>Encoded<br>Data      | Nested<br>Power &<br>Data    | Time-<br>Interleaved<br>Power &<br>Data | Unified<br>Power &<br>Data    |
|  | Wafer<br>Thickness    | 25 µm                        | 300 µm                                                     | 20 μm<br>(+2 μm glue)        | 20 µm                        | 200 µm                                  | 70 um<br>(+10 um<br>glue)     |
|  | Channel<br>Size       | 900 μm ×<br>900 μm           | $\begin{array}{c} 100\mu m \times \\ 100\mu m \end{array}$ | 110 μm ×<br>110 μm           | 700 μm ×<br>700 μm           | 2mm ×<br>2mm                            | 250 um<br>× 250<br>um         |
|  | Channel<br>Bandwidth  | 2.4Gbps                      | 1.2Gbps                                                    | 1.1Gbps                      | 6Gbps                        | 0.15Gbps                                | 1.27Gbps                      |
|  | Power/Area            | -                            | -                                                          | -                            | 20.8mW/mm <sup>2</sup>       | 3.37mW/mm <sup>2</sup>                  | 7.1mW/mm                      |
|  | Data/ Area            | 2.96<br>Gbps/mm <sup>2</sup> | 120<br>Gbps/mm <sup>2</sup>                                | 90.9<br>Gbps/mm <sup>2</sup> | 12.2<br>Gbps/mm <sup>2</sup> | 0.036<br>Gbps/mm <sup>2</sup>           | 20.32<br>Gbps/mm <sup>2</sup> |
|  | Technology            | 180nm                        | 0.35um                                                     | 65nm                         | 65nm                         | 180nm                                   | 65nm                          |
|  | System<br>Integration | Inductive<br>Links<br>Only   | Inductive<br>Links<br>Only                                 | Inductive<br>Links Only      | Inductive<br>Links Only      | ROM<br>Interface                        | Full SoC                      |

Fig. 7: Summary of silicon measurement results including comparison with previously reported IL implementations. Compared with prior works with power and data transmission, the presented chip achieves a  $7.8 \times$  reduction in area per link and is the first to integrate ILs as part of a complete SoC.