The University of Southampton
University of Southampton Institutional Repository

Novel lockstep-based fault mitigation approach for SoCs with roll-back and roll-forward recovery

Novel lockstep-based fault mitigation approach for SoCs with roll-back and roll-forward recovery
Novel lockstep-based fault mitigation approach for SoCs with roll-back and roll-forward recovery
All-Programmable System-on-Chips (APSoCs) constitute a compelling option for employing applications in radiation environments thanks to their high-performance computing and power efficiency merits. Despite these advantages, APSoCs are sensitive to radiation like any other electronic device. Processors embedded in APSoCs, therefore, have to be adequately hardened against ionizing-radiation to make them a viable choice of design for harsh environments. This paper proposes a novel lockstep-based approach to harden the dual-core ARM Cortex-A9 processor in the Xilinx Zynq-7000 APSoC against radiation-induced soft errors by coupling it with a MicroBlaze TMR subsystem in the programmable logic (PL) layer of the Zynq. The proposed technique uses the concepts of checkpointing along with roll-back and roll-forward mechanisms at the software level, i.e. software redundancy, as well as processor replication and checker circuits at the hardware level (i.e. hardware redundancy). Results of fault injection experiments show that the proposed approach achieves high levels of protection against soft errors by mitigating around 98% of bit-flips injected into the register files of both ARM cores while keeping timing performance overhead as low as 25% if block and application sizes are adjusted appropriately. Furthermore, the incorporation of the roll-forward recovery operation in addition to the roll-back operation improves the Mean Workload between Failures (MWBF) of the system by up to ≈19% depending on the nature of the running application, since the application can proceed faster, in a scenario where a fault occurs, when treated with the roll-forward operation rather than roll-back operation. Thus, relatively more data can be processed before the next error occurs in the system.

Lockstep, Reliability, Fault tolerance, Soft error mitigation, Zynq APSoC, ARM cortex-a processor, MicroBlaze processor
0026-2714
Kasap, Server
e49310e0-96aa-42e1-8259-6ad34cc1b025
Wachter, Eduardo Weber
bdacc537-b1ac-4241-a6fc-b67f1e6a6ce8
Zhai, Xiaojun
93ee3dbb-e10e-472b-adec-78acfcd4cbc7
Ehsan, Shoaib
ae8922f0-dbe0-4b22-8474-98e84d852de7
McDonald-Maier, Klaus D.
d35c2e77-744a-4318-9d9d-726459e64db9
Kasap, Server
e49310e0-96aa-42e1-8259-6ad34cc1b025
Wachter, Eduardo Weber
bdacc537-b1ac-4241-a6fc-b67f1e6a6ce8
Zhai, Xiaojun
93ee3dbb-e10e-472b-adec-78acfcd4cbc7
Ehsan, Shoaib
ae8922f0-dbe0-4b22-8474-98e84d852de7
McDonald-Maier, Klaus D.
d35c2e77-744a-4318-9d9d-726459e64db9

Kasap, Server, Wachter, Eduardo Weber, Zhai, Xiaojun, Ehsan, Shoaib and McDonald-Maier, Klaus D. (2021) Novel lockstep-based fault mitigation approach for SoCs with roll-back and roll-forward recovery. Microelectronics Reliability, 124, [114297]. (doi:10.1016/j.microrel.2021.114297).

Record type: Article

Abstract

All-Programmable System-on-Chips (APSoCs) constitute a compelling option for employing applications in radiation environments thanks to their high-performance computing and power efficiency merits. Despite these advantages, APSoCs are sensitive to radiation like any other electronic device. Processors embedded in APSoCs, therefore, have to be adequately hardened against ionizing-radiation to make them a viable choice of design for harsh environments. This paper proposes a novel lockstep-based approach to harden the dual-core ARM Cortex-A9 processor in the Xilinx Zynq-7000 APSoC against radiation-induced soft errors by coupling it with a MicroBlaze TMR subsystem in the programmable logic (PL) layer of the Zynq. The proposed technique uses the concepts of checkpointing along with roll-back and roll-forward mechanisms at the software level, i.e. software redundancy, as well as processor replication and checker circuits at the hardware level (i.e. hardware redundancy). Results of fault injection experiments show that the proposed approach achieves high levels of protection against soft errors by mitigating around 98% of bit-flips injected into the register files of both ARM cores while keeping timing performance overhead as low as 25% if block and application sizes are adjusted appropriately. Furthermore, the incorporation of the roll-forward recovery operation in addition to the roll-back operation improves the Mean Workload between Failures (MWBF) of the system by up to ≈19% depending on the nature of the running application, since the application can proceed faster, in a scenario where a fault occurs, when treated with the roll-forward operation rather than roll-back operation. Thus, relatively more data can be processed before the next error occurs in the system.

Text
1-s2.0-S0026271421000767-main (1) - Version of Record
Available under License Creative Commons Attribution.
Download (3MB)

More information

Accepted/In Press date: 18 July 2021
e-pub ahead of print date: 5 August 2021
Published date: September 2021
Keywords: Lockstep, Reliability, Fault tolerance, Soft error mitigation, Zynq APSoC, ARM cortex-a processor, MicroBlaze processor

Identifiers

Local EPrints ID: 473500
URI: http://eprints.soton.ac.uk/id/eprint/473500
ISSN: 0026-2714
PURE UUID: bb8c546c-beb9-480d-9b52-f02bb4427b95
ORCID for Shoaib Ehsan: ORCID iD orcid.org/0000-0001-9631-1898

Catalogue record

Date deposited: 20 Jan 2023 17:59
Last modified: 17 Mar 2024 04:16

Export record

Altmetrics

Contributors

Author: Server Kasap
Author: Eduardo Weber Wachter
Author: Xiaojun Zhai
Author: Shoaib Ehsan ORCID iD
Author: Klaus D. McDonald-Maier

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×