Hardware performance counters for system reliability monitoring

As technology scaling reaches nanometre scales,
the error rate due to variations in temperature and voltage,
single event effects and component degradation increases, making
components less reliable. In order to ensure a system continues
to function correctly while facing known reliability issues, it is
imperative that the system should have the means to detect the
occurrence of errors due to the presence of faults. A system that
behaves normally (no error detected in the system) exhibits a
profile, and any deviations from this profile indicate that there
is an anomaly in the system. In this paper, we propose to use
hardware performance counters (HPCs) to measure events that
occur during the execution of the program. We explore the
various counters available which could be use to identify the
anomalous behaviour in the system and develop a methodology
to observe the anomalies using HPCs by creating a faultfree
pattern and observing any subsequent changes in that
pattern. We evaluate the proposed technique using GemFI, an
architectural simulator based on Gem5 with additional fault
injection capabilities. We compare the results obtained at the
end of the execution with data collected during a time interval.
Our results show that HPCs can be used to identify anomalous
behaviour in a system that would lead to failure.

10.1109/IVSW.2017.8031548

IEEE

Woo, Lai Leng

ee042648-77bc-4b5d-979e-a44b302a7ad9

Halak, Basel

8221f839-0dfd-4f81-9865-37def5f79f33

Zwolinski, Mark

adfcb8e7-877f-4bd7-9b55-7553b6cb3ea0

3 July 2017

Woo, Lai Leng

ee042648-77bc-4b5d-979e-a44b302a7ad9

Halak, Basel

8221f839-0dfd-4f81-9865-37def5f79f33

Zwolinski, Mark

adfcb8e7-877f-4bd7-9b55-7553b6cb3ea0

Woo, Lai Leng, Halak, Basel and Zwolinski, Mark (2017) Hardware performance counters for system reliability monitoring. In 2nd International Verification and Security Workshop: IVSW 2017. IEEE.. (doi:10.1109/IVSW.2017.8031548).