Hardware performance counters for system reliability monitoring
Hardware performance counters for system reliability monitoring
As technology scaling reaches nanometre scales,
the error rate due to variations in temperature and voltage,
single event effects and component degradation increases, making
components less reliable. In order to ensure a system continues
to function correctly while facing known reliability issues, it is
imperative that the system should have the means to detect the
occurrence of errors due to the presence of faults. A system that
behaves normally (no error detected in the system) exhibits a
profile, and any deviations from this profile indicate that there
is an anomaly in the system. In this paper, we propose to use
hardware performance counters (HPCs) to measure events that
occur during the execution of the program. We explore the
various counters available which could be use to identify the
anomalous behaviour in the system and develop a methodology
to observe the anomalies using HPCs by creating a faultfree
pattern and observing any subsequent changes in that
pattern. We evaluate the proposed technique using GemFI, an
architectural simulator based on Gem5 with additional fault
injection capabilities. We compare the results obtained at the
end of the execution with data collected during a time interval.
Our results show that HPCs can be used to identify anomalous
behaviour in a system that would lead to failure.
Woo, Lai Leng
ee042648-77bc-4b5d-979e-a44b302a7ad9
Halak, Basel
8221f839-0dfd-4f81-9865-37def5f79f33
Zwolinski, Mark
adfcb8e7-877f-4bd7-9b55-7553b6cb3ea0
3 July 2017
Woo, Lai Leng
ee042648-77bc-4b5d-979e-a44b302a7ad9
Halak, Basel
8221f839-0dfd-4f81-9865-37def5f79f33
Zwolinski, Mark
adfcb8e7-877f-4bd7-9b55-7553b6cb3ea0
Woo, Lai Leng, Halak, Basel and Zwolinski, Mark
(2017)
Hardware performance counters for system reliability monitoring.
In 2nd International Verification and Security Workshop: IVSW 2017.
IEEE..
(doi:10.1109/IVSW.2017.8031548).
Record type:
Conference or Workshop Item
(Paper)
Abstract
As technology scaling reaches nanometre scales,
the error rate due to variations in temperature and voltage,
single event effects and component degradation increases, making
components less reliable. In order to ensure a system continues
to function correctly while facing known reliability issues, it is
imperative that the system should have the means to detect the
occurrence of errors due to the presence of faults. A system that
behaves normally (no error detected in the system) exhibits a
profile, and any deviations from this profile indicate that there
is an anomaly in the system. In this paper, we propose to use
hardware performance counters (HPCs) to measure events that
occur during the execution of the program. We explore the
various counters available which could be use to identify the
anomalous behaviour in the system and develop a methodology
to observe the anomalies using HPCs by creating a faultfree
pattern and observing any subsequent changes in that
pattern. We evaluate the proposed technique using GemFI, an
architectural simulator based on Gem5 with additional fault
injection capabilities. We compare the results obtained at the
end of the execution with data collected during a time interval.
Our results show that HPCs can be used to identify anomalous
behaviour in a system that would lead to failure.
More information
Published date: 3 July 2017
Identifiers
Local EPrints ID: 412724
URI: http://eprints.soton.ac.uk/id/eprint/412724
PURE UUID: 6b80a268-d336-42ba-a930-cf9e5bfde3bb
Catalogue record
Date deposited: 27 Jul 2017 16:30
Last modified: 16 Mar 2024 04:07
Export record
Altmetrics
Contributors
Author:
Lai Leng Woo
Author:
Basel Halak
Author:
Mark Zwolinski
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics