Nucleus: finding the sharing limit of heterogeneous cores
Nucleus: finding the sharing limit of heterogeneous cores
Heterogeneous multi-processors are designed to bridge the gap between performance and energy efficiency in modern embedded systems. This is achieved by pairing Out-of-Order (OoO) cores, yielding performance through aggressive speculation and latency masking, with In-Order (InO) cores, that preserve energy through simpler design. By leveraging migrations between them, workloads can therefore select the best setting for any given energy/delay envelope. However, migrations introduce execution overheads that can hurt performance if they happen too frequently. Finding the optimal migration frequency is critical to maximize energy savings while maintaining acceptable performance. We develop a simulation methodology that can 1) isolate the hardware effects of migrations from the software, 2) directly compare the performance of different core types, 3) quantify the performance degradation and 4) calculate the cost of migrations for each case. To showcase our methodology we run mibench, a microbenchmark suite, and show that migrations can happen as fast as every 100k instructions with little performance loss. We also show that, contrary to numerous recent studies, hypothetical designs do not need to share all of their internal components to be able to migrate at that frequency. Instead, we propose a feasible system that shares level 2 caches and a translation lookaside buffer that matches performance and efficiency. Our results show that there are phases comprising up to 10% that a migration to the OoO core leads to performance benefits without any additional energy cost when running on the InO core, and up to 6% of phases where a migration to the InO core can save energy without affecting performance. When considering a policy that focuses on improving the energy-delay product, results show that on average 66% of the phases can be migrated to deliver equal or better system operation without having to aggressively share the entire memory system or to revert to migration periods finer than 100k instructions.
Vougioukas, Ilias
b5654d64-ff5c-43ab-a005-97a72cc343d7
Sandberg, Andreas
d09c2a2a-151d-439c-b258-b852eeb56e33
Diestelhorst, Stephan
5ac0a14f-5a42-4e09-a173-399b97170272
Al-Hashimi, Bashir
0b29c671-a6d2-459c-af68-c4614dce3b5d
Merrett, Geoffrey
89b3a696-41de-44c3-89aa-b0aa29f54020
October 2017
Vougioukas, Ilias
b5654d64-ff5c-43ab-a005-97a72cc343d7
Sandberg, Andreas
d09c2a2a-151d-439c-b258-b852eeb56e33
Diestelhorst, Stephan
5ac0a14f-5a42-4e09-a173-399b97170272
Al-Hashimi, Bashir
0b29c671-a6d2-459c-af68-c4614dce3b5d
Merrett, Geoffrey
89b3a696-41de-44c3-89aa-b0aa29f54020
Vougioukas, Ilias, Sandberg, Andreas, Diestelhorst, Stephan, Al-Hashimi, Bashir and Merrett, Geoffrey
(2017)
Nucleus: finding the sharing limit of heterogeneous cores.
ACM Transactions on Embedded Computing Systems, 16 (5s).
(doi:10.1145/3126544).
Abstract
Heterogeneous multi-processors are designed to bridge the gap between performance and energy efficiency in modern embedded systems. This is achieved by pairing Out-of-Order (OoO) cores, yielding performance through aggressive speculation and latency masking, with In-Order (InO) cores, that preserve energy through simpler design. By leveraging migrations between them, workloads can therefore select the best setting for any given energy/delay envelope. However, migrations introduce execution overheads that can hurt performance if they happen too frequently. Finding the optimal migration frequency is critical to maximize energy savings while maintaining acceptable performance. We develop a simulation methodology that can 1) isolate the hardware effects of migrations from the software, 2) directly compare the performance of different core types, 3) quantify the performance degradation and 4) calculate the cost of migrations for each case. To showcase our methodology we run mibench, a microbenchmark suite, and show that migrations can happen as fast as every 100k instructions with little performance loss. We also show that, contrary to numerous recent studies, hypothetical designs do not need to share all of their internal components to be able to migrate at that frequency. Instead, we propose a feasible system that shares level 2 caches and a translation lookaside buffer that matches performance and efficiency. Our results show that there are phases comprising up to 10% that a migration to the OoO core leads to performance benefits without any additional energy cost when running on the InO core, and up to 6% of phases where a migration to the InO core can save energy without affecting performance. When considering a policy that focuses on improving the energy-delay product, results show that on average 66% of the phases can be migrated to deliver equal or better system operation without having to aggressively share the entire memory system or to revert to migration periods finer than 100k instructions.
Text
38_Vougioukas
- Accepted Manuscript
More information
Accepted/In Press date: 30 June 2017
e-pub ahead of print date: 10 October 2017
Published date: October 2017
Identifiers
Local EPrints ID: 412917
URI: http://eprints.soton.ac.uk/id/eprint/412917
ISSN: 1539-9087
PURE UUID: 94471bd3-75b4-49a2-9c0c-52ba8db2dba4
Catalogue record
Date deposited: 08 Aug 2017 16:31
Last modified: 16 Mar 2024 03:46
Export record
Altmetrics
Contributors
Author:
Ilias Vougioukas
Author:
Andreas Sandberg
Author:
Stephan Diestelhorst
Author:
Bashir Al-Hashimi
Author:
Geoffrey Merrett
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics