Runtime Algorithm and Hardware Management for Efficient DNN Inference on Mobile/Embedded Platforms
Runtime Algorithm and Hardware Management for Efficient DNN Inference on Mobile/Embedded Platforms
Deep neural network (DNN) inference is increasingly being executed on mobile and embedded platforms due to enhanced privacy, reduced latency, and improved energy efficiency. Efficient DNN deployment on these platforms is challenging due to limited computing resources. Although many static DNN model compression approaches have been proposed, they rely on prior knowledge of application performance requirements and hardware resource availability to determine the compression ratio. However, because both of these factors vary at runtime, statically compressed models cannot maintain consistent performance.Prior work has addressed this issue through algorithmic approaches (e.g., DNN model switching or dynamic DNNs) or runtime hardware resource management. However, there is limited literature on integrating the advantages of both algorithms and hardware at runtime. In this thesis, we investigate runtime DNN algorithm and hardware management, and develop a runtime system to optimise DNN performance as well as power and energy efficiency by leveraging the trade off opportunities from both algorithms and hardware platforms.First, our study finds that earlier dynamic DNN models suffer from significant memory overhead, limited runtime model compression ratio, and a narrow range of dynamic performance trade-offs. To address these issues, we propose a dynamic DNN approach that uses incremental training and group convolution pruning. In this approach, the channels of each DNN convolutional layer are divided into groups, which are then trained incrementally. At runtime, these pre-trained groups can be pruned to reduce latency and energy consumption or added back for accuracy recovery, all in real time and without the need for any retraining. At the same compression ratio, our proposed dynamic DNN model achieves a 2.4× reduction in memory footprint compared to prior work. In addition, we combine dynamic voltage and frequency scaling (DVFS) and task mapping with the model, enabling fine-grained and wide-ranging dynamic performance trade-offs. Next, we identify three common issues with all existing dynamic DNN approaches: (1) significant training time, (2) incompatibility with state-of-the-art Neural Architecture Search (NAS) deployment pipeline, and (3) suboptimal inference on heterogeneous hardware platforms. To address these problems, we propose the Dynamic Super-Network, a novel dynamic DNN approach designed specifically for NAS models. Unlike traditional resource-intensive approaches to train dynamic DNN models, this approach pre-samples diverse sub-networks from NAS super-network, eliminating the need for training. By sampling separate sub-network libraries for each type of heterogeneous hardware resource (e.g., CPU and GPU) on modern SoCs, one backbone super-network can efficiently scale across all hardware resources. On an Nvidia Jetson Xavier NX platform using the ImageNet dataset, our approach outperforms state-of-the-art work by achieving up to 3.5× (CPU) and 2.4× (GPU) faster inference at similar Top-1 accuracy, or delivering 3.8% (CPU) and 5.1% (GPU) higher accuracy at similar latency.To explore opportunities in both algorithms and hardware platforms, we propose a hierarchical runtime resource management approach that adjusts dynamic DNN models and DVFS to meet application- and user-level performance requirements (e.g., accuracy and latency) while respecting hardware constraints (e.g., power consumption). Compared with the Linux schedutil governor, our approach achieves a 13.7% reduction in energy consumption and a 6.5% reduction in latency when deploying a single DNN model, and up to a 47.2% reduction in energy consumption and a 19% reduction in latency when deploying two DNN models concurrently.
University of Southampton
Xun, Lei
51a0da82-6979-49a8-8eff-ada011f5aff5
2025
Xun, Lei
51a0da82-6979-49a8-8eff-ada011f5aff5
Merrett, Geoff
89b3a696-41de-44c3-89aa-b0aa29f54020
Hare, Jonathon
65ba2cda-eaaf-4767-a325-cd845504e5a9
Al-Hashimi, Bashir
32946be3-e73a-4742-b5cd-8e840748aad2
Xun, Lei
(2025)
Runtime Algorithm and Hardware Management for Efficient DNN Inference on Mobile/Embedded Platforms.
University of Southampton, Doctoral Thesis, 150pp.
Record type:
Thesis
(Doctoral)
Abstract
Deep neural network (DNN) inference is increasingly being executed on mobile and embedded platforms due to enhanced privacy, reduced latency, and improved energy efficiency. Efficient DNN deployment on these platforms is challenging due to limited computing resources. Although many static DNN model compression approaches have been proposed, they rely on prior knowledge of application performance requirements and hardware resource availability to determine the compression ratio. However, because both of these factors vary at runtime, statically compressed models cannot maintain consistent performance.Prior work has addressed this issue through algorithmic approaches (e.g., DNN model switching or dynamic DNNs) or runtime hardware resource management. However, there is limited literature on integrating the advantages of both algorithms and hardware at runtime. In this thesis, we investigate runtime DNN algorithm and hardware management, and develop a runtime system to optimise DNN performance as well as power and energy efficiency by leveraging the trade off opportunities from both algorithms and hardware platforms.First, our study finds that earlier dynamic DNN models suffer from significant memory overhead, limited runtime model compression ratio, and a narrow range of dynamic performance trade-offs. To address these issues, we propose a dynamic DNN approach that uses incremental training and group convolution pruning. In this approach, the channels of each DNN convolutional layer are divided into groups, which are then trained incrementally. At runtime, these pre-trained groups can be pruned to reduce latency and energy consumption or added back for accuracy recovery, all in real time and without the need for any retraining. At the same compression ratio, our proposed dynamic DNN model achieves a 2.4× reduction in memory footprint compared to prior work. In addition, we combine dynamic voltage and frequency scaling (DVFS) and task mapping with the model, enabling fine-grained and wide-ranging dynamic performance trade-offs. Next, we identify three common issues with all existing dynamic DNN approaches: (1) significant training time, (2) incompatibility with state-of-the-art Neural Architecture Search (NAS) deployment pipeline, and (3) suboptimal inference on heterogeneous hardware platforms. To address these problems, we propose the Dynamic Super-Network, a novel dynamic DNN approach designed specifically for NAS models. Unlike traditional resource-intensive approaches to train dynamic DNN models, this approach pre-samples diverse sub-networks from NAS super-network, eliminating the need for training. By sampling separate sub-network libraries for each type of heterogeneous hardware resource (e.g., CPU and GPU) on modern SoCs, one backbone super-network can efficiently scale across all hardware resources. On an Nvidia Jetson Xavier NX platform using the ImageNet dataset, our approach outperforms state-of-the-art work by achieving up to 3.5× (CPU) and 2.4× (GPU) faster inference at similar Top-1 accuracy, or delivering 3.8% (CPU) and 5.1% (GPU) higher accuracy at similar latency.To explore opportunities in both algorithms and hardware platforms, we propose a hierarchical runtime resource management approach that adjusts dynamic DNN models and DVFS to meet application- and user-level performance requirements (e.g., accuracy and latency) while respecting hardware constraints (e.g., power consumption). Compared with the Linux schedutil governor, our approach achieves a 13.7% reduction in energy consumption and a 6.5% reduction in latency when deploying a single DNN model, and up to a 47.2% reduction in energy consumption and a 19% reduction in latency when deploying two DNN models concurrently.
Text
Lei Xun - Runtime Algorithm and Hardware Management for Efficient DNN Inference on Mobile and Embedded Platforms
- Version of Record
Text
Final-thesis-submission-Examination-Mr-Lei-Xun
Restricted to Repository staff only
More information
Published date: 2025
Identifiers
Local EPrints ID: 502355
URI: http://eprints.soton.ac.uk/id/eprint/502355
PURE UUID: df70a10f-7f67-4f39-9204-6610fcf6b526
Catalogue record
Date deposited: 24 Jun 2025 16:35
Last modified: 11 Sep 2025 02:14
Export record
Contributors
Author:
Lei Xun
Thesis advisor:
Geoff Merrett
Thesis advisor:
Jonathon Hare
Thesis advisor:
Bashir Al-Hashimi
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics