Efficient deep learning inference at the edge
Efficient deep learning inference at the edge
Deep Learning has found success in a variety of fields. At the same time, the number of connected Internet of Things (IoT) devices is seeing exponential growth. This has lead to the development of the field of TinyML which aims to perform deep learning inference locally on the resource constrained IoT devices. Realising this objective requires optimisation across the inference stack with design of efficient algorithms and inference systems. In the work carried out in this thesis, we explore techniques for efficient model design and deployment for various complexity constraints that are imposed by the resource constraints of the devices or the application scenarios. The first contribution of this work is a gradient based approach to derive models of varying complexity using neural architecture search (NAS). This is achieved by combining multi-objective optimisation with NAS where we optimise the complexity of the model in addition to its quality. This method derives models of varying complexity without the need for manual heuristics or expensive hit and trial. The second contribution of our work studies how inference software can effectively utilise device resources to enable design and deployment of efficient models. Whereas prior works typically focus on performing inference within internal memory constraints, we develop the TinyOps inference framework for MCUs which accelerates inference from external memories. TinyOps significantly lifts the ceiling of accuracy achievable with up to 1.4x-2.5x lower inference latency than previous approaches. The final contribution of this work is an in-depth analysis of DNN deployment on MCUs to derive heuristics for model design and how inference can be adapted at runtime on MCUs for varying latency constraints. We study limitations of existing approaches and benchmark throughput of low-level operations on MCUs. Using our heuristics, we derive models which achieve state-of-the-art TinyML ImageNet classification when considering accuracy, latency and energy efficiency. The heuristics are also utilised in a super-network based approach to derive multiple models for different latency constraints. We show how an efficient accuracy-latency trade-off can be achieved at run-time with the TinyOps inference framework.
University of Southampton
Sadiq, Sulaiman
e82e1fe2-6b8c-4c49-b051-8aef0dabe99a
2025
Sadiq, Sulaiman
e82e1fe2-6b8c-4c49-b051-8aef0dabe99a
Merrett, Geoff
89b3a696-41de-44c3-89aa-b0aa29f54020
Hare, Jonathon
65ba2cda-eaaf-4767-a325-cd845504e5a9
Sadiq, Sulaiman
(2025)
Efficient deep learning inference at the edge.
University of Southampton, Doctoral Thesis, 169pp.
Record type:
Thesis
(Doctoral)
Abstract
Deep Learning has found success in a variety of fields. At the same time, the number of connected Internet of Things (IoT) devices is seeing exponential growth. This has lead to the development of the field of TinyML which aims to perform deep learning inference locally on the resource constrained IoT devices. Realising this objective requires optimisation across the inference stack with design of efficient algorithms and inference systems. In the work carried out in this thesis, we explore techniques for efficient model design and deployment for various complexity constraints that are imposed by the resource constraints of the devices or the application scenarios. The first contribution of this work is a gradient based approach to derive models of varying complexity using neural architecture search (NAS). This is achieved by combining multi-objective optimisation with NAS where we optimise the complexity of the model in addition to its quality. This method derives models of varying complexity without the need for manual heuristics or expensive hit and trial. The second contribution of our work studies how inference software can effectively utilise device resources to enable design and deployment of efficient models. Whereas prior works typically focus on performing inference within internal memory constraints, we develop the TinyOps inference framework for MCUs which accelerates inference from external memories. TinyOps significantly lifts the ceiling of accuracy achievable with up to 1.4x-2.5x lower inference latency than previous approaches. The final contribution of this work is an in-depth analysis of DNN deployment on MCUs to derive heuristics for model design and how inference can be adapted at runtime on MCUs for varying latency constraints. We study limitations of existing approaches and benchmark throughput of low-level operations on MCUs. Using our heuristics, we derive models which achieve state-of-the-art TinyML ImageNet classification when considering accuracy, latency and energy efficiency. The heuristics are also utilised in a super-network based approach to derive multiple models for different latency constraints. We show how an efficient accuracy-latency trade-off can be achieved at run-time with the TinyOps inference framework.
Text
SS_Thesis_A
- Version of Record
Text
Final-thesis-submission-Examination-Mr-Sulaiman-Sadiq
Restricted to Repository staff only
More information
Published date: 2025
Identifiers
Local EPrints ID: 502129
URI: http://eprints.soton.ac.uk/id/eprint/502129
PURE UUID: 12a42638-5307-487d-8a06-4254fc300d65
Catalogue record
Date deposited: 17 Jun 2025 16:37
Last modified: 11 Sep 2025 02:14
Export record
Contributors
Author:
Sulaiman Sadiq
Thesis advisor:
Geoff Merrett
Thesis advisor:
Jonathon Hare
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics