Efficient deep learning inference at the edge

Deep Learning has found success in a variety of fields. At the same time, the number of connected Internet of Things (IoT) devices is seeing exponential growth. This has lead to the development of the field of TinyML which aims to perform deep learning inference locally on the resource constrained IoT devices. Realising this objective requires optimisation across the inference stack with design of efficient algorithms and inference systems. In the work carried out in this thesis, we explore techniques for efficient model design and deployment for various complexity constraints that are imposed by the resource constraints of the devices or the application scenarios. The first contribution of this work is a gradient based approach to derive models of varying complexity using neural architecture search (NAS). This is achieved by combining multi-objective optimisation with NAS where we optimise the complexity of the model in addition to its quality. This method derives models of varying complexity without the need for manual heuristics or expensive hit and trial. The second contribution of our work studies how inference software can effectively utilise device resources to enable design and deployment of efficient models. Whereas prior works typically focus on performing inference within internal memory constraints, we develop the TinyOps inference framework for MCUs which accelerates inference from external memories. TinyOps significantly lifts the ceiling of accuracy achievable with up to 1.4x-2.5x lower inference latency than previous approaches. The final contribution of this work is an in-depth analysis of DNN deployment on MCUs to derive heuristics for model design and how inference can be adapted at runtime on MCUs for varying latency constraints. We study limitations of existing approaches and benchmark throughput of low-level operations on MCUs. Using our heuristics, we derive models which achieve state-of-the-art TinyML ImageNet classification when considering accuracy, latency and energy efficiency. The heuristics are also utilised in a super-network based approach to derive multiple models for different latency constraints. We show how an efficient accuracy-latency trade-off can be achieved at run-time with the TinyOps inference framework.

University of Southampton

Sadiq, Sulaiman

e82e1fe2-6b8c-4c49-b051-8aef0dabe99a

2025

Sadiq, Sulaiman

e82e1fe2-6b8c-4c49-b051-8aef0dabe99a

Merrett, Geoff

89b3a696-41de-44c3-89aa-b0aa29f54020

Hare, Jonathon

65ba2cda-eaaf-4767-a325-cd845504e5a9

Sadiq, Sulaiman (2025) Efficient deep learning inference at the edge. University of Southampton, Doctoral Thesis, 169pp.

Record type: Thesis (Doctoral)

Abstract

Text

SS_Thesis_A - Version of Record

Available under License University of Southampton Thesis Licence.

Download (9MB)

Text

Final-thesis-submission-Examination-Mr-Sulaiman-Sadiq

Restricted to Repository staff only

More information

Published date: 2025

Learn more about School of Electronics and Computer Science research Learn more about Cyber Physical Systems research

Identifiers

Local EPrints ID: 502129

URI: http://eprints.soton.ac.uk/id/eprint/502129

PURE UUID: 12a42638-5307-487d-8a06-4254fc300d65

ORCID for Geoff Merrett:

orcid.org/0000-0003-4980-3894

ORCID for Jonathon Hare:

orcid.org/0000-0003-2921-4283

Catalogue record

Date deposited: 17 Jun 2025 16:37

Last modified: 11 Sep 2025 02:14

Export record

Share this record

Share this on Facebook Share this on Twitter Share this on Weibo

Contributors

Author: Sulaiman Sadiq

Thesis advisor: Geoff Merrett

Thesis advisor: Jonathon Hare

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Library staff additional information