Persistence-based summaries for data analysis with applications to cyber security
Persistence-based summaries for data analysis with applications to cyber security
First formalised by Poincaré in his seminal text Analysis Situs, meaning the geometry of position, topology is the mathematical study of structure that remains invariant under continuous deformation. For over a hundred years this understanding of shape was confined to pure mathematics, but the advent of persistence-based summaries which enable practitioners to compute concise representations of the topology of data with strong theoretical guarantees has led to applications of the topological notion of shape to data analysis and machine learning. The first part of this thesis is concerned with understanding and extending the application of persistence-based summaries to machine learning. Motivated by an investigation into the utility of topological loss terms through the lens of statistical learning theory, we adapt a recent extension of the higher-order Laplacian to the persistent case for machine learning, suggesting a vectorisation scheme and baselining its efficacy on the MNIST and MoleculeNet datasets. We find that it outperforms persistent homology across all of our baseline tasks. We also extend the ubiquitous fuzzy c-means clustering algorithm to the space of persistence diagrams, proving the same convergence guarantees as the Euclidean case. We apply the fuzzy clustering algorithm to model selection, matching pre-trained deep learning models to datasets via the topology of their decision boundaries. In the second part of this thesis we consider applications of persistence-based summaries to cyber security. Cyber security is a critical application domain, with the annual cost of cyber crime to the UK economy estimated to be in excess of £27 billion and cyber attacks considered a tier 1 national security risk by the UK government. We investigate the utility of persistence-based summaries when detecting malicious behaviour in host-based computer logs, which are intrinsically extremely structured. We find that our methods can rival a standard baseline from the literature.
University of Southampton
Davies, Thomas
55626665-ec62-46e8-9140-11316e5c2576
September 2023
Davies, Thomas
55626665-ec62-46e8-9140-11316e5c2576
Sanchez Garcia, Ruben
8246cea2-ae1c-44f2-94e9-bacc9371c3ed
Tran-Thanh, Long
aecacf50-460e-410a-83be-b0c2a5ae226e
Cirstea, Corina
ce5b1cf1-5329-444f-9a76-0abcc47a54ea
Davies, Thomas
(2023)
Persistence-based summaries for data analysis with applications to cyber security.
University of Southampton, Doctoral Thesis, 164pp.
Record type:
Thesis
(Doctoral)
Abstract
First formalised by Poincaré in his seminal text Analysis Situs, meaning the geometry of position, topology is the mathematical study of structure that remains invariant under continuous deformation. For over a hundred years this understanding of shape was confined to pure mathematics, but the advent of persistence-based summaries which enable practitioners to compute concise representations of the topology of data with strong theoretical guarantees has led to applications of the topological notion of shape to data analysis and machine learning. The first part of this thesis is concerned with understanding and extending the application of persistence-based summaries to machine learning. Motivated by an investigation into the utility of topological loss terms through the lens of statistical learning theory, we adapt a recent extension of the higher-order Laplacian to the persistent case for machine learning, suggesting a vectorisation scheme and baselining its efficacy on the MNIST and MoleculeNet datasets. We find that it outperforms persistent homology across all of our baseline tasks. We also extend the ubiquitous fuzzy c-means clustering algorithm to the space of persistence diagrams, proving the same convergence guarantees as the Euclidean case. We apply the fuzzy clustering algorithm to model selection, matching pre-trained deep learning models to datasets via the topology of their decision boundaries. In the second part of this thesis we consider applications of persistence-based summaries to cyber security. Cyber security is a critical application domain, with the annual cost of cyber crime to the UK economy estimated to be in excess of £27 billion and cyber attacks considered a tier 1 national security risk by the UK government. We investigate the utility of persistence-based summaries when detecting malicious behaviour in host-based computer logs, which are intrinsically extremely structured. We find that our methods can rival a standard baseline from the literature.
Text
Thomas_Davies_Doctoral_thesis_pdfa
- Version of Record
Text
Final-thesis-submission-Examination-Mr-Thomas-Davies
Restricted to Repository staff only
More information
Submitted date: April 2023
Published date: September 2023
Identifiers
Local EPrints ID: 481621
URI: http://eprints.soton.ac.uk/id/eprint/481621
PURE UUID: 1bd9009b-a0a7-43c4-8e80-fd6a6cff7e3a
Catalogue record
Date deposited: 05 Sep 2023 16:35
Last modified: 06 Jun 2024 01:48
Export record
Contributors
Author:
Thomas Davies
Thesis advisor:
Long Tran-Thanh
Thesis advisor:
Corina Cirstea
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics