The University of Southampton
University of Southampton Institutional Repository

High-resolution image-based malware classification using multiple instance learning

High-resolution image-based malware classification using multiple instance learning
High-resolution image-based malware classification using multiple instance learning
This paper proposes a novel method of classifying malware into families using high-resolution greyscale images and multiple instance learning to overcome adversarial binary enlargement. Current methods of visualisation-based malware classification largely rely on lossy transformations of inputs such as resizing to handle the large, variable-sized images. Through empirical analysis and experimentation, it is shown that these approaches cause crucial information loss that can be exploited. The proposed solution divides the images into patches and uses embedding-based multiple instance learning with a convolutional neural network and an attention aggregation function for classification. The implementation is evaluated on the Microsoft Malware Classification dataset and achieves accuracies of up to $96.6\%$ on adversarially enlarged samples compared to the baseline of $22.8\%$. The Python code is available online at https://github.com/timppeters/MIL-Malware-Images .
cs.CR, cs.CV, cs.LG
arXiv
Peters, Tim
b3ab1e07-326f-41c2-9813-f00f3b75bcf0
Farhat, Hikmat
4b7583f4-d03c-425e-a65a-82c0e157e7e6
Peters, Tim
b3ab1e07-326f-41c2-9813-f00f3b75bcf0
Farhat, Hikmat
4b7583f4-d03c-425e-a65a-82c0e157e7e6

[Unknown type: UNSPECIFIED]

Record type: UNSPECIFIED

Abstract

This paper proposes a novel method of classifying malware into families using high-resolution greyscale images and multiple instance learning to overcome adversarial binary enlargement. Current methods of visualisation-based malware classification largely rely on lossy transformations of inputs such as resizing to handle the large, variable-sized images. Through empirical analysis and experimentation, it is shown that these approaches cause crucial information loss that can be exploited. The proposed solution divides the images into patches and uses embedding-based multiple instance learning with a convolutional neural network and an attention aggregation function for classification. The implementation is evaluated on the Microsoft Malware Classification dataset and achieves accuracies of up to $96.6\%$ on adversarially enlarged samples compared to the baseline of $22.8\%$. The Python code is available online at https://github.com/timppeters/MIL-Malware-Images .

Text
2311.12760v1 - Author's Original
Available under License Other.
Download (17MB)

More information

Published date: 21 November 2023
Additional Information: 13 figures, 2 tables
Keywords: cs.CR, cs.CV, cs.LG

Identifiers

Local EPrints ID: 492359
URI: http://eprints.soton.ac.uk/id/eprint/492359
PURE UUID: 5dcca454-3aba-463e-9026-df29de7d1dd0
ORCID for Hikmat Farhat: ORCID iD orcid.org/0000-0002-5043-227X

Catalogue record

Date deposited: 24 Jul 2024 17:12
Last modified: 25 Jul 2024 02:04

Export record

Altmetrics

Contributors

Author: Tim Peters
Author: Hikmat Farhat ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×