The University of Southampton
University of Southampton Institutional Repository

MalFamAware: automatic family identification and malware classification through online clustering

MalFamAware: automatic family identification and malware classification through online clustering
MalFamAware: automatic family identification and malware classification through online clustering
The skyrocketing growth rate of new malware brings novel challenges to protect computers and networks. Discerning truly novel malware from variants of known samples is a way to keep pace with this trend. This can be done by grouping known malware in families by similarity and classifying new samples into those families. As malware and their families evolve over time, approaches based on classifiers trained on a fixed ground truth are not suitable. Other techniques use clustering to identify families but they need to periodically recluster the whole set of samples, which does not scale well. A promising approach is based on incremental clustering, where periodically only yet unknown samples are clustered to identify new families, and classifiers are re-trained accordingly. However, the latter solutions usually are not able to immediately react and identify new malware families. In this paper we propose MalFamAware, a novel approach to malware family identification based on an online clustering algorithm, namely BIRCH, which efficiently updates clusters as new samples are fed without requiring to re-scan the entire dataset. MalFamAware is able to both classify new malware in existing families and identify new families at runtime. We present experimental evaluations where MalFamAware outperforms both total reclustering and incremental clustering solutions in terms of accuracy and time. We also compare our solution with classifiers re-trained over time, obtaining better accuracy, in particular when samples belong to yet unknown families.
1615-5270
Pitolli, Gregorio
de363075-df1f-4851-9c60-e5a9da186954
Laurenza, Giuseppe
c0a03b26-cfc2-4fd7-b8ac-bda2017da0a8
Aniello, Leonardo
9846e2e4-1303-4b8b-9092-5d8e9bb514c3
Querzoni, Leonardo
c0eee656-74e7-419d-876c-3cad808683d6
Baldoni, Roberto
6ea5e1cc-92fe-4b9d-9ed3-0b7970553965
Pitolli, Gregorio
de363075-df1f-4851-9c60-e5a9da186954
Laurenza, Giuseppe
c0a03b26-cfc2-4fd7-b8ac-bda2017da0a8
Aniello, Leonardo
9846e2e4-1303-4b8b-9092-5d8e9bb514c3
Querzoni, Leonardo
c0eee656-74e7-419d-876c-3cad808683d6
Baldoni, Roberto
6ea5e1cc-92fe-4b9d-9ed3-0b7970553965

Pitolli, Gregorio, Laurenza, Giuseppe, Aniello, Leonardo, Querzoni, Leonardo and Baldoni, Roberto (2020) MalFamAware: automatic family identification and malware classification through online clustering. International Journal of Information Security. (In Press)

Record type: Article

Abstract

The skyrocketing growth rate of new malware brings novel challenges to protect computers and networks. Discerning truly novel malware from variants of known samples is a way to keep pace with this trend. This can be done by grouping known malware in families by similarity and classifying new samples into those families. As malware and their families evolve over time, approaches based on classifiers trained on a fixed ground truth are not suitable. Other techniques use clustering to identify families but they need to periodically recluster the whole set of samples, which does not scale well. A promising approach is based on incremental clustering, where periodically only yet unknown samples are clustered to identify new families, and classifiers are re-trained accordingly. However, the latter solutions usually are not able to immediately react and identify new malware families. In this paper we propose MalFamAware, a novel approach to malware family identification based on an online clustering algorithm, namely BIRCH, which efficiently updates clusters as new samples are fed without requiring to re-scan the entire dataset. MalFamAware is able to both classify new malware in existing families and identify new families at runtime. We present experimental evaluations where MalFamAware outperforms both total reclustering and incremental clustering solutions in terms of accuracy and time. We also compare our solution with classifiers re-trained over time, obtaining better accuracy, in particular when samples belong to yet unknown families.

Text
MalFamAware - Accepted Manuscript
Download (18MB)

More information

Accepted/In Press date: 27 May 2020

Identifiers

Local EPrints ID: 441476
URI: http://eprints.soton.ac.uk/id/eprint/441476
ISSN: 1615-5270
PURE UUID: ed853308-8142-422b-976b-73402fb5f2c7
ORCID for Leonardo Aniello: ORCID iD orcid.org/0000-0003-2886-8445

Catalogue record

Date deposited: 15 Jun 2020 16:30
Last modified: 29 Jul 2021 01:51

Export record

Contributors

Author: Gregorio Pitolli
Author: Giuseppe Laurenza
Author: Leonardo Aniello ORCID iD
Author: Leonardo Querzoni
Author: Roberto Baldoni

University divisions

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×