The University of Southampton
University of Southampton Institutional Repository

Sounds of the deep: How input representation, model choice, and dataset size influence underwater sound classification performance

Sounds of the deep: How input representation, model choice, and dataset size influence underwater sound classification performance
Sounds of the deep: How input representation, model choice, and dataset size influence underwater sound classification performance
Convolutional Neural Networks (CNNs) have proven highly effective in automatically identifying and classifying underwater sound sources, enabling efficient analysis of marine environments. This work examines two key design choices for a CNN classifier: input representation and network architecture, analyzing their importance as training data size varies and their effectiveness in generalizing between sites. Passive acoustic data from three offshore sites in Western Scotland were used for hierarchical classification; categorizing sounds into one of four classes: delphinid tonal, delphinid clicks, vessels, and ambient noise. Three different input representations of the acoustic signals were investigated along with four CNN architectures, including three pre-trained for image classification tasks. Experiments show that a custom-built shallow CNN can outperform more complex architectures if the input representation is chosen appropriately. For example, a shallow CNN using Mel-spectrogram normalised with Per Channel Energy Normalization (MS-PCEN) achieved a 12.5% accuracy improvement over a ResNet model when small amounts of training data are available. Studying model performance across the three sites demonstrates that input representation is an important factor for achieving robust results between sites, with MS-PCEN achieving the best performance. However, the importance of the choice of input representation decreases as the training dataset size increases.
0001-4966
3017-3032
Olcay, Abdullah
7865eaa7-7e43-40fb-a2d3-3fc64f872faf
White, Paul R.
2dd2477b-5aa9-42e2-9d19-0806d994eaba
Bull, Jonathan M.
974037fd-544b-458f-98cc-ce8eca89e3c8
Risch, Denise
ac397a82-74f9-4305-956f-edd2ea5b2e3c
Dell, Benedict
9328b8aa-397f-4485-8fe3-db6e98ab6561
White, Ellen L.
50575aff-8aa1-4ee4-82e6-7e1bc5eefc70
Olcay, Abdullah
7865eaa7-7e43-40fb-a2d3-3fc64f872faf
White, Paul R.
2dd2477b-5aa9-42e2-9d19-0806d994eaba
Bull, Jonathan M.
974037fd-544b-458f-98cc-ce8eca89e3c8
Risch, Denise
ac397a82-74f9-4305-956f-edd2ea5b2e3c
Dell, Benedict
9328b8aa-397f-4485-8fe3-db6e98ab6561
White, Ellen L.
50575aff-8aa1-4ee4-82e6-7e1bc5eefc70

Olcay, Abdullah, White, Paul R., Bull, Jonathan M., Risch, Denise, Dell, Benedict and White, Ellen L. (2025) Sounds of the deep: How input representation, model choice, and dataset size influence underwater sound classification performance. Journal of the Acoustical Society of America, 157 (4), 3017-3032. (doi:10.1121/10.0036498).

Record type: Article

Abstract

Convolutional Neural Networks (CNNs) have proven highly effective in automatically identifying and classifying underwater sound sources, enabling efficient analysis of marine environments. This work examines two key design choices for a CNN classifier: input representation and network architecture, analyzing their importance as training data size varies and their effectiveness in generalizing between sites. Passive acoustic data from three offshore sites in Western Scotland were used for hierarchical classification; categorizing sounds into one of four classes: delphinid tonal, delphinid clicks, vessels, and ambient noise. Three different input representations of the acoustic signals were investigated along with four CNN architectures, including three pre-trained for image classification tasks. Experiments show that a custom-built shallow CNN can outperform more complex architectures if the input representation is chosen appropriately. For example, a shallow CNN using Mel-spectrogram normalised with Per Channel Energy Normalization (MS-PCEN) achieved a 12.5% accuracy improvement over a ResNet model when small amounts of training data are available. Studying model performance across the three sites demonstrates that input representation is an important factor for achieving robust results between sites, with MS-PCEN achieving the best performance. However, the importance of the choice of input representation decreases as the training dataset size increases.

Text
JASA_Preprint 14 - Accepted Manuscript
Download (5MB)

More information

Accepted/In Press date: 4 April 2025
Published date: 18 April 2025

Identifiers

Local EPrints ID: 500698
URI: http://eprints.soton.ac.uk/id/eprint/500698
ISSN: 0001-4966
PURE UUID: 436b3ebc-20d0-4771-9d78-960c98884731
ORCID for Paul R. White: ORCID iD orcid.org/0000-0002-4787-8713
ORCID for Jonathan M. Bull: ORCID iD orcid.org/0000-0003-3373-5807
ORCID for Ellen L. White: ORCID iD orcid.org/0000-0002-3787-8699

Catalogue record

Date deposited: 09 May 2025 17:17
Last modified: 10 May 2025 02:17

Export record

Altmetrics

Contributors

Author: Abdullah Olcay
Author: Paul R. White ORCID iD
Author: Denise Risch
Author: Benedict Dell
Author: Ellen L. White ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×