Sounds of the deep: How input representation, model choice, and dataset size influence underwater sound classification performance
Sounds of the deep: How input representation, model choice, and dataset size influence underwater sound classification performance
Convolutional Neural Networks (CNNs) have proven highly effective in automatically identifying and classifying underwater sound sources, enabling efficient analysis of marine environments. This work examines two key design choices for a CNN classifier: input representation and network architecture, analyzing their importance as training data size varies and their effectiveness in generalizing between sites. Passive acoustic data from three offshore sites in Western Scotland were used for hierarchical classification; categorizing sounds into one of four classes: delphinid tonal, delphinid clicks, vessels, and ambient noise. Three different input representations of the acoustic signals were investigated along with four CNN architectures, including three pre-trained for image classification tasks. Experiments show that a custom-built shallow CNN can outperform more complex architectures if the input representation is chosen appropriately. For example, a shallow CNN using Mel-spectrogram normalised with Per Channel Energy Normalization (MS-PCEN) achieved a 12.5% accuracy improvement over a ResNet model when small amounts of training data are available. Studying model performance across the three sites demonstrates that input representation is an important factor for achieving robust results between sites, with MS-PCEN achieving the best performance. However, the importance of the choice of input representation decreases as the training dataset size increases.
3017-3032
Olcay, Abdullah
7865eaa7-7e43-40fb-a2d3-3fc64f872faf
White, Paul R.
2dd2477b-5aa9-42e2-9d19-0806d994eaba
Bull, Jonathan M.
974037fd-544b-458f-98cc-ce8eca89e3c8
Risch, Denise
ac397a82-74f9-4305-956f-edd2ea5b2e3c
Dell, Benedict
9328b8aa-397f-4485-8fe3-db6e98ab6561
White, Ellen L.
50575aff-8aa1-4ee4-82e6-7e1bc5eefc70
18 April 2025
Olcay, Abdullah
7865eaa7-7e43-40fb-a2d3-3fc64f872faf
White, Paul R.
2dd2477b-5aa9-42e2-9d19-0806d994eaba
Bull, Jonathan M.
974037fd-544b-458f-98cc-ce8eca89e3c8
Risch, Denise
ac397a82-74f9-4305-956f-edd2ea5b2e3c
Dell, Benedict
9328b8aa-397f-4485-8fe3-db6e98ab6561
White, Ellen L.
50575aff-8aa1-4ee4-82e6-7e1bc5eefc70
Olcay, Abdullah, White, Paul R., Bull, Jonathan M., Risch, Denise, Dell, Benedict and White, Ellen L.
(2025)
Sounds of the deep: How input representation, model choice, and dataset size influence underwater sound classification performance.
Journal of the Acoustical Society of America, 157 (4), .
(doi:10.1121/10.0036498).
Abstract
Convolutional Neural Networks (CNNs) have proven highly effective in automatically identifying and classifying underwater sound sources, enabling efficient analysis of marine environments. This work examines two key design choices for a CNN classifier: input representation and network architecture, analyzing their importance as training data size varies and their effectiveness in generalizing between sites. Passive acoustic data from three offshore sites in Western Scotland were used for hierarchical classification; categorizing sounds into one of four classes: delphinid tonal, delphinid clicks, vessels, and ambient noise. Three different input representations of the acoustic signals were investigated along with four CNN architectures, including three pre-trained for image classification tasks. Experiments show that a custom-built shallow CNN can outperform more complex architectures if the input representation is chosen appropriately. For example, a shallow CNN using Mel-spectrogram normalised with Per Channel Energy Normalization (MS-PCEN) achieved a 12.5% accuracy improvement over a ResNet model when small amounts of training data are available. Studying model performance across the three sites demonstrates that input representation is an important factor for achieving robust results between sites, with MS-PCEN achieving the best performance. However, the importance of the choice of input representation decreases as the training dataset size increases.
Text
JASA_Preprint 14
- Accepted Manuscript
More information
Accepted/In Press date: 4 April 2025
Published date: 18 April 2025
Identifiers
Local EPrints ID: 500698
URI: http://eprints.soton.ac.uk/id/eprint/500698
ISSN: 0001-4966
PURE UUID: 436b3ebc-20d0-4771-9d78-960c98884731
Catalogue record
Date deposited: 09 May 2025 17:17
Last modified: 10 May 2025 02:17
Export record
Altmetrics
Contributors
Author:
Abdullah Olcay
Author:
Denise Risch
Author:
Benedict Dell
Author:
Ellen L. White
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics