Sounds of the deep: How input representation, model choice, and dataset size influence underwater sound classification performance

Convolutional Neural Networks (CNNs) have proven highly effective in automatically identifying and classifying underwater sound sources, enabling efficient analysis of marine environments. This work examines two key design choices for a CNN classifier: input representation and network architecture, analyzing their importance as training data size varies and their effectiveness in generalizing between sites. Passive acoustic data from three offshore sites in Western Scotland were used for hierarchical classification; categorizing sounds into one of four classes: delphinid tonal, delphinid clicks, vessels, and ambient noise. Three different input representations of the acoustic signals were investigated along with four CNN architectures, including three pre-trained for image classification tasks. Experiments show that a custom-built shallow CNN can outperform more complex architectures if the input representation is chosen appropriately. For example, a shallow CNN using Mel-spectrogram normalised with Per Channel Energy Normalization (MS-PCEN) achieved a 12.5% accuracy improvement over a ResNet model when small amounts of training data are available. Studying model performance across the three sites demonstrates that input representation is an important factor for achieving robust results between sites, with MS-PCEN achieving the best performance. However, the importance of the choice of input representation decreases as the training dataset size increases.

10.1121/10.0036498

0001-4966

3017-3032

Olcay, Abdullah

7865eaa7-7e43-40fb-a2d3-3fc64f872faf

White, Paul R.

2dd2477b-5aa9-42e2-9d19-0806d994eaba

Bull, Jonathan M.

974037fd-544b-458f-98cc-ce8eca89e3c8

Risch, Denise

ac397a82-74f9-4305-956f-edd2ea5b2e3c

Dell, Benedict

9328b8aa-397f-4485-8fe3-db6e98ab6561

White, Ellen L.

50575aff-8aa1-4ee4-82e6-7e1bc5eefc70

18 April 2025