A computer vision method for finding mislabelled specimens within natural history collections
A computer vision method for finding mislabelled specimens within natural history collections
1. Natural history collections are essential for biodiversity and evolution research and for studying biotic responses to global change. However, the numbers of specimens within natural history collections pose management challenges. Reduced funds, declining taxonomic training, and expanding collections can lead to mislabelled or missing specimens. This highlights the need for innovative and non-destructive methods of taxonomic verification for specimens in large collections. While genetic analyses offer precise verification, they are resource-intensive and less effective on degraded DNA from older specimens, with risks of damage to smaller specimens.
2. Computer vision can automate tasks such as species-level verification and morphological examination, though these techniques have yet to be incorporated and utilised by natural history collections for such management tasks. Digitisation initiatives, such as those at the Natural History Museum, London (NHM), have gained momentum in recent years, converting specimens to digital formats and enhancing global accessibility.
3. Here, we describe a computer vision pipeline applied to the digitised British and Irish Lepidoptera collection at the NHM. Specifically, our pipeline identifies specimens that do not match their labelled species status.
4. The pipeline was executed for 100 runs for the Butterfly and Moth datasets, resulting in 99,350 out of 350,208 specimens (28.37 %) being flagged at least once. We attribute a portion of these as pipeline errors, given the likelihood of some mislabelled specimens within training datasets. However, specimens flagged consistently across > 80 % of pipeline runs are likely mislabelled within the collections. Taxonomic experts visually examined 210 such specimens, finding 145 to be incorrectly labelled in the collection or the NHM data portal. Additionally, 30 specimens were sent for genetic verification to confirm species-level identification.
5. This synergy of computer vision and genetic-based species identification enhances the accuracy and efficiency of managing natural history collections, preserving their value for future generations.
Hollister, Jack Daniel
6276291d-9921-47d5-935d-008f68d00f2c
Martin, Geoff
8878a3c4-b538-4e38-90ee-8466a6f22093
Cai, Xiaohao
de483445-45e9-4b21-a4e8-b0427fc72cee
Horton, Tammy
c4b41665-f0bc-4f0f-a7af-b2b9afc02e34
Powell, Owain
d75df952-0e6b-40f5-b0e3-25b5828b5147
Sterling, Mark
e32d0e0b-3931-4abd-ad1c-13f852b0e00d
Turnbull, Glory
2c8eb61d-9377-4a91-8219-1cd189fc400c
Price, Ben
2ac259e5-45da-4c50-b31b-5daec7c5414e
Fenberg, Phillip
c73918cd-98cc-41e6-a18c-bf0de4f1ace8
July 2025
Hollister, Jack Daniel
6276291d-9921-47d5-935d-008f68d00f2c
Martin, Geoff
8878a3c4-b538-4e38-90ee-8466a6f22093
Cai, Xiaohao
de483445-45e9-4b21-a4e8-b0427fc72cee
Horton, Tammy
c4b41665-f0bc-4f0f-a7af-b2b9afc02e34
Powell, Owain
d75df952-0e6b-40f5-b0e3-25b5828b5147
Sterling, Mark
e32d0e0b-3931-4abd-ad1c-13f852b0e00d
Turnbull, Glory
2c8eb61d-9377-4a91-8219-1cd189fc400c
Price, Ben
2ac259e5-45da-4c50-b31b-5daec7c5414e
Fenberg, Phillip
c73918cd-98cc-41e6-a18c-bf0de4f1ace8
Hollister, Jack Daniel, Martin, Geoff, Cai, Xiaohao, Horton, Tammy, Powell, Owain, Sterling, Mark, Turnbull, Glory, Price, Ben and Fenberg, Phillip
(2025)
A computer vision method for finding mislabelled specimens within natural history collections.
Ecology and Evolution, 15 (7), [e71648].
(doi:10.1002/ece3.71648).
Abstract
1. Natural history collections are essential for biodiversity and evolution research and for studying biotic responses to global change. However, the numbers of specimens within natural history collections pose management challenges. Reduced funds, declining taxonomic training, and expanding collections can lead to mislabelled or missing specimens. This highlights the need for innovative and non-destructive methods of taxonomic verification for specimens in large collections. While genetic analyses offer precise verification, they are resource-intensive and less effective on degraded DNA from older specimens, with risks of damage to smaller specimens.
2. Computer vision can automate tasks such as species-level verification and morphological examination, though these techniques have yet to be incorporated and utilised by natural history collections for such management tasks. Digitisation initiatives, such as those at the Natural History Museum, London (NHM), have gained momentum in recent years, converting specimens to digital formats and enhancing global accessibility.
3. Here, we describe a computer vision pipeline applied to the digitised British and Irish Lepidoptera collection at the NHM. Specifically, our pipeline identifies specimens that do not match their labelled species status.
4. The pipeline was executed for 100 runs for the Butterfly and Moth datasets, resulting in 99,350 out of 350,208 specimens (28.37 %) being flagged at least once. We attribute a portion of these as pipeline errors, given the likelihood of some mislabelled specimens within training datasets. However, specimens flagged consistently across > 80 % of pipeline runs are likely mislabelled within the collections. Taxonomic experts visually examined 210 such specimens, finding 145 to be incorrectly labelled in the collection or the NHM data portal. Additionally, 30 specimens were sent for genetic verification to confirm species-level identification.
5. This synergy of computer vision and genetic-based species identification enhances the accuracy and efficiency of managing natural history collections, preserving their value for future generations.
Text
JDH_revised_manuscript_CLEAN
- Accepted Manuscript
Restricted to Repository staff only until 1 October 2025.
Request a copy
Text
Hollister et al. 2025
- Version of Record
More information
Accepted/In Press date: 12 June 2025
e-pub ahead of print date: 13 July 2025
Published date: July 2025
Identifiers
Local EPrints ID: 503385
URI: http://eprints.soton.ac.uk/id/eprint/503385
ISSN: 2045-7758
PURE UUID: aa5b3141-0384-4c26-9151-f485c8b9eb96
Catalogue record
Date deposited: 30 Jul 2025 16:35
Last modified: 18 Sep 2025 02:01
Export record
Altmetrics
Contributors
Author:
Jack Daniel Hollister
Author:
Geoff Martin
Author:
Xiaohao Cai
Author:
Tammy Horton
Author:
Owain Powell
Author:
Mark Sterling
Author:
Glory Turnbull
Author:
Ben Price
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics