READ ME File For ‘Drennan 2024 Doctoral Thesis data’ Dataset DOI: 10.5258/SOTON/D2958 ReadMe Author: Regan Drennan, University of Southampton, https://orcid.org/0000-0003-0137-5464 This dataset supports the thesis entitled “Patterns of Diversity, Connectivity, and Evolution in Southern Ocean and Deep-Sea Annelids” AWARDED BY: University of Southampton DATE OF AWARD: 2024 DESCRIPTION OF THE DATA: Data supporting Thesis Chapters 4 and 5, titles as follows: Chapter 4: "Do molecular barcodes enhance morphological species identification in biodiversity assessments? A case study in integrative identification of annelid fauna from the Prince Gustav Channel, Northeastern Antarctic Peninsula" Chapter 5: "Population genomics, cryptic diversity and phylogeographic structure in the Southern Ocean circumpolar annelid, Aglaophamus trissophyllus (Annelida: Nephtyidae)". Data Chapters 2 and 3 are published, with supporting data uploaded to relevant repositories: Chapter 2 DOI: 10.5852/ejt.2021.760.1447 Chapter 3 DOI: 10.3389/fmars.2020.595303 Five datasets are included in this data entry as follows: Ch4_specimen_data_table.xlsx Ch5_specimen_data_table.xlsx Ch4_PGC_barcode_sequences.zip Ch4_Kosterfjord_M_sarsi_sequences.zip Ch5_Aglaophamus_sequences.zip Also included is a link to an additional dataset titled "Drennan 2024 Doctoral Thesis Chapter 5 dataset - SNP catalogs" held externally on ZENODO due to large file size LINK: https://zenodo.org/records/10606641?token=eyJhbGciOiJIUzUxMiJ9.eyJpZCI6IjMyMTg5Y2E4LWMwODUtNDU2ZS1hYjc5LTBkMGI5YTI5MDE1MyIsImRhdGEiOnt9LCJyYW5kb20iOiI1ZWIxOTI3YzNhOGVjYTBkODdmOTlkYjM1YmZiOGNmOSJ9.BnIA9e6NbazlZihzzB3qQa-6ZcKzmImL8t3A2Af4uhAecdcY8bIU3VECmDBGn1BzMwIX5O6WmKil6STBWcAx5A ZENODO DOI: 10.5281/zenodo.10606641 .xlsx excel spreadsheets contain full specimen collection, occurrence, and taxonomy data for specimens examined in thesis Chapters 4 and 5, using Darwin Core Archive style headings (see: https://dwc.tdwg.org/list/) .zip files include folders containing sequence data in the form of FASTA format text files for DNA barcode sequence data examined in chapters 4 and 5. FASTA files can be open and read by text editing software, or specialised DNA reading software such as Geneious, MEGA, and Mesquite. DNA sequence data was generated by via Sanger sequencing and assembled in Geneious 10.09.01. DNA markers include COI, 16S rDNA, and 18S rDNA separate FASTA file for each marker). See methods sections in thesis Chapter 4 Methods section 4.2.2 and Chapter 5 Methods section 5.2.2 for more details. ZENODO hosted data includes genomic data generated in thesis Chapter 5. Single nucleotide polymorphism (SNP) genomic data was prepared and sequenced using a ddRADseq library preparation protocol. Following sequencing, filtering and locus assembly was carried out using Stacks v 2.64 https://catchenlab.life.illinois.edu/stacks/ - Stacks generates a catalog to determine which haplotype alleles are present at every locus in each individual. This dataset includes all catalogs analysed in thesis Chapter 5 following initial QC, processing, and quality filtering steps. See Chapter 5 Results section 5.2.5 for more details. Detailed description of each dataset is as follows: ____________________________ Ch4_specimen_data_table.xlsx Single excel spreadsheet of Prince Gustav Channel (Northeastern Antarctic Peninsula) annelid specimen collection, occurrence and taxonomy data, examined and updated in Chapter 4. Information about geographic location of data collection: Prince Gustav Channel, Northeastern Antrartcic peninsula, Southern Ocean (see Ch4_specimen_data_table.xlsx for details) Date of data collection: Specimens collected on RRS James Clark Ross expedition JR17003a February–March 2018 to the Prince Gustav Channel (see Ch4_specimen_data_table.xlsx for details). Collection, occurrence, and taxonomic data generated 2019-2023. Licence: CC BY Embargo: no Related projects/Funders: NERC INSPIRE DTP Related publication: Drennan et al. 2024a in prep ____________________________ Ch5_specimen_data_table.xlsx Single excel spreadsheet Aglaophamus cf. trissophyllus Single excel spreadsheet of Aglaophamus cf. trissophyllus sequences collected from various Southern Ocean localities. S Information about geographic location of data collection: Southern Ocean, Antarctica (see Ch5_specimen_data_table.xlsx for details) Date of data collection: Specimens collected over various Antarctic expeditions 2004-2018 (see Ch5_specimen_data_table.xlsx for details). Collection, occurrence, and taxonomic data generated 2019-2023. Licence: CC BY Embargo: no Related projects/Funders: NERC INSPIRE DTP Related publication: Drennan et al. 2024b in prep ____________________________ Ch4_PGC_barcode_sequences.zip Barcode sequence data for Prince Gustav Channel Annelids examined in Chapter 4. Sequences include specimen IDs. Includes three FASTA files for each genetic marker examined as follows: Ch4_COI_barcodes.fasta Ch4_16S_barcodes.fasta Ch4_18S_barcodes.fasta Information about geographic location of data collection: Prince Gustav Channel, Northeastern Antrartcic peninsula, Southern Ocean (see Ch5_specimen_data_table.xlsx for details) Date of data collection: Specimens collected on RRS James Clark Ross expedition JR17003a February–March 2018 to the Prince Gustav Channel (see Ch4_specimen_data_table.xlsx for details). Sequence data generated 2019-2023. Licence: CC BY Embargo: Yes. 3 years (28 Feb 2027) Related projects/Funders: NERC INSPIRE DTP Related publication: Drennan et al. 2024a in prep Data will be made publicly available on NCBI following publication of results (Drennan 2024a in prep.) ____________________________ Ch4_Kosterfjord_M_sarsi_sequences.zip Barcode sequence data for Kosterfjord Maldane sarsi examined in Chapter 4. Sequences include specimen IDs. Includes three FASTA files for each genetic marker examined as follows: Ch4_Kosterfjord_M_sarsi_COI.fasta Ch4_Kosterfjord_M_sarsi_16S.fasta Ch4_Kosterfjord_M_sarsi_18S.fasta Includes 15 Maldane sarsi individuals collected from the same site, Kosterfjord, Sweden (lat: 58.6498°, long: 11.0451° depth 135 m) over 3 years: 2017, 2019, 2021 as follows: KYS01_2017_A KYS01_2017_B KYS01_2017_C KYS01_2019_A KYS01_2019_B KYS01_2019_C KYS01_2019_D KYS01_2019_E KYS01_2021_A KYS01_2021_B KYS01_2021_C KYS01_2021_D KYS01_2021_E KYS01_2021_F KYS01_2021_G Sequence data generated 2022-2023. See Chapter 4 section 4.2.1 and 4.3.4.7 for additional collection, geographic, and temporal context. Licence: CC BY Embargo: Yes. 3 years (28 Feb 2027) Related projects/Funders: NERC INSPIRE DTP LinnéSys: Systematics Research Fund Related publication: Drennan et al. 2024a in prep Data will be made publicly available on NCBI following publication of results (Drennan 2024a in prep.) ____________________________ Ch5_Aglaophamus_sequences.zip Barcode sequence data for Aglaophamus cf. trissophullus examined in Chapter 5. Sequences include specimen IDs. Includes three FASTA files for each genetic marker examined as follows: Aglaophamus_cf_trissophyllus_COI.fasta Aglaophamus_cf_trissophyllus_COI.fasta Aglaophamus_cf_trissophyllus_COI.fasta Information about geographic location of data collection: Southern Ocean, Antarctica (see Ch5_specimen_data_table.xlsx for details) Date of data collection: Specimens collected over various Antarctic expeditions 2004-2018 (see Ch5_specimen_data_table.xlsx for details). Sequence data generated 2021-2023 Licence: CC BY Embargo: Yes. 3 years (28 Feb 2027) Related projects/Funders: NERC INSPIRE DTP Related publication: Drennan et al. 2024b in prep Data will be made publicly available on NCBI following publication of results (Drennan 2024b in prep.) ____________________________ Drennan 2024 Doctoral Thesis Chapter 5 dataset - SNP catalogs Zenodo DOI: 10.5281/zenodo.10606641 Four zipped catalog folders containing the final output of the Stacks “denovo_map.pl” de novo assembly pipeline. Each folder contains two major files, “catalog.fa.gz”, which contains the consensus sequence for each assembled locus in the data, as well as “catalog.calls”, a custom file that contains genotyping data. These files are intended to be read by the Stacks “populations” program, which can apply appropriate filters, calculate population genetic statistics, and export the data for further analyses, as in Chapter 5. The four catalog folders are as follows: All_species_600k_n113_catalog - combined catalog of all individuals across all putative species with >600k reads (113 individuals) Agla1_Agla2_600k_n93_catalog - combined catalog for both putative species “Agla 1” and “Agla 2” individuals with >600k reads (93 individuals) Agla1_600k_n73_catalog - catalog of putative species “Agla 1” individuals with >600k reads (73 individuals) Agla3_600k_n28_catalog - catalog of putative species “Agla 2” individuals with >600k reads (28 individuals) Information about geographic location of data collection: Southern Ocean, Antarctica (see Ch5_specimen_data_table.xlsx for details) Date of data collection: Specimens collected over various Antarctic expeditions 2004-2018 (see Ch5_specimen_data_table.xlsx for details. Specimen extractions and sequencing took place 2022-2023. Licence: CC BY Data available upon request. Data will be made publicly available on NCBI following publication of results (Drennan 2024b in prep.) Related projects/Funders: NERC INSPIRE DTP Related publication: Drennan et al. 2024b in prep -------------- Date that this file was created: Feb, 2024