READ ME File For 'Skills for the Curation of Sensitive Data' Dataset DOI: 10.5258/SOTON/D3811 Date that the file was created: January 2026 ------------------- GENERAL INFORMATION ------------------- ReadMe Author: BENJAMIN THOMAS, University of Southampton ORCID ID: 0000-0001-5240-7521 Date of data collection: MARCH-DECEMBER 2025 Information about geographic location of data collection: UK -------------------------- SHARING/ACCESS INFORMATION -------------------------- Licenses/restrictions placed on the data, or limitations of reuse: CREATIVE COMMONS CC-BY Recommended citation for the data: This dataset supports the publication: AUTHORS: Thomas, B., Murray, H., Guignard-Duff, M., Hettrick, S., Broadbent, P. TITLE: Skills for the Curation of Sensitive Data JOURNAL: N/A PAPER DOI IF KNOWN: Links to other publicly accessible locations of the data: N/A Links/relationships to ancillary or related data sets: N/A -------------------- DATA & FILE OVERVIEW -------------------- This dataset contains: Sensitive_Data_Curation_Skills_Survey_Data_-_Anonymised_-_Skills_for_the_curation_of_sensitive_data.csv - the data from the Qualtrics survey (anonymised) Topic_Guide_Key_Informants_-_clean_for_report.pdf - a list of questions for the initial 'key informant' interviews that helped shape the survey Survey_for_Report.pdf - The survey questions exported from Qualtrics Follow_up_Focus_Group_Topic_Guides_-_Clean_for_Report.pdf - a list of questions for the follow-up focus groups that explored themes from the survey Data_Curation_Skills_for_Sensitive_Data-codes.xlsx - a list of thematic codes for the qualitative data, exported from Atlas.ti Relationship between files, if important for context: All files support the collection and analysis of data related to the study Additional related data collected that was not included in the current data package: The code for the analysis of quantitative data can be found using the DOI 10.5281/zenodo.18378100 *The data set does not contain the transcripts of the qualitative interviews and focus groups. This is due to the high level of contextual information provided that made anonymisation impossible. -------------------------- METHODOLOGICAL INFORMATION -------------------------- Our research was designed to capture perspectives on skills and training needs for the curation of sensitive data. This qualitative study was conducted over the course of 2025, involving seven key informant interviews with people highly experienced in the curation of sensitive data in a range of contexts, a survey aimed at those in the UK who curate sensitive data, and three follow-up focus groups that explored the themes of skill gaps and training access/barriers in more depth. First, we (the research team) drew on our networks to set up interviews with seven key informants, targeting a range of different settings so that we could understand the nuances of the UK sensitive data curation landscape. We asked them to define sensitive data and the types of data they work with, explain the division of labour in their data curation processes, list the skills needed for the curation of sensitive data, discuss the skill gaps that exist and the reason for these gaps, and outline what training is available and what barriers there are to staff taking up this training. We also asked about how things might change in the future and about career development for those who curate data. These findings were analysed by the research team and led to the development of an anonymous online survey that was able to capture the broad range of experiences of data curators of sensitive data. The survey was disseminated through HDR-UK networks and ran over the summer, garnering 92 partial responses and 87 full responses. Three focus groups ran in November to further probe themes from the survey, namely the prioritisation of skills to deliver, and the form our training should take. Ethics Approval for the study was granted by University of Southampton’s ethics committee (ERGO number: 100680). Prior to interviews and focus groups, potential participants were provided with an information sheet that provided details about the aims of the research, the interview process (such as audio/video recording and transcription), confidentiality and anonymisation processes, and data storage and archiving. Participants completed an email consent process. Interviews and focus groups were transcribed, and transcripts pseudonymised and retained for archiving in a University of Southampton data repository. Recruitment and sample The research team identified a long list of potential ‘key informants’ who would be able to provide with deep insight into the diverse sensitive data curation landscape. We then shortlisted in order to select a spread of experience, and invited them to an online interview via email. We conducted seven interviews. Table x shows selected attributes. The survey was distributed via HDR UK networks including, for example, the UK TRE Community, Secure Access Data Professionals (SDAP) and SDE and Scottish Safe Haven networks. We intended to recruit as many curators of sensitive data as possible, with the networks well placed to reach people working with sensitive data across the four nations. We achieved 92 partial and 87 full responses to the survey. There is no data on the size of the workforce that curates sensitive data so we are unable to comment on the significance of this sample. We did not collect any demographic data, but some statistics on career stage, organisation type and responsibilities feature in the report Focus groups were recruited via an expression of interest form, distributed through relevant networks. This received 33 responses, and we selected based on career stage and a diversity of organisational contexts. Key informant interviews The interviews were designed to last between 45 minutes and 1 hour, and were conducted online following a semi-structured topic guide. We asked participants about details of their work in the curation of sensitive data in order to contextualise their responses. We then asked for definitions of data curation and sensitive data, the skills needed for data curation work, any skill gaps that exist for them or their team, the reasons for any gaps, and the training that is (and is not) available. We asked about challenges to effective data curation and finally, any future skills requirements that are likely to arise. Focus groups We ran three focus groups which were designed to last for around 2 hours, and were also conducted online. The first two were with a range of staff and public users of TREs, with one focused on trying to prioritise skill gaps and what requires training and the other focused on access and barriers to training. The final focus group was with more senior or experienced staff, who we asked about both topic areas. The topic guide for focus groups is in Appendix B. Participants were invited to answer individually, and then given the opportunity to discuss each others’ responses. While we purposefully constrained the scope of the focus group through the questions we posed, we did not try to force a consensus positions. Rather, we sought data that provided context and depth to the themes from the survey. Due to the small numbers of participants and the high level of contextual information, we are not able to effectively anonymise the transcripts. As such, we are not making these available as we have done with survey data. Data analysis After transcription of the audio recording, interviews and focus groups were coded in Atlas.ti, qualitative data analysis (QDA) software. We first coded to broad thematic and descriptive codes reflecting the interview and focus group themes. These included information on participants and their contexts, skills, skill gaps, reasons for skill gaps, access and barriers to training, and future challenges. Later, more fine-grained coding captured the themes that emerged relating to skills, skill gaps, reasons for gaps, and training. These codes enabled us to analyse and understand the existing skills and training landscape, and allowed us to prioritise the skills we are going to develop training for. Appendix E contains the list of thematic codes. Quantitative survey results were downloaded as a csv file and read into R. For each item, rows with missing values were first removed. Then the number of instances of each response was counted and the percentage of respondents who gave the response was calculated and plotted. Tidyverse functions were used throughout data cleaning and analysis. The analysis code is stored in a private GitHub repository within the Southampton-RSG organisation.