READ ME File For 'SecureChains: Non-Expert Risk Assessment Results ' Dataset DOI: https://doi.org/10.5258/SOTON/D3350 Date that the file was created: January 2025 ------------------- GENERAL INFORMATION ------------------- ReadMe Author: Betul GOKKAYA, University of Southampton [ORCID ID: 0009-0009-3632-9768] Date of data collection: August 2023 Information about geographic location of data collection: the United Kingdom -------------------------- SHARING/ACCESS INFORMATION -------------------------- Licenses/restrictions placed on the data, or limitations of reuse: Creative Commons Attribution CC-BY -------------------- DATA & FILE OVERVIEW -------------------- Securechains_nonexpert_data.zip includes two CSV files: Final Risk Score Comparison: This file contains the final risk scores based on non-expert user responses alongside the SecureChains score. It provides a clear and concise comparison for overall risk assessment. Detailed Question-Answer Analysis: This file contains the complete set of non-expert user responses, mapped to their corresponding questions. Additionally, it includes specific vulnerability types, offering deeper insights into the reasoning behind the risk assessment without aggregating the scores into a final threat score. -------------------------- METHODOLOGICAL INFORMATION -------------------------- Description of Methods Used for Collection/Generation of Data: The data were collected from participant groups: cybersecurity professionals and cybersecurity PhD students who has professional experience. Participants were reached out via email, inviting them to take part in the study. A docx file was created containing cybersecurity scenarios designed to simulate various cybersecurity threats and vulnerabilities. The document included multiple-choice questions for participants to select answers based on their analysis. Additionally, a web-based software tool which was developed for this study is provided.The tool contained all the questions needed to conduct a comprehensive risk assessment. Ethical approval for this data collection was obtained under the protocol ERGO 84485. Methods for Processing the Data: The raw data collected from participants were processed using Python libraries. The following steps were involved: Data Cleaning: Removing incomplete responses to ensure data quality. Analysis: Calculating key statistical metrics, including: Standard Deviation Standard Error Agreement Levels for user responses to each risk result. The processed data were then analyzed to identify patterns and insights into the participants' risk assessments. For data processing and analysis, Python was utilized, with libraries such as: Pandas (for data cleaning and organization), NumPy (for statistical calculations), Matplotlib/Seaborn (for data visualization). Standards and Calibration Information: The study adhered to established standards for cybersecurity risk assessment. The scenarios and questions were designed to align with recognized frameworks, ensuring relevance and accuracy in assessing vulnerabilities and threats. Environmental/Experimental Conditions: The study was conducted in a virtual environment, where participants completed the assessment independently through the web-based software or docx document. No external environmental variables were introduced, ensuring consistency in responses. To ensure data reliability: Responses were validated for completeness and logical consistency. Statistical checks were performed to identify outliers or anomalies in the data. Aggregated results were cross-verified against expected trends in cybersecurity risk assessment scenarios. People Involved: Betul Gokkaya was responsible for collecting all the data from participants and ensuring ethical guidelines were adhered to throughout the study. ---------------------------------------------------- DATA-SPECIFIC INFORMATION: final_risks_all_user.csv ---------------------------------------------------- Number of Variables: 6 variables: User Name, Question Group, Risk, Risk Score, Impact Score, Event Likelihood. Number of Cases/Rows: The dataset contains 222 rows, representing individual responses from participants or scenarios. Variable List (Defining Abbreviations, Units of Measure, Codes, or Symbols Used): User Name: Identifier for participants or systems (e.g., "Securechains," "participant_1"). No specific abbreviations used. Format: Text/String. Question Group: The category or type of question (e.g., "IoT data disclosure," "Counterfeit components"). No specific abbreviations used. Format: Text/String. Risk: Describes the severity of the risk. Values: "Low," "Medium," "High," "Very High." Format: Ordinal categorical variable. Risk Score: Quantitative measure of the risk (e.g., 2, 3, 4, 5). Units: Whole numbers, no decimals. Range: 1 (lowest) to 5 (highest). Impact Score: Score representing the impact of the risk (e.g., 3, 4, 5). Units: Whole numbers, no decimals. Range: 1 (lowest) to 5 (highest). Event Likelihood: Score representing the likelihood of the event occurring (e.g., 3, 4, 5). Units: Whole numbers, no decimals. Range: 1 (least likely) to 5 (most likely). Missing Data Codes: NA: Not Available or missing information. -1: System error or invalid entry. Specialized Formats or Other Abbreviations Used: Risk Levels (Low, Medium, High, Very High): Represented as text values and correspond to ordinal ranks based on severity. Scores (Risk, Impact, Likelihood): Numeric values for statistical analysis, using a consistent scale (1-5). ---------------------------------------------------- DATA-SPECIFIC INFORMATION: user_answers_w_securechains.csv ---------------------------------------------------- Number of Variables: 6 variables: QuestionCode, User_Name, User_Answer, question_group, value_type, cumulative_sum_value. Number of Cases/Rows: The dataset contains 446 rows, representing individual participant responses to specific questions, along with calculated cumulative scores. Variable List (Defining Abbreviations, Units of Measure, Codes, or Symbols Used): QuestionCode: The text of the question posed to participants (e.g., "How likely is it that a cyber actor would exploit IoT devices in your supply chain..."). Format: Text/String. User_Name: Identifier for the participant or system (e.g., "Securechains," "participant_1"). Format: Text/String. User_Answer: Participant's response to the question, indicating their risk assessment or perception. Possible values: "Low," "Medium," "High," "Very High." Format: Ordinal categorical variable. question_group: Categorization of the question into groups (e.g., IoT_group_1, external_user_risk_group1). Represents the type or domain of the risk being assessed. Format: Text/String. value_type: Type of value being assessed for the question: THREAT: Likelihood or possibility of the event occurring. IMPACT: Potential consequence or damage if the event occurs. Format: Text/String. cumulative_sum_value: Calculated cumulative score representing the overall risk or importance derived from participants' responses. Units: Whole numbers (e.g., 3, 4, 5). Range: [Insert the minimum and maximum observed values]. Missing Data Codes: NA: Indicates missing data or unanswered questions. -1: System error or invalid entry during data capture. Specialized Formats or Other Abbreviations Used: User_Answer Values (Low, Medium, High, Very High): Ordinal categories representing increasing levels of risk or impact. value_type (THREAT, IMPACT): Classifies the focus of each question. question_group: Groups questions into specific areas of focus, such as IoT-related risks, external user risks, or hardware vulnerabilities. cumulative_sum_value: Numeric values used for aggregated risk analysis.