Title: Data for 'Safe Reward Learning from Human Preferences and Justifications' Creator: Ilias Kazantzidis (ik3n19@soton.ac.uk) Date README_DROPJ-DATA.txt created: September, 2025 Data DOI: https://doi.org/10.5258/SOTON/D3749 Subject: Safe agent learning, Human Preferences, Justifications, Human-Agent Interaction, Safe Reinforcement Learning, World Models Funders: UKRI Recommended citation: AUTHORS: Ilias Kazantzidis, Timothy J. Norman, Christopher T. Freeman, Yali Du TITLE: Safe Reward Learning from Human Preferences and Justifications CONFERENCE: International Conference on Agents and Artificial Intelligence (ICAART 2026) Access Information: Data available under a CC BY 4.0 licence Language: English Dates: - Data Collected: May-September 2024 - Project Start: September 2023 - Project End: September 2024 - release date: N/A Location: University of Southampton, Highfield Campus SO17 1BJ Methodology: - The data were collected from the authors of the paper - Raw user trajectories were extracted via the Open AI gym (Brockman, G (2016). "OpenAI Gym". In: arXiv preprint arXiv:1606.01540.) interface of the Car Racing environment using the keyboard - Responses for ReQueST, DROS, DROP, DROPe and DROPJ feedback were extracted via newly-created interfaces - The responses to queries and the response time was collected Data Processing: - Raw user rollouts can be processed accordingly to train a VAE or Dynamics Models, and human feedback to train a Reward model - Instructions can be found in the Appendix of the publication and its associated codebase Attention - large uncompressed size ----------------------------------- The following archives are small to download, but expand to many gigabytes when extracted: - raw_user_rollouts_res84_ep600.zip -> ~25 GB after extraction - raw_user_rollouts_res128_ep600_chuckcobst.zip -> ~59 GB after extraction - raw_user_rollouts_res128_ep800_chuckccarobst.zip -> ~79 GB after extraction Instead of extracting them, you can either use the pretrained VAE and MDN-RNN models, or extract less trajectories following STEP 1 in dropj.py or obst_dropj.py. List of file names and description: - /data - /carracing # for Car Racing with no chuckholes or cars - raw_user_rollouts_res84_ep600.pkl (when unzip raw_user_rollouts_res84_ep600.zip): list of user real-world trajectories (Dataset R) - 25.5GB - /queries_20 (when unzip queries_s20.zip) - 177MB - *.mp4 files: video pairs demonstrating preference queries - data.pkl: dictionary linking each *.mp4 video to the numpy arrays of the two trajectory segments - /responses_20 - responses_user26_s20_q600_drop.pkl: preference responses in DROP - responses_user27_s20_q600_drope.pkl: preference responses in DROPe - responses_user27_s20_q600_dropj.pkl: preference and justification responses in DROPJ - dream_user_rollouts.pkl: list of user trajectories drawn in the virtual environment (Dataset D) - dream_user_rollouts_scaled.pkl: list of user trajectories drawn in the virtual environment (Dataset D); scaled in the appropriate range for reward model training - default_init_obs.pkl: default initial observation from the encoded trajectories, used as a starting point to the generation of dream user trajectories - default_init_obses_ep600enc_15.pkl: 15 default initial observations from the encoded trajectories, used as a starting point to the ReQueST trajectory generation - request_query_data_guiTrue_rewep200_rewinitFalse_gdit200.pkl: list of trajectory segments generated from ReQueST throughout training - request_traj_opt_times_guiTrue_rewep200_rewinitFalse_gdit200.pkl: trajectory optimisation times for ReQueST - dros_query_data_rewep200.pkl: list of trajectory segments generated with DROS in 'one-shot' before any human feedback provided - /obstcarracing # for Obstacle Car Racing only with chuckholes - raw_user_rollouts_res128_ep600_chuckcobst.pkl (unzipped from raw_user_rollouts_res128_ep600_chuckcobst.zip): list of user real-world trajectories (Dataset R) - 59.1GB - /queries_s20_chuckcobst (unzipped queries_s20_chuckcobst.zip) - 744MB - *.mp4 files: video pairs demonstrating preference queries - data.pkl: dictionary linking each *.mp4 video to the numpy arrays of the two trajectory segments - /responses_20_chuckcobst - responses_chuckcobst_user44_mjust.pkl: preference and justification responses in DROPJ with multiple justifications - dream_user_rollouts_chuckcobst_scaled.pkl: list of user trajectories drawn in the virtual environment (Dataset D); scaled in the appropriate range for reward model training - default_init_obs_chuckcobst.pkl: default initial observation from the encoded trajectories, used as a starting point to the generation of dream user trajectories - default_init_obses_chuckcobst.pkl: default initial observations from the encoded trajectories, used as starting points to the generation of dream user trajectories # for Obstacle Car Racing with chuckholes and cars - raw_user_rollouts_res128_ep800_chuckccarobst.pkl (unzipped from raw_user_rollouts_res128_ep800_chuckccarobst.zip): list of user real-world trajectories (Dataset R) - 78.8GB - /queries_s20_chuckccarobst (unzipped from queries_s20_chuckccarobst.zip) - 1.1GB - *.mp4 files: video pairs demonstrating preference queries - data.pkl: dictionary linking each *.mp4 video to the numpy arrays of the two trajectory segments - /responses_20_chuckccarobst - responses_chuckccarobst_user45_mjust.pkl: preferences and justifications for DROPJ with multiple justifications - dream_user_rollouts_chuckccarobst_scaled.pkl: list of user trajectories drawn in the virtual environment (Dataset D); scaled in the appropriate range for reward model training - default_init_obs_chuckccarobst.pkl: default initial observation from the encoded trajectories, used as a starting point to the generation of dream user trajectories - default_init_obses_chuckccarobst.pkl: default initial observations from the encoded trajectories, used as starting points to the generation of dream user trajectories - /models - /carracing # for Car Racing with no chuckholes or cars - enc_user_lat32_ch15L_res84_ep600_epochs100_lrs0_00001.pt: pretrained encoder model (VAE) - dyn_gcT_ep600_ch15L_lat32_epochs300_lrs0_0001.pt: pretrained dynamics model (MDN-RNN) - /pref_reward_models - reward_model__GUI_s20_nojustnoeq_resp.pt: reward models of DROP (nojustnoeq) - reward_model__GUI_s20_nojustnoeq_resp.pt: reward models of DROPe (nojusteq) - reward_model__GUI_s20_just_resp.pt: reward models of DROPJ (just) - /request_reward_models - *.pt: ReQueST reward models for different number of queries - /dros_reward_models - *.pt: DROS reward models for different number of queries - /obstcarracing # for Obstacle Car Racing only with chuckholes - enc_user_lat64_ch4res128_res128_ep600_chuckcobst_epochs90_lrs0.0001_kl2.pt: pretrained encoder model (VAE) only with chuckholes - dyn_gcT_ep600_chuckcobst_ch4res128_lat64_encep90_enclr0.0001_epochs50_lrs0.0001_rnn1024_mix7.pt: pretrained dynamics model (MDN-RNN) only with chuckholes # for Obstacle Car Racing with chuckholes and cars - enc_user_lat64_ch4res128_res128_ep800_chuckccarobst_epochs90_lrs0.0001_kl2.pt: pretrained encoder model (VAE) with chuckholes and cars - dyn_gcT_ep800_chuckccarobst_ch4res128_lat64_encep90_enclr0.0001_epochs30_lrs0.0001_rnn1024_mix7.pt: pretrained dynamics model (MDN-RNN) with chuckholes and cars - /pref_reward_models # stars (*) denote variable weights for each justification - reward_model__GUI_s20_chuckcobst_mjust_wdef*_wgrass*_wchuck*.pt: reward models of DROPJ with multiple justifications only with chuckholes - reward_model__GUI_s20_chuckccarobst_mjust_wdef*_wgrass*_wchuck*_wcar*.pt: reward models of DROPJ with multiple justifications with chuckholes and cars File formats: .pkl, .mp4, .pt