Title: Data for 'Safe Reward Learning from Human Preferences and Justifications'
Creator: Ilias Kazantzidis (ik3n19@soton.ac.uk)
Date README_DROPJ-DATA.txt created: September, 2025
Data DOI: https://doi.org/10.5258/SOTON/D3749
Subject: Safe agent learning, Human Preferences, Justifications, Human-Agent Interaction, Safe Reinforcement Learning, World Models
Funders: UKRI
Recommended citation:
	AUTHORS: Ilias Kazantzidis, Timothy J. Norman, Christopher T. Freeman, Yali Du
	TITLE: Safe Reward Learning from Human Preferences and Justifications
	CONFERENCE: International Conference on Agents and Artificial Intelligence (ICAART 2026)
Access Information: Data available under a CC BY 4.0 licence
Language: English
Dates: 
	- Data Collected: May-September 2024
	- Project Start: September 2023
	- Project End: September 2024
	- release date: N/A
Location: University of Southampton, Highfield Campus SO17 1BJ
Methodology: 
	- The data were collected from the authors of the paper
	- Raw user trajectories were extracted via the Open AI gym (Brockman, G (2016). "OpenAI Gym". In: arXiv preprint arXiv:1606.01540.) interface of the Car Racing environment using the keyboard
	- Responses for ReQueST, DROS, DROP, DROPe and DROPJ feedback were extracted via newly-created interfaces
	- The responses to queries and the response time was collected
Data Processing: 
	- Raw user rollouts can be processed accordingly to train a VAE or Dynamics Models, and human feedback to train a Reward model
	- Instructions can be found in the Appendix of the publication and its associated codebase

Attention - large uncompressed size
-----------------------------------
The following archives are small to download, but expand to many gigabytes when extracted:
	- raw_user_rollouts_res84_ep600.zip -> ~25 GB after extraction
	- raw_user_rollouts_res128_ep600_chuckcobst.zip -> ~59 GB after extraction
	- raw_user_rollouts_res128_ep800_chuckccarobst.zip -> ~79 GB after extraction

Instead of extracting them, you can either use the pretrained VAE and MDN-RNN models, or extract less trajectories following STEP 1 in dropj.py or obst_dropj.py.

List of file names and description:
	- /data
		- /carracing
			# for Car Racing with no chuckholes or cars
			- raw_user_rollouts_res84_ep600.pkl (when unzip raw_user_rollouts_res84_ep600.zip): list of user real-world trajectories (Dataset R) - 25.5GB
			- /queries_20 (when unzip queries_s20.zip) - 177MB
				- *.mp4 files: video pairs demonstrating preference queries
				- data.pkl: dictionary linking each *.mp4 video to the numpy arrays of the two trajectory segments
			- /responses_20
				- responses_user26_s20_q600_drop.pkl: preference responses in DROP
				- responses_user27_s20_q600_drope.pkl: preference responses in DROPe
				- responses_user27_s20_q600_dropj.pkl: preference and justification responses in DROPJ
			- dream_user_rollouts.pkl: list of user trajectories drawn in the virtual environment (Dataset D)
			- dream_user_rollouts_scaled.pkl: list of user trajectories drawn in the virtual environment (Dataset D); scaled in the appropriate range for reward model training
			- default_init_obs.pkl: default initial observation from the encoded trajectories, used as a starting point to the generation of dream user trajectories
			- default_init_obses_ep600enc_15.pkl: 15 default initial observations from the encoded trajectories, used as a starting point to the ReQueST trajectory generation
			- request_query_data_guiTrue_rewep200_rewinitFalse_gdit200.pkl: list of trajectory segments generated from ReQueST throughout training 
			- request_traj_opt_times_guiTrue_rewep200_rewinitFalse_gdit200.pkl: trajectory optimisation times for ReQueST
			- dros_query_data_rewep200.pkl: list of trajectory segments generated with DROS in 'one-shot' before any human feedback provided
		- /obstcarracing
			# for Obstacle Car Racing only with chuckholes
			- raw_user_rollouts_res128_ep600_chuckcobst.pkl (unzipped from raw_user_rollouts_res128_ep600_chuckcobst.zip): list of user real-world trajectories (Dataset R) - 59.1GB
			- /queries_s20_chuckcobst (unzipped queries_s20_chuckcobst.zip) - 744MB
				- *.mp4 files: video pairs demonstrating preference queries
				- data.pkl: dictionary linking each *.mp4 video to the numpy arrays of the two trajectory segments
			- /responses_20_chuckcobst
				- responses_chuckcobst_user44_mjust.pkl: preference and justification responses in DROPJ with multiple justifications
			- dream_user_rollouts_chuckcobst_scaled.pkl: list of user trajectories drawn in the virtual environment (Dataset D); scaled in the appropriate range for reward model training
			- default_init_obs_chuckcobst.pkl: default initial observation from the encoded trajectories, used as a starting point to the generation of dream user trajectories
			- default_init_obses_chuckcobst.pkl: default initial observations from the encoded trajectories, used as starting points to the generation of dream user trajectories

			# for Obstacle Car Racing with chuckholes and cars
			- raw_user_rollouts_res128_ep800_chuckccarobst.pkl (unzipped from raw_user_rollouts_res128_ep800_chuckccarobst.zip): list of user real-world trajectories (Dataset R) - 78.8GB
			- /queries_s20_chuckccarobst (unzipped from queries_s20_chuckccarobst.zip) - 1.1GB
				- *.mp4 files: video pairs demonstrating preference queries
				- data.pkl: dictionary linking each *.mp4 video to the numpy arrays of the two trajectory segments
			- /responses_20_chuckccarobst
				- responses_chuckccarobst_user45_mjust.pkl: preferences and justifications for DROPJ with multiple justifications
			- dream_user_rollouts_chuckccarobst_scaled.pkl: list of user trajectories drawn in the virtual environment (Dataset D); scaled in the appropriate range for reward model training
			- default_init_obs_chuckccarobst.pkl: default initial observation from the encoded trajectories, used as a starting point to the generation of dream user trajectories
			- default_init_obses_chuckccarobst.pkl: default initial observations from the encoded trajectories, used as starting points to the generation of dream user trajectories
	- /models
		- /carracing
			# for Car Racing with no chuckholes or cars
			- enc_user_lat32_ch15L_res84_ep600_epochs100_lrs0_00001.pt: pretrained encoder model (VAE)
			- dyn_gcT_ep600_ch15L_lat32_epochs300_lrs0_0001.pt: pretrained dynamics model (MDN-RNN)
			- /pref_reward_models
				- reward_model_<ensemble_num>_GUI_s20_nojustnoeq_resp<num_of_responses>.pt: reward models of DROP (nojustnoeq)
				- reward_model_<ensemble_num>_GUI_s20_nojustnoeq_resp<num_of_responses>.pt: reward models of DROPe (nojusteq)
				- reward_model_<ensemble_num>_GUI_s20_just_resp<num_of_responses>.pt: reward models of DROPJ (just)
			- /request_reward_models
				- *.pt: ReQueST reward models for different number of queries
			- /dros_reward_models
				- *.pt: DROS reward models for different number of queries
		- /obstcarracing
			# for Obstacle Car Racing only with chuckholes
			- enc_user_lat64_ch4res128_res128_ep600_chuckcobst_epochs90_lrs0.0001_kl2.pt: pretrained encoder model (VAE) only with chuckholes
			- dyn_gcT_ep600_chuckcobst_ch4res128_lat64_encep90_enclr0.0001_epochs50_lrs0.0001_rnn1024_mix7.pt: pretrained dynamics model (MDN-RNN) only with chuckholes
			
			# for Obstacle Car Racing with chuckholes and cars
			- enc_user_lat64_ch4res128_res128_ep800_chuckccarobst_epochs90_lrs0.0001_kl2.pt: pretrained encoder model (VAE) with chuckholes and cars
			- dyn_gcT_ep800_chuckccarobst_ch4res128_lat64_encep90_enclr0.0001_epochs30_lrs0.0001_rnn1024_mix7.pt: pretrained dynamics model (MDN-RNN) with chuckholes and cars
			
			- /pref_reward_models
				# stars (*) denote variable weights for each justification
				- reward_model_<ensemble_num>_GUI_s20_chuckcobst_mjust_wdef*_wgrass*_wchuck*.pt: reward models of DROPJ with multiple justifications only with chuckholes
				- reward_model_<ensemble_num>_GUI_s20_chuckccarobst_mjust_wdef*_wgrass*_wchuck*_wcar*.pt: reward models of DROPJ with multiple justifications with chuckholes and cars
File formats: .pkl, .mp4, .pt