**************************
* datasets - readme file *
**************************

* INTRODUCTION *

The crowdsourcing datasets I use in my PhD thesis, "On the efficiency of data collection and aggregation for the combination of multiple classifiers", are publicly available. However, for the sake of reproducibility and ease of use, I am uploading a copy of all of them apart from the Galaxy Zoo 2 one (see explanation below). Note: the format of the present datasets has been changed to match the input format of binary_sims.exe (see source code in this same repository), details are given below. This package contains the following five datasets:
- rte: from Snow et al. 2008
- temp: from Snow et al. 2008
- trec: from Lease et al. 2011
- bluebirds: from Welinder et al. 2010
- ducks: from Welinder et al. 2010
See Section 5.4.3 of my thesis for further details on the datasets.

* FORMAT *

Each dataset is given in comma-separated (CSV) format. Each line represents a separate datapoint with the following comma-separated entries:
- worker id: anonymised and in increasing contiguous positive numbers from 0
- task id: increasing contiguous positive numbers from 0
- worker label: binary answers either 0 or 1
- gold label: binary ground truth either 0 or 1

* GALAXY ZOO *

The Galaxy Zoo 2 dataset that is publicly available does not contain the fine-grained details of the crowdsourcing process. In order to plot Figure 1.1 we had to contact the authors of Willett et al. 2013, get access to the full anonymised dataset (7.5 GB), and extract the figures we needed from the raw data. We encourage you to go through the same process.
