*****************************
* binary_sims - readme file *
*****************************

* INTRODUCTION *

The binary_sims package contains the source code for running the experiments of my PhD thesis "On the efficiency of data collection and aggregation for the combination of multiple classifiers". This readme file explains how the code is structured and how to use it.

* CROWD.H/CROWD.CPP *

The classes in these files generate a (simulated) crowdsourcing dataset. The interface is designed to let you read one new datapoint at a time. Old datapoints are always available, but do not try to look into the future beyond t+1 unless you know what you are doing. Note: crowd_budget generates the whole dataset at t=0 (and shuffles it), crowd_quota adds workers one by one (sequential worker availability), crowd_workers polls the fixed pool of workers for more data in a round-robin fashion, crowd_file reads the data from a file (use this for experients on real data). All classes except crowd_file also provide the individual accuracy (Bernoulli parameter) of each worker.

* DISTRO.H/DISTRO.CPP *

The classes in these files allow to generate random numbers from several useful probability distributions. A method to correctly initialise the standard Mersenne Twister generator is also given.

* INFER.H/INFER.CPP *

Inference algorithms in their offline (infer_full) and online (infer_update) forms. They take as input a random number generator, the crowd dataset and the current time step (new label). The available algorithms are listed below (see my thesis for further explanation).
- infer_weighted: weighted majority voting. Accesses the true worker accuracy in the crowd dataset.
- infer_golden: gold standard plug-in aggregator. Estimates each new worker's accuracy by running preliminary trials.
- infer_majority: majority voting.
- infer_acyclic: Fast SBIC on the natural ordering of the crowd dataset.
- infer_delayed: vanilla implementation of Sorted SBIC. For offline use consider using infer_quick.
- infer_quick: log-time implementation of Sorted SBIC. DO NOT USE IN ONLINE MODE.
- infer_variational: approximate mean-field variational Bayes. See Liu et al. 2012 for details. Set alpha=beta=0 for traditional EM.
- infer_eigen: belief propagation/mtrix factorisation. See Karger et al. 2014 for details.
- infer_triangle: triangular estimation. See Bonald et al. 2014 for details.
- infer_particle: Mirror Gibbs particle filter.

* MAIN.CPP *

Something to get the experiments going. See USAGE for more details.

* POLICY.H/POLICY.CPP *

Data collection policies. They take as input the dataset, the inference algorithm and the current time step (new worker). They will overwrite which task the current worker is assigned to. The available algorithms are listed below (see my thesis for further explanation).
- policy_uniform: uniform allocation.
- policy_balance: weight balancing policy. Relies on accurate estimates of each worker's accuracy.
- policy_uncertainty: uncertainty sampling. Relies on meaningful evaluation of the task posterior.
- policy_los: expected zero-one loss reduction. Relies both on workers' accuracy estimates and task posterior evaluation.
- policy_eig: expected information gain. Relies both on workers' accuracy estimates and task posterior evaluation.

* SIM.H/SIM.CPP *

The classes in these files help running the experiments and hold the corresponding results. Note: sim_budget will read the full dataset, sim_error will keep querying new datapoints until the (estimated) task posterior falls below the given error threshold. Do not use sim_error with a fixed-size dataset unless you know what you are doing.

* USAGE *

The code follows the C++11 standard, and does not contain any dependency to external libraries. Compiling it should be straightforward.

The file main.cpp contains two separate versions of a main function, plus a test function named test_all(). The first version of the main can be used to run experiments on simulated datasets. Options to output the zero-one loss, predicted zero-one loss and elapsed time are given in the code. The latter is based on the library <chrono>, and has been tested on Windows only. The second version of the main can be used to run experiments on external datasets by specifying the desired input file.

Parameter list (first version of the main):
- output file name: the output will aslo be printed on stdout.
- random generator seed: choose different values for randomness.
- maximum number of simulations: repeat the same simulation (with different randomness) multiple times.
- maximum number of simulations with errors: repeat until we observe n simulations with at least one error.
- simulation type {sim_budget, sim_error}.
- simulation parameter: see sim.cpp/sim_parse() for details.
- inference type {infer_weighted, infer_majority, infer_acyclic}.
- number of tasks.
- inference parameters: not all inference types have extra parameters, see infer.cpp/infer_parse() for details.
- crowd type {crowd_budget, crowd_quota, crowd_workers}.
- number of workers: crowd_quota does not require this parameter, see crowd.cpp/crowd_parse() for details.
- worker accuracy distribution with parameters: continuous distribution in the interval [0,1]. See distro.cpp/distro_parse() for parameter details.
- worker quota distribution with parameters: discrete distribution on the natural numbers. Not required for crowd_workers. See distro.cpp/distro_parse() for parameter details.
- data collection policy {policy_uniform, policy_balance, policy_uncertainty, policy_los, policy_eig}.
