All filters take Mediawiki XML dumps in the 0.3 format, and output /something/.
Input is the first argument, defaulting to standard input. This can be made explicit with "-", as is convention.
_XML_ output is the second argument, defaulting to standard output. Some filters may provide other, non-XML information to standard output. In these cases, you /must/ use the second argument in order to keep the XML and non-XML separate.
Status and error reporting is performed via standard error.

These filters are lossy. They exist to fit some rather specific goals and, for performance reasons, do /not/ currently take any pains to ensure that unrecognized elements are preserved. In particular, some revision metadata is simply ignored, and thus never written out. In particular, the siteinfo element is _completely_ discarded. If you want these data preserved, you'll have to update the mediawikiformat.[ch] code (which isn't /too/ hard).

All filters operate on the data in file order. It is assumed that revisions are written oldest-first, which appears to be true of such dumps. All output (to standard error) memory usage information at the end of processing.


Available filters:
------------------
autotest
  Runs some tests. Ignores input, and always bails with a usage error. (It was just easier.)

dumptitles
  Outputs a plaintext, newline-separated list of page titles.

dumpall
  Outputs all _parsed_ data as an ad-hoc plaintext format; primarily debugging. Text is truncated.

passthrough
  Outputs XML which is semantically identical to the input, save the aformentioned lossages. Superflous whitespace is removed. This is common behaviour in the XML output.

discardnonart
  Outputs XML of the input, sans any pages which are in a namespace (pedantically, those which have a colon in the title, as this is the most semantic way the XML appears to encode such). This has the effect (at least on English Wikipedia) of removing media/image, help, special, talk, portal, template, etc. pages.

randomsubset
  Discards 50% of pages, and outputs the rest as XML. Note that, in order to get a O(1) memory implementation which can scale to arbitrarily huge dumps, this is not strictly accurate: rather than select 50% of pages, each page has a 50% change of inclusion. The probability can be overridden as a third argument, as a floating point number between zero and one. A better algorithm, which is O(1) with regards to total data size, but O(n) with regards to subset size, is to store a buffer of up to n pages, and probabilistically replace them with different pages as they are encountered. (This could be quite memory intensive, as each page many have thousands of revisions, each with thousands of bytes of text, all of which must be copied into the buffer.)

categorise
  Sorts edits into various categories (NOT Mediawiki categories), and outputs this information to standard output. The information is presented as XML, but does not count as "XML output", as it is not in the Mediawiki format. Accepts an optional "major edit" threshold, which is the greatest proportion of the page which could be changed while still being considered a minor edit; any more, and it is major. Specifically, the approximated Levenshtein distance (to get actual Levenshtein distance, you currently need to replace the call to upstr_distance with upstr_levenshtein and recompile) between the plaintext of the revisions over the length of the plaintext of the newer revision is used the change proporition. The default value is 0.05.


Postprocessing Perl scripts:
----------------------------

cataggr
  Turns the "categorise" filter's output into aggregate statistics. This is a bit nasty and task-specific, so looking at least the top of the source is advised. Will generate files of the form ag-(infix)-(graphname).dat if "infix" is provided as an argument, which can be fed to gnuplot with suitable headers. Also spits out a load of statistics to standard output.

compstrmags
  Compare the magnitude differences between the true and appoximated Levenshtein distance algorithms. Takes a pair of "categorise" filter XML files as input.


Other files:
------------

*.gp
  Gnuplot headers for making pretty graphs from some of the cataggr output. Note that parameters in here will be specific to the datasets used, and to my 9 month report in LaTeX. In particular, they expect a folder structure not present in this tarball.

enwiki-trunchard.xml.bz2
  A fixed-up 100,000 line head of the huge dump XML file, bz2 compressed. Just enough to crudely test the tools.
  Note that this is under the GFDL, not the MIT license, as it's just Wikipedia content that I've repackaged.


Not included:
-------------

The wikipedia dump subsets actually used in the experiment. Sorry, but even the 0.01% subset, GZipped, is 154MB.

