Preprints, Postprints, Peer Review, and Institutional vs. Central Self-Archiving

SUMMARY: Arxiv is a Central Repository (CR) in which physicists have been self-archiving their unrefereed preprints and their peer-reviewed postprints since 1991. There is now a growing movement toward distributed Institutional Repositories (IRs). Thanks to the OAI Protocol, all OAI-compliant IRs and CRs are now interoperable: their metadata can be harvested into search engines that treat all of their contents as if they were in one big virtual CR. What authors self-archive is their peer-reviewed publications, not just their unrefereed preprints. An archive is merely a repository, not a certifier of having met a peer-reviewed journal's quality standards.
Since the research institutions themselves are the primary research providers, with the direct interest in maximising the uptake and usage of their own research output, the natural place for them to deposit their own output is in their own IRs. Any central collections can be harvested via OAI. Institutions are also best placed to monitor and reward compliance with self-archiving mandates, both their own institutional mandates and those of the funders of their institutional research output. Arxiv has played an important role in getting us where we are, but it is likely that the era of CRs is coming to a close, and the era of distributed, interoperable IRs is now coming into its own in an entirely natural way, in keeping with the distributed nature of the Net/Web itself.

Comments on:

Ginsparg, Paul (2006) As We May Read. The Journal of Neuroscience, September 20, 2006, 26(38): 9606-9608 doi:10.1523/JNEUROSCI.3161-06.2006

"[A]rticles are deposited [in Arxiv] by researchers when they choose (either before, simultaneous with, or after peer review), and the articles are immediately available to researchers throughout the world."

Arxiv is a Central Repository (CR) in which physicists (mostly, and many mathematicians, and some computer scientists) have been self-archiving their unrefereed preprints and their peer-reviewed postprints since 1991. It is important to keep in mind that researchers self-archive preprints as well as postprints, because it makes a big difference whether one extrapolates from Arxiv as a preprint CR or a postprint CR, as we shall see below.

It is also pertinent to bear in mind that Arxiv is indeed a Central Repository (CR), because there is now a growing movement toward distributed Institutional Repositories (IRs). The IR movement was facilitated by the Open Archives Initiative (OAI) Protocol for Metadata Harvesting, which renders all IRs and CRs interoperable: the OAI Protocol was in turn created partly as a result of an initiative from Arxiv.

As a consequence of the OAI Protocol, all OAI-compliant IRs and CRs are interoperable: their metadata can be harvested into search engines that treat all of their contents as if they were in one big virtual CR.

"As a pure dissemination system, [Arxiv] operates at a factor of 100-1000 times lower [1.0% - 0.1%] in cost than a conventionally peer-reviewed system (Ginsparg, 2001)."

This is true, but it is tantamount to saying that as a pure dissemination system, photocopying the articles published in journals operates at a fraction of the cost of publishing a journal: A fraction, but a parasitic fraction, for without the journal, there would be nothing to either photocopy or distribute in Arxiv.

Nothing but the unrefereed preprint, that is. And this brings us face to face with the fundamental question: What are the true costs of peer review, and peer review alone? The peers (scarce, overused resource though they are) review for free, so it is not their services whose costs we are talking about, but the cost of implementing the peer review: processing the submissions, picking the referees, processing their reports, deciding what revisions need to be done to meet the journal's quality standards for acceptance, and deciding -- perhaps again by consulting the referees -- whether those revisions have been successfully done. The selection of referees and the decision as to what needs to be done is usually made by a qualified, answerable super-peer: the editor (or a board of editors). The editor(s) services, and the clerical services for processing submissions, communicating with referees, and processing referee reports are the costs involved -- and these include not just accepted papers, but rejected ones too (with some journals' rejection rates being over 90%).

In other words, peer-reviewed journal publishing is not a "pure dissemination system." Implementing the peer review costs some money too. There are estimates of what it costs (about $500 per paper was the average estimate a few years ago, which is between one-third and one-sixth of the charge per article that today's "Open Choice" journals are currently proposing -- although a few journals with high rejection rates have suggested a figure of $10,000 per article, without making it clear whether this represents their costs per article or their income per article).

The annual cost per paper in Arxiv, to Arxiv, has been estimated at about $10 (a few years ago), so this is indeed somewhere between 2% of the low-end estimate and 0.1% of the high-end estimate. If we include the cost of keying in the deposit to the depositor, it's a few pennies more.

But what do these figures mean? Why compare the cost of online dissemination alone with the cost of peer review (or any of the other values a journal adds, such as the print edition, copy-editing, reference-checking, and mark-up)?

"with many of the production tasks automatable or off-loadable to the authors, the editorial costs will then dominate the costs of an unreviewed distribution system by many orders of magnitude."

Translation: Online dissemination of unrefereed preprints alone costs a lot less than peer-reviewed publication. True, but what follows from that? Peer-reviewed publication costs a lot more than photo-copying too, but what authors photocopy and distribute is their peer-reviewed publications, not just their unrefereed preprints.

"Although the most recently submitted articles have not yet necessarily undergone formal review, the vast majority of the articles can, would, or do eventually satisfy editorial requirements somewhere.... [Arxiv's moderated] submissions are at least 'of refereeable quality'."

Every paper is first an unrefereed preprint -- and then, eventually, most are revised into peer-reviewed, accepted articles (postprints). Hence if preprints are deposited in Arxiv at all, it stands to reason that Arxiv's most recently deposited (sic) papers (sic) have not yet undergone peer review. Tune in a year later, and they will have been, with the revised postprint now also deposited.

Preprints and postprints are deposited rather than "submitted" to IRs or CRs, because an archive is merely a repository, not a certifier of having met a peer-reviewed journal's quality standards: let's reserve "submission" for the attempt to meet a journal's peer-review quality standards. Moreover, unrefereed preprints are merely papers, not articles; they become articles when they have been accepted for publication by a peer-reviewed journal. This is not pedantry or formalism. It is merely the sorting out of what has and has not met known quality control standards. The tag certifying this is currently the journal name, with its established quality level and track-record. A peer-reviewed journal (apart from its function as an access-provider) is a peer-review service-provider/certifier, publicly answerable for its quality standards with its own prestige and reputation. And authors are in turn answerable to the editor and referees, for meeting their standards for acceptance; revision is not optional but obligatory, a condition on acceptance for publication. Hence earning the tag certifying acceptance is a dynamic, interactive process, and not merely a pass/fail system.

Publication is even less like a pass/fail system in that in most fields there is a hierarchy of journals, with a range of peer-review standards, from the one or few most rigorous ones at the top (usually the ones with the highest rejection rates), all the way down to what is sometimes almost a vanity press at the bottom (little better than an unrefereed preprint). These differences in quality standards are known and relied upon in the field. And papers are not really published or unpublished: Most are published, eventually, but at their own quality level. The journals are all autonomous, independent of the authors and the authors' institutions, each dependent on its own established standards for quality and selectivity. Users are in turn dependent on each journal's public track record in deciding what to trust.

It is not at all clear what an IR's or CR's certification of which of its deposits is "of refereeable quality" might mean to busy researchers who need to know whether a paper is worth risking their limited time to read and try to use, apply and build upon. Users currently do this by seeing whether and where it has been published (with the journal name and track record serving as their indicator of the article's probable level of quality, reliability and validity). Unrefereed preprints have always been something handled with care, having only the author's name, institution and prior track-record as a guide to their reliability. Is Arxiv's tag of being "of refereeable quality" meant to serve as a further guide? or as a substitute for something?

"[P]roposed modifications of the peer review include a two-tier system (for more details, see Ginsparg, 2002), in which, on a first pass, only some cursory examination or other pro forma certification is given for acceptance into a standard tier. At some later point, a much smaller set of articles would be selected for more extensive evaluation."

This is a speculative hypothesis. It is no doubt being tested to see whether it works, whether it delivers results of quality and useability comparable to standard peer review, whether it is cost-effective, and whether it can replace journals. But as it stands, the hypothesis alone does not tell us whether and how well it will work; Arxiv is certainly not evidence for the validity of this hypothesis, since virtually all papers in Arxiv still undergo standard peer review. Arxiv is merely a CR that provides Open Access (OA) to both the preprints and the postprints.

"using standard search engines, more than one-third of the high-impact journal articles in a sample of biological/medical journals published in 2003 were found at nonjournal Web sites (Wren, 2005)."

This is very interesting. This is the higher end of a self-archiving rate that we have found to range between about 5% and 25% across disciplines. Physics is of course even higher (mostly because of Arxiv) and computer science higher still (see Citeseer, a google-style harvester of distributed locally deposited papers).

"at least 75% of the publications listed [in neuroscience] were freely available either via direct links from the above Web page or via a straightforward Web search for the article title."

This is even more interesting. It means that in such fields the majority of the articles -- note that we are almost certainly not talking about unrefereed preprints here but about peer-reviewed postprints -- are being self-archived already, so the only thing that remains to be done is to deposit (or harvest) them into the author's own OAI-compliant IR rather than a random website, to maximise visibility, harvestability, and impact.

"The enormously powerful sorts of data mining and number crunching that are already taken for granted as applied to the open-access genomics databases can be applied to the full text"

Indeed. And semantic and scientometric analyses too (though article texts are not quite the same thing as the research data on which the articles are based, hence the analogy with the genomics data base may be a bit misleading).

"it is likely that more research communities will join some form of global unified archive system without the current partitioning and access restrictions familiar from the paper medium"

What makes it most likely is the self-archiving mandates proposed or already adopted the world over (e.g., RCUK, Wellcome Trust, FRPAA, EC, plus individual institutional self-archiving mandates: CERN, Southampton, QUT, Minho).

But the deposits will not be done in one global CR, nor in a CR like Arxiv for each discipline or combination of disciplines. With the advent of the OAI protocol, all IRs and CRs are interoperable, and since the research institutions themselves are the primary research providers, with the direct interest inshowcasing their own research output as well as maximizing its uptake, usageand impact, the natural place for them to deposit their own output is in their own IRs. Any central collections can be gathered via OAI harvesting. Institutions are also best placed to monitor and reward compliance with self-archiving mandates, both their own institutional mandates and those of the funders of their institutional research output.

Arxiv has played an important role in getting us where we are, but it is likely that the era of CRs is coming to a close, and the era of distributed, interoperable IRs is now coming into its own in an entirely natural way, in keeping with the distributed nature of the Net/Web itself.

Stevan Harnad
American Scientist Open Access Forum

Preprints, Postprints, Peer Review, and Institutional vs. Central Self-Archiving

Friday, October 6. 2006