SUMMARY: Arxiv
is a Central
Repository (CR) in which physicists have been self-archiving their
unrefereed preprints and their peer-reviewed postprints since 1991.
There is now a growing movement toward distributed Institutional Repositories
(IRs). Thanks to the OAI
Protocol, all OAI-compliant IRs and CRs are now interoperable: their
metadata can be harvested into search engines that treat all of their
contents as if they were in one big virtual CR. What authors
self-archive is their peer-reviewed publications, not just their
unrefereed preprints. An archive is merely a repository, not a
certifier of having met a peer-reviewed journal's quality standards.
Since the research institutions
themselves are the primary research providers, with the direct interest
in maximising the uptake and usage of their own research output, the
natural place for them to deposit their own output is in their own IRs.
Any central collections can be harvested via OAI. Institutions are also
best placed to monitor and reward compliance with self-archiving
mandates, both their own institutional mandates and those of the
funders of their institutional research output. Arxiv has played an
important role in getting us where we are, but it is likely that the
era of CRs is coming to a close, and the era of distributed,
interoperable IRs is now coming into its own in an entirely natural
way, in keeping with the distributed nature of the Net/Web itself.
Comments on:
Ginsparg, Paul (2006) As We May
Read. The Journal of Neuroscience, September 20, 2006,
26(38): 9606-9608 doi:10.1523/JNEUROSCI.3161-06.2006
"[A]rticles are deposited [in Arxiv] by researchers when
they choose (either before, simultaneous with, or after peer review),
and the articles are immediately available to researchers throughout
the world."
Arxiv is a Central Repository
(CR) in which physicists (mostly, and many mathematicians, and some
computer scientists) have been self-archiving their unrefereed
preprints and their peer-reviewed postprints since 1991. It is
important to keep in mind that researchers self-archive preprints as
well as postprints, because it makes a big difference whether one
extrapolates from Arxiv as a preprint CR or a postprint CR, as we shall
see below.
It is also pertinent to bear in mind that Arxiv is indeed a
Central
Repository (CR), because there is now a growing movement toward
distributed
Institutional
Repositories (IRs). The IR movement was facilitated by the Open
Archives Initiative (
OAI)
Protocol for Metadata Harvesting, which renders all IRs and CRs
interoperable: the OAI Protocol was in turn created partly as a result
of an initiative from Arxiv.
As a consequence of the OAI Protocol, all OAI-compliant IRs and CRs are
interoperable: their metadata can be harvested into search engines that
treat all of their contents as if they were in one big virtual CR.
"As a pure dissemination system, [Arxiv] operates
at a factor of 100-1000 times lower [1.0% - 0.1%] in cost than a
conventionally peer-reviewed system (Ginsparg, 2001)."
This is true, but it is tantamount to saying that as a pure
dissemination system, photocopying the articles published in journals
operates at a fraction of the cost of publishing a journal: A fraction,
but a parasitic fraction, for without the journal, there would be
nothing to either photocopy or distribute in Arxiv.
Nothing but the unrefereed preprint, that is. And this brings us face
to face with the fundamental question: What are the
true
costs of peer review, and peer review alone? The peers (scarce,
overused resource though they are) review for free, so it is not their
services whose costs we are talking about, but the cost of implementing
the peer review: processing the submissions, picking the referees,
processing their reports, deciding what revisions need to be done to
meet the journal's quality standards for acceptance, and deciding --
perhaps again by consulting the referees -- whether those revisions
have been successfully done. The selection of referees and the decision
as to what needs to be done is usually made by a qualified, answerable
super-peer: the editor (or a board of editors). The editor(s) services,
and the clerical services for processing submissions, communicating
with referees, and processing referee reports are the costs involved --
and these include not just accepted papers, but rejected ones too (with
some journals' rejection rates being over 90%).
In other words, peer-reviewed journal publishing is not a "pure
dissemination system." Implementing the peer review costs some money
too. There are estimates of what it costs (about
$500
per paper was the average estimate a few years ago, which is between
one-third and one-sixth of the charge per article that today's "
Open
Choice" journals are currently proposing -- although a few journals
with high rejection rates have suggested a figure of $10,000 per
article, without making it clear whether this represents their costs
per article or their income per article).
The annual cost per paper in Arxiv, to Arxiv, has been estimated at
about
$10
(a few years ago), so this is indeed somewhere between 2% of the
low-end estimate and 0.1% of the high-end estimate. If we include the
cost of keying in the deposit to the depositor, it's a few pennies more.
But what do these figures mean? Why compare the cost of online
dissemination alone with the cost of peer review (or any of the other
values a journal adds, such as the print edition, copy-editing,
reference-checking, and mark-up)?
"with many of the production tasks automatable or
off-loadable to the authors, the editorial costs will then dominate the
costs of an unreviewed distribution system by many orders of magnitude."
Translation: Online dissemination of unrefereed preprints alone costs a
lot less than peer-reviewed publication. True, but what follows from
that? Peer-reviewed publication costs a lot more than photo-copying
too, but what authors photocopy and distribute is their peer-reviewed
publications, not just their unrefereed preprints.
"Although the most recently submitted articles
have not yet necessarily undergone formal review, the vast majority of
the articles can, would, or do eventually satisfy editorial
requirements somewhere.... [Arxiv's moderated] submissions are at least
'of refereeable quality'."
Every paper is first an unrefereed preprint -- and then, eventually,
most are revised into peer-reviewed, accepted articles (postprints).
Hence if preprints are deposited in Arxiv at all, it stands to reason
that Arxiv's most recently deposited (sic) papers (sic) have not yet
undergone peer review. Tune in a year later, and they will have been,
with the revised postprint now also deposited.
Preprints and postprints are deposited rather than "submitted" to IRs
or CRs, because an archive is merely a repository, not a certifier of
having met a peer-reviewed journal's quality standards: let's reserve
"submission" for the attempt to meet a journal's peer-review quality
standards. Moreover, unrefereed preprints are merely papers, not
articles; they become articles when they have been accepted for
publication by a peer-reviewed journal. This is not pedantry or
formalism. It is merely the sorting out of what has and has not met
known quality control standards. The tag certifying this is currently
the journal name, with its established quality level and track-record.
A peer-reviewed journal (apart from its function as an access-provider)
is a peer-review service-provider/certifier, publicly answerable for
its quality standards with its own prestige and reputation. And authors
are in turn answerable to the editor and referees, for meeting their
standards for acceptance; revision is not optional but obligatory, a
condition on acceptance for publication. Hence earning the tag
certifying acceptance is a dynamic, interactive process, and not merely
a pass/fail system.
Publication is even less like a pass/fail system in that in most fields
there is a hierarchy of journals, with a range of peer-review
standards, from the one or few most rigorous ones at the top (usually
the ones with the highest rejection rates), all the way down to what is
sometimes almost a vanity press at the bottom (little better than an
unrefereed preprint). These differences in quality standards are known
and relied upon in the field. And papers are not really published or
unpublished: Most are published, eventually, but at their own quality
level. The journals are all autonomous, independent of the authors and
the authors' institutions, each dependent on its own established
standards for quality and selectivity. Users are in turn dependent on
each journal's public track record in deciding what to trust.
It is not at all clear what an IR's or CR's certification of which of
its deposits is "of refereeable quality" might mean to busy researchers
who need to know whether a paper is worth risking their limited time to
read and try to use, apply and build upon. Users currently do this by
seeing whether and where it has been published (with the journal name
and track record serving as their indicator of the article's probable
level of quality, reliability and validity). Unrefereed preprints have
always been something handled with care, having only the author's name,
institution and prior track-record as a guide to their reliability. Is
Arxiv's tag of being "of refereeable quality" meant to serve as a
further guide? or as a
substitute
for something?
"[P]roposed modifications of the peer review
include a two-tier system (for more details, see Ginsparg, 2002), in
which, on a first pass, only some cursory examination or other pro
forma certification is given for acceptance into a standard tier. At
some later point, a much smaller set of articles would be selected for
more extensive evaluation."
This is a
speculative
hypothesis. It is no doubt being
tested
to see whether it works, whether it delivers results of quality and
useability comparable to
standard
peer review, whether it is cost-effective, and whether it can
replace journals. But as it stands, the hypothesis alone does not tell
us whether and how well it will work; Arxiv is certainly not evidence
for the validity of this hypothesis, since virtually all papers in
Arxiv still undergo standard peer review. Arxiv is merely a CR that
provides Open Access (OA) to both the preprints and the postprints.
"using standard search engines, more than
one-third of the high-impact journal articles in a sample of
biological/medical journals published in 2003 were found at nonjournal
Web sites (Wren, 2005)."
This is very interesting. This is the higher end of a self-archiving
rate that we have found to range between about
5%
and 25% across disciplines. Physics is of course even
higher (mostly because of
Arxiv) and computer science higher still (see
Citeseer, a google-style
harvester of distributed locally deposited papers).
"at least 75% of the publications listed [in
neuroscience] were freely available either via direct links from the
above Web page or via a straightforward Web search for the article
title."
This is even more interesting. It means that in such fields the
majority of the articles -- note that we are almost certainly not
talking about unrefereed preprints here but about peer-reviewed
postprints -- are being self-archived already, so the only thing that
remains to be done is to deposit (or harvest) them into the author's
own OAI-compliant IR rather than a random website, to maximise
visibility, harvestability, and impact.
"The enormously powerful sorts of data mining and
number crunching that are already taken for granted as applied to the
open-access genomics databases can be applied to the full text"
Indeed. And
semantic
and scientometric analyses too (though article texts are not quite
the same thing as the research data on which the articles are based,
hence the analogy with the genomics data base may be a bit misleading).
"it is likely that more research communities will
join some form of global unified archive system without the current
partitioning and access restrictions familiar from the paper medium"
What makes it most likely is the
self-archiving
mandates proposed or already adopted the world over (e.g.,
RCUK,
Wellcome Trust,
FRPAA,
EC,
plus individual institutional self-archiving mandates:
CERN,
Southampton,
QUT,
Minho).
But the deposits will not be done in one global CR, nor in a CR like
Arxiv for each discipline or combination of disciplines. With the
advent of the OAI protocol, all IRs and CRs are interoperable, and
since the research institutions themselves are the primary research
providers, with the direct interest inshowcasing their own research
output as well as maximizing its uptake, usageand impact, the natural
place for them to deposit their own output is in their own IRs. Any
central collections can be gathered via OAI harvesting. Institutions
are also best placed to monitor and reward compliance with
self-archiving mandates, both their own institutional mandates and
those of the funders of their institutional research output.
Arxiv has played an important role in getting us where we are, but it
is likely that the era of CRs is coming to a close, and the era of
distributed, interoperable IRs is now coming into its own in an
entirely natural way, in keeping with the distributed nature of the
Net/Web itself.
Stevan Harnad
American
Scientist Open Access Forum