CTWatch Quarterly » Incentivizing the Open Access Research Web

Incentivizing the Open Access Research Web

Publication-Archiving, Data-Archiving and Scientometrics

Tim Brody, U Southampton, UK
Les Carr, U Southampton, UK
Yves Gingras, UQAM
Chawki Hajjem, UQAM
Stevan Harnad, U Southampton, UK; UQAM
Alma Swan, U Southampton, UK; Key Perspectives

Introduction

The research production cycle has three components: the conduct of the research itself (R), the data (D), and the peer-reviewed publication (P) of the findings. Open Access (OA) means free online access to the publications (P-OA), but OA can also be extended to the data (D-OA): the two hurdles for D-OA are that not all researchers want to make their data OA and that the online infrastructure for D-OA still needs additional functionality. In contrast, all researchers, without exception, do want to make their publications P-OA, and the online infrastructure for publication-archiving (a worldwide interoperable network of OAI [1]-compliant Institutional Repositories [IRs][2]) already has all the requisite functionality for this.

Yet because so far only about 15% of researchers are spontaneously self-archiving their publications today, their funders and institutions are beginning to require OA self-archiving,[3] so as to maximize the usage and impact of their research output.

The adoption of these P-OA self-archiving mandates needs to be accelerated. Researchers’ careers and funding already depend on the impact (usage and citation) of their research. It has now been repeatedly demonstrated that making publications OA by self-archiving them in an OA IR dramatically enhances their research impact.[4] Research metrics (e.g., download and citation counts) are increasingly being used to estimate and reward research impact, notably in the UK Research Assessment Exercise (RAE).[5] But those metrics first need to be tested against human panel-based rankings in order to validate their predictive power.

Publications, their metadata, and their metrics are the database for the new science of scientometrics. The UK’s RAE, based on the research output of all disciplines from an entire nation, provides a unique opportunity for validating research metrics. In validating RAE metrics (through multiple regression analysis) [6] against panel rankings, the publication archive will be used as a data archive. Hence the RAE provides an important test case both for publication metrics and for data-archiving. It will not only provide incentives for the P-OA self-archiving of publications, but it will also help to increase both the functionality and the motivation for D-OA data-archiving.

Now let us look at all of this in a little more detail:

Reasearch, Data, and Publications

Research consists of three components: (1) the conduct of the Research (R) itself (whether the gathering of empirical data, or data-analyses, or both), (2) the empirical Data (D) (including the output of the data-analyses), and (3) the peer-reviewed journal article (or conference paper) Publications (P) that report the findings. The online era has made it possible to conduct more and more research online (R), to provide online access (local or distributed) to the data (D), and to provide online access to the peer-reviewed articles that report the findings (P).

The technical demands of providing the online infrastructure for all of this are the greatest for R and D – online collaborations and online data-archiving. But apart from the problem of meeting the technical demands for R and for D-archiving, the rest is a matter of choice: if the functional infrastructure is available for researchers to collaborate online and to provide online access to their data, then the rest is just a matter of whether and when researchers decide to use it to do so.[7][8] Some research may not be amenable to online collaboration, or some researchers may for various reasons prefer not to collaborate, or not to make their data publicly accessible.

In contrast, when it comes to P, the peer-reviewed research publications, the technical demands of providing the online infrastructure are much less complicated and have already been met. Moreover, all researchers (except those working on trade or military secrets) want to share their findings with all potential users, by (i) publishing them in peer reviewed journals in the first place and by (ii) sending reprints of their articles to any would-be user who does not have subscription access to the journal in which it was published. Most recently, in the online age, some researchers have also begun (iii) making their articles freely accessible online to all potential users webwide.

Open Access

Making articles freely accessible online is also called Open Access (OA). OA is optimal for research and hence inevitable. Yet even with all of P’s less exacting infrastructural demands already met, P-OA has been very slow in coming. Only about 15% of yearly research article output is being made OA spontaneously today. This article discusses what can be done to accelerate P-OA, to the joint advantage of R, D & P, using a very special hybrid example, based on the research corpus itself (P), serving as the database (D) for a new empirical discipline (R).

For “scientometrics” – the measurement of the growth and trajectory of knowledge - both the metadata and the full texts of research articles are data, as are their download and citation metrics. Scientometrics collects and analyzes these data by harvesting the texts, metadata, and metrics. P-OA, by providing the database for scientometrics, will allow scientometrics to better detect, assess, credit and reward research progress. This will not only encourage more researchers to make their own research publications P-OA (as well as encouraging their institutions and funders to mandate that they make them P-OA), but it will also encourage more researchers to make their data D-OA too, as well as to increase their online research collaborations (R). And although the generic infrastructure for making publications P-OA is already functionally ready, the specific infrastucture for treating P as D will be further shaped and stimulated by the requirements of scientometrics as R.

First, some potentially confusing details need to be made explicit and then set aside: publications (P) themselves sometimes contain research data (D). A prominent case is chemistry, where a research article may contain the raw data for a chemical structure. Some chemists have accordingly been advocating OA for chemical publications not just as Publications (P) but as primary research Data (D), which need to be made accessible, interoperable, harvestable and data-mineable for the sake of basic chemical research (R), rather than just for the usual reading of research articles by individual users. The digital processing of publication-embedded data is an important and valid objective, but it is a special case and hence will not be treated here, because the vast majority of research Publications (P) today do not include their raw data. It is best to consider the problem of online access to data that are embedded in publications as a special case of online access to D rather than as P. Similarly, the Human Genome Database,[9] inasmuch as it is a database rather than a peer-reviewed publication, is best considered as a special case of D rather than P.

Here, however, in the special case of scienotmetrics, we will be considering the case of P as itself a form of D, rather than merely as containing embedded D within it. We will also be setting aside the distinction between publication metadata (author, title, date, journal, affiliation, abstract, references) and the publication’s full-text itself. Scientometrics considers both of these as data (D). Processing the full-text’s content is the “semiometric” component of scientometrics. But each citing publication’s reference metadata are also logically linked to the publications they cite, so as the P corpus becomes increasingly OA, these logical links will become online hyperlinks. This will allow citation metrics to become part of the P-OA database too, along with download metrics. (The latter are very much like weblinks or citations; they take the form of a “hit-and-run.” Like citations however, they consist of a downloading site – identified by IP, although this could be made much more specific where the downloader agrees to supply more identifying metadata - plus a downloaded site and document.) We might call citation and download metrics “hypermetrics,” alongside the semiometrics, with which, together, they constitute scientometrics.

Scientometrics

The objective of scientometrics is to extract quantitative data from P that will help the research publication output to be harvested, data-mined, quantified, searched, navigated, monitored, analyzed, interpreted, predicted, evaluated, credited and rewarded. To do all this, the database itself first has to exist and, preferably, it should be OA. Currently, the only way to do (digital) scientometrics is by purchasing licensed access to each publisher’s full-text database (for the semiometric component), along with licensed access to the Thompson ISI Web of Science [10] database for some of the hypermetrics. (Not the hypermetrics for all publications, because ISI only indexes about one third of the approximately 25,000 peer-reviewed research journals published across all fields, nations and languages.) Google Scholar [11] and Google Books [12] index still more, but are as yet very far from complete in their coverage – again because only about 15% of current annual research output is being made P-OA. But if this P-OA content can be raised to 100%, not only will doing scientometrics no longer depend on licensed access to its target data, but researchers themselves, in all disciplines, will no longer depend only on licensed access in order to be able to use the research findings on which they must build their own research.

Three things are needed to increase the target database from 15% to 100%: (1) functionality, (2) incentives, and (3) mandates.[3] The network infrastructure needs to provide the functionality, the metrics will provide the incentive, and the functionality and incentives together will induce researchers’ institutions and funders to mandate OA for their research output (just as they already mandate P itself: “publish or perish”).

Citebase

As noted, much of the functional infrastructure for providing OA has already been developed. In 2000,[13] using the Open Archives Initiative (OAI) Protocol for Metadata Harvesting (OAI-PMH) [1], the Eprints [14] group at the University of Southampton designed the first and now widely used free software (GNU Eprints) for creating OAI-interoperable Institutional Repositories (IRs). Researchers can self-archive the metadata and the full-texts of their peer-reviewed, published articles by depositing them in these IRs. (If they wish, they may also deposit their pre-peer-review preprints, any postpublication revisions, their accompanying research data [D-OA], and the metadata, summaries and reference lists of their books). Not only can Google and Google Scholar harvest the contents of these IRs, but so can OAI services such as OAIster,[15] a virtual central repository through which users can search all the distributed OAI-compliant IRs. The IRs can also provide download and other usage metrics. In addition, one of us (Tim Brody) has created Citebase,[16] a scientometric navigational and evaluational engine that can rank articles and authors on the basis of a variety of metrics.[17]

Citebase’s current database [18] is not the network of IRs, because those IRs are still almost empty. (Only about 15% of research is being self-archived spontaneously today, because most institutions and funders have not yet mandated P-OA.) Consequently, for now, Citebase is instead focussed on the Physics Arxiv,[19] a special central repository and one of the oldest ones. In some areas of physics, the level of spontaneous self-archiving in Arxiv has already been at or near 100% for a number of years now. Hence, Arxiv provides a natural preview of what the capabilities of a scientometric engine like Citebase would be, once it could be applied to the entire research literature (because the entire literature had reached 100% P-OA).

First, Citebase links most of the citing articles to the cited articles in Arxiv (but not all of them, because Citebase’s linking software is not 100% successful for the articles in Arxiv, not all current articles are in Arxiv, and of course the oldest articles were published before OA self-archiving was possible). This generates citation counts for each successfully linked article. In addition, citation counts for authors are computed. However, this is currently being done for first-authors only: name-disambiguation still requires more work. On the other hand, once 100% P-OA is reached, it should be much easier to extract all names by triangulation - if persistent researcher-name identifiers have not yet come into their own by then.

So Citebase can rank either articles or authors in terms of their citation counts. It can also rank articles or authors in terms of their download counts (Figure 1). (Currently, this is based only on UK downloads: in this respect, Arxiv is not a fully OA database in the sense described above. Its metadata and texts are OA, but its download hypermetrics are not. Citebase gets its download metrics from a UK Arxiv mirror site, which Southampton happens to host. Despite the small and UK-biased download sample however, it has nevertheless been possible to show that early download counts are highly correlated with - hence predictive of - later citation counts.[20])

Figure 1. Citebase Ranking Metrics: List of current metrics on the basis of which Citebase can rank a set of articles.

Citebase can generate chronometrics too: the growth rate, decay rate and other parameters of the growth curve for both downloads and citations (Figure 2). It can also generate co-citation counts (how often two articles, or authors, are jointly cited). Citebase also provides “hub” and “authority” counts. An authority is cited by many authorities; a hub cites many authorities; a hub is more like a review article; and an authority is more like a much cited piece of primary research.

Figure 2. Citebase Sample Output: Citebase download/citation chronogram showing growth of downloads and growth of citations.

Citebase can currently rank a sample of articles or authors on each of these metrics, one metric at a time. We shall see shortly how these separate, “vertically based” rankings, one metric at a time, can be made into a single “horizontally based” one, using weighted combinations of multiple metrics jointly to do the ranking. If Citebase were already being applied to the worldwide P-OA network of IRs, and if that network contained 100% of each institution’s research publication output, along with each publication’s metrics, this would not only maximise research access, usage and impact, as OA is meant to do, but it would also provide an unprecedented and invaluable database for scientometric data-mining and analysis. OA scientometrics - no longer constrained by the limited coverage, access tolls and non-interoperability of today’s multiple proprietary databases for publications and metrics - could trace the trajectory of ideas, findings, and authors across time, across fields and disciplines, across individuals, groups, institutions and nations, and even across languages. Past research influences and confluences could be mapped, ongoing ones could be monitored, and future ones could be predicted or even influenced (through the use of metrics to help guide research employment and funding decisions).

Citebase today, however, merely provides a glimpse of what would be possible with an OA scientometric database. Citebase is largely based on only one discipline (physics) and uses only a few of the rich potential arrays of candidate metrics, none of them as yet validated. But more content, more metrics, and validation are on the way.

The UK Research Assessment Exercise (RAE)

The UK has a unique Dual Support System [21] for research funding: competitive research grants are just one component; the other is top-sliced funding, awarded to each UK university, department by department, based on how each department is ranked by discipline-based panels of reviewers who assess their research output. In the past, this costly and time-consuming Research Assessment Exercise (RAE) [22] has been based on submitting each researcher’s four best papers every six years to be ‘peer-reviewed’ by the appointed panel, alongside other data such as student counts and grant income (but not citation counts, which departments had been forbidden to submit and panels had been forbidden to consider, for both journals and individuals,).

To simplify the RAE and make it less time-consuming and costly, the UK has decided to phase out the panel-based RAE and replace it instead with ‘metrics.’[23] For a conversion to metrics, the only problem is determining which metrics to use. It was a surprising retrospective finding (based on post-RAE analyses in every discipline tested) that the departmental RAE rankings turned out to be highly correlated with the citation counts for the total research output of each department (Figure 3; [24][25]).

Research Assessment, Research Funding, and Citation Impact

“Correlation between RAE ratings and mean departmental citations +0.91 (1996) +0.86 (2001) (Psychology)”

“RAE and citation counting measure broadly the same thing”

“Citation counting is both more cost-effective and more transparent”

(Eysenck & Smith 2002) psyserver.pc.rhbnc.ac.uk/citations.pdf

Figure 3. RAE citation/ranking correlation: In the Research Assessment Exercise (RAE), the UK ranks and rewards the research output of its universities on the basis of costly and time consuming panel evaluations that have turned out to be highly correlated with citation counts. The RAE will be replacing the panel reviews by metrics after one last parallel panel/metric RAE in which many candidate metrics will be tested and validated against the panel rankings field by field.

Why would citation counts correlate highly with the panel’s subjective evaluation of researchers’ four submitted publications? Each panel was trying to assess quality and importance. But that is also what fellow-researchers assess, in deciding what to risk building their own research upon. When researchers take up a piece of research, apply and build upon it, they also cite it. They may sometimes cite work for other reasons, or they may fail to cite work even if they use it; but for the most part, a citation reflects research usage and hence research impact. If we take the panel rankings to have face validity, then the high correlation between citation counts and the panel rankings validates the citation metric as a faster, cheaper, proxy estimator.

New Online Research Metrics

Nor are one-dimensional citation counts the best we can do, metrically. There are many other research metrics waiting to be tested and validated: publication counts themselves are metrics. The number of years that a researcher has been publishing is also a potentially relevant and informative metric. (High citations later in a career are perhaps less impressive than earlier, though that no doubt depends on the field.) Total citations, average citations per year, and highest individual-article citation counts could all carry valid independent information, as could the average citation count (‘impact factor’) [26] of the journal in which each article is published. But not all citations are equal. By analogy with Google’s PageRank algorithm, citations can also be recursively weighted in terms of how highly cited the citing article or author is. Co-citations can be informative too: being co-cited with a Nobel Laureate may well mean more than being co-cited with a postgraduate student. Downloads can be counted in the online age and could serve as early indicators of impact.

Citation metrics today are based largely on journal articles citing journal articles – and mostly just those 8000 journals that are indexed by ISI’s Web of Science. That only represents a third (although probably the top third) of the total number of peer-reviewed journals published today, across all disciplines and all languages. OA self-archiving can make the other two-thirds of journal articles linkable and countable too. There are also many disciplines that are more book-based than journal based, so book-citation metrics can now be collected as well (and Google Books and Google Scholar are already a potential source for book citation counts). Besides self-archiving the full-texts of their published articles, researchers could self-archive a summary, the bibliographic metadata, and the references cited by their books. These could then be citation-linked and harvested for metrics too. And of courses researchers can also self-archive their data (D-OA), which could then also begin accruing download and citation counts. And web links themselves provide a further metric that is not quite the same as a citation link.

Many other data could be counted as metrics too. Co-author counts may have some significance and predictive value (positive or negative: they might just generate more spurious self-citations). It might make a difference in some fields whether their citations are from a small, closed circle of specialists, or broader, crossing subfields, fields, or even disciplines: an ‘inbreeding/outbreeding’ metric can be calculated. Web link analysis suggests investigating ‘hub’ and ‘authority’ metrics. Patterns of change across time, ‘chronometrics,’ may be important and informative in some fields; the early rate of growth of downloads and citations, as well as their later rate of decay. There will be fast-moving fields where quick uptake is a promising sign, and there will be longer-latency fields where staying power is a better sign. ‘Semiometrics’ can also be used to measure the degree of distance and overlap between different texts, from unrelated works on unrelated topics all the way to blatant plagiarism.

Validating Research Metrics

The one last parallel panel/metric RAE, in 2008, will provide a unique natural testbed for validating the rich new spectrum of Open Access metrics against the panel rankings. A statistical technique called multiple regression analysis can compute the contribution of each individual metric to the joint correlation of all the metrics with the RAE panel rankings. Once initialized by being validated against the panel rankings, the relative weight of each metric can then be adjusted and optimised according to the needs and criteria of each discipline, with the panels only serving as overseers and fine-tuners of the metric output, rather than having to try to re-review all the publications. This will allow research productivity and progress to be systematically monitored and rewarded.[27]

This is a natural, ‘horizontal’ extension of Citebase’s current functionality but it does not need to be restricted to the UK RAE: once validated, the metric equations, with the weights suitably adjusted to each field, can provide ‘continuous assessment’ of the growth and direction of scientific and scholarly research. Not only will the network of P-OA IRs do double duty by providing access to research for researchers as well as serving as the database for the field of scientometrics, but it will also provide an incentive for data-archiving (D-OA) alongside publication-archiving (P-OA) for other fields too, both by providing an example of the power and potential of such a worldwide database in scientometrics and by providing potential new impact metrics for research data-impact, alongside the more familiar metrics for research publication-impact.

The Open Access Impact Advantage

Citebase has already been able to demonstrate that, in physics, OA self-archiving dramatically enhances citation impact (Figure 4a) for articles deposited in Arxiv, compared to articles in the same journal and year that are not self-archived.[28] Lawrence [29] had already shown this earlier for computer science. The advantage has since been confirmed in 10 further disciplines (Figure 4b, 28) using the bibliographic metadata from the ISI Science and Social Science Citation Index (on CD-ROM leased to OST at UQAM) for millions of articles in thousands of journals, for which robots then trawled the web to see whether they could find a free online (OA) versions of the full text. An OA/non-OA citation advantage – OA articles are cited more than non-OA articles in the same journal and year - has been found in every discipline tested so far (and in every year except the two very first years of Arxiv).

	Figure 4a & b. Open Access citation advantage: Although only a small proportion of articles is currently being made Open Access, those articles are cited much more than those articles (in the same journal and year) that are not. 4a. Particle physics, based on Arxiv.
By discipline: total articles (OA+NOA), gray curve; percentage OA: (OA/(OA+NOA)) articles, green bars; percentage OA citation advantage: ((OA-NOA)/NOA) citation, red bars, averaged across 1992-2003 and ranked by total articles. All disciplines show an OA citation advantage (Hajjem et al. IEEE DEB 2005).	4b. Ten other fields, based on webwide robot searches.

There are many contributors to the OA advantage, but currently, with spontaneous OA self-archiving still hovering at only about 15%, a competitive advantage is one of its important components (Figure 5). (The University of Southampton, the first to adopt a self-archiving mandate, already enjoys an unexpectedly large “G-Factor,” which may well be due to the competitive advantage it gained from being the world’s first to mandate OA self-archiving: Figure 6) With the growing use of research impact metrics, validated by the UK RAE, the OA advantage will become much more visible and salient to researchers. Together with the growth of data impact metrics, alongside publication impact metrics and the prominent example of how scientometrics can data-mine its online database, there should now be a positive feedback loop, encouraging data self-archiving, publication self-archiving, OA self-archiving mandates, and the continuing development of the functionality of the underlying infrastructure.

Figure 5. Open Access Impact Advantage: There are many contributors to the OA Impact Advantage, but an important one currently (with OA self-archiving still only at 15%) is the competitive advantage. Although this advantage will of course disappear at 100% OA, metrics will make it more evident to researchers today, providing a strong motivation to reap the current competitive advantage.

Figure 6. Southampton Web Impact G-Factor: An important contributor to the University of Southampton’s surprisingly high web impact ‘G-Factor’ is the fact that it was the first to adopt a departmental self-archiving mandate so as to maximise the visibility, usage and impact of its research output.[30]

¹ Open Archives Initiative – http://www.openarchives.org/
² ROAR - http://roar.eprints.org/
³ eprints ROARMAP - http://www.eprints.org/openaccess/policysignup/
⁴ http://opcit.eprints.org/oacitation-biblio.html
⁵ RAE - http://www.rae.ac.uk/
⁶ http://arxiv.org/abs/cs.IR/0703131
⁷ De Roure, D. and Frey, J. (2007) "Three Perspectives on Collaborative Knowledge Acquisition in e-Science,’"In Proceedings of Workshop on Semantic Web for Collaborative Knowledge Acquisition (SWeCKa) 2007, Hyderabad, India. http://eprints.ecs.soton.ac.uk/13997/
⁸Murray-Rust P, Mitchell JBO, Rzepa HS. (2005) "Communication and re-use of chemical information in bioscience," BioMed Central Bioinformatics. 2005, Vol. 6, pp.180. http://www.biomedcentral.com/content/supplementary/1471-2105-6-180-S1.html
⁹ The GDB Human Genome Database - http://www.gdb.org/
¹⁰ Web of Science - http://scientific.thomson.com/products/wos/
¹¹ Google Scholar - http://scholar.google.com/
¹² Google Books - http://books.google.com/
¹³ http://www.dlib.org/dlib/october00/10inbrief.html#HARNAD
¹⁴ EPrints - http://www.eprints.org/
¹⁵ OAIster - http://www.oaister.org/
¹⁶ Citebase - http://www.citebase.org/
¹⁷ Brody, T. (2006) “Evaluating Research Impact through Open Access to Scholarly Communication,” Doctoral Dissertation, Electronics and Computer Science, University of Southampton http://eprints.ecs.soton.ac.uk/13313/
¹⁸ http://www.citebase.org/help/
¹⁹ arXiv.org - http://www.arxiv.org/
²⁰ Brody, T., Harnad, S. and Carr, L. (2006) “Earlier Web Usage Statistics as Predictors of Later Citation Impact,” Journal of the American Association for Information Science and Technology (JASIST),Vol. 57, no.8, pp. 1060-1072. http://eprints.ecs.soton.ac.uk/10713/
²¹ http://www.rcuk.ac.uk/aboutrcs/funding/dual/default.htm
²² http://www.hero.ac.uk/rae/
²³ http://www.hefce.ac.uk/research/assessment/reform/
²⁴ Smith, A., Eysenck, M. "The correlation between RAE ratings and citation counts in psychology," June 2002 http://psyserver.pc.rhbnc.ac.uk/citations.pdf
²⁵ Harnad, S., Carr, L., Brody, T. & Oppenheim, C. (2003) "Mandated online RAE CVs Linked to University Eprint Archives: Improving the UK Research Assessment Exercise whilst making it cheaper and easier," Ariadne, Vol. 35 (April 2003). http://www.ariadne.ac.uk/issue35/harnad/
²⁶ http://www.garfield.library.upenn.edu/papers/eval_of_science_oslo.html
²⁷ Harnad, S. (2007) “Open Access Scientometrics and the UK Research Assessment Exercise,” Proceedings of the 11th Annual Meeting of the International Society for Scientometrics and Informetrics. Madrid, Spain, 25 June 2007 http://arxiv.org/abs/cs.IR/0703131
²⁸ Harnad, S. & Brody, T. (2004) "Comparing the Impact of Open Access (OA) vs. Non-OA Articles in the Same Journals," D-Lib Magazine, Vol.10, no. 6. June http://www.dlib.org/dlib/june04/harnad/06harnad.html. Hajjem, C., Harnad, S. and Gingras, Y. (2005) Ten-Year Cross-Disciplinary Comparison of the Growth of Open Access and How it Increases Research Citation Impact. IEEE Data Engineering Bulletin 28(4) pp. 39-47. http://eprints.ecs.soton.ac.uk/12906/
²⁹ Lawrence, S. (2001), "Free online availability substantially increases a paper's impact," Nature, Vol. 31 May 2001 http://www.nature.com/nature/debates/e-access/Articles/lawrence.html
³⁰ Copyright Peter Hirst, 2006. This article and its contents and associated images may be freely reproduced and distributed provided that in every case its origin is properly attributed and a link to this website is included. http://www.universitymetrics.com/