Harnad, S. (submitted, 2008) Validating Research Performance Metrics Against Peer Rankings. Inter-Research Ethics in Science and Environmental Politics. Theme Section on: 'The use and misuse of bibliometric indices in evaluating scholarly performance'
Chaire de recherche du Canada
Institut des sciences cognitives
Universite du Quebec a Montreal
Montreal, Quebec, Canada H3C 3P8
Department of Electronics and Computer Science
University of Southampton
SO17 1BJ UNITED KINGDOM
ABSTRACT: A rich and diverse set of potential bibliometric and scientometric predictors of research performance quality and importance are emerging today, from the classic metrics (publication counts, journal impact factors and individual article/author citation counts) to promising new online metrics such as download counts, hub/authority scores and growth/decay chronometrics. In and of themselves, however, metrics are circular: They need to be jointly tested and validated against what it is that they purport to measure and predict, with each metric weighted according to its contribution to their joint predictive power. The natural criterion against which to validate metrics is expert evaluation by peers, and a unique opportunity to do this is offered by the 2008 UK Research Assessment Exercise, in which a full spectrum of metrics can be jointly tested, field by field, against peer rankings.
KEY WORDS: Bibliometrics - Citation Analysis - Journal Impact Factor - Metric Validation - Multiple Regression - Peer Review - Research Assessment – Scientometrics - Web Metrics
Philosophers have a saying1 (about those who are sceptical about metaphysics): "Show me someone who wishes to refute metaphysics and I'll show you a metaphysician with a rival system" (meaning that there is no escaping metaphysics one way or the other: even anti-metaphysics is metaphysics). The same could be said of bibliometrics, or, more broadly, scientometrics.
If we divide the evaluation of scientific and scholarly research into (1) subjective evaluation (peer review) and (2) objective evaluation (scientometrics: henceforth just "metrics"), then even those who wish to refute metrics in favor of peer review first have to demonstrate that peer review (2004a) is somehow more reliable and valid than metrics: And to demonstrate that without circularity (i.e., without simply decreeing that peer review is better because peers agree on what research is better and they also agree that peer review is better than metrics!), peer review too will have to be evaluated objectively, i.e., via metrics.
This is not to say that metrics themselves are exempt from the need from validation either. Trying to validate unvalidated metrics against unvalidated metrics is no better than trying to validate peer review with peer review: Circularity has to be eliminated on both sides.
The other contributions to this special ESEP special issue have done a good job pointing out the inappropriateness of the unvalidated use of journal impact factors (JIFs) in evaluating anything, be it journal quality, research quality, or researcher quality (Campbell 2008). Not only is the JIF, in and of itself, not validated as a measure of journal quality, especially when comparing across different fields, but, being a journal average, it is a particularly blunt instrument for evaluating and comparing individual authors or papers: Comparing authors in terms of their JIFs is like comparing university student applicants in terms of the average marks of the secondary schools from which the applicants have graduated, instead of comparing them in terms of their own individual marks (Moed 2005).
Psychometrics of Cognitive Performance Capacty
But even author citation counts stand unvalidated in and of themselves. The problem can be best illustrated with an example from another metric field: psychometrics (Kline 2000). If we wish to construct a test of human aptitude, it is not sufficient simply to invent test-items that we hypothesize to be measuring the performance capacity in question, and use those items to construct a set that is internally consistent (i.e., higher scorers tend to score higher on all items, and vice versa) and repeatable (i.e., the same individual tends to get the same score on repeated sittings). So far, that is merely a reliable test, not necessarily a valid one.
Let us call the capacity we are trying to measure and predict with our test our "criterion." To validate a psychometric test, we have to show either that the test has face-validity (i.e., that it is itself a direct measure of the criterion, as in the case of a long-distance swimming test to test long-distance swimming ability, or a calculational test to test calculating ability) or, in the absence of face-validity, we have to show that our test is strongly correlated either with a face-valid test of the criterion or with a test that has already been validated (as being correlated with the criterion).
Scientometrics of Research Performance Quality
In psychometrics, it is the correlation with the criterion that gets us out of the problem of circularity. But what is the criterion in the case of scientometrics? Presumably it is research performance quality itself. But what is the face-valid measure of research performance quality? Apart from the rare cases where a piece of research instantly generates acknowledged break-throughs or applications, the research cycle is too slow and uncertain to provide an immediate face-valid indicator of quality. So what do we do? We turn to expert judgment: Journals (and research funders) consult qualified peer referees to evaluate the quality of research output (or, in the case of grants, the quality of research proposals).
Now, as noted, peer review itself stands in need of validation, just as metrics do: Even if we finesse the problem of reliability, by only considering peer judgments on which there is substantial agreement (Harnad 1985), it still cannot be said that peer review is a face-valid measure of research quality or importance, just as citation counts are not a face-valid measure of research quality or importance.
Getting Metrics of the Ground
It is useful again to return to the analogous case of psychometrics: How did IQ testing first get off the ground, given that there was no face-valid measure of intelligence? IQ tests were bootstrapped in two ways: First, there were (1) "expert" ratings of pupils' performance, by their teachers. Teacher ratings are better than nothing, but of course they too, like peer review, are neither face-valid nor already validated.
In addition, there was the reasonable hypothesis that, whatever intelligence was, (2) the children who at a given age could do what most children could only do at an older age were more likely to be more intelligent (and vice versa) . The "Q" in IQ refers to the "Intelligence Quotient": the ratio of an individual child's test scores (mental age) to the test norms for their own age (chronological age). Now this risks being merely a measure of precociousness or developmental delay, rather than intelligence, unless it can be shown that, in the long run, the children with the higher IQ ratios do indeed turn out to be the more intelligent ones. And in that case psychometricians had the advantage of being able to follow children and their test scores and their teacher ratings through their life cycles long enough and on a large enough population to be able to validate and calibrate the tests they constructed against their later academic and professional performance. Once tests are validated, the rest becomes a matter of optimization through calibration and fine-tuning, including the addition of further tests.
Multiple Metrics: Multiple Regression
Psychometric tests and performance capacity turned out to be multifactorial: No single test covers all of our aptitudes. It requires a battery of different tests (of reasoning ability, calculation, verbal skill, spatial visualization, etc.) to be able to make an accurate assessment of individuals' performance capacity and to predict their future academic and professional success. There exist general cognitive abilities as well as domain-specific special abilities (such as those required for music, drawing, sports); and even the domain-general abilities can be factored into a large single general intelligence factor, or "G", plus a number of lesser cognitive factors (Kline 2000). Each test has differential weightings on the underlying factors, and that is why multiple tests rather than just a single test need to be used for evaluation and prediction.
Scientometric measures do not consist of multiple tests with multiple items (Moed 2005). They are individual one-dimensional metrics, such as journal impact factors or individual citation counts. Some apriori functions of several variables such as the h-index (Hirsch 2005) have also been proposed recently, but they too yield one-dimensional metrics. Many further metrics have been proposed or are possible, among them (1) download counts (Hitchcock et al 2003), (2) chronometrics (growth- and decay-rate parameters for citations and downloads; Brody et al. 2006), (3) Google PageRank-like recursively weighted citation counts (citations from highly cited articles or authors get higher weights; Page et al 1999), (4) co-citation analysis, (5) hub/authority metrics (Kleinberg 1999), (6) endogamy/exogamy metrics (narrowness/width of citations across co-authors, authors and fields), (7) text-overlap and other semiometric measures, (8) prior research funding levels, doctoral student counts, etc. (Harnad 2004b; Harzing 2008).
Without exception, however, none of these metrics can be said to have face validity: They still require objective validation. How to validate them? Jointly analyzing them for their intercorrelational structure could yield some common underlying factors that each metric measures to varying degrees, but that would still be circular because neither the metrics nor the factors have been validated against their external criterion.
Validating Metrics Against Peer Rankings
What is that external criterion -- the counterpart of psychometric performance capacity -- in the case of research performance quality? The natural candidate is peer review. Peer review does not have face-validity either, but (a) we rely on it already and (b) it is what critics of metrics typically recommend in place of metrics. So the natural way to test the validity of metrics is against peer review. If metrics and per rankings turn out to be uncorrelated, that will be bad news. If they turn out to be strongly correlated, then we can have confidence in going on to use the metrics independently. Peer rankings can even be used to calibrate and optimize the relative "weights" on each of the metrics in our joint battery of candidate metrics, discipline by discipline.
The simplest case of linear regression analysis is the correlation of one variable (the "predictor") with another (the "criterion"). Correlations can vary from +1 to -1. The square of the correlation coefficient indicates the percentage of the variability in the criterion variable that is predictable from the predictor variable. In multiple regression analysis, there can be P different predictor variables and C different criterion variables. Again, the square of the overall PC correlation indicates what percentage of the variability in the criterion variables is jointly predictable from the predictor variables. Each of the individual predictor variables also has a ("beta") weight that indicates what proportion of that overall predictability is contributed by that particular variable.
Now if we take peer review rankings as our (single) criterion (having first tested multiple peer rankings for reliability), and we take our battery of candidate metrics as our predictors, this yields a mutiple regression equation of the form b1P1 + b2P2 +... bpPp = C. If the overall correlation of P with C is high, then we have a set of metrics that has been jointly validated against peer review (and, incidentally, vice versa). The metrics will have to be validated separately field by field, and their profile of beta weights will differ from field to field. Even after validation, the initialized beta weights of the battery of metrics for each research field will still have to be calibrated, updated and optimized, in continuing periodic cross-checks against peer review, along with ongoing checks on internal consistency for both the metrics and the peer rankings. But the metrics will have been validated.
The UK Research Assessment Exercise
Is there any way this validation could actually be done? After all, journal peer review (as well as grant-proposal peer review) are done piece-wise, locally, and their referee ratings are both confidential and un-normalized. Hence they would not be jointly useable and comparable even if we had them available for every paper published within each field. There is, however, one systematic database that provides peer rankings for all research output in all fields at the scale of the entire research output of a large nation and research provider: The United Kingdom's Research Assessment Exercise (RAE) (Harnad 2007; Butler 2008).
For over two decades now, the UK has assembled peer panels to evaluate and rank the research output of every active researcher in every department of every UK university every six years. (The departments were then accorded top-sliced research funding in proportion to their RAE ranks.) The process was very costly and time-consuming. Moreover, it was shown in a number of correlational studies that the peer rankings were highly correlated with citation metrics in all fields tested (Oppenheim 1996) – even though citations were not counted in doing the peer rankings. It was accordingly decided that after one grand parallel ranking/metrics exercise in 2008, the RAE would be replaced by metrics alone, supplememented by 'light-touch' peer review in some fields.
The Open Access Research Web: A Synergy
The database for the last 2008 RAE hence provides a unique opportunity to validate a rich and diverse battery of candidate metrics for each discipline: The broader the spectrum of potential metrics tested, the greater the potential for validity, predictiveness, and customizability according to each discipline's own unique profile. And as a bonus, generating and harvesting metrics on the Open Access Research Web will not only help measure and predict research performance and productivity: it will also help maximize it (Shadbolt et al 2006).
It has now been demonstrated in over a dozen disciplines, systematically comparing articles published in the same journal and year, that the citation counts of the articles that are made freely accessible to all would-be users on the web (Open Access, OA) are on average twice as high as the citation counts of those that are not (Lawrence 2001; Harnad & Brody 2004; Hajjem et al 2005; see Figure 1).
Figure 1. Percent increase in citations for articles (in the same issue and journal) that are made freely accessible online (Open Access, OA) compared to those that are not. The OA advantage has been found in all fields tested. (Data from Harnad & Brody 2004 and Hajjem et al 2005.)
There are many different factors contributing to this 'Open Access Impact Advantage' -- including an early access advantage (when the preprint is made accessible before the published postprint), a quality bias (higher quality articles are more likely to be made OA), a quality advantage (higher quality articles benefit more from being made OA for users who cannot otherwise afford access), a usage advantage (OA articles are more accessible, more quickly and easily, for downloading) and a competitive advantage (which will vanish once all articles are OA) – but it is clear that OA is a net benefit to research and researchers in all fields.
Just as peer rankings and metrics can be used to mutually validate one another, so metrics can be used as incentives for providing OA, while OA itself, as it grows, enhances the predictive and directive power of metrics (Brody et al 2007): The prospect of increasing their usage and citation metrics (and their attendant rewards) is an incentive to researchers to provide Open Access to their findings. The resulting increase in epenly accessible research not only means more research access, usage and progress, but it provides more open ways to harvest, data-mine and analyze both the research findings and the metrics themselves. This means richer metrics, and faster and more direct feedback between research output and metrics, helping to identify and reward ongoing research, and even to help set the direction for future research.
Citebase : A Scientometric Search Engine
A foretaste of the Open Access Research Web is given by Citebase, a scientometric search engine (Brody et al 2006; Hitchcock et al 2003: http://www.citebase.org/ ). Based mostly on the Physics Arxiv, Citebase reference-links its nearly 500,000 papers and can rank search results on the basis of citation counts, download counts, and various other metrics (see Figure 2) that Citebase provides.
Figure 2. Some of the metrics on which Citebase http://www.citebase.org/ can rank search results.
For a given paper, Citebase can also generate growth curves for downloads and the growth of citations (see Figure 3). It turns out that early download growth is a predictor of later citation growth (Brody et al. 2006).
Figure 3. Citebase http://www.citebase.org/ growth curves for citations (red) and downloads (green) for a particularly important author in physics (E. Witten).
The various different metrics according to which Citebase can rank papers or authors can only be applied individually, one at a time in the current implementation. There is a menu (Figure 4: 'Rank matches byÉ') that allows the user to pick the metric. But in principle it is possible to redesign Citebase so as to rank according to multiple metrics at once, and even to adjust the weight on each metric. Imagining several of the vertical metric ranking options in Figure 2 arranged instead horizontally, with an adjustable weight (from -1 to +1) on each, gives an idea of how a search engine like this could be used to calibrate the outcomes of the multiple regression analysis described earlier for validating metrics. Exploratory analysis as well as fine-tuning adjustments could then be done by tweaking the beta weights.
Figure 4. Citebase http://www.citebase.org/ allows users to choose the metrics on which they wish to rank papers, as well as to allowing them to navigate on the basis of of citation links.
Bradley, F.H. (1897/2002) Appearance and Reality: A Metaphysical Essay. Adament Media Corporation.
Brody, T., Carr, L., Gingras, Y., Hajjem, C., Harnad, S. and Swan, A. (2007) Incentivizing the Open Access Research Web: Publication-Archiving, Data-Archiving and Scientometrics. CTWatch Quarterly 3(3). http://eprints.ecs.soton.ac.uk/14418/
Brody, T., Harnad, S. and Carr, L. (2006) Earlier Web Usage Statistics as Predictors of Later Citation Impact. Journal of the American Association for Information Science and Technology (JASIST) 57(8) pp. 1060-1072. http://eprints.ecs.soton.ac.uk/10713/
Butler L (2008) Using a balanced approach to bibliometrics: quantitative performance measures in the Australian Research Quality Framework (this issue ESEP)
Campbell P (2008) Escape from the impact factor (this issue ESEP)
Hajjem, C., Harnad, S. and Gingras, Y. (2005) Ten-Year Cross-Disciplinary Comparison of the Growth of Open Access and How it Increases Research Citation Impact. IEEE Data Engineering Bulletin 28(4) pp. 39-47. http://eprints.ecs.soton.ac.uk/11688/
Harnad, S (1985) Rational disagreement in peer review. Science, Technology and Human Values. 10 p.55-62.
Harnad, S. (2004a) The invisible hand of peer review. In Shatz, B. (ed.) Peer Review: A Critical Inquiry. Rowland & Littlefield. Pp. 235-242. http://cogprints.org/1646/
Harnad, S. (2004b) Enrich Impact Measures Through Open Access Analysis. British Medical Journal 2004; 329: http://bmj.bmjjournals.com/cgi/eletters/329/7471/0-h#80657
Harnad, S. & Brody, T. (2004) Comparing the Impact of Open Access (OA) vs. Non-OA Articles in the Same Journals, D-Lib Magazine 10 (6) http://eprints.ecs.soton.ac.uk/10207/
Harnad, S. (2007) Open Access Scientometrics and the UK Research Assessment Exercise. In Proceedings of 11th Annual Meeting of the International Society for Scientometrics and Informetrics 11(1), pp. 27-33, Madrid, Spain. Torres-Salinas, D. and Moed, H. F., Eds. http://eprints.ecs.soton.ac.uk/13804/
Harzing AWK, van der Wal R (2008) Google Scholar as a new source for citation analysis (this issue ESEP)
Hirsch, Jorge E., (2005), "An index to quantify an individual's scientific research output" Proceedings of the National Academy of Sciences 102(46) 16569-16572
Hitchcock, Steve; Woukeu, Arouna; Brody, Tim; Carr, Les; Hall, Wendy and Harnad, Stevan. (2003) Evaluating Citebase, an open access Web-based citation-ranked search and impact discovery service
Kleinberg, Jon, M. (1999) Hubs, Authorities, and Communities. ACM Computing Surveys 31(4) http://www.cs.brown.edu/memex/ACM_HypertextTestbed/papers/10.html
Kline, Paul (2000) The New Psychometrics: Science, Psychology and Measurement. Routledge
Lawrence, S. (2001) Online or Invisible? Nature 411 (6837): 521
Moed, H. F. (2005) Citation Analysis in Research Evaluation. NY Springer.
Oppenheim, Charles (1996) Do citations count? Citation indexing and the research assessment exercise, Serials, 9:155-61, 1996. http://uksg.metapress.com/index/5YCDB0M2K3XGAYA6.pdf
Shadbolt, N., Brody, T., Carr, L. and Harnad, S. (2006) The Open Research Web: A Preview of the Optimal and the Inevitable, in Jacobs, N., Eds. Open Access: Key Strategic, Technical and Economic Aspects, chapter 21. Chandos. http://eprints.ecs.soton.ac.uk/12453/
Page, L., Brin, S., Motwani, R., Winograd, T. (1999) The PageRank Citation Ranking: Bringing Order to the Web. http://dbpubs.stanford.edu:8090/pub/1999-66
1 In "Appearance and Reality," Bradley (1897/2002) wrote (of Ayer) that 'the man who is ready to prove that metaphysics is wholly impossible ... is a brother metaphysician with a rival theory"