Self-Archiving and Journal Subscriptions: Flawed Method and No Data

Monday, November 13. 2006

Self-Archiving and Journal Subscriptions: Flawed Method and No Data

SUMMARY: There is no evidence to date that Open Access (OA) self-archiving causes journal cancellations. The Publishing Research Consortium commissioned a survey of acquisitions librarian preferences to see whether they could predict such cancellations in the future using a "Share of Preference model," but the study has a glaring methodological flaw that invalidates its conclusion (that self-archiving will cause cancellations). The study consisted of asking librarians which of three hypothetical products -- A, B or C -- they preferred least and most, for a variety of hypothetical combinations of 6 properties with 3-4 possible values each:
      1. ACCESS DELAY: 24-months, 12-months, 6-months, immediate access
      2. PERCENTAGE OF JOURNAL'S CONTENT: 100%, 80%, 60%, 40%
      3. COST: 100%, 50%, 25%, 0%
      4. VERSION: preprint, refereed, refereed+copy-edited, published-PDF;
      5. ACCESS RELIABILITY: high, medium, low
      6. JOURNAL QUALITY: high, medium, low
No mention was made of OA self-archiving (in order to avoid "bias"); but, as a result, the model cannot make any prediction at all about the effects of self-archiving on cancellations. The questions on which it is based were about relative preferences for acquisition among competing "products" having different combinations of properties, and the model treated OA (0% cost) as if it were just one of those product properties. But self-archived articles are not products purchased by acquisitions librarians: they are papers given away by researchers, anarchically, and in parallel. Hence from the survey's "Share of Preference model" it is impossible to draw any conclusions about self-archiving causing cancellations by librarians, because the librarians were never asked what they would cancel, under what conditions; just what hypothetical products they would prefer over what. And of course they would prefer lower-priced, immediate products over higher-priced, delayed products! But if all articles in all journals were self-archived, the "Share of Preference model" does not give us the slightest clue about what journals librarians would acquire or cancel. Nor does it give us a clue as to what they would do between now (c. 15% self-archiving) and then (100% self-archiving). The banal fact that everyone would rather have something for free rather than paying for it certainly does not answer this question, or fill the gaping evidential gap about the existence, size, or timing of any hypothetical effect of self-archiving on cancellations. Nor does the study's one nontrivial finding: that librarians don't much care about the difference between a refereed author's draft and a published-PDF. (Let us hope that this study will be the last futile attempt to treat research as if it were done in order to generate or protect journal revenues. Even if valid evidence should eventually emerge that OA self-archiving does cause journal cancellations, it would be for the publishing community to adapt to that new reality, not for the research community to abstain from it, and its obvious benefits to research, researchers, their institutions, their funders, and the tax-paying public that funds the funders and for whose benefit the research is conducted.)

Self-Archiving and Journal Subscriptions:
Critique of Publishing Research Consortium Study

Stevan Harnad
The following is a critique of:

Chris Beckett and Simon Inger, Self-Archiving and Journal Subscriptions: Co-existence or Competition? An international Survey of Librarians' Preferences. Commissioned by the Publishing Research Consortium from Scholarly Information Strategies Ltd (SIS), a scholarly publishing consultancy. October 2006

Because there has so far been no detectable correlation between author self-archiving and journal cancellations, the Publishing Research Consortium commissioned a survey of acquisition librarians' preferences and attitudes about a number of hypothetical alternatives. From the responses a theoretical model was constructed, which predicted cancellations as more self-archived content becomes available. How did the study arrive at this prediction without any actual cancellation data?

The prediction was based on a rather simple methodological flaw: Librarians were given a series of hypothetical choices, each a choice among three hypothetical "products," A, B and C. The librarians were asked to pick which of the three product options they would prefer most and least. Each hypothetical product option consisted of a complicated combination of six properties out of 3-4 possible values per property.

Presenting this array of hypothetical product options as choices to acquisition librarians (apart from being highly complicated and highly hypothetical, with many hidden assumptions) is specious, for among the potential properties of the hypothetical "product" options was the property that some of the options were free.

But a free self-archived journal article is not a product: It is not something that an acquisitions librarian decides whether or not to acquire. Open Access (OA) is not a product-acquisition issue at all: At best (or worst) its a product cancellation issue.

Hence the only credible and direct hypothetical question one could have asked librarians about self-archived journal articles (and even then there would be no guarantee that librarians would actually do as they predicted they would do under the hypothetical conditions) would be about the circumstances under which they think they would cancel existing journals:

"Would you cancel journal X if 100% of its articles were accessible free online (80%? 60%? 40%?)? If they were accessible immediately (after 6 months? 12? 24?)?"

And even that question is laden with highly speculative and even indeterminate assumptions: How could librarians (or anyone) know what percentage of a journal was accessible for free, self-archived, for any particular journal?

And what about interactions between journal X and journal Y? (How to spend a given acquisitions budget -- what to acquire and what to cancel -- is presumably a comparative decision, and we are asking about the keep/cancel trade-offs.)

But what if 60% of all journals were free online (immediately? after 12 months?)? (Acquisition/cancellation decisions today are largely competitive ones: X gets cancelled in favour of Y. The rules of this trade-off game would presumably change if all journals were roughly on a par for their percentage of freely available online content or the length of the delay before it is freely available.)

Straightforward questions on what a librarian predicts they would cancel (in favour of what) under what hypothetical conditions (and how those conditions could be ascertained) might possibly have some weak predictive value. But such straightforward questions are not what this series of questions about preferences among hypothetical "product options" asked.

[Even straightforward hypothetical answers to straightforward hypothetical questions may not have any predictive value if the hypotheses are far-fetched or unfamiliar enough, if they have hidden or incoherent assumptions: I frankly don't believe there is a librarian alive who has a clue as to what they would keep or cancel if the self-archived versions of all journal articles were suddenly available free online today -- let alone what they would do as all journal contents gradually approached 100% availability, at various (uncertain) speeds, from a trajectory of increasing (but uncertain) free content (40% to 60% to 80%) and/or decreasing delay (24 months to 12 months to 6 months).]

And that's without mentioning intangibles such as any continuing demand for the paper edition, etc., nor how librarians could know the percentages available, how quickly the percentages would grow, and at what relative rate they would grow among more and less important journals, more and less expensive journals.

But it was not even these straightforward, if highly speculative, questions that were asked of librarians in this survey. Instead, they were asked to pick the most and least favoured option among three hypothetical "products," A, B and C, with a variety of complicated combinations of 6 hypothetical properties, which could each take 3-4 values:

      1. ACCESS DELAY: 24-months, 12-months, 6-months, immediate access
      2. PERCENTAGE OF JOURNAL'S CONTENT: 100%, 80%, 60%, 40%
      3. COST: 100%, 50%, 25%, 0%
      4. VERSION: preprint, refereed, refereed+copy-edited, published-PDF;
      5. ACCESS RELIABILITY: high, medium, low
      6. JOURNAL QUALITY: high, medium, low

In each case, products A, B and C were given some combination of the values on properties 1-6, and the librarian had to choose which of the 3 combinations they most and least preferred.

From samples of these combinations (interpolated and extrapolated within and between librarians) the survey concludes that:

PRC: A major study of librarian purchasing preferences has shown that librarians will show a strong inclination towards the acquisition [sic] of Open Access (OA) materials as they discover that more and more learned material has become available in institutional repositories.

(1) OA materials are not "acquired" (and it is both misleading and absurd to cast either the questions or the responses in an acquisitions context). Non-OA products are acquired, and the availability of OA versions of them might or might not induce cancellation in favour of other non-OA products under various circumstances (that are not even touched upon by this study or its methodology).

Why would the model assume arbitrary differential rates of OA growth among journals rather than roughly uniform growth across all journals in each field (apart form random fluctuations)? And if there were systematic differential OA growth within a field, wouldn't librarians' decisions depend very much on the field, and on which journal contents happen to became OA faster, rather than on any general predictions generated from this theoretical model?

(2) Nothing whatsoever was determined about what happens as more and more OA becomes available all round, nor about how availability would be ascertained, nor at what rate OA would grow and be ascertained. There were merely static questions about 3 hypothetical competing "products," some stipulated to be PP% OA within MM months.

PRC: Overall the survey shows that a significant number of librarians are likely to substitute OA materials for subscribed resources, given certain levels of reliability, peer review and currency of the information available. This last factor is a critical one -- resources become much less favoured if they are embargoed for a significant length of time.

The survey shows nothing whatsoever about libraries substituting OA material for anything, because free self-archived content is not something a subscriber institution (library) provides (by buying it in) but something an author institution provides, via its IR, by self-archiving it.

If the questions had been forthrightly put as pertaining to cancellation decisions under various hypothetical conditions, then at least we would have had librarians' speculations about what they think they would cancel under those hypothetical conditions. But instead we have inferences from a model based on least- and most-preferred "product" options having little or no bearing on any question other than the librarians' preferences for the hypothetical properties: They prefer journals with lower prices, whose content is higher quality, more reliable, more immediate, peer-reviewed, and preferably 100% of it. (Librarians don't much care whether the peer-reviewed article is the author's final draft or the publisher's PDF, as long as it's peer-reviewed: That is a genuine finding of this study!)

There is no way at all to interpolate or extrapolate from data like these to draw valid or even coherent conclusions about self-archiving and cancellations, with or without a "conjoint analysis" model.

PRC: One of the key benefits of the conjoint analysis approach used in this survey was the removal of bias by not referring, when testing different product configurations, to any named incarnations of content types, including subscription journals, licensed full-text (or aggregated) databases, or articles on OA repositories.

This "bias" was eliminated at the cost of making it a questionnaire about acquisitions among a variety of competing "products" when it should have been a questionnaire about cancellations under a variety of hypothetical OA conditions (many of them unascertainable, hence moot).

PRC: The survey tested librarians' preferences for a series of hypothetical and unnamed products frequently showing unfamiliar combinations of attributes -- such as a fully priced journal embargoed for 24 months, or content at 25% of the price but through an unreliable service. By taking this approach, the survey measured librarians' preferences for an abstract set of potential products thus avoiding any pre-conceived preferences for named products, such as journals, licensed full- text (aggregated) databases or content on OA repositories.

Indeed. But OA is not an alternative product for acquisition: it is a property that might or might not induce cancellation in favor of other products under certain hypothetical (and presumably competitive) conditions.

PRC: The data were abstracted into a "Share of Preference" model (or simulator) which has then been used to model real-life products and thus create predictions for librarians' real-life preferences for these products. It is therefore possible to go beyond the comparisons, in this work, of journals versus OA and to model other preferences, such as between OA and licensed full-text databases.

The "Share of Preference model" might be viable when the preference really concerns competing products for acquisition, with a variety of rival properties, but it fails completely when applied to free non-products, not for acquisition at all, but treated as if they were just another among the rival properties of products competing for acquisition.

We could have said a-priori that librarians (like all consumers) will prefer a higher quality product over a lower quality product, 100% of a product over 60% of a product, an immediate product over a delayed product, a lower-priced product over a higher-priced product. A "Share of Preference model" could give some rough rank orders for those various combinations.

It seems natural to add to such a "Share of Preference model" that consumers will prefer a free product over a priced product, except that we are talking here about acquisitions librarians, who do not "acquire" free products but merely buy or cancel priced journals. This study simply does not and cannot indicate under what OA conditions they will cancel what for what.

The following (mild) conclusions, are the only ones that can be drawn:

PRC: There is a strong preference for content that has undergone peer review.

Yes, and librarians don't much care whether the peer-reviewed content is the publisher's PDF version or the author's final version -- except that the publisher's PDF is for sale and the author's final draft is not! Nor does the model tell us under what conditions, if both versions are available for a journal X, librarians would cancel the publisher's PDF (and in favour of what journal Y?). The question is never even raised. That's the question the study was designed to answer, but the method could not answer it. The survey might as well have asked the librarians directly, for X/Y pairs of hypothetical or actual journals -- rather than A/B/C triplets of hypothetical "products" -- banal questions such as:

"If 100% of X were immediately available for free online and Y was not, and your users needed X and Y equally, and you could not afford both, and you currently subscribed to X and not to Y, would you cancel X for Y?"

I suspect that it is because -- in the absence of any actual evidence of self-archiving causing cancellations -- a survey on hypothetical cancellations of journal X in favour of journal Y (or no journal at all) under various %OA and months-delay conditions would not have been very convincing or informative that the survey instead resorted to "Share of Preference" modelling. But I'm afraid the outcome is even less convincing.

PRC: How soon content is made available is a key determinant of content model preference in librarian's acquisition behaviour; delay in availability reduces the attractiveness of a product offering.

Yes, immediate access is preferable to delayed access. And, no doubt, if/when librarians are ever inclined to cancel a journal X because PP% of its articles are freely available, they are more likely to do so if that PP% is immediately available than if it is only available 24 months after publication. But we could have guessed that without this study. The question is: Under what circumstances are librarians going to cancel what, when? This study does not and cannot tell us. Relative preference models can only tell us that they are more likely to do it under these conditions than under those conditions (and we already knew all that).

Having said all this, it is important to state clearly that, although there is still no evidence at all of self-archiving causing cancellations, it is possible, indeed probable, that self-archiving will cause some cancellations, eventually. No one knows (1) how soon it will cause cancellations, nor (2) how many cancellations it will cause. That all depends on (a) how much demand there still is for the print edition and (b) for the journal's online edition at that time, (c) for how long that demand lasts, and (d) how quickly self-archiving grows and approaches 100%. (Perhaps someone should do a survey on people's predictions about those factors!)

But regardless of any of this -- and regardless also of the validity or invalidity of the present survey -- the possibility or probability of cancellation pressure is most definitely not the basis on which the research community should decide whether or not to self-archive and whether or not to mandate self-archiving. That decision must be based entirely on the benefits of OA self-archiving for research access, impact, productivity and progress -- definitely not on the basis of the possibility of revenue losses for publishers.

We do well to remind ourselves that these questions are not primarily about what is or is not good for the publishing industry. They are about what is and is not good for research, researchers, their institutions, their funders, and the tax-paying public that funds the funders. Research is supported and conducted and peer-reviewed and published for the sake of research progress and applications, not in order to support the publishing industry, or to protect it from risk.

And what is certain is that peer-reviewed research publishing can and will successfully adapt to Open Access: How can it fail to do so, when it is researchers who conduct the research, write the articles, perform the peer review, read, use, apply and cite the research, and, now, provide online access to it as well? Publishers are performing a valuable service (in implementing the peer review and in providing a paper and online edition) but it is publishing that must adapt to what is best for research in the online age, definitely not research that must adapt to what is best for publishing. And publishing can and will adapt.

Berners-Lee, T., De Roure, D., Harnad, S. and Shadbolt, N. (2005) Journal publishing and author self-archiving: Peaceful Co-Existence and Fruitful Collaboration

(I might add that Dr. Alma Swan is not the super-ennuated (sic) Proustian personage repeatedly cited in this PRC survey, but the cygnine author of a number of landmark surveys, one of them reporting the only existing evidence -- negative -- for a causal connection between OA self-archiving and cancellations.)

Swan, A. (2005) Open access self-archiving: An Introduction. JISC Technical Report.

Stevan Harnad
American Scientist Open Access Forum

Self-Archiving and Journal Subscriptions: Flawed Method and No Data

Open Access Archivangelism

Monday, November 13. 2006

Self-Archiving and Journal Subscriptions: Flawed Method and No Data