READ ME File For 'In-conference citation data ACM Hypertext & ECHT' (as updated to 2021 data, 2.ed)
Dataset DOI: 10.5258/SOTON/D1870
ReadMe Author: Mark W. R. Anderson,
University of Southampton
ORCID ID: 0000-0001-7396-0721
-----------------
TABLE OF CONTENTS
-----------------
1. DESCRIPTION OF THE DATA
2. DATASET CONTENTS
3. DATA SOURCES
4. DATASET CREATION
5. DATA RECORDS
6. NODES FILE COLUMNS
7. EDGES FILE COLUMNS
8. LICENCE FOR RE-USE
9. NOTES
------------------
--- END TOC ------
--------------------------
1. DESCRIPTION OF THE DATA
--------------------------
This data consists of node and edge tables allowing analysis of in-conference citations in
the proceedings of ACM Hypertext 1987-2021, plus ECHT 1990/92/94.
Changes in this update:
- Dataset now includes all published items (1,447) not just the 903 items with at least 1
in-conference inbound *or* outbound citation links.
- Dataset now includes author keywords (column #17), and a boolean indicating papers in
the cited/citing set (column #18)
- An unforunate number of typos in the previous readme are now fixed.
This dataset does not currently include the full de-duped table of authors or the author
gender information used in the HT'22 paper (https://doi.org/10.1145/3511095.3531271) that
reports on this dataset. However, figures used in that paper are all based on this dataset
(and its richer source Tinderbox file).
-------------------
2. DATASET CONTENTS
-------------------
This dataset contains this readme plus 2 data files (3 files total) all in UTF-8
plain-text using comma-separated-variable (CSV) format. The files should be readable in
any text or code editor capable of reading plain text. The two data files are :
- ht_nodes_2021-1.csv (metadata on each discrete paper)
- ht_edges_2021-1.csv (links between papers)
The meaning of the per-column data in the data files is described in sections #6 and #7
below.
---------------
3. DATA SOURCES
---------------
1. ECHT 1990. Data is drawn from the author's (paper) copy of the ECHT'90 Proceedings
(Cambridge University Press, 1990, ISBN 0-521-40517-3). This is out of print and has no
e-book version. Digital data was re-keyed by this dataset's author from the book and
from OCR'd scans of the same.
2. ACM Hypertext 1987-2021, plus ECHT 1992 and 1994 (i.e. all other conferences bar the
above). Data was gathered via publicly accessible pages of the ACM Digital Library
(https://dl.acm.org), with corrections where errors in the master DL record was detected.
DL.ACM has no API and its Terms of Use specifically preclude screen-scraping: the library
HTML pages lack meaningful semantic mark-up such as would assist screen scraping even were
it to be authorised. Other sources such as dplp.org and Google Scholar certainly repeat
errors in the DL.ACM records even if not adding others.
Although this dataset's information is publicly accessing (ECHT'90 less so) it does not
exist in the form of a meaningful dataset instantly available for *other* purposes, such
as visualisation of citation trees.
The choice of a Creative Commons CC BY-NC-SA 4.0 licence respects the rights of the
source data owners and prevents commercial exploitation of the significant effort
involved in the creation of this dataset (at least, commercial exploitation *without*
seeking different *prior* approval). Section #8 for further detail of the licence.
In further re-use the original sources (ACM & CUP) must be acknowledged.
-------------------
4. DATASET CREATION
-------------------
As creation of the dataset was initially exploratory, it being unclear what information
was available across the subject domain, the primary tool used was Eastgate's
Tinderbox (https://eastgate.com/Tinderbox/). Tinderbox '.tbx' files are stored as
well-formed XML (with XML-compatible encoding of style RTFD text/embedded images). The
tool has strong & flexible support for incremental formalisation of content and has
very configurable export. The CSV files are created by export from Tinderbox.
---------------
5. DATA RECORDS
---------------
The dataset is drawn from data on 1,447 published items of which 1,079 were full/short
papers - the primary citable academic content. Here, the terms 'paper' and 'item' are
broadly interchangeable, though 'paper' likely refers to an item that is a full/short
paper, i.e. 9+pp. vs. 4–6pp, and is flagged with 'Is Paper' boolean data.
Represented within this, 903 items have 1 or more inbound *or* outbound in-conference
citation links. The export of both files is ordered as per the source, so by year
(conference) and, within year, by page number within the printed proceedings.
The later sub-set of 903 nodes in the dataset thus represent 62.4% of the wider corpus.
Thus 544 published items (37.6%) were never cited in-conference. Adjusting for 2021's
35 items that cannot yet have been cited (i.e. 1,412 total), 63.9% of items are in this
dataset with 36% never having been cited at all in-conference. The nodes file includes
'Article Type' allowing further filtering (or use 'Is Paper') to look only within the
1,073 papers (which will give 789 nodes, 73.5% of that lesser set, with as might be
expected, a smaller percentage of un-cited items).
At the end of the descriptions (below) of each data column, its source Tinderbox attribute
(aka field) is shown . Thus ($ArticleTitle) implies the 'Label' column data is drawn from
the Tinderbox attribute named 'ArticleTitle'. The use of a node file column named 'Label'
is deliberate as use of this data in Gephi requires a node column of that name.
---------------------
6. NODES FILE COLUMNS
---------------------
Node descriptions take the form "column-label: data type: description (Tinderbox source
attribute)". There are 17 columns.
1. ID: 10-digit number: the UID of the source note in Tinderbox. ($ID)
2. Label: string: The published title of the paper/item. ($ArticleTitle)
3. Authors: string: This is the descriptive author listing based on the number of authors,
in one of four descriptive forms based on the number of authors. Thus, 1:"A", 2:"A & B",
3:"A, B & C", 4+: "A et al."). Also see the 'Author Names' column for full names of each
author ($Name or $AltName - choice reflects internal Tinderbox title de-duplication
issues not germane here)
4. In-Conference References: Number: count of HT conference published items referenced by
this item. ($NumberInConfRefs)
5. In-Conference Cited By: Number: count of HT conference published items that have
referenced this item. ($CitedCount)
6. Total References: Number: total count of the item's citation of references from any
source (i.e. its list of References). ($NumberOfRefs)
7. Conference Proceedings: String: Formal title of the item's parent Conference's
Proceedings. ($ConferenceProceedings)
8. Conference Title: String: Optional additional theme title of the Conference's
Proceedings, so only for some conferences. ($ConferenceName)
9. Conference Abbreviation: String: Abbreviation used to refer to that conference.
Initially 'Hypertext', e.g. Hypertext'89, it was later shortened to 'HT', e.g. HT'02. The
year is always two digits. European conferences used 'ECHT', e.g. ECHT'92. In general
use, the name and date may be separated, optionally, by a space and use either straight
or typographic single quote types before the elided year value. Only straight quotes are
used in this data. ($ConferenceShortTitle)
10. Conference Year: Number, 4-digit year YYYY: the year the conference was held.
($PublicationYear)
11. DOI URL: String (a URL): the DOI-based URL of the published item in the DL.ACM. There
is no such data for ECHT'90 as it is not formally an ACM conference and has no DL.ACM
record. ($DOIUrl)
12. Author Names: String, as a comma+space list; list of author names in normal first/last
order suitable for screen display. List order is as given in published paper. ($Authors)
13. First Author: String: The name of the paper's first author. (FirstAuthor)
14. Article Type: String: the type of article the item represents, e.g.— full paper,
short paper, demo, poster, keynotes, etc. ($ArticleType)
15. DOI PDF URL: String (a URL): the DOI-based URL of the PDF published item in the DL.ACM
(requires login, so needs ACM account). There is no such data for ECHT'90 as it is not
formally an ACM conference and has no DLL.ACM record. ($DOIPdfUrl)
16. Is Paper: Boolean (true/false). A 'true' values indicates the item is full or short
paper. All other items are marked false. ($IsPaper)
17. Keywords: a quote-enclosed comma+space delimited list of the paper's author-supplied
keywords. ($RefKeywords)
18. Has Citation Link: true if this item has one or more in/outbound link in the corpus,
i.e. the paper either cites another paper or is cited by one at least once within the
overall corpus of conference items. ($HasCitationLink)
19. Abstract: String, in-source line breaks substituted with '####' delimiter to make CSV
more robust—substitute '\n', '\n\n' for delimiter to as appropriate for further use: The
text of the published item's abstract if any.
(Tinderbox source templates: column heads from "node-wrapper5", values from "node-sub-item5")
---------------------
7. EDGES FILE COLUMNS
---------------------
Here, the column naming uses a style common for use with Gephi.
Edges, i.e. links, are assumed to be directional from a source to a target.
There are 4 Data columns.
1. SOURCE: Number: node 'ID' (q.v) of the link's source node. ($ID)
2. TARGET: Number: node 'ID' (q.v) of the link's target (destination) node. ($ID)
3. LINKTYPE: String: not used - all use default value 'cites'. Can represent the hypertext
link type used within the main Tinderbox file, e.g. author's notes link to their articles
with type 'authored'. Items citing earlier article use 'cites', etc.
(no source attribute - data added as part of export, derived from Tinderbox's link-base)
4. WEIGHT: Number: Edge weighting. Not used, so default value of 1. (added during export)
(Tinderbox source templates: column heads from "edge-wrapper", values from "edge-item")
---------------------
8. LICENCE FOR RE-USE
---------------------
This dataset is licence under a Creative Commons Licence CC BY-NC-SA 4.0
(https://creativecommons.org/licenses/by-nc-sa/4.0/).
In overview, you are free to:
Share — copy and redistribute the material in any medium or format
Adapt — remix, transform, and build upon the material
The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
Attribution — You must give appropriate credit, provide a link to the license, and
indicate if changes were made. You may do so in any reasonable manner, but not in any way
that suggests the licensor endorses you or your use.
NonCommercial — You may not use the material for commercial purposes.
ShareAlike — If you remix, transform, or build upon the material, you must distribute your
contributions under the same license as the original.
No additional restrictions — You may not apply legal terms or technological measures that
legally restrict others from doing anything the license permits.
--------
9. NOTES
--------
Date of data collection: 2016–2021 (and ongoing for subsequent conferences)
Information about geographic location of data collection: n/a
Related projects/Funders: n/a
Related publication:
1. At HT'22. "Hypertext’s meta-history: Documenting in-conference citations, authors
and keyword data, 1987-2021" URL: https://doi.org/10.1145/3511095.3531271
Date that this file set (of 3 files) was created: (y-m-d) 2022-06-01