READ ME File For 'In-conference citation data ACM Hypertext & ECHT' (as updated to 2021 data) Dataset DOI: 10.5258/SOTON/D1870 ReadMe Author: Mark W. R. Anderson, University of Southampton ORCID ID: 0000-0001-7396-0721 ----------------- TABLE OF CONTENTS ----------------- 1. DESCRIPTION OF THE DATA 2. DATASET CONTENTS 3. DATA SOURCES 4. DATASET CREATION 5. DATA RECORDS 6. NODES FILE COLUMNS 7. EDGES FILE COLUMNS 8. LICENCE FOR RE-USE 9. NOTES ------------------ --- END TOC ------ -------------------------- 1. DESCRIPTION OF THE DATA -------------------------- This data consists of node and edge tables allowing analysis of in-conference citations in the proceedings of ACM Hypertext 1987-2021, plus ECHT 1990/92/94. ------------------- 2. DATASET CONTENTS ------------------- This dataset contains this readme plus 2 data files (3 files total) all in UTF-8 plain-text using comma-separated-variable (CSV) format. The files should be readable in any text or code editor capable of reading plain text. The two data files are : - ht_nodes_2021.csv (metadata on each discrete paper) - ht_edges_2021.csv (links between papers) The meaning of the per-column data in the data files is described in sections #6 and #7 below. --------------- 3. DATA SOURCES --------------- 1. ECHT 1990. Data is drawn from the author's (paper) copy of the ECHT'90 Proceedings (Cambridge University Press, 1990, ISBN 0-521-40517-3). This is out of print and has no e-book version. Data was re-keyed by the author. 2. ACM Hypertext 1987-2021, plus ECHT 1992 and 1994 (i.e. all other conferences bar the above). Data was gathered via publicly accessible pages of the ACM Digital Library (https://dl.acm.org). Terms of Use precluded screen scraping and there is no API. Although this information is publicly accessing (ECHT'90 less so) it does not exist in the form of a dataset instantly available for *other* purposes, such as visualisation of citation trees. The choice of a Creative Commons CC BY-NC-SA 4.0 licence respects the rights of the source data owners and prevents commercial exploitation of the significant effort involved in the creation of this dataset (at least, commercial exploitation *without* seeking different *prior* approval). Section #8 for further detail of the licence. In further re-use the original sources (ACM & CUP) must be acknowledged. ------------------- 4. DATASET CREATION ------------------- As creation of the dataset was initially exploratory, it being unclear what information was available across the subject domain, the primary tool used was Eastgate's Tinderbox (https://eastgate.com/Tinderbox/). Tinderbox '.tbx' files are stored as well-formed XML (with XML-compatible encoding of style RTFD text/embedded images). The tool has strong & flexible support for incremental formalisation of content and has very configurable export. The CSV files are created by export from Tinderbox. --------------- 5. DATA RECORDS --------------- The dataset is drawn from data on 1,447 published items of which 1,073 were full/short papers - the primary citable academic content. Here, the terms 'paper' and 'item' are broadly interchangeable, though 'paper' likely refers to an item that is a full/short paper, i.e. 9+pp. vs. 4–6pp. The data exported represent discrete items from the 1,412 items [sic] that have at least one inbound (cited by) OR outbound (cites) links. The export of both files is ordered as per the source, so by year (conference) and within year by page number within the printed proceedings. The 903 nodes in the datasets thus represent 62.4% of the wider corpus. Thus 544 published items (37.6%) were never cited in-conference. Adjusting for 2021's 35 items that cannot yet have been cited (i.e. 1,412 total), 63.9% of items are in this dataset with 36% never having been cited at all in-conference. The nodes file includes 'Article Type' allowing further filtering (or use 'Is Paper') to look only within the 1,073 papers (which will give 789 nodes, 73.5% of that lesser set, with as might be expected, a smaller percentage of un-cited items). At the end of the descriptions (below) of each data column, its source Tinderbox attribute (aka field) is shown . Thus ($ArticleTitle) implies the 'Label' column data is drawn from the Tinderbox attribute named 'ArticleTitle'. The use of a node file column named 'Label' is deliberate as use of this data in Gephi requires a node column of that name. --------------------- 6. NODES FILE COLUMNS --------------------- Node descriptions take the form "column-label: data type: description (Tinderbox source attribute)". There are 17 columns. 1. ID: 10-digit number: the UID of the source note in Tinderbox. ($ID) 2. Label: string: The published title of the paper/item. ($ArticleTitle) 3. Authors: string: This is the descriptive author listing based on the number of authors, in one of four descriptive forms based on the number of authors. Thus, 1:"A", 2:"A & B", 3:"A, B & C", 4+: "A et al."). Also see the 'Author Names' column for full names of each author ($Name or $AltName - choice reflects internal Tinderbox title de-duplication issues not germane here) 4. In-Conference References: Number: count of HT conference published items referenced by this item. ($NumberInConfRefs) 5. In-Conference Cited By: Number: count of HT conference published items that have referenced this item. ($CitedCount) 6. Total References: Number: total count of the item's citation of references from any source (i.e. its list of References). ($NumberOfRefs) 7. Conference Proceedings: String: Formal title of the item's parent Conference's Proceedings. ($ConferenceProceedings) 8. Conference Title: String: Optional additional theme title of the Conference's Proceedings, so only for some conferences. ($ConferenceName) 9. Conference Abbreviation: String: Abbreviation used to refer to that conference. Initially 'Hypertext', e.g. Hypertext'89, it was later shortened to 'HT', e.g. HT'02. The year is always two digits. European conferences used 'ECHT', e.g. ECHT'92. In general xuse, the name and date may be separated, optionally, by a space and use either straight or typographic single quote types before the elided year value. Only straight quotes are used in this data. ($ConferenceShortTitle) 10. Conference Year: Number, 4-digit year YYYY: the year the conference was held. ($PublicationYear) 11. DOI URL: String (a URL): the DOI-based URL of the published item in the DL.ACM. There is no such data for ECHT'90 as it is not formally an ACM conference and has no DL.ACM record. ($DOIUrl) 12. Author Names: String, as a comma+space list; list of author names in normal first/last order suitable for screen display. List order is as given in published paper. ($Authors) 13. First Author: String: The name of the paper's first author. (FirstAuthor) 14. Article Type: String: the type of article the item represents, e.g.— full paper, short paper, demo, poster, keynotes, etc. ($ArticleType) 15. DOI PDF URL: String (a URL): the DOI-based URL of the PDF published item in the DL.ACM (requires login, so needs ACM account). There is no such data for ECHT'90 as it is not formally an ACM conference and has no DLL.ACM record. ($DOIPdfUrl) 16. Is Paper: Boolean (true/false). A 'true' values indicates the item is full or short paper. All other items are marked false. ($IsPaper) 17. Abstract: String, in-source line breaks substituted with '####' delimiter to make CSV more robust—substitute '\n', '\n\n' for delimiter to as appropriate for further use: The text of the published item's abstract if any. (Tinderbox source template: "node-wrapper1a") --------------------- 7. EDGES FILE COLUMNS --------------------- Here, the column naming uses a style common for use with Gephi. Edges, i.e. links, are assumed to be directional from a source to a target. There are 4 Data columns. 1. SOURCE: Number: node 'ID' (q.v) of the link's source node. ($ID) 2. TARGET: Number: node 'ID' (q.v) of the link's target (destination) node. ($ID) 3. LINKTYPE: String: not used - all use default value 'cites'. Can represent the hypertext link type used within the main Tinderbox file, e.g. author's notes link to their articles with type 'authored'. Items citing earlier article use 'cites', etc. (no source attribute - data added as part of export, derived from Tinderbox's link-base) 4. WEIGHT: Number: Edge weighting. Not used, so default value of 1. (added during export) (Tinderbox source template: "node-edges") --------------------- 8. LICENCE FOR RE-USE --------------------- This dataset is licence under a Creative Commons Licence CC BY-NC-SA 4.0 (https://creativecommons.org/licenses/by-nc-sa/4.0/). In overview, you are free to: Share — copy and redistribute the material in any medium or format Adapt — remix, transform, and build upon the material The licensor cannot revoke these freedoms as long as you follow the license terms. Under the following terms: Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. NonCommercial — You may not use the material for commercial purposes. ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits. -------- 9. NOTES -------- Date of data collection: 2016–2021 (and ongoing for subsequent conferences) Information about geographic location of data collection: n/a Related projects/Funders: n/a Related publication: n/a Date that this file set (of 3 files) was created: (y-m-d) 2022-02-21