READ ME File For 'In-conference citation data ACM Hypertext & ECHT' (as updated to 2021 data)

Dataset DOI: 10.5258/SOTON/D1870

ReadMe Author: Mark W. R. Anderson, 
               University of Southampton
               ORCID ID: 0000-0001-7396-0721


-----------------
TABLE OF CONTENTS
-----------------
1. DESCRIPTION OF THE DATA
2. DATASET CONTENTS
3. DATA SOURCES
4. DATASET CREATION
5. DATA RECORDS
6. NODES FILE COLUMNS
7. EDGES FILE COLUMNS
8. LICENCE FOR RE-USE
9. NOTES
------------------ 
--- END TOC ------


--------------------------
1. DESCRIPTION OF THE DATA
--------------------------
This data consists of node and edge tables allowing analysis of in-conference citations in 
the proceedings of ACM Hypertext 1987-2021, plus ECHT 1990/92/94. 


-------------------
2. DATASET CONTENTS
-------------------
This dataset contains this readme plus 2 data files (3 files total) all in UTF-8 
plain-text using comma-separated-variable (CSV) format. The files should be readable in 
any text or code editor capable of reading plain text. The two data files are :

 - ht_nodes_2021.csv (metadata on each discrete paper)
 
 - ht_edges_2021.csv (links between papers)
 
The meaning of the per-column data in the data files is described in sections #6 and #7 
below.


---------------
3. DATA SOURCES
---------------
1. ECHT 1990. Data is drawn from the author's (paper) copy of the ECHT'90 Proceedings 
(Cambridge University Press, 1990, ISBN 0-521-40517-3). This is out of print and has no 
e-book version. Data was re-keyed by the author.

2. ACM Hypertext 1987-2021, plus ECHT 1992 and 1994 (i.e. all other conferences bar the 
above). Data was gathered via publicly accessible pages of the ACM Digital Library 
(https://dl.acm.org). Terms of Use precluded screen scraping and there is no API.

Although this information is publicly accessing (ECHT'90 less so) it does not exist in
the form of a dataset instantly available for *other* purposes, such as visualisation of 
citation trees.

The choice of a Creative Commons CC BY-NC-SA 4.0 licence respects the rights of the 
source data owners and prevents commercial exploitation of the significant effort 
involved in the creation of this dataset (at least, commercial exploitation *without* 
seeking different *prior* approval). Section #8 for further detail of the licence.

In further re-use the original sources (ACM & CUP) must be acknowledged.


-------------------
4. DATASET CREATION
-------------------
As creation of the dataset was initially exploratory, it being unclear what information
was available across the subject domain, the primary tool used was Eastgate's
Tinderbox (https://eastgate.com/Tinderbox/).  Tinderbox '.tbx' files are stored as
well-formed XML (with XML-compatible encoding of style RTFD text/embedded images). The 
tool has strong & flexible  support for incremental formalisation of content and has 
very configurable export. The CSV files are created by export from Tinderbox.


---------------
5. DATA RECORDS
---------------
The dataset is drawn from data on 1,447 published items of which 1,073 were full/short 
papers - the primary citable academic content. Here, the terms 'paper' and 'item' are 
broadly interchangeable, though 'paper' likely refers to an item that is a full/short 
paper, i.e. 9+pp. vs. 4–6pp.

The data exported represent discrete items from the 1,412 items [sic] that have at least 
one inbound (cited by) OR outbound (cites) links. The export of both files is ordered as 
per the source, so by year (conference) and within year by page number within the printed 
proceedings.

The 903 nodes in the datasets thus represent 62.4% of the wider corpus. Thus 544 published 
items (37.6%) were  never cited in-conference. Adjusting for 2021's 35 items that cannot 
yet have been cited (i.e. 1,412 total), 63.9% of items are in this dataset with 36% never 
having been cited at all in-conference. The nodes file includes 'Article Type' allowing 
further filtering (or use 'Is Paper') to look only within the 1,073 papers (which will 
give 789 nodes, 73.5% of that lesser set, with as might be expected, a smaller percentage 
of un-cited items).

At the end of the descriptions (below) of each data column, its source Tinderbox attribute 
(aka field) is shown . Thus ($ArticleTitle) implies the 'Label' column data is drawn from 
the Tinderbox attribute named 'ArticleTitle'. The use of a node file column named 'Label' 
is deliberate as use of this data in Gephi requires a node column of that name.


---------------------
6. NODES FILE COLUMNS
---------------------
Node descriptions take the form "column-label: data type: description (Tinderbox source 
attribute)". There are 17 columns.

1. ID: 10-digit number: the UID of the source note in Tinderbox. ($ID)

2. Label: string: The published title of the paper/item. ($ArticleTitle)

3. Authors: string: This is the descriptive author listing based on the number of authors, 
in one of four descriptive forms based on the number of authors. Thus, 1:"A", 2:"A & B", 
3:"A, B & C", 4+: "A et al."). Also see the 'Author Names' column for full names of each 
author ($Name or $AltName - choice reflects internal Tinderbox title de-duplication 
issues not germane here)

4. In-Conference References: Number: count of HT conference published items referenced by 
this item. ($NumberInConfRefs)

5. In-Conference Cited By: Number: count of HT conference published items that have 
referenced this item. ($CitedCount)

6. Total References: Number: total count of the item's citation of references from any 
source (i.e. its list of References). ($NumberOfRefs)

7. Conference Proceedings: String: Formal title of the item's parent Conference's 
Proceedings. ($ConferenceProceedings)

8. Conference Title: String: Optional additional theme title of the Conference's 
Proceedings, so only for some conferences. ($ConferenceName)

9. Conference Abbreviation: String: Abbreviation used to refer to that conference. 
Initially 'Hypertext', e.g. Hypertext'89, it was later shortened to 'HT', e.g. HT'02. The 
year is always two digits. European conferences used 'ECHT', e.g. ECHT'92. In general 
xuse, the name and date may be separated, optionally, by a space and use either straight 
or typographic single quote types before the elided year value. Only straight quotes are 
used in this data. ($ConferenceShortTitle)

10. Conference Year: Number, 4-digit year YYYY: the year the conference was held. 
($PublicationYear)

11. DOI URL: String (a URL): the DOI-based URL of the published item in the DL.ACM. There 
is no such data for ECHT'90 as it is not formally an ACM conference and has no DL.ACM 
record. ($DOIUrl)

12. Author Names: String, as a comma+space list; list of author names in normal first/last 
order suitable for screen display. List order is as given in published paper. ($Authors)

13. First Author: String: The name of the paper's first author. (FirstAuthor)
 
14. Article Type: String: the type of article the item represents, e.g.— full paper, 
short paper, demo, poster, keynotes, etc. ($ArticleType)

15. DOI PDF URL: String (a URL): the DOI-based URL of the PDF published item in the DL.ACM 
(requires login, so needs ACM account). There is no such data for ECHT'90 as it is not 
formally an ACM conference and has no DLL.ACM record. ($DOIPdfUrl)

16. Is Paper: Boolean (true/false). A 'true' values indicates the item is full or short 
paper. All other items are marked false. ($IsPaper)

17. Abstract: String, in-source line breaks substituted with '####' delimiter to make CSV
more robust—substitute '\n', '\n\n' for delimiter to as appropriate for further use: The 
text of the published item's abstract if any.

(Tinderbox source template: "node-wrapper1a")

---------------------
7. EDGES FILE COLUMNS
---------------------
Here, the column naming uses a style common for use with Gephi.

Edges, i.e. links, are assumed to be directional from a source to a target.

There are 4 Data columns. 

1. SOURCE: Number: node 'ID' (q.v) of the link's source node. ($ID)

2. TARGET: Number: node 'ID' (q.v) of the link's target (destination) node. ($ID)

3. LINKTYPE: String: not used - all use default value 'cites'. Can represent the hypertext 
link type used within the main Tinderbox file, e.g. author's notes link to their articles 
with type 'authored'. Items citing earlier article use 'cites', etc. 
(no source attribute - data added as part of export, derived from Tinderbox's link-base)

4. WEIGHT: Number: Edge weighting. Not used, so default value of 1. (added during export)

(Tinderbox source template: "node-edges")

---------------------
8. LICENCE FOR RE-USE
---------------------
This dataset is licence under a Creative Commons Licence CC BY-NC-SA 4.0
(https://creativecommons.org/licenses/by-nc-sa/4.0/).

In overview, you are free to:

Share — copy and redistribute the material in any medium or format
Adapt — remix, transform, and build upon the material

The licensor cannot revoke these freedoms as long as you follow the license terms.

Under the following terms:

Attribution — You must give appropriate credit, provide a link to the license, and 
indicate if changes were made. You may do so in any reasonable manner, but not in any way 
that suggests the licensor endorses you or your use.

NonCommercial — You may not use the material for commercial purposes.

ShareAlike — If you remix, transform, or build upon the material, you must distribute your 
contributions under the same license as the original.

No additional restrictions — You may not apply legal terms or technological measures that 
legally restrict others from doing anything the license permits.


--------
9. NOTES
--------
Date of data collection: 2016–2021 (and ongoing for subsequent conferences)

Information about geographic location of data collection: n/a

Related projects/Funders: n/a

Related publication: n/a

Date that this file set (of 3 files) was created: (y-m-d) 2022-02-21