READ ME File For 'In-conference citation data ACM Hypertext & ECHT' (as updated to 2021 data, 2.ed)

Dataset DOI: 10.5258/SOTON/D1870

ReadMe Author: Mark W. R. Anderson, 
               University of Southampton
               ORCID ID: 0000-0001-7396-0721


-----------------
TABLE OF CONTENTS
-----------------
1. DESCRIPTION OF THE DATA
2. DATASET CONTENTS
3. DATA SOURCES
4. DATASET CREATION
5. DATA RECORDS
6. NODES FILE COLUMNS
7. EDGES FILE COLUMNS
8. LICENCE FOR RE-USE
9. NOTES
------------------ 
--- END TOC ------


--------------------------
1. DESCRIPTION OF THE DATA
--------------------------
This data consists of node and edge tables allowing analysis of in-conference citations in 
the proceedings of ACM Hypertext 1987-2021, plus ECHT 1990/92/94. 

Changes in this update:

<ul><li> Dataset now includes all published items (1,447) not just the 903 items with at least 1 </li></ul>
  in-conference inbound *or* outbound citation links.
<ul><li> Dataset now includes author keywords (column #17), and a boolean indicating papers in </li></ul>
  the cited/citing set (column #18)
<ul><li> An unforunate number of typos in the previous readme are now fixed.</li></ul>

This dataset does not currently include the full de-duped table of authors or the author
gender information used in the HT'22 paper (https://doi.org/10.1145/3511095.3531271) that
reports on this dataset. However, figures used in that paper are all based on this dataset 
(and its richer source Tinderbox file).

-------------------
2. DATASET CONTENTS
-------------------
This dataset contains this readme plus 2 data files (3 files total) all in UTF-8 
plain-text using comma-separated-variable (CSV) format. The files should be readable in 
any text or code editor capable of reading plain text. The two data files are :

 - ht_nodes_2021-1.csv (metadata on each discrete paper)
 
 - ht_edges_2021-1.csv (links between papers)
 
The meaning of the per-column data in the data files is described in sections #6 and #7 
below.


---------------
3. DATA SOURCES
---------------
1. ECHT 1990. Data is drawn from the author's (paper) copy of the ECHT'90 Proceedings 
(Cambridge University Press, 1990, ISBN 0-521-40517-3). This is out of print and has no 
e-book version. Digital data was re-keyed by this dataset's author from the book and 
from OCR'd scans of the same.

2. ACM Hypertext 1987-2021, plus ECHT 1992 and 1994 (i.e. all other conferences bar the 
above). Data was gathered via publicly accessible pages of the ACM Digital Library 
(https://dl.acm.org), with corrections where errors in the master DL record was detected.
DL.ACM has no API and its Terms of Use specifically preclude screen-scraping: the library
HTML pages lack meaningful semantic mark-up such as would assist screen scraping even were
it to be authorised. Other sources such as dplp.org and Google Scholar certainly repeat 
errors in the DL.ACM records even if not adding others.

Although this dataset's information is publicly accessing (ECHT'90 less so) it does not 
exist in the form of a meaningful dataset instantly available for *other* purposes, such
as visualisation of citation trees.

The choice of a Creative Commons CC BY-NC-SA 4.0 licence respects the rights of the 
source data owners and prevents commercial exploitation of the significant effort 
involved in the creation of this dataset (at least, commercial exploitation *without* 
seeking different *prior* approval). Section #8 for further detail of the licence.

In further re-use the original sources (ACM & CUP) must be acknowledged.


-------------------
4. DATASET CREATION
-------------------
As creation of the dataset was initially exploratory, it being unclear what information
was available across the subject domain, the primary tool used was Eastgate's
Tinderbox (https://eastgate.com/Tinderbox/).  Tinderbox '.tbx' files are stored as
well-formed XML (with XML-compatible encoding of style RTFD text/embedded images). The 
tool has strong & flexible  support for incremental formalisation of content and has 
very configurable export. The CSV files are created by export from Tinderbox.


---------------
5. DATA RECORDS
---------------
The dataset is drawn from data on 1,447 published items of which 1,079 were full/short 
papers - the primary citable academic content. Here, the terms 'paper' and 'item' are 
broadly interchangeable, though 'paper' likely refers to an item that is a full/short 
paper, i.e. 9+pp. vs. 4–6pp, and is flagged with 'Is Paper' boolean data.

Represented within this, 903 items have 1 or more inbound *or* outbound in-conference 
citation links. The export of both files is ordered as per the source, so by year
 (conference) and, within year, by page number within the printed proceedings.

The later sub-set of 903 nodes in the dataset thus represent 62.4% of the wider corpus. 
Thus 544 published items (37.6%) were  never cited in-conference. Adjusting for 2021's 
35 items that cannot yet have been cited (i.e. 1,412 total), 63.9% of items are in this 
dataset with 36% never having been cited at all in-conference. The nodes file includes 
'Article Type' allowing further filtering (or use 'Is Paper') to look only within the 
1,073 papers (which will give 789 nodes, 73.5% of that lesser set, with as might be 
expected, a smaller percentage of un-cited items).

At the end of the descriptions (below) of each data column, its source Tinderbox attribute 
(aka field) is shown . Thus ($ArticleTitle) implies the 'Label' column data is drawn from 
the Tinderbox attribute named 'ArticleTitle'. The use of a node file column named 'Label' 
is deliberate as use of this data in Gephi requires a node column of that name.


---------------------
6. NODES FILE COLUMNS
---------------------
Node descriptions take the form "column-label: data type: description (Tinderbox source 
attribute)". There are 17 columns.

1. ID: 10-digit number: the UID of the source note in Tinderbox. ($ID)

2. Label: string: The published title of the paper/item. ($ArticleTitle)

3. Authors: string: This is the descriptive author listing based on the number of authors, 
in one of four descriptive forms based on the number of authors. Thus, 1:"A", 2:"A & B", 
3:"A, B & C", 4+: "A et al."). Also see the 'Author Names' column for full names of each 
author ($Name or $AltName - choice reflects internal Tinderbox title de-duplication 
issues not germane here)

4. In-Conference References: Number: count of HT conference published items referenced by 
this item. ($NumberInConfRefs)

5. In-Conference Cited By: Number: count of HT conference published items that have 
referenced this item. ($CitedCount)

6. Total References: Number: total count of the item's citation of references from any 
source (i.e. its list of References). ($NumberOfRefs)

7. Conference Proceedings: String: Formal title of the item's parent Conference's 
Proceedings. ($ConferenceProceedings)

8. Conference Title: String: Optional additional theme title of the Conference's 
Proceedings, so only for some conferences. ($ConferenceName)

9. Conference Abbreviation: String: Abbreviation used to refer to that conference. 
Initially 'Hypertext', e.g. Hypertext'89, it was later shortened to 'HT', e.g. HT'02. The 
year is always two digits. European conferences used 'ECHT', e.g. ECHT'92. In general 
use, the name and date may be separated, optionally, by a space and use either straight 
or typographic single quote types before the elided year value. Only straight quotes are 
used in this data. ($ConferenceShortTitle)

10. Conference Year: Number, 4-digit year YYYY: the year the conference was held. 
($PublicationYear)

11. DOI URL: String (a URL): the DOI-based URL of the published item in the DL.ACM. There 
is no such data for ECHT'90 as it is not formally an ACM conference and has no DL.ACM 
record. ($DOIUrl)

12. Author Names: String, as a comma+space list; list of author names in normal first/last 
order suitable for screen display. List order is as given in published paper. ($Authors)

13. First Author: String: The name of the paper's first author. (FirstAuthor)
 
14. Article Type: String: the type of article the item represents, e.g.— full paper, 
short paper, demo, poster, keynotes, etc. ($ArticleType)

15. DOI PDF URL: String (a URL): the DOI-based URL of the PDF published item in the DL.ACM 
(requires login, so needs ACM account). There is no such data for ECHT'90 as it is not 
formally an ACM conference and has no DLL.ACM record. ($DOIPdfUrl)

16. Is Paper: Boolean (true/false). A 'true' values indicates the item is full or short 
paper. All other items are marked false. ($IsPaper)

17. Keywords: a quote-enclosed comma+space delimited list of the paper's author-supplied
keywords. ($RefKeywords)

18. Has Citation Link: true if this item has one or more in/outbound link in the corpus,
i.e. the paper either cites another paper or is cited by one at least once within the 
overall corpus of conference items. ($HasCitationLink)

19. Abstract: String, in-source line breaks substituted with '####' delimiter to make CSV
more robust—substitute '\n', '\n\n' for delimiter to as appropriate for further use: The 
text of the published item's abstract if any.

(Tinderbox source templates: column heads from "node-wrapper5", values from "node-sub-item5")

---------------------
7. EDGES FILE COLUMNS
---------------------
Here, the column naming uses a style common for use with Gephi.

Edges, i.e. links, are assumed to be directional from a source to a target.

There are 4 Data columns. 

1. SOURCE: Number: node 'ID' (q.v) of the link's source node. ($ID)

2. TARGET: Number: node 'ID' (q.v) of the link's target (destination) node. ($ID)

3. LINKTYPE: String: not used - all use default value 'cites'. Can represent the hypertext 
link type used within the main Tinderbox file, e.g. author's notes link to their articles 
with type 'authored'. Items citing earlier article use 'cites', etc. 
(no source attribute - data added as part of export, derived from Tinderbox's link-base)

4. WEIGHT: Number: Edge weighting. Not used, so default value of 1. (added during export)

(Tinderbox source templates: column heads from "edge-wrapper", values from "edge-item")

---------------------
8. LICENCE FOR RE-USE
---------------------
This dataset is licence under a Creative Commons Licence CC BY-NC-SA 4.0
(https://creativecommons.org/licenses/by-nc-sa/4.0/).

In overview, you are free to:

Share — copy and redistribute the material in any medium or format
Adapt — remix, transform, and build upon the material

The licensor cannot revoke these freedoms as long as you follow the license terms.

Under the following terms:

Attribution — You must give appropriate credit, provide a link to the license, and 
indicate if changes were made. You may do so in any reasonable manner, but not in any way 
that suggests the licensor endorses you or your use.

NonCommercial — You may not use the material for commercial purposes.

ShareAlike — If you remix, transform, or build upon the material, you must distribute your 
contributions under the same license as the original.

No additional restrictions — You may not apply legal terms or technological measures that 
legally restrict others from doing anything the license permits.


--------
9. NOTES
--------
Date of data collection: 2016–2021 (and ongoing for subsequent conferences)

Information about geographic location of data collection: n/a

Related projects/Funders: n/a

Related publication: 

1. At HT'22. "Hypertext’s meta-history: Documenting in-conference citations, authors 
and keyword data, 1987-2021" URL: https://doi.org/10.1145/3511095.3531271 

Date that this file set (of 3 files) was created: (y-m-d) 2022-06-01