Digitising the Turing Archive: A Pilot Study
Hall W, Hughes GV, Martinez K, Weal MJ, Wills GB.
March 2000.
Intelligence Agents Multimedia.
Department of Electronics and Computer Science
University of Southampton.
ISBN: 085432714-2
ECS Technical Report: IAM00-2
© 2000, University of Southampton
This report presents a summary of the pilot project to produce an on-line version of a selected portion of the archive of Alan Turing held at Kings College, Cambridge. The design and creation of a database making use of information held in the archive catalogue is discussed. The production of a Web based interface to access the on-line materials is described. The practical issues involved in digitising documents are covered and the lessons learnt from this process are included. Finally, the report also presents an effort model, and sample timings from which cost estimates can be obtained.
2. Creating the Web interface to the Archive Material.
3. Practical aspects of Capturing the Archive (expand)
3.1 Information Audit
3.2 Practical points when scanning
3.3 Post processing.
3.4 Effort Taken.
Appendix A Scanning process (expand)
B-1 Overheads.
B-2 Cost of Employment
B-3 The Equipment Cost
B-4 Additional Process Cost
B-5 Total Process Costs.
Appendix C Database Design Details (expand)
Information about the digitised images in the electronic archive is stored in a database. The database is used as the foundation for the dynamically created Web pages and for other applications to access the data. The design of the database has been influenced by the nature of the Turing archive and by the desire to use the same model for other archives. This section summarises the design and implementation of the database and the technology used to access the data
The overall database design is implemented as two schema. The information about files consists of 14 tables called MMarch standing for Multimedia Archive. This is a standalone schema built before the project began as part of an effort within the IAM group to build a database capable of storing meta information about any file stored on the Multimedia server. It has a number of fields for describing a multimedia file of any format. These fields are based on the Dublin Core Metadata project. The MMarch schema of the database deliberately has no high level structure built into it so that it can be reused for all archives held on the Multimedia server. The second schema of tables have been designed and written for this project to add structure to the file tables. The Turing schema has 2 layers Items and Folders. Items contain Folders, which in turn contain Files. Files are contained in the MMarch schema. The Items correspond to the items listed in the archive catalogue. Each File entry represents one image in the archive. The layer of Folders gives the database the necessary flexibility to describe how the Files are grouped together within the Items. The seven Turing tables have been designed to be flexible in order to be readily reusable for future archives held on the server. The full Entity Relationship diagram for the database is included in Appendix C.
A major part of this pilot study has been to design the database schema and construct a system to import the data into the tables in the database. It has been a significant task to gather and create the source files used to fill the fields of the tables. Documents such as the archive catalogue and information about the copyright status of documents have been used as a starting point for this process. Samples from the six files used as inputs to generate the database are included in Appendix C.
The system used for the project is the relational database DB2 from IBM. The Web server used is Apache and the dynamic Web page generation is achieved using the PHP language. This language allows Web pages to be written containing SQL queries to the archive database to build the Web pages dynamically. Appendix D contains some information about PHP and includes the PHP code behind the Browse page of the pilot Web site as an example.
2. Creating the Web interface to the Archive Material.
A Turing Archive test site was created consisting of approximately 40 pages. The aim of the website is to provide a clean interface for accessing the archival material.
The interface contains metadata and bibliographical information. The interface is designed to direct users with different backgrounds to appropriate archival material while at the same time not restricting access to the remaining archival material. The on-line archive currently consists of about one third of the available material from the Turing Archive held at Kings College Cambridge.
To identify the potential users, user scenarios were created. The potential users are divided into three broad categories.
From the user scenarios, it was possible to identify the type of resources the user might require access to.
The scenarios and user requirements naturally lead to the interface requirements for each user groups.
The website consists of a front page that allows the user to choose one of several options.
Facilities exist for browsing and searching the on-line archive. The interface allowing direct access to the archive is perhaps more useful to the professional user but anyone accessing the archive has these facilities available to them. The on-line catalogue pages are dynamically generated, from the information held in the database. As well as browsing the on-line archive, users are able to view the full catalogue, listing all of the material that is present within the archive held at Kings College Cambridge. Where an on-line version of an item exists, links are supplied to the relevant files.
In addition to browsing the catalogue, some example trails have been constructed which are aimed at the different types of users. Two trails were designed to provide users with a general overview of Turing, his life and his works. Material for both trails was obtained from Alan Hodges website and a website at Manchester University that gives an overview of the History of Computing at Manchester. The first is aimed at the casual user and as such each of the pages holds general information and provides links to material of a general nature held in the online archive. In addition, links are included to additional material held on other websites. Where a user follows a link to a page not held within the on-line archive, the target Web page is displayed in a new window. The intention of this is to help prevent the user from accidentally leaving the website and being unable to retrace their steps.
More detailed information can be obtained by users by clicking on specially created that expand the current text by inserting additional text and links. These links are by default not underlined in order to try and differentiate them from other links on the Web pages. The additional text was identified with a grey background while the original text had a white background. The user knew that they had not left the trail as the same background colour was used throughout the trail. In addition at the top and bottom of the page was a trail map.
The second overview trail has similar navigational features with a trail map provided at the top and bottom of each page. However, unlike the first trial there is no expanding text, as all the content is displayed on loading the page.
3. Practical aspects of Capturing the Archive
This section of the document summarises a report on the practical lessons learnt whilst digitising a proportion of the Turing Archive [Wills 99], held at Kings College Cambridge.
There are similar documents available on the digitising of material, for example from the Higher Education Digitisation service [Tanner 98] and a project title Scoping the future of the University of Oxfords Digital Library Collections [Lee 99]. The aim of this section of the report is not to repeat the information, but to highlight where the approach taken differs from that already reported.
An essential aspect of the scanning process is in identification of the material to be digitised, especially if meaningful estimates of the time required are to be obtained. The information audit needs to focus on the type and size of documents, the amount of care required in handling, the type of protective covers used, and whether the document is loose leafed or bound, as these factors will have an impact on the effort required.
The archive catalogue and audit information allowed the directory structure and naming conventions of the captured information to be decided upon. This ensures that the procedure used to capture the information is simplified. The information audit will also aid in the selection process of the hardware to be used. For example, in the case of the Turing archive a large percentage of the material was foolscap, therefore a conventional A4 scanner was not sufficient, and an A3 scanner needed to be acquired. The information audit also allowed the size and type of the external storage media to be specified. In this case, an external SCSI hard disk was chosen.
Once the equipment has been purchased, a trial run of the complete process was undertaken. This demonstrated that the procedures used and the effort estimates were realistic. The trial run also allowed potential problems to be identified and solved prior to starting what could be a lengthy run. The file size for each of the captured document leaves could be quite large. For example, a foolscap, loose leaf sheet using 24-bit colour at 300 dpi was approximately 24Mb in size. The size of the captured material came to approximately 11Gb for 786 files. The size of the files had a significant effect on the choice of back-up device.
3.2 Practical points when scanning
One of the basic principles when capturing material for archival purposes is not to interpret the material or try and anticipate the proposed use of the material. Unless the material is an unaltered black and white photocopy or photograph, all scanning should be carried out in colour. This allows the researcher to clearly see the different inks and corrections made to a document. Tanner et al [Tanner 98] point out that this may also produce an additional cost due to factors such as increased storage space required, longer processing times, etc.
The practical lessons learnt can be summarised as :
The digitisation process involves more than just scanning the documents. The following sections discuss some of the post process (scanning) issues.
Figure 1 Representation of pyramidal image
Figure 2 Typical manuscript containing corrections [AMT/C/25/15]
Tanner et al [Tanner 98] presented some prices as a guide, yet supplied no information on how the costs were derived. They point out that the reason for this is that many of the costs relate to internal resources. To estimate the costs it is first necessary to identify the effort required. In order to aid estimation of the effort for scanning documents, a model of the process can be used. Once the effort has been calculated then the cost can be calculated. Within cost accounting there are also a number of methods of allocating cost. Some of the main considerations when allocating costs when digitising material are presented in Appendix B.
The conceptual process model [BSI 92], see Figure 3, and can be described as having:
Figure 3 The Capturing Process Model
Standard project techniques of estimating resources (and risk) should be applied to digitising of archival material. To aid this process historical data can be used, along with an engineering costing approach [Arnold 90]. Table 1 shows a summary of the time taken to scan pages from different document types.
Type of document | Average (minutes) | Tolerance +/- (minutes) |
Manuscript Foolscap | 1.70 | 0.930 |
Manuscript Foolscap Covered | 4.65 | 1.070 |
Photographs | 3.76 | 0.144 |
Correspondence | 2.54 | 0.036 |
Booklet | 2.81 | 1.524 |
Table 1 The time taken to scan documents.
The main effort with existing paper documentation is converting the information into an electronic format. This task can be sub-divided into five main activities, see Table 2.
Activity Number (i) | Activity Description |
1 |
Collecting the information to be converted. |
2 |
Converting the information to a standard raster format. |
3 |
Cleaning-up the information after conversion. |
4 |
Processing the information into a vector format. |
5 |
Saving the information with appropriate file names and file hierarchy |
Table 2 List of activities associate with digitising non-electronic media documents
Scanning is suitable for the majority of paper documents. However large paper items, i.e. blueprints, may be best digitally photographed, or photographed by conventional means and the negatives then scanned. There are also equivalent processes for converting microfiche documents.
The effort required to deal with the paper information (E) is the sum of the effort for gathering, converting, correcting, and processing the paper information. When estimating the time to carry out each process it is assumed that time is included for verifying the work. The individual processes that make up the converting of paper based information to electronic information depend on: -
The effort is expressed as: -
1
The effort estimation, person hrs, does not allow for breaks, distraction or other work related activities. For estimation, when converting the person hours to days allow only 6 hours per day, or when converting to weeks use 30 hrs per week.
The timings in Table 1 are average times to scan a document. Hence, there will be a tolerance to each of the tasks and therefore the accumulative tolerance can be quite large. However, by using statistical tolerancing, the overall tolerance for the time to capture is much less (and more realistic) than the arithmetic sum of the tolerance [Burr 76]. When people are first introduced to a task, they frequently take longer to perform that task than when they have repeated the task a number of times. This is known as the learning-curve effect [Drury 96]. The learning curve is based on real world observations and hence the relationships described are empirical [Arnold 90]. The learning curve can be applied to each of the input tasks identified in the process model.
We have discussed the results of a pilot study to create an electronic archive of a portion of the Alan Turing archive. The design and implementation of the database has been described including the need for a flexible database architecture for reuse with other archives. Aspects of the importing of data into the archive were covered describing how available material such as the archive catalogue could be utilised. A Web based interface to the on-line archive was described. This made use of technologies such as the dynamic creation of Web pages using PHP and the use of an external link service (the DLS).
The digitising of paper documents by scanning is a complex process that involves a number of considerations. To simplify the task a considerable amount of preparation is required. The preparation requires a thorough information audit and often a trial run. This allows a comprehensive procedure to put in place that will ensure the whole process is simplified.
Key word extraction by using OCR techniques is not viable for hand written manuscripts and may not be cost effective for hand amended/corrected typed manuscripts.
The key to costing the conversion process is in understanding the effort required. Hence a simple process model has been developed. In addition, some sample times are present that should allow a rough estimate of the time taken for future work.
The authors wish to thank The Institution of Electrical Engineers (IEE), The British Computer Society and the Department of Electronics and Computer Science, University of Southampton for funding this pilot project. The authors acknowledge the Archive Centre at Kings College Cambridge for assistance and granting access to the Turning Archive. The authors would like to thank Dr. Jonathan Swinton and Dr. Andrew Hodges for the use of some of their Web pages within the trial Web site.
[Arnold 90] | Arnold J, Hope T. Accounting for Management Decision 2nd edition, Prentice Hall, 1990. |
[Burr 76] | Burr I.W. Statistical Quality Control Methods. Marcel Dekker Inc. 1976. |
[Drury 96] | Drury C. Management and Cost Accounting 4th edition. Thomson 1996. |
[Martinez 98 ] | Martinez K, Cupitt J, Perry S.T. High resolution colorimetric image browsing on the Web. The Seventh International World Wide Web Conference 14 - 18 April 1998 Brisbane, Australia. . |
[Tanner 98] | Tanner S, Robinson B. The refugee Studies Programme Digital Library Feasibility Study, Published by HEDS, University of Hertfordshire 1998. |
[Lee 99] | Lee S.D. Scoping the Future of the University of Oxfords Digital Library Collections. Published Oxford University, 1999. Available at http://www.bodley.ox.ac.uk/scoping/ |
[Moore 98] | Moore M. As Useful as ABC? IEE Manufacturing Engineering Vol. 77, No. 2. April 1998. pp 92-94. |
[Reynolds 92] | Reynolds AJ. The finance of Engineering Companies, An introduction for students and practising Engineers. Edward Arnold, 1992. |
[Wills 98] | Wills GB, Heath I, Crowder RM, Hall W. A Model for Authoring and Costing an Industrial Hypermedia Application. University of Southampton Report No. MM98-6 November 1998. ISBN Number 085432685-5. |
[Wills 99] | Wills G.B, Hughes G.V, Martinez K, Hall W. Practical Aspects of Capturing the Turing Archive. University of Southampton Report No. MM99-7 December 1999. ISBN Number 0854327053. |