Practical Aspects of Capturing the Turing Archive
Wills GB, Hughes GV, Martinez K, Hall W.
Technical Report No. M99-7 December 1999
ISBN:- 0854327053
Copyright © 1999, University
of Southampton. All rights reserved.
2. Information Audit and effort estimation
3. Practical points when scanning
Appendix
A Scanning Effort
Appendix
B Scanning process
Appendix C
Costing
(expand)
This report presents a summary of the practical lessons gain while digitising a large proportion of the Turing archive. The report also presents an effort model, and sample timings from which cost estimates can be obtained.
1. Introduction
This report presents the practical lessons learnt whilst digitising a large proportion of the Turning Archive, held at Kings College Cambridge. There are similar documents available on the digitising material, for example from the Higher Education Digitisation service [Tanner 98] and a project title Scoping the future of the University of Oxfords Digital Library Collections [Lee 99]. Therefore the aim of this report is not to repeat the information, but present the approach taken that differs from that already reported.
2. Information Audit and effort estimation
An essential aspect of the scanning process is in identification of the material to be digitised, especially if meaningful estimates of the time required are to be obtained. The information audit needs to focus on the type and size of documents, the amount of care required in handling, the type of protective covers used, and whether the document is loose leafed or bound, as these factors will have an impact on the effort required.
The archive catalogue and audit information allowed the directory structure and naming conventions of the captured information to be decided. This ensures that the procedure used to capture the information is simplified. At this point it is worth noting that the foliation number may not necessary represent the actual order of the document leafs, but the order in which the archive received them, especially where the content is of a complex mathematical nature.
The information audit will also aid in the selection process of the hardware to be used. For example in the case of the Turing archive a large percentage of the material was foolscap. Therefore a conventional A4 scanner was not sufficient. Hence an A3 scanner was purchased. The information audit also allowed the size and type of the external storage media to be specified. In this case, an external SCSI hard disk was chosen.
Once the equipment has been purchased a trial run of the complete process was undertaken. This demonstrated that the procedures used and the effort estimates were realistic. The trial run also allowed potential problems to be identified and solved prior to starting what could be a lengthy run.
3. Practical points when scanning
One of the basic principles when capturing material for archival purposes is not to interpret the material or the proposed use of the material. Therefore, unless the material is an unaltered black and white photocopy or photograph, all scanning should be carried out in colour. This allows the researcher to clearly see the different inks and corrections made to a document. Tanner et al [Tanner 98]
point out that this may also produce an additional cost due to factors such as increased storage space required, longer processing times, etc.
To ensure that the foliation numbers are not intrusive to the reader, it is common practice to appended the foliation number to the corner, and as close to the edge of the document as possible. However, with most scanners the first millimetre of travel is not recorded in the scanned image. Hence the foliation number can be truncated when scanning an image butted up against the edge of the glass. Therefore, the document needs to be aligned with marks on the side ruler, i.e. 1mm back from the top edge, to ensure that the foliation number can be clearly seen. This allows the researcher to accurately reference the document.
Similarly when writing, people do not always leave a clear margin from the edge of the paper. Therefore, to ensure that it is clear to researchers, who can only view the document on-line, that by the act of digitising the material no information was lost, a small but distinct over scanning (that is the edges are clearly visible) should take place.
The document or the leaf of document may not be square, especially on older documents or when leafs are placed in a protective cover. Hence, two problems arise, first when the paper is aligned with one edge along the side rule and back 1 mm, the foliation number or information near an edge of the document may still be truncated. Therefore the document is required to be placed further away from the top edge than first anticipated. Secondly, as over scanning is required, to ensure no information is lost, images will have a white irregular shaped boarder, which may be removed or straightened during post processing, if required.
To capture a document by scanning, it is common for the software to pre-scan the glass and then allow an area to be selected for scanning. To speed up the process of scanning leafs from the same document that are all the same size, an area that is slightly larger than the leaf size should be selected in the preview scan. Therefore providing the leaf is placed in the similar position each time, the pre-scanning and selecting of an area to scan can be omitted, and yet still allow the points raised above to be taken into consideration.
The scanning process uses a bright light in order to capture the content of a document. Unfortunately this will often reveal the content of the other side of the page. To stop this printing through on the scanned image, a dark piece of card is placed behind the document leaf being scanned. The water marks and ink stains that are normally visible still remain visible on the scanned image. The colour of the card used was black. However on a small percentage of the images this may have affected the colour shading of the scanned image. Therefore perhaps a dark grey or similar colour would be better.
The process of scanning the documents essential involves using a computer. Hence the normal Health and Safety rules apply for computer operators. That is, ensure that there is correct seating arrangement, VDU placed at the correct height, etc. In addition regular breaks from viewing the screen must be allowed, hence this will have an impact on the effort estimation. Regular breaks are important for another reason. That is, although the process is a mundane task it does require concentration in order to keep track of which documents have been scanned. Hence, it would be difficult for a person to carryout the process of digitising the information and at the same time perform another task.
The setting used for the scanning process was for:
The file format was in TIFF (Tagged Image File Format).
The size of the file varied with the setting and the original document. However, as an example, a foolscap loose leaf sheet using 24-bit colour at 300 dpi was approximately 24Mb in size. The size of the captured material was approximately 11Gb for 786 files. The size of the files had a significant effect on the choice of back-up device. To ensure that files are transferred quickly and effectively, two hard disks were used. One was the usual device holding the operating system and the swap file, the other simply held the data. This allowed the scanning software to transfer the temporary file it created on the system device quickly to the back-up device.
The digitisation process involves more than just scanning the documents. This presents some of the post process (scanning) issues.
4.1 Viewing images across the Internet.
To transmit the image of the scanned document across the Internet, the image of a scan A4 sheet using the setting above creates a 24Mb, which is just impractical with todays technology to transmit across the internet. In addition, even after the file has been compressed it can still be impractical; for example a 24 Mb image when compressed is still several everal hundred kilo-bytes of data. Compression is used to reduce network bandwidth and thereby reduces transmission time, disk seeking and zoom/pan processing.
However people often do not need the whole image all the time, that is they will need to view at the whole document to gets its look and feel then focus in on a region of interest within the document. Hence, a process was required whereby a person could view the image, zoom in and out and maintain the quality of the original scan, all without having to wait to down load a very large file (with its associated time wasting).
A process developed by Martinez et al [Martinez 98] used a tiled JPEG-in-TIFF images as a method of pyramidal decomposition of the image to provide lossless images suitable for browsing, see Figure 1.
Figure 1 Representation of pyramidal image
The system fixes the large tile size to 800x600 and stores each tile in a separate file with a three letter name extension related to the tile position. The whole process is automated, and for the 768 files scanned from the Turing archive took approximately 6 hour (an overnight process) to create all the layers. The images are held as tiled pyramidal JPEG TIFF file at a server, and are transmitted via standard HTTP requests to a Java client. Only the portions of the image needed are requested and transmitted.
4.2 Sampling
Some studies proposed a fit percentage of documents to be sampled after the process has been completed. However, there is risk involved in all sampling, and this must be understood. Hence, a sampling plan in accordance with a recognised statistical process, for example BS 6001 [BSI 96], should be used.
The archive catalogue will normally provide the essential metadata for the document. However, as the document may contain many pages, it would aid the users if they could proceed straight to the page that contains the information they are seeking, by using key words. As the archive only stores a raster image of the page, it is necessary to convert the image into a vector format in order to extract the key words. For instance, convert scanned images of a manuscript into ASCII text by using Optical Character Recognition (OCR) techniques. Key word extraction is only practical (at present with limited time and costs) on the typed manuscript. Therefore hand written documents, for example letters, key words will require visual identification and manually entered into the navigational system.
4.3.1 Error rates for automated key word extraction.
Once the document has been through the OCR process, all the common (stop) words are removed to leave the keywords for the document. However, the OCR process is not perfect. The success of the conversion process is a combination of the quality of the hardware, software and the material to be scanned.
The combination of the quality of the hardware and software will result in a residual process error. To find this residual error rate, a small experiment was design using sample documents from the Turing archive. The documents were first converted to black and white, and saved as uncompressed Tagged Image File Format (TIFF). These then went through the OCR process, from which was extracted a list of sixty words that were not originally recognised. The list of words were put into a word processor document, to be used a test document. The list was repeated for three font sizes of 8 point, 10 point and 12 point, in three True Type fonts Courier New, Times New Roman and Arial, see Table 1. The document was then printed on a 600 DPI laser printer to produce a high quality image. The document went through the OCR process having been scanned at 200 Dots per Inch (DPI), 300 DPI and 400 DPI.
Font |
Point |
Number of Errors at 200 DPI |
Number of Errors at 300 DPI |
Number of Errors at 400 DPI |
Courier New |
8 |
1 |
2 |
0 |
10 |
0 |
0 |
0 |
|
12 |
1 |
0 |
0 |
|
Times New Roman |
8 |
2 |
0 |
0 |
10 |
0 |
0 |
0 |
|
12 |
0 |
0 |
0 |
|
Arial |
8 |
1 |
0 |
0 |
10 |
0 |
1 |
0 |
|
12 |
0 |
0 |
0 |
Table 1 The number of errors after the OCR process for the test document.
The results in Table 1 show the number of words that the OCR process did not correctly recognise all the characters. This shows that on average the residual error was less than a 4% in recognising words from a print document, for which the readability of the words was good. The conclusion from this exercise is that the size of the characters and the DPI setting of the scanner will influence the residual error. However, from Table 1, it can be seen that at scanning setting of 400 DPI no errors were found. The number of error does increase if other printable characters are included for example square brackets ([ ]) are recognised as a one (1), but these will not be counted as keywords.
The feasibility study conducted by Tanner et al. [Tanner 98] rated the original document, scanned image, and OCR results as good, fair, acceptable and poor, see Appendix B for more details. The categorisation of the documents and scanning process is based on the expertise (experience and judgement) of the users. However, the rating of the OCR process is quantitative, i.e. the percentage of error.
The rating of the documents for the Turing archive is in the lower portion of the scale i.e. acceptable and poor. The reasons for this in the large is that the archive is dealing with original work, and hence many of the manuscripts are draft copies with typed over or hand amendments, and mistyped word, see Figure 2.
Figure 2 Typical corrected manuscript from the Turing Archive [AMT/C/25/15]
The corrections to the manuscripts can make it difficult to achieve a high rating (Good or Fair) according to Tanners system. However, as the extraction process only requires one occurrence of a word to flag its present on a page, and generally key words will occur more than once, the changes of successfully identifying a key word is increases. For instance in the text relating to Figure 2, the key words of congruence appears three times and parastichy four times.
The scanned documents in the Turing archive can be classified into 3 main groups. There are:
Document |
Colour % error |
Converted to Grey % error |
Size 50% of Original % error |
% of words different between colour & grey) |
|
Unaltered Colour |
AMT/B/6/3 |
14.2 |
10.4 |
13.7 |
7.1 |
AMT/B/6/4 |
11.3 |
10.8 |
11.3 |
3.0 |
|
AMT/B/6/5 |
14.4 |
13.9 |
11.4 |
4.0 |
|
Altered Colour |
AMT/C/24/5 |
45.7 |
46.2 |
33.2 |
0.0 |
AMT/C/24/6 |
32.6 |
29.4 |
10.2 |
3.3 |
|
AMT/C/24/7 |
48.8 |
46.3 |
22.8 |
6.5 |
Table 2 Error rates for coloured images.
5. Effort Taken.
Tanner et al [Tanner 98] have presented some prices as a guide, yet no information on how the costs were derived. They point out the reason for this is that many of the cost relate to internal resources. Hence to estimate the costs it is first necessary to identify the effort required. In order to aid estimation of the effort for scanning documents, a model of the process can be used. The conceptual process model [BSI 92] is shown in Figure 3, and can be described as having:
Figure 3 Conceptual Process Model
Figure 4 The Capturing Process Model
Standard project techniques of estimating resources (and risk) should be applied to digitising of archival material. To aid this process historical data can be used, along with an engineering costing approach [Arnold 90]. Table 3 shows a summary of the time taken to scan pages from different document types, a complete listing of the time taken is shown in Appendix A
Type of document | Average (minutes) | Tolerance +/- (minutes) |
Manuscript Foolscap | 1.70 |
0.930 |
Manuscript Foolscap Covered | 4.65 |
1.070 |
Photographs | 3.76 |
0.144 |
Correspondence | 2.54 |
0.036 |
Booklet | 2.81 |
1.524 |
Table 3 The time taken to scan documents.
The main effort with existing paper documentation is converting the information into an electronic format . This task can be sub-divided into five main activities, see Table 4.
Activity Number (i) | Activity Description |
1 |
Collecting the information to be converted. |
2 |
Converting the information to a standard raster format. |
3 |
Cleaning-up the information after conversion. |
4 |
Processing the information into a vector format. |
5 |
Saving the information with appropriate file names and file hierarchy |
Table 4 List of activities associate with digitising non-electronic media documents
Scanning can digitise the majority of paper documents. However large paper items, i.e. blueprints, can be digitally photographed, or photographed by conventional means and the negatives scanned. There are also equivalent processes for converting microfiche documents.
There are a number of post process activities that can take place. For example, straightening crooked images and removing excessive over-scanning (i.e. white space around an image). Most documents only need to be in the raster format supplied by the conversion process. Key word extraction will also require conversion into a vector format.
Some material can be digitised using any number of companies that offer a bureau service. These companies will also do some of the post conversion tidying-up and the effort for correcting the converted documents is therefore reduced. While the effort to the organisation is zero, the cost is not and must be taken into account when carrying out the cost accounting.
The effort required to deal with the paper information (E) is the sum of the effort for gathering, converting, correcting, and processing the paper information. When estimating the time to carry out each process it is assumed that time is included for verifying the work.
These individual processes that make up the converting of paper based information to electronic information depend on: -
The effort is expressed as: -
The effort estimation, person hrs, does not allow for breaks,
distraction, or allowing the person to fulfil other work related
responsibilities. Hence when converting the person hours to days
allow only 6 hours per day, or to weeks use 30 hrs per week.
The timings in Table 3 are average times to scan a document. Hence, there will be a tolerance to each of the tasks and therefore the accumulative tolerance can be quite large. However, by using statistical tolerancing, the overall tolerance for the time to capture is much less (and more realistic) than the arithmetic sum of the tolerance [Burr 76].
2
where:
TT = The total tolerance
TI = The individual tolerance for each operation.
NT = the number of times the operation is performed
Within the literature, there are estimates of cost for scanning in documents [Tanner 98]. However, the method of allocating costs will vary between organisation, hence it is essential first to identify the effort. Once the effort has been calculated then the cost can be calculated. Within cost accounting there are also a number of methods of allocating cost, some of the main considerations when allocating costs when digitising material is presented in Appendix C.
When people are first introduced to a task, they frequently take longer to perform that task than when they have repeated the task a number of times. This is known as the learning-curve effect [Drury 96]. The learning curve is expressed as:
where T is the cumulative average of the time required to carry out the task X times and a is the time required to carry out the task the first time. The exponential, b, is defined as,
4
hence, b will have a range of [-1,0]. The learning curve is based on real world observations and hence the relationships described are empirical [Arnold 90]. The learning rate can vary between 65% and 90% in the early stages of a process, and levels out to reach a steady state in which no further reductions in the time to perform the tasks by learning can be achieved. There are tables available to aid applying the learning curve. The tables give the unit time and total unit time for different learning curves against number of times the task is performed. The learning curve can be applied to each of the input tasks identified in the process model.
Using a different number of people will not effect a cost equation [Wills 98]. Hence, equation 3 can be written as: -
5
where P is the number of people involved in the process.
6. Conclusion.
At first glance, digitising paper documents by scanning appears to be a straightforward and easy task. However, it is fraught with many minor considerations. Therefore, to simplify the task a considerable amount of preparation is required. The preparation will require a thorough information audit and often a trial run. This allows a comprehensive procedure to put in place that will ensure the whole process is simplified.
Key word extraction by using OCR techniques is not viable for hand written manuscripts and may not be cost effective for hand amended/corrected typed manuscripts.
The key to costing the conversion process is in understanding the effort required. Hence a simple process model has been developed. In addition, some sample times are present that should allow a rough estimate of the time taken for future work.
The authors acknowledge the Archive Centre at Kings College Cambridge for assistance and granting access to the Turning Archive.
[Arnold 90] | Arnold J, Hope T. Accounting for Management Decision 2nd edition, Prentice Hall 90 |
[BSI 92] | BSI. BS 6143 Part 1 1992 Guide to the economics of quality Part 1 Process Cost Model. British Standards Institute. http://www.bsi.org.uk |
[BSI 96] | BSI BS 6001-1:1999 ( ISO 2859-1:1999) Sampling procedures for inspection by attributes. British Standards Institute. http://www.bsi.org.uk |
[Burr 76] | Burr IW, Statistical Quality Control Methods. Marcel Dekker Inc. 1976 |
[Drury 96] | Drury C. Management and Cost Accounting 4th edition. Thomson 1996 |
[Martinez 98 ] | Martinez K, Cupitt J, Perry S. High resolution colorimetric image browsing on the Web. WW7 available at http://azul.ecs.soton.ac.uk/~km/papers/www7/ |
[Moore 98] | Moore M. As Useful as ABC? IEE Manufacturing Engineering Vol. 77, No. 2 April 1998 pp 92-94 |
[Reynolds 92] | Reynolds AJ. The finance of Engineering Companies, An introduction for students and practising Engineers. Edward Arnold 1992. |
[Tanner 98] | Tanner s, Robinson B, The refugee Studies Programme Digital Library Feasibility Study, Published by HEDS, University of Hertfordshire 1998. Available at http://heds.herts.ac.uk/Guidance/RSP_fs.html |
[Lee 99] | Lee SD. Scoping the Future of the University of Oxfords Digital Library Collections. Published Oxford University 1999 Available at http://www.bodley.ox.ac.uk/scoping/ |
[Wills 98] | Wills GB, Heath I, Crowder RM, Hall W A Model for Authoring and Costing an Industrial Hypermedia Application. Southampton University Technical No. MM98-6 November 1998. ISBN Number 085432685-5. |
Type of document | Number of leafs | Time Taken (m) | Average time (m) | Accumulative Average (m) | Tolerance |
Manuscript Foolscap |
22 69 46 93 119 44 30 |
61 114 117 168 109 72 32 |
2.7873 1.652 2.54 1.806 0.916 1.636 1.067 |
1.771 | +0.855 -1.002 |
Manuscript Foolscap Covered |
7 22 7 |
40 68 25 |
5.714 3.091 3.571 |
4.126 | +1.589 -0.554 |
Photographs | 18 10 |
65 39 |
3.611 3.900 |
3.756 | +/- 0.144 |
Correspondence Bound | 49 14 |
126 35 |
2.571 2.500 |
2.536 | +/- 0.036 |
Booklet | 6 35 |
26 45 |
4.333 1.286 |
2.810 | +/- 1.524 |
Table A-1 The time taken to scan documents.
Tanners Criteria
Below are the criteria Tanner et al. [Tanner 1998] set for the scanning process:
Condition: The condition of the original pages in the document. | Good - The paper is in very good condition
with the normal wear of being stored on a Library shelf.
The appearance should be as almost new. No tears,
yellowing, foxing etc. should be present. The binding of
the document is sound or may be removed for processing. Fair: The paper is in good condition and shows only a small amount of wear, such as slight yellowing, dirt or minor tears. Acceptable: The paper is generally OK condition but there are some problems. Problems such as extensive yellowing or dirt or fading of text, but where the text is still readable. Included here are good quality photocopies. Poor: Where the paper is in a generally poor condition, the text being very difficult to read. Possibly to do with dirt, yellowing, foxing, show through or tears, crumpled pages, a general poor paper condition. Coming into this category are smudged text such as found in poor photocopies or where the printing was done to Mimeograph type standards. |
Scan Standard: The standard of the output file that could be expected and the levels of post-processing required to the following ratings: | Good: The scan will
be very good and require almost no post-processing to
achieve a top standard. Fair: The scan is very good by requires some clean up or other post-processing (e.g. deskewing) to achieve the top standard. May also be used for material where the scanning is going to be difficult due to handling difficulties or special treatments. Acceptable: The scan image is below average standard but can be made more acceptable through clean up and other post-processing. Poor: The scan image is very poor and can only attain acceptable standard whatever the post-processing or other treatments used. |
OCR Standard: The expected accuracy standard of the output files should OCR be carried out. | Good: Accuracy at 99.99%
with almost no correction required. Fair: Accuracy at >99% and can be made to 99.99% with small number of corrections. Acceptable: Accuracy at 90-99%. Poor: Accuracy below 90%. |
A detailed cost method is the engineering cost method [Arnold 90], and is used where a product or process is not part of the companies normal business activity. In the engineering method estimates of the total time, labour, materials and capital equipment required to perform the activity. The difficulty comes in estimating the indirect costs of such items as insurances, maintenance and power. However, the engineering cost method leads to a very accurate predication of future costs. In addition, the cost of the Information Audit should be included, the rationale being that the effort required to perform the audit is an integral part of the authoring methodology and has a direct effect on the efficiency of the authoring process. A poor audit will result in a greater effort in the authoring process.
Overheads are those costs that cannot be directly assigned to the cost object such as product, process, or customer group [Drury 96]. Included in this cost are the cost of services, such as, lighting, heating, building maintenance, rent for floor space and the according proportion of the business rate. The traditional method of allocating these costs is to divide the overhead cost among the various cost centres of the organisation. Each cost centre will then further proportion the cost among each of its activities. This method works well for cost accounting especially where the overheads are small in relation to the direct cost. However, when these overhead cost are greater than the direct costs disparate allocation of overhead costs take place. Hence, a small but increasing number of engineering companies are changing over to activity based costing to allocate the overheads, especially when costing is to be used for management decisions [Moore 98]. Another method commonly used in estimation of costs is to allocate a figure for overhead costs based on a function of the cumulative labour costs i.e. an additional forty percent for example. What is clear is that different companies will use different methods of allocating these costs. However the result is still the same, which is a fixed figure that represents the overhead costs (CO).
There is a cost of employment other than just simply the salary paid to the employees. These costs include the employer's National Insurance contribution, pension fund contributions, health and other insurances. It is preferable to calculate an average hourly rate of employment and add this to the hourly rate of the workers salary to produce a cost of employment [Drury 96]. Hence, it is necessary to calculate the cost of employment for the different salary scales or bands for the employees involved in the authoring process. In the first instant it may be sufficient to assume that the people employed in the task are paid the same. However, this is rarely the case due to factors such as full or part time employment, length of time served, seniority, etc. Hence, the extent to which these factors are taken into account will be related to the accuracy of the cost estimation required.
In cost accounting, depreciation is used to spread the cost of equipment over a number of years [Reynolds 92]. The 'life term' of the equipment is chosen based on the nature of the equipment. Relative to the factory process machinery the life term of computer equipment is generally short, that is less than five years. The type of depreciation can be a constant amount that allows for scrap at the end, or a constant fraction of the residual amount, producing larger depreciation values in the early stages. However, for cost forecasting it makes more sense to include the full cost of the equipment (CE), including the cost of maintenance agreements, shipping, insurances, etc. In addition, the actual cost of purchasing equipment can be spread by the use of lease-purchase agreements, in which the company leases the equipment for a set period. If the company keeps the equipment to the end of the agreement, they will own the equipment, prior to which they can return the equipment as in any other leasing agreement.
The additional process costs (CP) are variable overhead costs, in that these are overheads unique to the process itself. These include the cost of managing and supervising the process, the materials and power consumed in the process, the cost of the Information Audit, the cost of any subcontracted work (out sourcing the conversion process) etc.
The total cost of the process is:
B-1
Copyright © 1999, University of Southampton. All rights reserved.