Accessing Textual Information Embedded in Internet Images
Accessing Textual Information Embedded in Internet Images
Indexing and searching for WWW pages is relying on analysing text. Current technology cannot process the text embedded in images on WWW pages. This paper argues that this is a significant problem as text in image form is usually semantically important (e.g. headers, titles). The results of a recent study are presented to show that the majority (76%) of words embedded in images do not appear elsewhere in the main text and that the majority (56%) of ALT tag descriptions of images are incorrect or do not exist at all. Research under way to devise tools to extract text from images based on the way humans perceive colour differences is outlined and results are presented.
Web document analysis, image analysis, text extraction
198-205
Antonacopoulos, Apostolos
9369bee5-b30f-4d4c-a63d-fe54984578cc
Karatzas, Dimosthenis
4d7e3927-2252-4039-88a4-0daca766e943
Ortiz Lopez, J
82b949de-0ad0-447e-8e33-172bec90a25d
2001
Antonacopoulos, Apostolos
9369bee5-b30f-4d4c-a63d-fe54984578cc
Karatzas, Dimosthenis
4d7e3927-2252-4039-88a4-0daca766e943
Ortiz Lopez, J
82b949de-0ad0-447e-8e33-172bec90a25d
Antonacopoulos, Apostolos, Karatzas, Dimosthenis and Ortiz Lopez, J
(2001)
Accessing Textual Information Embedded in Internet Images.
SPIE, Internet Imaging II, San Jose, United States.
.
Record type:
Conference or Workshop Item
(Paper)
Abstract
Indexing and searching for WWW pages is relying on analysing text. Current technology cannot process the text embedded in images on WWW pages. This paper argues that this is a significant problem as text in image form is usually semantically important (e.g. headers, titles). The results of a recent study are presented to show that the majority (76%) of words embedded in images do not appear elsewhere in the main text and that the majority (56%) of ALT tag descriptions of images are incorrect or do not exist at all. Research under way to devise tools to extract text from images based on the way humans perceive colour differences is outlined and results are presented.
Text
SPIE2001_Antonacopoulos.pdf
- Other
More information
Published date: 2001
Additional Information:
Event Dates: January 2001
Venue - Dates:
SPIE, Internet Imaging II, San Jose, United States, 2001-01-01
Keywords:
Web document analysis, image analysis, text extraction
Organisations:
Electronics & Computer Science
Identifiers
Local EPrints ID: 263506
URI: http://eprints.soton.ac.uk/id/eprint/263506
PURE UUID: 582e7d37-70da-4992-b760-147a5ba5f998
Catalogue record
Date deposited: 19 Feb 2007
Last modified: 14 Mar 2024 07:33
Export record
Contributors
Author:
Apostolos Antonacopoulos
Author:
Dimosthenis Karatzas
Author:
J Ortiz Lopez
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics