Text Segmentation in Web Images Using Colour Perception and Topological Features
Text Segmentation in Web Images Using Colour Perception and Topological Features
The research presented in this thesis addresses the problem of Text Segmentation in Web images. Text is routinely created in image form (headers, banners etc.) on Web pages, as an attempt to overcome the stylistic limitations of HTML. This text however, has a potentially high semantic value in terms of indexing and searching for the corresponding Web pages. As current search engine technology does not allow for text extraction and recognition in images, the text in image form is ignored. Moreover, it is desirable to obtain a uniform representation of all visible text of a Web page (for applications such as voice browsing or automated content analysis). This thesis presents two methods for text segmentation in Web images using colour perception and topological features. The nature of Web images and the implicit problems to text segmentation are described, and a study is performed to assess the magnitude of the problem and establish the need for automated text segmentation methods. Two segmentation methods are subsequently presented: the Split-and-Merge segmentation method and the Fuzzy segmentation method. Although approached in a distinctly different way in each method, the safe assumption that a human being should be able to read the text in any given Web Image is the foundation of both methods’ reasoning. This anthropocentric character of the methods along with the use of topological features of connected components, comprise the underlying working principles of the methods. An approach for classifying the connected components resulting from the segmentation methods as either characters or parts of the background is also presented.
text extraction, character segmentation, web document analysis, web images, colour perception, fuzzy
Karatzas, Dimosthenis
4d7e3927-2252-4039-88a4-0daca766e943
2003
Karatzas, Dimosthenis
4d7e3927-2252-4039-88a4-0daca766e943
Karatzas, Dimosthenis
(2003)
Text Segmentation in Web Images Using Colour Perception and Topological Features.
University of Liverpool, Computer Science, Doctoral Thesis.
Record type:
Thesis
(Doctoral)
Abstract
The research presented in this thesis addresses the problem of Text Segmentation in Web images. Text is routinely created in image form (headers, banners etc.) on Web pages, as an attempt to overcome the stylistic limitations of HTML. This text however, has a potentially high semantic value in terms of indexing and searching for the corresponding Web pages. As current search engine technology does not allow for text extraction and recognition in images, the text in image form is ignored. Moreover, it is desirable to obtain a uniform representation of all visible text of a Web page (for applications such as voice browsing or automated content analysis). This thesis presents two methods for text segmentation in Web images using colour perception and topological features. The nature of Web images and the implicit problems to text segmentation are described, and a study is performed to assess the magnitude of the problem and establish the need for automated text segmentation methods. Two segmentation methods are subsequently presented: the Split-and-Merge segmentation method and the Fuzzy segmentation method. Although approached in a distinctly different way in each method, the safe assumption that a human being should be able to read the text in any given Web Image is the foundation of both methods’ reasoning. This anthropocentric character of the methods along with the use of topological features of connected components, comprise the underlying working principles of the methods. An approach for classifying the connected components resulting from the segmentation methods as either characters or parts of the background is also presented.
Text
THESIS_Karatzas.pdf
- Other
More information
Published date: 2003
Keywords:
text extraction, character segmentation, web document analysis, web images, colour perception, fuzzy
Organisations:
Electronics & Computer Science
Identifiers
Local EPrints ID: 263525
URI: http://eprints.soton.ac.uk/id/eprint/263525
PURE UUID: 7e8e1a9c-86f6-4559-a6b1-5874e5056e23
Catalogue record
Date deposited: 19 Feb 2007
Last modified: 14 Mar 2024 07:34
Export record
Contributors
Author:
Dimosthenis Karatzas
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics