The University of Southampton
University of Southampton Institutional Repository

Text Segmentation in Web Images Using Colour Perception and Topological Features

Text Segmentation in Web Images Using Colour Perception and Topological Features
Text Segmentation in Web Images Using Colour Perception and Topological Features
The research presented in this thesis addresses the problem of Text Segmentation in Web images. Text is routinely created in image form (headers, banners etc.) on Web pages, as an attempt to overcome the stylistic limitations of HTML. This text however, has a potentially high semantic value in terms of indexing and searching for the corresponding Web pages. As current search engine technology does not allow for text extraction and recognition in images, the text in image form is ignored. Moreover, it is desirable to obtain a uniform representation of all visible text of a Web page (for applications such as voice browsing or automated content analysis). This thesis presents two methods for text segmentation in Web images using colour perception and topological features. The nature of Web images and the implicit problems to text segmentation are described, and a study is performed to assess the magnitude of the problem and establish the need for automated text segmentation methods. Two segmentation methods are subsequently presented: the Split-and-Merge segmentation method and the Fuzzy segmentation method. Although approached in a distinctly different way in each method, the safe assumption that a human being should be able to read the text in any given Web Image is the foundation of both methods’ reasoning. This anthropocentric character of the methods along with the use of topological features of connected components, comprise the underlying working principles of the methods. An approach for classifying the connected components resulting from the segmentation methods as either characters or parts of the background is also presented.
text extraction, character segmentation, web document analysis, web images, colour perception, fuzzy
Karatzas, Dimosthenis
4d7e3927-2252-4039-88a4-0daca766e943
Karatzas, Dimosthenis
4d7e3927-2252-4039-88a4-0daca766e943

(2003) Text Segmentation in Web Images Using Colour Perception and Topological Features. University of Liverpool, Computer Science, Doctoral Thesis.

Record type: Thesis (Doctoral)

Abstract

The research presented in this thesis addresses the problem of Text Segmentation in Web images. Text is routinely created in image form (headers, banners etc.) on Web pages, as an attempt to overcome the stylistic limitations of HTML. This text however, has a potentially high semantic value in terms of indexing and searching for the corresponding Web pages. As current search engine technology does not allow for text extraction and recognition in images, the text in image form is ignored. Moreover, it is desirable to obtain a uniform representation of all visible text of a Web page (for applications such as voice browsing or automated content analysis). This thesis presents two methods for text segmentation in Web images using colour perception and topological features. The nature of Web images and the implicit problems to text segmentation are described, and a study is performed to assess the magnitude of the problem and establish the need for automated text segmentation methods. Two segmentation methods are subsequently presented: the Split-and-Merge segmentation method and the Fuzzy segmentation method. Although approached in a distinctly different way in each method, the safe assumption that a human being should be able to read the text in any given Web Image is the foundation of both methods’ reasoning. This anthropocentric character of the methods along with the use of topological features of connected components, comprise the underlying working principles of the methods. An approach for classifying the connected components resulting from the segmentation methods as either characters or parts of the background is also presented.

PDF
THESIS_Karatzas.pdf - Other
Download (5MB)

More information

Published date: 2003
Keywords: text extraction, character segmentation, web document analysis, web images, colour perception, fuzzy
Organisations: Electronics & Computer Science

Identifiers

Local EPrints ID: 263525
URI: http://eprints.soton.ac.uk/id/eprint/263525
PURE UUID: 7e8e1a9c-86f6-4559-a6b1-5874e5056e23

Catalogue record

Date deposited: 19 Feb 2007
Last modified: 18 Jul 2017 07:44

Export record

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×