The University of Southampton
University of Southampton Institutional Repository

A systematic study of offline recognition of Thai printed and handwritten characters

A systematic study of offline recognition of Thai printed and handwritten characters
A systematic study of offline recognition of Thai printed and handwritten characters
Thai characters pose some unique problems, which differ from English and other oriental scripts. The structure of Thai characters consists of small loops combined with curves and there is an absence of spaces between each word and sentence. In each line, moreover, Thai characters can be composed on four levels, depending on the type of character being written. This research focuses on OCR for the Thai language: printed and offline handwritten character recognition. An attempt to overcome the problems by simple but effective methods is the main consideration. A printed OCR developed by the National Electronics and Computer Technology Center (NECTEC) uses Kohonen self-
organising maps (SOMs) for rough classification and back-propagation neural networks for fine classification. An evaluation of the NECTEC OCR is performed on a printed dataset that contains over 0.6 million tokens. Comparisons of the classifier, with and without the aspect ratio, and with and without SOMs, yield small, but statistically significant differences in recognition rate. A very straightforward classifier, the nearest neighbour, was examined to evaluate overall recognition performance and to compare with the classifier. It shows a significant improvement in recognition rate (about 98%) over the NECTEC classifier (about 96%) on both the original and distorted data (rotated and noisy), but at the expense of longer recognition times. For offline handwritten character recognition, three different classifiers are evaluated on three different datasets that contain, on average, approximately 10,000 tokens each. The neural network and HMMs are more effective and give higher recognition rates than the nearest neighbour classifier on three datasets. The best result obtained from the HMMs is 91.1% on ThaiCAM dataset. However, when evaluated on a different dataset, the recognition rates drastically reduce, due to differences in many aspects of online and offline handwritten data. An improvement in classification rates was obtained by adjusting the stroke width of a character in the online handwritten dataset (12 percentage points) and combining the training sets from the three datasets (7.6 percentage points). A boosting algorithm called AdaBoost yields a slight improvement in recognition rate (1.2 percentage points) over the original classifiers (without applying the AdaBoost algorithm).
Sae-Tang, Sutat
2fc386b2-ad1e-4836-89ba-ba3828b451b3
Sae-Tang, Sutat
2fc386b2-ad1e-4836-89ba-ba3828b451b3
Carter, John N.
e05be2f9-991d-4476-bb50-ae91606389da
Damper, Robert
6e0e7fdc-57ec-44d4-bc0f-029d17ba441d

Sae-Tang, Sutat (2011) A systematic study of offline recognition of Thai printed and handwritten characters. University of Southampton, Faculty of Physical and Applied Sciences, Doctoral Thesis, 138pp.

Record type: Thesis (Doctoral)

Abstract

Thai characters pose some unique problems, which differ from English and other oriental scripts. The structure of Thai characters consists of small loops combined with curves and there is an absence of spaces between each word and sentence. In each line, moreover, Thai characters can be composed on four levels, depending on the type of character being written. This research focuses on OCR for the Thai language: printed and offline handwritten character recognition. An attempt to overcome the problems by simple but effective methods is the main consideration. A printed OCR developed by the National Electronics and Computer Technology Center (NECTEC) uses Kohonen self-
organising maps (SOMs) for rough classification and back-propagation neural networks for fine classification. An evaluation of the NECTEC OCR is performed on a printed dataset that contains over 0.6 million tokens. Comparisons of the classifier, with and without the aspect ratio, and with and without SOMs, yield small, but statistically significant differences in recognition rate. A very straightforward classifier, the nearest neighbour, was examined to evaluate overall recognition performance and to compare with the classifier. It shows a significant improvement in recognition rate (about 98%) over the NECTEC classifier (about 96%) on both the original and distorted data (rotated and noisy), but at the expense of longer recognition times. For offline handwritten character recognition, three different classifiers are evaluated on three different datasets that contain, on average, approximately 10,000 tokens each. The neural network and HMMs are more effective and give higher recognition rates than the nearest neighbour classifier on three datasets. The best result obtained from the HMMs is 91.1% on ThaiCAM dataset. However, when evaluated on a different dataset, the recognition rates drastically reduce, due to differences in many aspects of online and offline handwritten data. An improvement in classification rates was obtained by adjusting the stroke width of a character in the online handwritten dataset (12 percentage points) and combining the training sets from the three datasets (7.6 percentage points). A boosting algorithm called AdaBoost yields a slight improvement in recognition rate (1.2 percentage points) over the original classifiers (without applying the AdaBoost algorithm).

Text
SutatSaeTang-Thesis.pdf - Other
Download (2MB)

More information

Published date: 15 November 2011
Organisations: University of Southampton, Southampton Wireless Group

Identifiers

Local EPrints ID: 206079
URI: http://eprints.soton.ac.uk/id/eprint/206079
PURE UUID: 4e86c955-b563-4c77-818f-6329ec3332e0

Catalogue record

Date deposited: 14 Dec 2011 17:07
Last modified: 14 Mar 2024 04:36

Export record

Contributors

Author: Sutat Sae-Tang
Thesis advisor: John N. Carter
Thesis advisor: Robert Damper

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×