The University of Southampton
University of Southampton Institutional Repository

Syllables and Other String Kernel extensions

Syllables and Other String Kernel extensions
Syllables and Other String Kernel extensions
Recently, the use of string kernels that compare documents as a string of letters has been shown to achieve good results on text classification problems. In this paper we introduce the application of the string kernel in conjunction with syllables. Using syllables shortens the representation of documents and as a result reduces computation time. Moreover syllables provide a more natural representation of text; rather than the traditional coarse representation given by the bag-of-words, or the too fine one resulting from considering individual letters only. We give some experimental results which show that syllables can be effectively used in text-categorisation problems. In this paper we also propose two extensions to the string kernel. The first introduces a new lambda-weighting scheme, where different symbols can be given differing decay weightings. This may be useful in text and other applications where the insertion of certain symbols may be known to be less significant. We also introduce the concept of 'soft matching', where symbols can match (possibly weighted by relevance) even if they are not identical. Again, this provides a method of incorporating prior knowledge where certain symbols can be regarded as a partial or exact match and contribute to the overall similarity measure for two data items.
Morgan Kaufmann
Tschach, Hauke
bdfecd23-4017-422f-875f-3c545d493066
Saunders, Craig
26634635-4d4d-4469-b9ec-1d68788aa47a
Shawe-Taylor, John
b1931d97-fdd0-4bc1-89bc-ec01648e928b
Tschach, Hauke
bdfecd23-4017-422f-875f-3c545d493066
Saunders, Craig
26634635-4d4d-4469-b9ec-1d68788aa47a
Shawe-Taylor, John
b1931d97-fdd0-4bc1-89bc-ec01648e928b

Tschach, Hauke, Saunders, Craig and Shawe-Taylor, John (2002) Syllables and Other String Kernel extensions. In Proceedings of ICML'02. Morgan Kaufmann..

Record type: Conference or Workshop Item (Paper)

Abstract

Recently, the use of string kernels that compare documents as a string of letters has been shown to achieve good results on text classification problems. In this paper we introduce the application of the string kernel in conjunction with syllables. Using syllables shortens the representation of documents and as a result reduces computation time. Moreover syllables provide a more natural representation of text; rather than the traditional coarse representation given by the bag-of-words, or the too fine one resulting from considering individual letters only. We give some experimental results which show that syllables can be effectively used in text-categorisation problems. In this paper we also propose two extensions to the string kernel. The first introduces a new lambda-weighting scheme, where different symbols can be given differing decay weightings. This may be useful in text and other applications where the insertion of certain symbols may be known to be less significant. We also introduce the concept of 'soft matching', where symbols can match (possibly weighted by relevance) even if they are not identical. Again, this provides a method of incorporating prior knowledge where certain symbols can be regarded as a partial or exact match and contribute to the overall similarity measure for two data items.

Text
Syllable_ICML02.pdf - Other
Download (213kB)

More information

Published date: 2002
Venue - Dates: Nineteenth International Conference on Machine Learning (ICML '02), 2002-01-01
Organisations: Electronics & Computer Science

Identifiers

Local EPrints ID: 258975
URI: http://eprints.soton.ac.uk/id/eprint/258975
PURE UUID: 9d352640-0d24-4968-9516-7dd562cff720

Catalogue record

Date deposited: 03 Mar 2004
Last modified: 16 Mar 2024 00:38

Export record

Contributors

Author: Hauke Tschach
Author: Craig Saunders
Author: John Shawe-Taylor

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×