Syllables and Other String Kernel extensions
Syllables and Other String Kernel extensions
Recently, the use of string kernels that compare documents as a string of letters has been shown to achieve good results on text classification problems. In this paper we introduce the application of the string kernel in conjunction with syllables. Using syllables shortens the representation of documents and as a result reduces computation time. Moreover syllables provide a more natural representation of text; rather than the traditional coarse representation given by the bag-of-words, or the too fine one resulting from considering individual letters only. We give some experimental results which show that syllables can be effectively used in text-categorisation problems. In this paper we also propose two extensions to the string kernel. The first introduces a new lambda-weighting scheme, where different symbols can be given differing decay weightings. This may be useful in text and other applications where the insertion of certain symbols may be known to be less significant. We also introduce the concept of 'soft matching', where symbols can match (possibly weighted by relevance) even if they are not identical. Again, this provides a method of incorporating prior knowledge where certain symbols can be regarded as a partial or exact match and contribute to the overall similarity measure for two data items.
Tschach, Hauke
bdfecd23-4017-422f-875f-3c545d493066
Saunders, Craig
26634635-4d4d-4469-b9ec-1d68788aa47a
Shawe-Taylor, John
b1931d97-fdd0-4bc1-89bc-ec01648e928b
2002
Tschach, Hauke
bdfecd23-4017-422f-875f-3c545d493066
Saunders, Craig
26634635-4d4d-4469-b9ec-1d68788aa47a
Shawe-Taylor, John
b1931d97-fdd0-4bc1-89bc-ec01648e928b
Tschach, Hauke, Saunders, Craig and Shawe-Taylor, John
(2002)
Syllables and Other String Kernel extensions.
In Proceedings of ICML'02.
Morgan Kaufmann..
Record type:
Conference or Workshop Item
(Paper)
Abstract
Recently, the use of string kernels that compare documents as a string of letters has been shown to achieve good results on text classification problems. In this paper we introduce the application of the string kernel in conjunction with syllables. Using syllables shortens the representation of documents and as a result reduces computation time. Moreover syllables provide a more natural representation of text; rather than the traditional coarse representation given by the bag-of-words, or the too fine one resulting from considering individual letters only. We give some experimental results which show that syllables can be effectively used in text-categorisation problems. In this paper we also propose two extensions to the string kernel. The first introduces a new lambda-weighting scheme, where different symbols can be given differing decay weightings. This may be useful in text and other applications where the insertion of certain symbols may be known to be less significant. We also introduce the concept of 'soft matching', where symbols can match (possibly weighted by relevance) even if they are not identical. Again, this provides a method of incorporating prior knowledge where certain symbols can be regarded as a partial or exact match and contribute to the overall similarity measure for two data items.
Text
Syllable_ICML02.pdf
- Other
More information
Published date: 2002
Venue - Dates:
Nineteenth International Conference on Machine Learning (ICML '02), 2002-01-01
Organisations:
Electronics & Computer Science
Identifiers
Local EPrints ID: 258975
URI: http://eprints.soton.ac.uk/id/eprint/258975
PURE UUID: 9d352640-0d24-4968-9516-7dd562cff720
Catalogue record
Date deposited: 03 Mar 2004
Last modified: 16 Mar 2024 00:38
Export record
Contributors
Author:
Hauke Tschach
Author:
Craig Saunders
Author:
John Shawe-Taylor
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics