The University of Southampton
University of Southampton Institutional Repository

Text Classification using String Kernels

Text Classification using String Kernels
Text Classification using String Kernels
We propose a novel approach for categorizing text documents based on the use of a special kernel. The kernel is an inner product in the feature space generated by all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences that are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique. Experimental comparisons of the performance of the kernel compared with a standard word feature space kernel (Joachims, 1998) show positive results on modestly sized datasets. The case of contiguous subsequences is also considered for comparison with the subsequences kernel with different decay factors. For larger documents and datasets the paper introduces an approximation technique that is shown to deliver good approximations efficiently for large datasets.
Kernels Support Vector Machines Text Classification
419-444
Lodhi, H.
80ab75b5-cd7b-4455-a158-aac3c0b4a74d
Saunders, C.
38a38da8-1eb3-47a8-80bc-b9cbb43f26e3
Shawe-Taylor, J.
c32d0ee4-b422-491f-8c28-78663851d6db
Cristianini, N.
00885da7-7833-4f0c-b8a0-3f385d89f642
Watkins, C.
ac57a767-e44d-4f17-b0ac-4e93b755aa87
Lodhi, H.
80ab75b5-cd7b-4455-a158-aac3c0b4a74d
Saunders, C.
38a38da8-1eb3-47a8-80bc-b9cbb43f26e3
Shawe-Taylor, J.
c32d0ee4-b422-491f-8c28-78663851d6db
Cristianini, N.
00885da7-7833-4f0c-b8a0-3f385d89f642
Watkins, C.
ac57a767-e44d-4f17-b0ac-4e93b755aa87

Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N. and Watkins, C. (2002) Text Classification using String Kernels. Journal of Machine Learning Research, 2 (3 - Su), 419-444.

Record type: Article

Abstract

We propose a novel approach for categorizing text documents based on the use of a special kernel. The kernel is an inner product in the feature space generated by all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences that are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique. Experimental comparisons of the performance of the kernel compared with a standard word feature space kernel (Joachims, 1998) show positive results on modestly sized datasets. The case of contiguous subsequences is also considered for comparison with the subsequences kernel with different decay factors. For larger documents and datasets the paper introduces an approximation technique that is shown to deliver good approximations efficiently for large datasets.

Text
String_JMLR02.pdf - Other
Download (220kB)

More information

Published date: 2002
Keywords: Kernels Support Vector Machines Text Classification
Organisations: Electronics & Computer Science

Identifiers

Local EPrints ID: 258968
URI: http://eprints.soton.ac.uk/id/eprint/258968
PURE UUID: 3bd24c09-1dae-4a94-9d7f-7e711e541de5

Catalogue record

Date deposited: 03 Mar 2004
Last modified: 14 Mar 2024 06:16

Export record

Contributors

Author: H. Lodhi
Author: C. Saunders
Author: J. Shawe-Taylor
Author: N. Cristianini
Author: C. Watkins

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×