The University of Southampton
University of Southampton Institutional Repository

A generalized language model as the combination of skipped n-grams and modified kneser-ney smoothing

A generalized language model as the combination of skipped n-grams and modified kneser-ney smoothing
A generalized language model as the combination of skipped n-grams and modified kneser-ney smoothing
We introduce a novel approach for building language models based on a systematic, recursive exploration of skip n-gram models which are interpolated using modified Kneser-Ney smoothing. Our approach generalizes language models as it contains the classical interpolation with lower order models as a special case. In this paper we motivate, formalize and present our approach. In an extensive empirical experiment over English text corpora we demonstrate that our generalized language models lead to a substantial reduction of perplexity between 3.1% and 12.7% in comparison to traditional language models using modified Kneser-Ney smoothing. Furthermore, we investigate the behaviour over three other languages and a domain specific corpus where we observed consistent improvements. Finally, we also show that the strength of our approach lies in its ability to cope in particular with sparse training data. Using a very small training data set of only 736 KB text we yield improvements of even 25.7% reduction of perplexity.
1145-1154
Association for Computational Linguistics
Pickhardt, Rene
a98a8c48-ba97-4501-850e-2878f342818d
Gottron, Thomas
ab6d9e90-4faf-41f5-8ddb-f6b7d12e5a80
Körner, Martin
4452ea54-cbd4-4ce7-a19f-1cc6efa3fbad
Wagner, Paul Georg
63e3bd88-bd21-4532-b52e-042ebf00aad1
Speicher, Till
835ffc25-2281-4246-a994-0815343c1923
Staab, Steffen
bf48d51b-bd11-4d58-8e1c-4e6e03b30c49
Pickhardt, Rene
a98a8c48-ba97-4501-850e-2878f342818d
Gottron, Thomas
ab6d9e90-4faf-41f5-8ddb-f6b7d12e5a80
Körner, Martin
4452ea54-cbd4-4ce7-a19f-1cc6efa3fbad
Wagner, Paul Georg
63e3bd88-bd21-4532-b52e-042ebf00aad1
Speicher, Till
835ffc25-2281-4246-a994-0815343c1923
Staab, Steffen
bf48d51b-bd11-4d58-8e1c-4e6e03b30c49

Pickhardt, Rene, Gottron, Thomas, Körner, Martin, Wagner, Paul Georg, Speicher, Till and Staab, Steffen (2014) A generalized language model as the combination of skipped n-grams and modified kneser-ney smoothing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014). vol. 1, Association for Computational Linguistics. pp. 1145-1154 . (doi:10.3115/v1/P14-1108).

Record type: Conference or Workshop Item (Paper)

Abstract

We introduce a novel approach for building language models based on a systematic, recursive exploration of skip n-gram models which are interpolated using modified Kneser-Ney smoothing. Our approach generalizes language models as it contains the classical interpolation with lower order models as a special case. In this paper we motivate, formalize and present our approach. In an extensive empirical experiment over English text corpora we demonstrate that our generalized language models lead to a substantial reduction of perplexity between 3.1% and 12.7% in comparison to traditional language models using modified Kneser-Ney smoothing. Furthermore, we investigate the behaviour over three other languages and a domain specific corpus where we observed consistent improvements. Finally, we also show that the strength of our approach lies in its ability to cope in particular with sparse training data. Using a very small training data set of only 736 KB text we yield improvements of even 25.7% reduction of perplexity.

This record has no associated files available for download.

More information

e-pub ahead of print date: 22 May 2014
Published date: 22 May 2014
Venue - Dates: 52nd Annual Meeting of the Association for Computational Linguistics, , Baltimore, United States, 2014-06-22 - 2014-06-27

Identifiers

Local EPrints ID: 413616
URI: http://eprints.soton.ac.uk/id/eprint/413616
PURE UUID: 41f851e3-63d2-4edf-8a99-b9970abd1c89
ORCID for Steffen Staab: ORCID iD orcid.org/0000-0002-0780-4154

Catalogue record

Date deposited: 30 Aug 2017 16:31
Last modified: 17 Mar 2024 03:38

Export record

Altmetrics

Contributors

Author: Rene Pickhardt
Author: Thomas Gottron
Author: Martin Körner
Author: Paul Georg Wagner
Author: Till Speicher
Author: Steffen Staab ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×