The University of Southampton
University of Southampton Institutional Repository

Chinese Unknown Word Identification Based on Local Bigram Model

Chinese Unknown Word Identification Based on Local Bigram Model
Chinese Unknown Word Identification Based on Local Bigram Model
This paper presents a Chinese unknown word identification system based on a local bigram model. Generally, our word segmentation system employs a statistical-based unigram model. But to identify those unknown words, we take advantage of their contextual information and apply a bigram model locally. By adjusting the value of interpolation which is derived from a smoothing method, we combine these two models with different dimensions. As a simplification of bigram, this method is simple as well as feasible, since the complexity of its algorithm is quite low and not so many training corpora are needed. The results of our experiments show the solution is effective.
Unknown word identification, Chinese word segmentation, Local bigram model
0219-4279
185-196
Wang, Zhuoran
ee2cbf1e-1250-46a7-8787-82920a656570
Liu, Ting
d86f6607-a0c9-4877-b4c6-fa9705fe6b6b
Wang, Zhuoran
ee2cbf1e-1250-46a7-8787-82920a656570
Liu, Ting
d86f6607-a0c9-4877-b4c6-fa9705fe6b6b

Wang, Zhuoran and Liu, Ting (2005) Chinese Unknown Word Identification Based on Local Bigram Model. International Journal of Computer Processing of Oriental Languages, Vol. 1 (3), 185-196. (doi:10.1142/S0219427905001286).

Record type: Article

Abstract

This paper presents a Chinese unknown word identification system based on a local bigram model. Generally, our word segmentation system employs a statistical-based unigram model. But to identify those unknown words, we take advantage of their contextual information and apply a bigram model locally. By adjusting the value of interpolation which is derived from a smoothing method, we combine these two models with different dimensions. As a simplification of bigram, this method is simple as well as feasible, since the complexity of its algorithm is quite low and not so many training corpora are needed. The results of our experiments show the solution is effective.

This record has no associated files available for download.

More information

Published date: September 2005
Keywords: Unknown word identification, Chinese word segmentation, Local bigram model
Organisations: Electronics & Computer Science

Identifiers

Local EPrints ID: 261543
URI: http://eprints.soton.ac.uk/id/eprint/261543
ISSN: 0219-4279
PURE UUID: 8d7c2071-9284-40cc-a831-3698c0a4cfa4

Catalogue record

Date deposited: 13 Nov 2005
Last modified: 14 Mar 2024 06:54

Export record

Altmetrics

Contributors

Author: Zhuoran Wang
Author: Ting Liu

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×