Chinese Unknown Word Identification Based on Local Bigram Model
Chinese Unknown Word Identification Based on Local Bigram Model
This paper presents a Chinese unknown word identification system based on a local bigram model. Generally, our word segmentation system employs a statistical-based unigram model. But to identify those unknown words, we take advantage of their contextual information and apply a bigram model locally. By adjusting the value of interpolation which is derived from a smoothing method, we combine these two models with different dimensions. As a simplification of bigram, this method is simple as well as feasible, since the complexity of its algorithm is quite low and not so many training corpora are needed. The results of our experiments show the solution is effective.
Unknown word identification, Chinese word segmentation, Local bigram model
185-196
Wang, Zhuoran
ee2cbf1e-1250-46a7-8787-82920a656570
Liu, Ting
d86f6607-a0c9-4877-b4c6-fa9705fe6b6b
September 2005
Wang, Zhuoran
ee2cbf1e-1250-46a7-8787-82920a656570
Liu, Ting
d86f6607-a0c9-4877-b4c6-fa9705fe6b6b
Wang, Zhuoran and Liu, Ting
(2005)
Chinese Unknown Word Identification Based on Local Bigram Model.
International Journal of Computer Processing of Oriental Languages, Vol. 1 (3), .
(doi:10.1142/S0219427905001286).
Abstract
This paper presents a Chinese unknown word identification system based on a local bigram model. Generally, our word segmentation system employs a statistical-based unigram model. But to identify those unknown words, we take advantage of their contextual information and apply a bigram model locally. By adjusting the value of interpolation which is derived from a smoothing method, we combine these two models with different dimensions. As a simplification of bigram, this method is simple as well as feasible, since the complexity of its algorithm is quite low and not so many training corpora are needed. The results of our experiments show the solution is effective.
This record has no associated files available for download.
More information
Published date: September 2005
Keywords:
Unknown word identification, Chinese word segmentation, Local bigram model
Organisations:
Electronics & Computer Science
Identifiers
Local EPrints ID: 261543
URI: http://eprints.soton.ac.uk/id/eprint/261543
ISSN: 0219-4279
PURE UUID: 8d7c2071-9284-40cc-a831-3698c0a4cfa4
Catalogue record
Date deposited: 13 Nov 2005
Last modified: 14 Mar 2024 06:54
Export record
Altmetrics
Contributors
Author:
Zhuoran Wang
Author:
Ting Liu
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics