Full-Text Search:
Home|Journal Papers|About CNKI|User Service|FAQ|Contact Us|中文
《Periodical of Ocean University of China》 2017-12
Add to Favorite Get Latest Update

LDA-Based Approach for Automatic Detection of Multilingual Text

ZHANG Wei;LI Wen;CHEN Dan;LI Zeng-Jie;College of Information Science and Engineering Technology,Ocean University of China;  
The paper proposed an unsupervised multilingual identification method based on Latent Dirichlet Allocation to deal with the automatic detection of multilingual text.From the perspective of speech recognition,it reforms the LDA for language identification,using n-grams as the features.Different from the usual method of selection of topic number according to the perplexity,the paper introduces a new method based on minimum description length(MDL for short),adopting the Collapsed Gibbs Sampling as the learning method to construct the unsupervised language identification based on the LDA model.The paper takes the mitlm toolkit to generate N-gram counting files and establishes the character level's language model in multilingual identification.Then the paper uses three other language identification systems for comparison with our LDA model.The experiment chooses nine euro languages form the ECI/MCI benchmark to do the identification experiment,at the same time the paper makes a detailed analyze on the trail results,realizing agood accuracy and recall result without any annotation.
【Fund】: 山东省自然科学基金项目(ZR2012FM016)资助~~
【CateGory Index】: TP391.1
Download(CAJ format) Download(PDF format)
CAJViewer7.0 supports all the CNKI file formats; AdobeReader only supports the PDF format.
©2006 Tsinghua Tongfang Knowledge Network Technology Co., Ltd.(Beijing)(TTKN) All rights reserved