Full-Text Search:
Home|Journal Papers|About CNKI|User Service|FAQ|Contact Us|中文
《Journal of Shenyang University of Technology》 2017-01
Add to Favorite Get Latest Update

Data cleaning technology based on N-Gram algorithm

MA Ping-quan;SONG Kai;JI Jian-wei;College of Information and Electrical Engineering,Shenyang Agricultural University;School of Automation and Electrical Engineering,Shenyang Ligong University;  
Aiming at the plentiful approximately duplicate data in the database,the attribute structure of approximately duplicate records and the causing reason were analyzed.The data records were calculated with the N-Gram algorithm to get the key values,namely N-Gram values,which represented the attribute of every record.According to the key values,the data records in the database were ordered so as to form a well-organized database.In addition,the similarity of data records in the database was calculated.The identified approximately duplicate records were cleaned by applying the arranged combination cleaning idea.The experimental results show that the N-Gram algorithm effectively increases the recall ratio and precision ratio of approximately duplicate data records.
【Fund】: 辽宁省教育厅科学研究项目(LG201610)
【CateGory Index】: TP311.13
Download(CAJ format) Download(PDF format)
CAJViewer7.0 supports all the CNKI file formats; AdobeReader only supports the PDF format.
©2006 Tsinghua Tongfang Knowledge Network Technology Co., Ltd.(Beijing)(TTKN) All rights reserved