Full-Text Search:
Home|Journal Papers|About CNKI|User Service|FAQ|Contact Us|中文
《Journal of Chinese Information Processing》 2011-01
Add to Favorite Get Latest Update

Distributed Index for Near Duplicate Detection

ZHANG Yue,YU Haomin,ZHANG Qi,HUANG Xuanjing(Fudan University,School of Computer Science and Technology,Shanghai 201203,China)  
How to effectively detect near duplicate documents on large corpus is a hot topic in recent years.Usually,near duplicate detection algorithms use Inverted Index to improve their efficiency.However,as the corpus size increases,single machine implementation of index structure is intractable.Therefore Distributed Index structure is required for near duplicate detection.To process rapidly increasing data size,the distributed index structures should have both high efficiency and scalability.In this paper,we compare two different distributed index structures,Term-Split Index and Doc-Split Index,and provide the Map-Reduce implementation.Based on those two index structures,we propose two different approaches,Term-Split Approach and Doc-Split Approach,to detect near duplicate documents using Map-Reduce paradigm.Finally,we compare the performance of the two different approaches on WT10G corpus.Experimental results show that the Doc-Split Approach is more efficient and has better scalability.
【Key Words】: near duplicate detection copy detection Map-Reduce
【Fund】: 国家自然科学基金资助项目(61073069 61003092);; 国家高技术研究发展计划(863计划)资助项目(2009AA01A346)
【CateGory Index】: TP391.3
Download(CAJ format) Download(PDF format)
CAJViewer7.0 supports all the CNKI file formats; AdobeReader only supports the PDF format.
【Co-references】
Chinese Journal Full-text Database 10 Hits
1 ZHANG Lu1,MA Li2 (1. Anyang Radio and Television University,Anyang 455000,China; 2. Henan University of Technology,Zhengzhou 450007,China);Discussion on Database Design[J];Journal of Anyang Institute of Technology;2007-04
2 Yuan Da;THE DATA MODELS FOR COMPUTER MAPPING[J];;1987-04
3 DONG Qi(Xi'an Railway Vocational and Technical Institute,Xi'an 710014,China);Application of IP-SAN technology in digital library[J];International Electronic Elements;2008-08
4 WU Feng-juan,L Zhi-ping,ZHAO Dong-qing(Institute of Surveying and Mapping,Information Engineering University,Zhengzhou,Henan,450052);The Dynamic Construction of the Geodetic Data Mode[J];Hydrographic Surveying and Charting;2006-04
5 YANG Zhong liang, WU Wen chuan, ZHANG Bo ming, SUN Hong bin, TNAG Lei, XU Chun hui (Department of Electrical Engineering, Tsinghua University, Beijing 100084, China);The research and development of tele-metering system based on application layer data mode[J];Relay;2002-11
6 SHAN Zhi Yong and SUN Yu Fang (Institute of Software, Chinese Academy of Sciences, Beijing 100080);A Study of Extending Generalized Framework for Access Control[J];Journal of Computer Research and Development;2003-02
7 Shen Zhibing Luo Ning(National University of Defence Technology,Changsha410073);Using Heartbeat to Implement Dynamic Standby System on Linux[J];Computer Engineering and Applications;2002-19
8 Zhou Jingli Zhang Wei Yu Shengsheng(Huazhong University of Technology,Wuhan430074);Design and Implementation of Storage Manager in iSCSI SAN[J];Computer Engineering and Applications;2004-12
9 Yi Fei Li Renfa Chen Zuo Zhang Guangjian(College of Computer and Communication,Hunan University,Changsha410082);The Study of a iSCSI-based IP Storage Architecture[J];Computer Engineering and Applications;2004-27
10 Zhu Ligu Xie Changsheng(Computer College,Huazhong University of Science & Technology,Wuhan 430074);Testing the Performance of iSCSI in Different Network Topology[J];Computer Engineering and Applications;2004-30
China Proceedings of conference Full-text Database 1 Hits
1 CHEN Zhan-long, WU Xin-cai, XIE Zhong, WU Liang (Faculty of Information Engineering, China University of Geosciences, Wuhan 430074, China);Study of Distributed Index Mechanism of Geospatial Data[A];[C];2007
©2006 Tsinghua Tongfang Knowledge Network Technology Co., Ltd.(Beijing)(TTKN) All rights reserved