Full-Text Search:
Home|Journal Papers|About CNKI|User Service|FAQ|Contact Us|中文
《Acta Electronica Sinica》 2000-S1
Add to Favorite Get Latest Update

Research and Evaluation of Near replicas of Web Pages Detection Algorithms

WANG Jian yong,XIE Zheng mao,LEI Ming,LI Xiao ming (Dept.of Computer Science & Technology,Peking University,Beijing 100871,China)  
Many documents are being replicated across the World wide Web.How to efficiently and accurately find the near replicas of web pages has become an important topic in the search engine research area,which can be used to improve the quality of searching service.In this paper,we propose 5 near replicas detection algorithms for search engines that rely on keyword matching,and evaluate them using the WebGather search engine system.In addition,we also compare our method with one of the most popular copy detection mechanisms.Our method has been successfully adopted to remove the near replicas of web pages in WebGather,and it can also be widely used to build digital library.
【Fund】: 国家 973重大基础研究发展规划项目基金! (No.G1 9990 32 70 6)
【CateGory Index】: TP393
Download(CAJ format) Download(PDF format)
CAJViewer7.0 supports all the CNKI file formats; AdobeReader only supports the PDF format.
【References】
Chinese Journal Full-text Database 10 Hits
1 LEI Ming\ WANG Jianyong\ ZHAO Jianghua\ SHAN Songwei\ CHEN Baojue (Department of Computer Scinece & Technology,Peking University,Beijing,100871);The 3~(rd) Generation Search Engine and WebGather Version 2.0[J];Acta Scicentiarum Naturalum Universitis Pekinesis;2001-05
2 Fan Yong et al;Detection and Elimination of Similar Web Pages based on Topic[J];Computer Development & Applications;2008-04
3 LIANG Ye1,LIANG Jing-zhang1,YANG Hong2,YE Yun1(1.College of Computer and Electronic Information,Guangxi University,Nanning 530004,China;2.Department of Computer Education,Hunan Normal University,Changsha 410081,China);Study on near-replicas detection algorithm in duplicated text removal[J];Journal of Guangxi University(Natural Science Edition);2010-02
4 LI Xiao-Ming 1, ZHU Jia-Ji 1, and YAN Hong-Fei 1,2 1(Department of Computer Science & Technology, Peking University, Beijing 100871) 2(Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing 100080);A Model for Collecting and Processing Topical Information in the Web and Its Application[J];Journal of Computer Research and Development;2003-12
5 ZHANG Man,LI Bi-cheng,LIN Chen (Information Engineering Institute,PLA Information Engineering University,Zhengzhou 450002);Email Remove-duplicate Algorithm Based on SHA-1[J];Computer Engineering;2008-11
6 WEI Li-xia,ZHENG Jia-heng(Department of Computer and Information Technology,Shanxi University,Taiyuan Shanxi 030006,China);Detection and elimination of similar Web pages based on text structure[J];Journal of Computer Applications;2007-11
7 ZHANG Jing-yang1,3,ZHANG Hua-ping2,3,LIU Jin-gang1(1.Joint Faculty of Computer Scientific Research,Capital Normal University,Beijing 100037,China;2.School of Computer Science and Technology,Beijing Institute of Technology,Beijing 100080,China;3.Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China);Large-scale document forward detection algorithm based on agglomerate-term[J];Journal of Computer Applications;2010-06
8 LIAN Hao1,2,LIU Yue1,XU Hong-bo1,CHENG Xue-qi1(1.Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100080,China;2.College of Information,Graduate School,Chinese Academy of Sciences 100049,China);Duplicated Web Pages Detection Algorithm Based on Boolean Model[J];Application Research of Computers;2007-02
9 HUANG Ren,FENG Sheng,YANG Ji-yun,LIU Yu,AO Min(College of Computer Science,Chongqing University,Chongqing 400044,China);Detection and elimination of similar Web pages based on text structure and extraction of long sentences[J];Application Research of Computers;2010-07
10 YANG Wei-Jie1,2+,DAI Ru-Wei2+,CUI Xia2 1(School of Computer Science and Information Engineering,Beijing Technology and Business University,Beijing 100048,China) 2(Institute of Automation,The Chinese Academy of Sciences,Beijing 100190,China);Model for Internet News Force Evaluation Based on Information Retrieval Technologies[J];Journal of Software;2009-09
【Co-references】
Chinese Journal Full-text Database 10 Hits
1 LI Wen-bin, LIU Chun-nian, HUANG Jia-jin ( Multimedia and Intelligent Software Technology Lab, College of Computer Science, Beijing University of Technology, Beijing 100022, China );Junk E-mail Filtering Method Based on Data Mining[J];Journal of Beijing Polytechnic University;2003-02
2 Wu Jing Zeng Xiao Chen Zhenyong Xiong Zhang(School of Computer Science and Technology,Beijing University of Aeronautics and Astronautics,Beijing 100083,China);Personalized interest modeling on portal based on latent interest semantic description[J];Journal of Beijing University of Aeronautics and Astronautics;2008-02
3 CAO Yuan da, HE Hai jun, TU Zhe ming (Dept. of Computer Science and Engineering, Beijing Institute of Technology, Beijing100081, China);Design and Implementation of a Chinese Web Documents Full-Text Retrieval System[J];Journal of Beijing Institute of Technology;2002-01
4 Zhang Jianhe;On editors' evaluation of papers,[J];Acta Editologica;2003-01
5 ZHONG Yi-xin (Center of Intelligence Science and Technology Research, Beijing University of Posts and Telecommunications, Beijing 100876, China);Comprehensive Information Based Methodology for Natural Language Understanding[J];Journal of Beijing University of Posts and Telecommunications;2004-04
6 ZHA zhi-hua,LI wei (Scholl of Electronics and Information of Shihezi university,Shihezi,832000,China);Developing Trend and Present Situation of Search Engine[J];Journal of Bingtuan Education Institute;2006-03
7 Shen Yang,etc.;Empirical Analysis on Academic Journal Website Optimization Based on SEO[J];Publishing Journal;2009-01
8 MA Hui-dong~1 LIU Guo-hua~1 LIANG Peng~1 YUAN ying~1(1.College of Information Science and Engineering,Yanshan University,Qinhuangdao,Hebei 066004,China);Document copy detection technology[J];Journal of Yanshan University;2007-05
9 HOU Meng-shu,LU Xian-liang,ZHOU Xu,ZHAN Chuan(School of Computer Science and Engineering, UEST of China Chengdu 610054);New Route Algorithms of Unstructured P2P Systems[J];Journal of University of Electronic Science and Technology of China;2005-01
10 YUJian-liang~*,ZHOUChong,LIURun-jie( Dept. of Chem. Eng. and Mach., Dalian Univ. of Technol., Dalian 116012, China );Experimental research on premixed gases explosion in overpressure[J];Journal of Dalian University of Technology;2005-02
China Proceedings of conference Full-text Database 1 Hits
1 Zhang Yong Chen Sirui Yang Zhiyong School of Computer and Communication, Lanzhou University of Technology, Lanzhou 730050;Research of an Improved Text Categorization Method[A];[C];2005
【Secondary References】
Chinese Journal Full-text Database 10 Hits
1 Lu Xiaofeng Zheng Quan(Computer Center, North China Univ. of Tech. ,100041, Beijing, China);Page Ranking Program Based on Users' Feedback[J];Journal of North China University of Technology;2004-03
2 HUANG Min(Anhui Radio & TV University,Hefei 230022);Research of Theme Search Strategy on Network Subject Optimization[J];Computer Programming Skills & Maintenance;2010-16
3 LI Wu-zhuang (Computer School, National University of Defence Technology, Changsha 410073,China);Design and Implement of Enterprise Search Engine Based on Semantic[J];Computer Knowledge and Technology(Academic Exchange);2007-08
4 LI Liyao(Fuqing Branch of Fujian Normal University,350300);The Improvement for the Page Rank Algorithm Based on Page Link Structure——Directive Visit Model[J];Journal of Fuqing Branch of Fujian Normal University;2006-02
5 WU Jun(Information Engineering College Zhengzhou University,Henan Zhengzhou 450052);The Internet and Information Acquisition[J];CD Technology;2009-06
6 LIANG Ye1,LIANG Jing-zhang1,YANG Hong2,YE Yun1(1.College of Computer and Electronic Information,Guangxi University,Nanning 530004,China;2.Department of Computer Education,Hunan Normal University,Changsha 410081,China);Study on near-replicas detection algorithm in duplicated text removal[J];Journal of Guangxi University(Natural Science Edition);2010-02
7 ZHAO Xin,WU Gang(College of Information Science and Technology, Beijing Forestry University, Beijing 100083);LOOKING BACK AND FORWARD TO THE DEVELOPMENT OF SEARCH ENGINE[J];Hebei Journal of Forestry and Orchard Research;2004-01
8 WANG Ying-chun,CAI Dong-feng,YE Na(Knowledge Engineering Research Center,Shenyang Aerospace University,Liaoning Shenyang 110136);The construction of domain knowledge base under the entity-attribute frame[J];Journal of Shenyang Aerospace University;2011-02
9 HE Fengling, TAO Wenxue, LI Kai, ZHOU Li, ZUO Wanli(College of Computer Science and Technology, Jilin University, Changchun 130012, China);Implementation of CHINA_VIVI——A New Generation Internet Search Engine[J];Acta Scientiarium Naturalium Universitatis Jilinensis;2003-02
10 Gao Bo 1,2 Zhang Zhongneng 1 Zha Zhiqin 21 (Department of Computer Science and Engineering,ShangHai Jiaotong University,Shanghai200030) 2 (Department of Computer Engineering,Changzhou Institute of Technology,Changzhou213002);Research on Web Page Classification Based on Text Link Ratio[J];Computer Engineering and Applications;2004-27
China Proceedings of conference Full-text Database 1 Hits
1 ZHOU Xiao-ping LIANG Yi-ping DeNg Zuo-xiang (School of Computer,Electronics and Information,Guangxi University,Nanning 530004);Research of Meta Search Engine[A];[C];2009
©2006 Tsinghua Tongfang Knowledge Network Technology Co., Ltd.(Beijing)(TTKN) All rights reserved