The Parallel Implementation and Application of an Improved K-means Algorithm
LI Xiao-yu;YU Li-ying;LEI Hang;TANG Xue-fei;School of Information and Software Engineering,University of Electronic Science and Technology of China;Chengdu COMSYS Information Tech.Co.,Ltd;
Following with the growth of massive data, clustering research, one of the core problems of big dataisfaced with more and more problems such as high computing complexity and lack of resource. It has proposed an improved parallel K-means algorithm based on Hadoop. To overcomethe problem that the traditional K-means algorithm often has local optimal solution due to the randomness choice of initial center, we introduce Canopy algorithm to initialize clustering center andapply K-means algorithm on canopy. Meanwhile, clusters are merged among canopies. The result is stable and iteration number is less. In addition, the parallel implementation methods and strategies of the improved algorithm are presented, combining with the distributed computing model of Map Reduce. And a new method of text clustering is introduced by improving the similarity of measurement. The experiment results indicate the validity and scalability of our method.