Open Access   Article Go Back

Survey of Clustering Methods for Large Scale Dataset

Anupama Jawale1 , Ganesh Magar2

Section:Survey Paper, Product Type: Journal Paper
Volume-7 , Issue-5 , Page no. 1338-1344, May-2019

CrossRef-DOI:   https://doi.org/10.26438/ijcse/v7i5.13381344

Online published on May 31, 2019

Copyright © Anupama Jawale, Ganesh Magar . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at   Google Scholar | DPI Digital Library

How to Cite this Paper

  • IEEE Citation
  • MLA Citation
  • APA Citation
  • BibTex Citation
  • RIS Citation

IEEE Style Citation: Anupama Jawale, Ganesh Magar, “Survey of Clustering Methods for Large Scale Dataset,” International Journal of Computer Sciences and Engineering, Vol.7, Issue.5, pp.1338-1344, 2019.

MLA Style Citation: Anupama Jawale, Ganesh Magar "Survey of Clustering Methods for Large Scale Dataset." International Journal of Computer Sciences and Engineering 7.5 (2019): 1338-1344.

APA Style Citation: Anupama Jawale, Ganesh Magar, (2019). Survey of Clustering Methods for Large Scale Dataset. International Journal of Computer Sciences and Engineering, 7(5), 1338-1344.

BibTex Style Citation:
@article{Jawale_2019,
author = {Anupama Jawale, Ganesh Magar},
title = {Survey of Clustering Methods for Large Scale Dataset},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {5 2019},
volume = {7},
Issue = {5},
month = {5},
year = {2019},
issn = {2347-2693},
pages = {1338-1344},
url = {https://www.ijcseonline.org/full_paper_view.php?paper_id=4410},
doi = {https://doi.org/10.26438/ijcse/v7i5.13381344}
publisher = {IJCSE, Indore, INDIA},
}

RIS Style Citation:
TY - JOUR
DO = {https://doi.org/10.26438/ijcse/v7i5.13381344}
UR - https://www.ijcseonline.org/full_paper_view.php?paper_id=4410
TI - Survey of Clustering Methods for Large Scale Dataset
T2 - International Journal of Computer Sciences and Engineering
AU - Anupama Jawale, Ganesh Magar
PY - 2019
DA - 2019/05/31
PB - IJCSE, Indore, INDIA
SP - 1338-1344
IS - 5
VL - 7
SN - 2347-2693
ER -

VIEWS PDF XML
436 221 downloads 104 downloads
  
  
           

Abstract

This research study focuses on a comparative study of various clustering algorithms for the performance evaluation of large datasets. Analysis of large datasets is required for effective knowledge discovery. Use of data mining, machine learning techniques are often being used to refine of larger datasets. Traditional approach of processing of large datasets is inefficient and needs to consider the fast processing parallel environment to enhance the performance. This study has emphasis on four clustering algorithms, K-Means, Wards, PAM and CLARA to study performance on larger dataset of GeoJson format and CSV formats. Statistical techniques Medoid and Centroid are used for experimental work with different sample sizes to measure the performance of algorithms. Experimental work is carried out using R programming on Azure cloud for parallel computing with HDInsight Cluster. This research study provide evidence that the algorithm CLARA shows constant Medoid computations for different sample sizes compare to algorithm PAM and K-,Means. Silhouette widths of the algorithms CLARA (0.41) and Silhouette width of PAM (0.36) indicates well defined clusters are present in CLARA. Performance of these algorithms is effectively enhanced by reducing the time of DBSCAN by 45.72%, K-means by 99.95% and CLARA by 99.96% in comparison with Ward’s Algorithm for larger datasets using parallel processing environment.

Key-Words / Index Term

Azure, CLARA, Clustering Algorithms, GeoJson dataset, PAM, R Studio, Ward’s Method

References

[1] S. Miyamoto, R. Abe, Y. Endo, and J. Takeshita, “Ward method of hierarchical clustering for non-Euclidean similarity measures,” in 2015 7th International Conference of Soft Computing and Pattern Recognition (SoCPaR), Fukuoka, Japan, 2015, pp. 60–63.
[2] Jian Yin, Zhi-Fang Tan, Jiang-Tao Ren, and Yi-Qun Chen, “An efficient clustering algorithm for mixed type attributes in large dataset,” in 2005 International Conference on Machine Learning and Cybernetics, Guangzhou, China, 2005, pp. 1611-1614 Vol. 3.
[3] Lingling Yuan, “An effective Chinese short message texts clustering algorithm based on the ward’s method,” in 2011 2nd International Conference on Artificial Intelligence, Management Science and Electronic Commerce (AIMSEC), Deng Feng, China, 2011, pp. 1897–1899.
[4] J. Pagel, M. Campion, A. S. Nair, and P. Ranganathan, “Clustering analytics for streaming smart grid datasets,” in 2016 Clemson University Power Systems Conference (PSC), Clemson, SC, USA, 2016, pp. 1–8.
[5] M. K. Pakhira, “Fast Image Segmentation Using Modified CLARA Algorithm,” in 2008 International Conference on Information Technology, Bhunaneswar, Orissa, India, 2008, pp. 14–18.
[6] S. Sreepathi, J. Kumar, R. T. Mills, F. M. Hoffman, V. Sripathi, and W. W. Hargrove, “Parallel Multivariate Spatio-Temporal Clustering of Large Ecological Datasets on Hybrid Supercomputers,” in 2017 IEEE International Conference on Cluster Computing (CLUSTER), Honolulu, HI, USA, 2017, pp. 267–277.
[7] X. Dong and Z. Zhang, “Research and implementation of PAM algorithm with time constraints,” in Proceedings 2014 International Conference on Informative and Cybernetics for Computational Social Systems (ICCSS), Qingdao, China, 2014, pp. 108–111.
[8] X.-D. Wang, R.-C. Chen, F. Yan, Z.-Q. Zeng, and C.-Q. Hong, “Fast Adaptive K-Means Subspace Clustering for High-Dimensional Data,” IEEE Access, vol. 7, pp. 42639–42651, 2019.
[9] K. M. A. Patel and P. Thakral, “The best clustering algorithms in data mining,” in 2016 International Conference on Communication and Signal Processing (ICCSP), Melmaruvathur, Tamilnadu, India, 2016, pp. 2042–2046.
[10] Li Wenchao, Z. Yong, and X. Shixiong, “A Novel Clustering Algorithm Based on Hierarchical and K-means Clustering,” in 2007 Chinese Control Conference, Zhangjiajie, China, 2006, pp. 605–609.
[11] A. Bhardwaj, V. K. Singh, Vanraj, and Y. Narayan, “Analyzing BigData with Hadoop cluster in HDInsight azure Cloud,” in 2015 Annual IEEE India Conference (INDICON), New Delhi, India, 2015, pp. 1–5.
[12] C. Nishizaki, Y. Niwa, M. Imasato, and H. Motogi, “A method for feature extraction and classification of marine radar images,” in 2014 World Automation Congress (WAC), Waikoloa, HI, 2014, pp. 48–53.
[13] C.-Y. Kuo, C. N. Hang, P.-D. Yu, and C. W. Tan, “Parallel Counting of Triangles in Large Graphs: Pruning and Hierarchical Clustering Algorithms,” in 2018 IEEE High Performance extreme Computing Conference (HPEC), Waltham, MA, 2018, pp. 1–6.
[14] M. Alkathiri, J. Abdul, and M. B. Potdar, “Kluster: Application of k-means clustering to multidimensional GEO-spatial data,” in 2017 International Conference on Information, Communication, Instrumentation and Control (ICICIC), Indore, 2017, pp. 1–7.
[15] S. Soor and B. S. D. Sagar, “Iterated Watersheds, A Connected Variation of K-Means for Clustering GIS Data,” p. 11.
[16] K. L. N. Eranki and A. S. Reddy, “Geo-spatial library: A geo-spatial educational tool for knowledge management and capacity building,” in 2012 IEEE International Conference on Engineering Education: Innovative Practices and Future Trends (AICERA), Kottayam, India, 2012, pp. 1–4.
[17] A. S. Sidhu, S. R. Balakrishnan, and S. K. Dhillon, “HPC+Azure environment for bioinformatics applications,” in 2013 IEEE International Conference on Bioinformatics and Biomedicine, Shanghai, China, 2013, pp. 12–15.
[18] C. Reinbacher, M. Ruther, and H. Bischof, “Pose Estimation of Known Objects by Efficient Silhouette Matching,” in 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 2010, pp. 1080–1083.
[19] F. Pedregosa et al., “Scikit-learn: Machine Learning in Python,” MACHINE LEARNING IN PYTHON, p. 6.
[20] Y. Zhuang, Y. Mao, and X. Chen, “A Limited-Iteration Bisecting K-Means for Fast Clustering Large Datasets,” in 2016 IEEE Trustcom/BigDataSE/ISPA, Tianjin, China, 2016, pp. 2257–2262.
[21] S. Gupta and V. K. Srivatava, “An accelerated clustering algorithm for segmentation of grayscale images,” in 2011 2nd International Conference on Computer and Communication Technology (ICCCT-2011), Allahabad, India, 2011, pp. 660–665.
[22] Marie Fernandes , “Data Mining: A Comparative Study of its Various Techniques and its Process”, International Journal of Scientific Research in Computer Science and Engineering, Vol.5, Issue.1, pp.19-23, 2017
[23] Nilamadhab Mishra , “Internet of Everything Advancement Study in Data Science and Knowledge Analytic Streams”, International Journal of Scientific Research in Computer Science and Engineering, Vol.6, Issue.1, pp.30-36, 2018.