A Brief Account of Iterative Big Data Clustering Algorithms

M. Shankar Lingam, A. M. Sudhakara

Open Access Article Go Back

A Brief Account of Iterative Big Data Clustering Algorithms

M. Shankar Lingam¹ , A. M. Sudhakara²

University of Mysore, Manasa Gangotri, Mysore, India.
Director, CIST, University of Mysore, India.

Correspondence should be addressed to: sudhakara.mysore@gmail.com.

Section:Review Paper, Product Type: Journal Paper
Volume-5 , Issue-10 , Page no. 292-301, Oct-2017

CrossRef-DOI: https://doi.org/10.26438/ijcse/v5i10.292301

Online published on Oct 30, 2017

Copyright © M. Shankar Lingam, A. M. Sudhakara . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at Google Scholar | DPI Digital Library

XML View

PDF Download

How to Cite this Paper

IEEE Citation
MLA Citation
APA Citation
BibTex Citation
RIS Citation

IEEE Citation

IEEE Style Citation: M. Shankar Lingam, A. M. Sudhakara, “A Brief Account of Iterative Big Data Clustering Algorithms,” International Journal of Computer Sciences and Engineering, Vol.5, Issue.10, pp.292-301, 2017.

MLA Citation

MLA Style Citation: M. Shankar Lingam, A. M. Sudhakara "A Brief Account of Iterative Big Data Clustering Algorithms." International Journal of Computer Sciences and Engineering 5.10 (2017): 292-301.

APA Citation

APA Style Citation: M. Shankar Lingam, A. M. Sudhakara, (2017). A Brief Account of Iterative Big Data Clustering Algorithms. International Journal of Computer Sciences and Engineering, 5(10), 292-301.

BibTex Citation

BibTex Style Citation:
@article{Lingam_2017,
author = {M. Shankar Lingam, A. M. Sudhakara},
title = {A Brief Account of Iterative Big Data Clustering Algorithms},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {10 2017},
volume = {5},
Issue = {10},
month = {10},
year = {2017},
issn = {2347-2693},
pages = {292-301},
url = {https://www.ijcseonline.org/full_paper_view.php?paper_id=1518},
doi = {https://doi.org/10.26438/ijcse/v5i10.292301}
publisher = {IJCSE, Indore, INDIA},
}

RIS Citation

RIS Style Citation:
TY - JOUR
DO = {https://doi.org/10.26438/ijcse/v5i10.292301}
UR - https://www.ijcseonline.org/full_paper_view.php?paper_id=1518
TI - A Brief Account of Iterative Big Data Clustering Algorithms
T2 - International Journal of Computer Sciences and Engineering
AU - M. Shankar Lingam, A. M. Sudhakara
PY - 2017
DA - 2017/10/30
PB - IJCSE, Indore, INDIA
SP - 292-301
IS - 10
VL - 5
SN - 2347-2693
ER -

VIEWS	PDF	XML
753	358 downloads	265 downloads

Bar Line

Abstract

Today, maximum of the organizations have to deal with big quantities of records, that is hastily growing. In order to address these explosively growing amounts of information, one has so that it will extract, examine, and process information time to time. Clustering has for this reason been identified keeping in view this example and it is considered as an essential device used to analyze huge statistics. Technological progress, specifically inside the regions of finance and enterprise informatics, poses a big task for big scale records clustering. To deal with this issue, researchers have provided you with parallel clustering algorithms that are primarily based on parallel programming fashions. MapReduce is one of the most typically used frameworks used for this motive and it has received high consciousness thanks to its flexibility, fault tolerance and programming ease. However, the overall performance has trouble for iterative packages. This paper gives an in depth evaluation of iterative frameworks which could help MapReduce for overcoming boundaries for iterative algorithms.

Key-Words / Index Term

clustering, framework and Map reduces

References

[1]. Kang U, Tong H, Sun J, Lin C-Y, Faloutsos C. GBASE, in Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining – KDD’11, San Diego,CA,USA,2011 Aug, pp. 1091.
[2]. “YouTube statistic.” [Online]. Available:http://www.youtube.com/yt/press/statistics.html. [last accessed June 22, 2014],Dare Accessed:22/6/2014.
[3]. Kim W. Parallel clustering algorithms: survey, CSC 8530 parallel algorithms, 2009, pp.1-32.
[4]. Aiyer A, Bautin M, Chen GJ, Damania P, Khemani P, Muthukkaruppan K, Vaidya M. Storage infrastructure behind facebook messages using HBase at scale. IEEE Data Engineering 2012, 35(2), pp.4–13.
[5]. Bu Y, Howe B, Balazinska M, Ernst MD. HaLoop: efficient iterative data processing on large clusters, in 36thInternational Conference on Very Large Data Bases, 2010 Sep, 3(1-2),pp.285-96.
[6]. Page L, Sergey Brin RM, Winograd T. The PageRank citation ranking: bringing order to the web, 1998 Jan, pp.1-17..
[7]. Mohebi A, Aghabozorgi S, Ying Wah T, Herawan T, Yahyapour R. Iterative big data clustering algorithms: a review, Software Practicle Experience, 2016,46,pp. 107–29.
[8]. Zikopoulos P, Parasuraman K, Deutsch T, Giles DC. Harness the Power of Big Data the IBM Big Data Platform [Kindle Edition], 1st edn. McGraw-Hill Osborne Media: New York, 2012 Sep.
[9]. Assunção MD, Calheiros RN, Bianchi S, Netto MAS, Buyya R. Big Data computing and clouds: Trends and future directions, Journal of Parallel and Distributed Computing 2014,75(13),pp.156–75.
[10]. Riccomini C. Samza: Real-time stream processing at LinkedIn, 2013. Retrieved July 5, 2014, from http://www. infoq.com/presentations/samza-linkedin, Date ACCESSED: 5/7/2014.
[11]. Murthy A. Tez: accelerating processing of data stored in HDFS, 2013. [Online]. Available: http://hortonworks.com/blog/introducing-tez-faster-hadoop-processing/,Date Accessed: 20/2/2013.
[12]. Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters, Communications of the ACM 2008, 51(1),pp.1–13.
[13]. Dhillon IS, Modha DS. A data-clustering algorithm on distributed memory multiprocessors, in Proceeding Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD, 1999,pp.245–60.
[14]. Forman G, Zhang B. Distributed data clustering can be efficient and exact, ACM SIGKDD Explorations Newsletter 2000, 2(2),pp.34–38.
[15]. Kang U, Papalexakis E, Harpale A, Faloutsos C. GigaTensor, in Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining – KDD’12, 2012,pp.1-316.
[16]. Kang U, Tsourakakis CE, Faloutsos C. PEGASUS: mining peta-scale graphs, Knowledge and Information Systems 2010,27(2),pp.303–25.
[17]. White T. Hadoop—The Definitive Guide: Storage and Analysis at Internet Scale, 3rd edn. O’Reilly R (ed.). Ireland, 2012 Jan, pp.1-647.
[18]. Snir M, Otto S, Huss-Lederman S, Walker D, Dongarra J. MPI–the Complete Reference, Second edn, the MPI Core, MIT Press: Cambridge, MA, USA, 1998 Sep,pp.1-350.
[19]. Stonebraker M, Abadi D, DeWitt DJ, Madden S, Paulson E, Pavlo A, Rasin A. MapReduce and parallel DBMSs,Communications of the ACM 2010, 53(1),pp.1-64.
[20]. Pavlo A, Paulson E, Rasin A, Abadi DJ, DeWitt DJ, Madden S, Stonebraker M. A comparison of approaches to largescale data analysis, In Proceedings of theACMSIGMODInternational Conference on Management of Data (SIGMOD), 2009, pp.165–78.
[21]. Ekanayake J, Li H, Zhang B, Gunarathne T, Bae S-H, Qiu J, Fox G. Twister, in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing - HPDC’10,Chicago, 2010 Jun, pp. 810-18.
[22]. Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G. Pregel, in Proceedings of the 2010 international conference on Management of data - SIGMOD’10, 2010,pp.1- 135.
[23]. Valiant LG. A bridging model for parallel computation. Communications of the ACM 1990, 33(8),pp.103–11.
[24]. Condie T, Conway N, Alvaro P, Hellerstein JM, Elmeleegy K, Sears R. MapReduce online, in NSDI’10 Proceedings of the 7th USENIX conference on Networked systems design and implementation Berkley,,2010,21,pp.1-15.
[25]. Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Spark: cluster computing with working sets, in HotCloud’10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, 2010, pp.1-10.
[26]. Yu Y, Isard M, Fetterly D, Budiu M, Erlingsson Ú, Gunda PK, Currey J. DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language, in OSDI’08 Proceedings of the 8th USENIX conference on Operating systems design and implementation, Berkeley,USA,CA,2008, pp. 1–14.
[27]. Chambers C, Raniwala A, Perry F, Adams S, Henry RR, Bradshaw R, Weizenbaum N. FlumeJava, ACM SIGPLAN Notices 2010, 45(6),pp.1-363.
[28]. Thulasiraman K, Swamy MNS. Graphs: Theory and Algorithms. John Wiley & Sons, Inc.: New York, 1992.
[29]. Logothetis D, Olston C, Reed B, Webb KC, Yocum K. Stateful bulk processing for incremental analytics, in Proceedings of the 1st ACM symposium on Cloud computing - SoCC’10, 2010 Jun, pp.1-12 ..
[30]. Zhang Y, Gao Q, Gao L, Wang C. PrIter: a distributed framework for prioritizing iterative computations. IEEE Transactions on Parallel and Distributed Systems 2013, 24(9),pp.1884–93.
[31]. Hellerstein JM, Haas PJ, Wang HJ. Online aggregation, Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data – SIGMOD’97, 1997, 26(2),pp.171–82.
[32]. Wu X, Kumar V, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D, Top 10 algorithms in data mining, Knowledge and Information Systems 2007, 14(1),pp.1–37.
[33]. Zhao W, He Q, Ma H. Parallel K-Means Clustering Based on, 2009, pp.674–79.
[34]. Zhou P, Ye W, Lei J. Large-scale data sets clustering based on MapReduce and Hadoop, The Journal of ComputerInformation Systems 2011, 16(7),pp.5956–63.
[35]. Nguyen CD, Nguyen DT, Pham V. LNCS 7975 - Parallel two-phase K-means, 2013,7975, pp.224–31.
[36]. Pham DT, Dimov SS, Nguyen CD. An incremental K-means algorithm, in Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science, 2004, pp.783–95.
[37]. Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S. Scalable k-means++. Proceedings of the VLDB Endowment 2012, 5(7),pp.622–33.
[38]. Arthur D, Vassilvitskii S. k-means++: the advantages of careful seeding, in SODA’07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 2007,pp. 1027–35.
[39]. Li C, Zhang Y, Jiao M, Yu G. Mux-Kmeans : Multiplex Kmeans for clustering large-scale data set categories and subject descriptors, in Proceedings of the 5th ACM workshop on Scientific cloud computing - ScienceCloud’14, 2014, pp. 25–32.
[40]. Aljarah I, Ludwig SA. Parallel particle swarm optimization clustering algorithm based on MapReduce methodology, in 2012 Fourth World Congress on Nature and Biologically Inspired Computing (NaBIC), Mexico City,2012 Nov,pp. 104–11.
[41]. Kennedy J, Eberhart R. Particle swarm optimization. Proceedings of ICNN’95 - International Conference on Neural Networks, 1995,4,pp.1942–48.
[42]. Sun Z, Fox G, Gu W, Li Z. A parallel clustering method combined information bottleneck theory and centroid-based clustering, Journal of Supercomputing 2014, 69(1),pp.452–67.
[43]. Tishby N, Pereira FC, Bialek W. The information bottleneck method, Apr. 2000,pp.1-16.
[44]. Satish Narayana Srirama PJ, Vainikko E. Adapting scientific computing problems to clouds using MapReduce, Future Generation Computer Systems 2012, 8(1),pp.184–92.
[45]. Kaufman L, Rousseeuw P. Finding groups in data: an introduction to cluster analysis, in Wiley Interscience, 1990, pp.1-5.
[46]. Martin Ester XX, Kriegel H-P, Jörg S. A density-based algorithm for discovering clusters in large spatial databases with noise, in 2nd International Conference on Knowledge Discovery and Data Mining, Portland, 1996,pp. 226–31.
[47]. Li L, Xi Y. Research on clustering algorithm and its parallelization strategy, International Conference on Computational and Information Sciences Chengudu,China, 2011, pp.325–28.
[48]. Kim Y, Shim K, Kim M-S, Sup Lee J. DBCURE-MR: an efficient density-based clustering algorithm for large data using MapReduce.,Information Systems 2014, 42,pp.15–35.
[49]. Ankerst M, Breunig MM, Kriegel H-P, Sander Journal of OPTICS, ACM SIGMOD Record, 1999, 28(2),pp.49–60.
[50]. Zhao W, Martha V, Xu X. PSCAN: A Parallel Structural Clustering Algorithm for Big Networks in MapReduce, in 2013 IEEE 27th International Conference on Advanced Information Networking and Applications (AINA), Barcelona,2013, pp.862–69.
[51]. NandiniRaghavan U, Albert R, Kumara S. Near linear time algorithm to detect community structures in large-scale networks, Physical Review 2007 Sep, 76(3),pp.1.
[52]. Sun T, Shu C, Li F, Yu H, Ma L, Fang Y. An efficient hierarchical clustering method for large datasets with MapReduce, in 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies,Higashi Hiroshima, 2009 Dec,pp. 494–99.
[53]. Jin WLC, Patwary MMA, Agrawal A, Hendrix W. DiSC: A distributed single-linkage hierarchical clustering algorithm using MapReduce, in Proceedings of the 4thInternational SC Workshop on Data Intensive Computing in the Clouds, 2013,pp.1-10.
[54]. Cormen TH, Leiserson CE, Rivest RL, Stein C. Introduction to Algorithms, 2nd edn, McGraw-Hill Higher Education: New York, USA, 2001,pp.1-640.
[55]. Parsons L, Haque E, Liu H. Subspace clustering for high dimensional data, ACM SIGKDD Explorations Newsletter 2004, 6(1),pp. 90–105.
[56]. R. Murugesh, I. Meenatchi, "A Study Using PI on: Sorting Structured Big Data In Distributed Environment Using Apache Hadoop MapReduce", International Journal of Computer Sciences and Engineering, Vol.2, Issue.8, pp.35-38, 2014.
[57]. Fries ST, Wels S. Projected clustering for huge data sets in MapReduce | Chair of Computer Science 9, in International Conference on Extending Database Technology (EDBT 2014), Athens, Greece, 2014,pp. 49–60.
[58]. Moise G, Sander J, Ester M. P3C: a robust projected clustering algorithm, in Sixth International Conference on Data Mining (ICDM’06), Hong Kong,2006 Dec, pp.414–25.
[59]. Hyndman RJ. The problem with Sturges’ rule for constructing histograms, no. 1995 Jul,pp. 1–2.
[60]. Elgohary A, Farahat AK, Kamel MS, Karray F. Embed and conquer: scalable embeddings for kernel k-means on MapReduce, in Appears in Proceedings of the SIAM International Conference on Data Mining (SDM), 2014, 2013,pp. 1–18.
[61]. Über die praktische Auflösung von linearen. “Integralgleichungen mit Anwendungen auf Randwertaufgaben”, Acta Mathematica, 1930, 54(1),pp.185–204.
[62]. Rahul R. Ghuleand Sachin N. Deshmukh, "Comparative Study on Speculative Execution Strategy to Improve MapReduce Performance", International Journal of Computer Sciences and Engineering, Vol.3, Issue.3, pp.197-200, 2015.
[63]. Chen W-Y, Sunnyvale Y, Song H, Bai C-JL, Chang EY. Parallel spectral clustering in distributed systems, IEEE Transactions on Pattern Analysis and Machine Intelligence 2011, 33(3),pp.568–86.
[64]. Maschho K, Sorensen D. A portable implementation of ARPACK for distributed memory parallel architectures, In Proceeding of Copper Mountain Conference on Iterative Methods, 1996,pp.1-8.
[65]. Papadimitriou S, Sun J. DisCo: distributed co-clustering with Map-Reduce: A Case Study Towards Petabyte-Scale End-to-End Mining, in 2008 Eighth IEEE International Conference on Data Mining, Pisa,2008 Dec,pp.512–21.
[66]. Su S, Cheng X, Gao L, Yin J. Co-Cluster D: a distributed framework for data co-clustering with sequential updates, in International Conference on Data Mining (ICDM), 2013 IEEE 13th,Dallas TX, 2013, pp.1193–98.
[67]. M. Shankar Lingam, A. M. Sudhakara, "A Brief Account of Iterative Big Data Clustering Algorithms", International Journal of Computer Sciences and Engineering, Vol.5, Issue.10, pp.300-309, 2017.
[68]. Dhillon IS. Co-clustering documents and words using bipartite spectral graph partitioning, in Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining - KDD’01, 2001,pp. 269–74.
[69]. P. Dadheech, D. Goyal, S. Srivastava, "Performance Improvement of Heterogeneous Hadoop Clusters Using MapReduce For Big Data", International Journal of Computer Sciences and Engineering, Vol.5, Issue.8, pp.211-214, 2017.
[70]. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing, in Proceeding NSDI’12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, 2012,2,pp.1-14.

Citations	8797
h-index	34
i10-index	152

Impact Factor :	3.802
ISSN :	2347-2693 (Online)