Open Access   Article Go Back

An Improved Shuffling Approach Towards Skew Mitigation in Mapreduce

N. K. Seera1 , S. Taruna2

Section:Research Paper, Product Type: Journal Paper
Volume-6 , Issue-7 , Page no. 819-826, Jul-2018

CrossRef-DOI:   https://doi.org/10.26438/ijcse/v6i7.819826

Online published on Jul 31, 2018

Copyright © N. K. Seera, S. Taruna . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at   Google Scholar | DPI Digital Library

How to Cite this Paper

  • IEEE Citation
  • MLA Citation
  • APA Citation
  • BibTex Citation
  • RIS Citation

IEEE Style Citation: N. K. Seera, S. Taruna, “An Improved Shuffling Approach Towards Skew Mitigation in Mapreduce,” International Journal of Computer Sciences and Engineering, Vol.6, Issue.7, pp.819-826, 2018.

MLA Style Citation: N. K. Seera, S. Taruna "An Improved Shuffling Approach Towards Skew Mitigation in Mapreduce." International Journal of Computer Sciences and Engineering 6.7 (2018): 819-826.

APA Style Citation: N. K. Seera, S. Taruna, (2018). An Improved Shuffling Approach Towards Skew Mitigation in Mapreduce. International Journal of Computer Sciences and Engineering, 6(7), 819-826.

BibTex Style Citation:
@article{Seera_2018,
author = {N. K. Seera, S. Taruna},
title = {An Improved Shuffling Approach Towards Skew Mitigation in Mapreduce},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {7 2018},
volume = {6},
Issue = {7},
month = {7},
year = {2018},
issn = {2347-2693},
pages = {819-826},
url = {https://www.ijcseonline.org/full_paper_view.php?paper_id=2518},
doi = {https://doi.org/10.26438/ijcse/v6i7.819826}
publisher = {IJCSE, Indore, INDIA},
}

RIS Style Citation:
TY - JOUR
DO = {https://doi.org/10.26438/ijcse/v6i7.819826}
UR - https://www.ijcseonline.org/full_paper_view.php?paper_id=2518
TI - An Improved Shuffling Approach Towards Skew Mitigation in Mapreduce
T2 - International Journal of Computer Sciences and Engineering
AU - N. K. Seera, S. Taruna
PY - 2018
DA - 2018/07/31
PB - IJCSE, Indore, INDIA
SP - 819-826
IS - 7
VL - 6
SN - 2347-2693
ER -

VIEWS PDF XML
541 278 downloads 138 downloads
  
  
           

Abstract

In MapReduce applications, map tasks are generally launched in parallel and are assigned equal sized input splits to work on. Thus map side skews are rare to occur. In contrast, reduce side skews are much more challenging because the shuffling of the intermediate data, partition sizes and partition assignment to worker nodes cannot be determined at early stages. Therefore it is one of the critical problems in MapReduce model which should be thoroughly studied and possible solutions need to framed. This paper studies various causes of skew and common approaches used for skew mitigation in real world applications. Paper presents a novel approach to address reduce side skew where the large volume of intermediate data is preprocessed by intermediate nodes to make the size of intermediate keys smaller. The partial results from intermediate nodes are collected, aggregated and sent to final worker nodes to generate final output. The proposed model is applicable to applications where there is no interdependency between values of similar keys. The approach used by proposed model is contrary to the approach where the data of skewed nodes is repartitioned dynamically into small fragments and assigned to idle nodes in the cluster.

Key-Words / Index Term

MapReduce, Skew Mitigation, Shuffling, Partitioning

References

[1]. J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”, OSDI 2004.
[2]. Y. Kwon et al, “A study of skew in mapreduce applications”, 5th Open Cirrus Summit, 2011.
[3]. Y.Kwon et al, “Skew-resistant parallel processing of feature-extracting scientific user-defined functions”, in Proceedings of ACM Symposium on Cloud Computing, 2010, pp. 75- 86.
[4]. B. Gufler et al, “Handing Data Skew in MapReduce, in Proceedings of 1st International Conference on Cloud Computing and Services Science”, 2011, pp. 574- 583.
[5]. Y. Kwon et al, “SkewTune: Mitigating skew in MapReduce applications”, ACM 2012, SIGMOD 2012 USA.
[6]. J. Lin, “The Curse of Zipf and Limits to Parallelization: A Look at the stragglers problem in Map Reduce”, July 2009, USA.
[7]. J. Rosen and B. Zhao, “Fine Grained Micro Tasks for MapReduce Skew Handling”, 2012.
[8]. M. Hanif and C. Lee, “An efficient key partitioning scheme for heterogeneous MapReduce clusters”, ICACT 2016, ISBN 978-89-968650-7-0.
[9]. S. Ibrahim et al, “Handling partitioning skew in MapReduce using LEEN”, Springer 2013.
[10]. Y. Le et al, “Online Load Balancing for MapReduce with skewed Data Input”, IEEE Transactions, 2014.
[11]. Y. Guo et al, “iShuffle: Improving Hadoop Performance with shuffle-on-write”, 10th International Conference on Autonomic Computing 2013.
[12]. R. Akbarinia et al, “An efficient solution for processing skewed MapReduce Jobs”, Globe`2015: 8th International Conference on Data Management in Cloud, Grid and P2P Systems, Sep 2015, Spain.
[13]. X. Tang et al, “A Reduce Task Scheduler for MapReduce with minimum transmission cost based on sampling evaluation”, IJDTA, Vol 8, No 1 (2015), pp 1-10.
[14]. Qi Chen et al, “LIBRA: Light Weight data skew mitigation in Map Reduce”, IEEE Transactions on Parallel & Distributed Systems, 2015, Vol 26, Issue 9.
[15]. A. Rasmussen et al. “Themis: an i/o-efficient MapReduce”. In Proceedings of the Third ACM Symposium on Cloud Computing, page 13. ACM, 2012.
[16]. S. Ibrahim et al, “Leen: Locality/fairness-aware key partitioning for MapReduce in the cloud”. In Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on, pages 17–24.IEEE, 2010.
[17]. M. Hammoud et al, “Center-of-gravity reduce task scheduling to lower mapreduce network traffic” in 2012 IEEE 5th International Conference on, pages 49–58. IEEE, 2012.
[18]. M. Hammoud and M.F. Sakr, “Locality-aware reduce task scheduling for mapreduce”. in 2011 IEEE Third International Conference on, pages 570–576. IEEE, 2011.
[19]. M. Zaharia et al, “Improving mapreduce performance in heterogeneous environments”. In Proceedings of the 8th USENIX conference on Operating systems design and implementation, pages 29–42, 2008.
[20]. G. Ananthanarayanan et al, “Reining in the outliers in map-reduce clusters using Mantri”. In Proceedings of the 9th USENIX conference, OSDI’10, pages 1–16, Berkeley, CA, USA, 2010. USENIX Association.
[21]. Zaharia et al, “Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling”. In Proc. of the ACM European Conference on Computer Systems (EuroSys) 2010).
[22]. M.C. Schatz, “Cloudburst: highly sensitive read mapping with MapReduce”. Bioinformatics, 25(11):1363–1369, 2009.
[23]. R. Vernica et al, “Adpative Map Reduce using Situation aware Mappers”, ACM 978-1-4503-0790-1, EDBT March 26-30, 2012.
[24]. R. Vernica et al, “Efficient parallel set-similarity joins using map reduce”, in Proceedings of SIGMOD Conf, pages 495-506, 2010.
[25]. Q. Chen, C. Liu and Z. Xiao, “Improving MapReduce Performance Using Smart Speculative Execution Strategy”, IEEETransactions on Computers (TC)63(4), 2014.
[26]. Y. Kwon et al, “Managing Skew in Hadoop”, IEEE Computer Society Technical Committee on Data Engineering 2013.
[27]. Y. Gao et al, “Handling data skew in MapReduce cluster by using partitioning tuning”, Journal of Health engineering, Volume (2017), 2017.
[28]. Liu et al, “OPTIMA: on-line partitioning skew mitigation for MapReduce with resource adjustment”, Journal of Network and Systems Management, vol. 24, no. 4, pp. 859–883, 2016.
[29]. Apache Software Foundation, “Hadoop Distributed File System: Architecture and Design”, 2007.
[30]. R. Akbarinia et al, “FP-Hadoop: Efficient Processing of Skewed MapReduce jobs”, Information Systems, Elsevier, 2016, 60, pp- 69-84.
[31]. Z.Liu et al, “Dynamic Resource Allocation for MapReduce with Partitioning Skew”, IEEE Transactions on Computers, Issue No. 11 - Nov. (2016 vol. 65)ISSN: 0018-9340, pp: 3304-3317.
[32]. S. R. Ramakrishnan et al, “Balancing reducer skew in MapReduce workloads using progressive sampling”, in Proceedings of the Third ACM Symposium on Cloud Computing, pp. 16–28, ACM 2012.
[33]. J. Berlinska and M. drozdowski, “Mitigating Partitioning Skew in MapReduce Computations”, MISTA 2013.
[34]. J. Dittrich and J.-A. Quian_e-Ruiz. “Efficient Big Data processing in Hadoop MapReduce”. Proceedings of the VLDB Endowment (PVLDB), 5(12):2014{2015, 2012C. Doulkeridis, K. Norvaget, A Survey of Large Analytical Query Processing in Map-Reduce, the VLDB Journal, 2013.
[35]. B. Arputhamary et al, “EDSHA: An Efficient Data Skew Handling Approach for MapReduce Model using Time Series Data”, IJCTA 9(27) 2016, pp 423-430.
[36]. P. Dhawalia et al, “Chisel++: Handling partitioning skew in MapReduce framework using efficient range partitioning technique”, DIDIC 2014, pp 21-28.
[37]. H. Chang et al, “Scheduling in MapReduce-like systems for fast competition time”, In proceedings of IEEE INFOCOM, China, 2011.
[38]. J.Tan et al, “Coupling task progress for MapReduce resource aware scheduling”, in Proceedings of IEEE INFOCOM, 2013.
[39]. F. Chen et al, “Joint scheduling of processing and shuffle phases in MapReduce systems”, in Proceedings of IEEE INFOCOM, 2012.
[40]. Y. Yuan et al, “On interference-aware provisioning for cloud based big data processing”, in Proceedings of ACM/IEEE IWQoS, June 2013.
[41]. S. Rao et al, “Sailfish- A framework for large scale data processing”, ACM Symposium on Cloud computing, SOCC 2012, UA, 2012.
[42]. Y. Liang et al, “Variable sized map and locality aware reduce on public-resource grids”, Future Gen Computing Systems, 27(6) : 843-849, June 2011.
[43]. M. Isard et al, “Dryad: Distributed Data-parallel programs for sequential building blocks”, In proceedings of EuroSys Conf, 2007.
[44]. K. Devine et al, “Partitioning and Load balancing for emerging parallel applications and architectures”, Chapter 6, Parallel Processing for Scientific Computing, 2006.
[45]. T. Y. Chen et al, “LaSA: A locality-aware scheduling algorithm for Hadoop-MapReduce resource assignment”, Collaboration Technologies and Systems (CTS), 2013, pp. 342-346.
[46]. S. Seo et al, “HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment”, IEEE International Conference, New Orleans, LA, pp. 1-8.
[47]. N.K. Seera and S. Taruna, “Analyzing cost parameters affecting Map Reduce application performance”, I. J. Computer Science and Information Technology, 2016, 8, 50-58.
[48]. N. Kaur and S. Taruna, “Efficient data layouts for cost optimized map reduce operations”, IEEEXplore, 2015, 600-604.
[49]. M. Kaur and G. Dhaliwal, “Performance comparison of MapReduce and Apache Spark on Hadoop for Big Data Analysis”, International Journal of Computer Sciences and Engeering, Vol 3 (11), PP- 66 – 69, Nov 2015.
[50]. M. Dhivya et al, “Hadoop MapReduce Online in Big Figure Analytics”, International Journal of Computer Sciences and Engeering, Vol 2 (9), PP- 100 – 104, Sep 2014.
[51]. J. Rajesh Khanna, “An Enormous Inspection of MapReduce”, International Journal of Scientific Research in Computer Science, Engineering and Information technology, Vol 2, Issue 6, IISN 2
[52]. Ouyang, X, Zhou, H, Clement, et al.,”Mitigate Data Skew Caused Stragglers through ImKP Partition in MapReduce”, Proceedings. 36th IEEE International Performance Computing and Communications Conference (IPCCC), 10-12 Dec 2017, San Diego, California, USA. IEEE.