Open Access   Article Go Back

Handling of Class Imbalanced Problem in Big Data Sets: An Experimental Evaluation (UCPMOT)

S.S. Patil1 , S. P. Sonavane2

Section:Research Paper, Product Type: Journal Paper
Volume-06 , Issue-01 , Page no. 1-9, Feb-2018

CrossRef-DOI:   https://doi.org/10.26438/ijcse/v6si1.19

Online published on Feb 28, 2018

Copyright © S.S. Patil, S. P. Sonavane . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at   Google Scholar | DPI Digital Library

How to Cite this Paper

  • IEEE Citation
  • MLA Citation
  • APA Citation
  • BibTex Citation
  • RIS Citation

IEEE Style Citation: S.S. Patil, S. P. Sonavane, “Handling of Class Imbalanced Problem in Big Data Sets: An Experimental Evaluation (UCPMOT),” International Journal of Computer Sciences and Engineering, Vol.06, Issue.01, pp.1-9, 2018.

MLA Style Citation: S.S. Patil, S. P. Sonavane "Handling of Class Imbalanced Problem in Big Data Sets: An Experimental Evaluation (UCPMOT)." International Journal of Computer Sciences and Engineering 06.01 (2018): 1-9.

APA Style Citation: S.S. Patil, S. P. Sonavane, (2018). Handling of Class Imbalanced Problem in Big Data Sets: An Experimental Evaluation (UCPMOT). International Journal of Computer Sciences and Engineering, 06(01), 1-9.

BibTex Style Citation:
@article{Patil_2018,
author = {S.S. Patil, S. P. Sonavane},
title = {Handling of Class Imbalanced Problem in Big Data Sets: An Experimental Evaluation (UCPMOT)},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {2 2018},
volume = {06},
Issue = {01},
month = {2},
year = {2018},
issn = {2347-2693},
pages = {1-9},
url = {https://www.ijcseonline.org/full_spl_paper_view.php?paper_id=183},
doi = {https://doi.org/10.26438/ijcse/v6i1.19}
publisher = {IJCSE, Indore, INDIA},
}

RIS Style Citation:
TY - JOUR
DO = {https://doi.org/10.26438/ijcse/v6i1.19}
UR - https://www.ijcseonline.org/full_spl_paper_view.php?paper_id=183
TI - Handling of Class Imbalanced Problem in Big Data Sets: An Experimental Evaluation (UCPMOT)
T2 - International Journal of Computer Sciences and Engineering
AU - S.S. Patil, S. P. Sonavane
PY - 2018
DA - 2018/02/28
PB - IJCSE, Indore, INDIA
SP - 1-9
IS - 01
VL - 06
SN - 2347-2693
ER -

           

Abstract

The huge amount of NoSQL data has acknowledged a new provision of context for processing. A new trail of data handling technologies with massive resources assists to store and process these gigantic data sets. The current attention is to determine the undisclosed information by assimilating this data bulks & handling it as per use. Further they are pre-processed and converted for needful analysis. The volume and variety of these data sets endure rising relentlessly. Moreover, imbalanced in many real-worlds vast data sets have elevated a point of concern in the research domain. The skewed distribution of classes in the data sets poses a difficulty to learn using traditional classifiers. They tend more towards majority classes. In recent years, numerous solutions have been proposed to address imbalanced classification. However, they fail to address the various data characteristics such as overlapping, redundancy involving classification performance. A rational over_sampling technique i.e. Updated Class Purity Maximization Over_Sampling Technique using Safe-Level based synthetic sample creation is proposed to efficiently handle imbalanced data sets. The newly suggested Lowest versus Highest method addresses the handling of multi-class data sets. The data sets from the UCI repository are processed using the mapreduce based programming on Hadoop framework. The evaluation parameters viz. F-measure and AUC are used to authenticate the performance of proposed technique over benchmarking techniques. The results attained evidently quote the dominance of the proposed technique.

Key-Words / Index Term

Imbalanced datasets, Big Data, Over_sampling techniques, Multi-class, Safe-Level based Synthetic Samples

References

[1] X. Wu et al., “Data mining with big data”, IEEE Transaction on Knowledge and Data Engineering, Vol.26, Issue.1, pp.97–107, 2014.
[2] A. Gandomi, M. Haider, “Beyond the hype: Big data concepts, methods, and analytics” International Journal of Information Management, Vol.35, Issue.2, pp.137–144, 2015.
[3] D. Agrawal et al., “Challenges and Opportunity with Big Data”, Community White Paper, pp.01-16, 2012.
[4] W. Zhao, H. Ma, Q. He., “Parallel k-means clustering based on mapreduce”, CloudCom, pp.674-679, 2009.
[5] X.-W. Chen et al., “Big data deep learning: Challenges and perspectives”, IEEE Access Practical Innovations: open solutions, Vol.2, pp.514 -525, 2014.
[6] “Big Data: Challenges and Opportunities, Infosys Labs Briefings - Infosys Labs,” http://www.infosys. com/infosys-labs/publications/ Documents/bigdata-challenges-opportunities.pdf.
[7] N. Japkowicz, S. Stephen, “The class imbalance problem: a systematic study”, ACM Intelligent Data Analysis Journal, Vol.6, Isuue.5, pp.429–449, 2002.
[8] H. He, E. Garcia, “Learning from Imbalanced Data”, IEEE Transaction on Knowledge and Data Engineering, Vol.21, Isuue.9, pp.1263–1284, 2009.
[9] Y. Sun, A. Wong, M. Kamel, “CLASSIFICATION OF IMBALANCED DATA: A REVIEW”, International Journal of Pattern Recognition Artificial Intelligence, Vol.23, Issue.4, pp.687–719, 2009.
[10] P. Byoung-Jun, S. Oh, W. Pedrycz, “The design of polynomial function-based neural network predictors for detection of software defects”, Elsevier: Journal of Information Sciences, pp.40-57, 2013.
[11] V. López et al., “An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics”, Elsevier: Journal of Information Sciences, Vol.250, pp.113–141, 2013.
[12] M. A. Nadaf, S. S. Patil, “Performance Evaluation of Categorizing Technical Support Requests Using Advanced K-Means Algorithm”, IEEE International Advance Computing Conference, pp.409-414, 2015.
[13] R. C. Bhagat, S. S. Patil, “Enhanced SMOTE algorithm for classification of imbalanced bigdata using Random Forest” IEEE International Advance Computing Conference, pp.403-408, 2015.
[14] R. Sara, V. Lopez, J. Benitez, F. Herrera, “On the use of MapReduce for imbalanced big data using Random Forest”, Elsevier: Journal of Information Sciences, pp.112-137, 2014.
[15] H. Jiang, Y. Chen, Z. Qiao, “Scaling up MapReduce-based Big Data Processing on Multi-GPU systems”, SpingerLink Cluster Computing, Vol.18, Issue. 1, pp.369–383, 2015.
[16] G. Batista, R. Prati, M. Monard, “A study of the behaviour of several methods for balancing machine learning training data”, ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets, Vol.6, Issue. 1, pp.20–29, 2004.
[17] N. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique”, Journal of Artificial Intelligence Research, Vol.16, pp.321- 357, 2002.
[18] H. Han, W. Wang, B. Mao, “Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning”, Proceedings of the 2005 International Conference on Intelligent Computing, Vol.3644 of Lecture Notes in Computer Science, pp.878–887, 2005.
[19] B. Chumphol, K. Sinapiromsaran, C. Lursinsap, “Safe-level-smote: Safelevel- synthetic minority over-sampling technique for handling the class imbalanced problem”, AKDD Springer Berlin Heidelberg, pp.475-482, 2009.
[20] H. He et al., “ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning”, IEEE International Joint Conference on Neural Networks, pp.1322-1328, 2008.
[21] S. Garcia et al., “Evolutionary-based selection of generalized instances for imbalanced classification”, Elsevier: Journal of Knowledge-Based Systems, pp.3-12, 2012.
[22] H. Feng, L. Hang, “A Novel Boundary Oversampling Algorithm Based on Neighborhood Rough Set Model: NRSBoundary-SMOTE”, Hindawi: Mathematical Problems in Engineering, 2013.
[23] N. Chawla, L. Aleksandar, L. Hall, K. Bowyer, “SMOTEBoost: Improving prediction of the minority class in boosting”, PKDD Springer Berlin Heidelberg, pp.107-119, 2003.
[24] H. Xiong, Y. Yang, S. Zhao, “Local clustering ensemble learning method based on improved AdaBoost for rare class analysis”, Journal of Computational Information Systems, Vol.8, Issue.4, pp.1783-1790, 2012.
[25] F. Alberto, M. Jesus, F. Herrera, “Multi-class imbalanced data-sets with linguistic fuzzy rule based classification systems based on pairwise learning”, Springer IPMU, pp.89–98, 2010.
[26] J. Hanl, Y. Liul, X. Sunl, “A Scalable Random Forest Algorithm Based on MapReduce”, IEEE, pp.849-852, 2013.
[27] J. Kwak, T. Lee, C. Kim, “An Incremental Clustering-Based Fault Detection Algorithm for Class-Imbalanced Process Data”, IEEE Transactions on Semiconductor Manufacturing, Vol.28, Issue.3, pp.318-328, 2015.
[28] S. Kim, H. Kim, Y. Namkoong, “Ordinal Classification of Imbalanced Data with Application in Emergency and Disaster Information Services”, IEEE Intelligent Systems, Vol.31, Issue.5, pp.50-56, 2016.
[29] M. Chandak, “Role of big-data in classification and novel class detection in data streams”, Springer Journal of Big Data, pp.1-9, 2016.
[30] S. Patil, S. Sonavane, “Enhanced Over_Sampling Techniques for Imbalanced Big Data Set Classification”, Data Science and Big Data: An Environment of Computational Intelligence: Studies in Big Data, Springer International Publishing AG, Vol.24, pp.49-81, 2017.
[31] W. A. Rivera, O. Asparouhov, “Safe Level OUPS for Improving Target Concept Learning in Imbalanced Data Sets”, Proceedings of the IEEE Southeast Conference, pp.1-8, 2015.
[32] S. Yen, Y. Lee, “Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset”, ICIC 2006, LNCIS 344, pp.731 – 740, 2006.
[33] C. Bunkhumpornpat, K. Sinapiromsaran, C. Lursinsap, “DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique”, Springer Journal of Applied Intelligence, pp.664-684, 2012.
[34] H. Guo et al.,“Learning from class-imbalanced data: Review of methods and applications”, Elsevier Expert Systems With Applications, Vol.73, pp.220 – 239, 2017.
[35] Z. Zhang et al.,“Empowering one-vs-one decomposition with ensemble learning for multi-class imbalanced data”, Elsevier Knowledge-Based Systems, Vol.106, pp.251 – 263, 2016.
[36] A. Vorobeva, “Examining the Performance of Classification Algorithms for Imbalanced Data Sets in Web Author Identification” Proceeding of the 18th Conference of FRUCT-ISPIT Association, pp.385 – 390, 2016.
[37] Machine Learning Repository, Center for Machine Learning and Intelligent Systems, US (NFS). https://archive.ics.uci.edu/ml/ datasets.html
[38] K. Yoon, S. Kwek, “An Unsupervised Learning Approach to Resolving the Data Imbalanced Issue in Supervised Learning Problems in Functional Genomics”, IEEE: International Conference on Hybrid Intelligent Systems, pp.1-6, 2005.
[39] M. Bach et al., “The study of under- and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis”, Elsevier Journal of Information Sciences, Vol.384, pp.174–190, 2017.
[40] D. Li et al., “Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge”, Elsevier: Journal of Computation and Operational Research,Vol.34, pp.966–982, 2007.
[41] S. Barua et al., “MWMOTE—Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning”, IEEE: Transaction on Knowledge and Data Engineering, Vol.26, pp.405–425, 2014.
[42] X. Ai et al., “Immune Centroids Over-Sampling Method for Multi-class Classification”, T. Cao, E. Lim, Z. Zhou., T. Ho, D. Cheung, H. Motoda, Advances in Knowledge Discovery and Data Mining (eds), PAKDD 2015, Springer,Vol.9077, pp.251–263, 2015.