Tackling Imbalance Datasets: Methods, Techniques & Comparisons

Shivam Kumar, Deepanshu Ahuja, Sandeep Kumar

Open Access Article Go Back

Tackling Imbalance Datasets: Methods, Techniques & Comparisons

Shivam Kumar¹ , Deepanshu Ahuja² , Sandeep Kumar³

Dept. of Computer Science & Engineering, Sharda University, University, Greater Noida, India.
Dept. of Computer Science & Engineering, Sharda University, University, Greater Noida, India.
Dept. of Computer Science & Engineering, Sharda University, University, Greater Noida, India.

Section:Research Paper, Product Type: Journal Paper
Volume-11 , Issue-5 , Page no. 6-12, May-2023

CrossRef-DOI: https://doi.org/10.26438/ijcse/v11i5.612

Online published on May 31, 2023

Copyright © Shivam Kumar, Deepanshu Ahuja, Sandeep Kumar . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at Google Scholar | DPI Digital Library

XML View

PDF Download

How to Cite this Paper

IEEE Citation
MLA Citation
APA Citation
BibTex Citation
RIS Citation

IEEE Style Citation: Shivam Kumar, Deepanshu Ahuja, Sandeep Kumar, “Tackling Imbalance Datasets: Methods, Techniques & Comparisons,” International Journal of Computer Sciences and Engineering, Vol.11, Issue.5, pp.6-12, 2023.

MLA Style Citation: Shivam Kumar, Deepanshu Ahuja, Sandeep Kumar "Tackling Imbalance Datasets: Methods, Techniques & Comparisons." International Journal of Computer Sciences and Engineering 11.5 (2023): 6-12.

APA Style Citation: Shivam Kumar, Deepanshu Ahuja, Sandeep Kumar, (2023). Tackling Imbalance Datasets: Methods, Techniques & Comparisons. International Journal of Computer Sciences and Engineering, 11(5), 6-12.

BibTex Style Citation:
@article{Kumar_2023,
author = {Shivam Kumar, Deepanshu Ahuja, Sandeep Kumar},
title = {Tackling Imbalance Datasets: Methods, Techniques & Comparisons},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {5 2023},
volume = {11},
Issue = {5},
month = {5},
year = {2023},
issn = {2347-2693},
pages = {6-12},
url = {https://www.ijcseonline.org/full_paper_view.php?paper_id=5569},
doi = {https://doi.org/10.26438/ijcse/v11i5.612}
publisher = {IJCSE, Indore, INDIA},
}

RIS Style Citation:
TY - JOUR
DO = {https://doi.org/10.26438/ijcse/v11i5.612}
UR - https://www.ijcseonline.org/full_paper_view.php?paper_id=5569
TI - Tackling Imbalance Datasets: Methods, Techniques & Comparisons
T2 - International Journal of Computer Sciences and Engineering
AU - Shivam Kumar, Deepanshu Ahuja, Sandeep Kumar
PY - 2023
DA - 2023/05/31
PB - IJCSE, Indore, INDIA
SP - 6-12
IS - 5
VL - 11
SN - 2347-2693
ER -

VIEWS	PDF	XML
132	267 downloads	191 downloads

Bar Line

Abstract

Over the past many years of continuous research and learning from data, i.e.duplication and Extraction continues to be a spotlight of enormous research. A classification data set with skewed class proportions is referred to as imbalanced. This term originated as a debate over the skewed distributions of binary tasks. Imbalanced data are those datasets that have an uneven distribution of observations across the target class, i.e First class category will have a very higher number of observations while the other class will have less number of observations. The emergence of the massive data era, along with the growth of machine learning and data mining (Data Science), as going deeper into the field of learning with imbalanced datasets, alongside the challenges which are emerging. Data-level methods and algorithm-level methods are repeatedly used and getting improved and popularity of hybrid approaches increased due to the extraction of earlier approaches (data level and algo level) and reduced weaknesses with powerful points. In order to advance the field of addressing imbalanced datasets and compare existing approaches and methodologies, this paper attempts to discuss the open questions and challenges that need to be resolved. This essay discusses each of them and offers ideas for potential directions for further investigation. The main issue with an unbalanced class distribution is when bad training habits cause bias in favour of the majority class. Deep learning algorithms and machine learning algorithms perform training on datasets which are underrepresented in some categories. Conventional methods advise to perform undersampling on majority class category and oversampling minority class category before the learning stage.By including learning modules with clever representations of samples from majority and minority samples, this research investigates various traditional and contemporary strategies to address this issue. The works of several researchers are compiled in a very logical approach and numerical opportunities and also future difficulties for the field`s future research are discussed.

Key-Words / Index Term

Multiclass, Classification, Imbalance, Prediction, Majority, Minority, Synthetic Minority Over-sampling Technique(smote), Simplified Swarm Optimization(SSO), Particle Swarm Optimization (PSO), Adaptive Synthetic (ADASYN), Diversified One-vs-One strategy(DOVO), Diversified Error Correcting Output Codes (DECOC).

References

[1] Krawczyk, B. Learning from imbalanced data: open challenges and future directions.Prog Artif Intell 5, pp.221–232, 2016. https://doi.org/10.1007/s13748-016-0094-0
[2] S. Sridhar and A. Kalaivani, "A Two Tier Iterative Ensemble Method To Tackle Imbalance In Multiclass Classification," 2020 International Conference on Decision Aid Sciences and Application (DASA), pp.1248-1254, 2020. doi: 10.1109/DASA51403.2020.9317019.
[3] Yang P, Yoo P D, Fernando J, et al. Sample subset optimization tech- niques for imbalanced and ensemble learning problems in bioinformatics applications, IEEE Transactions on Cybernetics, Vol.44, no.3, pp.445- 455, 2014.
[4] Wang K J , Makond B , Chen K H , et al. A hybrid classifier combining SMOTE with PSO to estimate 5-year survivability of breast cancer patients, Applied Soft Computing, 2014, Vol.20, pp.15-24, 2014.
[5] Susan, S., & Kumar, A. (2021). The balancing trick: Optimized sampling of imbalanced datasets—A brief survey of the recent State of the Art. Engineering Reports, 3(4), e12298, 2021.
[6] Y. Fathy, M. Jaber and A. Brintrup, "Learning With Imbalanced Data in Smart Manufacturing: A Comparative Analysis," in IEEE Access, Vol.9, pp.2734-2757, 2021. doi: 10.1109/ACCESS.2020.3047838.
[7] Neshat, M., Sepidnam, G. & Sargolzaei, M. Swallow swarm optimization algorithm: a new method to optimization. Neural Comput & Applic 23, pp.429–454, 2013. https://doi.org/10.1007/s00521-012-0939-9
[8] Kaur, Harsurinder & Pannu, Husanbir & Malhi, Avleen. (2019). A Systematic Review on Imbalanced Data Challenges in Machine Learning: Applications and Solutions. ACM Computing Surveys. 52. pp.1-36, 2019. 10.1145/3343440.
[9] Krawczyk, B. Learning from imbalanced data: open challenges and future directions.Prog Artif Intell 5, pp.221–232, 2016. https://doi.org/10.1007/s13748-016-0094-0
[10] W. Obaid and A. B. Nassif, "The Effects of Resampling on Classifying Imbalanced Datasets," 2022 Advances in Science and Engineering Technology International Conferences (ASET), pp.1-6, 2022. doi:10.1109/ASET53988.2022.9735021.
[11] Fadi Thabtah, Suhel Hammoud, Firuz Kamalov, Amanda Gonsalves, Data imbalance in classification: Experimental evaluation, Information Sciences, Vol.513, pp.429-441, 2020. ISSN 0020-0255,https://doi.org/10.1016/j.ins.2019.11.004.
[12] Kaur, H., Pannu, H. S., & Malhi, A. K. (2019). A systematic review on imbalanced data challenges in machine learning: Applications and solutions. ACM Computing Surveys (CSUR), 52(4), pp.1-36, 2019.
[13] Goyal, A., Rathore, L., & Kumar, S. (2021). A Survey on Solution of Imbalanced Data Classification Problem Using SMOTE and Extreme Learning Machine. In Communication and Intelligent Systems, pp.31-44, 2021. Springer, Singapore
[14] Sowah, R. A., Kuditchar, B., Mills, G. A., Acakpovi, A., Twum, R. A., Buah, G., & Agboyi, R. (2021). HCBST: An Efficient Hybrid Sampling Technique for Class Imbalance Problems. ACM Transactions on Knowledge Discovery from Data (TKDD), 16(3), pp.1-37, 2021.
[15] Liu, Y., Loh, H. T., & Sun, A. (2009). Imbalanced text classification: A term weighting approach. Expert systems with Applications, 36(1), pp.690-701, 2009.
[16] Goyal, S. (2022). Handling class-imbalance with KNN (neighborhood) under-sampling for software defect prediction. Artificial Intelligence Review, 55(3), pp.2023-2064, 2022.
[17] Tsai, C. F., Lin, W. C., Hu, Y. H., & Yao, G. T. (2019). Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Information Sciences, 477, pp.47-54, 2019.
[18] Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 6(1), pp.20-29, 2004.

Citations	2325
h-index	16
i10-index	47