Open Access   Article Go Back

Towards the Deployment of Machine Learning Solutions for Document Classification

Bichitrananda Behera1 , G. Kumaravelan2

Section:Research Paper, Product Type: Journal Paper
Volume-7 , Issue-3 , Page no. 193-201, Mar-2019

CrossRef-DOI:   https://doi.org/10.26438/ijcse/v7i3.193201

Online published on Mar 31, 2019

Copyright © Bichitrananda Behera, G. Kumaravelan . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at   Google Scholar | DPI Digital Library

How to Cite this Paper

  • IEEE Citation
  • MLA Citation
  • APA Citation
  • BibTex Citation
  • RIS Citation

IEEE Style Citation: Bichitrananda Behera, G. Kumaravelan, “Towards the Deployment of Machine Learning Solutions for Document Classification,” International Journal of Computer Sciences and Engineering, Vol.7, Issue.3, pp.193-201, 2019.

MLA Style Citation: Bichitrananda Behera, G. Kumaravelan "Towards the Deployment of Machine Learning Solutions for Document Classification." International Journal of Computer Sciences and Engineering 7.3 (2019): 193-201.

APA Style Citation: Bichitrananda Behera, G. Kumaravelan, (2019). Towards the Deployment of Machine Learning Solutions for Document Classification. International Journal of Computer Sciences and Engineering, 7(3), 193-201.

BibTex Style Citation:
@article{Behera_2019,
author = {Bichitrananda Behera, G. Kumaravelan},
title = {Towards the Deployment of Machine Learning Solutions for Document Classification},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {3 2019},
volume = {7},
Issue = {3},
month = {3},
year = {2019},
issn = {2347-2693},
pages = {193-201},
url = {https://www.ijcseonline.org/full_paper_view.php?paper_id=3818},
doi = {https://doi.org/10.26438/ijcse/v7i3.193201}
publisher = {IJCSE, Indore, INDIA},
}

RIS Style Citation:
TY - JOUR
DO = {https://doi.org/10.26438/ijcse/v7i3.193201}
UR - https://www.ijcseonline.org/full_paper_view.php?paper_id=3818
TI - Towards the Deployment of Machine Learning Solutions for Document Classification
T2 - International Journal of Computer Sciences and Engineering
AU - Bichitrananda Behera, G. Kumaravelan
PY - 2019
DA - 2019/03/31
PB - IJCSE, Indore, INDIA
SP - 193-201
IS - 3
VL - 7
SN - 2347-2693
ER -

VIEWS PDF XML
730 474 downloads 242 downloads
  
  
           

Abstract

In the era of internet-connected devices, the amount of unstructured data is multiplying in many different types of file formats. In particular, a great deal of knowledge is hidden in the vast amounts of textual data such as emails, blogs, tweets, and log files. The primary issue in this kind of textual data is to classify its content into predefined classes expeditiously in real time. Hence this research paper investigates the deployment of the state-of-the-art Machine Learning (ML) algorithms such as decision tree, k-nearest neighbourhood, Rocchio, ridge, passive-aggressive, multinomial naïve Bayes, Bernoulli naïve Bayes, support vector machine, artificial neural network including perceptron, stochastic gradient descent, back-propagation neural network in automatic classification of text documents on benchmark datasets such as 20Newsgroup, BBC news, BBC sports and IMDB. Finally, the performance of all the aforementioned built-in classifiers is compared and empirically evaluated using the well-defined metrics such as accuracy, error rate, precision, recall, f-measure and Kappa statistics.

Key-Words / Index Term

Text mining, Machine learning, Documents classification, Information Retrieval, Comparative study

References

[1] J. Han, M. Kamber, J. Pei, “Data Mining: Concepts and Techniques”, Morgan Kaufmann Publishers Inc., San Francisco, CA, 2011.
[2] S. Murugan, R. Karthika, “A Literature Review on Text Mining Techniques and Methods”, International Journal of Computer Sciences and Engineering, Vol.06, Issue.02, pp.96-99, 2018.
[3] R. Lourdusamy, S. Abraham, "A Survey on Text Pre-processing Techniques and Tools", International Journal of Computer Sciences and Engineering, Vol.06, Issue.03, pp.148-157, 2018.
[4] A. McCallum, R. Rosenfeld, T.M. Mitchell, and A.Y. Ng,”Improving Text Classification by Shrinkage in a Hierarchy of Classes”. In ICML, Vol. 98. 359–367, 1998.
[5] C.D. Manning, P. Raghavan, and H. Schütze, “Introduction to information retrieval”,Vol. 1, Cambridge university press, Cambridge, 2008.
[6] H. Turtle and W.B. Croft,“Inference networks for document retrieval”, In Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp. 1–24, 1989.
[7] P.V. Arivoli, T.Chakravarthy, G.Kumaravelan, “International Journal of Advanced Research in Computer Science”, 8, (8), 299-302, 2017.
[8] F. Sebastiani. “Machine Learning in Automated Text Categorization”, ACM Computing Surveys, 34(1), 2002.
[9] F. Colas ,P. Brazdil., “Artificial Intelligence in Theory and Practice”, ed. M. Bramer, (Boston: Springer), pp. 169-178, 2006.
[10] S. Z. Mishu, S. M. Rafiuddin, "Performance Analysis of Supervised Machine Learning Algorithms for Text Classification", 19th Int. Conf. Comput. Inf. Technol, pp. 409-413, 2016.
[11] A. Singh, B. S. Prakash, K. Chandrasekaran,” A comparison of linear discriminant analysis and ridge classifier on Twitter data”, International Conference on Computing, Communication and Automation (ICCCA), pp. 133-138,2016.
[12] Z.E. Rasjida, R. Setiawan, “Performance comparison and optimization of text document classification using k-NN and naïve bayes classification techniques”,Procedia Computer Science 2017; 116(C),pp.107-12, 2017.
[13] B.R. Samal, A.K. Behera, M. Panda, ” Performance analysis of supervised machine learning techniques for sentiment analysis” Proceedings of the 1st ICRIL international conference on sensing, signal processing and security (ICSSS). Piscataway, IEEE, pp. 128–133. 2017.
[14] M. Ghosh and G. Sanyal, “Performance Assessment of Multiple Classifiers Based on Ensemble Feature Selection Scheme for Sentiment Analysis”, Applied Computational Intelligence and Soft Computing, vol. 2018, Article ID 8909357, 12 pages, 2018.
[15] Y. Li, A. Jain, “Classification of text documents”,The Computer Journal, 41(8), pp. 537–546, 1998.
[16] C.C. Aggarwal and C. X. Zhai, “Mining text data”, Springer, 2012.
[17] A. McCallum, Kamal Nigam, ”A comparison of event models for naive bayes text classification”, In AAAI-98 workshop on learning for text categorization, Vol. 752. Citeseer, pp.41–48, 1998.
[18] D.D. Lewis, “Naive (Bayes) at forty: The independence assumption in information retrieval”, In Machine learning:ECML-98,Springer, pp.4–15. 1998.
[19] E.S. Han, G. Karypis, and V. Kumar, ”Text categorization using weight adjusted k-nearest neighborclassification. Springer”,2001
[20] C. Cortes, V. Vapnik.,” Support-vector networks. Machine Learning”, 20, pp. 273–297, 1995.
[21] H. Drucker, D. Wu, V. Vapnik, “Support Vector Machines for Spam Categorization”, IEEE Transactions on NeuralNetworks, vol. 10(5), pp.1048–1054, 1999.
[22] J.J. Rocchio, ”Relevance Feedback in Information Retrieval” The SMART Retrieval System, pp. 313–323 ,1971.
[23] J. He, L. Ding, L. Jiang, L. Ma, “Kernel ridge regression classification”, Proceedings of the International Joint Conference on Neural Networks. pp.2263-2267, 2014.
[24] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. “Online passive aggressive algorithms”, Journal of Machine Learning Research, vol. 7, pp. 551–585, 2006.
[25] B. Pang and L. Lee,”A Sentimental Education: Sentiment Analysis Using Subjectivity SummarizationBased on Minimum Cuts``, Proceedingsof the ACL, 2004.
[26] D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel DocumentClustering", Proc. ICML 2006.
[27] F. Pedregosa et al,“Scikit-learn: Machine learning in Python”, Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
[28] M. Sokolova, G. Lapalme, "A systematic analysis of performance measures for classification tasks", Inform. Process.Manage., vol. 45, no. 4, pp. 427-437, 2009.