Open Access   Article Go Back

Evaluating Techniques for Pre-Processing of Unstructured Text For Text Classification

Susan Koshy1 , R. Padmajavalli2

Section:Research Paper, Product Type: Journal Paper
Volume-6 , Issue-8 , Page no. 151-160, Aug-2018

CrossRef-DOI:   https://doi.org/10.26438/ijcse/v6i8.151160

Online published on Aug 31, 2018

Copyright © Susan Koshy, R. Padmajavalli . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at   Google Scholar | DPI Digital Library

How to Cite this Paper

  • IEEE Citation
  • MLA Citation
  • APA Citation
  • BibTex Citation
  • RIS Citation

IEEE Style Citation: Susan Koshy, R. Padmajavalli, “Evaluating Techniques for Pre-Processing of Unstructured Text For Text Classification,” International Journal of Computer Sciences and Engineering, Vol.6, Issue.8, pp.151-160, 2018.

MLA Style Citation: Susan Koshy, R. Padmajavalli "Evaluating Techniques for Pre-Processing of Unstructured Text For Text Classification." International Journal of Computer Sciences and Engineering 6.8 (2018): 151-160.

APA Style Citation: Susan Koshy, R. Padmajavalli, (2018). Evaluating Techniques for Pre-Processing of Unstructured Text For Text Classification. International Journal of Computer Sciences and Engineering, 6(8), 151-160.

BibTex Style Citation:
@article{Koshy_2018,
author = {Susan Koshy, R. Padmajavalli},
title = {Evaluating Techniques for Pre-Processing of Unstructured Text For Text Classification},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {8 2018},
volume = {6},
Issue = {8},
month = {8},
year = {2018},
issn = {2347-2693},
pages = {151-160},
url = {https://www.ijcseonline.org/full_paper_view.php?paper_id=2670},
doi = {https://doi.org/10.26438/ijcse/v6i8.151160}
publisher = {IJCSE, Indore, INDIA},
}

RIS Style Citation:
TY - JOUR
DO = {https://doi.org/10.26438/ijcse/v6i8.151160}
UR - https://www.ijcseonline.org/full_paper_view.php?paper_id=2670
TI - Evaluating Techniques for Pre-Processing of Unstructured Text For Text Classification
T2 - International Journal of Computer Sciences and Engineering
AU - Susan Koshy, R. Padmajavalli
PY - 2018
DA - 2018/08/31
PB - IJCSE, Indore, INDIA
SP - 151-160
IS - 8
VL - 6
SN - 2347-2693
ER -

VIEWS PDF XML
626 387 downloads 297 downloads
  
  
           

Abstract

The availability of digital information over the internet can be analyzed for knowledge discovery and intelligent decision making. Text categorization is an important and extensively studied problem in machine learning. Text classification or grouping of text into appropriate categories requires pre-processing techniques and machine learning algorithms. Pre-processing or data cleaning involves removal of html characters, tokenization, stop words removal, stemming, lemmatization and advanced processes such as parts of speech tagging followed by representation in appropriate form for machine learning. This paper experimentally evaluates the impact of stemming and tokenization techniques on text classification on five text datasets.

Key-Words / Index Term

Tokenisation, stemming, parts of speech tagging, document representation, vector space model

References

[1] Frakes William B. “Strength and similarity of affix removal stemming algorithms”. ACM SIGIR Forum, Volume 37, No. 1. 2003, 26-30.
[2] J. B. Lovins, “Development of a stemming algorithm,” Mechanical Translation and Computer Linguistic., vol.11, no.1/2, pp. 22-31, 1968.
[3] Mayfield James and McNamee Paul. “Single Ngram stemming”. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval. 2003, 415-416.
[4] Mladenic Dunja. “Automatic word lemmatization”. Proceedings B of the 5th International Multi- Conference Information Society IS. 2002, 153-159.
[5] Paice Chris D. “An evaluation method for stemming algorithms”. Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval. 1994, 42- 50.
[6] Porter M.F. “An algorithm for suffix stripping”. Program. 1980; 14, 130-137. Porter M.F. “Snowball: A language for stemming algorithms”. 2001.
[7] Hull David A. and Grefenstette Gregory. “A detailed analysis of English stemming algorithms”. Rank Xerox ResearchCenter Technical Report. 1996. (2002) The IEEE website. [Online]. Available: http://www.ieee.org/
[8] Han, Jiawei, Jian Pei, and Micheline Kamber. Data mining: concepts and techniques. Elsevier, 2011
[9] Derczynski, Leon, et al. "Twitter part-of-speech tagging for all: Overcoming sparse and noisy data." Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013. 2013.
[10] Jivani, Anjali Ganesh. "A comparative study of stemming algorithms." Int. J. Comp. Tech. Appl 2.6 (2011): 1930-1938.
[11] Majumder, Prasenjit, et al. "YASS: Yet another suffix stripper." ACM transactions on information systems (TOIS) 25.4 (2007): 18.
[12] Eibe Frank, Mark A. Hall, and Ian H. Witten (2016). The WEKA Workbench. Online Appendix for "Data Mining: Practical Machine Learning Tools and Techniques", Morgan Kaufmann, Fourth Edition, 2016.
[13] https://www.kaggle.com/ranjitha1/hotel-reviews-city-chennai/version/2#
[14] https://www.kaggle.com/uciml/sms-spam-collection-dataset/data
[15] https://sourceforge.net/projects/weka/
[16] Pomikálek, J., & Rehurek, R. (2007). The Influence of preprocessing parameters on text categorization. International Journal of Applied Science, Engineering and Technology, 1, 430-434.