A Survey on Text Pre-processing Techniques and Tools

Ravi Lourdusamy, Stanislaus Abraham

Open Access Article Go Back

A Survey on Text Pre-processing Techniques and Tools

Ravi Lourdusamy¹ , Stanislaus Abraham²

Section:Survey Paper, Product Type: Journal Paper
Volume-06 , Issue-03 , Page no. 148-157, Apr-2018

CrossRef-DOI: https://doi.org/10.26438/ijcse/v6si3.148157

Online published on Apr 30, 2018

Copyright © Ravi Lourdusamy, Stanislaus Abraham . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at Google Scholar | DPI Digital Library

XML View

PDF Download

How to Cite this Paper

IEEE Citation
MLA Citation
APA Citation
BibTex Citation
RIS Citation

IEEE Style Citation: Ravi Lourdusamy, Stanislaus Abraham, “A Survey on Text Pre-processing Techniques and Tools,” International Journal of Computer Sciences and Engineering, Vol.06, Issue.03, pp.148-157, 2018.

MLA Style Citation: Ravi Lourdusamy, Stanislaus Abraham "A Survey on Text Pre-processing Techniques and Tools." International Journal of Computer Sciences and Engineering 06.03 (2018): 148-157.

APA Style Citation: Ravi Lourdusamy, Stanislaus Abraham, (2018). A Survey on Text Pre-processing Techniques and Tools. International Journal of Computer Sciences and Engineering, 06(03), 148-157.

BibTex Style Citation:
@article{Lourdusamy_2018,
author = {Ravi Lourdusamy, Stanislaus Abraham},
title = {A Survey on Text Pre-processing Techniques and Tools},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {4 2018},
volume = {06},
Issue = {03},
month = {4},
year = {2018},
issn = {2347-2693},
pages = {148-157},
url = {https://www.ijcseonline.org/full_spl_paper_view.php?paper_id=337},
doi = {https://doi.org/10.26438/ijcse/v6i3.148157}
publisher = {IJCSE, Indore, INDIA},
}

RIS Style Citation:
TY - JOUR
DO = {https://doi.org/10.26438/ijcse/v6i3.148157}
UR - https://www.ijcseonline.org/full_spl_paper_view.php?paper_id=337
TI - A Survey on Text Pre-processing Techniques and Tools
T2 - International Journal of Computer Sciences and Engineering
AU - Ravi Lourdusamy, Stanislaus Abraham
PY - 2018
DA - 2018/04/30
PB - IJCSE, Indore, INDIA
SP - 148-157
IS - 03
VL - 06
SN - 2347-2693
ER -

Abstract

We live in an era of digital data explosion over Internet. Data warehouses deal with numerical databases than textual sources. Nearly eighty percent of digital data is either in semi or un-structured textual form. Several knowledge mining techniques developed over the past decade and those that are being developed now continue to draw attention to transform such textual data into desirable information and useful knowledge. This knowledge and information is used to benefit many fields of applications such as: social network, business management, customer care management system, market analysis, search engines, fraud detection, just to name a few. Text Mining (TM) is what is needed if desired information is to be obtained from such voluminous data. TM is multi-disciplinary in nature. Several TM techniques are deployed in the process of extracting knowledge from textual sources. Input text for such techniques needs to be pre-processed and cleaned. This survey briefly presents pre-processing tools for TM in general and Natural Language Processing (NLP) in particular. Also presents the broad categories of TM techniques used. The focus of this paper is to explore and analyze several features of text preprocessing techniques and tools that would interest researchers in the area of TM.

Key-Words / Index Term

Text Mining, Pre-processing techniques, Pre-processing Tools, Natural Language Processing

References

[1] Feldman Ronen & Dagan Ido, “Knowledge Discovery in Textual Databases”, KDD, Vol. 95. pp. 112–117, 1995.
[2] Saira, Gillani Andleeb, “From text mining to knowledge mining: An integrated framework of concept extraction and categorization for domain ontology”, PhD Dissertation, Budapesti Corvinus Egyetem, 2015.
[3] J. I. Toledo-Alvarado et al., “Automatic Building of an Ontology from a Corpus of Text Documents Using Data Mining Tools”, 2012
[4] Joe Tekli, “An overview on XML Semantic Disambiguation from Unstructured”, Member, IEEE, 2016.
[5] Harris, Z., ‘The structure of science information’, J Biomed. Inform., Vol. 35(4), pp. 215–221, 2002.
[6] Alexander Gelbukh, ”Special issue: Natural Language Processing and its Applications”, Institut Politécnico Nacional Centro de Investigaciónen Computación México, Mexico, 2010.
[7] Sibarani E. M., Nadial M., Panggabean E., & Meryana S., "A Study of parsing process on natural language processing in Bahasa Indonesia", International Conference on Computational Science and Engineering, pp. 309-316 2013.
[8] Andreas Hotho, Andreas Nürnberger, and Gerhard Paaß, “A Brief Survey of Text Mining. In Ldv Forum”, Vol. 20.19–62. 2005.
[9] Dragomir R Radev, Eduard Hovy, and Kathleen McKeown, “Introduction to the special issue on summarization”, Computational linguistics 28, 4, pp. 399–408, 2002.
[10] M. Allahyari, S. Pouriyeh, M. Assefi, S. Safaei, E. D. Trippe, J. B. Gutierrez, and K. Kochut., Text Summarization Techniques: A Brief Survey. ArXiv e-prints, 2017, arXiv:1707.02268
[11] Dipanjan Das and André FT Martins, “A survey on automatic text summarization”, Literature Survey for the Language and Statistics II course at CMU 4, pp. 192–195, 2007.
[12] Pritam C Gaigole, L. H. Patil, & P. M. Chaudhari, “Preprocessing Techniques in Text categorization”, National Conference on Innovative Paradigms in Engineering & Technology (NVIPET-2013), Proceedings published by International Journal of Computer Applications (IJCA), 2013.
[13] Katariya Nikita, & Chaudhari M. S., “Text Preprocessing For Text Mining Using Side Information”, International Journal of Computer Science and Mobile Applications, vol.3 Issue. 1, pp. 01-05, 2015.
[14] Ramasubramanian C., & Ramya R., “Effective Pre-Processing Activities in Text Mining using Improved Porter’s Stemming Algorithm”, International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, Issue 12, pp. 4536-4538, 2013.
[15] Vijayarani S, Ilamathi J, & Nithya, International Journal of Computer Science & Communication Networks, Vol 5(1), pp. 7-16, 2015.
[16] Vijayarani S, & Janani R, "Text mining: open source tokenization tools–an analysis", Advanced Computational Intelligence 3.1: pp. 37-47, 2016.
[17] Vaidya, Swapnil, & Jayshree Aher, "Natural Language Processing Preprocessing Techniques", International Journal of Computer Engineering and Applications, Volume XI, Special Issue, 2017, www.ijcea.com ISSN 2321-3469
[18] Katariya Nikita, & Chaudhari M. S., “Text Preprocessing For Text Mining Using Side Information”, International Journal of Computer Science and Mobile Applications, vol.3 Issue. 1, pp. 01-05, 2015.
[19] Nayak Arjun Srinivas, Kanive Ananthu, Chandavekar Naveen, & Balasubramani R, “Survey on Pre-Processing Techniques for Text Mining”, International Journal Of Engineering And Computer Science, Volume 5 Issues 6 2016.
[20] Nazri Mohd Zakree Ahmad, Siti Mariyam Shamsudin, &Azuraliza Abu Bakar. "An exploratory study of the Malay text processing tools in ontology learning.", Research project, Ministry of Higher Learning – Malesia, 2008.

Citations	2325
h-index	16
i10-index	47