Pre-processing Phase of Automatic Text Summarization for the Assamese Language

Gunadeep Chetia, Gopal Chandra Hazarika

Open Access Article Go Back

Pre-processing Phase of Automatic Text Summarization for the Assamese Language

Gunadeep Chetia¹ , Gopal Chandra Hazarika²

Section:Research Paper, Product Type: Journal Paper
Volume-6 , Issue-10 , Page no. 159-163, Oct-2018

CrossRef-DOI: https://doi.org/10.26438/ijcse/v6i10.159163

Online published on Oct 31, 2018

Copyright © Gunadeep Chetia, Gopal Chandra Hazarika . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at Google Scholar | DPI Digital Library

XML View

PDF Download

How to Cite this Paper

IEEE Citation
MLA Citation
APA Citation
BibTex Citation
RIS Citation

IEEE Style Citation: Gunadeep Chetia, Gopal Chandra Hazarika, “Pre-processing Phase of Automatic Text Summarization for the Assamese Language,” International Journal of Computer Sciences and Engineering, Vol.6, Issue.10, pp.159-163, 2018.

MLA Style Citation: Gunadeep Chetia, Gopal Chandra Hazarika "Pre-processing Phase of Automatic Text Summarization for the Assamese Language." International Journal of Computer Sciences and Engineering 6.10 (2018): 159-163.

APA Style Citation: Gunadeep Chetia, Gopal Chandra Hazarika, (2018). Pre-processing Phase of Automatic Text Summarization for the Assamese Language. International Journal of Computer Sciences and Engineering, 6(10), 159-163.

BibTex Style Citation:
@article{Chetia_2018,
author = {Gunadeep Chetia, Gopal Chandra Hazarika},
title = {Pre-processing Phase of Automatic Text Summarization for the Assamese Language},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {10 2018},
volume = {6},
Issue = {10},
month = {10},
year = {2018},
issn = {2347-2693},
pages = {159-163},
url = {https://www.ijcseonline.org/full_paper_view.php?paper_id=2998},
doi = {https://doi.org/10.26438/ijcse/v6i10.159163}
publisher = {IJCSE, Indore, INDIA},
}

RIS Style Citation:
TY - JOUR
DO = {https://doi.org/10.26438/ijcse/v6i10.159163}
UR - https://www.ijcseonline.org/full_paper_view.php?paper_id=2998
TI - Pre-processing Phase of Automatic Text Summarization for the Assamese Language
T2 - International Journal of Computer Sciences and Engineering
AU - Gunadeep Chetia, Gopal Chandra Hazarika
PY - 2018
DA - 2018/10/31
PB - IJCSE, Indore, INDIA
SP - 159-163
IS - 10
VL - 6
SN - 2347-2693
ER -

VIEWS	PDF	XML
709	348 downloads	254 downloads

Bar Line

Abstract

Pre-processing is the first and important phase of automatic text summarization. Pre-processing helps in normalizing a text document and generating a structured representation of the text. Major pre-processing tasks include segmentation, tokenization, stop-word removal, stemming and lemmatization. In this paper, we discuss these pre-processing tasks required for automatically summarizing Assamese text documents. Both Stemming and lemmatization play an important role in the pre-processing phase of morphologically rich highly inflected language like Assamese. We present a corpus based approach for stemming the Assamese words using n-gram similarity matching technique. We also propose a hybrid method for lemmatization of the Assamese verbs to obtain the grammatically correct root of a verb. Assamese verbs are the most inflectional compared to other word categories. Stemming alone is not sufficient to find the original roots in case of Assamese verbs. So, after segmentation, tokenization and stop-word removal we first apply stemming to all the words in the text document irrespective of their grammatical categories and then apply lemmatization to only the Assamese verbs. For identifying the Assamese verbs we use a look-up dictionary which contains a list of possible stems along with the corresponding lemma of the verbs.

Key-Words / Index Term

Pre-processing, Summarization, Stemming, Lemmatization, n-gram

References

[1] Maryam Kiabod, Mohammad Naderi Dehkordi and Sayed Mehran Sharafi, “A Novel Method of Significant Words Identification in Text Summarization”, Journal of Emerging Technologies in Web Intelligence, Vol. 4, No. 3, August, 2012.
[2] Joel Larocca Neto, Alex A. Freitas, Celso A. A. Kaestner, “Automatic Text Summarization using a Machine Learning Approach”, Proceeding SBIA `02 Proceedings of the 16th Brazilian Symposium on Artificial Intelligence: Advances in Artificial Intelligence Pages 205-215 November 11 - 14, 2002.
[3] Gordon, Raymond G., Jr. (ed.). “Ethnologue: Languages of the World”, Fifteenth edition. Dallas, Tex.: SIL International, 2005.
[4] Dipanjan Das, André FT Martins. "A survey on automatic text summarization." Literature Survey for the Language and Statistics II course at CMU 4,192-195, 2007.
[5] Prachi Shah, Nikita P. Desai, “A Survey of Automatic Text Summarization Techniques for Indian and Foreign Languages”, International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT) – 2016.
[6] Silla, C.N., Kaestner, C.A.A. “An Analysis of Sentence Boundary Detection Systems for English and Portuguese Documents” In: Gelbukh A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2004. Lecture Notes in Computer Science, vol 2945. Springer, Berlin, Heidelberg.
[7] Moral, C., de Antonio, A., Imbert, R. & Ramírez, J. “A survey of stemming algorithms in information retrieval/ Information Research”, 19(1) paper 605.
[8] Banikanta Kakati. “Assamese, Its Formation and Development”. LBS Publication, G.N.B. Road, Guwahati, fifth edition, 1995.
[9] Golok Chandra Goswami, “Structures of Assamese”. Department of Publication, Gauhati University, 1982.
[10] Nitin Indurkhya , Fred J. Damerau, “Handbook of Natural Language Processing”, Chapman & Hall/CRC, 2010.
[11] Tuomo Korenius , Jorma Laurikkala , Kalervo Järvelin , Martti Juhola, “Stemming and lemmatization in the clustering of finnish text documents”, Proceedings of the thirteenth ACM international conference on Information and knowledge management, Washington, D.C., USA ,November 08-13, 2004.
[12] Plisson Joel, Lavrac Nada and Mladenic Dunja. “A rule based approach to word lemmatization”, Proceedings of the 7th International Multi-Conference Information Society IS. 2004.
[13] M. F. Porter “An algorithm for suffix stripping. Program”, 14(3): 130-137. 1980
[14] Adamson, G. W. & Boreham, J., "The use of an Association Measure Based on Character Structure to identify Semantically Related Pairs of Words and Document Titles", InformationStorage and Retrieval 10, pp 253-260, 1974.
[15] Akinwale, A.T., Niewiadomski, A E Cient “Similarity Measures for Texts Matching” Journal of Applied Computer Science Vol. 23 No. 1,pp. 7-28, 2015,
[16] Kleinberg, J. & Tardos, É. “Algorithm Design”, Addison Wesley, 2006.

Citations	2325
h-index	16
i10-index	47