Author Identification on Imbalanced Class Dataset of Indian Literature in Marathi

Sunil D. Kale, Rajesh S. Prasad

Open Access Article Go Back

Author Identification on Imbalanced Class Dataset of Indian Literature in Marathi

Sunil D. Kale¹ , Rajesh S. Prasad²

Section:Research Paper, Product Type: Journal Paper
Volume-6 , Issue-11 , Page no. 542-547, Nov-2018

CrossRef-DOI: https://doi.org/10.26438/ijcse/v6i11.542547

Online published on Nov 30, 2018

Copyright © Sunil D. Kale, Rajesh S. Prasad . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at Google Scholar | DPI Digital Library

XML View

PDF Download

How to Cite this Paper

IEEE Citation
MLA Citation
APA Citation
BibTex Citation
RIS Citation

IEEE Citation

IEEE Style Citation: Sunil D. Kale, Rajesh S. Prasad, “Author Identification on Imbalanced Class Dataset of Indian Literature in Marathi,” International Journal of Computer Sciences and Engineering, Vol.6, Issue.11, pp.542-547, 2018.

MLA Citation

MLA Style Citation: Sunil D. Kale, Rajesh S. Prasad "Author Identification on Imbalanced Class Dataset of Indian Literature in Marathi." International Journal of Computer Sciences and Engineering 6.11 (2018): 542-547.

APA Citation

APA Style Citation: Sunil D. Kale, Rajesh S. Prasad, (2018). Author Identification on Imbalanced Class Dataset of Indian Literature in Marathi. International Journal of Computer Sciences and Engineering, 6(11), 542-547.

BibTex Citation

BibTex Style Citation:
@article{Kale_2018,
author = {Sunil D. Kale, Rajesh S. Prasad},
title = {Author Identification on Imbalanced Class Dataset of Indian Literature in Marathi},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {11 2018},
volume = {6},
Issue = {11},
month = {11},
year = {2018},
issn = {2347-2693},
pages = {542-547},
url = {https://www.ijcseonline.org/full_paper_view.php?paper_id=3203},
doi = {https://doi.org/10.26438/ijcse/v6i11.542547}
publisher = {IJCSE, Indore, INDIA},
}

RIS Citation

RIS Style Citation:
TY - JOUR
DO = {https://doi.org/10.26438/ijcse/v6i11.542547}
UR - https://www.ijcseonline.org/full_paper_view.php?paper_id=3203
TI - Author Identification on Imbalanced Class Dataset of Indian Literature in Marathi
T2 - International Journal of Computer Sciences and Engineering
AU - Sunil D. Kale, Rajesh S. Prasad
PY - 2018
DA - 2018/11/30
PB - IJCSE, Indore, INDIA
SP - 542-547
IS - 11
VL - 6
SN - 2347-2693
ER -

VIEWS	PDF	XML
526	331 downloads	267 downloads

Bar Line

Abstract

Author Identification is one of the application of text mining and is the task of investigating author of the anonymous text document. Application of author Identification includes in digital forensic, plagiarism detection, copyright issues, etc. The numerous amount of work is already done on English language perhaps Author identification of Indian regional languages is limited. This research paper presents Author identification on Indian regional Marathi Language. In this paper proposing a technique for identifying probabilistic authors via linguistic stylometry i.e. the statistical analysis of variations in literary style between one author or genre with another. In total 11 features are extracted with 8 lexical and syntactic features and 3 word N-gram features. Experimentation is performed with 8 features and machine learning algorithms, i.e. k-nearest neighbor, Naïve Bayes and Support Vector Machine. Moreover, result based on word n-gram i.e. unigram, bigram and trigram are also presented. Experimentation result shows better result with word N-gram method.

Key-Words / Index Term

Author Identification, Text Mining, Machine Learning, Marathi Language, Stylometry

References

[1] C. Qian, T. He, and R. Zhang, “Deep Learning based Authorship Identification.”
[2] Wikipedia contributors, “Languages with official status in India- Wikipedia,” Wikipedia, The Free Encyclopedia., 2018. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Languages_with_official_status_in_India&oldid=841744869. [Accessed: 21-May-2018].
[3] “Diversity of India – Geographical and Cultural contexts – Am an aspirant too,” Wikipedia, The Free Encyclopedia. [Online]. Available: https://tklvch.wordpress.com/2015/01/07/diversity-of-india-geographical-and-cultural-contexts/. [Accessed: 27-Apr-2018].
[4] T. C. Mendenhall, “The characteristic curves of composition.,” Science, vol. 9, no. 216, pp. 237–249, 1887.
[5] F. Mosteller and D. Wallace, “Inference and disputed authorship: The Federalist,” 1964.
[6] K. S. Digamberrao and R. S. Prasad, “Author Identification on Literature in Different Languages: A Systematic Survey,” in 2018 International Conference On Advances in Communication and Computing Technology (ICACCT), 2018, pp. 174–181.
[7] S. D. Kale and R. S. Prasad, “A Systematic Review on Author Identification Methods,” Int. J. Rough Sets Data Anal., vol. 4, no. 2, pp. 81–91, Apr. 2017.
[8] A. F. Otoom, E. E. Abdullah, S. Jaafer, A. Hamdallh, and D. Amer, “Towards author identification of Arabic text articles,” in 2014 5th International Conference on Information and Communication Systems (ICICS), 2014, pp. 1–4.
[9] B. Diri and M. Fatih Amasyali, “Automatic Author Detection for Turkish Texts.”
[10] H. Paci, E. Kajo, E. Trandafili, I. Tafa, and D. Salillari, “Author identification in Albanian language,” in Proceedings - 2011 International Conference on Network-Based Information Systems, NBiS 2011, 2011, pp. 425–430.
[11] S. D. Kale and R. S. Prasad, “Author Identification using Sequential Minimal Optimization with rule-based Decision Tree on Indian Literature in Marathi,” Procedia Comput. Sci., vol. 132, pp. 1086–1101, Jan. 2018.
[12] S. N. Prasad, V. B. Narsimha, P. V. Reddy, and A. V. Babu, “Influence of Lexical, Syntactic and Structural Features and their Combination on Authorship Attribution for Telugu Text,” Procedia Comput. Sci., vol. 48, no. C, pp. 58–64, 2015.
[13] S. Das and P. Mitra, “Author Identification in Bengali Literary Works,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 6744 LNCS, springer, 2011, pp. 220–226.
[14] J. R. Prasad, U. V. Kulkarni, and R. S. Prasad, “Template Matching Algorithm for Gujrati Character Recognition,” in 2009 Second International Conference on Emerging Trends in Engineering & Technology, 2009, pp. 263–268.
[15] J. R. Prasad, U. V. Kulkarni, and R. S. Prasad, “Offline Handwritten Character Recognition of Gujrati script using Pattern Matching,” in 2009 3rd International Conference on Anti-counterfeiting, Security, and Identification in Communication, 2009, pp. 611–615.
[16] F. Wikipedia, “Statistical classification Frequentist procedures.”
[17] E. Stamatatos, “A survey of modern authorship attribution methods,” J. Am. Soc. Inf. Sci. Technol., vol. 60, no. 3, pp. 538–556, 2009.
[18] M. W. Corney, “Analysing E-mail Text Authorship for Forensic Purposes by,” 2003.
[19] Chaitanya Singh, “HashMap in Java with Example.” [Online]. Available: https://beginnersbook.com/2013/12/hashmap-in-java-with-example/. [Accessed: 29-Oct-2018].
[20] “HashMap in Java - javatpoint.” [Online]. Available: https://www.javatpoint.com/java-hashmap. [Accessed: 29-Oct-2018].
[21] E. Table, R. External, C. Cat, and D. Rabbit, “Confusion matrix,” pp. 1–4, 2018.

Citations	8797
h-index	34
i10-index	152

Impact Factor :	3.802
ISSN :	2347-2693 (Online)