Open Access   Article Go Back

Author Identification on Imbalanced Class Dataset of Indian Literature in Marathi

Sunil D. Kale1 , Rajesh S. Prasad2

Section:Research Paper, Product Type: Journal Paper
Volume-6 , Issue-11 , Page no. 542-547, Nov-2018

CrossRef-DOI:   https://doi.org/10.26438/ijcse/v6i11.542547

Online published on Nov 30, 2018

Copyright © Sunil D. Kale, Rajesh S. Prasad . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at   Google Scholar | DPI Digital Library

How to Cite this Paper

  • IEEE Citation
  • MLA Citation
  • APA Citation
  • BibTex Citation
  • RIS Citation

IEEE Style Citation: Sunil D. Kale, Rajesh S. Prasad, “Author Identification on Imbalanced Class Dataset of Indian Literature in Marathi,” International Journal of Computer Sciences and Engineering, Vol.6, Issue.11, pp.542-547, 2018.

MLA Style Citation: Sunil D. Kale, Rajesh S. Prasad "Author Identification on Imbalanced Class Dataset of Indian Literature in Marathi." International Journal of Computer Sciences and Engineering 6.11 (2018): 542-547.

APA Style Citation: Sunil D. Kale, Rajesh S. Prasad, (2018). Author Identification on Imbalanced Class Dataset of Indian Literature in Marathi. International Journal of Computer Sciences and Engineering, 6(11), 542-547.

BibTex Style Citation:
@article{Kale_2018,
author = {Sunil D. Kale, Rajesh S. Prasad},
title = {Author Identification on Imbalanced Class Dataset of Indian Literature in Marathi},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {11 2018},
volume = {6},
Issue = {11},
month = {11},
year = {2018},
issn = {2347-2693},
pages = {542-547},
url = {https://www.ijcseonline.org/full_paper_view.php?paper_id=3203},
doi = {https://doi.org/10.26438/ijcse/v6i11.542547}
publisher = {IJCSE, Indore, INDIA},
}

RIS Style Citation:
TY - JOUR
DO = {https://doi.org/10.26438/ijcse/v6i11.542547}
UR - https://www.ijcseonline.org/full_paper_view.php?paper_id=3203
TI - Author Identification on Imbalanced Class Dataset of Indian Literature in Marathi
T2 - International Journal of Computer Sciences and Engineering
AU - Sunil D. Kale, Rajesh S. Prasad
PY - 2018
DA - 2018/11/30
PB - IJCSE, Indore, INDIA
SP - 542-547
IS - 11
VL - 6
SN - 2347-2693
ER -

VIEWS PDF XML
458 265 downloads 224 downloads
  
  
           

Abstract

Author Identification is one of the application of text mining and is the task of investigating author of the anonymous text document. Application of author Identification includes in digital forensic, plagiarism detection, copyright issues, etc. The numerous amount of work is already done on English language perhaps Author identification of Indian regional languages is limited. This research paper presents Author identification on Indian regional Marathi Language. In this paper proposing a technique for identifying probabilistic authors via linguistic stylometry i.e. the statistical analysis of variations in literary style between one author or genre with another. In total 11 features are extracted with 8 lexical and syntactic features and 3 word N-gram features. Experimentation is performed with 8 features and machine learning algorithms, i.e. k-nearest neighbor, Naïve Bayes and Support Vector Machine. Moreover, result based on word n-gram i.e. unigram, bigram and trigram are also presented. Experimentation result shows better result with word N-gram method.

Key-Words / Index Term

Author Identification, Text Mining, Machine Learning, Marathi Language, Stylometry

References

[1] C. Qian, T. He, and R. Zhang, “Deep Learning based Authorship Identification.”
[2] Wikipedia contributors, “Languages with official status in India- Wikipedia,” Wikipedia, The Free Encyclopedia., 2018. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Languages_with_official_status_in_India&oldid=841744869. [Accessed: 21-May-2018].
[3] “Diversity of India – Geographical and Cultural contexts – Am an aspirant too,” Wikipedia, The Free Encyclopedia. [Online]. Available: https://tklvch.wordpress.com/2015/01/07/diversity-of-india-geographical-and-cultural-contexts/. [Accessed: 27-Apr-2018].
[4] T. C. Mendenhall, “The characteristic curves of composition.,” Science, vol. 9, no. 216, pp. 237–249, 1887.
[5] F. Mosteller and D. Wallace, “Inference and disputed authorship: The Federalist,” 1964.
[6] K. S. Digamberrao and R. S. Prasad, “Author Identification on Literature in Different Languages: A Systematic Survey,” in 2018 International Conference On Advances in Communication and Computing Technology (ICACCT), 2018, pp. 174–181.
[7] S. D. Kale and R. S. Prasad, “A Systematic Review on Author Identification Methods,” Int. J. Rough Sets Data Anal., vol. 4, no. 2, pp. 81–91, Apr. 2017.
[8] A. F. Otoom, E. E. Abdullah, S. Jaafer, A. Hamdallh, and D. Amer, “Towards author identification of Arabic text articles,” in 2014 5th International Conference on Information and Communication Systems (ICICS), 2014, pp. 1–4.
[9] B. Diri and M. Fatih Amasyali, “Automatic Author Detection for Turkish Texts.”
[10] H. Paci, E. Kajo, E. Trandafili, I. Tafa, and D. Salillari, “Author identification in Albanian language,” in Proceedings - 2011 International Conference on Network-Based Information Systems, NBiS 2011, 2011, pp. 425–430.
[11] S. D. Kale and R. S. Prasad, “Author Identification using Sequential Minimal Optimization with rule-based Decision Tree on Indian Literature in Marathi,” Procedia Comput. Sci., vol. 132, pp. 1086–1101, Jan. 2018.
[12] S. N. Prasad, V. B. Narsimha, P. V. Reddy, and A. V. Babu, “Influence of Lexical, Syntactic and Structural Features and their Combination on Authorship Attribution for Telugu Text,” Procedia Comput. Sci., vol. 48, no. C, pp. 58–64, 2015.
[13] S. Das and P. Mitra, “Author Identification in Bengali Literary Works,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 6744 LNCS, springer, 2011, pp. 220–226.
[14] J. R. Prasad, U. V. Kulkarni, and R. S. Prasad, “Template Matching Algorithm for Gujrati Character Recognition,” in 2009 Second International Conference on Emerging Trends in Engineering & Technology, 2009, pp. 263–268.
[15] J. R. Prasad, U. V. Kulkarni, and R. S. Prasad, “Offline Handwritten Character Recognition of Gujrati script using Pattern Matching,” in 2009 3rd International Conference on Anti-counterfeiting, Security, and Identification in Communication, 2009, pp. 611–615.
[16] F. Wikipedia, “Statistical classification Frequentist procedures.”
[17] E. Stamatatos, “A survey of modern authorship attribution methods,” J. Am. Soc. Inf. Sci. Technol., vol. 60, no. 3, pp. 538–556, 2009.
[18] M. W. Corney, “Analysing E-mail Text Authorship for Forensic Purposes by,” 2003.
[19] Chaitanya Singh, “HashMap in Java with Example.” [Online]. Available: https://beginnersbook.com/2013/12/hashmap-in-java-with-example/. [Accessed: 29-Oct-2018].
[20] “HashMap in Java - javatpoint.” [Online]. Available: https://www.javatpoint.com/java-hashmap. [Accessed: 29-Oct-2018].
[21] E. Table, R. External, C. Cat, and D. Rabbit, “Confusion matrix,” pp. 1–4, 2018.