Open Access   Article Go Back

Text Similarity on Native Languages Documents

Ramandeep Kaur1 , Prabhjeet Kaur2

Section:Research Paper, Product Type: Journal Paper
Volume-9 , Issue-4 , Page no. 15-19, Apr-2021

CrossRef-DOI:   https://doi.org/10.26438/ijcse/v9i4.1519

Online published on Apr 30, 2021

Copyright © Ramandeep Kaur , Prabhjeet Kaur . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at   Google Scholar | DPI Digital Library

How to Cite this Paper

  • IEEE Citation
  • MLA Citation
  • APA Citation
  • BibTex Citation
  • RIS Citation

IEEE Style Citation: Ramandeep Kaur , Prabhjeet Kaur, “Text Similarity on Native Languages Documents,” International Journal of Computer Sciences and Engineering, Vol.9, Issue.4, pp.15-19, 2021.

MLA Style Citation: Ramandeep Kaur , Prabhjeet Kaur "Text Similarity on Native Languages Documents." International Journal of Computer Sciences and Engineering 9.4 (2021): 15-19.

APA Style Citation: Ramandeep Kaur , Prabhjeet Kaur, (2021). Text Similarity on Native Languages Documents. International Journal of Computer Sciences and Engineering, 9(4), 15-19.

BibTex Style Citation:
@article{Kaur_2021,
author = {Ramandeep Kaur , Prabhjeet Kaur},
title = {Text Similarity on Native Languages Documents},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {4 2021},
volume = {9},
Issue = {4},
month = {4},
year = {2021},
issn = {2347-2693},
pages = {15-19},
url = {https://www.ijcseonline.org/full_paper_view.php?paper_id=5319},
doi = {https://doi.org/10.26438/ijcse/v9i4.1519}
publisher = {IJCSE, Indore, INDIA},
}

RIS Style Citation:
TY - JOUR
DO = {https://doi.org/10.26438/ijcse/v9i4.1519}
UR - https://www.ijcseonline.org/full_paper_view.php?paper_id=5319
TI - Text Similarity on Native Languages Documents
T2 - International Journal of Computer Sciences and Engineering
AU - Ramandeep Kaur , Prabhjeet Kaur
PY - 2021
DA - 2021/04/30
PB - IJCSE, Indore, INDIA
SP - 15-19
IS - 4
VL - 9
SN - 2347-2693
ER -

VIEWS PDF XML
327 313 downloads 192 downloads
  
  
           

Abstract

Text similarity of text measuring is a challenging task when text is in local languages and large in amount. Text measuring tools are easily available in the market but for regional languages very few tools are available. To figure out we have introduced a text similarity in native languages. In this paper, we are highlighting the Punjabi language where we find out that cosine similarity measures the accuracy of the Punjabi documents with other Punjabi documents. Text in both documents is divided into n-grams and then the common n-grams are found. The text in the documents is subject to pre-processing, which includes tokenization and punctuation removal, followed by stop words removal and stemming. After the preprocessing step, the similarity score is calculated using the cosine similarity. The purpose of doing this is to one step toward highlighting native languages. The features, performance, advantages, and disadvantages of various similarity measures are discussed. In this paper, we provide an efficient evaluation of all these measures and help the researchers to select the best measure according to their requirement.

Key-Words / Index Term

Semantic similarity, Corpus-based similarity, Knowledge-based similarity, Semantic relatedness

References

[1] Gentner, Dedre; Markman, Arthur B. (1997). "Structure mapping in analogy and similarity" (PDF). American Psychologist. 52 (1): 45–56. CiteSeerX 10.1.1.87.5696. doi:10.1037/0003-066X.52.1.45. Archived from the original on 2016-03-24.
[2] Greg Aloupis, Thomas Fevens, Stefan Langerman, Tomomi Matsui, Antonio Mesa, Yurai Nunez, and David Rappaport, and Godfried T. Toussaint, "Algorithms for computing geometric measures of melodic similarity," Computer Music Journal, Vol. 30, No. 3, Fall 2006, pp. 67–76
[3] Gentner, Dedre; Markman, Arthur B. (1997). "Structure mapping in analogy and similarity" (PDF). American Psychologist. 52 (1): 45–56. CiteSeerX 10.1.1.87.5696. doi:10.1037/0003-066X.52.1.45. Archived from the original on 2016-03-24.
[4] Balkova, Valentina; Sukhonogov, Andrey; Yablonsky, Sergey (2003). "Russian WordNet From UML-notation to Inter net/Intranet Database Implementation" (PDF). GWC 2004 Proceedings: 31–38. Retrieved 12 March 2017.
[5] Novotný, Vít (2018). Implementation Notes for the Soft Cosine Measure. The 27th ACM International Conference on Information and Knowledge Management. Torun, Italy: Association for Computing Machinery. pp. 1639–1642. arXiv:1808.09407. doi:10.1145/3269206.3269317. ISBN 978-1-4503-6014-2.
[6] Langer, Stefan; Gipp, Bela (2017). "TF-IDuF: A Novel Term-Weighting Scheme for User Modeling based on Users` Personal Document Collections" (PDF). IConference.
[7] Rogers, David J.; Tanimoto, Taffee T. (1960). "A Computer Program for Classifying Plants". Science. 132 (3434): 1115–1118. doi:10.1126/science.132.3434.1115.
[8] A Survey of Encoding Techniques for Reducing Data-Movement Energy", JSA, 2018
[9] Winkler, W. E. (2006). "Overview of Record Linkage and Current Research Directions" (PDF). Research Report Series, RRS.
[10] Andoni, Alexandr; Krauthgamer, Robert; Onak, Krzysztof (2010). Polylogarithmic approximation for edit distance and the asymmetric query complexity. IEEE Symp. Foundations of Computer Science (FOCS). arXiv:1005.4033. Bibcode:2010arXiv1005.4033A. CiteSeerX 10.1.1.208.2079.
[11] Backurs, Arturs; Indyk, Piotr (2015). Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false). Forty-Seventh Annual ACM on Symposium on Theory of Computing (STOC). arXiv:1412.0348. Bibcode:2014arXiv1412.0348B.
[12] Chapman, S. (2006). SimMetrics: a java & c# .net library of similarity metrics, http://sourceforge.net/projects/simmetrics/.
[13] Hall , P. A. V. & Dowling, G. R. (1980) Approximate string matching, Comput. Surveys, 12:381-402.
[14] Peterson, J. L. (1980). Computer programs for detecting and correcting spelling errors, Comm. Assoc. Comput. Mach., 23:676-687.
[15] Jaro, M. A. (1989). Advances in record linkage methodology as applied to the 1985 census of Tampa Florida,Journal of the American Statistical Society,vol. 84, 406, pp 414-420.
[16] Jaro, M. A. (1995). Probabilistic linkage of large public health data file, Statistics in Medicine 14 (5-7), 491-8
[17] Winkler W. E. (1990). String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage,Proceedings of the Section on Survey Research Methods, American Statistical Association, 354–359
[18] Needleman, B. S. & Wunsch, D. C.(1970).A general method applicable to the search for similarities in the amino acid sequence of two proteins",Journal of Molecular Biology48(3): 443–53
[19] Smith, F. T. & Waterman, S. M. (1981).Identification of Common Molecular Subsequences, Journal of Molecular Biology147: 195–197
[20] Alberto, B. , Paolo, R., Eneko A. & Gorka L. (2010). Plagiarism Detection across Distant Language Pairs, In Proceedings of the 23rd International Conference on Computational Linguistics, pages 37–45
[21] Eugene F. K. (1987).Taxicab Geometry , Dover.ISBN0-486-25202-7
[22] Dice, L. (1945). Measures of the amount of ecologic association between species. Ecology, 26(3
[23] Jaccard, P. (1901). Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin de la Société Vaudoise des Sciences Naturelles 37, 547-579
[24] Lund, K., Burgess, C. & Atchley, R. A. (1995). Semantic and associative priming in a high-dimensional semantic space.Cognitive Science Proceedings (LEA), 660-665
[25] Lund, K. & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence.Behavior Research Methods, Instruments & Computers, 28(2),203-208
[26] Landauer, T.K. & Dumais, S.T. (1997). A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge", Psychological Review, 104
[27]https://www.mhrd.gov.in/sites/upload_files/mhrd/files/upload_document/languagebr.pdf