An Efficient Duplicate Detection Algorithm Using Data Cleansing

J. Selvi, R. Gayathri

Open Access Article Go Back

An Efficient Duplicate Detection Algorithm Using Data Cleansing

J. Selvi¹ , R. Gayathri²

Section:Survey Paper, Product Type: Journal Paper
Volume-07 , Issue-04 , Page no. 277-280, Feb-2019

Online published on Feb 28, 2019

Copyright © J. Selvi, R. Gayathri . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at Google Scholar | DPI Digital Library

XML View

PDF Download

How to Cite this Paper

IEEE Citation
MLA Citation
APA Citation
BibTex Citation
RIS Citation

IEEE Citation

IEEE Style Citation: J. Selvi, R. Gayathri, “An Efficient Duplicate Detection Algorithm Using Data Cleansing,” International Journal of Computer Sciences and Engineering, Vol.07, Issue.04, pp.277-280, 2019.

MLA Citation

MLA Style Citation: J. Selvi, R. Gayathri "An Efficient Duplicate Detection Algorithm Using Data Cleansing." International Journal of Computer Sciences and Engineering 07.04 (2019): 277-280.

APA Citation

APA Style Citation: J. Selvi, R. Gayathri, (2019). An Efficient Duplicate Detection Algorithm Using Data Cleansing. International Journal of Computer Sciences and Engineering, 07(04), 277-280.

BibTex Citation

BibTex Style Citation:
@article{Selvi_2019,
author = {J. Selvi, R. Gayathri},
title = {An Efficient Duplicate Detection Algorithm Using Data Cleansing},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {2 2019},
volume = {07},
Issue = {04},
month = {2},
year = {2019},
issn = {2347-2693},
pages = {277-280},
url = {https://www.ijcseonline.org/full_spl_paper_view.php?paper_id=771},
publisher = {IJCSE, Indore, INDIA},
}

RIS Citation

RIS Style Citation:
TY - JOUR
UR - https://www.ijcseonline.org/full_spl_paper_view.php?paper_id=771
TI - An Efficient Duplicate Detection Algorithm Using Data Cleansing
T2 - International Journal of Computer Sciences and Engineering
AU - J. Selvi, R. Gayathri
PY - 2019
DA - 2019/02/28
PB - IJCSE, Indore, INDIA
SP - 277-280
IS - 04
VL - 07
SN - 2347-2693
ER -

Abstract

The aim of the technique is to minimize the data duplication in the web mining patterns during the time of web based search in large data mining applications. Although there is a long line of work on identifying duplicates in relational data, only a few solutions focus on duplicate detection in more complex hierarchical structures, like XML data. In this system present a novel method for XML duplicate detection, called XML Dup. XML Dup uses a Bayesian network to determine the probability of two XML elements being duplicates, considering not only the information within the elements, but also the way that information is structured. In addition, to improve the efficiency of the network evaluation, a novel pruning strategy, capable of significant gains over the un optimized version of the algorithm, is presented. Through experiments, we show that our algorithm is able to achieve high precision and recall scores in several data sets. XML Dup is also able to outperform another state-of-the-art duplicate detection solution, both in terms of efficiency and of effectiveness.

Key-Words / Index Term

Duplicate Detection, Network Evaluation, Efficiency, Effectiveness

References

[1] S. R. Alenazi and Kamsuriah, “Record Duplication Detection in Database: A Review,” Int. J. Adv. Sci. Eng. Inf. Technol., vol. 6, no. 6, pp. 838–845, 2016.
[2] F. N. Mahmood and A. Ismail, “Semantic Similarity Measurement Methods: The State-of-the-art,” Res. J. Appl. Sci. Eng. Technol., vol. 8, no. 18, p. 1923–1932., 2014.
[3] A. Osama, Helmi, “A Comparative Study of Duplicate Record Detection Techniques,” Middle East, 2012.
[4] D. Vatsalan and P. Christen, “Privacy-Preserving Matching Of Similar Patients,” J. Biomed. Inform., vol. 59, pp. 285–298, 2016.
[5] Christenp and Timc, “Freely Extensible Biomedical Record Linkage,” 2013. [Online]. Available: https://sourceforge.net/projects/febrl/.
[6] M. G. Elfeky, V. S. Verykios, and A. K. Elmagarmid, “TAILOR: A Record Linkage Toolbox,” Proc. 18th Int. Conf. Data Eng., pp. 17–28, 2002.
[7] W. E. Yancey, “Big Match: A Program For Extracting Probably Matches From A Large File For Record Linkage,” Computing, vol. 1, no. 1, pp. 1– 8, 2002.
[8] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, “Duplicate Record Detection: A Survey,” IEEE Trans. Knowl. Data Eng., vol. 19, no. 1, pp. 1–16, Jan. 2007.
[9] W. H.Gomaa and A. A. Fahmy, “A Survey of Text Similarity Approaches,” Int. J. Comput. Appl., vol. 68, no. 13, pp. 13–18, Apr. 2013.
[10] R. T. Nakatsu and E. B. Grossman, “A Task-Fit Model of Crowdsourcing􀯗: Finding the Right Crowdsourcing Approach to Fit the Task,” J. Inf. Sci., pp. 1–11, 2014.
[11] “SemEval-2015 The 9th International Workshop on Semantic Evaluation,” New York 12571 USA, 2015.
[12] Nirmalrani V, E. P. Sim, and Arun PR, “Detection of near duplicate web pages using four stage algorithm,” in 2015 International Conference onCommunications and Signal Processing (ICCSP), 2015, pp. 0644–0648.
[13] Y. Jiang, G. Li, J. Feng, and W. Li, “String Similarity Joins􀯗: An Experimental Evaluation,” Vldb, pp. 625–636, 2014.
[14] P. A. V. Hall and G. R. Dowling, “Approximate String Matching,” ACM Comput.Surv., vol. 12, no. 4, pp. 381–402, 1980.
[15] J. L. Peterson, “Computer Programs For Detecting And Correcting Spelling Errors,” Commun. ACM, vol. 23, no. 12, pp. 676–687, Dec.1980.
[16] V. Wandhekar, “Validation of Deduplication in Data using Similarity Measure,” Int. J. Comput.Appl., vol. 116, no. 21, pp. 18–22, 2015.
[17] K. Williams and C. L. Giles, “Near Duplicate Detection In An Academic Digital Library,” Proc. 2013 ACM Symp. Doc. Eng. - DocEng ’13, pp. 91–94, 2013.
[18] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig, “Syntactic Clustering of the Web,” Comput.Networks ISDN Syst., vol. 29, no. 8, pp. 1157–1166, 1997.
[19] K. Dreßler and A.-C. N. Ngomo., “On the Efficient Execution of Bounded Jaro-Winkler Distances,” Semant. Web 8, vol. 0, no. 0, pp. 1–13, 2017.
[20] S. B. Needleman and C. D. Wunsch, “A General Method Applicable To The Search For Similarities In The Amino Acid Sequence Of Two Proteins,” J. Mol. Biol., vol. 48, no. 3, pp. 443–453, 1970.
[21] T. F. Smith and M. S. Waterman, “Identification Of Common Molecular Subsequences,” J. Mol. Biol., vol. 147, no. 1, pp. 195–197, Mar. 1981.

Citations	8797
h-index	34
i10-index	152

Impact Factor :	3.802
ISSN :	2347-2693 (Online)