Open Access   Article Go Back

On Applying Document Similarity Measures for Template based Clustering of Web Documents

T.I. Bagban1 , P. J. Kulkarni2

Section:Research Paper, Product Type: Journal Paper
Volume-06 , Issue-01 , Page no. 37-42, Feb-2018

CrossRef-DOI:   https://doi.org/10.26438/ijcse/v6si1.3742

Online published on Feb 28, 2018

Copyright © T.I. Bagban, P. J. Kulkarni . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at   Google Scholar | DPI Digital Library

How to Cite this Paper

  • IEEE Citation
  • MLA Citation
  • APA Citation
  • BibTex Citation
  • RIS Citation

IEEE Style Citation: T.I. Bagban, P. J. Kulkarni, “On Applying Document Similarity Measures for Template based Clustering of Web Documents,” International Journal of Computer Sciences and Engineering, Vol.06, Issue.01, pp.37-42, 2018.

MLA Style Citation: T.I. Bagban, P. J. Kulkarni "On Applying Document Similarity Measures for Template based Clustering of Web Documents." International Journal of Computer Sciences and Engineering 06.01 (2018): 37-42.

APA Style Citation: T.I. Bagban, P. J. Kulkarni, (2018). On Applying Document Similarity Measures for Template based Clustering of Web Documents. International Journal of Computer Sciences and Engineering, 06(01), 37-42.

BibTex Style Citation:
@article{Bagban_2018,
author = {T.I. Bagban, P. J. Kulkarni},
title = {On Applying Document Similarity Measures for Template based Clustering of Web Documents},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {2 2018},
volume = {06},
Issue = {01},
month = {2},
year = {2018},
issn = {2347-2693},
pages = {37-42},
url = {https://www.ijcseonline.org/full_spl_paper_view.php?paper_id=188},
doi = {https://doi.org/10.26438/ijcse/v6i1.3742}
publisher = {IJCSE, Indore, INDIA},
}

RIS Style Citation:
TY - JOUR
DO = {https://doi.org/10.26438/ijcse/v6i1.3742}
UR - https://www.ijcseonline.org/full_spl_paper_view.php?paper_id=188
TI - On Applying Document Similarity Measures for Template based Clustering of Web Documents
T2 - International Journal of Computer Sciences and Engineering
AU - T.I. Bagban, P. J. Kulkarni
PY - 2018
DA - 2018/02/28
PB - IJCSE, Indore, INDIA
SP - 37-42
IS - 01
VL - 06
SN - 2347-2693
ER -

           

Abstract

World Wide Web is the useful and easy way to get the source of information on the Internet. In order to reduce the content generation and publishing time, templates are used to populate the contents in web documents. Template provides easy access to the web document contents through their layout and structures. However, for search engines, due to its irrelevant terms, the templates degrade search engines accuracy and performance. Also the templates are used by wrapper induction tools used in information extractor to extract and integrate information from various E-commerce sites. Thus it has received a lot of attention to improve the search engines performance and content integration. In this paper we have discussed how heterogeneous web documents i.e. web documents generated from different templates, can be clustered. We have applied document similarity measures to cluster the heterogeneous web documents generated from templates. Our experimental results on real data sets show that cosine distance similarity measure is more suitable for template based clustering of heterogeneous web documents.

Key-Words / Index Term

Template, Clustering, Cosine, Jaccard, Agglomerative Hierarchical Clustering

References

[1] Bar-Yossef, Z., Rajagopalan, S,“Template detection via data mining and its applications”,WWW ’02: Proceedings of the 11th International Conference on World Wide Web, New York, NY, USA, ACM Press 580–591, 2002.
[2] Lin, S.H., Ho, J.M,“Discovering informative content blocks from web documents”, KDD ’02: Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, ACM Press 588–593, 2002.
[3] Debnath, S., Mitra, P., Giles, C.L,”Automatic extraction of informative blocks from webpages”, SAC ’05: Proceedings of the 2005 ACM Symposium on Applied Computing, New York, NY, USA, ACM Press 1722–1726,2005.
[4] Yi, L., Liu, B., Li, X,”Eliminating noisy information in web pages for data mining”, KDD ’03: Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, ACM Press 296–305, 2003
[5] [5] Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.F,”Automatic web news extraction using tree edit distance”, WWW ’04: Proceedings of the 13th International Conference on World Wide Web, New York, NY, USA, ACM Press 502–511,2004
[6] Gibson, D., Punera, K., Tomkins, A,”The volume and evolution of web page templates”,WWW ’05: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, New York, NY, USA, ACM Press ,830–839,2005
[7] Cruz, I.F., Borisov, S., Marks, M.A., Webbs, T.R,”Measuring structural similarity among webdocuments: preliminary results”, EP ’98: Proceedings of the 7th international Conference on Electronic Publishing, Artistic Imaging, and Digital Typography,.513 – 524, 1998
[8] Buttler, D,”A short survey of document structure similarity algorithms”, IC ’04: Proceedings of theInternational Conference on Internet Computing, CSREA Press 3–9, 2004
[9] Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G,”Syntactic clustering of the web”,ComputerNetworks 29(8-13) 1157–1166, 1997
[10] A. Arasu and H. Garcia-Molina,“Extracting Structured Data from Web Pages”, Proc. ACM SIGMOD, 2003.
[11] M. de Castro Reis, P.B. Golgher, A.S. da Silva, and A.H.F. Laender,“Automatic Web News Extraction Using Tree Edit Distance”, Proc. 13th Int’l Conf. World Wide Web (WWW), 2004.
[12] M.N. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim,“Xtract: A System for Extracting Document Type Descriptors from Xml Documents”, Proc. ACM SIGMOD, 2000.
[13] Y. Zhai and B. Liu,“Web Data Extraction Based on Partial Tree Alignment”, Proc. 14th Int’l Conf. World Wide Web (WWW), 2005.
[14] V. Crescenzi, G. Mecca, and P. Merialdo,“Roadrunner: Towards Automatic Data Extraction from Large Web Sites”, Proc. 27th Int’l Conf. Very Large Data Bases (VLDB), 2001.
[15] K. Vieira, A.S. da Silva, N. Pinto, E.S. de Moura, J.M.B. Cavalcanti, and J. Freire,“A Fast and Robust Method for Web Page Template Detection and Removal”, Proc. 15th ACM Int’l Conf. Information andKnowledge Management (CIKM), 2006.
[16] S. Zheng, D. Wu, R. Song, and J.-R. Wen,“Joint Optimization of Wrapper Generation and Template Detection”, Proc. ACMtiSIGKDD, 2007.
[17] Chulyun Kim and Kyuseok Shim,”TEXT: Automatic Template Extraction from Heterogeneous Web Pages”,IEEE Transaction on Knowledge and Data Engineering, 2011