HTML Tag Structure Based Content Retrieval from Web Pages

S.S. Bhamare

Open Access Article Go Back

HTML Tag Structure Based Content Retrieval from Web Pages

S.S. Bhamare¹

School of Computer Sciences, Kavayitri Bahinabai Chaudhari North Maharashtra University, Jalgaon (M.S) India.

Section:Research Paper, Product Type: Journal Paper
Volume-10 , Issue-11 , Page no. 35-39, Nov-2022

CrossRef-DOI: https://doi.org/10.26438/ijcse/v10i11.3539

Online published on Nov 30, 2022

Copyright © S.S. Bhamare . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at Google Scholar | DPI Digital Library

XML View

PDF Download

How to Cite this Paper

IEEE Citation
MLA Citation
APA Citation
BibTex Citation
RIS Citation

IEEE Style Citation: S.S. Bhamare, “HTML Tag Structure Based Content Retrieval from Web Pages,” International Journal of Computer Sciences and Engineering, Vol.10, Issue.11, pp.35-39, 2022.

MLA Style Citation: S.S. Bhamare "HTML Tag Structure Based Content Retrieval from Web Pages." International Journal of Computer Sciences and Engineering 10.11 (2022): 35-39.

APA Style Citation: S.S. Bhamare, (2022). HTML Tag Structure Based Content Retrieval from Web Pages. International Journal of Computer Sciences and Engineering, 10(11), 35-39.

BibTex Style Citation:
@article{Bhamare_2022,
author = {S.S. Bhamare},
title = {HTML Tag Structure Based Content Retrieval from Web Pages},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {11 2022},
volume = {10},
Issue = {11},
month = {11},
year = {2022},
issn = {2347-2693},
pages = {35-39},
url = {https://www.ijcseonline.org/full_paper_view.php?paper_id=5531},
doi = {https://doi.org/10.26438/ijcse/v10i11.3539}
publisher = {IJCSE, Indore, INDIA},
}

RIS Style Citation:
TY - JOUR
DO = {https://doi.org/10.26438/ijcse/v10i11.3539}
UR - https://www.ijcseonline.org/full_paper_view.php?paper_id=5531
TI - HTML Tag Structure Based Content Retrieval from Web Pages
T2 - International Journal of Computer Sciences and Engineering
AU - S.S. Bhamare
PY - 2022
DA - 2022/11/30
PB - IJCSE, Indore, INDIA
SP - 35-39
IS - 11
VL - 10
SN - 2347-2693
ER -

VIEWS	PDF	XML
190	348 downloads	119 downloads

Bar Line

Abstract

With the immense quantity of information in the World Wide Web, the World Wide Web (WWW) contains enormous amounts of web pages which are accessible by users. Web pages formatted in HTML (i.e. Hyper Text Markup Language) are found on this network of computers. All the Web pages, pictures, videos and other online content can be accessed via a Web browser. This provides a very useful and helpful means of collecting information. Information retrieval systems can help to retrieving the relevant information from web documents. This process of information retrieval involves three stages such as identifying the documents want to be processed, writing of query and use of searching mechanism to retrieve the relevant web document information. This paper discuss how HTML Tags structure of web page are useful for retrieval of main or informative content from web pages for efficient web mining operations.

Key-Words / Index Term

WWW, Web Page, HTML Tags, Text Density

References

[1] Malik Agyemang, Ken Barker, Rada S. Alhajj, Mining Web Content Outliers using Structure Oriented Weighting Techniques and N-Grams ACM Symposium on Applied Computing, pp.482-487, 2005.
[2] Gupta Et. Al Automating Content Extraction of HTML Documents World Wide Web: Internet and Web Information Systems, Online version published in 2004.
[3] Pan Ei San, Boilerplate Removal and Content Extraction From Dynamic Web Pages, International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.4, No.6, 2014.
[4] Li Xiaoli, Shi Zhongzhi Innovative Web Page Classification through Reducing Noise Journal of Computer Science and Technology, Vol.17, No. 1., 2002
[5] A.K. Tripathy, A.K. Singh An Efficient Method Of Eliminating Noisy Information In Web Pages for Data mining, in Proceedings of the Fourth International Conference on Computer and Information Technology (CIT’04), 2004.
[6] Kaasinen, E., Aaltonen, M., Kolari, J., Melakoski, S., and Laakko, T., Two Approaches to Bringing Internet Services to WAP Devices, In Proceedings of 9th International World-Wide Web Conference, pp. 231-246, 2000.
[7] Wong, W. and Fu, A. W., Finding Structure and Characteristics of Web Documents for Classification, In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD), Dallas, TX., USA, 2000.
[8] Deng Cai, Yu Shipeng and Wen Jirong, “VIPS: a vision-based page segmentation algorithm”, Microsoft Technical Report, MSR-TR-2003-79, 406-417, 2003.
[9] Sun Chengjie and Guan Yi, “A Statistical Approach for Content Extraction from Web Page”, Journal of Information Processing, Vol.18, Issue.5, pp.17-22, 2004.
[10] Zhao Xinxin,Suo Hongguang and Liu Yushu, “Web Content Information Extraction Method Based on Tag Window. Application Research of Computers, Vol.24, Issue.3, pp.144-145, 2007.
[11] Simple HTML Guide, 2014.
[12] List of main html tags. Online; accessed 25 march, 2014.
[13] S S Bhmare, B.V. Pawar “An Efficient Method of Web Page Noise Cleaning for Effective Web Mining", International Journal of Computer Applications (0975 – 8887) Vol.146 – No.3, pp.18-22, 2016.
[14] Dandan Song, Fei Sun, Lejian Liao.? A hybrid approach for content extraction with text density and visual importance of DOM nodes?. In the proceedings of Springer Knowl Inf Syst, Verlag London. Vol.42, pp.75-96, 2015.
[15] G. Salton and M. J. McGill, “Introduction to Modern Information Retrieval”, McGraw-Hill, New York, 1983.
[16] Soma Chatterjee, Kamal Sarkar “A Comparative Study of Three IR models for Bengali Document Retrieval” International Journal of Computer Sciences and Engineering E-ISSN 2347-2693 Vol.07, Issue.1, pp.220-225, 2019.

Citations	8797
h-index	34
i10-index	152