Open Access   Article Go Back

Noise Removal from News Web Sites

N. Narwal1

  1. Dept. of Computer Science, Maharaja Surajmal Institute (GGSIP University), New Delhi, India.

Correspondence should be addressed to: neetunarwal@gmail.com.

Section:Survey Paper, Product Type: Journal Paper
Volume-5 , Issue-9 , Page no. 237-243, Sep-2017

CrossRef-DOI:   https://doi.org/10.26438/ijcse/v5i9.237243

Online published on Sep 30, 2017

Copyright © N. Narwal . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at   Google Scholar | DPI Digital Library

How to Cite this Paper

  • IEEE Citation
  • MLA Citation
  • APA Citation
  • BibTex Citation
  • RIS Citation

IEEE Style Citation: N. Narwal, “Noise Removal from News Web Sites,” International Journal of Computer Sciences and Engineering, Vol.5, Issue.9, pp.237-243, 2017.

MLA Style Citation: N. Narwal "Noise Removal from News Web Sites." International Journal of Computer Sciences and Engineering 5.9 (2017): 237-243.

APA Style Citation: N. Narwal, (2017). Noise Removal from News Web Sites. International Journal of Computer Sciences and Engineering, 5(9), 237-243.

BibTex Style Citation:
@article{Narwal_2017,
author = {N. Narwal},
title = {Noise Removal from News Web Sites},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {9 2017},
volume = {5},
Issue = {9},
month = {9},
year = {2017},
issn = {2347-2693},
pages = {237-243},
url = {https://www.ijcseonline.org/full_paper_view.php?paper_id=1463},
doi = {https://doi.org/10.26438/ijcse/v5i9.237243}
publisher = {IJCSE, Indore, INDIA},
}

RIS Style Citation:
TY - JOUR
DO = {https://doi.org/10.26438/ijcse/v5i9.237243}
UR - https://www.ijcseonline.org/full_paper_view.php?paper_id=1463
TI - Noise Removal from News Web Sites
T2 - International Journal of Computer Sciences and Engineering
AU - N. Narwal
PY - 2017
DA - 2017/09/30
PB - IJCSE, Indore, INDIA
SP - 237-243
IS - 9
VL - 5
SN - 2347-2693
ER -

VIEWS PDF XML
974 384 downloads 259 downloads
  
  
           

Abstract

Most of the websites comprises of useful information but along with that they contains non-relevant information mostly related to advertisements, copyright, external links etc. This irrelevant information is considered as noise and if we focus on some of the popular English News web sites i.e., Times of India, Hindustan Times, Indian Express etc. consists of 30-40% of news related information and rest are noise content. In this paper we proposed a novel approach that extracts informative content from news web sites in an unsupervised fashion. Our method utilizes the web page segmentation technique to partition the web page into non overlapping rectangular blocks. In our study we used Artificial Neural Network as a classifier to discriminate the rectangular block using their features as relevant or irrelevant blocks. The main content blocks are filtered from the web page and user is presented with clean news web page. Empirical evaluation of our system shows that ANN classifier gives 96.03% accuracy for web content identification that results in accurately filtering of the web page content.

Key-Words / Index Term

Artificial Neural Network, Web Page Segmentation, Visual Blocks, Cosine Similarity

References

[1] C.-N. Ziegler and M. Skubacz, “Content extraction from news pages using particle swarm optimization on linguistic and structural features,” in WI ’07: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence. Washington, DC, USA: IEEE Computer Society, 2007, pp. 242–249.
[2] J. Gibson, B. Wellner, and S. Lubar, “Adaptive web-page content identification,” in Proceedings of the 9th annual ACM international workshop on Web information and data management. ACM New York, NY, USA, 2007, pp. 105–112.
[3] J. Prasad and A. Paepcke, “Coreex: content extraction from online news articles,” in CIKM ’08: Proceeding of the 17th ACM conference on Information and knowledge management. New York, NY, USA: ACM, 2008, pp. 1391–1392.
[4] S. Gupta, G. E. Kaiser, P. Grimm, M. F. Chiang, and J. Starren, “Automating content extraction of html documents,” World Wide Web, vol. 8, no. 2, pp. 179–224, 2005.
[5] I. Muslea, S. Minton, and C. Knoblock, “A hierarchical approach to wrapper induction,” in AGENTS ’99: Proceedings of the third annual conference on Autonomous Agents. New York, NY, USA: ACM, 1999, pp. 190–197.
[6] Jaiwei Han, Micheline Kamber, Data Mining Concepts and Techniques,Third Edition, ELSEVIER,2012.
[7] Neetu Narwal, Mayank Singh, Web Content Extraction A Heuristic Approach, International Journal Of Computer Science and Information Security, Vol 11, No1 , 2013.
[8] N Narwal, S K Sharma, Amit Prakash Singh, Entropy based content filtering for Mobile Web Page Adaptation, Proceeding WCI `15 Proceedings of the Third International Symposium on Women in Computing and Informatics Pages 588-594 , ACM New York, NY, USA ©2015 , table of contents ISBN: 978-1-4503-3361-0.
[9] Cai D., Yu S. and Wen J. R., VIPS: a Vision-based Page Segmentation Algorithm, Microsoft Technical Report (MSR-TR-2003-79), 2003.
[10] Chakrabarti D., Kumar R., and Punera K., A graph-theoretic approach to webpage segmentation, Proceedings of 15th International Conference on World Wide Web, 2008, ACM, pp 377–386.
[11] Doorenbos R.B., Etzioni O., Weld D.S., A Scalable Comparison-Shopping Agent for the World Wide Web. Technical report UW-CSE-96-01-03, University of Washington, 1996.
[12] Eikvil, Information Extraction from World Wide Web - A Survey , Technical Report 945, Norvegian Computing Center, 1999.
[13] Gogar Tomas, Hubacek Ondrej, Sedivy Jan, Deep Neural Networks for Web Page Information Extraction, Artificial Intelligence Applications and Innovations, IFIP Advances in Information and Communication Technology, Vol. 475, 2016, pp 154-163.
[14] Gu X.D., Chen J., Ma W.Y. and Chen G.L., Visual Based Content Understanding towards Web Adaptation, Proceedings of the Second International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems , Springer, 2002, pp 164-173.
[15] Hsu C. H. and Dung M. T., Generating Finite State Transducers for semi structured Data Extraction from the Web. Information Systems, Vol.23, No. 8, 1998, pp 521-538.
[16] Kang J. , Yang J., Choi J. , Repetition-based Web Page Segmentation by Detecting Tag Patterns for Small-Screen Devices, IEEE Transactions on Consumer Electronics, Vol. 56, Issue 2, May 2010, pp 980-986.
[17] Kohlschutter C. and Nejdl W. A densitometric approach to Web page segmentation. Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp 1173–1182.
[18] Kuppusamy K.S., Aghila G., A Personalized Web Page Content Filtering Model Based On Segmentation, International Journal of Information Sciences and Techniques (IJIST) Volume 2, Issue 1, 2012, pp 41-51.
[19] Kushmerick N., Weld D.S., Doorenbos R., Wrapper Induction for Information Extraction. Ph.D. Dissertation, University of Washington. Technical Report UW-CSE-97-11-04, 1997.
[20] Palekar V.R., Ali M. S. And Meghe R., Deep Web Data Extraction Using Web-Programming-Language-Independent Approach, Journal of Data Mining and Knowledge Discovery, 2012, pp 69-73.
[21] Safi Waseem, Maurel Facrice, Routoure Jean Marc, Beust Pierre, Dias Gael, A Hybrid Segmentation of Web Pages for Vibro-Tactile Access on Touch Screen Devices, . 3rd Workshop on Vision and Language (VL 2014) associated to 25th International Conference on Computational Linguistics (COLING 2014), Aug 2014, pp.95 – 102.
[22] Swezey Robin M.E., Shiramatsu Shun, Ozono Tadachika, Shintani Toramatsu, Web Page Segmentation Method by using Headlines to Web Contents as Separators and its Evaluation, International Journal of Computer Science and Network Security (IJCSNS), 2013, Volume 13 Issue 1, pp 1-6.
[23] Zou J., Le D., Thoma G. R., Combining DOM Tree and Geometric Layout Analysis for Online Medical Journal Article Segmentation, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, 2006, pp 119 – 128.