Pattern Matching for Extraction of Core Contentsfrom News Web Pages

Publish Year: 1395
نوع سند: مقاله کنفرانسی
زبان: English
View: 905

This Paper With 6 Page And PDF Format Ready To Download

  • Certificate
  • من نویسنده این مقاله هستم

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این Paper:

شناسه ملی سند علمی:

IRANWEB02_075

تاریخ نمایه سازی: 9 مرداد 1395

Abstract:

Web pages, besides core contents, consist of otherelements, such as banners, navigational elements, copyrightinformation, external links, etc. This noisy content covers morearea of web pages and is typically not related to the main subjectsof the web pages. Most of the information available on web pagesis either represented in XML, or HTML, or XHTML format thatmostly contains semi-structured text documents, which lacksformatted document structure. This document does notdiscriminate between the text and the schema, and the amount ofstructure used to represent the text depends on the purpose. Nosemantic is applied to semi-structured documents. This requiresextracting core contents of text document to analyse words orsentences for retrieving relevant information. Although there aremany existing methods that formulate the actual contentidentification problem as a DOM tree node selection problem,each one has some sort of lacunae. Here we proposed an approachbased on pattern matching technique. This technique uses simpleheuristic for extraction of core contents from web pages which aremostly semi-structured in nature. It requires visiting theappropriate news web site using their URL, accessing thelinks related to each news page of specified category, extractingthe data including metadata from each of these news web pages.The approach uses devised algorithm that applies regularexpressions (regexes) to identify the correct pattern for extractingthe actual text contents from these news documents. Proposedapproach deals with news web pages of any size and extracts corecontents with efficiency and high accuracy.

Authors

Sandeep Sirsat

Associate Professor and Head Department of Computer ScienceShri Shivaji Science & Arts College, Chikhali,Maharashtra, India

Vinay Chavan

Associate Professor and Head Department of Computer ScienceS. K. Prowal College, KamptiNagpur, India