CIVILICA We Respect the Science
(ناشر تخصصی کنفرانسهای کشور / شماره مجوز انتشارات از وزارت فرهنگ و ارشاد اسلامی: ۸۹۷۱)

Pattern Matching for Extraction of Core Contentsfrom News Web Pages

عنوان مقاله: Pattern Matching for Extraction of Core Contentsfrom News Web Pages
شناسه ملی مقاله: IRANWEB02_075
منتشر شده در دومین کنفرانس بین المللی وب پژوهی در سال 1395
مشخصات نویسندگان مقاله:

Sandeep Sirsat - Associate Professor and Head Department of Computer ScienceShri Shivaji Science & Arts College, Chikhali,Maharashtra, India
Vinay Chavan - Associate Professor and Head Department of Computer ScienceS. K. Prowal College, KamptiNagpur, India

خلاصه مقاله:
Web pages, besides core contents, consist of otherelements, such as banners, navigational elements, copyrightinformation, external links, etc. This noisy content covers morearea of web pages and is typically not related to the main subjectsof the web pages. Most of the information available on web pagesis either represented in XML, or HTML, or XHTML format thatmostly contains semi-structured text documents, which lacksformatted document structure. This document does notdiscriminate between the text and the schema, and the amount ofstructure used to represent the text depends on the purpose. Nosemantic is applied to semi-structured documents. This requiresextracting core contents of text document to analyse words orsentences for retrieving relevant information. Although there aremany existing methods that formulate the actual contentidentification problem as a DOM tree node selection problem,each one has some sort of lacunae. Here we proposed an approachbased on pattern matching technique. This technique uses simpleheuristic for extraction of core contents from web pages which aremostly semi-structured in nature. It requires visiting theappropriate news web site using their URL, accessing thelinks related to each news page of specified category, extractingthe data including metadata from each of these news web pages.The approach uses devised algorithm that applies regularexpressions (regexes) to identify the correct pattern for extractingthe actual text contents from these news documents. Proposedapproach deals with news web pages of any size and extracts corecontents with efficiency and high accuracy.

کلمات کلیدی:
Pattern matching, Information extraction, DocumentObject Module, tags

صفحه اختصاصی مقاله و دریافت فایل کامل: https://civilica.com/doc/481719/