Web Content Extraction Using Contextual Rules

Ahmad Pouramini; Shahram Nasiri

Web Content Extraction Using Contextual Rules

Publish place: 2nd Internatioanl Conference on Knowledge -Base Engineering and Innovation

Publish Year: 1394

نوع سند: مقاله کنفرانسی

زبان: English

This Paper With 5 Page And PDF Format Ready To Download

دریافت فایل کامل Paper

Certificate
من نویسنده این مقاله هستم

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این Paper:

https://civilica.com/doc/553343

شناسه ملی سند علمی:

KBEI02_293

تاریخ نمایه سازی: 5 بهمن 1395

Abstract:

Extracting the main content from web pages has many applications, such as mobile phone browsing, enhancing the page readability and speech rendering for the visually impaired. In applications that provide a service to end users, identifying the content of interest is better served with user assistance through a visual tool rather than an unsupervised method. In this paper, we propose a wrapping language supported by a visual tool to create wrappers for extracting the main content from web pages. The language is designed to be easy to use, and expressive enough to cover most common scenarios. In this language, various types of features (syntactical, semantic, visual, densitometric) can be employed in the extraction rules to identify the content of interest. Moreover, contextual information can be utilized as context variables to restrict the application of each rule to certain parts of the page and refining their content. Furthermore, the rules can be organized hierarchically to share common rules among wrappers for similar websites. The system is particularly suitable for extracting the main content from blogs, news and encyclopedia websites.

Keywords:

Web mining , Main Content Extraction , Web wrappers

Authors

Ahmad Pouramini

Department of Computer Engineering Sirjan University of Technology, Sirjan, Iran

Shahram Nasiri

Department of Computer Engineering Sirjan University of Technology, Sirjan, Iran