CIVILICA We Respect the Science
(ناشر تخصصی کنفرانسهای کشور / شماره مجوز انتشارات از وزارت فرهنگ و ارشاد اسلامی: ۸۹۷۱)

Web Data Extraction Using Textual Anchors

عنوان مقاله: Web Data Extraction Using Textual Anchors
شناسه ملی مقاله: JR_JKBEI-2-4_004
منتشر شده در شماره 4 دوره 2 فصل Jan در سال 1394
مشخصات نویسندگان مقاله:

Ahmad Pouramini - Department of Computer Engineering,Sirjan University of Technology, Sirjan, Iran

خلاصه مقاله:
In this paper, we present an approach and a visual tool, called ABDES, for creating web wrappers to extract data records from web pages. In our approach, we rely mainly on the visible page content, simulating the way a human user scans a web page for specific data. To create a wrapper, we use text features such as textual delimiters, keywords, constants or text patterns, which we call anchors, to create patterns for the target data regions and data records. We offer a polynomial data extraction algorithm, in which these patterns are checked against the page elements in a mixed bottom-up and top-down traverse of the DOM tree. The extracted data is directly mapped onto a hierarchical XML structure as the output of the algorithm. The wrappers generated by the system are robust and independent of the HTML structure. Therefore, they can be adapted to multiple websites to gather and integrate information.

کلمات کلیدی:
Web Data Record Extraction; Web Wrapper Generation; Web Information Extraction

صفحه اختصاصی مقاله و دریافت فایل کامل: https://civilica.com/doc/489909/