Employing a novel content-based similarity measure for a machine learning-driven focused crawler

The volume of the World Wide Web is growing rapidly, reaching a point where governing data is challenging. Search engines are used to collect data across the web for users. Web crawlers as the major part of search engines are then used to retrieve relevant data on the web according to the user requests. Accordingly, a focused crawler considers a predefined subject and retrieves corresponding relevant pages. In this paper, we propose an efficient focused web crawling approach, which uses a combination of a content-based similarity measure and a Naive Bayes learning classifier in order to find relevant pages to a particular subject. Our first experimental studies show satisfactory improvements where accuracy and recall are increased by 4% and 1% respectively.

Keywords:

Focused crawler , Web crawler , Naive Bayes classification , Relevant page , TF-IDF criteria

Authors

Atiye Jabalameli

Department of Electrical and Computer Engineering, University of Kashan, Kashan, Iran

S. Mehdi Vahidipour

Department of Electrical and Computer Engineering, University of Kashan, Kashan, Iran

Mohammad Mahdi Mohammadi

Department of Computer Engineering, Amirkabir University of Technology, Tehran, Iran

Certificate
من نویسنده این مقاله هستم

این Paper در بخشهای موضوعی زیر دسته بندی شده است:

هوش مصنوعی > یادگیری ماشین

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این Paper:

https://civilica.com/doc/1011676

شناسه ملی سند علمی:

CEPS06_121

تاریخ نمایه سازی: 9 اردیبهشت 1399

How to Cite to This Paper:

If you want to refer to this Paper in your research work, you can simply use the following phrase in the resources section:

Jabalameli, Atiye and Vahidipour, S. Mehdi and Mohammadi, Mohammad Mahdi,1398,Employing a novel content-based similarity measure for a machine learning-driven focused crawler,6th National Conference on Applied Research in Computer Engineering and Information Technology,Tehran,https://civilica.com/doc/1011676

Scientometrics

The specifications of the publisher center of this Paper are as follows:

Ranking of University of Kashan

Type of center: دانشگاه دولتی

Paper count: 9,037

In the scientometrics section of CIVILICA, you can see the scientific ranking of the Iranian academic and research centers based on the statistics of indexed articles.

مقالات مرتبط جدید