Language detection for classification and content-based web pages filtering

Publish Year: 1390
نوع سند: مقاله کنفرانسی
زبان: English
View: 1,865

This Paper With 5 Page And PDF Format Ready To Download

  • Certificate
  • من نویسنده این مقاله هستم

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این Paper:

شناسه ملی سند علمی:

IAUHNCEC01_064

تاریخ نمایه سازی: 18 تیر 1391

Abstract:

According to daily increase of the documents increasing on the internet, automatic language detection is getting more important. In this paper we used language detection system to classify and filtering of the immoral web pages, based on their contents. This system could detect 10 most used languages in the immoral web pages, including FARSI language. As a technique we introduce a new combined method which consists of three parts; URL Processor, page encoding processor, and text processor. In order to generate proper results this system has a voter which combines the results of these three parts. We used the immoral web pages and labeled web pages as an input data set in order to make a linguistic model for each language and system evaluation. Our experiments show 95% accuracy success in accuracy of outcome results. because in this particular issue, it is possible that the name used in the address doesn’t show the page immorality. Another reason is that, there could be many web pages with different languages which used the same encoding. Consequently, each method could not solve the problem by itself. It is declared in this paper that combination of thesethree methods has a very promising result. The paper structure consists of related works, problemdefinition, solution introduction, results interpretation, conclusion and future works.

Authors

Saman Bashbaghi

Computer Engineering Dept., Bu-Ali Sina University Hamedan, Iran

Hassan Khotanlou

Computer Engineering Dept., Bu-Ali Sina University Hamedan, Iran

مراجع و منابع این Paper:

لیست زیر مراجع و منابع استفاده شده در این Paper را نمایش می دهد. این مراجع به صورت کاملا ماشینی و بر اساس هوش مصنوعی استخراج شده اند و لذا ممکن است دارای اشکالاتی باشند که به مرور زمان دقت استخراج این محتوا افزایش می یابد. مراجعی که مقالات مربوط به آنها در سیویلیکا نمایه شده و پیدا شده اند، به خود Paper لینک شده اند :
  • J. Ropelato, Internet Pornography Statistics, TopTenReview S, 2O7. ...
  • G. Churcher, Distinctive character sequences, Personal commun ication, 1994. ...
  • G. Grefenstette، :Comparing two language identification schemes"، In Proceedings of ...
  • W.B. Cavnar، J. M. Trenkle، "N-gram-based text categorization"، In Symposium ...
  • Lena Grothe, Ernesto William De Luca and Andreas N urnberger, ...
  • نمایش کامل مراجع