CIVILICA We Respect the Science
(ناشر تخصصی کنفرانسهای کشور / شماره مجوز انتشارات از وزارت فرهنگ و ارشاد اسلامی: ۸۹۷۱)

Information Theoretic Text Classification

عنوان مقاله: Information Theoretic Text Classification
شناسه ملی مقاله: ACCSI12_350
منتشر شده در دوازدهمین کنفرانس سالانه انجمن کامپیوتر ایران در سال 1385
مشخصات نویسندگان مقاله:

Aavani - Mathematical Sciences Dept., Sharif Univ. Tech., Tehran, Iran، Sepanta Robotics & AI Research Foundation, Tehran, Iran
Farjudian - Mathematical Sciences Dept., Sharif Univ. Tech., Tehran, Iran
almani-Jelodar - Sepanta Robotics & AI Research Foundation, Tehran, Iran
Andalib - Sepanta Robotics & AI Research Foundation, Tehran, Iran

خلاصه مقاله:
The assignment of natural language texts to two or more predefined categories based on their contents, is an important component in many information organization and management tasks. This paper presents an information theoretic approach for text classification problem that we call it ITTC. Here, we prove that ITTC is theoretically equivalent to Bayesian classifier. However, when classification task is performed over dynamic or noisy data, or when the training data do not represent all probable cases, ITTC outperforms Bayesian classifier. We also show that the complexity of ITTC over test set grows linearly by the size of input data .We use some news groups, to evaluate the superior performance of our approach.

کلمات کلیدی:
Bayesian classifier, Entropy, Markov chain, text classification

صفحه اختصاصی مقاله و دریافت فایل کامل: https://civilica.com/doc/44736/