CIVILICA We Respect the Science
(ناشر تخصصی کنفرانسهای کشور / شماره مجوز انتشارات از وزارت فرهنگ و ارشاد اسلامی: ۸۹۷۱)

Text Anomalies Detection Using Histograms of Words

عنوان مقاله: Text Anomalies Detection Using Histograms of Words
شناسه ملی مقاله: JR_ACSIJ-5-1_010
منتشر شده در شماره 1 دوره 5 فصل January در سال 1395
مشخصات نویسندگان مقاله:

Abdulwahed Almarimi - Institute of Computer Science, Faculty of Science, P. J. Šafárik University in Košice ۰۴۰۰۱ Košice, Slovakia
Gabriela Andrejková - Institute of Computer Science, Faculty of Science, P. J. Šafárik University in Košice ۰۴۰۰۱ Košice, Slovakia

خلاصه مقاله:
Authors of written texts mainly can be characterized by some collection of attributes obtained from texts. Texts of the same author are very similar from the style point of view. We can consider that attributes of a full text are very similar to attributes of parts in the same text. In the same thoughts can be compared different parts of the same text. In the paper, we describe an algorithm based on histograms of a mapped text to interval 0,1 . In the mapping, it is kipped the word order as in the text. Histograms are analyzed from a cluster point of view. If a cluster dispersion is not large, the text is probably written by the same author. If the cluster dispersion is large, the text will be split in two or more parts and the same analysis will be done for the text parts. The experiments were done on English and Arabic texts. For combined English texts our algorithmcovers that texts were not written by one author. We have got the similar results for combined Arabic texts. Our algorithm can be used to basic text analysis if the text was written by one author.

کلمات کلیدی:
Authorship attribution, stylometry, anomaly detection, histogram

صفحه اختصاصی مقاله و دریافت فایل کامل: https://civilica.com/doc/793677/