Authorship Clustering using Homogeneous Feature Space and Two-stepped Automatic Fuzzy Cmeans Clustering

Publish Year: 1399
نوع سند: مقاله ژورنالی
زبان: English
View: 205

This Paper With 10 Page And PDF Format Ready To Download

  • Certificate
  • من نویسنده این مقاله هستم

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این Paper:

شناسه ملی سند علمی:

JR_JAISIS-1-1_006

تاریخ نمایه سازی: 17 فروردین 1400

Abstract:

Identifying the authorship either of an anonymous or a doubtful document constitutes a cornerstone for automatic forensic applications.  Moreover, it is a challenging task for both humans and computers considering complex content of document with variety of backgrounds. Due to nature of task it is always considered as an unsupervised task. Clustering documents according to the linguistic style of the authors who wrote them has been a task little studied by the research community. In order to address this problem, PAN Evaluation Framework has become the first effort to promote the development of the author clustering. There are different approaches to address the task and this article proposes a method based on a set of homogeneous features and two-stepped automatic FCM clustering. We use word Ngram, part-of-speech tagging and some other context free features, then using document similarity graph (DSG) estimating number of clusters; finally we use FCM to cluster corpus. We have done the task in very short amount of time and our performance results is comparable with leaderboard competitors in PAN CLEF 2017 challenge.

Authors

Mohammad Aminian

Computer Engineering Department, Bu Ali Sina University, Hamedan, Iran

Mahdi Eskandari

Computer Engineering Department, Bu Ali Sina University, Hamedan, Iran