Computing Semantic Similarity of Documents Based on Semantic Tensors

Publish Year: 1394
نوع سند: مقاله ژورنالی
زبان: English
View: 739

This Paper With 10 Page And PDF Format Ready To Download

  • Certificate
  • من نویسنده این مقاله هستم

این Paper در بخشهای موضوعی زیر دسته بندی شده است:

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این Paper:

شناسه ملی سند علمی:

JR_JIST-3-2_008

تاریخ نمایه سازی: 9 اسفند 1395

Abstract:

Exploiting semantic content of texts due to its wide range of applications such as finding related documents to a query, document classification and computing semantic similarity of documents has always been an important and challenging issue in Natural Language Processing. In this paper, using Wikipedia corpus and organizing it by three-dimensional tensor structure, a novel corpus-based approach for computing semantic similarity of texts is proposed. For this purpose, first the semantic vector of available words in documents are obtained from the vector space derived from available words in Wikipedia articles, then the semantic vector of documents is formed according to their words vector. Consequently, semantic similarity of a pair of documents is computed by comparing their corresponding semantic vectors. Moreover, due to existence of high dimensional vectors, the vector space of Wikipedia corpus will cause curse of dimensionality. On the other hand, vectors in high-dimension space are Usually very similar to each other. In this way, it would be meaningless and vain to identify the most appropriate semantic vector for the words. Therefore, the proposed approach tries to improve the effect of the curse of dimensionality by reducing the vector space dimensions through random indexing. Moreover, the random indexing makes significant improvement in memory consumption of the proposed approach by reducing the vector space dimensions. Additionally, the capability of addressing synonymous and polysemous words will be feasible in the proposed approach by means of the structured co-occurrence through random indexing.

Authors

Navid Bahrami

Department of Electrical, Computer and IT Engineering, Qazvin Branch, Islamic Azad University, Qazvin, Iran

Amir Hossein Jadidinejad

Department of Electrical, Computer and IT Engineering, Qazvin Branch, Islamic Azad University, Qazvin, Iran

Mozhdeh Nazari

Department of Engineering, Guilan Science and Research Branch, Islamic Azad University, Rasht, Iran