Enhanced Self-Attention Model for Cross-Lingual Semantic Textual Similarity in SOV and SVO Languages: Persian and English Case Study

Publish Year: 1402
نوع سند: مقاله ژورنالی
زبان: English
View: 64

This Paper With 14 Page And PDF Format Ready To Download

  • Certificate
  • من نویسنده این مقاله هستم

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این Paper:

شناسه ملی سند علمی:

JR_JCR-17-1_005

تاریخ نمایه سازی: 13 دی 1402

Abstract:

Semantic Textual Similarity (STS) is considered one of the subfields of natural language processing that has gained extensive research attention in recent years. Measuring the semantic similarity between words, phrases, paragraphs, and documents plays a significant role in natural language processing and computational linguistics. Semantic Textual Similarity finds applications in plagiarism detection, machine translation, information retrieval, and similar areas. STS aims to develop computational methods that can capture the nuanced degrees of resemblance in meaning between words, phrases, sentences, paragraphs, or even entire documents which is a challenging task for languages with low digital resources. This task becomes intricate in languages with pronoun-dropping and Subject-Object-Verb (SOV) word order specifications, such as Persian, due to their distinctive syntactic structures. One of the most important aspects of linguistic diversity lies in word order variation within languages. Some languages adhere to Subject-Object-Verb (SOV) word order, while others follow Subject-Verb-Object (SVO) patterns. These structural disparities, compounded by factors like pronoun-dropping, render the task of measuring cross-lingual STS in such languages exceptionally intricate. In the context of low-resource languages like Persian, this study proposes a customized model based on linguistic properties. Leveraging pronoun-dropping and SOV word order specifications of Persian, we introduce an innovative enhancement: a novel weighted relative positional encoding integrated into the self-attention mechanism. Moreover, we enrich context representations by infusing co-occurrence information through pointwise mutual information (PMI) factors. This paper introduces a cross-lingual model for semantic similarity analysis between Persian and English texts, utilizing parallel corpora. The experiments show that our proposed model achieves better performance than other models. Ablation study also shows that our system can converge faster and is less prone to overfitting. The proposed model is evaluated on Persian-English and Persian-Persian STS-Benchmarks and achieved ۸۸.۲۹% and ۹۱.۶۵% Pearson correlation coefficients on monolingual and cross-lingual STS-B, respectively.

Authors

Ebrahim Ganjalipour

Department of Applied Mathematics and Computer Science, Lahijan Branch, Islamic Azad University, Lahijan, Iran

Amir Hossein Refahi Sheikhani

Department of Applied Mathematics and Computer Science, Lahijan Branch, Islamic Azad University, Lahijan, Iran

Sohrab Kordrostami

Department of Applied Mathematics and Computer Science, Lahijan Branch, Islamic Azad University, Lahijan, Iran

Ali Asghar Hosseinzadeh

Department of Applied Mathematics and Computer Science,Lahijan Branch, Islamic Azad University,Lahijan,Iran