Enhanced Self-Attention Model for Cross-Lingual Semantic Textual Similarity in SOV and SVO Languages: Persian and English Case Study

Ebrahim Ganjalipour; Amir Hossein Refahi Sheikhani; Sohrab Kordrostami; Ali Asghar Hosseinzadeh

Enhanced Self-Attention Model for Cross-Lingual Semantic Textual Similarity in SOV and SVO Languages: Persian and English Case Study

Publish place: Journal of Computer and Robotics، Vol: 17، Issue: 1

Publish Year: 1402

نوع سند: مقاله ژورنالی

زبان: English

This Paper With 14 Page And PDF Format Ready To Download

دریافت فایل کامل Paper

Certificate
من نویسنده این مقاله هستم

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این Paper:

https://civilica.com/doc/1875221

شناسه ملی سند علمی:

JR_JCR-17-1_005

تاریخ نمایه سازی: 13 دی 1402

Abstract:

Semantic Textual Similarity (STS) is considered one of the subfields of natural language processing that has gained extensive research attention in recent years. Measuring the semantic similarity between words, phrases, paragraphs, and documents plays a significant role in natural language processing and computational linguistics. Semantic Textual Similarity finds applications in plagiarism detection, machine translation, information retrieval, and similar areas. STS aims to develop computational methods that can capture the nuanced degrees of resemblance in meaning between words, phrases, sentences, paragraphs, or even entire documents which is a challenging task for languages with low digital resources. This task becomes intricate in languages with pronoun-dropping and Subject-Object-Verb (SOV) word order specifications, such as Persian, due to their distinctive syntactic structures. One of the most important aspects of linguistic diversity lies in word order variation within languages. Some languages adhere to Subject-Object-Verb (SOV) word order, while others follow Subject-Verb-Object (SVO) patterns. These structural disparities, compounded by factors like pronoun-dropping, render the task of measuring cross-lingual STS in such languages exceptionally intricate. In the context of low-resource languages like Persian, this study proposes a customized model based on linguistic properties. Leveraging pronoun-dropping and SOV word order specifications of Persian, we introduce an innovative enhancement: a novel weighted relative positional encoding integrated into the self-attention mechanism. Moreover, we enrich context representations by infusing co-occurrence information through pointwise mutual information (PMI) factors. This paper introduces a cross-lingual model for semantic similarity analysis between Persian and English texts, utilizing parallel corpora. The experiments show that our proposed model achieves better performance than other models. Ablation study also shows that our system can converge faster and is less prone to overfitting. The proposed model is evaluated on Persian-English and Persian-Persian STS-Benchmarks and achieved ۸۸.۲۹% and ۹۱.۶۵% Pearson correlation coefficients on monolingual and cross-lingual STS-B, respectively.

Keywords:

Semantic Textual Similarity , English-Persian Semantic Similarity , Transformer , SOV Word Order Language , Pointwise Mutual Information

Authors

Ebrahim Ganjalipour

Department of Applied Mathematics and Computer Science, Lahijan Branch, Islamic Azad University, Lahijan, Iran

Amir Hossein Refahi Sheikhani

Department of Applied Mathematics and Computer Science, Lahijan Branch, Islamic Azad University, Lahijan, Iran

Sohrab Kordrostami

Department of Applied Mathematics and Computer Science, Lahijan Branch, Islamic Azad University, Lahijan, Iran

Ali Asghar Hosseinzadeh

Department of Applied Mathematics and Computer Science,Lahijan Branch, Islamic Azad University,Lahijan,Iran