سیویلیکا را در شبکه های اجتماعی دنبال نمایید.

Domain-based Abbreviation Expansion using Topic Modelling for Data Cleaning

Publish Year: 1403
Type: Journal paper
Language: English
View: 53

This Paper With 12 Page And PDF Format Ready To Download

Export:

Link to this Paper:

Document National Code:

JR_JCSE-11-2_001

Index date: 25 January 2025

Domain-based Abbreviation Expansion using Topic Modelling for Data Cleaning abstract

Data cleaning is a necessary process in data analytics and management and plays an essential role in obtaining more reliable data for further analysis and business. The most important challenge with textual unstructured data is how to invest and activate data cleaning, where a large amount of textual data is rapidly generated that urgently needs to be cleaned to be able to use it to obtain knowledge. In textual data, abbreviations are appearing more and more often in different datasets. The abbreviation disambiguation process can be considered a critical issue in data analysis and obtaining more reliable data. Many abbreviation expansion approaches have been proposed to handle this issue, but these approaches have never paid attention to the domain to which these abbreviations belong. To overcome this drawback, a domain-based abbreviation expansion method using topic modeling is proposed in this paper. In this method, topic modeling is applied first to the text, then the domain is determined based on the contribution of topics. This will reduce the search space. Finally, expansion is applied to abbreviations according to their domains. The proposed method has been validated by applying it to a COVID-19 tweets dataset and employing a logistic regression classifier. Two types of comparisons have been made, with the dataset itself before using the proposed method and with the results of the CrowdCorrect approach. The results show a clear improvement in precision and recall as well as the accuracy of the classifier.

Domain-based Abbreviation Expansion using Topic Modelling for Data Cleaning Keywords:

Domain-based Abbreviation Expansion using Topic Modelling for Data Cleaning authors

Ali Aji

Department of Software Engineering, Faculty of Computer, University of Isfahan, Isfahan, Iran.

Afsaneh Fatemi

Department of Software Engineering, Faculty of Computer, University of Isfahan, Isfahan, Iran.

Mohammad Ali Nematbakhsh

Department of Software Engineering, Faculty of Computer, University of Isfahan, Isfahan, Iran.

مراجع و منابع این Paper:

لیست زیر مراجع و منابع استفاده شده در این Paper را نمایش می دهد. این مراجع به صورت کاملا ماشینی و بر اساس هوش مصنوعی استخراج شده اند و لذا ممکن است دارای اشکالاتی باشند که به مرور زمان دقت استخراج این محتوا افزایش می یابد. مراجعی که مقالات مربوط به آنها در سیویلیکا نمایه شده و پیدا شده اند، به خود Paper لینک شده اند :
Amadeusz Lisiecki and Dariusz Krol. Internet Advertising Strategy Based on ...
Amin Beheshti and Kushal Vaghani and Boualem Benatallah and Alireza ...
Ike Vayansky and Sathish A P Kumar. A review of ...
Carolina Crisci AND Ghattas Badih and Perera Ghattas. A review ...
نمایش کامل مراجع