Investigating Hostile Post Detection in Gujarati: A Machine Learning Approach

B. J., Rameshbhai; K., Rana

Investigating Hostile Post Detection in Gujarati: A Machine Learning Approach

عنوان مقاله: Investigating Hostile Post Detection in Gujarati: A Machine Learning Approach
شناسه ملی مقاله: JR_IJE-37-7_008
منتشر شده در در سال 1403

مشخصات نویسندگان مقاله:

B. J. Rameshbhai - Department of Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India
K. Rana - Department of Computer Engineering, Sarvajanik College of Engineering and Technology, Gujarat Technological University, Ahmedabad, Gujarat, India

خلاصه مقاله:

Hostile post on social media is a crucial issue for individuals, governments and organizations. There is a critical need for an automated system that can investigate and identify hostile posts from large-scale data. In India, Gujarati is the sixth most spoken language. In this work, we have constructed a major hostile post dataset in the Gujarati language. The data are collected from Twitter, Instagram and Facebook. Our dataset consists of ۱,۵۱,۰۰۰ distinct comments having ۱۰,۰۰۰ manually annotated posts. These posts are labeled into the Hostile and Non-Hostile categories. We have used the dataset in two ways: (i) Original Gujarati Text Data and (ii) English data translated from Gujarati text. We have also checked the performance of pre-processing and without pre-processing data by removing extra symbols and substituting emoji descriptions in the text. We have conducted experiments using machine learning models based on supervised learning such as Support Vector Machine, Decision Tree, Random Forest, Gaussian Naive-Bayes, Logistic Regression, K-Nearest Neighbor and unsupervised learning based model such as k-means clustering. We have evaluated performance of these models for Bag-of-Words and TF-IDF feature extraction methods. It is observed that classification using TF-IDF features is efficient. Among these methods Logistic regression outperforms with an Accuracy of ۰.۶۸ and F۱-score of ۰.۶۷. The purpose of this research is to create a benchmark dataset and provide baseline results for detecting hostile posts in Gujarati Language.

کلمات کلیدی:

Hostile Text Detection, Machine Learning, Hate Text Detection, Text Classification, Gujarati Text Dataset

صفحه اختصاصی مقاله و دریافت فایل کامل: https://civilica.com/doc/1965652/