Document Clustering Using Deep Pre-trained Language Model Embeddings for Information Retrieval
Publish Year: 1404
نوع سند: مقاله کنفرانسی
زبان: English
View: 4
This Paper With 16 Page And PDF Format Ready To Download
- Certificate
- من نویسنده این مقاله هستم
استخراج به نرم افزارهای پژوهشی:
شناسه ملی سند علمی:
TSTACON02_085
تاریخ نمایه سازی: 26 بهمن 1404
Abstract:
Document clustering is critical to information retrieval (IR) as it enhances user navigation, semantic organization, and exploration of large text collections. Current clustering techniques, though, are marred by poor accuracy and semantic inconsistency, with many misclassifying relevant documents as noise and using superficial textual representations. This study aims to develop a clustering pipeline that produces semantically meaningful and structurally coherent groups of documents to support more effective IR. We propose a method that combines SBERT embeddings for deep semantic representation, UMAP for structure-preserving dimensionality reduction, and HDBSCAN for flexible, density-based clustering without needing to predefine the number of clusters. Experimental evaluations on the ۲۰ Newsgroups dataset reveal that our optimal setting with the paraphrase-mpnet-base-v۲ model obtains a Silhouette Score of ۰.۶۸۵۳, ARI of ۰.۷۸۶۵, and NMI of ۰.۸۱۸۶. These results illustrate the promise of embedding-based clustering methods to greatly improve the interpretability and effectiveness of IR systems on real-world text collections.
Keywords:
Authors
Mahdi Mohammadiha
Department of Computer Engineering, Faculty of Engineering, International University of Imam Khomeini, Qazvin, Iran
Mohammad Hassan Sadreddini
Department of Computer Engineering, Faculty of Engineering, International University of Imam Khomeini, Qazvin, Iran
Morteza Mohammadi Zanjireh
Department of Computer Engineering, Faculty of Engineering, International University of Imam Khomeini, Qazvin, Iran