Improving speech emotion recognition using audio transformer and features fusion

Publish Year: 1402
نوع سند: مقاله کنفرانسی
زبان: English
View: 219

This Paper With 8 Page And PDF Format Ready To Download

  • Certificate
  • من نویسنده این مقاله هستم

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این Paper:

شناسه ملی سند علمی:

ICAISV01_004

تاریخ نمایه سازی: 6 شهریور 1402

Abstract:

The purpose of speech emotion recognition is to recognize different speaker emotions by extracting and classifying salient features from a pre-processed speech signal. In this paper, a basic method based on the fusion of features, extracted from pre-trained AlexNet, BiLSTM and Wav۲vec۲.۰ models is improved for speech emotion recognition. To this end, similar to the basic model, spectrogram, MFCC and raw signal features are used, respectively. To improve the performance of the basic model, on the one hand, in addition to the MFCC, its first and second derivatives are also extracted. On the other hand, for feature extraction of the concatenated vector, the Audio Transformer with Patchout (PaSST) replaces the BiLSTM of the base model. Then, the attention unit is usedto use the effective information extracted from the MFCC and the spectrogram and also to weight the Wav۲vec۲.۰ output. Finally, the extracted features from AlexNet, PaSST, and also the weighted output of Wav۲vec۲.۰ are fused and fed to the Softmax as the classifier. Experiments have shown that the proposed algorithm has reached a weighted accuracy of ۶۱.۵۶% on RAVDESS dataset.

Authors

Fateme Mehrpouyan

Faculty of Electrical and Computer Engineering,Babol Noshirvani University of Technology, Mazandaran, Iran

Mehdi Ezoji

Faculty of Electrical and Computer Engineering,Babol Noshirvani University of Technology, Mazandaran, Iran