سیویلیکا را در شبکه های اجتماعی دنبال نمایید.

A Comparison of CQT Spectrogram with STFT-based Acoustic Features in Deep Learning-based Synthetic Speech Detection

Publish Year: 1402
Type: Journal paper
Language: English
View: 227

This Paper With 12 Page And PDF Format Ready To Download

این Paper در بخشهای موضوعی زیر دسته بندی شده است:

Export:

Link to this Paper:

Document National Code:

JR_JADM-11-1_010

Index date: 9 April 2023

A Comparison of CQT Spectrogram with STFT-based Acoustic Features in Deep Learning-based Synthetic Speech Detection abstract

Automatic Speaker Verification (ASV) systems have proven to bevulnerable to various types of presentation attacks, among whichLogical Access attacks are manufactured using voiceconversion and text-to-speech methods. In recent years, there has beenloads of work concentrating on synthetic speech detection, and with the arrival of deep learning-based methods and their success in various computer science fields, they have been a prevailing tool for this very task too. Most of the deep neural network-based techniques forsynthetic speech detection have employed the acoustic features basedon Short-Term Fourier Transform (STFT), which are extracted from theraw audio signal. However, lately, it has been discovered that the usageof Constant Q Transform's (CQT) spectrogram can be a beneficialasset both for performance improvement and processing power andtime reduction of a deep learning-based synthetic speech detection. In this work, we compare the usage of the CQT spectrogram and some most utilized STFT-based acoustic features. As lateral objectives, we consider improving the model's performance as much as we can using methods such as self-attention and one-class learning. Also, short-duration synthetic speech detection has been one of the lateral goals too. Finally, we see that the CQT spectrogram-based model not only outperforms the STFT-based acoustic feature extraction methods but also reduces the processing time and resources for detecting genuine speech from fake. Also, the CQT spectrogram-based model places wellamong the best works done on the LA subset of the ASVspoof 2019 dataset, especially in terms of Equal Error Rate.

A Comparison of CQT Spectrogram with STFT-based Acoustic Features in Deep Learning-based Synthetic Speech Detection Keywords:

A Comparison of CQT Spectrogram with STFT-based Acoustic Features in Deep Learning-based Synthetic Speech Detection authors

P. Abdzadeh

Faculty of New Sciences and Technologies, University of Tehran, Tehran, Iran.

H. Veisi

Faculty of New Sciences and Technologies, University of Tehran, Tehran, Iran.

مراجع و منابع این Paper:

لیست زیر مراجع و منابع استفاده شده در این Paper را نمایش می دهد. این مراجع به صورت کاملا ماشینی و بر اساس هوش مصنوعی استخراج شده اند و لذا ممکن است دارای اشکالاتی باشند که به مرور زمان دقت استخراج این محتوا افزایش می یابد. مراجعی که مقالات مربوط به آنها در سیویلیکا نمایه شده و پیدا شده اند، به خود Paper لینک شده اند :
D. A. Reynolds, “Speaker identification and verification using Gaussian mixture ...
Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, ...
Yee Wah Lau, M. Wagner, and D. Tran, “Vulnerability of ...
Z. Wu and H. Li, “Voice conversion and spoofing attack ...
Z. Wu, S. Gao, E. S. Cling, and H. Li, ...
M. Todisco et al., “ASVspoof ۲۰۱۹: Future Horizons in Spoofed ...
Z. Wu et al., “ASVspoof ۲۰۱۵: the first automatic speaker ...
J. Yamagishi et al., “Asvspoof ۲۰۱۹: The ۳rd automatic speaker ...
J. Yamagishi et al., “ASVspoof ۲۰۲۱: accelerating progress in spoofed ...
M. Alzantot, Z. Wang, and M. B. Srivastava, “Deep Residual ...
A. Gomez-Alanis, A. M. Peinado, J. A. Gonzalez, and A. ...
C.-I. Lai, N. Chen, J. Villalba, and N. Dehak, “ASSERT: ...
Z. Wu, R. K. Das, J. Yang, and H. Li, ...
K. Aghajani, “Audio-visual emotion recognition based on a deep convolutional ...
B. Z. Mansouri, H. R. Ghaffary, and A. Harimi, “Speech ...
J. C. Brown, “Calculation of a constant Q spectral transform,” ...
M. Todisco, H. Delgado, and N. Evans, “A New Feature ...
X. Li, X. Wu, H. Lu, X. Liu, and H. ...
J. Monteiro, J. Alam, and T. H. Falk, “Generalized end-to-end ...
H. Tak, J. Jung, J. Patino, M. Todisco, and N. ...
Z. Huang, S. Wang, and K. Yu, “Angular Softmax for ...
M. Sahidullah et al., “UIAI System for Short-Duration Speaker Verification ...
S. Wang, Z. Huang, Y. Qian, and K. Yu, “Discriminative ...
Y. Jung, Y. Choi, H. Lim, and H. Kim, “A ...
M. R. Kamble, H. B. Sailor, H. A. Patil, and ...
Z. Wu, X. Xiao, E. S. Chng, and H. Li, ...
M. Sahidullah, T. Kinnunen, and C. Hanilci, “A Comparison of ...
M. Todisco, H. Delgado, and N. W. Evans, “A New ...
B. Chettri, D. Stoller, V. Morfi, M. A. M. Ramírez, ...
X. Fang, H. Du, T. Gao, L. Zou, and Z. ...
M. Pal, A. Raikar, A. Panda, and S. K. Kopparapu, ...
H. Ma, J. Yi, J. Tao, Y. Bai, Z. Tian, ...
R. Jaiswal, D. Fitzgerald, E. Coyle, and S. Rickard, “Towards ...
Z. Weiping, Y. Jiantao, X. Xiaotao, L. Xiangtao, and P. ...
نمایش کامل مراجع