Improvement of Text-Independent Speaker Verification Using Gender-like Feature

Pornprom Kiawjak; Somkiat Wangsiripitak; Kitsuchart Pasupa

doi:10.1109/kst51265.2021.9415832

Improvement of Text-Independent Speaker Verification Using Gender-like Feature

Date

2021-1-21

Authors

Pornprom Kiawjak

Somkiat Wangsiripitak

Kitsuchart Pasupa

Abstract

Text-independent speaker verification is a task of verifying a speaker identity from a characteristic of voice. We proposed the combined deep Convolutional Neural Network (CNN) consisting of (i) the first CNN trained to achieve gender classification which is then used to create a gender-like embedding and (ii) the last CNN trained with one additional input, the gender-like feature (embedding) from the first, to classify each speaker. The classification layer of the last CNN is removed to allow the remaining combined deep CNN for one-shot learning and verification of unobserved speaker. Our proposed CNN could obtain better results compared to VGGVox (ResNet-50) by 0.40% of Equal Error Rate (EER) on average. Additionally, we investigated results based on the scenario that the gender is known; the evaluation was performed only on utterance pairs that comply with the scenario. The EER rate of such case that only gender of claimed identity is known is 0.52% lower than that of VGGVox (ResNet-50) on average of two genders. In a more specific situation that the gender of person making a claim is also known, two dedicated networks were retrained for female and male, in addition to our first network which was trained for both. It is interesting that, when compared to the first network, the female network achieved less EER on female-female verification, while the network dedicated for male performed worse. Nevertheless, our two dedicated networks outperformed VGGVox (ResNet-50) by 0.88% of EER.