Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation <BR>(3 minutes introduction)

Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation
(3 minutes introduction)

Jian Luo (Ping An Technology, China), Jianzong Wang (Ping An Technology, China), Ning Cheng (Ping An Technology, China), Jing Xiao (Ping An Technology, China)

Predicting the altered acoustic frames is an effective way of self-supervised learning for speech representation. However, it is challenging to prevent the pretrained model from overfitting. In this paper, we proposed to introduce two dropout regularization methods into the pretraining of transformer encoder: (1) attention dropout, (2) layer dropout. Both of the two dropout methods encourage the model to utilize global speech information, and avoid just copying local spectrum features when reconstructing the masked frames. We evaluated the proposed methods on phoneme classification and speaker recognition tasks. The experiments demonstrate that our dropout approaches achieve competitive results, and improve the performance of classification accuracy on downstream tasks.

InterSpeech 2021

Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation
(3 minutes introduction)

Search in Audio

Related Recordings

Speech Decomposition based on a Hybrid Speech Model and Optimal Segmentation
(3 minutes introduction)

Noise robust pitch stylization using minimum mean absolute error criterion
(3 minutes introduction)

InterSpeech 2021

Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation (3 minutes introduction)

Search in Audio

Related Recordings

Speech Decomposition based on a Hybrid Speech Model and Optimal Segmentation (3 minutes introduction)

Noise robust pitch stylization using minimum mean absolute error criterion (3 minutes introduction)

Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation
(3 minutes introduction)

Speech Decomposition based on a Hybrid Speech Model and Optimal Segmentation
(3 minutes introduction)

Noise robust pitch stylization using minimum mean absolute error criterion
(3 minutes introduction)