Speech Transcript - BERTphone: Phonetically-aware Encoder Representations for Utterance-level Speaker and Language Recognition

0:00:16	hello everyone mind themselves silly
0:00:19	amount i sent his ml only database i
0:00:23	and presenting a paper perform
0:00:26	on the t v o where include the representation for utterance level speaker and language
0:00:31	he or she
0:00:33	is the joint work was truly an original capturing
0:00:39	so that's log normal division first
0:00:44	will be coming d
0:00:46	the transformer based contextual representation like for g p d have shown agree six days
0:00:53	in downstream
0:00:55	a natural language understanding tasks
0:00:58	so similarity speech sinking off
0:01:01	okay zoom in addition
0:01:03	in the k how downstream speech tasks
0:01:05	i'd is the way
0:01:07	where imitation the
0:01:09	you know p
0:01:12	and that's in there
0:01:14	downstreaming speech task
0:01:16	like speaker condition or language recognition has very limited training data
0:01:23	but there is just a lot of all rush
0:01:26	speech and sre corpora or you are unlabeled speech corpora
0:01:31	so doing acoustic segmentation k u t rising those large corpus to how task training
0:01:38	speech task
0:01:41	so the most important question here is what information do we need
0:01:47	full is a task to the speech task
0:01:50	the first thing we have signal is definitely she
0:01:54	in fact
0:01:55	from that information for speaker recognition manager and she has already been scored
0:02:01	for very long time
0:02:02	so past works have used for a value their master of using have racist the
0:02:08	and as the frame level feature extractor for speaker and channel and language recognition
0:02:15	so it is often down with bottleneck features
0:02:19	well generally intermediate frame wise feature from a t and train a large speech corpora
0:02:26	however
0:02:27	since speaker recognition can require higher level "'cause" information like speaker traits
0:02:33	and an s relevant for a star
0:02:36	it is then which may be insufficient
0:02:39	for speaker recognition in some cases
0:02:43	so overcome this well scope your master
0:02:47	is to do multi task and to is a system with speaker recognition there
0:02:56	so there is a new trained
0:02:59	and you still supervised acoustic representation
0:03:05	able to include is start by pre-training on large amounts of monolingual speech
0:03:11	so those still relies models can capture some global structure cause a global acoustic structure
0:03:19	and can help is are
0:03:21	and also potentially string speech test
0:03:25	so some examples
0:03:27	so this models having wall way still suppress object the
0:03:31	and either contrast e
0:03:33	or the recursive more
0:03:36	s reconstruction
0:03:38	in fact some of those or have already shows on a stick space
0:03:44	in by being applying selves aggressive because you're representation in speaker recognition
0:03:52	so we propose perform
0:03:54	which include bows phonetic information
0:03:57	and
0:03:58	so supervised acoustic segmentation
0:04:02	which i just talking previous sides
0:04:05	so we assign overview of our model so is this is the include
0:04:11	and we do feature extraction
0:04:15	and we mask
0:04:16	span all four frames
0:04:18	then we have this mask frame sequences into transformer encoder
0:04:23	to get performance issue
0:04:27	so then we do multi task
0:04:29	and is performed with letters
0:04:31	so on the left side this is asr task so is to use it also
0:04:35	here
0:04:37	and on right side this is a self supervised
0:04:41	a consumer station test
0:04:43	so we use a lot also here to reconstruct mask frames to orange all frames
0:04:51	so for training criteria
0:04:54	and reconstruction task
0:04:57	we just use l one loss
0:04:59	to reconstruct mass frames
0:05:01	two or and you know frames
0:05:03	so it's basically same as denoising auto-encoder
0:05:08	a specifically
0:05:10	why we data easily mask a standoff ten frames and a five pairs random five
0:05:16	percent of the positions
0:05:18	and the replacing with zero vectors
0:05:22	so in this way we mask a fifteen percent of tokens
0:05:27	i which is similar to prayer
0:05:29	pre-training schema
0:05:32	and four sre task it just use standard c d c
0:05:36	training criteria
0:05:38	and then we combine posts are also lost a good get together
0:05:44	so
0:05:44	one die here is the hyper parameter and here is the signals that's
0:05:50	so we notice is that we use greatly to include really risque rescaled reconstruction lost
0:05:58	to be proportionally o is a city z lost
0:06:04	so after finish a pre-training there were more though
0:06:09	we can be fixed
0:06:11	and then we used to extract a features from the data
0:06:16	so as shown in here
0:06:18	we use perform more than two ultra a
0:06:22	the performance issue which is a true one here
0:06:25	and then we passes through block sorry we passed is performing transition into
0:06:33	bastien tests model
0:06:35	so for tests
0:06:37	this model
0:06:39	we just use x vector class of attention putting as architecture
0:06:44	and for those things you and it has speaker recognition we use a center
0:06:50	they lost
0:06:53	so
0:06:54	for a first time for closed set language recognition
0:06:58	we use the softmax layer there are very sensitive intention task has fixed language can
0:07:04	words
0:07:06	for speaker recognition we used here d eight and compare pairs of speaker invariance
0:07:11	which way extra here
0:07:15	okay so next we don't go about experience it up
0:07:19	so we use of all the dimension mean normalized and a mfcc as a writing
0:07:25	puts
0:07:26	are performed parameters burden schedule in the training details
0:07:30	i system was per base model
0:07:34	which is a tough still flotation a years
0:07:37	us to one sixty eight item dimension
0:07:40	and the general adaptation tasks
0:07:43	we are variable as a speech utterance and allows them into a backchannel at
0:07:49	and spread over multiple gpu
0:07:52	our or model is over first recent batches
0:07:55	to a maximum learning rate of one zero point zero one
0:07:59	averaging the model for us thirty bucks
0:08:02	fourteen data
0:08:06	for her from pre-training
0:08:07	we trained to perform although i'm two different dataset
0:08:11	the first one is a fisher english which is then point eight star hers
0:08:18	it's a different conversation data set
0:08:21	and that it did not wishing and i don't perform model a
0:08:26	t v
0:08:27	which is m one s sixty k
0:08:30	hers
0:08:31	which is the english ten talks
0:08:34	for speaker recognition
0:08:36	we use a fisher perform model for features speaker recognition task
0:08:41	and are used at an perform model for work so that a
0:08:45	speaker recognition tasks
0:08:47	so to be no this is that
0:08:49	even though it at an animal a rolls broadcast speech
0:08:54	but they don't have any
0:08:56	data overlap
0:08:58	in them
0:09:00	so that can be cast in there are
0:09:02	all of told me downstream task
0:09:06	forty them perform model
0:09:09	for language recognition
0:09:10	we use close in two thousand one means
0:09:13	lr and evaluation
0:09:19	so here is the results for a range of recognition experiments
0:09:24	so as you can see here
0:09:27	we have a huge improvements using perform
0:09:29	have here is in this day as input
0:09:34	we actually the state-of-the-art and three seconds and ten seconds condition
0:09:39	no we are therefore
0:09:41	preaching system s thirty seconds
0:09:43	but we estimate the past are all into an investor
0:09:49	so that speaker and us
0:09:51	initial experiments
0:09:54	so on the vowels
0:09:59	dataset
0:09:59	but first show that
0:10:01	using perform much better they use mfccs input
0:10:06	and in the fisher of you know feature
0:10:09	speaker recognition
0:10:11	case
0:10:13	performed you includes over ship feature the multi tasking approach where is a phone adding
0:10:18	extra thirty were jointly reading is star in speaker condition which is this like here
0:10:26	and the last scale all don't mean well the last speaker recognition tasks
0:10:33	in
0:10:34	perform gives around like eighty percent relative reduction in equal error rate
0:10:41	compare with the model training directly why stacy
0:10:45	our model also includes are
0:10:48	recent work
0:10:50	uses the imprint training set
0:10:51	we are multitasking and of research training which is this time
0:11:00	we did some operation study
0:11:03	the first one is to
0:11:06	we are trying to in their interpret the
0:11:09	last scale
0:11:10	the longer which is the last cable beating reconstruction in c vc lost
0:11:16	so as strong this table
0:11:19	we interpolate eating bound i zero which is it easy only model and alarmed i
0:11:24	x one which is the reconstruction model
0:11:28	it is you recognition in speaker recognition performance is you command was slightly degree
0:11:33	one performance on training to reconstruct
0:11:37	for language recognition if unless it is the only model
0:11:41	these are best
0:11:42	i mean reconstruction resulting degradation
0:11:45	posted really
0:11:46	as the degree the quality of phonetic information encoded
0:11:51	for speaker recognition a model do you the bass
0:11:54	when some i think it's a toss is introduced
0:11:57	in line with previous work on the really result phonetic one mission to is not
0:12:02	a speaker and session
0:12:04	as expected
0:12:06	using their problems it is the only model actively degree speaker recognition performance
0:12:12	in prayer and critics
0:12:15	we might just an is this clusters and take along the i for example and
0:12:21	because
0:12:23	zero point you
0:12:26	so we and outrageous study to incorporate
0:12:32	information so the in two different perform may years
0:12:38	so was basically what we these the wintry known model to use a global softmax
0:12:43	for myself are weights
0:12:45	two pool representation over the years
0:12:49	so in this way we know
0:12:52	one
0:12:54	what you they wish they years the model is focused on
0:12:58	so in this work we can see that
0:13:01	gaining this face but
0:13:03	then recognition user imitation virtually from later they years
0:13:08	this is consistent with the language recognition primary using phonetic information
0:13:13	in contrast
0:13:14	speaker recognition use more immediately years
0:13:18	this is just a house data some focus the end of from that information being
0:13:23	the average
0:13:25	you know
0:13:26	is the next last intuition their language or vanishing use higher-level features
0:13:32	for example phonetics a sequence information
0:13:35	well speaker recognition use pram a lower level features
0:13:40	quality slide speech
0:13:41	and the vocal range
0:13:43	class
0:13:43	some possible phonetic
0:13:46	performance
0:13:49	so in some
0:13:50	we introduce versatile
0:13:52	that's definitely tend to phonetically or where
0:13:54	acoustic instrumentation
0:13:57	and use perform we small task specific model we can improve performance of multiple speech
0:14:04	task
0:14:05	namely language and speaker recognition
0:14:08	way to other state-of-the-art
0:14:11	a variant of six point one six
0:14:14	and they are i you know i'm yours demonstrate second
0:14:17	posted language recognition tasks
0:14:20	and
0:14:21	eighteen percent relative reduction in speaker decoder is i want to dataset
0:14:27	future work including scrolling additional gain from i four using personal
0:14:32	and i scoring more advanced still system right so consider imitation methods
0:14:41	thank you guys think you've always demand transition

BERTphone: Phonetically-aware Encoder Representations for Utterance-level Speaker and Language Recognition

Speaker Recognition 1

Shaoshi Ling, Julian Salazar, Yuzong Liu, Katrin Kirchhoff