Speech Transcript - Unsupervised Regularization of the Embedding Extractor for Robust Language Identification

0:00:15	finally we one and sensual buttoning this presentation
0:00:18	and my value all the in domain presenting the work with the initial be and
0:00:22	enabling the
0:00:24	about the unsupervised domain adaptation of a language identification just and with the goal of
0:00:30	being robust to transmit junction
0:00:33	in this work was that is a problem of language identification something transmit change and
0:00:39	then which has not been a perceptron training of this just an we you and
0:00:43	a bit data from this target transmission channel
0:00:47	this problem is cold unsupervised domain adaptation
0:00:51	we propose to either regularization loss functions of the classification as function of the embedding
0:00:57	extract all during its training
0:00:59	you in this presentation we first define the task of unsupervised domain adaptation for language
0:01:06	identification thing
0:01:08	then we describe the proposed method of regularization optimizing extract all
0:01:14	and finally we present our experiments and rate
0:01:18	so first task open supervised them in the station for language and showing
0:01:24	we use just on down language identification just then based on the egg rolls
0:01:29	this is then is constituted of three bouts first within a feature extractor always aim
0:01:35	to extract frame of a feature
0:01:38	it is a stack of nine inch work which i've been trained to pretty tri-phone
0:01:44	and
0:01:46	frame level and buildings are extracted completely in and they are used as input of
0:01:51	technical extra so
0:01:53	the exact like spectral used and urinate well discriminatively trained
0:01:58	pretty
0:01:58	language and there's
0:02:01	we extracted a segment of it and beating funds is known it well and finally
0:02:07	a language classifier your question rigid of dimension reduction and support vector machine
0:02:14	corners is a scroll for each target language
0:02:19	we train such a system on the corpus thus we lose to in this in
0:02:25	g religious
0:02:26	the contain five languages are be english farsi actual and all
0:02:32	we have recordings for this five languages online transmissions and then
0:02:37	fast but telephone and then
0:02:39	and eight radio channels so now unless we sent to if you find a frost
0:02:45	a telephone recording
0:02:53	now and feature file
0:03:01	speech if
0:03:07	and you which
0:03:14	as you may have only this byers on the original files present a very difficult
0:03:20	noise and distortion characteristics so is this is a real challenge for domain adaptation
0:03:28	during just
0:03:29	well we knew stress again is reaching lines for both training and testing of this
0:03:35	just
0:03:36	our first work
0:03:38	whereas to investigate the domain mismatch issue with the corpus so we trained a language
0:03:44	identification system for each of the nine transmit ranch and then
0:03:49	and it corresponds to the rows of this or and we they did it justice
0:03:54	them on the nine transmit ranch and there's also this it
0:03:59	so first
0:04:01	on the diagonal we have the performance and the matched conditions and we had shamanic
0:04:06	whatever weight and ranging between c and fifteen percent and near acceptable performance
0:04:15	i was side of the diagonal when we test and the channel which has not
0:04:20	been observed during training we examine the you performance
0:04:24	so
0:04:25	it means that sent domain mismatch is a real issue is disgusting
0:04:30	conversely
0:04:32	on the last nine we train a system with that the of the nine transmissions
0:04:37	and then and we are two would performance on all channels meaning that the
0:04:43	okay to intensification system has the capacity
0:04:49	got one where on all channels and the problem into that there's observed during training
0:04:56	the goal of a word is to improve performance outside of the diagonal is the
0:05:02	better
0:05:02	without using a novel data from as a target
0:05:07	some speech engines
0:05:09	so this problem score and supervised a minute shouldn't
0:05:13	where domains corresponding transmission channels
0:05:16	so
0:05:17	we have a soul domain code s we live in that x is yelling zero
0:05:23	recordings and what is the corresponding language of data and we have an evident that
0:05:29	a form a target domain that's
0:05:33	or not
0:05:33	is it which she would language identification performance on the target
0:05:40	so now we describe
0:05:43	our method for unsupervised domain adaptation which is weaker an action of the meeting extra
0:05:48	though
0:05:50	a lot of unsupervised domain adaptation methods are based on a very simple idea of
0:05:57	making distribution of representations of both domain can you know
0:06:03	and this the in domain by using only unlabeled data by aligning distribution of representation
0:06:10	then with this is similar representation you can train a classifier always novel data from
0:06:17	the source the men and if so presentation a invariant between domain
0:06:22	so this if you're we also achieve a good performance on target domain so this
0:06:26	is a data gram to understand this idea you have no leverage that performance on
0:06:32	the menu are able to train a fist fight you're but there's not that from
0:06:36	where on an unseen target domain
0:06:39	consequently we use an evident that the proposed a man to wrong
0:06:44	a space of for presentation well representations of was domain have the same distribution consequently
0:06:52	if a classifier is trained on the source the main in we also well where
0:06:57	on the target
0:06:59	so now of the question is
0:07:03	where we and false invariance of the war presentation within the language identification then
0:07:09	and
0:07:11	idea is to apply to the expect all seems natural since it is
0:07:16	a representation with language information directly extracted for an ornate well trained pretty language
0:07:26	so i'll make the two
0:07:29	creates a dominion valiant expect a used to add a domain adaptation regularization as function
0:07:35	to address function of the embedding it's like well so classical used everything is what
0:07:40	is trained as a classification both sharon
0:07:43	accompanied the core sample
0:07:46	which is always fun
0:07:48	that's recover those functions that takes a lot in that exactly what s phone that's
0:07:53	holding
0:07:55	we added to this post function and a regularization them and all
0:08:01	is that as to make the l and invariance between that just distribution of a
0:08:06	collect all four wheels domain here but in the band are wrong down where there
0:08:10	is a compromise between invariance of the work right annotation between them and
0:08:15	and we classification performance
0:08:17	on this will them
0:08:19	also regularization and thus we decided to use the maximum and disturbance
0:08:25	so the maximum discrepancy
0:08:27	is a development function that correspond
0:08:30	to the supremum
0:08:32	of for the difference between the average of for function
0:08:36	overall was domains
0:08:38	well as experiments they can or well as basis function hate
0:08:42	if h is a the unit ball of our policing john it can be all
0:08:48	space
0:08:49	as maximum mean discrepancy
0:08:52	is the expectation of
0:08:55	channel values of phones embedded was domain
0:08:58	and it did me estimating the and with unit simple
0:09:03	so we i mean you bash during training of the system
0:09:06	we do exactly that doing training of for each mini batch we compute the maximum
0:09:13	mean discrepancy on different better and we idea sets of the classification mass function
0:09:19	in this well we use a good friend got an utterance define the space of
0:09:24	functions
0:09:28	we compare this murder the of reproduction of the n binning extract all to javabayes
0:09:33	them in addition we don't call correlation i the main corridor
0:09:39	j g of a to javabayes domain adaptation method is to transform representation of the
0:09:45	source domain to make then most similar to the target domain is then
0:09:50	train
0:09:51	the following blocks of this just an
0:09:55	with every that the from this whole domain that high in transform
0:09:59	and then applied this case if you all on the target domain
0:10:03	but correlation alignments a transformation to make sure the testing now targeted at that is
0:10:09	a matrix multiplication with the goal of making covariance matrices of was the mainstreaming
0:10:16	we apply this make the two
0:10:18	to
0:10:20	but use of this is then
0:10:21	the exact like select all so
0:10:23	we
0:10:24	transform
0:10:26	the frame of a weakening charles
0:10:28	and the classify so we apply correlation containment to the segment of an exact
0:10:37	and finally the we could use and we database domain adaptation meter the for the
0:10:43	language class if you know since our work is to prove that the minute addition
0:10:50	of demeaning extractor is superior to the meaning that the end of the classifier you
0:10:55	know
0:10:56	we simply trained with is a little bit from the target domain the classifier you
0:11:01	also is not the domain adaptation with the supervised training data
0:11:07	and it's the it gives us a bound on the potential performance of an adaptation
0:11:12	of the big increase real to the target
0:11:18	so
0:11:19	in this work we compare for methods
0:11:24	two
0:11:25	feature obeys the domain adaptation methods that are applied the in billing cycle also find
0:11:30	that that's you know and a longer model based meet the applied to the meaning
0:11:35	select all compare two and
0:11:37	a bone the a the performance that could it she only database adaptation of the
0:11:43	final clustering
0:11:47	so no relates to present the experiments
0:11:51	so we
0:11:53	trained systems that with this means the that
0:11:56	with the same sitting so
0:11:59	the same with a feature extractor which is the pre-trained retaining when with their next
0:12:05	with a feature extractor
0:12:08	system now see the nn architecture for the
0:12:10	exact on twelve
0:12:12	and
0:12:13	we go from a training for the regularization of the n binning structural
0:12:18	by the station for channel e g two so it's when domain adaptation is now
0:12:23	you and we select the hyperparameter long that's it but there's a compromise between the
0:12:29	troubles function bayes can performance of the target domain
0:12:32	well this domain annotations in i and then sees value from they select in and
0:12:37	apply to all of the l domain adaptation scenario
0:12:41	what is it important
0:12:43	is because
0:12:44	in a real domain adaptation scenario we can choose the lab and that from the
0:12:49	time domain state i mean so this bombing well as to be robust
0:12:55	and then we have to choose because of the men so we always use
0:12:59	the telephone channel as for the task since
0:13:02	most
0:13:04	language recognition corpora
0:13:07	a telephone corpus
0:13:09	and
0:13:10	we the target domain a each of the eight radio channels
0:13:15	so we
0:13:16	have a novel data from this domain
0:13:21	so fast we have to select the by mid on that
0:13:26	so we train
0:13:27	and the meaning extract or
0:13:29	with different values of from the corresponding to the court all this
0:13:33	but some
0:13:35	so that the value of the regularization loss function and the validation that
0:13:42	we is have you all wields use expect so at the beginning of training
0:13:49	as a maximum initial been steelers is close to zero
0:13:52	since is unattractive
0:13:55	randomly initialized and distribution of balls domain are so i'll
0:14:00	then in decreases during training because
0:14:03	classification needs to make a difference between the main
0:14:08	and that the value in that she is that is controlled by the value of
0:14:14	so regularization parameter a wrong
0:14:19	so that no with general and the classification as functions of course e
0:14:23	in these plots we have both the classification errors function and sort them in the
0:14:29	sorted line
0:14:30	and on time and domains in the line so that in lines corresponding to "'cause"
0:14:35	i'm complete on the target domain are not of the l really in and entropy
0:14:40	of a domain and the fusion
0:14:42	training experiments
0:14:43	but our cousins in it when the system here to understand what happens
0:14:49	so when the by mean they're also regularization and a smaller so the right job
0:14:54	here
0:14:56	"'cause" consequently israel used in the source domain but explodes on the diagonal don't
0:15:03	but i
0:15:05	increasing the value of from the we managed to read used to get between both
0:15:11	domain as out the green
0:15:13	and
0:15:14	right tails
0:15:16	but it slows down training on the solemn and for a high value of from
0:15:23	that
0:15:24	so non that scores one hundred than a we are not able to
0:15:30	compared and this whole domain
0:15:32	so the choice of from the
0:15:34	is a compromise
0:15:35	between
0:15:37	reducing the between domain
0:15:39	and but winkle reference and this will and
0:15:42	and we selected the value themselves than
0:15:45	for them
0:15:47	and then we'll lines is but you for all domain adaptation scenario means telephones old
0:15:52	man and each of the eight radio channel as target
0:15:56	so in this table we on your present performances all to be a domain
0:16:03	a and
0:16:05	because the l
0:16:07	best and worst performance for stringwise system and the target domain and as the average
0:16:12	performance twenty eight channels
0:16:17	but results from all channels are consistent
0:16:21	so first we were able performance of the baseline since then
0:16:25	the when travesty train control domain is you instrumented trained on data get into an
0:16:33	performance on for the system trained on for the main is really cool with an
0:16:38	average equal rights fourteen
0:16:40	and a training of the in domain and shift and a particular boy of twelve
0:16:46	then we have
0:16:47	so full
0:16:48	for system trained with baseline domain adaptation data so first the feature based them into
0:16:54	the efficient data
0:16:55	if it is applied to the classifier you l
0:16:59	we go from forty two such as tree the nine percent average equal weight
0:17:07	so we are she a slight improvement
0:17:10	we're cell that's
0:17:11	the feature based domain adaptation method is more efficient when applied to the and building
0:17:16	expect all
0:17:17	meaning a supporting all idea
0:17:20	that adaptation of them in a spectral is you don't with the patient the structural
0:17:25	and is the n meaning extractor is that it did
0:17:29	i thing
0:17:30	a feature based adaptation of the classifier you know that mandarin improve performance
0:17:35	finally
0:17:36	"'cause" you've got based training
0:17:38	of the classifier with and between training and skills in the man
0:17:43	actually a good performance and but we just significantly dog we stand so that are
0:17:48	trained on data domain it means that
0:17:53	i'm willing to train on this all the men are not perfectly suited for the
0:17:57	target domain and should also again of adapting the embedding extract all
0:18:04	so
0:18:05	domain adaptation
0:18:07	i was the cos if al
0:18:10	cannot compensate the domain mismatch in the space of the mean
0:18:15	and finally we can look at fraser's only the also
0:18:20	the maximum mean discrepancy regularization of the meeting extract all
0:18:24	so fast when the backend classifier is train and soul domain it is i don't
0:18:30	spectral domain adaptation experiment and false even of eight
0:18:36	ten years of the corpus we achieve a better performance than versus then two then
0:18:42	trained on the strip opposite trend untimed in domain
0:18:46	so with the exception of the channel
0:18:48	so this is a very good way they're showing that in brian's in the space
0:18:52	of them being
0:18:53	is useful and this required with the addition of the beginning this value
0:18:58	but and it's this is the last line
0:19:01	of the table
0:19:03	if we train the back-end classifier well on the type in domain we are still
0:19:08	able to improve performance with and that's in the meetings
0:19:12	means that these and beating looked at any and
0:19:16	and that we would work to improve again invariance of this and name
0:19:21	all we queen a commune eight this be done with an unsupervised domain adaptation the
0:19:27	of the pacific
0:19:30	so in this paper we study the as the transmission channel mismatch for a language
0:19:36	identification system and propose and unsupervised domain adaptation method of such as just them
0:19:43	so propose middle in the
0:19:45	is to add a regularization as function
0:19:48	of the to the unwitting extractor
0:19:51	and distance function is don't maximum mean discrepancy
0:19:55	so we surely
0:19:56	that system and the
0:19:58	is
0:19:59	details and supervised training of the word system on the target domain and we you
0:20:06	we don't anchorman that stability ideas that adaptation of the impending extract so is more
0:20:12	efficient than adaptation
0:20:14	of the classifier
0:20:16	in an exact are based language identification system
0:20:20	thank you

Unsupervised Regularization of the Embedding Extractor for Robust Language Identification

Speaker and Language Recognition

Raphaël Duroselle, Denis Jouvet, Irina Illina