Speech Transcript - Compensation on x-vector for Short Utterance Spoken Language Identification

0:00:13	thank you very much for video presentation
0:00:16	mandarin min come from you don't time
0:00:20	today i can actually for competition expectation for shown to the spoken language identification
0:00:31	i want to keep this presentation of the follows
0:00:35	clustering and we introduce the short utterance language identification tasks
0:00:42	the thing i shall use a neural network based on writing techniques
0:00:47	extractor
0:00:47	and they show how that vector use them for lid task
0:00:53	after that the feature compensation learning will be introduced
0:00:58	then
0:00:58	i'm sure you
0:01:00	our experiments are sent out
0:01:03	one really
0:01:04	and you summer and the conclusions
0:01:10	okay language identification techniques and topical use of a pre-processing stage a lot you lingo
0:01:18	did recognition and translation system
0:01:22	for real time speech processing system
0:01:26	incorporating performance of shock filters are task
0:01:30	are important
0:01:31	because it can
0:01:32	zero to reduce the real-time factor and the
0:01:36	it is also or system
0:01:39	well of the
0:01:40	state of the how
0:01:41	to
0:01:43	right the master is the i-vector based and that's it
0:01:46	alright to this semester very effective a relative number of devices
0:01:52	recently
0:01:53	most of the researcher neural network based approaches
0:01:58	because so the idea is the classification task
0:02:02	therefore they neural network model can be directly used for classification
0:02:10	the entanglements sure that the performance
0:02:13	a shot boundaries right you task
0:02:18	experiments a high initial for speaker verification task
0:02:23	and the recent study it was also successfully used to derive the task
0:02:28	in this work
0:02:29	we focus on the big vector based
0:02:32	nested
0:02:36	the expenditure the neural network based they write presentation data
0:02:41	note that using that are applied to men cost
0:02:45	the speaker recognition even today actually on the language identification
0:02:51	the network for extracting extractor
0:02:55	consists of three month euros
0:02:59	reliable feature extractor
0:03:02	statistics hogan
0:03:05	and the boundaries
0:03:08	variable representation years
0:03:11	a very well feature extractor model
0:03:15	outputs frame level
0:03:17	the utterance
0:03:18	we impose over a sequence of acoustic features
0:03:24	well this year s
0:03:26	time delay neural network
0:03:29	well convolutional neural network or used
0:03:34	then
0:03:35	a good coding here
0:03:39	canberra the frame level quality
0:03:42	further frame level features into a fixed to dimensional vector by using the mean and
0:03:50	they're
0:03:50	standard the condition
0:03:53	finally
0:03:55	for connected actually didn't is used to process all utterance level representations
0:04:03	and a final thoughts the next earlier you used it is all those response to
0:04:09	use you have
0:04:11	and the map i
0:04:16	and like to thank next are mostly used for speaker verification task
0:04:21	using the verification task
0:04:23	the extractor the doctors
0:04:27	frontends
0:04:28	that is the used to extract results of contracting agent
0:04:33	you back and
0:04:34	some of them and here or cosine similarity can be used up all common case
0:04:40	for the lid task
0:04:41	the front end up backends approach can also be used
0:04:46	compared to be that jointly row just thinking regression become more widely used directly
0:04:52	classification task
0:04:54	well clusters and
0:04:56	a reading tasks
0:04:57	we can also directly use the network outputs for classification
0:05:05	this work
0:05:06	make a shot authors lid task
0:05:10	not only
0:05:11	but the testing utterance become shorter
0:05:14	so performance also decreases
0:05:18	no degradation is mainly because
0:05:21	and i can think up to ten calls applies a large variation
0:05:26	of the shuttle to resist
0:05:29	to reduce
0:05:30	the variation or short utterances
0:05:34	normalization method using and
0:05:36	corresponding no other varieties
0:05:39	warranty investigated for i-vectors
0:05:42	and neural network based
0:05:46	it is the number that we can also apply stimuli the i-vector extractor
0:05:55	therefore
0:05:56	we inputting we think that
0:05:58	similar idea
0:06:00	two
0:06:01	improves accuracy performance by using vector network
0:06:07	the chair
0:06:08	compensation
0:06:11	well down by reducing the actually then
0:06:14	representation pleading a and the short duration
0:06:19	inputs
0:06:21	there
0:06:22	the s
0:06:24	is that representation overshot of the variance
0:06:27	and there is a representation of the corresponding rhino buttons is
0:06:35	the i-vector space
0:06:38	this education
0:06:39	can be rewriting "'cause" this one
0:06:44	well for training
0:06:46	drastically
0:06:47	which the vector is the network by using an l
0:06:53	duration encodes
0:06:55	then the shot input space to model the trend maybe a function
0:07:02	considering that difference between them out and the shot utterance
0:07:08	the shot boundaries
0:07:10	consis a very limited information
0:07:12	therefore to improve the performance a short utterance
0:07:17	both i and i were extracted and information local phonetic information an important issue
0:07:25	we suppose that
0:07:26	the variance
0:07:28	components the vector kind of that language and describe the information related to local phonetic
0:07:36	information
0:07:37	based on this consideration
0:07:40	but we propose to normalize only seventeen
0:07:44	component it's vector
0:07:47	it is
0:07:49	the representation overlap utterance
0:07:52	well
0:07:54	you mean
0:07:55	so rare in
0:07:57	components
0:07:58	to you the
0:07:59	frame level phonetic information
0:08:02	well alright discriminative features for language identification
0:08:09	the cost of the proposed a method is the only this time
0:08:14	for the representation of the utterance
0:08:18	could be obtained by neural network we assume that all those
0:08:23	so the intended to pass the last
0:08:28	in that program them that's a wine
0:08:32	we use and spectral and the
0:08:35	to supply
0:08:37	representation
0:08:39	and the in proposed a mess of the two we use the rest match
0:08:45	a global calibration pony
0:08:48	to obtain a representation
0:08:52	we evaluate you the proposed method that means that language recognition evaluation
0:08:59	two thousand and seventy set
0:09:03	it's a training data used
0:09:06	clover in this ad
0:09:07	and i dunno three five development data
0:09:12	for a rainy the to seven
0:09:16	and the
0:09:17	the telephone data so that i that line
0:09:22	for the test set it to be used as a close the standard nice to
0:09:28	those
0:09:31	the except that has recently that in section that the study is that okay and
0:09:36	the
0:09:37	this ad
0:09:38	we also program the
0:09:40	a wine one point five and to use against this sense
0:09:47	one of a trust
0:09:49	we used to sixty dimensional all they're pretty bad major
0:09:55	and then you covariance and that the existing as the average of was used for
0:10:00	evaluation metric
0:10:03	for this analysis is you can kind of the rest nets system and that it's
0:10:08	vector systems
0:10:10	the rest analysis to us
0:10:13	so the holy rollers that's
0:10:16	network
0:10:17	they are probably
0:10:20	and that while for the connectivity
0:10:22	the a lot of nist or both
0:10:27	well the i-vectors is to the thing last night to
0:10:31	we use the reliable feature extractor
0:10:37	well the training examples
0:10:41	some examples of our group had between five to ten seconds and the shot utterance
0:10:49	but it is going back to two seconds
0:10:53	in this case we show the results of the baseline and systems
0:10:58	come variation
0:10:59	we also realistic this results with popular by
0:11:04	other is utterance
0:11:07	was anybody can
0:11:09	it's a extractor system are more in fact you on long code utterances
0:11:17	and whatnot shop utterances the rest and
0:11:23	this is done in the better performance
0:11:27	and because of the duration mismatch the model trained with a lot of them is
0:11:32	samples
0:11:34	we form the where on the basis of the data but i'm not problem that
0:11:38	there shall i
0:11:42	the integration of the team here that without the feature compensation method
0:11:48	in this table
0:11:49	the baseline is the olympics vector network trained with the shops examples
0:11:56	the results of mean error rate is the
0:12:00	composition learning
0:12:02	and the two proposed them
0:12:07	mess to whether he's this table
0:12:10	for you the variation
0:12:13	we give a speaker to compare baseline
0:12:16	mean and variance this okay
0:12:19	and the proposed a method
0:12:22	problem of the results
0:12:24	we can say
0:12:26	the channel compensation
0:12:28	by using those
0:12:30	mean and variance
0:12:31	only could improve the performance
0:12:35	well not all utterances
0:12:38	yielding very
0:12:40	according to the best results
0:12:44	i four show the other varieties
0:12:49	compensation by using
0:12:51	me only
0:12:55	this significantly improve the performance
0:13:00	well concluded
0:13:02	in this work
0:13:03	we investigate an improvement of the neural network based the impending techniques
0:13:10	vector for shot about the rest lid task
0:13:13	we compare database that the channel compensation by comparing in various and the need i
0:13:20	think this the last
0:13:22	the proposed to me is the channel compensation only
0:13:26	it is expected to capture high-level or
0:13:30	construct a language information
0:13:32	right our meeting
0:13:34	variance components three because it is for that reason for software that it's
0:13:40	the results show that the proposed method the mock in fact the shock filters right
0:13:47	you task
0:13:51	that's what your attention

Compensation on x-vector for Short Utterance Spoken Language Identification

Speaker and Language Recognition

Peng Shen, Xugang Lu, Komei Sugiura, Sheng Li, Hisashi Kawai