Speech Transcript - Joint Training End-to-End Speech Recognition Systems with Speaker Attributes

0:00:14	hello everyone today my report is
0:00:18	joint training and from and the speech recognition systems with a good reviews i'm shouldn't
0:00:24	be i and like comics
0:00:26	all works for a nice e g
0:00:29	advanced technology lab located in control japan
0:00:39	in this paper that our motivation yes we focus on improving the performance all the
0:00:46	state-of-the-art transformer based and into and speech recognition system
0:00:51	the as we know has a hassle a multilingual speech to speech translation system that
0:00:58	has meta data does it all speakers
0:01:01	and how to improve see all that and two and the speech recognition system
0:01:07	for dealing with such diversity input is well the focus of this paper
0:01:20	since we are using transformer based and the two and the speech recognition system in
0:01:25	t is the state of the art
0:01:28	all five or so that a meter size is ten times larger than the traditional
0:01:33	give a neural network and hidden markov i read models how to compress this model
0:01:39	in this relatively small size it is another focus of this paper actually it is
0:01:45	all previous
0:01:48	interspeech paper in two thousand and nineteen we also introduced in this paper as
0:01:54	as the summarization
0:02:02	so this paper the tries to solve this serious all problem using following two
0:02:08	have knowledge is so firstly is recurrent stacked layers
0:02:13	the second days speech of interviews as all combinations so recurrent stack the layers tries
0:02:20	to compress the model size so it's
0:02:23	each attribute
0:02:24	as the label level limitation to train the model i means of trains the compressed
0:02:31	to model explicitly
0:02:34	actually doing something like s p speech adaptive training
0:02:41	to improve as a result
0:02:47	in this lies we introduce how we can press holiday all model
0:02:52	using z are currently is that
0:02:55	layers
0:02:56	so dimensional for conventional transformer based model
0:03:00	with conventional than only as i'm each a layer s is independent of interest
0:03:06	it is that for example six layers
0:03:10	six including layers and a six a decoding layers
0:03:14	so parameters size is very large so if we
0:03:18	use the same parameter for all layers in the encoder and
0:03:22	or so same kind of interesting to decode the
0:03:26	so we can compress the model being
0:03:29	well over six is original size take the example of six and six layers of
0:03:34	transformer based the model
0:03:39	this idea is simple but very effective
0:03:47	this is always experimental setting so dataset we use japanese speech recognition corpus siesta corpus
0:03:54	so training set sums up to five hundred hours
0:04:00	the development set and the three testing set
0:04:06	as well a model training sentence we use eight attention has but hundred and twelve
0:04:12	keep units
0:04:13	six rolls before you for the encoders and six plots for the record this
0:04:20	the experimental settings at least we use the word is model a solvent
0:04:27	as the training unit
0:04:31	what is forty dimensional filterbank as the feature extraction
0:04:48	as our experimental
0:04:50	result
0:04:52	is the share the model and see for model seashell a morse eers layers are
0:05:00	used to use you have model and the banana layers used agency
0:05:06	dimensional for models we tried different that's also layers
0:05:13	and i and i mean is so number in this case and number of encoders
0:05:19	and is the number of the corpus
0:05:22	we find the c
0:05:24	for the four models
0:05:26	c six in all the and six the core the structure can choose the best
0:05:32	result
0:05:33	emits a much deeper but no significant improvement
0:05:40	for the share the model we also observed sixteen that can encode those and sixty
0:05:45	coders
0:05:47	hash of the best result i have a so the performance you have caps for
0:05:52	the share the model
0:05:55	i mean it's a so performance have about one percent absolute
0:06:01	one while the salute percent of performance gap
0:06:06	how to minimize the upper formant caps is all
0:06:11	the other focus
0:06:26	as a summarization all the first experimental results also of this paper
0:06:32	our observation years so share the model with recurrent finisher layers
0:06:38	six times smaller hands the original and that are layers we in c in before
0:06:45	model
0:06:46	and the we can speed up the decoding time twice as fast as the original
0:06:53	decoding speed and ten percent faster of the training using what we use easily i
0:07:00	know medical decoding and the in such a standard training we will use a the
0:07:07	gpu
0:07:09	and it keeps directly use not beneficial
0:07:13	i mean more than six layers
0:07:15	i four experiment the we draw one and i
0:07:18	six layers
0:07:21	experimental setting
0:07:23	in the following experiment we only used six and six layers
0:07:28	structure
0:07:31	and the non-linear operations more important since the number i don't it's
0:07:36	i mean a
0:07:39	with the process and that's why we use shared
0:07:42	so i means association two layers works
0:07:51	why we propose c s sis speech attributes a limitation in the training because for
0:07:59	the autoregressive model it has an age as a first to recognize the words will
0:08:03	you problems related recognize words although it's decoding speed is very slow but
0:08:09	using this nature we can adopt as a speaker adaptive training similar like speaker adaptive
0:08:15	training using the introduce as the at the beginning of the label
0:08:29	c is a definition the of the speech attributes if the intruders of speakers information
0:08:35	and also include as a
0:08:37	us a speech segments in can you information
0:08:40	we give a formal definition of the speech attributes lists we use a dialectical speakers
0:08:48	i mean because we are using the corpus of japanese language so we use the
0:08:53	people and so that where people was on
0:08:57	i mean i've the polka all
0:09:00	okay insider at all so on the place is we put it here
0:09:04	and also we have a duration of the utterance is for these for these attributes
0:09:10	we finally it's very useful so we put it also is nothing to do with
0:09:14	speakers information and also we so it's a corpus is from three different us as
0:09:21	the several different
0:09:24	different resources academic simulated the dialogue read
0:09:28	miscellaneous and the something else
0:09:30	and also so forcefully is easy speech and all the speakers the female male and
0:09:36	a novel
0:09:37	and i mean unregistered information
0:09:40	for the for the age
0:09:42	we group all the speakers into four groups
0:09:47	the young middle-aged and an unregistered information and the as a cluster group we use
0:09:55	the educational level of the speakers the middle school high school pendulum was to
0:10:01	order to and the unregistered information
0:10:04	has all data definition we
0:10:07	use individual
0:10:10	and the different numbers of combination
0:10:13	to train
0:10:15	is that so i mean quotas put of those
0:10:18	and reduce aztecs in the model training
0:10:22	hence the experiments utilising all the speech attribute scenes a label
0:10:27	so first line is
0:10:29	without attribute
0:10:32	i means as a very conventional method
0:10:36	but we will probably to compare as a baseline
0:10:39	and also we use speakers
0:10:42	speaker id
0:10:44	as attributes slight abuse itself once all in and
0:10:48	and five hundred more speakers
0:10:51	i mean what is the speaker's ideas the beginning of the center is also label
0:10:55	we can see is not effective at all
0:10:59	and we also tried
0:11:01	i mean individual
0:11:03	groups of tanks individual groups of text at the beginning of the
0:11:07	that's the that's a label
0:11:09	an defined as the sex
0:11:11	information is the best and also duration
0:11:14	and speech
0:11:16	they are c selected it will be effective individually we also tried
0:11:21	to combination to attributes combination strategically implementation phone attribute combination
0:11:26	and the five even five attribute immigration is it is you can
0:11:31	similar to do
0:11:32	the combined together
0:11:34	so it will work
0:11:36	the best
0:11:38	but the to more effectively we find a for to groom to introduce two groups
0:11:43	sex and a duration works the best and the three groups it works the coming
0:11:49	to duration
0:11:50	sex and age works the best and the duration
0:11:54	and the for the duration of sex change it works the best of the best
0:12:01	still sees for groups of a the overall attribute combination can compare can compared to
0:12:09	the full models baseline
0:12:12	and i means that uncompressed larger network
0:12:16	so performance is comparable
0:12:20	we also find that using sees tanks
0:12:23	using the speech attributes
0:12:25	two element to train so
0:12:27	so for model means uncompressed a big model is not effective because the so the
0:12:33	model size is large enough to learn by himself
0:12:46	as the observation all this part of experiment we find a full model it and
0:12:51	then speech interviews by itself i mean in clicks italy
0:12:56	because a model size is too large the kind and all by in it all
0:13:00	by the self
0:13:02	so she had a model than is you explicitly signals
0:13:07	we also finds a hundred billion combinations of duration sex age and the
0:13:13	and
0:13:14	and a single sex
0:13:17	tag mostly effective
0:13:21	this information can predictive or from also
0:13:24	the resource informations
0:13:27	if
0:13:27	before
0:13:28	i mean
0:13:29	do in real asr recognition in test time
0:13:39	to conclude
0:13:40	so the ins to transform are based this speech recognition models
0:13:46	it has many layers in the encoder and decoder is it makes them on a
0:13:50	very large
0:13:51	so conventional and annotators each from the yes independent parameters
0:13:57	i've a it sees used make some model too large so we propose recurrent respect
0:14:03	to layers
0:14:04	so use the same parameters for only those in the encoder and decoder individually
0:14:10	we also propose speech attributes as input signal
0:14:15	to perform documentation to trying to
0:14:19	such as a so recurrent speculators
0:14:24	explicitly
0:14:26	so if you know t is
0:14:27	it is as a extensive experiments on the cs train corpus
0:14:37	we some model size reduction can two
0:14:41	ninety three percent
0:14:43	and also ten percent
0:14:45	faster training by using on the gpu
0:14:48	it is insignificant more seen kind of error rate in the speech recognition
0:14:53	i use speech attributes
0:14:56	and we also find so increase the tension entropy visualise the in the future maps
0:15:04	the future work
0:15:05	to maximize we will maximize a compression with a recurrent the stacked layers and with
0:15:12	small introduce duration
0:15:15	we also have the more flak we also we will also develop more flexible decoding
0:15:22	by choosing the is dynamically
0:15:25	and the we can use of precision in the model
0:15:30	in the model storage and the parameter during the fast attention on the other soft
0:15:34	max side
0:15:37	technology to make the model smaller
0:15:40	and the we also we investigate some model representations each layer
0:15:45	to see how many useful informations from se
0:15:49	so each layer
0:15:51	thinking so much for a listening
0:15:55	this is the end of the presentation and in questions are welcome

Joint Training End-to-End Speech Recognition Systems with Speaker Attributes

Speech Application

Sheng Li, Xugang Lu, Raj Dabre, Peng Shen, Hisashi Kawai