Speech Transcript - Neural speech recognition

0:00:06	i don't everyone sounds fortune this detail this field with for the children session long
0:00:11	nor automatic speech recognition and i not are a global from google research a total
0:00:17	it just started
0:00:21	this sixty minutes be they'll will be organised into boards the fast but will be
0:00:26	written by mean explaining basic formulations and some algorithms from your speech recognition
0:00:33	and the second but well cover software and implementations phone your speech recognition
0:00:39	and this but will be read by my coworker she gave me
0:00:43	it's going to the fast about
0:00:47	after more i want to define what is in your or speech recognition
0:00:51	in decision i used this down for we farting techniques for we are rising and
0:00:57	do and speech recognition but chose techniques sometimes can be also applied to know into
0:01:04	in speech recognition systems
0:01:07	and to and speech recognition is a time for speech recognition that involves neural networks
0:01:12	combining acoustic features directly into words
0:01:17	and you may know already a conventional speech recognizer cost consisting over every three parts
0:01:23	acoustic model pronunciation model and it's more detail
0:01:29	mm on the represents a probabilistic combustion
0:01:33	and this site are wasn't here find the best possible hypothesis from joe smallest
0:01:41	one just a hunt
0:01:43	and two and approach uses systems that this
0:01:47	here and you wanna talk
0:01:49	the diet equal but feature us into procedure is are used to represent forest equation
0:01:57	for speech recognition
0:01:59	well obvious advantage of this approach is
0:02:03	simplistic of the system
0:02:05	it's very make it comes with such algorithms can start in higher internal combustion can
0:02:11	be very complicated doing agreement
0:02:14	within three and two and approach is even extended to directly hunter role will form
0:02:20	signals is that all pre-computed feature vectors
0:02:25	discussion express how to design joe's neural networks that dynasty outputs words wrong feature with
0:02:32	the or role of all signals
0:02:45	easy as in the in this fast but iris brand three approaches for and speech
0:02:51	recognition and
0:02:53	also recent advances over chose three
0:02:58	it's called is a fast section
0:03:02	most of classical speech recognition models use this integration unit score
0:03:08	because the generative story or feature vector sequence x
0:03:13	and a procedure as well i
0:03:15	and b models the distribution of joel two variables by introducing
0:03:20	as shown to latent variables
0:03:24	so phonemes cd as well in here and the related hmms to she guessed s
0:03:32	usually be decomposed into by assuming that phoneme z yes
0:03:36	is generated depending on the word
0:03:39	and an si hmm states are generated depending on phoneme sequences
0:03:45	and features that yes
0:03:47	is depending on the agent states
0:03:50	so here me to carry assume draws independency assumption between introduced variables
0:03:58	yes in this assumption looks okay but yes in section will result in some languages
0:04:06	in conventional approach driving techniques introduced in each component of this decomposition
0:04:14	for example fold and that's what are we often used
0:04:17	i ran in and it's model here for getting better prediction all words marshy genesis
0:04:23	well as of what acoustic monitoring people when used t and even your network or
0:04:28	a recurrent neural network for
0:04:31	one thing this emission probability of features you guess is
0:04:35	in the next size i review joe's or rolled used to enhance components with it
0:04:41	writing techniques
0:04:46	t n and german hybrid approaches are very famous way to enhance the
0:04:51	conventional acoustic models
0:04:54	in this approach this definition the emission probability it used as an acoustic model of
0:04:59	the conventional speech recognition
0:05:02	she a the probability p java that even the hmms the there is transformed into
0:05:08	a probability that is proportional to this special
0:05:12	this is the ratio between the pretty if a probability of all agents today the
0:05:16	given the feature vector and some as not probability all the agents date
0:05:23	the predictive distribution is modeled by a neural net and the marginal distribution is modeled
0:05:29	by a margin on the other wall categorical distribution
0:05:33	this is a convenient way to bring expression ability of neural nets into
0:05:39	the conventional speech recognizers however
0:05:43	this have similar programs actually
0:05:47	cost
0:05:48	be used as division in you and then to permit parameter we use marginal distribution
0:05:53	independently parameterized by different parameters
0:05:57	so baseball's used here is just an approximation because the different modeling parameters used for
0:06:04	the marginal probability and predictive probability
0:06:09	secondary
0:06:11	it is known that a gmm stay there is a very difficult times it's been
0:06:15	be to estimate it
0:06:16	a classifier was yours
0:06:20	classifiers i mean you're metaquest for us
0:06:24	cost a for example some stationary bothers
0:06:29	is very difficult to classify the acoustic feature vector with a is belongs to the
0:06:35	fast all the phonemes segment was a second how all the phoneme segment
0:06:40	this fact makes training and prediction of the classifier more confusing what a stable in
0:06:47	other words
0:06:51	connectionist temporal classification can be regarded as a remedy for the
0:06:56	that program here
0:06:58	is easy more than each time today where is represented only by a few points
0:07:03	in the c yes
0:07:05	is done by introducing tommy a view here we according to brown
0:07:11	and associate most all input vectors to the rank k
0:07:16	only few input frames that i kind of center over poignant continuous to the final
0:07:22	output
0:07:25	this diagram shows
0:07:27	the speech to or your network this easy approach in this case is
0:07:34	when we have infancy yes with eight elements
0:07:39	each in the to be that is classified into name is augmented with the banks
0:07:44	in more
0:07:45	and the final result is defined by removing banks imports from the output
0:07:52	one advantage of this you want it is that be no longer i'm used to
0:07:57	estimate a gmm is data davis with using commission a speech recognition systems
0:08:04	so it is possible to train neural networks from scratch
0:08:08	also dct is in your in it is
0:08:11	jerry that we can use eight four laboratory see us to seek yes task encoding
0:08:19	and in speech recognition
0:08:21	so it can be used either to estimate phonemes you can write conveys no religious
0:08:26	order to estimate or she can or grapheme cts data into and approaches
0:08:31	however each day the here is estimated independently so there's not able to david dependency
0:08:44	it's a and b elaborate on the didn't is the induced by c d c
0:08:50	it is known that run a session one graphically move or in c d c
0:08:54	is ensuring be written represented by finite state transducer
0:09:00	if we present it in transducers be seen that the conventional left-to-right hmms and c
0:09:07	d's in your minutes
0:09:08	have a quite similar event is used for your
0:09:12	so in fact using only c d z for speech recognition is
0:09:17	in fact very similar to doing speech recognition results using language models
0:09:23	however still see it is you have some good properties
0:09:28	well the is it
0:09:31	in that
0:09:32	it can perform better combination with down sampling approaches in neural networks
0:09:38	commissioner broad needed
0:09:40	gmm based i meant that doesn't work very where with down sampled features
0:09:46	also even after obtaining an hmm state alignments the conversion i'm chinese to associate single
0:09:52	related to each time star
0:09:55	that makes the a very information or on the bus in the regional planning boundaries
0:10:02	and this ambiguity becomes more if we if the feature is downsampled
0:10:08	so it is you only classifiers
0:10:10	some kind of center of segments so we apply this i'm bus today is
0:10:16	related to that the second advantage is that we don't need to classify some phonemes
0:10:21	structure
0:10:23	nice the fast and second how full bottle
0:10:26	this makes training was terrible and also prediction more complicated
0:10:32	that means that it is combined with some such are voice and sid using neural
0:10:36	nets tends to make score defined as roger for each examples
0:10:45	so using cd see for classical speech you speech recognition is a good idea because
0:10:51	it needs down dating wanted to within the labels
0:10:55	event is e
0:10:56	even if c t z is used as a part of the system we still
0:11:00	have advantages described before
0:11:03	so
0:11:04	don't somewhere each and every be applied and also it can form a good combination
0:11:10	with that such algorithm
0:11:14	is brought presented by stack
0:11:17	so our indiana well there are eight cars of commerce now hybrid approach is unseat
0:11:23	is the approach
0:11:26	this is that c disease either want a just also vocal tract in conventional is
0:11:30	our systems
0:11:34	it doesn't next component
0:11:37	now there's more less channel be enhanced by introducing are enhanced recurrent neural nets what
0:11:43	is the atoms
0:11:44	long short time a more in your on its base order regression inputs
0:11:49	are in a language model by the x
0:11:51	this division over the next word by r antennas
0:11:56	that ingested always afraid guess boards
0:12:00	unlike previous n-gram round is more approaches are and then someone i did a word
0:12:05	and its context in a continuous vector
0:12:09	and use it to make a prediction the next work
0:12:14	since we used a reference for making dis-continuous context you please in addition
0:12:20	irina spanish monitors channel in theory hunting
0:12:24	and no infinite drinks of our history
0:12:28	even so in practice it often very difficult to optimize someone or in that very
0:12:33	nice to see significant improvements from n-gram language models
0:12:38	as a downside context representation are analyzed models i e
0:12:43	in n-gram approaches
0:12:46	the number of possible context is bounded by the number of different war history that
0:12:52	is finite
0:12:55	however a four hour and if you wanna be do not
0:12:59	do not used extending over the context to be used
0:13:03	so each different work is to have the defining context a good representation
0:13:08	one can say
0:13:09	this is issues downside for computation
0:13:14	but in fact it's not that inefficient
0:13:17	is very easy this idea is models
0:13:21	this going to presentation to carry you guys should space to store in memory wiring
0:13:26	harness was somewhat something
0:13:28	maybe compare the size of speech recognition systems with a conventional approach and free neural
0:13:34	network approach the size is actually compare or and your and it's are you was
0:13:38	more as on the tree expand it
0:13:41	a weighted finite state transducers
0:13:46	so it might be a bit counterintuitive button urinal neural net approaches actually fit very
0:13:52	well with
0:13:53	mobile devices to
0:13:56	it's very if the device has a some accelerators full matrix multiplication well example
0:14:06	another important property that inference is the competition or if you change is to organization
0:14:12	she's done in conventional approaches use
0:14:16	takes the rents context for making a prediction each part of token used to be
0:14:20	long enough for making i'd rate reduction
0:14:24	however irina stand out from when the context
0:14:28	that means that we can use finite organization metal that is some word tokens well
0:14:34	maybe we can use a grapheme based or close to
0:14:38	to document organisers used reason you unacknowledged monitors
0:14:42	most are very similar in the in the sense that talk as all the data
0:14:47	by matching existing control and the algorithms that these chaps database tokens
0:14:53	and they gradually margins in
0:14:56	both select pair or tokens marks might in some criteria
0:15:02	but encoding pde use these
0:15:05	the number i just and occurrences of tokens in the dataset whereas
0:15:11	work this approach evaluate the likelihood well what dataset we do things simply not models
0:15:17	over defined tokens
0:15:19	using the draws final vocals decoding result in a smaller tokens that
0:15:25	and the number with different tokens
0:15:27	in the system is often corresponds to the size of out three open your networks
0:15:33	thus
0:15:34	it also contributes to the computational efficiency of neural nets
0:15:40	now who introduced in additional c disease and advantages of around the dance
0:15:47	the distinction is about hiring transduced us that can one strings or bottom results
0:15:57	as i mentioned she did she turned out to be sensitive and it should be
0:16:00	doing output tokens
0:16:02	i don't and channel be used as a component that inject the household event is
0:16:08	a so
0:16:09	by combining cd z based prediction with are in an n-best contest hundred we get
0:16:14	are and transducer
0:16:17	this diagram shows the
0:16:19	the as texture or are in a transducers
0:16:24	this thought of as a director
0:16:27	corresponds to c t z predictor
0:16:30	despite compares distribution over the nist tokens
0:16:34	we have the tokens it is all demanded by all made it with a down
0:16:39	symbol
0:16:43	and this but correspond to our own in and
0:16:46	this of feedback loop next the prediction to be dependent to the previous words this
0:16:52	actually inject the dependence you to the previous of talkers
0:17:00	c d c and r and d is yes us a common structure that use
0:17:05	rank to and the input and output elements
0:17:09	as i shows in the cities each s it is it a free corresponding to
0:17:13	the
0:17:15	hmm states in the conventional acoustic model
0:17:18	and a similar to the agents days it is handled as a latent variable in
0:17:23	the likelihood function
0:17:26	as you are
0:17:28	this latent variable is marginalised out
0:17:31	two defines a likelihood function and a logistic function
0:17:35	here
0:17:36	or c d c and a rarity models with brock symbol use this
0:17:42	simple handcrafted model for probability old wires regions given the alignment c guess one
0:17:50	due to this simple definition of the probability all by
0:17:55	given by brian
0:17:57	the likelihood function can be simplified in this way
0:18:05	difference between c d c and r n and t appears in the second component
0:18:09	probability all i meant
0:18:12	given the input feature vectors here x
0:18:17	c vc introduces frame wise independency here we identity introduces the and i'm in predictions
0:18:25	that is depending on the previous i meant variables
0:18:33	to explain how i'm it is more the reading and t is process shows the
0:18:39	case that be how for input vectors
0:18:42	e one e to easily and e full and really fast or the u s
0:18:48	c yes
0:18:51	low and word
0:18:55	we show the case when the difference was a fixed as in the training phase
0:19:02	i'll original joint network denoted as if here
0:19:08	it has defined by the corresponding to different times stand for the other thing that
0:19:12	and
0:19:13	evaluated things of the context in your handling
0:19:18	to fast estimation is given by feeding the fast in court
0:19:22	eva and initial context here she's there to the joint network
0:19:29	if we close to that the fast output of the model to be block back
0:19:33	need to be finished reading from the current encode either
0:19:38	so the more the start switching that i two
0:19:44	if the second element of the i'm and see us to be the fast talking
0:19:48	in the reference
0:19:51	that is he huh
0:19:54	it changes the context with stuff from c zero to see one
0:19:59	and
0:20:00	the model continues to pretty if the nist of but should be back why should
0:20:04	be some other words
0:20:08	for example
0:20:09	if the past that outputs is to control can low is chosen
0:20:14	so context mister will be changed from semen to see two
0:20:19	by repeating the same process until we reached as a final step here
0:20:24	we get the posterior annotation knows single alignment cost
0:20:33	for the training also neural networks we they didn't diamond variables
0:20:39	we need to compute and expectation of agrarian visitors with given the alignment variable
0:20:45	well as the posterior distribution of the alignment whatever's here
0:20:50	and study
0:20:51	for what are wasn't is
0:20:54	used for this purpose
0:20:56	how we have a for a lot colour wasn't although generate graph is not computationally
0:21:00	efficient
0:21:02	to say it's not
0:21:05	g u r t but for entry
0:21:09	however i meant defined in are in energy in bright is good it's shaped event
0:21:15	is you structure
0:21:16	for this kind of stress enough to read "'em" problem for what i wouldn't sufficiently
0:21:20	fast to be some can be you or gpu accelerate arts
0:21:26	in this case we need to compute the sum of probability or what for the
0:21:30	past
0:21:31	generally you know them us or a rose
0:21:35	and the prior probability that is a sum well probabilities
0:21:39	wow colour cost in order to buy greene story are hours
0:21:44	since well as summation term be written as
0:21:50	operations annex sifting and summation is done be efficiently implemented to be t b you
0:21:57	for example
0:22:02	i know i'll try to introduce encoder decoder neural networks enhanced with attention recognition
0:22:10	c d c and r and d house i'm and variables to actually this size
0:22:15	to encode out to be if that shouldn't be used for making prediction of the
0:22:19	next token
0:22:21	this kind of information is all formally five us attention
0:22:27	if the point is about estimating we have to
0:22:31	we've got
0:22:32	we don't models of probabilities division one that times time varying we're directory that these
0:22:38	where
0:22:40	i is the timestamp we should regard for making prediction for ice world
0:22:47	we can construct is by using softmax at a young with in that computed from
0:22:52	the input see gen x on the previous two words why well do i minus
0:22:58	one
0:23:01	we combined attention probability into simpler are an n-best encode the and are in like
0:23:08	decoder
0:23:10	this is inspired neural networks defined
0:23:14	that is
0:23:16	we introduce addition one true
0:23:19	task it's the information from or encode all the and the decoder thus there a
0:23:24	state of the previous time stuff
0:23:28	this internal computed a
0:23:31	tension probability
0:23:33	i mentioned before people given
0:23:37	p o a
0:23:38	given the context and go the outputs
0:23:42	and in this module outputs a summary big summary bit the by comparing this expectation
0:23:49	the addition probability introduce the here is typically defined
0:23:53	by introducing a function that you believe then smudging score was similarity be doing decoder
0:23:59	context information and the encoder output
0:24:03	that is the t-norm as well as a here
0:24:08	if you have this a be represented by in your pet
0:24:13	all the components including composition of expectation one this probability distribution function can be optimized
0:24:20	by us improve about repetition for minimizing cross entropy criterion
0:24:28	compared to a rarity alignments here is internally represented in neural net where are energy
0:24:35	handle it as a latent variable in likelihood function that is actually objective function to
0:24:42	this is of course of attention right soft adaptation since we already used in court
0:24:47	output via and expectation as a relative prediction is made after deciding feature quote unquote
0:24:54	output to be used
0:24:57	so foundation is better in terms of a simple still be implementation around the also
0:25:02	optimisation
0:25:04	and it's also vegas that it has no few
0:25:07	it has only few wanting assumptions
0:25:11	however a combat the identity it's harder to enforce monotonicity of alignment
0:25:18	in speech recognition
0:25:20	same as well and corresponding of acoustic features assumed to be in the same order
0:25:26	we assumed that additional should be
0:25:29	monotonic
0:25:31	if we if we brought addition probability like this problem where y-axis is a tradition
0:25:37	in the right of tokens each and x-axis is a rotational in the encoded feature
0:25:41	sequence
0:25:43	the most probably most probability mass should be on the diagonal region
0:25:50	however us as soft adaptation is to flexible we sometimes see of diagonal peaks that
0:25:55	these
0:25:57	we decoding is more data for resolving such programs
0:26:06	well known to work extension force of traditional roles itself attention on transform us
0:26:12	okay jamaican can be viewed as a achieve area store where curry is computed from
0:26:17	the decoder state and itchy and variance is
0:26:21	but i are computed from the encoder output
0:26:25	so far addition is an additional attention components are computed everything queries cheese and of
0:26:31	various from the previously as output
0:26:34	a frisbee speaking this corresponds to g attention to the input from as a time
0:26:40	stamps
0:26:42	and z is of great human to joe's adaptation is also computed based on the
0:26:47	previously you out
0:26:50	transform is a neural net component activities this separation the us multiple times to integrate
0:26:56	information from
0:26:58	in that i as the timestamps
0:27:01	we just construct
0:27:02	both encoder and decoder based on this transform or
0:27:07	okay very transformers and nowadays used as a drawing you go is made of our
0:27:12	antennas
0:27:14	so we can use it for constructing acoustic model for almost a hybrid speech recognizers
0:27:19	or region defined transform a transducer we have transform a is used is that all
0:27:24	are in it or are introduced us
0:27:30	the last section of this but is for introducing within the elements is in your
0:27:35	speech recognition
0:27:36	even so and the in speech recognition and its related technologies in disagreeing with it
0:27:42	missed you how we have this element it is compared to the conventional speech recognizers
0:27:49	i will focus on the united disadvantages
0:27:53	the first one is that with the conventional system is very easy to integrate side
0:27:58	information to bias the recognition result
0:28:03	and the four and architecture is not trivial to do so
0:28:08	the second point is that into and speech recognizers in general requires huge amount of
0:28:13	training data to make it work
0:28:16	so in this in a method to overcome data sparsity issue is also important
0:28:23	the starting point is that in conventional system it's relatively easy to use compares it
0:28:29	does such as text data or no transcribed audio data
0:28:34	in this section i various miss some examples all studies
0:28:38	for all welcoming joe's conditions
0:28:42	possibly is about biasing results
0:28:45	by things is particularly important for real applications
0:28:50	speech recognition all used to find something in the database for example if we want
0:28:55	to build a system to make a phone call
0:28:58	speech recognizers shows a button name in the user's context are used
0:29:04	same kind of behaviour is needed for various kinds of entire eighties
0:29:11	like sometimes or what names
0:29:15	in commissioner is biasing speech recognizer is very easy it can be done just by
0:29:20	integrating additional language models that has enhanced
0:29:25	probability for such but in cities
0:29:29	well solution for this into and rows is introducing another addition we can see that
0:29:35	focuses on
0:29:37	predefined set or context vectors
0:29:41	i we explain the middle of cortical texture us one text out this the utterance
0:29:46	where
0:29:48	in this method context for at such as a names or sometimes i encoded to
0:29:53	single vector
0:29:55	on the other jamaican detect pitch context of it does should be activated to the
0:30:00	court to estimate the next word
0:30:04	and just an example were normalization probabilities
0:30:08	well as it out that
0:30:11	talk to
0:30:14	is addition we can start to think that some biasing for it is like but
0:30:17	fruit are you all want to brew joe's actually corresponding to some names
0:30:25	and this additional input vector representing context
0:30:28	is expected to have the rest of the decoding process
0:30:32	so after saying after the user saying talk to it is expected that some i
0:30:38	can imbue for all
0:30:41	and this context is attention mechanism can
0:30:46	so we still behave via by a by adding additional probability to joe's a name
0:30:53	context against us
0:31:00	the next topic is about marriage would get a model for welcome data sparsity i
0:31:06	will introduce a method proposed by d
0:31:10	dismissal is simple
0:31:12	that just a i-vector model vector representing dialect as an input
0:31:19	and use that it does that constructed by pooling the data in all the dialects
0:31:25	if we do have decided to dialect id in but consistent during training and decoding
0:31:31	speech recognizer trained in this may cancel each some more
0:31:35	depending on these input data is dialect
0:31:41	is a multi rate
0:31:43	from this role showing the base turns out it's
0:31:46	we see that just training into in speech recognizer result in stairways mass there does
0:31:51	it is not a good idea of the performance significantly worse in dialects with smaller
0:31:57	datasets
0:32:02	this will shows the result with transfer learning here transfer and you fast
0:32:07	the miss out that fast that price training will include it does it
0:32:11	and then applies the oppressed training on the matched to dataset
0:32:17	transfer aligning thickening actually improve the result
0:32:21	however we could all the dating further improvement just by integrating that is a dialect
0:32:26	id in
0:32:29	including the previous method i explained
0:32:33	before contextual a s having additional method in time that is have people were coming
0:32:39	knuckle dataset
0:32:42	so sitting in your architecture that can probably handle such additional metadata in but is
0:32:47	in the important nowadays
0:32:54	the last the is about the musical on data
0:32:58	as i have already mentioned and speech recognition "'cause" huge amount of training data
0:33:04	and is even worse because it's not true or how to use their data
0:33:10	conventional speech recognition can be found at least privilege test only data for language modeling
0:33:16	and also it's relatively easy to use a possible by mit line in the top
0:33:21	only one data
0:33:26	overcoming these issues of the bubble retraining is no again getting four
0:33:33	here
0:33:35	we want to optimize encoder all speech recognizer only by using non transcribed data
0:33:41	of course it is not possible to powerful cross entropy pruning or was the neighbours
0:33:46	if we the if the data is not transcribed
0:33:50	inspired by bottom involved in that are not image processing field
0:33:55	within the missiles use richer information to be context information on the instantaneous information
0:34:03	mutual information is engine there are very difficult to optimize but recent middle we are
0:34:08	as it by drawing
0:34:10	the missile correlates contrastive estimation
0:34:18	in this i want to explain the famous network called we have developed to one
0:34:22	or
0:34:23	this is a diagram for the wave double two point one you're
0:34:28	this mental is aiming a pre-training all she nn based in quarter by maximizing mutual
0:34:33	information between input outputs
0:34:36	and its surrounding context
0:34:41	context surrounding context is actually summarized by a random transformers
0:34:48	we baseline want to maximize
0:34:51	in formulation of infancy we describe want to maximize similarity between projected in order out
0:34:56	on context vector
0:34:59	are we have a is not assumption if we only do that similarity between
0:35:04	and what i'll put on qantas with the because
0:35:08	the similarity becomes maximum maybe enhance the in order that matt all the data points
0:35:13	into a single course of what's that one zero vector
0:35:19	in fine is the introduces another somewhere here in all the all split from random
0:35:24	times files
0:35:26	and in fantasy tries to minimize similarity between context and random resampling in order
0:35:34	so this famous so that we can maximise you know maybe doing contents untied in
0:35:41	all the all but
0:35:42	but
0:35:43	it minimizes melodically with the mean the
0:35:47	context and randomly sample in without
0:35:51	we have the victim point well it's very famous because of its surprising performance speech
0:35:56	recognition problems
0:35:58	it is reporting that only few minutes of training data that is option for i
0:36:03	mean and in speech recognizers if the encoder is trained with
0:36:08	well that was fifty thousand hours of training it contrast them training
0:36:15	so this amazon want right plots from training data is actually shows but it should
0:36:20	be we have three year old a need combatted to utterance to data
0:36:31	okay and you minimize for watching these but is it for my part
0:36:36	then it will be the best you key and then this but we have you
0:36:40	about
0:36:41	software aspects or and in speech you've right
0:36:47	probably rate on this is typically from google research that's okay so you implementation for
0:36:53	a total and eurospeech question
0:36:57	today for talk about the two kids well what you're in five minutes
0:37:03	and then
0:37:04	we will try pretty doing model was in the toolkit
0:37:08	introducing the protection
0:37:10	after that we'll trained you
0:37:13	neural speech recognition to model from score parts and ten minutes
0:37:19	so far and we are we show how to extend the money out and tasks
0:37:26	introduced in your little section for example how to the sorry the transform of knowing
0:37:32	state-of-the-art and or something like that
0:37:38	so we'll forcible i'm sure that to locate all of you
0:37:43	this table is
0:37:46	introducing
0:37:48	you mean magnets
0:37:49	a c l paper
0:37:52	this table briefly summarize the
0:37:55	kind of comparison between the various to the kids
0:38:01	in this table all the
0:38:03	posted to okay supports the
0:38:06	automatic speech recognition tasks
0:38:10	and
0:38:11	some of them
0:38:12	also supports the
0:38:14	different tasks like speech transformation on the central station
0:38:20	and text-to-speech test
0:38:24	and
0:38:25	note that there is
0:38:29	pre-trained models are available in several to get
0:38:34	so
0:38:37	in this tutorial
0:38:39	we will focus on the svm
0:38:41	because it's doubles many
0:38:43	tasks
0:38:44	for as the and two in modeling
0:38:47	and also it's of boats to train the model
0:38:50	so i think it is easy to
0:38:53	try
0:38:58	its implementation can is host it at peak at
0:39:03	and if you want to know more digit result
0:39:07	they are described in this paper this paper was
0:39:11	no is a speech recognition and text to speech
0:39:15	speech on section reports all the obvious on the part of the
0:39:19	news speech enhancement
0:39:22	feature we will be coming soon so respect that rate
0:39:28	in this to treat you know
0:39:30	we have try yes mean of two
0:39:34	it is kind of major update from the yes one okay there is
0:39:40	so there are differences
0:39:42	in the between them but measure origins
0:39:45	for example
0:39:46	if you using it is to depends upon any primaries for example county is to
0:39:53	get sent a however
0:39:56	used to taste minimalist approach
0:39:59	it mainly depends on title ish and it all from we can use integrate it
0:40:05	scully
0:40:08	and the world model
0:40:10	almost same
0:40:12	especially tts models more used in a long
0:40:17	and however this tuple the task is
0:40:21	kindly well in progress
0:40:23	however
0:40:25	this meant to all visible once it's all so it is nice to try if
0:40:30	you're interested in itself on tts
0:40:32	and also speech enhancement previous
0:40:36	if you into a sitting yes one please visit this you all out
0:40:42	it is show you the usage of the use of long
0:40:46	there was to use the speech tutorial
0:40:52	and this tutorial have long ago example posted not go crawl
0:40:59	good across from a base
0:41:01	pricing to print the in a web page
0:41:06	and you just can't to also samples to a after a to so
0:41:11	but is make sure that you are using could one time in court probable
0:41:16	by this thing
0:41:17	when you visit this very page because the one of them called
0:41:23	we used in this tutorial
0:41:27	this just the introduces
0:41:30	pre-training model
0:41:31	that means
0:41:33	the models or really train by
0:41:37	one on and some tasks and dataset
0:41:43	yes in until
0:41:45	the such and models
0:41:48	in
0:41:49	yes peanut models to report three
0:41:52	and hosted that senator
0:41:55	for example thing as all task there is
0:41:59	they're already speech and a mistake for english speech recognition
0:42:04	and t s j for japanese
0:42:07	so a score young for very on and so long
0:42:11	and tts have
0:42:13	also have already model
0:42:16	there
0:42:17	if we wanna
0:42:18	see the fruitless angle of the a novel
0:42:22	pretty c this you although
0:42:27	this
0:42:29	cindy s shows the how to use that
0:42:32	in python
0:42:35	for two right so
0:42:37	we have performed the
0:42:39	not controlled
0:42:40	to ignore the checkpoint for channel though
0:42:43	and i'm fact that to do this model object
0:42:47	after that
0:42:49	well you can believe that
0:42:52	some we wait for on
0:42:54	in you will call environment and its transcribed the result
0:43:00	to do this results
0:43:02	now so rats
0:43:05	get started and crawl
0:43:09	so basically the you all out in the page eight
0:43:14	you will find
0:43:17	e
0:43:18	note of it
0:43:19	like this
0:43:21	therefore trying we will
0:43:24	in still use
0:43:28	and before
0:43:30	running at feast make sure you all collecting
0:43:34	the i could a long time
0:43:37	it is
0:43:38	available
0:43:40	on
0:43:42	right corner
0:43:44	and priest select the change runtime five and
0:43:51	check the gpu we selected
0:43:55	note that the u is not
0:43:57	what it and see if you might be
0:44:01	so you want the training
0:44:04	so forth trying to instill using it because it is not default to install just
0:44:10	be at u
0:44:11	in a single core
0:44:14	so i can see if you press how many dependencies
0:44:19	because you can still for both you can't one used to
0:44:26	so or s
0:44:28	provide a pre-training model
0:44:32	so
0:44:33	first
0:44:35	i downloaded the waveform file for them
0:44:40	i resist this dataset
0:44:43	and i try to
0:44:46	for phone to is not all
0:44:49	on the
0:44:50	downloaded waving
0:44:53	so that before this
0:44:57	forced to right
0:44:58	you download that a pre-training what we'll
0:45:02	for example this mateo is trained by stingy button okay
0:45:07	using the unlabeled speech
0:45:09	yes all task
0:45:10	and he seems to you is to transform a picture
0:45:15	for neural networks
0:45:19	and
0:45:22	i think roll the waveform here and feed into a more below albeit
0:45:30	and that's right there
0:45:32	i'll but is a and the best result so well i selected the best one
0:45:38	to see how it looks like
0:45:40	so this is the result probably read speech model
0:45:44	sound check the
0:45:46	but the wave onset separation just starts
0:45:51	since i
0:45:58	false pretty well
0:46:00	so let's go back to the slide
0:46:07	just after we're show you how to use so for the wrong defined tasks
0:46:13	testing it directory your is the x two
0:46:17	that contains a although it so that sets inside that
0:46:21	and you five and the static content with same fires on directories
0:46:27	right column the yes onto
0:46:30	you basically you're on this says created from the cell
0:46:36	you produce the results reported in this really mean file
0:46:41	so i'll show you do well
0:46:45	kind of stage is inside the we want to sell you can start point two
0:46:50	stages or of people
0:46:51	but in the us stages
0:46:54	a score
0:46:56	specify the command drawing for box
0:47:00	one to five state it is
0:47:04	perform data preparation and six to eight for as long as model training and ninety
0:47:10	temporal bones is all training and after that the sre variation be performed
0:47:16	and very you got brought to you or entering the more used to put into
0:47:23	so that's need of it is of the data preparation stages
0:47:27	in just a very all we're focus on and four task
0:47:31	that is very small right images nice to come from i
0:47:36	for fast experiment
0:47:38	and the for a very fast daisy performance the positive and then data before reaching
0:47:43	utilize it is the task and then fires at the other everything and four but
0:47:49	it into the cup of these style that and after that we performed some preprocessing
0:47:56	the speech and text is it
0:47:58	as
0:48:00	value that was set in
0:48:02	the case so i a we use the you dior sentence please a lot of
0:48:09	to the text representation
0:48:13	so that representation we used in the training and evaluation stages
0:48:18	the six to a stages we performed a long as model training and intermediate a
0:48:25	very efficient like a public key and after that the itself training and decoding and
0:48:31	evaluation is performed
0:48:34	so you can
0:48:35	one of the training
0:48:39	the board using the purpose of or even go control it is okay
0:48:44	so you can monitor the log of the of the softmax output over a wide
0:48:49	or something alright though the c g c out of it can be morning to
0:48:52	during training
0:48:55	and this is a example the is it or you corporations scoring results looks like
0:49:01	these s a wide full
0:49:06	very efficient tool and reformatting results with the amount that because it it's more readable
0:49:12	and as you can see here for each opposable error rate and also something that
0:49:18	both the cup of the right was talking or rate
0:49:22	and finally we have had to train the model and you can use to exactly
0:49:26	saying you maybe i'd draw it is out inference you think more using a v
0:49:33	i
0:49:33	like okay i'm in the results in the beginning
0:49:38	if you specified
0:49:40	but kind of confusion six point two you use
0:49:43	so now it's got to the court
0:49:48	no way not to the controller
0:49:52	so
0:49:53	let's see the how example two directives like
0:50:00	okay you can
0:50:02	used
0:50:02	come on the right thing
0:50:06	like usual not work and you can also use the file explore from
0:50:13	this icon
0:50:15	and you define a many
0:50:17	datasets are available on the is to and
0:50:22	in this study we focus on and for all and decision is all one task
0:50:28	and
0:50:30	for now we are on london style
0:50:33	in just
0:50:34	israel
0:50:36	so
0:50:37	before are running the associated to any more but dependencies
0:50:44	two one training
0:50:46	a carry enough always
0:50:48	you quiet currently unfortunately so we needed are all the pretty complies
0:50:55	to use and also we need from
0:50:58	binary whom have to get and after install everything you're the you're on the on
0:51:04	to sell
0:51:06	here
0:51:08	so
0:51:10	for star
0:51:11	the
0:51:12	that is you down and of all four from cmu store because it is really
0:51:19	of a novel so after the enrollment is substituted the data preparation movie again
0:51:27	and
0:51:28	you can see here there is a menu will finds the and data training is
0:51:34	performed and the state five spoken addition
0:51:39	or text that cystic cooperation really figure ten
0:51:44	and this five results of this from the set s ps
0:51:49	so yes
0:51:51	and a for a few used to the sentence piece as a focalization
0:51:56	and after the center this
0:51:59	training is finished
0:52:00	the target money would be retrained
0:52:03	let's see here
0:52:04	and after that the sound training here starting
0:52:08	however i drafted to use of training because it if you're wrong
0:52:14	i finished this
0:52:16	training and ten minutes and i think it is reasonable
0:52:22	so let's see how the video data looks like the but that is distorting down
0:52:29	and we can find some
0:52:32	we have a
0:52:33	prepared it down here like a this is the internet or out with the text
0:52:39	file is here the fast
0:52:42	and three shows that false id and you will find
0:52:47	you find the corresponding speech in this while the nist p five p so if
0:52:53	you for ages from by you hear yes i is that we have some t
0:52:58	a
0:53:00	so after the
0:53:02	training was going to train the
0:53:07	speed you have to be used as the
0:53:10	blocking dft of the training phase and screen
0:53:13	it will store many things for example pickle
0:53:18	five some of detector the checkpoint
0:53:20	here
0:53:21	and also attention is wrong addition we have used to a here and
0:53:26	configuration can be accommodated according to the animal
0:53:30	and let's see you how the location on the looks like
0:53:34	so
0:53:36	configuration you are provide everything
0:53:39	every information in during training
0:53:43	so here is kind of
0:53:46	result of the cup operations so for example you to use is five point zero
0:53:51	is that we probably integrating into a non party and
0:53:56	you to use is this kind of like this result
0:54:01	usually it's like the binary to use this in this piece
0:54:06	and used our in an that's dying graph structure
0:54:10	okay
0:54:11	and
0:54:13	during training you can or someone that the pencil or
0:54:16	inside a good record
0:54:18	or you are or environment
0:54:20	and you go far in your exact after operatable it is so severe and achieves
0:54:27	icsi
0:54:29	parsons partition
0:54:32	then right
0:54:33	it is the output
0:54:35	so that it is only nice so yes see the other information so this is
0:54:42	there some visualization is i x d dft
0:54:46	as seen the voice very short utterance so that i and does not
0:54:52	we really five
0:54:53	additional dirichlet allocation right and i think
0:54:57	it's okay
0:55:00	so there is a variation result
0:55:03	and
0:55:04	but i said
0:55:05	the last it is for more details on down so i just pasted to the
0:55:10	e
0:55:11	not sell and
0:55:14	you just here the final result of the well there are and it's starting point
0:55:18	five in the test that i think
0:55:21	i mean i soundtrack the right is sixty four point nine and but can write
0:55:26	the six point five
0:55:28	okay so that this
0:55:31	you lose the
0:55:34	so one at this for infants at i
0:55:38	so
0:55:39	first of all we need to specify the fits point to use i document to
0:55:45	use this
0:55:46	very dark this because at it seems to be best
0:55:50	two point eight or so
0:55:52	we then use that the result really
0:55:56	according to have the same as the speech but it looks
0:56:00	more than seriously speaker that is it is more
0:56:04	okay so
0:56:06	thanks for putting the stuff there
0:56:10	this that stuff there will explain how to extend models and pat task
0:56:16	so that's
0:56:18	the total section in
0:56:20	he interest to
0:56:22	and cortical architecture and transparent and our transducer
0:56:27	when you have regression
0:56:28	there are they how to use that
0:56:31	it's
0:56:33	this is the answer
0:56:35	sometimes
0:56:36	like i and four task deftly already says of the predefined
0:56:42	plot configuration younger fought so you can just
0:56:47	that's fine why is a coefficient and take a look at that you going and
0:56:52	there are none of the values of a number of the units
0:56:56	inside younger five
0:56:59	i think it mostly goal of this fine is that yes it has test trying
0:57:03	many things like activation or
0:57:06	where tight so
0:57:08	make things like that
0:57:11	however if you
0:57:13	down some find that you can extend multi i think i said
0:57:18	for example
0:57:20	the
0:57:22	or and then what or transducer encoder and decoder but works in a men's these
0:57:27	interfaces
0:57:29	to ease the swat four
0:57:33	have keep the complexity between those variants for brain
0:57:39	implementation
0:57:41	so
0:57:43	this
0:57:44	and e s
0:57:46	other so i used to model
0:57:48	for vice versa yes feature plus four plus a
0:57:53	these two
0:57:55	others and go the invitation
0:57:58	for passing the encoder speakers and text input on the targets to i'll stick to
0:58:04	the
0:58:05	something like
0:58:06	as
0:58:08	explaining that
0:58:09	you're though it's
0:58:11	figure
0:58:14	and you can use that phone come on the right argument if you this is
0:58:19	just a you implementation in this
0:58:23	so score
0:58:27	and
0:58:29	if you want to send your task like you wanna
0:58:35	try sub tasks you on the is that are it is well for possible
0:58:41	then you extend that i was task
0:58:44	so existing asr was tedious task implements this
0:58:49	that is
0:58:50	and
0:58:52	to get the this
0:58:55	task i don't think feature
0:58:57	like a distributed training on divan sampling but checkpoint rejoining like that
0:59:04	as the was gonna section five we show you how used in payments
0:59:10	that
0:59:12	models
0:59:13	so that is it yes did have rivets e
0:59:17	and that and check the yes to implementation and
0:59:26	okay
0:59:27	the out into for some so
0:59:30	and there is
0:59:32	model definition here
0:59:35	so as i said in the us by a base
0:59:41	it implements have a sort the svm modeling the phase here
0:59:46	and actually simply call use the board mess of
0:59:52	the read and the most value is
0:59:54	so received a for the nist
0:59:56	it's here
0:59:58	so increase to use this be used in baton text output as seen that argument
1:00:04	and then it we kinda rate and was fine tuning full
1:00:09	euros the angle tunnels
1:00:12	so well that's in there
1:00:15	the first thing go the network coding rates the without the input of the think
1:00:21	of the networks
1:00:22	still this angle regularization and
1:00:25	well you see that output and it and
1:00:30	this is the outfit a within good as input and
1:00:33	they're pretty they're
1:00:36	text target
1:00:37	and calculated function here and the same their thing having in thus it is inference
1:00:44	so this is exactly same impotent target as well as the political there that those
1:00:51	are anti do the same thing
1:00:54	yes exactly same arguments
1:00:57	and then combine
1:01:01	thus values i-th honours the scrolling nazi it's quite easy and
1:01:06	same as the so we into using their you know section
1:01:11	so
1:01:12	thanks so or watching this

Neural speech recognition

Tutorials

Dr Yotaro Kubo and Mr Shigeki Karita, Google Research, Japan