0:00:06i don't everyone sounds fortune this detail this field with for the children session long
0:00:11nor automatic speech recognition and i not are a global from google research a total
0:00:17it just started
0:00:21this sixty minutes be they'll will be organised into boards the fast but will be
0:00:26written by mean explaining basic formulations and some algorithms from your speech recognition
0:00:33and the second but well cover software and implementations phone your speech recognition
0:00:39and this but will be read by my coworker she gave me
0:00:43it's going to the fast about
0:00:47after more i want to define what is in your or speech recognition
0:00:51in decision i used this down for we farting techniques for we are rising and
0:00:57do and speech recognition but chose techniques sometimes can be also applied to know into
0:01:04in speech recognition systems
0:01:07and to and speech recognition is a time for speech recognition that involves neural networks
0:01:12combining acoustic features directly into words
0:01:17and you may know already a conventional speech recognizer cost consisting over every three parts
0:01:23acoustic model pronunciation model and it's more detail
0:01:29mm on the represents a probabilistic combustion
0:01:33and this site are wasn't here find the best possible hypothesis from joe smallest
0:01:41one just a hunt
0:01:43and two and approach uses systems that this
0:01:47here and you wanna talk
0:01:49the diet equal but feature us into procedure is are used to represent forest equation
0:01:57for speech recognition
0:01:59well obvious advantage of this approach is
0:02:03simplistic of the system
0:02:05it's very make it comes with such algorithms can start in higher internal combustion can
0:02:11be very complicated doing agreement
0:02:14within three and two and approach is even extended to directly hunter role will form
0:02:20signals is that all pre-computed feature vectors
0:02:25discussion express how to design joe's neural networks that dynasty outputs words wrong feature with
0:02:32the or role of all signals
0:02:45easy as in the in this fast but iris brand three approaches for and speech
0:02:51recognition and
0:02:53also recent advances over chose three
0:02:58it's called is a fast section
0:03:02most of classical speech recognition models use this integration unit score
0:03:08because the generative story or feature vector sequence x
0:03:13and a procedure as well i
0:03:15and b models the distribution of joel two variables by introducing
0:03:20as shown to latent variables
0:03:24so phonemes cd as well in here and the related hmms to she guessed s
0:03:32usually be decomposed into by assuming that phoneme z yes
0:03:36is generated depending on the word
0:03:39and an si hmm states are generated depending on phoneme sequences
0:03:45and features that yes
0:03:47is depending on the agent states
0:03:50so here me to carry assume draws independency assumption between introduced variables
0:03:58yes in this assumption looks okay but yes in section will result in some languages
0:04:06in conventional approach driving techniques introduced in each component of this decomposition
0:04:14for example fold and that's what are we often used
0:04:17i ran in and it's model here for getting better prediction all words marshy genesis
0:04:23well as of what acoustic monitoring people when used t and even your network or
0:04:28a recurrent neural network for
0:04:31one thing this emission probability of features you guess is
0:04:35in the next size i review joe's or rolled used to enhance components with it
0:04:41writing techniques
0:04:46t n and german hybrid approaches are very famous way to enhance the
0:04:51conventional acoustic models
0:04:54in this approach this definition the emission probability it used as an acoustic model of
0:04:59the conventional speech recognition
0:05:02she a the probability p java that even the hmms the there is transformed into
0:05:08a probability that is proportional to this special
0:05:12this is the ratio between the pretty if a probability of all agents today the
0:05:16given the feature vector and some as not probability all the agents date
0:05:23the predictive distribution is modeled by a neural net and the marginal distribution is modeled
0:05:29by a margin on the other wall categorical distribution
0:05:33this is a convenient way to bring expression ability of neural nets into
0:05:39the conventional speech recognizers however
0:05:43this have similar programs actually
0:05:47cost
0:05:48be used as division in you and then to permit parameter we use marginal distribution
0:05:53independently parameterized by different parameters
0:05:57so baseball's used here is just an approximation because the different modeling parameters used for
0:06:04the marginal probability and predictive probability
0:06:09secondary
0:06:11it is known that a gmm stay there is a very difficult times it's been
0:06:15be to estimate it
0:06:16a classifier was yours
0:06:20classifiers i mean you're metaquest for us
0:06:24cost a for example some stationary bothers
0:06:29is very difficult to classify the acoustic feature vector with a is belongs to the
0:06:35fast all the phonemes segment was a second how all the phoneme segment
0:06:40this fact makes training and prediction of the classifier more confusing what a stable in
0:06:47other words
0:06:51connectionist temporal classification can be regarded as a remedy for the
0:06:56that program here
0:06:58is easy more than each time today where is represented only by a few points
0:07:03in the c yes
0:07:05is done by introducing tommy a view here we according to brown
0:07:11and associate most all input vectors to the rank k
0:07:16only few input frames that i kind of center over poignant continuous to the final
0:07:22output
0:07:25this diagram shows
0:07:27the speech to or your network this easy approach in this case is
0:07:34when we have infancy yes with eight elements
0:07:39each in the to be that is classified into name is augmented with the banks
0:07:44in more
0:07:45and the final result is defined by removing banks imports from the output
0:07:52one advantage of this you want it is that be no longer i'm used to
0:07:57estimate a gmm is data davis with using commission a speech recognition systems
0:08:04so it is possible to train neural networks from scratch
0:08:08also dct is in your in it is
0:08:11jerry that we can use eight four laboratory see us to seek yes task encoding
0:08:19and in speech recognition
0:08:21so it can be used either to estimate phonemes you can write conveys no religious
0:08:26order to estimate or she can or grapheme cts data into and approaches
0:08:31however each day the here is estimated independently so there's not able to david dependency
0:08:44it's a and b elaborate on the didn't is the induced by c d c
0:08:50it is known that run a session one graphically move or in c d c
0:08:54is ensuring be written represented by finite state transducer
0:09:00if we present it in transducers be seen that the conventional left-to-right hmms and c
0:09:07d's in your minutes
0:09:08have a quite similar event is used for your
0:09:12so in fact using only c d z for speech recognition is
0:09:17in fact very similar to doing speech recognition results using language models
0:09:23however still see it is you have some good properties
0:09:28well the is it
0:09:31in that
0:09:32it can perform better combination with down sampling approaches in neural networks
0:09:38commissioner broad needed
0:09:40gmm based i meant that doesn't work very where with down sampled features
0:09:46also even after obtaining an hmm state alignments the conversion i'm chinese to associate single
0:09:52related to each time star
0:09:55that makes the a very information or on the bus in the regional planning boundaries
0:10:02and this ambiguity becomes more if we if the feature is downsampled
0:10:08so it is you only classifiers
0:10:10some kind of center of segments so we apply this i'm bus today is
0:10:16related to that the second advantage is that we don't need to classify some phonemes
0:10:21structure
0:10:23nice the fast and second how full bottle
0:10:26this makes training was terrible and also prediction more complicated
0:10:32that means that it is combined with some such are voice and sid using neural
0:10:36nets tends to make score defined as roger for each examples
0:10:45so using cd see for classical speech you speech recognition is a good idea because
0:10:51it needs down dating wanted to within the labels
0:10:55event is e
0:10:56even if c t z is used as a part of the system we still
0:11:00have advantages described before
0:11:03so
0:11:04don't somewhere each and every be applied and also it can form a good combination
0:11:10with that such algorithm
0:11:14is brought presented by stack
0:11:17so our indiana well there are eight cars of commerce now hybrid approach is unseat
0:11:23is the approach
0:11:26this is that c disease either want a just also vocal tract in conventional is
0:11:30our systems
0:11:34it doesn't next component
0:11:37now there's more less channel be enhanced by introducing are enhanced recurrent neural nets what
0:11:43is the atoms
0:11:44long short time a more in your on its base order regression inputs
0:11:49are in a language model by the x
0:11:51this division over the next word by r antennas
0:11:56that ingested always afraid guess boards
0:12:00unlike previous n-gram round is more approaches are and then someone i did a word
0:12:05and its context in a continuous vector
0:12:09and use it to make a prediction the next work
0:12:14since we used a reference for making dis-continuous context you please in addition
0:12:20irina spanish monitors channel in theory hunting
0:12:24and no infinite drinks of our history
0:12:28even so in practice it often very difficult to optimize someone or in that very
0:12:33nice to see significant improvements from n-gram language models
0:12:38as a downside context representation are analyzed models i e
0:12:43in n-gram approaches
0:12:46the number of possible context is bounded by the number of different war history that
0:12:52is finite
0:12:55however a four hour and if you wanna be do not
0:12:59do not used extending over the context to be used
0:13:03so each different work is to have the defining context a good representation
0:13:08one can say
0:13:09this is issues downside for computation
0:13:14but in fact it's not that inefficient
0:13:17is very easy this idea is models
0:13:21this going to presentation to carry you guys should space to store in memory wiring
0:13:26harness was somewhat something
0:13:28maybe compare the size of speech recognition systems with a conventional approach and free neural
0:13:34network approach the size is actually compare or and your and it's are you was
0:13:38more as on the tree expand it
0:13:41a weighted finite state transducers
0:13:46so it might be a bit counterintuitive button urinal neural net approaches actually fit very
0:13:52well with
0:13:53mobile devices to
0:13:56it's very if the device has a some accelerators full matrix multiplication well example
0:14:06another important property that inference is the competition or if you change is to organization
0:14:12she's done in conventional approaches use
0:14:16takes the rents context for making a prediction each part of token used to be
0:14:20long enough for making i'd rate reduction
0:14:24however irina stand out from when the context
0:14:28that means that we can use finite organization metal that is some word tokens well
0:14:34maybe we can use a grapheme based or close to
0:14:38to document organisers used reason you unacknowledged monitors
0:14:42most are very similar in the in the sense that talk as all the data
0:14:47by matching existing control and the algorithms that these chaps database tokens
0:14:53and they gradually margins in
0:14:56both select pair or tokens marks might in some criteria
0:15:02but encoding pde use these
0:15:05the number i just and occurrences of tokens in the dataset whereas
0:15:11work this approach evaluate the likelihood well what dataset we do things simply not models
0:15:17over defined tokens
0:15:19using the draws final vocals decoding result in a smaller tokens that
0:15:25and the number with different tokens
0:15:27in the system is often corresponds to the size of out three open your networks
0:15:33thus
0:15:34it also contributes to the computational efficiency of neural nets
0:15:40now who introduced in additional c disease and advantages of around the dance
0:15:47the distinction is about hiring transduced us that can one strings or bottom results
0:15:57as i mentioned she did she turned out to be sensitive and it should be
0:16:00doing output tokens
0:16:02i don't and channel be used as a component that inject the household event is
0:16:08a so
0:16:09by combining cd z based prediction with are in an n-best contest hundred we get
0:16:14are and transducer
0:16:17this diagram shows the
0:16:19the as texture or are in a transducers
0:16:24this thought of as a director
0:16:27corresponds to c t z predictor
0:16:30despite compares distribution over the nist tokens
0:16:34we have the tokens it is all demanded by all made it with a down
0:16:39symbol
0:16:43and this but correspond to our own in and
0:16:46this of feedback loop next the prediction to be dependent to the previous words this
0:16:52actually inject the dependence you to the previous of talkers
0:17:00c d c and r and d is yes us a common structure that use
0:17:05rank to and the input and output elements
0:17:09as i shows in the cities each s it is it a free corresponding to
0:17:13the
0:17:15hmm states in the conventional acoustic model
0:17:18and a similar to the agents days it is handled as a latent variable in
0:17:23the likelihood function
0:17:26as you are
0:17:28this latent variable is marginalised out
0:17:31two defines a likelihood function and a logistic function
0:17:35here
0:17:36or c d c and a rarity models with brock symbol use this
0:17:42simple handcrafted model for probability old wires regions given the alignment c guess one
0:17:50due to this simple definition of the probability all by
0:17:55given by brian
0:17:57the likelihood function can be simplified in this way
0:18:05difference between c d c and r n and t appears in the second component
0:18:09probability all i meant
0:18:12given the input feature vectors here x
0:18:17c vc introduces frame wise independency here we identity introduces the and i'm in predictions
0:18:25that is depending on the previous i meant variables
0:18:33to explain how i'm it is more the reading and t is process shows the
0:18:39case that be how for input vectors
0:18:42e one e to easily and e full and really fast or the u s
0:18:48c yes
0:18:51low and word
0:18:55we show the case when the difference was a fixed as in the training phase
0:19:02i'll original joint network denoted as if here
0:19:08it has defined by the corresponding to different times stand for the other thing that
0:19:12and
0:19:13evaluated things of the context in your handling
0:19:18to fast estimation is given by feeding the fast in court
0:19:22eva and initial context here she's there to the joint network
0:19:29if we close to that the fast output of the model to be block back
0:19:33need to be finished reading from the current encode either
0:19:38so the more the start switching that i two
0:19:44if the second element of the i'm and see us to be the fast talking
0:19:48in the reference
0:19:51that is he huh
0:19:54it changes the context with stuff from c zero to see one
0:19:59and
0:20:00the model continues to pretty if the nist of but should be back why should
0:20:04be some other words
0:20:08for example
0:20:09if the past that outputs is to control can low is chosen
0:20:14so context mister will be changed from semen to see two
0:20:19by repeating the same process until we reached as a final step here
0:20:24we get the posterior annotation knows single alignment cost
0:20:33for the training also neural networks we they didn't diamond variables
0:20:39we need to compute and expectation of agrarian visitors with given the alignment variable
0:20:45well as the posterior distribution of the alignment whatever's here
0:20:50and study
0:20:51for what are wasn't is
0:20:54used for this purpose
0:20:56how we have a for a lot colour wasn't although generate graph is not computationally
0:21:00efficient
0:21:02to say it's not
0:21:05g u r t but for entry
0:21:09however i meant defined in are in energy in bright is good it's shaped event
0:21:15is you structure
0:21:16for this kind of stress enough to read "'em" problem for what i wouldn't sufficiently
0:21:20fast to be some can be you or gpu accelerate arts
0:21:26in this case we need to compute the sum of probability or what for the
0:21:30past
0:21:31generally you know them us or a rose
0:21:35and the prior probability that is a sum well probabilities
0:21:39wow colour cost in order to buy greene story are hours
0:21:44since well as summation term be written as
0:21:50operations annex sifting and summation is done be efficiently implemented to be t b you
0:21:57for example
0:22:02i know i'll try to introduce encoder decoder neural networks enhanced with attention recognition
0:22:10c d c and r and d house i'm and variables to actually this size
0:22:15to encode out to be if that shouldn't be used for making prediction of the
0:22:19next token
0:22:21this kind of information is all formally five us attention
0:22:27if the point is about estimating we have to
0:22:31we've got
0:22:32we don't models of probabilities division one that times time varying we're directory that these
0:22:38where
0:22:40i is the timestamp we should regard for making prediction for ice world
0:22:47we can construct is by using softmax at a young with in that computed from
0:22:52the input see gen x on the previous two words why well do i minus
0:22:58one
0:23:01we combined attention probability into simpler are an n-best encode the and are in like
0:23:08decoder
0:23:10this is inspired neural networks defined
0:23:14that is
0:23:16we introduce addition one true
0:23:19task it's the information from or encode all the and the decoder thus there a
0:23:24state of the previous time stuff
0:23:28this internal computed a
0:23:31tension probability
0:23:33i mentioned before people given
0:23:37p o a
0:23:38given the context and go the outputs
0:23:42and in this module outputs a summary big summary bit the by comparing this expectation
0:23:49the addition probability introduce the here is typically defined
0:23:53by introducing a function that you believe then smudging score was similarity be doing decoder
0:23:59context information and the encoder output
0:24:03that is the t-norm as well as a here
0:24:08if you have this a be represented by in your pet
0:24:13all the components including composition of expectation one this probability distribution function can be optimized
0:24:20by us improve about repetition for minimizing cross entropy criterion
0:24:28compared to a rarity alignments here is internally represented in neural net where are energy
0:24:35handle it as a latent variable in likelihood function that is actually objective function to
0:24:42this is of course of attention right soft adaptation since we already used in court
0:24:47output via and expectation as a relative prediction is made after deciding feature quote unquote
0:24:54output to be used
0:24:57so foundation is better in terms of a simple still be implementation around the also
0:25:02optimisation
0:25:04and it's also vegas that it has no few
0:25:07it has only few wanting assumptions
0:25:11however a combat the identity it's harder to enforce monotonicity of alignment
0:25:18in speech recognition
0:25:20same as well and corresponding of acoustic features assumed to be in the same order
0:25:26we assumed that additional should be
0:25:29monotonic
0:25:31if we if we brought addition probability like this problem where y-axis is a tradition
0:25:37in the right of tokens each and x-axis is a rotational in the encoded feature
0:25:41sequence
0:25:43the most probably most probability mass should be on the diagonal region
0:25:50however us as soft adaptation is to flexible we sometimes see of diagonal peaks that
0:25:55these
0:25:57we decoding is more data for resolving such programs
0:26:06well known to work extension force of traditional roles itself attention on transform us
0:26:12okay jamaican can be viewed as a achieve area store where curry is computed from
0:26:17the decoder state and itchy and variance is
0:26:21but i are computed from the encoder output
0:26:25so far addition is an additional attention components are computed everything queries cheese and of
0:26:31various from the previously as output
0:26:34a frisbee speaking this corresponds to g attention to the input from as a time
0:26:40stamps
0:26:42and z is of great human to joe's adaptation is also computed based on the
0:26:47previously you out
0:26:50transform is a neural net component activities this separation the us multiple times to integrate
0:26:56information from
0:26:58in that i as the timestamps
0:27:01we just construct
0:27:02both encoder and decoder based on this transform or
0:27:07okay very transformers and nowadays used as a drawing you go is made of our
0:27:12antennas
0:27:14so we can use it for constructing acoustic model for almost a hybrid speech recognizers
0:27:19or region defined transform a transducer we have transform a is used is that all
0:27:24are in it or are introduced us
0:27:30the last section of this but is for introducing within the elements is in your
0:27:35speech recognition
0:27:36even so and the in speech recognition and its related technologies in disagreeing with it
0:27:42missed you how we have this element it is compared to the conventional speech recognizers
0:27:49i will focus on the united disadvantages
0:27:53the first one is that with the conventional system is very easy to integrate side
0:27:58information to bias the recognition result
0:28:03and the four and architecture is not trivial to do so
0:28:08the second point is that into and speech recognizers in general requires huge amount of
0:28:13training data to make it work
0:28:16so in this in a method to overcome data sparsity issue is also important
0:28:23the starting point is that in conventional system it's relatively easy to use compares it
0:28:29does such as text data or no transcribed audio data
0:28:34in this section i various miss some examples all studies
0:28:38for all welcoming joe's conditions
0:28:42possibly is about biasing results
0:28:45by things is particularly important for real applications
0:28:50speech recognition all used to find something in the database for example if we want
0:28:55to build a system to make a phone call
0:28:58speech recognizers shows a button name in the user's context are used
0:29:04same kind of behaviour is needed for various kinds of entire eighties
0:29:11like sometimes or what names
0:29:15in commissioner is biasing speech recognizer is very easy it can be done just by
0:29:20integrating additional language models that has enhanced
0:29:25probability for such but in cities
0:29:29well solution for this into and rows is introducing another addition we can see that
0:29:35focuses on
0:29:37predefined set or context vectors
0:29:41i we explain the middle of cortical texture us one text out this the utterance
0:29:46where
0:29:48in this method context for at such as a names or sometimes i encoded to
0:29:53single vector
0:29:55on the other jamaican detect pitch context of it does should be activated to the
0:30:00court to estimate the next word
0:30:04and just an example were normalization probabilities
0:30:08well as it out that
0:30:11talk to
0:30:14is addition we can start to think that some biasing for it is like but
0:30:17fruit are you all want to brew joe's actually corresponding to some names
0:30:25and this additional input vector representing context
0:30:28is expected to have the rest of the decoding process
0:30:32so after saying after the user saying talk to it is expected that some i
0:30:38can imbue for all
0:30:41and this context is attention mechanism can
0:30:46so we still behave via by a by adding additional probability to joe's a name
0:30:53context against us
0:31:00the next topic is about marriage would get a model for welcome data sparsity i
0:31:06will introduce a method proposed by d
0:31:10dismissal is simple
0:31:12that just a i-vector model vector representing dialect as an input
0:31:19and use that it does that constructed by pooling the data in all the dialects
0:31:25if we do have decided to dialect id in but consistent during training and decoding
0:31:31speech recognizer trained in this may cancel each some more
0:31:35depending on these input data is dialect
0:31:41is a multi rate
0:31:43from this role showing the base turns out it's
0:31:46we see that just training into in speech recognizer result in stairways mass there does
0:31:51it is not a good idea of the performance significantly worse in dialects with smaller
0:31:57datasets
0:32:02this will shows the result with transfer learning here transfer and you fast
0:32:07the miss out that fast that price training will include it does it
0:32:11and then applies the oppressed training on the matched to dataset
0:32:17transfer aligning thickening actually improve the result
0:32:21however we could all the dating further improvement just by integrating that is a dialect
0:32:26id in
0:32:29including the previous method i explained
0:32:33before contextual a s having additional method in time that is have people were coming
0:32:39knuckle dataset
0:32:42so sitting in your architecture that can probably handle such additional metadata in but is
0:32:47in the important nowadays
0:32:54the last the is about the musical on data
0:32:58as i have already mentioned and speech recognition "'cause" huge amount of training data
0:33:04and is even worse because it's not true or how to use their data
0:33:10conventional speech recognition can be found at least privilege test only data for language modeling
0:33:16and also it's relatively easy to use a possible by mit line in the top
0:33:21only one data
0:33:26overcoming these issues of the bubble retraining is no again getting four
0:33:33here
0:33:35we want to optimize encoder all speech recognizer only by using non transcribed data
0:33:41of course it is not possible to powerful cross entropy pruning or was the neighbours
0:33:46if we the if the data is not transcribed
0:33:50inspired by bottom involved in that are not image processing field
0:33:55within the missiles use richer information to be context information on the instantaneous information
0:34:03mutual information is engine there are very difficult to optimize but recent middle we are
0:34:08as it by drawing
0:34:10the missile correlates contrastive estimation
0:34:18in this i want to explain the famous network called we have developed to one
0:34:22or
0:34:23this is a diagram for the wave double two point one you're
0:34:28this mental is aiming a pre-training all she nn based in quarter by maximizing mutual
0:34:33information between input outputs
0:34:36and its surrounding context
0:34:41context surrounding context is actually summarized by a random transformers
0:34:48we baseline want to maximize
0:34:51in formulation of infancy we describe want to maximize similarity between projected in order out
0:34:56on context vector
0:34:59are we have a is not assumption if we only do that similarity between
0:35:04and what i'll put on qantas with the because
0:35:08the similarity becomes maximum maybe enhance the in order that matt all the data points
0:35:13into a single course of what's that one zero vector
0:35:19in fine is the introduces another somewhere here in all the all split from random
0:35:24times files
0:35:26and in fantasy tries to minimize similarity between context and random resampling in order
0:35:34so this famous so that we can maximise you know maybe doing contents untied in
0:35:41all the all but
0:35:42but
0:35:43it minimizes melodically with the mean the
0:35:47context and randomly sample in without
0:35:51we have the victim point well it's very famous because of its surprising performance speech
0:35:56recognition problems
0:35:58it is reporting that only few minutes of training data that is option for i
0:36:03mean and in speech recognizers if the encoder is trained with
0:36:08well that was fifty thousand hours of training it contrast them training
0:36:15so this amazon want right plots from training data is actually shows but it should
0:36:20be we have three year old a need combatted to utterance to data
0:36:31okay and you minimize for watching these but is it for my part
0:36:36then it will be the best you key and then this but we have you
0:36:40about
0:36:41software aspects or and in speech you've right
0:36:47probably rate on this is typically from google research that's okay so you implementation for
0:36:53a total and eurospeech question
0:36:57today for talk about the two kids well what you're in five minutes
0:37:03and then
0:37:04we will try pretty doing model was in the toolkit
0:37:08introducing the protection
0:37:10after that we'll trained you
0:37:13neural speech recognition to model from score parts and ten minutes
0:37:19so far and we are we show how to extend the money out and tasks
0:37:26introduced in your little section for example how to the sorry the transform of knowing
0:37:32state-of-the-art and or something like that
0:37:38so we'll forcible i'm sure that to locate all of you
0:37:43this table is
0:37:46introducing
0:37:48you mean magnets
0:37:49a c l paper
0:37:52this table briefly summarize the
0:37:55kind of comparison between the various to the kids
0:38:01in this table all the
0:38:03posted to okay supports the
0:38:06automatic speech recognition tasks
0:38:10and
0:38:11some of them
0:38:12also supports the
0:38:14different tasks like speech transformation on the central station
0:38:20and text-to-speech test
0:38:24and
0:38:25note that there is
0:38:29pre-trained models are available in several to get
0:38:34so
0:38:37in this tutorial
0:38:39we will focus on the svm
0:38:41because it's doubles many
0:38:43tasks
0:38:44for as the and two in modeling
0:38:47and also it's of boats to train the model
0:38:50so i think it is easy to
0:38:53try
0:38:58its implementation can is host it at peak at
0:39:03and if you want to know more digit result
0:39:07they are described in this paper this paper was
0:39:11no is a speech recognition and text to speech
0:39:15speech on section reports all the obvious on the part of the
0:39:19news speech enhancement
0:39:22feature we will be coming soon so respect that rate
0:39:28in this to treat you know
0:39:30we have try yes mean of two
0:39:34it is kind of major update from the yes one okay there is
0:39:40so there are differences
0:39:42in the between them but measure origins
0:39:45for example
0:39:46if you using it is to depends upon any primaries for example county is to
0:39:53get sent a however
0:39:56used to taste minimalist approach
0:39:59it mainly depends on title ish and it all from we can use integrate it
0:40:05scully
0:40:08and the world model
0:40:10almost same
0:40:12especially tts models more used in a long
0:40:17and however this tuple the task is
0:40:21kindly well in progress
0:40:23however
0:40:25this meant to all visible once it's all so it is nice to try if
0:40:30you're interested in itself on tts
0:40:32and also speech enhancement previous
0:40:36if you into a sitting yes one please visit this you all out
0:40:42it is show you the usage of the use of long
0:40:46there was to use the speech tutorial
0:40:52and this tutorial have long ago example posted not go crawl
0:40:59good across from a base
0:41:01pricing to print the in a web page
0:41:06and you just can't to also samples to a after a to so
0:41:11but is make sure that you are using could one time in court probable
0:41:16by this thing
0:41:17when you visit this very page because the one of them called
0:41:23we used in this tutorial
0:41:27this just the introduces
0:41:30pre-training model
0:41:31that means
0:41:33the models or really train by
0:41:37one on and some tasks and dataset
0:41:43yes in until
0:41:45the such and models
0:41:48in
0:41:49yes peanut models to report three
0:41:52and hosted that senator
0:41:55for example thing as all task there is
0:41:59they're already speech and a mistake for english speech recognition
0:42:04and t s j for japanese
0:42:07so a score young for very on and so long
0:42:11and tts have
0:42:13also have already model
0:42:16there
0:42:17if we wanna
0:42:18see the fruitless angle of the a novel
0:42:22pretty c this you although
0:42:27this
0:42:29cindy s shows the how to use that
0:42:32in python
0:42:35for two right so
0:42:37we have performed the
0:42:39not controlled
0:42:40to ignore the checkpoint for channel though
0:42:43and i'm fact that to do this model object
0:42:47after that
0:42:49well you can believe that
0:42:52some we wait for on
0:42:54in you will call environment and its transcribed the result
0:43:00to do this results
0:43:02now so rats
0:43:05get started and crawl
0:43:09so basically the you all out in the page eight
0:43:14you will find
0:43:17e
0:43:18note of it
0:43:19like this
0:43:21therefore trying we will
0:43:24in still use
0:43:28and before
0:43:30running at feast make sure you all collecting
0:43:34the i could a long time
0:43:37it is
0:43:38available
0:43:40on
0:43:42right corner
0:43:44and priest select the change runtime five and
0:43:51check the gpu we selected
0:43:55note that the u is not
0:43:57what it and see if you might be
0:44:01so you want the training
0:44:04so forth trying to instill using it because it is not default to install just
0:44:10be at u
0:44:11in a single core
0:44:14so i can see if you press how many dependencies
0:44:19because you can still for both you can't one used to
0:44:26so or s
0:44:28provide a pre-training model
0:44:32so
0:44:33first
0:44:35i downloaded the waveform file for them
0:44:40i resist this dataset
0:44:43and i try to
0:44:46for phone to is not all
0:44:49on the
0:44:50downloaded waving
0:44:53so that before this
0:44:57forced to right
0:44:58you download that a pre-training what we'll
0:45:02for example this mateo is trained by stingy button okay
0:45:07using the unlabeled speech
0:45:09yes all task
0:45:10and he seems to you is to transform a picture
0:45:15for neural networks
0:45:19and
0:45:22i think roll the waveform here and feed into a more below albeit
0:45:30and that's right there
0:45:32i'll but is a and the best result so well i selected the best one
0:45:38to see how it looks like
0:45:40so this is the result probably read speech model
0:45:44sound check the
0:45:46but the wave onset separation just starts
0:45:51since i
0:45:58false pretty well
0:46:00so let's go back to the slide
0:46:07just after we're show you how to use so for the wrong defined tasks
0:46:13testing it directory your is the x two
0:46:17that contains a although it so that sets inside that
0:46:21and you five and the static content with same fires on directories
0:46:27right column the yes onto
0:46:30you basically you're on this says created from the cell
0:46:36you produce the results reported in this really mean file
0:46:41so i'll show you do well
0:46:45kind of stage is inside the we want to sell you can start point two
0:46:50stages or of people
0:46:51but in the us stages
0:46:54a score
0:46:56specify the command drawing for box
0:47:00one to five state it is
0:47:04perform data preparation and six to eight for as long as model training and ninety
0:47:10temporal bones is all training and after that the sre variation be performed
0:47:16and very you got brought to you or entering the more used to put into
0:47:23so that's need of it is of the data preparation stages
0:47:27in just a very all we're focus on and four task
0:47:31that is very small right images nice to come from i
0:47:36for fast experiment
0:47:38and the for a very fast daisy performance the positive and then data before reaching
0:47:43utilize it is the task and then fires at the other everything and four but
0:47:49it into the cup of these style that and after that we performed some preprocessing
0:47:56the speech and text is it
0:47:58as
0:48:00value that was set in
0:48:02the case so i a we use the you dior sentence please a lot of
0:48:09to the text representation
0:48:13so that representation we used in the training and evaluation stages
0:48:18the six to a stages we performed a long as model training and intermediate a
0:48:25very efficient like a public key and after that the itself training and decoding and
0:48:31evaluation is performed
0:48:34so you can
0:48:35one of the training
0:48:39the board using the purpose of or even go control it is okay
0:48:44so you can monitor the log of the of the softmax output over a wide
0:48:49or something alright though the c g c out of it can be morning to
0:48:52during training
0:48:55and this is a example the is it or you corporations scoring results looks like
0:49:01these s a wide full
0:49:06very efficient tool and reformatting results with the amount that because it it's more readable
0:49:12and as you can see here for each opposable error rate and also something that
0:49:18both the cup of the right was talking or rate
0:49:22and finally we have had to train the model and you can use to exactly
0:49:26saying you maybe i'd draw it is out inference you think more using a v
0:49:33i
0:49:33like okay i'm in the results in the beginning
0:49:38if you specified
0:49:40but kind of confusion six point two you use
0:49:43so now it's got to the court
0:49:48no way not to the controller
0:49:52so
0:49:53let's see the how example two directives like
0:50:00okay you can
0:50:02used
0:50:02come on the right thing
0:50:06like usual not work and you can also use the file explore from
0:50:13this icon
0:50:15and you define a many
0:50:17datasets are available on the is to and
0:50:22in this study we focus on and for all and decision is all one task
0:50:28and
0:50:30for now we are on london style
0:50:33in just
0:50:34israel
0:50:36so
0:50:37before are running the associated to any more but dependencies
0:50:44two one training
0:50:46a carry enough always
0:50:48you quiet currently unfortunately so we needed are all the pretty complies
0:50:55to use and also we need from
0:50:58binary whom have to get and after install everything you're the you're on the on
0:51:04to sell
0:51:06here
0:51:08so
0:51:10for star
0:51:11the
0:51:12that is you down and of all four from cmu store because it is really
0:51:19of a novel so after the enrollment is substituted the data preparation movie again
0:51:27and
0:51:28you can see here there is a menu will finds the and data training is
0:51:34performed and the state five spoken addition
0:51:39or text that cystic cooperation really figure ten
0:51:44and this five results of this from the set s ps
0:51:49so yes
0:51:51and a for a few used to the sentence piece as a focalization
0:51:56and after the center this
0:51:59training is finished
0:52:00the target money would be retrained
0:52:03let's see here
0:52:04and after that the sound training here starting
0:52:08however i drafted to use of training because it if you're wrong
0:52:14i finished this
0:52:16training and ten minutes and i think it is reasonable
0:52:22so let's see how the video data looks like the but that is distorting down
0:52:29and we can find some
0:52:32we have a
0:52:33prepared it down here like a this is the internet or out with the text
0:52:39file is here the fast
0:52:42and three shows that false id and you will find
0:52:47you find the corresponding speech in this while the nist p five p so if
0:52:53you for ages from by you hear yes i is that we have some t
0:52:58a
0:53:00so after the
0:53:02training was going to train the
0:53:07speed you have to be used as the
0:53:10blocking dft of the training phase and screen
0:53:13it will store many things for example pickle
0:53:18five some of detector the checkpoint
0:53:20here
0:53:21and also attention is wrong addition we have used to a here and
0:53:26configuration can be accommodated according to the animal
0:53:30and let's see you how the location on the looks like
0:53:34so
0:53:36configuration you are provide everything
0:53:39every information in during training
0:53:43so here is kind of
0:53:46result of the cup operations so for example you to use is five point zero
0:53:51is that we probably integrating into a non party and
0:53:56you to use is this kind of like this result
0:54:01usually it's like the binary to use this in this piece
0:54:06and used our in an that's dying graph structure
0:54:10okay
0:54:11and
0:54:13during training you can or someone that the pencil or
0:54:16inside a good record
0:54:18or you are or environment
0:54:20and you go far in your exact after operatable it is so severe and achieves
0:54:27icsi
0:54:29parsons partition
0:54:32then right
0:54:33it is the output
0:54:35so that it is only nice so yes see the other information so this is
0:54:42there some visualization is i x d dft
0:54:46as seen the voice very short utterance so that i and does not
0:54:52we really five
0:54:53additional dirichlet allocation right and i think
0:54:57it's okay
0:55:00so there is a variation result
0:55:03and
0:55:04but i said
0:55:05the last it is for more details on down so i just pasted to the
0:55:10e
0:55:11not sell and
0:55:14you just here the final result of the well there are and it's starting point
0:55:18five in the test that i think
0:55:21i mean i soundtrack the right is sixty four point nine and but can write
0:55:26the six point five
0:55:28okay so that this
0:55:31you lose the
0:55:34so one at this for infants at i
0:55:38so
0:55:39first of all we need to specify the fits point to use i document to
0:55:45use this
0:55:46very dark this because at it seems to be best
0:55:50two point eight or so
0:55:52we then use that the result really
0:55:56according to have the same as the speech but it looks
0:56:00more than seriously speaker that is it is more
0:56:04okay so
0:56:06thanks for putting the stuff there
0:56:10this that stuff there will explain how to extend models and pat task
0:56:16so that's
0:56:18the total section in
0:56:20he interest to
0:56:22and cortical architecture and transparent and our transducer
0:56:27when you have regression
0:56:28there are they how to use that
0:56:31it's
0:56:33this is the answer
0:56:35sometimes
0:56:36like i and four task deftly already says of the predefined
0:56:42plot configuration younger fought so you can just
0:56:47that's fine why is a coefficient and take a look at that you going and
0:56:52there are none of the values of a number of the units
0:56:56inside younger five
0:56:59i think it mostly goal of this fine is that yes it has test trying
0:57:03many things like activation or
0:57:06where tight so
0:57:08make things like that
0:57:11however if you
0:57:13down some find that you can extend multi i think i said
0:57:18for example
0:57:20the
0:57:22or and then what or transducer encoder and decoder but works in a men's these
0:57:27interfaces
0:57:29to ease the swat four
0:57:33have keep the complexity between those variants for brain
0:57:39implementation
0:57:41so
0:57:43this
0:57:44and e s
0:57:46other so i used to model
0:57:48for vice versa yes feature plus four plus a
0:57:53these two
0:57:55others and go the invitation
0:57:58for passing the encoder speakers and text input on the targets to i'll stick to
0:58:04the
0:58:05something like
0:58:06as
0:58:08explaining that
0:58:09you're though it's
0:58:11figure
0:58:14and you can use that phone come on the right argument if you this is
0:58:19just a you implementation in this
0:58:23so score
0:58:27and
0:58:29if you want to send your task like you wanna
0:58:35try sub tasks you on the is that are it is well for possible
0:58:41then you extend that i was task
0:58:44so existing asr was tedious task implements this
0:58:49that is
0:58:50and
0:58:52to get the this
0:58:55task i don't think feature
0:58:57like a distributed training on divan sampling but checkpoint rejoining like that
0:59:04as the was gonna section five we show you how used in payments
0:59:10that
0:59:12models
0:59:13so that is it yes did have rivets e
0:59:17and that and check the yes to implementation and
0:59:26okay
0:59:27the out into for some so
0:59:30and there is
0:59:32model definition here
0:59:35so as i said in the us by a base
0:59:41it implements have a sort the svm modeling the phase here
0:59:46and actually simply call use the board mess of
0:59:52the read and the most value is
0:59:54so received a for the nist
0:59:56it's here
0:59:58so increase to use this be used in baton text output as seen that argument
1:00:04and then it we kinda rate and was fine tuning full
1:00:09euros the angle tunnels
1:00:12so well that's in there
1:00:15the first thing go the network coding rates the without the input of the think
1:00:21of the networks
1:00:22still this angle regularization and
1:00:25well you see that output and it and
1:00:30this is the outfit a within good as input and
1:00:33they're pretty they're
1:00:36text target
1:00:37and calculated function here and the same their thing having in thus it is inference
1:00:44so this is exactly same impotent target as well as the political there that those
1:00:51are anti do the same thing
1:00:54yes exactly same arguments
1:00:57and then combine
1:01:01thus values i-th honours the scrolling nazi it's quite easy and
1:01:06same as the so we into using their you know section
1:01:11so
1:01:12thanks so or watching this