0:00:00welcome to the speaker all the c to them and twenty
0:00:02this is the tutorial session on text-to-speech synthesis
0:00:06i'm showing one from national institute of informatics japan
0:00:10i'm going to reduce realist tutorial on test to speech synthesis
0:00:15first brave self introduction
0:00:17and a post-hoc and i a much that i got my p h t two
0:00:22years ago from a simple at during my p h d i was working on
0:00:25text-to-speech since s
0:00:27of things the post or chi have been working on speech and also music audio
0:00:31generation
0:00:33meanwhile also getting boarding the a swiss move to as i'm writing and voice primus
0:00:38challenge of this here
0:00:42for this tutorial like to first apologise a thoughts the abstract i think is thing
0:00:48in the old strike i mentioned i will explaining the resetting neural network based acoustic
0:00:53models waveform generators
0:00:56the classic hidden markov model based approaches and also words conversion
0:01:02but this abstractly seems to be to features i think account to cover all topics
0:01:07thing one hour tutorial so in this tutorial i will focus on the recent neural
0:01:13network based acoustic models including the talked run and its environs
0:01:19for other topics such as a waveform generators and hmm after they've them out from
0:01:26this tutorial
0:01:28if we are interested you can find a useful notes and the reference papers thing
0:01:33this slide
0:01:35for this tutorial i'm going to focus on the recent approaches like to talk at
0:01:40run and the related
0:01:42seconds two seconds tedious models
0:01:45i'm going to talk about how they work and the water or the differences
0:01:50so this tutorial is based on my own reading list i summarised what have learned
0:01:55and what i have implemented with my colleagues
0:01:59so the content may not be comprehensive
0:02:03however i the i would try my best to include more contents i was summarised
0:02:08on hanging the notes in each slide
0:02:11i also provide an appendix that reading list ways what have rating the past so
0:02:17i hope you enjoy this tutorial and of course you all feedback is welcome
0:02:23so for this tutorial i'd like to first give a brief introduction about the current
0:02:29situation or the state-of-art it was a tts research after that i will give our
0:02:36view of a tts briefly introducing the classical methods
0:02:41and the why we're here today
0:02:42and after that i was spend the most of it i'm have this tutorial on
0:02:47these sequence two seconds tts and the state-of-art tts nowadays
0:02:52explain different types of a sequence two seconds tts those based on solve attention hot
0:02:58attention
0:02:59and hybrid approaches and finally i will make a summary and draw conclusions
0:03:07it speaking with introduction
0:03:10tedious is a technology that a covert single texting to the output avoidable
0:03:15one famous example of the tts application is the speaker use device professor stephen hawking
0:03:21nowadays we have more types of applications based on the tts
0:03:26one example is the intelligent the robot
0:03:29we also have that each taught systems are cell phones and computers
0:03:35research on t is has a really long history if we read the books reference
0:03:41paper some tedious we can find so many different types of tts methods for my
0:03:46since as you need selection and weighting it
0:03:49and the reason why researchers are still working on d is that and
0:03:55this researchers want to make systems i speech as natural as possible as natural as
0:04:00human speech for some types of applications all we also want the so that speech
0:04:07sounds like us
0:04:09so towards this go researches put so many a first thing to the case research
0:04:15i ever you know was not on your the recent years that and researchers find
0:04:21really good models to achieve this goal
0:04:25here i'd like to use the a space move data to show the rapid progress
0:04:30of tts
0:04:32first it's picture it is serious moved to center fifteen
0:04:36and the i-vector space where we show different types of tts system and their this
0:04:41distance from the natural speech religion you speech
0:04:44so you can see there are many system here most of them not based on
0:04:49hmm orgy a gmm based voice conversion
0:04:53for this edition basic tts is really for me from the natural speech is only
0:04:58unit selection that is close to the natural speech
0:05:03so how about is swiss moved to the nineteen after four years of research
0:05:08here the results based on expect or a computer with the picturing to send fitting
0:05:15we can see there are so meetings is some that are really close to the
0:05:19natural speech
0:05:21not only thing the selection i'd like to give that one here
0:05:25the first example is a sham d and system as you can see from this
0:05:30scoring which is still far free from a natural speech
0:05:35the unit selection is still close to match for speech meanwhile we can see
0:05:40other types of tts messrs
0:05:43including the sequence two seconds t d s and the women it so they are
0:05:46really close to match for speech
0:05:48of course this figure based on acoustic features either the extractor or the i-vectors
0:05:55but the question is how to this is that speech sounds really like in human
0:06:00perception
0:06:03to show that sounds are that question i'd like to use the results from our
0:06:08recent study where we come back to healing evaluation on the a swiss moved to
0:06:13them hiding data
0:06:14here we ask feeling evaluators to evaluate and the how this is a speech sounds
0:06:20like the target speakers and how this is that speech
0:06:25the what is the quality of sounds i speech compared with t natural speech
0:06:30so we show the results in terms of it by using the det curves
0:06:36as you can see from the left hand side we can see that d in
0:06:40h m d and is really for a for me from the natural speech in
0:06:45terms of the speaker similarity so this whole distribution is for rate from the natural
0:06:51target speech
0:06:53unit selection is calls or bastille not closing off its own is seconds to segment
0:06:58system as you can see from this picture as really close to the target speaker
0:07:03natural speech so you this case the eer is rock it is close to fifty
0:07:09percent
0:07:11so which means this is some kind of the release on the synthesized speech sounds
0:07:15like the a target speakers and human beings cannot tell tells them from each other
0:07:21this is similar trend if we look at the results in terms of speech quality
0:07:28d and the unit selection are not good enough it's only a sequence two seconds
0:07:34model that is really close to the natural speech
0:07:37so from these results we can have a general idea or on the how the
0:07:40race and the models based on second steps simplest model improves the quality and a
0:07:45speaker similarity and even the human beings can not tell them from the natural speech
0:07:54okay after introducing the results i'd like to play some samples from a swiss providing
0:07:59database
0:08:01and that i think you can have general perception a housing like is model sound
0:08:06like computer with a natural speech
0:08:10we did not complete with any of a local farmer
0:08:13we did not compete with any of the local optima we did not completely then
0:08:17have the local phone
0:08:20we did not completed any of the local formal writing
0:08:25eventually india function that winter
0:08:29a french at that level until
0:08:32so this other samples from two speakers i think you may agree that's an unit-selection
0:08:39sounds like to natural speech in terms of the speaker identity but you can sometimes
0:08:45perceive the channel one we concatenating you i different units together
0:08:53and the tmm sounds close but the but this sounds like artificial speech itsy seventy
0:09:01six models that's true some three like the target speakers
0:09:05if you are interested in you can find more samples are website or download the
0:09:10a space with lighting database to have a charge
0:09:15after listening to the tts samples from a swiss move into the nineteen
0:09:20i'm going to talk about more details on the tts what kind of problems women
0:09:25face one would be at a tts system what kind of solutions we can use
0:09:29and how we come up with idea d sequence to sequence tts models
0:09:36so what are the problems we may face one would be you the tts system
0:09:40to give a example here is once and this from the guidelines for tool be
0:09:45labeling my random it's more light
0:09:49the first thing we need to note is one recover the text things waveform is
0:09:53that
0:09:53the text is basically discrete
0:09:56it comes from a finite set of symbols
0:09:59well as the waveform is continuous in time domain and also doing them out of
0:10:03domain
0:10:04so
0:10:05because of the basic difference between the text and the speech the first thing we
0:10:11noticed is the ambiguity in pronunciation for example the inmates segments the more maria that
0:10:18all mate they are pronouncing different ways the second thing is about alignment
0:10:23for example where mi same eight we miss a mate
0:10:27we may shorts or increase the duration of the sound when we produce a pronounce
0:10:32it
0:10:33so this kind of alignment we need to learn from the data which is not
0:10:37easy another issue is the a to recover information with which is not encoded in
0:10:43the text for example
0:10:45the speaker identity and prosody this has really different issues when we but tts systems
0:10:55here is one example of using classic a tts to converse detects into the output
0:11:00waveform
0:11:02so the first step of the system is to clean the input a text to
0:11:06do some kind of text normalization to remove all kinds of
0:11:11the strangest thing balls from input text
0:11:14so after that the system converts the text into the phoneme or phone strings
0:11:19so the phone of phonemes are symbols that tells the computer how to read the
0:11:24ward
0:11:25of course this is not enough or we may need to add additional prosodic tags
0:11:30to each word or some sort of the word
0:11:33for example women and the size t mariano instead of the mate
0:11:37so giving thing and linguistic information about how to read text
0:11:43the system will our converts them into the acoustic units or acoustic features
0:11:48finally the system will use a waveform generator to covers the acoustic information into the
0:11:54output waveform
0:11:58in the literature we normally refers to the first steps the cued a system as
0:12:03a front end and the rest of the backend
0:12:07in this tutorial like another cover the topics on the front end
0:12:11and the readers can find it to textbooks on front end
0:12:15for this tutorial we focus on the back and the issue especially how we learned
0:12:22alignment to between the text and waveform in the back and the models
0:12:28the first example i'd like to explain is unit selection based back end
0:12:33so as the name suggests this mister is quite simple straightforward four inch input to
0:12:39unit which is directly select one speech segments from a large database
0:12:44after that which is directly concatenate these speech units into the outputs wasteful
0:12:50so there is no explicit modeling all of the alignment between the speech and the
0:12:56waveform
0:12:56a because this alignment has been preserved in the speech units so we didn't really
0:13:02care about alignment in this kind of mister
0:13:07however the story becomes different and when we use the hmm based back end to
0:13:12synset speech
0:13:14so
0:13:15in a like the unit selection which directly gender was waveform
0:13:20for a h t s hmm based approach we don't directly predict the waveform instead
0:13:25we first predict the sequence of acoustic features
0:13:29from the input text so this acoustic feature vectors maybe from each vector my corresponding
0:13:36to
0:13:36say twenty five milisecond of waveform
0:13:40and the we can use of vocoders to reconstruct a waveform from the acoustic feature
0:13:44vectors
0:13:45so each acoustic feature vector into my containing the for example the cepstrum coefficients if
0:13:52the role
0:13:53and all their kind of acoustic features specific
0:13:56two d speech will coders
0:13:58but is a general idea here
0:14:00in h t s we don't directly predicts waveform instead we need to first predict
0:14:05the acoustic feature vectors from the input text
0:14:11the question is how can we do that remembers that's the input information has being
0:14:18extracted from the text
0:14:20including the phoneme identity and all their prosodic tax
0:14:24so you h t s we normally encode or converts the linguistic features into a
0:14:30vector for each input a unit
0:14:33so in each vector it to make contains information like the phoneme identity
0:14:38the whether the course of a boy stress a lot
0:14:41so we assign this kind of vector for each unit
0:14:46the question of cores is how can we convert the sequence of encoding linguistic websters
0:14:51into the output to acoustic feature vectors
0:14:54so remembered the number of vectors we have is equal to the number of units
0:14:59in the text
0:15:01and this number is much shorter than a number of acoustic feature vectors we will
0:15:05we need to predict
0:15:07so this is alignment t should
0:15:11this is how the h t s system handles this issue
0:15:14since this system is based on a gmm so the first thing we need to
0:15:19do is to a converts the linguistic vectors seem to the hmm state
0:15:24so this is done by simply searching through these
0:15:27this increase after that we can get the hmm state for this specific which are
0:15:34after researching and the finding all the hmm state for each linguist vectors
0:15:41the next thing is to predict the duration for each in from state for example
0:15:47when we repeat the first item state two times the second one three times
0:15:53given is duration information that we can create six agenda seconds like this
0:15:59so remember that the sequence of this hmm state will be equal to the number
0:16:06of vectors when you to predict in the output
0:16:10the loss the regression task become much easier because we can use main types of
0:16:15all the reason
0:16:16to generate vectors for from each hmm state
0:16:22specifically h t s system used to you so called
0:16:26maximum likelihood parameter generation or present to produce
0:16:30the acoustic feature vectors from the hmm states
0:16:34but this is how the h t s system produce the output from the input
0:16:39to linguistic feature vectors
0:16:43two briefly summarize the h t a system we can use the speech or so
0:16:48we generates a linguistic features from the input text
0:16:53we do the searching in the decision trees
0:16:56after that we predict the duration for each hmm state so this is where the
0:17:01alignment is produced
0:17:03of to generate a output acoustic features after that everything is straightforward just convert each
0:17:09websters into the output vectors
0:17:12and do the waveform generation using the vocoder
0:17:18from h t s two d n is straightforward we just need to replace the
0:17:22hmm states ways the neural networks
0:17:25feet word one or record one
0:17:27however for this kind of framework we still need the duration model we need to
0:17:32predict
0:17:33the alignment
0:17:34from the linguistic feature vectors
0:17:37without that we cannot prepare the input to the neural networks
0:17:42indeed as d paper by alex grave says or ends are usually restricted to the
0:17:48problems where the encoder output sequences
0:17:51all will aligned
0:17:53as lies where using the com unfit for word or record neural networks
0:17:58we still need additional tools including the hmm
0:18:02to lower and it generates alignment
0:18:04for the tts task
0:18:07when we wonder with that we can use a single model to jointly learn alignment
0:18:11and to do the regression
0:18:13and this is where the sequence two seconds model counts as a stage
0:18:17in fact they're more ambitious
0:18:20they want to use a single neural networks to jointly learn alignment to the regression
0:18:25and you want conducting the linguistic all eyes on the input a test
0:18:29and that there are lots of recent work showing that this approach is reasonable and
0:18:34is really step our own your network so that we can achieve a better quality
0:18:39for tts
0:18:42okay let's look at the a sequence two-sigma cts models
0:18:47remember that the task of seconds two seconds model is to converse the text into
0:18:54the acoustic feature sequences
0:18:56and we need to solve three specific task
0:18:59how to derive linguistic features
0:19:01how to learn to generate alignment then how to generate output sequences
0:19:06again we cannot to use a common your and it works such as the feature
0:19:10for tall recon one
0:19:12for this kind of sequence two seconds model where you normally use attention mechanism
0:19:18for explanation i we use x has encode while i being the output
0:19:24note is that the input has same time steps while the output has and time
0:19:28steps
0:19:29so they have different time lengths
0:19:33the first if a framework we can use is the so-called encoding and decoding framework
0:19:37here we use our and layer as their encoder with processing code and extract the
0:19:44c which are from the last hidden is data from the encoder after that we
0:19:50use is c which are as a condition
0:19:52to generate child was sequences step by step
0:19:55so if we write only questions it to look like these so you can see
0:20:00how the output is factorized
0:20:03along all time steps and is see the condition is used in each condition the
0:20:08each time step
0:20:10this framework is straightforward and a simple so the matter hall on the input a
0:20:15sequence that is we can always compressed input a sequence information into a single vector
0:20:21however there is also you should because we need to use this c worked or
0:20:25across all the time steps on which generates output
0:20:31can we extract different context from the input what we generated different out time steps
0:20:37the answer is yes and we can use adaptation mechanism to achieve this goal
0:20:43suppose we want to generate a second time step why to here
0:20:46what my extract the heathens data from a decoder ring the previous time step and
0:20:52faded back to the encoder
0:20:54after that to extract some kind of weight
0:20:58vectors through the softmax layer
0:21:01then we do a weighted sum over the input information
0:21:04i produce the vector c to here
0:21:07we can use this c to which are as encode
0:21:09to the decoder and the produce the y two which are
0:21:13so it is how
0:21:14the context information can be calculated for the second time step
0:21:19so no desired we can save the output from the softmax layer so it is
0:21:24kind of wait information used for the second time step
0:21:28we can repeat the process for the next time step so in this case we
0:21:32feedback the history from the decoder in the second time step and then we calculates
0:21:39the vector c three for the output wise three
0:21:44in general we can do this for the time step and that we can write
0:21:48equations like these
0:21:50so after we save all the output from the softmax
0:21:56along all the time step so we can do it is
0:21:59the weight
0:22:00mm calculated by the use of the marks will gradually change
0:22:04as we move as we generate out hotel on the time so the way to
0:22:08his helpful three
0:22:09what we also move along the input sequences as you can see from this picture
0:22:16so this alone as alignment matrix thing you can find this picture in mating papers
0:22:21tedious or speech recognition
0:22:26two briefly summarize the attention base segments two seconds models we can use this equations
0:22:32for each time step and way calculate the softmax weight
0:22:37vector r for here
0:22:39and then we use or for vector to summarize information from the input so we
0:22:44do a weighted sum over the h vectors
0:22:48that gives a bus these a context vector c and for each time step
0:22:53with a c and context we can generate output y n
0:22:56and to repeat the process for all time steps
0:22:59this is generally how the attention basis segments two seconds model works
0:23:05as you can see from the previous explanation
0:23:08the attention make it is done is the essential for a sequence to sequence tts
0:23:12model
0:23:13and you to this reason there has being so many different types of attention proposed
0:23:18when i read the papers i noticed that there are so many different types of
0:23:23attention we can use
0:23:25self attention for word attention heart attention on the soft attention
0:23:29so one is the relationship between different types of attention and the what is her
0:23:34purse to use a specific attention
0:23:37so in the next few slice are we explain then in a more systematic way
0:23:43use my proposal i organise the tension based on what kind of features are used
0:23:48to compute alignment
0:23:50and how do they compute alignment are from
0:23:53and what kind of constraints e need to put on the alignment
0:23:56so as you can see to what kind if you choose to compute the a
0:24:01alignment we can organise attention based on with their content based whether they are location
0:24:07where all with their they are pure location base attention
0:24:11the way to compute the alignment we can organise attention based according to three groups
0:24:18relative diode and discover told attention
0:24:21and the for the final axes we can see attention is the a monotonic all
0:24:26for the tension the local attention and the global attention so this is my proposal
0:24:32for organising the so called soft attention
0:24:38that engine is not only group like a fine tuning in the literature
0:24:43i four rated paper is we can find another group so called hot attention
0:24:47the difference from these of attention is that in called attention the alignment is treated
0:24:54as a lot and of arrivals
0:24:56we need to use all kinds of tools such as dynamic programming and marginalisation to
0:25:02calculate the probability and the to marginalise the latin of around
0:25:07i would talk more about the difference between the two groups of attention in the
0:25:11later slides but for this stage i will focus on the soft attention
0:25:17that's first look at this told
0:25:19scaled order and additive attention
0:25:21to this i was three types of attention is which use different the ways to
0:25:26compute alignment matrix
0:25:28suppose we are going to compute the output y and for the and it's time
0:25:31step
0:25:33what we have is the decoder state of the previous time step yes and minus
0:25:38one
0:25:38we also have the features extracted from the input text which is you know data
0:25:43as the actually
0:25:45so this read hypotension differ in the way to compute the input to the to
0:25:50the softmax layer the output of softmax will be the alignment a matrix
0:25:58the first one the thought attention
0:26:01directly multiple a basis to vectors the s and minus one from the decoder the
0:26:07h and from the encoder
0:26:09so it is why the cold it is scored at a little attention
0:26:14scattered all detention is quite similar but in this case we and the scalar these
0:26:18here to change the value of the a lot of the activation to this of
0:26:24the max lightyear
0:26:25the other blasted last the type of attention is the cavity of attention
0:26:30so in this case we apply linear transformation is and the two vectors after that
0:26:36we add this to vector together so this is why this is a reason why
0:26:40quite the additive attention
0:26:44so note is that
0:26:46for all the three types of tensions in this example
0:26:51where using the s vector from decoder the edge vector from the encoder
0:26:56in all the words
0:26:57we can consider the h as a content of the input
0:27:01so
0:27:03we multiply the content the from the input a text
0:27:07ways the hidden state from decoder in order to computes the alignment matrix
0:27:13this bring us to the second question based on which we can classify different types
0:27:18of attention
0:27:19a question is what kind of features we can use to compute the output alignment
0:27:25in the previous slide high exploiting the told scale that alt and additive attention by
0:27:31using examples where use the decoder state and the content vector edge to computers alignment
0:27:40so for this type so mister is we call them content a base attention because
0:27:45they're using the content of vector
0:27:48however this is not only way to compute the alignment
0:27:53the second away is the so-called location aware attention
0:27:57as you can see from these two equations computer was content the base attention the
0:28:02location aware attention uses the attention vector from the previous time step
0:28:07so this attention is aware of the previous alignment so that's why record the location
0:28:14where attention
0:28:17the third type of attention in this group is the so called the location base
0:28:21attention
0:28:22so computer with the location aware attention we can do it is from this equation
0:28:27that the content the vector h is removed from the input so you know other
0:28:33words in the location based attention we don't care about the content with purely compute
0:28:39the attention of the lyman the matrix
0:28:42based on the decoder state and the previous the lyman to from the previous step
0:28:47finally there is c small byron to have the a location based attention so in
0:28:52this case
0:28:54we only use the decoder state to compute alignment without using the alignment from the
0:28:59previous time step
0:29:03from the equations of the four types of pensions i think you mean though it
0:29:08is when we compute attention or the lyman the matrix for each out what time
0:29:12step we need to consider all encoded time steps
0:29:16so this leads to the sorted dimension along which we can classify the tension
0:29:22along this dimension i'd like to explain two types of attention the first one is
0:29:27the so called the global attention
0:29:28as a name suggests what we compute alignment for each output timestamp
0:29:34we consider it is possible to get information from all the input time steps
0:29:39so this matrix the vector lyman the vector are here has no they are all
0:29:45elements
0:29:47in contrast when we use a local attention we consider some of the lyman to
0:29:52can be zero for example in this case we only consider
0:29:56to extract information from the input a steps in the middle
0:30:03now i have supplying these three dimensions are on which we can classify this often
0:30:08attention in fact all the paper all the examples i have explained can find their
0:30:14location in this treaty space
0:30:17but let me give one more concrete example that is yourself attention
0:30:22the same attention is a scalar this scale adult attention it's based on content
0:30:28and it's a global attention
0:30:29so let's see how is defined
0:30:33if we look at the equations of a set of attention we can do it
0:30:36is to y is cold the skeletal mode
0:30:39global and content the based extension
0:30:43muddy in this case a special thing about seven attention is that we extracting both
0:30:48the
0:30:50feature vectors edge here and action here from the input a sequence is you know
0:30:55all the words we are computing alignment on the in consequences itself
0:31:00of course of because we can compute the everything a power
0:31:04we can all we can also define a matrix form for the self attention so
0:31:09in this case we formulate the input feature sequences as a matrix
0:31:14and then we do with the us to get at a little attention in the
0:31:17matrix form
0:31:18and the matrix a is cool with the query key and the value matrices
0:31:24in this case the refers to the same matrix the h u
0:31:29so you know the words and this somewhat tension does a transformation on the input
0:31:35the sequences then output sequences has the same lands as encode
0:31:40in some sense we can consider the cell attention as a special type of convolutional
0:31:45renewal layers to transform the input into the output with the same lines
0:31:52of course we can also use a seven attention for alignment flirting sewing this case
0:31:56it just a special type of soft attention based on the scared about and a
0:32:03content based attention
0:32:05as you can see from the equations in this case we replace the query matrix
0:32:11with the state from a decoder but the process is quite similar and we can
0:32:16do everything power by using the matrix multiplication
0:32:21so by now i have explained all the street dimension is to classify this of
0:32:26attention and also example based on the self attention in fact there are more ways
0:32:32to combine different types of attentions and you don't find the i rinse in the
0:32:38paper published by a good this year
0:32:43giving the explanation on the sofa attention in this now quickly explain how to work
0:32:49sing in tts system
0:32:51so for the tts system when we use the tension basis segment stick set at
0:32:55two seconds models way almost to use the same framework as those used for the
0:33:00speech translation
0:33:02or machine translation or speech recognition
0:33:05so we use case input is to phonemes or characters
0:33:09now to po tasty acoustic feature vector sequences
0:33:12and we still have the encoder the tension and the decoder which is autoregressive
0:33:20but of course we can do something more for example adding more layers and the
0:33:25decoder increase the number of recording layers at the print data which receive the feedback
0:33:34from the previous time step in the outer aggressive decoder this is a free to
0:33:39choose
0:33:40but the basic idea is still the attention based approach to learn alignment between the
0:33:45input and output
0:33:48this gives sauce the basics to understand the first fingers a tts system based seconds
0:33:57test two seconds model so this is it talks on system
0:34:00as you can see from the picture of the original paper a they are architectures
0:34:06of a network can be generally divided into this three groups
0:34:12that decoder attention and then the encoder
0:34:16the we just they just differ how they define the encoder for example by using
0:34:21different types of hidden layers
0:34:25to extract information from the input phoneme or cracked or a sequences
0:34:31but the basic idea is still the same user attention to learn alignment between the
0:34:35input and output
0:34:38in fact talked ron is not only model or that he uses the a segments
0:34:43two seconds based approaches
0:34:46as for as i read and the first a model might be the i on
0:34:52probably shall work by alex craves if you listen to his pork into some fifteen
0:34:58you couldn't note is he plays some samples
0:35:00using the tension based frameworks attention basic was to set the smallest as already has
0:35:07to send fifteen
0:35:08so after that inter-speech there is one paper mattering tts which first use the attention
0:35:15and a published paper
0:35:17so after that it's a talk from system meaning to send
0:35:21seventeen
0:35:22i mean while there are different types of system for example the chart to waves
0:35:26as a talks on two
0:35:29the dct t s and the deep voice three transformer tts
0:35:34so all we all this types of system are based on the tension mechanism
0:35:40but here i liked also mention one spatial system the so called a voice loop
0:35:46which is also a sequence to sequence tts but actually use different types of alignment
0:35:51learning the so called memory buffer
0:35:55if you are interesting this model you can find the illustration in appendix
0:36:01to help the to help you to understand the difference between different types of segments
0:36:07of segments tts system i summarize the details and differences in this table across the
0:36:13different tts systems
0:36:16there are many details here for example in terms of t waveform generator their acoustic
0:36:21features and the architecture of the decoder encoder
0:36:25but let's focus on attention here
0:36:28as you can see for the talk from basis a sum they any use the
0:36:32additive attention and of course with a local location or a nist
0:36:38there are also other systems for example the shortwave which directly user location base attention
0:36:44and also there is a pure self attention basis of some that is the transformer
0:36:49tts
0:36:51and you can find the details later from the slide
0:36:56no i'd like to play some samples published ways to you papers so they are
0:37:01from the of official website they also the daytime the public domain
0:37:07and full
0:37:08system trying to using their own internal data i come output of samples here but
0:37:14you can find a samples on their websites
0:37:17table that but now i never find
0:37:21but that was totally a of the blue
0:37:23thus it is about the math of investigation into allegations a fixed and gains an
0:37:28illegal by thing
0:37:29prosecutors of openness of investigation into allegations of fixing gains an illegal betting
0:37:36and how to accept it as a numerical without any physical explanation
0:37:40and had accepted it as a numerical without any physical explanation
0:37:44do not at all
0:37:48or not
0:37:53after applying the samples i hope you can have a general impression of how the
0:37:58sequence to segments tts systems sound alike
0:38:02of course the quality might not be as good as a as what we have
0:38:06her in the a swiss moved to the lighting
0:38:10there are many different reasons for that
0:38:12and if you want to find other good examples i suggest the samples on the
0:38:17document and the transformer aware the used their own internal data to train the systems
0:38:25after listening to the samples i think of the raiders my wonder whether is of
0:38:29attention is good enough for tts purpose
0:38:33i think is also is no the samples i played all decoder samples there actually
0:38:39many cases where the sequence to segments based tts systems the do not work for
0:38:46this case we need cut to consider specific attention mechanism that is designed for the
0:38:52tts
0:38:53so this lead us to the
0:38:56another group of a system which use a monotonic and the for the tensions
0:39:03before explaining this type of models i think we need to first explain why the
0:39:09global attention or the global alignment does not work sometimes
0:39:15remembers that for the global alignment or the gullible attention we need to compute alignment
0:39:19of between every peer of the encoder and the output of time steps
0:39:25this might be necessary for other tasks such as machine translation but in that might
0:39:30not be necessary for tts
0:39:33and this kind of alignment is heart a lower sometimes it does not work
0:39:38so i'd like to play one sample
0:39:41so this is one sample from the paper from microsoft the research where the used
0:39:48to global attention to generate somewhere very long synthesis you can hear sound so that
0:39:53x transcription is here so is it would be the input
0:39:59crashes backslash we passed backslash yes there is backslash in that graph backslash one backslash
0:40:09fifteen that makes post processing a little painful
0:40:13even if as the reports does that have a clapping we have a rasta based
0:40:17impact of anything about the visible be version of the one people maybe people would
0:40:22be people with
0:40:25i hope this interesting example can tell you how the use of attention might not
0:40:29work
0:40:29well we use alone text as input
0:40:33and this issue we need to solve
0:40:35so what we can do to alleviate the problems and one thing we can consider
0:40:40is that for text to speech there is some kind of a monotonic in a
0:40:45relationship between the input and output because human beings read the text from left to
0:40:51right
0:40:52so we can use this kind of prior knowledge to constrain alignment so that we
0:40:58can make it easier for the system to learn the mapping from the input to
0:41:03the output
0:41:04so the idea looks like this
0:41:08so this is the motivation behind the a monotonic and the foreword attention
0:41:13and this and try to year of the ford a monotonic attention is to recompute
0:41:19the alignment matrix
0:41:21so suppose we have computed alignment a matrix like these so after that ways some
0:41:26kind of prior knowledge we recompute alignment matrix to encourage the monotonic alignment
0:41:34to give you an example but how do works this consider this simple task to
0:41:39convert the encoding x one two three to the outputs one two three white one
0:41:44two three
0:41:46suppose we have used a soft attention and we have computed the alignment for the
0:41:50first time a time stamp
0:41:53so this is where we can introduce a prior knowledge to constrained alignment learning
0:41:58so suppose we start we can only start from the first input time step so
0:42:04we can but this a alignment vector and zero are for their hat here to
0:42:09indicate initial condition so in this case is zero or is one zero
0:42:15for the more we constrain that's alignment of how only stay at the same input
0:42:20a step or you can only trust sees from the previous time step to the
0:42:25unix time step are like the a left-to-right hmm
0:42:29based on this condition is we can re compute the alignment vector
0:42:34i like these are for one tailed
0:42:37we can we can be is widely used
0:42:39to give you one more example here if suppose the are for one is equal
0:42:44to zero point five zero point four and there are point one so after the
0:42:49re calculation we can get in you worked are
0:42:52you can do it is how the probability to align the y one way sticks
0:42:57three is reduced from zero point one two zero point there
0:43:02so this is how we can use and the for what how we do the
0:43:07forward three calculation of the line matrix and the reduced the in possible alignment during
0:43:14the model training stage
0:43:17of course thing the paper the also propose all other types of mechanism to recompute
0:43:23the alignment a matrix but as soon try dear is a sign
0:43:28so giving the recalculate alignment vector we can
0:43:33use it to compute the first time step output
0:43:36that we can repeat the process and dollars alignment on the computer outputs y one
0:43:43to wise three
0:43:46interestingly if we check the alignment matrix thing the paper we can do it is
0:43:51how the foreword attention is different from the come one salford attention based approaches
0:43:59especially as you can see from the first row of the i'm in the matrix
0:44:03at is the alignment after what only one the books overly
0:44:09for the baseline without any constrained the alignment is just simply random or uniform distribution
0:44:16for the forward that engine ways to re calculation over the matrix you can see
0:44:20how the lyman the matrix looks like a monotonic a shape
0:44:26we can also consider this
0:44:29type so monotonic i shape has a prior a constraint on what we colour from
0:44:34the input and output data
0:44:37based on these example i think you gotta understand why the foreword attention make it
0:44:43make it is easier for the tts system to learn alignment between the input and
0:44:48output
0:44:51in addition to the foreword attention there are also other types of monotonic attention for
0:44:58example using different department reforms or combined it with a logo attention
0:45:04however i'd like to mention is that
0:45:07and the for ward also called a monotonic intention come not guarantee the attention to
0:45:13be monotonic exactly monotonic
0:45:18there are many reasons to explain that but i think of the fundamental reason is
0:45:22west you was considering the soft attention where we compute alignment and the way summarize
0:45:30the context where occurring now data in a deterministic way
0:45:34so this issue with like to solve or use a whole attention which are we
0:45:38explain in the later slides
0:45:40okay let's just play some samples to see how the for detection works
0:45:45so this is same text which i played before so if we don't if we
0:45:50use the solver the tension the tts system does a three work on the sample
0:45:55unless listen to how the for attention basis system works
0:46:00crashes backslash we'd ask backslashes the idea is there is yes backslash feel that radius
0:46:06backslash one backslash fifteen not that makes post processing a little painful since the files
0:46:13and reports crashes in a hierarchical structure mention that have
0:46:19so from this example you can notice how the for attention mate made a system
0:46:25successful to read the later part of this nonsense as
0:46:30this is a good example which shows how for detention works
0:46:34but again as i mentioned in the previous slide the for attention is not the
0:46:39grantee need to produce the a monotonic alignment
0:46:42here is one example from the you microsoft the paper
0:46:47the preliminary willing by gently cmu left district court for the no then just active
0:46:53californians to buy not set friends to battle chat variance divide not yet friends ten
0:46:59right not set friends derive not set friends derive not chip firms derived not set
0:47:04firms they're willing if the fact that for clark and
0:47:07it is the chip and let me and f t c hi jointly ask the
0:47:11judge last month to the labelling on the issue will pop up to thirty days
0:47:15one i pursued sentiment tasks
0:47:19this the funny example i hope you can know it is how the for attention
0:47:23system
0:47:23a repeat the phrase to rival chip firms
0:47:27multiple times
0:47:28and you can also see this alignment from the picture here so you this case
0:47:32alignment is not
0:47:34monotonic
0:47:36so again
0:47:37soft attention
0:47:39this'll for detention does not the grantees a monotonic alignment we colour from the data
0:47:45anyway from the previous samples ice i think you can hear how the for work
0:47:50tension hand help
0:47:51they tts system to learn the alignment for the lawns and this is
0:47:56actually the remaining tts system using the for attention for example the full papers here
0:48:01i will not play the samples here if are interested you can find the samples
0:48:05are website or in this light
0:48:09to use a soft attention can not guarantee the monotonic alignment to during generation
0:48:15we have to find another solution
0:48:17so one potential answer could be the hold attention
0:48:21here is my understanding on how the tension
0:48:24suppose we have the use of attention alignment matrix
0:48:27so this matrix tells us the probability that each output time step is aligned away
0:48:33single time step
0:48:35so from this alignment probability matrix limit or sample
0:48:40a monotonic alignments like these
0:48:42so it is idea if we want to use monotonic alignment for t v is
0:48:47generation
0:48:48however we have to take into consideration that there are multiple candidates for the alignment
0:48:54for example the alignment to on the more times three
0:48:58and we have different probabilities to troll these samples
0:49:04a can really during training we have to take into consideration the uncertainty with different
0:49:09alignment
0:49:11so you wanted to evaluate the model likelihood during training
0:49:16we have to create the alignment as a latin are able this probabilistic model
0:49:22so this idea is very similar to the hidden markov model and as you can
0:49:27imagine during training you know we have to use all kinds of dynamic programming
0:49:32feed forward or search algorithms
0:49:34to evaluate model likelihood
0:49:38to give you a more intuitive example of how the hot attention works we can
0:49:43compute it was this off attention
0:49:46as you can see from this picture for those of attention
0:49:49for each output time step
0:49:51we just directly calculate the weighted sum
0:49:54to extract information from the encode
0:49:57and is how we do to generate alignment during the generation things of attention
0:50:02and we repeat this
0:50:06operation for all the time steps
0:50:08in contrast in the whole attention we have to troll samples
0:50:13we have to select only one
0:50:15a possible alignment for each time step
0:50:18of course we can use more complicated sampling techniques such as the beam search all
0:50:24with turbo decoding to selects the good alignment
0:50:27for the tts generation
0:50:29but is how we do the generation in the whole attention
0:50:33computer was of attention
0:50:34we don't a weighted sum
0:50:36instead we will you we draw samples
0:50:41similarly in the training stage we have to use a dynamic programming to summarise all
0:50:46possible alignments in order to your body it's the model likelihood
0:50:51for the whole contention based models
0:50:53in contrast sold attention does not require to do so we just
0:50:57do the same as well you what we use
0:51:00for the generation stage
0:51:03we do the weighted sum for each time step
0:51:06so the difference between this off attention the hot a whole attention requires us to
0:51:12use a different space
0:51:14to categorise different techniques for hold attention
0:51:18that leads to this space
0:51:20which i think will be easy to understand different kinds of a whole attention techniques
0:51:25however due to the limited time i cannot explain the details are hot attention if
0:51:31you are interested please find this lies
0:51:34where i explain the whole attention in more details
0:51:38in terms of the tts system with a whole attention as far as we know
0:51:42there is only one group actually using the whole attention
0:51:46with a tts
0:51:47and it's the our group
0:51:49you can find a reference papers in the website below
0:51:53and you could also find many details on how we use different types of search
0:51:59and thus upping techniques
0:52:01to produce the output alignment from the whole attention based models
0:52:07given the details on soft attention and a brief introduction on the whole attention women
0:52:13come to the sort of group
0:52:15the hybrid approaches
0:52:16for the seconds to segments tts models
0:52:20from the first or the this tutorial hope you can understand how this of attention
0:52:25is easy to implement but
0:52:28it might not work on which generates about utterances
0:52:32of the whole intention my help the to solve this issue because data quantities a
0:52:36monotonic alignment during generation
0:52:40however
0:52:41according to our experiments the whole the tension might not be as at great as
0:52:47a soft attention
0:52:48for example sometimes so i may overestimate the duration for silence
0:52:54for both soft and the house attention we compute alignment probability for each pair of
0:52:59the encoder and output time steps
0:53:02for tts because the output sequence can be quite lawn
0:53:05so these meetings we have to calculate a large matrix
0:53:08for the alignment to make for the alignment probability is not easy
0:53:13of course we can do something more efficient suppose we can summarise the alignment information
0:53:18from the matrix
0:53:20so that we can know roughly how meaning out good time steps when you to
0:53:24generate for each input okay
0:53:27so by using this information we can we compute one probabilistic model for each input
0:53:32okay
0:53:33i just to estimate how meeting time steps they need to produce during the generation
0:53:39stage
0:53:40so this idea is not new you'd actually has been used in the hmm and
0:53:45d and bases system
0:53:47actually this is also that here behind the hybrid approaches
0:53:51for the hybrid approach is the first user attention based a model
0:53:55to extract alignment a matrix
0:53:58after that this summarizing information for example
0:54:01the duration or how meantime how many output time steps we need to repeat
0:54:06for each input it okay
0:54:08after summarizing these information we can trying the duration model directly for each input a
0:54:15token
0:54:17during the generation stage we can't either directly clogging the trend duration model as you
0:54:22can see from this picture we just need to predict how many output time steps
0:54:26we need to repeat
0:54:28for each input a target
0:54:31giving this duration information we can do the up sampling
0:54:35just read by duplicating each input vectors
0:54:38so the input into the decoder will be will lined with the output sequences we
0:54:43want to generate we can use norm on your a network such as the feed
0:54:47forward
0:54:48recall rent or autoregressive neural networks
0:54:51to converts the input to the output acoustic features decreases
0:54:58here are some tts system using the hybrid approaches
0:55:02the fast the speech user sold attention to extract the duration
0:55:06well the align tts and the other system and use different kinds of techniques to
0:55:11extract the duration
0:55:15i'd like to play some samples extracted from the published papers
0:55:21i would play just one sample for each system from for speech and for to
0:55:25for speech to
0:55:28which are you are like chapel and that this year casework to real straight at
0:55:33least are mostly
0:55:36it's have you are collected chapel and on the staircase work to rose training set
0:55:41are we
0:55:44although i only play the short samples here but i think you can find alarms
0:55:48and this is on their website
0:55:51what i want to see here from the example is that by using had hybrid
0:55:55approaches are we can generates the us acidic speech with the quite a robust duration
0:56:00i think that is one strong point about hybrid approaches
0:56:06okay let's come to the summary
0:56:09in this tutorial i first explain the pipeline tts system including the hmm and d
0:56:15and basis systems
0:56:17in the pipeline tts we need to use the front end
0:56:21to extracting linguist information from the input attacks after that we need to duration model
0:56:27to predict a duration four inch include a unit
0:56:32followed following that we need the acoustic model and the waveform generators
0:56:36to cover the linguist if you choosing to final
0:56:39waveform
0:56:42in two cents sixteen go deep mind propose to believe that
0:56:47all the way it is not explaining this to oreo i'd like to mention that
0:56:51the original wave in it still needs you front end and the duration more
0:56:56so it achieves the astonishing performance because it to use a single network
0:57:02to directly converts a linguistic features into d waveform sampling points directly
0:57:07this all the issues or the artifact and what we used a conventional waveform generators
0:57:13like the vocoders
0:57:15different from these two types of a tts system these signals two seconds model use
0:57:20a single modeled converts the input text
0:57:22into the acoustic features
0:57:25the use a single model to do alignment learning to do the duration modeling and
0:57:30the acoustic modeling
0:57:31in fact main sequence two seconds models also use women it's like a waveform generators
0:57:37to further improve the quality of this is that speech
0:57:42if we summarize the differences from the park lie system to the sequence to second
0:57:46system i think there are four suspects
0:57:50the first one is we replace the conventional front end in the pipeline system
0:57:56with the trainable implicit front end in the sequence two seconds model
0:58:00second instead of using external all duration model
0:58:04we may jointly do the duration modeling ways the sequence two seconds mapping
0:58:11sir point is the of acoustic models or low is not explained in this tutorial
0:58:17actually most of the seconds to segments model use just so called autoregressive decoding
0:58:22so would produce one audible time step
0:58:24conditioned on the previous time step
0:58:27the last point is the in your away from models as i mentioned in the
0:58:31previous slide
0:58:32making all the segments to segments models use neural waveform models like the wavelet
0:58:39the first the three types of differences implemented through the attention based segments two seconds
0:58:45models
0:58:46so in this tutorial we focus on attention mechanism
0:58:51we first explain this of the tension
0:58:53we also groups as of attention approaches space on this three
0:58:58dimensions
0:58:59what kind of features were used to calculate alignment a matrix how do we calculate
0:59:03alignment
0:59:04and the what kind of constraint we compute i'll the alignment
0:59:08we also mention the shortcoming of the a soft attention it does not guarantee the
0:59:13monotonic structures
0:59:15that evaluates the hot attention based approach
0:59:18however the whole attention might not be accurate enough to produce natural speech
0:59:23at a gives us to these are just a possible solution the hybrid approach where
0:59:29we don't
0:59:30used attention during the generation
0:59:35all the four specs are quite essential to the performances of the sequence to segments
0:59:39tts models
0:59:40of course maybe we may wonder
0:59:42what is the most important a factor
0:59:46that contributed to the performance of a sequence two seconds model
0:59:49to answer that oliver and is called x design experiments
0:59:54and they try to analyze the impact of each other's factor
0:59:58and the quality of a general speech from the sequence two seconds models
1:00:03ice recommended it to raise their paper to understand why the sickness two seconds model
1:00:08outperforms the tedious apply systems
1:00:13before we in this tutorial let me briefly mention all the research topics based on
1:00:18these seconds to segments tts models
1:00:22the first one big is the neural waveform models that has being using mating signal
1:00:27emitting seconds two seconds models
1:00:29due to the limited time i cannot explain the neural for a waveform almost but
1:00:34you can find a reference paper using the rating list
1:00:37another topic is the speakers style any motion modelling is segments two seconds models
1:00:45prosody is also hot topic being seconds two seconds modeling
1:00:49in terms of multi speaker modeling a single most of the segments two seconds models
1:00:54are quite a straightforward
1:00:55the either jointly trying the speaker vector is a bayes d sequence to suppose model
1:01:01or they use separates speaker model
1:01:04to extract these speaker vectors from the reference speech
1:01:08so this is so called the rule short learning for multi-speaker tts
1:01:14in terms of prosody either paper is focusing on the segmental prosody for example of
1:01:20the lexicon tom or the pitch accent
1:01:23so this all the most of this paper is focusing on the pitch accent the
1:01:29language at a language such as mattering or japanese
1:01:33in terms of the super or a sacramental variation is they're also papers
1:01:37combining the process of the embedding ways to talk from basis systems
1:01:41also system using the variational encoders
1:01:45to extract the processing bindings from the reference speech
1:01:49finally i'd like to mention another direction of all the tts research
1:01:54and i is the a tts for entertainment
1:01:57so for the in this paper the also is use the traditional japanese comedy data
1:02:04to trying to tts system
1:02:05so the goal of this kind of t c system is not only the speech
1:02:10communication but also
1:02:12mm to intuit and the audience
1:02:16this is and of this tutorial
1:02:18you can find this slide on my teeth how page it's recommended it to check
1:02:22the i exponent slides the reading list and the appendix thank you for recently