0:00:14okay so i'm pleased to introduce the next guest speaker who's a kitchen recruited from the goal is to just
0:00:20technology
0:00:22he's extremely well known but for those who don't know he's the pioneer of statistical speech of the system particular
0:00:30hmm speech of the system captures the together
0:00:33right a single actual not so like i
0:00:38i
0:00:39i
0:00:41okay
0:00:44and operator
0:00:45that
0:00:47most
0:00:47speech recognition researchers
0:00:50re going speech synthesis
0:00:52as a messy problem
0:00:55that the reason why
0:00:56yeah
0:00:58i yeah
0:01:00talk about a statistical formation of speech synthesis
0:01:04in this presentation
0:01:07okay
0:01:09to realise speech synthesis systems
0:01:12many approaches have been proposed
0:01:15before nineteen
0:01:17rule based formant synthesis had both been studied
0:01:21in this case funding you need a bit by hand crafted rules
0:01:27after nineties
0:01:28corpus based concatenative speech synthesis
0:01:31approach is dominant
0:01:33state of the art
0:01:35speech synthesis systems
0:01:36based on unit selection can generate natural sounding speech
0:01:41in recent years
0:01:43statistical parametric speech synthesis approach yet popularity
0:01:49it has
0:01:50several advantage
0:01:53such as
0:01:54flexibility in voice characteristics
0:01:57a small footprint
0:01:59automatic voice between
0:02:00and so on
0:02:02and i'm not there
0:02:03the most important
0:02:05advantage of the statistical approach
0:02:08it that's
0:02:09we can use mathematical you will define the models and average
0:02:14in this talk
0:02:15i would like to discuss how we can formulate
0:02:18and i understand the whole speech synthesis process
0:02:22including speech feature extraction acoustic modeling and the text processing and so on a unified statistical framework
0:02:33okay
0:02:34the basic problem
0:02:36all speech synthesis
0:02:38yeah i can be stated as shown here
0:02:43we have a speech database
0:02:46that is a text
0:02:49yeah
0:02:50a set of text
0:02:52and corresponding speech waveform
0:02:57given a text
0:02:58to be syntax
0:03:00but the speech waveform corresponding context
0:03:07the problem can be represented by this equation
0:03:13and it can be solved
0:03:15by estimating the
0:03:17predictive distribution
0:03:19a given barrier
0:03:23and then drawing samples
0:03:26from the predicted distribution
0:03:29basically it's quite simple
0:03:32however
0:03:35estimating that
0:03:36predictive distribution
0:03:38is very hot
0:03:40so
0:03:42we have to introduce a acoustic model problem
0:03:47then the in the acoustic model for example hmm
0:03:52and this part correspond to the training part
0:03:57and this but response
0:04:00to the generation part
0:04:03and the first i like to discuss the generation for
0:04:12as we know modeling speech waveform
0:04:14directly by
0:04:17acoustic models is very difficult
0:04:19so we have to introduce
0:04:21parametric representation speech waveform
0:04:25oh
0:04:27is a parametric representation of speech waveform
0:04:31for example cepstrum well mel cepstrum but as it is used for every zero
0:04:38accordingly and this generation apart
0:04:43is decomposed into these two terms
0:04:49we also know
0:04:53that takes should be converted to that is
0:04:57because the same text
0:04:59can i have much to pronunciation
0:05:02part of speech analytics lexical stress
0:05:06or other information
0:05:08so that generation part
0:05:11is decomposed
0:05:12into these three times
0:05:17text processing
0:05:19and
0:05:20acoustic model
0:05:22a parameter generation from acoustic model
0:05:25and speech waveform reconstruction
0:05:31and that it is difficult to perform integral and summation
0:05:36yeah over all the variables
0:05:40so we approximate the by joint maximization are shown here
0:05:46however
0:05:48joint maximization is still hot
0:05:51so i
0:05:52is approximated by a step by step maximization problem
0:05:58discourse want to the training part
0:06:01and
0:06:02this maximise
0:06:04maximization with that of this
0:06:07is that this correspond to
0:06:10text and at least
0:06:11and this corresponds to
0:06:14speech parameter generation from a acoustic model
0:06:19i
0:06:20talked about the generation part
0:06:24but the training part
0:06:26also requires a partner parametric representation of a speech waveform and there
0:06:36accordingly the
0:06:38training part
0:06:40can be approximated by a step by step maximization problem in a similar manner at that iteration part
0:06:49they're doing
0:06:50all speech database
0:06:53and the feature extraction of speech database
0:06:56and acoustic model train
0:07:01as a result
0:07:03the original problem
0:07:05is it
0:07:06decompose into these sub-problems
0:07:09bows or four
0:07:11training part and those are
0:07:14for dinner at some point
0:07:16feature extraction
0:07:18of speech database
0:07:20between
0:07:21and acoustic model training
0:07:24and the text and there is
0:07:25of
0:07:27the text
0:07:28to be synthesized
0:07:29and the speech parameter generation from acoustic model
0:07:33and finally yeah we reconstruct speech waveform
0:07:37by sampling of this
0:07:39distribution
0:07:44okay
0:07:46i just talked about the
0:07:48mathematical formulation
0:07:50in the following
0:07:51i like to explain
0:07:53each component a step by step
0:07:56and then
0:07:57show examples to demonstrate the flexibility of thus that's statistical approach
0:08:04and finally give some discussion and computers
0:08:10"'kay"
0:08:12this the overview of an hmm based speech synthesis system
0:08:17the training part is similar to those used in hmm based speech recognition system
0:08:23the essential difference
0:08:25it that the state output vector in clues
0:08:29not only spectrum parameters
0:08:32for example mel-cepstrum
0:08:34but also excited some parameters if zero parameters
0:08:39on the other hand
0:08:40the synthesis part
0:08:43does the inverse operation of speech recognition
0:08:48that is
0:08:49phoneme hmms
0:08:50or concatenated according to the labels
0:08:54i drive the from the text
0:08:56to be synthesized
0:08:58yeah
0:08:59a sequence or speech parameters
0:09:02a spectrum parameters and F zero parameters
0:09:06is determined in such a way that it's at most probable probability for the hmm is max
0:09:13and finally
0:09:14switch maple
0:09:17is in fact by using speech synthesis filter
0:09:21and that each part correspond to the
0:09:25supper problem
0:09:28that we
0:09:30feature extraction
0:09:32and the model training
0:09:34and that text analysis for the text to be synthesized
0:09:37and speech parameter generation from acoustic model trained a cost model
0:09:43and speech waveform reconstruction
0:09:47first
0:09:48i like to talk about speech feature extraction
0:09:52and space speech waveform a reconstruction which correspond to these which
0:10:03it's based on the source-filter model which in that no human speech production
0:10:10in this presentation
0:10:11i assume the
0:10:13system function
0:10:15H of Z is represented by mel-cepstral coefficient
0:10:21that is
0:10:22frequency warped cepstral coefficients
0:10:25defined by this equation
0:10:28the frequency warping function defined by this
0:10:32first order allpass system function
0:10:34give us a good approximation to auditory frequency scales
0:10:40and with an appropriate choice of that of
0:10:45by assuming X
0:10:47icsi's a
0:10:49a short segment of a speech waveform
0:10:52assuming X is a gaussian process
0:10:55we time see
0:10:57mel-cepstrum
0:10:58in such a way that
0:11:00it's likelihood
0:11:01with respect to X
0:11:04is maximized
0:11:05it's just that any other estimation of mel-cepstral coefficient
0:11:10because the of X
0:11:12is convex with respect to see
0:11:15the solution can easily obtained by an iterative everywhere
0:11:22okay
0:11:23to reset resynthesized speech
0:11:26H of the is controlled according to the estimated mel-cepstrum
0:11:32and excited by post-training
0:11:34and of white noise
0:11:36for voiced and unvoiced segments are respectively
0:11:42i know this is the
0:11:43pulse train
0:11:47under this is white noise
0:11:51and the excitation signal is generated based on voiced unvoiced information and if a zero
0:11:58extracted from the original speech
0:12:01this is all the non-speech unfair advantage now parents scales et cetera and dct excitation signal
0:12:12it could have the same if zero
0:12:15at this point
0:12:17and but exciting a speech synthesis filter controlled by mel-cepstral coefficient vectors
0:12:24by this excitation signal we can reconstruct the speech waveform
0:12:30i don't you have somebody else et cetera
0:12:34so now the problem
0:12:36is
0:12:37how we can
0:12:38generate both speech parameters
0:12:42from the tech
0:12:43i have to be synthesized was the corresponding acoustic
0:12:48model
0:12:53okay
0:12:55next i'd like to talk about this maximization problem
0:12:58which correspond to acoustic modeling
0:13:03this is the other two markov model hmm a result left to right topology
0:13:09which is used in speech recognition system
0:13:12we also use the same structure for speech synthesis
0:13:16please note that the state output probability is defined as
0:13:21gaussian single gaussian us that because
0:13:24it's enough for speech synthesis we are using a speaker-dependent model
0:13:30that for speech synthesis
0:13:35as i explained
0:13:36we need to model not only spectral parameters
0:13:40but also F zero parameters to resynthesize speech wave
0:13:44putting the state output vector consists of
0:13:48spectrum part
0:13:50and F zero part
0:13:52spectrum brought consists of mel-cepstrum coefficient vector
0:13:58and its delta and delta-delta
0:14:01and the F zero product consists of F zero and its delta and delta-delta
0:14:08the problem
0:14:10in modeling F zero by a gmm
0:14:14if that
0:14:15we cannot apply to conventional discrete or continuous stated distribution
0:14:21because
0:14:22F zero value
0:14:24not to define in the unvoiced region
0:14:27that is
0:14:28the observation sequence of F zero is composed of
0:14:33one dimensional continuous values
0:14:37and discrete a simple which represent about
0:14:42several heuristic methods have been investigated four hundred in the unvoiced region
0:14:49for example
0:14:50interpolating the caps
0:14:52or substituting random values for almost agrees
0:14:59to model this kind of observation sequence in a statistical quirk the manner
0:15:05we have defined a new kind of hmm
0:15:08yeah
0:15:10we refer to it as multi-space probability distribution hmms
0:15:14or msd hmm
0:15:16it includes the discrete hmm and the continuous mixture hmm
0:15:21as special cases
0:15:24and for the more it can model the sequence or
0:15:28all observation vectors with variable dimensionality including discrete simple
0:15:35we show the structure of msd hmm
0:15:38specialised for F zero modeling
0:15:42each state
0:15:43has weights
0:15:45which represent
0:15:47and probabilities
0:15:48all voiced
0:15:50and unvoiced
0:15:52and
0:15:53continuous distribution for voice
0:15:56observation
0:15:58that is not bad
0:16:00i'm em algorithm can easily be derived for training this type of H M
0:16:08okay
0:16:09but combining the spectrum part and F zero part of the state output distribution
0:16:15has
0:16:16mod stream structure
0:16:18like this
0:16:23okay
0:16:25no
0:16:26i like to talk about
0:16:28model structure
0:16:30in speech recognition
0:16:32preceding and succeeding phone identities are regarded as context
0:16:39on the other hand
0:16:40in speech synthesis
0:16:43current phone identity can also be a context
0:16:47because i
0:16:48no it's not necessary to know
0:16:51what the speech recognition result
0:16:54furthermore
0:16:55there are
0:16:57many other context of factors
0:16:59that affect
0:17:01spectrum
0:17:02every zero
0:17:03and the duration as shown here
0:17:06for example a number
0:17:09phones in this stuff below
0:17:12or
0:17:13for example current syllable in current word or part of speech or other looks more information and so on
0:17:22since there are
0:17:24too many combinations
0:17:26it's a difficult to have all possible model
0:17:31to avoid the problem in the same manner as hmm based speech recognition
0:17:36we use context-dependent hmms
0:17:39and apply a decision tree based context clustering technique to K
0:17:44in this figure a
0:17:47htk sty triphone letters are shown
0:17:51however
0:17:52in the case of speech synthesis the data is very long because it
0:17:58includes
0:18:00all these information
0:18:02so we also a list menu other questions
0:18:07about
0:18:09this information
0:18:14okay
0:18:15each number spectrum and F zero have its own influential contextual factors so that there should be some for spectrum
0:18:23and F zero should be clustered independently
0:18:27it results in
0:18:30stream dependent a context clustering structure
0:18:34i strongly
0:18:38in the standard hmm days
0:18:40the states through some prior probability an exponent site and decrease with increase over last iteration
0:18:49however
0:18:50it's too simple to control a temporal structure of speech parameter C sequence
0:18:56therefore
0:18:57we assume that the state
0:19:00durations
0:19:01oh gosh
0:19:03and not that the hmm with an explicit and racial model is called
0:19:10and hidden semi-markov model
0:19:12or it just a man
0:19:15and now we need a special type of em algorithm for parameter is used to measure this model
0:19:23okay as a result state iterations of aged men each hmms
0:19:29oh model the
0:19:30by a three dimensional
0:19:33gaussian
0:19:35and
0:19:36context-dependent three dimensional gaussians
0:19:39a class that by
0:19:41at this juncture
0:19:43so we now we have
0:19:45seven decision trees are in this example
0:19:48three four spectrum from those mel-cepstrum
0:19:52and three four F zero
0:19:54and a wonderful situation
0:20:00okay
0:20:01next i'd like to talk about the second maximization problem
0:20:05which correspond to speech parameter generation
0:20:09from acoustic model
0:20:12like concatenating context-dependent hmms
0:20:16according to the led us a drive from the text to be synthesized
0:20:21a sentence hmm can yeah
0:20:25something
0:20:28for a given sentence hmm
0:20:32we determine the speech parameter vector sequence
0:20:35oh
0:20:37which maximizes
0:20:38the outputs probably
0:20:41P
0:20:43this equation that can be approximated by this which one
0:20:48output approximated by maximization
0:20:52on the bottom or it can be decomposed into D two maximization problem
0:20:58first
0:21:00we determine the state sequence Q hot
0:21:03independently of all then
0:21:05yeah
0:21:06determine
0:21:08speech parameter vector sequence
0:21:10yeah O
0:21:11all hyped
0:21:12for the
0:21:13prefixed a state sequence
0:21:16do you have
0:21:18the first problem
0:21:20can be sold
0:21:21very easy
0:21:23because us that iteration are modelled by gosh
0:21:29the solution is simply given by means of gaussians
0:21:33a postage or some other
0:21:38unfortunately
0:21:39that direct solution for that
0:21:42second problem is you appropriate for synthesizing speech
0:21:48and this is an example parameter generation from an hmm
0:21:52composed by concatenation of a phoneme hmms
0:21:58each vertical dotted line
0:22:03a state of that the line represents a state out
0:22:08we assume that the covariance matrix is guy or whatever
0:22:12so each state has its means and variance
0:22:16for example this
0:22:19horizontal dotted line
0:22:21represents a mean of this state and the shaded area
0:22:26that represent
0:22:27variance
0:22:28of this thing
0:22:31by maximizing the output probability
0:22:34the parameter sequence becomes the mean vector sequence
0:22:39resulting in a step wise function like this
0:22:42because
0:22:43this is the most likely sequence for the sequence of a state of gaussians
0:22:50and the this jumps
0:22:52a coarse this continues its in synthetic speech
0:22:59about
0:23:00about the problem
0:23:02we assume that each state output vector O
0:23:06consists of mel-cepstral coefficient
0:23:09back to
0:23:10and it's dynamic feature vectors
0:23:13delta and delta-delta
0:23:15which correspond to the first
0:23:17and second derivatives
0:23:19of a speech parameter vector C
0:23:23and can be calculated as a linear combination or neighboring a speech parameter vectors
0:23:31most of speech recognition systems also use this type of speech parameters
0:23:35and
0:23:36relationship
0:23:39between
0:23:41see and that the C and the see that can be arranged in a matrix form
0:23:47as shown here
0:23:49i see in the
0:23:51mel-cepstral coefficient vector
0:23:54and delta and delta-delta and the dct stick out vector
0:24:01and
0:24:03C includes all
0:24:06mel-cepstral coefficients vectors for utterance
0:24:09and W is for calculating does that the
0:24:16and that this constraint
0:24:18on wall
0:24:20maximizing be with respect to all
0:24:24is equivalent to that with respect to see
0:24:28that's by setting the derivative equals zero we obtain a set of linear equations
0:24:34which can be shown in
0:24:36much useful
0:24:38that dimensionality
0:24:40of the equation is very high
0:24:43for example tens of thousand because C was all a mel-cepstral coefficients vector for utterance
0:24:52fortunately
0:24:53by using the special structure of this metric
0:24:58it's very sparse matrix
0:25:00it can be solved by
0:25:02fast algorithm
0:25:06okay
0:25:07this is an example of
0:25:09parameter generation
0:25:12that from us in this hmm using dynamic no feature brown
0:25:18this shows
0:25:21the trajectory
0:25:23of the
0:25:24second the coefficient
0:25:26of that generated the mel cepstrum
0:25:30sequence
0:25:32and
0:25:33they
0:25:34sure its delta
0:25:36and delta-delta which correspond to the first
0:25:40and second derivatives of the
0:25:43trajectory
0:25:46these three
0:25:47trajectories a constrained by each other
0:25:51and to determine the simon tennessee
0:25:54by maximizing
0:25:56total output probabilities
0:25:59as a result
0:26:00that trajectory
0:26:03is constrained to be realistic as defined by the statistics
0:26:08of static and dynamic feature
0:26:15you may have noticed that
0:26:19the of all
0:26:21is improper as the distribution of C
0:26:25because it's not normalize
0:26:27respect to see
0:26:29interestingly by normalizing
0:26:32be with
0:26:34respect to see we can drive a new type of trajectory model would to be called
0:26:40trajectory hmm
0:26:42oh i'm sorry but i won't go into details in this presentation
0:26:50okay if you guess of the spectrum calculated from the mel cepstrum vectors generated
0:26:56without dynamic feature parameters
0:26:59and we dine feature parameter respectively
0:27:03it can be seen that by taking into account
0:27:06a dynamic feature of parameters
0:27:10smoothly varying sequence of spectral can be up to
0:27:16and they show the generated F zero about that
0:27:19without the dynamic feature
0:27:22generated F zero sequence becomes a step wise function
0:27:27on the other hand by taking into account that i and number of features
0:27:31we can generate F zero trajectories
0:27:34which approximate the natural F zero that
0:27:40okay
0:27:41not i would like to play some speech samples of synthesized speech samples and too strong effect of dynamic features
0:27:49in speech parameter generation
0:27:54this was since size
0:27:57from the model trained with both
0:28:00yeah static and dynamic features
0:28:04and that this was syntax
0:28:06without a spectrum then it feature
0:28:09and this was
0:28:11in size without
0:28:13F zero dynamic feature
0:28:15and that this was in fact without the both spectrum and F zero dynamic three
0:28:22as the mean they this one
0:28:26like you know you're not model i mean and sorry again
0:28:30jeez a known only on like you know you're the model i mean and
0:28:34it's some of
0:28:35and the without
0:28:37spectrum then feature you may perceive a frequent discontinued is in this
0:28:45these are not lonely and that you know you wanna model i mean i
0:28:48can't find it
0:28:51and now without F zero dynamic features
0:28:54in this case you made by C different type of discontinued is
0:28:59jeez on only on like you know you on i mean and without both we may perceive serious discontinued
0:29:09these are known only and then you know you want to model i mean again we both
0:29:15jeez unknown only and like you know you're the model i mean and
0:29:18yep
0:29:19from this examples we can see that the importance of dynamic feature hmm based
0:29:30okay
0:29:31in the next part
0:29:33i lurked show some
0:29:34examples
0:29:36to demonstrate the flexibility of the statistical approach
0:29:43first i'd like to show an example of emotional speech synthesis
0:29:50i'm sorry about
0:29:52this is very old then so that support speech quite a
0:29:58is that
0:30:01and
0:30:02this sample is inside from a model trained with
0:30:07neutral speech
0:30:10and this was inside from the model trained with unreadable i'm very
0:30:16pitch
0:30:17this that the case again i'm sorry that it's in japanese
0:30:22this is english translation
0:30:25just a neutral
0:30:27people who need it i and unable maintain okay it has flat prosody
0:30:34and from and we model
0:30:37J i
0:30:41okay
0:30:42one of the sentence
0:30:44neutral
0:30:45meeting anyone i if you can have enough time
0:30:48yeah i
0:30:52it sounds like he's angry
0:30:54yeah
0:30:56and we see that that's training the system with a small amount of emotional speech data we can see that
0:31:01the most no speech very easy it's not in this area
0:31:05to handle craft a heuristic rules for emotional speech and
0:31:12next and that show an example of speaker adaptation in speech synthesis
0:31:17we apply the speaker adaptation technique now using speech recognition yeah mllr to the synthesis system
0:31:26and they say the speaker independent model is a model
0:31:31and it was adapted to a target the speaker eight
0:31:35and that this
0:31:36the adapted model
0:31:39okay this samples
0:31:41is
0:31:42since that from the
0:31:45speaker independent model
0:31:48for channel sometime recognition
0:31:51okay i'm sorry
0:31:53it's in japanese
0:31:55and this was synthesized
0:31:58oh this is that's inside speech
0:32:00but the for speaker i
0:32:04so this is inside speech but yeah it has speech bias voice
0:32:08a voice characteristics
0:32:11can also i snuck
0:32:13and this was synthesized from the adapted model
0:32:18with
0:32:20for turn
0:32:22oh yeah no sunlight recognition and fifty utterances
0:32:27of course you know something that is not let me play them again
0:32:32speaker independent model
0:32:34cocaine or something recognition for utterances
0:32:38yeah no sunlight recognition
0:32:42not something that is not
0:32:45and also i recognition
0:32:48if
0:32:48these three sound
0:32:51very similar it means that the system can maybe the target speakers voice using a very small amount of the
0:32:58adaptation data
0:33:00and then we have another sample
0:33:02maybe in the
0:33:04famous persons voice
0:33:10institute of technology energy was founded in nineteen O five isn't going to hire technical to pioneering academic institution dedicated
0:33:18to industrial education
0:33:21can you find for hey
0:33:24yes
0:33:26you're right
0:33:28please note that
0:33:31this was done by engine geometries at C S T R of the university of edinburgh
0:33:38and they
0:33:39yeah it was us inside by the system adapted to justify
0:33:47okay
0:33:48next example is speaker interpolation in speech synthesis
0:33:53when we have several speaker dependent hmm sets
0:33:57by interpolating among the hmm parameters
0:34:01means and variances
0:34:04we can generate a new hmm set
0:34:07which correspond to a new voice
0:34:10in this case we have to speaker-dependent models
0:34:15one
0:34:16is trained by a female speaker
0:34:19and one
0:34:20is trained by a male speaker
0:34:24okay let me play
0:34:26speech samples
0:34:27synthesized from
0:34:29female model
0:34:36sorry okay
0:34:42i
0:34:42and this was in part from a male speakers for a male speakers model
0:34:49well when i don't and we can interpolate between and these two models with arbitrarily depletion much
0:34:59a dct sent out
0:35:00due to
0:35:02models
0:35:03we cannot find he or she is
0:35:06male or female
0:35:09and
0:35:14and
0:35:15we can change in the in the pool interpolation ratio right nearly in a trance
0:35:21from female to male
0:35:24well
0:35:32we do not know
0:35:37sounds like
0:35:38male finally
0:35:41and i this is the same except we have for speaker-dependent the model
0:35:46models
0:35:47the first speaker
0:35:48and when the second speaker per speaker i don't want to know in a manner i don't have for speaker
0:36:06and this at the center of these four speakers
0:36:14and then we can also change the interpolation ratio red
0:36:17and
0:36:23oh in a manner another
0:36:25i don't know
0:36:31oh in
0:36:33it is interesting
0:36:34but
0:36:35could be used to this
0:36:38i
0:36:41yeah
0:36:42if we train each model
0:36:45with S P speaking style we can interpolate among speaking styles to it could be useful for spoken dialogue systems
0:36:54in this case
0:36:55we have two models
0:36:57once trained with
0:36:59a neutral
0:37:01draw a voice
0:37:03and one trained with
0:37:06high tension voice
0:37:07by the same speaker
0:37:10okay
0:37:11first neutral voice
0:37:14oh
0:37:18and heightened so model
0:37:21i
0:37:22i
0:37:25if you feel it's too much
0:37:28you we can adjust the
0:37:30degree of the expression by interpolating between two models
0:37:35for example this one
0:37:41and that we can
0:37:43also fixed extrapolated and used to model
0:37:47under the replay yeah all of them
0:37:50in this order
0:37:53oh
0:38:02i
0:38:03i
0:38:05oh
0:38:06oh
0:38:08oh
0:38:09please note that
0:38:11it's not just that changing average F zero the prosody it can be changed
0:38:18okay
0:38:19next example is eigenvoice
0:38:22the eigenvoice technique was to have developed for very fast speaker adaptation in speech recognition
0:38:30in speech synthesis
0:38:32it can be used for creating new voices
0:38:37image of something more
0:38:56okay
0:38:58this represents a weight
0:39:01for eigenvoices
0:39:03by adjusting them we can find a favourite voice
0:39:09each eigenvoice first eigenvoice and second eigenvoice
0:39:12it's eigenvoice
0:39:14may correspond to a specific voice character
0:39:20maybe play some speech samples
0:39:23but
0:39:25for the
0:39:27first eigenvoice we've negative rate
0:39:33yeah
0:39:35i
0:39:36okay
0:39:38and now we
0:39:39posted wait for the first eigenvoice
0:39:42okay
0:39:44oh no what contributes to say okay
0:39:48i'm sorry that this is the maximum with the ball
0:39:52and the second eigenvoice we've negative rate
0:40:00and we
0:40:01positive rate
0:40:03okay they're not what makes you don't sound that made on for eigenvoice
0:40:14and we've
0:40:15was divided up with the weight
0:40:19i
0:40:20yeah
0:40:23at
0:40:24and by second weight after writing we don't and various voices
0:40:29and find out for your voice
0:40:35some them then
0:40:37but i
0:40:39i
0:40:41i hope
0:40:43this is better
0:40:44okay
0:40:46the
0:40:49anyway this shows the flexibility of their statistical approach to speech synthesis
0:41:05okay
0:41:06similarly to other corpus based approaches
0:41:09and the hmm baptists
0:41:10system has a
0:41:12very compact language dependent but
0:41:16done easily be applied to other languages
0:41:20i like to play some the them
0:41:24japanese change that you know you and i one and i mean i'm sorry
0:41:29in which
0:41:30you would not keep the truth from chinese
0:41:34well or from grand cherokee
0:41:40korea
0:41:41then they can match the categories and the finnish
0:41:46only taken a little mental but it once again i must be sent to contain an snr essential or several
0:41:52minima
0:41:53and this is also in which but trained by
0:41:58baby
0:41:59vol
0:42:04yeah i
0:42:07okay
0:42:09and now
0:42:10next examples
0:42:12so that
0:42:13even
0:42:14singing voice can be used as a training data
0:42:18as a result
0:42:20the system can seen any piece of music
0:42:23we she's or her voice and simply used i
0:42:28and
0:42:30this is a one oh training data
0:42:37okay
0:42:44sees a semi professional scene
0:42:48and now
0:42:50the server
0:42:52and
0:42:53and anyway and
0:42:55this sample
0:42:56if
0:42:57syntax
0:42:58by using trained acoustic model so
0:43:03she have not
0:43:05some this song
0:43:11yeah
0:43:17oh i
0:43:23maybe it sounds that are
0:43:25but we have not seen this story
0:43:28us in this
0:43:31okay
0:43:34this is the final part
0:43:40yeah i like to show the
0:43:42basic problem of speech synthesis okay this one
0:43:49solving this problem directly
0:43:53based on
0:43:56and this equation
0:43:59is it yeah
0:44:00but we have to decompose it into trapped up to no such problems because the
0:44:08direct solution is not feasible we currently available computational resources
0:44:15however we can relax the approximation
0:44:21for example
0:44:23by marginalised in what their parameters
0:44:26a hmm for acoustic model parameters
0:44:28we can drive a variational bayesian acoustic modeling technique for speech synthesis
0:44:34well
0:44:36by marginalise and that is
0:44:38we can drive
0:44:40joint front-end and back-end model train
0:44:44a friend front it means that text process
0:44:47and back end acoustic model
0:44:50or by including a speech
0:44:53wait for speech waveform generation part in a statistical model we can also drive a bit from that of
0:45:01stats come on
0:45:03anyway please note that
0:45:05this kind of improved techniques
0:45:08can be drivers
0:45:09based on
0:45:11this equation which represents
0:45:13the basic problem
0:45:15since
0:45:20okay
0:45:22then some read this presentation
0:45:25i have talked about the stats got from their some of the speech synthesis problem
0:45:30all speech synthesis process is described in a statistical framework
0:45:35and it give us a unified view and the reviews
0:45:39what is correct and what is wrong
0:45:43another point i should
0:45:46implies that is
0:45:47the importance of the database
0:45:51future work
0:45:52still we have many problems
0:45:55which we should solve
0:45:57based on
0:46:02the
0:46:03equation which represent speech synthesis problem
0:46:11okay this is that
0:46:13final slide
0:46:15is P synthesis
0:46:17and messy problem
0:46:20no i don't think so
0:46:23i would be happy
0:46:25if many speech recognition researchers joining speech synthesis research
0:46:31it's must be very have had
0:46:33to a T research area
0:46:35that's all thank you very much
0:46:38i
0:46:57yeah thanks for such a could talk we have some time for questions
0:47:01michael
0:47:03oh thank you very much for a wonderful talk on speech synthesis at some point in the future i guess
0:47:09we don't even have to have our presenters make presentations anymore we could just synthesise them i would like to
0:47:16do that
0:47:18i'm not the fifteen you speaking
0:47:21so one of the quest one of things you alluded to at the end of your talk i was wondering
0:47:25if you could elaborate a little bit more
0:47:28one of the problems you can still hear on some of the examples you place a certain but seeing this
0:47:33dependent upon the speaker in the quality of the final waveform generation i'm just wondering if you could say a
0:47:39few words about some of the country
0:47:42the techniques that are being looked at in order to improve the quality of the waveform generation the model you
0:47:51at the at the beginning of the talk is still relatively simple excitation of the spectral sort of model and
0:47:58i know people looked at the fancier stuff
0:48:01just wondering if you have some comments as to what you think is interesting promising the directions to improve the
0:48:08quality the waveform generation yep
0:48:11i didn't
0:48:13mention but in the newest the system well we are using a straight vocoding a technique and it can improve
0:48:22the speech quality very much
0:48:25however
0:48:27i'm afraid that
0:48:29it's not based on
0:48:32that's that is god
0:48:34frame
0:48:36so
0:48:37i would like to include that kind of most you
0:48:42but coding part should be included this it which
0:48:47that must be
0:48:49this one
0:48:50i but
0:48:51currently
0:48:53we still
0:48:54use
0:48:55many approximation
0:48:57for example
0:48:58the formulation
0:49:00is
0:49:02correct
0:49:02for cost sampras
0:49:05it is right for unvoiced sect
0:49:07however it's
0:49:08not appropriate for
0:49:10theoretic sick person
0:49:12so we need more sophisticated a speech before each initial model
0:49:17and i believe that
0:49:19that can for that kind
0:49:21problem
0:49:22the vocal
0:49:25it is
0:49:27the onset do
0:49:32hi yes i have a couple of questions related to the smoothing of the cepstral coefficients we talked about
0:49:38i'm so the use of the deltas and double deltas gives you the smoothed
0:49:42i cepstral coefficient
0:49:45how is how important is that relative to say representing or generating static coefficients and then perhaps applying a moving
0:49:55a moving average filter some somewhere
0:49:58smoothing like that
0:49:59okay with question
0:50:04the initial
0:50:06a slight
0:50:08i have with one
0:50:13with the moment
0:50:47oh i'm sorry if it was not be greatly in the select but anyway
0:50:54that is a very effective
0:50:57and
0:50:58that the data is not so effect
0:51:00and now of course we can apply some heuristics somebody by filtering or something like
0:51:07it's
0:51:09still effect
0:51:11i have a worse
0:51:13that using that on the today parameter
0:51:17i have a most scores
0:51:20i'm sorry i can't find i one other following question on that when you set up those linear equations that
0:51:26you
0:51:27you weights
0:51:29all the different dimensions equally all the deltas and double deltas and the static coefficients are those all weighted
0:51:36equally when you
0:51:37solve the least squares equations
0:51:40yeah there's no weights
0:51:48yeah O we have a
0:51:50that's no choice weight or some operation
0:51:54we just have a
0:51:56we just have a
0:51:59definition of the probably
0:52:01but it gently
0:52:02and just
0:52:03using
0:52:08so i have a question
0:52:10so obviously we've
0:52:13in texas speech local optimum
0:52:15speech recognition but it's
0:52:18frequent comment and it came up and very minute talks is you know this whole question of how good is
0:52:23there's an hmm model of speech on the
0:52:26received wisdom is either that it's kind of terrible or
0:52:30it's terrible but it so tractable are useful the user button speech synthesis you think that the success of this
0:52:36technique
0:52:37both
0:52:38in fact demonstrated hmms are good model speech
0:52:41because i think it's
0:52:42the quality is for the
0:52:44higher than that of the thing anybody would have believed
0:52:46possible
0:52:47and what follows
0:52:50this workshop on the
0:52:52what is the same channel build models
0:53:00yeah that's
0:53:01yeah good but we've got question
0:53:03and
0:53:06anyway
0:53:10to
0:53:13we have been organising we just try to
0:53:17it's a
0:53:19evaluation campaign
0:53:21i think this systems
0:53:22and we have find that
0:53:24you did in the intelligibility of a hmm based systems
0:53:29or almost perfect
0:53:31almost
0:53:32three but i two notches on a natural speech
0:53:35i was still a naturalness
0:53:39is
0:53:40insufficient compared with natural speech
0:53:43so maybe you
0:53:47prosody
0:53:48so
0:53:49oh
0:53:50i believe that we have to improve prosodic part
0:53:55well
0:53:58D
0:53:59status got more
0:54:01maybe
0:54:04human speech and make one by various non-verbal information
0:54:09yeah speech about it cannot be done in the current speech
0:54:13system
0:54:14that kind of a speech of should be included
0:54:19so
0:54:20that you're talking was very nice
0:54:22i want to go a little bit for the long pause line of questioning because i was thinking about your
0:54:28final call to the asr community to take join you know the stuff
0:54:32one of the things you're seeing a lot with H M and stuff and the speech kill this is that
0:54:37everybody's moving parts discriminant models of various kinds and whatnot and the nice thing about
0:54:42you know
0:54:43the hmms for the synthesis problem is it really used to regenerate of problem right so i in some ways
0:54:49it model matches a little better which just sort of what paul was touching on
0:54:53so do you
0:54:54see
0:54:55in moving forward and synthesis that
0:54:59discriminant techniques are gonna be
0:55:02playing a part in that kind of thing where do you think that you know generated this you know asians
0:55:08are generated
0:55:09models and this definitely gonna be the right way to
0:55:12model this kind of thing
0:55:13yeah a good question
0:55:17this discriminant given training
0:55:21does not allow for speech utterances
0:55:25it is not necessary to discriminate
0:55:28and
0:55:29another point
0:55:31that
0:55:32we can set a specific
0:55:36objective function based on human perception
0:55:39don't
0:55:40quickly like
0:55:42a discriminative training in speech recognition
0:55:46but anyway yeah
0:55:49in speech synthesis
0:55:50the basic problem
0:55:52it's
0:55:53generation
0:55:54so we can consider to concentrate on
0:55:58generative model
0:55:59it's not necessary to tickle
0:56:01with the disk and discriminative training
0:56:04that a nice point of speech synthesis research
0:56:09but it in
0:56:30i
0:56:33i
0:56:36oh
0:56:40yeah
0:56:41yeah
0:56:42but
0:56:43i want to
0:56:45the that kind of
0:56:48optimization in a statistical framework
0:56:51by changing the a paedophile parameter or a more their structure we can do that now that's got four
0:57:01and i've got a related question you so you generate the maximum likelihood sequence
0:57:07if you have a really good
0:57:08and generative model we'd really like to sample stochastically
0:57:18there's not that
0:57:20we are using something
0:57:22because
0:57:31i'm sure this
0:57:37okay
0:57:38those are given variables
0:57:41and this is the speech waveform
0:57:44and a dct predicted distribution and we have something a speech waveform
0:57:50the exciting speech synthesis filter by a gaussian
0:57:55what noise
0:57:56it's
0:57:57just or something
0:58:01and that speech parameters mel cepstrum or this result if a zero S
0:58:09is marginalised in the equation
0:58:13so as a pokemon approximation we add generate it we imagine criteria
0:58:21maximum likelihood criterion
0:58:24it's just approximation
0:58:26but
0:58:27this criterion is
0:58:29something
0:58:31does it make sense
0:58:35well i guess i'm wondering whether it's good approximation of
0:58:40yeah that the reason why we have to reduce or remove
0:58:45we wanna relax the approximation in the future active
0:58:49future work
0:58:52okay things temporal things to close so that's like speaker
0:59:00i