Speech Transcript - Speech Synthesis as A Statistical Machine Learning Problem

0:00:14	okay so i'm pleased to introduce the next guest speaker who's a kitchen recruited from the goal is to just
0:00:20	technology
0:00:22	he's extremely well known but for those who don't know he's the pioneer of statistical speech of the system particular
0:00:30	hmm speech of the system captures the together
0:00:33	right a single actual not so like i
0:00:38	i
0:00:39	i
0:00:41	okay
0:00:44	and operator
0:00:45	that
0:00:47	most
0:00:47	speech recognition researchers
0:00:50	re going speech synthesis
0:00:52	as a messy problem
0:00:55	that the reason why
0:00:56	yeah
0:00:58	i yeah
0:01:00	talk about a statistical formation of speech synthesis
0:01:04	in this presentation
0:01:07	okay
0:01:09	to realise speech synthesis systems
0:01:12	many approaches have been proposed
0:01:15	before nineteen
0:01:17	rule based formant synthesis had both been studied
0:01:21	in this case funding you need a bit by hand crafted rules
0:01:27	after nineties
0:01:28	corpus based concatenative speech synthesis
0:01:31	approach is dominant
0:01:33	state of the art
0:01:35	speech synthesis systems
0:01:36	based on unit selection can generate natural sounding speech
0:01:41	in recent years
0:01:43	statistical parametric speech synthesis approach yet popularity
0:01:49	it has
0:01:50	several advantage
0:01:53	such as
0:01:54	flexibility in voice characteristics
0:01:57	a small footprint
0:01:59	automatic voice between
0:02:00	and so on
0:02:02	and i'm not there
0:02:03	the most important
0:02:05	advantage of the statistical approach
0:02:08	it that's
0:02:09	we can use mathematical you will define the models and average
0:02:14	in this talk
0:02:15	i would like to discuss how we can formulate
0:02:18	and i understand the whole speech synthesis process
0:02:22	including speech feature extraction acoustic modeling and the text processing and so on a unified statistical framework
0:02:33	okay
0:02:34	the basic problem
0:02:36	all speech synthesis
0:02:38	yeah i can be stated as shown here
0:02:43	we have a speech database
0:02:46	that is a text
0:02:49	yeah
0:02:50	a set of text
0:02:52	and corresponding speech waveform
0:02:57	given a text
0:02:58	to be syntax
0:03:00	but the speech waveform corresponding context
0:03:07	the problem can be represented by this equation
0:03:13	and it can be solved
0:03:15	by estimating the
0:03:17	predictive distribution
0:03:19	a given barrier
0:03:23	and then drawing samples
0:03:26	from the predicted distribution
0:03:29	basically it's quite simple
0:03:32	however
0:03:35	estimating that
0:03:36	predictive distribution
0:03:38	is very hot
0:03:40	so
0:03:42	we have to introduce a acoustic model problem
0:03:47	then the in the acoustic model for example hmm
0:03:52	and this part correspond to the training part
0:03:57	and this but response
0:04:00	to the generation part
0:04:03	and the first i like to discuss the generation for
0:04:12	as we know modeling speech waveform
0:04:14	directly by
0:04:17	acoustic models is very difficult
0:04:19	so we have to introduce
0:04:21	parametric representation speech waveform
0:04:25	oh
0:04:27	is a parametric representation of speech waveform
0:04:31	for example cepstrum well mel cepstrum but as it is used for every zero
0:04:38	accordingly and this generation apart
0:04:43	is decomposed into these two terms
0:04:49	we also know
0:04:53	that takes should be converted to that is
0:04:57	because the same text
0:04:59	can i have much to pronunciation
0:05:02	part of speech analytics lexical stress
0:05:06	or other information
0:05:08	so that generation part
0:05:11	is decomposed
0:05:12	into these three times
0:05:17	text processing
0:05:19	and
0:05:20	acoustic model
0:05:22	a parameter generation from acoustic model
0:05:25	and speech waveform reconstruction
0:05:31	and that it is difficult to perform integral and summation
0:05:36	yeah over all the variables
0:05:40	so we approximate the by joint maximization are shown here
0:05:46	however
0:05:48	joint maximization is still hot
0:05:51	so i
0:05:52	is approximated by a step by step maximization problem
0:05:58	discourse want to the training part
0:06:01	and
0:06:02	this maximise
0:06:04	maximization with that of this
0:06:07	is that this correspond to
0:06:10	text and at least
0:06:11	and this corresponds to
0:06:14	speech parameter generation from a acoustic model
0:06:19	i
0:06:20	talked about the generation part
0:06:24	but the training part
0:06:26	also requires a partner parametric representation of a speech waveform and there
0:06:36	accordingly the
0:06:38	training part
0:06:40	can be approximated by a step by step maximization problem in a similar manner at that iteration part
0:06:49	they're doing
0:06:50	all speech database
0:06:53	and the feature extraction of speech database
0:06:56	and acoustic model train
0:07:01	as a result
0:07:03	the original problem
0:07:05	is it
0:07:06	decompose into these sub-problems
0:07:09	bows or four
0:07:11	training part and those are
0:07:14	for dinner at some point
0:07:16	feature extraction
0:07:18	of speech database
0:07:20	between
0:07:21	and acoustic model training
0:07:24	and the text and there is
0:07:25	of
0:07:27	the text
0:07:28	to be synthesized
0:07:29	and the speech parameter generation from acoustic model
0:07:33	and finally yeah we reconstruct speech waveform
0:07:37	by sampling of this
0:07:39	distribution
0:07:44	okay
0:07:46	i just talked about the
0:07:48	mathematical formulation
0:07:50	in the following
0:07:51	i like to explain
0:07:53	each component a step by step
0:07:56	and then
0:07:57	show examples to demonstrate the flexibility of thus that's statistical approach
0:08:04	and finally give some discussion and computers
0:08:10	"'kay"
0:08:12	this the overview of an hmm based speech synthesis system
0:08:17	the training part is similar to those used in hmm based speech recognition system
0:08:23	the essential difference
0:08:25	it that the state output vector in clues
0:08:29	not only spectrum parameters
0:08:32	for example mel-cepstrum
0:08:34	but also excited some parameters if zero parameters
0:08:39	on the other hand
0:08:40	the synthesis part
0:08:43	does the inverse operation of speech recognition
0:08:48	that is
0:08:49	phoneme hmms
0:08:50	or concatenated according to the labels
0:08:54	i drive the from the text
0:08:56	to be synthesized
0:08:58	yeah
0:08:59	a sequence or speech parameters
0:09:02	a spectrum parameters and F zero parameters
0:09:06	is determined in such a way that it's at most probable probability for the hmm is max
0:09:13	and finally
0:09:14	switch maple
0:09:17	is in fact by using speech synthesis filter
0:09:21	and that each part correspond to the
0:09:25	supper problem
0:09:28	that we
0:09:30	feature extraction
0:09:32	and the model training
0:09:34	and that text analysis for the text to be synthesized
0:09:37	and speech parameter generation from acoustic model trained a cost model
0:09:43	and speech waveform reconstruction
0:09:47	first
0:09:48	i like to talk about speech feature extraction
0:09:52	and space speech waveform a reconstruction which correspond to these which
0:10:03	it's based on the source-filter model which in that no human speech production
0:10:10	in this presentation
0:10:11	i assume the
0:10:13	system function
0:10:15	H of Z is represented by mel-cepstral coefficient
0:10:21	that is
0:10:22	frequency warped cepstral coefficients
0:10:25	defined by this equation
0:10:28	the frequency warping function defined by this
0:10:32	first order allpass system function
0:10:34	give us a good approximation to auditory frequency scales
0:10:40	and with an appropriate choice of that of
0:10:45	by assuming X
0:10:47	icsi's a
0:10:49	a short segment of a speech waveform
0:10:52	assuming X is a gaussian process
0:10:55	we time see
0:10:57	mel-cepstrum
0:10:58	in such a way that
0:11:00	it's likelihood
0:11:01	with respect to X
0:11:04	is maximized
0:11:05	it's just that any other estimation of mel-cepstral coefficient
0:11:10	because the of X
0:11:12	is convex with respect to see
0:11:15	the solution can easily obtained by an iterative everywhere
0:11:22	okay
0:11:23	to reset resynthesized speech
0:11:26	H of the is controlled according to the estimated mel-cepstrum
0:11:32	and excited by post-training
0:11:34	and of white noise
0:11:36	for voiced and unvoiced segments are respectively
0:11:42	i know this is the
0:11:43	pulse train
0:11:47	under this is white noise
0:11:51	and the excitation signal is generated based on voiced unvoiced information and if a zero
0:11:58	extracted from the original speech
0:12:01	this is all the non-speech unfair advantage now parents scales et cetera and dct excitation signal
0:12:12	it could have the same if zero
0:12:15	at this point
0:12:17	and but exciting a speech synthesis filter controlled by mel-cepstral coefficient vectors
0:12:24	by this excitation signal we can reconstruct the speech waveform
0:12:30	i don't you have somebody else et cetera
0:12:34	so now the problem
0:12:36	is
0:12:37	how we can
0:12:38	generate both speech parameters
0:12:42	from the tech
0:12:43	i have to be synthesized was the corresponding acoustic
0:12:48	model
0:12:53	okay
0:12:55	next i'd like to talk about this maximization problem
0:12:58	which correspond to acoustic modeling
0:13:03	this is the other two markov model hmm a result left to right topology
0:13:09	which is used in speech recognition system
0:13:12	we also use the same structure for speech synthesis
0:13:16	please note that the state output probability is defined as
0:13:21	gaussian single gaussian us that because
0:13:24	it's enough for speech synthesis we are using a speaker-dependent model
0:13:30	that for speech synthesis
0:13:35	as i explained
0:13:36	we need to model not only spectral parameters
0:13:40	but also F zero parameters to resynthesize speech wave
0:13:44	putting the state output vector consists of
0:13:48	spectrum part
0:13:50	and F zero part
0:13:52	spectrum brought consists of mel-cepstrum coefficient vector
0:13:58	and its delta and delta-delta
0:14:01	and the F zero product consists of F zero and its delta and delta-delta
0:14:08	the problem
0:14:10	in modeling F zero by a gmm
0:14:14	if that
0:14:15	we cannot apply to conventional discrete or continuous stated distribution
0:14:21	because
0:14:22	F zero value
0:14:24	not to define in the unvoiced region
0:14:27	that is
0:14:28	the observation sequence of F zero is composed of
0:14:33	one dimensional continuous values
0:14:37	and discrete a simple which represent about
0:14:42	several heuristic methods have been investigated four hundred in the unvoiced region
0:14:49	for example
0:14:50	interpolating the caps
0:14:52	or substituting random values for almost agrees
0:14:59	to model this kind of observation sequence in a statistical quirk the manner
0:15:05	we have defined a new kind of hmm
0:15:08	yeah
0:15:10	we refer to it as multi-space probability distribution hmms
0:15:14	or msd hmm
0:15:16	it includes the discrete hmm and the continuous mixture hmm
0:15:21	as special cases
0:15:24	and for the more it can model the sequence or
0:15:28	all observation vectors with variable dimensionality including discrete simple
0:15:35	we show the structure of msd hmm
0:15:38	specialised for F zero modeling
0:15:42	each state
0:15:43	has weights
0:15:45	which represent
0:15:47	and probabilities
0:15:48	all voiced
0:15:50	and unvoiced
0:15:52	and
0:15:53	continuous distribution for voice
0:15:56	observation
0:15:58	that is not bad
0:16:00	i'm em algorithm can easily be derived for training this type of H M
0:16:08	okay
0:16:09	but combining the spectrum part and F zero part of the state output distribution
0:16:15	has
0:16:16	mod stream structure
0:16:18	like this
0:16:23	okay
0:16:25	no
0:16:26	i like to talk about
0:16:28	model structure
0:16:30	in speech recognition
0:16:32	preceding and succeeding phone identities are regarded as context
0:16:39	on the other hand
0:16:40	in speech synthesis
0:16:43	current phone identity can also be a context
0:16:47	because i
0:16:48	no it's not necessary to know
0:16:51	what the speech recognition result
0:16:54	furthermore
0:16:55	there are
0:16:57	many other context of factors
0:16:59	that affect
0:17:01	spectrum
0:17:02	every zero
0:17:03	and the duration as shown here
0:17:06	for example a number
0:17:09	phones in this stuff below
0:17:12	or
0:17:13	for example current syllable in current word or part of speech or other looks more information and so on
0:17:22	since there are
0:17:24	too many combinations
0:17:26	it's a difficult to have all possible model
0:17:31	to avoid the problem in the same manner as hmm based speech recognition
0:17:36	we use context-dependent hmms
0:17:39	and apply a decision tree based context clustering technique to K
0:17:44	in this figure a
0:17:47	htk sty triphone letters are shown
0:17:51	however
0:17:52	in the case of speech synthesis the data is very long because it
0:17:58	includes
0:18:00	all these information
0:18:02	so we also a list menu other questions
0:18:07	about
0:18:09	this information
0:18:14	okay
0:18:15	each number spectrum and F zero have its own influential contextual factors so that there should be some for spectrum
0:18:23	and F zero should be clustered independently
0:18:27	it results in
0:18:30	stream dependent a context clustering structure
0:18:34	i strongly
0:18:38	in the standard hmm days
0:18:40	the states through some prior probability an exponent site and decrease with increase over last iteration
0:18:49	however
0:18:50	it's too simple to control a temporal structure of speech parameter C sequence
0:18:56	therefore
0:18:57	we assume that the state
0:19:00	durations
0:19:01	oh gosh
0:19:03	and not that the hmm with an explicit and racial model is called
0:19:10	and hidden semi-markov model
0:19:12	or it just a man
0:19:15	and now we need a special type of em algorithm for parameter is used to measure this model
0:19:23	okay as a result state iterations of aged men each hmms
0:19:29	oh model the
0:19:30	by a three dimensional
0:19:33	gaussian
0:19:35	and
0:19:36	context-dependent three dimensional gaussians
0:19:39	a class that by
0:19:41	at this juncture
0:19:43	so we now we have
0:19:45	seven decision trees are in this example
0:19:48	three four spectrum from those mel-cepstrum
0:19:52	and three four F zero
0:19:54	and a wonderful situation
0:20:00	okay
0:20:01	next i'd like to talk about the second maximization problem
0:20:05	which correspond to speech parameter generation
0:20:09	from acoustic model
0:20:12	like concatenating context-dependent hmms
0:20:16	according to the led us a drive from the text to be synthesized
0:20:21	a sentence hmm can yeah
0:20:25	something
0:20:28	for a given sentence hmm
0:20:32	we determine the speech parameter vector sequence
0:20:35	oh
0:20:37	which maximizes
0:20:38	the outputs probably
0:20:41	P
0:20:43	this equation that can be approximated by this which one
0:20:48	output approximated by maximization
0:20:52	on the bottom or it can be decomposed into D two maximization problem
0:20:58	first
0:21:00	we determine the state sequence Q hot
0:21:03	independently of all then
0:21:05	yeah
0:21:06	determine
0:21:08	speech parameter vector sequence
0:21:10	yeah O
0:21:11	all hyped
0:21:12	for the
0:21:13	prefixed a state sequence
0:21:16	do you have
0:21:18	the first problem
0:21:20	can be sold
0:21:21	very easy
0:21:23	because us that iteration are modelled by gosh
0:21:29	the solution is simply given by means of gaussians
0:21:33	a postage or some other
0:21:38	unfortunately
0:21:39	that direct solution for that
0:21:42	second problem is you appropriate for synthesizing speech
0:21:48	and this is an example parameter generation from an hmm
0:21:52	composed by concatenation of a phoneme hmms
0:21:58	each vertical dotted line
0:22:03	a state of that the line represents a state out
0:22:08	we assume that the covariance matrix is guy or whatever
0:22:12	so each state has its means and variance
0:22:16	for example this
0:22:19	horizontal dotted line
0:22:21	represents a mean of this state and the shaded area
0:22:26	that represent
0:22:27	variance
0:22:28	of this thing
0:22:31	by maximizing the output probability
0:22:34	the parameter sequence becomes the mean vector sequence
0:22:39	resulting in a step wise function like this
0:22:42	because
0:22:43	this is the most likely sequence for the sequence of a state of gaussians
0:22:50	and the this jumps
0:22:52	a coarse this continues its in synthetic speech
0:22:59	about
0:23:00	about the problem
0:23:02	we assume that each state output vector O
0:23:06	consists of mel-cepstral coefficient
0:23:09	back to
0:23:10	and it's dynamic feature vectors
0:23:13	delta and delta-delta
0:23:15	which correspond to the first
0:23:17	and second derivatives
0:23:19	of a speech parameter vector C
0:23:23	and can be calculated as a linear combination or neighboring a speech parameter vectors
0:23:31	most of speech recognition systems also use this type of speech parameters
0:23:35	and
0:23:36	relationship
0:23:39	between
0:23:41	see and that the C and the see that can be arranged in a matrix form
0:23:47	as shown here
0:23:49	i see in the
0:23:51	mel-cepstral coefficient vector
0:23:54	and delta and delta-delta and the dct stick out vector
0:24:01	and
0:24:03	C includes all
0:24:06	mel-cepstral coefficients vectors for utterance
0:24:09	and W is for calculating does that the
0:24:16	and that this constraint
0:24:18	on wall
0:24:20	maximizing be with respect to all
0:24:24	is equivalent to that with respect to see
0:24:28	that's by setting the derivative equals zero we obtain a set of linear equations
0:24:34	which can be shown in
0:24:36	much useful
0:24:38	that dimensionality
0:24:40	of the equation is very high
0:24:43	for example tens of thousand because C was all a mel-cepstral coefficients vector for utterance
0:24:52	fortunately
0:24:53	by using the special structure of this metric
0:24:58	it's very sparse matrix
0:25:00	it can be solved by
0:25:02	fast algorithm
0:25:06	okay
0:25:07	this is an example of
0:25:09	parameter generation
0:25:12	that from us in this hmm using dynamic no feature brown
0:25:18	this shows
0:25:21	the trajectory
0:25:23	of the
0:25:24	second the coefficient
0:25:26	of that generated the mel cepstrum
0:25:30	sequence
0:25:32	and
0:25:33	they
0:25:34	sure its delta
0:25:36	and delta-delta which correspond to the first
0:25:40	and second derivatives of the
0:25:43	trajectory
0:25:46	these three
0:25:47	trajectories a constrained by each other
0:25:51	and to determine the simon tennessee
0:25:54	by maximizing
0:25:56	total output probabilities
0:25:59	as a result
0:26:00	that trajectory
0:26:03	is constrained to be realistic as defined by the statistics
0:26:08	of static and dynamic feature
0:26:15	you may have noticed that
0:26:19	the of all
0:26:21	is improper as the distribution of C
0:26:25	because it's not normalize
0:26:27	respect to see
0:26:29	interestingly by normalizing
0:26:32	be with
0:26:34	respect to see we can drive a new type of trajectory model would to be called
0:26:40	trajectory hmm
0:26:42	oh i'm sorry but i won't go into details in this presentation
0:26:50	okay if you guess of the spectrum calculated from the mel cepstrum vectors generated
0:26:56	without dynamic feature parameters
0:26:59	and we dine feature parameter respectively
0:27:03	it can be seen that by taking into account
0:27:06	a dynamic feature of parameters
0:27:10	smoothly varying sequence of spectral can be up to
0:27:16	and they show the generated F zero about that
0:27:19	without the dynamic feature
0:27:22	generated F zero sequence becomes a step wise function
0:27:27	on the other hand by taking into account that i and number of features
0:27:31	we can generate F zero trajectories
0:27:34	which approximate the natural F zero that
0:27:40	okay
0:27:41	not i would like to play some speech samples of synthesized speech samples and too strong effect of dynamic features
0:27:49	in speech parameter generation
0:27:54	this was since size
0:27:57	from the model trained with both
0:28:00	yeah static and dynamic features
0:28:04	and that this was syntax
0:28:06	without a spectrum then it feature
0:28:09	and this was
0:28:11	in size without
0:28:13	F zero dynamic feature
0:28:15	and that this was in fact without the both spectrum and F zero dynamic three
0:28:22	as the mean they this one
0:28:26	like you know you're not model i mean and sorry again
0:28:30	jeez a known only on like you know you're the model i mean and
0:28:34	it's some of
0:28:35	and the without
0:28:37	spectrum then feature you may perceive a frequent discontinued is in this
0:28:45	these are not lonely and that you know you wanna model i mean i
0:28:48	can't find it
0:28:51	and now without F zero dynamic features
0:28:54	in this case you made by C different type of discontinued is
0:28:59	jeez on only on like you know you on i mean and without both we may perceive serious discontinued
0:29:09	these are known only and then you know you want to model i mean again we both
0:29:15	jeez unknown only and like you know you're the model i mean and
0:29:18	yep
0:29:19	from this examples we can see that the importance of dynamic feature hmm based
0:29:30	okay
0:29:31	in the next part
0:29:33	i lurked show some
0:29:34	examples
0:29:36	to demonstrate the flexibility of the statistical approach
0:29:43	first i'd like to show an example of emotional speech synthesis
0:29:50	i'm sorry about
0:29:52	this is very old then so that support speech quite a
0:29:58	is that
0:30:01	and
0:30:02	this sample is inside from a model trained with
0:30:07	neutral speech
0:30:10	and this was inside from the model trained with unreadable i'm very
0:30:16	pitch
0:30:17	this that the case again i'm sorry that it's in japanese
0:30:22	this is english translation
0:30:25	just a neutral
0:30:27	people who need it i and unable maintain okay it has flat prosody
0:30:34	and from and we model
0:30:37	J i
0:30:41	okay
0:30:42	one of the sentence
0:30:44	neutral
0:30:45	meeting anyone i if you can have enough time
0:30:48	yeah i
0:30:52	it sounds like he's angry
0:30:54	yeah
0:30:56	and we see that that's training the system with a small amount of emotional speech data we can see that
0:31:01	the most no speech very easy it's not in this area
0:31:05	to handle craft a heuristic rules for emotional speech and
0:31:12	next and that show an example of speaker adaptation in speech synthesis
0:31:17	we apply the speaker adaptation technique now using speech recognition yeah mllr to the synthesis system
0:31:26	and they say the speaker independent model is a model
0:31:31	and it was adapted to a target the speaker eight
0:31:35	and that this
0:31:36	the adapted model
0:31:39	okay this samples
0:31:41	is
0:31:42	since that from the
0:31:45	speaker independent model
0:31:48	for channel sometime recognition
0:31:51	okay i'm sorry
0:31:53	it's in japanese
0:31:55	and this was synthesized
0:31:58	oh this is that's inside speech
0:32:00	but the for speaker i
0:32:04	so this is inside speech but yeah it has speech bias voice
0:32:08	a voice characteristics
0:32:11	can also i snuck
0:32:13	and this was synthesized from the adapted model
0:32:18	with
0:32:20	for turn
0:32:22	oh yeah no sunlight recognition and fifty utterances
0:32:27	of course you know something that is not let me play them again
0:32:32	speaker independent model
0:32:34	cocaine or something recognition for utterances
0:32:38	yeah no sunlight recognition
0:32:42	not something that is not
0:32:45	and also i recognition
0:32:48	if
0:32:48	these three sound
0:32:51	very similar it means that the system can maybe the target speakers voice using a very small amount of the
0:32:58	adaptation data
0:33:00	and then we have another sample
0:33:02	maybe in the
0:33:04	famous persons voice
0:33:10	institute of technology energy was founded in nineteen O five isn't going to hire technical to pioneering academic institution dedicated
0:33:18	to industrial education
0:33:21	can you find for hey
0:33:24	yes
0:33:26	you're right
0:33:28	please note that
0:33:31	this was done by engine geometries at C S T R of the university of edinburgh
0:33:38	and they
0:33:39	yeah it was us inside by the system adapted to justify
0:33:47	okay
0:33:48	next example is speaker interpolation in speech synthesis
0:33:53	when we have several speaker dependent hmm sets
0:33:57	by interpolating among the hmm parameters
0:34:01	means and variances
0:34:04	we can generate a new hmm set
0:34:07	which correspond to a new voice
0:34:10	in this case we have to speaker-dependent models
0:34:15	one
0:34:16	is trained by a female speaker
0:34:19	and one
0:34:20	is trained by a male speaker
0:34:24	okay let me play
0:34:26	speech samples
0:34:27	synthesized from
0:34:29	female model
0:34:36	sorry okay
0:34:42	i
0:34:42	and this was in part from a male speakers for a male speakers model
0:34:49	well when i don't and we can interpolate between and these two models with arbitrarily depletion much
0:34:59	a dct sent out
0:35:00	due to
0:35:02	models
0:35:03	we cannot find he or she is
0:35:06	male or female
0:35:09	and
0:35:14	and
0:35:15	we can change in the in the pool interpolation ratio right nearly in a trance
0:35:21	from female to male
0:35:24	well
0:35:32	we do not know
0:35:37	sounds like
0:35:38	male finally
0:35:41	and i this is the same except we have for speaker-dependent the model
0:35:46	models
0:35:47	the first speaker
0:35:48	and when the second speaker per speaker i don't want to know in a manner i don't have for speaker
0:36:06	and this at the center of these four speakers
0:36:14	and then we can also change the interpolation ratio red
0:36:17	and
0:36:23	oh in a manner another
0:36:25	i don't know
0:36:31	oh in
0:36:33	it is interesting
0:36:34	but
0:36:35	could be used to this
0:36:38	i
0:36:41	yeah
0:36:42	if we train each model
0:36:45	with S P speaking style we can interpolate among speaking styles to it could be useful for spoken dialogue systems
0:36:54	in this case
0:36:55	we have two models
0:36:57	once trained with
0:36:59	a neutral
0:37:01	draw a voice
0:37:03	and one trained with
0:37:06	high tension voice
0:37:07	by the same speaker
0:37:10	okay
0:37:11	first neutral voice
0:37:14	oh
0:37:18	and heightened so model
0:37:21	i
0:37:22	i
0:37:25	if you feel it's too much
0:37:28	you we can adjust the
0:37:30	degree of the expression by interpolating between two models
0:37:35	for example this one
0:37:41	and that we can
0:37:43	also fixed extrapolated and used to model
0:37:47	under the replay yeah all of them
0:37:50	in this order
0:37:53	oh
0:38:02	i
0:38:03	i
0:38:05	oh
0:38:06	oh
0:38:08	oh
0:38:09	please note that
0:38:11	it's not just that changing average F zero the prosody it can be changed
0:38:18	okay
0:38:19	next example is eigenvoice
0:38:22	the eigenvoice technique was to have developed for very fast speaker adaptation in speech recognition
0:38:30	in speech synthesis
0:38:32	it can be used for creating new voices
0:38:37	image of something more
0:38:56	okay
0:38:58	this represents a weight
0:39:01	for eigenvoices
0:39:03	by adjusting them we can find a favourite voice
0:39:09	each eigenvoice first eigenvoice and second eigenvoice
0:39:12	it's eigenvoice
0:39:14	may correspond to a specific voice character
0:39:20	maybe play some speech samples
0:39:23	but
0:39:25	for the
0:39:27	first eigenvoice we've negative rate
0:39:33	yeah
0:39:35	i
0:39:36	okay
0:39:38	and now we
0:39:39	posted wait for the first eigenvoice
0:39:42	okay
0:39:44	oh no what contributes to say okay
0:39:48	i'm sorry that this is the maximum with the ball
0:39:52	and the second eigenvoice we've negative rate
0:40:00	and we
0:40:01	positive rate
0:40:03	okay they're not what makes you don't sound that made on for eigenvoice
0:40:14	and we've
0:40:15	was divided up with the weight
0:40:19	i
0:40:20	yeah
0:40:23	at
0:40:24	and by second weight after writing we don't and various voices
0:40:29	and find out for your voice
0:40:35	some them then
0:40:37	but i
0:40:39	i
0:40:41	i hope
0:40:43	this is better
0:40:44	okay
0:40:46	the
0:40:49	anyway this shows the flexibility of their statistical approach to speech synthesis
0:41:05	okay
0:41:06	similarly to other corpus based approaches
0:41:09	and the hmm baptists
0:41:10	system has a
0:41:12	very compact language dependent but
0:41:16	done easily be applied to other languages
0:41:20	i like to play some the them
0:41:24	japanese change that you know you and i one and i mean i'm sorry
0:41:29	in which
0:41:30	you would not keep the truth from chinese
0:41:34	well or from grand cherokee
0:41:40	korea
0:41:41	then they can match the categories and the finnish
0:41:46	only taken a little mental but it once again i must be sent to contain an snr essential or several
0:41:52	minima
0:41:53	and this is also in which but trained by
0:41:58	baby
0:41:59	vol
0:42:04	yeah i
0:42:07	okay
0:42:09	and now
0:42:10	next examples
0:42:12	so that
0:42:13	even
0:42:14	singing voice can be used as a training data
0:42:18	as a result
0:42:20	the system can seen any piece of music
0:42:23	we she's or her voice and simply used i
0:42:28	and
0:42:30	this is a one oh training data
0:42:37	okay
0:42:44	sees a semi professional scene
0:42:48	and now
0:42:50	the server
0:42:52	and
0:42:53	and anyway and
0:42:55	this sample
0:42:56	if
0:42:57	syntax
0:42:58	by using trained acoustic model so
0:43:03	she have not
0:43:05	some this song
0:43:11	yeah
0:43:17	oh i
0:43:23	maybe it sounds that are
0:43:25	but we have not seen this story
0:43:28	us in this
0:43:31	okay
0:43:34	this is the final part
0:43:40	yeah i like to show the
0:43:42	basic problem of speech synthesis okay this one
0:43:49	solving this problem directly
0:43:53	based on
0:43:56	and this equation
0:43:59	is it yeah
0:44:00	but we have to decompose it into trapped up to no such problems because the
0:44:08	direct solution is not feasible we currently available computational resources
0:44:15	however we can relax the approximation
0:44:21	for example
0:44:23	by marginalised in what their parameters
0:44:26	a hmm for acoustic model parameters
0:44:28	we can drive a variational bayesian acoustic modeling technique for speech synthesis
0:44:34	well
0:44:36	by marginalise and that is
0:44:38	we can drive
0:44:40	joint front-end and back-end model train
0:44:44	a friend front it means that text process
0:44:47	and back end acoustic model
0:44:50	or by including a speech
0:44:53	wait for speech waveform generation part in a statistical model we can also drive a bit from that of
0:45:01	stats come on
0:45:03	anyway please note that
0:45:05	this kind of improved techniques
0:45:08	can be drivers
0:45:09	based on
0:45:11	this equation which represents
0:45:13	the basic problem
0:45:15	since
0:45:20	okay
0:45:22	then some read this presentation
0:45:25	i have talked about the stats got from their some of the speech synthesis problem
0:45:30	all speech synthesis process is described in a statistical framework
0:45:35	and it give us a unified view and the reviews
0:45:39	what is correct and what is wrong
0:45:43	another point i should
0:45:46	implies that is
0:45:47	the importance of the database
0:45:51	future work
0:45:52	still we have many problems
0:45:55	which we should solve
0:45:57	based on
0:46:02	the
0:46:03	equation which represent speech synthesis problem
0:46:11	okay this is that
0:46:13	final slide
0:46:15	is P synthesis
0:46:17	and messy problem
0:46:20	no i don't think so
0:46:23	i would be happy
0:46:25	if many speech recognition researchers joining speech synthesis research
0:46:31	it's must be very have had
0:46:33	to a T research area
0:46:35	that's all thank you very much
0:46:38	i
0:46:57	yeah thanks for such a could talk we have some time for questions
0:47:01	michael
0:47:03	oh thank you very much for a wonderful talk on speech synthesis at some point in the future i guess
0:47:09	we don't even have to have our presenters make presentations anymore we could just synthesise them i would like to
0:47:16	do that
0:47:18	i'm not the fifteen you speaking
0:47:21	so one of the quest one of things you alluded to at the end of your talk i was wondering
0:47:25	if you could elaborate a little bit more
0:47:28	one of the problems you can still hear on some of the examples you place a certain but seeing this
0:47:33	dependent upon the speaker in the quality of the final waveform generation i'm just wondering if you could say a
0:47:39	few words about some of the country
0:47:42	the techniques that are being looked at in order to improve the quality of the waveform generation the model you
0:47:51	at the at the beginning of the talk is still relatively simple excitation of the spectral sort of model and
0:47:58	i know people looked at the fancier stuff
0:48:01	just wondering if you have some comments as to what you think is interesting promising the directions to improve the
0:48:08	quality the waveform generation yep
0:48:11	i didn't
0:48:13	mention but in the newest the system well we are using a straight vocoding a technique and it can improve
0:48:22	the speech quality very much
0:48:25	however
0:48:27	i'm afraid that
0:48:29	it's not based on
0:48:32	that's that is god
0:48:34	frame
0:48:36	so
0:48:37	i would like to include that kind of most you
0:48:42	but coding part should be included this it which
0:48:47	that must be
0:48:49	this one
0:48:50	i but
0:48:51	currently
0:48:53	we still
0:48:54	use
0:48:55	many approximation
0:48:57	for example
0:48:58	the formulation
0:49:00	is
0:49:02	correct
0:49:02	for cost sampras
0:49:05	it is right for unvoiced sect
0:49:07	however it's
0:49:08	not appropriate for
0:49:10	theoretic sick person
0:49:12	so we need more sophisticated a speech before each initial model
0:49:17	and i believe that
0:49:19	that can for that kind
0:49:21	problem
0:49:22	the vocal
0:49:25	it is
0:49:27	the onset do
0:49:32	hi yes i have a couple of questions related to the smoothing of the cepstral coefficients we talked about
0:49:38	i'm so the use of the deltas and double deltas gives you the smoothed
0:49:42	i cepstral coefficient
0:49:45	how is how important is that relative to say representing or generating static coefficients and then perhaps applying a moving
0:49:55	a moving average filter some somewhere
0:49:58	smoothing like that
0:49:59	okay with question
0:50:04	the initial
0:50:06	a slight
0:50:08	i have with one
0:50:13	with the moment
0:50:47	oh i'm sorry if it was not be greatly in the select but anyway
0:50:54	that is a very effective
0:50:57	and
0:50:58	that the data is not so effect
0:51:00	and now of course we can apply some heuristics somebody by filtering or something like
0:51:07	it's
0:51:09	still effect
0:51:11	i have a worse
0:51:13	that using that on the today parameter
0:51:17	i have a most scores
0:51:20	i'm sorry i can't find i one other following question on that when you set up those linear equations that
0:51:26	you
0:51:27	you weights
0:51:29	all the different dimensions equally all the deltas and double deltas and the static coefficients are those all weighted
0:51:36	equally when you
0:51:37	solve the least squares equations
0:51:40	yeah there's no weights
0:51:48	yeah O we have a
0:51:50	that's no choice weight or some operation
0:51:54	we just have a
0:51:56	we just have a
0:51:59	definition of the probably
0:52:01	but it gently
0:52:02	and just
0:52:03	using
0:52:08	so i have a question
0:52:10	so obviously we've
0:52:13	in texas speech local optimum
0:52:15	speech recognition but it's
0:52:18	frequent comment and it came up and very minute talks is you know this whole question of how good is
0:52:23	there's an hmm model of speech on the
0:52:26	received wisdom is either that it's kind of terrible or
0:52:30	it's terrible but it so tractable are useful the user button speech synthesis you think that the success of this
0:52:36	technique
0:52:37	both
0:52:38	in fact demonstrated hmms are good model speech
0:52:41	because i think it's
0:52:42	the quality is for the
0:52:44	higher than that of the thing anybody would have believed
0:52:46	possible
0:52:47	and what follows
0:52:50	this workshop on the
0:52:52	what is the same channel build models
0:53:00	yeah that's
0:53:01	yeah good but we've got question
0:53:03	and
0:53:06	anyway
0:53:10	to
0:53:13	we have been organising we just try to
0:53:17	it's a
0:53:19	evaluation campaign
0:53:21	i think this systems
0:53:22	and we have find that
0:53:24	you did in the intelligibility of a hmm based systems
0:53:29	or almost perfect
0:53:31	almost
0:53:32	three but i two notches on a natural speech
0:53:35	i was still a naturalness
0:53:39	is
0:53:40	insufficient compared with natural speech
0:53:43	so maybe you
0:53:47	prosody
0:53:48	so
0:53:49	oh
0:53:50	i believe that we have to improve prosodic part
0:53:55	well
0:53:58	D
0:53:59	status got more
0:54:01	maybe
0:54:04	human speech and make one by various non-verbal information
0:54:09	yeah speech about it cannot be done in the current speech
0:54:13	system
0:54:14	that kind of a speech of should be included
0:54:19	so
0:54:20	that you're talking was very nice
0:54:22	i want to go a little bit for the long pause line of questioning because i was thinking about your
0:54:28	final call to the asr community to take join you know the stuff
0:54:32	one of the things you're seeing a lot with H M and stuff and the speech kill this is that
0:54:37	everybody's moving parts discriminant models of various kinds and whatnot and the nice thing about
0:54:42	you know
0:54:43	the hmms for the synthesis problem is it really used to regenerate of problem right so i in some ways
0:54:49	it model matches a little better which just sort of what paul was touching on
0:54:53	so do you
0:54:54	see
0:54:55	in moving forward and synthesis that
0:54:59	discriminant techniques are gonna be
0:55:02	playing a part in that kind of thing where do you think that you know generated this you know asians
0:55:08	are generated
0:55:09	models and this definitely gonna be the right way to
0:55:12	model this kind of thing
0:55:13	yeah a good question
0:55:17	this discriminant given training
0:55:21	does not allow for speech utterances
0:55:25	it is not necessary to discriminate
0:55:28	and
0:55:29	another point
0:55:31	that
0:55:32	we can set a specific
0:55:36	objective function based on human perception
0:55:39	don't
0:55:40	quickly like
0:55:42	a discriminative training in speech recognition
0:55:46	but anyway yeah
0:55:49	in speech synthesis
0:55:50	the basic problem
0:55:52	it's
0:55:53	generation
0:55:54	so we can consider to concentrate on
0:55:58	generative model
0:55:59	it's not necessary to tickle
0:56:01	with the disk and discriminative training
0:56:04	that a nice point of speech synthesis research
0:56:09	but it in
0:56:30	i
0:56:33	i
0:56:36	oh
0:56:40	yeah
0:56:41	yeah
0:56:42	but
0:56:43	i want to
0:56:45	the that kind of
0:56:48	optimization in a statistical framework
0:56:51	by changing the a paedophile parameter or a more their structure we can do that now that's got four
0:57:01	and i've got a related question you so you generate the maximum likelihood sequence
0:57:07	if you have a really good
0:57:08	and generative model we'd really like to sample stochastically
0:57:18	there's not that
0:57:20	we are using something
0:57:22	because
0:57:31	i'm sure this
0:57:37	okay
0:57:38	those are given variables
0:57:41	and this is the speech waveform
0:57:44	and a dct predicted distribution and we have something a speech waveform
0:57:50	the exciting speech synthesis filter by a gaussian
0:57:55	what noise
0:57:56	it's
0:57:57	just or something
0:58:01	and that speech parameters mel cepstrum or this result if a zero S
0:58:09	is marginalised in the equation
0:58:13	so as a pokemon approximation we add generate it we imagine criteria
0:58:21	maximum likelihood criterion
0:58:24	it's just approximation
0:58:26	but
0:58:27	this criterion is
0:58:29	something
0:58:31	does it make sense
0:58:35	well i guess i'm wondering whether it's good approximation of
0:58:40	yeah that the reason why we have to reduce or remove
0:58:45	we wanna relax the approximation in the future active
0:58:49	future work
0:58:52	okay things temporal things to close so that's like speaker
0:59:00	i

Speech Synthesis as A Statistical Machine Learning Problem

Invited Speakers

Keiichi Tokuda (Nagoya Institute of Technology)