Speech Transcript - UTILIZING GLOTTAL SOURCE PULSE LIBRARY FOR GENERATING IMPROVED EXCITATION SIGNAL FOR HMM-BASED SPEECH SYNTHESIS

0:00:14	okay hello everybody
0:00:15	i'm to right the a from all the image it
0:00:18	helsinki finland
0:00:20	and i'm gonna talk about a hmm based
0:00:22	speech synthesis
0:00:23	and how to improve quality
0:00:26	by devising a clock those source pulse library
0:00:30	and this is
0:00:31	um
0:00:32	making in collaboration of it
0:00:34	michael lakes
0:00:35	on this only and what they one you know
0:00:37	from the helsinki that unit was the think E and
0:00:40	and a block of and a local
0:00:42	um the although rest
0:00:45	okay so here's content
0:00:47	of my talk
0:00:48	so let's go straight to the background
0:00:52	so a six
0:00:53	a a lot in the goal of text speech is to generate net that's was sounding expression person
0:00:58	from a bit text and to at R
0:01:00	to major tts trends
0:01:02	one is they need selection which is space on
0:01:04	concatenating netting pretty recording
0:01:06	acoustic units
0:01:07	and D C else
0:01:08	um
0:01:10	are three what quality at its best
0:01:12	that the adaptability
0:01:14	a is somewhat pour
0:01:16	a the other mid thirties
0:01:17	statistical
0:01:18	which space the modeling speech parameters a he mark model
0:01:22	and it S but their adaptability
0:01:24	and
0:01:25	this work
0:01:26	a can about statistical
0:01:29	synthesis
0:01:31	so
0:01:33	but the problem is that the quality is not too good
0:01:36	so how proposal for this
0:01:39	he's
0:01:39	to
0:01:40	decompose the speech signal into to clock close source
0:01:43	signal and to vocal tract transfer function
0:01:47	and second
0:01:48	um
0:01:49	i for the decompose the call those source in several parameters
0:01:53	and a you call that pulse library
0:01:56	and then be model these parameters in a normal
0:01:59	i item and based
0:02:00	speech in this framework
0:02:01	H T as
0:02:03	and
0:02:03	and synthesis it's
0:02:05	we reconstruct construct the
0:02:07	um
0:02:08	a a source signal from the policies is and the parameter
0:02:12	and feel their it it the vocal filter
0:02:17	so that the basics so can so the source of the clock but um
0:02:21	voiced speech
0:02:22	is the complex they sum
0:02:25	and then the
0:02:26	signal goes the vocal tract and then we have speech
0:02:29	so we are interested in this clock but like citation very much in this work
0:02:34	so
0:02:36	and uh
0:02:36	hopper
0:02:38	uh i speaker be have speech you know
0:02:41	and the estimate it
0:02:42	got don't fall below that
0:02:44	and
0:02:45	how we can
0:02:46	estimate the signal
0:02:48	we can for simple use method called got the likeness filtering
0:02:52	which to estimate
0:02:53	the clock or so signal
0:02:54	from the speech signal itself
0:02:57	are several methods
0:02:58	to that from this task
0:03:00	i i go further into that
0:03:03	but use
0:03:05	but that that is based on each of they will
0:03:07	yeah P C
0:03:08	a use of a lpc
0:03:13	okay and then to the speech in the system
0:03:16	so is
0:03:17	a a a very family
0:03:18	most of you
0:03:20	but i will go through this fast so we have a
0:03:22	speech database and then me parameterized tries it
0:03:25	and train
0:03:26	the parameters according to the labours labels
0:03:30	and in synthesis to it's
0:03:31	i input text and a that and
0:03:35	a we can generate parameters are according to the that's was and so we can recall sort
0:03:39	speech
0:03:40	and in this work we are interested in this
0:03:43	a process and and synthesis steps
0:03:46	and
0:03:48	but in proving these we
0:03:49	try to make the speech
0:03:51	a more natural
0:03:54	so what we do in speech parameters a sony it's be first
0:03:57	window of the signal of course and
0:04:00	a a mix of tree
0:04:01	and to be
0:04:04	the is filtering so we decompose the you speech signal
0:04:07	the diffuse you logical corresponding parts which is that
0:04:11	a those source
0:04:12	and uh well got track
0:04:15	parameterized the vocal tract bit L S Fs
0:04:19	and filter
0:04:21	i rise the source with several parameters
0:04:24	are
0:04:25	fundamental frequency
0:04:26	how many noise ratio
0:04:28	a spectral to with L C and
0:04:31	harmonic model it
0:04:33	a in the lower bound
0:04:35	and finally
0:04:37	we extract the but top row is that the library
0:04:41	and link the holes with corresponding
0:04:43	source parameters
0:04:46	um
0:04:47	so how we do that
0:04:49	first
0:04:50	i um you a mean the couple a close or instance
0:04:53	from the different at to go of force you know
0:04:57	and then we extract each complete
0:05:00	to better at caught a source segment
0:05:02	and from do to the hann window
0:05:06	the billing T
0:05:08	is
0:05:09	a corresponding got a source parameters which are the energy fun of the frequency
0:05:14	voice source spectrum how much can trace and the harmonics
0:05:19	and in a to at and we store
0:05:22	yeah a down sampled ten millisecond version of the possible from
0:05:26	in order to
0:05:28	calculate the concatenation cost
0:05:30	and the synthesis stage
0:05:35	and the boss we may consist of hundreds
0:05:38	or even thousands of clock of a pulses
0:05:41	um and two as an example of some of the pulses
0:05:46	um um
0:05:47	from the two male speaker
0:05:51	okay
0:05:52	and that the synthesis stage
0:05:55	so what we do is we want to reconstruct the voiced
0:05:59	excitation
0:06:00	so we select
0:06:01	to
0:06:02	best matching impulses
0:06:04	according to the
0:06:05	oh of parameters turn that by the hmm men and
0:06:08	and um
0:06:10	a slippery
0:06:12	we scale them and dude
0:06:14	of the pulses
0:06:15	and all at them
0:06:17	to to write text like this
0:06:19	for one X at this be used only white noise
0:06:23	only be filtered to combine text this and and get so the leaks
0:06:27	it's
0:06:30	and
0:06:31	to but also a space
0:06:33	um um that it
0:06:34	by minimizing to joint cost composed of target
0:06:37	and concatenation costs
0:06:39	so that it course these uh
0:06:41	root mean square error or be in the voices by me there's
0:06:44	try but it's men and the one stored
0:06:46	for each pause
0:06:48	and we can of course have different weights
0:06:50	for different parameters
0:06:52	to to
0:06:53	this system
0:06:55	and the can can to the concatenation cost use the arms
0:06:58	error or bit in the down some presence of the poles
0:07:01	a second in eight
0:07:08	okay here's an example of the
0:07:11	um
0:07:12	how well in this goals
0:07:14	so he
0:07:15	to most excitation
0:07:25	and
0:07:26	and
0:07:29	and it was the on was excite there's an
0:07:32	less interesting
0:07:40	and
0:07:40	then combines i one play
0:07:42	that two thirty
0:07:44	a a with the for the since we get finally
0:07:47	a that speech
0:07:50	i
0:07:52	a
0:07:58	so it that's been nice so probably you didn't understand
0:08:01	i i have more samples
0:08:02	later
0:08:05	well first of the result
0:08:07	so
0:08:08	it was a we had used in the same just then
0:08:10	only only one o'clock top boss each you have more T white
0:08:13	according to the voice source parameters and we had the result that
0:08:16	a a it was preferred over the
0:08:19	a basic straight method
0:08:22	and
0:08:23	we have some samples from
0:08:24	from that system
0:08:35	and
0:08:38	as
0:08:41	sky
0:08:43	fashion
0:08:48	true
0:08:50	and
0:08:51	a
0:08:55	i
0:08:56	so we also participate in the proposed so and
0:08:59	two doesn't ten
0:09:00	with a
0:09:01	more more results
0:09:04	and you some samples from from that
0:09:10	and
0:09:13	that's
0:09:13	i
0:09:16	i
0:09:19	i
0:09:20	i
0:09:22	i
0:09:26	i
0:09:33	i
0:09:37	so the quality is is uh
0:09:39	quite good
0:09:40	and here the samples comparing this a single pulse technique and the
0:09:45	a a major pulse library technique
0:09:48	the
0:09:49	i hope you can hear the differences in this
0:09:52	she to some difference is not so big
0:09:55	so maybe i plate the in this persons
0:10:12	i
0:10:24	i
0:10:27	yeah
0:10:28	and
0:10:30	i heard some
0:10:32	some differences
0:10:33	yes
0:10:35	um
0:10:36	okay here's uh spectral comes comparing
0:10:39	yeah
0:10:40	difference in quality if you don't here
0:10:42	you can see for example that i uh here
0:10:45	that the nice
0:10:46	he model model better
0:10:48	uh
0:10:49	this sparse up technique as once more of the single pulse technique
0:10:53	and
0:10:54	a suppose voiced fricatives here
0:10:56	or are more but there because the single pulse technique couldn't
0:10:59	produce
0:11:00	um
0:11:02	soft policies
0:11:04	and high frequencies as spell
0:11:05	are are more that's role
0:11:11	and we conduct it's some
0:11:14	listening some tests
0:11:16	and we found that the
0:11:18	a a method but slightly preferred over single pulse technique
0:11:23	at the difference of a so great
0:11:25	but the but uh
0:11:27	speaker similarity
0:11:28	ross was but
0:11:31	and
0:11:32	very very many
0:11:34	um
0:11:35	sounds where lots more natural
0:11:38	yeah
0:11:38	this is
0:11:39	kind of um
0:11:41	could that used then of the source
0:11:43	so that are the same problems as in can as uh synthesis
0:11:47	so we have some at discontinuity there
0:11:50	and some more are are fog
0:11:52	a compared to the frequency C of the signal plus can take
0:12:00	okay okay here some way so we have
0:12:03	we do a need so to what's motivated high quality speech the sensor
0:12:06	and this ours
0:12:08	for but the blocks and and control all the speech parameters
0:12:12	a a speech X
0:12:13	have take this and
0:12:15	this pulse library generates more that's right side based and
0:12:18	because it
0:12:19	a three like and
0:12:21	in the three
0:12:22	and it is slightly prepared or the signal passed
0:12:26	and that the
0:12:27	references
0:12:28	and i thank you for your attention
0:12:38	time for questions
0:12:39	try the microphones
0:12:51	a one can can i have a question of two
0:12:54	oh
0:12:55	a unit selection um
0:12:57	pitch period
0:12:58	yeah could use a some about how large the entry tree is and how complex that search is that uh
0:13:03	that's potentially much larger search problem in a
0:13:05	yeah i don't size units well yeah be are in in the
0:13:09	initial initial stage of developing is still
0:13:11	the are experts in a concatenative synthesis but
0:13:14	you have right
0:13:15	um tried
0:13:17	um
0:13:18	various size is
0:13:19	from for example ten policies to twenty thousand paul
0:13:23	and E D bands
0:13:24	depends a lot
0:13:26	a
0:13:26	on the speech mother sometimes
0:13:28	i even hundred pulses might be as almost as good as
0:13:32	ten thousand pulses
0:13:33	so it's
0:13:34	um are trying to mate make make some sense of how to choose the
0:13:39	but also also is that this in that it in that it could be
0:13:43	uh
0:13:43	i ask with the
0:13:45	very few pulses
0:13:50	and and D T questions "'cause" you me how to choose appropriate code that house
0:13:55	from the right rate
0:13:56	so a great P D S had to choose
0:13:59	so a library
0:14:00	i to to select very so had to choose a pulse from the and the light yeah
0:14:04	yeah you have this uh a target cost
0:14:07	and can get the new cost
0:14:08	and we have rates
0:14:09	for lot of these
0:14:11	it's are two and by hand
0:14:13	at this moment
0:14:14	and to target cost is the
0:14:16	our and miss error between the source parameters
0:14:19	oh the library
0:14:21	and the ones to from the hmms
0:14:24	and
0:14:25	can at cost is uh
0:14:27	in this but be D D over only over three pulses
0:14:32	so it's it it can
0:14:33	the signal the similarity between the
0:14:36	a policy
0:14:37	and it should be a a a similar as possible
0:14:40	a at least we you catch you could start this way
0:14:44	and
0:14:45	to a total that or is competition of this
0:14:48	but the uh a what is we use feature be over a voiced segment
0:14:52	two uh
0:14:53	of must this procedure
0:14:56	um i i then that's that's part fifty are in uh team C between
0:15:00	choose them go to house
0:15:02	and
0:15:04	and and and so what's so and the same there's
0:15:07	which are generated from the hmms because the are models key in the hmms or
0:15:12	since
0:15:13	i as a lot of question regarding the same thing
0:15:16	a T I would arise that you use that line spectral pair or yeah said
0:15:22	or modeling the pulses
0:15:24	and can you
0:15:25	could you give some
0:15:26	rationale why you want to choose that and also D
0:15:30	still of the same high M S and the is really you are measuring the similarities similarity what contents
0:15:36	that's a frequency spectral
0:15:38	nature of face whatever
0:15:42	um
0:15:43	and
0:15:45	you mean um
0:15:46	here
0:15:48	the voice source spectrum
0:15:50	here
0:15:52	yeah i D the parameterization
0:15:55	yeah
0:15:56	so the vocal tract spectrum is set up as is a that and i D yeah more or just of
0:16:01	the spectrum of the of that source source yes yeah
0:16:04	yeah that's that's interesting question
0:16:07	the
0:16:09	but C P stems from the fact that firstly
0:16:12	we be have more old
0:16:13	the spectrum of the source by this
0:16:16	and here we have included D
0:16:18	it a power meter here
0:16:21	actually i i got i got sure
0:16:23	as sure that it would be them
0:16:25	most uh
0:16:27	and use it is
0:16:28	i a problem that how to choose the best parameters be quite this
0:16:32	and that we show just to going on a with are the best parameters
0:16:35	for selecting the best pulses
0:16:38	so
0:16:39	yeah but try to model the spectral tilt
0:16:41	and the spectral fine structure but is
0:16:44	a S Fs
0:16:45	of the source
0:16:47	as it it's probably be of because they're not
0:16:49	really like uh
0:16:51	and the most is
0:16:53	so the distortion is measured in the frequency domain magnitude
0:16:57	a a man yeah L S
0:16:59	yeah i i i S are sure that you are measuring the whole seen
0:17:02	from the time domain or in the frequency domain or was that was of phase or what
0:17:08	no it's only the L S Fs
0:17:10	okay
0:17:11	so we can improve it maybe
0:17:13	two but it the frequency domain
0:17:21	oh
0:17:22	oh
0:17:26	how leave you've known you hear in you uh H main seized didn't for the vocal tract parameters
0:17:32	so yeah i cannot not i'd and remember that number
0:17:38	okay i think of
0:17:46	okay
0:17:47	i i Q

UTILIZING GLOTTAL SOURCE PULSE LIBRARY FOR GENERATING IMPROVED EXCITATION SIGNAL FOR HMM-BASED SPEECH SYNTHESIS

Speech Synthesis

Presented by: Tuomo Raitio, Author(s): Tuomo Raitio, Aalto University, Finland; Antti Suni, University of Helsinki, Finland; Hannu Pulakka, Aalto University, Finland; Martti Vainio, University of Helsinki, Finland; Paavo Alku, Aalto University, Finland