0:00:14okay hello everybody
0:00:15i'm to right the a from all the image it
0:00:18helsinki finland
0:00:20and i'm gonna talk about a hmm based
0:00:22speech synthesis
0:00:23and how to improve quality
0:00:26by devising a clock those source pulse library
0:00:30and this is
0:00:32making in collaboration of it
0:00:34michael lakes
0:00:35on this only and what they one you know
0:00:37from the helsinki that unit was the think E and
0:00:40and a block of and a local
0:00:42um the although rest
0:00:45okay so here's content
0:00:47of my talk
0:00:48so let's go straight to the background
0:00:52so a six
0:00:53a a lot in the goal of text speech is to generate net that's was sounding expression person
0:00:58from a bit text and to at R
0:01:00to major tts trends
0:01:02one is they need selection which is space on
0:01:04concatenating netting pretty recording
0:01:06acoustic units
0:01:07and D C else
0:01:10are three what quality at its best
0:01:12that the adaptability
0:01:14a is somewhat pour
0:01:16a the other mid thirties
0:01:18which space the modeling speech parameters a he mark model
0:01:22and it S but their adaptability
0:01:25this work
0:01:26a can about statistical
0:01:33but the problem is that the quality is not too good
0:01:36so how proposal for this
0:01:40decompose the speech signal into to clock close source
0:01:43signal and to vocal tract transfer function
0:01:47and second
0:01:49i for the decompose the call those source in several parameters
0:01:53and a you call that pulse library
0:01:56and then be model these parameters in a normal
0:01:59i item and based
0:02:00speech in this framework
0:02:01H T as
0:02:03and synthesis it's
0:02:05we reconstruct construct the
0:02:08a a source signal from the policies is and the parameter
0:02:12and feel their it it the vocal filter
0:02:17so that the basics so can so the source of the clock but um
0:02:21voiced speech
0:02:22is the complex they sum
0:02:25and then the
0:02:26signal goes the vocal tract and then we have speech
0:02:29so we are interested in this clock but like citation very much in this work
0:02:36and uh
0:02:38uh i speaker be have speech you know
0:02:41and the estimate it
0:02:42got don't fall below that
0:02:45how we can
0:02:46estimate the signal
0:02:48we can for simple use method called got the likeness filtering
0:02:52which to estimate
0:02:53the clock or so signal
0:02:54from the speech signal itself
0:02:57are several methods
0:02:58to that from this task
0:03:00i i go further into that
0:03:03but use
0:03:05but that that is based on each of they will
0:03:07yeah P C
0:03:08a use of a lpc
0:03:13okay and then to the speech in the system
0:03:16so is
0:03:17a a a very family
0:03:18most of you
0:03:20but i will go through this fast so we have a
0:03:22speech database and then me parameterized tries it
0:03:25and train
0:03:26the parameters according to the labours labels
0:03:30and in synthesis to it's
0:03:31i input text and a that and
0:03:35a we can generate parameters are according to the that's was and so we can recall sort
0:03:40and in this work we are interested in this
0:03:43a process and and synthesis steps
0:03:48but in proving these we
0:03:49try to make the speech
0:03:51a more natural
0:03:54so what we do in speech parameters a sony it's be first
0:03:57window of the signal of course and
0:04:00a a mix of tree
0:04:01and to be
0:04:04the is filtering so we decompose the you speech signal
0:04:07the diffuse you logical corresponding parts which is that
0:04:11a those source
0:04:12and uh well got track
0:04:15parameterized the vocal tract bit L S Fs
0:04:19and filter
0:04:21i rise the source with several parameters
0:04:25fundamental frequency
0:04:26how many noise ratio
0:04:28a spectral to with L C and
0:04:31harmonic model it
0:04:33a in the lower bound
0:04:35and finally
0:04:37we extract the but top row is that the library
0:04:41and link the holes with corresponding
0:04:43source parameters
0:04:47so how we do that
0:04:50i um you a mean the couple a close or instance
0:04:53from the different at to go of force you know
0:04:57and then we extract each complete
0:05:00to better at caught a source segment
0:05:02and from do to the hann window
0:05:06the billing T
0:05:09a corresponding got a source parameters which are the energy fun of the frequency
0:05:14voice source spectrum how much can trace and the harmonics
0:05:19and in a to at and we store
0:05:22yeah a down sampled ten millisecond version of the possible from
0:05:26in order to
0:05:28calculate the concatenation cost
0:05:30and the synthesis stage
0:05:35and the boss we may consist of hundreds
0:05:38or even thousands of clock of a pulses
0:05:41um and two as an example of some of the pulses
0:05:46um um
0:05:47from the two male speaker
0:05:52and that the synthesis stage
0:05:55so what we do is we want to reconstruct the voiced
0:06:00so we select
0:06:02best matching impulses
0:06:04according to the
0:06:05oh of parameters turn that by the hmm men and
0:06:08and um
0:06:10a slippery
0:06:12we scale them and dude
0:06:14of the pulses
0:06:15and all at them
0:06:17to to write text like this
0:06:19for one X at this be used only white noise
0:06:23only be filtered to combine text this and and get so the leaks
0:06:31to but also a space
0:06:33um um that it
0:06:34by minimizing to joint cost composed of target
0:06:37and concatenation costs
0:06:39so that it course these uh
0:06:41root mean square error or be in the voices by me there's
0:06:44try but it's men and the one stored
0:06:46for each pause
0:06:48and we can of course have different weights
0:06:50for different parameters
0:06:52to to
0:06:53this system
0:06:55and the can can to the concatenation cost use the arms
0:06:58error or bit in the down some presence of the poles
0:07:01a second in eight
0:07:08okay here's an example of the
0:07:12how well in this goals
0:07:14so he
0:07:15to most excitation
0:07:29and it was the on was excite there's an
0:07:32less interesting
0:07:40then combines i one play
0:07:42that two thirty
0:07:44a a with the for the since we get finally
0:07:47a that speech
0:07:58so it that's been nice so probably you didn't understand
0:08:01i i have more samples
0:08:05well first of the result
0:08:08it was a we had used in the same just then
0:08:10only only one o'clock top boss each you have more T white
0:08:13according to the voice source parameters and we had the result that
0:08:16a a it was preferred over the
0:08:19a basic straight method
0:08:23we have some samples from
0:08:24from that system
0:08:56so we also participate in the proposed so and
0:08:59two doesn't ten
0:09:00with a
0:09:01more more results
0:09:04and you some samples from from that
0:09:37so the quality is is uh
0:09:39quite good
0:09:40and here the samples comparing this a single pulse technique and the
0:09:45a a major pulse library technique
0:09:49i hope you can hear the differences in this
0:09:52she to some difference is not so big
0:09:55so maybe i plate the in this persons
0:10:30i heard some
0:10:32some differences
0:10:36okay here's uh spectral comes comparing
0:10:40difference in quality if you don't here
0:10:42you can see for example that i uh here
0:10:45that the nice
0:10:46he model model better
0:10:49this sparse up technique as once more of the single pulse technique
0:10:54a suppose voiced fricatives here
0:10:56or are more but there because the single pulse technique couldn't
0:11:02soft policies
0:11:04and high frequencies as spell
0:11:05are are more that's role
0:11:11and we conduct it's some
0:11:14listening some tests
0:11:16and we found that the
0:11:18a a method but slightly preferred over single pulse technique
0:11:23at the difference of a so great
0:11:25but the but uh
0:11:27speaker similarity
0:11:28ross was but
0:11:32very very many
0:11:35sounds where lots more natural
0:11:38this is
0:11:39kind of um
0:11:41could that used then of the source
0:11:43so that are the same problems as in can as uh synthesis
0:11:47so we have some at discontinuity there
0:11:50and some more are are fog
0:11:52a compared to the frequency C of the signal plus can take
0:12:00okay okay here some way so we have
0:12:03we do a need so to what's motivated high quality speech the sensor
0:12:06and this ours
0:12:08for but the blocks and and control all the speech parameters
0:12:12a a speech X
0:12:13have take this and
0:12:15this pulse library generates more that's right side based and
0:12:18because it
0:12:19a three like and
0:12:21in the three
0:12:22and it is slightly prepared or the signal passed
0:12:26and that the
0:12:28and i thank you for your attention
0:12:38time for questions
0:12:39try the microphones
0:12:51a one can can i have a question of two
0:12:55a unit selection um
0:12:57pitch period
0:12:58yeah could use a some about how large the entry tree is and how complex that search is that uh
0:13:03that's potentially much larger search problem in a
0:13:05yeah i don't size units well yeah be are in in the
0:13:09initial initial stage of developing is still
0:13:11the are experts in a concatenative synthesis but
0:13:14you have right
0:13:15um tried
0:13:18various size is
0:13:19from for example ten policies to twenty thousand paul
0:13:23and E D bands
0:13:24depends a lot
0:13:26on the speech mother sometimes
0:13:28i even hundred pulses might be as almost as good as
0:13:32ten thousand pulses
0:13:33so it's
0:13:34um are trying to mate make make some sense of how to choose the
0:13:39but also also is that this in that it in that it could be
0:13:43i ask with the
0:13:45very few pulses
0:13:50and and D T questions "'cause" you me how to choose appropriate code that house
0:13:55from the right rate
0:13:56so a great P D S had to choose
0:13:59so a library
0:14:00i to to select very so had to choose a pulse from the and the light yeah
0:14:04yeah you have this uh a target cost
0:14:07and can get the new cost
0:14:08and we have rates
0:14:09for lot of these
0:14:11it's are two and by hand
0:14:13at this moment
0:14:14and to target cost is the
0:14:16our and miss error between the source parameters
0:14:19oh the library
0:14:21and the ones to from the hmms
0:14:25can at cost is uh
0:14:27in this but be D D over only over three pulses
0:14:32so it's it it can
0:14:33the signal the similarity between the
0:14:36a policy
0:14:37and it should be a a a similar as possible
0:14:40a at least we you catch you could start this way
0:14:45to a total that or is competition of this
0:14:48but the uh a what is we use feature be over a voiced segment
0:14:52two uh
0:14:53of must this procedure
0:14:56um i i then that's that's part fifty are in uh team C between
0:15:00choose them go to house
0:15:04and and and so what's so and the same there's
0:15:07which are generated from the hmms because the are models key in the hmms or
0:15:13i as a lot of question regarding the same thing
0:15:16a T I would arise that you use that line spectral pair or yeah said
0:15:22or modeling the pulses
0:15:24and can you
0:15:25could you give some
0:15:26rationale why you want to choose that and also D
0:15:30still of the same high M S and the is really you are measuring the similarities similarity what contents
0:15:36that's a frequency spectral
0:15:38nature of face whatever
0:15:45you mean um
0:15:48the voice source spectrum
0:15:52yeah i D the parameterization
0:15:56so the vocal tract spectrum is set up as is a that and i D yeah more or just of
0:16:01the spectrum of the of that source source yes yeah
0:16:04yeah that's that's interesting question
0:16:09but C P stems from the fact that firstly
0:16:12we be have more old
0:16:13the spectrum of the source by this
0:16:16and here we have included D
0:16:18it a power meter here
0:16:21actually i i got i got sure
0:16:23as sure that it would be them
0:16:25most uh
0:16:27and use it is
0:16:28i a problem that how to choose the best parameters be quite this
0:16:32and that we show just to going on a with are the best parameters
0:16:35for selecting the best pulses
0:16:39yeah but try to model the spectral tilt
0:16:41and the spectral fine structure but is
0:16:44a S Fs
0:16:45of the source
0:16:47as it it's probably be of because they're not
0:16:49really like uh
0:16:51and the most is
0:16:53so the distortion is measured in the frequency domain magnitude
0:16:57a a man yeah L S
0:16:59yeah i i i S are sure that you are measuring the whole seen
0:17:02from the time domain or in the frequency domain or was that was of phase or what
0:17:08no it's only the L S Fs
0:17:11so we can improve it maybe
0:17:13two but it the frequency domain
0:17:26how leave you've known you hear in you uh H main seized didn't for the vocal tract parameters
0:17:32so yeah i cannot not i'd and remember that number
0:17:38okay i think of
0:17:47i i Q