okay hello everybody
i'm to right the a from all the image it
helsinki finland
and i'm gonna talk about a hmm based
speech synthesis
and how to improve quality
by devising a clock those source pulse library
and this is
um
making in collaboration of it
michael lakes
on this only and what they one you know
from the helsinki that unit was the think E and
and a block of and a local
um the although rest
okay so here's content
of my talk
so let's go straight to the background
so a six
a a lot in the goal of text speech is to generate net that's was sounding expression person
from a bit text and to at R
to major tts trends
one is they need selection which is space on
concatenating netting pretty recording
acoustic units
and D C else
um
are three what quality at its best
that the adaptability
a is somewhat pour
a the other mid thirties
statistical
which space the modeling speech parameters a he mark model
and it S but their adaptability
and
this work
a can about statistical
synthesis
so
but the problem is that the quality is not too good
so how proposal for this
he's
to
decompose the speech signal into to clock close source
signal and to vocal tract transfer function
and second
um
i for the decompose the call those source in several parameters
and a you call that pulse library
and then be model these parameters in a normal
i item and based
speech in this framework
H T as
and
and synthesis it's
we reconstruct construct the
um
a a source signal from the policies is and the parameter
and feel their it it the vocal filter
so that the basics so can so the source of the clock but um
voiced speech
is the complex they sum
and then the
signal goes the vocal tract and then we have speech
so we are interested in this clock but like citation very much in this work
so
and uh
hopper
uh i speaker be have speech you know
and the estimate it
got don't fall below that
and
how we can
estimate the signal
we can for simple use method called got the likeness filtering
which to estimate
the clock or so signal
from the speech signal itself
are several methods
to that from this task
i i go further into that
but use
but that that is based on each of they will
yeah P C
a use of a lpc
okay and then to the speech in the system
so is
a a a very family
most of you
but i will go through this fast so we have a
speech database and then me parameterized tries it
and train
the parameters according to the labours labels
and in synthesis to it's
i input text and a that and
a we can generate parameters are according to the that's was and so we can recall sort
speech
and in this work we are interested in this
a process and and synthesis steps
and
but in proving these we
try to make the speech
a more natural
so what we do in speech parameters a sony it's be first
window of the signal of course and
a a mix of tree
and to be
the is filtering so we decompose the you speech signal
the diffuse you logical corresponding parts which is that
a those source
and uh well got track
parameterized the vocal tract bit L S Fs
and filter
i rise the source with several parameters
are
fundamental frequency
how many noise ratio
a spectral to with L C and
harmonic model it
a in the lower bound
and finally
we extract the but top row is that the library
and link the holes with corresponding
source parameters
um
so how we do that
first
i um you a mean the couple a close or instance
from the different at to go of force you know
and then we extract each complete
to better at caught a source segment
and from do to the hann window
the billing T
is
a corresponding got a source parameters which are the energy fun of the frequency
voice source spectrum how much can trace and the harmonics
and in a to at and we store
yeah a down sampled ten millisecond version of the possible from
in order to
calculate the concatenation cost
and the synthesis stage
and the boss we may consist of hundreds
or even thousands of clock of a pulses
um and two as an example of some of the pulses
um um
from the two male speaker
okay
and that the synthesis stage
so what we do is we want to reconstruct the voiced
excitation
so we select
to
best matching impulses
according to the
oh of parameters turn that by the hmm men and
and um
a slippery
we scale them and dude
of the pulses
and all at them
to to write text like this
for one X at this be used only white noise
only be filtered to combine text this and and get so the leaks
it's
and
to but also a space
um um that it
by minimizing to joint cost composed of target
and concatenation costs
so that it course these uh
root mean square error or be in the voices by me there's
try but it's men and the one stored
for each pause
and we can of course have different weights
for different parameters
to to
this system
and the can can to the concatenation cost use the arms
error or bit in the down some presence of the poles
a second in eight
okay here's an example of the
um
how well in this goals
so he
to most excitation
and
and
and it was the on was excite there's an
less interesting
and
then combines i one play
that two thirty
a a with the for the since we get finally
a that speech
i
a
so it that's been nice so probably you didn't understand
i i have more samples
later
well first of the result
so
it was a we had used in the same just then
only only one o'clock top boss each you have more T white
according to the voice source parameters and we had the result that
a a it was preferred over the
a basic straight method
and
we have some samples from
from that system
and
as
sky
fashion
true
and
a
i
so we also participate in the proposed so and
two doesn't ten
with a
more more results
and you some samples from from that
and
that's
i
i
i
i
i
i
i
so the quality is is uh
quite good
and here the samples comparing this a single pulse technique and the
a a major pulse library technique
the
i hope you can hear the differences in this
she to some difference is not so big
so maybe i plate the in this persons
i
i
yeah
and
i heard some
some differences
yes
um
okay here's uh spectral comes comparing
yeah
difference in quality if you don't here
you can see for example that i uh here
that the nice
he model model better
uh
this sparse up technique as once more of the single pulse technique
and
a suppose voiced fricatives here
or are more but there because the single pulse technique couldn't
produce
um
soft policies
and high frequencies as spell
are are more that's role
and we conduct it's some
listening some tests
and we found that the
a a method but slightly preferred over single pulse technique
at the difference of a so great
but the but uh
speaker similarity
ross was but
and
very very many
um
sounds where lots more natural
yeah
this is
kind of um
could that used then of the source
so that are the same problems as in can as uh synthesis
so we have some at discontinuity there
and some more are are fog
a compared to the frequency C of the signal plus can take
okay okay here some way so we have
we do a need so to what's motivated high quality speech the sensor
and this ours
for but the blocks and and control all the speech parameters
a a speech X
have take this and
this pulse library generates more that's right side based and
because it
a three like and
in the three
and it is slightly prepared or the signal passed
and that the
references
and i thank you for your attention
time for questions
try the microphones
a one can can i have a question of two
oh
a unit selection um
pitch period
yeah could use a some about how large the entry tree is and how complex that search is that uh
that's potentially much larger search problem in a
yeah i don't size units well yeah be are in in the
initial initial stage of developing is still
the are experts in a concatenative synthesis but
you have right
um tried
um
various size is
from for example ten policies to twenty thousand paul
and E D bands
depends a lot
a
on the speech mother sometimes
i even hundred pulses might be as almost as good as
ten thousand pulses
so it's
um are trying to mate make make some sense of how to choose the
but also also is that this in that it in that it could be
uh
i ask with the
very few pulses
and and D T questions "'cause" you me how to choose appropriate code that house
from the right rate
so a great P D S had to choose
so a library
i to to select very so had to choose a pulse from the and the light yeah
yeah you have this uh a target cost
and can get the new cost
and we have rates
for lot of these
it's are two and by hand
at this moment
and to target cost is the
our and miss error between the source parameters
oh the library
and the ones to from the hmms
and
can at cost is uh
in this but be D D over only over three pulses
so it's it it can
the signal the similarity between the
a policy
and it should be a a a similar as possible
a at least we you catch you could start this way
and
to a total that or is competition of this
but the uh a what is we use feature be over a voiced segment
two uh
of must this procedure
um i i then that's that's part fifty are in uh team C between
choose them go to house
and
and and and so what's so and the same there's
which are generated from the hmms because the are models key in the hmms or
since
i as a lot of question regarding the same thing
a T I would arise that you use that line spectral pair or yeah said
or modeling the pulses
and can you
could you give some
rationale why you want to choose that and also D
still of the same high M S and the is really you are measuring the similarities similarity what contents
that's a frequency spectral
nature of face whatever
um
and
you mean um
here
the voice source spectrum
here
yeah i D the parameterization
yeah
so the vocal tract spectrum is set up as is a that and i D yeah more or just of
the spectrum of the of that source source yes yeah
yeah that's that's interesting question
the
but C P stems from the fact that firstly
we be have more old
the spectrum of the source by this
and here we have included D
it a power meter here
actually i i got i got sure
as sure that it would be them
most uh
and use it is
i a problem that how to choose the best parameters be quite this
and that we show just to going on a with are the best parameters
for selecting the best pulses
so
yeah but try to model the spectral tilt
and the spectral fine structure but is
a S Fs
of the source
as it it's probably be of because they're not
really like uh
and the most is
so the distortion is measured in the frequency domain magnitude
a a man yeah L S
yeah i i i S are sure that you are measuring the whole seen
from the time domain or in the frequency domain or was that was of phase or what
no it's only the L S Fs
okay
so we can improve it maybe
two but it the frequency domain
oh
oh
how leave you've known you hear in you uh H main seized didn't for the vocal tract parameters
so yeah i cannot not i'd and remember that number
okay i think of
okay
i i Q