okay hello everybody

i'm to right the a from all the image it

helsinki finland

and i'm gonna talk about a hmm based

speech synthesis

and how to improve quality

by devising a clock those source pulse library

and this is

making in collaboration of it

michael lakes

on this only and what they one you know

from the helsinki that unit was the think E and

and a block of and a local

um the although rest

okay so here's content

of my talk

so let's go straight to the background

so a six

a a lot in the goal of text speech is to generate net that's was sounding expression person

from a bit text and to at R

to major tts trends

one is they need selection which is space on

concatenating netting pretty recording

acoustic units

and D C else

are three what quality at its best

that the adaptability

a is somewhat pour

a the other mid thirties

statistical

which space the modeling speech parameters a he mark model

and it S but their adaptability

and

this work

a can about statistical

synthesis

but the problem is that the quality is not too good

so how proposal for this

he's

decompose the speech signal into to clock close source

signal and to vocal tract transfer function

and second

i for the decompose the call those source in several parameters

and a you call that pulse library

and then be model these parameters in a normal

i item and based

speech in this framework

H T as

and

and synthesis it's

we reconstruct construct the

a a source signal from the policies is and the parameter

and feel their it it the vocal filter

so that the basics so can so the source of the clock but um

voiced speech

is the complex they sum

and then the

signal goes the vocal tract and then we have speech

so we are interested in this clock but like citation very much in this work

and uh

hopper

uh i speaker be have speech you know

and the estimate it

got don't fall below that

and

how we can

estimate the signal

we can for simple use method called got the likeness filtering

which to estimate

the clock or so signal

from the speech signal itself

are several methods

to that from this task

i i go further into that

but use

but that that is based on each of they will

yeah P C

a use of a lpc

okay and then to the speech in the system

so is

a a a very family

most of you

but i will go through this fast so we have a

speech database and then me parameterized tries it

and train

the parameters according to the labours labels

and in synthesis to it's

i input text and a that and

a we can generate parameters are according to the that's was and so we can recall sort

speech

and in this work we are interested in this

a process and and synthesis steps

and

but in proving these we

try to make the speech

a more natural

so what we do in speech parameters a sony it's be first

window of the signal of course and

a a mix of tree

and to be

the is filtering so we decompose the you speech signal

the diffuse you logical corresponding parts which is that

a those source

and uh well got track

parameterized the vocal tract bit L S Fs

and filter

i rise the source with several parameters

are

fundamental frequency

how many noise ratio

a spectral to with L C and

harmonic model it

a in the lower bound

and finally

we extract the but top row is that the library

and link the holes with corresponding

source parameters

so how we do that

first

i um you a mean the couple a close or instance

from the different at to go of force you know

and then we extract each complete

to better at caught a source segment

and from do to the hann window

the billing T

a corresponding got a source parameters which are the energy fun of the frequency

voice source spectrum how much can trace and the harmonics

and in a to at and we store

yeah a down sampled ten millisecond version of the possible from

in order to

calculate the concatenation cost

and the synthesis stage

and the boss we may consist of hundreds

or even thousands of clock of a pulses

um and two as an example of some of the pulses

um um

from the two male speaker

okay

and that the synthesis stage

so what we do is we want to reconstruct the voiced

excitation

so we select

best matching impulses

according to the

oh of parameters turn that by the hmm men and

and um

a slippery

we scale them and dude

of the pulses

and all at them

to to write text like this

for one X at this be used only white noise

only be filtered to combine text this and and get so the leaks

it's

and

to but also a space

um um that it

by minimizing to joint cost composed of target

and concatenation costs

so that it course these uh

root mean square error or be in the voices by me there's

try but it's men and the one stored

for each pause

and we can of course have different weights

for different parameters

to to

this system

and the can can to the concatenation cost use the arms

error or bit in the down some presence of the poles

a second in eight

okay here's an example of the

how well in this goals

so he

to most excitation

and

and it was the on was excite there's an

less interesting

and

then combines i one play

that two thirty

a a with the for the since we get finally

a that speech

so it that's been nice so probably you didn't understand

i i have more samples

later

well first of the result

it was a we had used in the same just then

only only one o'clock top boss each you have more T white

according to the voice source parameters and we had the result that

a a it was preferred over the

a basic straight method

and

we have some samples from

from that system

and

sky

fashion

true

and

so we also participate in the proposed so and

two doesn't ten

with a

more more results

and you some samples from from that

and

that's

so the quality is is uh

quite good

and here the samples comparing this a single pulse technique and the

a a major pulse library technique

the

i hope you can hear the differences in this

she to some difference is not so big

so maybe i plate the in this persons

yeah

and

i heard some

some differences

yes

okay here's uh spectral comes comparing

yeah

difference in quality if you don't here

you can see for example that i uh here

that the nice

he model model better

this sparse up technique as once more of the single pulse technique

and

a suppose voiced fricatives here

or are more but there because the single pulse technique couldn't

produce

soft policies

and high frequencies as spell

are are more that's role

and we conduct it's some

listening some tests

and we found that the

a a method but slightly preferred over single pulse technique

at the difference of a so great

but the but uh

speaker similarity

ross was but

and

very very many

sounds where lots more natural

yeah

this is

kind of um

could that used then of the source

so that are the same problems as in can as uh synthesis

so we have some at discontinuity there

and some more are are fog

a compared to the frequency C of the signal plus can take

okay okay here some way so we have

we do a need so to what's motivated high quality speech the sensor

and this ours

for but the blocks and and control all the speech parameters

a a speech X

have take this and

this pulse library generates more that's right side based and

because it

a three like and

in the three

and it is slightly prepared or the signal passed

and that the

references

and i thank you for your attention

time for questions

try the microphones

a one can can i have a question of two

a unit selection um

pitch period

yeah could use a some about how large the entry tree is and how complex that search is that uh

that's potentially much larger search problem in a

yeah i don't size units well yeah be are in in the

initial initial stage of developing is still

the are experts in a concatenative synthesis but

you have right

um tried

various size is

from for example ten policies to twenty thousand paul

and E D bands

depends a lot

on the speech mother sometimes

i even hundred pulses might be as almost as good as

ten thousand pulses

so it's

um are trying to mate make make some sense of how to choose the

but also also is that this in that it in that it could be

i ask with the

very few pulses

and and D T questions "'cause" you me how to choose appropriate code that house

from the right rate

so a great P D S had to choose

so a library

i to to select very so had to choose a pulse from the and the light yeah

yeah you have this uh a target cost

and can get the new cost

and we have rates

for lot of these

it's are two and by hand

at this moment

and to target cost is the

our and miss error between the source parameters

oh the library

and the ones to from the hmms

and

can at cost is uh

in this but be D D over only over three pulses

so it's it it can

the signal the similarity between the

a policy

and it should be a a a similar as possible

a at least we you catch you could start this way

and

to a total that or is competition of this

but the uh a what is we use feature be over a voiced segment

two uh

of must this procedure

um i i then that's that's part fifty are in uh team C between

choose them go to house

and

and and and so what's so and the same there's

which are generated from the hmms because the are models key in the hmms or

since

i as a lot of question regarding the same thing

a T I would arise that you use that line spectral pair or yeah said

or modeling the pulses

and can you

could you give some

rationale why you want to choose that and also D

still of the same high M S and the is really you are measuring the similarities similarity what contents

that's a frequency spectral

nature of face whatever

and

you mean um

here

the voice source spectrum

here

yeah i D the parameterization

yeah

so the vocal tract spectrum is set up as is a that and i D yeah more or just of

the spectrum of the of that source source yes yeah

yeah that's that's interesting question

the

but C P stems from the fact that firstly

we be have more old

the spectrum of the source by this

and here we have included D

it a power meter here

actually i i got i got sure

as sure that it would be them

most uh

and use it is

i a problem that how to choose the best parameters be quite this

and that we show just to going on a with are the best parameters

for selecting the best pulses

yeah but try to model the spectral tilt

and the spectral fine structure but is

a S Fs

of the source

as it it's probably be of because they're not

really like uh

and the most is

so the distortion is measured in the frequency domain magnitude

a a man yeah L S

yeah i i i S are sure that you are measuring the whole seen

from the time domain or in the frequency domain or was that was of phase or what

no it's only the L S Fs

okay

so we can improve it maybe

two but it the frequency domain

how leave you've known you hear in you uh H main seized didn't for the vocal tract parameters

so yeah i cannot not i'd and remember that number

okay i think of

okay

i i Q