okay hello everybody i'm to right the a from all the image it helsinki finland and i'm gonna talk about a hmm based speech synthesis and how to improve quality by devising a clock those source pulse library and this is um making in collaboration of it michael lakes on this only and what they one you know from the helsinki that unit was the think E and and a block of and a local um the although rest okay so here's content of my talk so let's go straight to the background so a six a a lot in the goal of text speech is to generate net that's was sounding expression person from a bit text and to at R to major tts trends one is they need selection which is space on concatenating netting pretty recording acoustic units and D C else um are three what quality at its best that the adaptability a is somewhat pour a the other mid thirties statistical which space the modeling speech parameters a he mark model and it S but their adaptability and this work a can about statistical synthesis so but the problem is that the quality is not too good so how proposal for this he's to decompose the speech signal into to clock close source signal and to vocal tract transfer function and second um i for the decompose the call those source in several parameters and a you call that pulse library and then be model these parameters in a normal i item and based speech in this framework H T as and and synthesis it's we reconstruct construct the um a a source signal from the policies is and the parameter and feel their it it the vocal filter so that the basics so can so the source of the clock but um voiced speech is the complex they sum and then the signal goes the vocal tract and then we have speech so we are interested in this clock but like citation very much in this work so and uh hopper uh i speaker be have speech you know and the estimate it got don't fall below that and how we can estimate the signal we can for simple use method called got the likeness filtering which to estimate the clock or so signal from the speech signal itself are several methods to that from this task i i go further into that but use but that that is based on each of they will yeah P C a use of a lpc okay and then to the speech in the system so is a a a very family most of you but i will go through this fast so we have a speech database and then me parameterized tries it and train the parameters according to the labours labels and in synthesis to it's i input text and a that and a we can generate parameters are according to the that's was and so we can recall sort speech and in this work we are interested in this a process and and synthesis steps and but in proving these we try to make the speech a more natural so what we do in speech parameters a sony it's be first window of the signal of course and a a mix of tree and to be the is filtering so we decompose the you speech signal the diffuse you logical corresponding parts which is that a those source and uh well got track parameterized the vocal tract bit L S Fs and filter i rise the source with several parameters are fundamental frequency how many noise ratio a spectral to with L C and harmonic model it a in the lower bound and finally we extract the but top row is that the library and link the holes with corresponding source parameters um so how we do that first i um you a mean the couple a close or instance from the different at to go of force you know and then we extract each complete to better at caught a source segment and from do to the hann window the billing T is a corresponding got a source parameters which are the energy fun of the frequency voice source spectrum how much can trace and the harmonics and in a to at and we store yeah a down sampled ten millisecond version of the possible from in order to calculate the concatenation cost and the synthesis stage and the boss we may consist of hundreds or even thousands of clock of a pulses um and two as an example of some of the pulses um um from the two male speaker okay and that the synthesis stage so what we do is we want to reconstruct the voiced excitation so we select to best matching impulses according to the oh of parameters turn that by the hmm men and and um a slippery we scale them and dude of the pulses and all at them to to write text like this for one X at this be used only white noise only be filtered to combine text this and and get so the leaks it's and to but also a space um um that it by minimizing to joint cost composed of target and concatenation costs so that it course these uh root mean square error or be in the voices by me there's try but it's men and the one stored for each pause and we can of course have different weights for different parameters to to this system and the can can to the concatenation cost use the arms error or bit in the down some presence of the poles a second in eight okay here's an example of the um how well in this goals so he to most excitation and and and it was the on was excite there's an less interesting and then combines i one play that two thirty a a with the for the since we get finally a that speech i a so it that's been nice so probably you didn't understand i i have more samples later well first of the result so it was a we had used in the same just then only only one o'clock top boss each you have more T white according to the voice source parameters and we had the result that a a it was preferred over the a basic straight method and we have some samples from from that system and as sky fashion true and a i so we also participate in the proposed so and two doesn't ten with a more more results and you some samples from from that and that's i i i i i i i so the quality is is uh quite good and here the samples comparing this a single pulse technique and the a a major pulse library technique the i hope you can hear the differences in this she to some difference is not so big so maybe i plate the in this persons i i yeah and i heard some some differences yes um okay here's uh spectral comes comparing yeah difference in quality if you don't here you can see for example that i uh here that the nice he model model better uh this sparse up technique as once more of the single pulse technique and a suppose voiced fricatives here or are more but there because the single pulse technique couldn't produce um soft policies and high frequencies as spell are are more that's role and we conduct it's some listening some tests and we found that the a a method but slightly preferred over single pulse technique at the difference of a so great but the but uh speaker similarity ross was but and very very many um sounds where lots more natural yeah this is kind of um could that used then of the source so that are the same problems as in can as uh synthesis so we have some at discontinuity there and some more are are fog a compared to the frequency C of the signal plus can take okay okay here some way so we have we do a need so to what's motivated high quality speech the sensor and this ours for but the blocks and and control all the speech parameters a a speech X have take this and this pulse library generates more that's right side based and because it a three like and in the three and it is slightly prepared or the signal passed and that the references and i thank you for your attention time for questions try the microphones a one can can i have a question of two oh a unit selection um pitch period yeah could use a some about how large the entry tree is and how complex that search is that uh that's potentially much larger search problem in a yeah i don't size units well yeah be are in in the initial initial stage of developing is still the are experts in a concatenative synthesis but you have right um tried um various size is from for example ten policies to twenty thousand paul and E D bands depends a lot a on the speech mother sometimes i even hundred pulses might be as almost as good as ten thousand pulses so it's um are trying to mate make make some sense of how to choose the but also also is that this in that it in that it could be uh i ask with the very few pulses and and D T questions "'cause" you me how to choose appropriate code that house from the right rate so a great P D S had to choose so a library i to to select very so had to choose a pulse from the and the light yeah yeah you have this uh a target cost and can get the new cost and we have rates for lot of these it's are two and by hand at this moment and to target cost is the our and miss error between the source parameters oh the library and the ones to from the hmms and can at cost is uh in this but be D D over only over three pulses so it's it it can the signal the similarity between the a policy and it should be a a a similar as possible a at least we you catch you could start this way and to a total that or is competition of this but the uh a what is we use feature be over a voiced segment two uh of must this procedure um i i then that's that's part fifty are in uh team C between choose them go to house and and and and so what's so and the same there's which are generated from the hmms because the are models key in the hmms or since i as a lot of question regarding the same thing a T I would arise that you use that line spectral pair or yeah said or modeling the pulses and can you could you give some rationale why you want to choose that and also D still of the same high M S and the is really you are measuring the similarities similarity what contents that's a frequency spectral nature of face whatever um and you mean um here the voice source spectrum here yeah i D the parameterization yeah so the vocal tract spectrum is set up as is a that and i D yeah more or just of the spectrum of the of that source source yes yeah yeah that's that's interesting question the but C P stems from the fact that firstly we be have more old the spectrum of the source by this and here we have included D it a power meter here actually i i got i got sure as sure that it would be them most uh and use it is i a problem that how to choose the best parameters be quite this and that we show just to going on a with are the best parameters for selecting the best pulses so yeah but try to model the spectral tilt and the spectral fine structure but is a S Fs of the source as it it's probably be of because they're not really like uh and the most is so the distortion is measured in the frequency domain magnitude a a man yeah L S yeah i i i S are sure that you are measuring the whole seen from the time domain or in the frequency domain or was that was of phase or what no it's only the L S Fs okay so we can improve it maybe two but it the frequency domain oh oh how leave you've known you hear in you uh H main seized didn't for the vocal tract parameters so yeah i cannot not i'd and remember that number okay i think of okay i i Q