| 0:00:14 | okay hello everybody |
|---|
| 0:00:15 | i'm to right the a from all the image it |
|---|
| 0:00:18 | helsinki finland |
|---|
| 0:00:20 | and i'm gonna talk about a hmm based |
|---|
| 0:00:22 | speech synthesis |
|---|
| 0:00:23 | and how to improve quality |
|---|
| 0:00:26 | by devising a clock those source pulse library |
|---|
| 0:00:30 | and this is |
|---|
| 0:00:31 | um |
|---|
| 0:00:32 | making in collaboration of it |
|---|
| 0:00:34 | michael lakes |
|---|
| 0:00:35 | on this only and what they one you know |
|---|
| 0:00:37 | from the helsinki that unit was the think E and |
|---|
| 0:00:40 | and a block of and a local |
|---|
| 0:00:42 | um the although rest |
|---|
| 0:00:45 | okay so here's content |
|---|
| 0:00:47 | of my talk |
|---|
| 0:00:48 | so let's go straight to the background |
|---|
| 0:00:52 | so a six |
|---|
| 0:00:53 | a a lot in the goal of text speech is to generate net that's was sounding expression person |
|---|
| 0:00:58 | from a bit text and to at R |
|---|
| 0:01:00 | to major tts trends |
|---|
| 0:01:02 | one is they need selection which is space on |
|---|
| 0:01:04 | concatenating netting pretty recording |
|---|
| 0:01:06 | acoustic units |
|---|
| 0:01:07 | and D C else |
|---|
| 0:01:08 | um |
|---|
| 0:01:10 | are three what quality at its best |
|---|
| 0:01:12 | that the adaptability |
|---|
| 0:01:14 | a is somewhat pour |
|---|
| 0:01:16 | a the other mid thirties |
|---|
| 0:01:17 | statistical |
|---|
| 0:01:18 | which space the modeling speech parameters a he mark model |
|---|
| 0:01:22 | and it S but their adaptability |
|---|
| 0:01:24 | and |
|---|
| 0:01:25 | this work |
|---|
| 0:01:26 | a can about statistical |
|---|
| 0:01:29 | synthesis |
|---|
| 0:01:31 | so |
|---|
| 0:01:33 | but the problem is that the quality is not too good |
|---|
| 0:01:36 | so how proposal for this |
|---|
| 0:01:39 | he's |
|---|
| 0:01:39 | to |
|---|
| 0:01:40 | decompose the speech signal into to clock close source |
|---|
| 0:01:43 | signal and to vocal tract transfer function |
|---|
| 0:01:47 | and second |
|---|
| 0:01:48 | um |
|---|
| 0:01:49 | i for the decompose the call those source in several parameters |
|---|
| 0:01:53 | and a you call that pulse library |
|---|
| 0:01:56 | and then be model these parameters in a normal |
|---|
| 0:01:59 | i item and based |
|---|
| 0:02:00 | speech in this framework |
|---|
| 0:02:01 | H T as |
|---|
| 0:02:03 | and |
|---|
| 0:02:03 | and synthesis it's |
|---|
| 0:02:05 | we reconstruct construct the |
|---|
| 0:02:07 | um |
|---|
| 0:02:08 | a a source signal from the policies is and the parameter |
|---|
| 0:02:12 | and feel their it it the vocal filter |
|---|
| 0:02:17 | so that the basics so can so the source of the clock but um |
|---|
| 0:02:21 | voiced speech |
|---|
| 0:02:22 | is the complex they sum |
|---|
| 0:02:25 | and then the |
|---|
| 0:02:26 | signal goes the vocal tract and then we have speech |
|---|
| 0:02:29 | so we are interested in this clock but like citation very much in this work |
|---|
| 0:02:34 | so |
|---|
| 0:02:36 | and uh |
|---|
| 0:02:36 | hopper |
|---|
| 0:02:38 | uh i speaker be have speech you know |
|---|
| 0:02:41 | and the estimate it |
|---|
| 0:02:42 | got don't fall below that |
|---|
| 0:02:44 | and |
|---|
| 0:02:45 | how we can |
|---|
| 0:02:46 | estimate the signal |
|---|
| 0:02:48 | we can for simple use method called got the likeness filtering |
|---|
| 0:02:52 | which to estimate |
|---|
| 0:02:53 | the clock or so signal |
|---|
| 0:02:54 | from the speech signal itself |
|---|
| 0:02:57 | are several methods |
|---|
| 0:02:58 | to that from this task |
|---|
| 0:03:00 | i i go further into that |
|---|
| 0:03:03 | but use |
|---|
| 0:03:05 | but that that is based on each of they will |
|---|
| 0:03:07 | yeah P C |
|---|
| 0:03:08 | a use of a lpc |
|---|
| 0:03:13 | okay and then to the speech in the system |
|---|
| 0:03:16 | so is |
|---|
| 0:03:17 | a a a very family |
|---|
| 0:03:18 | most of you |
|---|
| 0:03:20 | but i will go through this fast so we have a |
|---|
| 0:03:22 | speech database and then me parameterized tries it |
|---|
| 0:03:25 | and train |
|---|
| 0:03:26 | the parameters according to the labours labels |
|---|
| 0:03:30 | and in synthesis to it's |
|---|
| 0:03:31 | i input text and a that and |
|---|
| 0:03:35 | a we can generate parameters are according to the that's was and so we can recall sort |
|---|
| 0:03:39 | speech |
|---|
| 0:03:40 | and in this work we are interested in this |
|---|
| 0:03:43 | a process and and synthesis steps |
|---|
| 0:03:46 | and |
|---|
| 0:03:48 | but in proving these we |
|---|
| 0:03:49 | try to make the speech |
|---|
| 0:03:51 | a more natural |
|---|
| 0:03:54 | so what we do in speech parameters a sony it's be first |
|---|
| 0:03:57 | window of the signal of course and |
|---|
| 0:04:00 | a a mix of tree |
|---|
| 0:04:01 | and to be |
|---|
| 0:04:04 | the is filtering so we decompose the you speech signal |
|---|
| 0:04:07 | the diffuse you logical corresponding parts which is that |
|---|
| 0:04:11 | a those source |
|---|
| 0:04:12 | and uh well got track |
|---|
| 0:04:15 | parameterized the vocal tract bit L S Fs |
|---|
| 0:04:19 | and filter |
|---|
| 0:04:21 | i rise the source with several parameters |
|---|
| 0:04:24 | are |
|---|
| 0:04:25 | fundamental frequency |
|---|
| 0:04:26 | how many noise ratio |
|---|
| 0:04:28 | a spectral to with L C and |
|---|
| 0:04:31 | harmonic model it |
|---|
| 0:04:33 | a in the lower bound |
|---|
| 0:04:35 | and finally |
|---|
| 0:04:37 | we extract the but top row is that the library |
|---|
| 0:04:41 | and link the holes with corresponding |
|---|
| 0:04:43 | source parameters |
|---|
| 0:04:46 | um |
|---|
| 0:04:47 | so how we do that |
|---|
| 0:04:49 | first |
|---|
| 0:04:50 | i um you a mean the couple a close or instance |
|---|
| 0:04:53 | from the different at to go of force you know |
|---|
| 0:04:57 | and then we extract each complete |
|---|
| 0:05:00 | to better at caught a source segment |
|---|
| 0:05:02 | and from do to the hann window |
|---|
| 0:05:06 | the billing T |
|---|
| 0:05:08 | is |
|---|
| 0:05:09 | a corresponding got a source parameters which are the energy fun of the frequency |
|---|
| 0:05:14 | voice source spectrum how much can trace and the harmonics |
|---|
| 0:05:19 | and in a to at and we store |
|---|
| 0:05:22 | yeah a down sampled ten millisecond version of the possible from |
|---|
| 0:05:26 | in order to |
|---|
| 0:05:28 | calculate the concatenation cost |
|---|
| 0:05:30 | and the synthesis stage |
|---|
| 0:05:35 | and the boss we may consist of hundreds |
|---|
| 0:05:38 | or even thousands of clock of a pulses |
|---|
| 0:05:41 | um and two as an example of some of the pulses |
|---|
| 0:05:46 | um um |
|---|
| 0:05:47 | from the two male speaker |
|---|
| 0:05:51 | okay |
|---|
| 0:05:52 | and that the synthesis stage |
|---|
| 0:05:55 | so what we do is we want to reconstruct the voiced |
|---|
| 0:05:59 | excitation |
|---|
| 0:06:00 | so we select |
|---|
| 0:06:01 | to |
|---|
| 0:06:02 | best matching impulses |
|---|
| 0:06:04 | according to the |
|---|
| 0:06:05 | oh of parameters turn that by the hmm men and |
|---|
| 0:06:08 | and um |
|---|
| 0:06:10 | a slippery |
|---|
| 0:06:12 | we scale them and dude |
|---|
| 0:06:14 | of the pulses |
|---|
| 0:06:15 | and all at them |
|---|
| 0:06:17 | to to write text like this |
|---|
| 0:06:19 | for one X at this be used only white noise |
|---|
| 0:06:23 | only be filtered to combine text this and and get so the leaks |
|---|
| 0:06:27 | it's |
|---|
| 0:06:30 | and |
|---|
| 0:06:31 | to but also a space |
|---|
| 0:06:33 | um um that it |
|---|
| 0:06:34 | by minimizing to joint cost composed of target |
|---|
| 0:06:37 | and concatenation costs |
|---|
| 0:06:39 | so that it course these uh |
|---|
| 0:06:41 | root mean square error or be in the voices by me there's |
|---|
| 0:06:44 | try but it's men and the one stored |
|---|
| 0:06:46 | for each pause |
|---|
| 0:06:48 | and we can of course have different weights |
|---|
| 0:06:50 | for different parameters |
|---|
| 0:06:52 | to to |
|---|
| 0:06:53 | this system |
|---|
| 0:06:55 | and the can can to the concatenation cost use the arms |
|---|
| 0:06:58 | error or bit in the down some presence of the poles |
|---|
| 0:07:01 | a second in eight |
|---|
| 0:07:08 | okay here's an example of the |
|---|
| 0:07:11 | um |
|---|
| 0:07:12 | how well in this goals |
|---|
| 0:07:14 | so he |
|---|
| 0:07:15 | to most excitation |
|---|
| 0:07:25 | and |
|---|
| 0:07:26 | and |
|---|
| 0:07:29 | and it was the on was excite there's an |
|---|
| 0:07:32 | less interesting |
|---|
| 0:07:40 | and |
|---|
| 0:07:40 | then combines i one play |
|---|
| 0:07:42 | that two thirty |
|---|
| 0:07:44 | a a with the for the since we get finally |
|---|
| 0:07:47 | a that speech |
|---|
| 0:07:50 | i |
|---|
| 0:07:52 | a |
|---|
| 0:07:58 | so it that's been nice so probably you didn't understand |
|---|
| 0:08:01 | i i have more samples |
|---|
| 0:08:02 | later |
|---|
| 0:08:05 | well first of the result |
|---|
| 0:08:07 | so |
|---|
| 0:08:08 | it was a we had used in the same just then |
|---|
| 0:08:10 | only only one o'clock top boss each you have more T white |
|---|
| 0:08:13 | according to the voice source parameters and we had the result that |
|---|
| 0:08:16 | a a it was preferred over the |
|---|
| 0:08:19 | a basic straight method |
|---|
| 0:08:22 | and |
|---|
| 0:08:23 | we have some samples from |
|---|
| 0:08:24 | from that system |
|---|
| 0:08:35 | and |
|---|
| 0:08:38 | as |
|---|
| 0:08:41 | sky |
|---|
| 0:08:43 | fashion |
|---|
| 0:08:48 | true |
|---|
| 0:08:50 | and |
|---|
| 0:08:51 | a |
|---|
| 0:08:55 | i |
|---|
| 0:08:56 | so we also participate in the proposed so and |
|---|
| 0:08:59 | two doesn't ten |
|---|
| 0:09:00 | with a |
|---|
| 0:09:01 | more more results |
|---|
| 0:09:04 | and you some samples from from that |
|---|
| 0:09:10 | and |
|---|
| 0:09:13 | that's |
|---|
| 0:09:13 | i |
|---|
| 0:09:16 | i |
|---|
| 0:09:19 | i |
|---|
| 0:09:20 | i |
|---|
| 0:09:22 | i |
|---|
| 0:09:26 | i |
|---|
| 0:09:33 | i |
|---|
| 0:09:37 | so the quality is is uh |
|---|
| 0:09:39 | quite good |
|---|
| 0:09:40 | and here the samples comparing this a single pulse technique and the |
|---|
| 0:09:45 | a a major pulse library technique |
|---|
| 0:09:48 | the |
|---|
| 0:09:49 | i hope you can hear the differences in this |
|---|
| 0:09:52 | she to some difference is not so big |
|---|
| 0:09:55 | so maybe i plate the in this persons |
|---|
| 0:10:12 | i |
|---|
| 0:10:24 | i |
|---|
| 0:10:27 | yeah |
|---|
| 0:10:28 | and |
|---|
| 0:10:30 | i heard some |
|---|
| 0:10:32 | some differences |
|---|
| 0:10:33 | yes |
|---|
| 0:10:35 | um |
|---|
| 0:10:36 | okay here's uh spectral comes comparing |
|---|
| 0:10:39 | yeah |
|---|
| 0:10:40 | difference in quality if you don't here |
|---|
| 0:10:42 | you can see for example that i uh here |
|---|
| 0:10:45 | that the nice |
|---|
| 0:10:46 | he model model better |
|---|
| 0:10:48 | uh |
|---|
| 0:10:49 | this sparse up technique as once more of the single pulse technique |
|---|
| 0:10:53 | and |
|---|
| 0:10:54 | a suppose voiced fricatives here |
|---|
| 0:10:56 | or are more but there because the single pulse technique couldn't |
|---|
| 0:10:59 | produce |
|---|
| 0:11:00 | um |
|---|
| 0:11:02 | soft policies |
|---|
| 0:11:04 | and high frequencies as spell |
|---|
| 0:11:05 | are are more that's role |
|---|
| 0:11:11 | and we conduct it's some |
|---|
| 0:11:14 | listening some tests |
|---|
| 0:11:16 | and we found that the |
|---|
| 0:11:18 | a a method but slightly preferred over single pulse technique |
|---|
| 0:11:23 | at the difference of a so great |
|---|
| 0:11:25 | but the but uh |
|---|
| 0:11:27 | speaker similarity |
|---|
| 0:11:28 | ross was but |
|---|
| 0:11:31 | and |
|---|
| 0:11:32 | very very many |
|---|
| 0:11:34 | um |
|---|
| 0:11:35 | sounds where lots more natural |
|---|
| 0:11:38 | yeah |
|---|
| 0:11:38 | this is |
|---|
| 0:11:39 | kind of um |
|---|
| 0:11:41 | could that used then of the source |
|---|
| 0:11:43 | so that are the same problems as in can as uh synthesis |
|---|
| 0:11:47 | so we have some at discontinuity there |
|---|
| 0:11:50 | and some more are are fog |
|---|
| 0:11:52 | a compared to the frequency C of the signal plus can take |
|---|
| 0:12:00 | okay okay here some way so we have |
|---|
| 0:12:03 | we do a need so to what's motivated high quality speech the sensor |
|---|
| 0:12:06 | and this ours |
|---|
| 0:12:08 | for but the blocks and and control all the speech parameters |
|---|
| 0:12:12 | a a speech X |
|---|
| 0:12:13 | have take this and |
|---|
| 0:12:15 | this pulse library generates more that's right side based and |
|---|
| 0:12:18 | because it |
|---|
| 0:12:19 | a three like and |
|---|
| 0:12:21 | in the three |
|---|
| 0:12:22 | and it is slightly prepared or the signal passed |
|---|
| 0:12:26 | and that the |
|---|
| 0:12:27 | references |
|---|
| 0:12:28 | and i thank you for your attention |
|---|
| 0:12:38 | time for questions |
|---|
| 0:12:39 | try the microphones |
|---|
| 0:12:51 | a one can can i have a question of two |
|---|
| 0:12:54 | oh |
|---|
| 0:12:55 | a unit selection um |
|---|
| 0:12:57 | pitch period |
|---|
| 0:12:58 | yeah could use a some about how large the entry tree is and how complex that search is that uh |
|---|
| 0:13:03 | that's potentially much larger search problem in a |
|---|
| 0:13:05 | yeah i don't size units well yeah be are in in the |
|---|
| 0:13:09 | initial initial stage of developing is still |
|---|
| 0:13:11 | the are experts in a concatenative synthesis but |
|---|
| 0:13:14 | you have right |
|---|
| 0:13:15 | um tried |
|---|
| 0:13:17 | um |
|---|
| 0:13:18 | various size is |
|---|
| 0:13:19 | from for example ten policies to twenty thousand paul |
|---|
| 0:13:23 | and E D bands |
|---|
| 0:13:24 | depends a lot |
|---|
| 0:13:26 | a |
|---|
| 0:13:26 | on the speech mother sometimes |
|---|
| 0:13:28 | i even hundred pulses might be as almost as good as |
|---|
| 0:13:32 | ten thousand pulses |
|---|
| 0:13:33 | so it's |
|---|
| 0:13:34 | um are trying to mate make make some sense of how to choose the |
|---|
| 0:13:39 | but also also is that this in that it in that it could be |
|---|
| 0:13:43 | uh |
|---|
| 0:13:43 | i ask with the |
|---|
| 0:13:45 | very few pulses |
|---|
| 0:13:50 | and and D T questions "'cause" you me how to choose appropriate code that house |
|---|
| 0:13:55 | from the right rate |
|---|
| 0:13:56 | so a great P D S had to choose |
|---|
| 0:13:59 | so a library |
|---|
| 0:14:00 | i to to select very so had to choose a pulse from the and the light yeah |
|---|
| 0:14:04 | yeah you have this uh a target cost |
|---|
| 0:14:07 | and can get the new cost |
|---|
| 0:14:08 | and we have rates |
|---|
| 0:14:09 | for lot of these |
|---|
| 0:14:11 | it's are two and by hand |
|---|
| 0:14:13 | at this moment |
|---|
| 0:14:14 | and to target cost is the |
|---|
| 0:14:16 | our and miss error between the source parameters |
|---|
| 0:14:19 | oh the library |
|---|
| 0:14:21 | and the ones to from the hmms |
|---|
| 0:14:24 | and |
|---|
| 0:14:25 | can at cost is uh |
|---|
| 0:14:27 | in this but be D D over only over three pulses |
|---|
| 0:14:32 | so it's it it can |
|---|
| 0:14:33 | the signal the similarity between the |
|---|
| 0:14:36 | a policy |
|---|
| 0:14:37 | and it should be a a a similar as possible |
|---|
| 0:14:40 | a at least we you catch you could start this way |
|---|
| 0:14:44 | and |
|---|
| 0:14:45 | to a total that or is competition of this |
|---|
| 0:14:48 | but the uh a what is we use feature be over a voiced segment |
|---|
| 0:14:52 | two uh |
|---|
| 0:14:53 | of must this procedure |
|---|
| 0:14:56 | um i i then that's that's part fifty are in uh team C between |
|---|
| 0:15:00 | choose them go to house |
|---|
| 0:15:02 | and |
|---|
| 0:15:04 | and and and so what's so and the same there's |
|---|
| 0:15:07 | which are generated from the hmms because the are models key in the hmms or |
|---|
| 0:15:12 | since |
|---|
| 0:15:13 | i as a lot of question regarding the same thing |
|---|
| 0:15:16 | a T I would arise that you use that line spectral pair or yeah said |
|---|
| 0:15:22 | or modeling the pulses |
|---|
| 0:15:24 | and can you |
|---|
| 0:15:25 | could you give some |
|---|
| 0:15:26 | rationale why you want to choose that and also D |
|---|
| 0:15:30 | still of the same high M S and the is really you are measuring the similarities similarity what contents |
|---|
| 0:15:36 | that's a frequency spectral |
|---|
| 0:15:38 | nature of face whatever |
|---|
| 0:15:42 | um |
|---|
| 0:15:43 | and |
|---|
| 0:15:45 | you mean um |
|---|
| 0:15:46 | here |
|---|
| 0:15:48 | the voice source spectrum |
|---|
| 0:15:50 | here |
|---|
| 0:15:52 | yeah i D the parameterization |
|---|
| 0:15:55 | yeah |
|---|
| 0:15:56 | so the vocal tract spectrum is set up as is a that and i D yeah more or just of |
|---|
| 0:16:01 | the spectrum of the of that source source yes yeah |
|---|
| 0:16:04 | yeah that's that's interesting question |
|---|
| 0:16:07 | the |
|---|
| 0:16:09 | but C P stems from the fact that firstly |
|---|
| 0:16:12 | we be have more old |
|---|
| 0:16:13 | the spectrum of the source by this |
|---|
| 0:16:16 | and here we have included D |
|---|
| 0:16:18 | it a power meter here |
|---|
| 0:16:21 | actually i i got i got sure |
|---|
| 0:16:23 | as sure that it would be them |
|---|
| 0:16:25 | most uh |
|---|
| 0:16:27 | and use it is |
|---|
| 0:16:28 | i a problem that how to choose the best parameters be quite this |
|---|
| 0:16:32 | and that we show just to going on a with are the best parameters |
|---|
| 0:16:35 | for selecting the best pulses |
|---|
| 0:16:38 | so |
|---|
| 0:16:39 | yeah but try to model the spectral tilt |
|---|
| 0:16:41 | and the spectral fine structure but is |
|---|
| 0:16:44 | a S Fs |
|---|
| 0:16:45 | of the source |
|---|
| 0:16:47 | as it it's probably be of because they're not |
|---|
| 0:16:49 | really like uh |
|---|
| 0:16:51 | and the most is |
|---|
| 0:16:53 | so the distortion is measured in the frequency domain magnitude |
|---|
| 0:16:57 | a a man yeah L S |
|---|
| 0:16:59 | yeah i i i S are sure that you are measuring the whole seen |
|---|
| 0:17:02 | from the time domain or in the frequency domain or was that was of phase or what |
|---|
| 0:17:08 | no it's only the L S Fs |
|---|
| 0:17:10 | okay |
|---|
| 0:17:11 | so we can improve it maybe |
|---|
| 0:17:13 | two but it the frequency domain |
|---|
| 0:17:21 | oh |
|---|
| 0:17:22 | oh |
|---|
| 0:17:26 | how leave you've known you hear in you uh H main seized didn't for the vocal tract parameters |
|---|
| 0:17:32 | so yeah i cannot not i'd and remember that number |
|---|
| 0:17:38 | okay i think of |
|---|
| 0:17:46 | okay |
|---|
| 0:17:47 | i i Q |
|---|