0:00:13 | and |
---|---|

0:00:13 | just that the stance was you met and the statistics in as creation ask data |

0:00:18 | and the subject of my talk is to introduce an improvement method for text independent |

0:00:23 | phonetic segmentation based on that might kinda ne call mark came from |

0:00:28 | in brief |

0:00:29 | i will first focus on on what you a to be you a speech has a complex signal |

0:00:33 | physical sense |

0:00:34 | physical sense that is to say to you read |

0:00:37 | as a realisation of complex that |

0:00:40 | but after to having |

0:00:41 | if we introduce periods that time seen the study of complex system might be to use a powerful two |

0:00:47 | a cache in your character of the speech signal |

0:00:49 | this is called micro kind a knee call mark K for money's M M |

0:00:53 | and i i show the general potential of speak M M F have to be applied and a speech and |

0:00:58 | all is |

0:00:59 | and then i with channel on on hunter |

0:01:02 | application of these formalism them to phonetic segmentation of a speech signal and i been introduce |

0:01:07 | a basic and improvement to for segmentation |

0:01:10 | and finally i would take some time to present experimental results and to conclude |

0:01:16 | so it has been |

0:01:17 | to a quality and experimentally established that there use |

0:01:21 | for once a nonlinear phenomena in the production process of the speech |

0:01:25 | signal for example already was number which is a |

0:01:28 | number characterising different for a used |

0:01:31 | i put to be able to as thousand |

0:01:33 | which corresponds to to a for |

0:01:35 | a well as we know most of the |

0:01:38 | a in the speech processing tsar |

0:01:40 | based on the linear source-filter model which can not a quickly take into a |

0:01:45 | but a in your character of the speech signal |

0:01:48 | hence and so but here is to find then value a key parameters which are responsible for the complex |

0:01:54 | cut of a speech signal |

0:01:55 | previous studies have me should have shown that such parameters do exist but they are very hard to be estimate |

0:02:02 | our strategies to take the |

0:02:04 | and knowledge is coming from a statistical physics and to relate the complexity with the predictability of each point inside |

0:02:10 | the signal |

0:02:11 | and in practice need to |

0:02:13 | there although computationally efficient tools to |

0:02:16 | yeah |

0:02:19 | to make these parameters if there exist and to use them for a practical and a |

0:02:25 | as important one |

0:02:27 | as in the study of complex system the first phase of started in the late forties with the classical walk |

0:02:32 | of colour more of |

0:02:34 | and |

0:02:34 | which was the basis for the latest at later post in this domain |

0:02:38 | which are based on the study of a structure functions state |

0:02:41 | a main result of these methods used to |

0:02:44 | recognise a global lead the existence of a multiscale that structure without giving access to |

0:02:50 | state there |

0:02:51 | i mean |

0:02:53 | oh is a use is two |

0:02:55 | side |

0:02:56 | because they are based on their statistical average is non the stationary assumption |

0:03:01 | that can be used to decide whether a system is complex or or not that much more information |

0:03:06 | and the second phase missed we try to |

0:03:08 | uh |

0:03:09 | that's a mind you much recording inside the signal where the complexity happens and how it to its a |

0:03:15 | a more |

0:03:16 | precise terms we try to find a subset inside the signal which have the highest |

0:03:20 | information content and we try to explain how these |

0:03:23 | the transfer of |

0:03:24 | information between different the scale |

0:03:28 | organises itself |

0:03:30 | as methods are being made possible by the approach in the statistical physics in this study of |

0:03:35 | i lily system and the two size |

0:03:38 | a study of the notion of |

0:03:40 | transition site a complex east |

0:03:44 | as shown that uh so as you metric multi a scalar quantization is responsible for the complex C this inside |

0:03:50 | a signal |

0:03:51 | a typical example for the is the cascade of energy in fully developed look problem |

0:03:57 | fingerprint impact is is the existence of a power law behavior in the temporal correlation function |

0:04:03 | which has to be you |

0:04:04 | you value that out of any of stationarity assumption at each point site the signal any |

0:04:09 | a single exponents related to this power a lot of as we will be see you see shortly |

0:04:14 | a score of singularity exponents that it can be shown that it completely explains the |

0:04:19 | a quantization of multi-scale the structures |

0:04:23 | and |

0:04:26 | an example in this |

0:04:28 | i stick can only "'cause" form as mean that is in this study of multi of signals |

0:04:33 | i the kind a equal for models which was the first that am trying to at them |

0:04:37 | singularity exponents as a global property of the signal with |

0:04:41 | to what is called a lower down to spectrum are in this equation we have |

0:04:45 | a complex signal as |

0:04:47 | and a multi resolution a multiresolution function grand mal what thing at this scale or |

0:04:53 | and he the at to stand for expectations of where |

0:04:56 | a statistical ensemble |

0:04:59 | the exponent of these power to P could be related to the a |

0:05:03 | a distribution of singularity exponents |

0:05:05 | two dollars on transform but main problem is that it's a global description it doesn't give access to |

0:05:11 | equal |

0:05:12 | and a local dynamics of the signal |

0:05:15 | so in but |

0:05:17 | a can only from one is be try to |

0:05:19 | instead of of feeling on the statistical able to be try to see |

0:05:23 | so the signal |

0:05:26 | i i try to introduce |

0:05:27 | singularity exponents you much |

0:05:29 | is related to geometric location like the signal be a |

0:05:33 | the time index T here and uh |

0:05:35 | yeah |

0:05:36 | multiresolution function gram are |

0:05:38 | and this can just to here the power the |

0:05:41 | exponent and of this problem this but because single singularity exponent |

0:05:45 | and |

0:05:46 | can be estimated |

0:05:47 | precisely to |

0:05:49 | a we of the transition phones of the signal |

0:05:51 | yeah |

0:05:52 | to main problem is that precise estimation of these parameters |

0:05:56 | and uh in this regard but a what of one of the crucial sure choices it |

0:06:00 | problems is the choice of the functional grammar or for example we can use |

0:06:05 | simply the linear increments |

0:06:07 | and that it has been shown that it it doesn't give a precise estimation of H of T because of |

0:06:12 | to |

0:06:13 | a stable and sensitivity of these |

0:06:16 | and you in cream |

0:06:17 | have a best choice for batman |

0:06:19 | it's trying to be the grab model speech is defined as the integral of the variance models were work the |

0:06:25 | but i |

0:06:26 | oh use a B R teen this equation and normalized but the robust me on the real i |

0:06:31 | that's is defined from be typical characterisation of |

0:06:35 | can take energy into a real and |

0:06:37 | it has been shown that it |

0:06:40 | can |

0:06:41 | it is related to the information content of each point if we to use these measure four |

0:06:46 | yeah |

0:06:47 | calculation of H of T |

0:06:50 | so make this or if we can have a good estimate of H of T |

0:06:55 | i can um work |

0:06:57 | a a very important subset inside the signal which is called most thing we have many for this corresponds to |

0:07:02 | the |

0:07:02 | and since i the signal which up have to your of singularity exponents |

0:07:06 | it has been shown that the |

0:07:08 | or lower the value of a single exponent is the high |

0:07:12 | these are on the given point |

0:07:13 | so the critical transitions of the signal use have is happening |

0:07:19 | at this points |

0:07:20 | and a of a reconstruction from has been proposed that |

0:07:24 | and it has been shown in many applications that P can we construct the whole signal having access to only |

0:07:29 | this small subset of to date |

0:07:31 | so this is what just to too the importance of the singularity exponents |

0:07:35 | how have to that we can turn on to see how they can be applied to speech signal |

0:07:39 | previously we have shown that the estimation procedure of H of T for a speech signal and B have shown |

0:07:45 | that we can have |

0:07:46 | good to estimate of H of T for the majority of point in the speech signal we |

0:07:51 | have a speech signal extracted from timit |

0:07:54 | timit database with vertical red lines speech was the |

0:07:57 | phoneme boundaries them them from manual transcriptions provided in timit database and |

0:08:02 | of course the objective of text independent to phonetic segmentation is to identify these phoneme boundaries |

0:08:08 | and in a |

0:08:09 | tolerance mean do |

0:08:12 | so |

0:08:14 | since that is |

0:08:15 | different phonemes |

0:08:16 | they have we know that they have different a statistical properties V |

0:08:20 | expect a singularity exponents to have different behaviours |

0:08:24 | to show these you studied the |

0:08:27 | a can |

0:08:27 | distribution of the single a exponent the time evolution of the distribution of singularity exponents |

0:08:33 | so we have been those of to length thirty miliseconds be compute can |

0:08:36 | histogram of B |

0:08:38 | and we plot it's |

0:08:40 | a time evolution over time |

0:08:42 | and can easily not in this uh uh a graphical representation which is which are the P of conditional to |

0:08:48 | that histogram of singularity exponents conditioned on time |

0:08:52 | and can easily not a remarkable change in the distribution of singularity exponents between different phonemes |

0:08:59 | this has been extensively |

0:09:02 | evaluated over different to speech sect |

0:09:04 | signal |

0:09:05 | but the problem is that it cannot use these uh |

0:09:08 | graphical representation for but for developing a |

0:09:11 | but an automatic segmentation how |

0:09:14 | or you provide a E |

0:09:16 | is here to be used for an automatic algorithm |

0:09:19 | we we is that the easiest interpretation of these changing distribution is changing the average |

0:09:25 | a find a new measure of we it a C C V just simply get primitive of exponents |

0:09:30 | and |

0:09:30 | this could be considered as the can the average instantaneous average of singular to explore |

0:09:37 | we can see the resulting functional |

0:09:39 | and i it is clear that that it shows |

0:09:43 | a difference in distributions more clear a |

0:09:46 | so inside each phoneme the |

0:09:48 | a C see that is |

0:09:50 | or less in yeah we do not a change in |

0:09:53 | so a second of phoneme boundary |

0:09:56 | however |

0:09:56 | to develop an automatic fit |

0:09:58 | segmentation have or is that it can is very simple metric used to fit a piecewise linear curve to this |

0:10:04 | and C C by minimizing the mean square error |

0:10:07 | uh we have a |

0:10:09 | a a going wrong with take fitted okay |

0:10:12 | and we have identified the breaking points have like a candidate point |

0:10:17 | see that you have a a twenty five many |

0:10:19 | most of the |

0:10:21 | boundaries trees bit very good resolution because |

0:10:23 | a there are the |

0:10:25 | because we don't have any been doing |

0:10:27 | problem in this we have |

0:10:29 | access is high as possible resolution which is the sampling frequency of the speech signal |

0:10:33 | so |

0:10:34 | the primary simulations shows that is |

0:10:37 | but a simple metal |

0:10:38 | has comparable results with the state of the art these which was present in know previous works |

0:10:44 | and |

0:10:45 | oh at that it is that we don't a this it is not a |

0:10:50 | sensitive to the threshold |

0:10:51 | selection as we will see in experimental results |

0:10:55 | but where it's a per by performing a or on not is of this method be observed that |

0:11:00 | the i mean see in the |

0:11:01 | uh |

0:11:03 | that's |

0:11:04 | yeah i these thinking difference in the distribution of singularity exponents but the a C is not able to reveal |

0:11:10 | them to |

0:11:11 | identified the |

0:11:13 | i boundaries |

0:11:15 | a are points that there is no distinctive |

0:11:17 | changing the distributions but a C C and linear care feeding makes some mistakes |

0:11:23 | has a try to use a |

0:11:24 | but a classical approach in that |

0:11:26 | detection of change |

0:11:28 | change detection which is right to you has been widely used in segmentation of regions |

0:11:33 | which is a two step procedure to first |

0:11:35 | to select a set of candidate was generous |

0:11:38 | and then to a he is to to do the decision to |

0:11:43 | C but they're each can lead to to the corresponds to a change in the |

0:11:47 | can you know features or not |

0:11:50 | so for the process P selection is that we have two observations first we so that some of the missed |

0:11:55 | boundaries correspond to the |

0:11:56 | transitions between fricatives stops to roles |

0:12:00 | and uh |

0:12:02 | so can be so that that but |

0:12:04 | positions to detect are the transitions between |

0:12:07 | well i know it's segments or silence or poses two phonemes because |

0:12:11 | and silence we have |

0:12:13 | i would positive value of singularity exponents and you know active parts we have a |

0:12:17 | i only negative values |

0:12:18 | so it you an easy to |

0:12:20 | it take change in the |

0:12:22 | that's cups of a C C |

0:12:24 | hence we so to |

0:12:26 | uh i was a to be applied to a pass filter to the original signal and do exactly this same |

0:12:33 | to compute the singularity exponents and a C C for the low pass signal you as an example in the |

0:12:37 | that |

0:12:39 | the figure you can see that a C C of the original signal and in the right one you can |

0:12:43 | see the a C C of the lower filter |

0:12:46 | have to |

0:12:47 | signal we know that fricative is steep so and as far as are |

0:12:51 | essentially a high band signal than low pass signal corps |

0:12:53 | tends them into a a low energy |

0:12:56 | and to low energy signal |

0:12:58 | and see that the |

0:13:00 | figure we have some changing |

0:13:02 | shape or C C but it is not easy to detect which the |

0:13:05 | linear curve care feeding but in the right side right hand side yeah |

0:13:10 | much easier to detect a T reason is a another example of again i emphasise that we have to changing |

0:13:15 | the original a see C |

0:13:16 | but it is |

0:13:17 | not easy to detect |

0:13:18 | but that in the low pass version on the right hand side |

0:13:21 | it is really easy to take the |

0:13:24 | so as the first the you up apply the nmf A C R B C god |

0:13:28 | two |

0:13:29 | signal and its low pass filtered version |

0:13:31 | i'm the |

0:13:32 | but or or the breaking points as the as a candidates |

0:13:36 | and in the second |

0:13:37 | point to be to be perform uh |

0:13:41 | dynamic and i mean doing |

0:13:42 | followed by a log likelihood ratio you but as test to see |

0:13:46 | and one of the candidates but are they actually correspond to a changing distribution of singularity exponents or not |

0:13:51 | i in for size that be do is on the single exponents of the signal itself because we are interest |

0:13:57 | to to show the strength of singularity exponents the low pass filter of a filtered version |

0:14:02 | the does not have any real meaning is just some diversity via at are i grew |

0:14:07 | so that was the dynamic or window mean during procedure for each point |

0:14:11 | the consider treating those icsi like again that |

0:14:14 | oh have to question you put as is on |

0:14:17 | a question |

0:14:18 | and |

0:14:19 | i have to be but this is that to a single the exponents of that are generated by a single |

0:14:23 | gaussian or |

0:14:24 | it is generated by two questions on |

0:14:27 | X or we click |

0:14:28 | so much for H one what |

0:14:31 | right could then H C to a and we take the candidate as uh as the boundary otherwise we remove |

0:14:36 | it from a candidate please then |

0:14:39 | we go to the next |

0:14:41 | three |

0:14:42 | so |

0:14:43 | i experiment our simulations were done on timit the based on the full training for of to meet which consist |

0:14:50 | of four thousand and six hundred |

0:14:51 | sentences and we have developed a |

0:14:54 | i was move or to randomly chose and files from these data |

0:14:58 | we have |

0:14:59 | try to report of the possible performance in because there is this difficult in the literature to compare |

0:15:06 | have have reported out of time to simplify later corporations |

0:15:10 | are two category of |

0:15:11 | a score partial uh a or but you have hit rate or hit rate we shows the |

0:15:17 | right the |

0:15:18 | right of correctly detected by take that boundaries or segmentation we chose |

0:15:23 | how much more we have to take to than false long shows that |

0:15:26 | how much |

0:15:27 | i |

0:15:27 | how many false use have you have to take that |

0:15:30 | the problem with these partial as scores is that |

0:15:33 | a can be they can go in opposite directions for example an improvement each rate |

0:15:37 | could correspond to an increase in false alarm rates so we cannot do a |

0:15:41 | for on page and only be partial the schools but are about the score |

0:15:45 | to this partial the course i've missed and used go to a console |

0:15:48 | for example if one |

0:15:50 | takes a wrote and false alarm it to content or value takes hit rate and |

0:15:54 | or were segmentation into a beat |

0:15:56 | much in is on over segmentation rate so |

0:16:00 | oh the experimental result first we can see that comp |

0:16:04 | a C C D's do we seek a good on the improvement |

0:16:08 | and on the |

0:16:09 | for a different style utterances |

0:16:12 | we can see that we have like |

0:16:13 | two or three percent |

0:16:15 | huh improvement in france so one road and the like |

0:16:18 | for presenting in over segmentation and he rates are more or less the same |

0:16:23 | but and it this shows the |

0:16:25 | improvement over the procedure great |

0:16:27 | that compared |

0:16:28 | then be compared to that |

0:16:31 | a friends number so and which is the |

0:16:34 | state of the art in the literature |

0:16:36 | i can see that for the two runs of twenty five miliseconds be a were almost the same |

0:16:41 | contrary |

0:16:42 | yeah but a percent improvement in the file so long but and we have |

0:16:46 | ten percent improvement in our segmentation |

0:16:49 | uh right |

0:16:50 | a a more important for even if we go to |

0:16:53 | a low tolerance is for five miliseconds we can see that |

0:16:57 | for |

0:16:57 | i i love these we have like |

0:16:59 | more than ten percent improvement in heat rate false alarm and or segmentation this is because the |

0:17:04 | i would a high resolution of the to C C function of |

0:17:08 | that's the bit ones |

0:17:10 | but i been doing we don't have to been doing you have access to the finest possible resolution |

0:17:17 | in terms of a measure we can see |

0:17:19 | that's a a for a lower resolutions we have more than ten percent improvement in both of the |

0:17:25 | okay |

0:17:26 | for in both of the |

0:17:27 | um |

0:17:28 | a |

0:17:29 | scores and for twenty five miliseconds be have like six or or or or four present |

0:17:34 | improvement in or a and if so |

0:17:37 | have have uh to uh i i mentioned that the method is not sensitive to to show which is a |

0:17:42 | problem of the |

0:17:44 | as a call |

0:17:45 | so |

0:17:46 | text methods of phonetic segmentation |

0:17:50 | we are trying the |

0:17:51 | have shown the |

0:17:53 | a sensitivity of to a is to the care beating to sure |

0:17:57 | i have changed the could sure sure to over four hundred percent |

0:18:01 | the value of the threshold and they're |

0:18:02 | value you of a value only has changed in a zero point five percent this shows that |

0:18:07 | a choice of the threshold is not important that all in this have agreed |

0:18:12 | i choose a |

0:18:14 | for a independent is an important feature |

0:18:18 | of |

0:18:20 | we have |

0:18:21 | but these these to you have shown the you have emphasise on the strength of singularity exponents in section of |

0:18:26 | transitions found transitions fronts in the speech signal |

0:18:31 | a more importantly the promising phonetic segment |

0:18:34 | average be encouraging results in phonetic segmentation shows the |

0:18:38 | potential of M F in done it is is of week or local dynamics of a speech signal hence this |

0:18:43 | are are you of work is to use M M F U |

0:18:46 | i don't know means of a speech technology |

0:18:48 | and you to use the |

0:18:50 | constructions from or or or the concept of what to model they've that which is an ongoing research and |

0:18:56 | result |

0:18:57 | i hope to have good results in that |

0:18:59 | from |

0:19:00 | time to very much for that |

0:19:06 | right on time |

0:19:11 | i can take questions one and one but this is officially the end of the fact |

0:19:15 | oh |

0:19:16 | okay |

0:19:17 | yeah |

0:19:18 | i |