0:00:13and
0:00:13just that the stance was you met and the statistics in as creation ask data
0:00:18and the subject of my talk is to introduce an improvement method for text independent
0:00:23phonetic segmentation based on that might kinda ne call mark came from
0:00:28in brief
0:00:29i will first focus on on what you a to be you a speech has a complex signal
0:00:33physical sense
0:00:34physical sense that is to say to you read
0:00:37as a realisation of complex that
0:00:40but after to having
0:00:41if we introduce periods that time seen the study of complex system might be to use a powerful two
0:00:47a cache in your character of the speech signal
0:00:49this is called micro kind a knee call mark K for money's M M
0:00:53and i i show the general potential of speak M M F have to be applied and a speech and
0:00:58all is
0:00:59and then i with channel on on hunter
0:01:02application of these formalism them to phonetic segmentation of a speech signal and i been introduce
0:01:07a basic and improvement to for segmentation
0:01:10and finally i would take some time to present experimental results and to conclude
0:01:16so it has been
0:01:17to a quality and experimentally established that there use
0:01:21for once a nonlinear phenomena in the production process of the speech
0:01:25signal for example already was number which is a
0:01:28number characterising different for a used
0:01:31i put to be able to as thousand
0:01:33which corresponds to to a for
0:01:35a well as we know most of the
0:01:38a in the speech processing tsar
0:01:40based on the linear source-filter model which can not a quickly take into a
0:01:45but a in your character of the speech signal
0:01:48hence and so but here is to find then value a key parameters which are responsible for the complex
0:01:54cut of a speech signal
0:01:55previous studies have me should have shown that such parameters do exist but they are very hard to be estimate
0:02:02our strategies to take the
0:02:04and knowledge is coming from a statistical physics and to relate the complexity with the predictability of each point inside
0:02:10the signal
0:02:11and in practice need to
0:02:13there although computationally efficient tools to
0:02:16yeah
0:02:19to make these parameters if there exist and to use them for a practical and a
0:02:25as important one
0:02:27as in the study of complex system the first phase of started in the late forties with the classical walk
0:02:32of colour more of
0:02:34and
0:02:34which was the basis for the latest at later post in this domain
0:02:38which are based on the study of a structure functions state
0:02:41a main result of these methods used to
0:02:44recognise a global lead the existence of a multiscale that structure without giving access to
0:02:50state there
0:02:51i mean
0:02:53oh is a use is two
0:02:55side
0:02:56because they are based on their statistical average is non the stationary assumption
0:03:01that can be used to decide whether a system is complex or or not that much more information
0:03:06and the second phase missed we try to
0:03:08uh
0:03:09that's a mind you much recording inside the signal where the complexity happens and how it to its a
0:03:15a more
0:03:16precise terms we try to find a subset inside the signal which have the highest
0:03:20information content and we try to explain how these
0:03:23the transfer of
0:03:24information between different the scale
0:03:28organises itself
0:03:30as methods are being made possible by the approach in the statistical physics in this study of
0:03:35i lily system and the two size
0:03:38a study of the notion of
0:03:40transition site a complex east
0:03:44as shown that uh so as you metric multi a scalar quantization is responsible for the complex C this inside
0:03:50a signal
0:03:51a typical example for the is the cascade of energy in fully developed look problem
0:03:57fingerprint impact is is the existence of a power law behavior in the temporal correlation function
0:04:03which has to be you
0:04:04you value that out of any of stationarity assumption at each point site the signal any
0:04:09a single exponents related to this power a lot of as we will be see you see shortly
0:04:14a score of singularity exponents that it can be shown that it completely explains the
0:04:19a quantization of multi-scale the structures
0:04:23and
0:04:26an example in this
0:04:28i stick can only "'cause" form as mean that is in this study of multi of signals
0:04:33i the kind a equal for models which was the first that am trying to at them
0:04:37singularity exponents as a global property of the signal with
0:04:41to what is called a lower down to spectrum are in this equation we have
0:04:45a complex signal as
0:04:47and a multi resolution a multiresolution function grand mal what thing at this scale or
0:04:53and he the at to stand for expectations of where
0:04:56a statistical ensemble
0:04:59the exponent of these power to P could be related to the a
0:05:03a distribution of singularity exponents
0:05:05two dollars on transform but main problem is that it's a global description it doesn't give access to
0:05:11equal
0:05:12and a local dynamics of the signal
0:05:15so in but
0:05:17a can only from one is be try to
0:05:19instead of of feeling on the statistical able to be try to see
0:05:23so the signal
0:05:26i i try to introduce
0:05:27singularity exponents you much
0:05:29is related to geometric location like the signal be a
0:05:33the time index T here and uh
0:05:35yeah
0:05:36multiresolution function gram are
0:05:38and this can just to here the power the
0:05:41exponent and of this problem this but because single singularity exponent
0:05:45and
0:05:46can be estimated
0:05:47precisely to
0:05:49a we of the transition phones of the signal
0:05:51yeah
0:05:52to main problem is that precise estimation of these parameters
0:05:56and uh in this regard but a what of one of the crucial sure choices it
0:06:00problems is the choice of the functional grammar or for example we can use
0:06:05simply the linear increments
0:06:07and that it has been shown that it it doesn't give a precise estimation of H of T because of
0:06:12to
0:06:13a stable and sensitivity of these
0:06:16and you in cream
0:06:17have a best choice for batman
0:06:19it's trying to be the grab model speech is defined as the integral of the variance models were work the
0:06:25but i
0:06:26oh use a B R teen this equation and normalized but the robust me on the real i
0:06:31that's is defined from be typical characterisation of
0:06:35can take energy into a real and
0:06:37it has been shown that it
0:06:40can
0:06:41it is related to the information content of each point if we to use these measure four
0:06:46yeah
0:06:47calculation of H of T
0:06:50so make this or if we can have a good estimate of H of T
0:06:55i can um work
0:06:57a a very important subset inside the signal which is called most thing we have many for this corresponds to
0:07:02the
0:07:02and since i the signal which up have to your of singularity exponents
0:07:06it has been shown that the
0:07:08or lower the value of a single exponent is the high
0:07:12these are on the given point
0:07:13so the critical transitions of the signal use have is happening
0:07:19at this points
0:07:20and a of a reconstruction from has been proposed that
0:07:24and it has been shown in many applications that P can we construct the whole signal having access to only
0:07:29this small subset of to date
0:07:31so this is what just to too the importance of the singularity exponents
0:07:35how have to that we can turn on to see how they can be applied to speech signal
0:07:39previously we have shown that the estimation procedure of H of T for a speech signal and B have shown
0:07:45that we can have
0:07:46good to estimate of H of T for the majority of point in the speech signal we
0:07:51have a speech signal extracted from timit
0:07:54timit database with vertical red lines speech was the
0:07:57phoneme boundaries them them from manual transcriptions provided in timit database and
0:08:02of course the objective of text independent to phonetic segmentation is to identify these phoneme boundaries
0:08:08and in a
0:08:09tolerance mean do
0:08:12so
0:08:14since that is
0:08:15different phonemes
0:08:16they have we know that they have different a statistical properties V
0:08:20expect a singularity exponents to have different behaviours
0:08:24to show these you studied the
0:08:27a can
0:08:27distribution of the single a exponent the time evolution of the distribution of singularity exponents
0:08:33so we have been those of to length thirty miliseconds be compute can
0:08:36histogram of B
0:08:38and we plot it's
0:08:40a time evolution over time
0:08:42and can easily not in this uh uh a graphical representation which is which are the P of conditional to
0:08:48that histogram of singularity exponents conditioned on time
0:08:52and can easily not a remarkable change in the distribution of singularity exponents between different phonemes
0:08:59this has been extensively
0:09:02evaluated over different to speech sect
0:09:04signal
0:09:05but the problem is that it cannot use these uh
0:09:08graphical representation for but for developing a
0:09:11but an automatic segmentation how
0:09:14or you provide a E
0:09:16is here to be used for an automatic algorithm
0:09:19we we is that the easiest interpretation of these changing distribution is changing the average
0:09:25a find a new measure of we it a C C V just simply get primitive of exponents
0:09:30and
0:09:30this could be considered as the can the average instantaneous average of singular to explore
0:09:37we can see the resulting functional
0:09:39and i it is clear that that it shows
0:09:43a difference in distributions more clear a
0:09:46so inside each phoneme the
0:09:48a C see that is
0:09:50or less in yeah we do not a change in
0:09:53so a second of phoneme boundary
0:09:56however
0:09:56to develop an automatic fit
0:09:58segmentation have or is that it can is very simple metric used to fit a piecewise linear curve to this
0:10:04and C C by minimizing the mean square error
0:10:07uh we have a
0:10:09a a going wrong with take fitted okay
0:10:12and we have identified the breaking points have like a candidate point
0:10:17see that you have a a twenty five many
0:10:19most of the
0:10:21boundaries trees bit very good resolution because
0:10:23a there are the
0:10:25because we don't have any been doing
0:10:27problem in this we have
0:10:29access is high as possible resolution which is the sampling frequency of the speech signal
0:10:33so
0:10:34the primary simulations shows that is
0:10:37but a simple metal
0:10:38has comparable results with the state of the art these which was present in know previous works
0:10:44and
0:10:45oh at that it is that we don't a this it is not a
0:10:50sensitive to the threshold
0:10:51selection as we will see in experimental results
0:10:55but where it's a per by performing a or on not is of this method be observed that
0:11:00the i mean see in the
0:11:01uh
0:11:03that's
0:11:04yeah i these thinking difference in the distribution of singularity exponents but the a C is not able to reveal
0:11:10them to
0:11:11identified the
0:11:13i boundaries
0:11:15a are points that there is no distinctive
0:11:17changing the distributions but a C C and linear care feeding makes some mistakes
0:11:23has a try to use a
0:11:24but a classical approach in that
0:11:26detection of change
0:11:28change detection which is right to you has been widely used in segmentation of regions
0:11:33which is a two step procedure to first
0:11:35to select a set of candidate was generous
0:11:38and then to a he is to to do the decision to
0:11:43C but they're each can lead to to the corresponds to a change in the
0:11:47can you know features or not
0:11:50so for the process P selection is that we have two observations first we so that some of the missed
0:11:55boundaries correspond to the
0:11:56transitions between fricatives stops to roles
0:12:00and uh
0:12:02so can be so that that but
0:12:04positions to detect are the transitions between
0:12:07well i know it's segments or silence or poses two phonemes because
0:12:11and silence we have
0:12:13i would positive value of singularity exponents and you know active parts we have a
0:12:17i only negative values
0:12:18so it you an easy to
0:12:20it take change in the
0:12:22that's cups of a C C
0:12:24hence we so to
0:12:26uh i was a to be applied to a pass filter to the original signal and do exactly this same
0:12:33to compute the singularity exponents and a C C for the low pass signal you as an example in the
0:12:37that
0:12:39the figure you can see that a C C of the original signal and in the right one you can
0:12:43see the a C C of the lower filter
0:12:46have to
0:12:47signal we know that fricative is steep so and as far as are
0:12:51essentially a high band signal than low pass signal corps
0:12:53tends them into a a low energy
0:12:56and to low energy signal
0:12:58and see that the
0:13:00figure we have some changing
0:13:02shape or C C but it is not easy to detect which the
0:13:05linear curve care feeding but in the right side right hand side yeah
0:13:10much easier to detect a T reason is a another example of again i emphasise that we have to changing
0:13:15the original a see C
0:13:16but it is
0:13:17not easy to detect
0:13:18but that in the low pass version on the right hand side
0:13:21it is really easy to take the
0:13:24so as the first the you up apply the nmf A C R B C god
0:13:28two
0:13:29signal and its low pass filtered version
0:13:31i'm the
0:13:32but or or the breaking points as the as a candidates
0:13:36and in the second
0:13:37point to be to be perform uh
0:13:41dynamic and i mean doing
0:13:42followed by a log likelihood ratio you but as test to see
0:13:46and one of the candidates but are they actually correspond to a changing distribution of singularity exponents or not
0:13:51i in for size that be do is on the single exponents of the signal itself because we are interest
0:13:57to to show the strength of singularity exponents the low pass filter of a filtered version
0:14:02the does not have any real meaning is just some diversity via at are i grew
0:14:07so that was the dynamic or window mean during procedure for each point
0:14:11the consider treating those icsi like again that
0:14:14oh have to question you put as is on
0:14:17a question
0:14:18and
0:14:19i have to be but this is that to a single the exponents of that are generated by a single
0:14:23gaussian or
0:14:24it is generated by two questions on
0:14:27X or we click
0:14:28so much for H one what
0:14:31right could then H C to a and we take the candidate as uh as the boundary otherwise we remove
0:14:36it from a candidate please then
0:14:39we go to the next
0:14:41three
0:14:42so
0:14:43i experiment our simulations were done on timit the based on the full training for of to meet which consist
0:14:50of four thousand and six hundred
0:14:51sentences and we have developed a
0:14:54i was move or to randomly chose and files from these data
0:14:58we have
0:14:59try to report of the possible performance in because there is this difficult in the literature to compare
0:15:06have have reported out of time to simplify later corporations
0:15:10are two category of
0:15:11a score partial uh a or but you have hit rate or hit rate we shows the
0:15:17right the
0:15:18right of correctly detected by take that boundaries or segmentation we chose
0:15:23how much more we have to take to than false long shows that
0:15:26how much
0:15:27i
0:15:27how many false use have you have to take that
0:15:30the problem with these partial as scores is that
0:15:33a can be they can go in opposite directions for example an improvement each rate
0:15:37could correspond to an increase in false alarm rates so we cannot do a
0:15:41for on page and only be partial the schools but are about the score
0:15:45to this partial the course i've missed and used go to a console
0:15:48for example if one
0:15:50takes a wrote and false alarm it to content or value takes hit rate and
0:15:54or were segmentation into a beat
0:15:56much in is on over segmentation rate so
0:16:00oh the experimental result first we can see that comp
0:16:04a C C D's do we seek a good on the improvement
0:16:08and on the
0:16:09for a different style utterances
0:16:12we can see that we have like
0:16:13two or three percent
0:16:15huh improvement in france so one road and the like
0:16:18for presenting in over segmentation and he rates are more or less the same
0:16:23but and it this shows the
0:16:25improvement over the procedure great
0:16:27that compared
0:16:28then be compared to that
0:16:31a friends number so and which is the
0:16:34state of the art in the literature
0:16:36i can see that for the two runs of twenty five miliseconds be a were almost the same
0:16:41contrary
0:16:42yeah but a percent improvement in the file so long but and we have
0:16:46ten percent improvement in our segmentation
0:16:49uh right
0:16:50a a more important for even if we go to
0:16:53a low tolerance is for five miliseconds we can see that
0:16:57for
0:16:57i i love these we have like
0:16:59more than ten percent improvement in heat rate false alarm and or segmentation this is because the
0:17:04i would a high resolution of the to C C function of
0:17:08that's the bit ones
0:17:10but i been doing we don't have to been doing you have access to the finest possible resolution
0:17:17in terms of a measure we can see
0:17:19that's a a for a lower resolutions we have more than ten percent improvement in both of the
0:17:25okay
0:17:26for in both of the
0:17:27um
0:17:28a
0:17:29scores and for twenty five miliseconds be have like six or or or or four present
0:17:34improvement in or a and if so
0:17:37have have uh to uh i i mentioned that the method is not sensitive to to show which is a
0:17:42problem of the
0:17:44as a call
0:17:45so
0:17:46text methods of phonetic segmentation
0:17:50we are trying the
0:17:51have shown the
0:17:53a sensitivity of to a is to the care beating to sure
0:17:57i have changed the could sure sure to over four hundred percent
0:18:01the value of the threshold and they're
0:18:02value you of a value only has changed in a zero point five percent this shows that
0:18:07a choice of the threshold is not important that all in this have agreed
0:18:12i choose a
0:18:14for a independent is an important feature
0:18:18of
0:18:20we have
0:18:21but these these to you have shown the you have emphasise on the strength of singularity exponents in section of
0:18:26transitions found transitions fronts in the speech signal
0:18:31a more importantly the promising phonetic segment
0:18:34average be encouraging results in phonetic segmentation shows the
0:18:38potential of M F in done it is is of week or local dynamics of a speech signal hence this
0:18:43are are you of work is to use M M F U
0:18:46i don't know means of a speech technology
0:18:48and you to use the
0:18:50constructions from or or or the concept of what to model they've that which is an ongoing research and
0:18:56result
0:18:57i hope to have good results in that
0:18:59from
0:19:00time to very much for that
0:19:06right on time
0:19:11i can take questions one and one but this is officially the end of the fact
0:19:15oh
0:19:16okay
0:19:17yeah
0:19:18i