come come from us and to they i would like to present that will work work and type full uh linguistic in sees on bottom-up and top-down clustering for speaker diarization so let give a short of view of the work um so for i we give a short introduction giving the motivation of this work and one with the formulation of the problem to finally move want to compare is no these two clustering systems and finally he straight that or ideas with some experiments or word so um um have seen during the last reason two nine evaluation but was basically two main approaches for speaker diarization one he's bottom-up up also called agglomerative hierarchical clustering and the them or or a device a hierarchical clustering um we but released uh we sent to you well actually last year a class like as a a bit per uh give a brief education process for uh speaker diarization system and we we so that it get some consistent improvement for the top-down system but to trying to apply it on the bottom up we so that uh the result to word totally in consistent so that's what is that's the motivation of work is to know why it does not work and it's leads to try to have a look on what is the in front of an be sticking reasons on bottom-up and top top-down so that start with the formulation of the problem so here here you have an now just stream and so we want to solve the problem you spoken one so we proposed to cold G is the segmentation so so is the group of boundaries at each speaker down and uh uh S which is is speaker or grants so the list of the successive speakers so we is as and when is a G in this case and so we can summarise is setting and by the following questions so finding the optimum S and the optimum G as the argument of the maximum of is the probability given as a set of observations so it's is case a or B the audio stream so uh just using as a base and from that to stand uh a a question we can get the second mine you see on the screen and uh and use the dean and a it can be it is does not depend on a as of to so giving the the question number one there so a a use with this a question we can see that as is or you know there's which are required to solve this optimization task the first one you know to compute P or a given as and G is my acoustic speaker mother's off on uh uh so in this case it's often on gmm in may not the approach we we use currently a state-of-the-art of and the second model model so P S and G which is often on me it's uh maybe except in the someone prayer was work to to the just been presented now uh and so i looking at is a question was is that we have two main difficulties first of course we know what the speaker he's and secondly is acoustic model defined a perfect word from than thirty on the speaker but it can depend and as well oh on other is and C is like to the linguistic content so for the next part of this presentation we do the following assumption is that the major and reasons but i shouldn't is only you to the linguistic content so that's what we go are gonna like a sense on on is the difference of one times and they're gonna be written Q so considering this assumption is this assumption option can just we formulate a i a question uh take speaker and boundary and "'em" out that that are are possible speakers sequences so now a looking of the optimum as and G plus the optimal speaker and read that all so consider a now as the inference of the front and we can move on that the second nine on the screen which should correspond to monte guys a the probability of or or or or to different for names you and um and the third line is does a set just explained it with the bayesian rule um and next we can propose to do to S and she first a speaker diarization and do the following assumption that or the speaker a a or babble so we can just a a speaker john mother's so P of S and G can just disappear and the second assumption is that's we can expect the from and to be in that the and of the speaker and independent of G as well so that's why we can just from problem in the prior of Q so finally we got to a question the first for simple approach the second line for of maybe more complete approach and in comparing but of them which will lead mean to same results in perfect board we see that um uh the second question a phone are normalized and that in the first one we should have a normalized know that as well it means that's P or a given as an G has to be trained we a can think about a different for names so to summarise i see from this equation that the speaker in mentoring delta has to be up to nice to get a or with S and G and so that a an called solution for the top so um that the reason why it is to the fine was try to and you are search um um which are uh main a bottom-up and top-down approaches so if we move on out comparing these two approaches see i'm are with is just just to is one cluster and divide i to actively in order to get the optimum number of clusters white bottom-up is the opposite scenario we stop was a plenty of cluster and about them uh i to civilly um so so not is by far more popular approach um you get the best result of the law nine evaluation i top now is uh maybe a bit less than its but achieve competitive results um i work for sentence show that for single distant microphone and can lead to compare a result but the question is okay we start with an artist will converge into some clusters and how sure that this cluster corresponds to a speaker or another acoustically sans is like the final so yeah required so that this approach converge to a local maximum um in the perfect word operations dominates over the intra-speaker variation and if i mean could uh M and size resize to we should say okay bottom-up and top down should lead to exactly the same results but a of course has nothing is perfect yeah is there is as well the inference of linguistic contents can which can be very significant and may since one the speaker more there's are not well normalized uh i is the system can converse to a local maxima and we can uh B not a speaker unit but uh other acoustic units like the phone and Q so in the case of a down so the a new speaker out to and from uh normalized by grand mother so this model is to with or of the at by a lot available speech so we can expect small well to be we've and that is the speaker uh uh i iteratively introduce was a large amount of data us so we can expect have this new model quite a more normal light as well so is a huge risk as well a a a a a a a zero sum of their uh to a of the linguistic is to normalize it uh uh i to as the speaker by motion as well that's of course what we don't want to get we want to get the highly speaker-discriminative system by comparing the bottom up we should has a system was some very small clusters so which are which can to am i mean a local maximum and a highly uh discriminative a so that my from this point of view and the nation compared to bottom-up up but the problem is has a cluster a very small as a big is that a a is a you would risk that's the system converts but a a a a a some of the acoustic it and we normalized so finally just some i think but of the system may have the own drawback a and there or the advantages according to the so so speaker discrimination and the optimization to linguistic nuances so that's you just right now is is where with some but one work so here is a our experiment set so we have a a a a a a a speech activity detector is for but of the system um i on the left i think it's on the left for you as well yeah uh uh you have a bottom-up systems so it's a classical system of the art system yeah is the following reference you can see you are going to to spend too much time to to do about this but uh uh and you decide you as the top-down down sister so typical top-down system as well uh the so this is these are the two "'cause" S parents and next we use so a pretty freakish as long the following paper shown here so this is an option step will see the difference lead and a the by a and map based resegmentation segmentation and a of the this and and bodies edition of the features and a final the segmentation or the that that's sets so on a top training from conference meeting yeah from the list out you of four five six evaluation and for the evaluation sets so the proposed to use a a to a seven out to nine uh uh that that set which are a of T V shows right cord want to shook is a function of T B shows corpus here ah no additional preference so the first call um is that a can be the better and is the score one of speech uh of course as i our system the help does not process and and is the overlapped speech we just focus on the second one and so a for the we can see and and just looking at the is that was also apply occasions fixations that and see that okay first sub of the system a a better to an ounce for two Y for top down a well there is a uh um is a result a much worse for of for uh um the T V shows yeah signal it is not to as the best system which provides the best with a or to the that that's set and see for example T a part to a seven top down Q better result why for out you nine that as at the bottom up and we and also consider hmmm the results be a simplification so that i can see it vacation just uh but a a degradation in performance for the bottom-up a that for the top-down down it's a way a proof of um is the system so it's a question is uh a okay may be purification you the discrimination between clusters i i am a as has a down the propagation bottom-up well unless one normalized against phone but yeah sure the in this case the propagation an is you last so that's explain a bit it a a clear of the cluster purity so we propose to look at all the cluster to at by one of the system and compute the purity for four all of this cluster the card uh so the is computed one is the fist so we takes a double speaker time seconds and we divide by the that optimal number uh a uh of second of the cluster so that a difference a situation if we have a high purity and a small number of cluster yeah well i i one has a pretty is a purity of cluster we can expect a system a to be lightly to converge to some speaker you and are very you do not of clusters like like to as the system converts to as or acoustic it we as it to as their have been a uh we do and what happened difference in audio was a are possible and the same for the last case so we doing at the true G and the number of cluster um the for tab and so we see him in we we don't use we do the propagation process a a top down as compare about a priority more less with but the top down as the that's class and C um of the right the number of clusters we have a as uh clusters them the bottom-up up and them about cluster it's clusters none the idea and number of cluster to for the ground truth so we can expect that top down to be a in the first situation also converge to some speaker as a button up is probably the for case so well to see uh what happened right the purification we see that a first for the top down the purification use pro as a pretty is improved um i cluster so for sure how is uh the system to converse to speaker then without purification uh there is a consistent in purity uh i i have a cluster them for the top that down so that is not or even i have to say a in which it situation we however uh a last not for this experiment to part is uh looking at the from musician for this case a so we take a different clusters we take all the clusters um for a system and for each of these cluster right histogram of the different for names we do this for all the clusters generated for the top down and the sample for the bottom-up up and for all of the the four compute the to a cluster distance between D histogram is uh is the colour back like the distance so and X is the average of all for these distances for each of the system so um we can expect uh is this average distance to be small uh uh uh as a a distribution in the different phone that uh in the different clusters so which means that's the system my and we can expect it to be a high ones as a higher degree of conversion to have problems and so i the distribution i'm not equality is the different cluster a a the is exposed the result in seen first sound propagation step i are used is a bottom-up which show really that's in this guy's a cluster are better normalized a a pill now the propagation we see that there is an improvement for bus of the system but um a plus but if a cash am a very high or than the top down plus but if question which just it's explained why the purification prove that that i of the bottom up so to conclude um we have seen in this slides that's but approach products bottom-up and top-down down give some compare but results but is uh_huh you different behaviours but up not isn't more disk because but often a a uh a a trade off from some clusters which are last normalized against linguistic content well i is a top down uh a a off from some cluster which are better normalized but less speaker discriminative so a uh i i think and one of the conclusion of this work is a there is a good thing to note to nation of this two approaches so we recently published a bit but but i think that a lot of the or other things to try and has a a future work we can expect maybe design a specific propagation process for a a bottom-up taking into consideration of this linguistic in which is quite particular or or on a of this approach he france's and that's it thanks okay any question okay that and if i one a quick question i the can i think with and are going to take a hard thing and um can i oh oh oh for right as a we stick like to use i have seen that is the core of these two approaches which you are are not what the provocation was just a motivation which lead to these work but as the core of the bottom-up and call of the top down acts differently is is the mystic in which is isn't this case the phone and content of the speech so uh but it but to a question you i and you i think i think