so hi everybody and but the high i i'm presenting the world on the use of jesse as yen for speaker diarization and tracking that was brought on by my colleague yeah exactly great income and i come to the conference income collaboration with not too hard i can see generous before mac please to uh we live in happen so after the presentation of the of the task and the with division i will describe the two tasks uh that are explored i quickly speaker diarization and speaker tracking along with this stance uh uh use than the result obtained before compression and plastic first about the the two tasks that we consider a acoustic speaker diarisation is a who who spoke when' task we got married segmentation and clustering that was really and right the previews um uh oh um we consider it as a processing for the uh automatic speech recognition in for underage transcription uh in this situation we have new approach priori information on speaker phone and speaker's voice and we can see there you only acoustic driven approaches because al also approach linguistic use of the of the transcription but we are also interested in just to get writing that uh well we want to detect regional for spoken documents that a detailed by a given speaker and this situation we have a list of the speaker to the right and we have uh we are provided with training that that for for this because we consider the speaker tracking task as a combination of both acoustic speaker diarization press the speaker verification module in our configuration oh i'm and motivation in this work was to uh including the our system the as yeah techniques that you know all that they are become very uh successfully in speaker recognition uh we started with the G S B questions about vectors stan that was easy to develop a framework and they also for the uh well the features that can be used uh in uh uh as the end system for speaker recognition that well mllr cmllr lattice mllr that uh we also uh uh efficient to combine with the the do not you as it is or more i want also to say one about the context of the we're programs that uh what do you rate our work for improving uh so it's a a friend found that the research uh and uh uh innovation program uh that aims to improve automatic would you be sure uh document structuring and indexing and for this work we wanted to work specifically on the speaker diarization and tracking for exactly what 'cause that yeah and that's why we are uh still on the us team of uh offline um diarisation uh because we we are working on a carry on broadcast that that that are recorded and that uh patch on the on the web or on the radio on T V so one and we will still easy integration of been based fig we worked on the that out of the french ester evaluation uh i hope that the that that will so softly available to the were community remedied as being distributed to the participant to this evaluation in two thousand eight uh yeah uh one hundred being a target speakers uh for which uh we have about one and the right i well as a training that that consist of french speaking or radio shows from uh different sources french tools but also uh uh i consoles uh we have for the impostor that that we've to the it's yeah one which is uh uh is that when evaluation uh that that about four hundred impostors uh as the two development data consisting twenty radio show for that a lot six hours and the evaluation uh a row roughly the same amount twenty six radio shows uh false seven hours uh i also provide uh uh and the value of the uh so if you use some statistics on the number of speakers as speaking anything the segment length the uh S yeah uh to development and evaluation uh that that's that uh the development we we have between nine and twenty five speaker for the mean of C being and on the evaluation uh roughly lies uh with the given uh speaker there so we in a right the speaking length also vary a lot with a mean of uh sixty five seconds ranging between alpha signal and then more than ten minutes and on the evaluation the it's a bit i yeah with and the right of it is signals but we can see the standard deviation is very very high so it just just to have a rough uh i and segments oh also yeah in average six yeah seventeen seconds and the deadline for you know the about it we also uh rummaging from a fraction of signal to uh so norman i will not describe the uh acoustic speaker diarization system uh which is basically this is ten that uh high guy uh recap recap just the previous stork that was developed by uh C than men yeah uh changed to uh myself and only woman for them is two thousand for evaluation uh basically uh well so we just that uh die revisions then so initial a segmentation is using a front end with standard mfcc feature found in uh is a system uh the speech activity detection relies on the viterbi decoding with that's ever uh gmms of speech music and noise on the speech segment there is um uh the segmentation to ins more uh segments using to select two i just some sliding windows of output signal and using a local gardens major to segment i does that uh gmm are trained on the signal yeah the lower segmentation uh of this uh of this data this is the initial segmentation we have the first step of uh i i dramatic clustering using using the classical bic italian uh and oh using full covariance matrix on the single version uh something only uh specific thing is on the penalty which is the local big in I T uh taking into account only the number um it out of the two clusters that are um but and not of the all that and not or of the future and we uh with put the output of the biclustering into a second step using uh speaker I D um mode that is and uh clustering using a slightly different features using uh feature warping and not adapting the ubm the clustering relies the force log likelihood ratio between the two clusters so what what we did the uh was but there are simple simple stuff was looking at the G as the as yet then and integrate it into the system place of the uh last S I Ds clustering stage uh so i think i whiskey right up first of all that it's rose asked on that stuff in the G S V U consist of the means of the uh so predictor of the adapted gmm it exactly as that uh combining diarisation system can improve on the individual assistance there are several ways of doing this combination uh i think one system into the ozone the kind of thing that we already do in our stan we can also mounts different systems or do a cluster voting technique we did a version that's score label which means that during the clustering process we compute an average score between the G and then as the end and the G and then geodesy and then gmmubm schools with uh the weight oh that optimise on the that development the performance measure is the diarization error rate it was already described in uh preview stored so i want get too much into then uh again uh just to say that we also put some i could use with she with your coverage which are the ratio of the minutes reference speaker about within this cluster and combats ripples of possible right which can provide a better insight into the the speaker or and we use the the nist to for scoring following the step two evaluation plan yeah yeah he's a figure of the diarisation error right four uh i was ten uh to the to the left we have the performance of the gmm then to the right of the the G S E S P N then and uh on the uh X axis the different combination weight we have been right in the in green the green curve is for the evaluation set and the right go for the development set which is that the gmm is yeah yeah forms better than the that that the gmmubm so we have a bit of that the that yeah this is then but combination uh yeah is very uh successfully here uh more in detail what what we get is a ten percent relative improvements from a given that one to ten not one from the best performing then to the the G and then press at the end then uh on the development set and on the evaluation we are also saying right going down from uh nine that six two it but three was this was for the acoustic speaker diarization system no some words about the the speaker tracking as i said we uh you just stand as a combination of uh of the speaker acoustic speaker diarization system to the to the left with a speaker at our educations stan and we have three possible ways of doing this combination uh we can do this you can verification on the initial segments of the system or on the cluster output by the beach all by the S I D uh clustering step each case the segments are then uh compared to the speaker models and level according if you well a weekly on the on the C stands for the tracking system we use gmmubm and she is this the end system uh is that a uh that have the same uh properties as as assistant that we that we have already presented folder i musician uh a uh for the verification we choose the target model with the highest likelihood ratio i with the verification phase and the G S via the end is also uh is also following the the same the same uh architecture uh with the constraint that we scrolls input posters and target because uh using the agenda and channel matching the the current condition and we also perform uh waited at the right of but across all the so level uh system fusion the the performance measures for the tracking task where as finding the exact way evaluation campaign uh recall and precision and an issue of combining the but recall and precision um but in a time waiting um i manner and also the speaker weighted action that was the proposed doing the the F and you're that speaker we have something on the on the debt curls that that was simulated by uh by using a uh short segments of the evaluation data of all the different possible uh so then on the evaluation on that then and on the yeah so you're on the evaluation data but suppose that you and then you yeah and the G S V as yeah then we shall with the red the green and the blue gel a different version of the gmmubm since then uh with the verification applied i sat at the output of the segment initial segmentation in blue as the output of the D stayed in a red of the output of the excited state in green it appears that there are not that much if you're ounces and we add a slightly better a four months by using the output of the final stage which is a exciting stage and the G is yes yes then uh yeah you don't shown on the on this uh i'll put on the S I D the clustering step and uh yeah is you shown to perform much better than the gmmubm yeah some some figures i i will consult uh provides a recall precision if you're right uh and if your average by by speaker i i will mainly call uh focus on the uh S uh problem of the of the result that that of him on the uh on the dev and on T of or on the about it well what we what we so on the on the go on the on the development set we uh observe that the S I D clustering step provide the uh better performance that that was a condition and at the compare able uh performance that of the G S the end then and the combination uh is uh improving upon but it's then on the user on on on the evaluation dataset uh the G S V as yeah and you performing much better than the G bit gmmubm then and this case the combination a slightly outperform the G S T I E N and is better than the gmm ubm then well uh that was uh i would say a a simple experimental framework uh that was done that to do the integration of the D N A is yeah then into uh speaker diarization and tracking then so just yeah yeah as a private school performance to the existing standard gmm ubm uh that we that we had and the the score level fusion was uh what's that factory uh there are uh some caveats yeah for example in the post all set which is not very balanced according to the gender and the channel that there are some very small set for example honestly made on our bound that that we have a very few posters for the uh for the experiments and of course we want to go browser for well the svm features like an L L F cmllr lattice mllr and also the very interesting direction that were presented in the previous bill thank you for that cool yeah and that is that you using delta double delta features yeah uh and some other posters found that the amending the deltas yeah it is you could addendum limiting the deltas so the did you you try limiting the deltas yeah for example in the in the first stayed on the beach segmentation uh on the initial segmentation we use the delta delta on the first uh that uh on the big stage we use of comments metrics using nothing at all or only the a static features and on the second stage we use uh only the delta not the delta delta um the the russian one of the rationale is trying to have different uh feature representation to combine different uh aspects a different flavour i i'm not sure that it is optimal this way because we wouldn't test or configuration it was one way of doing that but i i i agree that that that that that that it's not clearly convincing that they bring uh always something uh in the district i to to to what people say that we observe that when the data very clean yeah we make some recruiting acoustic room so the didn't uh didn't they give some game but i mean it's data from nineteen precision you will never see any given to the different user thank you you don't slide fourteen years forty slide forty okay yeah yes you can explain why you could input result and the evaluation did there was a difference between a database the databases are where is that well not recorded at the same day and they have a slightly different balance between the sources of data there are some some uh so that coming from french i'll send their some of the uh from uh i pupils and that's good not was uh from a high view mall was up so they are and the balancing is different between the that when the evil slightly different and i think it's it's uh and it's blind some some reside and also the fact that uh well i think that for the acquisition system even the weight is uh the givens images even twenty uh well six a well it's not that much and then when you do speaker tracking you have like sometimes speaker we speak a lot so his model is what is fine sometimes speaker lies only few time to however on the speaker so how to fix it yourself in this case um on the speaker tracking it's a verification uh compared to two of racial so there is only a the normalisation by by the length but the the the free should there is no normalisation uh according to the length of the of the data i i agree that it is some something that needs to be uh addressed yeah and this i think yeah not