a segmentation so a a good are known and thank you for coming yeah and present now where we have do not the university of that was i needs so on this is a variability ability compensation and for the segment eight for speaker segmentation and to speak that phone conversation and we are also present in our a a technique to a it several hypotheses hypothesis and a for a given recording and to select the that best hypothesis there a segmentation the and so uh these work is focused on a the segmentation of to speaker conversation so it's is a speaker diarization problem a a a a a a so we we i'm at the answering the question was but one but and um and it's an is your task seemed since we and number of speaker is known and and limited to two so in this case now win the boundaries the speaker of that is of the that decision problem so it we can can um we could it as a a a a segmentation problem so i only there's mean a that in the field of the speaker verification and this has motivated and you approach just for this some addition of to speaker conversation yeah many base some factor analysis use eigenvoice voices and in this approach is uh the um the the speaker I D model a a bit a gmm supervector and that can be a present but this small set oh or a a a a has a small they mention vector that we use a record the speaker factors was i mention is much lower that they do you not a gmm supervector more so a a a the main idea is that a a a using a a a a a compact speaker a presentation we can estimate the a the parameters of these presentation what is more segments and that's what we do for for speaker segmentation so we eh a start i stream of speaker factors over the input signal so we are over or a one second window and frame by frame we start uh uh a set of a a a a a sequence of speaker factor of but that's and then we cluster these the speaker factors and the two clusters using pca A plus k-means clustering a once we have the two class we find a single got a single full covariance a cost four four it's just speaker or and we wanna be that we segmentation obtain a first a a some addition up with and then we would find this some addition using now some uh for segmentation step um using them to mfccs features and the mm a speaker models and so the the but the main contribution of this work is T the the the in general uh the of for in on whatever the T so first that i about that that's of but a do we can find a the set so and and only when a if we have a similar or similar recordings containing in different speakers we analyse the by a T percent on these recordings a a we can fight that there's the by ability this actually this by really the and it's mainly you to as because percentage recording so you use you referred to as well with yes and the speaker what every but we analyse a a set of a the accordance don't to the same speaker a so we can see that there's also whatever lead them on this record and use usually do you to a two aspects like the channel or the most chance of the speaker and this this but every is usually um no known as and system whatever but in in addition if we if we analyse a a recording contained a single speaker and with that many in smaller so i say and that we and a light but in really the this just slices a a we have see that there is also whatever it with you now recording and this but i bit is usually do to a phonetic balance or or they they won't young on the one channel of the of the of the recording and we will refer to as by big guess and that's system whatever you no approach for speaker segmentation we are only a but we we are only modeling in speaker whatever every so the question is a a are you the types of whatever the inter or interest my ability a fact some of the in performance well and so we just one do we need to compensate for inter system whatever they don't this is somewhat of a you we note that in the system whatever decomposition is but important for speaker recognition uh yeah but well i was able to say that it's not i so be important for speaker some shown but the is presentation yes so that channel factors it you helps us well so i i i have some preliminary experiments that P then the help but may keep my but we believe that it's it should and had so much because you not that is decision that's you don't see the same a speaker over different sessions so it's i was the same as the the speakers are a single session you don't have a higher information of the speakers actually we we believe that it does is a whatever ready to make up the body the speakers and diarization task because the channel is information that can how do you to separate the speak and what about the some whatever it so what what usually in the feel of speaker recognition a a state-of-the-art system doesn't take only a a a a that that's system and don't take a and take into account intersession variability seems yeah um they used a whole conversation to to be a model uh but we we think it's but important for speaker some of this and their efficient because um many and the state of the systems are based on the clustering of various mark be or segment and we can compensate do but i believe them and be segments for a given a speaker okay clustering process should be seen so that's but we try to do so you been a a each dataset contains several speakers sensor out according to per speaker we kind of thing a team of speaker factors for from each recording and then a a we can see that every such and as a different class and we more more to the speaker and that's system some whatever the guess between class body on seems we believe that more in a speaker and that's as them whatever the have to two separate speakers and in a recording and we model the session and whatever you S within class but yeah so we we this framework in it's C C to to apply a one known techniques has linear discriminant analysis of much might you can class but as to minimize we class variance are also within class covariance normalization a a to normalize the variance of for every class so it's the identity map fix so this to thing this have been successfully applied the uh for intersession compensation and in a speaker recognition they the in i of sister so to evaluate this this of these two approaches we use a a a a a a nice are we weight a a a some channel condition containing more than two thousand five minute telephone conversations i and that the speech nonspeech a what's are given and we miss the performance and that's of the speaker segmentation or or or speaker or a a a a a part of the that is as an hour rate so a as we we have some parts that the speech nonspeech segmentation and "'em" we don't take into account overlap speech the that is as an hour rate is the same as T segmentation or or or speaker are and uh a C us use what we you we assume a don't twenty five second people score and here we have the the results for the system using in a small ubm a a two hundred fifty six gaussians a prior and mfccs features and in this case we we don't we don't use that the segmentation of steps a a a a we can see that our get in two percent some or or um using intersession session variability compensation and W C C and we twenty a speaker factors a a the sum an or is that used to a two point five you also can see that another other another baseline a a with fifty speaker factors that it a slightly better know not much but slightly better and to to try a L D A for them dimensional direction and we can see that the L use had been a a a a but i an i and W C C N any is better the obtained a two percent a a of segmentation or and even the combination of what you are is is not better than now a and directly W C N but uh we try with this systems uh a after the resegmentation and it was surprising pricing that there was some of the step um make more or less equal that was also using twenty you're fifty speaker factors so the inter just but every decomposition with W Z and just your working and giving the a an improvement but it seems that it's a useful to improve the number of speaker factors so we were a little disappointed with this because with all it to suit help and we that i know where use but it meant to that not in the paper representing here a without a target ubm M um more features so in this case increase in the number of is because fact of his help so we can see that in this case or baseline E we the speaker factors thanks so one point eight segmentation or which is a a lower what before he was two point one and now that we use channel again but use the your work to one point four in in in this case a a in this case we we also increase the number of speaker factors and to test the L A and we see that the the eighties yeah is had been you've and a a more than before a and and also that our but configuration now is used to combine L D A plus W C C N so it seems that uh it's is it's to question that a the base and we'd have the speaker factors he's not better than the base them with fifty speaker factors but that there and the egg a we can take advantage of how more speaker factors a our best result is one point three some our so on and on the other hand you we propose you know so i think need to to you know it several segmentation a hypothesis and to select the best one a base of based on a set of from mister a so what we do is we it adaptively pretty it become the composition to to have this a a a a a a did we obtain four levels of splitting in as we can see the figure a a a a a and we segment every um every a slice with a propose a a system three then for every level we set at the best the slides this and we combine them to be able to speaker model and then we this to speaker models we to segment the whole recording using a i there with some segmentation and mfccs features you and i speak a speaker model a to select the what to select the best segment that slices is and the best level um on this four we use a a a a a a complete as missiles and also my you're voting stuff sorry components most of that were using this work were where a bias use information criterion a a a a a a we using mfccs speech their sign you a speaker models to compute a big and and the K yeah these things these dancing the speaker factors space so we were using gaussian and a a speaker models and a that space and computing the K U this stuff between what more a and to fuse both compute as were are using the a a a quite toolkit a well no for speaker verification and uh the in the weights of the a diffuse fusion weights were optimized to separate do for those a a a a a of time less that one percent someone channel okay so a a kid we have the results for these uh i but these is you know channel selection a strategy we can see when that when we we are not using seen a inter session variability compensation a a a a a a what this solution is improving the results which just but our baseline which choose two point one and we're getting one point nine with to our started you and if we have a an idea a a coffee that's much some we could select the best a a level at every time we could go that to one point one segmentation or but the of competence was of our remote idea at the mall a and using system but every to compare that you would we then get in a significant improvement improvement was was not the statistic that that's it's of this signal is significant a is so we try to my are what in a that the any help but that we we wanted to to make it what we it's some set of complete myself to because the the possibilities of for are complete and mysteries of computers was of a high and simple uh stuff that you to fuse a a segmentation hypothesis so yeah we were not really happy with this was also with try again again with a lot ubm more features and and our best racial for intersession variability compensation and and also a new set of complete missiles but this this is not in the paper a new results but oh show in two a a a a a and we we could but use this segmentation addition or or from one point three to one point two and one what some additional or one point zero and if we put select i always the best level that we could read used to get two point seven the channel or which is but good use so compared to the to based one well one point so ask completion sort of this work we we have presented to to make those for it a somewhat every to compensation we have some that they have for speaker segmentation a a a a a a change that W C N of things better performance than that done of the eight and it's some somehow similar to of the a plus but C C N that but similar to the combination of both um in the number of a speaker factors increase greece of the computational cost a it seems that W C N it's but there's that the word for a should should word for low computational cost applications but the of course a for our best computer and computational cost is not a problem all was computation is using a high number of speaker factors and all the egg it uses here and we we we have a summary of the results yes we might they our but so that this is one point three i we the a system where a the from one point nine to one point three and and also a a a note that probably that but used in is had been a lot because a a a a a a a close or or or because of for but that in the study so or in is that that use you seen pca plus k-means means as initialization so not i seen the the he within class covariance for a be a speaker is probably had been the K means that assumes that they all the and class is have the same class so at yeah why i are not quite as a think i so so it probably because of the a a W C i have some and also we have present a hypothesis generation and selection technique which C can prove to this like the results and for our best configuration we can use D some addition are from one point three to one point to with a large you um think that's all thank you match you you we use time for questions we yeah yeah yeah i just to one they mention i didn't mention it because it's it's in another but but so this is much more to produce on on really ability but yeah P C A okay we keep just one mention a a a to initialize a a and then became means we use all that mentions but we need to like this the a means of a k-means means we pca C a of show no no i yeah hmmm yeah well yeah yeah sure yeah but uh i i mean now experiments i i am i keeping one i'm and C N N uh maybe you know is not the best you can do but to one they mention it's not but it was so in is is usually the first dimension of of the pca put these is the best want to to it this because but it still i we are getting a about eighteen percent that is is an error right just using one dimension so we are not sure that i the best presentation yeah hmmm so you we we were try to just plug yeah C A output be more they mentioned for you and all all the images to the came the are questions and let's thing the speed than then one the speech was reduced