okay um and he said my name is laura so i'm a P H E student at the university of california berkeley and i also work at the international computer science institute or icsi um as many of you know and i would like to just uh also acknowledge my caught and charged ions it who was uh a fundamental part of of this work um so i just uh quick overview of pretty standard start out with a what we're trying to do and why uh we just got um related work um and go our our approach to the problem uh give you the results uh do a little bit additional analysis and conclude um with a summary and future work so i think we can all agree that uh automatic speaker recognition that some performance depends on a number of factors uh one of watch uh are intrinsic speaker characteristics um so there's no designs that if you know as humans we notice that certain sneaker sound more like similarly there's uh and you have that uh system's automatic systems will perform better or worse for different speakers um so the goal of this work it's too project watch speaker pairs well be difficult for automatic speaker recognition systems two distinct uh we did some preliminary work um what yeah that speaker pairs that are are hard for one system are also hard for others it's and um and of course you can use the system and select speaker pairs and you'll probably do really well but uh we wanted to it's a away from using any one system and and said um have a general approach and just use features that will hopefully capture uh oh degree of uh speaker similarity um the motivation uh besides being an interesting task it is to potentially better focus uh the research and reduce the amount of data needed to estimate system performance but so there's a couple times of related work um the first has to do with the idea of different speakers uh causing different problem um the infamous uh george washington you paper um categorise speakers based on system performance so you have fig out um you call the large number of false rejections as target speakers you have lamb uh who cause a large number of false acceptances as target speakers well as you call a large number of false acceptances and impostor speakers and finally or default well behaved she um in this work uh we don't actually distinguish between them available uh since we're looking at speaker pairs um but we want more on the title because hunting for lance didn't didn't sound so good um a couple other there's been other other work done on uh dealing with these speakers but it may be difficult um there's been work that shown that oh their performance differences between high and low pitch speakers um and then there been some uh worked on that uh tried two uh well the method to deal with this problem speakers the other elements of related work uh that's relevant is um and the whole features that are used two describe speakers or characterise speakers so you can draw varies from a lot of different types of work obviously uh speaker recognition approaches have use a variety of features uh certainly not an exhaustive list here but things like pitch and energy distributions are dynamic um prosodic statistics uh jitter and shimmer and in looking at perceptual speaker characterisation or discrimination your find a lot of formant frequencies and bandwidths and dynamic features and um other acoustic parameters that influence voice individuality include the pitch frequency contour and fluctuation again the formant frequencies and long term average but so our approaches as fairly straightforward um basically we compute feature values over some speech data uh corresponding to marry speaker and then using these feature values compute a measure similarity uh for all speaker pairs and the and looking at these measures look at the uh speaker pairs that have the highest and now we um values in terms of these uh similarity measures and compare for performance and those speakers uh to all so the features we consider here uh first of all pitch that sadistic um i mean median range and mean average slope much we you know okay um jitter and shimmer are the relative at an the average perturbation of generic and a five point amplitude perturbation quite question version uh formant frequency statistics you mean and median of the first three formant um i'll be doing that we work with it eight kilohertz so um although higher formants might be useful we we didn't calculate them here uh and he nonunion energy uh long term average spectrum energy statistics uh including the mean standard deviation range slow and local peak day um and we did a fourteenth order lpc analysis and uh found the frequencies from the coefficient right uh both with and without a minimum magnitude requirement which is essentially a limiting the bandwidth and uh then we to calling frequency it and in the middle histogram and finally we have B mode and median spectral so we have all these features well what measures do we use um for the scalar features of what almost all of them are uh we simply took the absolute or percent difference um also we in addition to using the formant frequencies individually we looked at some of the formant frequencies and we also looked at doctors of formant frequencies and but euclidean distance between the vector and finally for the histograms of frequencies uh we calculated the correlation as a matter so there's there's two different ways you can compute the single measure for speaker pair uh the first is to take for every speaker take all their uh feature values over the conversation sides but are available and just get an average feature value over the conversation and then compute the measure between these average values for each speaker um the other approaches to take and the conversation by conversation basis between two speaker pairs for two speakers uh compute the distance measure first and then averaged over the conversation pairs and um and the result types and i just present whatever method gave better larger different so the data that we use um really feature measure calculation and speaker pair selection uh we use p2p of neatness followup evaluation data um so this is all interview data um which is recorded on microphones me limit it to just you have a your microphone quality purposes and uh just uh the sign out um almost all the speakers have four conversations available um there's a handful with three or five but um because this is that's multiple conversation one and then once we have the speaker pairs selected um we evaluate performance using the uh data from the nist two thousand the evaluation short too short three condition uh so this data varies from the uh other did the pilot data in a couple respect um in addition to possibly being an interview uh it can also be speech from a telephone conversation and in addition to having uh the lab of the your microphone channel there are other microphones and as well the telephone so um available we had uh submissions that were shared by participating site and so thank you to everyone who share their submission um be sure to short precondition originally had i think maybe around ninety thousand miles or so um i had to remove the child that correspond to speakers that weren't in the selection data and that's you that your left about fifty five thousand trial and then furthermore when you just sub select oh and only keep trials corresponding to some percentage of speaker pairs uh you got around four thousand or eleven thousand trials love i know and we only keep target trials for speakers to show up in one of the so how do we evaluate the system performance um they're various metrics you can use what we look at here are to be uh minimum detection cost function and which of course is the a weighted some of uh with relative weights for errors uh what this is all done with the um two thousand a cost so it's not the low false alarm like the two thousand ten evaluation and then since we're looking at impostor speaker pairs uh we look at T false alarm rate which of course is simply for a given decision threshold the number of false alarm errors that occur are out of it total number of possible target track nine target so for every other system submission that we have we first uh just looking at the trials for the most or least similar speaker pairs we can keep the change in dcf relative to what it is for all speaker pairs and then uh take all these um system differences and average over the system so the results are just to a typical overall try and from the system um and then with the false alarm rate we uh for the all speakers' we look at a decision threshold that uh generates a false alarm rate of one sign and then at the same decision threshold um see what the false alarm rate is for the most and least similar speaker pairs and of course if we're actually taking uh if more similar speaker pairs actually corresponds to more difficult to distinguish speaker pairs then we expect these changes and the dcf and false alarm rate to be so here the results uh when you look at one sound of speaker pairs uh in each case the top row corresponds to uh the least similar speaker pairs so performance is improving and this road is um corresponds to most similar speaker pairs so that the performance is getting worse um so we notice that we are able to find uh features and and measures that we also like uh speaker pairs with the desired um and and see um if we then compare uh performance on one side to performance on five percent you can see that it's less pronounced when you include more speaker pairs um in some cases you you have these negative or opposite trends from what you back oh i i i pretty much mention all these points the only thing to note is that um changes in performance are not uniform across site submissions so that is uh one issue um okay so here's would adopt her for one system um when we use uh be euclidean distance between uh vectors of the first three formant uh dyslexic figure pair um you solid line correspond to uh the most similar speaker pairs being you and the dashed lines are very similar uh right is one percent and green is five percent and the black line is the uh case for all speaker pair um we know it is that uh in this particular instance um there's a bigger difference uh when looking at the uh leave similar speaker pairs and the most similar speaker pairs uh are much closer to formance overall speaker um although that doesn't happen all of the time it is uh certainly the general tendency uh to have this larger larger gap in this direction and here's another example that shows that it doesn't always hold um and this is uh a different system and a different feature measure that's the percent difference of median energy in this case and um you you get better separation here uh it there is and how much separation there is and across so we've been able to do some stuff but we expect we could probably do even better if we use uh more knowledge speaker system so um we decided to just simply use gmm since uh they show up obviously in a lot of um system um so we adapted uh speaker specific gmm um and then calculated the uh cal divergence between them has to be measured speaker similarity when we do that not surprisingly we get uh better results um the previous charts all had just one from negative fifty percent to fifty percent so you can see already that there are larger difference watches as i said what we would expect um here's that curve for a system using the key algorithm um again you can see that these are larger differences from the all performance and we again uh see this asymmetry where uh a bigger gap for the dissimilar pairs one for the somewhere so as i mentioned uh we're tend to be more successful at selecting easy to distinguish speaker pairs uh and possibly because these pairs may be easier to fine um one possible explanation would be that if you have a a speaker pair that is very dissimilar and um terms of pitch or formant frequencies that you know a big difference is probably going to mean that the system is not going to to use them um but on the flip side if you're trying to figure out what makes the speaker pair difficult um just like you know any single feature may not be enough to capture um that information so using them the K L divergence measure we took a closer look at uh the speaker pairs but are selected and as we in the mostly similar so in addition to like you know the one percent by the group uh also looked at three percent ten percent twenty percent and um and this data and there were a hundred fifty speakers overall uh leading to uh uh eighteen hundred unique speaker pair for same sex first and uh one thing we noted is that in in the groups of me uh least similar speaker pairs if you look at a group with what the larger values of the divergence um we would expect to be easier to distinguish the majority of them are male um but if you look at any one group about seventy five percent of speaker pairs in the group will be uh mail on average uh to a lesser extent we notice the opposite tendency when we look at uh more similar speaker pairs which uh somewhat tend to be more female um the one and three percent still have more male pairs but the other group have more female um this you know maybe part of the reason why uh system performance typically better and male um and it just in that you know males may may exhibit a greater range of differences between them so that there are likely to be more the similar a male speaker and finally so looking at these groups um we notice that there is a tendency define two types speaker uh there are speakers who frequently appear as members of difficult to distinguish uh speaker pairs and speakers who occur frequently as members of easy to distinguish speakers um in fact there are fifteen speakers you never appear in the most similar group and twenty four speakers you never appear in the most dissimilar group um i forgot but i think this is this twenty four speakers there's ten male and forty female so this uh tends to support the idea that there are these walls in there uh speakers who are are more difficult um or more similar to other speakers so just a summary of what i mentioned uh first of all it is possible project uh what speaker pairs will be difficult for a typical speaker recognition system to distinguish um for the features that we considered here would catch formant frequency of the the best ones seem to be uh the the euclidean dist between the first uh three formant frequency um but the best measure overall was the more uh complex uh cal divergence measure between uh speakers this fig gmm um i mentioned of course that we're typically more successful at identifying dissimilar speaker pairs and that in addition to to being able to um you know finding speaker pairs uh using these measures can provide potentially useful information about a speaker's tendency to be similar or dissimilar to other so future work um one thing to try is testing combinations of multiple feature measures because the the method for selecting similar speaker pairs um i did a little bit of work on this uh where i just basically assigned a rank according to each uh feature manager and then some of the ring over the speaker pairs and and did selection that way and and that that improve result uh another extension is to um instead of focusing on impostor speaker pairs see if you can find uh figure out what target speakers will be difficult uh for the system to correctly right and one thing but um certainly needs to be investigated is uh the you lack assistant see uh behaviour for the things but of speakers across um different uh system um we may be able to find potential trend in behaviour across classes or types of stuff of course with uh the site submissions that we used here uh almost all of the submissions are in fact fusion of multiple systems so might need to do a more of a breakdown uh to um really get out that sure okay that's all i have thank you hmmm sh thank you larry presentation i have a question about your formant extraction uh do you have don nation for all the old volumes or did you controls that's your extraction uh to the east you you you does volume or four different the you do one type of volume because you know it's my question is it is uh this use the volume or you you do um extraction according to the volume now so uh we didn't it was just the over the entire file so it's it's definitely you could probably get much better estimates and what we what we actually did because the the the problem is that uh uh you have a lot of disturbance according to the volume for now so uh but i think that uh i think that it is more the sample and no phonological information that's value yeah the speaker information so of course uh yeah of course this is convolving the the phonetic the phonetic with with the speaker okay thank you oh oh oh hmmm oh oh oh oh two what oh hmmm oh oh oh oh oh i pairs of what oh extracts or just oh uh_huh oh oh oh oh sure oh fig oh two oh oh oh hmmm four oh oh or oh oh sure sure thanks hmmm no we will we should uh four the the the yeah uh you uh uh uh right uh good oh i'm sorry are you talking about i figure paper and i think you mean uh oh cool yeah yeah i know but it was definitely the case but it was oh right hmmm hmmm oh