Přepis řeči - Speaker linking in large data sets

alright everybody uh whose so i want to uh already start breaking before the wine tasting less than i i can continue now um they want to talk about speaker linking and um what what can you actually suspect uh before we go to go for the wine tasting a few things um and i don't doubt uh a high price question i have a graph uh even for the mathematicians i have a formula um and also a picture and finally maybe or maybe not depending on uh or how well i do i joe so that's start with the and it out by the way uh if if you're not interested in this this subject you can you can keep yourself busy with with detecting oovs specific events that i use case you um alright so i was reading a book i have a look at home and um i haven't finished it yet but it's it's it's it just think it it tells about how uh people in world war two um yeah the the the pitch in this case we're we're eavesdropping on the communication uh of the uh five guys uh from from that english perspective and um they said they were listening to the morse code signals and codes where encrypted but still they were able to it did you some kind of information others namely the person behind the morse code apparatus and uh i think you in morse code technology that this corpus the face so that's why i assume that is the way your face goes up and down the operators uh so even though they didn't know the identity of the people they were able to link together one broadcast maybe at one particular instance in time from one particular direction or whatever uh to another one later and then they could use uh movements of troops so even though the messages themselves are encrypted there were still able to did you some information so that is that gives you an idea of uh oh but this could be used for useful for so another example of of linking were clustering as you might uh uh colleges i i i would rather pacific implementation um is uh actually it it's on the web it's done by uh by big software where based software firm uh you could do this with photographs with with basis in fact so uh and it works pretty well see about a hollow pictures and even though the clustering itself isn't very good uh in terms of actual forms figures uh you you get a cluster in this case uh it's high shoe and already and you click one of them away you type two or three letters of person and of course you can your your email um a database and that you made a new cluster you get the next person et cetera so it works very well but the in and in uh interactive setting even though the clustering performance itself in these particular cases pretty bad so um now just a short intermission so and the way i see clustering it's actually kind of old fashioned we're doing some kind of identification right put people by their forties and make hard decisions about this um sort of identification and we don't like identification has the problem of of the priors yes nicholas eight if you want to do proper identification you need a prior don't know them what are we going to do with them uh just a little test for you we will work with equal error rate even in language recognition so suppose you have uh a system with a certain equal error rate five percent means to class system detection now you're going to apply this system to eight to speaker identification system and you do your identification by taking the segment compute a score to one more one computing score for another and you you choose you have an equal prior uh good the the model with the maximum like to score and you don't do anything clever no discriminative training between two speakers with a so the question then is uh what is your identification error rate going to be gonna be one percent five percent or ten percent so it's not a question you can think about uh during this uh this if you don't want to watch a slide so speaker like linking i yeah i'd this this term was actually used uh we're inspired by uh by george overton is a see those lots of uh inspiration in in in the question what kind of questions should be sold in speaker recognition and it was good dismissive type of of of answers george can can do these kind of things uh as note that a speaker link it's a back problem so uh it whatever and i am interested in a large set a speech segment so um i want you if i want here diarization within a single show i think that's fantastic but i would actually like to do diarisation over all the television shows oh for uh entire year what or whatever large scale problems and again it's kind of a clustering want to link those speakers and i think it the large scale thing is a problem and that's what i want to uh to show or want to investigate so it's a bit export so presentation alright nick already said uh previously that is related to all kinds of other things speaker clustering of course basically the same problem but we're focusing now large large scale problems uh partitioning probably much nicer way of doing things but uh you need prior distributions overall partitionings and it doesn't probably work for for large scale um that has relations with lots of other things first of all diarisation and diarisation you need to also the segmentation of course um and that happens but you typically applied are stationed within a cluster of within a a single a single recording and like i say i'd like to make the links between recordings as well um insulation with the uh so i to wire training conditions in uh in in this speaker recognition evaluation um um with their uh first of all you have uh diarisation as an additional cost and you know that there is exactly one common speaker common link between all the training segments that you have so this there's more prior information um also speaker tracking um is related there i think the problem is that you are given a mobile for a particular speaker and then you have to find it in a large collection um and finally it is of course related to clustering in general uh with the difference that in many clustering problems is not really here what the classes are if you look at topic clustering yeah what makes topic topic can be something else and here with speaker of course we know the truth well very quick overview of that type of clustering algorithms which take clustering as a solution to this problem uh there is is oh damn way and you might see the way we train our gmm says that it's a way of doing this you start with the single clustering attracts people uh clusters that are more similar or you can do baltimore but also a sorry or agglomerative clustering i mean this is typically what we do in diarisation in the beginning and diarisation also that so from the from from the top so there you start with individual segments and you try to cluster together into you say this is enough i have found my classes now i'm sure that there are many more clustering or algorithms that are actually better than then these kinds but now concentrate actually the agglomerative closer now um one of the things that i i the bothers me about this this clustering is that that it doesn't scale with time if you if you take the easy to use simplest agglomerative clustering idea then you start with the number of for all segments and you find the best matching class them together and then you do it again um and the total complexity would be all of the order into this third power and if you then want to get intermediate updates in some kind of online situation show you recorded shows over a whole year and you get an extra show with extra speaker segments and you want to put them in then again you have uh uh an extra order of complexity in total of course you also get getting time every day so that make you doubt done incremental context you thank you um what's the next uh thing oh yeah you this is if you do agglomerative cluster offline so you collect your data and then you say i'm going to do that live clustering in a in a very careful manner posted online the saying either one segments next segment is insane either a cluster or to make a new cluster that's a lot simpler so that the the incremental complexity is now the order of the number of found cluster and for the divisions ski i don't know exactly what i think it's it's also um some aspects of of this clustering is you can decide either to retain your models during the clustering process or not so there's some advantages if you if you do that you have more data from all better models but you browse might also um there's another question is are you going to use the data in your your speaker comparison trix two to do some form of normalisation for instance for the general acoustics that you getting or you want to normalise scores or it may be even better when you want to trained discriminatively so stranger clusters discriminatively believing that addresses are very well then you probably get much better speaker separation this is something we local news doing in this speaker detection for good reason uh but i think if you really art and the scene clustering you might consider to do these things but i also think that or not trivial to do another aspect would be are indeed going to make decisions are we going to make hard clusters we're gone you speaker segments together or not are going to do it in some kind of software which is moral suppose lines of the speaker partitioning with priors every i might be better to the soft soft way i think if you can call comparison way two uh the way uh you do you do you attribute your data two you're mixtures in the gmm that's also done in software and that that works better then if you do it the hard way so this is something to consider as well alright another aspect or speaker clustering would be highway evaluate how well i'm doing for speaker detection we we have found very far into uh defining good evaluation measures the problem while we understand where are we gonna do for clustering usually people that do clustering have a some form of single evaluation measure and i i don't know which ones are the best but the ones that i like are the impurities or from suppressed impurities but we like to look at errors so i i i'd go with the impurity as basically if you if you have your cluster in the end of the clustering process you want to know uh how homogeneous is and the simplest way of looking this what is the most occurring speaker and what fraction does that or in impurity measures how much difference segments are there compared to the most current i if you want to express this mathematically then uh the way i after the fine it looks rather complicated um but i couldn't get it any simpler um but the interesting is sitting is the this is the cluster purity you you know you you see in in general cluster literature but i think there's always the other side in speaker detection we know there's always the other side so in in cluster impurity that's comparable to to minimising false alarm some you know there's always the the downside the missus so we should also the find something like speaker impurity which is the same definition then with respect to the reference speaker and uh you don't always see these things but uh i think you should just computed both and see how they train in your final clustering uh but and the the the reason is it is trivial to make uh uh cluster impurity of of zero so perfect clustering by just making a single cluster for every segment you so those not into you need the other the other part uh there's also other measures which which are more probabilistic of nature's around and looking only at the most we can see frequently occurring speaker in your cluster you can actually look at the at the whole distribution so you get some kind of that entropy measure for your cluster you can average of the roll cluster are weighted by the the number of segments in each cluster and again um not on this slide not only the cluster entropy you can define can also define okay speaker entropy sure again look at both these measures but then we come to the uh experimental section it it's a small experiment um actually carried out a while ago seems pretty ancient uh in terms of speaker recognition develop uh it's can being it's good that it's uh two years ago and and at that time we had a state of the art system a fourth we still have the same system but is not always say the art but anyway so it's a gmm svm system weedy nor at reform pretty well on on on the two thousand six uh he fell set and that the experiment was done uh a warm the preparing for a three two thousand eight so that's why work with that data that at the time we didn't have the the truth data oh two thousand eight yet so we using the two thousand six data i would simply use all the test segments are there is some some thirty seven hundred test segments as male or female and say well you can do it up fish speaker today should not have cross gender trials because those portions of charles tend to be one target trials and kind of not fair uh but here we're not really doing speaker detection we do clustering so it if gender gives you if you some information about cluster them maybe it's fair to use them um and moreover are hours system at the time was completely gender independent or was not a single condition one gender there five minutes um so two versions of agglomerative clustering one online so taking one second the time making decisions and one a part click here for vocal work this would result so you see speaker impurity versus uh cluster impurity for both type of uh agglomerative clusters clustering and um well you can define something like an equal impurity point uh i i put the debt curves uh that access for for people but can't live without that accent i actually works very well because that there's no reason why these first straight not not one that i understand but easily but it works very well you see that these two different kind of approaches one as much simpler the online version is much simpler than the offline version form more or less say uh another interesting thing or not counting from abroad in terms of T norm uh thresholds that you put uh in in the cluster our them for for stopping uh these are quite different for the two algorithms so that will be easy look at different things uh what it worked even uh and this is the the the last subject and that's the scalability all uh of this whole process because i mean the thing large numbers so because mentioning a thousand uh segments that we doing seven minute segments but for two thousand six i did not more at least i didn't take more than seven signal um so here i am looking at what scale what what the the equal impurity is as a function or number of segments on a log scale and you see uh some people would call this uh uh graceful degradation i think that's a fantastic where they learn to use it and then some vision vision graceful degradation and it's it it it's a single work some award number of all speakers my dad to do with the way to this this segments are chosen in the east evaluations because um you can also express it as the number of speakers but then and linear access you could actually exactly the same graph so seems that this relation between the number of segments and number speakers if you just randomly leave out segments in order to reduce the problem that's what i did going from the full problem here down i just uh randomly uh left out second so again you see the same kind of graceful degradation but there for for this performance speaker recognition system there is some number here where we will have an equal impurity of fifty percent if this this trend is good and we shouldn't go beyond so i think the if you define the problem of speaker clustering or speaker linking uh you have a problem with the scalability in terms of the number of speakers or number of segments or whatever you want to look at that so from that perspective i think it's interesting problem from gets harder has to do with the fact that i suppose identification gets harder with more class okay um that said that there is actually in in different fields or something called C and C as an analysis tool and you remember what it is but it's something like it measures how well your target object is in the based and uh classes of class segments returned um it's around and looking at identification one and identification you're looking at this how it goes with with uh with two and you might say um and that has been analysing in different literature already good um and of course the real nice thing thing would be in once we have define or you cultivation measure any good taste and a and a and a proper test that that that we understand forms of all scales that of course you can look at different our the because they are in my use here is pretty trivial and i'm sure that you can use global algorithms that that consider everything at the same time and perform or much better and um of course there's also a question i didn't say that i started with the score matrix are just scored everything against everything which is pretty moment at the time if i think about it now everybody scores everything against everything but at the time was kinda uh you would score nice nice try at least but can we do better than that so we use no alternative speech segments that we've already seen ordered receding in global for either normalisation and discriminant training that's another and that's a question and that's where i'd like to still visiting time that i've run over time is that great so you five minutes i'm fine well i don't have any slightly more at that was kind of a nice way to thanks david which is my signing about we have time for some comments or questions this uh this then yeah and that use um um yeah that's uh it seems quite the quite upsetting as you as you mentioned uh it seems like things that thing to break down we uh that the problems that are too long so yeah uh and and then and your conclusion you conjecture that maybe you do some bit of considering yes school so yeah i think that i have uh exactly the problem oh yeah sure in uh in in and my method for example and also in the uh variational bayes method which patrick messages to be tried for the following workshop uh you effectively uh or looking the type that one the but in in an unsupervised way yes so that um my mike that train so we'll have to we'll have to we'll have to like look at that and see if we can true that that again click or whatever i'm i'm not the both yeah um can you i've got two questions one um you disappointing speaker linking could you clarify which different from speech cluster in the end i don't know the scale thing but it it's it's it's the same problem that's just the way i see it is is is is linking is more like a task and clustering and more like uh a way of doing it i think there are otherwise identical and the reason why colour blinking is because we're all so busy with large scale diarisation and there you have two steps states yeah first within uh say we within your meeting or within your broadcast segmentation clustering kind of things diarisation and then you try to link the the different clusters between meetings are between the uh broadcast and in order to separate things there we call that linking rather than clustering with otherwise we'd have clustering here clustering there maybe a little less uncertainty on the speech segment for one speaker but um the second question was uh i'm still there was mention i'm puzzled online uh system i mean to me it looks like it top down cluster this uh or you think that the lines you for worse than no um and possibly condition um because you're and it's not one of my more right the this online clustering oh and i'm not sure whether it's the agglomerative no well i supportive in a sense yeah at one at the time you try to fit somewhere in your clusters maybe not more formally aboard yeah so i a ninety speaking something fast linking partition things that so uh like like it'd it even more involved some of this speaker clustering and language class happening in there and papers looking at it but the point we ran into it at one point in this like just about the plastic that's is that in general these and like task trying to come up this mess perform class to make sure diarisation you know these horribly complicated like pca measures and all that i think that they should keep in mind is somebody's measures it yeah it's hard to relate to actually doing and actually yeah yeah for different diarisation error rate right we use that really a lot of diarization error down little late diarisation or cluster all these other things i view them as being oh right it's not that they're not worth working on but it's not really we're gonna get a single measure performance say when we optimise this it always opens a question now what i do it so diarisation has this problem to look at uh speech recognition make they want to do diarisation or clustering on audio for adapting speech recogniser they hear what they want for doing adaptation it's nothing like what we would say oh that's really good diarisation so let's see here i think we're going into these things and not at three days and then come back diarisation the partitioning the linking and talk about it but at one point we're gonna have to be a little careful about creating these numbers and saying oh i got a number X better than why but all the people are gonna say well what's the nation that what what is it i help you do what did why that is no it's not clear to me point two we start uh actually liking them or something doing at the end yeah so uh experiment linking then hmmm are you batch of this on the text yeah um yeah um yeah i agree i mean that it did this this is just an application and i think the focus here more was to look at what what happens if if things scale scale up and and that the nice thing about speaker detection is that you don't have to worry about that things go up just get better estimates of how well you're doing but did theory the reform and the cost function whenever it should be more or less the same shoot you stabiliser thing here this is not okay shifting you might say okay you're doing the wrong thing on the other hand trying to rotation speakers might be useful thing so there's an example in this case right weightings okay what's yeah okay you get these clusters out what does someone do it yeah do you uh put in the same boat although i mean that yeah the i give it and saying C D i mean yeah this does i open this up in general partitioning in other things that are going on such as you get these things they in some sense you get that stuff what am i doing i get a cluster yeah thousand clusters at night average here it is yeah percent but if someone says i know i want do what so my searching for somebody in my so for example we went to this day trajectories in this details we started a trend clustering and some of the things no working it's which were diarization although lately that pulled away from this doing diarization instead the task is detection we're gonna give you we wanna see how it was in the context doing diarisation help you purify your data to roland test models we wanna see how well you do linking these two together and trying to see that correlates T diarization error rate in in that cation like detection task was that one but it's very loose it seems ah that's one thing here i think in general people are put up this task to another talking on is linking some you could say if i drop twenty percent ten percent did i get twice good in my and application matter and i using centimetres things are miles i just don't have like well guess here a rates go down you you do better but where is good enough today fig common this yeah you he's very close your time proposed in the present nice me and myself and the only difference it was exactly the same task to explain the interest of a task we so cues in we should use on no oh raw like you T V book us if you days and you know you speaker diarisation each recording and you want to load after that to combat was you know and this is a real if you will with a national media organisation uh they don't uh the the computer for this one come back seeing on each time they need to do when indexing task you mix according to me would be different and the second constraint is you should have like you you say something to have to implement the indexing implement the we'd formation you can't each done or receiving a new file uh come back to and we we do all computing in this case you have a strong difference between generalisation on one one or a few times and speaker tying your you key well you could have a hundred of thousands of hours of video last comment the you could okay ah okay i'm i'm i'm also just uh on on something that so in this sense uh why we brought this uh this kind of problem and also about uh devalued i submit um i i i didn't uh uh discuss this in my presentation but in the paper i show how you can uh do all the usual this task um so the partitioning problem the multiple training the unsupervised adaptation yeah and so on so by right generalising we can we can learn more about all that the normal tossed that we're doing um so if you can solve this problem can solve everything else so sorry yes if you're segmentation is given it's not just the last the last the last the last thing is uh the evaluation metric um uh you could use it sort of any practical purpose namely uh numerically optimised discriminant right so uh this is probably what i'm going to be doing the next three weeks at at at the workshop and uh oh in be mazes here probably what i do going to use okay so what what application and it's keeping you go so that yeah limits its uh thanks to be with

Speaker linking in large data sets

SESSION 8: Human performances in Speaker recognition, Speaker clustering and partitioning

Přidáno: 14. 7. 2010 11:08, Autor: David van Leeuwen (TNO Human Factors), Délka: 0:34:06