Přepis řeči - Detection target dependent score calibration for language recognition

oh money everyone my name is raymond and we are from the chinese university of hong kong and a institute for infocomm research in singapore oh first i thing i have to uh these two points which characterise our work today well first is uh unlike previous presentation which at least touch something about speaker recognition our work is exclusive exclusively on language recognition here today and the second point is uh we tried kind of an untidy alternative approach in a focusing a very specific asian language recognition we find that in uh the previous uh running two O nine they are or some very difficult languages so we just focus on these scenarios and that's why we have all this work all the action how dependent score calibration for language right so this outline today's presentation first oh will introduce the problem and then we we have a little bit about detection cost and now we will illustrate our collaboration with a two pass the first is a pairwise language recognition and then a general language recognition funding is the summer so the for language recognition task uh we defined is as follows given the target language the task of language recognition is to detect the presence of targets in the testing trial the practical linguistic all the calculates the school indicating the presence of the target and then uh make decisions when trainees decision is made then that is the detection cost so typical detection cost uh i think most of the overtime area with which a detection misses and false alarms and in our what we we interpret a score calibration as the adjustment of the markov items of score which in turn affect detectors decisions and the objective is to to calibration in order to have a minimum detection cost um more generally in uh in global calibration or as uh the remote set a application independent calibration the parameters of the detection cost function i usually ignored and the result of that is uh for for global cooperation is each transform the likelihood score in a global manner and it does not pay special attention to highly compressible try we do not say whether it is good or bad but in this work we going to do another way in language recognition two O nine there are some pairs of related languages uh listed already in the uh specifications so detection of these related languages becomes a bottleneck because because they are typical is easy to mix them up for example rush and then ukrainian in the end to do so in the following we will focus on these pan languages all we've always then one one at a time for example we call rest in the target language and then we have a related language call ukrainian and afterwards we have the high the language for all ukrainian and waited related language become russian and then we have ten rounds of calibrations such that the final or ever or will be reduced so not just very brief recap on the detection cost because uh you could you look at a lot diagram so just to have you uh comprehend what we don't we going to do so uh for example we have a two causes X T and H R two languages had a language related languages and then we have the uh log likelihood ratio form the target language so we call that a lamp H T it is the score from the detector H T so let a be the index of the test file and then if we plot the uh lambda H T against K it would be like this so you see a lot of of trials here so this is the the the school of one trial and uh they are circles and triangles circles many stands for the uh trials whose true that is we don't to H T and triangle stands for represent the the the trials where the two classes uh related uh target related costs H R so uh we focus on the field circles and triangles you'll be easy to understand you triangles uh false alarm because this about stressful and then the field circles are detection may because this is under the threshold so all again we keep it very simple to the objective is only two we used a you have to miss and false alarm but uh when we use that that means we want to reduce the kind of peace filled circles of and these few triangles and is kind of a discrete thing and we don't want to do that we want to do it in a quantitative way so all of this can be done by minimising the iranians deviation with respect to the detection threshold which means we want to minimise the where to it based all these you triangles empty circles from action detection threshold and we assume that this section turtle is already fixed at the very beginning so now we can uh introduce how we do the parrots language recognition first we make uh simple hypothesis because uh we we we told you that they are related tasks of languages so uh below like ways to solve these two related languages a number H T a number H ah contains very similar and complementary information so before you not at the route which shows you only uh the not like the racial foreground H T and now we introduce another dimension number H R which uh detection anyway so for related hypo and uh the trend of the of the schools normally follows this manner and uh to to understand this easily we can just pick any trial from a target cost sixty so it is natural that it has a very high school of number H T because it's detecting a target cost and it has a low score of number H ah because uh it is not belonging to the how to construct the rate cut and similarly for another trial in uh the related costs it has a high school in um the H R and those boring and number X T so uh this shape uh simply uh uh problems that to think of how about if we rotate the whole score space such that uh we can obtain a new score space and detection special like this and mathematically it is actually we when when we determine the detection threshold we not only consider a number H T but also consider the number H O which means we use the detect tech schools from to detect it or target language related language in order to her the final decision four well there this a tribunal to i cos steve so uh mathematical you want to formulate like this uh we talk about that uh we want to do this in a quantitative way to minimise it what to wear and use deviation which is the distance between these points from stress so uh we tech this he claims that was that we look into this uh lander minus the universe so this is uh the displacements of all these school from the press will so for detection based the ms is below the stress also this difference is negative and for false alarm the differences pause and why is uh representing the the it should label of the uh detection trial if it is appealing to that i have uh the better is one if it does not you don't to target cost about it is negative one so we can see that by multiplying the Y and this uh lambda minus three to four two cases of error as we always have some positive value and for correct acceptance and rejection as we always have negative better and then we use the max operation to remove these are all right acceptance rejection scores so is that finally what left over is only the erroneous deviation and then we sum over the whole database a week a problem first trial the the last row and we like to adjust the detection not letting way so where the adjusted the likelihood not the dash who produced this how to iran is T V H so perhaps i i should go back to the last line because uh what we do is to reduce the uh iran is deviation the the the distance between these errors from the stressful and the way we do that is by rotating the the the score space and the rotation of school space is actually accomplished uh i this equation because we want to do and linear combination of discourse from two detectors and the result is that the score space is rotated and ah here the whole problem is now formulated we have the objective function of iran is deviation and then we want to minimise that subject to uh be the union combination uh back to our for and then we also have uh little constraint uh just to make sure that uh the final result are updated a lot like racial would not be out of range and after uh we have done this optimisation of a rotating the small space with the developments that we upside these our parameter to D version dataset and then we go back to the normal or error metrics which is the detection cost and because this time we illustrate the pairwise language recognition process so we have one this term and one was a long term in the in the uh errors but that is so this is the key we uh diagram of our systems what we use is uh from what i think and prosodic fusion system oh i've to me that uh we only use one subsystem uh in you know is a tool for promote that takes so it's it's not a bad uh system to start with but what we want to try is to tried the or effectiveness of peace corps corporation in this particular scenario so how do we get the score from a different detectors oh we have located ten difficult idea languages then for each target languages uh we choose the lot like to resolve it so and then deal of leeway so of the related costs and then we do be parameter optimisation which means we we we will rotate the score space uh such that we have an update on the dash the updated a lot lighter racial the training data we use is uh this is a rarity a nineteen ninety six to two or seven corpora and the egg evaluation data is on these two or ninety version set to give you a brief idea of or how the amount of data we had uh for the general has which you see in a to slice the number of trout is about ten thousand for twenty three languages and to train this i'll for parameter rotating the score space we use a development set the departments that comes from these two or seventy version that is that and excerpts from two O nine B vitamins that and there's a total or six thousand trials and that estimations all thirty second so this is the result of the pairwise uh language recognition uh has the original E at least given here is about twenty percent for all these uh difficult languages and after we apply the us school calibration the error is about all nineteen present which is about five percent relative eer reduction we can see bosnian croatian confusion cannot be reduced by this method uh which kind of a which is because uh i guess the two languages mixed up very seriously in our score and in a related language pair confusion reduction is so more significant for the worst performing line let's see for example we compare oh ah see far harry and posse and uh the error reduction in there it is a more scientific with the help of a that is cool problem of prosody so the uh improvement of pairwise language recognition is not very significant but uh we want to extend this uh method to the general language recognition and then we'll see a more significant error reduction there oh we just will be uh 'cause average cost function for the pairwise language recognition again we have one with time and one false alarm time but uh if we move to D gender out 'cause then the cost function become a more complicated because they are more target languages and for the detection of each language's that is one this term and trendy to false alarm time to ponder two average score so as you see i highlighted that hard in red um because previously we have been opening and in data for two languages only so they're only circles and triangles but now when we expanded a general class there are more then you got that out of set or out of that data which is the data uh not reside in these two other languages so uh these so for all that data marked in red circles here and again i'll show you the general trend of the of the data in the in the uh detection scores of the two classifiers because the classifier of uh the the the a lot like away so all H T and H O uh giving very similar trend because these two languages are very similar so what has a high score in number sixty also give a high score in um H R and uh actually there are some modification we have to do uh when we proceed from the two language case to the general trend three language case first is a as as that we have many offset data we don't want to touch speed of that data because we are afraid then this may affect the detection of these other language classes the second thing is as as mentioned uh in the general cost function there is a there are twenty two but um term so the false alarm for each language pair become and the mine and we have two put more stress in a week you think detection ms one of the and or reducing the uh detection of a salami in order to have a low detection ah oh average score so this is the three moves we applied when we uh proceed from pairwise language recognition to the general outcast first will is we only select detection trials which are likely to belong to the two related languages H T N H O of course we do not know in advance which language they you don't to so we apply a holistic method which is not included in the paper just choose only these language to operate and the route to is we waited cost of detection miss trendy two times heavier as you see in a later slide or earlier we formulate the uh iranians deviation optimisation function so that it's a midterm and then days that for some time and we put the way twenty two times more forty after mister table three is uh we have the ship the reference point for the calculation of total awareness deviation the point of doing this is uh can be explained by one here we have said that detection miss is more important we have to put more focus in detector miss in in in the calibration and uh we go back to the original detection threshold feature here oh if you still remember we have field are only here and its deviation and then we move all of these right well see it because these reptiles suppose to fall into the region of right as that and then it was not handle in anyway in if if we don't do anything so uh if detection misses is so important why don't we just uh also try to look at these like hungry points by moving the detection price already to be higher forty below actually be allowed section four so to fluctuates uh and then we try the best oops the on which will give us the lowest general language recognition or so this is the revised objective function basically is the same exactly the same problem of the previous night you see all for the calibration with two languages but now we have uh the three modifications as shown in red here and after we have done the uh calibration with the development set then we go back to the E variation it is that and then use the convention uh average cost function to you first eer um this diagram is this page is maybe a little bit intimidating so all any you sometimes to to to two to explain so therefore diagrams here all we use the development set to tune the alpha parameters for for schools base rotation so this is they the score for lambda H T and number H O before rotation and this is uh the rotation as you can see we only choose a subset uh actually the black box a be a lot like this for for the kind of car and that we got a little late school for related costs so only suppose only the black and the wind blows up at that and they are rotated a little bit and this is the result for your eyes instead of course lucy more messy and um there are also some kind of rotation here and then uh we'll see what we want to do we want to do the rotation such that uh they are more target cost school or uh staying in the in the in upper end of the Y X so that would be less detection based so in the development set it it isn't very clear because uh the target class the black dots are already up hi in the Y axis so in the emergence as we can see those like balls scattering down the the the the the curve here which makes up with that red and green dollars have already moved up after rotation of the score space so this is the overall result of the uh equal error rate after applying the score space rotation before we have all four point four five equal error rate and we use single detection threshold for the detection of all language and after this quotation the uh error is reduced to about three point three percent which is about uh twenty five percent relative reduction of uh you have trend and oh we also introduce a before there is a parameter of seed on which accounts for the shifting of the detection threshold and if we ship it a porno or if they don't is louder and louder that means we uh become more and more cost to these boundary points two possibly um academies point so uh we've tried different concealment of signal is three point five we've got the lowest equal error so he comes a summary of today's top in language recognition uh language pair detection possible five pairs of related languages a linear combination of detection scores between target language and the related language brings about five point eight present market eer reduction and we we wise the parameters four of four optimisation of it this score space rotation and the application dependent calibration can be applied to do with that of detection that brings about twenty five percent relative reduction of eer so all for future work uh we have been thinking of some unsupervised methods to find these related targets because in this where we start with the people like that target uh oh with no derivation because this is already included specification and uh we have also thought about application to other you second pass but uh we understand this oh where is very specific to this particular language correction pass and we think um special case have to be taken if we um migrate that to audit detection part and this is the end of today's presentation thank you for much uh before the questions uh uh the two source me the amounts that the organising committee just the we do so do of to the solution uh so uh any questions for retirement i have been taking part in these evaluations will be questionable evaluation which seems to relate to what you're doing sounds like what you're saying is that when you're testing a simple and you see is destruction uh in the training you can do anything you want pursuing virtually the training training in the training in calibration and and and uh you can settings you threshold anywhere you want within the testing or you know to see this to super looks very much like russian but and and my task is to sit attrition but i happen to know the details of the ukrainian model that looks more like ukrainian are you allowed to do that interesting timers that this remote for some reason using a so you can look at all the languages and see which is close to so so okay so you can just forced to assume that so actually in doing test we just have a small part to hold it how come from different languages and then we compare to choose all day is maybe possibly russian okay you still linear combination hmmm yeah uh i mean all languages are related so we we uh_huh why not use a linear combination of all that a small one oh we have actually tried and the uh each good thing is that it only works for these kind of related language because oh i think a very simple i said to to to selection is then why this work is if the two languages are more and more famous then the oh okay scores from these two detectors have more complementary effects because oh say brush and then ukrainian they are very similar that means if i use the the the score combination of these two languages then i can have applied the competence to we just languages which are not rushing and then and right and this is the main of performance oh reduction of performance uh improvement we get we have all get a significant a reduction of false alarm two other language but not to the related language by doing the remote questions uh than the it's gonna during lunch the this time the speaker

Detection target dependent score calibration for language recognition

SESSION 4: Speaker and language recognition – scoring, confidences and calibration