0:00:07oh money everyone my name is raymond and
0:00:09we are from the chinese university of hong kong and a institute for infocomm research in singapore
0:00:15oh first i thing i have to uh these two points which characterise our work today
0:00:20well first is uh unlike
0:00:22previous
0:00:22presentation which
0:00:24at least
0:00:24touch something about speaker recognition
0:00:26our work
0:00:28is exclusive
0:00:29exclusively on language recognition here
0:00:31today
0:00:31and the second point is uh we tried
0:00:34kind of an untidy alternative approach
0:00:36in a
0:00:37focusing a very specific asian language recognition
0:00:40we find that in uh the previous uh running two O nine they are
0:00:44or some very difficult languages
0:00:46so we
0:00:47just focus on these scenarios
0:00:49and
0:00:49that's why we have all this work all the action how dependent score calibration for language right
0:00:56so this outline
0:00:57today's presentation first
0:00:58oh
0:00:59will introduce the problem and then we we have a little bit about detection cost
0:01:04and now we will illustrate our collaboration with a two pass
0:01:07the first is a pairwise language recognition and then a general language recognition
0:01:11funding is the summer
0:01:15so the for language recognition task uh we defined is as follows given the
0:01:20target language the task of language recognition is to detect the presence of targets in the testing trial
0:01:26the practical linguistic all the calculates the school
0:01:29indicating the presence of the target
0:01:31and then uh make decisions
0:01:33when
0:01:33trainees decision is made then that is the detection cost
0:01:37so
0:01:37typical detection cost uh i think
0:01:39most of the overtime area with which a detection misses and false alarms
0:01:46and in our what we we interpret a score calibration as the adjustment of the markov items of
0:01:51score
0:01:51which in turn affect
0:01:53detectors decisions
0:01:54and the objective is
0:01:55to to calibration in order to have a minimum detection cost
0:01:59um
0:02:01more generally in uh in global
0:02:03calibration or
0:02:04as uh
0:02:05the remote set a
0:02:07application independent calibration
0:02:09the parameters of the detection cost function
0:02:11i usually ignored
0:02:15and the
0:02:15result of that is uh for for global cooperation is
0:02:19each transform the likelihood score in a global manner
0:02:22and it does not pay special attention to highly compressible try
0:02:25we do not say whether it is good or bad but
0:02:27in this work we going to do
0:02:30another way
0:02:31in language recognition two O nine there are some pairs of related languages
0:02:35uh listed already in the uh specifications
0:02:38so detection of these related languages becomes a bottleneck because because
0:02:42they are typical is easy to mix them up
0:02:44for example rush and then ukrainian
0:02:46in the end to do
0:02:47so in
0:02:48the following we will focus on these pan languages
0:02:51all we've always then one one at a time for example we call rest in the target language
0:02:56and then we have a related language call ukrainian
0:02:59and afterwards we have the high the language for all ukrainian and waited related language become russian
0:03:05and then we have ten rounds of calibrations
0:03:07such that the final or
0:03:09ever or will be reduced
0:03:14so not just
0:03:15very brief recap on the detection cost because uh you could you look at a lot diagram so
0:03:20just to have you uh
0:03:21comprehend what
0:03:22we don't we going to do
0:03:25so uh
0:03:29for example we have a two
0:03:32causes X T and H R two languages had a language related languages
0:03:37and then we have the uh log likelihood ratio form
0:03:41the target language
0:03:42so we call that a lamp H T
0:03:44it is the score from the detector H T
0:03:47so let a be the index of the test file and then if we plot the uh lambda
0:03:51H T against K it would be like this
0:03:54so you see a lot of of trials here so this is the the the school of one trial
0:04:00and uh they are circles and triangles
0:04:02circles many stands for the uh
0:04:05trials whose true that is
0:04:07we don't to H T
0:04:08and triangle stands for
0:04:10represent the the
0:04:11the trials where the two classes uh
0:04:14related uh target related costs H R
0:04:18so uh we focus on the field
0:04:20circles and triangles
0:04:22you'll be easy to understand you triangles uh false alarm because this
0:04:25about stressful
0:04:26and then the field circles are
0:04:29detection may because this is under the threshold
0:04:35so all again we keep it very simple to the objective is
0:04:39only two we used a
0:04:41you have to miss
0:04:42and false alarm
0:04:43but uh when we use that that means we want to reduce the kind of peace filled circles of
0:04:48and these few triangles and is kind of a discrete
0:04:51thing
0:04:51and we don't want to do that we want to
0:04:53do it in a quantitative way
0:04:54so all of this can be done by minimising the iranians deviation
0:04:58with respect to the detection threshold
0:05:00which means we want to minimise the
0:05:03where to it
0:05:04based
0:05:04all these you
0:05:05triangles empty circles from
0:05:07action detection threshold
0:05:08and we assume that this
0:05:10section turtle is already fixed
0:05:11at the very beginning
0:05:17so now we can uh
0:05:19introduce how we do the parrots language recognition
0:05:25first we make uh
0:05:26simple hypothesis
0:05:28because uh
0:05:28we we we
0:05:29told you that they are related tasks of languages
0:05:32so uh
0:05:34below like ways to solve these two related languages a number H T a number H ah
0:05:39contains very similar and
0:05:41complementary information
0:05:43so
0:05:44before you
0:05:45not at the route which shows you only uh the
0:05:48not like the racial foreground H T
0:05:51and now we introduce another dimension number H R which uh detection anyway so for related hypo
0:05:58and uh the
0:05:59trend of the of the schools
0:06:01normally follows
0:06:03this manner
0:06:04and uh to to understand this easily we can just pick
0:06:08any
0:06:09trial from a target cost sixty so it is natural that it has a very high school
0:06:14of number H T because it's detecting a target cost and it has a low score
0:06:18of number H ah
0:06:19because uh it is not
0:06:21belonging to the
0:06:23how to construct the rate cut
0:06:25and similarly for
0:06:26another trial in uh the related costs
0:06:29it has a high school in um the H R and those boring
0:06:32and number X T
0:06:34so uh
0:06:37this
0:06:37shape
0:06:37uh simply uh uh
0:06:39problems that
0:06:40to think of
0:06:41how about if we rotate the whole score space
0:06:44such that uh
0:06:45we can obtain a new
0:06:47score space and detection special like this
0:06:49and mathematically it is
0:06:51actually we
0:06:52when
0:06:53when we determine the detection threshold we not only consider
0:06:57a number H T but also consider the number H O which means
0:07:01we use the
0:07:02detect tech schools from to detect it
0:07:04or
0:07:05target language related language in order to her
0:07:08the final decision
0:07:09four
0:07:10well there
0:07:11this
0:07:11a tribunal to
0:07:13i cos
0:07:13steve
0:07:16so uh mathematical you want to formulate
0:07:18like this
0:07:19uh we talk about that uh we want to
0:07:21do this in a quantitative way to minimise it what to wear and use deviation which is the
0:07:26distance between these points
0:07:28from stress
0:07:30so uh we
0:07:32tech
0:07:33this
0:07:33he claims that was that
0:07:35we look into this uh lander minus the universe
0:07:38so this is uh
0:07:39the
0:07:41displacements of all
0:07:43these
0:07:43school from the press will
0:07:44so for detection based
0:07:45the ms is below the stress also
0:07:48this
0:07:48difference is negative
0:07:50and for false alarm the differences pause
0:07:52and why is
0:07:53uh representing the the it should label of
0:07:56the
0:07:57uh detection trial
0:07:59if it is appealing to that i have uh
0:08:02the better is one if it does not you don't to target cost about it is negative one
0:08:06so we can see that by
0:08:08multiplying the Y and this uh lambda minus three to
0:08:12four
0:08:14two cases of error as we always have
0:08:17some positive value
0:08:18and for correct acceptance and rejection as we always have negative better and then we use the
0:08:23max operation to remove
0:08:25these are all right acceptance rejection
0:08:27scores
0:08:28so is that
0:08:30finally what left over is
0:08:32only the erroneous deviation and then we sum over the whole database
0:08:35a week a problem
0:08:36first
0:08:37trial
0:08:38the the last row
0:08:39and we like to adjust the detection not letting way so where
0:08:42the adjusted the likelihood
0:08:44not the dash
0:08:45who produced
0:08:46this
0:08:46how to iran is T V H
0:08:56so perhaps i i should go back to the last line because uh
0:09:00what we do is to reduce the
0:09:02uh iran is deviation
0:09:04the the
0:09:05the distance between these errors from the stressful
0:09:07and the way we do that is
0:09:09by rotating the the
0:09:11the score space
0:09:12and
0:09:13the rotation of school space
0:09:15is actually accomplished
0:09:17uh i
0:09:18this equation because we want to do
0:09:21and linear combination of
0:09:22discourse from two detectors
0:09:24and the result is that the score space is rotated
0:09:28and ah here the whole problem is now formulated we have the objective function of iran is deviation
0:09:33and then we want to minimise that
0:09:35subject to uh be
0:09:38the union combination uh back to our for
0:09:41and then we also have uh
0:09:42little constraint
0:09:43uh just to make sure that
0:09:45uh the final result are updated
0:09:48a lot like racial would not be out of range
0:09:51and after uh
0:09:53we have done
0:09:54this
0:09:54optimisation of a rotating the small space with the developments that we upside these our parameter to D version dataset
0:10:02and then we go back to the normal
0:10:04or error metrics
0:10:06which is the detection cost
0:10:07and
0:10:08because this time we illustrate the pairwise language recognition process so we have one
0:10:12this
0:10:13term and one was a long term in the
0:10:15in the uh errors
0:10:17but
0:10:18that is
0:10:21so this is the key
0:10:23we uh diagram of our systems
0:10:26what we use is uh
0:10:28from what i think and prosodic fusion system oh
0:10:31i've to me that uh we only use one subsystem
0:10:34uh in
0:10:36you know is a tool for promote that takes so it's
0:10:38it's not a bad uh system to start with but
0:10:41what we want to try is to
0:10:43tried the or
0:10:44effectiveness of peace corps corporation in this particular scenario
0:10:48so how do we get the
0:10:50score from a different detectors
0:10:52oh we have located ten
0:10:54difficult idea languages
0:10:56then
0:10:57for each target languages
0:10:58uh we
0:10:59choose
0:11:00the lot like to resolve it
0:11:02so and then deal of leeway so of the related costs
0:11:06and then we do be
0:11:08parameter optimisation which means we we we will rotate
0:11:11the score space
0:11:13uh such that
0:11:14we have an update on the dash the updated a lot lighter racial
0:11:18the training data we use is uh
0:11:21this is a rarity
0:11:22a nineteen ninety six to two or seven corpora
0:11:25and the egg evaluation data is
0:11:27on these two or ninety version set to give you a brief idea of or how
0:11:31the amount of data we had uh
0:11:33for the general has
0:11:35which you see
0:11:36in a to slice
0:11:38the number of trout is about ten thousand
0:11:40for twenty three languages
0:11:43and
0:11:43to train this
0:11:45i'll for parameter rotating the score space we use a development set
0:11:49the departments that comes from these two or seventy version that is that and excerpts from
0:11:54two O nine B vitamins that
0:11:56and there's a total or
0:11:57six thousand trials
0:11:59and that estimations all thirty second
0:12:03so this is the result of the pairwise uh language recognition
0:12:07uh
0:12:08has
0:12:09the original E at least given here is about twenty percent for all these uh difficult languages
0:12:15and after we apply the us
0:12:17school
0:12:17calibration
0:12:18the error is about all
0:12:21nineteen present which is
0:12:22about five percent relative
0:12:24eer reduction
0:12:26we can see bosnian croatian confusion cannot
0:12:28be reduced by this method
0:12:30uh
0:12:31which
0:12:32kind of a
0:12:33which is because uh i guess
0:12:35the two languages
0:12:36mixed up very seriously in
0:12:39our
0:12:40score
0:12:41and in a related language pair confusion reduction is so
0:12:45more significant for the worst performing line
0:12:47let's see for example we compare
0:12:49oh
0:12:50ah see far
0:12:52harry and posse
0:12:53and uh the error reduction in there it is
0:12:56a more scientific
0:12:57with the help of a
0:12:59that is cool problem of prosody
0:13:07so the uh
0:13:08improvement of pairwise language recognition is
0:13:10not very significant but uh we want to extend
0:13:13this
0:13:14uh method to the general language recognition and then we'll see
0:13:17a more significant error reduction there
0:13:21oh
0:13:22we just will be uh
0:13:27'cause
0:13:27average cost function for the pairwise language recognition again
0:13:31we have one with time and one false alarm time
0:13:34but uh if we move to D gender out 'cause then the cost function become a more complicated because they
0:13:40are more target languages
0:13:42and
0:13:42for the detection of each language's that is one this term and trendy to false alarm time to ponder
0:13:47two
0:13:49average score
0:13:51so as you see i highlighted that
0:13:53hard in red
0:13:57um
0:13:59because previously we have been opening and in data for two languages only
0:14:04so they're only circles and triangles
0:14:06but now
0:14:07when we expanded a general class there are
0:14:09more
0:14:11then you got that out of set or out of that data which is the data
0:14:15uh
0:14:16not
0:14:17reside in these two other languages
0:14:20so uh
0:14:21these
0:14:22so for all that data
0:14:23marked in red circles here
0:14:25and again
0:14:26i'll show you the general trend of
0:14:28the of the data in the
0:14:30in the uh
0:14:31detection scores of the two classifiers because
0:14:34the classifier of uh the the the a lot like away so all
0:14:38H T and H O uh
0:14:40giving very similar trend because these two languages are very similar
0:14:44so
0:14:44what
0:14:46has a high score in number sixty also give a high score in um H R
0:14:51and uh
0:14:52actually there are some modification we have to do uh
0:14:56when we proceed from the two language case to the general trend three language case
0:15:00first is a
0:15:02as as that we have many offset data we don't
0:15:05want to touch speed of that data because
0:15:08we are afraid
0:15:08then
0:15:09this may affect
0:15:10the detection of
0:15:11these
0:15:12other language classes
0:15:13the second thing is as as mentioned uh
0:15:16in the
0:15:16general cost function
0:15:19there is a
0:15:20there are twenty two but um term
0:15:23so the false alarm for each language pair
0:15:25become and the mine
0:15:27and we have
0:15:28two
0:15:30put more stress
0:15:31in a week you think detection ms
0:15:33one of the and or reducing the uh
0:15:35detection of a salami in order to have a low detection ah
0:15:39oh
0:15:40average score
0:15:43so this is the three moves we applied when we uh proceed from
0:15:47pairwise language recognition to the general outcast
0:15:50first will is we only select detection trials which are likely to belong to
0:15:54the two
0:15:54related languages H T N H O
0:15:57of course we do not know in advance which language they you don't to so we apply a holistic method
0:16:02which is not included in the paper just
0:16:04choose
0:16:05only these language to operate
0:16:06and
0:16:08the route to is we waited cost of detection miss trendy two times heavier
0:16:12as you see in a later slide or earlier
0:16:14we formulate the uh
0:16:16iranians
0:16:17deviation optimisation function so that it's a midterm and then days that for some time
0:16:22and we
0:16:22put the way twenty two times more forty
0:16:25after mister
0:16:26table three is uh
0:16:28we have the ship the reference point for the calculation of total awareness deviation
0:16:32the point of doing this is uh can be explained by one here
0:16:36we have said that detection miss is more important we have to put more focus
0:16:40in detector miss
0:16:41in in in the calibration
0:16:43and uh we go back to the original detection threshold
0:16:46feature here
0:16:48oh if you still remember we have
0:16:52field are only here and its deviation and then we move
0:16:56all of these
0:16:56right well see it because these reptiles
0:16:59suppose to fall into the region of right as that
0:17:02and then it was not handle
0:17:04in
0:17:04anyway
0:17:05in if if we don't do anything
0:17:07so uh
0:17:08if detection misses is so important
0:17:10why don't we just uh
0:17:12also try to look at these
0:17:14like hungry points
0:17:16by
0:17:16moving the detection price already to be higher
0:17:19forty below actually be allowed
0:17:21section four so to
0:17:24fluctuates uh and then we try
0:17:26the best
0:17:27oops the on which will give us the lowest
0:17:29general language recognition or
0:17:32so this is the revised objective function basically is the same exactly the same problem of the previous night you
0:17:38see all for the calibration with two languages
0:17:40but now we have uh
0:17:42the three modifications as shown in red here
0:17:45and after we have done the uh
0:17:48calibration with the development set
0:17:50then we
0:17:51go back to the E variation
0:17:53it is that and then
0:17:54use the
0:17:55convention uh
0:17:57average cost function to
0:17:59you first eer
0:18:02um
0:18:04this diagram is
0:18:05this page is maybe a little bit intimidating so all any you sometimes to to to
0:18:10two
0:18:11to explain
0:18:12so therefore diagrams here
0:18:14all we use
0:18:15the development set
0:18:16to tune the alpha parameters for for schools base rotation
0:18:20so this is they the score for lambda H T and number H O before
0:18:25rotation and this is uh the rotation
0:18:28as you can see we only choose a subset
0:18:30uh actually the black box a be a lot like this for for the kind of car and that we
0:18:35got a little late
0:18:36school for related costs
0:18:38so only
0:18:39suppose only the black and the wind blows up at that and they are rotated a little bit
0:18:44and this is the result for your eyes instead of course lucy more messy
0:18:49and um
0:18:50there are also some kind of rotation here
0:18:53and then uh
0:18:56we'll see what we want to do
0:18:57we want to do the rotation such that uh
0:19:00they are more
0:19:02target cost
0:19:03school or
0:19:04uh staying in the in the in upper end of the Y X
0:19:07so that would be less detection based
0:19:09so in the development set it
0:19:12it isn't very clear because uh the
0:19:15target class
0:19:16the black dots are already up
0:19:18hi in the Y axis
0:19:19so in the emergence as we can see
0:19:21those like balls
0:19:23scattering down the the the
0:19:25the the curve here which makes up with that
0:19:29red and green dollars
0:19:30have
0:19:31already moved up after
0:19:33rotation of the
0:19:35score space
0:19:37so this is the overall result of the uh
0:19:40equal error rate after applying the score
0:19:42space rotation
0:19:44before we have all four point four five
0:19:46equal error rate and we use
0:19:48single detection threshold for
0:19:50the detection of all language
0:19:52and after this quotation the uh
0:19:55error is reduced to about three point three percent which is about uh
0:19:59twenty five percent relative reduction of uh you have trend
0:20:07and oh we
0:20:08also introduce a
0:20:10before there is a
0:20:12parameter of seed on which
0:20:16accounts for the
0:20:17shifting of the detection threshold
0:20:19and if we ship it
0:20:21a porno or
0:20:22if they don't is louder and louder that means we
0:20:25uh
0:20:27become more and more cost to these boundary points
0:20:30two
0:20:30possibly
0:20:31um
0:20:32academies point
0:20:34so uh we've tried different concealment of signal is three point five we've got the lowest equal error
0:20:41so he comes a summary of today's top
0:20:44in language recognition
0:20:46uh language pair detection possible five pairs of related languages
0:20:50a linear combination of detection scores between
0:20:53target language and the related language brings about five point eight present market
0:20:57eer reduction
0:20:58and we we wise the parameters
0:21:00four of four
0:21:02optimisation of it
0:21:03this score space rotation
0:21:05and the application dependent
0:21:06calibration can be applied to do with that of detection that brings about
0:21:10twenty five percent relative reduction of eer
0:21:13so all for future work uh we have been thinking of
0:21:15some unsupervised methods to find these related targets
0:21:19because
0:21:20in this where we start with the people like that
0:21:22target
0:21:23uh oh
0:21:24with no derivation because this is already included specification
0:21:28and uh we have also thought about application to other
0:21:31you second pass but uh
0:21:33we understand
0:21:34this
0:21:35oh where is
0:21:36very specific to this particular language correction pass and
0:21:39we think
0:21:40um
0:21:41special case have to be taken if we
0:21:43um migrate that to audit detection part
0:21:46and this is the end of today's presentation thank you for much
0:21:57uh before
0:21:58the questions uh
0:21:59uh
0:22:00the two source me the amounts that
0:22:03the organising committee just
0:22:04the
0:22:05we do so do
0:22:06of to the solution
0:22:07uh so
0:22:08uh
0:22:09any questions for retirement
0:22:19i have been taking part in these evaluations will be questionable evaluation which seems to relate to what you're doing
0:22:25sounds like what you're saying is that when you're testing
0:22:28a simple
0:22:30and you see is destruction
0:22:33uh in the training you can do anything you want
0:22:35pursuing virtually the training training in the training in calibration and and and uh
0:22:42you can settings you threshold anywhere you want within the testing
0:22:46or you know to see this to super looks very much like russian
0:22:51but
0:22:52and and my task is to sit attrition but i happen to know the details of the ukrainian model that
0:22:57looks more like ukrainian
0:22:59are you allowed to do that interesting timers that this remote for some reason
0:23:06using a so you can look at all the languages and see which is close to so
0:23:10so
0:23:16okay so you can just forced to assume that
0:23:20so actually in doing test we just
0:23:22have a small part to hold it
0:23:24how come from
0:23:25different languages and then we compare
0:23:27to choose all day is maybe possibly russian
0:23:31okay
0:23:36you still linear combination
0:23:38hmmm yeah
0:23:39uh i mean all languages are related
0:23:41so we we
0:23:43uh_huh
0:23:43why not use a linear combination of all that
0:23:46a small one
0:23:47oh we have actually tried
0:23:49and the uh each good thing is that it only works for these kind of related language
0:23:55because oh
0:23:56i think
0:23:57a very simple
0:23:58i said to to to
0:24:00selection is then why this work is
0:24:02if the two languages are more and more famous
0:24:04then
0:24:04the
0:24:06oh okay
0:24:07scores from these two detectors
0:24:09have more complementary effects because
0:24:11oh
0:24:11say brush and then ukrainian they are very similar
0:24:14that means
0:24:14if i use
0:24:15the the the score combination of these two languages
0:24:18then i can have applied the competence to
0:24:21we just
0:24:22languages which are
0:24:24not rushing and then and
0:24:25right
0:24:26and this is the main of performance
0:24:29oh reduction of performance
0:24:31uh improvement we get
0:24:32we have
0:24:33all
0:24:34get a significant a reduction of
0:24:37false alarm
0:24:38two
0:24:39other language
0:24:40but not to the related language
0:24:41by doing
0:24:52the remote questions uh than the
0:24:55it's gonna during lunch
0:24:56the
0:24:57this time the speaker