Přepis řeči - Detection target dependent score calibration for language recognition

0:00:07	oh money everyone my name is raymond and
0:00:09	we are from the chinese university of hong kong and a institute for infocomm research in singapore
0:00:15	oh first i thing i have to uh these two points which characterise our work today
0:00:20	well first is uh unlike
0:00:22	previous
0:00:22	presentation which
0:00:24	at least
0:00:24	touch something about speaker recognition
0:00:26	our work
0:00:28	is exclusive
0:00:29	exclusively on language recognition here
0:00:31	today
0:00:31	and the second point is uh we tried
0:00:34	kind of an untidy alternative approach
0:00:36	in a
0:00:37	focusing a very specific asian language recognition
0:00:40	we find that in uh the previous uh running two O nine they are
0:00:44	or some very difficult languages
0:00:46	so we
0:00:47	just focus on these scenarios
0:00:49	and
0:00:49	that's why we have all this work all the action how dependent score calibration for language right
0:00:56	so this outline
0:00:57	today's presentation first
0:00:58	oh
0:00:59	will introduce the problem and then we we have a little bit about detection cost
0:01:04	and now we will illustrate our collaboration with a two pass
0:01:07	the first is a pairwise language recognition and then a general language recognition
0:01:11	funding is the summer
0:01:15	so the for language recognition task uh we defined is as follows given the
0:01:20	target language the task of language recognition is to detect the presence of targets in the testing trial
0:01:26	the practical linguistic all the calculates the school
0:01:29	indicating the presence of the target
0:01:31	and then uh make decisions
0:01:33	when
0:01:33	trainees decision is made then that is the detection cost
0:01:37	so
0:01:37	typical detection cost uh i think
0:01:39	most of the overtime area with which a detection misses and false alarms
0:01:46	and in our what we we interpret a score calibration as the adjustment of the markov items of
0:01:51	score
0:01:51	which in turn affect
0:01:53	detectors decisions
0:01:54	and the objective is
0:01:55	to to calibration in order to have a minimum detection cost
0:01:59	um
0:02:01	more generally in uh in global
0:02:03	calibration or
0:02:04	as uh
0:02:05	the remote set a
0:02:07	application independent calibration
0:02:09	the parameters of the detection cost function
0:02:11	i usually ignored
0:02:15	and the
0:02:15	result of that is uh for for global cooperation is
0:02:19	each transform the likelihood score in a global manner
0:02:22	and it does not pay special attention to highly compressible try
0:02:25	we do not say whether it is good or bad but
0:02:27	in this work we going to do
0:02:30	another way
0:02:31	in language recognition two O nine there are some pairs of related languages
0:02:35	uh listed already in the uh specifications
0:02:38	so detection of these related languages becomes a bottleneck because because
0:02:42	they are typical is easy to mix them up
0:02:44	for example rush and then ukrainian
0:02:46	in the end to do
0:02:47	so in
0:02:48	the following we will focus on these pan languages
0:02:51	all we've always then one one at a time for example we call rest in the target language
0:02:56	and then we have a related language call ukrainian
0:02:59	and afterwards we have the high the language for all ukrainian and waited related language become russian
0:03:05	and then we have ten rounds of calibrations
0:03:07	such that the final or
0:03:09	ever or will be reduced
0:03:14	so not just
0:03:15	very brief recap on the detection cost because uh you could you look at a lot diagram so
0:03:20	just to have you uh
0:03:21	comprehend what
0:03:22	we don't we going to do
0:03:25	so uh
0:03:29	for example we have a two
0:03:32	causes X T and H R two languages had a language related languages
0:03:37	and then we have the uh log likelihood ratio form
0:03:41	the target language
0:03:42	so we call that a lamp H T
0:03:44	it is the score from the detector H T
0:03:47	so let a be the index of the test file and then if we plot the uh lambda
0:03:51	H T against K it would be like this
0:03:54	so you see a lot of of trials here so this is the the the school of one trial
0:04:00	and uh they are circles and triangles
0:04:02	circles many stands for the uh
0:04:05	trials whose true that is
0:04:07	we don't to H T
0:04:08	and triangle stands for
0:04:10	represent the the
0:04:11	the trials where the two classes uh
0:04:14	related uh target related costs H R
0:04:18	so uh we focus on the field
0:04:20	circles and triangles
0:04:22	you'll be easy to understand you triangles uh false alarm because this
0:04:25	about stressful
0:04:26	and then the field circles are
0:04:29	detection may because this is under the threshold
0:04:35	so all again we keep it very simple to the objective is
0:04:39	only two we used a
0:04:41	you have to miss
0:04:42	and false alarm
0:04:43	but uh when we use that that means we want to reduce the kind of peace filled circles of
0:04:48	and these few triangles and is kind of a discrete
0:04:51	thing
0:04:51	and we don't want to do that we want to
0:04:53	do it in a quantitative way
0:04:54	so all of this can be done by minimising the iranians deviation
0:04:58	with respect to the detection threshold
0:05:00	which means we want to minimise the
0:05:03	where to it
0:05:04	based
0:05:04	all these you
0:05:05	triangles empty circles from
0:05:07	action detection threshold
0:05:08	and we assume that this
0:05:10	section turtle is already fixed
0:05:11	at the very beginning
0:05:17	so now we can uh
0:05:19	introduce how we do the parrots language recognition
0:05:25	first we make uh
0:05:26	simple hypothesis
0:05:28	because uh
0:05:28	we we we
0:05:29	told you that they are related tasks of languages
0:05:32	so uh
0:05:34	below like ways to solve these two related languages a number H T a number H ah
0:05:39	contains very similar and
0:05:41	complementary information
0:05:43	so
0:05:44	before you
0:05:45	not at the route which shows you only uh the
0:05:48	not like the racial foreground H T
0:05:51	and now we introduce another dimension number H R which uh detection anyway so for related hypo
0:05:58	and uh the
0:05:59	trend of the of the schools
0:06:01	normally follows
0:06:03	this manner
0:06:04	and uh to to understand this easily we can just pick
0:06:08	any
0:06:09	trial from a target cost sixty so it is natural that it has a very high school
0:06:14	of number H T because it's detecting a target cost and it has a low score
0:06:18	of number H ah
0:06:19	because uh it is not
0:06:21	belonging to the
0:06:23	how to construct the rate cut
0:06:25	and similarly for
0:06:26	another trial in uh the related costs
0:06:29	it has a high school in um the H R and those boring
0:06:32	and number X T
0:06:34	so uh
0:06:37	this
0:06:37	shape
0:06:37	uh simply uh uh
0:06:39	problems that
0:06:40	to think of
0:06:41	how about if we rotate the whole score space
0:06:44	such that uh
0:06:45	we can obtain a new
0:06:47	score space and detection special like this
0:06:49	and mathematically it is
0:06:51	actually we
0:06:52	when
0:06:53	when we determine the detection threshold we not only consider
0:06:57	a number H T but also consider the number H O which means
0:07:01	we use the
0:07:02	detect tech schools from to detect it
0:07:04	or
0:07:05	target language related language in order to her
0:07:08	the final decision
0:07:09	four
0:07:10	well there
0:07:11	this
0:07:11	a tribunal to
0:07:13	i cos
0:07:13	steve
0:07:16	so uh mathematical you want to formulate
0:07:18	like this
0:07:19	uh we talk about that uh we want to
0:07:21	do this in a quantitative way to minimise it what to wear and use deviation which is the
0:07:26	distance between these points
0:07:28	from stress
0:07:30	so uh we
0:07:32	tech
0:07:33	this
0:07:33	he claims that was that
0:07:35	we look into this uh lander minus the universe
0:07:38	so this is uh
0:07:39	the
0:07:41	displacements of all
0:07:43	these
0:07:43	school from the press will
0:07:44	so for detection based
0:07:45	the ms is below the stress also
0:07:48	this
0:07:48	difference is negative
0:07:50	and for false alarm the differences pause
0:07:52	and why is
0:07:53	uh representing the the it should label of
0:07:56	the
0:07:57	uh detection trial
0:07:59	if it is appealing to that i have uh
0:08:02	the better is one if it does not you don't to target cost about it is negative one
0:08:06	so we can see that by
0:08:08	multiplying the Y and this uh lambda minus three to
0:08:12	four
0:08:14	two cases of error as we always have
0:08:17	some positive value
0:08:18	and for correct acceptance and rejection as we always have negative better and then we use the
0:08:23	max operation to remove
0:08:25	these are all right acceptance rejection
0:08:27	scores
0:08:28	so is that
0:08:30	finally what left over is
0:08:32	only the erroneous deviation and then we sum over the whole database
0:08:35	a week a problem
0:08:36	first
0:08:37	trial
0:08:38	the the last row
0:08:39	and we like to adjust the detection not letting way so where
0:08:42	the adjusted the likelihood
0:08:44	not the dash
0:08:45	who produced
0:08:46	this
0:08:46	how to iran is T V H
0:08:56	so perhaps i i should go back to the last line because uh
0:09:00	what we do is to reduce the
0:09:02	uh iran is deviation
0:09:04	the the
0:09:05	the distance between these errors from the stressful
0:09:07	and the way we do that is
0:09:09	by rotating the the
0:09:11	the score space
0:09:12	and
0:09:13	the rotation of school space
0:09:15	is actually accomplished
0:09:17	uh i
0:09:18	this equation because we want to do
0:09:21	and linear combination of
0:09:22	discourse from two detectors
0:09:24	and the result is that the score space is rotated
0:09:28	and ah here the whole problem is now formulated we have the objective function of iran is deviation
0:09:33	and then we want to minimise that
0:09:35	subject to uh be
0:09:38	the union combination uh back to our for
0:09:41	and then we also have uh
0:09:42	little constraint
0:09:43	uh just to make sure that
0:09:45	uh the final result are updated
0:09:48	a lot like racial would not be out of range
0:09:51	and after uh
0:09:53	we have done
0:09:54	this
0:09:54	optimisation of a rotating the small space with the developments that we upside these our parameter to D version dataset
0:10:02	and then we go back to the normal
0:10:04	or error metrics
0:10:06	which is the detection cost
0:10:07	and
0:10:08	because this time we illustrate the pairwise language recognition process so we have one
0:10:12	this
0:10:13	term and one was a long term in the
0:10:15	in the uh errors
0:10:17	but
0:10:18	that is
0:10:21	so this is the key
0:10:23	we uh diagram of our systems
0:10:26	what we use is uh
0:10:28	from what i think and prosodic fusion system oh
0:10:31	i've to me that uh we only use one subsystem
0:10:34	uh in
0:10:36	you know is a tool for promote that takes so it's
0:10:38	it's not a bad uh system to start with but
0:10:41	what we want to try is to
0:10:43	tried the or
0:10:44	effectiveness of peace corps corporation in this particular scenario
0:10:48	so how do we get the
0:10:50	score from a different detectors
0:10:52	oh we have located ten
0:10:54	difficult idea languages
0:10:56	then
0:10:57	for each target languages
0:10:58	uh we
0:10:59	choose
0:11:00	the lot like to resolve it
0:11:02	so and then deal of leeway so of the related costs
0:11:06	and then we do be
0:11:08	parameter optimisation which means we we we will rotate
0:11:11	the score space
0:11:13	uh such that
0:11:14	we have an update on the dash the updated a lot lighter racial
0:11:18	the training data we use is uh
0:11:21	this is a rarity
0:11:22	a nineteen ninety six to two or seven corpora
0:11:25	and the egg evaluation data is
0:11:27	on these two or ninety version set to give you a brief idea of or how
0:11:31	the amount of data we had uh
0:11:33	for the general has
0:11:35	which you see
0:11:36	in a to slice
0:11:38	the number of trout is about ten thousand
0:11:40	for twenty three languages
0:11:43	and
0:11:43	to train this
0:11:45	i'll for parameter rotating the score space we use a development set
0:11:49	the departments that comes from these two or seventy version that is that and excerpts from
0:11:54	two O nine B vitamins that
0:11:56	and there's a total or
0:11:57	six thousand trials
0:11:59	and that estimations all thirty second
0:12:03	so this is the result of the pairwise uh language recognition
0:12:07	uh
0:12:08	has
0:12:09	the original E at least given here is about twenty percent for all these uh difficult languages
0:12:15	and after we apply the us
0:12:17	school
0:12:17	calibration
0:12:18	the error is about all
0:12:21	nineteen present which is
0:12:22	about five percent relative
0:12:24	eer reduction
0:12:26	we can see bosnian croatian confusion cannot
0:12:28	be reduced by this method
0:12:30	uh
0:12:31	which
0:12:32	kind of a
0:12:33	which is because uh i guess
0:12:35	the two languages
0:12:36	mixed up very seriously in
0:12:39	our
0:12:40	score
0:12:41	and in a related language pair confusion reduction is so
0:12:45	more significant for the worst performing line
0:12:47	let's see for example we compare
0:12:49	oh
0:12:50	ah see far
0:12:52	harry and posse
0:12:53	and uh the error reduction in there it is
0:12:56	a more scientific
0:12:57	with the help of a
0:12:59	that is cool problem of prosody
0:13:07	so the uh
0:13:08	improvement of pairwise language recognition is
0:13:10	not very significant but uh we want to extend
0:13:13	this
0:13:14	uh method to the general language recognition and then we'll see
0:13:17	a more significant error reduction there
0:13:21	oh
0:13:22	we just will be uh
0:13:27	'cause
0:13:27	average cost function for the pairwise language recognition again
0:13:31	we have one with time and one false alarm time
0:13:34	but uh if we move to D gender out 'cause then the cost function become a more complicated because they
0:13:40	are more target languages
0:13:42	and
0:13:42	for the detection of each language's that is one this term and trendy to false alarm time to ponder
0:13:47	two
0:13:49	average score
0:13:51	so as you see i highlighted that
0:13:53	hard in red
0:13:57	um
0:13:59	because previously we have been opening and in data for two languages only
0:14:04	so they're only circles and triangles
0:14:06	but now
0:14:07	when we expanded a general class there are
0:14:09	more
0:14:11	then you got that out of set or out of that data which is the data
0:14:15	uh
0:14:16	not
0:14:17	reside in these two other languages
0:14:20	so uh
0:14:21	these
0:14:22	so for all that data
0:14:23	marked in red circles here
0:14:25	and again
0:14:26	i'll show you the general trend of
0:14:28	the of the data in the
0:14:30	in the uh
0:14:31	detection scores of the two classifiers because
0:14:34	the classifier of uh the the the a lot like away so all
0:14:38	H T and H O uh
0:14:40	giving very similar trend because these two languages are very similar
0:14:44	so
0:14:44	what
0:14:46	has a high score in number sixty also give a high score in um H R
0:14:51	and uh
0:14:52	actually there are some modification we have to do uh
0:14:56	when we proceed from the two language case to the general trend three language case
0:15:00	first is a
0:15:02	as as that we have many offset data we don't
0:15:05	want to touch speed of that data because
0:15:08	we are afraid
0:15:08	then
0:15:09	this may affect
0:15:10	the detection of
0:15:11	these
0:15:12	other language classes
0:15:13	the second thing is as as mentioned uh
0:15:16	in the
0:15:16	general cost function
0:15:19	there is a
0:15:20	there are twenty two but um term
0:15:23	so the false alarm for each language pair
0:15:25	become and the mine
0:15:27	and we have
0:15:28	two
0:15:30	put more stress
0:15:31	in a week you think detection ms
0:15:33	one of the and or reducing the uh
0:15:35	detection of a salami in order to have a low detection ah
0:15:39	oh
0:15:40	average score
0:15:43	so this is the three moves we applied when we uh proceed from
0:15:47	pairwise language recognition to the general outcast
0:15:50	first will is we only select detection trials which are likely to belong to
0:15:54	the two
0:15:54	related languages H T N H O
0:15:57	of course we do not know in advance which language they you don't to so we apply a holistic method
0:16:02	which is not included in the paper just
0:16:04	choose
0:16:05	only these language to operate
0:16:06	and
0:16:08	the route to is we waited cost of detection miss trendy two times heavier
0:16:12	as you see in a later slide or earlier
0:16:14	we formulate the uh
0:16:16	iranians
0:16:17	deviation optimisation function so that it's a midterm and then days that for some time
0:16:22	and we
0:16:22	put the way twenty two times more forty
0:16:25	after mister
0:16:26	table three is uh
0:16:28	we have the ship the reference point for the calculation of total awareness deviation
0:16:32	the point of doing this is uh can be explained by one here
0:16:36	we have said that detection miss is more important we have to put more focus
0:16:40	in detector miss
0:16:41	in in in the calibration
0:16:43	and uh we go back to the original detection threshold
0:16:46	feature here
0:16:48	oh if you still remember we have
0:16:52	field are only here and its deviation and then we move
0:16:56	all of these
0:16:56	right well see it because these reptiles
0:16:59	suppose to fall into the region of right as that
0:17:02	and then it was not handle
0:17:04	in
0:17:04	anyway
0:17:05	in if if we don't do anything
0:17:07	so uh
0:17:08	if detection misses is so important
0:17:10	why don't we just uh
0:17:12	also try to look at these
0:17:14	like hungry points
0:17:16	by
0:17:16	moving the detection price already to be higher
0:17:19	forty below actually be allowed
0:17:21	section four so to
0:17:24	fluctuates uh and then we try
0:17:26	the best
0:17:27	oops the on which will give us the lowest
0:17:29	general language recognition or
0:17:32	so this is the revised objective function basically is the same exactly the same problem of the previous night you
0:17:38	see all for the calibration with two languages
0:17:40	but now we have uh
0:17:42	the three modifications as shown in red here
0:17:45	and after we have done the uh
0:17:48	calibration with the development set
0:17:50	then we
0:17:51	go back to the E variation
0:17:53	it is that and then
0:17:54	use the
0:17:55	convention uh
0:17:57	average cost function to
0:17:59	you first eer
0:18:02	um
0:18:04	this diagram is
0:18:05	this page is maybe a little bit intimidating so all any you sometimes to to to
0:18:10	two
0:18:11	to explain
0:18:12	so therefore diagrams here
0:18:14	all we use
0:18:15	the development set
0:18:16	to tune the alpha parameters for for schools base rotation
0:18:20	so this is they the score for lambda H T and number H O before
0:18:25	rotation and this is uh the rotation
0:18:28	as you can see we only choose a subset
0:18:30	uh actually the black box a be a lot like this for for the kind of car and that we
0:18:35	got a little late
0:18:36	school for related costs
0:18:38	so only
0:18:39	suppose only the black and the wind blows up at that and they are rotated a little bit
0:18:44	and this is the result for your eyes instead of course lucy more messy
0:18:49	and um
0:18:50	there are also some kind of rotation here
0:18:53	and then uh
0:18:56	we'll see what we want to do
0:18:57	we want to do the rotation such that uh
0:19:00	they are more
0:19:02	target cost
0:19:03	school or
0:19:04	uh staying in the in the in upper end of the Y X
0:19:07	so that would be less detection based
0:19:09	so in the development set it
0:19:12	it isn't very clear because uh the
0:19:15	target class
0:19:16	the black dots are already up
0:19:18	hi in the Y axis
0:19:19	so in the emergence as we can see
0:19:21	those like balls
0:19:23	scattering down the the the
0:19:25	the the curve here which makes up with that
0:19:29	red and green dollars
0:19:30	have
0:19:31	already moved up after
0:19:33	rotation of the
0:19:35	score space
0:19:37	so this is the overall result of the uh
0:19:40	equal error rate after applying the score
0:19:42	space rotation
0:19:44	before we have all four point four five
0:19:46	equal error rate and we use
0:19:48	single detection threshold for
0:19:50	the detection of all language
0:19:52	and after this quotation the uh
0:19:55	error is reduced to about three point three percent which is about uh
0:19:59	twenty five percent relative reduction of uh you have trend
0:20:07	and oh we
0:20:08	also introduce a
0:20:10	before there is a
0:20:12	parameter of seed on which
0:20:16	accounts for the
0:20:17	shifting of the detection threshold
0:20:19	and if we ship it
0:20:21	a porno or
0:20:22	if they don't is louder and louder that means we
0:20:25	uh
0:20:27	become more and more cost to these boundary points
0:20:30	two
0:20:30	possibly
0:20:31	um
0:20:32	academies point
0:20:34	so uh we've tried different concealment of signal is three point five we've got the lowest equal error
0:20:41	so he comes a summary of today's top
0:20:44	in language recognition
0:20:46	uh language pair detection possible five pairs of related languages
0:20:50	a linear combination of detection scores between
0:20:53	target language and the related language brings about five point eight present market
0:20:57	eer reduction
0:20:58	and we we wise the parameters
0:21:00	four of four
0:21:02	optimisation of it
0:21:03	this score space rotation
0:21:05	and the application dependent
0:21:06	calibration can be applied to do with that of detection that brings about
0:21:10	twenty five percent relative reduction of eer
0:21:13	so all for future work uh we have been thinking of
0:21:15	some unsupervised methods to find these related targets
0:21:19	because
0:21:20	in this where we start with the people like that
0:21:22	target
0:21:23	uh oh
0:21:24	with no derivation because this is already included specification
0:21:28	and uh we have also thought about application to other
0:21:31	you second pass but uh
0:21:33	we understand
0:21:34	this
0:21:35	oh where is
0:21:36	very specific to this particular language correction pass and
0:21:39	we think
0:21:40	um
0:21:41	special case have to be taken if we
0:21:43	um migrate that to audit detection part
0:21:46	and this is the end of today's presentation thank you for much
0:21:57	uh before
0:21:58	the questions uh
0:21:59	uh
0:22:00	the two source me the amounts that
0:22:03	the organising committee just
0:22:04	the
0:22:05	we do so do
0:22:06	of to the solution
0:22:07	uh so
0:22:08	uh
0:22:09	any questions for retirement
0:22:19	i have been taking part in these evaluations will be questionable evaluation which seems to relate to what you're doing
0:22:25	sounds like what you're saying is that when you're testing
0:22:28	a simple
0:22:30	and you see is destruction
0:22:33	uh in the training you can do anything you want
0:22:35	pursuing virtually the training training in the training in calibration and and and uh
0:22:42	you can settings you threshold anywhere you want within the testing
0:22:46	or you know to see this to super looks very much like russian
0:22:51	but
0:22:52	and and my task is to sit attrition but i happen to know the details of the ukrainian model that
0:22:57	looks more like ukrainian
0:22:59	are you allowed to do that interesting timers that this remote for some reason
0:23:06	using a so you can look at all the languages and see which is close to so
0:23:10	so
0:23:16	okay so you can just forced to assume that
0:23:20	so actually in doing test we just
0:23:22	have a small part to hold it
0:23:24	how come from
0:23:25	different languages and then we compare
0:23:27	to choose all day is maybe possibly russian
0:23:31	okay
0:23:36	you still linear combination
0:23:38	hmmm yeah
0:23:39	uh i mean all languages are related
0:23:41	so we we
0:23:43	uh_huh
0:23:43	why not use a linear combination of all that
0:23:46	a small one
0:23:47	oh we have actually tried
0:23:49	and the uh each good thing is that it only works for these kind of related language
0:23:55	because oh
0:23:56	i think
0:23:57	a very simple
0:23:58	i said to to to
0:24:00	selection is then why this work is
0:24:02	if the two languages are more and more famous
0:24:04	then
0:24:04	the
0:24:06	oh okay
0:24:07	scores from these two detectors
0:24:09	have more complementary effects because
0:24:11	oh
0:24:11	say brush and then ukrainian they are very similar
0:24:14	that means
0:24:14	if i use
0:24:15	the the the score combination of these two languages
0:24:18	then i can have applied the competence to
0:24:21	we just
0:24:22	languages which are
0:24:24	not rushing and then and
0:24:25	right
0:24:26	and this is the main of performance
0:24:29	oh reduction of performance
0:24:31	uh improvement we get
0:24:32	we have
0:24:33	all
0:24:34	get a significant a reduction of
0:24:37	false alarm
0:24:38	two
0:24:39	other language
0:24:40	but not to the related language
0:24:41	by doing
0:24:52	the remote questions uh than the
0:24:55	it's gonna during lunch
0:24:56	the
0:24:57	this time the speaker

Detection target dependent score calibration for language recognition

SESSION 4: Speaker and language recognition – scoring, confidences and calibration