Přepis řeči - INCLUDING HUMAN EXPERTISE IN SPEAKER RECOGNITION SYSTEMS: REPORT ON A PILOT EVALUATION

0:00:14	uh figure out much check
0:00:16	um low
0:00:17	uh i think you all very much for coming
0:00:19	uh i was strongly encouraged to be brief in order to allow time for questions
0:00:24	uh but if i a a like to begin by technology my file authors uh jack free george doddington
0:00:29	and i one martin
0:00:31	uh is well as this uh that has or participants
0:00:34	a many of whom uh are in this room
0:00:36	uh for there
0:00:38	a a hard work and effort and conducting has a reason
0:00:43	so the question or trying to address dresses
0:00:45	how can you human experts effectively
0:00:47	um
0:00:49	uh you to lies automatic speaker recognition technology
0:00:52	uh to our knowledge this is still an open question
0:00:55	uh so we included a small pilot test in the twenty ten nist speaker recognition evaluation
0:01:02	um
0:01:04	that
0:01:05	uh task and has determine whether two different speech segments were both spoken by the uh same speaker
0:01:11	the has evaluation valuation included two test
0:01:14	uh the first court has or one consisted of fifteen trials uh that is fifteen pairs of speech segments
0:01:19	uh i and uh the second has or to consist of a hundred and fifty trials uh the first fifteen
0:01:24	of which
0:01:25	a where the has one trial
0:01:27	has or systems could use human listeners uh or machines or both
0:01:31	and anyone who wish to participate uh was welcome
0:01:37	again uh each trial consisted of two speech segments
0:01:41	uh and the task is to determine whether they were spoken by the same speaker
0:01:45	uh there was no time limit on the amount of the scheme presented
0:01:48	uh but it was required that trials be processed separately and independently one at a time and and C
0:01:55	each trial
0:01:56	each system provided that same speaker or see uh or different speaker decision
0:02:00	uh as well as a numeric score
0:02:03	where a higher score indicated greater confidence
0:02:05	in a speaker
0:02:06	a same speaker
0:02:09	because of the limited number of trials the evaluate uh evaluation metric consisted of simply tallying the number of misses
0:02:15	and false or more
0:02:17	uh uh let me note that a miss is deciding the segments were spoken by different speakers were were spoken
0:02:22	by the same speaker
0:02:23	i and of false alarm is deciding segments are spoken by a uh the same speaker when in fact there
0:02:28	were spoken by different
0:02:35	uh do you to the limited number of trials it was necessary to select challenging segment errors
0:02:41	uh in each case one of the segments was a three minute recording of an interview uh of of one
0:02:45	of several different microphone
0:02:47	uh and in the other
0:02:49	uh uh the other segment was a five minute call recorded over a telephone channel
0:02:54	for has or one segment pair similarity was determined using an automatic system uh and the most similar different speaker
0:03:01	pairs
0:03:02	uh were selected for
0:03:04	uh different speaker trials and at least
0:03:06	similar speaker segments uh are chosen for
0:03:09	uh same speaker true
0:03:12	he's pairs were then screen by human
0:03:14	to select the most difficult trials more them eight any content cues
0:03:18	a has or to a selected in the same way uh the only difference being the screen
0:03:23	was
0:03:28	alright right
0:03:29	now that we know all about the hasr evaluation
0:03:31	uh let's play a game
0:03:33	it's called same speaker different speaker
0:03:35	and it's played by listening to uh a a a speech segments and uh voting whether they were spoken by
0:03:41	the same speak
0:03:46	one
0:03:50	a
0:03:56	i
0:03:59	i
0:04:00	a
0:04:05	i
0:04:06	i
0:04:10	i
0:04:12	i
0:04:14	i
0:04:15	i
0:04:15	i
0:04:16	i
0:04:17	i
0:04:18	i
0:04:19	i i
0:04:23	okay how many people believe was the same speaker
0:04:27	K a how many different speakers
0:04:29	okay overwhelmingly same but some different
0:04:32	okay and and the second row
0:04:35	i
0:04:38	i
0:04:39	i
0:04:40	i
0:04:45	i
0:04:48	i
0:04:53	i
0:04:55	i
0:04:57	i
0:04:57	i i
0:04:58	i
0:04:59	i
0:04:59	i
0:05:01	i
0:05:03	i
0:05:04	i
0:05:06	i
0:05:08	i
0:05:09	i
0:05:11	i
0:05:12	i
0:05:12	i
0:05:14	i
0:05:14	i
0:05:16	all right how many people think same speaker
0:05:19	uh just a couple
0:05:20	uh i i
0:05:21	how many uh how many different speaker
0:05:24	um well
0:05:27	you
0:05:28	there's a set of a little differently yeah but you may be surprised to learn that the first one was
0:05:31	different speaker
0:05:33	and the second was same speaker
0:05:35	um
0:05:37	yeah it's true it's absolutely true
0:05:39	and let us know that these were the trials and has or one that had the most missus and false
0:05:42	alarm
0:05:54	okay so let's see how that has or uh one systems did
0:05:57	uh on the top or same-speaker trials and on the bottom different speaker trials
0:06:02	uh there were twenty systems that participated from fifteen sites uh in six different countries
0:06:07	uh the green portion of the bars represents correct decisions
0:06:11	the blue misses
0:06:12	and the red false alarms
0:06:14	uh as we look from left to right we sea trials increase in uh an increasing difficulty of for the
0:06:19	systems
0:06:20	yeah and we just listen to
0:06:23	uh this trial and
0:06:24	and the strong
0:06:31	uh here we see individual system performance uh a on the hasr one trials
0:06:36	a each bar represents the total number of errors divided by the total number of trials uh that's fifty in
0:06:42	this case
0:06:43	a again blue indicates misses and read false alarms
0:06:46	uh this system with the fewest
0:06:48	errors
0:06:49	uh i had to as and no false alarms
0:06:52	and the system with the most
0:06:54	had four missus and uh seven four
0:07:07	i
0:07:09	okay um here we consider the performance of uh uh uh was uh from the sites that participated in hasr
0:07:16	one and hasr two
0:07:18	uh the bar on the left for each system repair uh represents uh has or one trials and the on
0:07:23	the right uh represents uh errors has a two trials
0:07:27	uh sorry
0:07:28	left uh has or one
0:07:31	and then right
0:07:32	as or two
0:07:34	uh again blues misses and and are there are false alarms
0:07:37	and a on average
0:07:40	um
0:07:42	the has or one uh
0:07:43	prove more challenging uh then has a two trials
0:07:47	no if you took your time and carefully read the fine print
0:07:51	of this or G of a a uh read ten evaluation plan
0:07:54	uh you would discover that we embedded in the automatic uh uh system evaluation the the hasr trials
0:08:00	uh i to uh see how the automatic systems to
0:08:05	so we when we look at the uh three leading systems in the main evaluation and and look of they
0:08:10	did on the
0:08:12	uh a has or trials
0:08:13	um this is what we see here on the right
0:08:16	i think we should note on this uh uh is that
0:08:19	the actual decisions
0:08:21	are being displayed here for the hasr systems
0:08:23	uh but we were not able to do that for the um
0:08:26	automatic systems uh due to a thousand to one different speaker the same speaker prior probability uh a given in
0:08:33	the main evaluation
0:08:34	uh so we
0:08:37	uh a the decision threshold
0:08:39	uh of the automatic system so as to produce equal counts of misses and false more
0:08:48	uh so we saw that leading automatic systems had noticeably fewer errors than the has or systems uh and the
0:08:54	tests proved quite challenging
0:08:57	i
0:08:58	in fact uh have the systems got more trials right them long and has are one
0:09:04	yes thank you
0:09:05	i
0:09:07	um
0:09:08	so uh we leave you um with
0:09:11	uh a couple questions
0:09:13	uh first was this data appropriate for support in has a research
0:09:17	um and where do we go from here
0:09:22	we are planning in another has or evaluation to be held in conjunction with that you twelve
0:09:27	uh we expect there be two test
0:09:29	uh of the first row twenty trials and the second with two hundred
0:09:33	and the trial selection process is plan to be similar as and has or ten
0:09:37	uh but hopefully with less human screen
0:09:40	uh the data will uh still be in english only
0:09:43	uh and the evaluation period is plain to be form months
0:09:46	uh which is three much longer than the automatic system evaluation is typically
0:09:52	um we or you are for your feedback
0:09:55	uh so please E or
0:09:56	or speak with this
0:09:58	um
0:10:00	i should note that statistical significance is of great importance
0:10:04	to nist
0:10:05	so if you interest to us
0:10:07	uh but with so few trials unowned can be assigned
0:10:10	uh to these result
0:10:12	uh we are also interested in ideas on how to improve uh the channel selection process so again please
0:10:18	uh
0:10:18	sure with us
0:10:20	uh for more information uh we're to provide feedback
0:10:23	um you're some websites or speak with us
0:10:26	uh you know is on the paper
0:10:28	very much
0:10:35	so for questions please come to the mike
0:10:49	right
0:11:07	okay i would like to have more explanation i can
0:11:11	and uh the proximity had difficulty and that approximately optimized how you
0:11:17	next year it is proximity
0:11:19	exactly sure um
0:11:21	well
0:11:29	so uh we ran a full matrix of
0:11:33	uh uh
0:11:35	um uh
0:11:36	interview train interview test on target trials of all speaker pairs
0:11:40	uh the three seven speaker pairs uh were identified
0:11:43	uh using a threshold of
0:11:45	six scores where the idea was
0:11:47	uh
0:11:48	the score was included if the scores including the top one percent of
0:11:51	um
0:11:53	scores in the direction
0:11:55	so
0:11:55	of those thirty seven acres were chosen and then
0:11:58	um combinations of segments for each speaker pair
0:12:01	um listen to
0:12:03	to determine which would be used for
0:12:05	one
0:12:06	uh for non-target
0:12:07	there's four
0:12:08	uh a target roles
0:12:10	or same speaker true
0:12:12	um
0:12:13	uh we did a a full matrix
0:12:15	uh of the actual sect
0:12:18	a and then this to the sec
0:12:19	errors
0:12:20	that way
0:12:20	and that was for has or one for as a two uh that was
0:12:23	very care
0:12:24	uh screen
0:12:25	a process was similar just with a a a a large
0:12:35	i i quick what was
0:12:36	the percentage of non sing have two data
0:12:40	uh uh uh uh just a non non-native
0:12:43	have that you what i
0:12:45	it
0:12:46	present present of non-native speakers in the hash is some people who were not native us english speakers
0:12:52	um let's see
0:12:54	um
0:12:55	something in one
0:12:57	two
0:12:59	um
0:13:04	uh i'm thinking of two
0:13:07	right
0:13:12	oh
0:13:13	three
0:13:14	oh
0:13:17	oh i'm sorry misunderstood
0:13:19	or or or or maybe a was source are you're asking are you asking for the trials are for the
0:13:22	participant
0:13:24	oh i'm sorry yeah i
0:13:26	yes
0:13:30	uh i do not know that off and but that something we can uh find that with so for port
0:13:34	them i will note that everyone who uh was recorded was reported in philadelphia
0:13:39	uh but that's of for a leader national city so i
0:13:49	i believe that's correct but uh sometimes
0:13:54	i
0:13:57	i
0:14:02	yes
0:14:26	give a there's another question
0:14:32	well what was the gender breakdown to you specifically select for uh could divide or did you
0:14:38	choose based upon
0:14:39	a challenge in the past but for
0:14:42	uh sure a get i don't have the gender breakdown handy but this was a um
0:14:47	um
0:14:48	this just fill out we did not try to uh about this but a whole trials
0:14:52	also of course but all trials were
0:14:54	a same sex
0:14:56	true
0:14:59	that you very much

INCLUDING HUMAN EXPERTISE IN SPEAKER RECOGNITION SYSTEMS: REPORT ON A PILOT EVALUATION

Human Assisted Speaker Recognition

Přednášející: Craig Greenberg, Autoři: Craig Greenberg, Alvin Martin, National Institute of Standards and Technology, United States; George Doddington, N/A, United States; John Godfrey, US Department of Defense, United States