uh figure out much check um low uh i think you all very much for coming uh i was strongly encouraged to be brief in order to allow time for questions uh but if i a a like to begin by technology my file authors uh jack free george doddington and i one martin uh is well as this uh that has or participants a many of whom uh are in this room uh for there a a hard work and effort and conducting has a reason so the question or trying to address dresses how can you human experts effectively um uh you to lies automatic speaker recognition technology uh to our knowledge this is still an open question uh so we included a small pilot test in the twenty ten nist speaker recognition evaluation um that uh task and has determine whether two different speech segments were both spoken by the uh same speaker the has evaluation valuation included two test uh the first court has or one consisted of fifteen trials uh that is fifteen pairs of speech segments uh i and uh the second has or to consist of a hundred and fifty trials uh the first fifteen of which a where the has one trial has or systems could use human listeners uh or machines or both and anyone who wish to participate uh was welcome again uh each trial consisted of two speech segments uh and the task is to determine whether they were spoken by the same speaker uh there was no time limit on the amount of the scheme presented uh but it was required that trials be processed separately and independently one at a time and and C each trial each system provided that same speaker or see uh or different speaker decision uh as well as a numeric score where a higher score indicated greater confidence in a speaker a same speaker because of the limited number of trials the evaluate uh evaluation metric consisted of simply tallying the number of misses and false or more uh uh let me note that a miss is deciding the segments were spoken by different speakers were were spoken by the same speaker i and of false alarm is deciding segments are spoken by a uh the same speaker when in fact there were spoken by different uh do you to the limited number of trials it was necessary to select challenging segment errors uh in each case one of the segments was a three minute recording of an interview uh of of one of several different microphone uh and in the other uh uh the other segment was a five minute call recorded over a telephone channel for has or one segment pair similarity was determined using an automatic system uh and the most similar different speaker pairs uh were selected for uh different speaker trials and at least similar speaker segments uh are chosen for uh same speaker true he's pairs were then screen by human to select the most difficult trials more them eight any content cues a has or to a selected in the same way uh the only difference being the screen was alright right now that we know all about the hasr evaluation uh let's play a game it's called same speaker different speaker and it's played by listening to uh a a a speech segments and uh voting whether they were spoken by the same speak one a i i a i i i i i i i i i i i i okay how many people believe was the same speaker K a how many different speakers okay overwhelmingly same but some different okay and and the second row i i i i i i i i i i i i i i i i i i i i i i i i i all right how many people think same speaker uh just a couple uh i i how many uh how many different speaker um well you there's a set of a little differently yeah but you may be surprised to learn that the first one was different speaker and the second was same speaker um yeah it's true it's absolutely true and let us know that these were the trials and has or one that had the most missus and false alarm okay so let's see how that has or uh one systems did uh on the top or same-speaker trials and on the bottom different speaker trials uh there were twenty systems that participated from fifteen sites uh in six different countries uh the green portion of the bars represents correct decisions the blue misses and the red false alarms uh as we look from left to right we sea trials increase in uh an increasing difficulty of for the systems yeah and we just listen to uh this trial and and the strong uh here we see individual system performance uh a on the hasr one trials a each bar represents the total number of errors divided by the total number of trials uh that's fifty in this case a again blue indicates misses and read false alarms uh this system with the fewest errors uh i had to as and no false alarms and the system with the most had four missus and uh seven four i okay um here we consider the performance of uh uh uh was uh from the sites that participated in hasr one and hasr two uh the bar on the left for each system repair uh represents uh has or one trials and the on the right uh represents uh errors has a two trials uh sorry left uh has or one and then right as or two uh again blues misses and and are there are false alarms and a on average um the has or one uh prove more challenging uh then has a two trials no if you took your time and carefully read the fine print of this or G of a a uh read ten evaluation plan uh you would discover that we embedded in the automatic uh uh system evaluation the the hasr trials uh i to uh see how the automatic systems to so we when we look at the uh three leading systems in the main evaluation and and look of they did on the uh a has or trials um this is what we see here on the right i think we should note on this uh uh is that the actual decisions are being displayed here for the hasr systems uh but we were not able to do that for the um automatic systems uh due to a thousand to one different speaker the same speaker prior probability uh a given in the main evaluation uh so we uh a the decision threshold uh of the automatic system so as to produce equal counts of misses and false more uh so we saw that leading automatic systems had noticeably fewer errors than the has or systems uh and the tests proved quite challenging i in fact uh have the systems got more trials right them long and has are one yes thank you i um so uh we leave you um with uh a couple questions uh first was this data appropriate for support in has a research um and where do we go from here we are planning in another has or evaluation to be held in conjunction with that you twelve uh we expect there be two test uh of the first row twenty trials and the second with two hundred and the trial selection process is plan to be similar as and has or ten uh but hopefully with less human screen uh the data will uh still be in english only uh and the evaluation period is plain to be form months uh which is three much longer than the automatic system evaluation is typically um we or you are for your feedback uh so please E or or speak with this um i should note that statistical significance is of great importance to nist so if you interest to us uh but with so few trials unowned can be assigned uh to these result uh we are also interested in ideas on how to improve uh the channel selection process so again please uh sure with us uh for more information uh we're to provide feedback um you're some websites or speak with us uh you know is on the paper very much so for questions please come to the mike right okay i would like to have more explanation i can and uh the proximity had difficulty and that approximately optimized how you next year it is proximity exactly sure um well so uh we ran a full matrix of uh uh um uh interview train interview test on target trials of all speaker pairs uh the three seven speaker pairs uh were identified uh using a threshold of six scores where the idea was uh the score was included if the scores including the top one percent of um scores in the direction so of those thirty seven acres were chosen and then um combinations of segments for each speaker pair um listen to to determine which would be used for one uh for non-target there's four uh a target roles or same speaker true um uh we did a a full matrix uh of the actual sect a and then this to the sec errors that way and that was for has or one for as a two uh that was very care uh screen a process was similar just with a a a a large i i quick what was the percentage of non sing have two data uh uh uh uh just a non non-native have that you what i it present present of non-native speakers in the hash is some people who were not native us english speakers um let's see um something in one two um uh i'm thinking of two right oh three oh oh i'm sorry misunderstood or or or or maybe a was source are you're asking are you asking for the trials are for the participant oh i'm sorry yeah i yes uh i do not know that off and but that something we can uh find that with so for port them i will note that everyone who uh was recorded was reported in philadelphia uh but that's of for a leader national city so i i believe that's correct but uh sometimes i i yes give a there's another question well what was the gender breakdown to you specifically select for uh could divide or did you choose based upon a challenge in the past but for uh sure a get i don't have the gender breakdown handy but this was a um um this just fill out we did not try to uh about this but a whole trials also of course but all trials were a same sex true that you very much