Speech Transcript - Exploring the Effects of Device Variability on Forensic Speaker Comparison Using VOCALISE and NFI-FRIDA, A Forensically Realistic Database

0:00:14	hello everyone minus my name is that's on the flute
0:00:18	i work at to identify and i will talk to about
0:00:22	automatic speaker recognition in a forensic voice comparison
0:00:27	such i'm the user of automatic speaker recognition technology a not a developer
0:00:32	which will give me a unique perspective which i hope will be insightful for you
0:00:37	and in this study into representativeness is a constant that's really important
0:00:43	in doing actual cases in forensic voice comparison
0:00:47	my to go also share are
0:00:49	they are heroes who can actually developed
0:00:51	automatic speaker recognition systems they work are also great research and they have their
0:00:56	system that was used for the study for this study
0:01:01	but an just the humble user and i will talk about automatic speaker recognition from
0:01:05	that perspective
0:01:07	so forensic voice comparison you will typically have an offender recording from the police and
0:01:11	somebody did something bad in this recording
0:01:14	and identity of the speakers unknown and
0:01:18	there we will be a suspect that we should think okay this guy must be
0:01:21	the same as the offender so the suspect recording and
0:01:25	the recording come from everywhere
0:01:28	the importance start is we get two recordings
0:01:31	one of the has a contested speaker identity and the other one that's just a
0:01:35	suspect no nobody distance
0:01:38	and the question is always pay are disguise
0:01:40	the same person are these people the same person
0:01:43	of course we translate this into hypotheses so
0:01:48	we were gonna bayesian framework
0:01:53	but it but all boils down to is the same going or not and when
0:01:57	you use automatic speaker recognition value
0:01:59	chuck in the recordings into your into your system you give it some users submit
0:02:04	data and reference normalization code word level
0:02:06	locater about in the book if you a score
0:02:09	and this score
0:02:11	so that exists in the void
0:02:13	there's no way of
0:02:15	telling what a score means that could be seventeen and nobody knows how gender so
0:02:19	you need and relevant population so you look at your
0:02:23	potential rah relevant population recordings of original speaker identity
0:02:30	and you check my case recordings are the blue tire blew guys so my relevant
0:02:35	population blue people and i compared those blue people
0:02:40	the in the same manner as i did not the case
0:02:43	and it will use the same speaker scores and different speaker scores are used to
0:02:47	can be made to distribution and then i can bring back okay score
0:02:52	and here in this example i can see i've in a war over about four
0:02:57	because the intersection with the green line and orange line data i ratio for
0:03:03	and this for that's a likelihood ratio and
0:03:07	now we have we don't have meaningless score anymore we have and meaningful number likelihood
0:03:12	ratio this is
0:03:13	an expression of the weight of evidence
0:03:15	it can actually be used in case work are in court
0:03:19	the judge can
0:03:20	weight is in this decision or a decision
0:03:24	about the case as a whole
0:03:26	okay let's backtrack
0:03:28	there was this choice of relevant potential relevant population and i said okay let's look
0:03:33	at the colour of the guys
0:03:36	but reality is a bit more compact and just for colours or maybe i should
0:03:40	have checked for whether they were wearing sunglasses and you would get another
0:03:44	another relevant population or maybe i should have checked
0:03:48	but today have had some or maybe the combination of these two
0:03:52	and
0:03:53	there's the earlier
0:03:55	results i got
0:03:57	but when and when taking for hence it might be that the distributions were shifted
0:04:01	and the actual resulting of our will be way lower than that had before or
0:04:05	had checked for sunglasses it might is just to the other way
0:04:10	and okay than other or kind of i would've
0:04:13	checks for every single metadata would think off
0:04:17	colour hats and the glasses i would probably not have sufficient data to even do
0:04:21	this
0:04:23	so you can see this is a major impact on the result of the case
0:04:26	and
0:04:29	this is a
0:04:30	a real problem in forensic voice comparison because when i was talking about hence when
0:04:34	i was talking about
0:04:37	i sunglasses i actually meant of course it's conditions
0:04:41	case recording conditions and that's norm list and i just even you some of the
0:04:46	double my have this when you think of it
0:04:48	could be close to infinity
0:04:52	so that's a real problem you don't really know what to select for and even
0:04:56	we didn't
0:04:58	this list
0:04:59	look at raw recording type there
0:05:01	in there there's multiple categories and within those categories there's
0:05:06	even cellular
0:05:07	so there elements to look for and it's just not clear
0:05:11	should this should some of these things could dish safely be ignored because there are
0:05:15	no impact on the use or
0:05:17	at all
0:05:18	or on it may be really crucial and then it's really important you don't wanna
0:05:22	forget it because then you get this wrong
0:05:25	likelihood ratio that could
0:05:27	potentially need to one miscarriage of justice
0:05:31	so in order to do research into this relevant population problem neglect the database it's
0:05:36	called in a v freda
0:05:38	forensically realistic into device audio it's got two hundred and fifty male speakers
0:05:43	and the other characteristics here are just the target audience of forensic voice comparison in
0:05:48	the netherlands basically and their speech was recorded on multiple devices simultaneously so every utterance
0:05:55	of speech is recorded in different ways
0:05:58	and i have an example of this
0:06:03	and they'll go there is setting and he's talking on the phone which isn't will
0:06:07	not a participant
0:06:08	and
0:06:09	she scheme or headset
0:06:12	a text-dependent i for the subset of the testing for
0:06:18	and there was a
0:06:19	no improvement due to stupidity i financed data suggesting for
0:06:26	and i guess to
0:06:34	and their the microphone on the other side room
0:06:43	and that's please kindly provided actual into sets of the telephone
0:06:50	and however
0:06:53	and this is still of a video by i phone which is
0:06:58	this text recording
0:07:00	so this is a list of the recording devices and
0:07:04	it says they're inside only for the two four three microphones
0:07:09	and i will explain this right now
0:07:11	so
0:07:12	every participant two days of recording everyday had eight recording sessions
0:07:17	for them are inside for them are outside
0:07:21	all those inside an outsider as it was divided in the silent backgrounds and noisy
0:07:25	background and for incitement just no sound or
0:07:29	a white noise radio
0:07:31	and making noise for the noisy background outside and
0:07:34	the actual location wherever so the roses sort of silent place
0:07:39	and there is a busy place writing central forensic them as you can see
0:07:45	and then the was the other variation where the actual telephone are used as eigen
0:07:49	up or and i phone and this made up this made eight conversations per day
0:07:53	and there's two days of those and
0:07:57	the conversations are five minutes of spontaneous telephone speech and
0:08:01	we actually transcribe half of it the i from recordings which helped us added recordings
0:08:07	you consider speech nonspeech information available
0:08:09	and look at the numbers and
0:08:12	you can see per speaker has about one hour twenty minutes speech duration that the
0:08:16	worked of the recording the duration of longer because for every
0:08:20	speech utterance does not of course
0:08:23	so why they do it is of course that's forensically relevant to the speaker demographics
0:08:28	and like i said but the real cool part if the simultaneous recordings and this
0:08:35	makes the influence of recording device possible more specifically the relevance of data
0:08:43	that's recorded by different recording to tax
0:08:46	the system we used for this studies vocalise bucks of a research
0:08:51	it's the x factor system and
0:08:54	and then really cool feature value in visible eyes the speakers you can see in
0:08:58	the bottom right that's i-vector extractors problem and down to three dimensions
0:09:04	you also have the option to do earlier generations i-vector gmm you the option to
0:09:10	and not use mfccs but use all the phonetic features
0:09:16	and they have a speciality other than normal stuff so there's reference organisation which is
0:09:21	very standard that it you can also submit data
0:09:25	for the ap lda to the better to the case conditions
0:09:31	so three to me to one and thirty five speakers from three to they all
0:09:35	the recordings were added that's speech and come to forty seconds
0:09:40	and they were divided into two groups there's test data and there's the reference normalization
0:09:46	cohort and we also did experiments ments without reference normalization cohort
0:09:51	and i should say for every speaker every day there's five recordings for five the
0:09:55	first five devices the smartphone video was not included but the other five or they're
0:10:01	so
0:10:02	you know what in you target and localise and this is what you get
0:10:07	and these are complex equal error are convex or equal error rates and as you
0:10:13	can see when you do a matching experiment you get quite good numbers this does
0:10:17	not compared to i-vector performance at all
0:10:20	it really better you can actually see that the first three devices that high quality
0:10:26	close microphones perform
0:10:28	pretty well against each other even then mismatched
0:10:31	this is not
0:10:33	quite true for the other two devices if you mismatch could
0:10:36	recordings but the for microphone of the telephone instead you really start to notice
0:10:41	and of course if you do too but to a clear recording types compared those
0:10:46	in a mismatch that's gonna be the
0:10:49	that's performing the worst
0:10:51	note if i do this with an i-vector system
0:10:54	this four five ish equal error rate actually what you get
0:11:00	all over the place and the one that's highlighted now would be probably ten percent
0:11:04	equal error rate
0:11:05	also note that reference normalization actually helps
0:11:08	so the lower the lower equal error rate is
0:11:11	almost everywhere a bit lower than one the optimal
0:11:15	so back to the original question can we do research that finds out vector something
0:11:20	to recruit sure from things can be ignored when selecting relevant population data
0:11:25	so we to the one hundred thirty five every the speakers that we divided them
0:11:30	in most cases so
0:11:31	those with the actual bayes recordings
0:11:34	and background data with the this in three ways and the results are pulled off
0:11:38	the ropes
0:11:40	when you compare levi's one for those more cases it would make sense to use
0:11:45	device about one as a background data because then you matching background data make matching
0:11:49	relevant population data
0:11:51	you could also use the other devices and then you would have mismatch in background
0:11:55	data
0:11:56	so we did this for every device type in the more cases and every device
0:12:00	tie in the background data and the relevant population data the menu twenty five
0:12:05	sets of ours which are represented as see the loss and if you look at
0:12:11	this table the diagonal means that we
0:12:14	use the right relevant population data so that more case type wasn't device one so
0:12:19	the handset so we don't the headset recordings
0:12:22	for the relevant population like that and so one
0:12:26	can you look at the first row here's basically invalid what you should to when
0:12:30	you have a case
0:12:32	in device one
0:12:33	it may make sense to use device one but you can also see that the
0:12:36	guys two and three are just as good
0:12:39	device for is a bit worse that's the form one little reverberation in it and
0:12:44	the telephone is definitely bad
0:12:47	that issue you each penalty and performance
0:12:50	and the same holds for the other two close microphones
0:12:56	accept maybe device three but it seem to be quite interchangeable and device for is
0:13:01	in between and device five is just out of the question that gives you a
0:13:05	penalty for performance
0:13:07	like to look at the two
0:13:09	the real recordings again
0:13:11	they better be represented by themselves as you can see in the numbers the lowest
0:13:16	civil are is for the matching background data and there's nothing that really comes close
0:13:21	so graphically represented
0:13:23	that means the three high quality microphones can sort of represent each other
0:13:29	as can be seen between a rose and
0:13:32	the for microphone which is still a direct microphone but it's far away in the
0:13:37	recording traversed
0:13:38	that's
0:13:39	an intermediate one and the telephone intercept definitely don't
0:13:42	use it
0:13:43	to represent
0:13:45	the microphones or vice versa
0:13:49	so that an answer to a question
0:13:52	for what in broad recording type you cannot blows over a telephone and the right
0:13:57	mackerel that's definitely a crucial difference there but what indirect microphone the brand or type
0:14:03	of make type of microphone is really not so important that's what these results seem
0:14:09	to suggest
0:14:11	so these results were not very surprising to me
0:14:16	but it shows you type of research you
0:14:18	a big interest to the user's social automatic speaker recognition because it gives you a
0:14:24	guideline
0:14:25	or how to choose relevant population data and it gives you basically a guideline how
0:14:30	to use it is are properly
0:14:33	making it available for and the

Exploring the Effects of Device Variability on Forensic Speaker Comparison Using VOCALISE and NFI-FRIDA, A Forensically Realistic Database

Speech Application

David van der Vloed, Finnian Kelly, Anil Alexander