good morning everybody a well nigh

contribution here and will be more most focus on keeping

overview of some rating guidelines that have been developed in the last two years

concerning directly or indirectly a speaker recognition systems or semi automatic speaker recognition we human

intervention the feature extraction mainly

and then the message format that before

doing something with speaker recognition in court in europe at least we should read this

guidelines because they're being generated after process of consensus among some community so i think

they're relevant community so it's that's a message phone we want to do something you're

if you're not from europe i thing at least it they deserve a re the

not to know what's going on in you know or environment

well the first one is

and the so called m c guideline for evaluative reporting in forensic science most of

you probably already know

eight was released in two thousand and fifteen i'll talk about later

second one is

this works

something wrong

i don't know what's going on

second one is a gallon that we have developed in a collaboration were then if

i and with consensus roommate additions on validation of light of racial methods

and for forensic evidence evaluation

and the first guideline is a guideline that has been


something's wrong with the computers


that's for the best for the windows system


with a one are some recent guidelines on but the logic and islands for back

practising for a six you madam adding an automatic speaker recognition also develop by m

c in europe and network forensic sciences to do particular the forensic speech analysis work

we're concerned in the first the first one is probably to the three of them

are available second one is already published in forensic science international from the third are

in this repository of documents from m c

and some critical combinations of this guideline are about expressing conclusions in court in general

not only in speaker recognition but in forensic science in general their recommendation for all

forensic science fields

and there's some critical recommendations in the guideline that have especially stressed

first one is that the expression of conclusions must be probabilistic

somewhere breast cancer recommend in their server must gain in this in the guideline

that i recommended to transform the probabilistic statement at a form of likelihood ratios in

terms of formal equivalence and what is absolutely stressed is that okay absolute statements should

be avoided

like identification exclusion categorical statements

second one is that when the one has to the finally hypothesis in the case

that's a same die different guy or this guy comes from this voice comes from

this guy all these speech segment comes from another person in with this characteristics

one has to consider at least one alternative

can be many of them but at least one

and a clear definition of the database also

a is also mandatory because the definition of interactive defines

what is the data we have to handle in order to compute this weight of


there's one it'd findings must be evaluated to given each of all the buttons is

so that lead as to

somehow kind of

well likelihood for each hypothesis only two hypothesis case we try to a where we're

going to a likelihood ratio

for the one it said that the conclusions of this breaks in terms of support

of hypothesis instead of probability of the processes this support to the hypothesis that putting

read this the way of

it is quite easy way to avoid some fallacies in a reasoning

and it for so as to suppress are support are the weight of the evidence

in terms of aligning racial rather than a posterior probability ratio

so support is an important work you want to avoid this kind of classes

so they will last one is that a data driven approaches should be the

final goal

but in the meantime their many people that cannot role in the lower tiers to

data driven approaches so

the guy lighter considers a they use of subject the subjective judgement is subject to

probabilities and so on

but it is recommended that data-driven this is kind of

a long-term goal

there's also an example in speaker recognition is not an example of what speaker recognition

soon should be because the generate the some controversy in the into the m c

four six p channel your analysis group because there are many ways of doing speaker


this is an example you should on automatic case it was generated by people from

what if it will that they used automatic speaker recognition for doing this but it's



just a guy templeton given example

how to do this in a given particular scenario with a given particular weight of

special conclusions which speaker

well the second that nine is a guideline validation we have been developing with people

identify and people that the a professor

and this guideline is aimed

to recommend everybody in forensic science that is you the likelihood ratios to go to


a objective evaluation procedures which is not the case

typically in many forensic science fields

here a speaker recognition we use that definitely in a in this conference everybody

use a experimental environment to validate their methods

but the two questions here first i if you're not used to that how to

do it


somehow i it comes to perform as measuring how performance at messrs should be interpreted

which perform and messrs i relevant

and the second one is okay i have a validated by a system is in

performance measure so

how to put that into play in order to make one technique

to be able to go to court some recommendations regarding laboratory accreditation laboratory and

okay procedures and so on

the guideline is very

particular it can create but i'm not gonna go into more many details the of

the thing is just determining if an implied a correlation matrix is able to be

used in court

and everything should be documented

we are in the process of a stellar accent is island into allies just and

therefore biometrics

d mlps meeting these but there are some of the people here collaborating from start

and or laboratories related to i so

and we proposed in a tile i some relevant characteristic this table is not intended

to that you read the table but you can see somehow cor eer thinks that

we are used to here so we contributed this into the general for as you

feel but this performance measures are not

limited to this once just a proposal

so that the guideline is supposed to be open that sense

so everybody can contribute would more performance measures these are the minimum requirement that we

understand that the validation process should contact regarding

performance measure and also there's a high stress

and most of my colleagues would talk about it

about the use of relative relevant for a six data so laboratory data it's okay

using a nist evaluation is nice


the last we follow with a critical the performance measuring in forensic

fourth conditions which is extremely tricky can stay

an extremely tricky issue and that like colleagues will talk about it later so

finally this l m c guideline for forensic automatic or semiautomatic on automatic speaker recognition

that was laid by pretty led by under the got a within the forensic speech

utterances working group

and it is guile anything have is compatible with the m c guideline for reporting

is also compatible with the validation guideline that we have been talking about before

and also address

many other issues

like a the most used technologies and matters with that the state-of-the-art methods that reliable

the most used features we have the features that typically used in hearing in different

approaches which are more reliable audio preprocessing how what is information if you might be

a human being in the process as well as well so it's based techniques guideline

and they have been developed within the for six a speech about it has what

many of us here have been developing having contributing to that so it's a guideline

that presents a high degree of agreement today i mean

okay that was my can be

thank you then and just and namely not so we have one minute for a

small or we question

in the case will have more time

when all the fast talk

any question

for then

and the guidelines

we don't continue with k





is going to the one you know with his presentation



how do you full screen

a good morning everybody

i'm that jonathan

from sweden

work for a company always be

and also

why the university of garber

currently at

i'm gonna talk a little bit about

a credit the small forensic speaker comparison

which we are

so the company

company we performed case work for around eleven years

been to sweden norway and us

approximately fourteen cases

almost all the more swedish cases

there are three

people employed

all the most part time

all employed by the university as well

and we are the sub contract or of the swedish national forensic centre

basically we handle more or less all the cases


a small area

just give you some short

just quickly talk about an implied methods mentioned them

and then i'm gonna talk about the evaluations for accreditation where daniel stuff comes in

very briefly what a forensic conclusion in sweden looks like and

quite a few questions

to put up there

so before explaining the three parts very briefly there's of course screening processes so and

fc screen means that

and that's developed over the years of course of these days it's basically

screen part of the cases are round fifty percent

and that happens and fc

these days basically

before it used to be a lot more screening an in house for us

but not in a t does it and one more because it's cheaper for them

and then there's always the second screening done

at our place as well and then

hunting comes from one station with joe the others we always say keep open so

that we can actually one samples during the analysis of web even if we take

taken on the analysis


the first part of the analysis is the linguistic phonetic perceptual analysis

i also

these days and some cases a

it's also could begin with a light dusting depending on how many people are embolus

on linguistic part is you know go through different steps of perceptual evaluation

it try to keep it in and some kind of bayesian manner so how do

we treat covered by a small

keep very brief you go through it once with and you bias yourself actually for

the one hypothesis and then you go through it again and you bias yourself for

the other i

and two people always doing this and third person

in most cases and to the by test

now the three more or less to the point at a private case

some level


matter cost and how much working pretty to case

second part is stiff you acoustic measurements that we still do

and are part of the standard protocol ones articulation rate basically produced a little per


fundamental frequency measures few of them graph and then the long-term formant analysis

which is basically nowadays handle more or less automatically

and well

also put into an i-vector system


and third party cycles than the automatic system so currently there are two systems are



evaluating one system and as one system researcher for systems altogether

guidelines when it comes to the evaluation for accreditation

we've been

fiddling around in the dark basically not knowing what to do exactly and i think

false we we're

we very much appreciate the work that's done but by and say and that's true

but also maybe especially since we're in a tight schedule mouse are next a deadline

for accreditation is like

to over

so when they regardless on when i was that and a five month ago meeting

would be da

and only but this work with the dog you know rudolph

that guidelines really important for us to how to treat the validation of automatic systems

it doesn't solve and everything of course and that's a lot of you can discuss

that very much but at least there are some guidelines now we can follow and

we know what to do basically for the accreditation at least and then you

people discussing

so that some of these are just example some of the plots that they

a suggestion the guidelines for some that it all looks five and you know it

can get the figures for each of those plots

these are some example of the problems you can start running into well from this

is from doubled from the flu to identify

you created directly maps for the results in this case it's a little are means

but also for equal error rate and so on four different testing that you don't

and huge telephone database so

more or less like see what happens when the training samples are more than one

and when the test sample is more or less or shorter and shorter

what happens

in the evaluation process

but if you consider all those plots

and all those figures you can and accreditation process you can realise that

is gonna be quite many pages if you also very brief you don't have to

read all this

consider how many validation start

very quickly went through that during these eleven years we've done over a hundred evaluations

and if you consider all those different the conditions and so on different durations like

microphone distant microphone mobile recordings with and without phase cover in an outsider car indoor

outdoor different languages different compression with done more less all those

with different datasets and some simulations but

you can imagine what a large document that would be

document in all those evaluations for the accreditation process

the perceptual phonetic analysis also has to be evaluated

currently we well it's been a difficulty for us because we're we've been to before

and we know pretty much that a to we have to some extent at least

so we been trying to evaluate each other

back and forth over the years now we are third person she goes through basically


testing because even though your the phd like speech pathology in her case

and you a great year still have to evaluate everything and you're not really used

to do forensic analysis on telephone material

she had to go through training phase the testing phase and then aligned evaluations

as we started that the small scale of course because extremely time consuming the last

almost like

twenty three speaker took are some three days to form the analyses

just quickly showing you what the

the national forensic centre verbal scale looks like nine point ordinal scale conclusions

two hypotheses

so from level loss for two-level minus four zero in the middle

and it goes something syllable-level plus four isn't like the results are extremely much more

probable is the main approaches to compare the alternative

a mind of two there's also more probable the alternative hypothesis to compare two main

behind each level there is a standard likelihood ratios

important to remember

even if you do all these evaluations and you put this probably thousand pages document

for accreditation every cases uni how much can you actually inferred from all the evaluations

you've done

to each and every case

is not easy at all

even though it looks really don't know it's evaluation

see a lot of stuff to think about still even though you go threat accreditation

process and you get this down problem

evaluation is not like the evaluation stops

and that just in general pattern that out there as well

we what is need to have a transparent report

still don't know that there's

something that we need to discuss much more

and who has to be able to understand this report

is it the actual

the jury or judge the

actually another expert probably which that's how are basically the

i think that's pretty quick


excellent i mean we have time for a couple of quest


like we did not


it's data mining for its well it's just two examples because of output all them

they're the slide look crazy and

so i just like plus or minus to give you an example i could have

taken the minus four

i suppose is probably more of a common that will get it to later but

based on

the preview so far seeing with first to talk

but one concern i have nothing wrong that is that

the big of the forces all the data right so that the data and there's

a lot of that are going to about guidelines accreditation so for one thing is

gonna be everybody keeps a the data is the problem but it keeps

kind of putting

of the near that if i guidelines and accreditation now it's gonna look like it's

more official time disconcerting later it is not really quest the discuss of how we're

actually ever gonna get our hands around the data issue

one leader answer all to me

well what i can tell us that

there is a lot of data

but of course we can cover all these conditions that amount of data but

to me also it's this the sensitivity of the data

so i can tell you there's a lot of data i can't really tell you

about how it's collected what data is its own because it's all kept behind

secrecy to too large extent and

that's also depose specially in sweden to do when you publish things

a lot of evaluation that we don't over the years we can publish because it's

i hope that is actually going to change now but it's the it's gonna

huge problem i can't really i can give

well if i probably something i have to be able to give the data actually

to another researcher if he asked for

or making can i

to this intuition and actually use the data error or something to for falsify ability



if you can do that you can't really publish anything so

and that's gonna difficulty but now we're

probably we maybe can do that anyway because organization it's changed please it's also

but will see

thank you let's go to our next us the you can sense

some talk a little bit up about some aspects of a word could be a

became we do speaker recognition since the seventies and early days it was done automatically

but the technology wasn't really

and ready and

that the method used was the autumn

auditory in acoustic-phonetic method starting from the eighties

and since about two thousand five use both this onto an acoustic method a compact

with that

plus also automatic speaker recognition

a just a few slides you

so you heard about from daniel about these



as semi-automatic an automatic speaker recognition

and just again repeating into two of the aspects or one is

the outcome of an automatic or semiautomatic method is

the likelihood ratio so

it's all about

it it's and systems that output like ratios

and another important aspect that then dimension as well this is that validation

of a like information method has to be performed with speech samples

that are typical representation of the speech material frantically boundaries confronted with an everyday work

so it's gonna be

forensically the relevance


these kinds are accessible even here on the on the website you might have noticed

that this is link using all the

it gets you to well

the and c website and there are four documents on there so as one of

all documents

on the nist website


since we have you those guidelines are we have to sort of a

practice what we preach so we have two

get busy

collecting the forensic data forensically relevant data and we've been

starting doing this a while ago one of those activities have been published and the

odyssey two thousand twelve

in our activity and ongoing

and another q this is our collaboration with the end of high


they have a

not really is they have good

compiles vienna five fruits corpus that was document and all those in two thousand fourteen

and we have a special license to work with them was off work to look

at this going many restrictions and so forth

also for in terms of what kind of data we have the best coverage is

for matching conditions

involving telephone intercept data

what's more difficult is about condition so especially mismatched conditions one type of conditionally frequently

have is


terrace videos

the people making announcements to public disguising the phase this and

is encouraging people to come to their

training hams and stuff like that

as opposed to telephone intercepted recording us all these guys callhome and then there's interception

it would be captured telephone section

so this would be an indispensable in terms of technology but also the speech style

so this guy i read something for example make involvement or learned it's at all

it's different from a natural telephone conversations

so we do have somebody remote we it's more difficult to collect the data

in other challenges language so we have case work in several languages and we want

to can cover them

and we do collect data from different languages but there is a limit to what

we can do in is an impact as a parallel strategy

we also investigate the affects both the size and that the type of the effect

if the is

mismatch in terms of the data we have so one type of situation is if

we have a

a testing corpus were but we don't have the right reference population for that we

have to use a reference for lid from another language what is the effect in

terms of shifting the like ratios will be used the incorrect

reference population as not a big effects

so these kind of effects are to some extent predictable

this what we also to took too to capture this language should languages is a

big issue

it's a

we don't we can just one

language a it's a it's in several languages we want to cover that

this one more practical problems not

no that to move more conceptual

problem and it issues

the one that's combining

different kind of every this there is quantifiable evidence

like a ratios coming from automatic or semiautomatic systems that's what the guy like


there's also qualitative evidence coming from the auditory phonetic and acoustic phonetic method

and we use both kind of evidence i mean some partitions an answer to this

just work with quantifiable evidence

others work with both have of evidence and the question is that how to combine

the two

and the since not everything is quantifiable if we use both methods eventually it we

have to be something

some strength of evidence statements that are not entirely

quantitative so in the in the end of it

one components qualitative the quality of the their entries that has to be qualitative be

because it doesn't

can you cannot calculated or way through there is some qualitative aspect so that's standing


not unsolvable of course

but it was all those in students to use both like the ratio producing methods

and qualitative methods the other one that's the most painful problem probably is this one

here about the a colour the interfacing with the core so

you can do audio stuff and would well or not so well but i one

cases judgement they have to go to court and interface with people from record and

they have of and have different mindset and different expectations and so forth and the

situation that we have in germany is

the courts in germany still expect posteriors statements

so the expects things like what you have your table

well the identity and or not identity cannot be assessed or is probable are highly

probable very are probable

can be assumed was near certainty that this is sort of stuff they used to

and that it still expects

no this of course this that discusses and everything but there is sort of psychology

inertia against switching to a bayesian framework

the v ideal

idea about the bayesian framework is

the speech experts supplies like reissues over prevalence of a forensic experts to the same

and then the courts applied prior all calculate posterior also from the prior art

and all the like iterations that coming from the expert so that would be the

ideal scenario

that's still and there's against implementing it and all and the netherlands sweden

you have much five and then we didn't only

i don't know if you sort of can

especially point three how state of the or some on that one

but this is since topic for discussion

just i'm just a interfaces so that

this it's not this is

the and then and this expectations coming from the core about sort of things they

want and so forth

that's basically but

i've system model

good thank you very much we have time for a couple questions

could you can just say something about how you actually at the moment go about


they quantative on the qualitative data is the sum

explicit statement about how you do that and how you integrate any kind of

relationships between those two types of evidence

we went to do with for the automatic is a thing of a here for


this is a plot coming from the guidelines and

and four we have is that we have

and i














the resistance against the bayesian


could it could it could vocabulary contribute to all the could german words for like

to drive so prior also still you have


i think easy i think using john colour you have to explain the concepts and


no i think it's not

language a little or no so probably but

as more as regression process on the core

the reason i'm asking in my home language awfully cons we don't really have words

for we have four probability

voice kind look but okay this enables as well that's why is that it but

once all the because there's not even though this things with your likelihood and probability

is just a sign or this is like overcomes were i got no idea of

this a posteriori

probability of the cost i don't know how to set


since the guy border vocabulary

my comment on there are two sort of got up again and again

which contribute to

at least at least partially to pull this with

interfacing with the legal profession

one of them is support

and the other one is the use of speaker recognition

no if you keep on talking about speaker recognition is not surprising that the cool

thing should one speaker recognition

right and do not this isn't speaker recognition you giving them elected version the speaker

recognition comes with the posterior

i think it's okay for us we understand that i think but

of course the legal profession is something to the if you keep on talking about

forensic speaker recognition


so surprising that i'm the one to the size of the sense we will

and secondly this

the one of the things that really gets by backup is this will support

in the likelihood ratio supports the hypothesis

it doesn't well

the like to the meaning of the likelihood ratio is the

hypothesis merges with the post with you when you take two parts into account mm

it can be reversed you know the last iteration of the thousand be robust

it has the meaning else it has a that has no meaning

apps the problems sight talking about this likelihood ratio of support for the prosecution hypothesis

the trouble thing to support a language

this is i know that's what people use i think it's a very bad choice



what's the same think then

they didn't this is this is you talking you're talking about you the trying to

say something about

no trying to say something about the posterior in the in the absence of the


and i'm not that there are plenty of other words but the but it's a

it seems to the standard itself as i

expression i and i again a way that we discuss later but i think that

the grim some core implicitly stays

there is no a consideration of all information for supporting previous opinion but you use

it in conjunction with

support for the hypothesis the not

the results are more likely

not that are i understand lately sentiment over the whole thing but if you say

my likelihood ratio to give support for the prosecution hypothesis well the defence hypothesis that

no one is a i mean how could happen that the wording that's been used

i understand the problem

i would like to stress is not the likelihood ratio what supports

is the findings also for via

with of evidence which is quantified in a range well the findings of different


okay so next


good morning

the title for like till today

he's opening the black box

for forensic automatic speaker recognition and this talk was

a prepared by financially and myself

we're from also wave research

which is e audio not speech rd company based out of oxford and are all

experiences feel is that we develop systems for automatic speaker recognition speaker diarization and audio


and we've been what in this field

for quite awhile a products all used by law enforcement u k and other agencies

in the u k u is your the middle east

and include them at least you came only the n if i and seventy k


topic i'd like to dress

coast with some of the common set of in that come up already


it is the fact that

automatic speaker recognition

ease eight black box and this is a comment that what about colleagues

one of our conferences set and it stuck with me


i think a lot of this work needs to be attracted to address the fact

that automatic speaker recognition methodology is a black box

well the last few days we being treated to a variety of new algorithms you

techniques in might have i mean variations and modifications of different algorithms

it isn't

any surprise

that these mathematically complex methods

all black box

to the laypeople the juries judges and voice

to a certain extent even to the forensic experts

where using these


as we've seen recent advances have been with these

with a large number of variables and does comment earlier about it or being about

the data training and evaluation data the feature modeling and parameter choices if you have

an evaluation you have fifteen systems with

variations of orders where the arguments been placed in one way of the other

and how parameters and tested i have been included in the focus

has been on getting incremental improvements on these loss database

and weighted like to do not

the variability in these databases has been designed all controlled


how does this it within the context of opening up this black box if you've

got real forensic casework like some recordings of doing

how do you use and how do you address

the can


let's look at the end c guidelines for some sport

now the l c guidelines talk about any expert method



transparency robustness and logic is on these of we already addressed quite good to go

into them

the things that stick out of balance for example that you have competing hypotheses or

propose a propositions and evidence is considered with respect to these hypotheses and propositions given

of course the prior background

and then there was about loading

and the fact that you know you don't want to

transpose the logical of

evaluating the hypothesis against evaluate the evidence instead

and robustness which is slightly different from the sorted speaker engineering we're talking about robustness

which is how well we did hold up to scrutiny however we really wanted to

cross examination the actual techniques the actual techniques of the use i will build a


and i think

white importantly that something you don't get any black box

its transparency


how well with the forensic expert be able to explain the methods

and explain the data and that goes in

a few system that the using

now let's take a very simple straightforward it's expect for tonight used i-vectors in the

same sentence politics a straightforward automatic pipeline wave training the ubm

you've got a whole lot of data that you can put into training the ubm

you choose another

another set of data for training the total variability space

and then you if you using lda p lda you can use even yet another

speaker and that i know was used a lot well in these

and this is just before you it testing in training and validation or equal error

rates and so on

so if you we even got started

you've got data decisions multiple data decisions about the ubm training about the tv matrix

about the l d and the lda

and this is before considering things like what is the relevant population than the likelihood

ratio method and so on it so for this is embedded within the system


going back to dogs comment about resolving about data

the system that are developed

with these kind of background data

have to be explicit

about their effects on

the likely to show what least that needs to be transparency about the effects that

the that these are like calibrated

that that's one part of the problem that is sort of the automatic

a black box if you will

somebody could help

now if the u k most

of the forensic speaker recognition case what is performed by forensic conditions

and they have a lot of experience and knowledge they understand the material and send

the language they understand the that idiosyncrasies of that speech the in the centre legal

requirements of their


that they want to

include these automatic methods but are all automatic systems give these goals

and how you then


this automatic score that you've got with this knowledge that you have about the fact

that this

speaker says

something that is very particular to a region or space

how do but these things together

okay assuming you even wanted to make your analysis more objective using likelihood ratios and

evaluating before system performance

how do you can to do this

what generally happens all happened was you had to


against that sort of

you had a traditional sort of forensic phonetics based approach look at performance and voice

quality and linguistic


and then you have the automatic space


which look at the spectrum and

you know street treated as a signal processing problem

because they only against each other

sometimes we don't even sit together at conferences


it's not

that kind of needs to go to this common political platform produce

beginning to be accepted which is that the that the bayesian likely iterations and it's

nice because you can have these multiple methods and not approaches and they can put

together in the same direction

i've been working with this problem for quite some years and then be with a

lot of colleagues who work with forensic casework

and i really think the

black box used

quite a quite an important probably creates

you've got situation where the forensic expert has four systems that they haven't elaborately decorated

these four systems for example

and you don't wind able to look in order that automatic system to you all

k-histograms i go back to but you on this is point about every case being


and the expert should be

say system parameters means to use

new data at every step speaker recognition process

and in some sense

i in this

doesn't just go for you know commercial systems


x the expert should not be limited to these prepackaged preprocessed manufacturer provided models and


and they should be able to train the system specifically for the problem domain

and it's it was in this context from table three


that we looked at one point in this is by no means the only good

only way of doing things


when you know

we don't that

putting together a not automatic system that was built with the with an open box

architect if you will so one if you flexibility

in the features that you put in so you could use automatic spectral features like

mfccs and so

but it is important but you could also use traditional forensic parameters like formants

and then

a debatable the fate but you can use user provided features again allow i i'll

the strength of these mathematical modelling techniques like i-vector p lda gmm and gmm ubm


and you can use and within the context of these lexical features


been doing this was that it was you were able to introduce needed all stages

in the i-vector by plane or the gmm-ubm pipeline


to a certain extent the system to the conditions of the case now

you lasting is this make

it's this big black box


no it doesn't

i e ds as complicated as it is

the what it tries to is open it up

to what goes into it and what data was into it and

allows for validation that's more meaningful

in the context of in this case

thanks any so there is only one we questioned

in you know case two

so that has a speaker

anyone very quick and then

and then the question itself

i'm another so i'm by s so i'm sorry for that but this is this

is a very interesting topic the black box thing and so on and i think


my opinion of course address trained yes because i think that when forensic expertise going

to court the board if an something he needs to understand what's going on and

what type of with a little additional using what type of algorithms that but using

wasteland that deceased into your specific case yes it's obvious every that's the main in

forensic problem that is every casey's is different and you need to have some ability

but that

but be careful with that because

you create a system where you can tune everything

then you create you make unsolvable the problem that what something before

because if you wanna system that is validated

and the same time you can change everything every time

that we're gonna problem because then you are gonna need to validate this is then

a single case so that for me for me creates

l a big problem and apply them or with a time because you need to

change data and sometimes is not a see the change data in the form of

audio files and so on if every single system every single case that you need

different the parameters of different song also makes more difficult to separate as also so

i think that

we need to find a place where you balance both things a transparency and openness

of the system but also unique list data lies some sort of a specific things

on the system just to the make it

to make the little the validation of the system at

what it does it

okay thank you any thank you in any case we can we can twenty maybe

this is interesting is gaussian

after that as a speaker

and then said well actually it and some of these points in all at the

in the other hand the demo in this challenge so you can also continue with


okay i'm gonna tell you about simon introduced to you right multi love our evaluation

or friends or voice comparison

that is being organised by myself and my former phd student of all bands and

so i think we've already talked about doesn't need for evaluation of forensic evidence

this goes across all branches of forensic evidence best been calls since the nineteen sixties

for forensic voice comparison to be evaluated under realistic case what conditions but i think

just by what everybody here said i think this still goes widely unheeded

so in our contribution to this is to run this friends go evaluation which were

calling forensically vol zero one

it's designed to be open to operational friends a greater or trees we especially want

them to partake take part

it's also going to be open to research work

and where providing training and testing data they're representing the conditions of one forensic case

so based a where providing the data but have that has based on a relevant

population for the case it based on the speaking styles for this particular case and

also the particular recording conditions for this


we are going to have the papers recording on the evaluation of each system published

in a virtual special is you all of speech communication

so the call for papers the system is not quite setup but i'm hoping it'll

be done maybe ventilate of this week or next week covers your


information if you wanna get information that still that's already available you can find it

by going to my website

and you can get started if you wanna start

so there's an introductory paper which is already available dropped of at least is already

available and it includes a description of the data and it includes the rules for

the evaluation

each paper that's evaluating system needs to describe the system in sufficient detail that it

could potentially be replicated

and we're thinking about the level of it could be replicated by forensic practitioners who

have the requisite skills and knowledge and facilities

we're not prototypes deadline on this people working in operational forensic laboratories are very busy


their priorities to actually do case work so where giving a two year time period

within which people can evaluate systems and submit

so disclaimer casework conditions very substantially from case the case

basically i'm of the opinion that you're sensually at this stage do have to evaluate

your system on a case by case basis because three conditions also variable from case

the case

and what that means is one should not whatever results one gets out of taking

part in this evaluation one should not assume that those are generalisable to other cases

unless a one can make a case that yes this all the case is very

similar to these the conditions in the in the front to give l zero one


so a little bit by the data to based on real cases i said of

the offender recordings of telephone call made your financial institutions call center this is just

something i

this work i just something i still of internet it's a landline recording at the

call center and it has babble and typing background noise it saved in the compressed

format because of course they want to reduce the matter storage that they have its

forty six seconds long and it is clearly an adult male australian english speaker

the suspect recording we should be able to get nice high quality suspect recording yes

okay right okay or no i have a point over there right this is the

actual room but the suspect recording was made in u c v is nice heart

goals and i think the cat the person taking the camera is like in the

opposite corner of the room

right imagine what the reverberation is like and you see this here

is nice fashion

and the microphone is in this box


a problems with the suspect recording as well but that's pretty typical of

the sorts of problems that we used we experience in real forensic work

so the data that we're providing a come from a database we collected which is

the whole database is actually available

but this is that this is extracted from that database i we got male australian

english speakers we have multiple non-contemporaneous recordings of each speaker we have multiple speaking tasks

recording session

we've got high quality audio so we recorded we actually had to record

the route speakers from the relevant population we have to record the relevant speaking styles

but then what we've done is with you type of the audio and we simulated

the technical recording conditions that i just mentioned and that's pretty pictures about signal at

most conditions

so we have training data from a hundred five speakers so if you're if you're


definitely used of nist sre is that sounds ridiculously low but day i think availability

of data relevant data is a major problem in forensic voice comparison

and that's

are actually quite a lot of data of compared to what people

can usually manage to get

and the test data comes from a total of sixty one speaker

so i can i have time to show you some preliminary results

based on the data from friends give a zero one

so this is results that of all than i actually did so this is not

part of this special the specialist you in speech communication it's something that we did

previously which is pretty a which is already been submitted but it's on almost exactly

the same data

so it's the in this example is looking at an i-vector system mfccs ubm t

matrix lda ilp lda and then a score to likelihood ratio conversion at the end

using logistic regression

and we trained a two different versions of this system one is using generic data

it's not using the training the first training level is not using the date i

just talked about it using a whole bunch of nist sre data it's about an

order of magnitude more speakers and two orders of magnitude more recordings

and we use the generic data for everything to get to the score to training

all the models to get to the school and then we use the case specific

data for training the model that goes from the score to likelihood ratio so that

logistic regression model at the end

that's a fairly typical way of doing things

because you do all the heart rending upfront here

right we did another system where we use case specific data all the way through

where train the models that get to the scores using k specific data and then

with training the score the likelihood ratio models using k specific data

and here are some results in terms of a zero if you just nosy llr

accuracy of a look at

okay so the case specific data

is the one that performed using k specific there are always through perform much better

than using joe generic data to get to the score and then k specific data

for sparse code a likelihood ratio commercial

and if you like tippett plots use tippett plots there's the generic the gen our

data systems use the k specifics

and if you understand tippett plots that's a huge difference

dive in front of words has already been mentioned his

doing very well in this presentation for not having been here

so he's going to his or his already started doing the evaluation and we've got

some results from him and his kindly allowed us to show the results here he

was testing that works this different user options and bat fox a one user is

a one option is a reference population

we put in either or data from all the hockey put in data from all

hundred five speakers or you like that but select a subset of thirty and he

tried using no impostor data already tried using impostor data from all hundred five train


we here are the results us summarize if you use

data from all the speakers instead of having better luck select a subset you get

better performance

if you use impostors versus don't use impostors using about this gives you better performance

so that the combination that gets you the best performance at the two

and if you like to but there's a tippet plot one thing that's clear to

notice is when you only using the thirty speakers selected by that works there's a

clear by us here which is then maybe a bias there but it is less

it's less clear

okay scale cask

thank you so we have just time for one question before we move into the

final phase for open questions and all the presentations remember in the session and z

nine forty five so there's less than ten minutes

so if we could begin with some questions for jeff that be great

the if the data was

totally appropriate but

giving it's viable to do a comparison of the two systems that you put up

based on your compare your evaluation

i was prepared for the question

here's the use the best so this

that was the red what the red one is the best of that systems and

the blue one is the best of this just the i-vector systems we did


so blue one is better in terms of cmllr and there's the difference

in terms of the tippett plots as well

right and i think and i think cross going back going back to

just our system that there are the versions of our systems i think the but

the big differences where using case relevant data although we threw

where is that was using a lot of generic data to get the score to

likelihood ratio

to get the score level

and i think that fox works better than our system that use generic data at

the beginning but i think we've end work better than that folks because we use

case relevant data all the way through

what's the difference in the likelihood ratios for the data

that's the crucial things

sorry three

what was the outcome in so you the you've compare two systems

but i would like to know what is the difference in the likelihood ratios the

this that the systems gave you the actual comparison

for the actual case yes

well there is

are we haven't we haven't tested that when we did when we did the actual

case we chose one system when we used one system

so we haven't for doing the case work we chose one system we validated the

performance of that one system and we didn't

go out and try a whole bunch about the systems on the actual on the

actual case

right because we do in case work it's it do in case work is not

a research activity were not trying to choose the best one and also the problem

comes up is okay you might say we chose three or four different systems and

then we pick the one that were the best

we will then over training so

where over training on the test set

we've optimize to the test set then rather than to the previously unseen actual suspect

and offender recording

and then there's also the problems of you know well okay you're presented

three different systems which one should we believe

precisely in a that's what i'll ask evolves so the defence counsel yes but and

not that i would've expected to have but suppose one of the systems gives you

a little loglr both minus five on the other one gives you local or four


right so certainly that's not so what we what we would do what we do

re in our practise is we

we pick the we optimize the system we pick a system that we're gonna use

we optimized to the conditions of the case we don't freeze the system

we then test the system using test data

with that we don't go back and change the system again that's just that's it

that's how well the system works and then the last thing reduced has the actual

suspect and offender recording

so we don't go gee i got an answer g let's see i got a

relatively low likelihood ratio who's paying me the prosecution they want a high one i'm

i'll go back i don't with the system and i can get a better answer

so we keep a straight chronological order to avoid any

and he suggested that we would be doing anything like that

yes i understand that but we're talking about different systems are we know little the

just one wants that all about the freezing of the system but the moment we

comparing systems

that's what tools about so while the results there were comparing says but it's a

whole across a whole bunch of test rats so it's averaged over a whole bunch

of trust us

for is the compare the comparison of the two different systems are based on this

you might decide

that you wanted to use one of this you might decide wanted to use the

best performing system but

in a few cases you would maybe decide to choose one of those systems but

if the conditions of the case in the future different i we then test the

performance of the system under the conditions of that you case

i might have decided on the basis of this case but i'm not taking this

case as the validation for the case what conditions are very different

rhino you're having entries news but i guess my question goes to about michael and

jeff at some point


as you go through your case work

most judges are not experts maybe speech or speaker verification so if you're working for

example a tippet plots do you present there was in core proceedings and if so

how do your difference in prosecuting attorneys actually i'd

program ask about the support about you always plots or how you present results

yes and case you point one in recent years we did included to the plot

together with the case specific thing that's but decided before so when we do explain

everything and try to make it easy and so forth will be not shielding the

the court from both results we we're giving them the results and then but try

to explain assesses that this is used



yes all its stuff that we put in our ports of course we see his

the validation of the system

and typically itself i centric or two lawyer

and then they start from the call me and they start asking me questions was

this mean what is this mean and i have system known okay

i'll come to your office will spend a day together i will go through the

basics with you so that you have got to level of sufficient level of understanding

and then the next day then you can ask specific questions about this particular case

in this particular report and so you know sometime in the mid afternoon which we

get to the level with so we started by doing very basic what's a likelihood

ratio and sometime by mid afternoon we get to the testing level and explaining

what a something like a tippet plot means

and then you get a court

and the court seem to be designed to prevent this transfer of information from the

expert to the trier of fact

because you know if

if you were going to train so if you're going to train somebody you what

you do you might send them something to read beforehand you go you give them

a little lecture you get them to ask the questions you ask them confirmation question

to see the understand but in court it's

the lawyer asked you the questions and you answer only those questions and that

jury isn't allowed to ask questions it's

getting major getting the trier of fact understand this

i a serious problem

i'm not a research it's don't also varies there we have not good solutions of

the one thing

thank you

i just so the suggestion which is to

to stop two

see for like a glacial has a single number

likelihood ratios not number it's a rush you and it's very important to be able

to present

with two parts of the racial the similarity and typicality

it's really important for you do fall because when you all

changing the reference population

could be very interesting the coat two

make link between the similarity typicality pills and

you'll decision the boat v

reference population

but talk

the sum and for some new software perhaps also the buttons will give you in

the very for the actually sure what electricians calculated from the from where the evidence


with the is a different speaker

with the suspect distribution and then the and the

the distribution coming from the reference population so easy to

two distributions use you the case and then point you could you could see the

how the decorations calculated

the question is then if you i mean this is an important that we call

can request then you please are added to the board or not but can always

can an insider how it is calculated


i guess they seek out of the

two pieces that are going on here one thing jeff actually what he was ending

up presenting was talking about the


accuracy of the system right

the performance of the system and then we have the whole thing about the likelihood

ratio that number that comes out that you what present to the trier of fact

we all think is the

or seems to be the going and way to go

one issue i have with the likelihood ratio

when we talk about being a number is

there is no real ground truth likelihood ratio right

in reality the only ground truth likelihood ratio that we can even calibrate ourselves to

are infinity

i mean zero one right it's

it's either true or not true between those two things we start saying that we

actually have evaluated the likelihood ratio

of six point three

there is no we never actually a value we don't

estimate the likelihood ratio relative to any ground truth likely racial because the ground truth

likelihood ratio lives the polarity

i mean there's right we only evaluated through the posteriors

is the llr stewart posterior

so i guess my question the people to go to court is

what you say is the ground truth how do you say what it means to

be between the two poles i guess

unlike which were ground truth likelihood ratio is what's

what is that

thank you some

might be one

for me and this

this is personal opinion

for me the answers the calibration of the likelihood ratio so it is definitely to

that the only ground truth is like the final label what is to proposition


what we have tried to do and

in this validation guideline that comes from the workers have been on precisely here in

speaker recognition

is that okay

then i will racial would be better is not at this supports the right decision

and the decision has to ground truth bold fine like


and there's another issue that is the issue of the calibration so calibration helps you

to make better decisions because if you likelihood ratio that calibration calibrated when you buy

them to the vocal imitation changes usually chain

the cost reduced


that's one issue of calibration and the other issues the kind that calibration gives you

some kind of tuning imagery to a

generate heavier or lighter weight of the evidence depending on what you're discriminative power

so systems with a very good the car should generally higher likelihood ratios good conditions

right then

system with the one stronger migration systems equipped with the words that occur is that

the two properties of calibration so

on the on one hand you improve your decisions which is the final accuracy mess

you're looking for

and on the other hand you have and it's kind of limiting a entity that

is telling you okay you do not discriminating good they give likelihood ratio should be


so that's

that's a true that the performance measure that we have been proposed


i mean i know this it's fills a politically for a but it also just

seems that everything that we want to say that we're presenting this likelihood ratio in

talking about scales right for you know bands on it but at the end of

the day

what really talking about is

a decision

which has prior mean you even still are when you calibrate everything's done through

a priors that are there you may say you integrated out we go through all

this the realities the day a six point three you can't say in ground truth

my six point three likely racial estimated was really close to the true likelihood ratio

except it's poles you're going to the heart decisions is the prior so i think

it away were sort of

i think that's what

j a set of twenty two

is your really just try to tell people of all the time to use this

is how often it was saying when it was the same for you know the

true this is what how often it wasn't when it was not you know to

the quality and similarity i just wondering in a sense of breaking

are we making it more complicated going to this issue try to describe the court

are we getting a too complicated by overlaying with so much

issues here in training ourselves and

to not to try to get away from any the priors verses just trying to


a simple answer a like so it this forensic

thing one guy setup

and just had a visual way of doing it you put down the dots like

here's all the dots when i ran it was the same here's the dots are

and what it was the same here's dot of when i ran the case data

through and you can visually see where sets relative to

it's true that's the two distributions but in some sense is almost just saying here's

here's what i got my read it when it was i knew the truth here's

what i got there and they were the same it hears with this starts it

you choose you know look at deciding to think it's close to the

one of the other without right overlay so much

issues on putting down to the single number

but i mean he's using equation

one of the things that

in my opinion there is not the to line ratio

likelihood ratios inspirational

kind of support and hopefully then somewhat to doing so that there's another should is

the competence is you so the likelihood ratio it's they're mainly because incompetence so that

the final decision has to be done by someone

that person with five i find in fact asks for some information so how the

guy that have the information that the fact finder has not can communicate his opinion

about that piece of information that he tries to integrate with a whole

that's the main issue about behind language all the way

decision could be made by anyone but this proportion of competence so

the form out the formalities their problems because

leaving everything without performance leads them anything that are consider illogical because the decisions are

made in the reports about that's one issue

the issue about simplicity about complexity i fully agree with you

i think that things are

have to be made much simpler and i was talking about with joe before the

band of yesterday that if you've got a chemical analysis

there's one guy that i expressed his opinion about one comparison of two pieces of

glass using

and scanning electron my microscopically with energy for six x rays and so on

and or by well i it is not the same goal they're trying to be

displaying what's going on inside the microscope of whatever with the energy is present right

so for the it's you know how to say that this agreement on the community

that there are standards regulate in the use of the procedures are great that are

there some kind of make sure

error rate so that comes along with it would be would with the standard over

so in my opinion that the weighted ball so

giving a lot of information to judges is something that can be counterproductive be my

way so the balance between transparency and not biasing communication it's important so i think

i think your argument is

it in some way talking about this issue it is very important issue for me

given things simple are the starting point to go if you want to put a

new method following way

can i can i just a we i think there's lots of details that we

can talk about later but i think

we have to present something which we believe is logically correct first and then we

have to second worry about how to communicate that and it's not appropriate to present

something which we believe is logically incorrect all we which


which is easy to present

and the exact example you were giving i think that's one where when we if

the jury looks of that they will immediately jumped to it was him

they will jump to an identification

and so that i think that's a problem

okay we have to move might be this one



is that i just want to common to on the point

proposed by do go in the body and so from then you

are you agree and i think we should be honest when you experts are doing

information report

it's not for the judge

it's only for some over simple if you kick spits which will be able but

the difference side for example to exit mine you information and to give some inputs

to willow your

we have in a g and you in front of the court

the only important things is how you are present in your opinion

it's only based on what you are sitting on the there is no thing to

do with the cued racial you could save my ticket rituals then to any you

know but betterment me or like me you know

so we have to be clear

report on the scientific boats of for some people should find enough information in order

to criticism norwalk

and if you know

information is given by some morning and widely by the expert recall

i will you don't

i have been discussed in many forensic scientist where we always agree that transpires is

important everything has been transparently reported and so on

talking about explanations in court about issues

the balance have to be taken into account for example indiana analysis the nn analysis

the deities they start to use probabilities for reporting and it was a huge mess

for ten twenty years

but i have a more exactly right in things and interpretation fallacies where common


that experience tell us that

it has to be a balance between boarding

transparencies important one and when someone comes to core to explain that don't reports

probably is better to keep the simplest writers thing rather than going to complicate things

for me for example having a performance graph with a lot of details

can be

okay for us but when you well that for all your

probably the information that he's taking from that graph is not what you're trying to


so the problem is that the level of detail

which are transparent

probably so much detail

is giving a person like listening

a different message then the person that is speaking is given

so the balance has to be there and i'm not saying how to do things

but the balance has to be there might been and i fully agree with you

are we have to be transparent

transparency and

and the level of detail our things that has to be considered

i can do you want me

i can't and on that to planning

just one minute

should i okay no

i think is really important what you're saying and the you can never

sort of leaves of the

responsibility of what you're actually expressing the court

ask some somewhat subjective whatever you do you know you go jeff's

weight of the

that's a danger in that i know if you

i read something about like theory of science or something called like physics and b

and that's very much appears when you when you move into a different paradigm which

did you just salsa system is actually completed different paradigm where argumentation is actually the

thing that they're doing not

i it's not

engineering more signs in the way we are used to it so you do a

lot of analyses but when you end up in corked it's a lot of a

argumentation and you express some opinion on

all the analysis you made so you are actually

you is this a big point with the physics and you that you don't leave

the responsibility to just the number logged on all this i have this system and

the this is the score and you do whatever you want with it because

the communication is equally important on the insecurities noticing i think those

there is a mile like in our system with the nine point scale and there

are some likelihood ratios bands it's not really that important but it's also like historical

and everything that they are used to this kind of system and of course the

in a is much

stronger is a label

much more often have a plus four and so we now our case we're almost

never about the class to for example and

you have to express the a kind of strange that you can get to in


that it's all a lot of parts of it no matter if you use automatic

system or it is based on this phonetic analysis is gonna be some subjective part

of it are

i mean even that the things that you of the that produce with automatic system

you know you choose chosen the data are you of this some subjective nist to

all of it so

i think is really important to remember the i think some neto is written a

really good the article on this interior signs on physics and the end because of

you show all these numbers and all these graph and nobody

well understanding cord i promise you


the defence lawyer will say something like okay so you actually adjusted your system to

the case that jeff

did you do that and then you probably in the end at one he forces

you when you've been in court for twelve hours a in the chair you that

i did that then is gonna say okay able so it's objected

and then and you're done

so you have to really

think about how you expressed thinking course try to stick to your opinion and what

you based it on

but can remember the physics and b i think it's really important they see number

and the score and they would just all

he's really smart this guy you know see snaps

so okay good thank you so much and i think we wanna go around the

plots for all the panelists