Speech Transcript - What are we missing with i-vectors? A perceptual analysis of i-vector-based falsely accepted trials

0:00:14	she and effect a lot of my presentation is what i would missing with i-vectors
0:00:19	a perceptron analysis of i-vector based falsely accepted trials and decide in collaboration with people
0:00:26	from their phonetic lot of the c as i c would the research
0:00:34	solution for many years at establishing the spain
0:00:37	so plus not talking about
0:00:42	i-vectors
0:00:46	yes we will not i i-vectors but tones
0:00:50	and
0:00:51	those i-vectors give us a compact an elegant solution for every utterance can be represented
0:00:57	in a fixed the dimension vector
0:01:01	they also a given us a great an efficient performance of that a wide range
0:01:06	of the original and a last two
0:01:09	perform a state to apply state-of-the-art and but the recognition techniques
0:01:14	and more the recently we are able to perform speaker recognition without point it is
0:01:20	a really great
0:01:24	we have we can avoid a lot of problems
0:01:28	and especially and i think that in the point
0:01:32	we don't produce calibrated likelihood ratios to forensic speaker recognition when we have lots of
0:01:38	i think that in accumulating a for this we have seen a nice paper we
0:01:43	wanted from that if i own do that's not that's
0:01:48	in this paper some but what if you feel you just a and have
0:01:52	wally score be in this paper has gone given a step farther when they have
0:01:56	not only over a being able to calculate an icon regularly richer when they have
0:02:03	i have recordings from the of these channel intercept assistant but they also have obtain
0:02:08	the day i select aggregation for to collapse the all that was to do so
0:02:14	they have an assessed not just a little bit about all the pros to sell
0:02:18	this is a
0:02:20	a great
0:02:22	we as a starting point but we have to look a little more in detail
0:02:27	about
0:02:28	i-vectors
0:02:30	and they explicitly courses lead to ignore a high-level and source little information
0:02:37	so the speaker and information
0:02:39	and is reduced to reach the short term but this is has a lot of
0:02:44	advantages for features to for conditional for real points
0:02:49	some users and imitate also so
0:02:52	but be still a spectral only detection decisions
0:02:57	probably will be uncorrelated with human perception this morning joe i'd like to this issue
0:03:04	of a possible loss of credibility of the system if the it's a very user
0:03:09	if i boardman ldc perceive rate disagreements between what the system is doing and what
0:03:17	what they can see that they can see that humans are pristine
0:03:22	moreover a we have almost that of that ignorance on the or you know those
0:03:27	detection errors
0:03:29	and when we have also you know system we are simply trying to restore system
0:03:33	probabilistic but we don't do not fit the specific with them
0:03:37	and we can have a transparent estimates is that which is very good but finally
0:03:42	we if we have a roast we cannot display at all what's the recent of
0:03:47	the
0:03:49	of the art
0:03:49	and it's very important to have to be able to provide explanations of all the
0:03:53	wires system is working set a specific way
0:03:57	and just a final reminder we as you decide systems usually on average error rate
0:04:04	but from the user's perspective
0:04:06	and they perceive performance like a baby case by case so it can be done
0:04:10	larger or even a single trial the system will be affected as a
0:04:15	as a whole
0:04:17	so what we in that for the paper wants to select a set of i
0:04:24	bet or based for the s if we try to problem
0:04:28	sorry ten and it's a eight sre ten
0:04:32	and we're gonna some a team of us find useful additions force the not english
0:04:38	a great
0:04:40	and
0:04:41	the objective was to explore to better understand what they do with their with that
0:04:47	a date down and all
0:04:50	that it just a and sre that's
0:04:54	as we have a and we might with that of data what target type of
0:05:00	types of different that they think they could find and also the number of different
0:05:06	types of different that they can have taken finding a single signal a trial
0:05:11	and the first of all a display where this is not a paper on the
0:05:16	speaker recognition by humans
0:05:17	both one of these you know in advance that day speakers in every time a
0:05:22	different
0:05:23	so
0:05:23	all what we are asking that is to highlight difference that they've find in the
0:05:29	and between the two utterances but without any a decision then used fourteen yes to
0:05:36	see what they can find a in a
0:05:39	and
0:05:40	in trials where the i-vector has provided a
0:05:44	line ratio greater than one
0:05:48	as they have a difficult time for analysis and we're not to select a subset
0:05:53	of trials
0:05:55	so we selected we will use the scores from our submission to nist two thousand
0:06:01	and ten
0:06:02	and what we did was a outlier proper selection
0:06:06	first of all we to be a sixteen and a false acceptance that we actually
0:06:12	had
0:06:12	and with the it to eight
0:06:15	with the eight is a set
0:06:17	and but also as those trials were specifically selected to be a special difficult for
0:06:25	humans just in case that was at peace stuff on it for that for the
0:06:30	analysis we also selected fifty different forces us a second trials from the sre can
0:06:38	and in that case of we had thousands of different
0:06:44	trials with the condition was selected yes those with no likelihood ratios in the range
0:06:49	from three to five with the translates into the results for all between the two
0:06:54	one hundred and fifty also so to those were a big are for example systems
0:06:59	that we usually
0:07:01	and how when we use our i-vector systems with
0:07:05	and eight now with the real by a lot of availability
0:07:10	and after those we yes end and all sixty six trials and they are there
0:07:15	are short rehearing not the about the mean this but trial they select it does
0:07:21	for a little work and eighteen trials nine male and female for them probably it's
0:07:26	a it's a and fourteen from a test everything
0:07:37	this is the final this which is in the paper just i want the soda
0:07:41	because we will and referred to every trial using the them
0:07:46	the number of the target id
0:07:49	ability of which one of the speakers
0:07:53	second disclaimer i'm not of an addition that i even have problems with english roll
0:07:59	okay i would be talking about but of things that my colleagues is therefore that
0:08:04	takes a lot declared it so yes
0:08:06	my apology that buttons if i have i say something not right
0:08:12	and this is the rate of features that they will explore they will we be
0:08:17	noted by really deformation type temporal characteristics what extent means that what the characteristics degree
0:08:23	of the solid deep or something like than all the type of non-linguistic features or
0:08:28	what robert was impressions of
0:08:31	so that they will just
0:08:33	what they will extend
0:08:36	we don't like the selected trials is to perform that detail during the at both
0:08:41	about one hour per one of the trials and we focus on the full feature
0:08:46	which are presented all along the conversation
0:08:50	i would still some samples
0:08:52	but that is a
0:08:54	the feature that the difference is that we are that they're finding out present along
0:08:58	the whole conversation
0:09:02	and those comparison will be maybe linguistically k compare compatible segment example select you think
0:09:08	that set consisting of motown and finally some of the observation would be confidence through
0:09:15	acoustically or estimate a and then
0:09:20	by seasonal i used in mentioning that might expect so you don't seem a spectrogram
0:09:27	so the last part of my presentation will be simply so and some of the
0:09:32	use a file
0:09:34	in every case i went so on a number of the trial with the where
0:09:43	the audio can from and also the likelihood ratio in that do not value the
0:09:48	degree of support that the ipod or used a given
0:09:52	the same speaker hypothesis so we know in advance they are different
0:09:56	this the i-vectors is that we say
0:09:59	and then the same of these c same speaker and we will see it for
0:10:04	every trial
0:10:05	and the that the that fault
0:10:08	degree of support of that are that can easily and english
0:10:12	all possible this is a case without a very high misleading value on the three
0:10:19	just and the operator what we use an obtain even for targets
0:10:25	and in that case for example what they found is that this for speech a
0:10:31	lot of the whole conversation is
0:10:33	and not different
0:10:35	no but we do you wanna go well
0:10:39	the it's for the blue line
0:10:42	for the right one
0:10:44	i really but i four
0:10:48	a sound like different by the that are over a regular or you are well
0:10:55	i really i four
0:11:02	and a set of features that they then used
0:11:05	you just about the long as variability
0:11:08	in the collective synthesis people usually tends to decrease the energy at the end up
0:11:13	there is at least that's happened with the for speaker in that case
0:11:24	our that the second speaker in that try out is
0:11:27	keeping the same stress can do you and we'll especially for to keep that log
0:11:34	in this
0:11:42	and this is consequently repeated during the whole conversation
0:11:48	in this case and which has which had a celebration of at a smaller value
0:11:53	obviously value and there's a
0:11:57	only dysphonic voice you once only one of the sides of the conversation is that
0:12:09	they have no idea what are okay
0:12:15	they have no idea what like are okay
0:12:22	is that is for the one
0:12:24	well there are no but neural network grammar
0:12:30	well there are no but you'll never bigger
0:12:34	for example you that are compared to the one light both phase right
0:12:39	but
0:12:40	and this is the spectral analysis of the of that powering latt uses a
0:12:46	without hi everyone would ratio on you know we have
0:12:50	much lower
0:12:54	another type of and situation that would be found is the president of creaky voice
0:12:58	for sample this is not very usual find in a speaker to the second one
0:13:03	here we just peaks do all the conversation with really voice
0:13:08	i
0:13:11	i normal rate and this
0:13:15	second one no you know
0:13:18	no you know
0:13:22	this is not very frequent but this thing present in this case and it's very
0:13:25	quickly but what is quite usual is that the resulting solution of creaky voice at
0:13:32	the end of the of the phrase would like your
0:13:35	we will pop up a sample here in that case work like ratio measly like
0:13:42	results about fifty
0:13:47	two
0:13:51	one segment well
0:13:54	well
0:13:56	well
0:13:58	well
0:14:02	we also found issues about sorry more boys system where you the voice difficult is
0:14:08	to haul the bit it's a similar segment with and that type of speech you
0:14:15	can see the
0:14:16	tennessee of the mean value is quite similar however the second one we have
0:14:21	you use the oscillation problems to maintain that
0:14:27	i together i
0:14:32	i get a very i
0:14:36	second one
0:14:37	we will be known
0:14:42	no
0:14:51	also a feature what's file was about the speech rate
0:14:56	you for somebody in that case there are two different speaker which sold at different
0:15:02	levels of a of activity
0:15:04	what about how would be better marketing
0:15:08	moreover
0:15:12	it was bigger really
0:15:14	we were able to leave
0:15:20	this also issues all known hyperarticulation for example
0:15:24	the phase
0:15:25	really different see if you're selling you know
0:15:29	one the other one i for like you know
0:15:33	well
0:15:34	almost basis some
0:15:36	also this can be found in other cases with the
0:15:42	without using any of a key and where the formant a three of on here
0:15:47	it's much more the about more standard for speaker
0:15:52	your
0:15:55	second
0:15:56	huh
0:15:58	the form of a second formant is much lower than the
0:16:04	signal for one for speaker
0:16:06	also that there may be found differences well the specific but there's of realisation some
0:16:12	first personable one pretty because the finding difference and a type of s that the
0:16:17	speaker reviews
0:16:19	for example in that case and the as in that speaker starts of the five
0:16:27	hundred you're while the as in the second speaker this
0:16:32	start above
0:16:34	three thousand system or a standard student s
0:16:38	i
0:16:45	also cases where the problems or differences in the a degree of summarisation
0:16:53	sample here i
0:16:56	this is like that
0:16:57	you don't want together
0:17:01	and that of kind of nice of voice when and in this case the other
0:17:06	one is i per thousand since we have a goal or something
0:17:15	also that uses about impaired melodic voices
0:17:18	so regular
0:17:23	no we in
0:17:27	what is the one i know you know
0:17:35	in some cases the file extralinguistic ensures that for example the noisy reading everything to
0:17:40	use that speaker
0:17:42	you can hear
0:17:44	that you are construction some parties
0:17:49	for
0:17:58	for example
0:18:01	well as well
0:18:08	so what while the second one that's it's already and noisy breathing at all
0:18:12	they're also presents all squats or
0:18:16	strong not control of the o
0:18:21	g
0:18:26	e
0:18:29	or not the case of some of the presence of rectly voice
0:18:33	e and o
0:18:42	i go off all gonna
0:18:47	so i'm finally this is they comparisons of the of a
0:18:51	this work where and the idea is that if you look and you all some
0:18:55	top weight you can find the amount of times that and one given feature is
0:19:02	file
0:19:03	and but its moral about the look trial by trials or columns of the table
0:19:09	and us see that
0:19:11	for every trial there are
0:19:13	there's an average of about four different types of different that a file
0:19:19	especially health interest to last if we want to make a diplomatic pursues to detect
0:19:25	something some any kind of features are possible feature related to phonation type well phone
0:19:33	creaky also and those the
0:19:37	like to a specific but there are some presentation of the specific sound
0:19:42	so do you might well
0:19:44	yes a we have shown that percent all analyses initial null correlation with the that
0:19:51	backdoor false acceptances
0:19:53	and
0:19:54	there is detectable a useful information goals trials that just produce away from poland uses
0:20:01	what one bs recognition rate is
0:20:04	furthermore there's like
0:20:06	a relational
0:20:07	and specifically the but the realisation that bit of a specific cells
0:20:12	but also at would that those could provide an
0:20:17	we try to reach no signals transcription of the whole utterances and they could be
0:20:22	used to provide some kind of soft information or
0:20:26	and
0:20:28	this what specific highlight the inter some provide an objective measurements about this for you
0:20:34	not the spectral features especially for speaker
0:20:38	thank you
0:21:00	just listening to
0:21:01	second creaky wanna sell like was actually clipping happening in the first
0:21:06	creaky voice
0:21:08	solves one it
0:21:09	perhaps the reason the system to see the same because audio clip like three
0:21:16	was there any analysis on when you when people listening to these false like taking
0:21:22	part of the audio acquisition and one that was quality as well
0:21:28	there was no it's okay
0:21:30	especially analysis of brain processing of the of the data we just select the data
0:21:34	as it was and what is given to them and it what have phone from
0:21:40	the phone at finding just what the what they what they did so
0:21:45	how can you tell them so
0:21:57	what's the variance from the sets consist of experts on
0:22:02	to ten
0:22:05	that's good
0:22:06	there was a very high actually the second one was a student of the rate
0:22:12	was just you from one to and
0:22:15	maybe they provide for they come from the same school of listening
0:22:19	and then the degree of agreement ones
0:22:24	impressing we will be working completely separate
0:22:29	i we have to say that there were no this is chosen there were no
0:22:32	scoring sorry what does i found difference on i five difference on but the degree
0:22:37	of
0:22:38	but i can say that it was almost exactly the same maybe there was one
0:22:43	of the differences and that one of the informant the of the on
0:22:53	i was wondering
0:22:54	since you only used
0:22:57	non-target trials
0:22:59	yes you have conducted the same experiment with the same from the tuition non-target trials
0:23:04	how many of those differences they would also something especially the prosodic differences
0:23:12	of course there will find a lot of then what's
0:23:14	we are trying to do is to look for clues we rolled analysis nowhere to
0:23:20	look for
0:23:21	for a different of information and of course those prosodic and just prosodic information that
0:23:28	prosodic information is very easily and modify a and b and you can depend a
0:23:34	lot on the on the type of conversation
0:23:36	that's why a i stress the idea of the issues of
0:23:41	voice production and specific buttons of religious the which can be much more dependent upon
0:23:46	the speaker but
0:23:48	of course this part of the word that could be don't and of course they
0:23:52	would because when
0:23:55	i suppose like then participate in that kind of a humans just this evaluation they
0:24:01	also did not
0:24:09	yes as the result of this analysis the use it just the but kind of
0:24:14	features that we used
0:24:17	system the future
0:24:20	which so you mention the prosody given duration what do you suggest
0:24:28	that we look at for improving system
0:24:33	i'm not suggesting anything special i just giving the information what they found but what
0:24:38	i'm saying is that the for example the one noise
0:24:41	those voice quality features around
0:24:43	a specific but doesn't really say some of some a has a good degree of
0:24:49	parameters that can be
0:24:50	the properly detected
0:24:52	let's see if they can improve the overall system

What are we missing with i-vectors? A perceptual analysis of i-vector-based falsely accepted trials

Speaker Modeling I

Joaquin Gonzalez-Rodriguez, Juana Gil, Rubén Pérez and Javier Franco-Pedroso