0:00:35well this is so you organise race
0:00:39a that
0:00:40if you serve also
0:00:42third order all three here that the first author individual
0:00:47yeah it solves the general idea of a source position
0:00:52that's the i think right after
0:00:55the previous oversee
0:01:01afterwards after this talk you think that this is fantastic i'm going to implement this
0:01:06tomorrow or next week
0:01:08in you need to do is done it so
0:01:10and mature
0:01:13and he slides because he'd make the slides form
0:01:18i'm very happy with that if only a house you think yeah afterwards i didn't
0:01:23get i think of this and why did before but this is probably due to
0:01:27me and all being able to convey the message
0:01:31and if you before hands thing to the same thing
0:01:36then you're are sort of each what we have
0:01:45this is sort of you automatically generated a summary of my presentation today
0:01:52which i think is kind of pointless in this particular case because
0:01:55contains lots of irony is that are can be explained later soldiers
0:02:00having to
0:02:01the motivation to do this work and the idea is that in speaker recognition
0:02:07we all phone that and you evaluations were things change from year to year we
0:02:13get to the situation where you get new data we haven't seen for sitting here
0:02:18well yeah
0:02:20voicing data no you know what kind of noise maybe some people
0:02:24but most of us don't know
0:02:30how we're going to deal with that's
0:02:33i don't
0:02:35sometimes have to
0:02:37locations and see what i'm going to talk about
0:02:40every once a while actually to rubbish i guess because
0:02:44i haven't seen this to go from
0:02:46but anyway that the basic idea is that if you get conditions definitely to train
0:02:52and test
0:02:53are of a different kind you like to have to see
0:02:57that you like to have seen this before what you do
0:03:01i don't
0:03:02if you're
0:03:03if you know that you won't have seen
0:03:06and one way of to do with that is this ideal source normalization
0:03:12i'll try to explain
0:03:13the basic ideas source say
0:03:17oh here's some slides about i-vectors i think i i'll skip
0:03:21with these two then you probably in your hands and standard much better i
0:03:25the basic idea is that we review the i-vector in this particular presentation that's a
0:03:31very low dimensional representation
0:03:33of the entire utterance
0:03:36apart from speaker information other information
0:03:41i essential to the idea of source position is that wants to that we do
0:03:46in the standards
0:03:48approach is we hear by within covariance
0:03:52within class covariance normalization
0:03:56for the P lda
0:04:00that's needs to be changed
0:04:02with data and in the training
0:04:06within class and between-class
0:04:08scatter matrices are
0:04:11are computed
0:04:12and that's where the source normalisation takes place
0:04:18so here and notes that we actually need to estimate those scatter matrices
0:04:25so this is the mathematics just to stay in line with the previous torso
0:04:29to have at least some mathematics so on the view screen
0:04:34this is the expression for the within
0:04:37speaker scatter matrix
0:04:40and this is what the source position is going to
0:04:44try and estimating a better way
0:04:48because what is the what is the
0:04:50problem with a wccn in this particular
0:04:54a matter is this issue is that
0:04:58relevant kinds of variation are observed in the training data
0:05:05and this is more often to if you don't have
0:05:11so i hear another graphical representation of what typically happens here we look at a
0:05:19kind of data is the label of the data in mind say which is
0:05:25the language so you have also english language data and every once in a while
0:05:30we get some a tests
0:05:33where language model english
0:05:35how that in i think that
0:05:38and before
0:05:40a also two thousand eight seconds content easy
0:05:43so when is you get in two thousand twelve be what you get
0:05:46so maybe language itself is not so relevant for the current that is what it
0:05:51is a good example of where things
0:05:55an important
0:05:56here is that even if we have some training data from here's
0:06:00will not have for all speakers
0:06:05the different languages so typically
0:06:09the speakers are decoupled from the language of for some language you have some speakers
0:06:13and for the language you have other speakers
0:06:15so how do you know the problem where you in the end in your
0:06:19recognition have to compare one segment in one second
0:06:23in the other language where the case might be that it's actually same speaker
0:06:30so what about shown out is why this kind of
0:06:36difference in language labels going to
0:06:39influence these
0:06:41we can
0:06:43sky within class scatter matrix
0:06:46so this is one way of viewing how the
0:06:49i-vectors might be distributed in this very
0:06:52this way
0:06:56is used
0:07:01three big circles denote the different sources in this case of source
0:07:06might be a language
0:07:08with some means there's a global mean which would be yeah mean i
0:07:14i guess
0:07:15i don't have some speaker so for the speaker you have a little bit of
0:07:17variability any comes from one source
0:07:20and the speaker is the she and he comes from another source and we have
0:07:25also you speaker sources in a last
0:07:28you think imagine if you're going to compute the between speaker variation that you actually
0:07:35i don't a lot of between source variation and that's probably not a good thing
0:07:40which you want to
0:07:41no we did different speakers and between source
0:07:47the wccn is going to
0:07:51do this for myself
0:07:52based on this information
0:07:57and related to this
0:07:59is what is stacey the source variance
0:08:03is not correctly
0:08:06observed so the various tv sources
0:08:09is not explicitly
0:08:14so there's another problem for wccn
0:08:21this is as follows is summarising again
0:08:25what problems are
0:08:27that's moved to the solution i think this is much more interesting that to see
0:08:32what's how do we tackle this problem that these sources to hang around
0:08:39globally different means in the this i-vector stage the solution is very simple is compute
0:08:45these means
0:08:46for every source
0:08:50so here you look at the
0:08:52scatter matrix
0:08:54for a
0:08:56conditioned on the source
0:08:58we simply say i compute the mean for every source
0:09:02and before computers contrary i
0:09:04subtract these means
0:09:06so the effect basically means that you
0:09:09all these three
0:09:11sources this is still going from like these two microphone
0:09:16yeah and telephone data
0:09:18also them for languages
0:09:20yeah more
0:09:21and you subtract the mean for
0:09:24label per language
0:09:26and then this scatter matrix will be estimated better so the mathematics then we'll say
0:09:34that's very nice fit within
0:09:36within a class variation
0:09:40we still have the between class variation
0:09:46but we'll just see that as the difference that's data rate
0:09:52so that the other way around
0:09:54but it does so the idea is that you can compensate for one
0:09:57scatter matrix and because you have total variability
0:10:01you can compute the other as the difference from
0:10:03total variability
0:10:08so this idea is to stress
0:10:10in fact that you only need the language labels see records applied to language
0:10:16for the development set
0:10:19so you're languages are you development
0:10:22and you're training your system you have all kinds of labels in your data in
0:10:25this case we consider
0:10:26language label
0:10:28but in applying this you do not need the languages
0:10:33because this is only used to make a better
0:10:37transforms for these wccn that make
0:10:42how can you actually see that it works well one way of doing that is
0:10:48to look at the distribution of i-vectors
0:10:51a wccn
0:10:54when you
0:10:54do not apply this technique source-normalization a strong left
0:10:59and here in different colours U C encoded of the label that we want to
0:11:04so the way in this case language you see for each language recognition
0:11:13languages might be familiar for these people needed
0:11:18that what you see that
0:11:19languages seem to have different places
0:11:23this is by the dimension a dimension reduction
0:11:27two dimensions
0:11:28after the incision that's just for few problems
0:11:32and you see a that is language normalization this source
0:11:35source normalization by language
0:11:39you get that all these different labels too much more similar
0:11:43force for the basic assumptions that
0:11:47i-vector systems are based on
0:11:51should a little better
0:11:53okay in our system results because
0:11:56we need to have tables
0:11:57in the presentation of we're going to get some
0:12:00at first what kind of what kind of experiment we can do
0:12:06we use
0:12:07most i databases for is that the
0:12:11yeah men the training
0:12:13yeah i-vector make use of
0:12:15but we did at one specific database callfriend
0:12:19very little database are used
0:12:21oh two starts
0:12:23the first language recognition so it contains
0:12:26a variation of languages and twelve languages certainly
0:12:30for that
0:12:36as for the evaluation data because these two data sets and from nist two thousand
0:12:42dataset and two thousand
0:12:44eight oh two thousand ten you might think why would you do that there wasn't
0:12:49actually much different language
0:12:52from english that was sense but we don't use that for purposes one for training
0:12:59calibration as well
0:13:01another reason is to see actually what are we do doesn't spurts
0:13:06the basic english performance too much
0:13:09you a case of course is going to be used as a test data
0:13:14where there is a there are trials from different languages
0:13:19and there are also considered
0:13:21condition english only so that
0:13:23we compare
0:13:25do you actually hurt ourselves
0:13:27this is
0:13:28durations are a simple standard
0:13:30the U
0:13:31have seen either numbers i'd say before so there's nothing you hear
0:13:38these are indians the breakdown numbers for the
0:13:44per language
0:13:46for the training data
0:13:49these funny are the results now here
0:13:52i'll try to explain
0:13:57it means this is you
0:13:59doesn't mean this is better
0:14:02but both figures means is better and the first condition
0:14:13the performance on all trials
0:14:15four sre eight
0:14:18and measured in where it and get
0:14:22does not in calibration here
0:14:26a C these numbers go down so for four O eight it works if we
0:14:31see some languages i believe that
0:14:34force is also in english
0:14:38if we
0:14:40oh you look at english then used to use a little bit so it does
0:14:44hurt our system but it doesn't hurt it's
0:14:50the same for
0:14:51as we can
0:14:54for system gets hurt
0:14:56but here
0:14:58the basic conclusion there
0:15:01here we have a breakdown where we look at the english languages
0:15:06from history of weights
0:15:09where is where we look at different positions are there in the in the trials
0:15:14the same language or different language
0:15:17when english is
0:15:20so the top row which has to be the best performance because
0:15:24still contains these N yeah
0:15:26many english trials
0:15:27systems that works best for
0:15:31so the baseline
0:15:34but this includes
0:15:36both english and english so if you break down
0:15:40for instance where you say okay i want a different language in the trial suppose
0:15:44that the target of target trials language
0:15:47i was four
0:15:49we see that the new figures that once right
0:15:53are slightly better than
0:15:55the red ones
0:15:57the background smooth
0:15:59and the same respect to four
0:16:01in addition so you can specifically look at them english trials
0:16:05where there's otherwise restriction
0:16:09it helps
0:16:10for the language
0:16:12trials where you actually restricts trials you say minus the same time but english
0:16:19still helps to there's one condition where for whatever it does not how
0:16:26so that's a big difference
0:16:28this is something we don't
0:16:30is that
0:16:32and that's for the old english trials
0:16:34where you specify that the process
0:16:39different language trials
0:16:41so usually
0:16:42it seems to work
0:16:43except for one particular
0:16:47where it's that's
0:16:50but i say that are actually not too many trials
0:16:53it's not show the graph oh very nice
0:16:56if you vision
0:16:58so i don't know how
0:17:00accurate this measure
0:17:04now i'll except for also it's calibration
0:17:10our to carlos also the it's a this kind of experiment i
0:17:15looking at
0:17:16more robust for
0:17:18for languages
0:17:19and we use a better different measure
0:17:21is a measure used by the keynote speaker they
0:17:26as cllr and one way of looking at how
0:17:32you're calibration is small rates is to look at the difference between the cllr and
0:17:37the minimum attainable C or your in G
0:17:41or oh
0:17:42C miss so as to that
0:17:44posts of
0:17:48this kind of H
0:17:51it's not
0:17:56this is gonna
0:17:59alright so you have to this school mismatched different means
0:18:04mismatched and matched
0:18:08i was actually thinking
0:18:11vigilance the intensity state we might build a set of mismatched
0:18:16my niched
0:18:18that might be to heart for you guys
0:18:23and the is the
0:18:26that they do thing that we tried to a remote here
0:18:30and black is the old approach
0:18:33at both is better figures
0:18:37so we see a separate from female to answer
0:18:43we ask for
0:18:47and now
0:18:51for this mismatch condition by big mismatch we need to calibrate english only to be
0:18:56a straight answer
0:18:58ten for calibration and we applied to
0:19:00sre eight is
0:19:03to be the other way around that we consider that way in order to be
0:19:06able to calibrate english and test or
0:19:11in a channel
0:19:13so this particular
0:19:15in addition it works
0:19:17always and
0:19:19in the matched condition that is only looking at english scores of this really
0:19:24well calibrated english words
0:19:28you see that it doesn't always help factors on one condition where it helps to
0:19:33do so
0:19:35the miscalibration itself
0:19:37so you molecules
0:19:39in calibration
0:19:40becomes less
0:19:42see that's for calibration there is still somehow
0:19:46english only
0:19:48but for the arts and figures it doesn't
0:19:53alright i hope that
0:19:56explains the numbers well enough
0:19:59your first for the managers amongst
0:20:03i just easier to draw at this
0:20:06the same time
0:20:09calibration this is just miscalibration so this is just the amount of information by
0:20:14by not be able to
0:20:16produce proper likelihood ratios
0:20:23the conditions where we applied is the language normalization
0:20:28but for english only trials you don't notice the difference
0:20:35so i have a slight
0:20:38conclusions are here
0:20:40used to source normalization wish to general framework and i have to say here's been
0:20:45applied before
0:20:47it should be machine
0:20:50three or four
0:20:54conference proceedings
0:20:57about this technique applied it to this
0:21:02definition of source being a microphone or integer interview or telephone
0:21:07and we even applied it
0:21:10i should say by
0:21:13to source being know the sex of the speaker so even though i speakers generally
0:21:19don't change six
0:21:21and that's in this evaluations
0:21:24you can use this approach
0:21:27to compensate for situations where you might not have enough data
0:21:32so for telephone conditions this
0:21:35didn't we make much difference but for conditions
0:21:39where there wasn't really much data i did how to shoot pool the male female
0:21:45i-vectors and make a human same gender independent
0:21:48recognition system
0:21:51and apply source normalization
0:21:53very sad speaker sex is the label of the i-vector and we normalize that way
0:21:58and that in your recognition
0:22:01you can only the labelling marcy can basically more second column of your trial based
0:22:10but you reply to two languages seems to work
0:22:17that it doesn't for english trials too much first
0:22:23and also basically S
0:22:27what's to go
0:22:54i stopped speaker cases
0:22:56and we do not use try to use language as a discriminating speakers
0:23:02in this
0:23:02research of course you can see that very well
0:23:05we think you that
0:23:07take it as a challenge that you should be able to recognize speakers even if
0:23:11the speaker speaks a different language than seen before in the in
0:23:15in the training then you
0:23:19of course
0:23:21make it will be easier by saying either or different speakers
0:23:24it's a speaker
0:24:36yeah and i remember calibration was in one of the one of the major problems
0:24:41in two thousand six
0:24:43where you know if you have more english
0:24:46performance actually be reasonable the discrimination performance but calibrations
0:24:54so sure that
0:24:55that even holds for be a systems
0:24:58nowadays though but a systems nowadays are
0:25:01generally behaving better
0:25:34no i don't think that a
0:25:37that is what we want
0:25:38say i think
0:25:40to say is that it
0:25:43with the channel
0:25:45between channel
0:25:49estimated one of the
0:25:52very of the total variance
0:25:55is used to the fact that
0:25:57things have a different language
0:26:01and you don't observed that's in the within speaker
0:26:06so the attributes within language variability
0:26:09as with the
0:26:11channel variability
0:26:13and that is not as to K stiff
0:26:16this case
0:26:20languages for same-speaker