0:00:06so
0:00:07uh
0:00:07yesterday uh
0:00:09done and almost mentioned that the port that might be necessary to uh
0:00:14uh to something
0:00:15to wake up uh affect the audience
0:00:18uh the suggestion that okay was to uh
0:00:21uh do a colour wheel
0:00:24uh
0:00:25uh
0:00:26which
0:00:26you didn't do i'm i'm also not going through that today
0:00:29um
0:00:31i was there
0:00:32uh
0:00:33assisted in this work or my calling
0:00:35yeah well
0:00:36uh
0:00:37and we're from uh
0:00:38i'll meet you in in south africa
0:00:41so
0:00:42uh
0:00:43this is
0:00:44the second paper out of three
0:00:47essentially on the same topic
0:00:49uh so
0:00:50i'll start by defining it and then saying what is different in our paper from the other two
0:00:55um
0:00:56and then
0:00:57uh we'll get to the race
0:00:59so
0:01:01and
0:01:02defining uh what i'm calling the partitioning problem
0:01:06uh we're all very familiar with the canonical detection problem
0:01:09uh where
0:01:10you need to decide
0:01:12there are two speech segments
0:01:13uh do we have one or two speakers here
0:01:16um
0:01:18so if you want to generalise that
0:01:20which is uh essentially my main interest here how do we generalise
0:01:25the canonical problem
0:01:27uh
0:01:28we're allowing
0:01:29more than two input segment
0:01:31and
0:01:32uh the natural question is then
0:01:34uh
0:01:35how many speakers and uh
0:01:37how do we divide the segments
0:01:39green speaker
0:01:42so
0:01:42here's an example
0:01:44uh
0:01:44the simplest generalisation we got from two
0:01:47three
0:01:48and
0:01:49uh
0:01:49the task is then
0:01:51to partition
0:01:52uh
0:01:53the
0:01:54uh
0:01:54set of input
0:01:55into
0:01:56subset so
0:01:57this is why recording a partition because that's the just the
0:02:01uh
0:02:02set theory
0:02:03uh
0:02:04why of
0:02:04stacking the problem
0:02:05so
0:02:06uh
0:02:07immediately
0:02:08we get five
0:02:09possibility
0:02:10uh so this should be quite obvious that could be one two or three speakers and in the two speaker
0:02:14case there are three ways
0:02:16to partition
0:02:19so
0:02:20um
0:02:21partitioning
0:02:22is
0:02:22the most general problem
0:02:24of this kind
0:02:25uh
0:02:26if you assume there's a single speaker in each segment
0:02:29um
0:02:30and
0:02:31uh
0:02:32i'm saying is the most general because
0:02:34if you have the answer
0:02:36about how the segments of partition you can
0:02:39on sir
0:02:40uh any other kind of
0:02:42uh detection verification identification open or closed
0:02:46so it
0:02:47uh
0:02:48uh
0:02:49uh or clustering problem
0:02:51that you can define within inserts egg
0:02:54uh the connection with
0:02:55diarization has already been mentioned there you also need a segmentation
0:03:00we're we're uh
0:03:01yeah
0:03:02pretty supposing the segmentation is given
0:03:05so
0:03:06uh the problem is general which is good
0:03:08but
0:03:09uh you have to be careful because
0:03:11uh the complexity explode
0:03:14um
0:03:16here's a little table
0:03:17uh
0:03:18that sounds that the number of possible ways
0:03:20what this and that
0:03:21uh
0:03:22uh can become very large
0:03:23uh very quickly
0:03:25so
0:03:26uh
0:03:27again
0:03:28the canonical problem
0:03:30there are just two solutions
0:03:32uh our example which will discuss some more
0:03:36that's fine
0:03:37and uh we're not going to discuss
0:03:39the last one in in full uh yesterday
0:03:43so
0:03:43uh
0:03:44to get to what's new
0:03:46uh
0:03:47material yeah
0:03:48and in the by product is
0:03:50uh too much to present in a in a single talk
0:03:53uh we were uh
0:03:55something to
0:03:56get everything up to the last line in the in the in the I pages
0:04:00so
0:04:00i'll just highlight
0:04:02what is near here and whine what am i want to go and and read the full right
0:04:07so
0:04:09as mentioned
0:04:10uh
0:04:12this is
0:04:12identical
0:04:13two
0:04:14what has been mentioned before are more probable mentioned after this
0:04:18so the problem itself is not you
0:04:20um
0:04:21so
0:04:22i'm
0:04:23stressing this
0:04:24generality which i've just mentioned
0:04:26uh but then i also have to mention that
0:04:30we're focusing
0:04:31uh on
0:04:32problems with that of a small number of
0:04:34of input
0:04:35but try
0:04:36where as
0:04:37uh for example don rickles trace the case of a large number of input
0:04:42um
0:04:44further
0:04:44uh
0:04:45in the background emphasised
0:04:47solutions that the level
0:04:49probabilistic output
0:04:51in other words calibrated
0:04:52likelihood
0:04:53and
0:04:54uh
0:04:55also propose a and associated evaluation criteria
0:04:59uh which i'm not going to discuss further yeah
0:05:03and then
0:05:04uh
0:05:05something which are
0:05:07going to discuss further
0:05:09our paper gives i closed form solution
0:05:12to this very general
0:05:14problem
0:05:15by
0:05:15using
0:05:16uh a simple additive gaussian
0:05:19uh a generative model in ivector space
0:05:22patrick
0:05:22yeah
0:05:23that's exactly what patrick explained
0:05:25except we're not doing it
0:05:27maybe tell
0:05:28just
0:05:28plane cows
0:05:30so
0:05:31um
0:05:31this model gives us
0:05:33uh
0:05:34lot like you
0:05:35uh
0:05:35output
0:05:36and
0:05:37it is tractable even fast when
0:05:40we don't do too many segments at once
0:05:44so
0:05:45uh
0:05:46yesterday in his keynote
0:05:47patrick mentioned
0:05:49that
0:05:49uh
0:05:51using this kind of modelling
0:05:53uh
0:05:53you can
0:05:54calculate the likelihood
0:05:56for any type of speaker recognition problem
0:05:58so
0:05:59this is exactly what we show in our paper
0:06:02uh the formulas of the
0:06:04you can use
0:06:07so
0:06:08um
0:06:09again
0:06:10very briefly
0:06:11be although i i picked a model
0:06:13uh
0:06:14they're model
0:06:15speaker and channel effects as independent multivariate gaussian
0:06:18and
0:06:19and additive
0:06:20uh
0:06:21i vectors
0:06:23i don't need to explain
0:06:24uh every speech segment gets mapped and i vector
0:06:27the reason we call it and i vector
0:06:29is
0:06:30simply because it's all
0:06:31intermediate sized ice for intermediate
0:06:34it's
0:06:35larger than an acoustic feature vector smaller than a super big
0:06:39i might also mention that uh
0:06:41total variability
0:06:43these eigenvectors
0:06:45uh
0:06:46cannot be reconstructed
0:06:48to give you
0:06:49uh the original speech so
0:06:51uh
0:06:52in my opinion they don't
0:06:54reflect the total variability in the in the signal
0:06:59so
0:07:00uh
0:07:02yeah i vector solution
0:07:04uh
0:07:05the generative model the hyperparameters of this model in other words those
0:07:09variability uh
0:07:11uh the covariance matrices that explain all the variability
0:07:15uh they have to be trained
0:07:16with an E M algorithm
0:07:18and that's similar to J I the there's some detail in in the base
0:07:23so i'm going to concentrate on the scoring
0:07:26because it's
0:07:27nice and simple with
0:07:29this very simple model
0:07:30so
0:07:32uh
0:07:34we've given a set
0:07:35of
0:07:36uh
0:07:37segments each represented by nite vector A B C
0:07:41and that is that i've got here represents
0:07:43a subset
0:07:45of that's it so is is a subset
0:07:47image my generative model
0:07:49and then uh
0:07:51we can calculate
0:07:52the likelihood
0:07:53that
0:07:54uh all of the
0:07:56segments in the subset belong to the same speaker so
0:07:59the details of how to calculate the likelihood
0:08:02i i love the subset is in the paper
0:08:06um
0:08:07so i'm something now
0:08:09how to go from
0:08:10the subset likelihood
0:08:12to the likelihood of a full partitioning of the full six so
0:08:16again for the three inputs
0:08:18uh that's one of the possibilities
0:08:20the model is simple so
0:08:22the likelihoods multiply
0:08:24that's very nice very very comfortable to use so
0:08:28this is all you need to solve
0:08:31all of those problems
0:08:32to
0:08:33to get a closed form solution of course is not always going to be a good solution but
0:08:37you get a solution
0:08:40uh
0:08:41so
0:08:43uh for the three input example
0:08:45uh
0:08:46the dust
0:08:47three inputs
0:08:48that represents a trial
0:08:50the output
0:08:52or the five different likelihoods for the fight
0:08:54partitioning probabilities
0:08:58so
0:08:59this solution
0:09:00uh
0:09:01is neat and there's a tile
0:09:03but
0:09:04as already mentioned
0:09:05it blows up if you
0:09:07try to
0:09:07uh
0:09:08used to
0:09:09too many input check
0:09:13so
0:09:14moving to
0:09:16experimental results
0:09:17um
0:09:19uh
0:09:20the experimental results
0:09:21on
0:09:22realness data
0:09:23is available in the full paper
0:09:26uh
0:09:26but
0:09:28in the rest of the stock we're going to
0:09:30uh
0:09:32use
0:09:33and experiment with synthetic data
0:09:35uh
0:09:36the reason i didn't with that
0:09:38in the paper was because
0:09:40yes
0:09:41especially the anonymous ones
0:09:43tend not to like synthetic data
0:09:45but everybody's wearing name tags so
0:09:47i was peers are not here so i'm going to
0:09:50uh
0:09:51per season
0:09:53with my uh
0:09:55synthetic data experiments
0:09:58so
0:09:58this
0:09:59takes the form of a
0:10:02a little tutorial in
0:10:03i think in probability theory
0:10:07so
0:10:08the generality of the
0:10:10partitioning problem
0:10:12and the simplicity of the
0:10:14i vector model
0:10:16uh a very handy tools
0:10:18two
0:10:19uh
0:10:20examine
0:10:21a few questions one might have about
0:10:25basic things about speaker recognition so
0:10:27i'd like to
0:10:29just
0:10:29show you
0:10:30how this how this works
0:10:33so the example we're going to discuss
0:10:36is
0:10:37nest
0:10:38uh unsupervised adaptation uh
0:10:41uh it's not a toss
0:10:43um
0:10:43and that's what i promised uh some points are
0:10:46uh yesterday that
0:10:48this would be discussed
0:10:49so
0:10:50um
0:10:51we're going to analyse it by making it a special case of the uh partitioning problem
0:10:56so basic problem is
0:10:59uh
0:10:59you do need more prior information
0:11:02then
0:11:03that
0:11:03which was provided
0:11:05in in the original definition of the stars
0:11:09so
0:11:10the next
0:11:11several slides are going to be on that
0:11:14so
0:11:15the input
0:11:16um
0:11:17we're looking at the simplest case
0:11:19you're given a train segment
0:11:21which is known to be of the target speaker
0:11:24uh then you'd also given and the adaptation segment
0:11:28which
0:11:29my or may not be from the target segment and you're allowed to use that
0:11:33and then finally there's a test segment and your job is to decide
0:11:37was this the target speaker or not
0:11:40so
0:11:42three inputs as mentioned
0:11:43there are five
0:11:44possibilities of hard to uh partition these three inputs
0:11:48uh
0:11:49we can group the first two
0:11:51as uh
0:11:53belonging to the target hypothesis
0:11:55and uh the last three as
0:11:58uh the instances of
0:12:00uh nontarget partitions non target because the test
0:12:04has a different speaker from that right
0:12:09so
0:12:11we need a prior
0:12:13nist
0:12:13provided
0:12:14the target price
0:12:16we don't need a prior
0:12:18for the train segment we already know it's of the
0:12:21target speaker
0:12:22but what about the adaptation segment
0:12:25so
0:12:26that
0:12:27prior was not stated in the
0:12:29but original problem
0:12:31so
0:12:32we can assemble
0:12:33uh
0:12:35these two
0:12:37priors
0:12:37uh
0:12:38just in the obvious way
0:12:40uh to give a full probability distribution over the five possibilities
0:12:45uh i've
0:12:46uh
0:12:46somewhat
0:12:47arbitrarily set the last one does the other
0:12:50uh to simplify matters here
0:12:53uh you're assuming
0:12:54if the test segment is not to target
0:12:57uh
0:12:58the adaptation segment is also not going to be
0:13:02so
0:13:03uh
0:13:04the whole thing
0:13:05a symbols like this
0:13:07uh
0:13:08the generative model supplies
0:13:10the five likelihoods for the five partitioning
0:13:13well possibilities
0:13:15and
0:13:15then
0:13:16uh
0:13:18you use as patrick said
0:13:20the basic rules of probability theory some room product rule
0:13:24but you need
0:13:25the
0:13:26uh
0:13:28you need those extra prize
0:13:30uh this prior which
0:13:31has not been mentioned before
0:13:33you need that
0:13:34to compute
0:13:35uh
0:13:36to properly express
0:13:38the likelihood ratio between the target and then and then on top hypothesis
0:13:45so
0:13:47the experiment that we did was
0:13:49to demonstrate
0:13:51uh what role
0:13:52does
0:13:53uh this prior play what might happen
0:13:55uh
0:13:57uh
0:13:57if
0:13:59uh
0:14:00you're assuming a bad price
0:14:03how closely should match the actual proportion
0:14:06uh
0:14:07in the data that you working
0:14:12so
0:14:12uh
0:14:13we use synthetic ivectors
0:14:16because we're not interested
0:14:18in examining the data
0:14:20or indeed
0:14:22the the P L D A model
0:14:24uh
0:14:25while making synthetic data with the model
0:14:28uh
0:14:28the data has a perfect
0:14:30shot
0:14:30to the model
0:14:32um
0:14:32so
0:14:33that
0:14:34focused
0:14:35focuses the experiment
0:14:37on the role of the prior
0:14:41so
0:14:42back to
0:14:43this system diagram
0:14:45uh we adjust to things independently
0:14:48the one is the proportion of
0:14:51the adaptation segments in in in the data
0:14:54the other
0:14:55uh
0:14:55is
0:14:56uh
0:14:57the
0:14:57assumed
0:14:58prior
0:14:59of
0:15:00how much that proportion might be
0:15:03and we evaluate the whole thing via via equal error
0:15:07so
0:15:08the results
0:15:09uh look something like this
0:15:12uh
0:15:12the horizontal axis of the here
0:15:15we have the assumed prior increasing in this direction
0:15:19other horizontal axis is the actual
0:15:21proportion
0:15:23and the vertical axis of course
0:15:25the equal error right so
0:15:26uh
0:15:28this here is the best situation to be in
0:15:30uh
0:15:31you know there are many adaptation segments
0:15:34and there are in fact an adaptation works
0:15:37the the back corner over there
0:15:40uh
0:15:40you're saying okay
0:15:42i'm not expecting any targets in the adaptation data so i'm not adapting
0:15:47uh the bad place to be is
0:15:49is there
0:15:50uh with
0:15:52you're assuming
0:15:54uh
0:15:54you'll find
0:15:55uh many adaptation segments
0:15:58but
0:15:58uh
0:15:59when there aren't any
0:16:01so
0:16:03uh
0:16:05the important thing to realise here is
0:16:07it's not so bad to assume
0:16:11that there aren't any adaptation segments
0:16:13because then you just back to what you would have done without
0:16:16adaptation
0:16:17but it is bad to have the mismatched the other way
0:16:23so
0:16:24uh
0:16:25the prior is important
0:16:26uh
0:16:26you might choose to ignore the prior
0:16:29but
0:16:29it's not going to
0:16:30go away it's there
0:16:32closing it's that means that even if you ignore
0:16:36so
0:16:37that brings me to the conclusion of that or
0:16:39um
0:16:41back in the real world
0:16:42uh
0:16:43we've already applied this partitioning software
0:16:46in helping us find
0:16:47the
0:16:48mislabelling in the
0:16:50uh is uh T O I
0:16:51uh
0:16:52data
0:16:54uh which we needed for development for the uh
0:16:57for this evaluation
0:16:58so
0:16:59we've only started on this
0:17:00on this work
0:17:02in the uh personas workshop to be
0:17:04that's starting next week
0:17:05of it here
0:17:07uh we'll be using
0:17:09uh exploring this problem some more
0:17:11uh
0:17:16okay that's all thank you
0:17:25fig
0:17:26three
0:17:26some
0:17:26question of common
0:17:29i could be
0:17:32could be could be if you look at the real case for
0:17:35the purpose
0:17:35you should
0:17:37you can't this you but which we are
0:17:39and then when you are
0:17:41you will have to
0:17:43this
0:17:43more
0:17:44okay
0:17:45usually in a real application you will have
0:17:48an impostor trying
0:17:51going to system enjoying during uh
0:17:53free house to cheat the system
0:17:55and something
0:17:56we will
0:17:57we have
0:17:57hmmm target speaker
0:17:59coming and uh
0:18:01on the targets
0:18:02green
0:18:02a few more
0:18:04you will
0:18:06i agree
0:18:07and this framework
0:18:08allows for that
0:18:09because
0:18:10uh
0:18:11that right there
0:18:12the prior
0:18:13but you plug in to get the final
0:18:15score
0:18:16you can make trial bin
0:18:18so
0:18:18if you know about this guy that you can modify the prior
0:18:21as
0:18:22as time progresses
0:18:24okay
0:18:24sorry
0:18:25did you could you could turn
0:18:26oh yes
0:18:27yeah
0:18:29now you have time to find some questions
0:18:47so do you see that it is it yes
0:18:49yeah wednesday speaker diarization yeah
0:18:55the down here correctly
0:18:57you ask whether that's
0:18:58this is a one step
0:19:00diarization system
0:19:01yeah well when we we do this in addition
0:19:05and it just means you wednesday
0:19:07um
0:19:08no i'm assuming here that the segmentation is given so i assume
0:19:13that that are not the segments which
0:19:15which uh
0:19:16have
0:19:17two speakers in them
0:19:18yeah
0:19:19it is and what they see
0:19:21yeah
0:19:21we apply a system
0:19:23four
0:19:24unique segmentation
0:19:25and you get
0:19:26all the boundaries
0:19:27you then
0:19:29could be used
0:19:30four
0:19:30yeah addition
0:19:32the segment
0:19:34i i wouldn't recommend it because as i pointed out
0:19:37if
0:19:38uh
0:19:38you have
0:19:39a thousand segments
0:19:40uh then
0:19:43uh
0:19:44you want to add to my mike ways
0:19:46uh
0:19:47they're not designed for the large scale case
0:19:49uh
0:19:51but they're they're other approximate
0:19:53uh
0:19:53methods
0:19:54you could you could
0:19:55you could still start from the same
0:19:57uh
0:19:58gaussian P L D A model
0:20:00but then you would need something like
0:20:01uh
0:20:02by variational bayes
0:20:04to to handle a large number of signal
0:20:06we're going to play with that
0:20:08uh
0:20:09at the workshop as well
0:20:15we can think
0:20:15okay