Speech Transcript - The speaker partitioning problem

0:00:06	so
0:00:07	uh
0:00:07	yesterday uh
0:00:09	done and almost mentioned that the port that might be necessary to uh
0:00:14	uh to something
0:00:15	to wake up uh affect the audience
0:00:18	uh the suggestion that okay was to uh
0:00:21	uh do a colour wheel
0:00:24	uh
0:00:25	uh
0:00:26	which
0:00:26	you didn't do i'm i'm also not going through that today
0:00:29	um
0:00:31	i was there
0:00:32	uh
0:00:33	assisted in this work or my calling
0:00:35	yeah well
0:00:36	uh
0:00:37	and we're from uh
0:00:38	i'll meet you in in south africa
0:00:41	so
0:00:42	uh
0:00:43	this is
0:00:44	the second paper out of three
0:00:47	essentially on the same topic
0:00:49	uh so
0:00:50	i'll start by defining it and then saying what is different in our paper from the other two
0:00:55	um
0:00:56	and then
0:00:57	uh we'll get to the race
0:00:59	so
0:01:01	and
0:01:02	defining uh what i'm calling the partitioning problem
0:01:06	uh we're all very familiar with the canonical detection problem
0:01:09	uh where
0:01:10	you need to decide
0:01:12	there are two speech segments
0:01:13	uh do we have one or two speakers here
0:01:16	um
0:01:18	so if you want to generalise that
0:01:20	which is uh essentially my main interest here how do we generalise
0:01:25	the canonical problem
0:01:27	uh
0:01:28	we're allowing
0:01:29	more than two input segment
0:01:31	and
0:01:32	uh the natural question is then
0:01:34	uh
0:01:35	how many speakers and uh
0:01:37	how do we divide the segments
0:01:39	green speaker
0:01:42	so
0:01:42	here's an example
0:01:44	uh
0:01:44	the simplest generalisation we got from two
0:01:47	three
0:01:48	and
0:01:49	uh
0:01:49	the task is then
0:01:51	to partition
0:01:52	uh
0:01:53	the
0:01:54	uh
0:01:54	set of input
0:01:55	into
0:01:56	subset so
0:01:57	this is why recording a partition because that's the just the
0:02:01	uh
0:02:02	set theory
0:02:03	uh
0:02:04	why of
0:02:04	stacking the problem
0:02:05	so
0:02:06	uh
0:02:07	immediately
0:02:08	we get five
0:02:09	possibility
0:02:10	uh so this should be quite obvious that could be one two or three speakers and in the two speaker
0:02:14	case there are three ways
0:02:16	to partition
0:02:19	so
0:02:20	um
0:02:21	partitioning
0:02:22	is
0:02:22	the most general problem
0:02:24	of this kind
0:02:25	uh
0:02:26	if you assume there's a single speaker in each segment
0:02:29	um
0:02:30	and
0:02:31	uh
0:02:32	i'm saying is the most general because
0:02:34	if you have the answer
0:02:36	about how the segments of partition you can
0:02:39	on sir
0:02:40	uh any other kind of
0:02:42	uh detection verification identification open or closed
0:02:46	so it
0:02:47	uh
0:02:48	uh
0:02:49	uh or clustering problem
0:02:51	that you can define within inserts egg
0:02:54	uh the connection with
0:02:55	diarization has already been mentioned there you also need a segmentation
0:03:00	we're we're uh
0:03:01	yeah
0:03:02	pretty supposing the segmentation is given
0:03:05	so
0:03:06	uh the problem is general which is good
0:03:08	but
0:03:09	uh you have to be careful because
0:03:11	uh the complexity explode
0:03:14	um
0:03:16	here's a little table
0:03:17	uh
0:03:18	that sounds that the number of possible ways
0:03:20	what this and that
0:03:21	uh
0:03:22	uh can become very large
0:03:23	uh very quickly
0:03:25	so
0:03:26	uh
0:03:27	again
0:03:28	the canonical problem
0:03:30	there are just two solutions
0:03:32	uh our example which will discuss some more
0:03:36	that's fine
0:03:37	and uh we're not going to discuss
0:03:39	the last one in in full uh yesterday
0:03:43	so
0:03:43	uh
0:03:44	to get to what's new
0:03:46	uh
0:03:47	material yeah
0:03:48	and in the by product is
0:03:50	uh too much to present in a in a single talk
0:03:53	uh we were uh
0:03:55	something to
0:03:56	get everything up to the last line in the in the in the I pages
0:04:00	so
0:04:00	i'll just highlight
0:04:02	what is near here and whine what am i want to go and and read the full right
0:04:07	so
0:04:09	as mentioned
0:04:10	uh
0:04:12	this is
0:04:12	identical
0:04:13	two
0:04:14	what has been mentioned before are more probable mentioned after this
0:04:18	so the problem itself is not you
0:04:20	um
0:04:21	so
0:04:22	i'm
0:04:23	stressing this
0:04:24	generality which i've just mentioned
0:04:26	uh but then i also have to mention that
0:04:30	we're focusing
0:04:31	uh on
0:04:32	problems with that of a small number of
0:04:34	of input
0:04:35	but try
0:04:36	where as
0:04:37	uh for example don rickles trace the case of a large number of input
0:04:42	um
0:04:44	further
0:04:44	uh
0:04:45	in the background emphasised
0:04:47	solutions that the level
0:04:49	probabilistic output
0:04:51	in other words calibrated
0:04:52	likelihood
0:04:53	and
0:04:54	uh
0:04:55	also propose a and associated evaluation criteria
0:04:59	uh which i'm not going to discuss further yeah
0:05:03	and then
0:05:04	uh
0:05:05	something which are
0:05:07	going to discuss further
0:05:09	our paper gives i closed form solution
0:05:12	to this very general
0:05:14	problem
0:05:15	by
0:05:15	using
0:05:16	uh a simple additive gaussian
0:05:19	uh a generative model in ivector space
0:05:22	patrick
0:05:22	yeah
0:05:23	that's exactly what patrick explained
0:05:25	except we're not doing it
0:05:27	maybe tell
0:05:28	just
0:05:28	plane cows
0:05:30	so
0:05:31	um
0:05:31	this model gives us
0:05:33	uh
0:05:34	lot like you
0:05:35	uh
0:05:35	output
0:05:36	and
0:05:37	it is tractable even fast when
0:05:40	we don't do too many segments at once
0:05:44	so
0:05:45	uh
0:05:46	yesterday in his keynote
0:05:47	patrick mentioned
0:05:49	that
0:05:49	uh
0:05:51	using this kind of modelling
0:05:53	uh
0:05:53	you can
0:05:54	calculate the likelihood
0:05:56	for any type of speaker recognition problem
0:05:58	so
0:05:59	this is exactly what we show in our paper
0:06:02	uh the formulas of the
0:06:04	you can use
0:06:07	so
0:06:08	um
0:06:09	again
0:06:10	very briefly
0:06:11	be although i i picked a model
0:06:13	uh
0:06:14	they're model
0:06:15	speaker and channel effects as independent multivariate gaussian
0:06:18	and
0:06:19	and additive
0:06:20	uh
0:06:21	i vectors
0:06:23	i don't need to explain
0:06:24	uh every speech segment gets mapped and i vector
0:06:27	the reason we call it and i vector
0:06:29	is
0:06:30	simply because it's all
0:06:31	intermediate sized ice for intermediate
0:06:34	it's
0:06:35	larger than an acoustic feature vector smaller than a super big
0:06:39	i might also mention that uh
0:06:41	total variability
0:06:43	these eigenvectors
0:06:45	uh
0:06:46	cannot be reconstructed
0:06:48	to give you
0:06:49	uh the original speech so
0:06:51	uh
0:06:52	in my opinion they don't
0:06:54	reflect the total variability in the in the signal
0:06:59	so
0:07:00	uh
0:07:02	yeah i vector solution
0:07:04	uh
0:07:05	the generative model the hyperparameters of this model in other words those
0:07:09	variability uh
0:07:11	uh the covariance matrices that explain all the variability
0:07:15	uh they have to be trained
0:07:16	with an E M algorithm
0:07:18	and that's similar to J I the there's some detail in in the base
0:07:23	so i'm going to concentrate on the scoring
0:07:26	because it's
0:07:27	nice and simple with
0:07:29	this very simple model
0:07:30	so
0:07:32	uh
0:07:34	we've given a set
0:07:35	of
0:07:36	uh
0:07:37	segments each represented by nite vector A B C
0:07:41	and that is that i've got here represents
0:07:43	a subset
0:07:45	of that's it so is is a subset
0:07:47	image my generative model
0:07:49	and then uh
0:07:51	we can calculate
0:07:52	the likelihood
0:07:53	that
0:07:54	uh all of the
0:07:56	segments in the subset belong to the same speaker so
0:07:59	the details of how to calculate the likelihood
0:08:02	i i love the subset is in the paper
0:08:06	um
0:08:07	so i'm something now
0:08:09	how to go from
0:08:10	the subset likelihood
0:08:12	to the likelihood of a full partitioning of the full six so
0:08:16	again for the three inputs
0:08:18	uh that's one of the possibilities
0:08:20	the model is simple so
0:08:22	the likelihoods multiply
0:08:24	that's very nice very very comfortable to use so
0:08:28	this is all you need to solve
0:08:31	all of those problems
0:08:32	to
0:08:33	to get a closed form solution of course is not always going to be a good solution but
0:08:37	you get a solution
0:08:40	uh
0:08:41	so
0:08:43	uh for the three input example
0:08:45	uh
0:08:46	the dust
0:08:47	three inputs
0:08:48	that represents a trial
0:08:50	the output
0:08:52	or the five different likelihoods for the fight
0:08:54	partitioning probabilities
0:08:58	so
0:08:59	this solution
0:09:00	uh
0:09:01	is neat and there's a tile
0:09:03	but
0:09:04	as already mentioned
0:09:05	it blows up if you
0:09:07	try to
0:09:07	uh
0:09:08	used to
0:09:09	too many input check
0:09:13	so
0:09:14	moving to
0:09:16	experimental results
0:09:17	um
0:09:19	uh
0:09:20	the experimental results
0:09:21	on
0:09:22	realness data
0:09:23	is available in the full paper
0:09:26	uh
0:09:26	but
0:09:28	in the rest of the stock we're going to
0:09:30	uh
0:09:32	use
0:09:33	and experiment with synthetic data
0:09:35	uh
0:09:36	the reason i didn't with that
0:09:38	in the paper was because
0:09:40	yes
0:09:41	especially the anonymous ones
0:09:43	tend not to like synthetic data
0:09:45	but everybody's wearing name tags so
0:09:47	i was peers are not here so i'm going to
0:09:50	uh
0:09:51	per season
0:09:53	with my uh
0:09:55	synthetic data experiments
0:09:58	so
0:09:58	this
0:09:59	takes the form of a
0:10:02	a little tutorial in
0:10:03	i think in probability theory
0:10:07	so
0:10:08	the generality of the
0:10:10	partitioning problem
0:10:12	and the simplicity of the
0:10:14	i vector model
0:10:16	uh a very handy tools
0:10:18	two
0:10:19	uh
0:10:20	examine
0:10:21	a few questions one might have about
0:10:25	basic things about speaker recognition so
0:10:27	i'd like to
0:10:29	just
0:10:29	show you
0:10:30	how this how this works
0:10:33	so the example we're going to discuss
0:10:36	is
0:10:37	nest
0:10:38	uh unsupervised adaptation uh
0:10:41	uh it's not a toss
0:10:43	um
0:10:43	and that's what i promised uh some points are
0:10:46	uh yesterday that
0:10:48	this would be discussed
0:10:49	so
0:10:50	um
0:10:51	we're going to analyse it by making it a special case of the uh partitioning problem
0:10:56	so basic problem is
0:10:59	uh
0:10:59	you do need more prior information
0:11:02	then
0:11:03	that
0:11:03	which was provided
0:11:05	in in the original definition of the stars
0:11:09	so
0:11:10	the next
0:11:11	several slides are going to be on that
0:11:14	so
0:11:15	the input
0:11:16	um
0:11:17	we're looking at the simplest case
0:11:19	you're given a train segment
0:11:21	which is known to be of the target speaker
0:11:24	uh then you'd also given and the adaptation segment
0:11:28	which
0:11:29	my or may not be from the target segment and you're allowed to use that
0:11:33	and then finally there's a test segment and your job is to decide
0:11:37	was this the target speaker or not
0:11:40	so
0:11:42	three inputs as mentioned
0:11:43	there are five
0:11:44	possibilities of hard to uh partition these three inputs
0:11:48	uh
0:11:49	we can group the first two
0:11:51	as uh
0:11:53	belonging to the target hypothesis
0:11:55	and uh the last three as
0:11:58	uh the instances of
0:12:00	uh nontarget partitions non target because the test
0:12:04	has a different speaker from that right
0:12:09	so
0:12:11	we need a prior
0:12:13	nist
0:12:13	provided
0:12:14	the target price
0:12:16	we don't need a prior
0:12:18	for the train segment we already know it's of the
0:12:21	target speaker
0:12:22	but what about the adaptation segment
0:12:25	so
0:12:26	that
0:12:27	prior was not stated in the
0:12:29	but original problem
0:12:31	so
0:12:32	we can assemble
0:12:33	uh
0:12:35	these two
0:12:37	priors
0:12:37	uh
0:12:38	just in the obvious way
0:12:40	uh to give a full probability distribution over the five possibilities
0:12:45	uh i've
0:12:46	uh
0:12:46	somewhat
0:12:47	arbitrarily set the last one does the other
0:12:50	uh to simplify matters here
0:12:53	uh you're assuming
0:12:54	if the test segment is not to target
0:12:57	uh
0:12:58	the adaptation segment is also not going to be
0:13:02	so
0:13:03	uh
0:13:04	the whole thing
0:13:05	a symbols like this
0:13:07	uh
0:13:08	the generative model supplies
0:13:10	the five likelihoods for the five partitioning
0:13:13	well possibilities
0:13:15	and
0:13:15	then
0:13:16	uh
0:13:18	you use as patrick said
0:13:20	the basic rules of probability theory some room product rule
0:13:24	but you need
0:13:25	the
0:13:26	uh
0:13:28	you need those extra prize
0:13:30	uh this prior which
0:13:31	has not been mentioned before
0:13:33	you need that
0:13:34	to compute
0:13:35	uh
0:13:36	to properly express
0:13:38	the likelihood ratio between the target and then and then on top hypothesis
0:13:45	so
0:13:47	the experiment that we did was
0:13:49	to demonstrate
0:13:51	uh what role
0:13:52	does
0:13:53	uh this prior play what might happen
0:13:55	uh
0:13:57	uh
0:13:57	if
0:13:59	uh
0:14:00	you're assuming a bad price
0:14:03	how closely should match the actual proportion
0:14:06	uh
0:14:07	in the data that you working
0:14:12	so
0:14:12	uh
0:14:13	we use synthetic ivectors
0:14:16	because we're not interested
0:14:18	in examining the data
0:14:20	or indeed
0:14:22	the the P L D A model
0:14:24	uh
0:14:25	while making synthetic data with the model
0:14:28	uh
0:14:28	the data has a perfect
0:14:30	shot
0:14:30	to the model
0:14:32	um
0:14:32	so
0:14:33	that
0:14:34	focused
0:14:35	focuses the experiment
0:14:37	on the role of the prior
0:14:41	so
0:14:42	back to
0:14:43	this system diagram
0:14:45	uh we adjust to things independently
0:14:48	the one is the proportion of
0:14:51	the adaptation segments in in in the data
0:14:54	the other
0:14:55	uh
0:14:55	is
0:14:56	uh
0:14:57	the
0:14:57	assumed
0:14:58	prior
0:14:59	of
0:15:00	how much that proportion might be
0:15:03	and we evaluate the whole thing via via equal error
0:15:07	so
0:15:08	the results
0:15:09	uh look something like this
0:15:12	uh
0:15:12	the horizontal axis of the here
0:15:15	we have the assumed prior increasing in this direction
0:15:19	other horizontal axis is the actual
0:15:21	proportion
0:15:23	and the vertical axis of course
0:15:25	the equal error right so
0:15:26	uh
0:15:28	this here is the best situation to be in
0:15:30	uh
0:15:31	you know there are many adaptation segments
0:15:34	and there are in fact an adaptation works
0:15:37	the the back corner over there
0:15:40	uh
0:15:40	you're saying okay
0:15:42	i'm not expecting any targets in the adaptation data so i'm not adapting
0:15:47	uh the bad place to be is
0:15:49	is there
0:15:50	uh with
0:15:52	you're assuming
0:15:54	uh
0:15:54	you'll find
0:15:55	uh many adaptation segments
0:15:58	but
0:15:58	uh
0:15:59	when there aren't any
0:16:01	so
0:16:03	uh
0:16:05	the important thing to realise here is
0:16:07	it's not so bad to assume
0:16:11	that there aren't any adaptation segments
0:16:13	because then you just back to what you would have done without
0:16:16	adaptation
0:16:17	but it is bad to have the mismatched the other way
0:16:23	so
0:16:24	uh
0:16:25	the prior is important
0:16:26	uh
0:16:26	you might choose to ignore the prior
0:16:29	but
0:16:29	it's not going to
0:16:30	go away it's there
0:16:32	closing it's that means that even if you ignore
0:16:36	so
0:16:37	that brings me to the conclusion of that or
0:16:39	um
0:16:41	back in the real world
0:16:42	uh
0:16:43	we've already applied this partitioning software
0:16:46	in helping us find
0:16:47	the
0:16:48	mislabelling in the
0:16:50	uh is uh T O I
0:16:51	uh
0:16:52	data
0:16:54	uh which we needed for development for the uh
0:16:57	for this evaluation
0:16:58	so
0:16:59	we've only started on this
0:17:00	on this work
0:17:02	in the uh personas workshop to be
0:17:04	that's starting next week
0:17:05	of it here
0:17:07	uh we'll be using
0:17:09	uh exploring this problem some more
0:17:11	uh
0:17:16	okay that's all thank you
0:17:25	fig
0:17:26	three
0:17:26	some
0:17:26	question of common
0:17:29	i could be
0:17:32	could be could be if you look at the real case for
0:17:35	the purpose
0:17:35	you should
0:17:37	you can't this you but which we are
0:17:39	and then when you are
0:17:41	you will have to
0:17:43	this
0:17:43	more
0:17:44	okay
0:17:45	usually in a real application you will have
0:17:48	an impostor trying
0:17:51	going to system enjoying during uh
0:17:53	free house to cheat the system
0:17:55	and something
0:17:56	we will
0:17:57	we have
0:17:57	hmmm target speaker
0:17:59	coming and uh
0:18:01	on the targets
0:18:02	green
0:18:02	a few more
0:18:04	you will
0:18:06	i agree
0:18:07	and this framework
0:18:08	allows for that
0:18:09	because
0:18:10	uh
0:18:11	that right there
0:18:12	the prior
0:18:13	but you plug in to get the final
0:18:15	score
0:18:16	you can make trial bin
0:18:18	so
0:18:18	if you know about this guy that you can modify the prior
0:18:21	as
0:18:22	as time progresses
0:18:24	okay
0:18:24	sorry
0:18:25	did you could you could turn
0:18:26	oh yes
0:18:27	yeah
0:18:29	now you have time to find some questions
0:18:47	so do you see that it is it yes
0:18:49	yeah wednesday speaker diarization yeah
0:18:55	the down here correctly
0:18:57	you ask whether that's
0:18:58	this is a one step
0:19:00	diarization system
0:19:01	yeah well when we we do this in addition
0:19:05	and it just means you wednesday
0:19:07	um
0:19:08	no i'm assuming here that the segmentation is given so i assume
0:19:13	that that are not the segments which
0:19:15	which uh
0:19:16	have
0:19:17	two speakers in them
0:19:18	yeah
0:19:19	it is and what they see
0:19:21	yeah
0:19:21	we apply a system
0:19:23	four
0:19:24	unique segmentation
0:19:25	and you get
0:19:26	all the boundaries
0:19:27	you then
0:19:29	could be used
0:19:30	four
0:19:30	yeah addition
0:19:32	the segment
0:19:34	i i wouldn't recommend it because as i pointed out
0:19:37	if
0:19:38	uh
0:19:38	you have
0:19:39	a thousand segments
0:19:40	uh then
0:19:43	uh
0:19:44	you want to add to my mike ways
0:19:46	uh
0:19:47	they're not designed for the large scale case
0:19:49	uh
0:19:51	but they're they're other approximate
0:19:53	uh
0:19:53	methods
0:19:54	you could you could
0:19:55	you could still start from the same
0:19:57	uh
0:19:58	gaussian P L D A model
0:20:00	but then you would need something like
0:20:01	uh
0:20:02	by variational bayes
0:20:04	to to handle a large number of signal
0:20:06	we're going to play with that
0:20:08	uh
0:20:09	at the workshop as well
0:20:15	we can think
0:20:15	okay

The speaker partitioning problem

SESSION 8: Human performances in Speaker recognition, Speaker clustering and partitioning

Added: 14. 7. 2010 11:08, Author: Niko Brümmer, Edward de Villiers (Agnitio), Length: 0:20:23