Speech Transcript - Person Instance Graphs for Named Speaker Identification in TV Broadcast

0:00:15	hi everyone so i'm a couple of them from the limbs see in france
0:00:20	so this is a joint work with all those people and you might know claude
0:00:24	barras the last order
0:00:25	he says hi
0:00:26	if you know in
0:00:30	so i'm going to talk about the this notion of person instance graphs for named
0:00:34	speaker identification in tv broadcast
0:00:37	so this is the outline of my talk
0:00:39	us first i'm going to give you a bit of context
0:00:43	then i'm going to discuss those this notion of person instance graph how we can
0:00:47	build them
0:00:49	and then how we can mind those the graphs to do speaker identification in
0:00:54	in tv shows an present some experimental results and then conclude my talk
0:01:01	so
0:01:02	about the context though we where working in the framework of these french challenge call
0:01:08	the whole pair
0:01:10	well we were given the tv shows like to this one for instance
0:01:15	they were
0:01:16	talk shows t v news the and were asked to answer automatically and these two
0:01:22	questions who speaks when
0:01:23	and
0:01:24	who appears when
0:01:29	in this form so we really need to the speaker diarization and then try to
0:01:33	identify each speech done separately
0:01:36	and provide normalized
0:01:41	name
0:01:42	this was very important to give the exact the form of the name like nicholas
0:01:46	equity fossil america but my here
0:01:50	i'm only going to focus on the who speaks when the task here
0:01:54	so they are many ways of am multiple sources of information to answer those questions
0:02:00	so obviously we can use the audio stream i to do speaker diarization an identification
0:02:04	we can also processed the speech to get some transcription form it
0:02:10	we can obviously use a visual stream to do fights clustering recognition and we can
0:02:14	try to get some names also from the
0:02:16	the o c r here
0:02:18	and
0:02:20	and so that the they are those two extremes coming from asr o c r
0:02:24	two and we can do name entity detection on this and to try to propagate
0:02:29	the names to the speaker cluster for instance here i'm not going to user will
0:02:33	the visual information because she's a
0:02:37	speaker addition
0:02:40	okay
0:02:41	so there are two ways of are recognising people in this kind of video the
0:02:45	unsupervised way and these supervised way
0:02:47	in on the left part in green i show you how we can do that
0:02:51	in the unsupervised fashion that means that we are not allowed to use prior all
0:02:55	biometric models
0:02:57	to recognize the person the speaker
0:03:00	so we each is usually done like that's we first transcribe the speech and try
0:03:06	to extract names from these a speech transcript
0:03:10	and in parallel we do speaker diarization and then we try to propagate the names
0:03:14	that where a detected in this in the speech transcript to the speaker cluster
0:03:18	to try to name
0:03:21	the speaker cluster that's what we call the named speaker identification so this is fully
0:03:25	unsupervised in terms of
0:03:27	biometric models
0:03:29	on the other side obviously
0:03:32	we can when we have a training data for various because we can for instance
0:03:36	bill an i-vector
0:03:39	speaker id system and use it
0:03:41	to do acoustic bass speaker identification
0:03:44	and we could also try to fuse those two into one a unified framework and
0:03:50	that's what i'm going to talk about a and this talk is about trying to
0:03:53	do all of that into one unified a framework
0:03:59	okay
0:04:00	so this framework i
0:04:03	is actually what i call the person instance graph so i'm going to describe it
0:04:09	as
0:04:10	good as i can so that you get an idea of
0:04:13	how it's peeled
0:04:15	so starting from the speech signal
0:04:19	we apply to another set for the speech-to-text
0:04:23	system from the company vocabulary search
0:04:26	and so it provides
0:04:28	both the speech transcription so these are the
0:04:30	the black dots here
0:04:32	and here you have a zoom on one particular speech turn and it also provides
0:04:36	us with the speech turns a segmentation into speech turns
0:04:40	so in the rest of my talk
0:04:43	this speech turns will be need denoted by t like turn
0:04:47	and for instance in this video
0:04:50	in these all pole audio now we don't use we deal there are five speech
0:04:54	turns denoted do you want to t five
0:04:56	a those are the first nodes
0:04:58	well my graph of this person instance graph
0:05:03	on top of this a speech transcript we can try to do spoken name detection
0:05:09	to do that
0:05:11	we use conditional random fields based on the that the one bit implementation of a
0:05:15	crf
0:05:17	we train two different classes of models
0:05:20	some of them were trained to only detect parts of names like
0:05:24	first name last name titles
0:05:26	and all there is were trying to detect complete names that once
0:05:32	and so they are a bunch of models that we trained here and they where
0:05:36	the output were combined using yet another
0:05:39	crf
0:05:42	model
0:05:43	on a using the output of these models as features
0:05:48	so what we get from these model is he's so then the names are detected
0:05:52	in the tech stream
0:05:55	and so here for instance there were five a
0:05:59	spoken names that were detected
0:06:01	and they are connected in this graph
0:06:04	to a canonical representation of the person here nicholas acquisition nicholas like was his name
0:06:10	was
0:06:11	detected and it's connected
0:06:15	to yet another
0:06:17	note in this graph which represent nicholas according
0:06:21	so in the rest of the talking as will be spoken names
0:06:27	that's which was as
0:06:28	and the identity
0:06:30	a vertex is in this graph are denoted i
0:06:36	so they are here for instance a for identity nodes and five is a spoken
0:06:41	names in this graph
0:06:44	and so what can we do with those names that were detected so what we
0:06:47	want you we want to
0:06:50	probably get those the spoken names to the neighboring speech turns we want to try
0:06:54	to
0:06:56	to use them to identify the that the speaker in the conversation
0:07:01	so they are many ways of estimating the probability that the spoken name s
0:07:05	is actually the identity of the speech turn t
0:07:08	in the literature so there where at first the people aware using hand-made rules about
0:07:13	in based on the
0:07:14	the context of the problems name in the speech transcript
0:07:18	other people use the contextual n-grams
0:07:20	and
0:07:22	even more recently semantic classification tree so we chose to use context all n-grams here
0:07:27	so let me show you an example for example in if in the speech transcript
0:07:31	someone says thank us as might be nicholas equity for instance then it's very likely
0:07:36	that the previous speech turn
0:07:37	is actually in you consequently so that's basically what does here
0:07:41	there is an eighty eight percent chance that the spoken name s
0:07:47	is actually the identity of the previous speech turn t one
0:07:50	that's how we are able to connect spoken names to speech turn in the graph
0:07:55	so weights edges are weighted by these probabilities
0:08:01	and then so
0:08:02	it's good but we can only propagate the names to at the neighboring speech turns
0:08:07	so what we can with what can we do next we can also compute some
0:08:11	kind of similarity between the all the speech turns
0:08:13	here we simply use the bayesian information criterion but based on mfcc features for each
0:08:19	speech turn and here for instance you have the
0:08:22	the in
0:08:24	in their speaker distribution of the big
0:08:29	similarity measure or and the
0:08:31	in green intra speaker so on the on our repair dataset
0:08:35	and so based on those two distribution we can estimate some kind of probability that
0:08:40	to speech turn t n t prime are the same speaker
0:08:43	that's how we connect all the speech turns in the graph
0:08:48	so at this point we have we can have these this big graph here
0:08:54	so i'm just going to focus on the station here so if the set of
0:08:58	thirty season this graph so they are three types of courtesies speech turns t
0:09:03	spoken name s
0:09:04	and identity vertex is i
0:09:07	and this graph is not necessarily complete
0:09:13	for instance the this identity of vertex to be not the connected to this speech
0:09:18	done for instance so
0:09:20	this is and you complete graph and
0:09:23	we denote by p
0:09:24	the weights that are
0:09:27	given to each edges or a p v prime is actually the probability that the
0:09:32	two parties is v prime
0:09:34	a are actually the same person of the same identity
0:09:39	so now that we have these graph what we want to achieve we want to
0:09:42	mine those graphs
0:09:44	to finally get our answer so try to give an identity to each of these
0:09:50	the speech turns
0:09:51	so you see in this example so this is the reference the here
0:09:55	it's nearly impossible to get a because the names of the
0:09:58	the name of this guy a is never even pronounce in the
0:10:02	e in the in the t v show
0:10:04	so
0:10:05	by chains we may have
0:10:09	biometric model for this guy
0:10:11	so there are
0:10:13	this is a very messy slide
0:10:15	but
0:10:17	so depending on how many edge is we put in this graph we can address
0:10:20	different tasks
0:10:21	for instance if we just connect this spoken name we speech turn we are able
0:10:27	just to
0:10:29	identify the addressee
0:10:30	of each speech tonight each time so only neighboring of speech turn can be
0:10:36	identify but then if we are those the
0:10:39	those the speech a speech turns speech turn the
0:10:43	edges
0:10:43	where able to propagate the names to all the speech turns
0:10:46	and if by chance we have a biometric models for this guy gas and j
0:10:52	then we using an i-vector system for instance we are able to connect each speech
0:10:57	turn to all
0:10:59	biometric models
0:11:00	and
0:11:03	estimate some kind of probability that those are the same person
0:11:07	so this is completely supervised speaker identification using these and this is completely unsupervised and
0:11:13	we can try to all these age in these big graph to do jointly
0:11:16	nee unsupervised and supervised
0:11:19	speaker identification
0:11:24	so
0:11:25	how can we mind these graphs then
0:11:28	and you objective is always thing is it to each vertex in this graph to
0:11:32	try to give a you correct identity
0:11:34	so at least in this can actually be modeled as a clustering problem
0:11:37	where we want to group all instance all thirty season the graph corresponding to the
0:11:43	same person
0:11:44	with the actual identity so here is what we expect on from a perfect system
0:11:50	in this graph
0:11:52	we would like to
0:11:53	putting the same clusters
0:11:55	the speech turns by a speaker c and all the names spoken
0:11:59	well all the time is name is pronounce also he in the same rough
0:12:03	so and we would like this was speaker hey in my first example
0:12:09	even though we don't have a an identity a in the graph we want to
0:12:13	be able to
0:12:14	cluster only speech don't like that
0:12:16	and some spoken names are use less to identify a
0:12:20	and you want because this is just someone we're talking about and not someone who
0:12:23	is present in the in the t v show
0:12:27	so to do that
0:12:29	we define
0:12:30	a set of function close ugh who called clustering function so
0:12:35	a delta
0:12:37	associated to each pair of nodes in this graph plp prior one
0:12:41	if they are in a same cluster and zero otherwise
0:12:45	the thing is not all function defined like that
0:12:48	actually code for a value clustering what we need to do you we need to
0:12:52	add some other constraints in this to this functions for instance
0:12:58	if we must be in the same cluster as itself
0:13:01	symmetry constraints on there so transitive at constraints like if you energy prime are in
0:13:06	the same cluster and be prime and b second are in the same cluster then
0:13:09	v and v secondmost been the same cluster
0:13:11	so this defines a search space
0:13:15	delta p
0:13:16	on the set of thirty six
0:13:18	but
0:13:20	we need to look for
0:13:22	the best clustering function delta
0:13:25	that the basic cluster all our data
0:13:29	so to do that we use or integral linear programming
0:13:32	and we want to maximize these objective function
0:13:36	basically a good clustering would a cluster
0:13:40	we group similar data
0:13:42	or data with high
0:13:45	probability
0:13:46	into the same cluster and separate
0:13:51	approach this is with loads a similarity into two different clusters so that's what this
0:13:56	objective function that is
0:13:58	and it is just normalized by the
0:14:00	number of edges in the grass
0:14:02	and we have this parameter i'll fact that can be tuned
0:14:06	to balance between in track clusters similarity and inter cluster the similarity
0:14:12	and we also add the additional constraints like for instance
0:14:16	for every speech turn in the graph
0:14:19	it can have at most one identity
0:14:23	alright depends if yours screws of in your crew or
0:14:27	but usually you have only one identity
0:14:29	and also we force spoken name
0:14:33	to be in the same cluster as their identity
0:14:39	the thing is with this formulation is that
0:14:44	you see that we some on all the edges on this graph
0:14:48	and the problem is that they are much more many more
0:14:54	speech turn to speech turn edges than they are points ten speech turn to spoken
0:14:59	name ages
0:15:00	so
0:15:02	i divided this objective function into sub objective function
0:15:09	this is basically exactly the same except that
0:15:12	we
0:15:13	the weight to all tap to every type of ages
0:15:17	so this way we can give more weight for instance twos spoken name to speech
0:15:22	turn edges in this graph
0:15:24	and this makes the this gives a set of parameters that we need to of
0:15:30	the hyper parameter that we need to optimize so beta and had five
0:15:36	and this is
0:15:40	optimized using a random search in the
0:15:43	in the alpha beta space
0:15:46	how much more time
0:15:50	so i'm coming to the
0:15:53	experimental results
0:15:56	so
0:15:57	he's the corpus that we were given by the organiser of the rubber challenge
0:16:04	so the corpus is divided into seven type of shows like they are tv news
0:16:09	talk shows
0:16:12	so the training set is made of twenty eight hours fully annotated in terms of
0:16:16	speaker a speech transcript
0:16:19	and name
0:16:21	the spoken names
0:16:22	and also we are given visual information which are is not relevant here but the
0:16:28	for instance we get and annotation or
0:16:33	one frame every ten seconds we know exactly would peers in this in this frame
0:16:39	so this training set is used to estimate the probability between speech turns the to
0:16:45	train the i-vector system and to train the speech turn to spoken name propagation probability
0:16:54	we used the development set
0:16:57	nine hours to estimate those the hyperparameter alpha and beta
0:17:02	and we use the test set
0:17:04	and it's a value at the this way this is basically identification error rate so
0:17:09	this is the total amount of a
0:17:11	wrongly the total duration the wrongly
0:17:15	i don't to find it plus
0:17:18	a missed detection for set on divided by the total duration of speech in the
0:17:22	reference
0:17:23	so this can go higher than one if you
0:17:26	we
0:17:27	do lots of false alarm for instance
0:17:31	so here are the big table of results i'm going to focus on the on
0:17:36	the few selected points
0:17:38	so i in this configuration b where we are completely unsupervised
0:17:44	it's
0:17:46	we can see that the an oracle do that too would be able to name
0:17:50	someone as soon as is name is pronounced in the in the stream
0:17:54	anywhere in the in the audio stream
0:17:56	i can only get the fifty six percent recall anyway
0:18:01	we get to twenty nine a here using this these graph
0:18:05	so there is a long way to go up to
0:18:08	to get the good a perfect results here
0:18:11	when we are combined the whole thing
0:18:15	the same an oracle would get fourteen percent
0:18:20	identification error rate
0:18:22	and our this oracle is able to recognize the someone as soon as
0:18:25	either there is a biometric model for eight or the name is pronounced in the
0:18:29	speech transcript
0:18:31	so
0:18:31	also there is a long way to go to get a perfect results
0:18:35	but so i'm just going to focus on the interesting results now i mean the
0:18:40	one that actually worked
0:18:44	so
0:18:46	note this is a better results angle i'm going to skip it as well
0:18:51	by adding at the red ages in the graph so going from a to be
0:18:54	where able to increase the recall so that was expected because we are now able
0:18:58	to propagate the names to all the speech turns
0:19:00	but also what's interesting is that we also increase the precision
0:19:04	which wasn't what i expected first when a
0:19:08	when i did this work
0:19:12	and what's interesting also is that we can combine those two approaches the names speaker
0:19:17	identification this right completely unsupervised
0:19:19	with standard the
0:19:21	i-vector acoustic speaker identification
0:19:24	and we are able to get the ten percent absolute the improvement to compared to
0:19:30	the i-vector system
0:19:32	and it works both for precision so we are able to increase the precision of
0:19:36	an i-vector system using those the spoken names
0:19:39	and obviously recall because they are some percent the for which we don't have a
0:19:43	biometric models so
0:19:45	we can use the spoken names to
0:19:49	to do to improve the identification
0:19:54	and i also wanted to stress this point that we also have results based on
0:19:59	the fully manual the
0:20:02	spoken name detection
0:20:03	and it happens that the even though our
0:20:06	a name detection system has a slot error rate of around thirty five percent
0:20:12	i it actually doesn't degrade when we go from manual a name detection to fully
0:20:17	automatic name detection so this is
0:20:19	an interesting result that we are robust to this kind of errors may be because
0:20:23	spoken names are often the repeated multiple times in the video so we manage to
0:20:27	get one of these
0:20:32	this is just the
0:20:34	a representation of the this weights beta that we are automatically
0:20:40	obtain using parameters hyper parameter tuning
0:20:43	when we only use the this configuration b so this is completely unsupervised
0:20:48	it actually gives more weight
0:20:50	to a speech turn to spoken name edges then to than the edges between two
0:20:56	speech turns
0:20:57	and when we do the for the full graph
0:21:00	it actually give the same weights
0:21:02	to the i-vector edges
0:21:04	and the speech turn to spoken name ages
0:21:08	so
0:21:09	this is the concluded
0:21:11	so we got the this ten percent absolute improvement over the i-vector system using spoken
0:21:16	names so this is kind of cheating because what using more information but
0:21:21	this can be improved even more if we had for instance written names
0:21:25	experiments that we did the
0:21:27	when the a given another fifteen percent the increase in performance
0:21:32	and so they are still a lot of errors that we need to address i
0:21:36	thank you very much
0:21:37	and thank you
0:21:42	just a quick advertisement on this corpus that may be of interest for those of
0:21:46	you doing speaker diarization as well
0:22:03	and i have the first question
0:22:07	not using any a priori knowledge on the distribution of speakers in a conversation or
0:22:14	in the media five like quite everybody
0:22:18	could you comment and then do you think various
0:22:20	some information to get that's the next step actually we plan to modify this
0:22:27	objective function to take the structure of a tissue into account so for instance we
0:22:32	could the ad here a term
0:22:35	that
0:22:36	take into account the prior probability that the when one a speaker speaks at time
0:22:42	t then there is a high chance that we can hear him again thirty seconds
0:22:46	later so this is not that all the taken into account for now but we
0:22:51	really need to out these
0:22:54	prior information the structure
0:22:56	i totally agree but we did you mean just the prior knowledge on the presence
0:23:01	of the speaker or
0:23:02	on
0:23:03	i don't know
0:23:05	the this
0:23:06	this is planned we're going to have the some extra terms here is to force
0:23:10	that some kind of structure
0:23:13	okay thanks and just
0:23:15	you could also pictures of the results of the evaluation complaining goes
0:23:21	you say that is what was done the focus of a few evaluation
0:23:25	could be nice to have an eight year what was the but with the differences
0:23:30	in a different participant
0:23:33	you close to be a
0:23:35	we notice of the based on did you see some differences i don't know
0:23:40	the main difference when the who appears when task in speaker id we were more
0:23:46	less the same and the same results
0:23:48	but what the
0:23:50	actually that's what gives the most information in terms of identities actually ups
0:23:58	the names that are written on screen
0:24:01	usually it's really easy to provide a to the current speech
0:24:06	speaker
0:24:08	and this it is if the fifteen free improvement in terms of performance when we
0:24:13	use the visual the
0:24:15	you're string
0:24:27	no it's the basically used on the
0:24:34	segmentation used for this stuff it with the goes and divergence followed by some kind
0:24:41	of linear clustering and
0:24:44	no it's not oracle it's a so the along the thirty five percent there are
0:24:49	there is
0:24:50	i think five
0:24:52	to ten percent coming from the speech activity detection and segmentation errors

Person Instance Graphs for Named Speaker Identification in TV Broadcast

Speaker Diarization

Hervé Bredin, Antoine Laurent, Achintya Sarkar, Viet-Bac Le, Sophie Rosset and Claude Barras