0:00:15hi everyone so i'm a couple of them from the limbs see in france
0:00:20so this is a joint work with all those people and you might know claude
0:00:24barras the last order
0:00:25he says hi
0:00:26if you know in
0:00:30so i'm going to talk about the this notion of person instance graphs for named
0:00:34speaker identification in tv broadcast
0:00:37so this is the outline of my talk
0:00:39us first i'm going to give you a bit of context
0:00:43then i'm going to discuss those this notion of person instance graph how we can
0:00:47build them
0:00:49and then how we can mind those the graphs to do speaker identification in
0:00:54in tv shows an present some experimental results and then conclude my talk
0:01:02about the context though we where working in the framework of these french challenge call
0:01:08the whole pair
0:01:10well we were given the tv shows like to this one for instance
0:01:15they were
0:01:16talk shows t v news the and were asked to answer automatically and these two
0:01:22questions who speaks when
0:01:24who appears when
0:01:29in this form so we really need to the speaker diarization and then try to
0:01:33identify each speech done separately
0:01:36and provide normalized
0:01:42this was very important to give the exact the form of the name like nicholas
0:01:46equity fossil america but my here
0:01:50i'm only going to focus on the who speaks when the task here
0:01:54so they are many ways of am multiple sources of information to answer those questions
0:02:00so obviously we can use the audio stream i to do speaker diarization an identification
0:02:04we can also processed the speech to get some transcription form it
0:02:10we can obviously use a visual stream to do fights clustering recognition and we can
0:02:14try to get some names also from the
0:02:16the o c r here
0:02:20and so that the they are those two extremes coming from asr o c r
0:02:24two and we can do name entity detection on this and to try to propagate
0:02:29the names to the speaker cluster for instance here i'm not going to user will
0:02:33the visual information because she's a
0:02:37speaker addition
0:02:41so there are two ways of are recognising people in this kind of video the
0:02:45unsupervised way and these supervised way
0:02:47in on the left part in green i show you how we can do that
0:02:51in the unsupervised fashion that means that we are not allowed to use prior all
0:02:55biometric models
0:02:57to recognize the person the speaker
0:03:00so we each is usually done like that's we first transcribe the speech and try
0:03:06to extract names from these a speech transcript
0:03:10and in parallel we do speaker diarization and then we try to propagate the names
0:03:14that where a detected in this in the speech transcript to the speaker cluster
0:03:18to try to name
0:03:21the speaker cluster that's what we call the named speaker identification so this is fully
0:03:25unsupervised in terms of
0:03:27biometric models
0:03:29on the other side obviously
0:03:32we can when we have a training data for various because we can for instance
0:03:36bill an i-vector
0:03:39speaker id system and use it
0:03:41to do acoustic bass speaker identification
0:03:44and we could also try to fuse those two into one a unified framework and
0:03:50that's what i'm going to talk about a and this talk is about trying to
0:03:53do all of that into one unified a framework
0:04:00so this framework i
0:04:03is actually what i call the person instance graph so i'm going to describe it
0:04:10good as i can so that you get an idea of
0:04:13how it's peeled
0:04:15so starting from the speech signal
0:04:19we apply to another set for the speech-to-text
0:04:23system from the company vocabulary search
0:04:26and so it provides
0:04:28both the speech transcription so these are the
0:04:30the black dots here
0:04:32and here you have a zoom on one particular speech turn and it also provides
0:04:36us with the speech turns a segmentation into speech turns
0:04:40so in the rest of my talk
0:04:43this speech turns will be need denoted by t like turn
0:04:47and for instance in this video
0:04:50in these all pole audio now we don't use we deal there are five speech
0:04:54turns denoted do you want to t five
0:04:56a those are the first nodes
0:04:58well my graph of this person instance graph
0:05:03on top of this a speech transcript we can try to do spoken name detection
0:05:09to do that
0:05:11we use conditional random fields based on the that the one bit implementation of a
0:05:17we train two different classes of models
0:05:20some of them were trained to only detect parts of names like
0:05:24first name last name titles
0:05:26and all there is were trying to detect complete names that once
0:05:32and so they are a bunch of models that we trained here and they where
0:05:36the output were combined using yet another
0:05:43on a using the output of these models as features
0:05:48so what we get from these model is he's so then the names are detected
0:05:52in the tech stream
0:05:55and so here for instance there were five a
0:05:59spoken names that were detected
0:06:01and they are connected in this graph
0:06:04to a canonical representation of the person here nicholas acquisition nicholas like was his name
0:06:11detected and it's connected
0:06:15to yet another
0:06:17note in this graph which represent nicholas according
0:06:21so in the rest of the talking as will be spoken names
0:06:27that's which was as
0:06:28and the identity
0:06:30a vertex is in this graph are denoted i
0:06:36so they are here for instance a for identity nodes and five is a spoken
0:06:41names in this graph
0:06:44and so what can we do with those names that were detected so what we
0:06:47want you we want to
0:06:50probably get those the spoken names to the neighboring speech turns we want to try
0:06:56to use them to identify the that the speaker in the conversation
0:07:01so they are many ways of estimating the probability that the spoken name s
0:07:05is actually the identity of the speech turn t
0:07:08in the literature so there where at first the people aware using hand-made rules about
0:07:13in based on the
0:07:14the context of the problems name in the speech transcript
0:07:18other people use the contextual n-grams
0:07:22even more recently semantic classification tree so we chose to use context all n-grams here
0:07:27so let me show you an example for example in if in the speech transcript
0:07:31someone says thank us as might be nicholas equity for instance then it's very likely
0:07:36that the previous speech turn
0:07:37is actually in you consequently so that's basically what does here
0:07:41there is an eighty eight percent chance that the spoken name s
0:07:47is actually the identity of the previous speech turn t one
0:07:50that's how we are able to connect spoken names to speech turn in the graph
0:07:55so weights edges are weighted by these probabilities
0:08:01and then so
0:08:02it's good but we can only propagate the names to at the neighboring speech turns
0:08:07so what we can with what can we do next we can also compute some
0:08:11kind of similarity between the all the speech turns
0:08:13here we simply use the bayesian information criterion but based on mfcc features for each
0:08:19speech turn and here for instance you have the
0:08:22the in
0:08:24in their speaker distribution of the big
0:08:29similarity measure or and the
0:08:31in green intra speaker so on the on our repair dataset
0:08:35and so based on those two distribution we can estimate some kind of probability that
0:08:40to speech turn t n t prime are the same speaker
0:08:43that's how we connect all the speech turns in the graph
0:08:48so at this point we have we can have these this big graph here
0:08:54so i'm just going to focus on the station here so if the set of
0:08:58thirty season this graph so they are three types of courtesies speech turns t
0:09:03spoken name s
0:09:04and identity vertex is i
0:09:07and this graph is not necessarily complete
0:09:13for instance the this identity of vertex to be not the connected to this speech
0:09:18done for instance so
0:09:20this is and you complete graph and
0:09:23we denote by p
0:09:24the weights that are
0:09:27given to each edges or a p v prime is actually the probability that the
0:09:32two parties is v prime
0:09:34a are actually the same person of the same identity
0:09:39so now that we have these graph what we want to achieve we want to
0:09:42mine those graphs
0:09:44to finally get our answer so try to give an identity to each of these
0:09:50the speech turns
0:09:51so you see in this example so this is the reference the here
0:09:55it's nearly impossible to get a because the names of the
0:09:58the name of this guy a is never even pronounce in the
0:10:02e in the in the t v show
0:10:05by chains we may have
0:10:09biometric model for this guy
0:10:11so there are
0:10:13this is a very messy slide
0:10:17so depending on how many edge is we put in this graph we can address
0:10:20different tasks
0:10:21for instance if we just connect this spoken name we speech turn we are able
0:10:27just to
0:10:29identify the addressee
0:10:30of each speech tonight each time so only neighboring of speech turn can be
0:10:36identify but then if we are those the
0:10:39those the speech a speech turns speech turn the
0:10:43where able to propagate the names to all the speech turns
0:10:46and if by chance we have a biometric models for this guy gas and j
0:10:52then we using an i-vector system for instance we are able to connect each speech
0:10:57turn to all
0:10:59biometric models
0:11:03estimate some kind of probability that those are the same person
0:11:07so this is completely supervised speaker identification using these and this is completely unsupervised and
0:11:13we can try to all these age in these big graph to do jointly
0:11:16nee unsupervised and supervised
0:11:19speaker identification
0:11:25how can we mind these graphs then
0:11:28and you objective is always thing is it to each vertex in this graph to
0:11:32try to give a you correct identity
0:11:34so at least in this can actually be modeled as a clustering problem
0:11:37where we want to group all instance all thirty season the graph corresponding to the
0:11:43same person
0:11:44with the actual identity so here is what we expect on from a perfect system
0:11:50in this graph
0:11:52we would like to
0:11:53putting the same clusters
0:11:55the speech turns by a speaker c and all the names spoken
0:11:59well all the time is name is pronounce also he in the same rough
0:12:03so and we would like this was speaker hey in my first example
0:12:09even though we don't have a an identity a in the graph we want to
0:12:13be able to
0:12:14cluster only speech don't like that
0:12:16and some spoken names are use less to identify a
0:12:20and you want because this is just someone we're talking about and not someone who
0:12:23is present in the in the t v show
0:12:27so to do that
0:12:29we define
0:12:30a set of function close ugh who called clustering function so
0:12:35a delta
0:12:37associated to each pair of nodes in this graph plp prior one
0:12:41if they are in a same cluster and zero otherwise
0:12:45the thing is not all function defined like that
0:12:48actually code for a value clustering what we need to do you we need to
0:12:52add some other constraints in this to this functions for instance
0:12:58if we must be in the same cluster as itself
0:13:01symmetry constraints on there so transitive at constraints like if you energy prime are in
0:13:06the same cluster and be prime and b second are in the same cluster then
0:13:09v and v secondmost been the same cluster
0:13:11so this defines a search space
0:13:15delta p
0:13:16on the set of thirty six
0:13:20we need to look for
0:13:22the best clustering function delta
0:13:25that the basic cluster all our data
0:13:29so to do that we use or integral linear programming
0:13:32and we want to maximize these objective function
0:13:36basically a good clustering would a cluster
0:13:40we group similar data
0:13:42or data with high
0:13:46into the same cluster and separate
0:13:51approach this is with loads a similarity into two different clusters so that's what this
0:13:56objective function that is
0:13:58and it is just normalized by the
0:14:00number of edges in the grass
0:14:02and we have this parameter i'll fact that can be tuned
0:14:06to balance between in track clusters similarity and inter cluster the similarity
0:14:12and we also add the additional constraints like for instance
0:14:16for every speech turn in the graph
0:14:19it can have at most one identity
0:14:23alright depends if yours screws of in your crew or
0:14:27but usually you have only one identity
0:14:29and also we force spoken name
0:14:33to be in the same cluster as their identity
0:14:39the thing is with this formulation is that
0:14:44you see that we some on all the edges on this graph
0:14:48and the problem is that they are much more many more
0:14:54speech turn to speech turn edges than they are points ten speech turn to spoken
0:14:59name ages
0:15:02i divided this objective function into sub objective function
0:15:09this is basically exactly the same except that
0:15:13the weight to all tap to every type of ages
0:15:17so this way we can give more weight for instance twos spoken name to speech
0:15:22turn edges in this graph
0:15:24and this makes the this gives a set of parameters that we need to of
0:15:30the hyper parameter that we need to optimize so beta and had five
0:15:36and this is
0:15:40optimized using a random search in the
0:15:43in the alpha beta space
0:15:46how much more time
0:15:50so i'm coming to the
0:15:53experimental results
0:15:57he's the corpus that we were given by the organiser of the rubber challenge
0:16:04so the corpus is divided into seven type of shows like they are tv news
0:16:09talk shows
0:16:12so the training set is made of twenty eight hours fully annotated in terms of
0:16:16speaker a speech transcript
0:16:19and name
0:16:21the spoken names
0:16:22and also we are given visual information which are is not relevant here but the
0:16:28for instance we get and annotation or
0:16:33one frame every ten seconds we know exactly would peers in this in this frame
0:16:39so this training set is used to estimate the probability between speech turns the to
0:16:45train the i-vector system and to train the speech turn to spoken name propagation probability
0:16:54we used the development set
0:16:57nine hours to estimate those the hyperparameter alpha and beta
0:17:02and we use the test set
0:17:04and it's a value at the this way this is basically identification error rate so
0:17:09this is the total amount of a
0:17:11wrongly the total duration the wrongly
0:17:15i don't to find it plus
0:17:18a missed detection for set on divided by the total duration of speech in the
0:17:23so this can go higher than one if you
0:17:27do lots of false alarm for instance
0:17:31so here are the big table of results i'm going to focus on the on
0:17:36the few selected points
0:17:38so i in this configuration b where we are completely unsupervised
0:17:46we can see that the an oracle do that too would be able to name
0:17:50someone as soon as is name is pronounced in the in the stream
0:17:54anywhere in the in the audio stream
0:17:56i can only get the fifty six percent recall anyway
0:18:01we get to twenty nine a here using this these graph
0:18:05so there is a long way to go up to
0:18:08to get the good a perfect results here
0:18:11when we are combined the whole thing
0:18:15the same an oracle would get fourteen percent
0:18:20identification error rate
0:18:22and our this oracle is able to recognize the someone as soon as
0:18:25either there is a biometric model for eight or the name is pronounced in the
0:18:29speech transcript
0:18:31also there is a long way to go to get a perfect results
0:18:35but so i'm just going to focus on the interesting results now i mean the
0:18:40one that actually worked
0:18:46note this is a better results angle i'm going to skip it as well
0:18:51by adding at the red ages in the graph so going from a to be
0:18:54where able to increase the recall so that was expected because we are now able
0:18:58to propagate the names to all the speech turns
0:19:00but also what's interesting is that we also increase the precision
0:19:04which wasn't what i expected first when a
0:19:08when i did this work
0:19:12and what's interesting also is that we can combine those two approaches the names speaker
0:19:17identification this right completely unsupervised
0:19:19with standard the
0:19:21i-vector acoustic speaker identification
0:19:24and we are able to get the ten percent absolute the improvement to compared to
0:19:30the i-vector system
0:19:32and it works both for precision so we are able to increase the precision of
0:19:36an i-vector system using those the spoken names
0:19:39and obviously recall because they are some percent the for which we don't have a
0:19:43biometric models so
0:19:45we can use the spoken names to
0:19:49to do to improve the identification
0:19:54and i also wanted to stress this point that we also have results based on
0:19:59the fully manual the
0:20:02spoken name detection
0:20:03and it happens that the even though our
0:20:06a name detection system has a slot error rate of around thirty five percent
0:20:12i it actually doesn't degrade when we go from manual a name detection to fully
0:20:17automatic name detection so this is
0:20:19an interesting result that we are robust to this kind of errors may be because
0:20:23spoken names are often the repeated multiple times in the video so we manage to
0:20:27get one of these
0:20:32this is just the
0:20:34a representation of the this weights beta that we are automatically
0:20:40obtain using parameters hyper parameter tuning
0:20:43when we only use the this configuration b so this is completely unsupervised
0:20:48it actually gives more weight
0:20:50to a speech turn to spoken name edges then to than the edges between two
0:20:56speech turns
0:20:57and when we do the for the full graph
0:21:00it actually give the same weights
0:21:02to the i-vector edges
0:21:04and the speech turn to spoken name ages
0:21:09this is the concluded
0:21:11so we got the this ten percent absolute improvement over the i-vector system using spoken
0:21:16names so this is kind of cheating because what using more information but
0:21:21this can be improved even more if we had for instance written names
0:21:25experiments that we did the
0:21:27when the a given another fifteen percent the increase in performance
0:21:32and so they are still a lot of errors that we need to address i
0:21:36thank you very much
0:21:37and thank you
0:21:42just a quick advertisement on this corpus that may be of interest for those of
0:21:46you doing speaker diarization as well
0:22:03and i have the first question
0:22:07not using any a priori knowledge on the distribution of speakers in a conversation or
0:22:14in the media five like quite everybody
0:22:18could you comment and then do you think various
0:22:20some information to get that's the next step actually we plan to modify this
0:22:27objective function to take the structure of a tissue into account so for instance we
0:22:32could the ad here a term
0:22:36take into account the prior probability that the when one a speaker speaks at time
0:22:42t then there is a high chance that we can hear him again thirty seconds
0:22:46later so this is not that all the taken into account for now but we
0:22:51really need to out these
0:22:54prior information the structure
0:22:56i totally agree but we did you mean just the prior knowledge on the presence
0:23:01of the speaker or
0:23:03i don't know
0:23:05the this
0:23:06this is planned we're going to have the some extra terms here is to force
0:23:10that some kind of structure
0:23:13okay thanks and just
0:23:15you could also pictures of the results of the evaluation complaining goes
0:23:21you say that is what was done the focus of a few evaluation
0:23:25could be nice to have an eight year what was the but with the differences
0:23:30in a different participant
0:23:33you close to be a
0:23:35we notice of the based on did you see some differences i don't know
0:23:40the main difference when the who appears when task in speaker id we were more
0:23:46less the same and the same results
0:23:48but what the
0:23:50actually that's what gives the most information in terms of identities actually ups
0:23:58the names that are written on screen
0:24:01usually it's really easy to provide a to the current speech
0:24:08and this it is if the fifteen free improvement in terms of performance when we
0:24:13use the visual the
0:24:15you're string
0:24:27no it's the basically used on the
0:24:34segmentation used for this stuff it with the goes and divergence followed by some kind
0:24:41of linear clustering and
0:24:44no it's not oracle it's a so the along the thirty five percent there are
0:24:49there is
0:24:50i think five
0:24:52to ten percent coming from the speech activity detection and segmentation errors