Speech Transcript - Person Instance Graphs for Named Speaker Identification in TV Broadcast

hi everyone so i'm a couple of them from the limbs see in france

so this is a joint work with all those people and you might know claude

barras the last order

he says hi

if you know in

so i'm going to talk about the this notion of person instance graphs for named

speaker identification in tv broadcast

so this is the outline of my talk

us first i'm going to give you a bit of context

then i'm going to discuss those this notion of person instance graph how we can

build them

and then how we can mind those the graphs to do speaker identification in

in tv shows an present some experimental results and then conclude my talk

about the context though we where working in the framework of these french challenge call

the whole pair

well we were given the tv shows like to this one for instance

they were

talk shows t v news the and were asked to answer automatically and these two

questions who speaks when

and

who appears when

in this form so we really need to the speaker diarization and then try to

identify each speech done separately

and provide normalized

name

this was very important to give the exact the form of the name like nicholas

equity fossil america but my here

i'm only going to focus on the who speaks when the task here

so they are many ways of am multiple sources of information to answer those questions

so obviously we can use the audio stream i to do speaker diarization an identification

we can also processed the speech to get some transcription form it

we can obviously use a visual stream to do fights clustering recognition and we can

try to get some names also from the

the o c r here

and

and so that the they are those two extremes coming from asr o c r

two and we can do name entity detection on this and to try to propagate

the names to the speaker cluster for instance here i'm not going to user will

the visual information because she's a

speaker addition

okay

so there are two ways of are recognising people in this kind of video the

unsupervised way and these supervised way

in on the left part in green i show you how we can do that

in the unsupervised fashion that means that we are not allowed to use prior all

biometric models

to recognize the person the speaker

so we each is usually done like that's we first transcribe the speech and try

to extract names from these a speech transcript

and in parallel we do speaker diarization and then we try to propagate the names

that where a detected in this in the speech transcript to the speaker cluster

to try to name

the speaker cluster that's what we call the named speaker identification so this is fully

unsupervised in terms of

biometric models

on the other side obviously

we can when we have a training data for various because we can for instance

bill an i-vector

speaker id system and use it

to do acoustic bass speaker identification

and we could also try to fuse those two into one a unified framework and

that's what i'm going to talk about a and this talk is about trying to

do all of that into one unified a framework

okay

so this framework i

is actually what i call the person instance graph so i'm going to describe it

good as i can so that you get an idea of

how it's peeled

so starting from the speech signal

we apply to another set for the speech-to-text

system from the company vocabulary search

and so it provides

both the speech transcription so these are the

the black dots here

and here you have a zoom on one particular speech turn and it also provides

us with the speech turns a segmentation into speech turns

so in the rest of my talk

this speech turns will be need denoted by t like turn

and for instance in this video

in these all pole audio now we don't use we deal there are five speech

turns denoted do you want to t five

a those are the first nodes

well my graph of this person instance graph

on top of this a speech transcript we can try to do spoken name detection

to do that

we use conditional random fields based on the that the one bit implementation of a

crf

we train two different classes of models

some of them were trained to only detect parts of names like

first name last name titles

and all there is were trying to detect complete names that once

and so they are a bunch of models that we trained here and they where

the output were combined using yet another

crf

model

on a using the output of these models as features

so what we get from these model is he's so then the names are detected

in the tech stream

and so here for instance there were five a

spoken names that were detected

and they are connected in this graph

to a canonical representation of the person here nicholas acquisition nicholas like was his name

was

detected and it's connected

to yet another

note in this graph which represent nicholas according

so in the rest of the talking as will be spoken names

that's which was as

and the identity

a vertex is in this graph are denoted i

so they are here for instance a for identity nodes and five is a spoken

names in this graph

and so what can we do with those names that were detected so what we

want you we want to

probably get those the spoken names to the neighboring speech turns we want to try

to use them to identify the that the speaker in the conversation

so they are many ways of estimating the probability that the spoken name s

is actually the identity of the speech turn t

in the literature so there where at first the people aware using hand-made rules about

in based on the

the context of the problems name in the speech transcript

other people use the contextual n-grams

and

even more recently semantic classification tree so we chose to use context all n-grams here

so let me show you an example for example in if in the speech transcript

someone says thank us as might be nicholas equity for instance then it's very likely

that the previous speech turn

is actually in you consequently so that's basically what does here

there is an eighty eight percent chance that the spoken name s

is actually the identity of the previous speech turn t one

that's how we are able to connect spoken names to speech turn in the graph

so weights edges are weighted by these probabilities

and then so

it's good but we can only propagate the names to at the neighboring speech turns

so what we can with what can we do next we can also compute some

kind of similarity between the all the speech turns

here we simply use the bayesian information criterion but based on mfcc features for each

speech turn and here for instance you have the

the in

in their speaker distribution of the big

similarity measure or and the

in green intra speaker so on the on our repair dataset

and so based on those two distribution we can estimate some kind of probability that

to speech turn t n t prime are the same speaker

that's how we connect all the speech turns in the graph

so at this point we have we can have these this big graph here

so i'm just going to focus on the station here so if the set of

thirty season this graph so they are three types of courtesies speech turns t

spoken name s

and identity vertex is i

and this graph is not necessarily complete

for instance the this identity of vertex to be not the connected to this speech

done for instance so

this is and you complete graph and

we denote by p

the weights that are

given to each edges or a p v prime is actually the probability that the

two parties is v prime

a are actually the same person of the same identity

so now that we have these graph what we want to achieve we want to

mine those graphs

to finally get our answer so try to give an identity to each of these

the speech turns

so you see in this example so this is the reference the here

it's nearly impossible to get a because the names of the

the name of this guy a is never even pronounce in the

e in the in the t v show

by chains we may have

biometric model for this guy

so there are

this is a very messy slide

but

so depending on how many edge is we put in this graph we can address

different tasks

for instance if we just connect this spoken name we speech turn we are able

just to

identify the addressee

of each speech tonight each time so only neighboring of speech turn can be

identify but then if we are those the

those the speech a speech turns speech turn the

edges

where able to propagate the names to all the speech turns

and if by chance we have a biometric models for this guy gas and j

then we using an i-vector system for instance we are able to connect each speech

turn to all

biometric models

and

estimate some kind of probability that those are the same person

so this is completely supervised speaker identification using these and this is completely unsupervised and

we can try to all these age in these big graph to do jointly

nee unsupervised and supervised

speaker identification

how can we mind these graphs then

and you objective is always thing is it to each vertex in this graph to

try to give a you correct identity

so at least in this can actually be modeled as a clustering problem

where we want to group all instance all thirty season the graph corresponding to the

same person

with the actual identity so here is what we expect on from a perfect system

in this graph

we would like to

putting the same clusters

the speech turns by a speaker c and all the names spoken

well all the time is name is pronounce also he in the same rough

so and we would like this was speaker hey in my first example

even though we don't have a an identity a in the graph we want to

be able to

cluster only speech don't like that

and some spoken names are use less to identify a

and you want because this is just someone we're talking about and not someone who

is present in the in the t v show

so to do that

we define

a set of function close ugh who called clustering function so

a delta

associated to each pair of nodes in this graph plp prior one

if they are in a same cluster and zero otherwise

the thing is not all function defined like that

actually code for a value clustering what we need to do you we need to

add some other constraints in this to this functions for instance

if we must be in the same cluster as itself

symmetry constraints on there so transitive at constraints like if you energy prime are in

the same cluster and be prime and b second are in the same cluster then

v and v secondmost been the same cluster

so this defines a search space

delta p

on the set of thirty six

but

we need to look for

the best clustering function delta

that the basic cluster all our data

so to do that we use or integral linear programming

and we want to maximize these objective function

basically a good clustering would a cluster

we group similar data

or data with high

probability

into the same cluster and separate

approach this is with loads a similarity into two different clusters so that's what this

objective function that is

and it is just normalized by the

number of edges in the grass

and we have this parameter i'll fact that can be tuned

to balance between in track clusters similarity and inter cluster the similarity

and we also add the additional constraints like for instance

for every speech turn in the graph

it can have at most one identity

alright depends if yours screws of in your crew or

but usually you have only one identity

and also we force spoken name

to be in the same cluster as their identity

the thing is with this formulation is that

you see that we some on all the edges on this graph

and the problem is that they are much more many more

speech turn to speech turn edges than they are points ten speech turn to spoken

name ages

i divided this objective function into sub objective function

this is basically exactly the same except that

the weight to all tap to every type of ages

so this way we can give more weight for instance twos spoken name to speech

turn edges in this graph

and this makes the this gives a set of parameters that we need to of

the hyper parameter that we need to optimize so beta and had five

and this is

optimized using a random search in the

in the alpha beta space

how much more time

so i'm coming to the

experimental results

he's the corpus that we were given by the organiser of the rubber challenge

so the corpus is divided into seven type of shows like they are tv news

talk shows

so the training set is made of twenty eight hours fully annotated in terms of

speaker a speech transcript

and name

the spoken names

and also we are given visual information which are is not relevant here but the

for instance we get and annotation or

one frame every ten seconds we know exactly would peers in this in this frame

so this training set is used to estimate the probability between speech turns the to

train the i-vector system and to train the speech turn to spoken name propagation probability

we used the development set

nine hours to estimate those the hyperparameter alpha and beta

and we use the test set

and it's a value at the this way this is basically identification error rate so

this is the total amount of a

wrongly the total duration the wrongly

i don't to find it plus

a missed detection for set on divided by the total duration of speech in the

reference

so this can go higher than one if you

do lots of false alarm for instance

so here are the big table of results i'm going to focus on the on

the few selected points

so i in this configuration b where we are completely unsupervised

it's

we can see that the an oracle do that too would be able to name

someone as soon as is name is pronounced in the in the stream

anywhere in the in the audio stream

i can only get the fifty six percent recall anyway

we get to twenty nine a here using this these graph

so there is a long way to go up to

to get the good a perfect results here

when we are combined the whole thing

the same an oracle would get fourteen percent

identification error rate

and our this oracle is able to recognize the someone as soon as

either there is a biometric model for eight or the name is pronounced in the

speech transcript

also there is a long way to go to get a perfect results

but so i'm just going to focus on the interesting results now i mean the

one that actually worked

note this is a better results angle i'm going to skip it as well

by adding at the red ages in the graph so going from a to be

where able to increase the recall so that was expected because we are now able

to propagate the names to all the speech turns

but also what's interesting is that we also increase the precision

which wasn't what i expected first when a

when i did this work

and what's interesting also is that we can combine those two approaches the names speaker

identification this right completely unsupervised

with standard the

i-vector acoustic speaker identification

and we are able to get the ten percent absolute the improvement to compared to

the i-vector system

and it works both for precision so we are able to increase the precision of

an i-vector system using those the spoken names

and obviously recall because they are some percent the for which we don't have a

biometric models so

we can use the spoken names to

to do to improve the identification

and i also wanted to stress this point that we also have results based on

the fully manual the

spoken name detection

and it happens that the even though our

a name detection system has a slot error rate of around thirty five percent

i it actually doesn't degrade when we go from manual a name detection to fully

automatic name detection so this is

an interesting result that we are robust to this kind of errors may be because

spoken names are often the repeated multiple times in the video so we manage to

get one of these

this is just the

a representation of the this weights beta that we are automatically

obtain using parameters hyper parameter tuning

when we only use the this configuration b so this is completely unsupervised

it actually gives more weight

to a speech turn to spoken name edges then to than the edges between two

speech turns

and when we do the for the full graph

it actually give the same weights

to the i-vector edges

and the speech turn to spoken name ages

this is the concluded

so we got the this ten percent absolute improvement over the i-vector system using spoken

names so this is kind of cheating because what using more information but

this can be improved even more if we had for instance written names

experiments that we did the

when the a given another fifteen percent the increase in performance

and so they are still a lot of errors that we need to address i

thank you very much

and thank you

just a quick advertisement on this corpus that may be of interest for those of

you doing speaker diarization as well

and i have the first question

not using any a priori knowledge on the distribution of speakers in a conversation or

in the media five like quite everybody

could you comment and then do you think various

some information to get that's the next step actually we plan to modify this

objective function to take the structure of a tissue into account so for instance we

could the ad here a term

that

take into account the prior probability that the when one a speaker speaks at time

t then there is a high chance that we can hear him again thirty seconds

later so this is not that all the taken into account for now but we

really need to out these

prior information the structure

i totally agree but we did you mean just the prior knowledge on the presence

of the speaker or

i don't know

the this

this is planned we're going to have the some extra terms here is to force

that some kind of structure

okay thanks and just

you could also pictures of the results of the evaluation complaining goes

you say that is what was done the focus of a few evaluation

could be nice to have an eight year what was the but with the differences

in a different participant

you close to be a

we notice of the based on did you see some differences i don't know

the main difference when the who appears when task in speaker id we were more

less the same and the same results

but what the

actually that's what gives the most information in terms of identities actually ups

the names that are written on screen

usually it's really easy to provide a to the current speech

speaker

and this it is if the fifteen free improvement in terms of performance when we

use the visual the

you're string

no it's the basically used on the

segmentation used for this stuff it with the goes and divergence followed by some kind

of linear clustering and

no it's not oracle it's a so the along the thirty five percent there are

there is

i think five

to ten percent coming from the speech activity detection and segmentation errors

Person Instance Graphs for Named Speaker Identification in TV Broadcast

Speaker Diarization

Hervé Bredin, Antoine Laurent, Achintya Sarkar, Viet-Bac Le, Sophie Rosset and Claude Barras