Speech Transcript - The REPERE Challenge: finding people in a multimodal context

so let's talk or more complex

right so that we present you do that

so the goals is really

challenges to find people in multiple

multimodal context

what you mean multimodal context condition it's is that the participant can use the speech

and image

to recognize people

it is occur at the french collaboration and the corpus is provided by and that

the evaluation is organized by adding the

three you're and the research associate participate to the to the challenge is this presentation

is a presentation of the evaluation is not a presentation of the systems we participate

to the to the competing to the challenge

and that if you want to have more details about the solution becomes a sample

was

please go to interspeech yeah and you might be some

of which

so what about my presentation i will present in the first task after the corpus

the matrix we used

and some results from the driver in contains that's will be noted that we consider

so that we can do is six years and some conclusions

so the main task is to answer the question who is present in the videos

so that's means that is visible or

is speaking in the videos

two conditions are proposed difference is on supervised condition that's means that the participants can

build

a priori models for the very the face or for the speech from a different

persons that's that might be on the videos

another side you have an unsupervised a condition where the participant and are always to

use only the videos the test videos to find the people

this man task is

every time you we have after also task that's

are more precise in the question that's mean to use to answer the question who

is speaking with visible on the video what names are start

oh on the speech

what names are displayed on the screen

to answer the question two conditions to and a mixture model conditions where people can

use all the modalities to answer the question and also where S a for who

is speaking they can only use the speech

for who is visible that can only use

the video that the image

and

we assume that for answers this question there are a some technologies that are a

necessary and so we propose that's

where we

yeah we assessed the speaker diarization the speech transcription the ad detection and segmenting the

overlaid words text detection and segmentation

and the optical character recognition for the text on screen

so a lot of the scandals for so as i say that right or do

not was conduct analysis here and the first and second official campaign will be on

a two thousand source thirteen and two thousand fourteen

so what do not show so you have sentence different shows that's are gonna are

not is that in the corpus

that is there are a different utterances of the same show us assume that some

people for example the presentation are present for multiple

yeah

shows and different shows and a clean

we worked with different kind of sure like you're information show or a political debate

you have at the bottom and question to the government stations to you

and the celebrity news shows

the we choose this kind of shows because they are very different and valuable the

some of them are more difficult examples are because of the kind of speech for

example you have for some for example for the celebrity a new show you have

more spontaneous speech and for the parliament question to the government for example is always

a read a speech so it's to mixed the condition of speech

all the this

shows come from two different channels

and then at the end of the project that will be a sixty hours of

videos for

for the database so i can imagine that you don't know was easy the other

so i propose you to show a little samples us to have an idea of

the

of the

the D

yeah

sure

yeah

i think yeah

yeah

yeah so

for the corpus was annotated form visual annotations

it's i mean image of a point of view so on

although it we annotate and one image every ten seconds

we determine the dickheads with one of the know

performance

the ads are described like to say if is there are there are there is

no occlusion of the jets or for example if you have a parent shorter or

something indication nazis

the person is name

the rate so that the text objects are in a rectangle to transcribe

and the on you on all detected text transcription you have to the person names

are annotated in the in the text

and so has to have something which i

it's more accurate diarization the parents acted experiments of all the other hand

and all the text

all

given to have to

to know where the is the fruit separation of the audience

for

for the speech annotation have a standard transcription of all the details

with the speaker turn segmentation and the music segmentation two

and a rich speech transcription says that includes all the disappearance

and

all the

and all that the world like you're a

i'm french you know some not station but a more like to alright so all

i think so and all this kind of expression that might be useful to recognise

the people

and we name the older person that are speaking and that we may and all

the

the main the speed of be of here so that sure on the speech transcription

are annotated to be from books that's is example here you an example and that's

what i want the user name so it's at the beginning

so that the main difference matrix we use is the estimated global bit-rate is found

on the means and false excitation but we want to boundaries at the fact that

the system i have found that the correct number of people who are present in

the video that's why we include a confusion that's means that if you have to

the number of people but you miss and you do an ml for the name

of the people is a less it's

that it's an important in less important error not to miss some have that's why

we use this kind of and this metric for the main task and for the

question who is speaking who is visible

and what names are displayed

and for what names are cited we use the slow to rate which is a

comparison of the hypothesis and the reference interval for the name

so for the driver and also the dry run corpus is very short

corpus based the goal was to see what's given what we can do with this

metric sense is kind of corpus and that it's clear that it's not enough for

the system to develop something which is

the performance but it's not the goal of the driver

and what we saw here is that the

the speech duration for a speaker is very short

and the majority of the speaker speak less than a twenty seconds but it's the

assignments because it's show and it said that if you can see of the show

and the you have that you have or you have people who speak not that

one more time

two hundred and sixty second so it's the diversity of the corpus and for them

the key for the people distribution according to the number of key frames

they appear you have the same thing some of them have your is not so

much and it is that if you can see but usually when someone appears not

captures a lot G speaker lots and so you combining and visioning the information you

might find who is speaking and who is present in the video

and so if you and i

the moments where the speed of the things display or the faces visible all the

speaker is speaking in all the corpus you can see that for eight percent

the P the person is speaking appears and his name i is displayed on the

videos at the same time

and

yeah but for example you have

a set seventy

percent of the people who just to name displayed on the screen and so for

the main task for example you don't have to say that the these people there's

people are present in the video is because they are not speaking or they are

not

V C and Z is distribution

is very different according to the kind of shows for example for different story

you have

a more

as long as thirty two persons of the few that the people want the name

that are not useful to find the people and for L C P for that's

the contrary you washers that if you find the name of a person that something's

that's

this person is present in the video so the participants have to analyze the little

this kind of things to

to have it might be a full to have this kind of information

so the

here you have to the annotation and the clues you can use to do that

to answer the question

you know i

there is there are there are more that

a two hundred and sixty seven people

there's people in the datasets

the one hundred seventy one people for the test set

and as you can see

there are some and then use guys that's means that's for the annotators a then

why not able to know who is that where it just we got in the

video that's just watching the video so

that's why i say the autonomous and the system have to find that there is

someone but they have not

maybe

for the fast results it's clear that it's a driving test again so the results

on that's so good

what we want to compare is the

here you go the system of things

for the main task

and comparing to the task we speaking and who is visible and as you can

see

they have a better results to say who is speaking example to say who is

visible on the videos and the for the main task the main problem is to

say who he is visible so on

please speaking

for speaking

in particular we analyzed the results for the and comparing the

the results for the supervised mixture model condition and the supervised on the model condition

and as you can see there is not different most significant difference in the results

between the two conditions that's means that the system then

the information that come from

the also for the C was not used by the system to improve their

then

and the on the side you know the

the results by shows

so the center of the circle we present the mean for the mean performance

and

the writers represents the standard deviation of the bit of the reference

and as you can see the we got according to the show the systems are

more provides and another so if we compare them

the yet also it's the results are very precise assessment that's this

this show is correctly a tree is a process but yeah regarding the green the

dark green maybe even if there is a lower the evaluation of the performance is

more important so that's might be some things that's the system have to improve to

regarding who is visible

in the videos

doing the same kind of analysis you can see is that there is a significant

difference between the supervised multimodal condition and the supervised model condition so here the speech

and i is useful and the systems have used this complementarity information and here again

you have

the representation of this is the results according to the show and here again you

have difference performance and evaluation of the performance of the show

is important

for who is sort

and we focus here for the results on the kind of mistakes the and have

rows S car done by the system and again as you can say that can

see that iteration is the more important

here are for all the systems will participate

and the

results might be that they have and as out

the system to

to detect the then sent it is named is has to be improved because they

say don't the same is a lots of names

for what i'm are displayed the performance again can be improved and we focus on

the austere and text segmentation results

and the results on a set of models is a lot so but not so

that again they can extract some information from and the segmentation is quite good

so it's again the problem of extracting the name from the text

that is the marginal program for

so in conclusion a dollar question and the goals is to find people in multiple

in the condition in french language the main question he who is present in the

video but you have a subtask and

seven questions that was that can be helpful to risk terms of the domain task

and this challenge now is open to anyone which is to participate so

yeah you can go

and for the dry run it is clear that sufficient information can improve and the

device in we also an important variability of the performance according to the shoes

for the perspective

for the matrix a

we want to include the ensemble and to take account the person in the videos

because for some application in particular for clustering of videos it's a less authority it

so that the importance of the person depend of his role in the video so

it's a an important to work and

we want to weights

the importance of the people according to the way the available modality that's someone if

you lose on okay speaking and

is visible is you will have a man make more errors than if it's just

speaking or just visible on the screen

for that we want to

to improve the characterisation of the difference between scenario

use the due to linger more speech analysis

a more

and this is a different size for the videos

and dropped or more that's what the same speaker

is in different shows like it's not exactly the same thing to speak in departments

and to be in debate with and also people so it's this kind of social

shows and the time we hope for

so thank you and if you have a question

all of them all the description that we have i have done here it's for

all the

all the dots are because after all this is that will be on the on

the learning and the training part for the official compare comparing so it so that's

why we have to do that

the analysis on all the data and that's where table it's a

it's a choice because we don't have but that's a now we don't have announced

does not shoot to speech it's in doing so

morning

since the yes

yeah

yeah the continuous but there was presented to the system

as they have they can use all the videos

that's for the annotation for the evaluation if the evaluation is based on a key

frames

it's more the evaluations and the for the for the participants must be a all

the videos

and it's just because we can say it's very expensive to do this kind of

annotation so that's why we dress for the evaluation

and that's why we indicates the beginning and the end of the operation of the

people so as to have also for the systems they have something

generalization risk is that it's not exactly diarisation for the videos

but it's always the problem

expensive part of doing this kind of

in addition

yeah

for the speech for speech for the question who is speaking is they have to

to answer for all the video but for the visible part of that they have

to focus on

on the key frames

but it is clear that at the beginning they don't know where are the key

frames it's that just

they don't just so when it's the test and

where wednesday

where are the people in the video

for the it's for the transcript that they have to transcribe the all the videos

the system at the at the beginning thing just have access to the

the as

it's a how to use their own system to transcribe the videos

that's the for the set task we bss for example

a use that you have to some reassurance after this for the main task they

just have to use it was that the beginning and

so used a technologies they want

so that summer yeah

a used to transcribe it does this one

the up on transcription

and also prefer just doing yeah generalization and have a

they don't for a lot on does unsupervised condition and so he was a lot

of face models or

the voice

no i think of the name of the detailed the shows so for example a

single a lot of the present data because for the information show is always the

same presentation right now

in all that they all signed the or the shows the old interest shows but

they don't know a always the

in fact that people for example so yeah

yeah

The REPERE Challenge: finding people in a multimodal context

SESSION 09: Speaker Recognition Evaluation

Juliette Kahn