Speech Transcript - The REPERE Challenge: finding people in a multimodal context

0:00:26	so let's talk or more complex
0:00:36	i
0:00:53	right so that we present you do that
0:00:59	so the goals is really
0:01:01	challenges to find people in multiple
0:01:03	multimodal context
0:01:07	so
0:01:08	what you mean multimodal context condition it's is that the participant can use the speech
0:01:14	and image
0:01:15	to recognize people
0:01:18	it is occur at the french collaboration and the corpus is provided by and that
0:01:24	the evaluation is organized by adding the
0:01:28	three you're and the research associate participate to the to the challenge is this presentation
0:01:34	is a presentation of the evaluation is not a presentation of the systems we participate
0:01:39	to the to the competing to the challenge
0:01:41	and that if you want to have more details about the solution becomes a sample
0:01:46	was
0:01:47	please go to interspeech yeah and you might be some
0:01:52	of which
0:01:55	so what about my presentation i will present in the first task after the corpus
0:02:01	the matrix we used
0:02:03	and some results from the driver in contains that's will be noted that we consider
0:02:08	so that we can do is six years and some conclusions
0:02:13	so the main task is to answer the question who is present in the videos
0:02:19	so that's means that is visible or
0:02:22	is speaking in the videos
0:02:25	two conditions are proposed difference is on supervised condition that's means that the participants can
0:02:33	build
0:02:34	a priori models for the very the face or for the speech from a different
0:02:41	persons that's that might be on the videos
0:02:45	another side you have an unsupervised a condition where the participant and are always to
0:02:52	use only the videos the test videos to find the people
0:02:59	this man task is
0:03:02	every time you we have after also task that's
0:03:06	are more precise in the question that's mean to use to answer the question who
0:03:11	is speaking with visible on the video what names are start
0:03:15	oh on the speech
0:03:18	what names are displayed on the screen
0:03:21	to answer the question two conditions to and a mixture model conditions where people can
0:03:29	use all the modalities to answer the question and also where S a for who
0:03:36	is speaking they can only use the speech
0:03:40	for who is visible that can only use
0:03:43	the video that the image
0:03:47	and
0:03:48	we
0:03:50	we assume that for answers this question there are a some technologies that are a
0:03:56	necessary and so we propose that's
0:03:59	where we
0:04:00	yeah we assessed the speaker diarization the speech transcription the ad detection and segmenting the
0:04:09	overlaid words text detection and segmentation
0:04:11	and the optical character recognition for the text on screen
0:04:18	so a lot of the scandals for so as i say that right or do
0:04:22	not was conduct analysis here and the first and second official campaign will be on
0:04:29	a two thousand source thirteen and two thousand fourteen
0:04:36	so what do not show so you have sentence different shows that's are gonna are
0:04:43	not is that in the corpus
0:04:45	that is there are a different utterances of the same show us assume that some
0:04:51	people for example the presentation are present for multiple
0:04:56	yeah
0:04:57	shows and different shows and a clean
0:05:02	we worked with different kind of sure like you're information show or a political debate
0:05:09	you have at the bottom and question to the government stations to you
0:05:13	and the celebrity news shows
0:05:17	the we choose this kind of shows because they are very different and valuable the
0:05:24	some of them are more difficult examples are because of the kind of speech for
0:05:30	example you have for some for example for the celebrity a new show you have
0:05:35	more spontaneous speech and for the parliament question to the government for example is always
0:05:43	a read a speech so it's to mixed the condition of speech
0:05:50	all the this
0:05:52	shows come from two different channels
0:05:55	and then at the end of the project that will be a sixty hours of
0:06:00	videos for
0:06:02	for the database so i can imagine that you don't know was easy the other
0:06:06	so i propose you to show a little samples us to have an idea of
0:06:12	the
0:06:13	of the
0:06:14	the D
0:06:25	i
0:06:26	yeah
0:06:27	sure
0:06:42	oh
0:06:45	i
0:06:49	i
0:06:51	yeah
0:06:53	i
0:06:55	i think yeah
0:07:07	i
0:07:08	i
0:07:11	yeah
0:07:14	yeah
0:07:16	i
0:07:17	oh
0:07:29	oh
0:07:35	i
0:07:39	i
0:08:03	yeah so
0:08:06	for the corpus was annotated form visual annotations
0:08:11	it's i mean image of a point of view so on
0:08:15	although it we annotate and one image every ten seconds
0:08:21	we determine the dickheads with one of the know
0:08:25	performance
0:08:27	the ads are described like to say if is there are there are there is
0:08:32	no occlusion of the jets or for example if you have a parent shorter or
0:08:40	something indication nazis
0:08:43	the person is name
0:08:45	the rate so that the text objects are in a rectangle to transcribe
0:08:52	and the on you on all detected text transcription you have to the person names
0:08:59	are annotated in the in the text
0:09:02	and so has to have something which i
0:09:05	it's more accurate diarization the parents acted experiments of all the other hand
0:09:12	and all the text
0:09:13	all
0:09:14	given to have to
0:09:17	to know where the is the fruit separation of the audience
0:09:22	for
0:09:24	for the speech annotation have a standard transcription of all the details
0:09:29	with the speaker turn segmentation and the music segmentation two
0:09:33	and a rich speech transcription says that includes all the disappearance
0:09:39	and
0:09:40	all the
0:09:42	and all that the world like you're a
0:09:45	i'm french you know some not station but a more like to alright so all
0:09:52	i think so and all this kind of expression that might be useful to recognise
0:09:59	the people
0:10:00	and we name the older person that are speaking and that we may and all
0:10:06	the
0:10:07	the main the speed of be of here so that sure on the speech transcription
0:10:13	are annotated to be from books that's is example here you an example and that's
0:10:20	what i want the user name so it's at the beginning
0:10:25	so that the main difference matrix we use is the estimated global bit-rate is found
0:10:32	on the means and false excitation but we want to boundaries at the fact that
0:10:37	the system i have found that the correct number of people who are present in
0:10:42	the video that's why we include a confusion that's means that if you have to
0:10:48	the number of people but you miss and you do an ml for the name
0:10:54	of the people is a less it's
0:10:58	that it's an important in less important error not to miss some have that's why
0:11:05	we use this kind of and this metric for the main task and for the
0:11:10	question who is speaking who is visible
0:11:13	and what names are displayed
0:11:16	and for what names are cited we use the slow to rate which is a
0:11:20	comparison of the hypothesis and the reference interval for the name
0:11:27	so for the driver and also the dry run corpus is very short
0:11:33	corpus based the goal was to see what's given what we can do with this
0:11:39	metric sense is kind of corpus and that it's clear that it's not enough for
0:11:43	the system to develop something which is
0:11:46	the performance but it's not the goal of the driver
0:11:50	and what we saw here is that the
0:11:53	the speech duration for a speaker is very short
0:11:58	and the majority of the speaker speak less than a twenty seconds but it's the
0:12:04	assignments because it's show and it said that if you can see of the show
0:12:09	and the you have that you have or you have people who speak not that
0:12:16	one more time
0:12:17	two hundred and sixty second so it's the diversity of the corpus and for them
0:12:24	the key for the people distribution according to the number of key frames
0:12:29	they appear you have the same thing some of them have your is not so
0:12:34	much and it is that if you can see but usually when someone appears not
0:12:40	captures a lot G speaker lots and so you combining and visioning the information you
0:12:45	might find who is speaking and who is present in the video
0:12:50	and so if you and i
0:12:53	the moments where the speed of the things display or the faces visible all the
0:13:00	speaker is speaking in all the corpus you can see that for eight percent
0:13:05	the P the person is speaking appears and his name i is displayed on the
0:13:11	videos at the same time
0:13:13	and
0:13:15	yeah but for example you have
0:13:18	a set seventy
0:13:21	percent of the people who just to name displayed on the screen and so for
0:13:26	the main task for example you don't have to say that the these people there's
0:13:31	people are present in the video is because they are not speaking or they are
0:13:36	not
0:13:37	V C and Z is distribution
0:13:40	is very different according to the kind of shows for example for different story
0:13:49	you have
0:13:51	a more
0:13:52	as long as thirty two persons of the few that the people want the name
0:13:57	that are not useful to find the people and for L C P for that's
0:14:01	the contrary you washers that if you find the name of a person that something's
0:14:07	that's
0:14:08	this person is present in the video so the participants have to analyze the little
0:14:16	this kind of things to
0:14:17	to have it might be a full to have this kind of information
0:14:22	so the
0:14:24	here you have to the annotation and the clues you can use to do that
0:14:30	to answer the question
0:14:34	you know i
0:14:38	there is there are there are more that
0:14:41	a two hundred and sixty seven people
0:14:45	there's people in the datasets
0:14:47	the one hundred seventy one people for the test set
0:14:52	and as you can see
0:14:55	there are some and then use guys that's means that's for the annotators a then
0:15:01	why not able to know who is that where it just we got in the
0:15:08	video that's just watching the video so
0:15:12	that's why i say the autonomous and the system have to find that there is
0:15:16	someone but they have not
0:15:18	maybe
0:15:21	for the fast results it's clear that it's a driving test again so the results
0:15:27	on that's so good
0:15:29	what we want to compare is the
0:15:32	here you go the system of things
0:15:37	for the main task
0:15:38	and comparing to the task we speaking and who is visible and as you can
0:15:44	see
0:15:45	they have a better results to say who is speaking example to say who is
0:15:51	visible on the videos and the for the main task the main problem is to
0:15:57	say who he is visible so on
0:16:01	please speaking
0:16:04	for speaking
0:16:07	in particular we analyzed the results for the and comparing the
0:16:12	the results for the supervised mixture model condition and the supervised on the model condition
0:16:19	and as you can see there is not different most significant difference in the results
0:16:24	between the two conditions that's means that the system then
0:16:28	the information that come from
0:16:31	the also for the C was not used by the system to improve their
0:16:36	then
0:16:38	so
0:16:39	and the on the side you know the
0:16:43	the results by shows
0:16:47	so the center of the circle we present the mean for the mean performance
0:16:52	and
0:16:53	the writers represents the standard deviation of the bit of the reference
0:16:57	and as you can see the we got according to the show the systems are
0:17:03	more provides and another so if we compare them
0:17:08	the yet also it's the results are very precise assessment that's this
0:17:14	this show is correctly a tree is a process but yeah regarding the green the
0:17:22	dark green maybe even if there is a lower the evaluation of the performance is
0:17:29	more important so that's might be some things that's the system have to improve to
0:17:38	regarding who is visible
0:17:40	in the videos
0:17:42	doing the same kind of analysis you can see is that there is a significant
0:17:48	difference between the supervised multimodal condition and the supervised model condition so here the speech
0:17:56	and i is useful and the systems have used this complementarity information and here again
0:18:05	you have
0:18:06	the representation of this is the results according to the show and here again you
0:18:13	have difference performance and evaluation of the performance of the show
0:18:17	is important
0:18:19	for who is sort
0:18:22	and we focus here for the results on the kind of mistakes the and have
0:18:27	rows S car done by the system and again as you can say that can
0:18:31	see that iteration is the more important
0:18:34	here are for all the systems will participate
0:18:37	and the
0:18:39	results might be that they have and as out
0:18:43	the system to
0:18:44	to detect the then sent it is named is has to be improved because they
0:18:51	say don't the same is a lots of names
0:18:55	for what i'm are displayed the performance again can be improved and we focus on
0:19:00	the austere and text segmentation results
0:19:04	and the results on a set of models is a lot so but not so
0:19:09	that again they can extract some information from and the segmentation is quite good
0:19:16	so it's again the problem of extracting the name from the text
0:19:21	that is the marginal program for
0:19:25	so in conclusion a dollar question and the goals is to find people in multiple
0:19:33	in the condition in french language the main question he who is present in the
0:19:38	video but you have a subtask and
0:19:41	seven questions that was that can be helpful to risk terms of the domain task
0:19:47	and this challenge now is open to anyone which is to participate so
0:19:54	yeah you can go
0:19:56	and for the dry run it is clear that sufficient information can improve and the
0:20:01	device in we also an important variability of the performance according to the shoes
0:20:07	for the perspective
0:20:09	for the matrix a
0:20:11	we want to include the ensemble and to take account the person in the videos
0:20:16	because for some application in particular for clustering of videos it's a less authority it
0:20:24	so that the importance of the person depend of his role in the video so
0:20:30	it's a an important to work and
0:20:33	we want to weights
0:20:36	the importance of the people according to the way the available modality that's someone if
0:20:43	you lose on okay speaking and
0:20:46	is visible is you will have a man make more errors than if it's just
0:20:52	speaking or just visible on the screen
0:20:55	for that we want to
0:20:58	to improve the characterisation of the difference between scenario
0:21:02	use the due to linger more speech analysis
0:21:07	a more
0:21:09	and this is a different size for the videos
0:21:12	and dropped or more that's what the same speaker
0:21:16	is in different shows like it's not exactly the same thing to speak in departments
0:21:22	and to be in debate with and also people so it's this kind of social
0:21:27	shows and the time we hope for
0:21:31	so thank you and if you have a question
0:21:35	i
0:21:59	all of them all the description that we have i have done here it's for
0:22:04	all the
0:22:05	all the dots are because after all this is that will be on the on
0:22:12	the learning and the training part for the official compare comparing so it so that's
0:22:18	why we have to do that
0:22:20	the analysis on all the data and that's where table it's a
0:22:27	it's a choice because we don't have but that's a now we don't have announced
0:22:32	does not shoot to speech it's in doing so
0:22:36	morning
0:22:37	since the yes
0:22:48	yeah
0:23:04	yeah the continuous but there was presented to the system
0:23:07	as they have they can use all the videos
0:23:10	that's for the annotation for the evaluation if the evaluation is based on a key
0:23:16	frames
0:23:17	it's more the evaluations and the for the for the participants must be a all
0:23:22	the videos
0:23:23	and it's just because we can say it's very expensive to do this kind of
0:23:30	annotation so that's why we dress for the evaluation
0:23:34	and that's why we indicates the beginning and the end of the operation of the
0:23:39	people so as to have also for the systems they have something
0:23:45	generalization risk is that it's not exactly diarisation for the videos
0:23:48	but it's always the problem
0:23:51	expensive part of doing this kind of
0:23:55	in addition
0:24:02	yeah
0:24:03	for the speech for speech for the question who is speaking is they have to
0:24:07	to answer for all the video but for the visible part of that they have
0:24:13	to
0:24:13	to focus on
0:24:16	on the key frames
0:24:19	but it is clear that at the beginning they don't know where are the key
0:24:23	frames it's that just
0:24:25	they don't just so when it's the test and
0:24:28	where wednesday
0:24:30	where are the people in the video
0:24:43	oh
0:24:51	oh
0:24:56	for the it's for the transcript that they have to transcribe the all the videos
0:25:03	the system at the at the beginning thing just have access to the
0:25:09	the as
0:25:11	it's a how to use their own system to transcribe the videos
0:25:15	that's the for the set task we bss for example
0:25:22	a use that you have to some reassurance after this for the main task they
0:25:27	just have to use it was that the beginning and
0:25:30	so used a technologies they want
0:25:32	so that summer yeah
0:25:35	a used to transcribe it does this one
0:25:37	the up on transcription
0:25:39	and also prefer just doing yeah generalization and have a
0:25:46	they don't for a lot on does unsupervised condition and so he was a lot
0:25:51	of face models or
0:25:56	the voice
0:26:08	i
0:26:14	no i think of the name of the detailed the shows so for example a
0:26:19	single a lot of the present data because for the information show is always the
0:26:25	same presentation right now
0:26:27	in all that they all signed the or the shows the old interest shows but
0:26:32	they don't know a always the
0:26:36	in fact that people for example so yeah
0:26:40	yeah
0:26:43	oh

The REPERE Challenge: finding people in a multimodal context

SESSION 09: Speaker Recognition Evaluation

Juliette Kahn