0:00:26so let's talk or more complex
0:00:53right so that we present you do that
0:00:59so the goals is really
0:01:01challenges to find people in multiple
0:01:03multimodal context
0:01:08what you mean multimodal context condition it's is that the participant can use the speech
0:01:14and image
0:01:15to recognize people
0:01:18it is occur at the french collaboration and the corpus is provided by and that
0:01:24the evaluation is organized by adding the
0:01:28three you're and the research associate participate to the to the challenge is this presentation
0:01:34is a presentation of the evaluation is not a presentation of the systems we participate
0:01:39to the to the competing to the challenge
0:01:41and that if you want to have more details about the solution becomes a sample
0:01:47please go to interspeech yeah and you might be some
0:01:52of which
0:01:55so what about my presentation i will present in the first task after the corpus
0:02:01the matrix we used
0:02:03and some results from the driver in contains that's will be noted that we consider
0:02:08so that we can do is six years and some conclusions
0:02:13so the main task is to answer the question who is present in the videos
0:02:19so that's means that is visible or
0:02:22is speaking in the videos
0:02:25two conditions are proposed difference is on supervised condition that's means that the participants can
0:02:34a priori models for the very the face or for the speech from a different
0:02:41persons that's that might be on the videos
0:02:45another side you have an unsupervised a condition where the participant and are always to
0:02:52use only the videos the test videos to find the people
0:02:59this man task is
0:03:02every time you we have after also task that's
0:03:06are more precise in the question that's mean to use to answer the question who
0:03:11is speaking with visible on the video what names are start
0:03:15oh on the speech
0:03:18what names are displayed on the screen
0:03:21to answer the question two conditions to and a mixture model conditions where people can
0:03:29use all the modalities to answer the question and also where S a for who
0:03:36is speaking they can only use the speech
0:03:40for who is visible that can only use
0:03:43the video that the image
0:03:50we assume that for answers this question there are a some technologies that are a
0:03:56necessary and so we propose that's
0:03:59where we
0:04:00yeah we assessed the speaker diarization the speech transcription the ad detection and segmenting the
0:04:09overlaid words text detection and segmentation
0:04:11and the optical character recognition for the text on screen
0:04:18so a lot of the scandals for so as i say that right or do
0:04:22not was conduct analysis here and the first and second official campaign will be on
0:04:29a two thousand source thirteen and two thousand fourteen
0:04:36so what do not show so you have sentence different shows that's are gonna are
0:04:43not is that in the corpus
0:04:45that is there are a different utterances of the same show us assume that some
0:04:51people for example the presentation are present for multiple
0:04:57shows and different shows and a clean
0:05:02we worked with different kind of sure like you're information show or a political debate
0:05:09you have at the bottom and question to the government stations to you
0:05:13and the celebrity news shows
0:05:17the we choose this kind of shows because they are very different and valuable the
0:05:24some of them are more difficult examples are because of the kind of speech for
0:05:30example you have for some for example for the celebrity a new show you have
0:05:35more spontaneous speech and for the parliament question to the government for example is always
0:05:43a read a speech so it's to mixed the condition of speech
0:05:50all the this
0:05:52shows come from two different channels
0:05:55and then at the end of the project that will be a sixty hours of
0:06:00videos for
0:06:02for the database so i can imagine that you don't know was easy the other
0:06:06so i propose you to show a little samples us to have an idea of
0:06:13of the
0:06:14the D
0:06:55i think yeah
0:08:03yeah so
0:08:06for the corpus was annotated form visual annotations
0:08:11it's i mean image of a point of view so on
0:08:15although it we annotate and one image every ten seconds
0:08:21we determine the dickheads with one of the know
0:08:27the ads are described like to say if is there are there are there is
0:08:32no occlusion of the jets or for example if you have a parent shorter or
0:08:40something indication nazis
0:08:43the person is name
0:08:45the rate so that the text objects are in a rectangle to transcribe
0:08:52and the on you on all detected text transcription you have to the person names
0:08:59are annotated in the in the text
0:09:02and so has to have something which i
0:09:05it's more accurate diarization the parents acted experiments of all the other hand
0:09:12and all the text
0:09:14given to have to
0:09:17to know where the is the fruit separation of the audience
0:09:24for the speech annotation have a standard transcription of all the details
0:09:29with the speaker turn segmentation and the music segmentation two
0:09:33and a rich speech transcription says that includes all the disappearance
0:09:40all the
0:09:42and all that the world like you're a
0:09:45i'm french you know some not station but a more like to alright so all
0:09:52i think so and all this kind of expression that might be useful to recognise
0:09:59the people
0:10:00and we name the older person that are speaking and that we may and all
0:10:07the main the speed of be of here so that sure on the speech transcription
0:10:13are annotated to be from books that's is example here you an example and that's
0:10:20what i want the user name so it's at the beginning
0:10:25so that the main difference matrix we use is the estimated global bit-rate is found
0:10:32on the means and false excitation but we want to boundaries at the fact that
0:10:37the system i have found that the correct number of people who are present in
0:10:42the video that's why we include a confusion that's means that if you have to
0:10:48the number of people but you miss and you do an ml for the name
0:10:54of the people is a less it's
0:10:58that it's an important in less important error not to miss some have that's why
0:11:05we use this kind of and this metric for the main task and for the
0:11:10question who is speaking who is visible
0:11:13and what names are displayed
0:11:16and for what names are cited we use the slow to rate which is a
0:11:20comparison of the hypothesis and the reference interval for the name
0:11:27so for the driver and also the dry run corpus is very short
0:11:33corpus based the goal was to see what's given what we can do with this
0:11:39metric sense is kind of corpus and that it's clear that it's not enough for
0:11:43the system to develop something which is
0:11:46the performance but it's not the goal of the driver
0:11:50and what we saw here is that the
0:11:53the speech duration for a speaker is very short
0:11:58and the majority of the speaker speak less than a twenty seconds but it's the
0:12:04assignments because it's show and it said that if you can see of the show
0:12:09and the you have that you have or you have people who speak not that
0:12:16one more time
0:12:17two hundred and sixty second so it's the diversity of the corpus and for them
0:12:24the key for the people distribution according to the number of key frames
0:12:29they appear you have the same thing some of them have your is not so
0:12:34much and it is that if you can see but usually when someone appears not
0:12:40captures a lot G speaker lots and so you combining and visioning the information you
0:12:45might find who is speaking and who is present in the video
0:12:50and so if you and i
0:12:53the moments where the speed of the things display or the faces visible all the
0:13:00speaker is speaking in all the corpus you can see that for eight percent
0:13:05the P the person is speaking appears and his name i is displayed on the
0:13:11videos at the same time
0:13:15yeah but for example you have
0:13:18a set seventy
0:13:21percent of the people who just to name displayed on the screen and so for
0:13:26the main task for example you don't have to say that the these people there's
0:13:31people are present in the video is because they are not speaking or they are
0:13:37V C and Z is distribution
0:13:40is very different according to the kind of shows for example for different story
0:13:49you have
0:13:51a more
0:13:52as long as thirty two persons of the few that the people want the name
0:13:57that are not useful to find the people and for L C P for that's
0:14:01the contrary you washers that if you find the name of a person that something's
0:14:08this person is present in the video so the participants have to analyze the little
0:14:16this kind of things to
0:14:17to have it might be a full to have this kind of information
0:14:22so the
0:14:24here you have to the annotation and the clues you can use to do that
0:14:30to answer the question
0:14:34you know i
0:14:38there is there are there are more that
0:14:41a two hundred and sixty seven people
0:14:45there's people in the datasets
0:14:47the one hundred seventy one people for the test set
0:14:52and as you can see
0:14:55there are some and then use guys that's means that's for the annotators a then
0:15:01why not able to know who is that where it just we got in the
0:15:08video that's just watching the video so
0:15:12that's why i say the autonomous and the system have to find that there is
0:15:16someone but they have not
0:15:21for the fast results it's clear that it's a driving test again so the results
0:15:27on that's so good
0:15:29what we want to compare is the
0:15:32here you go the system of things
0:15:37for the main task
0:15:38and comparing to the task we speaking and who is visible and as you can
0:15:45they have a better results to say who is speaking example to say who is
0:15:51visible on the videos and the for the main task the main problem is to
0:15:57say who he is visible so on
0:16:01please speaking
0:16:04for speaking
0:16:07in particular we analyzed the results for the and comparing the
0:16:12the results for the supervised mixture model condition and the supervised on the model condition
0:16:19and as you can see there is not different most significant difference in the results
0:16:24between the two conditions that's means that the system then
0:16:28the information that come from
0:16:31the also for the C was not used by the system to improve their
0:16:39and the on the side you know the
0:16:43the results by shows
0:16:47so the center of the circle we present the mean for the mean performance
0:16:53the writers represents the standard deviation of the bit of the reference
0:16:57and as you can see the we got according to the show the systems are
0:17:03more provides and another so if we compare them
0:17:08the yet also it's the results are very precise assessment that's this
0:17:14this show is correctly a tree is a process but yeah regarding the green the
0:17:22dark green maybe even if there is a lower the evaluation of the performance is
0:17:29more important so that's might be some things that's the system have to improve to
0:17:38regarding who is visible
0:17:40in the videos
0:17:42doing the same kind of analysis you can see is that there is a significant
0:17:48difference between the supervised multimodal condition and the supervised model condition so here the speech
0:17:56and i is useful and the systems have used this complementarity information and here again
0:18:05you have
0:18:06the representation of this is the results according to the show and here again you
0:18:13have difference performance and evaluation of the performance of the show
0:18:17is important
0:18:19for who is sort
0:18:22and we focus here for the results on the kind of mistakes the and have
0:18:27rows S car done by the system and again as you can say that can
0:18:31see that iteration is the more important
0:18:34here are for all the systems will participate
0:18:37and the
0:18:39results might be that they have and as out
0:18:43the system to
0:18:44to detect the then sent it is named is has to be improved because they
0:18:51say don't the same is a lots of names
0:18:55for what i'm are displayed the performance again can be improved and we focus on
0:19:00the austere and text segmentation results
0:19:04and the results on a set of models is a lot so but not so
0:19:09that again they can extract some information from and the segmentation is quite good
0:19:16so it's again the problem of extracting the name from the text
0:19:21that is the marginal program for
0:19:25so in conclusion a dollar question and the goals is to find people in multiple
0:19:33in the condition in french language the main question he who is present in the
0:19:38video but you have a subtask and
0:19:41seven questions that was that can be helpful to risk terms of the domain task
0:19:47and this challenge now is open to anyone which is to participate so
0:19:54yeah you can go
0:19:56and for the dry run it is clear that sufficient information can improve and the
0:20:01device in we also an important variability of the performance according to the shoes
0:20:07for the perspective
0:20:09for the matrix a
0:20:11we want to include the ensemble and to take account the person in the videos
0:20:16because for some application in particular for clustering of videos it's a less authority it
0:20:24so that the importance of the person depend of his role in the video so
0:20:30it's a an important to work and
0:20:33we want to weights
0:20:36the importance of the people according to the way the available modality that's someone if
0:20:43you lose on okay speaking and
0:20:46is visible is you will have a man make more errors than if it's just
0:20:52speaking or just visible on the screen
0:20:55for that we want to
0:20:58to improve the characterisation of the difference between scenario
0:21:02use the due to linger more speech analysis
0:21:07a more
0:21:09and this is a different size for the videos
0:21:12and dropped or more that's what the same speaker
0:21:16is in different shows like it's not exactly the same thing to speak in departments
0:21:22and to be in debate with and also people so it's this kind of social
0:21:27shows and the time we hope for
0:21:31so thank you and if you have a question
0:21:59all of them all the description that we have i have done here it's for
0:22:04all the
0:22:05all the dots are because after all this is that will be on the on
0:22:12the learning and the training part for the official compare comparing so it so that's
0:22:18why we have to do that
0:22:20the analysis on all the data and that's where table it's a
0:22:27it's a choice because we don't have but that's a now we don't have announced
0:22:32does not shoot to speech it's in doing so
0:22:37since the yes
0:23:04yeah the continuous but there was presented to the system
0:23:07as they have they can use all the videos
0:23:10that's for the annotation for the evaluation if the evaluation is based on a key
0:23:17it's more the evaluations and the for the for the participants must be a all
0:23:22the videos
0:23:23and it's just because we can say it's very expensive to do this kind of
0:23:30annotation so that's why we dress for the evaluation
0:23:34and that's why we indicates the beginning and the end of the operation of the
0:23:39people so as to have also for the systems they have something
0:23:45generalization risk is that it's not exactly diarisation for the videos
0:23:48but it's always the problem
0:23:51expensive part of doing this kind of
0:23:55in addition
0:24:03for the speech for speech for the question who is speaking is they have to
0:24:07to answer for all the video but for the visible part of that they have
0:24:13to focus on
0:24:16on the key frames
0:24:19but it is clear that at the beginning they don't know where are the key
0:24:23frames it's that just
0:24:25they don't just so when it's the test and
0:24:28where wednesday
0:24:30where are the people in the video
0:24:56for the it's for the transcript that they have to transcribe the all the videos
0:25:03the system at the at the beginning thing just have access to the
0:25:09the as
0:25:11it's a how to use their own system to transcribe the videos
0:25:15that's the for the set task we bss for example
0:25:22a use that you have to some reassurance after this for the main task they
0:25:27just have to use it was that the beginning and
0:25:30so used a technologies they want
0:25:32so that summer yeah
0:25:35a used to transcribe it does this one
0:25:37the up on transcription
0:25:39and also prefer just doing yeah generalization and have a
0:25:46they don't for a lot on does unsupervised condition and so he was a lot
0:25:51of face models or
0:25:56the voice
0:26:14no i think of the name of the detailed the shows so for example a
0:26:19single a lot of the present data because for the information show is always the
0:26:25same presentation right now
0:26:27in all that they all signed the or the shows the old interest shows but
0:26:32they don't know a always the
0:26:36in fact that people for example so yeah