0:00:21i'm waiting for the for screen
0:00:41hi i'm just the more from the university of five
0:00:45a will talk to about a a pretty in is to D we did uh and speaker addition of that
0:00:50original use would be video five
0:00:54i will start with an introduction then i would describe the at a speaker diarization system
0:01:00uh describe that that base we we then use for this two T
0:01:04show use some results and to uh
0:01:08some plastic
0:01:10as a you know not
0:01:12but speaker there is a and is the process to find in audio stream who spoke when with no priori
0:01:17information on that
0:01:19identity of the speakers of the number
0:01:22and it's important to note that is
0:01:24that the speaker diarization process
0:01:27in the speaker they're efficient process we don't do speaker identification
0:01:32as you are so now
0:01:34a uh a a is to approach is
0:01:37for speaker diarization systems
0:01:39but the map and top-down
0:01:41uh the down approach C
0:01:43a a is used but a system such as the yes stem and the bottom up that's approach is used
0:01:49by a system such as the you system
0:01:52so uh uh in the but in the top-down down system we start with no speakers and we had them
0:01:57one by one and and to the top with their and is reached
0:02:01and in the bottom-up approach we start with a lot of speakers and we um
0:02:06and we met them and to the top two and three
0:02:11the main idea of this to D was
0:02:13a test
0:02:14uh uh the of speaker diarization system and and its behavior on different uh on the new
0:02:21and a new content
0:02:22in in the new context
0:02:24which is the web video
0:02:26this system has been test on uh from that uh
0:02:29but cast
0:02:30was that that a it in the french evaluation campaign
0:02:33instead step
0:02:34and then meeting that the at the
0:02:37in the um
0:02:38and nist evaluation complain R
0:02:46uh the
0:02:48yeah this is the the decision that description of our system
0:02:53there are three minutes steps
0:02:55in the um in how a process with that with the speech nonspeech segmentation or so called
0:03:00speech activity detection
0:03:03then we have a segmentation step
0:03:05and there is segments should
0:03:07every every segment
0:03:08the re-segmentation step which aim to refine
0:03:12the um the results we have produced
0:03:15so in the uh speech sounds speed the detection we initialize an hmm from the given gmms
0:03:22we apply a viterbi decoding and we are our or segment that five
0:03:26then uh this files are the base for the next step would you the segmentation step
0:03:31in the segmentation step we initialize
0:03:33and any hmm with one speaker
0:03:35which will be the default speaker
0:03:38we try to add a speaker we'll and it's not that
0:03:41and the mean are in the do uh of training and decoding
0:03:46uh we check if we can add a a new speaker if
0:03:50we can
0:03:50we have a our segment it thought
0:03:52and if we can add the speaker we
0:03:55we go
0:03:56at the beginning of the in
0:04:00then a finally a there is some most stations that we in a uh we initialize
0:04:05a we generate an hmm
0:04:06from the previews
0:04:07segment that file
0:04:09and so in the loop
0:04:11oh viterbi decoding and but that adaptation
0:04:14and we have a our final segment
0:04:20as i said in the introduction them in idea of these two D was to test how a system on
0:04:24in and you context which is the way we do fight is
0:04:28the content of the web video five is and control we've do you don't video such as a movie trailers
0:04:34all broadcast use
0:04:36and will these tools for example a uh you can have a a video recording in studio or with a
0:04:42cell phone
0:04:43we decided to
0:04:45a to be the database
0:04:47in in as a a which is a D that
0:04:49two seven categories
0:04:51described just after or with mean a
0:04:54so a D as well
0:04:56contains a small than eight hundred videos in seven categories
0:04:59document are every movie trailer cartoon commercial a news
0:05:03well and using you
0:05:05and this two D we left
0:05:08a two categories
0:05:09spot because we don't have
0:05:11the the video stream
0:05:13and using video because it the it's a very difficult and there a very particular that i
0:05:20we manually annotated
0:05:22a a part of this corpus
0:05:24we ended it the audio the audio cup
0:05:27the audio file
0:05:29of uh a one hundred
0:05:31the twenty nine video file
0:05:34a it's which present around then how as and the hard
0:05:38these numbers are about the and that it but
0:05:42oh the corpus
0:05:44but two main thing that we can see it that we can deduce from this that but is that we
0:05:50we have the category which would be the best the news at the but some of the the that bill
0:05:56and the one which should be the worst
0:05:58a movie trailer
0:05:59and D is category should be the best and the worst
0:06:02because the um the length of the speaker turns
0:06:06for the news is very high and for the movie trailer is very low
0:06:10this is
0:06:11information is very information you "'cause"
0:06:13be important because if you remember what i said just before
0:06:17we will on them with that and if we don't have in of that that were on how one with
0:06:21we shouldn't have a
0:06:23a a good reason
0:06:28so the results
0:06:30then uh them set
0:06:33in the
0:06:34for these two D we compare the the system to the you and but the map system the room but
0:06:40the maps
0:06:41a "'em" were works
0:06:44a like how our system
0:06:47a with the C uh speech speech segmentation the the segmentation
0:06:52and then uh segmentation based on the bic criterion and the or a segmentation
0:06:59we test
0:07:00this system on a on the
0:07:03different that that's set
0:07:04the at C O nine
0:07:07that that that's it's from the nist
0:07:09evaluation can
0:07:11it's meeting that a
0:07:14from on uh
0:07:15as step two thousand eight
0:07:17and that uh from the french evaluation can a stuff to it's broadcast news that that
0:07:22and a a on our uh and at at the soup that
0:07:26of it years are are with manual and automatic speech and
0:07:29speech segmentation
0:07:31we we see after why you would be
0:07:35so this is how a pretty preliminary results
0:07:37the first
0:07:39a thing that we can out lines
0:07:41E is that uh we have
0:07:43quite good results
0:07:45i if you remember what show you said just before
0:07:48but uh we are not so far from the state of the art
0:07:51a result
0:07:54uh the second thing is that uh
0:07:57we know that the in system i'll perform hours
0:08:02and you can see that on a step two thousand eight
0:08:05uh they do to two times better than us
0:08:10and how our system
0:08:12oh on the uh in on the years are are
0:08:16got to
0:08:17uh this
0:08:21the a are remark can be applied because it they are not two times better
0:08:27then how our system
0:08:33then you can see that
0:08:35the um the hard part of the um
0:08:39of the there is an error rate
0:08:41he's you to speech nonspeech segmentation error
0:08:44so we try to move there Z to measure the influence of the segmentation the first
0:08:50speech speech nonspeech
0:08:52detection step
0:08:53this is the reason why we applied our system
0:08:56well system on the automatic speech and speech segmentation
0:09:00and manual segmentation
0:09:03so that results uh there is nearly no or
0:09:06for the
0:09:07with the with the perfect
0:09:11with the perfect speech
0:09:12speech nonspeech segmentation
0:09:16are so try to move there are to measure the influence of this system
0:09:20and the that that's well
0:09:22yeah as expected you can see that's the best category is the news category
0:09:28and they're worst category for how a system is
0:09:31the movie trailer category as
0:09:38you can see that um that you that you insist them i'll the phones i well system in nearly all
0:09:45the categories
0:09:46but the range of the um
0:09:49oh the scroll on a are quite close
0:09:52uh for example phone use the minimum an error rate is around zero percent for each system
0:09:58and the maximum there is an error rate for cartoon new there on the
0:10:02seventy two per
0:10:04for most
0:10:08but i think that we can uh did use from this stuff but that
0:10:14it's also something that's we knew
0:10:16that's that that system phone found the more speaker band how a system
0:10:22but you can see a
0:10:24uh uh when you look at the scroll that's
0:10:26the um
0:10:28the speaker phone by the U system
0:10:31i not small right reliable than how
0:10:34speaker phone even if
0:10:35the number of speaker from
0:10:37i of them
0:10:43um um
0:10:44in conclusion this to the outlines the difficulties and coded by both system
0:10:50but by both that system
0:10:53and uh and that there was a new was what done
0:10:56it also lines
0:10:58that's it's a very difficult database
0:11:00with a lot of but between categories are high interactivity if you're a but the number and the duration of
0:11:07of for a speaker turn of the speaker turns
0:11:10and there is a lot of a one i these
0:11:12should explain what we have but results
0:11:15and the
0:11:18our our big T
0:11:20a uh first to data only with their go is where we are the best
0:11:25and uh in the second time
0:11:28the main um
0:11:29a research i sis will be
0:11:31to use height of that formation from the video stream to have the decision
0:11:36on the on the speaker
0:11:38thank you for attention
0:11:40and if you have been
0:11:57so two questions on the first and uh
0:12:00did you score overlapped speech
0:12:02no because how were system can on the phone now on uh overlaps
0:12:06each okay and like that
0:12:08she the notion and data sets marked manually and
0:12:12number of speakers an average speaker turn
0:12:14you know the distribution L in any another important factor in the diarization is they even if i
0:12:19five speakers if it's dominated by two
0:12:23and you can actually do
0:12:24right if
0:12:25speakers stick at ninety percent of think
0:12:27talk i that we had an action on the different categories of how might of been distributed
0:12:31we don't really measure the
0:12:33but uh
0:12:36i'm call there a partition is quite a key but and for all the speakers
0:12:41a a for some categories
0:12:44but is no no um
0:12:48the mean on speaker
0:12:50yeah it's
0:12:52i know it depends on the categories
0:12:54like a news and document are is there is the mean and speakers
0:12:58but but for movie trailers got to an and from a shot in that the same
0:13:10do do anything special with music because i can image and there is a a lot of music a for
0:13:14example in a movie trailers
0:13:16or it can be like only music or music in the background
0:13:20yeah we don't use music uh information
0:13:23for now
0:13:25might be uh
0:13:26something interesting to do
0:13:28that's uh
0:13:29i and just to do where your question
0:13:31a a a a a we don't the the the music information
0:13:34with the music first
0:13:35mission fun
0:13:36which means that you don't you do not score
0:13:39i Q you are is are the parts are the music
0:13:43it depends on
0:13:44how it's we by the by the speech nonspeech uh step if the music is recognized
0:13:51as a speech um
0:13:54as the non-speech level
0:13:55it one be scroll but if it
0:13:57uh marked as speech uh
0:14:00level level it would be score
0:14:30uh it's
0:14:32here again depends on the categories
0:14:34movie trailers cartoons
0:14:36a a very noisy
0:14:39that's uh a
0:14:40mm use
0:14:41quite X