Přepis řeči - SPEAKER DIARIZATION OF HETEROGENEOUS WEB VIDEO FILES: A PRELIMINARY STUDY

i'm waiting for the for screen yeah hi i'm just the more from the university of five and a will talk to about a a pretty in is to D we did uh and speaker addition of that original use would be video five i will start with an introduction then i would describe the at a speaker diarization system uh describe that that base we we then use for this two T show use some results and to uh conclusion some plastic as a you know not but speaker there is a and is the process to find in audio stream who spoke when with no priori information on that identity of the speakers of the number and it's important to note that is that the speaker diarization process in the speaker they're efficient process we don't do speaker identification as you are so now a uh a a is to approach is for speaker diarization systems but the map and top-down uh the down approach C a a is used but a system such as the yes stem and the bottom up that's approach is used by a system such as the you system so uh uh in the but in the top-down down system we start with no speakers and we had them one by one and and to the top with their and is reached and in the bottom-up approach we start with a lot of speakers and we um and we met them and to the top two and three the main idea of this to D was a test uh uh the of speaker diarization system and and its behavior on different uh on the new and a new content in in the new context which is the web video this system has been test on uh from that uh but cast was that that a it in the french evaluation campaign instead step and then meeting that the at the in the um and nist evaluation complain R uh the yeah this is the the decision that description of our system there are three minutes steps in the um in how a process with that with the speech nonspeech segmentation or so called speech activity detection then we have a segmentation step and there is segments should every every segment the re-segmentation step which aim to refine the um the results we have produced so in the uh speech sounds speed the detection we initialize an hmm from the given gmms we apply a viterbi decoding and we are our or segment that five then uh this files are the base for the next step would you the segmentation step in the segmentation step we initialize and any hmm with one speaker which will be the default speaker we try to add a speaker we'll and it's not that and the mean are in the do uh of training and decoding uh we check if we can add a a new speaker if we can we have a our segment it thought and if we can add the speaker we we go at the beginning of the in then a finally a there is some most stations that we in a uh we initialize a we generate an hmm from the previews segment that file and so in the loop oh viterbi decoding and but that adaptation and we have a our final segment i uh as i said in the introduction them in idea of these two D was to test how a system on in and you context which is the way we do fight is the content of the web video five is and control we've do you don't video such as a movie trailers all broadcast use and will these tools for example a uh you can have a a video recording in studio or with a cell phone we decided to a to be the database in in as a a which is a D that two seven categories described just after or with mean a so a D as well contains a small than eight hundred videos in seven categories document are every movie trailer cartoon commercial a news well and using you and this two D we left uh a two categories spot because we don't have the the video stream and using video because it the it's a very difficult and there a very particular that i we manually annotated a a part of this corpus we ended it the audio the audio cup the audio file of uh a one hundred the twenty nine video file oh a it's which present around then how as and the hard these numbers are about the and that it but oh the corpus but two main thing that we can see it that we can deduce from this that but is that we um we have the category which would be the best the news at the but some of the the that bill and the one which should be the worst a movie trailer and D is category should be the best and the worst because the um the length of the speaker turns for the news is very high and for the movie trailer is very low this is information is very information you "'cause" be important because if you remember what i said just before we will on them with that and if we don't have in of that that were on how one with that we shouldn't have a a a good reason so the results then uh them set in the for these two D we compare the the system to the you and but the map system the room but the maps a "'em" were works uh a like how our system a with the C uh speech speech segmentation the the segmentation and then uh segmentation based on the bic criterion and the or a segmentation we test this system on a on the different that that's set the at C O nine uh that that that's it's from the nist evaluation can it's meeting that a and from on uh as step two thousand eight and that uh from the french evaluation can a stuff to it's broadcast news that that and a a on our uh and at at the soup that of it years are are with manual and automatic speech and speech segmentation we we see after why you would be so this is how a pretty preliminary results the first a thing that we can out lines if E is that uh we have quite good results i if you remember what show you said just before but uh we are not so far from the state of the art a result uh the second thing is that uh we know that the in system i'll perform hours is and you can see that on a step two thousand eight uh they do to two times better than us and how our system but oh on the uh in on the years are are got to uh this um this the a are remark can be applied because it they are not two times better then how our system uh then you can see that the um the hard part of the um of the there is an error rate he's you to speech nonspeech segmentation error so we try to move there Z to measure the influence of the segmentation the first speech speech nonspeech detection step this is the reason why we applied our system well system on the automatic speech and speech segmentation and manual segmentation so that results uh there is nearly no or for the with the with the perfect um with the perfect speech speech nonspeech segmentation are so try to move there are to measure the influence of this system and the that that's well yeah as expected you can see that's the best category is the news category and they're worst category for how a system is the movie trailer category as expect uh you can see that um that you that you insist them i'll the phones i well system in nearly all the categories but the range of the um oh the scroll on a are quite close uh for example phone use the minimum an error rate is around zero percent for each system and the maximum there is an error rate for cartoon new there on the seventy two per for most but i think that we can uh did use from this stuff but that we this it's also something that's we knew that's that that system phone found the more speaker band how a system but you can see a uh uh when you look at the scroll that's the um the speaker phone by the U system i not small right reliable than how speaker phone even if the number of speaker from i of them um um in conclusion this to the outlines the difficulties and coded by both system but by both that system and uh and that there was a new was what done it also lines that's it's a very difficult database with a lot of but between categories are high interactivity if you're a but the number and the duration of of for a speaker turn of the speaker turns and there is a lot of a one i these should explain what we have but results and the our our big T is a uh first to data only with their go is where we are the best and uh in the second time the main um a research i sis will be to use height of that formation from the video stream to have the decision on the on the speaker thank you for attention and if you have been in i we i oh hmmm so two questions on the first and uh did you score overlapped speech no no because how were system can on the phone now on uh overlaps each okay and like that she the notion and data sets marked manually and number of speakers an average speaker turn you know the distribution L in any another important factor in the diarization is they even if i five speakers if it's dominated by two and you can actually do right if speakers stick at ninety percent of think talk i that we had an action on the different categories of how might of been distributed we don't really measure the but uh i'm call there a partition is quite a key but and for all the speakers is a a for some categories but is no no um the mean on speaker yeah it's uh i know it depends on the categories like a news and document are is there is the mean and speakers but but for movie trailers got to an and from a shot in that the same i i oh do do anything special with music because i can image and there is a a lot of music a for example in a movie trailers or it can be like only music or music in the background yeah we don't use music uh information for now might be uh something interesting to do that's uh i and just to do where your question a a a a a we don't the the the music information with the music first mission fun which means that you don't you do not score i Q you are is are the parts are the music it depends on how it's we by the by the speech nonspeech uh step if the music is recognized as a speech um as the non-speech level it one be scroll but if it uh marked as speech uh level level it would be score i i but but oh but yeah i i i uh it's here again depends on the categories movie trailers cartoons a a very noisy that's uh a mm use quite X i i um

SPEAKER DIARIZATION OF HETEROGENEOUS WEB VIDEO FILES: A PRELIMINARY STUDY

Speaker Diarization

Přednášející: Pierre Clement, Autoři: Pierre Clement, Université d'Avignon, France; Thierry Bazillon, Aix Marseille Université, France; Corinne Fredouille, Université d'Avignon, France