0:00:15to do not an i'm here two percent this work the influence of transition costs
0:00:20in this one depicts the states of a speaker diarization i'm bad at the in
0:00:24a i work in the speech technology group in the ubm how well the wonders
0:00:29apply the kingdom at the here units playing
0:00:32so here is the online i'm going to following the in this presentation first of
0:00:36all i'm going to well explain the baseline system the baseline actually changes that we
0:00:41are used and with what they are detect to without the modules the states so
0:00:47focusing in the segmentation of clustering of states which is
0:00:52basically that the that initiation estates
0:00:55where a which is where we have been well actually making things and analysing results
0:01:02and what we
0:01:05one kid in this work was to analyze the effect of
0:01:10that the of the parameters involved in the in the duration of the speaker that's
0:01:15i mean
0:01:17at time can be very long various or tweets parameters are involved in this decision
0:01:22and how much it the probably affecting in our system
0:01:28then i well i will present the experiments we have done with a the development
0:01:34dataset and a all the analysis and some compression
0:01:39here is the baseline system architecture we how we work with more t-ball a it
0:01:45multiple signals from multiple microphones so that's or input that we first filtered to reduce
0:01:53and then we extract from these various signal the delay between them that is the
0:01:59time delay of arrival and we use these information for to work for two things
0:02:05is the acoustic function which is to create as you
0:02:09probably no
0:02:10to create a and i can see now just to me and all the all
0:02:14the signals from the different my in the microphones a delay in or one to
0:02:19the other what a
0:02:22probably about the proposed for them
0:02:25one delay to the other so they this soon as at the end it the
0:02:29voice on the acoustic a signal or something and it's a nice but
0:02:36no we use this signal to extract the cepstral with these in the mel frequency
0:02:41cepstral coefficient
0:02:43and also what to extract information about where their use a speech on there is
0:02:48not with the voice activity with that
0:02:52the on a way we use the delays that at the tdoa are used as
0:02:58actually as an input to
0:03:00and you see the last the states that is the segmentation of agglomerative clustering of
0:03:04speech strenuous that these the a day is to each day
0:03:09where we actually decide who uses beacon and way needs to speak and so when
0:03:14we performed a diarization
0:03:17it's a calm
0:03:19what it doesn't matter
0:03:22that these
0:03:23segmentation i'm mlps clustering i mean
0:03:27here is like a more in the diagram of what an of what these estates
0:03:33performs we first with a i'm not at the initialization allows a segmentation any sound
0:03:40that is porous is uniform for the baseline system we use his unit probably a
0:03:46segment into and plus their for a cease
0:03:48sixteen bit as or more but we use it might be sixteen because then we
0:03:53iteratively are going to
0:03:55reduce the number by marriage in or hypothesize a hypothesis that size clusters
0:04:04after this initial segmentation we before we start and that the of segmentation a training
0:04:10and during in the states we with these segmentation create it models than we that
0:04:15the that we that we use to a restatement the see now
0:04:20and well at the end of we will have a better segmentation is according to
0:04:26date at twelve a speaker models we have train
0:04:29i thirty three segmentation we a we compare these
0:04:34as clusters one-to-one birds and we didn't joint in we match those that are more
0:04:40seem a lot
0:04:42we used for that but using information you know
0:04:47we use
0:04:48in general
0:04:50well with the with a we do eat all these iteratively until there is no
0:04:55more clusters to merits because
0:04:57well this is then can see that there are no more plaster that are in
0:05:00a similar to demerits and it
0:05:03it finished
0:05:05that's features
0:05:10while here is a
0:05:12we are moving to the point this is like that diagram where you can see
0:05:17all the parameters that actually involved in the
0:05:21duration of these speaker that's we have one parameter the called medium on duration on
0:05:28for us is to fifty frames a two hundred fifty prince of the music and
0:05:32so a round to two point five seconds that is like
0:05:36okay i'm going i want that might speaker times are at least of two one
0:05:42five seconds of duration because if they are stored their well i'm you know still
0:05:46much interested them and so well let's force this is then to go at least
0:05:52two fifty
0:05:53then these parameters are beta a we won it we wanted to cancel them because
0:05:58they define to concern i mean to constantly influence it has in the duration of
0:06:04the speaker a
0:06:07good mean okay is probably t you would applied to
0:06:14remain in the same speaker or not moving to another and data is to
0:06:19one two and all that the one of the clustered by another speaker so we
0:06:23said then to one
0:06:24we know would they does not so one that is
0:06:28like the stopping people think that is they way they actually having not influence in
0:06:34the final decision
0:06:35but we also a in the a in there you know is used and the
0:06:39experiment that one this last and that and we
0:06:44discover is it was the from there you know system and
0:06:49the problem for us is that is not just the this parameter that is a
0:06:54happy some useless again in the decision of moving from one to speak at one
0:06:58other but that these m is the number of active clusters
0:07:04overall system iteratively rate use these number of clusters it goes from sixteen one sixteen
0:07:10for us "'cause" is what we used in
0:07:12to well bass-fine a time and that could be too
0:07:16or one
0:07:17what one would be just the timit but two three four and in each iteration
0:07:22it going is going to change
0:07:26here you see actually when it sees a state institution to change first we have
0:07:33the likelihood of a and while in the basic question is
0:07:38the likelihood of some primes
0:07:41to belong to one cluster the other side the likelihood of the same frames to
0:07:46belong to another cluster
0:07:48those two and i related to the about data we have so we are okay
0:07:54with this but these all the parameter located great we have called logarithm of k
0:08:00is just a independent and deals with a well what was in the data in
0:08:06the by diagram of the of the previous slide was a we present is that
0:08:11why this last
0:08:14is in the band has nothing to do with the to and this
0:08:18actually if it's lower than one is kind of been analyzed in changes
0:08:24because well as a variant of
0:08:26zero point whatever yes and negative
0:08:29and if hi there it
0:08:32five is fable or d is these changes
0:08:36and as i said
0:08:39as we have that case one is less and a
0:08:43and decreases in every iteration also calculate increase in every iteration but it still is
0:08:48going to be always lower than twice on so you know baseline sees the not
0:08:52fighting we are always been analysing fancy since even though we really don't know if
0:08:57we want to make
0:08:59sort their parents or wrong bass we are doing it so
0:09:03well we see and if we do not really a lower number of speakers because
0:09:08what we increase from sixteen whatever if we have a lower number of speakers
0:09:13we really have high probability of haven't changes i don't know i assume that i
0:09:19and all isolated so more these transitions
0:09:22so we thought well let's concept scott time that
0:09:26maybe it works fine and not these we remove this variability and take the decision
0:09:32only a few data and also
0:09:35of course as we have a we decided to do this experiment we decided also
0:09:39to say okay
0:09:40it said this case to a fixed value
0:09:43negative maybe but it what's actually we wanted to look is a if we could
0:09:49fable that is
0:09:51these transitioned plastics not used so it doesn't change over iterations
0:09:56but a also
0:09:58maybe a positive value so we have probably faber transition in a speaker's changing of
0:10:06is that experiments
0:10:07i a here is the database i fused
0:10:12we have the development set that is
0:10:15probably switch task is to evaluate somebody eight meetings
0:10:18from yes
0:10:19you see two thousand two thousand and five two thousand six and seven
0:10:27and we have used that for all the development dataset and then the test set
0:10:31that this r t o nine from
0:10:33come on
0:10:35well that the element that the this it is the from these results presented here
0:10:39i the two thousand nine
0:10:43here is all or the been all the experiments we have don't to analyze
0:10:47a study the effect of these
0:10:49parameters we wanted to check the effect of the season when we got we have
0:10:55all these k
0:10:56consistent weight because is like
0:10:59well i applied to taxis you have
0:11:03and we wanted also to tasty to evaluate its influence a
0:11:10we if we are also taking into account the minimum duration parameter and talk you
0:11:16because well also of them are actually influencing parameter data duration of the speaker time
0:11:23we used widely work in the baseline if i two fifty frames so there there's
0:11:29the baseline which is the flat line
0:11:32it's this is only after a score of course because the transition weighting the baseline
0:11:36is it to one it's last m so you change to about the process
0:11:41and then we have all these other
0:11:44experiments a here that i see still weight can be well if it's one i
0:11:49want you to do not is that winter season weight is one it's like constantly
0:11:54its effect caused by one zero so
0:11:57no effect at all is only data
0:12:00and if you y very you know a it put me in that changes are
0:12:05very few people actually
0:12:08i have and
0:12:09put the value for detecting some way equal to zero
0:12:15it's like fifty
0:12:16and a vertical line was there i have the two with the only needs and
0:12:24was like okay well very high because this e-step for be trained sees you know
0:12:28sent at the end is segments all the recording p one speaker which is obviously
0:12:34a very high error rate
0:12:38then we sell okay with me duration equal to two hundred we actually have like
0:12:44that every instable and all where it with low error rate various table section on
0:12:49a in trying to find yourself one
0:12:52and with a lower error rate and the baseline so
0:12:57maybe it's good to have peace into consideration
0:13:02so let's see what
0:13:03what i and in at the end
0:13:07what happened on the n
0:13:08if we six
0:13:10this with that that's that we choose
0:13:12three point is all of those points we have checked the we have a evaluated
0:13:17with the development dataset where
0:13:19with a better than they baseline we three
0:13:22one two three
0:13:23we also compute the well compute all the system compute the data position for the
0:13:30a for a transition weight you want to
0:13:33one as last and which is the baseline but with a minimum duration but the
0:13:37two hundred so
0:13:39and what we could compare actually a
0:13:42the improvement you to these transition we variation in you to the mean duration
0:13:48separately because what the baseline use minimum duration to fifty and so on
0:13:54i liked very much the idea scott setting it because i and well
0:14:00good to see
0:14:02parameter is in the band then if we can console and have to better results
0:14:06are at least the best was also at least good enough
0:14:11why not
0:14:12something less to train for future experiments
0:14:16the problem is actually the test set it out what it didn't go very well
0:14:22not very much but we may be well compute the average of the two error
0:14:28rates it's good but
0:14:30it was worse what
0:14:32we have what was barry we thought well
0:14:36and that the results for day
0:14:39prediction we what is three which is very boring actually very for in it changes
0:14:44of a speaker
0:14:45and rate using them anymore and iteration of any a speaker time
0:14:52conclusion model compression i think i four for these more or less during the during
0:14:57the presentation was more like stream rice and what i think that these turn transitioned
0:15:03weight i don't have it we discover because was
0:15:06it's a was a previously statically that came from icsi was well maybe someone you
0:15:14have worked with it is
0:15:18for s we discovered that very small changes can affect the very much that i
0:15:24use a c and that's why i look like at the beginning to have calculated
0:15:28but if you is the one to constantly it at least you have to note
0:15:31that it exists a you if you want to change your the duration or to
0:15:37work with the duration of jerry speaker dance is important it's important to make it
0:15:43to run experiments
0:15:46it with both transmit what the racing on "'em" also because well a very three
0:15:52of these
0:15:54actually got
0:15:56better it's also
0:15:57for us is good but and the main thing we land problem this is that
0:16:05if the variability with one time is very high
0:16:09or can be very high you mass
0:16:13i try to take into account the maybe evolve constantly made with a this technique
0:16:21what to one is the best option so you can
0:16:24i make the system or upwards for future experiments
0:16:30well that's more honest that i think
0:16:39so we then proposed
0:16:51thanks to multiple english
0:16:55first of all
0:16:57we look for this so that it's much platoons good solution from the whole circle
0:17:05two six should also and so each time constant
0:17:15so smooth the lasted a okay a sycamore to all the weights
0:17:22we should
0:17:25the phone but it's very important to train them
0:17:30the show a high constant
0:17:34i know not so the remote were used for training the transition probabilities
0:17:40in rooms do not want to work with them or whatever how to cope with
0:17:47this remote is
0:17:50and it's as much as the solution
0:17:55this transition
0:17:58the motivation and the results
0:18:04i dunno why the snow
0:18:06okay well
0:18:18i don't use the word
0:18:21those in differences
0:18:24all rates were all the routes two goals in the logo to go all this
0:18:33as a constant
0:18:35it's a cost and is
0:18:38one two three doesn't matter at all
0:18:41so in quantum o one with the home and speaker of the
0:18:47because why you try
0:18:56i is a three and you know is a constant value but is a different
0:19:01number and the decisions taken when this inequality it is a full field i want
0:19:07you made this inequality
0:19:08like okay i'll be is a idea would all be is a brings belong to
0:19:14discuss the that likelihood of these is saying brings belong to another cluster completely different
0:19:20and then be used and if it's high and have a forced that you okay
0:19:25change of class that is like if it's a very low
0:19:29it for b s
0:19:30to go to another cluster
0:19:32that's why it's a variable you fable more deterrence easiest the changes or you penalise
0:19:51so why i of course there is also probably
0:19:56you transition words so
0:19:59we can use
0:20:00okay for volume of english could be thing
0:20:04in there as well
0:20:05first for ratio between the core model on the
0:20:12moreover maybe
0:20:13become the new speaker
0:20:15so it's
0:20:17so do you think is sort of threshold are just
0:20:21it would be dependent on the task of the database just have one still because
0:20:27i haven't actually an take a nice okay i have a right
0:20:32that's why i think is that for the system to be more robust in the
0:20:37to be using future task or you know that databases and
0:20:43well it is the speaker out of the rights and in different meetings before and
0:20:47databases that have a slightly longer duration maybe i speak and a lot at all
0:20:53their interface the on a sorta
0:20:56and if you are in that room with four people just don't well it depends
0:21:01on the basically see that that's why i tied states yes okay if a if
0:21:07you can that's a similar results yes have you
0:21:12is a time you don't have to train and that's always
0:21:18if you it you have a similar result or something in that this would to
0:21:23know what you have you will have a less work to do you know used
0:21:28to let that the c sent one you that bayes and unique the menu or
0:21:36get rid of this problem right i also a one because this is like a
0:21:44preliminary work and i would like to maybe to use these
0:21:49if i
0:21:50somehow could this somehow i really don't know that don't have any clear the of
0:21:55what to do two
0:21:58but when i a get a good resampling that
0:22:01if i somehow had a i in the idea of how long the speaker concept
0:22:05going to be
0:22:06or how many singers or maybe if i have some information about the role of
0:22:10the speakers in the room and that could
0:22:14not would be used to i think smiling at that is aligned and that that's
0:22:19all a lot
0:22:21actually staying i think this kind of the probability of this is a low enough
0:22:26to these one or something some way of extracting this information in
0:22:32unsupervised diarization could be tricky but still i think you then you could
0:22:38and achieve this parameter full for get them better results
0:22:44but not
0:22:48show you questions