0:00:06yeah we'll come back to the session
0:00:08so now let's
0:00:10and not only from my seat deployable on recent
0:00:13so now we change the topic today to speaker diarization
0:00:17and uh
0:00:18not in speaker diarization
0:00:20one of the important things to guess the number
0:00:25let me give you that
0:00:26we have a speaker
0:00:27and each one
0:00:29you need it
0:00:30do the segmentation
0:00:35and uh so we have four people as in the first people is on the
0:00:39well my diarization of telephone conversations
0:00:43uh presented by all three penthouse
0:00:46i know
0:00:47giving him
0:00:49the topic of the presentation is a non derivation of telephone conversation
0:00:54yeah i would begin by presenting the speaker
0:00:57you get diarization problem
0:00:59and after when cheryl
0:01:01talk about online those online speaker diarisation
0:01:04and the overview of current to short overview of current speaker diarisation system
0:01:10i will then said present the suggested online speaker diarization system
0:01:14including description derivation time complexity and performance
0:01:19and i will
0:01:21of course
0:01:22the conclusion
0:01:23the task of speaker diarization system is to assign temporal segments of speech
0:01:28why now are
0:01:29participants in a conversation
0:01:32speaker diarization basically a ten
0:01:36the segment and cluster conversation
0:01:39such that
0:01:40if we see a from the left it's a manual derivation of a conversation down
0:01:44by a human listener
0:01:46and on the right
0:01:48automatic diarisation
0:01:49exhibited by
0:01:51the suggested this because diarisation system
0:01:57a state of the art speaker diarization system operates in an on off line manner
0:02:01that is
0:02:03conversation samples are
0:02:05gathered until the conversation and
0:02:08an application of the diarization system
0:02:12for some applications such as forensic or
0:02:15a speech recognition
0:02:17online diarization could be beneficial
0:02:19that is if
0:02:20we want to
0:02:21apply some automatic speaker recognition system
0:02:24we would uh
0:02:26be able to see this
0:02:27realisation of the conversation until the point
0:02:30yeah we want to apply
0:02:33online or something online derivation can be achieved by removing
0:02:36or minimising the size of the
0:02:39uh but for
0:02:42however this
0:02:43incurs in the sun
0:02:45difficult to to the system because
0:02:47the amount of data is reduced
0:02:53most of the offline diarization systems operate in a two stage uh process
0:02:57first i'll just i'll just remain
0:03:00over generated over segmented by some change detection algorithm
0:03:05and then and then the ground or yeah
0:03:07i hierarchical clustering
0:03:09algorithm is applied
0:03:11in which
0:03:12segments are merged
0:03:14until some termination conditions are met
0:03:16generally the number of the
0:03:18final speakers in the conversation
0:03:23some recent approaches in uh offline diarization system
0:03:26include gmmubm
0:03:28figure modelling
0:03:30speaker identification clustering
0:03:32and the fusion of several system was several
0:03:35a a feature set
0:03:37in order to apply
0:03:38there is a nation
0:03:41online speaker diarization system income to the encountered in the literature
0:03:46online gmm learning
0:03:48as some novelty detection algorithms apply
0:03:51into detecting when a new speaker is appearing in a conversation
0:03:56and uh gmmubm
0:03:58this scheme
0:04:02most of the
0:04:03state of the art diarization systems
0:04:05online and offline
0:04:07and carton it in the literature requires some
0:04:10offline training background a channel or gender
0:04:14and models in order to apply
0:04:16on the diarisation algorithms
0:04:18is some require several sets of features
0:04:21and the
0:04:23practically all require a large amount of the
0:04:26computation power
0:04:30this is just an online diarization system operates in a two stage process
0:04:35and unsupervised algorithm is applied
0:04:38over an initial training segment
0:04:41of the conversation
0:04:43followed by
0:04:44the use of the model generate in the in the first stage in order to put
0:04:49and eh
0:04:50receiver segmentation of the conversation
0:04:52on demand
0:04:55that is
0:04:58a on
0:04:59the samples are entered into the
0:05:01preprocessing stage
0:05:03feature extraction
0:05:05uh into the buffer
0:05:06which incorporates the uh initial training segments
0:05:10there is a show is applied only on the initial training segments
0:05:13and models are generated from the initial
0:05:17training segment
0:05:19once the models are available we could
0:05:21a apply or perform segmentation of the conversation
0:05:25based on these
0:05:26initial models
0:05:29however a major assumption a
0:05:31is that
0:05:32all of the speakers in the conversation must participate in this initial training segment
0:05:37or else
0:05:38they want the
0:05:39a model for these speakers will not be a
0:05:41be available
0:05:42for the rest of the segmentation process
0:05:46the first data validation
0:05:48is if we can still provide a
0:05:51telephone conversation their vision over the initial training segment
0:05:55and which
0:05:56the samples in the initial training segment that preposterous
0:05:59feature extraction
0:06:01is applied on the emission thingy thing man
0:06:03and an initial assignment algorithm
0:06:06that is in a conversation and let's assume a telephone conversation once we have
0:06:10we have
0:06:11successfully identified the non speech
0:06:13we still have two speakers
0:06:15it was signed
0:06:16features too
0:06:18that is what we
0:06:19can identify the speech
0:06:21however we must apply to some kind of algorithm plus nine features
0:06:25to either of the speakers
0:06:28one features are assigned to each of the speakers
0:06:30uh an iterative process of modelling
0:06:33and time series clustering
0:06:35is applied until termination conditions are met
0:06:39once termination conditions are right where we can provide
0:06:43the segmentation
0:06:44modelling in this paper
0:06:46or in this work is the band by song and the time series processing is done by
0:06:51it's some variant of the
0:06:53hidden markov model
0:06:56when we apply diarization over short segments of speech eh
0:07:00two main issues arise
0:07:03is it a low model complexities required
0:07:06because of the sparse amount of data
0:07:09and another problem is the or clustering constraints that is we would not like that
0:07:17skit with men speakers we would like to
0:07:21physical ones
0:07:21trains on the time of
0:07:24for each
0:07:26the fourth problem tackled by replacing the common gmm models
0:07:30by a self organising map
0:07:33that is we train a self organising map
0:07:35for each of the speakers
0:07:37self organising maps was a uh
0:07:40presented by on it
0:07:42any composed of the three main stages the first uh
0:07:48the second is a rough
0:07:51and finally
0:07:52a a fine tuning
0:07:54of the neurons or the
0:07:57into the distribution of
0:08:04once we have
0:08:06train the model for each of the all speakers in the conversation
0:08:09a we would require some means to estimate
0:08:13the likelihood
0:08:15given a new feature okay
0:08:17a we would like to the
0:08:19estimate the probability of the likelihood of the uh feature observation given the model
0:08:25under the assumption of
0:08:27normality that is its centroid in the self organising map
0:08:31is a a
0:08:32the mean over
0:08:36uh with the unit covariance metric
0:08:38we could apply
0:08:40the following equation in order to estimate
0:08:42the log likelihood
0:08:44what the minus log likelihood of the
0:08:51see that we we estimate the loglikelihood only with a single neuron
0:08:56generally it will
0:08:58contain the most
0:08:59to um
0:09:00most of the information regarding the closest
0:09:03observation point
0:09:07the joint likelihood go
0:09:09and a set of features
0:09:11could be estimated by some
0:09:13of the log likelihoods of the single feature
0:09:15given that is
0:09:17the clean independent
0:09:21justin constraints are enabled using
0:09:25if hidden markov model or a minimum duration hidden markov model
0:09:30in this model it's they
0:09:31is modelled using yeah
0:09:33hyper state that is
0:09:35in each hyper state we enforce a minimum duration of transitions from
0:09:40one one state
0:09:42to another state
0:09:43and in this manner we could use the
0:09:46hidden markov model in order to enforce the minimum duration time
0:09:50a for each of the
0:09:52each state in the meeting duration hidden markov model if the left or right
0:09:56hi per state
0:09:58you know which songs you
0:09:59to estimate
0:10:00the the log likelihood
0:10:02or the emission probability
0:10:03for each of the observation
0:10:09i don't know
0:10:10in the
0:10:11that's right
0:10:12transition matrix of the hidden markov models elements on the diagonal
0:10:17and a hyper state transition matrix matrix of
0:10:20and the element and all that do not uh the entire hyper state the transition matrices
0:10:25and then this matrix is updated it
0:10:27part of the training process
0:10:35once we have the models for each of the speakers in the hmm segmentation is applied
0:10:40and using the a viterbi
0:10:43time series clustering algorithm
0:10:45that is
0:10:47samples of the um
0:10:49sound wave
0:10:50is entered into a buffer
0:10:51initial training segment
0:10:53is applied
0:10:55and hidden markov models which is generated by the diarization system
0:11:00once we have this
0:11:01hidden markov model segmentation is applied almost
0:11:04instantaneously on the mac
0:11:07i would
0:11:07and it
0:11:11viterbi algorithm
0:11:12computation complexity is in the order of Q squared chi where you are the number of states in the H
0:11:19M and and T is the number of features
0:11:21uh in the conversation
0:11:24so that
0:11:25initialisation and recursion of the viterbi algorithm could be applied online that is
0:11:31F F S with which is
0:11:32after they were
0:11:33which is the first feature
0:11:35used to initialise
0:11:37the viterbi algorithm
0:11:39followed by F and which is your
0:11:41a two
0:11:43in the recursion process
0:11:45segmentation is demanded
0:11:49and backtracking could be applied online
0:11:52and that is almost instantaneous
0:11:58a graph
0:11:59stating the time required
0:12:01to generate the segmentation of a conversation is a function of the conversation length
0:12:05uh is given here
0:12:07and it's show that
0:12:09four hundred
0:12:10second the conversation for example
0:12:12only one millisecond of it and
0:12:14of time computer time is required
0:12:17and in the current implementation of the diarization system
0:12:20one second of processing time give a white man alive
0:12:23seventy three seconds of the audio
0:12:27doing the first aid of a derivation
0:12:31and experimentation the database used was the
0:12:34of two thousand forty eight conversation from the nist two thousand and five speaker recognition evaluation
0:12:40recordings L to speaker conversation in at a four wire which was sound
0:12:45and normalised in order to be generated two speaker conversations
0:12:50the features extracted was
0:12:53mfcc features and twelve mfcc including
0:12:56delta features
0:12:59the entire database was first
0:13:01processed by the diarization system using all of the data available
0:13:05to produce
0:13:06twenty percent diarization error rate in six point nine percent
0:13:10speaker right
0:13:15diarisation error rate
0:13:17how to the way we measured it was to include
0:13:20all of the hours available that is
0:13:23confusion and the uh
0:13:25also i mean
0:13:26speech and nonspeech
0:13:28also overlapped speech which is the set which are segments of
0:13:32speakers speaking together
0:13:34was also considered as an arrow
0:13:36in the speaker error rate
0:13:39we actually eliminated
0:13:41the nonspeech in both of the segmentations
0:13:44in order to generate only the speaker confusion
0:13:49the derivation error rate as a function of the initial segment length
0:13:53it's shown to
0:13:56the optimal of the
0:13:58performance obtained by the applying the nation system over the entire segment
0:14:04as we can see that
0:14:08say one twenty one or two minutes of initial training segment where we where you save twenty four
0:14:15percent diarization error rate
0:14:17and the
0:14:22is also presented in the
0:14:24application of a speaker error
0:14:30it seems that given
0:14:32two minutes of initial training segment they relation iterative
0:14:35sufficiently close
0:14:36uh to the diarization error rate obtained by applying this segmentation
0:14:40the diarization over the entire conversation
0:14:43and using one or twenty
0:14:44seconds of the initial training segment
0:14:47we could obtain twenty three
0:14:49twenty four diarisation percent diarisation error rate
0:14:51and twenty points
0:14:52ten point six
0:14:54a speaker error rate
0:14:56well using one and i think
0:14:58seconds of initial training segment
0:15:00provide twenty two point three diarization error rate
0:15:02and about ten percent speaker
0:15:05that the features
0:15:07did not
0:15:08provide an improved performance
0:15:14to conclude
0:15:14ascending online speaker that information system
0:15:17uh was presented
0:15:19and it was shown that using as few as
0:15:21one hundred twenty seconds of conversation and we could apply
0:15:25and provide
0:15:26segmentation of the conversation
0:15:28by an increase of
0:15:29four percent
0:15:30when compared to the diarization error rate obtained by the by applying the vision system
0:15:35over the entire conversation
0:15:37for them
0:15:38corpus of robustness and simplicity
0:15:40gmm models or or replaced by a self organising map
0:15:46a um
0:15:48we assume no prior information regarding the speakers on the or the conversation that if we use
0:15:53no background models of any kind
0:15:56in order to apply
0:15:58there is asian
0:15:59and no parameters are required
0:16:01to be trained offline
0:16:03and in order to apply diarization
0:16:07thank you
0:16:14take some questions
0:16:46well as opposed to some initialisation
0:16:48uh maybe i missed what is the length of the segment
0:16:52that you get into the sum
0:16:55okay that's fine
0:16:56we've done this
0:16:57merriment using a variable length of initial training segment that is
0:17:01assuming you are
0:17:03one hundred and twenty seconds of initial training segment
0:17:06some of which belongs to speaker a sound which belong to speaker B
0:17:10and sound belongs to non speech
0:17:12that is the the the exact amount of features
0:17:15belonging to each of the speakers was not measured because it's a it's a
0:17:19function of the initialisation algorithm
0:17:22but um i i mean
0:17:24what you know
0:17:26do also
0:17:27self organising map
0:17:29is using the short segments
0:17:31from this initialisation
0:17:33and do you have a fixed
0:17:36for the for the segments or is it
0:17:38so the uh
0:17:40segmented okay
0:17:53the initial training segment
0:17:55there is a she's actually applied on the initial training segment
0:17:59that is
0:18:00speech or nonspeech is
0:18:02uh detected
0:18:03nonspeech of that and then the segments belonging
0:18:06speech are
0:18:07a distributed among the two speakers
0:18:10in the conversation
0:18:11the distribution of the features to each of the speakers as a function of the initialisation algorithm
0:18:17which is a client of the K means
0:18:20a clustering algorithm
0:18:24the exact amount of features assigned to each of the
0:18:28i was not nice
0:18:32um i have a note on the question about the overlapping speech you said that you
0:18:37um overlapping speech in the responses but
0:18:40you score it as an error
0:18:42yeah and that you did not take it
0:18:44into account so we
0:18:45always and they're only one way to
0:18:47yeah and do you have an idea of the amount
0:18:50appeal is that it
0:18:51yes to to your result
0:18:52we have used two databases for uh there is a nation and
0:18:56the one used here was two thousand and forty eight conversation from then these
0:19:00the two of them
0:19:01two thousand and five speaker recognition
0:19:04i correctly remember it was about
0:19:07three dot eight
0:19:09of overlapped speech
0:19:11and in average
0:19:21i also have two questions first
0:19:22have you evaluated the degradation you get
0:19:25from replacing the gaussian model with the
0:19:27the uh that's why model
0:19:29and secondly
0:19:32uh could you i mean you want to use the initial
0:19:35you know so many seconds
0:19:36for for building your your uh
0:19:39you're speaker clusters
0:19:40a could you just redo that every so often i mean most uh
0:19:45machines this dataset more than once if you record
0:19:47uh you can continue doing online segmentation and in the background you can we compute your
0:19:53speaker clusters
0:19:54you know every
0:19:55uh thirty seconds or something like that
0:19:57of course
0:19:58for the first question
0:19:59we have examined
0:20:01self organising maps and gmm models for derivation
0:20:04in papers presented the previous
0:20:07that is
0:20:08jan and then solve for the nation
0:20:10in our studies experiments
0:20:12presented the same performance
0:20:14so we didn't find any reason to use a gmm
0:20:18especially because the training process for so long
0:20:21is a lot
0:20:22faster quicker
0:20:25for us more robust
0:20:27for a second question
0:20:29and exact paper was submitted to interspeech
0:20:32it does
0:20:34exactly what is it
0:20:38so i
0:20:39two questions
0:20:41one is the um
0:20:43comment about each set
0:20:44being used
0:20:46it is the first
0:20:47you get good performance going
0:20:49first hundred twenty seconds
0:20:50your initial
0:20:52at the door
0:20:52the files are only
0:20:54i mean for
0:20:54five minutes long you're using
0:20:56percent of the data
0:20:58you into that realistic to go halfway through a conversation
0:21:04not because just
0:21:07if we use about a thirty plus
0:21:10thirty second of the data in order to initialise the conversation
0:21:14the performance
0:21:15why that is
0:21:17i mean
0:21:17we get like a thirty three percent diarization error rate and
0:21:25four percent speaker
0:21:27the the amount of data
0:21:29required by the initial training but by the diarization system
0:21:32it's quite large
0:21:36if we have
0:21:37uh the possibility to train online thing the system as the conversation goes
0:21:42it would be great
0:21:43that's exactly what we partition
0:21:44in it
0:21:45in the next
0:21:47a paper in this
0:21:48did you see the link that was also it's just to name one
0:21:52they're looking at things like that
0:21:54oh well
0:21:56we use that
0:21:57well what where the conversation although ten minutes
0:22:01let's not the duration issues knots of its duration
0:22:05you conversations between street
0:22:08i take it turns you take
0:22:10you know it's
0:22:11duty cycle
0:22:14if you look
0:22:17format you like
0:22:17i mean
0:22:18E R
0:22:19you know
0:22:20there it should be fine
0:22:21if someone dominates
0:22:22first part
0:22:23conversation you know well
0:22:25that's so
0:22:26and i also think in the call home and call friend
0:22:30but the actually
0:22:31more than
0:22:32two people getting
0:22:34two people on one side getting on
0:22:37um so you have more
0:22:39so what
0:22:40yeah the point
0:22:42maybe you had this
0:22:45type your address in the
0:22:48what you compare this to so for example the window has
0:22:52at published papers we did this workshop that's me
0:22:55exactly this task
0:22:56you start out blindly
0:22:58you start building up doing online
0:23:00did you use that the baseline
0:23:02did you
0:23:05no i i think uh to
0:23:07two papers
0:23:09a which perform this online diarization task
0:23:12but mostly of broadcast news
0:23:15naked on telephone i believe
0:23:26this very little
0:23:27a problem
0:23:30i would we have yeah
0:23:32um you know
0:23:36thank you
0:23:43wanted to know if you have some idea
0:23:46detect a new cluster a new speaker the system not to be able to
0:23:51i do class
0:23:52during decoding
0:23:55our diarization system is
0:23:57only oriented to telephone conversation between two speakers that is what we already assumed that the number of speakers is
0:24:05i have encountered some ideas
0:24:09part of which use the leader follower algorithm which is a practically very simple
0:24:14that is the distance
0:24:16if we take and segment the conversation and take and and you segment
0:24:20you can take the distance to the
0:24:22current model you have
0:24:24and in the distance for all the it's a certain threshold
0:24:28then you
0:24:29and meeting new model
0:24:31you say that
0:24:32this is a new speaker
0:24:33and you train a new model for it
0:24:35and use it
0:24:35in order to a cluster the conversation
0:24:38later on
0:24:40when you come to the end of the conversation could also use
0:24:43did this distance matrix between and models
0:24:46you know to um
0:24:47march model which are
0:24:49very very close
0:25:04i want to make one of the um
0:25:07uh when you say that uh
0:25:10to meaning
0:25:11out of five in real life
0:25:13we never know what will be the length
0:25:18can be for mean
0:25:19can be ten million
0:25:22to mean
0:25:35before the meeting
0:25:37just make it
0:26:00i don't agree because we
0:26:03to me it's for initialisation
0:26:05does it matter if after the
0:26:08yeah the computation
0:26:10one means more or
0:26:12when you mean
0:26:13do you do need it i mean to me
0:26:15you should
0:26:18doesn't matter
0:26:19a piece to me
0:26:22the results
0:26:23no matter what
0:26:39if you one day
0:26:42the conversation
0:26:45i see
0:26:47to initiate
0:26:49can have better without
0:26:52on the
0:26:53i think you be more if we if we just need to know how many how
0:26:57almost iterations then you get
0:26:59sufficient statistics
0:27:00cover both speaker
0:27:07this is not i'm
0:27:08it is not only show the percentage of the conversation it's a matter of that
0:27:12the amount of statistics required to train
0:27:14two speakers wanted right
0:27:16that is
0:27:17if the conversation would last
0:27:19for half an hour following the two minutes
0:27:22unless the channel is change in such a manner that the models are not no longer no longer valid
0:27:28the result will be the same
0:27:31you are correct we have examined payment
0:27:34in order
0:27:34show that and that we wanted
0:27:36right i think we do not have anything to speak you know what i mean
0:27:43so you so you have an online system but i suspect you actually don't i suspect that you're online system
0:27:47is actually an offline system
0:27:52do you know what
0:27:53anything before you reach the end of the file
0:27:57in any point
0:27:58where we get results
0:28:01that's the output of that
0:28:02diarisation system
0:28:05do you but you do use an hmm
0:28:07i do have an original
0:28:09so you are differing your decisions
0:28:14so you output is soon but you output the history as soon as it's a single
0:28:19uh so the
0:28:21results to a single pair
0:28:23i uh there is all on user request
0:28:25that is
0:28:27using the hmm
0:28:30in order to provide diarization results
0:28:33i only need
0:28:34perform termination and backtrack
0:28:36and this could be done using
0:28:38one millisecond of processing time
0:28:41this stage
0:28:42can all be done online
0:28:44that is initialisation using only the first feature
0:28:47and the rest of the features and their their fortunes stage
0:28:52for any new feature i
0:28:55determination and backtracking is only memory than memorising i could provide results instantaneous
0:29:04instantaneously before the uh uh
0:29:07hmm results to single path
0:29:11you know
0:29:11what i really want to say is i think that
0:29:13this uh online offline distinction a distinction is really a red herring
0:29:19it would be better i think to um
0:29:23talk about the
0:29:27D for
0:29:28the allowed deferral time before a decision needs to be made
0:29:31uh you know
0:29:33you you you make a distinction between online and offline but really what you're doing is you're convolving with that
0:29:38particular approach
0:29:42create models with an initial segment
0:29:47to my with thinking
0:29:50um doesn't really make the distinction between what is online once offline and if i would call it semi online
0:29:57it would be okay with you
0:29:59oh what what i would like
0:30:00to see you
0:30:03oh specification of the uh
0:30:05amount of time
0:30:06that's allowed to be the decision is allowed to be P for
0:30:10and uh you know and if you do that then um
0:30:14the um
0:30:15an offline system that deferral time would be infinite
0:30:19in an online system the real time would be
0:30:23something that is demanded by the application yeah
0:30:29definition of the system
0:30:30of the client
0:30:45but that