0:00:06okay so this to talk is a about unsupervised compensation of intercession interspeaker variability for speaker diarisation
0:00:17okay
0:00:18so
0:00:18those on the front of the following i will first uh
0:00:22define the notion of intercession interspeaker variability and
0:00:26actually it's a very
0:00:26yeah
0:00:27equivalent to that
0:00:29to the notion to that concept of E to segment the probability that was the described in the previous uh
0:00:35presentation
0:00:36i would talk about how to estimate
0:00:38december built in a supervised
0:00:40i manner
0:00:42in a a supervised manner
0:00:44and have to compensate it
0:00:45then i will propose that to speaker diarization
0:00:49system
0:00:50which is a bit vector based and these also utilises
0:00:53is um compensation of
0:00:55this uh intercession intraspeaker variability
0:00:58which from then no one i will just a recall interspeaker variability
0:01:03yeah i would end report experiments
0:01:05and i will summarise in a in a talk about future work
0:01:11okay
0:01:12so
0:01:13they deals the interspeaker variability is uh
0:01:17the fact is that this is the reason why speaker diarisation is not achievable task
0:01:21because if there was no it interspeaker variability it was very easy to
0:01:26to perform this task
0:01:27and now when when talk about interspeaker variability i
0:01:30i'm
0:01:31and talk about phonetic variability
0:01:33energy or loudness uh probability
0:01:36within the speech of fiscal for six single speaker
0:01:39yeah acoustic or speaker intrinsic uh probability
0:01:43speech rate variation and even then
0:01:45non speech rate which is the due to the others
0:01:48so it sometimes about makes more errors and sometimes about mixed less the rose and this can also
0:01:53course um a probability that can be harmful to the
0:01:56their position algorithm
0:01:58and relevant to ask the for this kind of ability as is first of all scores speaker diarization
0:02:04but also speaker recognition ensure the training and testing sessions
0:02:10okay so the idea now is to france to
0:02:13a propose a generative model of proper generative model
0:02:17to handle this area kind of ability
0:02:19first i will start with the classical gmm model
0:02:22where a speaker is uh
0:02:23a disk is modelled by a gmm
0:02:26and the frames are
0:02:27generated independently according to this gmm
0:02:32now in a two thousand five we have proposed a modified model
0:02:37that the accounts also for intersession probability model
0:02:41according to this model speaker is not the model by a single gmm
0:02:45rather it's mode modelled by the pdf over the session space
0:02:49uh every session is modelled by gmm and uh frames are
0:02:53generated
0:02:54according to this session the gmm
0:02:57so now what we want to do is to again document this model and
0:03:01two proposed uh i guess i should be more though
0:03:04where a speaker here in in this case is a
0:03:07is the pdf over the session space
0:03:16the speaker is a pdf over the session space
0:03:18session is again a pdf over the segments space and in a segment is is the modelled by a gmm
0:03:25and the frames are generated independently according to the segment you mean
0:03:31if we want to visualise this and uh gmm space or the gmm supervector space we can see in this
0:03:37slide
0:03:38we see for super vectors
0:03:40and this is it for support vectors
0:03:42i corresponds to four different segments
0:03:45of the same speakers recorded this speaker in a in a particular session session out for
0:03:53and this can be modelled by by this distribution
0:03:56now if the same speaker
0:03:57talks in a different session
0:03:59then
0:04:00it was okay
0:04:01uh this distribution
0:04:03and we can model the entire session
0:04:06set of sessions with a with a single distribution which will be the speaker speaker a distribution
0:04:13and we can do the same for for speaker B
0:04:19okay so according to this uh
0:04:21a generative model
0:04:22we want to in a we assume that a for that each supervector supervector
0:04:29it for for
0:04:30particular speaker
0:04:31particular session a particular segment
0:04:34distributes normally
0:04:35with some mean and the covariance about the mean he's speaker and session dependent
0:04:40and the covariance is also speaker session dependent
0:04:42so this is them almost a
0:04:44a quite that um
0:04:45a general model we will then assume or assumptions in order to make it more right
0:04:50is it possible to actually use this model
0:04:55okay
0:04:56so back in a two thousand five in a in a paper called the trainable speaker diarization in interspeech
0:05:03we have a use this model used
0:05:05this journal the relative model
0:05:07two
0:05:08do supervised
0:05:09yeah you to interspeaker variability modelling for speaker diarisation
0:05:13in the context of what does use diarisation
0:05:17and in this case what we doing what you did we assume that the covariance
0:05:21or dying to speaker variability
0:05:23is speaker and session independent
0:05:25so we have a single covariance matrix
0:05:28and we estimated from a labelled
0:05:31so demented development set
0:05:34and so it's quite a trivial to to to estimate it
0:05:37and once we have estimated just to compress metrics we can use it to
0:05:42to where come into two and develop a metric
0:05:45that is induced
0:05:46from this uh
0:05:48components metrics
0:05:49and use this matrix to actually perform the devastation
0:05:53and the techniques that we that we we should we use a pca in W C C N
0:05:59okay so this what's in a in tucson seven but problems this technique is that we must have a labelled
0:06:05development set
0:06:06and this development set must be labelled according to speaker turns and sometimes this it's can be problematic
0:06:27okay
0:06:29so
0:06:30in
0:06:31in this paper
0:06:33what we do is we assume that the session
0:06:36that combines a medically session independent
0:06:39so
0:06:40it's not global as we have done before but it's a session dependent
0:06:44and
0:06:45that
0:06:46and in in that good that i'm gonna is described in the next slide
0:06:50yeah actually we don't need any labelled data
0:06:53we we don't use it in labelled data we we we have no training process
0:06:57we just
0:06:59estimate documents method which is that's it
0:07:01session dependent we estimate on the fly on the power
0:07:04session base
0:07:08okay so how do we do that
0:07:10the first stage is to do gmm supervector parameterisation
0:07:14what we do here is that we we take the session that session that you want to
0:07:18and that
0:07:19to apply their position
0:07:21and we first extract and that the features such as mfcc features that under the frame rate
0:07:28and then what we do we we estimate the session dependent ubm
0:07:32of course you get is a is a is a form of low order
0:07:36typical order can be for them
0:07:38four
0:07:39and we do that using the yin algorithm
0:07:42now after we do that we we take the speech signal and divided into overlapping superframes
0:07:48which of us at a frame rate of ten per second
0:07:51so what we do we just say
0:07:53define a chance of one second at superframe
0:07:57and we have a ninety percent overlap
0:07:59now if we
0:08:00such as a bit frame we estimate a gmm using a standard uh
0:08:04no adaptation from the uh the ubm did we just the
0:08:08estimated
0:08:09and now
0:08:10and then we take the uh the parameters of the gmm we concatenate them into a supervector
0:08:15and this is about presentation
0:08:17for that super frame
0:08:19so in the end of this process that a socialist paradise by a sequence of
0:08:23of
0:08:24supervector
0:08:27okay
0:08:27so now the next say yes
0:08:29next phase
0:08:30is to estimate that interspeaker combine metrics
0:08:33and we're we're given the fanciest of the supervectors
0:08:37and
0:08:38we should note
0:08:39yeah the super vector at time T
0:08:41S R P is actually found to component
0:08:44the first component is a speaker mean component
0:08:47it it's a comment that it's speaker dependent session dependent
0:08:50and the second component is the actual innocence
0:08:53the
0:08:54that you the interspeaker probability
0:08:56which is denoted by I T
0:08:58and according to our previous assumption this ten years and
0:09:01is the
0:09:03distributes the normally with a zero mean and uh
0:09:06combats matrix which is
0:09:08set session dependent
0:09:10no
0:09:10our goal is to estimate this a combat medics and what we do is
0:09:14we consider the difference between two consecutive a supervector
0:09:18so when the question for i define
0:09:21dallas update which is a difference between
0:09:23and to weigh the supervectors of two consecutive the
0:09:27a subframe
0:09:29superframe
0:09:30and
0:09:31we can see that the the difference between two super frame is actually that difference between that
0:09:36uh
0:09:37corresponding gay interspeaker variability
0:09:40a new things that i teeny minus i teen
0:09:44a minus one
0:09:45plus some may component
0:09:47which is usually zero
0:09:49because most of the time there's no speaker change
0:09:52but in every time there's a speaker change this to the value is non zero
0:09:57now what we
0:09:58what we do now is
0:09:59something that it's a bit risky but we we assume that
0:10:03most of the time there's no speaker change
0:10:05and we can live with that of the noise that is added by
0:10:09the violation of this assumption and we just neglect
0:10:12the impact of that speaker changes on the covariance matrix of
0:10:15yeah of the P
0:10:17so basically we compute
0:10:18the combat medic
0:10:20of the diversity by just
0:10:22eh
0:10:23and
0:10:24throwing away all the dog uh places that we that there is
0:10:27we could change and just the computing
0:10:29the covariance
0:10:31or that of a
0:10:33so so we in in
0:10:35so what we get that is that if we want to estimate that
0:10:38that the covariance of dying to speak about everything we actually what we have to do we just have to
0:10:43compute
0:10:44that empirical
0:10:45covariance
0:10:46of that delta supervector
0:10:48that's what we do
0:10:50that
0:10:50that's in the question
0:10:52the button
0:10:53right
0:11:03okay
0:11:03so that we have done that uh and we get some kind of estimation for thine
0:11:07fig availability
0:11:08we can try to compensate it
0:11:10and the way we do that is that we we of course a assume that it's of low rent
0:11:15and can apply pca to find a basis for the for this low rank a subspace
0:11:21in this above it
0:11:21space
0:11:22now we have two possible compensation after the first one is that
0:11:26now
0:11:27well we can come compensated for that
0:11:30in a speaker variability
0:11:32in the supervector space
0:11:34and this is only useful if we want to do that it was a change in these days
0:11:38so but
0:11:38fig
0:11:39yeah and another alternative is to use feature eh
0:11:43based now
0:11:44well we can compensate that then to speak about it in the feature space
0:11:48and then we can actually use any other eh
0:11:51regularisation algorithm
0:11:52on these compensated features
0:11:57okay
0:11:58so it is it what we what we do is we actually and decide to use now
0:12:04and then
0:12:05to do the rest of the diarization either supervectors
0:12:08domain
0:12:09so the motivation for that button which i'm about
0:12:11why
0:12:12is that
0:12:13if we look for example you in it
0:12:15the figures in the bottom of slide
0:12:17in the left hand side we we see an illustration on and off the ice
0:12:22if and
0:12:23right
0:12:23and what we see here
0:12:24with the uh two speakers
0:12:26but this is not the real that i just a
0:12:28creation
0:12:29so we in below we have once again but we have to speak at target speaker and the problem is
0:12:33that
0:12:34if we want to apply adaptation about them
0:12:36of course he doesn't know that that colour of the ball
0:12:40and it's very hard to separate the two speakers
0:12:43now if we walk in the supervector space
0:12:46then then what
0:12:47what
0:12:48in the middle of the
0:12:49by
0:12:50we we see here that had a distributions of that
0:12:53because that tend to be more anymore though
0:12:56therefore it it's much easier to
0:12:58to try to
0:12:59eh
0:13:00diarisation and this is of course the counter to the fact that we do some smoothing
0:13:04because a superframe is one second long
0:13:07now if we a if you manage to remove
0:13:10yeah substantial amount of that interspeaker variability
0:13:14what we get is uh uh they'll station in the right hand side of the fly
0:13:19and any
0:13:20even if we don't know the two colours if we don't know that
0:13:23some of the points are of one sample points are all right
0:13:26we still in it
0:13:27it's reasonable that we make
0:13:29find a solution of how to separate these speaker
0:13:33because that's the question is is very easy
0:13:38okay this is separation
0:13:43okay so of any of the uh of the algorithm is as following first of course we
0:13:47parameterised that speech in in two times to obtain serious of supervectors
0:13:52and we compensate for interspeaker variability as i have shown already
0:13:58and then we use the viterbi algorithm to do segmentation
0:14:02now we know that to do that to be a good
0:14:03two segmentation we just a sign think of they
0:14:06for each speaker
0:14:07actually we we also use um in a a
0:14:11a length the constraint so
0:14:13i'm
0:14:13this is some simplification here
0:14:15but basically we using one data per speaker
0:14:18and all we have to do is to be able to
0:14:20to estimate transition probabilities and how the probabilities
0:14:24now position but but it's it's very hard to it's very easy to way to estimate
0:14:29we can just take a very small a that that's that in and then estimate
0:14:33that the effects of that
0:14:35a of the length of to a speaker turn
0:14:38so if we know the i mean the mean is we could on we can estimate transition probabilities
0:14:43now if you want to
0:14:44the tricky part is to be able to estimate the output probabilities
0:14:48the probability of a
0:14:51speaker
0:14:52of support vector
0:14:53at time T
0:14:54given speaker I
0:14:55and we're not doing this a directly we do it
0:14:58do do you did a indirectly and i was see i would feel that this in the next slide
0:15:04so let's say that we can do this in some way
0:15:07and
0:15:08so
0:15:08we just apply data segmentation and we can and come up to four segmentation
0:15:13and then we a run again the yeah
0:15:16we we find the diarization using iterative viterbi resegmentation
0:15:20so we we just say come back to the original a frame based features that this is we train the
0:15:25speaker uh H M and
0:15:27using our previous segmentation and we really one of the best imitation but now the resolution is better
0:15:32because we we're working in a frame rate of a hundred
0:15:35a second
0:15:36and can do that this for
0:15:38couple of iterations then and that's the end of the totem
0:15:42so what i still have to talk about is how to
0:15:45estimate is the output probabilities
0:15:47probability of a a supervector T
0:15:50a given speaker on
0:15:54okay so
0:15:55that would
0:15:56is that following
0:15:57what we do is we and i will give us some motivation after this slide
0:16:01so what we do is we find a larger
0:16:03the eigenvector
0:16:04so in
0:16:05and more more precisely find a large
0:16:07eigenvalue and and we take the corresponding eigenvector
0:16:11and and we and this loud
0:16:13this a good vector is taken for the covariance matrix
0:16:16of all the supervectors
0:16:17so what we do take uh compensated supervectors from the session
0:16:22and we just compare compute the covariance matrix
0:16:24of these supervectors
0:16:25and find that that we take the first day eigenvector
0:16:28now what we do we take each shape compensated supervector
0:16:32and projected onto the larger a eigenvector
0:16:36and and what we get we get a for each a four four time T V get
0:16:40peace up T P supply is the projection
0:16:42of soap about R S T
0:16:44onto the largest eh
0:16:46it can vector
0:16:47and the
0:16:48well what i am i'm not going to convince you now about the the beatles on the paper and we
0:16:52try to get some division
0:16:54is that the action
0:16:55the log likelihood that we are looking for the log likelihood of a supervector S P
0:16:59given speaker one
0:17:01and and
0:17:02yeah divided by the the the log likelihood of a
0:17:05is
0:17:06you know the probability of
0:17:07speaker of the product at a given speaker to is actually a linear
0:17:11function
0:17:12of this projection
0:17:13that we have found
0:17:14so what we have to do is to be able to somehow
0:17:18estimate parameters a and B
0:17:20and if we estimate about the A B we can approximate the likelihood
0:17:24we're looking for you know in order to plug it into the derby
0:17:28but just taking this projection
0:17:30now be is actually
0:17:32is related to the dominance of of that speaker if we have two speakers and one of them is more
0:17:38dominant than be will be uh either positive or negative
0:17:41and in this day in in our experiments we just assumed to be zero therefore we assume that
0:17:47the speakers are equally dominant
0:17:49it's about this uh
0:17:51the conversation of course not true but uh we hope to
0:17:55accommodate with this uh problem a by using a final a viterbi resegmentation
0:18:00and a is assumed to be corpus the tape and then call content and we just have to make it
0:18:06single uh
0:18:07parameter with timit form this mode of that
0:18:10and
0:18:10no rules and do those out in the paper
0:18:14just that but just to get the motivation let's let's say
0:18:17we tended we have very easy problem
0:18:20that uh in in this case the right
0:18:22is one speaker I and uh
0:18:24and the blue is the second speaker
0:18:26and this is uh yeah actually in the super vectors
0:18:29right
0:18:30after we have done that i need to
0:18:32speaker compensation
0:18:33it it does become a bit of computation
0:18:35so if we just take the although the blue and the red eh
0:18:39but
0:18:39and we just compare the combined fanatics and take the first eigenvector what we get we get that
0:18:44the black arrow
0:18:46okay so if we take the black arrow and
0:18:48just project all the points on that there are we get the distribution in the right hand side of the
0:18:53the flight
0:18:53and if we just to decide
0:18:55according to that object on the black arrow if we just the
0:18:59compute
0:19:00the decision boundary
0:19:01we we see that it
0:19:02that
0:19:02this it does she and battery and it's a actually it's it's a
0:19:07it exactly as the optimal decision but we can compute the optimal decision boundary because
0:19:12we know the true distribution of the red and the blue box
0:19:15a because it's artificial data here
0:19:17in this this like
0:19:19so actually this this algorithm works in this case the simple case
0:19:23now as we try to increase then that amount of interspeaker variability
0:19:28like that we we didn't manage to
0:19:30remove all the almost dying to speak about ability
0:19:33so
0:19:34as we see
0:19:35as we a great amount of the excess
0:19:37they are within within all they speak about the
0:19:40we see that uh
0:19:42that decision boundary that we have to make is not exactly as the optimal decision boundary and
0:19:46actually this algorithm we we fail in in someone
0:19:50when a with this too much into station in in to speak about building
0:19:54so the hope is that
0:19:55an hour ago tend to that meant that the
0:19:58suppose always compensate for you to speak about the the hope is that on the on the data we're gonna
0:20:02work on
0:20:03we actually managed to remove not
0:20:05of the variability in order to let that button to work
0:20:10oh
0:20:11okay so now yeah
0:20:13i reported experiments
0:20:15yeah
0:20:15we used a a a a small i think they
0:20:19from that need two thousand five
0:20:21reef as that of a month if that we used to have we used one hundred the stations
0:20:26actually i don't think that we actually
0:20:28we need the
0:20:29so much the data for development because we actually
0:20:32estimate only a handful of parameters
0:20:34and this is used to tune the hmm transition parameters
0:20:38and uh and uh a parameter
0:20:41which is used for the loglikelihood calibration
0:20:44and we use that and and these two thousand five a course
0:20:48that's it
0:20:49for a for evaluation
0:20:51what we did is we took a the stereo
0:20:53a phone call and we just a sum
0:20:56artificially
0:20:57the two sides in order to get a
0:21:00a two wire data
0:21:02and we take the ground truth for my we derive it from the asr transcript
0:21:07provided by nist
0:21:09yeah we report a speaker error rate eh
0:21:12our error message
0:21:14rms measure is this speaker right
0:21:17and
0:21:17we use that stand out in is what we call it too
0:21:20this is
0:21:21correct
0:21:24okay
0:21:24and just enough to be able to compare our results to two different systems we we use the baseline which
0:21:30is big bass
0:21:31the
0:21:32and inspired by lindsay is that two thousand five this them
0:21:36oh
0:21:36it's it's based on detection of speaker changes using the A B in ten iterations of a
0:21:42viterbi resegmentation and along with david biclustering
0:21:46followed finally by a page of a viterbi resegmentation
0:21:52okay so this is a domain with all
0:21:55and on all day that that we we achieve the
0:21:59we
0:22:01we actually had uh uh
0:22:03speaker error rate of six point one
0:22:05for baseline
0:22:06and when we just use the supervector basis and without any intraspeaker variability compensation
0:22:13we got four point eight
0:22:15and we when we also a a used a speaker variability compensation we we got
0:22:20two point
0:22:21right one day
0:22:23and in the six
0:22:24that meant a supervector gmm order is sixty four and uh now
0:22:27compensation order is
0:22:29five
0:22:34we we ran some experiments in order to try to improve on this we we actually we didn't manage people
0:22:39but
0:22:39we try to to see
0:22:41if we just
0:22:42change the front end
0:22:44what's gonna happen so
0:22:46and we find out that feature warping actually degrade performance
0:22:49and this is the
0:22:51this was already has already been and
0:22:54a explained by the fact that the channel is actually something we were done we want to explore twenty do
0:22:58diarisation
0:22:59because
0:23:00it may be the case that different speakers and different channel
0:23:03so we don't want to roll channel
0:23:05a information
0:23:07and i think the other end
0:23:09see also slightly degraded
0:23:11performance
0:23:12and
0:23:13we also wanted to check the power
0:23:15but what would the way how much
0:23:18yeah but that would be in a way that it can be if we we had perfect
0:23:22eh
0:23:22positivity that though and we and we actually prove that
0:23:26very likely to two point six
0:23:31some more experiments a fine
0:23:32two
0:23:33eh
0:23:34it checks that it's a T V T of
0:23:37a system to
0:23:39a gmm although too
0:23:42and the nap compensation although so basically uh we see what
0:23:46jim order
0:23:47the best order that sixty four and one twenty eight i didn't try to increase the eh
0:23:52i don't know what would happen if i would be
0:23:54great that you know boulder
0:23:55and for nap and we see that it's quite a bit of it i think
0:23:58from fifteen uh we get quite the we get already any for the uh uh
0:24:03speaker rate of
0:24:03three
0:24:04but
0:24:05but performance
0:24:06something around
0:24:07five
0:24:11okay so finally before i am i
0:24:13and
0:24:14one of the motivations for this work was to use this the devastation for
0:24:19speaker I D into in um a
0:24:22they are in a
0:24:23to wire data
0:24:25so we we try to to say whether we get improvements using this devastation
0:24:30so
0:24:30what would we did it we we tried we defended acquisition
0:24:34and systems one of them was that a reference to
0:24:37a diarisation second one was the baseline diarization
0:24:41the third was the proposal
0:24:42it every station
0:24:44and we all we tried to apply that is they should i do only in the test
0:24:48the on the test data also
0:24:51and
0:24:52the train data because i
0:24:54one for sponsors actually
0:24:56have this problem so also the training data is also a uh some
0:25:01therefore we we
0:25:02we wanted to check
0:25:03a what
0:25:05the performance on also on training a
0:25:07on some data
0:25:09and we used
0:25:10speaker this system which is not based
0:25:12and
0:25:13that achieves a regular rate of five point two on and
0:25:17done this two thousand five or female data set
0:25:21and what we concluded that for when only test
0:25:25data is some
0:25:26that the proposed and achieve that performance that is equivalent to using manual diarization
0:25:32and when both a train and test
0:25:35data um
0:25:36there is a degradation compared to to the reference there
0:25:39it every station
0:25:41but however documentation that we get for the baseline system you have
0:25:46when we use
0:25:47the proposed
0:25:51okay so a to summarise
0:25:53and we have a described in the add button for unsupervised estimation of
0:25:58uh intercession interspeaker variability
0:26:01and we also
0:26:03it describes the two methods
0:26:04to a compensate for for this probability one of them is uh using now
0:26:10in support vector space and also
0:26:12i didn't report any without but we also ran experiments using a feature space now
0:26:17he then it
0:26:17C space
0:26:18and
0:26:19using get too
0:26:21to speaker diarization it
0:26:22using the gmm supervectors we got a speaker at all
0:26:26for one day
0:26:28and if we also apply
0:26:30interspeaker variability compensation
0:26:33i would think the supervector based they to speaker diarization system
0:26:37we get the for the improvement
0:26:38two two point eight
0:26:40and finally the whole system
0:26:43improve speaker I D accuracy
0:26:45for some the audio especially for the sound something case
0:26:50now for future work would be first of all to apply a feature space now
0:26:56for a need to speak a word with the conversation
0:26:59and then
0:27:01to use
0:27:01different the diarisation systems
0:27:04so this can be interesting
0:27:06and we we did try feature space now but but then we just
0:27:10stayed with
0:27:11we we just we estimated subvectors so all we did was in the top of it
0:27:15space
0:27:16and and of course it should try to extend this work in into multiple speaker diarization
0:27:22possibly in both of yours or in meetings
0:27:24and of course to integrate after a methods
0:27:28such as interspeaker variability modelling
0:27:30and which were proposed lately and
0:27:33to integrate them
0:27:34this uh this approach
0:27:50i personable congratulations for your work
0:27:53um
0:27:55you restricted yourself to the to speaker diarization
0:27:58yeah
0:27:59just regularisation something like uh there's this without model selection
0:28:03and the
0:28:04for my opinion the beauty of the resisted his model selection but anyway
0:28:08um
0:28:09you claim that with this
0:28:11so later
0:28:12subspace
0:28:13this projection
0:28:14you somehow customise
0:28:16data
0:28:17right
0:28:19yeah
0:28:21i i think that first of all it by using the supervector coach a it it's more convenient
0:28:26to do
0:28:27a diarisation because in some sense yes you
0:28:30you're you're a distribution tends to be more money model unimodal modelling modelling not and also that
0:28:37and
0:28:38i claim that you can use similar techniques that are used for speaker recognition such as a
0:28:43if the session variability modelling you can use
0:28:46similar techniques
0:28:47in in in this domain so so if you want
0:28:50uh you there is or something
0:28:51forget about two speaker
0:28:53you
0:28:54re a model selection task
0:28:56uh one question would be something like
0:28:58so you
0:28:59you can see the rascals and image
0:29:02let's say the emission this
0:29:03so space
0:29:04fine
0:29:05projection
0:29:06um since you
0:29:08she is
0:29:08these would be a gaussian
0:29:10i blame
0:29:11then traditional uh model selection techniques
0:29:14uh like
0:29:15like this information about you
0:29:16okay
0:29:17we'll uh
0:29:18be applicable without these student
0:29:20or um
0:29:21hopefully it
0:29:22yeah because
0:29:23them specification of the model
0:29:25when you use it with
0:29:27oh it's obvious when you use it in the M S
0:29:29she domain which clearly my remote
0:29:31okay we uh if you if you consider it now also goes animation
0:29:36then probably the B
0:29:38we'll give you answers
0:29:40in these subspace
0:29:42about
0:29:42the number of speakers as well
0:29:44number of speakers possibly
0:29:46oh
0:29:46he into an hmm frame
0:29:49uh so
0:29:50uh we should consider this
0:29:53definitely
0:30:03yeah no worse than about the summit some of the condition
0:30:06and
0:30:07how
0:30:07you
0:30:08can you define it in the training set the
0:30:11the sum
0:30:12condition because the
0:30:13if you
0:30:14the make that position in the training set the you don't know what is the motor okay
0:30:19so uh
0:30:21the method that was that decided the with with
0:30:24some sponsor is that
0:30:26eh
0:30:28you factorisation and then you just a you have the reference
0:30:32segmentation you just decide which
0:30:35side
0:30:36has more in an overlap or with
0:30:39with the um
0:30:41the limitation that you got
0:30:43so you know you know what you know what what portions of the speech are actually
0:30:48eh
0:30:49yeah
0:30:49you need to change
0:30:50on
0:30:51okay so because you have it's for white
0:30:53ordinates for wine data and you know and you have a segmentation
0:30:56you got
0:30:57you get uh two clusters
0:30:59and then you just compare these clusters
0:31:01it automatically to the uh to the reference to decide which cluster is more uh
0:31:07more correct
0:31:10and that's something that actually the
0:31:12if you if you think about it
0:31:14if someone train
0:31:15and on some data and so one of the way you consisting that you can apply automatic diarisation and just
0:31:21so
0:31:22and just say let's
0:31:23the user i see the two classes and he'll just decide which cluster is the right class
0:31:28that that's the motivation
0:31:31thank you
0:31:39yeah
0:31:46oh