a segmentation
so a a good are known and thank you for coming
yeah and present now where we have do not the university of that was i needs so on
this is a variability ability compensation and for the segment eight for speaker segmentation and to speak that phone conversation
and we are also present in our
a a technique to a it several hypotheses hypothesis and
a for a given recording and to select the that best hypothesis
there a segmentation the
and so uh
these work is focused on a the segmentation of to speaker conversation so it's
is a speaker diarization problem
a a a a a a so we we i'm at the answering the question was but one
but
and
um and it's an is your task seemed since we and number of speaker is known
and and limited to two
so in this case now win the boundaries
the speaker of that is of the that decision problem so it
we can
can um
we could it as a a a a segmentation problem
so i only there's mean a that in the field of the speaker verification and this has motivated and you
approach just for this some addition of to speaker conversation
yeah many base some factor analysis use eigenvoice voices
and in this approach is uh the um
the
the speaker I D model a a bit a gmm supervector
and that can be a present but this small set
oh or a a a a has a small they mention vector that we use a record the speaker factors
was i mention is much lower that they do you not a gmm supervector more
so a a a the main idea is that
a a a using a a a a a compact speaker a presentation we can estimate the a the parameters
of these presentation
what is more segments
and that's what we do for
for
speaker segmentation
so we eh
a start i stream of speaker factors over the input signal
so we are over or a one second window and
frame by frame we start uh
uh a set of
a a a a a sequence of speaker factor of but that's
and then we cluster these the speaker factors and the two clusters
using pca A plus k-means clustering
a once we have the two class we find a single got a single full covariance
a cost four four
it's just speaker or and we wanna be that we segmentation obtain a first
a a some addition up with and then we would find this some addition using now
some uh for segmentation step
um
using them to mfccs features and the mm a speaker models
and
so the
the but the main contribution of this work is T the
the the in general
uh the of for in on whatever the T
so first
that
i about that that's of but a do we can find a the set so
and and only when a if we have a similar or similar recordings containing in different speakers
we analyse the
by a T percent on these recordings
a a we can fight that there's the by ability this actually this by really the and it's mainly you
to
as because percentage recording so you use you referred to as well with yes and the speaker what every
but
we analyse a a set of a
the accordance don't to the same speaker
a so we can see that there's also whatever lead them on this record
and use usually do you to a
two aspects like the channel or
the most chance of the speaker
and this
this but every is usually um
no known as
and system whatever
but in in addition if we if we analyse
a a recording contained a single speaker and with that many in smaller
so i say
and that we
and a light but in really the this just slices
a a we have see that there is also whatever it with you now recording
and
this but i bit is usually do to a phonetic balance or
or they
they won't young on the one channel of the of the
of the recording
and we will refer to as by big guess
and that's system whatever you
no approach for speaker segmentation we are only a
but we we are only modeling in speaker whatever every
so the question is a a are you the types of whatever the inter or interest
my ability a fact some of the in performance
well
and
so we just one do we need to compensate for inter system whatever they don't this is somewhat of a
you
we note that in the system whatever decomposition is but important for speaker recognition
uh yeah but
well i was able to say that it's not
i so be important for speaker some shown but the
is presentation yes so that
channel factors it you helps us well
so i i i have some preliminary experiments that P
then
the help but
may keep my
but we believe that it's it should and had so much because you not that is decision that's you don't
see the same a speaker over different sessions
so it's i was the same as the the speakers are
a single session
you don't have a higher information of the speakers
actually we we believe that it does is a whatever ready to make up the body the speakers and diarization
task
because the channel is information that can
how do you to separate the speak
and what about the some whatever it
so what what usually in the feel of speaker recognition a a state-of-the-art system doesn't take only a a a
a that that's system and don't take a
and take into account intersession variability
seems yeah um they used a whole conversation to to be a model
uh but
we we think it's
but important for speaker some of this and their efficient because um many
and the state of the systems are based on
the clustering of various mark be or segment
and
we can compensate do but i believe them and be segments for a given a speaker
okay
clustering process should be seen
so that's
but we try to do so you been a
a each dataset contains several speakers sensor out according to per speaker
we kind of thing a team of speaker factors for from each recording
and then
a a we can see that every such and as a different class
and we
more more to the speaker and that's system some whatever the guess between class body on seems we believe that
more in a speaker and that's as them whatever the have
to two separate speakers
and in a recording
and we model the session and whatever you S within class but yeah
so we we this framework in
it's C C to to apply a one known techniques has
linear discriminant analysis of much might you can class but as to minimize we class variance
are also within class covariance normalization
a a to normalize the variance
of
for every class
so it's
the identity map fix
so this to thing this have been successfully applied the uh
for intersession compensation and in a speaker recognition
they
the in i of sister
so to evaluate this this of these two approaches we use a a a a a a nice are we
weight
a a a some channel condition containing more than two thousand five minute telephone conversations
i and that the speech nonspeech a what's are given
and we miss the performance and that's of the speaker segmentation or or or speaker or
a a a a a part of the that is as an hour rate so a as we
we have some parts that the speech nonspeech segmentation and "'em" we don't take into account overlap speech
the that is as an hour rate is the same as T segmentation or or or speaker are
and uh a C us use what we you we assume a don't twenty five second people score
and here we have the the results
for the system using in a small ubm
a a two hundred fifty six gaussians
a prior
and mfccs features
and in this case we we don't we don't use that the segmentation of steps
a a a a we can see that our get in two percent some or or
um using intersession session variability compensation and W C C and we
twenty a speaker factors
a a the sum an or is that used to a two point five
you also can see that another other another baseline
a a with fifty speaker factors that it
a slightly better
know
not much but slightly better
and to to try a L D A for them dimensional direction
and we can see that the L use had been
a a a a but i an i and W C C N any
is better
the obtained a two percent a a of segmentation or
and even the combination of what you are is is not better than
now
a and directly W C N
but uh we try with this
systems uh a after the resegmentation and it was surprising pricing that
there was some of the step
um
make more or less equal that was also
using twenty you're fifty speaker factors so
the inter just but every decomposition with W Z and just your working
and
giving the a an improvement
but it seems that it's a useful to improve the number of speaker factors so
we were a little disappointed with
this
because with all it to suit help
and we that i know where use but it meant to that not in the paper
representing here
a without a target ubm M um more features
so in this case increase in the number of is because fact of his help
so we can see that in this case or baseline E we the speaker factors
thanks so one point eight
segmentation or which is a a lower what before he was
two point one
and now that we use channel again but use the your work to one point four
in in in this case a a in this case we we also increase the number of speaker factors and
to test the L A
and we see that the the eighties
yeah is had been you've and
a a more than before
a
and and also that
our but configuration now is used to combine L D A plus W C C N
so it seems that uh it's is it's to question that a
the base and we'd have the speaker factors he's not better than the base them with fifty speaker factors but
that there and the egg
a we can take advantage of how more speaker factors
a our best result is
one point three some our
so on and on the other hand
you we propose you know so i think need to to you know it several segmentation a
hypothesis
and to select the best one a
base of based on a set of from mister
a so what we do is we it adaptively pretty it become the composition to to have this
a a a a a a did we obtain four levels of splitting in as we can see the figure
a a a a a and we segment every um
every a slice with a propose a a system three
then for every level we set at the best the slides this
and we combine them to be able to speaker model
and then we this to speaker models we to segment
the whole recording
using a
i there with some segmentation and mfccs features
you and i speak a speaker model
a to select the what to select the best segment that slices is and the best level um on this
four
we use a a a a a a complete as missiles and also my you're voting stuff
sorry components most of that were using this work were where a bias use information criterion
a a a a a a we using mfccs speech their sign you a speaker models to compute a
big
and and the K yeah these things these dancing the speaker factors space
so we were using gaussian and
a a speaker models and a that space and computing the K U
this stuff between what more
a and
to fuse both compute as were are using the a
a a quite toolkit
a well no for speaker verification
and uh the in the weights of the
a diffuse fusion weights were optimized to separate do for those
a a a a a of time less that one percent someone channel
okay
so
a a kid we have the results
for these uh i but these is you know channel selection
a strategy
we can see when that when we we are not using seen a
inter session variability compensation
a a a a a a what this solution is improving the results which just but our baseline which choose
two point one
and we're getting one point nine with to our started you
and if
we have a an idea
a a coffee that's much some we could select the best
a a level at every time
we could go that to one point one segmentation or
but
the of competence was of our remote idea
at the mall
a
and using system but every to compare that you would we then get
in a significant improvement improvement was
was not the statistic that
that's it's of this signal is significant
a is so we try to my are what in a that the any help
but that we we wanted to to make it what we
it's some set of complete myself to
because the
the possibilities of for are complete and mysteries
of computers was of a high
and
simple uh stuff that you to fuse
a a segmentation hypothesis
so
yeah
we were not really happy with this
was also with try again again with a lot ubm more features
and
and our best racial for intersession variability compensation
and and also a new set of complete missiles
but this
this is not in the paper a new results
but
oh show in two
a a a a a and we we could but use this segmentation addition or or from one point three
to one point two
and one what some additional or one point zero
and if we put select
i always the best level that we could read used to get two point seven
the channel or
which is
but good use so compared to the
to
based
one
well one point
so
ask completion sort of this work we we have presented to to make those for it a somewhat every to
compensation
we have some that they have for speaker segmentation
a a a a a a change that W C N of things better performance than that done of the
eight
and it's
some somehow similar to of the a plus but C C N
that
but similar to the combination of both
um
in the number of a speaker factors increase greece of the computational cost
a it seems that W C N it's
but there's that the word for
a
should should word for low computational cost applications
but the of course a for our best computer and computational cost is not a problem all was computation is
using
a high number of speaker factors and all the egg it uses here
and
we we we have a summary of the results yes we might they our but so that this is one
point three
i
we the a system where a the from one point nine to one point three
and
and also
a a a note that
probably that but used in is had been a lot because
a a a a a a a close or or or because of for but that in the study so
or in is that that use you seen pca plus k-means means
as initialization
so not i seen the the
he within class covariance for a be a speaker is probably had been the K means that assumes that they
all the
and class is have the same class so
at yeah why i are not quite as
a think i
so
so it probably because of the
a
a W C i have some
and also we have present a hypothesis generation and selection technique
which C can prove to this like the results
and for our best configuration we can use D some addition are from one point three to one point to
with a large you
um
think that's all
thank you match
you you we use time for questions
we
yeah
yeah
yeah
i just to one they mention
i didn't mention it because it's
it's in another
but but so this is much more to produce on
on really ability but
yeah P C A
okay we keep just one mention
a a a to initialize a a and then became means we use all that mentions but we need to
like this
the a means of a k-means means we
pca C a of show
no
no
i
yeah
hmmm
yeah
well
yeah yeah
sure
yeah but uh i i mean
now experiments i i am i keeping one i'm and C N N
uh
maybe
you know
is not the best you can do but
to one they mention it's not
but it was so in is is usually the first dimension
of
of the pca put these is the best want to to it this because but it still i we are
getting a about eighteen percent that is is an error right
just using one dimension
so we are not sure that i
the best presentation
yeah
hmmm
so you we we were try to
just plug yeah C A output
be more they mentioned for you and all all the images to the came
the are questions
and let's thing the speed than then one the speech was reduced