0:00:12hello everyone
0:00:14since for what shall use video
0:00:16i'm stringent england you please dont university
0:00:19eight year i we have a brief introduction to you about all paper
0:00:24optimal mapping lost
0:00:26are passed other works well and when speaker variation
0:00:33recently i neutral new word based approaches have become more and more popular problem or
0:00:40the use of speaker diarization
0:00:42such as voice activity detection
0:00:45speaker you by extraction clustering
0:00:49however
0:00:50and two in speaker diarization shin still remains a challenging
0:00:55partly you that difficult lost this i was the speaker label accurately portray
0:01:02no permutation invariant train was on n p r t s can be a possible
0:01:08solution
0:01:09and has been apply to as the emd network
0:01:14but it's kind of time complexity increases vector s and number of speakers increases
0:01:23in this paper
0:01:24we investigate improve on the calculus and for the proposed a novel optimal mapping lost
0:01:32which directly compute was that best matches between the allpass because sequence and that once
0:01:39was because sequence so hungarian weights
0:01:43all proposed lots systems finally we use the cost to polynomial time
0:01:50meanwhile keeping the same performance as deputy dots
0:01:57so what is speaker accurately
0:02:01he than an audio with ground truth labels
0:02:04a e p c zero eight p c are the speakers
0:02:10we nationally you all the speakers in two interpreters
0:02:13and good the encoding labels
0:02:16one two to three
0:02:19for outputs like one two to three and two one three
0:02:25plus should be correctly from the view of speaker diarization
0:02:29are
0:02:30traditional losses functions like binary cross entropy loss as the and the first a wrist
0:02:37or lost
0:02:38well as the second output with high loss
0:02:42this obviously does not meet all expectations
0:02:47the reason behind this that speaker diarization only focuses on the relative difference of the
0:02:54speaker identity
0:02:59let's have a real as the lightest obtained you and a speaker diarization system
0:03:05and see how yourself the rubber
0:03:08as the un t post encoder of each transform as the model
0:03:14is the partition decoding removed
0:03:17to be specific
0:03:19even for an recipe includes that little male filter bank features
0:03:24it directly generous speaker posterior assumes that model
0:03:30in addition
0:03:31the plp loss is used to call with the speaker labels maybe you'll be t
0:03:36program
0:03:38figure two shows an overview of as the emd for that was speaker cases
0:03:44given a sequence of features
0:03:47x one x two s t
0:03:50sat anti enclose of ground truth labels as follows
0:03:57t is the duration and is the number of speakers in the audio
0:04:03speaker label y t indicates a joint activity of the n is expressed as a
0:04:09too small
0:04:11it should be not to use a fifty seven boundary vector but not one whole
0:04:16vector
0:04:17for non speech regions
0:04:19windy sphere resellers
0:04:23the speeches from my water is employed as a network
0:04:29input i handled as follows
0:04:32first
0:04:33for a rest features are transformed by the new layer
0:04:37and then built into multiple stacked transform at encoder layers
0:04:44who is then passed through the second being a layer and the signal money function
0:04:49generation the at the end speaker it's posteriors as it is moment
0:04:56we believe that you have family there was transformer
0:04:59so let's give the introduction
0:05:02for more details
0:05:04please read the paper
0:05:08the system all was is i and the ground truth labels y column arise
0:05:15because the following formants
0:05:18where they had an and y had and has the same shape of t
0:05:24which in you has posterior as and labels of speaker and overtime respectively
0:05:31speaker label ambiguity programs you it is i
0:05:35no matter how you show why by column
0:05:39you can be still effective label is creation
0:05:42to cope with that program
0:05:45sequences can see those or problem rice permutations of y and computers the final cross
0:05:52entropy loss between slight each kind of population
0:05:58anyone a two in this is shown in texas of one two and
0:06:04and then to minimal loss is to return for and that well i for publication
0:06:11in brief
0:06:12the pf in loss function can be written as follows
0:06:17hum and an almost all possible permutations of one to an
0:06:23and the psd loss compaq time complexity is
0:06:27all of t times eight times and in fact a real
0:06:34the time cost of he lost become expensive as the number of speakers increases
0:06:41to deal with that we use
0:06:43become of this the first improvement first p actually lost
0:06:48we don't and computation is in the process of psd computation
0:06:53two point
0:06:55it was for what the equation as follows
0:07:00since we call and can was ranges from one to and
0:07:06only and square pairs they actually has the computation of pc loss function
0:07:13however
0:07:14is the process
0:07:15the function is score for and times and factory of times
0:07:21okay each pair is creepy to be computed
0:07:25our proposed idea is simple
0:07:28we first comp u
0:07:30we see that was used of or and we have pairs and stores them in
0:07:35the last matrix
0:07:37then you in the permutation process
0:07:40given and had so you have and way ahead and pair
0:07:44we just index and returns a corresponding amount
0:07:49the details as shown in algorithm one
0:07:53in the time complexity is
0:07:56of t times and square cost all of and times and vectorial
0:08:04so you the construction of the last matrix l
0:08:08we have choose this guy improvement on the calculus
0:08:12however
0:08:13the computational she's still increases you know that are real time
0:08:17when n is large
0:08:19to deal with a proper
0:08:21we must remove the permutation process
0:08:25because the relationship between they had an and y have the and is described by
0:08:30pca loss
0:08:32and each is they have an must be designed one and only one optimal why
0:08:38head and is a final
0:08:40as shown in figure four
0:08:44this tactical hoc assignment problem
0:08:47is that the proposed that the optimal moment shingles
0:08:51which employs the hunger reason everybody to find the best matching between the head and
0:08:58whitehead
0:09:01hungarian and was then it starts and every sentence as the final important right polynomial
0:09:08time
0:09:09vol case
0:09:11key element i j is kind the ester cost of assigned ground-truth speaker change
0:09:19was because i
0:09:22totally is to find the optimal designing indexes
0:09:25so that you overall cost is slow at least
0:09:31the hungarian with that can be described as follows
0:09:36is time complexity
0:09:39of and you
0:09:41and our optimal mapping loss are shown in double isn't cool
0:09:47in total
0:09:49the complexity is o
0:09:52key times
0:09:53and where us all of
0:09:57and q
0:10:01experimental results
0:10:04in the first experiment
0:10:06we generate plunking all close
0:10:08ranging from zero to one and binary ground truth labels are random programs
0:10:15is the things
0:10:16the times t time
0:10:19the batch size i is the tool wondering taking i and t six s five
0:10:25hundred
0:10:26the number of speakers and range is done to ten
0:10:31for each and
0:10:32we keep the process of data generation and lost computation
0:10:37using different roles functions all one hundred times
0:10:42the average time closed are reported in table two
0:10:46is periods are carried out on both you and g p triphones
0:10:52for you case
0:10:54we also for all the speaker
0:10:57we see that open mode mapping was is the hardest
0:11:00when and is larger than for in the time course are relatively stable
0:11:06in contrast
0:11:08the rest to functions a tool so you was
0:11:12when the number of speakers reaches time
0:11:18in addition we also repeated as the un the disappearance use different loss functions
0:11:25and see where is our proposed function is compatible in network training
0:11:31results are shown in table three
0:11:33as expected
0:11:35the model channel with different loss functions results in the same relation error rate
0:11:42therefore
0:11:43all loss functions i effective
0:11:48there is no
0:11:49since you