0:00:14since for what should be too
0:00:16and twenty one do clustering university
0:00:20and here i have a brief introduction to a paper
0:00:24the heart we still hot experimental results and discussions from decay you a novelty
0:00:33in this paper we present the summit each system for the second that a speech
0:00:38diarization challenge
0:00:40diarization system includes multiple modules
0:00:43nobody voice activity detection speaker in many extraction similarities miss truman clustering with the confusion
0:00:51overlap detection
0:00:54for each model to explore different technologies to enhance the performance
0:00:59a final submission even close to mismatch system based vad that the there is no
0:01:06based speaker a value
0:01:08that estimate base the similarity scoring and
0:01:11spectral per state
0:01:13three diarisation use also applied in the re-segmentation stage
0:01:18and overlap detection also brings time improvement
0:01:23our proposed system achieves a key point at forty what check one and twenty seven
0:01:29point ninety percent in eer for check two
0:01:34post a systems have reduced the f d r's right twenty seven point five percent
0:01:39is that you one point seven percent relative a cascade of use of s times
0:01:45we believe that diarization task is the over each utterance
0:01:51may analysis
0:01:53we carry a mentality analysis on a development set to show how hot the competition
0:02:01several in that occurs and order
0:02:04their religion of the lda was
0:02:07the number of speakers
0:02:09speech percent each and the overlap ever
0:02:12overlap ever determines the medium and diarization error rate a system is able to a
0:02:18chip we sell handling overlaps speech
0:02:22it is defined as follows
0:02:28the speech regions of speaker i
0:02:32in summary the competition is how many because first the audio site shown for about
0:02:39divers set of challenging documents
0:02:42second the number of speakers varies you know very large range
0:02:48hi overlap error costs for the eer
0:02:53well it process i employed in our experiments for training
0:02:58note that one looks like to combine short utterances received all speakers
0:03:04suitable for speaker in nineteen change
0:03:07most people speak audio's are drawn from the database is a median and tri-phone domains
0:03:14the making data consist of
0:03:16icsi i s l nist s and one baseline to
0:03:22the telephone data services
0:03:24no monolingual problems that
0:03:27including arabic
0:03:32men therein and spanish
0:03:35that used for changing voice activity detection
0:03:39similarity miss truman an overlap detection
0:03:43musician and a raw score or
0:03:46i employed for the computation
0:03:50voice activity detection
0:03:53right i p c p of initial best time for channel two
0:03:57let us to estimate into frames with twenty milliseconds
0:04:03for each input for n
0:04:05a pc generous the and the recall what
0:04:08and optional setting right a p c is the way steve martin
0:04:12there is a list of ways here
0:04:16three is the most where is you about field all non speech
0:04:21we also propose a em based approach for the vad task
0:04:26then usual network as shown in figure two consist of their rest and model you
0:04:32multiple bidirectional estimate there's and in you know there's
0:04:37our motivation is stay the rest and what you
0:04:40generous representative feature mapping is for speech and non-speech
0:04:45and then the right the original svms control sequential information
0:04:50the input is that a long sequence of frame as features
0:04:55each a france inter sequence a hack and feed into the rest that
0:04:59generating multiple channel
0:05:03features magazines
0:05:05we of times ago for every holy
0:05:07on each channel and courtesy dimensional vector
0:05:11next a bidirectional estimate there's to catch for the for one and that was sequence
0:05:19allpass from the prior that rationales
0:05:21task to that being a layers
0:05:23and that p with the sigmoid function
0:05:26and generous the speech posteriors
0:05:31all converges activity detection
0:05:33real-time a sliding mean of one point five seconds lands and zero point five as
0:05:38the five
0:05:39second shift was based speech into short segments
0:05:44the speaker embedded ice check a to find the sediments
0:05:48here we consider three models
0:05:50i-vector extractor and the rest i-vector
0:05:55for the i-vector extractor with follows that the how to design a t v one
0:06:00where is that in colour t and height of also audio's for system changing
0:06:06for this paper we also follows that the heart was on an ap we will
0:06:11call us that to change the model
0:06:14s for the rest i-vector
0:06:16it consists of three main components
0:06:20a restaurant or in a two-dimensional staticity pooling their
0:06:25and a feed-forward network
0:06:28not fit the one that well in close to is that the in your there's
0:06:32the search of l o zero point five between
0:06:36given a sequence of input features
0:06:39to rest and brian first covers them into multiple channel feature dimensions
0:06:45is that the static sporting their calculators the mean extend the time variation studies for
0:06:52each channel
0:06:53generating the utterance level representation of
0:06:56to see that addition
0:06:58last the feed-forward network transforms the utterance level feature representation to speaker posteriors
0:07:07the embedding the imaging is one hundred and twenty k
0:07:11chinese there is also folks that respect alimentation
0:07:15and detail parameters can be view in table three
0:07:20speaker in getting sequence x one x to x and
0:07:25we compute similarity score as i g between any interest because embedding as i x
0:07:33and push on the similarity matrix research and times i
0:07:38the first was that for the similarity measure is p lda
0:07:43you can be expressed as follows
0:07:47that's their assumes that the embedding i and j are from the different speakers
0:07:53well it's one assumes that is there are from the same speaker
0:07:58the lda model is channel we suppose that and written by the two development set
0:08:05we must not is there
0:08:07the key lda and those speaker embedded these you know paralysed and had a man
0:08:13reach you can always the sequential information
0:08:16therefore we propose to analyse them basis point model to capture the forward and backward
0:08:24in comparison with p lda
0:08:27scores articulated between vector and sequence rather than vector that could
0:08:33give a speaker embodies x one x two accent
0:08:37recently that i in recreate could be compared with the whole sequence
0:08:43do you feed this sequence into a list and their work
0:08:47and generous course of the input can kinda vectors
0:08:51a strong be actually equation seven
0:08:55the first you know what kind of course
0:08:58includes two i original estimate errors and to lean you know there is
0:09:03the output layer is one dimensional connected with this economy function
0:09:10in the clustering stage
0:09:12two was that a part
0:09:15the first was that is agglomerative hierarchical clustering
0:09:20which are from as the random mutually between precise
0:09:24segments i'm initialized as individual clusters
0:09:29and each time to prove starts with the highest score are merged and chosen humans
0:09:35raised were is mediate
0:09:37and are not always a spectral clustering
0:09:41is and where our best score some you know it's a
0:09:44given the similarities matrix s
0:09:47you can see that as i j s away of a g between no i
0:09:52and not okay you know and directly where
0:09:56by removing weak edges with small weights
0:09:59spectral clustering device the original graph into multiple somewhere off
0:10:05which star graph is a holster
0:10:09of course there
0:10:11there we segmentations that is that high to aligned a close friend rides
0:10:17g m and we segmentation next see that constructing thus because the cp gmms
0:10:23for each speaker according to clustering results
0:10:27then for each frame in the audio
0:10:30we assign it to gmm is the highest the posteriors
0:10:35the process interest and to convert
0:10:39and are not always
0:10:41we start with station
0:10:43construct a gmm a gmm model
0:10:45with engine voice priors
0:10:49impulses that imitation side or speaker-specific gmms share the same component weights and covariance men
0:10:59the mean vectors are projected from total variability subspace
0:11:06with some progress
0:11:08v diarization kingsbury's segmentation performance
0:11:13the small we consider is overlap detection
0:11:17the model structure data and ten in combination is a all the same as those
0:11:23in rest of the same voice activity detection system
0:11:27that we change the labels for speech nonspeech two overlap no overlap
0:11:34for testing cases
0:11:36but has segment is referred as overlapped speech
0:11:40we used ten is boundary i twenty frames and ten or speakers of hearing is
0:11:46the extended segment as the labels of the original segment
0:11:53experimental results
0:11:55whatever directly you and voice activity detection performance
0:11:59maybe parallel independent evaluation on a pc our best system based vad
0:12:07the metric used and whereas you're right
0:12:09and results are shown in table four
0:12:12basically we start model adaptation
0:12:16are processed model used just slightly better than the official baseline
0:12:20however if you finding the model to handle development set
0:12:26accuracy ready to be increased to ninety one point four percent on the eval set
0:12:32you can sort of course there are chanting that is drawn from meeting and telephone
0:12:39well as the database probably eleven domain
0:12:43domain mismatch this to work performance
0:12:46well model adaptation rinsed income improvement
0:12:52in table five
0:12:54we compare different combinations of the speaker binding
0:12:58similarity scoring and resume is that into one
0:13:03it is all that the that the address mapping to
0:13:07performs i-vector extractor or combination
0:13:11is that is
0:13:13so i and o a system based scoring well by spectral clustering have used
0:13:19better there in comparison to is you know the edges e
0:13:24best single system is systems six
0:13:27which she is that the eer of twenty point eight seven percent
0:13:32where we fuse based on tool for densities are reaching their score metrics
0:13:38the eer for the reduces to and you one to four percent
0:13:46with the condition is carried out on a best single system and the fusion system
0:13:52results are shown in figure six
0:13:55in our expectation
0:13:57the vad algorithms should outperform the gmm is the
0:14:02and re-segmentation models used
0:14:04should bring similar improvement for both systems
0:14:08"'cause" the price
0:14:10for the fusion system die residual predictions after resegmentation does not become more data right
0:14:18so mostly be improvement this can be systems six with bp diarization
0:14:25we do seems that the eer by one point six five percent absolutely
0:14:32the last few in our diarization system is overlap detection
0:14:37since the overlap everybody's as i s time for instance is present on the development
0:14:44is it is not go for asked was seems that there is around ten percent
0:14:48of the sometime error the eval set
0:14:52experiments are carried out on systems use with three d diarization
0:14:57results are shown in table seven
0:15:00all have to the time of the last speech only slightly improves the past i
0:15:04zero point is the c eight percent on channel one and zero point six nine
0:15:10percent on check two
0:15:12it is the very challenging because we for less than
0:15:17ten percent of the overlapped speech
0:15:22last to understand how our system performs it is recipe goldmine
0:15:28we go the eers of the development set on system six
0:15:32a tall man
0:15:34results are shown in figure three
0:15:38system performs rolls on this policy is
0:15:42rest or
0:15:43we have video media been chosen
0:15:46c of each are discussed in manhattan and that's due to high overlap errors
0:15:53the child domain
0:15:55this is by no overlap error rates to hide eer of
0:16:00so these data points that you eight percent
0:16:03it is probably because the audio are drawn from seeking colours
0:16:08we have shown to a six to at most old
0:16:13this is a mismatch
0:16:14comparison of speakers in our training database
0:16:18as a result
0:16:21six times the outperforms probably in this for changing documents
0:16:28things you've we'll watch