Speech Transcript - DIHARD II is Still Hard: Experimental Results and Discussions from the DKU-LENOVO Team

since for what should be too

and twenty one do clustering university

and here i have a brief introduction to a paper

the heart we still hot experimental results and discussions from decay you a novelty

in this paper we present the summit each system for the second that a speech

diarization challenge

diarization system includes multiple modules

nobody voice activity detection speaker in many extraction similarities miss truman clustering with the confusion

overlap detection

for each model to explore different technologies to enhance the performance

a final submission even close to mismatch system based vad that the there is no

based speaker a value

that estimate base the similarity scoring and

spectral per state

three diarisation use also applied in the re-segmentation stage

and overlap detection also brings time improvement

our proposed system achieves a key point at forty what check one and twenty seven

point ninety percent in eer for check two

post a systems have reduced the f d r's right twenty seven point five percent

is that you one point seven percent relative a cascade of use of s times

we believe that diarization task is the over each utterance

may analysis

we carry a mentality analysis on a development set to show how hot the competition

several in that occurs and order

their religion of the lda was

the number of speakers

speech percent each and the overlap ever

overlap ever determines the medium and diarization error rate a system is able to a

chip we sell handling overlaps speech

it is defined as follows

the speech regions of speaker i

in summary the competition is how many because first the audio site shown for about

divers set of challenging documents

second the number of speakers varies you know very large range

hi overlap error costs for the eer

well it process i employed in our experiments for training

note that one looks like to combine short utterances received all speakers

suitable for speaker in nineteen change

most people speak audio's are drawn from the database is a median and tri-phone domains

the making data consist of

icsi i s l nist s and one baseline to

the telephone data services

no monolingual problems that

including arabic

english

drama

japanese

men therein and spanish

that used for changing voice activity detection

similarity miss truman an overlap detection

musician and a raw score or

i employed for the computation

voice activity detection

right i p c p of initial best time for channel two

let us to estimate into frames with twenty milliseconds

duration

for each input for n

a pc generous the and the recall what

and optional setting right a p c is the way steve martin

there is a list of ways here

well

three is the most where is you about field all non speech

we also propose a em based approach for the vad task

then usual network as shown in figure two consist of their rest and model you

multiple bidirectional estimate there's and in you know there's

our motivation is stay the rest and what you

generous representative feature mapping is for speech and non-speech

and then the right the original svms control sequential information

the input is that a long sequence of frame as features

each a france inter sequence a hack and feed into the rest that

generating multiple channel

features magazines

we of times ago for every holy

on each channel and courtesy dimensional vector

next a bidirectional estimate there's to catch for the for one and that was sequence

information

finally

allpass from the prior that rationales

task to that being a layers

and that p with the sigmoid function

and generous the speech posteriors

all converges activity detection

real-time a sliding mean of one point five seconds lands and zero point five as

the five

second shift was based speech into short segments

the speaker embedded ice check a to find the sediments

here we consider three models

i-vector extractor and the rest i-vector

for the i-vector extractor with follows that the how to design a t v one

where is that in colour t and height of also audio's for system changing

for this paper we also follows that the heart was on an ap we will

call us that to change the model

s for the rest i-vector

it consists of three main components

a restaurant or in a two-dimensional staticity pooling their

and a feed-forward network

not fit the one that well in close to is that the in your there's

the search of l o zero point five between

given a sequence of input features

to rest and brian first covers them into multiple channel feature dimensions

is that the static sporting their calculators the mean extend the time variation studies for

each channel

generating the utterance level representation of

to see that addition

last the feed-forward network transforms the utterance level feature representation to speaker posteriors

the embedding the imaging is one hundred and twenty k

chinese there is also folks that respect alimentation

and detail parameters can be view in table three

speaker in getting sequence x one x to x and

we compute similarity score as i g between any interest because embedding as i x

and push on the similarity matrix research and times i

the first was that for the similarity measure is p lda

you can be expressed as follows

that's their assumes that the embedding i and j are from the different speakers

well it's one assumes that is there are from the same speaker

the lda model is channel we suppose that and written by the two development set

we must not is there

the key lda and those speaker embedded these you know paralysed and had a man

reach you can always the sequential information

therefore we propose to analyse them basis point model to capture the forward and backward

messages

in comparison with p lda

scores articulated between vector and sequence rather than vector that could

give a speaker embodies x one x two accent

recently that i in recreate could be compared with the whole sequence

do you feed this sequence into a list and their work

and generous course of the input can kinda vectors

a strong be actually equation seven

the first you know what kind of course

includes two i original estimate errors and to lean you know there is

the output layer is one dimensional connected with this economy function

in the clustering stage

two was that a part

the first was that is agglomerative hierarchical clustering

which are from as the random mutually between precise

segments i'm initialized as individual clusters

and each time to prove starts with the highest score are merged and chosen humans

raised were is mediate

and are not always a spectral clustering

is and where our best score some you know it's a

given the similarities matrix s

you can see that as i j s away of a g between no i

and not okay you know and directly where

by removing weak edges with small weights

spectral clustering device the original graph into multiple somewhere off

which star graph is a holster

of course there

there we segmentations that is that high to aligned a close friend rides

g m and we segmentation next see that constructing thus because the cp gmms

for each speaker according to clustering results

then for each frame in the audio

we assign it to gmm is the highest the posteriors

the process interest and to convert

and are not always

we start with station

construct a gmm a gmm model

with engine voice priors

impulses that imitation side or speaker-specific gmms share the same component weights and covariance men

she's

besides

the mean vectors are projected from total variability subspace

with some progress

v diarization kingsbury's segmentation performance

the small we consider is overlap detection

the model structure data and ten in combination is a all the same as those

in rest of the same voice activity detection system

that we change the labels for speech nonspeech two overlap no overlap

for testing cases

but has segment is referred as overlapped speech

we used ten is boundary i twenty frames and ten or speakers of hearing is

the extended segment as the labels of the original segment

experimental results

whatever directly you and voice activity detection performance

maybe parallel independent evaluation on a pc our best system based vad

the metric used and whereas you're right

and results are shown in table four

basically we start model adaptation

are processed model used just slightly better than the official baseline

however if you finding the model to handle development set

accuracy ready to be increased to ninety one point four percent on the eval set

you can sort of course there are chanting that is drawn from meeting and telephone

domain

well as the database probably eleven domain

domain mismatch this to work performance

well model adaptation rinsed income improvement

in table five

we compare different combinations of the speaker binding

similarity scoring and resume is that into one

it is all that the that the address mapping to

performs i-vector extractor or combination

is that is

so i and o a system based scoring well by spectral clustering have used

better there in comparison to is you know the edges e

best single system is systems six

which she is that the eer of twenty point eight seven percent

where we fuse based on tool for densities are reaching their score metrics

the eer for the reduces to and you one to four percent

with the condition is carried out on a best single system and the fusion system

results are shown in figure six

in our expectation

the vad algorithms should outperform the gmm is the

and re-segmentation models used

should bring similar improvement for both systems

"'cause" the price

for the fusion system die residual predictions after resegmentation does not become more data right

so mostly be improvement this can be systems six with bp diarization

we do seems that the eer by one point six five percent absolutely

the last few in our diarization system is overlap detection

since the overlap everybody's as i s time for instance is present on the development

set

is it is not go for asked was seems that there is around ten percent

of the sometime error the eval set

experiments are carried out on systems use with three d diarization

results are shown in table seven

all have to the time of the last speech only slightly improves the past i

zero point is the c eight percent on channel one and zero point six nine

percent on check two

it is the very challenging because we for less than

ten percent of the overlapped speech

last to understand how our system performs it is recipe goldmine

we go the eers of the development set on system six

a tall man

results are shown in figure three

system performs rolls on this policy is

rest or

we have video media been chosen

c of each are discussed in manhattan and that's due to high overlap errors

the child domain

this is by no overlap error rates to hide eer of

so these data points that you eight percent

it is probably because the audio are drawn from seeking colours

we have shown to a six to at most old

this is a mismatch

comparison of speakers in our training database

as a result

six times the outperforms probably in this for changing documents

things you've we'll watch

DIHARD II is Still Hard: Experimental Results and Discussions from the DKU-LENOVO Team

Diarization

Qingjian Lin, Weicheng Cai, Lin Yang, Junjie Wang, Jun Zhang, Ming Li