Speech Transcript - Online Diarization of Telephone Conversations

0:00:06	yeah we'll come back to the session
0:00:08	so now let's
0:00:10	and not only from my seat deployable on recent
0:00:13	so now we change the topic today to speaker diarization
0:00:17	and uh
0:00:18	not in speaker diarization
0:00:20	uh
0:00:20	one of the important things to guess the number
0:00:23	right
0:00:25	let me give you that
0:00:26	we have a speaker
0:00:27	and each one
0:00:29	you need it
0:00:30	do the segmentation
0:00:33	uh
0:00:35	and uh so we have four people as in the first people is on the
0:00:39	well my diarization of telephone conversations
0:00:43	uh presented by all three penthouse
0:00:45	please
0:00:46	i know
0:00:47	giving him
0:00:49	the topic of the presentation is a non derivation of telephone conversation
0:00:54	yeah i would begin by presenting the speaker
0:00:57	you get diarization problem
0:00:59	and after when cheryl
0:01:01	talk about online those online speaker diarisation
0:01:04	and the overview of current to short overview of current speaker diarisation system
0:01:10	i will then said present the suggested online speaker diarization system
0:01:14	including description derivation time complexity and performance
0:01:19	and i will
0:01:20	conclude
0:01:21	of course
0:01:22	the conclusion
0:01:23	the task of speaker diarization system is to assign temporal segments of speech
0:01:28	why now are
0:01:29	participants in a conversation
0:01:32	speaker diarization basically a ten
0:01:34	two
0:01:35	cluster
0:01:36	the segment and cluster conversation
0:01:39	such that
0:01:40	if we see a from the left it's a manual derivation of a conversation down
0:01:44	by a human listener
0:01:46	and on the right
0:01:48	automatic diarisation
0:01:49	exhibited by
0:01:51	the suggested this because diarisation system
0:01:56	more
0:01:57	a state of the art speaker diarization system operates in an on off line manner
0:02:01	that is
0:02:03	conversation samples are
0:02:05	gathered until the conversation and
0:02:08	falling
0:02:08	an application of the diarization system
0:02:11	however
0:02:12	for some applications such as forensic or
0:02:15	a speech recognition
0:02:17	online diarization could be beneficial
0:02:19	that is if
0:02:20	we want to
0:02:21	apply some automatic speaker recognition system
0:02:24	we would uh
0:02:26	be able to see this
0:02:27	realisation of the conversation until the point
0:02:30	yeah we want to apply
0:02:33	online or something online derivation can be achieved by removing
0:02:36	or minimising the size of the
0:02:39	uh but for
0:02:40	and
0:02:42	however this
0:02:43	incurs in the sun
0:02:45	difficult to to the system because
0:02:47	the amount of data is reduced
0:02:53	most of the offline diarization systems operate in a two stage uh process
0:02:57	first i'll just i'll just remain
0:03:00	over generated over segmented by some change detection algorithm
0:03:05	and then and then the ground or yeah
0:03:07	i hierarchical clustering
0:03:09	algorithm is applied
0:03:11	in which
0:03:12	segments are merged
0:03:14	until some termination conditions are met
0:03:16	generally the number of the
0:03:18	final speakers in the conversation
0:03:23	some recent approaches in uh offline diarization system
0:03:26	include gmmubm
0:03:28	is
0:03:28	figure modelling
0:03:30	speaker identification clustering
0:03:32	and the fusion of several system was several
0:03:35	a a feature set
0:03:37	in order to apply
0:03:38	there is a nation
0:03:41	online speaker diarization system income to the encountered in the literature
0:03:45	include
0:03:46	online gmm learning
0:03:48	as some novelty detection algorithms apply
0:03:51	into detecting when a new speaker is appearing in a conversation
0:03:56	and uh gmmubm
0:03:58	this scheme
0:04:02	most of the
0:04:03	state of the art diarization systems
0:04:05	online and offline
0:04:07	and carton it in the literature requires some
0:04:10	offline training background a channel or gender
0:04:14	and models in order to apply
0:04:16	on the diarisation algorithms
0:04:18	is some require several sets of features
0:04:21	and the
0:04:23	practically all require a large amount of the
0:04:26	computation power
0:04:30	this is just an online diarization system operates in a two stage process
0:04:35	first
0:04:35	and unsupervised algorithm is applied
0:04:38	over an initial training segment
0:04:41	of the conversation
0:04:43	followed by
0:04:44	the use of the model generate in the in the first stage in order to put
0:04:48	apply
0:04:49	and eh
0:04:50	receiver segmentation of the conversation
0:04:52	on demand
0:04:55	that is
0:04:56	this
0:04:57	sound
0:04:58	a on
0:04:59	the samples are entered into the
0:05:01	preprocessing stage
0:05:03	feature extraction
0:05:04	and
0:05:05	uh into the buffer
0:05:06	which incorporates the uh initial training segments
0:05:10	there is a show is applied only on the initial training segments
0:05:13	and models are generated from the initial
0:05:17	training segment
0:05:19	once the models are available we could
0:05:21	a apply or perform segmentation of the conversation
0:05:25	based on these
0:05:26	initial models
0:05:29	however a major assumption a
0:05:31	is that
0:05:32	all of the speakers in the conversation must participate in this initial training segment
0:05:37	or else
0:05:38	they want the
0:05:39	a model for these speakers will not be a
0:05:41	be available
0:05:42	for the rest of the segmentation process
0:05:46	the first data validation
0:05:48	is if we can still provide a
0:05:51	telephone conversation their vision over the initial training segment
0:05:55	yeah
0:05:55	and which
0:05:56	the samples in the initial training segment that preposterous
0:05:59	feature extraction
0:06:01	is applied on the emission thingy thing man
0:06:03	and an initial assignment algorithm
0:06:06	that is in a conversation and let's assume a telephone conversation once we have
0:06:10	we have
0:06:11	successfully identified the non speech
0:06:13	we still have two speakers
0:06:15	it was signed
0:06:16	features too
0:06:18	that is what we
0:06:19	can identify the speech
0:06:21	however we must apply to some kind of algorithm plus nine features
0:06:25	to either of the speakers
0:06:28	one features are assigned to each of the speakers
0:06:30	uh an iterative process of modelling
0:06:33	and time series clustering
0:06:35	is applied until termination conditions are met
0:06:39	once termination conditions are right where we can provide
0:06:43	the segmentation
0:06:44	modelling in this paper
0:06:46	or in this work is the band by song and the time series processing is done by
0:06:51	it's some variant of the
0:06:53	hidden markov model
0:06:56	when we apply diarization over short segments of speech eh
0:07:00	two main issues arise
0:07:02	one
0:07:03	is it a low model complexities required
0:07:06	because of the sparse amount of data
0:07:09	and another problem is the or clustering constraints that is we would not like that
0:07:14	and
0:07:15	segmentation
0:07:16	what
0:07:17	skit with men speakers we would like to
0:07:19	okay
0:07:20	employed
0:07:21	physical ones
0:07:21	trains on the time of
0:07:23	speech
0:07:24	for each
0:07:24	speaker
0:07:26	the fourth problem tackled by replacing the common gmm models
0:07:30	by a self organising map
0:07:33	that is we train a self organising map
0:07:35	for each of the speakers
0:07:37	self organising maps was a uh
0:07:40	presented by on it
0:07:42	any composed of the three main stages the first uh
0:07:46	initialisation
0:07:48	the second is a rough
0:07:50	training
0:07:51	and finally
0:07:52	a a fine tuning
0:07:54	of the neurons or the
0:07:56	centroid
0:07:57	into the distribution of
0:07:59	point
0:08:04	once we have
0:08:05	and
0:08:06	train the model for each of the all speakers in the conversation
0:08:09	a we would require some means to estimate
0:08:13	the likelihood
0:08:14	oh
0:08:15	given a new feature okay
0:08:17	a we would like to the
0:08:19	estimate the probability of the likelihood of the uh feature observation given the model
0:08:25	under the assumption of
0:08:27	normality that is its centroid in the self organising map
0:08:31	is a a
0:08:32	the mean over
0:08:33	yeah
0:08:34	and
0:08:35	probability
0:08:36	uh with the unit covariance metric
0:08:38	we could apply
0:08:40	the following equation in order to estimate
0:08:42	the log likelihood
0:08:44	what the minus log likelihood of the
0:08:47	and
0:08:48	observation
0:08:51	see that we we estimate the loglikelihood only with a single neuron
0:08:55	because
0:08:56	generally it will
0:08:58	contain the most
0:08:59	to um
0:09:00	most of the information regarding the closest
0:09:03	observation point
0:09:07	the joint likelihood go
0:09:09	and a set of features
0:09:11	could be estimated by some
0:09:13	of the log likelihoods of the single feature
0:09:15	given that is
0:09:17	that
0:09:17	the clean independent
0:09:21	justin constraints are enabled using
0:09:24	uhuh
0:09:25	if hidden markov model or a minimum duration hidden markov model
0:09:30	in this model it's they
0:09:31	is modelled using yeah
0:09:33	hyper state that is
0:09:35	in each hyper state we enforce a minimum duration of transitions from
0:09:40	once
0:09:40	one one state
0:09:42	to another state
0:09:43	and in this manner we could use the
0:09:46	hidden markov model in order to enforce the minimum duration time
0:09:50	a for each of the
0:09:51	speakers
0:09:52	each state in the meeting duration hidden markov model if the left or right
0:09:56	hi per state
0:09:58	you know which songs you
0:09:59	to estimate
0:10:00	the the log likelihood
0:10:02	or the emission probability
0:10:03	for each of the observation
0:10:09	i don't know
0:10:10	in the
0:10:11	that's right
0:10:12	transition matrix of the hidden markov models elements on the diagonal
0:10:17	and a hyper state transition matrix matrix of
0:10:20	and the element and all that do not uh the entire hyper state the transition matrices
0:10:25	and then this matrix is updated it
0:10:27	part of the training process
0:10:33	segmentation
0:10:35	once we have the models for each of the speakers in the hmm segmentation is applied
0:10:40	and using the a viterbi
0:10:43	time series clustering algorithm
0:10:45	that is
0:10:47	samples of the um
0:10:49	sound wave
0:10:50	is entered into a buffer
0:10:51	initial training segment
0:10:53	is applied
0:10:54	derivation
0:10:55	and hidden markov models which is generated by the diarization system
0:11:00	once we have this
0:11:01	hidden markov model segmentation is applied almost
0:11:04	instantaneously on the mac
0:11:07	i would
0:11:07	and it
0:11:11	viterbi algorithm
0:11:12	computation complexity is in the order of Q squared chi where you are the number of states in the H
0:11:19	M and and T is the number of features
0:11:21	uh in the conversation
0:11:24	so that
0:11:25	initialisation and recursion of the viterbi algorithm could be applied online that is
0:11:31	F F S with which is
0:11:32	after they were
0:11:33	really
0:11:33	which is the first feature
0:11:35	is
0:11:35	used to initialise
0:11:37	the viterbi algorithm
0:11:39	followed by F and which is your
0:11:41	a two
0:11:43	in the recursion process
0:11:44	once
0:11:45	segmentation is demanded
0:11:47	um
0:11:48	termination
0:11:49	and backtracking could be applied online
0:11:52	and that is almost instantaneous
0:11:58	a graph
0:11:59	stating the time required
0:12:01	to generate the segmentation of a conversation is a function of the conversation length
0:12:05	uh is given here
0:12:07	and it's show that
0:12:08	four
0:12:09	four hundred
0:12:10	second the conversation for example
0:12:12	only one millisecond of it and
0:12:14	of time computer time is required
0:12:17	and in the current implementation of the diarization system
0:12:20	one second of processing time give a white man alive
0:12:23	seventy three seconds of the audio
0:12:27	doing the first aid of a derivation
0:12:31	and experimentation the database used was the
0:12:34	of two thousand forty eight conversation from the nist two thousand and five speaker recognition evaluation
0:12:40	recordings L to speaker conversation in at a four wire which was sound
0:12:45	and normalised in order to be generated two speaker conversations
0:12:49	and
0:12:50	the features extracted was
0:12:52	twelve
0:12:53	mfcc features and twelve mfcc including
0:12:56	delta features
0:12:59	the entire database was first
0:13:01	processed by the diarization system using all of the data available
0:13:05	to produce
0:13:06	twenty percent diarization error rate in six point nine percent
0:13:10	speaker right
0:13:15	diarisation error rate
0:13:17	how to the way we measured it was to include
0:13:20	all of the hours available that is
0:13:22	speech
0:13:23	confusion and the uh
0:13:25	also i mean
0:13:26	speech and nonspeech
0:13:28	also overlapped speech which is the set which are segments of
0:13:32	speakers speaking together
0:13:34	was also considered as an arrow
0:13:36	in the speaker error rate
0:13:38	and
0:13:39	we actually eliminated
0:13:41	the nonspeech in both of the segmentations
0:13:44	in order to generate only the speaker confusion
0:13:49	the derivation error rate as a function of the initial segment length
0:13:53	it's shown to
0:13:55	approach
0:13:56	the optimal of the
0:13:57	eh
0:13:58	performance obtained by the applying the nation system over the entire segment
0:14:04	as we can see that
0:14:05	four
0:14:08	say one twenty one or two minutes of initial training segment where we where you save twenty four
0:14:15	percent diarization error rate
0:14:17	and the
0:14:19	this
0:14:21	behaviour
0:14:22	is also presented in the
0:14:24	application of a speaker error
0:14:30	it seems that given
0:14:32	two minutes of initial training segment they relation iterative
0:14:35	sufficiently close
0:14:36	uh to the diarization error rate obtained by applying this segmentation
0:14:40	the diarization over the entire conversation
0:14:43	and using one or twenty
0:14:44	seconds of the initial training segment
0:14:47	we could obtain twenty three
0:14:49	twenty four diarisation percent diarisation error rate
0:14:51	and twenty points
0:14:52	ten point six
0:14:54	signal
0:14:54	a speaker error rate
0:14:56	well using one and i think
0:14:58	seconds of initial training segment
0:15:00	provide twenty two point three diarization error rate
0:15:02	and about ten percent speaker
0:15:05	that the features
0:15:07	eh
0:15:07	did not
0:15:08	provide an improved performance
0:15:14	to conclude
0:15:14	ascending online speaker that information system
0:15:17	uh was presented
0:15:19	and it was shown that using as few as
0:15:21	one hundred twenty seconds of conversation and we could apply
0:15:25	and provide
0:15:26	segmentation of the conversation
0:15:28	by an increase of
0:15:29	four percent
0:15:30	when compared to the diarization error rate obtained by the by applying the vision system
0:15:35	over the entire conversation
0:15:37	for them
0:15:38	corpus of robustness and simplicity
0:15:40	gmm models or or replaced by a self organising map
0:15:46	a um
0:15:47	and
0:15:48	we assume no prior information regarding the speakers on the or the conversation that if we use
0:15:53	no background models of any kind
0:15:56	yeah
0:15:56	in order to apply
0:15:58	there is asian
0:15:59	and no parameters are required
0:16:01	to be trained offline
0:16:03	and in order to apply diarization
0:16:07	thank you
0:16:14	take some questions
0:16:37	oh
0:16:44	no
0:16:45	uh
0:16:46	well as opposed to some initialisation
0:16:48	uh maybe i missed what is the length of the segment
0:16:52	that you get into the sum
0:16:55	okay that's fine
0:16:56	we've done this
0:16:57	merriment using a variable length of initial training segment that is
0:17:01	assuming you are
0:17:03	one hundred and twenty seconds of initial training segment
0:17:06	some of which belongs to speaker a sound which belong to speaker B
0:17:10	and sound belongs to non speech
0:17:12	that is the the the exact amount of features
0:17:15	belonging to each of the speakers was not measured because it's a it's a
0:17:19	function of the initialisation algorithm
0:17:22	okay
0:17:22	but um i i mean
0:17:24	what you know
0:17:26	do also
0:17:27	self organising map
0:17:29	is using the short segments
0:17:31	from this initialisation
0:17:32	yeah
0:17:33	and do you have a fixed
0:17:34	flanks
0:17:36	for the for the segments or is it
0:17:38	so the uh
0:17:40	segmented okay
0:17:41	okay
0:17:49	here
0:17:51	okay
0:17:53	the initial training segment
0:17:55	there is a she's actually applied on the initial training segment
0:17:59	that is
0:17:59	first
0:18:00	speech or nonspeech is
0:18:02	uh detected
0:18:03	nonspeech of that and then the segments belonging
0:18:05	belonging
0:18:06	speech are
0:18:07	a distributed among the two speakers
0:18:10	in the conversation
0:18:11	the distribution of the features to each of the speakers as a function of the initialisation algorithm
0:18:17	which is a client of the K means
0:18:20	a clustering algorithm
0:18:23	so
0:18:24	the exact amount of features assigned to each of the
0:18:27	speakers
0:18:28	eh
0:18:28	i was not nice
0:18:31	okay
0:18:32	um i have a note on the question about the overlapping speech you said that you
0:18:37	um overlapping speech in the responses but
0:18:40	you score it as an error
0:18:42	yeah and that you did not take it
0:18:44	into account so we
0:18:45	always and they're only one way to
0:18:47	yeah and do you have an idea of the amount
0:18:50	appeal is that it
0:18:51	yes to to your result
0:18:52	we have used two databases for uh there is a nation and
0:18:56	the one used here was two thousand and forty eight conversation from then these
0:19:00	the two of them
0:19:01	two thousand and five speaker recognition
0:19:03	and
0:19:04	if
0:19:04	i correctly remember it was about
0:19:07	three dot eight
0:19:09	percent
0:19:09	of overlapped speech
0:19:11	and in average
0:19:12	okay
0:19:14	like
0:19:21	i also have two questions first
0:19:22	have you evaluated the degradation you get
0:19:25	from replacing the gaussian model with the
0:19:27	the uh that's why model
0:19:29	and secondly
0:19:30	um
0:19:32	uh could you i mean you want to use the initial
0:19:35	you know so many seconds
0:19:36	for for building your your uh
0:19:39	you're speaker clusters
0:19:40	a could you just redo that every so often i mean most uh
0:19:45	machines this dataset more than once if you record
0:19:47	uh you can continue doing online segmentation and in the background you can we compute your
0:19:53	speaker clusters
0:19:54	you know every
0:19:55	uh thirty seconds or something like that
0:19:57	of course
0:19:58	for the first question
0:19:59	we have examined
0:20:01	self organising maps and gmm models for derivation
0:20:04	in papers presented the previous
0:20:07	that is
0:20:08	jan and then solve for the nation
0:20:10	in our studies experiments
0:20:12	presented the same performance
0:20:14	so we didn't find any reason to use a gmm
0:20:18	especially because the training process for so long
0:20:21	is a lot
0:20:21	fast
0:20:22	faster quicker
0:20:24	and
0:20:24	basically
0:20:25	for us more robust
0:20:27	for a second question
0:20:29	and exact paper was submitted to interspeech
0:20:32	it does
0:20:34	exactly what is it
0:20:38	so i
0:20:39	two questions
0:20:40	here
0:20:41	one is the um
0:20:43	comment about each set
0:20:44	being used
0:20:45	yeah
0:20:46	it is the first
0:20:47	you get good performance going
0:20:49	first hundred twenty seconds
0:20:50	your initial
0:20:51	thing
0:20:52	at the door
0:20:52	the files are only
0:20:54	i mean for
0:20:54	five minutes long you're using
0:20:56	yeah
0:20:56	percent of the data
0:20:57	yeah
0:20:58	you into that realistic to go halfway through a conversation
0:21:02	absolutely
0:21:04	not because just
0:21:05	and
0:21:07	if we use about a thirty plus
0:21:10	thirty second of the data in order to initialise the conversation
0:21:14	the performance
0:21:15	why that is
0:21:16	see
0:21:17	i mean
0:21:17	we get like a thirty three percent diarization error rate and
0:21:21	about
0:21:23	twenty
0:21:25	four percent speaker
0:21:27	the the amount of data
0:21:29	required by the initial training but by the diarization system
0:21:32	it's quite large
0:21:34	so
0:21:36	if we have
0:21:37	uh the possibility to train online thing the system as the conversation goes
0:21:42	it would be great
0:21:43	that's exactly what we partition
0:21:44	in it
0:21:45	in the next
0:21:47	a paper in this
0:21:48	fearing
0:21:48	did you see the link that was also it's just to name one
0:21:52	they're looking at things like that
0:21:54	oh well
0:21:55	oh
0:21:56	we use that
0:21:57	well what where the conversation although ten minutes
0:22:01	let's not the duration issues knots of its duration
0:22:04	structure
0:22:04	right
0:22:05	you conversations between street
0:22:08	they
0:22:08	i take it turns you take
0:22:10	you know it's
0:22:11	the
0:22:11	duty cycle
0:22:12	very
0:22:14	if you look
0:22:15	variation
0:22:16	variance
0:22:17	format you like
0:22:17	i mean
0:22:18	E R
0:22:19	you know
0:22:20	there it should be fine
0:22:21	if someone dominates
0:22:22	first part
0:22:23	conversation you know well
0:22:25	exactly
0:22:25	that's so
0:22:26	and i also think in the call home and call friend
0:22:30	but the actually
0:22:31	more than
0:22:32	two people getting
0:22:33	yeah
0:22:33	yeah
0:22:34	two people on one side getting on
0:22:36	sharing
0:22:37	um so you have more
0:22:38	realistic
0:22:39	action
0:22:39	so what
0:22:40	yeah the point
0:22:41	in
0:22:42	maybe you had this
0:22:43	i
0:22:43	really
0:22:45	type your address in the
0:22:46	online
0:22:47	what
0:22:47	online
0:22:48	what you compare this to so for example the window has
0:22:52	at published papers we did this workshop that's me
0:22:55	exactly this task
0:22:56	you start out blindly
0:22:58	you start building up doing online
0:23:00	did you use that the baseline
0:23:02	did you
0:23:02	formant
0:23:03	okay
0:23:03	yeah
0:23:05	no i i think uh to
0:23:07	to
0:23:07	two papers
0:23:09	a which perform this online diarization task
0:23:12	but mostly of broadcast news
0:23:15	naked on telephone i believe
0:23:17	so
0:23:26	this very little
0:23:27	a problem
0:23:29	yeah
0:23:30	i would we have yeah
0:23:32	um you know
0:23:33	yeah
0:23:36	thank you
0:23:43	wanted to know if you have some idea
0:23:46	two
0:23:46	detect a new cluster a new speaker the system not to be able to
0:23:51	i do class
0:23:52	during decoding
0:23:54	yeah
0:23:55	our diarization system is
0:23:57	only oriented to telephone conversation between two speakers that is what we already assumed that the number of speakers is
0:24:04	too
0:24:05	but
0:24:05	i have encountered some ideas
0:24:07	eh
0:24:09	part of which use the leader follower algorithm which is a practically very simple
0:24:14	that is the distance
0:24:15	all
0:24:16	if we take and segment the conversation and take and and you segment
0:24:20	you can take the distance to the
0:24:22	current model you have
0:24:24	and in the distance for all the it's a certain threshold
0:24:28	then you
0:24:29	and meeting new model
0:24:31	you say that
0:24:32	this is a new speaker
0:24:33	and you train a new model for it
0:24:35	and use it
0:24:35	in order to a cluster the conversation
0:24:38	later on
0:24:40	when you come to the end of the conversation could also use
0:24:43	did this distance matrix between and models
0:24:46	you know to um
0:24:47	march model which are
0:24:49	very very close
0:25:03	uh
0:25:04	i want to make one of the um
0:25:07	uh when you say that uh
0:25:09	uh
0:25:10	to meaning
0:25:11	out of five in real life
0:25:13	we never know what will be the length
0:25:15	eh
0:25:16	recitation
0:25:18	can be for mean
0:25:19	can be ten million
0:25:21	so
0:25:21	yeah
0:25:22	to mean
0:25:23	oh
0:25:24	no
0:25:25	oh
0:25:29	right
0:25:30	just
0:25:34	action
0:25:34	finished
0:25:35	before the meeting
0:25:37	just make it
0:25:38	cation
0:25:42	yeah
0:25:55	hmmm
0:25:58	no
0:26:00	i don't agree because we
0:26:02	uh
0:26:03	to me it's for initialisation
0:26:05	does it matter if after the
0:26:07	the
0:26:08	yeah the computation
0:26:09	before
0:26:10	one means more or
0:26:12	when you mean
0:26:13	do you do need it i mean to me
0:26:15	you should
0:26:16	them
0:26:17	and
0:26:18	doesn't matter
0:26:19	a piece to me
0:26:21	online
0:26:21	king
0:26:22	the results
0:26:23	no matter what
0:26:24	and
0:26:37	so
0:26:38	so
0:26:39	if you one day
0:26:40	four
0:26:41	fig
0:26:42	the conversation
0:26:43	so
0:26:43	oh
0:26:44	fig
0:26:45	well
0:26:45	i see
0:26:46	second
0:26:47	to initiate
0:26:48	then
0:26:48	yeah
0:26:49	can have better without
0:26:52	on the
0:26:53	i think you be more if we if we just need to know how many how
0:26:57	almost iterations then you get
0:26:59	sufficient statistics
0:27:00	cover both speaker
0:27:03	right
0:27:04	cindy
0:27:05	fig
0:27:07	this is not i'm
0:27:08	it is not only show the percentage of the conversation it's a matter of that
0:27:12	the amount of statistics required to train
0:27:14	two speakers wanted right
0:27:16	that is
0:27:17	if the conversation would last
0:27:19	for half an hour following the two minutes
0:27:22	unless the channel is change in such a manner that the models are not no longer no longer valid
0:27:28	the result will be the same
0:27:31	but
0:27:31	you are correct we have examined payment
0:27:34	in order
0:27:34	show that and that we wanted
0:27:36	right i think we do not have anything to speak you know what i mean
0:27:40	right
0:27:43	so you so you have an online system but i suspect you actually don't i suspect that you're online system
0:27:47	is actually an offline system
0:27:51	okay
0:27:52	do you know what
0:27:53	anything before you reach the end of the file
0:27:57	in any point
0:27:58	where we get results
0:28:01	that's the output of that
0:28:02	diarisation system
0:28:05	do you but you do use an hmm
0:28:07	i do have an original
0:28:09	so you are differing your decisions
0:28:14	so you output is soon but you output the history as soon as it's a single
0:28:19	uh so the
0:28:20	the
0:28:21	results to a single pair
0:28:23	i uh there is all on user request
0:28:25	that is
0:28:26	yeah
0:28:27	using the hmm
0:28:30	in order to provide diarization results
0:28:33	i only need
0:28:34	perform termination and backtrack
0:28:36	and this could be done using
0:28:38	one millisecond of processing time
0:28:41	this stage
0:28:42	can all be done online
0:28:44	that is initialisation using only the first feature
0:28:47	and the rest of the features and their their fortunes stage
0:28:52	for any new feature i
0:28:55	determination and backtracking is only memory than memorising i could provide results instantaneous
0:29:02	instantaneously
0:29:04	what
0:29:04	instantaneously before the uh uh
0:29:07	hmm results to single path
0:29:09	yeah
0:29:10	hmmm
0:29:11	you know
0:29:11	what i really want to say is i think that
0:29:13	this uh online offline distinction a distinction is really a red herring
0:29:18	but
0:29:19	uh
0:29:19	it would be better i think to um
0:29:23	talk about the
0:29:25	uh
0:29:26	allow
0:29:27	D for
0:29:28	the allowed deferral time before a decision needs to be made
0:29:31	uh you know
0:29:33	you you you make a distinction between online and offline but really what you're doing is you're convolving with that
0:29:38	particular approach
0:29:40	where
0:29:40	you
0:29:42	uh
0:29:42	create models with an initial segment
0:29:45	that
0:29:47	to my with thinking
0:29:48	that
0:29:50	um doesn't really make the distinction between what is online once offline and if i would call it semi online
0:29:57	it would be okay with you
0:29:59	oh what what i would like
0:30:00	to see you
0:30:01	oh
0:30:03	oh specification of the uh
0:30:05	amount of time
0:30:06	that's allowed to be the decision is allowed to be P for
0:30:10	and uh you know and if you do that then um
0:30:14	the um
0:30:15	an offline system that deferral time would be infinite
0:30:19	in an online system the real time would be
0:30:23	something that is demanded by the application yeah
0:30:27	i
0:30:28	online
0:30:29	definition of the system
0:30:30	of the client
0:30:40	oh
0:30:45	but that

Online Diarization of Telephone Conversations

SESSION 6: Diarization

Added: 14. 7. 2010 11:08, Author: Oshry Ben-Harush (Ben-Gurion University), Itshak Lapidot (Sami Shamoon College of Engineering), Hugo Guterman (Ben-Gurion University), Length: 0:30:46