Speech Transcript - Linguistically Aided Speaker Diarization Using Speaker Role Information

0:00:13	however affinity for changing my spiritualist extraction presentation
0:00:21	my name's in the mountains and i'm going to present you know how to stay
0:00:28	on a linguistically it is triggered iterations distant loses information from stricter rules
0:00:35	personal unseen words our task is and what sticklers issues
0:00:40	sewing
0:00:40	small still a generic
0:00:43	setting standardisation i don't want to answer the question
0:00:47	so where
0:00:50	a given as input a real speech signal
0:00:53	what's wanted used to partition the signal into since derivatives
0:00:59	without having any prior information about the speakers a single precision errors
0:01:07	and conceptual and traditionally
0:01:10	this
0:01:11	task involves two steps
0:01:14	first
0:01:15	we want to circumvent the signal
0:01:19	into speaker images segment and this can be found either a uniform way or according
0:01:25	to some speaker change detection
0:01:28	and then how those speaker sessions and we want to cluster those interesting speaker groups
0:01:36	but
0:01:37	a there are a specific problems are connected to
0:01:42	instead of clustering
0:01:44	and in particular
0:01:50	speakers within the conversation
0:01:54	recite wrinkle means taking stays in terms of the acoustic characteristics
0:01:59	then there is there is all merging
0:02:03	the corresponding clusters together
0:02:07	also
0:02:08	it was too much noise or silence
0:02:10	we think the speech signal
0:02:13	which probably has not been a catchy by giving attention
0:02:20	then we may construct a close to shown cultures
0:02:24	well those nuisances
0:02:28	and as a result
0:02:30	is in fact
0:02:32	v performance of the system
0:02:35	using
0:02:36	we knew in advance the number of speakers
0:02:39	in the conversation
0:02:44	in this work we are for closed or scenarios word of speakers
0:02:50	a specific roles
0:02:52	for example with me think of that occupation direction a meeting collection where we have
0:02:59	the teacher questions
0:03:02	anyway interview will be out that each of your and interviewee and so on
0:03:08	and the interesting fee of those scenarios
0:03:12	is that different roles are usually associated
0:03:16	well with distinction
0:03:18	when we see colours
0:03:21	for example in and you we expect that the interviewer with a small portion and
0:03:25	you're you mm we'll answer those questions
0:03:29	over in another conversation we except for us the emissions will describe there's in terms
0:03:37	and the doctor will
0:03:39	you medical i still don't
0:03:42	so the question now and is kind we use language and commonly used
0:03:47	those linguistic buttons
0:03:49	to cities
0:03:50	there is a sh
0:03:54	so
0:03:56	if we remember of the problem for
0:04:00	diarisation in a traditional or a bunch
0:04:04	what we
0:04:05	we do is given the audio signal
0:04:08	first hmms and is given really done with involved in addition and the cluster
0:04:16	instead
0:04:18	if you're propose to also
0:04:22	process the fisher information which can really
0:04:26	you are from an asr
0:04:32	and issues
0:04:33	some extent no knowledge about so there are within the conversation
0:04:40	and give it is knowledge to estimate their profiles
0:04:45	and files you mean the acoustic
0:04:47	changes
0:04:49	all the speakers in the conversation
0:04:51	and now
0:04:51	since we have those two profiles we can conclude a clustering problem
0:04:57	into which conditional
0:04:59	and thus
0:05:00	we're gonna for the potential problems races which are conducted in clustering
0:05:05	we mention triggers
0:05:08	and now the next a few slides and one to go into detail
0:05:14	well on what i
0:05:16	someone change your use
0:05:18	and how we have implemented
0:05:22	so noticed your the in the first
0:05:25	a couple steps of our system
0:05:28	we all process the texture flemish
0:05:31	so given text the first step is that we want to change the chronology text
0:05:38	so
0:05:39	in which a segment after this segmentation set
0:05:44	we ones
0:05:45	every
0:05:46	to be uttered by a single speaker
0:05:50	so i really want assistant
0:05:52	that
0:05:54	no as a kind of their
0:05:58	where and there is in you are
0:06:00	speaker a speaker change in the conversation
0:06:04	instead
0:06:06	permissible we assume
0:06:07	that there is a single speaker or sentence
0:06:11	so we will segment i s the sentence level
0:06:15	and energy just so we view of this problem was sequence labeling or sequence tagging
0:06:21	problem
0:06:23	and
0:06:24	we construct this is a similar situation here were initially we construct
0:06:32	a
0:06:33	current level representation which were stressed and then something
0:06:39	we concatenate
0:06:42	this is representation with the
0:06:46	ward embedding all the course from war and now this
0:06:50	a sequence of words sheer ease thing to a biased and steering wheel
0:06:58	which predicts a sequence of labels
0:07:02	and a little here are two
0:07:04	but no that the war is at the beginning of a sentence and denotes the
0:07:10	war
0:07:10	is that the middle sentence
0:07:13	which essentially means
0:07:14	every which is not
0:07:17	so our sentence here each one of those machines
0:07:21	or whatever
0:07:22	words
0:07:23	strong
0:07:24	one when b
0:07:27	until the next one
0:07:31	now handles a segment we want to a sign role
0:07:35	to ensure those
0:07:37	so
0:07:39	and the domain working on a we assume that we more at
0:07:45	the roles in this domain
0:07:49	so we you
0:07:50	and roles just
0:07:52	language models for three and also we have and also with a wrong language model
0:07:58	and for this to a construction and prior models
0:08:04	and after we interpolate the language models and by these symbols you're a regional ventilation
0:08:13	and all the ways of your on some of the questions
0:08:17	are optimized on a development set
0:08:21	so what we interpolate the language models
0:08:24	we can just a sign
0:08:26	to each take segment the role that minimizes the corresponding complex
0:08:35	no not is that if you're we have built on about to a text
0:08:40	in the next step to discontinue was the case densities of the speakers for the
0:08:45	year in the conversation
0:08:47	we also need was you only so
0:08:50	so you're we need to align the text and the audio
0:08:56	and the
0:08:57	textual information comes from an asr system which to be in a real-world application
0:09:03	then these all right information is already available probability can last
0:09:10	so have no those module and a segments
0:09:13	we extract speaker rating which one visual with the extractor
0:09:21	for each
0:09:22	segment
0:09:23	a sign to a statistical
0:09:26	and we can now define as the wrong for the
0:09:31	are all acoustic identity
0:09:33	as a range of all those
0:09:36	speaker ratings transform that role
0:09:41	a by doing so however
0:09:45	we assume that
0:09:47	only on v
0:09:51	segments
0:09:51	r g
0:09:54	however
0:09:56	we cannot be confidently about all the roles segments and the reason e
0:10:03	since we have conversational interactions
0:10:07	after oversegmentations that we may have
0:10:10	some very short sessions for example
0:10:14	like even one or things like
0:10:17	well which do not contain sufficient information
0:10:20	well that that's all right recognition
0:10:25	so what we're doing instead is that we
0:10:28	assign a confidence measure
0:10:30	creation of those segments
0:10:32	and its confidence measure is the also difference
0:10:35	between the best implicitly we have
0:10:40	from a and the second was classes
0:10:45	and now we can then define a few
0:10:52	profile
0:10:52	a an average but now for this average we only a control and
0:10:58	e
0:11:00	segments
0:11:03	for which the confidence
0:11:05	is able
0:11:06	some stuff racial factor
0:11:09	and this is the size the tunable parameter all sources
0:11:16	so we can we have now estimated or profiles were ready to
0:11:22	or
0:11:23	a regularization
0:11:25	we're instead of clustering we can have a classification much
0:11:30	election
0:11:30	you're
0:11:32	and we're calling a traditional approach for a diarisation were first we segment
0:11:38	uniform the speech signal with a sliding window
0:11:42	we extract
0:11:44	us to go embedding for each resulting segment
0:11:49	and we probably
0:11:51	the only a similarity
0:11:54	known for each segment
0:11:57	with all the role profiles are just a estimate
0:12:03	and the role that we are assigned to each day
0:12:07	using one
0:12:09	that is most similar to segment
0:12:11	we know that maximizes
0:12:13	this is a single are in school
0:12:21	so this is this is in the were proposing and we're going to use in
0:12:26	to evaluate the system on dialects i felt interactions what we have two rolls namely
0:12:32	the normal that there is an efficient
0:12:38	and we are also going to use a mix of corporal
0:12:41	in order to train a our students tiger and or language models
0:12:47	is your in those the data is and reading the sizes of the core well
0:12:53	we're using well
0:12:57	and not going to go into detail
0:13:00	i'm to the specific parameters that we used for system and the several subsystems
0:13:07	i just mentioned that if a score or sentences like or more so
0:13:14	point age a after all
0:13:18	a working at all possible there she said
0:13:21	but a word error rates for asr system we're using
0:13:26	was about forty percent for dataset but we just is a lot a but actually
0:13:32	is
0:13:33	can call com one source some changes medical conversations
0:13:40	and
0:13:41	also baselines we will use in your own and it language baseline
0:13:46	forty one you know baseline a workout this is then
0:13:50	that we have
0:13:52	already mentioned the traditional system i'll mention where we have a uniform segmentation and then
0:13:58	to lda clustering
0:14:01	and forty language from baseline
0:14:04	we essentially how the first steps
0:14:07	all our a text based system you
0:14:09	well for one takes with a text we segments with our
0:14:14	a sentence tiger
0:14:16	and we assign a each
0:14:20	segments to enrol
0:14:22	and the only think of that we need to do in order to evaluate the
0:14:25	diarisation is to
0:14:28	a line you're
0:14:31	and the text here and
0:14:34	they have already mentioned
0:14:36	in the text can strong it is are then be alignment information
0:14:40	already available
0:14:44	chair our results on the survey data the we have testing
0:14:51	well we have used i don't the reference prostrate or asr transcript
0:14:58	we using a or something you're or an oracle text segmentation
0:15:02	here are or
0:15:05	unimodal
0:15:06	baseline same as yours the system that the we have
0:15:10	controls and by looking at the numbers we can make
0:15:15	interesting observations and
0:15:18	generate some interest conclusions
0:15:21	some personal
0:15:22	if we can further of the to a baseline we have
0:15:27	we see that the results or
0:15:31	better we feel guilty
0:15:33	that's just a
0:15:34	i which instantly on your screen as expected contains one information for the task also
0:15:40	speaker and session
0:15:42	and this is why
0:15:44	we propose using the ontology information only as the supplementary q
0:15:51	a what is interesting to notice is that
0:15:56	you know language model system comparing work and the some additional the timer
0:16:01	segmentation
0:16:02	i based machine
0:16:04	there is the
0:16:05	performance gap
0:16:09	and the reason for that is that
0:16:11	the tiger overstatement and also mention
0:16:15	we may have also show segments there's
0:16:18	do not contain sufficient information for english
0:16:23	however in our system we use this information only
0:16:27	in and i would be useful in order to a reddish
0:16:30	all the
0:16:34	segments of the rules segments to get a acoustic identity
0:16:38	the article rule
0:16:41	so
0:16:42	so such an actress is kind of cancel out you know system after this
0:16:48	well i'm british
0:16:50	a similar factor
0:16:52	is observed
0:16:54	last year you we compare the
0:16:57	results using the reference for the asr transcript
0:17:01	and because condition we have a pretty high word error rate
0:17:06	we have as if you're degradation in performance for the language system
0:17:10	once when using a star
0:17:13	results
0:17:15	however
0:17:16	when the trustees are only used for the profile estimation as we're doing in our
0:17:21	proposed system
0:17:22	then the performance
0:17:24	is substantially smaller
0:17:29	finally
0:17:31	when we see here is the if we estimate the files
0:17:35	using
0:17:36	not only know all the
0:17:39	i relevant segments but only
0:17:42	the segments that we are most compelling about then we have further a performance improvement
0:17:50	and instead of the parameters that we introduce
0:17:53	the earlier
0:17:54	here
0:17:55	we are using the eight percent all the
0:18:00	test segments
0:18:01	or station by the segment i mean the segments that we're most confident about
0:18:06	and they is a parameter optimize convertible
0:18:11	well i first observation again it's made from this library
0:18:16	where we have illustrated the
0:18:19	diarization error rate a function
0:18:23	all of the number of segments that's clear thinking this duration
0:18:28	or final estimation is that
0:18:30	unless we use
0:18:33	a very small number of segments per session most of the time
0:18:38	but performance is better five
0:18:41	the key audio-only baseline which is illustrated by a dashed line we shoot
0:18:48	also
0:18:49	if we compare those
0:18:52	blue and red lines
0:18:55	what we see is that even though
0:18:57	when we're using
0:18:58	v
0:19:00	sequence this time you're
0:19:02	a bit which is
0:19:04	this red line
0:19:06	i don't though
0:19:07	when using this
0:19:09	we have a slightly worse performance is an oracle
0:19:14	segmentation we observe that you we have two shoes
0:19:18	you're only the number of segments to use
0:19:21	then a tiger performance approaches the oracle
0:19:26	segmentation performance
0:19:30	to some with my presentation today we propose a system for speaker diarization
0:19:36	in scenarios were speakers for a specific roles
0:19:40	and we use the lexical information machine
0:19:44	with those roles
0:19:45	in order to estimate the acoustic advantages
0:19:49	and which changes the ability for classification approach
0:19:54	instead of a clustering
0:19:56	approaches use a common thing to do diarization
0:20:01	we evaluated our system on dynamics et cetera interruptions
0:20:05	and we just really a relative improvement of about
0:20:09	thirty percent
0:20:10	number two t only on baseline
0:20:14	so
0:20:16	this was my own presentation
0:20:18	thank you very much for button

Linguistically Aided Speaker Diarization Using Speaker Role Information

Diarization

Nikolaos Flemotomos, Panayiotis Georgiou, Shrikanth Narayanan