0:00:13however affinity for changing my spiritualist extraction presentation
0:00:21my name's in the mountains and i'm going to present you know how to stay
0:00:28on a linguistically it is triggered iterations distant loses information from stricter rules
0:00:35personal unseen words our task is and what sticklers issues
0:00:40small still a generic
0:00:43setting standardisation i don't want to answer the question
0:00:47so where
0:00:50a given as input a real speech signal
0:00:53what's wanted used to partition the signal into since derivatives
0:00:59without having any prior information about the speakers a single precision errors
0:01:07and conceptual and traditionally
0:01:11task involves two steps
0:01:15we want to circumvent the signal
0:01:19into speaker images segment and this can be found either a uniform way or according
0:01:25to some speaker change detection
0:01:28and then how those speaker sessions and we want to cluster those interesting speaker groups
0:01:37a there are a specific problems are connected to
0:01:42instead of clustering
0:01:44and in particular
0:01:50speakers within the conversation
0:01:54recite wrinkle means taking stays in terms of the acoustic characteristics
0:01:59then there is there is all merging
0:02:03the corresponding clusters together
0:02:08it was too much noise or silence
0:02:10we think the speech signal
0:02:13which probably has not been a catchy by giving attention
0:02:20then we may construct a close to shown cultures
0:02:24well those nuisances
0:02:28and as a result
0:02:30is in fact
0:02:32v performance of the system
0:02:36we knew in advance the number of speakers
0:02:39in the conversation
0:02:44in this work we are for closed or scenarios word of speakers
0:02:50a specific roles
0:02:52for example with me think of that occupation direction a meeting collection where we have
0:02:59the teacher questions
0:03:02anyway interview will be out that each of your and interviewee and so on
0:03:08and the interesting fee of those scenarios
0:03:12is that different roles are usually associated
0:03:16well with distinction
0:03:18when we see colours
0:03:21for example in and you we expect that the interviewer with a small portion and
0:03:25you're you mm we'll answer those questions
0:03:29over in another conversation we except for us the emissions will describe there's in terms
0:03:37and the doctor will
0:03:39you medical i still don't
0:03:42so the question now and is kind we use language and commonly used
0:03:47those linguistic buttons
0:03:49to cities
0:03:50there is a sh
0:03:56if we remember of the problem for
0:04:00diarisation in a traditional or a bunch
0:04:04what we
0:04:05we do is given the audio signal
0:04:08first hmms and is given really done with involved in addition and the cluster
0:04:18if you're propose to also
0:04:22process the fisher information which can really
0:04:26you are from an asr
0:04:32and issues
0:04:33some extent no knowledge about so there are within the conversation
0:04:40and give it is knowledge to estimate their profiles
0:04:45and files you mean the acoustic
0:04:49all the speakers in the conversation
0:04:51and now
0:04:51since we have those two profiles we can conclude a clustering problem
0:04:57into which conditional
0:04:59and thus
0:05:00we're gonna for the potential problems races which are conducted in clustering
0:05:05we mention triggers
0:05:08and now the next a few slides and one to go into detail
0:05:14well on what i
0:05:16someone change your use
0:05:18and how we have implemented
0:05:22so noticed your the in the first
0:05:25a couple steps of our system
0:05:28we all process the texture flemish
0:05:31so given text the first step is that we want to change the chronology text
0:05:39in which a segment after this segmentation set
0:05:44we ones
0:05:46to be uttered by a single speaker
0:05:50so i really want assistant
0:05:54no as a kind of their
0:05:58where and there is in you are
0:06:00speaker a speaker change in the conversation
0:06:06permissible we assume
0:06:07that there is a single speaker or sentence
0:06:11so we will segment i s the sentence level
0:06:15and energy just so we view of this problem was sequence labeling or sequence tagging
0:06:24we construct this is a similar situation here were initially we construct
0:06:33current level representation which were stressed and then something
0:06:39we concatenate
0:06:42this is representation with the
0:06:46ward embedding all the course from war and now this
0:06:50a sequence of words sheer ease thing to a biased and steering wheel
0:06:58which predicts a sequence of labels
0:07:02and a little here are two
0:07:04but no that the war is at the beginning of a sentence and denotes the
0:07:10is that the middle sentence
0:07:13which essentially means
0:07:14every which is not
0:07:17so our sentence here each one of those machines
0:07:21or whatever
0:07:24one when b
0:07:27until the next one
0:07:31now handles a segment we want to a sign role
0:07:35to ensure those
0:07:39and the domain working on a we assume that we more at
0:07:45the roles in this domain
0:07:49so we you
0:07:50and roles just
0:07:52language models for three and also we have and also with a wrong language model
0:07:58and for this to a construction and prior models
0:08:04and after we interpolate the language models and by these symbols you're a regional ventilation
0:08:13and all the ways of your on some of the questions
0:08:17are optimized on a development set
0:08:21so what we interpolate the language models
0:08:24we can just a sign
0:08:26to each take segment the role that minimizes the corresponding complex
0:08:35no not is that if you're we have built on about to a text
0:08:40in the next step to discontinue was the case densities of the speakers for the
0:08:45year in the conversation
0:08:47we also need was you only so
0:08:50so you're we need to align the text and the audio
0:08:56and the
0:08:57textual information comes from an asr system which to be in a real-world application
0:09:03then these all right information is already available probability can last
0:09:10so have no those module and a segments
0:09:13we extract speaker rating which one visual with the extractor
0:09:21for each
0:09:23a sign to a statistical
0:09:26and we can now define as the wrong for the
0:09:31are all acoustic identity
0:09:33as a range of all those
0:09:36speaker ratings transform that role
0:09:41a by doing so however
0:09:45we assume that
0:09:47only on v
0:09:51r g
0:09:56we cannot be confidently about all the roles segments and the reason e
0:10:03since we have conversational interactions
0:10:07after oversegmentations that we may have
0:10:10some very short sessions for example
0:10:14like even one or things like
0:10:17well which do not contain sufficient information
0:10:20well that that's all right recognition
0:10:25so what we're doing instead is that we
0:10:28assign a confidence measure
0:10:30creation of those segments
0:10:32and its confidence measure is the also difference
0:10:35between the best implicitly we have
0:10:40from a and the second was classes
0:10:45and now we can then define a few
0:10:52a an average but now for this average we only a control and
0:11:03for which the confidence
0:11:05is able
0:11:06some stuff racial factor
0:11:09and this is the size the tunable parameter all sources
0:11:16so we can we have now estimated or profiles were ready to
0:11:23a regularization
0:11:25we're instead of clustering we can have a classification much
0:11:32and we're calling a traditional approach for a diarisation were first we segment
0:11:38uniform the speech signal with a sliding window
0:11:42we extract
0:11:44us to go embedding for each resulting segment
0:11:49and we probably
0:11:51the only a similarity
0:11:54known for each segment
0:11:57with all the role profiles are just a estimate
0:12:03and the role that we are assigned to each day
0:12:07using one
0:12:09that is most similar to segment
0:12:11we know that maximizes
0:12:13this is a single are in school
0:12:21so this is this is in the were proposing and we're going to use in
0:12:26to evaluate the system on dialects i felt interactions what we have two rolls namely
0:12:32the normal that there is an efficient
0:12:38and we are also going to use a mix of corporal
0:12:41in order to train a our students tiger and or language models
0:12:47is your in those the data is and reading the sizes of the core well
0:12:53we're using well
0:12:57and not going to go into detail
0:13:00i'm to the specific parameters that we used for system and the several subsystems
0:13:07i just mentioned that if a score or sentences like or more so
0:13:14point age a after all
0:13:18a working at all possible there she said
0:13:21but a word error rates for asr system we're using
0:13:26was about forty percent for dataset but we just is a lot a but actually
0:13:33can call com one source some changes medical conversations
0:13:41also baselines we will use in your own and it language baseline
0:13:46forty one you know baseline a workout this is then
0:13:50that we have
0:13:52already mentioned the traditional system i'll mention where we have a uniform segmentation and then
0:13:58to lda clustering
0:14:01and forty language from baseline
0:14:04we essentially how the first steps
0:14:07all our a text based system you
0:14:09well for one takes with a text we segments with our
0:14:14a sentence tiger
0:14:16and we assign a each
0:14:20segments to enrol
0:14:22and the only think of that we need to do in order to evaluate the
0:14:25diarisation is to
0:14:28a line you're
0:14:31and the text here and
0:14:34they have already mentioned
0:14:36in the text can strong it is are then be alignment information
0:14:40already available
0:14:44chair our results on the survey data the we have testing
0:14:51well we have used i don't the reference prostrate or asr transcript
0:14:58we using a or something you're or an oracle text segmentation
0:15:02here are or
0:15:06baseline same as yours the system that the we have
0:15:10controls and by looking at the numbers we can make
0:15:15interesting observations and
0:15:18generate some interest conclusions
0:15:21some personal
0:15:22if we can further of the to a baseline we have
0:15:27we see that the results or
0:15:31better we feel guilty
0:15:33that's just a
0:15:34i which instantly on your screen as expected contains one information for the task also
0:15:40speaker and session
0:15:42and this is why
0:15:44we propose using the ontology information only as the supplementary q
0:15:51a what is interesting to notice is that
0:15:56you know language model system comparing work and the some additional the timer
0:16:02i based machine
0:16:04there is the
0:16:05performance gap
0:16:09and the reason for that is that
0:16:11the tiger overstatement and also mention
0:16:15we may have also show segments there's
0:16:18do not contain sufficient information for english
0:16:23however in our system we use this information only
0:16:27in and i would be useful in order to a reddish
0:16:30all the
0:16:34segments of the rules segments to get a acoustic identity
0:16:38the article rule
0:16:42so such an actress is kind of cancel out you know system after this
0:16:48well i'm british
0:16:50a similar factor
0:16:52is observed
0:16:54last year you we compare the
0:16:57results using the reference for the asr transcript
0:17:01and because condition we have a pretty high word error rate
0:17:06we have as if you're degradation in performance for the language system
0:17:10once when using a star
0:17:16when the trustees are only used for the profile estimation as we're doing in our
0:17:21proposed system
0:17:22then the performance
0:17:24is substantially smaller
0:17:31when we see here is the if we estimate the files
0:17:36not only know all the
0:17:39i relevant segments but only
0:17:42the segments that we are most compelling about then we have further a performance improvement
0:17:50and instead of the parameters that we introduce
0:17:53the earlier
0:17:55we are using the eight percent all the
0:18:00test segments
0:18:01or station by the segment i mean the segments that we're most confident about
0:18:06and they is a parameter optimize convertible
0:18:11well i first observation again it's made from this library
0:18:16where we have illustrated the
0:18:19diarization error rate a function
0:18:23all of the number of segments that's clear thinking this duration
0:18:28or final estimation is that
0:18:30unless we use
0:18:33a very small number of segments per session most of the time
0:18:38but performance is better five
0:18:41the key audio-only baseline which is illustrated by a dashed line we shoot
0:18:49if we compare those
0:18:52blue and red lines
0:18:55what we see is that even though
0:18:57when we're using
0:19:00sequence this time you're
0:19:02a bit which is
0:19:04this red line
0:19:06i don't though
0:19:07when using this
0:19:09we have a slightly worse performance is an oracle
0:19:14segmentation we observe that you we have two shoes
0:19:18you're only the number of segments to use
0:19:21then a tiger performance approaches the oracle
0:19:26segmentation performance
0:19:30to some with my presentation today we propose a system for speaker diarization
0:19:36in scenarios were speakers for a specific roles
0:19:40and we use the lexical information machine
0:19:44with those roles
0:19:45in order to estimate the acoustic advantages
0:19:49and which changes the ability for classification approach
0:19:54instead of a clustering
0:19:56approaches use a common thing to do diarization
0:20:01we evaluated our system on dynamics et cetera interruptions
0:20:05and we just really a relative improvement of about
0:20:09thirty percent
0:20:10number two t only on baseline
0:20:16this was my own presentation
0:20:18thank you very much for button