0:00:15thank you for that
0:00:17production good morning am
0:00:19this work was done uh with that
0:00:21in collaboration with michael or from I B M high for a source in real and i be M
0:00:26thomas most on the source centre
0:00:31and this work or uh is about
0:00:34uh or the european project the name her mess so
0:00:38first they will introduce this
0:00:40a project
0:00:41and then we will discuss uh uh um yeah walk on the speech transcription
0:00:46speaker tracking and spoken information retrieval
0:00:49um in this project
0:00:55i mess is the yeah three years along the resource project in the area of ambient assisted living
0:01:02a a surely funded to buy a a your
0:01:08a a and the goal of this project these to do a develop the a personal system for the elderly
0:01:15uh to alleviate a normal ageing related cognitive decline
0:01:21uh by providing a memory support and the some cognitive three
0:01:26so the brought a B is
0:01:28a a to record in audio
0:01:31and we do your
0:01:32a a personal experience of the user
0:01:35in part manually
0:01:38and in but to automatically
0:01:41and then to extract to
0:01:43method that data from this uh uh uh a recording
0:01:46and a a two or fair to the use of a certain set of this
0:01:52as it comes to the your recordings which is a a a a uh a primary focus here
0:01:58uh the user is equipped with the a a mobile device uh over the top i will call it P
0:02:04D A personal digital uh system
0:02:07and the uh the use wasn't can um a required that uh if he's a whole conversations of interest
0:02:13a a a at time or i'll say
0:02:18or those central service is or application that the the or ms system this to the user is called her
0:02:25to mess might past which is a a of the past experience of the user
0:02:30a in our case record in an audio and this is my uh primary focus and uh is specifically the
0:02:36speech shape to single related
0:02:38a a part of these uh application
0:02:42so the idea is to lead to the user
0:02:45to a a sub meat
0:02:48create a like for example what be the doctor tell me yesterday about the diet
0:02:53uh and if you look at this these read is uh a a a in a a composite create it
0:02:59contains a spoken words
0:03:02a like diet
0:03:03and the um uh come to a a a uh i i i a a i don't it you of
0:03:08the come part of the conversation then this system
0:03:11um we'll we'll tape and the it don't to the user
0:03:15um um are really want fragments of for want uh conversations that the match this uh uh query
0:03:22and they're query will be composed using the a a dedicated interface supporting the composition of such a did not
0:03:29and a free form of co
0:03:32now this that the control flow all of the siege to sink that supports uh these application
0:03:37basically what we need to do
0:03:39we have to the uh i'm extract
0:03:43the speaker a identity from this uh uh uh conversations recorded with a P D
0:03:49and to transcribe speech to text
0:03:52then we have to even books
0:03:54all this information
0:03:55a a and B able to uh so much over of these uh in text information not only fast but
0:04:02also a a a a a
0:04:04a a a correctly
0:04:08such as application poses certain challenges to the speech processing first of full the this is open domain conversational speech
0:04:15which is
0:04:16uh all with the challenge
0:04:18uh furthermore at the recording are made with the distantly placed P D uh and device
0:04:25and uh typically two people are talking to each other and the P D placed on the table uh between
0:04:33secondly a a a a a a a house next
0:04:35this is a elderly voices which share uh are important they in the literature to post challenges just to the
0:04:42um a a star system
0:04:44and the last but not the least them must see that data collection for training cannot be a four can
0:04:48such approach
0:04:52uh the target language for their must prototype system was the castilian spanish
0:04:57in the beginning of the project we a
0:05:00but form data collection trying to collect as much or all your that uh is willed
0:05:06uh we collected data from uh a forty seven elderly that lay and for young speaker
0:05:12yeah did that that was recorded simultaneously by the P D which is our target and also a headset microphone
0:05:18for the research and are not as in
0:05:20uh and the in total we collect put about for forty hours of uh that are which share was distribute
0:05:26it among the uh dialogues which is our target
0:05:30freestyle monologues and uh read out then
0:05:33or these that that are all these that that under of and the manual gonna by tim transcription and the
0:05:38speaker laid
0:05:41and now i speech to the speech it to text the transcription but
0:05:45uh uh in on need to be a a work on on this part was based on the i-th do
0:05:51you'll a uh uh toolkit develop by B M uh is still
0:05:55uh this system that we used and B within this project uh are similar to each other in terms of
0:06:01their uh uh architecture
0:06:03they employ to pass decoding
0:06:05with the feature space the speaker adaptation and discriminative for acoustic models at the second pass and the employ three
0:06:13a statistical language model
0:06:15the development here you were through a a three phases
0:06:18the baseline system in immediate system and that advance
0:06:23uh as a baseline we adopt
0:06:26spain is a system developed by I B i mean tc-star european project
0:06:31for transcription of parliamentary speech
0:06:34this is the use system their acoustic model contains about uh four thousand hmm states and about one hundred K
0:06:41A of channels
0:06:43in the tc-star is of iterations this system achieved eight percent were there rate which is very successful
0:06:49a a a a a a and one a weighted this baseline system on their had "'em" is that uh
0:06:54i including lead out dialogues
0:06:57a recorded with leap microphones and P E A
0:07:00um and this evaluation the what they or rates are are um a presented in this table
0:07:06we eh
0:07:09actually this evaluation uh a review
0:07:11the a high degree of mismatch between the baseline and training condition which has which is a a parliamentary speeches
0:07:18recorded with close talking microphone
0:07:20and the as target conditions which are free dialogues recorded with a distantly placed P D
0:07:26the in this table you can see they can uh a the influence of the linguistic aspect of these mismatch
0:07:32and acoustic matt uh uh ask like separately
0:07:35but all all that all you to both of the aspect the there are rate
0:07:40jams from the twenty four percent
0:07:43for really dollop the recording quiz sleep microphone
0:07:45to sixty eight percent for the dialogs recorded with P
0:07:50we build an intermediate system by uh adaptation of the baseline language models and the acoustic model
0:07:57a language model adaptation included
0:08:00um that a you go you language model on a subset of the um error mess conversation transcript
0:08:07and interpolation between the baseline language which model and the
0:08:11um yeah new than which model
0:08:13the acoustic model adaptation the patient was done of the speaker enrollment using a good animated our um yeah adaptation
0:08:20oh the baseline acoustic model on they a mess monologue that that of the target speaker
0:08:25uh in this table you uh can see the evaluation of the intermediate system
0:08:32on on the uh dialogs recorded uh by P D
0:08:36a a and they here you can uh see they contributions of the language model adaptation
0:08:43and the acoustic model adaptation
0:08:45uh separately but well that all the intermediate system
0:08:49uh uh read uses the were that all rate from sixty eight percent to fifty four
0:08:57we a bill
0:08:58the advanced system them a completely at E on the are miss P D data
0:09:04and the boats to what's stopping this training process by the initial alignments uh obtained with the baseline
0:09:11uh this a advanced system was trained on us so eight hours of speech but used by forty nine speakers
0:09:18it is a very with the data set
0:09:20and we put there is there of only two uh elderly speakers
0:09:24a a male and female for that this
0:09:26uh this is this model system related to small this is a a a a four times more than the
0:09:32baseline and the intermediate and it does not require a speaker don't roman so
0:09:37in that sense it is the deployment friendly system
0:09:40here you can see the evaluation of all these three system
0:09:44on the same dataset comprised of the conversational speech recorded by P D A and you see the that one
0:09:50system achieved
0:09:51so nine point two percent were that all rate
0:09:54which is dramatically improvement in the accuracy at a a a a uh uh a billy to the a a
0:09:59baseline an intermediate
0:10:02now we switch to the speaker tracking
0:10:04a a is you know speak and tracking in it task a uh a mean to answer the question who
0:10:09spoke when and on the um on a channel um
0:10:12a audio
0:10:14it can be seen as the concatenation of two sub that's
0:10:16speaker there is a nation we choose segments
0:10:20a a audio to speaker tones and fed the a class of this segments uh according to speaker similarity
0:10:27and speaker recognition
0:10:29a we just sign speaker identity labels to this uh a class
0:10:33in a mess we deal these two speaker conversations which is typically a dialogues the conversations of the speaker or
0:10:40of with a and that's that that
0:10:42uh this big get back and in you in or misuse
0:10:45by my only for the so much
0:10:48here we need only to know they yeah
0:10:50uh i i didn't you of the speaker participants in the conversation and the set secondary use use for uh
0:10:57enhancement of the transcript speech tampa they intelligibility while browsing good them uh
0:11:03uh for the use
0:11:05uh for the to here there is a nation and no will than the very effective and simple a technique
0:11:10a it has been developed a a and
0:11:13a it is described in detail in the this paper or or on uh well the second that N
0:11:19uh this
0:11:20a technique could be evaluated on the nist telephone F one you that a achieved the two point eight uh
0:11:26a of sent equal error rate
0:11:29on her a is that low you'd achieve a twenty four percent were there are only uh a excuse me
0:11:34uh frame at all rate which means uh percent H for incorrectly class that frames
0:11:41a and the difference um
0:11:43uh in the performance is accounted for the very challenging good um a record and condition in a miss and
0:11:51a now speaker recognition on that
0:11:54uh i
0:11:55he has speaker recognition is applied to on the uh segments provide it from the speaker there is a nation
0:12:01it facilitates speaker recognition because speaker recognition on on segment that uh
0:12:07the them multi party a a uh do is uh extremely challenging
0:12:12so it facilitates by
0:12:14a a a a a uh
0:12:16still the problem persists because the diarization is not perfect so this segments that we applied the speaker recognition to
0:12:23typically contain frames from both the speaker
0:12:27uh a a are as a a uh um as a is the same time to state of the out
0:12:32the speaker recognition algorithms
0:12:34um are not them to the interfering speaker so additional work is needed here and the to this end the
0:12:41or approach
0:12:42uh very a a a effective was um
0:12:46uh developed the in the ms project
0:12:49that the read used to high then the influence of the interfering speaker and the the algorithms on the lang
0:12:55this technique
0:12:56uh i
0:12:57a but it
0:12:58excuse me
0:13:02a uh described in detail but in in this two publications
0:13:07the egg will at all rate on the missed uh on the is uh a telephone you that the is
0:13:12about four percent and on that a lot is by their
0:13:16a diarisation it is about to eleven percent again the difference is accounted for their miss recording these
0:13:24and finally we move to the spoken information a table
0:13:30or a limit that the extracted from the audio he's index so what we are indexing
0:13:35they word confusion networks provided by the asr system it means
0:13:39for each work we use X
0:13:41and and best alternatives
0:13:43we even uh them in form
0:13:47and the along with their confidence measure
0:13:50next the work time stamps
0:13:52and finally speaker identity is associated with the conversation
0:13:57we define the query language
0:13:59uh which enables combining spoken talents and speaker or uh identity didn't T in the same query
0:14:07and i was so function uh rate dorms at least of uh by a relevance or that uh items each
0:14:13item contains
0:14:14the i D of of the conversation and times stamps of the eleven fragment inside the conversation
0:14:21and also it employs uh
0:14:23spell check
0:14:24eventually we a evaluated
0:14:27our way and two a and systems
0:14:29including a R
0:14:32uh indexing and retrieval
0:14:35uh we test and this as systems uh in the task of
0:14:39uh a conversation that it three will based on their content can vary which means
0:14:44but to be you could not the timing information uh a it by a such function and we did did
0:14:51not include the speaker identity in the query
0:14:54for this evaluation we use the same twenty conversations from the male and female of the lease speak at are
0:15:00used in the that evaluation
0:15:02fifty five queries have been composed manually again this train you conversations
0:15:09uh which means
0:15:10to each conversation uh if you could from one to two four
0:15:15we are composed by a spanish speaking get um
0:15:21now the idea was to a a of the speech so much
0:15:25uh to the texture search much a ritual was can see that as a different
0:15:31so for each query we found and mark to uh a live on conversations
0:15:36by searching with this committee at a or of the the button transcript of all the uh to into conversation
0:15:43in general for a each query that are no them were on uh uh i really one conversation because
0:15:48conversations uh a shared of more or less the same topics
0:15:53i them
0:15:55the and be a applied speech so much and use the standard of of the uh a uh um and
0:16:01mean average precision not a measure of uh to crime to find the accuracy related accuracy to see uh all
0:16:08this so
0:16:09and here uh you can see for evaluation
0:16:12uh for evaluations each evaluation is represented by
0:16:16uh to but
0:16:18the blue bar the a our work there or rate and they read but the big them up
0:16:24the sent
0:16:27the first evaluation was for the baseline is that sees them
0:16:31then in that is are them and then i'll glanced is thus
0:16:36all these three evaluations of down with uh uh
0:16:39when we even next one is the first the top best uh guess from the side
0:16:45and here you can see that with the advanced one system we achieve seventy percent uh uh uh a mean
0:16:51average precision of the so
0:16:53and the final evaluation for that one system was done to is using who uh with index goal the entire
0:16:59more confusion network and
0:17:01you a bring seventy six uh percent cent of up which means that we are pretty close to the
0:17:08a a textual so
0:17:09and the to wrap up
0:17:12a fast it seems that
0:17:14i'd the challenges
0:17:15the technology are mature enough to meet the challenge and show the ambient assisted living publication
0:17:22uh secondly a
0:17:24availability of the domain specific that uh been know that it is a very a important X three male important
0:17:30on the other hand
0:17:32um many projects cannot not of for the rate a big scale that the collection so it today they D
0:17:38L for uh
0:17:39but collaboration and that shading
0:17:41which will be a
0:17:43a a very useful for the progress in this area
0:17:46uh next the speaker recognition uh proper but as um yeah while the would
0:17:52but uh a reasonable performance on to speaker or uh conversations recorded by
0:17:57uh in such a a a a and like a distant mobile device and finally
0:18:03the a
0:18:04advanced speech source technology
0:18:06uh can have a calm the uh substantial a a cell asr error rate and the
0:18:13uh allows to approach the performance of the tech
0:18:16texture of information retrieval
0:18:19thank you
0:18:24okay questions
0:18:30thank you for to talk or two questions
0:18:32the first one
0:18:34L people
0:18:35you is that over a long time
0:18:37then you could expect them to role
0:18:41or or you could use
0:18:43to to an unsupervised adapt
0:18:46so much which my first one is
0:18:49if you right no
0:18:52or at about thirty nine percent word error rate
0:18:55oh what is your prediction on how for what you get with a
0:18:59unsupervised adaptation uh or supervised adaptation
0:19:03and the other the question is
0:19:05with that kind of population
0:19:07you could have
0:19:08dramatic changes like the person that's a stroll course so which would
0:19:12totally changes a closed
0:19:14so do you have any
0:19:17a any any idea your of how how you would deal with that
0:19:21is to record could you to be the second question
0:19:24second question is
0:19:25a i to lay she could
0:19:28a can have a very dramatic change of the ports correct rest
0:19:32for instance
0:19:33because used developed mean
0:19:36or are concerned yes a strong
0:19:38something like that which told to changes of what sort of absolute
0:19:42so okay so the the first question
0:19:46in general
0:19:47um the the supervised speaker in the
0:19:52can help
0:19:53to to to bring the rubber or it to a lower or what are the what were that all rate
0:20:02complicates the um yeah deployment the installation on of such system
0:20:08a a and um um i'm not quite sure that the egg in the it's our uh accuracy
0:20:16uh a be paid or if it the a lot of the speech source because we are not going to
0:20:20probably this transcripts what we need is
0:20:23just to so much so
0:20:27i'm not sure that
0:20:29this complication of the deployment will be paid or for a L A at the level of the deployed
0:20:35uh to your second question uh yeah absolutely i agree with you and this is the you know this is
0:20:42a research area we
0:20:44actually in in this project we have to a a by lots at uh
0:20:49a a real problems
0:20:52and this is the first uh uh um yeah uh a time
0:21:02so uh
0:21:04a i i i would like to know the the answer by myself
0:21:11it you know
0:21:14it it it says that maybe speaker and a meant is not
0:21:17so uh useful
0:21:21uh to some extent to you can be keep using this system and the you the voice characteristics uh
0:21:28uh degree of to that i am my to merely them
0:21:33i do not know maybe that
0:21:35size the user one
0:21:36you will not be able to use such a system and in no
0:21:41okay you thank you
0:21:43oh thing we uh need to trying the speaker