okay um thank you for that production good morning am this work was done uh with that in collaboration with michael or from I B M high for a source in real and i be M thomas most on the source centre night and this work or uh is about uh or the european project the name her mess so first they will introduce this a project and then we will discuss uh uh um yeah walk on the speech transcription speaker tracking and spoken information retrieval um in this project um i i mess is the yeah three years along the resource project in the area of ambient assisted living a a surely funded to buy a a your union a a and the goal of this project these to do a develop the a personal system for the elderly user uh to alleviate a normal ageing related cognitive decline uh by providing a memory support and the some cognitive three so the brought a B is um a a to record in audio and we do your a a personal experience of the user in part manually um and in but to automatically and then to extract to method that data from this uh uh uh a recording and a a two or fair to the use of a certain set of this um as it comes to the your recordings which is a a a a uh a primary focus here uh the user is equipped with the a a mobile device uh over the top i will call it P D A personal digital uh system and the uh the use wasn't can um a required that uh if he's a whole conversations of interest a a a at time or i'll say um uh or those central service is or application that the the or ms system this to the user is called her to mess might past which is a a of the past experience of the user a in our case record in an audio and this is my uh primary focus and uh is specifically the speech shape to single related a a part of these uh application so the idea is to lead to the user to a a sub meat a create a like for example what be the doctor tell me yesterday about the diet uh and if you look at this these read is uh a a a in a a composite create it contains a spoken words a like diet and the um uh come to a a a uh i i i a a i don't it you of the come part of the conversation then this system um we'll we'll tape and the it don't to the user um um are really want fragments of for want uh conversations that the match this uh uh query and they're query will be composed using the a a dedicated interface supporting the composition of such a did not and a free form of co now this that the control flow all of the siege to sink that supports uh these application basically what we need to do we have to the uh i'm extract the speaker a identity from this uh uh uh conversations recorded with a P D and to transcribe speech to text then we have to even books all this information a a and B able to uh so much over of these uh in text information not only fast but also a a a a a a a a correctly uh such as application poses certain challenges to the speech processing first of full the this is open domain conversational speech which is uh all with the challenge uh furthermore at the recording are made with the distantly placed P D uh and device and uh typically two people are talking to each other and the P D placed on the table uh between them uh secondly a a a a a a a house next this is a elderly voices which share uh are important they in the literature to post challenges just to the um a a star system and the last but not the least them must see that data collection for training cannot be a four can such approach uh the target language for their must prototype system was the castilian spanish in the beginning of the project we a but form data collection trying to collect as much or all your that uh is willed uh we collected data from uh a forty seven elderly that lay and for young speaker yeah did that that was recorded simultaneously by the P D which is our target and also a headset microphone for the research and are not as in uh and the in total we collect put about for forty hours of uh that are which share was distribute it among the uh dialogues which is our target freestyle monologues and uh read out then or these that that are all these that that under of and the manual gonna by tim transcription and the speaker laid and now i speech to the speech it to text the transcription but uh uh in on need to be a a work on on this part was based on the i-th do you'll a uh uh toolkit develop by B M uh is still uh this system that we used and B within this project uh are similar to each other in terms of their uh uh architecture they employ to pass decoding with the feature space the speaker adaptation and discriminative for acoustic models at the second pass and the employ three gram a statistical language model the development here you were through a a three phases the baseline system in immediate system and that advance uh as a baseline we adopt spain is a system developed by I B i mean tc-star european project for transcription of parliamentary speech this is the use system their acoustic model contains about uh four thousand hmm states and about one hundred K A of channels in the tc-star is of iterations this system achieved eight percent were there rate which is very successful a a a a a a and one a weighted this baseline system on their had "'em" is that uh i including lead out dialogues a recorded with leap microphones and P E A um and this evaluation the what they or rates are are um a presented in this table uh we eh actually this evaluation uh a review the a high degree of mismatch between the baseline and training condition which has which is a a parliamentary speeches recorded with close talking microphone and the as target conditions which are free dialogues recorded with a distantly placed P D the in this table you can see they can uh a the influence of the linguistic aspect of these mismatch and acoustic matt uh uh ask like separately but all all that all you to both of the aspect the there are rate jams from the twenty four percent for really dollop the recording quiz sleep microphone to sixty eight percent for the dialogs recorded with P next we build an intermediate system by uh adaptation of the baseline language models and the acoustic model a language model adaptation included um that a you go you language model on a subset of the um error mess conversation transcript and interpolation between the baseline language which model and the um yeah new than which model the acoustic model adaptation the patient was done of the speaker enrollment using a good animated our um yeah adaptation oh the baseline acoustic model on they a mess monologue that that of the target speaker uh in this table you uh can see the evaluation of the intermediate system on on the uh dialogs recorded uh by P D a a and they here you can uh see they contributions of the language model adaptation and the acoustic model adaptation uh separately but well that all the intermediate system uh uh read uses the were that all rate from sixty eight percent to fifty four finally we a bill the advanced system them a completely at E on the are miss P D data and the boats to what's stopping this training process by the initial alignments uh obtained with the baseline uh this a advanced system was trained on us so eight hours of speech but used by forty nine speakers it is a very with the data set and we put there is there of only two uh elderly speakers a a male and female for that this uh this is this model system related to small this is a a a a four times more than the baseline and the intermediate and it does not require a speaker don't roman so in that sense it is the deployment friendly system here you can see the evaluation of all these three system on the same dataset comprised of the conversational speech recorded by P D A and you see the that one system achieved so nine point two percent were that all rate which is dramatically improvement in the accuracy at a a a a uh uh a billy to the a a baseline an intermediate now we switch to the speaker tracking a a is you know speak and tracking in it task a uh a mean to answer the question who spoke when and on the um on a channel um a audio it can be seen as the concatenation of two sub that's speaker there is a nation we choose segments the a a audio to speaker tones and fed the a class of this segments uh according to speaker similarity and speaker recognition a we just sign speaker identity labels to this uh a class in a mess we deal these two speaker conversations which is typically a dialogues the conversations of the speaker or of with a and that's that that uh this big get back and in you in or misuse by my only for the so much here we need only to know they yeah uh i i didn't you of the speaker participants in the conversation and the set secondary use use for uh enhancement of the transcript speech tampa they intelligibility while browsing good them uh uh for the use uh for the to here there is a nation and no will than the very effective and simple a technique a it has been developed a a and a it is described in detail in the this paper or or on uh well the second that N uh this a technique could be evaluated on the nist telephone F one you that a achieved the two point eight uh a of sent equal error rate on her a is that low you'd achieve a twenty four percent were there are only uh a excuse me uh frame at all rate which means uh percent H for incorrectly class that frames a and the difference um uh in the performance is accounted for the very challenging good um a record and condition in a miss and that a now speaker recognition on that uh i he has speaker recognition is applied to on the uh segments provide it from the speaker there is a nation it facilitates speaker recognition because speaker recognition on on segment that uh the them multi party a a uh do is uh extremely challenging so it facilitates by a a a a a uh um still the problem persists because the diarization is not perfect so this segments that we applied the speaker recognition to typically contain frames from both the speaker uh a a are as a a uh um as a is the same time to state of the out the speaker recognition algorithms um are not them to the interfering speaker so additional work is needed here and the to this end the or approach uh very a a a effective was um uh developed the in the ms project that the read used to high then the influence of the interfering speaker and the the algorithms on the lang this technique uh i a but it excuse me the uh a uh described in detail but in in this two publications the egg will at all rate on the missed uh on the is uh a telephone you that the is about four percent and on that a lot is by their a diarisation it is about to eleven percent again the difference is accounted for their miss recording these and finally we move to the spoken information a table uh hmmm or a limit that the extracted from the audio he's index so what we are indexing first they word confusion networks provided by the asr system it means for each work we use X and and best alternatives we even uh them in form and the along with their confidence measure next the work time stamps and finally speaker identity is associated with the conversation we define the query language uh which enables combining spoken talents and speaker or uh identity didn't T in the same query and i was so function uh rate dorms at least of uh by a relevance or that uh items each item contains the i D of of the conversation and times stamps of the eleven fragment inside the conversation and also it employs uh spell check eventually we a evaluated our way and two a and systems including a R uh indexing and retrieval uh we test and this as systems uh in the task of uh a conversation that it three will based on their content can vary which means but to be you could not the timing information uh a it by a such function and we did did not include the speaker identity in the query for this evaluation we use the same twenty conversations from the male and female of the lease speak at are used in the that evaluation fifty five queries have been composed manually again this train you conversations uh which means to each conversation uh if you could from one to two four we are composed by a spanish speaking get um uh people now the idea was to a a of the speech so much uh to the texture search much a ritual was can see that as a different so for each query we found and mark to uh a live on conversations by searching with this committee at a or of the the button transcript of all the uh to into conversation in general for a each query that are no them were on uh uh i really one conversation because conversations uh a shared of more or less the same topics i them the and be a applied speech so much and use the standard of of the uh a uh um and mean average precision not a measure of uh to crime to find the accuracy related accuracy to see uh all this so and here uh you can see for evaluation uh for evaluations each evaluation is represented by uh to but the blue bar the a our work there or rate and they read but the big them up the sent uh the first evaluation was for the baseline is that sees them then in that is are them and then i'll glanced is thus all these three evaluations of down with uh uh when we even next one is the first the top best uh guess from the side and here you can see that with the advanced one system we achieve seventy percent uh uh uh a mean average precision of the so and the final evaluation for that one system was done to is using who uh with index goal the entire more confusion network and you a bring seventy six uh percent cent of up which means that we are pretty close to the a a textual so and the to wrap up a fast it seems that i'd the challenges the technology are mature enough to meet the challenge and show the ambient assisted living publication uh secondly a availability of the domain specific that uh been know that it is a very a important X three male important on the other hand um many projects cannot not of for the rate a big scale that the collection so it today they D L for uh but collaboration and that shading which will be a a a very useful for the progress in this area uh next the speaker recognition uh proper but as um yeah while the would but uh a reasonable performance on to speaker or uh conversations recorded by uh in such a a a a and like a distant mobile device and finally the a advanced speech source technology uh can have a calm the uh substantial a a cell asr error rate and the uh allows to approach the performance of the tech texture of information retrieval thank you okay questions okay thank you for to talk or two questions the first one L people you is that over a long time then you could expect them to role or or you could use to to an unsupervised adapt so much which my first one is it if you right no or at about thirty nine percent word error rate oh what is your prediction on how for what you get with a unsupervised adaptation uh or supervised adaptation and the other the question is with that kind of population you could have dramatic changes like the person that's a stroll course so which would totally changes a closed so do you have any a any any idea your of how how you would deal with that is to record could you to be the second question second question is a i to lay she could a can have a very dramatic change of the ports correct rest for instance because used developed mean or are concerned yes a strong something like that which told to changes of what sort of absolute so okay so the the first question yeah in general um the the supervised speaker in the can help to to to bring the rubber or it to a lower or what are the what were that all rate but a E complicates the um yeah deployment the installation on of such system a a and um um i'm not quite sure that the egg in the it's our uh accuracy uh a be paid or if it the a lot of the speech source because we are not going to probably this transcripts what we need is just to so much so uh i'm not sure that this complication of the deployment will be paid or for a L A at the level of the deployed uh to your second question uh yeah absolutely i agree with you and this is the you know this is a research area we actually in in this project we have to a a by lots at uh a a real problems and this is the first uh uh um yeah uh a time um so uh a i i i would like to know the the answer by myself um it you know yeah it it it says that maybe speaker and a meant is not so uh useful uh uh to some extent to you can be keep using this system and the you the voice characteristics uh uh degree of to that i am my to merely them i do not know maybe that size the user one you will not be able to use such a system and in no okay you thank you oh thing we uh need to trying the speaker