Speech Transcript - SPEECH PROCESSING AND RETRIEVAL IN A PERSONAL MEMORY AID SYSTEM FOR THE ELDERLY

0:00:13	okay
0:00:15	um
0:00:15	thank you for that
0:00:17	production good morning am
0:00:19	this work was done uh with that
0:00:21	in collaboration with michael or from I B M high for a source in real and i be M
0:00:26	thomas most on the source centre
0:00:28	night
0:00:31	and this work or uh is about
0:00:34	uh or the european project the name her mess so
0:00:38	first they will introduce this
0:00:40	a project
0:00:41	and then we will discuss uh uh um yeah walk on the speech transcription
0:00:46	speaker tracking and spoken information retrieval
0:00:49	um in this project
0:00:53	um
0:00:54	i
0:00:55	i mess is the yeah three years along the resource project in the area of ambient assisted living
0:01:02	a a surely funded to buy a a your
0:01:06	union
0:01:08	a a and the goal of this project these to do a develop the a personal system for the elderly
0:01:14	user
0:01:15	uh to alleviate a normal ageing related cognitive decline
0:01:21	uh by providing a memory support and the some cognitive three
0:01:26	so the brought a B is
0:01:27	um
0:01:28	a a to record in audio
0:01:31	and we do your
0:01:32	a a personal experience of the user
0:01:35	in part manually
0:01:37	um
0:01:38	and in but to automatically
0:01:41	and then to extract to
0:01:43	method that data from this uh uh uh a recording
0:01:46	and a a two or fair to the use of a certain set of this
0:01:51	um
0:01:52	as it comes to the your recordings which is a a a a uh a primary focus here
0:01:58	uh the user is equipped with the a a mobile device uh over the top i will call it P
0:02:04	D A personal digital uh system
0:02:07	and the uh the use wasn't can um a required that uh if he's a whole conversations of interest
0:02:13	a a a at time or i'll say
0:02:16	um
0:02:18	uh
0:02:18	or those central service is or application that the the or ms system this to the user is called her
0:02:25	to mess might past which is a a of the past experience of the user
0:02:30	a in our case record in an audio and this is my uh primary focus and uh is specifically the
0:02:36	speech shape to single related
0:02:38	a a part of these uh application
0:02:42	so the idea is to lead to the user
0:02:45	to a a sub meat
0:02:47	a
0:02:48	create a like for example what be the doctor tell me yesterday about the diet
0:02:53	uh and if you look at this these read is uh a a a in a a composite create it
0:02:59	contains a spoken words
0:03:02	a like diet
0:03:03	and the um uh come to a a a uh i i i a a i don't it you of
0:03:08	the come part of the conversation then this system
0:03:11	um we'll we'll tape and the it don't to the user
0:03:15	um um are really want fragments of for want uh conversations that the match this uh uh query
0:03:22	and they're query will be composed using the a a dedicated interface supporting the composition of such a did not
0:03:29	and a free form of co
0:03:32	now this that the control flow all of the siege to sink that supports uh these application
0:03:37	basically what we need to do
0:03:39	we have to the uh i'm extract
0:03:43	the speaker a identity from this uh uh uh conversations recorded with a P D
0:03:49	and to transcribe speech to text
0:03:52	then we have to even books
0:03:54	all this information
0:03:55	a a and B able to uh so much over of these uh in text information not only fast but
0:04:02	also a a a a a
0:04:04	a a a correctly
0:04:07	uh
0:04:08	such as application poses certain challenges to the speech processing first of full the this is open domain conversational speech
0:04:15	which is
0:04:16	uh all with the challenge
0:04:18	uh furthermore at the recording are made with the distantly placed P D uh and device
0:04:25	and uh typically two people are talking to each other and the P D placed on the table uh between
0:04:30	them
0:04:32	uh
0:04:33	secondly a a a a a a a house next
0:04:35	this is a elderly voices which share uh are important they in the literature to post challenges just to the
0:04:42	um a a star system
0:04:44	and the last but not the least them must see that data collection for training cannot be a four can
0:04:48	such approach
0:04:52	uh the target language for their must prototype system was the castilian spanish
0:04:57	in the beginning of the project we a
0:05:00	but form data collection trying to collect as much or all your that uh is willed
0:05:06	uh we collected data from uh a forty seven elderly that lay and for young speaker
0:05:12	yeah did that that was recorded simultaneously by the P D which is our target and also a headset microphone
0:05:18	for the research and are not as in
0:05:20	uh and the in total we collect put about for forty hours of uh that are which share was distribute
0:05:26	it among the uh dialogues which is our target
0:05:30	freestyle monologues and uh read out then
0:05:33	or these that that are all these that that under of and the manual gonna by tim transcription and the
0:05:38	speaker laid
0:05:41	and now i speech to the speech it to text the transcription but
0:05:45	uh uh in on need to be a a work on on this part was based on the i-th do
0:05:51	you'll a uh uh toolkit develop by B M uh is still
0:05:55	uh this system that we used and B within this project uh are similar to each other in terms of
0:06:01	their uh uh architecture
0:06:03	they employ to pass decoding
0:06:05	with the feature space the speaker adaptation and discriminative for acoustic models at the second pass and the employ three
0:06:12	gram
0:06:13	a statistical language model
0:06:15	the development here you were through a a three phases
0:06:18	the baseline system in immediate system and that advance
0:06:23	uh as a baseline we adopt
0:06:26	spain is a system developed by I B i mean tc-star european project
0:06:31	for transcription of parliamentary speech
0:06:34	this is the use system their acoustic model contains about uh four thousand hmm states and about one hundred K
0:06:41	A of channels
0:06:43	in the tc-star is of iterations this system achieved eight percent were there rate which is very successful
0:06:49	a a a a a a and one a weighted this baseline system on their had "'em" is that uh
0:06:54	i including lead out dialogues
0:06:57	a recorded with leap microphones and P E A
0:07:00	um and this evaluation the what they or rates are are um a presented in this table
0:07:05	uh
0:07:06	we eh
0:07:09	actually this evaluation uh a review
0:07:11	the a high degree of mismatch between the baseline and training condition which has which is a a parliamentary speeches
0:07:18	recorded with close talking microphone
0:07:20	and the as target conditions which are free dialogues recorded with a distantly placed P D
0:07:26	the in this table you can see they can uh a the influence of the linguistic aspect of these mismatch
0:07:32	and acoustic matt uh uh ask like separately
0:07:35	but all all that all you to both of the aspect the there are rate
0:07:40	jams from the twenty four percent
0:07:43	for really dollop the recording quiz sleep microphone
0:07:45	to sixty eight percent for the dialogs recorded with P
0:07:50	next
0:07:50	we build an intermediate system by uh adaptation of the baseline language models and the acoustic model
0:07:57	a language model adaptation included
0:08:00	um that a you go you language model on a subset of the um error mess conversation transcript
0:08:07	and interpolation between the baseline language which model and the
0:08:11	um yeah new than which model
0:08:13	the acoustic model adaptation the patient was done of the speaker enrollment using a good animated our um yeah adaptation
0:08:20	oh the baseline acoustic model on they a mess monologue that that of the target speaker
0:08:25	uh in this table you uh can see the evaluation of the intermediate system
0:08:32	on on the uh dialogs recorded uh by P D
0:08:36	a a and they here you can uh see they contributions of the language model adaptation
0:08:43	and the acoustic model adaptation
0:08:45	uh separately but well that all the intermediate system
0:08:49	uh uh read uses the were that all rate from sixty eight percent to fifty four
0:08:55	finally
0:08:57	we a bill
0:08:58	the advanced system them a completely at E on the are miss P D data
0:09:04	and the boats to what's stopping this training process by the initial alignments uh obtained with the baseline
0:09:11	uh this a advanced system was trained on us so eight hours of speech but used by forty nine speakers
0:09:18	it is a very with the data set
0:09:20	and we put there is there of only two uh elderly speakers
0:09:24	a a male and female for that this
0:09:26	uh this is this model system related to small this is a a a a four times more than the
0:09:32	baseline and the intermediate and it does not require a speaker don't roman so
0:09:37	in that sense it is the deployment friendly system
0:09:40	here you can see the evaluation of all these three system
0:09:44	on the same dataset comprised of the conversational speech recorded by P D A and you see the that one
0:09:50	system achieved
0:09:51	so nine point two percent were that all rate
0:09:54	which is dramatically improvement in the accuracy at a a a a uh uh a billy to the a a
0:09:59	baseline an intermediate
0:10:02	now we switch to the speaker tracking
0:10:04	a a is you know speak and tracking in it task a uh a mean to answer the question who
0:10:09	spoke when and on the um on a channel um
0:10:12	a audio
0:10:14	it can be seen as the concatenation of two sub that's
0:10:16	speaker there is a nation we choose segments
0:10:19	the
0:10:20	a a audio to speaker tones and fed the a class of this segments uh according to speaker similarity
0:10:27	and speaker recognition
0:10:29	a we just sign speaker identity labels to this uh a class
0:10:33	in a mess we deal these two speaker conversations which is typically a dialogues the conversations of the speaker or
0:10:40	of with a and that's that that
0:10:42	uh this big get back and in you in or misuse
0:10:45	by my only for the so much
0:10:48	here we need only to know they yeah
0:10:50	uh i i didn't you of the speaker participants in the conversation and the set secondary use use for uh
0:10:57	enhancement of the transcript speech tampa they intelligibility while browsing good them uh
0:11:03	uh for the use
0:11:05	uh for the to here there is a nation and no will than the very effective and simple a technique
0:11:10	a it has been developed a a and
0:11:13	a it is described in detail in the this paper or or on uh well the second that N
0:11:19	uh this
0:11:20	a technique could be evaluated on the nist telephone F one you that a achieved the two point eight uh
0:11:26	a of sent equal error rate
0:11:29	on her a is that low you'd achieve a twenty four percent were there are only uh a excuse me
0:11:34	uh frame at all rate which means uh percent H for incorrectly class that frames
0:11:41	a and the difference um
0:11:43	uh in the performance is accounted for the very challenging good um a record and condition in a miss and
0:11:50	that
0:11:51	a now speaker recognition on that
0:11:54	uh i
0:11:55	he has speaker recognition is applied to on the uh segments provide it from the speaker there is a nation
0:12:01	it facilitates speaker recognition because speaker recognition on on segment that uh
0:12:07	the them multi party a a uh do is uh extremely challenging
0:12:12	so it facilitates by
0:12:14	a a a a a uh
0:12:15	um
0:12:16	still the problem persists because the diarization is not perfect so this segments that we applied the speaker recognition to
0:12:23	typically contain frames from both the speaker
0:12:27	uh a a are as a a uh um as a is the same time to state of the out
0:12:32	the speaker recognition algorithms
0:12:34	um are not them to the interfering speaker so additional work is needed here and the to this end the
0:12:41	or approach
0:12:42	uh very a a a effective was um
0:12:46	uh developed the in the ms project
0:12:49	that the read used to high then the influence of the interfering speaker and the the algorithms on the lang
0:12:55	this technique
0:12:56	uh i
0:12:57	a but it
0:12:58	excuse me
0:13:00	the
0:13:02	uh
0:13:02	a uh described in detail but in in this two publications
0:13:07	the egg will at all rate on the missed uh on the is uh a telephone you that the is
0:13:12	about four percent and on that a lot is by their
0:13:16	a diarisation it is about to eleven percent again the difference is accounted for their miss recording these
0:13:24	and finally we move to the spoken information a table
0:13:28	uh
0:13:29	hmmm
0:13:30	or a limit that the extracted from the audio he's index so what we are indexing
0:13:34	first
0:13:35	they word confusion networks provided by the asr system it means
0:13:39	for each work we use X
0:13:41	and and best alternatives
0:13:43	we even uh them in form
0:13:47	and the along with their confidence measure
0:13:50	next the work time stamps
0:13:52	and finally speaker identity is associated with the conversation
0:13:57	we define the query language
0:13:59	uh which enables combining spoken talents and speaker or uh identity didn't T in the same query
0:14:07	and i was so function uh rate dorms at least of uh by a relevance or that uh items each
0:14:13	item contains
0:14:14	the i D of of the conversation and times stamps of the eleven fragment inside the conversation
0:14:21	and also it employs uh
0:14:23	spell check
0:14:24	eventually we a evaluated
0:14:27	our way and two a and systems
0:14:29	including a R
0:14:32	uh indexing and retrieval
0:14:35	uh we test and this as systems uh in the task of
0:14:39	uh a conversation that it three will based on their content can vary which means
0:14:44	but to be you could not the timing information uh a it by a such function and we did did
0:14:51	not include the speaker identity in the query
0:14:54	for this evaluation we use the same twenty conversations from the male and female of the lease speak at are
0:15:00	used in the that evaluation
0:15:02	fifty five queries have been composed manually again this train you conversations
0:15:09	uh which means
0:15:10	to each conversation uh if you could from one to two four
0:15:15	we are composed by a spanish speaking get um
0:15:18	uh
0:15:19	people
0:15:21	now the idea was to a a of the speech so much
0:15:25	uh to the texture search much a ritual was can see that as a different
0:15:31	so for each query we found and mark to uh a live on conversations
0:15:36	by searching with this committee at a or of the the button transcript of all the uh to into conversation
0:15:43	in general for a each query that are no them were on uh uh i really one conversation because
0:15:48	conversations uh a shared of more or less the same topics
0:15:53	i them
0:15:55	the and be a applied speech so much and use the standard of of the uh a uh um and
0:16:01	mean average precision not a measure of uh to crime to find the accuracy related accuracy to see uh all
0:16:08	this so
0:16:09	and here uh you can see for evaluation
0:16:12	uh for evaluations each evaluation is represented by
0:16:16	uh to but
0:16:18	the blue bar the a our work there or rate and they read but the big them up
0:16:24	the sent
0:16:26	uh
0:16:27	the first evaluation was for the baseline is that sees them
0:16:31	then in that is are them and then i'll glanced is thus
0:16:36	all these three evaluations of down with uh uh
0:16:39	when we even next one is the first the top best uh guess from the side
0:16:45	and here you can see that with the advanced one system we achieve seventy percent uh uh uh a mean
0:16:51	average precision of the so
0:16:53	and the final evaluation for that one system was done to is using who uh with index goal the entire
0:16:59	more confusion network and
0:17:01	you a bring seventy six uh percent cent of up which means that we are pretty close to the
0:17:08	a a textual so
0:17:09	and the to wrap up
0:17:12	a fast it seems that
0:17:14	i'd the challenges
0:17:15	the technology are mature enough to meet the challenge and show the ambient assisted living publication
0:17:22	uh secondly a
0:17:24	availability of the domain specific that uh been know that it is a very a important X three male important
0:17:30	on the other hand
0:17:32	um many projects cannot not of for the rate a big scale that the collection so it today they D
0:17:38	L for uh
0:17:39	but collaboration and that shading
0:17:41	which will be a
0:17:43	a a very useful for the progress in this area
0:17:46	uh next the speaker recognition uh proper but as um yeah while the would
0:17:52	but uh a reasonable performance on to speaker or uh conversations recorded by
0:17:57	uh in such a a a a and like a distant mobile device and finally
0:18:03	the a
0:18:04	advanced speech source technology
0:18:06	uh can have a calm the uh substantial a a cell asr error rate and the
0:18:13	uh allows to approach the performance of the tech
0:18:16	texture of information retrieval
0:18:19	thank you
0:18:24	okay questions
0:18:27	okay
0:18:30	thank you for to talk or two questions
0:18:32	the first one
0:18:34	L people
0:18:35	you is that over a long time
0:18:37	then you could expect them to role
0:18:41	or or you could use
0:18:43	to to an unsupervised adapt
0:18:46	so much which my first one is
0:18:49	it
0:18:49	if you right no
0:18:52	or at about thirty nine percent word error rate
0:18:55	oh what is your prediction on how for what you get with a
0:18:59	unsupervised adaptation uh or supervised adaptation
0:19:03	and the other the question is
0:19:05	with that kind of population
0:19:07	you could have
0:19:08	dramatic changes like the person that's a stroll course so which would
0:19:12	totally changes a closed
0:19:14	so do you have any
0:19:17	a any any idea your of how how you would deal with that
0:19:21	is to record could you to be the second question
0:19:24	second question is
0:19:25	a i to lay she could
0:19:28	a can have a very dramatic change of the ports correct rest
0:19:32	for instance
0:19:33	because used developed mean
0:19:36	or are concerned yes a strong
0:19:38	something like that which told to changes of what sort of absolute
0:19:42	so okay so the the first question
0:19:44	yeah
0:19:46	in general
0:19:47	um the the supervised speaker in the
0:19:52	can help
0:19:53	to to to bring the rubber or it to a lower or what are the what were that all rate
0:19:59	but
0:20:00	a
0:20:02	E
0:20:02	complicates the um yeah deployment the installation on of such system
0:20:08	a a and um um i'm not quite sure that the egg in the it's our uh accuracy
0:20:16	uh a be paid or if it the a lot of the speech source because we are not going to
0:20:20	probably this transcripts what we need is
0:20:23	just to so much so
0:20:26	uh
0:20:27	i'm not sure that
0:20:29	this complication of the deployment will be paid or for a L A at the level of the deployed
0:20:35	uh to your second question uh yeah absolutely i agree with you and this is the you know this is
0:20:42	a research area we
0:20:44	actually in in this project we have to a a by lots at uh
0:20:49	a a real problems
0:20:52	and this is the first uh uh um yeah uh a time
0:20:56	um
0:21:02	so uh
0:21:04	a i i i would like to know the the answer by myself
0:21:07	um
0:21:11	it you know
0:21:13	yeah
0:21:14	it it it says that maybe speaker and a meant is not
0:21:17	so uh useful
0:21:19	uh
0:21:21	uh to some extent to you can be keep using this system and the you the voice characteristics uh
0:21:28	uh degree of to that i am my to merely them
0:21:33	i do not know maybe that
0:21:35	size the user one
0:21:36	you will not be able to use such a system and in no
0:21:41	okay you thank you
0:21:43	oh thing we uh need to trying the speaker

SPEECH PROCESSING AND RETRIEVAL IN A PERSONAL MEMORY AID SYSTEM FOR THE ELDERLY

Industrial Technology for Speech Processing Applications

Presented by: Alexander Sorin, Author(s): Alexander Sorin, Hagai Aronowitz, Jonathan Mamou, Orith Toledo-Ronen, Ron Hoory, Michael Kuritzky, Yael Erez, IBM Haifa Research Lab, Israel; Bhuvana Ramabhadran, Abhinav Sethy, IBM T.J. Watson Research Center, United States