Speech Transcript - Language Diversity: Speech Processing In A Multi-Lingual Context

0:00:15	Thank you Isabel very much our feelings are shared. I'd like to thank the organisation
0:00:21	here for inviting me
0:00:22	to be a keynote speaker. It's really an honor. It's also a big challenge, so
0:00:28	I hope I will
0:00:30	make some messages that come across. Well to ... at least some of you.
0:00:35	As Isabel said I've been working in speech and speech processing for many years now
0:00:39	and today I'll
0:00:40	focus mostly on the act of speech recognition. But a little bit of context that
0:00:44	I'll talk about first.
0:00:46	So.
0:00:48	At LIMSI - being in Europe of course - we try to work on speech
0:00:53	recognition and the multilingual
0:00:55	context and processing at least a fair number of the European languages
0:00:59	This isn't really new. We are seeing sort of a regrowth in wanting to do
0:01:03	speech recognition in
0:01:04	different languages but, if you go back a long time ago there was some
0:01:08	research that was there. We just didn't hear about it this much.
0:01:10	And that's probably because there weren't too many common corpora and benchmark tests to be
0:01:16	able to compare results on and so people tended to report on your papers were
0:01:20	accepted more easily if
0:01:21	you use common data. Which is still the case.
0:01:23	And it's logical you want to compare results to other peoples'
0:01:27	results, but now there's more and more data out there, there's more test sets out
0:01:31	there and so we
0:01:32	can do more comparisons and we're seeing more languages covered. So I think that's really
0:01:36	nice.
0:01:37	So I'll speak about some of our research results in Quaero and Babel programs.
0:01:44	Sure.
0:01:46	Is that better? Sorry, o.k. I was popping this morning a bit so I wanted
0:01:52	to be not too close.
0:01:54	So, I'll speak highlights and research results from Quaero and Babel.
0:01:57	And then I want to touch upon some activities
0:01:59	I did with some colleagues at LIMSI and at the Laboratory of Phonetics in Paris
0:02:03	to trying new speech technologies for other
0:02:06	applications to carry out linguistic studies and corpora based studies
0:02:10	And I'll mentioned briefly a couple of a perceptual experiments that we've done
0:02:14	and then finally some concluding results, remarks.
0:02:18	So I guess we probably all agree in this community that we've seen a lot
0:02:22	of progress over the last
0:02:23	decade or two decades.
0:02:25	We're actually seeing some
0:02:27	technologies that are using
0:02:29	speech or are working on it. I think that's kind of fun, it's really nice.
0:02:32	But we see it for a few languages
0:02:34	and as we heard yesterday from Haizhou. He mentioned that about 1 % of the
0:02:40	world's languages are
0:02:41	actually seen in our proceedings, so we have something about them.
0:02:44	That's pretty low, but it's
0:02:46	up there from one or two that we did maybe twenty years ago. We're soaking
0:02:49	up the sun.
0:02:51	One of the problems I mentioned before is that our
0:02:54	technology typically relies on having a lot of language resources and these are difficult and
0:02:59	expensive to get. And so therefore it's harder to get them for limited languages
0:03:03	and current practice still
0:03:05	takes several months to years to bring up systems for language and if you count
0:03:09	the time for collecting
0:03:10	data and processing it and transcribing it and things like that.
0:03:14	So this is sort of a step back in time and we'll see if we
0:03:17	go back to say the late eighties,
0:03:18	early nineties. We had systems that were basicly command controlled dictation and we had some
0:03:24	dialogue systems, usually
0:03:25	pretty limited to a specific task like ATIS (Air Traffic Information System),
0:03:28	travel reservation, things like that.
0:03:30	Some of you here probably know well and some of you are maybe too young
0:03:33	and don't even know it, because the
0:03:35	publications are not necessarily scanned in online, so you don't see them.
0:03:39	But in fact we're now seeing a regrowth in the same activities. When you look
0:03:43	at voice mail and
0:03:44	dictation machines in your phones and the different
0:03:51	the personal assistants that we're seeing that are finally coming out now.
0:03:54	And so that's really exciting to see this sort of a pick up again that
0:03:57	we saw in the past.
0:03:59	And then of course we have some new applications or for someone new applications
0:04:02	that are growing as we've got better processing capability both in terms of computers and
0:04:10	data out there. And so we have speech analytics of both call center data or
0:04:14	meeting type data,
0:04:14	lectures, there's a bunch of things.
0:04:16	We have speech-to-speech translation and also indexation of audio, video,
0:04:21	documents which is used for media mining,
0:04:24	tasks and companies are very interested in that.
0:04:27	And of course for the speech analytics people that are really interested in finding out
0:04:30	what people
0:04:31	wanna buy and trying to sell them things and stuff like that.
0:04:35	So let me back up a little bit and talk about something.
0:04:40	Why is speech
0:04:41	processing difficult. All of us speak easily and
0:04:46	sort of mentioned this is sort of natural we learn language but
0:04:50	I think any of us who learned a foreign language at least as an adult
0:04:54	understand that's a little bit
0:04:55	harder than it really seems and so I learned French when I was ...
0:04:58	... after my PhD. I won't say my age.
0:05:02	And it wasn't so easy to learn, it wasn't so natural and my daughter who
0:05:06	grew up in lingual environment speaks
0:05:07	French and English fluently and she's better in the other languages than I am.
0:05:10	So I was good unilingual American speaker.
0:05:15	No other contact with the other languages.
0:05:17	We need context to understand what's being said, so you speak differently if you're speaking
0:05:22	in public then
0:05:24	if you're speaking to someone who you know. We all know this.
0:05:26	?? projector screen with speech is continuous and so if I'm
0:05:30	talking to you I might say it is not easy or it's not easy
0:05:34	but if I'm talking to my mother it's not easy.
0:05:38	I reduced that it's not easy. Well that's not so clear where the words are
0:05:43	and so I think that we all know once again that humans
0:05:47	reduce the pronunciation in regions that are of low information. Where it's not very important
0:05:52	you're
0:05:53	putting the minimum effort into saying what you want to say.
0:05:56	And of course there's other variability factors that
0:05:58	and I'll also mention that the speaker's characteristic accent,
0:06:01	the context we are in. Humans do very well in adapting to this.
0:06:05	Machines don't, in general.
0:06:08	So here I wanna, since I am taking a step back
0:06:10	in time I wanted to play a couple of very simple samples.
0:06:20	Is there anyone in this room that doesn't know this type of sentence?
0:06:23	You all heard it.
0:06:24	Good! Okay. That's timid, that's going back really, really long time ago.
0:06:29	I was involved in
0:06:32	some of the selection for timid but not in the non-sense's.
0:06:34	Those were selected to elicit pronunciation variants for
0:06:37	different styles of speaking
0:06:39	and you can hear that in the sample here
0:06:41	even in a very, very simple read text
0:06:44	we have different realisations of the word greasy.
0:06:51	So in the first case we have an S, in the second case we have
0:06:53	a Z.
0:06:53	And we can see that ...
0:06:55	I can't really point to the screen there. I don't have a pointer. Do you
0:06:58	have a pointer?
0:06:59	I think I've refused it before.
0:07:02	So in any case you can see in blue there that S is quite a
0:07:04	bit longer, you can see the voicing and the Z.
0:07:06	And is everyone here familiar with spectrograms? I sort of assumed the worst. It's okay,
0:07:12	good.
0:07:14	So here's another example of more conversational type speech.
0:07:18	And we'll see that people interrupt each other. You can hear some of the hesitations.
0:07:40	So in this example it's office corpus and participants called each other
0:07:46	and they're supposed to talk about some topic that they were given and they have
0:07:49	a mutual topic.
0:07:50	You're supposed to talk about it but not everybody does.
0:07:53	But even in this ...
0:07:55	They don't know each other very well but they still interrupted each other. They did
0:07:58	some turn taking
0:07:58	and you hear you can see the
0:08:00	hmm and laughter and there's someone else in another
0:08:05	presentation. I don't where it is.
0:08:08	Now I'm gonna play an example from Mandarin and I'm having
0:08:11	confidence that Billy Hartman who is probably in ?? with his wife
0:08:15	gave me the correct translation,
0:08:17	correct text here, because I don't understand it. Which is even more spontaneous so it's
0:08:22	an example
0:08:23	taken from the callhome mandarin corpus where we think it's a mother and a daughter
0:08:27	who are talking to
0:08:29	each other about the daughters job interview.
0:08:32	So for those who speak Mandarin.
0:08:48	So if I understood correctly and the translation is correct
0:08:52	basically talking about it and the mother doesn't understand what the job interview is about
0:08:57	and the daughter says
0:08:58	she says: Don't speak to me in another
0:09:01	language. Speak to me in Chinese. And the daughter says: You wouldn't have understand anyway,
0:09:04	even if I spoke in Chinese
0:09:05	And so I've had some similar situations speaking with my mother
0:09:10	That's what I do.
0:09:13	So now I'm gonna switch gears a little bit and talk about the
0:09:17	Quaero program which is one of the two topics I want to mainly focus on
0:09:22	here, we talk about the speech
0:09:23	recognition in different languages.
0:09:25	This is a large project
0:09:27	in France, it's research innovation project
0:09:30	which was funded by OUZIO, a French innovation agency.
0:09:34	It was initiated in 2004
0:09:37	but didn't start until 2008 and then ran for almost six years until the end
0:09:41	of 2013
0:09:41	so it's relatively recent that it finished. It was really fun
0:09:45	but when we started putting it together
0:09:47	the web was a lot of than it is now. So as we also heard,
0:09:50	I think it was this morning, there was no YouTube,
0:09:52	no Facebook, no Twitter, no Google Books, iPhones. All that didn't exist.
0:09:57	So life was boring, what do we do with our free time, right?
0:10:01	Instead of spending your time on the ??.
0:10:03	I think it's hard to be in the position of young people who don't know
0:10:07	life without all of this
0:10:08	and my daughter grew up with all of it.
0:10:10	And so it's very hard to relate to what this situation really is
0:10:13	but in any case
0:10:15	to get back to sort of processing of this data
0:10:17	we have tons and tons of data. I read that there's roughly 100 hours of
0:10:24	video uploaded to YouTube every minute.
0:10:27	And that's a huge amount of data and 61 languages. So if we are treating
0:10:31	about 7 of them,
0:10:32	we are not so bad. Maybe we cover the languages doing the videos there.
0:10:35	But we don't have to organise this data, we don't not know how to accesses
0:10:39	this data.
0:10:39	And so Quaero was trying to aim at this. How can we organize the data,
0:10:43	how can we access it, how can we index it
0:10:45	how can we build applications that can make use of today's technology and do something
0:10:49	interesting with it. I'm not gonna talk about all that. If you're interested I suggest
0:10:54	you to go to Quaero website and
0:10:57	you can find some demos and links and things like that
0:10:59	there. I'm gonna focus on the work that we did in speech processing
0:11:03	and at LIMSI we spoke about, we worked on mostly speech and text processing
0:11:08	including this applied to speech data, so named entity searchable text and speech,
0:11:13	work translation,
0:11:14	both of text and speech.
0:11:17	So here is
0:11:19	showing the speech processing technologies that we
0:11:23	worked on in the project. So the first box we have is audio speaker segmentation
0:11:28	such as chopping
0:11:29	signal and trying to decide speech and nonspeech regions
0:11:32	and
0:11:33	dividing into segments corresponding to different speakers
0:11:37	detecting speaker changes
0:11:39	then we may or may not know
0:11:40	the identity of the language being spoken
0:11:43	so we have a box of language identification if we don't know it.
0:11:47	Most of the time we want to transcribe the speech data because
0:11:51	speech is ubiquitous
0:11:52	there's speech all over the place and it has a high amount of information content
0:11:55	and so we believe it is
0:11:57	that the most useful. We work in speech and not in an image. Image people
0:12:00	might tell us that
0:12:01	image is more useful for indexing this type of data.
0:12:06	One advantage we have speech relative to image that I've just mentioned
0:12:10	is that the speech has the underlying writen representation that we've all pretty much agree
0:12:17	upon
0:12:17	more or less. Able decide where the word is.
0:12:19	We might differ a little bit but we pretty much agree upon it. The image
0:12:22	is not the case
0:12:23	if you give an image to two different people
0:12:25	someone will tell you it's a blue house, someone will tell you it's trees in
0:12:28	the park
0:12:29	with a little blue cabin in it. Something like that. You get a
0:12:31	very different description based on what people are interested in
0:12:35	and their matter expressing things. For speech in general
0:12:38	we're a little bit more normalized we
0:12:40	pretty much agree on what would be there.
0:12:42	Then you might wanna do
0:12:44	other type of processing such as speaker diarization.
0:12:47	This morning doctor ?? spoke about the Twitter statistics during the presidential elections and
0:12:53	that was something we actually worked on in Quaero.
0:12:56	Which was to try and look at a corpus of recordings
0:12:59	and look at speaker times within this corpus of recordings, you might have hundreds or
0:13:03	thousands of hours of recordings and look at how many
0:13:05	speakers are speaking when and how much time is allocated
0:13:08	and that's actually something that's got a potential to use at least in France
0:13:11	where they control that during the election period all the parties get the same amount
0:13:15	speaking time.
0:13:16	As you want very accurate measures of who is speaking when
0:13:20	so that everybody has a fair game
0:13:21	during the elections.
0:13:23	Other things that we worked on were adding the metadata to the
0:13:27	transcriptions, you might add punctuation or markers to
0:13:30	make it be more readable, you might want to transform numbers from
0:13:34	words into a number sequences like in newspaper text in 1997.
0:13:38	And you might want to identify
0:13:41	entities or speakers or topics that can be useful for automatic processing so you could
0:13:47	tags in
0:13:48	where same identities are.
0:13:50	And then finally the other box there is speech
0:13:52	translation typically based on the speech transcription
0:13:57	but we're also trying to work on having a tighter link between the speech and
0:14:01	the translation
0:14:01	portions, so you don't just transcribe and then
0:14:05	translate.
0:14:06	But we're trying to have more tight
0:14:08	relation between the two.
0:14:11	Let me talk a little bit now about speech recognition.
0:14:14	Everybody I think know there's a box, so basically we have
0:14:18	the main point is just that we have three important models. The language model,
0:14:23	the pronunciation model and the acoustic model.
0:14:25	And these are all typically estimated on very large corporas
0:14:28	This is where we're getting into problems with the low resource languages.
0:14:32	And I want to give a couple the illustrations on
0:14:35	why at least I believe and I spent effort doing this on pronunciation model
0:14:39	is really important that we have the right pronunciation in the dictionary.
0:14:42	So we take these two examples, on the left we have two versions of coupon.
0:14:51	And on the right we have
0:14:53	two versions of interest.
0:14:56	So in the case of the coupon in one case we have the ya sound
0:15:00	inserted there
0:15:01	and our models for speech recognition are typically
0:15:04	modeling phonetics in it's context.
0:15:07	And so we can see that if we have a transcription just k. u. p.
0:15:11	for it.
0:15:12	We're gonna have this ya there and the case is not gonna be
0:15:14	very good match to the one that we have the second case.
0:15:17	That's really very big difference and also the U becomes almost frented EU.
0:15:22	That is not
0:15:23	distinguishable technically in English.
0:15:25	And the same thing for interest. We have interest or interest.
0:15:29	Well in one case you N and the other case you have the TR cluster.
0:15:32	These are very different and you can imagine that if we ...
0:15:35	since our
0:15:37	acoustic models are based on alignment of these transcriptions with the audio signal
0:15:41	if we have more accurate pronunciations we're going to
0:15:44	have better acoustic models at the end and that's what our goal is.
0:15:49	So now I want to speak a little bit about
0:15:52	culture lightly supervised learning, there's many terms
0:15:55	being used for it now.
0:15:56	Unsupervised training, semi-supervised training, lightly supervised training
0:16:01	and so
0:16:02	basicly one goal is that .. and Ann mentioned something
0:16:05	about this yesterday, maybe machines can just learn on their own.
0:16:08	So here we have a machine
0:16:09	he's reading the newspaper, he's looking at the TV and he's learning.
0:16:14	Okay that's great.
0:16:15	That's something we would like to happen.
0:16:17	But we still believe that we need to put some guidance there and so this
0:16:20	is
0:16:21	researcher here trying to
0:16:23	give some informations and supervision the machine who's learning.
0:16:28	When we look at traditional acoustic modeling we typically use between several hundreds to
0:16:33	several thousands of hours of carefully annotated data and once again I said before that
0:16:37	this is expensive
0:16:38	and so people trying to look into
0:16:40	ways to reduce this information, reducing the amount of supervision for the training process.
0:16:45	And so
0:16:46	I believe that some people in this room are doing it
0:16:50	to automate the process of collecting the data. To automate the iterative learning of the
0:16:54	systems by themselves
0:16:55	even including the evaluation so having some
0:16:59	data used to evaluate on that is not necessarily really carefully annotated and most the
0:17:03	time it is
0:17:04	but there's been some work trying to use
0:17:06	unannotated data to improve the system
0:17:08	which I think is really exciting.
0:17:11	So we talk about reduced supervision and
0:17:13	unsupervised training that has a lot of different names that are used.
0:17:16	The basic idea is to use some existing speech recognizer,
0:17:19	you transcribe some data, you assume that this transcription is true.
0:17:24	Then you build new models estimating with this transcription and you reiterate and there's been
0:17:28	a
0:17:28	lot of work on it for about fifteen years now
0:17:31	and many different variance that have been explored, where to filter the data, where to
0:17:36	use confidence factors, do you
0:17:37	train on things that are only good, do you take things in the
0:17:39	middle range, this many things you can read about it.
0:17:42	Something that's pretty exciting that we see in the Babel work
0:17:46	is that even now if we apply these two systems starting
0:17:49	with very high word error rate, it still seems to be converging
0:17:52	and that's really nice.
0:17:54	The first things I'll talk about are going to be in case for a broadcast
0:17:57	news data but we have a lot of
0:17:59	data we can use as a supervision. And by this I mean we're using language
0:18:02	models that are trained in many
0:18:04	millions of words of text and this is giving some information to the systems. It's
0:18:08	not completely
0:18:08	unsupervised which is why you these different names for
0:18:11	what's being done by different researchers calling it.
0:18:13	It's all about the same but it's called by different names
0:18:16	and so here I wanted to illustrate
0:18:19	this was the case study for the Hungarian that we did in
0:18:22	the Quaero program and it was presented at last year's Interspeech, so maybe some of
0:18:26	you saw it, by A. Roy.
0:18:29	And we started off with
0:18:31	having
0:18:33	seed models at this point appear of about eighty percent or seed models that come
0:18:36	from other
0:18:37	languages or five languages we took them from, so we did what most people would
0:18:41	call cross language transfer.
0:18:42	These models came from if I
0:18:45	have it correctly: English, French, Russian, Italian and German
0:18:48	and we tried to just choose the best match between
0:18:50	the phone set of Hungarian to one of these languages.
0:18:54	And then we
0:18:56	use this model here to transcribe about 40 hours of data which is this point
0:19:00	here
0:19:01	and this size of the circle
0:19:03	is showing you
0:19:05	roughly how much data is used
0:19:06	that's this is forty hours then we double again here
0:19:09	and go about eighty hours and this is the word error rate
0:19:12	and this is the iteration number
0:19:14	and so we often use increased amount of data, increase the model
0:19:20	size, we have more parameters and models is going on
0:19:23	so the second point here is using
0:19:26	the same amount of data, but using more context
0:19:28	so we built a bigger model, so we once again took this model, we redecoded
0:19:32	all the data the forty hours
0:19:33	and built another model and so now we went down to about sixty percent, so
0:19:37	we still kind of flying
0:19:38	we doubled the data again and we're probably about a hundred fifty
0:19:42	hundred fifty something like that. Then we got down to about
0:19:45	fifty percent. These are all using same language model
0:19:48	so that wasn't changed in study
0:19:50	and then finally here we use about three hundred hours of
0:19:53	training data we're down to about thirty or thirty five percent
0:19:56	and of course everybody knows that
0:19:59	these were done with just standard PLP F0 features
0:20:03	and
0:20:04	pretty much everybody's using features generated by the MLP
0:20:07	so we took
0:20:08	our english MLP,
0:20:09	we generated features on the Hungarian data since across
0:20:13	lingual transfer
0:20:16	models there
0:20:17	and we see there begins small gain a little bit
0:20:20	once our amount of data is fixed
0:20:22	and then we took the transcripts that were generated by this system
0:20:26	here
0:20:26	and
0:20:28	we built an MLP
0:20:29	training MLP for the Hungarian language and there we also get now about a two
0:20:33	or three percent
0:20:34	absolute gain and we're down to a word error rate of about twenty five percent
0:20:38	which isn't wonderful
0:20:39	but it's still relatively high, but it's good enough for some applications such as media
0:20:43	monitoring and
0:20:44	things like that.
0:20:46	And so this was done completely un-transcribed and we did it with a bunch of
0:20:49	languages
0:20:50	so now let me show you some results for the ...
0:20:54	I think it's about nineteen languages we did in Quaero, we did more
0:20:56	we did twenty three but this is only for nineteen of them
0:21:00	and if we look here, if you go up to check
0:21:03	these were trained in a standard supervised manner
0:21:05	with somewhere between a hundred and five hundred hours
0:21:09	of data depending upon the language.
0:21:13	And so these with the blue shading were trained in unsupervised manner
0:21:17	once again we have the word error rate on the left and this is the
0:21:20	average error rate across
0:21:22	three to four hours of data per show,
0:21:25	per language, sorry.
0:21:27	And so we can see that while in general
0:21:30	the error rates are a little bit lower for
0:21:33	the supervised training, these arent't so bad some of them are really about the same
0:21:39	range and
0:21:40	you have to take the result with a little bit of grain of salt, because
0:21:43	some of these languages here
0:21:45	might be a little bit less well trained or a little bit less well advanced
0:21:49	than the lower scoring languages. These might be
0:21:52	doing a little bit better if we worked more on them.
0:21:56	But this isn't the full story so now I'm going to complicate the figure
0:22:00	and in green you have
0:22:02	word error rate on the lowest file, so that the audio file that had the
0:22:07	lowest word error
0:22:08	rate per language so these are in green.
0:22:11	Okay, so these files are easy they're probably news like files
0:22:15	okay.
0:22:16	And it gets very low, even Portuguese we're down around three percent for one of
0:22:19	these segments
0:22:20	and then in yellow we have the worst scoring one
0:22:24	and these were scoring files
0:22:26	typically or more interactive spontaneous speech talk shows, debates, noisy recording that's offside,
0:22:33	that's a lot of variability factors that come in.
0:22:35	So even though this blue curve is kinda nice we really see we have a
0:22:38	lot of work to do
0:22:40	if we want to be able to process all the state up here.
0:22:44	So now i'm going to switch gears and not talking any more about Quaero and
0:22:48	talk a little bit about Babel.
0:22:50	Where it's a lot harder to have supervision from
0:22:53	language model because you are working on languages in Babel
0:22:56	that have very little data, that is hard to get or typically have little data
0:22:59	but not all of them
0:23:00	are really in that situation.
0:23:02	And so this is the
0:23:06	a sentence I took from Mary Harper slide's
0:23:09	that she presented at ?? calling and so the idea
0:23:13	that's being investigated to apply
0:23:15	different techniques of linguistic machine language,
0:23:18	machine learning and speech processing methods
0:23:20	to be able to do speech recognition for keyword search and I highly recommend for
0:23:25	people that are
0:23:25	not familiar with Mary's talk so you see them.
0:23:28	I know that the ASRU one is online on Superlectures,
0:23:31	and the ?? column one I don't know, so people here probably know better than
0:23:34	me if ?? is there.
0:23:35	But there it's really interesting talks
0:23:38	and if you're interested in this topic I suggest you to
0:23:41	go there.
0:23:42	So, keyword spotting. Yesterday Ann spoke about that children can do keyword spotting very young
0:23:49	and so I wanna do first test for you because basic keyword spotting
0:23:54	what I mean is that
0:23:55	you're gonna localise in the audio signals some points where you have
0:24:00	your detected keyword
0:24:02	so these two you detected right
0:24:03	here
0:24:04	you missed it, it's the same word whatever keyword it was or occurred but you
0:24:08	didn't get it
0:24:09	and here you detected a keyword but you didnt get it
0:24:11	so that's a false alarms. So here you've missed the false alarm and the correct.
0:24:15	So now let me play you a couple of samples
0:24:18	and this is actually a test of two things same time.
0:24:21	One is language IDs so I'm gonna play samples at
0:24:24	different languages and there's two times six different languages
0:24:27	and there's a common words in all of these
0:24:30	samples.
0:24:30	And so I'd like people to let me know if you can detect
0:24:33	this words, so see if we as adults can do like children
0:24:37	can do.
0:24:58	Do you want to here it again?
0:25:00	And do we make it a little louder?
0:25:03	Is it possible to be a little bit louder on the audio because I can't
0:25:06	control it here.
0:25:09	I don't think it goes any louder.
0:25:11	I have it on the loudest.
0:25:31	Okay so I'll show you the languages there first. Anyone get the languages there's probably
0:25:35	speaker of each
0:25:36	language here, so you probably recognised your own language.
0:25:40	So the languages were: Tagalog, Arabic, French, Dutch, Haitian and Lithuanian.
0:26:08	Shall I play it again?
0:26:09	It's okay?
0:26:11	Alright, so.
0:26:12	So here's this second set of languages that we have there, the last one is
0:26:16	Tamil. I'm not really
0:26:17	sure the end where there were taxi in different places. Google translate told us it
0:26:23	was.
0:26:23	But there might be some native speakers here that can
0:26:26	tell us if that is or not. To me it sounded like income taxes and
0:26:29	sales tax.
0:26:30	But I don't
0:26:33	really know. Google told us that it was: to income from
0:26:35	taxes and sale of taxes, or something like that so anyway so
0:26:41	basically I did, everyone
0:26:43	catched the word taxes or only some of you did?
0:26:46	Taxes is one of those words that seems to be relatively
0:26:50	common and
0:26:51	in many languages anyway that's same thing.
0:26:57	Before talking about keyword spotting I'm not gonna talk about it too much actually, is
0:27:00	I wanted to
0:27:01	show some results on conversational telephone speech. So we'll talk about term error rate here
0:27:06	rather than word error rate, because in Mandarin we
0:27:09	measure the character error rate rather than in order. So for English
0:27:12	and Arabic we're measuring word error rate and for Mandarin its character
0:27:17	and these results are for I believe the NIST archives of
0:27:21	for
0:27:21	transcription task
0:27:23	and English systems are trained on about
0:27:25	two thousand hours of
0:27:28	data with annotations.
0:27:30	The Arabic and Mandarin systems were probably
0:27:32	trained on about two hundred or three hundred hours of data.
0:27:34	It's quite a bit less
0:27:35	and we can see that the English system gets pretty good. We're down to about
0:27:39	eighteen percent
0:27:40	of the word error rate. The Arabic is really quite high
0:27:43	about forty five percent. Maybe in part due to different dialects
0:27:47	and also maybe in part due to pronunciation modeling because
0:27:50	it's very
0:27:51	difficult in Arabic if you don't have the diacriticised form.
0:27:58	We also at LIMSI work on some other languages
0:28:00	including French, Spanish, Russian, Italian and these are
0:28:03	just some results to show you that we're sort of in the same ballpark
0:28:06	of error rates
0:28:07	for these systems, for once again conversational speech
0:28:10	and these are trained on about a hundred to two hundred hours of data.
0:28:14	Now let's go to Babel which can just be very challenging compared to what we
0:28:17	see here which is
0:28:18	already harder that we had for the broadcast type data.
0:28:22	And before that I just want to say a few words what we mean by
0:28:26	low resource language so in general
0:28:28	these days it means it has got low presence on the Internet.
0:28:31	That's probably not what ethnologists in English would agree
0:28:34	upon but I think from the technology community we are gonna say
0:28:37	you cannot get any data it's a low resource language.
0:28:40	It's got limited text resources
0:28:42	well at least in electronic form
0:28:45	there is
0:28:46	little or
0:28:47	some, but not too much I\O data,
0:28:49	you may or may not find some pronunciation dictionaries and it can be difficult to
0:28:54	find
0:28:54	maybe reliable knowledge about the language if you google different things and you find some
0:29:02	characteristics about the language you get three different peoples telling you three different
0:29:05	things and you don't really know what to believe.
0:29:08	And one point I'd like to make is that this is true for what we're
0:29:12	calling these low resource languages
0:29:13	but is also true many times for different types of applications that has passed that
0:29:16	we dealt with
0:29:17	even in well resourced languages. You might not have any data on the type of
0:29:21	test you're addressing.
0:29:22	So here's an overview of the Babel languages for the first two years of the
0:29:27	program
0:29:27	and I'm roughly trying to give an idea of the characteristics of the language I'm
0:29:31	sure that these
0:29:32	are not really hundred percent correct.
0:29:34	I tried to classify the characteristics into general classes and give it something we can
0:29:40	easily understand
0:29:41	and so for example we see the list of languages we ?? assume is make
0:29:46	any better
0:29:47	relatively closely related
0:29:48	and
0:29:50	Cantonese allow
0:29:52	and Vietnamese
0:29:53	that are used
0:29:55	different scripts that's Bengali and Assamese share the same written script.
0:29:59	We also have the Pashto
0:30:02	which uses the Arabic script, the one we have to problem to of diacritization in
0:30:05	it.
0:30:06	And then we have
0:30:08	Turkish, Tagalog, Vietnamese and ??
0:30:12	which was actually very challenging because there we had clicks we needed to deal with.
0:30:16	So they use different scripts,
0:30:18	some of them languages have tones so in this case we had four that had
0:30:22	tone,
0:30:22	we were trying to classify the morphology into being easy, hard and medium,
0:30:27	okay, this is not very
0:30:28	I'm sure it is not very reliable but basically three of them we consider to
0:30:33	have a difficult
0:30:33	morphology so that was the Pashto, the Turkish and the Zulu.
0:30:39	And the others of them are low.
0:30:41	The next column is the number of dialects in this is not
0:30:44	the number of dialects in the language, this is the number of dialects in the
0:30:48	corpus collected
0:30:48	in the context of Babel.
0:30:50	So in some cases we only had one as in Lao and Zulu, but in
0:30:53	another cases we had for Cantonese as many as
0:30:55	five, in Turkish as many as seven.
0:30:58	And then once again whether or not
0:31:00	the G2P
0:31:02	is easy or difficult
0:31:04	and so some of them are easy, some of them seem to be hard.
0:31:07	In particular the Pashto
0:31:09	and for the Cantonese is basically the dictionary lookup
0:31:14	limited character set.
0:31:16	So here and the
0:31:18	last column I'm showing the word error rates for
0:31:21	the Babel languages and its joint in a different style.
0:31:24	If you look at the top of the blue bar
0:31:26	that's the
0:31:28	word error rate
0:31:29	of the
0:31:30	worst language. So in this case for the .. in fact for both of them
0:31:34	with the top of the blue
0:31:36	this language here is about
0:31:38	fifty and some percent and sixty and some percent, that's Pashto
0:31:41	and the top of the oranges just showing you the range of the
0:31:45	word error rates across the different languages.
0:31:50	This word error rate I said backwards. This is the best language
0:31:54	and this is the worst language. The top here are Pashto
0:31:57	which is about seventy percent in one case and
0:31:59	fifty five percent for another
0:32:02	and this is the best which I believe is Vietnamese and Cantonese.
0:32:06	Sorry, if I confused you there.
0:32:12	And I'm wrong again with that too. I mixed up the keyword spotting
0:32:15	So this is, I should've read my notes,
0:32:16	the lowest word error rate was for Haitian and Tagalog
0:32:21	and the highest was for Pashto.
0:32:23	And in this case we had, what's called in our community, you can see it
0:32:27	in another papers, is Full LP
0:32:29	which means you have somewhere between sixty and ninety hours of annotated data for training
0:32:33	and
0:32:34	there's the LLP, which is the low resourced
0:32:37	which is only ten hours of annotated data per language, but you can use the
0:32:41	additional data here
0:32:41	in unsupervised or semi-supervised manner.
0:32:46	So some of the research directions that you've probably seen a fair amount of talks
0:32:50	about here
0:32:51	are looking into language-independent methods
0:32:54	to develop
0:32:55	speech-to-text and keyword spotting for the languages looking into multilingual acoustic modeling.
0:33:00	Yesterday there was some talk by the Cambridge people and there was also talk from
0:33:04	MIT
0:33:05	trying to improve model accuracy with these limited training conditions
0:33:10	using unsupervised or semi-supervised
0:33:13	techniques for the conversational data
0:33:15	which we don't have too much
0:33:17	information that's coming for the language model.
0:33:19	It's a very weak language model that we have
0:33:22	and trying to explore multilingual and
0:33:24	unsupervised MLP training. And both of those have been pretty successful
0:33:28	where is the multilingual acoustic modeling using standard ?? hmms is a little bit less
0:33:32	successful.
0:33:32	And one other thing that we're seeing
0:33:35	is interest in is using graphemic models because these could sort of avoid the problem
0:33:40	of
0:33:40	having to do grapheme to phoneme
0:33:42	and it reduces the problem of pronunciation modeling to
0:33:45	something closer to text normalization you have to do anyway
0:33:48	for language modeling.
0:33:51	So now I wanted to talk just
0:33:53	briefly about something that didn't work that we tried at LIMSI. So one of the
0:33:56	languages is Haitian, so this is great you know
0:33:58	we work in French we developed decent French system
0:34:01	so why not try using French models to help our Haitian system
0:34:05	and so the first thing we do is to try to run our French system
0:34:09	on Haitian data, it was a disaster
0:34:11	it was really bad
0:34:12	then we took the French models,
0:34:15	acoustic models and the language model for Haitian data but also
0:34:18	wasn't very good
0:34:20	then we said okay let's try adding varying amounts of
0:34:24	French data to
0:34:25	Haitian system. So this is the Haitian baseline, so we have about
0:34:28	seventy and some percent word error rate so seventy two ?? much yourself
0:34:33	If we had ten hours of French we get worse, we got about seventy four
0:34:36	or seventy five.
0:34:37	We had twenty hours to go, got worse again.
0:34:39	We had fifty hours to get worse again, we said hups! This is not working,
0:34:43	stop,
0:34:43	this was
0:34:45	work that we never really got back to. We wanted to look a little more
0:34:48	in trying to understand better why
0:34:49	this was happening, we don't know if it's due to the fact that the recording
0:34:52	conditions were very different, we
0:34:53	don't if there were really phonetic or phonological differences between languages
0:34:58	and then we had another bright idea let's just say
0:35:01	okay, let's not use standard French data we also have some accented French data from
0:35:05	Africa
0:35:06	we have some data from North Africa, from
0:35:10	I don't remember where the other was from
0:35:12	and so we said let's trying do that
0:35:14	same results. We took ten hours of data we had and basically
0:35:17	degrade the same way.
0:35:18	So we were kinda disappointed by the results and then
0:35:21	dropped working on it for awhile.
0:35:22	We hoped to get back to some of this again. There
0:35:24	was a paper from KIT that was talking about using
0:35:30	multilingual and bilingual models
0:35:32	for recognition of non-native speech and that actually was getting some gain, so I thought
0:35:36	that was
0:35:37	a positive result despite
0:35:39	instead of our negative result here.
0:35:41	Let me just,
0:35:44	one of the.
0:35:46	One of things we also tried to do some joint models for Bengali and Assamese,
0:35:50	because we have been naive and not speaking
0:35:52	these languages decided this was something we can try
0:35:54	and put them together and see if we can get some gain.
0:35:56	In one condition we got tiny little gain from the language model trainable set of
0:36:01	data, but really tiny
0:36:03	and the acoustic model once again didn't help us.
0:36:06	And I heard that yesterday somebody commented on it
0:36:08	saying that they really are quite different languages and we shouldn't be
0:36:11	assuming just because we don't understand that they are very close.
0:36:14	But we did have Bengali speakers in our lab and they told us they were
0:36:18	pretty close,
0:36:18	so it wasn't based on nothing.
0:36:20	So let me just give a couple of results on keyword spotting just to give
0:36:24	you sort of an idea of
0:36:26	what type things were talking about what the results are.
0:36:29	On the left part of the graph I give results
0:36:33	problem
0:36:35	2006, it was the spoken term detection task that was run by NIST and it
0:36:39	was done on a more
0:36:41	cases in this one. This is on broadcast news and conversational data and you can
0:36:45	see that the measure that is used
0:36:47	here is MTWV, which is the Maximum Term Weighted Value and
0:36:52	I don't wanna go into it
0:36:53	but basically it's a measure of false alarms and misses and you can put the
0:36:57	penalty to it.
0:36:58	The higher the number the better, so on the other slide
0:37:00	we wanted lower number because it was word error rate
0:37:02	and on these ones we want high number.
0:37:05	And so we can see that for the broadcast news data it was about eighty
0:37:08	two or eighty five percent and for
0:37:10	the CTS data it's pretty close up around eighty
0:37:13	but if we look at the Babel languages now
0:37:16	we are down between forty five. So once again now the
0:37:20	worst language is here which is around forty five percent for
0:37:24	sixty of full training against sixty to ninety hours of supervised training
0:37:27	and the best one goes up to about seventy two percent.
0:37:31	Now look at my notes so I get these rates and the worst language was
0:37:35	Pashto
0:37:36	and the best languages were Cantonese and Vietnamese.
0:37:39	And this is now the limited condition and you can see that you take a
0:37:43	really big hit
0:37:44	for the worst language here
0:37:47	but in fact on the best ones, we're not doing so much worse. So these
0:37:51	systems were trained on the ten
0:37:52	hours and then the additional data was used in unsupervised manner
0:37:56	and then there's a bunch of bells and whistles and a bunch of techniques used
0:37:59	to get these
0:37:59	keyword spotting that I didn't talk about and I won't talk about
0:38:02	but there's a lot of talks on it that you'll see here that
0:38:05	you can go to and I think there are two sessions tomorrow
0:38:07	and maybe another poster.
0:38:09	Once again there's talks from Mary Harper if you feel interested in finding out more.
0:38:16	So some findings from Babel, so you've seen unsupervised training
0:38:22	is helping a little bit at least even though we have very poor language models.
0:38:26	The multilingual acoustic models don't seem to be very successful
0:38:29	but there is
0:38:31	something of hope from some research going on.
0:38:34	The multilingual MLPs are bit more successful, meaning there's quite a few papers talking about
0:38:38	that.
0:38:39	Something that
0:38:40	we've used in LIMSI for awhile, but was also shown in Babel
0:38:44	programs that pitch features are useful even for non-tonal languages.
0:38:48	It was in the past we used pitch for work on tonal languages and we
0:38:51	don't need to use it all the time.
0:38:53	And now I think a lot of people are just systematically using it in their
0:38:56	systems.
0:38:56	Graphemic models are once again becoming
0:38:59	popular and they give results very close to phonemic ones
0:39:03	and then for keyword spotting there's a bunch of important things,
0:39:06	score normalization
0:39:09	is extremely important there was a talk
0:39:11	the last ASRU meeting
0:39:13	and dealing with out-of-vocabulary keyword so basically when you get a keyword you don't necessarily
0:39:17	know all those words and particularly when you have ten hours of data, transcript of
0:39:21	that
0:39:21	you've got very small vocabulary. You have no idea what type of query
0:39:25	person will give and you need to do something, do tricks
0:39:28	to be able to recognize and find these keywords in the audio
0:39:32	and
0:39:33	typically is being investigated now separate units
0:39:35	and proxy type things and I'm sure you'll find papers on that here.
0:39:39	So let me switch gears now in my last fifteen minutes,
0:39:44	ten minutes. Okay, to talk about some linguistic
0:39:46	studies and the idea is to use speech technologies
0:39:49	as tools to study language variation, to do error analysis,
0:39:53	there are two recent workshops that I listed on the slides.
0:39:58	And I'm gonna take case study from Luxembourgish - the Luxembourg's language.
0:40:02	This is done working closely with Martine Adda-Decker from MC who is Luxembourgish for those
0:40:06	who don't know her.
0:40:08	She says that Luxembourgish is really true multilingual environment, sort of like Singapore
0:40:13	and in fact it seems a lot like Singapore.
0:40:15	The capital city is the same name as the country for both of these
0:40:19	well there's a little bit different
0:40:22	it's a little bit warmer here.
0:40:24	But Luxembourg is about three times the size of Singapore
0:40:29	and Singapore has about ten times the amount of people.
0:40:32	So it's
0:40:33	not quite the same.
0:40:34	So basic question we're asking for Luxembourgish is that given you've got a lot of
0:40:39	contact with English, French and German which language is the closest?
0:40:44	And there was a paper, there was a couple of papers that Martine's first author
0:40:48	of.
0:40:48	A different workshops and most recent one was the last SLTU.
0:40:53	This is a plot showing the number of shared words between Luxembourgish, French, English and
0:40:59	German
0:41:00	and so the bottom curve is English,
0:41:02	the middle one is German and the top one is French
0:41:05	and
0:41:06	along the x-axis is the size of the word list sorted by frequency
0:41:10	and on the y-axis is the number of shared words and so you can see
0:41:13	that at the low end we've got the
0:41:15	function words as we expect those the most frequent in the languages,
0:41:18	then you get more general content words.
0:41:22	And as higher up you get technical terms and a proper names.
0:41:27	And you can see that in general there's more sharing with French
0:41:31	than with German or English at least at the lexical level.
0:41:35	And you have
0:41:36	once again the highest amount of sharing when you get
0:41:39	technical terms and it's because these are shared across languages more generally.
0:41:44	So what we try to do that's the question of given that we have this
0:41:47	similarity
0:41:47	to some extent at the lexical level there is this type of similarity at the
0:41:53	phonological level.
0:41:54	And so what we did we took acoustic models from English, French and German.
0:41:58	We tried to do an equivalence between these
0:42:01	IPS symbols
0:42:02	and those in Luxembourgish. So Martine defined the set up
0:42:05	phones for
0:42:06	Luxembourgish
0:42:07	and then we
0:42:08	did hacked up pronunciation dictionary that would allow
0:42:12	any language change to happen after any phoneme, so if you have a
0:42:15	this can get pretty big you have a lot of pronunciations because you have
0:42:18	if you had three letters you're going to be able to decide each point go
0:42:22	to
0:42:22	the other ones. You can see the illustration here with the pad you go anywhere.
0:42:25	And the we said when I also trained a model on a three, multilingual model
0:42:29	trained on a three data
0:42:30	together so we took a subset of the English, French and German data and did
0:42:33	what we called a pooled model.
0:42:36	And so the first experiment with it is we tried to align
0:42:41	the audio data with
0:42:43	these three models in parallel, so that the system could choose which acoustic model likes
0:42:47	best
0:42:48	the English, French, German and pooled
0:42:50	and then we did a second experiment so we train the
0:42:54	Luxembourgish model in unsupervised manner just like I showed for Hungarian and we said now
0:42:59	let's use that and we replaced the pooled model with Luxembourgish model.
0:43:03	And so of course or expectation is that once we put Luxembourgish model in there
0:43:07	it should get
0:43:08	the data, so the alignment should go to that model that's what we expect.
0:43:12	And
0:43:12	so here's what we got
0:43:14	so on the left is experiment 1, the one where we have the pooled model.
0:43:16	On the right we have Luxembourgish model and
0:43:20	the top is
0:43:20	German then we have French, English and pooled Luxembourgish and so we were really
0:43:25	disappointed, so the first thing we see is first of all Luxembourgish doesn't take everything
0:43:29	and second so we have pretty much the same distribution, there's very little change.
0:43:33	So we said okay let's try and look at this a little bit more. Martine
0:43:36	said let's look at this
0:43:37	more carefully because she knows the language.
0:43:39	I was looking at and so we looked at
0:43:42	some diphthongs
0:43:43	which only exist in Luxembourgish and so we had this ?? card base. We're trying
0:43:47	to choose something close when we took
0:43:48	English and French and now we see the effect that we want
0:43:51	so originally they wanted English which has diphthongs or more diphthongs.
0:43:55	And now they want to Luxembourgish. So we are happy we've got some results we
0:44:00	wanted.
0:44:01	We should do some more working, looking more things we are happy with this result.
0:44:07	The second thing i wanted to mention was talking a bit about language change and
0:44:10	this was
0:44:11	associated phonetic corpus based study that was also presented last year at Interspeech by Maria
0:44:16	Candeias.
0:44:17	And we were looking at three different phenomena that are going to be growing
0:44:21	in the society, so you have consonant cluster reduction so explique
0:44:26	exclaim, so you have eXCLaim
0:44:29	you get rid of the ??.
0:44:30	There is too many things to pronounce.
0:44:33	The palatalization and affrication of dental stops which is a sign of the
0:44:37	social status in immigrant population.
0:44:40	And in fact that for me when you hear the cha or ja they sounds
0:44:44	very normal to me, because we have them
0:44:46	in English and I'm used to it, so we do that in English and then
0:44:50	the third one is the
0:44:52	fricative epithesis
0:44:53	which is at the end of word, you had this
0:44:55	?? type sound. Sorry.
0:44:58	And I'll play you an example
0:45:03	And that was something that I remember very distinctly when I first came to France
0:45:06	I heard it all the
0:45:07	time and women did this. It was some characteristic of women speech that at the
0:45:12	end there's eesh.
0:45:13	And it's very common
0:45:14	but in fact is now
0:45:16	growing more in
0:45:17	even male speech.
0:45:20	But these were examples that were taken from broadcast
0:45:24	data, so this is people talking on the radio and the television
0:45:27	so you imagine that if they're doing it it's something that is really
0:45:29	now accepted by the community. That's really are a growing trend and so
0:45:33	Maria was looking at these over the last decade and so what we did was
0:45:36	same type of thing. We took
0:45:38	a dictionary and we allowed after Es to have this eesh
0:45:41	type sound. We allowed different phonemes to go there
0:45:44	and then looked at alignments and how many counts we got of the different
0:45:49	occurrences.
0:45:50	And so here we're just showing that between 2003
0:45:55	and 2007
0:45:57	this is becoming longer
0:45:59	and it's also increased in frequency by about twenty percent.
0:46:05	So now let me just, last thing I wantes to talk about
0:46:07	was human performance and we all know that humans do better
0:46:10	then machines on transcription tasks
0:46:12	and machines have trouble dealing with variability that humans do much better with.
0:46:18	So here is a plot of, this is based on some work of doctor ??
0:46:23	and his colleagues.
0:46:24	That
0:46:26	we took
0:46:27	samples stimuli from the what the recognizer got wrong. So everything you see is
0:46:32	100 % word error rate by the recognizer that were very confusable little function words
0:46:36	like ah.
0:46:38	And an in English
0:46:41	Things like that.
0:46:42	And we played them stimuli readers. With 14 native
0:46:44	French stimuli subjects and 70 English subjects.
0:46:48	And everyone who listened understood the stimuli
0:46:50	and so here you can see that if we give just a local context, a
0:46:53	three
0:46:53	gram context which is what many recognisers have to the humans
0:46:57	they have make thirty percent errors on this
0:46:59	but the system was a hundred percent wrong.
0:47:01	If we up that context to five grams, so we've got one word each side
0:47:06	they now go down by about twenty percent. So this is nice going the right
0:47:09	directions the context is
0:47:10	helping us as it seems a little bit.
0:47:12	And if we go up to seven or nine gram
0:47:14	we are doing little bit better but we still have about fifteen percent error rate
0:47:18	by humans on this
0:47:19	task, so our feelings that these are intrinsically
0:47:21	ambiguous given even a small context. We need a larger one.
0:47:26	And just to have some control we also put in some samples where the recognizer
0:47:30	was correct
0:47:30	and here now zero word error rate for the recogniser and you see the humans
0:47:34	make very few errors also
0:47:36	which comforts us that's not an experimental problem that we have higher rates for humans.
0:47:41	So I just wanna play one more example
0:47:43	from the human misunderstanding.
0:47:45	This coming from French talk show I think there's enough French people here that will
0:47:49	follow it.
0:47:55	And the
0:47:56	other person
0:48:01	So the error that happens is that one speaker said là aussi
0:48:05	which is very close to là aussi en
0:48:07	which is very close là aussi en.
0:48:10	I pronounce it poorly.
0:48:14	And in fact what was really interesting about this
0:48:18	the time the correction came was about twenty words
0:48:21	later than the person actually said là aussi en.
0:48:24	And so the two that were talking they had own mindset
0:48:27	and they weren't really listening to the other one completely and this is once again
0:48:30	a broadcast talk show.
0:48:31	I can play the longer sentence for people later if you're interested.
0:48:35	And so my last slide
0:48:37	is that as a community we are processing more languages and wider variety of data.
0:48:42	We are able to get by with less supervision at least to some of the
0:48:47	training data.
0:48:48	We're seeing some successful applications
0:48:50	with this imperfect technology.
0:48:53	Something we
0:48:55	like to extend to is to use the technology for other
0:48:58	purposes. We still have little semantic and world knowledge in our models.
0:49:02	And we still have a lot of progress to do, because those word error rates
0:49:05	are still flying and there's a lot of task there
0:49:07	and so maybe we need to some deep thinking
0:49:11	and how to deal with this.
0:49:13	So that's all.
0:49:15	Thank you.
0:49:20	We have time for some questions?
0:49:35	No questions.
0:49:48	Hi Lori.
0:49:49	Hi Malcolm. In the semi-super-sample summary supervised learning sense cases do you have any sense
0:49:53	of when things fail?
0:49:54	Why things converge or diverge?
0:50:00	We had some problems with some languages ... this is on broadcast data?
0:50:04	We had some problems if you had
0:50:06	poor text normalisation or if you didn't do good filtering to make sure that the
0:50:09	text data were really
0:50:10	from the language that you were targeting
0:50:12	it can fail, it just doesn't converge. So this one case and in fact we
0:50:16	had two languages where the
0:50:17	problem was like that. So basically the word segmentation wasn't good.
0:50:20	I think if you have
0:50:24	too much garbage in your
0:50:26	language model you're going to have a
0:50:28	poor information you're giving. What amazes me and we haven't
0:50:31	done too much of the work actually ourselves at LIMSI yet
0:50:35	is that it still seems you working to some degree for the Babel data.
0:50:39	Where we're flying with these word error rates and we have very little
0:50:42	language model data but probably what we have is
0:50:45	correct because manual transcripts we're using for it
0:50:48	and the case where you're downloading data from web, but you don't really know what
0:50:50	are getting.
0:50:51	And so if you put garbage in, you're getting garbage out. That's why we need
0:50:54	human to supervise what's
0:50:55	going on at least to some extent.
0:50:58	So it was quantified to some extent?
0:51:02	I don't
0:51:03	really have enough information and I know that one of the languages that we tried,
0:51:07	so basically you'd get
0:51:08	some improvement, but you'd stagnate maybe at the level of
0:51:11	the second or third iteration just to improve further.
0:51:15	It didn't happen too often.
0:51:17	And it's something that
0:51:20	I don't really have a good feeling for. Something I didn't talk about was text
0:51:24	normalisation that really is an important part of our work. It is something sort of
0:51:27	considered I think front work
0:51:28	and people don't talk about too much.
0:51:32	Any more questions?
0:51:35	Well if not, I would like to invite the organiser our chairman
0:51:41	to give enough
0:51:43	of our appreciation to Lori.
0:51:46	Let's thank her again.

Language Diversity: Speech Processing In A Multi-Lingual Context

Keynotes

Lori Lamel, LIMSI-CNRS, France