Speech Transcript - Unsupervised Acoustic Model Training with Limited Linguistic Resources

0:00:15	so a good morning everyone
0:00:17	i'm going to you do a sort of a
0:00:20	this struggle
0:00:22	passage also some the work we did before
0:00:27	and the we i hope this'll work links and easy to
0:00:30	so basic only to be talking about semi supervised and unsupervised acoustic model training with
0:00:35	limited linguistic resources
0:00:38	V i mean and as most of us know i'm going to this with a
0:00:44	lot of actually overstate last decade instead of
0:00:48	research
0:00:49	and so i'm gonna talk about some experience we've had a team see
0:00:52	about like the unsupervised and super unsupervised training
0:00:56	L give a couple of case studies
0:00:58	and i actually first case study will be on english which at this there
0:01:01	then we'll all talk
0:01:03	very briefly about some different types of lexical units from modeling switches graphemic units versus
0:01:09	phonemic units in babble
0:01:11	and already mentioned just briefly i added that slightest an acoustic model interpolation "'cause" we're
0:01:16	talking about how to deal with all this header engenders data
0:01:20	and all five fish with some comments
0:01:24	so
0:01:27	over the last
0:01:28	decade or two we've seen part of advances in speech processing technologies lot of the
0:01:33	technologies are getting out there from industrial companies and there's that kind of commonplace for
0:01:39	a lot of the people right now and so people expect that this stuff really
0:01:42	works and i think
0:01:44	this is great that we're seeing or not really get out there at the other
0:01:47	people's expectations are really
0:01:49	i and we still have problems that are systems are pretty much developed for a
0:01:54	given
0:01:55	task a given language
0:01:57	and we still have a lot of work to do deported get good performance on
0:02:00	other tasks and languages we only cover a few tens maybe you fifty years old
0:02:05	languages now as a community
0:02:07	and many times language variance are actually even consider different languages is just easier to
0:02:13	develop a system for different very
0:02:15	and we still rely and language resources a lot
0:02:18	but over the last decade or two we've been seeing that's a reliance on human
0:02:23	intervention so we can use them
0:02:25	with a little bit less human work
0:02:28	so i guess this is sort of just
0:02:31	everybody knows this or maybe everybody doesn't die if there's some people that are working
0:02:35	on speech recognition here where we are holy grail listed all this technology that works
0:02:40	on anything that it's independent of the speakers of the task
0:02:43	there's no problem noise is no problem changing your microphone
0:02:46	and i guess some says maybe fortunately for us still resource do "'cause" this remains
0:02:51	the dream for us
0:02:53	but we do have
0:02:55	a lot lower error rates and we had a decade or two ago
0:02:58	we can process many more types of data with different speaking styles different conditions you
0:03:03	originally in the work that we we're doing was always requiring read speech who needs
0:03:08	to recognise something that was read from attacks it doesn't
0:03:11	that's a logical now look back at it
0:03:13	we cover more languages and we have a fair amount of work reaching the output
0:03:19	to make the transcripts more usable you know by systems or machines which is not
0:03:23	exactly the same thing so you might wanna quit different information you're going downstream processing
0:03:27	by machines
0:03:28	purses if you're doing it for you to be reading
0:03:32	so what's a low resource language i don't really have an answer but i think
0:03:36	in many of us in this community
0:03:39	typically mean that there are too many E resources so we don't find information online
0:03:44	"'cause" that's what we're using now to develop systems
0:03:47	if you speak to link was i think it's may be very different answers we
0:03:50	get and i'm not don't really wanna get into that
0:03:53	but
0:03:55	basically the
0:03:58	we need to be able to find it if we want to develop systems
0:04:01	and that type of thing going to talk about our languages that are low resource
0:04:06	in the in the sense that el the ldc don't have resources that they distribute
0:04:10	google probably has them
0:04:13	and you can we get them online can we develop systems with data that we
0:04:16	find online
0:04:17	i'm not going to really talk about the babble type languages or other rare languages
0:04:23	where
0:04:24	but you really don't even have mostly writing conventions you don't necessarily understand you have
0:04:29	any information about the language except maybe some linguists that have spoken to some people
0:04:33	aren't gonna visited
0:04:34	and so i guess you're a little bit more in that direction for marianne from
0:04:38	time in the
0:04:39	next part of the talk
0:04:40	and of course this framework with by outside on the fusion room
0:04:43	i'm trying to do the speech translation for this text languages with this really no
0:04:48	resources
0:04:49	so we have little or essentially no
0:04:51	available audio data
0:04:53	you have
0:04:54	probably nothing in terms of dictionaries you don't even necessarily have word this languages in
0:04:58	general very limited knowledge about the language
0:05:01	but you can also consider the many types of data for well-resourced language user language
0:05:05	variance
0:05:06	or almost low resource because we just don't have
0:05:09	much available data for
0:05:11	so let me take a little stuff back in time to the late nineteenth and
0:05:14	early two thousand
0:05:16	and one of the questions that you get all the time from funding agencies is
0:05:20	how much stated you need
0:05:22	okay
0:05:23	we try to answer this that i don't think anybody knows which is an hour
0:05:26	say what depends where you want to be it depends what you want to do
0:05:30	the funding agency users were leases time brawls complaining the data collection is
0:05:35	it is costly
0:05:37	i'm why you always asking to find data that is see this is a recurrent
0:05:41	question
0:05:42	and so this is the curves that we did back in two thousand showing with
0:05:46	supervised training on broadcast news data in english
0:05:50	how much you word error rate is as a function of business have a pointer
0:05:56	the red one
0:06:01	no i mean well that's anyway you start with
0:06:04	the little number of the really high number of the left is one and a
0:06:07	half hours of audio data distraught bootstrapping system
0:06:10	with a well trained language model
0:06:13	the second point there is about thirty three with what a set of ten minutes
0:06:18	the next one is one and half hours where you see that the word error
0:06:21	rate is about thirty three percent and then as we have more data we go
0:06:25	down we see that once we get to fifty or a hundred hours
0:06:28	the sort of starts to platform so we're getting diminishing returns really additional data and
0:06:33	so this one thing we can say
0:06:36	the red
0:06:38	we do
0:06:40	okay so once we get you could you know what hundred hours of data
0:06:43	basically
0:06:45	you don't wanna spend a lot of money for that additional data could you just
0:06:48	not getting much returns
0:06:49	once again this is on broadcast news data we had a reasonably well train a
0:06:54	language model so we're seeing this asymptotic behavior of the error rate something we observed
0:07:00	in the community at large is that when you start a new task get
0:07:04	rapid progress it's really fine "'cause" everything here the error rates are dropping we're getting
0:07:07	twenty percent thirty percent and one here is great
0:07:10	that once you get some reasonable
0:07:12	every we're getting about six percent per year and where did some count if you
0:07:17	look over say ten or fifteen years of progress
0:07:19	it seems like the average
0:07:21	improvement we're getting is about six percent per year
0:07:25	so this groups
0:07:26	i don't wanna do that
0:07:27	additional data should cost less
0:07:29	and we need to learn how to use to predict this is sort of what
0:07:32	was going to remind back in two thousand K which is still i think quite
0:07:36	relevant to that
0:07:38	so
0:07:41	you can think about different types of bubbles the supervision so way that when people
0:07:46	were saying we should use phonetic
0:07:47	or phone level transcriptions for training or phone models as logical
0:07:52	it's gives you more information is better than using words
0:07:54	and
0:07:55	people did that
0:07:57	our experience that they can see when we did some tests on this using timit
0:08:01	type data switchboard and a breath is read speech corpus in french
0:08:04	is that actually humans like a segmentation is that we're human phonetic transcriptions better but
0:08:10	the system like the automatic ones better
0:08:12	so basically if you use the word level transcription with the dictionary the covers a
0:08:16	reasonable about variance the systems were better than training them on the phonetic transcriptions maybe
0:08:21	that would not be true nowadays i don't know we have redone it
0:08:24	but that sort of satisfied us to say okay we can go ahead we can
0:08:26	do this approach that we do the standard alignment of word level transcriptions with the
0:08:31	audio so then if you go to be next you can say okay we can
0:08:35	have a large amount of actual reality data for large as round hundred hours or
0:08:40	greater than a hundred hours
0:08:42	we can have some
0:08:44	after the annotated data but a lot of unlabeled data with approximate transcriptions not gonna
0:08:48	give some results on this
0:08:50	we can have no annotated data
0:08:52	okay but we can find some sort of related text related information
0:08:56	or we can have some small amount and then use this to bootstrap are systems
0:09:00	that will be sort of semi supervised and this is what we heard about a
0:09:03	little bit yesterday this is what people been doing so you basically transcribed raw data
0:09:07	you say this is ground truth and you do your standard training to build a
0:09:11	models this work
0:09:13	there's no lot of variance that have been published so you can filter it you
0:09:16	can use confidence measures you can use consensus networks you can do rover you can
0:09:20	do
0:09:20	lattices lots of different sort of variance
0:09:23	and i listed some of the early work but in my recollection it was people
0:09:27	involved years project and people involved in the project in europe
0:09:31	and i if i forgot people i'm sorry i don't mean to but this is
0:09:35	what comes to mind absolutely
0:09:37	early adopters of this
0:09:39	type of activities
0:09:42	so
0:09:43	if we just go back to supervised training and i think most people in this
0:09:46	room know this i'm not gonna stand for a long time
0:09:49	to normalize transcriptions what you do that for the language model anyway that's not so
0:09:53	bad
0:09:54	you need to do things and creating a word this you need to come up
0:09:57	phonemic transcriptions and you meet you in the old days we collected so when we
0:10:02	start errors in the transcripts we actually spend time correcting that because we only had
0:10:05	thirty hours fifty hours we thought it would give a something i think young people
0:10:09	today wouldn't even think about this but we spent a lot of times that we
0:10:13	but that
0:10:14	and then you you're standard training
0:10:18	so
0:10:20	this is this showing the results are using what we called semi supervised training
0:10:25	so you had a language model that was
0:10:27	trained on a certain amount of hours amazing the justice right okay so they the
0:10:32	manual word error rate was eighteen percent if we had a fully train system
0:10:36	we used closed captions as a language model one am showing here be done different
0:10:41	variance
0:10:42	so it's a sort of an approximate transcription that we had
0:10:44	and we took
0:10:47	in these numbers we started we
0:10:49	every now i think
0:10:52	ten hours of original data
0:10:54	and we then transcribe varying amounts of
0:10:57	unlabeled data so this is the raw unlabeled data
0:11:00	and we said okay we can use it is so that this unfiltered this is
0:11:05	close to an unsupervised training courses we can do a semi supervised what we say
0:11:09	where the this not too much of an error rate difference between the transcript we
0:11:14	generate and what the caption was that's good
0:11:17	and so we took in this at a sort of phrase level where the segment
0:11:21	level we just kept and train the segments where the word error rate for the
0:11:24	captions to be from automatic transcriptions with less than X and i don't remember what
0:11:30	the experts
0:11:30	which probably less than twenty or thirty percent error rate
0:11:34	and you can see that we get pretty close
0:11:36	and so we get within ten percent absolute of the manual transcriptions using both in
0:11:41	i words this is what we do mostly then we just don't bother filtering it's
0:11:46	easy you just train on everything
0:11:49	it seems to give about the same type of results
0:11:52	be a measure that was introduced by dbn is was called the word error
0:11:57	rate recovery and so basically you look at the difference between
0:12:02	how much you get some supervised training and how much we get from unsupervised training
0:12:06	from your initial starting point and so what we get here is about eighty five
0:12:10	to ninety percent is we're covering most of what we could have gotten had we
0:12:14	don't supervised training
0:12:17	one problem we had this work is that there is some knowledge in the system
0:12:20	because we did
0:12:22	have prior knowledge from the dictionary we did have a pretty good language model was
0:12:25	close to the data it didn't wasn't exactly same data but was close
0:12:30	so we discuss the set i think it was in years meeting or maybe a
0:12:34	conference mostly was with rich worse we were discussing it and said well you know
0:12:38	take into an extreme let's see if we can use
0:12:44	one hour training data work ten minutes of training data
0:12:47	and we were crazy enough to do this in the time it was a lot
0:12:50	of computation because every time every time you value with a different language model every
0:12:53	time you use different amounts of data you have to be the code reader could
0:12:56	multiple times
0:12:58	these days be very easy to be one of these experiments but at the time
0:13:01	to time to do that
0:13:03	and so here we see that if we start with a ten minute bedroom system
0:13:07	we've got a word error rate of sixty five percent we actually did this didn't
0:13:11	think it would work
0:13:14	and that was okay and that's just take some data so three to four hours
0:13:17	of
0:13:18	data
0:13:19	and non people
0:13:21	in improve the fact we throw away these ten minutes could was just more complicated
0:13:25	to build models merging it to do that okay we take the ten to four
0:13:28	hours of automatic
0:13:29	and we go to fifty four or we go down and we stopped heard about
0:13:33	under forty hours and we got thirty seven point four percent if you use the
0:13:38	same language model with the full training data supervised
0:13:42	you only get to about thirty percent
0:13:44	so we're getting pretty
0:13:45	good difference of where we need to get
0:13:48	so we're happy with that
0:13:50	and it about this time
0:13:53	so what the green in came to do this is with this C and we
0:13:56	sort of tried to we don't really apply this method to his work as we
0:14:01	don't have enough audio data but we did try and look at
0:14:05	questions that we've been asking for a long time
0:14:08	as to how much
0:14:09	data you need to train models
0:14:10	what improve performance can you expect when you have limited resources what's more important audio
0:14:17	data or text data
0:14:19	and how can you prove the language models when you have very little bit so
0:14:21	she can twenty around two thousand four if i remember correctly and we had available
0:14:26	i guess we consider this reasonably small amounts of data we had thirty seven hours
0:14:31	was
0:14:32	bodily good but it's a not of a
0:14:35	and we had about five million words of transcripts
0:14:38	and we nothing about number
0:14:42	and so what when the first things you did was to look at what's the
0:14:46	influence of the audio data transcripts versus what transcripts on out-of-vocabulary
0:14:52	and so on the left we have the out-of-vocabulary rate
0:14:54	and here we're showing for two hours of transcripts ten hours of transcriptions about seven
0:14:59	K words twenty K words and fifty K words with the thirty five
0:15:02	and how it is your oov rate
0:15:05	go down as you adding more transcripts and so you can see here and the
0:15:08	top her
0:15:10	if you add in sre and on this
0:15:13	so i adding more text data so as the curves are the amount of transcripts
0:15:17	we have
0:15:18	and on there
0:15:20	bottom x-axis we reading different amounts of text data that protect five million text sources
0:15:26	and so if you start with
0:15:28	just two hours of audio data if you had ten K
0:15:33	you don't really lower your be too much
0:15:36	if you have hundred K right a little bit more et cetera et cetera
0:15:40	if you're have ten hours of data
0:15:43	there isn't much of an effect and so you see that the effect of adding
0:15:46	the text data is less than adding the audio data
0:15:49	that actually at a cat to because we probably have some sort mismatch we know
0:15:52	the audio data is the same type of data we're trying to look at
0:15:55	and the text data is related but not really the same
0:16:00	so then here's another curve that we're trying to
0:16:03	look at the amount of audio data
0:16:07	versus text data for
0:16:09	language model is a little bit complicated discrete on hers here or you just two
0:16:14	hours of audio data in the acoustic model
0:16:16	and the bottom one than the green once again it's ten hours and the red
0:16:20	is thirty five
0:16:21	and
0:16:22	but you can see that even if you add in more text data you're not
0:16:26	really improving the word error rate now
0:16:29	and everyone said okay is it coming from the acoustic data where's it coming from
0:16:33	the transcripts we know the transcripts or less close
0:16:35	and so we added in on the purple in the blue curves are using the
0:16:40	transcripts from the ten hours for the thirty five hours and so we can see
0:16:44	that if you only have two hours of audio data is just not going to
0:16:47	do very well and you need more once you get to ten hours
0:16:51	it seems like a improvement you get is
0:16:54	little lesson this is interesting "'cause" this is what was being used currently in the
0:16:57	babel project we're actually
0:16:59	working with ten hours for some the conditions
0:17:02	let me a few minutes about some other work he did for
0:17:08	the language modeling and it's with on work comp word decomposition so i'm work is
0:17:14	a very rich morphology and has lots of to poke our high out-of-vocabulary rates
0:17:21	a problem also for language modeling is that is very took is not very well
0:17:25	models so therefore it's interesting to use a word decompounding
0:17:30	and when you look at the literature you can see needs results languages you get
0:17:33	a nice gains some you don't always get it again you know we don't necessarily
0:17:37	get that in word error rate
0:17:39	and so
0:17:41	one a
0:17:43	idea that we had in this work was to try and avoid with the generate
0:17:46	once you don't want to create visible units and
0:17:49	so that's what i'm going to give a couple of ideas about and to do
0:17:53	this we build matched conditions these types of we train language models retrained acoustic models
0:17:58	for all the
0:17:59	conditions
0:18:01	so this is showing
0:18:03	here we had the we use them morfessor algorithm
0:18:07	which is relatively recent at the time
0:18:10	so this basic morfessor will be the curse would
0:18:13	then there's
0:18:15	no reason it's referred to as harris we're basically you look at the number of
0:18:19	strings letters the can succeed another better and that gives you an idea of what
0:18:23	the perplexity is if you have a big a lot of different letters that can
0:18:27	follow it's like the dbn you word if it's were you more and if it's
0:18:31	not it's likely to be within the same
0:18:33	a more
0:18:35	we also then tried to use and distinctive feature properties to train at some speech
0:18:40	information into the decomposition and looked using phonemic confusion constraints that were generated using phone
0:18:46	alignments and so basically here if you have used to sequences neighbour a and may
0:18:52	well
0:18:53	in queens like that may because if the not the lot and the well we
0:18:57	easily confusable if they were easily confusable was okay display
0:19:00	the idea that
0:19:03	constraints it's relatively language-independent but of course you do need to know the phonemes in
0:19:07	the language or
0:19:08	have and you set of phonemes in general
0:19:10	so this is looking at what happens in terms of the number of tokens you
0:19:15	get after T V splitting for different configurations
0:19:18	and
0:19:20	the length of them so it's something that was also the weight was represented rest
0:19:24	everything that was to phonemes so that's where you see things a two four six
0:19:28	eight et cetera
0:19:28	and basically the main point is that anything
0:19:31	that is that's this is your baseline by the words in the black
0:19:36	and once you start cutting use
0:19:39	units get shorter as you expect that's a goal of it
0:19:42	and then if you use this confusion constraints those ones we cease uses green in
0:19:46	the purple general there are a little bit less shifted to the left so we're
0:19:49	creating slightly
0:19:50	if you were very short units
0:19:53	and that was the goal what we're trying to do and then here's a table
0:19:57	that we probably don't want to go
0:20:00	into too much but if you look at the baseline system we had twenty two
0:20:03	point six percent
0:20:04	no numbers are relatively close okay but
0:20:07	if you split anything
0:20:09	the error rate general gets worse isn't the black ones so you can use to
0:20:13	the distinctive features that you really help
0:20:17	you can
0:20:18	the only ones that there's were used disk this phoneme confusion constraint and so here
0:20:22	all of these two slightly better
0:20:24	the and the baseline and those the only ones and so is really important to
0:20:28	avoid adding
0:20:30	we need to confusion your homophones and your system
0:20:32	so that sort of the typical message for this
0:20:37	so we
0:20:38	the other one thing today we got fifty percent reduction in oov rates that was
0:20:43	good except we were introducing errors and confusions and the little affix as
0:20:48	that we're compensating or recording this more than yours we recovered
0:20:54	and basically we did some studies and look at the previously over previous oov words
0:20:59	and basically about half of them were correctly recognized using this method but we would
0:21:03	swap it out
0:21:04	with a recently introduced
0:21:06	on the different aspects is that work
0:21:09	so just another slide sorta not of all of his work what was more logical
0:21:13	to put here in the talk so i
0:21:15	but here is we've used unsupervised decomposition once again usually based on morfessor or some
0:21:20	modifications of it for finnish hungarian and german
0:21:24	and russian
0:21:25	for the first three languages we got reasonable gains a between one and three percent
0:21:30	and we can reduce our vocabulary size is from seven hundred thousand two million words
0:21:35	to around three hundred thousand
0:21:36	which are a little bit more usable for the system and probably more
0:21:39	easy to train a reliable estimates
0:21:43	for some of them we need acoustic model retraining so we could do not for
0:21:47	german
0:21:47	for finnish we tried both the
0:21:51	acoustic model retraining or not
0:21:52	and we
0:21:53	well time's got three percent difference using the morphologically decomposed system whether or not retrain
0:21:58	the acoustic model to
0:21:59	interesting for us morfessor worked well for finnish i think in part because the authors
0:22:04	were
0:22:04	and so there
0:22:05	the output was maybe design for that
0:22:08	we also tried to do this and russian cts where we only had the time
0:22:12	about ten hours of training data so conversational telephone speech
0:22:16	six yes for some people that might know what
0:22:19	and we got a reduction in the oov we were able to use a smaller
0:22:22	vocabulary but we can get an egg in word error rate
0:22:26	but once again this is very preliminary
0:22:28	work we get done
0:22:29	so now i'm going to shift your gym where my time
0:22:36	fourteen is gone
0:22:37	that okay several faster
0:22:39	so to speak a few minutes about
0:22:42	finish where we do have is one of the first languages and deeper we didn't
0:22:46	have any
0:22:49	audio and untranscribed audio and so
0:22:52	we have found some online data with an approximate transcripts that comes from a
0:22:57	initially used for foreigners finish
0:23:00	and there is no transcribed development data either and said how are we gonna do
0:23:04	this for many companies is easy to hire someone to transcribe some data for us
0:23:08	is not so we see so it takes time to find the person
0:23:11	if we're government research labs to
0:23:14	this is a complicated so can we get ahead
0:23:17	by doing something simpler and so we did is we
0:23:19	use this is approximate transcriptions but also for the development not just for the
0:23:26	unsupervised training
0:23:28	and once again as i said before we use morphological decomposition for this
0:23:32	so here's occur showing the
0:23:35	estimated word error rate as we increase the amount of unsupervised training data
0:23:40	so we have two hours of five hours and then sir stabilised again once we
0:23:45	get around ten to fifteen hours
0:23:47	we're stabilising C we get a beginning here and this approximate but is going the
0:23:51	right direction that about two months later we had somebody the came in and
0:23:57	transcribed data for us it was a two or three hours sets is not a
0:24:00	lot it still took awhile first to get the person for them to do it
0:24:03	and you can see the human error rate
0:24:05	use
0:24:06	in the following exactly the same curve
0:24:09	are error rates higher zero underestimating here because what we did is we selected regions
0:24:13	as and the done for the unsupervised training where there was a good knots between
0:24:17	these sort of approximate transcriptions
0:24:19	and
0:24:19	what the system did we measured on that but we're not because it allowed us
0:24:23	to develop without necessarily having to wait for this data to become
0:24:27	available
0:24:30	so the message on that is that the unsupervised acoustic model training worked reasonably well
0:24:35	using these approximate transcripts
0:24:37	with since then it on
0:24:38	some sorry it is also worked on
0:24:42	for the language models so we can improve our language models using the sort approximate
0:24:46	transcription it worked
0:24:48	we then added into the system some cross lingual mlp
0:24:51	so we tried both french and english
0:24:53	and we got about ten percent improvement
0:24:55	and i said before with the morphological decomposition
0:24:59	so now i'm gonna talk about not a language which also is consider somewhat low
0:25:05	resourced so that in
0:25:06	and this was work it was done
0:25:08	with all the other operand was that nancy can was russian so you sort of
0:25:12	down that unit interesting language for him and basically his words where they just know
0:25:16	nothing for that and out there
0:25:17	this assistance is not distribute corpora but you can find text and audio on the
0:25:22	net so therefore something we could reasonably do
0:25:25	it's a baltic language is not so many speakers of one point five million it's
0:25:29	a complicated language but uses a lot now forget half of it
0:25:34	and you please reasonably straightforward
0:25:37	so i this is sort of the overview of the language models we found a
0:25:41	fair amount of data
0:25:42	good
0:25:44	one point six million words
0:25:46	and in domain data and hundred forty two million words newspapers so the in domain
0:25:50	means it comes from like radio and T V stations and this just newspapers
0:25:54	we used about a five hundred thousand word vocabulary just keeping words that occurred more
0:25:58	than three times
0:26:00	text processing thing is or standard
0:26:02	however this isn't really important stuff it's if you don't do the text processing carefully
0:26:06	you have problems when trying to cancer supervised training means that seems to be our
0:26:10	experience and it was pretty much standard language models he threw in some neural network
0:26:15	language models at the end so given distressed talk that was interesting to
0:26:19	for that line into
0:26:20	so this is this figure showing the
0:26:24	word error rate have goes up so that these curves here the word error rate
0:26:27	as a function of the iteration
0:26:30	and me circles are shown you roughly the size of the acoustic units were roughly
0:26:34	doubling at each stop
0:26:35	the amount of audio data used in an unsupervised manner
0:26:40	for the systems
0:26:41	at this level here for we added in the mlp from russian
0:26:45	are initial seed models were here
0:26:49	came from the mixer three languages english french russian the audio data wiper about sixty
0:26:54	hours at this stage to about seven hundred eight hundred hours at this stage raw
0:26:58	so you're only using about half
0:27:01	when you have to build models
0:27:02	and of course something that's important used to increase the number of context in the
0:27:05	states that you model the same time
0:27:07	so it doesn't suffice just add more data to keep the model topology fixed you
0:27:11	don't get much of again from the
0:27:15	afterwards he did some additional tuning parameters and you pass decoding and
0:27:20	use the four gram lm and you can see that we it's just the original
0:27:24	so see i is
0:27:26	case insensitive and C D is case sensitive not context independent context dependent
0:27:31	as we're looking at the word error rate if you take into account case "'cause"
0:27:34	what people want to read is really having case correct
0:27:37	and even for different search engines sometimes is important to have the case correct
0:27:40	because you want to know for the proper name or not
0:27:44	and so for people that are found neural net language model got about what have
0:27:47	to two percent gain
0:27:49	by
0:27:49	adding them and this is on dev data and then and validated we got pretty
0:27:53	much similar results
0:27:54	so we were happy with that so it's completely unsupervised we developed a system
0:27:58	in about less than a month
0:28:01	mainly at the end we were and this is trying for hungarian roughly the same
0:28:05	thing
0:28:06	we used a few data from
0:28:08	five languages we had less audio data so we only one two
0:28:12	about three hundred hours
0:28:13	and we used a originally and mlp trained on english
0:28:17	and then we use the transcripts of this level to then generate an mlp trail
0:28:23	area
0:28:24	using unsupervised transcription we got a another two point eight or so that napster
0:28:29	so just to you been
0:28:31	overview this just some results from the program which some of you are where i'm
0:28:36	sure in some of you or less
0:28:37	the systems to the one the including channel
0:28:41	are trained on supervised data
0:28:43	and the supervised data varies from fifty to see a hundred fifty two hundred hours
0:28:48	upon language
0:28:49	on the right the role train unsupervised
0:28:52	the green sorry the low
0:28:55	mine is the average error rate across the test data about three hours
0:29:00	and so you see it's are going up and the ones on the side our
0:29:02	general little bit higher the ones on the left not so much they're pretty good
0:29:08	bulgarian and with the when you are a little bit higher here
0:29:11	then on the a look some vocational come back to a few minutes
0:29:15	if you look at the
0:29:17	lowest word error rate i anyone to the segments we had from T V radio
0:29:20	they're pretty known in fact even some of the unsupervised or the word and the
0:29:25	supervised
0:29:26	and finally this is the worst case since the worst-case word error rate is still
0:29:29	pretty ice we still got a fair amount of work to do
0:29:32	these data are mixed news and conversations
0:29:36	and some of them are some languages a more interactive than others things like that
0:29:41	so i'm going to skip the next slide which is too much stuff was to
0:29:44	show the amount of data we use of people are interested come finally later
0:29:49	and i want to say two or three words about dictionary so when the
0:29:54	think that we're this them that passes very costly to do with dictionaries and so
0:29:58	there's been
0:30:00	more recently use a growing interest in using graphemic type units rather than phonetic units
0:30:06	in years just in our systems
0:30:07	and the first K work but i found was contact in i
0:30:11	maybe people are aware earlier work
0:30:13	doing this that are not aware of
0:30:15	and avoid this production of the pronunciation dictionary
0:30:19	basically the G two P problem becomes a text normalisation problems we can have numbers
0:30:24	there are things like that you have to convert dates and times and all those
0:30:27	types of things into words in order to do this and then you have units
0:30:32	so this we then it means see for turkish tagalog passed to within the babel
0:30:37	program
0:30:38	and we get about
0:30:40	as like other previous studies got about comparable results
0:30:44	in general
0:30:45	but for some languages we actually do better with the graphemic systems and the
0:30:50	phonemic systems in fact i should mention that back in the gale days that was
0:30:54	work using graphemic systems rare
0:30:56	and basically this is some results we don't passed to for
0:31:03	in the babel program we had a two pass system using the dbn voice activity
0:31:06	detection in the but features thank you and we use both graphemic and phonemic systems
0:31:11	and we can see that there about the same that the phonemic is about one
0:31:15	point higher than the graphemic but if we do what you pass system where we
0:31:19	do we need you graphemic we actually got a reasonable
0:31:22	getting from that
0:31:23	we believe that the one of the problems in the past was actually having for
0:31:29	pronunciation generation so therefore they're bad
0:31:32	or a lot of variability you don't have also where the graphemic systems can actually
0:31:36	outperformed the phonemic
0:31:38	so let me now speak about look some work just because it's a and this
0:31:41	is work done with
0:31:42	marty noted that there is looks from looks and work for those of you know
0:31:46	her and it's a little country where the not too many people but it's really
0:31:50	a multilingual environments of the
0:31:52	the
0:31:53	people
0:31:55	when they go to school their first language is german and then i believe it's
0:31:59	french and english at this study but it won't speak their local language about submission
0:32:04	apparently even those this type of the country the few close your eyes and you
0:32:07	guys are you don't see in you have exaggerated a little bit
0:32:11	but you even have multiple dialects in different regions
0:32:16	so what we did this we initialize to originally the first studies we did was
0:32:20	just try and look at segmentation experiments for how
0:32:24	which languages are favoured by look supportish data so we had basically no transcribed data
0:32:29	time
0:32:29	and we transcribe ten or fifteen minutes of some
0:32:33	the data and so we do is we did some approximate mappings are saying that
0:32:37	if you take the
0:32:39	in like to mortgage okay maps pretty well to french english german but if you
0:32:43	take you well that doesn't really exist in english but can see that from germany
0:32:47	french and so it's okay in english will use the it
0:32:51	to get a mapping so we have the same number of phonemes for sort of
0:32:54	phonemes in each one
0:32:56	and basically when we said okay we build models would put them in parable parallel
0:33:00	and we can have a superset of models and we try and align
0:33:03	he looks more this data with this so we had somewhere transcriptions of a small
0:33:07	amount and see which languages are
0:33:10	referred we do this we allowed
0:33:12	so
0:33:13	i don't know that much about the language myself but basically you can have french
0:33:16	words inserted in the middle so the apart from the language is that they're also
0:33:21	indian
0:33:21	and so we allowed the language is to change a word boundaries you had to
0:33:26	use the phones from a given seed model
0:33:29	within the word change of word boundaries and basically we found that as you can
0:33:33	expect since the looks mortgages the dramatic language in general the segmentation for german
0:33:39	second was english which is closest but there's about ten percent that what you french
0:33:43	in this was typically needs allows an effect for english
0:33:45	typically dip sounds remind we don't really know exactly why
0:33:50	so based on that we then said okay is now a couple years later we
0:33:54	got some transcribed broadcast news data in button we're going
0:33:58	and which are easy models richer context independent they're tiny they're not gonna perform well
0:34:03	and we just decoded the two or three hours of training data
0:34:07	and you can see that the word error rates are flying this you expect some
0:34:10	in the right range for the amount of data for the fact that the context
0:34:13	independent
0:34:13	but the german models for preferred
0:34:15	we actually did models that we're pooled estimate the data together and told it and
0:34:20	those for
0:34:20	like less than the german however we get already before we knew this "'cause" we
0:34:25	didn't have the data we had started this so we used will models to
0:34:28	do the automatic decoding and once again we did or standard techniques you can see
0:34:33	that we're going from about thirty five to about twenty nine percent
0:34:37	word error rate by doubling data and adding new increasing the context adding mlp features
0:34:42	et cetera
0:34:44	and we were able to model more context
0:34:46	but
0:34:47	is there is kind of high converge some the other languages and so we martine
0:34:51	looks at that classification of errors and you can see
0:34:55	basically there's a lot of confusions between
0:34:58	homophones
0:35:00	some of the data this is pretty interactive data so it's not the same bn
0:35:03	data we have two types you and with human production various of people did false
0:35:08	starts repetitions in this pronounce something
0:35:10	or the distance at work
0:35:13	and then a large percentage re-estimated somewhere between fifteen twenty percent writing variance so because
0:35:18	look some work is just sort of this
0:35:21	spoken language
0:35:22	are these really errors or not and so this is an example of some of
0:35:25	the writing variances of the words saturday and i'm not going to
0:35:28	was times they are not that's probably not really how you pronounce it okay all
0:35:32	these are written text all allowable so basically you can
0:35:37	depending on what regional variant or you can say so or show
0:35:40	you can say tiered you can change the bow
0:35:44	and all these are accepted in the written form we can find them in the
0:35:47	text
0:35:48	and in what they say
0:35:50	so
0:35:51	even though this is i don't know people really consider this a low resource language
0:35:56	you're not there's not much data were almost none available all the languages used in
0:36:00	speaking it's not really used in writing so how much for time i think it's
0:36:04	good
0:36:07	i'm going to speak
0:36:08	one minute about we're trying to do this on korean but we once again don't
0:36:11	have transcribed dev data and we were trying to do a study where we look
0:36:16	at the side of the language model to use for decoding an unsupervised recording using
0:36:22	a hundred twenty thousand words two hundred two million
0:36:26	or
0:36:27	two K character
0:36:29	language model
0:36:29	just for the decoding here we looked at using phone units and have syllable acoustic
0:36:34	units
0:36:35	and the only again we had was from ldc there's about a ten hour dataset
0:36:39	if we do a standard train model on it we took to the last two
0:36:43	files we held for deaf because we didn't have any
0:36:46	we got a word error rate of about thirty percent and the character error rate
0:36:49	of about twenty percent
0:36:50	on this data is probably optimistic because really seen data is all the training data
0:36:55	just the last two so that it's very close to it
0:37:00	and so what years were increasing the amount of
0:37:03	data we use from the web and looking at influence on the word error rate
0:37:07	and the character error rate using different size language models and you can see that
0:37:10	for the two hundred thousand where the chilean
0:37:13	it's about the same we sorry these results are all decoded with the same size
0:37:17	language model the role decoded with the to indicate language model
0:37:21	so it's only what we use for the unsupervised training that's changing
0:37:25	and can we see that the real results are basically the same as we go
0:37:27	with the same data
0:37:29	but the character language model which we skip this step
0:37:33	is doing slightly better in terms of character error rate than the others
0:37:37	we don't know it's real we need to look into some more since is just
0:37:40	really recent stuff we're doing so for people to think it's easy to get transcribers
0:37:46	it's been a month and how we're looking for someone in france that has the
0:37:49	right to work we can transcribe the query and for us
0:37:52	we finally found someone and they can start working in till february
0:37:56	okay so yes it's an easy thing to do but
0:37:59	not necessarily depending on your constraints for hiring things is not so easy
0:38:03	so we're gonna follow up on this more we hope to have some clear results
0:38:07	at the end
0:38:09	to words about acoustic model interpolation because you're string we spoke about we have this
0:38:15	heterogeneous data
0:38:16	how do we combine the data from different sources enrich make the statement that you
0:38:19	want to use all that you don't want to throw it away but you
0:38:22	data weightings that's
0:38:23	when we doing it is you can just
0:38:25	at
0:38:26	more we just some of the data remover for others in such a go frog
0:38:32	at the syllables into is working on acoustic model interpolation
0:38:36	and had a paper
0:38:38	speech i think
0:38:40	and looking at if you can do something random polling you can use a different
0:38:44	sets and then interpolate them and use this on the european portuguese
0:38:50	with the baseline putting gave you thirty one point seven percent in the interpolation give
0:38:54	you about the same result but almost easier to deal with
0:38:57	"'cause" you can you can train your data on smaller sets you
0:39:00	and then interpolate done the same idea of what's done for language modeling for years
0:39:04	now
0:39:05	then we also looked at what we hear knife that using different a variance for
0:39:11	english and this is some work that have been published with me to your back
0:39:15	in two thousand and ten and basically
0:39:18	we get a little bit of data for some of these
0:39:21	we don't degrade for any of the variance you with respect to the visual pull
0:39:24	model whereas with the map adaptation we actually did a little bit more some one
0:39:27	or two with the variance i don't remember which ones
0:39:31	so let me finish up
0:39:34	i guess the take a message like say that the unsupervised acoustic model training successful
0:39:40	it's been applied in a lot to broadcast news type data more recently in the
0:39:45	babel project to
0:39:48	wider range of the data i think it's really exciting that we can do that
0:39:50	we still have to find the data
0:39:52	but it it's really nice that we can do this type stuff
0:39:56	but the error rates are still kind of time
0:39:59	you it's a sorry even though the eraser so kind of i we can we're
0:40:03	going in the right direction general
0:40:05	the
0:40:06	i'm sure rich or people but can see more some more about this they are
0:40:10	during the meeting
0:40:11	this is something that's interesting is that it seems in this will make people from
0:40:15	yesterday happy it seems that the
0:40:18	mlps a more robust to the fact that the transcriptions are imperfect in and they
0:40:22	take less of a hit in the hmms
0:40:24	as that sort of interesting
0:40:25	observation
0:40:28	and so we can use this untranscribed data were automatically and on automatically transcribed data
0:40:33	to produce
0:40:34	references for the training the mlps that's really nice
0:40:38	the your hopefulness type of approach will allow us to extend two different types of
0:40:42	task more easily miss you don't have to use the time of collecting the data
0:40:46	entry transcribing the data you collect
0:40:48	we still have to collected
0:40:50	i didn't speak about multilingual acoustic modeling
0:40:53	which is something that we in general
0:40:57	shows and to do bootstrapping restore should just taking like models from other languages
0:41:03	is it better use multilingual can we do better
0:41:06	i think it depends on what you have been hand would be nice if we
0:41:08	could everything we've tried in babble has gotten worse the sparse of a little bit
0:41:13	disappointed with what we've been doing
0:41:16	then of course something i didn't talk about what you do we languages have no
0:41:20	written standard formats or touched upon it with the bottom for example i don't really
0:41:25	now we're trying to do some work in others are in this even for about
0:41:29	it's a paper round two thousand five i think of trying to automatically discover lexical
0:41:34	units
0:41:35	but when the main problems you also have you know they're meaningful
0:41:40	i said i'm a bunch of times that here myself saying it's or systems or
0:41:44	the kinetic the going to learn that people that say like and you know you
0:41:47	can either in that
0:41:48	then the word-like even if you get it is meaningful in some cases and it's
0:41:51	not meaningful another cases and so how do you with that how do not was
0:41:54	useful
0:41:56	but i think it's really exciting it's been fun stuff i hope the
0:42:00	those of you that have worked on unsupervised training will continue in those of you
0:42:04	that have in my money give it a try
0:42:06	and so
0:42:08	thank you for giving me the opportunity speak
0:42:11	and these are all but i've worked with closely on this work and there's probably
0:42:15	other people i've forgotten and sorry
0:42:19	so thank you
0:42:28	thanks lori we have some time for questions
0:42:39	natural unit is gradually improvements more data you have any idea what is being can
0:42:45	improve i mean in this which words are getting better which ones that state maybe
0:42:51	that probably not that yes now we have about that that's an interesting thing to
0:42:55	look at
0:42:56	something that we
0:42:58	what we would like to do that we haven't done yet is to actually not
0:43:01	just continue we don't normally we incrementally increasing amount of data would be interesting even
0:43:06	change the datasets is just use different round of portions of i think we should
0:43:10	cover better
0:43:12	"'cause" when you're models are like something gonna continue liking it
0:43:16	we have a look at words would be interesting
0:43:20	so the question it so
0:43:22	you know if you talk to machine learning person that works in some a supervised
0:43:25	learning the get really nervous when you say self training or some supervision because it's
0:43:29	this thing where you're starting with something which isn't working that well it can counters
0:43:34	and actually go unstable and the opposite direction so there's this sort of sensitive
0:43:39	you're starting with a baseline recognizer trained on a small amount that's work reasonably well
0:43:44	you can improve if you're starting something was working really bad i can get worse
0:43:49	and i and i noticed with some of a lot of the results that you
0:43:53	had in this talk shows a lot of broadcast news we are starting with
0:43:56	but are performing susan hours those all these results for all languages get your data
0:44:01	we had nothing started zero
0:44:03	zero you are transcribed data
0:44:05	in language or you're making language
0:44:08	okay so we started on a sort of all these languages here to the right
0:44:13	we have zero
0:44:14	in language data
0:44:15	and we started with seed models of word context independent if you use you know
0:44:19	that from another language and reduced to the max and so the noise model you
0:44:23	will roughly sixty to eighty percent word error rate when you start
0:44:26	on your data
0:44:28	so we're starting really high kicks but
0:44:30	or language models even though they're trained on
0:44:34	newspaper text newswire text and things like that
0:44:36	are pretty well representative the task there can be very strong constraints in there
0:44:41	which is why i find the babble were really exciting which means you we haven't
0:44:46	done it ourselves on the unsupervised part
0:44:49	but the it's really exciting because there don't have the screen coming from a language
0:44:53	model
0:44:54	all you have a small amount of
0:44:56	transcriptions so ten hours of transcriptions so here there is information is coming into the
0:45:00	system from the text
0:45:02	and that we're
0:45:03	i personally believe the
0:45:05	why works
0:45:07	just something to see if you don't normalise
0:45:10	correctly so we had certain situations people that i don't know how to pronounce numbers
0:45:14	i'm just gonna keep them as numbers
0:45:17	convergence is a lot harder doesn't work very well if you say i'm gonna throw
0:45:21	away the numbers it which is what some of the people to be the language
0:45:25	modeling did
0:45:26	it also doesn't work so well
0:45:27	so you really need to have something that represents pretty well
0:45:31	well cosine it seems
0:45:32	from my we also had some languages we're would you
0:45:36	people from the litter here we think the problem when you take text that are
0:45:40	online sometimes yes
0:45:42	it's a texan other languages you actually have to filter out the text that are
0:45:45	not language you're targeting tended did you wanna
0:45:49	come up i think you're so that users come up during questions or
0:45:53	okay
0:45:54	no during the question you wanted her to change
0:46:00	it wasn't it i don't think questions business at your and the formant to all
0:46:05	a
0:46:06	and the last question from or okay one L
0:46:12	depending application at the end of the day you may want to have readable text
0:46:16	like to queue for translation or broadcast news
0:46:20	and at that point two H and four by hand think of names
0:46:25	it's probably more important than
0:46:27	and also percent of the K L I
0:46:30	let me just call always systems are case them punctuation
0:46:35	but we're not measuring the word error rate on punctuation the case where all the
0:46:39	systems produce the punctuated case double
0:46:44	the named entities are not specifically detected but hopefully if there proper nouns will be
0:46:51	uppercase diff we did a language model right
0:46:53	so this is something that actually we've
0:46:56	we tried in that's why had this slide work in the case insensitive in case
0:47:01	okay and then in case insensitive word error rates this is about a two percent
0:47:04	difference
0:47:06	in that the punctuation is a lot harder to evaluate so some work that we
0:47:09	do with the acting colour who's now and that
0:47:14	trying to evaluate the punctuation based on the road program and it's very difficult if
0:47:19	you take to is for humans they don't agree how to punctuate things maybe not
0:47:23	of speech
0:47:24	for a
0:47:26	full stops
0:47:28	closer and big if there's eighty percent inter annotator agreement and if you go to
0:47:32	common sits down relative
0:47:35	so it's very heart
0:47:36	but what really want something that's acceptable no you don't really care about a ground
0:47:41	to i think sort of like translation you know really care exactly what it is
0:47:44	as long as a reasonable
0:47:47	reasonably correct punctuation if you have multiple forms of are possible just like you can
0:47:51	translate something in multiple ways if you get one of them that's correct this could
0:47:54	not so i think punctuation false same category if you use a common or after
0:47:59	something it's not very important as long as you can understand correctly
0:48:03	as i think as a really heart problems to evaluate
0:48:07	fact even more so than doing something that seems reasonable
0:48:12	the used as a separate comes on that the punctuation is that in the postprocessing
0:48:16	step
0:48:18	and other sites of done you bbn is done punctuation unity the other sites also
0:48:24	over ten years you know
0:48:28	but you're right
0:48:30	okay

Unsupervised Acoustic Model Training with Limited Linguistic Resources

Limited Resources Day

Lori Lamel (CNRS-LIMSI)