Speech Transcript - Building Speech Recognition Systems with Low Resources

0:00:21	host this better
0:00:23	okay
0:00:23	thank you very much thanks of having mean here giving you the chance to talk
0:00:27	about our work
0:00:29	so i would like to
0:00:30	talk about how can we build systems with very low resource as
0:00:34	and very much as lori also tried i would like to give a brief
0:00:38	the definition of what we think the low resource languages and it's surprising you mentioned
0:00:43	if you ask the linguists they would give a different answer but no
0:00:46	linguists keep the very same and so they say
0:00:49	in other resource languages these crawl them and say that is a language which likes
0:00:55	electronic resources for speech and language processing
0:00:58	it lacks presence on the web
0:01:00	it likes the writing system or stable topography and it may elect linguistic expertise
0:01:06	and
0:01:07	a language qualifies for being under-resourced if you just one of these criteria applies but
0:01:12	in many cases several of these the
0:01:16	definition supply
0:01:19	some people call that under-resourced some people called we low resource or more resources or
0:01:25	whatever but it's all we
0:01:26	men means the same thing
0:01:28	however what's really different is if you look into mine or two languages so mine
0:01:33	or two languages not necessary low resourced
0:01:36	and the low resource languages not necessarily a minority language so these have to be
0:01:41	discriminative
0:01:42	we were trying to put together some definitions so there will be
0:01:46	a special issue in speech communication which i was likely to put together with low
0:01:51	associated one are
0:01:53	in the alex carp of and it is about under-resourced languages and how to build
0:01:58	asr systems forward
0:02:00	and will be out in january and next year so twenty fourteen
0:02:05	so the ideal case
0:02:07	i consider to be or i think many people increase if you have plenty of
0:02:11	resources
0:02:12	so that means you have plenty of phone sets you know about the phone set
0:02:16	up to of a language you do have
0:02:17	or you have corresponding transcripts
0:02:20	you have the pronunciation rules you have experts which build for you the pronunciation dictionaries
0:02:26	and you have plenty of text data
0:02:28	and then you can build asr systems you can you build an L P or
0:02:32	inching systems and also tts systems for matter of fact
0:02:35	so that would be the ideal world
0:02:39	however the word is not ideal and i was asked to talk about a low
0:02:43	resource is what we do if we have very few on all data
0:02:47	so i would like to present a few of the solutions which we are currently
0:02:51	working on
0:02:52	and so the first one is addressing the fact if you do not have any
0:02:55	transcripts at all in a language which you would like to work on
0:02:59	the second is if you do not have any pronunciation dictionary so how can you
0:03:03	dig in for example into web for getting pronunciations
0:03:08	and the second thing is i would like to show some work on what we're
0:03:12	currently doing if you have no writing system at all for language
0:03:17	and then what can you do if you do not have any linguistic X expertise
0:03:21	so you don't have language experts at hand
0:03:24	and for all the solutions i would like to show today my general underlying assumption
0:03:30	is
0:03:30	that i would like to leverage off existing knowledge and data resources which i have
0:03:35	gathered from many languages already
0:03:37	so i sort of believe in if you have seen a lot of things in
0:03:40	one language and many languages
0:03:42	it should help you to build
0:03:44	the resources in the next language
0:03:47	so to me
0:03:49	the holy grail of rapid adaptation in languages and probably also in domains this
0:03:54	that you have plenty of data given in many different languages
0:03:58	that you sort of derive a global inventory of models
0:04:04	and then you have a tiny bit of data or maybe next to zero data
0:04:09	and then basically adapt to that particular length
0:04:14	and so when i started we might instance with the alex waibel he basically a
0:04:19	push you very much to if you want to do something with many languages they
0:04:23	would be good if you have many languages
0:04:25	so i started out collecting a large
0:04:28	corpus of data so
0:04:31	so i wanted to have something which is
0:04:33	uniform in a way that it basically a appears in many different languages but has
0:04:38	the same forward the same style and so ever and so ever it basically
0:04:43	i started doing this in ninety five and ever since we keep collecting data so
0:04:47	we now accumulated twenty one languages
0:04:49	it's basically you have for each language we have twenty hour spoken by one hundred
0:04:54	speakers
0:04:55	so altogether it is a small amount compared to what people use today
0:04:59	but it's basically having covered many languages
0:05:03	in over the years we have built recognises and all these languages
0:05:06	so that means we do have a pool of knowledge
0:05:09	in the acoustics and also in dictionaries in text
0:05:13	which i would like to show you how to leverage this for
0:05:17	situations what we do not have
0:05:19	any resources keep
0:05:21	so the underlying idea when i started in my thesis what's mean that was still
0:05:26	a gaussian mixture model time so that was pretty dnns
0:05:30	so the idea would be to share acoustic models across languages
0:05:34	in order to get some sort of and a global phone inventory and by the
0:05:39	time and that's something which also you are included
0:05:42	we use the ipa the international phonetic alphabet
0:05:45	to share data across languages
0:05:48	so whenever to phones have the same ipa representation
0:05:52	the joint to be trained as a language independent fashion
0:05:56	and
0:05:57	one observation we did what had was that if you do this and this is
0:06:01	a graph what we do this for twelve languages
0:06:03	so we share twelve languages based on i
0:06:06	one of the observations was
0:06:08	this doesn't really
0:06:10	i mean this just see
0:06:12	she just going straight up and that was only twelve languages
0:06:15	so our expectation of my expectation was that this would keep growing so that means
0:06:19	the more languages we see we will have issued diversity and we're not anywhere near
0:06:25	being in an asymptotic behaviour of the score
0:06:29	so and then as we hurt and yesterday in these wonderful talks from morgan and
0:06:35	also from france idea about the dnns
0:06:38	so what i'm really happy about this or why am excited about the dnns as
0:06:43	for the very first time
0:06:44	what we see is
0:06:46	not only that they work well for
0:06:49	strapping in new languages
0:06:51	but what i had never seen before it acoustic modeling whenever i would share across
0:06:55	languages
0:06:56	i would get worse performance in the training languages so i never managed to get
0:07:01	a better performance and those languages are used for training i only got slightly better
0:07:05	performance when they use the four languages
0:07:07	so now what the dnns we really i mean i hope so for the very
0:07:11	first time if you train the multi-lingual network that you can even outperform on the
0:07:16	languages which are in the training set
0:07:18	which makes me very happy and i think gives hopeful
0:07:21	using them for bootstrapping also two languages
0:07:25	so the first thing i would like to talk about is what
0:07:29	can be done if there are no transcription that at all so you have a
0:07:32	new language
0:07:33	well not to have that's a pronunciation dictionary and a good amount of speech
0:07:39	but you do not have any transcriptions
0:07:42	so and for this experiment we took the check which doesn't mean that i consider
0:07:47	check to be low resource languages the reason we to check is
0:07:52	two things the first of all the conference was in the czech republic so we
0:07:56	thought this is an appealing idea and the other was here is reason is
0:07:59	we had four other slavic languages which are related to check and we wanted to
0:08:04	figure out
0:08:05	whether we can whether we can gain from having related languages so we had two
0:08:09	sets of experiments we to for slavic languages
0:08:13	which is equation russian obligatory in polish
0:08:15	in the hope that the relation might help for bootstrapping check
0:08:19	and then me to another situation which is probably more realistic namely that you have
0:08:25	already resource language rich so asr systems and resource rich languages that is english french
0:08:32	german and spanish
0:08:34	and we basically said okay we use we have given speech we have given a
0:08:38	dictionary and we have given the language model in check
0:08:41	what we're letting our transcripts
0:08:43	and the idea is basically that we wanted to leverage of knowledge which we gained
0:08:48	already in many other languages
0:08:50	so the idea here goes as follows
0:08:52	we basically take the source languages he in this case we have a polish creation
0:08:56	will there in russian
0:08:58	we take the recognizer we haven't this language
0:09:01	and now we're taking our target checked dictionary and language model which we saw to
0:09:05	have given
0:09:07	and we are using the dictionary to express the check dictionary in terms of polish
0:09:12	falls
0:09:13	sounds to me
0:09:14	but it basically allows us to use the context dependent acoustic models
0:09:18	from our polish asr system
0:09:21	and we don't the very same thing for operation gary in russian so we basically
0:09:25	create for different dictionaries which are expressed in the phone set of the of the
0:09:30	source language on this
0:09:32	so it's out check words but it would be expressed in encoders phones and russian
0:09:37	phones and so on
0:09:38	and so now we can do what's we can run the polls recognizer on checked
0:09:42	data
0:09:43	it would give us check works we have the russian diction we have the russian
0:09:47	asr and the polish and so on so we have
0:09:50	basic need we get for hypothesis
0:09:52	of the czech words
0:09:54	and so then what we're doing is we're leveraging of confidence score which had been
0:09:58	proposed by two mask and then too much off
0:10:01	which was called a stop you so it basically how i mean what we do
0:10:06	with will basically calculate how often the same word appears
0:10:10	in the hypothesis from the different recognizer so this is the polish recognizer clearly
0:10:15	check hypothesis this is the creation the bulgarian the rational
0:10:19	and whenever or more languages agree on the same work we assume this work to
0:10:24	be correctly identified in this can be then used for training on
0:10:29	and so the whole the whole idea is we need to leverage off or to
0:10:33	vote to george to boot model different languages
0:10:37	and the reason why we few this works quite well is
0:10:41	if you take is the usual confidence score and you apply basically one language
0:10:46	we know that
0:10:47	the confidence score it's rather unreliable so this one gives you the confidence score threshold
0:10:53	over the word error rate
0:10:55	and basically if you have just one language and it really you start out with
0:11:00	the very poor performance
0:11:01	then even a very good confidence course not very reliable
0:11:05	but if you do this for several languages then what happens is that if you
0:11:10	had at least two or two languages agree on the same words of the same
0:11:14	islands in the hypothesis
0:11:16	then you basically have a much more reliable the behavior of the confidence score
0:11:22	and we learn that
0:11:23	using a threshold which is roughly one divided by and plus and offset if and
0:11:28	is the number of languages which are applied
0:11:30	that is a good value
0:11:32	so then the whole framework works as follows you basically create your saul's recognizers with
0:11:39	the matched dictionary
0:11:41	you run on your check audio data to produce a transcriptions
0:11:46	once you get the transcriptions you look with the
0:11:49	multilingual east appeal which ones that much parts of the transcriptions
0:11:54	and then you know basically adapt
0:11:56	the acoustic model of your source recognizer
0:11:59	towards
0:12:00	using the transcriptions which you just drive that means to produce recogniser choosing to check
0:12:06	the russian recogniser choosing to check
0:12:08	the creation one and so on so everything gets more basically chewed in to check
0:12:13	and then go through all everything the way you just keep the you adapted acoustic
0:12:18	model and then you we iterate
0:12:19	so we basically it generate the whole process
0:12:22	in which we always through the transcriptions away
0:12:25	and once we have are certain amount of data this way automatically transcribed
0:12:30	then we are using a multilingual phone inventory to bootstrap the check system
0:12:36	and we were quite happy with the with the performance we saw so what you
0:12:40	see here is basically
0:12:42	the first iteration to in the second third and then the bootstrap and the final
0:12:46	step after
0:12:47	reiterating one more time
0:12:49	and this is basically the performance as you get up with gary integration with the
0:12:54	pollution but the russian
0:12:57	acoustic models
0:12:58	and basically the take a message is if you use the related languages the best
0:13:02	system we can come up with has twenty two point
0:13:05	seven percent
0:13:06	performance
0:13:07	which is
0:13:08	very close to the baseline which we have if you're assuming that we have twenty
0:13:13	hours of transcriptions give
0:13:15	so this is the supervised case
0:13:17	and this is the fully unsupervised case
0:13:19	for the related K so for the case for the good case where you have
0:13:23	related languages
0:13:24	and this little one here is the one for the not related resource-rich languages
0:13:29	and the number the best number we got with this is twenty three point three
0:13:33	so we are still within range even so we're not as close as if we
0:13:37	do the related cases
0:13:38	but i believe that the
0:13:40	then the differences marginal
0:13:42	you may wonder
0:13:43	about two things one thing is
0:13:45	does that work for
0:13:48	for a conversational speech
0:13:50	and this is something we currently trying so this is more plan speech so
0:13:54	we are in the making of doing this for conversational speech the other question you
0:13:59	may have as Y all the results also work but don't please keep in mind
0:14:03	that we only have twenty three hours of training data and the setup and we
0:14:06	have a
0:14:07	rather shoot or perplexity because it's newspaper articles
0:14:13	if we so we of course wondered whether we would gain something by increasing the
0:14:18	number of source languages so we did the same thing this time for vietnamese and
0:14:22	we look at
0:14:23	what is the difference between using two source languages for source languages and six for
0:14:28	source languages
0:14:29	and what we discovered is that
0:14:31	over the iterations what you see here in the bars is the amount of data
0:14:35	we could extract
0:14:36	for trends for transcription
0:14:39	and what you see here and the curves is the quality of the transcriptions
0:14:43	so what we found is that the quality of transcriptions teams to go slightly up
0:14:48	if we using more languages
0:14:50	but we can derive more transcription data so the amount of data we can big
0:14:55	sticks extract its larger
0:14:58	and so what we found it is if you look at the performance here that
0:15:03	finally if we do with six languages we get sixteen point eight on vietnamese which
0:15:08	compares to fourteen point three if we build it in a supervised fashion
0:15:13	now the company at here is if we now have only and next word use
0:15:18	native speaker knows how to build systems
0:15:21	here she could do something it the range of eleven point eight sort the gap
0:15:26	you get all the gain you get by doing language specific things is enormous i
0:15:31	would say and significantly larger than the get you have between supervised and unsupervised
0:15:37	so if you have the choice having somebody one or something about the language and
0:15:41	for example to tuning and tweaking like
0:15:43	for vietnamese control modeling
0:15:46	pitch features using multi syllables and so on that gives a larger gains then the
0:15:51	discrepancy between transcribed and untranscribed
0:15:57	okay so the second thing i would like to discusses so this was about what
0:16:02	can what we do if we don't have transcriptions the second thing is what do
0:16:06	you do
0:16:06	if you have no pronunciation dictionaries
0:16:09	so that means the again the idea would be leveraging of what we already have
0:16:14	from other languages so we assuming we have seen a lot of scripts we have
0:16:18	seen a lot of pronunciation rules
0:16:20	and we would now like to generate the dictionary and language
0:16:24	so let me first talk briefly about writing system so first of all of course
0:16:29	we put wonder
0:16:31	how many
0:16:32	how many languages are written at all does it for example makes sense to look
0:16:36	into languages should not have a writing system
0:16:38	if you consider all mikolov which is from a site maintained by simon baker
0:16:43	so if you use listing about seven hundred fifty languages which have a script
0:16:49	he things that the number that room number of languages with the script is close
0:16:53	to two thousand
0:16:54	which basically means that the majority of the other six thousand languages do not have
0:16:59	a written form
0:17:01	so that so it seems to be a rather serious
0:17:04	issue
0:17:05	and then you look at the writing systems there are different types of writing system
0:17:09	so that the logo graphic ones which means the characters of this of the writing
0:17:15	system are based on semantic unix
0:17:18	units and the graphemes rather represent meaning then representing some sort typical example is a
0:17:23	chinese hands
0:17:25	and then there are scripts which are called phone a graphic
0:17:28	so that means here the graphemes are representing silence
0:17:33	rather than meaning
0:17:34	and then you have different forms of for the graphics and segmental ones for example
0:17:39	there are one grapheme roughly corresponds to one sound so that's very convenient for building
0:17:44	dictionaries in the syllabic case grapheme may represent an entire syllable
0:17:50	and then they are feature ones like korean where or one grapheme a represent
0:17:55	and articulatory feature
0:17:58	but if you look at the world most parts of the world agree which means
0:18:03	there are perhaps
0:18:04	the scripts which are roman all the transcripts
0:18:07	and then another big chunk here is the one which has it will take script
0:18:12	and they're both
0:18:14	photographic segmental scripts so this
0:18:18	seems to be very nice for a between pronunciation dictionary generation
0:18:22	in these other once we looked
0:18:25	first
0:18:25	so that means we assume that there is or correlation or there was a relationship
0:18:30	between graphemes into sound or
0:18:32	we call the grapheme-to-phoneme or letter-to-sound
0:18:35	but we also know that there was alone in there there's just there are languages
0:18:39	which are very close which have a very close grapheme-to-phoneme relationship
0:18:43	and then there are others which are like english more pathological C
0:18:48	sorry
0:18:49	don't worry offend anybody but
0:18:51	okay so of course that's all the first thing we wanted to look at this
0:18:55	so how many data do we need
0:18:57	to create a pronunciation dictionary
0:18:59	so what you see here
0:19:00	is on the on the on ten globalphone dictionaries you see here the phoneme error
0:19:05	rate on the y-axis and you see on the x-axis the number of phones
0:19:09	and what it means this we took
0:19:13	pronunciations we had pronunciations in our dictionary
0:19:15	we took them how we trained or G two P model based on sick which
0:19:20	were from a max be signing
0:19:22	and then we use the G two P to generate more entries in the dictionary
0:19:27	and then we compare the generated entries to the true entries in the dictionary and
0:19:32	that gave us the phoneme error rate
0:19:34	so if you have for example five K of five thousand
0:19:40	phonemes that means we took
0:19:42	well
0:19:42	roughly one thousand pronunciations word pronunciations and so you have five thousand phonemes which you
0:19:48	can learn on
0:19:49	you can learn you on
0:19:51	and what we learned from this graph is first of all
0:19:53	the G two P depends on the language that's not really surprising remote this already
0:19:58	so we have these punch down here this is the slavic languages so you guys
0:20:02	are really lucky
0:20:04	and also spanish is below there and then we have portuguese and this one instrument
0:20:08	and the worst case
0:20:09	in our scenario here is english
0:20:12	so there is a strong dependency on the language the other thing we don't is
0:20:17	if you have a close
0:20:18	G two P relationship then
0:20:21	five K phonemes might already be enough
0:20:23	to build a decent G two P model
0:20:27	and it separate out roughly after fifteen K examples
0:20:31	but we also see that a language like german for example needs
0:20:35	six times more examples then the language portuguese so depending on the relationship you have
0:20:41	to work harder to get examples in order to build a good G two P
0:20:45	model
0:20:46	now when you get those
0:20:48	examples from
0:20:50	and to select the hooded this work he basically
0:20:54	looked into dictionary which is a resource where you find when you find
0:21:00	pronunciations put in by volunteers in that language and you look at how many
0:21:06	language already covered dictionary and have a nice one tape pronunciations so that means one
0:21:12	thousand words
0:21:13	which roughly corresponds to five thousand of phonemes
0:21:17	and so we found thirty seven languages
0:21:19	which are covered that was at the time of twenty eleven when he did this
0:21:24	and he also follow if you look at this one this is the roles
0:21:28	dictionary entries over several years so this is twenty ten this is twenty eleven
0:21:32	so there seems to be a lot going on so that means the community seems
0:21:38	to
0:21:39	discover that and work in more words and so the whole would be that
0:21:44	probably covers
0:21:45	in the future
0:21:46	both more languages but also words
0:21:49	but now of course the question is
0:21:51	it is you use this so we build an interface to build an interface where
0:21:56	you can basically uploaded vocabulary file
0:22:00	in it would go away in search dictionary for whether it finds the entries
0:22:04	and then and we use all the entries it finds to train a G two
0:22:08	P model and then it would basically generate whatever you want to generate
0:22:12	and it would go into the pages and looks for the ipa-based pronunciations
0:22:17	there is a lot of cleaning have to do because the pronunciations to not only
0:22:21	not always basically
0:22:24	referred to what was saying so sometimes you look for what side and you get
0:22:29	an ipa for what can i might deals what's that
0:22:31	so you have to do a lot of checking to make sure that you're not
0:22:33	learning the wrong things
0:22:35	and that's one of the reason why the dictionary a G two P performs worse
0:22:40	than if you have a good dictionary where you start from so this one for
0:22:43	six languages compares
0:22:45	the be the G two P performance
0:22:48	a based on the things we learn from dictionary compared to the things we had
0:22:53	already on globalphone
0:22:54	and you can see here if you compare the same colours
0:22:57	that there are some significant distances
0:23:01	which
0:23:01	probably refers to one to the error unless you have what we had we found
0:23:07	and dictionary and the other thing is
0:23:09	there might be many contribute just to put in pronunciations into the dictionary in dictionary
0:23:16	and that means you might have inconsistencies which always an issue forty two P models
0:23:22	okay
0:23:23	so if you have nothing but you have related language you can also do a
0:23:26	very to force very simple approach and that's what we tried with ukrainian so we
0:23:32	want to ukrainian we had a student who was collecting data and we said okay
0:23:37	we already have russian gary german english
0:23:40	we have many languages why don't you try to
0:23:43	build the dictionary from what we already have
0:23:46	and it's a very simple approach so basically it has
0:23:49	for steps in the first step
0:23:51	you to the grapheme to grapheme mapping so you have russian
0:23:56	and you met the ukrainian graphemes to russian graphemes
0:24:00	once you've done this then there's a second step you can now basically applied to
0:24:04	russian grapheme-to-phoneme model
0:24:07	and then you have a russian phonemes and then you have to map them back
0:24:11	so to speak ukrainian mapping so you see there is a lot of mapping going
0:24:15	on
0:24:16	and then you do some post processing
0:24:18	so if she so she this for the different source languages then let's look at
0:24:23	russian for example so he will first
0:24:26	the ukrainian graphemes to rough and graphemes which
0:24:30	may tim to forty three rule so we had to come up with forty three
0:24:34	ones
0:24:35	and then he would run the G two P and after that you would
0:24:38	create fifty six rules to convert back the russian phonemes to ukrainian phonemes
0:24:44	and then to do some post processing which required another fifty seven rules so altogether
0:24:49	we had to create about one hundred sixty ones which can be done it in
0:24:56	today
0:24:57	and what comes out it's and
0:24:59	is the dictionary well when we plug this into the highways are it gives the
0:25:04	twenty one point six percent error rate
0:25:07	if we use the simplest approach we just use the graphemes for asr we get
0:25:13	something like twenty three point eight percent
0:25:15	if we ask the student to sit down and build a rules in order to
0:25:20	create the dictionary by hand so you basically at hand crafted rules which
0:25:24	piled up to about eight are rules then he gets twenty two point four and
0:25:28	in the process we wondered why the russian based one is
0:25:32	better than
0:25:33	be handcrafted and found that we had some issues with our rule so we could
0:25:37	we fix the rules
0:25:39	and then basically ended up with the same performance
0:25:43	so that means if you have
0:25:45	if you initial if you don't have much time and you we need to even
0:25:49	to reduce the number of rules you want an expert to write then probably be
0:25:53	screwed
0:25:54	method is an option
0:25:56	but that of course depends on that you have a reasonable related language because if
0:26:01	you would with other non related languages you get significantly worse results
0:26:05	so if we do would with english or german then it doesn't pan out
0:26:11	okay so i have fifteen minutes
0:26:14	so the other thing you can do if you don't have pronunciations is you can
0:26:17	ask the crowd and so we
0:26:19	created a small toolkit which basically gives you keyboard and on the keyboard you have
0:26:25	i ipa sounds which is you have to get used to it but you can
0:26:29	pick basic you press i mean you can listen to
0:26:31	how it sounds
0:26:32	and then you get a word and then you are asked to produce the pronunciation
0:26:37	of this word and then once you're done you can go to the next work
0:26:41	and we use this we basically feed it is into a mechanical turk
0:26:47	and basically we headed run for twelve days to find out whether this would give
0:26:51	good results or not
0:26:53	so the first thing we learned is so we rented for twelve days and we
0:26:57	had
0:26:59	calculated look at how long people spend
0:27:01	building pronunciations
0:27:03	the average time most fifty three seconds
0:27:06	most of the work so the majority based you had to reject
0:27:10	and we had
0:27:12	out of these nineteen hundred pronunciations we got a more than fifty percent where not
0:27:18	really useful because what we found out as people we would test the their the
0:27:23	task they would basically
0:27:25	spent a second giving one a phoneme
0:27:28	and so people where we were very fast but very sloppy and we couldn't really
0:27:33	get a good we didn't find a good way to
0:27:36	basically have incentives to provide very good answers so doing this
0:27:41	i mean we did this in parallel with many different
0:27:44	many people working on the same words in order to find out whether the whether
0:27:48	we get good pronunciations but it didn't really
0:27:51	penn out nicely
0:27:53	so the second thing then we did this we reached out of our friends and
0:27:57	volunteers so we work harder to improve the interface we had a nice welcome page
0:28:02	of tutorial
0:28:03	we gave the people quality feedback so that they basically have an incentive to work
0:28:07	harder and we also had basically we put it out as again so people would
0:28:12	compete with each other
0:28:14	in a in a friendly manner
0:28:15	so one thing we found listed in the in the beginning they would spend a
0:28:19	lot of time to get the what's right so they would spend six minutes on
0:28:22	one what which i found quite high
0:28:24	and the and then finally they would be down to one and have a minutes
0:28:28	roughly put word
0:28:31	and you can see here so the system amount of users so it's really heart
0:28:35	to get to keep the uses one tractor but certainly the median time for what
0:28:40	we do significantly
0:28:42	and also the arrows which we found procession would go would go down
0:28:48	but it's harder to keep
0:28:50	the people in the so getting really messes amount of pronunciations out of this with
0:28:55	crowd sourcing before rather challenging
0:28:58	okay so now i would like to talk and wu approach which we work on
0:29:04	if there is no writing system at all
0:29:08	so that means you would like to build a system for languages which have
0:29:14	a written form
0:29:16	so this is
0:29:17	sort of how we think of it you have nothing done
0:29:21	we assume that thing as not written even if it's true and you can you
0:29:24	can ask what to speak for you like a sentence in training and so you
0:29:30	can ask simply say i'm sick and you would say she dropped i don't i
0:29:34	don't speak klingon but may maybe people do
0:29:37	and you can ask him i've help the and you would also produce a phrase
0:29:40	and so that what people want to subconsciously is we would try to identify sounds
0:29:47	and we would perceive this as a sequence of
0:29:52	and then we would probably if there let's see if you says she drop in
0:29:57	she P then we would probably assume that she might be something like a because
0:30:01	this is what was
0:30:03	the repeating phrase in what we asked him to do
0:30:07	and then we could basically derive that are means G and what seems to be
0:30:11	the word for sick and
0:30:13	she seems to be the word for healthy
0:30:16	and so the idea would be to
0:30:18	create this process that we're doing in our minds that we putting this on a
0:30:24	on the machine
0:30:25	and they had in already substantial work done the first one probably the that's what
0:30:31	i found was performed from the wall basis he
0:30:34	so he was using a monolingual unsupervised segmentation in order to get the phone sequences
0:30:39	into words
0:30:41	and then there was the work from our is the work from the past and
0:30:44	speaker and alex waibel they were using a cross lingual word to word alignment based
0:30:50	on keys plus
0:30:52	and then in the combination of work they basically combine the monolingual in the keys
0:30:56	plus approach
0:30:57	in order to find the phoneme to a word alignments
0:31:02	and so phoenix are back in our lab she started using a cross lingual words
0:31:08	to phoneme alignment it for that you improved we extended the
0:31:12	it is a approach but let me first roughly explain how it work
0:31:18	so you basically have to sentences you have a source language in this case it's
0:31:22	not a german
0:31:24	it says a different each digit date
0:31:27	and then you have english phone sequence which you happen to have derived from what
0:31:33	the person was saying so you have language words and things
0:31:37	for you
0:31:38	i mean ideally so and you have these little word boundaries already in there but
0:31:42	of course if you create a phone sequence of a with the phone recognizer fingers
0:31:48	all you have a lot of errors you might have a lot of errors here
0:31:51	and of course you don't have any word boundaries
0:31:53	and the question now is how can you
0:31:56	get be alignment between the words and the corresponding phone
0:32:02	and one of the issues okay so and what we next it is he extended
0:32:07	keys plus model or if he's using the ibm model three
0:32:12	and she
0:32:14	basically
0:32:15	change it such that it can be applied for a different word length so he
0:32:20	was a putting in a word length probability
0:32:22	in order to counterbalance the fact that if you to work to phoneme alignments that
0:32:27	you have way more phonemes
0:32:28	to be aligned to one to one word
0:32:31	so the model he came up with basically skips the lexical translation it uses the
0:32:37	fraternity uses the distortion but then puts in the word length
0:32:41	and basically then the words are sort of placeholders and then be via the lexical
0:32:46	translation would be translated to the phone sequence
0:32:51	and what you found so the first result it did on be take english spanish
0:32:56	so what he found is that he compared the keys that what to phoneme alignment
0:33:01	to its model
0:33:03	three P alignment
0:33:04	and you can see that consistently
0:33:06	independent of the phoneme error rate of the original phone sequence
0:33:11	he could outperform other keys plus possible to phoneme alignment
0:33:15	and she could also in terms of its course you could also outperform
0:33:21	the monolingual one
0:33:24	so this is on the pair from english to spanish and this is from the
0:33:27	pair from spanish to english
0:33:29	and so and f-score here of eighty percent for example means that he could get
0:33:34	the segmentation the word boundaries in the phone sequences right to
0:33:38	about ninety percent accuracy which is quite good
0:33:44	so it's something that would have solved
0:33:47	i mean that when we're not anywhere near that but is something we do have
0:33:52	a correspondence between the phones and in the in the
0:33:56	works
0:33:57	we could now think of the next
0:33:59	the next step would be to we take the phone sequences and generate a pronunciation
0:34:05	dictionary
0:34:06	and now we run into this problem with low already mentioned in her talk
0:34:09	that is some words might have different meanings
0:34:12	in so for example the word a german can mean both speech and language
0:34:18	so what will happen is you will have a lot of different pronunciations and some
0:34:22	vowel sound like speech even so they might be you know
0:34:26	everyone S
0:34:27	but others are significantly different in these are the language words
0:34:30	so the first thing you need to do what we did is we clustered
0:34:34	all the
0:34:35	occurring phone sequences in the different
0:34:38	what meanings
0:34:39	and we use the D V scan a cluster for that so if you if
0:34:43	when we're lucky we are ending up with the cluster which represents everything widget goes
0:34:48	to language
0:34:49	and the other cluster represents everything which goes to speech
0:34:53	and then if you have several words which sound like speech then what prefix was
0:34:57	using was the n-best lectures to from and real spoken order to
0:35:01	basic generate one a result for speech
0:35:05	and then he would put those two results into the dictionary so
0:35:10	since we don't have
0:35:11	a really written form we cannot give it what checks all you can either we
0:35:16	in our case we just did numbers
0:35:19	so there is one word which gets the number one and the has the pronunciation
0:35:23	language
0:35:24	and there's another word which gets the i teach when it has the pronunciation speech
0:35:30	okay and so we try this out not would be tech but we moved to
0:35:35	a pronunciation extraction on private data because felix found very nice page
0:35:42	bible gateway which provides
0:35:46	translations of bible of the bible in many different languages and he did not fifteen
0:35:50	translations intend languages and they even provide all you in english portuguese and some other
0:35:56	languages
0:35:57	in order to play around with
0:35:59	and so what needed to sorry she basically verse aligned the bible text so he
0:36:05	extracted more than thirty
0:36:07	rows alignments for training
0:36:08	his approach
0:36:10	and so what you see here is basically the distribution of the absolute phoneme errors
0:36:15	in the extracted pronunciations
0:36:18	so basically he compared extractor come a pronunciations to the real word pronunciations
0:36:23	and so for example this thing here means that
0:36:27	if you use T V spanish bible
0:36:31	create english pronunciations
0:36:33	there are about three thousand nine hundred words
0:36:36	out of the fourteen hundred which had forty thousand which are extracted
0:36:40	in these three thousand nine hundred words have no phoneme error so there really correspond
0:36:45	exactly to what
0:36:46	had been found in the dictionary
0:36:50	so that in total the dictionary had something like fourteen hundred words so this is
0:36:54	the use easily so this is the english one of the english the bible translations
0:36:59	and this one gives you
0:37:01	the distributions
0:37:02	of the phone arrows across all the different languages he started from to extract the
0:37:07	pronunciations
0:37:10	one thing so what we did in this case it's we assumed
0:37:14	it in a first step we assume that the target phone sequence
0:37:18	is the canonical pronunciation so that means this has not been done with the phone
0:37:22	recognizer on audio but it has been done
0:37:25	under the assumption that the phone
0:37:27	sequence escort
0:37:28	only what was on almost the word boundary and so the next step is now
0:37:32	to basically this with the real phone recognizer in see how well it will
0:37:38	cross or
0:37:39	okay so and then the next step but we have done that yet
0:37:43	is that we now have a transcribed audio which is basically transcribed and word I
0:37:48	Ds
0:37:49	we could create a dictionary based on what I Ds or whatever you want to
0:37:53	take here and then you have a language model and then we could put all
0:37:57	these things together to build a speech recognition system based on data which has not
0:38:02	been
0:38:04	but they haven't so this last that we haven't done yet
0:38:07	so i four minutes left three minutes left okay so the third one start run
0:38:13	is really quickly this is here the question is what
0:38:16	could be done if there are not many linguistic experts and i think everybody has
0:38:21	seen this problem here she wants to build this
0:38:24	system in a language and there's simply no but you their coach who could help
0:38:28	lori was just talking about finding a korean for transcribing right
0:38:32	and so
0:38:34	so what we did
0:38:36	the war started several years ago in that was and it's F project actually that
0:38:40	was funded by mary harper
0:38:43	and it did together with alan black the idea was
0:38:46	how can you bridge the gap between technology experts and the language experts and so
0:38:51	we build a lot of web pages tools
0:38:54	and the web based tools allow somebody who doesn't know anything or not very much
0:38:59	about speech recognition or G S
0:39:02	it allows this person to
0:39:04	work
0:39:05	on the language for him or herself so you basically have hand rails
0:39:09	which tells you what to do in as in this next step
0:39:12	in order to build a system and the person can go step by step through
0:39:16	would to build the system
0:39:17	and we had done
0:39:19	quite substantial work in the past so the system is
0:39:22	every here it is used in a semi now which is a multilingual seminar also
0:39:27	across seen U N K I T
0:39:30	and we always adopt the languages of the students who take the score so what
0:39:34	the use we have really done a lot of systems asr and tts systems in
0:39:39	the respective languages
0:39:40	and whenever we learn something we plug this into
0:39:43	introduce the true
0:39:45	and we using this we for education but we also using this for
0:39:50	purposes to do up resources and reach out
0:39:53	two people if we don't have time at hand
0:39:55	for example we had one
0:39:58	one
0:39:58	one time wanted to build something in company and there was only a single speaker
0:40:02	of company it seeing you so he was sending basically is what the web page
0:40:07	to his people back in india and got a lot of speech and work out
0:40:11	of this
0:40:12	by
0:40:12	being able to do virtually
0:40:15	okay so i think i
0:40:17	i should close here so i wanted to basically show you some
0:40:23	solutions we are working on
0:40:24	if we have resources and i try to give one example if we have no
0:40:28	transcript if we have no
0:40:31	pronunciations or if we need have no
0:40:33	but system at all
0:40:35	and also wanted to stress that we are keep working on our record language adaptation
0:40:40	toolkit
0:40:41	in order to leverage of fieldwork and also be able to out to do outreach
0:40:46	to the community
0:40:48	in order to then finally bridge the gap between those people who know the technology
0:40:52	and those people know the language
0:40:54	thank you very much
0:41:04	for a few questions
0:41:17	i just had the clarification question about your first
0:41:21	study where you had no transcription
0:41:24	so how do you obtain the pronunciation dictionary in the other models
0:41:30	are you get
0:41:30	i could let me just go back so the idea is that you have
0:41:35	a dictionary in it so target most check so the idea would be you have
0:41:39	a check dictionary given in the czech phone set
0:41:43	but the source recognisable right to run the data is in let's say polish so
0:41:48	what you need to do is you need to take the produce phones
0:41:52	and replace the check phones in the dictionary for the problem phones
0:41:56	so you basically map them in the czech dictionary
0:41:59	so you get the polish variant of a check dictionary
0:42:06	we do i forgot that i'm sorry so we are using through a global phone
0:42:10	everything is expressed and ipa so doing a mapping is rather straightforward
0:42:14	like you can do this with fifty mapping rules i would say not nothing too
0:42:18	fancy because
0:42:19	they're the context
0:42:20	can be used this way but the in the after the iterations the
0:42:24	acoustic models would be map adapted anyways
0:42:31	so
0:42:32	for the recognizer for languages be so and it's operating system actually the things they
0:42:36	represented the last part
0:42:39	what would be a recommendation to present the output because if there is no writing
0:42:43	system is our to target all right you cannot probably due
0:42:47	the users to just be I Ds numerical id so diverse every what we do
0:42:53	so i think
0:42:55	if you haven't if you have a system or language which is not written it
0:42:59	all the tts might be a good approach that you basically
0:43:02	speak it would be to the people so that they can listen to it
0:43:06	so that would be run option and then all you have to do is if
0:43:09	you get
0:43:11	number one you basically to teach us on the phones
0:43:15	and then the other thing of course the a simple straightforward one is apply any
0:43:20	sort of
0:43:21	phoneme to grapheme mapping to get it back so
0:43:25	but that's
0:43:27	all that we could come up with actually i don't like the number idea either
0:43:30	so we basically switch now to something with underscore so that you can at least
0:43:34	we did
0:43:37	you could potentially and translation
0:43:39	or understanding intentionally to
0:43:43	as ways to see
0:43:44	sure i mean actually what for what's out there is you have the word id
0:43:49	in the target word to source word
0:43:53	you shouldn't what this into translation system and then
0:43:57	translated into
0:43:59	language that's
0:44:00	twenty correct
0:44:01	i'm more interested in the asr it's always
0:44:10	just one year when you were doing the bible work and finding
0:44:14	see the words and read or differences were you actually can see how many of
0:44:18	the words that are january you're not real words
0:44:21	"'cause" you do in the list right and i didn't really
0:44:24	get to the so
0:44:25	so one thing is for the english case
0:44:29	you can see that the number of
0:44:32	generated words is roughly the number of
0:44:34	words so it's not overgenerating in its not under generating
0:44:38	but you can also see swedish for example son to generate into gary and check
0:44:42	is what generate and what we found is that
0:44:45	the generation is also related to how many words the source language has source we
0:44:51	just has very few and very check as many so they'll what generating an unknown
0:44:56	generating
0:44:57	and of course name of the game would be to sort of cheap this in
0:45:00	balance if we assume that it's roughly one on one
0:45:04	does it and see a question
0:45:09	one last no
0:45:10	quick question
0:45:12	you already asked
0:45:24	could tell
0:45:27	i practical question
0:45:30	so they you have an application then you want to build the
0:45:35	speech recognition based interface
0:45:39	i looking what although these numbers can you advise me that
0:45:43	you know what you're doing is the way to go or should i go transcribed
0:45:47	audio and leave the speech recognition reduce then
0:45:52	so for the record language adaptation server that's exactly what we were wondering so we
0:45:57	are now doing our blaming on the components and try to advise the use of
0:46:02	what's better is a better to work on the pronunciation that it doesn't work on
0:46:06	the audio
0:46:07	and for the first
0:46:08	one hour the results are usually it's better to work on though
0:46:12	getting some audio data so one alright at least
0:46:15	and then getting the pronunciations right but it highly depends on how much data you
0:46:20	have a way you are so
0:46:22	difficult to answer but i think generating an automatic answer from the from the from
0:46:27	an arrow blaming to tell the use of what to do next how to invest
0:46:32	best his or time
0:46:33	i think that's
0:46:35	the right way to go
0:46:38	of the question for all three speakers this morning for oriented for mary i think
0:46:45	the big elephant in the room is the getting textual data for languages that are
0:46:50	only spoken and touched on this a little bit
0:46:54	and if you don't have web data and only have conversational data that's recorded making
0:47:00	the transcripts i mean in with the G B goes that we developed really by
0:47:04	five for a large as cost was always getting the text data and the translation
0:47:09	data
0:47:10	so acoustic modeling is by comparison relatively straightforward cheap to do
0:47:15	so my question is how do you interested i mean the hero here the room
0:47:19	is married to do this gigantic database of conversational data and the wondering how is
0:47:25	a community we can make faster progress set that problem
0:47:34	i've never heard of a question do a speaker who has spun
0:47:37	i
0:47:39	well maybe this is something i think about a good question we can come by
0:47:47	so let's bring that up again and today
0:47:50	okay let's thank time and what rate

Building Speech Recognition Systems with Low Resources

Limited Resources Day

Tanja Schultz (KIT)