Speech Transcript - The Babel Program and Low Resource Speech Technology

0:00:15	i mary harper and
0:00:18	in october two thousand ten i went i refer to develop this program a babel
0:00:25	i
0:00:26	and i say i think babel there's lots of ways to pronounce babel
0:00:32	and you can actually go to this website and find out about all the way
0:00:38	of saying a lot of people like this example
0:00:41	for me about is be a D V only and i drop in buffalo new
0:00:44	york were that was by
0:00:46	dialectal variation of inclusion
0:00:48	we want a taller are while so i see table
0:00:52	and of course there's also the original hebrew word and a variety of other ways
0:00:56	of pronouncing as well
0:00:58	but morgan pointed out the but though
0:01:01	now you say sounds but they didn't
0:01:03	in for some reason
0:01:07	okay so
0:01:08	every program
0:01:10	whether it's a darpa write our has sort of a back story you have to
0:01:14	have a motivation
0:01:15	that sort of an elevator speech so my challenges that
0:01:19	you know you you're in a situation we're dealing with the crisis it might be
0:01:23	they the
0:01:24	it might be a need for example where you have to deal with a lot
0:01:29	of
0:01:30	i'll
0:01:30	speech noisy speech in order to solve a crisis situation and you have thousands of
0:01:36	hours in no time to listen you might have one or two people who could
0:01:40	listen to it but you're certainly not gonna get through it and anytime the
0:01:43	would be reasonable in order to help people
0:01:46	and
0:01:47	if you have no existing speech technology for that language you have problems
0:01:53	but if you could rapidly develop that
0:01:55	say in the day here to you actually might be able to do simple it
0:02:02	it sort of addresses two gaps it's harder to build up the human capital in
0:02:06	the language
0:02:07	because it can take years and signal i don't we have one or two people
0:02:10	global language and we see that even with just developing the resources that we don't
0:02:15	have this
0:02:16	this language capital
0:02:18	and there's also a technology gap and so this slide was certainly
0:02:25	down a number of years ago but certainly computer to the three hundred and ninety
0:02:29	three million a printer ninety three languages that have a million speakers we touched very
0:02:34	few
0:02:35	right so we and we've only studied of is actually really and immediately i mean
0:02:42	we study english all the time because it easy they're corpora and so one
0:02:46	and it can take way too much time
0:02:49	it months to years to build a new language especially if you have to transcribe
0:02:53	the and the systems double developed for english don't always work well to other languages
0:02:59	they can they can help with the bootstrap
0:03:02	they certainly don't give you the kind of
0:03:04	error rates that maybe someone might want to see
0:03:09	so the basic idea underlying
0:03:12	table we rather than just evaluate a word error because
0:03:17	the director was very the image that she wanted to habitat
0:03:21	have a real task that just transcription
0:03:24	and so keyword search
0:03:26	fortunately had have an evaluation in two thousand six
0:03:30	and so we settled on at the keyword search task
0:03:33	where the basic idea is you speech recognition
0:03:37	or phone recognition or something the index
0:03:40	the thousands of hours of audio and then you have some whale
0:03:44	of putting in a query
0:03:46	for babel we use orthographic queries and for those who are doing low resource they
0:03:51	do other things with the data in order to basically accommodate to the fact that
0:03:56	we
0:03:57	we use orthographic queries and then we evaluate
0:04:00	a weather about the got the keeper correctly identified from the audio
0:04:06	so
0:04:08	our approach is really to work with a wide variety of languages and
0:04:13	the more
0:04:14	why you know so not just european languages it's it i think is really important
0:04:19	to study things that have a wide variety of aspects
0:04:23	in
0:04:25	real recording conditions as much as possible i mean obviously collections are gonna suffer from
0:04:30	the pair that
0:04:31	you go into these countries you may not be able to
0:04:34	record in a highly reverberant room or something
0:04:36	but the hope is that you can get these sort of real world recording situations
0:04:43	and then we
0:04:45	constrain the resources in various ways and that we actually collect a lot of data
0:04:51	but we actually create a wide variety of conditions for people to evaluate and
0:04:55	and the actually can create conditions as well to answer questions that they think are
0:04:59	important like getting by without lexicon
0:05:03	so we gradually reduce the amount of brown from speech for dipping them in training
0:05:07	but we also give them the audio
0:05:10	untranscribed i we also reducing the amount of time that they have to basically evaluate
0:05:15	the surprise language and i think that's critical
0:05:17	and not starting off with something that's impossible at the outset we actually getting people
0:05:22	to the point where they can develop the technology is extremely important in this
0:05:27	and we set the targets to be
0:05:31	sort of a three times improvement over which we get with phonetics or
0:05:35	and i think that was critical and that was done based on the std or
0:05:38	six
0:05:39	dbn results in cantonese and mandarin and
0:05:42	where they got a point three briefly about point three atwv so we said that
0:05:46	as the target level
0:05:50	so
0:05:51	the goal is to improve speech technology with limited amounts of ground truth data only
0:05:56	speech
0:05:57	systems for an english
0:06:00	language is extremely important
0:06:02	i improving speech recognition through innovative use
0:06:05	of the technology and different approaches
0:06:09	at a wide variety of languages so that you can get fast development of keyword
0:06:14	search systems to tackle this problem
0:06:20	i just to give you a sense of the layout of the program
0:06:26	other than the basic your which was they had a little bit one for the
0:06:29	nine months because it was a fifteen month period
0:06:34	they have roughly about nine months to work with data
0:06:37	the columns
0:06:38	not necessarily a day one
0:06:40	and then the evaluation starts where they have one want to do the keyword search
0:06:45	under practice languages so we evaluate everything
0:06:48	is this really important to understand
0:06:51	what progress is being made on the different languages causes languages are all different
0:06:55	and then we actually three give them a surprise language were we give them the
0:07:00	half of data and all talk about that little bit
0:07:03	where they have a certain number of meets the builder system
0:07:06	which decreases over the pure so in the basically related for weeks
0:07:11	this period more in the option one period don't have three weeks
0:07:16	and then they have one week to retrain their keyword results
0:07:20	and
0:07:21	give you might ask why one we there a lot of research or evaluation methods
0:07:26	the people are trying out what keywords or so it is important to leave a
0:07:29	sufficient amount of time there as well
0:07:32	so they not sure that we're using to measure performance is the actual term weighted
0:07:36	value which was developed by nist
0:07:39	i think in coordination with a number of
0:07:43	sponsors of that evaluation and it's kind of got i use case where
0:07:49	you've got people who
0:07:51	would like to be able to find stuff they don't tolerate a great number of
0:07:54	false alarms so you wouldn't want to use of score the other thing is that
0:07:58	rare terms given as a few nature of language
0:08:01	and the fact that maybe rare terms may be very useful in terms of finding
0:08:05	things that are critical
0:08:07	i mean tsunami might be a very common thing in the in the traffic you're
0:08:12	collecting but it may not have been there in your training data so you want
0:08:17	to be able to find some dummy things in your in your audio for
0:08:22	and
0:08:23	what you have to realise is that it's are weighted it and it's basically evaluated
0:08:29	over all terms regardless of the frequency of those terms
0:08:33	right so a single ten house the same in the score is something that is
0:08:37	highly frequent
0:08:39	and then there's a number of things that
0:08:43	this is that
0:08:44	like that the C in the be value a by enlarge you've got this weighting
0:08:48	a probability of false alarm
0:08:50	the systems have very low probability of for of false alarms typically so those are
0:08:56	really kind of little bit you can understand that it sort of a tradeoff between
0:09:00	those two things
0:09:01	but really missing something really doesn't a score when there are singleton
0:09:06	so something you want to sort of in mind as you look at the results
0:09:09	that i'm gonna go through
0:09:13	so
0:09:15	the babel program has a number of dimensions in terms of people were working
0:09:21	obviously the program wouldn't exist without data and so i has been the data collector
0:09:26	from day one i started i actually talked to them
0:09:29	proposed a i went on the job about the notion of the data collection
0:09:35	then and then
0:09:37	we have the test an evaluation team that's what T N T stands for work
0:09:41	and
0:09:42	it's actually
0:09:44	important to realise that you
0:09:46	have nist two can run and evaluation miserable cindy the technological support to setup an
0:09:50	evaluation
0:09:52	and then approach that's like this
0:09:53	so we actually have like hidden layers that actually build system so we can do
0:09:57	forced alignments and things like that
0:10:00	and then my work in some logistic stuff and then can also provides
0:10:04	needed help with linguistics they help me what they advise me on a number of
0:10:09	dimensions and certainly getting good phonetic coverage over the language and getting know how diversity
0:10:15	languages
0:10:16	it would be really card for me where i don't know well these languages actually
0:10:21	do that and the other thing is there is sort of attaining between the T
0:10:25	V gene and happen in order to ensure that the quality of the data is
0:10:30	appropriate for the T ask that we're doing
0:10:33	so keyword searches not something that happened had
0:10:36	then sporting before and had or browse
0:10:40	really doesn't make it very challenging to evaluate and keyword search so
0:10:45	we actually do teen and offline i'll talk a little bit about it that's the
0:10:50	one and then we have for teens
0:10:52	where i put the primes on the left applied to C M U I B
0:10:56	M X C and we bbn are the primes and you can see all
0:11:01	all the people who participated in the base period sometimes there's some reconfiguration but this
0:11:06	is the picture it was at the time of the base period so mobile technologies
0:11:10	still in there
0:11:13	so
0:11:14	lots of work
0:11:16	i think they're sixteen papers here that were supported by table i think of you
0:11:21	go back you
0:11:22	go back to
0:11:23	icassp over the past couple years in interspeech
0:11:27	i think they're probably hundred papers for so that have been sponsored by
0:11:32	babel all with rate work
0:11:34	i want to point out that is like go through things like that can i
0:11:37	have time to touch on
0:11:38	all the work for all the cool things the people are doing
0:11:41	i'm just gonna point out something selecting
0:11:43	or sort of interesting lessons word
0:11:46	and you know
0:11:48	there's a lot of other things the people are doing better
0:11:50	quite interesting i'm also gonna point out
0:11:53	how we change things for the option period
0:11:56	and the kinds of things that look like they're
0:11:58	really glimmering hopes i think
0:12:00	so
0:12:01	i'm not gonna ask that research is you'll be able see that
0:12:04	and in future conference
0:12:07	so the data collection is actually quite daunting
0:12:11	we actually have we're collecting the data and four dollars
0:12:15	where there are seven collected at a time
0:12:18	and we only needed for practise languages and one surprise language for
0:12:23	the base period we collected seven
0:12:25	no was a good thing we did because what we can plan to help as
0:12:29	the development language and a surprise language which would was a some is and by
0:12:32	golly
0:12:33	where are some is was supposed to be surprise
0:12:37	things went wrong with the collection and so we basically had to use the other
0:12:41	five languages so
0:12:44	it is really important to be over collecting for your leads at a particular time
0:12:49	the amount of time we spent collect seven languages
0:12:52	given the fact that you stagger kick up
0:12:54	is
0:12:55	roughly two years
0:12:57	right so you can see that there is like to your overlapped periods
0:13:02	so it is really interesting right now we're working on sixteen getting ready just send
0:13:06	funds for five so it there it really is this is sort of the critical
0:13:11	period for basically making sure the rest of the of the program is going to
0:13:16	play out but you can see there is an increasing number
0:13:19	of languages in each period you subtract one for the surprise language and you can
0:13:25	see how many
0:13:26	being used for practise
0:13:35	right so you can imagine by the time you hit the and of the program
0:13:39	multilingual systems are gonna be really highly supportive really high supported
0:13:45	we have a variety of criteria for selecting languages all talk about the little bit
0:13:49	more on the next
0:13:52	most of these are multi dialectal and they also represent a wide variety of recording
0:13:57	conditions
0:13:58	and starting in the option period we also started collecting
0:14:02	microphone channel
0:14:04	and
0:14:05	they all the data include surprise environments are channels in the evaluation
0:14:10	so
0:14:11	there is always something that's next and it's not that for hundreds of the data
0:14:16	but it there so people can assess whether they're methods are working
0:14:21	and these things
0:14:22	so we languages from a variety of language families with different features phonotactic morphological syntactic
0:14:30	and so one whether there are stolen a the collected in country which i think
0:14:35	is really important so you're living with a wide variety of telecommunications
0:14:40	kind of sit
0:14:41	situations others dialectal variation
0:14:44	a wide variety of environments the easiest environment tends to be the home office one
0:14:49	where there's a landline a mobile not always landline and some of these countries now
0:14:53	so when line is disappearing in some of the collections we're doing now
0:14:57	probably place three in the be able with the car click card it tends to
0:15:01	be one of the or
0:15:03	and then there's others
0:15:05	obviously you want to have not telephone channel data in there as well
0:15:10	and metadata valence
0:15:12	we actually do
0:15:13	provide the metadata
0:15:15	with each of the for the files so that the collection could alternately be used
0:15:19	to support dialect id room language id or other things
0:15:22	right so
0:15:23	you want to collect this data in such a way that it can be used
0:15:26	for a variety of purposes
0:15:30	we start off doing risk assessment obviously you don't wanna go into country where there's
0:15:37	a likelihood that people will die when they're doing the collection so
0:15:41	you have to you have to take that into consideration we also have to take
0:15:44	into consideration whether or not
0:15:46	there are
0:15:47	is the potentially get transcribers
0:15:49	and people who knows something about the language so all those things are certainly taken
0:15:53	into account
0:15:54	then we begin the work of working of a language where we actually work on
0:15:59	what happened calls a language specific peculiarities document
0:16:02	it typically involve involves providing the phoneme set
0:16:07	that is gonna be used by captain
0:16:10	and a variety of other things and something about the dialects in what
0:16:14	there were what the primary dialect that they would standardise and for example
0:16:24	that some people use some people don't
0:16:28	well but it is a part of the and process so we allow we keep
0:16:31	it going it provides
0:16:33	at the start of the lexicon
0:16:35	and also something densities which are very useful
0:16:39	and then there's a small database of transcribed conversational that they sent to us that
0:16:44	is review
0:16:45	i castle in others
0:16:47	to make sure that
0:16:49	the transcription quality is reasonable
0:16:52	sometimes we also get a lexicon to take a look
0:16:56	provide feedback
0:16:57	that affects things only receive an interim do we which is about three hours of
0:17:01	a conversation and that we actually start looking for had ever graph
0:17:06	who had or perhaps there are verging because
0:17:08	you can actually is the lexicon to help you spot these together with some language
0:17:12	experts so we try to clean that up so spelling normalisation is something that we
0:17:17	do
0:17:19	i it's that perhaps it's a little bit are adapt a certain amount of artificiality
0:17:24	but it certainly is important to do it i can tell you it's not
0:17:28	we're gonna be a hundred percent accurate
0:17:31	it's being done with
0:17:32	a certain amount of limitation on the resources are available
0:17:35	finally we get the be delivery and that's reviewed in partitioned into training
0:17:40	do have any the L and
0:17:42	every collection is collected is if it's a surprise well
0:17:47	where we use seventy five hours of the about but for the development languages the
0:17:51	practice languages we only used fifteen
0:17:53	so in many cases we have a lot of leftover audio that we just don't
0:17:57	pass
0:17:58	we also develop keyworks
0:18:00	using a certain amount that we have them annotated by captain so that we can
0:18:04	assign types and so one
0:18:06	so that we can have a certain notion of balance among the keyword so we
0:18:09	make sure that we come up with a certain number of names and so once
0:18:13	the but there's balance in the test
0:18:15	we also have the segments that and provides can be very large
0:18:21	we can re segments
0:18:23	using
0:18:26	voice activity detection
0:18:27	and basically those segments are passed back to happen for judgement in quality where they
0:18:34	compare to the original segments
0:18:36	then we do forced alignments on the dev and email and give the force alignments
0:18:40	to the performance
0:18:43	all these are the period one languages where problem we begin with cantonese pashto tagalog
0:18:48	and turkish and those were pretty risk free languages
0:18:54	and then we tested on vietnamese remember vietnamese was not to be the surprise language
0:18:59	and ended up being somewhat challenging in that
0:19:04	cantonese they provided provide of word boundaries but in vietnamese
0:19:08	it was just the syllables right and so things tend to be short words and
0:19:14	they also did a and not to bang up job of including all the dialectal
0:19:19	variants the pronunciations which
0:19:21	i think probably also cause problems
0:19:24	but actually as a resource it's a it's a great resource if you're interested in
0:19:28	understanding the vietnamese dialects you can see the number of dialects per language
0:19:34	cantonese have five partial for to call it three turkish seven in vietnamese for instance
0:19:40	the cantonese dialects
0:19:43	probably were pretty heart for
0:19:45	some people to understand so at the beginning when we use the data there was
0:19:49	some question about whether those dialects really cantonese but they work
0:19:55	so when we when we evaluate but this
0:19:59	developed an evaluation plan and they were
0:20:02	three conditions for the language resources that are use there was sort of the basic
0:20:06	language pack so this is i use the resources and i'm to button
0:20:11	there's the babel L like language resource condition where you could use a language packs
0:20:16	that you have available
0:20:18	and that's very nice for multi-lingual work
0:20:21	and then if you wanted to bring in other not available resources do that so
0:20:25	for example if you wanted to bring in web text or something like that
0:20:29	or you wanted to bring in a pronunciation lexicon or if you had some found
0:20:32	data do that
0:20:35	and then there's the amount of training that they used from the base a lower
0:20:39	condition
0:20:40	and you could either use the eighty hours of conversational
0:20:43	together with the scripted
0:20:45	or you could use something that was limited which uses
0:20:48	ten hours of transcription that it's selected it sub selected from the eighty hours so
0:20:54	it's a proper subset of but our set
0:20:57	and then there are two conditions for evaluating keywords there is the and heart condition
0:21:04	no text audio reeves so you build your keywords system you don't have knowledge of
0:21:10	the keyword you just basically do the search based on
0:21:13	those keywords you you're not able to read decoder retrain or something like that with
0:21:17	knowledge
0:21:19	you're not allowed can you're obviously gonna
0:21:24	decode but you can take into consideration knowledge of the keywords the test audio we
0:21:28	used condition
0:21:29	is you have knowledge of the keywords you could actually do things like automatically add
0:21:34	them to the lexicon and do crazy things in terms of a language model you
0:21:38	could use that you could go if you were gonna do the other lr you
0:21:41	can actually but what for language model data and so on
0:21:44	right so there's a lot of variability here
0:21:47	in the action period
0:21:48	we're really actually change things up a lot
0:21:51	where people can declare the resources and so there's a lot of interesting new conditions
0:21:56	that performers can come up with a narrow
0:21:59	so this is this was the star but there certainly gonna be i think a
0:22:02	lot more variability in the experiments people in the future
0:22:07	so
0:22:08	another innovation the came up with the program is since we're waiting so many languages
0:22:14	and we don't want to prevent people from doing experimental conditions
0:22:19	nist developed but they probably in the scoring server
0:22:23	and this allows researchers to submit
0:22:26	in get evaluated against the test data we don't
0:22:29	release all the test data after test we really some portion of it
0:22:33	if you wanted to go one evaluate against the full test set you know
0:22:37	given a sequestered part
0:22:39	and i think that's really important so
0:22:41	if you're writing a paper
0:22:43	ten months after the other evaluation you wanna go back and reevaluate or you've discovered
0:22:48	something new
0:22:51	and
0:22:52	you want to basically
0:22:55	test your hypothesis on the past languages
0:22:58	right can do that
0:22:59	and still get the full test them i think that's really very important and i
0:23:04	really think it it's gonna make a lot of difference in terms of the pure
0:23:09	science of the program can support
0:23:15	jon fiscus put together this for the open of al
0:23:19	this is submissions
0:23:20	over
0:23:22	the six weeks so the twenty seventh week
0:23:26	in the program
0:23:28	and you can see where there's spikes in terms of the rapid increase in the
0:23:33	cumulative number of submissions but you can see even after the evaluations over especially with
0:23:39	vietnamese
0:23:41	people cat submitting right because vietnamese was somewhat challenging and some people wanted to continue
0:23:46	to do work and of course the number of other languages as well
0:23:50	this that the resulting get back to you and as soon as they basically say
0:23:54	everything's okay everything's
0:23:56	right there is a sort of an intermediate point where they wanna make sure that
0:24:00	everything is working properly and so
0:24:04	usually takes about a week before the first results are was but assume is there
0:24:08	last
0:24:09	then people can report them openly
0:24:13	so
0:24:15	in the first period people to the state and a lot of creative things
0:24:19	people submitted primary in contrast systems and
0:24:23	for the most are trying to or submissions word system combinations and we'll talk a
0:24:28	little bit about system combination because it really does seem to help except for the
0:24:32	swordfish
0:24:33	all performers were able to make the program targets in all languages including the surprise
0:24:38	using the full language pair
0:24:40	and that in the base language resource condition with no audio
0:24:45	and of course
0:24:47	there are other conditions where you could potentially do better
0:24:50	program targets were exceeded with ten hours of training and for the five languages by
0:24:55	some people
0:24:57	usually using system combination
0:24:59	right
0:25:01	system combination reduces
0:25:04	the token error rate and increases atwv compared to single systems
0:25:10	but even single system full
0:25:12	language pair
0:25:14	single system full language pack systems
0:25:16	maybe program target
0:25:20	with the with the language back
0:25:23	all systems have of course have very probable low false alarm
0:25:29	warring this miss rate places a significant role in increasing atwv and that something you
0:25:34	want to sort of keep in mind
0:25:36	and there were several collection factors that actually attracted atwv language dialect environment gender
0:25:44	and i'm good just show you some poor results i think that are sort of
0:25:47	interesting
0:25:48	i don't know that i don't think evolution this even to the performers i actually
0:25:52	put this together for my program review
0:25:56	in here in here is
0:25:58	this i the this slide are posted i'm accurately but up here
0:26:03	i call this from the
0:26:06	actually they're probably not posted
0:26:08	but you can see that the base lr full language pack
0:26:11	are all marketing reading
0:26:13	and you know not everybody submits to condition the only one that was required was
0:26:17	the full language pack a cell are
0:26:19	and you can see people made that are their targets
0:26:23	and in all the languages
0:26:28	gender affects atwv and what was kind of into the and word error as well
0:26:33	in the set of collections the females that better
0:26:36	i think you did better with female speech which is kind of interesting
0:26:41	i'm not all the languages sometimes by a lot look at the technology for example
0:26:45	the males are so much worse
0:26:48	i don't why i mean really we collect two thousand speakers
0:26:52	right for per language so
0:26:57	and i'm sure there's interactions
0:26:59	with other factors but environment is important
0:27:03	you can see overall
0:27:06	pooling all over all systems you get a and the average of point five one
0:27:10	atwv
0:27:16	so the car here and he
0:27:19	the unexpected environment were sort of
0:27:23	equally for the landline a mobile are the home office and those are sort of
0:27:28	the best
0:27:29	and then the place in street people are sort of
0:27:33	somewhere in between
0:27:35	typically those are probably done what cell phone so
0:27:38	but
0:27:41	when you want to cross language
0:27:43	this is kind of mse slide the card it is significantly worse than our past
0:27:48	over summaries and
0:27:49	and you know obviously partial was of her language overall but there's something going on
0:27:55	there
0:27:57	and it is kind of interesting you know you look at it turkish and the
0:28:01	land lines wonderful well they probably have a much more stable
0:28:06	environment for landline
0:28:09	in some of these maybe rare so maybe with pancho
0:28:13	so
0:28:14	was the predominant thing so i didn't i what i didn't give you is sort
0:28:18	of the breakout of the distributions
0:28:21	a dialect
0:28:23	dialect in atwv interacted and i gave the for teens and you can see
0:28:28	northeast northwest so used in southwest
0:28:31	southwest was really under-represented it was really became clear with the interesting collects twice it's
0:28:36	a i don't want people by right
0:28:39	but you can see people could still do something with that in some of these
0:28:43	were related but certainly the ones that have a higher amount of data
0:28:46	certainly word the past
0:28:49	and the ones that had the lower amount of data least amount of data we're
0:28:52	sort of the words
0:28:53	and that was true across the board
0:28:55	but certainly the dialect does dimension of challenge to the data
0:29:04	so
0:29:09	lamb
0:29:11	i think it's area specific
0:29:15	somehow or another and
0:29:16	getting a
0:29:19	and echo
0:29:20	so what helps well early and it was clear that
0:29:24	especially with the cantonese data that you gotta do
0:29:27	re segmentation of the data to remove this to do science modeling get rid of
0:29:31	the silence or you kinda screwed things
0:29:35	robust multi mlp features
0:29:38	works really important i think
0:29:41	it really paid "'em" played a major role deep learning
0:29:47	really started to shine in the program very early and then i think there's
0:29:52	lots and lots of room for to keep shining in doing
0:29:55	very interesting experiments
0:29:58	pitch features on a language were useful at least for most people
0:30:03	and that what kind of cool about that is that sort of gives hope for
0:30:07	more universal feature extraction
0:30:10	and
0:30:11	one of the things that was really extremely important was to develop methods for preserving
0:30:15	potential it's a search alternatives
0:30:17	and the variety of ways of doing that including
0:30:20	tensor lattices are smarter ways of doing the queries there there's a number of papers
0:30:25	here that you can probably
0:30:26	C and this topic
0:30:29	and in other
0:30:31	then used
0:30:33	combining systems especially wonderful training data really matters a lot
0:30:38	it matters a lot
0:30:39	whether you try to build the systems differently or we just randomly see them differently
0:30:45	system combination is very useful semi supervised training
0:30:49	is very helpful for acoustic model and features
0:30:52	and score normalization
0:30:54	really plays a big model so if you do nothing else score normalization gives you
0:31:00	a lot right
0:31:03	so i
0:31:04	i could i could report and number
0:31:07	of things i just picked a smattering of things
0:31:10	in typically the reason why a perfect it was not an endorsement per se but
0:31:14	it was largely because
0:31:16	there was some P
0:31:17	sure that sort of
0:31:19	with speech to the point i was trying to make
0:31:22	but certainly several of these have papers appearing here so i put them there when
0:31:27	i when i could sort of a lineup the result
0:31:30	because some of these results were things that i got from site visits as opposed
0:31:34	to from papers because
0:31:35	a group here i to prepare the talk
0:31:38	longer go
0:31:40	but you can see
0:31:41	the stacked bottleneck features versus the inter bottleneck features you get an eight percent reduction
0:31:46	in word error
0:31:47	and the anaconda competent improvement in terms of atwv
0:31:51	adding fundamental frequency in probability of voicing
0:31:54	reduces word error
0:31:55	we generate this was on vietnamese i believe
0:31:58	we generation neural network
0:32:00	we generation neural net it
0:32:02	and neural network targets at a percent and semi supervised training
0:32:07	helped a lot to
0:32:08	and those were all
0:32:10	additive right so
0:32:12	very cool
0:32:13	so features very important
0:32:16	deep learning is very helpful and
0:32:19	we have a comparison here between shallow and deep
0:32:22	and you can see the shallow versus the deep atwv and
0:32:26	you know two to three percent
0:32:30	absolute improvement this was using the kuwaiti tandem sat
0:32:34	fmpe full language pack models
0:32:39	pitch helps even for non-tonal language
0:32:41	this is this is from the M probably has been playing around with features
0:32:45	because he was very unhappy with how here performed with this
0:32:50	edge and vietnamese so he's basically done a lot of interesting network
0:32:55	and you can see you know when they the
0:32:59	as C
0:33:01	pitch feature sometimes the goes up
0:33:03	it goes down a little bit for by golly but his method that he incorporated
0:33:07	in the kaldi gives an improvement
0:33:09	and all those languages
0:33:11	so vietnamese and cantonese are tonal but you can see a semi something a like
0:33:16	and certainly a lot of other people
0:33:18	have a similar program your problem and so one
0:33:21	have this kind of result large lattices help up to a point
0:33:26	right so you've got a
0:33:28	i actually haven't so this is the data per
0:33:31	where random is up in the upper right corner
0:33:34	and the further down go the better but that curve shows the operating
0:33:39	performance in terms of trade off between probability of false alarm probability of miss
0:33:44	and so
0:33:45	in further down is really important
0:33:48	and so you can see the green line is done with small lattices
0:33:51	and the purple and the line is done with larger
0:33:57	and the normals
0:33:58	lattices and eventually it is diminishing returns but certainly reserving stuff
0:34:04	that you want to find is extremely important
0:34:10	knowledge of the keywords helps
0:34:12	so you can see
0:34:13	it helps even more with the limited language pack we're
0:34:16	you don't
0:34:17	you might not know about those words based on the ten hour subset
0:34:21	so if you know about the keywords
0:34:24	you can actually leverage that knowledge
0:34:26	in interesting ways like not running things weight you always wanna keep the probabilities right
0:34:31	but you might want to set
0:34:33	specific
0:34:34	beings for specific for
0:34:41	maybe and has developed
0:34:44	a white list approach
0:34:47	that using the audio we use so you can see here
0:34:51	here's knowledge
0:34:53	of the keywords before they basically do things and they get a re called keywords
0:34:58	about ninety two percent
0:35:01	without knowledge of the keyword that seventy four percent you can see there's the big
0:35:05	of done atwv and you can see the number of hits per keyword is much
0:35:11	lower in keywords without its
0:35:13	much higher
0:35:14	but if you simply look at say infrequent words that may be important
0:35:19	just boosting the P model that was actually does give you something
0:35:23	in terms of being able to preserve those
0:35:26	keywords and so the percent of recall somewhere between
0:35:29	and that's that that's beneficial right so it's preserving stuff
0:35:33	so that you don't perform things out
0:35:35	and
0:35:37	you look at system combination i think system combinations about preserving stuff to
0:35:42	you get big gains
0:35:44	this is this is that the data set so it's
0:35:47	you can see
0:35:49	the best system here to combine system
0:35:52	and all on a full language pack and a limited language pack and you can
0:35:55	see
0:35:56	icsi except for posh though
0:35:58	system combination gets you
0:36:01	about point three atwv
0:36:03	which is pretty amazing
0:36:05	i word errors but you know
0:36:09	you can actually make the target
0:36:11	amazing
0:36:13	here's another picture of system combination
0:36:16	where you can see the
0:36:18	the individual systems using various
0:36:22	putting
0:36:22	you know dnns be enough
0:36:24	and then you have the combination in routinely this is a limited language pair
0:36:30	results as well
0:36:31	so you're gonna see much more modest scores
0:36:37	light
0:36:38	good duh normalisation this is the bbn result and
0:36:43	dab in email
0:36:44	per language where you look at
0:36:46	cantonese part though turkish tagalog and vietnamese you can see normalisation gives you a significant
0:36:53	improvement
0:36:54	not always the same price the
0:36:57	the dev and that S right so there's some impact of the set
0:37:01	but you can see
0:37:03	normalisation and doing it well
0:37:05	is certainly a big part of the program and there's a lot of methods that
0:37:08	people are working on now including that that's
0:37:11	rescoring
0:37:14	and the other interesting result which i believe appears here as a poster
0:37:20	and i couldn't put all the names of the authors out there so
0:37:24	would be readable so i put in and all
0:37:26	but when you normalize is very important
0:37:30	so you've got the contrast between the no audio we used in the audio reuse
0:37:36	but you can you can look at either one row or the other row
0:37:39	and if i do
0:37:41	normalisation after system combination i only get so far but if i normalize before i
0:37:47	do system combination i do really well
0:37:49	and
0:37:50	if i normal is
0:37:55	after the best tokenization be more score combination i basically can build a single system
0:38:01	that is really better than what you produce you normalize weights and normalizing orally so
0:38:07	if you're doing combinations of various representations is important to get the scores on the
0:38:11	cities
0:38:12	and in the same
0:38:13	the same place i mean it
0:38:15	it is really important it makes a big difference
0:38:18	and quite frankly a single systems gonna be much easier to run so it's kind
0:38:22	of an interesting thing to know
0:38:25	the other people that appears here
0:38:28	is
0:38:30	touches on analysis so
0:38:32	effective thresholds on atwv is also
0:38:36	an interesting thing to look at where you can actually get a number or rules
0:38:40	we have a fair threshold so it's just based on
0:38:44	my notion of what i can do here we based on
0:38:47	what i have in the genoa
0:38:50	verses if i set the threshold to be a them all
0:38:54	for the key for each keyword
0:38:57	and then if i play around and make sure that i he
0:39:01	the things that matter and throw away the things that are so i basis that
0:39:05	the probability
0:39:06	of hits to one of my probability of missus does euro you can see
0:39:11	the probability space is also playing a major role
0:39:14	in terms of your ability to get the keywords it's not just a matter of
0:39:17	calibration
0:39:18	also getting better probabilities seems to be an important aspect as well
0:39:22	so there is a lot of interesting things that people can look at and certainly
0:39:27	analysis i think is really a very important aspect of the program so understanding
0:39:32	why something works why something doesn't work
0:39:34	why something doesn't work is such a prison such a bad thing it basically by
0:39:38	if you a piece of knowledge that really is important in terms of solving the
0:39:41	problem
0:39:43	we also had an open keyword search
0:39:47	valuation in two thousand thirteen for vietnamese and we have a lot of people i
0:39:51	we had before babel performers plus eight outside teens who ended up submitting systems
0:39:57	and
0:39:58	i was them here
0:40:00	we have eight wonderful volunteers who actually participated in the open kws meeting is of
0:40:06	the results in their all over the
0:40:08	all over the place i kinda put it up there
0:40:11	so that and these are posted right that the resulting in kws are posted you
0:40:16	can go take a look
0:40:17	but
0:40:18	if people want to participate in the next one maybe they won't feel so shy
0:40:22	about the possibility of submitting something that may not be
0:40:27	super certainly babel people
0:40:29	have a lot more practise with the data
0:40:31	but
0:40:33	you can see you know that the scores were all over the place
0:40:36	and but people really did a lot of interesting things and there was your resource
0:40:41	approaches as well as
0:40:43	is low resource approaches
0:40:47	so impure into we added six languages we have fried practice and one surprise
0:40:53	they only have sixty hours of transcribed training they do have the remaining twenty hours
0:40:59	untranscribed
0:41:00	there's also what ten hour training set
0:41:03	and they have to exceed the program targets now i'm both condition
0:41:08	because they got so close right so the
0:41:19	and also
0:41:21	approaches that use things like morphology and so one
0:41:23	where maybe they would help you to get
0:41:26	i in the ten hour set
0:41:28	maybe the sixty hours that are the eighty hours that is a little bit too
0:41:31	large
0:41:32	and then they'll help three weeks to build the surprise language
0:41:36	the languages are bengali a nasty so those were collected in the first period they
0:41:40	don't have another channel they're pure telephony
0:41:43	no we have a means illumination real allow and of course that we have a
0:41:47	surprise and i'm not gonna out that here
0:41:52	optimizing bengali i think our
0:41:56	somewhat
0:41:57	okay right but zoo appears to be quite challenging annotation real
0:42:02	appears to be quite
0:42:04	simple right and so these are aspects of the language i don't think they're aspects
0:42:08	up a collection
0:42:10	and then lower
0:42:12	will have its own challenges because again
0:42:17	we couldn't annotate the compounds
0:42:19	reliably and so
0:42:21	the lower words
0:42:23	not the not the borrowed words are
0:42:26	multi syllabic right there the syllables
0:42:28	but there's single slap excuse me
0:42:32	so cast would put together some of the challenges of this
0:42:36	and present it got but i thought that was interesting so that the notion of
0:42:40	the sure language models where you can
0:42:42	sure between then golly in a somebody's right
0:42:46	also means doesn't have this much of a web presence and so it sort of
0:42:49	an interesting thing to do reporting in the french for the haitian creole
0:42:53	the phonology there are stolen allowance to
0:42:56	lasso has told kinda like
0:42:58	cantonese and
0:42:59	and vietnamese but two tone is very different
0:43:03	unfortunately this is how we could not marking the legs kind of couldn't be done
0:43:08	reliably and so it didn't make sense to put it in the resource
0:43:12	and then you also have some six segmental
0:43:15	it's a segmental phonology issues by golly and
0:43:18	morphology use there is too big time maybe more so than in the big only
0:43:23	enough
0:43:24	the oov rate is
0:43:26	higher than any of the languages we've seen including turkish which
0:43:30	didn't really have a terrible oov rate
0:43:33	and then there's other aspects that linguists might be interested in looking at the likeness
0:43:38	levels
0:43:40	there's person to script something albion ask means are sort of very similar
0:43:45	strictly speaking at being falsely score for a some is also mean square but it
0:43:49	really is same as the bengali
0:43:52	and then you have wow which has an another script as well there's a lot
0:43:56	of code switching and fusion creole but there is available ones you to i certainly
0:44:01	see a
0:44:02	and so those can be problems
0:44:05	and then there's a lot of short words and haitian creole allow but
0:44:08	i guess the short or words are hurting haitian creole maybe we'll for well
0:44:14	maybe not
0:44:15	so exciting directions people are going in
0:44:20	one of the things that we want is more analysis and so we revise the
0:44:24	evaluation plan images posted at the open kws sites you can actually take a look
0:44:29	if you want
0:44:31	so that people can actually evaluate a lot more conditions and then
0:44:35	sure the conditions with each other so that
0:44:37	others can evaluate likewise
0:44:40	there's a lot of work going out multilingual processing is trying to sit right it's
0:44:46	very intriguing and very interesting
0:44:48	and i think yes the deep learning things to really
0:44:52	those neural net models or certainly seen to play a role in progress the people
0:44:58	are making
0:45:00	machine learning
0:45:02	sort of get a somewhat slow start because you're trying to integrate this community into
0:45:07	the speech community but they're beginning to take off
0:45:10	two so stay tuned i think that there is a lot of interesting things the
0:45:15	we're gonna happen
0:45:16	smart lattices and consensus networks were beginning to play a role at the end of
0:45:21	last
0:45:22	a period
0:45:23	but i think that there are actually making much progress now
0:45:29	and the thing is that a lot of work was done a consensus networks to
0:45:32	make it work with the keyword search task
0:45:35	originally it was developed by lydia
0:45:38	due to basically
0:45:40	you know do a last pass right before you gave your one best output
0:45:46	and it was great for that but there were things that you can do to
0:45:49	basically make it work a little bit better with the keyword search
0:45:52	and then morphology
0:45:56	again this is a community integration of people largely were context working with speech community
0:46:02	so there's a lot of tradeoffs between whether you wanna break
0:46:05	break up a little pieces words which might be something that's great if you're doing
0:46:09	text
0:46:10	and that's a great if you're doing speech so
0:46:14	a lot of a lot of the integration of the teens
0:46:17	is beginning to bear fruit there as well so
0:46:20	it's quite interesting and a big thing that i think is really important is the
0:46:24	getting by with less
0:46:26	so
0:46:26	ten hours of training were less
0:46:29	i don't seen results with less
0:46:31	but i certainly think would be cool
0:46:33	and no pronunciation like
0:46:35	so
0:46:35	everybody promise to do decimation studies but
0:46:39	to large extent you know the program targets unfortunately
0:46:42	seem to sometimes try the research toward program targets as opposed actual exploring the space
0:46:49	of experiments so there is there is a there's a tradeoff between having annual evaluations
0:46:55	in getting people to do research
0:46:56	but i really
0:46:58	really do hope that people will
0:47:01	explore these conditions "'cause" i think the really important
0:47:07	so i'm ending up with us why about the open kws
0:47:13	the slightest why
0:47:15	and you can see the timescale right so
0:47:18	registrations gonna close and G and at the end of january so if you're interested
0:47:22	at all
0:47:24	we use
0:47:25	do consider
0:47:27	the vietnamese language pack will be available for those of you who have not participated
0:47:31	before
0:47:32	the open kws people who have participated as long as they participate again can keep
0:47:37	the data
0:47:38	right so if you just keep participating you can actually keep all the surprise languages
0:47:42	and hopefully nist open
0:47:44	up some of those that language is by evaluating on them to
0:47:47	so there's lots of data it's very useful
0:47:51	there's a lot of things there that you could
0:47:54	do with that data above
0:47:57	to support basic speech recognition and other types of speech research
0:48:03	and hopefully by the time with the
0:48:04	and of the program this will be released publicly to everybody
0:48:08	since we all the data alright
0:48:11	but you can see
0:48:12	the surprise language bill
0:48:14	is gonna be sent
0:48:17	we can have or so before the
0:48:19	evaluation begins where we send a password so there won't be any problem with the
0:48:24	download
0:48:25	downloads gonna be a little bit harder since we have the channel data which is
0:48:30	not downsampled in any way
0:48:33	because we figured
0:48:34	that's an aspect of handling that data
0:48:37	right and then people have the three weeks we send out the evaluation pack ahead
0:48:43	of time as well as its larger
0:48:44	it's seventy five hours and some of that is channel data
0:48:48	right and this will send a password on april twenty eight there at which point
0:48:52	people have a week
0:48:54	to complete their submissions you can submit many things
0:48:58	this will keep an eye and things to make sure that submissions are sounded there's
0:49:02	problems
0:49:03	there is a point of contact and so on so
0:49:06	it should not be a very bad thing and the other thing that this
0:49:10	there will be an open kws meeting were everybody would be expected to participate so
0:49:14	there is sort of a bird there for people who might participate but
0:49:19	i think that the meeting last time was very valuable in
0:49:22	in table babel folks were really very generous in sharing their insights so
0:49:27	i think it's a great opportunity to hear about
0:49:31	the work
0:49:32	and be able to ask questions and interact with
0:49:35	the babel participants so i think it's a really good thing you have the open
0:49:39	kws
0:49:42	and last but not least this is the get up with the slide stage
0:49:49	this is this is one of the things you have to do and the pitch
0:49:53	for the program
0:49:54	i put a little task force there and
0:49:57	after languages cover
0:49:59	but
0:50:01	obviously it's nice to be able to say all but really there's the caviar that
0:50:05	this has to be a language that has an orthographic transcription
0:50:08	i have to say even just having a north the orthographic transcription does not make
0:50:13	it easy
0:50:14	to create a language and so some languages are really much more normalize than others
0:50:19	me
0:50:20	as much as we have done a lot of work in terms of normalizing english
0:50:23	and there's a lot of spelling variants that happened it's a lot harder to do
0:50:28	it in these other languages were there really isn't
0:50:31	well studied conventions so
0:50:34	all star the caviar because certainly
0:50:38	you really do have to have the ability the capability of being able to clean
0:50:42	up the language
0:50:43	even when there is a presence in the web
0:50:47	and tiny is we talked about them as well
0:50:50	we're moving down to ten to forty hours
0:50:54	working with variable recording conditions where they developed
0:50:59	system in a we
0:51:00	a big the immediate impact has been language data were i i've
0:51:05	shared language data in had opened of males
0:51:08	that impacts the community at also text the government
0:51:12	new methods and speech search speech systems is then sort of the medium impact
0:51:16	and getting affective keyword search a new languages deliver quickly as the ultimate delivery so
0:51:21	learning how to do that learning how to solve the problem of
0:51:26	this is a new language now build the system is really for it's the core
0:51:30	principle program
0:51:32	and everything really needs to be projected in that direction
0:51:36	and alternately there are lots of other ways to say well
0:51:39	what if i only have a certain amount of time to transcribe
0:51:43	we find that
0:51:44	we can do that very well programmatically the people can certainly investigate that right where
0:51:49	they consider
0:51:51	the time to what the time to transcribe and clean things up in terms of
0:51:55	selecting data that they can work with
0:51:58	the nice thing is that there is that eighty hours of audio regardless of how
0:52:03	much data you use and so there's a lot of room to investigate a wide
0:52:06	variety
0:52:07	of getting by with less
0:52:09	including getting by with no lexical
0:52:12	getting by without transcripts
0:52:14	at all certainly there is more like that going on in the program
0:52:19	it may not
0:52:21	the
0:52:22	price
0:52:23	that the best systems do but i would say it's all equally important and in
0:52:28	vital to the program so having a wide variety of things going out i think
0:52:31	is really important
0:52:34	i'm done so if you questions

The Babel Program and Low Resource Speech Technology

Limited Resources Day

Mary Harper (IARPA)