0:00:15i mary harper and
0:00:18in october two thousand ten i went i refer to develop this program a babel
0:00:25i
0:00:26and i say i think babel there's lots of ways to pronounce babel
0:00:32and you can actually go to this website and find out about all the way
0:00:38of saying a lot of people like this example
0:00:41for me about is be a D V only and i drop in buffalo new
0:00:44york were that was by
0:00:46dialectal variation of inclusion
0:00:48we want a taller are while so i see table
0:00:52and of course there's also the original hebrew word and a variety of other ways
0:00:56of pronouncing as well
0:00:58but morgan pointed out the but though
0:01:01now you say sounds but they didn't
0:01:03in for some reason
0:01:07okay so
0:01:08every program
0:01:10whether it's a darpa write our has sort of a back story you have to
0:01:14have a motivation
0:01:15that sort of an elevator speech so my challenges that
0:01:19you know you you're in a situation we're dealing with the crisis it might be
0:01:23they the
0:01:24it might be a need for example where you have to deal with a lot
0:01:29of
0:01:30i'll
0:01:30speech noisy speech in order to solve a crisis situation and you have thousands of
0:01:36hours in no time to listen you might have one or two people who could
0:01:40listen to it but you're certainly not gonna get through it and anytime the
0:01:43would be reasonable in order to help people
0:01:46and
0:01:47if you have no existing speech technology for that language you have problems
0:01:53but if you could rapidly develop that
0:01:55say in the day here to you actually might be able to do simple it
0:02:02it sort of addresses two gaps it's harder to build up the human capital in
0:02:06the language
0:02:07because it can take years and signal i don't we have one or two people
0:02:10global language and we see that even with just developing the resources that we don't
0:02:15have this
0:02:16this language capital
0:02:18and there's also a technology gap and so this slide was certainly
0:02:25down a number of years ago but certainly computer to the three hundred and ninety
0:02:29three million a printer ninety three languages that have a million speakers we touched very
0:02:34few
0:02:35right so we and we've only studied of is actually really and immediately i mean
0:02:42we study english all the time because it easy they're corpora and so one
0:02:46and it can take way too much time
0:02:49it months to years to build a new language especially if you have to transcribe
0:02:53the and the systems double developed for english don't always work well to other languages
0:02:59they can they can help with the bootstrap
0:03:02they certainly don't give you the kind of
0:03:04error rates that maybe someone might want to see
0:03:09so the basic idea underlying
0:03:12table we rather than just evaluate a word error because
0:03:17the director was very the image that she wanted to habitat
0:03:21have a real task that just transcription
0:03:24and so keyword search
0:03:26fortunately had have an evaluation in two thousand six
0:03:30and so we settled on at the keyword search task
0:03:33where the basic idea is you speech recognition
0:03:37or phone recognition or something the index
0:03:40the thousands of hours of audio and then you have some whale
0:03:44of putting in a query
0:03:46for babel we use orthographic queries and for those who are doing low resource they
0:03:51do other things with the data in order to basically accommodate to the fact that
0:03:56we
0:03:57we use orthographic queries and then we evaluate
0:04:00a weather about the got the keeper correctly identified from the audio
0:04:06so
0:04:08our approach is really to work with a wide variety of languages and
0:04:13the more
0:04:14why you know so not just european languages it's it i think is really important
0:04:19to study things that have a wide variety of aspects
0:04:23in
0:04:25real recording conditions as much as possible i mean obviously collections are gonna suffer from
0:04:30the pair that
0:04:31you go into these countries you may not be able to
0:04:34record in a highly reverberant room or something
0:04:36but the hope is that you can get these sort of real world recording situations
0:04:43and then we
0:04:45constrain the resources in various ways and that we actually collect a lot of data
0:04:51but we actually create a wide variety of conditions for people to evaluate and
0:04:55and the actually can create conditions as well to answer questions that they think are
0:04:59important like getting by without lexicon
0:05:03so we gradually reduce the amount of brown from speech for dipping them in training
0:05:07but we also give them the audio
0:05:10untranscribed i we also reducing the amount of time that they have to basically evaluate
0:05:15the surprise language and i think that's critical
0:05:17and not starting off with something that's impossible at the outset we actually getting people
0:05:22to the point where they can develop the technology is extremely important in this
0:05:27and we set the targets to be
0:05:31sort of a three times improvement over which we get with phonetics or
0:05:35and i think that was critical and that was done based on the std or
0:05:38six
0:05:39dbn results in cantonese and mandarin and
0:05:42where they got a point three briefly about point three atwv so we said that
0:05:46as the target level
0:05:50so
0:05:51the goal is to improve speech technology with limited amounts of ground truth data only
0:05:56speech
0:05:57systems for an english
0:06:00language is extremely important
0:06:02i improving speech recognition through innovative use
0:06:05of the technology and different approaches
0:06:09at a wide variety of languages so that you can get fast development of keyword
0:06:14search systems to tackle this problem
0:06:20i just to give you a sense of the layout of the program
0:06:26other than the basic your which was they had a little bit one for the
0:06:29nine months because it was a fifteen month period
0:06:34they have roughly about nine months to work with data
0:06:37the columns
0:06:38not necessarily a day one
0:06:40and then the evaluation starts where they have one want to do the keyword search
0:06:45under practice languages so we evaluate everything
0:06:48is this really important to understand
0:06:51what progress is being made on the different languages causes languages are all different
0:06:55and then we actually three give them a surprise language were we give them the
0:07:00half of data and all talk about that little bit
0:07:03where they have a certain number of meets the builder system
0:07:06which decreases over the pure so in the basically related for weeks
0:07:11this period more in the option one period don't have three weeks
0:07:16and then they have one week to retrain their keyword results
0:07:20and
0:07:21give you might ask why one we there a lot of research or evaluation methods
0:07:26the people are trying out what keywords or so it is important to leave a
0:07:29sufficient amount of time there as well
0:07:32so they not sure that we're using to measure performance is the actual term weighted
0:07:36value which was developed by nist
0:07:39i think in coordination with a number of
0:07:43sponsors of that evaluation and it's kind of got i use case where
0:07:49you've got people who
0:07:51would like to be able to find stuff they don't tolerate a great number of
0:07:54false alarms so you wouldn't want to use of score the other thing is that
0:07:58rare terms given as a few nature of language
0:08:01and the fact that maybe rare terms may be very useful in terms of finding
0:08:05things that are critical
0:08:07i mean tsunami might be a very common thing in the in the traffic you're
0:08:12collecting but it may not have been there in your training data so you want
0:08:17to be able to find some dummy things in your in your audio for
0:08:22and
0:08:23what you have to realise is that it's are weighted it and it's basically evaluated
0:08:29over all terms regardless of the frequency of those terms
0:08:33right so a single ten house the same in the score is something that is
0:08:37highly frequent
0:08:39and then there's a number of things that
0:08:43this is that
0:08:44like that the C in the be value a by enlarge you've got this weighting
0:08:48a probability of false alarm
0:08:50the systems have very low probability of for of false alarms typically so those are
0:08:56really kind of little bit you can understand that it sort of a tradeoff between
0:09:00those two things
0:09:01but really missing something really doesn't a score when there are singleton
0:09:06so something you want to sort of in mind as you look at the results
0:09:09that i'm gonna go through
0:09:13so
0:09:15the babel program has a number of dimensions in terms of people were working
0:09:21obviously the program wouldn't exist without data and so i has been the data collector
0:09:26from day one i started i actually talked to them
0:09:29proposed a i went on the job about the notion of the data collection
0:09:35then and then
0:09:37we have the test an evaluation team that's what T N T stands for work
0:09:41and
0:09:42it's actually
0:09:44important to realise that you
0:09:46have nist two can run and evaluation miserable cindy the technological support to setup an
0:09:50evaluation
0:09:52and then approach that's like this
0:09:53so we actually have like hidden layers that actually build system so we can do
0:09:57forced alignments and things like that
0:10:00and then my work in some logistic stuff and then can also provides
0:10:04needed help with linguistics they help me what they advise me on a number of
0:10:09dimensions and certainly getting good phonetic coverage over the language and getting know how diversity
0:10:15languages
0:10:16it would be really card for me where i don't know well these languages actually
0:10:21do that and the other thing is there is sort of attaining between the T
0:10:25V gene and happen in order to ensure that the quality of the data is
0:10:30appropriate for the T ask that we're doing
0:10:33so keyword searches not something that happened had
0:10:36then sporting before and had or browse
0:10:40really doesn't make it very challenging to evaluate and keyword search so
0:10:45we actually do teen and offline i'll talk a little bit about it that's the
0:10:50one and then we have for teens
0:10:52where i put the primes on the left applied to C M U I B
0:10:56M X C and we bbn are the primes and you can see all
0:11:01all the people who participated in the base period sometimes there's some reconfiguration but this
0:11:06is the picture it was at the time of the base period so mobile technologies
0:11:10still in there
0:11:13so
0:11:14lots of work
0:11:16i think they're sixteen papers here that were supported by table i think of you
0:11:21go back you
0:11:22go back to
0:11:23icassp over the past couple years in interspeech
0:11:27i think they're probably hundred papers for so that have been sponsored by
0:11:32babel all with rate work
0:11:34i want to point out that is like go through things like that can i
0:11:37have time to touch on
0:11:38all the work for all the cool things the people are doing
0:11:41i'm just gonna point out something selecting
0:11:43or sort of interesting lessons word
0:11:46and you know
0:11:48there's a lot of other things the people are doing better
0:11:50quite interesting i'm also gonna point out
0:11:53how we change things for the option period
0:11:56and the kinds of things that look like they're
0:11:58really glimmering hopes i think
0:12:00so
0:12:01i'm not gonna ask that research is you'll be able see that
0:12:04and in future conference
0:12:07so the data collection is actually quite daunting
0:12:11we actually have we're collecting the data and four dollars
0:12:15where there are seven collected at a time
0:12:18and we only needed for practise languages and one surprise language for
0:12:23the base period we collected seven
0:12:25no was a good thing we did because what we can plan to help as
0:12:29the development language and a surprise language which would was a some is and by
0:12:32golly
0:12:33where are some is was supposed to be surprise
0:12:37things went wrong with the collection and so we basically had to use the other
0:12:41five languages so
0:12:44it is really important to be over collecting for your leads at a particular time
0:12:49the amount of time we spent collect seven languages
0:12:52given the fact that you stagger kick up
0:12:54is
0:12:55roughly two years
0:12:57right so you can see that there is like to your overlapped periods
0:13:02so it is really interesting right now we're working on sixteen getting ready just send
0:13:06funds for five so it there it really is this is sort of the critical
0:13:11period for basically making sure the rest of the of the program is going to
0:13:16play out but you can see there is an increasing number
0:13:19of languages in each period you subtract one for the surprise language and you can
0:13:25see how many
0:13:26being used for practise
0:13:35right so you can imagine by the time you hit the and of the program
0:13:39multilingual systems are gonna be really highly supportive really high supported
0:13:45we have a variety of criteria for selecting languages all talk about the little bit
0:13:49more on the next
0:13:52most of these are multi dialectal and they also represent a wide variety of recording
0:13:57conditions
0:13:58and starting in the option period we also started collecting
0:14:02microphone channel
0:14:04and
0:14:05they all the data include surprise environments are channels in the evaluation
0:14:10so
0:14:11there is always something that's next and it's not that for hundreds of the data
0:14:16but it there so people can assess whether they're methods are working
0:14:21and these things
0:14:22so we languages from a variety of language families with different features phonotactic morphological syntactic
0:14:30and so one whether there are stolen a the collected in country which i think
0:14:35is really important so you're living with a wide variety of telecommunications
0:14:40kind of sit
0:14:41situations others dialectal variation
0:14:44a wide variety of environments the easiest environment tends to be the home office one
0:14:49where there's a landline a mobile not always landline and some of these countries now
0:14:53so when line is disappearing in some of the collections we're doing now
0:14:57probably place three in the be able with the car click card it tends to
0:15:01be one of the or
0:15:03and then there's others
0:15:05obviously you want to have not telephone channel data in there as well
0:15:10and metadata valence
0:15:12we actually do
0:15:13provide the metadata
0:15:15with each of the for the files so that the collection could alternately be used
0:15:19to support dialect id room language id or other things
0:15:22right so
0:15:23you want to collect this data in such a way that it can be used
0:15:26for a variety of purposes
0:15:30we start off doing risk assessment obviously you don't wanna go into country where there's
0:15:37a likelihood that people will die when they're doing the collection so
0:15:41you have to you have to take that into consideration we also have to take
0:15:44into consideration whether or not
0:15:46there are
0:15:47is the potentially get transcribers
0:15:49and people who knows something about the language so all those things are certainly taken
0:15:53into account
0:15:54then we begin the work of working of a language where we actually work on
0:15:59what happened calls a language specific peculiarities document
0:16:02it typically involve involves providing the phoneme set
0:16:07that is gonna be used by captain
0:16:10and a variety of other things and something about the dialects in what
0:16:14there were what the primary dialect that they would standardise and for example
0:16:24that some people use some people don't
0:16:28well but it is a part of the and process so we allow we keep
0:16:31it going it provides
0:16:33at the start of the lexicon
0:16:35and also something densities which are very useful
0:16:39and then there's a small database of transcribed conversational that they sent to us that
0:16:44is review
0:16:45i castle in others
0:16:47to make sure that
0:16:49the transcription quality is reasonable
0:16:52sometimes we also get a lexicon to take a look
0:16:56provide feedback
0:16:57that affects things only receive an interim do we which is about three hours of
0:17:01a conversation and that we actually start looking for had ever graph
0:17:06who had or perhaps there are verging because
0:17:08you can actually is the lexicon to help you spot these together with some language
0:17:12experts so we try to clean that up so spelling normalisation is something that we
0:17:17do
0:17:19i it's that perhaps it's a little bit are adapt a certain amount of artificiality
0:17:24but it certainly is important to do it i can tell you it's not
0:17:28we're gonna be a hundred percent accurate
0:17:31it's being done with
0:17:32a certain amount of limitation on the resources are available
0:17:35finally we get the be delivery and that's reviewed in partitioned into training
0:17:40do have any the L and
0:17:42every collection is collected is if it's a surprise well
0:17:47where we use seventy five hours of the about but for the development languages the
0:17:51practice languages we only used fifteen
0:17:53so in many cases we have a lot of leftover audio that we just don't
0:17:57pass
0:17:58we also develop keyworks
0:18:00using a certain amount that we have them annotated by captain so that we can
0:18:04assign types and so one
0:18:06so that we can have a certain notion of balance among the keyword so we
0:18:09make sure that we come up with a certain number of names and so once
0:18:13the but there's balance in the test
0:18:15we also have the segments that and provides can be very large
0:18:21we can re segments
0:18:23using
0:18:26voice activity detection
0:18:27and basically those segments are passed back to happen for judgement in quality where they
0:18:34compare to the original segments
0:18:36then we do forced alignments on the dev and email and give the force alignments
0:18:40to the performance
0:18:43all these are the period one languages where problem we begin with cantonese pashto tagalog
0:18:48and turkish and those were pretty risk free languages
0:18:54and then we tested on vietnamese remember vietnamese was not to be the surprise language
0:18:59and ended up being somewhat challenging in that
0:19:04cantonese they provided provide of word boundaries but in vietnamese
0:19:08it was just the syllables right and so things tend to be short words and
0:19:14they also did a and not to bang up job of including all the dialectal
0:19:19variants the pronunciations which
0:19:21i think probably also cause problems
0:19:24but actually as a resource it's a it's a great resource if you're interested in
0:19:28understanding the vietnamese dialects you can see the number of dialects per language
0:19:34cantonese have five partial for to call it three turkish seven in vietnamese for instance
0:19:40the cantonese dialects
0:19:43probably were pretty heart for
0:19:45some people to understand so at the beginning when we use the data there was
0:19:49some question about whether those dialects really cantonese but they work
0:19:55so when we when we evaluate but this
0:19:59developed an evaluation plan and they were
0:20:02three conditions for the language resources that are use there was sort of the basic
0:20:06language pack so this is i use the resources and i'm to button
0:20:11there's the babel L like language resource condition where you could use a language packs
0:20:16that you have available
0:20:18and that's very nice for multi-lingual work
0:20:21and then if you wanted to bring in other not available resources do that so
0:20:25for example if you wanted to bring in web text or something like that
0:20:29or you wanted to bring in a pronunciation lexicon or if you had some found
0:20:32data do that
0:20:35and then there's the amount of training that they used from the base a lower
0:20:39condition
0:20:40and you could either use the eighty hours of conversational
0:20:43together with the scripted
0:20:45or you could use something that was limited which uses
0:20:48ten hours of transcription that it's selected it sub selected from the eighty hours so
0:20:54it's a proper subset of but our set
0:20:57and then there are two conditions for evaluating keywords there is the and heart condition
0:21:04no text audio reeves so you build your keywords system you don't have knowledge of
0:21:10the keyword you just basically do the search based on
0:21:13those keywords you you're not able to read decoder retrain or something like that with
0:21:17knowledge
0:21:19you're not allowed can you're obviously gonna
0:21:24decode but you can take into consideration knowledge of the keywords the test audio we
0:21:28used condition
0:21:29is you have knowledge of the keywords you could actually do things like automatically add
0:21:34them to the lexicon and do crazy things in terms of a language model you
0:21:38could use that you could go if you were gonna do the other lr you
0:21:41can actually but what for language model data and so on
0:21:44right so there's a lot of variability here
0:21:47in the action period
0:21:48we're really actually change things up a lot
0:21:51where people can declare the resources and so there's a lot of interesting new conditions
0:21:56that performers can come up with a narrow
0:21:59so this is this was the star but there certainly gonna be i think a
0:22:02lot more variability in the experiments people in the future
0:22:07so
0:22:08another innovation the came up with the program is since we're waiting so many languages
0:22:14and we don't want to prevent people from doing experimental conditions
0:22:19nist developed but they probably in the scoring server
0:22:23and this allows researchers to submit
0:22:26in get evaluated against the test data we don't
0:22:29release all the test data after test we really some portion of it
0:22:33if you wanted to go one evaluate against the full test set you know
0:22:37given a sequestered part
0:22:39and i think that's really important so
0:22:41if you're writing a paper
0:22:43ten months after the other evaluation you wanna go back and reevaluate or you've discovered
0:22:48something new
0:22:51and
0:22:52you want to basically
0:22:55test your hypothesis on the past languages
0:22:58right can do that
0:22:59and still get the full test them i think that's really very important and i
0:23:04really think it it's gonna make a lot of difference in terms of the pure
0:23:09science of the program can support
0:23:15jon fiscus put together this for the open of al
0:23:19this is submissions
0:23:20over
0:23:22the six weeks so the twenty seventh week
0:23:26in the program
0:23:28and you can see where there's spikes in terms of the rapid increase in the
0:23:33cumulative number of submissions but you can see even after the evaluations over especially with
0:23:39vietnamese
0:23:41people cat submitting right because vietnamese was somewhat challenging and some people wanted to continue
0:23:46to do work and of course the number of other languages as well
0:23:50this that the resulting get back to you and as soon as they basically say
0:23:54everything's okay everything's
0:23:56right there is a sort of an intermediate point where they wanna make sure that
0:24:00everything is working properly and so
0:24:04usually takes about a week before the first results are was but assume is there
0:24:08last
0:24:09then people can report them openly
0:24:13so
0:24:15in the first period people to the state and a lot of creative things
0:24:19people submitted primary in contrast systems and
0:24:23for the most are trying to or submissions word system combinations and we'll talk a
0:24:28little bit about system combination because it really does seem to help except for the
0:24:32swordfish
0:24:33all performers were able to make the program targets in all languages including the surprise
0:24:38using the full language pair
0:24:40and that in the base language resource condition with no audio
0:24:45and of course
0:24:47there are other conditions where you could potentially do better
0:24:50program targets were exceeded with ten hours of training and for the five languages by
0:24:55some people
0:24:57usually using system combination
0:24:59right
0:25:01system combination reduces
0:25:04the token error rate and increases atwv compared to single systems
0:25:10but even single system full
0:25:12language pair
0:25:14single system full language pack systems
0:25:16maybe program target
0:25:20with the with the language back
0:25:23all systems have of course have very probable low false alarm
0:25:29warring this miss rate places a significant role in increasing atwv and that something you
0:25:34want to sort of keep in mind
0:25:36and there were several collection factors that actually attracted atwv language dialect environment gender
0:25:44and i'm good just show you some poor results i think that are sort of
0:25:47interesting
0:25:48i don't know that i don't think evolution this even to the performers i actually
0:25:52put this together for my program review
0:25:56in here in here is
0:25:58this i the this slide are posted i'm accurately but up here
0:26:03i call this from the
0:26:06actually they're probably not posted
0:26:08but you can see that the base lr full language pack
0:26:11are all marketing reading
0:26:13and you know not everybody submits to condition the only one that was required was
0:26:17the full language pack a cell are
0:26:19and you can see people made that are their targets
0:26:23and in all the languages
0:26:28gender affects atwv and what was kind of into the and word error as well
0:26:33in the set of collections the females that better
0:26:36i think you did better with female speech which is kind of interesting
0:26:41i'm not all the languages sometimes by a lot look at the technology for example
0:26:45the males are so much worse
0:26:48i don't why i mean really we collect two thousand speakers
0:26:52right for per language so
0:26:57and i'm sure there's interactions
0:26:59with other factors but environment is important
0:27:03you can see overall
0:27:06pooling all over all systems you get a and the average of point five one
0:27:10atwv
0:27:16so the car here and he
0:27:19the unexpected environment were sort of
0:27:23equally for the landline a mobile are the home office and those are sort of
0:27:28the best
0:27:29and then the place in street people are sort of
0:27:33somewhere in between
0:27:35typically those are probably done what cell phone so
0:27:38but
0:27:41when you want to cross language
0:27:43this is kind of mse slide the card it is significantly worse than our past
0:27:48over summaries and
0:27:49and you know obviously partial was of her language overall but there's something going on
0:27:55there
0:27:57and it is kind of interesting you know you look at it turkish and the
0:28:01land lines wonderful well they probably have a much more stable
0:28:06environment for landline
0:28:09in some of these maybe rare so maybe with pancho
0:28:13so
0:28:14was the predominant thing so i didn't i what i didn't give you is sort
0:28:18of the breakout of the distributions
0:28:21a dialect
0:28:23dialect in atwv interacted and i gave the for teens and you can see
0:28:28northeast northwest so used in southwest
0:28:31southwest was really under-represented it was really became clear with the interesting collects twice it's
0:28:36a i don't want people by right
0:28:39but you can see people could still do something with that in some of these
0:28:43were related but certainly the ones that have a higher amount of data
0:28:46certainly word the past
0:28:49and the ones that had the lower amount of data least amount of data we're
0:28:52sort of the words
0:28:53and that was true across the board
0:28:55but certainly the dialect does dimension of challenge to the data
0:29:04so
0:29:09lamb
0:29:11i think it's area specific
0:29:15somehow or another and
0:29:16getting a
0:29:19and echo
0:29:20so what helps well early and it was clear that
0:29:24especially with the cantonese data that you gotta do
0:29:27re segmentation of the data to remove this to do science modeling get rid of
0:29:31the silence or you kinda screwed things
0:29:35robust multi mlp features
0:29:38works really important i think
0:29:41it really paid "'em" played a major role deep learning
0:29:47really started to shine in the program very early and then i think there's
0:29:52lots and lots of room for to keep shining in doing
0:29:55very interesting experiments
0:29:58pitch features on a language were useful at least for most people
0:30:03and that what kind of cool about that is that sort of gives hope for
0:30:07more universal feature extraction
0:30:10and
0:30:11one of the things that was really extremely important was to develop methods for preserving
0:30:15potential it's a search alternatives
0:30:17and the variety of ways of doing that including
0:30:20tensor lattices are smarter ways of doing the queries there there's a number of papers
0:30:25here that you can probably
0:30:26C and this topic
0:30:29and in other
0:30:31then used
0:30:33combining systems especially wonderful training data really matters a lot
0:30:38it matters a lot
0:30:39whether you try to build the systems differently or we just randomly see them differently
0:30:45system combination is very useful semi supervised training
0:30:49is very helpful for acoustic model and features
0:30:52and score normalization
0:30:54really plays a big model so if you do nothing else score normalization gives you
0:31:00a lot right
0:31:03so i
0:31:04i could i could report and number
0:31:07of things i just picked a smattering of things
0:31:10in typically the reason why a perfect it was not an endorsement per se but
0:31:14it was largely because
0:31:16there was some P
0:31:17sure that sort of
0:31:19with speech to the point i was trying to make
0:31:22but certainly several of these have papers appearing here so i put them there when
0:31:27i when i could sort of a lineup the result
0:31:30because some of these results were things that i got from site visits as opposed
0:31:34to from papers because
0:31:35a group here i to prepare the talk
0:31:38longer go
0:31:40but you can see
0:31:41the stacked bottleneck features versus the inter bottleneck features you get an eight percent reduction
0:31:46in word error
0:31:47and the anaconda competent improvement in terms of atwv
0:31:51adding fundamental frequency in probability of voicing
0:31:54reduces word error
0:31:55we generate this was on vietnamese i believe
0:31:58we generation neural network
0:32:00we generation neural net it
0:32:02and neural network targets at a percent and semi supervised training
0:32:07helped a lot to
0:32:08and those were all
0:32:10additive right so
0:32:12very cool
0:32:13so features very important
0:32:16deep learning is very helpful and
0:32:19we have a comparison here between shallow and deep
0:32:22and you can see the shallow versus the deep atwv and
0:32:26you know two to three percent
0:32:30absolute improvement this was using the kuwaiti tandem sat
0:32:34fmpe full language pack models
0:32:39pitch helps even for non-tonal language
0:32:41this is this is from the M probably has been playing around with features
0:32:45because he was very unhappy with how here performed with this
0:32:50edge and vietnamese so he's basically done a lot of interesting network
0:32:55and you can see you know when they the
0:32:59as C
0:33:01pitch feature sometimes the goes up
0:33:03it goes down a little bit for by golly but his method that he incorporated
0:33:07in the kaldi gives an improvement
0:33:09and all those languages
0:33:11so vietnamese and cantonese are tonal but you can see a semi something a like
0:33:16and certainly a lot of other people
0:33:18have a similar program your problem and so one
0:33:21have this kind of result large lattices help up to a point
0:33:26right so you've got a
0:33:28i actually haven't so this is the data per
0:33:31where random is up in the upper right corner
0:33:34and the further down go the better but that curve shows the operating
0:33:39performance in terms of trade off between probability of false alarm probability of miss
0:33:44and so
0:33:45in further down is really important
0:33:48and so you can see the green line is done with small lattices
0:33:51and the purple and the line is done with larger
0:33:57and the normals
0:33:58lattices and eventually it is diminishing returns but certainly reserving stuff
0:34:04that you want to find is extremely important
0:34:10knowledge of the keywords helps
0:34:12so you can see
0:34:13it helps even more with the limited language pack we're
0:34:16you don't
0:34:17you might not know about those words based on the ten hour subset
0:34:21so if you know about the keywords
0:34:24you can actually leverage that knowledge
0:34:26in interesting ways like not running things weight you always wanna keep the probabilities right
0:34:31but you might want to set
0:34:33specific
0:34:34beings for specific for
0:34:41maybe and has developed
0:34:44a white list approach
0:34:47that using the audio we use so you can see here
0:34:51here's knowledge
0:34:53of the keywords before they basically do things and they get a re called keywords
0:34:58about ninety two percent
0:35:01without knowledge of the keyword that seventy four percent you can see there's the big
0:35:05of done atwv and you can see the number of hits per keyword is much
0:35:11lower in keywords without its
0:35:13much higher
0:35:14but if you simply look at say infrequent words that may be important
0:35:19just boosting the P model that was actually does give you something
0:35:23in terms of being able to preserve those
0:35:26keywords and so the percent of recall somewhere between
0:35:29and that's that that's beneficial right so it's preserving stuff
0:35:33so that you don't perform things out
0:35:35and
0:35:37you look at system combination i think system combinations about preserving stuff to
0:35:42you get big gains
0:35:44this is this is that the data set so it's
0:35:47you can see
0:35:49the best system here to combine system
0:35:52and all on a full language pack and a limited language pack and you can
0:35:55see
0:35:56icsi except for posh though
0:35:58system combination gets you
0:36:01about point three atwv
0:36:03which is pretty amazing
0:36:05i word errors but you know
0:36:09you can actually make the target
0:36:11amazing
0:36:13here's another picture of system combination
0:36:16where you can see the
0:36:18the individual systems using various
0:36:22putting
0:36:22you know dnns be enough
0:36:24and then you have the combination in routinely this is a limited language pair
0:36:30results as well
0:36:31so you're gonna see much more modest scores
0:36:37light
0:36:38good duh normalisation this is the bbn result and
0:36:43dab in email
0:36:44per language where you look at
0:36:46cantonese part though turkish tagalog and vietnamese you can see normalisation gives you a significant
0:36:53improvement
0:36:54not always the same price the
0:36:57the dev and that S right so there's some impact of the set
0:37:01but you can see
0:37:03normalisation and doing it well
0:37:05is certainly a big part of the program and there's a lot of methods that
0:37:08people are working on now including that that's
0:37:11rescoring
0:37:14and the other interesting result which i believe appears here as a poster
0:37:20and i couldn't put all the names of the authors out there so
0:37:24would be readable so i put in and all
0:37:26but when you normalize is very important
0:37:30so you've got the contrast between the no audio we used in the audio reuse
0:37:36but you can you can look at either one row or the other row
0:37:39and if i do
0:37:41normalisation after system combination i only get so far but if i normalize before i
0:37:47do system combination i do really well
0:37:49and
0:37:50if i normal is
0:37:55after the best tokenization be more score combination i basically can build a single system
0:38:01that is really better than what you produce you normalize weights and normalizing orally so
0:38:07if you're doing combinations of various representations is important to get the scores on the
0:38:11cities
0:38:12and in the same
0:38:13the same place i mean it
0:38:15it is really important it makes a big difference
0:38:18and quite frankly a single systems gonna be much easier to run so it's kind
0:38:22of an interesting thing to know
0:38:25the other people that appears here
0:38:28is
0:38:30touches on analysis so
0:38:32effective thresholds on atwv is also
0:38:36an interesting thing to look at where you can actually get a number or rules
0:38:40we have a fair threshold so it's just based on
0:38:44my notion of what i can do here we based on
0:38:47what i have in the genoa
0:38:50verses if i set the threshold to be a them all
0:38:54for the key for each keyword
0:38:57and then if i play around and make sure that i he
0:39:01the things that matter and throw away the things that are so i basis that
0:39:05the probability
0:39:06of hits to one of my probability of missus does euro you can see
0:39:11the probability space is also playing a major role
0:39:14in terms of your ability to get the keywords it's not just a matter of
0:39:17calibration
0:39:18also getting better probabilities seems to be an important aspect as well
0:39:22so there is a lot of interesting things that people can look at and certainly
0:39:27analysis i think is really a very important aspect of the program so understanding
0:39:32why something works why something doesn't work
0:39:34why something doesn't work is such a prison such a bad thing it basically by
0:39:38if you a piece of knowledge that really is important in terms of solving the
0:39:41problem
0:39:43we also had an open keyword search
0:39:47valuation in two thousand thirteen for vietnamese and we have a lot of people i
0:39:51we had before babel performers plus eight outside teens who ended up submitting systems
0:39:57and
0:39:58i was them here
0:40:00we have eight wonderful volunteers who actually participated in the open kws meeting is of
0:40:06the results in their all over the
0:40:08all over the place i kinda put it up there
0:40:11so that and these are posted right that the resulting in kws are posted you
0:40:16can go take a look
0:40:17but
0:40:18if people want to participate in the next one maybe they won't feel so shy
0:40:22about the possibility of submitting something that may not be
0:40:27super certainly babel people
0:40:29have a lot more practise with the data
0:40:31but
0:40:33you can see you know that the scores were all over the place
0:40:36and but people really did a lot of interesting things and there was your resource
0:40:41approaches as well as
0:40:43is low resource approaches
0:40:47so impure into we added six languages we have fried practice and one surprise
0:40:53they only have sixty hours of transcribed training they do have the remaining twenty hours
0:40:59untranscribed
0:41:00there's also what ten hour training set
0:41:03and they have to exceed the program targets now i'm both condition
0:41:08because they got so close right so the
0:41:19and also
0:41:21approaches that use things like morphology and so one
0:41:23where maybe they would help you to get
0:41:26i in the ten hour set
0:41:28maybe the sixty hours that are the eighty hours that is a little bit too
0:41:31large
0:41:32and then they'll help three weeks to build the surprise language
0:41:36the languages are bengali a nasty so those were collected in the first period they
0:41:40don't have another channel they're pure telephony
0:41:43no we have a means illumination real allow and of course that we have a
0:41:47surprise and i'm not gonna out that here
0:41:52optimizing bengali i think our
0:41:56somewhat
0:41:57okay right but zoo appears to be quite challenging annotation real
0:42:02appears to be quite
0:42:04simple right and so these are aspects of the language i don't think they're aspects
0:42:08up a collection
0:42:10and then lower
0:42:12will have its own challenges because again
0:42:17we couldn't annotate the compounds
0:42:19reliably and so
0:42:21the lower words
0:42:23not the not the borrowed words are
0:42:26multi syllabic right there the syllables
0:42:28but there's single slap excuse me
0:42:32so cast would put together some of the challenges of this
0:42:36and present it got but i thought that was interesting so that the notion of
0:42:40the sure language models where you can
0:42:42sure between then golly in a somebody's right
0:42:46also means doesn't have this much of a web presence and so it sort of
0:42:49an interesting thing to do reporting in the french for the haitian creole
0:42:53the phonology there are stolen allowance to
0:42:56lasso has told kinda like
0:42:58cantonese and
0:42:59and vietnamese but two tone is very different
0:43:03unfortunately this is how we could not marking the legs kind of couldn't be done
0:43:08reliably and so it didn't make sense to put it in the resource
0:43:12and then you also have some six segmental
0:43:15it's a segmental phonology issues by golly and
0:43:18morphology use there is too big time maybe more so than in the big only
0:43:23enough
0:43:24the oov rate is
0:43:26higher than any of the languages we've seen including turkish which
0:43:30didn't really have a terrible oov rate
0:43:33and then there's other aspects that linguists might be interested in looking at the likeness
0:43:38levels
0:43:40there's person to script something albion ask means are sort of very similar
0:43:45strictly speaking at being falsely score for a some is also mean square but it
0:43:49really is same as the bengali
0:43:52and then you have wow which has an another script as well there's a lot
0:43:56of code switching and fusion creole but there is available ones you to i certainly
0:44:01see a
0:44:02and so those can be problems
0:44:05and then there's a lot of short words and haitian creole allow but
0:44:08i guess the short or words are hurting haitian creole maybe we'll for well
0:44:14maybe not
0:44:15so exciting directions people are going in
0:44:20one of the things that we want is more analysis and so we revise the
0:44:24evaluation plan images posted at the open kws sites you can actually take a look
0:44:29if you want
0:44:31so that people can actually evaluate a lot more conditions and then
0:44:35sure the conditions with each other so that
0:44:37others can evaluate likewise
0:44:40there's a lot of work going out multilingual processing is trying to sit right it's
0:44:46very intriguing and very interesting
0:44:48and i think yes the deep learning things to really
0:44:52those neural net models or certainly seen to play a role in progress the people
0:44:58are making
0:45:00machine learning
0:45:02sort of get a somewhat slow start because you're trying to integrate this community into
0:45:07the speech community but they're beginning to take off
0:45:10two so stay tuned i think that there is a lot of interesting things the
0:45:15we're gonna happen
0:45:16smart lattices and consensus networks were beginning to play a role at the end of
0:45:21last
0:45:22a period
0:45:23but i think that there are actually making much progress now
0:45:29and the thing is that a lot of work was done a consensus networks to
0:45:32make it work with the keyword search task
0:45:35originally it was developed by lydia
0:45:38due to basically
0:45:40you know do a last pass right before you gave your one best output
0:45:46and it was great for that but there were things that you can do to
0:45:49basically make it work a little bit better with the keyword search
0:45:52and then morphology
0:45:56again this is a community integration of people largely were context working with speech community
0:46:02so there's a lot of tradeoffs between whether you wanna break
0:46:05break up a little pieces words which might be something that's great if you're doing
0:46:09text
0:46:10and that's a great if you're doing speech so
0:46:14a lot of a lot of the integration of the teens
0:46:17is beginning to bear fruit there as well so
0:46:20it's quite interesting and a big thing that i think is really important is the
0:46:24getting by with less
0:46:26so
0:46:26ten hours of training were less
0:46:29i don't seen results with less
0:46:31but i certainly think would be cool
0:46:33and no pronunciation like
0:46:35so
0:46:35everybody promise to do decimation studies but
0:46:39to large extent you know the program targets unfortunately
0:46:42seem to sometimes try the research toward program targets as opposed actual exploring the space
0:46:49of experiments so there is there is a there's a tradeoff between having annual evaluations
0:46:55in getting people to do research
0:46:56but i really
0:46:58really do hope that people will
0:47:01explore these conditions "'cause" i think the really important
0:47:07so i'm ending up with us why about the open kws
0:47:13the slightest why
0:47:15and you can see the timescale right so
0:47:18registrations gonna close and G and at the end of january so if you're interested
0:47:22at all
0:47:24we use
0:47:25do consider
0:47:27the vietnamese language pack will be available for those of you who have not participated
0:47:31before
0:47:32the open kws people who have participated as long as they participate again can keep
0:47:37the data
0:47:38right so if you just keep participating you can actually keep all the surprise languages
0:47:42and hopefully nist open
0:47:44up some of those that language is by evaluating on them to
0:47:47so there's lots of data it's very useful
0:47:51there's a lot of things there that you could
0:47:54do with that data above
0:47:57to support basic speech recognition and other types of speech research
0:48:03and hopefully by the time with the
0:48:04and of the program this will be released publicly to everybody
0:48:08since we all the data alright
0:48:11but you can see
0:48:12the surprise language bill
0:48:14is gonna be sent
0:48:17we can have or so before the
0:48:19evaluation begins where we send a password so there won't be any problem with the
0:48:24download
0:48:25downloads gonna be a little bit harder since we have the channel data which is
0:48:30not downsampled in any way
0:48:33because we figured
0:48:34that's an aspect of handling that data
0:48:37right and then people have the three weeks we send out the evaluation pack ahead
0:48:43of time as well as its larger
0:48:44it's seventy five hours and some of that is channel data
0:48:48right and this will send a password on april twenty eight there at which point
0:48:52people have a week
0:48:54to complete their submissions you can submit many things
0:48:58this will keep an eye and things to make sure that submissions are sounded there's
0:49:02problems
0:49:03there is a point of contact and so on so
0:49:06it should not be a very bad thing and the other thing that this
0:49:10there will be an open kws meeting were everybody would be expected to participate so
0:49:14there is sort of a bird there for people who might participate but
0:49:19i think that the meeting last time was very valuable in
0:49:22in table babel folks were really very generous in sharing their insights so
0:49:27i think it's a great opportunity to hear about
0:49:31the work
0:49:32and be able to ask questions and interact with
0:49:35the babel participants so i think it's a really good thing you have the open
0:49:39kws
0:49:42and last but not least this is the get up with the slide stage
0:49:49this is this is one of the things you have to do and the pitch
0:49:53for the program
0:49:54i put a little task force there and
0:49:57after languages cover
0:49:59but
0:50:01obviously it's nice to be able to say all but really there's the caviar that
0:50:05this has to be a language that has an orthographic transcription
0:50:08i have to say even just having a north the orthographic transcription does not make
0:50:13it easy
0:50:14to create a language and so some languages are really much more normalize than others
0:50:19me
0:50:20as much as we have done a lot of work in terms of normalizing english
0:50:23and there's a lot of spelling variants that happened it's a lot harder to do
0:50:28it in these other languages were there really isn't
0:50:31well studied conventions so
0:50:34all star the caviar because certainly
0:50:38you really do have to have the ability the capability of being able to clean
0:50:42up the language
0:50:43even when there is a presence in the web
0:50:47and tiny is we talked about them as well
0:50:50we're moving down to ten to forty hours
0:50:54working with variable recording conditions where they developed
0:50:59system in a we
0:51:00a big the immediate impact has been language data were i i've
0:51:05shared language data in had opened of males
0:51:08that impacts the community at also text the government
0:51:12new methods and speech search speech systems is then sort of the medium impact
0:51:16and getting affective keyword search a new languages deliver quickly as the ultimate delivery so
0:51:21learning how to do that learning how to solve the problem of
0:51:26this is a new language now build the system is really for it's the core
0:51:30principle program
0:51:32and everything really needs to be projected in that direction
0:51:36and alternately there are lots of other ways to say well
0:51:39what if i only have a certain amount of time to transcribe
0:51:43we find that
0:51:44we can do that very well programmatically the people can certainly investigate that right where
0:51:49they consider
0:51:51the time to what the time to transcribe and clean things up in terms of
0:51:55selecting data that they can work with
0:51:58the nice thing is that there is that eighty hours of audio regardless of how
0:52:03much data you use and so there's a lot of room to investigate a wide
0:52:06variety
0:52:07of getting by with less
0:52:09including getting by with no lexical
0:52:12getting by without transcripts
0:52:14at all certainly there is more like that going on in the program
0:52:19it may not
0:52:21the
0:52:22price
0:52:23that the best systems do but i would say it's all equally important and in
0:52:28vital to the program so having a wide variety of things going out i think
0:52:31is really important
0:52:34i'm done so if you questions