0:00:17so as i mention the globally for the presentation was to tell you are there
0:00:20is life young darpa
0:00:22plaintiff able to it is lots of interesting stuff
0:00:25some of the things we do in companies like a will is this is very
0:00:30different the scale is
0:00:32a unique
0:00:33and there are different challenges
0:00:36and i think is it really exciting i'm to be in interspeech
0:00:40because suddenly finally i speech is becoming useful if you have told me
0:00:47four years ago and just for us to all that speech was something that will
0:00:51be really use by mainstream people a i wouldn't believe it i mean and in
0:00:56fact there are before i can't the well i was oriented my carrier more into
0:00:59the area of data mining in speech mining because i thought that was the only
0:01:03way
0:01:04what is speech could be useful i really did not believing interactivity you know humans
0:01:09talking to computers
0:01:11so at the same very surprised and is working
0:01:16so
0:01:18to tell you a little bit
0:01:21about the a little the history of speech i will just kind of interesting
0:01:26people ask me all the time
0:01:28the multimodal you once
0:01:29so
0:01:31the team is that they're not around two thousand five
0:01:34and i think we
0:01:37there what discusses internally whether
0:01:40which would be a lower am systems or was to a license technology and
0:01:46the decision was another plane at the same i going on the time was more
0:01:50in favour of licensing a but my trial a me we were pushing for now
0:01:55we need to be a lot of stuff so we one and i'm glad we
0:01:58we post and convince K
0:02:00but on the other hand that was very see what has these cultures where we
0:02:04do every we build everything even our own hardware
0:02:07a on data centre so i wasn't is a decision i two thousand six we
0:02:11basically a started to build a what at that time wasn't a state-of-the-art system
0:02:15and we look very lucky because we got people like if you look any
0:02:19we had a lot of experience building train their infrastructure and all that you can
0:02:24sell week a lot of impostors bins and
0:02:27building the colours
0:02:29language modeling was easy for us because we could leverage all the work from the
0:02:33translation be
0:02:36and the three was basically and the challenge for us was to build this infrastructure
0:02:40on top of a will distributed computing energy so that was really
0:02:45different
0:02:47so in two thousand and seven
0:02:50and these are to the reflex the mindset of the team we build the system
0:02:53and was called who for one and it was really a directory assistance you will
0:02:57call the telephone number and then you could ask like a
0:03:00i
0:03:02tell me what is the closest based on
0:03:04a so everything was going through the voice channel
0:03:08and similarly two thousand and seven we also study the voicemail transcription project because we
0:03:13had a product called still available we will voice is like a it's like a
0:03:17telephone numbers assigned to you and that's for working
0:03:22so a voicemail transcription was also relevant
0:03:26and they are seven eight a
0:03:28things are there was a radical change with a
0:03:32the appearance of a
0:03:34as much telephones with and rate i mean
0:03:36but i wait a if you hear me saying describing something in this talk it
0:03:41doesn't really mean i i'm not meaning to say that we did it before
0:03:45just see that we really "'cause" in fact you know there is a rich history
0:03:48of mismatch telephones before nokia probably microsoft and apple but any in any case for
0:03:53us
0:03:55that was like a an eye opening experience and we decided to basically the top
0:03:59any kind of work with the L D was going through the telephone channel and
0:04:02switch to directly to smart phones and levitated operating system and send everything that channel
0:04:08in two thousand and nine we also had L
0:04:10a project to youtube transcription an alignment and actually initially i was working in that
0:04:14area because that was what i have been doing before
0:04:18the idea was to facilitate a be the transcription a lot's got some alignment
0:04:22and that is still around
0:04:25hi just here has been doing a lot of work in that area rate interesting
0:04:29stuff
0:04:30a and in two thousand nine we basically went from voice search another telephones to
0:04:34dictation
0:04:36in the keyword you will see sometimes a tiny microphone that allows you to dictate
0:04:41in two thousand and then we also enabled what we call the intended pi basically
0:04:45is to allow developers to leverage our servers so they can build speech into that
0:04:51applications on hundred
0:04:53and in two thousand ten we started in this part of adding going beyond transcription
0:04:57into semantics basically the nexus that understanding why from voiced collection
0:05:04in eleven we went into this top layer was just are bringing search into your
0:05:08laptop
0:05:10in two thousand eleven we started this speech project this was then be analysis on
0:05:15the team is london with us
0:05:17and in two thousand eleven we study the our language expansion that's something i have
0:05:22been doing for the last three years tell you a little bit about that
0:05:27perhaps because it's a little bit relevant to the work on
0:05:30but there was no pooling
0:05:32i in two thousand twelve we started in earnest activities in the speaker and language
0:05:36identification
0:05:38with the goal of building instead of their system and i think we're pretty much
0:05:42there
0:05:43and this year we basically probabilistic a web interface for speech so that
0:05:48you can basically that is that you can inject
0:05:51you can call our recognizers from any web page
0:05:54and in two thousand and thirteen also this here we have a
0:05:57go on into these and i would really like a little bit about that the
0:06:01in this transformation of who will from
0:06:05from the transcription
0:06:06not will
0:06:07the speech the role of the speech in globalphone providing transcriptions to basically going into
0:06:12understanding on the system
0:06:16people ask me all the time what's how is this it's team organise so i
0:06:19also wanted to tell you so you don't ask me anymore
0:06:22so
0:06:24but i want to say is that we focus exclusively on the speech recognition we
0:06:28don't focus on semantic understanding or in the be there are other things for that
0:06:34i mean with a little bit of an emptiness matters to help us put out
0:06:37a generate that
0:06:39language models for data mining of is like that
0:06:42so that the group is organized size headed by difference we can this organising several
0:06:47subgroups there is the acoustic modeling thing
0:06:50this is have i mean Q by kenny and you know they basically work on
0:06:53acoustic model algorithms a bayesian
0:06:56and robustness
0:06:57i would say that probably the most research see oriented they really at the edge
0:07:02of new eigen things all they work in the nns is that in that group
0:07:05then there's another group which
0:07:07we call it the language is modeling
0:07:10acoustic recent interest in name because we would with this team is the result of
0:07:15merit in the language model in the amount they internationally say something so we in
0:07:19my school maybe
0:07:21so it's language languages
0:07:23and we put a lot of work related to build lexicons a language modeling we
0:07:29take care of
0:07:31keeping our acoustic models for asr dimension that a little bit
0:07:35we develop new languages and we are in terms of improving the quality of our
0:07:40systems there you bring in the quality of improving you things we also have a
0:07:45speaker and language are the activities in this you and then this is really large
0:07:49now is headed by me in new york france what form montague
0:07:53we must be
0:07:55once we
0:07:56thirty probably no
0:07:58then there is the services group they take care of all the infrastructure you know
0:08:02the set of errors that are continuously running our products
0:08:07they take it of deployment scalability
0:08:10this is the more software engineering activities are core activities in the team and then
0:08:14we had a platform and the colour team there in terms of new recall that
0:08:18are worth aims
0:08:20the colours that can run from the device to be stupid to system
0:08:25there is activities of eighty two word spotting and speaker that either also
0:08:29then
0:08:31we have a large data operations and posting
0:08:35well
0:08:35this is the team lattices and of front end that must analysis on the front
0:08:39really their goal is to the data collections on the patient's transcriptions
0:08:43i once you start doing so many languages and so many problems you need
0:08:46that has to be annotated
0:08:49and under the in of course they need tools so that you see too much
0:08:53hassle this data
0:08:55and of course we have at the tts activities
0:08:57so everybody see the man to be or need in you are
0:09:01probably rough happen have and then the tts guys that
0:09:05i think they the makeup of the team is
0:09:08probably fifty percent so when the nearest
0:09:10fifty percent a
0:09:12speech sound there's
0:09:14lately we have been growing a lot of the so when you need inside i
0:09:17think we should really grown more noun the speech signs
0:09:21and in addition to this court thing we have it is to him of linguists
0:09:24there to spread all over i mean we what we do you we often bring
0:09:28up a linguistic team in a country like we have for example with the mean
0:09:32in island
0:09:33they have been helping us with speech recognition and we just
0:09:37and but i like to see that everybody code
0:09:39even some of how a linguist
0:09:41when i doing well i remember there was this job that even the lawyers up
0:09:45with what able to write code
0:09:48a anyway so let me tell you a little bit more about some of the
0:09:52technologies
0:09:56you will basically everything we do in the speech team is big data
0:10:01which is kind of bothers me because i think that it's community has been a
0:10:04big data for a long time so i don't know why noise fashionable has a
0:10:07new name but whatever
0:10:10so i will not tell you about they
0:10:13the nns because the you know
0:10:16you know more than me and we have fit into that the topic
0:10:22you know we have any stupid infrastructure and
0:10:25and i can tell you takes a week train on the few thousand hours of
0:10:29data things busloads of make it faster
0:10:33but there is something that is that a fundamentally different when it comes to acoustic
0:10:37modeling a well
0:10:40which is that will not transcribed data
0:10:42and this my come out a system price but there are reasons for that
0:10:48the main result is that there is a lot of data to process so i
0:10:51think
0:10:53i think you don't have done they compilations back of envelope i think everyday our
0:10:57servers process from five to ten years of for more or less way sounds like
0:11:02a lot but you talk to people like marcel who has a background in and
0:11:08a data centre
0:11:10you know what they process telephone calls and that
0:11:13it's not that much actually they probably more
0:11:17but when you have so much data is
0:11:21it's like drinking from the five or so we have this humid shallow
0:11:26so you have the speech file course for window that you and you have to
0:11:29figure out how to get something out of it right because this a lot of
0:11:33data
0:11:35so
0:11:36and this to you just an idea
0:11:39we break down our languages into T O one language is the important ones that
0:11:43one's identity traffic and then the tear to
0:11:46and then there is that you're three like icelandic and still generate a little later
0:11:51but even
0:11:52they you know that you're one his unit and then it more like thousand hours
0:11:56per day of traffic
0:11:57a problem only depends on the language
0:11:59but even they entity or two
0:12:02the ones that a little bit less important or that we launched recently like vietnamese
0:12:05for all ukrainian basically an eight hundred hours per day
0:12:10so
0:12:11so for us a i was thinking about that yesterday
0:12:15our and the results probably is not the data a like a lot of the
0:12:19word about the sponsoring our and the results probably use the people to look at
0:12:23all these data coming from
0:12:25so
0:12:26so really what we try to do is we try to automate the held out
0:12:31of it as much as we can i personally this like
0:12:35to have any engineer looking at the language for more than three months because it's
0:12:39not scalable
0:12:41i prefer that we come up with the solution and we put into what automatic
0:12:45infrastructure
0:12:49so
0:12:51in our team we have been investing a lot of what we call our expressed
0:12:55pipeline
0:12:56are you wasting
0:12:59maybe around in that it for
0:13:03and what the steaming rephrase
0:13:05so
0:13:07basically any way to comes into voice search is log like what is time for
0:13:12spoken
0:13:13so we keep you know what looks
0:13:15a lots of late that's i mentioned from five to ten years of all you
0:13:18per they
0:13:20and you want like well
0:13:23unlike in a typical acoustic modeling training set up you know the one you learning
0:13:26the school
0:13:28we don't transcribing because it's impossible there's no way to transcribe all that
0:13:33and then follow the traditional approach was what you supervised training
0:13:37so what we do is we
0:13:39basically look at the transcription produced by the collection and you
0:13:43and then we must as that they'd that's what's as we can trying to decide
0:13:46when can we trust transcription when we cannot
0:13:50we do a lot of data selection
0:13:53it's a lot of work into that
0:13:56apply combinations of statistics are
0:13:58confidence scores
0:14:00and then we train
0:14:04so
0:14:05that's the image would like to use the audio comes from applications goal was in
0:14:09this you know that the transcription is provided to the user or the developer we
0:14:13longer
0:14:14and then goes into our or infrastructure what we must us the data extensively
0:14:19and actually scoreboard
0:14:24so this is what we call
0:14:26all the other a desperation
0:14:30and one of the things we have been looking at
0:14:32so i think is interesting use
0:14:33how do you sampled from this
0:14:36ten years of data that you get that they what is that you apply
0:14:41the you try to random selection
0:14:42do you being the data according to confidence scores and then you try to select
0:14:49particular mean right because the data you a organiser data according to confidence
0:14:55you might be tempted to use the data that is
0:14:58i guess what are computed
0:15:00but then you could argue that you're not learning anything new right because if the
0:15:03recognizer is correct is not much to learn
0:15:06so you go a little bit deeper into the conference so select
0:15:10a phrase is utterances that you tries but not too much
0:15:14it's not obvious what do
0:15:16it's and i think guys a very active area of work for us
0:15:23the teletext and we look for example we do a lot of what we call
0:15:26distribution flattening
0:15:28so
0:15:29and with this for two reasons one reason we do it is to
0:15:33increase the type of coverage
0:15:36for example i can tell you we discover this problem where somebody was
0:15:41asking for weather in so it
0:15:44well enough and
0:15:46the recognizer was failing all the time and when we investigated we discover that particular
0:15:51five triphone
0:15:53score but it was not you know what or so
0:15:56so that are good reasons to really not train your systems can model the head
0:16:01of the distribution when it comes triphones or what is but the flooding
0:16:07in other is on is that you have to be careful because on what is
0:16:10a very popular for example in korea
0:16:12you just select high confidence scores and select the utterances by that
0:16:17a ten or fifteen percent of your core right is going to be composed of
0:16:20three queries one is a down
0:16:23but of research and using a clinic or you are not available
0:16:27last one is bound
0:16:29abilities forgotten
0:16:30so i you are building a if you are not careful you're building a it's
0:16:34three word recognizer
0:16:36so you really need to flooding
0:16:38not trust
0:16:39the distribution completely
0:16:43but of course a unsupervised training is landers right and you probably have heard a
0:16:48stories about it i know people in apple have worked with these
0:16:53this is what we call the P problem so you hear sound clear
0:16:57talk about the P problem i would tell you what it is
0:17:00so the history is that we launched accordion system
0:17:04and we start to collect topic complex and we're going to into
0:17:09retraining with this unsupervised approach
0:17:13and of course
0:17:14when you have someone state that you look at the you look at the locks
0:17:18right you just push it to the system but at some point with that look
0:17:21you know the logs like
0:17:22we notice that thirty percent of the topic
0:17:25is a they talk and B
0:17:30where a little bit mystified way that so we listen to the data and we
0:17:33note is that when this wayne or of talked about the effect
0:17:39or
0:17:40goddess passing by
0:17:42the recognizer is maxim that we that okay P
0:17:46you know what sounds like that which our lexicon will provide a transcripts phonetic sequence
0:17:50like
0:17:51so it's a it's a possible so it matches explicit noise
0:17:55and we will do it with high confidence
0:17:59so
0:18:00it is that's hypothesize in that it that the P R talk in becomes like
0:18:05i think it starts to capture more
0:18:08so
0:18:10so this is the P probably on
0:18:12are you have to be vigilant for
0:18:14i we have observed that every language seems to have a P talking
0:18:19so
0:18:21we have found for example while we have P talk and in other languages
0:18:25a feasible
0:18:26sometimes it
0:18:27sometimes is a like a sequence of can sometimes right because they kind of matched
0:18:32noise about of the times is something else
0:18:36and there are some but examples here
0:18:39so
0:18:42so you know we deal with this in many ways
0:18:46some of the phase we do it for example in terms of transcription flattening or
0:18:51triphone flattening
0:18:52those help a lot
0:18:54they start to filter out these people transcriptions
0:18:57but another simple thing we will use we have at this is said that contains
0:19:00a lot of noise
0:19:03cars passing by an air blowing to the microphone and we always i evaluate what
0:19:09we call our project set so
0:19:12if the rejects set us that's providing a lot of those transcriptions then
0:19:16it's really nice because you number one identify what that of the new P tokens
0:19:22and you can filter again slows
0:19:26we also model these noise explicitly we have noise models to capture this kind of
0:19:30problems and get it get rid of it in the transcription
0:19:34from time to time i think when you only the one supervised transcription there is
0:19:38the danger that the system is going to going to sound court in their behaviour
0:19:42so from time to time is not a well yet retranscribe corpora
0:19:46a really use a new model so you kind of remove it from this but
0:19:50corner cases and then study and
0:19:53but at the same this is unsupervised this every active area for us
0:19:57and that a lot of very interesting problems to be with
0:20:02so given all the so we stick some safeguards
0:20:05we basically select
0:20:08something like
0:20:10four thousand hours also
0:20:12and we retrain our acoustic models and we tend to do this
0:20:15we started doing every six months now we i think we have a monthly cycle
0:20:19i don't hoping we get into it to week cycle so every two weeks our
0:20:22acoustic models are retrained
0:20:25and there is on that are two reasons for that
0:20:29one reason is that we need to track the change in fleet of hundred devices
0:20:33on unlike able
0:20:36that are many telephones with different hardware different microphone configurations so it's important to track
0:20:41those changes and every week there is a new model
0:20:44so you need to track that
0:20:46i think you also want to change to track user behavior a new ways for
0:20:51example there is
0:20:53that are different uses of our system initially it was base or queries
0:20:57now that are longer what is more conversational and you know the
0:21:01the acoustic change a little bit so we want to track goes
0:21:05and it does still
0:21:08discontiguous acoustic model training
0:21:10basically allow us to not only to track but to improve the performance and indeed
0:21:15in this particular about that are more things bigger acoustic model in that are we
0:21:19actually tracking but
0:21:22but it does help
0:21:24i also have to say that
0:21:26they repress pipeline
0:21:29i mean and talking mostly about acoustic models
0:21:31but we also use email ideas for language modeling
0:21:35for pronunciation learning
0:21:39so that i think we do is we obviously our work with acoustic modelling thing
0:21:44so whenever there are based practises new maybe yes
0:21:47and i say set for some recently really like to walk to prototype on icelandic
0:21:52well matched me white
0:21:54so whenever they discover something that works really when a nice in icelandic we bring
0:21:58it into our pipeline and we basically we track the work of the acoustic modeling
0:22:03thing
0:22:04something works well
0:22:07and we try to encode these into amassing work flow
0:22:11that as i said every two weeks it does everything trains
0:22:15you'd evaluates with a testing sets
0:22:19i would tell you a little bit more later about our matrix
0:22:24and the other thing it does is that it actually creates this is this is
0:22:28really need it creates a change least basically telling you okay this is all i
0:22:32change and ready after it doesn't evaluation so
0:22:35you do like the model the only thing you want to say yes i like
0:22:38this multiple simple X and
0:22:40so we still have a human same yes or no
0:22:43we could train a neural network to do that for us i guess
0:22:49and another thing we had we now is to we have been thinking of how
0:22:53can we improve these even more
0:22:56so we have been that following another thing we check all the that they will
0:22:59also help approach
0:23:01a show you why they
0:23:03maybe of the david hassle of approach is that you have a very good looking
0:23:08i'm fast
0:23:09i will think which is the production system
0:23:12maybe it's a little bit done based would look you know fast
0:23:15and then you have it really is mar i will thing where you know it's
0:23:18which is likely the computed in the U Cs
0:23:23and they the here is that we can
0:23:25re process a lot of our laws we reach acoustic models
0:23:29we treated language models with deeper means things you put in doing products because the
0:23:34system will be real time
0:23:36and the goal is that we can read instead of taking just a transcriptions from
0:23:40the collection system
0:23:41re process all you
0:23:43and if we do that we immediately see reductions in the transcription error rate of
0:23:47ten percent
0:23:49and then you can understand that sticks select data
0:23:54retrained acoustic model
0:23:56and actually one of the things we're starting to do is to
0:23:59have
0:23:59products and acoustic models and really reach acoustic models
0:24:03was only porpoises to a preprocessing data that they might be a slower
0:24:07and you can it that it is probably applies we had in this process right
0:24:10now
0:24:11and the aim is that through
0:24:13all these tricks we can read reviews the error rate on our transcriptions hundred training
0:24:18and again the goal is really not to transcribe we can avoid
0:24:25and as i said similar ideas are used in language modeling
0:24:30and in pronunciation modeling
0:24:32i where we try to learn from all the
0:24:35let me tell you a little bit about the medics
0:24:39because surprisingly whatever rate is not
0:24:42the only thing with the
0:24:45voice search basically exceed it's a similar behavior to search you know it's a
0:24:50he said distribution with it really long tail
0:24:54if the only thing you do is a major the task the word error rate
0:24:57on the desk instead you transcribe like months ago or two months ago
0:25:01there are several problems for that one is that most likely you're going to be
0:25:04majoring have of the distribution
0:25:06the talking as like face things like that
0:25:10so i mean after a while you look very well on the common tokens
0:25:15but you really care about the tape those tokens that you're three times today how
0:25:19well you know
0:25:21i'm and what test sets don't over the
0:25:24i is not also practical to transcribe every single day i know optimal loving but
0:25:30not possible
0:25:32and the queries are changing
0:25:33it's a evolving all the time whatever and testing said that was really one month
0:25:38ago might not be
0:25:39but i don't know
0:25:41and you know even the best on speech transcribers there is only time between in
0:25:45the data packing it's in the you get in so you can use of those
0:25:48among
0:25:50so we used to a
0:25:52identity matrix i mean
0:25:54we still use whatever rate
0:25:55but we also used to alternative metrics
0:25:58one is that what we call side by side testing so the idea is that
0:26:03you just want to measure the difference between two systems the product some system and
0:26:07a candidate system the candidate system could be a system that has a new acoustic
0:26:10model
0:26:11or a new language model or any pronunciation lexicon or a combination of the three
0:26:14whatever it is
0:26:16and what we do is we basically look at the
0:26:19we select like
0:26:20i'll thousand utterances or three thousand utterances from yesterday
0:26:24we have the transcriptions that the collection system gave us
0:26:28and then we re process those with the new candidate system
0:26:31we look at the differences i mean if a hypothesis out the same in but
0:26:35we don't care
0:26:36i don't the differences we do in a B S and we'll has had the
0:26:41search thing has had from many years
0:26:44any infrastructure think of it like a small
0:26:47mechanical turk with a user's distributed everywhere in the world
0:26:52which are pretty familiar and the only thing you really as they miss listen to
0:26:55the leon tell us which one is
0:26:57more closely matches the only
0:27:01so this is bases very fast to do
0:27:03so you can you can get resulting in a couple of our sometimes less
0:27:08the other thing with the which is i think is even more interesting is what
0:27:11we call like experiments
0:27:13so that he has that you have a new candidate system
0:27:15and that you feel is pretty good
0:27:17and we need policing deploy this system into products and it it's that's taking a
0:27:21little bit of one percent ten percent
0:27:24and
0:27:25and we basically track if U matrix that beset in the deck metrics like the
0:27:30creek the click through rates
0:27:31whether the users are picking on the result more or not
0:27:35whether the users are corrected by hand out the transcription we provide
0:27:40whether they use it stays with days the application or
0:27:43also way
0:27:47and you know of course there is a lot of a statistic out processing to
0:27:50understand when the results of signal if you can i when they're not
0:27:54but this is just really useful because
0:27:56it allow us to
0:27:58believe a systems quickly before we increase the topic to
0:28:02one two percent
0:28:03and of course that i think is that the user that's even know that he's
0:28:06been subject experiment
0:28:13so our kinds of metrics used are
0:28:16this is kind of hundred related to how is this system doing
0:28:20that i think is growing a lot basically in the last
0:28:24but in the last three years
0:28:27it seems like it doubles every six months or what i think i don't see
0:28:31i don't we don't see the train went down
0:28:34i mean that our results for that is not just that speech is becoming more
0:28:37use what is that we have been adding more languages
0:28:41their role of english in our applications had has been diminishing used to be of
0:28:45course the so that i think now is a little bit less than fifty percent
0:28:49and the top ten longing non you were single is languages they generated now more
0:28:53than fifty percent of our topic
0:28:55and the other thirty seven they didn't less
0:28:57but
0:28:58but what we actually have seen in the past is that once the quality of
0:29:02a language improves
0:29:04a things begin to show that
0:29:08another thing that is very interesting is there
0:29:10the percentage of what is where there has being a semantic interpretation instead of just
0:29:14task transcription we are providing
0:29:18we have parts in the output is increasing and i think this is only for
0:29:21english is beginning to be around twenty percent of the time we act on the
0:29:25query
0:29:26we parts
0:29:27we understand what it's S so these are some graphics
0:29:32here is for example if we attack an improvement in france
0:29:36where you know we keep adding the amount of things better pronunciations lighter acoustic models
0:29:41at own language model
0:29:44a language model trained with clean data
0:29:46we do a lot of data much hassle of
0:29:50our queries before we build language models for but we happen to tell you about
0:29:54that
0:29:55a larger language model so on and so forth so this is a continuous process
0:30:02limited a little bit about the language is therefore i thought
0:30:06might be relevant to the aliens
0:30:08and also having working on that for the well
0:30:10so in two thousand eleven we decided to focus on you languages
0:30:16i mean was a necessity because hundred was becoming global and we need to bring
0:30:20body as much as we could
0:30:22so we went through the initial analysis and like many of you out we went
0:30:26to at the lower bound we got these or some pictures of all languages and
0:30:30you know
0:30:30organising activities some level a real cool
0:30:33and then you look at the statistics like everybody else you see one seven that's
0:30:37all languages
0:30:39and so forth families of languages and
0:30:42and then you look at the statistics
0:30:45six percent of so it's a spoken by
0:30:47more than six percent of the languages are spoken by more than a million people
0:30:52and a six percent of spoken by less than a hundred people so probably will
0:30:55not bother with those
0:30:57and then you call with adults we can't and more or less
0:30:59you basically cover
0:31:01ninety nine percent of population so
0:31:03we internally keep talking about that our goal is to build voice search in the
0:31:07top three hundred languages i think
0:31:10it's a good selling point it's a good sounding bit
0:31:13i think in reality
0:31:15probably after we reach at which are they
0:31:18we at that's a lot
0:31:21we brought it would have to rethink what to do with the next ones
0:31:24"'cause" that there are many sentiment for
0:31:26the only two languages are there is no where is nothing to search so
0:31:30but you could argue that it's a it's a look back right when you have
0:31:35you have a speech recognition technology maybe you facilitate the recent content so we need
0:31:40to break this
0:31:41this loop somehow
0:31:44well that is we are very problem is our approach to rapid prototyping of languages
0:31:50i think rather than an algorithmic approach which is
0:31:53that would have been
0:31:55i would in itself
0:31:57a thinking we decided to take a more process approach focus on process
0:32:03we basically focus on solving the two main problems
0:32:07which obviously may i here this week we need you want to which is how
0:32:11the held way get data
0:32:13so we basically spent a bit of time developing tools to collect data very quickly
0:32:16very efficiently
0:32:18other hand
0:32:20we build software tools that run on the telephones that allow asked to send a
0:32:24team can collect data in a week a two hundred hours of data
0:32:28we also be of the lot of webpage a web based tools to go annotations
0:32:33so then result is that in three years we collected more than sixty language is
0:32:37around sixty languages and at any time we have teams in collections so right now
0:32:41we have teams see
0:32:42more we're planning at and also be what is an outspoken and all that stuff
0:32:48it's a so
0:32:49we're starting in it for indian languages
0:32:52we
0:32:53and farsi we're going to collect in L A
0:32:56so little bit is still you need a
0:33:00so this is how a lot of data collection application looked is called data how
0:33:03and there is actually an open-source where some of this that
0:33:07idiom button are
0:33:08put together with the ground so i think what you to talk to him
0:33:13because you can also do it
0:33:15and this is how our web based to for annotations look like
0:33:19this is the tool we use with our vendors
0:33:22with our on linguistic teams festivity worldwide so they can give us and i think
0:33:27this is a phonetic transcription to a task
0:33:30or they can do for example this is opinions selection task
0:33:33or they can do a transcription task
0:33:37for test set something like
0:33:40and
0:33:41you know in
0:33:43just to talk about more a bit more about rapid prototyping
0:33:46for lexicons i think lexicons is an area where we're still do not as fast
0:33:50as i like
0:33:50a lot of our lexicons are base which is good because now that we have
0:33:55trained linguists they can probably put together to a lexicon
0:33:59for set for four languages about regular
0:34:01i like sponges or swahili in it they most likely
0:34:06i'm for low that are
0:34:07more difficult then we
0:34:09we thereby cumbersome lexicon support we collect a seed corpora with our tools and then
0:34:14with changing to be
0:34:17for language modeling that has never been a problem for us because we just mine
0:34:20web page is a we have the advantage of the what it's doing what people
0:34:24search in the particular language that's very useful
0:34:28of course every language has its no one says and you know whether it's a
0:34:32segmentation what don modeling or inflection we and that building a tool for any language
0:34:37like we have been working on inflection morning for us and
0:34:40but the boosting is a once you build the tools you can deploy the for
0:34:44any language so i'm hoping at some point we ran out of we're
0:34:47linguistic phenomena
0:34:49and we have to swallow and
0:34:51lots of data must aztec civilisations that we about having place one problem you have
0:34:55is mixed languages in the queries
0:34:57so you have to classify or something like that
0:34:59but the process is pretty automatic and
0:35:02a for acoustic modeling is the most automatic thing once you have the data rather
0:35:06you push the button on
0:35:08and typically and they later you have a neural network training
0:35:13and the way we develop a language is now is
0:35:17we basically have a date operations the in domain this data collection some we a
0:35:20lot of preparation and then we made in these we call it works on languages
0:35:27we meet for a week in the room
0:35:28and we typically have a success rate of fifty or seventy percent
0:35:33meaning that in a we get a system that is but lacks and very
0:35:36reluctantly forest means and the right of
0:35:39around ten percent
0:35:41and some languages are quite a little bit more work and you know six months
0:35:44later we go back to the
0:35:46so we have been lots in languages at an average of
0:35:49four five last year we more we like the thing
0:35:52very so this is their language coverage we have
0:35:56a
0:35:57so you can i mean he's forty eight in production
0:36:03somebody asked me that they why we have basque coliseum
0:36:06and gotten and spanish
0:36:08you can figure it out
0:36:13we have all these languages in preparation time i might maybe in the i'm hiding
0:36:17the innocence was clearly
0:36:19and that's a set our teams are collecting more data so we will keep going
0:36:24an interesting is that we
0:36:27we have gone into that languages i think we still have leading
0:36:32although we run into well also delayed
0:36:35and we had bill imaginary languages so this is my challenge for there are private
0:36:39lessons
0:36:40see you can tell me what language is this
0:36:43let me see
0:37:12somebody downloading a movie
0:37:27i can try to do
0:37:30you know if not outright in
0:37:35okay we'll i think will try sometime today
0:37:41i wanted to briefly mention atis and running out of time
0:37:46basically all these languages are available into a P ice one is a the under
0:37:51api this is a pointer just look for
0:37:54speech hundred
0:37:55and there's also a web api disobey simple api used in the way from we
0:38:00give you the transcripts and we're thinking about in reading then a little bit more
0:38:04but a lot of developers have been building a
0:38:07applications on top of recipients and i think for as a P this create yes
0:38:11of course is pretty
0:38:12there are very important because
0:38:14for two reasons when we launched a new language data really provide us with more
0:38:19data and at the beginning more latest book
0:38:22and i think it also exposes users and developers to the idea that hey i
0:38:27can bill applications with the speech recognition and this is good for us
0:38:33recent which is
0:38:34sometimes are
0:38:36the developer some faster in doing things that are useful like for example when we
0:38:41started working on will not which is a large
0:38:43kind of semantic assist and system our
0:38:46we didn't have data but because we have this api and developers have been building
0:38:51cd like applications in under four years we could leverage that data and you was
0:38:55really good semantic annotations
0:38:58and just to finish the little bit
0:39:02i think we are now in
0:39:04in the middle of this big transducer in the speech recognition at least within from
0:39:09transcription two or more conversational interface
0:39:12and you know there's all these new features that i did this is not speech
0:39:15is in the be done by other teams
0:39:17but you know seems like a core reference resolution so it becomes more conversational
0:39:22a problem resolution weighted refinements my voice
0:39:26a more to come
0:39:27they make the application will be more interesting and
0:39:30we really i think that the company
0:39:33is in the middle of this transformation where will goes from these white box where
0:39:38you type
0:39:39into one of an assistant where you
0:39:42you talk you engaged in a conversation
0:39:45like to think of who a list
0:39:47trying to become like bachelor you can talk to
0:39:50on you know that changes everything these long term be single based on the computer
0:39:55what i hope with
0:39:57a little bit better personality but
0:39:59that's a little bit where R
0:40:01where we are trying to go this
0:40:03pervasive
0:40:04role of a speech not only not you're under telephone but on your that's your
0:40:08car
0:40:09in your appliances of home
0:40:12you know assistant that
0:40:14even makes access to information which is what mobile is about leaving easier and less
0:40:19intimidating for many years as
0:40:21so
0:40:23the aim is to have a
0:40:26and this is related to speech technologies are not only the microphone here about various
0:40:30microphones are always on always listening to you we have maybe steps with this thing
0:40:36call okay well
0:40:37a signal less
0:40:38so you can talk to your device get home talk to your refrigerator whatever it
0:40:43is i know it's about the conversation
0:40:45predicated not so what you
0:40:48about your data
0:40:50with really high quality speech recognition we try to get the time better
0:40:55and so on and so forth and really conversation
0:41:00so that was just want to tell you
0:41:16the questions with a little bit late but
0:41:19a
0:41:31i don't becomes R
0:41:46we with
0:41:54it is concerned i think there are four
0:41:59collect as much data and it was really a philosophical choice
0:42:04to spend more money on more careful annotations especially for translation
0:42:10where we actually did not sell
0:42:14it's not always the body it's the call
0:42:35first the common
0:42:38many students are university use
0:42:41google transcriptions part of projects actually works very nicely
0:42:47it's been a great
0:42:49source
0:42:50work but i have one question which is
0:42:55you cannot recognise my name
0:42:58why
0:43:01i is that you know we have a
0:43:06in intra lingual engineering this thing we call yellow's which is
0:43:11when we identify problem and we get together in a room and we don't stop
0:43:15and in resolving
0:43:16so named recognition was identify in july
0:43:20and i actually i wasn't in that it for
0:43:23we came up with the solution so it's been deployed a like
0:43:27to they actually products
0:43:29well chuck tomorrow
0:43:31no but at the know this you notice that the serious question there utility in
0:43:37some words of the but actually don't see to show opener speech systems so how
0:43:42do we do the so that my name recognition is difficult to excel in a
0:43:47because the space is pretty much infinite
0:43:50so we do a variety of things a dynamic language models
0:43:55based upon your data
0:43:56so i mean to do name recognition of your names
0:43:59the names you talk to you know that you can do but it's when you
0:44:03have as a generic system
0:44:06that actually somehow believe so it's
0:44:10there that's ultimately problem for the i mean we operate typically with a million pockets
0:44:17in our vocabulary we are going to two million with the song
0:44:21but still a that is way more the only way to handle this kind of
0:44:26problem is with
0:44:29more personal essays so we know about you
0:44:32so we can do you need to you
0:44:37i think you