0:00:17 | so as i mention the globally for the presentation was to tell you are there |
---|
0:00:20 | is life young darpa |
---|
0:00:22 | plaintiff able to it is lots of interesting stuff |
---|
0:00:25 | some of the things we do in companies like a will is this is very |
---|
0:00:30 | different the scale is |
---|
0:00:32 | a unique |
---|
0:00:33 | and there are different challenges |
---|
0:00:36 | and i think is it really exciting i'm to be in interspeech |
---|
0:00:40 | because suddenly finally i speech is becoming useful if you have told me |
---|
0:00:47 | four years ago and just for us to all that speech was something that will |
---|
0:00:51 | be really use by mainstream people a i wouldn't believe it i mean and in |
---|
0:00:56 | fact there are before i can't the well i was oriented my carrier more into |
---|
0:00:59 | the area of data mining in speech mining because i thought that was the only |
---|
0:01:03 | way |
---|
0:01:04 | what is speech could be useful i really did not believing interactivity you know humans |
---|
0:01:09 | talking to computers |
---|
0:01:11 | so at the same very surprised and is working |
---|
0:01:16 | so |
---|
0:01:18 | to tell you a little bit |
---|
0:01:21 | about the a little the history of speech i will just kind of interesting |
---|
0:01:26 | people ask me all the time |
---|
0:01:28 | the multimodal you once |
---|
0:01:29 | so |
---|
0:01:31 | the team is that they're not around two thousand five |
---|
0:01:34 | and i think we |
---|
0:01:37 | there what discusses internally whether |
---|
0:01:40 | which would be a lower am systems or was to a license technology and |
---|
0:01:46 | the decision was another plane at the same i going on the time was more |
---|
0:01:50 | in favour of licensing a but my trial a me we were pushing for now |
---|
0:01:55 | we need to be a lot of stuff so we one and i'm glad we |
---|
0:01:58 | we post and convince K |
---|
0:02:00 | but on the other hand that was very see what has these cultures where we |
---|
0:02:04 | do every we build everything even our own hardware |
---|
0:02:07 | a on data centre so i wasn't is a decision i two thousand six we |
---|
0:02:11 | basically a started to build a what at that time wasn't a state-of-the-art system |
---|
0:02:15 | and we look very lucky because we got people like if you look any |
---|
0:02:19 | we had a lot of experience building train their infrastructure and all that you can |
---|
0:02:24 | sell week a lot of impostors bins and |
---|
0:02:27 | building the colours |
---|
0:02:29 | language modeling was easy for us because we could leverage all the work from the |
---|
0:02:33 | translation be |
---|
0:02:36 | and the three was basically and the challenge for us was to build this infrastructure |
---|
0:02:40 | on top of a will distributed computing energy so that was really |
---|
0:02:45 | different |
---|
0:02:47 | so in two thousand and seven |
---|
0:02:50 | and these are to the reflex the mindset of the team we build the system |
---|
0:02:53 | and was called who for one and it was really a directory assistance you will |
---|
0:02:57 | call the telephone number and then you could ask like a |
---|
0:03:00 | i |
---|
0:03:02 | tell me what is the closest based on |
---|
0:03:04 | a so everything was going through the voice channel |
---|
0:03:08 | and similarly two thousand and seven we also study the voicemail transcription project because we |
---|
0:03:13 | had a product called still available we will voice is like a it's like a |
---|
0:03:17 | telephone numbers assigned to you and that's for working |
---|
0:03:22 | so a voicemail transcription was also relevant |
---|
0:03:26 | and they are seven eight a |
---|
0:03:28 | things are there was a radical change with a |
---|
0:03:32 | the appearance of a |
---|
0:03:34 | as much telephones with and rate i mean |
---|
0:03:36 | but i wait a if you hear me saying describing something in this talk it |
---|
0:03:41 | doesn't really mean i i'm not meaning to say that we did it before |
---|
0:03:45 | just see that we really "'cause" in fact you know there is a rich history |
---|
0:03:48 | of mismatch telephones before nokia probably microsoft and apple but any in any case for |
---|
0:03:53 | us |
---|
0:03:55 | that was like a an eye opening experience and we decided to basically the top |
---|
0:03:59 | any kind of work with the L D was going through the telephone channel and |
---|
0:04:02 | switch to directly to smart phones and levitated operating system and send everything that channel |
---|
0:04:08 | in two thousand and nine we also had L |
---|
0:04:10 | a project to youtube transcription an alignment and actually initially i was working in that |
---|
0:04:14 | area because that was what i have been doing before |
---|
0:04:18 | the idea was to facilitate a be the transcription a lot's got some alignment |
---|
0:04:22 | and that is still around |
---|
0:04:25 | hi just here has been doing a lot of work in that area rate interesting |
---|
0:04:29 | stuff |
---|
0:04:30 | a and in two thousand nine we basically went from voice search another telephones to |
---|
0:04:34 | dictation |
---|
0:04:36 | in the keyword you will see sometimes a tiny microphone that allows you to dictate |
---|
0:04:41 | in two thousand and then we also enabled what we call the intended pi basically |
---|
0:04:45 | is to allow developers to leverage our servers so they can build speech into that |
---|
0:04:51 | applications on hundred |
---|
0:04:53 | and in two thousand ten we started in this part of adding going beyond transcription |
---|
0:04:57 | into semantics basically the nexus that understanding why from voiced collection |
---|
0:05:04 | in eleven we went into this top layer was just are bringing search into your |
---|
0:05:08 | laptop |
---|
0:05:10 | in two thousand eleven we started this speech project this was then be analysis on |
---|
0:05:15 | the team is london with us |
---|
0:05:17 | and in two thousand eleven we study the our language expansion that's something i have |
---|
0:05:22 | been doing for the last three years tell you a little bit about that |
---|
0:05:27 | perhaps because it's a little bit relevant to the work on |
---|
0:05:30 | but there was no pooling |
---|
0:05:32 | i in two thousand twelve we started in earnest activities in the speaker and language |
---|
0:05:36 | identification |
---|
0:05:38 | with the goal of building instead of their system and i think we're pretty much |
---|
0:05:42 | there |
---|
0:05:43 | and this year we basically probabilistic a web interface for speech so that |
---|
0:05:48 | you can basically that is that you can inject |
---|
0:05:51 | you can call our recognizers from any web page |
---|
0:05:54 | and in two thousand and thirteen also this here we have a |
---|
0:05:57 | go on into these and i would really like a little bit about that the |
---|
0:06:01 | in this transformation of who will from |
---|
0:06:05 | from the transcription |
---|
0:06:06 | not will |
---|
0:06:07 | the speech the role of the speech in globalphone providing transcriptions to basically going into |
---|
0:06:12 | understanding on the system |
---|
0:06:16 | people ask me all the time what's how is this it's team organise so i |
---|
0:06:19 | also wanted to tell you so you don't ask me anymore |
---|
0:06:22 | so |
---|
0:06:24 | but i want to say is that we focus exclusively on the speech recognition we |
---|
0:06:28 | don't focus on semantic understanding or in the be there are other things for that |
---|
0:06:34 | i mean with a little bit of an emptiness matters to help us put out |
---|
0:06:37 | a generate that |
---|
0:06:39 | language models for data mining of is like that |
---|
0:06:42 | so that the group is organized size headed by difference we can this organising several |
---|
0:06:47 | subgroups there is the acoustic modeling thing |
---|
0:06:50 | this is have i mean Q by kenny and you know they basically work on |
---|
0:06:53 | acoustic model algorithms a bayesian |
---|
0:06:56 | and robustness |
---|
0:06:57 | i would say that probably the most research see oriented they really at the edge |
---|
0:07:02 | of new eigen things all they work in the nns is that in that group |
---|
0:07:05 | then there's another group which |
---|
0:07:07 | we call it the language is modeling |
---|
0:07:10 | acoustic recent interest in name because we would with this team is the result of |
---|
0:07:15 | merit in the language model in the amount they internationally say something so we in |
---|
0:07:19 | my school maybe |
---|
0:07:21 | so it's language languages |
---|
0:07:23 | and we put a lot of work related to build lexicons a language modeling we |
---|
0:07:29 | take care of |
---|
0:07:31 | keeping our acoustic models for asr dimension that a little bit |
---|
0:07:35 | we develop new languages and we are in terms of improving the quality of our |
---|
0:07:40 | systems there you bring in the quality of improving you things we also have a |
---|
0:07:45 | speaker and language are the activities in this you and then this is really large |
---|
0:07:49 | now is headed by me in new york france what form montague |
---|
0:07:53 | we must be |
---|
0:07:55 | once we |
---|
0:07:56 | thirty probably no |
---|
0:07:58 | then there is the services group they take care of all the infrastructure you know |
---|
0:08:02 | the set of errors that are continuously running our products |
---|
0:08:07 | they take it of deployment scalability |
---|
0:08:10 | this is the more software engineering activities are core activities in the team and then |
---|
0:08:14 | we had a platform and the colour team there in terms of new recall that |
---|
0:08:18 | are worth aims |
---|
0:08:20 | the colours that can run from the device to be stupid to system |
---|
0:08:25 | there is activities of eighty two word spotting and speaker that either also |
---|
0:08:29 | then |
---|
0:08:31 | we have a large data operations and posting |
---|
0:08:35 | well |
---|
0:08:35 | this is the team lattices and of front end that must analysis on the front |
---|
0:08:39 | really their goal is to the data collections on the patient's transcriptions |
---|
0:08:43 | i once you start doing so many languages and so many problems you need |
---|
0:08:46 | that has to be annotated |
---|
0:08:49 | and under the in of course they need tools so that you see too much |
---|
0:08:53 | hassle this data |
---|
0:08:55 | and of course we have at the tts activities |
---|
0:08:57 | so everybody see the man to be or need in you are |
---|
0:09:01 | probably rough happen have and then the tts guys that |
---|
0:09:05 | i think they the makeup of the team is |
---|
0:09:08 | probably fifty percent so when the nearest |
---|
0:09:10 | fifty percent a |
---|
0:09:12 | speech sound there's |
---|
0:09:14 | lately we have been growing a lot of the so when you need inside i |
---|
0:09:17 | think we should really grown more noun the speech signs |
---|
0:09:21 | and in addition to this court thing we have it is to him of linguists |
---|
0:09:24 | there to spread all over i mean we what we do you we often bring |
---|
0:09:28 | up a linguistic team in a country like we have for example with the mean |
---|
0:09:32 | in island |
---|
0:09:33 | they have been helping us with speech recognition and we just |
---|
0:09:37 | and but i like to see that everybody code |
---|
0:09:39 | even some of how a linguist |
---|
0:09:41 | when i doing well i remember there was this job that even the lawyers up |
---|
0:09:45 | with what able to write code |
---|
0:09:48 | a anyway so let me tell you a little bit more about some of the |
---|
0:09:52 | technologies |
---|
0:09:56 | you will basically everything we do in the speech team is big data |
---|
0:10:01 | which is kind of bothers me because i think that it's community has been a |
---|
0:10:04 | big data for a long time so i don't know why noise fashionable has a |
---|
0:10:07 | new name but whatever |
---|
0:10:10 | so i will not tell you about they |
---|
0:10:13 | the nns because the you know |
---|
0:10:16 | you know more than me and we have fit into that the topic |
---|
0:10:22 | you know we have any stupid infrastructure and |
---|
0:10:25 | and i can tell you takes a week train on the few thousand hours of |
---|
0:10:29 | data things busloads of make it faster |
---|
0:10:33 | but there is something that is that a fundamentally different when it comes to acoustic |
---|
0:10:37 | modeling a well |
---|
0:10:40 | which is that will not transcribed data |
---|
0:10:42 | and this my come out a system price but there are reasons for that |
---|
0:10:48 | the main result is that there is a lot of data to process so i |
---|
0:10:51 | think |
---|
0:10:53 | i think you don't have done they compilations back of envelope i think everyday our |
---|
0:10:57 | servers process from five to ten years of for more or less way sounds like |
---|
0:11:02 | a lot but you talk to people like marcel who has a background in and |
---|
0:11:08 | a data centre |
---|
0:11:10 | you know what they process telephone calls and that |
---|
0:11:13 | it's not that much actually they probably more |
---|
0:11:17 | but when you have so much data is |
---|
0:11:21 | it's like drinking from the five or so we have this humid shallow |
---|
0:11:26 | so you have the speech file course for window that you and you have to |
---|
0:11:29 | figure out how to get something out of it right because this a lot of |
---|
0:11:33 | data |
---|
0:11:35 | so |
---|
0:11:36 | and this to you just an idea |
---|
0:11:39 | we break down our languages into T O one language is the important ones that |
---|
0:11:43 | one's identity traffic and then the tear to |
---|
0:11:46 | and then there is that you're three like icelandic and still generate a little later |
---|
0:11:51 | but even |
---|
0:11:52 | they you know that you're one his unit and then it more like thousand hours |
---|
0:11:56 | per day of traffic |
---|
0:11:57 | a problem only depends on the language |
---|
0:11:59 | but even they entity or two |
---|
0:12:02 | the ones that a little bit less important or that we launched recently like vietnamese |
---|
0:12:05 | for all ukrainian basically an eight hundred hours per day |
---|
0:12:10 | so |
---|
0:12:11 | so for us a i was thinking about that yesterday |
---|
0:12:15 | our and the results probably is not the data a like a lot of the |
---|
0:12:19 | word about the sponsoring our and the results probably use the people to look at |
---|
0:12:23 | all these data coming from |
---|
0:12:25 | so |
---|
0:12:26 | so really what we try to do is we try to automate the held out |
---|
0:12:31 | of it as much as we can i personally this like |
---|
0:12:35 | to have any engineer looking at the language for more than three months because it's |
---|
0:12:39 | not scalable |
---|
0:12:41 | i prefer that we come up with the solution and we put into what automatic |
---|
0:12:45 | infrastructure |
---|
0:12:49 | so |
---|
0:12:51 | in our team we have been investing a lot of what we call our expressed |
---|
0:12:55 | pipeline |
---|
0:12:56 | are you wasting |
---|
0:12:59 | maybe around in that it for |
---|
0:13:03 | and what the steaming rephrase |
---|
0:13:05 | so |
---|
0:13:07 | basically any way to comes into voice search is log like what is time for |
---|
0:13:12 | spoken |
---|
0:13:13 | so we keep you know what looks |
---|
0:13:15 | a lots of late that's i mentioned from five to ten years of all you |
---|
0:13:18 | per they |
---|
0:13:20 | and you want like well |
---|
0:13:23 | unlike in a typical acoustic modeling training set up you know the one you learning |
---|
0:13:26 | the school |
---|
0:13:28 | we don't transcribing because it's impossible there's no way to transcribe all that |
---|
0:13:33 | and then follow the traditional approach was what you supervised training |
---|
0:13:37 | so what we do is we |
---|
0:13:39 | basically look at the transcription produced by the collection and you |
---|
0:13:43 | and then we must as that they'd that's what's as we can trying to decide |
---|
0:13:46 | when can we trust transcription when we cannot |
---|
0:13:50 | we do a lot of data selection |
---|
0:13:53 | it's a lot of work into that |
---|
0:13:56 | apply combinations of statistics are |
---|
0:13:58 | confidence scores |
---|
0:14:00 | and then we train |
---|
0:14:04 | so |
---|
0:14:05 | that's the image would like to use the audio comes from applications goal was in |
---|
0:14:09 | this you know that the transcription is provided to the user or the developer we |
---|
0:14:13 | longer |
---|
0:14:14 | and then goes into our or infrastructure what we must us the data extensively |
---|
0:14:19 | and actually scoreboard |
---|
0:14:24 | so this is what we call |
---|
0:14:26 | all the other a desperation |
---|
0:14:30 | and one of the things we have been looking at |
---|
0:14:32 | so i think is interesting use |
---|
0:14:33 | how do you sampled from this |
---|
0:14:36 | ten years of data that you get that they what is that you apply |
---|
0:14:41 | the you try to random selection |
---|
0:14:42 | do you being the data according to confidence scores and then you try to select |
---|
0:14:49 | particular mean right because the data you a organiser data according to confidence |
---|
0:14:55 | you might be tempted to use the data that is |
---|
0:14:58 | i guess what are computed |
---|
0:15:00 | but then you could argue that you're not learning anything new right because if the |
---|
0:15:03 | recognizer is correct is not much to learn |
---|
0:15:06 | so you go a little bit deeper into the conference so select |
---|
0:15:10 | a phrase is utterances that you tries but not too much |
---|
0:15:14 | it's not obvious what do |
---|
0:15:16 | it's and i think guys a very active area of work for us |
---|
0:15:23 | the teletext and we look for example we do a lot of what we call |
---|
0:15:26 | distribution flattening |
---|
0:15:28 | so |
---|
0:15:29 | and with this for two reasons one reason we do it is to |
---|
0:15:33 | increase the type of coverage |
---|
0:15:36 | for example i can tell you we discover this problem where somebody was |
---|
0:15:41 | asking for weather in so it |
---|
0:15:44 | well enough and |
---|
0:15:46 | the recognizer was failing all the time and when we investigated we discover that particular |
---|
0:15:51 | five triphone |
---|
0:15:53 | score but it was not you know what or so |
---|
0:15:56 | so that are good reasons to really not train your systems can model the head |
---|
0:16:01 | of the distribution when it comes triphones or what is but the flooding |
---|
0:16:07 | in other is on is that you have to be careful because on what is |
---|
0:16:10 | a very popular for example in korea |
---|
0:16:12 | you just select high confidence scores and select the utterances by that |
---|
0:16:17 | a ten or fifteen percent of your core right is going to be composed of |
---|
0:16:20 | three queries one is a down |
---|
0:16:23 | but of research and using a clinic or you are not available |
---|
0:16:27 | last one is bound |
---|
0:16:29 | abilities forgotten |
---|
0:16:30 | so i you are building a if you are not careful you're building a it's |
---|
0:16:34 | three word recognizer |
---|
0:16:36 | so you really need to flooding |
---|
0:16:38 | not trust |
---|
0:16:39 | the distribution completely |
---|
0:16:43 | but of course a unsupervised training is landers right and you probably have heard a |
---|
0:16:48 | stories about it i know people in apple have worked with these |
---|
0:16:53 | this is what we call the P problem so you hear sound clear |
---|
0:16:57 | talk about the P problem i would tell you what it is |
---|
0:17:00 | so the history is that we launched accordion system |
---|
0:17:04 | and we start to collect topic complex and we're going to into |
---|
0:17:09 | retraining with this unsupervised approach |
---|
0:17:13 | and of course |
---|
0:17:14 | when you have someone state that you look at the you look at the locks |
---|
0:17:18 | right you just push it to the system but at some point with that look |
---|
0:17:21 | you know the logs like |
---|
0:17:22 | we notice that thirty percent of the topic |
---|
0:17:25 | is a they talk and B |
---|
0:17:30 | where a little bit mystified way that so we listen to the data and we |
---|
0:17:33 | note is that when this wayne or of talked about the effect |
---|
0:17:39 | or |
---|
0:17:40 | goddess passing by |
---|
0:17:42 | the recognizer is maxim that we that okay P |
---|
0:17:46 | you know what sounds like that which our lexicon will provide a transcripts phonetic sequence |
---|
0:17:50 | like |
---|
0:17:51 | so it's a it's a possible so it matches explicit noise |
---|
0:17:55 | and we will do it with high confidence |
---|
0:17:59 | so |
---|
0:18:00 | it is that's hypothesize in that it that the P R talk in becomes like |
---|
0:18:05 | i think it starts to capture more |
---|
0:18:08 | so |
---|
0:18:10 | so this is the P probably on |
---|
0:18:12 | are you have to be vigilant for |
---|
0:18:14 | i we have observed that every language seems to have a P talking |
---|
0:18:19 | so |
---|
0:18:21 | we have found for example while we have P talk and in other languages |
---|
0:18:25 | a feasible |
---|
0:18:26 | sometimes it |
---|
0:18:27 | sometimes is a like a sequence of can sometimes right because they kind of matched |
---|
0:18:32 | noise about of the times is something else |
---|
0:18:36 | and there are some but examples here |
---|
0:18:39 | so |
---|
0:18:42 | so you know we deal with this in many ways |
---|
0:18:46 | some of the phase we do it for example in terms of transcription flattening or |
---|
0:18:51 | triphone flattening |
---|
0:18:52 | those help a lot |
---|
0:18:54 | they start to filter out these people transcriptions |
---|
0:18:57 | but another simple thing we will use we have at this is said that contains |
---|
0:19:00 | a lot of noise |
---|
0:19:03 | cars passing by an air blowing to the microphone and we always i evaluate what |
---|
0:19:09 | we call our project set so |
---|
0:19:12 | if the rejects set us that's providing a lot of those transcriptions then |
---|
0:19:16 | it's really nice because you number one identify what that of the new P tokens |
---|
0:19:22 | and you can filter again slows |
---|
0:19:26 | we also model these noise explicitly we have noise models to capture this kind of |
---|
0:19:30 | problems and get it get rid of it in the transcription |
---|
0:19:34 | from time to time i think when you only the one supervised transcription there is |
---|
0:19:38 | the danger that the system is going to going to sound court in their behaviour |
---|
0:19:42 | so from time to time is not a well yet retranscribe corpora |
---|
0:19:46 | a really use a new model so you kind of remove it from this but |
---|
0:19:50 | corner cases and then study and |
---|
0:19:53 | but at the same this is unsupervised this every active area for us |
---|
0:19:57 | and that a lot of very interesting problems to be with |
---|
0:20:02 | so given all the so we stick some safeguards |
---|
0:20:05 | we basically select |
---|
0:20:08 | something like |
---|
0:20:10 | four thousand hours also |
---|
0:20:12 | and we retrain our acoustic models and we tend to do this |
---|
0:20:15 | we started doing every six months now we i think we have a monthly cycle |
---|
0:20:19 | i don't hoping we get into it to week cycle so every two weeks our |
---|
0:20:22 | acoustic models are retrained |
---|
0:20:25 | and there is on that are two reasons for that |
---|
0:20:29 | one reason is that we need to track the change in fleet of hundred devices |
---|
0:20:33 | on unlike able |
---|
0:20:36 | that are many telephones with different hardware different microphone configurations so it's important to track |
---|
0:20:41 | those changes and every week there is a new model |
---|
0:20:44 | so you need to track that |
---|
0:20:46 | i think you also want to change to track user behavior a new ways for |
---|
0:20:51 | example there is |
---|
0:20:53 | that are different uses of our system initially it was base or queries |
---|
0:20:57 | now that are longer what is more conversational and you know the |
---|
0:21:01 | the acoustic change a little bit so we want to track goes |
---|
0:21:05 | and it does still |
---|
0:21:08 | discontiguous acoustic model training |
---|
0:21:10 | basically allow us to not only to track but to improve the performance and indeed |
---|
0:21:15 | in this particular about that are more things bigger acoustic model in that are we |
---|
0:21:19 | actually tracking but |
---|
0:21:22 | but it does help |
---|
0:21:24 | i also have to say that |
---|
0:21:26 | they repress pipeline |
---|
0:21:29 | i mean and talking mostly about acoustic models |
---|
0:21:31 | but we also use email ideas for language modeling |
---|
0:21:35 | for pronunciation learning |
---|
0:21:39 | so that i think we do is we obviously our work with acoustic modelling thing |
---|
0:21:44 | so whenever there are based practises new maybe yes |
---|
0:21:47 | and i say set for some recently really like to walk to prototype on icelandic |
---|
0:21:52 | well matched me white |
---|
0:21:54 | so whenever they discover something that works really when a nice in icelandic we bring |
---|
0:21:58 | it into our pipeline and we basically we track the work of the acoustic modeling |
---|
0:22:03 | thing |
---|
0:22:04 | something works well |
---|
0:22:07 | and we try to encode these into amassing work flow |
---|
0:22:11 | that as i said every two weeks it does everything trains |
---|
0:22:15 | you'd evaluates with a testing sets |
---|
0:22:19 | i would tell you a little bit more later about our matrix |
---|
0:22:24 | and the other thing it does is that it actually creates this is this is |
---|
0:22:28 | really need it creates a change least basically telling you okay this is all i |
---|
0:22:32 | change and ready after it doesn't evaluation so |
---|
0:22:35 | you do like the model the only thing you want to say yes i like |
---|
0:22:38 | this multiple simple X and |
---|
0:22:40 | so we still have a human same yes or no |
---|
0:22:43 | we could train a neural network to do that for us i guess |
---|
0:22:49 | and another thing we had we now is to we have been thinking of how |
---|
0:22:53 | can we improve these even more |
---|
0:22:56 | so we have been that following another thing we check all the that they will |
---|
0:22:59 | also help approach |
---|
0:23:01 | a show you why they |
---|
0:23:03 | maybe of the david hassle of approach is that you have a very good looking |
---|
0:23:08 | i'm fast |
---|
0:23:09 | i will think which is the production system |
---|
0:23:12 | maybe it's a little bit done based would look you know fast |
---|
0:23:15 | and then you have it really is mar i will thing where you know it's |
---|
0:23:18 | which is likely the computed in the U Cs |
---|
0:23:23 | and they the here is that we can |
---|
0:23:25 | re process a lot of our laws we reach acoustic models |
---|
0:23:29 | we treated language models with deeper means things you put in doing products because the |
---|
0:23:34 | system will be real time |
---|
0:23:36 | and the goal is that we can read instead of taking just a transcriptions from |
---|
0:23:40 | the collection system |
---|
0:23:41 | re process all you |
---|
0:23:43 | and if we do that we immediately see reductions in the transcription error rate of |
---|
0:23:47 | ten percent |
---|
0:23:49 | and then you can understand that sticks select data |
---|
0:23:54 | retrained acoustic model |
---|
0:23:56 | and actually one of the things we're starting to do is to |
---|
0:23:59 | have |
---|
0:23:59 | products and acoustic models and really reach acoustic models |
---|
0:24:03 | was only porpoises to a preprocessing data that they might be a slower |
---|
0:24:07 | and you can it that it is probably applies we had in this process right |
---|
0:24:10 | now |
---|
0:24:11 | and the aim is that through |
---|
0:24:13 | all these tricks we can read reviews the error rate on our transcriptions hundred training |
---|
0:24:18 | and again the goal is really not to transcribe we can avoid |
---|
0:24:25 | and as i said similar ideas are used in language modeling |
---|
0:24:30 | and in pronunciation modeling |
---|
0:24:32 | i where we try to learn from all the |
---|
0:24:35 | let me tell you a little bit about the medics |
---|
0:24:39 | because surprisingly whatever rate is not |
---|
0:24:42 | the only thing with the |
---|
0:24:45 | voice search basically exceed it's a similar behavior to search you know it's a |
---|
0:24:50 | he said distribution with it really long tail |
---|
0:24:54 | if the only thing you do is a major the task the word error rate |
---|
0:24:57 | on the desk instead you transcribe like months ago or two months ago |
---|
0:25:01 | there are several problems for that one is that most likely you're going to be |
---|
0:25:04 | majoring have of the distribution |
---|
0:25:06 | the talking as like face things like that |
---|
0:25:10 | so i mean after a while you look very well on the common tokens |
---|
0:25:15 | but you really care about the tape those tokens that you're three times today how |
---|
0:25:19 | well you know |
---|
0:25:21 | i'm and what test sets don't over the |
---|
0:25:24 | i is not also practical to transcribe every single day i know optimal loving but |
---|
0:25:30 | not possible |
---|
0:25:32 | and the queries are changing |
---|
0:25:33 | it's a evolving all the time whatever and testing said that was really one month |
---|
0:25:38 | ago might not be |
---|
0:25:39 | but i don't know |
---|
0:25:41 | and you know even the best on speech transcribers there is only time between in |
---|
0:25:45 | the data packing it's in the you get in so you can use of those |
---|
0:25:48 | among |
---|
0:25:50 | so we used to a |
---|
0:25:52 | identity matrix i mean |
---|
0:25:54 | we still use whatever rate |
---|
0:25:55 | but we also used to alternative metrics |
---|
0:25:58 | one is that what we call side by side testing so the idea is that |
---|
0:26:03 | you just want to measure the difference between two systems the product some system and |
---|
0:26:07 | a candidate system the candidate system could be a system that has a new acoustic |
---|
0:26:10 | model |
---|
0:26:11 | or a new language model or any pronunciation lexicon or a combination of the three |
---|
0:26:14 | whatever it is |
---|
0:26:16 | and what we do is we basically look at the |
---|
0:26:19 | we select like |
---|
0:26:20 | i'll thousand utterances or three thousand utterances from yesterday |
---|
0:26:24 | we have the transcriptions that the collection system gave us |
---|
0:26:28 | and then we re process those with the new candidate system |
---|
0:26:31 | we look at the differences i mean if a hypothesis out the same in but |
---|
0:26:35 | we don't care |
---|
0:26:36 | i don't the differences we do in a B S and we'll has had the |
---|
0:26:41 | search thing has had from many years |
---|
0:26:44 | any infrastructure think of it like a small |
---|
0:26:47 | mechanical turk with a user's distributed everywhere in the world |
---|
0:26:52 | which are pretty familiar and the only thing you really as they miss listen to |
---|
0:26:55 | the leon tell us which one is |
---|
0:26:57 | more closely matches the only |
---|
0:27:01 | so this is bases very fast to do |
---|
0:27:03 | so you can you can get resulting in a couple of our sometimes less |
---|
0:27:08 | the other thing with the which is i think is even more interesting is what |
---|
0:27:11 | we call like experiments |
---|
0:27:13 | so that he has that you have a new candidate system |
---|
0:27:15 | and that you feel is pretty good |
---|
0:27:17 | and we need policing deploy this system into products and it it's that's taking a |
---|
0:27:21 | little bit of one percent ten percent |
---|
0:27:24 | and |
---|
0:27:25 | and we basically track if U matrix that beset in the deck metrics like the |
---|
0:27:30 | creek the click through rates |
---|
0:27:31 | whether the users are picking on the result more or not |
---|
0:27:35 | whether the users are corrected by hand out the transcription we provide |
---|
0:27:40 | whether they use it stays with days the application or |
---|
0:27:43 | also way |
---|
0:27:47 | and you know of course there is a lot of a statistic out processing to |
---|
0:27:50 | understand when the results of signal if you can i when they're not |
---|
0:27:54 | but this is just really useful because |
---|
0:27:56 | it allow us to |
---|
0:27:58 | believe a systems quickly before we increase the topic to |
---|
0:28:02 | one two percent |
---|
0:28:03 | and of course that i think is that the user that's even know that he's |
---|
0:28:06 | been subject experiment |
---|
0:28:13 | so our kinds of metrics used are |
---|
0:28:16 | this is kind of hundred related to how is this system doing |
---|
0:28:20 | that i think is growing a lot basically in the last |
---|
0:28:24 | but in the last three years |
---|
0:28:27 | it seems like it doubles every six months or what i think i don't see |
---|
0:28:31 | i don't we don't see the train went down |
---|
0:28:34 | i mean that our results for that is not just that speech is becoming more |
---|
0:28:37 | use what is that we have been adding more languages |
---|
0:28:41 | their role of english in our applications had has been diminishing used to be of |
---|
0:28:45 | course the so that i think now is a little bit less than fifty percent |
---|
0:28:49 | and the top ten longing non you were single is languages they generated now more |
---|
0:28:53 | than fifty percent of our topic |
---|
0:28:55 | and the other thirty seven they didn't less |
---|
0:28:57 | but |
---|
0:28:58 | but what we actually have seen in the past is that once the quality of |
---|
0:29:02 | a language improves |
---|
0:29:04 | a things begin to show that |
---|
0:29:08 | another thing that is very interesting is there |
---|
0:29:10 | the percentage of what is where there has being a semantic interpretation instead of just |
---|
0:29:14 | task transcription we are providing |
---|
0:29:18 | we have parts in the output is increasing and i think this is only for |
---|
0:29:21 | english is beginning to be around twenty percent of the time we act on the |
---|
0:29:25 | query |
---|
0:29:26 | we parts |
---|
0:29:27 | we understand what it's S so these are some graphics |
---|
0:29:32 | here is for example if we attack an improvement in france |
---|
0:29:36 | where you know we keep adding the amount of things better pronunciations lighter acoustic models |
---|
0:29:41 | at own language model |
---|
0:29:44 | a language model trained with clean data |
---|
0:29:46 | we do a lot of data much hassle of |
---|
0:29:50 | our queries before we build language models for but we happen to tell you about |
---|
0:29:54 | that |
---|
0:29:55 | a larger language model so on and so forth so this is a continuous process |
---|
0:30:02 | limited a little bit about the language is therefore i thought |
---|
0:30:06 | might be relevant to the aliens |
---|
0:30:08 | and also having working on that for the well |
---|
0:30:10 | so in two thousand eleven we decided to focus on you languages |
---|
0:30:16 | i mean was a necessity because hundred was becoming global and we need to bring |
---|
0:30:20 | body as much as we could |
---|
0:30:22 | so we went through the initial analysis and like many of you out we went |
---|
0:30:26 | to at the lower bound we got these or some pictures of all languages and |
---|
0:30:30 | you know |
---|
0:30:30 | organising activities some level a real cool |
---|
0:30:33 | and then you look at the statistics like everybody else you see one seven that's |
---|
0:30:37 | all languages |
---|
0:30:39 | and so forth families of languages and |
---|
0:30:42 | and then you look at the statistics |
---|
0:30:45 | six percent of so it's a spoken by |
---|
0:30:47 | more than six percent of the languages are spoken by more than a million people |
---|
0:30:52 | and a six percent of spoken by less than a hundred people so probably will |
---|
0:30:55 | not bother with those |
---|
0:30:57 | and then you call with adults we can't and more or less |
---|
0:30:59 | you basically cover |
---|
0:31:01 | ninety nine percent of population so |
---|
0:31:03 | we internally keep talking about that our goal is to build voice search in the |
---|
0:31:07 | top three hundred languages i think |
---|
0:31:10 | it's a good selling point it's a good sounding bit |
---|
0:31:13 | i think in reality |
---|
0:31:15 | probably after we reach at which are they |
---|
0:31:18 | we at that's a lot |
---|
0:31:21 | we brought it would have to rethink what to do with the next ones |
---|
0:31:24 | "'cause" that there are many sentiment for |
---|
0:31:26 | the only two languages are there is no where is nothing to search so |
---|
0:31:30 | but you could argue that it's a it's a look back right when you have |
---|
0:31:35 | you have a speech recognition technology maybe you facilitate the recent content so we need |
---|
0:31:40 | to break this |
---|
0:31:41 | this loop somehow |
---|
0:31:44 | well that is we are very problem is our approach to rapid prototyping of languages |
---|
0:31:50 | i think rather than an algorithmic approach which is |
---|
0:31:53 | that would have been |
---|
0:31:55 | i would in itself |
---|
0:31:57 | a thinking we decided to take a more process approach focus on process |
---|
0:32:03 | we basically focus on solving the two main problems |
---|
0:32:07 | which obviously may i here this week we need you want to which is how |
---|
0:32:11 | the held way get data |
---|
0:32:13 | so we basically spent a bit of time developing tools to collect data very quickly |
---|
0:32:16 | very efficiently |
---|
0:32:18 | other hand |
---|
0:32:20 | we build software tools that run on the telephones that allow asked to send a |
---|
0:32:24 | team can collect data in a week a two hundred hours of data |
---|
0:32:28 | we also be of the lot of webpage a web based tools to go annotations |
---|
0:32:33 | so then result is that in three years we collected more than sixty language is |
---|
0:32:37 | around sixty languages and at any time we have teams in collections so right now |
---|
0:32:41 | we have teams see |
---|
0:32:42 | more we're planning at and also be what is an outspoken and all that stuff |
---|
0:32:48 | it's a so |
---|
0:32:49 | we're starting in it for indian languages |
---|
0:32:52 | we |
---|
0:32:53 | and farsi we're going to collect in L A |
---|
0:32:56 | so little bit is still you need a |
---|
0:33:00 | so this is how a lot of data collection application looked is called data how |
---|
0:33:03 | and there is actually an open-source where some of this that |
---|
0:33:07 | idiom button are |
---|
0:33:08 | put together with the ground so i think what you to talk to him |
---|
0:33:13 | because you can also do it |
---|
0:33:15 | and this is how our web based to for annotations look like |
---|
0:33:19 | this is the tool we use with our vendors |
---|
0:33:22 | with our on linguistic teams festivity worldwide so they can give us and i think |
---|
0:33:27 | this is a phonetic transcription to a task |
---|
0:33:30 | or they can do for example this is opinions selection task |
---|
0:33:33 | or they can do a transcription task |
---|
0:33:37 | for test set something like |
---|
0:33:40 | and |
---|
0:33:41 | you know in |
---|
0:33:43 | just to talk about more a bit more about rapid prototyping |
---|
0:33:46 | for lexicons i think lexicons is an area where we're still do not as fast |
---|
0:33:50 | as i like |
---|
0:33:50 | a lot of our lexicons are base which is good because now that we have |
---|
0:33:55 | trained linguists they can probably put together to a lexicon |
---|
0:33:59 | for set for four languages about regular |
---|
0:34:01 | i like sponges or swahili in it they most likely |
---|
0:34:06 | i'm for low that are |
---|
0:34:07 | more difficult then we |
---|
0:34:09 | we thereby cumbersome lexicon support we collect a seed corpora with our tools and then |
---|
0:34:14 | with changing to be |
---|
0:34:17 | for language modeling that has never been a problem for us because we just mine |
---|
0:34:20 | web page is a we have the advantage of the what it's doing what people |
---|
0:34:24 | search in the particular language that's very useful |
---|
0:34:28 | of course every language has its no one says and you know whether it's a |
---|
0:34:32 | segmentation what don modeling or inflection we and that building a tool for any language |
---|
0:34:37 | like we have been working on inflection morning for us and |
---|
0:34:40 | but the boosting is a once you build the tools you can deploy the for |
---|
0:34:44 | any language so i'm hoping at some point we ran out of we're |
---|
0:34:47 | linguistic phenomena |
---|
0:34:49 | and we have to swallow and |
---|
0:34:51 | lots of data must aztec civilisations that we about having place one problem you have |
---|
0:34:55 | is mixed languages in the queries |
---|
0:34:57 | so you have to classify or something like that |
---|
0:34:59 | but the process is pretty automatic and |
---|
0:35:02 | a for acoustic modeling is the most automatic thing once you have the data rather |
---|
0:35:06 | you push the button on |
---|
0:35:08 | and typically and they later you have a neural network training |
---|
0:35:13 | and the way we develop a language is now is |
---|
0:35:17 | we basically have a date operations the in domain this data collection some we a |
---|
0:35:20 | lot of preparation and then we made in these we call it works on languages |
---|
0:35:27 | we meet for a week in the room |
---|
0:35:28 | and we typically have a success rate of fifty or seventy percent |
---|
0:35:33 | meaning that in a we get a system that is but lacks and very |
---|
0:35:36 | reluctantly forest means and the right of |
---|
0:35:39 | around ten percent |
---|
0:35:41 | and some languages are quite a little bit more work and you know six months |
---|
0:35:44 | later we go back to the |
---|
0:35:46 | so we have been lots in languages at an average of |
---|
0:35:49 | four five last year we more we like the thing |
---|
0:35:52 | very so this is their language coverage we have |
---|
0:35:56 | a |
---|
0:35:57 | so you can i mean he's forty eight in production |
---|
0:36:03 | somebody asked me that they why we have basque coliseum |
---|
0:36:06 | and gotten and spanish |
---|
0:36:08 | you can figure it out |
---|
0:36:13 | we have all these languages in preparation time i might maybe in the i'm hiding |
---|
0:36:17 | the innocence was clearly |
---|
0:36:19 | and that's a set our teams are collecting more data so we will keep going |
---|
0:36:24 | an interesting is that we |
---|
0:36:27 | we have gone into that languages i think we still have leading |
---|
0:36:32 | although we run into well also delayed |
---|
0:36:35 | and we had bill imaginary languages so this is my challenge for there are private |
---|
0:36:39 | lessons |
---|
0:36:40 | see you can tell me what language is this |
---|
0:36:43 | let me see |
---|
0:37:12 | somebody downloading a movie |
---|
0:37:27 | i can try to do |
---|
0:37:30 | you know if not outright in |
---|
0:37:35 | okay we'll i think will try sometime today |
---|
0:37:41 | i wanted to briefly mention atis and running out of time |
---|
0:37:46 | basically all these languages are available into a P ice one is a the under |
---|
0:37:51 | api this is a pointer just look for |
---|
0:37:54 | speech hundred |
---|
0:37:55 | and there's also a web api disobey simple api used in the way from we |
---|
0:38:00 | give you the transcripts and we're thinking about in reading then a little bit more |
---|
0:38:04 | but a lot of developers have been building a |
---|
0:38:07 | applications on top of recipients and i think for as a P this create yes |
---|
0:38:11 | of course is pretty |
---|
0:38:12 | there are very important because |
---|
0:38:14 | for two reasons when we launched a new language data really provide us with more |
---|
0:38:19 | data and at the beginning more latest book |
---|
0:38:22 | and i think it also exposes users and developers to the idea that hey i |
---|
0:38:27 | can bill applications with the speech recognition and this is good for us |
---|
0:38:33 | recent which is |
---|
0:38:34 | sometimes are |
---|
0:38:36 | the developer some faster in doing things that are useful like for example when we |
---|
0:38:41 | started working on will not which is a large |
---|
0:38:43 | kind of semantic assist and system our |
---|
0:38:46 | we didn't have data but because we have this api and developers have been building |
---|
0:38:51 | cd like applications in under four years we could leverage that data and you was |
---|
0:38:55 | really good semantic annotations |
---|
0:38:58 | and just to finish the little bit |
---|
0:39:02 | i think we are now in |
---|
0:39:04 | in the middle of this big transducer in the speech recognition at least within from |
---|
0:39:09 | transcription two or more conversational interface |
---|
0:39:12 | and you know there's all these new features that i did this is not speech |
---|
0:39:15 | is in the be done by other teams |
---|
0:39:17 | but you know seems like a core reference resolution so it becomes more conversational |
---|
0:39:22 | a problem resolution weighted refinements my voice |
---|
0:39:26 | a more to come |
---|
0:39:27 | they make the application will be more interesting and |
---|
0:39:30 | we really i think that the company |
---|
0:39:33 | is in the middle of this transformation where will goes from these white box where |
---|
0:39:38 | you type |
---|
0:39:39 | into one of an assistant where you |
---|
0:39:42 | you talk you engaged in a conversation |
---|
0:39:45 | like to think of who a list |
---|
0:39:47 | trying to become like bachelor you can talk to |
---|
0:39:50 | on you know that changes everything these long term be single based on the computer |
---|
0:39:55 | what i hope with |
---|
0:39:57 | a little bit better personality but |
---|
0:39:59 | that's a little bit where R |
---|
0:40:01 | where we are trying to go this |
---|
0:40:03 | pervasive |
---|
0:40:04 | role of a speech not only not you're under telephone but on your that's your |
---|
0:40:08 | car |
---|
0:40:09 | in your appliances of home |
---|
0:40:12 | you know assistant that |
---|
0:40:14 | even makes access to information which is what mobile is about leaving easier and less |
---|
0:40:19 | intimidating for many years as |
---|
0:40:21 | so |
---|
0:40:23 | the aim is to have a |
---|
0:40:26 | and this is related to speech technologies are not only the microphone here about various |
---|
0:40:30 | microphones are always on always listening to you we have maybe steps with this thing |
---|
0:40:36 | call okay well |
---|
0:40:37 | a signal less |
---|
0:40:38 | so you can talk to your device get home talk to your refrigerator whatever it |
---|
0:40:43 | is i know it's about the conversation |
---|
0:40:45 | predicated not so what you |
---|
0:40:48 | about your data |
---|
0:40:50 | with really high quality speech recognition we try to get the time better |
---|
0:40:55 | and so on and so forth and really conversation |
---|
0:41:00 | so that was just want to tell you |
---|
0:41:16 | the questions with a little bit late but |
---|
0:41:19 | a |
---|
0:41:31 | i don't becomes R |
---|
0:41:46 | we with |
---|
0:41:54 | it is concerned i think there are four |
---|
0:41:59 | collect as much data and it was really a philosophical choice |
---|
0:42:04 | to spend more money on more careful annotations especially for translation |
---|
0:42:10 | where we actually did not sell |
---|
0:42:14 | it's not always the body it's the call |
---|
0:42:35 | first the common |
---|
0:42:38 | many students are university use |
---|
0:42:41 | google transcriptions part of projects actually works very nicely |
---|
0:42:47 | it's been a great |
---|
0:42:49 | source |
---|
0:42:50 | work but i have one question which is |
---|
0:42:55 | you cannot recognise my name |
---|
0:42:58 | why |
---|
0:43:01 | i is that you know we have a |
---|
0:43:06 | in intra lingual engineering this thing we call yellow's which is |
---|
0:43:11 | when we identify problem and we get together in a room and we don't stop |
---|
0:43:15 | and in resolving |
---|
0:43:16 | so named recognition was identify in july |
---|
0:43:20 | and i actually i wasn't in that it for |
---|
0:43:23 | we came up with the solution so it's been deployed a like |
---|
0:43:27 | to they actually products |
---|
0:43:29 | well chuck tomorrow |
---|
0:43:31 | no but at the know this you notice that the serious question there utility in |
---|
0:43:37 | some words of the but actually don't see to show opener speech systems so how |
---|
0:43:42 | do we do the so that my name recognition is difficult to excel in a |
---|
0:43:47 | because the space is pretty much infinite |
---|
0:43:50 | so we do a variety of things a dynamic language models |
---|
0:43:55 | based upon your data |
---|
0:43:56 | so i mean to do name recognition of your names |
---|
0:43:59 | the names you talk to you know that you can do but it's when you |
---|
0:44:03 | have as a generic system |
---|
0:44:06 | that actually somehow believe so it's |
---|
0:44:10 | there that's ultimately problem for the i mean we operate typically with a million pockets |
---|
0:44:17 | in our vocabulary we are going to two million with the song |
---|
0:44:21 | but still a that is way more the only way to handle this kind of |
---|
0:44:26 | problem is with |
---|
0:44:29 | more personal essays so we know about you |
---|
0:44:32 | so we can do you need to you |
---|
0:44:37 | i think you |
---|