Speech Transcript - The growing role of speech in Google products

0:00:17	so as i mention the globally for the presentation was to tell you are there
0:00:20	is life young darpa
0:00:22	plaintiff able to it is lots of interesting stuff
0:00:25	some of the things we do in companies like a will is this is very
0:00:30	different the scale is
0:00:32	a unique
0:00:33	and there are different challenges
0:00:36	and i think is it really exciting i'm to be in interspeech
0:00:40	because suddenly finally i speech is becoming useful if you have told me
0:00:47	four years ago and just for us to all that speech was something that will
0:00:51	be really use by mainstream people a i wouldn't believe it i mean and in
0:00:56	fact there are before i can't the well i was oriented my carrier more into
0:00:59	the area of data mining in speech mining because i thought that was the only
0:01:03	way
0:01:04	what is speech could be useful i really did not believing interactivity you know humans
0:01:09	talking to computers
0:01:11	so at the same very surprised and is working
0:01:16	so
0:01:18	to tell you a little bit
0:01:21	about the a little the history of speech i will just kind of interesting
0:01:26	people ask me all the time
0:01:28	the multimodal you once
0:01:29	so
0:01:31	the team is that they're not around two thousand five
0:01:34	and i think we
0:01:37	there what discusses internally whether
0:01:40	which would be a lower am systems or was to a license technology and
0:01:46	the decision was another plane at the same i going on the time was more
0:01:50	in favour of licensing a but my trial a me we were pushing for now
0:01:55	we need to be a lot of stuff so we one and i'm glad we
0:01:58	we post and convince K
0:02:00	but on the other hand that was very see what has these cultures where we
0:02:04	do every we build everything even our own hardware
0:02:07	a on data centre so i wasn't is a decision i two thousand six we
0:02:11	basically a started to build a what at that time wasn't a state-of-the-art system
0:02:15	and we look very lucky because we got people like if you look any
0:02:19	we had a lot of experience building train their infrastructure and all that you can
0:02:24	sell week a lot of impostors bins and
0:02:27	building the colours
0:02:29	language modeling was easy for us because we could leverage all the work from the
0:02:33	translation be
0:02:36	and the three was basically and the challenge for us was to build this infrastructure
0:02:40	on top of a will distributed computing energy so that was really
0:02:45	different
0:02:47	so in two thousand and seven
0:02:50	and these are to the reflex the mindset of the team we build the system
0:02:53	and was called who for one and it was really a directory assistance you will
0:02:57	call the telephone number and then you could ask like a
0:03:00	i
0:03:02	tell me what is the closest based on
0:03:04	a so everything was going through the voice channel
0:03:08	and similarly two thousand and seven we also study the voicemail transcription project because we
0:03:13	had a product called still available we will voice is like a it's like a
0:03:17	telephone numbers assigned to you and that's for working
0:03:22	so a voicemail transcription was also relevant
0:03:26	and they are seven eight a
0:03:28	things are there was a radical change with a
0:03:32	the appearance of a
0:03:34	as much telephones with and rate i mean
0:03:36	but i wait a if you hear me saying describing something in this talk it
0:03:41	doesn't really mean i i'm not meaning to say that we did it before
0:03:45	just see that we really "'cause" in fact you know there is a rich history
0:03:48	of mismatch telephones before nokia probably microsoft and apple but any in any case for
0:03:53	us
0:03:55	that was like a an eye opening experience and we decided to basically the top
0:03:59	any kind of work with the L D was going through the telephone channel and
0:04:02	switch to directly to smart phones and levitated operating system and send everything that channel
0:04:08	in two thousand and nine we also had L
0:04:10	a project to youtube transcription an alignment and actually initially i was working in that
0:04:14	area because that was what i have been doing before
0:04:18	the idea was to facilitate a be the transcription a lot's got some alignment
0:04:22	and that is still around
0:04:25	hi just here has been doing a lot of work in that area rate interesting
0:04:29	stuff
0:04:30	a and in two thousand nine we basically went from voice search another telephones to
0:04:34	dictation
0:04:36	in the keyword you will see sometimes a tiny microphone that allows you to dictate
0:04:41	in two thousand and then we also enabled what we call the intended pi basically
0:04:45	is to allow developers to leverage our servers so they can build speech into that
0:04:51	applications on hundred
0:04:53	and in two thousand ten we started in this part of adding going beyond transcription
0:04:57	into semantics basically the nexus that understanding why from voiced collection
0:05:04	in eleven we went into this top layer was just are bringing search into your
0:05:08	laptop
0:05:10	in two thousand eleven we started this speech project this was then be analysis on
0:05:15	the team is london with us
0:05:17	and in two thousand eleven we study the our language expansion that's something i have
0:05:22	been doing for the last three years tell you a little bit about that
0:05:27	perhaps because it's a little bit relevant to the work on
0:05:30	but there was no pooling
0:05:32	i in two thousand twelve we started in earnest activities in the speaker and language
0:05:36	identification
0:05:38	with the goal of building instead of their system and i think we're pretty much
0:05:42	there
0:05:43	and this year we basically probabilistic a web interface for speech so that
0:05:48	you can basically that is that you can inject
0:05:51	you can call our recognizers from any web page
0:05:54	and in two thousand and thirteen also this here we have a
0:05:57	go on into these and i would really like a little bit about that the
0:06:01	in this transformation of who will from
0:06:05	from the transcription
0:06:06	not will
0:06:07	the speech the role of the speech in globalphone providing transcriptions to basically going into
0:06:12	understanding on the system
0:06:16	people ask me all the time what's how is this it's team organise so i
0:06:19	also wanted to tell you so you don't ask me anymore
0:06:22	so
0:06:24	but i want to say is that we focus exclusively on the speech recognition we
0:06:28	don't focus on semantic understanding or in the be there are other things for that
0:06:34	i mean with a little bit of an emptiness matters to help us put out
0:06:37	a generate that
0:06:39	language models for data mining of is like that
0:06:42	so that the group is organized size headed by difference we can this organising several
0:06:47	subgroups there is the acoustic modeling thing
0:06:50	this is have i mean Q by kenny and you know they basically work on
0:06:53	acoustic model algorithms a bayesian
0:06:56	and robustness
0:06:57	i would say that probably the most research see oriented they really at the edge
0:07:02	of new eigen things all they work in the nns is that in that group
0:07:05	then there's another group which
0:07:07	we call it the language is modeling
0:07:10	acoustic recent interest in name because we would with this team is the result of
0:07:15	merit in the language model in the amount they internationally say something so we in
0:07:19	my school maybe
0:07:21	so it's language languages
0:07:23	and we put a lot of work related to build lexicons a language modeling we
0:07:29	take care of
0:07:31	keeping our acoustic models for asr dimension that a little bit
0:07:35	we develop new languages and we are in terms of improving the quality of our
0:07:40	systems there you bring in the quality of improving you things we also have a
0:07:45	speaker and language are the activities in this you and then this is really large
0:07:49	now is headed by me in new york france what form montague
0:07:53	we must be
0:07:55	once we
0:07:56	thirty probably no
0:07:58	then there is the services group they take care of all the infrastructure you know
0:08:02	the set of errors that are continuously running our products
0:08:07	they take it of deployment scalability
0:08:10	this is the more software engineering activities are core activities in the team and then
0:08:14	we had a platform and the colour team there in terms of new recall that
0:08:18	are worth aims
0:08:20	the colours that can run from the device to be stupid to system
0:08:25	there is activities of eighty two word spotting and speaker that either also
0:08:29	then
0:08:31	we have a large data operations and posting
0:08:35	well
0:08:35	this is the team lattices and of front end that must analysis on the front
0:08:39	really their goal is to the data collections on the patient's transcriptions
0:08:43	i once you start doing so many languages and so many problems you need
0:08:46	that has to be annotated
0:08:49	and under the in of course they need tools so that you see too much
0:08:53	hassle this data
0:08:55	and of course we have at the tts activities
0:08:57	so everybody see the man to be or need in you are
0:09:01	probably rough happen have and then the tts guys that
0:09:05	i think they the makeup of the team is
0:09:08	probably fifty percent so when the nearest
0:09:10	fifty percent a
0:09:12	speech sound there's
0:09:14	lately we have been growing a lot of the so when you need inside i
0:09:17	think we should really grown more noun the speech signs
0:09:21	and in addition to this court thing we have it is to him of linguists
0:09:24	there to spread all over i mean we what we do you we often bring
0:09:28	up a linguistic team in a country like we have for example with the mean
0:09:32	in island
0:09:33	they have been helping us with speech recognition and we just
0:09:37	and but i like to see that everybody code
0:09:39	even some of how a linguist
0:09:41	when i doing well i remember there was this job that even the lawyers up
0:09:45	with what able to write code
0:09:48	a anyway so let me tell you a little bit more about some of the
0:09:52	technologies
0:09:56	you will basically everything we do in the speech team is big data
0:10:01	which is kind of bothers me because i think that it's community has been a
0:10:04	big data for a long time so i don't know why noise fashionable has a
0:10:07	new name but whatever
0:10:10	so i will not tell you about they
0:10:13	the nns because the you know
0:10:16	you know more than me and we have fit into that the topic
0:10:22	you know we have any stupid infrastructure and
0:10:25	and i can tell you takes a week train on the few thousand hours of
0:10:29	data things busloads of make it faster
0:10:33	but there is something that is that a fundamentally different when it comes to acoustic
0:10:37	modeling a well
0:10:40	which is that will not transcribed data
0:10:42	and this my come out a system price but there are reasons for that
0:10:48	the main result is that there is a lot of data to process so i
0:10:51	think
0:10:53	i think you don't have done they compilations back of envelope i think everyday our
0:10:57	servers process from five to ten years of for more or less way sounds like
0:11:02	a lot but you talk to people like marcel who has a background in and
0:11:08	a data centre
0:11:10	you know what they process telephone calls and that
0:11:13	it's not that much actually they probably more
0:11:17	but when you have so much data is
0:11:21	it's like drinking from the five or so we have this humid shallow
0:11:26	so you have the speech file course for window that you and you have to
0:11:29	figure out how to get something out of it right because this a lot of
0:11:33	data
0:11:35	so
0:11:36	and this to you just an idea
0:11:39	we break down our languages into T O one language is the important ones that
0:11:43	one's identity traffic and then the tear to
0:11:46	and then there is that you're three like icelandic and still generate a little later
0:11:51	but even
0:11:52	they you know that you're one his unit and then it more like thousand hours
0:11:56	per day of traffic
0:11:57	a problem only depends on the language
0:11:59	but even they entity or two
0:12:02	the ones that a little bit less important or that we launched recently like vietnamese
0:12:05	for all ukrainian basically an eight hundred hours per day
0:12:10	so
0:12:11	so for us a i was thinking about that yesterday
0:12:15	our and the results probably is not the data a like a lot of the
0:12:19	word about the sponsoring our and the results probably use the people to look at
0:12:23	all these data coming from
0:12:25	so
0:12:26	so really what we try to do is we try to automate the held out
0:12:31	of it as much as we can i personally this like
0:12:35	to have any engineer looking at the language for more than three months because it's
0:12:39	not scalable
0:12:41	i prefer that we come up with the solution and we put into what automatic
0:12:45	infrastructure
0:12:49	so
0:12:51	in our team we have been investing a lot of what we call our expressed
0:12:55	pipeline
0:12:56	are you wasting
0:12:59	maybe around in that it for
0:13:03	and what the steaming rephrase
0:13:05	so
0:13:07	basically any way to comes into voice search is log like what is time for
0:13:12	spoken
0:13:13	so we keep you know what looks
0:13:15	a lots of late that's i mentioned from five to ten years of all you
0:13:18	per they
0:13:20	and you want like well
0:13:23	unlike in a typical acoustic modeling training set up you know the one you learning
0:13:26	the school
0:13:28	we don't transcribing because it's impossible there's no way to transcribe all that
0:13:33	and then follow the traditional approach was what you supervised training
0:13:37	so what we do is we
0:13:39	basically look at the transcription produced by the collection and you
0:13:43	and then we must as that they'd that's what's as we can trying to decide
0:13:46	when can we trust transcription when we cannot
0:13:50	we do a lot of data selection
0:13:53	it's a lot of work into that
0:13:56	apply combinations of statistics are
0:13:58	confidence scores
0:14:00	and then we train
0:14:04	so
0:14:05	that's the image would like to use the audio comes from applications goal was in
0:14:09	this you know that the transcription is provided to the user or the developer we
0:14:13	longer
0:14:14	and then goes into our or infrastructure what we must us the data extensively
0:14:19	and actually scoreboard
0:14:24	so this is what we call
0:14:26	all the other a desperation
0:14:30	and one of the things we have been looking at
0:14:32	so i think is interesting use
0:14:33	how do you sampled from this
0:14:36	ten years of data that you get that they what is that you apply
0:14:41	the you try to random selection
0:14:42	do you being the data according to confidence scores and then you try to select
0:14:49	particular mean right because the data you a organiser data according to confidence
0:14:55	you might be tempted to use the data that is
0:14:58	i guess what are computed
0:15:00	but then you could argue that you're not learning anything new right because if the
0:15:03	recognizer is correct is not much to learn
0:15:06	so you go a little bit deeper into the conference so select
0:15:10	a phrase is utterances that you tries but not too much
0:15:14	it's not obvious what do
0:15:16	it's and i think guys a very active area of work for us
0:15:23	the teletext and we look for example we do a lot of what we call
0:15:26	distribution flattening
0:15:28	so
0:15:29	and with this for two reasons one reason we do it is to
0:15:33	increase the type of coverage
0:15:36	for example i can tell you we discover this problem where somebody was
0:15:41	asking for weather in so it
0:15:44	well enough and
0:15:46	the recognizer was failing all the time and when we investigated we discover that particular
0:15:51	five triphone
0:15:53	score but it was not you know what or so
0:15:56	so that are good reasons to really not train your systems can model the head
0:16:01	of the distribution when it comes triphones or what is but the flooding
0:16:07	in other is on is that you have to be careful because on what is
0:16:10	a very popular for example in korea
0:16:12	you just select high confidence scores and select the utterances by that
0:16:17	a ten or fifteen percent of your core right is going to be composed of
0:16:20	three queries one is a down
0:16:23	but of research and using a clinic or you are not available
0:16:27	last one is bound
0:16:29	abilities forgotten
0:16:30	so i you are building a if you are not careful you're building a it's
0:16:34	three word recognizer
0:16:36	so you really need to flooding
0:16:38	not trust
0:16:39	the distribution completely
0:16:43	but of course a unsupervised training is landers right and you probably have heard a
0:16:48	stories about it i know people in apple have worked with these
0:16:53	this is what we call the P problem so you hear sound clear
0:16:57	talk about the P problem i would tell you what it is
0:17:00	so the history is that we launched accordion system
0:17:04	and we start to collect topic complex and we're going to into
0:17:09	retraining with this unsupervised approach
0:17:13	and of course
0:17:14	when you have someone state that you look at the you look at the locks
0:17:18	right you just push it to the system but at some point with that look
0:17:21	you know the logs like
0:17:22	we notice that thirty percent of the topic
0:17:25	is a they talk and B
0:17:30	where a little bit mystified way that so we listen to the data and we
0:17:33	note is that when this wayne or of talked about the effect
0:17:39	or
0:17:40	goddess passing by
0:17:42	the recognizer is maxim that we that okay P
0:17:46	you know what sounds like that which our lexicon will provide a transcripts phonetic sequence
0:17:50	like
0:17:51	so it's a it's a possible so it matches explicit noise
0:17:55	and we will do it with high confidence
0:17:59	so
0:18:00	it is that's hypothesize in that it that the P R talk in becomes like
0:18:05	i think it starts to capture more
0:18:08	so
0:18:10	so this is the P probably on
0:18:12	are you have to be vigilant for
0:18:14	i we have observed that every language seems to have a P talking
0:18:19	so
0:18:21	we have found for example while we have P talk and in other languages
0:18:25	a feasible
0:18:26	sometimes it
0:18:27	sometimes is a like a sequence of can sometimes right because they kind of matched
0:18:32	noise about of the times is something else
0:18:36	and there are some but examples here
0:18:39	so
0:18:42	so you know we deal with this in many ways
0:18:46	some of the phase we do it for example in terms of transcription flattening or
0:18:51	triphone flattening
0:18:52	those help a lot
0:18:54	they start to filter out these people transcriptions
0:18:57	but another simple thing we will use we have at this is said that contains
0:19:00	a lot of noise
0:19:03	cars passing by an air blowing to the microphone and we always i evaluate what
0:19:09	we call our project set so
0:19:12	if the rejects set us that's providing a lot of those transcriptions then
0:19:16	it's really nice because you number one identify what that of the new P tokens
0:19:22	and you can filter again slows
0:19:26	we also model these noise explicitly we have noise models to capture this kind of
0:19:30	problems and get it get rid of it in the transcription
0:19:34	from time to time i think when you only the one supervised transcription there is
0:19:38	the danger that the system is going to going to sound court in their behaviour
0:19:42	so from time to time is not a well yet retranscribe corpora
0:19:46	a really use a new model so you kind of remove it from this but
0:19:50	corner cases and then study and
0:19:53	but at the same this is unsupervised this every active area for us
0:19:57	and that a lot of very interesting problems to be with
0:20:02	so given all the so we stick some safeguards
0:20:05	we basically select
0:20:08	something like
0:20:10	four thousand hours also
0:20:12	and we retrain our acoustic models and we tend to do this
0:20:15	we started doing every six months now we i think we have a monthly cycle
0:20:19	i don't hoping we get into it to week cycle so every two weeks our
0:20:22	acoustic models are retrained
0:20:25	and there is on that are two reasons for that
0:20:29	one reason is that we need to track the change in fleet of hundred devices
0:20:33	on unlike able
0:20:36	that are many telephones with different hardware different microphone configurations so it's important to track
0:20:41	those changes and every week there is a new model
0:20:44	so you need to track that
0:20:46	i think you also want to change to track user behavior a new ways for
0:20:51	example there is
0:20:53	that are different uses of our system initially it was base or queries
0:20:57	now that are longer what is more conversational and you know the
0:21:01	the acoustic change a little bit so we want to track goes
0:21:05	and it does still
0:21:08	discontiguous acoustic model training
0:21:10	basically allow us to not only to track but to improve the performance and indeed
0:21:15	in this particular about that are more things bigger acoustic model in that are we
0:21:19	actually tracking but
0:21:22	but it does help
0:21:24	i also have to say that
0:21:26	they repress pipeline
0:21:29	i mean and talking mostly about acoustic models
0:21:31	but we also use email ideas for language modeling
0:21:35	for pronunciation learning
0:21:39	so that i think we do is we obviously our work with acoustic modelling thing
0:21:44	so whenever there are based practises new maybe yes
0:21:47	and i say set for some recently really like to walk to prototype on icelandic
0:21:52	well matched me white
0:21:54	so whenever they discover something that works really when a nice in icelandic we bring
0:21:58	it into our pipeline and we basically we track the work of the acoustic modeling
0:22:03	thing
0:22:04	something works well
0:22:07	and we try to encode these into amassing work flow
0:22:11	that as i said every two weeks it does everything trains
0:22:15	you'd evaluates with a testing sets
0:22:19	i would tell you a little bit more later about our matrix
0:22:24	and the other thing it does is that it actually creates this is this is
0:22:28	really need it creates a change least basically telling you okay this is all i
0:22:32	change and ready after it doesn't evaluation so
0:22:35	you do like the model the only thing you want to say yes i like
0:22:38	this multiple simple X and
0:22:40	so we still have a human same yes or no
0:22:43	we could train a neural network to do that for us i guess
0:22:49	and another thing we had we now is to we have been thinking of how
0:22:53	can we improve these even more
0:22:56	so we have been that following another thing we check all the that they will
0:22:59	also help approach
0:23:01	a show you why they
0:23:03	maybe of the david hassle of approach is that you have a very good looking
0:23:08	i'm fast
0:23:09	i will think which is the production system
0:23:12	maybe it's a little bit done based would look you know fast
0:23:15	and then you have it really is mar i will thing where you know it's
0:23:18	which is likely the computed in the U Cs
0:23:23	and they the here is that we can
0:23:25	re process a lot of our laws we reach acoustic models
0:23:29	we treated language models with deeper means things you put in doing products because the
0:23:34	system will be real time
0:23:36	and the goal is that we can read instead of taking just a transcriptions from
0:23:40	the collection system
0:23:41	re process all you
0:23:43	and if we do that we immediately see reductions in the transcription error rate of
0:23:47	ten percent
0:23:49	and then you can understand that sticks select data
0:23:54	retrained acoustic model
0:23:56	and actually one of the things we're starting to do is to
0:23:59	have
0:23:59	products and acoustic models and really reach acoustic models
0:24:03	was only porpoises to a preprocessing data that they might be a slower
0:24:07	and you can it that it is probably applies we had in this process right
0:24:10	now
0:24:11	and the aim is that through
0:24:13	all these tricks we can read reviews the error rate on our transcriptions hundred training
0:24:18	and again the goal is really not to transcribe we can avoid
0:24:25	and as i said similar ideas are used in language modeling
0:24:30	and in pronunciation modeling
0:24:32	i where we try to learn from all the
0:24:35	let me tell you a little bit about the medics
0:24:39	because surprisingly whatever rate is not
0:24:42	the only thing with the
0:24:45	voice search basically exceed it's a similar behavior to search you know it's a
0:24:50	he said distribution with it really long tail
0:24:54	if the only thing you do is a major the task the word error rate
0:24:57	on the desk instead you transcribe like months ago or two months ago
0:25:01	there are several problems for that one is that most likely you're going to be
0:25:04	majoring have of the distribution
0:25:06	the talking as like face things like that
0:25:10	so i mean after a while you look very well on the common tokens
0:25:15	but you really care about the tape those tokens that you're three times today how
0:25:19	well you know
0:25:21	i'm and what test sets don't over the
0:25:24	i is not also practical to transcribe every single day i know optimal loving but
0:25:30	not possible
0:25:32	and the queries are changing
0:25:33	it's a evolving all the time whatever and testing said that was really one month
0:25:38	ago might not be
0:25:39	but i don't know
0:25:41	and you know even the best on speech transcribers there is only time between in
0:25:45	the data packing it's in the you get in so you can use of those
0:25:48	among
0:25:50	so we used to a
0:25:52	identity matrix i mean
0:25:54	we still use whatever rate
0:25:55	but we also used to alternative metrics
0:25:58	one is that what we call side by side testing so the idea is that
0:26:03	you just want to measure the difference between two systems the product some system and
0:26:07	a candidate system the candidate system could be a system that has a new acoustic
0:26:10	model
0:26:11	or a new language model or any pronunciation lexicon or a combination of the three
0:26:14	whatever it is
0:26:16	and what we do is we basically look at the
0:26:19	we select like
0:26:20	i'll thousand utterances or three thousand utterances from yesterday
0:26:24	we have the transcriptions that the collection system gave us
0:26:28	and then we re process those with the new candidate system
0:26:31	we look at the differences i mean if a hypothesis out the same in but
0:26:35	we don't care
0:26:36	i don't the differences we do in a B S and we'll has had the
0:26:41	search thing has had from many years
0:26:44	any infrastructure think of it like a small
0:26:47	mechanical turk with a user's distributed everywhere in the world
0:26:52	which are pretty familiar and the only thing you really as they miss listen to
0:26:55	the leon tell us which one is
0:26:57	more closely matches the only
0:27:01	so this is bases very fast to do
0:27:03	so you can you can get resulting in a couple of our sometimes less
0:27:08	the other thing with the which is i think is even more interesting is what
0:27:11	we call like experiments
0:27:13	so that he has that you have a new candidate system
0:27:15	and that you feel is pretty good
0:27:17	and we need policing deploy this system into products and it it's that's taking a
0:27:21	little bit of one percent ten percent
0:27:24	and
0:27:25	and we basically track if U matrix that beset in the deck metrics like the
0:27:30	creek the click through rates
0:27:31	whether the users are picking on the result more or not
0:27:35	whether the users are corrected by hand out the transcription we provide
0:27:40	whether they use it stays with days the application or
0:27:43	also way
0:27:47	and you know of course there is a lot of a statistic out processing to
0:27:50	understand when the results of signal if you can i when they're not
0:27:54	but this is just really useful because
0:27:56	it allow us to
0:27:58	believe a systems quickly before we increase the topic to
0:28:02	one two percent
0:28:03	and of course that i think is that the user that's even know that he's
0:28:06	been subject experiment
0:28:13	so our kinds of metrics used are
0:28:16	this is kind of hundred related to how is this system doing
0:28:20	that i think is growing a lot basically in the last
0:28:24	but in the last three years
0:28:27	it seems like it doubles every six months or what i think i don't see
0:28:31	i don't we don't see the train went down
0:28:34	i mean that our results for that is not just that speech is becoming more
0:28:37	use what is that we have been adding more languages
0:28:41	their role of english in our applications had has been diminishing used to be of
0:28:45	course the so that i think now is a little bit less than fifty percent
0:28:49	and the top ten longing non you were single is languages they generated now more
0:28:53	than fifty percent of our topic
0:28:55	and the other thirty seven they didn't less
0:28:57	but
0:28:58	but what we actually have seen in the past is that once the quality of
0:29:02	a language improves
0:29:04	a things begin to show that
0:29:08	another thing that is very interesting is there
0:29:10	the percentage of what is where there has being a semantic interpretation instead of just
0:29:14	task transcription we are providing
0:29:18	we have parts in the output is increasing and i think this is only for
0:29:21	english is beginning to be around twenty percent of the time we act on the
0:29:25	query
0:29:26	we parts
0:29:27	we understand what it's S so these are some graphics
0:29:32	here is for example if we attack an improvement in france
0:29:36	where you know we keep adding the amount of things better pronunciations lighter acoustic models
0:29:41	at own language model
0:29:44	a language model trained with clean data
0:29:46	we do a lot of data much hassle of
0:29:50	our queries before we build language models for but we happen to tell you about
0:29:54	that
0:29:55	a larger language model so on and so forth so this is a continuous process
0:30:02	limited a little bit about the language is therefore i thought
0:30:06	might be relevant to the aliens
0:30:08	and also having working on that for the well
0:30:10	so in two thousand eleven we decided to focus on you languages
0:30:16	i mean was a necessity because hundred was becoming global and we need to bring
0:30:20	body as much as we could
0:30:22	so we went through the initial analysis and like many of you out we went
0:30:26	to at the lower bound we got these or some pictures of all languages and
0:30:30	you know
0:30:30	organising activities some level a real cool
0:30:33	and then you look at the statistics like everybody else you see one seven that's
0:30:37	all languages
0:30:39	and so forth families of languages and
0:30:42	and then you look at the statistics
0:30:45	six percent of so it's a spoken by
0:30:47	more than six percent of the languages are spoken by more than a million people
0:30:52	and a six percent of spoken by less than a hundred people so probably will
0:30:55	not bother with those
0:30:57	and then you call with adults we can't and more or less
0:30:59	you basically cover
0:31:01	ninety nine percent of population so
0:31:03	we internally keep talking about that our goal is to build voice search in the
0:31:07	top three hundred languages i think
0:31:10	it's a good selling point it's a good sounding bit
0:31:13	i think in reality
0:31:15	probably after we reach at which are they
0:31:18	we at that's a lot
0:31:21	we brought it would have to rethink what to do with the next ones
0:31:24	"'cause" that there are many sentiment for
0:31:26	the only two languages are there is no where is nothing to search so
0:31:30	but you could argue that it's a it's a look back right when you have
0:31:35	you have a speech recognition technology maybe you facilitate the recent content so we need
0:31:40	to break this
0:31:41	this loop somehow
0:31:44	well that is we are very problem is our approach to rapid prototyping of languages
0:31:50	i think rather than an algorithmic approach which is
0:31:53	that would have been
0:31:55	i would in itself
0:31:57	a thinking we decided to take a more process approach focus on process
0:32:03	we basically focus on solving the two main problems
0:32:07	which obviously may i here this week we need you want to which is how
0:32:11	the held way get data
0:32:13	so we basically spent a bit of time developing tools to collect data very quickly
0:32:16	very efficiently
0:32:18	other hand
0:32:20	we build software tools that run on the telephones that allow asked to send a
0:32:24	team can collect data in a week a two hundred hours of data
0:32:28	we also be of the lot of webpage a web based tools to go annotations
0:32:33	so then result is that in three years we collected more than sixty language is
0:32:37	around sixty languages and at any time we have teams in collections so right now
0:32:41	we have teams see
0:32:42	more we're planning at and also be what is an outspoken and all that stuff
0:32:48	it's a so
0:32:49	we're starting in it for indian languages
0:32:52	we
0:32:53	and farsi we're going to collect in L A
0:32:56	so little bit is still you need a
0:33:00	so this is how a lot of data collection application looked is called data how
0:33:03	and there is actually an open-source where some of this that
0:33:07	idiom button are
0:33:08	put together with the ground so i think what you to talk to him
0:33:13	because you can also do it
0:33:15	and this is how our web based to for annotations look like
0:33:19	this is the tool we use with our vendors
0:33:22	with our on linguistic teams festivity worldwide so they can give us and i think
0:33:27	this is a phonetic transcription to a task
0:33:30	or they can do for example this is opinions selection task
0:33:33	or they can do a transcription task
0:33:37	for test set something like
0:33:40	and
0:33:41	you know in
0:33:43	just to talk about more a bit more about rapid prototyping
0:33:46	for lexicons i think lexicons is an area where we're still do not as fast
0:33:50	as i like
0:33:50	a lot of our lexicons are base which is good because now that we have
0:33:55	trained linguists they can probably put together to a lexicon
0:33:59	for set for four languages about regular
0:34:01	i like sponges or swahili in it they most likely
0:34:06	i'm for low that are
0:34:07	more difficult then we
0:34:09	we thereby cumbersome lexicon support we collect a seed corpora with our tools and then
0:34:14	with changing to be
0:34:17	for language modeling that has never been a problem for us because we just mine
0:34:20	web page is a we have the advantage of the what it's doing what people
0:34:24	search in the particular language that's very useful
0:34:28	of course every language has its no one says and you know whether it's a
0:34:32	segmentation what don modeling or inflection we and that building a tool for any language
0:34:37	like we have been working on inflection morning for us and
0:34:40	but the boosting is a once you build the tools you can deploy the for
0:34:44	any language so i'm hoping at some point we ran out of we're
0:34:47	linguistic phenomena
0:34:49	and we have to swallow and
0:34:51	lots of data must aztec civilisations that we about having place one problem you have
0:34:55	is mixed languages in the queries
0:34:57	so you have to classify or something like that
0:34:59	but the process is pretty automatic and
0:35:02	a for acoustic modeling is the most automatic thing once you have the data rather
0:35:06	you push the button on
0:35:08	and typically and they later you have a neural network training
0:35:13	and the way we develop a language is now is
0:35:17	we basically have a date operations the in domain this data collection some we a
0:35:20	lot of preparation and then we made in these we call it works on languages
0:35:27	we meet for a week in the room
0:35:28	and we typically have a success rate of fifty or seventy percent
0:35:33	meaning that in a we get a system that is but lacks and very
0:35:36	reluctantly forest means and the right of
0:35:39	around ten percent
0:35:41	and some languages are quite a little bit more work and you know six months
0:35:44	later we go back to the
0:35:46	so we have been lots in languages at an average of
0:35:49	four five last year we more we like the thing
0:35:52	very so this is their language coverage we have
0:35:56	a
0:35:57	so you can i mean he's forty eight in production
0:36:03	somebody asked me that they why we have basque coliseum
0:36:06	and gotten and spanish
0:36:08	you can figure it out
0:36:13	we have all these languages in preparation time i might maybe in the i'm hiding
0:36:17	the innocence was clearly
0:36:19	and that's a set our teams are collecting more data so we will keep going
0:36:24	an interesting is that we
0:36:27	we have gone into that languages i think we still have leading
0:36:32	although we run into well also delayed
0:36:35	and we had bill imaginary languages so this is my challenge for there are private
0:36:39	lessons
0:36:40	see you can tell me what language is this
0:36:43	let me see
0:37:12	somebody downloading a movie
0:37:27	i can try to do
0:37:30	you know if not outright in
0:37:35	okay we'll i think will try sometime today
0:37:41	i wanted to briefly mention atis and running out of time
0:37:46	basically all these languages are available into a P ice one is a the under
0:37:51	api this is a pointer just look for
0:37:54	speech hundred
0:37:55	and there's also a web api disobey simple api used in the way from we
0:38:00	give you the transcripts and we're thinking about in reading then a little bit more
0:38:04	but a lot of developers have been building a
0:38:07	applications on top of recipients and i think for as a P this create yes
0:38:11	of course is pretty
0:38:12	there are very important because
0:38:14	for two reasons when we launched a new language data really provide us with more
0:38:19	data and at the beginning more latest book
0:38:22	and i think it also exposes users and developers to the idea that hey i
0:38:27	can bill applications with the speech recognition and this is good for us
0:38:33	recent which is
0:38:34	sometimes are
0:38:36	the developer some faster in doing things that are useful like for example when we
0:38:41	started working on will not which is a large
0:38:43	kind of semantic assist and system our
0:38:46	we didn't have data but because we have this api and developers have been building
0:38:51	cd like applications in under four years we could leverage that data and you was
0:38:55	really good semantic annotations
0:38:58	and just to finish the little bit
0:39:02	i think we are now in
0:39:04	in the middle of this big transducer in the speech recognition at least within from
0:39:09	transcription two or more conversational interface
0:39:12	and you know there's all these new features that i did this is not speech
0:39:15	is in the be done by other teams
0:39:17	but you know seems like a core reference resolution so it becomes more conversational
0:39:22	a problem resolution weighted refinements my voice
0:39:26	a more to come
0:39:27	they make the application will be more interesting and
0:39:30	we really i think that the company
0:39:33	is in the middle of this transformation where will goes from these white box where
0:39:38	you type
0:39:39	into one of an assistant where you
0:39:42	you talk you engaged in a conversation
0:39:45	like to think of who a list
0:39:47	trying to become like bachelor you can talk to
0:39:50	on you know that changes everything these long term be single based on the computer
0:39:55	what i hope with
0:39:57	a little bit better personality but
0:39:59	that's a little bit where R
0:40:01	where we are trying to go this
0:40:03	pervasive
0:40:04	role of a speech not only not you're under telephone but on your that's your
0:40:08	car
0:40:09	in your appliances of home
0:40:12	you know assistant that
0:40:14	even makes access to information which is what mobile is about leaving easier and less
0:40:19	intimidating for many years as
0:40:21	so
0:40:23	the aim is to have a
0:40:26	and this is related to speech technologies are not only the microphone here about various
0:40:30	microphones are always on always listening to you we have maybe steps with this thing
0:40:36	call okay well
0:40:37	a signal less
0:40:38	so you can talk to your device get home talk to your refrigerator whatever it
0:40:43	is i know it's about the conversation
0:40:45	predicated not so what you
0:40:48	about your data
0:40:50	with really high quality speech recognition we try to get the time better
0:40:55	and so on and so forth and really conversation
0:41:00	so that was just want to tell you
0:41:16	the questions with a little bit late but
0:41:19	a
0:41:31	i don't becomes R
0:41:46	we with
0:41:54	it is concerned i think there are four
0:41:59	collect as much data and it was really a philosophical choice
0:42:04	to spend more money on more careful annotations especially for translation
0:42:10	where we actually did not sell
0:42:14	it's not always the body it's the call
0:42:35	first the common
0:42:38	many students are university use
0:42:41	google transcriptions part of projects actually works very nicely
0:42:47	it's been a great
0:42:49	source
0:42:50	work but i have one question which is
0:42:55	you cannot recognise my name
0:42:58	why
0:43:01	i is that you know we have a
0:43:06	in intra lingual engineering this thing we call yellow's which is
0:43:11	when we identify problem and we get together in a room and we don't stop
0:43:15	and in resolving
0:43:16	so named recognition was identify in july
0:43:20	and i actually i wasn't in that it for
0:43:23	we came up with the solution so it's been deployed a like
0:43:27	to they actually products
0:43:29	well chuck tomorrow
0:43:31	no but at the know this you notice that the serious question there utility in
0:43:37	some words of the but actually don't see to show opener speech systems so how
0:43:42	do we do the so that my name recognition is difficult to excel in a
0:43:47	because the space is pretty much infinite
0:43:50	so we do a variety of things a dynamic language models
0:43:55	based upon your data
0:43:56	so i mean to do name recognition of your names
0:43:59	the names you talk to you know that you can do but it's when you
0:44:03	have as a generic system
0:44:06	that actually somehow believe so it's
0:44:10	there that's ultimately problem for the i mean we operate typically with a million pockets
0:44:17	in our vocabulary we are going to two million with the song
0:44:21	but still a that is way more the only way to handle this kind of
0:44:26	problem is with
0:44:29	more personal essays so we know about you
0:44:32	so we can do you need to you
0:44:37	i think you

The growing role of speech in Google products

Applications Day

Pedro Moreno (Google)