Speech Transcript - Large Scale Learning of a Joint Embedding Space

0:00:15	great so it's a closer to
0:00:18	to talk here today indeed i
0:00:20	followed your
0:00:24	community for a couple of here is about ten years ago and then moved on
0:00:27	and doing a few different things but i might come back to it who knows
0:00:31	i was a very interested in what i've seen during the week
0:00:35	so thanks on fighting
0:00:37	so to the l i'll talk about if you
0:00:41	projects related to and bidding space is then going to
0:00:45	tell you what we can do with that and what they are first but
0:00:50	pay attention to the fact that this is not only my work does plenty of
0:00:54	people that have work on that with me
0:00:56	and like to thank them first
0:01:01	so
0:01:02	and bidding sweat out be what are the useful for and in fact it started
0:01:07	like a decade ago when people started to try to think
0:01:12	how can we present
0:01:14	discrete objects in the continue space in such a way that it becomes useful to
0:01:19	manipulate them
0:01:20	like a if you think of words
0:01:23	words are difficult to manipulate because
0:01:26	you cannot compared to words the easy d at least the
0:01:30	i in terms of mathematics a comparator so how can you represent words such that
0:01:35	then after that you can
0:01:36	you can manipulate them and compare them
0:01:41	so these are these embedding space is more we're gonna project words
0:01:47	once we have projected words into these spaces
0:01:49	what about projecting anything else
0:01:51	like images
0:01:53	or speech or
0:01:55	music
0:01:56	so how can we do that that's gonna be the second part
0:02:00	and we see that once we
0:02:02	manipulated is a complex object in these spaces
0:02:05	the there's a because you
0:02:07	you learn
0:02:09	semantic representations of these words you can actually
0:02:13	try to
0:02:16	to discover things in complex objects like images that you've never seen before we see
0:02:21	how we can do that
0:02:22	and now
0:02:24	briefly describe at the end
0:02:26	some recent work we've done with the trying to
0:02:29	do similar things from images but now applied to speech
0:02:32	and vice
0:02:36	so let's
0:02:38	let's start with what i mean by embedding so
0:02:41	so here i describe it three d a beating space but of course in general
0:02:46	it's more like a one hundred two hundred one thousand
0:02:49	the and bidding space
0:02:51	and
0:02:52	think of a this space as the
0:02:54	as the real number real vector space
0:02:58	work each point
0:02:59	could the could be the position of a discrete object like a word so soul
0:03:06	here we have let's your best this work
0:03:10	here we have the position of the word or barbara
0:03:12	and here the
0:03:13	like here the position of dawson's c will paris
0:03:17	and
0:03:19	we want to learn where to put these words and that's basically at the beginning
0:03:23	we just on the whole
0:03:24	so we're gonna
0:03:25	fine random position for each of the words
0:03:28	of the given dictionary that were given
0:03:31	and then we want to modify the positions of the move and at some point
0:03:35	one to be such that
0:03:38	nearby words so the word nearby the position of the word dolphin
0:03:42	should have similar meanings are tick at least related meanings
0:03:46	so you'd like c word and often to be not far
0:03:48	but relatively far from say powers and all but
0:03:53	and which that this if we can achieve that is gonna be useful for
0:03:57	for manipulating these words after work so how can we do that
0:04:02	so
0:04:04	about the ten years ago my brother years once we just for those of you
0:04:08	who don't know there are two modules
0:04:10	you invited me that my brother
0:04:17	both of us working deep-learning
0:04:20	so
0:04:22	and
0:04:22	but that's about ten years ago he
0:04:26	and
0:04:27	describe the project where he could learn such an bidding since
0:04:31	and here is that how he started the
0:04:34	this thing
0:04:37	so
0:04:38	he was using neural network so this is a neural network where we have the
0:04:43	inputs and you have layers
0:04:45	good connected to each other and at the end you have the output layer
0:04:48	and the goal was to try to learn a representation for words so what he
0:04:53	did
0:04:53	was take
0:04:54	sentences
0:04:55	like you can capture sentences from we keep it from the web from anywhere only
0:04:59	where you can just grab sentences
0:05:02	and your goal is to try to
0:05:04	to find a representation of words
0:05:07	such that
0:05:09	it will be easy to try to predict
0:05:12	if i show you few words can you predict what will be didn't next work
0:05:15	so you put
0:05:17	cities for words as input of the model
0:05:20	you crunch them and at the end you predict the word in your dictionary and
0:05:24	does one you need for each of the word of your dictionary
0:05:27	and you try to predict the next word will be cheese
0:05:31	in that case
0:05:32	now how your present the word
0:05:34	you're present them as a vector in
0:05:37	this the dimensional space
0:05:39	which at the beginning will be just a random vector
0:05:42	so this is what i
0:05:44	we called the embedding space
0:05:46	so there i think of it does a big look up the table or a
0:05:49	big matrix where each line
0:05:51	is there a presentation of a word
0:05:53	and if you see the sequence the cat eights you're gonna just look at the
0:05:57	word
0:05:59	the look up the word cat each of them have a vector representation you just
0:06:03	put that as input of your well
0:06:05	and then you passing through your model and you predict the next word
0:06:09	and the hope is that if you do that often
0:06:11	and you
0:06:12	as you might imagine you have in into new the model such data so that
0:06:16	easy to get
0:06:17	you will want you will find good representation of words
0:06:23	so the first time he did it was very painful very slow machines ten years
0:06:28	ago are very slow the dictionary was very small
0:06:31	it wasn't very useful but since then things have improved the model has
0:06:35	been simplified we have more data about more gpus
0:06:39	another thing start to work quite well
0:06:42	here is an example of the nn bidding space that we trained
0:06:46	about two years ago i think
0:06:48	and you don't have to try to see well
0:06:51	what's in that space
0:06:53	we can pick a word
0:06:54	so like here i pick the word apple
0:06:56	and look in the space what are the nearest words saying the
0:07:00	euclidean space so a look
0:07:02	at the position of all keywords i
0:07:04	sort them with respect to the distance to the target word
0:07:07	and i look at the other words and what i see that
0:07:10	the other words are very semantically similar to apples so you have fruit apples melon
0:07:14	speech
0:07:15	and whatever
0:07:17	if you take a stab you get things that are less a happier
0:07:22	and around i phone
0:07:24	used are you see stuff like i whatever
0:07:27	so
0:07:29	so it does capture everything and this was trained in the unsupervised weights sex
0:07:34	just show sentences
0:07:35	and that's what you get
0:07:39	and this was with the fifty dimensional embedding so you don't even need to have
0:07:43	a very large space this was about
0:07:45	i think or hundred thousand words
0:07:47	dictionary
0:07:49	so those hundred thousand vector here that are hidden in that space
0:07:54	so
0:07:56	with time these kind of model have evolved and usually involved in that just means
0:08:01	simplified so it was
0:08:03	it was a complex a architecture and it became actually much simpler now it's
0:08:08	in fact they are almost linear model
0:08:10	just the and of aiding and abetting of the word and then linear transformation in
0:08:15	you try to predict
0:08:16	another word and that's it
0:08:19	the weights done is you pick it again you take a sentence you randomly pick
0:08:24	a word in that sentence and then you randomly pick another word that you're gonna
0:08:27	try to predict
0:08:28	so it's not anymore
0:08:30	the next word and its not anymore
0:08:32	if we do of words that helps you predicting the next one it's
0:08:36	a random were trying to predict another random work round
0:08:39	and that's just the fact that it's a round that makes it
0:08:42	interesting because words ten
0:08:44	to keep
0:08:46	they tend to
0:08:47	to cook are often together so
0:08:50	so that's a very simple now the code is actually available in the very efficient
0:08:55	to train so you can train your own the bidding space
0:08:58	on in the matter of in our to of training so it's very efficient
0:09:04	a so
0:09:05	so here we have an example about
0:09:08	we took all the terms we saw and we keep it and here examples again
0:09:12	of embedding space is so these are the words nearby tiger shark
0:09:16	or car these are words to sorted or a cluster them according to their semantic
0:09:21	an
0:09:22	you see i don't know here all the food things
0:09:26	all the replies it et cetera so it captures semantic
0:09:30	but it's actually even more
0:09:33	strong than that
0:09:34	because
0:09:36	you can play some games with these and beatings words so for instance
0:09:40	if a you take the after training it within meetings to space the using this
0:09:45	as good problem all
0:09:47	you look at the editing position of the roll
0:09:50	italy berlin and german germany you look at where they are in the space
0:09:54	and you can apply operators on that so you to the embedding a position of
0:09:59	rome and you subtract italy
0:10:02	you at germany and what do you get vernon
0:10:05	that means you can actually generalising fine so the vector that went from row to
0:10:11	italy is the same as the when it goes from berlin to germany because they
0:10:15	have the same or addition to each other
0:10:17	and that's for semantic and you have also syntactic relations
0:10:21	like hardest to harder to what biggest to bigger
0:10:24	with the same kind of argument
0:10:26	and that's
0:10:27	surprisingly working
0:10:29	you can do similar tricks using for translation you can see that you can train
0:10:36	the
0:10:38	separate
0:10:39	and beating space for separate languages
0:10:41	and you'll find that these relations actually work from this
0:10:45	from language to language and you can use these kind of things to help train
0:10:48	stations and they're
0:10:50	tons of tricks that you can seem to the charger nowadays using these and beating
0:10:54	spaces so they are very
0:10:55	interesting to manipulate
0:11:00	before i forget feel free to ask any question whenever you want of course
0:11:06	so what else can we do with these are bidding spaces so
0:11:12	about four five years ago i was actually interested that googling
0:11:15	images
0:11:17	and i and i wanted to try to see if i can train the model
0:11:20	to annotate images but it being at google i'm not interested in on the and
0:11:24	trying to label images out of the hundred classes of course i'm more interested in
0:11:29	the large scale setting so when you have six hundred thousand classes
0:11:33	that's more interesting for me
0:11:35	but at that time at least
0:11:37	the recharger do see do
0:11:39	you computer vision literature was more interested are focusing on task where you have a
0:11:44	hundred two hundred
0:11:45	up to a thousand classes but that's
0:11:47	that's about it
0:11:50	so can we go further than that
0:11:54	so of course i image annotation is the heart task that's plenty of problems with
0:11:59	that you can think of the fact that the object objects are very the often
0:12:05	lookalike and actually this problem
0:12:07	is even worse when the number of classes grow
0:12:10	so when you have only two classes it's very easy to discriminate the if you
0:12:13	have a
0:12:14	a thousand classes or hundred thousand classes
0:12:16	you can be sure that
0:12:18	two of these classes are very visually similar to each other
0:12:22	so the problem is
0:12:24	or are becoming harder as the number of class the class a scroll
0:12:28	just plenty of other problems the rated to computer vision which i won't go into
0:12:32	details but
0:12:33	let me just
0:12:35	summarise how computer vision was done
0:12:37	about four five years ago
0:12:39	things have evolved a lot since then but at that time
0:12:44	you had two steps
0:12:45	feature extraction and classification
0:12:48	so first you would extractor features and the way you extract features
0:12:53	was
0:12:55	say it's very similar to how you would expect your for your from your voice
0:13:00	you with the find the a good representation of
0:13:04	places in the image and then
0:13:05	aggregated in a in some way and that would be your presentation
0:13:11	what you had that you would then try to classify don't using the best classifier
0:13:15	that was available at the time
0:13:17	which was an svm and then you train an excellent for each of your classes
0:13:21	and hope
0:13:21	that it would scale well
0:13:23	a it didn't
0:13:25	so one of the
0:13:27	problems you had was
0:13:29	and
0:13:30	that
0:13:31	very similar images would give rise to
0:13:33	very different labels
0:13:35	that would be completely unrelated semantically so for instance these are three
0:13:40	whoops
0:13:41	these are
0:13:42	three images a
0:13:44	of as of the video that
0:13:46	are like a few seconds apart
0:13:49	it's a shark or something like a shark and this is the label that
0:13:53	you're classically which classifier would give and you see that this is very semantically different
0:13:58	from that
0:13:59	here it's airliner for those who don't see and here's tiger sharks
0:14:03	but the image are quite similar to or high so some things
0:14:07	not working somewhere
0:14:10	and that's the kind of problem would like to be able to solve that would
0:14:13	be if i sure you to similar images i'd like to have to similar labels
0:14:16	at least
0:14:18	at least semantically
0:14:20	so why isn't that the working
0:14:23	so one argument about that is that we when we try to classify are images
0:14:28	we showed no we impose no relation between our labels all the labels you can
0:14:34	think of them as being in this id edges
0:14:37	of the hyper q
0:14:38	of the number of classes
0:14:40	where the edges of the number of classes
0:14:43	and so there's no more relation between a
0:14:45	it share and the and them under a more than there is or relation between
0:14:50	the mental reinvesting great even though there is a semantic creation between them we don't
0:14:55	capture done with the way we
0:14:57	train our classifiers and that's probably bad
0:15:01	so what if instead of having the labels that these
0:15:04	corners here
0:15:05	what about putting them inside
0:15:08	i guess
0:15:10	so now the labels are inside this i per q like these embedding space as
0:15:14	i was talking about the earlier
0:15:17	and it's fine because first of all the
0:15:20	the size of this hyper cube does not you anymore depend
0:15:24	on the number of labels classes
0:15:26	you can have way more classes than the actual size of your
0:15:31	space
0:15:32	because now it's a real space
0:15:34	and you can put your labels in such a way that they are by label
0:15:37	have your by meaning
0:15:39	and we've got
0:15:40	what happens is that if you make a mistake by picking the wrong label
0:15:44	hopefully you're gonna pick a label that was nearby hands have a semantic meaning not
0:15:49	too far so
0:15:50	hopefully that would work
0:15:52	and does even more interesting think that could happen you could
0:15:55	put more labels in the space that the one for which you have images
0:15:58	and maybe you'd be able to label in image of a topic you've never
0:16:02	seen any image before just because it is semantically related to it
0:16:08	so we try to see that
0:16:10	is just a dream
0:16:12	so about four years ago we started working on this project and
0:16:17	and what we of course so we were interested in these embedding space is an
0:16:22	and what we
0:16:23	tried was basically to merge the idea of having an image classifier and an embedding
0:16:27	space
0:16:28	and
0:16:30	so
0:16:31	we had this protocol was that e
0:16:34	where we want to do
0:16:36	try to learn jointly how to take an image
0:16:39	and project its input representation into
0:16:43	and then bidding space so you would have a
0:16:46	you projection from their representation of that image into
0:16:49	point
0:16:49	in that space and in the same space you'd have points that represents
0:16:54	labels
0:16:55	our classes
0:16:56	like thousand and robot and eiffel tower
0:16:58	and the goal was to jointly try to find the position of the labels
0:17:03	i try to find the mapping from the image
0:17:05	to deal a label space and if you can do that jointly you hopefully sold
0:17:10	the
0:17:11	jointly the classification task
0:17:13	and have a good and beating space for words
0:17:18	and
0:17:19	being at google
0:17:20	i should tell you that everything i see looks like a ranking problem
0:17:26	and so i obviously see that there's a ranking problem
0:17:30	where in that case the goal is if you show me an image and gonna
0:17:34	try to rank
0:17:35	the labels such that the nearest label is the one that corresponds to label
0:17:40	sounds good that that's a correct drinking problem and you also want to make sure
0:17:44	that
0:17:45	similar labels if you are to make a mistake let's make sure that the mistake
0:17:48	is semantically reasonable that
0:17:51	if you were to click on the word
0:17:52	it would be it reasonable word even if it's not the perfect work
0:17:57	so we are going to train or well with i ranking loss
0:18:02	in mine
0:18:04	okay
0:18:06	it was actually very simple ball
0:18:08	again very just a linear mapping so
0:18:11	what we had was
0:18:14	so this is prior deep-learning
0:18:18	error in some sense
0:18:19	at least in computer vision
0:18:23	so we work if it features
0:18:25	the
0:18:26	the mfcc of the
0:18:29	of the image world
0:18:31	and the what we were looking for was just a linear mapping between these features
0:18:35	of an image and the embedding space so you take your features that
0:18:39	that's x it's the image representation of an image
0:18:42	and you just multiply by matrix v
0:18:44	such that the result is another factor that is not in your hopefully in the
0:18:48	embedding space
0:18:49	and again like for the embedding space you have a factor
0:18:53	representation of each of your well words
0:18:56	which in that case out that labels
0:18:58	of the image classification task
0:19:00	and you want to find
0:19:03	w the a representation of your labels and v the mapping between each features and
0:19:08	the embedding space
0:19:09	such that it optimizes a task what is the task
0:19:12	we are going to define a similarity between two point in the space and in
0:19:16	this case the similarity function between
0:19:18	in image
0:19:19	x and the label i
0:19:21	is just
0:19:22	it's dot product
0:19:24	so in the embedding space so you take the image you projected into the embedding
0:19:27	space
0:19:28	you multiplied by the label
0:19:30	that for which you're considering to label that image
0:19:34	and you want that score to be high
0:19:36	for the correct label until
0:19:38	is a low for incorrect label
0:19:41	and so and you're going to put some constraint because
0:19:45	we're doing machine learning we need to put something regularization
0:19:49	such that
0:19:53	such that the value of the embedding space basically it's constraints so you control than
0:19:58	or
0:19:59	both the norm of the
0:20:02	of the mapping and the norm of the bidding
0:20:05	labels itself
0:20:08	okay as so as i said
0:20:12	we are going to try to solve this problem using a ranking loss so what
0:20:15	it what do i mean by that
0:20:18	well
0:20:19	we are going to
0:20:21	construct a loss
0:20:23	to minimize such that
0:20:26	for every image in our training set that's this part here
0:20:30	and for every current label of that image in image could have more than one
0:20:34	label that
0:20:35	that's often the case
0:20:37	and for every incorrect label of that image
0:20:41	and that does a lot
0:20:42	so that that's a nice big
0:20:44	you want to make sure that
0:20:46	this score
0:20:48	so the score of the correct label should be higher than the score of any
0:20:51	incorrect label plus margin
0:20:54	what this basically says it's a hinge loss
0:20:58	so
0:20:59	not only you want to score the correct label to be higher than any other
0:21:02	one you want to make sure that i don't plus margin so that generalizes better
0:21:06	here the margin is one but it's a constant you can put what you want
0:21:11	and if that's not the case then you pay a price
0:21:14	and that's the price to pay and you want to minimize that price
0:21:18	and you can optimize it is very efficiently by stochastic gradient descent by simply
0:21:22	sampling in image from your sec training set sampling a positive image
0:21:27	a positive label
0:21:28	from the set of correct label of that image
0:21:31	and then simply any other label which
0:21:33	most likely will be a novel correct label
0:21:37	you have your triplet then you compute your loss
0:21:39	if it's
0:21:41	if it's if the loss is positive you are you change the parameters of
0:21:46	you're model v and w here
0:21:49	so that's good and it works actually
0:21:52	but you can actually do better
0:21:54	i i'm not going to go into details about how you can do better but
0:21:56	i think of the
0:21:57	the following problem
0:22:00	what you want is at least
0:22:02	again
0:22:03	when you want to rank a hundred thousand are to be an object
0:22:08	what you want
0:22:09	is that in the top ranking
0:22:12	position of the object you're going to return does something interesting
0:22:15	so
0:22:17	if i show you two functions that are ranking labels
0:22:21	one of them returns
0:22:23	label correct labeling position the
0:22:25	one and another correct label in the position
0:22:28	one thousand
0:22:29	you fine you should find it more interesting that another function that returns
0:22:34	the to correct labeling position five hundred and five hundred one
0:22:37	even though in terms of ranking they have the same value
0:22:40	in terms of the user using it you try to have at least one label
0:22:44	returned in the top position
0:22:45	so you want to favour the top of the ranking you want to put a
0:22:48	lot of interest is there
0:22:49	and they are ways to modify
0:22:52	these kinds of losses to favour the top of the ranking
0:22:56	i one going to the d as they also have to they are in the
0:22:59	paper but
0:23:00	but it actually makes a huge difference in terms of the perception of the user
0:23:04	because
0:23:04	at least at the top
0:23:06	of the ranking you see things that make sense
0:23:11	so let's look at
0:23:14	these experiments the original the first experiments with done
0:23:18	we had the
0:23:20	so at that time there was the
0:23:22	a database an image in the in the computer vision literature that started appearing called
0:23:27	image net
0:23:28	it still there it's growing
0:23:31	at that time their whereas sixteen thousand labels in the image net
0:23:35	corpus now that was more than twenty thousand
0:23:38	but
0:23:39	nobody was actually using that the that the corpus i as a as it is
0:23:44	people were had selected about the thousand label and they were only playing with one
0:23:49	thousand and that's still the case and fortunate
0:23:51	almost nobody plays with all the corpus that is actually available and that contains
0:23:56	millions of images
0:23:57	at that time about five million images
0:24:00	i think now it's more like ten million images
0:24:02	so that that's good
0:24:03	but nobody's using it
0:24:06	so we consider that a small dataset
0:24:09	and we looked at the bigger one
0:24:11	which came from the web so we looked at the maybe images from the web
0:24:16	and for the web data we don't really have any labeled the way we use
0:24:20	our label whereby
0:24:22	looking at what people do one image search on google image search tree type of
0:24:26	query that the queried often you see images
0:24:29	you click on an image
0:24:30	now if
0:24:31	many of you click on the same image for the same query we are going
0:24:35	to consider that this is a good
0:24:37	a the queries a good labels for that
0:24:40	for that image
0:24:41	so it's very noisy
0:24:43	a lot of things happened here
0:24:44	but it it's usually reasonable
0:24:46	but you can collect as many
0:24:49	so this is a very small set of what was actually available but still
0:24:53	there was more than a hundred thousand labeling or set so that was interesting
0:24:58	so we actually publish the
0:25:00	paper showing these results and i want to exercise the fact that
0:25:04	we had
0:25:05	a one person
0:25:06	accuracy
0:25:08	on that data so ninety nine person role
0:25:11	and it was published
0:25:16	so i think that's good
0:25:17	we also hope
0:25:21	so this is a record and
0:25:23	so that summarize the thing by saying that this algorithm was better than the
0:25:27	many things we tried this is just a summary
0:25:30	so these numbers are higher than the other one
0:25:33	we show to type of metric precision at one which is accuracy
0:25:37	precision at ten
0:25:38	which is harmony good labeled you're return to the top and so that's more like
0:25:42	a ranking loss
0:25:44	and that more like look like what you see on google you look at the
0:25:47	page and you're happy if you see the document you want in the top
0:25:53	of course if you put more than once and then sample the number scroll and
0:25:57	everything gets better but
0:25:58	but yell the numbers are small and so the question is
0:26:01	is it any useful anyway because it's or small
0:26:04	and it turns out it is so first of all let's look at the embedding
0:26:07	space again it's always fun to look at
0:26:09	what happened after we've train the model
0:26:11	so remember the model was trained with the
0:26:14	just pairs of image and label
0:26:17	nor addition between words duration between images just there is a females and label in
0:26:22	are gonna just look at
0:26:23	where the labels are in the space no image yet
0:26:27	so i look at the labeled brock about that and i look at the nearby
0:26:31	labels out of the hundred pounds and label
0:26:34	and these are the labels we see
0:26:35	and so the nearest one is basically a spelling mistake because well people type anything
0:26:40	on the way
0:26:42	and the other ones are also very similar and then you see this ball will
0:26:45	which
0:26:46	i don't know what it is
0:26:49	and then if you take a big i'm you see again the similar things then
0:26:53	interestingly you see semantic relations between the u c and that their soccer player that
0:26:58	happening not far
0:27:00	maybe did look alike i don't know
0:27:03	you also see things like translations so dolphin these near death in though phone
0:27:08	or similar projects like whale
0:27:11	and
0:27:12	you see you see like i for that we're used to train station you see
0:27:16	either
0:27:17	things not far or similar visually
0:27:21	then the eiffel tower and all these has been trained in some sense i've never
0:27:24	told the model that the
0:27:26	it often is like the phone it's
0:27:29	it's they are just because they share similar images basically
0:27:32	it did that and beating space
0:27:34	so that's nice but what about the actual task
0:27:37	so here is
0:27:39	a sample of for images
0:27:42	from the test set
0:27:44	all of them
0:27:45	the if i had to compute the rescored precision at what would be zero basically
0:27:49	i failed in all these images as expected i mean i fail ninety nine percent
0:27:54	of the time so this is for these ninety nine
0:27:58	and but the figures are gracious in some sense so this these are supposed to
0:28:02	be
0:28:03	dawson
0:28:04	and the a tensor they'll finny a car and you see the words that happens
0:28:09	afterwards
0:28:09	so that funny here is in position thirty that's good
0:28:12	here it's in position i don't like eight
0:28:16	but the other words around make sense maybe the wrong
0:28:20	their answer but at the end so we give
0:28:22	would satisfy many humans and that's
0:28:25	that's good just because they have actually very similar semantic meetings
0:28:30	so we have the bark about my thing here
0:28:33	we have a
0:28:35	i was interested in the last i guess trip here because maybe you don't know
0:28:38	but there's a copy of the eiffel tower investigates
0:28:40	and so it actually made sense
0:28:43	it was surprise
0:28:44	so
0:28:47	so that's interesting the they will you make mistakes is now more interesting used to
0:28:51	make a lot of mistakes but at least
0:28:53	the answer make sense and that that's better
0:28:56	but so that was as of a four years ago and
0:29:00	what happened after that was
0:29:03	the deep-learning
0:29:05	error started
0:29:06	and the everything changed in the image field
0:29:10	like it didn't speech error i would say
0:29:13	so
0:29:14	now that's how we do we beach recognition
0:29:17	so
0:29:18	and the way we do it is by taking an image and applying these a
0:29:23	deep network
0:29:24	happen till you find we take a decision using it
0:29:27	as softmax layer at the end of your deep architecture
0:29:30	and the think that works the best these days is the convolution that's and the
0:29:35	for those of you don't know what these are it's basically layers that look at
0:29:39	only at the small part of the image so there's just a unit here that
0:29:44	look only at this part of the image and tries to
0:29:46	get the fighting for this part
0:29:48	but the function that gets this value is the same as the one that looks
0:29:53	at this part of the image and this and this and this
0:29:55	so we are actually convolving a function a along the whole image
0:30:00	and returning the this completion at the output of this that you're
0:30:04	and that we pull the answer locally so we look at the answer of that
0:30:08	set of convolution in the local patch
0:30:10	and take and return something like the maximal the mean or
0:30:14	but works usually is the max but you can try any pulling
0:30:19	thing and you do that again layer after layer
0:30:22	and what you're
0:30:23	you're bored you do full connection and that the n
0:30:27	you get an answer so it is
0:30:28	is a much more involving architecture it's very
0:30:32	slow to train
0:30:34	you need gpus and all that but
0:30:38	i must say first of all they were developed about twenty five years ago sorts
0:30:42	nothing you
0:30:43	but the only now we have the data that shows how good they are because
0:30:47	before there was not enough data know there was not enough
0:30:50	machine power like gpus
0:30:52	to actually train such a complex architecture so now it works
0:30:55	and it actually works very well so the first time the it was used on
0:31:00	on this the competition call the image net which is a competition quote to classify
0:31:05	with a thousand
0:31:06	label
0:31:07	it basically blue and the competition so all of those the competitors which we're using
0:31:13	classical computer vision techniques and they were actually the best in there
0:31:16	feel
0:31:17	they are like ten person away from the
0:31:19	from the deep-learning approach so
0:31:22	it changed everything and now
0:31:24	at least in this t v b are the literature
0:31:27	almost nobody is not using computer core
0:31:30	d player
0:31:31	it architectures
0:31:34	maybe just the slide to say that we do use such a thing at we
0:31:37	will for real product so it's not just research
0:31:40	for instance if you
0:31:42	type queries like
0:31:44	my for two of something
0:31:46	we're gonna try to look in your own for those unlabeled the
0:31:49	and try to return
0:31:52	you're for two of
0:31:53	sunset here
0:31:55	and it's done using the type of architecture that
0:31:58	that the one this competition
0:31:59	i must say that we actually
0:32:01	but
0:32:03	do
0:32:03	the authors of that paper
0:32:07	that
0:32:08	that is that geoff hinton the
0:32:11	an exclusive ski in the yes it's got are
0:32:13	the are not working at school so they help us about
0:32:19	it works
0:32:23	they are very good cliques
0:32:25	okay and let's the that contain you know
0:32:29	so let's go back to our meeting spaces and the fact that you can put
0:32:33	a lot of things in an unwitting space
0:32:37	so on one side we have these embedding space is that are very powerful because
0:32:40	they capture the semantic of labels on the dataset
0:32:44	we have these powerful deep-learning our architecture that can
0:32:47	that are based inter class now
0:32:49	so can we can we varied these two things in you know way that would
0:32:53	be useful
0:32:54	and
0:32:55	in fact what we found was that you can use these two and try to
0:32:58	be able to label in image of a label that is not of these ones
0:33:03	that appears here anywhere and that's
0:33:06	interesting because now
0:33:08	even though this was trained on the thousand labeled we can try to reason
0:33:11	about sarah hundred thousand label even though we haven't seen ninety nine percent of the
0:33:17	label
0:33:21	surprisingly it's actually very simple to do
0:33:24	we started by doing something with more complex but idiot we converged again
0:33:27	in the simplest
0:33:29	so we shouldn't and here is how you do it
0:33:31	so first
0:33:32	obviously
0:33:33	you train these two things
0:33:35	separately you train your best deep-learning architecture on your image classifier
0:33:39	and you train your best the bidding melon labels
0:33:42	the only thing that you require is that
0:33:44	the labels that are the at the that were for which you train your deep
0:33:49	architecture
0:33:50	should be embedded in the space so if one of the label is car
0:33:53	make sure that colours here but that shouldn't be a problem because here you can
0:33:57	put anything as long as you see text
0:33:59	related to these label
0:34:01	so that was an easy
0:34:03	requirement
0:34:05	once you have that here is what you do
0:34:08	you take an image
0:34:10	and you compute
0:34:11	so the
0:34:13	the score of the deep-learning
0:34:17	model so you the score of the deep-learning model is actually the posterior probability of
0:34:21	a label
0:34:22	given
0:34:23	that the image and you have these vector of p of the label given the
0:34:27	image
0:34:28	you are going to compute all these score you have a thousand of them
0:34:32	but you are going to only takes a the top ones
0:34:36	top one could be the top thousand if you want but
0:34:38	it's gonna be faster if you take the top ten
0:34:43	and you are going to
0:34:45	to
0:34:47	look at the labels corresponding to these top ones of suppose the topic labels
0:34:53	contained
0:34:54	these
0:34:55	suppose these words adopt then label obtain so bear lion tiger
0:35:00	integer a
0:35:01	you are going to look at the embedding space of the of the top then
0:35:04	label you obtained here
0:35:05	where they are and you are going to make it in average of them in
0:35:09	the embedding space
0:35:10	but it's gonna be a weighted average back haul and the weight will be how
0:35:13	much you think it is the actual label actual by and
0:35:17	so if you really think it's a lay a lion
0:35:19	the they're result of the of the weighted combination would be very near the lion
0:35:24	if you really think it's a there it's gonna be near the beer
0:35:27	if you think it's between the bear in alliance so you obtain
0:35:31	fifty percent they're fifty percent line you're gonna be in a position your parent line
0:35:35	like in the middle
0:35:36	and that's what this
0:35:38	thing set so you average the top labels you found in the operating space and
0:35:43	you find the position and that's where you should be now you look around here
0:35:47	in the other expansion look at the nearest label
0:35:50	the might be labeled from the top doesn't know there might be in the other
0:35:53	label
0:35:53	and that's your inserter
0:35:55	and because it can be any other label it can be
0:35:57	labels of subject you've never sing
0:36:01	does it work
0:36:03	it does actually surprisingly
0:36:05	not the perfectly by far again you see like
0:36:09	if you person precision but it does work
0:36:13	good enough that it's better than what we've seen elsewhere
0:36:16	so this is the model that is doing this the
0:36:20	complexer a convex the
0:36:23	combination of a semantic and beating and that when you was using the top ten
0:36:27	label that's waistcoat can see
0:36:29	it computed there is this something that we also publish called device which instead of
0:36:33	doing this simple convex formulation try to learn the mapping between the two
0:36:38	and the mapping was surprisingly not as good as just the simple combination
0:36:43	and this would be the output of the model itself so this cannot actually find
0:36:48	the correct solution because
0:36:49	we know that the correct solution of that image the correct label
0:36:53	is not the top then the top but thousand it it's not the label the
0:36:57	mold those about so it will make a mistake
0:37:00	while these ones have access to the full embedding space and be conducted says something
0:37:04	about
0:37:05	things that never seen and that's
0:37:07	and okay that works
0:37:08	in this case
0:37:11	okay
0:37:12	so
0:37:13	that was a nice for images but the recently i thought okay
0:37:17	what about speech
0:37:20	so about the ten years ago i was working in speech so i had some
0:37:24	knowledge about the house speech mode
0:37:26	but in the meanwhile of course everything change the deep-learning a wave also hit the
0:37:30	speech community
0:37:31	and now nobody's using it anymore igmms and stuff like that we use the deep
0:37:37	network so how is the speech
0:37:40	recognition done nowadays
0:37:43	so this is speech in
0:37:45	one slide
0:37:47	you take your
0:37:49	you're speech signal you transform at the using some features
0:37:53	and
0:37:55	in for the training set that you have you take the sequence of words
0:37:59	and you
0:38:00	you
0:38:01	cut it into sub
0:38:02	word unit which are usually phonemes or
0:38:05	by for triphone and for whatever you want
0:38:07	and these phones are then cut into cell phone unit which are called states because
0:38:12	they are states of hmms with the were not using hmm anymore
0:38:16	and
0:38:17	and then we try to align the audio with the states
0:38:22	so we take a previous model and we tried to say okay with our previous
0:38:26	small
0:38:27	this part of the audio should correspond to state number two hundred forty five
0:38:32	and we do that for all our training set and that's becomes are training data
0:38:38	to train it deep architecture which outputs is the number of states you have n
0:38:43	you try to predict
0:38:44	which state this audio should be corresponding to out of the
0:38:48	you know case
0:38:50	fourteen thousand states
0:38:51	so the actual speech
0:38:53	acoustic model is a classifier a fourteen thousand classes
0:38:58	this is how it works and i think it
0:39:01	we do that because that's
0:39:02	how we've been doing speech for well but it seems unreasonable to me
0:39:06	that's where trying to classify wasn't two states
0:39:09	which even as humans have part time
0:39:13	as a task to do because these days have no particular meaning
0:39:17	the phonemes themselves have been designed by linguist and maybe that's not what that the
0:39:21	that should say
0:39:22	we should maybe look at the data instead of asking a linguist
0:39:26	don't know how many language so we have here
0:39:28	hopefully not too many
0:39:33	i
0:39:34	and
0:39:36	so let's see if we can
0:39:38	get rid of these states and phonemes and all that
0:39:41	of course it's gonna be hardened that will not succeed
0:39:44	very well but at least i think it's worth trying
0:39:48	and see where we go
0:39:50	so
0:39:52	so what can we do
0:39:55	so the first thing i need was a very naive approach i two can't data
0:39:59	and instead of cutting the data and segmenting the data at the state level as
0:40:05	okay i forgot about state i forgot about fourteen what else do we have words
0:40:09	so that segment the
0:40:11	training set we have in terms of words and the
0:40:14	that's an easier task because you it's usually easier to segment your data in terms
0:40:18	of words well humans would agree
0:40:20	roughly where word started words that ends
0:40:24	so let's try to learn a model that price should just a classified words
0:40:29	and that's what i did so i had my audio data and user deep architecture
0:40:33	and try to predict that the end
0:40:35	the word directly
0:40:36	so that assumes that it has already been segment that
0:40:39	the same way that this take based model was assuming that it was already segmented
0:40:43	but instead of seeing
0:40:44	only one state plus context i'm gonna see the whole were
0:40:48	now it turns out words are not that long
0:40:51	with a window of about the two seconds i capture the like ninety nine percent
0:40:55	of the training words a hot
0:40:57	so you need about two hundred frames to express and capture most of the words
0:41:02	or at least of the training set i had access to which
0:41:05	is a query data from
0:41:07	from google
0:41:10	so i train your typical deep conclusion mall the same kind of model that was
0:41:15	used for images but then nowhere use it for speech
0:41:19	i use the
0:41:21	the dictionary used exactly what small the sense that in the training set
0:41:27	not all possible words i pure
0:41:29	so i use only about fifty thousand words
0:41:32	which
0:41:33	looks big but it's actually small compared to the actual number of words
0:41:37	that people will use in our test set for which we need something that can
0:41:41	be at work at least
0:41:43	so we have a problem later but let's forget about that problem so far
0:41:47	and try to classified our training set into one of the forty eight thousand word
0:41:54	so we can trying to model and that's nice and you get some accuracy seventy
0:41:57	three percent
0:41:58	is it good is it but i don't know
0:42:01	it's reasonable
0:42:03	where the we see where we go with this
0:42:07	the first thing to say that if you have this you are not done at
0:42:10	all in the speech recognition task because
0:42:14	i've assume that someone gave me in a in a line data set so my
0:42:18	training data was aligned at the word level
0:42:21	but now if i want to do speech recognition are not going to be given
0:42:24	in the alignment i have to align it myself
0:42:27	so
0:42:27	since i wanted to have the quickly summarise although set okay i'm gonna forget about
0:42:31	the alignment
0:42:32	i will use
0:42:33	the crowd well we have to provide a target so i take a model
0:42:37	and i and i just run the speech recognizer we have
0:42:41	which happens to be quite good
0:42:42	and i look at that the lattice which is that
0:42:46	a compact representation of the top-k
0:42:49	sequences of words
0:42:50	that could have been uttered for this it turns
0:42:55	of acoustic
0:42:56	and i will only look at the arc of that like this and try to
0:42:59	rescore it so now i know that
0:43:01	it for each are there was a beginning and time so i can take the
0:43:05	audio of that part of the work of the sequence and try to score it
0:43:09	and say okay i think it should be this word with these probability rd score
0:43:13	and i can get the score and try to
0:43:17	two score that and that
0:43:18	that's good but it doesn't solve the problem of the not work
0:43:22	my model was trained with forty eight thousand words
0:43:25	and the decoder will see where more words so how will i be ever able
0:43:29	to
0:43:30	two
0:43:31	two classifier that words with this
0:43:33	this is a problem so let's try to go further in our idea
0:43:37	and let's try to reason about how we could actually be able to
0:43:41	to produce an old word
0:43:43	or score unknown words
0:43:46	that's where the embedding space is we start to be useful
0:43:49	so here is the suggestion
0:43:51	we're gonna try to learn
0:43:54	in mapping between
0:43:55	and you representation of words that we have access to and its base of words
0:44:01	so what have access to that i can you edus
0:44:04	is things that make up the word like the letters of the word or the
0:44:08	lighter n-gram of a word so for instance i take the word hello and i
0:44:12	can extract
0:44:13	but quota features
0:44:14	i don't fit like the letters it has
0:44:17	the bigram letters it has the trigram letters it has the foreground letters it has
0:44:22	the
0:44:23	and writing letters it had all of them
0:44:25	so that's a lot of features
0:44:28	but them
0:44:29	maybe they are useful
0:44:30	actually if you add two more symbols
0:44:33	like
0:44:34	beginning and of word
0:44:36	then it's even more interesting because
0:44:39	so the ing in english is a very often
0:44:42	and being of words and that's good to do that ing and is a very
0:44:46	powerful features so let's try to add that as features
0:44:49	and the
0:44:51	and try to represent words like this so
0:44:55	the first thing i
0:44:57	it was
0:44:58	trying to see if i take a word that extracted features i show you only
0:45:03	this
0:45:04	can usually it can you tell me given this that the word i was talking
0:45:08	about the was handle
0:45:10	turns out that it's actually very easy task and the
0:45:14	on the test set i got about ninety nine percent accuracy if i want
0:45:18	if i train a simple model to predict
0:45:20	what worked it is given its features so these features actually really
0:45:25	capture enough of the word
0:45:27	to tell you that this is hello
0:45:29	so that's good that use these features
0:45:32	but how can we use it
0:45:33	so we're gonna use it in a
0:45:35	and bidding
0:45:36	deep-learning kind of architecture
0:45:39	in the following way
0:45:40	so we had our first model which was you take the audio when you try
0:45:44	to predict what word it is
0:45:46	now
0:45:47	my p is that the
0:45:49	the last layer of these are secure capture as a lot of information about the
0:45:54	whole word
0:45:56	and that to word that some
0:46:00	alike
0:46:01	we did not far
0:46:02	in their representation of the last layer l c d
0:46:05	deep architecture
0:46:07	and
0:46:08	what i will try to make sure is that indeed i can try to learn
0:46:12	a mapping between
0:46:14	any word
0:46:16	and
0:46:17	the position in that space that correspond to the word so that space contains word
0:46:21	but now do not organised in terms of
0:46:25	how they are related semantically they are organized in that space the space being the
0:46:29	last layer of the deep architecture
0:46:30	in terms of how they sound alike
0:46:32	two words that some of the like will be nearby that space and that's great
0:46:38	so now i'm going to train
0:46:40	i ranking model that will take
0:46:44	in old you acoustic will projected into that space
0:46:47	we take
0:46:48	the word this would you acoustic corresponds to
0:46:53	transform it into features project the into a another space that i hope will be
0:46:59	similar to this one
0:47:00	and try to make sure that the representation of the correct word in that space
0:47:04	is here
0:47:06	there are presentation of do it you actually near that the representation of another word
0:47:11	so i want to make sure that in that specified that the audio i projected
0:47:15	i take the letters of the correct word i projected and they should be by
0:47:19	the embedding space
0:47:20	and by nearby i just mean
0:47:22	that it's near that any other word i would take and projected so that i
0:47:26	could rank the word and the nearest word of an acoustic sequence would be the
0:47:30	correct word
0:47:31	and that would work for any word any sequence of letters i can express
0:47:35	does that make sense
0:47:37	okay
0:47:38	so
0:47:38	and it so again that's the your typical ranking loss
0:47:42	and the like trained that well
0:47:44	and now with that model i can actually score any word so even though
0:47:48	this model was only trained with fifty thousand
0:47:52	words
0:47:53	with this addition i can now score in figure them out of words as norm
0:47:57	estimate of letters
0:47:59	which
0:48:00	is okay in that case that was only english
0:48:05	okay so it doesn't work first of all it doesn't work as well
0:48:09	as the as the so if i
0:48:12	use only this model i get
0:48:14	seventy three percent accuracy but if i use
0:48:17	the small
0:48:19	i'd of the much bigger set of words i get only fifty three percent accuracy
0:48:23	but it still
0:48:24	maybe enough to be able to use it in the decoder no
0:48:28	and adding another useful example of these embedding space is now we're talking about
0:48:33	and beating spaces of old you
0:48:34	so i think a word i projected into the embedding space and i look at
0:48:38	other words around
0:48:40	and i see words that
0:48:42	sounds similar they probably have completely different meanings but this on the same
0:48:46	and
0:48:47	you can even
0:48:49	put push up any word that is actually not the word and try to see
0:48:52	how you would pronounce it so
0:48:55	could be interesting
0:48:57	okay so doesn't work well it works
0:49:01	so
0:49:03	basically so far
0:49:04	it only works in combination with the we if you if you rescoring you combine
0:49:09	it with a good model
0:49:11	so it's just preliminary work but i think there's the there's another things to try
0:49:16	in that space
0:49:17	that these to be tried it sits on the a preliminary or so
0:49:20	you don't improves lightly the result
0:49:22	even though it's like to hear means it actually improves significantly because the size of
0:49:26	the data was
0:49:28	ways you which it still not there was a despising it is for me
0:49:31	but i think it contains
0:49:34	i think sees of a frazzled that we should consider these it would you space
0:49:39	design meeting space for you are i think
0:49:41	something to consider
0:49:43	later
0:49:46	maybe i can tell you a bit of the kind of air light model was
0:49:48	playing
0:49:49	it was making mistakes like it's was replaced by its
0:49:53	five was replaced by five
0:49:56	which
0:49:57	we agree are different words
0:50:00	okay was replaced by okay and that kind of mistake so it was mostly mistakes
0:50:05	from the language small and not much from the acoustic model but
0:50:08	nevertheless you need to train them jointly which i haven't
0:50:11	and so there's a work to do here
0:50:14	okay so
0:50:16	i'm gonna stop the now so these are the conclusion i hope i convince you
0:50:21	that the these that these and baiting space is a very powerful the fact that
0:50:26	you can
0:50:27	take any kind of data whether they are discrete data like words or complex that
0:50:32	are like images or sounds and projected into space where you can compared to where
0:50:36	you can
0:50:36	look at the nearest neighbours in that space on where you can make a new
0:50:40	even know parameters
0:50:41	on them like averages or
0:50:44	subtraction and stuff like that
0:50:45	this is a very powerful
0:50:49	way to consider complex objects
0:50:51	we've tried it actually in many other applications
0:50:54	i can tell you if you ask them for which we talk about like a
0:50:57	music recommendations what we will we had this the
0:51:00	music
0:51:02	you can upload your music and we're gonna try to
0:51:05	to help you
0:51:07	do play list with it
0:51:08	or try to code
0:51:11	contingent to buy new music and stuff like that
0:51:14	and we do that in
0:51:15	not only but
0:51:16	also using the old your presentation of the image
0:51:19	so we've actually
0:51:20	represented your music into these kind of invading spaces and look around in that space
0:51:25	we've done that for videos of course of four languages information machine translation of talk
0:51:29	about it and worst i think trying to do things in speech recognition
0:51:33	and i think there's even more to do and the
0:51:36	and why not the trying these kind of things for speaker verification or language to
0:51:42	classification i don't know but really next year maybe
0:51:47	i do
0:51:57	so
0:52:13	thank you very much
0:52:17	that would like to know what's wrong about linguistics
0:52:22	so nothing wrong about linguistic but it i'm afraid of taking early decisions so for
0:52:27	instance for speech when we take words and we will present them that sequence of
0:52:31	phonemes
0:52:32	often there was
0:52:33	more than one representation what and one way to present the word
0:52:36	and you need these
0:52:37	thank ways to decide what is the correct way or the quite a ways
0:52:41	and there's a discrete the most of them that you need to put in
0:52:44	because you that's how you're gonna represent the you always do that
0:52:48	and you are making early decisions some of the might be wrong i'd like to
0:52:52	get rid of early wrong decisions
0:52:55	for indications transcribe sings a different
0:52:59	or at and use
0:53:02	set a simple
0:53:08	using strong
0:53:09	i think that's wrong
0:53:12	in the comments about the
0:53:16	images
0:53:18	what about you deal i mean you probably most work now
0:53:24	so if you a chick
0:53:27	you're asking what about action x
0:53:29	so
0:53:30	we have people working on that so we have you to put which is part
0:53:34	of we will and contains if you videos
0:53:41	and we have a big group trying these kind of approaches for you keep so
0:53:44	i cannot
0:53:46	to them but i know they have good results
0:53:52	so anyway trained are selected just distinction between the word and
0:53:59	acoustics was difference between this and have in a sequential training
0:54:03	people dollars
0:54:05	this kind of similar what do you mean basically one sit all the whole sentence
0:54:09	all incorrect sure so
0:54:12	you could use the recording that instead of using a consonant over you the acoustic
0:54:17	and you get
0:54:19	plus and
0:54:20	minus is i would sit the
0:54:22	plus about using a recurrent net is that you don't need to decide a priori
0:54:27	what's the maximum size
0:54:28	the might this is that you
0:54:30	the she are more than what you want
0:54:32	your presentation is more scared than the actual model where you decide so we are
0:54:38	actually trying weaver l indians now for about that so i'm not saying it's wrong
0:54:42	it's a good idea
0:54:44	these experiments were done with a consonant
0:54:46	go ahead and use your recurrent net
0:54:56	i think my question is in the same direction as well but the it was
0:55:00	them mentioning video what about sentences you're able to represent sentences sequence of words in
0:55:07	this so you know the current the line of work by my colleagues the quickly
0:55:12	and yes it's cover
0:55:15	who are actually trying now to do that kind of things so the use in
0:55:19	nist ends or recurrent net that's not right go into details of how it works
0:55:23	but where you first read
0:55:28	some input about your sentence it could be the sentence and another language or it
0:55:32	could be d v d or it could be the old you what is that
0:55:34	it
0:55:34	and then you i'll put a sentence
0:55:36	and you trained of all to output the right and that and so that you
0:55:40	i actually reasoning about sentence
0:55:42	so it's early work so far but it
0:55:45	hope it's gonna work
0:55:51	we want me to ask about the numbers on the board a low pass it
0:55:57	my question supposed to be
0:56:00	supervised minutes of what show
0:56:03	can be somehow to unsupervised
0:56:07	as so it's it had to see that distinction between the two then
0:56:12	when you train your invading spaces using only sentences is that supervised data or unsupervised
0:56:18	data i mean it's sentences that
0:56:20	exist in the world
0:56:22	but you are not the you don't need it people to label them may appear
0:56:27	by themself on the web
0:56:29	i don't know if this is supervised or not
0:56:33	so
0:56:34	the distinction is not clear to me
0:56:37	you have to tell me more
0:56:42	what it might be getting it is true when you get the sense
0:56:45	supervised
0:56:47	human generate right was showing so
0:56:51	supervised the since you say this is for english sense
0:56:54	yes so that's i think women getting is in the unsupervised it i just give
0:57:01	you data services in but when you do unsupervised clustering like
0:57:06	such as things look similar because
0:57:09	this is this picture world
0:57:11	in the supervised case you said this is world this world this is given it
0:57:16	kind of something to guide along and maybe one question be if you start throwing
0:57:21	things in their this is questions
0:57:24	how to use it's almost in clustering how to use
0:57:27	so selfish
0:57:28	i think the hope of unsupervised learning and i do believe that we need a
0:57:31	lot of our work in that field the that it's crucial is a to find
0:57:35	structure in the world
0:57:38	the things that happen try to be the wall to happen with some structure maybe
0:57:41	randomly but with some distribution
0:57:43	and you want to
0:57:45	constrain the space where you're going to operate with these objects these and big space
0:57:50	is or
0:57:51	or any other he didn't representation
0:57:53	such that they take into account that structure so that it is here
0:57:57	to say well these two things
0:57:59	are nearby because in that structure doesn't the what the way around you cannot go
0:58:03	by a left-to-right things are only in that direction so compared and like that
0:58:07	that's what you want to use of your unsupervised data to so for instance you
0:58:12	can take or audio
0:58:14	and try to their representation of the old doing a compact we just by looking
0:58:17	at would you as long as it's audio
0:58:19	of things that you will see later so not just the right done
0:58:23	audio but maybe people talking but without
0:58:26	understanding what they say or images
0:58:29	that exist but without labels
0:58:31	or again take that has been read in your language
0:58:34	but you and you don't need to know what is that text about or what
0:58:37	is this image about as long as it's a set of images that
0:58:41	are valid
0:58:41	in this is that it in other image you would see come from the same
0:58:44	distribution
0:58:45	so it's very useful it's a heart task
0:58:48	and but we need at a lot
0:58:55	so you are trying to look nice out of a couple diverse using to sampling
0:59:00	can you comment on how successful the like your that jane you have obtained using
0:59:07	combine the two models
0:59:09	a complement of recognizing a of a couple of that's
0:59:13	so the for small with only train a recognizing words that was not known
0:59:17	in the second one
0:59:19	i used it
0:59:20	on our test set which contains ten times more different words most of the words
0:59:24	in the test set was not we're not to the train set so the
0:59:27	the decoder the results i gave in terms of word error rate was on the
0:59:32	vocabulary that was
0:59:34	more than ten times speakers and the training set
0:59:38	so it was using this letter representation so is that what you mentor
0:59:43	i mean you in successful in because i think look at atlantis
0:59:48	well
0:59:49	so it's out-of-vocabulary from the training set but it's not solving the real task which
0:59:54	i'm sure you are
0:59:55	interested in
0:59:56	which is out-of-vocabulary of the test set
0:59:59	that is a word that is that is not even in my
1:00:04	the cannery one at code and i'd like to be able to reason about it
1:00:07	so i haven't tried that and i think it's more
1:00:10	interesting task
1:00:14	for some
1:00:18	about linguist
1:00:25	okay
1:00:28	one where
1:00:32	some
1:00:35	yes joint
1:00:40	so are
1:00:48	which is the first part i talk about people's are starting working on that but
1:00:51	years ago yes but it's the hardback yes
1:00:54	also
1:01:17	i agree i haven't the but i guess videos would be dismissed the best the
1:01:21	way to see that where you have old you and
1:01:24	and images but i have not personally work on that and but i do people
1:01:28	are otherwise it on the which other data but
1:01:31	yes i think
1:01:32	six seventy to here

Large Scale Learning of a Joint Embedding Space

Keynotes

Samy Bengio