0:00:15great so it's a closer to
0:00:18to talk here today indeed i
0:00:20followed your
0:00:24community for a couple of here is about ten years ago and then moved on
0:00:27and doing a few different things but i might come back to it who knows
0:00:31i was a very interested in what i've seen during the week
0:00:35so thanks on fighting
0:00:37so to the l i'll talk about if you
0:00:41projects related to and bidding space is then going to
0:00:45tell you what we can do with that and what they are first but
0:00:50pay attention to the fact that this is not only my work does plenty of
0:00:54people that have work on that with me
0:00:56and like to thank them first
0:01:02and bidding sweat out be what are the useful for and in fact it started
0:01:07like a decade ago when people started to try to think
0:01:12how can we present
0:01:14discrete objects in the continue space in such a way that it becomes useful to
0:01:19manipulate them
0:01:20like a if you think of words
0:01:23words are difficult to manipulate because
0:01:26you cannot compared to words the easy d at least the
0:01:30i in terms of mathematics a comparator so how can you represent words such that
0:01:35then after that you can
0:01:36you can manipulate them and compare them
0:01:41so these are these embedding space is more we're gonna project words
0:01:47once we have projected words into these spaces
0:01:49what about projecting anything else
0:01:51like images
0:01:53or speech or
0:01:56so how can we do that that's gonna be the second part
0:02:00and we see that once we
0:02:02manipulated is a complex object in these spaces
0:02:05the there's a because you
0:02:07you learn
0:02:09semantic representations of these words you can actually
0:02:13try to
0:02:16to discover things in complex objects like images that you've never seen before we see
0:02:21how we can do that
0:02:22and now
0:02:24briefly describe at the end
0:02:26some recent work we've done with the trying to
0:02:29do similar things from images but now applied to speech
0:02:32and vice
0:02:36so let's
0:02:38let's start with what i mean by embedding so
0:02:41so here i describe it three d a beating space but of course in general
0:02:46it's more like a one hundred two hundred one thousand
0:02:49the and bidding space
0:02:52think of a this space as the
0:02:54as the real number real vector space
0:02:58work each point
0:02:59could the could be the position of a discrete object like a word so soul
0:03:06here we have let's your best this work
0:03:10here we have the position of the word or barbara
0:03:12and here the
0:03:13like here the position of dawson's c will paris
0:03:19we want to learn where to put these words and that's basically at the beginning
0:03:23we just on the whole
0:03:24so we're gonna
0:03:25fine random position for each of the words
0:03:28of the given dictionary that were given
0:03:31and then we want to modify the positions of the move and at some point
0:03:35one to be such that
0:03:38nearby words so the word nearby the position of the word dolphin
0:03:42should have similar meanings are tick at least related meanings
0:03:46so you'd like c word and often to be not far
0:03:48but relatively far from say powers and all but
0:03:53and which that this if we can achieve that is gonna be useful for
0:03:57for manipulating these words after work so how can we do that
0:04:04about the ten years ago my brother years once we just for those of you
0:04:08who don't know there are two modules
0:04:10you invited me that my brother
0:04:17both of us working deep-learning
0:04:22but that's about ten years ago he
0:04:27describe the project where he could learn such an bidding since
0:04:31and here is that how he started the
0:04:34this thing
0:04:38he was using neural network so this is a neural network where we have the
0:04:43inputs and you have layers
0:04:45good connected to each other and at the end you have the output layer
0:04:48and the goal was to try to learn a representation for words so what he
0:04:53was take
0:04:55like you can capture sentences from we keep it from the web from anywhere only
0:04:59where you can just grab sentences
0:05:02and your goal is to try to
0:05:04to find a representation of words
0:05:07such that
0:05:09it will be easy to try to predict
0:05:12if i show you few words can you predict what will be didn't next work
0:05:15so you put
0:05:17cities for words as input of the model
0:05:20you crunch them and at the end you predict the word in your dictionary and
0:05:24does one you need for each of the word of your dictionary
0:05:27and you try to predict the next word will be cheese
0:05:31in that case
0:05:32now how your present the word
0:05:34you're present them as a vector in
0:05:37this the dimensional space
0:05:39which at the beginning will be just a random vector
0:05:42so this is what i
0:05:44we called the embedding space
0:05:46so there i think of it does a big look up the table or a
0:05:49big matrix where each line
0:05:51is there a presentation of a word
0:05:53and if you see the sequence the cat eights you're gonna just look at the
0:05:59the look up the word cat each of them have a vector representation you just
0:06:03put that as input of your well
0:06:05and then you passing through your model and you predict the next word
0:06:09and the hope is that if you do that often
0:06:11and you
0:06:12as you might imagine you have in into new the model such data so that
0:06:16easy to get
0:06:17you will want you will find good representation of words
0:06:23so the first time he did it was very painful very slow machines ten years
0:06:28ago are very slow the dictionary was very small
0:06:31it wasn't very useful but since then things have improved the model has
0:06:35been simplified we have more data about more gpus
0:06:39another thing start to work quite well
0:06:42here is an example of the nn bidding space that we trained
0:06:46about two years ago i think
0:06:48and you don't have to try to see well
0:06:51what's in that space
0:06:53we can pick a word
0:06:54so like here i pick the word apple
0:06:56and look in the space what are the nearest words saying the
0:07:00euclidean space so a look
0:07:02at the position of all keywords i
0:07:04sort them with respect to the distance to the target word
0:07:07and i look at the other words and what i see that
0:07:10the other words are very semantically similar to apples so you have fruit apples melon
0:07:15and whatever
0:07:17if you take a stab you get things that are less a happier
0:07:22and around i phone
0:07:24used are you see stuff like i whatever
0:07:29so it does capture everything and this was trained in the unsupervised weights sex
0:07:34just show sentences
0:07:35and that's what you get
0:07:39and this was with the fifty dimensional embedding so you don't even need to have
0:07:43a very large space this was about
0:07:45i think or hundred thousand words
0:07:49so those hundred thousand vector here that are hidden in that space
0:07:56with time these kind of model have evolved and usually involved in that just means
0:08:01simplified so it was
0:08:03it was a complex a architecture and it became actually much simpler now it's
0:08:08in fact they are almost linear model
0:08:10just the and of aiding and abetting of the word and then linear transformation in
0:08:15you try to predict
0:08:16another word and that's it
0:08:19the weights done is you pick it again you take a sentence you randomly pick
0:08:24a word in that sentence and then you randomly pick another word that you're gonna
0:08:27try to predict
0:08:28so it's not anymore
0:08:30the next word and its not anymore
0:08:32if we do of words that helps you predicting the next one it's
0:08:36a random were trying to predict another random work round
0:08:39and that's just the fact that it's a round that makes it
0:08:42interesting because words ten
0:08:44to keep
0:08:46they tend to
0:08:47to cook are often together so
0:08:50so that's a very simple now the code is actually available in the very efficient
0:08:55to train so you can train your own the bidding space
0:08:58on in the matter of in our to of training so it's very efficient
0:09:04a so
0:09:05so here we have an example about
0:09:08we took all the terms we saw and we keep it and here examples again
0:09:12of embedding space is so these are the words nearby tiger shark
0:09:16or car these are words to sorted or a cluster them according to their semantic
0:09:22you see i don't know here all the food things
0:09:26all the replies it et cetera so it captures semantic
0:09:30but it's actually even more
0:09:33strong than that
0:09:36you can play some games with these and beatings words so for instance
0:09:40if a you take the after training it within meetings to space the using this
0:09:45as good problem all
0:09:47you look at the editing position of the roll
0:09:50italy berlin and german germany you look at where they are in the space
0:09:54and you can apply operators on that so you to the embedding a position of
0:09:59rome and you subtract italy
0:10:02you at germany and what do you get vernon
0:10:05that means you can actually generalising fine so the vector that went from row to
0:10:11italy is the same as the when it goes from berlin to germany because they
0:10:15have the same or addition to each other
0:10:17and that's for semantic and you have also syntactic relations
0:10:21like hardest to harder to what biggest to bigger
0:10:24with the same kind of argument
0:10:26and that's
0:10:27surprisingly working
0:10:29you can do similar tricks using for translation you can see that you can train
0:10:39and beating space for separate languages
0:10:41and you'll find that these relations actually work from this
0:10:45from language to language and you can use these kind of things to help train
0:10:48stations and they're
0:10:50tons of tricks that you can seem to the charger nowadays using these and beating
0:10:54spaces so they are very
0:10:55interesting to manipulate
0:11:00before i forget feel free to ask any question whenever you want of course
0:11:06so what else can we do with these are bidding spaces so
0:11:12about four five years ago i was actually interested that googling
0:11:17and i and i wanted to try to see if i can train the model
0:11:20to annotate images but it being at google i'm not interested in on the and
0:11:24trying to label images out of the hundred classes of course i'm more interested in
0:11:29the large scale setting so when you have six hundred thousand classes
0:11:33that's more interesting for me
0:11:35but at that time at least
0:11:37the recharger do see do
0:11:39you computer vision literature was more interested are focusing on task where you have a
0:11:44hundred two hundred
0:11:45up to a thousand classes but that's
0:11:47that's about it
0:11:50so can we go further than that
0:11:54so of course i image annotation is the heart task that's plenty of problems with
0:11:59that you can think of the fact that the object objects are very the often
0:12:05lookalike and actually this problem
0:12:07is even worse when the number of classes grow
0:12:10so when you have only two classes it's very easy to discriminate the if you
0:12:13have a
0:12:14a thousand classes or hundred thousand classes
0:12:16you can be sure that
0:12:18two of these classes are very visually similar to each other
0:12:22so the problem is
0:12:24or are becoming harder as the number of class the class a scroll
0:12:28just plenty of other problems the rated to computer vision which i won't go into
0:12:32details but
0:12:33let me just
0:12:35summarise how computer vision was done
0:12:37about four five years ago
0:12:39things have evolved a lot since then but at that time
0:12:44you had two steps
0:12:45feature extraction and classification
0:12:48so first you would extractor features and the way you extract features
0:12:55say it's very similar to how you would expect your for your from your voice
0:13:00you with the find the a good representation of
0:13:04places in the image and then
0:13:05aggregated in a in some way and that would be your presentation
0:13:11what you had that you would then try to classify don't using the best classifier
0:13:15that was available at the time
0:13:17which was an svm and then you train an excellent for each of your classes
0:13:21and hope
0:13:21that it would scale well
0:13:23a it didn't
0:13:25so one of the
0:13:27problems you had was
0:13:31very similar images would give rise to
0:13:33very different labels
0:13:35that would be completely unrelated semantically so for instance these are three
0:13:41these are
0:13:42three images a
0:13:44of as of the video that
0:13:46are like a few seconds apart
0:13:49it's a shark or something like a shark and this is the label that
0:13:53you're classically which classifier would give and you see that this is very semantically different
0:13:58from that
0:13:59here it's airliner for those who don't see and here's tiger sharks
0:14:03but the image are quite similar to or high so some things
0:14:07not working somewhere
0:14:10and that's the kind of problem would like to be able to solve that would
0:14:13be if i sure you to similar images i'd like to have to similar labels
0:14:16at least
0:14:18at least semantically
0:14:20so why isn't that the working
0:14:23so one argument about that is that we when we try to classify are images
0:14:28we showed no we impose no relation between our labels all the labels you can
0:14:34think of them as being in this id edges
0:14:37of the hyper q
0:14:38of the number of classes
0:14:40where the edges of the number of classes
0:14:43and so there's no more relation between a
0:14:45it share and the and them under a more than there is or relation between
0:14:50the mental reinvesting great even though there is a semantic creation between them we don't
0:14:55capture done with the way we
0:14:57train our classifiers and that's probably bad
0:15:01so what if instead of having the labels that these
0:15:04corners here
0:15:05what about putting them inside
0:15:08i guess
0:15:10so now the labels are inside this i per q like these embedding space as
0:15:14i was talking about the earlier
0:15:17and it's fine because first of all the
0:15:20the size of this hyper cube does not you anymore depend
0:15:24on the number of labels classes
0:15:26you can have way more classes than the actual size of your
0:15:32because now it's a real space
0:15:34and you can put your labels in such a way that they are by label
0:15:37have your by meaning
0:15:39and we've got
0:15:40what happens is that if you make a mistake by picking the wrong label
0:15:44hopefully you're gonna pick a label that was nearby hands have a semantic meaning not
0:15:49too far so
0:15:50hopefully that would work
0:15:52and does even more interesting think that could happen you could
0:15:55put more labels in the space that the one for which you have images
0:15:58and maybe you'd be able to label in image of a topic you've never
0:16:02seen any image before just because it is semantically related to it
0:16:08so we try to see that
0:16:10is just a dream
0:16:12so about four years ago we started working on this project and
0:16:17and what we of course so we were interested in these embedding space is an
0:16:22and what we
0:16:23tried was basically to merge the idea of having an image classifier and an embedding
0:16:31we had this protocol was that e
0:16:34where we want to do
0:16:36try to learn jointly how to take an image
0:16:39and project its input representation into
0:16:43and then bidding space so you would have a
0:16:46you projection from their representation of that image into
0:16:49in that space and in the same space you'd have points that represents
0:16:55our classes
0:16:56like thousand and robot and eiffel tower
0:16:58and the goal was to jointly try to find the position of the labels
0:17:03i try to find the mapping from the image
0:17:05to deal a label space and if you can do that jointly you hopefully sold
0:17:11jointly the classification task
0:17:13and have a good and beating space for words
0:17:19being at google
0:17:20i should tell you that everything i see looks like a ranking problem
0:17:26and so i obviously see that there's a ranking problem
0:17:30where in that case the goal is if you show me an image and gonna
0:17:34try to rank
0:17:35the labels such that the nearest label is the one that corresponds to label
0:17:40sounds good that that's a correct drinking problem and you also want to make sure
0:17:45similar labels if you are to make a mistake let's make sure that the mistake
0:17:48is semantically reasonable that
0:17:51if you were to click on the word
0:17:52it would be it reasonable word even if it's not the perfect work
0:17:57so we are going to train or well with i ranking loss
0:18:02in mine
0:18:06it was actually very simple ball
0:18:08again very just a linear mapping so
0:18:11what we had was
0:18:14so this is prior deep-learning
0:18:18error in some sense
0:18:19at least in computer vision
0:18:23so we work if it features
0:18:26the mfcc of the
0:18:29of the image world
0:18:31and the what we were looking for was just a linear mapping between these features
0:18:35of an image and the embedding space so you take your features that
0:18:39that's x it's the image representation of an image
0:18:42and you just multiply by matrix v
0:18:44such that the result is another factor that is not in your hopefully in the
0:18:48embedding space
0:18:49and again like for the embedding space you have a factor
0:18:53representation of each of your well words
0:18:56which in that case out that labels
0:18:58of the image classification task
0:19:00and you want to find
0:19:03w the a representation of your labels and v the mapping between each features and
0:19:08the embedding space
0:19:09such that it optimizes a task what is the task
0:19:12we are going to define a similarity between two point in the space and in
0:19:16this case the similarity function between
0:19:18in image
0:19:19x and the label i
0:19:21is just
0:19:22it's dot product
0:19:24so in the embedding space so you take the image you projected into the embedding
0:19:28you multiplied by the label
0:19:30that for which you're considering to label that image
0:19:34and you want that score to be high
0:19:36for the correct label until
0:19:38is a low for incorrect label
0:19:41and so and you're going to put some constraint because
0:19:45we're doing machine learning we need to put something regularization
0:19:49such that
0:19:53such that the value of the embedding space basically it's constraints so you control than
0:19:59both the norm of the
0:20:02of the mapping and the norm of the bidding
0:20:05labels itself
0:20:08okay as so as i said
0:20:12we are going to try to solve this problem using a ranking loss so what
0:20:15it what do i mean by that
0:20:19we are going to
0:20:21construct a loss
0:20:23to minimize such that
0:20:26for every image in our training set that's this part here
0:20:30and for every current label of that image in image could have more than one
0:20:34label that
0:20:35that's often the case
0:20:37and for every incorrect label of that image
0:20:41and that does a lot
0:20:42so that that's a nice big
0:20:44you want to make sure that
0:20:46this score
0:20:48so the score of the correct label should be higher than the score of any
0:20:51incorrect label plus margin
0:20:54what this basically says it's a hinge loss
0:20:59not only you want to score the correct label to be higher than any other
0:21:02one you want to make sure that i don't plus margin so that generalizes better
0:21:06here the margin is one but it's a constant you can put what you want
0:21:11and if that's not the case then you pay a price
0:21:14and that's the price to pay and you want to minimize that price
0:21:18and you can optimize it is very efficiently by stochastic gradient descent by simply
0:21:22sampling in image from your sec training set sampling a positive image
0:21:27a positive label
0:21:28from the set of correct label of that image
0:21:31and then simply any other label which
0:21:33most likely will be a novel correct label
0:21:37you have your triplet then you compute your loss
0:21:39if it's
0:21:41if it's if the loss is positive you are you change the parameters of
0:21:46you're model v and w here
0:21:49so that's good and it works actually
0:21:52but you can actually do better
0:21:54i i'm not going to go into details about how you can do better but
0:21:56i think of the
0:21:57the following problem
0:22:00what you want is at least
0:22:03when you want to rank a hundred thousand are to be an object
0:22:08what you want
0:22:09is that in the top ranking
0:22:12position of the object you're going to return does something interesting
0:22:17if i show you two functions that are ranking labels
0:22:21one of them returns
0:22:23label correct labeling position the
0:22:25one and another correct label in the position
0:22:28one thousand
0:22:29you fine you should find it more interesting that another function that returns
0:22:34the to correct labeling position five hundred and five hundred one
0:22:37even though in terms of ranking they have the same value
0:22:40in terms of the user using it you try to have at least one label
0:22:44returned in the top position
0:22:45so you want to favour the top of the ranking you want to put a
0:22:48lot of interest is there
0:22:49and they are ways to modify
0:22:52these kinds of losses to favour the top of the ranking
0:22:56i one going to the d as they also have to they are in the
0:22:59paper but
0:23:00but it actually makes a huge difference in terms of the perception of the user
0:23:04at least at the top
0:23:06of the ranking you see things that make sense
0:23:11so let's look at
0:23:14these experiments the original the first experiments with done
0:23:18we had the
0:23:20so at that time there was the
0:23:22a database an image in the in the computer vision literature that started appearing called
0:23:27image net
0:23:28it still there it's growing
0:23:31at that time their whereas sixteen thousand labels in the image net
0:23:35corpus now that was more than twenty thousand
0:23:39nobody was actually using that the that the corpus i as a as it is
0:23:44people were had selected about the thousand label and they were only playing with one
0:23:49thousand and that's still the case and fortunate
0:23:51almost nobody plays with all the corpus that is actually available and that contains
0:23:56millions of images
0:23:57at that time about five million images
0:24:00i think now it's more like ten million images
0:24:02so that that's good
0:24:03but nobody's using it
0:24:06so we consider that a small dataset
0:24:09and we looked at the bigger one
0:24:11which came from the web so we looked at the maybe images from the web
0:24:16and for the web data we don't really have any labeled the way we use
0:24:20our label whereby
0:24:22looking at what people do one image search on google image search tree type of
0:24:26query that the queried often you see images
0:24:29you click on an image
0:24:30now if
0:24:31many of you click on the same image for the same query we are going
0:24:35to consider that this is a good
0:24:37a the queries a good labels for that
0:24:40for that image
0:24:41so it's very noisy
0:24:43a lot of things happened here
0:24:44but it it's usually reasonable
0:24:46but you can collect as many
0:24:49so this is a very small set of what was actually available but still
0:24:53there was more than a hundred thousand labeling or set so that was interesting
0:24:58so we actually publish the
0:25:00paper showing these results and i want to exercise the fact that
0:25:04we had
0:25:05a one person
0:25:08on that data so ninety nine person role
0:25:11and it was published
0:25:16so i think that's good
0:25:17we also hope
0:25:21so this is a record and
0:25:23so that summarize the thing by saying that this algorithm was better than the
0:25:27many things we tried this is just a summary
0:25:30so these numbers are higher than the other one
0:25:33we show to type of metric precision at one which is accuracy
0:25:37precision at ten
0:25:38which is harmony good labeled you're return to the top and so that's more like
0:25:42a ranking loss
0:25:44and that more like look like what you see on google you look at the
0:25:47page and you're happy if you see the document you want in the top
0:25:53of course if you put more than once and then sample the number scroll and
0:25:57everything gets better but
0:25:58but yell the numbers are small and so the question is
0:26:01is it any useful anyway because it's or small
0:26:04and it turns out it is so first of all let's look at the embedding
0:26:07space again it's always fun to look at
0:26:09what happened after we've train the model
0:26:11so remember the model was trained with the
0:26:14just pairs of image and label
0:26:17nor addition between words duration between images just there is a females and label in
0:26:22are gonna just look at
0:26:23where the labels are in the space no image yet
0:26:27so i look at the labeled brock about that and i look at the nearby
0:26:31labels out of the hundred pounds and label
0:26:34and these are the labels we see
0:26:35and so the nearest one is basically a spelling mistake because well people type anything
0:26:40on the way
0:26:42and the other ones are also very similar and then you see this ball will
0:26:46i don't know what it is
0:26:49and then if you take a big i'm you see again the similar things then
0:26:53interestingly you see semantic relations between the u c and that their soccer player that
0:26:58happening not far
0:27:00maybe did look alike i don't know
0:27:03you also see things like translations so dolphin these near death in though phone
0:27:08or similar projects like whale
0:27:12you see you see like i for that we're used to train station you see
0:27:17things not far or similar visually
0:27:21then the eiffel tower and all these has been trained in some sense i've never
0:27:24told the model that the
0:27:26it often is like the phone it's
0:27:29it's they are just because they share similar images basically
0:27:32it did that and beating space
0:27:34so that's nice but what about the actual task
0:27:37so here is
0:27:39a sample of for images
0:27:42from the test set
0:27:44all of them
0:27:45the if i had to compute the rescored precision at what would be zero basically
0:27:49i failed in all these images as expected i mean i fail ninety nine percent
0:27:54of the time so this is for these ninety nine
0:27:58and but the figures are gracious in some sense so this these are supposed to
0:28:04and the a tensor they'll finny a car and you see the words that happens
0:28:09so that funny here is in position thirty that's good
0:28:12here it's in position i don't like eight
0:28:16but the other words around make sense maybe the wrong
0:28:20their answer but at the end so we give
0:28:22would satisfy many humans and that's
0:28:25that's good just because they have actually very similar semantic meetings
0:28:30so we have the bark about my thing here
0:28:33we have a
0:28:35i was interested in the last i guess trip here because maybe you don't know
0:28:38but there's a copy of the eiffel tower investigates
0:28:40and so it actually made sense
0:28:43it was surprise
0:28:47so that's interesting the they will you make mistakes is now more interesting used to
0:28:51make a lot of mistakes but at least
0:28:53the answer make sense and that that's better
0:28:56but so that was as of a four years ago and
0:29:00what happened after that was
0:29:03the deep-learning
0:29:05error started
0:29:06and the everything changed in the image field
0:29:10like it didn't speech error i would say
0:29:14now that's how we do we beach recognition
0:29:18and the way we do it is by taking an image and applying these a
0:29:23deep network
0:29:24happen till you find we take a decision using it
0:29:27as softmax layer at the end of your deep architecture
0:29:30and the think that works the best these days is the convolution that's and the
0:29:35for those of you don't know what these are it's basically layers that look at
0:29:39only at the small part of the image so there's just a unit here that
0:29:44look only at this part of the image and tries to
0:29:46get the fighting for this part
0:29:48but the function that gets this value is the same as the one that looks
0:29:53at this part of the image and this and this and this
0:29:55so we are actually convolving a function a along the whole image
0:30:00and returning the this completion at the output of this that you're
0:30:04and that we pull the answer locally so we look at the answer of that
0:30:08set of convolution in the local patch
0:30:10and take and return something like the maximal the mean or
0:30:14but works usually is the max but you can try any pulling
0:30:19thing and you do that again layer after layer
0:30:22and what you're
0:30:23you're bored you do full connection and that the n
0:30:27you get an answer so it is
0:30:28is a much more involving architecture it's very
0:30:32slow to train
0:30:34you need gpus and all that but
0:30:38i must say first of all they were developed about twenty five years ago sorts
0:30:42nothing you
0:30:43but the only now we have the data that shows how good they are because
0:30:47before there was not enough data know there was not enough
0:30:50machine power like gpus
0:30:52to actually train such a complex architecture so now it works
0:30:55and it actually works very well so the first time the it was used on
0:31:00on this the competition call the image net which is a competition quote to classify
0:31:05with a thousand
0:31:07it basically blue and the competition so all of those the competitors which we're using
0:31:13classical computer vision techniques and they were actually the best in there
0:31:17they are like ten person away from the
0:31:19from the deep-learning approach so
0:31:22it changed everything and now
0:31:24at least in this t v b are the literature
0:31:27almost nobody is not using computer core
0:31:30d player
0:31:31it architectures
0:31:34maybe just the slide to say that we do use such a thing at we
0:31:37will for real product so it's not just research
0:31:40for instance if you
0:31:42type queries like
0:31:44my for two of something
0:31:46we're gonna try to look in your own for those unlabeled the
0:31:49and try to return
0:31:52you're for two of
0:31:53sunset here
0:31:55and it's done using the type of architecture that
0:31:58that the one this competition
0:31:59i must say that we actually
0:32:03the authors of that paper
0:32:08that is that geoff hinton the
0:32:11an exclusive ski in the yes it's got are
0:32:13the are not working at school so they help us about
0:32:19it works
0:32:23they are very good cliques
0:32:25okay and let's the that contain you know
0:32:29so let's go back to our meeting spaces and the fact that you can put
0:32:33a lot of things in an unwitting space
0:32:37so on one side we have these embedding space is that are very powerful because
0:32:40they capture the semantic of labels on the dataset
0:32:44we have these powerful deep-learning our architecture that can
0:32:47that are based inter class now
0:32:49so can we can we varied these two things in you know way that would
0:32:53be useful
0:32:55in fact what we found was that you can use these two and try to
0:32:58be able to label in image of a label that is not of these ones
0:33:03that appears here anywhere and that's
0:33:06interesting because now
0:33:08even though this was trained on the thousand labeled we can try to reason
0:33:11about sarah hundred thousand label even though we haven't seen ninety nine percent of the
0:33:21surprisingly it's actually very simple to do
0:33:24we started by doing something with more complex but idiot we converged again
0:33:27in the simplest
0:33:29so we shouldn't and here is how you do it
0:33:31so first
0:33:33you train these two things
0:33:35separately you train your best deep-learning architecture on your image classifier
0:33:39and you train your best the bidding melon labels
0:33:42the only thing that you require is that
0:33:44the labels that are the at the that were for which you train your deep
0:33:50should be embedded in the space so if one of the label is car
0:33:53make sure that colours here but that shouldn't be a problem because here you can
0:33:57put anything as long as you see text
0:33:59related to these label
0:34:01so that was an easy
0:34:05once you have that here is what you do
0:34:08you take an image
0:34:10and you compute
0:34:11so the
0:34:13the score of the deep-learning
0:34:17model so you the score of the deep-learning model is actually the posterior probability of
0:34:21a label
0:34:23that the image and you have these vector of p of the label given the
0:34:28you are going to compute all these score you have a thousand of them
0:34:32but you are going to only takes a the top ones
0:34:36top one could be the top thousand if you want but
0:34:38it's gonna be faster if you take the top ten
0:34:43and you are going to
0:34:47look at the labels corresponding to these top ones of suppose the topic labels
0:34:55suppose these words adopt then label obtain so bear lion tiger
0:35:00integer a
0:35:01you are going to look at the embedding space of the of the top then
0:35:04label you obtained here
0:35:05where they are and you are going to make it in average of them in
0:35:09the embedding space
0:35:10but it's gonna be a weighted average back haul and the weight will be how
0:35:13much you think it is the actual label actual by and
0:35:17so if you really think it's a lay a lion
0:35:19the they're result of the of the weighted combination would be very near the lion
0:35:24if you really think it's a there it's gonna be near the beer
0:35:27if you think it's between the bear in alliance so you obtain
0:35:31fifty percent they're fifty percent line you're gonna be in a position your parent line
0:35:35like in the middle
0:35:36and that's what this
0:35:38thing set so you average the top labels you found in the operating space and
0:35:43you find the position and that's where you should be now you look around here
0:35:47in the other expansion look at the nearest label
0:35:50the might be labeled from the top doesn't know there might be in the other
0:35:53and that's your inserter
0:35:55and because it can be any other label it can be
0:35:57labels of subject you've never sing
0:36:01does it work
0:36:03it does actually surprisingly
0:36:05not the perfectly by far again you see like
0:36:09if you person precision but it does work
0:36:13good enough that it's better than what we've seen elsewhere
0:36:16so this is the model that is doing this the
0:36:20complexer a convex the
0:36:23combination of a semantic and beating and that when you was using the top ten
0:36:27label that's waistcoat can see
0:36:29it computed there is this something that we also publish called device which instead of
0:36:33doing this simple convex formulation try to learn the mapping between the two
0:36:38and the mapping was surprisingly not as good as just the simple combination
0:36:43and this would be the output of the model itself so this cannot actually find
0:36:48the correct solution because
0:36:49we know that the correct solution of that image the correct label
0:36:53is not the top then the top but thousand it it's not the label the
0:36:57mold those about so it will make a mistake
0:37:00while these ones have access to the full embedding space and be conducted says something
0:37:05things that never seen and that's
0:37:07and okay that works
0:37:08in this case
0:37:13that was a nice for images but the recently i thought okay
0:37:17what about speech
0:37:20so about the ten years ago i was working in speech so i had some
0:37:24knowledge about the house speech mode
0:37:26but in the meanwhile of course everything change the deep-learning a wave also hit the
0:37:30speech community
0:37:31and now nobody's using it anymore igmms and stuff like that we use the deep
0:37:37network so how is the speech
0:37:40recognition done nowadays
0:37:43so this is speech in
0:37:45one slide
0:37:47you take your
0:37:49you're speech signal you transform at the using some features
0:37:55in for the training set that you have you take the sequence of words
0:37:59and you
0:38:01cut it into sub
0:38:02word unit which are usually phonemes or
0:38:05by for triphone and for whatever you want
0:38:07and these phones are then cut into cell phone unit which are called states because
0:38:12they are states of hmms with the were not using hmm anymore
0:38:17and then we try to align the audio with the states
0:38:22so we take a previous model and we tried to say okay with our previous
0:38:27this part of the audio should correspond to state number two hundred forty five
0:38:32and we do that for all our training set and that's becomes are training data
0:38:38to train it deep architecture which outputs is the number of states you have n
0:38:43you try to predict
0:38:44which state this audio should be corresponding to out of the
0:38:48you know case
0:38:50fourteen thousand states
0:38:51so the actual speech
0:38:53acoustic model is a classifier a fourteen thousand classes
0:38:58this is how it works and i think it
0:39:01we do that because that's
0:39:02how we've been doing speech for well but it seems unreasonable to me
0:39:06that's where trying to classify wasn't two states
0:39:09which even as humans have part time
0:39:13as a task to do because these days have no particular meaning
0:39:17the phonemes themselves have been designed by linguist and maybe that's not what that the
0:39:21that should say
0:39:22we should maybe look at the data instead of asking a linguist
0:39:26don't know how many language so we have here
0:39:28hopefully not too many
0:39:36so let's see if we can
0:39:38get rid of these states and phonemes and all that
0:39:41of course it's gonna be hardened that will not succeed
0:39:44very well but at least i think it's worth trying
0:39:48and see where we go
0:39:52so what can we do
0:39:55so the first thing i need was a very naive approach i two can't data
0:39:59and instead of cutting the data and segmenting the data at the state level as
0:40:05okay i forgot about state i forgot about fourteen what else do we have words
0:40:09so that segment the
0:40:11training set we have in terms of words and the
0:40:14that's an easier task because you it's usually easier to segment your data in terms
0:40:18of words well humans would agree
0:40:20roughly where word started words that ends
0:40:24so let's try to learn a model that price should just a classified words
0:40:29and that's what i did so i had my audio data and user deep architecture
0:40:33and try to predict that the end
0:40:35the word directly
0:40:36so that assumes that it has already been segment that
0:40:39the same way that this take based model was assuming that it was already segmented
0:40:43but instead of seeing
0:40:44only one state plus context i'm gonna see the whole were
0:40:48now it turns out words are not that long
0:40:51with a window of about the two seconds i capture the like ninety nine percent
0:40:55of the training words a hot
0:40:57so you need about two hundred frames to express and capture most of the words
0:41:02or at least of the training set i had access to which
0:41:05is a query data from
0:41:07from google
0:41:10so i train your typical deep conclusion mall the same kind of model that was
0:41:15used for images but then nowhere use it for speech
0:41:19i use the
0:41:21the dictionary used exactly what small the sense that in the training set
0:41:27not all possible words i pure
0:41:29so i use only about fifty thousand words
0:41:33looks big but it's actually small compared to the actual number of words
0:41:37that people will use in our test set for which we need something that can
0:41:41be at work at least
0:41:43so we have a problem later but let's forget about that problem so far
0:41:47and try to classified our training set into one of the forty eight thousand word
0:41:54so we can trying to model and that's nice and you get some accuracy seventy
0:41:57three percent
0:41:58is it good is it but i don't know
0:42:01it's reasonable
0:42:03where the we see where we go with this
0:42:07the first thing to say that if you have this you are not done at
0:42:10all in the speech recognition task because
0:42:14i've assume that someone gave me in a in a line data set so my
0:42:18training data was aligned at the word level
0:42:21but now if i want to do speech recognition are not going to be given
0:42:24in the alignment i have to align it myself
0:42:27since i wanted to have the quickly summarise although set okay i'm gonna forget about
0:42:31the alignment
0:42:32i will use
0:42:33the crowd well we have to provide a target so i take a model
0:42:37and i and i just run the speech recognizer we have
0:42:41which happens to be quite good
0:42:42and i look at that the lattice which is that
0:42:46a compact representation of the top-k
0:42:49sequences of words
0:42:50that could have been uttered for this it turns
0:42:55of acoustic
0:42:56and i will only look at the arc of that like this and try to
0:42:59rescore it so now i know that
0:43:01it for each are there was a beginning and time so i can take the
0:43:05audio of that part of the work of the sequence and try to score it
0:43:09and say okay i think it should be this word with these probability rd score
0:43:13and i can get the score and try to
0:43:17two score that and that
0:43:18that's good but it doesn't solve the problem of the not work
0:43:22my model was trained with forty eight thousand words
0:43:25and the decoder will see where more words so how will i be ever able
0:43:31two classifier that words with this
0:43:33this is a problem so let's try to go further in our idea
0:43:37and let's try to reason about how we could actually be able to
0:43:41to produce an old word
0:43:43or score unknown words
0:43:46that's where the embedding space is we start to be useful
0:43:49so here is the suggestion
0:43:51we're gonna try to learn
0:43:54in mapping between
0:43:55and you representation of words that we have access to and its base of words
0:44:01so what have access to that i can you edus
0:44:04is things that make up the word like the letters of the word or the
0:44:08lighter n-gram of a word so for instance i take the word hello and i
0:44:12can extract
0:44:13but quota features
0:44:14i don't fit like the letters it has
0:44:17the bigram letters it has the trigram letters it has the foreground letters it has
0:44:23and writing letters it had all of them
0:44:25so that's a lot of features
0:44:28but them
0:44:29maybe they are useful
0:44:30actually if you add two more symbols
0:44:34beginning and of word
0:44:36then it's even more interesting because
0:44:39so the ing in english is a very often
0:44:42and being of words and that's good to do that ing and is a very
0:44:46powerful features so let's try to add that as features
0:44:49and the
0:44:51and try to represent words like this so
0:44:55the first thing i
0:44:57it was
0:44:58trying to see if i take a word that extracted features i show you only
0:45:04can usually it can you tell me given this that the word i was talking
0:45:08about the was handle
0:45:10turns out that it's actually very easy task and the
0:45:14on the test set i got about ninety nine percent accuracy if i want
0:45:18if i train a simple model to predict
0:45:20what worked it is given its features so these features actually really
0:45:25capture enough of the word
0:45:27to tell you that this is hello
0:45:29so that's good that use these features
0:45:32but how can we use it
0:45:33so we're gonna use it in a
0:45:35and bidding
0:45:36deep-learning kind of architecture
0:45:39in the following way
0:45:40so we had our first model which was you take the audio when you try
0:45:44to predict what word it is
0:45:47my p is that the
0:45:49the last layer of these are secure capture as a lot of information about the
0:45:54whole word
0:45:56and that to word that some
0:46:01we did not far
0:46:02in their representation of the last layer l c d
0:46:05deep architecture
0:46:08what i will try to make sure is that indeed i can try to learn
0:46:12a mapping between
0:46:14any word
0:46:17the position in that space that correspond to the word so that space contains word
0:46:21but now do not organised in terms of
0:46:25how they are related semantically they are organized in that space the space being the
0:46:29last layer of the deep architecture
0:46:30in terms of how they sound alike
0:46:32two words that some of the like will be nearby that space and that's great
0:46:38so now i'm going to train
0:46:40i ranking model that will take
0:46:44in old you acoustic will projected into that space
0:46:47we take
0:46:48the word this would you acoustic corresponds to
0:46:53transform it into features project the into a another space that i hope will be
0:46:59similar to this one
0:47:00and try to make sure that the representation of the correct word in that space
0:47:04is here
0:47:06there are presentation of do it you actually near that the representation of another word
0:47:11so i want to make sure that in that specified that the audio i projected
0:47:15i take the letters of the correct word i projected and they should be by
0:47:19the embedding space
0:47:20and by nearby i just mean
0:47:22that it's near that any other word i would take and projected so that i
0:47:26could rank the word and the nearest word of an acoustic sequence would be the
0:47:30correct word
0:47:31and that would work for any word any sequence of letters i can express
0:47:35does that make sense
0:47:38and it so again that's the your typical ranking loss
0:47:42and the like trained that well
0:47:44and now with that model i can actually score any word so even though
0:47:48this model was only trained with fifty thousand
0:47:53with this addition i can now score in figure them out of words as norm
0:47:57estimate of letters
0:48:00is okay in that case that was only english
0:48:05okay so it doesn't work first of all it doesn't work as well
0:48:09as the as the so if i
0:48:12use only this model i get
0:48:14seventy three percent accuracy but if i use
0:48:17the small
0:48:19i'd of the much bigger set of words i get only fifty three percent accuracy
0:48:23but it still
0:48:24maybe enough to be able to use it in the decoder no
0:48:28and adding another useful example of these embedding space is now we're talking about
0:48:33and beating spaces of old you
0:48:34so i think a word i projected into the embedding space and i look at
0:48:38other words around
0:48:40and i see words that
0:48:42sounds similar they probably have completely different meanings but this on the same
0:48:47you can even
0:48:49put push up any word that is actually not the word and try to see
0:48:52how you would pronounce it so
0:48:55could be interesting
0:48:57okay so doesn't work well it works
0:49:03basically so far
0:49:04it only works in combination with the we if you if you rescoring you combine
0:49:09it with a good model
0:49:11so it's just preliminary work but i think there's the there's another things to try
0:49:16in that space
0:49:17that these to be tried it sits on the a preliminary or so
0:49:20you don't improves lightly the result
0:49:22even though it's like to hear means it actually improves significantly because the size of
0:49:26the data was
0:49:28ways you which it still not there was a despising it is for me
0:49:31but i think it contains
0:49:34i think sees of a frazzled that we should consider these it would you space
0:49:39design meeting space for you are i think
0:49:41something to consider
0:49:46maybe i can tell you a bit of the kind of air light model was
0:49:49it was making mistakes like it's was replaced by its
0:49:53five was replaced by five
0:49:57we agree are different words
0:50:00okay was replaced by okay and that kind of mistake so it was mostly mistakes
0:50:05from the language small and not much from the acoustic model but
0:50:08nevertheless you need to train them jointly which i haven't
0:50:11and so there's a work to do here
0:50:14okay so
0:50:16i'm gonna stop the now so these are the conclusion i hope i convince you
0:50:21that the these that these and baiting space is a very powerful the fact that
0:50:26you can
0:50:27take any kind of data whether they are discrete data like words or complex that
0:50:32are like images or sounds and projected into space where you can compared to where
0:50:36you can
0:50:36look at the nearest neighbours in that space on where you can make a new
0:50:40even know parameters
0:50:41on them like averages or
0:50:44subtraction and stuff like that
0:50:45this is a very powerful
0:50:49way to consider complex objects
0:50:51we've tried it actually in many other applications
0:50:54i can tell you if you ask them for which we talk about like a
0:50:57music recommendations what we will we had this the
0:51:02you can upload your music and we're gonna try to
0:51:05to help you
0:51:07do play list with it
0:51:08or try to code
0:51:11contingent to buy new music and stuff like that
0:51:14and we do that in
0:51:15not only but
0:51:16also using the old your presentation of the image
0:51:19so we've actually
0:51:20represented your music into these kind of invading spaces and look around in that space
0:51:25we've done that for videos of course of four languages information machine translation of talk
0:51:29about it and worst i think trying to do things in speech recognition
0:51:33and i think there's even more to do and the
0:51:36and why not the trying these kind of things for speaker verification or language to
0:51:42classification i don't know but really next year maybe
0:51:47i do
0:52:13thank you very much
0:52:17that would like to know what's wrong about linguistics
0:52:22so nothing wrong about linguistic but it i'm afraid of taking early decisions so for
0:52:27instance for speech when we take words and we will present them that sequence of
0:52:32often there was
0:52:33more than one representation what and one way to present the word
0:52:36and you need these
0:52:37thank ways to decide what is the correct way or the quite a ways
0:52:41and there's a discrete the most of them that you need to put in
0:52:44because you that's how you're gonna represent the you always do that
0:52:48and you are making early decisions some of the might be wrong i'd like to
0:52:52get rid of early wrong decisions
0:52:55for indications transcribe sings a different
0:52:59or at and use
0:53:02set a simple
0:53:08using strong
0:53:09i think that's wrong
0:53:12in the comments about the
0:53:18what about you deal i mean you probably most work now
0:53:24so if you a chick
0:53:27you're asking what about action x
0:53:30we have people working on that so we have you to put which is part
0:53:34of we will and contains if you videos
0:53:41and we have a big group trying these kind of approaches for you keep so
0:53:44i cannot
0:53:46to them but i know they have good results
0:53:52so anyway trained are selected just distinction between the word and
0:53:59acoustics was difference between this and have in a sequential training
0:54:03people dollars
0:54:05this kind of similar what do you mean basically one sit all the whole sentence
0:54:09all incorrect sure so
0:54:12you could use the recording that instead of using a consonant over you the acoustic
0:54:17and you get
0:54:19plus and
0:54:20minus is i would sit the
0:54:22plus about using a recurrent net is that you don't need to decide a priori
0:54:27what's the maximum size
0:54:28the might this is that you
0:54:30the she are more than what you want
0:54:32your presentation is more scared than the actual model where you decide so we are
0:54:38actually trying weaver l indians now for about that so i'm not saying it's wrong
0:54:42it's a good idea
0:54:44these experiments were done with a consonant
0:54:46go ahead and use your recurrent net
0:54:56i think my question is in the same direction as well but the it was
0:55:00them mentioning video what about sentences you're able to represent sentences sequence of words in
0:55:07this so you know the current the line of work by my colleagues the quickly
0:55:12and yes it's cover
0:55:15who are actually trying now to do that kind of things so the use in
0:55:19nist ends or recurrent net that's not right go into details of how it works
0:55:23but where you first read
0:55:28some input about your sentence it could be the sentence and another language or it
0:55:32could be d v d or it could be the old you what is that
0:55:34and then you i'll put a sentence
0:55:36and you trained of all to output the right and that and so that you
0:55:40i actually reasoning about sentence
0:55:42so it's early work so far but it
0:55:45hope it's gonna work
0:55:51we want me to ask about the numbers on the board a low pass it
0:55:57my question supposed to be
0:56:00supervised minutes of what show
0:56:03can be somehow to unsupervised
0:56:07as so it's it had to see that distinction between the two then
0:56:12when you train your invading spaces using only sentences is that supervised data or unsupervised
0:56:18data i mean it's sentences that
0:56:20exist in the world
0:56:22but you are not the you don't need it people to label them may appear
0:56:27by themself on the web
0:56:29i don't know if this is supervised or not
0:56:34the distinction is not clear to me
0:56:37you have to tell me more
0:56:42what it might be getting it is true when you get the sense
0:56:47human generate right was showing so
0:56:51supervised the since you say this is for english sense
0:56:54yes so that's i think women getting is in the unsupervised it i just give
0:57:01you data services in but when you do unsupervised clustering like
0:57:06such as things look similar because
0:57:09this is this picture world
0:57:11in the supervised case you said this is world this world this is given it
0:57:16kind of something to guide along and maybe one question be if you start throwing
0:57:21things in their this is questions
0:57:24how to use it's almost in clustering how to use
0:57:27so selfish
0:57:28i think the hope of unsupervised learning and i do believe that we need a
0:57:31lot of our work in that field the that it's crucial is a to find
0:57:35structure in the world
0:57:38the things that happen try to be the wall to happen with some structure maybe
0:57:41randomly but with some distribution
0:57:43and you want to
0:57:45constrain the space where you're going to operate with these objects these and big space
0:57:50is or
0:57:51or any other he didn't representation
0:57:53such that they take into account that structure so that it is here
0:57:57to say well these two things
0:57:59are nearby because in that structure doesn't the what the way around you cannot go
0:58:03by a left-to-right things are only in that direction so compared and like that
0:58:07that's what you want to use of your unsupervised data to so for instance you
0:58:12can take or audio
0:58:14and try to their representation of the old doing a compact we just by looking
0:58:17at would you as long as it's audio
0:58:19of things that you will see later so not just the right done
0:58:23audio but maybe people talking but without
0:58:26understanding what they say or images
0:58:29that exist but without labels
0:58:31or again take that has been read in your language
0:58:34but you and you don't need to know what is that text about or what
0:58:37is this image about as long as it's a set of images that
0:58:41are valid
0:58:41in this is that it in other image you would see come from the same
0:58:45so it's very useful it's a heart task
0:58:48and but we need at a lot
0:58:55so you are trying to look nice out of a couple diverse using to sampling
0:59:00can you comment on how successful the like your that jane you have obtained using
0:59:07combine the two models
0:59:09a complement of recognizing a of a couple of that's
0:59:13so the for small with only train a recognizing words that was not known
0:59:17in the second one
0:59:19i used it
0:59:20on our test set which contains ten times more different words most of the words
0:59:24in the test set was not we're not to the train set so the
0:59:27the decoder the results i gave in terms of word error rate was on the
0:59:32vocabulary that was
0:59:34more than ten times speakers and the training set
0:59:38so it was using this letter representation so is that what you mentor
0:59:43i mean you in successful in because i think look at atlantis
0:59:49so it's out-of-vocabulary from the training set but it's not solving the real task which
0:59:54i'm sure you are
0:59:55interested in
0:59:56which is out-of-vocabulary of the test set
0:59:59that is a word that is that is not even in my
1:00:04the cannery one at code and i'd like to be able to reason about it
1:00:07so i haven't tried that and i think it's more
1:00:10interesting task
1:00:14for some
1:00:18about linguist
1:00:28one where
1:00:35yes joint
1:00:40so are
1:00:48which is the first part i talk about people's are starting working on that but
1:00:51years ago yes but it's the hardback yes
1:01:17i agree i haven't the but i guess videos would be dismissed the best the
1:01:21way to see that where you have old you and
1:01:24and images but i have not personally work on that and but i do people
1:01:28are otherwise it on the which other data but
1:01:31yes i think
1:01:32six seventy to here