0:00:15 | great so it's a closer to |
---|
0:00:18 | to talk here today indeed i |
---|
0:00:20 | followed your |
---|
0:00:24 | community for a couple of here is about ten years ago and then moved on |
---|
0:00:27 | and doing a few different things but i might come back to it who knows |
---|
0:00:31 | i was a very interested in what i've seen during the week |
---|
0:00:35 | so thanks on fighting |
---|
0:00:37 | so to the l i'll talk about if you |
---|
0:00:41 | projects related to and bidding space is then going to |
---|
0:00:45 | tell you what we can do with that and what they are first but |
---|
0:00:50 | pay attention to the fact that this is not only my work does plenty of |
---|
0:00:54 | people that have work on that with me |
---|
0:00:56 | and like to thank them first |
---|
0:01:01 | so |
---|
0:01:02 | and bidding sweat out be what are the useful for and in fact it started |
---|
0:01:07 | like a decade ago when people started to try to think |
---|
0:01:12 | how can we present |
---|
0:01:14 | discrete objects in the continue space in such a way that it becomes useful to |
---|
0:01:19 | manipulate them |
---|
0:01:20 | like a if you think of words |
---|
0:01:23 | words are difficult to manipulate because |
---|
0:01:26 | you cannot compared to words the easy d at least the |
---|
0:01:30 | i in terms of mathematics a comparator so how can you represent words such that |
---|
0:01:35 | then after that you can |
---|
0:01:36 | you can manipulate them and compare them |
---|
0:01:41 | so these are these embedding space is more we're gonna project words |
---|
0:01:47 | once we have projected words into these spaces |
---|
0:01:49 | what about projecting anything else |
---|
0:01:51 | like images |
---|
0:01:53 | or speech or |
---|
0:01:55 | music |
---|
0:01:56 | so how can we do that that's gonna be the second part |
---|
0:02:00 | and we see that once we |
---|
0:02:02 | manipulated is a complex object in these spaces |
---|
0:02:05 | the there's a because you |
---|
0:02:07 | you learn |
---|
0:02:09 | semantic representations of these words you can actually |
---|
0:02:13 | try to |
---|
0:02:16 | to discover things in complex objects like images that you've never seen before we see |
---|
0:02:21 | how we can do that |
---|
0:02:22 | and now |
---|
0:02:24 | briefly describe at the end |
---|
0:02:26 | some recent work we've done with the trying to |
---|
0:02:29 | do similar things from images but now applied to speech |
---|
0:02:32 | and vice |
---|
0:02:36 | so let's |
---|
0:02:38 | let's start with what i mean by embedding so |
---|
0:02:41 | so here i describe it three d a beating space but of course in general |
---|
0:02:46 | it's more like a one hundred two hundred one thousand |
---|
0:02:49 | the and bidding space |
---|
0:02:51 | and |
---|
0:02:52 | think of a this space as the |
---|
0:02:54 | as the real number real vector space |
---|
0:02:58 | work each point |
---|
0:02:59 | could the could be the position of a discrete object like a word so soul |
---|
0:03:06 | here we have let's your best this work |
---|
0:03:10 | here we have the position of the word or barbara |
---|
0:03:12 | and here the |
---|
0:03:13 | like here the position of dawson's c will paris |
---|
0:03:17 | and |
---|
0:03:19 | we want to learn where to put these words and that's basically at the beginning |
---|
0:03:23 | we just on the whole |
---|
0:03:24 | so we're gonna |
---|
0:03:25 | fine random position for each of the words |
---|
0:03:28 | of the given dictionary that were given |
---|
0:03:31 | and then we want to modify the positions of the move and at some point |
---|
0:03:35 | one to be such that |
---|
0:03:38 | nearby words so the word nearby the position of the word dolphin |
---|
0:03:42 | should have similar meanings are tick at least related meanings |
---|
0:03:46 | so you'd like c word and often to be not far |
---|
0:03:48 | but relatively far from say powers and all but |
---|
0:03:53 | and which that this if we can achieve that is gonna be useful for |
---|
0:03:57 | for manipulating these words after work so how can we do that |
---|
0:04:02 | so |
---|
0:04:04 | about the ten years ago my brother years once we just for those of you |
---|
0:04:08 | who don't know there are two modules |
---|
0:04:10 | you invited me that my brother |
---|
0:04:17 | both of us working deep-learning |
---|
0:04:20 | so |
---|
0:04:22 | and |
---|
0:04:22 | but that's about ten years ago he |
---|
0:04:26 | and |
---|
0:04:27 | describe the project where he could learn such an bidding since |
---|
0:04:31 | and here is that how he started the |
---|
0:04:34 | this thing |
---|
0:04:37 | so |
---|
0:04:38 | he was using neural network so this is a neural network where we have the |
---|
0:04:43 | inputs and you have layers |
---|
0:04:45 | good connected to each other and at the end you have the output layer |
---|
0:04:48 | and the goal was to try to learn a representation for words so what he |
---|
0:04:53 | did |
---|
0:04:53 | was take |
---|
0:04:54 | sentences |
---|
0:04:55 | like you can capture sentences from we keep it from the web from anywhere only |
---|
0:04:59 | where you can just grab sentences |
---|
0:05:02 | and your goal is to try to |
---|
0:05:04 | to find a representation of words |
---|
0:05:07 | such that |
---|
0:05:09 | it will be easy to try to predict |
---|
0:05:12 | if i show you few words can you predict what will be didn't next work |
---|
0:05:15 | so you put |
---|
0:05:17 | cities for words as input of the model |
---|
0:05:20 | you crunch them and at the end you predict the word in your dictionary and |
---|
0:05:24 | does one you need for each of the word of your dictionary |
---|
0:05:27 | and you try to predict the next word will be cheese |
---|
0:05:31 | in that case |
---|
0:05:32 | now how your present the word |
---|
0:05:34 | you're present them as a vector in |
---|
0:05:37 | this the dimensional space |
---|
0:05:39 | which at the beginning will be just a random vector |
---|
0:05:42 | so this is what i |
---|
0:05:44 | we called the embedding space |
---|
0:05:46 | so there i think of it does a big look up the table or a |
---|
0:05:49 | big matrix where each line |
---|
0:05:51 | is there a presentation of a word |
---|
0:05:53 | and if you see the sequence the cat eights you're gonna just look at the |
---|
0:05:57 | word |
---|
0:05:59 | the look up the word cat each of them have a vector representation you just |
---|
0:06:03 | put that as input of your well |
---|
0:06:05 | and then you passing through your model and you predict the next word |
---|
0:06:09 | and the hope is that if you do that often |
---|
0:06:11 | and you |
---|
0:06:12 | as you might imagine you have in into new the model such data so that |
---|
0:06:16 | easy to get |
---|
0:06:17 | you will want you will find good representation of words |
---|
0:06:23 | so the first time he did it was very painful very slow machines ten years |
---|
0:06:28 | ago are very slow the dictionary was very small |
---|
0:06:31 | it wasn't very useful but since then things have improved the model has |
---|
0:06:35 | been simplified we have more data about more gpus |
---|
0:06:39 | another thing start to work quite well |
---|
0:06:42 | here is an example of the nn bidding space that we trained |
---|
0:06:46 | about two years ago i think |
---|
0:06:48 | and you don't have to try to see well |
---|
0:06:51 | what's in that space |
---|
0:06:53 | we can pick a word |
---|
0:06:54 | so like here i pick the word apple |
---|
0:06:56 | and look in the space what are the nearest words saying the |
---|
0:07:00 | euclidean space so a look |
---|
0:07:02 | at the position of all keywords i |
---|
0:07:04 | sort them with respect to the distance to the target word |
---|
0:07:07 | and i look at the other words and what i see that |
---|
0:07:10 | the other words are very semantically similar to apples so you have fruit apples melon |
---|
0:07:14 | speech |
---|
0:07:15 | and whatever |
---|
0:07:17 | if you take a stab you get things that are less a happier |
---|
0:07:22 | and around i phone |
---|
0:07:24 | used are you see stuff like i whatever |
---|
0:07:27 | so |
---|
0:07:29 | so it does capture everything and this was trained in the unsupervised weights sex |
---|
0:07:34 | just show sentences |
---|
0:07:35 | and that's what you get |
---|
0:07:39 | and this was with the fifty dimensional embedding so you don't even need to have |
---|
0:07:43 | a very large space this was about |
---|
0:07:45 | i think or hundred thousand words |
---|
0:07:47 | dictionary |
---|
0:07:49 | so those hundred thousand vector here that are hidden in that space |
---|
0:07:54 | so |
---|
0:07:56 | with time these kind of model have evolved and usually involved in that just means |
---|
0:08:01 | simplified so it was |
---|
0:08:03 | it was a complex a architecture and it became actually much simpler now it's |
---|
0:08:08 | in fact they are almost linear model |
---|
0:08:10 | just the and of aiding and abetting of the word and then linear transformation in |
---|
0:08:15 | you try to predict |
---|
0:08:16 | another word and that's it |
---|
0:08:19 | the weights done is you pick it again you take a sentence you randomly pick |
---|
0:08:24 | a word in that sentence and then you randomly pick another word that you're gonna |
---|
0:08:27 | try to predict |
---|
0:08:28 | so it's not anymore |
---|
0:08:30 | the next word and its not anymore |
---|
0:08:32 | if we do of words that helps you predicting the next one it's |
---|
0:08:36 | a random were trying to predict another random work round |
---|
0:08:39 | and that's just the fact that it's a round that makes it |
---|
0:08:42 | interesting because words ten |
---|
0:08:44 | to keep |
---|
0:08:46 | they tend to |
---|
0:08:47 | to cook are often together so |
---|
0:08:50 | so that's a very simple now the code is actually available in the very efficient |
---|
0:08:55 | to train so you can train your own the bidding space |
---|
0:08:58 | on in the matter of in our to of training so it's very efficient |
---|
0:09:04 | a so |
---|
0:09:05 | so here we have an example about |
---|
0:09:08 | we took all the terms we saw and we keep it and here examples again |
---|
0:09:12 | of embedding space is so these are the words nearby tiger shark |
---|
0:09:16 | or car these are words to sorted or a cluster them according to their semantic |
---|
0:09:21 | an |
---|
0:09:22 | you see i don't know here all the food things |
---|
0:09:26 | all the replies it et cetera so it captures semantic |
---|
0:09:30 | but it's actually even more |
---|
0:09:33 | strong than that |
---|
0:09:34 | because |
---|
0:09:36 | you can play some games with these and beatings words so for instance |
---|
0:09:40 | if a you take the after training it within meetings to space the using this |
---|
0:09:45 | as good problem all |
---|
0:09:47 | you look at the editing position of the roll |
---|
0:09:50 | italy berlin and german germany you look at where they are in the space |
---|
0:09:54 | and you can apply operators on that so you to the embedding a position of |
---|
0:09:59 | rome and you subtract italy |
---|
0:10:02 | you at germany and what do you get vernon |
---|
0:10:05 | that means you can actually generalising fine so the vector that went from row to |
---|
0:10:11 | italy is the same as the when it goes from berlin to germany because they |
---|
0:10:15 | have the same or addition to each other |
---|
0:10:17 | and that's for semantic and you have also syntactic relations |
---|
0:10:21 | like hardest to harder to what biggest to bigger |
---|
0:10:24 | with the same kind of argument |
---|
0:10:26 | and that's |
---|
0:10:27 | surprisingly working |
---|
0:10:29 | you can do similar tricks using for translation you can see that you can train |
---|
0:10:36 | the |
---|
0:10:38 | separate |
---|
0:10:39 | and beating space for separate languages |
---|
0:10:41 | and you'll find that these relations actually work from this |
---|
0:10:45 | from language to language and you can use these kind of things to help train |
---|
0:10:48 | stations and they're |
---|
0:10:50 | tons of tricks that you can seem to the charger nowadays using these and beating |
---|
0:10:54 | spaces so they are very |
---|
0:10:55 | interesting to manipulate |
---|
0:11:00 | before i forget feel free to ask any question whenever you want of course |
---|
0:11:06 | so what else can we do with these are bidding spaces so |
---|
0:11:12 | about four five years ago i was actually interested that googling |
---|
0:11:15 | images |
---|
0:11:17 | and i and i wanted to try to see if i can train the model |
---|
0:11:20 | to annotate images but it being at google i'm not interested in on the and |
---|
0:11:24 | trying to label images out of the hundred classes of course i'm more interested in |
---|
0:11:29 | the large scale setting so when you have six hundred thousand classes |
---|
0:11:33 | that's more interesting for me |
---|
0:11:35 | but at that time at least |
---|
0:11:37 | the recharger do see do |
---|
0:11:39 | you computer vision literature was more interested are focusing on task where you have a |
---|
0:11:44 | hundred two hundred |
---|
0:11:45 | up to a thousand classes but that's |
---|
0:11:47 | that's about it |
---|
0:11:50 | so can we go further than that |
---|
0:11:54 | so of course i image annotation is the heart task that's plenty of problems with |
---|
0:11:59 | that you can think of the fact that the object objects are very the often |
---|
0:12:05 | lookalike and actually this problem |
---|
0:12:07 | is even worse when the number of classes grow |
---|
0:12:10 | so when you have only two classes it's very easy to discriminate the if you |
---|
0:12:13 | have a |
---|
0:12:14 | a thousand classes or hundred thousand classes |
---|
0:12:16 | you can be sure that |
---|
0:12:18 | two of these classes are very visually similar to each other |
---|
0:12:22 | so the problem is |
---|
0:12:24 | or are becoming harder as the number of class the class a scroll |
---|
0:12:28 | just plenty of other problems the rated to computer vision which i won't go into |
---|
0:12:32 | details but |
---|
0:12:33 | let me just |
---|
0:12:35 | summarise how computer vision was done |
---|
0:12:37 | about four five years ago |
---|
0:12:39 | things have evolved a lot since then but at that time |
---|
0:12:44 | you had two steps |
---|
0:12:45 | feature extraction and classification |
---|
0:12:48 | so first you would extractor features and the way you extract features |
---|
0:12:53 | was |
---|
0:12:55 | say it's very similar to how you would expect your for your from your voice |
---|
0:13:00 | you with the find the a good representation of |
---|
0:13:04 | places in the image and then |
---|
0:13:05 | aggregated in a in some way and that would be your presentation |
---|
0:13:11 | what you had that you would then try to classify don't using the best classifier |
---|
0:13:15 | that was available at the time |
---|
0:13:17 | which was an svm and then you train an excellent for each of your classes |
---|
0:13:21 | and hope |
---|
0:13:21 | that it would scale well |
---|
0:13:23 | a it didn't |
---|
0:13:25 | so one of the |
---|
0:13:27 | problems you had was |
---|
0:13:29 | and |
---|
0:13:30 | that |
---|
0:13:31 | very similar images would give rise to |
---|
0:13:33 | very different labels |
---|
0:13:35 | that would be completely unrelated semantically so for instance these are three |
---|
0:13:40 | whoops |
---|
0:13:41 | these are |
---|
0:13:42 | three images a |
---|
0:13:44 | of as of the video that |
---|
0:13:46 | are like a few seconds apart |
---|
0:13:49 | it's a shark or something like a shark and this is the label that |
---|
0:13:53 | you're classically which classifier would give and you see that this is very semantically different |
---|
0:13:58 | from that |
---|
0:13:59 | here it's airliner for those who don't see and here's tiger sharks |
---|
0:14:03 | but the image are quite similar to or high so some things |
---|
0:14:07 | not working somewhere |
---|
0:14:10 | and that's the kind of problem would like to be able to solve that would |
---|
0:14:13 | be if i sure you to similar images i'd like to have to similar labels |
---|
0:14:16 | at least |
---|
0:14:18 | at least semantically |
---|
0:14:20 | so why isn't that the working |
---|
0:14:23 | so one argument about that is that we when we try to classify are images |
---|
0:14:28 | we showed no we impose no relation between our labels all the labels you can |
---|
0:14:34 | think of them as being in this id edges |
---|
0:14:37 | of the hyper q |
---|
0:14:38 | of the number of classes |
---|
0:14:40 | where the edges of the number of classes |
---|
0:14:43 | and so there's no more relation between a |
---|
0:14:45 | it share and the and them under a more than there is or relation between |
---|
0:14:50 | the mental reinvesting great even though there is a semantic creation between them we don't |
---|
0:14:55 | capture done with the way we |
---|
0:14:57 | train our classifiers and that's probably bad |
---|
0:15:01 | so what if instead of having the labels that these |
---|
0:15:04 | corners here |
---|
0:15:05 | what about putting them inside |
---|
0:15:08 | i guess |
---|
0:15:10 | so now the labels are inside this i per q like these embedding space as |
---|
0:15:14 | i was talking about the earlier |
---|
0:15:17 | and it's fine because first of all the |
---|
0:15:20 | the size of this hyper cube does not you anymore depend |
---|
0:15:24 | on the number of labels classes |
---|
0:15:26 | you can have way more classes than the actual size of your |
---|
0:15:31 | space |
---|
0:15:32 | because now it's a real space |
---|
0:15:34 | and you can put your labels in such a way that they are by label |
---|
0:15:37 | have your by meaning |
---|
0:15:39 | and we've got |
---|
0:15:40 | what happens is that if you make a mistake by picking the wrong label |
---|
0:15:44 | hopefully you're gonna pick a label that was nearby hands have a semantic meaning not |
---|
0:15:49 | too far so |
---|
0:15:50 | hopefully that would work |
---|
0:15:52 | and does even more interesting think that could happen you could |
---|
0:15:55 | put more labels in the space that the one for which you have images |
---|
0:15:58 | and maybe you'd be able to label in image of a topic you've never |
---|
0:16:02 | seen any image before just because it is semantically related to it |
---|
0:16:08 | so we try to see that |
---|
0:16:10 | is just a dream |
---|
0:16:12 | so about four years ago we started working on this project and |
---|
0:16:17 | and what we of course so we were interested in these embedding space is an |
---|
0:16:22 | and what we |
---|
0:16:23 | tried was basically to merge the idea of having an image classifier and an embedding |
---|
0:16:27 | space |
---|
0:16:28 | and |
---|
0:16:30 | so |
---|
0:16:31 | we had this protocol was that e |
---|
0:16:34 | where we want to do |
---|
0:16:36 | try to learn jointly how to take an image |
---|
0:16:39 | and project its input representation into |
---|
0:16:43 | and then bidding space so you would have a |
---|
0:16:46 | you projection from their representation of that image into |
---|
0:16:49 | point |
---|
0:16:49 | in that space and in the same space you'd have points that represents |
---|
0:16:54 | labels |
---|
0:16:55 | our classes |
---|
0:16:56 | like thousand and robot and eiffel tower |
---|
0:16:58 | and the goal was to jointly try to find the position of the labels |
---|
0:17:03 | i try to find the mapping from the image |
---|
0:17:05 | to deal a label space and if you can do that jointly you hopefully sold |
---|
0:17:10 | the |
---|
0:17:11 | jointly the classification task |
---|
0:17:13 | and have a good and beating space for words |
---|
0:17:18 | and |
---|
0:17:19 | being at google |
---|
0:17:20 | i should tell you that everything i see looks like a ranking problem |
---|
0:17:26 | and so i obviously see that there's a ranking problem |
---|
0:17:30 | where in that case the goal is if you show me an image and gonna |
---|
0:17:34 | try to rank |
---|
0:17:35 | the labels such that the nearest label is the one that corresponds to label |
---|
0:17:40 | sounds good that that's a correct drinking problem and you also want to make sure |
---|
0:17:44 | that |
---|
0:17:45 | similar labels if you are to make a mistake let's make sure that the mistake |
---|
0:17:48 | is semantically reasonable that |
---|
0:17:51 | if you were to click on the word |
---|
0:17:52 | it would be it reasonable word even if it's not the perfect work |
---|
0:17:57 | so we are going to train or well with i ranking loss |
---|
0:18:02 | in mine |
---|
0:18:04 | okay |
---|
0:18:06 | it was actually very simple ball |
---|
0:18:08 | again very just a linear mapping so |
---|
0:18:11 | what we had was |
---|
0:18:14 | so this is prior deep-learning |
---|
0:18:18 | error in some sense |
---|
0:18:19 | at least in computer vision |
---|
0:18:23 | so we work if it features |
---|
0:18:25 | the |
---|
0:18:26 | the mfcc of the |
---|
0:18:29 | of the image world |
---|
0:18:31 | and the what we were looking for was just a linear mapping between these features |
---|
0:18:35 | of an image and the embedding space so you take your features that |
---|
0:18:39 | that's x it's the image representation of an image |
---|
0:18:42 | and you just multiply by matrix v |
---|
0:18:44 | such that the result is another factor that is not in your hopefully in the |
---|
0:18:48 | embedding space |
---|
0:18:49 | and again like for the embedding space you have a factor |
---|
0:18:53 | representation of each of your well words |
---|
0:18:56 | which in that case out that labels |
---|
0:18:58 | of the image classification task |
---|
0:19:00 | and you want to find |
---|
0:19:03 | w the a representation of your labels and v the mapping between each features and |
---|
0:19:08 | the embedding space |
---|
0:19:09 | such that it optimizes a task what is the task |
---|
0:19:12 | we are going to define a similarity between two point in the space and in |
---|
0:19:16 | this case the similarity function between |
---|
0:19:18 | in image |
---|
0:19:19 | x and the label i |
---|
0:19:21 | is just |
---|
0:19:22 | it's dot product |
---|
0:19:24 | so in the embedding space so you take the image you projected into the embedding |
---|
0:19:27 | space |
---|
0:19:28 | you multiplied by the label |
---|
0:19:30 | that for which you're considering to label that image |
---|
0:19:34 | and you want that score to be high |
---|
0:19:36 | for the correct label until |
---|
0:19:38 | is a low for incorrect label |
---|
0:19:41 | and so and you're going to put some constraint because |
---|
0:19:45 | we're doing machine learning we need to put something regularization |
---|
0:19:49 | such that |
---|
0:19:53 | such that the value of the embedding space basically it's constraints so you control than |
---|
0:19:58 | or |
---|
0:19:59 | both the norm of the |
---|
0:20:02 | of the mapping and the norm of the bidding |
---|
0:20:05 | labels itself |
---|
0:20:08 | okay as so as i said |
---|
0:20:12 | we are going to try to solve this problem using a ranking loss so what |
---|
0:20:15 | it what do i mean by that |
---|
0:20:18 | well |
---|
0:20:19 | we are going to |
---|
0:20:21 | construct a loss |
---|
0:20:23 | to minimize such that |
---|
0:20:26 | for every image in our training set that's this part here |
---|
0:20:30 | and for every current label of that image in image could have more than one |
---|
0:20:34 | label that |
---|
0:20:35 | that's often the case |
---|
0:20:37 | and for every incorrect label of that image |
---|
0:20:41 | and that does a lot |
---|
0:20:42 | so that that's a nice big |
---|
0:20:44 | you want to make sure that |
---|
0:20:46 | this score |
---|
0:20:48 | so the score of the correct label should be higher than the score of any |
---|
0:20:51 | incorrect label plus margin |
---|
0:20:54 | what this basically says it's a hinge loss |
---|
0:20:58 | so |
---|
0:20:59 | not only you want to score the correct label to be higher than any other |
---|
0:21:02 | one you want to make sure that i don't plus margin so that generalizes better |
---|
0:21:06 | here the margin is one but it's a constant you can put what you want |
---|
0:21:11 | and if that's not the case then you pay a price |
---|
0:21:14 | and that's the price to pay and you want to minimize that price |
---|
0:21:18 | and you can optimize it is very efficiently by stochastic gradient descent by simply |
---|
0:21:22 | sampling in image from your sec training set sampling a positive image |
---|
0:21:27 | a positive label |
---|
0:21:28 | from the set of correct label of that image |
---|
0:21:31 | and then simply any other label which |
---|
0:21:33 | most likely will be a novel correct label |
---|
0:21:37 | you have your triplet then you compute your loss |
---|
0:21:39 | if it's |
---|
0:21:41 | if it's if the loss is positive you are you change the parameters of |
---|
0:21:46 | you're model v and w here |
---|
0:21:49 | so that's good and it works actually |
---|
0:21:52 | but you can actually do better |
---|
0:21:54 | i i'm not going to go into details about how you can do better but |
---|
0:21:56 | i think of the |
---|
0:21:57 | the following problem |
---|
0:22:00 | what you want is at least |
---|
0:22:02 | again |
---|
0:22:03 | when you want to rank a hundred thousand are to be an object |
---|
0:22:08 | what you want |
---|
0:22:09 | is that in the top ranking |
---|
0:22:12 | position of the object you're going to return does something interesting |
---|
0:22:15 | so |
---|
0:22:17 | if i show you two functions that are ranking labels |
---|
0:22:21 | one of them returns |
---|
0:22:23 | label correct labeling position the |
---|
0:22:25 | one and another correct label in the position |
---|
0:22:28 | one thousand |
---|
0:22:29 | you fine you should find it more interesting that another function that returns |
---|
0:22:34 | the to correct labeling position five hundred and five hundred one |
---|
0:22:37 | even though in terms of ranking they have the same value |
---|
0:22:40 | in terms of the user using it you try to have at least one label |
---|
0:22:44 | returned in the top position |
---|
0:22:45 | so you want to favour the top of the ranking you want to put a |
---|
0:22:48 | lot of interest is there |
---|
0:22:49 | and they are ways to modify |
---|
0:22:52 | these kinds of losses to favour the top of the ranking |
---|
0:22:56 | i one going to the d as they also have to they are in the |
---|
0:22:59 | paper but |
---|
0:23:00 | but it actually makes a huge difference in terms of the perception of the user |
---|
0:23:04 | because |
---|
0:23:04 | at least at the top |
---|
0:23:06 | of the ranking you see things that make sense |
---|
0:23:11 | so let's look at |
---|
0:23:14 | these experiments the original the first experiments with done |
---|
0:23:18 | we had the |
---|
0:23:20 | so at that time there was the |
---|
0:23:22 | a database an image in the in the computer vision literature that started appearing called |
---|
0:23:27 | image net |
---|
0:23:28 | it still there it's growing |
---|
0:23:31 | at that time their whereas sixteen thousand labels in the image net |
---|
0:23:35 | corpus now that was more than twenty thousand |
---|
0:23:38 | but |
---|
0:23:39 | nobody was actually using that the that the corpus i as a as it is |
---|
0:23:44 | people were had selected about the thousand label and they were only playing with one |
---|
0:23:49 | thousand and that's still the case and fortunate |
---|
0:23:51 | almost nobody plays with all the corpus that is actually available and that contains |
---|
0:23:56 | millions of images |
---|
0:23:57 | at that time about five million images |
---|
0:24:00 | i think now it's more like ten million images |
---|
0:24:02 | so that that's good |
---|
0:24:03 | but nobody's using it |
---|
0:24:06 | so we consider that a small dataset |
---|
0:24:09 | and we looked at the bigger one |
---|
0:24:11 | which came from the web so we looked at the maybe images from the web |
---|
0:24:16 | and for the web data we don't really have any labeled the way we use |
---|
0:24:20 | our label whereby |
---|
0:24:22 | looking at what people do one image search on google image search tree type of |
---|
0:24:26 | query that the queried often you see images |
---|
0:24:29 | you click on an image |
---|
0:24:30 | now if |
---|
0:24:31 | many of you click on the same image for the same query we are going |
---|
0:24:35 | to consider that this is a good |
---|
0:24:37 | a the queries a good labels for that |
---|
0:24:40 | for that image |
---|
0:24:41 | so it's very noisy |
---|
0:24:43 | a lot of things happened here |
---|
0:24:44 | but it it's usually reasonable |
---|
0:24:46 | but you can collect as many |
---|
0:24:49 | so this is a very small set of what was actually available but still |
---|
0:24:53 | there was more than a hundred thousand labeling or set so that was interesting |
---|
0:24:58 | so we actually publish the |
---|
0:25:00 | paper showing these results and i want to exercise the fact that |
---|
0:25:04 | we had |
---|
0:25:05 | a one person |
---|
0:25:06 | accuracy |
---|
0:25:08 | on that data so ninety nine person role |
---|
0:25:11 | and it was published |
---|
0:25:16 | so i think that's good |
---|
0:25:17 | we also hope |
---|
0:25:21 | so this is a record and |
---|
0:25:23 | so that summarize the thing by saying that this algorithm was better than the |
---|
0:25:27 | many things we tried this is just a summary |
---|
0:25:30 | so these numbers are higher than the other one |
---|
0:25:33 | we show to type of metric precision at one which is accuracy |
---|
0:25:37 | precision at ten |
---|
0:25:38 | which is harmony good labeled you're return to the top and so that's more like |
---|
0:25:42 | a ranking loss |
---|
0:25:44 | and that more like look like what you see on google you look at the |
---|
0:25:47 | page and you're happy if you see the document you want in the top |
---|
0:25:53 | of course if you put more than once and then sample the number scroll and |
---|
0:25:57 | everything gets better but |
---|
0:25:58 | but yell the numbers are small and so the question is |
---|
0:26:01 | is it any useful anyway because it's or small |
---|
0:26:04 | and it turns out it is so first of all let's look at the embedding |
---|
0:26:07 | space again it's always fun to look at |
---|
0:26:09 | what happened after we've train the model |
---|
0:26:11 | so remember the model was trained with the |
---|
0:26:14 | just pairs of image and label |
---|
0:26:17 | nor addition between words duration between images just there is a females and label in |
---|
0:26:22 | are gonna just look at |
---|
0:26:23 | where the labels are in the space no image yet |
---|
0:26:27 | so i look at the labeled brock about that and i look at the nearby |
---|
0:26:31 | labels out of the hundred pounds and label |
---|
0:26:34 | and these are the labels we see |
---|
0:26:35 | and so the nearest one is basically a spelling mistake because well people type anything |
---|
0:26:40 | on the way |
---|
0:26:42 | and the other ones are also very similar and then you see this ball will |
---|
0:26:45 | which |
---|
0:26:46 | i don't know what it is |
---|
0:26:49 | and then if you take a big i'm you see again the similar things then |
---|
0:26:53 | interestingly you see semantic relations between the u c and that their soccer player that |
---|
0:26:58 | happening not far |
---|
0:27:00 | maybe did look alike i don't know |
---|
0:27:03 | you also see things like translations so dolphin these near death in though phone |
---|
0:27:08 | or similar projects like whale |
---|
0:27:11 | and |
---|
0:27:12 | you see you see like i for that we're used to train station you see |
---|
0:27:16 | either |
---|
0:27:17 | things not far or similar visually |
---|
0:27:21 | then the eiffel tower and all these has been trained in some sense i've never |
---|
0:27:24 | told the model that the |
---|
0:27:26 | it often is like the phone it's |
---|
0:27:29 | it's they are just because they share similar images basically |
---|
0:27:32 | it did that and beating space |
---|
0:27:34 | so that's nice but what about the actual task |
---|
0:27:37 | so here is |
---|
0:27:39 | a sample of for images |
---|
0:27:42 | from the test set |
---|
0:27:44 | all of them |
---|
0:27:45 | the if i had to compute the rescored precision at what would be zero basically |
---|
0:27:49 | i failed in all these images as expected i mean i fail ninety nine percent |
---|
0:27:54 | of the time so this is for these ninety nine |
---|
0:27:58 | and but the figures are gracious in some sense so this these are supposed to |
---|
0:28:02 | be |
---|
0:28:03 | dawson |
---|
0:28:04 | and the a tensor they'll finny a car and you see the words that happens |
---|
0:28:09 | afterwards |
---|
0:28:09 | so that funny here is in position thirty that's good |
---|
0:28:12 | here it's in position i don't like eight |
---|
0:28:16 | but the other words around make sense maybe the wrong |
---|
0:28:20 | their answer but at the end so we give |
---|
0:28:22 | would satisfy many humans and that's |
---|
0:28:25 | that's good just because they have actually very similar semantic meetings |
---|
0:28:30 | so we have the bark about my thing here |
---|
0:28:33 | we have a |
---|
0:28:35 | i was interested in the last i guess trip here because maybe you don't know |
---|
0:28:38 | but there's a copy of the eiffel tower investigates |
---|
0:28:40 | and so it actually made sense |
---|
0:28:43 | it was surprise |
---|
0:28:44 | so |
---|
0:28:47 | so that's interesting the they will you make mistakes is now more interesting used to |
---|
0:28:51 | make a lot of mistakes but at least |
---|
0:28:53 | the answer make sense and that that's better |
---|
0:28:56 | but so that was as of a four years ago and |
---|
0:29:00 | what happened after that was |
---|
0:29:03 | the deep-learning |
---|
0:29:05 | error started |
---|
0:29:06 | and the everything changed in the image field |
---|
0:29:10 | like it didn't speech error i would say |
---|
0:29:13 | so |
---|
0:29:14 | now that's how we do we beach recognition |
---|
0:29:17 | so |
---|
0:29:18 | and the way we do it is by taking an image and applying these a |
---|
0:29:23 | deep network |
---|
0:29:24 | happen till you find we take a decision using it |
---|
0:29:27 | as softmax layer at the end of your deep architecture |
---|
0:29:30 | and the think that works the best these days is the convolution that's and the |
---|
0:29:35 | for those of you don't know what these are it's basically layers that look at |
---|
0:29:39 | only at the small part of the image so there's just a unit here that |
---|
0:29:44 | look only at this part of the image and tries to |
---|
0:29:46 | get the fighting for this part |
---|
0:29:48 | but the function that gets this value is the same as the one that looks |
---|
0:29:53 | at this part of the image and this and this and this |
---|
0:29:55 | so we are actually convolving a function a along the whole image |
---|
0:30:00 | and returning the this completion at the output of this that you're |
---|
0:30:04 | and that we pull the answer locally so we look at the answer of that |
---|
0:30:08 | set of convolution in the local patch |
---|
0:30:10 | and take and return something like the maximal the mean or |
---|
0:30:14 | but works usually is the max but you can try any pulling |
---|
0:30:19 | thing and you do that again layer after layer |
---|
0:30:22 | and what you're |
---|
0:30:23 | you're bored you do full connection and that the n |
---|
0:30:27 | you get an answer so it is |
---|
0:30:28 | is a much more involving architecture it's very |
---|
0:30:32 | slow to train |
---|
0:30:34 | you need gpus and all that but |
---|
0:30:38 | i must say first of all they were developed about twenty five years ago sorts |
---|
0:30:42 | nothing you |
---|
0:30:43 | but the only now we have the data that shows how good they are because |
---|
0:30:47 | before there was not enough data know there was not enough |
---|
0:30:50 | machine power like gpus |
---|
0:30:52 | to actually train such a complex architecture so now it works |
---|
0:30:55 | and it actually works very well so the first time the it was used on |
---|
0:31:00 | on this the competition call the image net which is a competition quote to classify |
---|
0:31:05 | with a thousand |
---|
0:31:06 | label |
---|
0:31:07 | it basically blue and the competition so all of those the competitors which we're using |
---|
0:31:13 | classical computer vision techniques and they were actually the best in there |
---|
0:31:16 | feel |
---|
0:31:17 | they are like ten person away from the |
---|
0:31:19 | from the deep-learning approach so |
---|
0:31:22 | it changed everything and now |
---|
0:31:24 | at least in this t v b are the literature |
---|
0:31:27 | almost nobody is not using computer core |
---|
0:31:30 | d player |
---|
0:31:31 | it architectures |
---|
0:31:34 | maybe just the slide to say that we do use such a thing at we |
---|
0:31:37 | will for real product so it's not just research |
---|
0:31:40 | for instance if you |
---|
0:31:42 | type queries like |
---|
0:31:44 | my for two of something |
---|
0:31:46 | we're gonna try to look in your own for those unlabeled the |
---|
0:31:49 | and try to return |
---|
0:31:52 | you're for two of |
---|
0:31:53 | sunset here |
---|
0:31:55 | and it's done using the type of architecture that |
---|
0:31:58 | that the one this competition |
---|
0:31:59 | i must say that we actually |
---|
0:32:01 | but |
---|
0:32:03 | do |
---|
0:32:03 | the authors of that paper |
---|
0:32:07 | that |
---|
0:32:08 | that is that geoff hinton the |
---|
0:32:11 | an exclusive ski in the yes it's got are |
---|
0:32:13 | the are not working at school so they help us about |
---|
0:32:19 | it works |
---|
0:32:23 | they are very good cliques |
---|
0:32:25 | okay and let's the that contain you know |
---|
0:32:29 | so let's go back to our meeting spaces and the fact that you can put |
---|
0:32:33 | a lot of things in an unwitting space |
---|
0:32:37 | so on one side we have these embedding space is that are very powerful because |
---|
0:32:40 | they capture the semantic of labels on the dataset |
---|
0:32:44 | we have these powerful deep-learning our architecture that can |
---|
0:32:47 | that are based inter class now |
---|
0:32:49 | so can we can we varied these two things in you know way that would |
---|
0:32:53 | be useful |
---|
0:32:54 | and |
---|
0:32:55 | in fact what we found was that you can use these two and try to |
---|
0:32:58 | be able to label in image of a label that is not of these ones |
---|
0:33:03 | that appears here anywhere and that's |
---|
0:33:06 | interesting because now |
---|
0:33:08 | even though this was trained on the thousand labeled we can try to reason |
---|
0:33:11 | about sarah hundred thousand label even though we haven't seen ninety nine percent of the |
---|
0:33:17 | label |
---|
0:33:21 | surprisingly it's actually very simple to do |
---|
0:33:24 | we started by doing something with more complex but idiot we converged again |
---|
0:33:27 | in the simplest |
---|
0:33:29 | so we shouldn't and here is how you do it |
---|
0:33:31 | so first |
---|
0:33:32 | obviously |
---|
0:33:33 | you train these two things |
---|
0:33:35 | separately you train your best deep-learning architecture on your image classifier |
---|
0:33:39 | and you train your best the bidding melon labels |
---|
0:33:42 | the only thing that you require is that |
---|
0:33:44 | the labels that are the at the that were for which you train your deep |
---|
0:33:49 | architecture |
---|
0:33:50 | should be embedded in the space so if one of the label is car |
---|
0:33:53 | make sure that colours here but that shouldn't be a problem because here you can |
---|
0:33:57 | put anything as long as you see text |
---|
0:33:59 | related to these label |
---|
0:34:01 | so that was an easy |
---|
0:34:03 | requirement |
---|
0:34:05 | once you have that here is what you do |
---|
0:34:08 | you take an image |
---|
0:34:10 | and you compute |
---|
0:34:11 | so the |
---|
0:34:13 | the score of the deep-learning |
---|
0:34:17 | model so you the score of the deep-learning model is actually the posterior probability of |
---|
0:34:21 | a label |
---|
0:34:22 | given |
---|
0:34:23 | that the image and you have these vector of p of the label given the |
---|
0:34:27 | image |
---|
0:34:28 | you are going to compute all these score you have a thousand of them |
---|
0:34:32 | but you are going to only takes a the top ones |
---|
0:34:36 | top one could be the top thousand if you want but |
---|
0:34:38 | it's gonna be faster if you take the top ten |
---|
0:34:43 | and you are going to |
---|
0:34:45 | to |
---|
0:34:47 | look at the labels corresponding to these top ones of suppose the topic labels |
---|
0:34:53 | contained |
---|
0:34:54 | these |
---|
0:34:55 | suppose these words adopt then label obtain so bear lion tiger |
---|
0:35:00 | integer a |
---|
0:35:01 | you are going to look at the embedding space of the of the top then |
---|
0:35:04 | label you obtained here |
---|
0:35:05 | where they are and you are going to make it in average of them in |
---|
0:35:09 | the embedding space |
---|
0:35:10 | but it's gonna be a weighted average back haul and the weight will be how |
---|
0:35:13 | much you think it is the actual label actual by and |
---|
0:35:17 | so if you really think it's a lay a lion |
---|
0:35:19 | the they're result of the of the weighted combination would be very near the lion |
---|
0:35:24 | if you really think it's a there it's gonna be near the beer |
---|
0:35:27 | if you think it's between the bear in alliance so you obtain |
---|
0:35:31 | fifty percent they're fifty percent line you're gonna be in a position your parent line |
---|
0:35:35 | like in the middle |
---|
0:35:36 | and that's what this |
---|
0:35:38 | thing set so you average the top labels you found in the operating space and |
---|
0:35:43 | you find the position and that's where you should be now you look around here |
---|
0:35:47 | in the other expansion look at the nearest label |
---|
0:35:50 | the might be labeled from the top doesn't know there might be in the other |
---|
0:35:53 | label |
---|
0:35:53 | and that's your inserter |
---|
0:35:55 | and because it can be any other label it can be |
---|
0:35:57 | labels of subject you've never sing |
---|
0:36:01 | does it work |
---|
0:36:03 | it does actually surprisingly |
---|
0:36:05 | not the perfectly by far again you see like |
---|
0:36:09 | if you person precision but it does work |
---|
0:36:13 | good enough that it's better than what we've seen elsewhere |
---|
0:36:16 | so this is the model that is doing this the |
---|
0:36:20 | complexer a convex the |
---|
0:36:23 | combination of a semantic and beating and that when you was using the top ten |
---|
0:36:27 | label that's waistcoat can see |
---|
0:36:29 | it computed there is this something that we also publish called device which instead of |
---|
0:36:33 | doing this simple convex formulation try to learn the mapping between the two |
---|
0:36:38 | and the mapping was surprisingly not as good as just the simple combination |
---|
0:36:43 | and this would be the output of the model itself so this cannot actually find |
---|
0:36:48 | the correct solution because |
---|
0:36:49 | we know that the correct solution of that image the correct label |
---|
0:36:53 | is not the top then the top but thousand it it's not the label the |
---|
0:36:57 | mold those about so it will make a mistake |
---|
0:37:00 | while these ones have access to the full embedding space and be conducted says something |
---|
0:37:04 | about |
---|
0:37:05 | things that never seen and that's |
---|
0:37:07 | and okay that works |
---|
0:37:08 | in this case |
---|
0:37:11 | okay |
---|
0:37:12 | so |
---|
0:37:13 | that was a nice for images but the recently i thought okay |
---|
0:37:17 | what about speech |
---|
0:37:20 | so about the ten years ago i was working in speech so i had some |
---|
0:37:24 | knowledge about the house speech mode |
---|
0:37:26 | but in the meanwhile of course everything change the deep-learning a wave also hit the |
---|
0:37:30 | speech community |
---|
0:37:31 | and now nobody's using it anymore igmms and stuff like that we use the deep |
---|
0:37:37 | network so how is the speech |
---|
0:37:40 | recognition done nowadays |
---|
0:37:43 | so this is speech in |
---|
0:37:45 | one slide |
---|
0:37:47 | you take your |
---|
0:37:49 | you're speech signal you transform at the using some features |
---|
0:37:53 | and |
---|
0:37:55 | in for the training set that you have you take the sequence of words |
---|
0:37:59 | and you |
---|
0:38:00 | you |
---|
0:38:01 | cut it into sub |
---|
0:38:02 | word unit which are usually phonemes or |
---|
0:38:05 | by for triphone and for whatever you want |
---|
0:38:07 | and these phones are then cut into cell phone unit which are called states because |
---|
0:38:12 | they are states of hmms with the were not using hmm anymore |
---|
0:38:16 | and |
---|
0:38:17 | and then we try to align the audio with the states |
---|
0:38:22 | so we take a previous model and we tried to say okay with our previous |
---|
0:38:26 | small |
---|
0:38:27 | this part of the audio should correspond to state number two hundred forty five |
---|
0:38:32 | and we do that for all our training set and that's becomes are training data |
---|
0:38:38 | to train it deep architecture which outputs is the number of states you have n |
---|
0:38:43 | you try to predict |
---|
0:38:44 | which state this audio should be corresponding to out of the |
---|
0:38:48 | you know case |
---|
0:38:50 | fourteen thousand states |
---|
0:38:51 | so the actual speech |
---|
0:38:53 | acoustic model is a classifier a fourteen thousand classes |
---|
0:38:58 | this is how it works and i think it |
---|
0:39:01 | we do that because that's |
---|
0:39:02 | how we've been doing speech for well but it seems unreasonable to me |
---|
0:39:06 | that's where trying to classify wasn't two states |
---|
0:39:09 | which even as humans have part time |
---|
0:39:13 | as a task to do because these days have no particular meaning |
---|
0:39:17 | the phonemes themselves have been designed by linguist and maybe that's not what that the |
---|
0:39:21 | that should say |
---|
0:39:22 | we should maybe look at the data instead of asking a linguist |
---|
0:39:26 | don't know how many language so we have here |
---|
0:39:28 | hopefully not too many |
---|
0:39:33 | i |
---|
0:39:34 | and |
---|
0:39:36 | so let's see if we can |
---|
0:39:38 | get rid of these states and phonemes and all that |
---|
0:39:41 | of course it's gonna be hardened that will not succeed |
---|
0:39:44 | very well but at least i think it's worth trying |
---|
0:39:48 | and see where we go |
---|
0:39:50 | so |
---|
0:39:52 | so what can we do |
---|
0:39:55 | so the first thing i need was a very naive approach i two can't data |
---|
0:39:59 | and instead of cutting the data and segmenting the data at the state level as |
---|
0:40:05 | okay i forgot about state i forgot about fourteen what else do we have words |
---|
0:40:09 | so that segment the |
---|
0:40:11 | training set we have in terms of words and the |
---|
0:40:14 | that's an easier task because you it's usually easier to segment your data in terms |
---|
0:40:18 | of words well humans would agree |
---|
0:40:20 | roughly where word started words that ends |
---|
0:40:24 | so let's try to learn a model that price should just a classified words |
---|
0:40:29 | and that's what i did so i had my audio data and user deep architecture |
---|
0:40:33 | and try to predict that the end |
---|
0:40:35 | the word directly |
---|
0:40:36 | so that assumes that it has already been segment that |
---|
0:40:39 | the same way that this take based model was assuming that it was already segmented |
---|
0:40:43 | but instead of seeing |
---|
0:40:44 | only one state plus context i'm gonna see the whole were |
---|
0:40:48 | now it turns out words are not that long |
---|
0:40:51 | with a window of about the two seconds i capture the like ninety nine percent |
---|
0:40:55 | of the training words a hot |
---|
0:40:57 | so you need about two hundred frames to express and capture most of the words |
---|
0:41:02 | or at least of the training set i had access to which |
---|
0:41:05 | is a query data from |
---|
0:41:07 | from google |
---|
0:41:10 | so i train your typical deep conclusion mall the same kind of model that was |
---|
0:41:15 | used for images but then nowhere use it for speech |
---|
0:41:19 | i use the |
---|
0:41:21 | the dictionary used exactly what small the sense that in the training set |
---|
0:41:27 | not all possible words i pure |
---|
0:41:29 | so i use only about fifty thousand words |
---|
0:41:32 | which |
---|
0:41:33 | looks big but it's actually small compared to the actual number of words |
---|
0:41:37 | that people will use in our test set for which we need something that can |
---|
0:41:41 | be at work at least |
---|
0:41:43 | so we have a problem later but let's forget about that problem so far |
---|
0:41:47 | and try to classified our training set into one of the forty eight thousand word |
---|
0:41:54 | so we can trying to model and that's nice and you get some accuracy seventy |
---|
0:41:57 | three percent |
---|
0:41:58 | is it good is it but i don't know |
---|
0:42:01 | it's reasonable |
---|
0:42:03 | where the we see where we go with this |
---|
0:42:07 | the first thing to say that if you have this you are not done at |
---|
0:42:10 | all in the speech recognition task because |
---|
0:42:14 | i've assume that someone gave me in a in a line data set so my |
---|
0:42:18 | training data was aligned at the word level |
---|
0:42:21 | but now if i want to do speech recognition are not going to be given |
---|
0:42:24 | in the alignment i have to align it myself |
---|
0:42:27 | so |
---|
0:42:27 | since i wanted to have the quickly summarise although set okay i'm gonna forget about |
---|
0:42:31 | the alignment |
---|
0:42:32 | i will use |
---|
0:42:33 | the crowd well we have to provide a target so i take a model |
---|
0:42:37 | and i and i just run the speech recognizer we have |
---|
0:42:41 | which happens to be quite good |
---|
0:42:42 | and i look at that the lattice which is that |
---|
0:42:46 | a compact representation of the top-k |
---|
0:42:49 | sequences of words |
---|
0:42:50 | that could have been uttered for this it turns |
---|
0:42:55 | of acoustic |
---|
0:42:56 | and i will only look at the arc of that like this and try to |
---|
0:42:59 | rescore it so now i know that |
---|
0:43:01 | it for each are there was a beginning and time so i can take the |
---|
0:43:05 | audio of that part of the work of the sequence and try to score it |
---|
0:43:09 | and say okay i think it should be this word with these probability rd score |
---|
0:43:13 | and i can get the score and try to |
---|
0:43:17 | two score that and that |
---|
0:43:18 | that's good but it doesn't solve the problem of the not work |
---|
0:43:22 | my model was trained with forty eight thousand words |
---|
0:43:25 | and the decoder will see where more words so how will i be ever able |
---|
0:43:29 | to |
---|
0:43:30 | two |
---|
0:43:31 | two classifier that words with this |
---|
0:43:33 | this is a problem so let's try to go further in our idea |
---|
0:43:37 | and let's try to reason about how we could actually be able to |
---|
0:43:41 | to produce an old word |
---|
0:43:43 | or score unknown words |
---|
0:43:46 | that's where the embedding space is we start to be useful |
---|
0:43:49 | so here is the suggestion |
---|
0:43:51 | we're gonna try to learn |
---|
0:43:54 | in mapping between |
---|
0:43:55 | and you representation of words that we have access to and its base of words |
---|
0:44:01 | so what have access to that i can you edus |
---|
0:44:04 | is things that make up the word like the letters of the word or the |
---|
0:44:08 | lighter n-gram of a word so for instance i take the word hello and i |
---|
0:44:12 | can extract |
---|
0:44:13 | but quota features |
---|
0:44:14 | i don't fit like the letters it has |
---|
0:44:17 | the bigram letters it has the trigram letters it has the foreground letters it has |
---|
0:44:22 | the |
---|
0:44:23 | and writing letters it had all of them |
---|
0:44:25 | so that's a lot of features |
---|
0:44:28 | but them |
---|
0:44:29 | maybe they are useful |
---|
0:44:30 | actually if you add two more symbols |
---|
0:44:33 | like |
---|
0:44:34 | beginning and of word |
---|
0:44:36 | then it's even more interesting because |
---|
0:44:39 | so the ing in english is a very often |
---|
0:44:42 | and being of words and that's good to do that ing and is a very |
---|
0:44:46 | powerful features so let's try to add that as features |
---|
0:44:49 | and the |
---|
0:44:51 | and try to represent words like this so |
---|
0:44:55 | the first thing i |
---|
0:44:57 | it was |
---|
0:44:58 | trying to see if i take a word that extracted features i show you only |
---|
0:45:03 | this |
---|
0:45:04 | can usually it can you tell me given this that the word i was talking |
---|
0:45:08 | about the was handle |
---|
0:45:10 | turns out that it's actually very easy task and the |
---|
0:45:14 | on the test set i got about ninety nine percent accuracy if i want |
---|
0:45:18 | if i train a simple model to predict |
---|
0:45:20 | what worked it is given its features so these features actually really |
---|
0:45:25 | capture enough of the word |
---|
0:45:27 | to tell you that this is hello |
---|
0:45:29 | so that's good that use these features |
---|
0:45:32 | but how can we use it |
---|
0:45:33 | so we're gonna use it in a |
---|
0:45:35 | and bidding |
---|
0:45:36 | deep-learning kind of architecture |
---|
0:45:39 | in the following way |
---|
0:45:40 | so we had our first model which was you take the audio when you try |
---|
0:45:44 | to predict what word it is |
---|
0:45:46 | now |
---|
0:45:47 | my p is that the |
---|
0:45:49 | the last layer of these are secure capture as a lot of information about the |
---|
0:45:54 | whole word |
---|
0:45:56 | and that to word that some |
---|
0:46:00 | alike |
---|
0:46:01 | we did not far |
---|
0:46:02 | in their representation of the last layer l c d |
---|
0:46:05 | deep architecture |
---|
0:46:07 | and |
---|
0:46:08 | what i will try to make sure is that indeed i can try to learn |
---|
0:46:12 | a mapping between |
---|
0:46:14 | any word |
---|
0:46:16 | and |
---|
0:46:17 | the position in that space that correspond to the word so that space contains word |
---|
0:46:21 | but now do not organised in terms of |
---|
0:46:25 | how they are related semantically they are organized in that space the space being the |
---|
0:46:29 | last layer of the deep architecture |
---|
0:46:30 | in terms of how they sound alike |
---|
0:46:32 | two words that some of the like will be nearby that space and that's great |
---|
0:46:38 | so now i'm going to train |
---|
0:46:40 | i ranking model that will take |
---|
0:46:44 | in old you acoustic will projected into that space |
---|
0:46:47 | we take |
---|
0:46:48 | the word this would you acoustic corresponds to |
---|
0:46:53 | transform it into features project the into a another space that i hope will be |
---|
0:46:59 | similar to this one |
---|
0:47:00 | and try to make sure that the representation of the correct word in that space |
---|
0:47:04 | is here |
---|
0:47:06 | there are presentation of do it you actually near that the representation of another word |
---|
0:47:11 | so i want to make sure that in that specified that the audio i projected |
---|
0:47:15 | i take the letters of the correct word i projected and they should be by |
---|
0:47:19 | the embedding space |
---|
0:47:20 | and by nearby i just mean |
---|
0:47:22 | that it's near that any other word i would take and projected so that i |
---|
0:47:26 | could rank the word and the nearest word of an acoustic sequence would be the |
---|
0:47:30 | correct word |
---|
0:47:31 | and that would work for any word any sequence of letters i can express |
---|
0:47:35 | does that make sense |
---|
0:47:37 | okay |
---|
0:47:38 | so |
---|
0:47:38 | and it so again that's the your typical ranking loss |
---|
0:47:42 | and the like trained that well |
---|
0:47:44 | and now with that model i can actually score any word so even though |
---|
0:47:48 | this model was only trained with fifty thousand |
---|
0:47:52 | words |
---|
0:47:53 | with this addition i can now score in figure them out of words as norm |
---|
0:47:57 | estimate of letters |
---|
0:47:59 | which |
---|
0:48:00 | is okay in that case that was only english |
---|
0:48:05 | okay so it doesn't work first of all it doesn't work as well |
---|
0:48:09 | as the as the so if i |
---|
0:48:12 | use only this model i get |
---|
0:48:14 | seventy three percent accuracy but if i use |
---|
0:48:17 | the small |
---|
0:48:19 | i'd of the much bigger set of words i get only fifty three percent accuracy |
---|
0:48:23 | but it still |
---|
0:48:24 | maybe enough to be able to use it in the decoder no |
---|
0:48:28 | and adding another useful example of these embedding space is now we're talking about |
---|
0:48:33 | and beating spaces of old you |
---|
0:48:34 | so i think a word i projected into the embedding space and i look at |
---|
0:48:38 | other words around |
---|
0:48:40 | and i see words that |
---|
0:48:42 | sounds similar they probably have completely different meanings but this on the same |
---|
0:48:46 | and |
---|
0:48:47 | you can even |
---|
0:48:49 | put push up any word that is actually not the word and try to see |
---|
0:48:52 | how you would pronounce it so |
---|
0:48:55 | could be interesting |
---|
0:48:57 | okay so doesn't work well it works |
---|
0:49:01 | so |
---|
0:49:03 | basically so far |
---|
0:49:04 | it only works in combination with the we if you if you rescoring you combine |
---|
0:49:09 | it with a good model |
---|
0:49:11 | so it's just preliminary work but i think there's the there's another things to try |
---|
0:49:16 | in that space |
---|
0:49:17 | that these to be tried it sits on the a preliminary or so |
---|
0:49:20 | you don't improves lightly the result |
---|
0:49:22 | even though it's like to hear means it actually improves significantly because the size of |
---|
0:49:26 | the data was |
---|
0:49:28 | ways you which it still not there was a despising it is for me |
---|
0:49:31 | but i think it contains |
---|
0:49:34 | i think sees of a frazzled that we should consider these it would you space |
---|
0:49:39 | design meeting space for you are i think |
---|
0:49:41 | something to consider |
---|
0:49:43 | later |
---|
0:49:46 | maybe i can tell you a bit of the kind of air light model was |
---|
0:49:48 | playing |
---|
0:49:49 | it was making mistakes like it's was replaced by its |
---|
0:49:53 | five was replaced by five |
---|
0:49:56 | which |
---|
0:49:57 | we agree are different words |
---|
0:50:00 | okay was replaced by okay and that kind of mistake so it was mostly mistakes |
---|
0:50:05 | from the language small and not much from the acoustic model but |
---|
0:50:08 | nevertheless you need to train them jointly which i haven't |
---|
0:50:11 | and so there's a work to do here |
---|
0:50:14 | okay so |
---|
0:50:16 | i'm gonna stop the now so these are the conclusion i hope i convince you |
---|
0:50:21 | that the these that these and baiting space is a very powerful the fact that |
---|
0:50:26 | you can |
---|
0:50:27 | take any kind of data whether they are discrete data like words or complex that |
---|
0:50:32 | are like images or sounds and projected into space where you can compared to where |
---|
0:50:36 | you can |
---|
0:50:36 | look at the nearest neighbours in that space on where you can make a new |
---|
0:50:40 | even know parameters |
---|
0:50:41 | on them like averages or |
---|
0:50:44 | subtraction and stuff like that |
---|
0:50:45 | this is a very powerful |
---|
0:50:49 | way to consider complex objects |
---|
0:50:51 | we've tried it actually in many other applications |
---|
0:50:54 | i can tell you if you ask them for which we talk about like a |
---|
0:50:57 | music recommendations what we will we had this the |
---|
0:51:00 | music |
---|
0:51:02 | you can upload your music and we're gonna try to |
---|
0:51:05 | to help you |
---|
0:51:07 | do play list with it |
---|
0:51:08 | or try to code |
---|
0:51:11 | contingent to buy new music and stuff like that |
---|
0:51:14 | and we do that in |
---|
0:51:15 | not only but |
---|
0:51:16 | also using the old your presentation of the image |
---|
0:51:19 | so we've actually |
---|
0:51:20 | represented your music into these kind of invading spaces and look around in that space |
---|
0:51:25 | we've done that for videos of course of four languages information machine translation of talk |
---|
0:51:29 | about it and worst i think trying to do things in speech recognition |
---|
0:51:33 | and i think there's even more to do and the |
---|
0:51:36 | and why not the trying these kind of things for speaker verification or language to |
---|
0:51:42 | classification i don't know but really next year maybe |
---|
0:51:47 | i do |
---|
0:51:57 | so |
---|
0:52:13 | thank you very much |
---|
0:52:17 | that would like to know what's wrong about linguistics |
---|
0:52:22 | so nothing wrong about linguistic but it i'm afraid of taking early decisions so for |
---|
0:52:27 | instance for speech when we take words and we will present them that sequence of |
---|
0:52:31 | phonemes |
---|
0:52:32 | often there was |
---|
0:52:33 | more than one representation what and one way to present the word |
---|
0:52:36 | and you need these |
---|
0:52:37 | thank ways to decide what is the correct way or the quite a ways |
---|
0:52:41 | and there's a discrete the most of them that you need to put in |
---|
0:52:44 | because you that's how you're gonna represent the you always do that |
---|
0:52:48 | and you are making early decisions some of the might be wrong i'd like to |
---|
0:52:52 | get rid of early wrong decisions |
---|
0:52:55 | for indications transcribe sings a different |
---|
0:52:59 | or at and use |
---|
0:53:02 | set a simple |
---|
0:53:08 | using strong |
---|
0:53:09 | i think that's wrong |
---|
0:53:12 | in the comments about the |
---|
0:53:16 | images |
---|
0:53:18 | what about you deal i mean you probably most work now |
---|
0:53:24 | so if you a chick |
---|
0:53:27 | you're asking what about action x |
---|
0:53:29 | so |
---|
0:53:30 | we have people working on that so we have you to put which is part |
---|
0:53:34 | of we will and contains if you videos |
---|
0:53:41 | and we have a big group trying these kind of approaches for you keep so |
---|
0:53:44 | i cannot |
---|
0:53:46 | to them but i know they have good results |
---|
0:53:52 | so anyway trained are selected just distinction between the word and |
---|
0:53:59 | acoustics was difference between this and have in a sequential training |
---|
0:54:03 | people dollars |
---|
0:54:05 | this kind of similar what do you mean basically one sit all the whole sentence |
---|
0:54:09 | all incorrect sure so |
---|
0:54:12 | you could use the recording that instead of using a consonant over you the acoustic |
---|
0:54:17 | and you get |
---|
0:54:19 | plus and |
---|
0:54:20 | minus is i would sit the |
---|
0:54:22 | plus about using a recurrent net is that you don't need to decide a priori |
---|
0:54:27 | what's the maximum size |
---|
0:54:28 | the might this is that you |
---|
0:54:30 | the she are more than what you want |
---|
0:54:32 | your presentation is more scared than the actual model where you decide so we are |
---|
0:54:38 | actually trying weaver l indians now for about that so i'm not saying it's wrong |
---|
0:54:42 | it's a good idea |
---|
0:54:44 | these experiments were done with a consonant |
---|
0:54:46 | go ahead and use your recurrent net |
---|
0:54:56 | i think my question is in the same direction as well but the it was |
---|
0:55:00 | them mentioning video what about sentences you're able to represent sentences sequence of words in |
---|
0:55:07 | this so you know the current the line of work by my colleagues the quickly |
---|
0:55:12 | and yes it's cover |
---|
0:55:15 | who are actually trying now to do that kind of things so the use in |
---|
0:55:19 | nist ends or recurrent net that's not right go into details of how it works |
---|
0:55:23 | but where you first read |
---|
0:55:28 | some input about your sentence it could be the sentence and another language or it |
---|
0:55:32 | could be d v d or it could be the old you what is that |
---|
0:55:34 | it |
---|
0:55:34 | and then you i'll put a sentence |
---|
0:55:36 | and you trained of all to output the right and that and so that you |
---|
0:55:40 | i actually reasoning about sentence |
---|
0:55:42 | so it's early work so far but it |
---|
0:55:45 | hope it's gonna work |
---|
0:55:51 | we want me to ask about the numbers on the board a low pass it |
---|
0:55:57 | my question supposed to be |
---|
0:56:00 | supervised minutes of what show |
---|
0:56:03 | can be somehow to unsupervised |
---|
0:56:07 | as so it's it had to see that distinction between the two then |
---|
0:56:12 | when you train your invading spaces using only sentences is that supervised data or unsupervised |
---|
0:56:18 | data i mean it's sentences that |
---|
0:56:20 | exist in the world |
---|
0:56:22 | but you are not the you don't need it people to label them may appear |
---|
0:56:27 | by themself on the web |
---|
0:56:29 | i don't know if this is supervised or not |
---|
0:56:33 | so |
---|
0:56:34 | the distinction is not clear to me |
---|
0:56:37 | you have to tell me more |
---|
0:56:42 | what it might be getting it is true when you get the sense |
---|
0:56:45 | supervised |
---|
0:56:47 | human generate right was showing so |
---|
0:56:51 | supervised the since you say this is for english sense |
---|
0:56:54 | yes so that's i think women getting is in the unsupervised it i just give |
---|
0:57:01 | you data services in but when you do unsupervised clustering like |
---|
0:57:06 | such as things look similar because |
---|
0:57:09 | this is this picture world |
---|
0:57:11 | in the supervised case you said this is world this world this is given it |
---|
0:57:16 | kind of something to guide along and maybe one question be if you start throwing |
---|
0:57:21 | things in their this is questions |
---|
0:57:24 | how to use it's almost in clustering how to use |
---|
0:57:27 | so selfish |
---|
0:57:28 | i think the hope of unsupervised learning and i do believe that we need a |
---|
0:57:31 | lot of our work in that field the that it's crucial is a to find |
---|
0:57:35 | structure in the world |
---|
0:57:38 | the things that happen try to be the wall to happen with some structure maybe |
---|
0:57:41 | randomly but with some distribution |
---|
0:57:43 | and you want to |
---|
0:57:45 | constrain the space where you're going to operate with these objects these and big space |
---|
0:57:50 | is or |
---|
0:57:51 | or any other he didn't representation |
---|
0:57:53 | such that they take into account that structure so that it is here |
---|
0:57:57 | to say well these two things |
---|
0:57:59 | are nearby because in that structure doesn't the what the way around you cannot go |
---|
0:58:03 | by a left-to-right things are only in that direction so compared and like that |
---|
0:58:07 | that's what you want to use of your unsupervised data to so for instance you |
---|
0:58:12 | can take or audio |
---|
0:58:14 | and try to their representation of the old doing a compact we just by looking |
---|
0:58:17 | at would you as long as it's audio |
---|
0:58:19 | of things that you will see later so not just the right done |
---|
0:58:23 | audio but maybe people talking but without |
---|
0:58:26 | understanding what they say or images |
---|
0:58:29 | that exist but without labels |
---|
0:58:31 | or again take that has been read in your language |
---|
0:58:34 | but you and you don't need to know what is that text about or what |
---|
0:58:37 | is this image about as long as it's a set of images that |
---|
0:58:41 | are valid |
---|
0:58:41 | in this is that it in other image you would see come from the same |
---|
0:58:44 | distribution |
---|
0:58:45 | so it's very useful it's a heart task |
---|
0:58:48 | and but we need at a lot |
---|
0:58:55 | so you are trying to look nice out of a couple diverse using to sampling |
---|
0:59:00 | can you comment on how successful the like your that jane you have obtained using |
---|
0:59:07 | combine the two models |
---|
0:59:09 | a complement of recognizing a of a couple of that's |
---|
0:59:13 | so the for small with only train a recognizing words that was not known |
---|
0:59:17 | in the second one |
---|
0:59:19 | i used it |
---|
0:59:20 | on our test set which contains ten times more different words most of the words |
---|
0:59:24 | in the test set was not we're not to the train set so the |
---|
0:59:27 | the decoder the results i gave in terms of word error rate was on the |
---|
0:59:32 | vocabulary that was |
---|
0:59:34 | more than ten times speakers and the training set |
---|
0:59:38 | so it was using this letter representation so is that what you mentor |
---|
0:59:43 | i mean you in successful in because i think look at atlantis |
---|
0:59:48 | well |
---|
0:59:49 | so it's out-of-vocabulary from the training set but it's not solving the real task which |
---|
0:59:54 | i'm sure you are |
---|
0:59:55 | interested in |
---|
0:59:56 | which is out-of-vocabulary of the test set |
---|
0:59:59 | that is a word that is that is not even in my |
---|
1:00:04 | the cannery one at code and i'd like to be able to reason about it |
---|
1:00:07 | so i haven't tried that and i think it's more |
---|
1:00:10 | interesting task |
---|
1:00:14 | for some |
---|
1:00:18 | about linguist |
---|
1:00:25 | okay |
---|
1:00:28 | one where |
---|
1:00:32 | some |
---|
1:00:35 | yes joint |
---|
1:00:40 | so are |
---|
1:00:48 | which is the first part i talk about people's are starting working on that but |
---|
1:00:51 | years ago yes but it's the hardback yes |
---|
1:00:54 | also |
---|
1:01:17 | i agree i haven't the but i guess videos would be dismissed the best the |
---|
1:01:21 | way to see that where you have old you and |
---|
1:01:24 | and images but i have not personally work on that and but i do people |
---|
1:01:28 | are otherwise it on the which other data but |
---|
1:01:31 | yes i think |
---|
1:01:32 | six seventy to here |
---|