great so it's a closer to
to talk here today indeed i
followed your
community for a couple of here is about ten years ago and then moved on
and doing a few different things but i might come back to it who knows
i was a very interested in what i've seen during the week
so thanks on fighting
so to the l i'll talk about if you
projects related to and bidding space is then going to
tell you what we can do with that and what they are first but
pay attention to the fact that this is not only my work does plenty of
people that have work on that with me
and like to thank them first
so
and bidding sweat out be what are the useful for and in fact it started
like a decade ago when people started to try to think
how can we present
discrete objects in the continue space in such a way that it becomes useful to
manipulate them
like a if you think of words
words are difficult to manipulate because
you cannot compared to words the easy d at least the
i in terms of mathematics a comparator so how can you represent words such that
then after that you can
you can manipulate them and compare them
so these are these embedding space is more we're gonna project words
once we have projected words into these spaces
what about projecting anything else
like images
or speech or
music
so how can we do that that's gonna be the second part
and we see that once we
manipulated is a complex object in these spaces
the there's a because you
you learn
semantic representations of these words you can actually
try to
to discover things in complex objects like images that you've never seen before we see
how we can do that
and now
briefly describe at the end
some recent work we've done with the trying to
do similar things from images but now applied to speech
and vice
so let's
let's start with what i mean by embedding so
so here i describe it three d a beating space but of course in general
it's more like a one hundred two hundred one thousand
the and bidding space
and
think of a this space as the
as the real number real vector space
work each point
could the could be the position of a discrete object like a word so soul
here we have let's your best this work
here we have the position of the word or barbara
and here the
like here the position of dawson's c will paris
and
we want to learn where to put these words and that's basically at the beginning
we just on the whole
so we're gonna
fine random position for each of the words
of the given dictionary that were given
and then we want to modify the positions of the move and at some point
one to be such that
nearby words so the word nearby the position of the word dolphin
should have similar meanings are tick at least related meanings
so you'd like c word and often to be not far
but relatively far from say powers and all but
and which that this if we can achieve that is gonna be useful for
for manipulating these words after work so how can we do that
so
about the ten years ago my brother years once we just for those of you
who don't know there are two modules
you invited me that my brother
both of us working deep-learning
so
and
but that's about ten years ago he
and
describe the project where he could learn such an bidding since
and here is that how he started the
this thing
so
he was using neural network so this is a neural network where we have the
inputs and you have layers
good connected to each other and at the end you have the output layer
and the goal was to try to learn a representation for words so what he
did
was take
sentences
like you can capture sentences from we keep it from the web from anywhere only
where you can just grab sentences
and your goal is to try to
to find a representation of words
such that
it will be easy to try to predict
if i show you few words can you predict what will be didn't next work
so you put
cities for words as input of the model
you crunch them and at the end you predict the word in your dictionary and
does one you need for each of the word of your dictionary
and you try to predict the next word will be cheese
in that case
now how your present the word
you're present them as a vector in
this the dimensional space
which at the beginning will be just a random vector
so this is what i
we called the embedding space
so there i think of it does a big look up the table or a
big matrix where each line
is there a presentation of a word
and if you see the sequence the cat eights you're gonna just look at the
word
the look up the word cat each of them have a vector representation you just
put that as input of your well
and then you passing through your model and you predict the next word
and the hope is that if you do that often
and you
as you might imagine you have in into new the model such data so that
easy to get
you will want you will find good representation of words
so the first time he did it was very painful very slow machines ten years
ago are very slow the dictionary was very small
it wasn't very useful but since then things have improved the model has
been simplified we have more data about more gpus
another thing start to work quite well
here is an example of the nn bidding space that we trained
about two years ago i think
and you don't have to try to see well
what's in that space
we can pick a word
so like here i pick the word apple
and look in the space what are the nearest words saying the
euclidean space so a look
at the position of all keywords i
sort them with respect to the distance to the target word
and i look at the other words and what i see that
the other words are very semantically similar to apples so you have fruit apples melon
speech
and whatever
if you take a stab you get things that are less a happier
and around i phone
used are you see stuff like i whatever
so
so it does capture everything and this was trained in the unsupervised weights sex
just show sentences
and that's what you get
and this was with the fifty dimensional embedding so you don't even need to have
a very large space this was about
i think or hundred thousand words
dictionary
so those hundred thousand vector here that are hidden in that space
so
with time these kind of model have evolved and usually involved in that just means
simplified so it was
it was a complex a architecture and it became actually much simpler now it's
in fact they are almost linear model
just the and of aiding and abetting of the word and then linear transformation in
you try to predict
another word and that's it
the weights done is you pick it again you take a sentence you randomly pick
a word in that sentence and then you randomly pick another word that you're gonna
try to predict
so it's not anymore
the next word and its not anymore
if we do of words that helps you predicting the next one it's
a random were trying to predict another random work round
and that's just the fact that it's a round that makes it
interesting because words ten
to keep
they tend to
to cook are often together so
so that's a very simple now the code is actually available in the very efficient
to train so you can train your own the bidding space
on in the matter of in our to of training so it's very efficient
a so
so here we have an example about
we took all the terms we saw and we keep it and here examples again
of embedding space is so these are the words nearby tiger shark
or car these are words to sorted or a cluster them according to their semantic
an
you see i don't know here all the food things
all the replies it et cetera so it captures semantic
but it's actually even more
strong than that
because
you can play some games with these and beatings words so for instance
if a you take the after training it within meetings to space the using this
as good problem all
you look at the editing position of the roll
italy berlin and german germany you look at where they are in the space
and you can apply operators on that so you to the embedding a position of
rome and you subtract italy
you at germany and what do you get vernon
that means you can actually generalising fine so the vector that went from row to
italy is the same as the when it goes from berlin to germany because they
have the same or addition to each other
and that's for semantic and you have also syntactic relations
like hardest to harder to what biggest to bigger
with the same kind of argument
and that's
surprisingly working
you can do similar tricks using for translation you can see that you can train
the
separate
and beating space for separate languages
and you'll find that these relations actually work from this
from language to language and you can use these kind of things to help train
stations and they're
tons of tricks that you can seem to the charger nowadays using these and beating
spaces so they are very
interesting to manipulate
before i forget feel free to ask any question whenever you want of course
so what else can we do with these are bidding spaces so
about four five years ago i was actually interested that googling
images
and i and i wanted to try to see if i can train the model
to annotate images but it being at google i'm not interested in on the and
trying to label images out of the hundred classes of course i'm more interested in
the large scale setting so when you have six hundred thousand classes
that's more interesting for me
but at that time at least
the recharger do see do
you computer vision literature was more interested are focusing on task where you have a
hundred two hundred
up to a thousand classes but that's
that's about it
so can we go further than that
so of course i image annotation is the heart task that's plenty of problems with
that you can think of the fact that the object objects are very the often
lookalike and actually this problem
is even worse when the number of classes grow
so when you have only two classes it's very easy to discriminate the if you
have a
a thousand classes or hundred thousand classes
you can be sure that
two of these classes are very visually similar to each other
so the problem is
or are becoming harder as the number of class the class a scroll
just plenty of other problems the rated to computer vision which i won't go into
details but
let me just
summarise how computer vision was done
about four five years ago
things have evolved a lot since then but at that time
you had two steps
feature extraction and classification
so first you would extractor features and the way you extract features
was
say it's very similar to how you would expect your for your from your voice
you with the find the a good representation of
places in the image and then
aggregated in a in some way and that would be your presentation
what you had that you would then try to classify don't using the best classifier
that was available at the time
which was an svm and then you train an excellent for each of your classes
and hope
that it would scale well
a it didn't
so one of the
problems you had was
and
that
very similar images would give rise to
very different labels
that would be completely unrelated semantically so for instance these are three
whoops
these are
three images a
of as of the video that
are like a few seconds apart
it's a shark or something like a shark and this is the label that
you're classically which classifier would give and you see that this is very semantically different
from that
here it's airliner for those who don't see and here's tiger sharks
but the image are quite similar to or high so some things
not working somewhere
and that's the kind of problem would like to be able to solve that would
be if i sure you to similar images i'd like to have to similar labels
at least
at least semantically
so why isn't that the working
so one argument about that is that we when we try to classify are images
we showed no we impose no relation between our labels all the labels you can
think of them as being in this id edges
of the hyper q
of the number of classes
where the edges of the number of classes
and so there's no more relation between a
it share and the and them under a more than there is or relation between
the mental reinvesting great even though there is a semantic creation between them we don't
capture done with the way we
train our classifiers and that's probably bad
so what if instead of having the labels that these
corners here
what about putting them inside
i guess
so now the labels are inside this i per q like these embedding space as
i was talking about the earlier
and it's fine because first of all the
the size of this hyper cube does not you anymore depend
on the number of labels classes
you can have way more classes than the actual size of your
space
because now it's a real space
and you can put your labels in such a way that they are by label
have your by meaning
and we've got
what happens is that if you make a mistake by picking the wrong label
hopefully you're gonna pick a label that was nearby hands have a semantic meaning not
too far so
hopefully that would work
and does even more interesting think that could happen you could
put more labels in the space that the one for which you have images
and maybe you'd be able to label in image of a topic you've never
seen any image before just because it is semantically related to it
so we try to see that
is just a dream
so about four years ago we started working on this project and
and what we of course so we were interested in these embedding space is an
and what we
tried was basically to merge the idea of having an image classifier and an embedding
space
and
so
we had this protocol was that e
where we want to do
try to learn jointly how to take an image
and project its input representation into
and then bidding space so you would have a
you projection from their representation of that image into
point
in that space and in the same space you'd have points that represents
labels
our classes
like thousand and robot and eiffel tower
and the goal was to jointly try to find the position of the labels
i try to find the mapping from the image
to deal a label space and if you can do that jointly you hopefully sold
the
jointly the classification task
and have a good and beating space for words
and
being at google
i should tell you that everything i see looks like a ranking problem
and so i obviously see that there's a ranking problem
where in that case the goal is if you show me an image and gonna
try to rank
the labels such that the nearest label is the one that corresponds to label
sounds good that that's a correct drinking problem and you also want to make sure
that
similar labels if you are to make a mistake let's make sure that the mistake
is semantically reasonable that
if you were to click on the word
it would be it reasonable word even if it's not the perfect work
so we are going to train or well with i ranking loss
in mine
okay
it was actually very simple ball
again very just a linear mapping so
what we had was
so this is prior deep-learning
error in some sense
at least in computer vision
so we work if it features
the
the mfcc of the
of the image world
and the what we were looking for was just a linear mapping between these features
of an image and the embedding space so you take your features that
that's x it's the image representation of an image
and you just multiply by matrix v
such that the result is another factor that is not in your hopefully in the
embedding space
and again like for the embedding space you have a factor
representation of each of your well words
which in that case out that labels
of the image classification task
and you want to find
w the a representation of your labels and v the mapping between each features and
the embedding space
such that it optimizes a task what is the task
we are going to define a similarity between two point in the space and in
this case the similarity function between
in image
x and the label i
is just
it's dot product
so in the embedding space so you take the image you projected into the embedding
space
you multiplied by the label
that for which you're considering to label that image
and you want that score to be high
for the correct label until
is a low for incorrect label
and so and you're going to put some constraint because
we're doing machine learning we need to put something regularization
such that
such that the value of the embedding space basically it's constraints so you control than
or
both the norm of the
of the mapping and the norm of the bidding
labels itself
okay as so as i said
we are going to try to solve this problem using a ranking loss so what
it what do i mean by that
well
we are going to
construct a loss
to minimize such that
for every image in our training set that's this part here
and for every current label of that image in image could have more than one
label that
that's often the case
and for every incorrect label of that image
and that does a lot
so that that's a nice big
you want to make sure that
this score
so the score of the correct label should be higher than the score of any
incorrect label plus margin
what this basically says it's a hinge loss
so
not only you want to score the correct label to be higher than any other
one you want to make sure that i don't plus margin so that generalizes better
here the margin is one but it's a constant you can put what you want
and if that's not the case then you pay a price
and that's the price to pay and you want to minimize that price
and you can optimize it is very efficiently by stochastic gradient descent by simply
sampling in image from your sec training set sampling a positive image
a positive label
from the set of correct label of that image
and then simply any other label which
most likely will be a novel correct label
you have your triplet then you compute your loss
if it's
if it's if the loss is positive you are you change the parameters of
you're model v and w here
so that's good and it works actually
but you can actually do better
i i'm not going to go into details about how you can do better but
i think of the
the following problem
what you want is at least
again
when you want to rank a hundred thousand are to be an object
what you want
is that in the top ranking
position of the object you're going to return does something interesting
so
if i show you two functions that are ranking labels
one of them returns
label correct labeling position the
one and another correct label in the position
one thousand
you fine you should find it more interesting that another function that returns
the to correct labeling position five hundred and five hundred one
even though in terms of ranking they have the same value
in terms of the user using it you try to have at least one label
returned in the top position
so you want to favour the top of the ranking you want to put a
lot of interest is there
and they are ways to modify
these kinds of losses to favour the top of the ranking
i one going to the d as they also have to they are in the
paper but
but it actually makes a huge difference in terms of the perception of the user
because
at least at the top
of the ranking you see things that make sense
so let's look at
these experiments the original the first experiments with done
we had the
so at that time there was the
a database an image in the in the computer vision literature that started appearing called
image net
it still there it's growing
at that time their whereas sixteen thousand labels in the image net
corpus now that was more than twenty thousand
but
nobody was actually using that the that the corpus i as a as it is
people were had selected about the thousand label and they were only playing with one
thousand and that's still the case and fortunate
almost nobody plays with all the corpus that is actually available and that contains
millions of images
at that time about five million images
i think now it's more like ten million images
so that that's good
but nobody's using it
so we consider that a small dataset
and we looked at the bigger one
which came from the web so we looked at the maybe images from the web
and for the web data we don't really have any labeled the way we use
our label whereby
looking at what people do one image search on google image search tree type of
query that the queried often you see images
you click on an image
now if
many of you click on the same image for the same query we are going
to consider that this is a good
a the queries a good labels for that
for that image
so it's very noisy
a lot of things happened here
but it it's usually reasonable
but you can collect as many
so this is a very small set of what was actually available but still
there was more than a hundred thousand labeling or set so that was interesting
so we actually publish the
paper showing these results and i want to exercise the fact that
we had
a one person
accuracy
on that data so ninety nine person role
and it was published
so i think that's good
we also hope
so this is a record and
so that summarize the thing by saying that this algorithm was better than the
many things we tried this is just a summary
so these numbers are higher than the other one
we show to type of metric precision at one which is accuracy
precision at ten
which is harmony good labeled you're return to the top and so that's more like
a ranking loss
and that more like look like what you see on google you look at the
page and you're happy if you see the document you want in the top
of course if you put more than once and then sample the number scroll and
everything gets better but
but yell the numbers are small and so the question is
is it any useful anyway because it's or small
and it turns out it is so first of all let's look at the embedding
space again it's always fun to look at
what happened after we've train the model
so remember the model was trained with the
just pairs of image and label
nor addition between words duration between images just there is a females and label in
are gonna just look at
where the labels are in the space no image yet
so i look at the labeled brock about that and i look at the nearby
labels out of the hundred pounds and label
and these are the labels we see
and so the nearest one is basically a spelling mistake because well people type anything
on the way
and the other ones are also very similar and then you see this ball will
which
i don't know what it is
and then if you take a big i'm you see again the similar things then
interestingly you see semantic relations between the u c and that their soccer player that
happening not far
maybe did look alike i don't know
you also see things like translations so dolphin these near death in though phone
or similar projects like whale
and
you see you see like i for that we're used to train station you see
either
things not far or similar visually
then the eiffel tower and all these has been trained in some sense i've never
told the model that the
it often is like the phone it's
it's they are just because they share similar images basically
it did that and beating space
so that's nice but what about the actual task
so here is
a sample of for images
from the test set
all of them
the if i had to compute the rescored precision at what would be zero basically
i failed in all these images as expected i mean i fail ninety nine percent
of the time so this is for these ninety nine
and but the figures are gracious in some sense so this these are supposed to
be
dawson
and the a tensor they'll finny a car and you see the words that happens
afterwards
so that funny here is in position thirty that's good
here it's in position i don't like eight
but the other words around make sense maybe the wrong
their answer but at the end so we give
would satisfy many humans and that's
that's good just because they have actually very similar semantic meetings
so we have the bark about my thing here
we have a
i was interested in the last i guess trip here because maybe you don't know
but there's a copy of the eiffel tower investigates
and so it actually made sense
it was surprise
so
so that's interesting the they will you make mistakes is now more interesting used to
make a lot of mistakes but at least
the answer make sense and that that's better
but so that was as of a four years ago and
what happened after that was
the deep-learning
error started
and the everything changed in the image field
like it didn't speech error i would say
so
now that's how we do we beach recognition
so
and the way we do it is by taking an image and applying these a
deep network
happen till you find we take a decision using it
as softmax layer at the end of your deep architecture
and the think that works the best these days is the convolution that's and the
for those of you don't know what these are it's basically layers that look at
only at the small part of the image so there's just a unit here that
look only at this part of the image and tries to
get the fighting for this part
but the function that gets this value is the same as the one that looks
at this part of the image and this and this and this
so we are actually convolving a function a along the whole image
and returning the this completion at the output of this that you're
and that we pull the answer locally so we look at the answer of that
set of convolution in the local patch
and take and return something like the maximal the mean or
but works usually is the max but you can try any pulling
thing and you do that again layer after layer
and what you're
you're bored you do full connection and that the n
you get an answer so it is
is a much more involving architecture it's very
slow to train
you need gpus and all that but
i must say first of all they were developed about twenty five years ago sorts
nothing you
but the only now we have the data that shows how good they are because
before there was not enough data know there was not enough
machine power like gpus
to actually train such a complex architecture so now it works
and it actually works very well so the first time the it was used on
on this the competition call the image net which is a competition quote to classify
with a thousand
label
it basically blue and the competition so all of those the competitors which we're using
classical computer vision techniques and they were actually the best in there
feel
they are like ten person away from the
from the deep-learning approach so
it changed everything and now
at least in this t v b are the literature
almost nobody is not using computer core
d player
it architectures
maybe just the slide to say that we do use such a thing at we
will for real product so it's not just research
for instance if you
type queries like
my for two of something
we're gonna try to look in your own for those unlabeled the
and try to return
you're for two of
sunset here
and it's done using the type of architecture that
that the one this competition
i must say that we actually
but
do
the authors of that paper
that
that is that geoff hinton the
an exclusive ski in the yes it's got are
the are not working at school so they help us about
it works
they are very good cliques
okay and let's the that contain you know
so let's go back to our meeting spaces and the fact that you can put
a lot of things in an unwitting space
so on one side we have these embedding space is that are very powerful because
they capture the semantic of labels on the dataset
we have these powerful deep-learning our architecture that can
that are based inter class now
so can we can we varied these two things in you know way that would
be useful
and
in fact what we found was that you can use these two and try to
be able to label in image of a label that is not of these ones
that appears here anywhere and that's
interesting because now
even though this was trained on the thousand labeled we can try to reason
about sarah hundred thousand label even though we haven't seen ninety nine percent of the
label
surprisingly it's actually very simple to do
we started by doing something with more complex but idiot we converged again
in the simplest
so we shouldn't and here is how you do it
so first
obviously
you train these two things
separately you train your best deep-learning architecture on your image classifier
and you train your best the bidding melon labels
the only thing that you require is that
the labels that are the at the that were for which you train your deep
architecture
should be embedded in the space so if one of the label is car
make sure that colours here but that shouldn't be a problem because here you can
put anything as long as you see text
related to these label
so that was an easy
requirement
once you have that here is what you do
you take an image
and you compute
so the
the score of the deep-learning
model so you the score of the deep-learning model is actually the posterior probability of
a label
given
that the image and you have these vector of p of the label given the
image
you are going to compute all these score you have a thousand of them
but you are going to only takes a the top ones
top one could be the top thousand if you want but
it's gonna be faster if you take the top ten
and you are going to
to
look at the labels corresponding to these top ones of suppose the topic labels
contained
these
suppose these words adopt then label obtain so bear lion tiger
integer a
you are going to look at the embedding space of the of the top then
label you obtained here
where they are and you are going to make it in average of them in
the embedding space
but it's gonna be a weighted average back haul and the weight will be how
much you think it is the actual label actual by and
so if you really think it's a lay a lion
the they're result of the of the weighted combination would be very near the lion
if you really think it's a there it's gonna be near the beer
if you think it's between the bear in alliance so you obtain
fifty percent they're fifty percent line you're gonna be in a position your parent line
like in the middle
and that's what this
thing set so you average the top labels you found in the operating space and
you find the position and that's where you should be now you look around here
in the other expansion look at the nearest label
the might be labeled from the top doesn't know there might be in the other
label
and that's your inserter
and because it can be any other label it can be
labels of subject you've never sing
does it work
it does actually surprisingly
not the perfectly by far again you see like
if you person precision but it does work
good enough that it's better than what we've seen elsewhere
so this is the model that is doing this the
complexer a convex the
combination of a semantic and beating and that when you was using the top ten
label that's waistcoat can see
it computed there is this something that we also publish called device which instead of
doing this simple convex formulation try to learn the mapping between the two
and the mapping was surprisingly not as good as just the simple combination
and this would be the output of the model itself so this cannot actually find
the correct solution because
we know that the correct solution of that image the correct label
is not the top then the top but thousand it it's not the label the
mold those about so it will make a mistake
while these ones have access to the full embedding space and be conducted says something
about
things that never seen and that's
and okay that works
in this case
okay
so
that was a nice for images but the recently i thought okay
what about speech
so about the ten years ago i was working in speech so i had some
knowledge about the house speech mode
but in the meanwhile of course everything change the deep-learning a wave also hit the
speech community
and now nobody's using it anymore igmms and stuff like that we use the deep
network so how is the speech
recognition done nowadays
so this is speech in
one slide
you take your
you're speech signal you transform at the using some features
and
in for the training set that you have you take the sequence of words
and you
you
cut it into sub
word unit which are usually phonemes or
by for triphone and for whatever you want
and these phones are then cut into cell phone unit which are called states because
they are states of hmms with the were not using hmm anymore
and
and then we try to align the audio with the states
so we take a previous model and we tried to say okay with our previous
small
this part of the audio should correspond to state number two hundred forty five
and we do that for all our training set and that's becomes are training data
to train it deep architecture which outputs is the number of states you have n
you try to predict
which state this audio should be corresponding to out of the
you know case
fourteen thousand states
so the actual speech
acoustic model is a classifier a fourteen thousand classes
this is how it works and i think it
we do that because that's
how we've been doing speech for well but it seems unreasonable to me
that's where trying to classify wasn't two states
which even as humans have part time
as a task to do because these days have no particular meaning
the phonemes themselves have been designed by linguist and maybe that's not what that the
that should say
we should maybe look at the data instead of asking a linguist
don't know how many language so we have here
hopefully not too many
i
and
so let's see if we can
get rid of these states and phonemes and all that
of course it's gonna be hardened that will not succeed
very well but at least i think it's worth trying
and see where we go
so
so what can we do
so the first thing i need was a very naive approach i two can't data
and instead of cutting the data and segmenting the data at the state level as
okay i forgot about state i forgot about fourteen what else do we have words
so that segment the
training set we have in terms of words and the
that's an easier task because you it's usually easier to segment your data in terms
of words well humans would agree
roughly where word started words that ends
so let's try to learn a model that price should just a classified words
and that's what i did so i had my audio data and user deep architecture
and try to predict that the end
the word directly
so that assumes that it has already been segment that
the same way that this take based model was assuming that it was already segmented
but instead of seeing
only one state plus context i'm gonna see the whole were
now it turns out words are not that long
with a window of about the two seconds i capture the like ninety nine percent
of the training words a hot
so you need about two hundred frames to express and capture most of the words
or at least of the training set i had access to which
is a query data from
from google
so i train your typical deep conclusion mall the same kind of model that was
used for images but then nowhere use it for speech
i use the
the dictionary used exactly what small the sense that in the training set
not all possible words i pure
so i use only about fifty thousand words
which
looks big but it's actually small compared to the actual number of words
that people will use in our test set for which we need something that can
be at work at least
so we have a problem later but let's forget about that problem so far
and try to classified our training set into one of the forty eight thousand word
so we can trying to model and that's nice and you get some accuracy seventy
three percent
is it good is it but i don't know
it's reasonable
where the we see where we go with this
the first thing to say that if you have this you are not done at
all in the speech recognition task because
i've assume that someone gave me in a in a line data set so my
training data was aligned at the word level
but now if i want to do speech recognition are not going to be given
in the alignment i have to align it myself
so
since i wanted to have the quickly summarise although set okay i'm gonna forget about
the alignment
i will use
the crowd well we have to provide a target so i take a model
and i and i just run the speech recognizer we have
which happens to be quite good
and i look at that the lattice which is that
a compact representation of the top-k
sequences of words
that could have been uttered for this it turns
of acoustic
and i will only look at the arc of that like this and try to
rescore it so now i know that
it for each are there was a beginning and time so i can take the
audio of that part of the work of the sequence and try to score it
and say okay i think it should be this word with these probability rd score
and i can get the score and try to
two score that and that
that's good but it doesn't solve the problem of the not work
my model was trained with forty eight thousand words
and the decoder will see where more words so how will i be ever able
to
two
two classifier that words with this
this is a problem so let's try to go further in our idea
and let's try to reason about how we could actually be able to
to produce an old word
or score unknown words
that's where the embedding space is we start to be useful
so here is the suggestion
we're gonna try to learn
in mapping between
and you representation of words that we have access to and its base of words
so what have access to that i can you edus
is things that make up the word like the letters of the word or the
lighter n-gram of a word so for instance i take the word hello and i
can extract
but quota features
i don't fit like the letters it has
the bigram letters it has the trigram letters it has the foreground letters it has
the
and writing letters it had all of them
so that's a lot of features
but them
maybe they are useful
actually if you add two more symbols
like
beginning and of word
then it's even more interesting because
so the ing in english is a very often
and being of words and that's good to do that ing and is a very
powerful features so let's try to add that as features
and the
and try to represent words like this so
the first thing i
it was
trying to see if i take a word that extracted features i show you only
this
can usually it can you tell me given this that the word i was talking
about the was handle
turns out that it's actually very easy task and the
on the test set i got about ninety nine percent accuracy if i want
if i train a simple model to predict
what worked it is given its features so these features actually really
capture enough of the word
to tell you that this is hello
so that's good that use these features
but how can we use it
so we're gonna use it in a
and bidding
deep-learning kind of architecture
in the following way
so we had our first model which was you take the audio when you try
to predict what word it is
now
my p is that the
the last layer of these are secure capture as a lot of information about the
whole word
and that to word that some
alike
we did not far
in their representation of the last layer l c d
deep architecture
and
what i will try to make sure is that indeed i can try to learn
a mapping between
any word
and
the position in that space that correspond to the word so that space contains word
but now do not organised in terms of
how they are related semantically they are organized in that space the space being the
last layer of the deep architecture
in terms of how they sound alike
two words that some of the like will be nearby that space and that's great
so now i'm going to train
i ranking model that will take
in old you acoustic will projected into that space
we take
the word this would you acoustic corresponds to
transform it into features project the into a another space that i hope will be
similar to this one
and try to make sure that the representation of the correct word in that space
is here
there are presentation of do it you actually near that the representation of another word
so i want to make sure that in that specified that the audio i projected
i take the letters of the correct word i projected and they should be by
the embedding space
and by nearby i just mean
that it's near that any other word i would take and projected so that i
could rank the word and the nearest word of an acoustic sequence would be the
correct word
and that would work for any word any sequence of letters i can express
does that make sense
okay
so
and it so again that's the your typical ranking loss
and the like trained that well
and now with that model i can actually score any word so even though
this model was only trained with fifty thousand
words
with this addition i can now score in figure them out of words as norm
estimate of letters
which
is okay in that case that was only english
okay so it doesn't work first of all it doesn't work as well
as the as the so if i
use only this model i get
seventy three percent accuracy but if i use
the small
i'd of the much bigger set of words i get only fifty three percent accuracy
but it still
maybe enough to be able to use it in the decoder no
and adding another useful example of these embedding space is now we're talking about
and beating spaces of old you
so i think a word i projected into the embedding space and i look at
other words around
and i see words that
sounds similar they probably have completely different meanings but this on the same
and
you can even
put push up any word that is actually not the word and try to see
how you would pronounce it so
could be interesting
okay so doesn't work well it works
so
basically so far
it only works in combination with the we if you if you rescoring you combine
it with a good model
so it's just preliminary work but i think there's the there's another things to try
in that space
that these to be tried it sits on the a preliminary or so
you don't improves lightly the result
even though it's like to hear means it actually improves significantly because the size of
the data was
ways you which it still not there was a despising it is for me
but i think it contains
i think sees of a frazzled that we should consider these it would you space
design meeting space for you are i think
something to consider
later
maybe i can tell you a bit of the kind of air light model was
playing
it was making mistakes like it's was replaced by its
five was replaced by five
which
we agree are different words
okay was replaced by okay and that kind of mistake so it was mostly mistakes
from the language small and not much from the acoustic model but
nevertheless you need to train them jointly which i haven't
and so there's a work to do here
okay so
i'm gonna stop the now so these are the conclusion i hope i convince you
that the these that these and baiting space is a very powerful the fact that
you can
take any kind of data whether they are discrete data like words or complex that
are like images or sounds and projected into space where you can compared to where
you can
look at the nearest neighbours in that space on where you can make a new
even know parameters
on them like averages or
subtraction and stuff like that
this is a very powerful
way to consider complex objects
we've tried it actually in many other applications
i can tell you if you ask them for which we talk about like a
music recommendations what we will we had this the
music
you can upload your music and we're gonna try to
to help you
do play list with it
or try to code
contingent to buy new music and stuff like that
and we do that in
not only but
also using the old your presentation of the image
so we've actually
represented your music into these kind of invading spaces and look around in that space
we've done that for videos of course of four languages information machine translation of talk
about it and worst i think trying to do things in speech recognition
and i think there's even more to do and the
and why not the trying these kind of things for speaker verification or language to
classification i don't know but really next year maybe
i do
so
thank you very much
that would like to know what's wrong about linguistics
so nothing wrong about linguistic but it i'm afraid of taking early decisions so for
instance for speech when we take words and we will present them that sequence of
phonemes
often there was
more than one representation what and one way to present the word
and you need these
thank ways to decide what is the correct way or the quite a ways
and there's a discrete the most of them that you need to put in
because you that's how you're gonna represent the you always do that
and you are making early decisions some of the might be wrong i'd like to
get rid of early wrong decisions
for indications transcribe sings a different
or at and use
set a simple
using strong
i think that's wrong
in the comments about the
images
what about you deal i mean you probably most work now
so if you a chick
you're asking what about action x
so
we have people working on that so we have you to put which is part
of we will and contains if you videos
and we have a big group trying these kind of approaches for you keep so
i cannot
to them but i know they have good results
so anyway trained are selected just distinction between the word and
acoustics was difference between this and have in a sequential training
people dollars
this kind of similar what do you mean basically one sit all the whole sentence
all incorrect sure so
you could use the recording that instead of using a consonant over you the acoustic
and you get
plus and
minus is i would sit the
plus about using a recurrent net is that you don't need to decide a priori
what's the maximum size
the might this is that you
the she are more than what you want
your presentation is more scared than the actual model where you decide so we are
actually trying weaver l indians now for about that so i'm not saying it's wrong
it's a good idea
these experiments were done with a consonant
go ahead and use your recurrent net
i think my question is in the same direction as well but the it was
them mentioning video what about sentences you're able to represent sentences sequence of words in
this so you know the current the line of work by my colleagues the quickly
and yes it's cover
who are actually trying now to do that kind of things so the use in
nist ends or recurrent net that's not right go into details of how it works
but where you first read
some input about your sentence it could be the sentence and another language or it
could be d v d or it could be the old you what is that
it
and then you i'll put a sentence
and you trained of all to output the right and that and so that you
i actually reasoning about sentence
so it's early work so far but it
hope it's gonna work
we want me to ask about the numbers on the board a low pass it
my question supposed to be
supervised minutes of what show
can be somehow to unsupervised
as so it's it had to see that distinction between the two then
when you train your invading spaces using only sentences is that supervised data or unsupervised
data i mean it's sentences that
exist in the world
but you are not the you don't need it people to label them may appear
by themself on the web
i don't know if this is supervised or not
so
the distinction is not clear to me
you have to tell me more
what it might be getting it is true when you get the sense
supervised
human generate right was showing so
supervised the since you say this is for english sense
yes so that's i think women getting is in the unsupervised it i just give
you data services in but when you do unsupervised clustering like
such as things look similar because
this is this picture world
in the supervised case you said this is world this world this is given it
kind of something to guide along and maybe one question be if you start throwing
things in their this is questions
how to use it's almost in clustering how to use
so selfish
i think the hope of unsupervised learning and i do believe that we need a
lot of our work in that field the that it's crucial is a to find
structure in the world
the things that happen try to be the wall to happen with some structure maybe
randomly but with some distribution
and you want to
constrain the space where you're going to operate with these objects these and big space
is or
or any other he didn't representation
such that they take into account that structure so that it is here
to say well these two things
are nearby because in that structure doesn't the what the way around you cannot go
by a left-to-right things are only in that direction so compared and like that
that's what you want to use of your unsupervised data to so for instance you
can take or audio
and try to their representation of the old doing a compact we just by looking
at would you as long as it's audio
of things that you will see later so not just the right done
audio but maybe people talking but without
understanding what they say or images
that exist but without labels
or again take that has been read in your language
but you and you don't need to know what is that text about or what
is this image about as long as it's a set of images that
are valid
in this is that it in other image you would see come from the same
distribution
so it's very useful it's a heart task
and but we need at a lot
so you are trying to look nice out of a couple diverse using to sampling
can you comment on how successful the like your that jane you have obtained using
combine the two models
a complement of recognizing a of a couple of that's
so the for small with only train a recognizing words that was not known
in the second one
i used it
on our test set which contains ten times more different words most of the words
in the test set was not we're not to the train set so the
the decoder the results i gave in terms of word error rate was on the
vocabulary that was
more than ten times speakers and the training set
so it was using this letter representation so is that what you mentor
i mean you in successful in because i think look at atlantis
well
so it's out-of-vocabulary from the training set but it's not solving the real task which
i'm sure you are
interested in
which is out-of-vocabulary of the test set
that is a word that is that is not even in my
the cannery one at code and i'd like to be able to reason about it
so i haven't tried that and i think it's more
interesting task
for some
about linguist
okay
one where
some
yes joint
so are
which is the first part i talk about people's are starting working on that but
years ago yes but it's the hardback yes
also
i agree i haven't the but i guess videos would be dismissed the best the
way to see that where you have old you and
and images but i have not personally work on that and but i do people
are otherwise it on the which other data but
yes i think
six seventy to here