Speech Transcript - Large Scale Learning of a Joint Embedding Space

great so it's a closer to

to talk here today indeed i

followed your

community for a couple of here is about ten years ago and then moved on

and doing a few different things but i might come back to it who knows

i was a very interested in what i've seen during the week

so thanks on fighting

so to the l i'll talk about if you

projects related to and bidding space is then going to

tell you what we can do with that and what they are first but

pay attention to the fact that this is not only my work does plenty of

people that have work on that with me

and like to thank them first

and bidding sweat out be what are the useful for and in fact it started

like a decade ago when people started to try to think

how can we present

discrete objects in the continue space in such a way that it becomes useful to

manipulate them

like a if you think of words

words are difficult to manipulate because

you cannot compared to words the easy d at least the

i in terms of mathematics a comparator so how can you represent words such that

then after that you can

you can manipulate them and compare them

so these are these embedding space is more we're gonna project words

once we have projected words into these spaces

what about projecting anything else

like images

or speech or

music

so how can we do that that's gonna be the second part

and we see that once we

manipulated is a complex object in these spaces

the there's a because you

you learn

semantic representations of these words you can actually

try to

to discover things in complex objects like images that you've never seen before we see

how we can do that

and now

briefly describe at the end

some recent work we've done with the trying to

do similar things from images but now applied to speech

and vice

so let's

let's start with what i mean by embedding so

so here i describe it three d a beating space but of course in general

it's more like a one hundred two hundred one thousand

the and bidding space

and

think of a this space as the

as the real number real vector space

work each point

could the could be the position of a discrete object like a word so soul

here we have let's your best this work

here we have the position of the word or barbara

and here the

like here the position of dawson's c will paris

and

we want to learn where to put these words and that's basically at the beginning

we just on the whole

so we're gonna

fine random position for each of the words

of the given dictionary that were given

and then we want to modify the positions of the move and at some point

one to be such that

nearby words so the word nearby the position of the word dolphin

should have similar meanings are tick at least related meanings

so you'd like c word and often to be not far

but relatively far from say powers and all but

and which that this if we can achieve that is gonna be useful for

for manipulating these words after work so how can we do that

about the ten years ago my brother years once we just for those of you

who don't know there are two modules

you invited me that my brother

both of us working deep-learning

and

but that's about ten years ago he

and

describe the project where he could learn such an bidding since

and here is that how he started the

this thing

he was using neural network so this is a neural network where we have the

inputs and you have layers

good connected to each other and at the end you have the output layer

and the goal was to try to learn a representation for words so what he

did

was take

sentences

like you can capture sentences from we keep it from the web from anywhere only

where you can just grab sentences

and your goal is to try to

to find a representation of words

such that

it will be easy to try to predict

if i show you few words can you predict what will be didn't next work

so you put

cities for words as input of the model

you crunch them and at the end you predict the word in your dictionary and

does one you need for each of the word of your dictionary

and you try to predict the next word will be cheese

in that case

now how your present the word

you're present them as a vector in

this the dimensional space

which at the beginning will be just a random vector

so this is what i

we called the embedding space

so there i think of it does a big look up the table or a

big matrix where each line

is there a presentation of a word

and if you see the sequence the cat eights you're gonna just look at the

word

the look up the word cat each of them have a vector representation you just

put that as input of your well

and then you passing through your model and you predict the next word

and the hope is that if you do that often

and you

as you might imagine you have in into new the model such data so that

easy to get

you will want you will find good representation of words

so the first time he did it was very painful very slow machines ten years

ago are very slow the dictionary was very small

it wasn't very useful but since then things have improved the model has

been simplified we have more data about more gpus

another thing start to work quite well

here is an example of the nn bidding space that we trained

about two years ago i think

and you don't have to try to see well

what's in that space

we can pick a word

so like here i pick the word apple

and look in the space what are the nearest words saying the

euclidean space so a look

at the position of all keywords i

sort them with respect to the distance to the target word

and i look at the other words and what i see that

the other words are very semantically similar to apples so you have fruit apples melon

speech

and whatever

if you take a stab you get things that are less a happier

and around i phone

used are you see stuff like i whatever

so it does capture everything and this was trained in the unsupervised weights sex

just show sentences

and that's what you get

and this was with the fifty dimensional embedding so you don't even need to have

a very large space this was about

i think or hundred thousand words

dictionary

so those hundred thousand vector here that are hidden in that space

with time these kind of model have evolved and usually involved in that just means

simplified so it was

it was a complex a architecture and it became actually much simpler now it's

in fact they are almost linear model

just the and of aiding and abetting of the word and then linear transformation in

you try to predict

another word and that's it

the weights done is you pick it again you take a sentence you randomly pick

a word in that sentence and then you randomly pick another word that you're gonna

try to predict

so it's not anymore

the next word and its not anymore

if we do of words that helps you predicting the next one it's

a random were trying to predict another random work round

and that's just the fact that it's a round that makes it

interesting because words ten

to keep

they tend to

to cook are often together so

so that's a very simple now the code is actually available in the very efficient

to train so you can train your own the bidding space

on in the matter of in our to of training so it's very efficient

a so

so here we have an example about

we took all the terms we saw and we keep it and here examples again

of embedding space is so these are the words nearby tiger shark

or car these are words to sorted or a cluster them according to their semantic

you see i don't know here all the food things

all the replies it et cetera so it captures semantic

but it's actually even more

strong than that

because

you can play some games with these and beatings words so for instance

if a you take the after training it within meetings to space the using this

as good problem all

you look at the editing position of the roll

italy berlin and german germany you look at where they are in the space

and you can apply operators on that so you to the embedding a position of

rome and you subtract italy

you at germany and what do you get vernon

that means you can actually generalising fine so the vector that went from row to

italy is the same as the when it goes from berlin to germany because they

have the same or addition to each other

and that's for semantic and you have also syntactic relations

like hardest to harder to what biggest to bigger

with the same kind of argument

and that's

surprisingly working

you can do similar tricks using for translation you can see that you can train

the

separate

and beating space for separate languages

and you'll find that these relations actually work from this

from language to language and you can use these kind of things to help train

stations and they're

tons of tricks that you can seem to the charger nowadays using these and beating

spaces so they are very

interesting to manipulate

before i forget feel free to ask any question whenever you want of course

so what else can we do with these are bidding spaces so

about four five years ago i was actually interested that googling

images

and i and i wanted to try to see if i can train the model

to annotate images but it being at google i'm not interested in on the and

trying to label images out of the hundred classes of course i'm more interested in

the large scale setting so when you have six hundred thousand classes

that's more interesting for me

but at that time at least

the recharger do see do

you computer vision literature was more interested are focusing on task where you have a

hundred two hundred

up to a thousand classes but that's

that's about it

so can we go further than that

so of course i image annotation is the heart task that's plenty of problems with

that you can think of the fact that the object objects are very the often

lookalike and actually this problem

is even worse when the number of classes grow

so when you have only two classes it's very easy to discriminate the if you

have a

a thousand classes or hundred thousand classes

you can be sure that

two of these classes are very visually similar to each other

so the problem is

or are becoming harder as the number of class the class a scroll

just plenty of other problems the rated to computer vision which i won't go into

details but

let me just

summarise how computer vision was done

about four five years ago

things have evolved a lot since then but at that time

you had two steps

feature extraction and classification

so first you would extractor features and the way you extract features

was

say it's very similar to how you would expect your for your from your voice

you with the find the a good representation of

places in the image and then

aggregated in a in some way and that would be your presentation

what you had that you would then try to classify don't using the best classifier

that was available at the time

which was an svm and then you train an excellent for each of your classes

and hope

that it would scale well

a it didn't

so one of the

problems you had was

and

that

very similar images would give rise to

very different labels

that would be completely unrelated semantically so for instance these are three

whoops

these are

three images a

of as of the video that

are like a few seconds apart

it's a shark or something like a shark and this is the label that

you're classically which classifier would give and you see that this is very semantically different

from that

here it's airliner for those who don't see and here's tiger sharks

but the image are quite similar to or high so some things

not working somewhere

and that's the kind of problem would like to be able to solve that would

be if i sure you to similar images i'd like to have to similar labels

at least

at least semantically

so why isn't that the working

so one argument about that is that we when we try to classify are images

we showed no we impose no relation between our labels all the labels you can

think of them as being in this id edges

of the hyper q

of the number of classes

where the edges of the number of classes

and so there's no more relation between a

it share and the and them under a more than there is or relation between

the mental reinvesting great even though there is a semantic creation between them we don't

capture done with the way we

train our classifiers and that's probably bad

so what if instead of having the labels that these

corners here

what about putting them inside

i guess

so now the labels are inside this i per q like these embedding space as

i was talking about the earlier

and it's fine because first of all the

the size of this hyper cube does not you anymore depend

on the number of labels classes

you can have way more classes than the actual size of your

space

because now it's a real space

and you can put your labels in such a way that they are by label

have your by meaning

and we've got

what happens is that if you make a mistake by picking the wrong label

hopefully you're gonna pick a label that was nearby hands have a semantic meaning not

too far so

hopefully that would work

and does even more interesting think that could happen you could

put more labels in the space that the one for which you have images

and maybe you'd be able to label in image of a topic you've never

seen any image before just because it is semantically related to it

so we try to see that

is just a dream

so about four years ago we started working on this project and

and what we of course so we were interested in these embedding space is an

and what we

tried was basically to merge the idea of having an image classifier and an embedding

space

and

we had this protocol was that e

where we want to do

try to learn jointly how to take an image

and project its input representation into

and then bidding space so you would have a

you projection from their representation of that image into

point

in that space and in the same space you'd have points that represents

labels

our classes

like thousand and robot and eiffel tower

and the goal was to jointly try to find the position of the labels

i try to find the mapping from the image

to deal a label space and if you can do that jointly you hopefully sold

the

jointly the classification task

and have a good and beating space for words

and

being at google

i should tell you that everything i see looks like a ranking problem

and so i obviously see that there's a ranking problem

where in that case the goal is if you show me an image and gonna

try to rank

the labels such that the nearest label is the one that corresponds to label

sounds good that that's a correct drinking problem and you also want to make sure

that

similar labels if you are to make a mistake let's make sure that the mistake

is semantically reasonable that

if you were to click on the word

it would be it reasonable word even if it's not the perfect work

so we are going to train or well with i ranking loss

in mine

okay

it was actually very simple ball

again very just a linear mapping so

what we had was

so this is prior deep-learning

error in some sense

at least in computer vision

so we work if it features

the

the mfcc of the

of the image world

and the what we were looking for was just a linear mapping between these features

of an image and the embedding space so you take your features that

that's x it's the image representation of an image

and you just multiply by matrix v

such that the result is another factor that is not in your hopefully in the

embedding space

and again like for the embedding space you have a factor

representation of each of your well words

which in that case out that labels

of the image classification task

and you want to find

w the a representation of your labels and v the mapping between each features and

the embedding space

such that it optimizes a task what is the task

we are going to define a similarity between two point in the space and in

this case the similarity function between

in image

x and the label i

is just

it's dot product

so in the embedding space so you take the image you projected into the embedding

space

you multiplied by the label

that for which you're considering to label that image

and you want that score to be high

for the correct label until

is a low for incorrect label

and so and you're going to put some constraint because

we're doing machine learning we need to put something regularization

such that

such that the value of the embedding space basically it's constraints so you control than

both the norm of the

of the mapping and the norm of the bidding

labels itself

okay as so as i said

we are going to try to solve this problem using a ranking loss so what

it what do i mean by that

well

we are going to

construct a loss

to minimize such that

for every image in our training set that's this part here

and for every current label of that image in image could have more than one

label that

that's often the case

and for every incorrect label of that image

and that does a lot

so that that's a nice big

you want to make sure that

this score

so the score of the correct label should be higher than the score of any

incorrect label plus margin

what this basically says it's a hinge loss

not only you want to score the correct label to be higher than any other

one you want to make sure that i don't plus margin so that generalizes better

here the margin is one but it's a constant you can put what you want

and if that's not the case then you pay a price

and that's the price to pay and you want to minimize that price

and you can optimize it is very efficiently by stochastic gradient descent by simply

sampling in image from your sec training set sampling a positive image

a positive label

from the set of correct label of that image

and then simply any other label which

most likely will be a novel correct label

you have your triplet then you compute your loss

if it's

if it's if the loss is positive you are you change the parameters of

you're model v and w here

so that's good and it works actually

but you can actually do better

i i'm not going to go into details about how you can do better but

i think of the

the following problem

what you want is at least

again

when you want to rank a hundred thousand are to be an object

what you want

is that in the top ranking

position of the object you're going to return does something interesting

if i show you two functions that are ranking labels

one of them returns

label correct labeling position the

one and another correct label in the position

one thousand

you fine you should find it more interesting that another function that returns

the to correct labeling position five hundred and five hundred one

even though in terms of ranking they have the same value

in terms of the user using it you try to have at least one label

returned in the top position

so you want to favour the top of the ranking you want to put a

lot of interest is there

and they are ways to modify

these kinds of losses to favour the top of the ranking

i one going to the d as they also have to they are in the

paper but

but it actually makes a huge difference in terms of the perception of the user

because

at least at the top

of the ranking you see things that make sense

so let's look at

these experiments the original the first experiments with done

we had the

so at that time there was the

a database an image in the in the computer vision literature that started appearing called

image net

it still there it's growing

at that time their whereas sixteen thousand labels in the image net

corpus now that was more than twenty thousand

but

nobody was actually using that the that the corpus i as a as it is

people were had selected about the thousand label and they were only playing with one

thousand and that's still the case and fortunate

almost nobody plays with all the corpus that is actually available and that contains

millions of images

at that time about five million images

i think now it's more like ten million images

so that that's good

but nobody's using it

so we consider that a small dataset

and we looked at the bigger one

which came from the web so we looked at the maybe images from the web

and for the web data we don't really have any labeled the way we use

our label whereby

looking at what people do one image search on google image search tree type of

query that the queried often you see images

you click on an image

now if

many of you click on the same image for the same query we are going

to consider that this is a good

a the queries a good labels for that

for that image

so it's very noisy

a lot of things happened here

but it it's usually reasonable

but you can collect as many

so this is a very small set of what was actually available but still

there was more than a hundred thousand labeling or set so that was interesting

so we actually publish the

paper showing these results and i want to exercise the fact that

we had

a one person

accuracy

on that data so ninety nine person role

and it was published

so i think that's good

we also hope

so this is a record and

so that summarize the thing by saying that this algorithm was better than the

many things we tried this is just a summary

so these numbers are higher than the other one

we show to type of metric precision at one which is accuracy

precision at ten

which is harmony good labeled you're return to the top and so that's more like

a ranking loss

and that more like look like what you see on google you look at the

page and you're happy if you see the document you want in the top

of course if you put more than once and then sample the number scroll and

everything gets better but

but yell the numbers are small and so the question is

is it any useful anyway because it's or small

and it turns out it is so first of all let's look at the embedding

space again it's always fun to look at

what happened after we've train the model

so remember the model was trained with the

just pairs of image and label

nor addition between words duration between images just there is a females and label in

are gonna just look at

where the labels are in the space no image yet

so i look at the labeled brock about that and i look at the nearby

labels out of the hundred pounds and label

and these are the labels we see

and so the nearest one is basically a spelling mistake because well people type anything

on the way

and the other ones are also very similar and then you see this ball will

which

i don't know what it is

and then if you take a big i'm you see again the similar things then

interestingly you see semantic relations between the u c and that their soccer player that

happening not far

maybe did look alike i don't know

you also see things like translations so dolphin these near death in though phone

or similar projects like whale

and

you see you see like i for that we're used to train station you see

either

things not far or similar visually

then the eiffel tower and all these has been trained in some sense i've never

told the model that the

it often is like the phone it's

it's they are just because they share similar images basically

it did that and beating space

so that's nice but what about the actual task

so here is

a sample of for images

from the test set

all of them

the if i had to compute the rescored precision at what would be zero basically

i failed in all these images as expected i mean i fail ninety nine percent

of the time so this is for these ninety nine

and but the figures are gracious in some sense so this these are supposed to

dawson

and the a tensor they'll finny a car and you see the words that happens

afterwards

so that funny here is in position thirty that's good

here it's in position i don't like eight

but the other words around make sense maybe the wrong

their answer but at the end so we give

would satisfy many humans and that's

that's good just because they have actually very similar semantic meetings

so we have the bark about my thing here

we have a

i was interested in the last i guess trip here because maybe you don't know

but there's a copy of the eiffel tower investigates

and so it actually made sense

it was surprise

so that's interesting the they will you make mistakes is now more interesting used to

make a lot of mistakes but at least

the answer make sense and that that's better

but so that was as of a four years ago and

what happened after that was

the deep-learning

error started

and the everything changed in the image field

like it didn't speech error i would say

now that's how we do we beach recognition

and the way we do it is by taking an image and applying these a

deep network

happen till you find we take a decision using it

as softmax layer at the end of your deep architecture

and the think that works the best these days is the convolution that's and the

for those of you don't know what these are it's basically layers that look at

only at the small part of the image so there's just a unit here that

look only at this part of the image and tries to

get the fighting for this part

but the function that gets this value is the same as the one that looks

at this part of the image and this and this and this

so we are actually convolving a function a along the whole image

and returning the this completion at the output of this that you're

and that we pull the answer locally so we look at the answer of that

set of convolution in the local patch

and take and return something like the maximal the mean or

but works usually is the max but you can try any pulling

thing and you do that again layer after layer

and what you're

you're bored you do full connection and that the n

you get an answer so it is

is a much more involving architecture it's very

slow to train

you need gpus and all that but

i must say first of all they were developed about twenty five years ago sorts

nothing you

but the only now we have the data that shows how good they are because

before there was not enough data know there was not enough

machine power like gpus

to actually train such a complex architecture so now it works

and it actually works very well so the first time the it was used on

on this the competition call the image net which is a competition quote to classify

with a thousand

label

it basically blue and the competition so all of those the competitors which we're using

classical computer vision techniques and they were actually the best in there

feel

they are like ten person away from the

from the deep-learning approach so

it changed everything and now

at least in this t v b are the literature

almost nobody is not using computer core

d player

it architectures

maybe just the slide to say that we do use such a thing at we

will for real product so it's not just research

for instance if you

type queries like

my for two of something

we're gonna try to look in your own for those unlabeled the

and try to return

you're for two of

sunset here

and it's done using the type of architecture that

that the one this competition

i must say that we actually

but

the authors of that paper

that

that is that geoff hinton the

an exclusive ski in the yes it's got are

the are not working at school so they help us about

it works

they are very good cliques

okay and let's the that contain you know

so let's go back to our meeting spaces and the fact that you can put

a lot of things in an unwitting space

so on one side we have these embedding space is that are very powerful because

they capture the semantic of labels on the dataset

we have these powerful deep-learning our architecture that can

that are based inter class now

so can we can we varied these two things in you know way that would

be useful

and

in fact what we found was that you can use these two and try to

be able to label in image of a label that is not of these ones

that appears here anywhere and that's

interesting because now

even though this was trained on the thousand labeled we can try to reason

about sarah hundred thousand label even though we haven't seen ninety nine percent of the

label

surprisingly it's actually very simple to do

we started by doing something with more complex but idiot we converged again

in the simplest

so we shouldn't and here is how you do it

so first

obviously

you train these two things

separately you train your best deep-learning architecture on your image classifier

and you train your best the bidding melon labels

the only thing that you require is that

the labels that are the at the that were for which you train your deep

architecture

should be embedded in the space so if one of the label is car

make sure that colours here but that shouldn't be a problem because here you can

put anything as long as you see text

related to these label

so that was an easy

requirement

once you have that here is what you do

you take an image

and you compute

so the

the score of the deep-learning

model so you the score of the deep-learning model is actually the posterior probability of

a label

given

that the image and you have these vector of p of the label given the

image

you are going to compute all these score you have a thousand of them

but you are going to only takes a the top ones

top one could be the top thousand if you want but

it's gonna be faster if you take the top ten

and you are going to

look at the labels corresponding to these top ones of suppose the topic labels

contained

these

suppose these words adopt then label obtain so bear lion tiger

integer a

you are going to look at the embedding space of the of the top then

label you obtained here

where they are and you are going to make it in average of them in

the embedding space

but it's gonna be a weighted average back haul and the weight will be how

much you think it is the actual label actual by and

so if you really think it's a lay a lion

the they're result of the of the weighted combination would be very near the lion

if you really think it's a there it's gonna be near the beer

if you think it's between the bear in alliance so you obtain

fifty percent they're fifty percent line you're gonna be in a position your parent line

like in the middle

and that's what this

thing set so you average the top labels you found in the operating space and

you find the position and that's where you should be now you look around here

in the other expansion look at the nearest label

the might be labeled from the top doesn't know there might be in the other

label

and that's your inserter

and because it can be any other label it can be

labels of subject you've never sing

does it work

it does actually surprisingly

not the perfectly by far again you see like

if you person precision but it does work

good enough that it's better than what we've seen elsewhere

so this is the model that is doing this the

complexer a convex the

combination of a semantic and beating and that when you was using the top ten

label that's waistcoat can see

it computed there is this something that we also publish called device which instead of

doing this simple convex formulation try to learn the mapping between the two

and the mapping was surprisingly not as good as just the simple combination

and this would be the output of the model itself so this cannot actually find

the correct solution because

we know that the correct solution of that image the correct label

is not the top then the top but thousand it it's not the label the

mold those about so it will make a mistake

while these ones have access to the full embedding space and be conducted says something

about

things that never seen and that's

and okay that works

in this case

okay

that was a nice for images but the recently i thought okay

what about speech

so about the ten years ago i was working in speech so i had some

knowledge about the house speech mode

but in the meanwhile of course everything change the deep-learning a wave also hit the

speech community

and now nobody's using it anymore igmms and stuff like that we use the deep

network so how is the speech

recognition done nowadays

so this is speech in

one slide

you take your

you're speech signal you transform at the using some features

and

in for the training set that you have you take the sequence of words

and you

you

cut it into sub

word unit which are usually phonemes or

by for triphone and for whatever you want

and these phones are then cut into cell phone unit which are called states because

they are states of hmms with the were not using hmm anymore

and

and then we try to align the audio with the states

so we take a previous model and we tried to say okay with our previous

small

this part of the audio should correspond to state number two hundred forty five

and we do that for all our training set and that's becomes are training data

to train it deep architecture which outputs is the number of states you have n

you try to predict

which state this audio should be corresponding to out of the

you know case

fourteen thousand states

so the actual speech

acoustic model is a classifier a fourteen thousand classes

this is how it works and i think it

we do that because that's

how we've been doing speech for well but it seems unreasonable to me

that's where trying to classify wasn't two states

which even as humans have part time

as a task to do because these days have no particular meaning

the phonemes themselves have been designed by linguist and maybe that's not what that the

that should say

we should maybe look at the data instead of asking a linguist

don't know how many language so we have here

hopefully not too many

and

so let's see if we can

get rid of these states and phonemes and all that

of course it's gonna be hardened that will not succeed

very well but at least i think it's worth trying

and see where we go

so what can we do

so the first thing i need was a very naive approach i two can't data

and instead of cutting the data and segmenting the data at the state level as

okay i forgot about state i forgot about fourteen what else do we have words

so that segment the

training set we have in terms of words and the

that's an easier task because you it's usually easier to segment your data in terms

of words well humans would agree

roughly where word started words that ends

so let's try to learn a model that price should just a classified words

and that's what i did so i had my audio data and user deep architecture

and try to predict that the end

the word directly

so that assumes that it has already been segment that

the same way that this take based model was assuming that it was already segmented

but instead of seeing

only one state plus context i'm gonna see the whole were

now it turns out words are not that long

with a window of about the two seconds i capture the like ninety nine percent

of the training words a hot

so you need about two hundred frames to express and capture most of the words

or at least of the training set i had access to which

is a query data from

from google

so i train your typical deep conclusion mall the same kind of model that was

used for images but then nowhere use it for speech

i use the

the dictionary used exactly what small the sense that in the training set

not all possible words i pure

so i use only about fifty thousand words

which

looks big but it's actually small compared to the actual number of words

that people will use in our test set for which we need something that can

be at work at least

so we have a problem later but let's forget about that problem so far

and try to classified our training set into one of the forty eight thousand word

so we can trying to model and that's nice and you get some accuracy seventy

three percent

is it good is it but i don't know

it's reasonable

where the we see where we go with this

the first thing to say that if you have this you are not done at

all in the speech recognition task because

i've assume that someone gave me in a in a line data set so my

training data was aligned at the word level

but now if i want to do speech recognition are not going to be given

in the alignment i have to align it myself

since i wanted to have the quickly summarise although set okay i'm gonna forget about

the alignment

i will use

the crowd well we have to provide a target so i take a model

and i and i just run the speech recognizer we have

which happens to be quite good

and i look at that the lattice which is that

a compact representation of the top-k

sequences of words

that could have been uttered for this it turns

of acoustic

and i will only look at the arc of that like this and try to

rescore it so now i know that

it for each are there was a beginning and time so i can take the

audio of that part of the work of the sequence and try to score it

and say okay i think it should be this word with these probability rd score

and i can get the score and try to

two score that and that

that's good but it doesn't solve the problem of the not work

my model was trained with forty eight thousand words

and the decoder will see where more words so how will i be ever able

two

two classifier that words with this

this is a problem so let's try to go further in our idea

and let's try to reason about how we could actually be able to

to produce an old word

or score unknown words

that's where the embedding space is we start to be useful

so here is the suggestion

we're gonna try to learn

in mapping between

and you representation of words that we have access to and its base of words

so what have access to that i can you edus

is things that make up the word like the letters of the word or the

lighter n-gram of a word so for instance i take the word hello and i

can extract

but quota features

i don't fit like the letters it has

the bigram letters it has the trigram letters it has the foreground letters it has

the

and writing letters it had all of them

so that's a lot of features

but them

maybe they are useful

actually if you add two more symbols

beginning and of word

then it's even more interesting because

so the ing in english is a very often

and being of words and that's good to do that ing and is a very

powerful features so let's try to add that as features

and the

and try to represent words like this so

the first thing i

it was

trying to see if i take a word that extracted features i show you only

this

can usually it can you tell me given this that the word i was talking

about the was handle

turns out that it's actually very easy task and the

on the test set i got about ninety nine percent accuracy if i want

if i train a simple model to predict

what worked it is given its features so these features actually really

capture enough of the word

to tell you that this is hello

so that's good that use these features

but how can we use it

so we're gonna use it in a

and bidding

deep-learning kind of architecture

in the following way

so we had our first model which was you take the audio when you try

to predict what word it is

now

my p is that the

the last layer of these are secure capture as a lot of information about the

whole word

and that to word that some

alike

we did not far

in their representation of the last layer l c d

deep architecture

and

what i will try to make sure is that indeed i can try to learn

a mapping between

any word

and

the position in that space that correspond to the word so that space contains word

but now do not organised in terms of

how they are related semantically they are organized in that space the space being the

last layer of the deep architecture

in terms of how they sound alike

two words that some of the like will be nearby that space and that's great

so now i'm going to train

i ranking model that will take

in old you acoustic will projected into that space

we take

the word this would you acoustic corresponds to

transform it into features project the into a another space that i hope will be

similar to this one

and try to make sure that the representation of the correct word in that space

is here

there are presentation of do it you actually near that the representation of another word

so i want to make sure that in that specified that the audio i projected

i take the letters of the correct word i projected and they should be by

the embedding space

and by nearby i just mean

that it's near that any other word i would take and projected so that i

could rank the word and the nearest word of an acoustic sequence would be the

correct word

and that would work for any word any sequence of letters i can express

does that make sense

okay

and it so again that's the your typical ranking loss

and the like trained that well

and now with that model i can actually score any word so even though

this model was only trained with fifty thousand

words

with this addition i can now score in figure them out of words as norm

estimate of letters

which

is okay in that case that was only english

okay so it doesn't work first of all it doesn't work as well

as the as the so if i

use only this model i get

seventy three percent accuracy but if i use

the small

i'd of the much bigger set of words i get only fifty three percent accuracy

but it still

maybe enough to be able to use it in the decoder no

and adding another useful example of these embedding space is now we're talking about

and beating spaces of old you

so i think a word i projected into the embedding space and i look at

other words around

and i see words that

sounds similar they probably have completely different meanings but this on the same

and

you can even

put push up any word that is actually not the word and try to see

how you would pronounce it so

could be interesting

okay so doesn't work well it works

basically so far

it only works in combination with the we if you if you rescoring you combine

it with a good model

so it's just preliminary work but i think there's the there's another things to try

in that space

that these to be tried it sits on the a preliminary or so

you don't improves lightly the result

even though it's like to hear means it actually improves significantly because the size of

the data was

ways you which it still not there was a despising it is for me

but i think it contains

i think sees of a frazzled that we should consider these it would you space

design meeting space for you are i think

something to consider

later

maybe i can tell you a bit of the kind of air light model was

playing

it was making mistakes like it's was replaced by its

five was replaced by five

which

we agree are different words

okay was replaced by okay and that kind of mistake so it was mostly mistakes

from the language small and not much from the acoustic model but

nevertheless you need to train them jointly which i haven't

and so there's a work to do here

okay so

i'm gonna stop the now so these are the conclusion i hope i convince you

that the these that these and baiting space is a very powerful the fact that

you can

take any kind of data whether they are discrete data like words or complex that

are like images or sounds and projected into space where you can compared to where

you can

look at the nearest neighbours in that space on where you can make a new

even know parameters

on them like averages or

subtraction and stuff like that

this is a very powerful

way to consider complex objects

we've tried it actually in many other applications

i can tell you if you ask them for which we talk about like a

music recommendations what we will we had this the

music

you can upload your music and we're gonna try to

to help you

do play list with it

or try to code

contingent to buy new music and stuff like that

and we do that in

not only but

also using the old your presentation of the image

so we've actually

represented your music into these kind of invading spaces and look around in that space

we've done that for videos of course of four languages information machine translation of talk

about it and worst i think trying to do things in speech recognition

and i think there's even more to do and the

and why not the trying these kind of things for speaker verification or language to

classification i don't know but really next year maybe

i do

thank you very much

that would like to know what's wrong about linguistics

so nothing wrong about linguistic but it i'm afraid of taking early decisions so for

instance for speech when we take words and we will present them that sequence of

phonemes

often there was

more than one representation what and one way to present the word

and you need these

thank ways to decide what is the correct way or the quite a ways

and there's a discrete the most of them that you need to put in

because you that's how you're gonna represent the you always do that

and you are making early decisions some of the might be wrong i'd like to

get rid of early wrong decisions

for indications transcribe sings a different

or at and use

set a simple

using strong

i think that's wrong

in the comments about the

images

what about you deal i mean you probably most work now

so if you a chick

you're asking what about action x

we have people working on that so we have you to put which is part

of we will and contains if you videos

and we have a big group trying these kind of approaches for you keep so

i cannot

to them but i know they have good results

so anyway trained are selected just distinction between the word and

acoustics was difference between this and have in a sequential training

people dollars

this kind of similar what do you mean basically one sit all the whole sentence

all incorrect sure so

you could use the recording that instead of using a consonant over you the acoustic

and you get

plus and

minus is i would sit the

plus about using a recurrent net is that you don't need to decide a priori

what's the maximum size

the might this is that you

the she are more than what you want

your presentation is more scared than the actual model where you decide so we are

actually trying weaver l indians now for about that so i'm not saying it's wrong

it's a good idea

these experiments were done with a consonant

go ahead and use your recurrent net

i think my question is in the same direction as well but the it was

them mentioning video what about sentences you're able to represent sentences sequence of words in

this so you know the current the line of work by my colleagues the quickly

and yes it's cover

who are actually trying now to do that kind of things so the use in

nist ends or recurrent net that's not right go into details of how it works

but where you first read

some input about your sentence it could be the sentence and another language or it

could be d v d or it could be the old you what is that

and then you i'll put a sentence

and you trained of all to output the right and that and so that you

i actually reasoning about sentence

so it's early work so far but it

hope it's gonna work

we want me to ask about the numbers on the board a low pass it

my question supposed to be

supervised minutes of what show

can be somehow to unsupervised

as so it's it had to see that distinction between the two then

when you train your invading spaces using only sentences is that supervised data or unsupervised

data i mean it's sentences that

exist in the world

but you are not the you don't need it people to label them may appear

by themself on the web

i don't know if this is supervised or not

the distinction is not clear to me

you have to tell me more

what it might be getting it is true when you get the sense

supervised

human generate right was showing so

supervised the since you say this is for english sense

yes so that's i think women getting is in the unsupervised it i just give

you data services in but when you do unsupervised clustering like

such as things look similar because

this is this picture world

in the supervised case you said this is world this world this is given it

kind of something to guide along and maybe one question be if you start throwing

things in their this is questions

how to use it's almost in clustering how to use

so selfish

i think the hope of unsupervised learning and i do believe that we need a

lot of our work in that field the that it's crucial is a to find

structure in the world

the things that happen try to be the wall to happen with some structure maybe

randomly but with some distribution

and you want to

constrain the space where you're going to operate with these objects these and big space

is or

or any other he didn't representation

such that they take into account that structure so that it is here

to say well these two things

are nearby because in that structure doesn't the what the way around you cannot go

by a left-to-right things are only in that direction so compared and like that

that's what you want to use of your unsupervised data to so for instance you

can take or audio

and try to their representation of the old doing a compact we just by looking

at would you as long as it's audio

of things that you will see later so not just the right done

audio but maybe people talking but without

understanding what they say or images

that exist but without labels

or again take that has been read in your language

but you and you don't need to know what is that text about or what

is this image about as long as it's a set of images that

are valid

in this is that it in other image you would see come from the same

distribution

so it's very useful it's a heart task

and but we need at a lot

so you are trying to look nice out of a couple diverse using to sampling

can you comment on how successful the like your that jane you have obtained using

combine the two models

a complement of recognizing a of a couple of that's

so the for small with only train a recognizing words that was not known

in the second one

i used it

on our test set which contains ten times more different words most of the words

in the test set was not we're not to the train set so the

the decoder the results i gave in terms of word error rate was on the

vocabulary that was

more than ten times speakers and the training set

so it was using this letter representation so is that what you mentor

i mean you in successful in because i think look at atlantis

well

so it's out-of-vocabulary from the training set but it's not solving the real task which

i'm sure you are

interested in

which is out-of-vocabulary of the test set

that is a word that is that is not even in my

the cannery one at code and i'd like to be able to reason about it

so i haven't tried that and i think it's more

interesting task

for some

about linguist

okay

one where

some

yes joint

so are

which is the first part i talk about people's are starting working on that but

years ago yes but it's the hardback yes

also

i agree i haven't the but i guess videos would be dismissed the best the

way to see that where you have old you and

and images but i have not personally work on that and but i do people

are otherwise it on the which other data but

yes i think

six seventy to here

Large Scale Learning of a Joint Embedding Space

Keynotes

Samy Bengio