Speech Transcript - Towards Unsupervised Learning of Speech Representations

two

yes

so interviews the allow testing well

thank you on the first i want to thank you again for of anything this

water in it is very

and if you could the a moment and if you could your but you did

very well and i'm sure we will we will take advantage of you organisation an

optional

secondly equal i'm really be too

have you location to introduce you musical of any will be whole first the speaker

i will be sure because i'm quite sure but quite all of japanese

no you would really so i will not even introduction they know you even if

you are still you with a true

you go according to me at least we will see you was a really a

few buttons

so we go you us your master in the second and hearing that went to

university the frozen even

so about twenty ten years ago

and the you went to be a is to use a student rental

with the wheel invisible trying to one be from the sunni a remote case the

you then you is the on droning for distance speech in the two thousand seventeen

and then known the meta

maybe not the useful to introduce them you know that a

i'm also true but they will all of us so awful you know very well

and you start work has a also is a sure working closely with a threshold

venue

you work on several topics may be wrong than representation only for speech button not

only about

and he really you also one of the code from the of the speech way

initiative for building you

two k open okay for a speech and speaker recognition it was a singing about

even us to

we use it to form you use you already have a long list of speakers

in the topics and i know you we need to as a very nice a

or a now and i don't even for you but before two

do they do you say it will be wall of introduction if you want before

a good movie do i will explain how the decision we walk

we will close to a pre-recorded view bone by nicole

during easy do you have you wanted to some question maybe case in intensive box

please or a few

think about question and i haven't integration is possible now see we give you what

you do need good complete variances

and then we will have a fifteen minutes

live

question and answers with music or doing this decision is fifteen years

you could use both the question and answer box

well

be a raise your hand the so we raise your hand with the you know

that to a i-th question in i

during position

so we could be want to say some well handled before two good we do

just i think you're much for the introduction hello

i hope the d v d w within the video will be fine now but

in the worst case you probably you guys have to increase a little bit it

but

let's see how it goes

it can be cool i think we give a really do know

sorry we have a simple was shown to an small technical problem good we don't

have you do you the

before it was working so it's better to does which the previews

present

annotation we're

can't hear nothing alright

yes a

can and have a little stuff

okay training

hi everyone i mean permanently

and a very high

to give it is here today

had obviously

so let me for the whole thing rather

for i by can be used for them

with the

the speech commute

entitled make you know used to words unsupervised training

all speech work station

well so supervised learning is a key a lot of what are the my shooter

feel

and of course is getting ready

within the speech community well

so today i will like to share the experience

the time again after working poor

i two or three years

on this topic

okay but if or diving into cell supervised learning that me room are some of

the limitations

of supervised there

which is the dominant paradigm stays there

well you can see deep-learning

as a way to lure hierarchical representation is where we start from the low concepts

we combine them

we create

high-level also console

so the learning

is a very general is the case

is implemented through a deep neural networks

that are often

trained

in a supervised way

using a large and rotated corpora

you can do this is that only and approach

alright integrate

success

are you learning many practical application

is clear today

and is paradigm

has some limitations

what are

this issues

for example

indeed the data and not

general data

but and updated data and crosses they cannot the issue the expense the time-consuming however

wires numerals normal

rubber supervised learning is data and

also computationally demanding

one

of course to these days to reach state-of-the-art performance

machine learning

we need a lot of data

and a lot of data requires a lot of computations

deleting the fact the access but

supervised learning

a technology to have brute

brute

setup

users

moreover

if we

training a system now

supervised way the representations that the latter

my by the hours

to worse a specific application

for instance if we train a system for speaker identification

the representation that's been there are would that not or

speech recognition

so we might want to real or some kind of general representation that annoying

transfer learning

much easier and better

density

the third imitation is actually more exploration

and is that where rain

does not use

only supervised learning

critical mine different all

i'm

pretty sure

that

combined

different the remote data that is cool but she

to reach higher levels

or artificial intelligence

we can combine

supervised learning

we and

contrastive learning

weighted imitation learn a

well we'd reinforcement learning and of course

with some supervised learning

so what is sell supervised there

so supervised learning is a type of an unsupervised learning

where we have a supervision

but the supervision

is extracted

from the city no it's channel

in cell supervised learning we'd ask

don't have

you models that have to create labels we don't have you months

but the labels

i retreated basically

for free

we can create

columns of them without s

normally in some supervised learning

we applied some kind of

known transformations to the input signal

and use the resulting outcomes

as a label as targets

well let me clarified his with some example derived from the computer vision community which

was the first one

teaching better

this

approach

in this

comparison of community actually they

the not is quite early i this earlier than the other that by solving some

kind of symbols task we were able

to train a neural network that there are some kind of needful

representation

for instance you can ask your neural network was also kind of relative positioning task

where you have small edges of an image

and you have to decide their relative position

between them

you can ask your neural network

but the right colour

set an image

or to find the correct

rotation and of any age

goal of this task are relatively

easy but each we design a system your vector learners used in table show this

task

we inherently over a wider system to have some kind of semantic knowledge of the

words or at least semantic knowledge on the image

that can be really very have their

representation hopefully high level

robust representations

and yes

subsets unsupervised learning is extremely

interesting is gaining a lot of randy

let me show that animals

give low rank k by

the kernel

showing saying that you know if only the cage

no supervised learning the su or look at a reformer learning is the charger indicate

that an unsupervised

or supervised learning is the basic indicate you sell

and meaning that

we believe this modality is

definitely

ingredient

a two

to develop intelligent systems

okay but what about the old you an speech field

as i mentioned before

there is a crucial we number of research more stuff cools in the direction

also supervise there really you know speech

and we have seen as many of them even

at the interspeech

but here let me just highlight here of

and my opinion the first work that firstly shows the appendices also supervised learning you

know you speech

is the contrastive predictive coding was by are among the nor backing

two thousand eight key

this work is mostly about

predicting

the future

given the past

more recently we have seen

another

very good where by facebook with what we should back to zero where d with

we were able to show impressive results with that our approach

which implies some kind of masking technique sooner number couple

which ones dish

i also contributed

does feel with the problem of analysis which encode it which as we will see

later i which we explore

multi doubts selsa provides there

however

cell supervised learning all speech

is it really challenge

why

first of all because speech is characterised by high dimensional that

we have typically a long sequences

of samples that can be well variable length

the last

but not laced

speech in her you know the and tails

complex hierarchical structure that might be very difficult to further

without being guided

by a strong

supervision

speech in fact

as characterised by samples we can combine

there were sampled that the

aims

i from twenty and you can create two levels of all syllables okay worse and

finally

we have than me

all descendants

and inferring

all these kind of structure

might be

extremely difficult

on my side i started i've been some supervised learning when i started my all

stock

i mean the almost

three years ago

and time

people it means that we're doing research ourselves supervised learning

a approaches based on what information

and i got so excited that

i decided to study some supervised learning

approaches with motion information

for learning

speech representations

and that led to the development

all the technical

a lot coming from max that i will and described in the next my

after that we for extended

this techniques using a multi task supervised learning approach

and that led to the double meant

all the problem of the gnostic speech encoder plays

the presented

and interspeech two thousand nineteen

and also we extended

days with another technique

if you can improve system called base plus

and we recently presented this work

at i

okay let's start from motion information based approach

what is more information

the motion information is defined as the key and they are virgins

between

the joint distributions of two random variables

and their product or marginal

why

this is important

because we move information we can capture complex problem being of relationships

between

random part of

eve the

two random variables are independent univoxel formation zero

while you do with some kind of dependency between is why doubles the are then

mutual information is greater you

this is very attractive

the issues that much information that's difficult to compute high dimensional space

and is limited

a lot

it's optical but

for a decal mush entirely sure

however one recent were coal mine actual information

you're estimator

phone that it is possible

one maximizing minimizing motivation

within a framework that closely resembles

data counts

how does where

i think mention and we can sample somehow

some samples from the joint distribution

recorded

positive samples

we will explain later

how we can do that graph

it's also assume we can i

sample

some kind of examples from the marginal distributions and we call

there's negative samples

then we can see that

this positive and negative samples

with the special neural net where was cost function

is it don't are far down

bound works mesh

the don't screw are no information that has low where

and if we train

this is a letter to maximize

this them about

we finally converge to also mesh

and inspired by this approach i started

thinking about

motion information based approaches specific only

for speech

i danced idea and then you do cool a little informatics that works

in this way

for example we employ s seven they strategy

that will

several positive and negative

this way

sure the whole

that choosing a random shyer

from i runs and scolded

so you one

then

which is another out of the channel from the same sentence

and we call it

two

and finally which is another random from another sentence

that's your front

we this

samples with his chance we can

please some kind of interesting things

for instance we can process

c one c two i was your problem with and recorder

which provide

hopefully higher level information

then

we can go free positive and negative so all we

if we

concatenate

z one and two we create

samples from the joint distribution

positive system

which is a positive sense or because we expect some kind of relation between

this random variables because extract

from say

a signal

then we can also can also create

and negative samples michael t z one and that run

in this can be seen

and a sample from the chronicle marginal distribution

after that

we employ and discriminator which is

with posting

or negative samples

and it is screaming the

should figure out

basically

you need to get positive or negative examples for this case

if the representations

kind of from seven

or from you

in this system that discriminate rollers is

set

to maximize the mutual information

moreover the encoder and a discrete mister

are jointly trained from scratch

and this

results in

compared to

game nodding an adversarial game like can

this case

the encoder and its creator should cooperate to learn

i hu and hopefully high level

representation

a good question here okay

and but one two will are you play is k

with this came we basically learn speaker identities of our wheeler speaker endings

why

because this approach is based on randomly

sam thing

within the same set

and if we randomly sample within the same sentence

and reliable started or that the system can disentangle are the variable factor is

definitely the speaker identity

rubber in

we assume that we have i dataset and just large enough without

large variability a speaker and if we randomly sample two sentences

the probability of by me

the same speaker is very low

so overall

this can be c

as a system for learning

speaker of endings without

provided to the system the police

this is label

on the speaker identity

the encoder is fat by their roles speech samples directly

in the first layer of a contact the architecture we just use see that makes

learning problem to roll samples much easier

in fact instead of using the standard convolutional filters we use a band pass parameterize

filters that only learns d

because of this is distilled

this makes

learning from the rose i'm all easier

and not only used on the supervised learning but we also only useful in this

also provides context and

i will encourage you to read a reference paper

if you would like to hear more about

sing

what are the strength and issues a lot come from

once trained is that

we are able

when they let me from us were able to learn

high quality

speaker representation which are competitive

with the ones

learning standard supervised we

or rubber

luckily formats is very simple and also computationally efficient

because we only use the local information thanks to that we can provide a lot

the computations

the mediation with that

is that the representations are very task specific

as we have seen before with lee we can

there

speaker baddies

but what about the other for and

informations that's a banded in speech signal mike phonemes

and motions

and many are things

so when it's this results i ask myself

i w really sure that a single task as in our

actually most of the forest the trying to used cell supervised learning by solving single

task

but

my experience suggests that one single task was not is not know because

with a single task we always only count sure

little information

on the signal that we might want

well based on this observation we decided star and you project called problem i know

stick speech coder where we wanted to learn

more general representation might join the demixing multiple

cell supervised task

in pays we have an ensemble on your macros that mass operate together

to discover good speech representations

so what is the intuition behind that

if we joint this'll moldable unsupervised task

we can expect that each task ratings different you

under speech

and you

put together

different views on the same signal

we might have higher chances

two

have a more general incomplete

description

on the signal so

moreover

and consensus across all these uses needed

and using pose some kind of

soft constraint in the representation

it may seem we can improve

its robustness

so with this approach we were actually able

to learn

general robust

and transferable features

thanks to

a joint is holding multiple task

and let me explain next slide more details on how

a system works

a is based on an encoder

the transforms more samples higher level representation

you colour is based on signal formal by seven locks

and the also earlier

he writing we start from the raw set will be

one starts from the lowest possible speech representation

after the encoder we have a bunch all workers where each worker saul's different sensible

mice task

one thing to remark is that the worker

workers are very small

one

because you've if the workers are very simple a small you're not sure

we forced encoder to provide

and much more robust and what is higher now

representation

there are actually two types of work we

started

regression workers that solves

error regression task and the binary

strolls

binary classification task

you binary workers are similar to that one

other than the one that we have some for an hour

more show you from which

as for the regression task

we have some workers that is t v some kind of normal speech representation

for instance we have one worker estimating waveform back

you know encoder fashion

we estimateable always spectrum

we estimate that about

frequency cepstral coefficients embassy they also have positive features such as

bottom-up probability zero crossing rate and i don't

so why we do something like that

because we use the way being jack quarters some kind of

prior knowledge that can be very helpful

so supervised learning

in particular in the speech community we are well aware that there are some

features that are we are very helpful

like mfcc

cross at least

why not

try to take advantage of that

i y

we are not trying to jack

this information inside a wire

neural network

you parallel to the regressors we also have

binary classification task

binary classification task working with similar to what we have described for with more to

the formation approaches

basically we sample tree

speech and x

are core of the negatives according to some kind of predefined extra you

we don't process all the stress

weighted the our case encoder

and then we should and scream inter

which is trained on binary percent we should figure out any

we have a positive or negative

so very similar to

the only approach we describe four

only difference

is the article or something strategy

because we didn't different some to strategy we can't

hi my

different features

one simple strategy that we don't

is the one proposed in mock of the infomax that has we have seen for

is able to lure

speaker and wendy's and general speaker identity

together with that we have an under similar strategy called good level the marks

here we do we play basically the same game but we use

larger chunks

and with larger channels

we hope white while i

kind of

complementary information which hopefully is more

global them

well finally we propose another interesting task or sequence pretty code

would this task be hopefully are able to capture some kind of

information on the order

all

the sequence

it works in this way we choose a random channel from

and a random sentence

cultures and core change

which is another random show on the future

of the same set those and is also one

and then we choose another random chat on that

passed on the same

so if we

palais de ziggy

we are

hopefully able to capture a little bit better how

the sequence can involve and ask country some kind of longer context information we were

able to capture with previous task

this sequence political endings similar

two contrastive predictive coding proposed by are one or

the main difference is that no work is

the negative samples actually all the samples are derived from the same sentence not for

other ones because

in this case you will like to only focus on how

this you possible we don't want to capture

another kind of pixel information such as speaker that we will capture

with other tasks

okay but how can we use

mays

inside s speech cross i

well

step one is unsupervised training so we can take the architecture

that we have

and i four

and training particular we can jointly train you quarter and workers using standard issue

a by optimising a loss which is computed as the average

each worker cost

in of are you experiment with it

we tried different

alternatives

but we found that

average e

the courses

the best approach we very fine

once we have train

i where a architecture we now use

i didn't label

we can go to step two which is supervised by joining

this case

we get to create a all the workers and

like our colour into

a supervised classifier which is trained with little

i'm now a supervised eight

actually here and there are a couple of also the data is not number one

is to use

is it as a standard

feature called or this case

freeze

pays yuri this supervised fine phase

another approach

just a pre-training priest with this unsupervised

parameters

and fine curate

you re

the

supervised find you phase so this several approaches the one usually hears

the best for four

it is very important

true mar

that is

step number one this unsupervised three

can

should be done only once

in fact we have seen

there is a dish variance phase

are generally now that can use for large are righty

all speech tasks like

speech recognition speaker recognition speaker speech enhancement

and min six

and you even don't wanna

three by yourself

that's a supervised extractor you can use

and three

parameters that share

but the i were proposed

well this is not all about he's

in fact

in created by the good results achieved with the original version

we decided

two

spend some time to founder

we revise the architecture and improving

and we don't use opportunity of the judges are two dollars a night t

organized by the johns hopkins university to set up t

working on improving

pace

and as a result we came up with a you architecture called

pays last where we introduced

different types all improvements

first of all week apple

a peas with on-the-fly data ish

here we use speech what an initial techniques like anti noise reverberation

but we also out

some kind of run zeros in the time waveform and also we filter the dixie

data in the signal of with some kind of random band must and boston's order

to use

zeros

in the frequency domain

so what is that are not be very important because

i gives us to the system so i kind of robustness is a noise and

reverberation another environment artifacts

a nice things that

since everything is on the fly

every time we contaminated descendants for distortion

and also

the workers are based on the clean

alone labels extracted from the clean version signal so we

implicitly ask

this way

our system to

perform some kind of

i dunno ways

and then we also robust colour

we still have seen no always on the years but that we have also i

recurrent neural network that is

and efficient way to introduce some kind of we can see that sure

and we also

some ski connection that have a rowdy and back to punish

then we have improve a lot other workers

so we have not so that

the more workers

the better it is

and yes

we definitely have a introduced

a lot of workers the injured that estimates for instance you type of features on

different

context lines et cetera overall

we can improve a lot the performance

all the system will different speech tasks

what do we learn phase

we show some kind of it isn't applauded

assuming that's

here

we show that bayes variable are pretty well speaker identity is and you can

clearly recognise

that the

there are pretty defining cluster

a four

the speakers

here is that we show some carol

i'll

deceived lots

for phonemes

and you can see here

everything's lossless well the final but

you have some phonemes

like it is

sure

right

but you can also detect some kind of phonemes which are

a pretty clusters of meaning that

we are actually learning

some kind of twenty

representation

even

without

and he

so when you label

okay we try these plots are different

speech tasks and you can refer to the paper to see all the results

but she really we just discussed some all the numbers that we have chi

on a noisy asr tasks highlight

i think a little bit then robustness

on the proposed approach

furthermore let me say that we have three

a wire

ace on every speech

without using the labels and

very interesting

we have noticed that we don't need

a not a lot of data to train a base we just need

one hundred fifty a wire one hundred that was really the speech

and these are enough to

i generated numbers sdc staples

this is quite interesting because

i usually standard sort of about approaches rely on a lot a lot of data

in our case with thing that

somehow we are more that efficient because we employ a lot a lot of workers

trying to extract a lot of information

are on our speech signals

on the left you can see the results when we treat only here you right

is a challenging task characterised by speech recorded in a domestic requirement

and corrupted by noise ratio

you can see here

that pays a single outperform

traditional features and also combinations a traditional speech features

on the right you can see the results of time five

jerry time

probably is the most challenging

task average

and where design speech is discover or as white noise you're a sure

a lot a lot of these two buses such as overlap speech

and that even guess

a pretty challenging scenario we are able

to the slightly outperform

the standard and based on their

i features

all their current database

actually do representations of other with them

a is

are quite a general or boston transferable

and we have successfully applied

them to different tasks

why don't we have seen speech recognition but you can use it

for speaker recognition

for speech announcement

was learning and motion recognition and i and also aware of some works right to

use

p is for transfer learning across languages train one that based on and trivias on

english and you task and another language and seems to

sure some kind of surprising robustness here

transformation

you can find the code in the tree model

on guitar when i encourage you to

well here and play would pace as well

but let me conclude this park with some sides also supervised learning and their role

that it can lady

in the future

has a mentioned in the first part of the presentation i think they're the g

be of intelligent machines is the combination of different note that this

we can combine a supervised learning

with unsupervised imitation the room for smaller in contrast one has all

so i think there is a huge based here for which tweezers direction where we

basically

combine

in a simple and again the way

difference

elderly time that

one of them

could be and

so supervised learning but not only

this is

very important in days because

stand our supervised learning as i don't know approach but we are start something see

some kind of limitation in this limitation mouldy even including your

in the next

years so supervised learning is too much as a demanding too much or addition to

learning

and we've been going the direction

only few it was a few companies the war will be able

to train state-of-the-art just

and i think different starting different learning with what is conditioned

an especially selsa for about thirty because i we has we have seen

in his presentation

so supervised learning can

an extremely useful the transfer learning area

so we sell supervised learning we have channels cooler a representation which is

generally now

it can use

for several down by class task

and this is

a really big advantage

in terms of computational complexity scores

so i think

the future paradigm

will be a final enough will be similar to the first a popular approach of

learning where we where he where

able to initialize current

neural network

using

unsupervised learning approaches also provides a legal approach

and then we can find you know that we need also

i think is

could be

pretty much

i feature primetime needed for speech where

bayesian transfer to remove lady

always measure

role in the pipeline

and yes

that some similar to what we have seen the last the differences that

and you at first system we were using for a supervisor some supervised learning where

based on restrictive about of washing

right now is the as we are using

much more sophisticated techniques

but the idea is the same manner

could be

quickly and the measurable in speech processing and more in general

in that the machine learning in the near future

if you're interested in to the stopping again you would like to read

a more also supervised

learning you know you speech you can take a look

into the and i c m l workshop

also supervised learning you know the speech that you have

recently

organized

and you can going to the website c or the presentation and read all the

which i think is

kind of interesting initiative

and that we also highlight

they will be

seen their initiative

it is your i knew it is so i will equation also

you also to participate

to use that

alright since i have a few more minutes

i'm very happy to of the u

on another very exciting projects and leading these days which is called

speech frame

speech frame will be an open-source all than one two

entirely down well i

no one goal

be a little in that can significant speed-up

research and double of all speech and audio processing techniques

so we are building

toolkit which will be efficient flexible

moreover and very important we'd i hu

the main difference with the other existing toolkit that speech rate is specifically designed with

addressed

multiple speech task

i don't see time

recent speech brain muscle or speech michelle channels operations recognition and most recognition multi microphone

signal processing speaker diarization

and many other things

typically all this task share the underlying technology which is unclear me

and the room there is

the reason why we have we need different repository or

different kind of speech applications is so what we want

is like our brain

we have a single that is able

to process several speech applications and the c time

main issue with the other tokens

this most of them is that the

i really for a single task

for instance you can use county for each and you know speech recognition and i

don't know colour the is

we the idea creating can show that can be extremely is that still on

meeting speech recognition

standard v is yes

very good or

speaker recognition

i think

it is fess explicitly them will

what

different task is still not exist

and people when they how to implement complex pipeline involving

different technologies lie like speech enhancement last

speech recognition

speech recognition speaker recognition

they are like because the captain john

and of course jumping from one looking to is very demanding here t can be

there are different programming languages will different constant errors are we there's cetera

and the

one other issues that

if we have different look at very how to combine a system together and uniformly

in a single system just fully range just

a very important use this we declare

so we actually working on that and we are trying to lower best rate

to do not always will allow users to

actually a couple the next

a speech point one

in an easy way

what a time line actually we have work a lot of these you're on that

we haven't email

a lot of people working on that a lot of interest

and we are very close to a first really is that

will happen we estimate within a couple amount so i as strongly encouraged you to

stay tuned and then

and that try

speech brain

i in the future and q how's your feedback

speaker in

as quickly the project is how would be as well people

we have lower while the

twenty delaware as last having solar raiders you have all sources sounds will all be

ones and so the project is getting bigger and we go to have also the

product

all the speech community

technical the store

saying it be right to my

collaborator

the guys year are being

this ain't is that working on there

all these are the other works lots of the what's happening

and here you can see

the key that is currently working on the speech rate and that recyclable them because

i think together we are working very well and

well we soon you'll see and the result of our house work

thank you very much

for everything

and i'm very happy now to reply to your

many thanks musical than i wasn't nation

i already have a

a set of questions for you

so as to what is wrong using both ukrainian but at so complex the first

patient was from nicole rubber

and the only the we i have to you england

it a weight on a holiday is less computationally demanding men so that is known

actually is nothing but the best and then i'm

i think and i can take this opportunity to clarify little bit matter this the

things there are a couple of things to consider

for the whole with bayes

we're trying to learn not and task specific representation but in general representation

a at this means that you can train you are i'll supervise a network just

once right and then you can use just a little amount of supervised data to

train the system

so and is naturally it's to the computational advantages because you have to train

the big thing on the one

and a menu don't things

when you have

some

things which are and we have to the standard supervised learning and usually

if you have a good representation a supervised learning part is gonna be are much

easier

and the other i think good think about pay is a

that they didn't remark too much in the presentation but this is better to remark

here a little beads

is that the basis pretty there's a sufficient right

we found very good results even just using something like fifty hours of speech so

very little compared to

what we see these days

even on cell supervised learning where people are using tie was on and thousand how

real speech

and we are data efficient because mm with the multiple workers

somehow we try to extract as much as all the possible information from phone signal

we are trying to do our best to be also that efficient extract everything we

can from the signal

so the right shoe things here

the day

and the fact that we are learning a general representation right so when we you

can train only one time phase and use it for multiple task and then also

be that late fusion part that to allow you to

learn reasonable representation

even or it then

and a relatively small amount of unlabeled data

an eco are you are you k do you have other

comments on the part

okay

five is very bad because you really question is on the sides of anyway and

try my best

i haven't quite a you have a question from don't combo well as you could

become a common and remote with this also is supervised learning and this ideal conditions

actually mm and we increased a lot the robustness of bayes the when we revise

it with bayes plus

and as i mentioned before in based blast we combine basically sell supervised learning with

on-the-fly data limitation

well that's domain it means that every time we have and you sent those we

contaminated with a different sequence of noise and on different reverberation such that the system

every time and looks also different

sentence

a lda different at least contamination and in the output

our workers are i'm not extracting their labels from the noisy signal but from the

original clean one

so somehow i wire system is a forest

two

they noise the features

and d is that it's to the robustness we have seen before we actually tried

it they're challenging task like your our time i data and it was so realistically

rate at these increase robustness to where standard approaches

good thank you same really sure

okay

you ask some questions

bayes rule has also a question about the competition between the walkers in days

and i don't we should be visible but he or within them or when

leam engine could consider some segments

one in the same interest

one has a positive example and you're as a negative

some people

and ask you what to expect been able to learn in this case

actually the set of workers that we tried is not random right we took the

opportunity of the day salt for instance to do a lot a lot of experiments

we and we just come out with a set of word the subset of worker

the subset of ideas

that actually works for us

so actually i one of our concern was okay how is possible to put together

a regression task which are bayes on the square error for instance is lost with

binary task which are based on other kind of lost like better because entropy

how we can how we can learn things together and we told that there was

a big issue but we realise that actually is not just doing an experiment doing

some kind of operation of the workers so we not does that if we put

together more workers the batter units

and the same atoms for a leam and jean

which are a different actually because a lean is based on small amount is small

chunks of speech

and we that the will there are not and meeting not carry information

while the with them james in the same game but played with the larger ta

and larger chunks of one seconds one second house

and we that tubular hopefully

higher level representations so we found that the

they did chew and the same time are at any we have full even though

at the are clearly correlated subsets right

and cuban equation is coming from one channel

and the nist you is you have to the right including to provide us to

pay

and she's really thinking about the five but they is not explicitly thing within speaker

variability

so none of the task is forcing and then he's from different from those from

within speaker could be seen you know we shall work

use it for you and known problem in adding some supervised five little ones you're

where you have always easy

well first of all on including supervised task totally makes sense honestly one can play

with the and

same a supervised of course seems cell supervise in this case a things and i

e n is present all people already d the i saw some recent papers that

actually work trying to do that the

in this paper for base we prefer to stay

on the selsa broom buys side only to make sure us to do actually check

what are the output read is something it's a pure

so supervised learning approach

so as for bayes for speaker recognition and then within speaker maybe this yes is

not that specifically designed for that

so is not them is not the optimal but we anyway learn some kind of

a or speaker identity

actually

we didn't there's too much about we can we are confident that we can learn

can be quite competitive with them with data with standard system actually maybe we have

to devise a little bitty architecture for that you're speaker recognition applications because these days

also here so

numbers to which are impressive in terms of equal error rate for box so that

but

the same idea i mean could be could be i think it's extended and we

designed to specifically lower better speaker imaginings actually was in our main target was

was more general so we wanted to

to learn a pretty general representation and see if this is somehow works

reasonably well for multiple stars

thank you very is a nicely with the next question from o coming from the

we're not

which tries to use

if you common than the five about the things that you system is no need

to give or speaker restitution and ten information you can really

as you are using examples positive examples coming from within a this a single you

five

actually what we do is to do this on the slide at the moment a

sure right

so if we have sentence one

one time sentence one is

contaminated with some kind of channel so i kind of reverberation affect the next time

is contaminated with another one so maybe with this approach we try to limit a

little bit the these affective but

there might be there might be this issue read through

do you mean to you thing but that the motivation you use it would take

decision problem of internal run by itself

so of maybe not tickling the full problem but at least

minimizing right or

reducing its right

i think about the and the other hand we don't that many out there does

it will feel will like to stay in the

so supervised domain right so we don't and speaker labels so we cannot say okay

let's jump to another's signal from the same speaker because that case

we have

we use the

the labels so

the best we can do is to contaminate the sentence two

i mean

change a little bit some other database the reverberation noise effect and

hope to have

to learn more this p can left the channel

fine i we moved to a question from and you can turn

hasn't that i model can use form from two perspectives and dealing extraction and more

than ornament pre-training

both for this we don't should be effective

well but which one may be built for speaker verification so

language and then

we take a look again

okay

i think a please could be used both you're right i can be used for

feature extraction or embedded guess extraction all for and basically pre-training

my experience is that

these works very well in a pre-training scenario so it is designed basically to have

the

to train printing your network with their nest so nist also provides way and then

find your eight with the small supervised data

this is the

basically the mean the main application we have in mind for a for pays but

we also tried it as a standard feature extractor

where embedding a structure

not for speaker recognition but for a speech recognition

and it works quite well so if you freeze the encoder right and you plan

just the features that you have there you can and supervisor coded what's well but

it works better if you jointly finetune the encoder and the classifier during that a

supervised phase

thank you and we'll come back to the grid also no with a question about

the

temporal

sequence walker also can you would avoid more on the minimum detection walker but focused

on the right sequence

this is for that the

maybe some cases the segment from the few to and but would reasonably contain the

thing then

with some problem with this walkers you know some comments

definitely that's that have very nice question actually mm could easily the soup as worker

is the one that has that's important thing the performance

so as i mention with the a lot of model glacial we try to figure

out the effect of each and which task and is what was working well improve

but less than other work at where more important like the rest of the regressors

and the m and g

and mm

actually this is an important risk when you what when you build a view sample

from the past is simple from the future you have to make sure you just

you are not same thing with being the receptive field of your convolutional neural networks

otherwise the task becomes

too easy

so what we have done is to make sure that the next the future sample

is not too close rights from the people from the and core one and not

too far because if it is to close the risk is to learn nothing basically

if it is to fire

the risk is that there isn't anything in anymore and reasonable correlation between that you

so or it's not easy to design the this task

and them

we did the

you didn't as weights are we

we were able to sample the past in the feature representation within some reasonable range

it could be interesting to write i believe that you hide traces

but are we move to another question from i in one but still were asked

to you being to write the all the same it does

known from will lead to four speakers for extracting speaker-specific information

well and in this paper we new the bayes paper actually is not only about

the nist speaker recognition so the filters that will learn are actually are not that

far away from stand out method

mel filters where we basically try to locate

more precisely more filters in the lower part of the spectrum and less filters in

the higher part of the spectrum

we de lima

local informants the technique that was designed basically that work only for speaker recognition the

filters we still are there are some harass right where more filters in there are

as more common for the speech and the formants

so similar to what we have seen in

using sync net

with the supervised a approach

but with bayes we are not the we're not a look at more

more filters in the speech region we are more or less the same as the

standard not filter scale back

i we have

we are also conclusion i don't have more open question i just have one or

be possible

a i would like to see you

to the explaining more well i use a about unsupervised training

used and the it composes of as the training

it's of my feeling what an issue as

b s

a more easy to find if you have some

during a supervised training because you're some information on the data meta information of video

and we each with the unsupervised training seems to me that

you have less information but you have no reason to have a list yes in

the

they

the figure sure

okay and

the reason is that the

if you train your representation with the supervised data your presentation could be biased to

the task right specifically for instance if it's frame

a aspic a representation with speaker recognition right your presentation is not could for speech

recognition and it is does a bias on speaker recognition around it

with a supervised learning at least the in the way we are trying to do

it with a multitask et cetera this list a risk is reduced because you have

the same representation that is good for both speech recognition

and speech recognition and the speaker recognition

communist and that i

really the want to thank you again and we are we will be the over

the official but before to close the position i will be the microphone to get

a the only those two

wants to you to also i think you actually

thank you are service right

yes i stepped off state of a very wide of the top integerization and then

s l obtained in this session

so as dataset now do you think that something to us to decisions

but

system

one of the stuff can show this not

and the second

yes

if your best guess okay just to heal i you the that token decision that

the but there is something that sequences changes that's thanks for inviting me that was

really great thank you

okay that's a tennis together again

and test on a distance for a

a sentence

and you lucille tomorrow a same time this time i in a and ten

definitely

so as you can just

of that by time

Towards Unsupervised Learning of Speech Representations

Keynotes

Dr. Mirco Ravanelli, Université de Montréal, Canada