0:00:01two
0:00:02yes
0:00:04so interviews the allow testing well
0:00:11thank you on the first i want to thank you again for of anything this
0:00:16water in it is very
0:00:20and if you could the a moment and if you could your but you did
0:00:25very well and i'm sure we will we will take advantage of you organisation an
0:00:29optional
0:00:31secondly equal i'm really be too
0:00:34have you location to introduce you musical of any will be whole first the speaker
0:00:41i will be sure because i'm quite sure but quite all of japanese
0:00:46no you would really so i will not even introduction they know you even if
0:00:53you are still you with a true
0:00:56as
0:00:58you go according to me at least we will see you was a really a
0:01:02few buttons
0:01:03no
0:01:04so we go you us your master in the second and hearing that went to
0:01:10university the frozen even
0:01:13so about twenty ten years ago
0:01:16and the you went to be a is to use a student rental
0:01:24with the wheel invisible trying to one be from the sunni a remote case the
0:01:30you then you is the on droning for distance speech in the two thousand seventeen
0:01:39and then known the meta
0:01:42maybe not the useful to introduce them you know that a
0:01:46i'm also true but they will all of us so awful you know very well
0:01:53and you start work has a also is a sure working closely with a threshold
0:01:59venue
0:02:01you work on several topics may be wrong than representation only for speech button not
0:02:07only about
0:02:08and he really you also one of the code from the of the speech way
0:02:14initiative for building you
0:02:17two k open okay for a speech and speaker recognition it was a singing about
0:02:23so
0:02:24even us to
0:02:27we use it to form you use you already have a long list of speakers
0:02:32in the topics and i know you we need to as a very nice a
0:02:36or a now and i don't even for you but before two
0:02:44do they do you say it will be wall of introduction if you want before
0:02:48a good movie do i will explain how the decision we walk
0:02:54we will close to a pre-recorded view bone by nicole
0:03:00during easy do you have you wanted to some question maybe case in intensive box
0:03:07please or a few
0:03:10think about question and i haven't integration is possible now see we give you what
0:03:16you do need good complete variances
0:03:19and then we will have a fifteen minutes
0:03:23live
0:03:23question and answers with music or doing this decision is fifteen years
0:03:29you could use both the question and answer box
0:03:33well
0:03:34be a raise your hand the so we raise your hand with the you know
0:03:38that to a i-th question in i
0:03:42during position
0:03:44so we could be want to say some well handled before two good we do
0:03:47just i think you're much for the introduction hello
0:03:52i hope the d v d w within the video will be fine now but
0:03:56in the worst case you probably you guys have to increase a little bit it
0:04:00but
0:04:01let's see how it goes
0:04:04it can be cool i think we give a really do know
0:04:49sorry we have a simple was shown to an small technical problem good we don't
0:04:53have you do you the
0:04:57before it was working so it's better to does which the previews
0:05:03present
0:05:03annotation we're
0:05:05can't hear nothing alright
0:05:09yes a
0:05:10can and have a little stuff
0:05:17okay training
0:05:41hi everyone i mean permanently
0:05:43and a very high
0:05:45to give it is here today
0:05:47had obviously
0:05:49so let me for the whole thing rather
0:05:53for i by can be used for them
0:05:57with the
0:05:57the speech commute
0:06:01entitled make you know used to words unsupervised training
0:06:04all speech work station
0:06:07well so supervised learning is a key a lot of what are the my shooter
0:06:12feel
0:06:13and of course is getting ready
0:06:16within the speech community well
0:06:18so today i will like to share the experience
0:06:24the time again after working poor
0:06:26i two or three years
0:06:28on this topic
0:06:32okay but if or diving into cell supervised learning that me room are some of
0:06:39the limitations
0:06:41of supervised there
0:06:43which is the dominant paradigm stays there
0:06:48well you can see deep-learning
0:06:51as a way to lure hierarchical representation is where we start from the low concepts
0:06:58we combine them
0:06:59we create
0:07:00high-level also console
0:07:03so the learning
0:07:05is a very general is the case
0:07:08is implemented through a deep neural networks
0:07:13that are often
0:07:14trained
0:07:15in a supervised way
0:07:17using a large and rotated corpora
0:07:22you can do this is that only and approach
0:07:26alright integrate
0:07:27success
0:07:28are you learning many practical application
0:07:33is clear today
0:07:34and is paradigm
0:07:36has some limitations
0:07:39what are
0:07:40this issues
0:07:42for example
0:07:44we
0:07:45indeed the data and not
0:07:48general data
0:07:49but and updated data and crosses they cannot the issue the expense the time-consuming however
0:07:56wires numerals normal
0:08:01rubber supervised learning is data and
0:08:04also computationally demanding
0:08:07one
0:08:09of course to these days to reach state-of-the-art performance
0:08:14machine learning
0:08:15we need a lot of data
0:08:17and a lot of data requires a lot of computations
0:08:21deleting the fact the access but
0:08:24a
0:08:25supervised learning
0:08:27a technology to have brute
0:08:32brute
0:08:32setup
0:08:33users
0:08:36moreover
0:08:38if we
0:08:39training a system now
0:08:40supervised way the representations that the latter
0:08:44my by the hours
0:08:46to worse a specific application
0:08:49for instance if we train a system for speaker identification
0:08:54the representation that's been there are would that not or
0:08:58speech recognition
0:09:00so we might want to real or some kind of general representation that annoying
0:09:07transfer learning
0:09:08much easier and better
0:09:11density
0:09:14the third imitation is actually more exploration
0:09:17and is that where rain
0:09:20does not use
0:09:21only supervised learning
0:09:24critical mine different all
0:09:27i'm
0:09:28pretty sure
0:09:29that
0:09:30combined
0:09:32different the remote data that is cool but she
0:09:35to reach higher levels
0:09:37or artificial intelligence
0:09:39we can combine
0:09:41supervised learning
0:09:42we and
0:09:45contrastive learning
0:09:46weighted imitation learn a
0:09:49well we'd reinforcement learning and of course
0:09:52with some supervised learning
0:09:55so what is sell supervised there
0:09:59so supervised learning is a type of an unsupervised learning
0:10:04where we have a supervision
0:10:07but the supervision
0:10:08is extracted
0:10:09from the city no it's channel
0:10:13in cell supervised learning we'd ask
0:10:16don't have
0:10:16you models that have to create labels we don't have you months
0:10:21but the labels
0:10:23i retreated basically
0:10:25for free
0:10:25we can create
0:10:27columns of them without s
0:10:31normally in some supervised learning
0:10:34we applied some kind of
0:10:36known transformations to the input signal
0:10:39and use the resulting outcomes
0:10:42as a label as targets
0:10:46well let me clarified his with some example derived from the computer vision community which
0:10:52was the first one
0:10:54teaching better
0:10:56this
0:10:56approach
0:10:58in this
0:10:59comparison of community actually they
0:11:03the not is quite early i this earlier than the other that by solving some
0:11:09kind of symbols task we were able
0:11:11to train a neural network that there are some kind of needful
0:11:14representation
0:11:17for instance you can ask your neural network was also kind of relative positioning task
0:11:22where you have small edges of an image
0:11:25and you have to decide their relative position
0:11:27between them
0:11:29you can ask your neural network
0:11:31but the right colour
0:11:32set an image
0:11:33or to find the correct
0:11:35rotation and of any age
0:11:39goal of this task are relatively
0:11:41easy but each we design a system your vector learners used in table show this
0:11:48task
0:11:49we inherently over a wider system to have some kind of semantic knowledge of the
0:11:55words or at least semantic knowledge on the image
0:11:58that can be really very have their
0:12:04representation hopefully high level
0:12:06robust representations
0:12:10and yes
0:12:11subsets unsupervised learning is extremely
0:12:15interesting is gaining a lot of randy
0:12:19let me show that animals
0:12:21give low rank k by
0:12:22the kernel
0:12:24showing saying that you know if only the cage
0:12:29no supervised learning the su or look at a reformer learning is the charger indicate
0:12:34that an unsupervised
0:12:36or supervised learning is the basic indicate you sell
0:12:40and meaning that
0:12:42we believe this modality is
0:12:44definitely
0:12:47ingredient
0:12:48a two
0:12:49to develop intelligent systems
0:12:54okay but what about the old you an speech field
0:12:59as i mentioned before
0:13:02there is a crucial we number of research more stuff cools in the direction
0:13:07also supervise there really you know speech
0:13:12and we have seen as many of them even
0:13:15at the interspeech
0:13:17but here let me just highlight here of
0:13:22and my opinion the first work that firstly shows the appendices also supervised learning you
0:13:29know you speech
0:13:30is the contrastive predictive coding was by are among the nor backing
0:13:35two thousand eight key
0:13:38this work is mostly about
0:13:40predicting
0:13:41the future
0:13:42given the past
0:13:45more recently we have seen
0:13:47another
0:13:48very good where by facebook with what we should back to zero where d with
0:13:54we were able to show impressive results with that our approach
0:13:57which implies some kind of masking technique sooner number couple
0:14:02which ones dish
0:14:05i also contributed
0:14:07does feel with the problem of analysis which encode it which as we will see
0:14:12later i which we explore
0:14:15multi doubts selsa provides there
0:14:18however
0:14:20cell supervised learning all speech
0:14:22is it really challenge
0:14:26why
0:14:27first of all because speech is characterised by high dimensional that
0:14:32we have typically a long sequences
0:14:35of samples that can be well variable length
0:14:40the last
0:14:41but not laced
0:14:43speech in her you know the and tails
0:14:47complex hierarchical structure that might be very difficult to further
0:14:53without being guided
0:14:55by a strong
0:14:57supervision
0:14:58speech in fact
0:15:00as characterised by samples we can combine
0:15:03there were sampled that the
0:15:05aims
0:15:07i from twenty and you can create two levels of all syllables okay worse and
0:15:11finally
0:15:13we have than me
0:15:14all descendants
0:15:16and inferring
0:15:17all these kind of structure
0:15:20might be
0:15:21extremely difficult
0:15:25on my side i started i've been some supervised learning when i started my all
0:15:31stock
0:15:31i mean the almost
0:15:33three years ago
0:15:35and time
0:15:37people it means that we're doing research ourselves supervised learning
0:15:42a approaches based on what information
0:15:46and i got so excited that
0:15:48i decided to study some supervised learning
0:15:51approaches with motion information
0:15:53for learning
0:15:55speech representations
0:15:56and that led to the development
0:15:58all the technical
0:15:59a lot coming from max that i will and described in the next my
0:16:05after that we for extended
0:16:08this techniques using a multi task supervised learning approach
0:16:13and that led to the double meant
0:16:15all the problem of the gnostic speech encoder plays
0:16:18the presented
0:16:19and interspeech two thousand nineteen
0:16:22and also we extended
0:16:24days with another technique
0:16:26if you can improve system called base plus
0:16:30and we recently presented this work
0:16:32at i
0:16:37okay let's start from motion information based approach
0:16:42what is more information
0:16:44the motion information is defined as the key and they are virgins
0:16:48between
0:16:49the joint distributions of two random variables
0:16:53and their product or marginal
0:16:58why
0:16:59this is important
0:17:01because we move information we can capture complex problem being of relationships
0:17:06between
0:17:07random part of
0:17:10eve the
0:17:12two random variables are independent univoxel formation zero
0:17:17while you do with some kind of dependency between is why doubles the are then
0:17:22mutual information is greater you
0:17:26this is very attractive
0:17:29the issues that much information that's difficult to compute high dimensional space
0:17:36and is limited
0:17:38a lot
0:17:40it's optical but
0:17:41in
0:17:42for a decal mush entirely sure
0:17:47however one recent were coal mine actual information
0:17:52you're estimator
0:17:54phone that it is possible
0:17:56one maximizing minimizing motivation
0:17:59within a framework that closely resembles
0:18:03data counts
0:18:05how does where
0:18:07i think mention and we can sample somehow
0:18:11some samples from the joint distribution
0:18:13recorded
0:18:14positive samples
0:18:16we will explain later
0:18:18how we can do that graph
0:18:20it's also assume we can i
0:18:22sample
0:18:24some kind of examples from the marginal distributions and we call
0:18:28there's negative samples
0:18:32then we can see that
0:18:33this positive and negative samples
0:18:36with the special neural net where was cost function
0:18:40is it don't are far down
0:18:42bound works mesh
0:18:44the don't screw are no information that has low where
0:18:49and if we train
0:18:50this is a letter to maximize
0:18:54this them about
0:18:55we finally converge to also mesh
0:19:01and inspired by this approach i started
0:19:04thinking about
0:19:06motion information based approaches specific only
0:19:09for speech
0:19:12i danced idea and then you do cool a little informatics that works
0:19:18in this way
0:19:20so
0:19:22for example we employ s seven they strategy
0:19:25that will
0:19:26several positive and negative
0:19:28this way
0:19:29sure the whole
0:19:30that choosing a random shyer
0:19:33from i runs and scolded
0:19:35so you one
0:19:37then
0:19:37which is another out of the channel from the same sentence
0:19:41and we call it
0:19:42two
0:19:45and finally which is another random from another sentence
0:19:49that's your front
0:19:53we this
0:19:54samples with his chance we can
0:19:57please some kind of interesting things
0:20:00for instance we can process
0:20:03c one c two i was your problem with and recorder
0:20:08which provide
0:20:09hopefully higher level information
0:20:14then
0:20:15we can go free positive and negative so all we
0:20:19if we
0:20:21concatenate
0:20:22z one and two we create
0:20:25samples from the joint distribution
0:20:28positive system
0:20:30which is a positive sense or because we expect some kind of relation between
0:20:36this random variables because extract
0:20:39from say
0:20:40a signal
0:20:43then we can also can also create
0:20:44and negative samples michael t z one and that run
0:20:49in this can be seen
0:20:51and a sample from the chronicle marginal distribution
0:20:56after that
0:20:57we employ and discriminator which is
0:21:01with posting
0:21:02or negative samples
0:21:04and it is screaming the
0:21:06should figure out
0:21:07basically
0:21:08if
0:21:09you need to get positive or negative examples for this case
0:21:14if the representations
0:21:15kind of from seven
0:21:17or from you
0:21:22in this system that discriminate rollers is
0:21:25set
0:21:26to maximize the mutual information
0:21:30moreover the encoder and a discrete mister
0:21:33are jointly trained from scratch
0:21:37and this
0:21:38results in
0:21:39compared to
0:21:41game nodding an adversarial game like can
0:21:44this case
0:21:46the encoder and its creator should cooperate to learn
0:21:52i hu and hopefully high level
0:21:55representation
0:21:57a good question here okay
0:22:00and but one two will are you play is k
0:22:05with this came we basically learn speaker identities of our wheeler speaker endings
0:22:15why
0:22:16because this approach is based on randomly
0:22:18sam thing
0:22:19within the same set
0:22:21and if we randomly sample within the same sentence
0:22:25and reliable started or that the system can disentangle are the variable factor is
0:22:32definitely the speaker identity
0:22:34rubber in
0:22:36we assume that we have i dataset and just large enough without
0:22:40large variability a speaker and if we randomly sample two sentences
0:22:45the probability of by me
0:22:46the same speaker is very low
0:22:49so overall
0:22:50this can be c
0:22:51as a system for learning
0:22:54speaker of endings without
0:22:57provided to the system the police
0:22:59this is label
0:23:02on the speaker identity
0:23:06the encoder is fat by their roles speech samples directly
0:23:12in the first layer of a contact the architecture we just use see that makes
0:23:17learning problem to roll samples much easier
0:23:20in fact instead of using the standard convolutional filters we use a band pass parameterize
0:23:26filters that only learns d
0:23:29because of this is distilled
0:23:32this makes
0:23:35learning from the rose i'm all easier
0:23:38and not only used on the supervised learning but we also only useful in this
0:23:42also provides context and
0:23:44i will encourage you to read a reference paper
0:23:47if you would like to hear more about
0:23:51sing
0:23:53what are the strength and issues a lot come from
0:23:58once trained is that
0:23:59we are able
0:24:00when they let me from us were able to learn
0:24:03high quality
0:24:05speaker representation which are competitive
0:24:07with the ones
0:24:09learning standard supervised we
0:24:12or rubber
0:24:14luckily formats is very simple and also computationally efficient
0:24:19because we only use the local information thanks to that we can provide a lot
0:24:23the computations
0:24:26the mediation with that
0:24:27is that the representations are very task specific
0:24:32as we have seen before with lee we can
0:24:36there
0:24:37speaker baddies
0:24:39but what about the other for and
0:24:42informations that's a banded in speech signal mike phonemes
0:24:46and motions
0:24:47and many are things
0:24:51so when it's this results i ask myself
0:24:55i w really sure that a single task as in our
0:25:00actually most of the forest the trying to used cell supervised learning by solving single
0:25:05task
0:25:07but
0:25:08my experience suggests that one single task was not is not know because
0:25:13we
0:25:14with a single task we always only count sure
0:25:18little information
0:25:20on the signal that we might want
0:25:25well based on this observation we decided star and you project called problem i know
0:25:32stick speech coder where we wanted to learn
0:25:37more general representation might join the demixing multiple
0:25:43cell supervised task
0:25:46in pays we have an ensemble on your macros that mass operate together
0:25:52to discover good speech representations
0:25:58so what is the intuition behind that
0:26:01if we joint this'll moldable unsupervised task
0:26:05we can expect that each task ratings different you
0:26:11under speech
0:26:13and you
0:26:13put together
0:26:15different views on the same signal
0:26:17we might have higher chances
0:26:20two
0:26:21have a more general incomplete
0:26:23description
0:26:24on the signal so
0:26:28moreover
0:26:30and consensus across all these uses needed
0:26:33and using pose some kind of
0:26:35soft constraint in the representation
0:26:38it may seem we can improve
0:26:41its robustness
0:26:44so with this approach we were actually able
0:26:47to learn
0:26:48general robust
0:26:50and transferable features
0:26:52thanks to
0:26:53a joint is holding multiple task
0:26:56and let me explain next slide more details on how
0:27:01a system works
0:27:05a is based on an encoder
0:27:08the transforms more samples higher level representation
0:27:14you colour is based on signal formal by seven locks
0:27:19and the also earlier
0:27:22he writing we start from the raw set will be
0:27:26one starts from the lowest possible speech representation
0:27:32after the encoder we have a bunch all workers where each worker saul's different sensible
0:27:39mice task
0:27:41one thing to remark is that the worker
0:27:44workers are very small
0:27:46one
0:27:47because you've if the workers are very simple a small you're not sure
0:27:51we forced encoder to provide
0:27:54and much more robust and what is higher now
0:27:58representation
0:28:01there are actually two types of work we
0:28:03started
0:28:04regression workers that solves
0:28:08error regression task and the binary
0:28:12strolls
0:28:12binary classification task
0:28:14you binary workers are similar to that one
0:28:17other than the one that we have some for an hour
0:28:21more show you from which
0:28:23as for the regression task
0:28:26we have some workers that is t v some kind of normal speech representation
0:28:33for instance we have one worker estimating waveform back
0:28:37you know encoder fashion
0:28:40we estimateable always spectrum
0:28:42we estimate that about
0:28:43frequency cepstral coefficients embassy they also have positive features such as
0:28:49bottom-up probability zero crossing rate and i don't
0:28:54so why we do something like that
0:28:56because we use the way being jack quarters some kind of
0:29:01prior knowledge that can be very helpful
0:29:04in
0:29:05so supervised learning
0:29:07in particular in the speech community we are well aware that there are some
0:29:12features that are we are very helpful
0:29:15like mfcc
0:29:16cross at least
0:29:17why not
0:29:19try to take advantage of that
0:29:21i y
0:29:22we are not trying to jack
0:29:24this information inside a wire
0:29:26neural network
0:29:29you parallel to the regressors we also have
0:29:32binary classification task
0:29:35binary classification task working with similar to what we have described for with more to
0:29:41the formation approaches
0:29:43basically we sample tree
0:29:45speech and x
0:29:47are core of the negatives according to some kind of predefined extra you
0:29:53we don't process all the stress
0:29:56weighted the our case encoder
0:29:59and then we should and scream inter
0:30:01which is trained on binary percent we should figure out any
0:30:05we have a positive or negative
0:30:08so very similar to
0:30:10the only approach we describe four
0:30:14only difference
0:30:15is the article or something strategy
0:30:18because we didn't different some to strategy we can't
0:30:21hi my
0:30:22different features
0:30:24one simple strategy that we don't
0:30:27is the one proposed in mock of the infomax that has we have seen for
0:30:31is able to lure
0:30:33speaker and wendy's and general speaker identity
0:30:38together with that we have an under similar strategy called good level the marks
0:30:43here we do we play basically the same game but we use
0:30:48larger chunks
0:30:49and with larger channels
0:30:51we hope white while i
0:30:54kind of
0:30:54complementary information which hopefully is more
0:30:57global them
0:31:01well finally we propose another interesting task or sequence pretty code
0:31:07would this task be hopefully are able to capture some kind of
0:31:12information on the order
0:31:14all
0:31:15the sequence
0:31:16it works in this way we choose a random channel from
0:31:20and a random sentence
0:31:22cultures and core change
0:31:24which is another random show on the future
0:31:27of the same set those and is also one
0:31:31and then we choose another random chat on that
0:31:34passed on the same
0:31:37so if we
0:31:39palais de ziggy
0:31:41we are
0:31:42hopefully able to capture a little bit better how
0:31:46the sequence can involve and ask country some kind of longer context information we were
0:31:53able to capture with previous task
0:31:56this sequence political endings similar
0:31:59two contrastive predictive coding proposed by are one or
0:32:03the main difference is that no work is
0:32:07the negative samples actually all the samples are derived from the same sentence not for
0:32:12other ones because
0:32:14in this case you will like to only focus on how
0:32:17this you possible we don't want to capture
0:32:20another kind of pixel information such as speaker that we will capture
0:32:25with other tasks
0:32:30okay but how can we use
0:32:33mays
0:32:34inside s speech cross i
0:32:39well
0:32:39step one is unsupervised training so we can take the architecture
0:32:44that we have
0:32:45and i four
0:32:47and training particular we can jointly train you quarter and workers using standard issue
0:32:57a by optimising a loss which is computed as the average
0:33:03each worker cost
0:33:05in of are you experiment with it
0:33:07we tried different
0:33:09alternatives
0:33:10but we found that
0:33:11average e
0:33:12the courses
0:33:14the best approach we very fine
0:33:18once we have train
0:33:19i where a architecture we now use
0:33:22i didn't label
0:33:23we can go to step two which is supervised by joining
0:33:28this case
0:33:29we get to create a all the workers and
0:33:31like our colour into
0:33:34a supervised classifier which is trained with little
0:33:37i'm now a supervised eight
0:33:41actually here and there are a couple of also the data is not number one
0:33:47is to use
0:33:48is it as a standard
0:33:50feature called or this case
0:33:53freeze
0:33:53pays yuri this supervised fine phase
0:33:57another approach
0:33:59just a pre-training priest with this unsupervised
0:34:02parameters
0:34:03and fine curate
0:34:05you re
0:34:06the
0:34:08supervised find you phase so this several approaches the one usually hears
0:34:14the best for four
0:34:17it is very important
0:34:19true mar
0:34:20that is
0:34:21step number one this unsupervised three
0:34:24can
0:34:25should be done only once
0:34:27in fact we have seen
0:34:29there is a dish variance phase
0:34:32are generally now that can use for large are righty
0:34:37all speech tasks like
0:34:39speech recognition speaker recognition speaker speech enhancement
0:34:43and min six
0:34:45and you even don't wanna
0:34:47three by yourself
0:34:49that's a supervised extractor you can use
0:34:52and three
0:34:54parameters that share
0:34:55but the i were proposed
0:35:00well this is not all about he's
0:35:04in fact
0:35:05in created by the good results achieved with the original version
0:35:10we decided
0:35:11two
0:35:13spend some time to founder
0:35:15we revise the architecture and improving
0:35:18and we don't use opportunity of the judges are two dollars a night t
0:35:23organized by the johns hopkins university to set up t
0:35:27working on improving
0:35:29pace
0:35:30and as a result we came up with a you architecture called
0:35:34pays last where we introduced
0:35:37different types all improvements
0:35:41first of all week apple
0:35:43a peas with on-the-fly data ish
0:35:47here we use speech what an initial techniques like anti noise reverberation
0:35:53but we also out
0:35:55some kind of run zeros in the time waveform and also we filter the dixie
0:36:00data in the signal of with some kind of random band must and boston's order
0:36:05to use
0:36:07zeros
0:36:07in the frequency domain
0:36:10so what is that are not be very important because
0:36:13i gives us to the system so i kind of robustness is a noise and
0:36:19reverberation another environment artifacts
0:36:23a nice things that
0:36:24since everything is on the fly
0:36:26every time we contaminated descendants for distortion
0:36:32and also
0:36:33the workers are based on the clean
0:36:37alone labels extracted from the clean version signal so we
0:36:42implicitly ask
0:36:43this way
0:36:44our system to
0:36:46perform some kind of
0:36:47i dunno ways
0:36:50and then we also robust colour
0:36:53we still have seen no always on the years but that we have also i
0:36:58recurrent neural network that is
0:37:01and efficient way to introduce some kind of we can see that sure
0:37:05and we also
0:37:08some ski connection that have a rowdy and back to punish
0:37:14then we have improve a lot other workers
0:37:17so we have not so that
0:37:19the more workers
0:37:22the better it is
0:37:25and yes
0:37:26we definitely have a introduced
0:37:30a lot of workers the injured that estimates for instance you type of features on
0:37:34different
0:37:36context lines et cetera overall
0:37:39we can improve a lot the performance
0:37:42all the system will different speech tasks
0:37:46what do we learn phase
0:37:48we show some kind of it isn't applauded
0:37:51assuming that's
0:37:53here
0:37:55we show that bayes variable are pretty well speaker identity is and you can
0:38:00clearly recognise
0:38:01that the
0:38:03there are pretty defining cluster
0:38:06a four
0:38:08the speakers
0:38:11here is that we show some carol
0:38:14i'll
0:38:17deceived lots
0:38:18for phonemes
0:38:19and you can see here
0:38:21everything's lossless well the final but
0:38:24you have some phonemes
0:38:26like it is
0:38:27sure
0:38:27right
0:38:29but you can also detect some kind of phonemes which are
0:38:32a pretty clusters of meaning that
0:38:34we are actually learning
0:38:36some kind of twenty
0:38:38representation
0:38:39even
0:38:40without
0:38:41and he
0:38:41so when you label
0:38:45okay we try these plots are different
0:38:49speech tasks and you can refer to the paper to see all the results
0:38:55but she really we just discussed some all the numbers that we have chi
0:38:59on a noisy asr tasks highlight
0:39:03i think a little bit then robustness
0:39:06on the proposed approach
0:39:10furthermore let me say that we have three
0:39:12a wire
0:39:13ace on every speech
0:39:16without using the labels and
0:39:18very interesting
0:39:20we have noticed that we don't need
0:39:22a not a lot of data to train a base we just need
0:39:26one hundred fifty a wire one hundred that was really the speech
0:39:29and these are enough to
0:39:31i generated numbers sdc staples
0:39:36this is quite interesting because
0:39:38i usually standard sort of about approaches rely on a lot a lot of data
0:39:43in our case with thing that
0:39:45somehow we are more that efficient because we employ a lot a lot of workers
0:39:51trying to extract a lot of information
0:39:54are on our speech signals
0:39:58on the left you can see the results when we treat only here you right
0:40:03is a challenging task characterised by speech recorded in a domestic requirement
0:40:09and corrupted by noise ratio
0:40:12you can see here
0:40:14that pays a single outperform
0:40:18traditional features and also combinations a traditional speech features
0:40:23on the right you can see the results of time five
0:40:27jerry time
0:40:28probably is the most challenging
0:40:31task average
0:40:32and where design speech is discover or as white noise you're a sure
0:40:37a lot a lot of these two buses such as overlap speech
0:40:41and that even guess
0:40:42a pretty challenging scenario we are able
0:40:45to the slightly outperform
0:40:48the standard and based on their
0:40:51i features
0:40:53all their current database
0:40:58actually do representations of other with them
0:41:02a is
0:41:03are quite a general or boston transferable
0:41:06and we have successfully applied
0:41:09them to different tasks
0:41:10why don't we have seen speech recognition but you can use it
0:41:14for speaker recognition
0:41:16for speech announcement
0:41:17was learning and motion recognition and i and also aware of some works right to
0:41:23use
0:41:24p is for transfer learning across languages train one that based on and trivias on
0:41:32english and you task and another language and seems to
0:41:36sure some kind of surprising robustness here
0:41:39transformation
0:41:42you can find the code in the tree model
0:41:45on guitar when i encourage you to
0:41:47well here and play would pace as well
0:41:53but let me conclude this park with some sides also supervised learning and their role
0:41:59that it can lady
0:42:01in the future
0:42:04has a mentioned in the first part of the presentation i think they're the g
0:42:10be of intelligent machines is the combination of different note that this
0:42:15we can combine a supervised learning
0:42:17with unsupervised imitation the room for smaller in contrast one has all
0:42:25so i think there is a huge based here for which tweezers direction where we
0:42:30basically
0:42:33combine
0:42:34in a simple and again the way
0:42:36difference
0:42:38elderly time that
0:42:40one of them
0:42:43could be and
0:42:45so supervised learning but not only
0:42:49this is
0:42:49very important in days because
0:42:53stand our supervised learning as i don't know approach but we are start something see
0:42:58some kind of limitation in this limitation mouldy even including your
0:43:03in the next
0:43:04years so supervised learning is too much as a demanding too much or addition to
0:43:09learning
0:43:09and we've been going the direction
0:43:11only few it was a few companies the war will be able
0:43:15to train state-of-the-art just
0:43:18and i think different starting different learning with what is conditioned
0:43:23an especially selsa for about thirty because i we has we have seen
0:43:29in his presentation
0:43:30so supervised learning can
0:43:32an extremely useful the transfer learning area
0:43:36so we sell supervised learning we have channels cooler a representation which is
0:43:42generally now
0:43:44it can use
0:43:46for several down by class task
0:43:50and this is
0:43:52a really big advantage
0:43:55in terms of computational complexity scores
0:43:59so i think
0:44:01the future paradigm
0:44:03will be a final enough will be similar to the first a popular approach of
0:44:10learning where we where he where
0:44:12able to initialize current
0:44:15neural network
0:44:16using
0:44:18unsupervised learning approaches also provides a legal approach
0:44:22and then we can find you know that we need also
0:44:25i think is
0:44:26could be
0:44:29pretty much
0:44:31i feature primetime needed for speech where
0:44:33bayesian transfer to remove lady
0:44:36always measure
0:44:38role in the pipeline
0:44:40and yes
0:44:42that some similar to what we have seen the last the differences that
0:44:46and you at first system we were using for a supervisor some supervised learning where
0:44:52based on restrictive about of washing
0:44:54right now is the as we are using
0:44:56much more sophisticated techniques
0:44:58but the idea is the same manner
0:45:01could be
0:45:03quickly and the measurable in speech processing and more in general
0:45:07in that the machine learning in the near future
0:45:11if you're interested in to the stopping again you would like to read
0:45:15a more also supervised
0:45:17learning you know you speech you can take a look
0:45:19into the and i c m l workshop
0:45:22also supervised learning you know the speech that you have
0:45:26recently
0:45:27organized
0:45:28and you can going to the website c or the presentation and read all the
0:45:34which i think is
0:45:35kind of interesting initiative
0:45:37and that we also highlight
0:45:39they will be
0:45:41seen their initiative
0:45:42it is your i knew it is so i will equation also
0:45:48you also to participate
0:45:49to use that
0:45:53alright since i have a few more minutes
0:45:56i'm very happy to of the u
0:45:59on another very exciting projects and leading these days which is called
0:46:04speech frame
0:46:06speech frame will be an open-source all than one two
0:46:10entirely down well i
0:46:12no one goal
0:46:13be a little in that can significant speed-up
0:46:16research and double of all speech and audio processing techniques
0:46:22so we are building
0:46:23toolkit which will be efficient flexible
0:46:27moreover and very important we'd i hu
0:46:34the main difference with the other existing toolkit that speech rate is specifically designed with
0:46:41addressed
0:46:42multiple speech task
0:46:44i don't see time
0:46:46recent speech brain muscle or speech michelle channels operations recognition and most recognition multi microphone
0:46:55signal processing speaker diarization
0:46:58and many other things
0:47:00so
0:47:00typically all this task share the underlying technology which is unclear me
0:47:08and the room there is
0:47:10the reason why we have we need different repository or
0:47:16different kind of speech applications is so what we want
0:47:20is like our brain
0:47:22we have a single that is able
0:47:24to process several speech applications and the c time
0:47:32main issue with the other tokens
0:47:34this most of them is that the
0:47:37i really for a single task
0:47:40for instance you can use county for each and you know speech recognition and i
0:47:44don't know colour the is
0:47:47is
0:47:49we the idea creating can show that can be extremely is that still on
0:47:54meeting speech recognition
0:47:56standard v is yes
0:47:59very good or
0:48:00speaker recognition
0:48:02i think
0:48:04it is fess explicitly them will
0:48:07what
0:48:08different task is still not exist
0:48:12and people when they how to implement complex pipeline involving
0:48:18different technologies lie like speech enhancement last
0:48:22speech recognition
0:48:23or
0:48:24speech recognition speaker recognition
0:48:26they are like because the captain john
0:48:32and of course jumping from one looking to is very demanding here t can be
0:48:38there are different programming languages will different constant errors are we there's cetera
0:48:44and the
0:48:45one other issues that
0:48:48if we have different look at very how to combine a system together and uniformly
0:48:54in a single system just fully range just
0:48:57a very important use this we declare
0:49:00so we actually working on that and we are trying to lower best rate
0:49:07to do not always will allow users to
0:49:12actually a couple the next
0:49:14a speech point one
0:49:16in an easy way
0:49:19what a time line actually we have work a lot of these you're on that
0:49:23we haven't email
0:49:25a lot of people working on that a lot of interest
0:49:29and we are very close to a first really is that
0:49:33will happen we estimate within a couple amount so i as strongly encouraged you to
0:49:40stay tuned and then
0:49:43and that try
0:49:45speech brain
0:49:46i in the future and q how's your feedback
0:49:51speaker in
0:49:52as quickly the project is how would be as well people
0:49:57we have lower while the
0:50:00twenty delaware as last having solar raiders you have all sources sounds will all be
0:50:08ones and so the project is getting bigger and we go to have also the
0:50:13product
0:50:15all the speech community
0:50:18technical the store
0:50:21saying it be right to my
0:50:23collaborator
0:50:24the guys year are being
0:50:28this ain't is that working on there
0:50:32all these are the other works lots of the what's happening
0:50:36and here you can see
0:50:39the key that is currently working on the speech rate and that recyclable them because
0:50:45i think together we are working very well and
0:50:52well we soon you'll see and the result of our house work
0:50:57thank you very much
0:50:58for everything
0:51:00and i'm very happy now to reply to your
0:51:15many thanks musical than i wasn't nation
0:51:20i already have a
0:51:21a set of questions for you
0:51:25so as to what is wrong using both ukrainian but at so complex the first
0:51:31patient was from nicole rubber
0:51:38and the only the we i have to you england
0:51:43it a weight on a holiday is less computationally demanding men so that is known
0:51:50in
0:51:52actually is nothing but the best and then i'm
0:51:55i think and i can take this opportunity to clarify little bit matter this the
0:52:01things there are a couple of things to consider
0:52:05for the whole with bayes
0:52:08we're trying to learn not and task specific representation but in general representation
0:52:15a at this means that you can train you are i'll supervise a network just
0:52:21once right and then you can use just a little amount of supervised data to
0:52:27train the system
0:52:29so and is naturally it's to the computational advantages because you have to train
0:52:36the big thing on the one
0:52:38and a menu don't things
0:52:42when you have
0:52:44some
0:52:45things which are and we have to the standard supervised learning and usually
0:52:51if you have a good representation a supervised learning part is gonna be are much
0:52:57easier
0:52:59and the other i think good think about pay is a
0:53:03that they didn't remark too much in the presentation but this is better to remark
0:53:07here a little beads
0:53:10is that the basis pretty there's a sufficient right
0:53:13we found very good results even just using something like fifty hours of speech so
0:53:19very little compared to
0:53:21what we see these days
0:53:24even on cell supervised learning where people are using tie was on and thousand how
0:53:27real speech
0:53:29and we are data efficient because mm with the multiple workers
0:53:35somehow we try to extract as much as all the possible information from phone signal
0:53:41we are trying to do our best to be also that efficient extract everything we
0:53:47can from the signal
0:53:50so the right shoe things here
0:53:53the day
0:53:55and the fact that we are learning a general representation right so when we you
0:53:59can train only one time phase and use it for multiple task and then also
0:54:04be that late fusion part that to allow you to
0:54:08learn reasonable representation
0:54:10even or it then
0:54:12and a relatively small amount of unlabeled data
0:54:17an eco are you are you k do you have other
0:54:22comments on the part
0:54:33so
0:54:34okay
0:54:36five is very bad because you really question is on the sides of anyway and
0:54:41try my best
0:54:43i haven't quite a you have a question from don't combo well as you could
0:54:48become a common and remote with this also is supervised learning and this ideal conditions
0:54:56actually mm and we increased a lot the robustness of bayes the when we revise
0:55:03it with bayes plus
0:55:06and as i mentioned before in based blast we combine basically sell supervised learning with
0:55:16on-the-fly data limitation
0:55:19well that's domain it means that every time we have and you sent those we
0:55:23contaminated with a different sequence of noise and on different reverberation such that the system
0:55:30every time and looks also different
0:55:33sentence
0:55:34a lda different at least contamination and in the output
0:55:39our workers are i'm not extracting their labels from the noisy signal but from the
0:55:45original clean one
0:55:47so somehow i wire system is a forest
0:55:52two
0:55:53they noise the features
0:55:55and d is that it's to the robustness we have seen before we actually tried
0:56:01it they're challenging task like your our time i data and it was so realistically
0:56:10rate at these increase robustness to where standard approaches
0:56:18good thank you same really sure
0:56:22okay
0:56:24you ask some questions
0:56:26bayes rule has also a question about the competition between the walkers in days
0:56:34and i don't we should be visible but he or within them or when
0:56:40leam engine could consider some segments
0:56:43one in the same interest
0:56:45one has a positive example and you're as a negative
0:56:50some people
0:56:51and ask you what to expect been able to learn in this case
0:56:57actually the set of workers that we tried is not random right we took the
0:57:03opportunity of the day salt for instance to do a lot a lot of experiments
0:57:08we and we just come out with a set of word the subset of worker
0:57:14the subset of ideas
0:57:16that actually works for us
0:57:19so actually i one of our concern was okay how is possible to put together
0:57:27a regression task which are bayes on the square error for instance is lost with
0:57:33binary task which are based on other kind of lost like better because entropy
0:57:38how we can how we can learn things together and we told that there was
0:57:45a big issue but we realise that actually is not just doing an experiment doing
0:57:50some kind of operation of the workers so we not does that if we put
0:57:54together more workers the batter units
0:57:58and the same atoms for a leam and jean
0:58:02which are a different actually because a lean is based on small amount is small
0:58:09chunks of speech
0:58:11and we that the will there are not and meeting not carry information
0:58:16while the with them james in the same game but played with the larger ta
0:58:21and larger chunks of one seconds one second house
0:58:26and we that tubular hopefully
0:58:29higher level representations so we found that the
0:58:35they did chew and the same time are at any we have full even though
0:58:39at the are clearly correlated subsets right
0:58:46and cuban equation is coming from one channel
0:58:50and the nist you is you have to the right including to provide us to
0:58:56pay
0:58:58and she's really thinking about the five but they is not explicitly thing within speaker
0:59:04variability
0:59:06so none of the task is forcing and then he's from different from those from
0:59:10within speaker could be seen you know we shall work
0:59:13use it for you and known problem in adding some supervised five little ones you're
0:59:19where you have always easy
0:59:24well first of all on including supervised task totally makes sense honestly one can play
0:59:32with the and
0:59:34same a supervised of course seems cell supervise in this case a things and i
0:59:39e n is present all people already d the i saw some recent papers that
0:59:45actually work trying to do that the
0:59:48in this paper for base we prefer to stay
0:59:52on the selsa broom buys side only to make sure us to do actually check
0:59:58what are the output read is something it's a pure
1:00:02so supervised learning approach
1:00:06so as for bayes for speaker recognition and then within speaker maybe this yes is
1:00:14not that specifically designed for that
1:00:18so is not them is not the optimal but we anyway learn some kind of
1:00:25a or speaker identity
1:00:29actually
1:00:34we didn't there's too much about we can we are confident that we can learn
1:00:39can be quite competitive with them with data with standard system actually maybe we have
1:00:45to devise a little bitty architecture for that you're speaker recognition applications because these days
1:00:52also here so
1:00:54numbers to which are impressive in terms of equal error rate for box so that
1:00:58but
1:00:59the same idea i mean could be could be i think it's extended and we
1:01:05designed to specifically lower better speaker imaginings actually was in our main target was
1:01:15was more general so we wanted to
1:01:18to learn a pretty general representation and see if this is somehow works
1:01:25reasonably well for multiple stars
1:01:29thank you very is a nicely with the next question from o coming from the
1:01:34we're not
1:01:36which tries to use
1:01:38if you common than the five about the things that you system is no need
1:01:43to give or speaker restitution and ten information you can really
1:01:50as you are using examples positive examples coming from within a this a single you
1:01:56five
1:01:58actually what we do is to do this on the slide at the moment a
1:02:05sure right
1:02:06so if we have sentence one
1:02:10one time sentence one is
1:02:13contaminated with some kind of channel so i kind of reverberation affect the next time
1:02:18is contaminated with another one so maybe with this approach we try to limit a
1:02:25little bit the these affective but
1:02:29there might be there might be this issue read through
1:02:34do you mean to you thing but that the motivation you use it would take
1:02:39decision problem of internal run by itself
1:02:43so of maybe not tickling the full problem but at least
1:02:50minimizing right or
1:02:51reducing its right
1:02:54i think about the and the other hand we don't that many out there does
1:02:57it will feel will like to stay in the
1:03:00so supervised domain right so we don't and speaker labels so we cannot say okay
1:03:05let's jump to another's signal from the same speaker because that case
1:03:10we have
1:03:11we use the
1:03:13the labels so
1:03:14the best we can do is to contaminate the sentence two
1:03:17i mean
1:03:18change a little bit some other database the reverberation noise effect and
1:03:22hope to have
1:03:24to learn more this p can left the channel
1:03:29fine i we moved to a question from and you can turn
1:03:34hasn't that i model can use form from two perspectives and dealing extraction and more
1:03:41than ornament pre-training
1:03:44both for this we don't should be effective
1:03:47well but which one may be built for speaker verification so
1:03:53language and then
1:03:55we take a look again
1:03:57i
1:04:04okay
1:04:06i think a please could be used both you're right i can be used for
1:04:13feature extraction or embedded guess extraction all for and basically pre-training
1:04:21my experience is that
1:04:25these works very well in a pre-training scenario so it is designed basically to have
1:04:30the
1:04:33to train printing your network with their nest so nist also provides way and then
1:04:38find your eight with the small supervised data
1:04:44this is the
1:04:46basically the mean the main application we have in mind for a for pays but
1:04:51we also tried it as a standard feature extractor
1:04:55where embedding a structure
1:04:57not for speaker recognition but for a speech recognition
1:05:02and it works quite well so if you freeze the encoder right and you plan
1:05:06just the features that you have there you can and supervisor coded what's well but
1:05:11it works better if you jointly finetune the encoder and the classifier during that a
1:05:18supervised phase
1:05:21thank you and we'll come back to the grid also no with a question about
1:05:25the
1:05:27temporal
1:05:29sequence walker also can you would avoid more on the minimum detection walker but focused
1:05:36on the right sequence
1:05:38this is for that the
1:05:40maybe some cases the segment from the few to and but would reasonably contain the
1:05:47thing then
1:05:49so
1:05:51with some problem with this walkers you know some comments
1:05:55definitely that's that have very nice question actually mm could easily the soup as worker
1:06:02is the one that has that's important thing the performance
1:06:07so as i mention with the a lot of model glacial we try to figure
1:06:12out the effect of each and which task and is what was working well improve
1:06:19but less than other work at where more important like the rest of the regressors
1:06:24and the m and g
1:06:27and mm
1:06:29actually this is an important risk when you what when you build a view sample
1:06:33from the past is simple from the future you have to make sure you just
1:06:38you are not same thing with being the receptive field of your convolutional neural networks
1:06:43otherwise the task becomes
1:06:45too easy
1:06:47so what we have done is to make sure that the next the future sample
1:06:52is not too close rights from the people from the and core one and not
1:06:58too far because if it is to close the risk is to learn nothing basically
1:07:03if it is to fire
1:07:06the risk is that there isn't anything in anymore and reasonable correlation between that you
1:07:12so or it's not easy to design the this task
1:07:17and them
1:07:17we did the
1:07:18you didn't as weights are we
1:07:20we were able to sample the past in the feature representation within some reasonable range
1:07:27it could be interesting to write i believe that you hide traces
1:07:33but are we move to another question from i in one but still were asked
1:07:40to you being to write the all the same it does
1:07:43known from will lead to four speakers for extracting speaker-specific information
1:07:50well and in this paper we new the bayes paper actually is not only about
1:07:57the nist speaker recognition so the filters that will learn are actually are not that
1:08:04far away from stand out method
1:08:07mel filters where we basically try to locate
1:08:10more precisely more filters in the lower part of the spectrum and less filters in
1:08:16the higher part of the spectrum
1:08:18we de lima
1:08:20local informants the technique that was designed basically that work only for speaker recognition the
1:08:27filters we still are there are some harass right where more filters in there are
1:08:34as more common for the speech and the formants
1:08:37so similar to what we have seen in
1:08:41in
1:08:42using sync net
1:08:44with the supervised a approach
1:08:48but with bayes we are not the we're not a look at more
1:08:52more filters in the speech region we are more or less the same as the
1:08:57standard not filter scale back
1:09:04i we have
1:09:05we are also conclusion i don't have more open question i just have one or
1:09:11be possible
1:09:12a i would like to see you
1:09:15to the explaining more well i use a about unsupervised training
1:09:21used and the it composes of as the training
1:09:25it's of my feeling what an issue as
1:09:29b s
1:09:30a more easy to find if you have some
1:09:34during a supervised training because you're some information on the data meta information of video
1:09:41and we each with the unsupervised training seems to me that
1:09:46you have less information but you have no reason to have a list yes in
1:09:51the
1:09:52they
1:09:53the figure sure
1:09:55okay and
1:09:58the reason is that the
1:10:00if you train your representation with the supervised data your presentation could be biased to
1:10:07the task right specifically for instance if it's frame
1:10:11a aspic a representation with speaker recognition right your presentation is not could for speech
1:10:19recognition and it is does a bias on speaker recognition around it
1:10:24with a supervised learning at least the in the way we are trying to do
1:10:28it with a multitask et cetera this list a risk is reduced because you have
1:10:32the same representation that is good for both speech recognition
1:10:36and speech recognition and the speaker recognition
1:10:41communist and that i
1:10:43really the want to thank you again and we are we will be the over
1:10:50the official but before to close the position i will be the microphone to get
1:10:57a the only those two
1:11:00wants to you to also i think you actually
1:11:06thank you are service right
1:11:09yes i stepped off state of a very wide of the top integerization and then
1:11:15s l obtained in this session
1:11:17so as dataset now do you think that something to us to decisions
1:11:23but
1:11:26system
1:11:28one of the stuff can show this not
1:11:45and the second
1:11:51yes
1:11:52if your best guess okay just to heal i you the that token decision that
1:12:03the but there is something that sequences changes that's thanks for inviting me that was
1:12:09really great thank you
1:12:11okay that's a tennis together again
1:12:13and test on a distance for a
1:12:19a sentence
1:12:21and you lucille tomorrow a same time this time i in a and ten
1:12:31definitely
1:12:33so as you can just
1:12:36of that by time