Speech Transcript - Towards Unsupervised Learning of Speech Representations

0:00:01	two
0:00:02	yes
0:00:04	so interviews the allow testing well
0:00:11	thank you on the first i want to thank you again for of anything this
0:00:16	water in it is very
0:00:20	and if you could the a moment and if you could your but you did
0:00:25	very well and i'm sure we will we will take advantage of you organisation an
0:00:29	optional
0:00:31	secondly equal i'm really be too
0:00:34	have you location to introduce you musical of any will be whole first the speaker
0:00:41	i will be sure because i'm quite sure but quite all of japanese
0:00:46	no you would really so i will not even introduction they know you even if
0:00:53	you are still you with a true
0:00:56	as
0:00:58	you go according to me at least we will see you was a really a
0:01:02	few buttons
0:01:03	no
0:01:04	so we go you us your master in the second and hearing that went to
0:01:10	university the frozen even
0:01:13	so about twenty ten years ago
0:01:16	and the you went to be a is to use a student rental
0:01:24	with the wheel invisible trying to one be from the sunni a remote case the
0:01:30	you then you is the on droning for distance speech in the two thousand seventeen
0:01:39	and then known the meta
0:01:42	maybe not the useful to introduce them you know that a
0:01:46	i'm also true but they will all of us so awful you know very well
0:01:53	and you start work has a also is a sure working closely with a threshold
0:01:59	venue
0:02:01	you work on several topics may be wrong than representation only for speech button not
0:02:07	only about
0:02:08	and he really you also one of the code from the of the speech way
0:02:14	initiative for building you
0:02:17	two k open okay for a speech and speaker recognition it was a singing about
0:02:23	so
0:02:24	even us to
0:02:27	we use it to form you use you already have a long list of speakers
0:02:32	in the topics and i know you we need to as a very nice a
0:02:36	or a now and i don't even for you but before two
0:02:44	do they do you say it will be wall of introduction if you want before
0:02:48	a good movie do i will explain how the decision we walk
0:02:54	we will close to a pre-recorded view bone by nicole
0:03:00	during easy do you have you wanted to some question maybe case in intensive box
0:03:07	please or a few
0:03:10	think about question and i haven't integration is possible now see we give you what
0:03:16	you do need good complete variances
0:03:19	and then we will have a fifteen minutes
0:03:23	live
0:03:23	question and answers with music or doing this decision is fifteen years
0:03:29	you could use both the question and answer box
0:03:33	well
0:03:34	be a raise your hand the so we raise your hand with the you know
0:03:38	that to a i-th question in i
0:03:42	during position
0:03:44	so we could be want to say some well handled before two good we do
0:03:47	just i think you're much for the introduction hello
0:03:52	i hope the d v d w within the video will be fine now but
0:03:56	in the worst case you probably you guys have to increase a little bit it
0:04:00	but
0:04:01	let's see how it goes
0:04:04	it can be cool i think we give a really do know
0:04:49	sorry we have a simple was shown to an small technical problem good we don't
0:04:53	have you do you the
0:04:57	before it was working so it's better to does which the previews
0:05:03	present
0:05:03	annotation we're
0:05:05	can't hear nothing alright
0:05:09	yes a
0:05:10	can and have a little stuff
0:05:17	okay training
0:05:41	hi everyone i mean permanently
0:05:43	and a very high
0:05:45	to give it is here today
0:05:47	had obviously
0:05:49	so let me for the whole thing rather
0:05:53	for i by can be used for them
0:05:57	with the
0:05:57	the speech commute
0:06:01	entitled make you know used to words unsupervised training
0:06:04	all speech work station
0:06:07	well so supervised learning is a key a lot of what are the my shooter
0:06:12	feel
0:06:13	and of course is getting ready
0:06:16	within the speech community well
0:06:18	so today i will like to share the experience
0:06:24	the time again after working poor
0:06:26	i two or three years
0:06:28	on this topic
0:06:32	okay but if or diving into cell supervised learning that me room are some of
0:06:39	the limitations
0:06:41	of supervised there
0:06:43	which is the dominant paradigm stays there
0:06:48	well you can see deep-learning
0:06:51	as a way to lure hierarchical representation is where we start from the low concepts
0:06:58	we combine them
0:06:59	we create
0:07:00	high-level also console
0:07:03	so the learning
0:07:05	is a very general is the case
0:07:08	is implemented through a deep neural networks
0:07:13	that are often
0:07:14	trained
0:07:15	in a supervised way
0:07:17	using a large and rotated corpora
0:07:22	you can do this is that only and approach
0:07:26	alright integrate
0:07:27	success
0:07:28	are you learning many practical application
0:07:33	is clear today
0:07:34	and is paradigm
0:07:36	has some limitations
0:07:39	what are
0:07:40	this issues
0:07:42	for example
0:07:44	we
0:07:45	indeed the data and not
0:07:48	general data
0:07:49	but and updated data and crosses they cannot the issue the expense the time-consuming however
0:07:56	wires numerals normal
0:08:01	rubber supervised learning is data and
0:08:04	also computationally demanding
0:08:07	one
0:08:09	of course to these days to reach state-of-the-art performance
0:08:14	machine learning
0:08:15	we need a lot of data
0:08:17	and a lot of data requires a lot of computations
0:08:21	deleting the fact the access but
0:08:24	a
0:08:25	supervised learning
0:08:27	a technology to have brute
0:08:32	brute
0:08:32	setup
0:08:33	users
0:08:36	moreover
0:08:38	if we
0:08:39	training a system now
0:08:40	supervised way the representations that the latter
0:08:44	my by the hours
0:08:46	to worse a specific application
0:08:49	for instance if we train a system for speaker identification
0:08:54	the representation that's been there are would that not or
0:08:58	speech recognition
0:09:00	so we might want to real or some kind of general representation that annoying
0:09:07	transfer learning
0:09:08	much easier and better
0:09:11	density
0:09:14	the third imitation is actually more exploration
0:09:17	and is that where rain
0:09:20	does not use
0:09:21	only supervised learning
0:09:24	critical mine different all
0:09:27	i'm
0:09:28	pretty sure
0:09:29	that
0:09:30	combined
0:09:32	different the remote data that is cool but she
0:09:35	to reach higher levels
0:09:37	or artificial intelligence
0:09:39	we can combine
0:09:41	supervised learning
0:09:42	we and
0:09:45	contrastive learning
0:09:46	weighted imitation learn a
0:09:49	well we'd reinforcement learning and of course
0:09:52	with some supervised learning
0:09:55	so what is sell supervised there
0:09:59	so supervised learning is a type of an unsupervised learning
0:10:04	where we have a supervision
0:10:07	but the supervision
0:10:08	is extracted
0:10:09	from the city no it's channel
0:10:13	in cell supervised learning we'd ask
0:10:16	don't have
0:10:16	you models that have to create labels we don't have you months
0:10:21	but the labels
0:10:23	i retreated basically
0:10:25	for free
0:10:25	we can create
0:10:27	columns of them without s
0:10:31	normally in some supervised learning
0:10:34	we applied some kind of
0:10:36	known transformations to the input signal
0:10:39	and use the resulting outcomes
0:10:42	as a label as targets
0:10:46	well let me clarified his with some example derived from the computer vision community which
0:10:52	was the first one
0:10:54	teaching better
0:10:56	this
0:10:56	approach
0:10:58	in this
0:10:59	comparison of community actually they
0:11:03	the not is quite early i this earlier than the other that by solving some
0:11:09	kind of symbols task we were able
0:11:11	to train a neural network that there are some kind of needful
0:11:14	representation
0:11:17	for instance you can ask your neural network was also kind of relative positioning task
0:11:22	where you have small edges of an image
0:11:25	and you have to decide their relative position
0:11:27	between them
0:11:29	you can ask your neural network
0:11:31	but the right colour
0:11:32	set an image
0:11:33	or to find the correct
0:11:35	rotation and of any age
0:11:39	goal of this task are relatively
0:11:41	easy but each we design a system your vector learners used in table show this
0:11:48	task
0:11:49	we inherently over a wider system to have some kind of semantic knowledge of the
0:11:55	words or at least semantic knowledge on the image
0:11:58	that can be really very have their
0:12:04	representation hopefully high level
0:12:06	robust representations
0:12:10	and yes
0:12:11	subsets unsupervised learning is extremely
0:12:15	interesting is gaining a lot of randy
0:12:19	let me show that animals
0:12:21	give low rank k by
0:12:22	the kernel
0:12:24	showing saying that you know if only the cage
0:12:29	no supervised learning the su or look at a reformer learning is the charger indicate
0:12:34	that an unsupervised
0:12:36	or supervised learning is the basic indicate you sell
0:12:40	and meaning that
0:12:42	we believe this modality is
0:12:44	definitely
0:12:47	ingredient
0:12:48	a two
0:12:49	to develop intelligent systems
0:12:54	okay but what about the old you an speech field
0:12:59	as i mentioned before
0:13:02	there is a crucial we number of research more stuff cools in the direction
0:13:07	also supervise there really you know speech
0:13:12	and we have seen as many of them even
0:13:15	at the interspeech
0:13:17	but here let me just highlight here of
0:13:22	and my opinion the first work that firstly shows the appendices also supervised learning you
0:13:29	know you speech
0:13:30	is the contrastive predictive coding was by are among the nor backing
0:13:35	two thousand eight key
0:13:38	this work is mostly about
0:13:40	predicting
0:13:41	the future
0:13:42	given the past
0:13:45	more recently we have seen
0:13:47	another
0:13:48	very good where by facebook with what we should back to zero where d with
0:13:54	we were able to show impressive results with that our approach
0:13:57	which implies some kind of masking technique sooner number couple
0:14:02	which ones dish
0:14:05	i also contributed
0:14:07	does feel with the problem of analysis which encode it which as we will see
0:14:12	later i which we explore
0:14:15	multi doubts selsa provides there
0:14:18	however
0:14:20	cell supervised learning all speech
0:14:22	is it really challenge
0:14:26	why
0:14:27	first of all because speech is characterised by high dimensional that
0:14:32	we have typically a long sequences
0:14:35	of samples that can be well variable length
0:14:40	the last
0:14:41	but not laced
0:14:43	speech in her you know the and tails
0:14:47	complex hierarchical structure that might be very difficult to further
0:14:53	without being guided
0:14:55	by a strong
0:14:57	supervision
0:14:58	speech in fact
0:15:00	as characterised by samples we can combine
0:15:03	there were sampled that the
0:15:05	aims
0:15:07	i from twenty and you can create two levels of all syllables okay worse and
0:15:11	finally
0:15:13	we have than me
0:15:14	all descendants
0:15:16	and inferring
0:15:17	all these kind of structure
0:15:20	might be
0:15:21	extremely difficult
0:15:25	on my side i started i've been some supervised learning when i started my all
0:15:31	stock
0:15:31	i mean the almost
0:15:33	three years ago
0:15:35	and time
0:15:37	people it means that we're doing research ourselves supervised learning
0:15:42	a approaches based on what information
0:15:46	and i got so excited that
0:15:48	i decided to study some supervised learning
0:15:51	approaches with motion information
0:15:53	for learning
0:15:55	speech representations
0:15:56	and that led to the development
0:15:58	all the technical
0:15:59	a lot coming from max that i will and described in the next my
0:16:05	after that we for extended
0:16:08	this techniques using a multi task supervised learning approach
0:16:13	and that led to the double meant
0:16:15	all the problem of the gnostic speech encoder plays
0:16:18	the presented
0:16:19	and interspeech two thousand nineteen
0:16:22	and also we extended
0:16:24	days with another technique
0:16:26	if you can improve system called base plus
0:16:30	and we recently presented this work
0:16:32	at i
0:16:37	okay let's start from motion information based approach
0:16:42	what is more information
0:16:44	the motion information is defined as the key and they are virgins
0:16:48	between
0:16:49	the joint distributions of two random variables
0:16:53	and their product or marginal
0:16:58	why
0:16:59	this is important
0:17:01	because we move information we can capture complex problem being of relationships
0:17:06	between
0:17:07	random part of
0:17:10	eve the
0:17:12	two random variables are independent univoxel formation zero
0:17:17	while you do with some kind of dependency between is why doubles the are then
0:17:22	mutual information is greater you
0:17:26	this is very attractive
0:17:29	the issues that much information that's difficult to compute high dimensional space
0:17:36	and is limited
0:17:38	a lot
0:17:40	it's optical but
0:17:41	in
0:17:42	for a decal mush entirely sure
0:17:47	however one recent were coal mine actual information
0:17:52	you're estimator
0:17:54	phone that it is possible
0:17:56	one maximizing minimizing motivation
0:17:59	within a framework that closely resembles
0:18:03	data counts
0:18:05	how does where
0:18:07	i think mention and we can sample somehow
0:18:11	some samples from the joint distribution
0:18:13	recorded
0:18:14	positive samples
0:18:16	we will explain later
0:18:18	how we can do that graph
0:18:20	it's also assume we can i
0:18:22	sample
0:18:24	some kind of examples from the marginal distributions and we call
0:18:28	there's negative samples
0:18:32	then we can see that
0:18:33	this positive and negative samples
0:18:36	with the special neural net where was cost function
0:18:40	is it don't are far down
0:18:42	bound works mesh
0:18:44	the don't screw are no information that has low where
0:18:49	and if we train
0:18:50	this is a letter to maximize
0:18:54	this them about
0:18:55	we finally converge to also mesh
0:19:01	and inspired by this approach i started
0:19:04	thinking about
0:19:06	motion information based approaches specific only
0:19:09	for speech
0:19:12	i danced idea and then you do cool a little informatics that works
0:19:18	in this way
0:19:20	so
0:19:22	for example we employ s seven they strategy
0:19:25	that will
0:19:26	several positive and negative
0:19:28	this way
0:19:29	sure the whole
0:19:30	that choosing a random shyer
0:19:33	from i runs and scolded
0:19:35	so you one
0:19:37	then
0:19:37	which is another out of the channel from the same sentence
0:19:41	and we call it
0:19:42	two
0:19:45	and finally which is another random from another sentence
0:19:49	that's your front
0:19:53	we this
0:19:54	samples with his chance we can
0:19:57	please some kind of interesting things
0:20:00	for instance we can process
0:20:03	c one c two i was your problem with and recorder
0:20:08	which provide
0:20:09	hopefully higher level information
0:20:14	then
0:20:15	we can go free positive and negative so all we
0:20:19	if we
0:20:21	concatenate
0:20:22	z one and two we create
0:20:25	samples from the joint distribution
0:20:28	positive system
0:20:30	which is a positive sense or because we expect some kind of relation between
0:20:36	this random variables because extract
0:20:39	from say
0:20:40	a signal
0:20:43	then we can also can also create
0:20:44	and negative samples michael t z one and that run
0:20:49	in this can be seen
0:20:51	and a sample from the chronicle marginal distribution
0:20:56	after that
0:20:57	we employ and discriminator which is
0:21:01	with posting
0:21:02	or negative samples
0:21:04	and it is screaming the
0:21:06	should figure out
0:21:07	basically
0:21:08	if
0:21:09	you need to get positive or negative examples for this case
0:21:14	if the representations
0:21:15	kind of from seven
0:21:17	or from you
0:21:22	in this system that discriminate rollers is
0:21:25	set
0:21:26	to maximize the mutual information
0:21:30	moreover the encoder and a discrete mister
0:21:33	are jointly trained from scratch
0:21:37	and this
0:21:38	results in
0:21:39	compared to
0:21:41	game nodding an adversarial game like can
0:21:44	this case
0:21:46	the encoder and its creator should cooperate to learn
0:21:52	i hu and hopefully high level
0:21:55	representation
0:21:57	a good question here okay
0:22:00	and but one two will are you play is k
0:22:05	with this came we basically learn speaker identities of our wheeler speaker endings
0:22:15	why
0:22:16	because this approach is based on randomly
0:22:18	sam thing
0:22:19	within the same set
0:22:21	and if we randomly sample within the same sentence
0:22:25	and reliable started or that the system can disentangle are the variable factor is
0:22:32	definitely the speaker identity
0:22:34	rubber in
0:22:36	we assume that we have i dataset and just large enough without
0:22:40	large variability a speaker and if we randomly sample two sentences
0:22:45	the probability of by me
0:22:46	the same speaker is very low
0:22:49	so overall
0:22:50	this can be c
0:22:51	as a system for learning
0:22:54	speaker of endings without
0:22:57	provided to the system the police
0:22:59	this is label
0:23:02	on the speaker identity
0:23:06	the encoder is fat by their roles speech samples directly
0:23:12	in the first layer of a contact the architecture we just use see that makes
0:23:17	learning problem to roll samples much easier
0:23:20	in fact instead of using the standard convolutional filters we use a band pass parameterize
0:23:26	filters that only learns d
0:23:29	because of this is distilled
0:23:32	this makes
0:23:35	learning from the rose i'm all easier
0:23:38	and not only used on the supervised learning but we also only useful in this
0:23:42	also provides context and
0:23:44	i will encourage you to read a reference paper
0:23:47	if you would like to hear more about
0:23:51	sing
0:23:53	what are the strength and issues a lot come from
0:23:58	once trained is that
0:23:59	we are able
0:24:00	when they let me from us were able to learn
0:24:03	high quality
0:24:05	speaker representation which are competitive
0:24:07	with the ones
0:24:09	learning standard supervised we
0:24:12	or rubber
0:24:14	luckily formats is very simple and also computationally efficient
0:24:19	because we only use the local information thanks to that we can provide a lot
0:24:23	the computations
0:24:26	the mediation with that
0:24:27	is that the representations are very task specific
0:24:32	as we have seen before with lee we can
0:24:36	there
0:24:37	speaker baddies
0:24:39	but what about the other for and
0:24:42	informations that's a banded in speech signal mike phonemes
0:24:46	and motions
0:24:47	and many are things
0:24:51	so when it's this results i ask myself
0:24:55	i w really sure that a single task as in our
0:25:00	actually most of the forest the trying to used cell supervised learning by solving single
0:25:05	task
0:25:07	but
0:25:08	my experience suggests that one single task was not is not know because
0:25:13	we
0:25:14	with a single task we always only count sure
0:25:18	little information
0:25:20	on the signal that we might want
0:25:25	well based on this observation we decided star and you project called problem i know
0:25:32	stick speech coder where we wanted to learn
0:25:37	more general representation might join the demixing multiple
0:25:43	cell supervised task
0:25:46	in pays we have an ensemble on your macros that mass operate together
0:25:52	to discover good speech representations
0:25:58	so what is the intuition behind that
0:26:01	if we joint this'll moldable unsupervised task
0:26:05	we can expect that each task ratings different you
0:26:11	under speech
0:26:13	and you
0:26:13	put together
0:26:15	different views on the same signal
0:26:17	we might have higher chances
0:26:20	two
0:26:21	have a more general incomplete
0:26:23	description
0:26:24	on the signal so
0:26:28	moreover
0:26:30	and consensus across all these uses needed
0:26:33	and using pose some kind of
0:26:35	soft constraint in the representation
0:26:38	it may seem we can improve
0:26:41	its robustness
0:26:44	so with this approach we were actually able
0:26:47	to learn
0:26:48	general robust
0:26:50	and transferable features
0:26:52	thanks to
0:26:53	a joint is holding multiple task
0:26:56	and let me explain next slide more details on how
0:27:01	a system works
0:27:05	a is based on an encoder
0:27:08	the transforms more samples higher level representation
0:27:14	you colour is based on signal formal by seven locks
0:27:19	and the also earlier
0:27:22	he writing we start from the raw set will be
0:27:26	one starts from the lowest possible speech representation
0:27:32	after the encoder we have a bunch all workers where each worker saul's different sensible
0:27:39	mice task
0:27:41	one thing to remark is that the worker
0:27:44	workers are very small
0:27:46	one
0:27:47	because you've if the workers are very simple a small you're not sure
0:27:51	we forced encoder to provide
0:27:54	and much more robust and what is higher now
0:27:58	representation
0:28:01	there are actually two types of work we
0:28:03	started
0:28:04	regression workers that solves
0:28:08	error regression task and the binary
0:28:12	strolls
0:28:12	binary classification task
0:28:14	you binary workers are similar to that one
0:28:17	other than the one that we have some for an hour
0:28:21	more show you from which
0:28:23	as for the regression task
0:28:26	we have some workers that is t v some kind of normal speech representation
0:28:33	for instance we have one worker estimating waveform back
0:28:37	you know encoder fashion
0:28:40	we estimateable always spectrum
0:28:42	we estimate that about
0:28:43	frequency cepstral coefficients embassy they also have positive features such as
0:28:49	bottom-up probability zero crossing rate and i don't
0:28:54	so why we do something like that
0:28:56	because we use the way being jack quarters some kind of
0:29:01	prior knowledge that can be very helpful
0:29:04	in
0:29:05	so supervised learning
0:29:07	in particular in the speech community we are well aware that there are some
0:29:12	features that are we are very helpful
0:29:15	like mfcc
0:29:16	cross at least
0:29:17	why not
0:29:19	try to take advantage of that
0:29:21	i y
0:29:22	we are not trying to jack
0:29:24	this information inside a wire
0:29:26	neural network
0:29:29	you parallel to the regressors we also have
0:29:32	binary classification task
0:29:35	binary classification task working with similar to what we have described for with more to
0:29:41	the formation approaches
0:29:43	basically we sample tree
0:29:45	speech and x
0:29:47	are core of the negatives according to some kind of predefined extra you
0:29:53	we don't process all the stress
0:29:56	weighted the our case encoder
0:29:59	and then we should and scream inter
0:30:01	which is trained on binary percent we should figure out any
0:30:05	we have a positive or negative
0:30:08	so very similar to
0:30:10	the only approach we describe four
0:30:14	only difference
0:30:15	is the article or something strategy
0:30:18	because we didn't different some to strategy we can't
0:30:21	hi my
0:30:22	different features
0:30:24	one simple strategy that we don't
0:30:27	is the one proposed in mock of the infomax that has we have seen for
0:30:31	is able to lure
0:30:33	speaker and wendy's and general speaker identity
0:30:38	together with that we have an under similar strategy called good level the marks
0:30:43	here we do we play basically the same game but we use
0:30:48	larger chunks
0:30:49	and with larger channels
0:30:51	we hope white while i
0:30:54	kind of
0:30:54	complementary information which hopefully is more
0:30:57	global them
0:31:01	well finally we propose another interesting task or sequence pretty code
0:31:07	would this task be hopefully are able to capture some kind of
0:31:12	information on the order
0:31:14	all
0:31:15	the sequence
0:31:16	it works in this way we choose a random channel from
0:31:20	and a random sentence
0:31:22	cultures and core change
0:31:24	which is another random show on the future
0:31:27	of the same set those and is also one
0:31:31	and then we choose another random chat on that
0:31:34	passed on the same
0:31:37	so if we
0:31:39	palais de ziggy
0:31:41	we are
0:31:42	hopefully able to capture a little bit better how
0:31:46	the sequence can involve and ask country some kind of longer context information we were
0:31:53	able to capture with previous task
0:31:56	this sequence political endings similar
0:31:59	two contrastive predictive coding proposed by are one or
0:32:03	the main difference is that no work is
0:32:07	the negative samples actually all the samples are derived from the same sentence not for
0:32:12	other ones because
0:32:14	in this case you will like to only focus on how
0:32:17	this you possible we don't want to capture
0:32:20	another kind of pixel information such as speaker that we will capture
0:32:25	with other tasks
0:32:30	okay but how can we use
0:32:33	mays
0:32:34	inside s speech cross i
0:32:39	well
0:32:39	step one is unsupervised training so we can take the architecture
0:32:44	that we have
0:32:45	and i four
0:32:47	and training particular we can jointly train you quarter and workers using standard issue
0:32:57	a by optimising a loss which is computed as the average
0:33:03	each worker cost
0:33:05	in of are you experiment with it
0:33:07	we tried different
0:33:09	alternatives
0:33:10	but we found that
0:33:11	average e
0:33:12	the courses
0:33:14	the best approach we very fine
0:33:18	once we have train
0:33:19	i where a architecture we now use
0:33:22	i didn't label
0:33:23	we can go to step two which is supervised by joining
0:33:28	this case
0:33:29	we get to create a all the workers and
0:33:31	like our colour into
0:33:34	a supervised classifier which is trained with little
0:33:37	i'm now a supervised eight
0:33:41	actually here and there are a couple of also the data is not number one
0:33:47	is to use
0:33:48	is it as a standard
0:33:50	feature called or this case
0:33:53	freeze
0:33:53	pays yuri this supervised fine phase
0:33:57	another approach
0:33:59	just a pre-training priest with this unsupervised
0:34:02	parameters
0:34:03	and fine curate
0:34:05	you re
0:34:06	the
0:34:08	supervised find you phase so this several approaches the one usually hears
0:34:14	the best for four
0:34:17	it is very important
0:34:19	true mar
0:34:20	that is
0:34:21	step number one this unsupervised three
0:34:24	can
0:34:25	should be done only once
0:34:27	in fact we have seen
0:34:29	there is a dish variance phase
0:34:32	are generally now that can use for large are righty
0:34:37	all speech tasks like
0:34:39	speech recognition speaker recognition speaker speech enhancement
0:34:43	and min six
0:34:45	and you even don't wanna
0:34:47	three by yourself
0:34:49	that's a supervised extractor you can use
0:34:52	and three
0:34:54	parameters that share
0:34:55	but the i were proposed
0:35:00	well this is not all about he's
0:35:04	in fact
0:35:05	in created by the good results achieved with the original version
0:35:10	we decided
0:35:11	two
0:35:13	spend some time to founder
0:35:15	we revise the architecture and improving
0:35:18	and we don't use opportunity of the judges are two dollars a night t
0:35:23	organized by the johns hopkins university to set up t
0:35:27	working on improving
0:35:29	pace
0:35:30	and as a result we came up with a you architecture called
0:35:34	pays last where we introduced
0:35:37	different types all improvements
0:35:41	first of all week apple
0:35:43	a peas with on-the-fly data ish
0:35:47	here we use speech what an initial techniques like anti noise reverberation
0:35:53	but we also out
0:35:55	some kind of run zeros in the time waveform and also we filter the dixie
0:36:00	data in the signal of with some kind of random band must and boston's order
0:36:05	to use
0:36:07	zeros
0:36:07	in the frequency domain
0:36:10	so what is that are not be very important because
0:36:13	i gives us to the system so i kind of robustness is a noise and
0:36:19	reverberation another environment artifacts
0:36:23	a nice things that
0:36:24	since everything is on the fly
0:36:26	every time we contaminated descendants for distortion
0:36:32	and also
0:36:33	the workers are based on the clean
0:36:37	alone labels extracted from the clean version signal so we
0:36:42	implicitly ask
0:36:43	this way
0:36:44	our system to
0:36:46	perform some kind of
0:36:47	i dunno ways
0:36:50	and then we also robust colour
0:36:53	we still have seen no always on the years but that we have also i
0:36:58	recurrent neural network that is
0:37:01	and efficient way to introduce some kind of we can see that sure
0:37:05	and we also
0:37:08	some ski connection that have a rowdy and back to punish
0:37:14	then we have improve a lot other workers
0:37:17	so we have not so that
0:37:19	the more workers
0:37:22	the better it is
0:37:25	and yes
0:37:26	we definitely have a introduced
0:37:30	a lot of workers the injured that estimates for instance you type of features on
0:37:34	different
0:37:36	context lines et cetera overall
0:37:39	we can improve a lot the performance
0:37:42	all the system will different speech tasks
0:37:46	what do we learn phase
0:37:48	we show some kind of it isn't applauded
0:37:51	assuming that's
0:37:53	here
0:37:55	we show that bayes variable are pretty well speaker identity is and you can
0:38:00	clearly recognise
0:38:01	that the
0:38:03	there are pretty defining cluster
0:38:06	a four
0:38:08	the speakers
0:38:11	here is that we show some carol
0:38:14	i'll
0:38:17	deceived lots
0:38:18	for phonemes
0:38:19	and you can see here
0:38:21	everything's lossless well the final but
0:38:24	you have some phonemes
0:38:26	like it is
0:38:27	sure
0:38:27	right
0:38:29	but you can also detect some kind of phonemes which are
0:38:32	a pretty clusters of meaning that
0:38:34	we are actually learning
0:38:36	some kind of twenty
0:38:38	representation
0:38:39	even
0:38:40	without
0:38:41	and he
0:38:41	so when you label
0:38:45	okay we try these plots are different
0:38:49	speech tasks and you can refer to the paper to see all the results
0:38:55	but she really we just discussed some all the numbers that we have chi
0:38:59	on a noisy asr tasks highlight
0:39:03	i think a little bit then robustness
0:39:06	on the proposed approach
0:39:10	furthermore let me say that we have three
0:39:12	a wire
0:39:13	ace on every speech
0:39:16	without using the labels and
0:39:18	very interesting
0:39:20	we have noticed that we don't need
0:39:22	a not a lot of data to train a base we just need
0:39:26	one hundred fifty a wire one hundred that was really the speech
0:39:29	and these are enough to
0:39:31	i generated numbers sdc staples
0:39:36	this is quite interesting because
0:39:38	i usually standard sort of about approaches rely on a lot a lot of data
0:39:43	in our case with thing that
0:39:45	somehow we are more that efficient because we employ a lot a lot of workers
0:39:51	trying to extract a lot of information
0:39:54	are on our speech signals
0:39:58	on the left you can see the results when we treat only here you right
0:40:03	is a challenging task characterised by speech recorded in a domestic requirement
0:40:09	and corrupted by noise ratio
0:40:12	you can see here
0:40:14	that pays a single outperform
0:40:18	traditional features and also combinations a traditional speech features
0:40:23	on the right you can see the results of time five
0:40:27	jerry time
0:40:28	probably is the most challenging
0:40:31	task average
0:40:32	and where design speech is discover or as white noise you're a sure
0:40:37	a lot a lot of these two buses such as overlap speech
0:40:41	and that even guess
0:40:42	a pretty challenging scenario we are able
0:40:45	to the slightly outperform
0:40:48	the standard and based on their
0:40:51	i features
0:40:53	all their current database
0:40:58	actually do representations of other with them
0:41:02	a is
0:41:03	are quite a general or boston transferable
0:41:06	and we have successfully applied
0:41:09	them to different tasks
0:41:10	why don't we have seen speech recognition but you can use it
0:41:14	for speaker recognition
0:41:16	for speech announcement
0:41:17	was learning and motion recognition and i and also aware of some works right to
0:41:23	use
0:41:24	p is for transfer learning across languages train one that based on and trivias on
0:41:32	english and you task and another language and seems to
0:41:36	sure some kind of surprising robustness here
0:41:39	transformation
0:41:42	you can find the code in the tree model
0:41:45	on guitar when i encourage you to
0:41:47	well here and play would pace as well
0:41:53	but let me conclude this park with some sides also supervised learning and their role
0:41:59	that it can lady
0:42:01	in the future
0:42:04	has a mentioned in the first part of the presentation i think they're the g
0:42:10	be of intelligent machines is the combination of different note that this
0:42:15	we can combine a supervised learning
0:42:17	with unsupervised imitation the room for smaller in contrast one has all
0:42:25	so i think there is a huge based here for which tweezers direction where we
0:42:30	basically
0:42:33	combine
0:42:34	in a simple and again the way
0:42:36	difference
0:42:38	elderly time that
0:42:40	one of them
0:42:43	could be and
0:42:45	so supervised learning but not only
0:42:49	this is
0:42:49	very important in days because
0:42:53	stand our supervised learning as i don't know approach but we are start something see
0:42:58	some kind of limitation in this limitation mouldy even including your
0:43:03	in the next
0:43:04	years so supervised learning is too much as a demanding too much or addition to
0:43:09	learning
0:43:09	and we've been going the direction
0:43:11	only few it was a few companies the war will be able
0:43:15	to train state-of-the-art just
0:43:18	and i think different starting different learning with what is conditioned
0:43:23	an especially selsa for about thirty because i we has we have seen
0:43:29	in his presentation
0:43:30	so supervised learning can
0:43:32	an extremely useful the transfer learning area
0:43:36	so we sell supervised learning we have channels cooler a representation which is
0:43:42	generally now
0:43:44	it can use
0:43:46	for several down by class task
0:43:50	and this is
0:43:52	a really big advantage
0:43:55	in terms of computational complexity scores
0:43:59	so i think
0:44:01	the future paradigm
0:44:03	will be a final enough will be similar to the first a popular approach of
0:44:10	learning where we where he where
0:44:12	able to initialize current
0:44:15	neural network
0:44:16	using
0:44:18	unsupervised learning approaches also provides a legal approach
0:44:22	and then we can find you know that we need also
0:44:25	i think is
0:44:26	could be
0:44:29	pretty much
0:44:31	i feature primetime needed for speech where
0:44:33	bayesian transfer to remove lady
0:44:36	always measure
0:44:38	role in the pipeline
0:44:40	and yes
0:44:42	that some similar to what we have seen the last the differences that
0:44:46	and you at first system we were using for a supervisor some supervised learning where
0:44:52	based on restrictive about of washing
0:44:54	right now is the as we are using
0:44:56	much more sophisticated techniques
0:44:58	but the idea is the same manner
0:45:01	could be
0:45:03	quickly and the measurable in speech processing and more in general
0:45:07	in that the machine learning in the near future
0:45:11	if you're interested in to the stopping again you would like to read
0:45:15	a more also supervised
0:45:17	learning you know you speech you can take a look
0:45:19	into the and i c m l workshop
0:45:22	also supervised learning you know the speech that you have
0:45:26	recently
0:45:27	organized
0:45:28	and you can going to the website c or the presentation and read all the
0:45:34	which i think is
0:45:35	kind of interesting initiative
0:45:37	and that we also highlight
0:45:39	they will be
0:45:41	seen their initiative
0:45:42	it is your i knew it is so i will equation also
0:45:48	you also to participate
0:45:49	to use that
0:45:53	alright since i have a few more minutes
0:45:56	i'm very happy to of the u
0:45:59	on another very exciting projects and leading these days which is called
0:46:04	speech frame
0:46:06	speech frame will be an open-source all than one two
0:46:10	entirely down well i
0:46:12	no one goal
0:46:13	be a little in that can significant speed-up
0:46:16	research and double of all speech and audio processing techniques
0:46:22	so we are building
0:46:23	toolkit which will be efficient flexible
0:46:27	moreover and very important we'd i hu
0:46:34	the main difference with the other existing toolkit that speech rate is specifically designed with
0:46:41	addressed
0:46:42	multiple speech task
0:46:44	i don't see time
0:46:46	recent speech brain muscle or speech michelle channels operations recognition and most recognition multi microphone
0:46:55	signal processing speaker diarization
0:46:58	and many other things
0:47:00	so
0:47:00	typically all this task share the underlying technology which is unclear me
0:47:08	and the room there is
0:47:10	the reason why we have we need different repository or
0:47:16	different kind of speech applications is so what we want
0:47:20	is like our brain
0:47:22	we have a single that is able
0:47:24	to process several speech applications and the c time
0:47:32	main issue with the other tokens
0:47:34	this most of them is that the
0:47:37	i really for a single task
0:47:40	for instance you can use county for each and you know speech recognition and i
0:47:44	don't know colour the is
0:47:47	is
0:47:49	we the idea creating can show that can be extremely is that still on
0:47:54	meeting speech recognition
0:47:56	standard v is yes
0:47:59	very good or
0:48:00	speaker recognition
0:48:02	i think
0:48:04	it is fess explicitly them will
0:48:07	what
0:48:08	different task is still not exist
0:48:12	and people when they how to implement complex pipeline involving
0:48:18	different technologies lie like speech enhancement last
0:48:22	speech recognition
0:48:23	or
0:48:24	speech recognition speaker recognition
0:48:26	they are like because the captain john
0:48:32	and of course jumping from one looking to is very demanding here t can be
0:48:38	there are different programming languages will different constant errors are we there's cetera
0:48:44	and the
0:48:45	one other issues that
0:48:48	if we have different look at very how to combine a system together and uniformly
0:48:54	in a single system just fully range just
0:48:57	a very important use this we declare
0:49:00	so we actually working on that and we are trying to lower best rate
0:49:07	to do not always will allow users to
0:49:12	actually a couple the next
0:49:14	a speech point one
0:49:16	in an easy way
0:49:19	what a time line actually we have work a lot of these you're on that
0:49:23	we haven't email
0:49:25	a lot of people working on that a lot of interest
0:49:29	and we are very close to a first really is that
0:49:33	will happen we estimate within a couple amount so i as strongly encouraged you to
0:49:40	stay tuned and then
0:49:43	and that try
0:49:45	speech brain
0:49:46	i in the future and q how's your feedback
0:49:51	speaker in
0:49:52	as quickly the project is how would be as well people
0:49:57	we have lower while the
0:50:00	twenty delaware as last having solar raiders you have all sources sounds will all be
0:50:08	ones and so the project is getting bigger and we go to have also the
0:50:13	product
0:50:15	all the speech community
0:50:18	technical the store
0:50:21	saying it be right to my
0:50:23	collaborator
0:50:24	the guys year are being
0:50:28	this ain't is that working on there
0:50:32	all these are the other works lots of the what's happening
0:50:36	and here you can see
0:50:39	the key that is currently working on the speech rate and that recyclable them because
0:50:45	i think together we are working very well and
0:50:52	well we soon you'll see and the result of our house work
0:50:57	thank you very much
0:50:58	for everything
0:51:00	and i'm very happy now to reply to your
0:51:15	many thanks musical than i wasn't nation
0:51:20	i already have a
0:51:21	a set of questions for you
0:51:25	so as to what is wrong using both ukrainian but at so complex the first
0:51:31	patient was from nicole rubber
0:51:38	and the only the we i have to you england
0:51:43	it a weight on a holiday is less computationally demanding men so that is known
0:51:50	in
0:51:52	actually is nothing but the best and then i'm
0:51:55	i think and i can take this opportunity to clarify little bit matter this the
0:52:01	things there are a couple of things to consider
0:52:05	for the whole with bayes
0:52:08	we're trying to learn not and task specific representation but in general representation
0:52:15	a at this means that you can train you are i'll supervise a network just
0:52:21	once right and then you can use just a little amount of supervised data to
0:52:27	train the system
0:52:29	so and is naturally it's to the computational advantages because you have to train
0:52:36	the big thing on the one
0:52:38	and a menu don't things
0:52:42	when you have
0:52:44	some
0:52:45	things which are and we have to the standard supervised learning and usually
0:52:51	if you have a good representation a supervised learning part is gonna be are much
0:52:57	easier
0:52:59	and the other i think good think about pay is a
0:53:03	that they didn't remark too much in the presentation but this is better to remark
0:53:07	here a little beads
0:53:10	is that the basis pretty there's a sufficient right
0:53:13	we found very good results even just using something like fifty hours of speech so
0:53:19	very little compared to
0:53:21	what we see these days
0:53:24	even on cell supervised learning where people are using tie was on and thousand how
0:53:27	real speech
0:53:29	and we are data efficient because mm with the multiple workers
0:53:35	somehow we try to extract as much as all the possible information from phone signal
0:53:41	we are trying to do our best to be also that efficient extract everything we
0:53:47	can from the signal
0:53:50	so the right shoe things here
0:53:53	the day
0:53:55	and the fact that we are learning a general representation right so when we you
0:53:59	can train only one time phase and use it for multiple task and then also
0:54:04	be that late fusion part that to allow you to
0:54:08	learn reasonable representation
0:54:10	even or it then
0:54:12	and a relatively small amount of unlabeled data
0:54:17	an eco are you are you k do you have other
0:54:22	comments on the part
0:54:33	so
0:54:34	okay
0:54:36	five is very bad because you really question is on the sides of anyway and
0:54:41	try my best
0:54:43	i haven't quite a you have a question from don't combo well as you could
0:54:48	become a common and remote with this also is supervised learning and this ideal conditions
0:54:56	actually mm and we increased a lot the robustness of bayes the when we revise
0:55:03	it with bayes plus
0:55:06	and as i mentioned before in based blast we combine basically sell supervised learning with
0:55:16	on-the-fly data limitation
0:55:19	well that's domain it means that every time we have and you sent those we
0:55:23	contaminated with a different sequence of noise and on different reverberation such that the system
0:55:30	every time and looks also different
0:55:33	sentence
0:55:34	a lda different at least contamination and in the output
0:55:39	our workers are i'm not extracting their labels from the noisy signal but from the
0:55:45	original clean one
0:55:47	so somehow i wire system is a forest
0:55:52	two
0:55:53	they noise the features
0:55:55	and d is that it's to the robustness we have seen before we actually tried
0:56:01	it they're challenging task like your our time i data and it was so realistically
0:56:10	rate at these increase robustness to where standard approaches
0:56:18	good thank you same really sure
0:56:22	okay
0:56:24	you ask some questions
0:56:26	bayes rule has also a question about the competition between the walkers in days
0:56:34	and i don't we should be visible but he or within them or when
0:56:40	leam engine could consider some segments
0:56:43	one in the same interest
0:56:45	one has a positive example and you're as a negative
0:56:50	some people
0:56:51	and ask you what to expect been able to learn in this case
0:56:57	actually the set of workers that we tried is not random right we took the
0:57:03	opportunity of the day salt for instance to do a lot a lot of experiments
0:57:08	we and we just come out with a set of word the subset of worker
0:57:14	the subset of ideas
0:57:16	that actually works for us
0:57:19	so actually i one of our concern was okay how is possible to put together
0:57:27	a regression task which are bayes on the square error for instance is lost with
0:57:33	binary task which are based on other kind of lost like better because entropy
0:57:38	how we can how we can learn things together and we told that there was
0:57:45	a big issue but we realise that actually is not just doing an experiment doing
0:57:50	some kind of operation of the workers so we not does that if we put
0:57:54	together more workers the batter units
0:57:58	and the same atoms for a leam and jean
0:58:02	which are a different actually because a lean is based on small amount is small
0:58:09	chunks of speech
0:58:11	and we that the will there are not and meeting not carry information
0:58:16	while the with them james in the same game but played with the larger ta
0:58:21	and larger chunks of one seconds one second house
0:58:26	and we that tubular hopefully
0:58:29	higher level representations so we found that the
0:58:35	they did chew and the same time are at any we have full even though
0:58:39	at the are clearly correlated subsets right
0:58:46	and cuban equation is coming from one channel
0:58:50	and the nist you is you have to the right including to provide us to
0:58:56	pay
0:58:58	and she's really thinking about the five but they is not explicitly thing within speaker
0:59:04	variability
0:59:06	so none of the task is forcing and then he's from different from those from
0:59:10	within speaker could be seen you know we shall work
0:59:13	use it for you and known problem in adding some supervised five little ones you're
0:59:19	where you have always easy
0:59:24	well first of all on including supervised task totally makes sense honestly one can play
0:59:32	with the and
0:59:34	same a supervised of course seems cell supervise in this case a things and i
0:59:39	e n is present all people already d the i saw some recent papers that
0:59:45	actually work trying to do that the
0:59:48	in this paper for base we prefer to stay
0:59:52	on the selsa broom buys side only to make sure us to do actually check
0:59:58	what are the output read is something it's a pure
1:00:02	so supervised learning approach
1:00:06	so as for bayes for speaker recognition and then within speaker maybe this yes is
1:00:14	not that specifically designed for that
1:00:18	so is not them is not the optimal but we anyway learn some kind of
1:00:25	a or speaker identity
1:00:29	actually
1:00:34	we didn't there's too much about we can we are confident that we can learn
1:00:39	can be quite competitive with them with data with standard system actually maybe we have
1:00:45	to devise a little bitty architecture for that you're speaker recognition applications because these days
1:00:52	also here so
1:00:54	numbers to which are impressive in terms of equal error rate for box so that
1:00:58	but
1:00:59	the same idea i mean could be could be i think it's extended and we
1:01:05	designed to specifically lower better speaker imaginings actually was in our main target was
1:01:15	was more general so we wanted to
1:01:18	to learn a pretty general representation and see if this is somehow works
1:01:25	reasonably well for multiple stars
1:01:29	thank you very is a nicely with the next question from o coming from the
1:01:34	we're not
1:01:36	which tries to use
1:01:38	if you common than the five about the things that you system is no need
1:01:43	to give or speaker restitution and ten information you can really
1:01:50	as you are using examples positive examples coming from within a this a single you
1:01:56	five
1:01:58	actually what we do is to do this on the slide at the moment a
1:02:05	sure right
1:02:06	so if we have sentence one
1:02:10	one time sentence one is
1:02:13	contaminated with some kind of channel so i kind of reverberation affect the next time
1:02:18	is contaminated with another one so maybe with this approach we try to limit a
1:02:25	little bit the these affective but
1:02:29	there might be there might be this issue read through
1:02:34	do you mean to you thing but that the motivation you use it would take
1:02:39	decision problem of internal run by itself
1:02:43	so of maybe not tickling the full problem but at least
1:02:50	minimizing right or
1:02:51	reducing its right
1:02:54	i think about the and the other hand we don't that many out there does
1:02:57	it will feel will like to stay in the
1:03:00	so supervised domain right so we don't and speaker labels so we cannot say okay
1:03:05	let's jump to another's signal from the same speaker because that case
1:03:10	we have
1:03:11	we use the
1:03:13	the labels so
1:03:14	the best we can do is to contaminate the sentence two
1:03:17	i mean
1:03:18	change a little bit some other database the reverberation noise effect and
1:03:22	hope to have
1:03:24	to learn more this p can left the channel
1:03:29	fine i we moved to a question from and you can turn
1:03:34	hasn't that i model can use form from two perspectives and dealing extraction and more
1:03:41	than ornament pre-training
1:03:44	both for this we don't should be effective
1:03:47	well but which one may be built for speaker verification so
1:03:53	language and then
1:03:55	we take a look again
1:03:57	i
1:04:04	okay
1:04:06	i think a please could be used both you're right i can be used for
1:04:13	feature extraction or embedded guess extraction all for and basically pre-training
1:04:21	my experience is that
1:04:25	these works very well in a pre-training scenario so it is designed basically to have
1:04:30	the
1:04:33	to train printing your network with their nest so nist also provides way and then
1:04:38	find your eight with the small supervised data
1:04:44	this is the
1:04:46	basically the mean the main application we have in mind for a for pays but
1:04:51	we also tried it as a standard feature extractor
1:04:55	where embedding a structure
1:04:57	not for speaker recognition but for a speech recognition
1:05:02	and it works quite well so if you freeze the encoder right and you plan
1:05:06	just the features that you have there you can and supervisor coded what's well but
1:05:11	it works better if you jointly finetune the encoder and the classifier during that a
1:05:18	supervised phase
1:05:21	thank you and we'll come back to the grid also no with a question about
1:05:25	the
1:05:27	temporal
1:05:29	sequence walker also can you would avoid more on the minimum detection walker but focused
1:05:36	on the right sequence
1:05:38	this is for that the
1:05:40	maybe some cases the segment from the few to and but would reasonably contain the
1:05:47	thing then
1:05:49	so
1:05:51	with some problem with this walkers you know some comments
1:05:55	definitely that's that have very nice question actually mm could easily the soup as worker
1:06:02	is the one that has that's important thing the performance
1:06:07	so as i mention with the a lot of model glacial we try to figure
1:06:12	out the effect of each and which task and is what was working well improve
1:06:19	but less than other work at where more important like the rest of the regressors
1:06:24	and the m and g
1:06:27	and mm
1:06:29	actually this is an important risk when you what when you build a view sample
1:06:33	from the past is simple from the future you have to make sure you just
1:06:38	you are not same thing with being the receptive field of your convolutional neural networks
1:06:43	otherwise the task becomes
1:06:45	too easy
1:06:47	so what we have done is to make sure that the next the future sample
1:06:52	is not too close rights from the people from the and core one and not
1:06:58	too far because if it is to close the risk is to learn nothing basically
1:07:03	if it is to fire
1:07:06	the risk is that there isn't anything in anymore and reasonable correlation between that you
1:07:12	so or it's not easy to design the this task
1:07:17	and them
1:07:17	we did the
1:07:18	you didn't as weights are we
1:07:20	we were able to sample the past in the feature representation within some reasonable range
1:07:27	it could be interesting to write i believe that you hide traces
1:07:33	but are we move to another question from i in one but still were asked
1:07:40	to you being to write the all the same it does
1:07:43	known from will lead to four speakers for extracting speaker-specific information
1:07:50	well and in this paper we new the bayes paper actually is not only about
1:07:57	the nist speaker recognition so the filters that will learn are actually are not that
1:08:04	far away from stand out method
1:08:07	mel filters where we basically try to locate
1:08:10	more precisely more filters in the lower part of the spectrum and less filters in
1:08:16	the higher part of the spectrum
1:08:18	we de lima
1:08:20	local informants the technique that was designed basically that work only for speaker recognition the
1:08:27	filters we still are there are some harass right where more filters in there are
1:08:34	as more common for the speech and the formants
1:08:37	so similar to what we have seen in
1:08:41	in
1:08:42	using sync net
1:08:44	with the supervised a approach
1:08:48	but with bayes we are not the we're not a look at more
1:08:52	more filters in the speech region we are more or less the same as the
1:08:57	standard not filter scale back
1:09:04	i we have
1:09:05	we are also conclusion i don't have more open question i just have one or
1:09:11	be possible
1:09:12	a i would like to see you
1:09:15	to the explaining more well i use a about unsupervised training
1:09:21	used and the it composes of as the training
1:09:25	it's of my feeling what an issue as
1:09:29	b s
1:09:30	a more easy to find if you have some
1:09:34	during a supervised training because you're some information on the data meta information of video
1:09:41	and we each with the unsupervised training seems to me that
1:09:46	you have less information but you have no reason to have a list yes in
1:09:51	the
1:09:52	they
1:09:53	the figure sure
1:09:55	okay and
1:09:58	the reason is that the
1:10:00	if you train your representation with the supervised data your presentation could be biased to
1:10:07	the task right specifically for instance if it's frame
1:10:11	a aspic a representation with speaker recognition right your presentation is not could for speech
1:10:19	recognition and it is does a bias on speaker recognition around it
1:10:24	with a supervised learning at least the in the way we are trying to do
1:10:28	it with a multitask et cetera this list a risk is reduced because you have
1:10:32	the same representation that is good for both speech recognition
1:10:36	and speech recognition and the speaker recognition
1:10:41	communist and that i
1:10:43	really the want to thank you again and we are we will be the over
1:10:50	the official but before to close the position i will be the microphone to get
1:10:57	a the only those two
1:11:00	wants to you to also i think you actually
1:11:06	thank you are service right
1:11:09	yes i stepped off state of a very wide of the top integerization and then
1:11:15	s l obtained in this session
1:11:17	so as dataset now do you think that something to us to decisions
1:11:23	but
1:11:26	system
1:11:28	one of the stuff can show this not
1:11:45	and the second
1:11:51	yes
1:11:52	if your best guess okay just to heal i you the that token decision that
1:12:03	the but there is something that sequences changes that's thanks for inviting me that was
1:12:09	really great thank you
1:12:11	okay that's a tennis together again
1:12:13	and test on a distance for a
1:12:19	a sentence
1:12:21	and you lucille tomorrow a same time this time i in a and ten
1:12:31	definitely
1:12:33	so as you can just
1:12:36	of that by time

Towards Unsupervised Learning of Speech Representations

Keynotes

Dr. Mirco Ravanelli, Université de Montréal, Canada