Přepis řeči - APPROXIMATE NEAREST-SUBSPACE REPRESENTATIONS FOR SOUND MIXTURES

0:00:13	so
0:00:15	that i'm looking at that time i realise really horrible so
0:00:18	well one too much
0:00:19	um so you get a bit of the motivation that i have behind this the this work
0:00:23	um
0:00:23	one complained of we have a all had when it came to audio is that
0:00:27	all the models all are
0:00:28	use a lot of user are not doing you sort of have to always have all these uh constraint that
0:00:32	you put in a lot one harmonic sounds or the sound be
0:00:35	noise to be stationary this all is a lot of
0:00:37	contribution from the user and um
0:00:39	somebody the love that this um some slightly allergic to that idea um i i don't want
0:00:43	myself to input a lot of information to is them one assistant to learn that information some
0:00:48	um the other thing that that motivates me a lot is that
0:00:51	when you see a lot of work we die in audio we always have this and third in the end
0:00:54	uh up plus and T uh and that basically has to be any kind of interference and of course because
0:01:00	we don't
0:01:01	uh we're not a comparable with math we assume a gaussian it makes like easy uh but of course you
0:01:05	have to people speaking at the same time the second person is not gonna be just a gaussian in signal
0:01:09	something much
0:01:10	a complicate
0:01:11	um so a lot of the work or not that's really carry well sure
0:01:14	um and the third motivating point is that
0:01:17	um especially nowadays that for the
0:01:19	the good experiment is one example of that when you have a lot of data you can have some very
0:01:23	very simple very very stupid algorithms
0:01:26	and the perform very complicated system
0:01:28	uh which sort of a a a very humbling experience
0:01:30	um so
0:01:32	uh these the things i want to keep in mind during this talk and is gonna be a four point
0:01:35	which is also very important but that's gonna come later
0:01:39	um so oh
0:01:41	the uh first observation about gonna make is that as far as i'm concerned a lot of the fun stuff
0:01:45	you can do with audio
0:01:46	uh i have to do with them
0:01:47	as the the that's gonna be the sound quality
0:01:49	um so uh if you wanna do classification separation or full or sound does for me and
0:01:54	well just a fun ways uh but you really care about is the magnitude spectrum because so it happens are
0:01:58	here likes to a uh uh take that much more than say phase
0:02:02	um some someone gonna be talking about that the me
0:02:04	um
0:02:05	uh for now
0:02:06	um and that the other of gonna make is that if we normalize speced right that really change things if
0:02:11	i speak twice as loud
0:02:13	i'm still gonna be saying the same thing
0:02:14	it's not gonna make a difference
0:02:16	um so the presentation gonna be using throughout this talk is that
0:02:20	a point
0:02:21	and i hope the people in the back and see we of that
0:02:23	um
0:02:24	whenever you see a a a uh through phone upper keys later that's real had on it that means suspect
0:02:28	spectrum that's normalized so basically we divided by the sum of what its elements
0:02:32	so might expect normalized
0:02:33	um so um uh a it's a basic a some some to one
0:02:37	um what that means is that it starts to that's of the uh and during a strange space uh because
0:02:43	a log by the entire space of spectra is gonna be basically line the simplex
0:02:47	to be a subspace of the overall um uh
0:02:50	but a spectrum
0:02:52	and he's well that looks like let assume we have a spectrum that only has three frequency
0:02:56	uh the way we can represent it if we know that's normalized
0:02:59	uh is uh a uh of uh with that a simplex with you shown here
0:03:02	a a each but of the simplex is that gonna corresponds to a frequency so we have on the lower
0:03:06	left
0:03:07	the low frequency um on the top we have a lower middle frequency
0:03:11	um the lower um
0:03:12	uh right we have our uh a high frequency
0:03:15	so
0:03:16	within that symbol X we can represent any kind of spectrum that has only three frequencies
0:03:20	oh assuming you don't care about amplitude so for example all the low frequency sounds are gonna and been there's
0:03:25	the region all the high frequency sounds will be this region
0:03:28	any point or on the middle is gonna be this wide component that using all of the uh frequencies at
0:03:32	the same time
0:03:33	so it's of a sort of simple mixing model
0:03:37	so um
0:03:39	contras not great but
0:03:40	one
0:03:41	so that's talk about a very very basic sound model based and that idea on that uh uh uh a
0:03:45	representation
0:03:47	so i can go out
0:03:48	record five hours of for speech of somebody speaking
0:03:51	um and then every time having you recording of that person and what i can do
0:03:55	um is basically go through they are uh normalized spectra uh of the new recording us for example it's a
0:04:01	would pick uh the spectrum over here
0:04:03	and what i can do is a simple matching try to figure out which spectrum out of all the training
0:04:08	data is the closest to it so it's a simple nearest neighbor operation
0:04:11	i'm just trying to find the spectrum it has them you the same uh uh uh uh sort of look
0:04:16	at it
0:04:17	um so that nothing special about doing something like that
0:04:19	uh i just to gig an example to get oriented that's what the input was
0:04:24	we have some
0:04:29	sound
0:04:29	uh he's the input
0:04:31	yeah
0:04:33	i
0:04:34	and he what it ends up being approximated if we just swap all the spectra with the closest looking spectrum
0:04:39	from the training data
0:04:40	yeah
0:04:42	yeah
0:04:43	it's not a good representation but it's sort of a getting the just of it
0:04:47	um what what happens in a and as john that to that we're thinking about
0:04:51	uh it's it looks like they we're gonna have the red points point being a training data
0:04:55	we're being given some blue point which are the spectra of the sound would trying to lies we're always trying
0:04:59	to find was the closest point to the point of observing right now and when a swap that in my
0:05:03	presentation
0:05:04	so there's nothing super special
0:05:07	oh this one point of wanna make um
0:05:09	this is not a euclidean space because we have a because train and that simplex things get a little funny
0:05:14	um so we can really use you can in this then
0:05:16	uh it's is gonna ignore a lot of the uh uh uh of properties of that space which make you
0:05:21	unique
0:05:21	um and that one will too much on the on the details um uh but basically if you're working on
0:05:26	a simplex
0:05:27	uh you you can't assume anything's many things you have there's we have something like a dirichlet distribution
0:05:31	and what that means is that the a proper distance measure in that space is gonna be the cross-entropy between
0:05:35	the spec
0:05:36	when we're doing that nearest neighbor search we're looking at the cross entropy between the normalized spectra are not something
0:05:41	that like the L two a just
0:05:43	so um
0:05:45	um the uh a whole point of the stock lowest to uh analyse mixtures uh because we don't wanna have
0:05:49	the last and of the there
0:05:51	um so um now he's have gonna start with this
0:05:54	um i'm gonna make a small something to begin with
0:05:56	uh whenever you have a mixture of two sounds and here we have and adding operation
0:06:00	the magnitude spectrogram
0:06:02	of that mixture
0:06:03	i'm gonna soon is gonna be equal to the magnitude spectrograms of the individual sounds
0:06:08	had we been able to observe them uh on the own
0:06:11	this is not exactly correct because as a little money you at by taking the magnitude
0:06:14	uh but an average it's it's uh it's a fine a some
0:06:18	um
0:06:19	the other thing is that makes is not necessarily something a process that's or of uh a markovian in in
0:06:23	anyway
0:06:24	so we can just look at one vector at any point in time in that make sure
0:06:28	so what we doing here is that was saying that the this particular spectrum but we're observing of the mixture
0:06:33	is gonna be a some of the corresponding uh spectra of the original sources at that the at the same
0:06:37	time
0:06:38	so
0:06:39	a very simple idea
0:06:41	um
0:06:42	and guess what happens when you look at this uh uh statement uh uh in the space that where in
0:06:47	i i uh we're gonna have again as of three frequent a simplex
0:06:50	will be observing a point which is a mixture between or two sources
0:06:53	and that point will have to lie
0:06:55	on the subspace space that connects the two point
0:06:58	of the spectra that uh uh uh a a that the one thing to create that make sure so uh
0:07:02	in the previous example we're gonna have these to but these two spectra which were what be clean sounds were
0:07:06	like
0:07:07	oh those are gonna be represented by these two points
0:07:09	any point in spectrum that lies on this subspace between those two points would be a plausible blend of these
0:07:15	two spec
0:07:17	um and then how far your along that line tells you how much of this spectrum is a contributing
0:07:24	so now that we have this model um we can have a slightly us sir of a updated version of
0:07:29	the nearest neighbours uh idea
0:07:31	now was gonna have is that i'm gonna of a mixture of sounds
0:07:33	and that's the only thing i'm gonna have i don't know what the original sounds are
0:07:37	a exactly
0:07:38	um i can go to my database and say well that looks like in this make sure i have a
0:07:42	somebody speaking in a whole bunch of chirping better
0:07:45	i can get a gigabyte of speech i can get a gigabyte of chirping birds no big deal now days
0:07:49	um and what i have to do is that for every spectrum that i'm observing in the mixture i'm gonna
0:07:53	try to find
0:07:54	one spectrum from the speech database and one spectrum from these are chirping a database
0:07:59	that we combine well together in or to approximate what i'm observing
0:08:03	so become this humongous search
0:08:05	and this um is gonna be that if i do that in a find these two spectra
0:08:08	a would be before approximations of the spectre that i have an observed of the original uh clean source
0:08:14	uh what that means again in the space that where N is that if you're gonna have a frequency simplex
0:08:19	so spectrum simplex
0:08:20	that's gonna be a subspace space where you have say the red source because it has
0:08:23	is particular temple car uh row characteristic
0:08:26	is gonna be a of neighbourhood or a subspace that has
0:08:29	uh the blue source
0:08:30	um i'm gonna try to find
0:08:32	all the lines that connect the blue point in a red point
0:08:35	and uh uh tried to figure out which is the one that passes the closest to any of them of
0:08:38	a of my neck spec
0:08:40	um
0:08:41	um at this point you're probably thinking you must be knots
0:08:44	uh because this is a a horrible search uh a problem uh just to give you some numbers
0:08:49	if you have ten means of training data which is not a lot of data um we were talking about
0:08:53	seventy five thousand spectra per source
0:08:55	uh and of course is or something like two thousand mention
0:08:57	uh ten second make sure is gonna be about twelve hundred spectra that was down to about five and a
0:09:02	half billion searches for every spectrum of or input
0:09:05	um
0:09:05	so it's not gonna happen uh uh if you wait it
0:09:08	what
0:09:09	um hmmm
0:09:11	but there's a to actually relax the problem uh make more of a of a of a can thing is
0:09:14	optimisation problem
0:09:16	a one get much details but it's extremely boring uh but uh uh the when gonna as a sort of
0:09:20	a really describe it is that we're gonna try to uh uh basically use all our training data is being
0:09:25	this huge huge basis um uh set so every spectrum or training data i will and the being a basis
0:09:30	vector
0:09:31	uh we could serve concatenate all that data together
0:09:33	and our goal is to find
0:09:35	a how to combine all this
0:09:37	uh this over complete bases in such a way that i'm only using one spectrum from uh each of the
0:09:41	two uh uh source
0:09:43	um and if i stated that weight sounds like a sparse coding problem
0:09:47	uh it's it's not it's not particularly hard to uh the to
0:09:50	uh to to solve so uh i got one world
0:09:52	and it too much
0:09:53	and is an approximate solution but it's a lot faster
0:09:56	uh
0:09:58	of of use an example of how the think behaves when you have mixture
0:10:01	um we have uh do uh original sound of the top and this of the sound i never get to
0:10:05	observe
0:10:06	um so just a play one of them
0:10:09	i
0:10:11	i
0:10:12	timit stuff stuff you will heard of that
0:10:13	um and then i have a mixture
0:10:15	uh of uh two speakers
0:10:16	i
0:10:17	a
0:10:18	i
0:10:19	i also have a lot of training data of these two particular speaker is now what it can do is
0:10:23	do this huge search uh our optimisation try to approximate every spectrum of the mixture as a superposition of any
0:10:29	two spectra from those uh
0:10:30	to speaker data
0:10:32	uh if i do that um i can reconstruct a the two sources
0:10:36	and the Q for example one of them
0:10:40	i
0:10:41	now
0:10:42	"'cause" the thing um
0:10:44	i see a lot of familiar faces are able be thinking what the hell you just row paper to the
0:10:47	they they go in it it so much better was the point of doing this
0:10:50	uh and this is gonna be my for um my four points
0:10:53	um
0:10:54	the whole point of this representation is that we don't necessarily want to separate sounds um
0:10:58	i can't for the life of me think why anybody would wanna separate the sound because what is only one
0:11:02	wanna separate the sound because you wanna do
0:11:04	speech recognition or you wanna do pitch detection you wanna do something afterwards
0:11:07	separation by cell it's gonna use less
0:11:09	um in fact is no
0:11:10	and that we do that
0:11:12	um so the whole point of this representation is that
0:11:15	we have a very nice semantic way of describing the speed of the uh um
0:11:19	uh a the mixture
0:11:20	uh by saying that we have these to clean spectra that come together
0:11:24	to approximate our uh uh are make sure
0:11:26	you just the yeah uh ability to do a lot more uh smart processing
0:11:30	uh because that speced will carry some semantic information but
0:11:33	so here's one quick example
0:11:35	um suppose that we have a mixture of two instruments playing is big
0:11:38	which of the one of here
0:11:40	um and then i have some training data of those two instruments isolate
0:11:43	it's very easy for me to use a pitch track here in pitch stack all of the spectra in the
0:11:47	uh in the training data so that means that every spectrum that i have there
0:11:51	oh is gonna be associated with the pitch value
0:11:54	by do this kind of the composition
0:11:55	and basically gonna be explaining every spectrum in the mixture
0:11:58	as a superposition of two spectra from a check from my training data
0:12:02	and that the i will have a pitch label
0:12:03	added to it so that means at that point i know exactly what are the two pitch is that
0:12:07	a sounding for that particular time in the mixture
0:12:10	and what's nice about it
0:12:12	is that if uh we of did experiments starting from a
0:12:15	uh doing a solo uh uh instruments a so basic just doing a nearest neighbor search all the we have
0:12:19	to having five instruments playing a the world would wind instruments at the same time
0:12:23	um and sir of here the results in terms of a error and standard deviation in heard
0:12:28	and uh we see can see the we sort of go from
0:12:30	and then are of for about four hz
0:12:32	uh average error four hz for this all case of forty two herds
0:12:36	i wouldn't have five in miss playing so that's not something we could do if we had a uh if
0:12:40	we just a if the monophone a pitch tracking algorithm
0:12:43	because we're using this way of decomposing mixtures and of things that we can already have labeled
0:12:47	uh we get this extra ability to um uh
0:12:50	uh to get a a a a uh of extra labels for mixtures
0:12:53	uh i'm not example of that is phonetic tagging
0:12:56	um um if i have a mixture of two sounds of of to uh speakers and then i can associate
0:13:00	that with spectra from the clean recordings i also have a lot of labels that come with that spectra i
0:13:05	know what phoneme corresponds of that spectrum
0:13:07	i i can maybe do some emotion recognition know what the motion state is i know what's speaker get that
0:13:11	was like a what page that speaker was a a a was speaking at
0:13:15	um and what happens is that we only see a very mild degradation when we try to analyse mixtures so
0:13:20	uh
0:13:20	uh in this case just some numbers
0:13:23	a simple numbers the for a one speaker if we just do a nearest neighbor search and
0:13:26	a little bit of smoothing we can get a phoneme error forty five percent
0:13:29	what you have two speakers you gone of uh phoneme error of fifty four percent which the fairly mild
0:13:33	uh us of increasing the error
0:13:35	uh even of the problem is considerably harder
0:13:38	i and we get about ten percent or eight percent uh
0:13:41	a worse results everytime you adding you speak
0:13:43	uh so grace will be dealing with uh would be a
0:13:46	a with the mixture case
0:13:49	so um
0:13:50	uh to rub that up
0:13:52	um
0:13:52	this is just a simple geometric looked to a sort of uh consider mixtures uh the point i'm trying to
0:13:57	make is that
0:13:58	you really have to incorporate the idea that sounds makes into a model you can just say well i'm gonna
0:14:03	have a scream model and hopefully people figure out a way to to to deal with X the sources
0:14:07	uh was for this a model that start to the idea that things are gonna be mixed together
0:14:10	um
0:14:12	uh we we really care about uh
0:14:14	yeah
0:14:19	how you know my schedule
0:14:21	yeah
0:14:21	uh
0:14:23	um
0:14:23	the whole idea of re composing is based on uh a a lot of that i sort of uh uh
0:14:27	uh an interesting concept that
0:14:29	we see a lot of the computer vision uh uh literature we see it a lot in the text mining
0:14:33	literature richer um we don't really see that much in the audio space uh if you have a lot and
0:14:37	lot lots a lots of recordings you should be able to explain pretty much everything that comes in
0:14:41	um i dream is that at some point or
0:14:43	speech databases are gonna be so big
0:14:45	you just to a nudist search a neighbour an a sentence will give you the scent and you don't have
0:14:49	to do all this to process
0:14:50	so been done and text right face if i search first question we'll go
0:14:53	somebody has already has that
0:14:55	um so it's only a matter of time and would be speech
0:14:58	um
0:14:59	a and the other thing is that
0:15:00	uh
0:15:01	the thinking about separation is really "'em" we point uh the
0:15:04	the is something else we have to do after separation
0:15:07	and the if there's any message they can be you with is that
0:15:10	being able to uh analyse make and some smart way about figuring out what's in the mixture not necessarily have
0:15:15	to extract that uh that the information out of a
0:15:18	so
0:15:19	and most of them so that here
0:15:20	make it
0:15:26	uh uh we get for some questions
0:15:29	i uh
0:15:30	i
0:15:33	hello huh
0:15:35	um
0:15:36	okay once i comment
0:15:38	some of those
0:15:39	like music actually do want to separate the sounds because we interested in re mixing a on i'm just gonna
0:15:44	mention that K O and might my and to that is that want to makes it then you just one
0:15:49	to mix the music you don't want to extract the source
0:15:51	because you gonna put it back in the mix
0:15:54	i
0:15:56	we we could talk about up fixing off fine
0:15:58	um
0:15:59	the the the other thing i just i just wanted to ask was she made this provocative statement you can
0:16:04	use euclidean distance yeah
0:16:06	and then of two slides later you said
0:16:09	but i'm can use the L two norm
0:16:11	so i O i did a say that i i i could've sworn you did if i did it was
0:16:15	a mistake i didn't i is used to do or
0:16:18	but and the center student and you say were getting you you could use the L two norm to do
0:16:21	something to enforce sparsity
0:16:24	a a yes that uh okay so i was just curious what that men
0:16:28	uh
0:16:32	um so
0:16:33	whenever we we talk about sparsity what you wanna do is uh optimize the L one norm of signal and
0:16:38	that's all the what the compressive literature about
0:16:40	um it turns out that we're dealing with normal a spectral they already some to one does no way can
0:16:44	optimize the L one norm because
0:16:46	it's a nonnegative bunch of numbers that sum to one it's always gonna be one
0:16:49	sort of screwed if you are uh if you few if you try to optimize that
0:16:52	oh because those numbers are gonna be between zero and one uh the close are to zero that's is gonna
0:16:57	be a smaller number so by optimising the L two norm
0:17:00	of of uh of a normalized vector that sums to one
0:17:03	oh then essentially you're enforcing sparsity because you saying i want all of the other to be really really close
0:17:07	to zero
0:17:08	and you only gonna have to be one of them be a a a a a of a one
0:17:12	that make sense
0:17:15	hi uh the two speaker example that you played uh
0:17:18	it's
0:17:19	it it was artificially made
0:17:20	by adding two signals
0:17:22	i'm just curious how this actually works but the L signals where you have
0:17:26	reverberation which
0:17:27	smears the speaker
0:17:30	ignores in
0:17:31	time um yes
0:17:32	if you have a a a a uh if you have a long a "'cause" are a big reverberation is
0:17:36	not a big deal because
0:17:38	the egg was of the reverberation are gonna be spectra that correspond to some of the previous uh utterances of
0:17:43	the speak
0:17:44	um and that seated incorporate in the model so uh it's a
0:17:47	it's are resistant to that
0:17:48	uh the on the place where you get into trouble is if you have very strange phase X and the
0:17:52	reverberation that actually changes the spectrum
0:17:54	well that's L
0:17:55	so if you i don't know if to sir of in the bathroom any have these very ugly echo that's
0:17:59	very short then you
0:18:01	creates an order or resonance in your spectrum
0:18:03	uh that's gonna buy guess your uh your testing data to be very different from the training data
0:18:07	and the and the fit is not gonna be as good but as as long as the spectral card there's
0:18:11	they the same a reverberation is on a big the
0:18:18	oh
0:18:18	so a thank you um power talk uh
0:18:21	so i a question parts uh to represent a the a house so the same that we need a how
0:18:26	a lot data to to represent that sauce
0:18:28	yeah and uh you you said use G uh when you do the search
0:18:32	i think that audio data at in kind of a man and follow so is and you man you for
0:18:36	it is seem that we don't need
0:18:38	i'm not met data is in
0:18:40	no that a itself
0:18:42	yeah do you sink back introduce a D two
0:18:45	to representation we how back or
0:18:47	uh queries
0:18:49	the training
0:18:50	um to you and it is a out of data in order to properly represent the manifold that every source
0:18:54	of lying in
0:18:55	um a and that's the only reason and it sort of becomes a very similar to uh problem to a
0:18:59	supervised learning where
0:19:01	if you data is these and you can represent uh your input thing you're fine
0:19:05	and that would be that would mean that you only have a about a doesn't data points are could mean
0:19:08	you five million and that data point
0:19:10	uh so like in this case
0:19:12	um if you're dealing with simple sources the not so much variation then you can get away with a with
0:19:16	a little data
0:19:17	um if you wanna do something that's very uh uh you know if you wanna model everybody's voice
0:19:21	at all different pitches and all different phonemes of all sort of languages than obviously need to have
0:19:25	a pretty good representation of all the possible cases and it makes a database
0:19:28	a you
0:19:31	yeah paris my and common i guess
0:19:34	is often what you're are i guess
0:19:35	we seen to wrap these talk you know a lot of modeling in different ways of modeling that sound
0:19:41	i was wondering what you philosophy is maybe he's you went over it be a bit quickly "'cause" you said
0:19:46	you got a mixed and you gonna take two examples samples or many examples of that it's to so it's
0:19:51	like you know
0:19:52	the components of the mixture already and then you figure out how they
0:19:55	put together and so what you're saying
0:19:57	at at if you have more complicated situation or maybe you don't know what the components are
0:20:02	or if it's a more complicated and the database size grows you was showing
0:20:06	or earlier
0:20:07	the search can can grow very big can you've done some clever stuff to make the search small
0:20:12	um and what is your philosophy on overall
0:20:15	the or the older dream is that eventually you have a database that has pretty much every sound ever were
0:20:19	played
0:20:20	um um you gonna do search over that and chances are that was some whatever you're analysing as it been
0:20:24	repeated and fast
0:20:25	uh so you know that's what we're driving at
0:20:27	um uh
0:20:29	uh one E case where a sort of white one defence point they can use is that in the um
0:20:35	uh quintet that example here
0:20:37	um even when i have a solve recording uh basically had recordings of five instruments
0:20:41	and i had to big this one spectrum that was close as in the do ed again have the five
0:20:44	instrument at a bit on a on two of them
0:20:47	um
0:20:48	so yes i did know what instrument were in there and they were part of the database
0:20:51	uh but i sort of think of this more of a as a logistic problem in that
0:20:55	uh you know i had to get some data um you know ultimately you just wanna have this humongous database
0:20:59	of that everything
0:21:00	and and pick through that

APPROXIMATE NEAREST-SUBSPACE REPRESENTATIONS FOR SOUND MIXTURES

Innovative Representations of Audio

Přednášející: Paris Smaragdis, Autoři: Paris Smaragdis, University of Illinois Urbana-Champaign, United States