0:00:13so
0:00:15that i'm looking at that time i realise really horrible so
0:00:18well one too much
0:00:19um so you get a bit of the motivation that i have behind this the this work
0:00:23um
0:00:23one complained of we have a all had when it came to audio is that
0:00:27all the models all are
0:00:28use a lot of user are not doing you sort of have to always have all these uh constraint that
0:00:32you put in a lot one harmonic sounds or the sound be
0:00:35noise to be stationary this all is a lot of
0:00:37contribution from the user and um
0:00:39somebody the love that this um some slightly allergic to that idea um i i don't want
0:00:43myself to input a lot of information to is them one assistant to learn that information some
0:00:48um the other thing that that motivates me a lot is that
0:00:51when you see a lot of work we die in audio we always have this and third in the end
0:00:54uh up plus and T uh and that basically has to be any kind of interference and of course because
0:01:00we don't
0:01:01uh we're not a comparable with math we assume a gaussian it makes like easy uh but of course you
0:01:05have to people speaking at the same time the second person is not gonna be just a gaussian in signal
0:01:09something much
0:01:10a complicate
0:01:11um so a lot of the work or not that's really carry well sure
0:01:14um and the third motivating point is that
0:01:17um especially nowadays that for the
0:01:19the good experiment is one example of that when you have a lot of data you can have some very
0:01:23very simple very very stupid algorithms
0:01:26and the perform very complicated system
0:01:28uh which sort of a a a very humbling experience
0:01:30um so
0:01:32uh these the things i want to keep in mind during this talk and is gonna be a four point
0:01:35which is also very important but that's gonna come later
0:01:39um so oh
0:01:41the uh first observation about gonna make is that as far as i'm concerned a lot of the fun stuff
0:01:45you can do with audio
0:01:46uh i have to do with them
0:01:47as the the that's gonna be the sound quality
0:01:49um so uh if you wanna do classification separation or full or sound does for me and
0:01:54well just a fun ways uh but you really care about is the magnitude spectrum because so it happens are
0:01:58here likes to a uh uh take that much more than say phase
0:02:02um some someone gonna be talking about that the me
0:02:04um
0:02:05uh for now
0:02:06um and that the other of gonna make is that if we normalize speced right that really change things if
0:02:11i speak twice as loud
0:02:13i'm still gonna be saying the same thing
0:02:14it's not gonna make a difference
0:02:16um so the presentation gonna be using throughout this talk is that
0:02:20a point
0:02:21and i hope the people in the back and see we of that
0:02:23um
0:02:24whenever you see a a a uh through phone upper keys later that's real had on it that means suspect
0:02:28spectrum that's normalized so basically we divided by the sum of what its elements
0:02:32so might expect normalized
0:02:33um so um uh a it's a basic a some some to one
0:02:37um what that means is that it starts to that's of the uh and during a strange space uh because
0:02:43a log by the entire space of spectra is gonna be basically line the simplex
0:02:47to be a subspace of the overall um uh
0:02:50but a spectrum
0:02:52and he's well that looks like let assume we have a spectrum that only has three frequency
0:02:56uh the way we can represent it if we know that's normalized
0:02:59uh is uh a uh of uh with that a simplex with you shown here
0:03:02a a each but of the simplex is that gonna corresponds to a frequency so we have on the lower
0:03:06left
0:03:07the low frequency um on the top we have a lower middle frequency
0:03:11um the lower um
0:03:12uh right we have our uh a high frequency
0:03:15so
0:03:16within that symbol X we can represent any kind of spectrum that has only three frequencies
0:03:20oh assuming you don't care about amplitude so for example all the low frequency sounds are gonna and been there's
0:03:25the region all the high frequency sounds will be this region
0:03:28any point or on the middle is gonna be this wide component that using all of the uh frequencies at
0:03:32the same time
0:03:33so it's of a sort of simple mixing model
0:03:37so um
0:03:39contras not great but
0:03:40one
0:03:41so that's talk about a very very basic sound model based and that idea on that uh uh uh a
0:03:45representation
0:03:47so i can go out
0:03:48record five hours of for speech of somebody speaking
0:03:51um and then every time having you recording of that person and what i can do
0:03:55um is basically go through they are uh normalized spectra uh of the new recording us for example it's a
0:04:01would pick uh the spectrum over here
0:04:03and what i can do is a simple matching try to figure out which spectrum out of all the training
0:04:08data is the closest to it so it's a simple nearest neighbor operation
0:04:11i'm just trying to find the spectrum it has them you the same uh uh uh uh sort of look
0:04:16at it
0:04:17um so that nothing special about doing something like that
0:04:19uh i just to gig an example to get oriented that's what the input was
0:04:24we have some
0:04:29sound
0:04:29uh he's the input
0:04:31yeah
0:04:33i
0:04:34and he what it ends up being approximated if we just swap all the spectra with the closest looking spectrum
0:04:39from the training data
0:04:40yeah
0:04:42yeah
0:04:43it's not a good representation but it's sort of a getting the just of it
0:04:47um what what happens in a and as john that to that we're thinking about
0:04:51uh it's it looks like they we're gonna have the red points point being a training data
0:04:55we're being given some blue point which are the spectra of the sound would trying to lies we're always trying
0:04:59to find was the closest point to the point of observing right now and when a swap that in my
0:05:03presentation
0:05:04so there's nothing super special
0:05:07oh this one point of wanna make um
0:05:09this is not a euclidean space because we have a because train and that simplex things get a little funny
0:05:14um so we can really use you can in this then
0:05:16uh it's is gonna ignore a lot of the uh uh uh of properties of that space which make you
0:05:21unique
0:05:21um and that one will too much on the on the details um uh but basically if you're working on
0:05:26a simplex
0:05:27uh you you can't assume anything's many things you have there's we have something like a dirichlet distribution
0:05:31and what that means is that the a proper distance measure in that space is gonna be the cross-entropy between
0:05:35the spec
0:05:36when we're doing that nearest neighbor search we're looking at the cross entropy between the normalized spectra are not something
0:05:41that like the L two a just
0:05:43so um
0:05:45um the uh a whole point of the stock lowest to uh analyse mixtures uh because we don't wanna have
0:05:49the last and of the there
0:05:51um so um now he's have gonna start with this
0:05:54um i'm gonna make a small something to begin with
0:05:56uh whenever you have a mixture of two sounds and here we have and adding operation
0:06:00the magnitude spectrogram
0:06:02of that mixture
0:06:03i'm gonna soon is gonna be equal to the magnitude spectrograms of the individual sounds
0:06:08had we been able to observe them uh on the own
0:06:11this is not exactly correct because as a little money you at by taking the magnitude
0:06:14uh but an average it's it's uh it's a fine a some
0:06:18um
0:06:19the other thing is that makes is not necessarily something a process that's or of uh a markovian in in
0:06:23anyway
0:06:24so we can just look at one vector at any point in time in that make sure
0:06:28so what we doing here is that was saying that the this particular spectrum but we're observing of the mixture
0:06:33is gonna be a some of the corresponding uh spectra of the original sources at that the at the same
0:06:37time
0:06:38so
0:06:39a very simple idea
0:06:41um
0:06:42and guess what happens when you look at this uh uh statement uh uh in the space that where in
0:06:47i i uh we're gonna have again as of three frequent a simplex
0:06:50will be observing a point which is a mixture between or two sources
0:06:53and that point will have to lie
0:06:55on the subspace space that connects the two point
0:06:58of the spectra that uh uh uh a a that the one thing to create that make sure so uh
0:07:02in the previous example we're gonna have these to but these two spectra which were what be clean sounds were
0:07:06like
0:07:07oh those are gonna be represented by these two points
0:07:09any point in spectrum that lies on this subspace between those two points would be a plausible blend of these
0:07:15two spec
0:07:17um and then how far your along that line tells you how much of this spectrum is a contributing
0:07:24so now that we have this model um we can have a slightly us sir of a updated version of
0:07:29the nearest neighbours uh idea
0:07:31now was gonna have is that i'm gonna of a mixture of sounds
0:07:33and that's the only thing i'm gonna have i don't know what the original sounds are
0:07:37a exactly
0:07:38um i can go to my database and say well that looks like in this make sure i have a
0:07:42somebody speaking in a whole bunch of chirping better
0:07:45i can get a gigabyte of speech i can get a gigabyte of chirping birds no big deal now days
0:07:49um and what i have to do is that for every spectrum that i'm observing in the mixture i'm gonna
0:07:53try to find
0:07:54one spectrum from the speech database and one spectrum from these are chirping a database
0:07:59that we combine well together in or to approximate what i'm observing
0:08:03so become this humongous search
0:08:05and this um is gonna be that if i do that in a find these two spectra
0:08:08a would be before approximations of the spectre that i have an observed of the original uh clean source
0:08:14uh what that means again in the space that where N is that if you're gonna have a frequency simplex
0:08:19so spectrum simplex
0:08:20that's gonna be a subspace space where you have say the red source because it has
0:08:23is particular temple car uh row characteristic
0:08:26is gonna be a of neighbourhood or a subspace that has
0:08:29uh the blue source
0:08:30um i'm gonna try to find
0:08:32all the lines that connect the blue point in a red point
0:08:35and uh uh tried to figure out which is the one that passes the closest to any of them of
0:08:38a of my neck spec
0:08:40um
0:08:41um at this point you're probably thinking you must be knots
0:08:44uh because this is a a horrible search uh a problem uh just to give you some numbers
0:08:49if you have ten means of training data which is not a lot of data um we were talking about
0:08:53seventy five thousand spectra per source
0:08:55uh and of course is or something like two thousand mention
0:08:57uh ten second make sure is gonna be about twelve hundred spectra that was down to about five and a
0:09:02half billion searches for every spectrum of or input
0:09:05um
0:09:05so it's not gonna happen uh uh if you wait it
0:09:08what
0:09:09um hmmm
0:09:11but there's a to actually relax the problem uh make more of a of a of a can thing is
0:09:14optimisation problem
0:09:16a one get much details but it's extremely boring uh but uh uh the when gonna as a sort of
0:09:20a really describe it is that we're gonna try to uh uh basically use all our training data is being
0:09:25this huge huge basis um uh set so every spectrum or training data i will and the being a basis
0:09:30vector
0:09:31uh we could serve concatenate all that data together
0:09:33and our goal is to find
0:09:35a how to combine all this
0:09:37uh this over complete bases in such a way that i'm only using one spectrum from uh each of the
0:09:41two uh uh source
0:09:43um and if i stated that weight sounds like a sparse coding problem
0:09:47uh it's it's not it's not particularly hard to uh the to
0:09:50uh to to solve so uh i got one world
0:09:52and it too much
0:09:53and is an approximate solution but it's a lot faster
0:09:56uh
0:09:58of of use an example of how the think behaves when you have mixture
0:10:01um we have uh do uh original sound of the top and this of the sound i never get to
0:10:05observe
0:10:06um so just a play one of them
0:10:09i
0:10:11i
0:10:12timit stuff stuff you will heard of that
0:10:13um and then i have a mixture
0:10:15uh of uh two speakers
0:10:16i
0:10:17a
0:10:18i
0:10:19i also have a lot of training data of these two particular speaker is now what it can do is
0:10:23do this huge search uh our optimisation try to approximate every spectrum of the mixture as a superposition of any
0:10:29two spectra from those uh
0:10:30to speaker data
0:10:32uh if i do that um i can reconstruct a the two sources
0:10:36and the Q for example one of them
0:10:40i
0:10:41now
0:10:42"'cause" the thing um
0:10:44i see a lot of familiar faces are able be thinking what the hell you just row paper to the
0:10:47they they go in it it so much better was the point of doing this
0:10:50uh and this is gonna be my for um my four points
0:10:53um
0:10:54the whole point of this representation is that we don't necessarily want to separate sounds um
0:10:58i can't for the life of me think why anybody would wanna separate the sound because what is only one
0:11:02wanna separate the sound because you wanna do
0:11:04speech recognition or you wanna do pitch detection you wanna do something afterwards
0:11:07separation by cell it's gonna use less
0:11:09um in fact is no
0:11:10and that we do that
0:11:12um so the whole point of this representation is that
0:11:15we have a very nice semantic way of describing the speed of the uh um
0:11:19uh a the mixture
0:11:20uh by saying that we have these to clean spectra that come together
0:11:24to approximate our uh uh are make sure
0:11:26you just the yeah uh ability to do a lot more uh smart processing
0:11:30uh because that speced will carry some semantic information but
0:11:33so here's one quick example
0:11:35um suppose that we have a mixture of two instruments playing is big
0:11:38which of the one of here
0:11:40um and then i have some training data of those two instruments isolate
0:11:43it's very easy for me to use a pitch track here in pitch stack all of the spectra in the
0:11:47uh in the training data so that means that every spectrum that i have there
0:11:51oh is gonna be associated with the pitch value
0:11:54by do this kind of the composition
0:11:55and basically gonna be explaining every spectrum in the mixture
0:11:58as a superposition of two spectra from a check from my training data
0:12:02and that the i will have a pitch label
0:12:03added to it so that means at that point i know exactly what are the two pitch is that
0:12:07a sounding for that particular time in the mixture
0:12:10and what's nice about it
0:12:12is that if uh we of did experiments starting from a
0:12:15uh doing a solo uh uh instruments a so basic just doing a nearest neighbor search all the we have
0:12:19to having five instruments playing a the world would wind instruments at the same time
0:12:23um and sir of here the results in terms of a error and standard deviation in heard
0:12:28and uh we see can see the we sort of go from
0:12:30and then are of for about four hz
0:12:32uh average error four hz for this all case of forty two herds
0:12:36i wouldn't have five in miss playing so that's not something we could do if we had a uh if
0:12:40we just a if the monophone a pitch tracking algorithm
0:12:43because we're using this way of decomposing mixtures and of things that we can already have labeled
0:12:47uh we get this extra ability to um uh
0:12:50uh to get a a a a uh of extra labels for mixtures
0:12:53uh i'm not example of that is phonetic tagging
0:12:56um um if i have a mixture of two sounds of of to uh speakers and then i can associate
0:13:00that with spectra from the clean recordings i also have a lot of labels that come with that spectra i
0:13:05know what phoneme corresponds of that spectrum
0:13:07i i can maybe do some emotion recognition know what the motion state is i know what's speaker get that
0:13:11was like a what page that speaker was a a a was speaking at
0:13:15um and what happens is that we only see a very mild degradation when we try to analyse mixtures so
0:13:20uh
0:13:20uh in this case just some numbers
0:13:23a simple numbers the for a one speaker if we just do a nearest neighbor search and
0:13:26a little bit of smoothing we can get a phoneme error forty five percent
0:13:29what you have two speakers you gone of uh phoneme error of fifty four percent which the fairly mild
0:13:33uh us of increasing the error
0:13:35uh even of the problem is considerably harder
0:13:38i and we get about ten percent or eight percent uh
0:13:41a worse results everytime you adding you speak
0:13:43uh so grace will be dealing with uh would be a
0:13:46a with the mixture case
0:13:49so um
0:13:50uh to rub that up
0:13:52um
0:13:52this is just a simple geometric looked to a sort of uh consider mixtures uh the point i'm trying to
0:13:57make is that
0:13:58you really have to incorporate the idea that sounds makes into a model you can just say well i'm gonna
0:14:03have a scream model and hopefully people figure out a way to to to deal with X the sources
0:14:07uh was for this a model that start to the idea that things are gonna be mixed together
0:14:10um
0:14:12uh we we really care about uh
0:14:14yeah
0:14:19how you know my schedule
0:14:21yeah
0:14:21uh
0:14:23um
0:14:23the whole idea of re composing is based on uh a a lot of that i sort of uh uh
0:14:27uh an interesting concept that
0:14:29we see a lot of the computer vision uh uh literature we see it a lot in the text mining
0:14:33literature richer um we don't really see that much in the audio space uh if you have a lot and
0:14:37lot lots a lots of recordings you should be able to explain pretty much everything that comes in
0:14:41um i dream is that at some point or
0:14:43speech databases are gonna be so big
0:14:45you just to a nudist search a neighbour an a sentence will give you the scent and you don't have
0:14:49to do all this to process
0:14:50so been done and text right face if i search first question we'll go
0:14:53somebody has already has that
0:14:55um so it's only a matter of time and would be speech
0:14:58um
0:14:59a and the other thing is that
0:15:00uh
0:15:01the thinking about separation is really "'em" we point uh the
0:15:04the is something else we have to do after separation
0:15:07and the if there's any message they can be you with is that
0:15:10being able to uh analyse make and some smart way about figuring out what's in the mixture not necessarily have
0:15:15to extract that uh that the information out of a
0:15:18so
0:15:19and most of them so that here
0:15:20make it
0:15:26uh uh we get for some questions
0:15:29i uh
0:15:30i
0:15:33hello huh
0:15:35um
0:15:36okay once i comment
0:15:38some of those
0:15:39like music actually do want to separate the sounds because we interested in re mixing a on i'm just gonna
0:15:44mention that K O and might my and to that is that want to makes it then you just one
0:15:49to mix the music you don't want to extract the source
0:15:51because you gonna put it back in the mix
0:15:54i
0:15:56we we could talk about up fixing off fine
0:15:58um
0:15:59the the the other thing i just i just wanted to ask was she made this provocative statement you can
0:16:04use euclidean distance yeah
0:16:06and then of two slides later you said
0:16:09but i'm can use the L two norm
0:16:11so i O i did a say that i i i could've sworn you did if i did it was
0:16:15a mistake i didn't i is used to do or
0:16:18but and the center student and you say were getting you you could use the L two norm to do
0:16:21something to enforce sparsity
0:16:24a a yes that uh okay so i was just curious what that men
0:16:28uh
0:16:32um so
0:16:33whenever we we talk about sparsity what you wanna do is uh optimize the L one norm of signal and
0:16:38that's all the what the compressive literature about
0:16:40um it turns out that we're dealing with normal a spectral they already some to one does no way can
0:16:44optimize the L one norm because
0:16:46it's a nonnegative bunch of numbers that sum to one it's always gonna be one
0:16:49sort of screwed if you are uh if you few if you try to optimize that
0:16:52oh because those numbers are gonna be between zero and one uh the close are to zero that's is gonna
0:16:57be a smaller number so by optimising the L two norm
0:17:00of of uh of a normalized vector that sums to one
0:17:03oh then essentially you're enforcing sparsity because you saying i want all of the other to be really really close
0:17:07to zero
0:17:08and you only gonna have to be one of them be a a a a a of a one
0:17:12that make sense
0:17:15hi uh the two speaker example that you played uh
0:17:18it's
0:17:19it it was artificially made
0:17:20by adding two signals
0:17:22i'm just curious how this actually works but the L signals where you have
0:17:26reverberation which
0:17:27smears the speaker
0:17:30ignores in
0:17:31time um yes
0:17:32if you have a a a a uh if you have a long a "'cause" are a big reverberation is
0:17:36not a big deal because
0:17:38the egg was of the reverberation are gonna be spectra that correspond to some of the previous uh utterances of
0:17:43the speak
0:17:44um and that seated incorporate in the model so uh it's a
0:17:47it's are resistant to that
0:17:48uh the on the place where you get into trouble is if you have very strange phase X and the
0:17:52reverberation that actually changes the spectrum
0:17:54well that's L
0:17:55so if you i don't know if to sir of in the bathroom any have these very ugly echo that's
0:17:59very short then you
0:18:01creates an order or resonance in your spectrum
0:18:03uh that's gonna buy guess your uh your testing data to be very different from the training data
0:18:07and the and the fit is not gonna be as good but as as long as the spectral card there's
0:18:11they the same a reverberation is on a big the
0:18:18oh
0:18:18so a thank you um power talk uh
0:18:21so i a question parts uh to represent a the a house so the same that we need a how
0:18:26a lot data to to represent that sauce
0:18:28yeah and uh you you said use G uh when you do the search
0:18:32i think that audio data at in kind of a man and follow so is and you man you for
0:18:36it is seem that we don't need
0:18:38i'm not met data is in
0:18:40no that a itself
0:18:42yeah do you sink back introduce a D two
0:18:45to representation we how back or
0:18:47uh queries
0:18:49the training
0:18:50um to you and it is a out of data in order to properly represent the manifold that every source
0:18:54of lying in
0:18:55um a and that's the only reason and it sort of becomes a very similar to uh problem to a
0:18:59supervised learning where
0:19:01if you data is these and you can represent uh your input thing you're fine
0:19:05and that would be that would mean that you only have a about a doesn't data points are could mean
0:19:08you five million and that data point
0:19:10uh so like in this case
0:19:12um if you're dealing with simple sources the not so much variation then you can get away with a with
0:19:16a little data
0:19:17um if you wanna do something that's very uh uh you know if you wanna model everybody's voice
0:19:21at all different pitches and all different phonemes of all sort of languages than obviously need to have
0:19:25a pretty good representation of all the possible cases and it makes a database
0:19:28a you
0:19:31yeah paris my and common i guess
0:19:34is often what you're are i guess
0:19:35we seen to wrap these talk you know a lot of modeling in different ways of modeling that sound
0:19:41i was wondering what you philosophy is maybe he's you went over it be a bit quickly "'cause" you said
0:19:46you got a mixed and you gonna take two examples samples or many examples of that it's to so it's
0:19:51like you know
0:19:52the components of the mixture already and then you figure out how they
0:19:55put together and so what you're saying
0:19:57at at if you have more complicated situation or maybe you don't know what the components are
0:20:02or if it's a more complicated and the database size grows you was showing
0:20:06or earlier
0:20:07the search can can grow very big can you've done some clever stuff to make the search small
0:20:12um and what is your philosophy on overall
0:20:15the or the older dream is that eventually you have a database that has pretty much every sound ever were
0:20:19played
0:20:20um um you gonna do search over that and chances are that was some whatever you're analysing as it been
0:20:24repeated and fast
0:20:25uh so you know that's what we're driving at
0:20:27um uh
0:20:29uh one E case where a sort of white one defence point they can use is that in the um
0:20:35uh quintet that example here
0:20:37um even when i have a solve recording uh basically had recordings of five instruments
0:20:41and i had to big this one spectrum that was close as in the do ed again have the five
0:20:44instrument at a bit on a on two of them
0:20:47um
0:20:48so yes i did know what instrument were in there and they were part of the database
0:20:51uh but i sort of think of this more of a as a logistic problem in that
0:20:55uh you know i had to get some data um you know ultimately you just wanna have this humongous database
0:20:59of that everything
0:21:00and and pick through that