Speech Transcript - APPROXIMATE NEAREST-SUBSPACE REPRESENTATIONS FOR SOUND MIXTURES

so that i'm looking at that time i realise really horrible so well one too much um so you get a bit of the motivation that i have behind this the this work um one complained of we have a all had when it came to audio is that all the models all are use a lot of user are not doing you sort of have to always have all these uh constraint that you put in a lot one harmonic sounds or the sound be noise to be stationary this all is a lot of contribution from the user and um somebody the love that this um some slightly allergic to that idea um i i don't want myself to input a lot of information to is them one assistant to learn that information some um the other thing that that motivates me a lot is that when you see a lot of work we die in audio we always have this and third in the end uh up plus and T uh and that basically has to be any kind of interference and of course because we don't uh we're not a comparable with math we assume a gaussian it makes like easy uh but of course you have to people speaking at the same time the second person is not gonna be just a gaussian in signal something much a complicate um so a lot of the work or not that's really carry well sure um and the third motivating point is that um especially nowadays that for the the good experiment is one example of that when you have a lot of data you can have some very very simple very very stupid algorithms and the perform very complicated system uh which sort of a a a very humbling experience um so uh these the things i want to keep in mind during this talk and is gonna be a four point which is also very important but that's gonna come later um so oh the uh first observation about gonna make is that as far as i'm concerned a lot of the fun stuff you can do with audio uh i have to do with them as the the that's gonna be the sound quality um so uh if you wanna do classification separation or full or sound does for me and well just a fun ways uh but you really care about is the magnitude spectrum because so it happens are here likes to a uh uh take that much more than say phase um some someone gonna be talking about that the me um uh for now um and that the other of gonna make is that if we normalize speced right that really change things if i speak twice as loud i'm still gonna be saying the same thing it's not gonna make a difference um so the presentation gonna be using throughout this talk is that a point and i hope the people in the back and see we of that um whenever you see a a a uh through phone upper keys later that's real had on it that means suspect spectrum that's normalized so basically we divided by the sum of what its elements so might expect normalized um so um uh a it's a basic a some some to one um what that means is that it starts to that's of the uh and during a strange space uh because a log by the entire space of spectra is gonna be basically line the simplex to be a subspace of the overall um uh but a spectrum and he's well that looks like let assume we have a spectrum that only has three frequency uh the way we can represent it if we know that's normalized uh is uh a uh of uh with that a simplex with you shown here a a each but of the simplex is that gonna corresponds to a frequency so we have on the lower left the low frequency um on the top we have a lower middle frequency um the lower um uh right we have our uh a high frequency so within that symbol X we can represent any kind of spectrum that has only three frequencies oh assuming you don't care about amplitude so for example all the low frequency sounds are gonna and been there's the region all the high frequency sounds will be this region any point or on the middle is gonna be this wide component that using all of the uh frequencies at the same time so it's of a sort of simple mixing model so um contras not great but one so that's talk about a very very basic sound model based and that idea on that uh uh uh a representation so i can go out record five hours of for speech of somebody speaking um and then every time having you recording of that person and what i can do um is basically go through they are uh normalized spectra uh of the new recording us for example it's a would pick uh the spectrum over here and what i can do is a simple matching try to figure out which spectrum out of all the training data is the closest to it so it's a simple nearest neighbor operation i'm just trying to find the spectrum it has them you the same uh uh uh uh sort of look at it um so that nothing special about doing something like that uh i just to gig an example to get oriented that's what the input was we have some sound uh he's the input yeah i and he what it ends up being approximated if we just swap all the spectra with the closest looking spectrum from the training data yeah yeah it's not a good representation but it's sort of a getting the just of it um what what happens in a and as john that to that we're thinking about uh it's it looks like they we're gonna have the red points point being a training data we're being given some blue point which are the spectra of the sound would trying to lies we're always trying to find was the closest point to the point of observing right now and when a swap that in my presentation so there's nothing super special oh this one point of wanna make um this is not a euclidean space because we have a because train and that simplex things get a little funny um so we can really use you can in this then uh it's is gonna ignore a lot of the uh uh uh of properties of that space which make you unique um and that one will too much on the on the details um uh but basically if you're working on a simplex uh you you can't assume anything's many things you have there's we have something like a dirichlet distribution and what that means is that the a proper distance measure in that space is gonna be the cross-entropy between the spec when we're doing that nearest neighbor search we're looking at the cross entropy between the normalized spectra are not something that like the L two a just so um um the uh a whole point of the stock lowest to uh analyse mixtures uh because we don't wanna have the last and of the there um so um now he's have gonna start with this um i'm gonna make a small something to begin with uh whenever you have a mixture of two sounds and here we have and adding operation the magnitude spectrogram of that mixture i'm gonna soon is gonna be equal to the magnitude spectrograms of the individual sounds had we been able to observe them uh on the own this is not exactly correct because as a little money you at by taking the magnitude uh but an average it's it's uh it's a fine a some um the other thing is that makes is not necessarily something a process that's or of uh a markovian in in anyway so we can just look at one vector at any point in time in that make sure so what we doing here is that was saying that the this particular spectrum but we're observing of the mixture is gonna be a some of the corresponding uh spectra of the original sources at that the at the same time so a very simple idea um and guess what happens when you look at this uh uh statement uh uh in the space that where in i i uh we're gonna have again as of three frequent a simplex will be observing a point which is a mixture between or two sources and that point will have to lie on the subspace space that connects the two point of the spectra that uh uh uh a a that the one thing to create that make sure so uh in the previous example we're gonna have these to but these two spectra which were what be clean sounds were like oh those are gonna be represented by these two points any point in spectrum that lies on this subspace between those two points would be a plausible blend of these two spec um and then how far your along that line tells you how much of this spectrum is a contributing so now that we have this model um we can have a slightly us sir of a updated version of the nearest neighbours uh idea now was gonna have is that i'm gonna of a mixture of sounds and that's the only thing i'm gonna have i don't know what the original sounds are a exactly um i can go to my database and say well that looks like in this make sure i have a somebody speaking in a whole bunch of chirping better i can get a gigabyte of speech i can get a gigabyte of chirping birds no big deal now days um and what i have to do is that for every spectrum that i'm observing in the mixture i'm gonna try to find one spectrum from the speech database and one spectrum from these are chirping a database that we combine well together in or to approximate what i'm observing so become this humongous search and this um is gonna be that if i do that in a find these two spectra a would be before approximations of the spectre that i have an observed of the original uh clean source uh what that means again in the space that where N is that if you're gonna have a frequency simplex so spectrum simplex that's gonna be a subspace space where you have say the red source because it has is particular temple car uh row characteristic is gonna be a of neighbourhood or a subspace that has uh the blue source um i'm gonna try to find all the lines that connect the blue point in a red point and uh uh tried to figure out which is the one that passes the closest to any of them of a of my neck spec um um at this point you're probably thinking you must be knots uh because this is a a horrible search uh a problem uh just to give you some numbers if you have ten means of training data which is not a lot of data um we were talking about seventy five thousand spectra per source uh and of course is or something like two thousand mention uh ten second make sure is gonna be about twelve hundred spectra that was down to about five and a half billion searches for every spectrum of or input um so it's not gonna happen uh uh if you wait it what um hmmm but there's a to actually relax the problem uh make more of a of a of a can thing is optimisation problem a one get much details but it's extremely boring uh but uh uh the when gonna as a sort of a really describe it is that we're gonna try to uh uh basically use all our training data is being this huge huge basis um uh set so every spectrum or training data i will and the being a basis vector uh we could serve concatenate all that data together and our goal is to find a how to combine all this uh this over complete bases in such a way that i'm only using one spectrum from uh each of the two uh uh source um and if i stated that weight sounds like a sparse coding problem uh it's it's not it's not particularly hard to uh the to uh to to solve so uh i got one world and it too much and is an approximate solution but it's a lot faster uh of of use an example of how the think behaves when you have mixture um we have uh do uh original sound of the top and this of the sound i never get to observe um so just a play one of them i i timit stuff stuff you will heard of that um and then i have a mixture uh of uh two speakers i a i i also have a lot of training data of these two particular speaker is now what it can do is do this huge search uh our optimisation try to approximate every spectrum of the mixture as a superposition of any two spectra from those uh to speaker data uh if i do that um i can reconstruct a the two sources and the Q for example one of them i now "'cause" the thing um i see a lot of familiar faces are able be thinking what the hell you just row paper to the they they go in it it so much better was the point of doing this uh and this is gonna be my for um my four points um the whole point of this representation is that we don't necessarily want to separate sounds um i can't for the life of me think why anybody would wanna separate the sound because what is only one wanna separate the sound because you wanna do speech recognition or you wanna do pitch detection you wanna do something afterwards separation by cell it's gonna use less um in fact is no and that we do that um so the whole point of this representation is that we have a very nice semantic way of describing the speed of the uh um uh a the mixture uh by saying that we have these to clean spectra that come together to approximate our uh uh are make sure you just the yeah uh ability to do a lot more uh smart processing uh because that speced will carry some semantic information but so here's one quick example um suppose that we have a mixture of two instruments playing is big which of the one of here um and then i have some training data of those two instruments isolate it's very easy for me to use a pitch track here in pitch stack all of the spectra in the uh in the training data so that means that every spectrum that i have there oh is gonna be associated with the pitch value by do this kind of the composition and basically gonna be explaining every spectrum in the mixture as a superposition of two spectra from a check from my training data and that the i will have a pitch label added to it so that means at that point i know exactly what are the two pitch is that a sounding for that particular time in the mixture and what's nice about it is that if uh we of did experiments starting from a uh doing a solo uh uh instruments a so basic just doing a nearest neighbor search all the we have to having five instruments playing a the world would wind instruments at the same time um and sir of here the results in terms of a error and standard deviation in heard and uh we see can see the we sort of go from and then are of for about four hz uh average error four hz for this all case of forty two herds i wouldn't have five in miss playing so that's not something we could do if we had a uh if we just a if the monophone a pitch tracking algorithm because we're using this way of decomposing mixtures and of things that we can already have labeled uh we get this extra ability to um uh uh to get a a a a uh of extra labels for mixtures uh i'm not example of that is phonetic tagging um um if i have a mixture of two sounds of of to uh speakers and then i can associate that with spectra from the clean recordings i also have a lot of labels that come with that spectra i know what phoneme corresponds of that spectrum i i can maybe do some emotion recognition know what the motion state is i know what's speaker get that was like a what page that speaker was a a a was speaking at um and what happens is that we only see a very mild degradation when we try to analyse mixtures so uh uh in this case just some numbers a simple numbers the for a one speaker if we just do a nearest neighbor search and a little bit of smoothing we can get a phoneme error forty five percent what you have two speakers you gone of uh phoneme error of fifty four percent which the fairly mild uh us of increasing the error uh even of the problem is considerably harder i and we get about ten percent or eight percent uh a worse results everytime you adding you speak uh so grace will be dealing with uh would be a a with the mixture case so um uh to rub that up um this is just a simple geometric looked to a sort of uh consider mixtures uh the point i'm trying to make is that you really have to incorporate the idea that sounds makes into a model you can just say well i'm gonna have a scream model and hopefully people figure out a way to to to deal with X the sources uh was for this a model that start to the idea that things are gonna be mixed together um uh we we really care about uh yeah how you know my schedule yeah uh um the whole idea of re composing is based on uh a a lot of that i sort of uh uh uh an interesting concept that we see a lot of the computer vision uh uh literature we see it a lot in the text mining literature richer um we don't really see that much in the audio space uh if you have a lot and lot lots a lots of recordings you should be able to explain pretty much everything that comes in um i dream is that at some point or speech databases are gonna be so big you just to a nudist search a neighbour an a sentence will give you the scent and you don't have to do all this to process so been done and text right face if i search first question we'll go somebody has already has that um so it's only a matter of time and would be speech um a and the other thing is that uh the thinking about separation is really "'em" we point uh the the is something else we have to do after separation and the if there's any message they can be you with is that being able to uh analyse make and some smart way about figuring out what's in the mixture not necessarily have to extract that uh that the information out of a so and most of them so that here make it uh uh we get for some questions i uh i hello huh um okay once i comment some of those like music actually do want to separate the sounds because we interested in re mixing a on i'm just gonna mention that K O and might my and to that is that want to makes it then you just one to mix the music you don't want to extract the source because you gonna put it back in the mix i we we could talk about up fixing off fine um the the the other thing i just i just wanted to ask was she made this provocative statement you can use euclidean distance yeah and then of two slides later you said but i'm can use the L two norm so i O i did a say that i i i could've sworn you did if i did it was a mistake i didn't i is used to do or but and the center student and you say were getting you you could use the L two norm to do something to enforce sparsity a a yes that uh okay so i was just curious what that men uh um so whenever we we talk about sparsity what you wanna do is uh optimize the L one norm of signal and that's all the what the compressive literature about um it turns out that we're dealing with normal a spectral they already some to one does no way can optimize the L one norm because it's a nonnegative bunch of numbers that sum to one it's always gonna be one sort of screwed if you are uh if you few if you try to optimize that oh because those numbers are gonna be between zero and one uh the close are to zero that's is gonna be a smaller number so by optimising the L two norm of of uh of a normalized vector that sums to one oh then essentially you're enforcing sparsity because you saying i want all of the other to be really really close to zero and you only gonna have to be one of them be a a a a a of a one that make sense hi uh the two speaker example that you played uh it's it it was artificially made by adding two signals i'm just curious how this actually works but the L signals where you have reverberation which smears the speaker ignores in time um yes if you have a a a a uh if you have a long a "'cause" are a big reverberation is not a big deal because the egg was of the reverberation are gonna be spectra that correspond to some of the previous uh utterances of the speak um and that seated incorporate in the model so uh it's a it's are resistant to that uh the on the place where you get into trouble is if you have very strange phase X and the reverberation that actually changes the spectrum well that's L so if you i don't know if to sir of in the bathroom any have these very ugly echo that's very short then you creates an order or resonance in your spectrum uh that's gonna buy guess your uh your testing data to be very different from the training data and the and the fit is not gonna be as good but as as long as the spectral card there's they the same a reverberation is on a big the oh so a thank you um power talk uh so i a question parts uh to represent a the a house so the same that we need a how a lot data to to represent that sauce yeah and uh you you said use G uh when you do the search i think that audio data at in kind of a man and follow so is and you man you for it is seem that we don't need i'm not met data is in no that a itself yeah do you sink back introduce a D two to representation we how back or uh queries the training um to you and it is a out of data in order to properly represent the manifold that every source of lying in um a and that's the only reason and it sort of becomes a very similar to uh problem to a supervised learning where if you data is these and you can represent uh your input thing you're fine and that would be that would mean that you only have a about a doesn't data points are could mean you five million and that data point uh so like in this case um if you're dealing with simple sources the not so much variation then you can get away with a with a little data um if you wanna do something that's very uh uh you know if you wanna model everybody's voice at all different pitches and all different phonemes of all sort of languages than obviously need to have a pretty good representation of all the possible cases and it makes a database a you yeah paris my and common i guess is often what you're are i guess we seen to wrap these talk you know a lot of modeling in different ways of modeling that sound i was wondering what you philosophy is maybe he's you went over it be a bit quickly "'cause" you said you got a mixed and you gonna take two examples samples or many examples of that it's to so it's like you know the components of the mixture already and then you figure out how they put together and so what you're saying at at if you have more complicated situation or maybe you don't know what the components are or if it's a more complicated and the database size grows you was showing or earlier the search can can grow very big can you've done some clever stuff to make the search small um and what is your philosophy on overall the or the older dream is that eventually you have a database that has pretty much every sound ever were played um um you gonna do search over that and chances are that was some whatever you're analysing as it been repeated and fast uh so you know that's what we're driving at um uh uh one E case where a sort of white one defence point they can use is that in the um uh quintet that example here um even when i have a solve recording uh basically had recordings of five instruments and i had to big this one spectrum that was close as in the do ed again have the five instrument at a bit on a on two of them um so yes i did know what instrument were in there and they were part of the database uh but i sort of think of this more of a as a logistic problem in that uh you know i had to get some data um you know ultimately you just wanna have this humongous database of that everything and and pick through that

APPROXIMATE NEAREST-SUBSPACE REPRESENTATIONS FOR SOUND MIXTURES

Innovative Representations of Audio

Presented by: Paris Smaragdis, Author(s): Paris Smaragdis, University of Illinois Urbana-Champaign, United States