Speech Transcript - AN APPROACH TO SEQUENTIAL GROUPING IN COCHANNEL SPEECH

a Q alright so uh we now how a a change gears a it of is a lost uh in the last presentation uh uh work is ready file separate in speech from nonspeech interference so this paper is dealing with with right so that you know if is itself is self a speech a phones so obvious is it is uh uh it's is a and tight at different ball games or was different techniques you and that this that the live my presentation a by what's is yeah source as um a a a a very uh it's right that that this is a joint work was my last you co can a comfort uh so he a lies off first of briefly describe the background and um uh the main idea is to perform unsupervised sequential grouping in this work i will first your with um on uh one job being of a voiced speech that would you do you with a segment of of what i'm voiced speech oh presents some you should result so this is a by a a standard explanation of what sick marginalisation problem as so this this problem has been extensively discussed in the literature of uh or or C analysis as well as computational to C analysis so you can imagine that we have a a mixture and or sky we have a a a a a mixture here and this is basic basically time was representation the mixture and the uh uh typical a this mixture is force process through a uh uh the re-segmentation stage so this that she's actually give sauce um a set of time rip with the segments or we call all a a as a of uh a it and and each segment is actually use a kind you "'cause" region you know time for with apply and the next stage is a somewhat what is groupings they she's such tries the group ah i would with segments of cross frequency or you can set as so light is of vertical grouping because the vertical axis is the frequency are a through this process we and up with a a set of what we call somewhat what streams so you someone is in here is illustrated as a a a um uh uh by different colour so nor that the each region uh i the same kind of those are to correspond to a kind "'cause" region because now segments have been grouped across frequency oh to see they power to the main part was a quite of or be not what this paper deals with so C want to will be is to take some of data streams as the import and through a process of be in now would like to come up with segregated speech streams so each stream you the case of course channel speech species stream should correspond to a single speaker so that's the oh the parts and or you in this work uh a not as as that we're not you can was somewhat it's grouping so assume some of that is we we has already been tell with we just use a a a a a and recent a pop shop where is now we have developed a call kind of model them the tandem algorithm uh years uh is actually a a that's to things somewhat ten years they uh in tandem wise pitch tracking second is using detected should do some of it's grouping i can say it's a a uh to generate a a a a a voiced speech and we this somewhat tennis streams generated a a why don't station face is number of uh issues the pose is that somewhat of is streams a consist all where you make calm spectral in company frames because you over a particular time frame uh there's a subset of have you have not bit onto one source the remaining in you have you is bit onto the apples so you you don't have a whole frame you we don't have have ram yeah that else other you choose or had has that do ways and uh sum of that streams are also well for short in duration which use a trip re a a cruise the use of a speaker identification "'cause" otherwise a it a standard technique like you can at a about what speakers i you know that you know for channel a speech they come hope play used that speaker any T to group a sum of data streams but because that too sure about the be a price uh speaker identification not that i basically of blue uh much but L and a comes to i'm speech uh i was we just but a it recall got that the was because it doesn't have a structure so it doesn't have a harmonic structure and is a week and as she compared to the rest of be uh speech stuff or because of this channel is a guess as this you work to gonna use a model based methods a well met best but those we have a lot of leeway as you come pretty trends speaker models and by it also cons with cost the cost is desire why face a mixture yeah actually have to not only have to assume not the underlying speakers are at a a a a a a in the mixture so other was the they this speakers have already have i and the corresponding trained speaker models the second has which is not a little a three task that you actually have to do so you got an education a lot of this co-channel channel speech which is it's a challenging problem so i D is i would like to address this from a a the unsupervised uh approach or as you wise perspective so i do as that we want to apply as you buys class ring and is sort of like uh you as racial would like to accomplish we already have somewhat any streams and we propose this in this a relative a new feature space coach you have C C that you have to the corresponding to come up all frequency cepstral coefficients and these of the reason that about of features that are shown to be a a fact you for speaker adaptation add to we use we we perform this using do you have sees the as the feature space and we like to have this somewhat what streams uh to be somehow natural a third in two group i if it does not of is they've one group correspond to once you that the other group correspond to a this to be gone um you know the problem is solved and uh problem is all with with tiny speaker model so that's that's the whole it uh uh that is so you have to apply like real that we have to define a objective function so that should be a a a a budget a little of of show what should be a a a pretty objective function in court channel speech i guess that's the that's a question so as we set out we assume we already have a is someone can is group hype of these the call them G so a use actually because we have a a set of someone is streams so cheese is actually a by back oh one corresponding to one people are zero correspond to the speaker oh with this where this five of these now we can measure two things wise called the weighting group scatter the fix or as of W here's ask that so as a probably is added what has got a matter surveys as basic just is a sum at it ah oh i'll a pro blocks all differences between feature there's that the mean back that's that's that's what it is and now you since you are to have a a type of this here i "'cause" or measures so got between groups get of metric so this is a as a B now with this too a a measure is defined then this is a a a a vice the technique in in class three like i use the phase of the ratio of between group and we didn't a a measure is is to match the group give and that's of this that so the this is they what's read and i guess so you have to measure is is and i you you by one of them and you find a trace of the of of about and is press already ready side for measure the ratio as on on uh i come back to uh the not because what you can was core channel speech that is that a nice uh a constrained now apply a a desired a cup i'd as a as a penalty term because "'cause" is not to thing is streams with a or what the pitch count was can a calm kind of should not be assigned to the same speaker what is a not always and i at present cannot generate two pitch points in the same time i think it shouldn't be any any uh a controversy here and it is about we can permit as as a penalty term because that as we the pitch track itself is is a you one you as you've a pitch i it doesn't have any have a we can just give you that all you you should not a lot and all any any P H um a or whatever because the actually has i be mistakes so we actually just use a a continuous function is basic a sigmoid function a a the same a bunch is prime try by the number of frames where you have overlapping pitch contours within the same speaker or with the same group oh i be at this to the guy that you have the the are you cannot define objective object defined called constrained objective function which is basically a is the group difference which is the first term as second them is a penalty is as the penalty of are we we we we we we subtract up and um that i'm not as the standard then you be have two times after have to pick out was a trade for this is that system prior not given the uh objective function not a to to is it comes than to we have a a set of some of the streams the ways we need a got a single white or back a corresponding to the so called to more um uh a a will be a you you a was the are we uh the problem coming from from by searching for year possible able group use and we got to figure out of one that keeps the highest scroll i objective function i we think that in this way of course this is an optimisation problem there are many many a a an we applied to solve this optimisation problem because would you do was by do back that is so one has choice will be the use in um uh genetic out so we just basic a up that you them three each binary back the which is a a a single sequential grouping three that as eight crime sort in in the um um a genetic algorithm that objective function nine corresponds to the in this school a basic a your to see that live actually it's is it once pretty fast even not G has a reputation of you measure small a a a big common so we is that i has been is function in the last relation is taken as a solution so now we have a basically a given a solution to to one organisation of that speech so what about voice poses a particular typical uh uh but a good ability always and and speech correspond to some seven consonants and uh yeah in spoken english we are trapped on the statistical analysis see if we have a rabbit about a a first twelve bits of if so first all the entire house open sounds english not correspond i'm boy speech basically down there about eight um voice uh phonemes teams three is unvoiced more also for unvoiced fricatives and wow voice every cut so uh i as a quite a be in is a as as a particular good to to your was because they you you don't have this they in that you know you can somehow you to extract speech features and that you hold the noise doesn't satisfy speech a a this but C is what you do with score channel speech about the lottery is is is a and uh i'm was speed as we know like be taken a lower is ready just noise say you know if you listen to this you know he's and as in a probable context it is just more oh right so how do we do you um voiced um speech sequence romanisation so we apply L all audio segment this was mice i the previous four and have i for a a and set all seven best analysis but as as i've says a a of both voiced speech and unvoiced speech and then we take since we are already a segregated voiced speech by this time the portions of segments that overlap with segregated voiced speech uh could be removed and the remaining ones of is a corresponding money i'm voiced segment now with this so this is like a a a a a bunch or regions uh in this speaker sh a unique we we choose a unique of course money into a unique or a segment not task of sequential grouping becomes a again i sign imply and we labels to use one of this class a colour region or do we do all a the key ideas like these two in the simple white here we just very much leverage we well has already been a a which is a we are now have two streams that correspond voiced speech now we to i'm was was by using complementary remote or the was speech which has already been so you is a a a eight so the gave a stream for once because as speaker got a and now why correspond the one black was a one zero not be free for days so that becomes a why why but are because black mean feel we drop less speaking big compliment your master correspond to all the speaker right become this all the segregation problem then the condom to are ready just course but it all to speak a a a we have a company as all this problem and and they are a we are actually have two uh do you are with some of the of some of the uh uh use use because of the a a a i was that's be made by a voiced speech segregation so basically we have a well as we get this a complementary mask we remove the time frames that contain no H or two pitch points it be would points obvious that corresponding to or ready to voice streams there is no there shouldn't be any boy speech there so can be safely removed if has no pitch i i is bad P other i'm voice speech in not time for is that you you uh i'll come back in the time next time right uh in the next slide and so basically i you've we remove all of them not there are so that the but there's a which is will correspond to work or the the complementary mask a take the company might models then you have a a single or a a a a a a a is segment that i'm was segment is the be label by ah but mask now the see which one has them or or or but i be an ish the member of yeah mean is that you should ready be a side to that of the speaker so that's a basic idea are guess in the first of a um to in time a peter i'm twin climb but so okay so i need a little a okay so a is it as well as the way uh would you is there's as as a very simple and has a clear limitation which is that this cannot do your with um or was um voice portions of the mixture not that it is evaluation quite was which was speech separation rubber's or S S C we i'm now we are to did analysis you in on that right corpus let frames i account for about ten percent of or or unvoiced speech issue so messed at this that J applicable to do you was ninety present a bomb voice speech by there's remaining many tempers and we cannot your with this is a one limit that where can i what oh right so so is i don't have a well i don't of the goes through this uh you better results pretty quickly uh we have a you body and a one and read zero db or channel mixtures from S is equal or and we compare all and uh we that it we the evaluation measure is as a signal noise ratio where we use i your by the most as one through and that we compare was a model based a matter at a is the best mouth the nice to provide a i guess we got it the and a but he also finds a by uh i don't the of for this method is uh that what them as a as a as M grammar a approach and you here i'm like it just give the unvoiced voice uh uh i your so a some of that is streams but to look at a and the results which is the estimate someone is frames from the tandem where some and all this is is what's better if the two talk result different gender or T G in the D you case star the as wise approach gives you a six point eight yes small you improvement because they original signals is is the G and in the send and gender case the room was about three point seven that or or i i we should use the remote five point two test which is uh a one point seven that's was away from the idea used right yeah or is got of didn't make any state what what a sigmoid all right so uh well i'm always P she about it as segregation uh evaluation here we only evaluate over the um voiced intervals we met signal as racial game and in this case uh again one we are just look at the yeah uh a a kind of all results with uh as the some of that is frames plus voice uh segregation um this is the magic gives you a pretty decent uh improvement impose a model based and and as a wise approach and gives the the improve the vol almost bought it a of our results uh you put them together and uh this is actually this is a why we wish to look at this always balls a segregation sick one row will be all voice each and stick model would be a of a more uh and this is and gives you five point seven that's room and which is that which is a a i was we were a bit would score boards a very nice but is and this is a loss them or here this is a one um mixture i is a us english we but this is the is the result and so there is a mixture of a male and a female i se email i so i that male become but was that your body models i can stand i standard with okay okay come grew a a a a we have proposed a novel unsupervised supervise a approach to see one normalisation uh you quote channel speech and the uh a what or they should all voice species kind you through class ring and um voice speech is a not through the use of complementary mask and all you but if is trolls that uh are the which is as a wise for even little better than a company that model i yeah thanks to much do we have a short question command the i would like to know how much training you would be needed is needed to start before you can start to separation uh there's a uh i to why so that is not may it what they as you have a some of its streams channel as a as real as is gender you and the model which which has a little but that i was ready at not a whole lot but once you have these they of of ideas that always to this is the high rate as a wise uh method yeah yeah i but you have to find the average is of the groups and the total a as before would be preprocessing some to speak rise of rivers are yeah so that it is a pretty was it is that we you pay a a mixture which as as as in it and that you can or a set of of of it is a stream for each time someone is tree then you basically extract of a features yes it's features and with the speech as you just one sign it too the uh my but yeah so you two thousand that i wise a question and and yeah and have you can see the the in this meeting so we have not consider right a a uh a uh uh we we we are not on it a and uh i a or channel speech know is H H so so we would probably i my would like this so two O K A a and that will problem is for called channel speech without reverberation first that no oh like and uh you operation so you um do we expect the uh yeah we'll on more is the console casing a yeah we work for this uh or yeah a little about he's a uh so right and one had to like to common uh we well she actually is a big problem for uh well mobilisation base that uh to model or base because pitch based group is very robust and um and and set also best analysis is also a pretty good asset is better uh because we will right to create a like this this they fate talk is from all source locations a not as basic my to uh you thank okay thank thank you

AN APPROACH TO SEQUENTIAL GROUPING IN COCHANNEL SPEECH

Speech Enhancement

Presented by: DeLiang Wang, Author(s): Ke Hu, DeLiang Wang, The Ohio State University, United States