0:00:13a Q
0:00:14alright so uh we now how a a change gears a it of is a lost uh in the last
0:00:18presentation uh uh work is ready file separate in speech from nonspeech
0:00:23interference
0:00:24so this paper is dealing with with
0:00:27right
0:00:27so that you know if is itself is self a speech
0:00:30a phones
0:00:30so obvious is it is
0:00:32uh uh it's is a and tight at different ball games or was different techniques
0:00:36you
0:00:37and that this that the live my presentation
0:00:40a by what's is yeah source
0:00:42as um
0:00:42a a a a very uh it's right that that this is a joint work was my last you co
0:00:48can a comfort
0:00:50uh
0:00:51so he a lies off first
0:00:53of briefly describe the background and um
0:00:56uh the main idea is to perform unsupervised sequential grouping in this work
0:01:02i will first your with um on uh
0:01:05one job being of
0:01:07a voiced speech that would you do you with a segment of of what i'm voiced speech
0:01:12oh presents some you should result
0:01:15so this is a by a a standard explanation of what sick marginalisation problem as
0:01:20so this
0:01:20this problem has been extensively discussed in the literature of uh
0:01:24or or C analysis as well as computational to C analysis
0:01:27so you can imagine that we have a a mixture
0:01:30and
0:01:36or
0:01:37sky
0:01:38we have a a a a a mixture here
0:01:39and this is basic basically time was representation the mixture
0:01:43and the uh uh typical a this mixture is force
0:01:46process through a
0:01:48uh uh the re-segmentation stage
0:01:50so this that she's actually give sauce
0:01:53um
0:01:54a set of
0:01:55time rip with the segments or we call all
0:01:57a a as a of uh a it and and each segment is actually use a kind you "'cause" region
0:02:03you know time for with apply
0:02:05and
0:02:06the next stage is a somewhat what is groupings they she's such tries the group ah i would with segments
0:02:11of cross frequency or you can set as so light is of vertical grouping because the vertical axis is the
0:02:16frequency are
0:02:18a through this process we and up with a a set of what we call somewhat what streams
0:02:23so you someone is in here
0:02:27is illustrated as a a a um
0:02:30uh uh by different colour so nor that the each region
0:02:34uh
0:02:35i the same kind of those are to correspond to a kind "'cause" region because now segments
0:02:40have been grouped across frequency
0:02:42oh to see they power to the main part was a quite of or be not what this paper deals
0:02:46with
0:02:46so C want to will be is to
0:02:48take some of data streams as the import
0:02:50and
0:02:51through a process of be in now would like to
0:02:54come up with segregated speech streams so each stream
0:02:57you the case of course channel speech species stream should correspond to a single speaker
0:03:02so that's the
0:03:03oh the parts
0:03:06and
0:03:07or you in this work uh a not as as that we're not you can was somewhat it's grouping so
0:03:11assume some of that is we we has already been tell with
0:03:13we just use a a a a a and
0:03:15recent a pop shop where is now we have developed a call kind of model them
0:03:19the tandem algorithm uh years
0:03:21uh is actually
0:03:22a a that's to things somewhat ten years they uh in tandem wise pitch tracking second is using detected should
0:03:29do some of it's grouping
0:03:30i can say it's a a
0:03:32uh to generate a a a a a voiced speech
0:03:36and
0:03:37we this somewhat tennis streams generated
0:03:39a a why don't station face is number of uh issues
0:03:43the pose is that somewhat of is streams
0:03:45a consist all
0:03:46where you make calm spectral in company frames
0:03:49because you over a particular time frame
0:03:52uh there's a subset of have you have not bit onto one source
0:03:55the remaining in you have you is bit onto the apples
0:03:59so you you don't have a whole frame you we don't have have ram yeah that else other you choose
0:04:03or had
0:04:04has that do ways
0:04:05and uh sum of that streams are also well for short in duration
0:04:09which use a trip re
0:04:11a a cruise
0:04:12the use of a speaker identification "'cause" otherwise a it a standard technique like you can at a about what
0:04:17speakers i you know that you know for channel
0:04:19a speech they come hope play used that speaker any T
0:04:22to group a sum of data streams
0:04:25but because that too sure about the be a price uh speaker identification not that i basically of blue
0:04:30uh much but L
0:04:32and a comes to i'm speech
0:04:35uh i was we just but a it recall got that the was because it doesn't have a structure so
0:04:39it doesn't have a harmonic structure
0:04:41and is a week and as she compared to the rest of be uh speech stuff
0:04:46or because of this channel is a guess as this you work to gonna use a model based methods
0:04:51a well met best but those we have a lot of leeway as you come pretty trends speaker models
0:04:55and by it also cons with cost the cost is desire
0:04:59why face a mixture
0:05:01yeah actually have to not only have to assume not the underlying speakers are at a a a a a
0:05:06a in the mixture so other was the they this speakers have already have
0:05:10i
0:05:10and the corresponding trained speaker models
0:05:13the second has which is not a little a three task that you actually have to do so you got
0:05:17an education
0:05:18a lot of this
0:05:18co-channel channel speech which is it's a challenging problem
0:05:24so i D is i would like to address this from a a the unsupervised
0:05:29uh approach or as you wise perspective
0:05:32so i do as that we want to apply as you buys class ring
0:05:36and is sort of like uh you as racial would like to accomplish we already have somewhat any streams
0:05:42and
0:05:43we propose this in this a relative a new feature space coach you have C C that you have to
0:05:47the corresponding to come up all frequency
0:05:50cepstral coefficients
0:05:51and these of the reason that about of features that are shown to be a a fact you for speaker
0:05:56adaptation
0:05:57add to we use we we
0:05:58perform this
0:06:00using do you have sees the as the feature space
0:06:03and we like to have
0:06:04this somewhat what streams
0:06:06uh to be somehow
0:06:07natural a third
0:06:09in two group
0:06:11i if it does not of is they've one group correspond to once you that the other group correspond to
0:06:16a this to be gone um you know the problem is solved and uh problem is all with with tiny
0:06:19speaker model
0:06:20so that's that's the whole it uh uh that is
0:06:23so you have to apply like real that we have to define a objective function
0:06:28so that should be a a a a budget a little of of show what should be a a a
0:06:31pretty objective function in court channel speech i guess that's the
0:06:34that's a question
0:06:36so
0:06:36as we set out we assume we already have a is someone can is group hype of these the call
0:06:41them G
0:06:42so a use actually
0:06:44because we have a a set of someone is streams so cheese is actually a by back
0:06:48oh one corresponding to one people are zero correspond to the speaker
0:06:52oh with this
0:06:53where this five of these
0:06:55now we can measure two things
0:06:58wise called the weighting
0:07:00group
0:07:01scatter the fix
0:07:03or as of W
0:07:05here's ask that so as a probably is added what has got a matter surveys as basic just is a
0:07:09sum at it
0:07:10ah
0:07:11oh i'll a pro blocks all differences between feature there's that the mean back that's that's that's what it is
0:07:17and now you
0:07:18since you are to have a a type of this here
0:07:21i "'cause" or measures so got between groups get of metric
0:07:25so this is a as a B
0:07:26now with this too
0:07:28a a measure is defined
0:07:30then this is a a a a vice the technique in in class three like i use the phase of
0:07:34the ratio of between group and we didn't
0:07:38a a measure is is
0:07:39to match the group give
0:07:41and that's of this that so the
0:07:43this is they what's read and i guess so you have to measure is is and i you you by
0:07:47one of them and you find a trace of the
0:07:49of of about
0:07:50and is press already ready side for measure the ratio as on on uh i come back to
0:07:54uh
0:07:55the
0:07:57not because what you can was core channel speech that is that a nice
0:08:00uh a constrained now apply
0:08:02a a desired a cup i'd as a as a penalty term
0:08:05because "'cause" is not to thing is streams
0:08:07with a or what the pitch count was can a calm kind of should not be assigned to the same
0:08:11speaker
0:08:12what is a not always and i
0:08:14at present cannot generate
0:08:16two pitch points
0:08:18in the same time
0:08:19i think it shouldn't be any any uh a controversy here
0:08:22and it is about we can permit as as a penalty term because that as we
0:08:27the pitch track itself is is a you one you as you've a pitch i it doesn't have any have
0:08:31a we can just give you that all you you should not a lot and all
0:08:34any any P H um a or whatever
0:08:37because the actually has
0:08:38i be mistakes so we actually just use a a continuous function is basic a sigmoid function
0:08:44a a the same a bunch is prime try by the number of frames where you have overlapping
0:08:49pitch contours
0:08:50within the same speaker or with the same group
0:08:55oh i be at this to the guy that you have the the are you cannot define objective object defined
0:08:59called constrained objective function
0:09:01which is basically a is the group difference
0:09:03which is the first term as second them is a penalty is as the penalty of are we we we
0:09:08we we we subtract up
0:09:10and um
0:09:11that i'm not as the standard then you be have two times after have to pick out was a trade
0:09:15for this is that system prior
0:09:19not given the uh objective function
0:09:22not a to to is it comes than to we have a a set of
0:09:26some of the streams the ways
0:09:28we need a got a single white or back
0:09:31a corresponding to the so called to more um uh a a will be
0:09:35a you you a was the are we uh the problem coming from from by searching for year
0:09:40possible able group use
0:09:42and we got to figure out of one that keeps the highest scroll i objective function
0:09:47i we think that in this way of course this is an optimisation problem there are many many a a
0:09:51an we applied to solve this optimisation problem
0:09:54because would you do was by do back that is
0:09:56so one has choice will be the use in um uh genetic out
0:09:59so we just basic a up that you them
0:10:01three each binary back the which is a a a single sequential grouping
0:10:06three that as
0:10:07eight crime sort in in the um um a genetic algorithm
0:10:11that objective function nine corresponds to the in this school
0:10:15a basic a your to see that live actually it's is it once pretty fast even not G has a
0:10:19reputation of you measure small
0:10:21a a a big common so we is that i has been is function in the last relation is taken
0:10:25as a solution
0:10:27so now we have a
0:10:28basically a given a solution
0:10:30to to one organisation of
0:10:33that speech
0:10:35so what about voice poses a particular typical uh uh
0:10:38but a good ability always
0:10:40and and
0:10:42speech correspond to some seven consonants
0:10:44and uh
0:10:45yeah in spoken english we are trapped on the statistical analysis see if we have a rabbit about
0:10:51a a
0:10:52first twelve bits of if so first
0:10:54all the entire house open
0:10:56sounds english not correspond i'm boy speech basically down there about eight
0:11:00um voice
0:11:01uh phonemes teams
0:11:03three is unvoiced more also for unvoiced fricatives and wow voice every cut
0:11:08so
0:11:09uh i as a quite a be in is a as as a particular good to to your was because
0:11:14they you you don't have this
0:11:16they in that you know you can somehow you to extract speech features and that you hold the noise doesn't
0:11:21satisfy speech
0:11:22a a this but C is what you do with score channel speech about
0:11:25the lottery is is is a and uh i'm was speed as we know like be taken a lower is
0:11:30ready just noise say
0:11:32you know if you listen to this you know he's
0:11:34and as in a probable context
0:11:36it is
0:11:36just more
0:11:38oh right so how do we do you um voiced um
0:11:41speech sequence romanisation
0:11:43so we apply L all audio segment this was
0:11:47mice i the previous four
0:11:49and have i for a a and set all seven best analysis but as as i've says
0:11:53a a of both voiced speech and unvoiced speech
0:11:56and then we
0:11:57take since we are already a segregated voiced speech by this time
0:12:01the portions of segments that overlap with segregated voiced speech uh could be removed
0:12:06and the remaining ones of is a corresponding money i'm voiced segment
0:12:10now with this so this is like
0:12:12a a a a a bunch or regions
0:12:13uh in this speaker sh
0:12:15a unique we we choose a unique of course money into a unique or a segment
0:12:20not task of sequential grouping becomes a again i sign imply and we labels
0:12:25to use one of this class a colour region
0:12:30or do we do all
0:12:31a the key ideas like these two in the simple white here
0:12:33we just
0:12:34very much leverage we well has already been a a which is a we are now have
0:12:39two streams that correspond voiced speech
0:12:41now we to i'm was was by using complementary remote
0:12:46or the was speech which has already been
0:12:49so you is a a a
0:12:51eight
0:12:51so the gave a stream for once because as speaker got a
0:12:55and
0:12:55now why correspond the one black was a one zero
0:12:59not be free for
0:13:00days
0:13:01so that becomes a why why but are because black
0:13:04mean feel we drop less speaking
0:13:05big compliment your master correspond to all the speaker
0:13:08right become this all the segregation problem then the condom to are ready just course but it all to speak
0:13:14a a a we have a company as all this problem and and they are a we are actually have
0:13:17two
0:13:18uh do you are with some of the of some of the uh uh use use because of the
0:13:22a a a i was that's be made by a voiced speech segregation
0:13:25so basically we have a
0:13:27well as we get this a complementary mask we remove the time frames that contain
0:13:33no H
0:13:34or two pitch points
0:13:36it be would points obvious
0:13:38that corresponding to or ready to voice
0:13:41streams
0:13:41there is no there shouldn't be any boy speech there so can be safely removed
0:13:45if has no pitch
0:13:47i i is bad
0:13:49P other i'm voice speech in not time for
0:13:52is that you you uh i'll come back in the time next time right uh in the next slide
0:13:57and
0:13:57so basically i you've we remove all of them not there are so that the but there's a which is
0:14:02will correspond to work or the the complementary mask
0:14:04a take the company might models
0:14:06then you have a a single or
0:14:09a a a a a a a is
0:14:11segment that i'm was segment is the be label by
0:14:14ah
0:14:15but mask
0:14:16now the see which one has them or or or but i be an ish
0:14:20the member of yeah mean is that you should ready be a side to that of the speaker
0:14:24so that's a basic idea are guess in the first of a
0:14:27um
0:14:28to in time
0:14:30a peter
0:14:31i'm twin climb
0:14:33but so okay so i need a little
0:14:35a okay so a is it as well as the way uh would you is there's as as a very
0:14:40simple and has a clear limitation
0:14:43which is that this cannot do your with
0:14:45um or was um voice
0:14:47portions of the mixture
0:14:49not that it is evaluation quite was which was speech separation rubber's or S S C
0:14:54we i'm now we are to did analysis you in on that right corpus
0:14:57let
0:14:58frames i account for about ten percent of or or unvoiced speech issue
0:15:02so messed at this that J applicable to
0:15:05do you was ninety present a bomb voice speech by there's remaining many tempers and we cannot your with this
0:15:10is a one
0:15:11limit that where can i what
0:15:14oh right so
0:15:14so is i don't have a well i don't of the goes through this uh you better results pretty quickly
0:15:19uh we have a you body and a one and read zero db or channel mixtures from S is equal
0:15:23or
0:15:25and we compare all
0:15:26and uh
0:15:28we that it we the evaluation measure is as a signal noise ratio where we use i your by the
0:15:33most as one through
0:15:35and that we compare was a model based a matter
0:15:37at a is the best mouth the nice to provide a i guess we got it the
0:15:42and a but he also finds a by uh i don't the of for this method is
0:15:46uh that what them as a as a as M
0:15:49grammar
0:15:50a approach
0:15:51and you here i'm like it just give the unvoiced voice uh uh i your
0:15:55so a some of that is
0:15:56streams but
0:15:57to look at a and the results which is the estimate someone is frames from the tandem where some
0:16:02and
0:16:03all this is is what's better if the two talk result different gender or T G
0:16:08in the D you case star
0:16:10the
0:16:11as wise approach gives you a six point eight
0:16:15yes small you improvement because they original
0:16:17signals is is the G
0:16:19and in the
0:16:20send and gender case the room was about three point seven that
0:16:23or or i i we should use the remote five point two test
0:16:27which is
0:16:28uh a one point seven that's was away from the idea used right
0:16:32yeah or is got of didn't make any state
0:16:34what what a sigmoid
0:16:36all right so uh well i'm always P she about it as segregation uh evaluation here
0:16:41we only evaluate over
0:16:43the um voiced intervals we met signal as racial game
0:16:47and in this case
0:16:48uh again one we are just look at the yeah uh a a kind of all results with
0:16:53uh
0:16:53as the some of that is frames
0:16:55plus voice
0:16:57uh segregation
0:16:58um
0:16:59this is the magic gives you a pretty decent
0:17:01uh improvement impose a model based and and as a wise approach
0:17:05and gives the the improve the vol almost bought it
0:17:10a of our results
0:17:11uh
0:17:12you put them together
0:17:13and uh this is actually
0:17:15this is a why we wish to look at
0:17:17this always balls
0:17:18a segregation sick one row will be all
0:17:21voice each and stick model would be a of a more
0:17:25uh
0:17:26and this is and gives you five point seven that's room and
0:17:29which is that which is a a i was we were a bit would score boards a very nice
0:17:33but is
0:17:35and this is a loss them or here this is a
0:17:38one um mixture
0:17:40i is
0:17:42a us english
0:17:46we
0:17:47but this is the is the result
0:17:49and so there is a mixture of a male and a female
0:17:52i se
0:17:54email
0:17:55i
0:17:56so i
0:17:58that
0:17:59male become but was that your body models
0:18:02i can stand
0:18:06i standard
0:18:09with
0:18:11okay okay come grew a a a a we have proposed a novel unsupervised supervise a approach to see one
0:18:15normalisation
0:18:16uh you quote channel speech
0:18:18and the uh a what or they should all voice species kind you through class ring and um voice speech
0:18:24is a not through the use of complementary mask
0:18:27and all you but if is trolls that uh are the which is as a wise for even little better
0:18:31than
0:18:32a company that model
0:18:34i
0:18:35yeah thanks to much
0:18:40do we have a short question
0:18:42command
0:18:45the i would like to know how much training you would be needed
0:18:49is needed to start before you can start to separation
0:18:52uh there's a
0:18:53uh i to why
0:18:55so that is not
0:18:57may it what they as you have a some of its streams channel
0:19:01as a as real as is gender you and the model
0:19:04which which has a little
0:19:05but that
0:19:06i was ready at not a whole lot
0:19:08but once you have these
0:19:10they of of ideas that always to this is the high rate as a wise uh method yeah yeah i
0:19:15but you have to find the average is of the groups and the total a as before
0:19:20would be preprocessing some to speak rise of rivers are yeah so that it is a pretty was it is
0:19:25that we you pay a a mixture which as as as
0:19:28in it
0:19:29and that you can or a set of of of it is
0:19:32a stream
0:19:33for each time someone is tree
0:19:35then you basically extract of a features yes it's features
0:19:39and with the speech as you just one sign it too
0:19:41the uh my
0:19:44but yeah
0:19:47so you two thousand
0:19:48that
0:19:48i
0:19:50wise
0:19:53a question
0:19:56and and yeah
0:19:57and
0:19:57have you can see the the in this meeting so
0:20:02we have not consider
0:20:04right
0:20:05a a uh a uh uh we
0:20:07we we are not on it
0:20:09a and uh i a or channel speech
0:20:12know
0:20:13is H H so so we would probably
0:20:16i my would like this
0:20:17so two
0:20:18O K A a and that will problem is for called channel speech without reverberation first that no
0:20:24oh like and uh you
0:20:25operation
0:20:27so you
0:20:27um
0:20:29do we expect the uh
0:20:31yeah we'll on more is the console casing
0:20:35a
0:20:36yeah
0:20:39we work for this
0:20:40uh
0:20:41or
0:20:42yeah a little about he's a uh
0:20:44so
0:20:45right
0:20:46and one had to like to common
0:20:48uh
0:20:49we well she actually is a big problem for
0:20:52uh well mobilisation base
0:20:54that
0:20:55uh to model or base
0:20:57because pitch based group
0:20:59is very robust
0:21:01and
0:21:01um and and set also best analysis is also a pretty good asset
0:21:06is better
0:21:09uh
0:21:10because we will right to create a like this
0:21:12this
0:21:13they
0:21:13fate talk is from all source locations
0:21:16a not as basic my to uh
0:21:20you
0:21:23thank okay thank thank you