0:00:13paper or model based compressive sensing for this and multi party speech recognition
0:00:17this is a joint board we'd here mobile and are then one can show
0:00:22and and that's we focus on is the uh a problem of competing speech sources which is uh a a
0:00:27common and one of the most challenging and demanding you should see many speech applications
0:00:32and a lot of uh our work he's basically to perform the speech separation prior to recognition
0:00:40the scenario that we are considering is that case that the number of microphones it's less and the number of
0:00:46source so it's a actually and on their that on mine a speech separation problem
0:00:50but if you are
0:00:52hmmm the number of measurements is even less on the number of on unknown
0:00:56sources
0:00:58and the sparse component and and it's easy pro is one of the most promising approaches to deal with this
0:01:03problem
0:01:05a idea is that we caff down there to to mine
0:01:07speech separation problem as a sparse recovery where we leverage compressive sensing theory are you to still the
0:01:14and
0:01:15to in all other wars uh we
0:01:17proposed to integrate the sparse component analysis at to the front and processing of a speech recognition systems
0:01:24to provide some compliments are you process to make it robust
0:01:28and not in can
0:01:32i will follow we the
0:01:33very brief introduction and on compressive sensing
0:01:37to help me put the work
0:01:39in into the context and then i will explain the details of our method which is blind source separation via
0:01:44a model based compressive
0:01:46sensing
0:01:47um then i will provide the experiments so set top and speech recognition results and the concluding that
0:01:53the most four
0:01:55compressive sensing in and not said it's sensing thing why a dimensionality reduction
0:02:00and the idea is that uh when
0:02:02a signal
0:02:07and uh the sparse signal X
0:02:09high dimension all but the the fact is that the dimensionality of the signal is somewhat misleading and that's true
0:02:15information content are only in but that in very few on the are quite a few
0:02:20such as signal
0:02:22the information content of of as the sparse signal like these could be pretty there are uh what uh by
0:02:28a a kind of a dimensionality reduction measurements which we did not to year we fight
0:02:33through a very and be captured
0:02:35we very few measurements by
0:02:40so in a in
0:02:42a case that naturally is kind of dimensionality reduction happen
0:02:46and we can leverage compressive sensing theory in these cases to
0:02:51for a the and
0:02:55compressive sensing in theory
0:02:58a on three ingredients first of all is a sparse representation we have to come up with a representation of
0:03:04the signal
0:03:05a which is a sparse meaning that very few of the whole corpus are kept in most of the energy
0:03:10of the signal
0:03:11from a geometric perspective if the signal you V R C
0:03:15a a in fact most of the space is
0:03:18more
0:03:18in and
0:03:19the see the sparse signal be only in and played
0:03:23a a link with the court in eight
0:03:25signal not like these
0:03:27the information content of by of make it could be captured we five i
0:03:32if a the the i'm yeah provide an exam at three to the sparse
0:03:37uh a or X
0:03:39um meaning that the is stands or
0:03:41oh the information between the sparse vector the pairwise distances are pretty there in our observation time an are in
0:03:51a like and that's two he key ingredients have a sparse representation and her and measurement
0:03:56compressive sensing guarantees to recover the not i directly improve believe by searching for the sparse
0:04:03solution which matched observation
0:04:06oh i but in practice we don't have a sparse representation in most of the case that
0:04:12a lot for many natural in signal such as images and a speech is a kind of a sparse representation
0:04:18which we call compressed bell could be obtained
0:04:21well i some
0:04:22it's transformation
0:04:23in case of a speech such a transformation is
0:04:26in fact get worries
0:04:27action
0:04:28the
0:04:29it's a kind of a spectral brown of the speech uh has been illustrated here and you see that very
0:04:34few of the whole the spectrogram spectrographic representation
0:04:38has a a um values
0:04:41and the uh
0:04:44a uh and we if we sort the chord it's of the signal the the sorted coefficients show a very
0:04:50rapid decay
0:04:52which is a according to the power law
0:04:54um
0:04:55a signal not like these uh would be
0:04:58um would be cold compress the bell and this could be a a a in our framework of compress
0:05:04see
0:05:05moreover i are
0:05:06the you can even see in our spectrographic map there is an underlying structure are following the sparse coefficients for
0:05:13instance here you see that most of the large it'd coefficients are of those sort of cluster together
0:05:19which could we could further level structure and they're like the coefficients to improve the recall perform
0:05:26and to further you use the number of required observation
0:05:31i get very brief introduction of compressive sensing a i will explain the details all yes that's
0:05:37but source separation or a model based that sparse recovery which from now on i we'll just call bss M
0:05:43R
0:05:47in fact that a more duration from the very reach each each rich are in the context all a sparse
0:05:52component analysis
0:05:53and i have provided that very few of them
0:05:57of the paper is just as like there's men but i was most
0:06:00the but the fact is that the this is much longer
0:06:03and mostly the paper a all square you mouth and scott rick card were very much in aspiring for us
0:06:09to do you think of have the intuition that
0:06:12sparse sparse component analysis could put in help a speech recognition systems in overlapping
0:06:18many you that of a sparse component and it's is it's spatial cues have been used for late in can
0:06:23be covering of this know
0:06:25um
0:06:26and and in this context a what uh some work a um of what kinds of or and colleagues a
0:06:32at least in i P S N
0:06:33the mode that all us a which had uh to formulate a source week already as a sparse recovery a
0:06:40a source localisation as a sparse recovery
0:06:42we uh a a a finally the our yes that's ms are is nothing except that a sparse component analysis
0:06:49work to where
0:06:50which provides a joint
0:06:52a a framework for source localisation and separation
0:06:55what is the new one out S the segments are is that we
0:06:59experiment the model underlying the sparse coefficients we deal we convolutive mixtures
0:07:04and we use and you efficient and accurate already had agreed
0:07:08all
0:07:10a call from the C is the first thing the end used to come up with a
0:07:14kind of a sparse representation of the all node C not that we desire to recover
0:07:21the idea
0:07:22here here is that we describe as the plan or ready or the room into G D N
0:07:27re
0:07:27for this characterization use
0:07:30i if that the
0:07:31the are still dense that each of the speaker i Q by an exclusive
0:07:35three
0:07:36so if three of loss
0:07:37a free of a speaker was are competing
0:07:40um um only three or the grease are active and all the rest have absolute be the energy
0:07:46lee
0:07:47that kind of a spatial a sparse representation that we could obtain for uh simultaneous the speech source
0:07:54i i the spectral the sparsity we mused
0:07:58the short time fourier transform
0:08:00and a spectro-temporal representation
0:08:03now we in time L these two representations a spatial all and a start to gather and it should use
0:08:08the spatial a spectral representation
0:08:11all of where am
0:08:13we did we denote not eat here at a each component of it is in fact the signal not coming
0:08:17from each query any the meeting
0:08:19inside are are the spectral components
0:08:22due to their
0:08:24an people
0:08:27and that this thing we're yeah and is the in and measurements recent work of and scoring published in font
0:08:34uh and the might using a one ten or moon review
0:08:37he recognised the kind of natural manifestation of compressive sensing measurements through a of greens function projection
0:08:45lee
0:08:46aspired us to model our measurement make or matrix uh using the image model
0:08:53but a technique which has been already proposed by john out
0:08:56and break the U
0:08:57and uh the i'd yeah you made model uh is that uh when the room using for or brown and
0:09:02i'm speaking here
0:09:04is not in
0:09:05only me but that happens of my image is with respect to all these walls stick together
0:09:10and we could model that these uh we the greens function with this particular form a frequency domain which
0:09:17each component has been attenuated with respect to it's these that of the image to the
0:09:22so a sense or
0:09:23and has been the late
0:09:24which is the and the speaker
0:09:26so
0:09:28oh using this model we could do uh find the projection S you to to each sensor meant for each
0:09:35of the in in in the room
0:09:36and now we by all these predictions and construct our measurements
0:09:40a matrix five
0:09:42which is how power microphone you mention
0:09:44three
0:09:47now
0:09:49introducing the sparse representation of X which is our unknown we have a
0:09:54the no one observations of microphones which you how
0:09:58by
0:09:59we was suppose that we have a M microphones and we have a measurement matrix with image model
0:10:04i
0:10:05all is to recover X
0:10:08from very few measurements
0:10:10after me why
0:10:12the channel and is that
0:10:13for for i uh has the non trivial a space
0:10:17and a a like source coming from the now the space we give the same
0:10:21so though mm
0:10:23according to a linear algebra such a system doesn't have a nice addition how work
0:10:27the solution would be to sparse the solution and bases
0:10:31uh what
0:10:32a sparse recovery help us to and give them enough information to overcome that you posed as
0:10:39a of
0:10:40a of our inverse problem and
0:10:42you
0:10:45but do we do here is that we use a loss recovery at agree
0:10:48the uh it was presented in uh
0:10:51session
0:10:51that uh yesterday
0:10:53on a learning a low dimension signal models and the i of all these uh if fact a a known
0:11:00to the family of a check par thresholding method
0:11:03and the idea is that scenes project eighteen
0:11:05the signal into the whole the space and finding the sparse the solutions is in hard and it's the combinatorial
0:11:12a a problem and in
0:11:14i straight you hard thresholding approach as a we kind of a price they make an and i trade you
0:11:19manner to
0:11:20the a sparse solution by keeping only the
0:11:23hmmm hmmm in a sparse
0:11:24i and
0:11:25a the largest value coefficients and discarding all
0:11:28and this has been done on
0:11:30um um
0:11:31and
0:11:31a model based a of what be D is that
0:11:34we checked only largest
0:11:37a a large just
0:11:38uh
0:11:38energy of their uh of the blocks and discarded the rest of the blocks
0:11:48and now i i in one to their our experiments and up
0:11:54oh for the speech court who's we use our route to which was not overlapping but the overlap it be
0:11:59interference that selected randomly from its to me
0:12:02we discrete twice the
0:12:04a plan or are your of the room into two fifty by fifty cents a meet and reads
0:12:08and the reverberation time was two hundred sec
0:12:12in in a scenario we tested our method one of bad two
0:12:15uh a three competing the speak yours of air
0:12:18our target the speech coming from a work too
0:12:22interference one and two are active and in the second scenario
0:12:26uh interference three and four
0:12:28and older as
0:12:29are
0:12:30i'm ten
0:12:34and the result in the case of a story recording and um separation and using a
0:12:40when and three sources are competing
0:12:43are the following our on route to is kind of digit recognition task which for wise the training in two
0:12:48conditions one of that
0:12:50the hmm M M by and has been trained only using clean
0:12:53we not trance as and the other one is using multi condition or noisy the trance this to train our
0:12:58eight to the model
0:12:59and the baseline nine now overlapping in speech being clean condition is fifty nine percent cent sixty one person remote
0:13:05condition training and after or
0:13:07a separation and perform speech recognition and we could
0:13:11i of up to ninety two percent the multi condition string
0:13:14and um
0:13:16a a ball eighty percent of relative improvement have been a
0:13:22then in the second scenario five sources were active
0:13:25and the uh we
0:13:27but
0:13:28one of them
0:13:29appealing panelling
0:13:30a space all of this work is that uh we could we are very much for the and valley the
0:13:34microphone
0:13:37and the geometry to we could use
0:13:40oh oh in two cases once a one we use only two microphones and is just say can you use
0:13:45only for microphone
0:13:47and separated the speech and then perform a speech recognition and and the
0:13:52word accuracy rates are provided this
0:13:54in the table
0:13:56a as you to ninety four per and if or microphones have been used to do this
0:14:00source separation
0:14:01and the relative improvement would be up to eighty five first
0:14:11right the she that is that a would like to come back with that the information bearing components of for
0:14:16a speech recognition are indeed the sparse and
0:14:18that's some for all the main and these years to some compelling evidence that
0:14:22sparse component analyses is what is a potential approach to deal with the problem of overlapping in
0:14:27realistic really stick applications of the speech recognition
0:14:30and
0:14:31or or or or are we use a a a kind of model that the sparse recovery you we showed
0:14:36that we could go beyond this part
0:14:39i'm sorry
0:14:40it it was method
0:14:41oh are you could construct the audio
0:14:44are
0:14:45and destructive source and motion to
0:14:49or
0:14:53i
0:14:54to leave
0:14:55five
0:14:55yeah
0:14:56the tormented of are old so that we could all put also have some
0:15:00kind of a rat quantitative or at least of the using like a S I R
0:15:05they are all of one
0:15:06measures which have been proposed for to do
0:15:08source separation
0:15:09but lot thing so
0:15:11hmmm goal what's finally to
0:15:12speech recognition and we just
0:15:16so that the speech recognition results with keep the best fine final
0:15:20performance performance or or evaluation of how the system would work for speech rec
0:15:26a
0:15:28a
0:15:30source separation so true two
0:15:36to work
0:15:37i
0:15:38a a a a a a a source of a yeah we have well
0:15:48that's true
0:15:50oh
0:15:54so call types
0:15:58but
0:15:59uh
0:16:00so
0:16:02"'cause"
0:16:04subject to we we have
0:16:05uh we sent to some
0:16:07poles and the
0:16:08there are some
0:16:10some cases of the overlapping which with the seal be here or your you're back background
0:16:14have
0:16:15a
0:16:16a not like the kind of musical noise that we expect from
0:16:20binary man
0:16:20king
0:16:21case
0:16:22because the sparse recovery could be in some sense the even
0:16:26look that as the kind of soft my
0:16:27can still
0:16:28the kind of artifacts or P
0:16:33i
0:16:34i
0:16:36a
0:16:39i
0:16:41oh
0:16:44yeah
0:16:46i
0:16:49well
0:16:50and the measurement may treat in uh it depends on they are each
0:16:55for the environment inter element spacing the many factors that have been um um in written in very detail
0:17:02been considered in the car paper
0:17:05but in in our case are you can be was in fact
0:17:08as five well as so we know like some kind of precondition conditioning
0:17:12by orthogonalization and the details some are we can use sending are
0:17:19but in theory um
0:17:21a in still is for instance for that for a
0:17:24for very specific acoustic conditions
0:17:27uh
0:17:28we could still that are a P also hold
0:17:33uh
0:17:34the process
0:17:40a