0:00:12first let me introduce mike seltzer from microsoft
0:00:16Q has been there since two thousand and three
0:00:19and his interests are really no noise robust speech recognition microsoft the reprocessing the acoustic model adaptation speech enhancement
0:00:28in two thousand and seven to receive the base down all their word from the ieee signal processing society
0:00:35and from two thousand and six and two thousand and eight yeah he was a member of the S L
0:00:39D C and he was also the editor in chief of the
0:00:43electro electronic
0:00:46newsletters all many of us to receive emails from him whenever that the newsletter came out
0:00:52and the holiday season associate that either of the ieee transactions on speech and audio processing
0:00:58and the title of his talkies robust speech recognition more than just a lot of noise
0:01:03so
0:01:04these by michael
0:01:14good afternoon thinks that a great introduction george and things to the
0:01:20it's very two thousand committee for inviting india the stalk it's really the an honour to be here without you
0:01:28and i hope to have the instinct a kind of sentences after lunch and hope for the food come along
0:01:33once it into badly
0:01:35so let's get started
0:01:38so
0:01:39i've been in the field for about oh ten years or so and i brought them to me really seems
0:01:45like
0:01:46where in what i'll call almost a golden age of speech recognition
0:01:52there's based yeah and as we all know there's a number of mainstream products that everyone on the obvious is
0:01:57using that involve speech recognition of course there's a huge proliferation of mobile phones and data plans and voice search
0:02:04and things like that
0:02:07speech is also widely deployed now in automobiles
0:02:11in fact i i'd like to point to the ford sync as one example of a system that we took
0:02:16speech in cars from a high end
0:02:18come on a add on to look for enlarge remote bills to a kind of a low and feature on
0:02:24stand
0:02:25packages for low and for models and sort of like a moderate sized
0:02:28that functionality and then my most recently is that what this the flaming of the connect i don't for X
0:02:35box
0:02:36which has gesturing voice input
0:02:39in addition to these three examples of technologies that are that are out there we're also in many ways swimming
0:02:45in this in this you know drowning in the data that we have as a sort the proverbial a house
0:02:51a fire hose of data coming at us and fifty yeah with cloud base system
0:02:56no all the data thing logged on the system and sort of having data in many cases is not a
0:03:00problem as i know how what do we do with the state is sort of the channels these days
0:03:04and finally as on a on a personal note
0:03:06i think that one is most meaningful for me is the fact that all this is happening so that only
0:03:11have to talk explaining to my mother what it is i do on a daily basis
0:03:16i'm a
0:03:19i'm not sure she's we so happy that a user face on the sly
0:03:23i won't tell if you want to
0:03:25neverless in spite of all the success there'd be lots of challenges that are still out there there's new applications
0:03:31and as an example here this is the virtual receptionist by dental who should ms are sort of a project
0:03:36and situated interaction and multiparty engage me
0:03:39there's always new devices this is a particularly interesting one from time to shorten stands use a electromyography interface where
0:03:46it sort of actually measure
0:03:48engineer skin as the input i have another colleague a ardent thing was working on a speech input using microphone
0:03:54arrays inside a space helmet for us
0:03:59you know systems that are deployed for spacewalks
0:04:01and of course as the thomas friedman wrote the world is becoming flatter and flatter and there's always new languages
0:04:07a new cultures that are systems come in contact with
0:04:12so addressing these challenges takes data and data is
0:04:16time consuming to collect in some ways and it's also expensive to collect
0:04:20as an alternative i like to propose
0:04:23that we can extract additional utility from the data we already have
0:04:26the idea is that by reusing and recycling the existing data we have
0:04:30potentially we can reduce the need rick actually collect new data
0:04:34and you can think of it informally as making better use of our resources so what i like to focus
0:04:38my talk on today is
0:04:40how we can take ideas from a process to help speech recognition go green
0:04:45so the symbol for my talk today
0:04:48we'll be how speech recognition goes bring to reduce recycle and reuse
0:04:53of information
0:04:56like you like to go see a lot
0:04:58okay so first i'm gonna talk about the one aspect of this in the sense of reduce
0:05:03so
0:05:04we know their systems suffer because these are just statistical pattern classifiers when this mismatch between the acoustic models we
0:05:10have and the data that we see a runtime one the biggest sources of this mismatch is environmental noise
0:05:16so the best solution of course is to retrain with matched data and this is either expensive or impossible depending
0:05:21on what your definition of matched data is
0:05:23if matched it is you know i'm in a car on a highway then that's reason to collect it just
0:05:27as a little bit time consuming if matched data is i'm gonna be in the speaker model of car on
0:05:32this particular road with the snow is of course it's impossible to do
0:05:35so as an alternative we have standard adaptation techniques that are tried and true in part of a standard toolkits
0:05:40things like map or mllr adaptation they're really great because their generic in a computationally efficient the only downside is
0:05:47the need sufficient data in order to train is proud
0:05:50as an alternative relative to discuss here is the way we can exploit the model environment
0:05:54and by doing this
0:05:56the estimation the adaptation method you will get some sort of structure imposed on it and as a result we
0:06:00get lots of efficiencies and the data process
0:06:03so before we go into the details let's just take a quick look
0:06:06at the effect of noise on speech through the processing chain of showing of the processing chain for mfccs for
0:06:12a similar features like plps lpc is it's similar so we know that in the spectral domain of the linear
0:06:19the waveform domain speech is additive
0:06:22that's not too bad to handle
0:06:23what's your the power domain it's against additive analysis additional cross term here which is the kind of correlation between
0:06:29speech and noise we gently assume speech or noise uncorrelated so we kind of ignore the term and sort of
0:06:34hideaway
0:06:35now things get a little bit trickier once you have to go to the mel filterbank analog operation because after
0:06:41log domain gets are to get this little bit of a nasty regulation which says that the noise
0:06:47the noisy features i C R can be described as a clean speech features plus some nonlinear function of the
0:06:52clean speech and the noise
0:06:53and indicates that goes up to a linear transform and we get of a vector version of the same equation
0:06:58for the purpose of this talk on a backup before the dct "'cause" easier to visualise things and one or
0:07:02two dimensions rather than thirty nine dimensions
0:07:05and source talk about this equation here
0:07:07so because speech and noise are it's a symmetric a relationship so expos and is and plus X we can
0:07:13swap the positions of X and N in this equation here if we do that i mean and we sort
0:07:18of bring common term is to teach that equations you get a slightly different expression
0:07:23and what what's interesting here is that you have in the lab
0:07:26is because is a lot domain operators this is basically to a function of two signal-to-noise ratios something that's in
0:07:33speech enhancement signal processing called the a posteriori snr which is the snr of the observed speech compared to the
0:07:38noise and the prior snr which is the unknown clean speech and as a version of that and the noise
0:07:45and so if we look at what this relationship is between along this function
0:07:50we have a curve like this
0:07:52and it's curve makes a lot of intuitive sense of it look at different points along the curve and so
0:07:55we can say for example appear in the in the upper right of the curve
0:07:59right now we have high snr basically noise is not much of a factor and the noisy speech is given
0:08:03to the clean speech
0:08:05in a similar way
0:08:07at the other end of the curve
0:08:09we have a low snr
0:08:11and do not you matter with the clean speech is it's completely dominated by the noise
0:08:15so why was that in this case and of course you know the million dollar question is
0:08:19how do we handle things in the middle of the nonlinearities is something that needs to be dealt with
0:08:24there's an added complication of this which is earlier we sort of swept this cross correlation between speech or noise
0:08:29under the rug
0:08:31but it turns out
0:08:32that yes it is your own expectation but it's actually not zero you know it's a distribution that has nonnegligible
0:08:39very that we have to plot data on this curve and see how well it matches the curve
0:08:43you see the direction is that the data lies on that line but there's actually significant spread around that my
0:08:49what that means is
0:08:51that even if we're given the exact value of clean speech and exact value of noisy speech in the feature
0:08:57domain where she can predict exactly what the noisy feature will be there's just you can predict what the distribution
0:09:03will be
0:09:03and ending this additional uncertainty makes things we wanna do things like model adaptation even more complicated
0:09:09so
0:09:10they're been a number way to look at transfer in this equation into the model them
0:09:16if we do that again this nonlinearity and the some thirty great challenges so we again look at extremes of
0:09:21the curve is quite straightforward at high snrs than if we do an adaptation the noisy distribution it's gonna exactly
0:09:27the same as the queen distribute
0:09:29we go over here to the low snrs the lower left of the curve
0:09:32the noisy the speech distributions to just the noise distribution
0:09:36and the real trick is how do we handle this area in the middle right we have even if we
0:09:40assume that the speech and the noise are gaussian we put these things through this nominee relationship or comes out
0:09:45is definitely not gaussian
0:09:47but of course this is speech recognition and
0:09:51i darted if it's not gaussian we're just gonna assume it's gaussian anyway and so there are and so there
0:09:56are various approximations that are made to do this because
0:10:00we just you know six
0:10:02so
0:10:04right the most famous example to do this would you do at that noise is to simply taken a linear
0:10:08approximation to this to linearize around this is the famous vector taylor series algorithm by pager marino
0:10:14the idea here is you simply have an expansion point determined by your
0:10:17that that's given by the mean of the gaussian are trying to adapt and the mean every noise
0:10:21and you simply when your eyes that than on the menu function around that point and once you have a
0:10:25linear function now doing annotation is very straightforward you on the how to how to transform gaussians subject to a
0:10:30linear transformation
0:10:32now the trick here is that the transformation is only determined by the mean of the curve and the size
0:10:38of the variance of the speech and the noise of the clean speech and the noise model will determine the
0:10:42accuracy of the linearisation stuff the brains very broad if it's very in a wide bell then the linear they
0:10:48should not be very accurate "'cause" you be subject to more nonlinear
0:10:51so
0:10:52the refinement of idea of that idea we look into something called linear spline interpolation which is sort of participated
0:10:59well if one line works well many lines must work better and so the idea is to simply take this
0:11:05find an approximate using a linear spline which is the idea that you have a
0:11:10a series of knots which are basic places you could even the dots in this in this figure
0:11:15and with between the dots you have you doing approximation is quite accurate
0:11:19and in fact because it's a simple linear rescue cash you have a variance associated word error associated with that
0:11:24when you're model and that'll account for that alpha spread that spread of the data around
0:11:29and then when you figure out what to do at runtime you have to use
0:11:36right you can use all the splines based on the pdf rather than just having to pick a single one
0:11:40determined by the mean and so essentially depending on how much mass of the probabilities under each of the segments
0:11:45that tells you how much contribution of that linearisation you're gonna use in your final approximation
0:11:51so you're using you know incorporating you do linearisation based an entire distribution rather than just the mean and i
0:11:57think is the spine parameters can be trained from stereo data they are also can be trained from an integrated
0:12:01way using sort of maximum likelihood in an hmm free
0:12:06we just use two examples of a linear a linearisation approach this another approach is the sampling based method
0:12:13and the idea here is based on i-th in of the most famous example this is by the data-driven pmc
0:12:18work for mark gales
0:12:21in ninety six but that method requires you know tens of thousands of samples is for every gaussian are trying
0:12:26to adapt its completely
0:12:27infeasible it is a good upper bound you can do
0:12:30but the unscented transform is very elegant way to sort of do clever sampling the ideas you just take certain
0:12:36sigma points and again because we can assume things are gaussian there's a simple recipe for what these sampling points
0:12:41are take is small set of points in this case it's typically about in a less than a hundred point
0:12:47ask them through the non-linear function you know to be true under your model and then you can compute the
0:12:54moments and basically estimate P Y
0:12:56again
0:12:58depending on how spread the variance of this model of this distribution you're trying to adapt is that will determine
0:13:04how accurate this adaptation is so is gonna for the refinement this method
0:13:08post recently call the unscented gaussian mixture filter in this case you take a very broad gaussian simply chop it
0:13:14up into a gaussian mixture where within each gaussian
0:13:18the variance a small and simple linear approximation works quite well
0:13:22in the sampling works quite efficiently and then use and to combine all discussions back on the other side
0:13:26here
0:13:27so you just for example there are a handful of others out there in the literature
0:13:33but one thing what i've tried to convey here is in contrast to standard adaptation you'll notice i didn't talk
0:13:37at all about data
0:13:39and observations was talk about how to adapt the model all we had was that of the hmm parameter you
0:13:44X and the noise model and
0:13:46so it's of the what's nice about these systems is that
0:13:50excuse me
0:13:52is that basically slowly need is an estimate of what the noises in the signal and given that we can
0:13:57actually depth every single gaussian in the system because the structures impose on the adaptation process
0:14:02and in fact if we can sort of sniffed what the environment is before we even see any speech we
0:14:06can seduce in the first pass which is very nice and of course you can refine this of the second
0:14:10pass by doing you know em type are going to update your noise parameters
0:14:14so of course
0:14:15under this model the accuracy of the technique is largely due to the accuracy of the approximation using so those
0:14:21are four examples i showed earlier and essentially people who work in this area basic trying to come up better
0:14:27approximations to that nonlinear function other alternatives also focus on more explicitly modeling
0:14:33that uncertainty between X with between the speech and noise that accounts that spread in the data that was nearly
0:14:39figure
0:14:40so just a sense of how these things work this is the road to which is a standard noise robustness
0:14:45task it's a noisy connected digit task
0:14:49for people care it's a complex back-end some like we could train system
0:14:55it's for the best next like that sort of baseline you can create with this data i mean you can
0:14:58see that sort of doing standard things like C M and is not great when you cmllr again this it
0:15:05in one utterance you may not have enough data to do that to do the adaptation correct you get a
0:15:08small gain but not but not a huge when
0:15:13the L C advanced front-end shown there is a fee is sort of the
0:15:16i guess representative of state-of-the-art in sort of front end signal processing approach to doing this as i was not
0:15:21where the models are used to treat this as a noisy signal and hands it in the front end and
0:15:26if you do vts
0:15:27in the rain algorithm ignoring
0:15:31that correlation between speech or noise that spread of the data you get about the same performance
0:15:36and now if you actually account for that variance in the data by tuning a weight in your in your
0:15:42update which i won't get into the details of us to get a pretty sick significant gain
0:15:45that's a really nice result the problem with that is that the value that you actually is optimal is that
0:15:50you theoretically implausible and don't and breaks your entire model so that part is a little bit unsatisfying
0:15:57in addition the fact is not quite that might that often not pravda generalise as across corpora and then we
0:16:03see that you get about the same results of the use the spline interpolation method where we have you have
0:16:07the link the linear regression model it does account for the spread and sort of a more natural way
0:16:12and again all the numbers of than similar at first pass numbers they could be refined further with second test
0:16:18so
0:16:20well this shows we could be no you have nice
0:16:23gains by adapting the structure there's been a little bit of a dirty laundry i was trying to cover up
0:16:29which is that the environmental model is completely dependent on the assumption that the hmms trained on clean speech and
0:16:36as you all know clean speech is kind of a an artificial construct that something we can collect in the
0:16:41lab but is not very generic it also means that if we deploy a system out in the world we
0:16:46collect the data that comes like in that it is easy valuable for updating our system and refine your sister
0:16:51but if it's noisy and our system can only be taken clean data we can use that data
0:16:55have a problem
0:16:56so
0:16:57a solution to that problem has been proposed and referred to as
0:17:04noise adaptive training also composes joint adaptive training
0:17:07and the idea is basically completely i can sort of a little brother little sister to speaker adaptive training
0:17:15in the same as figure out the training try to remove speaker variability in your acoustic model by having some
0:17:20other transform absorb the speaker variability we wanna have the same kind of operation happen to absorb the environmental variability
0:17:27what this allows you to do is actually train incorporate train data from different sources
0:17:32into a single model is helpful if you if you if you think about a multi-style model we can take
0:17:36all kinds of data from all different conditions and mix it all together
0:17:40the model will model the noisy speech correctly beer and have a lot of variance is just modeling the fact
0:17:43that are coming from different environments
0:17:45that's not gonna help you with phonetic classification
0:17:48and if you are not a dataset scare scenario this could become very import
0:17:53so again just to make it a little a little bit more explicit he hears the general flow force speaker
0:17:59adaptive training you have some multi-speaker data in a speaker independent hmm
0:18:03that then doesn't a process where you italy update your hmm and some speaker transforms
0:18:09most commonly using cmllr and this process goes back and forth so convergence and what what's left of it
0:18:14speaker adapted hmm
0:18:16so a noise adapting the exact same process happens
0:18:19except the goal is to remove the environmental variability from a multi-style multi environment day
0:18:25so what happens here is we have again i would i guess you could call it an orderly cause an
0:18:30environment independent model but that's
0:18:32what it is and also for apparel structural call that essentially data from lots of by
0:18:38and then in your iterative process you basically trying to model and account for the noise or channel distortion that's
0:18:43in all of your in all of your data
0:18:46with other parameters so that the hmm is free to model the phonetic variability and this case typically what's more
0:18:51stuff and on is the noise that is environmental parameters are updated on a per utterance basis rather than a
0:18:56per speaker basis because there's few parameters and so you're able to estimate those
0:19:00well number comes out is a noise adapted hmm again that the nice thing here again is because you can
0:19:06do this potentially in the first pass you don't need to keep the first environmental independent or noise independent model
0:19:12around like you do in speaker adaptive training you can directly operate all the time and noise adapted H
0:19:18there are some results with noise adaptive training
0:19:21as analysis with noisy multi-style training data you can see this is the result for cmn just cepstral mean normalisation
0:19:28now we try to fight the vts algorithm which assumes the models clean in this case not under the assumption
0:19:34is broken and so we got to get
0:19:37you have to improve over the baseline but the results are not nearly as good and then we get overturned
0:19:41to getting nice gains but we actually do this adaptive training and we see similar performance on the aurora three
0:19:46task an interesting thing there is actually because that's real data collected in a car
0:19:51or she is no clean data to train this on and so you actually need an approach like this to
0:19:56run a successful on that technique and
0:19:59corpus like this
0:20:02so
0:20:04to summarise are for
0:20:06is the triangle and redo
0:20:10i si model adaptation as you all know can reduce environmental mismatch
0:20:14when you impose this environmental structure determine by the model that the adaptation is incredibly data efficient if you think
0:20:20about a general you need
0:20:22and ask them to be noise in an estimate of your
0:20:25of your noise meeting yours variance of potentially last interview channel means that spacey thirty nine was thirty nine you
0:20:30know it's
0:20:31hundred and twenty parameters to estimate which is really a very little and you know you could even for example
0:20:36if you assume that your noise is stationary then you're you can actually eliminate even the delta kappa delta features
0:20:42of your noise
0:20:44every running or static features that even fewer parameters
0:20:46doing the adaptation unfortunately is computationally quite a bore i mean it really it's you adapting every gaussian in your
0:20:55system is probably overkill to do an utterance-by-utterance basis but you can improve the performance by using regression classes shown
0:21:03by i think well as work
0:21:06yeah thing is that we can reduce environmental variability in the final model we have
0:21:10by doing this noise adaptive training in this is helpful when we're in scenarios where there's not much data to
0:21:15work
0:21:15the other considerations that reminders although i'm certain ml systems use can be integrated discover training
0:21:21and is a huge sort of parallel literature to this where the same exact algorithms are used in the front-end
0:21:27where your place the hmm with the gmm you do this as a front-end feature enhancement scheme and see basically
0:21:32the same exact operation with the goal of generating a hand
0:21:35version of the cepstra
0:21:37and
0:21:38those items the exact same sort of mathematics mathematical framework and then the nice thing is there is that you
0:21:43can then if you're data that you work with is noise you can also do the same adaptive training technique
0:21:48on the front-end gmm and
0:21:51still use those technique
0:21:54so well
0:21:59and i wanna move on from reduced to recycle
0:22:02and in this case element talk about is
0:22:05change gears from the ways to channel
0:22:07and talk about how we can recycle narrowband data that we have
0:22:11i think it's not a very controversial statement to say
0:22:16but now that voice over data is replacing voice over the wire
0:22:20and when you do this now because you know especially in speech applications you have you speaking to some a
0:22:26smart phone your voice is not going you know making a telephone call anymore it's going over the data network
0:22:30to some serve
0:22:31when you do that does not capture them with a possible so you can base the captured you know subjective
0:22:36bandwidth constraints because our
0:22:38latency constraints you can see you can basic captured arbitrary bandwidth and this is that we have you know where
0:22:43possible wideband data is preferable
0:22:46games do very you know which you build equipment system with narrowband or wideband data but they are consistent
0:22:52for example if you look at a car
0:22:56the gains you get are larger in that's not context because a lot of the noise it's in the cars
0:22:59at low frequencies sort of the rumble of the of the highway and the tires creates a lot of low
0:23:03frequency noise so having a high energy in the plosives and affricates is really helpful for discriminative ability
0:23:09and of course it's also sort of going becoming a the standard for just human communication is wideband codecs from
0:23:17the M R is the european standard and skype now is going to wideband codec or even an ultra wideband
0:23:22codec so the fact that people perceive it sort of also implies that numbers machines would probably prefer
0:23:27well
0:23:29that said there are existing stockpiles a narrowband data all the systems even building over the years and for many
0:23:35low resources languages in on the developing world mobile phone still are prevalent and i don't think we're gonna go
0:23:40away that soon so we want the ability to do something useful with that data
0:23:46so what i'd like to propose is there a way to use the narrowband data to help augment
0:23:53some wideband data we have in data scare snares to build a better wideband acoustic model and inspiration for this
0:24:00came from the signal processing literature maybe ten or fifteen years ago people propose the bandwidth extension speech processing
0:24:06sort of like again it comes from the fact that we know
0:24:09the people prefer
0:24:11wideband speech it turns out it's not it's not any more intelligible unless you looking at isolated phones it's actually
0:24:16both are equally intelligible but things like listener for T and just personal pride and preference comes across in a
0:24:23much higher for wide
0:24:26speech and so the way these algorithms operated
0:24:29was that the basis set can be learned correlations between low and high frequency spectrum the signal so here's a
0:24:35just
0:24:36a poorly first grade drawing version of
0:24:39of spectra like to say that my four year old to this but i did it myself
0:24:45so this is sort of you know the pilots like about like this i was going for that with a
0:24:48couple of formants as of yet if i ask you guys to predict
0:24:51what is sort of on the other side of the line
0:24:55you know it maybe predict something like that it seems pretty reasonable probably you know you make a down the
0:24:58difference low platform and maybe in a different location but is not for example gonna go up it's not you
0:25:03know you would you would doubt that would
0:25:05and so we can do is basically user like a gaussian mixture model to predict
0:25:09the gas independent mappings from low to high band spectra
0:25:14and then a simple we could do is to say let's just generate wideband features from narrowband features
0:25:18and if you're familiar with the missing feature literature this says basically i'd like i have some in missing features
0:25:25you say i have some
0:25:27components of my features that are too corrupted by noise addition to remove them and then try to fill them
0:25:32in from the surrounding reliable data this is like doing this you features with the deterministic madness given by the
0:25:37telephone
0:25:39you're simply taking some amount of wideband data
0:25:43some potentially large amount narrowband data you're trying to convert that narrowband data into a pseudo wideband features and go
0:25:50to train an acoustic model that way
0:25:53so this actually works okay works pretty well and here's an example
0:26:00this is a wideband log mel-spectrogram
0:26:03the left in this is that same speech but through a telephony channel you can see obviously the information below
0:26:09three hundred hz and above thirty four hundred hz is
0:26:11it has gone missing so to speak and the idea of this bandwidth extension the feature domain is to say
0:26:16can we do something to fill it back
0:26:19and in this particular case "'cause" it's not it's not perfect
0:26:21but you know a lot of you know where there's read it gently read in the other pictures are reserved
0:26:25capturing
0:26:26of the gross features but data and we could use that then to train our system
0:26:31so this is good but the downside is that if you do it this way in the feature domain you
0:26:34end up with a point estimate of what you're wideband feature should be and if that estimates for or it's
0:26:39wrong words you know
0:26:42things like that you really have no way of informing the model during training to not use that data as
0:26:48much as maybe other estimates that maybe more reliable and so to get this to work you have to do
0:26:53some ad hoc things like corpus weightings to say okay we have a little bit of
0:26:57wideband data but i'm the count those statistics much more heavily than my
0:27:01statistics of my narrowband data which i would have extended into therefore don't trust quite as much so as not
0:27:06theoretically optimal
0:27:09and as a result you know a better used to be to use and you know we can could incorporate
0:27:14this into any amalgam directly see on only train hmm
0:27:18would it be the state sequence is the hidden variable so you can figure this is doing the exact same
0:27:22thing but you're adding additional hidden variables for all the missing frequency components that you don't have in the telephone
0:27:27channel
0:27:30so if you do this you get something that looks like this where you have the narrowband goes directly into
0:27:34the training procedure with the wideband data you have this and expand with em algorithm and we comes out as
0:27:39a wideband hmm no i'm not gonna try to go into too many details and i really try to keep
0:27:43equations to a minimum but i just want to point out
0:27:46a few notable thing is this is the variance update equation and a few things that are interesting i think
0:27:51about this relation the this update equation is
0:27:55first of all you look at the why should sorry i should mention the notation have adopted here's from the
0:28:00missing feature literature so oh is something that you would observe in ms and it's missing as you consider O
0:28:05to be the
0:28:06the telephone band frequency components and M to be the missing high-frequency components you're trying to the model when you're
0:28:12hmm
0:28:13second thing at the posterior combination computation is only computed over
0:28:17low band that you have only lives are bands you've actually marginalise out the commode you don't have over all
0:28:22your models and so therefore erroneous estimates that you make in this process don't corrupt your posterior calculations because you
0:28:28only computing posteriors based on reliable information that you know is
0:28:31is that
0:28:32the other interesting thing is that
0:28:34rather than having a an estimate that's global across all your data you actually have a state conditional C estimate
0:28:41where the estimate of the wideband features determined by the observation at time T as well as the state you're
0:28:46in and so the says
0:28:48the extended wideband feature i have your it can be a function of both the data i see as well
0:28:53as whether i mean of our fricative or a plosive
0:28:57sample
0:28:58and finally there's this variance piece at the end here which then says in general for this particular gaussian
0:29:06how much uncertainty overall is there in trying to do this mapping so maybe a minute in a case where
0:29:10them doing this mapping is really heart because there's very little correlation from the time-frequency snack is we will high
0:29:16variance there so that model as i could reflect the fact that we've
0:29:22that we've estimated that the estimates that we're using may be poor
0:29:26so if we look at the performance here we've taken a wall street journal task we base it took the
0:29:30training data and partitioned into wideband set and the narrowband set at some proportion
0:29:36and so the idea is that if you look at the performance of the wideband data that's the lower line
0:29:40it's about ten percent
0:29:42and if you take the entire system and sort of telephone dies at all you end up with the upper
0:29:46curve but in the purple curve that's the sort the narrowband system the goal of this is to say given
0:29:52some wideband data and next thing in the rest narrowband data how far how much coming close that gap
0:29:57so we see that in this is comparing the results of the feature version on the model domain version and
0:30:01so we can see that we have a split of at twenty
0:30:04the performance is about the same and so in that case you know why go through all the extra computation
0:30:08the feature
0:30:09version works quite well interestingly once you go to a more extreme case where only ten percent the training set
0:30:14is actually wide-band the rest is narrowband do in the future version of it is that you worse than just
0:30:19training at an entire narrowband system
0:30:22because there's lots of uncertainty in the extension that you do in the front end which is not reflected in
0:30:26your model at all but if we do the training in this integrated framework
0:30:31we end up with you know a performance that again is better than equal than all narrowband
0:30:39so
0:30:44talk about this last prong of this second volume of the of the triangle here and recycle
0:30:51potentially possibly narrowband data can be recycled for using wideband data this may allow us to use the existing piles
0:30:58of legacy data we have
0:30:59and for initial system that we have narrowband data whether we want to build narrowband data maybe easier to collect
0:31:05and maybe simple just like the small amount of wideband data
0:31:08you can do this in the front end we can come up with the sort of integrated train training framework
0:31:13and
0:31:14like a noise-robust this case there is a front-end version that i talked about and there are advantages to that
0:31:19i shouldn't sort of
0:31:21so it doesn't the right it allows you that if you do this in the front and you can use
0:31:25whatever features you want you can then take the postprocesses news bottle neck features
0:31:29tack a bunch of frames individual the i and so you have a little bit more flexibility what you wanna
0:31:32do downstream from this process
0:31:34and the other interesting thing is that the same technology can be used in the reverse scenario where the input
0:31:41maybe narrowband and the models actually wide
0:31:45you may think where this happened but this action happens in systems lot of some as soon as someone puts
0:31:50on a bluetooth headset
0:31:51you could have a wideband applied system somebody decides that they wanna you'll be safe in hands-free in a put
0:31:57on a bluetooth headset all somewhat you comes and your system is
0:31:59in our band if you want do something about it you're gonna get killed and killed but
0:32:05sorry i and you're going hands free signal killed but anyway you performance is gonna suffer
0:32:12and so you know one up after we to maintain two models in your server the other ideas you can
0:32:17actually do about the station the front end and process that by or by a wideband recognizer noise thing there
0:32:23is like you don't have to be as good as true wideband performing
0:32:27you just have to be better than or as good as but you've got the narrowband performance would be and
0:32:31then it's worth it to do that
0:32:33so
0:32:34finally i'd like to move on to a last component here of a reuse
0:32:44and talk about the reuse the speaker transforms
0:32:49so
0:32:50one of things that we found
0:32:52is that
0:32:54the utterances in the applications that are being deployed commercially now are really sure
0:32:59and so
0:33:00you know one seattle obviously people's a starbucks quite a bit
0:33:05no muppet show times or in the living scenario X box play maybe all that you know the only thing
0:33:10you get
0:33:11in addition to that these are really gently rich dialogue interactive systems and so these are sort of one shot
0:33:16thing for you speak where you get a result in your in your done
0:33:19so that the combination of these two things
0:33:22make it really difficult to obtain sufficient data for doing conventional speaker adaptation from a single session of use in
0:33:28so doing things like mllr cmllr becomes quite difficult in a single utterance
0:33:33case and so
0:33:35and obvious solution to this is to say well let's just accumulate the data over time across sections we have
0:33:39users are you know making multiple queries to the system
0:33:43it's aggregate it all together and then we'll have an update at sufficiently to build a transfer
0:33:49the
0:33:50difficulty comes in because this now because it lies applications on mobile phones it means the people are obviously mobile
0:33:57two
0:33:58and they're all across all these different users they're actually in different environments
0:34:02that creates additional variability in the data that we can lead over time and so in my
0:34:10by numbers i guess i would say or you know them you know a metaphor here let's imagine a user
0:34:15called the system and the observation comes in as Y and that some combination of the phonetic content which i'm
0:34:21showing is as a white box
0:34:22some speaker-specific information shown as a blue box and
0:34:27some you know environmental backer information as the right
0:34:31so user gets the speech and says oh okay well mannered proportion adaptation and store away the transform
0:34:37so the next time this user calls we know will be loaded up and ready to go
0:34:41so sure enough sometime later the user cost back
0:34:45and the phonetic content you know may or may not be the
0:34:47the speaker
0:34:49is the same
0:34:50but now is you know here she is in a different location or different environment and so the observation is
0:34:56now green instead of purple and as a result we can do adaptation on the mile using the store transform
0:35:01but mismatch persists this is not something optimal
0:35:04and so what we would like is
0:35:09a solution where
0:35:10the variability when we do something like annotation can be separate or to use the part
0:35:16so that we can say let's just hold onto the part that's related to speaker and sort of throw away
0:35:21the part that's in environment or very get store the part that's for environments that we oversee different user call
0:35:27back from that same environment we can actually do that as well
0:35:30so in order to do this sort of factorisation or separation of the different compare sources of variability
0:35:38you actually need an explicit way to do joint compensation so it's very heart to separate these things if you
0:35:43don't have a model that explicitly models them
0:35:46as
0:35:46individual sources of variability
0:35:49and so to do this there's
0:35:52several pieces of work that the proposed it's sort of like a being at a diner and it sort of
0:35:57gets use one from column a and one from column B you can sort of take all the you know
0:36:01all your favourite speaker adaptation algorithms in you can take
0:36:04all the games and apply for environmental adaptation pick one up from each thing and combined them and then you
0:36:09can have a usable model made using thing is that this is sort of proposed
0:36:15that ten years ago
0:36:17but as far as i can tell with without with the exception of joint factor analysis and two thousand five
0:36:22is not that much work on it since and now it sort of seems to be have sort of come
0:36:25on the scene again which is good i think it's a it's not obvious the more people
0:36:31in their work on this you know
0:36:33is
0:36:33so
0:36:35all these possible combinations of methods can do this
0:36:40joint compensation together might talk about one particular instance
0:36:45of using cmllr transforms mostly because i've already talked about how vts is used and so trying to
0:36:52several different
0:36:54ways you can go about doing compensation for noise
0:36:57so in this case we're gonna talk about the idea that you can use a cascade of cmllr transforms
0:37:02one that captures environmental variability wanna capture speaker
0:37:06a nice thing about using transforms like this is that we give up the benefit of all the structure we
0:37:11had an environmental model using solutions like be yes
0:37:14but we get the ability to have much more flexible use meaning that we have no restriction on what the
0:37:20features we can use are what the data that where this it's trained from we don't to do this
0:37:24adaptive training schemes like the noise adapted train
0:37:29the idea is quite simply defined transforms that maximise the likelihood of a set of environmental transforms in a spell
0:37:35of speaker transformations given sample of training or adaptation data
0:37:40now of course you know it's not heart to see that this cascade of linear transforms is itself a linear
0:37:45transform
0:37:46in as a result you can take a linear transfer and factor it into two separate transforms in an arbitrary
0:37:52number of ways menu which will
0:37:55not be meaningful and so the way that we're gonna get around this is to borrow heavily from the key
0:38:01idea i think in joint factor analysis from speaker recognition which is to say let's learn the transformations on partitions
0:38:08of the training data where were able to sort of isolate the variability that we range
0:38:13so pictorially
0:38:14still a bit busy inside politics of it
0:38:17gives you a headache but
0:38:19you can think about the idea that your basic gonna group the data by speaker and a given those that
0:38:25you can update your speaker trend
0:38:27then you gonna repartition your data by environment keeper speaker transforms fixed and update your environment transforms and then go
0:38:32back and forth in this manner now of course
0:38:35doing this doing this operation assumes that you have a sense of what you're speaker clusters are in your environment
0:38:41clusters are
0:38:43there are some cases where we it sounds reasonable to assume the labels are given to you so for example
0:38:48if it's a
0:38:49a phone overhead you know mobile phone data plants near you can have a caller id or a user id
0:38:55of the hardware address and so you can have a high confidence that you know the speaker is simile for
0:39:00certain applications
0:39:02like the X box in the living room we certainly think it's result we can say okay this thing is
0:39:06by not driving on the card sixty miles an hour probably isn't in the living room once we can assume
0:39:10the environment in that case or if we don't have this information you can really do environment clustering algorithms are
0:39:16speaker clustering
0:39:18and so
0:39:19yeah just to show some results here
0:39:24the idea is you can again take
0:39:28take the training data that let's say from a bright of environmental the brighter speakers
0:39:32and estimate some environment transforms on the training data
0:39:35to do that of course you have to estimate the speaker transforms as well but in this case the speaker
0:39:40the speakers in training and test are distinct and so the speaker chances are not useful for us in the
0:39:45reuse scenario
0:39:46and so we've tried here to say let's take estimate the speaker transform
0:39:50given data from a single environment this case is the subway
0:39:54we can we take that
0:39:56transform and either estimated in this way where the sources of variability are factored
0:40:00or estimated using sort of conventional cmllr approach and apply to data from the same speaker in six different environments
0:40:09three which aren't times that you've seen in training three what's are
0:40:12that are not seen
0:40:14and you can see in both cases you get a benefit by having additional transform in their absorb
0:40:21the variability from the noise so that this the speaker transform can as you focus on just as the variability
0:40:27that comes from the speaker that you care about and so you can see there's again overdoing cmllr alone and
0:40:32that comes again from the fact that this year margin for me is not presumably
0:40:38learning the mapping of the environment plus the speakers ideally learning the transform just the speaker alone
0:40:46so
0:40:49in
0:40:50scenarios where speaker data is scarce
0:40:54a reuse is important for adaptation
0:40:59no this is a case where each utterance is you know ten or fifteen or twenty seconds this techniques and
0:41:04are not nearly as important but if the case where you only have a second or two data you wanna
0:41:07be able to aggregate all this data and build a model for that speaker
0:41:11it seems that did it comes from
0:41:14places
0:41:15where there's a high degree of variability from other sources
0:41:18the problem becomes a little more challenging
0:41:20and this can be environments it can be devices of your
0:41:24you know all of you but data that's
0:41:27being held up like this and then you have a far field data then you have additional data that's four
0:41:32feet away on your couch
0:41:33all these things are all different in different microphones all these sources are things that are that are basically blurring
0:41:39the speaker transmitter trying to learn and you wanna go to isolate those in order we use the speaker turn
0:41:44so doing this style base it allows a secondary transform to absorb this unwanted variability
0:41:50and
0:41:52there are various ways of doing in there are just a you know obviously if you have a sort of
0:41:57a transforms that are specifically modeling different things explicitly it'll be easier to get the separation if we knew things
0:42:02like
0:42:03two linear transforms then you need to sort of resort to be used just data partitioning schemes
0:42:09which you know
0:42:10makes things a little bit more difficult
0:42:12so
0:42:14here i've just tried to hit a little bit on a three way you know three aspects of speech recognition
0:42:19going green in this reduce reuse recycle framework before i conclude i just wanted to slow touch on i think
0:42:27you know
0:42:28as someone who's worked in you know a we strongly i guess and robustness and these ideas i sorta wanna
0:42:34talk about there's to serve also as i got a member three personalities that i sort of take on and
0:42:39so i wanna sort of address
0:42:41and you may find yourself thinking i one of these present noise in turn
0:42:44and so i wanna sort of address because those and so i think there's people who are the believers
0:42:49there's people who are
0:42:50the sceptics
0:42:51and those people who i was called the willing which are sort of the people who say oh well maybe
0:42:55i'll give this a try and you know i think
0:43:00i think about sort the but the resurgence in neural net acoustic modeling as a as a good example of
0:43:04this that we're maybe some auditory inspired signal processing is another example where
0:43:08there were true believers in sort of acoustic models using neural nets then they're from so though we can't be
0:43:13when an hmm
0:43:14you know to put that aside and then you know that's kind of improve the people that i would give
0:43:19this a try again they want move from being sceptics to the willing
0:43:22now they got good results another all the believers again and so i think i wanna sort of talking about
0:43:28these very briefly so i would say to the sceptics i was sort of say yes you know one thing
0:43:32that i think is interesting is there's increasing robustness in speech recognition thing going on for a long time is
0:43:38in lots of sessions
0:43:40lots of papers slots
0:43:43but if you look at the tasks that it becomes standard for speech recognition like we need like i talked
0:43:47about today they're all very small bus orders
0:43:49today state-of-the-art systems compared to things like switchboard and galen meeting recognition
0:43:54and in is very large scale systems like switchboard and galen meetings
0:43:58robustness techniques are not really a part of the puzzle there and so i think is very fair to say
0:44:03all these methods really necessary in any sort of
0:44:06i still deployed system i would say to that i would just say yes it depends on i sorta wanna
0:44:10give a few very anecdotal examples to sort of motivate why i think this is
0:44:16you think of the bn so in production quality systems that do have all the bells and whistles that we
0:44:22that i one and knows about that are common is large scale systems
0:44:26we see and things like voice search you know in fact the gains are small and so you know it's
0:44:30not really a huge went to employ these techniques and so it's a fair critique just are we don't need
0:44:35we don't need robustness
0:44:36as you move the smell like the car turns out that actually gains are pretty big
0:44:41and not you note taking this is you know this to be much more usable by incorporating some elements of
0:44:47noise robustness in two
0:44:49into your system
0:44:51finally i would actually say with the X box like connect
0:44:54turns out that actually i would say these systems are actually unusable
0:44:57if you know if i consider a robust as the entire sort of audio processing front-end plus whatever happen
0:45:03in the recognizer
0:45:04if we
0:45:05throw all that away which establishes his microphone to listen i will do everything in the model space systems are
0:45:09actually unusable
0:45:11and so there actually is a large place
0:45:13technology in certain scenarios
0:45:15ski
0:45:16peering out to the willing so if someone says well you know what's the easiest way to try celeste of
0:45:22it is this thing to try is noise adaptive training and sort the biggest bang for the buck is what
0:45:27i would say is not well lee dying called noise adaptive training in the feature space
0:45:32the idea is very simple that you have some training data
0:45:35you believe you have some way to enhance the training data run-time we need to take a train data just
0:45:39prior to the same exact process and retrain your acoustic model you know you think that this is this is
0:45:44basically very akin to doing similar for speaker adaptive training you basically updating your features
0:45:50before you reach in your model it turns out that if we do this you have to get performance
0:45:55that generally is far superior to operating are trying to compensate noisy speech to recognise with the clean trained hmm
0:46:02and if you are gonna to try this i think you know the standard algorithms are findings expect subtraction i
0:46:07mean
0:46:08the fanciest ones work are great but in an improvement a small i think getting the basics working is important
0:46:14but the important thing is you need to serve to an optimize the right objective function i've had you know
0:46:19talk to people say oh we got you know a spectral subtraction component from
0:46:23my friend who's in the speech enhancement part of our lab and i just tried it and it was you
0:46:26know i didn't work at all and the reason is that these things are optimized joey completely differently and so
0:46:31we need to really you know it and
0:46:32you do need to understand all the details and nuances of what's happening are gonna but generally is a whole
0:46:36set of parameters and floors and weights
0:46:39and things
0:46:40in those things can all be tuned and you can tune them to where you know to maximise word error
0:46:44or minimize word error rate and that would be great you can do that in a greedy way let's just
0:46:48sweep a whole bunch of parameters to we get the best
0:46:50you can also use something called test which is a computational proxy to stands for the perceptual evaluation of speech
0:46:56quality space you like a model of what
0:46:59human listeners would say it turns out that that's which are quite correlated to speech recognition performance and so if
0:47:05you can maximise that are you have your yeah signal processing bodies have some column that maximizes pack has that's
0:47:12a good place to start and turns out that the doing things like snrs after the worst thing you can
0:47:16do it creates all kinds of
0:47:17distortion free
0:47:20so
0:47:21with that i just want to conclude and say that we proposed that potentially there are there's goodness to be
0:47:27had by using existing data and no we sort of put this on the matter of going green
0:47:34i'm just pretending to this case of just provide you know try to write one example of the way that
0:47:38we can reduce recycle and reuse
0:47:41the data that we have either from environmental mismatch point of view a bandwidth point of view or speaker
0:47:47adaptation point of view so a there's many other
0:47:50ways to do this or just talked about a few and of course there's more work to be done
0:47:54and so with that i will thank you
0:47:58i think the speaker for
0:48:05oh we have plenty of time for questions
0:48:11so what mike things
0:48:13great small i was wondering if you can address
0:48:16some other problems in the your robustness area for example
0:48:22oh there are many cases with the rules what's your nonlinear distortions that are going to be applied to these
0:48:29signal of strange this in the communication channel and what you talked about i mean
0:48:36the transform techniques could obviously work on it or anything but i'm wondering if you have any comments or what
0:48:42do you do one place
0:48:43rules nonlinear distortions of the signal with the signal still basically set my intelligible both the it doesn't fit any
0:48:51of the classical speech plus noise model
0:48:55well
0:48:56the one thing i would say it is
0:48:59that
0:49:03is a heart problem
0:49:04i
0:49:07thank you don't even agreed on it
0:49:11so that feature space adaptive training technique
0:49:15is that you generic across any kind of distortion so if you actually have the ability if you know what
0:49:19that coding is we can model it somehow you guys past data through that that's why the best way to
0:49:24model it sort of the
0:49:25it's not very fancy or what but i think it'll work
0:49:30the thing is a lot of things are burst
0:49:33but i find it so that you can actually just detect them
0:49:36building you know whatever classifier and as just part at that point you know you can for example say i'm
0:49:41gonna you know compute my decoder score bias just giving up on this frames in is no content here that's
0:49:46another way you can do it
0:49:49i think sort of trying to have a model for you know
0:49:52number in your garden in your
0:49:54i think by won't work
0:49:55and then like to believe that there is some you know that we can extend the linear transformation scheme to
0:50:00nonlinear transformations like some kind of an L P
0:50:03mllr kind of thing but you know that's remains to be seen and that that's again it does really quite
0:50:07get it is sort
0:50:09i think and are we talking of this or this occasional
0:50:11gobbledygook that comes and i don't think that would really just that so i think those two other techniques are
0:50:15so
0:50:21i think the one thing that's interesting is the correlation between how people speak and the noise background and or
0:50:28a kind of
0:50:29what does that adding
0:50:31right noise rather than so the long artifact has the obvious
0:50:36lot of speech thing which we pretty but to compensate for
0:50:40you know we normalize stuff
0:50:42but there's the bombard spectral that
0:50:45which means that allowed or the noise is the more vocal effort there is the more
0:50:51to the spectrum and all that sort of thing how do the techniques you're talking about addressed utterance
0:50:57is the whole kind of different problem because
0:51:00the environment
0:51:01really doesn't
0:51:02yeah unless you know the signal to noise ratio
0:51:06straight
0:51:07right so i think
0:51:11what's interesting about those is those are
0:51:15speaker
0:51:16affects that is are manifested by the environment
0:51:19and so like you said having environment models not gonna capture that at all it's more like maybe having a
0:51:24but you may want to have some kind of you know so i don't know i don't have the exact
0:51:28answer although i would think that having a environment informed
0:51:33peak or
0:51:34transform kind of thing would be would be useful so you know potentially
0:51:38your choice of you know vtln work parameters for example could be affected by what you perceive in the environment
0:51:44any level speaker F
0:51:45you
0:51:46you detect
0:51:49and the other thing of course is sort of the
0:51:51the poor man's answer would be you know i'm not sure how much of this can be modelled again it
0:51:56by exist existing speaker adaptation techniques you know
0:52:00again i think a lot of the text in the being of a nonlinear
0:52:04and so it's hardest
0:52:05we put on the rug with the and mllr transform
0:52:08but it's so i think i think that comes at it as you know incidents on that i was trying
0:52:11to talk about
0:52:12orthogonalisation of this
0:52:14the speech and the noise and i think you're actually the opposite which is actually a jointly informed
0:52:20transform which i think is a very enticing area
0:52:22i don't imagine a way of too much work
0:52:31you
0:52:32might be greener features that came in were themselves insensitive
0:52:37to some is just absolutely
0:52:40so that would that would that would well if i if i agree with you that i'm through email talks
0:52:43i can agree with you now
0:52:45maybe at the coffee break i can agree with you but no i think that that's that that's
0:52:50that's true right there's the whole and i think a lot of this comes with the biologically inspired kind of
0:52:56features and i think that's true and i think actually in fact the work that
0:53:02or elan's human kind of did kind of shows that they've made
0:53:06grammar correctly that they train to a deep net on aurora and got you know high degree of noise robustness
0:53:11just running the network
0:53:13potentially learn some kind of
0:53:15noise invariant
0:53:16features
0:53:17you know i think
0:53:19right is right and so no i think that's true i don't know no i problems i think right now
0:53:23where we are
0:53:24it's the heart to come up with sort of a one size fits all
0:53:28scheme so there's one other thing
0:53:32but that's about it
0:53:34using gmms to the government data to what the specific example you
0:53:40the basically as long as i understand the gmm you mentioned was trained and supplied basically doesn't consider the entrance
0:53:48i
0:53:49in the gmm case
0:53:51right but you could also do an hmm for
0:53:54well but that's is easy you can see the transcriptions like phone level transcriptions
0:54:00can you improve that signal absolutely yeah that's what is shown so only small with a technique
0:54:07all possible the pure speech feature based technique
0:54:11yeah and what well that was a good gosh well yes but i think you don't necessarily need a very
0:54:17strong model
0:54:18so you know i guess you class might so you could have you could for example have a phone-loop hmm
0:54:25in the front end that's using like that is using a model based technique but
0:54:29but you know getting the state sequence right is actually is a problem in the feature technique as you guys
0:54:33you have within the context of a if you don't put here takes on the on the search space you
0:54:38can have within a close to have it skipping around states
0:54:41you have inconsistent hypotheses for what the missing band is
0:54:45and you can apply that to some extent if you have a if you do a sort of a cheap
0:54:48decoding the friend where there's a your phone hmm with the phone language model
0:54:53and you could do that just to i think what you have you know that
0:54:56the benefit the models actually
0:54:58restraining your state space to sort of possible sequences of phones
0:55:03once you have that i think generating whether use that to enhance feature order in the model domain is
0:55:09you know what you know the both options
0:55:11yeah i mean it's only also agree i think
0:55:13the model domain is
0:55:15will be optimal
0:55:16i think if you start saying well my system runs with
0:55:19a eleven frames of hlda and all this other stuff it becomes a little harder to
0:55:24to do that you know you can sort of just a minute it's gonna be a blind transform like mllr
0:55:28but if you wanna put structure in the transfer
0:55:30the map the low to high frequency that gets a little more difficult
0:55:36okay
0:55:37is that the speaker again