Speech Transcript - Robust Speech Recognition: more than just a lot of noise

0:00:12	first let me introduce mike seltzer from microsoft
0:00:16	Q has been there since two thousand and three
0:00:19	and his interests are really no noise robust speech recognition microsoft the reprocessing the acoustic model adaptation speech enhancement
0:00:28	in two thousand and seven to receive the base down all their word from the ieee signal processing society
0:00:35	and from two thousand and six and two thousand and eight yeah he was a member of the S L
0:00:39	D C and he was also the editor in chief of the
0:00:43	electro electronic
0:00:46	newsletters all many of us to receive emails from him whenever that the newsletter came out
0:00:52	and the holiday season associate that either of the ieee transactions on speech and audio processing
0:00:58	and the title of his talkies robust speech recognition more than just a lot of noise
0:01:03	so
0:01:04	these by michael
0:01:14	good afternoon thinks that a great introduction george and things to the
0:01:20	it's very two thousand committee for inviting india the stalk it's really the an honour to be here without you
0:01:28	and i hope to have the instinct a kind of sentences after lunch and hope for the food come along
0:01:33	once it into badly
0:01:35	so let's get started
0:01:38	so
0:01:39	i've been in the field for about oh ten years or so and i brought them to me really seems
0:01:45	like
0:01:46	where in what i'll call almost a golden age of speech recognition
0:01:52	there's based yeah and as we all know there's a number of mainstream products that everyone on the obvious is
0:01:57	using that involve speech recognition of course there's a huge proliferation of mobile phones and data plans and voice search
0:02:04	and things like that
0:02:07	speech is also widely deployed now in automobiles
0:02:11	in fact i i'd like to point to the ford sync as one example of a system that we took
0:02:16	speech in cars from a high end
0:02:18	come on a add on to look for enlarge remote bills to a kind of a low and feature on
0:02:24	stand
0:02:25	packages for low and for models and sort of like a moderate sized
0:02:28	that functionality and then my most recently is that what this the flaming of the connect i don't for X
0:02:35	box
0:02:36	which has gesturing voice input
0:02:39	in addition to these three examples of technologies that are that are out there we're also in many ways swimming
0:02:45	in this in this you know drowning in the data that we have as a sort the proverbial a house
0:02:51	a fire hose of data coming at us and fifty yeah with cloud base system
0:02:56	no all the data thing logged on the system and sort of having data in many cases is not a
0:03:00	problem as i know how what do we do with the state is sort of the channels these days
0:03:04	and finally as on a on a personal note
0:03:06	i think that one is most meaningful for me is the fact that all this is happening so that only
0:03:11	have to talk explaining to my mother what it is i do on a daily basis
0:03:16	i'm a
0:03:19	i'm not sure she's we so happy that a user face on the sly
0:03:23	i won't tell if you want to
0:03:25	neverless in spite of all the success there'd be lots of challenges that are still out there there's new applications
0:03:31	and as an example here this is the virtual receptionist by dental who should ms are sort of a project
0:03:36	and situated interaction and multiparty engage me
0:03:39	there's always new devices this is a particularly interesting one from time to shorten stands use a electromyography interface where
0:03:46	it sort of actually measure
0:03:48	engineer skin as the input i have another colleague a ardent thing was working on a speech input using microphone
0:03:54	arrays inside a space helmet for us
0:03:59	you know systems that are deployed for spacewalks
0:04:01	and of course as the thomas friedman wrote the world is becoming flatter and flatter and there's always new languages
0:04:07	a new cultures that are systems come in contact with
0:04:12	so addressing these challenges takes data and data is
0:04:16	time consuming to collect in some ways and it's also expensive to collect
0:04:20	as an alternative i like to propose
0:04:23	that we can extract additional utility from the data we already have
0:04:26	the idea is that by reusing and recycling the existing data we have
0:04:30	potentially we can reduce the need rick actually collect new data
0:04:34	and you can think of it informally as making better use of our resources so what i like to focus
0:04:38	my talk on today is
0:04:40	how we can take ideas from a process to help speech recognition go green
0:04:45	so the symbol for my talk today
0:04:48	we'll be how speech recognition goes bring to reduce recycle and reuse
0:04:53	of information
0:04:56	like you like to go see a lot
0:04:58	okay so first i'm gonna talk about the one aspect of this in the sense of reduce
0:05:03	so
0:05:04	we know their systems suffer because these are just statistical pattern classifiers when this mismatch between the acoustic models we
0:05:10	have and the data that we see a runtime one the biggest sources of this mismatch is environmental noise
0:05:16	so the best solution of course is to retrain with matched data and this is either expensive or impossible depending
0:05:21	on what your definition of matched data is
0:05:23	if matched it is you know i'm in a car on a highway then that's reason to collect it just
0:05:27	as a little bit time consuming if matched data is i'm gonna be in the speaker model of car on
0:05:32	this particular road with the snow is of course it's impossible to do
0:05:35	so as an alternative we have standard adaptation techniques that are tried and true in part of a standard toolkits
0:05:40	things like map or mllr adaptation they're really great because their generic in a computationally efficient the only downside is
0:05:47	the need sufficient data in order to train is proud
0:05:50	as an alternative relative to discuss here is the way we can exploit the model environment
0:05:54	and by doing this
0:05:56	the estimation the adaptation method you will get some sort of structure imposed on it and as a result we
0:06:00	get lots of efficiencies and the data process
0:06:03	so before we go into the details let's just take a quick look
0:06:06	at the effect of noise on speech through the processing chain of showing of the processing chain for mfccs for
0:06:12	a similar features like plps lpc is it's similar so we know that in the spectral domain of the linear
0:06:19	the waveform domain speech is additive
0:06:22	that's not too bad to handle
0:06:23	what's your the power domain it's against additive analysis additional cross term here which is the kind of correlation between
0:06:29	speech and noise we gently assume speech or noise uncorrelated so we kind of ignore the term and sort of
0:06:34	hideaway
0:06:35	now things get a little bit trickier once you have to go to the mel filterbank analog operation because after
0:06:41	log domain gets are to get this little bit of a nasty regulation which says that the noise
0:06:47	the noisy features i C R can be described as a clean speech features plus some nonlinear function of the
0:06:52	clean speech and the noise
0:06:53	and indicates that goes up to a linear transform and we get of a vector version of the same equation
0:06:58	for the purpose of this talk on a backup before the dct "'cause" easier to visualise things and one or
0:07:02	two dimensions rather than thirty nine dimensions
0:07:05	and source talk about this equation here
0:07:07	so because speech and noise are it's a symmetric a relationship so expos and is and plus X we can
0:07:13	swap the positions of X and N in this equation here if we do that i mean and we sort
0:07:18	of bring common term is to teach that equations you get a slightly different expression
0:07:23	and what what's interesting here is that you have in the lab
0:07:26	is because is a lot domain operators this is basically to a function of two signal-to-noise ratios something that's in
0:07:33	speech enhancement signal processing called the a posteriori snr which is the snr of the observed speech compared to the
0:07:38	noise and the prior snr which is the unknown clean speech and as a version of that and the noise
0:07:45	and so if we look at what this relationship is between along this function
0:07:50	we have a curve like this
0:07:52	and it's curve makes a lot of intuitive sense of it look at different points along the curve and so
0:07:55	we can say for example appear in the in the upper right of the curve
0:07:59	right now we have high snr basically noise is not much of a factor and the noisy speech is given
0:08:03	to the clean speech
0:08:05	in a similar way
0:08:07	at the other end of the curve
0:08:09	we have a low snr
0:08:11	and do not you matter with the clean speech is it's completely dominated by the noise
0:08:15	so why was that in this case and of course you know the million dollar question is
0:08:19	how do we handle things in the middle of the nonlinearities is something that needs to be dealt with
0:08:24	there's an added complication of this which is earlier we sort of swept this cross correlation between speech or noise
0:08:29	under the rug
0:08:31	but it turns out
0:08:32	that yes it is your own expectation but it's actually not zero you know it's a distribution that has nonnegligible
0:08:39	very that we have to plot data on this curve and see how well it matches the curve
0:08:43	you see the direction is that the data lies on that line but there's actually significant spread around that my
0:08:49	what that means is
0:08:51	that even if we're given the exact value of clean speech and exact value of noisy speech in the feature
0:08:57	domain where she can predict exactly what the noisy feature will be there's just you can predict what the distribution
0:09:03	will be
0:09:03	and ending this additional uncertainty makes things we wanna do things like model adaptation even more complicated
0:09:09	so
0:09:10	they're been a number way to look at transfer in this equation into the model them
0:09:16	if we do that again this nonlinearity and the some thirty great challenges so we again look at extremes of
0:09:21	the curve is quite straightforward at high snrs than if we do an adaptation the noisy distribution it's gonna exactly
0:09:27	the same as the queen distribute
0:09:29	we go over here to the low snrs the lower left of the curve
0:09:32	the noisy the speech distributions to just the noise distribution
0:09:36	and the real trick is how do we handle this area in the middle right we have even if we
0:09:40	assume that the speech and the noise are gaussian we put these things through this nominee relationship or comes out
0:09:45	is definitely not gaussian
0:09:47	but of course this is speech recognition and
0:09:51	i darted if it's not gaussian we're just gonna assume it's gaussian anyway and so there are and so there
0:09:56	are various approximations that are made to do this because
0:10:00	we just you know six
0:10:02	so
0:10:04	right the most famous example to do this would you do at that noise is to simply taken a linear
0:10:08	approximation to this to linearize around this is the famous vector taylor series algorithm by pager marino
0:10:14	the idea here is you simply have an expansion point determined by your
0:10:17	that that's given by the mean of the gaussian are trying to adapt and the mean every noise
0:10:21	and you simply when your eyes that than on the menu function around that point and once you have a
0:10:25	linear function now doing annotation is very straightforward you on the how to how to transform gaussians subject to a
0:10:30	linear transformation
0:10:32	now the trick here is that the transformation is only determined by the mean of the curve and the size
0:10:38	of the variance of the speech and the noise of the clean speech and the noise model will determine the
0:10:42	accuracy of the linearisation stuff the brains very broad if it's very in a wide bell then the linear they
0:10:48	should not be very accurate "'cause" you be subject to more nonlinear
0:10:51	so
0:10:52	the refinement of idea of that idea we look into something called linear spline interpolation which is sort of participated
0:10:59	well if one line works well many lines must work better and so the idea is to simply take this
0:11:05	find an approximate using a linear spline which is the idea that you have a
0:11:10	a series of knots which are basic places you could even the dots in this in this figure
0:11:15	and with between the dots you have you doing approximation is quite accurate
0:11:19	and in fact because it's a simple linear rescue cash you have a variance associated word error associated with that
0:11:24	when you're model and that'll account for that alpha spread that spread of the data around
0:11:29	and then when you figure out what to do at runtime you have to use
0:11:36	right you can use all the splines based on the pdf rather than just having to pick a single one
0:11:40	determined by the mean and so essentially depending on how much mass of the probabilities under each of the segments
0:11:45	that tells you how much contribution of that linearisation you're gonna use in your final approximation
0:11:51	so you're using you know incorporating you do linearisation based an entire distribution rather than just the mean and i
0:11:57	think is the spine parameters can be trained from stereo data they are also can be trained from an integrated
0:12:01	way using sort of maximum likelihood in an hmm free
0:12:06	we just use two examples of a linear a linearisation approach this another approach is the sampling based method
0:12:13	and the idea here is based on i-th in of the most famous example this is by the data-driven pmc
0:12:18	work for mark gales
0:12:21	in ninety six but that method requires you know tens of thousands of samples is for every gaussian are trying
0:12:26	to adapt its completely
0:12:27	infeasible it is a good upper bound you can do
0:12:30	but the unscented transform is very elegant way to sort of do clever sampling the ideas you just take certain
0:12:36	sigma points and again because we can assume things are gaussian there's a simple recipe for what these sampling points
0:12:41	are take is small set of points in this case it's typically about in a less than a hundred point
0:12:47	ask them through the non-linear function you know to be true under your model and then you can compute the
0:12:54	moments and basically estimate P Y
0:12:56	again
0:12:58	depending on how spread the variance of this model of this distribution you're trying to adapt is that will determine
0:13:04	how accurate this adaptation is so is gonna for the refinement this method
0:13:08	post recently call the unscented gaussian mixture filter in this case you take a very broad gaussian simply chop it
0:13:14	up into a gaussian mixture where within each gaussian
0:13:18	the variance a small and simple linear approximation works quite well
0:13:22	in the sampling works quite efficiently and then use and to combine all discussions back on the other side
0:13:26	here
0:13:27	so you just for example there are a handful of others out there in the literature
0:13:33	but one thing what i've tried to convey here is in contrast to standard adaptation you'll notice i didn't talk
0:13:37	at all about data
0:13:39	and observations was talk about how to adapt the model all we had was that of the hmm parameter you
0:13:44	X and the noise model and
0:13:46	so it's of the what's nice about these systems is that
0:13:50	excuse me
0:13:52	is that basically slowly need is an estimate of what the noises in the signal and given that we can
0:13:57	actually depth every single gaussian in the system because the structures impose on the adaptation process
0:14:02	and in fact if we can sort of sniffed what the environment is before we even see any speech we
0:14:06	can seduce in the first pass which is very nice and of course you can refine this of the second
0:14:10	pass by doing you know em type are going to update your noise parameters
0:14:14	so of course
0:14:15	under this model the accuracy of the technique is largely due to the accuracy of the approximation using so those
0:14:21	are four examples i showed earlier and essentially people who work in this area basic trying to come up better
0:14:27	approximations to that nonlinear function other alternatives also focus on more explicitly modeling
0:14:33	that uncertainty between X with between the speech and noise that accounts that spread in the data that was nearly
0:14:39	figure
0:14:40	so just a sense of how these things work this is the road to which is a standard noise robustness
0:14:45	task it's a noisy connected digit task
0:14:49	for people care it's a complex back-end some like we could train system
0:14:55	it's for the best next like that sort of baseline you can create with this data i mean you can
0:14:58	see that sort of doing standard things like C M and is not great when you cmllr again this it
0:15:05	in one utterance you may not have enough data to do that to do the adaptation correct you get a
0:15:08	small gain but not but not a huge when
0:15:13	the L C advanced front-end shown there is a fee is sort of the
0:15:16	i guess representative of state-of-the-art in sort of front end signal processing approach to doing this as i was not
0:15:21	where the models are used to treat this as a noisy signal and hands it in the front end and
0:15:26	if you do vts
0:15:27	in the rain algorithm ignoring
0:15:31	that correlation between speech or noise that spread of the data you get about the same performance
0:15:36	and now if you actually account for that variance in the data by tuning a weight in your in your
0:15:42	update which i won't get into the details of us to get a pretty sick significant gain
0:15:45	that's a really nice result the problem with that is that the value that you actually is optimal is that
0:15:50	you theoretically implausible and don't and breaks your entire model so that part is a little bit unsatisfying
0:15:57	in addition the fact is not quite that might that often not pravda generalise as across corpora and then we
0:16:03	see that you get about the same results of the use the spline interpolation method where we have you have
0:16:07	the link the linear regression model it does account for the spread and sort of a more natural way
0:16:12	and again all the numbers of than similar at first pass numbers they could be refined further with second test
0:16:18	so
0:16:20	well this shows we could be no you have nice
0:16:23	gains by adapting the structure there's been a little bit of a dirty laundry i was trying to cover up
0:16:29	which is that the environmental model is completely dependent on the assumption that the hmms trained on clean speech and
0:16:36	as you all know clean speech is kind of a an artificial construct that something we can collect in the
0:16:41	lab but is not very generic it also means that if we deploy a system out in the world we
0:16:46	collect the data that comes like in that it is easy valuable for updating our system and refine your sister
0:16:51	but if it's noisy and our system can only be taken clean data we can use that data
0:16:55	have a problem
0:16:56	so
0:16:57	a solution to that problem has been proposed and referred to as
0:17:04	noise adaptive training also composes joint adaptive training
0:17:07	and the idea is basically completely i can sort of a little brother little sister to speaker adaptive training
0:17:15	in the same as figure out the training try to remove speaker variability in your acoustic model by having some
0:17:20	other transform absorb the speaker variability we wanna have the same kind of operation happen to absorb the environmental variability
0:17:27	what this allows you to do is actually train incorporate train data from different sources
0:17:32	into a single model is helpful if you if you if you think about a multi-style model we can take
0:17:36	all kinds of data from all different conditions and mix it all together
0:17:40	the model will model the noisy speech correctly beer and have a lot of variance is just modeling the fact
0:17:43	that are coming from different environments
0:17:45	that's not gonna help you with phonetic classification
0:17:48	and if you are not a dataset scare scenario this could become very import
0:17:53	so again just to make it a little a little bit more explicit he hears the general flow force speaker
0:17:59	adaptive training you have some multi-speaker data in a speaker independent hmm
0:18:03	that then doesn't a process where you italy update your hmm and some speaker transforms
0:18:09	most commonly using cmllr and this process goes back and forth so convergence and what what's left of it
0:18:14	speaker adapted hmm
0:18:16	so a noise adapting the exact same process happens
0:18:19	except the goal is to remove the environmental variability from a multi-style multi environment day
0:18:25	so what happens here is we have again i would i guess you could call it an orderly cause an
0:18:30	environment independent model but that's
0:18:32	what it is and also for apparel structural call that essentially data from lots of by
0:18:38	and then in your iterative process you basically trying to model and account for the noise or channel distortion that's
0:18:43	in all of your in all of your data
0:18:46	with other parameters so that the hmm is free to model the phonetic variability and this case typically what's more
0:18:51	stuff and on is the noise that is environmental parameters are updated on a per utterance basis rather than a
0:18:56	per speaker basis because there's few parameters and so you're able to estimate those
0:19:00	well number comes out is a noise adapted hmm again that the nice thing here again is because you can
0:19:06	do this potentially in the first pass you don't need to keep the first environmental independent or noise independent model
0:19:12	around like you do in speaker adaptive training you can directly operate all the time and noise adapted H
0:19:18	there are some results with noise adaptive training
0:19:21	as analysis with noisy multi-style training data you can see this is the result for cmn just cepstral mean normalisation
0:19:28	now we try to fight the vts algorithm which assumes the models clean in this case not under the assumption
0:19:34	is broken and so we got to get
0:19:37	you have to improve over the baseline but the results are not nearly as good and then we get overturned
0:19:41	to getting nice gains but we actually do this adaptive training and we see similar performance on the aurora three
0:19:46	task an interesting thing there is actually because that's real data collected in a car
0:19:51	or she is no clean data to train this on and so you actually need an approach like this to
0:19:56	run a successful on that technique and
0:19:59	corpus like this
0:20:02	so
0:20:04	to summarise are for
0:20:06	is the triangle and redo
0:20:10	i si model adaptation as you all know can reduce environmental mismatch
0:20:14	when you impose this environmental structure determine by the model that the adaptation is incredibly data efficient if you think
0:20:20	about a general you need
0:20:22	and ask them to be noise in an estimate of your
0:20:25	of your noise meeting yours variance of potentially last interview channel means that spacey thirty nine was thirty nine you
0:20:30	know it's
0:20:31	hundred and twenty parameters to estimate which is really a very little and you know you could even for example
0:20:36	if you assume that your noise is stationary then you're you can actually eliminate even the delta kappa delta features
0:20:42	of your noise
0:20:44	every running or static features that even fewer parameters
0:20:46	doing the adaptation unfortunately is computationally quite a bore i mean it really it's you adapting every gaussian in your
0:20:55	system is probably overkill to do an utterance-by-utterance basis but you can improve the performance by using regression classes shown
0:21:03	by i think well as work
0:21:06	yeah thing is that we can reduce environmental variability in the final model we have
0:21:10	by doing this noise adaptive training in this is helpful when we're in scenarios where there's not much data to
0:21:15	work
0:21:15	the other considerations that reminders although i'm certain ml systems use can be integrated discover training
0:21:21	and is a huge sort of parallel literature to this where the same exact algorithms are used in the front-end
0:21:27	where your place the hmm with the gmm you do this as a front-end feature enhancement scheme and see basically
0:21:32	the same exact operation with the goal of generating a hand
0:21:35	version of the cepstra
0:21:37	and
0:21:38	those items the exact same sort of mathematics mathematical framework and then the nice thing is there is that you
0:21:43	can then if you're data that you work with is noise you can also do the same adaptive training technique
0:21:48	on the front-end gmm and
0:21:51	still use those technique
0:21:54	so well
0:21:59	and i wanna move on from reduced to recycle
0:22:02	and in this case element talk about is
0:22:05	change gears from the ways to channel
0:22:07	and talk about how we can recycle narrowband data that we have
0:22:11	i think it's not a very controversial statement to say
0:22:16	but now that voice over data is replacing voice over the wire
0:22:20	and when you do this now because you know especially in speech applications you have you speaking to some a
0:22:26	smart phone your voice is not going you know making a telephone call anymore it's going over the data network
0:22:30	to some serve
0:22:31	when you do that does not capture them with a possible so you can base the captured you know subjective
0:22:36	bandwidth constraints because our
0:22:38	latency constraints you can see you can basic captured arbitrary bandwidth and this is that we have you know where
0:22:43	possible wideband data is preferable
0:22:46	games do very you know which you build equipment system with narrowband or wideband data but they are consistent
0:22:52	for example if you look at a car
0:22:56	the gains you get are larger in that's not context because a lot of the noise it's in the cars
0:22:59	at low frequencies sort of the rumble of the of the highway and the tires creates a lot of low
0:23:03	frequency noise so having a high energy in the plosives and affricates is really helpful for discriminative ability
0:23:09	and of course it's also sort of going becoming a the standard for just human communication is wideband codecs from
0:23:17	the M R is the european standard and skype now is going to wideband codec or even an ultra wideband
0:23:22	codec so the fact that people perceive it sort of also implies that numbers machines would probably prefer
0:23:27	well
0:23:29	that said there are existing stockpiles a narrowband data all the systems even building over the years and for many
0:23:35	low resources languages in on the developing world mobile phone still are prevalent and i don't think we're gonna go
0:23:40	away that soon so we want the ability to do something useful with that data
0:23:46	so what i'd like to propose is there a way to use the narrowband data to help augment
0:23:53	some wideband data we have in data scare snares to build a better wideband acoustic model and inspiration for this
0:24:00	came from the signal processing literature maybe ten or fifteen years ago people propose the bandwidth extension speech processing
0:24:06	sort of like again it comes from the fact that we know
0:24:09	the people prefer
0:24:11	wideband speech it turns out it's not it's not any more intelligible unless you looking at isolated phones it's actually
0:24:16	both are equally intelligible but things like listener for T and just personal pride and preference comes across in a
0:24:23	much higher for wide
0:24:26	speech and so the way these algorithms operated
0:24:29	was that the basis set can be learned correlations between low and high frequency spectrum the signal so here's a
0:24:35	just
0:24:36	a poorly first grade drawing version of
0:24:39	of spectra like to say that my four year old to this but i did it myself
0:24:45	so this is sort of you know the pilots like about like this i was going for that with a
0:24:48	couple of formants as of yet if i ask you guys to predict
0:24:51	what is sort of on the other side of the line
0:24:55	you know it maybe predict something like that it seems pretty reasonable probably you know you make a down the
0:24:58	difference low platform and maybe in a different location but is not for example gonna go up it's not you
0:25:03	know you would you would doubt that would
0:25:05	and so we can do is basically user like a gaussian mixture model to predict
0:25:09	the gas independent mappings from low to high band spectra
0:25:14	and then a simple we could do is to say let's just generate wideband features from narrowband features
0:25:18	and if you're familiar with the missing feature literature this says basically i'd like i have some in missing features
0:25:25	you say i have some
0:25:27	components of my features that are too corrupted by noise addition to remove them and then try to fill them
0:25:32	in from the surrounding reliable data this is like doing this you features with the deterministic madness given by the
0:25:37	telephone
0:25:39	you're simply taking some amount of wideband data
0:25:43	some potentially large amount narrowband data you're trying to convert that narrowband data into a pseudo wideband features and go
0:25:50	to train an acoustic model that way
0:25:53	so this actually works okay works pretty well and here's an example
0:26:00	this is a wideband log mel-spectrogram
0:26:03	the left in this is that same speech but through a telephony channel you can see obviously the information below
0:26:09	three hundred hz and above thirty four hundred hz is
0:26:11	it has gone missing so to speak and the idea of this bandwidth extension the feature domain is to say
0:26:16	can we do something to fill it back
0:26:19	and in this particular case "'cause" it's not it's not perfect
0:26:21	but you know a lot of you know where there's read it gently read in the other pictures are reserved
0:26:25	capturing
0:26:26	of the gross features but data and we could use that then to train our system
0:26:31	so this is good but the downside is that if you do it this way in the feature domain you
0:26:34	end up with a point estimate of what you're wideband feature should be and if that estimates for or it's
0:26:39	wrong words you know
0:26:42	things like that you really have no way of informing the model during training to not use that data as
0:26:48	much as maybe other estimates that maybe more reliable and so to get this to work you have to do
0:26:53	some ad hoc things like corpus weightings to say okay we have a little bit of
0:26:57	wideband data but i'm the count those statistics much more heavily than my
0:27:01	statistics of my narrowband data which i would have extended into therefore don't trust quite as much so as not
0:27:06	theoretically optimal
0:27:09	and as a result you know a better used to be to use and you know we can could incorporate
0:27:14	this into any amalgam directly see on only train hmm
0:27:18	would it be the state sequence is the hidden variable so you can figure this is doing the exact same
0:27:22	thing but you're adding additional hidden variables for all the missing frequency components that you don't have in the telephone
0:27:27	channel
0:27:30	so if you do this you get something that looks like this where you have the narrowband goes directly into
0:27:34	the training procedure with the wideband data you have this and expand with em algorithm and we comes out as
0:27:39	a wideband hmm no i'm not gonna try to go into too many details and i really try to keep
0:27:43	equations to a minimum but i just want to point out
0:27:46	a few notable thing is this is the variance update equation and a few things that are interesting i think
0:27:51	about this relation the this update equation is
0:27:55	first of all you look at the why should sorry i should mention the notation have adopted here's from the
0:28:00	missing feature literature so oh is something that you would observe in ms and it's missing as you consider O
0:28:05	to be the
0:28:06	the telephone band frequency components and M to be the missing high-frequency components you're trying to the model when you're
0:28:12	hmm
0:28:13	second thing at the posterior combination computation is only computed over
0:28:17	low band that you have only lives are bands you've actually marginalise out the commode you don't have over all
0:28:22	your models and so therefore erroneous estimates that you make in this process don't corrupt your posterior calculations because you
0:28:28	only computing posteriors based on reliable information that you know is
0:28:31	is that
0:28:32	the other interesting thing is that
0:28:34	rather than having a an estimate that's global across all your data you actually have a state conditional C estimate
0:28:41	where the estimate of the wideband features determined by the observation at time T as well as the state you're
0:28:46	in and so the says
0:28:48	the extended wideband feature i have your it can be a function of both the data i see as well
0:28:53	as whether i mean of our fricative or a plosive
0:28:57	sample
0:28:58	and finally there's this variance piece at the end here which then says in general for this particular gaussian
0:29:06	how much uncertainty overall is there in trying to do this mapping so maybe a minute in a case where
0:29:10	them doing this mapping is really heart because there's very little correlation from the time-frequency snack is we will high
0:29:16	variance there so that model as i could reflect the fact that we've
0:29:22	that we've estimated that the estimates that we're using may be poor
0:29:26	so if we look at the performance here we've taken a wall street journal task we base it took the
0:29:30	training data and partitioned into wideband set and the narrowband set at some proportion
0:29:36	and so the idea is that if you look at the performance of the wideband data that's the lower line
0:29:40	it's about ten percent
0:29:42	and if you take the entire system and sort of telephone dies at all you end up with the upper
0:29:46	curve but in the purple curve that's the sort the narrowband system the goal of this is to say given
0:29:52	some wideband data and next thing in the rest narrowband data how far how much coming close that gap
0:29:57	so we see that in this is comparing the results of the feature version on the model domain version and
0:30:01	so we can see that we have a split of at twenty
0:30:04	the performance is about the same and so in that case you know why go through all the extra computation
0:30:08	the feature
0:30:09	version works quite well interestingly once you go to a more extreme case where only ten percent the training set
0:30:14	is actually wide-band the rest is narrowband do in the future version of it is that you worse than just
0:30:19	training at an entire narrowband system
0:30:22	because there's lots of uncertainty in the extension that you do in the front end which is not reflected in
0:30:26	your model at all but if we do the training in this integrated framework
0:30:31	we end up with you know a performance that again is better than equal than all narrowband
0:30:39	so
0:30:44	talk about this last prong of this second volume of the of the triangle here and recycle
0:30:51	potentially possibly narrowband data can be recycled for using wideband data this may allow us to use the existing piles
0:30:58	of legacy data we have
0:30:59	and for initial system that we have narrowband data whether we want to build narrowband data maybe easier to collect
0:31:05	and maybe simple just like the small amount of wideband data
0:31:08	you can do this in the front end we can come up with the sort of integrated train training framework
0:31:13	and
0:31:14	like a noise-robust this case there is a front-end version that i talked about and there are advantages to that
0:31:19	i shouldn't sort of
0:31:21	so it doesn't the right it allows you that if you do this in the front and you can use
0:31:25	whatever features you want you can then take the postprocesses news bottle neck features
0:31:29	tack a bunch of frames individual the i and so you have a little bit more flexibility what you wanna
0:31:32	do downstream from this process
0:31:34	and the other interesting thing is that the same technology can be used in the reverse scenario where the input
0:31:41	maybe narrowband and the models actually wide
0:31:45	you may think where this happened but this action happens in systems lot of some as soon as someone puts
0:31:50	on a bluetooth headset
0:31:51	you could have a wideband applied system somebody decides that they wanna you'll be safe in hands-free in a put
0:31:57	on a bluetooth headset all somewhat you comes and your system is
0:31:59	in our band if you want do something about it you're gonna get killed and killed but
0:32:05	sorry i and you're going hands free signal killed but anyway you performance is gonna suffer
0:32:12	and so you know one up after we to maintain two models in your server the other ideas you can
0:32:17	actually do about the station the front end and process that by or by a wideband recognizer noise thing there
0:32:23	is like you don't have to be as good as true wideband performing
0:32:27	you just have to be better than or as good as but you've got the narrowband performance would be and
0:32:31	then it's worth it to do that
0:32:33	so
0:32:34	finally i'd like to move on to a last component here of a reuse
0:32:44	and talk about the reuse the speaker transforms
0:32:49	so
0:32:50	one of things that we found
0:32:52	is that
0:32:54	the utterances in the applications that are being deployed commercially now are really sure
0:32:59	and so
0:33:00	you know one seattle obviously people's a starbucks quite a bit
0:33:05	no muppet show times or in the living scenario X box play maybe all that you know the only thing
0:33:10	you get
0:33:11	in addition to that these are really gently rich dialogue interactive systems and so these are sort of one shot
0:33:16	thing for you speak where you get a result in your in your done
0:33:19	so that the combination of these two things
0:33:22	make it really difficult to obtain sufficient data for doing conventional speaker adaptation from a single session of use in
0:33:28	so doing things like mllr cmllr becomes quite difficult in a single utterance
0:33:33	case and so
0:33:35	and obvious solution to this is to say well let's just accumulate the data over time across sections we have
0:33:39	users are you know making multiple queries to the system
0:33:43	it's aggregate it all together and then we'll have an update at sufficiently to build a transfer
0:33:49	the
0:33:50	difficulty comes in because this now because it lies applications on mobile phones it means the people are obviously mobile
0:33:57	two
0:33:58	and they're all across all these different users they're actually in different environments
0:34:02	that creates additional variability in the data that we can lead over time and so in my
0:34:10	by numbers i guess i would say or you know them you know a metaphor here let's imagine a user
0:34:15	called the system and the observation comes in as Y and that some combination of the phonetic content which i'm
0:34:21	showing is as a white box
0:34:22	some speaker-specific information shown as a blue box and
0:34:27	some you know environmental backer information as the right
0:34:31	so user gets the speech and says oh okay well mannered proportion adaptation and store away the transform
0:34:37	so the next time this user calls we know will be loaded up and ready to go
0:34:41	so sure enough sometime later the user cost back
0:34:45	and the phonetic content you know may or may not be the
0:34:47	the speaker
0:34:49	is the same
0:34:50	but now is you know here she is in a different location or different environment and so the observation is
0:34:56	now green instead of purple and as a result we can do adaptation on the mile using the store transform
0:35:01	but mismatch persists this is not something optimal
0:35:04	and so what we would like is
0:35:09	a solution where
0:35:10	the variability when we do something like annotation can be separate or to use the part
0:35:16	so that we can say let's just hold onto the part that's related to speaker and sort of throw away
0:35:21	the part that's in environment or very get store the part that's for environments that we oversee different user call
0:35:27	back from that same environment we can actually do that as well
0:35:30	so in order to do this sort of factorisation or separation of the different compare sources of variability
0:35:38	you actually need an explicit way to do joint compensation so it's very heart to separate these things if you
0:35:43	don't have a model that explicitly models them
0:35:46	as
0:35:46	individual sources of variability
0:35:49	and so to do this there's
0:35:52	several pieces of work that the proposed it's sort of like a being at a diner and it sort of
0:35:57	gets use one from column a and one from column B you can sort of take all the you know
0:36:01	all your favourite speaker adaptation algorithms in you can take
0:36:04	all the games and apply for environmental adaptation pick one up from each thing and combined them and then you
0:36:09	can have a usable model made using thing is that this is sort of proposed
0:36:15	that ten years ago
0:36:17	but as far as i can tell with without with the exception of joint factor analysis and two thousand five
0:36:22	is not that much work on it since and now it sort of seems to be have sort of come
0:36:25	on the scene again which is good i think it's a it's not obvious the more people
0:36:31	in their work on this you know
0:36:33	is
0:36:33	so
0:36:35	all these possible combinations of methods can do this
0:36:40	joint compensation together might talk about one particular instance
0:36:45	of using cmllr transforms mostly because i've already talked about how vts is used and so trying to
0:36:52	several different
0:36:54	ways you can go about doing compensation for noise
0:36:57	so in this case we're gonna talk about the idea that you can use a cascade of cmllr transforms
0:37:02	one that captures environmental variability wanna capture speaker
0:37:06	a nice thing about using transforms like this is that we give up the benefit of all the structure we
0:37:11	had an environmental model using solutions like be yes
0:37:14	but we get the ability to have much more flexible use meaning that we have no restriction on what the
0:37:20	features we can use are what the data that where this it's trained from we don't to do this
0:37:24	adaptive training schemes like the noise adapted train
0:37:29	the idea is quite simply defined transforms that maximise the likelihood of a set of environmental transforms in a spell
0:37:35	of speaker transformations given sample of training or adaptation data
0:37:40	now of course you know it's not heart to see that this cascade of linear transforms is itself a linear
0:37:45	transform
0:37:46	in as a result you can take a linear transfer and factor it into two separate transforms in an arbitrary
0:37:52	number of ways menu which will
0:37:55	not be meaningful and so the way that we're gonna get around this is to borrow heavily from the key
0:38:01	idea i think in joint factor analysis from speaker recognition which is to say let's learn the transformations on partitions
0:38:08	of the training data where were able to sort of isolate the variability that we range
0:38:13	so pictorially
0:38:14	still a bit busy inside politics of it
0:38:17	gives you a headache but
0:38:19	you can think about the idea that your basic gonna group the data by speaker and a given those that
0:38:25	you can update your speaker trend
0:38:27	then you gonna repartition your data by environment keeper speaker transforms fixed and update your environment transforms and then go
0:38:32	back and forth in this manner now of course
0:38:35	doing this doing this operation assumes that you have a sense of what you're speaker clusters are in your environment
0:38:41	clusters are
0:38:43	there are some cases where we it sounds reasonable to assume the labels are given to you so for example
0:38:48	if it's a
0:38:49	a phone overhead you know mobile phone data plants near you can have a caller id or a user id
0:38:55	of the hardware address and so you can have a high confidence that you know the speaker is simile for
0:39:00	certain applications
0:39:02	like the X box in the living room we certainly think it's result we can say okay this thing is
0:39:06	by not driving on the card sixty miles an hour probably isn't in the living room once we can assume
0:39:10	the environment in that case or if we don't have this information you can really do environment clustering algorithms are
0:39:16	speaker clustering
0:39:18	and so
0:39:19	yeah just to show some results here
0:39:24	the idea is you can again take
0:39:28	take the training data that let's say from a bright of environmental the brighter speakers
0:39:32	and estimate some environment transforms on the training data
0:39:35	to do that of course you have to estimate the speaker transforms as well but in this case the speaker
0:39:40	the speakers in training and test are distinct and so the speaker chances are not useful for us in the
0:39:45	reuse scenario
0:39:46	and so we've tried here to say let's take estimate the speaker transform
0:39:50	given data from a single environment this case is the subway
0:39:54	we can we take that
0:39:56	transform and either estimated in this way where the sources of variability are factored
0:40:00	or estimated using sort of conventional cmllr approach and apply to data from the same speaker in six different environments
0:40:09	three which aren't times that you've seen in training three what's are
0:40:12	that are not seen
0:40:14	and you can see in both cases you get a benefit by having additional transform in their absorb
0:40:21	the variability from the noise so that this the speaker transform can as you focus on just as the variability
0:40:27	that comes from the speaker that you care about and so you can see there's again overdoing cmllr alone and
0:40:32	that comes again from the fact that this year margin for me is not presumably
0:40:38	learning the mapping of the environment plus the speakers ideally learning the transform just the speaker alone
0:40:46	so
0:40:49	in
0:40:50	scenarios where speaker data is scarce
0:40:54	a reuse is important for adaptation
0:40:59	no this is a case where each utterance is you know ten or fifteen or twenty seconds this techniques and
0:41:04	are not nearly as important but if the case where you only have a second or two data you wanna
0:41:07	be able to aggregate all this data and build a model for that speaker
0:41:11	it seems that did it comes from
0:41:14	places
0:41:15	where there's a high degree of variability from other sources
0:41:18	the problem becomes a little more challenging
0:41:20	and this can be environments it can be devices of your
0:41:24	you know all of you but data that's
0:41:27	being held up like this and then you have a far field data then you have additional data that's four
0:41:32	feet away on your couch
0:41:33	all these things are all different in different microphones all these sources are things that are that are basically blurring
0:41:39	the speaker transmitter trying to learn and you wanna go to isolate those in order we use the speaker turn
0:41:44	so doing this style base it allows a secondary transform to absorb this unwanted variability
0:41:50	and
0:41:52	there are various ways of doing in there are just a you know obviously if you have a sort of
0:41:57	a transforms that are specifically modeling different things explicitly it'll be easier to get the separation if we knew things
0:42:02	like
0:42:03	two linear transforms then you need to sort of resort to be used just data partitioning schemes
0:42:09	which you know
0:42:10	makes things a little bit more difficult
0:42:12	so
0:42:14	here i've just tried to hit a little bit on a three way you know three aspects of speech recognition
0:42:19	going green in this reduce reuse recycle framework before i conclude i just wanted to slow touch on i think
0:42:27	you know
0:42:28	as someone who's worked in you know a we strongly i guess and robustness and these ideas i sorta wanna
0:42:34	talk about there's to serve also as i got a member three personalities that i sort of take on and
0:42:39	so i wanna sort of address
0:42:41	and you may find yourself thinking i one of these present noise in turn
0:42:44	and so i wanna sort of address because those and so i think there's people who are the believers
0:42:49	there's people who are
0:42:50	the sceptics
0:42:51	and those people who i was called the willing which are sort of the people who say oh well maybe
0:42:55	i'll give this a try and you know i think
0:43:00	i think about sort the but the resurgence in neural net acoustic modeling as a as a good example of
0:43:04	this that we're maybe some auditory inspired signal processing is another example where
0:43:08	there were true believers in sort of acoustic models using neural nets then they're from so though we can't be
0:43:13	when an hmm
0:43:14	you know to put that aside and then you know that's kind of improve the people that i would give
0:43:19	this a try again they want move from being sceptics to the willing
0:43:22	now they got good results another all the believers again and so i think i wanna sort of talking about
0:43:28	these very briefly so i would say to the sceptics i was sort of say yes you know one thing
0:43:32	that i think is interesting is there's increasing robustness in speech recognition thing going on for a long time is
0:43:38	in lots of sessions
0:43:40	lots of papers slots
0:43:43	but if you look at the tasks that it becomes standard for speech recognition like we need like i talked
0:43:47	about today they're all very small bus orders
0:43:49	today state-of-the-art systems compared to things like switchboard and galen meeting recognition
0:43:54	and in is very large scale systems like switchboard and galen meetings
0:43:58	robustness techniques are not really a part of the puzzle there and so i think is very fair to say
0:44:03	all these methods really necessary in any sort of
0:44:06	i still deployed system i would say to that i would just say yes it depends on i sorta wanna
0:44:10	give a few very anecdotal examples to sort of motivate why i think this is
0:44:16	you think of the bn so in production quality systems that do have all the bells and whistles that we
0:44:22	that i one and knows about that are common is large scale systems
0:44:26	we see and things like voice search you know in fact the gains are small and so you know it's
0:44:30	not really a huge went to employ these techniques and so it's a fair critique just are we don't need
0:44:35	we don't need robustness
0:44:36	as you move the smell like the car turns out that actually gains are pretty big
0:44:41	and not you note taking this is you know this to be much more usable by incorporating some elements of
0:44:47	noise robustness in two
0:44:49	into your system
0:44:51	finally i would actually say with the X box like connect
0:44:54	turns out that actually i would say these systems are actually unusable
0:44:57	if you know if i consider a robust as the entire sort of audio processing front-end plus whatever happen
0:45:03	in the recognizer
0:45:04	if we
0:45:05	throw all that away which establishes his microphone to listen i will do everything in the model space systems are
0:45:09	actually unusable
0:45:11	and so there actually is a large place
0:45:13	technology in certain scenarios
0:45:15	ski
0:45:16	peering out to the willing so if someone says well you know what's the easiest way to try celeste of
0:45:22	it is this thing to try is noise adaptive training and sort the biggest bang for the buck is what
0:45:27	i would say is not well lee dying called noise adaptive training in the feature space
0:45:32	the idea is very simple that you have some training data
0:45:35	you believe you have some way to enhance the training data run-time we need to take a train data just
0:45:39	prior to the same exact process and retrain your acoustic model you know you think that this is this is
0:45:44	basically very akin to doing similar for speaker adaptive training you basically updating your features
0:45:50	before you reach in your model it turns out that if we do this you have to get performance
0:45:55	that generally is far superior to operating are trying to compensate noisy speech to recognise with the clean trained hmm
0:46:02	and if you are gonna to try this i think you know the standard algorithms are findings expect subtraction i
0:46:07	mean
0:46:08	the fanciest ones work are great but in an improvement a small i think getting the basics working is important
0:46:14	but the important thing is you need to serve to an optimize the right objective function i've had you know
0:46:19	talk to people say oh we got you know a spectral subtraction component from
0:46:23	my friend who's in the speech enhancement part of our lab and i just tried it and it was you
0:46:26	know i didn't work at all and the reason is that these things are optimized joey completely differently and so
0:46:31	we need to really you know it and
0:46:32	you do need to understand all the details and nuances of what's happening are gonna but generally is a whole
0:46:36	set of parameters and floors and weights
0:46:39	and things
0:46:40	in those things can all be tuned and you can tune them to where you know to maximise word error
0:46:44	or minimize word error rate and that would be great you can do that in a greedy way let's just
0:46:48	sweep a whole bunch of parameters to we get the best
0:46:50	you can also use something called test which is a computational proxy to stands for the perceptual evaluation of speech
0:46:56	quality space you like a model of what
0:46:59	human listeners would say it turns out that that's which are quite correlated to speech recognition performance and so if
0:47:05	you can maximise that are you have your yeah signal processing bodies have some column that maximizes pack has that's
0:47:12	a good place to start and turns out that the doing things like snrs after the worst thing you can
0:47:16	do it creates all kinds of
0:47:17	distortion free
0:47:20	so
0:47:21	with that i just want to conclude and say that we proposed that potentially there are there's goodness to be
0:47:27	had by using existing data and no we sort of put this on the matter of going green
0:47:34	i'm just pretending to this case of just provide you know try to write one example of the way that
0:47:38	we can reduce recycle and reuse
0:47:41	the data that we have either from environmental mismatch point of view a bandwidth point of view or speaker
0:47:47	adaptation point of view so a there's many other
0:47:50	ways to do this or just talked about a few and of course there's more work to be done
0:47:54	and so with that i will thank you
0:47:58	i think the speaker for
0:48:05	oh we have plenty of time for questions
0:48:11	so what mike things
0:48:13	great small i was wondering if you can address
0:48:16	some other problems in the your robustness area for example
0:48:22	oh there are many cases with the rules what's your nonlinear distortions that are going to be applied to these
0:48:29	signal of strange this in the communication channel and what you talked about i mean
0:48:36	the transform techniques could obviously work on it or anything but i'm wondering if you have any comments or what
0:48:42	do you do one place
0:48:43	rules nonlinear distortions of the signal with the signal still basically set my intelligible both the it doesn't fit any
0:48:51	of the classical speech plus noise model
0:48:55	well
0:48:56	the one thing i would say it is
0:48:59	that
0:49:03	is a heart problem
0:49:04	i
0:49:07	thank you don't even agreed on it
0:49:11	so that feature space adaptive training technique
0:49:15	is that you generic across any kind of distortion so if you actually have the ability if you know what
0:49:19	that coding is we can model it somehow you guys past data through that that's why the best way to
0:49:24	model it sort of the
0:49:25	it's not very fancy or what but i think it'll work
0:49:30	the thing is a lot of things are burst
0:49:33	but i find it so that you can actually just detect them
0:49:36	building you know whatever classifier and as just part at that point you know you can for example say i'm
0:49:41	gonna you know compute my decoder score bias just giving up on this frames in is no content here that's
0:49:46	another way you can do it
0:49:49	i think sort of trying to have a model for you know
0:49:52	number in your garden in your
0:49:54	i think by won't work
0:49:55	and then like to believe that there is some you know that we can extend the linear transformation scheme to
0:50:00	nonlinear transformations like some kind of an L P
0:50:03	mllr kind of thing but you know that's remains to be seen and that that's again it does really quite
0:50:07	get it is sort
0:50:09	i think and are we talking of this or this occasional
0:50:11	gobbledygook that comes and i don't think that would really just that so i think those two other techniques are
0:50:15	so
0:50:21	i think the one thing that's interesting is the correlation between how people speak and the noise background and or
0:50:28	a kind of
0:50:29	what does that adding
0:50:31	right noise rather than so the long artifact has the obvious
0:50:36	lot of speech thing which we pretty but to compensate for
0:50:40	you know we normalize stuff
0:50:42	but there's the bombard spectral that
0:50:45	which means that allowed or the noise is the more vocal effort there is the more
0:50:51	to the spectrum and all that sort of thing how do the techniques you're talking about addressed utterance
0:50:57	is the whole kind of different problem because
0:51:00	the environment
0:51:01	really doesn't
0:51:02	yeah unless you know the signal to noise ratio
0:51:06	straight
0:51:07	right so i think
0:51:11	what's interesting about those is those are
0:51:15	speaker
0:51:16	affects that is are manifested by the environment
0:51:19	and so like you said having environment models not gonna capture that at all it's more like maybe having a
0:51:24	but you may want to have some kind of you know so i don't know i don't have the exact
0:51:28	answer although i would think that having a environment informed
0:51:33	peak or
0:51:34	transform kind of thing would be would be useful so you know potentially
0:51:38	your choice of you know vtln work parameters for example could be affected by what you perceive in the environment
0:51:44	any level speaker F
0:51:45	you
0:51:46	you detect
0:51:49	and the other thing of course is sort of the
0:51:51	the poor man's answer would be you know i'm not sure how much of this can be modelled again it
0:51:56	by exist existing speaker adaptation techniques you know
0:52:00	again i think a lot of the text in the being of a nonlinear
0:52:04	and so it's hardest
0:52:05	we put on the rug with the and mllr transform
0:52:08	but it's so i think i think that comes at it as you know incidents on that i was trying
0:52:11	to talk about
0:52:12	orthogonalisation of this
0:52:14	the speech and the noise and i think you're actually the opposite which is actually a jointly informed
0:52:20	transform which i think is a very enticing area
0:52:22	i don't imagine a way of too much work
0:52:31	you
0:52:32	might be greener features that came in were themselves insensitive
0:52:37	to some is just absolutely
0:52:40	so that would that would that would well if i if i agree with you that i'm through email talks
0:52:43	i can agree with you now
0:52:45	maybe at the coffee break i can agree with you but no i think that that's that that's
0:52:50	that's true right there's the whole and i think a lot of this comes with the biologically inspired kind of
0:52:56	features and i think that's true and i think actually in fact the work that
0:53:02	or elan's human kind of did kind of shows that they've made
0:53:06	grammar correctly that they train to a deep net on aurora and got you know high degree of noise robustness
0:53:11	just running the network
0:53:13	potentially learn some kind of
0:53:15	noise invariant
0:53:16	features
0:53:17	you know i think
0:53:19	right is right and so no i think that's true i don't know no i problems i think right now
0:53:23	where we are
0:53:24	it's the heart to come up with sort of a one size fits all
0:53:28	scheme so there's one other thing
0:53:32	but that's about it
0:53:34	using gmms to the government data to what the specific example you
0:53:40	the basically as long as i understand the gmm you mentioned was trained and supplied basically doesn't consider the entrance
0:53:48	i
0:53:49	in the gmm case
0:53:51	right but you could also do an hmm for
0:53:54	well but that's is easy you can see the transcriptions like phone level transcriptions
0:54:00	can you improve that signal absolutely yeah that's what is shown so only small with a technique
0:54:07	all possible the pure speech feature based technique
0:54:11	yeah and what well that was a good gosh well yes but i think you don't necessarily need a very
0:54:17	strong model
0:54:18	so you know i guess you class might so you could have you could for example have a phone-loop hmm
0:54:25	in the front end that's using like that is using a model based technique but
0:54:29	but you know getting the state sequence right is actually is a problem in the feature technique as you guys
0:54:33	you have within the context of a if you don't put here takes on the on the search space you
0:54:38	can have within a close to have it skipping around states
0:54:41	you have inconsistent hypotheses for what the missing band is
0:54:45	and you can apply that to some extent if you have a if you do a sort of a cheap
0:54:48	decoding the friend where there's a your phone hmm with the phone language model
0:54:53	and you could do that just to i think what you have you know that
0:54:56	the benefit the models actually
0:54:58	restraining your state space to sort of possible sequences of phones
0:55:03	once you have that i think generating whether use that to enhance feature order in the model domain is
0:55:09	you know what you know the both options
0:55:11	yeah i mean it's only also agree i think
0:55:13	the model domain is
0:55:15	will be optimal
0:55:16	i think if you start saying well my system runs with
0:55:19	a eleven frames of hlda and all this other stuff it becomes a little harder to
0:55:24	to do that you know you can sort of just a minute it's gonna be a blind transform like mllr
0:55:28	but if you wanna put structure in the transfer
0:55:30	the map the low to high frequency that gets a little more difficult
0:55:36	okay
0:55:37	is that the speaker again

Robust Speech Recognition: more than just a lot of noise

Invited Speakers

Michael Seltzer (Microsoft Research)