Speech Transcript - Ouch - Outing Unfortunate Characteristics of HMMs (Used for Speech Recognition)

0:00:15	so i'm gonna talk about a project average but thank you for having me here
0:00:22	i
0:00:23	i enjoyed my time in the czech republic that learn many check we're concluding well
0:00:29	so thank you
0:00:31	so i
0:00:33	project ouch out stands for outing unfortunate characteristics of the hmms
0:00:39	there are three
0:00:41	truthfully there were three phases the at the sort of initial were that we did
0:00:47	on this was a project that larry really and i three hundred to when we
0:00:52	read nuance
0:00:54	and i truthfully it also had its antecedents in work that we were doing it
0:00:59	for signal
0:01:01	but that's a funded a very small pilot study and i our funded the a
0:01:09	larger but still small off a lot the people the students to work with me
0:01:14	were day gaelic
0:01:17	hardly
0:01:18	part is there i was actually postdoc ensure you chair is currently is to berkeley
0:01:24	and larry really jaw really morgan
0:01:26	and myself were thus reducing your people
0:01:29	so project out
0:01:31	what we're trying to do
0:01:33	is our goal is to sort of develop a quantitative understanding of
0:01:41	how the current formalism thing
0:01:44	and you know surprisingly this being very little work
0:01:48	in this direction in the for your history
0:01:51	of speech recognition
0:01:54	there's been some but it means were isolated and sporadic
0:01:59	and
0:02:00	you know progress in speech recognition has been very exercise and
0:02:05	in my largely because we be proceeding
0:02:09	wire trial-and-error and so the claim is
0:02:12	that by gaining a deeper understanding
0:02:16	powers are algorithm succeed and fail
0:02:19	other than just measuring we're word error and if it if we get an improvement
0:02:23	in word error keep it
0:02:26	we it doesn't improve we
0:02:28	we just it should enable more efficient and steady progress and i claim that this
0:02:34	should be embedded are standard sort of research may not necessarily the techniques that i'm
0:02:41	gonna talk about okay but just this
0:02:43	notion that when you have a model
0:02:46	you know it doesn't fit the data you should get a try to gain some
0:02:50	understanding of how a model differs from the data and how that data model residual
0:02:57	impacts
0:02:58	the classification errors
0:03:01	so the main questions that a project ouch was interested in is you could be
0:03:08	the main where you could think about it do this is what to the models
0:03:12	find surprising about data what is it about speech data that the models find surprising
0:03:17	and how to do that surprise translate the air
0:03:22	so
0:03:23	i'm gonna talk today about quantifying the two major
0:03:28	hmm assumptions their impact on the error rates of the course the two major assumptions
0:03:33	are the very strong independence assumptions the models makes
0:03:38	and also
0:03:40	and equally strong assumption about what the form of the marginal distribution of the frames
0:03:45	are typically we assume that there are a gaussian mixture models of course nowadays people
0:03:50	are using a multi layer perceptrons but it can you make some sort of formal
0:03:56	assumption about what looks like
0:04:00	also which of these incorrect assumptions is and key your discriminative training mpe or mmi
0:04:08	which it's these assumptions is
0:04:11	is this process are compensating for the maximum
0:04:16	and
0:04:17	do these results change when you move from a miss from a mass training and
0:04:22	test
0:04:22	us we're formalism the mismatched case
0:04:26	so there early sort of work that we did was on the switchboard in the
0:04:30	wall street journal corpora later on we move to the icsi corpus
0:04:35	you can read past
0:04:36	this sort of question about how do these results change in this mask a case
0:04:43	in it and form of why asr so brittle
0:04:47	we go
0:04:48	at any time bring up
0:04:51	a new recognizer on a problem whether
0:04:54	the same language or across languages you always have to star it seems almost from
0:04:59	scratch you always have to collect a bunch of data that's closely related to the
0:05:05	to the task that you
0:05:06	you have and
0:05:08	it hardly ever works the first time you try it it's the reason that most
0:05:12	of us in this room have
0:05:14	have jobs it's are sort of it sort of a good thing but it's incredibly
0:05:19	frustrating right it's like
0:05:23	it's a miracle that when anything works the first
0:05:27	so the ir project mainly was interested in studying
0:05:32	these
0:05:33	these questions on it the icsi meeting corpus where there's a new field channel and
0:05:38	a far-field show i'll talk a little bit more about that i'm we wanted to
0:05:43	understand when you trained models on the near field condition
0:05:47	what happens when you are recognise for future
0:05:51	and so in this context
0:05:54	is the brittle nist of asr solely due to the models inability to account for
0:06:00	the statistical dependence that occurs in real data
0:06:04	and you know what i started this particular project
0:06:07	i thought
0:06:08	that it was just gonna be used independence assumptions so
0:06:12	and i was very surprised
0:06:15	when we actually started doing the work
0:06:19	and in fact it once like so
0:06:23	and so i say i just sort of funny but
0:06:26	but in the matched case basically
0:06:29	this the inability of the model to account for statistical dependence that occurs in real
0:06:34	data is basically the whole problem
0:06:37	but when you move to the mismatched case
0:06:39	all the sudden something else rears its head it
0:06:43	and it and it and it's a big problem and so all describe what this
0:06:47	problem
0:06:49	it has to do with the lack of invariance of the from
0:06:53	so
0:06:55	i'm gonna spend a little data time
0:06:57	talking about the sort of methodology we use so what the way we explore this
0:07:03	question is we create
0:07:07	we fabricate data
0:07:09	a we use stimulation and a novel a sampling process
0:07:15	that uses real data
0:07:17	to probe the models and the data that we create
0:07:21	is either completely stimulated that satisfies all the model assumptions
0:07:26	or it's real data
0:07:28	that we sample than the way that gives the properties that we understand and so
0:07:34	by feeding this data we can sort of pro the models and see their response
0:07:41	to this to the state and we research we observe recognition actually
0:07:47	so here's an example
0:07:49	so this is an example of what of course according to the average estimate seventy
0:07:55	high miss rate by counts capital markets report
0:07:59	so this is an example of course what we expect speech to sound like this
0:08:03	is from wall street journal so this is a fabricated version of this that essentially
0:08:09	agrees with all the model assumptions
0:08:13	according to different estimates to construct the attachments capital markets report
0:08:19	you can speculate syllable rhymes two point five percent that's model
0:08:25	so
0:08:26	so you know it's highly amusing but it's intelligible obviously and it obviously you know
0:08:32	it's from a model that was constructed from a hundred different speakers and it reflects
0:08:37	the sort of structure
0:08:39	so what we're trying to quantify
0:08:41	is
0:08:43	what the difference between these two extremes in terms of recognition condition
0:08:50	so the basic idea of data fabrication a simple
0:08:56	we follow the hmms generation a mechanism so to do that we first we generate
0:09:04	a an underlying state sequence consistent with the transcript the dictionary and the state transitions
0:09:12	the underlying each of it that you know the hidden markov model
0:09:15	then we walk down this
0:09:19	this sequence and we and that of frame at each point
0:09:22	so here's a picture a nice picture that describe the sort of structure it's a
0:09:27	sort of a parts of it are actually a graphical model
0:09:32	a this courses in each ml
0:09:34	but basically we unpack so if we have a transcript we unpack words
0:09:41	we get the corresponding pronunciations
0:09:45	the phones in context
0:09:47	then determine which hmm we use so this is the hidden state each of these
0:09:51	states and mit observations according to the so whatever mixture model we're actually using right
0:09:59	and so if you're not so familiar with the hmms i assume pretty much everyone
0:10:04	in the room is but this sort of highlights the independence assumptions right well it
0:10:10	highlights two things one
0:10:12	the frames are omitted according to rule and the rule is the marginal but the
0:10:17	form that we get for the marginal distribution of frames
0:10:21	and then of course then this also says that these frames are independent so every
0:10:26	time i met
0:10:28	a frame from state three state it is independent from the previous frame that was
0:10:33	emitted from state three so that's a very strong assumption
0:10:37	but in addition
0:10:38	it is also independent from any of the frames that we're and it'd previously from
0:10:43	the state so these of the very strong and
0:10:46	but okay again to generate observations we just all of this rule and basically once
0:10:54	we know the sequence of states
0:10:57	i have a sequence of states one side out that i just walk down those
0:11:02	sequences states and i'd to withdraw
0:11:04	from
0:11:06	what it either a distribution
0:11:08	or whether it be empirical or parametric
0:11:13	so
0:11:14	so for simulation
0:11:16	it's a i know it's easy to simulate from a mixture models not a big
0:11:21	deal right
0:11:23	but what about this sort of novel sampling process that'll allow us to get a
0:11:30	the independence assumptions will so that for this
0:11:33	we idea of formalism
0:11:36	from a reference bootstrap so i talked a little bit about the bootstrap in the
0:11:41	paper the poster
0:11:45	a people in the feel don't seem to be terribly familiar with that i'm not
0:11:51	sure is topical very much but i will try
0:11:54	so in the basic idea areas
0:11:57	a suppose you have an unknown population right so you've got some population distribution and
0:12:02	you compute the statistic that's meant to summarize this population itself
0:12:08	then you want to know how good is the statistics so i want to construct
0:12:12	a confidence interval for the statistics to give me a sense how well i've estimated
0:12:17	from
0:12:18	a place
0:12:20	so how the lighting that if i don't know what the population
0:12:23	i mean i'm trying to
0:12:25	you know i'm trying to derive properties of a of this population
0:12:29	and so and so in particular i don't know anything about really except the sample
0:12:34	like drawn from this population
0:12:37	so
0:12:37	but for F runs a bootstrap procedure people would usually make some parametric assumptions about
0:12:44	population typically you'd assume it's a normal or gaussian
0:12:49	and then compute
0:12:51	and a confidence interval using that structure
0:12:54	well course that sort of crazy you know why would you do that you know
0:12:58	especially if you're trying to say
0:12:59	is this population distribution gaussian are not well it's crazy to still
0:13:04	then that the population distribution is gaussian to compute this confidence in
0:13:09	so this was a big problem in the late seventies when computers became sort of
0:13:14	sort usable
0:13:15	by and statisticians
0:13:18	he came up with the sort of formalism and
0:13:21	and so the name comes from pulling up oneself up by the bootstrap lots of
0:13:26	people use the bootstrap for various sorts of terminology it allegedly comes everyone attributes this
0:13:33	to the to the to the story in the
0:13:36	adventure and so pair and a one channels and where E
0:13:40	used in some and yes to get out so we pulls himself up
0:13:44	by its bootstraps out of the but of course you read very one or the
0:13:49	adventures of error
0:13:50	when chosen and that's not what huh
0:13:52	in fact you within a small
0:13:55	on forcing use trying to get out of this one
0:13:57	so instead pulled himself out what is okay
0:14:01	so maybe instead we collected daily
0:14:06	similarly a little bit whiter i thought that was very
0:14:10	so
0:14:12	so the with the way the way the bootstrap words
0:14:16	is you take empirical distribution so you tree
0:14:19	so you have the same
0:14:21	so this sample is a representative of the true population distribution so if it's big
0:14:26	enough it should be a pretty good represented
0:14:29	and so you since
0:14:30	instead of dating a parametric model to this you treat this is an empirical distribution
0:14:35	and you sample from that empirical distribution
0:14:39	sampling from the empirical distribution turns out to be equivalent to just doing a random
0:14:45	draw with replacement from the sample itself
0:14:48	yes the name resample
0:14:50	so we're gonna adapt this
0:14:52	this formalism to so the so problem at hand so ins will you know so
0:14:57	when we train our models right if it so imagine we're viterbi trainer
0:15:02	here here's a
0:15:04	you know
0:15:05	well i'll have another picture but basically we're gonna sample to the frames that are
0:15:11	assigned to a particular state during training and that's work
0:15:16	and we can do this for various types of sick
0:15:21	so here
0:15:22	it's a really crappy picture but which i have to do a better job but
0:15:26	this is that i here again that
0:15:29	so the you know these again see
0:15:32	but so we have the true population distribution this you know we fit a say
0:15:40	gaussian to this is not particularly good representative and instead if we have if we'd
0:15:45	run enough data from a this histogram estimate the distributor
0:15:52	so basically
0:15:55	but the important part of this slide it is
0:15:58	resampling is gonna fabricate data
0:16:02	that satisfies independence assumptions of the hmm because i'm gonna do random draw with replacement
0:16:08	from the distribution
0:16:10	but
0:16:11	the data we create are gonna deviate from the hmms parametric out the distribution of
0:16:19	the distributional assumptions that we make two exactly the same day degree that real data
0:16:24	do because it's real data
0:16:26	and it's the data at all
0:16:28	from the training
0:16:30	so here's it's already good picture which can lead in sort
0:16:34	describe a little bit
0:16:36	about what we do
0:16:38	and a
0:16:39	so imagine if we have training data and we're actually doing viterbi training so if
0:16:44	we're doing viterbi training we get a forced alignment that for all the states
0:16:49	we just accumulate all frames a
0:16:52	for that state and then we fit a gmm to right and so that
0:16:57	but instead of doing that in the in the bootstrap formal is the we accumulate
0:17:03	frames and we stick "'em" in earnest
0:17:06	that are that are labeled with that state
0:17:09	so training is just like you know or even here training you know
0:17:14	you just accumulate all the frames associated with the state
0:17:17	but instead of a forgetting about that you keep track what they are used to
0:17:22	come in a packet parameter
0:17:23	and so in it when it comes time to generate pseudo data you have an
0:17:27	alignment or some state sequence that you've got however
0:17:32	you have a state sequence ins when you walk down to generate the frames if
0:17:36	i was generating the frames and simulation i would stimulate i do a random draw
0:17:41	from a distribution now instead i to a random draw with replacement from a buck
0:17:47	under and of frames okay
0:17:49	so the frames again are independent because i'm doing random draws with independence
0:17:55	and they the deviate from the tape from the distributional assumptions to the same degree
0:18:00	the real data or "'cause" they are real data
0:18:03	so sorry i believe bring this but and then i can also all about it
0:18:08	i can i can
0:18:10	do you
0:18:11	sequence so i can i can samples the trajectories phone trajectories and word trajectories
0:18:18	because
0:18:19	so here
0:18:20	you're this is this sequence of frames associated to states
0:18:25	so i can stick that into that whole sequence
0:18:29	likewise i can take whole phone sequence and put it in here and when i
0:18:34	drawer from your ins
0:18:35	instead of getting individual frames i get segments
0:18:39	so that the important thing is
0:18:41	no matter what see so i five have segments in the utterance
0:18:45	when i draw the segments between segments the things they are independent but they inherit
0:18:53	dependence that exists in real data within that sector so we have
0:18:59	between segment independent
0:19:02	within segments dependent so this is the way that we can control the sort of
0:19:07	degree of statistical dependence that's in the day
0:19:12	this is quite power
0:19:15	so this sort of just
0:19:17	sort of summarises this
0:19:19	but the and you can see
0:19:21	could even stickler hundred and your
0:19:24	but that so the point is that's a that segment level resampling
0:19:31	relaxes frame level independence to segment
0:19:39	so here's a sort of picture
0:19:43	the models response to fabricate so this is i didn't for that
0:19:49	okay so
0:19:54	i don't know how much i wanna spend on this but
0:19:59	so here what we have it is simulated
0:20:04	a simulated data are the real error rate and as i gradually reintroduce independence and
0:20:10	the that the data the word error rate starts to increase rather dramatic
0:20:16	so point is
0:20:18	let's look at the simulated word error rate so you can think of this is
0:20:21	i think of this is you've got some sort of not and where you re
0:20:24	introducing depends in the data and as i reintroduce data dependence in the data error
0:20:30	rate
0:20:32	comes quite high this is
0:20:33	this is i icsi meeting data this is
0:20:37	with unimodal models
0:20:39	the same sort of phenomena happens when you use mixture models where you know like
0:20:43	say component extreme
0:20:46	so that here the simulated error rate is around two percent little bit less than
0:20:51	two percent
0:20:52	when i do frame level resampling error rate increases just a little bit it's a
0:20:56	very small increase it does increase but it's but it by very small
0:21:01	now when i reintroduce
0:21:04	with in state dependence
0:21:06	all the sudden the error rate becomes around twelve percent so the error rate is
0:21:10	increased by a factor of six
0:21:12	when i introduce
0:21:14	within bone dependence
0:21:17	the error rate increases the king by about a factor of a two
0:21:23	and then when i go to words it increases by
0:21:27	we can almost by a factor of two this typically is the largest job on
0:21:31	the corpora that we've worked with
0:21:33	when you go when you move from frame
0:21:35	to stay typically increases by about a factor of six
0:21:39	so you think about this you make an argument and the argument is that
0:21:44	the that the change the distributional assumption that we make with hidden with gmms
0:21:53	it's not such a big deal i mean it's important but it's not such a
0:21:57	big deal
0:21:57	the biggest single factors are these reintroduction dependent so with the dependence in the data
0:22:04	that the models are findings the price i mean you know it's a
0:22:08	it's a you know everybody knew the dependence assumptions work well i mean i'm not
0:22:13	saying that surprising but i personal we use it was
0:22:18	was really surprise and it took a long time to come around
0:22:24	to the fact that you really it is the model they're the errors oriented dependence
0:22:29	assumption and we tend to work around this by other sorts of things
0:22:35	so that this is a summary of the matched case result so we came the
0:22:40	statistic when we have matched training and test
0:22:43	the it's the independence assumptions that's the big deal
0:22:46	that's the model inability to account for dependence in the data that is that is
0:22:52	to railing things
0:22:53	the marginal distributions
0:22:55	that so much
0:22:57	so surprisingly also so in a different you know if later study
0:23:01	we zorro but
0:23:03	attached this formalism tasks the question so what is what is discriminative training doing you
0:23:08	know see start with the maximum likelihood model you apply mmi
0:23:13	what what's happening here so you apply this formalism and you see that in fact
0:23:20	mmi is actually randy is actually compensating for these independent and that's assumptions in a
0:23:28	way that i don't completely understand i have hypotheses about how this might work
0:23:34	but
0:23:36	a so here you
0:23:39	really complicated procedure that's a little hokey
0:23:42	that to people twenty years many people in this room it took twenty years to
0:23:48	get to work right
0:23:49	and it took many laughs once we shown to work on large vocabulary took many
0:23:54	labs an additional apply yours to get it to work in their lap
0:23:58	it's you know now it's pretty routine to do this but you know it's a
0:24:02	lot it was a single to get this to work and my point is that's
0:24:07	doing is compensating for the independence assumptions we know the independence assumptions are a problem
0:24:13	i'm not saying that it's gonna be easy the figure find a model that relaxes
0:24:17	the independence assumptions
0:24:19	but perhaps that twenty years of effort
0:24:21	would be better spent
0:24:23	attacking that problem
0:24:26	so one about mismatched training
0:24:30	so the icsi meeting corpus
0:24:32	on a we have near field models
0:24:37	collected from on a solo
0:24:40	you know head mounted microphones there was a some microphone array of some sort
0:24:47	but that the meeting room was quiet it was small had are normal amount of
0:24:52	river the kind of reaper human six back
0:24:55	in a room
0:24:56	if you listen to these two channels you can tell that they're different
0:25:01	but it's not like the far-field channel is radically different when you listen to
0:25:07	i we it's it sounds a little different but it's perfectly intelligible
0:25:13	so we explore
0:25:15	training test with near field train interest with farfield and this mismatch condition where a
0:25:20	train on your field data and test for
0:25:24	so
0:25:26	i'll just say that it's harder it's not
0:25:30	hardly we have to be careful and you have to think about what you're trying
0:25:33	to do when you when you when you run these types of experiments in particular
0:25:38	a there were a lot it's use that we went through
0:25:43	to take get the near field channel and the far-field channel exactly parallel so that
0:25:49	we were actually measuring
0:25:51	what we wanted to do it is it's like a somewhat
0:25:55	intricate lab set
0:25:57	and so it's
0:26:01	so the paper that we wrote in i cast just i don't know how well
0:26:05	it describes it but it attempted to describe it and we have a on
0:26:10	the icsi website there's a technical report that's reasonably good
0:26:14	that describes a lot this stuff so i'm not gonna believer this but there was
0:26:19	a lot of effort that we can go through that's
0:26:23	so here here's of the bottom line is that we're
0:26:26	so first let's look at the green and the red curve satanic again i'm almost
0:26:31	so
0:26:33	the green and the red curve are the mass near field and far-field and notice
0:26:38	that extract each other pretty well the different
0:26:40	the first real data is obviously hardware
0:26:43	but interestingly look down here at the simulated in the frame
0:26:48	accuracies
0:26:49	they're still really low you know there
0:26:52	the a match farfield is higher it's worse but it still really low and in
0:26:58	particular that these error rates are around the two percent right so i wanted so
0:27:06	let's think about that no then notice before we think about that the mismatch simulation
0:27:12	rate
0:27:12	it's you know so this is where we want to concentrate so this is what
0:27:17	we want to think about that right
0:27:19	so the simulated
0:27:21	we don't need to worry about this other stuff it's the simulated thing that we're
0:27:25	gonna concentrate
0:27:26	so
0:27:29	what when you simulate data from near field models and you recognise it with your
0:27:34	field models the error rate is essentially no
0:27:37	so that means that problem is essentially step
0:27:43	again when i take the far field models and i simulate data from the far-field
0:27:48	models
0:27:49	and i and i will
0:27:51	and i recognise it with the far-field models
0:27:53	i get essentially nowhere
0:27:55	again that means that problem is essentially stuff
0:27:59	so in these two individuals spaces where we you know so the frames so in
0:28:06	the signal processing the mfccs there are generated in the matched cases they're essentially separable
0:28:13	problems but all the side when i take in the
0:28:18	the near field models and look at the at the far field data it's
0:28:23	drat dramatically not step
0:28:26	so that means that the transformation that takes place between the near field data and
0:28:32	the far field data is not
0:28:35	it's not very that from the front end is not invariant under this transformation and
0:28:41	that lack of invariance
0:28:43	is what's causing this huge increase in here
0:28:47	so we again it's not surprising that the front and is not invariant to this
0:28:52	transformation there's a little bit a river there's a little bit of noise but what's
0:28:57	remarkable it is
0:28:58	that that's
0:29:00	solely that problem the causes
0:29:03	this huge degradation in there
0:29:06	and that is actually fairly remark
0:29:10	so
0:29:13	a
0:29:14	so there are many more results
0:29:17	a involving mixture model so we rerun all of these results with i think eight
0:29:23	component mixture models we see the same sort of behaviour
0:29:27	we've reproduce all the discriminative training results we ask
0:29:32	can does discriminative training somehow magically sort of the be leery than
0:29:38	the mismatch a case and the answer is no
0:29:41	we do i think morgan to this really you're on a natural question is how
0:29:46	does mllr work in this thing we talked about that an mllr you can you
0:29:51	can reduce
0:29:52	some of the scratches you would expect
0:29:54	but mllr is a simple linear transformation and whatever transformation between these two channels is
0:30:00	happening
0:30:01	it's some peculiar nonlinear transformation right so it's unreasonable
0:30:06	to expect animal or to do
0:30:08	as well but it's a good this task harness is a really good test harness
0:30:13	for evaluating
0:30:14	you know how resistant to these type how invariant to these transformations are for and
0:30:20	and so we've explored that a little but
0:30:23	and it's not so encouraged
0:30:26	alright well so that i think i table i will and i've
0:30:31	sort of blather donald enough i think all i'll turn it over to jordan and
0:30:37	you will
0:30:37	he will
0:30:38	have a higher level you a role idea and the not and then we'll have
0:30:42	questions that
0:30:54	so what you what presented in
0:31:02	i
0:31:18	okay one two three
0:31:20	alright so it turns out the there were two parts of this project
0:31:26	C told you about the technical stuff but we also saw that we'd like to
0:31:30	figure out
0:31:31	you've been hearing a lot about how wonderful speech recognition is during this meeting and
0:31:35	we thought we will actually like to understand what the community actually thought about what
0:31:40	speech recognition was like
0:31:42	so we rollers also survey and i called a bunch of people many of you
0:31:48	what called me
0:31:50	and this is called the rats right
0:31:59	and well we wanna do is just see what people thought about how speech recognition
0:32:03	really worked we were we were hoping that we would find some evidence to persuade
0:32:09	the government maybe to put it some money and fun some speech recognition research which
0:32:14	we haven't seen in a long time
0:32:17	but we really we just one the final was going on
0:32:20	and so we put together a little survey team
0:32:24	jen into jamieson worked with me she's a alice that's been in speech for very
0:32:29	long time and we engage frederick okay and he's a specialist at doing service
0:32:36	and we design a snowball start by
0:32:40	it's normal surveys very interesting it
0:32:44	it says you start with a small group of people that you know and you
0:32:47	have some the questions and then you apps them who else task
0:32:51	and you just follow that for your nose and what that means is although it's
0:32:56	not entirely unbiased it's as unbiased as you can do if you don't know the
0:33:00	sampling populations going to be
0:33:06	so we want to low what was going on what the people think or the
0:33:10	failures and what remedies of people try and how do they were
0:33:17	so we did this novel sampling
0:33:19	here's the questionnaire i don't wanna spend a lot of time and this but just
0:33:23	take a
0:33:25	the interesting questions are
0:33:28	the fall last one on the slide where is the current technology failed
0:33:33	and the first one on the side when you think broke
0:33:36	and then questions about sort of what you do about what was going on and
0:33:41	then if there's other stuff
0:33:45	the survey participants tended to be all
0:33:49	i think
0:33:50	that's sort of how our snowball work not terribly all but there's not a lot
0:33:54	again people in this so ages with thirty five seventy
0:33:58	we spoke about eighty five people
0:34:03	and they have an interesting mix of jobs most of them were in research somewhere
0:34:09	in development so we're both
0:34:11	there were a small battery as a management people and then people self referred them's
0:34:17	the their jobs as something more detail
0:34:22	but mostly these are and be people lord managers doing speech research or language one
0:34:30	sort of another
0:34:35	so here's what you told us
0:34:39	there's a
0:34:42	natural language is the real problem and acoustic modeling is a real problem
0:34:47	and everything else that we do was broken more or less
0:34:51	so i think the community sort of had this field not the people trying to
0:34:55	sell speech recognition to the management but the people trying to make it work have
0:35:00	a feeling that all is not really well in the technology
0:35:05	so lots of people and when you point fingers there pointing fingers to the language
0:35:11	itself and to acoustic modeling
0:35:14	and there's the third guy which this says not robust let's say this what steven
0:35:20	and stuff
0:35:21	we were able
0:35:22	so there's something going on with this technology that makes it not work very well
0:35:27	and when we ask people what they try
0:35:30	the fix things the answers everything
0:35:34	people of muck around with the training some people have tried all kinds of different
0:35:38	because i just of their system
0:35:40	a
0:35:42	i
0:35:43	i know
0:35:50	some piece trying to calm
0:35:54	alright anyway
0:35:58	what on the interesting things the people try to do
0:36:02	many of us have tried to fix pronunciations either in dictionaries or in rules the
0:36:07	pronunciation and to well me and everyone is found that this is a waste
0:36:12	it's pretty interesting that so that's not a way to fix the systems that we
0:36:16	currently will so we tried all kinds of stuff
0:36:21	and so i think
0:36:22	are taken from the survey is that people
0:36:27	actually don't believe that technology is very solid and we try a lot of things
0:36:31	to fix it and then we looked a little bit of the literature about the
0:36:35	literature surveys in the icsi report which you can go really but the comma so
0:36:40	we found a little sure looks sort of like this is from a review by
0:36:43	fruity
0:36:45	and it's a
0:36:48	L B C Rs far from be solved background noise channel distortion far in excess
0:36:52	casual disfluent speech one expected topic to it is because automatic systems to make egregious
0:36:57	errors and that's what everybody set anybody who's looked at that they'll says well this
0:37:02	technology is okay sometimes but it fails all i
0:37:08	so we conclude was
0:37:10	the technology is all i point out that the models the most of those used
0:37:14	by hidden markov models the most of us use i know as the thing that
0:37:18	was written down apply my for john a canadian sixty nine
0:37:22	so maybe that's i think kernel one of our issues here
0:37:29	so when these systems fail they degrade not gracefully like your for your role but
0:37:35	character catastrophic liam quickly
0:37:40	speech recognition performance is substantially behind how humans do in almost every circumstance
0:37:48	and
0:37:49	they're not robust
0:37:51	so i wanted to that sort of michael overall overview of what the survey was
0:37:57	and it's available on the icsi website in the in the program but i wanted
0:38:03	to add a couple a personal comments about my analysis of what's happening
0:38:08	so these are not i'm not representing the government are actually i want to talk
0:38:13	to you about my own personal else's
0:38:17	so here's i there's three points first point
0:38:21	if you have a model in it and you don't a lot of time hill
0:38:24	climbing to the optimum performance
0:38:26	and it doesn't perform optimally at that spot
0:38:29	you got the wrong model
0:38:32	hidden markov models we're proved to converge by power producers and Y so the idea
0:38:37	in nineteen sixty not
0:38:39	that prove has two parts
0:38:41	one is it says you can always make a better model
0:38:45	two it says you get the optimal parameters if the data came from the model
0:38:51	that second part is
0:38:54	absolutely not true in our speech recognition systems and we're climbing on data that doesn't
0:39:00	match the model and we're not gonna find the answer that way
0:39:04	so we spent a lot of time
0:39:06	trying to account trying to adapt for the problem back we got the wrong model
0:39:13	this is a personal bond
0:39:15	if you use sixty four gaussians applying to some distribution you have no idea what
0:39:19	the distribution
0:39:21	the original
0:39:23	multi gaussian distributions we're done with a single mean and i understand but that's not
0:39:29	weird
0:39:30	and so my corollary i think speaks for itself
0:39:37	and finally if the system you bill pills for fifty percent of the population entirely
0:39:43	and then for the people who works for estimate as they walk in a reverberant
0:39:46	environment or noisy place it fails
0:39:48	it's broken
0:39:51	and i believe speech recognition is terribly problem
0:39:55	so i think what we really wanted to do i'm i want to draw an
0:39:59	analogy so i one and what drawn analogy between
0:40:03	transcription and transportation
0:40:06	and for transportation man this is what i want something that slick and slowly and
0:40:12	easy to use and doesn't bright
0:40:15	and what we build use this
0:40:20	it runs on two wheels it will get similar eventually you spend almost all your
0:40:24	time dealing with problems they have nothing to do with the transportation part
0:40:28	and so i believe that that's what we've done with speech recognition
0:40:32	and it's time for new models and
0:40:35	i urge you to think about model
0:40:38	and not so much about the data
0:40:54	and tape
0:40:56	generate okay
0:40:58	i assume that is to generate a lot of discussion in a lot of questions
0:41:02	if it doesn't then something is wrong with us
0:41:06	this sds community would be done broken
0:41:10	okay was the first over there
0:41:20	a question about the resampling
0:41:24	as i think about this you have a sort of sequence of random variables in
0:41:27	your turning a knob on the independence between them
0:41:30	and
0:41:31	one of the things that charting that knob does is it
0:41:35	as things become more dependent there's
0:41:37	less information
0:41:40	what i'm wondering is how much of the word error rate degradation you see
0:41:44	might be associated simply with the fact that there's just less information
0:41:48	in streams that are more dependence
0:41:54	this working
0:41:56	so i guess i don't understand question
0:41:59	a that i mean i
0:42:02	so i you're right so here is an answering you can tell me if i'm
0:42:07	close to understanding the model assumes that each frame has an independent amount of information
0:42:15	but we know that the frames do not have in depend amounts of information the
0:42:20	amount of information
0:42:22	going from frame to frame varies enormously
0:42:25	but the model treats every single one of those frames is independent and that's the
0:42:31	an egregious violation of these
0:42:34	so that
0:42:37	i guess i was thinking about was
0:42:39	if i ask you to say we're ten times that i ask ten people to
0:42:42	see the work once
0:42:43	and are trying to figure what's the word
0:42:45	like that the ten people say it might actually provide more information in the data
0:42:49	itself
0:42:51	and i just wondering if that might at all
0:42:53	contribute to why there's more
0:42:57	information as you sample from
0:42:59	from or more disparate parts of the train database
0:43:07	well i think i think what you're actually saying is the you your works
0:43:15	explaining
0:43:17	why
0:43:18	so the model
0:43:20	i think
0:43:21	many people this is a question they have so the when you when you have
0:43:26	all the frames and their independent when you do frame resampling the frames come from
0:43:31	all sorts of different speakers and when you when you line them up you know
0:43:35	like the what i play they come from all sorts of different speakers but then
0:43:40	as soon as i start
0:43:43	increasing the segment size then each one of those segments is gonna come from one
0:43:49	speaker right is this is sort of along the lines what you're thinking well does
0:43:53	the notion of speaker is part of the dependence in the data right the fact
0:43:59	that each one of these frames scheme
0:44:01	from a single speaker that's the pen
0:44:05	and so that interframe to ten
0:44:07	well the model knows nothing about
0:44:09	and so if that's causing a problem or not that that's as we're Q your
0:44:14	data
0:44:22	of course all of us
0:44:23	you know as you said all of us or have been aware of this for
0:44:26	a long time and i think there has been a lot of effort at trying
0:44:29	to undo it
0:44:31	it's kind of when we say the model this these there's an independence assumption that
0:44:37	sort of have true
0:44:39	because the features that we use
0:44:42	go over several frames so of course they're not actually independent you know when you
0:44:47	synthesise it's not clear what you really synthesise "'cause" you have to synthesise something that
0:44:51	has
0:44:52	may have an independent value but it has to have a derivative that matches the
0:44:56	previous thing and so on but
0:44:58	but we've all tried things like segmental models
0:45:02	which don't have that independence assumption
0:45:04	right we take a segment
0:45:07	a whole phoneme so you're
0:45:09	is skipping the state independence assumption and the frame independence assumption and just going straight
0:45:15	to the contextdependent phoneme
0:45:18	and now you're picking a sample from the one distribution for that context dependent phoneme
0:45:24	and that always works worse
0:45:28	maybe you can do something with that are combined it with the hidden markov model
0:45:32	and gain of i have a point but by itself it always works a lot
0:45:36	worse
0:45:38	and unless you unless you cripple the hidden markov model with the salem only gonna
0:45:43	use context independent models then this one might work better but
0:45:48	so the question is
0:45:49	it's not that we haven't tried
0:45:51	people have tried to make models that aboard those things and almost all of those
0:45:56	things got more as the flip side of that is you said mpe or mmi
0:46:00	and all these things run that M
0:46:01	two
0:46:02	avoid
0:46:04	that assumption but they don't we just the arab i-vector for
0:46:08	they reduce the error by
0:46:10	ten percent fifteen percent relative
0:46:13	basically a small it it's is similar to any of that any of the other
0:46:18	tricks we do so they have any comment on those two observations
0:46:21	well i mean
0:46:23	i i'm not sure what
0:46:25	so a natural question is at which i think is the first part of what
0:46:29	you're saying is so why many people to try and fail to be hmms with
0:46:36	models that take into account
0:46:40	independent third the dependence structure in the data so why hasn't that work
0:46:45	well
0:46:47	i would say that
0:46:49	that
0:46:50	i do not believe that anyone has any quantitative notion of why these things here
0:46:57	in the data
0:46:59	i'm not saying that we should go back to these methods maybe we should but
0:47:04	well i will give you an example of something you know twenty years ago people
0:47:09	gave up neural networks
0:47:11	and all of a certain you know neural networks or
0:47:16	R
0:47:16	are the new
0:47:18	the new
0:47:20	come
0:47:21	i don't know what the right biblical sprays is but hallelujah so and what it
0:47:28	takes is somebody who believes in something and dry start to do it and i
0:47:35	think that here is the problem
0:47:37	we should be i don't know what the solution is i honestly don't know what
0:47:41	the solution is but i will say also that the mmi thing no and i
0:47:46	don't believe anyone would be the mmi it was not designed to overcome independence
0:47:53	you know if we knew that maximum likelihood solution to this problem was not the
0:47:58	right solution so we found an alternative model selection procedure that we've just in a
0:48:04	different place
0:48:05	again if the model were correct we wouldn't have to do that
0:48:16	coming back to the results this is this simulation results you presented
0:48:20	i think these are highly suggestive because
0:48:24	by changing the data to fulfil your assumptions
0:48:29	the error rates you get or not the error rates we
0:48:32	expect from the real data
0:48:35	because you fit
0:48:36	the problem to your assumptions but we have to go the other way around so
0:48:40	what error rates we really can expect if we
0:48:45	improve on modeling are still it that's an open questions system
0:48:48	exactly i'm the that that's absolutely right at the in no way in my claiming
0:48:54	that if we could model dependence in the data that we would be seen these
0:48:58	error rates the frame resampling error rates that that's absolutely correct
0:49:04	i mean so
0:49:05	presumably we do we repeat do better the other point though is i think that
0:49:12	a lot of the
0:49:17	this sort of brittle nist that we experience
0:49:20	in our models this is a conjecture is due to this very
0:49:25	sort of for fit to the temporal structure
0:49:31	and temper you know temporal we have a we have what one way of thinking
0:49:35	of what these results a you know the frame resampling results that says if you
0:49:40	forget about the temporal structure in the data models work really well but as soon
0:49:46	as you introduce real temporal structure and the data the model start falling
0:49:51	and so we'll speech i think temporal structures importance
0:49:57	i think
0:50:04	here is the my
0:50:10	by a shock i see how a
0:50:15	speechless
0:50:16	or thai interested party
0:50:19	yes the line
0:50:25	i don't think
0:50:27	a
0:50:28	i when you please independence assumptions is not
0:50:34	in the sticks more mixing to not extract information you can speech doesn't necessarily track
0:50:41	you know to work
0:50:44	i mean i can build the proposed system that satisfy
0:50:49	independence assumption
0:50:51	so i don't think
0:50:52	you know
0:50:53	really follows that
0:50:55	for my models really see
0:50:58	the models and so
0:51:01	i think you don't want thinking about extracting
0:51:06	getting the right information the problem this over account the information
0:51:10	it's a question of this represent information
0:51:15	and so if you misrepresented what are more or less than in the process
0:51:19	i was the misrepresentation
0:51:21	so that the false alarms
0:51:25	three
0:51:28	something like
0:51:29	some work
0:51:31	have you might have
0:51:34	but works if that's not right
0:51:38	work land farm
0:51:41	i rate is
0:51:44	just done the same tendency
0:51:47	these days
0:52:26	but
0:52:27	but
0:52:34	i like
0:52:45	when you know all
0:52:55	one thing that works really poor C
0:52:58	is if you have a mismatched representation
0:53:01	so i think the think about some model is representing text okay
0:53:07	you can represented as raster scan text
0:53:09	well you could represented as follows
0:53:13	and if you change the size of the image
0:53:16	the to the two things a very different of that the five
0:53:20	five test of an actual easy representation change and the rest just and it's just
0:53:25	the whole thing
0:53:27	so you have to ask yourself is the problem that we're C
0:53:31	the fact that we have a representation for the problem that doesn't match
0:53:37	that i think is the realisation
0:53:40	mm this tell us something a common
0:53:43	as you go for then for the top from state to phones in phones to
0:53:48	segments
0:53:49	data it's becoming more and more speaker-dependent is it may be the problem is your
0:53:54	models and not there don't i mean are
0:53:57	morse i mean if you made your models more speaker-dependent what we have seen the
0:54:02	such difference
0:54:03	but it has nothing to do with a frame dependent sampling but well like what
0:54:08	i was trying to say before is that is a form of dependence
0:54:13	the that
0:54:14	that
0:54:15	and the model knows nothing about
0:54:17	this form of the pen
0:54:19	you know that there are many forms a of dependence and data knowing what independence
0:54:24	is a heart thing for human to understand right
0:54:28	but that form of dependence is precisely there and it may be causing the problem
0:54:36	so there were there were a number of speakers so there are relatively few speakers
0:54:42	in this corpus and so we have to sort of cat them so that there
0:54:46	wasn't a single dominant speaker
0:54:50	which i mean i think that would be the last
0:54:56	so let me you sort of continue with work was asking again
0:55:02	we know the model is wrong
0:55:05	models are always wrong
0:55:08	and so
0:55:11	the way your
0:55:13	you can argue that the model is wrong mathematically or you can argue that it's
0:55:17	wrong because it doesn't meet certain in a match a human performance what we think
0:55:22	of as human performance i think we may overestimate human performance a little bit but
0:55:26	it clearly doesn't match it
0:55:29	but in fact you know if you look at all the research that all of
0:55:32	us do
0:55:34	we use at least feel like protecting those problem so we say we're gonna use
0:55:39	fonts models it to use your analogy we a lower models to have we scale
0:55:45	them like fonts right we put in we say we're going to estimate a scale
0:55:49	factor in that scale factor is not a simple
0:55:52	we can be a simple one there were can be a matrix you know much
0:55:54	more complicated than what you do with the font and we constrain it to be
0:55:58	the same we say the speakers the same for the whole sentence
0:56:01	we do speaker adaptive training so we try to remove the differences
0:56:07	we tried to normalize all the speakers to the same place and then insert the
0:56:11	properties of the new speaker again right
0:56:14	close sort of like the analogy of a font
0:56:16	we tried to do all of these things we certainly trying to model channels
0:56:23	we do all of these with linear models and not linear models
0:56:28	and
0:56:29	we get small improvements
0:56:31	so my question let me turn the question around
0:56:34	the model is wrong
0:56:36	what's the right model
0:56:38	not what is the do but what is the right model
0:56:42	so
0:56:43	i think we all don't know the answer to that question but let me tell
0:56:47	you something other phenomena that i would like to see as making
0:56:52	unless you've been following particle physics but
0:56:56	in particle physics
0:56:58	when you measure particle interactions prestigious of the interactions are governed by
0:57:03	basically by feynman diagrams
0:57:05	and so to compute a for particle interaction like using the super collider to compute
0:57:11	a cross sectional area for one of the interactions takes just if we computer about
0:57:15	a week to look at all the fine and i guess
0:57:19	the quite of the physics guys it's just discovered a geometric object
0:57:24	enforce days and in the geometric object it turns out that each
0:57:28	little area house
0:57:32	an area that is exactly the solution
0:57:34	so that problem of computing the cross sectional area
0:57:39	and you can outdo the computations
0:57:43	in about five minutes with a pencil the tape
0:57:47	so
0:57:48	there's a place where the difference in the model has a huge effect
0:57:54	i'm making things work so i don't think i don't believe the model lies in
0:57:59	that we of the kinds of things that we've all these always been doing
0:58:02	i think we need to have some radical re interpretation of the way we look
0:58:06	at the data that we look at the word
0:58:09	maybe which on the lines in one place
0:58:11	maybe
0:58:14	i took the degree in linguistics as i thought speech wasn't an easy problems as
0:58:18	a jury point of view and i learned to distrust everything a linguist set
0:58:24	maybe which most of them to but
0:58:26	maybe there's something different that we should be don't
0:58:28	so i would love just against look outside this place that we've been exploring

Ouch - Outing Unfortunate Characteristics of HMMs (Used for Speech Recognition)

4th Day

Jordan Cohen (Spelamode) , Steven Wegmann (ICSI)