0:00:15oh
0:00:16on the
0:00:18traditional speech recognition
0:00:20and the first paper
0:00:27i think
0:00:30speech
0:00:42mar
0:00:44vol
0:00:47huh
0:00:48really
0:00:50actually i
0:00:51yeah
0:00:57okay so this
0:00:58the stock
0:01:00you will notice that some
0:01:02overlap with the present
0:01:04that is
0:01:06a couple the signal and i'll try to point out the differences
0:01:11the
0:01:12i'll start with the motivation
0:01:14and how we prepared what the data is now we prepared it
0:01:19we actually and i
0:01:21average talks in this
0:01:24in this
0:01:26we actually use more than just a cepstral
0:01:28a system so it's i'll explain what the systems are using
0:01:33that i describe the results
0:01:36and some questions for that should be multiple question
0:01:40future work
0:01:44so the as we all know sre has a
0:01:49just telephone speech is all that
0:01:51however
0:01:53our keynote presentation yesterday
0:01:56and how to train
0:01:59interviews with a microphone
0:02:02recorded speech
0:02:04and
0:02:05a total of twenty ten sre and this was still distributing the data as to
0:02:12where telephone speech
0:02:15eight khz sampling rate the you're not coding which is a lossy
0:02:20oh coding scheme that is works well for
0:02:24for telephone speech
0:02:25so one of the things that we're gonna looking at is the effects of those
0:02:31two factors and what you remove the word relaxes
0:02:34constraints out that the data is encoded
0:02:37and this is the part where there's you know where there's overlap with bills talk
0:02:43from tuesday
0:02:44and i should point out is a difference not
0:02:47we actually did not need of the system front-ends
0:02:51for acoustic modeling the same
0:02:54we always use a telephone from that
0:02:56still see some interesting differences
0:02:59that's the part that is different
0:03:01the studied
0:03:05and then we look at the second variable which is
0:03:09how much better it can get if you actually use an asr system
0:03:13some of our systems use a speech recognition and of course the quality of the
0:03:18speech recognition is also
0:03:19partly a function of the quality of the audio and
0:03:24that is the second variable that you also
0:03:30so the history of this work is that during the a train ten sre development
0:03:35cycle which of course based on
0:03:38what part as a really data
0:03:41we notice that
0:03:43you to write it is important control in the recording of the interviews
0:03:47the
0:03:49it's not be
0:03:50no danger audio
0:03:52recordings
0:03:54using only
0:03:56it's
0:03:57the U haul as a result coding
0:04:01because
0:04:03the little boy
0:04:04the
0:04:06the compression best
0:04:08here
0:04:09this
0:04:10so that this is
0:04:12it was a problem for system
0:04:15at least
0:04:16so we started dating around
0:04:18different
0:04:19the effects of different
0:04:20audio
0:04:23and then we got lucky because we have another project that was independent of S
0:04:27are going on sri time
0:04:30which basically give us access to the fullband the original recordings of a portion of
0:04:36mixer data was the basis yes or interview
0:04:42and so we basically created and version of the interview data answer rate
0:04:49and the rest mixture
0:04:50and so on that which pointed to some interesting results which we actually recorded
0:04:56and the rhino workshop following
0:05:00okay
0:05:01S three
0:05:02right
0:05:03but there but the results were
0:05:06you know we have this
0:05:07limited data set and it wasn't the complete dataset because there were still a microphone
0:05:13data that was available to us in full bandwidth
0:05:16so we set aside and actually last year
0:05:19just released the complete sre
0:05:24and microphone actually using the phone calls
0:05:28in the in sixteen khz
0:05:32and then we decided okay this is now we can basically look at this and
0:05:36proper you know the complete evalset we can actually gets results
0:05:42so that's
0:05:44so we have two data sets were set is the three-way data and so the
0:05:49original us rewrite data we repartition that's not what we use portion for the well
0:05:53for training purposes such as training that by adding to the background data and intersession
0:05:59variability training data
0:06:01and we help we held out at a set of forty eight females and thirty
0:06:05four male speakers
0:06:07for development testing and that's the data for
0:06:11so we found all the possible trials that data
0:06:14and
0:06:15i remember that the data
0:06:17classified into short
0:06:19conversations
0:06:21and
0:06:22we have those two conditions a long conversations were truncated actually
0:06:26because
0:06:28yeah
0:06:28ten
0:06:30condition
0:06:31so these are the number of trials that resulted from that strategy
0:06:36this was actually again this was the development set
0:06:40is that so by the time to
0:06:42developers
0:06:44channel
0:06:47that S which and data is again hasn't released version and using the extended trial
0:06:53set so large number of one
0:06:57oh actually with a wide band versions of both these phone calls recorded
0:07:05you know microphone as well as
0:07:08the
0:07:10wideband
0:07:12a number of them are only gonna look at all conditions that involved microphones both
0:07:18training
0:07:18so this
0:07:19total five conditions
0:07:26well
0:07:27and in this presentation just with a lot of all focus on the eer hold
0:07:35yeah
0:07:36results and
0:07:37the paper has there's the dcf results
0:07:41and only if you have these
0:07:43how the results differ but they differ qualitatively so
0:07:47just one
0:07:48say
0:07:50the number so
0:07:52stick to
0:07:55okay here so we prepare the data
0:07:57so we have the first condition that's the baseline condition that you can look exactly
0:08:01right the euler coding
0:08:04yeah that sounds this is how the data was delivered to us
0:08:08yeah
0:08:12we ourselves to a version of the data
0:08:15where we took you based on what that the data we downsampled to eight
0:08:20but we live in the east yeah you're holding
0:08:24to avoid these the loss
0:08:27right
0:08:28and
0:08:30we did not you builds a condition to use cost
0:08:35this you saw
0:08:37one thing we noticed socks is that we can use talks a lot of things
0:08:42and actually there's different down sampling are
0:08:46you provided once and you should try to see how the unit with your actual
0:08:50task
0:08:51what i think that you have to be very careful sauces that
0:08:55they are not backward compatible so if you take the latest version of sauce will
0:09:00not the same as well as the one that you might have used five years
0:09:04ago so in fact very careful with keeping older versions around
0:09:09to make sure that doesn't are we have to use an older version
0:09:13this off to
0:09:15get
0:09:15results
0:09:17there were some things that have
0:09:18we tried
0:09:20so i'm just one you're about so
0:09:22maybe a little less
0:09:24harshly not saying that you should use it all but you should use it with
0:09:27great care
0:09:28and that's why we have the sixteen khz
0:09:31yeah
0:09:32oh
0:09:34then you're holding
0:09:36a just a little bit more detail so we have basically we have a
0:09:41a portion of the
0:09:42five
0:09:44available to us
0:09:45seconds
0:09:46flat encoding
0:09:48actually
0:09:49forty four khz
0:09:51so we downsampled to sixteen and
0:09:55and we use the this the segmentation tables
0:10:00the segments for the development
0:10:02some spot checking to make sure actually have exactly matching
0:10:07dataset that matches best sex
0:10:12that's all
0:10:13that probably
0:10:16now in our system
0:10:18the first thing we do with all the microphone data is we find a wiener
0:10:23filter doesn't formant by
0:10:26from start with at least but is run evaluations
0:10:32but
0:10:33for
0:10:34icsi ogi T
0:10:37some
0:10:38actually bottom that's
0:10:42then
0:10:44then we used a speech activity detection method that was that we're not problems but
0:10:50seem to do
0:10:51reasonably grounded was actually inspired by the point those two thousand and we use
0:10:56combination of a
0:10:58yeah hmm based speech activity detection
0:11:01and we saw that
0:11:03provided by
0:11:05but the important thing is that we did not want to introduce
0:11:09segmentation as confounding variables are comparison so we took the original segmentations which where
0:11:16derived from your data and we kept that same segmentation fixed across all the different
0:11:23i
0:11:27at each modality is so you don't
0:11:30a better
0:11:31we say
0:11:32i
0:11:33yeah
0:11:38okay so that we haven't is are things that the basis
0:11:42so for some systems the right later
0:11:46so it's we have to recognise this so the first recognizer our baseline recognizer
0:11:52is a is a conversational telephone speech recognizer
0:11:56oh that's
0:11:58has been used for the last two evaluations asr evaluation so that's why
0:12:03it's based on telephone data only it has two stages
0:12:07second stage
0:12:09two hypotheses from the first one for
0:12:11unsupervised adaptation
0:12:13yeah
0:12:14and
0:12:15it has
0:12:18we measure the word error rate on some assume you six microphone data that we
0:12:23have transcribers cells
0:12:25below thirty percent
0:12:28yeah
0:12:31but that of course since we now have the of the wideband version of this
0:12:35data we actually have the opportunity to improve the recognition
0:12:39by
0:12:41oh we have a different system it actually have very similar structure results in terms
0:12:46of the algorithms for acoustic modeling so for the type of language models of what
0:12:50that's very simple compatible to the first baseline system
0:12:53but it was trained harder on meeting data so that was trained wideband data and
0:12:58furthermore
0:12:59that meeting it includes a far-field for
0:13:03which is important because some of the work
0:13:05the majority of the speech in the i
0:13:09you condition also far field microphone
0:13:12so we found that this would be a reasonable match to the to the into
0:13:17data
0:13:18and it will be read this on a on the interview data from sre ten
0:13:22we found that the output twenty one percent word tokens than the old
0:13:28and because our recognizer tends to just delete words when it hasn't for acoustic match
0:13:32that's a pretty good indication that is
0:13:34substantially more at
0:13:37and that we used
0:13:38we didn't have any transcribed sre ten interview data
0:13:42other cell is we simply matching compare the asr accuracy on meeting data rich
0:13:49which result was similar character so we used a far-field meeting data from
0:13:54i don't know which one it was one of the nist
0:13:57rt evaluation sets
0:13:59and we found that the original cts recognisers had a very high rate
0:14:04and then the first stage of our meeting recognizer which
0:14:09is important it still is eight khz models actually performs this kind of cross-adaptation between
0:14:14different kinds of acoustic models of the first stage uses
0:14:17no models were trained using data already have much better accuracy
0:14:23over forty percent and then the second stage with sixteen khz models and unsupervised adaptation
0:14:30i
0:14:31percent error rate so clearly a big improvement in terms of
0:14:34speech recognition accuracy and probably consistent with the observation
0:14:38that may be more talk spurt lattice
0:14:45okay now to the systems
0:14:47the
0:14:49there were three systems over all these calls from a larger combination of systems that
0:14:53were used in the official ester
0:14:57twenty ten submission
0:15:00so the first system is kind of our main state
0:15:02cepstral system and use the
0:15:05telephone band analysis of past twenty possible coefficients of the one K gaussians
0:15:13and we didn't even bother to retrain the i-th channel i speakers for this
0:15:19so we take those from the original
0:15:22system
0:15:23is based
0:15:25data performs the t-norm
0:15:29is a pretty run of the model system not using i-vectors but you know as
0:15:33of twenty ten was a pretty standard state
0:15:38okay that the two systems that to the asr the first one is or mllr
0:15:42system uses a few days
0:15:46model performs some
0:15:48some
0:15:49feature normalisation
0:15:51oh yeah i a total of sixteen transforms
0:15:56that come out by crossing rate for classes
0:16:01and the two genders so we have made a specific reference models and female specific
0:16:06right reference models with what they're always applied to both male data so yeah sixty
0:16:11different transforms
0:16:13okay the model that almost twenty five feature
0:16:16features you the right now but then used as you know
0:16:22i forgot to put in here that you don't perform now for
0:16:26session
0:16:30there is what i'm system
0:16:31i've sounds very bad but that's give some
0:16:37my brothers
0:16:38consists of the relative frequency features are collected
0:16:42the top thousand address and trigrams
0:16:45you
0:16:46the background data
0:16:47and again using svm
0:16:49or
0:16:52okay so here for
0:16:54the interesting
0:16:55comparison
0:16:56so we use these three different wait for conditions
0:17:00and rank or cepstral system on the sre eight data and is short and long
0:17:06data condition
0:17:08and you can see clearly that the largest in about twelve percent relative
0:17:13hums from
0:17:14the dropping of this
0:17:16impressive
0:17:17coding as well
0:17:21and that has a small additional gain a problem switching from eight to sixteen khz
0:17:27sampling
0:17:28and you might think well as a gmm front-end operates at eight
0:17:34okay so what could possibly be improving by switching
0:17:38sixteen
0:17:39and the answer is that the noise filtering happens at the fullband
0:17:44so the spec
0:17:45subtraction
0:17:47works better when you when you operate at
0:17:52and then down sampling
0:17:55right
0:17:57so this was kind of an interesting result of us
0:17:59is it
0:18:01requires fairly minimal changes to the system and you know the gmms moments change so
0:18:06that those
0:18:08so
0:18:10now we do the same system on sre ten data
0:18:14and is a lot of numbers use of in and summarized
0:18:17so basically you get pretty substantial gains
0:18:20in order of ten percent relative
0:18:22and the largest Z E R
0:18:27for the vocal effort
0:18:30very suggestive because i think that especially for low vocal effort affected by this
0:18:36block coding but careful driving
0:18:39datasets
0:18:40very small
0:18:43i should also point out as shown in the paper that the relative improvement on
0:18:48a somewhat lower
0:18:49the set of ten percent
0:18:51i
0:18:52so
0:18:54page
0:18:56but that i think that you get
0:18:59vol
0:19:03okay now more numbers
0:19:05in our system
0:19:07the benefits much more and here we have two different
0:19:10contrast conditions we have E
0:19:13so the mlr so the acoustic modeling always uses telephone speech so that we can
0:19:19that you don't have to retrain anything the old telephone background data doesn't it doesn't
0:19:25prevent us from using telephone data
0:19:27the background model
0:19:29however
0:19:30the
0:19:32the audio that you process before the final down sampling step okay the sixteen K
0:19:38audio what you get the benefit from not having the lossy coding from doing that
0:19:42the voice the filtering for that
0:19:45and then you have two choices you can use the
0:19:48first recognition step as your hypothesis of which comes from the eight khz models or
0:19:53you can use the second stage comes from the sixteen khz models which as we
0:19:57saw to sell better
0:19:59and so we have both of these here and of course the second one
0:20:02is consistently better
0:20:04one very small
0:20:06yeah
0:20:07yeah
0:20:09from a little data conditions
0:20:11i
0:20:12sorry
0:20:13smart
0:20:16seconds more accurate hypotheses
0:20:19that's overall we see very substantial gains you only about twenty percent
0:20:26hence
0:20:26i six is a lot of numbers
0:20:28a two-dimensional lots of these you know roughly
0:20:31so you have one axis you see that the other axis
0:20:35alright and you see that roughly two thirds of again come from the switch from
0:20:41from eight khz to sixteen khz
0:20:44still using the where
0:20:46sorry then when you hear the asr
0:20:48see
0:20:49you get another
0:20:51another time i
0:20:53top
0:20:54yeah
0:20:55so this is pretty
0:20:57cross
0:20:58this condition
0:21:00okay that just around a
0:21:02the results
0:21:03so the word n-gram system
0:21:05voice operated much for
0:21:09operating point
0:21:10but the relative
0:21:12these are much smaller
0:21:14oh
0:21:14and i would speculate one
0:21:17whatever
0:21:18just remark
0:21:19okay
0:21:21it's
0:21:22prosody one second
0:21:24overall
0:21:26it does the same things
0:21:29okay
0:21:30so completely
0:21:33recent changes and development data of course where is the question how to make
0:21:38use of the of the full bandwidth the a priori that's a little to us
0:21:42and this applies able to capture systems answers
0:21:45instead use asr
0:21:47yeah
0:21:48studying this on two datasets sre right sre ten
0:21:52probably a few conclusions
0:21:55so there is substantial gains to be happens conference on image fifteen found
0:22:00so we can express
0:22:02no losses encoding is it is it is a big plus that's probably the biggest
0:22:07plus the cepstral system
0:22:09but you also get a small additional gain by doing voice filtering at the at
0:22:14the full bandwidth
0:22:15and then that'll systems get a significant strong using better asr and that of course
0:22:21you need to find a mask asr system and we were quite successful using a
0:22:26meeting recognizer that was trained for what nist rt
0:22:29evaluations using for a few data
0:22:32and what we have get like units
0:22:35we have not actually changed the analysis bandwidth all the acoustic models
0:22:40selsa we still using the telephone
0:22:43yeah
0:22:44for both these cepstral
0:22:46okay
0:22:49and this process
0:22:50future work is quite a few
0:22:54so obviously
0:22:56the three systems
0:22:57are from
0:22:58so to them questions
0:23:00so we the next step was used
0:23:03combining all here we haven't done that
0:23:06a question how much can you
0:23:08nation
0:23:10what will require a
0:23:12quite a bit of work is to also read that the prosodic sys
0:23:16which is
0:23:17the
0:23:17very nicely complementary so
0:23:19yeah is the right so
0:23:21two covariance data
0:23:24and then the questions of course can we do better by retraining or acoustic models
0:23:31and then using wideband data or alternatively can come up with some clever ways
0:23:37and wideband data you sequence
0:23:39bandwidth extension methods
0:23:41or a simply modeling bandwidth mismatch as a as one dimension
0:23:46right
0:23:50i
0:24:13that's
0:24:16after i
0:24:21well that's then you have to you have to use the telephone
0:24:27that is
0:24:29i
0:24:29so
0:24:30i
0:24:32well first of all
0:24:34it was a bigger
0:24:37well
0:24:38yeah
0:24:39we also felt that a large number of speakers
0:24:43oh
0:24:44as the results
0:24:46i
0:24:48well
0:24:56oh
0:25:02okay
0:25:04we look at the at
0:25:07did you don't shake spectral shaping slightly down sampled at three
0:25:11here
0:25:14oh
0:25:17the reader must achievement
0:25:20yeah
0:25:24you do not but
0:25:25i think after that we didn't change it
0:25:31yeah
0:25:32such as
0:25:36substantially higher local optimum
0:25:40yeah
0:25:42and it
0:25:43use the beginning
0:25:44something like this
0:25:47so my
0:25:50the like
0:25:51oh
0:25:55course
0:25:57it's try to date
0:26:00also