oh
on the
traditional speech recognition
and the first paper
i think
speech
mar
vol
huh
really
actually i
yeah
okay so this
the stock
you will notice that some
overlap with the present
that is
a couple the signal and i'll try to point out the differences
the
i'll start with the motivation
and how we prepared what the data is now we prepared it
we actually and i
average talks in this
in this
we actually use more than just a cepstral
a system so it's i'll explain what the systems are using
that i describe the results
and some questions for that should be multiple question
future work
so the as we all know sre has a
just telephone speech is all that
however
our keynote presentation yesterday
and how to train
interviews with a microphone
recorded speech
and
a total of twenty ten sre and this was still distributing the data as to
where telephone speech
eight khz sampling rate the you're not coding which is a lossy
oh coding scheme that is works well for
for telephone speech
so one of the things that we're gonna looking at is the effects of those
two factors and what you remove the word relaxes
constraints out that the data is encoded
and this is the part where there's you know where there's overlap with bills talk
from tuesday
and i should point out is a difference not
we actually did not need of the system front-ends
for acoustic modeling the same
we always use a telephone from that
still see some interesting differences
that's the part that is different
the studied
and then we look at the second variable which is
how much better it can get if you actually use an asr system
some of our systems use a speech recognition and of course the quality of the
speech recognition is also
partly a function of the quality of the audio and
that is the second variable that you also
so the history of this work is that during the a train ten sre development
cycle which of course based on
what part as a really data
we notice that
you to write it is important control in the recording of the interviews
the
it's not be
no danger audio
recordings
using only
it's
the U haul as a result coding
because
the little boy
the
the compression best
here
this
so that this is
it was a problem for system
at least
so we started dating around
different
the effects of different
audio
and then we got lucky because we have another project that was independent of S
are going on sri time
which basically give us access to the fullband the original recordings of a portion of
mixer data was the basis yes or interview
and so we basically created and version of the interview data answer rate
and the rest mixture
and so on that which pointed to some interesting results which we actually recorded
and the rhino workshop following
okay
S three
right
but there but the results were
you know we have this
limited data set and it wasn't the complete dataset because there were still a microphone
data that was available to us in full bandwidth
so we set aside and actually last year
just released the complete sre
and microphone actually using the phone calls
in the in sixteen khz
and then we decided okay this is now we can basically look at this and
proper you know the complete evalset we can actually gets results
so that's
so we have two data sets were set is the three-way data and so the
original us rewrite data we repartition that's not what we use portion for the well
for training purposes such as training that by adding to the background data and intersession
variability training data
and we help we held out at a set of forty eight females and thirty
four male speakers
for development testing and that's the data for
so we found all the possible trials that data
and
i remember that the data
classified into short
conversations
and
we have those two conditions a long conversations were truncated actually
because
yeah
ten
condition
so these are the number of trials that resulted from that strategy
this was actually again this was the development set
is that so by the time to
developers
channel
that S which and data is again hasn't released version and using the extended trial
set so large number of one
oh actually with a wide band versions of both these phone calls recorded
you know microphone as well as
the
wideband
a number of them are only gonna look at all conditions that involved microphones both
training
so this
total five conditions
well
and in this presentation just with a lot of all focus on the eer hold
yeah
results and
the paper has there's the dcf results
and only if you have these
how the results differ but they differ qualitatively so
just one
say
the number so
stick to
okay here so we prepare the data
so we have the first condition that's the baseline condition that you can look exactly
right the euler coding
yeah that sounds this is how the data was delivered to us
yeah
we ourselves to a version of the data
where we took you based on what that the data we downsampled to eight
but we live in the east yeah you're holding
to avoid these the loss
right
and
we did not you builds a condition to use cost
this you saw
one thing we noticed socks is that we can use talks a lot of things
and actually there's different down sampling are
you provided once and you should try to see how the unit with your actual
task
what i think that you have to be very careful sauces that
they are not backward compatible so if you take the latest version of sauce will
not the same as well as the one that you might have used five years
ago so in fact very careful with keeping older versions around
to make sure that doesn't are we have to use an older version
this off to
get
results
there were some things that have
we tried
so i'm just one you're about so
maybe a little less
harshly not saying that you should use it all but you should use it with
great care
and that's why we have the sixteen khz
yeah
oh
then you're holding
a just a little bit more detail so we have basically we have a
a portion of the
five
available to us
seconds
flat encoding
actually
forty four khz
so we downsampled to sixteen and
and we use the this the segmentation tables
the segments for the development
some spot checking to make sure actually have exactly matching
dataset that matches best sex
that's all
that probably
now in our system
the first thing we do with all the microphone data is we find a wiener
filter doesn't formant by
from start with at least but is run evaluations
but
for
icsi ogi T
some
actually bottom that's
then
then we used a speech activity detection method that was that we're not problems but
seem to do
reasonably grounded was actually inspired by the point those two thousand and we use
combination of a
yeah hmm based speech activity detection
and we saw that
provided by
but the important thing is that we did not want to introduce
segmentation as confounding variables are comparison so we took the original segmentations which where
derived from your data and we kept that same segmentation fixed across all the different
i
at each modality is so you don't
a better
we say
i
yeah
okay so that we haven't is are things that the basis
so for some systems the right later
so it's we have to recognise this so the first recognizer our baseline recognizer
is a is a conversational telephone speech recognizer
oh that's
has been used for the last two evaluations asr evaluation so that's why
it's based on telephone data only it has two stages
second stage
two hypotheses from the first one for
unsupervised adaptation
yeah
and
it has
we measure the word error rate on some assume you six microphone data that we
have transcribers cells
below thirty percent
yeah
but that of course since we now have the of the wideband version of this
data we actually have the opportunity to improve the recognition
by
oh we have a different system it actually have very similar structure results in terms
of the algorithms for acoustic modeling so for the type of language models of what
that's very simple compatible to the first baseline system
but it was trained harder on meeting data so that was trained wideband data and
furthermore
that meeting it includes a far-field for
which is important because some of the work
the majority of the speech in the i
you condition also far field microphone
so we found that this would be a reasonable match to the to the into
data
and it will be read this on a on the interview data from sre ten
we found that the output twenty one percent word tokens than the old
and because our recognizer tends to just delete words when it hasn't for acoustic match
that's a pretty good indication that is
substantially more at
and that we used
we didn't have any transcribed sre ten interview data
other cell is we simply matching compare the asr accuracy on meeting data rich
which result was similar character so we used a far-field meeting data from
i don't know which one it was one of the nist
rt evaluation sets
and we found that the original cts recognisers had a very high rate
and then the first stage of our meeting recognizer which
is important it still is eight khz models actually performs this kind of cross-adaptation between
different kinds of acoustic models of the first stage uses
no models were trained using data already have much better accuracy
over forty percent and then the second stage with sixteen khz models and unsupervised adaptation
i
percent error rate so clearly a big improvement in terms of
speech recognition accuracy and probably consistent with the observation
that may be more talk spurt lattice
okay now to the systems
the
there were three systems over all these calls from a larger combination of systems that
were used in the official ester
twenty ten submission
so the first system is kind of our main state
cepstral system and use the
telephone band analysis of past twenty possible coefficients of the one K gaussians
and we didn't even bother to retrain the i-th channel i speakers for this
so we take those from the original
system
is based
data performs the t-norm
is a pretty run of the model system not using i-vectors but you know as
of twenty ten was a pretty standard state
okay that the two systems that to the asr the first one is or mllr
system uses a few days
model performs some
some
feature normalisation
oh yeah i a total of sixteen transforms
that come out by crossing rate for classes
and the two genders so we have made a specific reference models and female specific
right reference models with what they're always applied to both male data so yeah sixty
different transforms
okay the model that almost twenty five feature
features you the right now but then used as you know
i forgot to put in here that you don't perform now for
session
there is what i'm system
i've sounds very bad but that's give some
my brothers
consists of the relative frequency features are collected
the top thousand address and trigrams
you
the background data
and again using svm
or
okay so here for
the interesting
comparison
so we use these three different wait for conditions
and rank or cepstral system on the sre eight data and is short and long
data condition
and you can see clearly that the largest in about twelve percent relative
hums from
the dropping of this
impressive
coding as well
and that has a small additional gain a problem switching from eight to sixteen khz
sampling
and you might think well as a gmm front-end operates at eight
okay so what could possibly be improving by switching
sixteen
and the answer is that the noise filtering happens at the fullband
so the spec
subtraction
works better when you when you operate at
and then down sampling
right
so this was kind of an interesting result of us
is it
requires fairly minimal changes to the system and you know the gmms moments change so
that those
so
now we do the same system on sre ten data
and is a lot of numbers use of in and summarized
so basically you get pretty substantial gains
in order of ten percent relative
and the largest Z E R
for the vocal effort
very suggestive because i think that especially for low vocal effort affected by this
block coding but careful driving
datasets
very small
i should also point out as shown in the paper that the relative improvement on
a somewhat lower
the set of ten percent
i
so
page
but that i think that you get
vol
okay now more numbers
in our system
the benefits much more and here we have two different
contrast conditions we have E
so the mlr so the acoustic modeling always uses telephone speech so that we can
that you don't have to retrain anything the old telephone background data doesn't it doesn't
prevent us from using telephone data
the background model
however
the
the audio that you process before the final down sampling step okay the sixteen K
audio what you get the benefit from not having the lossy coding from doing that
the voice the filtering for that
and then you have two choices you can use the
first recognition step as your hypothesis of which comes from the eight khz models or
you can use the second stage comes from the sixteen khz models which as we
saw to sell better
and so we have both of these here and of course the second one
is consistently better
one very small
yeah
yeah
from a little data conditions
i
sorry
smart
seconds more accurate hypotheses
that's overall we see very substantial gains you only about twenty percent
hence
i six is a lot of numbers
a two-dimensional lots of these you know roughly
so you have one axis you see that the other axis
alright and you see that roughly two thirds of again come from the switch from
from eight khz to sixteen khz
still using the where
sorry then when you hear the asr
see
you get another
another time i
top
yeah
so this is pretty
cross
this condition
okay that just around a
the results
so the word n-gram system
voice operated much for
operating point
but the relative
these are much smaller
oh
and i would speculate one
whatever
just remark
okay
it's
prosody one second
overall
it does the same things
okay
so completely
recent changes and development data of course where is the question how to make
use of the of the full bandwidth the a priori that's a little to us
and this applies able to capture systems answers
instead use asr
yeah
studying this on two datasets sre right sre ten
probably a few conclusions
so there is substantial gains to be happens conference on image fifteen found
so we can express
no losses encoding is it is it is a big plus that's probably the biggest
plus the cepstral system
but you also get a small additional gain by doing voice filtering at the at
the full bandwidth
and then that'll systems get a significant strong using better asr and that of course
you need to find a mask asr system and we were quite successful using a
meeting recognizer that was trained for what nist rt
evaluations using for a few data
and what we have get like units
we have not actually changed the analysis bandwidth all the acoustic models
selsa we still using the telephone
yeah
for both these cepstral
okay
and this process
future work is quite a few
so obviously
the three systems
are from
so to them questions
so we the next step was used
combining all here we haven't done that
a question how much can you
nation
what will require a
quite a bit of work is to also read that the prosodic sys
which is
the
very nicely complementary so
yeah is the right so
two covariance data
and then the questions of course can we do better by retraining or acoustic models
and then using wideband data or alternatively can come up with some clever ways
and wideband data you sequence
bandwidth extension methods
or a simply modeling bandwidth mismatch as a as one dimension
right
i
that's
after i
well that's then you have to you have to use the telephone
that is
i
so
i
well first of all
it was a bigger
well
yeah
we also felt that a large number of speakers
oh
as the results
i
well
oh
okay
we look at the at
did you don't shake spectral shaping slightly down sampled at three
here
oh
the reader must achievement
yeah
you do not but
i think after that we didn't change it
yeah
such as
substantially higher local optimum
yeah
and it
use the beginning
something like this
so my
the like
oh
course
it's try to date
also