on the

traditional speech recognition

and the first paper

i think






actually i


okay so this

the stock

you will notice that some

overlap with the present

that is

a couple the signal and i'll try to point out the differences


i'll start with the motivation

and how we prepared what the data is now we prepared it

we actually and i

average talks in this

in this

we actually use more than just a cepstral

a system so it's i'll explain what the systems are using

that i describe the results

and some questions for that should be multiple question

future work

so the as we all know sre has a

just telephone speech is all that


our keynote presentation yesterday

and how to train

interviews with a microphone

recorded speech


a total of twenty ten sre and this was still distributing the data as to

where telephone speech

eight khz sampling rate the you're not coding which is a lossy

oh coding scheme that is works well for

for telephone speech

so one of the things that we're gonna looking at is the effects of those

two factors and what you remove the word relaxes

constraints out that the data is encoded

and this is the part where there's you know where there's overlap with bills talk

from tuesday

and i should point out is a difference not

we actually did not need of the system front-ends

for acoustic modeling the same

we always use a telephone from that

still see some interesting differences

that's the part that is different

the studied

and then we look at the second variable which is

how much better it can get if you actually use an asr system

some of our systems use a speech recognition and of course the quality of the

speech recognition is also

partly a function of the quality of the audio and

that is the second variable that you also

so the history of this work is that during the a train ten sre development

cycle which of course based on

what part as a really data

we notice that

you to write it is important control in the recording of the interviews


it's not be

no danger audio


using only


the U haul as a result coding


the little boy


the compression best



so that this is

it was a problem for system

at least

so we started dating around


the effects of different


and then we got lucky because we have another project that was independent of S

are going on sri time

which basically give us access to the fullband the original recordings of a portion of

mixer data was the basis yes or interview

and so we basically created and version of the interview data answer rate

and the rest mixture

and so on that which pointed to some interesting results which we actually recorded

and the rhino workshop following


S three


but there but the results were

you know we have this

limited data set and it wasn't the complete dataset because there were still a microphone

data that was available to us in full bandwidth

so we set aside and actually last year

just released the complete sre

and microphone actually using the phone calls

in the in sixteen khz

and then we decided okay this is now we can basically look at this and

proper you know the complete evalset we can actually gets results

so that's

so we have two data sets were set is the three-way data and so the

original us rewrite data we repartition that's not what we use portion for the well

for training purposes such as training that by adding to the background data and intersession

variability training data

and we help we held out at a set of forty eight females and thirty

four male speakers

for development testing and that's the data for

so we found all the possible trials that data


i remember that the data

classified into short



we have those two conditions a long conversations were truncated actually





so these are the number of trials that resulted from that strategy

this was actually again this was the development set

is that so by the time to



that S which and data is again hasn't released version and using the extended trial

set so large number of one

oh actually with a wide band versions of both these phone calls recorded

you know microphone as well as



a number of them are only gonna look at all conditions that involved microphones both


so this

total five conditions


and in this presentation just with a lot of all focus on the eer hold


results and

the paper has there's the dcf results

and only if you have these

how the results differ but they differ qualitatively so

just one


the number so

stick to

okay here so we prepare the data

so we have the first condition that's the baseline condition that you can look exactly

right the euler coding

yeah that sounds this is how the data was delivered to us


we ourselves to a version of the data

where we took you based on what that the data we downsampled to eight

but we live in the east yeah you're holding

to avoid these the loss



we did not you builds a condition to use cost

this you saw

one thing we noticed socks is that we can use talks a lot of things

and actually there's different down sampling are

you provided once and you should try to see how the unit with your actual


what i think that you have to be very careful sauces that

they are not backward compatible so if you take the latest version of sauce will

not the same as well as the one that you might have used five years

ago so in fact very careful with keeping older versions around

to make sure that doesn't are we have to use an older version

this off to



there were some things that have

we tried

so i'm just one you're about so

maybe a little less

harshly not saying that you should use it all but you should use it with

great care

and that's why we have the sixteen khz



then you're holding

a just a little bit more detail so we have basically we have a

a portion of the


available to us


flat encoding


forty four khz

so we downsampled to sixteen and

and we use the this the segmentation tables

the segments for the development

some spot checking to make sure actually have exactly matching

dataset that matches best sex

that's all

that probably

now in our system

the first thing we do with all the microphone data is we find a wiener

filter doesn't formant by

from start with at least but is run evaluations



icsi ogi T


actually bottom that's


then we used a speech activity detection method that was that we're not problems but

seem to do

reasonably grounded was actually inspired by the point those two thousand and we use

combination of a

yeah hmm based speech activity detection

and we saw that

provided by

but the important thing is that we did not want to introduce

segmentation as confounding variables are comparison so we took the original segmentations which where

derived from your data and we kept that same segmentation fixed across all the different


at each modality is so you don't

a better

we say



okay so that we haven't is are things that the basis

so for some systems the right later

so it's we have to recognise this so the first recognizer our baseline recognizer

is a is a conversational telephone speech recognizer

oh that's

has been used for the last two evaluations asr evaluation so that's why

it's based on telephone data only it has two stages

second stage

two hypotheses from the first one for

unsupervised adaptation



it has

we measure the word error rate on some assume you six microphone data that we

have transcribers cells

below thirty percent


but that of course since we now have the of the wideband version of this

data we actually have the opportunity to improve the recognition


oh we have a different system it actually have very similar structure results in terms

of the algorithms for acoustic modeling so for the type of language models of what

that's very simple compatible to the first baseline system

but it was trained harder on meeting data so that was trained wideband data and


that meeting it includes a far-field for

which is important because some of the work

the majority of the speech in the i

you condition also far field microphone

so we found that this would be a reasonable match to the to the into


and it will be read this on a on the interview data from sre ten

we found that the output twenty one percent word tokens than the old

and because our recognizer tends to just delete words when it hasn't for acoustic match

that's a pretty good indication that is

substantially more at

and that we used

we didn't have any transcribed sre ten interview data

other cell is we simply matching compare the asr accuracy on meeting data rich

which result was similar character so we used a far-field meeting data from

i don't know which one it was one of the nist

rt evaluation sets

and we found that the original cts recognisers had a very high rate

and then the first stage of our meeting recognizer which

is important it still is eight khz models actually performs this kind of cross-adaptation between

different kinds of acoustic models of the first stage uses

no models were trained using data already have much better accuracy

over forty percent and then the second stage with sixteen khz models and unsupervised adaptation


percent error rate so clearly a big improvement in terms of

speech recognition accuracy and probably consistent with the observation

that may be more talk spurt lattice

okay now to the systems


there were three systems over all these calls from a larger combination of systems that

were used in the official ester

twenty ten submission

so the first system is kind of our main state

cepstral system and use the

telephone band analysis of past twenty possible coefficients of the one K gaussians

and we didn't even bother to retrain the i-th channel i speakers for this

so we take those from the original


is based

data performs the t-norm

is a pretty run of the model system not using i-vectors but you know as

of twenty ten was a pretty standard state

okay that the two systems that to the asr the first one is or mllr

system uses a few days

model performs some


feature normalisation

oh yeah i a total of sixteen transforms

that come out by crossing rate for classes

and the two genders so we have made a specific reference models and female specific

right reference models with what they're always applied to both male data so yeah sixty

different transforms

okay the model that almost twenty five feature

features you the right now but then used as you know

i forgot to put in here that you don't perform now for


there is what i'm system

i've sounds very bad but that's give some

my brothers

consists of the relative frequency features are collected

the top thousand address and trigrams


the background data

and again using svm


okay so here for

the interesting


so we use these three different wait for conditions

and rank or cepstral system on the sre eight data and is short and long

data condition

and you can see clearly that the largest in about twelve percent relative

hums from

the dropping of this


coding as well

and that has a small additional gain a problem switching from eight to sixteen khz


and you might think well as a gmm front-end operates at eight

okay so what could possibly be improving by switching


and the answer is that the noise filtering happens at the fullband

so the spec


works better when you when you operate at

and then down sampling


so this was kind of an interesting result of us

is it

requires fairly minimal changes to the system and you know the gmms moments change so

that those


now we do the same system on sre ten data

and is a lot of numbers use of in and summarized

so basically you get pretty substantial gains

in order of ten percent relative

and the largest Z E R

for the vocal effort

very suggestive because i think that especially for low vocal effort affected by this

block coding but careful driving


very small

i should also point out as shown in the paper that the relative improvement on

a somewhat lower

the set of ten percent




but that i think that you get


okay now more numbers

in our system

the benefits much more and here we have two different

contrast conditions we have E

so the mlr so the acoustic modeling always uses telephone speech so that we can

that you don't have to retrain anything the old telephone background data doesn't it doesn't

prevent us from using telephone data

the background model



the audio that you process before the final down sampling step okay the sixteen K

audio what you get the benefit from not having the lossy coding from doing that

the voice the filtering for that

and then you have two choices you can use the

first recognition step as your hypothesis of which comes from the eight khz models or

you can use the second stage comes from the sixteen khz models which as we

saw to sell better

and so we have both of these here and of course the second one

is consistently better

one very small



from a little data conditions




seconds more accurate hypotheses

that's overall we see very substantial gains you only about twenty percent


i six is a lot of numbers

a two-dimensional lots of these you know roughly

so you have one axis you see that the other axis

alright and you see that roughly two thirds of again come from the switch from

from eight khz to sixteen khz

still using the where

sorry then when you hear the asr


you get another

another time i



so this is pretty


this condition

okay that just around a

the results

so the word n-gram system

voice operated much for

operating point

but the relative

these are much smaller


and i would speculate one


just remark



prosody one second


it does the same things


so completely

recent changes and development data of course where is the question how to make

use of the of the full bandwidth the a priori that's a little to us

and this applies able to capture systems answers

instead use asr


studying this on two datasets sre right sre ten

probably a few conclusions

so there is substantial gains to be happens conference on image fifteen found

so we can express

no losses encoding is it is it is a big plus that's probably the biggest

plus the cepstral system

but you also get a small additional gain by doing voice filtering at the at

the full bandwidth

and then that'll systems get a significant strong using better asr and that of course

you need to find a mask asr system and we were quite successful using a

meeting recognizer that was trained for what nist rt

evaluations using for a few data

and what we have get like units

we have not actually changed the analysis bandwidth all the acoustic models

selsa we still using the telephone


for both these cepstral


and this process

future work is quite a few

so obviously

the three systems

are from

so to them questions

so we the next step was used

combining all here we haven't done that

a question how much can you


what will require a

quite a bit of work is to also read that the prosodic sys

which is


very nicely complementary so

yeah is the right so

two covariance data

and then the questions of course can we do better by retraining or acoustic models

and then using wideband data or alternatively can come up with some clever ways

and wideband data you sequence

bandwidth extension methods

or a simply modeling bandwidth mismatch as a as one dimension




after i

well that's then you have to you have to use the telephone

that is




well first of all

it was a bigger



we also felt that a large number of speakers


as the results





we look at the at

did you don't shake spectral shaping slightly down sampled at three



the reader must achievement


you do not but

i think after that we didn't change it


such as

substantially higher local optimum


and it

use the beginning

something like this

so my

the like



it's try to date