Speech Transcript - Effects of Audio and ASR Quality on Cepstral and High-level Speaker Verification Systems

on the

traditional speech recognition

and the first paper

i think

speech

mar

vol

huh

really

actually i

yeah

okay so this

the stock

you will notice that some

overlap with the present

that is

a couple the signal and i'll try to point out the differences

the

i'll start with the motivation

and how we prepared what the data is now we prepared it

we actually and i

average talks in this

in this

we actually use more than just a cepstral

a system so it's i'll explain what the systems are using

that i describe the results

and some questions for that should be multiple question

future work

so the as we all know sre has a

just telephone speech is all that

however

our keynote presentation yesterday

and how to train

interviews with a microphone

recorded speech

and

a total of twenty ten sre and this was still distributing the data as to

where telephone speech

eight khz sampling rate the you're not coding which is a lossy

oh coding scheme that is works well for

for telephone speech

so one of the things that we're gonna looking at is the effects of those

two factors and what you remove the word relaxes

constraints out that the data is encoded

and this is the part where there's you know where there's overlap with bills talk

from tuesday

and i should point out is a difference not

we actually did not need of the system front-ends

for acoustic modeling the same

we always use a telephone from that

still see some interesting differences

that's the part that is different

the studied

and then we look at the second variable which is

how much better it can get if you actually use an asr system

some of our systems use a speech recognition and of course the quality of the

speech recognition is also

partly a function of the quality of the audio and

that is the second variable that you also

so the history of this work is that during the a train ten sre development

cycle which of course based on

what part as a really data

we notice that

you to write it is important control in the recording of the interviews

the

it's not be

no danger audio

recordings

using only

it's

the U haul as a result coding

because

the little boy

the

the compression best

here

this

so that this is

it was a problem for system

at least

so we started dating around

different

the effects of different

audio

and then we got lucky because we have another project that was independent of S

are going on sri time

which basically give us access to the fullband the original recordings of a portion of

mixer data was the basis yes or interview

and so we basically created and version of the interview data answer rate

and the rest mixture

and so on that which pointed to some interesting results which we actually recorded

and the rhino workshop following

okay

S three

right

but there but the results were

you know we have this

limited data set and it wasn't the complete dataset because there were still a microphone

data that was available to us in full bandwidth

so we set aside and actually last year

just released the complete sre

and microphone actually using the phone calls

in the in sixteen khz

and then we decided okay this is now we can basically look at this and

proper you know the complete evalset we can actually gets results

so that's

so we have two data sets were set is the three-way data and so the

original us rewrite data we repartition that's not what we use portion for the well

for training purposes such as training that by adding to the background data and intersession

variability training data

and we help we held out at a set of forty eight females and thirty

four male speakers

for development testing and that's the data for

so we found all the possible trials that data

and

i remember that the data

classified into short

conversations

and

we have those two conditions a long conversations were truncated actually

because

yeah

ten

condition

so these are the number of trials that resulted from that strategy

this was actually again this was the development set

is that so by the time to

developers

channel

that S which and data is again hasn't released version and using the extended trial

set so large number of one

oh actually with a wide band versions of both these phone calls recorded

you know microphone as well as

the

wideband

a number of them are only gonna look at all conditions that involved microphones both

training

so this

total five conditions

well

and in this presentation just with a lot of all focus on the eer hold

yeah

results and

the paper has there's the dcf results

and only if you have these

how the results differ but they differ qualitatively so

just one

say

the number so

stick to

okay here so we prepare the data

so we have the first condition that's the baseline condition that you can look exactly

right the euler coding

yeah that sounds this is how the data was delivered to us

yeah

we ourselves to a version of the data

where we took you based on what that the data we downsampled to eight

but we live in the east yeah you're holding

to avoid these the loss

right

and

we did not you builds a condition to use cost

this you saw

one thing we noticed socks is that we can use talks a lot of things

and actually there's different down sampling are

you provided once and you should try to see how the unit with your actual

task

what i think that you have to be very careful sauces that

they are not backward compatible so if you take the latest version of sauce will

not the same as well as the one that you might have used five years

ago so in fact very careful with keeping older versions around

to make sure that doesn't are we have to use an older version

this off to

get

results

there were some things that have

we tried

so i'm just one you're about so

maybe a little less

harshly not saying that you should use it all but you should use it with

great care

and that's why we have the sixteen khz

yeah

then you're holding

a just a little bit more detail so we have basically we have a

a portion of the

five

available to us

seconds

flat encoding

actually

forty four khz

so we downsampled to sixteen and

and we use the this the segmentation tables

the segments for the development

some spot checking to make sure actually have exactly matching

dataset that matches best sex

that's all

that probably

now in our system

the first thing we do with all the microphone data is we find a wiener

filter doesn't formant by

from start with at least but is run evaluations

but

for

icsi ogi T

some

actually bottom that's

then

then we used a speech activity detection method that was that we're not problems but

seem to do

reasonably grounded was actually inspired by the point those two thousand and we use

combination of a

yeah hmm based speech activity detection

and we saw that

provided by

but the important thing is that we did not want to introduce

segmentation as confounding variables are comparison so we took the original segmentations which where

derived from your data and we kept that same segmentation fixed across all the different

at each modality is so you don't

a better

we say

yeah

okay so that we haven't is are things that the basis

so for some systems the right later

so it's we have to recognise this so the first recognizer our baseline recognizer

is a is a conversational telephone speech recognizer

oh that's

has been used for the last two evaluations asr evaluation so that's why

it's based on telephone data only it has two stages

second stage

two hypotheses from the first one for

unsupervised adaptation

yeah

and

it has

we measure the word error rate on some assume you six microphone data that we

have transcribers cells

below thirty percent

yeah

but that of course since we now have the of the wideband version of this

data we actually have the opportunity to improve the recognition

oh we have a different system it actually have very similar structure results in terms

of the algorithms for acoustic modeling so for the type of language models of what

that's very simple compatible to the first baseline system

but it was trained harder on meeting data so that was trained wideband data and

furthermore

that meeting it includes a far-field for

which is important because some of the work

the majority of the speech in the i

you condition also far field microphone

so we found that this would be a reasonable match to the to the into

data

and it will be read this on a on the interview data from sre ten

we found that the output twenty one percent word tokens than the old

and because our recognizer tends to just delete words when it hasn't for acoustic match

that's a pretty good indication that is

substantially more at

and that we used

we didn't have any transcribed sre ten interview data

other cell is we simply matching compare the asr accuracy on meeting data rich

which result was similar character so we used a far-field meeting data from

i don't know which one it was one of the nist

rt evaluation sets

and we found that the original cts recognisers had a very high rate

and then the first stage of our meeting recognizer which

is important it still is eight khz models actually performs this kind of cross-adaptation between

different kinds of acoustic models of the first stage uses

no models were trained using data already have much better accuracy

over forty percent and then the second stage with sixteen khz models and unsupervised adaptation

percent error rate so clearly a big improvement in terms of

speech recognition accuracy and probably consistent with the observation

that may be more talk spurt lattice

okay now to the systems

the

there were three systems over all these calls from a larger combination of systems that

were used in the official ester

twenty ten submission

so the first system is kind of our main state

cepstral system and use the

telephone band analysis of past twenty possible coefficients of the one K gaussians

and we didn't even bother to retrain the i-th channel i speakers for this

so we take those from the original

system

is based

data performs the t-norm

is a pretty run of the model system not using i-vectors but you know as

of twenty ten was a pretty standard state

okay that the two systems that to the asr the first one is or mllr

system uses a few days

model performs some

some

feature normalisation

oh yeah i a total of sixteen transforms

that come out by crossing rate for classes

and the two genders so we have made a specific reference models and female specific

right reference models with what they're always applied to both male data so yeah sixty

different transforms

okay the model that almost twenty five feature

features you the right now but then used as you know

i forgot to put in here that you don't perform now for

session

there is what i'm system

i've sounds very bad but that's give some

my brothers

consists of the relative frequency features are collected

the top thousand address and trigrams

you

the background data

and again using svm

okay so here for

the interesting

comparison

so we use these three different wait for conditions

and rank or cepstral system on the sre eight data and is short and long

data condition

and you can see clearly that the largest in about twelve percent relative

hums from

the dropping of this

impressive

coding as well

and that has a small additional gain a problem switching from eight to sixteen khz

sampling

and you might think well as a gmm front-end operates at eight

okay so what could possibly be improving by switching

sixteen

and the answer is that the noise filtering happens at the fullband

so the spec

subtraction

works better when you when you operate at

and then down sampling

right

so this was kind of an interesting result of us

is it

requires fairly minimal changes to the system and you know the gmms moments change so

that those

now we do the same system on sre ten data

and is a lot of numbers use of in and summarized

so basically you get pretty substantial gains

in order of ten percent relative

and the largest Z E R

for the vocal effort

very suggestive because i think that especially for low vocal effort affected by this

block coding but careful driving

datasets

very small

i should also point out as shown in the paper that the relative improvement on

a somewhat lower

the set of ten percent

page

but that i think that you get

vol

okay now more numbers

in our system

the benefits much more and here we have two different

contrast conditions we have E

so the mlr so the acoustic modeling always uses telephone speech so that we can

that you don't have to retrain anything the old telephone background data doesn't it doesn't

prevent us from using telephone data

the background model

however

the

the audio that you process before the final down sampling step okay the sixteen K

audio what you get the benefit from not having the lossy coding from doing that

the voice the filtering for that

and then you have two choices you can use the

first recognition step as your hypothesis of which comes from the eight khz models or

you can use the second stage comes from the sixteen khz models which as we

saw to sell better

and so we have both of these here and of course the second one

is consistently better

one very small

yeah

from a little data conditions

sorry

smart

seconds more accurate hypotheses

that's overall we see very substantial gains you only about twenty percent

hence

i six is a lot of numbers

a two-dimensional lots of these you know roughly

so you have one axis you see that the other axis

alright and you see that roughly two thirds of again come from the switch from

from eight khz to sixteen khz

still using the where

sorry then when you hear the asr

see

you get another

another time i

top

yeah

so this is pretty

cross

this condition

okay that just around a

the results

so the word n-gram system

voice operated much for

operating point

but the relative

these are much smaller

and i would speculate one

whatever

just remark

okay

it's

prosody one second

overall

it does the same things

okay

so completely

recent changes and development data of course where is the question how to make

use of the of the full bandwidth the a priori that's a little to us

and this applies able to capture systems answers

instead use asr

yeah

studying this on two datasets sre right sre ten

probably a few conclusions

so there is substantial gains to be happens conference on image fifteen found

so we can express

no losses encoding is it is it is a big plus that's probably the biggest

plus the cepstral system

but you also get a small additional gain by doing voice filtering at the at

the full bandwidth

and then that'll systems get a significant strong using better asr and that of course

you need to find a mask asr system and we were quite successful using a

meeting recognizer that was trained for what nist rt

evaluations using for a few data

and what we have get like units

we have not actually changed the analysis bandwidth all the acoustic models

selsa we still using the telephone

yeah

for both these cepstral

okay

and this process

future work is quite a few

so obviously

the three systems

are from

so to them questions

so we the next step was used

combining all here we haven't done that

a question how much can you

nation

what will require a

quite a bit of work is to also read that the prosodic sys

which is

the

very nicely complementary so

yeah is the right so

two covariance data

and then the questions of course can we do better by retraining or acoustic models

and then using wideband data or alternatively can come up with some clever ways

and wideband data you sequence

bandwidth extension methods

or a simply modeling bandwidth mismatch as a as one dimension

right

that's

after i

well that's then you have to you have to use the telephone

that is

well first of all

it was a bigger

well

yeah

we also felt that a large number of speakers

as the results

well

okay

we look at the at

did you don't shake spectral shaping slightly down sampled at three

here

the reader must achievement

yeah

you do not but

i think after that we didn't change it

yeah

such as

substantially higher local optimum

yeah

and it

use the beginning

something like this

so my

the like

course

it's try to date

also

Effects of Audio and ASR Quality on Cepstral and High-level Speaker Verification Systems

SESSION 10: Speaker Recognition - Application

Andreas Stolcke