oh my name is in like was though and will be talking about

oh

and that affect in the

T scroll database that

captures a large vocabulary content

and

i will be talking about

how the one but fact

i speech parameters

and continuous speech are and how it affects S a

oh the presentation we will have three parts

the first part are i will introduce the the school database

and present and the results of the are we cease of the speech parameters and

a tree could be really content

the second part uh i a propose

modified right version of the rest of that during which is very popular in

S S and i have also

in some kind of

combination of this modified to rest of it

i don't normalization be proposed in

i "'cause" two thousand nine

it's quite Q C N and finally

i will present a volition of the relations of these

a a side by side it's other uh a cepstral normalization

so first what just some effect

uh i have it refers to the phenomenon and

speak in noisy conditions and so they try to maintain

uh intelligible communication

so they we increase the vocal part and they do lot of other thing

are are people who understand them

uh

but the fact is strip like that and number of parameters like a go for a page

i month frequency system push it can their locations

spectral slope changes and there are other variations we cannot so

oh this affects although little S because the

acoustic models P are usually using a are typically trained on new to speech

so one of these um variations and speech parameters

or some kind of mismatch between the acoustic models and the incoming features

oh the previous studies

oh that look that

i bart the fact in the context of a are they usually focus on

a a a a small a be the task

so

i and this is kind of contribution of the study the

a very look how the

and that affect affects large vocabulary asr

a kind of a mental bill talk is because it's

and to make that speech so i mean

and that large vocabulary

but

so first i would like to uh

and use the ut scope database

i the database

uh colours

speech under cognitive and physical stress emotion motion someone but the fact

you would be just looking at the one but a portion of the data

a it contains fifty eight subjects

uh uh of those are they wanna a native speakers of us english

and if five female six males

and we are using just the native speakers in this study so we would only minute it or does the

effect of

oh funding X and

uh the database context

a a each each subject

uh a new speech

and C a speech for this in like that uh noisy conditions

the what the case of the subject

uh a are exposed to noise produced true

that's found

a but but you can still collect

a relatively clean speech than a be high as and and the channel

microphone channel

i use three types of noise is in is the one but effect

uh it's what

was car noise

that was record it

or uh and driving on a highway sixty five

a mouse but over

and we have a large crowd noise and being noise

a be produce the nicest to the subjects that of levels

and the case of car and a large crowd the to seven the and ninety db is

as L

in the case of pink noise it was a all start the last sixty five to eighty five

because

the subjects kind of complaint that the

missus disturbing them at those original and

oh the speech was recorded in the summer wood

are also

then

sure as high snr

if three microphone channels strolled microphone close to and five kit like

this study we are looking at the cost talk microphone because

but whites

a a high snr

and

i mean i like that's throat microphone that

it's more broad event

so the content of

of the sessions

for each speaker

for the neutral in conditions where they didn't you know and the noise

we would produce a hunter

made like sentence east they read then

and the noisy conditions they will treat better each scenario when some sentences

a in tree

three levels of noise

also uh uh read digit string

and there was also from from thing the speech are they will be

in content of uh of a picture

for the study we are using just the the made like sentence for several reasons

i don't to the digit strings because

and the french a very

recognition and

maybe be in the beginning to use language modeling

so the digit strings

we just maybe that

and use the spontaneous speech because

a it was kind of difficult to

to make the subjects

to like a natural so

speech should be kind of abrupt and they would be laughing for there will be a long pulses

to be kind of a hard to deal is

this step of speech at this stage of the research so

just not using it this this small

a so in the in the speech production analysis part

well you you will be analysing as an R

second whoosh

no sure

oh we do this because it kind of relates of the vocal intensity

since there uh

uh surrounding background noise

can be considered kind of

can of can stomp in the sample

could the changes in the vocal intensity the

are directly reflected in the changes in this and R

this so really don't need to know actually the up level or

how the i'm a direct

the signal good actually relates to

out to the intensity because we can count of just the microphone gain uh

during the recording so that would be a problem

so use a uh

me analyse uh

zero or no rebel formant frequencies and duration

and then we'll it look at cepstral distributions which is or a little bit far from

a direct

or or primarily a speech direction parameters but

it's important for the is a later

so we used a so for and some other tools to extract these parameters there's

so

uh the the first figure here uh is snr

a continuous line is for

speech or speech and there was no noise produce

so you can see in this case the the mean this are is

a always compare to all other conditions

uh

this figure is just

oh showing

the place for a highway noise so we have

i mean a produce it's of and date in ninety db is

we can see in

increasing level of noise the snrs increasing that basically means that

vocal intensity was increased

in the subject

it's kind of

and into it if and that was reported by many previous to this from what effect

so so look at

sampling in one but function it should be basically

are the relation between the noise level and the

speech intensity

a noise

have a would be

well the

cindy Vs

so in our case

if we if use tradition lies

france would be observing slopes

i me to and zero to zero point three

a a zero or

but to

me for pink noise

the subjects that are uh make the kind of randomly

and that are and crowd noise

it just frame more consistent

and the zero point stream that's or this in there was kind of typical

as a scene

in previous studies

X thing that's fundamental frequency about

uh i'm not showing and the distributions this

this time

and be the rather focusing on the since we have

three levels of noise that gives as kind of chance to

a a that the the correlation between the

uh

have a lot of the

noise that

the subjects are saying too

and the changes in the mean as you know so you can see

and the table there are

i

a rolls one is for females at and one for males

i first to the slope of the regression line

i spread this correlation coefficient as he

a error

so you can see for especially for highway and crowd noise

a a correlation coefficient just really high it's very close to one

well it's partly because use just the mean values of all the recordings in that type of

a a a a in that level of noise

but also you can see that the mean square errors are very low

so there's is very strong mean a linear relationship between the presentation level

and D is an actually

a a F zero and hard

you could see some previous past of these that would be

in clean a relationship when the

and here would be also in work scale it would be in some it on but here actually for us

it's

a mean scale

a when when you are looking at the

a month we can see so we are looking at the F one

i two space

vol

i and the company is line will be referring to the new speech

and the other ones would be for a highway noise someone to ninety

we estimate the phone boundaries using force alignment

so it it's not perfectly a period

but

there some or it could be it should be kind of consistent "'cause" the recordings that are process so

if as some kind of in what is happening there

uh are the the

error bars are actually the standard deviation intervals

so you can see there's some kind of

very consistent shift in the

from the rebels space here

is the level of noise

and we're looking at the level duration

a can be use force alignment

to to estimate the boundaries of the vowels

so some previous studies reported that uh some there would be some time construction or

expansion for different uh form classes

sort something similar you see for some of was there be some slight reduction

is the level of increasing level of noise but most there the

that was them to be problem

unfortunately fortunately given the amount of data here

a

and finance intervals are quite right so

and two D C kind of consistent trends here

uh the changes are not statistically significant so we can make

and and they it conclusions of to this

and mouse is finally you are looking at

that's distributions

uh

and get us kind of a how the

acoustic stick model

be affected told what kind of mismatch you can expect that

so here i'm also putting the

just so lead line here is for the timit train a a a a a bit that that we were

using quite there for training the

rules

the other one so are for the U T school conditions

and you can see there's a

a mismatch you look at C zero which kind of represents presents the local energy

C one that reflects kind of spectral still

there are a big differences

uh in the

distribution

so

we can exploit this will affect the a side in negative way

oh so

oh i would like to move phone and describe the

but the factor stuff of there we are proposing

so we stays very popular

oh

a magician method

we

it's used either on long walk

a uh and that he's or it can be used in cepstral domain to is basically the same thing

it's a bandpass filtering and

a start basically a process

a build very slow

else slowly varying uh signal components and really of fast varying caps O

signal components

belief are kind of and it

a speech

and it has been shown to

oh increase robustness and noise

channel mismatch

and so in a a a a a variation

but i sign speaker I

uh

but one or the slide but work of the original rasta filter is

that's

it's a are very zero

a a kind of a or there because we we want to have

and spells

so we as also introduce a some kind of transient and distortion

a a in time domain because if there are some rubber

abrupt changes and the

and a general signal

i take some time

the the the right

settle down

so we try to

a like us to that we need try to improve it a little bit

so

we you you really you can

but are also there

right by two separate blocks

but is what would be

first so mean normalization that till

and that's we help us get rid of the dc second one

much of the scroll in components it's also pairs that depends on the length of the window

of the

and no segment or or of the window but

dc component to be definitely on

and maybe that's just fine

and then we

then B

a second one could be a low pass filter

that's will be suppressing the

fast

a a change changes in the signal

this way the the low pass filter can be a very well all or there

and can be kind of nice this smooth side will show

the next slide

ah

as all this kind of scheme a what's cells

so the chance to replace the

dc C separation

uh

by some more sophisticated uh

distribution normalization that to that

not necessarily

normalize a sphinx to their means like the

um

or a minimization

so

you to in this figure we can see the original or a band pass filter

as a solid line

and also the newly proposed filter that the dashed fine

but just what pass

so you you see it kind of uh or eliminates the residual

cycle

and the height of frequencies that we can see "'em" original rasta

and here's example

you you uh the

first figure that to prosper and

would be

or all C zero from an if she's

some kind of example

and the

but the total bill would be

a the rest of was to apply to the caesar or C zero track

see there some kind of very strong transients

at some stages

and size by the dashed line and

but are one is when we combine some uh

some

minimization in

you C and is the newly proposed a pass filter

you see also of the transient effects are gone

which will be like nice

so now

we can this a newly proposed a a low pass

filter

yes

our compensation that of this called you see and

and tell based cepstra

and the mixed

normalization

uh

spread

i is kind of similar like cepstral mean variance normalization but

we observe that if you have a noise signal

or if you're from what fact

or the the you wanna

the skewness of the distributions them to change

here

distributions that that kind of the current skewness

then a whining them by their mean

a maybe not very often because the dynamic range are you with that

very or what maybe find like that's a ninety percent of the samples

i can be about aligned

so what we do instead

we we pick some one high one tiles to make them from the

histograms we so let's say

a five since a ninety five percent

so we know this

interval different bounds

and into

or of the samples

and the a these intervals

and set of mean and variance

and we found than be shown in previous studies that

it helps a lot

uh uh special in one but effect and

noise at if

so we will propose combining this instead of C and

is the low pass stuff

so finally

i i will present the evolution so

the system

it was uh

triphone hmms

system i'm i'm was the rules

mister store to mixtures

and B were training the the models on clean timit

we use a set language modeling to two it's for language modeling

and

because of "'cause" there's a mismatch

channel mismatch be and microphone mismatch between timit and

a data

we we should we chose several sessions and use them for

acoustic model adaptation so we use them a lot and i mean P

and use these adaptations sessions of course and the evolution like to one

so

the in the oceans we had the neutral

and and by speech

but also of a clean signals was i and that

and then you'll also makes those recordings is the

a a is the car noise

to see how how the methods will be robust and

and to effect and and

so the base and performance

uh

and you to test set

i C C and

and i to C D and

but was like a person's what are rate and the P

a similar so than we just the other of are are much

uh

you didn't use language modeling

after after this because we want to just see

i the acoustic models are affected

i

and that affect and and the noise

and minimum to have a really strong language model that

but these a little the right

uh i mean the benefits of the individual normalization for job

so this is just a a baseline a evolution

and the C C V and system you C

or a neutral speech of some

based and performance and

each uh a noise type

hence we are increasing the

noise level in the headphones

uh the one but i think that

stronger and also the is R

to is that what they're

grows

just a that the recording are queen so in all cases here

but high snr

a so then you are comparing so

i all or normalization that that's

and i mean normalization but it's magician

i to be normalization rasta stuff filtering

you should have been was in addition

histogram equalisation but we to the timit train data

but distributions as the reference point and then we compare it to you C and the Q skinner stuff

and this set the results

also so uh the table the left and side

uh

shows the overall uh results across all conditions in clear mean recordings so

or set the new to run one but once for no noise was a that

i S R

so you see

best a actually

doesn't work very well here

still better to use of than nothing about

but much better in this space but in any case is it can be

i

and the and on the best performing normalizations here would be

to see and and pops to gain normalization and

a out in summarization histogram equalisation

a numbers behind Q C and

uh that

but just shows the setting for type of

i a as we use if it's nine

use the nine person

and L and

and to mount person and in Q C for used

a percent than nine to six percent

so for different task and data bases

i actually helps to tune this

ah

choice of the compound

a on the right side you see

just pick the best performing a normal

and the baseline one

and compare them on the noisy

recordings but the car was mixed

but there is that and you see

the or there

i mean the ranking of the normalizations unfortunately completely makes is or a change so

i didn't and and normalization that what what best every which is kind of disappointing but

yeah what

but this nice

me me from but that if you use the newly proposed low-pass pass rasta filter

a consistent lee improves the

performance of the use C normalization

but two new recordings and noise recordings

and now we submitted paper to interspeech and

but

we are showing that

sure that using you can see "'em" and and the you rest stuff filter

it always out a performance as stuff for plp P

L M F C C even if you use it in in trouble based schemes and X

so

yeah

it seems kind of from a sink it's very simple

so that's basically it what could just should be able to addition use so i'm not going to do that

so

and different indigent

i i for just one quick question well the other speak a and it yeah

huh

i

i

right