oh my name is in like was though and will be talking about
oh
and that affect in the
T scroll database that
captures a large vocabulary content
and
i will be talking about
how the one but fact
i speech parameters
and continuous speech are and how it affects S a
oh the presentation we will have three parts
the first part are i will introduce the the school database
and present and the results of the are we cease of the speech parameters and
a tree could be really content
the second part uh i a propose
modified right version of the rest of that during which is very popular in
S S and i have also
in some kind of
combination of this modified to rest of it
i don't normalization be proposed in
i "'cause" two thousand nine
it's quite Q C N and finally
i will present a volition of the relations of these
a a side by side it's other uh a cepstral normalization
so first what just some effect
uh i have it refers to the phenomenon and
speak in noisy conditions and so they try to maintain
uh intelligible communication
so they we increase the vocal part and they do lot of other thing
are are people who understand them
uh
but the fact is strip like that and number of parameters like a go for a page
i month frequency system push it can their locations
spectral slope changes and there are other variations we cannot so
oh this affects although little S because the
acoustic models P are usually using a are typically trained on new to speech
so one of these um variations and speech parameters
or some kind of mismatch between the acoustic models and the incoming features
oh the previous studies
oh that look that
i bart the fact in the context of a are they usually focus on
a a a a small a be the task
so
i and this is kind of contribution of the study the
a very look how the
and that affect affects large vocabulary asr
a kind of a mental bill talk is because it's
and to make that speech so i mean
and that large vocabulary
but
so first i would like to uh
and use the ut scope database
i the database
uh colours
speech under cognitive and physical stress emotion motion someone but the fact
you would be just looking at the one but a portion of the data
a it contains fifty eight subjects
uh uh of those are they wanna a native speakers of us english
and if five female six males
and we are using just the native speakers in this study so we would only minute it or does the
effect of
oh funding X and
uh the database context
a a each each subject
uh a new speech
and C a speech for this in like that uh noisy conditions
the what the case of the subject
uh a are exposed to noise produced true
that's found
a but but you can still collect
a relatively clean speech than a be high as and and the channel
microphone channel
i use three types of noise is in is the one but effect
uh it's what
was car noise
that was record it
or uh and driving on a highway sixty five
a mouse but over
and we have a large crowd noise and being noise
a be produce the nicest to the subjects that of levels
and the case of car and a large crowd the to seven the and ninety db is
as L
in the case of pink noise it was a all start the last sixty five to eighty five
because
the subjects kind of complaint that the
missus disturbing them at those original and
oh the speech was recorded in the summer wood
are also
then
sure as high snr
if three microphone channels strolled microphone close to and five kit like
this study we are looking at the cost talk microphone because
but whites
a a high snr
and
i mean i like that's throat microphone that
it's more broad event
so the content of
of the sessions
for each speaker
for the neutral in conditions where they didn't you know and the noise
we would produce a hunter
made like sentence east they read then
and the noisy conditions they will treat better each scenario when some sentences
a in tree
three levels of noise
also uh uh read digit string
and there was also from from thing the speech are they will be
in content of uh of a picture
for the study we are using just the the made like sentence for several reasons
i don't to the digit strings because
and the french a very
recognition and
maybe be in the beginning to use language modeling
so the digit strings
we just maybe that
and use the spontaneous speech because
a it was kind of difficult to
to make the subjects
to like a natural so
speech should be kind of abrupt and they would be laughing for there will be a long pulses
to be kind of a hard to deal is
this step of speech at this stage of the research so
just not using it this this small
a so in the in the speech production analysis part
well you you will be analysing as an R
second whoosh
no sure
oh we do this because it kind of relates of the vocal intensity
since there uh
uh surrounding background noise
can be considered kind of
can of can stomp in the sample
could the changes in the vocal intensity the
are directly reflected in the changes in this and R
this so really don't need to know actually the up level or
how the i'm a direct
the signal good actually relates to
out to the intensity because we can count of just the microphone gain uh
during the recording so that would be a problem
so use a uh
me analyse uh
zero or no rebel formant frequencies and duration
and then we'll it look at cepstral distributions which is or a little bit far from
a direct
or or primarily a speech direction parameters but
it's important for the is a later
so we used a so for and some other tools to extract these parameters there's
so
uh the the first figure here uh is snr
a continuous line is for
speech or speech and there was no noise produce
so you can see in this case the the mean this are is
a always compare to all other conditions
uh
this figure is just
oh showing
the place for a highway noise so we have
i mean a produce it's of and date in ninety db is
we can see in
increasing level of noise the snrs increasing that basically means that
vocal intensity was increased
in the subject
it's kind of
and into it if and that was reported by many previous to this from what effect
so so look at
sampling in one but function it should be basically
are the relation between the noise level and the
speech intensity
a noise
have a would be
well the
cindy Vs
so in our case
if we if use tradition lies
france would be observing slopes
i me to and zero to zero point three
a a zero or
but to
me for pink noise
the subjects that are uh make the kind of randomly
and that are and crowd noise
it just frame more consistent
and the zero point stream that's or this in there was kind of typical
as a scene
in previous studies
X thing that's fundamental frequency about
uh i'm not showing and the distributions this
this time
and be the rather focusing on the since we have
three levels of noise that gives as kind of chance to
a a that the the correlation between the
uh
have a lot of the
noise that
the subjects are saying too
and the changes in the mean as you know so you can see
and the table there are
i
a rolls one is for females at and one for males
i first to the slope of the regression line
i spread this correlation coefficient as he
a error
so you can see for especially for highway and crowd noise
a a correlation coefficient just really high it's very close to one
well it's partly because use just the mean values of all the recordings in that type of
a a a a in that level of noise
but also you can see that the mean square errors are very low
so there's is very strong mean a linear relationship between the presentation level
and D is an actually
a a F zero and hard
you could see some previous past of these that would be
in clean a relationship when the
and here would be also in work scale it would be in some it on but here actually for us
it's
a mean scale
a when when you are looking at the
a month we can see so we are looking at the F one
i two space
vol
i and the company is line will be referring to the new speech
and the other ones would be for a highway noise someone to ninety
we estimate the phone boundaries using force alignment
so it it's not perfectly a period
but
there some or it could be it should be kind of consistent "'cause" the recordings that are process so
if as some kind of in what is happening there
uh are the the
error bars are actually the standard deviation intervals
so you can see there's some kind of
very consistent shift in the
from the rebels space here
is the level of noise
and we're looking at the level duration
a can be use force alignment
to to estimate the boundaries of the vowels
so some previous studies reported that uh some there would be some time construction or
expansion for different uh form classes
sort something similar you see for some of was there be some slight reduction
is the level of increasing level of noise but most there the
that was them to be problem
unfortunately fortunately given the amount of data here
a
and finance intervals are quite right so
and two D C kind of consistent trends here
uh the changes are not statistically significant so we can make
and and they it conclusions of to this
and mouse is finally you are looking at
that's distributions
uh
and get us kind of a how the
acoustic stick model
be affected told what kind of mismatch you can expect that
so here i'm also putting the
just so lead line here is for the timit train a a a a a bit that that we were
using quite there for training the
rules
the other one so are for the U T school conditions
and you can see there's a
a mismatch you look at C zero which kind of represents presents the local energy
C one that reflects kind of spectral still
there are a big differences
uh in the
distribution
so
we can exploit this will affect the a side in negative way
oh so
oh i would like to move phone and describe the
but the factor stuff of there we are proposing
so we stays very popular
oh
a magician method
we
it's used either on long walk
a uh and that he's or it can be used in cepstral domain to is basically the same thing
it's a bandpass filtering and
a start basically a process
a build very slow
else slowly varying uh signal components and really of fast varying caps O
signal components
belief are kind of and it
a speech
and it has been shown to
oh increase robustness and noise
channel mismatch
and so in a a a a a variation
but i sign speaker I
uh
but one or the slide but work of the original rasta filter is
that's
it's a are very zero
a a kind of a or there because we we want to have
and spells
so we as also introduce a some kind of transient and distortion
a a in time domain because if there are some rubber
abrupt changes and the
and a general signal
i take some time
the the the right
settle down
so we try to
a like us to that we need try to improve it a little bit
so
we you you really you can
but are also there
right by two separate blocks
but is what would be
first so mean normalization that till
and that's we help us get rid of the dc second one
much of the scroll in components it's also pairs that depends on the length of the window
of the
and no segment or or of the window but
dc component to be definitely on
and maybe that's just fine
and then we
then B
a second one could be a low pass filter
that's will be suppressing the
fast
a a change changes in the signal
this way the the low pass filter can be a very well all or there
and can be kind of nice this smooth side will show
the next slide
ah
as all this kind of scheme a what's cells
so the chance to replace the
dc C separation
uh
by some more sophisticated uh
distribution normalization that to that
not necessarily
normalize a sphinx to their means like the
um
or a minimization
so
you to in this figure we can see the original or a band pass filter
as a solid line
and also the newly proposed filter that the dashed fine
but just what pass
so you you see it kind of uh or eliminates the residual
cycle
and the height of frequencies that we can see "'em" original rasta
and here's example
you you uh the
first figure that to prosper and
would be
or all C zero from an if she's
some kind of example
and the
but the total bill would be
a the rest of was to apply to the caesar or C zero track
see there some kind of very strong transients
at some stages
and size by the dashed line and
but are one is when we combine some uh
some
minimization in
you C and is the newly proposed a pass filter
you see also of the transient effects are gone
which will be like nice
so now
we can this a newly proposed a a low pass
filter
yes
our compensation that of this called you see and
and tell based cepstra
and the mixed
normalization
uh
spread
i is kind of similar like cepstral mean variance normalization but
we observe that if you have a noise signal
or if you're from what fact
or the the you wanna
the skewness of the distributions them to change
here
distributions that that kind of the current skewness
then a whining them by their mean
a maybe not very often because the dynamic range are you with that
very or what maybe find like that's a ninety percent of the samples
i can be about aligned
so what we do instead
we we pick some one high one tiles to make them from the
histograms we so let's say
a five since a ninety five percent
so we know this
interval different bounds
and into
or of the samples
and the a these intervals
and set of mean and variance
and we found than be shown in previous studies that
it helps a lot
uh uh special in one but effect and
noise at if
so we will propose combining this instead of C and
is the low pass stuff
so finally
i i will present the evolution so
the system
it was uh
triphone hmms
system i'm i'm was the rules
mister store to mixtures
and B were training the the models on clean timit
we use a set language modeling to two it's for language modeling
and
because of "'cause" there's a mismatch
channel mismatch be and microphone mismatch between timit and
a data
we we should we chose several sessions and use them for
acoustic model adaptation so we use them a lot and i mean P
and use these adaptations sessions of course and the evolution like to one
so
the in the oceans we had the neutral
and and by speech
but also of a clean signals was i and that
and then you'll also makes those recordings is the
a a is the car noise
to see how how the methods will be robust and
and to effect and and
so the base and performance
uh
and you to test set
i C C and
and i to C D and
but was like a person's what are rate and the P
a similar so than we just the other of are are much
uh
you didn't use language modeling
after after this because we want to just see
i the acoustic models are affected
i
and that affect and and the noise
and minimum to have a really strong language model that
but these a little the right
uh i mean the benefits of the individual normalization for job
so this is just a a baseline a evolution
and the C C V and system you C
or a neutral speech of some
based and performance and
each uh a noise type
hence we are increasing the
noise level in the headphones
uh the one but i think that
stronger and also the is R
to is that what they're
grows
just a that the recording are queen so in all cases here
but high snr
a so then you are comparing so
i all or normalization that that's
and i mean normalization but it's magician
i to be normalization rasta stuff filtering
you should have been was in addition
histogram equalisation but we to the timit train data
but distributions as the reference point and then we compare it to you C and the Q skinner stuff
and this set the results
also so uh the table the left and side
uh
shows the overall uh results across all conditions in clear mean recordings so
or set the new to run one but once for no noise was a that
i S R
so you see
best a actually
doesn't work very well here
still better to use of than nothing about
but much better in this space but in any case is it can be
i
and the and on the best performing normalizations here would be
to see and and pops to gain normalization and
a out in summarization histogram equalisation
a numbers behind Q C and
uh that
but just shows the setting for type of
i a as we use if it's nine
use the nine person
and L and
and to mount person and in Q C for used
a percent than nine to six percent
so for different task and data bases
i actually helps to tune this
ah
choice of the compound
a on the right side you see
just pick the best performing a normal
and the baseline one
and compare them on the noisy
recordings but the car was mixed
but there is that and you see
the or there
i mean the ranking of the normalizations unfortunately completely makes is or a change so
i didn't and and normalization that what what best every which is kind of disappointing but
yeah what
but this nice
me me from but that if you use the newly proposed low-pass pass rasta filter
a consistent lee improves the
performance of the use C normalization
but two new recordings and noise recordings
and now we submitted paper to interspeech and
but
we are showing that
sure that using you can see "'em" and and the you rest stuff filter
it always out a performance as stuff for plp P
L M F C C even if you use it in in trouble based schemes and X
so
yeah
it seems kind of from a sink it's very simple
so that's basically it what could just should be able to addition use so i'm not going to do that
so
and different indigent
i i for just one quick question well the other speak a and it yeah
huh
i
i
right