0:00:17oh my name is in like was though and will be talking about
0:00:22and that affect in the
0:00:24T scroll database that
0:00:25captures a large vocabulary content
0:00:29i will be talking about
0:00:30how the one but fact
0:00:33i speech parameters
0:00:35and continuous speech are and how it affects S a
0:00:39oh the presentation we will have three parts
0:00:42the first part are i will introduce the the school database
0:00:45and present and the results of the are we cease of the speech parameters and
0:00:49a tree could be really content
0:00:51the second part uh i a propose
0:00:54modified right version of the rest of that during which is very popular in
0:00:58S S and i have also
0:01:01in some kind of
0:01:02combination of this modified to rest of it
0:01:05i don't normalization be proposed in
0:01:08i "'cause" two thousand nine
0:01:11it's quite Q C N and finally
0:01:13i will present a volition of the relations of these
0:01:17a a side by side it's other uh a cepstral normalization
0:01:22so first what just some effect
0:01:24uh i have it refers to the phenomenon and
0:01:27speak in noisy conditions and so they try to maintain
0:01:31uh intelligible communication
0:01:33so they we increase the vocal part and they do lot of other thing
0:01:38are are people who understand them
0:01:42but the fact is strip like that and number of parameters like a go for a page
0:01:47i month frequency system push it can their locations
0:01:50spectral slope changes and there are other variations we cannot so
0:01:55oh this affects although little S because the
0:01:58acoustic models P are usually using a are typically trained on new to speech
0:02:03so one of these um variations and speech parameters
0:02:07or some kind of mismatch between the acoustic models and the incoming features
0:02:12oh the previous studies
0:02:14oh that look that
0:02:15i bart the fact in the context of a are they usually focus on
0:02:19a a a a small a be the task
0:02:22i and this is kind of contribution of the study the
0:02:24a very look how the
0:02:26and that affect affects large vocabulary asr
0:02:29a kind of a mental bill talk is because it's
0:02:32and to make that speech so i mean
0:02:35and that large vocabulary
0:02:39so first i would like to uh
0:02:41and use the ut scope database
0:02:44i the database
0:02:45uh colours
0:02:47speech under cognitive and physical stress emotion motion someone but the fact
0:02:51you would be just looking at the one but a portion of the data
0:02:55a it contains fifty eight subjects
0:02:57uh uh of those are they wanna a native speakers of us english
0:03:01and if five female six males
0:03:04and we are using just the native speakers in this study so we would only minute it or does the
0:03:09effect of
0:03:10oh funding X and
0:03:13uh the database context
0:03:15a a each each subject
0:03:16uh a new speech
0:03:18and C a speech for this in like that uh noisy conditions
0:03:23the what the case of the subject
0:03:25uh a are exposed to noise produced true
0:03:28that's found
0:03:29a but but you can still collect
0:03:31a relatively clean speech than a be high as and and the channel
0:03:35microphone channel
0:03:38i use three types of noise is in is the one but effect
0:03:41uh it's what
0:03:42was car noise
0:03:43that was record it
0:03:45or uh and driving on a highway sixty five
0:03:48a mouse but over
0:03:49and we have a large crowd noise and being noise
0:03:52a be produce the nicest to the subjects that of levels
0:03:56and the case of car and a large crowd the to seven the and ninety db is
0:04:00as L
0:04:01in the case of pink noise it was a all start the last sixty five to eighty five
0:04:07the subjects kind of complaint that the
0:04:09missus disturbing them at those original and
0:04:13oh the speech was recorded in the summer wood
0:04:16are also
0:04:18sure as high snr
0:04:20if three microphone channels strolled microphone close to and five kit like
0:04:24this study we are looking at the cost talk microphone because
0:04:27but whites
0:04:28a a high snr
0:04:30i mean i like that's throat microphone that
0:04:33it's more broad event
0:04:35so the content of
0:04:36of the sessions
0:04:37for each speaker
0:04:39for the neutral in conditions where they didn't you know and the noise
0:04:43we would produce a hunter
0:04:45made like sentence east they read then
0:04:47and the noisy conditions they will treat better each scenario when some sentences
0:04:52a in tree
0:04:54three levels of noise
0:04:55also uh uh read digit string
0:04:58and there was also from from thing the speech are they will be
0:05:01in content of uh of a picture
0:05:04for the study we are using just the the made like sentence for several reasons
0:05:08i don't to the digit strings because
0:05:10and the french a very
0:05:12recognition and
0:05:13maybe be in the beginning to use language modeling
0:05:16so the digit strings
0:05:18we just maybe that
0:05:19and use the spontaneous speech because
0:05:22a it was kind of difficult to
0:05:24to make the subjects
0:05:25to like a natural so
0:05:27speech should be kind of abrupt and they would be laughing for there will be a long pulses
0:05:31to be kind of a hard to deal is
0:05:33this step of speech at this stage of the research so
0:05:36just not using it this this small
0:05:39a so in the in the speech production analysis part
0:05:44well you you will be analysing as an R
0:05:47second whoosh
0:05:48no sure
0:05:49oh we do this because it kind of relates of the vocal intensity
0:05:53since there uh
0:05:54uh surrounding background noise
0:05:56can be considered kind of
0:05:58can of can stomp in the sample
0:06:00could the changes in the vocal intensity the
0:06:03are directly reflected in the changes in this and R
0:06:06this so really don't need to know actually the up level or
0:06:09how the i'm a direct
0:06:11the signal good actually relates to
0:06:14out to the intensity because we can count of just the microphone gain uh
0:06:18during the recording so that would be a problem
0:06:21so use a uh
0:06:22me analyse uh
0:06:23zero or no rebel formant frequencies and duration
0:06:27and then we'll it look at cepstral distributions which is or a little bit far from
0:06:32a direct
0:06:33or or primarily a speech direction parameters but
0:06:35it's important for the is a later
0:06:38so we used a so for and some other tools to extract these parameters there's
0:06:43uh the the first figure here uh is snr
0:06:47a continuous line is for
0:06:50speech or speech and there was no noise produce
0:06:53so you can see in this case the the mean this are is
0:06:57a always compare to all other conditions
0:07:00this figure is just
0:07:02oh showing
0:07:03the place for a highway noise so we have
0:07:06i mean a produce it's of and date in ninety db is
0:07:08we can see in
0:07:10increasing level of noise the snrs increasing that basically means that
0:07:15vocal intensity was increased
0:07:16in the subject
0:07:18it's kind of
0:07:19and into it if and that was reported by many previous to this from what effect
0:07:23so so look at
0:07:24sampling in one but function it should be basically
0:07:28are the relation between the noise level and the
0:07:31speech intensity
0:07:33a noise
0:07:33have a would be
0:07:35well the
0:07:37cindy Vs
0:07:38so in our case
0:07:39if we if use tradition lies
0:07:41france would be observing slopes
0:07:43i me to and zero to zero point three
0:07:46a a zero or
0:07:47but to
0:07:47me for pink noise
0:07:49the subjects that are uh make the kind of randomly
0:07:52and that are and crowd noise
0:07:54it just frame more consistent
0:07:55and the zero point stream that's or this in there was kind of typical
0:07:59as a scene
0:08:00in previous studies
0:08:02X thing that's fundamental frequency about
0:08:04uh i'm not showing and the distributions this
0:08:07this time
0:08:08and be the rather focusing on the since we have
0:08:11three levels of noise that gives as kind of chance to
0:08:14a a that the the correlation between the
0:08:18have a lot of the
0:08:19noise that
0:08:20the subjects are saying too
0:08:22and the changes in the mean as you know so you can see
0:08:26and the table there are
0:08:28a rolls one is for females at and one for males
0:08:31i first to the slope of the regression line
0:08:34i spread this correlation coefficient as he
0:08:37a error
0:08:38so you can see for especially for highway and crowd noise
0:08:42a a correlation coefficient just really high it's very close to one
0:08:46well it's partly because use just the mean values of all the recordings in that type of
0:08:51a a a a in that level of noise
0:08:54but also you can see that the mean square errors are very low
0:08:57so there's is very strong mean a linear relationship between the presentation level
0:09:03and D is an actually
0:09:05a a F zero and hard
0:09:06you could see some previous past of these that would be
0:09:09in clean a relationship when the
0:09:12and here would be also in work scale it would be in some it on but here actually for us
0:09:18a mean scale
0:09:20a when when you are looking at the
0:09:24a month we can see so we are looking at the F one
0:09:27i two space
0:09:30i and the company is line will be referring to the new speech
0:09:33and the other ones would be for a highway noise someone to ninety
0:09:38we estimate the phone boundaries using force alignment
0:09:41so it it's not perfectly a period
0:09:45there some or it could be it should be kind of consistent "'cause" the recordings that are process so
0:09:49if as some kind of in what is happening there
0:09:52uh are the the
0:09:54error bars are actually the standard deviation intervals
0:09:58so you can see there's some kind of
0:10:00very consistent shift in the
0:10:02from the rebels space here
0:10:04is the level of noise
0:10:06and we're looking at the level duration
0:10:09a can be use force alignment
0:10:10to to estimate the boundaries of the vowels
0:10:13so some previous studies reported that uh some there would be some time construction or
0:10:18expansion for different uh form classes
0:10:22sort something similar you see for some of was there be some slight reduction
0:10:26is the level of increasing level of noise but most there the
0:10:30that was them to be problem
0:10:31unfortunately fortunately given the amount of data here
0:10:35and finance intervals are quite right so
0:10:38and two D C kind of consistent trends here
0:10:41uh the changes are not statistically significant so we can make
0:10:45and and they it conclusions of to this
0:10:48and mouse is finally you are looking at
0:10:50that's distributions
0:10:53and get us kind of a how the
0:10:55acoustic stick model
0:10:57be affected told what kind of mismatch you can expect that
0:11:00so here i'm also putting the
0:11:02just so lead line here is for the timit train a a a a a bit that that we were
0:11:07using quite there for training the
0:11:10the other one so are for the U T school conditions
0:11:13and you can see there's a
0:11:14a mismatch you look at C zero which kind of represents presents the local energy
0:11:19C one that reflects kind of spectral still
0:11:22there are a big differences
0:11:24uh in the
0:11:27we can exploit this will affect the a side in negative way
0:11:31oh so
0:11:32oh i would like to move phone and describe the
0:11:36but the factor stuff of there we are proposing
0:11:38so we stays very popular
0:11:42a magician method
0:11:45it's used either on long walk
0:11:46a uh and that he's or it can be used in cepstral domain to is basically the same thing
0:11:51it's a bandpass filtering and
0:11:53a start basically a process
0:11:55a build very slow
0:11:57else slowly varying uh signal components and really of fast varying caps O
0:12:02signal components
0:12:03belief are kind of and it
0:12:06a speech
0:12:07and it has been shown to
0:12:09oh increase robustness and noise
0:12:11channel mismatch
0:12:12and so in a a a a a variation
0:12:16but i sign speaker I
0:12:19but one or the slide but work of the original rasta filter is
0:12:24it's a are very zero
0:12:25a a kind of a or there because we we want to have
0:12:29and spells
0:12:29so we as also introduce a some kind of transient and distortion
0:12:34a a in time domain because if there are some rubber
0:12:37abrupt changes and the
0:12:39and a general signal
0:12:41i take some time
0:12:43the the the right
0:12:44settle down
0:12:45so we try to
0:12:47a like us to that we need try to improve it a little bit
0:12:52we you you really you can
0:12:54but are also there
0:12:56right by two separate blocks
0:12:57but is what would be
0:12:59first so mean normalization that till
0:13:01and that's we help us get rid of the dc second one
0:13:04much of the scroll in components it's also pairs that depends on the length of the window
0:13:09of the
0:13:10and no segment or or of the window but
0:13:13dc component to be definitely on
0:13:15and maybe that's just fine
0:13:17and then we
0:13:18then B
0:13:19a second one could be a low pass filter
0:13:22that's will be suppressing the
0:13:24a a change changes in the signal
0:13:26this way the the low pass filter can be a very well all or there
0:13:29and can be kind of nice this smooth side will show
0:13:32the next slide
0:13:37as all this kind of scheme a what's cells
0:13:40so the chance to replace the
0:13:42dc C separation
0:13:45by some more sophisticated uh
0:13:47distribution normalization that to that
0:13:49not necessarily
0:13:51normalize a sphinx to their means like the
0:13:54or a minimization
0:13:56you to in this figure we can see the original or a band pass filter
0:14:00as a solid line
0:14:02and also the newly proposed filter that the dashed fine
0:14:04but just what pass
0:14:06so you you see it kind of uh or eliminates the residual
0:14:11and the height of frequencies that we can see "'em" original rasta
0:14:15and here's example
0:14:17you you uh the
0:14:19first figure that to prosper and
0:14:22would be
0:14:23or all C zero from an if she's
0:14:25some kind of example
0:14:26and the
0:14:27but the total bill would be
0:14:29a the rest of was to apply to the caesar or C zero track
0:14:33see there some kind of very strong transients
0:14:36at some stages
0:14:37and size by the dashed line and
0:14:39but are one is when we combine some uh
0:14:43minimization in
0:14:45you C and is the newly proposed a pass filter
0:14:48you see also of the transient effects are gone
0:14:52which will be like nice
0:14:54so now
0:14:56we can this a newly proposed a a low pass
0:15:00our compensation that of this called you see and
0:15:03and tell based cepstra
0:15:05and the mixed
0:15:10i is kind of similar like cepstral mean variance normalization but
0:15:13we observe that if you have a noise signal
0:15:16or if you're from what fact
0:15:18or the the you wanna
0:15:20the skewness of the distributions them to change
0:15:23distributions that that kind of the current skewness
0:15:26then a whining them by their mean
0:15:29a maybe not very often because the dynamic range are you with that
0:15:32very or what maybe find like that's a ninety percent of the samples
0:15:36i can be about aligned
0:15:37so what we do instead
0:15:39we we pick some one high one tiles to make them from the
0:15:43histograms we so let's say
0:15:46a five since a ninety five percent
0:15:48so we know this
0:15:49interval different bounds
0:15:50and into
0:15:51or of the samples
0:15:53and the a these intervals
0:15:55and set of mean and variance
0:15:56and we found than be shown in previous studies that
0:16:00it helps a lot
0:16:01uh uh special in one but effect and
0:16:03noise at if
0:16:05so we will propose combining this instead of C and
0:16:08is the low pass stuff
0:16:10so finally
0:16:13i i will present the evolution so
0:16:15the system
0:16:16it was uh
0:16:17triphone hmms
0:16:18system i'm i'm was the rules
0:16:20mister store to mixtures
0:16:22and B were training the the models on clean timit
0:16:26we use a set language modeling to two it's for language modeling
0:16:32because of "'cause" there's a mismatch
0:16:33channel mismatch be and microphone mismatch between timit and
0:16:37a data
0:16:38we we should we chose several sessions and use them for
0:16:42acoustic model adaptation so we use them a lot and i mean P
0:16:46and use these adaptations sessions of course and the evolution like to one
0:16:51the in the oceans we had the neutral
0:16:55and and by speech
0:16:57but also of a clean signals was i and that
0:16:59and then you'll also makes those recordings is the
0:17:03a a is the car noise
0:17:04to see how how the methods will be robust and
0:17:07and to effect and and
0:17:10so the base and performance
0:17:13and you to test set
0:17:15i C C and
0:17:16and i to C D and
0:17:17but was like a person's what are rate and the P
0:17:21a similar so than we just the other of are are much
0:17:26you didn't use language modeling
0:17:28after after this because we want to just see
0:17:30i the acoustic models are affected
0:17:33and that affect and and the noise
0:17:36and minimum to have a really strong language model that
0:17:38but these a little the right
0:17:40uh i mean the benefits of the individual normalization for job
0:17:44so this is just a a baseline a evolution
0:17:48and the C C V and system you C
0:17:50or a neutral speech of some
0:17:52based and performance and
0:17:54each uh a noise type
0:17:57hence we are increasing the
0:17:59noise level in the headphones
0:18:02uh the one but i think that
0:18:03stronger and also the is R
0:18:05to is that what they're
0:18:09just a that the recording are queen so in all cases here
0:18:12but high snr
0:18:14a so then you are comparing so
0:18:16i all or normalization that that's
0:18:18and i mean normalization but it's magician
0:18:21i to be normalization rasta stuff filtering
0:18:24you should have been was in addition
0:18:26histogram equalisation but we to the timit train data
0:18:29but distributions as the reference point and then we compare it to you C and the Q skinner stuff
0:18:34and this set the results
0:18:37also so uh the table the left and side
0:18:41shows the overall uh results across all conditions in clear mean recordings so
0:18:46or set the new to run one but once for no noise was a that
0:18:50i S R
0:18:51so you see
0:18:53best a actually
0:18:55doesn't work very well here
0:18:57still better to use of than nothing about
0:19:00but much better in this space but in any case is it can be
0:19:04and the and on the best performing normalizations here would be
0:19:09to see and and pops to gain normalization and
0:19:12a out in summarization histogram equalisation
0:19:15a numbers behind Q C and
0:19:17uh that
0:19:18but just shows the setting for type of
0:19:20i a as we use if it's nine
0:19:22use the nine person
0:19:24and L and
0:19:25and to mount person and in Q C for used
0:19:28a percent than nine to six percent
0:19:30so for different task and data bases
0:19:32i actually helps to tune this
0:19:35choice of the compound
0:19:38a on the right side you see
0:19:40just pick the best performing a normal
0:19:43and the baseline one
0:19:45and compare them on the noisy
0:19:47recordings but the car was mixed
0:19:50but there is that and you see
0:19:52the or there
0:19:54i mean the ranking of the normalizations unfortunately completely makes is or a change so
0:19:59i didn't and and normalization that what what best every which is kind of disappointing but
0:20:04yeah what
0:20:06but this nice
0:20:06me me from but that if you use the newly proposed low-pass pass rasta filter
0:20:10a consistent lee improves the
0:20:13performance of the use C normalization
0:20:15but two new recordings and noise recordings
0:20:18and now we submitted paper to interspeech and
0:20:23we are showing that
0:20:24sure that using you can see "'em" and and the you rest stuff filter
0:20:27it always out a performance as stuff for plp P
0:20:31L M F C C even if you use it in in trouble based schemes and X
0:20:37it seems kind of from a sink it's very simple
0:20:40so that's basically it what could just should be able to addition use so i'm not going to do that
0:20:46and different indigent
0:20:52i i for just one quick question well the other speak a and it yeah