oh my name is in like was though and will be talking about oh and that affect in the T scroll database that captures a large vocabulary content and i will be talking about how the one but fact i speech parameters and continuous speech are and how it affects S a oh the presentation we will have three parts the first part are i will introduce the the school database and present and the results of the are we cease of the speech parameters and a tree could be really content the second part uh i a propose modified right version of the rest of that during which is very popular in S S and i have also in some kind of combination of this modified to rest of it i don't normalization be proposed in i "'cause" two thousand nine it's quite Q C N and finally i will present a volition of the relations of these a a side by side it's other uh a cepstral normalization so first what just some effect uh i have it refers to the phenomenon and speak in noisy conditions and so they try to maintain uh intelligible communication so they we increase the vocal part and they do lot of other thing are are people who understand them uh but the fact is strip like that and number of parameters like a go for a page i month frequency system push it can their locations spectral slope changes and there are other variations we cannot so oh this affects although little S because the acoustic models P are usually using a are typically trained on new to speech so one of these um variations and speech parameters or some kind of mismatch between the acoustic models and the incoming features oh the previous studies oh that look that i bart the fact in the context of a are they usually focus on a a a a small a be the task so i and this is kind of contribution of the study the a very look how the and that affect affects large vocabulary asr a kind of a mental bill talk is because it's and to make that speech so i mean and that large vocabulary but so first i would like to uh and use the ut scope database i the database uh colours speech under cognitive and physical stress emotion motion someone but the fact you would be just looking at the one but a portion of the data a it contains fifty eight subjects uh uh of those are they wanna a native speakers of us english and if five female six males and we are using just the native speakers in this study so we would only minute it or does the effect of oh funding X and uh the database context a a each each subject uh a new speech and C a speech for this in like that uh noisy conditions the what the case of the subject uh a are exposed to noise produced true that's found a but but you can still collect a relatively clean speech than a be high as and and the channel microphone channel i use three types of noise is in is the one but effect uh it's what was car noise that was record it or uh and driving on a highway sixty five a mouse but over and we have a large crowd noise and being noise a be produce the nicest to the subjects that of levels and the case of car and a large crowd the to seven the and ninety db is as L in the case of pink noise it was a all start the last sixty five to eighty five because the subjects kind of complaint that the missus disturbing them at those original and oh the speech was recorded in the summer wood are also then sure as high snr if three microphone channels strolled microphone close to and five kit like this study we are looking at the cost talk microphone because but whites a a high snr and i mean i like that's throat microphone that it's more broad event so the content of of the sessions for each speaker for the neutral in conditions where they didn't you know and the noise we would produce a hunter made like sentence east they read then and the noisy conditions they will treat better each scenario when some sentences a in tree three levels of noise also uh uh read digit string and there was also from from thing the speech are they will be in content of uh of a picture for the study we are using just the the made like sentence for several reasons i don't to the digit strings because and the french a very recognition and maybe be in the beginning to use language modeling so the digit strings we just maybe that and use the spontaneous speech because a it was kind of difficult to to make the subjects to like a natural so speech should be kind of abrupt and they would be laughing for there will be a long pulses to be kind of a hard to deal is this step of speech at this stage of the research so just not using it this this small a so in the in the speech production analysis part well you you will be analysing as an R second whoosh no sure oh we do this because it kind of relates of the vocal intensity since there uh uh surrounding background noise can be considered kind of can of can stomp in the sample could the changes in the vocal intensity the are directly reflected in the changes in this and R this so really don't need to know actually the up level or how the i'm a direct the signal good actually relates to out to the intensity because we can count of just the microphone gain uh during the recording so that would be a problem so use a uh me analyse uh zero or no rebel formant frequencies and duration and then we'll it look at cepstral distributions which is or a little bit far from a direct or or primarily a speech direction parameters but it's important for the is a later so we used a so for and some other tools to extract these parameters there's so uh the the first figure here uh is snr a continuous line is for speech or speech and there was no noise produce so you can see in this case the the mean this are is a always compare to all other conditions uh this figure is just oh showing the place for a highway noise so we have i mean a produce it's of and date in ninety db is we can see in increasing level of noise the snrs increasing that basically means that vocal intensity was increased in the subject it's kind of and into it if and that was reported by many previous to this from what effect so so look at sampling in one but function it should be basically are the relation between the noise level and the speech intensity a noise have a would be well the cindy Vs so in our case if we if use tradition lies france would be observing slopes i me to and zero to zero point three a a zero or but to me for pink noise the subjects that are uh make the kind of randomly and that are and crowd noise it just frame more consistent and the zero point stream that's or this in there was kind of typical as a scene in previous studies X thing that's fundamental frequency about uh i'm not showing and the distributions this this time and be the rather focusing on the since we have three levels of noise that gives as kind of chance to a a that the the correlation between the uh have a lot of the noise that the subjects are saying too and the changes in the mean as you know so you can see and the table there are i a rolls one is for females at and one for males i first to the slope of the regression line i spread this correlation coefficient as he a error so you can see for especially for highway and crowd noise a a correlation coefficient just really high it's very close to one well it's partly because use just the mean values of all the recordings in that type of a a a a in that level of noise but also you can see that the mean square errors are very low so there's is very strong mean a linear relationship between the presentation level and D is an actually a a F zero and hard you could see some previous past of these that would be in clean a relationship when the and here would be also in work scale it would be in some it on but here actually for us it's a mean scale a when when you are looking at the a month we can see so we are looking at the F one i two space vol i and the company is line will be referring to the new speech and the other ones would be for a highway noise someone to ninety we estimate the phone boundaries using force alignment so it it's not perfectly a period but there some or it could be it should be kind of consistent "'cause" the recordings that are process so if as some kind of in what is happening there uh are the the error bars are actually the standard deviation intervals so you can see there's some kind of very consistent shift in the from the rebels space here is the level of noise and we're looking at the level duration a can be use force alignment to to estimate the boundaries of the vowels so some previous studies reported that uh some there would be some time construction or expansion for different uh form classes sort something similar you see for some of was there be some slight reduction is the level of increasing level of noise but most there the that was them to be problem unfortunately fortunately given the amount of data here a and finance intervals are quite right so and two D C kind of consistent trends here uh the changes are not statistically significant so we can make and and they it conclusions of to this and mouse is finally you are looking at that's distributions uh and get us kind of a how the acoustic stick model be affected told what kind of mismatch you can expect that so here i'm also putting the just so lead line here is for the timit train a a a a a bit that that we were using quite there for training the rules the other one so are for the U T school conditions and you can see there's a a mismatch you look at C zero which kind of represents presents the local energy C one that reflects kind of spectral still there are a big differences uh in the distribution so we can exploit this will affect the a side in negative way oh so oh i would like to move phone and describe the but the factor stuff of there we are proposing so we stays very popular oh a magician method we it's used either on long walk a uh and that he's or it can be used in cepstral domain to is basically the same thing it's a bandpass filtering and a start basically a process a build very slow else slowly varying uh signal components and really of fast varying caps O signal components belief are kind of and it a speech and it has been shown to oh increase robustness and noise channel mismatch and so in a a a a a variation but i sign speaker I uh but one or the slide but work of the original rasta filter is that's it's a are very zero a a kind of a or there because we we want to have and spells so we as also introduce a some kind of transient and distortion a a in time domain because if there are some rubber abrupt changes and the and a general signal i take some time the the the right settle down so we try to a like us to that we need try to improve it a little bit so we you you really you can but are also there right by two separate blocks but is what would be first so mean normalization that till and that's we help us get rid of the dc second one much of the scroll in components it's also pairs that depends on the length of the window of the and no segment or or of the window but dc component to be definitely on and maybe that's just fine and then we then B a second one could be a low pass filter that's will be suppressing the fast a a change changes in the signal this way the the low pass filter can be a very well all or there and can be kind of nice this smooth side will show the next slide ah as all this kind of scheme a what's cells so the chance to replace the dc C separation uh by some more sophisticated uh distribution normalization that to that not necessarily normalize a sphinx to their means like the um or a minimization so you to in this figure we can see the original or a band pass filter as a solid line and also the newly proposed filter that the dashed fine but just what pass so you you see it kind of uh or eliminates the residual cycle and the height of frequencies that we can see "'em" original rasta and here's example you you uh the first figure that to prosper and would be or all C zero from an if she's some kind of example and the but the total bill would be a the rest of was to apply to the caesar or C zero track see there some kind of very strong transients at some stages and size by the dashed line and but are one is when we combine some uh some minimization in you C and is the newly proposed a pass filter you see also of the transient effects are gone which will be like nice so now we can this a newly proposed a a low pass filter yes our compensation that of this called you see and and tell based cepstra and the mixed normalization uh spread i is kind of similar like cepstral mean variance normalization but we observe that if you have a noise signal or if you're from what fact or the the you wanna the skewness of the distributions them to change here distributions that that kind of the current skewness then a whining them by their mean a maybe not very often because the dynamic range are you with that very or what maybe find like that's a ninety percent of the samples i can be about aligned so what we do instead we we pick some one high one tiles to make them from the histograms we so let's say a five since a ninety five percent so we know this interval different bounds and into or of the samples and the a these intervals and set of mean and variance and we found than be shown in previous studies that it helps a lot uh uh special in one but effect and noise at if so we will propose combining this instead of C and is the low pass stuff so finally i i will present the evolution so the system it was uh triphone hmms system i'm i'm was the rules mister store to mixtures and B were training the the models on clean timit we use a set language modeling to two it's for language modeling and because of "'cause" there's a mismatch channel mismatch be and microphone mismatch between timit and a data we we should we chose several sessions and use them for acoustic model adaptation so we use them a lot and i mean P and use these adaptations sessions of course and the evolution like to one so the in the oceans we had the neutral and and by speech but also of a clean signals was i and that and then you'll also makes those recordings is the a a is the car noise to see how how the methods will be robust and and to effect and and so the base and performance uh and you to test set i C C and and i to C D and but was like a person's what are rate and the P a similar so than we just the other of are are much uh you didn't use language modeling after after this because we want to just see i the acoustic models are affected i and that affect and and the noise and minimum to have a really strong language model that but these a little the right uh i mean the benefits of the individual normalization for job so this is just a a baseline a evolution and the C C V and system you C or a neutral speech of some based and performance and each uh a noise type hence we are increasing the noise level in the headphones uh the one but i think that stronger and also the is R to is that what they're grows just a that the recording are queen so in all cases here but high snr a so then you are comparing so i all or normalization that that's and i mean normalization but it's magician i to be normalization rasta stuff filtering you should have been was in addition histogram equalisation but we to the timit train data but distributions as the reference point and then we compare it to you C and the Q skinner stuff and this set the results also so uh the table the left and side uh shows the overall uh results across all conditions in clear mean recordings so or set the new to run one but once for no noise was a that i S R so you see best a actually doesn't work very well here still better to use of than nothing about but much better in this space but in any case is it can be i and the and on the best performing normalizations here would be to see and and pops to gain normalization and a out in summarization histogram equalisation a numbers behind Q C and uh that but just shows the setting for type of i a as we use if it's nine use the nine person and L and and to mount person and in Q C for used a percent than nine to six percent so for different task and data bases i actually helps to tune this ah choice of the compound a on the right side you see just pick the best performing a normal and the baseline one and compare them on the noisy recordings but the car was mixed but there is that and you see the or there i mean the ranking of the normalizations unfortunately completely makes is or a change so i didn't and and normalization that what what best every which is kind of disappointing but yeah what but this nice me me from but that if you use the newly proposed low-pass pass rasta filter a consistent lee improves the performance of the use C normalization but two new recordings and noise recordings and now we submitted paper to interspeech and but we are showing that sure that using you can see "'em" and and the you rest stuff filter it always out a performance as stuff for plp P L M F C C even if you use it in in trouble based schemes and X so yeah it seems kind of from a sink it's very simple so that's basically it what could just should be able to addition use so i'm not going to do that so and different indigent i i for just one quick question well the other speak a and it yeah huh i i right