Speech Transcript - Modeling Prosody for Speaker Recognition: Why Estimating Pitch May Be a Red Herring

right thanks my name's cornell and this is work with she was sitting up there she's happy to take all of their questions afterwards so i before we begin i just want to sort of lay down that the definitions that i'm gonna be using this is my first time at this meeting so i may be saying things very wrong and i apologise for that in advance um i conceive of all features that you could uh compute from speech in these four areas um where on the left hand side i can i consider course spectral features and on the right hand side fine spectral features and at the top two panels things that characterise a single frame of speech where is at the bottom things that characterise the trajectory that models things across great so all the features that you probably are familiar with can be used to test like space and prosodic features tend to be those that uh either model the fine structure in the in the spectrum for that was that model long term dependencies at the bottom but we here in this paper are gonna look at only the so called contain use prosodic features namely those that characterise a single frame and in particular of looking at okay so pitch is estimated uh using a pitch detector um which typically produces a best estimate for pitch but it usually is so noisy that a pitch detector it's typically expected to produce and best estimate and then a dynamic programming approach is used two ah and so i think that to a single best estimate per frame i'm gonna refer to these two components pitch estimation and the best estimate per frame that comes out of this um can be linearly or nonlinearly smooth can be normalised based on proximity to some kind of landmark and then different kind of features can be extracted from and the simple model now these things at the bottom here i assume or what you were calling high level feature computation or high level features in this talk i hope i'm not disappointing in once we're actually gonna be looking at this point which is as low level the in that session um we're gonna claim that these features are as low level is M S okay so if we look at this right box a little more closely um typically pitch estimation of pitch section is that two step process where the source to think we start from is an fft um so the first step is the computation of what i'm gonna be calling a transform domain and there's lots of alternatives here so was saying this is all the correlation spectrum and then the second step is simply finding the the art um yeah so a lot of effort has gone into uh this process and typically the effort is both it's only on this first because the second step is so elementary that nobody really questions um the most of the work was improving pitch detection has gone into making sure this is just oh such just force optimally or most free to uh we consider what we're gonna claim in this work is that you should just throw well as whole seconds and you should model hi transform and that's what this talk so there are four parts to stop men are described what i'm calling the harmonic structure transform presenting something garments some additional analysis and i will conclude three slides yeah okay so the particular pitch detection algorithm that we're gonna look at is um was proposed by shorter in nineteen sixty eight and it involves summing producing an you spectrum the sigma spectrum where uh at each frequency we have the sum of all the frequencies that are integer multiples of some candidate fundamental frequency in the original effect okay and this was very quickly after he proposed this dog harmonic compression which is distinctly nonlinear operation um i wanna demonstrated over here on the right basically what what ends up what ends up happening is that the spectrum is compressed conceptually by integer factors and then it's a bird right and the the problem with congresswoman compression is that has led to people actually looking for uh implementations of this algorithm in exactly this way so first compressed and then uh and it turns out that this is occupied people for about twenty years a lot of last century a a much better way to do this is to actually not do any compression but uh comb filter so you just add whatever's at whatever frequency you want without first having to compress it towards the and that a harmonic frequency or fundamental frequency that you're interested in um so when this happens there of course no compression difficulties filtering is linear uh we in this work are gonna be defining all of the all of our filters over this range of three hundred hertz to eight thousand hertz um if you have lots of such computers and that you'll the filter bank and in this work we're gonna nominally three four hundred filters and this filterbank um which range from fifty to four hundred fifty hertz space one hurt so far uh so this is the continuous frequency space of course we want a discrete frequency space filter because uh we have discrete ffts so there are lots of ways to do this and i always like siding the work by you know colleagues from lindsay because i this is actually work that influenced me um but it's probably not the first work what we're gonna do in this work is a little bit different we're gonna sing that each uh to stop the columnist triangular and then we're gonna simply riemann sample this such that the filterbank filter for that comb filter discrete filter actually end up looking like this and as you will know it doesn't look harmonic at all it's a what what do you do with this now so uh if you have a set of such discrete calls can filters then that actually um implement uh filter bank that has represented a matrix representation age and it's very simple to use you just matrix multiplication by the fft that you haven't we're also gonna take the logarithm of the output of that filter bank the same way that's done for the mel frequency filter um finally we're gonna subtract from that the energy that is founded specific integer more at integer multiples of a specific candidate fundamental frequency we're gonna subtract from that the energy found everywhere else and to do that we're gonna form this complementary it is complement transform H tilde um which i can demonstrate over here this is the column a column vector for particular a comb filter then we just form a unity complement right and that gives us a this here so that's the corresponding column vector of H two um what this of course implements why implements a if they're in i form of the harmonic to noise ratio um which is known to correlate with force voice or hoarseness or roughness of reading and typically in in in pathological you so two what's done is the harmonic noise ratios computed only and not at the fundamental frequency once that is known and what we're doing is we're computing it for all possible candidate fundamental frequencies and then using that vector as a feature fish okay so the elements of this are still correlated in me decorrelate them in a way that anybody else would we subtract the global mean we form a D correlation matrix and then we truncate after applying that matrix all those features that are in the mentions that have one positive eigenvalue we're gonna call the output of this harmonic structure cepstral coefficients for lack of a better term and um this is simply a D correlation of the logarithm oh the output of the filter banks minus some normalisation term which is uh R H tilde here we we actually explore two different options for this book pca and lda which you probably know more about than i do i want i've claim that this is at the level of mfccs but i would like to try to convince you hear that it's nearly identical from a functional point of view um the mel filterbank kind of also be implemented as a matrix and so if it's if that's and you can see that at least inside here is approximately the same it's a matrix multiplication of the lombard the decorrelating transforms of course different and sort of important and in our case unfortunate that article or decorrelating matrix is data pendant where is the mfcc one is not but um to compare here 'em in H these are essentially the columns of and that is to say they smear energy across frequencies that are related by jason see where is the columns H the matrix that we're proposing here smear energy across frequencies that are related by harmonicity um i also wanting just say that this is as direct sort of a lot from our previous work using a representation call fundamental frequency variation which models the instantaneous change in fundamental frequency without actually computing the fundamental frequency um so what we're gonna what we're doing here in the in the current work is we take it frame of speech we take it at fifty and then we take a bunch of idealised fft that are the con filters where the columns were capital H we formed a dog matrix oh the frame that we are currently looking at with every one of these and the locus of these dot product of course defines uh this trajectory which is a function of a lie which corresponds to fundamental frequency right so in contrast the F F E stuff that we've done before we take two frames we take the current frame the same as here but we also take the previous frame and we die like the previous frame by logarithmic factors by a range of them and then again we take the dot product oh the dilated previous frame with the current frame and then the locus of those dot product give us gives us another focus it is also a function of i wear i hear is the logarithmic dilation factor so this expresses expresses the key here nominally expresses the uh location of the peak sorry expresses the fundamental frequency in hertz and the location of the peak here expresses the rate of change of fundamental frequency and for sex okay so not gonna um describe the experiments that we did too see whether this makes any sense at all um the data that we use this wall street journal data mostly coming from wall street the this corpus um the number of speakers that we have over a hundred and two female speakers and ninety five male speakers and leave it close to classification of gender separately we used and second trials we had enough we get enough data to have five minutes of training minutes of development data three minutes test data and that corresponding to approximately fourteen hundred eighteen hundred trials that are ten seconds apiece um all of the data comes from a single microphone i and that's what we're calling match multisession which means that for the majority of speakers but data that both in the train in the data and the test set is drawn from all speakers that all sessions that are available for that something that is not in the paper um is we did this afterwards but i thought that you might appreciate this is we built a system that's based just on pitch and we extract pitch unit standard sound processing to it in its default settings the comparison isn't quite fair because any pitch tracker currently actually uses dynamic programming so this the speech this system is actually using longterm constraints where is our system will not 'cause it treats brains in the end it um we we ignore unvoiced frames and we transformed voice frame to what do you mean and what we see is that for females this this system uh achieves an accuracy of approximately eighteen percent and approximate twenty seven percent for females and males respectively i get the feeling that my microphone is louder sometimes and white or another is that true or does it bother anyone okay sorry okay so the system that we're proposing here um to explore this this idea of of modelling the entire transform domain signal um we don't perform any preemphasis uh because partly because we are not using frequencies below three hundred hertz though because we throw those away there's no D C component we decided not to bother with this uh we have seventy five percent frame overlap with thirty two millisecond frames and we use the and window instead of the hamming window which is ubiquitous use um number of dimensions and the number of gaussian gender models as is still need to discuss that in coming coming um slide we don't use universal background model and we don't use any speech activity detection so um in optimising this number of dimensions um what we what we have done here is we have we have created the most laconic modelling you can invent a single gaussian which the diagonal covariance and we train our pca lda trans on the training set and we select that number of dimensions which maximise accuracy on the developer right so we can see here that four the pca transform which even accuracy of about forty percent for the for the first part and top discriminants and accuracy about eighty five percent for the um for that first hand lda uh_huh components for females and slightly better for males but that's approximating that the ballpark the lighter colours in these two parts uh represent longer trial durations we decided not to do sixty second in thirty seconds which is what we start out with because the the numbers were were too high and it was difficult them paris so um this table summarises uh the performance of this agency C system that i just described uh once the number of gaussians has been optimised has been set to optimise that's set accuracy and that number happen to be two fifty six in our experiments what what we see here is that a single is if you take the representation that a pitch tracker is exposed to and you spend the time looking for the art max in it you achieve eighteen and twenty seven percent spectral here but if you don't bother doing that you just throw everything that that representation has any model that then you achieve almost a hundred percent so the claim here based on these experiments is that there is speaker discriminant information it is beyond our max in these uh in in his representation vectors and of course discarding it needs the performance that is not really comparable um spending time improving our backs estimation it appears unnecessary and of course our backs estimation here speech okay so we also constructed a a contrastive mfcc system which is not really stan nerd in the weighted but you probably build them but um we we try to retain as many similarities with but a very simple to just see system as we could so um we did apply preemphasis and having window because that just happens to be the standard front end feature processing in our asr systems um we retain twenty of the the uh lowest order mfccs but then we also don't use a universal background model or any speech activity detection so uh_huh uh in this respect the the two systems are most comparable and what we see when we compare these these two systems is that essentially in every case at least for this data for these experiments that we did here this H assisi representation outperforms the mfcc representation but what we're happy just saying that they're they're comparable in magnitude um we've also just just to be safe applied you know lda to the mfcc system this is also not if they're thing to do because we actually haven't truncated or thrown away discarded any dimensions after that so we we take twenty dimensions and we rotate them um it it it leads to uh negligible improve um if we combine the agency see in the next systems and we achieve uh we we get improvements in every case uh except for dev set in males here mfccs don't seem to help um but um other than that in general at least combination with mfccs okay so given these results i i'm gonna describe a couple of uh analyses or analysis of a couple of perturbations 'cause we were interested in seen how lucky we were in just guessing at the parameters that actually drive the can evaluation of our system so we considered three different kinds of perturbations one was um changing the the the frequency of range to which the the filterbank is exposed um one is changing the number of a comb filters in the filterbank and the other is they'll be throwing out the the so called spectral envelope information which is contained in mfcc um i mean we very much we had a very simple version of this analysis we where we used only diagonal covariance gaussian um single that that'll convince causing per speaker and we only show numbers on that set because find it there sufficiently similar but of course uh granularity to that you've all set numbers that we didn't actually bother doing um as before i'm gonna plug accuracy as a function of the number of dimensions a so the first perturbation has to do with modifying the slow order cut off as i said the justice system looks at frequencies between three hundred hertz and eight kilohertz and if it is interesting to see what happens if you choose a different value for this low order or low frequency cutoff um so the results here for females on the left and males on the right um indicate that what we had chosen this three hundred hertz cutoff this just happens to correspond to the first hand fifty space um what's but was it is that is to say that to the to the best performance if we if we expose the algorithm to also frequencies between zero and three hundred then for females we actually lose approximately four percent absolute here that the drugs much smaller for males and um moving the the cutoff further up has a smaller effect but it's also worse than than keeping that we have the second perturbation that we there are that we analyse ways changing the upper limit here so um as i said we had three hundred eight thousand to begin with but it's interesting to see what happens if you cut it at four thousand words two thousand and and this configuration particular corresponds approximately to an upsampled eight kilohertz telephone no uh uh_huh so here again results for males on the right and for females on the left uh what we see is that for males chip reducing the number of high frequency components that that that you looking at in the F F T has a more drastic effect and for females um for females actually going down to four thousand is only a drop less than one one percent absolute but then dropping it further um you see drops of approximately three percent up to wanna stated even under these sort of ridiculous ablation conditions this significantly outperforms a pitch tracker and or so it is not known how well the pitch tracker would operate on three hundred to two thousand or audio so um the third perturbation is is is that in the transform domain so we and i said at the very beginning we have four hundred filters under one hertz apart and uh were at liberty to choose however many filters we want right so uh so it's interesting to see what happens if you double that number and space them point five hertz apart or you have that number space imports as far apart uh what the results show here for female the left again and for males on the right is that increasing the resolution of the candidate fundamental frequencies with which you construct the filter bank actually leads to significant improvements for females of um almost two percent absolute and for males is slightly smaller um and but then decreasing resolution has a similar impact negatively for boston uh finally the fact that the that the mfcc an H assisi features combine to improve performance in three out of four cases suggests that the that the to surf feature streams are complementary um but there is actually no proof of that until sort of now so what we're gonna do here is we're gonna take that the the source domain um fft and we're going to uh lifter it by transforming it into the real cepstrum and then throwing out the low order cepstral coefficients and then transforming it back into the into the spectrum coming um and i wanna say here that the the lower order real cepstral coefficients correspond approximately to the low order mfcc coefficients right so um ablating real cepstral coefficients is a which are working computed without a filterbank is a very similar to removing exactly that information that's captured by so this system that we have a justice system that you saw the performance of in the table would we actually don't do when you lettering but we could lifter you know the first thirteen low order cepstral coefficients which corresponds approximately what people typically use an asr system and or or twenty which is what we use in ornaments see a baseline we saw earlier and what and that happening here you can see that a personal comment on females is that removing the spectral envelope information actually improves performance here so if we throw away the first thirteen cepstral corpus are sort of the information contained in the first thirteen cepstral coefficients we get an improvement of about two percent absolute i meaning that the spectral and information that's model and mfccs actually hurts here for when um it's also the case that if we throw twenty of them we also do better than not throwing out any but it's already not as good as only doing a thirteen which suggests that the the cepstral coefficients that are found between or thirteen and twenty or or are useful um in male doing any kind of uh evolution seems to sorry blistering seems to hurt but it the pain is smaller here you throw away the first thirteen that's negligible i believe it's one one trial and uh i have no idea what statistically so so the findings of this is that um the this representation appears to be robust to perturbations of various sorts there is play of approximately five percent absolute um the performance for female speakers seems to be more sensitive to these perturbations than for males in both uh pleasing and displeasing directions and um it it's again important to say that and even under these perturbed conditions the the performance of these systems is vastly superior to um the performance that would be achieved if you spent a lot of time finding the art max in the representation that pitch trackers are exposed to uh we don't know how to pitch track would perform whisper so the summary of the stock um i still have three slides is that uh the information that's available to it standard pitch tracker because it is computed by this pitch tracker and then subsequently discarded is valuable for speaker recognition um and then the three points that i would like to pay specific attention to is that the performance achieved which is where they just C C but features is comparable to that achieved with mfcc features um the information contained in these pages theses if you're is complementary to the information okay mfccs and the H assisi modelling appears to be at least as easy as C C um so that this evidence suggests as i probably said too often now that improving estimation of detecting arguments are finding the art max in this representation the goal which are essentially um seems like and endeavour that doesn't warrant further time investment and uh_huh it's it's possible to simply model the entire transform domain and and do better um if pitch is required for other high level kind of features which of course we're ignoring here 'cause we're not doing anyone distance um feature computation then at least that information should not be discarded even if it's not used to estimate pitch i if it it yeah it these ideas um generalised to other data types in other tasks then there there is some chance that this um who will lead to some form a paradigm shift in the way that prosody is modelled speech so i wanna close with a couple of happy at um we we don't actually know how these features compared to other instantaneous uh for prosody vectors right so it's possible that if you had a uh a vector that contains page and maybe harmonic to noise ratio and and maybe some other things that are computable you know instantaneously personal frames and it would be much the difference would be much smaller um we don't know that this guy at the current time we also don't know how this this representation performed under various various um mismatched conditions for example channel or or or session or distance from microphone or uh vocal effort so these are things that need to be explored and it's also quite possible that better better uh that there are other classifiers the maybe better suited to this um in particular the performance that wasn't bad which you do this single diagonal covariance gaussian suggested maybe svm so much better that the the feature vectors are large right so um this presents some some problems existing prosody systems of course focus a lot on long term features and we haven't attempted that here at all so um a simple thing to try would be to uh simply start features from temporally adjacent frames or stack i differences from features but i i think that probably the best thing to do is to simply compute the modulation spectrum over this so how it just the spectrogram um and of course probably most importantly we would really like to have a data independent uh feature rotation which allows us to compress the feature space this would significantly improve understanding 'cause right now we just have this huge bag of numbers and it would prevent it would allow as to apply some normal things that people apply like universal background models um and it would allow us to deployed in other large ask thank you thank you could you perhaps uh just help me understand your your last one uh oh a explain please explain to me why do some difficulty in applying our method to to a lot that's because you're feature vectors are very large or well the first thing like yeah so in the in the in the system that we describe most extensively here the feature vector has four hundred number um and so i have found it to be painful to so that's four hundred every ten milliseconds right does that answer your question actually i say something more um we'd actually found that if you're looking at different kinds of mismatch you need to do some um homomorphic processing which actually increases the size of this feature vector and so it becomes even more painful and it's basically because we don't really know how come really properly model with a data independent yeah okay thanks becomes uh the those seem like it would be very worthwhile try that is based on those yeah on the nasdaq does it so well would be nice to think of ways to my proposal definitely and if any of you have any suggestions i would like to take you have any thoughts on this might be hit on mismatched data um i do i have we have some thoughts so but we we don't have really the correct kinds of thoughts so um note also that the problem is that the other dataset that we've been playing with most recently after doing this is that is a far field data so and so everything is far field so that there is a big change in what happened then we actually don't really know exactly where the changes so i guess we're now in the process of thinking about buying it different data but we try to remember this table um so this is on something called the mixer five dataset which is which which contains lots of different channels but that nine channels that we use are all the far field channel and um we we what we have there is we have uh two evaluation sets um one has session match and the other is session mismatch and then we we build models for data from every channel and apply them to that same channel that's the match channel condition and we also apply those models to data from every other channel and that's a mismatch channel conditions so that it's not channel condition consists of um i think it's a times nine it's an average of eight times nine numbers and the nine channel condition is an average of nine so what we see is that in channel match and channel so in section match and channel matched conditions um we we're we're doing something on interesting there um but in it it here is here that session mismatch is is more painful than channel mismatch right and um there there is a there is a clear reversal here in the ordering between the mfcc system and a justice system that we reported in this work um so oh yeah by the way so uh this this line this H assisi or is that system that i just described it you see sinew is something that we submitted this summer also just problem that can be accepted which is where the this this table comes from um but uh but the point is that um be these numbers in this role or are always smaller than the average then the numbers in in in this role so uh i don't know if that answers your questions i can probably talk a little bit about the magnitude of these numbers but you're happy with this then but but i could see actually here that um i don't recall him but i think that we did the combination here and the combination leads to approximately ten percent absolute increase you over this mfcc number it on average over all conditions right and and asks you know processing side sorry yeah and asks you know persisting sign i think this proposal to choose please see mia two O D U these images cool oh if you these things use you know maybe holding well right right side is withholding extra but could you hold your microphone a little bit too so sorry okay um um i think this so use it is futures this proposal you chose to me that the band limited i a harmonic to noise ratios no i think in addition to harmonic to noise ratios disputes to be seen you know to oh i p2p these images or if you these two things which is useful made recordings mated in it decoding so mixed excitation and so you have in mind well i have not i i'm not sure that i got all of the things that you said um but if you said that there is something that's very similar it is yeah i would really like to talk with you yeah system and and that's where we can do that offline or or you can use this right thank you just to come back to the four hundred dimensional features i think you wanna one of those like to mention all the lights did you reduce that feature dimensionality before nor modelling stage i'm sorry uh can you just at the very beginning of your question your your your features of four hundred dimensional yep so did you use an L D I to reduce the dimensionality before you're such as well model together yeah so to what dimensionality do you use sorry i i uh so it turns out that i i differ males are for females it was fifty two and fifty three i i i don't remember which gender doesn't matter is close enough okay because then it would seem it probably not a practical problem anymore uh we we typically use sixty dimensional features uh with the with the uh ubm that has two thousand components so that's doable right but the problem is that we would need to invert uh you know we we need to compute the L D or pca transform over because these transforms are global right so we would need to compute a pca transform over two thousand features for the entire i don't know ubm training set if you will do do do i understand you correctly oh did you say words um we have to transform is estimated remote will depend on where you are from sparse rooms cross no i what i meant if i gave that impression i didn't intend to uh so it's it's russell dimensional so much yes um remote closed so hard to see what would be um a problem when using a universal remote some of these features but you would use everything rooms who are for two dimensions in the morning right so i guess i see training and and extracting the features and the and change energies of all time and and it's a lower yeah right this yeah so basic basically yeah so we start from like i see and just and oh it's really uh you yeah chaney and then because based on this and there is that we do and happy and sad and change and that's the test set you'll often but and just yeah listing people recently we like to do it and due to some limitations a couple of it's not difficult imprints so many times it's just fine i mean i mean we just haven't gotten around to getting that far and like i said with feature vectors of the order of four hundred or eight hundred is shown being better and two thousand and forty eight after some homomorphic yeah scores um just haven't gotten around to even estimating how much disk space we would need and a particular court so um that's essentially the the correct answer there but my thought always was that we were gonna attack this problem by making the feature vector smaller first rather than addressing the the the infrastructure problem and by more disks right so okay okay okay but i think right now i

Modeling Prosody for Speaker Recognition: Why Estimating Pitch May Be a Red Herring

SESSION 1: Speaker recognition – LVCSR and high level features

Added: 14. 7. 2010 11:08, Author: Kornel Laskowski, Qin Jin (Carnegie Mellon University), Length: 0:36:44