Přepis řeči - AN INVESTIGATION OF SUBSPACE MODELING FOR PHONETIC AND SPEAKER VARIABILITY IN AUTOMATIC SPEECH RECOGNITION

uh_huh so the and so there was a sum of overlap uh between that you talk in right about that to a problem used to say that uh presentations of like painting a lot user takes a couple of coats and some people it never sticks and you know who you are a so uh base you guys start of the the the purpose of the interest in this work is investigating techniques for performing uh sub-spaces over acoustic model parameter and the obvious uh you know a aspect of that is that we for more efficient characterisations of sources of variability by representing used for sources of variability in low dimensional spaces so you say a speaker adaptation and speaker verification uh there's been there's there's clear advantages here to forming sub-spaces over uh a or i guess describing a global variability by forming the or or over these super vectors which correspond to concatenated means of continues gaussian scenes hmms and G M and there's examples in speaker verification go back to the eigenvoice approach is where you know where where they show the it then it's of forming these these uh these global sub-spaces and also extend into uh obviously into speaker verification where things like joint factor analysis also benefit from these low-dimensional uh a is formed over these giant supervectors vectors so we have the more interesting developments uh that's occurred recently is this notion of extent these uh these supervector based techniques to modelling state-level variability and they and we do that by uh basically oh for mean of multiple sub-spaces uh uh where as we rather than these single subspace spaces that we form for these super vector take and uh this is the guy of these subspace gaussian mixture model it proposed by bad bad and in uh two thousand ten oh do as we doing an experimental study to compare the uh the asr performance of both so these both of these types of approach uh i i guess is important it to note again that we're forming sub-spaces spaces over model parameters and and i guess this is a simple example where we form these these uh sub-spaces over super vectors in a speaker adaptation problem and so this little example here we have a a a a a a a a a a population of a speakers and we're for and uh we have um uh i do but from to these N speakers and i have a little plot in at some arbitrary two dimensional space of the feature vectors of those speakers we can train a a mixture model so we describing each of these populations of speakers in terms of model parameters uh a mixture of gaussians a two dimensional space and then yeah as we could let's say will form a one dimensional subspace um that describes variation over these um uh a or these and dimensional model parameters this case and can be quite large the cake cat need the concatenation of these he's mixture vector so i we can do that a speaker dependent supervectors by concatenating the means of each of these i each each of these distributions and we can identify some sort of this the subspace projection by using something like print principal components analysis or maximum likelihood clustering uh based on being is that mention or uh super super vectors having done that we can uh do adaptation now all had sound adaptation data we are well we also have a a speaker-independent model parameters which we might of training from multi-speaker speaker uh training data and we also have a a a a a a a a a subspace projection matrix that will have a trained on our super vector then i and vowels uh coming up with a a a a vector which describes uh the you know the position oh of our model with respect to ride data in this uh in in this low dimensional space and so that's a that that's race basically what these subspace space adaptation procedures are uh uh in um uh when when we uh are defined are subspace over these supervectors and the good thing about them in a a a for example and uh and speaker adaptation is we can uh uh can we could do adaptation was seconds of speech rather than and it's it's speech like you might need when you're doing were you know sort of regression based H so that's a hall but all well on good and these things are well known and that the oh a we got there's are model i forgot about that so that so is that's all fine and and um uh so the idea here we but in the subspace gmm as we can next which which are do subspace adaptation for a or form subspace um models to model sort of global variability to modeling so state level variability hmmm and was in and stop me if you've heard this one but there's a number of generalisations when we go from the um a a for from these as uh to uh vector based approaches to to of the uh subspace gmm the first one as as as and pointed out we for a a a a uh uh are are state dependent observation probabilities are state dependent probability in terms of use uh a full covariance the have since is shared pool a full covariance gaussians and we call that a a basically a a a universal background model following from the terminology in uh in in speaker verification and so we have a a and the hundreds of these uh a where from you know what a couple of hundred to a thousand a a shared covariance map uh gaussians uh that we define a distributions over from next you as a nation is that were forming these uh as these subspace projections a one for each of these gaussians in the pool and uh and so we'll have multiple projections a rather than a single a projection as we were be uh the the speaker the space based approach and then the final generalisation of is that we uh the the state dependent means and these state dependent weights are now uh basically obtain yeah it's projections with an these sub-spaces so the the i a the the mean vector for the j-th state and the i a mixture component is obtained from a uh this this projection where this and so by is the the the cursor oh hmmm really and so by it is the uh is the uh uh the subspace projection a matrix and the V G A it is this state vector that the and mentioned before but we uh we describe in has here it is as a a set uh to the uh a and uh you of these uh a you uh a uh the mean of the uh universal background model uh but that's a a a that that's not terribly important and then T the state dependent weights are obtained from a weight projection vectors and that's these W use of ice here uh and uh again applied uh to these state depended V vector and uh the fan of these weights are i you know they're their normal right so they sum to one the exponential why makes it uh look vaguely like a um multiclass logistic regression but in fact i guess just stand probably pointed out that this exponential is really an important here when was comes to optimize the objective function the expectation maximisation al gore so a a so this is a very interesting a what are they really interesting aspects of but is we've got a just a really small number of parameters to represent the state single state vector V in this case i where as in a traditional continues density hmm all the parameters are basically state depend we just got a big pile gas so these this sub state matrices and the U i'll share i can then extend this more arrows we can extend this for other by by adding the notion of sub states i we now instead of had a single sub just uh single Z vector per state now we can have any number of them i met just a house is to play with a parameterisation um and so now we have a mixed a oh what will combination of and uh uh uh uh a sub state a a a uh densities per state and now the means and these weights and a in backs post by sub state and and stage J that's so they uh somebody has something you about the um um and the number of parameters and the model the interesting thing is that is dominated generally uh depending a parameterisation course but you generally have five or more uh uh uh uh a shared parameters then you had state dependent prior and uh the example system that the examples set since my hat that might be as much as ten to one oh that's a little bit extreme but that that uh that's not unusual okay so issues of training uh we we doing maximum likelihood training with the em algorithm i i if we have uh basically the maximum likelihood training of these subspace parameters and the and the uh these state vectors is a really simple straightforward extension of the case for the this global a a subspace model what we trained these sub-spaces over super vectors and a stand but what happens is that at and work very well when you don't have the weight vectors those oh those additional degrees of freedom are important and so we knew we had that was a in you you there is basically an additional component to the ml auxiliary function uh of the the uh uh a the these solution the out to you don't you don't have unique optimum and so you have to be very careful about how you optimize that auxiliary function as far as initialising things you start a uh initialising the the context um the state context for the sgmm if the phonetic context clustered uh a continuous density hmm state uh we initialize the the means and uh for covariance a matrices for for go a as a as a um i just using unsupervised gmm training and uh a rather than initialising of the the other parameters of the system we basically and initialize the uh state should the joint state um uh a uh uh a mixture component of posteriors where uh a as a product of the state posteriors from a initial cd hmm and the mixture component part last from are from are you M so yeah that a uh basically we're doing a experimental study to compare the performance of the uh a subspace gmm with the unsupervised uh subspace adaptation when yeah okay in that and we're doing that using the resource management database and will acknowledge that that's a a a fairly small corpus collected under constrained conditions we have about four hours of training and about a hundred speakers um the the advantage for us though it is as it's not very amenable uh to various other adaptation approach the regression based adaptation and uh vtln and so on don't do much with it so we can see where there is there isn't the issue of the overlap of the effects we get from the S from from the adaptation scheme we're doing here with other possible um you normalisation strategies so yeah baseline system with seventeen hundred uh context clustered uh states and about six gaussians per state which is pretty typical and the a a speaker dependent evaluation task is for quite nine percent and that's pretty much uh in in the alignment with the state the are um and the another point here about the parameter the allocation of the sgmm parameters so for this particular system we're starting at dependent a a a are the continuous density hmm uh system has about eight hundred thousand parameters um and all of those are basically a shared parameter for the for the the right this in the first row or this table for the single sub state per state a subspace gmm a roughly ninety percent of those parameters are shared across all states we have you know a the shape parameters correspond to these sub state projection matrices uh and the full covariance gaussians and was about six hundred thirty K of those but only about sixty K A of these state depend of parameters which correspond to these V that so i for that particular system we have a as you know small number of parameters as we do uh for the can use continuous density hmm and but are a of the uh parameterisation uh uh that that that the not that much different then the continuous density hmm and it no matter about the parameterisation is an our case we tends to be have a rate lee you know we were were this heavily biased to as a number of parameters you know uh it's them so the the E one or eight we've got here is we can by basically repeated this for point nine percent word accurate word error rate we got from the baseline continuous density hmm for subspace face G and and the best performance that we got was was about uh what what was three point nine nine percent so that's about a twenty percent improvement which is it's pretty substantial on the is i guess with dance comment meant that that when you have a small amount of data you your real level of the it's about twenty percent T V second and third row a described by uh i can basically compare a different a means of the initialisation than this uh this scheme for initialising the posterior probabilities of the sgmm a give us some small but statistically significant improvement performance over a flat start um for the data and a comparison of the sgmm with these supervector adaptation uh a approaches which of because uh well which are fairly well known uh these days i i basically what we did yeah uh estimated uh a a a a a a a sub state i'm sorry a subspace projection matrix E here which is a defined of over these supervectors of R a a a a a a a of a a a um uh a yeah of our scheme of continuous density hmm that's the uh the um in that first equation and then uh doing adaptation uh uh estimate this uh this use actor speaker dependent you vector from a single unlabeled labeled or i'm transcribe test utterance and so we have a basically a subspace dimension there is twenty um and uh a what we thought as we got about eight ten percent the but eight or not i guess that's nine or ten percent uh improve performance that's a for from this uh a super vector based at uh adaptation which is not as big as we got from the sgmm we could we were also a a a a i and this corpus tried the uh and uh these speaker at a you go throwing and the speaker subspace uh with the with the uh S gmm model we didn't get a a really statistically significant uh improvement from that uh i would suspect that the additional degrees of freedom in these speaker dependent weights in uh uh describing N's uh to earlier talk might have uh might have an impact on the uh i do in my done no i so that to me as a okay you could me a look so this this last slide is an anecdotal example oh the distribution of these sub state projection vectors um it or of the first two dimensions of those sub state projection vectors in this case for spanish language call home corpus or from an sgmm trained from the spanish language callhome home corpus and um we so we have these that's that's we were restricted uh this plot to uh a sort of a scatter diagram here of the center states of the five spanish vowels with the easily in the set indicating so the location of the cluster centroids this is very similar to a plot that look but good and better words did uh i i'm uh in this language corpus and and um basically the thing i found that was really interesting about this it is a a a a a you you see is very nice clustering oh a three is uh uh of these state dependent vectors of uh for the for the different files that's something you just don't see in a continuous density hmm right you get uh a you you certainly can't look at the means oh of the densities in a continues sit density hmm and see any kind of visible is structure there so this is a very interesting thing how the structure sort of discovered automatically from the sgmm and and the C the that uh the really are some other interesting uses of the thing um so to summarise it here i i guess some out of time we got this a rather substantial i say eighty percent before but it's getting pour so before for get my team percent by the end of the talk i i than eighteen percent reduction and word error rate uh compared to the cd hmm and uh basically a did better then be and supervise a subspace based speaker adaptation um oh we yes and this sort of general comment that uh the is very interesting and it's more anecdotal comment that these state level parameters seem to uncover cover uh i i so sadie picked a a a a a a a a underlying structure in the data um so this is a sum to take advantage of this interesting structure we started looking at hi and C a speech therapy application can be done by exploiting sort of this can structure we see uh in terms of the describing the phonetic variability and also a like a number of other people are talk to go the conference looking at multilingual a acoustic modeling application a much questions uh a great i i think we is probably similar to uh right thirty that's tired i believe we uh initialize the and vectors to the in the first a column of the and actors to the means of the ubm and the uh and the V that actors were initialised to unity two no questions no in that case of these computer station thank you very much for this truck i

AN INVESTIGATION OF SUBSPACE MODELING FOR PHONETIC AND SPEAKER VARIABILITY IN AUTOMATIC SPEECH RECOGNITION

Acoustic Modeling

Přednášející: Richard Rose, Autoři: Richard Rose, Shou-Chun Yin, Yun Tang, McGill University, Canada