okay i'm going to talk about the what we did together with uh when you only be suggesting this one uh in the scope of the last and a simulation per speaker or it was the motivation beginning this is uh before the line of my presentation i would try to motivate and interviews the problem we went to phase and we went to sell and since my words but it wasn't related to connectionist speech recognition i i will uh have a look to the very basic cell so and then they will interviews the uh the the novel features obtained using this work but they call the information that would features for speaker recognition and then we go uh we we good we will go to the the experiments and so conclusions and some feature to work or work ideas so uh the main motivation was uh we wanted to participate in this and me stipulation we saw that the best systems were using um an arm is different amount of systems combining them and what i want i'm not going to mention them but eh you know there are many men is possible subsystems and i'm on all of them i was particularly attractive uh directed by but what are usually called like level features that it's in close relation with the prior session and basically this this system say to use speaker adaptation transforms employed in asr systems for speaker detection features and our proposed as alternatives to four times a capsule features that are the most commonly used one and you know we have the the work of a and veracity in fact uh the working press it into the it's very closely we but it was never samples that were in that in another in another with some difference of course the same and basically uh uh what is done in this work uh is places to use uh weights they re from mllr transforms to produce a grimace in the back doors uh concatenate them and use this is four coefficients to model speaker uh support vector machines so what is the problem we propose that at least one of them i don't like we'd have been always working uh we i read a a a and then a tremendous sense of this is that on the neural networks remarkable systems and and i wish to show later some characteristics but uh the the main problem nor the motivation for this work is that we can no we cannot use a typical adaptation methods like mllr that are usually used in and gaussian approaches so what i try to doing this work at the very beginning it began with to see if i can do something similar to the motor transformation for i've read uh um systems and if we can use it uh to obtain the speaker information into china a speaker discussion system and it with the farthest 'em some baseline systems in that in that it's very with us tonight uh telephone bill for condition so the minister with some of the irony is a basis for uh uh you know for the one someone do the probably don't are not very related basically we have been working on this uh for some applications mainly for business than prescription uh but also for telephony get a telephone applications and for some other languages but our main focus so quickly or two is um and usually is considered that was the way he works is that will replace the gaussian for a neural network and all cases are and will be emulated perceptron and we use this uh the probability estimations as the dubs a a as the i'll probably this or as as the break this uh the pasta probably this L the of the single state hmm and usually uh we have very uh relatively few outputs like just uh phonemes or some other so for that uh units but not not a more the main characteristics is that or is that they are usually considered but up to classify of the the neural networks the up with these two marks of other streams and they're pretty good for a blind test as we will just leave you and the only time we have uh some problems with context modelling uh and also uh with annotation it's no so or at least they are not so so what estimation methods like in gaussian systems so this is this is uh uh diagram block of or row 'cause my suspicion system for make an english you can see probably so that this here and use this okay okay we can you can you can see a similar streams with different features you pay a fee B B features uh you be with breast the features on this one so modulation spectrum for each of features it's one of them for me it's a different we could later perceptron well trained with with with they uh with like a with transcriptions and everything and we marked uh the similar stint in uh with a simple um problem or rule this probably does this posterior so the use by uh because or to have with a language model the lexical oh no and uh some definitions of the hmm that is the relation between the perot would probably be and and wait phoneme for instance uh represent and that minimal relation also to provide the most likely word or send uh some characteristics of the system in dummies a four and made in nineteen seventy bottle said we have been a one to call in less one group one real time is that a seventy percent one six were right will this will use this in the form of the phonemes some others of phonetic units and we train with a four although the forty hours i would have a program language model that is the interpolation of the transcripts and the and written that uh from newspapers and that a relatively small colours just a four thousand so uh in that let's just say i'm just saying um sorry data for evaluation and uh i needed to the trained on you and you a speech recogniser and it's i i read about a very very very weak system because it is i have access to C T S uh it's speech and basically way what is was to train new remote unit and networks with we don't simply data with the simple 'cause news data and there is some other differences to the system i use in this work is that they i have another another issue mostly with and see a bunch of from today different than features i don't use a monophone M unit and and i did some very informal evaluations yes see for myself how it was working then and in in telephone data conversational telephone is it and i have uh very everywhere i where relate uh but anyway this recognise is used for uh two purposes for support for someone as in a uh generate a phonetic alignment with the descriptions of the provided by nist and and also for for training the the speaker adaptation the summation so uh how how can we uh the other i will be needed and then works to a speaker information or whatever else uh there are several approaches but i some basic the two of them the first one would be uh starting from i speak in that and then i'm open an mlp network uh we can do uh the rival uh of what propagation algorithm and mm what we started with a network of anyone train it instead of random wait and with about voice and i went and we had that the weights and and that so uh the other think we can do it's probably just a they'd some of the weights forces that the ones that we go from the the last hidden layer to that to the output layer the price of something more interesting to do it's too to modify the the structure of the detector of the mlp network and tried not too to modify the speaker independent component and that's what we can do for instance well we can get there's more that for most of the phonetic level that would be to uh uh some kind of transformation at the output of the problems and try to but that to the speaker characteristics and on the other hand you can try to the same at the acoustic level try to add that the the features the input features to the characteristics of the from that could from the speaker dependent of the characteristics of the speaker independent uh system so this last solution i i did some just for a desire to verify that this could work on it works and and i i from that what that that was the best one for yourself application so hey decided to try also forced to get and here we have a um a typical mlp neville with just one could allow yeah is it impolite or the feel i don't open later and how can we train dissertation uh features or this other additions uh lattices basically we incorporate a newly nine lighter than the beginning and we apply uh there but replication algorithm as usual i mean we have data would labels we it make the forward propagation compute the output of the network there are we do the okay the quality cover we do that about the position of the yeah well and then when it comes to the the weight i'm sorry well opening no but okay when it comes to to the bit the weight we just the the the date of the of the linear would never and we can we keep froze and the the the speaker independent component so let me okay about this the formation of a normalisation uh well i seaside it's intended them up in common the switched over the representation that consists the mlp uh performance and it can be considered a kind of sorry right hand as for the normalisation but with some a special characteristics because we are not imposing any any a restriction in addition to the base station process i mean we don't have a a target speaker that we try to normalise that that uh the data and and according to previous works it seems that it's also i don't stick to depend on i mean if we train the transformation network with i a speaker independent network behind and it changed is a speaker independent network that instead of having one hidden linux that's too it doesn't works anymore so uh it has some kind of the pendant of the detector uh well we have trained it's withstand the the marketing from that to say um with a diagonal buttons vintage metrics and when we use implemented of the same speaker uh what we falcon beginning is that uh it could hopefully if we send the differences we continue a speaker and and so model and that well i thought that would be useful for speaker identification so there so i stuck exactly the features i'd in the phonetic alignment with a nice the stations and train a speaker additions estimation for every segment and it's um a special things that they do is to remove long segments of silence to to avoid background and channel effect in the resulting features and i then just thinking of 'cause what edition that that this that is usually don't in the market in mlp training i just a place um fix the number five books and already said that this was that base and already sticks that they from the what what uh i don't that you think is that instead of training a full matrix uh and fully mean the input usually a fireman mlp it's composed of by the frame but the current frame and and it's context uh and if this for the square matters would be and and feed the number of features and the shape of the context it said that the reason that i i train or right tie the network uh for each frame independently on its position the context so are reduce the size of the of the transformation this chi so networks also um attic and what is our intent to come between them okay and and in addition to that that the the the that the source and the word feature vector uh you also a stack the feature in the meeting the feature mean and variance because it is it is it uh it is very usual to to do mean um but it's not my decision to the to the input of the mlp okay and i do this for for the difference thing to have the plp that could be with that's the modulation spectrum and at sea and for modelling i use support vector machines i i think that the speaker my uh feature vector and uh and i said above are impostor said used as negative examples i use the lips of them with linear kernel and ideas uh i mean that's almost stationary oh the input in the front seat one so let's go to the sperry meant um it's it's a i use the estimated as an extra to show three only the telltale condition uh i used to come to stiff systems to verify the usefulness or not oh this approach uh uh quite simple gmm ubm uh based on based on the features i i remove nonspeech frames or look at no one or two frames based on i well trained as business be it and uh i mean why becomes an alignment of the log energy uh i did that so i mean embodiments a shot and well typical things in ubm this is the set of the to use from previous a summary of relations i also play uh the normal score lemma session it and in addition to that uh can persist it compresses system i L C is uh a supervector are system that the quality of the S B svm and for the uh for the negative said it's i i i did read that the the supervectors from from this speaker models and i'm for the for the battery use data from the previous sorry S R I evaluations and i didn't apply score normalisation because they didn't uh see much improvement probably fig so there's some kind of problem might in my configuration i'm a conclusion uh i did calibration function and uh gender dependent is in the the toolkit by an equal to gain and it in two steps these this has gotten for every single system and later on i did other linear logistic regression and in case of uh yeah doing function of more than one system it's not it at this is that okay uh and i i did pay for every focus validation in the same evaluation set so what i i didn't think carole double colouration because uh it's what the recognition some set for calibration forty one right so and here we have already some results you can see that that works in blue over the course of the individual transformation network C stands based on different features plp but uh well listen a spectrogram and nancy yeah you have the mean detection cost function um point supplied by i i i i for me to say that it's the cost i use the cost of the sre propose an eight not the new all the two thousand nine yeah so and this is the the war right well the the the first thing that that that they want to make about this is that that well is not but it would but it worked and anyway i wasn't sure when a list of the to this and and with this but the individual systems uh we can see probably but the performance of that C the features but uh i don't have a big explanation probably because the feature sizes is bigger but the that i'm not sure or what simply because then that what is but it's over the classifier uh then i did some to other experiments that what's first try to fuse with audiologist information that the four individual systems or even better to try uh to concatenate that there the four david well features and to uh to train a single ordered in a single transformation of the feature vector what uh and we can see a nice improvement using the complete just wasn't at work feature vector um move to the next one this is that the that or comparing the different bayesian systems together with the new proposed in from the pacific on T N svm uh we can see with respect to the gmmubm um about their it performs better that close to the operation point but it seems that it goes it was words or a or a plus list items as long as we go closer to the whatever point and with us to the supervector uh we have a slightly worse performance in close to that the person point and and it works right words in the in the other in the other one to the the car thirteen point and we yeah what do think it's important from these results is that i can achieve more or less similar system some of the baseline systems by comparing to in some cases a bit worse in some cases a bit better but not politically different so the the the the the the final corpus of what's in fact trying to use it for for improving the the baseline systems and this is that the the the results show that the combination and you can see several different combinations these are the two baselines this is the minimum cost obtain deeper right and we can see that when we yeah we incorporated this formation of what features system we have some improvement probably uh it's um that that all the combinations here also so and i'm sure okay with that i mean yeah the conclusions uh what they combine in this work or what they want to do is to show that features that it from N in a a and then it to my meditation techniques can be used for speaker identification in a very similar way to how similar are is used for lotion systems uh i have used uh uh annotation technique technical information network and okay back to base on the recognition of this everlasting transforms and and the mean and variance of the input feature statistics uh should do to perform but well and with respect to the baseline we could see a relatively good performance so cases it in some operation points of all the the the cover it with it was more it was bad another it was worse but more or less uh similar performances and uh we could build five verify that it provides some complementary speaker choose for for for channel that we can have uh or baseline systems uh with respect to to carlisle and future work that we are going in a a or listen uh with these features um we need to assess a better than classified other we case our system because i would have very by a very low they were provided fact and well for discussion and i imagination also for for the station itself because uh probably with a better a speech recognition system would uh we'll have more meaningful features uh we we did almost all the tuning and another one to two characteristics uh base and all these things but that would probably should do something uh more and undertones and which is the relation between the the architecture of the speaker independent and never on the resulting features uh or even to mystical there adaptation method i do not try this adaptation of the output of the problem is um we have also i would have some of these things some us to meet at the it is a bit we can say and also uh apply in but everything compensation like now and so the nothing can really work in other things with interest in it and into the something similar to what is only in in people um for language identification letter that is use in several mlp uh networks from different languages and the reading this transformation networks for every of these languages without phonetic alignment and and then to get in a in a single feature vector and and this way of making the the the the approach uh not needed for the asr descriptions and finally making it also language independent and that's all okay okay questions actually chris slot uh no yeah i think it and number like i don't know number oh oh that one one okay some um lost um right a lot yeah um systems yeah yeah you know normalisation five so oh uh oh sure right um right normalisation no um i just the randomisation of the input of the svm modelling uh i didn't do a modelling uh also when i was doing testing but i i didn't do any other normalisation to the feature vectors in the rents either one i think it's conditional and in this uh support vector machine approaches just one some features just no no not not uh it's true yeah but well i and number should but it was between the the the svm we go with this i mean it will select this features that are more important i i mean i i didn't read in a different way that if you just coming from plp or i i just let the this ubm to learn what he thought it was better didn't do anything in this way uh this morning to mobilise the ubm it can be speaker not to to i'll close system model and if you use one of my oh oh yeah and to train neural network much more data than not one more thing and uh so why not too many rich but from the old people from one uh i i think it differs when you will should be because it and get it very well you're talking about way way idea and still with a random initialisation of the mlp network for training oh that was a layer of the moment the soft mask yeah yeah so why the force the one that works well mark just right you need only so much you can that uh the D C uh yeah i have a soft but max output here yeah and i don't have any other any other softmax output anyway this the linear input network and and i'm not i'm not doing any kind of non nonlinear in section at this point it's a am i think so but the no there's no nonlinearly stationary i dunno if an answering to you i oh no they didn't have or is it just it's a it's uh uh speech features yeah p2p or okay or and sorry and it's uh the the current frame and its context not only uh use anything context of there are two but uh but it's it's it's it's it's feature yeah so would you like to slide forty table um uh as a baseline but just like how much is it uh how many map estimation ah i did five and probably the support of okay relation um but i i think it was right this yeah yes i did five map iterations yes so i improprieties them but uh you you did five map map iterations before sitting do your is for him yeah so uh we found that yeah one yeah i i if only they right i i it's a to control yeah uh but but but we verified well i'm not completion but uh in that basic gmm ubm with five we got better even if we we go farther away we got uh a slight improvement but uh we and verified it when we moved to the supervector we do the so uh and and i realised i could do you want to david that it this was not a good idea probably well that's probably um with the but the configuration i would have uh okay i'm sure but the performance and as a purveyor in the supervector system sure i i realise that fig oh on the loss right X yeah okay oh sure uh_huh you yeah so how much oh hmmm oh and well it improves right the yeah no no i i like the way it was a uh too much but probably there is oh fig configuration problems because they see that people i get very nice improvement with no i don't know if it's because they'll tell only uh that the prove it yeah they get the improvement it's not so let's say i tried with uh different dimensionalities and the it improves yeah but the but it was not moving from i don't know how much i had here it wasn't moving from the six point fifty nine to three it was less and that um it is part of one um i just one which uh sport not uh um one you oh oh yeah yeah hmmm we live just straight on it no what she um no yeah one um oh oh oops yeah right or a hmmm oh sure also school hmmm oh i i i if it is a did something not right would be because i didn't various incidents okay fine and um i'm not sure i'm not sure i'm just currently live uh svm because probably in i think it was using that probably estimation it beeps and it's not a good idea but then using in both and both systems based on svm i mean and using also in a my proposal so i think i i can improve in that way the noise more was what you were meant in because and and are doing the and and and are doing the the this kind of problem with the background using that the to the the as the sum as being pretty and i think that the prediction the problem prediction is not that would the score for the for the speaker identification thing but well okay