0:00:06so this is
0:00:07um
0:00:08this is work
0:00:09that we did at S R I
0:00:11and it's in fact our
0:00:12foray into language recognition
0:00:15um
0:00:17and
0:00:17it's great because most of
0:00:19the
0:00:19the kind of background techniques were already
0:00:22uh
0:00:22splendid plenty of detail in the previous
0:00:24three talks
0:00:25so
0:00:26two
0:00:26go into that again
0:00:27um
0:00:29i will
0:00:29start with some preliminaries
0:00:31and then
0:00:32uh tell you about our
0:00:33or mental set up
0:00:35and the focus here will be on phonotactic
0:00:37uh
0:00:38why should not say phonotactic but phone base
0:00:40uh recognition
0:00:42uh i say that because that encompass
0:00:44or both
0:00:45phonotactic
0:00:46uh modelling
0:00:47uh such as what we've heard about
0:00:48before as well as
0:00:49uh mllr
0:00:50based modelling which of course is also based on phone models hence the commonalities
0:00:55um
0:00:56and
0:00:57i will also look at two different techniques and comparing uh two different ways of doing the phonotactic modelling
0:01:03um
0:01:04then conclude with some
0:01:05pointers to future work and conclusion
0:01:08so
0:01:09because we haven't participated in the past
0:01:12uh language recognition evaluations we actually did have access
0:01:15two
0:01:16any data after lre all five
0:01:19since this work was done L the sea's accuracy release L O seven but we didn't have a chance
0:01:24process that yeah
0:01:25so we're dealing with this
0:01:27i apologise for dealing with a rather
0:01:29outdated
0:01:29a task which is the a seven language task
0:01:32mallory of five
0:01:33it's a conversational speech sounds a lot of that a voice of america stuff
0:01:38um the test data consists of
0:01:40uh about thirty six hundred
0:01:42test segments
0:01:42and we only look at
0:01:43thirty second condition
0:01:45uh the training data and is
0:01:47from uh you know the same
0:01:49seven languages
0:01:50and the duration
0:01:51after we perform our automatic segmentation into speech nonspeech
0:01:55uh boils down to about fifty six hours for these three
0:01:59so you first languages and that way six hours for the
0:02:02many falling
0:02:03that you see here
0:02:05uh well i'll be reporting the equal error rate averaged over all languages
0:02:09uh so uh
0:02:11uh that's just what
0:02:12how to choose and maybe that
0:02:13that the best choice but
0:02:14but that
0:02:15what you see
0:02:16um the we performed two fold cross validation so
0:02:21uh to do calibration and fusion we
0:02:23uh you know which put the data and how we we
0:02:26uh estimate on the first have to
0:02:28second vice versa
0:02:29combined
0:02:30results
0:02:31and this is again because we didn't have
0:02:33an independent you
0:02:34that
0:02:35only working
0:02:36salary of five
0:02:39okay
0:02:39you heard all this
0:02:41so the the the two main stream
0:02:43uh
0:02:44techniques are really the cepstral gmm
0:02:46um
0:02:47and lately
0:02:48you know there's been the incorporation of
0:02:50uh
0:02:51uh session
0:02:52oh variability compensation
0:02:54with jfa and we implemented that
0:02:56uh
0:02:57using you know
0:02:59kind of a standard framework and just for reference
0:03:01gives you
0:03:02something like two point eight seven
0:03:04uh
0:03:04the average equal error rate
0:03:06um
0:03:07Q data
0:03:08and uh the alternative
0:03:10a popular technique of course
0:03:11the prlm technique
0:03:13and you've heard all about that already
0:03:15um
0:03:16yeah so i won't repeat it here
0:03:18the uh of course
0:03:20what
0:03:20popular about it is that you can then combine multiple language specific like
0:03:24fusion
0:03:25at the score level and get much
0:03:27results
0:03:27to solve for
0:03:29and for calibration
0:03:30fusion
0:03:30we also didn't attempt anything
0:03:32uh out of the ordinary
0:03:33uh in fact we haven't even
0:03:35i tried to um
0:03:36oh i haven't
0:03:37you know
0:03:38incorporated this uh the gaussian
0:03:40uh
0:03:40i can't modelling yet so we just use the multiclass vocal
0:03:44uh which is based on a lot cleaner
0:03:46aggression
0:03:47um
0:03:47did you both user
0:03:48and
0:03:49um
0:03:50calibration
0:03:51so
0:03:52first section
0:03:53is about
0:03:53phonotactic language modelling
0:03:56um this again is a
0:03:58standard technique by now saw before so
0:04:00stead of doing one best phone decoding we do we do for that
0:04:03the coding
0:04:04um
0:04:06uh this was actually
0:04:08you know adopted twice
0:04:10uh lindsay
0:04:11uh
0:04:12proposed
0:04:12for language I D and
0:04:14uh as right
0:04:15C proposes
0:04:15for
0:04:16speaker I D
0:04:17um
0:04:18and in both cases that shows pretty dramatic improvements and actually i wanted to
0:04:21respond to uh
0:04:22to uh something that uh uh haptic
0:04:25said
0:04:26um
0:04:27uh
0:04:27because
0:04:28quest
0:04:29uh or because of the previous talk
0:04:31i actually do not think that the lattice
0:04:33the coding
0:04:34in that increases your variability and in fact i think it reduces your credibility
0:04:38because you're not making hard decision
0:04:40so whereas in one hypothesis
0:04:42you know the recogniser
0:04:43so the whatever charlie might decide between
0:04:45having a frequency of one
0:04:47or you know point nine nine nine
0:04:49in the
0:04:50in the lattice
0:04:51approach you actually
0:04:52uh represent both the one best and
0:04:54and later hypotheses so you have
0:04:57you have all the hypotheses represented
0:04:59and they just differ
0:05:00by small numerical values i think it actually gives you a more robust
0:05:04uh
0:05:05feature
0:05:06and that was actually demonstrated
0:05:07by andy had in the original paper
0:05:10post
0:05:11yeah
0:05:12and the other reason why it works but of course gives you more
0:05:15granularity
0:05:16feature
0:05:18uh
0:05:20so much about the
0:05:21how we do the feature extraction
0:05:23um we
0:05:24we didn't have time to really develop new phone recognisers for this so we just took
0:05:29three phone recognition systems that we had lying around
0:05:32uh one is for american english one is for spanish and thirty four levels you know
0:05:37and you can see here the phone sets differ in their sizes
0:05:39and uh
0:05:40furthermore the training data
0:05:42is vastly different
0:05:43uh english we have you know basically as much data
0:05:46one
0:05:47um and for that reason we gender dependent modelling for the other two languages with less training data we we
0:05:52do a gender independent modelling
0:05:54but other than that they all use
0:05:55same kind of standard asr
0:05:57right
0:05:58uh
0:05:58plp front and vocal tract length normalisation
0:06:01hlda
0:06:02uh for dimensionality reduction
0:06:04and the crossword triphones intently
0:06:07uh acoustic model training
0:06:10uh the decoding of course is done without one attack
0:06:13constraints
0:06:14um
0:06:14but we do use
0:06:16following the results from lindsay we use the context dependent
0:06:19triphones
0:06:19the code
0:06:21uh
0:06:23and also go again following uh you know very nice
0:06:26figure from them as a couple years ago
0:06:28uh we use the cmllr adaptation and decoding
0:06:31um and by the way we tried a regular mllr as well and it didn't perform as well
0:06:36you know and i guess that's an agreement with
0:06:38another of the previous
0:06:39talk
0:06:41okay
0:06:43so uh the first
0:06:44uh
0:06:45i think we would like to propose
0:06:47is to get rid of or largely get rid of all these different kernel phone decoder
0:06:51and instead we
0:06:53we can define
0:06:54a universal
0:06:55phone set that covers several languages in our case we made up such a set
0:07:00uh
0:07:01fifty two phones
0:07:02and what you do is you map
0:07:04a new map your your individual language specific dictionaries to a
0:07:08uh common shared phone set
0:07:10and then you retrain
0:07:11you're acoustic models
0:07:13uh using the map ref
0:07:15uh and of course the language models
0:07:17uh if you perform
0:07:18uh for decoding with the language model with the phonotactic models should also be retrained
0:07:23the phone recognition accuracies uh as measured
0:07:27on um
0:07:28on individual languages are very close
0:07:30two
0:07:31what you get with universal phone set you you're not really you uh
0:07:34sacrificing much in terms of that
0:07:36see
0:07:37and this is the these are the following these are the
0:07:39language specific steps that we combine
0:07:42map in this fashion so
0:07:43we took a
0:07:44um american english data
0:07:46of two
0:07:47the right is mainly native and nonnative speakers
0:07:50and we
0:07:50because we know that in much of what
0:07:53we do both native and
0:07:54the nonnative speakers of
0:07:55her
0:07:56uh
0:07:56not in the natural frequencies but with
0:07:58with
0:07:59more than native nonnative uh
0:08:01focus
0:08:02we actually waited them so that they have equal amount of data roughly
0:08:05uh and then um and then and spanish and egyptian arabic now note we use the egyptian arabic here
0:08:11because that happen
0:08:12to be a dataset where we have
0:08:14about lies
0:08:15uh
0:08:16transitioned so we can actually perform this from happening in there you know
0:08:20pretty straightforward way
0:08:21and also these
0:08:22two data sets that have very little data spanish and uh egyptian
0:08:26they are weighted more heavily to
0:08:28it's even more about
0:08:30in terms of the overall model
0:08:33then this is might be a detail and
0:08:35known to everybody but
0:08:36uh
0:08:37window to this i thought i'd point it out
0:08:39uh so
0:08:40when we do the the log likelihood ratio scoring
0:08:43we actually a do not
0:08:45use all the languages
0:08:46in the denominator but only the languages that are not
0:08:49the target language
0:08:50and that gives you slightly better result
0:08:57okay
0:08:58so here the results using
0:09:00the uh prlm approach
0:09:02so
0:09:03have the three individual
0:09:04um the individual
0:09:06P
0:09:07are aligned
0:09:08still
0:09:08american based on the american english recogniser level in arabic and spanish recognisers
0:09:13and uh american english asthmatic
0:09:16back
0:09:16i because it has the most training data gives you the best
0:09:18individual results
0:09:20uh and might also be good because american english actually had where english has it
0:09:24relatively high number of
0:09:26uh oh
0:09:26testing phone so that gives you a lot of resolution
0:09:29in your in your
0:09:29coding
0:09:30um and then when you do the standard
0:09:33um
0:09:34prlm with first
0:09:36two recognisers and then three recognisers you get progress
0:09:39truman
0:09:40overall from the
0:09:41single best
0:09:42uh which is the american english to the three way
0:09:45you know be prlm
0:09:47uh with about thirty four
0:09:48centrality
0:09:49yeah
0:09:50and
0:09:50then these single decoder that uses only the multilingual
0:09:54a recording
0:09:55a gives you three point O one which is very close to the
0:09:58combined
0:09:59oh and of course vastly simpler and faster
0:10:01strongly
0:10:02because
0:10:03and if you combine these all you have a four way P P R line now
0:10:07you get another nice to uh improvement so you go from the previous result
0:10:11the
0:10:12uh you know what
0:10:13the three language specific stuff
0:10:14terms
0:10:15to a full weight prlm
0:10:17with a with a pretty significant
0:10:19uh twenty four percent additional reduction so
0:10:22usually when you add more and more of these
0:10:24language specific systems the improvement kind of peter out as you might expect
0:10:28but if you apply
0:10:30the multilingual
0:10:31uh system then you get another big
0:10:35uh just some details so again this might all be common knowledge but
0:10:39uh we did find that that was actually no gain from four grams three grams was
0:10:44the best
0:10:45in terms of the language model um
0:10:47you know the overall lack
0:10:48see
0:10:49so somehow the
0:10:51you know the the the
0:10:52the programs are too sparse all the models are not adequate to capture the information
0:10:56programs
0:10:57and it actually
0:10:58a good to do a fairly
0:11:00uh suboptimal smoothing in terms of language model performance
0:11:04the assembly at once moving works best
0:11:06works better than doing fancy things and i
0:11:10um
0:11:12okay
0:11:15so now we do have some we had something which
0:11:18is very easily done
0:11:19in like a bird
0:11:20system
0:11:21uh my understanding so we
0:11:22we use uh
0:11:24we used we augment the standard
0:11:26uh cepstral front end
0:11:27with a mlp features with multilayer perceptron and neural network features
0:11:31uh which works right very well when we do word recognition
0:11:35uh and and other tasks
0:11:36and we also show that a
0:11:38front end that is trained on saying english too
0:11:41the form english phone
0:11:43uh
0:11:43uh discrimination
0:11:45uh actually generalises to other layers so you could train
0:11:47and you're not
0:11:48to discriminating users english phones at the frame level
0:11:51and then
0:11:52uh
0:11:52use that
0:11:53train front end
0:11:54to train say a mandarin recogniser and you would see a nice
0:11:58so
0:11:58this is this confidence that is
0:12:00this
0:12:00a front end although it is trained on only one language will actually
0:12:04a generalised to other languages which is exactly what we want for the
0:12:07language
0:12:09uh did we find that the across the board for all languages
0:12:13um
0:12:13we get uh a small but
0:12:15consistent improvement in the
0:12:17the recognition accuracy at the phone level
0:12:20phone phone recognition or
0:12:21see
0:12:22um and now we're gonna throw this at the multilingual prlm
0:12:25so we can augment the
0:12:27multilingual prlm
0:12:28the
0:12:29uh with
0:12:30uh with this
0:12:31uh mlp feature front
0:12:32and you see an improvement here
0:12:34that uh is about
0:12:36um
0:12:37uh is it that is about
0:12:40um
0:12:41you want to see it yet though some three point O one
0:12:43two two point eight one
0:12:45and if you do this
0:12:47combination
0:12:48with the with the other
0:12:49language specific
0:12:50P L M systems
0:12:52you got improvement from two point oh nine to one
0:12:54so nice
0:12:55nice improvement from adding those
0:12:57yeah i uh
0:12:58an opium
0:12:59as others have seen but we wanted to verify it
0:13:02uh for this
0:13:03uh
0:13:03for this framework with the multilingual
0:13:07okay so now we're gonna try something diff
0:13:08so another thing that we use
0:13:10with some success
0:13:11in speaker
0:13:12identification of course is the M R trends
0:13:14um
0:13:15so
0:13:16why should we be able to do this
0:13:17or
0:13:18language recognition
0:13:19so the idea
0:13:20seen it
0:13:20uh talk about probably he at the workshop is you have a language independent
0:13:25set of phone models
0:13:27and uh you use mllr adaptation
0:13:29so you estimated transform
0:13:31to move certain phone class
0:13:33from there
0:13:33language independent locations or
0:13:36speaker independent or whatever the in
0:13:37the
0:13:38pennants is that you that you care about
0:13:40to a defendant
0:13:42to to a location that is specific to a subset of your data such as
0:13:45language or
0:13:47and then use the transform coefficients
0:13:49as features
0:13:50and you model them with as we have
0:13:53and an hour
0:13:54case
0:13:54we have eight phone classes
0:13:56each uh
0:13:57the feature vector
0:13:58has thirty nine components
0:14:00and uh the you know the the affine transform of the thirty nine by forty matrix that we get
0:14:05about thirteen thousand
0:14:07the twelve thousand
0:14:08raw
0:14:09we perform right normalisation as we do in our speaker I D's
0:14:13and that's our feature vector
0:14:15and then we do support vector machine training with linear kernels
0:14:18and
0:14:19you know the hyperplane is really the the model
0:14:21the language model
0:14:22model for the language
0:14:24case
0:14:24and the L A D's scores
0:14:26this
0:14:26from
0:14:27your
0:14:27test sample
0:14:31yeah the results and this is a very crude system but
0:14:34bear with me
0:14:35so um we try this
0:14:37first with english
0:14:38uh mlr reference model
0:14:40so we use
0:14:41female english
0:14:42speakers only an hour
0:14:44in our uh reference model
0:14:45and we get a
0:14:46you know we get some results
0:14:47some people are right
0:14:49um
0:14:49we can play this game when we actually combining male and a female transform and we get a better result
0:14:55insistent with
0:14:56with what we
0:14:56see and
0:14:56speaker i work
0:14:58uh but when we use a single gender independent
0:15:01multilingual animal a reference model
0:15:04we do much better
0:15:05so this just goes to show first that it works in principle
0:15:08secondly
0:15:09that again the multilingual phone models work
0:15:12we better than the line
0:15:13fig
0:15:14oh
0:15:17now we want to
0:15:18get this result down to be more competitive with our standard
0:15:21uh it's a cepstral uh girls
0:15:24so first of all we can
0:15:25uh we can we can use a little trick
0:15:28the training conversations actually pretty long compared to the test
0:15:31the conversation
0:15:32that's the test set
0:15:33so we can actually split our training
0:15:35conversations
0:15:36into thirty seconds
0:15:38uh
0:15:38segments and get many more data points for the svm training
0:15:42we can also optimise the number of gaussians in our F
0:15:45models
0:15:46to be smaller that forces the mlr to do
0:15:48more adaptation work
0:15:50in the transform uh as opposed to just using different regions
0:15:53if you're gaussian mixture
0:15:55and finally we can
0:15:56we can do that now
0:15:58to try to project out
0:16:00uh within language
0:16:01where the light
0:16:03so um
0:16:04uh so that
0:16:05uh that's all done you kind of incrementally and you see that the
0:16:09that the average uh equal error rate goes down from you know the seven
0:16:14to just blow for
0:16:16so
0:16:16we're not quite
0:16:17they are yet
0:16:18far as the
0:16:19baseline of the cepstral gmm goes but it's much more
0:16:26okay
0:16:27um
0:16:28now again another incremental improvement we augment the plp front end of the mlr system with a twenty five mlp
0:16:35features
0:16:36so
0:16:37a number of features goes from um
0:16:40uh thirty
0:16:40uh thirty and nine times forty
0:16:43to that one
0:16:44the other block diagonal
0:16:46opponent that
0:16:47accounts for adapting the mlp features
0:16:49which is
0:16:49twenty five
0:16:50yeah
0:16:52six so
0:16:52overall the feature dimension increases from
0:16:55the twelve thousand
0:16:56to uh to just
0:16:57but you are under eighteen
0:16:59okay
0:17:00and
0:17:01the performance goals
0:17:02oh
0:17:03okay
0:17:03see
0:17:04improves
0:17:05uh and well to thirteen
0:17:07central
0:17:08i i reduction
0:17:12okay so now i want to go back
0:17:13two
0:17:14phone
0:17:15phonotactic modelling
0:17:17and as we've seen
0:17:19um
0:17:20i you know hardly anybody uses language models anymore i'm gonna use
0:17:23yeah
0:17:24for uh for phonotactic modelling
0:17:26uh so we wanted to
0:17:27do the same
0:17:28uh and see if
0:17:29if what we saw before still work
0:17:31so um
0:17:33as we found also
0:17:35you know many years ago that in uh in speaker ideas that svm models
0:17:39uh
0:17:39plight phone tandem features
0:17:41uh what better than language
0:17:43and that
0:17:43to me
0:17:44because of
0:17:44right
0:17:45um
0:17:47so
0:17:47here we want to apply this to the multilingual phone right
0:17:50hmmm
0:17:50wars and we use the uh T if there are no good
0:17:54yeah campbell
0:17:55uh and we do not perform any rank normalisation
0:17:58sound like
0:17:59yeah
0:18:00sure
0:18:01the uh again we play this game that we use
0:18:03split our training uh
0:18:05conversation sides into
0:18:06segments that match the length
0:18:08the test data
0:18:09and that gives us more
0:18:10uh that is as more training samples
0:18:13a four or a smear
0:18:15uh
0:18:16so this was our baseline using a language model over the phone and ram
0:18:20uh and that was the old is all
0:18:22then when we do an S yeah
0:18:24that with
0:18:25with the same feature space trigrams
0:18:27we do slightly worse
0:18:29but uh
0:18:30whereas previously we did not get again with foreground
0:18:34we now i
0:18:34you again with foreground
0:18:36so with
0:18:36with the additional features that
0:18:39and uh the result actually get better than up
0:18:42uh finally we can uh we confuse the two uh phonotactic systems the lm based
0:18:48the svm based system
0:18:50and we got another
0:18:52uh back
0:18:53so
0:18:54uh
0:18:55apparently the svm is a better tool when it comes
0:18:57to modelling very sparse
0:18:59uh features
0:19:00uh
0:19:01and that's why we see again from going from
0:19:04like that
0:19:05we also tried uh using now
0:19:07but that no gain from that
0:19:09yeah replicating
0:19:10something we tried and speaker I D
0:19:12it didn't work
0:19:13um however we haven't tried to the uh you know dimensionality reduction
0:19:16techniques like proposed in the uh in the previous talk so that's certainly something
0:19:20okay
0:19:22okay and just kind of the grand finale
0:19:25what we put everything together
0:19:26this is our single best system the
0:19:28oh phone recognition or phonotactic S yeah
0:19:32with that result
0:19:33over there
0:19:35one word yes
0:19:36and this is our other baseline the cepstral gmm
0:19:40and then we can
0:19:41incrementally add
0:19:42at uh
0:19:43a phonotactic or or phone based systems
0:19:46uh we see again from combining the caps O all the combination
0:19:50start with the cepstral gmm
0:19:52uh so
0:19:53we first of all we see that doing mllr type modelling on cepstral features
0:19:58uh does combine with the with the cepstral gmm
0:20:02um
0:20:03the
0:20:04the multilingual
0:20:06uh
0:20:06prlm system
0:20:08is the best
0:20:08i think the combined with the baseline
0:20:10see a whopping
0:20:11send
0:20:12uh reduction
0:20:13there yeah
0:20:14um and then adding on top of these
0:20:16these two
0:20:17uh and you can have you know all the others and you get you go down for another twenty percent
0:20:22relative so we
0:20:23essentially you had the error rate from you know two point eight seven that one
0:20:27or
0:20:28uh which
0:20:30looks like a pretty nice reduction
0:20:32um
0:20:33the well i really
0:20:35told you the highlights only
0:20:36the fact that
0:20:37two different kinds of phonotactic modelling actually combine
0:20:40um the fact that you type
0:20:41cepstral modelling combine
0:20:43um and the interesting thing is here
0:20:46that
0:20:46adding multiple uh this
0:20:48what behind the P prlm
0:20:50adding multiple
0:20:51language
0:20:52pacific
0:20:53uh phonotactic model
0:20:55does not help
0:20:56okay
0:20:56the one thing that all these other things it's no longer useful to actually have the line
0:21:00pacific phone recognition
0:21:05okay just a quick rundown of some
0:21:07one of our future direction
0:21:09so obviously we want to
0:21:11uh
0:21:11verify these results with
0:21:13more recent
0:21:14uh lre dataset
0:21:15we want in particular trials on the language
0:21:17uh the dialogue type the path
0:21:20and um
0:21:21you know the svm approach this already
0:21:24it seemed as
0:21:24for some the previous talks
0:21:26can be pursued in the parallel with multiple language specific phone set
0:21:30um
0:21:31but more interestingly i think we should we train the mlp features
0:21:34to actually be a well matched
0:21:36to the multilingual phone set that we're using now
0:21:39at the end but
0:21:40so
0:21:41uh that's it
0:21:41yeah button additional improvement
0:21:43the um
0:21:46uh
0:21:46we could all do not very interesting we
0:21:48could you mlp features for all the language
0:21:50pacific phone recognisers
0:21:52the
0:21:52handling them
0:21:53we might not really
0:21:54sue because we
0:21:55trying to get rid of the lines
0:21:57fig one recogniser
0:21:58and we can of course then go to more high level feature uh features that we've tried and worked well
0:22:04in speaker I D such as prosodic features
0:22:06and constrained uh caps
0:22:09okay so here the the
0:22:11the people messages so we try
0:22:13there is fine
0:22:14uh phone based systems
0:22:16uh for language I D
0:22:17using techniques that we
0:22:18uh that we had previously seen uh to work well in asr and also in speaker I D
0:22:25uh
0:22:25we for the first time to our knowledge we tried using mllr svm modelling for the language reckon
0:22:30have a network
0:22:31um
0:22:32the uh multilingual i guess
0:22:35the biggest
0:22:35take a math
0:22:36that
0:22:36the multilingual phone model approach is
0:22:39is
0:22:40works
0:22:40better
0:22:41and as
0:22:42is simpler
0:22:43then using a combination of a language
0:22:47did you
0:22:47parable
0:22:49and it still gives you some games if you combine
0:22:51language mister
0:22:53uh phone recognition
0:22:54um the mlp front end uh can
0:22:57proof but not that so what others found that mlp fine then you gain line recognition carries over to these
0:23:03two techniques
0:23:04that we explored here
0:23:05and
0:23:06uh the mllr in the cepstral gmm uh approach
0:23:09for cepstral modelling also combine quite well
0:23:12um
0:23:14well the rest of set already so
0:23:16that's it
0:23:24any questions
0:23:25right
0:23:30thank you very much for nice to know
0:23:32and
0:23:33at the beginning you said that the multilingual phone um did you mention
0:23:37she works
0:23:38proximate
0:23:38the same as the language dependent one
0:23:41do you have
0:23:42i mean numbers or
0:23:43oh no it's not here
0:23:44no i i have them
0:23:46you know what home but
0:23:47but i didn't think it was really relevant
0:23:49'cause when we imagine phone recognition accuracy
0:23:51we actually usually apply phonotactic model
0:23:54but in
0:23:54for language I D purposes we throw away the phonotactic model because we want
0:23:59to be very sensitive to the
0:24:00uh
0:24:01you know to the
0:24:02particulars of the line
0:24:03because
0:24:12maybe that was the very beginning of the top but we was more details about this
0:24:16discriminative
0:24:17mlp features that you're feeding into the
0:24:19hmmm
0:24:19phone recogniser or do you like
0:24:21then the posterior
0:24:23by some postprocessing or button next yeah so playing with
0:24:26yes they are actually quite
0:24:27plaques
0:24:28uh they were by the way we didn't train anything particular for this language
0:24:32yeah we just
0:24:33that's something that we have used in in word rec
0:24:35yeah
0:24:36uh in fact it was all the way you know it's basically
0:24:38these
0:24:39you just were optimised for
0:24:40word recognition
0:24:41and conversational english
0:24:43telephone
0:24:44um
0:24:48uh
0:24:49so
0:24:49we take um
0:24:51we take actually
0:24:52which a plp features
0:24:53over a nine frame window
0:24:55and then perform the usual kind of
0:24:58M L T
0:24:59mlp uh
0:25:00uh training with those
0:25:02those input features
0:25:03um we also
0:25:04form we also use the hats features
0:25:07which are kind of a derivative of the trap features
0:25:09uh
0:25:10going back to
0:25:10you know like uh a man
0:25:12work
0:25:12uh so those capture more long term
0:25:15uh
0:25:15critical band energies
0:25:17um
0:25:17and then we combine the posteriors from these two mlps
0:25:21into a single set of posterior vectors
0:25:23and then um
0:25:24then we would use it to twenty five dimensions using using uh
0:25:28yeah
0:25:32any questions
0:25:38what is your noise to you
0:25:40um
0:25:41i
0:25:42does
0:25:42falling on
0:25:43from scrooge um
0:25:45uh
0:25:46the mlp setup
0:25:47what are you trying
0:25:49do you
0:25:49you
0:25:50beautiful
0:25:51your mind
0:25:52ooh
0:25:53so it's an english phone set so it has another forty five
0:25:56uh categories
0:25:57that's performing
0:25:58frame level class
0:25:59cation
0:26:00so you trying to predict
0:26:02the phone at each frame
0:26:03uh
0:26:04english phone of each frame
0:26:06regardless of the length
0:26:08so as i said we did not train language specific or even the multilingual mlp
0:26:12we we just be using the english uh specific mlp that we had
0:26:18the really
0:26:19and perhaps try
0:26:20do you
0:26:21oh
0:26:22you might
0:26:22hmmm
0:26:23you know that's what it put in the future work as one of the obvious
0:26:26the proof
0:26:27that you could actually
0:26:28generalised concept
0:26:29cover all languages
0:26:30and then we try the animal
0:26:32hmmm
0:26:40i
0:26:41one
0:26:42um
0:26:43um
0:26:46huh
0:26:47oh
0:26:49uh
0:26:52it's a fake
0:26:53it's a it's a uh it's a mapping designed by a phonetician
0:26:57yeah
0:26:58this
0:26:58oh
0:26:59uh
0:27:01we plan to do that but
0:27:03we have
0:27:10i see
0:27:13right