0:00:15so first thank you very much for the odyssey conference for giving us the chance
0:00:20to present our language recognition system my name's raymond we've from the university of sheffield
0:00:26and the chinese university of hong kong
0:00:30so well i was a these have a language recognition system is pretty a
0:00:35fundamental and signed it
0:00:37so the of motivation of the paper and at all today will be basically to
0:00:43go through the keypoints maybe set of the core system and so this is some
0:00:47more suggestions and well as be calibration
0:00:53a bit of the background language recognition is about recognising language from a speech segment
0:00:57so we go through the classical map that all language recognition we can see researches
0:01:03using acoustic of phonotactic features working on that
0:01:07and then there are shifted delta cepstral features which takes a longer temporal spend of
0:01:11the signal which helps you language recognition recently i-vectors all at the end and for
0:01:18the combination of all of methods proved to be useful in anguish recognitions
0:01:23of all us we submitted three systems of the combination of three system in a
0:01:30nice language recognition last year
0:01:32the first one is a standard i-vector system and we have phonotactic system and the
0:01:36third one is a frame basis but the nn system after the evaluation we
0:01:41got a little bit enhancement combining the button and i-vector we'll go through its good
0:01:45of that like to
0:01:47so this is just a recap briefly on be a training data and also the
0:01:51target languages we have the switchboard data used telephone speech training data also some
0:01:56all multilingual
0:01:59lre training data from past evaluations
0:02:03training set of assets
0:02:04so there are twenty languages in language recognition and therefore into six language clusters and
0:02:09the task of language recognition is to identify languages within the clusters of language we
0:02:15shot closely related
0:02:18on the training data of language recognition comes as a role set of files in
0:02:23about seven hundred eight hundred hours or to start with the training we run some
0:02:28voice activity detection and to train our voice activity detector we use the competition that
0:02:34are from speech all by
0:02:37training our switchboard model of from tokenizer
0:02:40run out
0:02:41a forced alignment into them and then we just treats the silence label as non
0:02:46speech and the nonsilence labeled as speech we also take some of the posterity train
0:02:52data from voice of america broadcasts speech to train the of voice activity detector using
0:02:57that channel
0:02:58for this data we just take the role of speech nonspeech label
0:03:02on the amount of the voiced and unvoiced speech in different corpus from the table
0:03:11we train a to lay at the end and for a vad so this
0:03:15takes a stand at the end and with
0:03:18which with all train three dimensional filter bank features of features by saying of fifteen
0:03:23frames laughs and the right
0:03:25the outputs of the end and is
0:03:28two neurons the end and which is voice and the voice put zero probability
0:03:33we have sequence lyman using a tuesday hmm and forcing a minimum duration of twenty
0:03:38frames of voiced and unvoiced on top of that we have a heuristic to bridge
0:03:43the non a non speech gap which are shorter than two seconds
0:03:47for the results
0:03:49on the switchboard test data we have a miss and false alarm rate all around
0:03:53to present
0:03:55but for the be all the o a data the broke out of the broadcast
0:04:00they to the error rates much higher so we did an oral inspections that they
0:04:07we believe it's down to the inaccuracy of the reference data so will a first
0:04:10system and to continue trying out language recognition system
0:04:15we establish
0:04:17define a training set in the cost of the system development so these are the
0:04:23two "'cause" that's we use v one and p three
0:04:26the v one data is already version of the training data we use it directly
0:04:30text e of vad results
0:04:33and then extracts
0:04:34the whole segment whose duration lies between twenty and forty five seconds and then we
0:04:39train that specifically for thirty minute all sort thirty second condition so in the developments
0:04:47from the very beginning divided test and training in three second ten seconds thirty second
0:04:53duration we're not sure whether this is correct or not very that
0:04:58v three data are then we
0:05:00actually run different tokenizer all over again on the whole training set of the data
0:05:05and with that we will be one segmentation just that then we have a shorter
0:05:09segments for offshore shorter segments for decoding at the speed up the decoding process in
0:05:14the first round
0:05:15then we run re-segmentation with differences i don't stressful
0:05:19and we derive a three
0:05:21training set of normal evaluation of thirty seconds ten seconds and three seconds
0:05:26so these are not this thing gives that with a little bit of overlapping
0:05:30what data partitions of for each of the set then we have
0:05:33at present of the data for training time percent for development and that we're going
0:05:37to report the internal pass result in the early bits of the experiment for ten
0:05:41percent inter class
0:05:45so this is a system diagram for our or language recognition system on the laughs
0:05:50you can see the i-vector system and there is a phonotactic system the phonotactic system
0:05:55generate bottleneck features to fit into
0:05:57the nn system which is the frame based language recognition system
0:06:03the i-vector system is i we follow standard county recipe of media and normalization for
0:06:11the features shifted delta cepstrum mean normalization and also frame based vad to start with
0:06:17we trained a two thousand forty eight combine ubm and so the variability matrix to
0:06:22extract six hundred dimension i-vector we tried to language classifiers with all support vector machine
0:06:28and logistic regression and then to focus of the study here is to see to
0:06:33compare the use of
0:06:35different datasets in the training of ubm and also to the for a bit matrix
0:06:39also language classifier and also the comparison of global and cluster dependence classifiers
0:06:47but i think global classifies i mean classify which all
0:06:51classifies all the trendy languages and one go
0:06:54so we have four configurations here is that with so form condition a to condition
0:06:59be we increase the amount of data for ubm and total variability matrix training
0:07:03from be to see
0:07:05we replace the svm with logistic regression classifier and from c t we further increase
0:07:11the amount of training data for logistic regression classifier
0:07:15and the past year on the right shows the
0:07:19minimum average minimums the average score for different all configurations of set up on the
0:07:24i-vector system and the result is reported on the internal tests v one data
0:07:28which has
0:07:29thirty second duration
0:07:32on for when we go to a where we look at the to read past
0:07:36here in the middle then we can see
0:07:39the comparison between using fewer amount of training data for the ubm and more amounts
0:07:46then it gives some improvement there
0:07:48and we also see some a difference
0:07:52a by having a global classifier and within class the classifiers we did not manage
0:07:57to try or the combination is listed here just because of the time constraint
0:08:02but for this set of experiments on
0:08:05what we conclude is that we tend to use
0:08:09the full set of role training data and segment that for the training of ubm
0:08:14and sort of error rate matrix and also within class the classifiers outperform the global
0:08:20and then when our training progresses then we moved to the v three data
0:08:26we have similar conclusions as i just mentioned and then we tried
0:08:31to use different amount of the training data forty logistic regression classifier as shown as
0:08:36the three web boss here
0:08:39basically the left bar here are used as few amount of training data only one
0:08:43hundred hours
0:08:44and we use three hundred hours of data
0:08:48for the d one we use the roll set of data which a comprises about
0:08:52eight hundred hours so here that showrooms
0:08:55a trade-off between using more data and also whether the data are well structure of
0:09:00our segment it or not and then we ended up with using three hundred hours
0:09:04of segmented data training the a logistic regression classifier
0:09:09for the two red bars on the far left and right it is about the
0:09:15use of svm or use of the all
0:09:19logistic regression in the language recognition again that shows the
0:09:26for using logistic regression classifier
0:09:31then that comes to our second system lid phonotactic language recognition system
0:09:37there are two components in the phonotactic system first a phone tokenizer and the second
0:09:43the language classifier the from tokenizer is based on the standard county setup we have
0:09:49lda c m and how speaker adaptation
0:09:52then it is that the n m with six layer and each layer contains around
0:09:57two thousand euros
0:09:59we used i don't bigram language model with a very low grammar scale factor of
0:10:04zero point five we tried to have a high a scale factor of two and
0:10:08that's and
0:10:09gives better results in our internal test sets
0:10:12optionally we try to run even sequence training on the training switchboard data but bear
0:10:19in mind design english training data so we're not sure that
0:10:22of discriminative training will give over trying new networks to the results
0:10:28for the language classifier design svm classifiers
0:10:31which runs are trained on the tf-idf statistics of the phone n-gram which tried from
0:10:38bigram l from trigram the reason we back-off to bigram is that we of trained
0:10:43on the form
0:10:45position dependent form and we ended up with
0:10:48roughly five million dimension of the trigram statistics we
0:10:51where e that maybe sparsity issues
0:10:55so this is the performance on the internal test sets
0:10:59with the different setup
0:11:00as we which the trigram outdated gives better performance in terms of the low means
0:11:06the average score of this is valid for the thirty seconds later but you messy
0:11:10in a while that may break very comes to very short duration segments
0:11:17the purple bass a the results with the discriminatively trained the nn from tokenisers again
0:11:23than that shown that be of the over trained the nn here are and it
0:11:27gives higher word error rate
0:11:29sorry a higher that means the average score i mean
0:11:36the third system is the frame based the nn system for language recognition
0:11:42we talk a sixty four dimensional bottleneck features from the switchboard tokenisers
0:11:47and there are features slicing with the for frames one the left and for frames
0:11:51on the right
0:11:52the d n and is a four layer the nn with seven hundred neurons
0:11:58we have a problem normalizations which
0:12:02we multiplied it has probability with the inverse of the language prior and the decision
0:12:07of language recognition system can buy every change the frame based language recognition posterior probability
0:12:17so this is
0:12:18hey summary of the frame based language recognition system on different handsets
0:12:26then to trance we observed against very obviously when the situation is shorter than d
0:12:33c average score is higher and then the second is generally the
0:12:38the be the error he is higher than the phonotactic system and i-vector system but
0:12:45it becomes more robust when it comes to a very short duration
0:12:51so after the evaluation we have an enhanced system which recall that a button that
0:12:55i-vector system and is also a basic system
0:12:59we talked the
0:13:00bottleneck features from the switchboard and we place the mfcc in i-vector system with the
0:13:06bottleneck features and build another system for language recognition
0:13:11a bit of the details
0:13:13we take the sixty four dimension bottleneck features
0:13:16there are no vtln and no normalization or shifted delta cepstrum but they are frame
0:13:22based vad here
0:13:25so this is a side by side comparison between the i-vector system and the bottleneck
0:13:30system where the mfcc features can replaced by the bottleneck
0:13:33we can see roughly of relative improvement from fifteen to twenty five percent by replacing
0:13:40the bottleneck features
0:13:45for system calibration and fusion we train target language dependent gaussian back and
0:13:53and the gaussian
0:13:54has for age of sixteen components and then these are trained on da training data
0:13:59of thirty seconds data
0:14:02then of course system fusion we run logistic regression
0:14:06that comprises the log-likelihood ratio conversion and that the system combination
0:14:12the reflection
0:14:15so we apply that separately on the three system the i-vector system the nn system
0:14:20and phonotactic system we found that
0:14:24gaussian back and you know why work for the i-vector so we do not use
0:14:27that in the
0:14:30final evaluation
0:14:31and then for the in an informal to technique gives a
0:14:34significant improvement
0:14:38and this is the fusion result in our internal it has set so
0:14:42for thirty second data
0:14:44i-vector system gave so
0:14:46the battery so i'm on the three
0:14:49submissions system
0:14:51it can the n and informal to take they have roughly the same performance
0:14:56system fusion give some performance improvement actually a noticeable performance in the internal test that's
0:15:03we have
0:15:04and the bottleneck system did not give better results but and where we incorporate the
0:15:08for system than there are the best results we have
0:15:12when it comes down to three seconds a as i've set the phonotactic system
0:15:20behaves much worse here
0:15:22so that maybe because of this pastiche was on t particular setup of our own
0:15:27current statistics
0:15:31we compare the i-vector system and the bottleneck system then we see significant improvement for
0:15:36the off button x system and the further improvement the impression
0:15:41then here we show on the results of d formal evaluation
0:15:48i-vector system
0:15:50phonotactic system the nn system performs well
0:15:54roughly as expected
0:15:55and then bottleneck system again has
0:15:59more than ten percent relative improvement on top of the i-vector system
0:16:03and this system version
0:16:05gives marginal improvement
0:16:07on top of the best system here
0:16:10then finally i'm going to a shown to be about a pair-wise system contribution
0:16:16to see keyword you've contribution to t component system in our language recognition systems
0:16:22so now you see clusters of boss hears for each clusters on the very laughed
0:16:27pass we have a single system
0:16:29and then what these single system for example this is about an i-vector system
0:16:34we make a fusion with this system with one of the system and then the
0:16:39older is that we take the worst system to fuse with
0:16:43and then we take the second whereas and so on
0:16:46so the interesting thing here he is that gender at apart from fusion with that
0:16:51the nn system which is the worst system
0:16:53fusion pairwise fusion you know case works
0:16:58maybe you can argue we may be in a different all operating region that the
0:17:02error region and that
0:17:04maybe why we cease to work
0:17:07and then another interesting thing is the of
0:17:11performance of fusion system basically is in proportion to the performance of the single system
0:17:16which means that when the fusion of about the system then we get a better
0:17:19results here
0:17:22so as a summary we introduce the three language direction recognition component systems submitted to
0:17:28the or at least two thousand fifty and the description to segmentation data selection plan
0:17:35and classifier training we have and then harassment button i i-vector system
0:17:40and is demonstrate performance improvement for the future work we want to were a bit
0:17:46on the data selection and augmentation as a team thus
0:17:50and also we are interested in the multilingual new network of the adaptation of that
0:17:55maybe some unsupervised training on that as well and to improve the bottleneck features also
0:18:01some variability compensation to deal with the huge try no and development dataset mismatch in
0:18:07the evaluation dataset
0:18:08and a suggestions or maybe collaborations all welcome a thank very much more attention
0:18:20here type questions
0:18:34thanks for the when you're talking about the language clusters
0:18:41the clusters the according to some linguists yes
0:18:48for our small experiment
0:18:52the linguistic clusters and the based on the
0:18:57a to a lot of the same with silver last of the data
0:19:03but they which is
0:19:06and use these clusters that are on features
0:19:10we gain that it would be the computer the
0:19:13when compared to the results of plus so that are made by linguists
0:19:19tried plus the language for trial
0:19:24yes i think that's a scientific question an interesting question we follow language classes basically
0:19:30by a narrow definition all exciting following what the nice a language recognition evaluations told
0:19:36us to and you're absolutely right up there are some cases when the training where
0:19:43just become a distinction between
0:19:47even dialects or other unwanted of factors which does not directly related to language classes
0:19:54at all so yes definitely this is something we want to look at them particularly
0:19:57for some dialects were interested in for example chinese data are interested in it everyone
0:20:02to do more
0:20:06and the questions
0:20:11i one quick question so in an eer at most teams we did a scroll
0:20:17most works typically would sixty percent for train maybe going to seventy percent used a
0:20:22little bit more you want to eighty percent so my question is once you did
0:20:28your development when you actually submitted
0:20:31the final results did you do of for retrained with all the data or did
0:20:35you just stick with the original eighty percent range system that you
0:20:38we trained with the original system with eighty percent which we now doubts whether this
0:20:43should be the case and then we also have almost have a little bit by
0:20:49even if in the very early stage we
0:20:51divided data into three second ten second and three seconds or and that again
0:20:56of reduce the amount of training size and that's that we should note decision we
0:21:01tried to use h present and seventy we
0:21:05one more suggestions on
0:21:07how the data i think with of a bit on the all data segmentation and
0:21:13selection time
0:21:16here any other questions
0:21:20and b let's think speaker again