| 0:00:06 | morning |
|---|
| 0:00:07 | uh |
|---|
| 0:00:08 | what i would like to present |
|---|
| 0:00:09 | here today is uh |
|---|
| 0:00:11 | our |
|---|
| 0:00:14 | language recognition |
|---|
| 0:00:15 | two thousand nine |
|---|
| 0:00:16 | submission |
|---|
| 0:00:18 | and |
|---|
| 0:00:19 | we did |
|---|
| 0:00:20 | a lot of work |
|---|
| 0:00:21 | after the evaluations to figure out |
|---|
| 0:00:23 | what happened |
|---|
| 0:00:25 | darcy |
|---|
| 0:00:25 | them into our lives that because actually |
|---|
| 0:00:27 | we saw a big difference |
|---|
| 0:00:29 | in the performance |
|---|
| 0:00:30 | on our |
|---|
| 0:00:31 | development data and the |
|---|
| 0:00:33 | on |
|---|
| 0:00:33 | then |
|---|
| 0:00:34 | that on the actual evaluation |
|---|
| 0:00:36 | so |
|---|
| 0:00:37 | uh |
|---|
| 0:00:39 | first |
|---|
| 0:00:41 | i will try to explain |
|---|
| 0:00:42 | what |
|---|
| 0:00:43 | new in the |
|---|
| 0:00:44 | uh language recognition |
|---|
| 0:00:46 | what happened |
|---|
| 0:00:47 | in the year two thousand nine |
|---|
| 0:00:49 | that in some new data |
|---|
| 0:00:51 | then i will go through |
|---|
| 0:00:53 | a very quick and brief description of our |
|---|
| 0:00:55 | all system |
|---|
| 0:00:56 | and then i will |
|---|
| 0:00:58 | try to concentrate on the |
|---|
| 0:01:00 | uh issues of the calibration and data selection |
|---|
| 0:01:03 | and uh how we resolve |
|---|
| 0:01:05 | problems with our original development set |
|---|
| 0:01:08 | then i will |
|---|
| 0:01:09 | try to |
|---|
| 0:01:09 | conclude |
|---|
| 0:01:10 | our work |
|---|
| 0:01:12 | so |
|---|
| 0:01:13 | in |
|---|
| 0:01:14 | two thousand nine |
|---|
| 0:01:15 | what was new |
|---|
| 0:01:16 | that |
|---|
| 0:01:17 | the new uh source of the data came into the language recognition |
|---|
| 0:01:21 | actually these data are |
|---|
| 0:01:22 | broadcast |
|---|
| 0:01:23 | the voice of america |
|---|
| 0:01:25 | and |
|---|
| 0:01:26 | we found a big are high |
|---|
| 0:01:27 | of |
|---|
| 0:01:28 | about |
|---|
| 0:01:29 | for the three languages and |
|---|
| 0:01:31 | uh the data uh |
|---|
| 0:01:33 | out of this archive |
|---|
| 0:01:34 | what's the use |
|---|
| 0:01:35 | and actually only the detected telephone calls |
|---|
| 0:01:38 | and um |
|---|
| 0:01:40 | which |
|---|
| 0:01:41 | this data brought |
|---|
| 0:01:42 | at peak variability |
|---|
| 0:01:44 | a to the original cts data we always used for the training of our language ideas |
|---|
| 0:01:50 | so it |
|---|
| 0:01:51 | okay |
|---|
| 0:01:51 | it brought some |
|---|
| 0:01:53 | new problems with calibration and channel compensation |
|---|
| 0:01:57 | so |
|---|
| 0:01:58 | uh |
|---|
| 0:02:01 | these are the languages |
|---|
| 0:02:03 | uh which |
|---|
| 0:02:04 | are present |
|---|
| 0:02:05 | i would have to check if they are still present in the |
|---|
| 0:02:08 | a voice over |
|---|
| 0:02:08 | of the M erica archive |
|---|
| 0:02:10 | as you can see |
|---|
| 0:02:13 | the |
|---|
| 0:02:13 | and multiple languages |
|---|
| 0:02:15 | here is a very huge and |
|---|
| 0:02:17 | it brought |
|---|
| 0:02:18 | very very nice |
|---|
| 0:02:19 | dataset |
|---|
| 0:02:20 | to test our systems on and |
|---|
| 0:02:22 | ability to improve the language recognition |
|---|
| 0:02:25 | stems |
|---|
| 0:02:26 | two |
|---|
| 0:02:26 | two |
|---|
| 0:02:27 | uh actually classify |
|---|
| 0:02:29 | more languages so |
|---|
| 0:02:31 | for the |
|---|
| 0:02:32 | two thousand nine |
|---|
| 0:02:33 | nist lre |
|---|
| 0:02:35 | these are the |
|---|
| 0:02:36 | twenty three |
|---|
| 0:02:37 | uh |
|---|
| 0:02:38 | target languages |
|---|
| 0:02:39 | and the bold ones |
|---|
| 0:02:41 | other languages |
|---|
| 0:02:42 | uh where the only uh |
|---|
| 0:02:46 | well the |
|---|
| 0:02:47 | that we had |
|---|
| 0:02:47 | only uh data coming from the |
|---|
| 0:02:50 | from this was |
|---|
| 0:02:51 | of and there are high |
|---|
| 0:02:53 | so there was no cts data for training on these languages |
|---|
| 0:02:56 | on the other languages we also had |
|---|
| 0:03:00 | normal |
|---|
| 0:03:01 | continues |
|---|
| 0:03:02 | speech data are recorded by |
|---|
| 0:03:04 | L D C |
|---|
| 0:03:05 | previous times |
|---|
| 0:03:07 | and also for the |
|---|
| 0:03:08 | two thousand nine |
|---|
| 0:03:10 | uh evaluation |
|---|
| 0:03:11 | so we had to deal with this issue |
|---|
| 0:03:13 | and |
|---|
| 0:03:15 | uh |
|---|
| 0:03:17 | and |
|---|
| 0:03:18 | uh do the |
|---|
| 0:03:19 | proper calibration |
|---|
| 0:03:21 | and channel compensation |
|---|
| 0:03:22 | so |
|---|
| 0:03:23 | what |
|---|
| 0:03:24 | be more tomatoes |
|---|
| 0:03:25 | after the evaluation to do this work and work |
|---|
| 0:03:28 | again |
|---|
| 0:03:28 | on our development set and to do a lot of experiments was |
|---|
| 0:03:32 | but we saw a huge difference |
|---|
| 0:03:34 | between the performance |
|---|
| 0:03:36 | our |
|---|
| 0:03:36 | original |
|---|
| 0:03:37 | development set |
|---|
| 0:03:38 | and uh |
|---|
| 0:03:40 | you've all said |
|---|
| 0:03:41 | which the uh we |
|---|
| 0:03:43 | which was uh |
|---|
| 0:03:44 | corrected by |
|---|
| 0:03:45 | nice |
|---|
| 0:03:47 | so all of the |
|---|
| 0:03:48 | numbers you will see here |
|---|
| 0:03:50 | will be the |
|---|
| 0:03:52 | average detection cost |
|---|
| 0:03:53 | defined by nice |
|---|
| 0:03:55 | and |
|---|
| 0:03:57 | uh |
|---|
| 0:04:00 | yeah uh |
|---|
| 0:04:02 | on the |
|---|
| 0:04:03 | language recognition workshop there about |
|---|
| 0:04:05 | there were a lot of discussions about that |
|---|
| 0:04:07 | crafting of |
|---|
| 0:04:08 | uh a |
|---|
| 0:04:09 | development set |
|---|
| 0:04:10 | alarm systems |
|---|
| 0:04:11 | so |
|---|
| 0:04:12 | uh |
|---|
| 0:04:12 | some |
|---|
| 0:04:13 | some people created a rather small and |
|---|
| 0:04:16 | very clean upset |
|---|
| 0:04:17 | we we had a |
|---|
| 0:04:19 | actually a very very huge |
|---|
| 0:04:20 | development set containing a lot of data |
|---|
| 0:04:23 | which brought some computational issues |
|---|
| 0:04:25 | to train the systems but |
|---|
| 0:04:27 | uh we decided to go |
|---|
| 0:04:29 | with this development |
|---|
| 0:04:30 | set |
|---|
| 0:04:31 | the big one |
|---|
| 0:04:33 | and |
|---|
| 0:04:33 | in the end it didn't show |
|---|
| 0:04:35 | to be maybe the but the she's |
|---|
| 0:04:37 | but that but decision but |
|---|
| 0:04:39 | we |
|---|
| 0:04:40 | had to |
|---|
| 0:04:41 | well with that |
|---|
| 0:04:42 | so |
|---|
| 0:04:43 | and |
|---|
| 0:04:44 | we |
|---|
| 0:04:44 | presentation of our |
|---|
| 0:04:46 | us |
|---|
| 0:04:47 | is what we had in the |
|---|
| 0:04:49 | in the |
|---|
| 0:04:49 | summation so |
|---|
| 0:04:51 | we had two types of uh |
|---|
| 0:04:53 | uh front ends |
|---|
| 0:04:54 | the first |
|---|
| 0:04:55 | on acoustic frontends which are based |
|---|
| 0:04:58 | on the gmm modelling and the features are |
|---|
| 0:05:00 | mfcc derive actually |
|---|
| 0:05:02 | these are the |
|---|
| 0:05:03 | uh popular shifty don't like cepstral features |
|---|
| 0:05:06 | and |
|---|
| 0:05:07 | for the system we had |
|---|
| 0:05:08 | there |
|---|
| 0:05:08 | jfa sixteen |
|---|
| 0:05:11 | we tried a new feature extraction |
|---|
| 0:05:13 | based on the audio the |
|---|
| 0:05:15 | and then we had their eighty and then |
|---|
| 0:05:20 | maximum |
|---|
| 0:05:20 | mutual information criterion |
|---|
| 0:05:22 | and using the channel compensated features |
|---|
| 0:05:26 | also |
|---|
| 0:05:27 | we tried to normal |
|---|
| 0:05:28 | gmm with a guilty features without any channel compensation |
|---|
| 0:05:32 | we perform the |
|---|
| 0:05:33 | well tract length normalisation |
|---|
| 0:05:35 | cepstral mean and |
|---|
| 0:05:37 | and variance normalisation |
|---|
| 0:05:39 | and reading the voice activity detection using car |
|---|
| 0:05:42 | hungarian phoneme recogniser |
|---|
| 0:05:44 | when we |
|---|
| 0:05:45 | where we met all of this |
|---|
| 0:05:46 | speech phonemes to the |
|---|
| 0:05:48 | speech and nonspeech |
|---|
| 0:05:49 | the to decide |
|---|
| 0:05:55 | yeah thanks |
|---|
| 0:05:56 | then |
|---|
| 0:05:56 | it's a standard based jittery sistine |
|---|
| 0:05:59 | uh |
|---|
| 0:06:00 | as you can see |
|---|
| 0:06:01 | a sorry but |
|---|
| 0:06:02 | this time of course without |
|---|
| 0:06:04 | and the eigenvoices there is only a |
|---|
| 0:06:07 | uh channel |
|---|
| 0:06:08 | variability present |
|---|
| 0:06:10 | so |
|---|
| 0:06:10 | we had |
|---|
| 0:06:11 | some super vector |
|---|
| 0:06:12 | of gmm means for every speech segment |
|---|
| 0:06:15 | and which is |
|---|
| 0:06:16 | then uh |
|---|
| 0:06:17 | channel dependent |
|---|
| 0:06:19 | the this |
|---|
| 0:06:19 | uh channel loading matrix was trained using the |
|---|
| 0:06:22 | E M algorithm and |
|---|
| 0:06:24 | the five hundred |
|---|
| 0:06:26 | sessions for every language very used to train |
|---|
| 0:06:29 | uh the |
|---|
| 0:06:31 | the channel loading matrix |
|---|
| 0:06:32 | and uh |
|---|
| 0:06:33 | language dependent uh super vectors |
|---|
| 0:06:36 | the alice |
|---|
| 0:06:38 | the remote adapted using the |
|---|
| 0:06:40 | rather than smart all these but also trained |
|---|
| 0:06:43 | using the five |
|---|
| 0:06:44 | hundred segments |
|---|
| 0:06:45 | there |
|---|
| 0:06:46 | a language |
|---|
| 0:06:49 | actually this |
|---|
| 0:06:50 | is the core acoustic system here |
|---|
| 0:06:53 | because |
|---|
| 0:06:54 | uh |
|---|
| 0:06:55 | it uses also our delta features and |
|---|
| 0:06:57 | as you will see |
|---|
| 0:06:58 | later on we decided to drop the audio D features and use |
|---|
| 0:07:02 | just the J faces |
|---|
| 0:07:03 | scheme |
|---|
| 0:07:04 | eating the shifted of packets |
|---|
| 0:07:11 | yeah we tried |
|---|
| 0:07:12 | a new discriminative technique to derive our features |
|---|
| 0:07:15 | uh this is technique |
|---|
| 0:07:17 | based on the |
|---|
| 0:07:19 | a region dependent linear transforms this is a technique |
|---|
| 0:07:22 | uh |
|---|
| 0:07:22 | which was introduced in the speech recognition but it is known as |
|---|
| 0:07:26 | S and P E |
|---|
| 0:07:27 | the idea is that |
|---|
| 0:07:28 | we have some |
|---|
| 0:07:29 | you know transformations |
|---|
| 0:07:31 | which will take our features |
|---|
| 0:07:33 | and |
|---|
| 0:07:34 | then |
|---|
| 0:07:35 | we take the linear combinations of the transformation to |
|---|
| 0:07:38 | uh |
|---|
| 0:07:39 | two |
|---|
| 0:07:42 | for menu |
|---|
| 0:07:43 | uh feature which would be which should |
|---|
| 0:07:46 | uh be discriminate |
|---|
| 0:07:47 | it's trained so |
|---|
| 0:07:49 | i know but |
|---|
| 0:07:50 | picture and i will try to |
|---|
| 0:07:52 | at least |
|---|
| 0:07:53 | very briefly |
|---|
| 0:07:55 | uh describe what is going on so |
|---|
| 0:07:58 | in the star |
|---|
| 0:07:59 | we are |
|---|
| 0:07:59 | having |
|---|
| 0:08:00 | some linear transformation |
|---|
| 0:08:02 | in the beginning there are initialised |
|---|
| 0:08:04 | two |
|---|
| 0:08:05 | great just the shifted delta cepstral features |
|---|
| 0:08:08 | we have some |
|---|
| 0:08:09 | G M and which is trained on all or |
|---|
| 0:08:11 | over all languages |
|---|
| 0:08:13 | and which is |
|---|
| 0:08:14 | select the two which is uh |
|---|
| 0:08:16 | suppose |
|---|
| 0:08:17 | two |
|---|
| 0:08:17 | so like |
|---|
| 0:08:18 | the |
|---|
| 0:08:19 | uh |
|---|
| 0:08:20 | here the transformations in every step |
|---|
| 0:08:23 | it actually provides the weights |
|---|
| 0:08:25 | we are uh then we are combining |
|---|
| 0:08:28 | these |
|---|
| 0:08:28 | transformation |
|---|
| 0:08:29 | so for every twenty one frames |
|---|
| 0:08:32 | we |
|---|
| 0:08:33 | we take the we we take the twenty once frames |
|---|
| 0:08:36 | mfcc put it into the gmm |
|---|
| 0:08:38 | then we take the most meaning |
|---|
| 0:08:40 | gaussian components |
|---|
| 0:08:42 | which provide us the weights |
|---|
| 0:08:44 | and |
|---|
| 0:08:44 | we will combine |
|---|
| 0:08:46 | according to this might be a combined is linear transformations |
|---|
| 0:08:49 | usually |
|---|
| 0:08:51 | it happened that |
|---|
| 0:08:52 | only one |
|---|
| 0:08:53 | or three |
|---|
| 0:08:55 | a gaussian |
|---|
| 0:08:55 | components |
|---|
| 0:08:56 | for these twenty one frames |
|---|
| 0:08:58 | where nonzero also |
|---|
| 0:09:00 | uh not all of these other transformations were linearly combined all the other |
|---|
| 0:09:05 | weights are set to zero |
|---|
| 0:09:07 | so |
|---|
| 0:09:08 | then we are taking the eating area |
|---|
| 0:09:10 | combined transformations |
|---|
| 0:09:13 | and |
|---|
| 0:09:13 | summing up |
|---|
| 0:09:15 | and then |
|---|
| 0:09:16 | there is a gmm |
|---|
| 0:09:18 | which will |
|---|
| 0:09:18 | estimate these feature and according to the training translate criteria |
|---|
| 0:09:23 | we will update |
|---|
| 0:09:24 | these |
|---|
| 0:09:25 | linear transform |
|---|
| 0:09:26 | and then we go |
|---|
| 0:09:27 | one other |
|---|
| 0:09:28 | one two months frames |
|---|
| 0:09:30 | train the system so |
|---|
| 0:09:31 | here |
|---|
| 0:09:32 | in the end what we have |
|---|
| 0:09:34 | after the training |
|---|
| 0:09:35 | this |
|---|
| 0:09:36 | will be the features |
|---|
| 0:09:37 | we will feed you are |
|---|
| 0:09:39 | jeff face |
|---|
| 0:09:43 | the next |
|---|
| 0:09:43 | acoustic system what that was a gmm |
|---|
| 0:09:46 | they two hundred |
|---|
| 0:09:47 | and for the adults and |
|---|
| 0:09:49 | one and |
|---|
| 0:09:50 | which was |
|---|
| 0:09:51 | uh discriminatively trained using tandem i |
|---|
| 0:09:54 | uh criterion |
|---|
| 0:09:55 | and |
|---|
| 0:09:58 | uh we use the features which are which where |
|---|
| 0:10:01 | penn state |
|---|
| 0:10:04 | so that was |
|---|
| 0:10:05 | for acoustic subsystems |
|---|
| 0:10:07 | then |
|---|
| 0:10:08 | some common technique from then |
|---|
| 0:10:10 | uh |
|---|
| 0:10:13 | the core of our well but |
|---|
| 0:10:14 | but but think systems where of course our |
|---|
| 0:10:17 | phoneme recognisers |
|---|
| 0:10:19 | the first one to english one is a gmm based |
|---|
| 0:10:22 | uh phoneme recogniser which is based on our |
|---|
| 0:10:25 | on the triphone acoustic models from an |
|---|
| 0:10:28 | lvcsr |
|---|
| 0:10:29 | than with just a |
|---|
| 0:10:30 | oh |
|---|
| 0:10:30 | take |
|---|
| 0:10:31 | uh language model |
|---|
| 0:10:33 | the two other uh |
|---|
| 0:10:35 | for the party |
|---|
| 0:10:36 | for not phoneme recognisers of the russian and hungarian |
|---|
| 0:10:39 | our neural network based |
|---|
| 0:10:40 | well the |
|---|
| 0:10:41 | neural network |
|---|
| 0:10:42 | uh |
|---|
| 0:10:43 | estimates the posterior probabilities |
|---|
| 0:10:46 | of the phonemes and then |
|---|
| 0:10:47 | it feeds them to the hmm for the decoding |
|---|
| 0:10:50 | so |
|---|
| 0:10:51 | these |
|---|
| 0:10:52 | we uh phoneme recognisers were used to be able |
|---|
| 0:10:56 | three uh binary decision tree language models |
|---|
| 0:11:01 | and |
|---|
| 0:11:01 | one svm |
|---|
| 0:11:02 | well |
|---|
| 0:11:04 | which was |
|---|
| 0:11:04 | based on the hungarian phoneme written |
|---|
| 0:11:06 | nice |
|---|
| 0:11:07 | here the foreground |
|---|
| 0:11:08 | where use |
|---|
| 0:11:09 | and uh |
|---|
| 0:11:10 | as we and was actually using only the trying around |
|---|
| 0:11:13 | uh |
|---|
| 0:11:14 | a lattice come |
|---|
| 0:11:15 | as a feature |
|---|
| 0:11:21 | then |
|---|
| 0:11:21 | uh we were doing a fusion |
|---|
| 0:11:23 | um we use it |
|---|
| 0:11:24 | and you go |
|---|
| 0:11:25 | multiclass |
|---|
| 0:11:27 | uh logistic regression |
|---|
| 0:11:28 | focal |
|---|
| 0:11:29 | toolkit |
|---|
| 0:11:30 | so whatever assisting |
|---|
| 0:11:32 | uh |
|---|
| 0:11:34 | the thing is |
|---|
| 0:11:36 | but the first time we had we didn't |
|---|
| 0:11:38 | trained to three separate beckons for the |
|---|
| 0:11:41 | each condition |
|---|
| 0:11:42 | we tried |
|---|
| 0:11:42 | to do the |
|---|
| 0:11:45 | duration independent fusion so |
|---|
| 0:11:48 | every sixteen |
|---|
| 0:11:50 | was a coding |
|---|
| 0:11:51 | some |
|---|
| 0:11:52 | raw scores |
|---|
| 0:11:53 | and |
|---|
| 0:11:53 | in addition to these it was outputting also some information about the line |
|---|
| 0:11:57 | a segment |
|---|
| 0:11:58 | which for the |
|---|
| 0:11:59 | acoustic system or was |
|---|
| 0:12:01 | number of frames and uh |
|---|
| 0:12:02 | phonotactic systems they provide it |
|---|
| 0:12:05 | number of phonemes |
|---|
| 0:12:07 | then these |
|---|
| 0:12:08 | a raw scores for every systems |
|---|
| 0:12:10 | where |
|---|
| 0:12:11 | we are going to lose |
|---|
| 0:12:12 | uh the gaussian backend |
|---|
| 0:12:14 | we had |
|---|
| 0:12:15 | three but gaussian back and |
|---|
| 0:12:17 | persisting because we use |
|---|
| 0:12:19 | uh |
|---|
| 0:12:21 | three |
|---|
| 0:12:21 | and so |
|---|
| 0:12:22 | uh lance normalisation either we |
|---|
| 0:12:25 | divided discourse |
|---|
| 0:12:27 | by the |
|---|
| 0:12:28 | uh by the land |
|---|
| 0:12:30 | or |
|---|
| 0:12:31 | two okay square root or |
|---|
| 0:12:33 | we didn't do anything |
|---|
| 0:12:34 | and then |
|---|
| 0:12:35 | we put |
|---|
| 0:12:36 | all of |
|---|
| 0:12:37 | the L wheels of the |
|---|
| 0:12:38 | uh these goals and back and sing to the multiclass |
|---|
| 0:12:41 | uh |
|---|
| 0:12:42 | a logistic regression |
|---|
| 0:12:43 | discriminatively trained |
|---|
| 0:12:45 | and i'll put most |
|---|
| 0:12:46 | that a calibrated |
|---|
| 0:12:48 | language |
|---|
| 0:12:48 | look like |
|---|
| 0:12:49 | course |
|---|
| 0:12:51 | so |
|---|
| 0:12:53 | here's |
|---|
| 0:12:53 | scheme of the |
|---|
| 0:12:54 | fusion |
|---|
| 0:12:55 | so again |
|---|
| 0:12:56 | it's is |
|---|
| 0:12:57 | thing |
|---|
| 0:12:57 | uh i'll put |
|---|
| 0:12:59 | uh |
|---|
| 0:13:00 | four |
|---|
| 0:13:01 | and |
|---|
| 0:13:02 | it's a it's either |
|---|
| 0:13:04 | taken as it these |
|---|
| 0:13:05 | or |
|---|
| 0:13:05 | it's normalised by |
|---|
| 0:13:07 | where or |
|---|
| 0:13:08 | divide it |
|---|
| 0:13:09 | and |
|---|
| 0:13:09 | uh then the output of the gaussian beckons |
|---|
| 0:13:12 | both |
|---|
| 0:13:13 | also together with the information about the lines to the discriminant |
|---|
| 0:13:16 | it's criminal |
|---|
| 0:13:17 | multi possible just |
|---|
| 0:13:18 | regression |
|---|
| 0:13:22 | so |
|---|
| 0:13:24 | the |
|---|
| 0:13:26 | the actual |
|---|
| 0:13:26 | core of this |
|---|
| 0:13:28 | paper |
|---|
| 0:13:29 | was to |
|---|
| 0:13:30 | was |
|---|
| 0:13:31 | to go |
|---|
| 0:13:32 | uh so our development set and decide |
|---|
| 0:13:35 | whether |
|---|
| 0:13:35 | or |
|---|
| 0:13:36 | address but the problem |
|---|
| 0:13:37 | you're right |
|---|
| 0:13:38 | like thing |
|---|
| 0:13:39 | uh our friends |
|---|
| 0:13:40 | int or you know |
|---|
| 0:13:42 | get too much yeah |
|---|
| 0:13:43 | who provided us with their development set |
|---|
| 0:13:47 | so we were able to do |
|---|
| 0:13:48 | this analyse |
|---|
| 0:13:49 | actually |
|---|
| 0:13:50 | in the tory no they had |
|---|
| 0:13:52 | much uh |
|---|
| 0:13:53 | small development set then we had |
|---|
| 0:13:56 | it contained about uh |
|---|
| 0:13:58 | if i correctly may remember |
|---|
| 0:14:00 | ten thousand segments of |
|---|
| 0:14:01 | and thirty three |
|---|
| 0:14:03 | thirty four languages our development set was |
|---|
| 0:14:05 | very huge it contained |
|---|
| 0:14:07 | data from |
|---|
| 0:14:08 | fifty seven languages and about uh |
|---|
| 0:14:12 | sixty thousand |
|---|
| 0:14:13 | second |
|---|
| 0:14:14 | so we did the experiment |
|---|
| 0:14:16 | we try to recreate the |
|---|
| 0:14:18 | putting the whole uh |
|---|
| 0:14:20 | training |
|---|
| 0:14:21 | set and |
|---|
| 0:14:22 | development set |
|---|
| 0:14:24 | and also we had |
|---|
| 0:14:25 | of course all training at developments and then we |
|---|
| 0:14:27 | the the four |
|---|
| 0:14:28 | types |
|---|
| 0:14:29 | experiment i'd everywhere |
|---|
| 0:14:30 | training |
|---|
| 0:14:31 | our systems |
|---|
| 0:14:32 | are the system and cutting |
|---|
| 0:14:34 | and calibrating in on the |
|---|
| 0:14:36 | uh |
|---|
| 0:14:37 | put it all |
|---|
| 0:14:38 | they |
|---|
| 0:14:39 | what it does set or |
|---|
| 0:14:40 | we trained |
|---|
| 0:14:42 | on the |
|---|
| 0:14:43 | L P T set and then potty break it on our set |
|---|
| 0:14:46 | or |
|---|
| 0:14:46 | we train |
|---|
| 0:14:48 | our set and calibrated |
|---|
| 0:14:49 | on the L P T outright |
|---|
| 0:14:51 | trained |
|---|
| 0:14:51 | on our set and |
|---|
| 0:14:53 | i degraded one hours |
|---|
| 0:14:54 | so |
|---|
| 0:14:54 | these |
|---|
| 0:14:55 | while at |
|---|
| 0:14:57 | while i |
|---|
| 0:14:57 | columns |
|---|
| 0:14:58 | our our |
|---|
| 0:14:59 | original scores |
|---|
| 0:15:01 | these analyses of course |
|---|
| 0:15:02 | was done |
|---|
| 0:15:03 | using our |
|---|
| 0:15:05 | our |
|---|
| 0:15:06 | uh one |
|---|
| 0:15:06 | the stick |
|---|
| 0:15:07 | subsystem the jfa system |
|---|
| 0:15:09 | because it would be |
|---|
| 0:15:10 | very um feasible to run all of the systems |
|---|
| 0:15:13 | again |
|---|
| 0:15:14 | for the training so |
|---|
| 0:15:16 | as you can see |
|---|
| 0:15:17 | we had some |
|---|
| 0:15:18 | serious issues for some languages actually these were the languages |
|---|
| 0:15:21 | uh whether only the what's of america |
|---|
| 0:15:25 | uh data were available |
|---|
| 0:15:26 | so |
|---|
| 0:15:27 | bosnian language |
|---|
| 0:15:28 | was an issue you can see a big |
|---|
| 0:15:30 | difference |
|---|
| 0:15:31 | between a |
|---|
| 0:15:32 | twenty two and our set the the blue blue column |
|---|
| 0:15:35 | is |
|---|
| 0:15:36 | just |
|---|
| 0:15:37 | training on our set |
|---|
| 0:15:38 | and using the |
|---|
| 0:15:39 | putting those |
|---|
| 0:15:40 | the |
|---|
| 0:15:41 | development set for calibration so |
|---|
| 0:15:43 | there must have been uh some |
|---|
| 0:15:45 | some bothersome issue |
|---|
| 0:15:48 | in our development set |
|---|
| 0:15:50 | so |
|---|
| 0:15:50 | the problems where the |
|---|
| 0:15:51 | wasn't in |
|---|
| 0:15:52 | farsi |
|---|
| 0:15:56 | and also |
|---|
| 0:15:57 | the final |
|---|
| 0:15:59 | final score |
|---|
| 0:16:00 | we were |
|---|
| 0:16:01 | everywhere |
|---|
| 0:16:03 | gaining some |
|---|
| 0:16:04 | performance |
|---|
| 0:16:06 | a loss |
|---|
| 0:16:07 | uh |
|---|
| 0:16:08 | so we try to |
|---|
| 0:16:09 | focus on these languages and fine |
|---|
| 0:16:12 | that should we had in our development |
|---|
| 0:16:15 | so the first |
|---|
| 0:16:16 | first we should we found was |
|---|
| 0:16:18 | ridiculous |
|---|
| 0:16:19 | we had |
|---|
| 0:16:19 | mislabelled one |
|---|
| 0:16:21 | language in our development set |
|---|
| 0:16:22 | actually that was a labour |
|---|
| 0:16:24 | label for |
|---|
| 0:16:25 | far as the and |
|---|
| 0:16:26 | version |
|---|
| 0:16:27 | and we treated them as |
|---|
| 0:16:28 | different languages so we |
|---|
| 0:16:31 | we corrected is or and |
|---|
| 0:16:33 | the problems for the for the language |
|---|
| 0:16:35 | mostly disappear |
|---|
| 0:16:36 | the next problem |
|---|
| 0:16:38 | we |
|---|
| 0:16:38 | we address was |
|---|
| 0:16:40 | finding the repeating speakers between |
|---|
| 0:16:43 | training and development set because |
|---|
| 0:16:46 | based on the discussions |
|---|
| 0:16:47 | on the |
|---|
| 0:16:48 | language recognition workshop |
|---|
| 0:16:50 | we already |
|---|
| 0:16:52 | a suspect it this can be a problem |
|---|
| 0:16:54 | for our |
|---|
| 0:16:55 | uh |
|---|
| 0:16:56 | training and develop |
|---|
| 0:16:57 | so |
|---|
| 0:16:58 | what we D |
|---|
| 0:16:59 | we trained the |
|---|
| 0:17:00 | our speaker I D's |
|---|
| 0:17:01 | stint from |
|---|
| 0:17:03 | previous |
|---|
| 0:17:04 | evaluations |
|---|
| 0:17:05 | which is a gmm based |
|---|
| 0:17:07 | speaker I D's |
|---|
| 0:17:08 | dean |
|---|
| 0:17:09 | and |
|---|
| 0:17:11 | uh |
|---|
| 0:17:12 | train the models for every train |
|---|
| 0:17:14 | segment |
|---|
| 0:17:15 | inside the language and test |
|---|
| 0:17:17 | again the segment |
|---|
| 0:17:18 | in the |
|---|
| 0:17:19 | uh developments |
|---|
| 0:17:21 | what we ended up |
|---|
| 0:17:22 | was this |
|---|
| 0:17:23 | uh |
|---|
| 0:17:24 | bimodal uh |
|---|
| 0:17:25 | distribution of |
|---|
| 0:17:27 | scores |
|---|
| 0:17:27 | so |
|---|
| 0:17:28 | uh |
|---|
| 0:17:29 | this part here |
|---|
| 0:17:36 | this part here |
|---|
| 0:17:38 | these are the |
|---|
| 0:17:38 | hi |
|---|
| 0:17:39 | speaker I discourse |
|---|
| 0:17:40 | and it's uh just |
|---|
| 0:17:41 | there are some recruiting speakers |
|---|
| 0:17:44 | between the training and the developments |
|---|
| 0:17:46 | so |
|---|
| 0:17:48 | when they look at these pictures |
|---|
| 0:17:50 | we decided |
|---|
| 0:17:52 | to threshold the data and to discard |
|---|
| 0:17:54 | everything from our development set |
|---|
| 0:17:56 | what is |
|---|
| 0:17:58 | higher |
|---|
| 0:17:58 | score then |
|---|
| 0:17:59 | for this ukrainian language |
|---|
| 0:18:01 | uh |
|---|
| 0:18:02 | of |
|---|
| 0:18:03 | discourse |
|---|
| 0:18:04 | the threshold |
|---|
| 0:18:04 | twenty |
|---|
| 0:18:06 | did this |
|---|
| 0:18:07 | experiment |
|---|
| 0:18:09 | we discovered that |
|---|
| 0:18:11 | we are we are discarding |
|---|
| 0:18:13 | for some languages |
|---|
| 0:18:14 | yeah disquiet discarding almost everything from our development set |
|---|
| 0:18:17 | for example bosnian |
|---|
| 0:18:19 | we ended up |
|---|
| 0:18:20 | with the |
|---|
| 0:18:20 | just fourteen |
|---|
| 0:18:21 | fourteen segments in our development set |
|---|
| 0:18:24 | and |
|---|
| 0:18:25 | for the other languages |
|---|
| 0:18:26 | where |
|---|
| 0:18:27 | very very doing the |
|---|
| 0:18:29 | speaker i didn't |
|---|
| 0:18:30 | cation filtering |
|---|
| 0:18:31 | we also discarded a lot of the data for example ukrainian only twelve |
|---|
| 0:18:35 | well segments |
|---|
| 0:18:36 | inaudible |
|---|
| 0:18:39 | so |
|---|
| 0:18:39 | what was the performance change when we did this experiment |
|---|
| 0:18:43 | really me |
|---|
| 0:18:45 | or |
|---|
| 0:18:45 | correcting the label |
|---|
| 0:18:47 | or already it was easy |
|---|
| 0:18:49 | and it |
|---|
| 0:18:50 | a show |
|---|
| 0:18:50 | and that the |
|---|
| 0:18:52 | did does |
|---|
| 0:18:52 | some |
|---|
| 0:18:53 | uh |
|---|
| 0:18:54 | proven |
|---|
| 0:18:55 | and then |
|---|
| 0:18:55 | speaker I D filtering |
|---|
| 0:18:58 | this was |
|---|
| 0:18:58 | white huge |
|---|
| 0:18:59 | different |
|---|
| 0:19:01 | in the performance so |
|---|
| 0:19:04 | these |
|---|
| 0:19:04 | again these are the results for our acoustic |
|---|
| 0:19:07 | subsystem the jfa |
|---|
| 0:19:09 | two thousand |
|---|
| 0:19:10 | what they got |
|---|
| 0:19:11 | as with |
|---|
| 0:19:11 | the R T L T features |
|---|
| 0:19:17 | so |
|---|
| 0:19:18 | when we did this |
|---|
| 0:19:21 | we decided to run |
|---|
| 0:19:22 | the whole fusion on our filter |
|---|
| 0:19:24 | data |
|---|
| 0:19:25 | it's not that |
|---|
| 0:19:26 | we we didn't change |
|---|
| 0:19:28 | the nature or or we didn't retrain |
|---|
| 0:19:30 | and you far system we had in the |
|---|
| 0:19:32 | submission |
|---|
| 0:19:33 | for the |
|---|
| 0:19:34 | nist language recognition evaluation we just |
|---|
| 0:19:37 | filtered out |
|---|
| 0:19:38 | course |
|---|
| 0:19:39 | from our development set and |
|---|
| 0:19:41 | run diffusion again |
|---|
| 0:19:43 | and we were |
|---|
| 0:19:44 | gaining |
|---|
| 0:19:45 | some |
|---|
| 0:19:45 | performance |
|---|
| 0:19:46 | improvements |
|---|
| 0:19:47 | quite |
|---|
| 0:19:48 | substantial |
|---|
| 0:19:49 | so for the |
|---|
| 0:19:50 | third the second condition |
|---|
| 0:19:52 | the C average went from |
|---|
| 0:19:54 | two point three to one point ninety three which is |
|---|
| 0:19:57 | what a nice |
|---|
| 0:19:58 | improvement and the |
|---|
| 0:20:00 | if you look at the table for |
|---|
| 0:20:01 | every duration |
|---|
| 0:20:04 | the improve there is |
|---|
| 0:20:05 | an improvement |
|---|
| 0:20:06 | i think there is no number |
|---|
| 0:20:08 | which deteriorated |
|---|
| 0:20:09 | so |
|---|
| 0:20:10 | it worked |
|---|
| 0:20:11 | all over the conditions and the |
|---|
| 0:20:15 | uh over |
|---|
| 0:20:16 | oh |
|---|
| 0:20:16 | the |
|---|
| 0:20:17 | all |
|---|
| 0:20:18 | uh |
|---|
| 0:20:18 | set and |
|---|
| 0:20:19 | for every language and |
|---|
| 0:20:21 | four |
|---|
| 0:20:22 | every duration |
|---|
| 0:20:25 | what we also |
|---|
| 0:20:26 | so here |
|---|
| 0:20:27 | what's a little |
|---|
| 0:20:29 | you duration |
|---|
| 0:20:30 | of the results on our developments |
|---|
| 0:20:33 | yeah it it could be |
|---|
| 0:20:36 | address |
|---|
| 0:20:37 | could be |
|---|
| 0:20:39 | the the cost could be that the our system |
|---|
| 0:20:42 | right trained actually to the |
|---|
| 0:20:44 | that that is |
|---|
| 0:20:45 | speaker and they they're more i can recognise |
|---|
| 0:20:47 | the speaker then the |
|---|
| 0:20:48 | then the language |
|---|
| 0:20:50 | for some languages |
|---|
| 0:20:55 | so then |
|---|
| 0:20:56 | we decided to work on our |
|---|
| 0:20:58 | uh |
|---|
| 0:20:59 | acoustics just in the jfa |
|---|
| 0:21:02 | are the L D system |
|---|
| 0:21:03 | because |
|---|
| 0:21:04 | we wanted to do also |
|---|
| 0:21:06 | another possible experiments to improve the final |
|---|
| 0:21:09 | final fusion |
|---|
| 0:21:11 | so |
|---|
| 0:21:12 | what we did |
|---|
| 0:21:13 | we |
|---|
| 0:21:13 | just discarded the |
|---|
| 0:21:15 | audio T features and use |
|---|
| 0:21:17 | the plane shifted delta cepstra |
|---|
| 0:21:19 | train |
|---|
| 0:21:20 | the system |
|---|
| 0:21:21 | and |
|---|
| 0:21:21 | it uh |
|---|
| 0:21:23 | there was some improvement |
|---|
| 0:21:25 | out of this |
|---|
| 0:21:27 | also what we did |
|---|
| 0:21:28 | was to train the jfa |
|---|
| 0:21:30 | using all |
|---|
| 0:21:31 | all the segments |
|---|
| 0:21:32 | there |
|---|
| 0:21:33 | language |
|---|
| 0:21:34 | instead of five hundred segments |
|---|
| 0:21:36 | the or language and |
|---|
| 0:21:38 | this |
|---|
| 0:21:38 | uh brought |
|---|
| 0:21:39 | so some |
|---|
| 0:21:40 | nice improvement |
|---|
| 0:21:41 | so when we |
|---|
| 0:21:42 | did the |
|---|
| 0:21:44 | final fusion |
|---|
| 0:21:46 | we |
|---|
| 0:21:47 | is guarded the |
|---|
| 0:21:48 | are the L T J face |
|---|
| 0:21:50 | in |
|---|
| 0:21:50 | replace it with the normal |
|---|
| 0:21:52 | jfa |
|---|
| 0:21:53 | justin |
|---|
| 0:21:54 | the and the my |
|---|
| 0:21:55 | us |
|---|
| 0:21:56 | still remained in the fusion |
|---|
| 0:21:57 | and instead of |
|---|
| 0:21:58 | all other |
|---|
| 0:22:00 | uh binary trees and |
|---|
| 0:22:01 | that one is we and we |
|---|
| 0:22:02 | we put there |
|---|
| 0:22:03 | actually a lot |
|---|
| 0:22:04 | of the svm |
|---|
| 0:22:06 | systems which are phonotactic |
|---|
| 0:22:07 | based and |
|---|
| 0:22:09 | uh |
|---|
| 0:22:10 | they are based on |
|---|
| 0:22:11 | our |
|---|
| 0:22:12 | uh |
|---|
| 0:22:12 | all of us |
|---|
| 0:22:13 | all of ours |
|---|
| 0:22:14 | uh phoneme recognisers and that a much because we'll have at all |
|---|
| 0:22:18 | on the |
|---|
| 0:22:19 | two P M |
|---|
| 0:22:20 | and we will he will explain |
|---|
| 0:22:22 | more |
|---|
| 0:22:23 | about this |
|---|
| 0:22:24 | stem cell |
|---|
| 0:22:25 | when we did this the final fusion went from |
|---|
| 0:22:29 | one point nine |
|---|
| 0:22:30 | as we saw previously |
|---|
| 0:22:31 | two |
|---|
| 0:22:33 | uh one point |
|---|
| 0:22:34 | fifty seven |
|---|
| 0:22:34 | which is |
|---|
| 0:22:35 | very competitive results |
|---|
| 0:22:37 | of course |
|---|
| 0:22:38 | it's a positive relation with |
|---|
| 0:22:44 | so what is the conclusions |
|---|
| 0:22:46 | of this work |
|---|
| 0:22:48 | we have to really care about our development that the data and rather than |
|---|
| 0:22:52 | creating a huge |
|---|
| 0:22:53 | huge development set it |
|---|
| 0:22:55 | better to |
|---|
| 0:22:56 | pay attention and |
|---|
| 0:22:58 | and |
|---|
| 0:22:59 | have it |
|---|
| 0:23:00 | smaller box filter and |
|---|
| 0:23:02 | clean |
|---|
| 0:23:03 | uh development set we actually did experiments with |
|---|
| 0:23:05 | trying given more data |
|---|
| 0:23:07 | two or seven |
|---|
| 0:23:09 | it didn't help us |
|---|
| 0:23:10 | the problem of the repeating speakers |
|---|
| 0:23:13 | between |
|---|
| 0:23:14 | between the training and the development set was |
|---|
| 0:23:17 | i was like |
|---|
| 0:23:18 | large |
|---|
| 0:23:19 | and |
|---|
| 0:23:20 | we should pay attention when we are |
|---|
| 0:23:22 | doing the next evolves |
|---|
| 0:23:23 | so that this |
|---|
| 0:23:24 | well |
|---|
| 0:23:26 | so thank you |
|---|
| 0:23:27 | and |
|---|
| 0:23:33 | also |
|---|
| 0:23:45 | uh_huh |
|---|
| 0:23:47 | okay |
|---|
| 0:23:51 | what |
|---|
| 0:23:51 | oh |
|---|
| 0:23:53 | oh |
|---|
| 0:23:54 | about a person so |
|---|
| 0:23:57 | we're principles is what |
|---|
| 0:24:00 | later use |
|---|
| 0:24:02 | a we are we looked at least |
|---|
| 0:24:04 | and |
|---|
| 0:24:04 | yeah we |
|---|
| 0:24:05 | we talked with them with one |
|---|
| 0:24:06 | in the workshop and they were |
|---|
| 0:24:08 | they were doing the speaker filtering stuff |
|---|
| 0:24:11 | but we didn't uh filter they are set according to our uh training set |
|---|
| 0:24:15 | uh to |
|---|
| 0:24:16 | but even the filter |
|---|
| 0:24:18 | uh the repeating speaker |
|---|
| 0:24:19 | what remained there |
|---|
| 0:24:21 | we just use as it was |
|---|
| 0:24:26 | oh |
|---|
| 0:24:27 | right |
|---|
| 0:24:30 | do you |
|---|
| 0:24:31 | same speakers element |
|---|
| 0:24:33 | i see |
|---|
| 0:24:36 | we should |
|---|
| 0:24:37 | we we don't know that and we didn't check it |
|---|
| 0:24:40 | we we just |
|---|
| 0:24:41 | it uh |
|---|
| 0:24:41 | wanted to treat our evaluation set |
|---|
| 0:24:44 | S |
|---|
| 0:24:44 | and evaluation set and we didn't look at it yeah |
|---|
| 0:24:47 | yeah |
|---|
| 0:24:48 | you know robust |
|---|
| 0:24:50 | you could probably get a little uh |
|---|
| 0:24:53 | well i think that the the remote or not |
|---|
| 0:24:55 | so much speakers repeating in that you've also because |
|---|
| 0:24:58 | as i understood nice |
|---|
| 0:25:00 | was using some uh |
|---|
| 0:25:02 | previously recorded data |
|---|
| 0:25:04 | and that |
|---|
| 0:25:05 | it is probably much less likely that |
|---|
| 0:25:08 | that there will be the meeting speakers again |
|---|
| 0:25:11 | for something which is of course it can happen |
|---|
| 0:25:13 | four |
|---|
| 0:25:14 | some of them but we we didn't actually check if those |
|---|
| 0:25:17 | or so |
|---|
| 0:25:19 | well short but |
|---|
| 0:25:21 | oh |
|---|
| 0:25:22 | you lose it seems |
|---|
| 0:25:24 | this |
|---|
| 0:25:25 | list |
|---|
| 0:25:25 | uh |
|---|
| 0:25:26 | you choose that actually |
|---|
| 0:25:28 | would be to then the more |
|---|
| 0:25:30 | uh |
|---|
| 0:25:31 | yeah |
|---|
| 0:25:31 | yeah |
|---|
| 0:25:33 | yeah it is like that |
|---|
| 0:25:34 | uh |
|---|
| 0:25:35 | we were |
|---|
| 0:25:36 | making a lot of effort to |
|---|
| 0:25:38 | try this new our guilty technique and |
|---|
| 0:25:40 | uh |
|---|
| 0:25:40 | which didn't work |
|---|
| 0:25:41 | also what was working was |
|---|
| 0:25:43 | combining 'cause of |
|---|
| 0:25:44 | many phonotactic systems as |
|---|
| 0:25:46 | as you did in your |
|---|
| 0:25:48 | submission and |
|---|
| 0:25:48 | yeah |
|---|
| 0:25:49 | very easily combining |
|---|
| 0:25:51 | thirteen pca base |
|---|
| 0:25:52 | as we and |
|---|
| 0:25:53 | since |
|---|
| 0:25:53 | based on our phoneme recognisers |
|---|
| 0:25:55 | what's actually |
|---|
| 0:25:57 | uh |
|---|
| 0:25:57 | very |
|---|
| 0:25:59 | very nice |
|---|
| 0:26:00 | the results |
|---|
| 0:26:01 | are quite compact it if if you will |
|---|
| 0:26:03 | and the number one one seventy eight |
|---|
| 0:26:05 | just |
|---|
| 0:26:05 | these |
|---|
| 0:26:06 | svm systems |
|---|
| 0:26:07 | where better then |
|---|
| 0:26:08 | our |
|---|
| 0:26:09 | final submission even after the filtering of the |
|---|
| 0:26:13 | of the calibration |
|---|
| 0:26:16 | like speaker |
|---|
| 0:26:17 | sort of |
|---|
| 0:26:19 | let's |
|---|
| 0:26:20 | but |
|---|
| 0:26:21 | oh |
|---|