0:00:15 | so |
---|---|

0:00:17 | going to all and comic from a is a defined for can research |

0:00:21 | the papers them go to precisely so i score submissions to the twenty fifteen is |

0:00:26 | a language recognition i-vector challenge |

0:00:29 | and this paper is that right will we might collecting the |

0:00:32 | in a subdued |

0:00:37 | okay this is a liar five percent asians were for so for a given a |

0:00:41 | break go and a will be one of the i-vector challenge which is have different |

0:00:46 | from a perspective of the organiser |

0:00:51 | okay |

0:00:52 | and then the what would be a always detections a strategy |

0:00:57 | which constitute the most part of the work you're in the i-vector challenge |

0:01:02 | and then we talked on the description of subsystems but you know the final submission |

0:01:07 | there have is in fact fusion from multiple systems |

0:01:11 | and then at this for the vice the expand result |

0:01:14 | so it's the conclusions |

0:01:19 | okay so |

0:01:22 | the |

0:01:24 | the i-vector challenge of course of consists of the i-vectors are extracted from fifty target |

0:01:30 | languages |

0:01:31 | that's why some unknown languages and |

0:01:35 | all these i-vectors probable the |

0:01:38 | conversational telephone speech and of service and the error rate of speech |

0:01:43 | and from the perspective of the participants there are three major challenges |

0:01:48 | the first one being the |

0:01:51 | it is also a language identification tasks |

0:01:54 | so in that in addition to the fifty target languages we have two models that |

0:01:58 | additional class |

0:02:00 | to detect the offset languages |

0:02:02 | and on top of these the set of always languages is unknown |

0:02:07 | and it has to be learned |

0:02:08 | from what a labeled training |

0:02:10 | development data |

0:02:12 | and |

0:02:14 | one if we get these that the unlabeled development data is that you're consists of |

0:02:18 | the target languages target languages and as well as the always languages so we have |

0:02:24 | to select a pretty carefully do so or as languages from the unlabeled a development |

0:02:30 | set |

0:02:36 | okay so this is the only a three dataset the are provided to the participants |

0:02:43 | the first one is the training set and this is a label layer consists of |

0:02:47 | a fifteen thousand a |

0:02:49 | i-vectors |

0:02:50 | to cover the fifty target languages |

0:02:53 | so we have about you and i-vectors but languages |

0:02:57 | the next is the development set which is not available |

0:03:01 | consist of both target and non-target languages |

0:03:04 | so most of the will is in fact consists of how to select those or |

0:03:09 | as i-vectors |

0:03:10 | from these development set |

0:03:12 | and find it is that the test set sic this external find i-vectors and split |

0:03:17 | into the |

0:03:18 | so the and seven the speed |

0:03:22 | for the progress and evaluation set |

0:03:28 | okay nist provide a baseline |

0:03:31 | is the i-vectors cosine scoring baseline consisting of a three step |

0:03:35 | first one is the whitening |

0:03:37 | followed by norm salty a whitening parameters is actual and |

0:03:41 | from the unlabeled elements that |

0:03:44 | is |

0:03:45 | because this one is an unsupervised training just need to mean and the corpora metrics |

0:03:50 | and then i suggest that was the cosine scoring so we have the five k |

0:03:54 | here |

0:03:56 | which is the average mean |

0:03:58 | all the |

0:04:00 | the three hundred i-vectors for specific target language |

0:04:05 | and k from |

0:04:06 | one kl case of fifty here |

0:04:10 | fight i is the i-vectors of the a test segment |

0:04:13 | so the cosine scoring given by these equations and of course after the rank normalization |

0:04:18 | this to them that would be equal to one |

0:04:21 | i never the case of language identification is what we have to do is to |

0:04:25 | select the language that but this is that gives the highest score |

0:04:31 | so i we can see from here |

0:04:33 | the i-vector cosine scoring a baseline |

0:04:38 | there's no |

0:04:39 | the or scost is not included here |

0:04:42 | okay so |

0:04:44 | you know as we can see later if we include additional "'cause" with or as |

0:04:48 | the performances so we get a quite |

0:04:52 | improvement compared to the baseline |

0:04:56 | okay no we evaluate the |

0:04:58 | the cost |

0:04:59 | which is defined as the not identification at a rate |

0:05:04 | error rates across the fifteen target languages and the always class |

0:05:08 | but it is efficient as the ones |

0:05:10 | correct instrument |

0:05:12 | but now if you |

0:05:13 | if you but this case the fifty which is the number of target languages |

0:05:17 | and well with the of voice into this formula |

0:05:21 | well we can see that the weight |

0:05:24 | given to the you know the or s everywhere |

0:05:27 | in detecting the |

0:05:29 | the or s filling detect no where it is much higher |

0:05:33 | compare |

0:05:34 | to the target classes |

0:05:35 | so this means |

0:05:37 | the cost so as colours that |

0:05:40 | always detection is a very important |

0:05:45 | things to do |

0:05:46 | to reduce the cost |

0:05:52 | so that this one can talk about |

0:05:55 | okay so |

0:05:57 | you know to investigate different strategy to perform always occasions the be designed a |

0:06:04 | so called unlabeled sre |

0:06:07 | from labeled training data we have |

0:06:11 | so the labeled train it does consist of fifteen thousand |

0:06:14 | i-vectors a four fifty target like this |

0:06:18 | so what we did was actually we do forty then split |

0:06:22 | so we have |

0:06:24 | for each target languages |

0:06:26 | and assuming that the act and i os languages and this is |

0:06:30 | run is that random selection |

0:06:32 | it's not particular |

0:06:35 | preference for any other languages |

0:06:37 | so this |

0:06:39 | i-vectors |

0:06:41 | is used as the |

0:06:44 | or is languages in all other percent |

0:06:47 | and we select fifth the all three hundred |

0:06:50 | as the unlabeled |

0:06:53 | a target languages in the unlabeled data set |

0:06:56 | okay |

0:06:58 | and of course we perform lda to reduce the dimension so that we could investigate |

0:07:03 | different strategy in the in the |

0:07:06 | for us we |

0:07:12 | okay are basically we investigated to strategy of each file we invest in many strategy |

0:07:18 | and have to study we follow pretty useful the first one record this fee talking |

0:07:25 | and the second one is the best or s so the this fit target mean |

0:07:29 | that you know we train a classifier |

0:07:32 | and then be for the target languages so those i-vector based let's feet to the |

0:07:38 | target classes would be taken as the always i-vector |

0:07:42 | well as for the |

0:07:44 | for this |

0:07:46 | bass fiddle as we train kind of like |

0:07:49 | fifty or forty plus one of fifty plus one and a which is that it |

0:07:53 | took about |

0:07:54 | so is that you have one class for the o s so we select those |

0:07:59 | that is breastfeed to the voice class |

0:08:03 | "'kay" so no we this is the radius of philosophy okay so what happens that |

0:08:09 | we to the a target languages that we train a multi-class svm soul for the |

0:08:14 | case of our seamlet the unlabeled a development set we have forty classes therefore the |

0:08:21 | actual we have fifty classes |

0:08:23 | so what is that |

0:08:25 | we train a multi-class svm |

0:08:26 | and then be scores look in those i-vectors in the unlabeled development set |

0:08:32 | so |

0:08:32 | what have like what is of the i-vectors where have one probably the posterior probabilities |

0:08:39 | for each of the classes |

0:08:41 | then we take the max |

0:08:43 | a amount of k classes we have which if these |

0:08:48 | so then this okay will be those i-vectors having the posterior probability |

0:08:56 | nasa then a given |

0:08:58 | try show |

0:09:00 | right |

0:09:01 | well as for the case of these that best fit or s |

0:09:05 | we train |

0:09:06 | o k plus one |

0:09:08 | multiclass svm b k plus one clusters |

0:09:11 | so no the question is how |

0:09:13 | how we're going to get the additional costs |

0:09:17 | even that we don't have the label so what is that b |

0:09:23 | we have the fifteen target languages |

0:09:26 | and then to be true in |

0:09:27 | those unlabeled development set |

0:09:30 | assuming that |

0:09:32 | all those unlabeled data out set i-vectors |

0:09:37 | maybe train the multiclass svms for that |

0:09:41 | and of course target i-vectors inside out of it a demo set so but |

0:09:47 | well it is |

0:09:49 | using the multiclass svm that trained in this manner |

0:09:52 | we compute the posterior probability |

0:09:54 | with respect to the voice class right so that is the like those i-vectors with |

0:10:01 | the highest |

0:10:02 | probability |

0:10:04 | which |

0:10:05 | what we call the best fit us |

0:10:08 | so in this way we actually discarding |

0:10:10 | those target i-vectors |

0:10:13 | in the unlabeled development set |

0:10:22 | i homing clean like spanish and have a |

0:10:25 | too much |

0:10:26 | chuckling |

0:10:27 | that's right okay |

0:10:30 | i travel events |

0:10:31 | okay so this is that i comparison of the to make the |

0:10:38 | the best fit of that |

0:10:40 | and the |

0:10:42 | this fit okay |

0:10:43 | and this is |

0:10:45 | the precision versus recall so we can see here the best fit |

0:10:50 | i'll so best fit offset |

0:10:52 | if a better |

0:10:54 | precision for all call value |

0:10:56 | compared to the these fit of a list target |

0:11:00 | and this |

0:11:02 | bigrams that illustrate |

0:11:03 | the on the two dimensional |

0:11:06 | graph |

0:11:09 | so later but setups that |

0:11:11 | can actually give a geographically energy |

0:11:15 | detection |

0:11:16 | the better the or as segment i-vectors |

0:11:20 | from a limited amount set |

0:11:28 | okay so |

0:11:31 | you know of the idea the |

0:11:33 | best fit or as cussing then be we do a attractive purification step |

0:11:40 | to improve the always detections what points that the based on the score |

0:11:46 | based on the |

0:11:47 | no but the detections that we have |

0:11:50 | from the |

0:11:51 | best fit offset |

0:11:53 | we randomly |

0:11:54 | the i-vectors for the top to bottom |

0:11:57 | talk will be the |

0:11:59 | most likely to be for s |

0:12:02 | what one would be a most likely to be target |

0:12:05 | and then we have these a to process of i-vectors |

0:12:09 | then we take the mean |

0:12:10 | and then be scores against all unlabeled i-vectors |

0:12:15 | and then we form be rang again we get and all |

0:12:18 | we the larger and increase and likely |

0:12:22 | for each iteration |

0:12:24 | and |

0:12:25 | we collect a risk of high so in that if you do this effectively |

0:12:31 | then when the best result what we can have peace |

0:12:35 | no we increase the lm |

0:12:37 | to these |

0:12:39 | you one three percent |

0:12:41 | a job forty percent meaning |

0:12:43 | against these six thousand |

0:12:45 | find a bus the i-vectors unlabeled i-vectors that we have |

0:12:55 | okay so the system there is something that is a fusion stuff on multiple classifiers |

0:12:59 | so is consists of pretty symbols and now classic so i've a classifier we have |

0:13:06 | the first one is the gaussian backend followed by multiclass logistic regressions |

0:13:10 | and then we have a solutions of svms |

0:13:13 | one is based on what we call polynomial expansions |

0:13:17 | and in a one is a fundamentally |

0:13:20 | then we also have investigated using the a multilayer perceptron |

0:13:26 | to expand i-vectors in a non you know we endpointed svm |

0:13:32 | just this one no so we also have these that the nn classifier that take |

0:13:36 | the i-vector as input and output is a few the last one |

0:13:42 | a target |

0:13:43 | fifty target languages and one i'll set |

0:13:48 | a languages |

0:13:50 | and the system there is something that is a very simple is the in your |

0:13:54 | fusion so |

0:13:57 | the way we learned the weight |

0:13:58 | is by some meeting the result to that systems |

0:14:01 | and then have a series on the progress set and the |

0:14:04 | it has to wait accordingly |

0:14:11 | okay so for the first |

0:14:14 | classifiers |

0:14:16 | well we what we did you section we train a gaussian so |

0:14:20 | distributions for each all the target languages |

0:14:22 | so for the case of fifty k good fifty target and just be trained fifty |

0:14:26 | gaussian distributions |

0:14:28 | and he a the means |

0:14:31 | estimate the separately |

0:14:33 | well as for the cobra metrics used actually a we get a global gram matrix |

0:14:39 | and then was moved in |

0:14:41 | we the smoothing figure two point one may be adapted to the individual target classes |

0:14:48 | then |

0:14:51 | then we get a new backend but in the score space we train a you |

0:14:55 | be included always clusters |

0:14:57 | as one additional process |

0:15:00 | okay this is counter have a standard in the language recognition |

0:15:04 | and this is followed by us cost score calibration using a multi class logistic regressions |

0:15:09 | and of course we used a multi class logistic regression be could come good any |

0:15:14 | log-likelihood in two parts deal |

0:15:16 | and this is maybe can actually control |

0:15:19 | the of trial so t v can get you know perhaps put more |

0:15:24 | prior onto the always classes because |

0:15:27 | and have seen voice detection is |

0:15:30 | right important in production costs |

0:15:35 | okay |

0:15:38 | okay that may probably svm |

0:15:41 | that we have a |

0:15:43 | use |

0:15:45 | so |

0:15:46 | we do or a simple well in the by expansions use in a second up |

0:15:50 | to the second order |

0:15:51 | so this one expand a four hundred dimensional i-vectors seems to at k which is |

0:15:56 | scaled by a b |

0:15:57 | then we didn't is obvious a bit worse and i sent rising to a global |

0:16:01 | mean and normalized to unit norm |

0:16:04 | and perform any p |

0:16:06 | be the rank not just at each is kind of small compared to the |

0:16:11 | the dalmatian have |

0:16:13 | okay and then |

0:16:15 | to include always classes the we have a fifty one classes so we used to |

0:16:20 | strategy once one versus all |

0:16:22 | and get a one is a pair-wise strategy so |

0:16:25 | the final score we combination of these two o a strategy to be used to |

0:16:29 | train svm |

0:16:33 | okay so |

0:16:35 | and i one is what we call the empirical kind of mapping |

0:16:38 | so what we did this we use the polynomials break those that we have |

0:16:43 | then we construct we call a possible way the matrix |

0:16:47 | using all the training that we have |

0:16:50 | as well as the or else |

0:16:53 | you know i-vectors to be a detector |

0:16:55 | then we do for each of the i-vectors that were going to score |

0:17:01 | we do a mapping |

0:17:02 | by just simply modifications to the matrix we have |

0:17:05 | then be account like a combating all transforming the a polynomial select those to the |

0:17:11 | score space |

0:17:13 | the optimal course call score vectors |

0:17:16 | and this is followed by |

0:17:18 | you know us and writing and to the global mean and normalized to unit norm |

0:17:22 | and the same strategy line |

0:17:24 | so we have to a kernel that we use one polynomial expansions that second emprically |

0:17:30 | kinda mapping but svm |

0:17:36 | is result |

0:17:37 | first of all we see we would like to compare how the a local minima |

0:17:42 | selectors the score scoring goes compared to the i-vectors |

0:17:46 | so this pulse first lines the baseline |

0:17:48 | ways i-vectors followed by cosine scoring |

0:17:52 | zero point three nine five nine |

0:17:54 | and t v just simply change cosine scoring to svm |

0:17:58 | then what we get is about seven point eight percent improvement compared to the baseline |

0:18:03 | and then if you chase endurable in my expansion and i-vectors then we get is |

0:18:08 | that your point three for which is a fourteen percent buttons |

0:18:11 | and if we know from the polynomial select those used empirical kernel |

0:18:16 | a of the scost with those we get a sixteen percent of phones |

0:18:22 | okay so next we see the a simple example always detection strategy maybe on to |

0:18:28 | compare the this fit target database without set |

0:18:31 | for both or no male svm and emprically connects |

0:18:38 | okay so |

0:18:40 | this is what they like you know when you includes the |

0:18:46 | it does not include any more s is |

0:18:50 | this fourteen percent due to the classifier compared to baseline |

0:18:54 | if you use the is the lowest fit target |

0:18:58 | variable get the d two percent improvements okay then best fit or s get this |

0:19:05 | and if you on not the best fit a or s |

0:19:08 | then we do a exactly for purification |

0:19:12 | we get a forty five percent improvements |

0:19:14 | similarly for the case of empirical kernel |

0:19:21 | alright so this is the you know how final submission is |

0:19:27 | we get about fifty five percent no improvement on the progress set |

0:19:33 | and a fifty four percent |

0:19:36 | compared to baseline |

0:19:37 | on eva sense that so |

0:19:39 | the improvements a new setting one century come from a better classifier |

0:19:44 | but you svm multiclass logistic conditions we used the n and t is the mlp |

0:19:49 | and i think the most part actually contribute at the contribution is from the always |

0:19:53 | detection strategy b c |

0:19:55 | give us a raw forty percent so far improvements |

0:19:58 | compared to baseline |

0:20:02 | okay i not examine the mentions that |

0:20:04 | we have in one day from the has a cassette |

0:20:09 | the number always the fact that is a one thousand seven hundred i think this |

0:20:13 | is much |

0:20:14 | higher than the |

0:20:17 | a real more file or as a segments all i-vectors in the test set |

0:20:22 | but given that the cost actually in a very well |

0:20:26 | if you do a |

0:20:28 | miss detection or as |

0:20:29 | you're going to lose much in terms of the cost so |

0:20:33 | it is better to say i-vector that so as then |

0:20:37 | then this not always |

0:20:41 | okay so this is the how progress |

0:20:44 | across |

0:20:44 | treat formant |

0:20:46 | so from the baseline systems |

0:20:48 | then we have a the you know |

0:20:51 | classifier |

0:20:52 | then be a |

0:20:54 | we found that the |

0:20:56 | this fee target |

0:20:57 | it's a good strategy for the always detection then we get a boost the performance |

0:21:02 | and then the betsy lawrence strategy eva santana bows |

0:21:06 | and then adaptive a cluster verification difference and a one |

0:21:10 | and then finally we have the fusion which that's to the zero point one seven |

0:21:15 | in terms of the costs |

0:21:20 | okay so |

0:21:21 | in conclusion so we have obtained a bow |

0:21:24 | fifty percent of buttons compared to baseline feature is |

0:21:28 | major contribution from the fusion multiple classifier |

0:21:31 | and the s voice detection strategy |

0:21:35 | and the following are always detection strategy find to be useful |

0:21:39 | which is the this fit target bessy always ended if a classification |

0:21:44 | but i have a real are actually able to find a good strategy to actually |

0:21:49 | extract |

0:21:51 | useful target i-vectors from a delay but i'm set |

0:21:55 | so all we believe a t v |

0:21:57 | have a bit distracted in doing that |

0:22:00 | you would give us a for the improvement |

0:22:11 | okay we have time for some questions |

0:22:22 | i think things |

0:22:25 | forward three d is your two we observe the |

0:22:30 | not very useful to sort out of so that this one class |

0:22:36 | because |

0:22:37 | i |

0:22:38 | distributed between different based on this definition you try to |

0:22:43 | maybe more k plus one but k plus the than we choose the |

0:22:50 | the o posted to |

0:22:53 | well |

0:22:54 | i four comments i'm for channel we didn't try because when you know during the |

0:22:59 | evaluations we do not reno |

0:23:01 | how many other languages that there |

0:23:03 | in the as |

0:23:06 | classes in maybe one he may be too |

0:23:08 | so |

0:23:09 | we do have and the ideas of how many languages in the class |

0:23:13 | so we don't actually explored it not that options you is what we take the |

0:23:17 | much from the that reject mm can see that much from the that you're is |

0:23:22 | entirely on the or at least the these japanese |

0:23:25 | so we can say okay this all of the or more close to italian family |

0:23:30 | you results or group of the way we show |

0:23:35 | we should have done the in of the and language tree and green |

0:23:39 | and the second question do you the confusion matrix |

0:23:44 | c we choose or more in terms of somehow pool |

0:23:52 | so |

0:23:53 | the thing we have peace |

0:23:56 | not exactly what say but maybe you know take this opportunity to actually talk about |

0:24:02 | the is greater actually the central but the snow so |

0:24:06 | you know overall what we did for the for the i-vector challenge is not always |

0:24:10 | detections of cost a lot of are those expect that we explore |

0:24:15 | i for example you know the target detection is actually not very good if you |

0:24:19 | if you see that the able find the summation even though we give a lot |

0:24:23 | model |

0:24:23 | fifty percent improvement compared to baseline |

0:24:26 | but the target detection effect it was compared to the baseline |

0:24:30 | if you see what |

0:24:34 | thank you thank you |

0:24:44 | are there |

0:24:45 | this study |

0:24:51 | the i-th this one the channel i-vector challenge to |

0:24:54 | proceed the |

0:24:56 | the nist the l at |

0:24:59 | distribution right so |

0:25:01 | how much of this work the was left and right in |

0:25:05 | in this well afford to aid a star forgeries i |

0:25:11 | i'm for divorce because the |

0:25:13 | our at a ten to fifteen is across the identifications |

0:25:17 | we have open set verification problem |

0:25:20 | where the always cucumber important for a way that ten to fifteen is kind of |

0:25:26 | not available but maybe we in fact use the we called the and pick a |

0:25:31 | kind of map |

0:25:32 | for however |

0:25:34 | but of course for the our you |

0:25:37 | well what we actually important you use of the bottleneck features |

0:25:41 | compared to |

0:25:43 | you know we may be used to use sdc |

0:25:46 | in a once you replace sdc be a bartender features we get around fifty percent |

0:25:52 | automatically we are doing anything |

0:25:54 | so you more focusing on the |

0:25:57 | p two levels and |

0:25:59 | for lid anymore |

0:26:02 | because for the i-vector challenge to pass on the policies that about always detection |

0:26:06 | if it's in for the us presentation this well |

0:26:15 | okay so it i think we're out of time so it's pretty slick the speaker |

0:26:18 | again |