0:00:23okay very much so the other not to seventy six system was a collaborative work
0:00:28between the brno university of technology
0:00:31medial and a little
0:00:36so let's start already introduced we had twenty four languages to deal with
0:00:42we had a new metric
0:00:44the list of the languages somehow
0:00:47screwed the
0:00:49and so we have the new metric the two seventy six language but that that's
0:00:54where the name comes from
0:00:56and we had to select a twenty four was pairs in terms of mindcf and
0:01:01then compute the average
0:01:03actually see
0:01:05in order to be able to deal with those languages we had to call we
0:01:10had to collect or data so this is basically the list of data that we
0:01:14used in the past
0:01:16evaluations there are some callfriend fisher
0:01:19oh mixer data from the sre evaluations
0:01:24previous lre evaluations ogi data of foreign accented english
0:01:28some a speech data for the you've really used european languages
0:01:33i was and switchboard and some broadcasts though
0:01:37data from the voice of america and ready for your
0:01:40and then some
0:01:42iraqi arabic conversational speech
0:01:45arabic broadcast
0:01:48oh as doc showed
0:01:51there were some languages for which we didn't have enough data
0:01:54so what we did is that we add on this additional radio data from public
0:02:00sources so we use the radio free europe radio free asia
0:02:05some czech broadcast
0:02:08and there's a there's a list of languages first we covered check farsi lotion and
0:02:14job be
0:02:15and again i say arabic knackered be
0:02:18mandarin in training and i and i guess that were couple
0:02:23so what we did is that we did a phone call detection so we
0:02:27detecting the parts of
0:02:30broadcasts where
0:02:31there were some
0:02:34conversations would be greater
0:02:37and for each language we we're and
0:02:40automatic speaker like labeling
0:02:43so that we didn't want that we do
0:02:47train and test sets
0:02:49but the speakers to overlap
0:02:52oh this is this is the scheme for a development set so we use the
0:02:57lre eleven development data
0:02:59so we make two sets actually one D one was the trusty data which was
0:03:04actually based on nist so thirty seconds so cost
0:03:10we also an automatic speaker labeling
0:03:14and we split into non-overlapping
0:03:17training data
0:03:18test parts
0:03:20and then we took the entire conversations and in the thirty seconds excerpts
0:03:25so was presented thirty second segments
0:03:30and all splits from one conversation side to be trained and test set
0:03:37again there was some
0:03:39speaker automatic speaker labeling
0:03:42of course we had more data but it was less reliable that's all seen our
0:03:46contrastive system
0:03:49this doing this
0:03:51helped a little bit
0:03:53having the all the P V their position
0:04:01to give a little bit of statistics on or not
0:04:05dataset so we get the train set which are sixty six sre six two thousand
0:04:11and it was based of all kinds of sources
0:04:16and yet the test set
0:04:17which was thirty eight thousand segments
0:04:20was based basically on a previous lre evaluations
0:04:24and then be and test sets with service
0:04:26on the other you
0:04:28evaluations comprising one
0:04:35so a little overview of our systems we have we have a summation of three
0:04:39systems one primary and you can trust it
0:04:42so the primary system consisted of one acoustic system which was based on the i-vectors
0:04:47you be the descriptions later
0:04:51and then we had three phonotactic subsystems so that would
0:04:56yeah diver system so we had
0:04:58a binary decision tree systems based on the english tokenizer then we had a pc
0:05:05systems based on the russian systems
0:05:08and then we had some multinomial subspace
0:05:11i-vector space but
0:05:17that the first thing just a system or same as the primary what we did
0:05:21is that we excluded the P two
0:05:23that means that the entire conversations
0:05:28we'll see the results
0:05:30and then contrasted to system was just of
0:05:33fusion of two best system with the acoustic and the english
0:05:37problem of the image was that the at the development that the case
0:05:42very good results but it's see as kind of problem i think
0:05:48so there is a little diagram
0:05:50of our system
0:05:52in the in the first
0:05:56at a very left you know we have the front end so we have the
0:06:00acoustic i-vector the phonotactic i-vector and the pca
0:06:04oh which basically convert the input of some form
0:06:09into a fixed factors either i-vectors for the for the acoustic i-vector extractor or we
0:06:14also column i-vectors for the phonotactic i-vector extractor pca
0:06:19and after which we had to do some scoring
0:06:24and then we use the det binary decision tree model which show was based basically
0:06:29going on a log likelihood evaluation
0:06:31of the of the n-gram counts
0:06:33itself so we already got discourse
0:06:36which could then go do precalibration
0:06:40both the scoring and pre-calibration are based on logistic regression
0:06:45i'm unit-fusion also based on logistic regression out of which we get twenty four scores
0:06:50likelihoods and then we do pair-wise log likelihood ratio for
0:06:55each of the errors
0:06:58it is just to show the how is the data that the data described in
0:07:02the previous section so that the train database was used for the
0:07:06the for the front-end training and for the for the scoring
0:07:10classifier training
0:07:13at the dev and test databases where they used for the for the
0:07:17back in the pre-calibration
0:07:24so for acoustic system we use the hungarian phoneme recognizer based vad
0:07:30oh basically to kill the
0:07:33silence then we use the vtln
0:07:35i'll features dithering
0:07:38cepstral mean
0:07:41and variance normalisation with rasta processing basically
0:07:46what similar to that
0:07:48right later previously
0:07:50and the modeling was based on full covariance ubm we two thousand forty eight components
0:07:55and the i-vector size was example
0:08:00for the phonotactic systems
0:08:04we used we used a diversity of techniques to
0:08:08sources are tokenization
0:08:10so that pca for each feature extraction
0:08:14was based on is something you
0:08:17hungarian tokenizer
0:08:20so what we do is that we would do accounts with the square root of
0:08:23the counts
0:08:25ppca on top of that we used the dimensions to six hundred
0:08:30and we basically used in the same way as the acoustic i-vector
0:08:35and we had a multinomial subspace modeling of the trigram counts was based on the
0:08:39regression tokenizer
0:08:41so this is something a slightly newer and something that marshall
0:08:50is the basically modeling the n-gram counts
0:08:56the subspace
0:08:57of the simplex
0:09:00the output of
0:09:01of such a approach is also
0:09:04i-vector like
0:09:06a feature
0:09:07which we then again process the same way as the i-vectors
0:09:10and then we had the binary decision tree which is basically a novel technique where
0:09:15the decision trees are used to cluster the n-gram counts
0:09:18and a claimant like this is used
0:09:21if the score
0:09:25so you know the scoring for the for the acoustic i-vector and the two phonotactic
0:09:31i-vector systems was the
0:09:34the input was usually the i-vector gonna six hundred dimensional or one thousand dimensional case
0:09:40we perform length normalization
0:09:43ultimately performed within class covariance normalization
0:09:47and after that is that as the
0:09:49S R
0:09:51classify we used to regularize multiclass logistic regression
0:09:55with cross entropy objective function
0:09:57the regularizer was the L two regularizer but the penalty was chosen without cross validation
0:10:06and it was trained on the train database
0:10:09the output
0:10:13what we did that with the
0:10:16the each set up to twenty four scores is that we do precalibration
0:10:20of E system so that was a full affine transform
0:10:27and we use the regularized logistic regression
0:10:30which was trained on the test
0:10:32and the database is
0:10:37and in the end we use the
0:10:40we used the four systems
0:10:45with the with the constrain affine transform so instead of assigning each of the twenty
0:10:50four scores of each of the systemsindividual scale constant
0:10:53we had one scale constants for one system
0:10:56and we had a we have a vector offsets
0:11:01and this logistic regression was also bayesian logistic regression stumps regularized
0:11:07was trained on the test
0:11:09jeff databases
0:11:13as i said that the decisions where done using the log-likelihood ratios that came out
0:11:18these are
0:11:20the decisions for the for the
0:11:22two seventy six course where
0:11:24converted from the twenty four scores
0:11:27oh as a as a log likelihood ratios
0:11:30among all those pairs
0:11:36this is just a
0:11:38the subtraction of course
0:11:40we give a score
0:11:42it is
0:11:44and decisions the models
0:11:48all set to threshold of zero
0:11:51as just a little common that these decisions are invariant to scaling of the of
0:11:55the log likelihoods
0:11:58you relations by the heart language pairs
0:12:02does not sell
0:12:03just the calibration
0:12:07so the analysis that we used to
0:12:10we so we
0:12:13we fixed that the
0:12:15denotes the use when designing the system is that
0:12:18oh we fix the twenty four worst pairs on our thirty seconds evaluations
0:12:24and we compared
0:12:25of three different numbers of the actual dcf
0:12:30minimum dcf
0:12:31start dcf which was based
0:12:33on the
0:12:34on that may cause recipe and mentioned monday
0:12:39it's based on the
0:12:41likelihood pre-calibration
0:12:45so we will compare the development
0:12:50and the evaluation sets
0:12:53we will try if the comparison of four eight system so we will have the
0:12:58individual systems on getting phonotactic direction of the technique phonotactic
0:13:04eucharistic i-vectors and then we will present for fusions
0:13:09so one of the primary
0:13:10the contrastive the second contrastive
0:13:13and we also have a three system fusion which excluded the
0:13:17the english phonotactic system which somehow
0:13:20he misbehaved
0:13:21this will see
0:13:24so this is that this is the result of three seconds that we fixed the
0:13:28person thirty seconds
0:13:32as we see this is the this is the
0:13:35this is the miss behavior of the system so the last parser with the that
0:13:42set and the right or with
0:13:45the evaluation set and we see that the trend
0:13:48going like this in the in the dataset but
0:13:52the english phonotactic system
0:13:55oh of that vector
0:14:04this is for the for the ten seconds
0:14:11so this is this contrast if one system is the system where we excluded the
0:14:16where we excluded that the dft to data which are
0:14:21comprise the entire segment so we see that there is a slight hit
0:14:25compared to
0:14:26the primary system so these two systems to remind you are
0:14:29very same except that
0:14:32in the in the calibration and scoring
0:14:34there are there are some data left out
0:14:39oh a again the english
0:14:42the english is very behaved here
0:14:45oh we see that though
0:14:48that the difference between the balloon the and the right one
0:14:52well which is the difference between the minimum and actual is actually
0:14:56we see that
0:14:57the miscalibration was
0:15:02no i mean
0:15:03it was not a tragedy but
0:15:06we didn't do very well the calibration which is even more simple on the thirty
0:15:10second stuff so if we see that
0:15:12that the miscalibration is much
0:15:15much more reasonable
0:15:17especially fusions
0:15:21here on the on the contrastive one versus the primary
0:15:24oh we see that the that excluding the data
0:15:29really are
0:15:33that's one thing
0:15:36on the evaluation systems excluded so the three systems is equivalent to the primary by
0:15:41excluding the english system
0:15:43we see that the on the development set didn't do much
0:15:47in fact the system is slightly worse excluding the english
0:15:50systems because the english perform very well
0:15:52on the
0:15:54the development set
0:15:55but is putting it
0:15:57really hot on the evaluation
0:16:00after the evaluation
0:16:07so again just to summarise the observation is that it was
0:16:11the big deterioration between the mindcf for development and
0:16:20we differ
0:16:22then there was no calibration disasters but on the thirty seconds as i pointed out
0:16:27well we could have done better
0:16:29the binary tree system was kind of screwed and
0:16:34what we found out later is that if we apply
0:16:38similar dimensionality-reduction plot scoring techniques
0:16:41even to the english tokens the system where good again so
0:16:46so it was it was
0:16:49due to the
0:16:50the plane landed evaluation
0:16:53and acoustic outperforms phonotactic almost everywhere that there were a couple of systems a couple
0:16:58of language pairs where the where the acoustic
0:17:01where the phonotactic was better
0:17:03if you have ever so we did it is that we didn't analysis but for
0:17:07novice versus mit system since the mighty was the best
0:17:10a there was a weak correlation between sites and difficulty of paris
0:17:15a domain mindcf
0:17:17was were very similar
0:17:19of the mindcf five
0:17:22for the worst twenty four pairs for slightly worse for us than for mit
0:17:28an actual dcf oh
0:17:30we had a big calibration
0:17:34that's an interesting how there's an interesting plot here which compares us some of the
0:17:39selected arabic dialects versus like languages where O
0:17:45somehow we knew that i might get more data are for the arabic dialects
0:17:49so we see that we do very poor
0:17:51and some of the some of the pairs
0:17:54okay arabic versus push to
0:17:57et cetera because we mostly due to the lack of data while i'm on the
0:18:02on the
0:18:03the slavic languages
0:18:06we had we do better
0:18:08and some
0:18:08selected pairs
0:18:10so this is just two are just to show that
0:18:14the amount of data really
0:18:19this is just a correlation between some of the best of
0:18:23unlike the end and be useful is the
0:18:26but axes and use them mit excess
0:18:30we see that if we did the same thing
0:18:33all the points we would be aligned
0:18:36as the
0:18:37the ability but the we see that some errors are really
0:18:41of the
0:18:42we did very differently
0:18:44some of the past
0:18:47and this is just to show all the worst
0:18:50the worst the mindcf
0:18:53and versus actual mindcf for the for the mit and be used
0:18:57but not system
0:19:01we these are average
0:19:03at which point so we see that on average my better
0:19:09so mit
0:19:11and my these points are more know on a on a single line all the
0:19:14systems are more scattered around here so this shows the again
0:19:20calibration hit
0:19:24so that concludes that we built several systems but we only selected for
0:19:28for the primary for the primary fusion
0:19:32get the acoustic outperforms the phonotactic
0:19:35for the phonotactic we try to from the different backgrounds and we saw that the
0:19:39dimension reduction really else
0:19:42we have
0:19:44the big hit
0:19:45for the english phonotactic systems where there was a
0:19:49i forgot to delete the we did not know why
0:19:52we already is the
0:19:57and probably we could use special detectors for select paris
0:20:01that is that is
0:20:41the unique and the shifted
0:20:43yeah we use that
0:20:48so we use the six mfccs plus C zero yeah shifted again
0:20:58oh yeah
0:21:14for the which so we use
0:21:18real the regularisation in our scoring and then or pre-calibration
0:21:23so a little L two regularization