Speech Transcript - Description and analysis of the Brno276 system for LRE2011

0:00:15	you
0:00:23	okay very much so the other not to seventy six system was a collaborative work
0:00:28	between the brno university of technology
0:00:31	medial and a little
0:00:36	so let's start already introduced we had twenty four languages to deal with
0:00:42	we had a new metric
0:00:44	the list of the languages somehow
0:00:47	screwed the
0:00:49	and so we have the new metric the two seventy six language but that that's
0:00:54	where the name comes from
0:00:56	and we had to select a twenty four was pairs in terms of mindcf and
0:01:01	then compute the average
0:01:03	actually see
0:01:05	in order to be able to deal with those languages we had to call we
0:01:10	had to collect or data so this is basically the list of data that we
0:01:14	used in the past
0:01:16	evaluations there are some callfriend fisher
0:01:19	oh mixer data from the sre evaluations
0:01:24	previous lre evaluations ogi data of foreign accented english
0:01:28	some a speech data for the you've really used european languages
0:01:33	i was and switchboard and some broadcasts though
0:01:37	data from the voice of america and ready for your
0:01:40	and then some
0:01:42	iraqi arabic conversational speech
0:01:45	arabic broadcast
0:01:46	speech
0:01:48	oh as doc showed
0:01:51	there were some languages for which we didn't have enough data
0:01:54	so what we did is that we add on this additional radio data from public
0:02:00	sources so we use the radio free europe radio free asia
0:02:05	some czech broadcast
0:02:06	america
0:02:08	and there's a there's a list of languages first we covered check farsi lotion and
0:02:14	job be
0:02:15	and again i say arabic knackered be
0:02:18	mandarin in training and i and i guess that were couple
0:02:23	so what we did is that we did a phone call detection so we
0:02:27	detecting the parts of
0:02:30	broadcasts where
0:02:31	there were some
0:02:32	and
0:02:33	telephone
0:02:34	conversations would be greater
0:02:37	and for each language we we're and
0:02:40	automatic speaker like labeling
0:02:43	so that we didn't want that we do
0:02:46	yeah
0:02:47	train and test sets
0:02:49	but the speakers to overlap
0:02:52	oh this is this is the scheme for a development set so we use the
0:02:57	lre eleven development data
0:02:59	so we make two sets actually one D one was the trusty data which was
0:03:04	actually based on nist so thirty seconds so cost
0:03:09	definition
0:03:10	we also an automatic speaker labeling
0:03:14	and we split into non-overlapping
0:03:17	training data
0:03:18	test parts
0:03:20	and then we took the entire conversations and in the thirty seconds excerpts
0:03:24	results
0:03:25	so was presented thirty second segments
0:03:30	and all splits from one conversation side to be trained and test set
0:03:37	again there was some
0:03:39	speaker automatic speaker labeling
0:03:42	of course we had more data but it was less reliable that's all seen our
0:03:46	contrastive system
0:03:47	oh
0:03:49	this doing this
0:03:51	helped a little bit
0:03:53	having the all the P V their position
0:03:56	oh
0:03:59	so
0:04:01	to give a little bit of statistics on or not
0:04:05	dataset so we get the train set which are sixty six sre six two thousand
0:04:10	segments
0:04:11	and it was based of all kinds of sources
0:04:16	and yet the test set
0:04:17	which was thirty eight thousand segments
0:04:20	was based basically on a previous lre evaluations
0:04:24	and then be and test sets with service
0:04:26	on the other you
0:04:28	evaluations comprising one
0:04:35	so a little overview of our systems we have we have a summation of three
0:04:39	systems one primary and you can trust it
0:04:42	so the primary system consisted of one acoustic system which was based on the i-vectors
0:04:47	you be the descriptions later
0:04:51	and then we had three phonotactic subsystems so that would
0:04:56	yeah diver system so we had
0:04:58	a binary decision tree systems based on the english tokenizer then we had a pc
0:05:04	reduction
0:05:05	systems based on the russian systems
0:05:08	and then we had some multinomial subspace
0:05:11	i-vector space but
0:05:14	hungarian
0:05:15	tokenizer
0:05:17	that the first thing just a system or same as the primary what we did
0:05:21	is that we excluded the P two
0:05:23	that means that the entire conversations
0:05:28	we'll see the results
0:05:29	later
0:05:30	and then contrasted to system was just of
0:05:33	fusion of two best system with the acoustic and the english
0:05:37	problem of the image was that the at the development that the case
0:05:42	very good results but it's see as kind of problem i think
0:05:48	so there is a little diagram
0:05:50	of our system
0:05:52	in the in the first
0:05:56	at a very left you know we have the front end so we have the
0:06:00	acoustic i-vector the phonotactic i-vector and the pca
0:06:04	oh which basically convert the input of some form
0:06:09	into a fixed factors either i-vectors for the for the acoustic i-vector extractor or we
0:06:14	also column i-vectors for the phonotactic i-vector extractor pca
0:06:19	and after which we had to do some scoring
0:06:24	and then we use the det binary decision tree model which show was based basically
0:06:29	going on a log likelihood evaluation
0:06:31	of the of the n-gram counts
0:06:33	itself so we already got discourse
0:06:36	which could then go do precalibration
0:06:40	both the scoring and pre-calibration are based on logistic regression
0:06:45	i'm unit-fusion also based on logistic regression out of which we get twenty four scores
0:06:50	likelihoods and then we do pair-wise log likelihood ratio for
0:06:55	each of the errors
0:06:58	it is just to show the how is the data that the data described in
0:07:02	the previous section so that the train database was used for the
0:07:06	the for the front-end training and for the for the scoring
0:07:10	classifier training
0:07:13	at the dev and test databases where they used for the for the
0:07:17	back in the pre-calibration
0:07:19	fusion
0:07:24	so for acoustic system we use the hungarian phoneme recognizer based vad
0:07:30	oh basically to kill the
0:07:33	silence then we use the vtln
0:07:35	i'll features dithering
0:07:38	cepstral mean
0:07:41	and variance normalisation with rasta processing basically
0:07:46	what similar to that
0:07:48	right later previously
0:07:50	and the modeling was based on full covariance ubm we two thousand forty eight components
0:07:55	and the i-vector size was example
0:08:00	for the phonotactic systems
0:08:04	we used we used a diversity of techniques to
0:08:08	sources are tokenization
0:08:10	so that pca for each feature extraction
0:08:13	oh
0:08:14	was based on is something you
0:08:17	hungarian tokenizer
0:08:20	so what we do is that we would do accounts with the square root of
0:08:23	the counts
0:08:25	ppca on top of that we used the dimensions to six hundred
0:08:30	and we basically used in the same way as the acoustic i-vector
0:08:35	and we had a multinomial subspace modeling of the trigram counts was based on the
0:08:39	regression tokenizer
0:08:41	so this is something a slightly newer and something that marshall
0:08:47	one
0:08:49	features
0:08:50	is the basically modeling the n-gram counts
0:08:56	the subspace
0:08:57	of the simplex
0:09:00	the output of
0:09:01	of such a approach is also
0:09:04	i-vector like
0:09:06	a feature
0:09:07	which we then again process the same way as the i-vectors
0:09:10	and then we had the binary decision tree which is basically a novel technique where
0:09:15	the decision trees are used to cluster the n-gram counts
0:09:18	and a claimant like this is used
0:09:21	if the score
0:09:25	so you know the scoring for the for the acoustic i-vector and the two phonotactic
0:09:31	i-vector systems was the
0:09:34	the input was usually the i-vector gonna six hundred dimensional or one thousand dimensional case
0:09:39	pca
0:09:40	we perform length normalization
0:09:43	ultimately performed within class covariance normalization
0:09:47	and after that is that as the
0:09:49	S R
0:09:51	classify we used to regularize multiclass logistic regression
0:09:55	with cross entropy objective function
0:09:57	the regularizer was the L two regularizer but the penalty was chosen without cross validation
0:10:06	and it was trained on the train database
0:10:09	the output
0:10:10	was
0:10:11	scores
0:10:13	what we did that with the
0:10:16	the each set up to twenty four scores is that we do precalibration
0:10:20	of E system so that was a full affine transform
0:10:27	and we use the regularized logistic regression
0:10:30	which was trained on the test
0:10:32	and the database is
0:10:37	and in the end we use the
0:10:40	we used the four systems
0:10:43	two
0:10:44	oh
0:10:45	with the with the constrain affine transform so instead of assigning each of the twenty
0:10:50	four scores of each of the systemsindividual scale constant
0:10:53	we had one scale constants for one system
0:10:56	and we had a we have a vector offsets
0:11:01	and this logistic regression was also bayesian logistic regression stumps regularized
0:11:07	was trained on the test
0:11:09	jeff databases
0:11:13	as i said that the decisions where done using the log-likelihood ratios that came out
0:11:18	these are
0:11:20	the decisions for the for the
0:11:22	two seventy six course where
0:11:24	converted from the twenty four scores
0:11:27	oh as a as a log likelihood ratios
0:11:30	among all those pairs
0:11:33	oh
0:11:34	for
0:11:36	this is just a
0:11:38	the subtraction of course
0:11:39	we
0:11:40	we give a score
0:11:42	it is
0:11:44	and decisions the models
0:11:46	where
0:11:48	all set to threshold of zero
0:11:51	as just a little common that these decisions are invariant to scaling of the of
0:11:55	the log likelihoods
0:11:56	and
0:11:58	you relations by the heart language pairs
0:12:02	does not sell
0:12:03	just the calibration
0:12:07	so the analysis that we used to
0:12:10	we so we
0:12:13	we fixed that the
0:12:15	denotes the use when designing the system is that
0:12:18	oh we fix the twenty four worst pairs on our thirty seconds evaluations
0:12:24	and we compared
0:12:25	of three different numbers of the actual dcf
0:12:30	minimum dcf
0:12:31	start dcf which was based
0:12:33	on the
0:12:34	on that may cause recipe and mentioned monday
0:12:39	it's based on the
0:12:41	likelihood pre-calibration
0:12:45	so we will compare the development
0:12:50	and the evaluation sets
0:12:53	we will try if the comparison of four eight system so we will have the
0:12:58	individual systems on getting phonotactic direction of the technique phonotactic
0:13:04	eucharistic i-vectors and then we will present for fusions
0:13:09	so one of the primary
0:13:10	the contrastive the second contrastive
0:13:13	and we also have a three system fusion which excluded the
0:13:17	the english phonotactic system which somehow
0:13:20	he misbehaved
0:13:21	this will see
0:13:24	so this is that this is the result of three seconds that we fixed the
0:13:28	person thirty seconds
0:13:31	oh
0:13:32	as we see this is the this is the
0:13:35	this is the miss behavior of the system so the last parser with the that
0:13:42	set and the right or with
0:13:45	the evaluation set and we see that the trend
0:13:48	going like this in the in the dataset but
0:13:51	the
0:13:52	the english phonotactic system
0:13:54	speech
0:13:55	oh of that vector
0:13:58	yeah
0:14:04	this is for the for the ten seconds
0:14:11	so this is this contrast if one system is the system where we excluded the
0:14:16	where we excluded that the dft to data which are
0:14:21	comprise the entire segment so we see that there is a slight hit
0:14:25	compared to
0:14:26	the primary system so these two systems to remind you are
0:14:29	very same except that
0:14:32	in the in the calibration and scoring
0:14:34	there are there are some data left out
0:14:39	oh a again the english
0:14:42	the english is very behaved here
0:14:45	oh we see that though
0:14:48	that the difference between the balloon the and the right one
0:14:52	well which is the difference between the minimum and actual is actually
0:14:56	we see that
0:14:57	the miscalibration was
0:15:02	no i mean
0:15:03	it was not a tragedy but
0:15:06	we didn't do very well the calibration which is even more simple on the thirty
0:15:10	second stuff so if we see that
0:15:12	that the miscalibration is much
0:15:15	much more reasonable
0:15:17	especially fusions
0:15:21	here on the on the contrastive one versus the primary
0:15:24	oh we see that the that excluding the data
0:15:29	really are
0:15:31	so
0:15:33	that's one thing
0:15:34	oh
0:15:36	on the evaluation systems excluded so the three systems is equivalent to the primary by
0:15:41	excluding the english system
0:15:43	we see that the on the development set didn't do much
0:15:47	in fact the system is slightly worse excluding the english
0:15:50	systems because the english perform very well
0:15:52	on the
0:15:54	the development set
0:15:55	but is putting it
0:15:57	really hot on the evaluation
0:16:00	after the evaluation
0:16:07	so again just to summarise the observation is that it was
0:16:11	the big deterioration between the mindcf for development and
0:16:20	we differ
0:16:22	then there was no calibration disasters but on the thirty seconds as i pointed out
0:16:27	well we could have done better
0:16:29	the binary tree system was kind of screwed and
0:16:34	what we found out later is that if we apply
0:16:38	similar dimensionality-reduction plot scoring techniques
0:16:41	even to the english tokens the system where good again so
0:16:46	so it was it was
0:16:49	due to the
0:16:50	the plane landed evaluation
0:16:53	and acoustic outperforms phonotactic almost everywhere that there were a couple of systems a couple
0:16:58	of language pairs where the where the acoustic
0:17:01	where the phonotactic was better
0:17:03	if you have ever so we did it is that we didn't analysis but for
0:17:07	novice versus mit system since the mighty was the best
0:17:10	a there was a weak correlation between sites and difficulty of paris
0:17:15	a domain mindcf
0:17:17	was were very similar
0:17:19	of the mindcf five
0:17:22	for the worst twenty four pairs for slightly worse for us than for mit
0:17:28	an actual dcf oh
0:17:30	we had a big calibration
0:17:34	that's an interesting how there's an interesting plot here which compares us some of the
0:17:39	selected arabic dialects versus like languages where O
0:17:45	somehow we knew that i might get more data are for the arabic dialects
0:17:49	so we see that we do very poor
0:17:51	and some of the some of the pairs
0:17:54	okay arabic versus push to
0:17:57	et cetera because we mostly due to the lack of data while i'm on the
0:18:02	on the
0:18:03	the slavic languages
0:18:06	we had we do better
0:18:08	and some
0:18:08	selected pairs
0:18:10	so this is just two are just to show that
0:18:13	oh
0:18:14	the amount of data really
0:18:16	there's
0:18:19	this is just a correlation between some of the best of
0:18:23	unlike the end and be useful is the
0:18:26	but axes and use them mit excess
0:18:28	oh
0:18:30	we see that if we did the same thing
0:18:33	all the points we would be aligned
0:18:36	as the
0:18:37	the ability but the we see that some errors are really
0:18:41	of the
0:18:42	we did very differently
0:18:44	some of the past
0:18:47	and this is just to show all the worst
0:18:50	the worst the mindcf
0:18:53	and versus actual mindcf for the for the mit and be used
0:18:57	but not system
0:18:59	so
0:19:00	oh
0:19:01	we these are average
0:19:03	at which point so we see that on average my better
0:19:09	so mit
0:19:11	and my these points are more know on a on a single line all the
0:19:14	systems are more scattered around here so this shows the again
0:19:19	the
0:19:20	calibration hit
0:19:24	so that concludes that we built several systems but we only selected for
0:19:28	for the primary for the primary fusion
0:19:32	get the acoustic outperforms the phonotactic
0:19:35	for the phonotactic we try to from the different backgrounds and we saw that the
0:19:39	dimension reduction really else
0:19:42	we have
0:19:44	the big hit
0:19:45	for the english phonotactic systems where there was a
0:19:49	i forgot to delete the we did not know why
0:19:52	we already is the
0:19:54	scoring
0:19:57	and probably we could use special detectors for select paris
0:20:01	that is that is
0:20:25	yes
0:20:26	oh
0:20:27	yeah
0:20:30	yes
0:20:33	oh
0:20:38	sorry
0:20:40	well
0:20:41	the unique and the shifted
0:20:43	yeah we use that
0:20:45	right
0:20:47	oh
0:20:48	so we use the six mfccs plus C zero yeah shifted again
0:20:55	okay
0:20:56	sorry
0:20:58	i
0:20:58	oh yeah
0:21:12	so
0:21:14	for the which so we use
0:21:18	real the regularisation in our scoring and then or pre-calibration
0:21:23	so a little L two regularization
0:21:27	okay
0:21:32	oh
0:21:53	i
0:21:53	i
0:21:59	i
0:22:18	oh
0:22:31	so

Description and analysis of the Brno276 system for LRE2011

SESSION 07: Language Recognition Evaluation

Ondrej Glembek