Speech Transcript - Data selection and calibration issues in automatic language recognition – investigation with BUT-AGNITIO NIST LRE 2009 system

0:00:06	morning
0:00:07	uh
0:00:08	what i would like to present
0:00:09	here today is uh
0:00:11	our
0:00:14	language recognition
0:00:15	two thousand nine
0:00:16	submission
0:00:18	and
0:00:19	we did
0:00:20	a lot of work
0:00:21	after the evaluations to figure out
0:00:23	what happened
0:00:25	darcy
0:00:25	them into our lives that because actually
0:00:27	we saw a big difference
0:00:29	in the performance
0:00:30	on our
0:00:31	development data and the
0:00:33	on
0:00:33	then
0:00:34	that on the actual evaluation
0:00:36	so
0:00:37	uh
0:00:39	first
0:00:41	i will try to explain
0:00:42	what
0:00:43	new in the
0:00:44	uh language recognition
0:00:46	what happened
0:00:47	in the year two thousand nine
0:00:49	that in some new data
0:00:51	then i will go through
0:00:53	a very quick and brief description of our
0:00:55	all system
0:00:56	and then i will
0:00:58	try to concentrate on the
0:01:00	uh issues of the calibration and data selection
0:01:03	and uh how we resolve
0:01:05	problems with our original development set
0:01:08	then i will
0:01:09	try to
0:01:09	conclude
0:01:10	our work
0:01:12	so
0:01:13	in
0:01:14	two thousand nine
0:01:15	what was new
0:01:16	that
0:01:17	the new uh source of the data came into the language recognition
0:01:21	actually these data are
0:01:22	broadcast
0:01:23	the voice of america
0:01:25	and
0:01:26	we found a big are high
0:01:27	of
0:01:28	about
0:01:29	for the three languages and
0:01:31	uh the data uh
0:01:33	out of this archive
0:01:34	what's the use
0:01:35	and actually only the detected telephone calls
0:01:38	and um
0:01:40	which
0:01:41	this data brought
0:01:42	at peak variability
0:01:44	a to the original cts data we always used for the training of our language ideas
0:01:50	so it
0:01:51	okay
0:01:51	it brought some
0:01:53	new problems with calibration and channel compensation
0:01:57	so
0:01:58	uh
0:02:01	these are the languages
0:02:03	uh which
0:02:04	are present
0:02:05	i would have to check if they are still present in the
0:02:08	a voice over
0:02:08	of the M erica archive
0:02:10	as you can see
0:02:13	the
0:02:13	and multiple languages
0:02:15	here is a very huge and
0:02:17	it brought
0:02:18	very very nice
0:02:19	dataset
0:02:20	to test our systems on and
0:02:22	ability to improve the language recognition
0:02:25	stems
0:02:26	two
0:02:26	two
0:02:27	uh actually classify
0:02:29	more languages so
0:02:31	for the
0:02:32	two thousand nine
0:02:33	nist lre
0:02:35	these are the
0:02:36	twenty three
0:02:37	uh
0:02:38	target languages
0:02:39	and the bold ones
0:02:41	other languages
0:02:42	uh where the only uh
0:02:46	well the
0:02:47	that we had
0:02:47	only uh data coming from the
0:02:50	from this was
0:02:51	of and there are high
0:02:53	so there was no cts data for training on these languages
0:02:56	on the other languages we also had
0:03:00	normal
0:03:01	continues
0:03:02	speech data are recorded by
0:03:04	L D C
0:03:05	previous times
0:03:07	and also for the
0:03:08	two thousand nine
0:03:10	uh evaluation
0:03:11	so we had to deal with this issue
0:03:13	and
0:03:15	uh
0:03:17	and
0:03:18	uh do the
0:03:19	proper calibration
0:03:21	and channel compensation
0:03:22	so
0:03:23	what
0:03:24	be more tomatoes
0:03:25	after the evaluation to do this work and work
0:03:28	again
0:03:28	on our development set and to do a lot of experiments was
0:03:32	but we saw a huge difference
0:03:34	between the performance
0:03:36	our
0:03:36	original
0:03:37	development set
0:03:38	and uh
0:03:40	you've all said
0:03:41	which the uh we
0:03:43	which was uh
0:03:44	corrected by
0:03:45	nice
0:03:47	so all of the
0:03:48	numbers you will see here
0:03:50	will be the
0:03:52	average detection cost
0:03:53	defined by nice
0:03:55	and
0:03:57	uh
0:04:00	yeah uh
0:04:02	on the
0:04:03	language recognition workshop there about
0:04:05	there were a lot of discussions about that
0:04:07	crafting of
0:04:08	uh a
0:04:09	development set
0:04:10	alarm systems
0:04:11	so
0:04:12	uh
0:04:12	some
0:04:13	some people created a rather small and
0:04:16	very clean upset
0:04:17	we we had a
0:04:19	actually a very very huge
0:04:20	development set containing a lot of data
0:04:23	which brought some computational issues
0:04:25	to train the systems but
0:04:27	uh we decided to go
0:04:29	with this development
0:04:30	set
0:04:31	the big one
0:04:33	and
0:04:33	in the end it didn't show
0:04:35	to be maybe the but the she's
0:04:37	but that but decision but
0:04:39	we
0:04:40	had to
0:04:41	well with that
0:04:42	so
0:04:43	and
0:04:44	we
0:04:44	presentation of our
0:04:46	us
0:04:47	is what we had in the
0:04:49	in the
0:04:49	summation so
0:04:51	we had two types of uh
0:04:53	uh front ends
0:04:54	the first
0:04:55	on acoustic frontends which are based
0:04:58	on the gmm modelling and the features are
0:05:00	mfcc derive actually
0:05:02	these are the
0:05:03	uh popular shifty don't like cepstral features
0:05:06	and
0:05:07	for the system we had
0:05:08	there
0:05:08	jfa sixteen
0:05:11	we tried a new feature extraction
0:05:13	based on the audio the
0:05:15	and then we had their eighty and then
0:05:20	maximum
0:05:20	mutual information criterion
0:05:22	and using the channel compensated features
0:05:26	also
0:05:27	we tried to normal
0:05:28	gmm with a guilty features without any channel compensation
0:05:32	we perform the
0:05:33	well tract length normalisation
0:05:35	cepstral mean and
0:05:37	and variance normalisation
0:05:39	and reading the voice activity detection using car
0:05:42	hungarian phoneme recogniser
0:05:44	when we
0:05:45	where we met all of this
0:05:46	speech phonemes to the
0:05:48	speech and nonspeech
0:05:49	the to decide
0:05:55	yeah thanks
0:05:56	then
0:05:56	it's a standard based jittery sistine
0:05:59	uh
0:06:00	as you can see
0:06:01	a sorry but
0:06:02	this time of course without
0:06:04	and the eigenvoices there is only a
0:06:07	uh channel
0:06:08	variability present
0:06:10	so
0:06:10	we had
0:06:11	some super vector
0:06:12	of gmm means for every speech segment
0:06:15	and which is
0:06:16	then uh
0:06:17	channel dependent
0:06:19	the this
0:06:19	uh channel loading matrix was trained using the
0:06:22	E M algorithm and
0:06:24	the five hundred
0:06:26	sessions for every language very used to train
0:06:29	uh the
0:06:31	the channel loading matrix
0:06:32	and uh
0:06:33	language dependent uh super vectors
0:06:36	the alice
0:06:38	the remote adapted using the
0:06:40	rather than smart all these but also trained
0:06:43	using the five
0:06:44	hundred segments
0:06:45	there
0:06:46	a language
0:06:49	actually this
0:06:50	is the core acoustic system here
0:06:53	because
0:06:54	uh
0:06:55	it uses also our delta features and
0:06:57	as you will see
0:06:58	later on we decided to drop the audio D features and use
0:07:02	just the J faces
0:07:03	scheme
0:07:04	eating the shifted of packets
0:07:11	yeah we tried
0:07:12	a new discriminative technique to derive our features
0:07:15	uh this is technique
0:07:17	based on the
0:07:19	a region dependent linear transforms this is a technique
0:07:22	uh
0:07:22	which was introduced in the speech recognition but it is known as
0:07:26	S and P E
0:07:27	the idea is that
0:07:28	we have some
0:07:29	you know transformations
0:07:31	which will take our features
0:07:33	and
0:07:34	then
0:07:35	we take the linear combinations of the transformation to
0:07:38	uh
0:07:39	two
0:07:42	for menu
0:07:43	uh feature which would be which should
0:07:46	uh be discriminate
0:07:47	it's trained so
0:07:49	i know but
0:07:50	picture and i will try to
0:07:52	at least
0:07:53	very briefly
0:07:55	uh describe what is going on so
0:07:58	in the star
0:07:59	we are
0:07:59	having
0:08:00	some linear transformation
0:08:02	in the beginning there are initialised
0:08:04	two
0:08:05	great just the shifted delta cepstral features
0:08:08	we have some
0:08:09	G M and which is trained on all or
0:08:11	over all languages
0:08:13	and which is
0:08:14	select the two which is uh
0:08:16	suppose
0:08:17	two
0:08:17	so like
0:08:18	the
0:08:19	uh
0:08:20	here the transformations in every step
0:08:23	it actually provides the weights
0:08:25	we are uh then we are combining
0:08:28	these
0:08:28	transformation
0:08:29	so for every twenty one frames
0:08:32	we
0:08:33	we take the we we take the twenty once frames
0:08:36	mfcc put it into the gmm
0:08:38	then we take the most meaning
0:08:40	gaussian components
0:08:42	which provide us the weights
0:08:44	and
0:08:44	we will combine
0:08:46	according to this might be a combined is linear transformations
0:08:49	usually
0:08:51	it happened that
0:08:52	only one
0:08:53	or three
0:08:55	a gaussian
0:08:55	components
0:08:56	for these twenty one frames
0:08:58	where nonzero also
0:09:00	uh not all of these other transformations were linearly combined all the other
0:09:05	weights are set to zero
0:09:07	so
0:09:08	then we are taking the eating area
0:09:10	combined transformations
0:09:13	and
0:09:13	summing up
0:09:15	and then
0:09:16	there is a gmm
0:09:18	which will
0:09:18	estimate these feature and according to the training translate criteria
0:09:23	we will update
0:09:24	these
0:09:25	linear transform
0:09:26	and then we go
0:09:27	one other
0:09:28	one two months frames
0:09:30	train the system so
0:09:31	here
0:09:32	in the end what we have
0:09:34	after the training
0:09:35	this
0:09:36	will be the features
0:09:37	we will feed you are
0:09:39	jeff face
0:09:43	the next
0:09:43	acoustic system what that was a gmm
0:09:46	they two hundred
0:09:47	and for the adults and
0:09:49	one and
0:09:50	which was
0:09:51	uh discriminatively trained using tandem i
0:09:54	uh criterion
0:09:55	and
0:09:58	uh we use the features which are which where
0:10:01	penn state
0:10:04	so that was
0:10:05	for acoustic subsystems
0:10:07	then
0:10:08	some common technique from then
0:10:10	uh
0:10:13	the core of our well but
0:10:14	but but think systems where of course our
0:10:17	phoneme recognisers
0:10:19	the first one to english one is a gmm based
0:10:22	uh phoneme recogniser which is based on our
0:10:25	on the triphone acoustic models from an
0:10:28	lvcsr
0:10:29	than with just a
0:10:30	oh
0:10:30	take
0:10:31	uh language model
0:10:33	the two other uh
0:10:35	for the party
0:10:36	for not phoneme recognisers of the russian and hungarian
0:10:39	our neural network based
0:10:40	well the
0:10:41	neural network
0:10:42	uh
0:10:43	estimates the posterior probabilities
0:10:46	of the phonemes and then
0:10:47	it feeds them to the hmm for the decoding
0:10:50	so
0:10:51	these
0:10:52	we uh phoneme recognisers were used to be able
0:10:56	three uh binary decision tree language models
0:11:01	and
0:11:01	one svm
0:11:02	well
0:11:04	which was
0:11:04	based on the hungarian phoneme written
0:11:06	nice
0:11:07	here the foreground
0:11:08	where use
0:11:09	and uh
0:11:10	as we and was actually using only the trying around
0:11:13	uh
0:11:14	a lattice come
0:11:15	as a feature
0:11:21	then
0:11:21	uh we were doing a fusion
0:11:23	um we use it
0:11:24	and you go
0:11:25	multiclass
0:11:27	uh logistic regression
0:11:28	focal
0:11:29	toolkit
0:11:30	so whatever assisting
0:11:32	uh
0:11:34	the thing is
0:11:36	but the first time we had we didn't
0:11:38	trained to three separate beckons for the
0:11:41	each condition
0:11:42	we tried
0:11:42	to do the
0:11:45	duration independent fusion so
0:11:48	every sixteen
0:11:50	was a coding
0:11:51	some
0:11:52	raw scores
0:11:53	and
0:11:53	in addition to these it was outputting also some information about the line
0:11:57	a segment
0:11:58	which for the
0:11:59	acoustic system or was
0:12:01	number of frames and uh
0:12:02	phonotactic systems they provide it
0:12:05	number of phonemes
0:12:07	then these
0:12:08	a raw scores for every systems
0:12:10	where
0:12:11	we are going to lose
0:12:12	uh the gaussian backend
0:12:14	we had
0:12:15	three but gaussian back and
0:12:17	persisting because we use
0:12:19	uh
0:12:21	three
0:12:21	and so
0:12:22	uh lance normalisation either we
0:12:25	divided discourse
0:12:27	by the
0:12:28	uh by the land
0:12:30	or
0:12:31	two okay square root or
0:12:33	we didn't do anything
0:12:34	and then
0:12:35	we put
0:12:36	all of
0:12:37	the L wheels of the
0:12:38	uh these goals and back and sing to the multiclass
0:12:41	uh
0:12:42	a logistic regression
0:12:43	discriminatively trained
0:12:45	and i'll put most
0:12:46	that a calibrated
0:12:48	language
0:12:48	look like
0:12:49	course
0:12:51	so
0:12:53	here's
0:12:53	scheme of the
0:12:54	fusion
0:12:55	so again
0:12:56	it's is
0:12:57	thing
0:12:57	uh i'll put
0:12:59	uh
0:13:00	four
0:13:01	and
0:13:02	it's a it's either
0:13:04	taken as it these
0:13:05	or
0:13:05	it's normalised by
0:13:07	where or
0:13:08	divide it
0:13:09	and
0:13:09	uh then the output of the gaussian beckons
0:13:12	both
0:13:13	also together with the information about the lines to the discriminant
0:13:16	it's criminal
0:13:17	multi possible just
0:13:18	regression
0:13:22	so
0:13:24	the
0:13:26	the actual
0:13:26	core of this
0:13:28	paper
0:13:29	was to
0:13:30	was
0:13:31	to go
0:13:32	uh so our development set and decide
0:13:35	whether
0:13:35	or
0:13:36	address but the problem
0:13:37	you're right
0:13:38	like thing
0:13:39	uh our friends
0:13:40	int or you know
0:13:42	get too much yeah
0:13:43	who provided us with their development set
0:13:47	so we were able to do
0:13:48	this analyse
0:13:49	actually
0:13:50	in the tory no they had
0:13:52	much uh
0:13:53	small development set then we had
0:13:56	it contained about uh
0:13:58	if i correctly may remember
0:14:00	ten thousand segments of
0:14:01	and thirty three
0:14:03	thirty four languages our development set was
0:14:05	very huge it contained
0:14:07	data from
0:14:08	fifty seven languages and about uh
0:14:12	sixty thousand
0:14:13	second
0:14:14	so we did the experiment
0:14:16	we try to recreate the
0:14:18	putting the whole uh
0:14:20	training
0:14:21	set and
0:14:22	development set
0:14:24	and also we had
0:14:25	of course all training at developments and then we
0:14:27	the the four
0:14:28	types
0:14:29	experiment i'd everywhere
0:14:30	training
0:14:31	our systems
0:14:32	are the system and cutting
0:14:34	and calibrating in on the
0:14:36	uh
0:14:37	put it all
0:14:38	they
0:14:39	what it does set or
0:14:40	we trained
0:14:42	on the
0:14:43	L P T set and then potty break it on our set
0:14:46	or
0:14:46	we train
0:14:48	our set and calibrated
0:14:49	on the L P T outright
0:14:51	trained
0:14:51	on our set and
0:14:53	i degraded one hours
0:14:54	so
0:14:54	these
0:14:55	while at
0:14:57	while i
0:14:57	columns
0:14:58	our our
0:14:59	original scores
0:15:01	these analyses of course
0:15:02	was done
0:15:03	using our
0:15:05	our
0:15:06	uh one
0:15:06	the stick
0:15:07	subsystem the jfa system
0:15:09	because it would be
0:15:10	very um feasible to run all of the systems
0:15:13	again
0:15:14	for the training so
0:15:16	as you can see
0:15:17	we had some
0:15:18	serious issues for some languages actually these were the languages
0:15:21	uh whether only the what's of america
0:15:25	uh data were available
0:15:26	so
0:15:27	bosnian language
0:15:28	was an issue you can see a big
0:15:30	difference
0:15:31	between a
0:15:32	twenty two and our set the the blue blue column
0:15:35	is
0:15:36	just
0:15:37	training on our set
0:15:38	and using the
0:15:39	putting those
0:15:40	the
0:15:41	development set for calibration so
0:15:43	there must have been uh some
0:15:45	some bothersome issue
0:15:48	in our development set
0:15:50	so
0:15:50	the problems where the
0:15:51	wasn't in
0:15:52	farsi
0:15:56	and also
0:15:57	the final
0:15:59	final score
0:16:00	we were
0:16:01	everywhere
0:16:03	gaining some
0:16:04	performance
0:16:06	a loss
0:16:07	uh
0:16:08	so we try to
0:16:09	focus on these languages and fine
0:16:12	that should we had in our development
0:16:15	so the first
0:16:16	first we should we found was
0:16:18	ridiculous
0:16:19	we had
0:16:19	mislabelled one
0:16:21	language in our development set
0:16:22	actually that was a labour
0:16:24	label for
0:16:25	far as the and
0:16:26	version
0:16:27	and we treated them as
0:16:28	different languages so we
0:16:31	we corrected is or and
0:16:33	the problems for the for the language
0:16:35	mostly disappear
0:16:36	the next problem
0:16:38	we
0:16:38	we address was
0:16:40	finding the repeating speakers between
0:16:43	training and development set because
0:16:46	based on the discussions
0:16:47	on the
0:16:48	language recognition workshop
0:16:50	we already
0:16:52	a suspect it this can be a problem
0:16:54	for our
0:16:55	uh
0:16:56	training and develop
0:16:57	so
0:16:58	what we D
0:16:59	we trained the
0:17:00	our speaker I D's
0:17:01	stint from
0:17:03	previous
0:17:04	evaluations
0:17:05	which is a gmm based
0:17:07	speaker I D's
0:17:08	dean
0:17:09	and
0:17:11	uh
0:17:12	train the models for every train
0:17:14	segment
0:17:15	inside the language and test
0:17:17	again the segment
0:17:18	in the
0:17:19	uh developments
0:17:21	what we ended up
0:17:22	was this
0:17:23	uh
0:17:24	bimodal uh
0:17:25	distribution of
0:17:27	scores
0:17:27	so
0:17:28	uh
0:17:29	this part here
0:17:36	this part here
0:17:38	these are the
0:17:38	hi
0:17:39	speaker I discourse
0:17:40	and it's uh just
0:17:41	there are some recruiting speakers
0:17:44	between the training and the developments
0:17:46	so
0:17:48	when they look at these pictures
0:17:50	we decided
0:17:52	to threshold the data and to discard
0:17:54	everything from our development set
0:17:56	what is
0:17:58	higher
0:17:58	score then
0:17:59	for this ukrainian language
0:18:01	uh
0:18:02	of
0:18:03	discourse
0:18:04	the threshold
0:18:04	twenty
0:18:06	did this
0:18:07	experiment
0:18:09	we discovered that
0:18:11	we are we are discarding
0:18:13	for some languages
0:18:14	yeah disquiet discarding almost everything from our development set
0:18:17	for example bosnian
0:18:19	we ended up
0:18:20	with the
0:18:20	just fourteen
0:18:21	fourteen segments in our development set
0:18:24	and
0:18:25	for the other languages
0:18:26	where
0:18:27	very very doing the
0:18:29	speaker i didn't
0:18:30	cation filtering
0:18:31	we also discarded a lot of the data for example ukrainian only twelve
0:18:35	well segments
0:18:36	inaudible
0:18:39	so
0:18:39	what was the performance change when we did this experiment
0:18:43	really me
0:18:45	or
0:18:45	correcting the label
0:18:47	or already it was easy
0:18:49	and it
0:18:50	a show
0:18:50	and that the
0:18:52	did does
0:18:52	some
0:18:53	uh
0:18:54	proven
0:18:55	and then
0:18:55	speaker I D filtering
0:18:58	this was
0:18:58	white huge
0:18:59	different
0:19:01	in the performance so
0:19:04	these
0:19:04	again these are the results for our acoustic
0:19:07	subsystem the jfa
0:19:09	two thousand
0:19:10	what they got
0:19:11	as with
0:19:11	the R T L T features
0:19:17	so
0:19:18	when we did this
0:19:21	we decided to run
0:19:22	the whole fusion on our filter
0:19:24	data
0:19:25	it's not that
0:19:26	we we didn't change
0:19:28	the nature or or we didn't retrain
0:19:30	and you far system we had in the
0:19:32	submission
0:19:33	for the
0:19:34	nist language recognition evaluation we just
0:19:37	filtered out
0:19:38	course
0:19:39	from our development set and
0:19:41	run diffusion again
0:19:43	and we were
0:19:44	gaining
0:19:45	some
0:19:45	performance
0:19:46	improvements
0:19:47	quite
0:19:48	substantial
0:19:49	so for the
0:19:50	third the second condition
0:19:52	the C average went from
0:19:54	two point three to one point ninety three which is
0:19:57	what a nice
0:19:58	improvement and the
0:20:00	if you look at the table for
0:20:01	every duration
0:20:04	the improve there is
0:20:05	an improvement
0:20:06	i think there is no number
0:20:08	which deteriorated
0:20:09	so
0:20:10	it worked
0:20:11	all over the conditions and the
0:20:15	uh over
0:20:16	oh
0:20:16	the
0:20:17	all
0:20:18	uh
0:20:18	set and
0:20:19	for every language and
0:20:21	four
0:20:22	every duration
0:20:25	what we also
0:20:26	so here
0:20:27	what's a little
0:20:29	you duration
0:20:30	of the results on our developments
0:20:33	yeah it it could be
0:20:36	address
0:20:37	could be
0:20:39	the the cost could be that the our system
0:20:42	right trained actually to the
0:20:44	that that is
0:20:45	speaker and they they're more i can recognise
0:20:47	the speaker then the
0:20:48	then the language
0:20:50	for some languages
0:20:55	so then
0:20:56	we decided to work on our
0:20:58	uh
0:20:59	acoustics just in the jfa
0:21:02	are the L D system
0:21:03	because
0:21:04	we wanted to do also
0:21:06	another possible experiments to improve the final
0:21:09	final fusion
0:21:11	so
0:21:12	what we did
0:21:13	we
0:21:13	just discarded the
0:21:15	audio T features and use
0:21:17	the plane shifted delta cepstra
0:21:19	train
0:21:20	the system
0:21:21	and
0:21:21	it uh
0:21:23	there was some improvement
0:21:25	out of this
0:21:27	also what we did
0:21:28	was to train the jfa
0:21:30	using all
0:21:31	all the segments
0:21:32	there
0:21:33	language
0:21:34	instead of five hundred segments
0:21:36	the or language and
0:21:38	this
0:21:38	uh brought
0:21:39	so some
0:21:40	nice improvement
0:21:41	so when we
0:21:42	did the
0:21:44	final fusion
0:21:46	we
0:21:47	is guarded the
0:21:48	are the L T J face
0:21:50	in
0:21:50	replace it with the normal
0:21:52	jfa
0:21:53	justin
0:21:54	the and the my
0:21:55	us
0:21:56	still remained in the fusion
0:21:57	and instead of
0:21:58	all other
0:22:00	uh binary trees and
0:22:01	that one is we and we
0:22:02	we put there
0:22:03	actually a lot
0:22:04	of the svm
0:22:06	systems which are phonotactic
0:22:07	based and
0:22:09	uh
0:22:10	they are based on
0:22:11	our
0:22:12	uh
0:22:12	all of us
0:22:13	all of ours
0:22:14	uh phoneme recognisers and that a much because we'll have at all
0:22:18	on the
0:22:19	two P M
0:22:20	and we will he will explain
0:22:22	more
0:22:23	about this
0:22:24	stem cell
0:22:25	when we did this the final fusion went from
0:22:29	one point nine
0:22:30	as we saw previously
0:22:31	two
0:22:33	uh one point
0:22:34	fifty seven
0:22:34	which is
0:22:35	very competitive results
0:22:37	of course
0:22:38	it's a positive relation with
0:22:44	so what is the conclusions
0:22:46	of this work
0:22:48	we have to really care about our development that the data and rather than
0:22:52	creating a huge
0:22:53	huge development set it
0:22:55	better to
0:22:56	pay attention and
0:22:58	and
0:22:59	have it
0:23:00	smaller box filter and
0:23:02	clean
0:23:03	uh development set we actually did experiments with
0:23:05	trying given more data
0:23:07	two or seven
0:23:09	it didn't help us
0:23:10	the problem of the repeating speakers
0:23:13	between
0:23:14	between the training and the development set was
0:23:17	i was like
0:23:18	large
0:23:19	and
0:23:20	we should pay attention when we are
0:23:22	doing the next evolves
0:23:23	so that this
0:23:24	well
0:23:26	so thank you
0:23:27	and
0:23:33	also
0:23:45	uh_huh
0:23:47	okay
0:23:51	what
0:23:51	oh
0:23:53	oh
0:23:54	about a person so
0:23:57	we're principles is what
0:24:00	later use
0:24:02	a we are we looked at least
0:24:04	and
0:24:04	yeah we
0:24:05	we talked with them with one
0:24:06	in the workshop and they were
0:24:08	they were doing the speaker filtering stuff
0:24:11	but we didn't uh filter they are set according to our uh training set
0:24:15	uh to
0:24:16	but even the filter
0:24:18	uh the repeating speaker
0:24:19	what remained there
0:24:21	we just use as it was
0:24:26	oh
0:24:27	right
0:24:30	do you
0:24:31	same speakers element
0:24:33	i see
0:24:36	we should
0:24:37	we we don't know that and we didn't check it
0:24:40	we we just
0:24:41	it uh
0:24:41	wanted to treat our evaluation set
0:24:44	S
0:24:44	and evaluation set and we didn't look at it yeah
0:24:47	yeah
0:24:48	you know robust
0:24:50	you could probably get a little uh
0:24:53	well i think that the the remote or not
0:24:55	so much speakers repeating in that you've also because
0:24:58	as i understood nice
0:25:00	was using some uh
0:25:02	previously recorded data
0:25:04	and that
0:25:05	it is probably much less likely that
0:25:08	that there will be the meeting speakers again
0:25:11	for something which is of course it can happen
0:25:13	four
0:25:14	some of them but we we didn't actually check if those
0:25:17	or so
0:25:19	well short but
0:25:21	oh
0:25:22	you lose it seems
0:25:24	this
0:25:25	list
0:25:25	uh
0:25:26	you choose that actually
0:25:28	would be to then the more
0:25:30	uh
0:25:31	yeah
0:25:31	yeah
0:25:33	yeah it is like that
0:25:34	uh
0:25:35	we were
0:25:36	making a lot of effort to
0:25:38	try this new our guilty technique and
0:25:40	uh
0:25:40	which didn't work
0:25:41	also what was working was
0:25:43	combining 'cause of
0:25:44	many phonotactic systems as
0:25:46	as you did in your
0:25:48	submission and
0:25:48	yeah
0:25:49	very easily combining
0:25:51	thirteen pca base
0:25:52	as we and
0:25:53	since
0:25:53	based on our phoneme recognisers
0:25:55	what's actually
0:25:57	uh
0:25:57	very
0:25:59	very nice
0:26:00	the results
0:26:01	are quite compact it if if you will
0:26:03	and the number one one seventy eight
0:26:05	just
0:26:05	these
0:26:06	svm systems
0:26:07	where better then
0:26:08	our
0:26:09	final submission even after the filtering of the
0:26:13	of the calibration
0:26:16	like speaker
0:26:17	sort of
0:26:19	let's
0:26:20	but
0:26:21	oh

Data selection and calibration issues in automatic language recognition – investigation with BUT-AGNITIO NIST LRE 2009 system

SESSION 9: Language recognition – general and data