Speech Transcript - The MITLL NIST LRE 2015 Language Recognition System

0:00:15	so but of this past leo and all be presenting a mit a side presentation
0:00:21	for the summation
0:00:23	and you know we have a number of system kind of focus more on the
0:00:27	but i
0:00:29	a little bit of analysis and even kind of listening to in the top one
0:00:34	so in general basically wanna talk a little bit about the kind of systems that
0:00:37	we looked at and then i'm gonna talk about which ones we end up using
0:00:40	on the evaluation itself so maybe the primary submission
0:00:43	but it up about the development data we have a how we ended up using
0:00:47	that data panel things we look to try to augment that data
0:00:51	then initial some of the averaging results we had in like everybody else was kind
0:00:55	of surprise
0:00:56	when we first so what happened on the original compared but we have seen on
0:01:00	the development set
0:01:01	then i'm gonna have a sum of the regions income
0:01:06	so in terms of systems we looked at about well over ten systems
0:01:11	not surprisingly the system that we looked at where you there
0:01:14	i-vector systems or the nn
0:01:18	i don't bottleneck systems in something in some way and are you know on the
0:01:23	on the more conventional i-vectors subset of systems we have an about an sdc type
0:01:27	system with cepstra and then we also have a system that was basically the same
0:01:32	system with a speech added to it
0:01:35	then we have kind of are set of the nn system would welcome x and
0:01:39	bottleneck spa speech and modeled in a posteriors
0:01:43	then we even though to things like and i'm in my system kind of also
0:01:47	will system but in that case we were using a bottleneck features instead of kind
0:01:52	of the conventional features we used in the past
0:01:55	for the open task let me kind of emphasise that quite a bit we also
0:02:00	tried the multilingual system
0:02:03	and we use a five of the babel systems
0:02:06	and we also had a few other systems that where maybe on the slightly more
0:02:10	site you know we had a kind of what unit discover a system is kind
0:02:13	of along the lines of what all the described earlier
0:02:15	and we also have this of the nn counts multinomial model system which it something
0:02:20	that i think my jeans gonna talked about what a bit more during his stuff
0:02:24	right
0:02:25	and four turns out that for calibration we really didn't do anything nude that we
0:02:30	don't over the last years maybe the last pretty of allegiance or sell so that
0:02:34	wasn't really anything new on the on that site
0:02:37	in next someone to talk a little bit of the development data
0:02:41	and have you probably heard by now we have the six people are displayed
0:02:45	we did a little wider be a waste of augmented that data and you know
0:02:50	at the end there wasn't really a whole lot of things that work on the
0:02:53	side of movement in the data and we basically ended up having
0:02:58	kind of some reviews on the data where we had the full segments lost
0:03:02	i mean the full utterances plus segments we derive from that same data so we
0:03:06	can also the data twice except we so some sort of duration
0:03:10	variability in kind of a lot of things we tried we tried you know kind
0:03:15	of doing some working page a change in the spectrum of the to be looking
0:03:19	at things like to create
0:03:20	at the end of that really seen to hal a performance even what we
0:03:26	one thing we did what we and we did not retrain
0:03:28	our whole systems what we did retrain kind of what back and strategy so basically
0:03:33	we can't more system specs on the data we had been developed mean with what
0:03:37	we did retrain the back in with kind of basically the hundred percent of the
0:03:43	in terms of be that was mainly by the way for though the fix that
0:03:48	so for the open set of course like everybody else we did looked at one
0:03:52	of the source if we had available and of course there's the word plenty of
0:03:55	sources in there
0:03:56	at the end you only know thing i'm you know system but really benefited by
0:04:01	you have been this additional data was gonna the multilingual
0:04:04	or just so basically most of the system we use on the open set as
0:04:08	we have developed on the
0:04:10	fixed condition except for of course the multilingual which kind of needed all the extra
0:04:15	one thing and on wanna talk a little bit more and get into some specifics
0:04:19	is
0:04:20	that doing that development
0:04:22	we did notice that using all the data without available actually did not her did
0:04:26	not help performance
0:04:28	so it was our after doing some kind of all really experiments we decided to
0:04:33	only at data in a few of the languages
0:04:35	and i'm gonna talk about whether that was the best decision or not
0:04:39	so
0:04:40	at least in the bottom and we need see that out in that
0:04:43	data to those languages the performance
0:04:46	in terms of the bottom results we sell
0:04:49	in this addresses both kind of the cluster average in detail
0:04:53	and then what happened between the de l
0:04:57	fixed set and the open set
0:04:58	in this kind of what we so we for the most part you know
0:05:02	we select chinese and i've union kind of been the top this new ones on
0:05:06	the dev set
0:05:07	what the performance in general seem reasonable so we were kind of pretty happy with
0:05:11	it in an average we were kind of one though
0:05:13	one zero to some more about neighbourhood
0:05:15	the other observation here i think
0:05:18	is that on the open set we did see that we get a little bit
0:05:21	of improvement over you know the that fixed condition
0:05:26	so we may maybe we will see as much of maybe we could have expected
0:05:29	what we saw some improvement so we so that was also recently
0:05:34	not wouldn't talk a little bit about the evaluation results
0:05:38	well so on the evaluation results kind of
0:05:41	the bad we got a big this discrepancy we between what we saw doing the
0:05:46	data set and what we so on the evaluation set
0:05:50	so
0:05:50	in right away john like you know almost a year ten times some other also
0:05:56	regions in here and things like we ended up some meeting a five way fusion
0:06:00	of systems
0:06:01	and we had a this unit discovery system
0:06:04	we had a account system we had a the bottleneck features and we have the
0:06:09	speech kind of conventional system that we
0:06:11	that we train and that's just
0:06:13	performance in that we of in their of obtaining was on the on c average
0:06:19	of what a little bit lowest and point eighteen
0:06:22	and of course also julian here
0:06:24	this idea of what happened with the french cluster
0:06:27	controlling both the performance we had as a whole in the performance you
0:06:30	we have not dealt with the french cluster
0:06:33	and
0:06:34	one other observation here is
0:06:36	die
0:06:37	like everything else we ended up using all the systems and we had a greedy
0:06:40	approach to kind of remotely panel of the lines of what you saw the are
0:06:43	on the last person speech and then we sorted out
0:06:46	after looking at
0:06:47	big long evaluation of all a fusion of
0:06:50	then we systems and five we system
0:06:52	and we ended up with this five with some system and it does not for
0:06:56	the most we were not necessarily
0:06:58	that far off of the
0:07:00	a human performance and we could have obtain so how we describe how we somehow
0:07:04	know what our best system it would have been
0:07:07	we actually would have been
0:07:08	very little form some it into a kind of the oracle system
0:07:13	other than that of one of the region is kind of like the best system
0:07:16	we had an enormous
0:07:18	for estimation what's the bottleneck feature system closely followed by the awards just
0:07:26	of course in it something data has been talked about
0:07:29	quite a bit by now
0:07:31	there was these each with the french clustering and in each really kind of the
0:07:35	main there were two things that they came in that we kind of talked about
0:07:39	and the first one was applied seems like we're really building the channel detector which
0:07:43	is kind of what all dimension and then there were all the things that a
0:07:46	mean and we heard from
0:07:49	ldc at the workshop that have to do well there might be older each is
0:07:53	not only channel to sell before i forget that i wanna kind of drawing a
0:07:58	common to the earlier discussion that all dine in dog had
0:08:02	which is we did do a lot of analysis on the
0:08:06	on the channel each you in two you know nine
0:08:09	one thought that thing to my neighbours something in the and say can i say
0:08:13	something different that what everybody of the same is whether the difference is that we
0:08:17	analyze you know nine were mainly based on the language
0:08:21	in here we're kind of one cluster classes which may or may not
0:08:26	at to the discussion of all why we're seeing that this seems to be printing
0:08:30	on the channel side even though
0:08:32	apparently would people listened to it they're those difference and might not be there
0:08:37	so
0:08:38	kind of
0:08:39	going into more detail into is feature about the prince cluster we did see here
0:08:43	that we do seem like things line up for channel and i mean the big
0:08:47	feature and i think it was obvious earlier had to deal with the fact that
0:08:50	for one of the languages just
0:08:52	did not have any data on that channel
0:08:56	so it seemed like the channel what that we so on the wall in was
0:08:59	not a available all do in the dataset
0:09:01	was kind of being more like going to the actual channel instead of the language
0:09:07	and you can see here that you know there is there doesn't seem to be
0:09:10	a big difference here between you know
0:09:12	been able to tell the like that passes so far it's more like seems to
0:09:16	be more on the channel
0:09:18	element
0:09:18	one thing we did do
0:09:21	it's kind of well we said well maybe just the nature of the problem when
0:09:24	we kind of look at i don't know how to different a cluster and we
0:09:28	look at the slap a cluster which was
0:09:31	polish in russian and it does not we look at that cluster we didn't sing
0:09:35	used to observe the same each you with this kind of channel alignment on so
0:09:40	we were able to some extent even though there's the challenge channel element here we
0:09:43	also can kind of tells the classes of what a lot better than
0:09:46	we were able to do one the french cluster
0:09:50	no i'm going to the
0:09:52	the open condition kind of the though the main difference here like i said was
0:09:57	then we have this modeling well bottleneck feature system
0:10:00	and that was actually kind of replaced the but what system
0:10:06	or compared to what we had only a fixed condition
0:10:09	and once again the performance you was a little bit better was and not substantially
0:10:13	better done on the on a fixed condition but it was a little bit better
0:10:16	and like i said kind of the multilingual bottleneck seem to be you don't want
0:10:20	that
0:10:22	run into the difference and was actually be different in this case
0:10:27	like i said earlier one thing that came in
0:10:30	okay and we were a little bit surprised so
0:10:32	was the fact that using extra that did not seem to help on the development
0:10:36	set
0:10:37	and you know you hear your kind of looking at what happen in the case
0:10:40	of arabic and we added error rate in a number of ways and you can
0:10:43	see on that
0:10:45	on the lower right corner in there
0:10:47	that for the most but it didn't seem like it make a big differences only
0:10:50	kind of one particular scenario where we get a little bit of one improvement but
0:10:54	it's not like svm or data we seem to consistently be able to
0:10:57	get improvements
0:10:59	one thing that actually
0:11:01	also came into play list the fact that what happened as fast as we look
0:11:06	at be used after the evaluation
0:11:09	and one thing was that anything old also address some of this was that
0:11:15	even though we did not see any improvements by adding data on the development set
0:11:19	we would have taken substantial improvements we have care all that data in
0:11:24	into the eval set of course one into is that a lot of that has
0:11:28	to do with this labeled data that have that particular channel in there and whether
0:11:32	there are some data in there that seems to be used the same data or
0:11:35	not we didn't go in and it's precisely looked at you know are just precisely
0:11:39	the same cluster not examples lines of course we're expecting that maybe not necessarily are
0:11:43	the same
0:11:45	body would have kind of substantially change or performance maybe on the order of thirty
0:11:48	forty percent
0:11:50	i don't think that we also did a little bit of after the eval was
0:11:54	kind of keep looking at this multilingual bottleneck features and once again we don't is
0:11:59	no you know are scored nine
0:12:01	so someone think that we used a we also get some improvements on with dot
0:12:06	the multilingual bottleneck feature system
0:12:09	asked to change the diversity implanted in this is not completely
0:12:13	linear meaning you doesn't mean that we go from five to seven five to ten
0:12:16	and five fifteen any it's always improving it still kind of something we're getting a
0:12:21	better handle on what it seems like there's some obviously some religion in their between
0:12:25	the diversity of the languages we used to train the not be with
0:12:29	and kind of the performance in once again i'd probably at this point with seen
0:12:33	as much as tend to fifty percent improvement
0:12:38	i don't think that we that idea actually post people was i try to listen
0:12:42	to the languages that i know so spanish and english and
0:12:47	kind of the idea was well you know for our system and once again in
0:12:51	my assessment which i'm not a language
0:12:54	you know if i listened to some of the errors we had
0:12:57	you know is there anything i see and hear that seems to be system that
0:13:01	once again for the u r submission any in the case of spanish basically what
0:13:05	i ended up going we had a number of errors i mean probably for the
0:13:08	holy but i think we had on the order of two thousand there was also
0:13:12	and what if i just randomly picked fifty on each of these two languages this
0:13:16	into them in figure out if there anything that seems to be somewhat system
0:13:20	any disparage case there were two things that seem calm in the first one that
0:13:26	i was a little bit surprising seem like we have we can do not a
0:13:30	problem with human p
0:13:32	once again know little white necessarily but my i mean one idea that comes to
0:13:37	mind is maybe there were someone on the represented on the training
0:13:41	and by the way when i say i i'd say expenditures i mean spending terrors
0:13:46	i took out for to be used from the iberia clusters alarm and miss you
0:13:49	know of the to the three classes advantage
0:13:52	and i don't think that clean and one thing i'll want one other point is
0:13:56	that i see example and all the error cycles across all directions i mean i
0:14:01	probably the same to maybe a handful of forty seconds maybe ten or so on
0:14:06	the order of ten seconds and maybe like seventy percent of the cards or about
0:14:10	when you some
0:14:11	we low twenties are on the three set up the range and that applied for
0:14:14	both cases actually
0:14:17	so one thing that's also
0:14:19	i want to mention is
0:14:21	we actually had within because we have on the stand aside between five and seven
0:14:25	cats that either nonspeech on them
0:14:28	or things like
0:14:30	or a ladder or something so i mean how much you should be able to
0:14:34	detect language from that
0:14:36	not quite sure bottom what we obviously
0:14:40	we obviously having five cards in there i mean seems like it might be a
0:14:44	big number what it that all usefully extended to the whole set of errors we
0:14:49	had all it's not clear but at least that's the observation on this limited
0:14:54	set of data that i listen to on the english side we also kind of
0:14:58	have also for this a mutual well basically empty any speech files most of them
0:15:04	where you should on the three second what even on some of the ten seconds
0:15:07	we would have this nominal ten seconds speech caught and then you'll see that you
0:15:13	know the person rate
0:15:14	comes here the first and maybe speaks for a second dozens there's nothing left while
0:15:19	that gets detect it and then becomes they have again and maybe laughter something so
0:15:23	there was a little bit of that once again i guess to some extent that's
0:15:26	reality but it's kind of something product or that i wanted to bring up well
0:15:30	for
0:15:31	to your attention
0:15:33	the other thing was that in once again on this limited sample on the english
0:15:37	side it seemed like most of the errors
0:15:40	i so
0:15:41	where between
0:15:43	british english and american
0:15:46	so there were
0:15:48	maybe five rate
0:15:50	we're seen there that they all in one way or another within the ending beach
0:15:53	but most of the year maybe like i said there on the eighty percent or
0:15:56	so we're actually confusion between rereading which
0:16:00	and
0:16:01	in america
0:16:04	and i think that's actually be a
0:16:07	particular
0:16:08	we're going too much right away so kind of i
0:16:11	quickly as a
0:16:13	let me just gotta go through you know we did see that there what a
0:16:16	little bit of improvement vaguely future
0:16:18	needless to say you know bottlenecks in the nn bayesian i-vectors
0:16:24	dominated at well we're still kind of parsing out procedure with the french cluster i
0:16:29	actually saw presentation yesterday that i think they for that some of the data kind
0:16:33	of across like thinking with the true some of the data anything like they got
0:16:37	really big improvements by using a little bit of the data of for training
0:16:40	so it does seem like having the channel represented in the would include what a
0:16:44	bit euro performance
0:16:45	and you know there was also this each other's adding more data to sit down
0:16:49	quail
0:16:50	everything else i'd
0:16:51	you know hindsight is twenty so we know
0:16:54	and once again i guess the generally to is in the feature you know should
0:16:58	we focus on
0:16:59	some particular conditions or kind of think about in terms of robustness
0:17:03	and not
0:17:11	right now we have time for some questions
0:17:24	so we were coming we
0:17:26	only a week
0:17:29	that probably also these errors that we have in the spanish clusters
0:17:35	could be also you
0:17:37	like to
0:17:38	no each of the levels because it raises the question about
0:17:43	it's that you will get a spanish your this pain
0:17:46	from the cell
0:17:47	it's closer to carry any spondees or
0:17:51	to the regular responding a from spain
0:17:55	a i mean
0:17:58	in my personal experience
0:18:00	i find like people like i think it's under the cr for example seen very
0:18:03	close to the people in puerto rico
0:18:06	wait closer than people from my three than anywhere else in to sell a decision
0:18:11	like i saw a lot of those errors like a like people that really where
0:18:16	maybe what i would hypothesize that been from the south spain
0:18:20	community once again i saw somebody it's not like at least for our system it
0:18:24	seems like this q one female notion was something that actually
0:18:29	but absolutely i would i in my limited understanding and knowledge about this i would
0:18:33	say i would have expected that because i you know the way that a that
0:18:38	people from seem to me i once again that the cr would kind of draw
0:18:41	a last syllables and seems like that is precisely the way people important people would
0:18:46	do
0:18:49	cushion
0:18:55	thank you for your presentation in one of your s like to set that up
0:19:00	for your opens the task you didn't use all the data set and for training
0:19:05	this
0:19:07	this out of set model right all right so
0:19:11	if i recall correctly what i said was that we only use the this you
0:19:16	open
0:19:18	data for the multi
0:19:21	lamel or not
0:19:24	a lot i opens the data you well
0:19:27	once again we remember not necessarily all the data what kind of the multilingual was
0:19:33	trained on the five well label
0:19:36	about sorry and you have mentioned here that adding more data did not solve the
0:19:42	problem
0:19:43	which data you added to your us all you know it's a paralysed addition or
0:19:49	just blind
0:19:50	no absolutely its size so we it like i shows i think on the
0:19:55	on that are we is just an example or one
0:20:00	so in this case we basically what if we drawing more error rate and basically
0:20:05	what we get a serving on the test set so obviously you're kind of as
0:20:09	everything else are doing the best you put on the dev set and hoping to
0:20:12	make a good prediction of what we're gonna see on the eval set
0:20:15	what we had observing here that are more are training data in this just one
0:20:20	so that is to help the problem we have done training which makes included all
0:20:26	the data that is going there like for the sources we had available and it's
0:20:30	actually seem to work so we backtrack and only added training in some of the
0:20:35	for instance again whether we didn't necessarily go back end of a little this on
0:20:40	the eval data consistently say i mean if we had at your data you know
0:20:45	we did the analysis of all we have done all the system that whole training
0:20:50	data only
0:20:51	that would have been better mostly because of the french last right because we would
0:20:55	have labeled data that represent the that channel
0:20:58	in about all the don't have happened only in it systematically i don't you know
0:21:03	on this language would have well known that language without her so we have one
0:21:08	okay thank
0:21:14	questions
0:21:19	i'm gonna ask those questions on the slide that you have here we used for
0:21:23	the four
0:21:25	from the other real whatever
0:21:28	i'm gonna ask is the supreme assignments there because the reno
0:21:32	reno when we did our test that if you threw away all the speech and
0:21:36	just use well what you thought was silent you get a five percent danger
0:21:41	i
0:21:42	i just one i'm not sure right eigen necessarily and channel dependent
0:21:51	other questions
0:21:53	people are usually good
0:21:56	i think
0:21:57	okay rock to think that drawing and via mit lincoln laboratory to

The MITLL NIST LRE 2015 Language Recognition System

Speaker & Language Recognition Systems

Pedro Torres-Carrasquillo, Najim Dehak, Elizabeth Godoy, Douglas Reynolds, Fred Richardson, Stephen Shum, Elliot Singer, Douglas Sturim