Speech Transcript - A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION

0:00:13	is might go
0:00:13	i am gonna out so to day about an approach to selection of
0:00:17	a order n-grams
0:00:19	in a from that phonotactic
0:00:21	a was recognition
0:00:22	yeah
0:00:23	lance time is to drive so are we try to
0:00:26	go quickly
0:00:28	a
0:00:30	i will
0:00:30	do a short introduction that we show you
0:00:34	a a a which is that fit to select method call are what we present
0:00:38	are we show you this pretty meant that that
0:00:40	a a to that and results
0:00:41	and i will finish with a shot somebody
0:00:44	of the work
0:00:45	so the motivation the mean but reason is that
0:00:48	a a phonotactic language recognition
0:00:50	hi of the and side expected to have more
0:00:53	this scheme that if a information
0:00:55	from the language
0:00:56	i mean more languages which it's bit weak a information
0:00:59	i
0:01:00	the problem is that
0:01:02	hi and number at the number of n-grams
0:01:04	yeah a a spongy like
0:01:06	as N increases
0:01:07	so so have there are some a a a computer the in it's
0:01:10	and uh that's why
0:01:12	uh many many a C stands usually
0:01:16	stick
0:01:17	the and or more that to three or four
0:01:20	a a we cannot i apply
0:01:22	that that directly dimensionality of reductions
0:01:24	like pca your the eh
0:01:26	two sets a huge space
0:01:28	so uh uh is thirty got yeah some
0:01:31	a uh works
0:01:33	related to
0:01:34	fit selection i would mention yes
0:01:36	two of them
0:01:37	a has one by every just one that that in i cups
0:01:40	to first and then eight
0:01:41	where and they uh would tape just sent the to work well
0:01:45	filter method
0:01:46	that was used to select the most discriminative can if and mean as one N grams based on S B
0:01:51	M
0:01:52	where
0:01:52	and that those n-grams where where a
0:01:55	and it
0:01:56	to get that's of out a a subset of anger
0:02:00	some G is like that uh
0:02:02	quite the similar work
0:02:03	my turn it down
0:02:05	i they use the same rubber filter method
0:02:07	a a but use set to this to mean that if we tell you
0:02:10	first uh as B M's but issue you in which is by see basically the same
0:02:15	a a as the but it was one
0:02:16	and second that use also it chi square measure
0:02:20	the fact is that in both cases
0:02:22	a there was no improvement or even at the relation
0:02:25	and when
0:02:27	hi high get than four grams were used
0:02:30	we also had a
0:02:32	quite a similar problem we we uh
0:02:36	we have rest
0:02:37	a quite similar problem
0:02:38	in a previous work
0:02:39	what we uh uh eat
0:02:41	phonotactic language recognition using um
0:02:44	a the colour
0:02:45	oh prince of phone n-grams
0:02:47	in that case that features space was very weak
0:02:50	and that the K yeah we use that was just to to do a
0:02:54	or can base fits a selection
0:02:57	was too but to build in this past a vector of respect
0:03:01	of cones
0:03:01	using only the most frequent
0:03:03	uh units
0:03:05	but the problem is that in a high or than the not you know a there is fit to speech
0:03:10	is
0:03:10	you really huge
0:03:12	so we been a simple uh a frequency based selection
0:03:15	can be a a a a a channel
0:03:18	so let's do frequent space
0:03:20	effect selection that H be the number of phonetic units of an acoustic the colour
0:03:26	a is so that is six
0:03:27	eight for what when possible and n-grams
0:03:30	i if a the number of units of an acoustic or it would be for
0:03:34	and and save and they all that of the n-grams
0:03:37	they fit to space it's really huge
0:03:40	but uh we must take
0:03:42	we must to and we must think that most of the most of those fits
0:03:46	we we not the appear in the we will not be seen in the training so we can forget get
0:03:52	well
0:03:53	i mean even and most of the scene it's as would have a very low counts so we can
0:03:58	for a bit then and just
0:04:00	simply be select the most frequent
0:04:02	fits
0:04:03	the problem is that with such a high
0:04:06	sets set you say a fit space
0:04:09	the vector a set of a
0:04:11	the uh i mean we cannot even
0:04:13	uh two
0:04:15	the could active comes we cannot uh a store all the comedy seconds
0:04:19	of the of all day features sing
0:04:21	in the training set
0:04:23	so we state of the directly uh
0:04:27	a a all the collective comes we must is me
0:04:32	they is quite simple
0:04:33	we have one i used the full training set
0:04:36	i we are one that's like to buy
0:04:38	to wheel table with all the comedy of cons of the scene fits
0:04:42	but that it equally well when clean that's a what
0:04:45	so every every collected K calls
0:04:47	we are one that retain all the only those entries we
0:04:50	hi good cons than a given first tell
0:04:54	so in P D all the and this with no work calls are described that that it they can a
0:05:00	set to zero
0:05:03	uh yeah both T and tao
0:05:05	a would be stick or constants so by mid this must be to
0:05:09	in order to get a quite a beach
0:05:10	table
0:05:11	in our experiments we usually a a a a a a get
0:05:14	and this
0:05:15	ten ten times speaker tables
0:05:18	and that the site is nice
0:05:19	so the and is that the site the size do you find of the the side
0:05:23	we we try to get a ten times speaker tape
0:05:26	so the proposed going this work seen that we start with that and that T with an M table
0:05:31	C will be dead parameter that L sats
0:05:33	a a how many combative comes with
0:05:36	we for a great
0:05:38	since the last
0:05:39	a
0:05:40	since the last the update
0:05:42	and for every i a training sent as we do
0:05:46	we can wait
0:05:47	the cons
0:05:48	and did say well
0:05:49	in the table
0:05:50	and then we update the T parameter when that you parameter is high and the key from i'm at that
0:05:55	we have data table type is with the model
0:05:57	all the entries with a
0:06:00	i the end we must do it in and a final that updates
0:06:03	to see that they
0:06:05	final size of the table is much be good than
0:06:08	the decide
0:06:09	and size
0:06:10	then we to get a table and just use the most frequent and
0:06:15	for
0:06:16	the use race that it it to this are quite common approach in i
0:06:19	phonotactic language recognition we use a and from this for this
0:06:23	we estimate like this is uh and uh
0:06:26	we use as bn based language model with backoff n-grams
0:06:29	and think
0:06:30	oh gaussian back and a linear fusion
0:06:33	i D N
0:06:34	a training development and test a a corporate i was that a T got for the two thousand and seven
0:06:39	common used a language recognition evaluation
0:06:43	a we use yes
0:06:44	a ten conversations for development for function calibration
0:06:49	and and those the compensation what where to split
0:06:52	a for the splitting thirty seconds
0:06:54	a thirty second
0:06:56	second
0:06:57	the evaluation was carried out in the core a it can be back thirty seconds plus
0:07:04	oh this of what use was uh for a valuable
0:07:06	the from the chorus
0:07:07	a a a a a and know from but
0:07:10	group
0:07:10	for checks
0:07:12	a on guard interval and
0:07:13	version
0:07:14	that this is where of time with htk using the a a brand or C
0:07:19	as the a modeling who is that was done a using a deep linear uh quite fast
0:07:24	you have only
0:07:26	enough of the M
0:07:27	and in back can um for was done using their focal toolkit keep from nick
0:07:34	before i doing that according to the using their of a brand of the goal this a we split
0:07:39	we are more the non-speech
0:07:41	a a segments a from from the training segments
0:07:45	and they all the non phonetic you knees where maps
0:07:48	where mapped to set
0:07:51	we use a a a remote that we do we do use to like this is so we use the
0:07:55	but have the core there's only to get a estimates of day
0:07:59	a gmm is state
0:08:02	uh we did to a phone we use for that S and those that is where model by means of
0:08:07	support vector much is using
0:08:09	the knows
0:08:10	a a with the test and that and back of n-grams think
0:08:14	and using the stand that
0:08:15	i the background probability weighting
0:08:18	the training was doing use one versus all
0:08:21	but
0:08:23	so let's
0:08:25	jump from fit three times to four
0:08:28	and a we take just that all the all the grams
0:08:31	a in training
0:08:32	we see that that we got only in a round about
0:08:36	to two me on a a fixed
0:08:38	in train
0:08:39	so
0:08:40	i are have different numbers for each the call that
0:08:43	so that is no need so to and you speech that with two thousand two medium as we can
0:08:48	yeah a count them
0:08:49	and select the most for of them
0:08:52	if we use the full
0:08:53	to um
0:08:55	two medium the as
0:08:56	in fact
0:08:57	but not all the features which will be really need send
0:09:01	so they are a well i but it's size of the a sparse vectors of this sparse code vectors
0:09:05	was found to be a a a about
0:09:07	seventy first
0:09:11	we use the four
0:09:12	four gram scrolls a we getting me prove mean feel of it or send the even you whatever right
0:09:18	but we should take into account that
0:09:20	they are but spectral size
0:09:21	and it when we use the full
0:09:23	four grams what's was
0:09:25	quite a be your
0:09:26	and and a three gram baseline system
0:09:28	so that would that would be a problem if we've got a lot of data data for example or for
0:09:32	the two thousand and nine competition
0:09:34	what what the that was much be
0:09:37	so that first thing was just to uh select the most frequent units
0:09:41	from there full
0:09:43	for
0:09:43	but all around yeah
0:09:45	in this day were you can see all the was
0:09:47	when D we select a start in from
0:09:50	so and units
0:09:51	a to fight so you is obviously as we select a
0:09:55	less and less units
0:09:56	you
0:09:57	their but but he's is more that
0:09:59	equal whatever rate grows at not money but with some was delay of solution that's why we prefer see a
0:10:04	there
0:10:05	a a coarse P
0:10:07	cost
0:10:08	for for for evaluation because see somehow more more as significant
0:10:12	and you but a right because
0:10:14	as a mall a a small perturbations are run
0:10:17	the it what a point
0:10:19	lead to different people are are like by
0:10:22	so it we mark
0:10:23	two
0:10:24	to to select "'em" points
0:10:25	first there
0:10:26	one hundred thousand and second was the thirty thousand
0:10:29	there is an is that once a hundred percent features is more or less the same number of features that
0:10:34	with full
0:10:35	sector
0:10:36	and the S C if that's and uh one was selected because
0:10:40	are but it's vector size is more or less a Q but i
0:10:43	more or less give a link
0:10:44	to their site comes case so the computational of course the case of thirty some stuff for that was and
0:10:49	was more or less the same
0:10:51	as state
0:10:54	so let's try to jam
0:10:56	um
0:10:57	four grams to high gear and
0:10:59	we is the only fixed K and now but values
0:11:02	to ensure at least to me
0:11:04	for is at the end of the out all
0:11:07	a a i guess the
0:11:08	just to note that the a key value a sick you in
0:11:11	to more than two hours of voice
0:11:13	i would
0:11:14	so that means that
0:11:15	close to features that
0:11:17	even at the in the year of the counties these really a each read a really low
0:11:22	in two years out yeah
0:11:23	are are a a a remote
0:11:26	a a also as N increases as we use high get an a and all or
0:11:33	i the number of like n-grams decreases
0:11:35	in this this table we can see
0:11:37	how many how many like n-grams grams
0:11:39	we can find a as we change in all but all that from three to set and you can see
0:11:44	that when we
0:11:45	get
0:11:46	the most free bins
0:11:47	saving grams
0:11:48	or or that around
0:11:49	only
0:11:50	twelve of them P
0:11:52	so we select a seven as the high guest or that for
0:11:55	and
0:11:58	this table you can see the ross rules in the probability dimension on a too
0:12:02	for two sec select you on uh leads
0:12:05	a hundred thoughts and of thirty
0:12:07	thirty thousand
0:12:08	you're scene
0:12:09	from three i'm to after
0:12:12	two seven
0:12:12	seven for order
0:12:14	and
0:12:15	uh as you can see
0:12:17	the the right a a a a a a with a four and then and
0:12:21	C runs
0:12:22	a once a a that be from from and
0:12:25	grams
0:12:26	a with a five six and seven
0:12:28	they that the on the the they have a a a a a the from
0:12:33	the four gram
0:12:34	system anyway
0:12:35	a good day the be good
0:12:37	a wouldn't you know was that
0:12:39	uh uh even the rest what not bad that they were not what's
0:12:43	the i mean the are somehow this stable
0:12:46	a
0:12:46	that they have
0:12:47	quite a a big not in a quite big K
0:12:50	a number of
0:12:51	i or other
0:12:52	and in the east
0:12:54	or we can collect
0:12:56	eighteen
0:12:57	eighteen thousand
0:13:00	so uh
0:13:02	i would try to
0:13:04	fine is my presentation
0:13:05	we present that that the a for its a mixed in with a a a a a has been proposed
0:13:11	so i has some i don't make fits a selection with the that has been proposed which of was to
0:13:16	perform phonotactic being based language recognition
0:13:18	with a high or that and are
0:13:21	or from an improvements
0:13:22	a a with a route to the baseline trigram svm system have been reported in experiments are on there
0:13:27	so those and seven used competition but the base
0:13:30	when when uh applying the proposed a in to select the most frequent units up to four five six
0:13:36	and seven
0:13:38	a is from our was obtained
0:13:40	when selecting that
0:13:41	a
0:13:42	hundred thousand most frequent units up to five grams
0:13:46	would you lead and need what or a great improvement of eleven percent
0:13:49	with a to that these than three gram system
0:13:52	i we are currently working on the evaluation of as smart the selection criteria on the
0:13:57	these
0:13:58	uh approach
0:13:59	so
0:14:00	that's so on
0:14:02	a thank you have
0:14:09	question
0:14:10	for
0:14:12	uh
0:14:15	that
0:14:16	i
0:14:16	or
0:14:20	a
0:14:22	from
0:14:27	we
0:14:33	i think what we known as was that with each lower or or gram you had a different dynamic range
0:14:39	i was wondering if you
0:14:40	tried to be scaled them differently or
0:14:42	or if them separately or something that
0:14:45	yeah
0:14:46	right
0:14:47	oh
0:14:47	no
0:14:48	yeah
0:14:48	yeah
0:14:50	yeah
0:14:51	uh
0:14:54	yeah
0:14:55	yeah
0:14:57	i
0:14:59	don't
0:15:02	right
0:15:03	just leave them on the vector
0:15:04	yeah
0:15:05	yeah
0:15:05	oh
0:15:07	ah
0:15:10	a
0:15:14	okay yeah
0:15:16	i
0:15:17	and
0:15:18	and
0:15:19	this one
0:15:20	and
0:15:22	and
0:15:23	uh
0:15:24	but
0:15:26	we
0:15:28	i
0:15:31	oh okay
0:15:32	a
0:15:38	thank you
0:15:39	yeah
0:15:43	one iteration of me
0:15:45	this one or the other
0:15:46	that
0:15:47	well
0:15:48	well
0:15:49	you
0:15:50	phonetic
0:15:51	for
0:15:53	i
0:15:54	yeah
0:15:55	oh
0:15:56	because general
0:15:57	yeah
0:15:58	i
0:15:59	that
0:16:00	that
0:16:01	were
0:16:03	but
0:16:05	i
0:16:05	uh_huh
0:16:06	yeah true
0:16:08	sure but H M
0:16:10	we have somehow
0:16:12	sent in politics but
0:16:14	i
0:16:16	maybe i
0:16:17	i i and the so you pressed to the you mean we
0:16:19	right
0:16:19	here's
0:16:20	acoustics is that we have more say
0:16:24	the
0:16:25	three
0:16:26	looking at
0:16:27	a
0:16:27	but
0:16:28	i
0:16:28	and all these no baseline
0:16:30	i
0:16:32	oh
0:16:33	that
0:16:34	station
0:16:35	yes think that that work
0:16:37	but i
0:16:38	a a a a a a a a sure if we we use for
0:16:41	one it and the X i mean
0:16:43	like
0:16:45	anyway way i think somehow some something this
0:16:48	so how you see to
0:16:50	i
0:16:50	that uh
0:16:52	thinks is
0:16:53	you
0:16:53	when you are pop used in and in in a in uh
0:16:57	in a a a a special uh
0:16:59	a a a a a out of you for all of them
0:17:02	for the rest of
0:17:04	the thing

A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION

Language Identification

Presented by: Mikel Penagarikano, Author(s): Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-Fuentes, German Bordel, University of the Basque Country, Spain