Přepis řeči - Connectionist Transformation Network Features for Speaker Recognition

0:00:06	okay i'm going to talk about the what we did together with uh when you only be suggesting this one
0:00:12	uh
0:00:13	in the scope of the
0:00:14	last
0:00:15	and a simulation
0:00:17	per speaker
0:00:18	or it was the motivation beginning
0:00:21	this is
0:00:21	uh before the line of my presentation
0:00:24	i would try to motivate and interviews the problem we went to phase and we went to sell
0:00:29	and
0:00:30	since my words but it wasn't related to connectionist speech recognition i
0:00:33	i will
0:00:34	uh have a look to the very basic cell so
0:00:37	and then they will
0:00:39	interviews the
0:00:40	uh the
0:00:41	the novel features obtained using this work but they call
0:00:44	the information that would features
0:00:46	for speaker recognition
0:00:47	and then we go uh we we good
0:00:49	we will go to the the experiments and
0:00:51	so conclusions and some
0:00:53	feature to work or work ideas
0:00:56	so
0:00:56	uh the main motivation was
0:00:58	uh we wanted to participate in this
0:01:00	and me stipulation
0:01:02	we saw that the best systems were using
0:01:04	um
0:01:05	an arm is
0:01:06	different amount of
0:01:07	systems
0:01:08	combining them
0:01:09	and what
0:01:11	i want i'm not going to mention them but eh
0:01:14	you know there are many men is possible subsystems
0:01:16	and
0:01:18	i'm on all of them i was
0:01:19	particularly attractive uh directed by
0:01:22	but what are usually called like level features that it's in close relation with the prior session
0:01:27	and basically this
0:01:28	this system say to use
0:01:29	speaker adaptation transforms employed in asr systems for speaker detection
0:01:34	features
0:01:35	and our proposed as alternatives to four times a capsule
0:01:39	features that are the most commonly used one
0:01:42	and you know we have the the work of a and veracity
0:01:44	in fact uh the working press it into the it's very closely we
0:01:47	but it was never samples that were in that
0:01:49	in another
0:01:51	in another
0:01:52	with some difference of course
0:01:53	the same
0:01:54	and basically
0:01:55	uh uh
0:01:56	what is done in this work
0:01:58	uh is
0:01:59	places to use
0:02:00	uh weights they re
0:02:01	from mllr transforms
0:02:03	to produce a grimace in the back doors
0:02:05	uh
0:02:06	concatenate them and use
0:02:07	this is four coefficients to model
0:02:09	speaker
0:02:10	uh support vector machines
0:02:12	so
0:02:13	what is the problem
0:02:15	we propose that at least one of them
0:02:17	i don't like
0:02:18	we'd have been always working
0:02:19	uh we i read a a a and then a tremendous sense of this is that on the neural networks
0:02:24	remarkable systems
0:02:26	and
0:02:28	and
0:02:29	i wish to show later
0:02:31	some characteristics but uh the
0:02:33	the main problem nor the motivation for this work is that
0:02:35	we can no
0:02:36	we cannot use
0:02:38	a typical adaptation methods like
0:02:40	mllr that are usually used in
0:02:43	and gaussian approaches
0:02:45	so
0:02:45	what i try
0:02:46	to doing this work at the very beginning
0:02:49	it began with
0:02:50	to see if i can do something similar to the motor transformation for
0:02:53	i've read
0:02:54	uh
0:02:55	um
0:02:56	systems and if we can use it
0:02:57	uh to obtain the speaker
0:02:59	information into china
0:03:01	a speaker discussion system
0:03:03	and it with the farthest 'em
0:03:04	some
0:03:05	baseline systems in that in that
0:03:07	it's very with us tonight
0:03:08	uh telephone bill for condition
0:03:11	so
0:03:12	the minister
0:03:13	with some of the irony is a basis for
0:03:16	uh
0:03:18	uh
0:03:18	you know for the one
0:03:21	someone do the probably don't are not very related
0:03:23	basically we have been working on this
0:03:25	uh for some applications
0:03:27	mainly for business than prescription
0:03:29	uh but also for telephony get a telephone applications
0:03:32	and for some other languages but
0:03:34	our main focus so quickly or two is
0:03:36	um
0:03:38	and usually is considered
0:03:39	that was the way he works is that will replace the gaussian
0:03:42	for a neural network and all cases are and will be emulated perceptron
0:03:47	and we use this uh
0:03:48	the probability estimations
0:03:50	as the dubs a a as the
0:03:53	i'll probably this
0:03:54	or as
0:03:55	as the break this uh
0:03:57	the pasta probably this L the
0:03:59	of the single state hmm
0:04:01	and usually
0:04:03	uh we have very uh relatively few outputs
0:04:06	like
0:04:07	just
0:04:07	uh phonemes or some other so for that
0:04:09	uh
0:04:10	units but not not a more
0:04:12	the main characteristics is that
0:04:13	or
0:04:14	is that they are usually considered but up to classify of the the neural networks
0:04:18	the up with these two marks of other streams
0:04:20	and
0:04:21	they're pretty good for a blind test
0:04:23	as we will just leave you
0:04:25	and the only time we have
0:04:27	uh some problems with context modelling
0:04:29	uh and also
0:04:31	uh with annotation it's no so or at least they are not so
0:04:34	so what estimation methods like in gaussian systems
0:04:37	so this is this is
0:04:38	uh uh diagram block of
0:04:40	or row 'cause my suspicion system for make an english
0:04:43	you can see
0:04:44	probably
0:04:46	so that this
0:04:47	here
0:04:47	and use this
0:04:49	okay
0:04:50	okay
0:04:50	we can you can you can see a similar streams with different features you pay a fee B B features
0:04:56	uh
0:04:57	you be with breast the features on
0:04:59	this one so modulation spectrum for each of features
0:05:01	it's one of them
0:05:02	for me it's a different
0:05:03	we could later perceptron
0:05:05	well trained with with
0:05:06	with they uh with like a
0:05:09	with transcriptions
0:05:10	and everything
0:05:11	and we marked
0:05:12	uh the similar stint in uh with a simple um
0:05:15	problem or
0:05:16	rule
0:05:18	this probably does this posterior so the use by uh because or
0:05:21	to have with a language model the lexical oh no
0:05:24	and
0:05:24	uh some definitions of the hmm
0:05:27	that is the relation between the perot would probably be and
0:05:30	and wait
0:05:31	phoneme for instance
0:05:32	uh represent and that
0:05:33	minimal relation also
0:05:35	to provide the most likely word
0:05:37	or send
0:05:38	uh some
0:05:40	characteristics of the system
0:05:41	in dummies
0:05:42	a four
0:05:44	and made in nineteen seventy bottle said
0:05:46	we have been a one to call in less one group
0:05:48	one real time
0:05:50	is that a seventy percent
0:05:52	one six were right
0:05:54	will this will use this in the form of the phonemes some others of phonetic units
0:05:58	and we train with a four
0:06:00	although the forty hours
0:06:01	i would have a program language model
0:06:03	that is the interpolation of the transcripts and the
0:06:06	and
0:06:07	written that uh from newspapers
0:06:09	and that a relatively small colours
0:06:12	just a four thousand
0:06:14	so uh in that let's just say i'm just saying
0:06:17	um sorry data
0:06:19	for evaluation and uh
0:06:22	i needed to
0:06:23	the trained on you
0:06:24	and you a speech recogniser
0:06:25	and it's i i read about a very very very weak
0:06:27	system
0:06:28	because it is i have access to C T S
0:06:30	uh it's speech
0:06:31	and basically way what is was to train new remote unit and networks with
0:06:35	we don't simply data with the simple 'cause news data
0:06:38	and there is some other differences to the system i use
0:06:40	in this work is that they
0:06:42	i have another another issue mostly with and see a bunch of from today
0:06:45	different than features
0:06:46	i don't use a monophone M unit
0:06:49	and
0:06:49	and i did some very informal evaluations yes
0:06:52	see for myself how it was working then
0:06:54	and in in telephone data conversational telephone is it
0:06:58	and i have
0:06:58	uh very everywhere
0:07:00	i
0:07:01	where relate
0:07:02	uh but anyway
0:07:04	this
0:07:04	recognise is used for
0:07:05	uh two purposes for support for someone
0:07:07	as in a uh generate a phonetic alignment with the descriptions of the provided by nist
0:07:12	and
0:07:13	and also for for training the
0:07:16	the speaker adaptation
0:07:17	the summation
0:07:20	so
0:07:20	uh how how can we
0:07:22	uh the other
0:07:23	i will be needed and then works
0:07:25	to a speaker information or whatever else
0:07:28	uh
0:07:28	there are several approaches but i
0:07:30	some basic
0:07:31	the two of them
0:07:32	the first one would be uh starting from i speak in that and then i'm open an mlp network
0:07:38	uh
0:07:38	we can do uh the rival uh of what propagation algorithm
0:07:42	and
0:07:43	mm
0:07:44	what
0:07:44	we started with a network of anyone train it
0:07:47	instead of random
0:07:48	wait
0:07:49	and with about voice and i went and we had that the weights and
0:07:52	and that so
0:07:53	uh the other
0:07:54	think we can do it's probably
0:07:55	just a they'd some of the weights forces that
0:07:58	the ones that we go from the the last hidden layer to that
0:08:01	to the output layer
0:08:03	the price of something
0:08:04	more interesting to do
0:08:06	it's too
0:08:06	to modify the the structure of the detector
0:08:09	of the mlp network
0:08:11	and tried
0:08:12	not too
0:08:13	to modify the speaker independent component
0:08:15	and that's what we can do for instance well we can get
0:08:18	there's more that for most of the phonetic level that would be
0:08:20	to uh uh
0:08:22	some kind of transformation at the output of the problems and try to
0:08:25	but that to the speaker characteristics and on the other hand you can
0:08:28	try to the same
0:08:29	at the acoustic level
0:08:31	try to add that the
0:08:32	the features the input features to the characteristics of the
0:08:35	from that could from the speaker dependent of the characteristics of the speaker independent
0:08:39	uh system so
0:08:41	this last solution
0:08:43	i i did some just for a desire to verify that this could work on it works
0:08:47	and and i i from that what that that was the best one
0:08:51	for yourself application so
0:08:53	hey
0:08:54	decided to try
0:08:56	also forced to get
0:08:58	and here we have a um
0:09:00	a typical mlp neville with just one could allow yeah
0:09:03	is it impolite or the feel i don't open later
0:09:06	and
0:09:07	how can we train dissertation
0:09:09	uh features or this other additions
0:09:11	uh
0:09:12	lattices
0:09:13	basically we incorporate a newly nine
0:09:16	lighter than the beginning
0:09:18	and we apply
0:09:19	uh there but replication algorithm
0:09:21	as usual i mean
0:09:22	we have data would labels we it make the forward propagation compute the output of the network
0:09:27	there are
0:09:28	we do the
0:09:29	okay
0:09:30	the quality cover we do that
0:09:32	about the position of the yeah well
0:09:33	and then
0:09:35	when it comes to the the weight
0:09:36	i'm sorry
0:09:37	well
0:09:38	opening
0:09:39	no but
0:09:40	okay when it comes to
0:09:41	to the bit the weight
0:09:42	we just the the the date of the of the linear would never and we can
0:09:46	we keep
0:09:47	froze and the the the speaker independent component
0:09:50	so
0:09:51	let me
0:09:53	okay
0:09:54	about this
0:09:55	the formation of a normalisation uh
0:09:57	well i seaside it's intended them up in common the switched over the representation that consists
0:10:02	the mlp uh performance
0:10:04	and it can be considered a kind of sorry
0:10:06	right hand
0:10:06	as for the normalisation
0:10:08	but with some
0:10:09	a special characteristics because
0:10:11	we are not imposing any any
0:10:13	a restriction in addition to the base station process i mean
0:10:16	we don't have a
0:10:17	a target speaker that we try to normalise
0:10:19	that that uh
0:10:20	the data
0:10:21	and
0:10:22	and according to previous works
0:10:24	it seems that it's also
0:10:25	i don't stick to depend on i mean
0:10:26	if we train the transformation network
0:10:28	with
0:10:29	i
0:10:30	a speaker independent network behind and it changed is a speaker independent network
0:10:34	that instead of having one hidden linux that's too
0:10:36	it doesn't works anymore
0:10:38	so
0:10:38	uh
0:10:39	it has some kind of
0:10:40	the pendant
0:10:41	of the detector
0:10:42	uh
0:10:43	well we have trained
0:10:44	it's withstand the the marketing from that to say
0:10:46	um with a diagonal buttons vintage metrics
0:10:49	and
0:10:50	when we use implemented of the same speaker
0:10:53	uh what we
0:10:54	falcon beginning is that uh
0:10:57	it could
0:10:57	hopefully if we send the differences we continue a speaker and
0:11:00	and so model and that
0:11:02	well i thought that would be useful for speaker identification
0:11:06	so
0:11:06	there so i stuck exactly the features
0:11:09	i'd in the phonetic alignment with a nice
0:11:12	the stations
0:11:13	and
0:11:14	train a speaker additions estimation for every segment
0:11:17	and it's um
0:11:19	a special things that they do is to remove
0:11:21	long
0:11:22	segments of silence to to avoid background and channel effect
0:11:25	in the resulting features
0:11:27	and i then just thinking of 'cause what edition that that this
0:11:30	that is usually don't in the market
0:11:31	in mlp training
0:11:33	i just
0:11:34	a place um
0:11:35	fix the number five books and
0:11:36	already said that this was that
0:11:38	base and already sticks that they
0:11:39	from the
0:11:40	what what
0:11:41	uh i don't that you think is that instead of
0:11:44	training
0:11:45	a full matrix
0:11:47	uh
0:11:47	and fully mean
0:11:49	the input usually a fireman mlp it's composed of
0:11:51	by the frame
0:11:52	but the current frame and and it's context
0:11:55	uh
0:11:56	and if
0:11:57	this for the square matters would be
0:11:59	and
0:11:59	and feed the number of features and
0:12:02	the shape of the context
0:12:04	it said that the reason that i
0:12:05	i
0:12:06	train or
0:12:08	right
0:12:08	tie the network
0:12:09	uh for each frame independently on its position the context
0:12:13	so are reduce the size of the of the transformation this chi
0:12:17	so networks also um attic
0:12:19	and what is our intent to come between them okay
0:12:22	and
0:12:23	and in addition to that that the the the
0:12:25	that the source and the word feature vector
0:12:28	uh you also a stack the feature in the meeting
0:12:31	the feature mean and variance
0:12:32	because it is it is it
0:12:33	uh it is very usual to
0:12:35	to do mean um but it's not my decision to the
0:12:37	to the input of the mlp
0:12:39	okay
0:12:40	and i do this for
0:12:41	for the difference thing to have
0:12:43	the plp that could be with that's the modulation spectrum and at sea
0:12:49	and for modelling i use support vector machines
0:12:52	i i think that the speaker my uh
0:12:54	feature vector and uh and i said above are impostor
0:12:57	said
0:12:58	used as negative examples
0:13:00	i use the lips of them
0:13:02	with linear kernel and ideas uh i mean that's almost stationary
0:13:05	oh the input in the front seat one
0:13:09	so
0:13:10	let's go to the sperry meant
0:13:11	um
0:13:13	it's it's a i use the estimated as an extra to show three only the telltale condition
0:13:19	uh
0:13:19	i used to come to stiff
0:13:21	systems
0:13:21	to verify the usefulness or not
0:13:24	oh this approach
0:13:25	uh
0:13:26	uh quite simple gmm ubm
0:13:29	uh based on
0:13:32	based on the features
0:13:33	i i remove
0:13:35	nonspeech frames or look at
0:13:37	no one or two frames based on
0:13:38	i well trained as business be it
0:13:40	and uh
0:13:42	i mean why becomes an alignment of the log energy
0:13:45	uh i did that so i mean embodiments a shot and well
0:13:49	typical things in
0:13:50	ubm
0:13:52	this is the set of the to use from previous
0:13:54	a summary of relations
0:13:56	i also play uh the normal score
0:13:58	lemma session
0:14:00	it
0:14:00	and in addition to that uh can persist it compresses
0:14:04	system
0:14:05	i L C is uh
0:14:07	a supervector are
0:14:08	system that the quality of the S B svm
0:14:11	and for the uh
0:14:13	for the negative said it's i i i did read that the
0:14:15	the supervectors from from this speaker models and
0:14:18	i'm for the
0:14:20	for the battery use data from the previous
0:14:22	sorry
0:14:23	S R I evaluations
0:14:25	and i didn't apply score normalisation because they didn't
0:14:27	uh see
0:14:28	much improvement probably
0:14:30	fig so there's some kind of problem might
0:14:32	in my configuration i'm a conclusion
0:14:36	uh i did calibration function and uh gender dependent is in the the toolkit
0:14:41	by an equal
0:14:42	to gain and it in two steps these
0:14:45	this has gotten
0:14:47	for every single system
0:14:49	and later on i did
0:14:50	other linear logistic regression
0:14:52	and in case of
0:14:53	uh yeah
0:14:54	doing function of more than one system it's not it at this is that
0:14:57	okay
0:14:58	uh and i i did pay for every focus validation
0:15:02	in the same evaluation set
0:15:04	so
0:15:04	what
0:15:08	i i didn't think carole double colouration because uh
0:15:11	it's what the recognition some set for calibration forty one right
0:15:14	so
0:15:15	and here we have already some results
0:15:18	you can see that that works
0:15:19	in blue
0:15:21	over the course of the individual
0:15:23	transformation network C stands
0:15:25	based on different features plp but uh
0:15:27	well listen a spectrogram and nancy
0:15:29	yeah you have
0:15:30	the mean detection cost function um
0:15:32	point
0:15:34	supplied by
0:15:35	i i i i for me to say that it's the cost
0:15:37	i use the cost of the sre propose an eight
0:15:39	not the new all the
0:15:41	two thousand nine
0:15:42	yeah
0:15:43	so
0:15:43	and this is the the war right
0:15:46	well the the the first thing that that that they want to make about this is that that
0:15:50	well is not but it would but it worked
0:15:52	and anyway i wasn't sure when a list of the to this
0:15:54	and
0:15:56	and with this but the individual systems
0:15:58	uh we can see probably but the performance of that C
0:16:01	the features but uh i don't have a big explanation probably
0:16:04	because the feature
0:16:05	sizes
0:16:06	is bigger but the that i'm not sure or what simply because then that what is but
0:16:10	it's over the classifier
0:16:12	uh then i did
0:16:13	some to other experiments that
0:16:15	what's first try to fuse with audiologist information that the four individual systems
0:16:19	or even better
0:16:21	to try uh to concatenate that there
0:16:23	the four
0:16:25	david well features
0:16:27	and to uh to train a single
0:16:29	ordered in a single
0:16:30	transformation of the feature vector
0:16:33	what uh and we can see a nice improvement
0:16:35	using the complete
0:16:36	just wasn't at work
0:16:37	feature vector
0:16:39	um
0:16:43	move to the next one
0:16:45	this is that the that or
0:16:47	comparing the different
0:16:48	bayesian systems
0:16:50	together with the new proposed in from the pacific on
0:16:52	T N svm
0:16:54	uh
0:16:56	we can see with respect to the gmmubm
0:16:59	um
0:17:01	about their
0:17:03	it performs better that close to the operation point
0:17:05	but it seems that it goes it was
0:17:08	words
0:17:09	or a or a plus list items
0:17:11	as long as we go closer to the whatever
0:17:13	point
0:17:14	and with us to the supervector uh
0:17:17	we have a slightly worse performance in close to that
0:17:20	the person point and and it works
0:17:23	right
0:17:23	words
0:17:24	in the in the other
0:17:26	in the other
0:17:26	one to the the car
0:17:28	thirteen point
0:17:29	and we yeah
0:17:30	what do think it's important from these results
0:17:32	is that
0:17:33	i can achieve more or less similar system
0:17:36	some of the baseline systems by comparing to
0:17:38	in some cases
0:17:39	a bit worse in some cases a bit better but
0:17:42	not politically different
0:17:45	so
0:17:46	the the the the the the final corpus of what's
0:17:48	in fact
0:17:49	trying to use it for for improving the the baseline systems
0:17:52	and this is that the the the results show that the combination
0:17:56	and you can see several
0:17:58	different combinations
0:17:59	these are the two baselines
0:18:01	this is the minimum cost
0:18:03	obtain deeper
0:18:04	right
0:18:04	and we can see that when we
0:18:06	yeah
0:18:07	we incorporated this formation of what features system
0:18:10	we have
0:18:11	some improvement
0:18:12	probably
0:18:13	uh
0:18:14	it's um
0:18:19	that that all the combinations here also
0:18:22	so
0:18:24	and i'm sure
0:18:25	okay with that
0:18:28	i mean
0:18:29	yeah
0:18:30	the conclusions
0:18:31	uh
0:18:32	what they combine
0:18:33	in this work or what they want to do is to show
0:18:36	that features that it
0:18:37	from N in a a and then it to my meditation techniques
0:18:41	can be used for speaker identification
0:18:43	in a very similar way to
0:18:45	how similar are
0:18:46	is used for lotion systems
0:18:48	uh i have used uh uh annotation technique
0:18:51	technical information network
0:18:53	and
0:18:56	okay back to base
0:18:57	on the recognition of this everlasting transforms
0:19:00	and
0:19:01	and the mean and variance of the input feature statistics uh should do to perform
0:19:06	but well
0:19:07	and with respect to the baseline
0:19:09	we could see a relatively good performance
0:19:11	so cases it in some operation points of
0:19:14	all the the the
0:19:15	cover it with
0:19:16	it was more
0:19:16	it was bad another
0:19:18	it was worse but more or less
0:19:20	uh
0:19:21	similar performances
0:19:22	and
0:19:23	uh we could build five verify that it provides some
0:19:26	complementary speaker
0:19:28	choose for for for channel that we can have
0:19:30	uh or baseline systems
0:19:33	uh with respect to
0:19:34	to carlisle and future work
0:19:36	that we are going in a a
0:19:37	or listen uh
0:19:39	with these features
0:19:40	um
0:19:41	we need
0:19:42	to assess a better than classified other we case our system because
0:19:46	i would have very
0:19:47	by a very low they were provided fact
0:19:50	and
0:19:50	well for discussion and i imagination also for
0:19:53	for the station itself because uh
0:19:56	probably with a better
0:19:57	a speech recognition system would
0:19:59	uh we'll have more meaningful features
0:20:02	uh
0:20:03	we we did almost all the tuning
0:20:06	and another one to two characteristics uh
0:20:09	base and all these things but that would probably
0:20:10	should do something
0:20:12	uh
0:20:13	more and undertones and which is the relation between the the architecture of the speaker independent and never on the
0:20:18	resulting features
0:20:20	uh
0:20:20	or even to mystical there
0:20:22	adaptation method
0:20:23	i do not try this
0:20:24	adaptation of the output of the problem is
0:20:26	um
0:20:28	we have also i would have some of these things some us to meet at the
0:20:32	it is a bit
0:20:33	we can say
0:20:33	and
0:20:35	also uh apply in but everything compensation like now
0:20:39	and so the nothing can really work in other things
0:20:41	with interest in it
0:20:42	and into the something similar to
0:20:44	what is only in in people um for language identification letter
0:20:47	that is
0:20:48	use in
0:20:49	several mlp
0:20:50	uh networks from different languages and
0:20:53	the reading this transformation networks for every of these languages without
0:20:56	phonetic alignment and and then
0:20:58	to get in a in a single feature vector and
0:21:01	and this way of making the the the the approach
0:21:04	uh not needed for the asr descriptions and finally
0:21:07	making it also language independent
0:21:09	and
0:21:11	that's all
0:21:13	okay
0:21:24	okay
0:21:24	questions
0:21:30	actually
0:21:30	chris slot
0:21:32	uh
0:21:33	no
0:21:34	yeah i think it and number
0:21:36	like i don't know
0:21:37	number
0:21:38	oh
0:21:42	oh
0:21:47	that one
0:21:48	one
0:21:50	okay some
0:21:51	um
0:21:53	lost
0:21:54	um
0:21:55	right
0:21:56	a lot
0:21:57	yeah
0:21:58	um systems
0:21:59	yeah
0:22:00	yeah
0:22:01	you know
0:22:02	normalisation
0:22:03	five
0:22:03	so
0:22:04	oh
0:22:05	uh
0:22:07	oh
0:22:09	sure
0:22:11	right
0:22:11	um
0:22:12	right
0:22:12	normalisation
0:22:13	no um
0:22:14	i just the randomisation of the input of the svm
0:22:18	modelling
0:22:19	uh i didn't do
0:22:20	a modelling uh also when i was doing testing but i i didn't do
0:22:24	any other normalisation to the
0:22:26	feature vectors
0:22:27	in the rents either one
0:22:28	i think it's conditional and
0:22:30	in this
0:22:31	uh support vector machine approaches
0:22:33	just one
0:22:34	some
0:22:34	features
0:22:35	just
0:22:36	no
0:22:37	no
0:22:37	not not
0:22:38	uh
0:22:40	it's true
0:22:41	yeah
0:22:42	but
0:22:42	well i and number should but it was between the the the svm
0:22:45	we
0:22:46	go with this i mean it will select
0:22:48	this
0:22:49	features that are more important i i mean
0:22:51	i i didn't read in a different way that
0:22:53	if you just coming from plp or
0:22:55	i i just let the this ubm to learn
0:22:58	what he thought it was better
0:23:00	didn't do anything
0:23:02	in this way
0:23:08	uh
0:23:10	this morning to mobilise the ubm it can be
0:23:15	speaker not to to
0:23:18	i'll close system model
0:23:21	and if you use one of my
0:23:24	oh
0:23:24	oh yeah
0:23:26	and to train neural network much more data than not
0:23:32	one more thing
0:23:34	and uh so why not too many
0:23:37	rich but from the old people from one
0:23:41	uh i i think it differs when you will
0:23:43	should be because it and get it very well
0:23:45	you're talking about
0:23:45	way way idea and still with a random initialisation of the mlp network
0:23:50	for training
0:23:52	oh
0:23:53	that was a layer of the moment
0:23:56	the soft mask
0:23:57	yeah
0:23:58	yeah
0:23:59	so why the force
0:24:00	the one that works well mark
0:24:02	just
0:24:03	right
0:24:04	you need only so much
0:24:06	you can
0:24:08	that
0:24:11	uh
0:24:11	the D C
0:24:15	uh
0:24:17	yeah
0:24:18	i have
0:24:18	a soft but
0:24:19	max output here
0:24:21	yeah
0:24:21	and i don't have any other any other softmax output
0:24:24	anyway this
0:24:25	the linear input network and
0:24:27	and i'm not i'm not doing any kind of non nonlinear in section at this point
0:24:32	it's a
0:24:34	am i think so
0:24:35	but
0:24:35	the
0:24:37	no there's no
0:24:38	nonlinearly stationary i dunno if an answering to you
0:24:43	i
0:24:43	oh
0:24:47	no they didn't have or is it just it's a it's
0:24:50	uh
0:24:51	uh speech features
0:24:52	yeah p2p or
0:24:54	okay or
0:24:55	and
0:24:56	sorry and it's
0:24:57	uh the
0:24:58	the current frame and its context
0:25:00	not only uh use anything context of
0:25:02	there are two
0:25:04	but uh
0:25:05	but it's it's it's it's it's feature
0:25:11	yeah
0:25:12	so would you like to slide forty
0:25:19	table
0:25:20	um
0:25:21	uh
0:25:23	as a baseline
0:25:24	but just like how much is it
0:25:26	uh
0:25:27	how many
0:25:27	map estimation
0:25:29	ah i did five and probably the support of okay relation
0:25:33	um but i i think it was
0:25:35	right
0:25:36	this
0:25:36	yeah
0:25:37	yes i did five map iterations
0:25:39	yes so i improprieties them but
0:25:40	uh you you did
0:25:42	five map map iterations before
0:25:44	sitting
0:25:45	do your is for him
0:25:47	yeah
0:25:48	so
0:25:48	uh
0:25:49	we found
0:25:50	that
0:25:51	yeah
0:25:52	one
0:25:53	yeah i i
0:25:54	if only they
0:25:55	right i i
0:25:57	it's a
0:25:58	to control
0:25:59	yeah uh
0:25:59	but but but we verified
0:26:01	well
0:26:02	i'm not completion but
0:26:03	uh in that basic gmm ubm
0:26:05	with five we got better even if we we go farther away
0:26:09	we got
0:26:10	uh
0:26:11	a slight improvement
0:26:12	but uh we and verified it when we moved to the supervector we do
0:26:16	the so uh
0:26:17	and and i realised i could do you want to david that it
0:26:20	this was not a good idea
0:26:22	probably
0:26:22	well that's
0:26:23	probably um
0:26:25	with the but the configuration i would have
0:26:27	uh
0:26:28	okay
0:26:28	i'm sure
0:26:29	but the performance and as a purveyor in the supervector
0:26:32	system
0:26:33	sure i i realise that
0:26:35	fig
0:26:51	oh
0:26:51	on the loss
0:26:53	right
0:26:56	X
0:26:57	yeah
0:26:57	okay
0:26:58	oh
0:26:59	sure
0:26:59	uh_huh
0:27:00	you
0:27:01	yeah
0:27:01	so
0:27:02	how much
0:27:04	oh
0:27:05	hmmm
0:27:08	oh
0:27:08	and
0:27:09	well
0:27:09	it improves
0:27:10	right
0:27:11	the
0:27:12	yeah
0:27:12	no no i i like the way it was a
0:27:14	uh too much but probably there is
0:27:16	oh
0:27:16	fig configuration problems because they see that people
0:27:19	i get
0:27:20	very nice improvement with
0:27:21	no
0:27:22	i don't know if it's because they'll tell only
0:27:24	uh that the prove it
0:27:26	yeah they get the improvement it's not
0:27:28	so let's say i tried with
0:27:30	uh
0:27:30	different dimensionalities
0:27:32	and the
0:27:34	it improves
0:27:35	yeah
0:27:35	but the but it was not moving from
0:27:37	i don't know how much i had here
0:27:39	it wasn't moving from
0:27:42	the six point fifty nine to three
0:27:45	it was less
0:27:46	and that
0:27:47	um
0:27:48	it is part of one
0:27:49	um
0:27:50	i just one
0:27:52	which
0:27:53	uh sport
0:27:54	not
0:27:55	uh
0:27:56	um
0:27:57	one
0:27:57	you
0:27:59	oh
0:27:59	oh
0:28:00	yeah
0:28:01	yeah
0:28:02	hmmm
0:28:03	we live
0:28:04	just
0:28:04	straight on it
0:28:05	no
0:28:06	what
0:28:07	she
0:28:08	um
0:28:09	no
0:28:10	yeah
0:28:11	one
0:28:12	um
0:28:14	oh
0:28:14	oh
0:28:15	oops
0:28:15	yeah
0:28:17	right
0:28:18	or a
0:28:19	hmmm
0:28:19	oh
0:28:20	sure
0:28:21	also
0:28:22	school
0:28:23	hmmm
0:28:25	oh
0:28:25	i i i if it is a did something not right would be because i didn't various incidents
0:28:30	okay fine
0:28:31	and
0:28:31	um i'm not sure
0:28:33	i'm not sure i'm just currently live
0:28:35	uh svm because probably in
0:28:37	i think it was using that probably estimation it beeps and it's not a good idea
0:28:42	but then using
0:28:42	in both
0:28:43	and both systems based on svm i mean
0:28:46	and using also in a my proposal so
0:28:48	i think i i can improve in that way the noise
0:28:51	more was what you were meant in
0:28:52	because and and are doing the
0:28:54	and and and are doing the the this kind of problem with the background using that
0:28:58	the to the the as the sum
0:29:00	as being pretty and i think that
0:29:02	the prediction the problem
0:29:03	prediction is not that would the score for the
0:29:05	for the speaker identification
0:29:07	thing
0:29:08	but
0:29:09	well
0:29:11	okay

Connectionist Transformation Network Features for Speaker Recognition

SESSION 2: Features for Speaker recognition

Přidáno: 14. 7. 2010 11:08, Autor: Alberto Abad (INESC-ID Lisboa), Jordi Luque (Universitat Politècnica de Catalunya), Délka: 0:29:22