Speech Transcript - Speaker Characterization Using TDNN, TDNN-LSTM, TDNN-LSTM-Attention based Speaker Embeddings for NIST SRE 2019

0:00:14	i variable
0:00:16	the to have we really fair use the i per speaker characterization using key and
0:00:22	then
0:00:25	sure there's
0:00:27	speaker i four nist sre two so
0:00:31	the right
0:00:33	my and gently one
0:00:38	basically nine
0:00:40	first we like that you a large
0:00:45	my boss range
0:00:47	and that we use the used a
0:00:50	five
0:00:51	and we the tree
0:00:56	about the punch
0:00:58	the network based speaker dataset
0:01:02	and three demonstrate very what also
0:01:06	and
0:01:07	because the mainstream mixture
0:01:10	different
0:01:11	i for one thing works structure what
0:01:14	oops
0:01:15	such as convolution one they work
0:01:21	i did you walk
0:01:23	here
0:01:26	the lowest eer
0:01:27	a vectorized that's or
0:01:31	in speaker baiting
0:01:34	area cordoned sartre soccer but it to freeze
0:01:38	a pension
0:01:39	in that picture
0:01:41	so
0:01:42	we can is t
0:01:44	use a better to better talk of these two to speaker recognition
0:01:52	this paper is process speaker characterization
0:01:56	using active they only work
0:01:58	don't
0:01:59	sure that
0:02:00	then we don't work a protection a call at a robust protection
0:02:06	the speaker
0:02:10	and the
0:02:12	well
0:02:13	right dependability
0:02:15	used
0:02:17	is are
0:02:18	the variation that that's the
0:02:21	the next baseline if the park on speaker recognition evaluation
0:02:27	kentucky by the
0:02:29	you first nation on thirty two hours passed
0:02:32	and there are large
0:02:34	since
0:02:35	nineteen ninety six
0:02:40	for real application different sure i'm sorry features
0:02:46	but what
0:02:47	right feature
0:02:48	it makes the speech
0:02:51	the nist sre ten show
0:02:58	i will take years but wasn't makes the
0:03:03	mastery power
0:03:05	right proposed the first neural network based
0:03:09	speaker weighting
0:03:11	i also has brought before
0:03:15	feature errors
0:03:17	final by a couple of its the
0:03:24	no milk based speaker eight
0:03:28	is the
0:03:29	mainstream or coded
0:03:32	speaker recognition
0:03:34	and thus
0:03:36	first speaker
0:03:37	speaker mister a
0:03:40	t you know based structure
0:03:43	you know network structure
0:03:46	for
0:03:47	two part
0:03:48	first
0:03:49	the speech you will be cost
0:03:53	for label
0:03:55	representation
0:03:56	followed by rocks the
0:03:58	these tickle forty
0:04:02	been
0:04:03	there are two
0:04:04	second but
0:04:05	therefore
0:04:07	tends to who
0:04:10	and you're we
0:04:12	is true first
0:04:13	there
0:04:14	the combined than for others
0:04:17	speaker very
0:04:20	in this study
0:04:22	i for the
0:04:23	well
0:04:25	we praise
0:04:25	the
0:04:26	second it so there
0:04:28	you can with their
0:04:30	robust they're
0:04:31	according to
0:04:36	work
0:04:37	structure
0:04:41	in addition
0:04:43	i also used
0:04:46	attention there too
0:04:48	you're right
0:04:50	the statistical put it there
0:04:53	accordingly
0:04:55	what structure press at the receiver tension
0:05:00	speaker but
0:05:07	in this study
0:05:09	but i australian feature extraction are
0:05:13	based k to find a good features
0:05:15	for speaker rate
0:05:18	through acoustic features there are trendy for all go far
0:05:22	the first male frequency catch a quite feature
0:05:27	i cory and three
0:05:29	basically
0:05:30	okay recognition
0:05:33	you know
0:05:36	the service
0:05:37	mel-scale filter be attach with each accordingly
0:05:42	p
0:05:46	to me
0:05:47	could be well it backwards with your check
0:05:51	for kind of data local station
0:05:54	are used
0:05:55	took it seven
0:05:56	you cultural for each of the top
0:06:00	the you're saying and data points that if the
0:06:03	current to wrap
0:06:05	the original audio file
0:06:07	which each but between
0:06:10	no
0:06:12	utterance
0:06:14	no problems
0:06:18	in this thing
0:06:20	is the simulated impulse response
0:06:24	i used to cover all reaching or
0:06:27	right column
0:06:29	okay
0:06:31	right in aspects problems
0:06:34	so it
0:06:35	speech vision
0:06:38	try to one for speech
0:06:40	two
0:06:41	like that's
0:06:44	well just as a
0:06:46	original reach
0:06:49	the last
0:06:50	the that you a patient
0:06:52	original
0:06:53	what if i
0:06:55	gail
0:06:56	which the training data
0:06:58	very approach or four
0:07:00	but you advantage future or right
0:07:04	by using
0:07:06	such for kernel in addition
0:07:10	there are
0:07:11	seven corpus
0:07:14	origin
0:07:14	that are it
0:07:22	thus are train artificial
0:07:26	instead
0:07:27	nist sre
0:07:29	switchboard
0:07:30	bonastre
0:07:31	it aspect
0:07:33	that was therefore it after
0:07:35	do correctly for
0:07:37	q
0:07:38	we should okay first and sit
0:07:42	i for one clean speech
0:07:45	for our molding
0:07:48	one utterances from eighty six summon speaker
0:07:52	but i
0:07:54	it's a huge amount of it
0:07:59	well you material should it also nist sre sound and eight
0:08:04	it is i two so that night in a heartbeat
0:08:09	the most
0:08:10	available training data which
0:08:13	because the state yes
0:08:16	it can be expressed are all speech
0:08:18	you know in speech
0:08:21	only
0:08:21	well do you or but to me but
0:08:26	and i
0:08:27	so
0:08:28	it for me for feature extraction
0:08:34	right we are sure
0:08:40	a couple minutes the i it weights
0:08:43	there
0:08:43	national institute of standards
0:08:46	and technology matched speaker recognition evaluation task
0:08:52	sre it was sort of a start to that night
0:08:59	experimental results showed that the cost structure their decision cost function
0:09:07	well the
0:09:08	going segment
0:09:09	two
0:09:10	and
0:09:10	zero point
0:09:13	see
0:09:13	right
0:09:14	two
0:09:17	which the nist
0:09:18	this idea to start it
0:09:21	and decide to a nightly evaluation it has the respectively
0:09:30	this figure this table
0:09:33	chaudhari
0:09:36	well allows you know that
0:09:39	the best performance
0:09:42	there are fixed
0:09:46	i compare the first and second
0:09:50	segment variable speed but it
0:09:53	they also come
0:09:55	see if a feature
0:09:58	well all we can
0:10:00	fun
0:10:02	filled up in
0:10:03	these each feature
0:10:06	awful
0:10:06	you know this the feature
0:10:11	we also
0:10:14	so i
0:10:15	the first
0:10:16	segment i
0:10:18	speaker big be weighted a second
0:10:22	the speaker but something the second their speaker at
0:10:29	for the first their speaker
0:10:32	i
0:10:34	result
0:10:35	so
0:10:36	i think
0:10:37	both the speaker
0:10:40	first bears a bit
0:10:42	they for dimension of the image
0:10:46	we can use the score fusion
0:10:49	okay vector itself
0:10:58	since
0:10:59	i file
0:11:00	filter bank feature was a feature vector function
0:11:07	and also be noted that the cost fifty and draws attention c
0:11:12	and eighty dollars
0:11:15	we so what role
0:11:20	extent also mention i'll for sure
0:11:23	the next frame
0:11:24	therefore it should
0:11:28	what are trained based on
0:11:30	the pen
0:11:32	each feature
0:11:34	this type of show
0:11:36	we can find
0:11:37	by using white
0:11:39	role
0:11:40	for in
0:11:41	they will refer to ensure
0:11:43	we can pick the performance
0:11:51	finally
0:11:52	well so that all call
0:11:55	and ninety six and
0:11:57	by using expensive but it is that file and feature and then it is the
0:12:02	back and scoring
0:12:05	why final submission
0:12:08	that is
0:12:10	where it is
0:12:12	much
0:12:14	each year suspension
0:12:17	bic we wish the
0:12:19	so q two
0:12:21	one two cards
0:12:24	once your feet it's
0:12:28	do you got but not for right
0:12:33	for
0:12:35	pretty much are you
0:12:38	this table show
0:12:40	by the final file for this site tools on it
0:12:44	it is i thought it right
0:12:48	you deterioration
0:12:55	that we show that a portion
0:13:01	this paper to use that system so
0:13:04	to a
0:13:05	next slide so that night
0:13:08	ct has task
0:13:09	i'm scroll neural network
0:13:12	structure
0:13:13	which operates on india and at a at least
0:13:17	and you know extra tight shot
0:13:20	it showed up and have your
0:13:23	and you may speak at
0:13:24	there and sixty you know the lp and feature analysis
0:13:30	i used
0:13:32	channel that's k
0:13:33	we did
0:13:34	feature
0:13:36	mixer six sre
0:13:38	so which what a watch therefore
0:13:41	that one
0:13:42	be a huge
0:13:44	six
0:13:46	no prior for
0:13:48	because our compensation is that what we
0:13:52	be well in that the of available training there
0:13:57	the proposed mixer shooter it should
0:14:01	this year
0:14:02	score
0:14:03	you or initially suitable for
0:14:07	to zero
0:14:09	contrary nine five
0:14:12	the
0:14:12	next
0:14:13	this idea to start at sre two thousand nine that the original dataset back
0:14:22	thank you
0:14:23	thank you very much

Speaker Characterization Using TDNN, TDNN-LSTM, TDNN-LSTM-Attention based Speaker Embeddings for NIST SRE 2019

Speech Application

Chien-Lin Huang