Speech Transcript - Feature Extraction Using 2-D Autoregressive Models For Speaker Recognition

0:00:15	i
0:00:20	she had your dark suit in greasy wash more
0:00:25	zero such and number eight this type of features for speaker recognition we've got five
0:00:31	papers
0:00:32	there will be presented in the session
0:00:34	there is a bit of time yeah
0:00:37	before the process
0:00:38	process will come for a this evening's event my
0:00:44	we can actually now you're a little bit afterwards
0:00:46	discussion
0:00:48	sure the first error talk is a feature extraction using two younger regression models for
0:00:54	speaker recognition a johns hopkins group
0:00:57	rescue representing the paper
0:01:02	oh
0:01:04	i think that you want to ask once you got constraints also i'm just sitting
0:01:08	here watching
0:01:10	so that neither idea of the last but not what they are
0:01:15	is that i want to use this but also for S L T V
0:01:20	possible to some discussion about features in general for speaker recognition
0:01:24	because i think we started yesterday and i was to realise that we have some
0:01:29	issues just like the mean
0:01:31	so i have a few slides at the beginning which perhaps
0:01:34	it would be more general and what i want to talk about that like
0:01:39	and he'll be back again is it
0:01:43	i always like if you have a questions during the presentation
0:01:47	these past me immediately don't feel
0:01:50	but i mean if we don't this work the slides
0:01:53	is that receives as everybody's here and you don't know what i'm talking
0:01:58	so just keep asking questions or something
0:02:02	so the business a is following
0:02:07	we have a speech
0:02:09	speech information so he streams
0:02:14	that is a speaker by
0:02:18	that is
0:02:19	probably should environment
0:02:21	and this information
0:02:23	and this is
0:02:25	speech
0:02:27	a right be pretty
0:02:29	all of us
0:02:33	one of them
0:02:35	so these are really
0:02:37	if you are not speaker a special case i environment or you message
0:02:44	can be used or a
0:02:47	and
0:02:47	speaker
0:02:49	oh
0:02:50	speaker is easy job is the number of things
0:02:54	which may consider as a disturbing as audio
0:02:59	one i saw the speaker has a speaker agents only can do not annoying
0:03:07	source audio video information if you would like to be invariant
0:03:15	a nice piece of a signal and there is this information
0:03:21	right
0:03:22	you
0:03:23	and balls
0:03:25	analysis of features
0:03:27	and the classifier
0:03:30	yeah analysis that would you stop which we know
0:03:35	before
0:03:37	we see that they
0:03:38	this is based on what everyone in school or whatever you got from previous experience
0:03:44	with the day
0:03:46	and then there is a classifier and classifier was typically train
0:03:50	now is the distinction you another classification is somehow coming because we then train feature
0:03:58	extraction
0:03:59	so
0:04:01	that is
0:04:03	some you know like
0:04:05	but the so as i said and this is what exactly what is what we
0:04:10	tried that before
0:04:12	also
0:04:14	outcome of this whole process should be in our identity of the speaker
0:04:19	right so the eagles this process
0:04:23	somehow alleviating sources
0:04:26	on what information
0:04:28	yeah
0:04:29	stressy information about the speaker so you would like to see you analysis
0:04:34	which somehow suppresses to answer
0:04:37	the message will influence of the environment and so on and hence is the information
0:04:43	about who is speaking
0:04:46	but of course time constant you also over years in speech research
0:04:52	that is very often better
0:04:54	what is used as much as possible to use something which is wrong because what
0:05:00	you have either you can get
0:05:03	and so on
0:05:05	yeah
0:05:05	that in speech recognition here with the same way as we end up in the
0:05:12	speech recognition
0:05:13	i well i know how to do this process speech that is
0:05:19	you take the signal and there's some frequency axis is not clear so you get
0:05:26	a sequence of each of the each maybe describe the signal in different frequency sub-bands
0:05:30	you're at five
0:05:32	ignore to face there and you want to find it somehow you know in quotes
0:05:38	because people that hearing is to some extent first thing and the press
0:05:42	and this properties might be might be useful some properties
0:05:48	and
0:05:50	so we don't signal so analysis is very high
0:05:57	right here
0:05:59	so this typeface that's
0:06:02	the money
0:06:03	and then some modifications to depending on the school of thought of that
0:06:08	the be seen before the vacations
0:06:11	plp people different modifications the mfcc people and so on it's own there is a
0:06:18	people
0:06:19	and
0:06:22	yeah we take it cosine transform a few cases because people
0:06:27	the features used modifications most likely there is some compression type of a room that
0:06:32	something happens here are also transformed here approximately the correlation features
0:06:37	and you get the cepstrum
0:06:39	and cepstrum is what we use any using
0:06:42	both in speech and speaker position and all these are the worst
0:06:47	so that's because
0:06:50	we should be all representation
0:06:53	as the speaker recognition people i'm not world from speech recognition you are you this
0:06:59	if i give different whole speech recognition people also oral presentation from speech coding people
0:07:06	and so on and so basically he was then
0:07:08	on the shoulders times
0:07:11	right so much as i mentioned briefly at the work site you
0:07:17	i
0:07:18	yeah data is actually a slight
0:07:24	online
0:07:25	so what was the sources of a body at that time
0:07:31	so that the source a different channels
0:07:35	what you
0:07:36	right most in one
0:07:39	interspeech case
0:07:40	so we use a set of points
0:07:42	about the speech sound
0:07:44	that's why we shouldn't
0:07:48	what conditions
0:07:50	you
0:07:51	this information which is of course
0:07:56	the design
0:07:58	the most suitable
0:08:00	function
0:08:01	and so forth
0:08:03	this is something you
0:08:05	of course is you don't live which i or typically work will be first thing
0:08:11	this
0:08:12	the but you just changing channels down
0:08:17	a lot of course also the goal
0:08:20	and high basically space exposed
0:08:25	so this is the formation
0:08:27	which i feel E you will be are not speak
0:08:32	yeah so yeah pretty funny "'cause" it's a little late and of course
0:08:38	say
0:08:40	briefly speaker a techniques like a background what
0:08:46	joint factor analysis and so one image that i speakers
0:08:50	in some cases embarrassingly well i zero
0:08:56	doesn't exist by from the G
0:08:59	so
0:09:01	probably doesn't is not sure that
0:09:08	i
0:09:11	now let's see how much this machinery minutes i mean from the from these days
0:09:17	i
0:09:20	so
0:09:22	i
0:09:35	i
0:09:40	i
0:09:49	all
0:09:53	this is so this is like i think that
0:09:57	yeah exactly
0:10:01	my
0:10:03	you know this is a spectrum so it's not accustomed as the suspect
0:10:07	so that is because we copied some as well as far as a very fast
0:10:15	where
0:10:16	yeah
0:10:19	firstly
0:10:20	yeah that's
0:10:22	such in the break that it might be worthwhile looking back into these
0:10:28	the basic analysis
0:10:29	because we have a data much more data and a very fancy processing techniques may
0:10:36	wants to know how much
0:10:38	variability yeah exactly how much variability
0:10:43	i
0:10:44	i
0:10:54	i
0:10:55	i
0:11:03	so
0:11:10	i
0:11:11	i
0:11:15	yes
0:11:17	and the techniques which you can be physical for recognizing speaker actually very much bigger
0:11:23	than
0:11:24	that is
0:11:26	maybe
0:11:27	is it is misleading because we use
0:11:31	when you
0:11:32	speaker dependent on
0:11:34	i
0:11:36	yeah
0:11:36	i
0:11:39	and
0:11:40	what are you want
0:11:41	ask
0:11:42	or maybe sets it they pay cisco phase right
0:11:47	somebody
0:11:49	and
0:11:51	but the same decide it is this work
0:11:54	the work on i don't sources and methods applied
0:11:59	speech on this is people might be more specific for speaker recognition
0:12:04	but this would be another story so results
0:12:07	this
0:12:08	i talk about are based on deriving spectrum are focused on
0:12:15	normally the signal people you know a second time
0:12:22	and after some preprocessing fine autoregressive model i mean like
0:12:28	and what we what log spectral line spectrum and we for a and a
0:12:35	and a spectral
0:12:39	spectrum
0:12:41	right
0:12:42	the sequence is the functional a
0:12:44	you can also differently in this to help with this
0:12:49	where presenting here
0:12:51	if you think it will sometimes long signal
0:12:55	in do exactly the same thing
0:12:58	and you between those on your on your cosine
0:13:02	so then you want to be able to derive the model and in this particular
0:13:09	frequency that
0:13:12	but by wideband and you end up which is time-frequency nation
0:13:18	just like you for this is that you know i sometimes like all overlay this
0:13:25	is this is a very rich people whose second level or when they do this
0:13:30	i
0:13:30	spectral
0:13:32	and this is maybe more weight to each hearing is working because i don't see
0:13:37	that
0:13:38	and second of speech and speaker
0:13:41	what frequency components and the most
0:13:45	then the second so this is the way you have a what is important for
0:13:50	you to somehow get some system
0:13:53	the global this way
0:13:55	start
0:13:56	this
0:13:57	well i enough not be possible at you know which we can see if i
0:14:02	was
0:14:03	which one
0:14:05	if you just look at the picture might believe me
0:14:08	okay
0:14:09	yeah
0:14:10	so this is what we all frequency domain linear prediction of these gonna fight students
0:14:16	recording three
0:14:16	as you don't prediction
0:14:18	oh that's a perceptual linear prediction so this can be side i
0:14:24	but i think it is a quite a bit of perceptions
0:14:28	it's
0:14:30	as the
0:14:31	so here is one seven
0:14:34	we have a signal
0:14:36	yeah you have a basal
0:14:38	finally all of this model
0:14:41	oh
0:14:43	and you also otherwise
0:14:45	see what is left after
0:14:48	is that
0:14:50	and not really different frequency bands that i
0:14:54	this time domain signal are bands
0:14:59	different frequency band you can be some is for the channel over there
0:15:05	so the resynthesized speech from adults only one can also synthesized speech from them
0:15:12	yeah
0:15:16	so if you
0:15:18	the signal
0:15:21	oh
0:15:22	oh search
0:15:24	i
0:15:24	i table you
0:15:30	yeah
0:15:32	i
0:15:34	oh
0:15:38	oh search
0:15:41	i just don't
0:15:46	yeah
0:15:49	and
0:15:55	i
0:15:59	if you where
0:16:02	well i
0:16:04	i
0:16:08	i
0:16:09	that is to send messages because then
0:16:14	thus
0:16:15	speech
0:16:17	but bottom line here is that
0:16:19	what we should not be used for speaker a single be this way
0:16:24	oh
0:16:24	i
0:16:25	a four or is that actually you know
0:16:29	in some ways
0:16:31	one is some there is a whole
0:16:36	components
0:16:37	yeah
0:16:39	formation
0:16:40	well
0:16:41	also
0:16:43	shen
0:16:43	for
0:16:44	speech
0:16:50	here is that since a young
0:16:53	a simple and here is that you get a sound
0:16:59	robustness so you know it's
0:17:03	and you have a representation
0:17:06	yeah
0:17:07	in
0:17:07	so as well as you have some problem here
0:17:13	give some more
0:17:15	high energy possible and we can see
0:17:20	oh is assumed
0:17:25	so
0:17:26	i mentioned in
0:17:30	so
0:17:31	well as a whole
0:17:33	which i
0:17:35	since so you'll find the right
0:17:38	as i
0:17:40	so if you before
0:17:41	or if you
0:17:43	yeah
0:17:45	different S is divided by S
0:17:49	and this is just a to see this somehow
0:17:54	that's easily different frequencies
0:17:57	depending on the frequency
0:17:59	well
0:18:00	channel and you can you like
0:18:03	i one of the suspect
0:18:07	to see what is just a way
0:18:11	or gain of the older this
0:18:14	and that's what we foresee essentially you just ignored
0:18:19	in this new
0:18:20	so
0:18:22	thus
0:18:23	well you right side or depending on
0:18:27	oh
0:18:28	oh by the
0:18:31	the signal is you and i think this task to say
0:18:36	then
0:18:37	so i eight
0:18:41	also
0:18:42	oh or similar
0:18:44	you
0:18:46	more
0:18:46	more robust in presence of an average
0:18:49	noise that's right
0:18:53	is just a mess
0:18:55	well
0:18:56	then
0:18:58	i
0:19:00	so basically we so people
0:19:05	if you look at more than me importance
0:19:11	well
0:19:14	and
0:19:15	so how many that is more
0:19:19	first thing is that
0:19:21	speech
0:19:22	you
0:19:23	and
0:19:24	these be different frequency ranges
0:19:28	E
0:19:29	try to find
0:19:32	i don't know
0:19:34	well
0:19:36	and also different
0:19:39	this is a state
0:19:40	and then they want to be able to use the one speaker recognition techniques which
0:19:47	you
0:19:48	friends don't know and so on
0:19:50	but then we then we just
0:19:52	or something which is small
0:19:54	that's cs for significantly
0:19:57	this way they expect to take a frequency
0:20:02	respect or
0:20:04	and five respect to see that all over
0:20:09	oh
0:20:09	yeah and then so you do this
0:20:13	at a time
0:20:14	this time frequencies
0:20:17	i
0:20:17	this is me
0:20:21	here we already removed
0:20:23	okay some
0:20:28	you
0:20:31	that is
0:20:32	then
0:20:33	yeah
0:20:34	yeah
0:20:36	she is much longer
0:20:40	responsible rule
0:20:42	very short
0:20:46	the communication
0:20:48	so it's yeah
0:20:51	oh that's out
0:20:53	i
0:20:53	style
0:20:57	which i theses so you know
0:21:01	so our
0:21:02	first
0:21:08	yeah i
0:21:11	yeah i think that both my main street
0:21:14	one
0:21:16	yeah
0:21:18	and we also
0:21:20	a false one
0:21:31	i
0:21:33	i
0:21:34	performance
0:21:38	and
0:21:39	oh
0:21:41	this is
0:21:42	both
0:21:44	i
0:21:56	so
0:22:01	again
0:22:02	right
0:22:07	yeah
0:22:10	right
0:22:12	this
0:22:14	so
0:22:16	oh
0:22:18	i
0:22:23	i
0:22:26	i know i was also
0:22:30	i have some
0:22:32	the task
0:22:36	yeah i
0:22:38	same i
0:22:39	i
0:22:42	but that's a
0:22:44	you know
0:22:47	i
0:22:48	yeah
0:22:53	oh
0:22:55	you
0:22:56	right
0:23:00	well
0:23:01	i
0:23:04	yeah
0:23:04	i
0:23:08	oh
0:23:09	oh
0:23:13	i
0:23:14	i
0:23:15	and
0:23:16	well
0:23:16	where
0:23:18	oh
0:23:19	yeah
0:23:20	i can't
0:23:22	yeah
0:23:23	i
0:23:24	i was hoping that are supposed to be
0:23:28	based
0:23:30	oh yeah probably get a degree without so maybe somebody
0:23:36	i think this there's function is expressed here
0:23:40	but at the same time
0:23:42	features and classifier based or a classifier for speaker recognition results speaker
0:23:49	use all the knowledge data used bandage fact they can tell you for different areas
0:23:55	of speech sounds
0:23:57	by different parts of the model and so on and so on
0:24:00	interesting
0:24:02	how to realise doesn't take advantage of that somebody was pointing out what
0:24:09	so yeah that is
0:24:12	i
0:24:21	oh
0:24:22	oh
0:24:25	oh
0:24:27	i
0:24:30	i
0:24:32	i
0:24:48	no
0:24:51	it's such that every utterance is about a sentence we just take over the whole
0:24:57	utterance
0:24:58	if we have a lot of speech we could be chopped into segments i one
0:25:03	five seconds and then be on the length of the segment be always choose the
0:25:08	what the model about how their second right hand we expect second
0:25:14	so what the country that segment of the signal
0:25:19	by to be if you if you don't signal just you mean and the iced
0:25:23	i typically the first
0:25:26	very personal data to model doesn't lid can check
0:25:31	so we use the central file
0:25:35	this is
0:25:37	i
0:25:40	i
0:25:48	vol
0:25:51	i didn't say exactly
0:25:54	what i said that what might be interesting for speaker is to use there is
0:25:57	you run
0:25:59	this process which is like you all pole zero signal in different component and yeah
0:26:05	it was the one which was used here
0:26:09	component was the war which was god
0:26:11	but i have been like what i write
0:26:14	was that it sounded like a global K but information about this about this message
0:26:21	a problem and she would have was
0:26:25	just
0:26:26	information about some information about the speaker
0:26:30	i don't think they and eighty or assigned to the original
0:26:34	the original
0:26:36	this other sort of T V this profile used as it is for speech recognition
0:26:41	component is
0:26:43	component so we just use it as a speech signal utterance
0:26:47	our phoneme recognizers got getting what was it fifty five four percent
0:26:53	fifty percent accuracy
0:26:55	so you can understand the same machine that's the two
0:26:59	a with respect to recognizing phonemes
0:27:02	somebody i you know
0:27:17	i mean happening at the top the loss and all four formants are gone
0:27:24	and everything is one
0:27:26	and it is
0:27:28	it's a bit
0:27:31	i way that you don't
0:27:32	the
0:27:38	the only assumption is not in use it is useful since out
0:27:43	oh also somebody speaker
0:27:48	i
0:27:49	i
0:27:51	oh of course i mean i see that course yeah so that might be right
0:27:58	i
0:27:59	a
0:28:02	or
0:28:03	i
0:28:10	i
0:28:12	of course we oh no i'm you know i again that they all cases you
0:28:17	ask and you saw or fusion right all six together things i one side try
0:28:22	to paper as a matter of fact it was of a speaker recognition
0:28:26	which was called towards decreasing error rates
0:28:29	and there's one of the reviewers if
0:28:32	if she uses here and you feel that
0:28:35	says he is not doesn't between
0:28:38	the paper was rejected so i have a that are saying about you know if
0:28:43	you are working on something you
0:28:46	and of course if you use it on its own is very likely that you're
0:28:50	performance
0:28:51	the other
0:28:52	degrees that's why neural paper to was increasing rate
0:28:56	but now since we have these huge
0:28:58	and that was fifteen years ago and you start working one fusions
0:29:02	if you if you just the goals for different source of what you have a
0:29:07	different source of information you have very like to the improvement after you you'll see
0:29:13	that that's why should research when you things
0:29:18	of the diffusion you are very unlikely degrees error rates will be all right i'm
0:29:22	you want to do something you it doesn't work and you put your what's that
0:29:26	works
0:29:27	and you can present at the conference
0:29:31	seven
0:29:33	others

Feature Extraction Using 2-D Autoregressive Models For Speaker Recognition

SESSION 08: Features for Speaker Recognition

Hynek Hermansky