0:00:06okay um
0:00:08and he said my name is laura so i'm a P H E student at the university of california berkeley
0:00:14and i also work at the international
0:00:16computer science
0:00:17institute or icsi um as many of you know
0:00:20and i would like to
0:00:20just uh
0:00:21also acknowledge my caught and charged ions it
0:00:24who was uh
0:00:25a fundamental part of of this work
0:00:28um
0:00:30so i just uh
0:00:31quick overview of pretty standard start out with a
0:00:35what we're trying to do and why
0:00:37uh we just got
0:00:38um related work
0:00:40um and go our our approach
0:00:42to the problem uh give you the results
0:00:45uh
0:00:45do a little bit additional analysis
0:00:47and conclude um
0:00:49with a summary and
0:00:51future work
0:00:55so i think we can
0:00:56all agree that uh automatic speaker recognition
0:00:59that some performance
0:01:00depends on a number of factors
0:01:02uh one of watch uh are intrinsic speaker characteristics
0:01:07um
0:01:07so there's no designs that if
0:01:10you know as humans we notice that certain sneaker sound more like
0:01:13similarly there's
0:01:14uh
0:01:16and you have that
0:01:17uh
0:01:17system's automatic systems will perform better or worse
0:01:21for different speakers
0:01:22um
0:01:23so the goal of this work
0:01:25it's too project watch
0:01:27speaker pairs
0:01:28well be difficult for automatic
0:01:30speaker recognition systems two distinct
0:01:32uh we did some preliminary work um
0:01:35what yeah that
0:01:37speaker pairs that are are hard for one system are also hard for others
0:01:41it's and
0:01:42um
0:01:43and of course you can use the system and
0:01:45select speaker pairs and you'll probably do really well
0:01:48but uh we wanted to
0:01:49it's a away from using any one system and and said um
0:01:53have a general approach
0:01:55and just use features that will hopefully capture uh
0:01:59oh
0:02:00degree of
0:02:00uh
0:02:01speaker similarity
0:02:03um
0:02:04the motivation
0:02:05uh
0:02:06besides
0:02:07being an interesting task
0:02:09it is
0:02:09to potentially better focus
0:02:11uh
0:02:12the research and reduce the amount of data needed to estimate
0:02:15system performance but
0:02:21so there's a couple times of
0:02:22related work um the first has to do with the idea of different speakers uh causing different problem
0:02:28um the infamous
0:02:30uh
0:02:30george washington
0:02:31you paper um
0:02:33categorise speakers based on system performance
0:02:37so you have
0:02:39fig out
0:02:40um
0:02:41you call the large number of false rejections as target speakers
0:02:46you have lamb
0:02:47uh who cause a large number of false acceptances as target speakers
0:02:53well as you call a large number of false acceptances and impostor speakers
0:02:58and finally or default well behaved
0:03:01she
0:03:02um
0:03:03in this work
0:03:04uh we don't actually distinguish between them available
0:03:07uh
0:03:08since we're looking at speaker pairs
0:03:09um
0:03:11but
0:03:11we want more on the title because
0:03:13hunting for lance didn't
0:03:15didn't sound so good
0:03:18um a couple other
0:03:20there's been other other work done on uh dealing with these speakers
0:03:24but it may be difficult
0:03:25um there's been work that shown that
0:03:28oh their performance differences
0:03:30between high and low pitch
0:03:31speakers
0:03:32um and then there been some uh worked on
0:03:35that
0:03:36uh tried
0:03:37two
0:03:37uh
0:03:38well the method to deal with this
0:03:40problem speakers
0:03:44the other elements of related work
0:03:46uh that's relevant
0:03:47is
0:03:47um
0:03:48and the whole
0:03:49features that are used
0:03:50two
0:03:51describe speakers or characterise speakers
0:03:53so you can draw varies from a lot of different types of work obviously uh
0:03:57speaker recognition approaches have use a variety of features
0:04:01uh certainly not an exhaustive list here but things like pitch and energy distributions are dynamic
0:04:06um
0:04:07prosodic statistics
0:04:09uh jitter and shimmer
0:04:11and
0:04:12in looking at
0:04:13perceptual speaker characterisation or discrimination your find
0:04:16a lot of formant frequencies and bandwidths and dynamic features
0:04:20and um
0:04:23other acoustic parameters that influence voice individuality include the pitch frequency contour and fluctuation
0:04:29again the formant frequencies and long term average but
0:04:36so our approaches
0:04:37as
0:04:38fairly straightforward
0:04:39um basically we compute feature values over some speech data
0:04:44uh
0:04:44corresponding to marry speaker
0:04:46and then using these feature values compute a measure similarity uh for all speaker pairs
0:04:53and
0:04:54the and looking at these measures look at the uh
0:04:58speaker pairs
0:04:59that have the highest and now we
0:05:01um
0:05:02values
0:05:03in terms of these
0:05:04uh similarity measures
0:05:05and compare for performance and those speakers
0:05:08uh to all
0:05:14so the features
0:05:15we consider here uh first of all
0:05:17pitch
0:05:18that sadistic
0:05:19um i mean median
0:05:21range and mean average slope
0:05:24much we
0:05:25you know
0:05:25okay
0:05:26um
0:05:27jitter and shimmer are the relative at an
0:05:30the average perturbation of generic
0:05:32and a five point amplitude perturbation quite
0:05:34question version
0:05:39uh formant frequency statistics
0:05:41you mean and median of the first three formant
0:05:43um
0:05:44i'll be doing that we work with it
0:05:46eight kilohertz
0:05:47so um
0:05:48although
0:05:48higher formants
0:05:49might be useful we we didn't calculate them here
0:05:55uh and he nonunion energy
0:06:00uh long term average
0:06:02spectrum energy statistics
0:06:03uh including the mean standard deviation
0:06:06range
0:06:07slow
0:06:07and local peak day
0:06:10um
0:06:11and we did a fourteenth order lpc analysis
0:06:14and uh found the frequencies
0:06:17from
0:06:17the coefficient right
0:06:19uh both with and without a minimum magnitude requirement which is essentially a limiting the bandwidth
0:06:26and uh then we to calling frequency it and
0:06:28in the middle
0:06:29histogram
0:06:32and finally we have B mode and median
0:06:35spectral
0:06:41so we have
0:06:42all these features well what measures do we use
0:06:45um for the scalar features of what
0:06:47almost all of them are
0:06:49uh we simply took the absolute or percent difference
0:06:54um also we in addition to using the formant frequencies individually we looked at some of the formant frequencies
0:07:02and we also looked at doctors of
0:07:04formant frequencies and
0:07:05but euclidean distance between the vector
0:07:08and finally for the histograms of frequencies
0:07:11uh
0:07:12we calculated the correlation
0:07:15as a matter
0:07:16so there's there's
0:07:17two different ways you can compute the single measure for speaker pair
0:07:21uh the first is
0:07:23to take for every speaker
0:07:24take all their uh feature values over the
0:07:27conversation sides but are available
0:07:29and just
0:07:30get an average feature value over the conversation
0:07:33and then compute the measure between these average values for each speaker
0:07:38um the other approaches to take
0:07:40and the conversation by conversation basis between two speaker pairs
0:07:44for two speakers
0:07:45uh compute the distance measure first and then averaged over the conversation pairs
0:07:51and um
0:07:52and the result types and i just present whatever method
0:07:56gave
0:07:56better
0:07:59larger different
0:08:05so the data that we use
0:08:07um
0:08:08really feature measure calculation and
0:08:11speaker pair selection
0:08:12uh we use p2p of neatness
0:08:15followup evaluation data
0:08:17um so this is all interview data
0:08:20um
0:08:21which is recorded on microphones me limit it to just
0:08:24you have a your microphone
0:08:26quality purposes
0:08:28and uh just uh the sign out um almost all the speakers have four conversations available
0:08:33um there's a handful with
0:08:35three or five
0:08:36but
0:08:37um
0:08:38because this is
0:08:39that's multiple conversation
0:08:41one
0:08:42and then once we have the speaker pairs selected um
0:08:46we evaluate performance using the uh data from the
0:08:50nist two thousand the evaluation short too short three condition
0:08:54uh so this
0:08:55data varies from the
0:08:57uh other did the pilot data in
0:08:59a couple respect
0:09:01um
0:09:02in addition to
0:09:02possibly being an interview
0:09:04uh it can also be
0:09:06speech from a telephone conversation
0:09:09and in addition to having uh the lab of the your microphone
0:09:12channel there are other microphones
0:09:14and as well the telephone
0:09:22so um
0:09:23available we had uh
0:09:26submissions that were shared by participating site
0:09:29and so thank you to everyone who share their submission
0:09:32um
0:09:33be sure to short precondition originally had i think maybe around ninety thousand miles or so um
0:09:39i had to remove the child that
0:09:41correspond to speakers that weren't in the selection data
0:09:44and that's you that your left
0:09:46about fifty five thousand trial
0:09:49and then furthermore when you just
0:09:51sub select
0:09:51oh and only keep trials corresponding to some percentage of speaker pairs
0:09:56uh you got around four thousand or eleven thousand trials love
0:10:00i know
0:10:02and
0:10:03we only keep target trials for speakers to show up in one of the
0:10:11so how do we evaluate the system performance
0:10:14um
0:10:15they're various metrics you can use what we look at here are to be uh
0:10:19minimum detection cost function
0:10:21and
0:10:21which of course is the
0:10:23a weighted
0:10:24some of uh
0:10:26with relative weights
0:10:27for errors
0:10:29uh what this is all done with the um
0:10:31two thousand a cost
0:10:32so it's not the low false alarm like the two thousand ten evaluation
0:10:37and then since we're looking at
0:10:38impostor speaker pairs uh we look at
0:10:41T false alarm rate which of course is
0:10:43simply
0:10:44for a given decision threshold the number of false alarm errors
0:10:49that occur are out of it
0:10:50total number of possible
0:10:52target track nine target
0:10:56so
0:10:58for every other system submission that we have
0:11:01we first
0:11:01uh
0:11:03just looking at the trials for the most
0:11:05or least similar speaker pairs
0:11:07we can keep the change in dcf relative to
0:11:10what it is for all speaker pairs
0:11:13and then uh
0:11:15take all these
0:11:16um
0:11:17system
0:11:18differences and average over the system
0:11:20so the results are just
0:11:22to
0:11:23a typical overall try
0:11:24and
0:11:25from the system
0:11:26um
0:11:27and then with the false alarm rate we
0:11:29uh for the all speakers' we look at a decision threshold that
0:11:33uh generates a false alarm rate of one
0:11:35sign
0:11:35and then at the same decision threshold um
0:11:39see what the false alarm rate is for the most and least
0:11:41similar speaker pairs
0:11:44and of course if we're actually taking uh if more similar speaker pairs
0:11:48actually corresponds to
0:11:50more difficult to distinguish speaker pairs
0:11:52then we expect these changes
0:11:54and the dcf
0:11:55and
0:11:56false alarm rate to be
0:12:02so
0:12:03here the results uh when you look at
0:12:05one
0:12:05sound of speaker pairs
0:12:07uh in each case
0:12:08the
0:12:09top row
0:12:10corresponds
0:12:11to uh
0:12:12the
0:12:13least similar speaker pairs
0:12:15so
0:12:16performance is improving
0:12:18and this road is
0:12:19um
0:12:20corresponds to most
0:12:22similar speaker pairs so that
0:12:24the
0:12:24performance is getting worse
0:12:26um
0:12:27so we notice that we are able to find uh
0:12:30features and and measures that
0:12:33we also like
0:12:34uh
0:12:35speaker pairs
0:12:36with the desired
0:12:37um
0:12:37and and see
0:12:39um
0:12:41if we then
0:12:42compare uh performance on one side
0:12:45to performance on five percent you can see that it's less
0:12:49pronounced when you include
0:12:50more speaker pairs
0:12:52um
0:12:53in some cases you
0:12:55you have these negative
0:12:56or
0:12:57opposite trends from what you
0:12:58back
0:13:04oh i i
0:13:05i pretty much mention all these points the only thing to note is that
0:13:08um
0:13:09changes in performance are not uniform
0:13:12across site submissions
0:13:13so that is
0:13:14uh
0:13:14one issue
0:13:20um okay so here's would adopt her for one system um when we use
0:13:25uh be euclidean distance between
0:13:28uh vectors of the first three formant
0:13:31uh dyslexic
0:13:32figure pair
0:13:33um
0:13:34you solid line
0:13:36correspond to uh the most similar speaker pairs being you
0:13:40and the dashed lines are very similar
0:13:43uh right is one percent and green is five percent
0:13:47and the black line is
0:13:48the
0:13:49uh case for all speaker pair
0:13:52um
0:13:54we know it is
0:13:57that
0:13:58uh in this particular instance
0:14:00um
0:14:00there's
0:14:02a bigger difference
0:14:03uh when looking at
0:14:04the
0:14:05uh
0:14:06leave
0:14:06similar
0:14:07speaker pairs
0:14:08and the most similar speaker pairs
0:14:11uh are much closer to
0:14:12formance overall speaker
0:14:15um
0:14:16although that doesn't happen all of the time it is
0:14:19uh
0:14:20certainly the general tendency
0:14:22uh
0:14:22to have this larger
0:14:23larger gap in this direction
0:14:29and
0:14:30here's another
0:14:31example that
0:14:31shows
0:14:33that it doesn't always hold
0:14:34um
0:14:36and this is
0:14:36uh a different system and a different feature measure that's the percent difference of median energy in this case
0:14:43and um
0:14:48you you get better separation here
0:14:50uh it there is and how much separation there is
0:14:53and
0:14:56across
0:14:58so
0:14:58we've been able to do some stuff but we expect we could probably do even better
0:15:03if we use uh more knowledge
0:15:05speaker system
0:15:06so
0:15:08um
0:15:09we decided to just
0:15:10simply use gmm
0:15:12since uh they show up obviously in a lot of
0:15:15um system
0:15:16um so we adapted uh
0:15:19speaker specific gmm
0:15:21um and then calculated the
0:15:24uh
0:15:25cal divergence
0:15:26between them has to be measured speaker similarity
0:15:30when we do that
0:15:31not surprisingly we get uh
0:15:33better results
0:15:35um
0:15:36the previous
0:15:37charts all had just one from negative fifty percent to fifty percent
0:15:40so you can see already that
0:15:42there are larger difference
0:15:44watches as i said what we would expect
0:15:49um
0:15:50here's
0:15:51that curve for a system using the key algorithm
0:15:55um
0:15:57again you can see that these are larger differences from the all performance
0:16:02and we again
0:16:04uh see this asymmetry where
0:16:07uh
0:16:09a bigger gap
0:16:10for the dissimilar pairs one for the somewhere
0:16:20so as i mentioned uh we're
0:16:23tend to be more successful at selecting easy to distinguish speaker pairs
0:16:27uh
0:16:28and possibly because these pairs may be easier to
0:16:31fine
0:16:32um
0:16:32one possible explanation would be that
0:16:35if you have a a speaker pair
0:16:37that is very dissimilar
0:16:39and um
0:16:40terms of pitch or formant frequencies
0:16:42that
0:16:43you know a big difference is probably going to mean that
0:16:46the system is not going to
0:16:47to use them
0:16:48um but on the flip side if you're trying to figure out what makes the speaker pair
0:16:52difficult um just like you know any single feature may not be enough
0:16:56to capture
0:16:58um
0:16:59that information
0:17:06so using them the K L divergence measure
0:17:10we took a closer look at
0:17:11uh the speaker pairs but are selected
0:17:14and
0:17:15as we in the mostly similar so in addition to like you know the one percent by
0:17:19the group
0:17:20uh also looked at
0:17:21three percent ten percent twenty percent
0:17:24and um and this data and there were a hundred fifty speakers overall
0:17:29uh
0:17:29leading to uh uh eighteen hundred
0:17:32unique speaker pair
0:17:34for same sex
0:17:35first
0:17:38and uh one thing we noted
0:17:40is that
0:17:41in
0:17:42in the groups of me
0:17:44uh least similar speaker pairs
0:17:47if you look at a group
0:17:49with what the larger values
0:17:50of the divergence
0:17:52um
0:17:53we would expect
0:17:54to be easier to distinguish
0:17:56the majority of them are male
0:17:57um
0:17:58but
0:17:59if you look at any one group about seventy five percent of
0:18:01speaker pairs in the group will be
0:18:03uh mail
0:18:04on average
0:18:10uh to a lesser extent we notice the opposite tendency when we look at
0:18:14uh
0:18:16more similar speaker pairs which
0:18:18uh somewhat tend to be more female
0:18:21um
0:18:21the
0:18:23one
0:18:24and
0:18:24three percent
0:18:25still have more male pairs
0:18:26but
0:18:28the other group
0:18:29have more female
0:18:34um
0:18:35this
0:18:36you know maybe part of the reason why uh
0:18:38system performance
0:18:39typically better
0:18:40and male
0:18:41um
0:18:42and it just in that you know males may may
0:18:45exhibit a greater range of differences
0:18:48between them
0:18:49so that
0:18:50there are likely to be more
0:18:51the similar
0:18:52a male speaker
0:18:59and finally so looking at these groups
0:19:02um we notice that there is a tendency define two types
0:19:05speaker
0:19:06uh there are speakers who frequently appear as members of difficult to distinguish
0:19:11uh
0:19:11speaker pairs
0:19:12and speakers who occur frequently as members of
0:19:15easy to distinguish speakers
0:19:18um
0:19:20in fact there are fifteen speakers you never appear in the most
0:19:23similar group
0:19:25and twenty four speakers you never appear in the most
0:19:28dissimilar group
0:19:30um
0:19:32i forgot
0:19:32but i think this is
0:19:33this twenty four speakers there's ten male and forty female
0:19:40so this
0:19:41uh tends to support the idea that there are these walls in there
0:19:45uh
0:19:45speakers who are
0:19:47are more difficult
0:19:48um
0:19:48or more similar to other speakers
0:19:55so just a summary of what i mentioned
0:19:57uh first of all it is possible
0:19:59project
0:20:00uh what speaker pairs will be difficult for a
0:20:03typical speaker recognition system
0:20:04to distinguish
0:20:06um
0:20:07for the features
0:20:08that we considered here would catch
0:20:10formant frequency of the the best ones seem to be uh the the euclidean dist
0:20:14between the first
0:20:15uh three formant frequency
0:20:18um but the best measure overall was the more uh complex
0:20:21uh
0:20:22cal divergence measure between
0:20:24uh
0:20:25speakers this
0:20:25fig gmm
0:20:27um i mentioned of course that we're typically more successful at identifying dissimilar speaker pairs
0:20:33and that in addition to
0:20:35to being able to um
0:20:37you know finding speaker pairs
0:20:39uh
0:20:40using these measures can provide potentially useful information
0:20:43about a speaker's tendency to be
0:20:45similar or dissimilar to other
0:20:52so future work
0:20:53um
0:20:55one thing to try is testing combinations of multiple feature measures
0:21:00because the the method for selecting similar speaker pairs
0:21:03um i did a little bit of work on this
0:21:06uh where i just
0:21:07basically
0:21:08assigned a rank
0:21:09according to each
0:21:10uh feature manager and then some of the ring
0:21:12over the speaker pairs and and
0:21:14did selection that way
0:21:16and and that that improve
0:21:18result
0:21:19uh another extension is to um instead of focusing on impostor speaker pairs
0:21:24see if you can find uh figure out what
0:21:27target speakers will be difficult
0:21:28uh for the system to correctly right
0:21:32and one thing but um certainly needs to be investigated is
0:21:36uh
0:21:37the you lack
0:21:38assistant see uh
0:21:39behaviour for the things but of speakers across
0:21:42um
0:21:43different
0:21:44uh
0:21:45system
0:21:46um
0:21:46we may be able to find potential trend
0:21:49in behaviour across classes or types of stuff
0:21:52of course with
0:21:53uh
0:21:54the
0:21:55site submissions that we used here uh
0:21:59almost all of the submissions are in fact
0:22:01fusion of multiple systems
0:22:03so might need to do a more of a breakdown
0:22:05uh to
0:22:07um
0:22:08really get out
0:22:09that
0:22:10sure
0:22:13okay
0:22:14that's all i have thank you
0:22:16hmmm
0:22:24sh
0:22:38thank you larry
0:22:39presentation
0:22:41i have a question about your
0:22:43formant extraction
0:22:44uh
0:22:45do you have
0:22:46don
0:22:47nation
0:22:48for all the old volumes
0:22:50or
0:22:51did you controls that's your extraction
0:22:53uh
0:22:54to the east
0:22:55you you you
0:22:56does volume
0:22:57or
0:22:58four
0:22:59different
0:22:59the
0:23:00you do
0:23:01one type of volume
0:23:03because you know
0:23:04it's
0:23:04my question is
0:23:06it is uh
0:23:07this
0:23:07use the volume
0:23:09or you you do
0:23:11um extraction according to the volume
0:23:14now so uh we didn't it was just the over the entire file so it's it's definitely you could probably
0:23:19get much better estimates
0:23:21and what we what we actually did
0:23:23because
0:23:24the the the problem is that
0:23:26uh
0:23:27uh you have a lot of disturbance according to the volume for now
0:23:30so
0:23:31uh but i think that
0:23:33uh
0:23:34i think that it is more the sample
0:23:36and
0:23:36no
0:23:37phonological information
0:23:39that's value
0:23:40yeah
0:23:40the speaker information
0:23:42so of course uh yeah of course this is
0:23:44convolving the the phonetic the
0:23:46phonetic
0:23:47with
0:23:47with the speaker
0:23:48okay thank you
0:23:55oh
0:23:57oh
0:23:58oh
0:23:59hmmm
0:24:01oh
0:24:05oh
0:24:06oh
0:24:07oh
0:24:08two
0:24:11what
0:24:12oh
0:24:13hmmm
0:24:14oh
0:24:16oh
0:24:17oh
0:24:18oh
0:24:19oh
0:24:20i
0:24:21pairs of what
0:24:22oh
0:24:23extracts or just
0:24:25oh
0:24:26uh_huh
0:24:28oh
0:24:29oh
0:24:31oh
0:24:32oh
0:24:33sure
0:24:36oh
0:24:37fig
0:24:39oh
0:24:39two
0:24:40oh
0:24:41oh
0:24:42oh
0:24:42hmmm
0:24:43four
0:24:44oh
0:24:45oh
0:24:47or
0:24:48oh
0:24:52oh
0:24:54sure
0:24:56sure
0:24:59thanks
0:25:04hmmm
0:25:06no
0:25:08we will
0:25:08we should
0:25:09uh
0:25:11four
0:25:12the
0:25:12the
0:25:13the
0:25:14yeah
0:25:16uh
0:25:17you
0:25:19uh
0:25:19uh
0:25:22uh
0:25:24right
0:25:25uh
0:25:25good
0:25:27oh
0:25:28i'm sorry
0:25:30are you talking about
0:25:32i
0:25:33figure
0:25:33paper
0:25:34and i think you mean
0:25:36uh
0:25:38oh
0:25:40cool
0:25:42yeah
0:25:44yeah i know but it was definitely the case but it was
0:25:48oh
0:25:49right
0:25:52hmmm
0:25:53hmmm
0:25:58oh