0:00:07right thanks
0:00:08my name's cornell
0:00:09and this is
0:00:10work with
0:00:11she was sitting up there she's
0:00:12happy to take all of their questions afterwards
0:00:18so i before we begin i just want to sort of
0:00:20lay down that the definitions that i'm gonna be using this is my first time at this meeting so i
0:00:24may be saying things very wrong
0:00:26and i apologise for that in advance
0:00:28um
0:00:29i conceive of all features that you could uh
0:00:31compute from speech
0:00:33in these four areas
0:00:35um where on the left hand side
0:00:37i can i consider
0:00:38course
0:00:39spectral features and on the right hand side fine spectral features
0:00:42and at the top two panels things that characterise a single frame of speech
0:00:46where is
0:00:47at the bottom things that characterise the trajectory
0:00:49that models things across great
0:00:53so
0:00:53all the features that you probably are familiar with can be
0:00:56used to test like
0:00:57space
0:00:58and prosodic features tend to be those that uh
0:01:01either model the fine structure in the in the spectrum for that was that
0:01:05model long term dependencies at the bottom
0:01:07but we here in this paper are gonna look at only the so called
0:01:11contain use prosodic features namely those that
0:01:13characterise a single frame
0:01:15and in particular of looking at
0:01:19okay so pitch is estimated
0:01:20uh using a pitch detector
0:01:23um which typically produces a
0:01:25best estimate for pitch but it usually is so noisy
0:01:27that a pitch detector
0:01:29it's typically expected to produce and best estimate
0:01:32and then a dynamic programming approach is used
0:01:35two
0:01:36ah
0:01:36and so i think that to a single best estimate per frame
0:01:39i'm gonna refer to these two components
0:01:41pitch estimation
0:01:42and the best estimate per frame that comes out of this
0:01:45um
0:01:46can be linearly or nonlinearly smooth
0:01:49can be normalised
0:01:50based on proximity to some kind of landmark
0:01:52and then different kind of
0:01:53features can be extracted from
0:01:55and the simple model now these things at the bottom here
0:01:57i assume or what you were calling high level feature computation or
0:02:01high level features
0:02:02in this talk
0:02:03i hope i'm not disappointing in once we're actually gonna be looking at this point
0:02:07which is as low level
0:02:09the
0:02:10in that session
0:02:12um
0:02:13we're gonna claim that these features are as low level is M S
0:02:17okay so if we look at this right box a little more closely
0:02:20um typically pitch estimation of pitch section is that
0:02:23two step process
0:02:24where the source to think we start from
0:02:27is an fft
0:02:29um
0:02:30so the first step is the computation of what i'm gonna be calling a transform domain
0:02:35and there's lots of alternatives here
0:02:37so was saying this is all the correlation spectrum
0:02:40and then the second step is simply finding the the art
0:02:43um
0:02:44yeah
0:02:45so a lot of effort has gone into
0:02:47uh
0:02:48this process and typically the effort is both it's only on this first
0:02:51because the second step is so elementary that nobody really questions
0:02:55um
0:02:56the
0:02:57most of the work was improving pitch detection has gone into making sure this is just
0:03:02oh such just force optimally or most free
0:03:05to uh
0:03:06we consider
0:03:08what we're gonna claim in this work is that you should just throw well as whole seconds
0:03:12and you should model hi transform
0:03:15and that's what this talk
0:03:17so there are four parts to stop
0:03:20men are described
0:03:21what i'm calling the harmonic structure transform
0:03:24presenting something
0:03:24garments
0:03:25some additional analysis
0:03:27and i will conclude
0:03:28three slides
0:03:29yeah
0:03:36okay so
0:03:37the particular pitch detection algorithm that we're gonna look at
0:03:41is um was proposed by shorter in nineteen sixty eight
0:03:44and it involves
0:03:45summing
0:03:46producing an you spectrum the sigma spectrum where uh at each frequency we have
0:03:50the sum of all the frequencies
0:03:52that are
0:03:53integer multiples of some candidate fundamental frequency in the original effect
0:03:58okay and this was very quickly after he proposed this dog harmonic compression which is
0:04:03distinctly nonlinear operation
0:04:05um
0:04:06i wanna demonstrated over here on the right
0:04:08basically what what ends up
0:04:11what ends up happening is that the spectrum is compressed
0:04:14conceptually
0:04:15by integer factors and then it's a bird
0:04:18right
0:04:18and the the problem with congresswoman compression is that has led to people actually looking for
0:04:23uh implementations of this algorithm
0:04:25in exactly this way so first compressed and then uh
0:04:28and it turns out that
0:04:29this is occupied people for about twenty years
0:04:32a lot of last century
0:04:34a a much better way to
0:04:36do this is to actually not do any compression
0:04:38but
0:04:39uh comb filter
0:04:40so you just add whatever's at whatever frequency you want without first having to compress it towards the
0:04:45and that a harmonic frequency or
0:04:47fundamental frequency that you're interested in
0:04:49um
0:04:51so when this happens there of course no compression difficulties
0:04:55filtering is linear
0:04:57uh
0:04:58we in this work are gonna be defining all of the all of our filters over this range of three
0:05:02hundred hertz to eight thousand hertz
0:05:05um if you have lots of such computers and that
0:05:08you'll the filter bank
0:05:09and in this work we're gonna
0:05:11nominally three four hundred
0:05:12filters and this filterbank
0:05:14um which
0:05:15range from fifty to four hundred fifty hertz space one hurt so far
0:05:20uh
0:05:20so this is
0:05:21the continuous frequency space of course we want a discrete frequency space
0:05:25filter because uh we have
0:05:27discrete ffts
0:05:28so
0:05:29there are lots of ways to do this and
0:05:32i
0:05:32always like siding
0:05:33the work by you know colleagues
0:05:35from
0:05:35lindsay because i
0:05:37this is actually work that influenced me
0:05:39um but it's probably not the first work
0:05:41what we're gonna do in this work is a little bit different
0:05:43we're gonna sing that each
0:05:44uh
0:05:45to stop the columnist triangular
0:05:48and
0:05:49then we're gonna simply riemann sample this
0:05:52such that the filterbank filter
0:05:54for that
0:05:55comb filter
0:05:56discrete filter actually end up looking like this and
0:05:58as you will know it doesn't look harmonic at all
0:06:03it's a what what do you do with this now so
0:06:05uh if you have a set of such discrete calls can filters
0:06:08then that actually
0:06:10um
0:06:11implement
0:06:11uh
0:06:12filter bank
0:06:13that has represented a matrix representation age
0:06:16and it's very simple to use you just
0:06:18matrix multiplication by the fft that you haven't
0:06:21we're also gonna take the logarithm of the output of that filter bank
0:06:24the same way that's done for
0:06:25the mel frequency filter
0:06:27um finally we're gonna subtract from that
0:06:30the energy that is founded specific integer more at integer multiples of a specific candidate fundamental frequency
0:06:35we're gonna subtract from that
0:06:37the energy found everywhere else
0:06:39and to do that we're gonna form
0:06:41this
0:06:41complementary it is complement transform H tilde
0:06:45um which i can
0:06:46demonstrate over here this is the
0:06:48column
0:06:49a column vector for particular
0:06:50a comb filter
0:06:52then we just
0:06:53form a unity complement right and that gives us a
0:06:56this here so that's the corresponding
0:06:58column vector of H two
0:07:00um
0:07:01what this of course implements why implements a
0:07:04if they're in i form of the harmonic
0:07:06to noise ratio
0:07:08um which is known to correlate with force
0:07:10voice or hoarseness or roughness of reading
0:07:12and
0:07:13typically in in in pathological you so
0:07:16two
0:07:17what's done is the harmonic noise ratios computed only and not at the fundamental frequency once that is known
0:07:23and what we're doing is we're computing it for all possible candidate fundamental frequencies
0:07:27and then using that vector
0:07:28as a
0:07:29feature
0:07:30fish
0:07:34okay
0:07:34so the elements of this are still correlated in me decorrelate them in a way that anybody else would
0:07:39we subtract the global mean
0:07:41we form a D correlation matrix
0:07:43and then we truncate
0:07:44after applying that matrix all those features
0:07:46that are in the mentions that have one positive eigenvalue
0:07:50we're gonna call the output of this
0:07:52harmonic structure cepstral coefficients for lack of a better term
0:07:55and um
0:07:56this is simply a D correlation of the logarithm
0:07:59oh the output of the filter banks minus some normalisation term which
0:08:02is uh
0:08:03R H tilde here
0:08:05we we actually explore two different options for this book pca
0:08:08and lda which you probably know more about than i do
0:08:12i want
0:08:13i've claim that this is at the level of
0:08:15mfccs but
0:08:16i would like to try to convince you hear that it's nearly identical from a functional point of view
0:08:20um
0:08:21the mel filterbank
0:08:22kind of
0:08:23also be implemented as a matrix
0:08:25and so if it's if that's and
0:08:27you can see that at least
0:08:28inside here
0:08:29is approximately the same it's a matrix multiplication of the lombard
0:08:33the decorrelating transforms of course
0:08:35different
0:08:36and
0:08:37sort of
0:08:37important and in our case unfortunate that
0:08:40article or decorrelating matrix is data
0:08:42pendant where is the mfcc one is not but
0:08:45um
0:08:46to compare
0:08:47here 'em in H
0:08:49these are
0:08:49essentially the columns of and
0:08:51that is to say
0:08:53they
0:08:53smear energy across frequencies that are related by
0:08:56jason
0:08:57see
0:08:57where is the columns H the matrix that we're proposing here
0:09:01smear energy across frequencies that are related by harmonicity
0:09:06um i also wanting
0:09:08just say that this is as direct
0:09:09sort of a lot from
0:09:11our previous work
0:09:12using a representation call fundamental frequency variation which
0:09:16models the
0:09:17instantaneous change in fundamental frequency without actually computing the fundamental frequency
0:09:22um
0:09:23so what we're gonna what we're doing here in the in the current work
0:09:26is we take it
0:09:27frame of speech we take it at fifty
0:09:29and then we take a bunch of idealised fft that are the con
0:09:32filters where the columns were capital H
0:09:35we formed a dog
0:09:36matrix
0:09:37oh
0:09:37the frame
0:09:38that we are currently looking at with every one of these
0:09:41and
0:09:42the locus of these dot product
0:09:43of course defines
0:09:44uh
0:09:45this trajectory which is a function of a lie
0:09:48which
0:09:48corresponds to
0:09:50fundamental frequency
0:09:51right so
0:09:54in contrast the F F E stuff that we've done before we take two frames we take the current frame
0:09:58the same as here
0:09:59but we also take
0:10:00the previous frame
0:10:02and we die like the previous frame by logarithmic factors
0:10:06by a range of them
0:10:07and then again we take the dot product
0:10:09oh
0:10:09the dilated
0:10:10previous frame
0:10:11with the current frame
0:10:13and then the locus of those dot product give us gives us another focus
0:10:16it is also a function of i wear
0:10:18i hear is the logarithmic dilation factor so
0:10:21this
0:10:22expresses
0:10:23expresses
0:10:24the key here nominally expresses the
0:10:27uh location of the peak sorry
0:10:28expresses the
0:10:30fundamental frequency in hertz
0:10:32and the location of the peak here expresses the
0:10:35rate of change of fundamental frequency and
0:10:37for sex
0:10:39okay so not gonna
0:10:41um describe
0:10:42the experiments that we did too
0:10:43see whether
0:10:44this makes any sense at all
0:10:46um
0:10:46the data that we use this wall street
0:10:48journal data mostly coming from wall street
0:10:50the
0:10:51this corpus
0:10:52um the number of speakers that we have over a hundred and two female speakers and ninety five male speakers
0:10:56and leave it
0:10:57close to classification of gender separately
0:11:00we used and second trials
0:11:01we had enough
0:11:02we get enough data
0:11:04to have five minutes of training minutes of development data
0:11:07three minutes
0:11:07test data and that
0:11:08corresponding to approximately fourteen hundred eighteen hundred
0:11:13trials that are ten seconds apiece um
0:11:15all of the data comes from a single microphone i
0:11:18and that's what we're calling match multisession which means that
0:11:21for the majority of speakers but
0:11:23data that both in the train in the data and the test set
0:11:26is drawn from all
0:11:27speakers that all sessions that are available for that
0:11:32something that is not in the paper
0:11:34um
0:11:34is
0:11:35we did this afterwards but i thought that
0:11:37you might
0:11:38appreciate this is we built a system that's based just on pitch
0:11:41and we extract pitch unit standard
0:11:43sound processing to it
0:11:44in its default settings
0:11:46the comparison isn't quite fair because
0:11:49any pitch tracker currently
0:11:50actually uses dynamic programming so this
0:11:52the speech
0:11:54this system is actually using longterm constraints where is our system will not 'cause it treats brains in the end
0:11:59it
0:12:00um
0:12:00we we ignore unvoiced frames and we transformed voice frame to what do you mean
0:12:04and what we see is that
0:12:05for females this this system
0:12:07uh achieves an accuracy of approximately eighteen percent
0:12:11and
0:12:11approximate twenty seven percent for females and males
0:12:14respectively
0:12:17i get the feeling that my microphone is louder sometimes and white or another
0:12:20is that true or
0:12:21does it bother anyone
0:12:24okay
0:12:25sorry
0:12:27okay so the system that we're proposing here um to explore this this idea of of modelling the entire transform
0:12:33domain signal
0:12:34um
0:12:35we don't perform any preemphasis
0:12:36uh because partly because we
0:12:38are not using frequencies below three hundred hertz though because we throw those away there's no D C component we
0:12:43decided not to bother with this
0:12:44uh
0:12:45we have seventy five percent
0:12:46frame overlap
0:12:47with thirty two millisecond frames and we use the
0:12:50and window instead of the hamming window which is ubiquitous use
0:12:53um
0:12:54number of dimensions and the number of gaussian gender models
0:12:56as
0:12:57is
0:12:57still need to discuss that in coming coming um
0:13:01slide
0:13:01we don't use universal background model and we don't use any speech activity detection
0:13:05so
0:13:08um in
0:13:09optimising this number of dimensions
0:13:11um
0:13:12what we what we have done here is we have
0:13:14we have created
0:13:15the most laconic modelling you can invent a single gaussian which the diagonal covariance
0:13:20and we train our pca lda trans
0:13:23on the training set
0:13:24and we
0:13:25select that number of dimensions which maximise accuracy on the developer
0:13:29right so we can see here that
0:13:31four
0:13:32the pca transform
0:13:33which even accuracy of about forty percent
0:13:35for the for the first part
0:13:37and top discriminants
0:13:38and accuracy about eighty five percent
0:13:40for the
0:13:41um
0:13:43for that first hand lda
0:13:45uh_huh
0:13:45components
0:13:46for females and slightly better for males but that's
0:13:49approximating that the ballpark
0:13:51the lighter colours in these two parts
0:13:52uh represent
0:13:53longer trial durations we decided not to do sixty second in thirty seconds which is what we start out with
0:13:58because
0:13:58the the numbers were were too high and it was difficult them
0:14:01paris
0:14:01so
0:14:04um
0:14:04this table summarises uh the performance of
0:14:07this agency C system that i just described
0:14:09uh once the number of
0:14:11gaussians
0:14:12has been optimised
0:14:13has been set to optimise
0:14:15that's set accuracy and that number happen
0:14:17to be two fifty six
0:14:18in our experiments
0:14:19what what we see here is that a single
0:14:22is
0:14:22if you take the
0:14:24representation that a pitch tracker is exposed to and you spend the time looking for the art max in it
0:14:29you achieve eighteen and twenty seven percent
0:14:31spectral here but if you don't bother doing that you just throw everything that that representation has any model that
0:14:36then you achieve almost a hundred percent
0:14:38so
0:14:39the claim here based on these experiments
0:14:41is that
0:14:41there is speaker discriminant information
0:14:43it is beyond our max
0:14:45in these uh
0:14:46in
0:14:47in his representation vectors
0:14:48and
0:14:49of course discarding it needs the performance that
0:14:51is
0:14:52not really comparable
0:14:53um
0:14:54spending time improving our backs estimation
0:14:56it appears unnecessary and of course
0:14:58our backs estimation here
0:15:00speech
0:15:03okay so we also constructed a
0:15:05a contrastive mfcc system which
0:15:07is not really stan
0:15:08nerd in the weighted
0:15:09but you probably build them but
0:15:11um
0:15:11we we try to retain as many similarities with
0:15:14but a very simple to just see system as we could
0:15:16so
0:15:17um
0:15:18we did apply
0:15:19preemphasis and having window because that just happens to be the standard front end feature processing in our asr systems
0:15:25um
0:15:25we retain twenty
0:15:27of the
0:15:28the
0:15:28uh lowest order mfccs
0:15:30but then
0:15:32we also don't use a universal background model or any speech activity detection
0:15:36so
0:15:37uh_huh
0:15:37uh in this respect the the two systems are most comparable
0:15:42and what we see when we compare these these two systems is that
0:15:45essentially in every case at least for this data for these experiments that we did here
0:15:50this H assisi representation outperforms the mfcc representation but what we're happy just saying that they're they're comparable in magnitude
0:15:56um
0:15:57we've also
0:15:58just
0:15:59just to be safe
0:15:59applied you know lda to the mfcc system
0:16:02this is also not if they're thing to do because we actually haven't
0:16:05truncated or thrown away discarded any dimensions after that so we we take twenty dimensions and we rotate them
0:16:10um
0:16:11it it it leads to
0:16:12uh negligible improve
0:16:15um if we combine the agency see in the next
0:16:17systems and
0:16:18we achieve uh
0:16:19we we get improvements in every case
0:16:21uh except
0:16:22for dev set in males here mfccs don't seem to help
0:16:25um
0:16:27but
0:16:27um
0:16:28other than that in general
0:16:29at least combination with mfccs
0:16:35okay so
0:16:36given these results i i'm gonna
0:16:38describe a couple of
0:16:40uh
0:16:41analyses or analysis of a couple of perturbations 'cause we were interested in
0:16:45seen how
0:16:45lucky we were in just guessing at the
0:16:48parameters that actually drive
0:16:50the can
0:16:50evaluation of our system
0:16:51so we considered three different kinds of perturbations
0:16:54one was um changing the
0:16:56the the frequency of range
0:16:59to which the
0:16:59the filterbank is exposed
0:17:02um
0:17:02one is
0:17:03changing the number of a comb filters in the filterbank
0:17:06and the other is
0:17:08they'll be throwing out the
0:17:09the
0:17:10so called spectral envelope information which is contained in mfcc
0:17:14um
0:17:15i mean we very much
0:17:17we had a very simple
0:17:18version of this analysis we where we used
0:17:20only diagonal covariance gaussian
0:17:22um
0:17:23single that that'll convince causing per speaker
0:17:26and we only show numbers on that set because
0:17:28find it there
0:17:29sufficiently similar but of course uh
0:17:31granularity to that you've all set numbers that
0:17:33we didn't actually bother doing
0:17:36um
0:17:37as before i'm gonna plug accuracy as a function of the number of dimensions
0:17:43a so the first perturbation has to do with modifying the slow order cut off as i said the justice
0:17:47system
0:17:48looks at frequencies between three hundred hertz and eight kilohertz
0:17:51and if it is interesting to see what happens if you choose a different value for this low order or
0:17:56low frequency cutoff
0:17:57um
0:17:58so the results here for females on the left and males on the right
0:18:01um
0:18:02indicate that what we had chosen this three hundred hertz cutoff this just happens to correspond to the first hand
0:18:08fifty
0:18:09space
0:18:09um
0:18:10what's but was it is that is to say that to the
0:18:13to the
0:18:14best performance
0:18:15if we if we expose the algorithm to
0:18:18also frequencies between zero and three hundred
0:18:21then for females we actually lose approximately four percent absolute here
0:18:24that
0:18:25the drugs much smaller for males
0:18:27and um
0:18:28moving the the cutoff further up
0:18:31has a smaller effect but it's also worse than than keeping that we have
0:18:35the second perturbation that we
0:18:37there are that we analyse ways
0:18:39changing the upper limit here so
0:18:41um as i said we had three hundred eight thousand to begin with
0:18:45but it's interesting to see what happens if you cut it at four thousand words
0:18:48two thousand
0:18:49and and this
0:18:50configuration particular corresponds approximately
0:18:53to an upsampled eight kilohertz telephone
0:18:55no
0:18:55uh
0:18:56uh_huh
0:18:57so here again results for males on the right and for females on the left
0:19:01uh
0:19:02what we see is that
0:19:03for males
0:19:04chip
0:19:05reducing the number of high frequency components that that that you looking at in the F F T
0:19:09has a more drastic effect and for females
0:19:12um
0:19:13for females actually
0:19:14going down to four thousand
0:19:16is only a drop
0:19:17less than one
0:19:18one percent absolute
0:19:19but then dropping it further
0:19:21um
0:19:22you see drops of approximately three percent up to
0:19:25wanna stated even under these sort of ridiculous ablation conditions
0:19:28this significantly outperforms a pitch tracker and
0:19:31or so it is not known how well the pitch tracker would operate on
0:19:34three hundred to two thousand
0:19:35or
0:19:36audio
0:19:37so um
0:19:38the third perturbation
0:19:40is
0:19:41is is
0:19:42that in the transform domain so we
0:19:43and i said at the very beginning we have four hundred filters under one hertz apart
0:19:47and
0:19:48uh were at liberty to choose
0:19:50however many filters we want
0:19:52right so
0:19:52uh
0:19:54so it's interesting to see what happens if you double that number and space them point five hertz apart or
0:19:58you have that number space imports as far apart
0:20:01uh what the results
0:20:02show here for female the left again and for males on the right
0:20:05is that
0:20:06increasing the resolution
0:20:08of the
0:20:09candidate fundamental frequencies
0:20:10with which you construct the filter bank
0:20:12actually leads to significant improvements for females of
0:20:15um
0:20:16almost two percent absolute and for males is slightly smaller
0:20:19um
0:20:20and but then decreasing resolution has a similar impact negatively
0:20:23for boston
0:20:26uh finally
0:20:27the fact that the
0:20:28that the mfcc an H assisi features
0:20:30combine to improve performance in three out of four cases
0:20:34suggests that the
0:20:35that the to surf feature streams are complementary
0:20:38um
0:20:39but there is actually no proof of that until sort of now
0:20:43so what we're gonna do here is we're gonna take that
0:20:45the the source domain um
0:20:47fft
0:20:48and we're going to uh
0:20:50lifter it by transforming it into the real cepstrum and then throwing out the low order cepstral coefficients and then
0:20:55transforming it back into the
0:20:57into the spectrum
0:20:58coming
0:20:59um
0:20:59and i wanna say here that the the lower order
0:21:02real cepstral coefficients correspond approximately
0:21:05to the low order mfcc coefficients right so
0:21:08um
0:21:08ablating
0:21:10real cepstral coefficients
0:21:11is a
0:21:12which are working
0:21:13computed without a filterbank
0:21:15is a
0:21:16very similar to removing
0:21:17exactly that information that's captured by
0:21:20so
0:21:21this system that we have a justice system that you saw the performance of in the table
0:21:25would we actually don't do when you lettering
0:21:26but we could
0:21:27lifter
0:21:28you know the first thirteen low order cepstral coefficients
0:21:31which corresponds approximately what people typically use an asr system
0:21:34and
0:21:35or or twenty which is what we use in ornaments
0:21:37see a baseline we saw earlier
0:21:39and
0:21:40what and that happening here
0:21:42you can see that
0:21:43a personal comment on females is that removing the spectral envelope information actually improves performance here so
0:21:49if we throw away
0:21:50the first thirteen cepstral corpus are sort of the information contained in the first
0:21:53thirteen cepstral coefficients
0:21:55we get an improvement of about two percent absolute
0:21:57i meaning that the spectral
0:21:59and information that's model and mfccs actually hurts here for when
0:22:03um
0:22:04it's also the case that if we throw twenty of them
0:22:06we also do better than not throwing out any but it's already not as good as only doing a thirteen
0:22:11which suggests that the
0:22:12the cepstral coefficients that are found between
0:22:14or thirteen and twenty or
0:22:16or
0:22:16are useful
0:22:17um in male doing any kind of
0:22:19uh
0:22:20evolution seems to
0:22:22sorry blistering seems to hurt
0:22:24but it
0:22:24the pain is
0:22:25smaller here you throw away the first thirteen that's negligible i believe
0:22:29it's
0:22:29one one trial
0:22:32and uh i have no idea what statistically so
0:22:35so the findings of this is that um
0:22:38the this representation appears to be robust
0:22:40to perturbations of various sorts
0:22:42there is play of approximately five percent absolute
0:22:45um
0:22:46the performance for
0:22:47female speakers seems to be more sensitive to these perturbations than for males
0:22:51in both
0:22:52uh pleasing and displeasing directions and
0:22:54um
0:22:55it it's again important to say that and even under these perturbed conditions
0:22:59the
0:23:00the performance of these systems is
0:23:01vastly superior to
0:23:03um
0:23:04the performance that would be achieved if you spent a lot of time finding the art max in the representation
0:23:09that pitch trackers are exposed to
0:23:11uh
0:23:12we don't know how to pitch track would perform
0:23:14whisper
0:23:16so the summary of the stock
0:23:18um
0:23:18i still have three slides
0:23:20is that uh the information that's available
0:23:23to it
0:23:24standard pitch tracker
0:23:25because it is computed by this pitch tracker
0:23:27and then subsequently discarded is
0:23:29valuable for speaker recognition
0:23:31um
0:23:32and then the three points that i would like to
0:23:34pay specific attention to is that
0:23:36the performance achieved which is where they just C C but features
0:23:39is comparable to that achieved with mfcc features
0:23:42um
0:23:43the information contained in these pages theses
0:23:46if you're is complementary to the information
0:23:48okay
0:23:49mfccs
0:23:50and the H assisi modelling appears to be at least as easy as
0:23:53C C
0:23:56um
0:23:57so
0:23:57that this evidence
0:23:58suggests as i probably said
0:24:00too often now
0:24:01that
0:24:03improving estimation of
0:24:05detecting arguments are finding the art max
0:24:08in this representation the goal which are essentially
0:24:11um
0:24:11seems like
0:24:12and endeavour that doesn't
0:24:14warrant further time investment and
0:24:16uh_huh
0:24:16it's
0:24:17it's
0:24:17possible
0:24:18to simply model the entire transform domain
0:24:20and and do better
0:24:22um
0:24:23if pitch is required for other high level kind of features which of course we're ignoring here 'cause we're not
0:24:27doing anyone
0:24:28distance
0:24:29um
0:24:30feature computation
0:24:32then
0:24:32at least
0:24:33that information should not be discarded
0:24:35even if it's not used to estimate pitch
0:24:37i if
0:24:38it
0:24:38it yeah it
0:24:39these ideas
0:24:40um generalised to other data types in other tasks then
0:24:44there there is some chance that this
0:24:46um who will lead to some form a paradigm shift
0:24:49in the way that prosody is modelled
0:24:51speech
0:24:54so i wanna close with a couple of
0:24:56happy at um
0:24:57we we don't actually know how these features compared to other
0:25:00instantaneous
0:25:01uh for prosody
0:25:03vectors right so it's possible that if you had a
0:25:05uh
0:25:05a vector that contains page and maybe harmonic to noise ratio and and maybe some other things that
0:25:10are
0:25:11computable
0:25:12you know instantaneously personal frames and
0:25:14it would be much
0:25:15the difference would be much smaller
0:25:17um
0:25:18we don't know that this guy
0:25:19at the current time
0:25:20we also don't know how this
0:25:21this representation performed under various
0:25:23various um
0:25:25mismatched conditions for example channel or or or session or distance from microphone
0:25:30or uh vocal effort
0:25:31so these are things that need to be
0:25:33explored and it's also quite possible that
0:25:36better
0:25:36better
0:25:37uh
0:25:38that
0:25:39there are other classifiers the maybe better
0:25:41suited to this
0:25:42um
0:25:43in particular the performance that wasn't bad which you do this single diagonal covariance gaussian suggested
0:25:48maybe svm so much better that the the feature vectors are large right so
0:25:52um
0:25:53this presents some some problems
0:25:55existing prosody systems of course
0:25:57focus a lot on long term features
0:26:00and we haven't
0:26:01attempted that here at all so
0:26:03um a simple thing to try would be to uh simply start features from temporally adjacent frames
0:26:08or stack i differences
0:26:10from features but i
0:26:11i think
0:26:12that
0:26:12probably the best thing to do is to
0:26:14simply compute the modulation spectrum over this
0:26:17so how it just the spectrogram
0:26:19um
0:26:21and of course probably most importantly
0:26:22we would really like to have a data independent
0:26:25uh feature rotation which allows us to
0:26:27compress the feature space
0:26:30this would significantly improve understanding 'cause right now we just have this huge bag of numbers
0:26:34and it would prevent it would allow
0:26:36as to apply some normal
0:26:38things that people apply like universal background models
0:26:41um
0:26:45and it would allow us to
0:26:47deployed in other large ask
0:26:50thank you
0:26:51thank you
0:27:06could you
0:27:07perhaps
0:27:08uh
0:27:09just help me understand your your last one
0:27:12uh
0:27:13oh a explain please explain to me why do some difficulty in applying our method to to a lot
0:27:20that's because you're
0:27:21feature vectors are very large or
0:27:25well the first thing like yeah so in the in the in the system that we describe most
0:27:29extensively here
0:27:30the feature vector has
0:27:31four hundred number
0:27:33um
0:27:33and so
0:27:35i have found it to be
0:27:37painful to so that's four hundred every
0:27:39ten milliseconds right
0:27:43does that answer your question actually
0:27:44i say something more
0:27:46um
0:27:47we'd actually found that if you're looking at
0:27:49different kinds of mismatch
0:27:50you need to do some
0:27:52um homomorphic processing which actually
0:27:54increases the size of this feature vector and so it becomes even more painful
0:27:57and it's basically because we don't really know
0:28:00how come
0:28:00really properly model
0:28:02with a data independent
0:28:04yeah
0:28:06okay thanks becomes uh
0:28:08the those seem like it would be very worthwhile try that is based on those
0:28:12yeah
0:28:13on the nasdaq does it so well
0:28:15would be nice to think of ways to
0:28:17my proposal
0:28:18definitely and if any of you have any suggestions
0:28:20i would like to take
0:28:30you have any thoughts on this
0:28:32might be hit on mismatched data
0:28:36um
0:28:38i do
0:28:39i have we have some thoughts
0:28:40so
0:28:41but we we don't have
0:28:42really the correct
0:28:42kinds of thoughts
0:28:43so
0:28:44um
0:28:46note also that the problem is that the other dataset that we've been playing with most recently after doing this
0:28:52is that is a far field data so
0:28:54and so everything is far field
0:28:55so that there is a big change in what happened then we actually don't really know
0:28:59exactly where the changes so i guess we're now in the process of thinking about buying it
0:29:02different data but
0:29:03we try to remember this table um
0:29:06so
0:29:06this is on something called the mixer five dataset
0:29:09which is
0:29:10which which contains lots of
0:29:11different channels but
0:29:12that nine channels that we use are all the far field channel
0:29:16and um
0:29:17we we what we have there is we have uh
0:29:20two evaluation sets
0:29:22um
0:29:23one has
0:29:24session match and the other is session mismatch and then
0:29:27we
0:29:29we build models
0:29:30for data from every channel and apply them to that same channel that's the match channel condition
0:29:34and we also apply those models to data from every other channel and that's a mismatch channel conditions so that
0:29:39it's not channel condition consists of
0:29:41um
0:29:42i think it's a times nine it's an average of eight times nine numbers and the nine channel condition is
0:29:46an average of nine
0:29:48so
0:29:48what we see is that in channel
0:29:50match and channel
0:29:51so in section match and channel matched conditions
0:29:54um
0:29:54we we're we're doing something on interesting there
0:29:57um
0:29:58but in
0:29:59it it here is here
0:30:01that session mismatch
0:30:02is is more painful than channel mismatch
0:30:04right
0:30:05and um
0:30:07there there is a there is a clear reversal here in the ordering between the mfcc system and a justice
0:30:13system that we reported in this work
0:30:15um
0:30:16so
0:30:17oh yeah by the way
0:30:18so uh
0:30:20this
0:30:20this line this H assisi or is that system that i just described
0:30:24it you see sinew is something that we
0:30:27submitted this summer also just problem that can be accepted which is where the this
0:30:30this table comes from
0:30:31um
0:30:32but uh
0:30:34but the point is that um
0:30:35be these numbers in this role or
0:30:37are always smaller than the average
0:30:44then the numbers in in in this role
0:30:46so
0:30:47uh
0:30:48i don't know if that answers your questions i can probably talk a little bit about the magnitude of these
0:30:52numbers but
0:30:53you're happy with this then
0:31:02but
0:31:03but i could see actually here that um
0:31:06i don't recall him but i think that
0:31:08we did the combination here and the combination leads to approximately ten percent
0:31:13absolute increase
0:31:14you over this mfcc number
0:31:17it on average over all conditions right
0:31:20and
0:31:21and asks you know processing side
0:31:23sorry
0:31:24yeah and asks you know persisting sign
0:31:26i think this proposal to choose
0:31:28please
0:31:28see mia
0:31:29two
0:31:30O D U these images
0:31:32cool
0:31:33oh if you these things
0:31:34use
0:31:35you know maybe
0:31:36holding
0:31:36well
0:31:37right right side is withholding extra but could you hold your microphone a little bit too so sorry
0:31:43okay
0:31:44um
0:31:45um i think this so use it is futures
0:31:47this proposal you chose to me that the band limited
0:31:50i a harmonic to noise ratios
0:31:52no
0:31:52i think in addition to harmonic to noise ratios disputes to be seen you know to
0:31:58oh
0:31:58i p2p these images
0:32:00or if you these two things
0:32:01which is useful
0:32:03made recordings
0:32:05mated
0:32:05in it decoding so mixed excitation
0:32:08and
0:32:09so
0:32:10you have in mind
0:32:11well i have not i i'm not sure that i got all of the things that you said
0:32:15um
0:32:16but if you said that there is something that's very similar it is
0:32:19yeah i would really like to talk with you yeah system and and that's where we can do that offline
0:32:24or or you can use this
0:32:26right
0:32:27thank you
0:32:32just to come back to the
0:32:33four hundred dimensional features i think you wanna
0:32:36one of those like to mention all the lights
0:32:39did you reduce that
0:32:40feature dimensionality before nor modelling stage i'm sorry
0:32:44uh can you just at the very beginning of your question
0:32:49your
0:32:49your your features of
0:32:50four hundred dimensional yep
0:32:52so
0:32:53did you use an L D I to reduce the dimensionality before you're such as well
0:32:58model together
0:33:00yeah
0:33:00so to what dimensionality do you use
0:33:03sorry i i uh
0:33:17so it turns out that
0:33:19i i differ males are for females it was fifty two and fifty three
0:33:22i i i don't remember which gender doesn't matter
0:33:24is close enough
0:33:25okay because then it would seem it
0:33:28probably not a practical problem anymore
0:33:30uh
0:33:31we we typically use sixty dimensional features
0:33:34uh with the with the
0:33:36uh
0:33:37ubm that has two thousand components
0:33:40so that's doable
0:33:41right but the problem is that we would need to invert
0:33:44uh you know we we need to compute the L D or pca transform
0:33:47over
0:33:48because these transforms are global
0:33:50right
0:33:51so we would need to compute
0:33:52a pca transform over
0:33:54two thousand features for the entire
0:33:57i don't know
0:33:58ubm training set if you will
0:34:01do do do i understand you correctly
0:34:04oh
0:34:04did you say words um
0:34:07we have to transform is estimated remote will depend on where you are from sparse rooms
0:34:13cross
0:34:14no
0:34:15i
0:34:16what i meant
0:34:17if i gave that impression i didn't intend to
0:34:20uh
0:34:21so it's it's russell dimensional so much
0:34:25yes
0:34:26um remote closed so hard to see what would be um
0:34:30a problem when using a universal remote
0:34:33some of these features
0:34:35but you would use everything rooms who are
0:34:38for two dimensions in the morning
0:34:41right so
0:34:48i guess
0:35:02i see
0:35:03training and
0:35:05and extracting the features and the
0:35:08and change energies of all time
0:35:10and
0:35:11and it's a lower
0:35:13yeah right
0:35:14this
0:35:15yeah
0:35:16so basic
0:35:18basically yeah so we start from like
0:35:20i see
0:35:21and just
0:35:21and
0:35:22oh it's really uh you yeah
0:35:24chaney and then
0:35:25because
0:35:26based on this and
0:35:27there is that we do
0:35:28and happy and sad and change and that's the test set
0:35:32you'll often but
0:35:33and
0:35:34just
0:35:35yeah listing
0:35:36people recently
0:35:37we like to do it
0:35:39and
0:35:40due to some limitations
0:35:42a couple of it's not difficult imprints
0:35:45so many times it's just fine
0:35:47i mean
0:35:48i mean we just haven't gotten around to getting that far
0:35:50and like i said
0:35:52with
0:35:53feature vectors of the order of
0:35:55four hundred or eight hundred is shown being better
0:35:57and
0:35:58two thousand and forty eight after
0:36:00some homomorphic
0:36:02yeah scores
0:36:03um
0:36:04just haven't gotten around to even estimating how much disk space we would need and a particular court
0:36:09so
0:36:10um
0:36:12that's essentially the the
0:36:14correct answer there but my
0:36:15thought always was that we were gonna attack this problem by making the feature vector smaller first
0:36:19rather than addressing the
0:36:21the
0:36:22the
0:36:22infrastructure problem
0:36:23and by more disks right
0:36:25so
0:36:25okay
0:36:26okay
0:36:32okay but
0:36:33i think
0:36:33right now
0:36:35i