0:00:15okay so got on that difficulty with this is so these are
0:00:19so i sent to store for them
0:00:23it's a talk about so a neural networks primarily recurrent neural networks
0:00:27for text dependent speaker verification
0:00:31this is on paper at least a very natural fit between a model on the
0:00:37and it's something that's a good goal has got to work very successfully
0:00:43so we try to unfortunately we came to the completion that have we were a
0:00:49couple of orders-of-magnitude short simply amount of background data we with me
0:00:57i one telephone dugout and it's explain why didn't work
0:01:04i would recommend that you read this paper i suggested to and the derided a
0:01:09as a survey article
0:01:11i think is worth reading on those on those trials
0:01:15but i don't cry going to spend the whole period talking about this particular problem
0:01:21i'd like to explain why our times are for getting of these neural networks to
0:01:28work i'm talking specifically about
0:01:30speaker discriminant neural networks
0:01:33getting them to work in text
0:01:35independent speaker recognition
0:01:40got times a thesis project will be specifically i'm getting convolutional neural networks to work
0:01:47and i personally i'm particularly interested in
0:01:52what is the right back end architecture
0:01:54for this type of problem
0:01:58so what i plan to do it was then maybe five or even though it
0:02:02would have only results to present have spent maybe five or ten minutes
0:02:06talking about
0:02:08well point this is a difficult problem but why the difficulties are not since approval
0:02:16if possible like will explain for four
0:02:19hoping to do by way of at
0:02:23system for the
0:02:25for the nist evaluation based on speaker discriminant neural networks
0:02:31all this in the hope of provoking a discussion i would be particularly interested in
0:02:38fans of and the other people who might be trying to do something
0:02:43okay so i don't
0:02:45that's for guns on this task of the problem was to use neural networks
0:02:51to extract utterance that features
0:02:55which could be used to characterize speakers
0:02:59in the context of a classical text dependent speaker recognition task where you have a
0:03:05a pass phrase and the phonetic variability is partially nailed down
0:03:11the easiest
0:03:12way to do this is using an ordinary feed forward a deep neural network
0:03:19but we were particularly interested in trying to get this to work with recurrent neural
0:03:25largely inspired by
0:03:27recent work in machine translation which
0:03:30this briefly
0:03:36so here's the problem i'll just mention at the outset that we were specifically interested
0:03:42in the case of getting that's to work with a modest amount of background data
0:03:47most of us working in
0:03:49text dependent speaker recognition are confronted by very heart constraint more if we're lucky we
0:03:55will be able to get data from
0:03:58one hundred speakers
0:04:00whereas if you read the google paper you will see that they have
0:04:04this really tens of millions of recordings
0:04:08all instances of phrase
0:04:13so for
0:04:16well what you would do and designing a deep neural network for this purpose you
0:04:21would just feed the a three hundred milisecond
0:04:25when no into a classical feed forward neural network
0:04:30with the softmax on the outputs where you have additional for each
0:04:37speaker among your development the population and train up with a classical cross entropy criterion
0:04:45you with then given utterance level features simply by averaging the output so from the
0:04:52over all frames so that this was implemented successfully by google the gold at a
0:04:57d vector approach
0:05:03it works fairly well on our task as well although it's not competitive with play
0:05:09the gmm ubm
0:05:12so well this is just the
0:05:15classical feed forward architecture i don't think it needs and the anti further comments
0:05:23what was i think most remarkable about the
0:05:28or an architecture which are
0:05:31describe the next
0:05:34is that a local manage to get this to work has an end-to-end
0:05:39speaker recognition system not nearly
0:05:42a feature extractor
0:05:44but one of which could make a binary a decision concerning a trial as to
0:05:48whether it's a
0:05:50a target trial or non-target trial
0:05:53this has been sort of seen as a part of gold at the end of
0:05:57the rainbow in our field for very long time
0:06:01it has been i people have been able to get to work with i-vectors
0:06:07but a direct approach to that problem has generally been you know resistant to our
0:06:14best staffers but go to work with their or and then system
0:06:20so you see that they used to an awful lot of data that figure of
0:06:24twenty two million recordings is not a misprint
0:06:30so the what the or nn architecture in the slides the diagrams refer just to
0:06:39a classical memory module of them the and test again a memory module where
0:06:47in addition to an input vector at each time step you also have a hidden
0:06:52layers of encodes upon set straight
0:06:55and the one neural network does at each time step is that depends again but
0:07:00so the
0:07:01a hidden activation
0:07:04then squash as the dimension back down so the dimension of the hidden activation that
0:07:10i'm feeds a nominee repeated into a nonlinear z so you
0:07:13keep on updating a memory of the history of the utterance and that's
0:07:21a very natural sort of model
0:07:24for data with a left-to-right structure as in classical text dependent speaker recognition
0:07:31or and even machine translation
0:07:33and the was a
0:07:35was it paper
0:07:37okay so this is the classical or in an architecture
0:07:42there was a an extraordinary paper machine translation published and two thousand and fourteen
0:07:48which shows that it was possible to train a neural network for the
0:07:54french to english translation problem
0:07:57using an organ and architecture with a very special feature namely
0:08:04the was a single softmax
0:08:07okay in the what they call the encoder the encoder read french language sentences
0:08:16it was trained in such a way that the hidden activation the last time step
0:08:21was capable of memorising the entire french sentence
0:08:29so that all the information you need to you needed in order to do machine
0:08:34translation from french to english was summarized in the hidden activation at the last war
0:08:41of the of the sentence
0:08:44to get this work they have to use for layers of the nist m units
0:08:49it wasn't easy but they were able to get good results with a machine on
0:08:54a state-of-the-art results on machine translation task
0:08:57with sentences about the thirty warren's obviously that's must actually break down
0:09:04okay you can memorise sentences of indefinite duration this way just because the memory has
0:09:13a finite capacity
0:09:15but google data well if it works a machine translation is definitely going to work
0:09:22text dependent speaker recognition will be possible to
0:09:26memorise the as a speakers utterance o a fixed hence frames
0:09:35the other various ways them the past has been improved on
0:09:42an obvious thing to do instead of
0:09:46using the activation of the last time step to memorise an utterance would be to
0:09:51average the activations of all time steps
0:09:54but once again you would be taking the average activation and feeding it into a
0:09:58single softmax to do the to do the memory it's not one softmax per frame
0:10:07there was a bit of controversy as you can imagine and the machine translation field
0:10:11as to whether this would really was the right way to memorise entire sentences and
0:10:17that lead to a flurry of activity something called
0:10:22what was attention modeling
0:10:24okay where
0:10:25i mean the argument was that if you're going to translate from french to english
0:10:30then in the course of the english translation as you proceed work by where you
0:10:35want to direct your attention to the appropriate place and the in the french utterance
0:10:41and that's correspondence is not necessarily going to be monotonic because word ordering can change
0:10:48as you change one language to the other
0:10:51but that was and a model developed along these lines in the actual then shows
0:10:59about which i think
0:11:01planes to be the state-of-the-art in a text
0:11:07machine automatic machine translation
0:11:09and what gotten set up to do was to
0:11:15take that idea and instead of using this sort of attention mechanism to weight the
0:11:23individual frames
0:11:25in the utterance to learn an optimal
0:11:28summary of a speakers production of the of the pass phrase
0:11:36and that was the thing that so actually work best for them
0:11:40so that this describes the task if a fairly classical text dependent speaker recognition task
0:11:48of the language was in german it was provided for us by the biphone stressed
0:11:57the results with the in the heavens well although the you know standard tricks worked
0:12:04as a as advertised of they were you know
0:12:10or the cold read you units rectified linear units dropped out some accent and so
0:12:15on each of them gave an incremental improvement in performance but
0:12:20we want able to match the performance of a gmm ubm
0:12:25and of course well the same thing happened with or analysis at doing intelligent summaries
0:12:34of they said data held but the results ultimately more disappointing
0:12:39and the reason
0:12:41it was quite clear that the reason
0:12:44with just one hundred development speakers we are going to
0:12:49hopelessly overfit to the to the data so
0:12:53at these methods are not going to work on less we have a very large
0:12:58amounts of data
0:13:03very large amounts of data ports are on the way i was
0:13:08just this morning to make a was set that the might be the possibility of
0:13:12getting a surly data
0:13:15where this sort of thing could be serious the as a viable plausible
0:13:24solution but it's clear that go term isn't going together up usually faces of that
0:13:30is solved
0:13:32while he's been bitten by the
0:13:36by the neural network back so he's is task would be to trying to get
0:13:40convolutional neural networks working
0:13:45convolutional neural networks trained to discriminate between speakers working as a feature extractors
0:13:52for a text independent speaker recognition
0:13:57what i would like to do it was just
0:13:59talk about what are our fans are for that
0:14:07what i thought it would do was first of all explain why this
0:14:11this is a difficult problem
0:14:13okay why
0:14:15we cannot expect out of the bars solutions
0:14:20already existing in the neural network literature to work for us
0:14:25a white nonetheless it's not in an superbly difficult problem and we ought to be
0:14:29able to do something about
0:14:31presently uncommitted
0:14:33to get in this work
0:14:34the to get in this work
0:14:36we are going to submit some sort of system for the for the nist evaluation
0:14:42but i think well it's going to take a bit longer to actually i and
0:14:47all the king set out of this
0:14:51it seems to believe that
0:14:55it approach in this problem there are two fundamental questions that we need to be
0:14:59able to answer and how we answer them is probably going to dictate
0:15:06well direction we actually terry
0:15:11the car restroom about the backend which i'm particularly interested then
0:15:15but it's i actually of secondary importance
0:15:20so the first question i c is if we look at these success that feels
0:15:26like face recognition
0:15:29have a where
0:15:31a very similar biometric pattern recognition problem i'm taking thinking in particular of gee face
0:15:38one is it that it has more so spectacularly for them but we still haven't
0:15:43been able to get more
0:15:44that's what that's one question
0:15:47a second question would be
0:15:51if we look at the current state-of-the-art in text dependent speaker recognition
0:15:57because that's where we have a
0:16:02neural network trained to discriminate between senones
0:16:06collecting baumwelch statistics for a
0:16:10an i-vector field is a cascade
0:16:12wang is it
0:16:14if we simply trying to neural network to discriminate between speakers
0:16:21in the in the nist data what is it that we haven't been able to
0:16:25treat that architecture
0:16:28okay together to work satisfactorily
0:16:30in speaker recognition
0:16:34to my knowledge
0:16:36several people have tried this but haven't yet obtain a even a publisher result
0:16:42okay i'm it may be wrong about this be happy to select program wrong about
0:16:47this but i believe that this is where things stand a present
0:16:53so if we if we look at the
0:16:57at the deep face architecture became the
0:17:01so what these guys didn't facebook they had a population of four thousand development speakers
0:17:06one thousand images are
0:17:10speaker i
0:17:11subject okay
0:17:13one thousand images per for proper subject they
0:17:16trying to convolutional neural network to
0:17:22between this the subjects in the development population
0:17:26and use that as a feature extractor and one-to-one assumption that just that the output
0:17:33into a cosine distance classifier
0:17:35there are output was a few thousand dimensions but
0:17:38google later showed that you could do this with one hundred twenty dimensions but the
0:17:44same order of magnitude that we have found
0:17:47so be appropriate for characterizing speakers and
0:17:52text independent speaker recognition
0:17:55of course the fact that they have one thousand instances per subject but obviously does
0:18:00make like a lot easier
0:18:04the market is four we have maybe time average
0:18:09but some people have raised a sort of more fundamental concern
0:18:13in our case we're not really trying to extract features from something that's
0:18:19analogous to static images
0:18:23because of the time dimension work on where we're confronted with model only
0:18:29are we dealing with utterances of variable duration model than a fixed dimension but
0:18:37order of phonetic events is something that is nuisance for us
0:18:43okay we need to get a representation that's
0:18:47invariant under permutations with respect to the
0:18:51order of phonetic events
0:18:54i don't
0:18:55a convolutional neural network should be eight to solve multiples
0:18:59problems in principle
0:19:02because it will produce a representation that's invariant under permutations and the time dimension
0:19:07and in principle it will be able to handle
0:19:11utterances of variable duration
0:19:16there is an animal automatic segmentation image processing you seen that they do use convolutional
0:19:21neural networks with images of variable
0:19:28so i don't think it's hopeless but this would be my answer the question okay
0:19:37signal discriminant neural networks work but not speaker discriminant neural networks is because i think
0:19:42trying to discriminate between speakers on very short time scales is going to be very
0:19:48heart problem
0:19:49i think we should just stay away from the
0:19:51from the time being and the reason is very simple
0:19:54but the
0:20:00variability in the signal at short time scales is necessarily phonetic variable
0:20:06not speaker variable
0:20:08it was very phonetic variability then
0:20:13speaker a speech recognition rather than what would not be possible
0:20:17okay so what happens again if we focus and if we take the same architecture
0:20:22as is used in signal discriminant neural networks at a ten milisecond frame advancement three
0:20:29hundred milisecond window
0:20:31then we're just gonna get swamped with the problem phonetic variability
0:20:38it's actually quite easy okay to get neural networks working as a feature extractor
0:20:45if you use all utterances as the input i mean just encode the utterance as
0:20:50an i-vector you will get bottleneck feature that
0:20:53doesn't very good job of discriminating between speakers
0:20:57if you feed and whole utterances they problem it some of the will but is
0:21:02actually too easy to be interesting i did not gonna get away from i-vectors
0:21:06if you go down to ten miliseconds i think is just going to get killed
0:21:09by the problem of phonetic variability and
0:21:13the sweet spot for the short term i think should be something like ten seconds
0:21:17okay that was marked in
0:21:19and language recognition
0:21:21and you'll see actually several papers in the in these proceedings
0:21:27that show that neural networks or good a extra features and language recognition
0:21:33if you're if you give them utterances of three seconds or ten seconds whatever
0:21:39but i would say that particular problem of
0:21:43getting down to short time time-scales is one that we should eventually be able to
0:21:47solve and we showed that go one
0:21:50okay i think if you want to
0:21:55neural networks as feature extractor is not nearly for speech rec speaker recognition but also
0:22:00for speaker diarization then you are going to have to confront the problem
0:22:04okay you can't have a window of more than
0:22:08say five hundred milliseconds in speaker diarization or you're going to miss speaker turns okay
0:22:15so you
0:22:16we are eventually going to have to confront that problem how to normalize for the
0:22:21phonetic variability and
0:22:24in utterances of short duration if we're to train
0:22:28neural networks to discriminate between speakers
0:22:32i just mention
0:22:35paper of
0:22:37famous will be present in that attempts to deal with that problem with factor analysis
0:22:44the very last analysis
0:22:46i thought to be
0:22:49the idea would be
0:22:52i think this is going to work eventually okay you we should
0:22:56think of phonetic content as a
0:23:01short term
0:23:03channel effects
0:23:05okay one when i say short term i mean maybe five
0:23:10frames or chan frames in the normal
0:23:15way we think about channels this is sort of that this would be sort of
0:23:18hopeless okay you we can model channel effects that the resumes of the
0:23:24persistent over entire utterances but not at the level of say ten miliseconds however we
0:23:33do have the benefit of a supervision
0:23:37from that could be supplied by something like a signal discriminant neural network that tells
0:23:42you at each time step while the
0:23:46probable phonetic content
0:23:48that is
0:23:49so that it is actually possible to model phonetic content as
0:23:55a short lived channel effect and you can do that using factor analysis methods
0:24:01and that was the topic of famous as presentation you just a first experiment
0:24:06but i think that particular problem is going to be
0:24:11the solution of that problem is going to be a key element
0:24:15to i
0:24:17the guessing
0:24:19neural networks to discriminate between speakers i short just a short time scales
0:24:25okay so that's same about that so
0:24:52okay so the i think that you said that you want to reduce and then
0:24:58to learn the same speaker variability how you while you're trying to think about how
0:25:03you like your yes the other one thinking about the softmax as the target speakers
0:25:08or you know for example i can tell you what we are interested in working
0:25:12is the what is trying to learn the cosine similarity between speakers so we have
0:25:18a skinny staring
0:25:19trying to mimic saying all this is the same speaker or different speaker would buy
0:25:24toward by learning some cosine similarity and tried to push the clusters friendly shoulders
0:25:30well my view about this and this is just a pen okay is that
0:25:37i believe that in order to get you are not forced to work in speaker
0:25:44recognition in the long run we are going to have to combine them with a
0:25:48general okay
0:25:51i the way it's you're working is that
0:25:56analogously to the face to face architecture we can hope to get neural networks working
0:26:03as feature extractors that would be trained to discriminate between speakers in the development set
0:26:09but used as feature extractors
0:26:13at runtime
0:26:14i would expect
0:26:17we would have these neural networks i'll for thing
0:26:20so i axis
0:26:21okay i regular intervals as you as you go through an utterance
0:26:25and that the problem
0:26:27i believe that the interesting problem
0:26:30is how to design a backend
0:26:33to deal with that
0:26:36okay it in fact in fact involved modeling counts which you will be the
0:26:44the topic of your presentation
0:26:47although i believe for
0:26:49there are other models which are just waiting to be used
0:26:53for the and thinking particularly of latent directly allocation
0:26:57which is the
0:26:59i'm along for
0:27:01i data eigenvoices four
0:27:05for continuous data
0:27:10you can you
0:27:12i and the results that you want you can do is you can
0:27:17you can build an i-vector extractor using latent dirichlet allocation for count a so
0:27:22and if you can do eigenvoices you can also do
0:27:26an analogue of the of the i
0:27:29it'll behave very differently from the bleu we
0:27:33"'cause" it would've gaussian assumptions
0:27:35it won't even have this optional statistical independence between speaker effects and channel things
0:27:42that's a whole lot of thirty
0:27:44okay you can actually what basis for that the data with
0:27:49training the lda with unlabeled data you can do that's what latent dirichlet allocation
0:27:56so that it's actually very big
0:27:58figure here waiting to be useful
0:28:02only the question is do we want to go to tea
0:28:06the selected training of softmax forty want to go to direction of representation
0:28:11i think personally for this is just one and
0:28:16personally i believe
0:28:20your networks
0:28:22okay or not to our task okay
0:28:28we could never hope to the
0:28:30training on labeled data
0:28:33with just a matter for you and that was cannot discriminate between speakers of the
0:28:37don't know harms the listener
0:28:39so i think you will need to be complemented by a backend which is waiting
0:28:47to be developed
0:28:48not the backend but we have present person