0:00:15i'm gonna be representing us to use a university of science and technology of china
0:00:19the national engineering level of speech line and language
0:00:24information processing of the
0:00:26this is a paper by a margin is my student master student and some other
0:00:32collaborators we asked him to
0:00:34build his own c n which he did and then be austin to join using
0:00:38it for something which he did
0:00:40so what i'm gonna do is present
0:00:42what came out when he tried that
0:00:46we got before stages introduction really have this works for language and at the structure
0:00:51in selling
0:00:52is proposed method some experiments analysis and then
0:00:56maybe and with but of sort on some future work
0:01:00well the first thing to ask is what is language identification and
0:01:06it just the task of taking a piece of speech and extracting language identification information
0:01:12from that comes at different levels as we know
0:01:14and we can say that that's all acoustic information or phonetic information list right hand
0:01:21to disassociate that from the characteristics of a speech speaker as will say
0:01:26and a little while
0:01:27and was finding the tendency to do
0:01:30speaker recognition
0:01:32state-of-the-art as well probably
0:01:36maybe this will change shortly i don't know but state-of-the-art is really gmm i-vectors
0:01:42and we say in great gains but everybody's you know trying to find what's next
0:01:48deep-learning in particular allows us to i'll take some of the advantages of supervised at
0:01:55training be able to extract discriminative it
0:01:58discriminative information out of the
0:02:01the data is that we have especially when we have a small amounts of training
0:02:04data we can use transfer learning methods to
0:02:08to train something which may well be discriminative
0:02:11on a weighted task of inferring it it's language id
0:02:16some of these are we saying
0:02:18recently they take bottleneck a network based i-vector representation of a
0:02:24v and song collaborator
0:02:26this was
0:02:27i think it was last year in interspeech there's also a poster yesterday which you
0:02:31missed paper should be in the savings we say dc non based a neural network
0:02:37approaches here
0:02:40doing great things that's transactions on ice lp this
0:02:44then there's some approaches which are and to it and methods and we can look
0:02:50at some of the
0:02:53the we the
0:02:55state-of-the-art as flown through that
0:02:58deep neural networks
0:03:00here
0:03:01and that was i guess
0:03:05long short-term memory
0:03:07i'm n n's here
0:03:11also in to speech
0:03:13so this is really a extracting at a frame level
0:03:17and gathering sufficient statistics over an utterance in order so
0:03:22pulled out
0:03:22language specific
0:03:25identifiers they re entrant approach
0:03:28using convolutional no young neural network to it so the text it's short
0:03:34utterances and it's using the power of a c n
0:03:37to put out
0:03:38they the information from these short utterances
0:03:41and seems to get over some of the problems in terms
0:03:45of utterance length
0:03:47we have a different method
0:03:48we also think that doing st say so mfccs with a large context maybe
0:03:56introducing
0:03:57too much information that a c n all day n and then this to remove
0:04:00so what we have today was
0:04:03a use some of our train a precious training data to remove information that probably
0:04:08shouldn't have been included in the first place if we had a magic wand
0:04:11in terms of input features are
0:04:15so what we're doing is a
0:04:16is slightly different
0:04:20we think convolution young neural network and
0:04:24when using the c n to extract frame level information per se what we actually
0:04:31doing
0:04:32in this very
0:04:34wide long
0:04:36and to and type system is starting off with plp input features
0:04:43and we're doing a the bottleneck
0:04:46the nn just data to take bottleneck
0:04:48network
0:04:49taking the bottleneck features here
0:04:52adding a what could be quite a lot of context to the bottleneck features and
0:04:57then fading that adjusts ann
0:05:00and here so three layers
0:05:03i finally a fully connected output and we're getting is a language label
0:05:07directly at the output from this
0:05:09so you can see why this is sort of attractive in terms of a system
0:05:14level implementation but to me it's
0:05:17kind of counterintuitive
0:05:19because we tend to use c n n's
0:05:21to extract
0:05:22front and information the mean in the related tasks that we've been trying
0:05:27they tend to one for a while for that
0:05:30i mean we did try things like stacks of mfccs as an input features to
0:05:37a c n directly and it doesn't seem to somebody else can do about of
0:05:41nasa doesn't seem to want that well
0:05:43so what we did was we have of the nn
0:05:46followed by a c n and the see how that works
0:05:50and limits
0:05:52sums up what is that transform acoustic features to a compact representation
0:05:56we did not frame-by-frame and a context of multiple frames for the bottleneck features with
0:06:02context into the c n
0:06:03and we come out with something which should be discriminatively in terms of language
0:06:10okay so this is what we call the lid features
0:06:15i mean we think that the general acoustic features that the import they like i
0:06:19said they do contain too much information
0:06:21so we trying to reduce the amount of
0:06:24information about an on the
0:06:26on the train system
0:06:28follows
0:06:31i'm not given the limited amount of training data we don't really wanna voice that
0:06:36we know that we can have a deep neural network which is trained on sentence
0:06:43and that will be a phonetic information
0:06:47the beginning of it is acoustic information
0:06:49somewhere in the middle of that network is a transformation
0:06:53effectively from the phonetic to the from the acoustic to a phonetic we take the
0:06:57well something features which
0:06:59we how far a compact
0:07:03representation of the requirement information
0:07:06not sure that's true because there's plenty of approach is that take
0:07:10information from
0:07:12both the center and the end of the day n and
0:07:15seem to work well especially with fusion
0:07:18anyway
0:07:19what we're doing it would just
0:07:20kind of different is when using a spatial pyramid polling
0:07:24the output of the c n and
0:07:29we want this allows us that there was it allows us to take the front
0:07:33end information and to span utterance level with
0:07:38which
0:07:38provides us with a
0:07:41utterance length invariant
0:07:44fixed dimension vector this point
0:07:50so i just a deal with arbitrary input so we just we take the method
0:07:53spatial polling is from the paper by climbing huh
0:07:56that's e c v computer vision two thousand and fourteen and it's designed to solve
0:08:01the problem of making the feature dimension invariant to the input size missus a problem
0:08:07we face often and is a problem certain
0:08:10and areas of image processing also face
0:08:14i think was happen is we've got i
0:08:16so i kind of feedback where the speech technology goes into the image processing and
0:08:20the
0:08:20comes back to the speech failed and then i cycles around
0:08:24so this is really inspired by a bag of words approach
0:08:26and it comes through
0:08:31into the special permit problem which uses a power of two
0:08:34stack of max hold features
0:08:37okay so it changes resolution of the power to
0:08:41and we can control quite finally how many
0:08:44features
0:08:45we
0:08:45one to the output of that
0:08:47so attractive in that work well like it
0:08:50the information on that is actually in the paper
0:08:54so how do we do that had we put all the stuff together
0:08:57well the shown in the diagram on the right here what we're doing is with
0:09:01taking a
0:09:02six layer the nn which is trained with large scale
0:09:07switchboard
0:09:09information
0:09:10and with taking the half of the network up to the bottleneck layer and fading
0:09:14that into system that now was trained using language id using lid training data
0:09:21and
0:09:23now if we propose that if we take that information and we feed directly into
0:09:27a c n
0:09:28given the training data that we would using this well it will not converge for
0:09:32anything sensible
0:09:33if at all
0:09:35it just doesn't work so what we have that there was they have to build
0:09:37a network
0:09:38like a c n layer by layer
0:09:41so that the nn is already trying that's fixed that's great
0:09:43and then you start to build that the c n by having first convolutional and
0:09:47then the second and then the third each one takes a special permit polling and
0:09:52the fully connected layer at the output
0:09:55to give us the direct language labels
0:09:59and excel works right i'm we can see that late only look at the results
0:10:03layer by layer
0:10:05s two
0:10:05how of the
0:10:08how the accuracy improves with
0:10:10the number of layers and with the size of the labels
0:10:14it's quite interesting to say that
0:10:17the nn pretty standard it's
0:10:20forty eight features fifteen plp use delta and delta-delta
0:10:24sorry pitch
0:10:25and with a context size
0:10:28twenty one frames
0:10:30one of two four
0:10:32one or two four fifty one or two for one or two four and three
0:10:35zero to zeros senones at the output
0:10:38and we look at the structure of the
0:10:40c n of a little while
0:10:42it is worth mentioning at this point because it's a problem
0:10:46but we create sorry separate networks for the task of thirty second ten seconds and
0:10:52three seconds data
0:10:55i mean we would like to combine these with trying to money
0:10:58but this separately trained
0:10:59no maximum
0:11:02a baseline is button like gmm i-vector and bottleneck the nn i-vector with lda doubly
0:11:09c n
0:11:10pretty much as we published previously
0:11:14so we look at how this works
0:11:16just try to visualise some of these layers
0:11:19what we have here was we got the
0:11:23post
0:11:24pooling three
0:11:26fully connected layer
0:11:28information
0:11:31note this diagram comes from the paper what we've done is be taken the
0:11:37these
0:11:38the test it
0:11:41over some utterances and we've compared for different languages just visually
0:11:47so what we don't is just thirty five randomly selected features from that stack
0:11:53plotted here for two languages
0:11:57because right
0:11:58on the left this dowry
0:12:00on the right it's farsi
0:12:02which i'm told are very similar languages
0:12:06the top and the bottom at different
0:12:09segments
0:12:10from utterances
0:12:12so what we're looking on the left is intra language difference what we're looking at
0:12:16the right just in between left and my is interlanguage difference so top and one
0:12:21was intra
0:12:22left and my is inter
0:12:23so we should say that there is a large variability between languages small variability within
0:12:29languages that's what we get
0:12:31it gives us
0:12:32visual evidence
0:12:33to think that
0:12:35these statistics might well be discriminative for languages
0:12:40just leaving along a bit further
0:12:43down here what we getting here was a frame level information
0:12:48and we like to call this lid senones maybe this is not best terminology
0:12:54but
0:12:55just two
0:12:56to explain have a how we get to that sort of a conclusion
0:13:01if we look at this information i e bay saying and a right so i
0:13:05and be noticed the scales on some of the one
0:13:11one five a low lid senones coming out of the of the system out for
0:13:18frame level with context for
0:13:22speech
0:13:25another piece of speech there
0:13:27a transition region between
0:13:29two parts of speech here
0:13:32and non-speech region just here
0:13:35so what we tend to say when we visualise this is we see a different
0:13:40lid senones activating and a activating
0:13:44as we go through an utterance or go between utterances
0:13:48and we believe that this language discrimination information in the
0:13:54if you look at the scale
0:13:56the y-axis scale of that use
0:13:58we can see that when there is a non speech regions around here we get
0:14:01all sorts of things activating but the level
0:14:05the amplitude of activation is quite low
0:14:08you can it gives
0:14:09evidence to the fact that rippling you have something which is a language specific at
0:14:12least
0:14:18so we also there's something called a
0:14:22hybrid sampled evaluation so we spent thirty seconds ten seconds a three seconds in to
0:14:26separate networks
0:14:28we train them independently and we do well we don't do quite the same degree
0:14:32of augmentation as a hundred but we do try to men by cutting the thirty
0:14:37second speech into ten seconds
0:14:38and three seconds regions
0:14:40so we're doing is where
0:14:41we're trying to make up to the fact that the three second information is woefully
0:14:45inadequate in terms of statistics probably
0:14:48i having a lot more effect
0:14:50a mostly have that works
0:14:52in terms of the but
0:14:54performance of each
0:14:58unfortunately we only have data here from
0:15:01yes to allow you zero nine and for that we only have six
0:15:05most confusable languages
0:15:07it's a subset is much quicker subset so do analysis on into one experiments on
0:15:13and if you look at papers over the last few years
0:15:16we tend to publish with these six languages fast
0:15:20and then extend later
0:15:22seems worthwhile
0:15:24it's about hundred fifty i was of
0:15:26training data voice of america and radio broadcast cts and telephone speech
0:15:31and we split up into the three different
0:15:35level or looking at two baseline systems and our proposed network
0:15:39normal
0:15:40the fusion on that later
0:15:42everybody wants to do fusion
0:15:44the end
0:15:47so let's look at three the way that this
0:15:52this structure can be adapted because the so many different parameters that we could change
0:15:57in here
0:15:58the first one you wanted it was look at the
0:16:00the size of the context
0:16:02at the output of the
0:16:04the nn layers
0:16:06and with changing and
0:16:08if you can make it out just here
0:16:12lower case n
0:16:13so what we're doing is where
0:16:15keeping the same
0:16:16bottleneck
0:16:17network
0:16:18but we're starting a more of them
0:16:22and we can see from the results for thirty seconds ten seconds and three seconds
0:16:25in eer
0:16:27the bigger the context in general the better the results
0:16:31no bear in mind that we only have some context the input here
0:16:37right that's also got context twenty one frames to be precise so we adding more
0:16:43context at this and we're saying benefit
0:16:48and it turns out that for the ten seconds and three seconds tasks context of
0:16:52twenty one
0:16:53just here
0:16:54tends to what better
0:16:55for the
0:16:56thirty second task and even longer context much better probably because the data was longer
0:17:02i think that the
0:17:04the problem is the three seconds and ten seconds data tends to saturate i mean
0:17:09we just cannot physically get enough information had about data
0:17:12no matter how much context size
0:17:14we introduce
0:17:17and moving on a little bit further
0:17:19we can also experiment with
0:17:21how
0:17:22t and how wide the c n is
0:17:26and we do that down here with
0:17:31basically three different experiments one of which is the lid net with
0:17:36i a one zero two four
0:17:40such that convolution input layer
0:17:44single-layer
0:17:45then fading into the special permit polling and the fully connected system
0:17:50we trained the system up we get about nine
0:17:54nine percent to sixteen percent
0:17:57performance on the three different scales if we add another layer so we have a
0:18:01two class and then we all that down by reasonable amount for the three seconds
0:18:06not quite so much of the thirty seconds
0:18:09and we're looking at one two eight to five six or five one two
0:18:14size on the secondly
0:18:17in the c n
0:18:18third layer
0:18:21we check out sixty phone one two eight and we can say that basically with
0:18:24increasing complexity the results tend to improve lesson for the thirty seconds more for the
0:18:29others
0:18:31i but temple evaluation what we actually doing here is way using the
0:18:35the thumb thirty second network
0:18:38to evaluate thirty second data
0:18:40the ten second network to evaluate thirty second a ten second data
0:18:43and the three second network to evaluate everything
0:18:46and the performance
0:18:49unsurprisingly of the three second network is better for the three second data you can
0:18:54only use that ten seconds better for the ten second data but the thirty second
0:18:59network a thirty second data is
0:19:03however
0:19:04it's better using the ten second one for thirty second data so this means that
0:19:08perhaps this these networks themselves are hoping at different scales of information so we fuse
0:19:13them together to get the results of the bottom
0:19:16and we have a slight improvement there
0:19:20but you won't notice that we can only improve on the baseline system for the
0:19:25thirty second result
0:19:29one more thing before we conclude the i-vector system uses a button first order statistics
0:19:36but this effectively only uses a with order statistics
0:19:39so
0:19:41pretty much are a few what would be looking at hand we can incorporate more
0:19:45statistics
0:19:47whether we can build a comprehensive know what that uses
0:19:50all scales and handles all scales simultaneously so that's it that's a weird and wonderful
0:19:56day n c n hybrid thank you
0:20:06we have time for questions
0:20:13so
0:20:15thanks very much that was very interesting and as far as i could see a
0:20:20score of the network so
0:20:24as far as understood you
0:20:28did some incremental training and so once you that once you trying to part of
0:20:34the network and then you extend the network the first the parameters of first part
0:20:38they stay fixed you don't step them
0:20:41you have this a fixed so that we fixed that enemy build on it and
0:20:44it
0:20:46again is what you do when you ask us to try different things and i
0:20:50probably wouldn't have done this myself but it tends to one
0:20:53quite well
0:21:00the most
0:21:01there's a fixed you mean you network trained the mortgage you just change
0:21:07most layer
0:21:08and they to retrain the whole system no we don't we train our system we
0:21:12focus that the to the backend open and we just trained on us layer
0:21:17q
0:21:21i think we have another question for a we've got lots of time so you
0:21:25spoke a lot about the information flow through the neural network so if you of
0:21:32a read some of geoff hinton "'s" stuff on neural networks the you will you
0:21:39will tell you again and again that
0:21:41that there is more information
0:21:44in our case in the speech and the labels so
0:21:48use advocating for the use of generative models rather than discriminative ones as well as
0:21:54i can see you horses ple discriminative so i just like to hear any point
0:22:00you have on that matter
0:22:02so actually is interesting that you bring that up because i was looking at some
0:22:05of the comments in tumours making recently and
0:22:10i he was talking about using
0:22:12he was talking about the benefits of having a two-stage process where we have one
0:22:18front end which is very good at picking out
0:22:20how the most useful data from a large-scale dataset and then a backend which is
0:22:26very good it using the and that these two tasks are complementary is seldom we
0:22:32can use one system that excels in both tasks he believes that both women can
0:22:36be trained and we seem to do not but we've done it
0:22:40okay the opposite way around to the way i would have imagined
0:22:48okay thank you very much speech