0:00:15i per se and with lincoln laboratory enormous pride some more quickly for channel compensation
0:00:21i using that the lda for the only thing you know
0:00:26and that is no brief overview over a five year multichannel speaker recognition and mixer
0:00:32and the baseline system is an i-vector system is trained on one tell us all
0:00:38and there are two approaches were looking at one okay that the lda parameters the
0:00:42telephone data to microphone data
0:00:45and the other approaches we try to compensate features coming into the system and re-training
0:00:49or system does not sort of forms are hybrid system i don't give results along
0:00:54the way
0:00:57so the basic idea that we have a system is trained on switchboard data and
0:01:01works pretty well in the data were tested on is also conversational telephone speech
0:01:06but as a multiple known you try to evaluate microphone trials on the same system
0:01:10just fall for the for performance is really that
0:01:15two approaches people that has to do this sort of a adaptation of the lda
0:01:21i don't think is exactly the same adaptation reason was trying to bring in some
0:01:25of the subspace to move that the only parameters for the microphone data
0:01:31and we also tried past enhancement of another did not so was different prices to
0:01:37do the
0:01:40i'm sorry what's due process is to use a neural network to do this compensation
0:01:47and actually it's not new in general i should mention that for this part challenge
0:01:51a lot of people using this technique and works very well for speech recognition and
0:01:55that test but they had microphone data as well
0:01:58so for these two techniques one we're taking a i-vectors from a telephone train system
0:02:04and weird adding those two of this microphone data to do that we take the
0:02:09within class an across class may parameters are used the lda scoring
0:02:14and we adapt those parameters towards the microphone data using a relevance map which is
0:02:19just a lambda interpolation
0:02:21and that we found that some calibration issues we do pretty well for eer we
0:02:26get a nice gain at the eer level but for mindcf we don't see much
0:02:31on the other hand it is a very simple technique and that you don't change
0:02:33or system you just to train these two of parameters with existing i-vectors or you
0:02:38extract new i-vectors the microphone data we don't change the system itself
0:02:42the ml approach is a little a requires more work in that you've the training
0:02:46you know
0:02:47and the d n is trained to take a parallel data that's noisy the try
0:02:51to clean it up to try to reconstruct a clean signal given a noisy representation
0:02:56of the same data
0:02:58and that's actually very robust technique it works by twelve it does mean you want
0:03:02to retrain your system with that new front end
0:03:07also for this work or using three datasets one is switchboard one and two that's
0:03:11we used for training the baseline system and all the i-vector parameters are trained with
0:03:16just that data
0:03:17and then we'll mixer to which is a collection from two thousand for those a
0:03:22multi microphone collection
0:03:24i've had a clean telephone channel than the at eight microphones in the romantic like
0:03:28to the data parallel for tuna forty speakers and up to six sessions i think
0:03:33that was collected two thousand four minnesota dataset those actually is not straight and
0:03:38and this is the mixer six about that one they did the same type of
0:03:42a collection but for an speakers in different rooms and therefore two microphones as well
0:03:46so the telephone channel
0:03:48and for the sre they're focusing a lot on the interview condition for that where
0:03:53the interviewer rum and interviewee and you had to separate the two to try to
0:03:58not deal with that issue we just took this other portion of the sessions
0:04:02which is a conversation the person's having over phone so it's the same how collection
0:04:07but it's conversational data
0:04:08and that matches the mixer to style so these are disjoint
0:04:12lex the mixer to the mixer six
0:04:15we use mixer to for developing the system either for training or indian and or
0:04:18for adapting or parameters and the mixer six or using protesting that to see how
0:04:22well works
0:04:26so i just t v an idea of what these collections or comprise the next
0:04:30one and two was collected over eight microphones
0:04:32and mixer six was over fourteen
0:04:35we found it generate a huge dataset values of fourteen so we just selected six
0:04:40of them based on the distance from the speaker so the mixer six collection comes
0:04:43with documentation about where the microphones or position that that's we use here
0:04:49mixer one and two was available to us but we've actually given this the ldc
0:04:54and they graph is planning on making release people wanna work with this data so
0:04:59it should be probably available fairly soon thing
0:05:02and ice estimates somewhere only evaluating on same mic trials on the mixer six condition
0:05:06of the trials always you the target speaker and or what the non-target speakers on
0:05:10the same mike
0:05:13the baseline system is
0:05:15exactly what everybody else is doing with an i-vector system
0:05:18we start with a ubm to be trained on switchboard wanted to extract easier wasn't
0:05:23first order statistics to create a i supervector and then we take the map point
0:05:27estimate to get the i-vector six enter dimensional i-vector
0:05:32the whitening is done with switchboard two data as well for the d n and
0:05:37for the microphone a map and that of for the map-adapted case actually did the
0:05:41whitening using the mixer a microphone data the mixture to microphone data and then signal
0:05:46w c and c macy of the parameters are being adapted for the ple the
0:05:50lda adaptation
0:05:53so start with the baseline results
0:05:55well the first result in table is on a street and that's just the telephone
0:05:59results on disk sort of the out-of-domain task we get the system trained on switchboard
0:06:04and then the you'd all data is this a street and mixer data so you
0:06:08don't have mixer data as part of training the system
0:06:11that's about five point seven percent equal error rate and a point six two and
0:06:16you take that system
0:06:17and evaluate it with the s with the mixer six trials the microphone trials
0:06:22and you can see the equal error rate goes up by a factor of two
0:06:25or so and mindcf really takes a good as well
0:06:29and the first number there is the average this just taking the eer further channel
0:06:34and then averaging number that's kind of unrealistic because typically you'd have to pick one
0:06:37threshold for everything so the people i think is a more practical matter and that
0:06:42one's even more c take a bigger hit for that because of the calibration problem
0:06:49where for the remaining results of this for example i think that's a more practical
0:06:55the first and the map-adapted results and here you can see the same the mindcf
0:06:58really doesn't improve very much although you do get a pretty big improvement eer goes
0:07:03down by about thirty one percent
0:07:04so that part's nice but min you'd really like to see mindcf get a little
0:07:10and just yes i should mention that for landay's use point five and the reason
0:07:16for that as i did sort of a sweet and you can see there's a
0:07:19they're nice curves at eer because that's where i get again
0:07:22and point five looks like it's a it's fairly optimal across microphones of the three
0:07:27d plot is for each microphone the eers use with as use we plan to
0:07:32for doing data adaptation
0:07:34and around point five as we're seeing a sweet spot for that
0:07:38but you'll get mindcf it doesn't really change very much that's where we were saying
0:07:41the problem of this technique
0:07:44so moving on added to the enhancement idea were training a neural network to try
0:07:49to reconstruct a plane signal given by a noisy version of that so we have
0:07:54the person talking to telephone the telephone is are clean version and we also have
0:07:58microphones of the room the collecting of the microphone corrupted versions
0:08:02and we just trained as like a regression it's a very simple thing we have
0:08:04a windowed set of i-vectors coming into the n and we have the same vector
0:08:09trying to reconstruct that we just training over again samples
0:08:13one key thing release i think this is important is that we include the clean
0:08:17samples as well really like this neural network not change the clean data but to
0:08:21try to also improve the noisy data make of what more likely
0:08:27and just t v some idea of how this data was collected
0:08:30the ldc the these parallel collections and a couple of rounded have like one or
0:08:34two rooms which is not really like that morals but this is so how this
0:08:39and you'll have to come in to sit down and they have the microphones around
0:08:41that have all the equipment running
0:08:43and of the problems that if you realise later that you wanna one more microphone
0:08:47maybe really hardly really comeback collect more data so really what people do especially asr
0:08:53size eight in generating synthetic parallel datasets using a i rs online and point out
0:08:59noise sources and just generating tons parallel data
0:09:02and we actually been working on that more recently the another paper interspeech on that
0:09:06and that actually that works quite well as well i think that's and long-term as
0:09:10the way wanna do that but we had this probably just available and we want
0:09:13to start with that for this work
0:09:17so that the hybrid system where you have that channel compensating neural network in the
0:09:21front of it and then you have the i-vector system the of the baseline these
0:09:26before and we just retrain this pipeline after we retrain the denoising neural network we
0:09:30retreat we retrain the i-vector system on the switchboard
0:09:35and that for the system or using all the mixture to data for training course
0:09:38and then we also using forty mfccs and that's the dimensionality of the output of
0:09:43the neural nets or trying to reconstruct forty mfccs and that includes twenty deltas which
0:09:50may seem kind of counterintuitive but it was actually important and blue delta coefficients and
0:09:55we use of five layer neural network with two thousand forty nodes
0:09:59twenty one frame input context and mainly because that's we used for bottleneck features before
0:10:05we just adapted that system to this problem
0:10:08and then we of the one clean channel and the eight noisy ones come
0:10:13and you can say we get a pretty big in mindcf and everything it almost
0:10:17a thirty percent gain mindcf and that's cool result
0:10:20and a fifty percent in eer so this is really doing we're hoping is to
0:10:23get an improvement at mindcf and eer as well
0:10:29so that was actually nice k
0:10:32i should mention we didn't number of different things we tried initially i think it
0:10:35first we're trying to see if we could do this with log mel-frequency filter banks
0:10:40so i think some of the work that's been done just on the enhancement side
0:10:43is to try to improve a filter banks and then you can do what one
0:10:47of those like to synthesise cepstra from those a cleaned up filter banks
0:10:51but i will be found that the deltas were actually important so going to mfccs
0:10:56plus deltas give us to begin reduced using filter banks
0:10:59and is also critical on each other people mention this to be suitable for the
0:11:03you have to do that some type of me the variance normalisation to the data
0:11:06for training the neural net just to get the district to converge
0:11:09and that we also found the architecture at a pretty big impact so i am
0:11:12reporting results on the two thousand forty eight node be an you can say we
0:11:16take you can see we take a bit of here we go down to ten
0:11:19twenty four nodes especially dr and then we get on the five control not be
0:11:22taken figurehead
0:11:24but honestly the two thousand and forty you know the nn to goes a long
0:11:27time the training i-vectors weeks to train that one and that's maybe are four we
0:11:31don't have a parallel training mechanism
0:11:33that was the problem that
0:11:37it's worth seeing what the telephone performances you don't really want to system is robust
0:11:40to microphone data but also worked well for telephone data and so this is actually
0:11:44kind of a nice surprise we get a small gain about it some percent relative
0:11:47in just on the telephone task
0:11:50than that was for the you know a signal that and forty the map that
0:11:52the lda falls apart when you buy telephone data is you moved all those parameters
0:11:57this microphone set there does not well matched telephone data anymore
0:12:01so it's the trade off there
0:12:04so we see the nice in using this the nn channel compensation technique forty doesn't
0:12:10it was a lost on the telephone data
0:12:13you so you don't need do any kind of channel detection to switch back and
0:12:17the map that the lda unfortunately so far hasn't work well for us it does
0:12:22give unity are but the mindcf doesn't really change very much
0:12:26it is really easy to implement if you have an existing i-vector system you just
0:12:29run on that day to train parameters
0:12:32the other issue is that we've been using real relative to this which is not
0:12:37really very practical so the synthetic parallel corpora makes a lot sense
0:12:40and lastly at the input if you're really looking into using a recurrent networks within
0:12:46for doing a lot with feed forward networks and with the big context one to
0:12:49allow that but i think aren't as we can be the way to go looking
0:12:56the biggest much time
0:13:02how to the sre five
0:13:09think that recent training
0:13:34you said you didn't
0:13:37we think about the size of the input window you used twenty one frames i
0:13:42and just about that
0:13:43you have some
0:13:45inputs for some idea is do you think that for channel compensation for example you
0:13:50need a longer window of and what of your were doing only for each speaker
0:13:56recognition or e
0:13:58you know actually i would really recommend looking at the aspire papers from i think
0:14:03it was from
0:14:06maybe asru not sure it's one of the speaker regular workshops
0:14:11or might be names which actually a perl thusly train the denoising network and is
0:14:15it but i think were the fft outputs to introduce six a power a fifty
0:14:20upwards and yet a really long window or something like a three hundred frames or
0:14:25something huge like that we trained a giant network
0:14:28and yes it very impressive results and i've been meaning to see if i can
0:14:32recreate that it will take me forever to training
0:14:34so i think we wanna have a faster training algorithm but i would encourage looking
0:14:38at those results in particular looking at the other aspire systems
0:14:42the suit they did i think there was a and ice comparison of what do
0:14:45you did a joint training of the whole system with the way i one was
0:14:49doing a where you are you do a multi style sorry multi condition training with
0:14:55a with a whole bunch of data with your targets are always signals and some
0:14:59people try to decouple it's of the asr system was trained independently
0:15:03and then they train the denoising network and just use those features and one issue
0:15:07i haven't addressed here is the idea of not retraining the i-vector system
0:15:11so could you actually do okay if the features were coming from the denoising network
0:15:16but you're still using
0:15:18the same i-vector system
0:15:20i was worried about right now but i think it's worth busting
0:15:31but i start pretty i did you go back to a whether you're earlier slides
0:15:35where you're gonna highlighting the different microphones between mixture to and mixer six
0:15:41yes so
0:15:43so i was looking at a mixture one and two and their what kind of
0:15:48country a little concern if i guess channel number five has the kind of the
0:15:53jar or a
0:15:55okay thinking of star wars years arc a cellphone wyman there's also the error by
0:16:00one so you got to their actually i mean you're but not i don't think
0:16:04used five and six from mixture one armature two percent correct all next
0:16:10extra wanted to die all data use all of it so i'm thinking that some
0:16:15out those when you have two mikes that are actually still configured around here they
0:16:20are letterman agree you know i mean it it's a mike you're gonna have some
0:16:25i imagine interference between the two
0:16:28so that maybe i don't know it does not question you check that okay so
0:16:34what are the things good the main question i was gonna ask is when you're
0:16:37looking at a kind of map adaptation you had the
0:16:44denoising enhancement piece when you're looking across the different mikes going from one mike to
0:16:49another some mikes that are closer in terms of their characteristics in others did you
0:16:54see any benefit in moving from one to the other
0:16:59i guess we're asking is whether we could subset a set of unique
0:17:02right and we haven't that's a really good question i think actually moving forward anyway
0:17:08i mean real data is kinda nice because you can reality check but i think
0:17:11actually moving towards the synthetic data you can really move to two very different you
0:17:16run conditions i mean exactly collected in two rows are male diverse and i'm just
0:17:22thinking about chemistry structure for all the mikes energy kind of look at your solutions
0:17:26to see
0:17:27why you're one if you're launching from one mike to another sometimes of each closer
0:17:32one solution does better than another
0:17:34it's actually analysis we could try to do we could try to see which features
0:17:37look closer cross the parallel data sets
0:17:40i think about asking you to burn morality cycles either directly it's a nice question
0:17:55that that's a good point we have to don't i couldn't find placement information for
0:17:59mixed wanted to it probably exist somewhere but i ran out of like trying to
0:18:02find mixer six has a lot of information
0:18:09so mixer to it was it three locations i think there's i aside the ldc
0:18:17and i think it see i think there are three and then mixer six i
0:18:20think is to i believe that's right
0:18:25although it was okay feature start with reading it
0:18:34a question and all that denoising network so when we apply that kind of thing
0:18:39we found it was important to
0:18:42applied in fact and then ten train the network because if we send that the
0:18:48silence frames to it
0:18:49i was with easy but value that's a really because of just zeros and then
0:18:54it goes the rest of the network what the network zapping that state so that
0:18:58are actually thousand four point we ran a
0:19:02we limited the mars that's right i
0:19:05that's i think we might have run that on the clean channel for training and
0:19:09applied at the other ones for decoding we always ran back or whatever the data
0:19:13we try to optimize the you know to not realistic addition but for training i
0:19:19think we might have done a that on the telephone data which matched are bad
0:19:22system robust and then use that as i
0:19:24the speech marks across
0:19:32anymore questions
0:19:35okay stack the speaker