0:00:32C nine i can't wait to present
0:00:41the first is
0:00:55and the for this call
0:00:57i was
0:00:59all right
0:01:01application speaker
0:01:03maybe may not be as well this work is a well known
0:01:07you may not know whether you want to pay attention to this thing so let
0:01:09me just summarise it
0:01:11oh gee by
0:01:13proof speaker
0:01:17one of the interested in that have to pay attention
0:01:22okay so let me
0:01:26in this
0:01:33speech processing
0:01:42five channel
0:01:50some knowledge about this
0:01:51discounts characteristics present in the acoustic signal could be helpful for improving performance of such
0:02:00recognition system in our case you interested in speaker recognition system
0:02:04and the svm already before that using such information speech
0:02:08we also very informational side information in the fusion calibration and how well
0:02:15most of the recent approaches would always
0:02:18the L one recent words like to put people do for nist evaluations that usually
0:02:24oh independent detectors for various
0:02:28information various detectors various
0:02:31estimators like estimators of a signal to noise ratio reverberation a detectors of language and
0:02:38so on
0:02:39and what we propose in this work is to detect a
0:02:43this acoustic condition based on direct everybody else i-vectors nowadays be yeah time to show
0:02:48that you can just the i-vectors
0:02:51detect all kinds of other different
0:02:54acoustic condition information or nuisance condition or call it is like from the i-vector and
0:03:03then use this information in quite simple way
0:03:08fusion speaker
0:03:11just some
0:03:14maybe the
0:03:15most important previous work on making use of
0:03:19i is the condition detection that you have
0:03:22in the past you may still remember feature mapping where features are compensated based on
0:03:28detecting acoustic condition or
0:03:30more specifically channel the of the signal and then
0:03:35this work in that for the channels that they just based on
0:03:39it's channel specific gaussian mixture models but then we thought that the
0:03:43don't need to detect a channel condition or acoustic conditions anymore because you got this
0:03:48wonderful is the joint factor analysis and i-vectors and we thought that you don't have
0:03:53to really explicitly detect the condition that the channel compensation scheme will account for this
0:04:00variability what is intersession variability directly but again we saw that using some side information
0:04:07in a calibration or fusion was actually helping even so that you're compensating using
0:04:14space methods
0:04:19again the side information that we have been using the side already mentioned recently where
0:04:24extracted by language identification systems also signal-to-noise the racial estimators and we collect all kinds
0:04:31of information about the signal like that it will try to make use of it
0:04:35in to improve speaker identification system and there are the request in several different ways
0:04:42of using such information so they probably because this thing was
0:04:46i just a bill
0:04:48evolution condition specific fusion or calibration so
0:04:53okay you were trained different calibration for a specific condition like specific duration or
0:04:58or english only trials
0:05:02spoken by different languages
0:05:04or more than anything it was something that possible cm actually started with that and
0:05:09i think the nickel remark one it is focal toolkit was by linear score can
0:05:15be defined by linear score side information combination where a that was bilinear form the
0:05:21new interaction between the scores
0:05:24and the side information itself
0:05:30all speech so i mean
0:05:31mention when using
0:05:36information about try itself rather than
0:05:40the in digital recording integral
0:05:44segmental speech so that was
0:05:46this side information will be is this child actually
0:05:51could be different recordings in the trial come from different languages or other people's of
0:05:56these short duration of all these long duration something that we have tried
0:06:00in two thousand and politicians was finally to get side information from individual segments and
0:06:08combine these side information from individual segments in certain way
0:06:12to improve again calibration fusion this is something that people i'm using but we would
0:06:17be a four in the morning this
0:06:20in this talk
0:06:21and let me maybe
0:06:24splendid more closely
0:06:28so what is our approach them to do
0:06:33this acoustic conditions
0:06:35as i said we are going to use
0:06:38i-vectors that's as an input to classifier to detect predefined set of various audio characteristics
0:06:45in our is that when you just simple linear gaussian classifier
0:06:49which is a similar thing that people you're using for i-vector based language identification
0:06:59the way we are going to represent the audio characteristics of the signal would be
0:07:03just vector of posterior probabilities of these individual
0:07:08oh yeah like to show that we can use this vector of all ones are
0:07:12all these this us and side information for
0:07:15a fusion and calibration of speaker a system and get quite substantial
0:07:21improvement in performance
0:07:23and in this work we actually using exactly the same i-vectors for both
0:07:28characterization of the audio segment and speaker recognition and
0:07:33a justification for this thing is
0:07:37reasoning that we have is that
0:07:39or maybe nuisance characteristics including an i-vector itself affect speaker id performance so if we
0:07:45can the take those characteristics from the i-vector itself those should be those most important
0:07:53improving speaker recognition performance for
0:07:56compensating for the effects are still there in i-vectors
0:08:03oh before i get into more details and on what exactly do than actually the
0:08:09oh let me introduce the a development and relations that used in this work we
0:08:14should give you some idea what kind of partly this and kind of condition you
0:08:21so the data but we have used this prism evaluation set which is something that
0:08:26you have presented
0:08:27i during the last nist
0:08:32decided that you have
0:08:34collected and pretty well collected by the database was something that we therefore
0:08:39best project and we see this comparison of the design for actress
0:08:45many dataset so would be data that yeah are comes from feature sre you evolution
0:08:52take a speech for you would say so these are basically that everybody uses for
0:08:56for training system for nist evolution but we try to bill evaluation set that accounts
0:09:02for different kinds of our abilities
0:09:04so that was a huge investment in normalizing all the old in a method a
0:09:09information of all the files and we try to include as many trials and read
0:09:14as many trials as we could
0:09:16we try to create trials for a specific types of variability so we do lots
0:09:20of evaluation conditions for the specific types of variability
0:09:24like different styles in general for to the different vocal for language portability usually the
0:09:31conditions that the oldest the specific type of a really db we didn't try to
0:09:35really makes different types
0:09:38the different
0:09:39and so on
0:09:40right now results and see what is the different types of variability courses
0:09:46what degradation
0:09:48it's and i never explicitly always tried to break more trials compared to what has
0:09:55been defined for installations
0:09:58oh we also tried to introduce new types of variability in this they are specifically
0:10:04noise variability reverberation so we artificially added actually non sound and reverberation the a i
0:10:11a few more this on the next slide
0:10:14they should be also duration for condition
0:10:18for each mixing
0:10:22at this prison set consists of two parts that elements to be better at the
0:10:27moment we have a one thousand speakers around thirty K thirty thousand audio files
0:10:34for more than seven
0:10:36seventy million five so that was so this is actually relations that
0:10:41and then you have a tree
0:10:44sixty thousand speakers
0:10:48comes from
0:10:49a hundred thousand
0:10:55deciding this task just to give you some idea wasn't really just as easy as
0:10:59taking saying of the we used features switchboard and sre four for training the rest
0:11:05for four testing either really and attention to reduce to the data from different
0:11:12sets to use the models for training and testing so for example to get some
0:11:16language portability we have used for evolution data from five
0:11:20the same are trained for different
0:11:27oh yeah
0:11:28for all i
0:11:30we try to use them for
0:11:31eventually for testing but at the same time we wanted to cover some of the
0:11:35channels that are some of the microphones and i two thousand they don't they also
0:11:40training also so
0:11:43they be related to
0:11:46pay attention
0:11:48splitting the database is very like this
0:11:52last number of trials
0:11:55you see
0:11:56straight patients
0:11:58i don't i don't i just try to quickly summarise the bigger for designing the
0:12:03noisy and reverberant data what exactly do what's really can be some design so that
0:12:09we define the way how to for all that the noise and what
0:12:14we try to use open source tools and principles the other noises that added to
0:12:19the data and county somerset is that this other people are interested in adding new
0:12:24a noisy is it should be straightforward to just for the rest is that you
0:12:28have designed a new types of noises new also reverberations
0:12:34so i mean
0:12:36just the in the blue box it pretty much summarizes the additive noise but also
0:12:41use this file for
0:12:45just adding the be
0:12:48noise to the data as a specific S
0:12:50the signal to noise ratio and you have used different noise is the kind of
0:12:54course there are only is not use different kind of noises for
0:13:00for training data for enrollment trials for enrollment segments for test segments of try to
0:13:06make sure that you never train or test that not even the same noise are
0:13:10not even noise taken from the same
0:13:13final or not even that exactly the same time
0:13:16so if i say that was a cocktail party
0:13:19i noise to be noise from restaurant noise from our for so different kind of
0:13:25pop noises and make sure that really makes this
0:13:28very similar
0:13:32and you have added the noise to the data at different snrs specifically
0:13:37twenty fifty eight
0:13:41the noise was actually added to clean data for these data are wrong
0:13:47thousand and four
0:13:52the data should be
0:13:54right before
0:13:59similarly defined is a reverberant subset of the data a which again use this
0:14:05we're to which is open source
0:14:06for simulating a rectangle or impulse responses from a rectangular room set and then added
0:14:14reverberation at different
0:14:17reverberation times to date and again pay attention to at the same time reverberation to
0:14:25training and test data
0:14:33okay so we get the time how to the sounds like not be
0:14:39and you more details on be characterization six
0:14:42so they system itself
0:14:44is based as i said i-vector the i-vector
0:14:47is pretty much the standard i-vector extractor that everybody using nowadays a ubm based on
0:14:54gaussian train and that's
0:14:57you mean and variance
0:14:59a extract actually si substrate exactly the same as for speaker identification string only one
0:15:07speech frames are trained on
0:15:12but it's quite possible that for detecting
0:15:15it is
0:15:16you just features like may be applied
0:15:27or the U six hundred dimensional i-vectors are extracted from the standard variability space that
0:15:33is assumed to
0:15:35are expected to contain information about speaker
0:15:39acoustic conditions in this case we didn't do any of the eight
0:15:45for the speaker information so use the i-vector
0:15:52channel and the length normalization for this
0:16:01so as a classifier is to use linear gaussian classifier trained on these i-vectors about
0:16:09what is trained for classify a
0:16:14conditions is that i'm going to show on the next slide and a final diarization
0:16:21characterization present this paper
0:16:23posterior probabilities be specified classes so in fact they
0:16:27taking this vector is
0:16:28simple as a nonlinear function forty five and
0:16:32this is just a simple as i mean
0:16:34affine transformation i-vector for mass function
0:16:39take this
0:16:43this summarizes the whole system works
0:16:48and as well as you can see that is
0:16:54such as
0:16:54is that
0:16:56as you can see
0:16:57the same training data variance
0:17:02train a ubm training the subspace matrix and also
0:17:18based on this actually
0:17:21our system for
0:17:23so we try to distinguish between
0:17:26three dollars
0:17:27a microphone they are
0:17:29a noisy and this case where
0:17:33those kind of noisy data that you are actually noise added to the clean originally
0:17:38clean microphone the a and B distinguish three different conditions which is noise a db
0:17:44fifteen db and twenty db snr
0:17:46and the conditions for are currently covered we define the condition according to reverberation time
0:17:54three five
0:17:55zero point
0:17:58you can see how much data used for training data
0:18:06right and soon as you can see there is always the same number of training
0:18:10and test files for
0:18:14because those are actually
0:18:15the same file
0:18:17and noise in different level noise
0:18:23the way to those classes because we just the vector posture only use of those
0:18:28classes defined
0:18:29assume that the classes are usually X
0:18:33which is exactly this
0:18:35green and with or elevation data because this is exactly how our evaluation set was
0:18:42never have reverberation and noise in the same recording
0:18:47but of course this is unrealistic
0:18:49in relatively you can
0:18:52reverberation of the army
0:18:55background i
0:18:58still be viewed that using this paper would be useful for such conditions because this
0:19:05all the vectors of posteriors can account for
0:19:07or something just conditions in the data also
0:19:21this is
0:19:22i animation that they
0:19:26where if you have reported that comes from my
0:19:29comes from
0:19:30all then we do this estimation you probably get all through that somehow reflects the
0:19:35that i
0:19:36there is
0:19:38stands for my
0:19:39yeah probably
0:19:45what how much
0:19:51but of course we can go for more principled way we can even a little
0:19:55independent classifiers for these independent types of articulators of that
0:20:02classifiers also speech and noise which kind of reverberation level of reverberation but it still
0:20:07a microphone was it would be trained data which contains a mix of
0:20:13such conditions
0:20:14so this
0:20:18table summarizes
0:20:20what performance be obtained in terms of that i think these conditions
0:20:25so they the table shows of the two classes and the detected classes
0:20:34and i know that i'm supposed to be pressing enter
0:20:40so they
0:20:42the if you had a perfect classification we should see numbers hundred and diagonal and
0:20:49as well as the justice confusion matrix and normalized in such a way that you
0:21:00i can see that this didn't really have a what
0:21:03what is
0:21:05what you were pleased with was that we could actually see that at least
0:21:11recognition microphone and telephone data is almost
0:21:15right so that almost here for microphone we get some confusion
0:21:20here we might
0:21:22twenty db
0:21:24noisy data and as i told you actually great it is a noisy i think
0:21:29these microphone data and adding noise to
0:21:32exactly this i twenty db snr and if you listen to those clear that some
0:21:37of the data actually contained some voice so it's quite natural that some of the
0:21:41states from the clean microphone
0:21:44twenty db which is not
0:21:47kind of like
0:21:49in a year
0:21:51oh also if you look at the a different noise levels we see quite reasonable
0:21:58performance of base
0:22:00a reasonably large numbers in that all again like nicely twenty db recognise that there
0:22:05is some confusion
0:22:07but this is again something that would be expected specially the S and now ratios
0:22:11which are close to each other should be actually
0:22:16a what one thing that actually seen was that the most of the confusion comes
0:22:21with some type of noise they don't really affect the i-vector much i think this
0:22:25type of noise resulted in almost
0:22:28exactly the same i-vector and something that was also naturalness you get
0:22:35where you don't do very well i'll be
0:22:41these conditions where maybe try to detect the reverberation time
0:22:46and we see that the those they thought of those detections are actually comes all
0:22:51over the place you really confused for
0:22:54for noisy at a party
0:22:58the main reason that we believe that is the thing is happening is that redefining
0:23:02conditions reverberation time is not actually a good thing to do because the reverberation be
0:23:09if you played reverberations then you could actually hear that one
0:23:15one type of reverberation at one reverberation time was much models are just a perceptually
0:23:20to another reporting which was completely different reverberation times of the reverberation time is probably
0:23:25the right consonant as we apply the C using these data for improving speech recognition
0:23:32actually improves the speaker recognition performance we actually looks like that the classification itself does
0:23:38a good job in terms of
0:23:39classifying things in putting things into the right classes
0:23:43which allows us to degrade
0:23:46so finally how to use this information about acoustic condition for the calibration for improving
0:23:55the speaker recognition system
0:23:57cells of you see this approach that the that no an echo actually proposed for
0:24:03when we do our the A B C system for nist sre two thousand and
0:24:09you also and i believe this is the thing that is implemented in
0:24:12in both source to that
0:24:14three available
0:24:16so the idea is to be just one calibration if we review this and tara
0:24:24people obtain standard linear combination where you know a some bias and some multiply the
0:24:29experiments with
0:24:31switching from be touches
0:24:33oh nickel
0:24:35and that's of the device is the
0:24:39wavelet multiply the scores
0:24:41but you can see that we actually in some bias term which is just your
0:24:46between the vector of posteriors from one and the second
0:24:50segment that are space in the
0:24:54trial and then based some matrix so this linear phone that is
0:24:59vectors in there
0:25:04just bias and this is that the final score this is to go next
0:25:12so we just mentioned before
0:25:18the same
0:25:24you're just list conditions
0:25:26running times and let me just
0:25:29say briefly that we are presenting results on one list of conditions that a subset
0:25:34of all conditions summarising and these are these
0:25:38we know that no problem of that a lot of entry
0:25:44microphone just my own different vocal for different languages in the recordings of different noises
0:25:52in recording room reverberation and the system that we use for speaker id
0:25:58used exactly this
0:25:59as i say
0:26:01once invited me
0:26:03right normalization lda are used as a
0:26:07train the presence of and
0:26:10this slide just two
0:26:12results and you can see that you are actually nice
0:26:16on so these are
0:26:18once in principle
0:26:19the dcf and eer
0:26:23can see so maybe less
0:26:25we just are relevant for
0:26:32from this condition which are actually the condition of the conversation but recorded over a
0:26:38microphone somewhere our prediction actually does very job
0:26:43oh surprisingly get some improvements in
0:26:46also used it all comes
0:26:49from the single condition that you have a probably again can do some two one
0:26:53detecting voice is that you have a does a good job on detecting on the
0:26:58noise condition
0:27:00from different
0:27:01noise levels
0:27:03thus reasonable job on room reverberation actually in
0:27:10so only conditions but they are quite
0:27:13anyway for speaker identification proposed we don't get any problem at all the that comes
0:27:19from just one is to be don't have conditioned and then tell us to improve
0:27:24the thing we do not get improvements for language and a common words that
0:27:31again we didn't really have condition
0:27:35and the next slide actually just showing the same thing that we also still pretty
0:27:40much the same gains you can be fused with
0:27:43system cepstral prosody just
0:27:45so this is just to say
0:27:47suspecting summarizes
0:27:50the conclusions are summarized in practice
0:28:14so well what is that it doesn't classify people are
0:28:18issues that you have in training and test
0:28:22actually is if i say i reverberation is rubber reverberation time domain is different for
0:28:28reverberation but they come from not only are artistry of the actual five
0:28:33and he defined the reverberation time just cost is
0:28:37what we have seen is that if you listen to the recordings that in test
0:28:43fine two recordings that sound similar or perceptually but they come from different classes we
0:28:49just the way defined the cost is probably wasn't where i think
0:28:53direction part is how the possibility fine it's probably not correct that would be more
0:28:57natural clustering a more natural clustering that would account for the type of reverberation i
0:29:03mean you and regression that you if you flat there is nothing for about that
0:29:07kind of you late reverberation order reverberation that spread over all the time
0:29:12which will affect the speech and
0:29:16what else
0:29:18would be in our case may be considered to be come from the same part
0:29:21so then you probably can be some classes which about related speaker recognition performance and
0:29:27it helps at the end in the in the
0:29:29and the speaker recognition performance even for that
0:29:32the classification that would but the classification is not because we define