0:00:14but the way as in addition on a big the or at the moment variable
0:00:19and together energy profiles for this assumption
0:00:23please call per week the nist model or something
0:00:29presentational by a discussion of speech data or regression
0:00:34then twenty thousand eleven a short almost is ugly but also a unity it will
0:00:41be really didn't you feel
0:00:44and that in addition to
0:00:45no two thousand three
0:00:48in order to discuss different all come from each at least one but you
0:00:53estimate of the system finally
0:00:56so in this work we will be
0:01:00the baseline system and the other site by d
0:01:05but is if only a little training
0:01:09and then vol challenge in
0:01:11one seventeen leading system and i just
0:01:15so we will be
0:01:17formulated as stand alone and we consider a
0:01:21and are the natural just you know speech
0:01:26and then otherwise you will be system which is there
0:01:29is a story or something what are you hungry if it is only genuine coming
0:01:34as an actual speech
0:01:36comparison
0:01:39in this will be a excluding concentrating on the speech production
0:01:46so that there is a little or lossless wealthy and single where they not be
0:01:51honest you know
0:01:53a little variations the
0:01:56has basically three aspects lately ready one
0:02:00environment according to one
0:02:02so when we also for smart phones five or not using my
0:02:08i quality that all speaker b
0:02:10one and conical
0:02:13the different
0:02:14experiments i well for these recordings which is
0:02:19well only is also one in which all
0:02:23and this is a
0:02:27no
0:02:28we will discuss the modelling all a list of
0:02:31it in the list movies model you one is a new one next even that's
0:02:36a strange
0:02:37during this whole signal s b is not an estimation of the impulse response
0:02:43of the
0:02:45the recording device
0:02:47microphone
0:02:48i mean what
0:02:49plus the known model
0:02:51you don't
0:02:52so this imposes was a copy of this series convolutional we build upon so on
0:03:00then the recording device and as a speakers characteristics
0:03:04that is multimedia speakers and the and
0:03:08so relating the blissful means also in signal characteristics differences of iterations models
0:03:16there were involved only in the
0:03:18jane speech
0:03:21to his audibly distorted data so that sources speech you can see that
0:03:27this is a
0:03:29actually lower than the worst
0:03:31this is an actual speech and e g
0:03:35or
0:03:37so then what went on according will only
0:03:40especially
0:03:45and then only isn't getting or
0:03:51or a roller or
0:03:54and that was one instead of just one here is the presentation
0:04:00so one of the important characteristic that is here is that it should also gonna
0:04:05some
0:04:05in the differences between genuine and it
0:04:11speech
0:04:11and are the ones in a distribution because we can see there
0:04:16most of the day
0:04:18these changes are data
0:04:21it is the distortion that is being nervous in the high frequency regions
0:04:26because of illegal immigrants
0:04:28and of just like expected because the their transmissions at their sticks on the acoustic
0:04:34characteristics of gonna the phone
0:04:36and women
0:04:37is expected to be bandpass region because
0:04:40only one can be is the error or because you becomes a
0:04:44we'll have more stands opinion is
0:04:47i don't have been system is responsible must therefore
0:04:51okay was characteristics in the in the previously
0:04:55in your dynamo in but it just a model of speech function
0:05:00a we can see that maybe a speech has basically by first order statistics on
0:05:07the right you can is on which involves only
0:05:12these concealment speech coding speech
0:05:16no in this work addressing the stars in addition there me
0:05:22concentrate on a weekly
0:05:25on a new data
0:05:27additional industry on
0:05:31nolan available in you
0:05:33the idea is there you know a companion but we discuss the fundamentals of do
0:05:38you wanna do not you so well before that
0:05:43the initial requires the basis accent but instead of discontent signal x and then it's
0:05:49previous that unveiling something minus one
0:05:52an additional next congestion something that is
0:05:56we find that it is your data in the next experiment and minus one in
0:06:03so that you and then because of this on the amount of structured utilizes the
0:06:07desire was
0:06:08but their own meeting the
0:06:10but
0:06:11in boston immediate future in the presence
0:06:14however within a is actually an actual speech signal catches the dependencies
0:06:20in the signal is also has a signal and these different independent signal is not
0:06:24a lie i can be your like having something minus one or the u v
0:06:30mobile
0:06:32well in the context of speech production and perception we know little or no control
0:06:37recognition and i for this
0:06:40no one she
0:06:41so mostly for something or an introduction cognition perception
0:06:45well whatever you want to thank you speech
0:06:50so well motivated by this kind of
0:06:53okay statistics of the natural speech we also exploit a pu you know if available
0:06:59in one
0:07:01while i mean
0:07:02then we consider only the a initial clustering just and we consider
0:07:08the i se but in the past and i in which
0:07:12sure
0:07:12and as this one and it is not really meant that is a mathematical
0:07:17in addition
0:07:18all the sinusoidal previous i and the previous section seven
0:07:22and a we can see that use basically the new location as explained minus plus
0:07:28one
0:07:29excluding and less time based on and
0:07:32but i score as defined in
0:07:35one is in this because it captures of bananas yours was
0:07:39based on the
0:07:40reynolds
0:07:42so this is and everything the
0:07:43we begin this is the pu and this is that it wouldn't exist
0:07:47consider
0:07:48there is only just where you're being nor do i even it is
0:07:52in this isn't the video games or two
0:07:56it is because there isn't dependency structure because of the pu for these kinds of
0:08:00and in this case
0:08:02you the minutes in this is
0:08:05i feel that it was a good why don't we discussed in
0:08:10we can see that you also used the described next sure you domain and then
0:08:16a justice of you in the netherlands
0:08:21not in this but are we extend our recently proposed remote the actual and b
0:08:26c basically women because not more
0:08:30that is used in these easy we have an input speech
0:08:35preemphasis problem and yet been investigated the and then be cleaning everything more than fifty
0:08:42one from a nation
0:08:45so miserably you'll be explained remedial the reason is there anyone you know
0:08:50well actually better sticks all basically and dependencies and sequence of both genders speech and
0:08:57then how did better than the
0:08:59there is a question that only
0:09:01a in this is a screen
0:09:04so for example this is the
0:09:06two can assume one all basically the speech
0:09:10you know various acoustic and one that we discuss they can control spending analysis and
0:09:15can see that the view point has got the ones which we discuss not be
0:09:20a and b
0:09:22their ability to just really the speech forty one
0:09:26the final and one
0:09:29no this is that he's for the initial clusters of similar to speech that we
0:09:33"'cause" there's not as is that in the component other
0:09:36was is trained using that as convolutional physically
0:09:41impulse response will be these the resulting was it is obvious cases of this
0:09:48all signals are inverse discrete cosine and sinus
0:09:50that is themselves layers
0:09:52we examine the impostors
0:09:55the man digging out of the impostors and
0:09:58and weddings an option
0:10:00we can see that the pu provider maintains the high energy pulses and an additional
0:10:07okay
0:10:08the there's characteristics within that will use in which their children adaptation transforms used by
0:10:15one morning or anything they're also that also for the natural speech in the u
0:10:19a visiting then more only
0:10:23so that it is which means you think that in cases in a considerable so
0:10:27that it almost
0:10:29with a single moment
0:10:31earlier
0:10:32which is basically in the next to the model mice is channel factors something more
0:10:38than one indicating in the morning shows that are running and you changed
0:10:43speech production that which should also direct relation
0:10:47i in this study we are really
0:10:51this
0:10:52characteristics
0:10:53this decoder just basically
0:10:57speech
0:10:57only the achievable when the anyone who have a variable are you gonna show a
0:11:03well fine
0:11:05for actually than that of speech shown basically well why a beautiful place
0:11:11corresponding this is a to be
0:11:13but in this feature a fight for
0:11:17the weather the rest of your own et al
0:11:21i just t
0:11:22and the different for different values of the difference in the next layer for example
0:11:27for this
0:11:29and allows
0:11:30and the elements in this is one is to show all three and one as
0:11:34well you know that for different is the next
0:11:38five
0:11:40when using basically here with additional features and then there's each one woman and an
0:11:45actual speech and hence we consider this s
0:11:48secondly
0:11:50better a discriminative you
0:11:52for using the
0:11:54and b a value in the pca projection
0:11:58so these differences are also clearly better for the prior knowledge of multiple files are
0:12:03used for the natural and you than the one of the financial speech and
0:12:08but in order to utilize probability speech and that or distinctions this
0:12:13not be quite and we plan and text i is differences
0:12:18for doing that she
0:12:21this is innovation
0:12:23slightly distribution it's a function
0:12:26for
0:12:27okay well i mean i speech and the speech bin z there and the standard
0:12:31batteries but this in figure you're to be figure in the world melodies
0:12:37for spectral
0:12:38i suspect and this is because
0:12:41ten miliseconds each other ones for an s and h
0:12:46we can see clearly there at the start with just here in both cases are
0:12:51lower there are one get better result shows you are working together and it is
0:12:55features really able to see what
0:12:58focused
0:12:59and high resolution of formant structure an overall distortion
0:13:04one an active speech
0:13:05and the signal doesn't features which are no
0:13:08can be captured but only
0:13:10well known as features in the residual so which is being unity
0:13:15namely
0:13:16during speech
0:13:19and this is in profile the textual this profile always be there for the various
0:13:25values of an index that is
0:13:27well human speech
0:13:28we used to using only the energy based vad point five we used and is
0:13:33a ribbons in next
0:13:34thus the phone recognition as well
0:13:37and then be seen that are really
0:13:40we see their four one and can see that for the various different as in
0:13:46this distribution as producing features and testing a each altogether as it is there is
0:13:54measured for different values of you
0:13:56a one different messages consider
0:13:59most of the five one solution to capture some features general capturing the traditional table
0:14:06for that this is the t i one
0:14:10in this thing using the standard statistically meaningful
0:14:13is longer than surrounding wasn't two database
0:14:16and it is that you one i in this work you the initial search
0:14:22and in experiment is a little difference you the
0:14:24these are not i logistic thing and assisting different no matter just like
0:14:30for each of these features are going fourteen on the cross gender engine is varying
0:14:35from one twenty thirty nine ninety
0:14:37the motivation z-norm mixture component gmms okay and ninety five one
0:14:43and we use basically different ones
0:14:45in this work is in gmm simple gmms
0:14:50this is a for successful results using a it is interesting
0:14:57with that is for refinement is a dependence you next
0:14:59you can see that basically
0:15:02and anything that's
0:15:04forty eight was the one they but it is my final and basically we consider
0:15:09forty eight to five they are used for six point five significant and is a
0:15:15twenty five percent
0:15:16or represented as
0:15:19which in the usual significant improvement in
0:15:23and has fewer can be a to find an optimal choice of measurements index for
0:15:27this experiment
0:15:29and this is basically the
0:15:31locatable score retire fungus you gmm and was you sure well based on all distributions
0:15:38of the solutions
0:15:40all sequences e
0:15:43and this is an analysis e
0:15:44and is mfcc and matrix
0:15:47you can see that for you just distribution has to be well signal estimation whereas
0:15:52for residual different for a gmm
0:15:59this on the development
0:16:01no you're not experiments for basically these features are like a combination of them but
0:16:06you forty eight one
0:16:08well and then you mfcc and the n six z
0:16:12if it also there is used a list of the unlike this is from not
0:16:17just like mfcc
0:16:20i this is e
0:16:21and six is significant performance improvement then
0:16:24we both models going able to model well basically smooth and phone it is easy
0:16:29features we can
0:16:30so it is easy
0:16:31then we'll is used in the ecstasy
0:16:33and m c and we also there is really can strategy
0:16:37this result is as you we just use this almost indicating that
0:16:42but with features it also captures complementary information
0:16:46then the baseline of the challenge can and
0:16:50is systems on t
0:16:52wasn't retirees wasn't one iteration
0:16:55so
0:16:56this is the and already you know
0:16:59and then we also show the performance using a detection error tradeoff curve so we
0:17:03can also they're the performance of the det calls for a way to one this
0:17:08is one is basically
0:17:10mfcc then security
0:17:12this is one
0:17:13and this is an existing data for me from clean the proposed features and screams
0:17:19and
0:17:19and similar training actually almost or with
0:17:24also features are only one
0:17:27so
0:17:28however the fuses well formed elements are defined in and function indicating there
0:17:34you models there is resigning
0:17:38but are trained using the that the justice to perform better than the engine just
0:17:43the decision features
0:17:45i don't use and
0:17:49here is an analysis or physically and one or more efficient well money well mauritius
0:17:55physically model issues
0:17:57and i saw in one additional from the perspective
0:18:02so that reducing the problem first final one is okay
0:18:06new the bar e
0:18:09e here is for the natural and this a three different this from the different
0:18:13characteristics like benefit
0:18:15a high quality classes
0:18:18three one and you'll be playing on the only problem
0:18:22a message in which
0:18:24so we can see their this is the sum of implementation and fast implementation and
0:18:29they are very real distinct and is a weighting
0:18:32involve a harmonic structure is or
0:18:34in an actual speech but there is no result obviously
0:18:38you need for
0:18:39this is definitely a cost you difference between the natural and the
0:18:43i
0:18:46finally we evaluated the sickly
0:18:48using this costings you can see that
0:18:53the views different contributions like environment acoustic environment that voice recording ways
0:18:59and we can see their own but for the proposed features to meet is even
0:19:03da was to find the list equal and
0:19:06existing we just like dimensions using consisting and
0:19:09so this is showing me for an answer was you just on different conditions
0:19:15find it was always in this work we take your batteries exploiting question
0:19:20the idea was features to d c you know okay the menu
0:19:26movies easy and everything but
0:19:29and this is only on better for different decomposition of a controversial but was not
0:19:35affected by the owners of different one
0:19:37number of channels but she was adamant use this for most beneficial for the two
0:19:43streams
0:19:43this forms as a
0:19:45well
0:19:46on the final experimentation
0:19:48i don't know we need was actually impulse response of random should be my acoustic
0:19:53environment
0:19:53we should definitely a landing on the nist is immensely challenging
0:19:58things as well
0:19:59with this knowledge yet results using line data and in as you gonna condition a
0:20:05one time someone colours audio research
0:20:09we also kind of the organisers of recognition workshop twenty and challenges of this is
0:20:15what we also want to challenge
0:20:17really and also
0:20:19indeed it was made available but not from in this experiment be
0:20:23sarcastically meaningful system not
0:20:26and finally the citizens just
0:20:29and we i
0:20:30on the phone and h