0:00:14she
0:00:16good thing under need for a nice introduction
0:00:19actually anthony what's in my life for several years a four years
0:00:25during the
0:00:28twenty nine ten leading not introduce him
0:00:33i can also be very well
0:00:35okay so first of all i would like to conform to do that actually indeed
0:00:41speech and language as p l c is the most vibrant to a special interest
0:00:48group we know it's got
0:00:49and not that the all be cased i don't
0:00:53presence of myself i was also don't things and then the past of the vice
0:00:59president of this guy and then
0:01:01and jean francois a past president of is colour or
0:01:05must come to show the support
0:01:09and also like the thing release and
0:01:14at a dual due to
0:01:16to have
0:01:17brought them all the c in to spend a belief that many of us to
0:01:23have wanted to come to appeal power in the visit the basque country for low
0:01:27income and then he may want excuse for all of us
0:01:31to counter just beautiful and
0:01:33harry oppression
0:01:36a year ago
0:01:40i do not all relevant to me doing but i mean yes that the extend
0:01:43the invitation
0:01:45to ask me too
0:01:46to talk about indy spoofing
0:01:49i thought they do would be this will be a topic that
0:01:55it's very close to will be discussing speaker recognition
0:01:59and that it also made me one very hot the past few days to put
0:02:02together the selected this is really the for like this presentation on this topic
0:02:06is actually a topic but my
0:02:08phd student
0:02:10and session who
0:02:14right you eighty two years ago now he told me that he's working in apple
0:02:17computer
0:02:19he's not here it's is are you hear no
0:02:22and are like to start with our like to thank the global for people to
0:02:27ten
0:02:28at all ni
0:02:31nick you've and same opportunity
0:02:33for sharing with me a set of all presented a slight step save me a
0:02:37lot of time they did that you total in a nasty a in the asia
0:02:42pacific signalling information processing society and you summit conference
0:02:46in
0:02:48on call used in hong kong i vocal folds at any last december i attended
0:02:53the talk and then they may be the set of slicing i extract quite a
0:02:58number of them from
0:03:00from their presentations i just want to say that thanks to them
0:03:03and also
0:03:05a also thing my student on another student show higher to
0:03:10prepare some experiments a to just to make might also complete
0:03:18wonderful so
0:03:20so my topic will be on and useful thing i understand that infeasible think is
0:03:24actually not the scientific disciplines is that kind of application to that goes with a
0:03:30speaker recognition system
0:03:32and also because it's not yet the of the establish displaying so that's why i
0:03:36don't see i don't think this so what definition it or what and t spoofing
0:03:41is anything that to protect the security of a speaker recognition system
0:03:45that's what we
0:03:46think about so today only share with you some of the
0:03:50i experience that we had we touch a pontoon perhaps those experience can
0:03:56speaker for the discussion so during to
0:04:00workshop
0:04:01voiced by metric used based actually i actually just like a the name of speaker
0:04:06recognition or community in the twenty twelve there's a report saying that at eighteen off
0:04:14at top bangs in the world have adopted speaker recognition the system actually now
0:04:22the number numbers increase the tremendously
0:04:24i just a month ago many of my tingles announcement density per in addition to
0:04:29a launch voice authentication system for a
0:04:35for
0:04:37call center services and
0:04:39for somalia we also part of this project and just
0:04:43turn people that
0:04:44for the first time we are paid to become a heck of so the system
0:04:49so we just two
0:04:51evaluate the up these security of features of the also just a speaker recognition to
0:04:57the point
0:04:59this is a projection by
0:05:01try to come the market size one or
0:05:07what kind of a biometrics
0:05:10used in
0:05:11both ranking financial findings of something of course maybe other areas
0:05:16and you can see that a voice parametric is actually want of the growth area
0:05:22no i both the colour with them for my laptop screen but it just the
0:05:30the last
0:05:32however it shows that so it is
0:05:34we see a tremendous group which is a must be good in fingerprint because finger
0:05:38brings can still a mature technology time with these
0:05:42and we talk to customers i was working institute's into we face a lot of
0:05:48our industry up on the someone the need to deploy speaker recognition system
0:05:53the question they ask that's not so much how accurate system these because they can
0:05:57see that this is kind of given the because the system
0:05:59must be what must what within the be well the question usually to ask is
0:06:03how secure the system is in face of that x and
0:06:07and
0:06:09i know using the other things like that
0:06:11so
0:06:13recently we actually to you three years ago we deploy a technology to
0:06:18the noble smartphone if you get the learnable's smartphone the screen unlocking
0:06:23likely to be a
0:06:25to includes a voice authentication it is somewhat technology and of course they all day
0:06:30or also ask her voice ask for indy spoofing
0:06:34can isn't note that
0:06:35to go against a three by tech
0:06:40please i will talk about it
0:06:42so i talked to someone who talk about four
0:06:46man items one is people would be one this exploding the text talk about
0:06:51most compression in the artifact stick we may discomfort in the in the voice
0:06:56in a also lastly
0:06:59yes to be automatic speaker verification
0:07:02in t spoofing the comparing the last year
0:07:05i don't want to go through the details of the evaluation campaign by will talk
0:07:10about
0:07:11some of some of the observations i suppose a different a start
0:07:17okay so typically a speaker verification system a taken voice as input to doing that
0:07:22make a decision is to set identity claim to reject
0:07:28most of the time we assume that the voice input is actually from a sheep
0:07:33life a person like speech
0:07:36in reality it may not be true
0:07:39the of we can categorise all it is possible detecting to this four types impersonation
0:07:46just like a getting a person to mimic two
0:07:50a impersonate your voice
0:07:51and
0:07:52replacing managed to record somebody's voice you can play back to the to the system
0:07:57and speech synthesis and postcompletions these are the scientific
0:08:01thank technology means of a creating a speech
0:08:07well the could be some other new methods that do the because invents now i
0:08:14suppose we know we that the weights of the fact that can be categorized is
0:08:18this for every as
0:08:20used table summarize are the
0:08:23we're going to assess abilities the effectiveness the and the reason to the system and
0:08:28the com the availability of the countermeasures so sensibility meanings that how easy no this
0:08:34you have access to this technology to spoof a system
0:08:38so they're studies on the impersonations of basically you get a person to
0:08:43to act as another person
0:08:46this is actually part one of the very old the performing arts usually you try
0:08:52to learn to maybe
0:08:53some of this voice
0:08:55and
0:08:57study shows that system so i think people like this may be able to a
0:09:02maybe another person very well to the human years actually the voice may not be
0:09:06a very
0:09:16very strong as a as a as a tech because the computer listen so differently
0:09:21form of human yes
0:09:23and the is also difficult to kind of a train the person to make some
0:09:28of these voice so basically the it is a little accessibility saying it is not
0:09:32it doesn't propose a present a strong the risk of to a speaker verification system
0:09:39a replay the tech is basically to have somebody is
0:09:43voice winding up talking and then you play back to the system
0:09:47which is a low tech there is a
0:09:50usually in the context of text-dependent
0:09:53if it is a text independent are used have some of these voice see that
0:09:57golding vad
0:09:58we added the voice input impact so basically forced into the
0:10:02the speech synthesis in voice conversion all categories so
0:10:07for replay attacks
0:10:09we
0:10:11evaluate the oak to the risk
0:10:14mostly in the context of text-dependent the speaker verification
0:10:20when we talk about the voice i'm looking screen of untenable phone
0:10:25it should we develop a system that is kind of a
0:10:29taking the a unique features up of voice
0:10:34optimal pretty back
0:10:35we know that
0:10:37a human
0:10:40vocal system cannot repeat digits only the same
0:10:43a voice to construct so if you happen that you you're able to record all
0:10:47the voices and then when they have comes in
0:10:50and you compare the incoming voice with data in the storage if they are exactly
0:10:55the same the timings
0:10:56this is a deeply fact
0:10:58so we have mechanism to do this but there could be also other ways to
0:11:01do this for example
0:11:04they're studies
0:11:05she by idea those group on the some years ago on the on the
0:11:12protecting replay attacks obviously the
0:11:14the idea is well you replay
0:11:16actually it is a replay of a recording in the recording usually is
0:11:20taken from a far-field microphone
0:11:22the level to
0:11:23the did not always of we re overruns
0:11:26and acoustic effect of the room
0:11:29if you're able to categorise it
0:11:33you able to detect the retailing deeply example here so this is original speech
0:11:40for the works
0:11:50no i
0:11:53well i
0:11:56so you hear you hear it reverberation in the noise level in this is
0:12:00extent it might
0:12:03unique characteristics of a far-field microphone recordings that if we detect this thing of course
0:12:07you can you can you can accept or reject a recording voice but this is
0:12:12very difficult because room acoustics that changes from place to place it's very difficult to
0:12:17be just one model there
0:12:19that kind of a identify or the car or the room acoustics
0:12:24another techniques that we i just dimension is got audio fingerprinting
0:12:29yes idea the idea is that if we can
0:12:32keep
0:12:33the voice
0:12:33in the storage
0:12:35for at least cells they are presented to the system
0:12:38of course you'd only to do that we keep the recording as a whole cube
0:12:42away from this a whole
0:12:44think of this we do fingerprint recognition actually the system doesn't we call doesn't get
0:12:48the picture the picture
0:12:49of the figure three
0:12:51you keep only the cued always the training voicing what those be cocky points of
0:12:55the fingerprints
0:12:56the same for the for audio
0:13:00there's a this the software that the quality quite
0:13:03show them something you know you can
0:13:05you can record the piece of music and then you retrieve
0:13:08we choose the collection of the audio from the
0:13:13from the system is the same technology you have a you have a voice recording
0:13:18contained in the spectrogram
0:13:19and then you kind of finalise the spectrogram into pixels in the only remember the
0:13:25keypoints key point of those data
0:13:27of high energy so high contrast an data
0:13:30and actually you only need like
0:13:33so the forty bytes
0:13:35keep recording off
0:13:38five seconds
0:13:40so
0:13:40practically you can kind of the store a lot unlimited number of entries in the
0:13:45system
0:13:46so when the test speech comes just compare
0:13:49one by one then if this check matching just rejected because
0:13:53no one can produce a voice of identical voice
0:13:57this to time signal to noise
0:14:02then comes to speak speech synthesis in voice conversion this to share
0:14:07many common
0:14:09properties for example within it difficult to
0:14:12generate the voice they rely on a cook statistical models to the generate the features
0:14:17et cetera so
0:14:19that's london to get the so today open focus of will be on voice conversion
0:14:23thing of course of this as many of them
0:14:25much of the techniques also we plan to
0:14:27a speech synthesis detection
0:14:29we do a speaker verification
0:14:34the we what on robust features
0:14:38features has to be a real course has to be reliable has to be robust
0:14:42and so we see this is chip in good
0:14:46well we start a fine okay
0:14:49features of both
0:14:51properties but
0:14:52most of us to use the short-term spectral features because these easy to achieve
0:14:57and is actually but the reliable
0:14:59and robust against noise was getting c
0:15:03ageing have states
0:15:05a channel
0:15:08variation says that's what is at all focus
0:15:11the typically to type of features one is on
0:15:14voice production system like a lpc features you consider the vocal system as a
0:15:18as excitation follow but followed by a resonance a few to right so you bottle
0:15:24the excite that the second the source model with the future this where you kind
0:15:29of similar
0:15:30production system there's another type of thinking days to formal like the are required the
0:15:35peripheral auditory system we report use a cell we don't we don't hear part of
0:15:39it so things like you can see that it
0:15:43the court we have possible a member right
0:15:47this
0:15:49and path bandpass filters to get the signals
0:15:52and we try to
0:15:53derive features that
0:15:55kind of the follow
0:15:58bandpass filters at different scales of mel scale
0:16:02in this set of parameters to record all jittery
0:16:06features things like mfcc have
0:16:08many other little talk about the tree transform
0:16:11et cetera
0:16:15unfortunately most of them we will on robustness we try to extract the people's
0:16:21characteristics unique characteristics speaker characteristics
0:16:24we can see that the rest as a noise to try to accommodate
0:16:28so as a result and no more robust the speaker recognition system they also means
0:16:32is more vulnerable to the tech because it'll when we synthesise the voice you have
0:16:37all kind of variations and we've real features are very good in
0:16:41overcoming the
0:16:43what kind of noise actually your system become very vulnerable to the system so we
0:16:48have like a contradicting
0:16:50requested to the system a one hand we want to detect the synthetic voice which
0:16:54is
0:16:55unwanted and on the other hand we want to be
0:16:57a robust in these two things
0:16:59are not are the same direction therefore we cannot have one system that does both
0:17:04you to go t we have one system this for synthetic speech detection in the
0:17:08front as a filter so when
0:17:10we detect yes this is
0:17:13this is a not that it a synthetic voice then the signal pasta a speaker
0:17:16verification system
0:17:20next ever going to talk about people voice comparison so voice compression this actually now
0:17:25is very accessible so we can even go to amazon dot com you can buy
0:17:30a box ninety nine point i five dollars
0:17:33ready for
0:17:34we those k
0:17:36and actually allows you to a change your voice to masquerading a voice to be
0:17:42too i mean to check you identity from one to another or to
0:17:47kind of a you can use that
0:17:49step here
0:17:55so basically okay five two and the system the formants the peach you can
0:18:01you can try use this
0:18:04put forth a to kind of possible for
0:18:06a speaker verification system
0:18:08clearly in your
0:18:10in your room so
0:18:12so
0:18:13it will cost a
0:18:15if we understand well how postcompletions done maybe we can be with system to detect
0:18:21synthetic voice quality points the system is a basically three parts
0:18:26at all of this like formant judge in the slides i believe that
0:18:30distance voice with must you with my student
0:18:33voices very different from a from his voice at one time this analysis and you
0:18:37can be a system that combine the voice one another this must be very strong
0:18:41the
0:18:42voice comparison system
0:18:44so busy that three modules
0:18:46to analyze compare the features and
0:18:49and to synthesise
0:18:51by analyze because
0:18:53it's very
0:18:54how to deal with the time-domain signal so you compared to
0:18:58the two men that you can
0:19:00many project
0:19:01releasing frequency domain
0:19:03and then you complete the features into
0:19:05where you manipulating the way you want then you have put them back
0:19:09synthesising generate the voice of another plus
0:19:13we do that this is a couple coding to actually it is
0:19:17account isn't it was very well studied in a
0:19:20only as in communication you're all common people want to transmit signals duty codings the
0:19:26one to compress the signal
0:19:28they want to
0:19:29multiplex the signal they want to increase the signals
0:19:32with
0:19:34coats et cetera
0:19:35so they and that i think into features into the parameters then you do what
0:19:40they want to estimate the over the narrowband channel and a at the end to
0:19:44make sure that all the signal can be we can put the signal back you
0:19:48are using the parameters so this was better but at a traditional framework for
0:19:53in the communications and
0:19:55today actually we replace the transmission channel with a
0:19:59feature compression that
0:20:01allows us to do voice compression
0:20:04they all kind of voters on the data does this
0:20:09we just group their body into two categories of people
0:20:12why in speech synthesis and all this very well one score
0:20:15sinusoidal vocoders basically the idea is a similar to
0:20:20to generate signal that please all yes we
0:20:24okay how much human the voices so generating so much to generate some of which
0:20:29sounds very natural humour years which is a good
0:20:33so the idea is to components
0:20:35i mean to decompose the
0:20:37the two row ticks lawns into a collection of
0:20:40and i'm harmonics and then of course to include writing
0:20:44and record the modulated noise components so you have the noise which represent the fricatives
0:20:49in a the harmonic components that representing involves input this to get together you can
0:20:54regenerate the cell
0:20:57this
0:20:58kind of vocal the
0:20:59or in this study is that it's
0:21:02people
0:21:03evaluated in found that they are actually very natural and that has some issues a
0:21:10some of the issues of like
0:21:13because you've completed to this harmonic opal components and the number of parameters data they
0:21:19need to describe the signals varies from
0:21:23from the signal itself like
0:21:25like fundamental frequencies like something rates et cetera they affect the numbers for every frame
0:21:29we have different number problematic
0:21:32present the problem you want to the model it
0:21:34in a in the
0:21:36statistical model we need the same number of parameters to model
0:21:40of course they also like and they have a single overcome this so the studies
0:21:44on this if focusing on how to manage the number of features in the data
0:21:48on the other hand how to manage the noise because
0:21:52harmonics is you know this card
0:21:54good to describe karate signal is not very well in describing
0:21:59another type of for a few days sorry overcome this call source-filter model which is
0:22:05i think i mentioned earlier you can you think of this vocal production system
0:22:10you have the
0:22:10source excitation thank you of resonance you to anything you try to model
0:22:14this both
0:22:15and the good thing about this is
0:22:18than parameters for example you use a linear predictive coding
0:22:23actually you can fix the number of a parameters
0:22:26and that helps to have stopped the modelling
0:22:29of course addition also has a problem you compared this to the final sort of
0:22:33encoding signal so you don't called the
0:22:35some of the study seem like music a synthesis the quite face welcome to say
0:22:39they're
0:22:40they allows you to scale in both time and frequency domain so we hang
0:22:44actually
0:22:45control the phase of the signal the many to interface for source-filter model
0:22:51you don't this filter has to be
0:22:55stapling call calls so you have all the all the all the remote set to
0:23:01be reading the
0:23:02the unit circle in
0:23:04because of all day so if a low minimum phase
0:23:07a strategy we reconstruct the signal that actually cost artifacts
0:23:12it is good for a
0:23:16a defect detection synthetic speech detection
0:23:20on so this up
0:23:21where simple study which stuff by a judge and a few years ago and doesn't
0:23:26do a very simple test you have a number of vocoders and that you to
0:23:31copy synthesis you do not you just
0:23:34simply analysing to the features we compose the signals
0:23:37it was see what they detected this is synthetic voice on
0:23:41and the result shows that with this modified group delay are cepstral coefficient you can
0:23:46you can do very well in detecting the synthetic voice so there's artifacts all the
0:23:51data and a lot effect to be analytically visualise but
0:23:57popular features of okay
0:23:58you can actually detect
0:24:00so no
0:24:01after talking about the vocal we talk about voice compose
0:24:06so
0:24:07voice conversion basically you want to convert ones
0:24:15spectral from one person to a not while the
0:24:18things that is quite people
0:24:21a voice quite a number of things the main the main items that the formants
0:24:27the formants about the formants the first is that it to tell which is how
0:24:31it is by all will or the valves which one of these
0:24:34but you also has the personal
0:24:36a we also represent the vocal tract structure in a different way people are different
0:24:41formant structures in
0:24:43maybe formant tracks
0:24:45of course you have also be fundamental frequency which is the peach and also the
0:24:49intensity of the
0:24:50the energy envelope all these are very difficult to kind of a manipulate individually what
0:24:56we usually do is
0:24:58spectral compose compare one
0:25:01expect special level one person's voice to a not to kind of a transform
0:25:06a typical example we select is
0:25:18so well usually do is that you have
0:25:20also called parallel corpus
0:25:23we have samples of the same content and you do alignment you can do just
0:25:27to do a dtw alignment and then you come up with the panel of
0:25:34features right
0:25:35in then and then you
0:25:37divide a track compression function from
0:25:40from this past
0:25:41you have all the past stop suppose the enough to cover all the mappings
0:25:47and then you do the combustion this a one time
0:25:49and one topic that the important you have the
0:25:51source features you prior to compression functioning then you come up with the help
0:25:56so the m many techniques and you are not this is not for reading this
0:26:00is presented by children as a sparse the web or for the progress of the
0:26:04this research is to say that but they are
0:26:08many of a
0:26:10compression techniques
0:26:12using samples
0:26:14and linear regression
0:26:16it's a linear function to convert source to target one not normally a method to
0:26:20do we and then this way is kind of the transfer learning so you know
0:26:25that
0:26:28people a chance for the form one percent to another's voice at england the transform
0:26:33matrix from many pairs of people named now you only have a very little samples
0:26:37and you language or
0:26:39the history of a number of using the dependence that they rely on from other
0:26:44people's in this way you height of all to
0:26:49composed so that allows you to use fewer data thank with we to estimate fewer
0:26:54number put parameters will achieve the same goal
0:26:57so i just us
0:27:00i would just a touch upon a few
0:27:04basic approach so one disk or complement mapping so basically the same thing to alignment
0:27:10you get the parents and you do vector quantization for the past
0:27:15and this is in past so with the runtime we only have one
0:27:20samples
0:27:21for example with the sources right column
0:27:24in the and it and you the green is a target at the green you
0:27:28don't have so you are right the source into this vectors you get all those
0:27:32cool was and then you
0:27:35string
0:27:35the green ones to get in the generate a target voice
0:27:39of course this is very elementary techniques
0:27:42to do this
0:27:44imagine you to do this
0:27:46you focus very much on the parent
0:27:49you know the source and target match but had a cat too much about the
0:27:53continually t and the target
0:27:54therefore this a lot of continuing discontinuity in the in the generative voice
0:28:00another technique is to kind of a convert this you to a continuous a space
0:28:04but and that if you do this and then you can have a formula like
0:28:09this you have access the input as the source in the white yourself and this
0:28:13is a linear transformation
0:28:15i think of the is it is kind of a
0:28:17the previous one this is quite a few them are coke bottle of cohen this
0:28:21is a continuous version of it
0:28:23right fielder continuous version of it
0:28:26and then and then of course of this one generate that slightly a smooth the
0:28:30a voice
0:28:32in another technique is a the previous two are kind of our remembering the samples
0:28:37right
0:28:37in this one we deal with a remembering the samples we remembered the
0:28:43competition
0:28:44the warping functions you know that
0:28:46source speaker the target speakers if you have enough samples we can kind of derive
0:28:50a
0:28:51well warping function
0:28:53between them and we don't remember this warping functions
0:28:56and run time when the test data comes is applied the right
0:28:59what in function to generate target
0:29:04it is not technical frame selection
0:29:08frame selection
0:29:08does not talk about a global approach
0:29:11basically doesn't care too much about the continuing tid target
0:29:14is one plastic taking into consideration
0:29:17so you have certain frames uk in the training data article on
0:29:21each other with a similar peach similar
0:29:26or phonetic context thing they tend to get together so we have a kind of
0:29:29a selection process not just by
0:29:32a source target distance about also talking to talk it's friend distance to ensure the
0:29:37continuing this one
0:29:39give us a little bit smooth the
0:29:41i'll post
0:29:43thank of course is it is unit selection technique this is a very non in
0:29:47the
0:29:48of speech synthesis
0:29:49where you have a
0:29:53sufficient sample maybe you have pain twenty utterances of fifty utterances of a target language
0:29:58you can achieve break it down to elements components
0:30:01and then
0:30:03at one time you want to compose something just pull the samples together you concatenated
0:30:07into one piecing playback
0:30:10this is actually one of the come away on doing that as a specious feuding
0:30:16the speech synthesis system but think of this if you do this there is a
0:30:20discontinued we do between the
0:30:22between the between that units
0:30:25both in magnitude and phase in this could be at the next we can detect
0:30:31so the some summarise so i just say about a
0:30:35actually we did in a voice compression and in the a speech synthesis
0:30:41studies we have
0:30:43subjective evaluation objective evaluation
0:30:46and actually not of their address spoofing quality of a synthetic voice
0:30:53looks at unvoice lisa in one of the example you hear that
0:30:56you a see that
0:30:57or
0:31:00assembled that you cannot even understand but it is a very it
0:31:04including a very strong
0:31:07taking voice for speaker verification system so this'll to analysis are
0:31:14well for
0:31:15kind of a
0:31:18quality perceptual quality i evaluation
0:31:20but when it comes to spoofing the tech i believe that this effort to us
0:31:24define what the best ways to analyze the completely voice as of the details of
0:31:30the strings to the text system in last year is yes the spoof evaluation campaign
0:31:37my view is providing object wondering allows us to kind of evaluating the string of
0:31:42for a synthetic voices are completely points
0:31:47okay so it makes a let me talk about
0:31:52the effects of
0:31:55the artifacts of
0:32:00we size in the synthetic voice that possibly we can detect
0:32:04we know that we cannot visualise the i-th effects is very difficult to see it
0:32:07actually i
0:32:08i
0:32:09get my student group to try to
0:32:12so all spectrograms in to see that differences
0:32:14and
0:32:16there is no direct ways to kind of measuring but their indirect way of a
0:32:21model ringing for example
0:32:23if you know that the signal is discontinuous of course you can use features that
0:32:27represents a thinking kind of this crap continue we deal first speech you both in
0:32:32can both in many do anything phase kind of to model
0:32:36the data
0:32:39that'll things that we should look into one is the manager
0:32:43and the other is the phase i mean this is like the standard tech signal
0:32:47processing a textbook
0:32:49what was important is a
0:32:51in most of the speech recognition thing speech synthesis of research
0:32:55we pay much attention to the many to get interface
0:33:01for simple reason that
0:33:03my to do is easier to manage it is you easier to
0:33:07visualise
0:33:09and that this case is a much more difficult to update to
0:33:14to
0:33:15to describe to associate the parameters with the physical meaning
0:33:20and
0:33:21but actually they a lot of research in the literature on phase features for speech
0:33:27recognition and that provides a
0:33:29kind of a to see for us to
0:33:33to start this research
0:33:35so in terms of ninety two
0:33:39we don't know that to analyze the speech signal we need to do this short
0:33:43time fourier transform
0:33:45i don't you use
0:33:47sinusoidal coding will use the source-filter vocoder you wanted to do this short time
0:33:55time-frequency analysis
0:33:56in this present at effect you know that we don't is a fft
0:34:01then you use a fixed window length
0:34:04and then you have
0:34:06spectral
0:34:09you change in you have
0:34:10windowing effects of all these all these are at effects
0:34:14produced by the by the in by the system in the process
0:34:18and then you have this more think effect you know that when we do
0:34:22introducing this is a compilation most almost all models are
0:34:26maximum likelihood estimation right next a more likely to wasn't maximum like the
0:34:32estimation trying to do
0:34:34they try to give you the average over everything
0:34:36because the averaged you always higher
0:34:39the higher
0:34:42probabilities right
0:34:44and they cause a problem
0:34:45the limited dynamic range of the
0:34:48of the signals without test generated in the could be at effect so that we
0:34:52can
0:34:53the same for phase
0:34:55the same faces a bigger problem
0:34:57often time what we do as i said that when we do synthesis we do
0:35:00recognition we use a many to features a week actually
0:35:04in order to ignore the phase i mean we still think that face continua t
0:35:09v is a is important and we don't think that modeling the faces as important
0:35:13as the many achieved it also present an opportunity for us to kind of the
0:35:18tech artifacts we can model still patterns of
0:35:23phase
0:35:25distribution seen
0:35:26a natural speech then we are able to detect synthetic
0:35:31next just some examples of this is a
0:35:35just to really wanna say that a short time frequency
0:35:39analysis you use a fixed window fixed length window to analyze the
0:35:43to analyse the signal was saying
0:35:46and up you have a
0:35:49record the interference between
0:35:52frequency being
0:35:54i'll the energies across the frequency
0:35:58and are the same time because you do shifting window to window without overlap sending
0:36:02actually you also have this smearing expect i don't have time axis so we have
0:36:07you have the interference a the convex s and you also have the in the
0:36:13this mary factor in the frequency
0:36:16access
0:36:18if we were able to detect
0:36:20detect this then this could be
0:36:22something that
0:36:24a signature morphosyntactic balls
0:36:29well coldest
0:36:31most of the everything the vocal this actually two
0:36:33kind of a remote
0:36:34two most the they'll the waveform as a result of this
0:36:39short time sometime
0:36:44effect and
0:36:48where people set actually you are using
0:36:50one artifacts to correct another artifact so you have to short time frequency short-term a
0:36:55spectral
0:36:58really cage so we take cost you problems and that you are used another smoothing
0:37:02methods kind of try to smooth everything about so you have a quality factor corrigan
0:37:06out if you have to a different significant
0:37:09and you can
0:37:09kind of a extract the signal but interestingly this smoothing effect because you use human
0:37:15years to kind of a pressure the quality actually after this of the smoothing evaluation
0:37:21says that
0:37:23the sound quality suppressed
0:37:25but i believe that they're artifacts inside you can describe and just not also mention
0:37:31that we use statistical model
0:37:32i don't in the voice compression
0:37:35or in the in
0:37:39okay to markov model or a synthesis
0:37:42and then we try to i'll try to estimate the
0:37:47how to generate the parameters using maximum likelihood
0:37:50criteria they always give you the average will not always you have other ways to
0:37:56just means you might disagree with me a but the other ways to model the
0:38:00to the dynamics about the
0:38:02in general systems give you kind of a
0:38:05average
0:38:06a signal
0:38:08that is a limited dynamic range of a completely speech
0:38:12this example a i just plot the spectrogram of the natural speech in the copy
0:38:18synthesis speech and hearing see that
0:38:21actually
0:38:23absolute differences in the spectral
0:38:25two main
0:38:27in this is a pitch patterns in the
0:38:32get a map that he hmm based a synthetic
0:38:35well ways we know that a human speech
0:38:38actually the peach patent is not so stable as you know synthetic voice using the
0:38:44paper by what you have
0:38:48twenty two thousand five for the height of a trot to chart one shows the
0:38:54synthetic voice which has a very straight up each pattern
0:38:58it is in a p h this is the autocorrelation of this at the time
0:39:02domain signals
0:39:03and you see that a natural speech actually has about has something like you know
0:39:08when you believing loosing you have this but broughton
0:39:11the two roddick modulation top each round
0:39:14and also some peach level
0:39:16and synthetic voices rather strict
0:39:20because of this if we believe that this
0:39:23there is a lack of a dynamic range in the synthetic voicing completely voice then
0:39:28the dynamic range of the spectrogram can be used as a features also of great
0:39:34one paper by tom these group that we talk about only use their with and
0:39:37without delta dynamic features of
0:39:40of
0:39:42spectrograms as the features i ignoring the static features are used to detect synthetic voice
0:39:49in the also techniques to
0:39:51a model the temporal modulation features you know when we have a feature frames which
0:39:57is a like the usually one frame by frame by frame we selected ten miliseconds
0:40:02shift in this
0:40:04cut a piece of signals well like the fifty frames and you extract it into
0:40:10a temporal a few using the temporal futile to model it is and then use
0:40:15this to
0:40:16to oaks former oak supervector like this to model
0:40:21the model that i'd of the many to
0:40:27features audio based features and it works for
0:40:32well for this a complementary features into
0:40:35in the extended voice detection
0:40:38phase is something that we will this was
0:40:43us to pay attention to
0:40:46why people don't use face creatures is because mostly because it's of it difficult to
0:40:51to describe it and it because
0:40:55many unique properties for example we have this mapping effect when you want to see
0:40:59you have to unblacked it this is a real red
0:41:06record a signal you can see any patents
0:41:08but actually
0:41:09if you think if you have a real time you have you have a real
0:41:13signal and then we do fourier transform you have the
0:41:17the real part in have the imaginary parts right and then
0:41:20the man did you is come from this to pass in the face also come
0:41:24from these two pass and
0:41:26then
0:41:27by right they should present a similar patterns like this if you many people have
0:41:31shown that day
0:41:33unwrapping do it properly with proper normalization you see similar patterns
0:41:37face feature and thus many the you manicure feature the looks about the same
0:41:44and they give us a opportunities to i mean another new features to look into
0:41:49you in synthetic voice and completely both people do not pay enough
0:41:53i things into two
0:41:55to a face increase feature become very useful for detection for synthetic oppose detection
0:42:01a to have to be too
0:42:03must there are many papers on all this other techniques with recordings ten years instantaneous
0:42:09frequency which is the time that derivative of the phase signal so basically you have
0:42:13to frames
0:42:15and then this is the method you look very similar but they are
0:42:19phase features
0:42:20could be very the face sorry the phase
0:42:24if a square and could be very different
0:42:26good very different
0:42:27so
0:42:28by
0:42:28by taking their
0:42:30difference
0:42:32as a features
0:42:33you're able to extend it to remember something
0:42:37we strip is remembered every sample was in the time-domain actually this two pi shift
0:42:42of the signal to maintain the continuing this so we want to do this you
0:42:45have to kind of unwrap it
0:42:47because usually we should do window by ten milisecond twenty millisecond not by every samples
0:42:52right
0:42:52so when you take this thing you want to make sure that the
0:42:55features are
0:42:56kind of a complete the phase are continues you have to do
0:42:59kind of a normalization
0:43:01ross a little bit
0:43:02and then is a group delay features this which is a frequency derivative of phase
0:43:09you know we have a single like these
0:43:11and you have the power spectrum which shows the two resonance pick here
0:43:15you see really and then the
0:43:18group delay also shows or something like this
0:43:20and these features rather complex
0:43:23a mechanism but at least a show you initial step
0:43:27a similar utterance has many to a feature
0:43:31this is a novel different plots or spectrograms face were so that a development my
0:43:39student groups in last year's this
0:43:42if you spoofing and compare their see that if the if the log magnitude spectrum
0:43:49make it you
0:43:50and a you can have a group delay unit we probably actually you see the
0:43:55similar patterns
0:43:56did manage
0:43:58and you have many other things non modified group delay of the instantaneous frequencies on
0:44:04the other features you see the paper but specimen to print
0:44:08in this
0:44:09allpass also but features to do that the detection
0:44:14finally comes to the last year so that into scooting evaluation
0:44:19each shows that
0:44:21this is
0:44:22a performance on the data a speaker recognition you've just use the gmm the standard
0:44:27gmm system and then once you a lda with the spoofing voice us anything voice
0:44:34the performance at twelve o a missile
0:44:41okay looked in the evaluation they were kind of five
0:44:48synthetic voice
0:44:49which is used as a training development data how this is called norm that x
0:44:53you have to access to the to the training data of the synthesized
0:44:58and i have another five that you don't have access to tell you what how
0:45:03to generate
0:45:04and then you only given the evaluation data supposed to detect d a synthetic voice
0:45:09for all of them so you typically use the five a
0:45:13a voice to train your system and used
0:45:17use the system to tessa
0:45:19across the ten evaluation
0:45:21ten a voices and this is a brief summary of the resulting see that for
0:45:28the not attack italy the performance the average is kind of a
0:45:34for unknown to take it gives to like a four times higher so error rates
0:45:38so of course of this is kind of a and
0:45:42known beforehand you know
0:45:44we denote this signals you of course you can do something you're trying to train
0:45:48the detector using the samples in the you detect that
0:45:54like to actually we do one particular i think use a synthesiser which is a
0:46:00kind of outline of the system
0:46:03you know
0:46:04all the system did pretty these are the sixteen estimations of the system thing the
0:46:08rank by the performance
0:46:10most of them did very well for but to take one is example t very
0:46:15well for
0:46:18for all of them we don't
0:46:20as ten without the unique selection synthesizer right
0:46:25and it pretty reasonably well
0:46:28and then one comes to f k even very the equal error is very high
0:46:32so basically all the features kind of felt that for
0:46:35for testing
0:46:36so was tested
0:46:37as in this is the tts
0:46:40using
0:46:42unit selection and replay
0:46:44sound clip to see show you how it is that this is a testing
0:46:59if i should
0:47:02if i should
0:47:08actually
0:47:10so we say it's night so here okay thank you can really hear this a
0:47:14this set in the s k i s ten present the strongest
0:47:20the text to the speaker recognition system i believe that is because his unit selection
0:47:24demos of the salsa silence frames because we do frame-by-frame
0:47:28and the frames are a natural voice except the
0:47:31the vad the connection points which is represented minority the yep in the back or
0:47:36friends
0:47:40nowadays everything must have a little bit of a deep neural network so i also
0:47:44include neural network my presentation
0:47:49so this is a
0:47:51very simple deep neural a simple neural network is this is not appear
0:47:55there's one layer anyway neural network there has to take the speech as the input
0:48:00take the features as input for type of features
0:48:02and then generated output
0:48:04so
0:48:04the sounds that this the closer to
0:48:06the something closer to the right things like is more natural speech and laughing size
0:48:13more synthetic voice is occurring see that has can
0:48:18overlaps with natural speech very much as ten and natural speech you give a very
0:48:22similar score
0:48:23that makes the features that we have kind of this difference to differentiate them
0:48:27so i wasn't another recent research in this is a very recent resistant work
0:48:31we take one hundred frames as the input to a
0:48:34convolutional neural network so you have how different to do polling and
0:48:39and all this allows you to get a wider range of a samples
0:48:44how difference actually can cover the kind of one
0:48:47one minute thing to make sure the in the one when it is there are
0:48:50some transition of a
0:48:53of
0:48:54acoustic units between
0:48:56their subjects junctions
0:48:59in it
0:49:00we can see that as ten and natural speech kind of that so but the
0:49:05good separation
0:49:08and
0:49:09as a
0:49:10positive and studies are i read quite a number of literatures one of them is
0:49:15a multiple of the things is given to
0:49:20features that d
0:49:21the best in the evaluation which is a so-called ward italy transform basically the idea
0:49:26is in canada here you have this
0:49:29you have this
0:49:33you have this
0:49:34possible to member different few
0:49:38filters with different pen with a different center frequencies that so you're kind of a
0:49:42trying to be filters of the kind of a different awfully good a good friends
0:49:48with different pen with a to get the coefficients
0:49:52this is actually not new in the on "'em" scale a good cepstral coefficient already
0:49:58is doing this but was differences
0:50:01in this
0:50:03a set of you just a status similar to kind of a wavelet kind of
0:50:07a
0:50:07it's question you have we for low-frequency you have a longer windows
0:50:12and but for high frequency inverse filtering
0:50:15yes up to sort the shot to
0:50:19response function in this way you get
0:50:21different resolutions to
0:50:23two different frequency bands
0:50:27this is a paper this is a
0:50:30slight so there is given to the need via by any device and a big
0:50:34they
0:50:36just got a very impressive result that he's going to present in all this is
0:50:40why don't
0:50:40one to jump from one
0:50:42so i try to share with us so
0:50:44is the effect of see that this is
0:50:48spectral where that is shown that using
0:50:52constant q cepstral coefficient for in the similar concept of auditory transform
0:50:57at the
0:50:58low frequencies the better frequency resolutions but poor convex solutions
0:51:04point time resolutions allows
0:51:07asked to
0:51:07have a bigger windows in terms of time it has a bigger range
0:51:11range the cover to cover
0:51:13you know the
0:51:15the discontinuity of the features
0:51:18the higher frequency is a better time resolution
0:51:22it is equal costs them to collect
0:51:26need to ship would you littering his one presentation
0:51:30so it
0:51:31with these techniques to give the very impressive without giving the evaluation the best result
0:51:36was equal error rate eight point
0:51:38five percent
0:51:40and
0:51:40then with a with days
0:51:43do you achieved like a one percent equal error rate this is really impressive
0:51:48okay to some
0:51:51so
0:51:54splitting the deck a spoofing the tech a this many challenges and opportunities in the
0:51:59most systems also on there
0:52:01that the input speech is actually natural speech i don't know this is opportunity always
0:52:05the challenge it depends on your the heckler you want to
0:52:08the system developed a
0:52:11more robust speaker verification system
0:52:14many meetings this very vulnerable to have text and then we need to take special
0:52:21because to address this issue and
0:52:26and
0:52:27but the speech perceptual quality doesn't equal to
0:52:30to less artifacts actually in speech synthesis
0:52:34my impression of this we try to the just the
0:52:38the output signal just to please the human years by actually in the spectral
0:52:42ram it in the it generate or interface claim has a lot of artifacts day
0:52:47that yet to be discovered
0:52:49motion humans listen different
0:52:52matching
0:52:53mostly now listen to frame-by-frame features i remember the days when they were in the
0:52:58company we wanted to keep those ten will be a single ip what demonstration for
0:53:03tts system to have a dialog speech recognition system
0:53:07every time i give just tamil
0:53:09and people across very happy
0:53:12and
0:53:12and people thought that this was
0:53:15magical demonstration but to me to save a demonstration because the lpc features top of
0:53:20lpc features every time they get the echoes get the was correct if i talked
0:53:25to the system some something kimmy role model without leaves acoustically ninety five percent
0:53:30so matching and humans listen to different things and we need to discover it is
0:53:36more and a
0:53:38and the study also shows that from the last two yes i
0:53:43publications shows that features are more important than classifier
0:53:47or maybe we have not reach the level having good features so the a lot
0:53:52more study to be that the features for the thing to classify
0:53:55this way
0:53:57thank you
0:54:04so you how do for this presentation so we have time for a couple of
0:54:08questions
0:54:09then when you want your judgements to start
0:54:18anyone
0:54:27we get idiots from terrorists
0:54:29obviously have the voice pitch the use of pitch stretch a pitch appending algorithm so
0:54:36it sounds like this begin with very low voice either to discuss the voice or
0:54:41to sound more threatening
0:54:44the question is that way of
0:54:47of inferring the degree of change that has been made to the pit the pitch
0:54:51can both either just formant frequency or formants and found difficult or just formants so
0:54:58but would be here we are the loss of any way of knowing whether and
0:55:03it
0:55:04what extent it is possible to and four
0:55:07the degree of change that has been made in order to
0:55:10to change it back
0:55:14well
0:55:16we don't have
0:55:18i think for forensic unit kind of visual tools that allows you to
0:55:22to do analysis have believed that the
0:55:26the features that we just talk about things tending as instantaneous frequencies the group feature
0:55:32group delay modified group delay
0:55:35cepstral coefficients the constant q cepstral coefficients those are the wonderful tools for you to
0:55:44do
0:55:45comparison since i just show you just shout on the second
0:55:50so actually a
0:55:57so we did in the left when we
0:56:01analyze the features
0:56:02we did
0:56:04observe some features for example this is a call
0:56:09relative of phase shift is a natural speech this is a synthetic speech and you
0:56:14can see you cannot hear
0:56:17the difference because they are all very natural you cannot really
0:56:22here any differences but
0:56:24the craze gram actually tells you something so i believe that
0:56:28maybe it can be used as the tools
0:56:30i don't think anybody has really appealing to a system for practical used yet
0:56:38you is just
0:56:51so very nice talk time was very happy to see the breath of work field
0:56:56and beam our work is actually been covered by someone you folks here i wanted
0:57:01to make one comment i think one of the fundamental challenges when you look at
0:57:04voice conversion most of that research is really focused on humans being able to
0:57:11assess the quality should usually for human consumption not necessarily for speaker recognition systems
0:57:18so if you look at voice conversion technologies most end up focusing on making sure
0:57:24that the prosody is correct because that's something it's pretty easy to kind of assess
0:57:28it was different like fundamental frequency and so forth so i think in a bit
0:57:33nineties we had to some work or what we did as we took
0:57:38the output of natural speech and segment based speech synthesis and fed into archer here
0:57:45cell models and look at here some firing characteristics on the output of what we
0:57:51saw was that in regular normal speech
0:57:54there's an actual production evolution that takes place in the articulators
0:57:58the corresponding here saw firing characteristic also have a natural variation
0:58:03but in the synthetic side in segment based synthesis
0:58:07when you could be hair cell finding characteristics they don't necessarily behave the same way
0:58:12so we found that was actually very interesting way to kind of bring kind of
0:58:17the signal processing side of the hearing into the speaker assessment side
0:58:23you could find actually really more quality speech synthesis
0:58:26but the hair cell firing characteristics would be able to pick up that differences there
0:58:32certainly i think just now example asked an the unit selection feature is a quite
0:58:37example you can hear anything by actually it is stronger is exposed in voiced
0:58:48yes one last question and then after ending you can break
0:58:55so part of what you with through you're talking about the different aspects as try
0:59:00to detect whether the voice that modify
0:59:02right and you're look so the things in there were the
0:59:06looking at the pitch the phase and so forth but what is the really that
0:59:11isn't speaker
0:59:13verification be used is because the big with it is of the handset being delivered
0:59:17around most speech coming in the systems already gonna go through some form vocoder
0:59:23so is that by its nature going to start to
0:59:27you know you're gonna get you're gonna start detecting a lot of these artifacts are
0:59:30really gonna be natural artifacts of the communication system itself and
0:59:34i think is that thing looked at is look like most of this is based
0:59:38on what inputs are happening right at the from lips into the system
0:59:43so i think that's vertical question so i think the challenge now we just two
0:59:51model different type of artifact the artifacts to are susceptible to the system for example
0:59:55you set a two
0:59:57the could be if the system that's going through communication channel days editable coding that
1:00:01already
1:00:03but most of them do not really manipulative
1:00:07parameter stay just try to recover the signal as much as possible
1:00:12so
1:00:14and
1:00:18at the moment the researches focusing the task focused very much on
1:00:22the features they are able to surf a store scientific
1:00:25exactly the if we have good features so we can tell we can model them
1:00:30effective
1:00:31and of by this is a mostly also telephone channel two different channel you mentioned
1:00:37that and ten telephone channel it is also
1:00:40you know
1:00:41and a lot channel digital channels in all kind of things
1:00:46so i
1:00:50they could be issue but they also asked me about this when we do not
1:00:53this when we kind of doing this analysis the or digital
1:00:58the by actually this process the complete to analogue an income packaging
1:01:01what the effects of the data
1:01:03we have not really studied
1:01:09okay thank you i think we have to stick to scale so thank let thanks
1:01:13again purpose only how do