Speech Transcript - Many-to-Many Voice Conversion Using Cycle-Consistent Variational Autoencoder with Multiple Decoders

0:00:13	well contort okay speaker or this a twenty four char workshop
0:00:20	i'm doing so you from columbia university
0:00:23	and i'll be presenting our recent research effort on many-to-many voice conversion using cycle free
0:00:31	a weight multi decoders
0:00:36	this is a joint work with some given name
0:00:42	can you only
0:00:46	and in channel you
0:00:49	we are all from a io their core university
0:00:54	i've first to find the problem that we want solve
0:00:57	which is voice conversion among multiple speakers
0:01:02	and describe key idea of the proposed a method called a psycho we at with
0:01:07	multi decoders to solve the problem
0:01:11	doctor in show you is going to explain the details up to stack of we
0:01:16	with multi decoders
0:01:19	and show you some experimental richard so the proposed method
0:01:24	followed by some concluding remarks
0:01:30	voice conversion is the task of converting the speaker-related voice characteristics in an utterance while
0:01:37	maintaining the linguistic information
0:01:40	for example of female speakers may sound like a main speaker using a voice conversion
0:01:45	technique
0:01:48	voice conversion can be applied to data limitation
0:01:52	for example for the training of automatic speech recognition systems
0:01:57	various voice generation from a text to speech is system
0:02:02	speaking assistance for foreign languages to accent conversion
0:02:08	speech enhancement by improving the comprehensibility up to convert it voice
0:02:13	and personal information protection through a speaker d identification
0:02:21	if you have a parallel training data
0:02:23	which contain pairs sub same transcription utterances spoken by different speakers
0:02:30	we can just train a simple neural network by providing the source speakers utterances has
0:02:37	the input adopt a network
0:02:39	and the target speakers utterances has the target up to network have the proper time
0:02:45	alignment of the parallel utterances
0:02:48	however building a parallel there are corporas is a highly expensive task sometimes even impossible
0:02:56	which estimate the strongly is for voice conversion method that does not require the parallel
0:03:02	training data
0:03:06	therefore recent voice conversion approaches attempt to use non-parallel training data
0:03:14	one of such approaches as using a variational twelve encoder
0:03:19	or fee ensure
0:03:22	which was originally developed has the generative model for image generation
0:03:28	if we at is composed of an encoder and the decoder
0:03:33	t can coder produces a set of parameters for posterior distribution a latent variable z
0:03:40	given the input data x
0:03:44	where is the decoder generates a set of parameters for posterior distribution no output data
0:03:50	x given the latent variables c
0:03:54	after being trained using the variational all but andre has its objective function
0:04:00	the vad can be used to generate samples of x by feeding random latent variables
0:04:06	to the decoder
0:04:08	of we can be applied to voice conversion by providing speaker identity to the decoder
0:04:15	together with the latent variable
0:04:18	the vad is trained to reconstruct the input speech from the latent variables c and
0:04:24	the source speaker identity x
0:04:27	here for case latter x represent an utterance
0:04:32	and uppercase letter x represent speaker identity
0:04:38	to convert a speech from a source speaker to a target speaker
0:04:42	the source speaker identity x is replaced with a target speaker identity y
0:04:49	however due to t have sense of unexploded training process for the conversion path between
0:04:55	the source speaker and the target speaker
0:04:59	the vad based voice conversion methods generally produce for quality voice
0:05:05	that is the conventional method to train the model with the self reconstruction the objective
0:05:11	function only
0:05:12	not considering the convergent path from a source speaker to a target speaker
0:05:19	in order to solve this problem we propose to cycle consistent variational auto-encoder or cycles
0:05:27	we a short with multi decoders
0:05:31	it uses the cycle consistency lost and multiple decoders for explicit convergent path training as
0:05:38	follows
0:05:40	and the speech x is fed into don't network it passes to the decoder and
0:05:46	compressed into to latent variable z
0:05:50	the reconstruction error is computed using the reconstructed speech x prime by the speaker x
0:05:58	decoder
0:05:59	up to this point the loss function is similar to the vanilla vad except that
0:06:05	it does not require your the speaker identity because the decoder is used exclusively for
0:06:12	speaker x
0:06:14	the same input speech x go through at the encoder and the speaker windy quarter
0:06:20	as well
0:06:21	to generate the convert a speech x prime from speaker next to speaker y which
0:06:28	has the same linguistic contents has the original input speech x but in speaker y's
0:06:34	voice
0:06:36	then the converted speech x prime goes to the encoder and the speaker x t
0:06:42	quarter to generate the converted back speech x double prime which should recover the first
0:06:49	input speech x
0:06:53	the cyclic conversion encourages the explicit training of the voice conversion path from speaker y
0:07:00	two speaker x without parallel training there
0:07:06	the cycle consistency loss optimized t decoder vad for two speakers given the input speech
0:07:13	x is defined as follows
0:07:17	again the whole point case letters x and y represent speaker identities
0:07:24	no cable and the input speech x the loss function up to cycle of we
0:07:29	at for two speakers is the weight is some how the above two losses as
0:07:34	follows
0:07:36	where lambda is the weight up to cycle consistency loss
0:07:41	similarly input speech y is used to train the convergent path from speaker next to
0:07:48	speaker white explicitly as well as the so reconstruction path for speaker y
0:07:55	it can be easily extended for more than two speakers by summing over all pairs
0:08:01	of the training speakers
0:08:03	the loss function up to cycle of we for more than two speakers can be
0:08:08	computed has follows
0:08:10	where the second summation is usually over a mini batch
0:08:15	the sound quality can be improved
0:08:18	since each decoder there is it's on the speaker's voice characteristics
0:08:23	by the additional convergent path training
0:08:27	why the combination of we at most handle multiple speakers with only a single decoder
0:08:32	by self reconstruction training
0:08:37	at this point i'd like to handle the microphone to doctor you
0:08:41	is going to explain the details of the proposed to cycle of the with multi
0:08:46	decoders
0:08:47	thank you
0:08:48	i mean value from a that corey university
0:08:53	i did explaining the details of the proposed that could be a bit more two
0:08:57	decoders
0:08:58	experimental results
0:09:00	and conclusions
0:09:02	that generates about the rasta network or can ensure can be applied to cycle be
0:09:06	tween as the quality of the resulting speech is
0:09:10	the reconstructed speech x prime is retrieved from the second vad
0:09:14	i feeding the speech x from speaker x
0:09:17	in the speaker identity of x
0:09:20	but this can be to is trained to distinguish the reconstructed speech from the original
0:09:24	speech
0:09:27	door cyclic completely the speech x top we're prime is also pretty from the second
0:09:32	v
0:09:34	it first composed of speech x still speaker y's voice
0:09:38	i feeding the speaker identity or why the latent variable z
0:09:43	the converted speech is not be encumbered back to the speaker x voice
0:09:48	i feeding the speaker identity of x with the latent variables
0:09:53	but this commander is also trained to distinguish the results next overpriced keys from the
0:09:58	original speech
0:10:01	the cycle we can be for to extend it to use much pretty coders
0:10:07	in similar fashion to the second vad we much pretty colours
0:10:12	each speaker use is dedicated decoder and discriminator networks
0:10:17	since they are much for against
0:10:19	previous iteration again is more t five
0:10:23	the modified in the icsi mark has right
0:10:28	in this work we used what sustains cans w again since short instead of one
0:10:33	you like
0:10:35	redesigned architecture of a motorist based on the paper by coming our current two thousand
0:10:40	nineteen
0:10:42	all encoder the colour and discriminate was used pretty cumbersome architecture speak at the linear
0:10:48	units or jeer using your
0:10:51	the source identity vector is broadcast it it's jerry april that is the source cited
0:10:58	vector c is appended to the output of the previous jury overlap
0:11:02	since we assume gaussian distribution with diagonal covariance matrices for the encoder and the decoder
0:11:09	the outputs of the encoder and the decoder at pairs of mean and variance
0:11:15	that decoder architecture is similar to that of encoder
0:11:19	the target speaker identity vectors
0:11:22	and that used for the multi decoder side could be and marty decoder circuitry w
0:11:29	and this is the architecture but this commander that war
0:11:33	as in that score that while the target speaker identity vectors are not used for
0:11:38	the multi decoder second be multicolour second three w k
0:11:43	now i we show some experimental results of the proposed missile and concluding remarks
0:11:50	here is that it takes up the experimental setup
0:11:54	we was always component challenge two thousand eighteen dataset which consist of six theory and
0:12:00	six male speakers
0:12:02	we used a subset of two female speakers and two male speakers
0:12:07	each speaker has one on the sixteen utterances
0:12:10	we use seventy two utterances for training
0:12:13	nine utterances for validation and start by utterances for testing
0:12:19	we used three sets of features
0:12:21	thirty systematic gesture questions or m c ensure and fundamental frequency and a periodicities
0:12:29	we use the following hyper parameters
0:12:32	and it's more there was books a from all five on the train the beam
0:12:37	order
0:12:39	we analyze time and space complexity of the algorithm
0:12:43	the time complexity is measured by the average training time by fourteen seconds using the
0:12:48	chip or thirty s two thousand eight a gpu machine
0:12:51	the space complexity is measured by number of model parameters
0:12:56	by comparing we and so i could be a single decoder
0:13:00	we can see that ending cycle consistency increase the training time to four times but
0:13:05	the number of parameters seaside into car
0:13:09	same can be so by comparing fourier the reader can incite v w in the
0:13:14	single decoder
0:13:16	using multiple decoders considerably increase space complexities
0:13:22	especially when the w again is already since they nist separate this came in terms
0:13:27	for each speaker assess where
0:13:30	the global variance or achieving is sure or m c
0:13:34	can be used to measure the degree or some money that these the highly we
0:13:38	values corner with the shopping use of the spectra
0:13:43	we miss error cheery for each of the insisting this is all of it is
0:13:48	all sources for your space and the comparative space by the commission of the and
0:13:53	the from four section three
0:13:55	don't ever is to use of the conventional vad and the proposed section v four
0:14:00	or in the system only various all sources with similar
0:14:03	the tv various of the second v for higher insisting that is useful better than
0:14:08	those of the miss and v
0:14:12	for the case of the listener and the compare two speech utterances contain the same
0:14:16	linguistic information
0:14:18	the difference between the mfcc up to two speech utterance it should be small
0:14:23	we miss the mel-cepstral distortion m c d for various algorithm
0:14:28	by comparing v and v w can we can see that the v w and
0:14:34	channel real problems be anyway
0:14:38	by comparing p w can in section v the billion single decoder
0:14:42	we can see the effectiveness so and things like a consistency
0:14:48	by comparing psycho every single decoder and marty decoder
0:14:52	and second we that we begin single decoder and much decoder
0:14:56	we can see that the much which could afford to improve the performance
0:15:01	one interesting to note is that the cycle but much pretty cover up of one
0:15:06	second v w can be much pretty cold or
0:15:09	we suspect that the multi decoder second consistency lost its of setting up to learn
0:15:14	the cumberland pace explicitly that the additional w again four component past may not in
0:15:19	excess sorry
0:15:21	we conducted to subjective evaluations that show is test and similar i think task
0:15:28	for naturalness test we measured the mean opinion score and where
0:15:33	ten s not evaluate the naturalness of the forty eight utterances in this case of
0:15:38	one
0:15:39	two five exile
0:15:41	one average the proposed multi decoder cycle vad hessians slightly higher naturalness the scores that
0:15:47	the commission of e
0:15:50	it can be also seen that the proposed i could be a method has shown
0:15:54	relatively stable performance this beating cumberland pair
0:15:58	for similarity test we conducted the following experiment
0:16:02	using forty eight utterances and ten participants as in the trellis test
0:16:07	all target speakers utterances will play first
0:16:11	then the to convert you know transceivers by the to messrs what played in random
0:16:15	order
0:16:16	listeners were asked to select the most email addresses to the target speakers speeds or
0:16:22	fair if they could not additive four
0:16:25	results so that the proposed multicolour second be based we see a upon the completion
0:16:31	of e p c significantly
0:16:34	now we show some examples of voice comparison
0:16:38	this is the song not the source speaker
0:16:41	because man groping in the arctic darkness
0:16:44	the found the elemental
0:16:46	in the target speaker
0:16:48	because my well being in arctic darkness and found no matter
0:16:53	these are the silence of the component is speeches
0:16:57	we present and grabbing in the arctic darkness and found the elemental
0:17:02	five nine running our preparedness and finding a no
0:17:07	because and then grabbing in the arctic darkness the funny elemental
0:17:11	because nine broken in the arctic darkness the founding elemental
0:17:16	because an island hopping in the arctic darkness the founding elemental
0:17:21	as in and problem of the art darkness the finding a no
0:17:26	these are another example so combating based p two p m s p
0:17:32	this is the sound of the source speaker
0:17:34	the proper course to pursue is to offer your name and address
0:17:39	in the target speaker
0:17:41	the proper course to pursue is to offer your name address
0:17:45	these are the silence of the component is speeches
0:17:49	the proper course to pursue is to offer your name inventor's
0:17:54	the proportion issue is to offer your main interest
0:17:58	the proper course to pursue is to offer your name and managers
0:18:03	the proper course to pursue is to offer your name and address
0:18:08	the proper course to pursue is to offer your name and address
0:18:13	the proper corresponds to is often in a way to address
0:18:17	you know there some concluding remarks
0:18:20	the variational to encode herve's voice conversion can run many-to-many voice comers on we do
0:18:25	parenting data
0:18:27	however it has low quality you to have sense of explicit training process for the
0:18:32	common to pass
0:18:34	in this war we improve the quality of vad based voice conversion by using second
0:18:39	consistency and much but decoders
0:18:42	values of cycle consistency in a widow network to explicitly learn the compression pistons and
0:18:48	then use a much we decoders in a bit on the top tool on individual
0:18:52	target speakers voice characteristics
0:18:56	for future works
0:18:57	we have currently running the experiments using a lot of corpus consisting of more than
0:19:02	hundred speakers
0:19:04	to find out how the proposed messrs careful allows a number of speakers
0:19:09	the proposed methods can be further extended by utilizing much recorders
0:19:13	for example using technique at encoder for you it's all speakers
0:19:18	also replacing the for coder with more powerful endurable colour such as the we even
0:19:23	a what we are and then an increase the power point six where
0:19:28	thank you for watching our presentation

Many-to-Many Voice Conversion Using Cycle-Consistent Variational Autoencoder with Multiple Decoders

Voice Conversion and Synthesis

Dongsuk Yook, Seong-Gyun Leem, Keonnyeong Lee, In-Chul Yoo