well contort okay speaker or this a twenty four char workshop

i'm doing so you from columbia university

and i'll be presenting our recent research effort on many-to-many voice conversion using cycle free

a weight multi decoders

this is a joint work with some given name

can you only

and in channel you

we are all from a io their core university

i've first to find the problem that we want solve

which is voice conversion among multiple speakers

and describe key idea of the proposed a method called a psycho we at with

multi decoders to solve the problem

doctor in show you is going to explain the details up to stack of we

with multi decoders

and show you some experimental richard so the proposed method

followed by some concluding remarks

voice conversion is the task of converting the speaker-related voice characteristics in an utterance while

maintaining the linguistic information

for example of female speakers may sound like a main speaker using a voice conversion


voice conversion can be applied to data limitation

for example for the training of automatic speech recognition systems

various voice generation from a text to speech is system

speaking assistance for foreign languages to accent conversion

speech enhancement by improving the comprehensibility up to convert it voice

and personal information protection through a speaker d identification

if you have a parallel training data

which contain pairs sub same transcription utterances spoken by different speakers

we can just train a simple neural network by providing the source speakers utterances has

the input adopt a network

and the target speakers utterances has the target up to network have the proper time

alignment of the parallel utterances

however building a parallel there are corporas is a highly expensive task sometimes even impossible

which estimate the strongly is for voice conversion method that does not require the parallel

training data

therefore recent voice conversion approaches attempt to use non-parallel training data

one of such approaches as using a variational twelve encoder

or fee ensure

which was originally developed has the generative model for image generation

if we at is composed of an encoder and the decoder

t can coder produces a set of parameters for posterior distribution a latent variable z

given the input data x

where is the decoder generates a set of parameters for posterior distribution no output data

x given the latent variables c

after being trained using the variational all but andre has its objective function

the vad can be used to generate samples of x by feeding random latent variables

to the decoder

of we can be applied to voice conversion by providing speaker identity to the decoder

together with the latent variable

the vad is trained to reconstruct the input speech from the latent variables c and

the source speaker identity x

here for case latter x represent an utterance

and uppercase letter x represent speaker identity

to convert a speech from a source speaker to a target speaker

the source speaker identity x is replaced with a target speaker identity y

however due to t have sense of unexploded training process for the conversion path between

the source speaker and the target speaker

the vad based voice conversion methods generally produce for quality voice

that is the conventional method to train the model with the self reconstruction the objective

function only

not considering the convergent path from a source speaker to a target speaker

in order to solve this problem we propose to cycle consistent variational auto-encoder or cycles

we a short with multi decoders

it uses the cycle consistency lost and multiple decoders for explicit convergent path training as


and the speech x is fed into don't network it passes to the decoder and

compressed into to latent variable z

the reconstruction error is computed using the reconstructed speech x prime by the speaker x


up to this point the loss function is similar to the vanilla vad except that

it does not require your the speaker identity because the decoder is used exclusively for

speaker x

the same input speech x go through at the encoder and the speaker windy quarter

as well

to generate the convert a speech x prime from speaker next to speaker y which

has the same linguistic contents has the original input speech x but in speaker y's


then the converted speech x prime goes to the encoder and the speaker x t

quarter to generate the converted back speech x double prime which should recover the first

input speech x

the cyclic conversion encourages the explicit training of the voice conversion path from speaker y

two speaker x without parallel training there

the cycle consistency loss optimized t decoder vad for two speakers given the input speech

x is defined as follows

again the whole point case letters x and y represent speaker identities

no cable and the input speech x the loss function up to cycle of we

at for two speakers is the weight is some how the above two losses as


where lambda is the weight up to cycle consistency loss

similarly input speech y is used to train the convergent path from speaker next to

speaker white explicitly as well as the so reconstruction path for speaker y

it can be easily extended for more than two speakers by summing over all pairs

of the training speakers

the loss function up to cycle of we for more than two speakers can be

computed has follows

where the second summation is usually over a mini batch

the sound quality can be improved

since each decoder there is it's on the speaker's voice characteristics

by the additional convergent path training

why the combination of we at most handle multiple speakers with only a single decoder

by self reconstruction training

at this point i'd like to handle the microphone to doctor you

is going to explain the details of the proposed to cycle of the with multi


thank you

i mean value from a that corey university

i did explaining the details of the proposed that could be a bit more two


experimental results

and conclusions

that generates about the rasta network or can ensure can be applied to cycle be

tween as the quality of the resulting speech is

the reconstructed speech x prime is retrieved from the second vad

i feeding the speech x from speaker x

in the speaker identity of x

but this can be to is trained to distinguish the reconstructed speech from the original


door cyclic completely the speech x top we're prime is also pretty from the second


it first composed of speech x still speaker y's voice

i feeding the speaker identity or why the latent variable z

the converted speech is not be encumbered back to the speaker x voice

i feeding the speaker identity of x with the latent variables

but this commander is also trained to distinguish the results next overpriced keys from the

original speech

the cycle we can be for to extend it to use much pretty coders

in similar fashion to the second vad we much pretty colours

each speaker use is dedicated decoder and discriminator networks

since they are much for against

previous iteration again is more t five

the modified in the icsi mark has right

in this work we used what sustains cans w again since short instead of one

you like

redesigned architecture of a motorist based on the paper by coming our current two thousand


all encoder the colour and discriminate was used pretty cumbersome architecture speak at the linear

units or jeer using your

the source identity vector is broadcast it it's jerry april that is the source cited

vector c is appended to the output of the previous jury overlap

since we assume gaussian distribution with diagonal covariance matrices for the encoder and the decoder

the outputs of the encoder and the decoder at pairs of mean and variance

that decoder architecture is similar to that of encoder

the target speaker identity vectors

and that used for the multi decoder side could be and marty decoder circuitry w

and this is the architecture but this commander that war

as in that score that while the target speaker identity vectors are not used for

the multi decoder second be multicolour second three w k

now i we show some experimental results of the proposed missile and concluding remarks

here is that it takes up the experimental setup

we was always component challenge two thousand eighteen dataset which consist of six theory and

six male speakers

we used a subset of two female speakers and two male speakers

each speaker has one on the sixteen utterances

we use seventy two utterances for training

nine utterances for validation and start by utterances for testing

we used three sets of features

thirty systematic gesture questions or m c ensure and fundamental frequency and a periodicities

we use the following hyper parameters

and it's more there was books a from all five on the train the beam


we analyze time and space complexity of the algorithm

the time complexity is measured by the average training time by fourteen seconds using the

chip or thirty s two thousand eight a gpu machine

the space complexity is measured by number of model parameters

by comparing we and so i could be a single decoder

we can see that ending cycle consistency increase the training time to four times but

the number of parameters seaside into car

same can be so by comparing fourier the reader can incite v w in the

single decoder

using multiple decoders considerably increase space complexities

especially when the w again is already since they nist separate this came in terms

for each speaker assess where

the global variance or achieving is sure or m c

can be used to measure the degree or some money that these the highly we

values corner with the shopping use of the spectra

we miss error cheery for each of the insisting this is all of it is

all sources for your space and the comparative space by the commission of the and

the from four section three

don't ever is to use of the conventional vad and the proposed section v four

or in the system only various all sources with similar

the tv various of the second v for higher insisting that is useful better than

those of the miss and v

for the case of the listener and the compare two speech utterances contain the same

linguistic information

the difference between the mfcc up to two speech utterance it should be small

we miss the mel-cepstral distortion m c d for various algorithm

by comparing v and v w can we can see that the v w and

channel real problems be anyway

by comparing p w can in section v the billion single decoder

we can see the effectiveness so and things like a consistency

by comparing psycho every single decoder and marty decoder

and second we that we begin single decoder and much decoder

we can see that the much which could afford to improve the performance

one interesting to note is that the cycle but much pretty cover up of one

second v w can be much pretty cold or

we suspect that the multi decoder second consistency lost its of setting up to learn

the cumberland pace explicitly that the additional w again four component past may not in

excess sorry

we conducted to subjective evaluations that show is test and similar i think task

for naturalness test we measured the mean opinion score and where

ten s not evaluate the naturalness of the forty eight utterances in this case of


two five exile

one average the proposed multi decoder cycle vad hessians slightly higher naturalness the scores that

the commission of e

it can be also seen that the proposed i could be a method has shown

relatively stable performance this beating cumberland pair

for similarity test we conducted the following experiment

using forty eight utterances and ten participants as in the trellis test

all target speakers utterances will play first

then the to convert you know transceivers by the to messrs what played in random


listeners were asked to select the most email addresses to the target speakers speeds or

fair if they could not additive four

results so that the proposed multicolour second be based we see a upon the completion

of e p c significantly

now we show some examples of voice comparison

this is the song not the source speaker

because man groping in the arctic darkness

the found the elemental

in the target speaker

because my well being in arctic darkness and found no matter

these are the silence of the component is speeches

we present and grabbing in the arctic darkness and found the elemental

five nine running our preparedness and finding a no

because and then grabbing in the arctic darkness the funny elemental

because nine broken in the arctic darkness the founding elemental

because an island hopping in the arctic darkness the founding elemental

as in and problem of the art darkness the finding a no

these are another example so combating based p two p m s p

this is the sound of the source speaker

the proper course to pursue is to offer your name and address

in the target speaker

the proper course to pursue is to offer your name address

these are the silence of the component is speeches

the proper course to pursue is to offer your name inventor's

the proportion issue is to offer your main interest

the proper course to pursue is to offer your name and managers

the proper course to pursue is to offer your name and address

the proper course to pursue is to offer your name and address

the proper corresponds to is often in a way to address

you know there some concluding remarks

the variational to encode herve's voice conversion can run many-to-many voice comers on we do

parenting data

however it has low quality you to have sense of explicit training process for the

common to pass

in this war we improve the quality of vad based voice conversion by using second

consistency and much but decoders

values of cycle consistency in a widow network to explicitly learn the compression pistons and

then use a much we decoders in a bit on the top tool on individual

target speakers voice characteristics

for future works

we have currently running the experiments using a lot of corpus consisting of more than

hundred speakers

to find out how the proposed messrs careful allows a number of speakers

the proposed methods can be further extended by utilizing much recorders

for example using technique at encoder for you it's all speakers

also replacing the for coder with more powerful endurable colour such as the we even

a what we are and then an increase the power point six where

thank you for watching our presentation