Speech Transcript - Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data

controller presentation of a more speaker on the c paper transform is actual and costly

or emotional most commercial

it's not are no to negate

from national university feel cool

on the ball state of technology and to sign

based in the outline of this presentation

first i'd okay and extraction to emotional most commercial

and so relating what

i don't talk about our contributions

proposed framework

experiments

and can push

emotional most commercial is almost conversion technique

it aims to coursing motion in the speech

from these loss functions to the target options

in the meeting about the speaker and then key and linguistic information should be greater

and you can see in this speaker

the same utterance spoken by the same speaker

but the

motion has been changed some signs i is tiny has naples many applications we human

computer interaction

such as personalise

text-to-speech

so for a pulse no conversational agents

and three or no

emotion physical access with multiple signal or should groups which can be or five kate

right it's like trial

mostly

you motion is also scroll segmental

i hierarchical you make sure which makes it more difficult

two can where is the emotion in the speech mean those that early studies only

focus on that spectrum commercial

and i haven't okay you mashed attention on the cost the

it's missing is not sufficient

and most previous work we where

tara notion it is

from the source of the target emotion more options

by in the private case heralded have used it is say a difficult to cart

and also will limit the score applications

you know the true really met

we eliminates the need for the panel clean data we propose to your cycle gonna

to find the mappings

of spectral a post e

so i've okay is proposed for

me translation and has

shaves remarkable performance

a non-parallel tossed

researchers

have successfully applied these most commercial and speech synthesis

i was like yoga has three losses

whether where zero loss

cycle consistent signals

and i and he may not

so ministry losses

was that we're gonna turn around nineteen features also target don't mean

results and you want to know data

another challenge you motion commercial without austin modeling

in many for so the information test the result

fundamental frequency which we also quite

i zero

used

main factor

also in the nation

where studies

convert have zero

zero linear transformation

but i was we all know

i zero very from the micro most reliable suction the walls

and three states

twenty four options that will

side modeling used to sing for channels characterize

but speech was there are rare

some researchers propose to

well do i zero ways

conclusion remote transform

can was made about transform

used a signal processing technique

which is true

it controls the signal

two different time don't means

it can describe still

well as t you different and resolutions

and we think

so as to me to modeling hierarchical signals

such an afternoon

this figure shows although

continuous wavelet transform walks

we use minimum transform

composed of no

to turn scale

they have the same linguistic content and spoken by the thing speaker and we assume

that more scales

can capture the short-term variations have scales can capture the long-term variations

it has taken this tool options

very infomercial to tune the long-term variations

even though they are spoken by the same speaker and we don't think speaking time

this variations reflect the emotional variance

the different time scales options

so in this paper

we propose a panel of free emotional most commercial framework

and we also showed that of course i

although motion almost commercial

we can come versus an actual and prosody three shows recycle can extracting and we

also

that's great

different

training strategies

for spectral quality commercial

sessions

some pre-training a joint training

another thing

experimental results

shows that we also the baseline approaches and carrot she called quality converged

speech samples

this is the training phase of our proposed framework

you know training phase

majoring to sample against false fine sure prosody separates the

we also want vocoder

so it is trend of spectral features and i zero strong salsa target sergeant's

with only

you called it but aside from fisher seem to twenty four conventional caps

and use mean are transformed into compost zero into ten different scales

and we train this to cycle goes for spectrum was this outrageously to lower than

that of clean speech and start time you start for acoustic features

i really conversion phase

we use

to train sec okay

two congresses five approach crusty

we used was vocal the actual singing size that coverage options

we also in rats case two different training strategies our proposed framework

the first one is

second conjured

in this framework

we concatenate and catch

we say that would keep based f zero features

and that you put to it's like okay

and that's that and the second one cycle again separate

i in this framework

wishing to several gets full spectrum of the active

and in this work we can bound of three from walks

and where you're sidewalk and coworkers fight for

and use a

in your transformation to compare the posterior

this framework we call so baseline

and the

so i will control and second guess separate refers to two different training strategies one

sec okay

and we talked about last slots

and we use the l

most a lot

this whole words which is recorded by a

all stressful in american actually it's

i don't we conduct experiments from neutral two and three signed a sparse

for each emotion combination

but you're slidy non-parallel utterance

around stream means for training and ten utterances for evaluation

and all the all tracking

no iteration

big companies i'm seeking to marilyn spectrum distortion

and the cohort it a nasty and p c of size so

the performance of the was the commercial

from this to table

our

we can say that our propose

so we can a separate stream or

all of the baseline and several controls remote for all wasn't shapes

and we also further contact

sometimes you evaluation to us to study motion similarity and in this experiment we compare

the preference test

and that's from this to speaker

our proposed framework consistently on some the baseline and the second controller

i'm from the figure six we can say that most the listeners

who's our

so we can

separate framework

rather than that cycle control and

by results

some pre-training is

why the sampling training is much faster the orange county

we think because of the these menus manage trustees different time scales

and it is different time scales makers this though

current content and containing depends i denotes

so that

join training has not

can estimate the transform coefficients read the spectral

features and stuff frame now

and the

and this train a strategy assumes that holstein is containing and that

with a mean each number of training samples

for example streaming use of speech

in our experiments

but during training

so we're gonna model

kernel

generalize very well so emotional mandy

we start unseen components and the

one time

interface

so we thing that's made use of reason why as a separate training is much

better than the joint training our experiments

i realistic

paper we use that's

several training outside for mostly

can actually

after performance than during training

and the experimental results also

shows that our proposed motion almost workshop framework can achieve better performance based on but

is not an okay data

and

and this is all

all for a pronunciation central or attention

Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data

Voice Conversion and Synthesis

Kun Zhou, Berrak Sisman, Haizhou Li