Speech Transcript - Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data

0:00:16	controller presentation of a more speaker on the c paper transform is actual and costly
0:00:22	or emotional most commercial
0:00:24	it's not are no to negate
0:00:26	from national university feel cool
0:00:29	on the ball state of technology and to sign
0:00:38	based in the outline of this presentation
0:00:41	first i'd okay and extraction to emotional most commercial
0:00:45	and so relating what
0:00:46	i don't talk about our contributions
0:00:49	proposed framework
0:00:51	experiments
0:00:52	and can push
0:00:55	emotional most commercial is almost conversion technique
0:00:59	it aims to coursing motion in the speech
0:01:02	from these loss functions to the target options
0:01:06	in the meeting about the speaker and then key and linguistic information should be greater
0:01:13	and you can see in this speaker
0:01:15	the same utterance spoken by the same speaker
0:01:19	but the
0:01:20	motion has been changed some signs i is tiny has naples many applications we human
0:01:28	computer interaction
0:01:30	such as personalise
0:01:32	text-to-speech
0:01:33	so for a pulse no conversational agents
0:01:39	and three or no
0:01:40	emotion physical access with multiple signal or should groups which can be or five kate
0:01:46	right it's like trial
0:01:48	mostly
0:01:49	more
0:01:50	you motion is also scroll segmental
0:01:53	i hierarchical you make sure which makes it more difficult
0:01:57	two can where is the emotion in the speech mean those that early studies only
0:02:02	focus on that spectrum commercial
0:02:05	and i haven't okay you mashed attention on the cost the
0:02:09	it's missing is not sufficient
0:02:11	and most previous work we where
0:02:14	tara notion it is
0:02:15	from the source of the target emotion more options
0:02:19	by in the private case heralded have used it is say a difficult to cart
0:02:25	and also will limit the score applications
0:02:31	you know the true really met
0:02:33	we eliminates the need for the panel clean data we propose to your cycle gonna
0:02:39	to find the mappings
0:02:40	of spectral a post e
0:02:43	so i've okay is proposed for
0:02:46	me translation and has
0:02:49	shaves remarkable performance
0:02:50	a non-parallel tossed
0:02:53	researchers
0:02:54	have successfully applied these most commercial and speech synthesis
0:02:59	i was like yoga has three losses
0:03:02	whether where zero loss
0:03:04	cycle consistent signals
0:03:05	and i and he may not
0:03:08	so ministry losses
0:03:09	was that we're gonna turn around nineteen features also target don't mean
0:03:13	results and you want to know data
0:03:20	another challenge you motion commercial without austin modeling
0:03:25	in many for so the information test the result
0:03:31	fundamental frequency which we also quite
0:03:34	i zero
0:03:35	used
0:03:36	main factor
0:03:37	also in the nation
0:03:39	where studies
0:03:41	convert have zero
0:03:42	zero linear transformation
0:03:44	but i was we all know
0:03:46	i zero very from the micro most reliable suction the walls
0:03:51	and three states
0:03:53	twenty four options that will
0:03:56	side modeling used to sing for channels characterize
0:04:00	but speech was there are rare
0:04:02	some researchers propose to
0:04:04	well do i zero ways
0:04:06	conclusion remote transform
0:04:08	can was made about transform
0:04:11	used a signal processing technique
0:04:13	which is true
0:04:14	it controls the signal
0:04:16	two different time don't means
0:04:20	it can describe still
0:04:22	well as t you different and resolutions
0:04:25	and we think
0:04:26	so as to me to modeling hierarchical signals
0:04:30	such an afternoon
0:04:36	this figure shows although
0:04:38	continuous wavelet transform walks
0:04:41	we use minimum transform
0:04:43	composed of no
0:04:45	to turn scale
0:04:46	they have the same linguistic content and spoken by the thing speaker and we assume
0:04:52	that more scales
0:04:54	can capture the short-term variations have scales can capture the long-term variations
0:05:01	it has taken this tool options
0:05:03	very infomercial to tune the long-term variations
0:05:06	even though they are spoken by the same speaker and we don't think speaking time
0:05:12	this variations reflect the emotional variance
0:05:16	the different time scales options
0:05:22	so in this paper
0:05:23	we propose a panel of free emotional most commercial framework
0:05:28	and we also showed that of course i
0:05:32	although motion almost commercial
0:05:35	we can come versus an actual and prosody three shows recycle can extracting and we
0:05:41	also
0:05:42	that's great
0:05:43	different
0:05:44	training strategies
0:05:46	for spectral quality commercial
0:05:48	sessions
0:05:49	some pre-training a joint training
0:05:52	another thing
0:05:53	experimental results
0:05:56	shows that we also the baseline approaches and carrot she called quality converged
0:06:03	speech samples
0:06:07	this is the training phase of our proposed framework
0:06:11	you know training phase
0:06:12	majoring to sample against false fine sure prosody separates the
0:06:17	we also want vocoder
0:06:19	so it is trend of spectral features and i zero strong salsa target sergeant's
0:06:25	with only
0:06:26	you called it but aside from fisher seem to twenty four conventional caps
0:06:31	and use mean are transformed into compost zero into ten different scales
0:06:37	and we train this to cycle goes for spectrum was this outrageously to lower than
0:06:42	that of clean speech and start time you start for acoustic features
0:06:50	i really conversion phase
0:06:52	we use
0:06:53	to train sec okay
0:06:55	two congresses five approach crusty
0:06:58	we used was vocal the actual singing size that coverage options
0:07:02	we also in rats case two different training strategies our proposed framework
0:07:08	the first one is
0:07:10	second conjured
0:07:11	in this framework
0:07:12	we concatenate and catch
0:07:14	we say that would keep based f zero features
0:07:18	and that you put to it's like okay
0:07:21	and that's that and the second one cycle again separate
0:07:24	i in this framework
0:07:26	wishing to several gets full spectrum of the active
0:07:34	and in this work we can bound of three from walks
0:07:38	and where you're sidewalk and coworkers fight for
0:07:41	and use a
0:07:43	in your transformation to compare the posterior
0:07:46	this framework we call so baseline
0:07:49	and the
0:07:51	so i will control and second guess separate refers to two different training strategies one
0:07:56	sec okay
0:07:57	and we talked about last slots
0:08:00	and we use the l
0:08:02	most a lot
0:08:03	this whole words which is recorded by a
0:08:06	all stressful in american actually it's
0:08:09	i don't we conduct experiments from neutral two and three signed a sparse
0:08:14	for each emotion combination
0:08:16	but you're slidy non-parallel utterance
0:08:20	around stream means for training and ten utterances for evaluation
0:08:27	and all the all tracking
0:08:29	no iteration
0:08:30	big companies i'm seeking to marilyn spectrum distortion
0:08:34	and the cohort it a nasty and p c of size so
0:08:38	the performance of the was the commercial
0:08:40	from this to table
0:08:42	our
0:08:43	we can say that our propose
0:08:46	so we can a separate stream or
0:08:48	all of the baseline and several controls remote for all wasn't shapes
0:08:56	and we also further contact
0:08:59	sometimes you evaluation to us to study motion similarity and in this experiment we compare
0:09:05	the preference test
0:09:08	and that's from this to speaker
0:09:09	our proposed framework consistently on some the baseline and the second controller
0:09:18	i'm from the figure six we can say that most the listeners
0:09:22	who's our
0:09:24	so we can
0:09:25	separate framework
0:09:26	rather than that cycle control and
0:09:34	so
0:09:34	by results
0:09:35	some pre-training is
0:09:39	why the sampling training is much faster the orange county
0:09:43	we think because of the these menus manage trustees different time scales
0:09:48	and it is different time scales makers this though
0:09:52	current content and containing depends i denotes
0:09:56	so that
0:09:57	join training has not
0:09:59	can estimate the transform coefficients read the spectral
0:10:04	features and stuff frame now
0:10:06	and the
0:10:07	and this train a strategy assumes that holstein is containing and that
0:10:13	so
0:10:13	with a mean each number of training samples
0:10:17	for example streaming use of speech
0:10:19	in our experiments
0:10:21	but during training
0:10:22	so we're gonna model
0:10:24	kernel
0:10:25	generalize very well so emotional mandy
0:10:28	we start unseen components and the
0:10:30	one time
0:10:31	interface
0:10:34	so we thing that's made use of reason why as a separate training is much
0:10:38	better than the joint training our experiments
0:10:45	i realistic
0:10:47	paper we use that's
0:10:49	several training outside for mostly
0:10:51	can actually
0:10:52	after performance than during training
0:10:55	and the experimental results also
0:10:58	shows that our proposed motion almost workshop framework can achieve better performance based on but
0:11:06	is not an okay data
0:11:09	and
0:11:11	and this is all
0:11:12	all for a pronunciation central or attention

Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data

Voice Conversion and Synthesis

Kun Zhou, Berrak Sisman, Haizhou Li