Speech Transcript - WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss

0:00:14	no i while my name and you're right i got my p h d degree
0:00:18	from in amateur university of time i will make accreditation about our paper on tts
0:00:24	this is at select work well in the mobile university assigned or
0:00:28	national university of thing able and the single university of technology and the desire
0:00:33	the title of the paper is the tts tagger somebody's the tts with joint time-frequency
0:00:38	domain laws
0:00:42	this is just a quick also a what i'm going to talk about
0:00:47	we will now come to the first the section
0:00:51	okay total speech is to compress test into humour like speech
0:00:56	with the element of deep-learning
0:00:58	and twenty s has so many advantages over the question all tts techniques
0:01:03	that was on base the tts
0:01:05	actually consists of two modules
0:01:08	this first line in the feature prediction and the second my in baseform generation
0:01:13	the main task
0:01:15	feature prediction network
0:01:17	it's a lot in frequency domain acoustic features
0:01:20	but i was away from generation model is to convert
0:01:25	frequency domain hosting features into time domain before
0:01:29	a tackle
0:01:30	couple some implementation of the loss
0:01:33	with clean
0:01:34	with a full face a construction
0:01:37	that only uses a loss function derived from spectrogram in frequency domain
0:01:43	that's a loss function that and take the before into considered consideration in the optimisation
0:01:49	process
0:01:51	as a dallas
0:01:52	there exists a mismatch
0:01:54	between the tackles optimisations and the except exactly the before
0:02:00	in this paper we propose to add a time-domain loss function to the with flame
0:02:05	old have some basis you just model and to the training time in other words
0:02:10	where you're the boss frequency domain laws
0:02:13	and the time domain laws for the training of feature prediction model
0:02:17	in addition
0:02:19	where yours is i is the are it's awful scaling by weight signal to distortion
0:02:25	to measure the quality of the time domain before
0:02:30	next
0:02:31	i would like to impose time related work
0:02:36	the overall architecture of the whole some model include feature prediction model
0:02:41	which contains encoder
0:02:43	but session based decoder and the quickly always
0:02:47	for waveform
0:02:48	in construction
0:02:51	but in colour consis all
0:02:53	two components
0:02:55	a c and d is the model size has three combustion this
0:03:00	and it can be used them also
0:03:02	that had a bidirectional and that's in there
0:03:07	the decoder consists of all components
0:03:09	a totally appear in that
0:03:11	to understand layers
0:03:13	a linear projection layer
0:03:15	and the of five convolutional layer
0:03:18	post and that
0:03:20	during training
0:03:21	we optimize the feature prediction model
0:03:24	to minimize the
0:03:25	to minimize the
0:03:27	frequency domain laws
0:03:28	lost
0:03:29	between the center at the minute special features
0:03:33	and there's a time data no special features
0:03:38	as a loss function is your like at lady only for frequency domain acoustic features
0:03:44	that style and
0:03:47	to directly control the quality of the center at time domain before you other words
0:03:53	so that's frequency domain laws question doesn't stick as a before into consideration
0:03:58	in the optimisation process
0:04:00	to address the mismatch problem
0:04:03	but propose and you'll jennings k for wasn't based the tts
0:04:08	the main contributions of this paper a summarized as follows
0:04:13	with that the this you're a time-domain lost for speech synthesis
0:04:18	we in parole
0:04:19	tables on base the tts framework by proposing and the origin is k
0:04:24	based on short time frequency domain lost
0:04:27	we propose to your is a is the are metric to measure the distortion time
0:04:33	domain before
0:04:36	this session looks at as a framework our proposed a method
0:04:42	in this section
0:04:43	based on raises use of and the only propose the time-domain loss function for example
0:04:48	some base the tts
0:04:50	by applying onion jennings k
0:04:52	that takes into account both time and the frequency domain loss functions
0:04:57	we have actually effectively
0:04:59	we use the mismatch
0:05:01	between the frequency domain features
0:05:03	and the time domain with four and in provo the output speech quality
0:05:09	the proposed framework in court as based tts after
0:05:15	next
0:05:16	we have l discussed in detail the proposed as training scheme
0:05:20	in the tts
0:05:22	we define two objective functions during training
0:05:25	the first why is
0:05:26	frequency domain lost
0:05:28	denoted as
0:05:30	last f
0:05:32	is that it completely dolby the mel spectral features
0:05:35	you know similarly as time goes on model
0:05:38	a second of my a's of propose the time-domain lost
0:05:41	innovative as nasty that it obtain the and to a waveform level as a household
0:05:48	gravely iteration
0:05:50	the prettiest
0:05:51	time-domain signal from them as special features
0:05:55	thus f
0:05:56	ensures that
0:05:58	the generated email special enclose the names for male special
0:06:03	nasty minimizing this lost
0:06:06	as a waveform level we had no weighting
0:06:09	coefficient
0:06:10	like to balance the two losses
0:06:13	the total loss function of the whole model
0:06:16	is
0:06:18	defined as this fashion
0:06:25	really them by a also shows the
0:06:28	complex a
0:06:29	training process of our proposed the vts
0:06:33	but in tts the
0:06:35	model p d's the no special features from the given in close quarters samples
0:06:43	and is that converts the produced
0:06:45	and the target from a special to the time-domain signal using greatly algorithm
0:06:50	finally a slight loss function
0:06:53	is jewels the to optimize the we've tts model
0:06:58	we also my is i sdr
0:07:01	it's a full scale you moderate signal to distortion
0:07:04	to measure the this test between the generated a before and the target financial speech
0:07:12	we not there is a is the r is a better at the
0:07:16	only during training and the
0:07:20	known to
0:07:21	and the
0:07:22	not require an to run time in
0:07:25	in first
0:07:27	i don't out like to move on to experiment part
0:07:31	but have also the tts
0:07:33	experiments energy speech database
0:07:36	beta better for systems for comparative study
0:07:41	the first are the first why it happens you know
0:07:44	this system had only a frequency-domain loss function
0:07:48	with clean
0:07:49	wavelan always are is your that will generate of the faithful as try to the
0:07:55	second one is that was older than
0:07:58	this system
0:07:59	also have only a frequency-domain loss function
0:08:02	potentially
0:08:04	we have network older is your the to generate of the waveform at runtime
0:08:08	the survivability gestural
0:08:11	it means that purple treated as model is trained with joint time-frequency domain lost
0:08:17	quickly and always there is yours the during training and lifetime physics
0:08:21	the last one is the tts that and it means that propose that if you
0:08:26	just model is trained of it so i time-frequency domain lost
0:08:30	gravely always there is yours the during training and the painting been that vocal there
0:08:35	is yours the two synthesis speech
0:08:38	and try to
0:08:41	we also compare these systems with the ground truth speech
0:08:44	denoted as c g
0:08:47	a try to
0:08:48	tables on seattle and the but just yell your wavelan
0:08:53	algorithm with sixty four iterations
0:08:57	we can that the listening experiments
0:09:01	to evaluate the quality of the synthesis the speech
0:09:06	the first evaluated the sound quality of the synthesis the speech in terms o mean
0:09:12	opinion score
0:09:14	it quality in figure one
0:09:16	we compare comes on seattle and with tts gentle to all their of that the
0:09:22	effect
0:09:24	so it's time frequency domain lost
0:09:26	we believe that this it off fair comparison
0:09:30	as both frameworks used with him or with them
0:09:34	for waveform generation during training and the right
0:09:38	i can be seen in figure one a
0:09:40	but if tts yellow outperforms two cousins
0:09:44	but compare
0:09:45	couple shown that even
0:09:47	and of if tts
0:09:49	but
0:09:50	to investigate how well
0:09:52	the predicate and their special features
0:09:54	perform
0:09:55	with natural colder
0:09:58	we observe that it means always tts is trained with an em all trees are
0:10:03	it performs better
0:10:05	then we also went in s from whole the is available
0:10:08	s right are
0:10:10	but compare
0:10:11	based tts yellow and the bit tedious you data and
0:10:15	in terms of was quality
0:10:17	we notice that both frameworks
0:10:19	a change under the same conditions
0:10:21	however
0:10:22	the tts target and use is being that's vocal therefore waveform generation of right are
0:10:28	and except t with tts target and
0:10:31	also performs based tts g l
0:10:37	we also conducted at pratt first test
0:10:40	to assess speech quality of proposed frameworks
0:10:44	figure two shows that our proposed to be tts member
0:10:49	also performs
0:10:50	the baseline system
0:10:52	well both gravely and the way that so called as the teens
0:10:57	at the right
0:11:00	we further conduct another a be preference test
0:11:04	to examine the effect
0:11:06	of the number of briefly
0:11:08	iterations
0:11:09	on the with tts
0:11:11	performance
0:11:15	for rapid turnaround
0:11:16	but only apply
0:11:18	why and the two we have lain iterations
0:11:22	for face construction
0:11:25	and investigated the effect in terms o was quality
0:11:31	we observe that
0:11:33	the single iteration of gravely all these are
0:11:36	presents a better performers
0:11:38	then two iterations
0:11:42	finally
0:11:43	but controlled of this paper
0:11:48	but proposed and you'll tokens only implementation cultivated yes
0:11:52	we propose to yours you know in my wrote signal to distortion as the loss
0:11:58	function
0:11:59	the proposed to me tts frameworks
0:12:02	performs the baseline
0:12:03	and the chips high quality signals this the speech
0:12:14	some people
0:12:15	right mouse for taking the time to listen to this presentation
0:12:19	if insists you please check our panel page to the speech samples
0:12:24	thank you for your test

WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss

Voice Conversion and Synthesis

Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao, Haizhou Li