0:00:14no i while my name and you're right i got my p h d degree
0:00:18from in amateur university of time i will make accreditation about our paper on tts
0:00:24this is at select work well in the mobile university assigned or
0:00:28national university of thing able and the single university of technology and the desire
0:00:33the title of the paper is the tts tagger somebody's the tts with joint time-frequency
0:00:38domain laws
0:00:42this is just a quick also a what i'm going to talk about
0:00:47we will now come to the first the section
0:00:51okay total speech is to compress test into humour like speech
0:00:56with the element of deep-learning
0:00:58and twenty s has so many advantages over the question all tts techniques
0:01:03that was on base the tts
0:01:05actually consists of two modules
0:01:08this first line in the feature prediction and the second my in baseform generation
0:01:13the main task
0:01:15feature prediction network
0:01:17it's a lot in frequency domain acoustic features
0:01:20but i was away from generation model is to convert
0:01:25frequency domain hosting features into time domain before
0:01:29a tackle
0:01:30couple some implementation of the loss
0:01:33with clean
0:01:34with a full face a construction
0:01:37that only uses a loss function derived from spectrogram in frequency domain
0:01:43that's a loss function that and take the before into considered consideration in the optimisation
0:01:49process
0:01:51as a dallas
0:01:52there exists a mismatch
0:01:54between the tackles optimisations and the except exactly the before
0:02:00in this paper we propose to add a time-domain loss function to the with flame
0:02:05old have some basis you just model and to the training time in other words
0:02:10where you're the boss frequency domain laws
0:02:13and the time domain laws for the training of feature prediction model
0:02:17in addition
0:02:19where yours is i is the are it's awful scaling by weight signal to distortion
0:02:25to measure the quality of the time domain before
0:02:30next
0:02:31i would like to impose time related work
0:02:36the overall architecture of the whole some model include feature prediction model
0:02:41which contains encoder
0:02:43but session based decoder and the quickly always
0:02:47for waveform
0:02:48in construction
0:02:51but in colour consis all
0:02:53two components
0:02:55a c and d is the model size has three combustion this
0:03:00and it can be used them also
0:03:02that had a bidirectional and that's in there
0:03:07the decoder consists of all components
0:03:09a totally appear in that
0:03:11to understand layers
0:03:13a linear projection layer
0:03:15and the of five convolutional layer
0:03:18post and that
0:03:20during training
0:03:21we optimize the feature prediction model
0:03:24to minimize the
0:03:25to minimize the
0:03:27frequency domain laws
0:03:28lost
0:03:29between the center at the minute special features
0:03:33and there's a time data no special features
0:03:38as a loss function is your like at lady only for frequency domain acoustic features
0:03:44that style and
0:03:47to directly control the quality of the center at time domain before you other words
0:03:53so that's frequency domain laws question doesn't stick as a before into consideration
0:03:58in the optimisation process
0:04:00to address the mismatch problem
0:04:03but propose and you'll jennings k for wasn't based the tts
0:04:08the main contributions of this paper a summarized as follows
0:04:13with that the this you're a time-domain lost for speech synthesis
0:04:18we in parole
0:04:19tables on base the tts framework by proposing and the origin is k
0:04:24based on short time frequency domain lost
0:04:27we propose to your is a is the are metric to measure the distortion time
0:04:33domain before
0:04:36this session looks at as a framework our proposed a method
0:04:42in this section
0:04:43based on raises use of and the only propose the time-domain loss function for example
0:04:48some base the tts
0:04:50by applying onion jennings k
0:04:52that takes into account both time and the frequency domain loss functions
0:04:57we have actually effectively
0:04:59we use the mismatch
0:05:01between the frequency domain features
0:05:03and the time domain with four and in provo the output speech quality
0:05:09the proposed framework in court as based tts after
0:05:15next
0:05:16we have l discussed in detail the proposed as training scheme
0:05:20in the tts
0:05:22we define two objective functions during training
0:05:25the first why is
0:05:26frequency domain lost
0:05:28denoted as
0:05:30last f
0:05:32is that it completely dolby the mel spectral features
0:05:35you know similarly as time goes on model
0:05:38a second of my a's of propose the time-domain lost
0:05:41innovative as nasty that it obtain the and to a waveform level as a household
0:05:48gravely iteration
0:05:50the prettiest
0:05:51time-domain signal from them as special features
0:05:55thus f
0:05:56ensures that
0:05:58the generated email special enclose the names for male special
0:06:03nasty minimizing this lost
0:06:06as a waveform level we had no weighting
0:06:09coefficient
0:06:10like to balance the two losses
0:06:13the total loss function of the whole model
0:06:16is
0:06:18defined as this fashion
0:06:25really them by a also shows the
0:06:28complex a
0:06:29training process of our proposed the vts
0:06:33but in tts the
0:06:35model p d's the no special features from the given in close quarters samples
0:06:43and is that converts the produced
0:06:45and the target from a special to the time-domain signal using greatly algorithm
0:06:50finally a slight loss function
0:06:53is jewels the to optimize the we've tts model
0:06:58we also my is i sdr
0:07:01it's a full scale you moderate signal to distortion
0:07:04to measure the this test between the generated a before and the target financial speech
0:07:12we not there is a is the r is a better at the
0:07:16only during training and the
0:07:20known to
0:07:21and the
0:07:22not require an to run time in
0:07:25in first
0:07:27i don't out like to move on to experiment part
0:07:31but have also the tts
0:07:33experiments energy speech database
0:07:36beta better for systems for comparative study
0:07:41the first are the first why it happens you know
0:07:44this system had only a frequency-domain loss function
0:07:48with clean
0:07:49wavelan always are is your that will generate of the faithful as try to the
0:07:55second one is that was older than
0:07:58this system
0:07:59also have only a frequency-domain loss function
0:08:02potentially
0:08:04we have network older is your the to generate of the waveform at runtime
0:08:08the survivability gestural
0:08:11it means that purple treated as model is trained with joint time-frequency domain lost
0:08:17quickly and always there is yours the during training and lifetime physics
0:08:21the last one is the tts that and it means that propose that if you
0:08:26just model is trained of it so i time-frequency domain lost
0:08:30gravely always there is yours the during training and the painting been that vocal there
0:08:35is yours the two synthesis speech
0:08:38and try to
0:08:41we also compare these systems with the ground truth speech
0:08:44denoted as c g
0:08:47a try to
0:08:48tables on seattle and the but just yell your wavelan
0:08:53algorithm with sixty four iterations
0:08:57we can that the listening experiments
0:09:01to evaluate the quality of the synthesis the speech
0:09:06the first evaluated the sound quality of the synthesis the speech in terms o mean
0:09:12opinion score
0:09:14it quality in figure one
0:09:16we compare comes on seattle and with tts gentle to all their of that the
0:09:22effect
0:09:24so it's time frequency domain lost
0:09:26we believe that this it off fair comparison
0:09:30as both frameworks used with him or with them
0:09:34for waveform generation during training and the right
0:09:38i can be seen in figure one a
0:09:40but if tts yellow outperforms two cousins
0:09:44but compare
0:09:45couple shown that even
0:09:47and of if tts
0:09:49but
0:09:50to investigate how well
0:09:52the predicate and their special features
0:09:54perform
0:09:55with natural colder
0:09:58we observe that it means always tts is trained with an em all trees are
0:10:03it performs better
0:10:05then we also went in s from whole the is available
0:10:08s right are
0:10:10but compare
0:10:11based tts yellow and the bit tedious you data and
0:10:15in terms of was quality
0:10:17we notice that both frameworks
0:10:19a change under the same conditions
0:10:21however
0:10:22the tts target and use is being that's vocal therefore waveform generation of right are
0:10:28and except t with tts target and
0:10:31also performs based tts g l
0:10:37we also conducted at pratt first test
0:10:40to assess speech quality of proposed frameworks
0:10:44figure two shows that our proposed to be tts member
0:10:49also performs
0:10:50the baseline system
0:10:52well both gravely and the way that so called as the teens
0:10:57at the right
0:11:00we further conduct another a be preference test
0:11:04to examine the effect
0:11:06of the number of briefly
0:11:08iterations
0:11:09on the with tts
0:11:11performance
0:11:15for rapid turnaround
0:11:16but only apply
0:11:18why and the two we have lain iterations
0:11:22for face construction
0:11:25and investigated the effect in terms o was quality
0:11:31we observe that
0:11:33the single iteration of gravely all these are
0:11:36presents a better performers
0:11:38then two iterations
0:11:42finally
0:11:43but controlled of this paper
0:11:48but proposed and you'll tokens only implementation cultivated yes
0:11:52we propose to yours you know in my wrote signal to distortion as the loss
0:11:58function
0:11:59the proposed to me tts frameworks
0:12:02performs the baseline
0:12:03and the chips high quality signals this the speech
0:12:14some people
0:12:15right mouse for taking the time to listen to this presentation
0:12:19if insists you please check our panel page to the speech samples
0:12:24thank you for your test