Speech Transcript - Generative Adversarial Networks for Singing Voice Conversion with and without Parallel Data

0:00:18	hello everyone
0:00:20	my name is very extra iron from singapore inverse of technology and design using or
0:00:25	today i will be talking about are okay oracle generative inverse real networks for singing
0:00:31	voice conversion
0:00:32	we i mean a pattern training they are we have combat that this research together
0:00:36	with michael we actually from the machine restore single
0:00:42	so that the basic definition of singing voice conversion it is that have to convert
0:00:47	one single his voice to sound like that of another without changing the lyrical content
0:00:53	you can also see an illustration of it
0:00:56	you have a sore single
0:00:58	who is thinking mummy out here i going
0:01:01	and we will and the singing voice conversion year
0:01:04	and you're gonna change identical is
0:01:07	or some sort it sounds like this lady is thinking the same song
0:01:13	and i don't like to highlight that sinking armies lexical and emotional information through or
0:01:18	and all
0:01:20	well data being transferred from the source to the target speaker
0:01:26	so in this paper we will propose a novel solutions to singing voice conversion based
0:01:31	on generative address real networks and without parallel training data
0:01:37	and let's
0:01:39	briefly talk about singing-voice partition
0:01:43	singing voice conversion is another very user
0:01:47	because in itself is not an easy task
0:01:50	and to mimic someone thinking is more difficult
0:01:54	well professional fingers are trained to control and very they walk a timber
0:01:58	they're by the by the physical limit of your remote production system
0:02:03	and singing voice conversion provides an extension two months or collect be able to control
0:02:08	the voice
0:02:09	beyond the physical limit and expressive
0:02:12	in extend this very small way
0:02:16	so singing voice conversion has lots of applications and some of them are listed here
0:02:21	such as singing synthesis the bingo soundtrack
0:02:24	and grapheme one thinking
0:02:26	and there is also challenge here that i would like to highlight
0:02:30	thinking is a final or and any distortion of the remote the singing voice cannot
0:02:35	be tolerated
0:02:38	so of nist singing voice conversion you melting like there is a voice conversion what
0:02:43	is the difference between singing voice conversion and the traditional voice conversion well they share
0:02:49	similar moderation
0:02:50	in the conventional speech was a motion which we also all the identity or version
0:02:55	unseen was on version differs from speech voice conversion in many ways that are listed
0:03:00	here
0:03:01	starting in the traditional speech voice conversion speech processing it includes speech dynamics durational words
0:03:09	they'll is right speaker individuality
0:03:11	therefore we need to transform from the source to the target speaker
0:03:17	in singing voice conversion the matter of thinking is grammar that it's removed by the
0:03:22	sheet music itself so it is considered as far as an independent
0:03:27	therefore in singing voice conversion only the characteristics of voice identity
0:03:32	so just where
0:03:34	are considered as the price and the trains to the contrary to
0:03:38	so in this paper we will only focus on the spectrum or emotion
0:03:42	aspect of thinking voice conversion
0:03:46	so before starting to talk about are proposing impose farmers a model i would like
0:03:51	to their belief that were generated by terrestrial networks and my mutual i mean this
0:03:55	paper
0:03:56	so the traditional generates about restaurant that for once the generative and discriminative training of
0:04:01	your may already know the
0:04:03	and generate bidirectional networks have recently we wish to be effective
0:04:08	instead i mean it feels
0:04:09	listed below in a generation image translation speech enhancement language identification
0:04:16	it's just speech sentences anyone in speech voice conversion
0:04:20	and in this paper we propose to generate vectors are not or
0:04:24	that's where
0:04:26	that's where that works for i was thinking voice conversion well with and without where
0:04:31	the training data
0:04:34	so
0:04:35	i don't like at least a contributions here to start with me propose a singing
0:04:39	voice conversion frame or
0:04:41	it is based on channel factors from the four
0:04:44	and v h one martin singing-voice middle an extra no such as speech recognition which
0:04:51	is not very easy to train
0:04:53	i think cycle can be achieved by the other data free thinking voice that on
0:04:59	the baseline
0:05:00	and last but not least mean reduce the reliance on large amount of data
0:05:05	well what are the and non-parallel training scenario
0:05:09	we would like to know that this paper reports the for a successful at time
0:05:13	to yield a gender at birth and that's where
0:05:16	okay though they are thinking one version
0:05:22	phone based i'm thinking were voice farmer human have the training data
0:05:27	and the statistical methods such as gaussian mixture models are presented and the success of
0:05:32	singing-voice origin
0:05:34	we have multiple listed
0:05:35	some of these works here there are a great idea
0:05:39	a do not use the learning most of the time and the for ideas in
0:05:43	the learning has a positive impact in training feels with no exception to singing-voice origin
0:05:49	and hearing this paper we propose to use and to learn the essential differences between
0:05:54	the source thinking and the original target singing train discriminative process you know testing
0:06:00	and in this paper we further understand your processing as a part of the gas
0:06:05	solutions to singing voice or emotion
0:06:07	in a comparative study
0:06:13	so let's try the training phase of the thing and three main steps provided here
0:06:20	the first
0:06:22	is to perform
0:06:23	well analysis
0:06:26	to obtain a spectral and prosodic features as provided here we develop more
0:06:32	and the second step is to use dynamic time warping algorithm for temporal alignment of
0:06:37	source and target singing spectral features it is also provided here with the blue collar
0:06:42	we will either here is a and
0:06:44	the algorithms that you cannot training
0:06:48	and the last is to train to generate electricity and that's for by using the
0:06:52	aligned thinking source and target feature
0:06:54	i don't like to highlight you know one more time that we haven't data from
0:06:58	source and target english and they are thinking the same
0:07:03	this is class to their parallel training data for thinking voice conversion
0:07:08	and it also would like to highlight that the previous studies top loss in french
0:07:13	and the singing voice conversion it is not always necessary to transform based or values
0:07:19	from the source target singular a meeting possible singles of a single key
0:07:24	and the combination of a realistic usually has a small
0:07:28	in k until the singing voice
0:07:31	so therefore in this paper
0:07:34	beyond from spectral feature vectors h in acceptable singing-voice where the
0:07:42	and may need to run time version we again have three mains yes
0:07:46	provided here the first step based on things or thinking features using to roll analysis
0:07:52	and the second studies to generate the climate is sinking spectral features by using the
0:07:57	which is already to train during the training phase
0:08:01	and last but not really is we're gonna generated by just a four by using
0:08:05	girls and
0:08:07	i like to highlight in this paper
0:08:10	standard by the previous studies we don't know from f their original
0:08:14	but in french and there's singing voice conversion experiment
0:08:18	for two gender singing voice conversion experiments we performed in so version
0:08:24	and in all experiments
0:08:26	but we have in this paper in order to distill the scores in getting but
0:08:31	are
0:08:33	so this is at all
0:08:36	are the data case
0:08:37	but it's not singing voice conversion without her the training data
0:08:42	before we discuss singing voice conversion time and high like something
0:08:49	learned from the guy the training data file
0:08:51	as also cited here and she's right well that the voice conversion each force or
0:08:56	version i mean also provides a solution to model
0:09:00	the single translation
0:09:02	and for best knowledge
0:09:04	so again has not instantly or singing voice conversion
0:09:08	and in this paper saying trying to find an optimal set okay
0:09:13	the good singing data of speakers
0:09:18	for singing voice conversion uses
0:09:20	so i just as follows
0:09:21	adversely lowest and i that the maximal
0:09:25	and we decide again engages and have demonstrated that i'd estimate involves here
0:09:31	this allows us to preserve the lyrical content of the source euchre
0:09:35	sorry sourcing
0:09:37	so i'm to make slice we will be discussing very briefly why we need you
0:09:41	loss function
0:09:43	let's start entwined being an adversarial all
0:09:47	is voice conversion are paying optimize the distribution of the remote thinking feature
0:09:53	as much as closer to the distribution of targets there
0:09:56	and also the distribution of convergence data comes to that of target single
0:10:01	let's learned a little speaker
0:10:03	and we can achieve high speaker similarity singing voice conversion
0:10:08	so that we need to find a system small
0:10:12	the reason is the and with a global mean tells us better than on version
0:10:16	of the target single state the distribution
0:10:20	and does not help to results are think this contextual information
0:10:24	and it's a distance loss we can maintain the contextual information it single source and
0:10:30	target hair
0:10:32	well i estimate models it was decided that systems a lot of rice clustering wanna
0:10:37	structure however it will not surface to guarantee that the mapping always with their little
0:10:43	one of those for center
0:10:45	so explicit presented little and
0:10:49	gonna incorporate and i'd estimate involved here
0:10:54	let's look at the experiment
0:10:56	and
0:10:57	true this paper we are from objective and subjective evaluation with a us singing database
0:11:03	and describe the second system audio recordings
0:11:07	from point a english or by about professional seniors
0:11:11	and for all other than that in the training data settings
0:11:15	in one that experiments tree and five or started singing care
0:11:23	and we extract twenty four mel cepstral coefficients logarithmic fundamental and after this
0:11:29	and we normalize the source and target and cepstral zero mean and unit variance by
0:11:33	using the statistics of the training
0:11:36	so on
0:11:37	let's get the objective evaluation here
0:11:41	the mel-cepstral distortion between the targets english nature thinking and converts it is warm and
0:11:47	its you may no longer mel-cepstral distortion value in the case more spectral distortion
0:11:53	and hearing this table one meal a quadtree framework
0:11:57	and if you personally interested how we trained these networks
0:12:00	please note that
0:12:02	all these models and experimental conditions are provided in the paper
0:12:07	so you can just go and check
0:12:09	for each time we provide another one paragraph and explain how we trained them
0:12:13	and three army training male to male and female to male conversion
0:12:18	and for in the anything we have a nice and the training data
0:12:22	tri-phone from each speaker and types for each speaker
0:12:26	and if you just a good thing in the nine you are going on the
0:12:29	always outperforms in
0:12:32	so i shows that if we have not training vector
0:12:37	is a much better solution than the nn for singing voice conversion
0:12:41	and this cycle again no i guess problem is more challenging because we are doing
0:12:47	a very low hanging one are shown
0:12:50	which means the lyrical content is different during the training
0:12:54	i don't the data is not depend on
0:12:56	so i again achieves comparable results or something one battery o
0:13:01	and the gmm baseline
0:13:02	and i'm and not in the in the baseline use of hello they all these
0:13:06	results show that which is much better without we don't think so
0:13:11	i mean then if we do not readily only castilian cycle
0:13:16	and achieve comparable or even better results to that of in a
0:13:23	so
0:13:24	in the next slide here we report the subjective evaluation we have our experiments indicate
0:13:29	are about to the interest of time i already a
0:13:32	some of them
0:13:33	here in the presentation
0:13:35	so what mean opinion score
0:13:37	and we have fifteen subjects participated in the listening test on each subject listens to
0:13:42	based on merit
0:13:43	singing voices
0:13:44	and the anything ghana trained in parallel data verified against train kernel training data
0:13:51	and if you look at the end and you are but also that
0:13:57	and i don't know
0:13:59	and even though they use the same amount of training data
0:14:03	results show last the
0:14:06	outperforms the n and it should be used for singing voice or emotional word in
0:14:10	there and if you look cycle again you train the same amount of training data
0:14:16	but it does not parallel which means it's the more challenging
0:14:19	and you for a more challenging task
0:14:21	i again
0:14:22	you know
0:14:23	a choose a very similar performance to that of yen and
0:14:27	then the and then use of parallel training data
0:14:29	so we really the performance of cycle again you know is the remote will
0:14:34	assuming that uses non-parallel training then
0:14:38	another experiment that we wanna compere he recycling andreas's again
0:14:43	for speaker similarity
0:14:45	i think this experiment reported here in a separate friends task of speaker similarity you
0:14:50	five minutes on their scores type again training
0:14:54	where is the audio stream and that's one for training
0:14:57	and
0:14:59	this experiment shows that the actual again we thinking they are not clear that singing
0:15:04	thing to this bar achieves comparable results to
0:15:08	it and the sinking they are
0:15:10	if it just doesn't have the battery sample for forty eight point one percent of
0:15:14	the time
0:15:15	which we believe is a remarkable because if you know
0:15:18	having the training database a much more challenging task aiming at a training dataset or
0:15:24	so we believe that cycle again issues
0:15:26	really the performance in terms of singing voice conversion line you have
0:15:30	no further training data
0:15:33	so some in this paper we propose a novel solution based on generative accuracy and
0:15:38	that's where it's just singing voice conversion
0:15:40	we and we don't parallel training data
0:15:43	and the whole and framework which is very well documented anymore spectral training data
0:15:48	i know exactly yes to reno to the error between source and target fingers
0:15:54	and you and i mean and not a training data
0:15:57	we show that it works really well
0:16:00	furthermore we also show that the proposed framework for better
0:16:04	in less training data and the n and which we really remarkable
0:16:09	that one leaves with or without parental training data available generative and restroom that's where
0:16:15	if you high i anymore
0:16:19	and you're for listening

Generative Adversarial Networks for Singing Voice Conversion with and without Parallel Data

Voice Conversion and Synthesis

Berrak Sisman, Haizhou Li