0:00:18hello everyone
0:00:20my name is very extra iron from singapore inverse of technology and design using or
0:00:25today i will be talking about are okay oracle generative inverse real networks for singing
0:00:31voice conversion
0:00:32we i mean a pattern training they are we have combat that this research together
0:00:36with michael we actually from the machine restore single
0:00:42so that the basic definition of singing voice conversion it is that have to convert
0:00:47one single his voice to sound like that of another without changing the lyrical content
0:00:53you can also see an illustration of it
0:00:56you have a sore single
0:00:58who is thinking mummy out here i going
0:01:01and we will and the singing voice conversion year
0:01:04and you're gonna change identical is
0:01:07or some sort it sounds like this lady is thinking the same song
0:01:13and i don't like to highlight that sinking armies lexical and emotional information through or
0:01:18and all
0:01:20well data being transferred from the source to the target speaker
0:01:26so in this paper we will propose a novel solutions to singing voice conversion based
0:01:31on generative address real networks and without parallel training data
0:01:37and let's
0:01:39briefly talk about singing-voice partition
0:01:43singing voice conversion is another very user
0:01:47because in itself is not an easy task
0:01:50and to mimic someone thinking is more difficult
0:01:54well professional fingers are trained to control and very they walk a timber
0:01:58they're by the by the physical limit of your remote production system
0:02:03and singing voice conversion provides an extension two months or collect be able to control
0:02:08the voice
0:02:09beyond the physical limit and expressive
0:02:12in extend this very small way
0:02:16so singing voice conversion has lots of applications and some of them are listed here
0:02:21such as singing synthesis the bingo soundtrack
0:02:24and grapheme one thinking
0:02:26and there is also challenge here that i would like to highlight
0:02:30thinking is a final or and any distortion of the remote the singing voice cannot
0:02:35be tolerated
0:02:38so of nist singing voice conversion you melting like there is a voice conversion what
0:02:43is the difference between singing voice conversion and the traditional voice conversion well they share
0:02:49similar moderation
0:02:50in the conventional speech was a motion which we also all the identity or version
0:02:55unseen was on version differs from speech voice conversion in many ways that are listed
0:03:01starting in the traditional speech voice conversion speech processing it includes speech dynamics durational words
0:03:09they'll is right speaker individuality
0:03:11therefore we need to transform from the source to the target speaker
0:03:17in singing voice conversion the matter of thinking is grammar that it's removed by the
0:03:22sheet music itself so it is considered as far as an independent
0:03:27therefore in singing voice conversion only the characteristics of voice identity
0:03:32so just where
0:03:34are considered as the price and the trains to the contrary to
0:03:38so in this paper we will only focus on the spectrum or emotion
0:03:42aspect of thinking voice conversion
0:03:46so before starting to talk about are proposing impose farmers a model i would like
0:03:51to their belief that were generated by terrestrial networks and my mutual i mean this
0:03:56so the traditional generates about restaurant that for once the generative and discriminative training of
0:04:01your may already know the
0:04:03and generate bidirectional networks have recently we wish to be effective
0:04:08instead i mean it feels
0:04:09listed below in a generation image translation speech enhancement language identification
0:04:16it's just speech sentences anyone in speech voice conversion
0:04:20and in this paper we propose to generate vectors are not or
0:04:24that's where
0:04:26that's where that works for i was thinking voice conversion well with and without where
0:04:31the training data
0:04:35i don't like at least a contributions here to start with me propose a singing
0:04:39voice conversion frame or
0:04:41it is based on channel factors from the four
0:04:44and v h one martin singing-voice middle an extra no such as speech recognition which
0:04:51is not very easy to train
0:04:53i think cycle can be achieved by the other data free thinking voice that on
0:04:59the baseline
0:05:00and last but not least mean reduce the reliance on large amount of data
0:05:05well what are the and non-parallel training scenario
0:05:09we would like to know that this paper reports the for a successful at time
0:05:13to yield a gender at birth and that's where
0:05:16okay though they are thinking one version
0:05:22phone based i'm thinking were voice farmer human have the training data
0:05:27and the statistical methods such as gaussian mixture models are presented and the success of
0:05:32singing-voice origin
0:05:34we have multiple listed
0:05:35some of these works here there are a great idea
0:05:39a do not use the learning most of the time and the for ideas in
0:05:43the learning has a positive impact in training feels with no exception to singing-voice origin
0:05:49and hearing this paper we propose to use and to learn the essential differences between
0:05:54the source thinking and the original target singing train discriminative process you know testing
0:06:00and in this paper we further understand your processing as a part of the gas
0:06:05solutions to singing voice or emotion
0:06:07in a comparative study
0:06:13so let's try the training phase of the thing and three main steps provided here
0:06:20the first
0:06:22is to perform
0:06:23well analysis
0:06:26to obtain a spectral and prosodic features as provided here we develop more
0:06:32and the second step is to use dynamic time warping algorithm for temporal alignment of
0:06:37source and target singing spectral features it is also provided here with the blue collar
0:06:42we will either here is a and
0:06:44the algorithms that you cannot training
0:06:48and the last is to train to generate electricity and that's for by using the
0:06:52aligned thinking source and target feature
0:06:54i don't like to highlight you know one more time that we haven't data from
0:06:58source and target english and they are thinking the same
0:07:03this is class to their parallel training data for thinking voice conversion
0:07:08and it also would like to highlight that the previous studies top loss in french
0:07:13and the singing voice conversion it is not always necessary to transform based or values
0:07:19from the source target singular a meeting possible singles of a single key
0:07:24and the combination of a realistic usually has a small
0:07:28in k until the singing voice
0:07:31so therefore in this paper
0:07:34beyond from spectral feature vectors h in acceptable singing-voice where the
0:07:42and may need to run time version we again have three mains yes
0:07:46provided here the first step based on things or thinking features using to roll analysis
0:07:52and the second studies to generate the climate is sinking spectral features by using the
0:07:57which is already to train during the training phase
0:08:01and last but not really is we're gonna generated by just a four by using
0:08:05girls and
0:08:07i like to highlight in this paper
0:08:10standard by the previous studies we don't know from f their original
0:08:14but in french and there's singing voice conversion experiment
0:08:18for two gender singing voice conversion experiments we performed in so version
0:08:24and in all experiments
0:08:26but we have in this paper in order to distill the scores in getting but
0:08:33so this is at all
0:08:36are the data case
0:08:37but it's not singing voice conversion without her the training data
0:08:42before we discuss singing voice conversion time and high like something
0:08:49learned from the guy the training data file
0:08:51as also cited here and she's right well that the voice conversion each force or
0:08:56version i mean also provides a solution to model
0:09:00the single translation
0:09:02and for best knowledge
0:09:04so again has not instantly or singing voice conversion
0:09:08and in this paper saying trying to find an optimal set okay
0:09:13the good singing data of speakers
0:09:18for singing voice conversion uses
0:09:20so i just as follows
0:09:21adversely lowest and i that the maximal
0:09:25and we decide again engages and have demonstrated that i'd estimate involves here
0:09:31this allows us to preserve the lyrical content of the source euchre
0:09:35sorry sourcing
0:09:37so i'm to make slice we will be discussing very briefly why we need you
0:09:41loss function
0:09:43let's start entwined being an adversarial all
0:09:47is voice conversion are paying optimize the distribution of the remote thinking feature
0:09:53as much as closer to the distribution of targets there
0:09:56and also the distribution of convergence data comes to that of target single
0:10:01let's learned a little speaker
0:10:03and we can achieve high speaker similarity singing voice conversion
0:10:08so that we need to find a system small
0:10:12the reason is the and with a global mean tells us better than on version
0:10:16of the target single state the distribution
0:10:20and does not help to results are think this contextual information
0:10:24and it's a distance loss we can maintain the contextual information it single source and
0:10:30target hair
0:10:32well i estimate models it was decided that systems a lot of rice clustering wanna
0:10:37structure however it will not surface to guarantee that the mapping always with their little
0:10:43one of those for center
0:10:45so explicit presented little and
0:10:49gonna incorporate and i'd estimate involved here
0:10:54let's look at the experiment
0:10:57true this paper we are from objective and subjective evaluation with a us singing database
0:11:03and describe the second system audio recordings
0:11:07from point a english or by about professional seniors
0:11:11and for all other than that in the training data settings
0:11:15in one that experiments tree and five or started singing care
0:11:23and we extract twenty four mel cepstral coefficients logarithmic fundamental and after this
0:11:29and we normalize the source and target and cepstral zero mean and unit variance by
0:11:33using the statistics of the training
0:11:36so on
0:11:37let's get the objective evaluation here
0:11:41the mel-cepstral distortion between the targets english nature thinking and converts it is warm and
0:11:47its you may no longer mel-cepstral distortion value in the case more spectral distortion
0:11:53and hearing this table one meal a quadtree framework
0:11:57and if you personally interested how we trained these networks
0:12:00please note that
0:12:02all these models and experimental conditions are provided in the paper
0:12:07so you can just go and check
0:12:09for each time we provide another one paragraph and explain how we trained them
0:12:13and three army training male to male and female to male conversion
0:12:18and for in the anything we have a nice and the training data
0:12:22tri-phone from each speaker and types for each speaker
0:12:26and if you just a good thing in the nine you are going on the
0:12:29always outperforms in
0:12:32so i shows that if we have not training vector
0:12:37is a much better solution than the nn for singing voice conversion
0:12:41and this cycle again no i guess problem is more challenging because we are doing
0:12:47a very low hanging one are shown
0:12:50which means the lyrical content is different during the training
0:12:54i don't the data is not depend on
0:12:56so i again achieves comparable results or something one battery o
0:13:01and the gmm baseline
0:13:02and i'm and not in the in the baseline use of hello they all these
0:13:06results show that which is much better without we don't think so
0:13:11i mean then if we do not readily only castilian cycle
0:13:16and achieve comparable or even better results to that of in a
0:13:24in the next slide here we report the subjective evaluation we have our experiments indicate
0:13:29are about to the interest of time i already a
0:13:32some of them
0:13:33here in the presentation
0:13:35so what mean opinion score
0:13:37and we have fifteen subjects participated in the listening test on each subject listens to
0:13:42based on merit
0:13:43singing voices
0:13:44and the anything ghana trained in parallel data verified against train kernel training data
0:13:51and if you look at the end and you are but also that
0:13:57and i don't know
0:13:59and even though they use the same amount of training data
0:14:03results show last the
0:14:06outperforms the n and it should be used for singing voice or emotional word in
0:14:10there and if you look cycle again you train the same amount of training data
0:14:16but it does not parallel which means it's the more challenging
0:14:19and you for a more challenging task
0:14:21i again
0:14:22you know
0:14:23a choose a very similar performance to that of yen and
0:14:27then the and then use of parallel training data
0:14:29so we really the performance of cycle again you know is the remote will
0:14:34assuming that uses non-parallel training then
0:14:38another experiment that we wanna compere he recycling andreas's again
0:14:43for speaker similarity
0:14:45i think this experiment reported here in a separate friends task of speaker similarity you
0:14:50five minutes on their scores type again training
0:14:54where is the audio stream and that's one for training
0:14:59this experiment shows that the actual again we thinking they are not clear that singing
0:15:04thing to this bar achieves comparable results to
0:15:08it and the sinking they are
0:15:10if it just doesn't have the battery sample for forty eight point one percent of
0:15:14the time
0:15:15which we believe is a remarkable because if you know
0:15:18having the training database a much more challenging task aiming at a training dataset or
0:15:24so we believe that cycle again issues
0:15:26really the performance in terms of singing voice conversion line you have
0:15:30no further training data
0:15:33so some in this paper we propose a novel solution based on generative accuracy and
0:15:38that's where it's just singing voice conversion
0:15:40we and we don't parallel training data
0:15:43and the whole and framework which is very well documented anymore spectral training data
0:15:48i know exactly yes to reno to the error between source and target fingers
0:15:54and you and i mean and not a training data
0:15:57we show that it works really well
0:16:00furthermore we also show that the proposed framework for better
0:16:04in less training data and the n and which we really remarkable
0:16:09that one leaves with or without parental training data available generative and restroom that's where
0:16:15if you high i anymore
0:16:19and you're for listening