Speech Transcript - Comparison of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis

well i'm to know what you said today talking about comparison of speech representation for

automatic only estimation in multi-speaker text-to-speech synthesis

the pleasure that are you today and i also one thank michael appears to whether

you are in simon king

the problem that we want to solve this research is how to develop a neural

network that will outweigh mean opinion score you and some synthetic speech input

so the motivations for this work are to speed of the tts development cycle to

save time and money for the listening test and to predict tts quality automatic

there ran several attempts to develop an automatic already estimation system p five six three

was developed for coding and there is no correlation to text to speech ask was

also developer top twenty speech it requires a high quality reference

and it is based on comparing degraded speech to natural speech tts errors are not

always captured in this approach on a mouse is interesting but it was limited to

only the google tts systems and the ground truth is based on multiple masters conducted

over a period of time

well you know was introduced in two thousand eighteen this was for speech enhancement and

it is limited to the timit dataset

in two thousand nineteen mass that was introduced

this is for estimating the quality of voice conversion systems we found that the pre-trained

models try to do not generalize well to text to speech

two main contributions of this paper are to retrain the original master that framework using

tts data and x for frame based weighting in the loss function we also train

a new low capacity c and then architecture on the tts dataset and compare a

different types of speech representations

finally we characterize mars prediction performance based on speaker global ranking

our approach is basically in two phases and is one we did all speech representations

training models network re-training the original most network using a tts data and we determine

what is the best representation and model for tts it is too

we training new other multi-speaker tts system conduct a small loss listening test and applied

trained model based one to analyze the generalization the question we want to answer these

two is just our best model developed in based wyman generalize to a new tts

system

one of our main contributions as a we explore different types of speech representations

in a low capacity seen in architecture we which we develop to handle these new

representations

we have five different types of extractors

these are just like regular inspectors however during training of the extractor which of about

adjusted and

the target is not necessarily speaker id we have one version of the egg structure

where the target is in fact speaker or speaker at

then we have another type just a fixed factor where the target is a categorical

variable representing room size from the t a s p screw dataset

we have another type of x factor that is modeling power oracle values for the

t sixty reaper time

another one to the models talk origin basically distance

attacker to talk or distance and finally another one then models of the device quality

we also have at extraction features

which is a output of a image that

pretrained model that is operating on the entire spectrogram as an image

so this is a very high dimensional representation

finally we have the acoustic model and variance

and those are from an asr system again those are five hundred twelve dimensions someone's

where x vectors

finally there is the original models that which uses frame based features

and we're retraining that are be leds

when we treat these different types of environment

and extractors

we have as targets

against speakers room size the t sixty reverb time it distances and the replay device

quality

so of the purpose is to use the extractors to model different types of environments

in attacks there and label in that the two thousand nineteen is feast the physical

access dataset so we're getting labels for free from the physical access data we want

to use those to model a speech degradation

there will do not apply as a way of modeling the degradation in text to

speech

someone we train our what sdc and then and the a personal must now we

are working on the watchable access

dataset from a speech to challenge this is the evaluation portion of the l a

dataset

it consists of for a unique dct k speakers

importantly there are thirteen different tts and b c systems

in this dataset so we get a wide range of quality of tts and voice

conversion how well it and most of the systems are in fact text to speech

also warmly as they were evaluated with human judgements and the ground truth judgement on

a one to ten point scale for mean opinion score on the same march reading

task so where are the like the optimal as different masters over time this alleviates

our problem

we used a speaker disjoint training about interest split

and in table one of the people that a reference here we can see these

systems for a labeled at zero seven three nineteen there's thirteen of them and they

have different characteristics so for example the use different types of vocoders so some will

use group when a visual using a high quality

no vocoders the just wait another week or and then

so again we explore two types of mass production that neural networks in the first

case we have all of the machinery that comes with the original more snacks which

has different types of architectures for example there's one version that has a bidirectional both

another version that have the c and then

another version that has a scene and p l s t m combination and so

with their code right on how we get all of the architecture and we explore

all the different hyper parameters of the explored

in addition to that of original models that we introduce our low capacity c n

and which we use to operate on are different representations such as the extractors it

deep structural features of the acoustic model bindings

so now we're going to talk about some of the findings that we got from

our experiments

first we used for different correlation matrix

each of them have different ranges and different tradeoffs and might be useful for different

types of problems we wanted to keep in mind as previous work would use these

we also introduce the final and the candle tower a correlation so from start we

have the linear correlation coefficient the l c

also runs pearson r is a value that ranges between negative one hundred one depending

on correlation on one being highly correlated

and worst experiment right correlation coefficient when the benefits is it is non parametric and

the values again range which we need one to austria one we also use mean

square error analysis not ideals as it fails to capture distributional information such as outliers

and we can we have the cepstral right correlation coefficient which is useful on this

task because it captures or ratings and is a little bit more robust error sensitivity

then experiment right correlation coefficient

so here is the one stable of our first set of results

now this is the correlation between the ground truth model scores from the l a

dataset and are predicted model scores from are different systems are aggregated into different ways

one is a d system level

and the others of the speaker level in this work we are particularly interested in

how different speakers contribute to the overall quality of a tts system so we focus

our discussion on the speaker level results

from left to right

we have different systems and the different representations to starting with these this first column

pretrained voice conversion c and then this is the pre-trained model that comes with the

original mass net and you know it is trained on voice conversion data in here

we have applied it to the only dataset what we see is that there is

almost no correlation between the pre-training model and the teachers data

when we retrained demos nancy more structure what we retrained and only really dataset and

then evaluated again on a held-out portion of the lid it is that we get

much higher correlation we can see that the method to trim a state-of-the-art structure is

fine except for a needed to be retrained on the data

and we have are over a different representations of we compare our setup extractors as

close to structural features and acoustic model and bindings

these for you these were

run on our local sdc nn which we trained from scratch so there's no pre-trained

models in this experiment

what we find is what we consider all the different correlation metrics here at speaker

level aggregation created by expected far to be the best representation

recall that expected five modeling device quality so it does make some intuitive sense

and it's worth mentioning that the retrained i was for access us you know and

master our structure also performs quite well

so here we want to characterize

some the best and worst tts systems

so for example we identified using the ground truth the system a zero eight is

supports quality system it has a mean and more score of one point seven five

and it is i hmm based tts system so it makes sense of this might

be the worst performing system

and then we didn't five best performing system a high quality having a higher mean

models

i five point five eight and that is in fact the we've aren't in tts

system

so now let's listen to some examples of what this the speech sounds like what

we see here in the plot is that the one true i'll ground truth masks

label

has quite a spread between one and six five the that is being predicted by

or systems

or in a very narrow band still we have this range from about two point

five

two or three point five so as a very narrow

dialogue is the key

okay that is the we've are in and here's the hmm

today will tell

it's got a little bit more dollars

so next we also once you characterise the best and worst speakers

adheres we things get a little bit tricky so we have the best system which

is eight and in the worst system which is a eight but we just saw

the hmm and the we that

in you also have the best speaker in the worst speaker which we identified solar

best speaker the l a dataset based on the ground truth is the speaker labels

zero four eight and the worst being zero four zero

now we look at what the art room on score is what we look at

in terms of best system or speaker were system test speaker the true mask for

has a quite a big gap however are predicted a mean opinion score from the

model is

much narrower in the difference

and also the ordinal ranking is reverse and that's listen to some examples of

the cultural be is changed dramatically in the past five or six years

so that was the best system in the worst speaker

today will tell

then the worst system in the best speaker and it just that they are

someone close was listened to it again

the coach arabia's change dramatically in the past five or six years

today will tell

okay and so the fact that we're hearing some closeness

may correspond to a the neural range of scores predicted by our system

next importantly we wanna talk at a peacock analysis that we did so how well

this us now training that we trained generalize to a completely held-out tts system with

so that data

so for this we have the need for tts dataset a that is audio book

data and we have a large set it has five hundred eighty six hours without

you thousand speakers now that did undergo some cleaning from google

and we have a small subset the we trained our teachers system on which is

sixty hours of male and female a just a forty five speakers so a balanced

across the two genders and we have the personally thirty seven thousand utterances that we

trained our tts system one

it is a system that weeks for is dct ts otherwise the result feel and

it's just ec tts

with one highest speaker codes incorporated into the system

this tts system consists of a text and mel that work but also has a

spectrogram super resolution network and the audio control group them and so we will hear

the graph and one in the next slide

what we apply the models that's to the synthesized speech

for your

and abrogated the speaker level

what we do see again is that the best representation

as far as correlation matrix go is expected five

which is the device quality as some actually before

however the correlation overall is quite or so we cannot say that the a so

that is

working very well on this dataset even though we have identified of a better representation

to use compared to the others

so even though demos and doesn't generalize well to this new system

when we use our best performing representation the expected five we can capture is some

relative speaker rankings

like this often cleans closes

this was he

so that would be the worst speaker synthesized them are system using midrange speaker

alright for white broke away

in your best

i hear the fact you she after dinner if i can't for this way come

upstairs of me

and the one we a look at them side by side so the lever t

and the weight less zorro value system and the way that so these side by

side we have the of the u d c tts with the weight net from

the l a data

what we see is that the speakers in each system

contribute differently to the overall performance of the system so there are some speakers you

will just two outstanding a in both systems

in some speakers there are generally much worse now i take the worst performing speaker

probably of really a trained on lever t s and the worst performing speaker in

the evening tts important side by side

let's listen to that

versa for was looking at and what that the thing that

you know sdc tts case we

that is great news for the viewers influence of the scores for levels

okay so probably are actually quite or and we find is that the by selectively

choosing the speaker to evaluate which is system on one could artificially low or loose

the overall system score so selecting only the speakers to

or performing a very well what would loosely

overall systems for some more efficiently

so in conclusion what we determined is that the overall approach for doing mass production

is sound by the correlation between true and predicted scores could be improved

of the mass production model

training the leds is that

does not generalize well

to a held-out tts system and data

and we did find the summer presentations or a better suited for this task and

other representations are just not well suited to this task

we have made to tools available and get home

so the first is demos estimation low capacity c and then using the expected five

device quality

extractor and we try to treat are pretrained model

the second to is the

original last that structure with the pre-trained model that is reached frames

on the leds that so the original master right pretrained model for voice conversion and

we're providing a pretrained model that you

some of the future directions are to look at a predicting speaker similarity

we also thank you would be interesting to use is us to think directors wars

to project

the models score

we think that it would be important to train we formulate this task

as a marshal or at preference test

and finally we would like to incorporate automatic mass estimation into the tts training process

thank you very much listening to the talk and we hope you enjoy paper

Comparison of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis

Voice Conversion and Synthesis

Jennifer Williams, Joanna Rownicka, Pilar Oplustil, Simon King