| 0:00:13 | thank you very much for video presentation |
|---|
| 0:00:16 | mandarin min come from you don't time |
|---|
| 0:00:20 | today i can actually for competition expectation for shown to the spoken language identification |
|---|
| 0:00:31 | i want to keep this presentation of the follows |
|---|
| 0:00:35 | clustering and we introduce the short utterance language identification tasks |
|---|
| 0:00:42 | the thing i shall use a neural network based on writing techniques |
|---|
| 0:00:47 | extractor |
|---|
| 0:00:47 | and they show how that vector use them for lid task |
|---|
| 0:00:53 | after that the feature compensation learning will be introduced |
|---|
| 0:00:58 | then |
|---|
| 0:00:58 | i'm sure you |
|---|
| 0:01:00 | our experiments are sent out |
|---|
| 0:01:03 | one really |
|---|
| 0:01:04 | and you summer and the conclusions |
|---|
| 0:01:10 | okay language identification techniques and topical use of a pre-processing stage a lot you lingo |
|---|
| 0:01:18 | did recognition and translation system |
|---|
| 0:01:22 | for real time speech processing system |
|---|
| 0:01:26 | incorporating performance of shock filters are task |
|---|
| 0:01:30 | are important |
|---|
| 0:01:31 | because it can |
|---|
| 0:01:32 | zero to reduce the real-time factor and the |
|---|
| 0:01:36 | it is also or system |
|---|
| 0:01:39 | well of the |
|---|
| 0:01:40 | state of the how |
|---|
| 0:01:41 | to |
|---|
| 0:01:43 | right the master is the i-vector based and that's it |
|---|
| 0:01:46 | alright to this semester very effective a relative number of devices |
|---|
| 0:01:52 | recently |
|---|
| 0:01:53 | most of the researcher neural network based approaches |
|---|
| 0:01:58 | because so the idea is the classification task |
|---|
| 0:02:02 | therefore they neural network model can be directly used for classification |
|---|
| 0:02:10 | the entanglements sure that the performance |
|---|
| 0:02:13 | a shot boundaries right you task |
|---|
| 0:02:18 | experiments a high initial for speaker verification task |
|---|
| 0:02:23 | and the recent study it was also successfully used to derive the task |
|---|
| 0:02:28 | in this work |
|---|
| 0:02:29 | we focus on the big vector based |
|---|
| 0:02:32 | nested |
|---|
| 0:02:36 | the expenditure the neural network based they write presentation data |
|---|
| 0:02:41 | note that using that are applied to men cost |
|---|
| 0:02:45 | the speaker recognition even today actually on the language identification |
|---|
| 0:02:51 | the network for extracting extractor |
|---|
| 0:02:55 | consists of three month euros |
|---|
| 0:02:59 | reliable feature extractor |
|---|
| 0:03:02 | statistics hogan |
|---|
| 0:03:05 | and the boundaries |
|---|
| 0:03:08 | variable representation years |
|---|
| 0:03:11 | a very well feature extractor model |
|---|
| 0:03:15 | outputs frame level |
|---|
| 0:03:17 | the utterance |
|---|
| 0:03:18 | we impose over a sequence of acoustic features |
|---|
| 0:03:24 | well this year s |
|---|
| 0:03:26 | time delay neural network |
|---|
| 0:03:29 | well convolutional neural network or used |
|---|
| 0:03:34 | then |
|---|
| 0:03:35 | a good coding here |
|---|
| 0:03:39 | canberra the frame level quality |
|---|
| 0:03:42 | further frame level features into a fixed to dimensional vector by using the mean and |
|---|
| 0:03:50 | they're |
|---|
| 0:03:50 | standard the condition |
|---|
| 0:03:53 | finally |
|---|
| 0:03:55 | for connected actually didn't is used to process all utterance level representations |
|---|
| 0:04:03 | and a final thoughts the next earlier you used it is all those response to |
|---|
| 0:04:09 | use you have |
|---|
| 0:04:11 | and the map i |
|---|
| 0:04:16 | and like to thank next are mostly used for speaker verification task |
|---|
| 0:04:21 | using the verification task |
|---|
| 0:04:23 | the extractor the doctors |
|---|
| 0:04:27 | frontends |
|---|
| 0:04:28 | that is the used to extract results of contracting agent |
|---|
| 0:04:33 | you back and |
|---|
| 0:04:34 | some of them and here or cosine similarity can be used up all common case |
|---|
| 0:04:40 | for the lid task |
|---|
| 0:04:41 | the front end up backends approach can also be used |
|---|
| 0:04:46 | compared to be that jointly row just thinking regression become more widely used directly |
|---|
| 0:04:52 | classification task |
|---|
| 0:04:54 | well clusters and |
|---|
| 0:04:56 | a reading tasks |
|---|
| 0:04:57 | we can also directly use the network outputs for classification |
|---|
| 0:05:05 | this work |
|---|
| 0:05:06 | make a shot authors lid task |
|---|
| 0:05:10 | not only |
|---|
| 0:05:11 | but the testing utterance become shorter |
|---|
| 0:05:14 | so performance also decreases |
|---|
| 0:05:18 | no degradation is mainly because |
|---|
| 0:05:21 | and i can think up to ten calls applies a large variation |
|---|
| 0:05:26 | of the shuttle to resist |
|---|
| 0:05:29 | to reduce |
|---|
| 0:05:30 | the variation or short utterances |
|---|
| 0:05:34 | normalization method using and |
|---|
| 0:05:36 | corresponding no other varieties |
|---|
| 0:05:39 | warranty investigated for i-vectors |
|---|
| 0:05:42 | and neural network based |
|---|
| 0:05:46 | it is the number that we can also apply stimuli the i-vector extractor |
|---|
| 0:05:55 | therefore |
|---|
| 0:05:56 | we inputting we think that |
|---|
| 0:05:58 | similar idea |
|---|
| 0:06:00 | two |
|---|
| 0:06:01 | improves accuracy performance by using vector network |
|---|
| 0:06:07 | the chair |
|---|
| 0:06:08 | compensation |
|---|
| 0:06:11 | well down by reducing the actually then |
|---|
| 0:06:14 | representation pleading a and the short duration |
|---|
| 0:06:19 | inputs |
|---|
| 0:06:21 | there |
|---|
| 0:06:22 | the s |
|---|
| 0:06:24 | is that representation overshot of the variance |
|---|
| 0:06:27 | and there is a representation of the corresponding rhino buttons is |
|---|
| 0:06:35 | the i-vector space |
|---|
| 0:06:38 | this education |
|---|
| 0:06:39 | can be rewriting "'cause" this one |
|---|
| 0:06:44 | well for training |
|---|
| 0:06:46 | drastically |
|---|
| 0:06:47 | which the vector is the network by using an l |
|---|
| 0:06:53 | duration encodes |
|---|
| 0:06:55 | then the shot input space to model the trend maybe a function |
|---|
| 0:07:02 | considering that difference between them out and the shot utterance |
|---|
| 0:07:08 | the shot boundaries |
|---|
| 0:07:10 | consis a very limited information |
|---|
| 0:07:12 | therefore to improve the performance a short utterance |
|---|
| 0:07:17 | both i and i were extracted and information local phonetic information an important issue |
|---|
| 0:07:25 | we suppose that |
|---|
| 0:07:26 | the variance |
|---|
| 0:07:28 | components the vector kind of that language and describe the information related to local phonetic |
|---|
| 0:07:36 | information |
|---|
| 0:07:37 | based on this consideration |
|---|
| 0:07:40 | but we propose to normalize only seventeen |
|---|
| 0:07:44 | component it's vector |
|---|
| 0:07:47 | it is |
|---|
| 0:07:49 | the representation overlap utterance |
|---|
| 0:07:52 | well |
|---|
| 0:07:54 | you mean |
|---|
| 0:07:55 | so rare in |
|---|
| 0:07:57 | components |
|---|
| 0:07:58 | to you the |
|---|
| 0:07:59 | frame level phonetic information |
|---|
| 0:08:02 | well alright discriminative features for language identification |
|---|
| 0:08:09 | the cost of the proposed a method is the only this time |
|---|
| 0:08:14 | for the representation of the utterance |
|---|
| 0:08:18 | could be obtained by neural network we assume that all those |
|---|
| 0:08:23 | so the intended to pass the last |
|---|
| 0:08:28 | in that program them that's a wine |
|---|
| 0:08:32 | we use and spectral and the |
|---|
| 0:08:35 | to supply |
|---|
| 0:08:37 | representation |
|---|
| 0:08:39 | and the in proposed a mess of the two we use the rest match |
|---|
| 0:08:45 | a global calibration pony |
|---|
| 0:08:48 | to obtain a representation |
|---|
| 0:08:52 | we evaluate you the proposed method that means that language recognition evaluation |
|---|
| 0:08:59 | two thousand and seventy set |
|---|
| 0:09:03 | it's a training data used |
|---|
| 0:09:06 | clover in this ad |
|---|
| 0:09:07 | and i dunno three five development data |
|---|
| 0:09:12 | for a rainy the to seven |
|---|
| 0:09:16 | and the |
|---|
| 0:09:17 | the telephone data so that i that line |
|---|
| 0:09:22 | for the test set it to be used as a close the standard nice to |
|---|
| 0:09:28 | those |
|---|
| 0:09:31 | the except that has recently that in section that the study is that okay and |
|---|
| 0:09:36 | the |
|---|
| 0:09:37 | this ad |
|---|
| 0:09:38 | we also program the |
|---|
| 0:09:40 | a wine one point five and to use against this sense |
|---|
| 0:09:47 | one of a trust |
|---|
| 0:09:49 | we used to sixty dimensional all they're pretty bad major |
|---|
| 0:09:55 | and then you covariance and that the existing as the average of was used for |
|---|
| 0:10:00 | evaluation metric |
|---|
| 0:10:03 | for this analysis is you can kind of the rest nets system and that it's |
|---|
| 0:10:08 | vector systems |
|---|
| 0:10:10 | the rest analysis to us |
|---|
| 0:10:13 | so the holy rollers that's |
|---|
| 0:10:16 | network |
|---|
| 0:10:17 | they are probably |
|---|
| 0:10:20 | and that while for the connectivity |
|---|
| 0:10:22 | the a lot of nist or both |
|---|
| 0:10:27 | well the i-vectors is to the thing last night to |
|---|
| 0:10:31 | we use the reliable feature extractor |
|---|
| 0:10:37 | well the training examples |
|---|
| 0:10:41 | some examples of our group had between five to ten seconds and the shot utterance |
|---|
| 0:10:49 | but it is going back to two seconds |
|---|
| 0:10:53 | in this case we show the results of the baseline and systems |
|---|
| 0:10:58 | come variation |
|---|
| 0:10:59 | we also realistic this results with popular by |
|---|
| 0:11:04 | other is utterance |
|---|
| 0:11:07 | was anybody can |
|---|
| 0:11:09 | it's a extractor system are more in fact you on long code utterances |
|---|
| 0:11:17 | and whatnot shop utterances the rest and |
|---|
| 0:11:23 | this is done in the better performance |
|---|
| 0:11:27 | and because of the duration mismatch the model trained with a lot of them is |
|---|
| 0:11:32 | samples |
|---|
| 0:11:34 | we form the where on the basis of the data but i'm not problem that |
|---|
| 0:11:38 | there shall i |
|---|
| 0:11:42 | the integration of the team here that without the feature compensation method |
|---|
| 0:11:48 | in this table |
|---|
| 0:11:49 | the baseline is the olympics vector network trained with the shops examples |
|---|
| 0:11:56 | the results of mean error rate is the |
|---|
| 0:12:00 | composition learning |
|---|
| 0:12:02 | and the two proposed them |
|---|
| 0:12:07 | mess to whether he's this table |
|---|
| 0:12:10 | for you the variation |
|---|
| 0:12:13 | we give a speaker to compare baseline |
|---|
| 0:12:16 | mean and variance this okay |
|---|
| 0:12:19 | and the proposed a method |
|---|
| 0:12:22 | problem of the results |
|---|
| 0:12:24 | we can say |
|---|
| 0:12:26 | the channel compensation |
|---|
| 0:12:28 | by using those |
|---|
| 0:12:30 | mean and variance |
|---|
| 0:12:31 | only could improve the performance |
|---|
| 0:12:35 | well not all utterances |
|---|
| 0:12:38 | yielding very |
|---|
| 0:12:40 | according to the best results |
|---|
| 0:12:44 | i four show the other varieties |
|---|
| 0:12:49 | compensation by using |
|---|
| 0:12:51 | me only |
|---|
| 0:12:55 | this significantly improve the performance |
|---|
| 0:13:00 | well concluded |
|---|
| 0:13:02 | in this work |
|---|
| 0:13:03 | we investigate an improvement of the neural network based the impending techniques |
|---|
| 0:13:10 | vector for shot about the rest lid task |
|---|
| 0:13:13 | we compare database that the channel compensation by comparing in various and the need i |
|---|
| 0:13:20 | think this the last |
|---|
| 0:13:22 | the proposed to me is the channel compensation only |
|---|
| 0:13:26 | it is expected to capture high-level or |
|---|
| 0:13:30 | construct a language information |
|---|
| 0:13:32 | right our meeting |
|---|
| 0:13:34 | variance components three because it is for that reason for software that it's |
|---|
| 0:13:40 | the results show that the proposed method the mock in fact the shock filters right |
|---|
| 0:13:47 | you task |
|---|
| 0:13:51 | that's what your attention |
|---|