Přepis řeči - CROVER: IMPROVING ROVER USING AUTOMATIC ERROR DETECTION

X we yeah first you want depart like it an engineering great uh my presentation days and i you or improving the row section the uh the next day and uh a component note that you the big for line of my speech to the first we start by presenting the rover system uh then outlining the our proposed approach to improve the the system followed by some experiment results and then you with the somebody and future direction so the motivation of our work is uh do you to the way you could just the use of large vocabulary uh continuous speech recognition system and uh but the abundance of application using this type of a U K of systems the is an three requirement for high to see robustness and the my speech the code is so some of the common solutions to these uh problems are enhancing the feature extraction we we can also combine the uh speech feature the from and you we can also be my two and then yeah that is as are combined we don't know on this solutions so the or a combination uh means that uh a some of the common i mean a to and this type of a approach is out the recogniser output voting error reduction the rover system a there is also the confusion network combination issue and a minimum time frame framework error rate the idea is to combine different output coming from different the speech decode there's into one single composite output that hopefully will lead to and the that use the word error rate so we're gonna be focusing on the rule or we're trying to improve the this over system or different the rover a what has been divided by john and first this and nineteen ninety seven by uh within nist and the goal is to produce a scope was the asr to use the word error rate this technique is now a uh uh no as a is it's a baseline technique and if knew to the combination technique of the code there's is compared to uh to the rover technique so the process of for is it's a two sticks process first it starts of by creating a composite word transition network from the different a speech decoders then this network is brought was by uh of voting algorithm that's rice to select the best the a word at each slot and the uh a word transition network to do this that are i mean the people but has presented the back and night seven i see that the three voting uh scheme first of all you one of them uh uses the only the frequency of a is that each slot of the network and the two other the work of so basically this is the main the court like equation of the rover system uh we have a oh work i we have the ah so that more that's low uh so i they that uh we you uh but use one oh or there are if you for the voting because we don't need to worry about these for now the some of the shortcomings of the rover system is that this scoring mechanism the voting mechanism uh only works if the different the headers that that are coming from each decode that are different from each other otherwise the if we even we compile them together that is no being we don't gain anything because we will end up with the same i don't and the also that i the that the combination of these of the transition network it doesn't get an T the optimal a a result so if you combine to put that a and B the result is different from the N this technique is also or never be but wonder two of the voting mechanism are using the confidence value which is i mean uh are still not a reliable and the speech recognition and yeah uh also we a more than one best the sequence uh from each recognizer and we we it's and they but to i'll for the error on S asr systems one only one sink asr outputs the codec sequence of words so i what i mean some and all that i mean or uh several uh works has a and had been done uh a to try to fact these uh uh these problems especially the use of machine learning uh techniques and the voting mechanism uh but still i mean the performance of the system as a each of that to and is very difficult to uh uh to use a of the word at or eight so our our proposed approach here is to try tried to and reject the context sure well context what analysis B for the voting mechanism to try to feed that out and remove move the at ours from the composite word transition network before applying the voting make is oh with a with a wide but in the composite network then we do move the error and then we apply the usual voting uh oh also over that's start by presenting the at our detection a technique first so we have a five few terms than a but would of a word is the set of words the left and right context the P in my of the work is nothing else that a the of the probability of these were happening together divided by the product uh really a probably we can get that from a a a a large corpus the number of a as of the word i divide by the number then once we have a the in my in my four point uh why it information once we have the my we can we the same and the coherence of values and was uh how to money mean from the M my i i mean like or each yeah i but it some of this error classify error detection as i is as follows so given a sentence as i got five or the the neighborhood then we we be he's in my scores for uh all of the pair of words and that's sentence then we use the segment X score as we showed before using the either that harmonic mean maximum or summation uh once we could be same ending score as we create that we define the a of all these stores and then how we plan that one uh the uh what is an ad if the same and text for of that word is less uh uh i be but but by this average that mean that are that what is an adder or otherwise it will be dark as a correct uh output so the second part of that approach is and to get eating this thing within the rover process so we have a that by a think the word transition network so this is the work of the network we have the four see the second one and one so we want on this network we you to do and the more yeah what we use more than one at a classifier then and we yeah and oh of the net this one then my oh oh i we could work position network so that i could them or its as follows we can that that with it more than one no of the network we have my and a plastic five to remove the word and they're of them by the null transition and then we apply the voting algorithm so some experimental results uh the kind of a frame we had to use the the E also be nine the latest for uh uh speech recorded from your and then there C and you uh open source sphinx four or java bayes the mean uh a speech coder we used the have for the thing more uh we try to a a a a a to in three type of for decoder is so V nine with this language model and then sphinx four with two different language model that yeah my counts as you a but we before we had some probability so we had to use the huge but was we use the would one really and uh where the open corpus we use the seventeen million unigram and three hundred four many by guns to get those frequencies uh the measure you the word error rate the number of deletions the fusion i search and divided by the number of and words that have been out of by the recognizer precision and one one was it is for a negative for for that i a to the harmonic mean of precision and recall and then the naked uh but that that where they and the or that the the the fine maybe the value of predictive value so let's first uh show the uh a a says i mean that uh i would ever five before we had to get it within the rover this thing you we have not a measure or that's that's if i and we have not that there is a a be the threshold this is i mean the the the the threshold we uh but and is it's that they are how i get a that is the filtering of the ad uh we also a lot of the different time for the i-th deviation of the P M i for all the same and uh uh stored and i know this here that most of the aggregation you the same i or of the same i the when the change some of them better results the project rate so a so we the next day yeah uh we are what and uh again because we are tackling what export as a uh incorrect words so now that the C of our uh assessment we have done to experiments we have applied to at or detect on uh all the words and then on uh uh all the words but the stop words we move the stop words i i would explain why later on so a to uh we go there we have pretty settings so we and we have a report experiments for the or well my engine so we can see that we you know a to one point five percent but a at a rate uh we know that when we might my at are that's i one uh oh or because by the definition of a stop words that's a word that lacks i mean segment that meaning and then i and if you see that's the form uh for the pay in my it try to to see i mean each that what is an outlier but uh use you stop word it's very difficult to to see whether it's an outlier in the sentence or not all oh that's no for this C D uh D code or a complete H uh that the the one i why so what we have you have a oh a that we only have a at a classifier E uh i mean we don't a fine two of the asr uh output i i on so on a my when the video when we do that we see that get so what a or somebody we have the proposed in this paper or of an approach to improve the rovers and we oh X you are over a we have one a a a context what analysis the yeah the use of the and a classifier and now that i at are classifier and we have a have to one point five percent and what at a rate reduction future that action we can use uh uh uh all uh i i know at our classifier like the and that's i the latent let semantic and X and had a classifier a a we can also oh my that's fine to compensate for the low decision right i we can find i using a i we have again additional complexity of the C of a and the scalability ability of this system you you you know questions this yeah these so you you've presentation um but you try to be computed scores on the words yeah that of that it's a good question i mean uh and this paper we only use the one of the voting uh scheme the frequency once you one because most of the confidence value if not all of them from to a speech to decoder uh a or B use list they are a of it this value so we like we don't use of them so when you have a sentence all the words have a confidence value of one so it's basically use we cannot not use and see the impact of using the confidence value uh uh with this thing using this but i but uh uh i mean this supply this uh approach can be applied to what do we are talking about was the voting mechanism but we not touching that that part where trying to to move errors and then go back to the original rover so it doesn't affect what yes it does it it provides a means three voting with ten is in you can choose which are you like so now our we don't have a good confidence measure we don't use of we use the first any questions you questions well maybe can you just the you please read a home and on the computation complexity of the C will assess system yeah i mean uh for the C rover because we have to check her mean uh when we and do this error detection a classifier what will happen i mean there or do we need a me more computation however how was see i i i think i mean we have to have a huge corpus to be able to extract those probability how how does this a fact in terms of mean T in terms of P U power we have to one to give that measures he uh huh you no questions also oh i i of so yeah we still we can still get but better result because what we are we actually is are going to have the voting make a by tomorrow at so if we the error and we have coffee they are then that voting mechanism we we for sure i but yeah P if there is no one the question is too expensive speaker

CROVER: IMPROVING ROVER USING AUTOMATIC ERROR DETECTION

Industrial Technology for Speech Processing Applications

Přednášející: Kacem Abida, Autoři: Kacem Abida, Fakhri Karray, University of Waterloo, Canada; Wafa Abida, Vestec, Canada