Speech Transcript - RUNNING THE EXAMPLE SCRIPTS

i i think this is gonna be the official theme tune L T a okay so i gonna be talking about yeah how to rank of the and so basically will be going we have to recipes on on uh yeah the come our stuff one is results man as one a most wall street journal we gonna go through some all the results management recipe and well well have a few digressions to explain as much of the internal the cal as you need to know it's a kind of a understand that that we those to the installation process i'll describe the unix one because that's one most people will use but it also has a visual studio windows one uh the scripts i I scripts or all and bash again for popularity reasons D is kind of agnostic about the shell there's nothing unit that's really specific to anyone shell no suppose you want to download cal than you want to run it a you probably first go to this location kaldi don't source forge don't net i F will also work we have a page of documentation that explains the we much everything county relate to uh we use a source control program called sub ocean the command name is S the N it'll typically be installed on most of the system you will have uh it's a S U N as a little bit like C V S but it's a more modern implementation so to check out cal the you would just type this command on and this will a the a lot of stuff go by and the screen it will check out a bunch of directories the code the screen an installation instructions the documentation source so on uh the installation instructions you just look at the install file and the installation is pretty simple that's like change direct to here ron installed a S say C D to here running one can figure run me there is an rather nonzero probability that something will go wrong because it does kind of hope that set things there install but we kind of provide instructions a the common cases and if it doesn't and stop please ask me and i'll tried the help you to get it to install uh that there is a directory of kind of external tools and we try to have a script to configure and uh to download and make all of these external tools so that you don't have to worry about that yourself these include as P H two right two because sphere files i yeah rest M is a language modeling toolkit uh with a that them as i mentioned before we chose this because it has a right of be open license very limited features openfst at such so but you done that you checked it doubt you try to install it uh no scratch that sense uh so a and the future we gonna have version numbers and everything currently because we haven't yet come to version one point are we just have trunk which is the kind of version control thing for whatever your current code is uh inside then you'll find rules which is a place where we gonna download and compile various external to the find has sars the source which is well all of us source code is including the source for our documentation and these are the subdirectories in there that these are these are the names of the things that i showed you on that funny uh slide with the rectangles so this this is all the subdirectories of code and E G the directory a script they contain the results management and and wall street journal scripts the you can probably see it's it's and of hubris and the naming scheme here we we got we went for their deep naming scheme because we believe that eventually will be tons of script and those directories i so i was you've uh then the installation in tools and that that's we just wanna get script you got the source to configure the the configure script that sometimes configure script so these vast scripts that also generated by things like to make or whatever it is but this one is just a hand generated wanted tries to find where you're like atlas library or steal a pack libraries and if it finds it then a composite with that and it and it detects certain like certain systems like cygwin and mac os that have particular setups that are common and then it handles those a separate uh uh it's good to talk minus J for when you make decode because there's a lot of uh tools and the code is rather template and so the compilation is a little bit slow this makes it in parallel you don't make test the ghost all the subdirectories and runs all the programs that and with dashed test we have a lot of testing programs they're mostly uh units S to make sure that all of the code is working things like you have a matrix multiplication or something you do the multiplication and you verified that the answer was right like that uh and there's also you if you can also type make well grind it runs a program called well grind to check for memory error and that that would i mean right now there's no error but that would detect if there with things like and allocated memory so suppose you've done and you to make you type make test and i thing one wrong so you C D two E jeez are S one and it's is where example script uh this just seems that you you know you member of the L D C a what have and you have you have access to be you did think this corpus i think for members that's like three hundred dollars the a lot of sites will have it already so the results management corpus it the a all really simple corpus but uh we use a because it's really fast to run and it because it's kind of an lvcsr like task is really medium vocabulary but because it contains that words and has a lexicon and everything it kind of but haze like a typical lvcsr system even though it's or uh so that's be on some directory and you have to figure out what that directory is at some point you have to pass it to one of the scrip as a bunch of come than here that you're supposed to write you know you're not real expect the run this directly it will just X it on you if you do that it's it's just a sequence of commands you're expected to run by had because there's a high enough probability that any given one of the most failed that you thought it wasn't good to a be over optimistic can just make it a single script i mean the failure is the gonna be do to simple things like maybe the wrong directory as some but anyway so i i'm gonna go through what this run done S age that the first thing is data preparation and so you will you'll see that the door called data probably B to there you know you know what the directory if your results management uh data data is this is up this is the ldc the give it that uh and it'll just do a bunch of stuff basic with convert thing whatever format this corpus has in to a format that deal like and you know Q waiting lists of file names like that these you C D out uh just for things that are created by this that was the bunch of stuff actually in this directory that it create here's an example S C P file so it contains the utterance i'd B and then if is the pipe come on from his the here's apply can man so this can is gonna be run whenever some program tries to read this now and C P far this of the concept that the that's okay had is not really quite the same as C K's notion of an S C P file not be explaining later exactly what that is another think that's created here is a decoding graph in fst format in in some other scripts like in the wall street journal script this stage wouldn't be creating any fsts it would just create an arpa the because our M doesn't use an arc a we do like this so uh uh and stage of data preparation uh oh yeah it's is created in that directory to it comes from stuff that's in the results management this it'll create a lexicon for you in this form is pretty of obvious and will to ten into an F C the call tools don't we deal with this directly we deal with fst so the lexicon that you give to count is gonna be an open i format fast that there also some uh is the speaker matt so this of that are inside the this a speaker I D the file that contains a lot of the this this is how to the you know maps utterances just because and vice versa there's no notion of like masks of comments or uh_huh so yeah that's content about turns idea is quite important important and D in never there's no notion of like parsing file in thing like the last element is the utterance idea what's of you have to have an explicit uh list and all of these that C P files than R kaiser index by this utterance are inside the you have to decide on uh we are that script also create a text format of the transcriptions but will convert this into an integer format kaldi eli just just so the cal doesn't need to have for all of the program some kind of match between the text an integer form of the uh the the uh word so this is the transcript the text format of the trans oops uh next step after to that it the prep stuff we the pair the graphs there's is gonna be a a bunch of openfst if commands and here like scripts to convert from uh from the lexicon to the fst format the the lexicon actually contains the silent and the scrip the the script kind of at and it's not something that very deeply embedded in county these these little files if you've ever used at indy toolkit or openfst you'll know what these are there symbol tables so so it's uh this uh this is the text form of zero the text form of one et cetera and E P S for epsilon is always zero this is kind of uh a common thing an fst toolkits knows of that idea zero so so this is why phones or one based because zero is always reserved for epsilon uh so the the create yeah all of the F so openfst does have a capability to put to put symbol tables on the fst so the fsts we kind of know what the words were we haven't used that because it it quickly becomes very difficult once you decide to have simple tables on the fsts we've we basically use integer format throughout the data which uh it outputs these files this is the gee the grammar or that could be not the language model used for decoding so the lexicon L L just got this one big is the lexicon the disambiguation symbols and if anyone has read the papers of uh more riyadh i'll the described the standard recipe for fst based uh yes uh i don't know what that is little symbols like hash one hashed to that they put on the lexicon of the ends of words to ensure that term eyes ability uh i i i i but i'm not going to that in more detail or it's gonna suck up the entire time of the tall uh pairing integer list of silence and nonsilence phones i we we we created little files tape things like this isn't needed later on by the scripts because occasionally a scribble need to know what the I D's of the silence phones and because the kaldi tools will at integer formats it's gonna need that as a list of integers uh computing remote okay so this is this is just a command to and vocal kind of other script that uh compute the mfcc and and here is the command and i believe actually this before it's uh it basically write cm mfcc to some disk and then this is gonna be a text file that contains on each line is gonna be utterance id and then the law that this filename cool on integer offset so it so it can kind of directly go to that part of the file using F C okay i think this is what i just said this is the uh script format that i mentioned of course the script format is very generic this whole thing doesn't have to be of this formant it's any is anyone of our extended filenames might include a real file something of this form pi whatever i i i'm showing you what the archive format looks like that really is binary data so that you can see uh yeah but but in some cases there would be text i'm you you you can give it the option to write in text and and very often it'll be a nice line by line format yeah i uh yeah so i think i mentioned this before the script this key is an important concept because there's this concept of uh a collection of objects indexed by key in this case the string think of a little bit like an S T L map where you know it would be a map from string to whatever object so the archives in the script both the kind of make use of this concept and i think that concept a little bit more detail in the next slide but the script format is the key and then some kind of extended filename blah blah blah i think i mentioned this before but the types of extended filenames include actual file a command piping output pipe symbol then the command which is like in which is the input motion and out but this is only a pretty very inputting from applied an an offset into a file which is uh which is useful where you where you want to write a big archive but have random access into uh so this might seem like a very in minus things i think it's important that if you as you want to ever use count it's and to understand this how this work so there's the concept of a table and this table doesn't really correspond to any like concrete objects or class it some a generic comes that the idea or is a collection of objects of some known type it's type known and of of all indexed by key which is the string we we define a key is the non empty space free string for that and i was we have to make its space free so otherwise we get it all kinds of issues a so so there was a street template plated class of that somehow relate to tables is the table right ear sequential table read or and random access table with that so this two ways you this three ways you can do something with a table you can write a table and what you like you do with this is you you'd say write me something with this key and this object that's gonna write it to the table a any in you keep doing that you can read a table chilly which means repeatedly give the next key and giving the next subject are you can random act you can do random access on a table which means do you have a object this key and so no if so give me the object that's how you interact at is the templates the template it on i gonna describe the next like what they're ten it on then not actually template on the object it's it would be most natural to template on the object that's in the table but the problem is that doesn't work very well with uh kind of fundamental types like integers and so on because it "'cause" that normal cal the object they have a read function and a right function have a particular behavior it's common all of them but we can't just to see using the everything we want to read and write will have that form because how would be writing to do is a how would write as T L like and it and it would be ridiculous and my pin to somehow have to derive a class that's an integer and give it a thank of the integers not class last so we tend like um what we call a holder a hold class as a cost that has set and read and write functions uh and it has a type that T inside it that is the actual type of the table whole so you know knowing all of this stuff is if you if the i lost to by not because you know a C plus plus are really doesn't matter because this is i'm just it's how the channels of this like as am works but uh you don't need to know this to understand the how the whole thing work so i think that's as an example of how i the C plus plus level you use that the table comes so we we introduce things of terminology here that may seem a bit annoying but eventually becomes clarifying and i are specify or is a string that tells the table code had to read a table of check uh and his an example of one uh yeah yeah K call on this finally so the table code is gonna part this and when it reads this it's as okay yeah telling me that this is an arc um thing that has the key object key object and this is an extended file name but tells you had to open a pipe of or open a tree so now this is the tight name sequential table read template it on this holder tie so this is if were reading something of type in thirty two so this is the use of the object name the and initialize that we're giving it this string so it's soon as you initialise the object it it's opening the high it's say we gonna read from this so now we now we using the subject with thing what for blah blah about what is what this code is doing that's getting each key and to and from the sequential table read and of course this and since this is the sequential table read that's what this subject expect us to do so the point is that the maybe error it's right some of the objects may not be there sometimes you know something may go wrong this the template it code is gonna handle that so you're kind of user level code just see that as a sequential access i think this uh a stuff that have already told you a there is some things that the table code has to do there were little bit tricky one one of these things as a very often you once to do random access on objects that are in an archive in that our K may maybe in a pi as use a lot of high and and the problem is that suppose to some reason you ask you query a key that was not in the arc in order it's of tell you know it wasn't in the arc it's gonna have to read each one in the archive go to the end of the pie and then saying no but that means that i doesn't know that you're not gonna ask for something else to so has got the store all of that stuff and member so in in order to uh stop it from having to do this you can specify and the are specified thing a little common S calm cs S a options that what tell it this archive is sorted on key are we gonna call this archive in sorted or so basically that gives the code enough information to know that i it doesn't have to store all the stuff and memory in it can still kind of be correct i'm gonna go a little bit fast is reduced uh i think we went through this computing mfccs monophone training so you would invoke this script uh we gonna go through the script a little bit it's set some some very than bash the directory were what are you doing your experiment the features i think we so one of the strings for this is and are specified that i mentioned before this this part tell the that we're gonna and separate this stream as an archive this tells the had to open the stream and of course this is a i that's another colour the command has its own thing sometimes it can can even be nested but beyond one level of nesting be the shell escaping would become to thing that hi this is applied what's so in fact this is an output is always that puts on the right so what this is a it i think that out says it's reading in this this script file that says where the features a and its output thing to an are kind of on the standard up so and then this says that this whole thing is a pie so this park gets interpreted by the program that is given that yeah you can used to it as you i oh huh so and that is going to the monophone training script uh we create a file called the X slash last let's top L that specifies the hitch an apology to be uh to the uh the cow so i mean you you can this file for a fairly self explanatory a script repeat that uh there is uh is it of the three state and then this is the kind of final state that call of that the last state always has an X a probability of one uh this is a week amount to initialize the uh G M and initialize that with the dimension of thirty nine outputs puts the here and this also outputs a tree very trivial tree that doesn't really have any splits and it and that's how we handle a monophone system even a monophone system has a decision tree it's just so that you don't have you know all the code is you five uh see if we have okay creating decoding graphs for training or all of the kind of training script have a command of this form that it creates an archive that have what has all of the fsts one for each are and we do this as a separate come "'cause" otherwise it would be too slow we'd only do on each iteration a little bit too slow so i take that the initial model the lexicon a fist C uh trained a all this of the transcriptions an integer format and that the put goes to this sprite that it just use that it and puts it in a and that file so uh this is just the format of the dot track not try file it's just an integer at uh transcription where we've can all of the strings so their integer numbers no of people like that a you okay so the very first stage of uh monophone training is the flat start where you uh and of in divide the utterance equally a to the number of phones or whatsoever and uh create a an alignment a once to that so yeah output of this program is something called alignment which is basically for each utterance it's a vector of integer in to those integers is an id D that i touched on earlier we call a transition i D it's something that behaves roughly similar to the P D F index of P D i D but it has a little bit more information so you know the phone you know what the transition lot so it kind of contains sufficient information to to to update data so we put that into this program gmm max that a light the suffix a means that it read an alignment "'cause" the different versions of this program that we alignments that read in uh posteriors gaussian in little posters and different thing so it takes the model it take the feature this of the shell variable is good bye it read than this stuff from the input put an input and the outputs of this so but by the way whenever something has a arc on it or or the C P O that that's an are specify or or doubly specify that means that as a collection of objects being passed around indexed by key but if you don't see that like here and is just a file is just a single stream is not there's no notion of index a there the i think a cover this a this oh you and that's is the gmm mm update so it takes the the original late to outputs the you model so the that this is the viterbi stage of training what what we do during training is on so on selected to rate it's iterations we redo the alignment we don't necessarily do that every iteration simply because this is the this is the thing that takes most to the time and and it "'cause" it if you have multiple gaussian Z uh this is not the only thing that's going on in training so it makes sense to uh not do it every so i think this is pretty obvious that should be to that she's here it i you give it the beam with the model this is the yeah this is this stream that that has all the fsts on it features and uh it's gonna right it's gonna sorry oh that's a as an option i mentioned briefly options on these are specify or or in this case a double is just five it so that a right in text format the default is binary but you could do common be if you want to emphasise that uh you monophone training we re align on almost every iteration because thing i found that that would better or something thing or maybe it's because you usually have single gaussian uh during right i i think after that you system is they do to but you don't have to we so often so typically during it kind of you pocket triphone training we'd only realigned three or four time uh so mix up to increase the number of gaussian is maybe slightly against the whole called a philosophy but it's just an option to the update program uh the way we allocate gas since we don't have a constant number of gaussians per state we we provide uh it it's a power law it's proportional to the count and this shouldn't be no but by the should be not point to i don't know why that way uh it it's just slightly better than having a constant number so yeah just schedule we used to allocate the guest in that's typically you start from a set the number you linearly increase and then it levels out it would probably be more natural to increase with the log kind of increase of the power law something but it just didn't work as well was but it to do a linear i uh okay so a triphone training the first stage is we we align all of the data of that we uh for the monophone we use the subset because this is no point so just small system so we re all of the data and we've output alignment we we we Q my a special kind of stats for training the decision tree what this is for each unique tries triphone context in this case it's gonna a malay single gaussian well the stats for a single gaussian and this is gonna and was to train the tree the standard way so that the just stuff in the script that kind of automatically that some automatic clustering produces question we don't use hundred or questions is as the hassle find them and and this it a these a producing various files that will be read like D so a lot of the actual control of how the tree get set up is some of the script level a building the tree this is the colour command the bill the tree what that's actually does is the it goes that to fifteen hundred leaves and then it kind of clutches it like down a little bit but by nonpredictable amount because yeah chills threshold it uses to you the clustering of to the initial splitting is the same as the kind of last successful split so you can't quite predict have big it'll be but normally it's tricks by twenty percent or is what you it's give you so you initialise the model for this tree this this this program doesn't know if it's gonna be a gmm or of for for various gmm or S gmm you gonna create oh to separate program to initialize the model uh is a nice feature of the whole alignment onset you can actually take can a and produce for one model and converted it to kind of be valid for another model so that means that you can avoid it's a certain amount of uh we generating a okay so if you want to decode you have to build the decoding graph this is a this is the how we and be a graph generation and the think that is doing first to compose is L with G it's a as minimize is you get a L G that's an this some so stuff for disambiguation symbols going on uh if if you and are gonna go through that then you have to compose the in the context of christian it kind of expands the file that a context-dependent phone and that's a kind of dynamic generation of uh the context of T going on that happens within member in here and not gonna go and more do sell uh what's going on here and then this last one uh make eight trends use so that this hey jeff T that on the a basic we expand that the hey jim and so on the right to the context-dependent phones on the left you got the P D S but uh adds all the stuff that network so this is grading to see that does that and the last just to uh compose hates with C L G uh it's M and eyes i yeah oh in we did that without self loop so at the end i this is to to just make these more memory efficient we don't we wait till the very end had the self so this just goes prefix prefixes for the decoding script we create a shell variable that tells that what the features will be then we invoke this uh program that's three decoders and this what the affect the uh come man it could be jen and decode code simple faster or D it's the the kind of medium one there the G and decode simple is mainly that for debugging "'cause" it's so simple that you know the can be anything wrong with it sorry we just we just compare the to i yeah so decoding coding be missed twenty it's is the beam min kind of language model scale we only a the acoustic the language model uh this is just a get more human readable out put the model the F S T features this isn't a W specify specifies had to write transcription it's says do it and text for format but D the can be integers we're gonna have to change them to uh the text before scoring if one format and this is a this is the the alignment this is it is a really useful like Q just decoding but i you might want to do adaptation late to using using that decoding us that these supervision so it just kind of we just always produce a "'cause" it doesn't cost okay so i think that's basically comes to the end of this talk a given you a very vague idea of have the scripts work what's and them or we see there's a lot more details that you'd the to find out before using them but a lot of that stuff is in the documentation you have to kind of dig around the documentation i've been told that it's not very uh clear where to star but there is a lot of it there so if you just willing to read at all a the thought also it it have heavily cross reference so if you five something that kind of related to what you need that usually be a link that you can click on that will take it to what you do i okay so that's it for any question uh yeah uh you never really deals directly with with any of those symbols because all in integers so it really doesn't matter as as can and what it is so yeah i seen that you could do any t-f eight well those are those have to be does have to contain no white space and been on them so i don't i don't yeah it's not is not gonna worry but you T F a long is not white space i think that in never checks but it's actually ask but i mean i think you to if a is that's that if it's not but is no we gonna be white speech "'cause" it's all always more than a hundred and twenty eight i don't but it it should be a a for any have a i i don't really think is a good idea to put you T F A in those things because i mean don't you have old fashion uh_huh yeah i i i think it should work but you gonna be concerned about this about this shell the "'cause" it could be that the shell of doing some kind of manipulation on the lines of that foundation go weird characters i don't know with the will work you know but but it's it should be easily changeable to handle that if it really becomes an issue uh uh_huh uh i believe that can i think our i reasoning right that that i don't recall call it things ever been tested a pulse go so it it's two percent well the five that i shall can have those probabilities but i think the perl script that great the lexicon actually has a flag but is post except them or least that that one point but there of that the but it's just it's just that you know that you line for script so it's not like yeah yeah so it that really care whether a lexicon has probability is just an F T so yeah the uh yeah ooh well uh they can get very large and i i mean you with ones that we like one and a half ago i think that was with a a somewhat and trigram lm they do get very big but at some point we gonna create coders that a from that problem i mean i think the festive from a as a it's great because he D back its simple but maybe the memories we're gonna work on if there's no more questions i guess we can call it a day oh one more i guess for but if there oh you have to redo them music they own oh oh oh yeah yeah

RUNNING THE EXAMPLE SCRIPTS

Kaldi Workshop

Presented by: Dan Povey, Author(s): Dan Povey