0:00:26a a a i i i
0:00:32i you know what
0:00:33a i today
0:00:35and people
0:00:36and some people would be you know a to up to have an another's with
0:00:40they on if an sup of for the rest of the live
0:00:43i thinking that those prophecies is of been just slightly misinterpreted
0:00:47and the event that they were referring to is this
0:00:49a wonderful speech to okay
0:00:53a that have in of actually that
0:00:55you know it almost buys review thing it
0:00:57so uh
0:00:59i do i think that have anything and their estimates the significance of this that
0:01:04a that's okay
0:01:06so first they just about the name
0:01:08it's a it's some kind of coffee reference hence the little coffee being with uh
0:01:13but had so
0:01:16is just
0:01:17whatever name we thought to
0:01:20so uh
0:01:21the structure of this uh this whole presentation is fess i'm gonna talk
0:01:25for about
0:01:26fifteen or twenty minute
0:01:28just giving you know of you kind of from all sides of this tool K
0:01:32and then we're gonna a people to escape in case they don't want to know more details than the have
0:01:36a short break
0:01:37and then
0:01:39i not and uh on drug going to talk about a uh
0:01:43some more called local stuff like
0:01:45and i was gonna talk about some of the acoustic modeling code
0:01:48and we'll talk about the uh matrix like
0:01:51which just kind of independent useful
0:01:55and then after that
0:01:57i'm gonna go through some example scripts that we have been try to get people
0:02:01more of a you know
0:02:02give people a sense of of how to use that
0:02:07or the next slide
0:02:10some important aspect of the project is it the
0:02:13it's license under a you V two point uh which is the
0:02:17style a license that basically allows it to do anything you want with it
0:02:21there is only a uh
0:02:23an acknowledgement a
0:02:25close which as you have to acknowledge that
0:02:27the code came from that but that
0:02:29that's of that's
0:02:31it's it's one of the most open up the standard lies
0:02:36the project of currently hosted on source forge which is the
0:02:39standard place for these kinds of open source project
0:02:44we we it
0:02:45some talk it's a very closely associated with a particular institution
0:02:49our attention is for it to be more of a kind of
0:02:52thing that lives
0:02:54and the clouds out or and source for
0:02:56i i shouldn't have use that will that that's to
0:02:58that's just gratuitous that
0:03:00but it yeah there it's very for it not to just be him a
0:03:04the pet project of some particular little group but uh
0:03:07that's to represent
0:03:08the best of what's out there and and and we will can be participants as long as you can contribute
0:03:12code under
0:03:13this slice sense than that's great
0:03:17it's basically a C plus plus to at
0:03:19the code compiles it a native windows and
0:03:22and the common units but fun like can we're not claiming that a compile once on or you know
0:03:28other we're problem but but it compiled from on the normal one
0:03:34you have some documentation not as much as takes T K
0:03:37and and and we have example script
0:03:40these example scripts and not uh
0:03:42there just for results also as one and and uh
0:03:45wall street journal
0:03:46but we're gonna have more to
0:03:50they basically run from ldc that's
0:03:52so once you have the this you can kind of point them to the disk
0:03:55and just
0:03:56get an idea of how it work
0:04:04oh no i now i realise that we didn't look a large enough row
0:04:08i think i think we just have a tie this thing to uh aggressively
0:04:12if these were not guy
0:04:20okay somehow out i gonna go through the kind of a think that support this is just the current features
0:04:25or tending to a lot more
0:04:27so you can build a standard context-dependent uh
0:04:30lvcsr system
0:04:32you know with tree clustering
0:04:34in that it's been written in such a way that it supports arbitrary context size is so you can go
0:04:39quint phone oh what's have and it will uh
0:04:42a work
0:04:43without without pain
0:04:45but the the training coding about fst based on a
0:04:49our code compiled against openfst
0:04:52for those of you who don't know up fst is
0:04:55it's kind of like the eighteen T tells set it's open source
0:04:58it's uh
0:04:59a project uh
0:05:00like google and some other
0:05:06we can only only have max and like the had training
0:05:09we haven't yet done lattice generation but at time
0:05:12timeline line for adding discriminative training and lattice generation
0:05:16this summer slash
0:05:21we we we support all kinds of linear and affine transforms you can imagine
0:05:25i don't not all of these
0:05:27necessarily involve uh
0:05:29you know that tree version
0:05:30what where you have a
0:05:32multiple regression plot
0:05:34that's just because we
0:05:36are trying to avoid very complicated frameworks that would make that so difficult to use
0:05:41so a lot of these just support point a single transform
0:05:45all of these things also have examples scrip
0:05:48so it's not just something that's in the code that
0:05:51that we know work
0:05:51something that you can also
0:05:53get to
0:05:57and trying to have a i did want to just is other tool kits as a little disclaimer here
0:06:02we're not claiming that all of tool kids don't have any of these advantages to
0:06:07but uh
0:06:08waiting for clean coal code and modular design
0:06:13and and by module we we probably need something a little bit stronger than you would normally uh
0:06:19normally imagine it's it's written in such a way that
0:06:22it's not only easy to combine the various things that are in the
0:06:25but it's easy to uh
0:06:27kind of extend arbitrarily
0:06:29and and we have avoid the kind of code where
0:06:32when you add something
0:06:34a bunch of other bits of code have to know about what you added then you have to modify all
0:06:38kinds of
0:06:39you know
0:06:40all kinds of other
0:06:44the part is a big uh
0:06:46advantage i know but not a lot of uh
0:06:50to gets such a completely free lies
0:06:53and that that we don't really anticipate this being used for commercial purposes
0:06:58our understanding is that
0:06:59a lot of research group
0:07:02as a matter of principle they they won't
0:07:04you stuff that has no commercially license because this say
0:07:07is this research can the commercial by the
0:07:10and now
0:07:11or of the license will
0:07:15have example scripts which were which were uh
0:07:19standing documentation
0:07:22that this whole community building think that the people involved in cal is currently uh
0:07:28it's a group of people mostly vol
0:07:30who are to the previous to you works so
0:07:33myself are are not a bunch of guys from but
0:07:36and and if you others
0:07:40but we open to new participant
0:07:43and uh
0:07:45well what what what we're hoping for mainly is not just people who come to be a line not to
0:07:49of code but
0:07:50the people who really want to understand the whole thing
0:07:53you can contribute a significant amount
0:07:59it's okay is especially good for stuff that involves a lot of linear algebra
0:08:03it has a
0:08:04very good matrix like be the andreas going to talk about
0:08:07so if you want to do stuff that involves a lot of a matrix and vector
0:08:15also uh
0:08:16of course we we compile pile against the openfst library so
0:08:20you can do have T stuff you know at the code
0:08:24its built in
0:08:26a scalable way
0:08:30it doesn't explicitly interact with any power level is a parallel by
0:08:34it doesn't
0:08:36it doesn't interact with them at weird do use or or um
0:08:39um P i i think
0:08:40"'cause" we felt that that that would just lock it into particular kinds of system
0:08:44so uh
0:08:46all the a
0:08:47it's been in in such a way that uh it should still work efficiently when everything is very large scale
0:08:53you have a lot of day
0:08:55our our intention is to it and all of the state-of-the-art methods
0:09:00for lvcsr things like
0:09:01discriminative training
0:09:03a standard
0:09:04all of the standard adaptation
0:09:09but uh i think i say
0:09:11on the next slide
0:09:13something that we not kinda doing in the in the immediate future
0:09:17it's things like online decoding which
0:09:19what i mean by that is uh
0:09:21where the data is coming in say from a microphone or telephone
0:09:25and it's some kind of interactive application
0:09:27because you could use it to do that and building a decoder isn't that hard in this framework
0:09:32but uh
0:09:34i basic target audience is uh
0:09:37speech recognition researchers who want to work
0:09:40on the speech rec oh
0:09:42other than
0:09:43rather than those who uh
0:09:47oh have a mock i was learning what everyone was looking at a multiscale to enter the room um and
0:09:52disrupted that's all right if very present
0:09:56oh i
0:10:04so i
0:10:06we set some people lately have uh
0:10:10this become popular recently take do a kind of life unwrapper for C plus plus code
0:10:15the idea being that you can uh
0:10:17more easily write your script
0:10:19however we we've avoided that approach because
0:10:22probably because it's a hassle to do the the wrapping
0:10:25and nobody ever understands house we were
0:10:28probably because uh
0:10:30it just forces people to learn a new language and
0:10:32probably those who just want the colours think that everyone knows by
0:10:40we support the kind of
0:10:42flexibility and configurable ability of that in different ways
0:10:46but partly uh
0:10:48i think it'll become clear later so perhaps will
0:10:53will will will leave to lake to those ask
0:10:56so we don't have back would training their in there are no immediate plans to do it
0:11:01and i some people i think some people like for back for kind of religious reason
0:11:05but uh
0:11:06i don't believe any was demonstrated the viterbi be is worse
0:11:10and it just so and we need to use with a be
0:11:12for uh
0:11:14we because you can write the alignments to this compact lee
0:11:17and then
0:11:27really interesting
0:11:29but i i i even even not let this as
0:11:32just a single hypothesis
0:11:34makes it if
0:11:37so we'll have to think about that i mean it's not like it's really hard to do
0:11:40but it just wasn't something that we had planned
0:11:49oh okay
0:11:50well it's at the state level
0:11:53but we it's not really this the
0:11:55i stay i mean pdf
0:11:56index but
0:11:57that you little bit more precise not because uh
0:12:00you just right out the state sequence it's fine for model training but then
0:12:04if you wanna work work the phone sequence the penny how tree work
0:12:08it might not be implied by the state sequence of then we have these identifiers the also contain the phone
0:12:13and the transition
0:12:15oh it's and it's a it's an integer a list of it just but those
0:12:19in integers
0:12:21are not quite the states there
0:12:22something that can be mapped to the state also to the phone
0:12:30so i'm just gonna describe a how this
0:12:33came to be we had this work in two thousand nine
0:12:36a a lot of uh focus was on
0:12:38as G M N
0:12:42we that the supper we we were using that some guys some brno a university of technology
0:12:47including a on draw look at another
0:12:50they built this
0:12:51uh infrastructure for uh
0:12:54for training as gmms that was it was written in C plus plus but it rely don't he's T K
0:12:59and i also built a a and F E F S T based code
0:13:02so that we could be code our own C plus plus code with access to the matrix like
0:13:08so we kind of calling that crow took D
0:13:12and and we wanted to release that
0:13:15you know in is some kind of open source way but we realise that
0:13:19the rest P was just too hard to encapsulate because the had he's T K had our stuff
0:13:24as a lot of script
0:13:26so we we wanted to create something that
0:13:29good support this stuff and was easy to encapsulate so we we an entirely new uh
0:13:37the next summer we were entirely new toolkit that is
0:13:41you know that we that
0:13:43we wanted everything to be clean and unified
0:13:45and to have a nice use shiny C plus plus
0:13:48speech rec my
0:13:51i think that's the uh
0:13:53i think that's this a
0:13:55slides a last somewhere
0:13:57are two thousand ten we had another workshop and or no
0:14:00where we uh
0:14:02that a lot of coding
0:14:03and and the vision at that time which and i realise is very unrealistic
0:14:07was that we
0:14:08we have a complete working system with example script
0:14:12you know the end of the sum
0:14:14but that that kind of didn't really materialise a had a lot of pieces
0:14:18but we didn't really have a complete working system so
0:14:22after uh
0:14:24i kind of obligated to
0:14:26you know
0:14:27and is the system and and and we had a help from others thing especially on that
0:14:31and and doing a lot of coding after that
0:14:40when we go to the next slide
0:14:42it's a it's only been officially really something like last week
0:14:46that's when we actually uh got all the legal approvals and
0:14:49put up on source forge
0:14:51this is just a list of the people i don't think i'm gonna go through all the names
0:14:55this the list of all the people who are rich then uh code specifically for D
0:14:59that's of the list the people who done various other things or it's so help the in various ways
0:15:07i would describe exactly have for each one but i'm kind of scared i've left someone of one of these
0:15:14i i i i
0:15:15i i just let you read it
0:15:20a lot of these people are
0:15:21have some connection to bird or you invested to of technology
0:15:25oh people but the in uh or
0:15:28like that
0:15:30so that this is a
0:15:32this is is a rather messy diagram
0:15:34i i just wanted
0:15:36i want to give you some idea of what the dependency structure of kaldi was but i decided to put
0:15:40side information and to here so
0:15:42the area of these uh
0:15:45of these rectangles is roughly proportional to how many lines of code
0:15:49there are
0:15:51the these think the thing that we can pile again
0:15:54so open a fist is the C plus plus library
0:15:59at let's C left that refers to the math libraries that we can pile again
0:16:06and the rough dependency structures thing on top of things that and on them but
0:16:10is very approximate
0:16:13for instance he's various
0:16:14fst the algorithms that we've extended of an fst with
0:16:19stuff relating to tree clustering for decision tree
0:16:24that for leading to hmm topology
0:16:27decoder decoders
0:16:29language modeling thing this is a small box because really all it does is uh
0:16:34compile a marketing to enough
0:16:37i two that
0:16:41you tell this at this is mostly i O stuff as various frameworks for io
0:16:46will be explained later run kind of after a break
0:16:49so we can allow people
0:16:51this is the matrix like we so this
0:16:54a lot of this is just wrappers for stuff that's here
0:16:57but if any i don't know if any of you are familiar with
0:16:59with the steal a pack and blast and those things
0:17:02but their C libraries that
0:17:04for C plus plus program a slightly painful to work with "'cause" they have all of these arguments like the
0:17:09rose the columns
0:17:11and the thing you wanna do is this very long line of code
0:17:14and uh
0:17:16so there's no notion of like a matrix as an object
0:17:18so this kind of ad that abstraction and it is it is significantly easier to use
0:17:23then of this make the
0:17:27this is feed sure
0:17:29preprocessing and you know
0:17:30going from a web file to mfcc that's that's fair
0:17:34uh gaussian mixture models a diagonal and full
0:17:39subspace gaussian mixture models this is
0:17:41the reason might talk
0:17:44linear transforms
0:17:45things like fmllr M L R S T C
0:17:50things of that nature
0:17:52vtln is in here to kind of the
0:17:54linear form of vtln
0:17:57all of these things that he had these are kind of you know directories that contain
0:18:01command line programs that tells you a bit about the structure of the toolkit which is that we have
0:18:06which really more than a hundred command line program
0:18:09and each one does a fairly specific thing
0:18:12wanted to avoid this phenomenon where you have a program that kind of allegedly does one thing
0:18:17that really is controlled by button really an option
0:18:20and has rather complicated behavior depending which upset you give it
0:18:24so this is part of the mechanism that we use to ensure
0:18:28the everything's configurable an easy to understand
0:18:31is no python layer but that's a lot of uh
0:18:34programs as simple function
0:18:37and on top of this
0:18:38is the
0:18:39shell scripts
0:18:40so to do a not actual system building a recipe
0:18:45what are example scripts currently only do is it's the bash script
0:18:48and that you know has a bunch of variables and bash to keep track of iteration and stuff
0:18:53and it and it runs the job
0:18:55but invoking
0:18:56from the command line
0:18:57because the different ways you could do this if you if you love perl up a python or whatever you
0:19:01as to i
0:19:02but that's how a script
0:19:04and and something that
0:19:06i really haven't included on this diagram but it's kind of parts of the
0:19:10dependency structures this some
0:19:12tools that we rely on so
0:19:17for language modeling
0:19:18i D thought we use i R T L them just because of license issues but probably you on use
0:19:23that's i lm if you
0:19:25wanna do a lot of a language modeling stuff
0:19:27uh things like as P H two pi
0:19:29to and
0:19:30to uh
0:19:31in separate data from the L
0:19:34and so on so that the you
0:19:35we actually we actually have a
0:19:37and of can
0:19:38and installation script that will automatically obtain those things are so the scripts can run
0:19:43without you having to manually install stuff
0:19:46and your sis
0:19:48so i'm just gonna
0:19:49briefly summarise the matrix like tree under will be talking more about it later but the plan was
0:19:55to allow people to escape after this initial segment
0:19:58case the not that the boat to that they one here about this stuff
0:20:01but uh
0:20:03as i said it's a C plus plus rap for a blast and seal at pat
0:20:07and we've
0:20:07well why should say on really has gone to a lot of trouble to ensure that it can compile
0:20:12and the various
0:20:13different configurations the what
0:20:16libraries you have your system
0:20:18so it can either the work from blast plus C lap pack
0:20:21or from a less or using
0:20:23entails M K L
0:20:25the reason is that on some systems you might have one but not the other
0:20:29i i less is an implementation of blast that's the
0:20:32kind of optimized to the specific a hardware
0:20:37is is generally a more
0:20:40the code that we've rat
0:20:42generic matrices like square matrices
0:20:46also packed symmetric matrices where where you uh
0:20:50have a symmetric matrix the only store the lower triangle
0:20:53and it's like this this this
0:20:57and uh pack triangular matrix
0:20:59there are other formats that last and C web back supports but these are the ones that we for what
0:21:04applicable to
0:21:06speech processing like we don't are a lot of sparse make sure
0:21:10and traditional
0:21:14this uh and i like we also includes things like S P D an F S C
0:21:18i fifty isn't supply any of those libraries but we uh
0:21:22we we uh got permission from rick come out of our microsoft
0:21:25to uh
0:21:26use this code
0:21:27so he has a good "'em"
0:21:32something about the matrix like the even if you don't buy into the whole to kit
0:21:36if you need a C plus plus matrix library it's probably a
0:21:40is probably quite good in fact it's surprising that there it doesn't seem to be a lot out there
0:21:45that fills this nice just there's blues
0:21:48but that it's a rather weird library and i i don't think a lot of people like
0:21:57okay if you what the about open F is key
0:22:00so i i seem and one he knows what what fsts are
0:22:03it in T had this command line tool kit
0:22:06but i don't believe they ever released
0:22:08so one some of those guys went to google they decided to have one that was uh
0:22:12for open source and it's a patch lies
0:22:15that's why we as part there is reason we made out the a you license
0:22:18because we figured that
0:22:20to to use up pin fst there's no real point in having a
0:22:23different license "'cause" it just gives the law my head
0:22:26so we went for the same one
0:22:30so yeah
0:22:31we can pile against its some that for is the decoder
0:22:35it doesn't use like a special decoding graph format
0:22:38use is the uh same memory structures the openfst
0:22:43and the by the way open F to C has a lot of templates and stuff so that
0:22:47is not just one fst for and there's a lot of them
0:22:49so if you want to do you could uh
0:22:52kind of template your decoder run some fancy format that would be let's a compact or dynamically expanded or some
0:22:59we're not gonna go into that in detail today
0:23:02so we actually implemented various extensions to openfst
0:23:07some of the recipes the perhaps not totally in the spirit of openfst because
0:23:12those guys have a particular recipe that they do
0:23:15and i was is just a little bit different for
0:23:20later on i can explain why
0:23:21i feel that there are good reasons for uh i don't know if those guys would agree with
0:23:31if you with the by about io
0:23:33it's of the controversial decision among the group to U C plus plus three
0:23:38in the end we decided to do it probably because openfst also does it
0:23:43something you know a lot of people prefer sea base i L
0:23:46but but but we do this
0:23:48we support binary in text mode formats a little bit like htk so that each
0:23:53object in the toolkit
0:23:55as a function that will
0:23:57right and it takes a little argument binary tech
0:24:00so it it'll just
0:24:01put its output it's data out of the stream in binary or text mode
0:24:05any in each object also has the read function that does the same thing
0:24:11it's of the standard thing in many talk at the used and final made in various ways
0:24:15like this can mean the standard input standard output
0:24:18it is just a command
0:24:20and this is what how it knows that it's
0:24:24this is the
0:24:25and off that into a found meaning it will
0:24:28it will open the file fc to that position
0:24:31it's is uh useful for reasons that will be described later
0:24:39so this this archive format is it
0:24:41quite fundamental part of the way uh
0:24:44kaldi work
0:24:45and i think
0:24:46i've just cry i'm gonna describe this more later in a another talk with the basic concept is
0:24:51you have a collection of objects let's imagine that they're matrix
0:24:55and there you are there are indexed by a string
0:24:58where the string might be let's say an utterance id
0:25:01so you want to have some way to
0:25:04to access this collection of uh
0:25:06strings and matrices
0:25:09and you might there might be a couple of different ways you could do that you might wanna go sequentially
0:25:12through the
0:25:13as an accumulation of some
0:25:15we might want to do random access
0:25:17so there's a whole framework for doing this
0:25:21basically the reason is so that your
0:25:23the most of the calico doesn't have to worry about
0:25:27things like opening files and ever conditions and
0:25:30you know that doesn't have to be a lot of logic about that in the command line programs because it's
0:25:34all handled by some
0:25:36generic framework
0:25:37but apart from this we tried to avoid
0:25:39generic framework
0:25:44the tree building clustering code
0:25:46we it's based on
0:25:47very generic
0:25:49clustering the can something like
0:25:51i guess hard to model whatever they call it
0:25:54so it doesn't that that that internal code doesn't assume a lot about what your trees
0:25:59it is suitable build decision trees in different ways including
0:26:02like sharing the true
0:26:04and asking questions about the central central phone
0:26:07it's like that
0:26:10it's very scalable to white context for example quint phone
0:26:13i know a lot of the
0:26:16it it's hard to write code that was scaled to queen phone because if you have to enumerate all of
0:26:20the context
0:26:22that's kind of it's hard hard to go to
0:26:24a but uh
0:26:25we basically avoid ever enumerating those con
0:26:29uh as an example of a
0:26:30how we make use of this general C
0:26:33and the wall street journal recipe we uh
0:26:35we increase the phone sets of the in the were asking about the phone position and the stress
0:26:41a "'cause" the know he's to K supports this "'cause" i thing you had a
0:26:44have a paper marked with
0:26:45he about doing that
0:26:47so uh
0:26:48but but uh if the phones that much larger than that probably
0:26:52an approach based on enumeration of context would start
0:26:57you don't think so no i mean like it was a thousand thousand keep this day
0:27:04okay well i
0:27:10okay hmm and transition modeling co
0:27:15we try to have an approach where
0:27:17a piece of code only needs to know
0:27:20the minima needs to know
0:27:21so so the hey gmm and transition modeling code doesn't really have any notion of a pdf it's purely
0:27:27it purely does what it needs to do
0:27:30and the rest to separate
0:27:32this is probably pretty standard approach you you develop a uh
0:27:36you specify prototype to paul
0:27:38it's apology for each phone is that how many states what the transitions are
0:27:44and we make the transitions the
0:27:46separate depending on the uh
0:27:49depending on the pdf
0:27:50so so that if the pdfs into states are different than the transitions out of those
0:27:54states are separately estimated
0:27:56is this is just the most
0:27:58specifically that you can estimate the transitions without having your
0:28:02decoding graph blowup
0:28:04it's not believing clear that this matters but
0:28:08we just felt that it was that we should do the best we could on
0:28:13they're mechanisms would sending these youth hmms into fsts because
0:28:17all of the training decoding is fst basically kind of have to have an fst representation of these
0:28:25it's is something that we touched on a a
0:28:28and i are F S T so what you would normally imagine is that the F it has input symbols
0:28:32that are the
0:28:33the pdf so some symbol the represents the P D and the output symbols of the word
0:28:39but the problem with that is let's suppose you uh
0:28:43you want to find out what the phone sequence
0:28:45it's all well well and good if your
0:28:48if if your phone had separate tree
0:28:51so that so that it was could for each state which phone it belong
0:28:55but but what if you had a larger phone set and you wanted to have a shared tree room
0:28:59and that wasn't you know one to one mappings
0:29:01oh there was in the mapping you need so
0:29:04so we have a input labels on the fsts the encoded bit more information
0:29:10and this is also useful in training the transitions because
0:29:12sometimes just the pdf labels wouldn't you of you quite enough information
0:29:17the train the transition
0:29:21there's a couple of different ways to create decoding graphs
0:29:24for uh training purposes you have to create a lot of these things at the same time
0:29:29and combining the fst algorithms using script
0:29:32would be quite inefficient because you have the overhead of process creation
0:29:37we uh
0:29:39we call the openfst algorithms of the C plus plus level combine them together
0:29:43so that uh
0:29:45you can create your decoding graphs for
0:29:50and and we typically put them in one of these archive
0:29:54like basically a big file concatenated together with little keys in it
0:29:58on disk
0:29:59so that you don't have the I O of
0:30:01accessing hundreds of little file
0:30:03training use of the viterbi path
0:30:05these graphs
0:30:08for test time
0:30:09we we we didn't we didn't use this approach of C plus plus because it there's just no point
0:30:14we uh
0:30:16it's basically scripts and i'm gonna goes wannabe scripts later for those words
0:30:23that's the least scripts that create the decoding graph recalls some openfst tools but some of our own
0:30:28and that relates partly to a difference in recipes
0:30:31but uh
0:30:32i'll talk more about later
0:30:34after great
0:30:37and i was gonna talk later about some of the acoustic modeling co
0:30:41i'm just gonna give a brief summary
0:30:43are gmm code is
0:30:45it's very simple it's not part of some big framework
0:30:47it kind of but like an
0:30:49and object that has you know the means the variances
0:30:52it can evaluate like it's for you give it the feature
0:30:55but it doesn't
0:30:56and her from some
0:30:58generic acoustic model class and it doesn't at ten
0:31:01that's a kind of know about things like linear a it just sits there
0:31:04and and things like we a transform
0:31:07they kind of have to access the model and do what they want
0:31:10the the reason for that is that if
0:31:12the gmm knows too much
0:31:14them whatever you do that's fancy
0:31:16you have to then change the gmm code
0:31:19and it just
0:31:21it's is not my situation
0:31:24so uh
0:31:26yeah we have a separate class for gmm stats accumulation
0:31:29and doing that they
0:31:32for for a collection of gmms like an gmm gmm system
0:31:36we have a class that pretty much behave similar to a vector a G M at
0:31:41so it's
0:31:42it's a fairly simple thing
0:31:43there's no notion of name of a state that is just an integer
0:31:47and then really we've avoided having
0:31:50like names and names for things in the co
0:31:57oh this this low case vector just refer to the S T L vector
0:32:01but there is an upper case vector to that
0:32:03but does something in a matrix like
0:32:08well the code is never been case in that as the code we
0:32:12i i even on windows
0:32:21we've got quite a lot of linear transform coder
0:32:26lda hate lda
0:32:28again and fitting on the fence with regard the naming of this technique
0:32:32i don't wanna and anyway
0:32:36another these multi name okay
0:32:38uh olympia version of each other i mean we tried regular vtln is
0:32:42yeah everyone knows that it's kind of tricky to get it to work
0:32:45it was that you'll anyone that worked better in the N
0:32:47uh it is something new that
0:32:50it's a kind of a replacement for vtln that what's a little bit better
0:32:53i gonna
0:32:54explain what it is uh at a later date
0:32:59a lot of this
0:33:02one this transform the global the with the way we handle them as well
0:33:06it just becomes part of the feature space
0:33:09so it's just
0:33:09start of the matrix on disk and this
0:33:12use a lot of plight so the way it actually works is that this matrix
0:33:15is multiplied by the feature as part of a high
0:33:18my seem like you're right obviously there is silly way to do it from a computational point of view but
0:33:23it just makes the scripts really convenient
0:33:25to uh
0:33:29yeah so when i say they're applied in a unified way what what i mean is that the co the
0:33:33estimates any of these transforms
0:33:35there really outputs just to make trick
0:33:37so uh
0:33:38you know there's no like
0:33:40and some a lot transform J
0:33:43that's just
0:33:44well okay yeah there is so for the uh regression tree one
0:33:48but for but for the global one it's just it's just a matrix
0:33:52i mean that's with the point of contention among as that to whether to do it this way
0:33:56but uh
0:33:57some of a style that it was important to keep the simple case is simple
0:34:01and to it to avoid having a
0:34:03a framework
0:34:04for the cases one was an S
0:34:10okay decoders
0:34:11well of the decoders that we currently have use
0:34:14fully expanded F S is one i mean when i say for the expanded i mean is down to that
0:34:18H M and state level with
0:34:20so loops represented as uh
0:34:23actual you know if sdr
0:34:26i know there's a lot of way to do this and initially
0:34:28one of the thoughts we had
0:34:29would be that
0:34:31you know we wouldn't have the self loop so we might even have
0:34:34representations of the states the and then it was just so much simpler to do it this way
0:34:38this is what we have now
0:34:41we have three decoders but by decoder we mean they uh
0:34:44C plus plus code that does decoding
0:34:47it's not necessarily the same thing as a command line decoding
0:34:50we have three decoders on the spectrum simple too fast
0:34:53and the reason for this is that
0:34:54once you have a complicated fast decoder is almost impossible to the to debug
0:34:59so if something goes wrong you can always just one the simple one
0:35:02you know and you can find out if it's a decoder issue
0:35:08we wanted to make it so the decoder doesn't as you too much about what you're model model selection
0:35:13so it again decoder has no idea of gmm hmms it doesn't even know about features
0:35:18all that
0:35:20all the decoder knows about is give me the likelihood or
0:35:24score level
0:35:25for this
0:35:26uh frame index
0:35:28and this pdf in that
0:35:30so it so interface that the decoder seizes is almost like a matrix
0:35:35the matrix of uh
0:35:37of floats but i'm is is not represented that way because you want to
0:35:41you know you want to have it on them on
0:35:45so yeah this is the decodable interface an interface that the
0:35:49it's a very simple interface that says give me the likelihood for this you know time in this frame and
0:35:54how many time frames are the
0:35:57and how many pdf index is that that's almost all the interfaces
0:36:01but this this is the interface at the decoder requires so the idea was to implement you know
0:36:06L fantastic a model
0:36:08and you
0:36:11in in a very matter what interface of that model is
0:36:13you create a small object that satisfies the decodable interface
0:36:17and knows how to get the likelihoods from your and L fantastical model
0:36:21and then you uh
0:36:23you instantiate the decoder with that are you give that
0:36:27so uh
0:36:31the gmm wrapping okay
0:36:34yeah so i come online decoding programs a very simple we don't have like multipath or anything
0:36:39we don't have uh
0:36:42we don't we don't know than to support multiple types of model
0:36:46an example decoding program is
0:36:48decode with the G M and
0:36:51but no
0:36:52with number multiple class adaptation
0:36:54yeah so does the simple thing
0:36:55and then if you want to support let's a multi-class
0:36:58mllr fmllr
0:37:00we uh have a separate come online prague
0:37:03yeah the idea is that
0:37:04there might be people coming into the project might want to be able to understand that come online program
0:37:09and we don't one that once a make the barrier to entry too high
0:37:12we got the
0:37:13support the overhead of having to maintain two parallel decoders
0:37:17keep it relatively simple to understand any given one
0:37:24we support the standard types of features
0:37:27mfcc and plp features are quite similar to
0:37:30K one
0:37:32we put in a reasonable range of configure ability but
0:37:35i mean being realistic with respect to how much people are really working on this stuff i mean i think
0:37:40most people are doing research on this would probably be coming out with their own features
0:37:44so we don't support every possible
0:37:46combination of it
0:37:47for every possible change
0:37:49we only we we dwell format because there i reasoning is
0:37:53your you can always it's find the external program to convert it and
0:37:57do it as part of a high
0:38:07well we cannot htk and i won't from uh we don't there's no more that we support
0:38:15i mean
0:38:16i i basic concept to have people use the system is
0:38:19as a complete system
0:38:21because once you start supporting model you know in a conversion just get work
0:38:26but yeah that's the he's tk K features as a as a special case
0:38:31we typically will right features another large objects to a single very large file of relates to this archive format
0:38:37so the form of the file as a key space then your object
0:38:41and another key a space that object
0:38:43and uh
0:38:45we have efficient mechanisms to read such files
0:38:48the the the two normal cases are firstly sequential access
0:38:51we want it's rate over the things an archive
0:38:54exactly random access and the the different ways to do that one is
0:38:58you can write a separate file that has little
0:39:00point doesn't of the file
0:39:02another is that
0:39:03you can kind of simulate random access even though you're really going sequentially
0:39:07if you know that the keys are sorted
0:39:10uh and another way is if the file isn't isn't that big
0:39:13you can do random access by just having the code go through the whole file
0:39:18stalled objects and memory
0:39:19that's not just scalable but
0:39:21for for a lot of uh
0:39:23types of all kinds it really doesn't matter
0:39:27oh yeah so the feature
0:39:29feature level processing like adding deltas that from a lot
0:39:32typically each one of those the separate program so you have like a sequence of programs and apply
0:39:37and again that's a bit inefficient but
0:39:39it's not like it's really consuming more than ten percent of your C P U so
0:39:43you just don't care that much this has been written with
0:39:46ease of use in my
0:39:50like i said there's a lot of command line tools this is an example of uh
0:39:54a command line and this backslashes of this
0:39:57the cell
0:39:58so uh
0:40:01this this is one of the many programs
0:40:03the plp would be a separate command line
0:40:06this is just you know
0:40:07an option
0:40:08either the two command line arguments in this uh
0:40:11i gonna be explaining later on or about what these mean with this
0:40:14directed to write these things to it
0:40:16and archive on the
0:40:18a key object key object
0:40:21and then
0:40:22i don't know this is the input
0:40:23we have to read it
0:40:25and then this is telling it to write an archive and also
0:40:28and i C P file that
0:40:30kind of has little pointers into the okay
0:40:32so that you can efficiently access the features by random access
0:40:36um um
0:40:39so yeah another example of another feature of this is that as only one option here we we we have
0:40:44no more than a few options on any given come on
0:40:47i mean it's a local program i support
0:40:49less the channel
0:40:51it's not it's not a very can different to at is more driven by how you combine these grow
0:40:58oh you something else about this whole archive a uh formalism is that
0:41:02this C plus plus level code in the individual come line tools
0:41:06we doesn't have have to worry too much about high uh
0:41:10you can just treat
0:41:11the uh
0:41:13when to get something like this there's
0:41:15there's very short uh
0:41:17statements in the C plus plus that will it's a rate over a
0:41:20so it doesn't have the
0:41:22think too much about the error conditions
0:41:26but yep
0:41:32fst festive generation
0:41:35okay that as another part of the talk later on
0:41:45for training
0:41:47there's there's a command line program that will
0:41:49kind of do the fst generation for you and generate lots of the left S to use one for each
0:41:54yeah so for testing
0:41:56it's it's a script the calls a fist openfst programs an our versions of openfst for
0:42:05i'm gonna go through that script later one and another part
0:42:09a a are you this decide this is not obvious you know a lot stand the script
0:42:13but this is just to get people some idea
0:42:16oh of uh
0:42:17of how we do do training
0:42:19so you know this is the bashed script it's doing a loop over the iterations
0:42:24uh and this one is estimating ml mllt up
0:42:29i suppose this script review the bias and sorry man i
0:42:33but as that we are is the colour i've yet
0:42:35so a
0:42:38so if it's that one of iterations that we do a lot C
0:42:42then uh
0:42:44so we have on disk
0:42:45some uh alignment this is like steak level alignment
0:42:49it's in a mark at i've
0:42:51from my that i mentioned
0:42:52so this converts them to posteriors
0:42:54just an average of trivial way by thing that each
0:42:57each one has a posterior of one
0:43:00this takes the this
0:43:01this gives a zero weight to the file and
0:43:03that's would be a
0:43:04this would be a variable and by
0:43:08yeah so this takes away the uh
0:43:10you the silence is there a posterior
0:43:12and this is an accumulation program
0:43:14that uh
0:43:16this would be the model that's the thought fit of the features as the abashed variable that would be
0:43:21that elsewhere where
0:43:23uh this
0:43:24a a hmmm
0:43:25i think that's refers to the standard input
0:43:28me that's reading an our cat from the standard input and that
0:43:30output by the
0:43:32you this
0:43:32mean that's writing an archive to standard it out but
0:43:35yeah yeah output of these programs is passed by up pi
0:43:42all all of the error and logging out but goes to the standard error uh
0:43:45because we've kind of used with that it out but for this type stuff
0:43:50so we just directing the logging up
0:43:53so then this is a separate program that does the mllt the estimation
0:43:58it takes in uh
0:43:59let me see
0:44:01uh it's it's it's computing some kind of make
0:44:04and then uh
0:44:06because then am a lot T yeah
0:44:08what i i have to you can the transform
0:44:10you have to change the means of your model so
0:44:13we have a separate we like to get everything separate
0:44:16so you know transforming the me the separate operations so we have a separate program for that
0:44:21and then
0:44:22we have to compose the L B M T transform with the previous one
0:44:26so this is another will program that does that
0:44:29so this with was setting another bash variable able to make
0:44:32the ah features correspond now to the
0:44:35new ml L you a melody features
0:44:40so as you can see that this is the very and bash
0:44:43and it's
0:44:43this would be passed as a command line arguments to one of the program
0:44:47and it's a command involving a pie that actually vol
0:44:51calling to separate cal be uh
0:44:55each for their own argument
0:44:57so obvious you can guess from the names of those programs what they're doing
0:45:01and then of "'cause" uh it seems to have features sub
0:45:04oh yeah i think we were estimating the ml T on a subset of features
0:45:08so this is like the same as this but it's the
0:45:10it's using less
0:45:12the data
0:45:15so i think i
0:45:17i spoke about these issues but for
0:45:21oh yeah so uh
0:45:24we had example scripts results management and was to general and these run from the ldc
0:45:29distributed this
0:45:31uh now we found in the literature just some uh
0:45:35some some uh baseline
0:45:37these numbers are numbers are just the basic context system
0:45:42with i think uh mean normalization
0:45:44we have of course more advanced things but
0:45:47those you know because it had to find in the literature the same thing
0:45:50we just giving you the unadapted adapted
0:45:53so it's a
0:45:54slightly better than this number will can right someone a two thousand
0:45:58and that the hates you K paper from ninety four
0:46:01a has a funny but a number for this was the gender dependent system
0:46:05so uh
0:46:05so i think basically would doing the same as
0:46:08you expect given the same out
0:46:11i mean
0:46:13i was hoping you know the set of this help project that the results would be but uh
0:46:17for issues relating to the tree in can phone and stuff but
0:46:20you know that in we give a senate
0:46:22it it's working there's no major but
0:46:26uh did of the
0:46:28okay next slide
0:46:32uh just the not on speed and coding is used
0:46:35use a bigram numbers and the "'cause" the baseline we'll bigram numbers
0:46:38we can't yeah yeah code with the full
0:46:41with the full uh trigram language model that
0:46:43distributed with the wall street journal corpus
0:46:46because the fsts uh
0:46:48they get to large
0:46:50we have a "'cause" to with pruned track
0:46:52but that's why we're coding the bigram numbers
0:46:56hopefully by the sum we gonna
0:46:58as the couple of things we can do that we both working on one is to have a decoder that
0:47:01does some kind of on the fly
0:47:03pensions so that we can uh
0:47:05the code directly with that
0:47:07and the other to have a just generation so we can we score
0:47:11the decoding speed is for these was to just don't numbers is about twice as fast as real
0:47:16and a "'cause" that's on a good machine
0:47:18so i mean this is june so that you don't get more than zero point one degradation from
0:47:23versus a white B
0:47:26a the wall street journal script takes a few hours on a single machine using
0:47:30we problem lies on to three C be used
0:47:32this is just an example script we didn't want to include things like you serve in the example script
0:47:37because then it wouldn't run on uh everyone's machine
0:47:40the was they would be fast if you were doing a parallel
0:47:54if it in member it well as well
0:47:58but ten gig
0:47:59i i mean
0:48:00i S you know everyone knows that F is T compilation tend to up a bit
0:48:04it's not like
0:48:06if you have the size of the model you can just about compiler
0:48:12i i don't recall that it's a trigram one for most journal
0:48:15i i
0:48:17and then we go how many was but i think
0:48:19i don't think that the our stuff is any worse than you know normal if T
0:48:23that ups that fully expand of thing
0:48:27oh yeah okay results management
0:48:29this is a
0:48:31use she came results
0:48:33or take an uh
0:48:35from uh
0:48:36this is i think this is basically the hey K are each K the be but he's real us to
0:48:40take it from a paper of mine like in ninety nine or something
0:48:44i just couldn't find in the read me file from are C K on all of the test
0:48:49and the average as you can see the average is the same
0:48:53with the same algorithms are getting the same result as H
0:48:59yeah and it and the decoding we on the setup is about zero point one times real
0:49:16yeah the test set are quite
0:49:20oh yeah
0:49:21is a very small test that a handful of words that are of
0:49:25uh this is this page is mainly
0:49:27just to give you some idea of the kinds of things that are in our example scripts we have a
0:49:31bunch of
0:49:32different configuration of this of the standard configuration
0:49:35well this is the standard configuration because this is what within the htk baseline line
0:49:42adding M L T doesn't we seem the hell
0:49:45sorry adding is T
0:49:47as they we the hell
0:49:51a a a a a it's well i think nine frames plus lda that you makes it worse but then
0:49:56when you do
0:49:57uh F T C on top of that
0:50:00if that you better than here and so that this was the kind of this was that I B M
0:50:04recipe P
0:50:07sorry this with I B M is to be so i i guess that must been some interaction between these
0:50:11two parts of the recipe
0:50:13that somehow made it work
0:50:14i i don't know if it's a generalized to other trade other test set
0:50:17we gonna find out
0:50:19uh that's placed nine frames plus hlda
0:50:22triple deltas plus hlda
0:50:24triple deltas plus
0:50:26lda D A plus a lot C
0:50:28this this
0:50:29quite good
0:50:30uh sgmm cyst these are all and adaptive
0:50:33have a separate slide for uh adapted exp
0:50:39if is doing it
0:50:41and that's
0:50:41it's stated otherwise that oh yeah okay so this is but utterance adaptation this is per speaker
0:50:48this was four point five my and before uh
0:50:51adaptation so it really doesn't help if you do it but i'd sir rights and that's because this too many
0:50:57in to model a
0:50:58this is doing the same thing per speaker gets a lot but uh
0:51:01its exponential transform is
0:51:03again i'm not gonna describe what it is is something vtln one
0:51:07uh and it gets quite a bit but uh
0:51:09this is a this vtln and of the kind of many a version of vtln i believe
0:51:14it is that thing to improve quite a lot
0:51:16and of got improvement is more pronounced on the per utterance level because
0:51:21in know it it's just like a constrained form of a from a loss of the only point is
0:51:26to do it
0:51:27to do when you have less they
0:51:30splice nine frames for cell day sex to transform
0:51:34a from well thing i from a lot
0:51:36we only did some of these put speaker because it wouldn't help of the
0:51:41as you can see that the well of different combinations this is as gmm including the
0:51:45speaker offsets sets the and thumbs if you member
0:51:48and it does help so
0:51:50so uh i think rick was saying that that is wasn't working for him but it seems to be working
0:51:54for us
0:51:55three point one five goes to uh
0:51:59where is it to point six eight
0:52:01i i must of uh forgot to fill this line and
0:52:04it's is that's gmm plus a from a la
0:52:07but no speaker vectors
0:52:10a per speaker
0:52:13i think i have these numbers but i think i must not put in i think a best number was
0:52:17like to point for
0:52:20point three
0:52:23so general plug for cal
0:52:27i believe it easy to use i mean i have the scripts didn't scale you guys up as if you
0:52:31traction is that once you understand them
0:52:34everything becomes quite simple
0:52:37it kind of does that you that the sound has speech works like if you some under who does
0:52:42is randomly
0:52:43moving the script you know changing configurations of
0:52:46you're not uh
0:52:47it's not gonna work
0:52:48it it doesn't like
0:52:50it doesn't or to magically know that the features you have a not combat your model
0:52:56so so you can have to know what you doing from a speech science point of view
0:53:01it's quite uh
0:53:02it's easy to use that the C plus plus
0:53:04flash to
0:53:06software engineer
0:53:08it's easy to extend and modify
0:53:10you can reduce should be go changes are give them back to
0:53:13the cal group
0:53:15we open to including other people's
0:53:18so that may give you most citation
0:53:21so this
0:53:21is i really
0:53:23the and the this first part so
0:53:26you can get up and have a drink and after a few minutes
0:53:32yeah has documentation cal D duck source forge dot net
0:53:36uh uh okay if is not as good as H K and probably being realistic will never be
0:53:41what will do
0:53:42is will
0:53:43of able lies the F but the he's to K has use and point people to the he's K documentation
0:53:48so then about eight and say that have he had then
0:53:51yeah i know me
0:53:55i use C
0:53:58see i
0:54:02but okay
0:54:03we can have a shot rate you we can have a drink
0:54:06and just a pair you're not in uh
0:54:09that committed to it
0:54:10and then
0:54:12uh uh we've have a gonna talk up to what
0:54:14the fact