a a a i i i
no
i
a
i you know what
a i today
and people
and some people would be you know a to up to have an another's with
they on if an sup of for the rest of the live
i thinking that those prophecies is of been just slightly misinterpreted
and the event that they were referring to is this
a wonderful speech to okay
i
a that have in of actually that
you know it almost buys review thing it
so uh
i do i think that have anything and their estimates the significance of this that
a that's okay
okay
so first they just about the name
it's a it's some kind of coffee reference hence the little coffee being with uh
but had so
a
but
is just
whatever name we thought to
so uh
the structure of this uh this whole presentation is fess i'm gonna talk
for about
fifteen or twenty minute
just giving you know of you kind of from all sides of this tool K
and then we're gonna a people to escape in case they don't want to know more details than the have
a short break
and then
i not and uh on drug going to talk about a uh
some more called local stuff like
and i was gonna talk about some of the acoustic modeling code
and we'll talk about the uh matrix like
which just kind of independent useful
uh
speech
and then after that
uh
i'm gonna go through some example scripts that we have been try to get people
more of a you know
give people a sense of of how to use that
now
or the next slide
so
some important aspect of the project is it the
it's license under a you V two point uh which is the
style a license that basically allows it to do anything you want with it
there is only a uh
an acknowledgement a
close which as you have to acknowledge that
the code came from that but that
that's of that's
it's it's one of the most open up the standard lies
uh
the project of currently hosted on source forge which is the
standard place for these kinds of open source project
uh
we we it
some talk it's a very closely associated with a particular institution
our attention is for it to be more of a kind of
thing that lives
and the clouds out or and source for
i i shouldn't have use that will that that's to
that's just gratuitous that
but it yeah there it's very for it not to just be him a
the pet project of some particular little group but uh
that's to represent
the best of what's out there and and and we will can be participants as long as you can contribute
code under
this slice sense than that's great
uh
it's basically a C plus plus to at
the code compiles it a native windows and
and the common units but fun like can we're not claiming that a compile once on or you know
other we're problem but but it compiled from on the normal one
a
you have some documentation not as much as takes T K
and and and we have example script
these example scripts and not uh
there just for results also as one and and uh
wall street journal
but we're gonna have more to
they
they basically run from ldc that's
so once you have the this you can kind of point them to the disk
and just
get an idea of how it work
so
oh no i now i realise that we didn't look a large enough row
i think i think we just have a tie this thing to uh aggressively
if these were not guy
uh
yeah
so
okay somehow out i gonna go through the kind of a think that support this is just the current features
obviously
or tending to a lot more
so you can build a standard context-dependent uh
lvcsr system
you know with tree clustering
in that it's been written in such a way that it supports arbitrary context size is so you can go
to
quint phone oh what's have and it will uh
a work
without without pain
but the the training coding about fst based on a
our code compiled against openfst
for those of you who don't know up fst is
it's kind of like the eighteen T tells set it's open source
it's uh
a project uh
like google and some other
um
we can only only have max and like the had training
we haven't yet done lattice generation but at time
timeline line for adding discriminative training and lattice generation
and
this summer slash
like
uh
we we we support all kinds of linear and affine transforms you can imagine
i don't not all of these
necessarily involve uh
you know that tree version
what where you have a
multiple regression plot
that's just because we
are trying to avoid very complicated frameworks that would make that so difficult to use
so a lot of these just support point a single transform
we
all of these things also have examples scrip
so it's not just something that's in the code that
that we know work
something that you can also
get to
so
and trying to have a i did want to just is other tool kits as a little disclaimer here
that
we're not claiming that all of tool kids don't have any of these advantages to
but uh
waiting for clean coal code and modular design
uh
and and by module we we probably need something a little bit stronger than you would normally uh
normally imagine it's it's written in such a way that
it's not only easy to combine the various things that are in the
but it's easy to uh
kind of extend arbitrarily
and and we have avoid the kind of code where
when you add something
a bunch of other bits of code have to know about what you added then you have to modify all
kinds of
you know
all kinds of other
and
the part is a big uh
advantage i know but not a lot of uh
to gets such a completely free lies
and that that we don't really anticipate this being used for commercial purposes
uh
our understanding is that
a lot of research group
as a matter of principle they they won't
you stuff that has no commercially license because this say
is this research can the commercial by the
and now
or of the license will
uh
have example scripts which were which were uh
standing documentation
and
that this whole community building think that the people involved in cal is currently uh
it
it's a group of people mostly vol
who are to the previous to you works so
myself are are not a bunch of guys from but
and and if you others
case
uh
but we open to new participant
and uh
well what what what we're hoping for mainly is not just people who come to be a line not to
of code but
the people who really want to understand the whole thing
you can contribute a significant amount
um
the
it's okay is especially good for stuff that involves a lot of linear algebra
it has a
very good matrix like be the andreas going to talk about
so if you want to do stuff that involves a lot of a matrix and vector
H
are
do
also uh
of course we we compile pile against the openfst library so
you can do have T stuff you know at the code
uh
its built in
a scalable way
well
it doesn't explicitly interact with any power level is a parallel by
it doesn't
it doesn't interact with them at weird do use or or um
um P i i think
"'cause" we felt that that that would just lock it into particular kinds of system
so uh
but
all the a
it's been in in such a way that uh it should still work efficiently when everything is very large scale
you have a lot of day
our our intention is to it and all of the state-of-the-art methods
for lvcsr things like
discriminative training
a standard
all of the standard adaptation
uh
but uh i think i say
on the next slide
uh
something that we not kinda doing in the in the immediate future
it's things like online decoding which
what i mean by that is uh
where the data is coming in say from a microphone or telephone
and it's some kind of interactive application
because you could use it to do that and building a decoder isn't that hard in this framework
but uh
i basic target audience is uh
speech recognition researchers who want to work
on the speech rec oh
other than
rather than those who uh
oh have a mock i was learning what everyone was looking at a multiscale to enter the room um and
disrupted that's all right if very present
i
oh i
i
i
okay
i
so i
we set some people lately have uh
this become popular recently take do a kind of life unwrapper for C plus plus code
the idea being that you can uh
more easily write your script
however we we've avoided that approach because
probably because it's a hassle to do the the wrapping
and nobody ever understands house we were
probably because uh
it just forces people to learn a new language and
probably those who just want the colours think that everyone knows by
that
uh
so
we support the kind of
flexibility and configurable ability of that in different ways
but partly uh
i think it'll become clear later so perhaps will
will will will leave to lake to those ask
so we don't have back would training their in there are no immediate plans to do it
and i some people i think some people like for back for kind of religious reason
but uh
i don't believe any was demonstrated the viterbi be is worse
and it just so and we need to use with a be
for uh
we because you can write the alignments to this compact lee
and then
on
really
okay
really interesting
but i i i even even not let this as
like
just a single hypothesis
makes it if
okay
so we'll have to think about that i mean it's not like it's really hard to do
but it just wasn't something that we had planned
uh
one
uh_huh
oh okay
well it's at the state level
but we it's not really this the
i stay i mean pdf
index but
that you little bit more precise not because uh
you just right out the state sequence it's fine for model training but then
if you wanna work work the phone sequence the penny how tree work
it might not be implied by the state sequence of then we have these identifiers the also contain the phone
and the transition
oh it's and it's a it's an integer a list of it just but those
in integers
are not quite the states there
something that can be mapped to the state also to the phone
uh
so i'm just gonna describe a how this
came to be we had this work in two thousand nine
a a lot of uh focus was on
as G M N
um
we that the supper we we were using that some guys some brno a university of technology
including a on draw look at another
uh
they built this
uh infrastructure for uh
for training as gmms that was it was written in C plus plus but it rely don't he's T K
system
and i also built a a and F E F S T based code
so that we could be code our own C plus plus code with access to the matrix like
um
so we kind of calling that crow took D
and and we wanted to release that
recipe
you know in is some kind of open source way but we realise that
the rest P was just too hard to encapsulate because the had he's T K had our stuff
as a lot of script
so we we wanted to create something that
good support this stuff and was easy to encapsulate so we we an entirely new uh
uh
the next summer we were entirely new toolkit that is
you know that we that
we wanted everything to be clean and unified
and to have a nice use shiny C plus plus
speech rec my
i think that's the uh
i think that's this a
slides a last somewhere
are two thousand ten we had another workshop and or no
where we uh
that a lot of coding
and and the vision at that time which and i realise is very unrealistic
was that we
we have a complete working system with example script
you know the end of the sum
but that that kind of didn't really materialise a had a lot of pieces
but we didn't really have a complete working system so
after uh
i kind of obligated to
you know
and is the system and and and we had a help from others thing especially on that
and and doing a lot of coding after that
so
uh
when we go to the next slide
it's a it's only been officially really something like last week
that's when we actually uh got all the legal approvals and
put up on source forge
this is just a list of the people i don't think i'm gonna go through all the names
this the list of all the people who are rich then uh code specifically for D
that's of the list the people who done various other things or it's so help the in various ways
and
uh
i would describe exactly have for each one but i'm kind of scared i've left someone of one of these
lists
and
i i i i
i i just let you read it
um
a lot of these people are
have some connection to bird or you invested to of technology
oh people but the in uh or
like that
so that this is a
this is is a rather messy diagram
i i just wanted
i want to give you some idea of what the dependency structure of kaldi was but i decided to put
side information and to here so
the area of these uh
of these rectangles is roughly proportional to how many lines of code
there are
so
the these think the thing that we can pile again
so open a fist is the C plus plus library
uh
at let's C left that refers to the math libraries that we can pile again
uh
and
and the rough dependency structures thing on top of things that and on them but
is very approximate
so
for instance he's various
fst the algorithms that we've extended of an fst with
uh
stuff relating to tree clustering for decision tree
uh
that for leading to hmm topology
decoder decoders
language modeling thing this is a small box because really all it does is uh
compile a marketing to enough
i two that
uh
you tell this at this is mostly i O stuff as various frameworks for io
that
will be explained later run kind of after a break
so we can allow people
skate
this is the matrix like we so this
a lot of this is just wrappers for stuff that's here
but if any i don't know if any of you are familiar with
with the steal a pack and blast and those things
but their C libraries that
for C plus plus program a slightly painful to work with "'cause" they have all of these arguments like the
rose the columns
tried
and the thing you wanna do is this very long line of code
and uh
so there's no notion of like a matrix as an object
so this kind of ad that abstraction and it is it is significantly easier to use
then of this make the
uh
this is feed sure
preprocessing and you know
going from a web file to mfcc that's that's fair
uh gaussian mixture models a diagonal and full
subspace gaussian mixture models this is
the reason might talk
the
linear transforms
things like fmllr M L R S T C
hlda
things of that nature
vtln is in here to kind of the
linear form of vtln
uh
all of these things that he had these are kind of you know directories that contain
command line programs that tells you a bit about the structure of the toolkit which is that we have
which really more than a hundred command line program
and each one does a fairly specific thing
wanted to avoid this phenomenon where you have a program that kind of allegedly does one thing
that really is controlled by button really an option
and has rather complicated behavior depending which upset you give it
so this is part of the mechanism that we use to ensure
uh
the everything's configurable an easy to understand
is no python layer but that's a lot of uh
programs as simple function
and on top of this
is the
shell scripts
so to do a not actual system building a recipe
what are example scripts currently only do is it's the bash script
and that you know has a bunch of variables and bash to keep track of iteration and stuff
and it and it runs the job
but invoking
from the command line
because the different ways you could do this if you if you love perl up a python or whatever you
as to i
but that's how a script
and and something that
i really haven't included on this diagram but it's kind of parts of the
dependency structures this some
tools that we rely on so
uh
for language modeling
i D thought we use i R T L them just because of license issues but probably you on use
that's i lm if you
wanna do a lot of a language modeling stuff
uh things like as P H two pi
to and
to uh
in separate data from the L
and so on so that the you
we actually we actually have a
and of can
and installation script that will automatically obtain those things are so the scripts can run
without you having to manually install stuff
and your sis
so i'm just gonna
briefly summarise the matrix like tree under will be talking more about it later but the plan was
to allow people to escape after this initial segment
case the not that the boat to that they one here about this stuff
but uh
as i said it's a C plus plus rap for a blast and seal at pat
and we've
well why should say on really has gone to a lot of trouble to ensure that it can compile
and the various
different configurations the what
libraries you have your system
so it can either the work from blast plus C lap pack
or from a less or using
entails M K L
the reason is that on some systems you might have one but not the other
i i less is an implementation of blast that's the
kind of optimized to the specific a hardware
automatically
is is generally a more
so
the code that we've rat
includes
generic matrices like square matrices
also packed symmetric matrices where where you uh
have a symmetric matrix the only store the lower triangle
and it's like this this this
order
and uh pack triangular matrix
there are other formats that last and C web back supports but these are the ones that we for what
most
applicable to
speech processing like we don't are a lot of sparse make sure
and traditional
so
this uh and i like we also includes things like S P D an F S C
i fifty isn't supply any of those libraries but we uh
we we uh got permission from rick come out of our microsoft
to uh
use this code
so he has a good "'em"
um
something about the matrix like the even if you don't buy into the whole to kit
if you need a C plus plus matrix library it's probably a
is probably quite good in fact it's surprising that there it doesn't seem to be a lot out there
that fills this nice just there's blues
but that it's a rather weird library and i i don't think a lot of people like
um
okay if you what the about open F is key
so i i seem and one he knows what what fsts are
it in T had this command line tool kit
but i don't believe they ever released
source
so one some of those guys went to google they decided to have one that was uh
for open source and it's a patch lies
that's why we as part there is reason we made out the a you license
because we figured that
to to use up pin fst there's no real point in having a
different license "'cause" it just gives the law my head
so we went for the same one
ah
so yeah
we can pile against its some that for is the decoder
it doesn't use like a special decoding graph format
use is the uh same memory structures the openfst
and the by the way open F to C has a lot of templates and stuff so that
is not just one fst for and there's a lot of them
so if you want to do you could uh
kind of template your decoder run some fancy format that would be let's a compact or dynamically expanded or some
like
we're not gonna go into that in detail today
so we actually implemented various extensions to openfst
some of the recipes the perhaps not totally in the spirit of openfst because
those guys have a particular recipe that they do
and i was is just a little bit different for
later on i can explain why
i feel that there are good reasons for uh i don't know if those guys would agree with
uh
so
if you with the by about io
it's of the controversial decision among the group to U C plus plus three
in the end we decided to do it probably because openfst also does it
uh
something you know a lot of people prefer sea base i L
but but but we do this
uh
we support binary in text mode formats a little bit like htk so that each
object in the toolkit
as a function that will
right and it takes a little argument binary tech
so it it'll just
put its output it's data out of the stream in binary or text mode
any in each object also has the read function that does the same thing
so
ah
it's of the standard thing in many talk at the used and final made in various ways
like this can mean the standard input standard output
it is just a command
and this is what how it knows that it's
can
uh
this is the
and off that into a found meaning it will
it will open the file fc to that position
it's is uh useful for reasons that will be described later
uh
so this this archive format is it
quite fundamental part of the way uh
kaldi work
and i think
i've just cry i'm gonna describe this more later in a another talk with the basic concept is
you have a collection of objects let's imagine that they're matrix
and there you are there are indexed by a string
where the string might be let's say an utterance id
so you want to have some way to
to access this collection of uh
strings and matrices
and you might there might be a couple of different ways you could do that you might wanna go sequentially
through the
as an accumulation of some
we might want to do random access
so there's a whole framework for doing this
uh
basically the reason is so that your
the most of the calico doesn't have to worry about
things like opening files and ever conditions and
you know that doesn't have to be a lot of logic about that in the command line programs because it's
all handled by some
generic framework
but apart from this we tried to avoid
generic framework
ah
the tree building clustering code
we it's based on
very generic
clustering the can something like
i guess hard to model whatever they call it
uh
so it doesn't that that that internal code doesn't assume a lot about what your trees
it is suitable build decision trees in different ways including
like sharing the true
and asking questions about the central central phone
it's like that
um
it's very scalable to white context for example quint phone
i know a lot of the
it it's hard to write code that was scaled to queen phone because if you have to enumerate all of
the context
that's kind of it's hard hard to go to
a but uh
we basically avoid ever enumerating those con
uh as an example of a
how we make use of this general C
and the wall street journal recipe we uh
we increase the phone sets of the in the were asking about the phone position and the stress
i
a "'cause" the know he's to K supports this "'cause" i thing you had a
have a paper marked with
he about doing that
so uh
but but uh if the phones that much larger than that probably
an approach based on enumeration of context would start
i
you don't think so no i mean like it was a thousand thousand keep this day
right
okay well i
okay
uh
okay hmm and transition modeling co
so
we've
we try to have an approach where
a piece of code only needs to know
the minima needs to know
so so the hey gmm and transition modeling code doesn't really have any notion of a pdf it's purely
it purely does what it needs to do
and the rest to separate
so
this is probably pretty standard approach you you develop a uh
you specify prototype to paul
it's apology for each phone is that how many states what the transitions are
uh
and we make the transitions the
separate depending on the uh
depending on the pdf
so so that if the pdfs into states are different than the transitions out of those
states are separately estimated
is this is just the most
specifically that you can estimate the transitions without having your
decoding graph blowup
it's not believing clear that this matters but
uh
we just felt that it was that we should do the best we could on
uh
they're mechanisms would sending these youth hmms into fsts because
all of the training decoding is fst basically kind of have to have an fst representation of these
uh
it's is something that we touched on a a
and i are F S T so what you would normally imagine is that the F it has input symbols
that are the
the pdf so some symbol the represents the P D and the output symbols of the word
but the problem with that is let's suppose you uh
you want to find out what the phone sequence
it's all well well and good if your
if if your phone had separate tree
so that so that it was could for each state which phone it belong
but but what if you had a larger phone set and you wanted to have a shared tree room
and that wasn't you know one to one mappings
oh there was in the mapping you need so
so we have a input labels on the fsts the encoded bit more information
uh
and this is also useful in training the transitions because
sometimes just the pdf labels wouldn't you of you quite enough information
the train the transition
uh
there's a couple of different ways to create decoding graphs
for
for uh training purposes you have to create a lot of these things at the same time
and combining the fst algorithms using script
would be quite inefficient because you have the overhead of process creation
so
we uh
we call the openfst algorithms of the C plus plus level combine them together
so that uh
you can create your decoding graphs for
training
uh
and and we typically put them in one of these archive
like basically a big file concatenated together with little keys in it
on disk
so that you don't have the I O of
accessing hundreds of little file
training use of the viterbi path
these graphs
uh
for test time
we we we didn't we didn't use this approach of C plus plus because it there's just no point
we uh
it's basically scripts and i'm gonna goes wannabe scripts later for those words
um
that's the least scripts that create the decoding graph recalls some openfst tools but some of our own
and that relates partly to a difference in recipes
but uh
i'll talk more about later
after great
so
and i was gonna talk later about some of the acoustic modeling co
i'm just gonna give a brief summary
are gmm code is
it's very simple it's not part of some big framework
it kind of but like an
and object that has you know the means the variances
it can evaluate like it's for you give it the feature
but it doesn't
and her from some
generic acoustic model class and it doesn't at ten
that's a kind of know about things like linear a it just sits there
and and things like we a transform
they kind of have to access the model and do what they want
the the reason for that is that if
the gmm knows too much
them whatever you do that's fancy
you have to then change the gmm code
and it just
it's is not my situation
so uh
yeah we have a separate class for gmm stats accumulation
and doing that they
so
for for a collection of gmms like an gmm gmm system
we have a class that pretty much behave similar to a vector a G M at
so it's
it's a fairly simple thing
there's no notion of name of a state that is just an integer
and then really we've avoided having
like names and names for things in the co
exit
jurors
uh_huh
oh this this low case vector just refer to the S T L vector
but there is an upper case vector to that
but does something in a matrix like
i
well the code is never been case in that as the code we
i i even on windows
uh
i
yeah
okay
we've got quite a lot of linear transform coder
uh
lda hate lda
again and fitting on the fence with regard the naming of this technique
i don't wanna and anyway
i
uh
another these multi name okay
uh olympia version of each other i mean we tried regular vtln is
yeah everyone knows that it's kind of tricky to get it to work
it was that you'll anyone that worked better in the N
uh it is something new that
it's a kind of a replacement for vtln that what's a little bit better
i gonna
explain what it is uh at a later date
mllr
uh
a lot of this
so
one this transform the global the with the way we handle them as well
it just becomes part of the feature space
so it's just
start of the matrix on disk and this
use a lot of plight so the way it actually works is that this matrix
is multiplied by the feature as part of a high
my seem like you're right obviously there is silly way to do it from a computational point of view but
it just makes the scripts really convenient
to uh
do
uh
yeah so when i say they're applied in a unified way what what i mean is that the co the
estimates any of these transforms
there really outputs just to make trick
so uh
you know there's no like
and some a lot transform J
that's just
well okay yeah there is so for the uh regression tree one
i
but for but for the global one it's just it's just a matrix
i
i mean that's with the point of contention among as that to whether to do it this way
but uh
some of a style that it was important to keep the simple case is simple
and to it to avoid having a
a framework
for the cases one was an S
uh
okay decoders
well of the decoders that we currently have use
fully expanded F S is one i mean when i say for the expanded i mean is down to that
H M and state level with
so loops represented as uh
actual you know if sdr
i know there's a lot of way to do this and initially
one of the thoughts we had
would be that
you know we wouldn't have the self loop so we might even have
representations of the states the and then it was just so much simpler to do it this way
this is what we have now
we have three decoders but by decoder we mean they uh
C plus plus code that does decoding
it's not necessarily the same thing as a command line decoding
we have three decoders on the spectrum simple too fast
and the reason for this is that
once you have a complicated fast decoder is almost impossible to the to debug
so if something goes wrong you can always just one the simple one
you know and you can find out if it's a decoder issue
uh
decoded
we wanted to make it so the decoder doesn't as you too much about what you're model model selection
so it again decoder has no idea of gmm hmms it doesn't even know about features
all that
all the decoder knows about is give me the likelihood or
score level
for this
uh frame index
and this pdf in that
so it so interface that the decoder seizes is almost like a matrix
the matrix of uh
of floats but i'm is is not represented that way because you want to
you know you want to have it on them on
i
so yeah this is the decodable interface an interface that the
it's a very simple interface that says give me the likelihood for this you know time in this frame and
like
how many time frames are the
and how many pdf index is that that's almost all the interfaces
but this this is the interface at the decoder requires so the idea was to implement you know
L fantastic a model
and you
uh
i
in in a very matter what interface of that model is
you create a small object that satisfies the decodable interface
and knows how to get the likelihoods from your and L fantastical model
and then you uh
you instantiate the decoder with that are you give that
so uh
the gmm wrapping okay
yeah so i come online decoding programs a very simple we don't have like multipath or anything
we don't have uh
we don't we don't know than to support multiple types of model
an example decoding program is
decode with the G M and
but no
with number multiple class adaptation
yeah so does the simple thing
and then if you want to support let's a multi-class
mllr fmllr
we uh have a separate come online prague
yeah the idea is that
there might be people coming into the project might want to be able to understand that come online program
and we don't one that once a make the barrier to entry too high
we got the
support the overhead of having to maintain two parallel decoders
keep it relatively simple to understand any given one
uh
we support the standard types of features
mfcc and plp features are quite similar to
K one
we've
we put in a reasonable range of configure ability but
i mean being realistic with respect to how much people are really working on this stuff i mean i think
most people are doing research on this would probably be coming out with their own features
so we don't support every possible
combination of it
for every possible change
we only we we dwell format because there i reasoning is
your you can always it's find the external program to convert it and
do it as part of a high
sorry
well we cannot htk and i won't from uh we don't there's no more that we support
uh
yeah
i mean
i i basic concept to have people use the system is
as a complete system
because once you start supporting model you know in a conversion just get work
but yeah that's the he's tk K features as a as a special case
uh
we typically will right features another large objects to a single very large file of relates to this archive format
so the form of the file as a key space then your object
and another key a space that object
and uh
we have efficient mechanisms to read such files
the the the two normal cases are firstly sequential access
we want it's rate over the things an archive
exactly random access and the the different ways to do that one is
you can write a separate file that has little
point doesn't of the file
another is that
you can kind of simulate random access even though you're really going sequentially
if you know that the keys are sorted
uh and another way is if the file isn't isn't that big
you can do random access by just having the code go through the whole file
stalled objects and memory
that's not just scalable but
for for a lot of uh
types of all kinds it really doesn't matter
oh yeah so the feature
feature level processing like adding deltas that from a lot
typically each one of those the separate program so you have like a sequence of programs and apply
and again that's a bit inefficient but
it's not like it's really consuming more than ten percent of your C P U so
you just don't care that much this has been written with
ease of use in my
uh
like i said there's a lot of command line tools this is an example of uh
a command line and this backslashes of this
the cell
so uh
this this is one of the many programs
the plp would be a separate command line
this is just you know
an option
either the two command line arguments in this uh
i gonna be explaining later on or about what these mean with this
directed to write these things to it
and archive on the
a key object key object
and then
i don't know this is the input
we have to read it
and then this is telling it to write an archive and also
and i C P file that
kind of has little pointers into the okay
so that you can efficiently access the features by random access
um um
so
so yeah another example of another feature of this is that as only one option here we we we have
no more than a few options on any given come on
i mean it's a local program i support
less the channel
it's not it's not a very can different to at is more driven by how you combine these grow
a
oh you something else about this whole archive a uh formalism is that
this C plus plus level code in the individual come line tools
we doesn't have have to worry too much about high uh
you can just treat
the uh
when to get something like this there's
there's very short uh
statements in the C plus plus that will it's a rate over a
stuff
so it doesn't have the
think too much about the error conditions
but yep
fst festive generation
okay that as another part of the talk later on
well
for training
there's there's a command line program that will
kind of do the fst generation for you and generate lots of the left S to use one for each
file
yeah so for testing
it's it's a script the calls a fist openfst programs an our versions of openfst for
so
i'm gonna go through that script later one and another part
top
a a are you this decide this is not obvious you know a lot stand the script
but this is just to get people some idea
oh of uh
of how we do do training
so you know this is the bashed script it's doing a loop over the iterations
uh and this one is estimating ml mllt up
i suppose this script review the bias and sorry man i
but as that we are is the colour i've yet
so a
so if it's that one of iterations that we do a lot C
then uh
so we have on disk
some uh alignment this is like steak level alignment
it's in a mark at i've
from my that i mentioned
so this converts them to posteriors
just an average of trivial way by thing that each
each one has a posterior of one
this takes the this
this gives a zero weight to the file and
that's would be a
this would be a variable and by
uh
yeah so this takes away the uh
you the silence is there a posterior
and this is an accumulation program
that uh
this would be the model that's the thought fit of the features as the abashed variable that would be
that elsewhere where
uh this
a a hmmm
i think that's refers to the standard input
me that's reading an our cat from the standard input and that
output by the
you this
mean that's writing an archive to standard it out but
so
yeah yeah output of these programs is passed by up pi
uh
all all of the error and logging out but goes to the standard error uh
because we've kind of used with that it out but for this type stuff
so
so we just directing the logging up
there
so then this is a separate program that does the mllt the estimation
it takes in uh
let me see
uh it's it's it's computing some kind of make
and then uh
because then am a lot T yeah
what i i have to you can the transform
you have to change the means of your model so
we have a separate we like to get everything separate
so you know transforming the me the separate operations so we have a separate program for that
and then
we have to compose the L B M T transform with the previous one
so this is another will program that does that
so this with was setting another bash variable able to make
the ah features correspond now to the
new ml L you a melody features
so
so as you can see that this is the very and bash
and it's
this would be passed as a command line arguments to one of the program
and it's a command involving a pie that actually vol
calling to separate cal be uh
program
each for their own argument
so obvious you can guess from the names of those programs what they're doing
and then of "'cause" uh it seems to have features sub
oh yeah i think we were estimating the ml T on a subset of features
so this is like the same as this but it's the
it's using less
the data
so i think i
i spoke about these issues but for
oh yeah so uh
we had example scripts results management and was to general and these run from the ldc
distributed this
uh now we found in the literature just some uh
some some uh baseline
these numbers are numbers are just the basic context system
with i think uh mean normalization
we have of course more advanced things but
those you know because it had to find in the literature the same thing
we just giving you the unadapted adapted
so it's a
slightly better than this number will can right someone a two thousand
and that the hates you K paper from ninety four
a has a funny but a number for this was the gender dependent system
so uh
so i think basically would doing the same as
you expect given the same out
i mean
uh
i was hoping you know the set of this help project that the results would be but uh
for issues relating to the tree in can phone and stuff but
you know that in we give a senate
so
it it's working there's no major but
uh did of the
okay next slide
uh just the not on speed and coding is used
use a bigram numbers and the "'cause" the baseline we'll bigram numbers
we can't yeah yeah code with the full
with the full uh trigram language model that
distributed with the wall street journal corpus
because the fsts uh
they get to large
we have a "'cause" to with pruned track
but that's why we're coding the bigram numbers
so
hopefully by the sum we gonna
as the couple of things we can do that we both working on one is to have a decoder that
does some kind of on the fly
pensions so that we can uh
the code directly with that
and the other to have a just generation so we can we score
the decoding speed is for these was to just don't numbers is about twice as fast as real
and a "'cause" that's on a good machine
so i mean this is june so that you don't get more than zero point one degradation from
versus a white B
a the wall street journal script takes a few hours on a single machine using
we problem lies on to three C be used
this is just an example script we didn't want to include things like you serve in the example script
because then it wouldn't run on uh everyone's machine
the was they would be fast if you were doing a parallel
yeah
uh_huh
if it in member it well as well
well
but ten gig
i i mean
i S you know everyone knows that F is T compilation tend to up a bit
it's not like
if you have the size of the model you can just about compiler
i i don't recall that it's a trigram one for most journal
i i
and then we go how many was but i think
i don't think that the our stuff is any worse than you know normal if T
that ups that fully expand of thing
oh yeah okay results management
this is a
use she came results
or take an uh
from uh
this is i think this is basically the hey K are each K the be but he's real us to
take it from a paper of mine like in ninety nine or something
"'cause"
i just couldn't find in the read me file from are C K on all of the test
and the average as you can see the average is the same
so
with the same algorithms are getting the same result as H
okay
uh
yeah and it and the decoding we on the setup is about zero point one times real
yeah
yeah
yeah the test set are quite
oh yeah
is a very small test that a handful of words that are of
uh this is this page is mainly
just to give you some idea of the kinds of things that are in our example scripts we have a
bunch of
different configuration of this of the standard configuration
well this is the standard configuration because this is what within the htk baseline line
uh
um
adding M L T doesn't we seem the hell
sorry adding is T
see
as they we the hell
i
a a a a a it's well i think nine frames plus lda that you makes it worse but then
when you do
uh F T C on top of that
if that you better than here and so that this was the kind of this was that I B M
recipe P
so
sorry this with I B M is to be so i i guess that must been some interaction between these
two parts of the recipe
that somehow made it work
i i don't know if it's a generalized to other trade other test set
we gonna find out
uh that's placed nine frames plus hlda
triple deltas plus hlda
triple deltas plus
lda D A plus a lot C
this this
quite good
uh sgmm cyst these are all and adaptive
have a separate slide for uh adapted exp
um
if is doing it
and that's
it's stated otherwise that oh yeah okay so this is but utterance adaptation this is per speaker
so
this was four point five my and before uh
adaptation so it really doesn't help if you do it but i'd sir rights and that's because this too many
parameters
in to model a
this is doing the same thing per speaker gets a lot but uh
its exponential transform is
again i'm not gonna describe what it is is something vtln one
uh and it gets quite a bit but uh
this is a this vtln and of the kind of many a version of vtln i believe
it is that thing to improve quite a lot
and of got improvement is more pronounced on the per utterance level because
uh
in know it it's just like a constrained form of a from a loss of the only point is
to do it
to do when you have less they
uh
splice nine frames for cell day sex to transform
a from well thing i from a lot
we only did some of these put speaker because it wouldn't help of the
uh
as you can see that the well of different combinations this is as gmm including the
speaker offsets sets the and thumbs if you member
and it does help so
so uh i think rick was saying that that is wasn't working for him but it seems to be working
for us
three point one five goes to uh
where is it to point six eight
i i must of uh forgot to fill this line and
it's is that's gmm plus a from a la
but no speaker vectors
yeah
a per speaker
yeah
i think i have these numbers but i think i must not put in i think a best number was
like to point for
point three
uh
so general plug for cal
uh
i believe it easy to use i mean i have the scripts didn't scale you guys up as if you
traction is that once you understand them
everything becomes quite simple
but
it kind of does that you that the sound has speech works like if you some under who does
is randomly
moving the script you know changing configurations of
the
you're not uh
it's not gonna work
it it doesn't like
it doesn't or to magically know that the features you have a not combat your model
so so you can have to know what you doing from a speech science point of view
but
it's quite uh
it's easy to use that the C plus plus
flash to
software engineer
uh
it's easy to extend and modify
you can reduce should be go changes are give them back to
the cal group
uh
we open to including other people's
stuff
so that may give you most citation
so this
is i really
the and the this first part so
you can get up and have a drink and after a few minutes
well
yeah has documentation cal D duck source forge dot net
uh uh okay if is not as good as H K and probably being realistic will never be
what will do
is will
of able lies the F but the he's to K has use and point people to the he's K documentation
so then about eight and say that have he had then
yeah i know me
i
i
i
i use C
see i
i
but okay
we can have a shot rate you we can have a drink
and just a pair you're not in uh
that committed to it
and then
uh uh we've have a gonna talk up to what
the fact