a a a i i i

no

i

a

i you know what

a i today

and people

and some people would be you know a to up to have an another's with

they on if an sup of for the rest of the live

i thinking that those prophecies is of been just slightly misinterpreted

and the event that they were referring to is this

a wonderful speech to okay

i

a that have in of actually that

you know it almost buys review thing it

so uh

i do i think that have anything and their estimates the significance of this that

a that's okay

okay

so first they just about the name

it's a it's some kind of coffee reference hence the little coffee being with uh

but had so

a

but

is just

whatever name we thought to

so uh

the structure of this uh this whole presentation is fess i'm gonna talk

for about

fifteen or twenty minute

just giving you know of you kind of from all sides of this tool K

and then we're gonna a people to escape in case they don't want to know more details than the have

a short break

and then

i not and uh on drug going to talk about a uh

some more called local stuff like

and i was gonna talk about some of the acoustic modeling code

and we'll talk about the uh matrix like

which just kind of independent useful

uh

speech

and then after that

uh

i'm gonna go through some example scripts that we have been try to get people

more of a you know

give people a sense of of how to use that

now

or the next slide

so

some important aspect of the project is it the

it's license under a you V two point uh which is the

style a license that basically allows it to do anything you want with it

there is only a uh

an acknowledgement a

close which as you have to acknowledge that

the code came from that but that

that's of that's

it's it's one of the most open up the standard lies

uh

the project of currently hosted on source forge which is the

standard place for these kinds of open source project

uh

we we it

some talk it's a very closely associated with a particular institution

our attention is for it to be more of a kind of

thing that lives

and the clouds out or and source for

i i shouldn't have use that will that that's to

that's just gratuitous that

but it yeah there it's very for it not to just be him a

the pet project of some particular little group but uh

that's to represent

the best of what's out there and and and we will can be participants as long as you can contribute

code under

this slice sense than that's great

uh

it's basically a C plus plus to at

the code compiles it a native windows and

and the common units but fun like can we're not claiming that a compile once on or you know

other we're problem but but it compiled from on the normal one

a

you have some documentation not as much as takes T K

and and and we have example script

these example scripts and not uh

there just for results also as one and and uh

wall street journal

but we're gonna have more to

they

they basically run from ldc that's

so once you have the this you can kind of point them to the disk

and just

get an idea of how it work

so

oh no i now i realise that we didn't look a large enough row

i think i think we just have a tie this thing to uh aggressively

if these were not guy

uh

yeah

so

okay somehow out i gonna go through the kind of a think that support this is just the current features

obviously

or tending to a lot more

so you can build a standard context-dependent uh

lvcsr system

you know with tree clustering

in that it's been written in such a way that it supports arbitrary context size is so you can go

to

quint phone oh what's have and it will uh

a work

without without pain

but the the training coding about fst based on a

our code compiled against openfst

for those of you who don't know up fst is

it's kind of like the eighteen T tells set it's open source

it's uh

a project uh

like google and some other

um

we can only only have max and like the had training

we haven't yet done lattice generation but at time

timeline line for adding discriminative training and lattice generation

and

this summer slash

like

uh

we we we support all kinds of linear and affine transforms you can imagine

i don't not all of these

necessarily involve uh

you know that tree version

what where you have a

multiple regression plot

that's just because we

are trying to avoid very complicated frameworks that would make that so difficult to use

so a lot of these just support point a single transform

we

all of these things also have examples scrip

so it's not just something that's in the code that

that we know work

something that you can also

get to

so

and trying to have a i did want to just is other tool kits as a little disclaimer here

that

we're not claiming that all of tool kids don't have any of these advantages to

but uh

waiting for clean coal code and modular design

uh

and and by module we we probably need something a little bit stronger than you would normally uh

normally imagine it's it's written in such a way that

it's not only easy to combine the various things that are in the

but it's easy to uh

kind of extend arbitrarily

and and we have avoid the kind of code where

when you add something

a bunch of other bits of code have to know about what you added then you have to modify all

kinds of

you know

all kinds of other

and

the part is a big uh

advantage i know but not a lot of uh

to gets such a completely free lies

and that that we don't really anticipate this being used for commercial purposes

uh

our understanding is that

a lot of research group

as a matter of principle they they won't

you stuff that has no commercially license because this say

is this research can the commercial by the

and now

or of the license will

uh

have example scripts which were which were uh

standing documentation

and

that this whole community building think that the people involved in cal is currently uh

it

it's a group of people mostly vol

who are to the previous to you works so

myself are are not a bunch of guys from but

and and if you others

case

uh

but we open to new participant

and uh

well what what what we're hoping for mainly is not just people who come to be a line not to

of code but

the people who really want to understand the whole thing

you can contribute a significant amount

um

the

it's okay is especially good for stuff that involves a lot of linear algebra

it has a

very good matrix like be the andreas going to talk about

so if you want to do stuff that involves a lot of a matrix and vector

H

are

do

also uh

of course we we compile pile against the openfst library so

you can do have T stuff you know at the code

uh

its built in

a scalable way

well

it doesn't explicitly interact with any power level is a parallel by

it doesn't

it doesn't interact with them at weird do use or or um

um P i i think

"'cause" we felt that that that would just lock it into particular kinds of system

so uh

but

all the a

it's been in in such a way that uh it should still work efficiently when everything is very large scale

you have a lot of day

our our intention is to it and all of the state-of-the-art methods

for lvcsr things like

discriminative training

a standard

all of the standard adaptation

uh

but uh i think i say

on the next slide

uh

something that we not kinda doing in the in the immediate future

it's things like online decoding which

what i mean by that is uh

where the data is coming in say from a microphone or telephone

and it's some kind of interactive application

because you could use it to do that and building a decoder isn't that hard in this framework

but uh

i basic target audience is uh

speech recognition researchers who want to work

on the speech rec oh

other than

rather than those who uh

oh have a mock i was learning what everyone was looking at a multiscale to enter the room um and

disrupted that's all right if very present

i

oh i

i

i

okay

i

so i

we set some people lately have uh

this become popular recently take do a kind of life unwrapper for C plus plus code

the idea being that you can uh

more easily write your script

however we we've avoided that approach because

probably because it's a hassle to do the the wrapping

and nobody ever understands house we were

probably because uh

it just forces people to learn a new language and

probably those who just want the colours think that everyone knows by

that

uh

so

we support the kind of

flexibility and configurable ability of that in different ways

but partly uh

i think it'll become clear later so perhaps will

will will will leave to lake to those ask

so we don't have back would training their in there are no immediate plans to do it

and i some people i think some people like for back for kind of religious reason

but uh

i don't believe any was demonstrated the viterbi be is worse

and it just so and we need to use with a be

for uh

we because you can write the alignments to this compact lee

and then

on

really

okay

really interesting

but i i i even even not let this as

like

just a single hypothesis

makes it if

okay

so we'll have to think about that i mean it's not like it's really hard to do

but it just wasn't something that we had planned

uh

one

uh_huh

oh okay

well it's at the state level

but we it's not really this the

i stay i mean pdf

index but

that you little bit more precise not because uh

you just right out the state sequence it's fine for model training but then

if you wanna work work the phone sequence the penny how tree work

it might not be implied by the state sequence of then we have these identifiers the also contain the phone

and the transition

oh it's and it's a it's an integer a list of it just but those

in integers

are not quite the states there

something that can be mapped to the state also to the phone

uh

so i'm just gonna describe a how this

came to be we had this work in two thousand nine

a a lot of uh focus was on

as G M N

um

we that the supper we we were using that some guys some brno a university of technology

including a on draw look at another

uh

they built this

uh infrastructure for uh

for training as gmms that was it was written in C plus plus but it rely don't he's T K

system

and i also built a a and F E F S T based code

so that we could be code our own C plus plus code with access to the matrix like

um

so we kind of calling that crow took D

and and we wanted to release that

recipe

you know in is some kind of open source way but we realise that

the rest P was just too hard to encapsulate because the had he's T K had our stuff

as a lot of script

so we we wanted to create something that

good support this stuff and was easy to encapsulate so we we an entirely new uh

uh

the next summer we were entirely new toolkit that is

you know that we that

we wanted everything to be clean and unified

and to have a nice use shiny C plus plus

speech rec my

i think that's the uh

i think that's this a

slides a last somewhere

are two thousand ten we had another workshop and or no

where we uh

that a lot of coding

and and the vision at that time which and i realise is very unrealistic

was that we

we have a complete working system with example script

you know the end of the sum

but that that kind of didn't really materialise a had a lot of pieces

but we didn't really have a complete working system so

after uh

i kind of obligated to

you know

and is the system and and and we had a help from others thing especially on that

and and doing a lot of coding after that

so

uh

when we go to the next slide

it's a it's only been officially really something like last week

that's when we actually uh got all the legal approvals and

put up on source forge

this is just a list of the people i don't think i'm gonna go through all the names

this the list of all the people who are rich then uh code specifically for D

that's of the list the people who done various other things or it's so help the in various ways

and

uh

i would describe exactly have for each one but i'm kind of scared i've left someone of one of these

lists

and

i i i i

i i just let you read it

um

a lot of these people are

have some connection to bird or you invested to of technology

oh people but the in uh or

like that

so that this is a

this is is a rather messy diagram

i i just wanted

i want to give you some idea of what the dependency structure of kaldi was but i decided to put

side information and to here so

the area of these uh

of these rectangles is roughly proportional to how many lines of code

there are

so

the these think the thing that we can pile again

so open a fist is the C plus plus library

uh

at let's C left that refers to the math libraries that we can pile again

uh

and

and the rough dependency structures thing on top of things that and on them but

is very approximate

so

for instance he's various

fst the algorithms that we've extended of an fst with

uh

stuff relating to tree clustering for decision tree

uh

that for leading to hmm topology

decoder decoders

language modeling thing this is a small box because really all it does is uh

compile a marketing to enough

i two that

uh

you tell this at this is mostly i O stuff as various frameworks for io

that

will be explained later run kind of after a break

so we can allow people

skate

this is the matrix like we so this

a lot of this is just wrappers for stuff that's here

but if any i don't know if any of you are familiar with

with the steal a pack and blast and those things

but their C libraries that

for C plus plus program a slightly painful to work with "'cause" they have all of these arguments like the

rose the columns

tried

and the thing you wanna do is this very long line of code

and uh

so there's no notion of like a matrix as an object

so this kind of ad that abstraction and it is it is significantly easier to use

then of this make the

uh

this is feed sure

preprocessing and you know

going from a web file to mfcc that's that's fair

uh gaussian mixture models a diagonal and full

subspace gaussian mixture models this is

the reason might talk

the

linear transforms

things like fmllr M L R S T C

hlda

things of that nature

vtln is in here to kind of the

linear form of vtln

uh

all of these things that he had these are kind of you know directories that contain

command line programs that tells you a bit about the structure of the toolkit which is that we have

which really more than a hundred command line program

and each one does a fairly specific thing

wanted to avoid this phenomenon where you have a program that kind of allegedly does one thing

that really is controlled by button really an option

and has rather complicated behavior depending which upset you give it

so this is part of the mechanism that we use to ensure

uh

the everything's configurable an easy to understand

is no python layer but that's a lot of uh

programs as simple function

and on top of this

is the

shell scripts

so to do a not actual system building a recipe

what are example scripts currently only do is it's the bash script

and that you know has a bunch of variables and bash to keep track of iteration and stuff

and it and it runs the job

but invoking

from the command line

because the different ways you could do this if you if you love perl up a python or whatever you

as to i

but that's how a script

and and something that

i really haven't included on this diagram but it's kind of parts of the

dependency structures this some

tools that we rely on so

uh

for language modeling

i D thought we use i R T L them just because of license issues but probably you on use

that's i lm if you

wanna do a lot of a language modeling stuff

uh things like as P H two pi

to and

to uh

in separate data from the L

and so on so that the you

we actually we actually have a

and of can

and installation script that will automatically obtain those things are so the scripts can run

without you having to manually install stuff

and your sis

so i'm just gonna

briefly summarise the matrix like tree under will be talking more about it later but the plan was

to allow people to escape after this initial segment

case the not that the boat to that they one here about this stuff

but uh

as i said it's a C plus plus rap for a blast and seal at pat

and we've

well why should say on really has gone to a lot of trouble to ensure that it can compile

and the various

different configurations the what

libraries you have your system

so it can either the work from blast plus C lap pack

or from a less or using

entails M K L

the reason is that on some systems you might have one but not the other

i i less is an implementation of blast that's the

kind of optimized to the specific a hardware

automatically

is is generally a more

so

the code that we've rat

includes

generic matrices like square matrices

also packed symmetric matrices where where you uh

have a symmetric matrix the only store the lower triangle

and it's like this this this

order

and uh pack triangular matrix

there are other formats that last and C web back supports but these are the ones that we for what

most

applicable to

speech processing like we don't are a lot of sparse make sure

and traditional

so

this uh and i like we also includes things like S P D an F S C

i fifty isn't supply any of those libraries but we uh

we we uh got permission from rick come out of our microsoft

to uh

use this code

so he has a good "'em"

um

something about the matrix like the even if you don't buy into the whole to kit

if you need a C plus plus matrix library it's probably a

is probably quite good in fact it's surprising that there it doesn't seem to be a lot out there

that fills this nice just there's blues

but that it's a rather weird library and i i don't think a lot of people like

um

okay if you what the about open F is key

so i i seem and one he knows what what fsts are

it in T had this command line tool kit

but i don't believe they ever released

source

so one some of those guys went to google they decided to have one that was uh

for open source and it's a patch lies

that's why we as part there is reason we made out the a you license

because we figured that

to to use up pin fst there's no real point in having a

different license "'cause" it just gives the law my head

so we went for the same one

ah

so yeah

we can pile against its some that for is the decoder

it doesn't use like a special decoding graph format

use is the uh same memory structures the openfst

and the by the way open F to C has a lot of templates and stuff so that

is not just one fst for and there's a lot of them

so if you want to do you could uh

kind of template your decoder run some fancy format that would be let's a compact or dynamically expanded or some

like

we're not gonna go into that in detail today

so we actually implemented various extensions to openfst

some of the recipes the perhaps not totally in the spirit of openfst because

those guys have a particular recipe that they do

and i was is just a little bit different for

later on i can explain why

i feel that there are good reasons for uh i don't know if those guys would agree with

uh

so

if you with the by about io

it's of the controversial decision among the group to U C plus plus three

in the end we decided to do it probably because openfst also does it

uh

something you know a lot of people prefer sea base i L

but but but we do this

uh

we support binary in text mode formats a little bit like htk so that each

object in the toolkit

as a function that will

right and it takes a little argument binary tech

so it it'll just

put its output it's data out of the stream in binary or text mode

any in each object also has the read function that does the same thing

so

ah

it's of the standard thing in many talk at the used and final made in various ways

like this can mean the standard input standard output

it is just a command

and this is what how it knows that it's

can

uh

this is the

and off that into a found meaning it will

it will open the file fc to that position

it's is uh useful for reasons that will be described later

uh

so this this archive format is it

quite fundamental part of the way uh

kaldi work

and i think

i've just cry i'm gonna describe this more later in a another talk with the basic concept is

you have a collection of objects let's imagine that they're matrix

and there you are there are indexed by a string

where the string might be let's say an utterance id

so you want to have some way to

to access this collection of uh

strings and matrices

and you might there might be a couple of different ways you could do that you might wanna go sequentially

through the

as an accumulation of some

we might want to do random access

so there's a whole framework for doing this

uh

basically the reason is so that your

the most of the calico doesn't have to worry about

things like opening files and ever conditions and

you know that doesn't have to be a lot of logic about that in the command line programs because it's

all handled by some

generic framework

but apart from this we tried to avoid

generic framework

ah

the tree building clustering code

we it's based on

very generic

clustering the can something like

i guess hard to model whatever they call it

uh

so it doesn't that that that internal code doesn't assume a lot about what your trees

it is suitable build decision trees in different ways including

like sharing the true

and asking questions about the central central phone

it's like that

um

it's very scalable to white context for example quint phone

i know a lot of the

it it's hard to write code that was scaled to queen phone because if you have to enumerate all of

the context

that's kind of it's hard hard to go to

a but uh

we basically avoid ever enumerating those con

uh as an example of a

how we make use of this general C

and the wall street journal recipe we uh

we increase the phone sets of the in the were asking about the phone position and the stress

i

a "'cause" the know he's to K supports this "'cause" i thing you had a

have a paper marked with

he about doing that

so uh

but but uh if the phones that much larger than that probably

an approach based on enumeration of context would start

i

you don't think so no i mean like it was a thousand thousand keep this day

right

okay well i

okay

uh

okay hmm and transition modeling co

so

we've

we try to have an approach where

a piece of code only needs to know

the minima needs to know

so so the hey gmm and transition modeling code doesn't really have any notion of a pdf it's purely

it purely does what it needs to do

and the rest to separate

so

this is probably pretty standard approach you you develop a uh

you specify prototype to paul

it's apology for each phone is that how many states what the transitions are

uh

and we make the transitions the

separate depending on the uh

depending on the pdf

so so that if the pdfs into states are different than the transitions out of those

states are separately estimated

is this is just the most

specifically that you can estimate the transitions without having your

decoding graph blowup

it's not believing clear that this matters but

uh

we just felt that it was that we should do the best we could on

uh

they're mechanisms would sending these youth hmms into fsts because

all of the training decoding is fst basically kind of have to have an fst representation of these

uh

it's is something that we touched on a a

and i are F S T so what you would normally imagine is that the F it has input symbols

that are the

the pdf so some symbol the represents the P D and the output symbols of the word

but the problem with that is let's suppose you uh

you want to find out what the phone sequence

it's all well well and good if your

if if your phone had separate tree

so that so that it was could for each state which phone it belong

but but what if you had a larger phone set and you wanted to have a shared tree room

and that wasn't you know one to one mappings

oh there was in the mapping you need so

so we have a input labels on the fsts the encoded bit more information

uh

and this is also useful in training the transitions because

sometimes just the pdf labels wouldn't you of you quite enough information

the train the transition

uh

there's a couple of different ways to create decoding graphs

for

for uh training purposes you have to create a lot of these things at the same time

and combining the fst algorithms using script

would be quite inefficient because you have the overhead of process creation

so

we uh

we call the openfst algorithms of the C plus plus level combine them together

so that uh

you can create your decoding graphs for

training

uh

and and we typically put them in one of these archive

like basically a big file concatenated together with little keys in it

on disk

so that you don't have the I O of

accessing hundreds of little file

training use of the viterbi path

these graphs

uh

for test time

we we we didn't we didn't use this approach of C plus plus because it there's just no point

we uh

it's basically scripts and i'm gonna goes wannabe scripts later for those words

um

that's the least scripts that create the decoding graph recalls some openfst tools but some of our own

and that relates partly to a difference in recipes

but uh

i'll talk more about later

after great

so

and i was gonna talk later about some of the acoustic modeling co

i'm just gonna give a brief summary

are gmm code is

it's very simple it's not part of some big framework

it kind of but like an

and object that has you know the means the variances

it can evaluate like it's for you give it the feature

but it doesn't

and her from some

generic acoustic model class and it doesn't at ten

that's a kind of know about things like linear a it just sits there

and and things like we a transform

they kind of have to access the model and do what they want

the the reason for that is that if

the gmm knows too much

them whatever you do that's fancy

you have to then change the gmm code

and it just

it's is not my situation

so uh

yeah we have a separate class for gmm stats accumulation

and doing that they

so

for for a collection of gmms like an gmm gmm system

we have a class that pretty much behave similar to a vector a G M at

so it's

it's a fairly simple thing

there's no notion of name of a state that is just an integer

and then really we've avoided having

like names and names for things in the co

exit

jurors

uh_huh

oh this this low case vector just refer to the S T L vector

but there is an upper case vector to that

but does something in a matrix like

i

well the code is never been case in that as the code we

i i even on windows

uh

i

yeah

okay

we've got quite a lot of linear transform coder

uh

lda hate lda

again and fitting on the fence with regard the naming of this technique

i don't wanna and anyway

i

uh

another these multi name okay

uh olympia version of each other i mean we tried regular vtln is

yeah everyone knows that it's kind of tricky to get it to work

it was that you'll anyone that worked better in the N

uh it is something new that

it's a kind of a replacement for vtln that what's a little bit better

i gonna

explain what it is uh at a later date

mllr

uh

a lot of this

so

one this transform the global the with the way we handle them as well

it just becomes part of the feature space

so it's just

start of the matrix on disk and this

use a lot of plight so the way it actually works is that this matrix

is multiplied by the feature as part of a high

my seem like you're right obviously there is silly way to do it from a computational point of view but

it just makes the scripts really convenient

to uh

do

uh

yeah so when i say they're applied in a unified way what what i mean is that the co the

estimates any of these transforms

there really outputs just to make trick

so uh

you know there's no like

and some a lot transform J

that's just

well okay yeah there is so for the uh regression tree one

i

but for but for the global one it's just it's just a matrix

i

i mean that's with the point of contention among as that to whether to do it this way

but uh

some of a style that it was important to keep the simple case is simple

and to it to avoid having a

a framework

for the cases one was an S

uh

okay decoders

well of the decoders that we currently have use

fully expanded F S is one i mean when i say for the expanded i mean is down to that

H M and state level with

so loops represented as uh

actual you know if sdr

i know there's a lot of way to do this and initially

one of the thoughts we had

would be that

you know we wouldn't have the self loop so we might even have

representations of the states the and then it was just so much simpler to do it this way

this is what we have now

we have three decoders but by decoder we mean they uh

C plus plus code that does decoding

it's not necessarily the same thing as a command line decoding

we have three decoders on the spectrum simple too fast

and the reason for this is that

once you have a complicated fast decoder is almost impossible to the to debug

so if something goes wrong you can always just one the simple one

you know and you can find out if it's a decoder issue

uh

decoded

we wanted to make it so the decoder doesn't as you too much about what you're model model selection

so it again decoder has no idea of gmm hmms it doesn't even know about features

all that

all the decoder knows about is give me the likelihood or

score level

for this

uh frame index

and this pdf in that

so it so interface that the decoder seizes is almost like a matrix

the matrix of uh

of floats but i'm is is not represented that way because you want to

you know you want to have it on them on

i

so yeah this is the decodable interface an interface that the

it's a very simple interface that says give me the likelihood for this you know time in this frame and

like

how many time frames are the

and how many pdf index is that that's almost all the interfaces

but this this is the interface at the decoder requires so the idea was to implement you know

L fantastic a model

and you

uh

i

in in a very matter what interface of that model is

you create a small object that satisfies the decodable interface

and knows how to get the likelihoods from your and L fantastical model

and then you uh

you instantiate the decoder with that are you give that

so uh

the gmm wrapping okay

yeah so i come online decoding programs a very simple we don't have like multipath or anything

we don't have uh

we don't we don't know than to support multiple types of model

an example decoding program is

decode with the G M and

but no

with number multiple class adaptation

yeah so does the simple thing

and then if you want to support let's a multi-class

mllr fmllr

we uh have a separate come online prague

yeah the idea is that

there might be people coming into the project might want to be able to understand that come online program

and we don't one that once a make the barrier to entry too high

we got the

support the overhead of having to maintain two parallel decoders

keep it relatively simple to understand any given one

uh

we support the standard types of features

mfcc and plp features are quite similar to

K one

we've

we put in a reasonable range of configure ability but

i mean being realistic with respect to how much people are really working on this stuff i mean i think

most people are doing research on this would probably be coming out with their own features

so we don't support every possible

combination of it

for every possible change

we only we we dwell format because there i reasoning is

your you can always it's find the external program to convert it and

do it as part of a high

sorry

well we cannot htk and i won't from uh we don't there's no more that we support

uh

yeah

i mean

i i basic concept to have people use the system is

as a complete system

because once you start supporting model you know in a conversion just get work

but yeah that's the he's tk K features as a as a special case

uh

we typically will right features another large objects to a single very large file of relates to this archive format

so the form of the file as a key space then your object

and another key a space that object

and uh

we have efficient mechanisms to read such files

the the the two normal cases are firstly sequential access

we want it's rate over the things an archive

exactly random access and the the different ways to do that one is

you can write a separate file that has little

point doesn't of the file

another is that

you can kind of simulate random access even though you're really going sequentially

if you know that the keys are sorted

uh and another way is if the file isn't isn't that big

you can do random access by just having the code go through the whole file

stalled objects and memory

that's not just scalable but

for for a lot of uh

types of all kinds it really doesn't matter

oh yeah so the feature

feature level processing like adding deltas that from a lot

typically each one of those the separate program so you have like a sequence of programs and apply

and again that's a bit inefficient but

it's not like it's really consuming more than ten percent of your C P U so

you just don't care that much this has been written with

ease of use in my

uh

like i said there's a lot of command line tools this is an example of uh

a command line and this backslashes of this

the cell

so uh

this this is one of the many programs

the plp would be a separate command line

this is just you know

an option

either the two command line arguments in this uh

i gonna be explaining later on or about what these mean with this

directed to write these things to it

and archive on the

a key object key object

and then

i don't know this is the input

we have to read it

and then this is telling it to write an archive and also

and i C P file that

kind of has little pointers into the okay

so that you can efficiently access the features by random access

um um

so

so yeah another example of another feature of this is that as only one option here we we we have

no more than a few options on any given come on

i mean it's a local program i support

less the channel

it's not it's not a very can different to at is more driven by how you combine these grow

a

oh you something else about this whole archive a uh formalism is that

this C plus plus level code in the individual come line tools

we doesn't have have to worry too much about high uh

you can just treat

the uh

when to get something like this there's

there's very short uh

statements in the C plus plus that will it's a rate over a

stuff

so it doesn't have the

think too much about the error conditions

but yep

fst festive generation

okay that as another part of the talk later on

well

for training

there's there's a command line program that will

kind of do the fst generation for you and generate lots of the left S to use one for each

file

yeah so for testing

it's it's a script the calls a fist openfst programs an our versions of openfst for

so

i'm gonna go through that script later one and another part

top

a a are you this decide this is not obvious you know a lot stand the script

but this is just to get people some idea

oh of uh

of how we do do training

so you know this is the bashed script it's doing a loop over the iterations

uh and this one is estimating ml mllt up

i suppose this script review the bias and sorry man i

but as that we are is the colour i've yet

so a

so if it's that one of iterations that we do a lot C

then uh

so we have on disk

some uh alignment this is like steak level alignment

it's in a mark at i've

from my that i mentioned

so this converts them to posteriors

just an average of trivial way by thing that each

each one has a posterior of one

this takes the this

this gives a zero weight to the file and

that's would be a

this would be a variable and by

uh

yeah so this takes away the uh

you the silence is there a posterior

and this is an accumulation program

that uh

this would be the model that's the thought fit of the features as the abashed variable that would be

that elsewhere where

uh this

a a hmmm

i think that's refers to the standard input

me that's reading an our cat from the standard input and that

output by the

you this

mean that's writing an archive to standard it out but

so

yeah yeah output of these programs is passed by up pi

uh

all all of the error and logging out but goes to the standard error uh

because we've kind of used with that it out but for this type stuff

so

so we just directing the logging up

there

so then this is a separate program that does the mllt the estimation

it takes in uh

let me see

uh it's it's it's computing some kind of make

and then uh

because then am a lot T yeah

what i i have to you can the transform

you have to change the means of your model so

we have a separate we like to get everything separate

so you know transforming the me the separate operations so we have a separate program for that

and then

we have to compose the L B M T transform with the previous one

so this is another will program that does that

so this with was setting another bash variable able to make

the ah features correspond now to the

new ml L you a melody features

so

so as you can see that this is the very and bash

and it's

this would be passed as a command line arguments to one of the program

and it's a command involving a pie that actually vol

calling to separate cal be uh

program

each for their own argument

so obvious you can guess from the names of those programs what they're doing

and then of "'cause" uh it seems to have features sub

oh yeah i think we were estimating the ml T on a subset of features

so this is like the same as this but it's the

it's using less

the data

so i think i

i spoke about these issues but for

oh yeah so uh

we had example scripts results management and was to general and these run from the ldc

distributed this

uh now we found in the literature just some uh

some some uh baseline

these numbers are numbers are just the basic context system

with i think uh mean normalization

we have of course more advanced things but

those you know because it had to find in the literature the same thing

we just giving you the unadapted adapted

so it's a

slightly better than this number will can right someone a two thousand

and that the hates you K paper from ninety four

a has a funny but a number for this was the gender dependent system

so uh

so i think basically would doing the same as

you expect given the same out

i mean

uh

i was hoping you know the set of this help project that the results would be but uh

for issues relating to the tree in can phone and stuff but

you know that in we give a senate

so

it it's working there's no major but

uh did of the

okay next slide

uh just the not on speed and coding is used

use a bigram numbers and the "'cause" the baseline we'll bigram numbers

we can't yeah yeah code with the full

with the full uh trigram language model that

distributed with the wall street journal corpus

because the fsts uh

they get to large

we have a "'cause" to with pruned track

but that's why we're coding the bigram numbers

so

hopefully by the sum we gonna

as the couple of things we can do that we both working on one is to have a decoder that

does some kind of on the fly

pensions so that we can uh

the code directly with that

and the other to have a just generation so we can we score

the decoding speed is for these was to just don't numbers is about twice as fast as real

and a "'cause" that's on a good machine

so i mean this is june so that you don't get more than zero point one degradation from

versus a white B

a the wall street journal script takes a few hours on a single machine using

we problem lies on to three C be used

this is just an example script we didn't want to include things like you serve in the example script

because then it wouldn't run on uh everyone's machine

the was they would be fast if you were doing a parallel

yeah

uh_huh

if it in member it well as well

well

but ten gig

i i mean

i S you know everyone knows that F is T compilation tend to up a bit

it's not like

if you have the size of the model you can just about compiler

i i don't recall that it's a trigram one for most journal

i i

and then we go how many was but i think

i don't think that the our stuff is any worse than you know normal if T

that ups that fully expand of thing

oh yeah okay results management

this is a

use she came results

or take an uh

from uh

this is i think this is basically the hey K are each K the be but he's real us to

take it from a paper of mine like in ninety nine or something

"'cause"

i just couldn't find in the read me file from are C K on all of the test

and the average as you can see the average is the same

so

with the same algorithms are getting the same result as H

okay

uh

yeah and it and the decoding we on the setup is about zero point one times real

yeah

yeah

yeah the test set are quite

oh yeah

is a very small test that a handful of words that are of

uh this is this page is mainly

just to give you some idea of the kinds of things that are in our example scripts we have a

bunch of

different configuration of this of the standard configuration

well this is the standard configuration because this is what within the htk baseline line

uh

um

adding M L T doesn't we seem the hell

sorry adding is T

see

as they we the hell

i

a a a a a it's well i think nine frames plus lda that you makes it worse but then

when you do

uh F T C on top of that

if that you better than here and so that this was the kind of this was that I B M

recipe P

so

sorry this with I B M is to be so i i guess that must been some interaction between these

two parts of the recipe

that somehow made it work

i i don't know if it's a generalized to other trade other test set

we gonna find out

uh that's placed nine frames plus hlda

triple deltas plus hlda

triple deltas plus

lda D A plus a lot C

this this

quite good

uh sgmm cyst these are all and adaptive

have a separate slide for uh adapted exp

um

if is doing it

and that's

it's stated otherwise that oh yeah okay so this is but utterance adaptation this is per speaker

so

this was four point five my and before uh

adaptation so it really doesn't help if you do it but i'd sir rights and that's because this too many

parameters

in to model a

this is doing the same thing per speaker gets a lot but uh

its exponential transform is

again i'm not gonna describe what it is is something vtln one

uh and it gets quite a bit but uh

this is a this vtln and of the kind of many a version of vtln i believe

it is that thing to improve quite a lot

and of got improvement is more pronounced on the per utterance level because

uh

in know it it's just like a constrained form of a from a loss of the only point is

to do it

to do when you have less they

uh

splice nine frames for cell day sex to transform

a from well thing i from a lot

we only did some of these put speaker because it wouldn't help of the

uh

as you can see that the well of different combinations this is as gmm including the

speaker offsets sets the and thumbs if you member

and it does help so

so uh i think rick was saying that that is wasn't working for him but it seems to be working

for us

three point one five goes to uh

where is it to point six eight

i i must of uh forgot to fill this line and

it's is that's gmm plus a from a la

but no speaker vectors

yeah

a per speaker

yeah

i think i have these numbers but i think i must not put in i think a best number was

like to point for

point three

uh

so general plug for cal

uh

i believe it easy to use i mean i have the scripts didn't scale you guys up as if you

traction is that once you understand them

everything becomes quite simple

but

it kind of does that you that the sound has speech works like if you some under who does

is randomly

moving the script you know changing configurations of

the

you're not uh

it's not gonna work

it it doesn't like

it doesn't or to magically know that the features you have a not combat your model

so so you can have to know what you doing from a speech science point of view

but

it's quite uh

it's easy to use that the C plus plus

flash to

software engineer

uh

it's easy to extend and modify

you can reduce should be go changes are give them back to

the cal group

uh

we open to including other people's

stuff

so that may give you most citation

so this

is i really

the and the this first part so

you can get up and have a drink and after a few minutes

well

yeah has documentation cal D duck source forge dot net

uh uh okay if is not as good as H K and probably being realistic will never be

what will do

is will

of able lies the F but the he's to K has use and point people to the he's K documentation

so then about eight and say that have he had then

yeah i know me

i

i

i

i use C

see i

i

but okay

we can have a shot rate you we can have a drink

and just a pair you're not in uh

that committed to it

and then

uh uh we've have a gonna talk up to what

the fact