okay so let me but you know they four
or conference
well as
well uh should no practical remarks are of the this morning
you know i guess doesn't have any closing ceremony in the afternoon
and the
greens but
name
but it then
yeah grew the one last year
uh seems to come down
so uh oh i would like to
thank you all for being here
stranger
you next your encoder
and actually
the only thing
that uh after having done
slides is that we have a again in a problem will be not so probably
some like to being written
again
so for that and we are working uh for a while
ooh it
oh we will have the last a line or E
given a uh by michael jordan
from uh there
and the yeah more um uh problem uh university of you know or um
we introduce
or speaker
think you're much one yeah so
a great on two
this a michael jordan
or you've got
i
as uh
last
you you for this
uh a you view
a new
of is uh work
uh
legend
if if use of artificial intelligence and machine
and on some student for which i mentioned
a two
uh is uh work is in fact
for just a two
what you was it P two
san go
then be key effect number eighty might way
uh if not uh
uh interest
machine learning and
you you general if you have to
where
for last one years
i
a word fundamental them then so
uh
be with
i to use you
uh and the light
scratch this
you we you work and you need
absolutely incredible contributions to last
uh the topic is if to talk about day is uh a non permit can i
region
methods
uh
a general use be stressed it is quite this
for a work on a graphical models we see graphic
which which found application
a speaker you
and many statistical signal processing so
in all the rest like a natural language course
nation the G
statistical genetics
uh
a as being of course recognise
a tremendous if voice work uh
both in statistics
engineering in for instance
a year here was a V two
a a national actually number
S
what is to
national shall i could be of uh
this
and
so uh
a fellow
american association for the advanced
first person is slice and
good do three
distinctions the this
in a uh is to follow
usefulness
statistics
uh of
solution for
a machinery
of light you be of course a would you really
a statistic is to so even lecture
as well as see that
a special distinctions
before that i mean you just read a
i
and
train
because you use that's basket
i some uh
um
some some words from students
for
michael has a large number of students
with
extremely successful positions
here's what i read
michael jordan if you imagine
artificial
yeah
Q or
if you can become used to
we really should be with region students you have
you
thank you
i'm delighted to be here and i thank the organisers very much for inviting me
uh so the main goal the talk is to tell you a little bit about what
uh the title means what is a nonparametric
um just to anticipate a little bit first bayesian just means you use probability pretty seriously
and is already community uh where probabilities been used uh
probably more than any other apply committed i know of for a long long time so that's easy part of
my story
nonparametric C doesn't mean there no parameters it is just the opposite it means there's a growing number of parameters
so you get more more data at uh bayes not permit matrix a large have more more parameters
i grows so that something a
a a rates or the number of data points but it grows
uh i think that's kind of a key
moderate perspective
uh
at about out statistics and uh signal process
and the bayesian has a particular take on
oh uh a really does a toolbox high my talk and gonna try to do the talk is give you
idea what this toolbox is
you can use it to solve apply problems
i has a beautiful mathematical structure and it's some since it's really just getting going just getting started
uh a song good try to convince you that you to contribute here
uh both and a fundamental middle side and i and a track
a a big dollars collaborators are the bottom you their names or appear about the talk
um
what
what you do them
um now so i've of the one slide on sort of just sort of a historical philosophy i i but
computer science your but this can equally be bill uh well be
um
so the all these fields in the forties and fifties uh uh were somehow together you know people like uh
a common core of and to rain and so on
some of in this many these
it's separate "'cause" the problems got really hard i believe
um
and to peer sides when a often start looking at data structures an out
it didn't for much on an certain
so this is a ring and focused on uncertainty and certainly and funk focus too much on algorithms and data
structures
and so maybe to not per much as one
a venue you where these two things are coming together and mathematically really at it amounts to using stochastic process
i set of classical parametric distributions
i use a growing number parameters so you have
distributions indexed by
um large spaces those are just called stochastic process
so
yeah you're gonna see that are right the unit the talk some of the stochastic process we're just in looking
at
and sort of put this a little bit of a more of a clack a statistical context if you pick
up a book on bayesian analysis
you will see a the posterior written is proportional to the like the times the prior
uh and usually right out in a parametric way with the data index in the parameter space
um and so this the priors is a likelihood in there's sure posterior
in this talk a we don't want data to be a a dimensional object we want to be in three
dimensional open in that
so we a right set of data but the right G of like that
so of X given G is still to be a likelihood in this talk this is mainly get be hidden
markov model actually
that's so G will be the structure and all the parameters of a hidden markov model the state space and
so on and this is just the usual hmm
i and this will be some kind of a structural component of hmms or multiple priors on that
so all be talking mostly about this and less about that that
um all right so but it with a mathematical story just that instead of having a classical prior distribution
you have a distribution an in an image dimensional space
and that's not so strange mathematically that is what a stochastic processes
uh so we have us a prior stochastic process it's could be multiplied by some kind of us fairly classical
likelihood
"'cause" once G is fixed
it it out care what the size space is is just a fixed object
is the probably of data given G can be easily specified typically
as we multiply this post no prior by this likely that we get ourselves up
posterior stochastic process
that's to get some more concrete idea what these things are
we'll be talking a little bit about
uh distributions on trees of one bounded than an down the fan out you get these in genetics and and
natural language processing quite a bit
uh so cats process on partitions are rise a lot we talk about class tree models to the number of
clusters is a known a priori
we can put distributions on grammars on sparse matrices on
something copy copy really we but distributions on distributions with this tool as well as you can get kind of
a recursive a E
that you see in computer science
um i'm to talk about one particular a large class of um stochastic processes is a random measures what we
get to what those are
those are
one the main ingredients of the tool but some try to a
a at talking
okay so here the familiar problem um i learned about this from us so my always at icsi a we've
work on this a bit um um
and the but use it as a con way to to show you the some of the methods i'm talking
about be rolled out in out applied to domain
uh so i think everyone here knows with this problem is there's a single microphone there's a meeting going on
here's the waveform
and ground truth is like this the bob spoke for well than john then chilled than bob and someone
we'd like to infer this from this
and we don't know how many people in the room will we don't know anything about the spectral characteristics of
their speech
here's the other problem to talk about was a little less traditional for for you guys i think
um but it's a segmentation problem
that's a multivariate segmentation and it's a multi type time series uh segmentation problem
okay so um someone came into a room and this is uh a motion capture we have sensors on their
body
um
and i think it's about sixty dimensional
uh so we have a sixty dimensional time-series for this person
and you can see it segment the able they did a little bit of jumping jacks a little bit of
splits little but was so on so for
um but we're not supposed to know that a pretty or we don't know what the library of exercise routines
was and we don't know for this particular person which routines of of the big library they decided to use
how long they lasted and where the break points were all that
so like to and for all of that
moreover this is just one person sixty dimensional time-series series have its for many different people
so each person gonna come in a will get a sixty dimensional times for each one of them
and the we some overlap some people to twist some people able to touch your toes and so so forth
not every won't do all we're teens null be some overlap
we like to find that and then exploit it so far learn a little bit about what twists look like
for one person
i like that it be available to be used to segment the times or the that for some other person
it's of the joint inference problem over multiple i
multiple to
a a high dimensional time series
okay
um okay so everyone this audience with hmms are plus an audience to talk to for that reason so here's
some of here's three of the diagrams are often use i like the graphical model one here what here's the
states
there is a markov chain i
um i multi normal random variables
you have in missions coming off of those in here the parameter sitting on there
um the core of but of course is the transition matrix is the K by K matrix at each role
of it
is the next state transition probability distributions so give a urine state
um um say three
or two then you have a transition out of that i'm we represent that it's this little
discrete measure here it just a bunch of atoms and look at locations which of the integers
and we have masses is so C with them and also the transition probabilities going out from state to as
is my two
to all the next day there's K of these things so it's a it's a measure that has finite support
and will be talking about measures that infinite support here
pretty so
so there there would be pi two pi one might be a different measure that maybe is is sparse it
his two non-zero atoms
and um and so one pi three and so on
okay so those of the representation that the lattice the graphical model
and i i
let's represents is of the
next state transitions that will be talked about the talk
alright so uh lots of issues with hmms that are still some some sense and resolved uh how many states
are we use
so for the diarisation problem we don't know the number of speakers so we have to infer that in some
way
um and the uh segmentation probably don't know the number of behaviours i i got twists send job job out
touch your toes and jumping jacks and all
or some uh no number of behaviour
i think you more interestingly this ladder problem and it has uh it she's about the structure of the state
space
so how to be an co the notion that a particular times is makes use of particular subset of the
is the comet oral notion
and i i don't know how to and code that in a classical hmm was just set states
how to share states among times zero
i so don't know have think about that there's a structure come sort of stuck state space
um um that we don't have a class which to um so that's gonna going to my P of G
i'm gonna put a
structural information of my prior and and given the particular choice i'm gonna have that
and information available for inference at the uh hmm low
alright so based on parametric solve these i think it a very elegant way and not lots of other the
problem so i'm gonna try to show you what
what we do
a to do that
um
okay so uh i'm get a a again be talking about random measures here be kind skip this slide
i was not really much been set here go right to an example of a random measure
um no right so everyone here knows what a measure is my is just a
set you take a set in you get a number out
and the sets them some sigma algebra
um we wanna talk about now a random measures
i was gonna be probabilistic and do inference on measures we got to have random measures
and so that's first of all star was something really easy that's put a a random distribution on integers
can just go from one to infinity
and a distribution on is the sum of the set of numbers a countable of the sum to one
we wanna a random distributions so there could be a set of random numbers that some the one
right how do we do that they have to kind of a in some ways they can some the one
and uh there many ways you think of doing that here's a way that turns out have beautiful comet a
properties that allow was to
to develop inference algorithms based on this idea
as this called stick breaking it's old i in probability theory
um so what do to do is what take a number of beta random variables that that um uh we
it take in in a number of them and do the draw an independent so beta to pay a paid
a random variable
it has two parameters normally but a little pen the first one to one in the other will be free
a random variables live between zero and one right and if you take a L uh not to be bigger
bigger bigger this think tilt to up to the to the left
so most the masses near the origin and less of the masses near one
i so think of getting lots of small numbers out of this
is beat around
so don't to do but guess stick the goes from zero to one and we get a break off the
first fraction of but according to be the one
and but call that high one
and the remainder of the stick is one minus to one and i'm the take a fraction of that beaded
to
and this little part right here is that amount beta two times the red
and a call that pike two
here's pi three five for pie five and so on
which keep break you know peace the stick and uh the total stick is has has mass one
and so as we do this at to infinity we're gonna get a the pies will eventually some to one
oh actually easy to prove if you want to prove it here
you get these pies was sum to one or in this procedure
alright so now that's a way to get a we now have a these pies are they are distribution for
any fixed pies and the random so we have a random distribution only in
so having learned how to do that we now how can i we can promote these up to distribution arbitrary
spaces using that tool to here's how that's done
you use these same price before as weights in a little mixture model an infinite mixture
and the makes come the next chirp or uh components are these delta function these are do right dealt is
they re unit masses
at locations he K
and the peak here in depended draws from a distribution G not on some space of this some space can
be some arbitrary space
it's that it got out of euclidean space it can be a function space it can be
uh a a but space it can be just about anything
um and so these atoms as are living on that's space and their be weighted by these these mixing proportions
pike K
i so each one these has unit mass and their been weighted by these things which some the ones so
this total object G is a spiky object it's a it's a measure
um it's total mass one
and it lives on some space
that's it's random and two ways it's a because the pies got my stick breaking and because these P case
we're drawn from some distribution G not was the source of these things a me calling you think atoms
so G not as the source of the atoms
so with these weighted atoms and some space
alright so that is a random measure if i take G of some set a it'll get me and number
so it's a measure
i it he's added even so on
and it's random a to weight so it's a random measure so this is a very general way now of
getting atoms if i do the pies according to stick breaking
this particular object G has an name it's called a racially process and usually right it like this it as
these two parameters the stick breaking parameter
and the source of the atoms for
a but we can break sticks in different ways and um get the atoms a different ways so this that
actually a uh
a the useful tool
a right so we can have use this as a component of various kinds of statistical models are here's something
called that a racially process mixture model and sky just what you think it was
we're the use
G a draw from G which are now drawn this way it's on some space but the speech here's the
real line just like and draw at
it has at that locations given by drawing from some underline distribution G not
it has
heights to these atoms give my stick breaking with this parameter alpha not
so if i draw specific G again an object looking like that
i the uses of the kind the a mixture model just uses as a as the distribution
it's not your typical gaussian it's it's the distribution and random
and i draw from it so i might draw this at in the middle of their uh this you know
has high probability has a big high
and have been drawn that that's a parameter now for some underlying likelihood
um and i wrote it down here that X a given data is kind of some distribution a indexed by
data
so i do that again and again that's this here that box
and that gives me
um i mixture model in fact there's some probability are get the same atom
on six
draws of theta and so those
uh
indices i would be coming of from the same at we would think of those is at belong to the
same cluster
alright right so that's a dirichlet process mixture model here some kind
are data drawn from one of these thing and drive more more data is like go across you start seeing
the blue dots more more data
and to the parameters they data in this case are means and covariances of gaussians state as a whole big
long vector
a and you see that the number of distinct once was growing there only if you just ones and the
you get more more
and they grow it's sum rate fact that rate turns of to be logarithm of and
logarithm of the number of number date
so the number of parameters we have in this system is growing we we have small number of parameters here
and we keep dry from G again and again and again we get more more parameters
and like i said it grows a rate log ins we're nonparametric here we don't have a fixed parameter space
no matter how much data we have
as you me more more data i'm gonna give you more more parameters
okay so let's go back to this little slide here
and is i've been alluding to uh with some probability five picked a theta one equal to this adam here
in the middle
with some probability data to decode that exact same atom and so those two data points will come from the
same cluster
uh you can ask what's the kind of comical structure induced by this kind of procedural how often does that
happen
what's the probability that data to is equal to theta one no matter where
theta one occur
probably that you get a lightning twice in the same place
and what is that probability fact not over for this particular G but over all possible choices of G under
this procedure that i outlined
this P G
so you think that would be a really hard problems solve that sounds complicated and it turns out that's really
easy to solve terms of the answer is
and um so that understand that you need or stance a call the chinese rest is another stochastic process
this is on partition
um not on parameters this is on partitions
and this here i've have all drawing this down here
a you have customers coming to a restaurant with an infinite number of tables
um so here these round tables that the chinese rest raw that's why they're round
and and the first customer sits here with probably one
the second customer or joins then was probably half an sort the new table probably half
and the job is used to let table proportional the number of people already the table
so that's often called preferential attachment you start to get a big cluster merging at some small clusters after that
and you can easily prove from this this little set up here the number of occupied tables grows as rate
at rate log in
um okay so it's a beautiful mathematical fact that if you do the integrals that was talking about for the
a racially process this turns out to be the marginal probability of the racially process of who sits with who
i so it's it it's another perspective it's the marginal
uh under that other big probability measure
um P G
you can make this into an explicit clustering model a mixture model effectively but now
a of in terms of clustering in that a mixture components
all you do is that the first person set at a table
a draws a parameter for that table
from some prior which would call G not the same G not as before actually
and everybody was sits a table here at that same parameter vector
so if fee one here is a mean in a variance of a gaussian that all the data points for
customers around this table
well all come from the same gal see have the same mean cover it's
okay so this is actually i i
exactly a marginal of the dirichlet process mixture my
um okay so that's kind of little but a tutorial this is been able you know forty years old the
the the the like process mixture models it's that's leave all the in and slowly getting taken up and apply
communities
um but then very very slow
um now
to use this in a richer class of problems that just mixture models and clustering
a you want to have to face the problem that you have multiple estimation problems you have one not just
one set a at have multiple sets of data
and the hmm case that's gonna come because we have the current state which indexes the next state
we have a bunch of distribution
the next day distributions for all of the the current states
so we don't have one distribution we have a whole bunch of distribution
or all the rows of the transition i
we need to estimate all those rows
um so in statistics we off that have this a rise in we have multiple estimation problems or something parameter
and there's a data based on that of another other parameters data based on that
and we often want tie these things to get because there might be very little data over here a lot
of data over here of these are related problems that makes sense that tie these to get
and that called a hierarchical model it's one the main reasons to be a bayesian which is a very easy
to make these kind of hard ca models at tie things together
um and so what you do is you assume that E parameter
a for all these subproblems uh are these primes are related by coming from an underline parameter
as you draw these randomly from an my data
and now all the data over here kinda goes up to this tree down to here and the the posterior
estimate of data i
depends on all the data here
it's a convex combination of all the data
so that arises out the hierarchical bayesian perspective and here's kind of a picture of that
here's the i just showed
here's a kind of the graphical model representation that using these boxes to represent replication
so that our boss box represents these
amber applications down below
okay so we can do that now what the dirichlet process and so this the paper that we published few
years ago a hard could racially process
um this is i think really a very useful tool is just the hierarchy applied to a racially process next
we have the same kind of setup as before
uh but now we don't just have one G we have G one G two G three G four
um for the different
estimation problems the different groups
and if you want be concrete about this is the different rows of the hmm
each one of them has that is the right a measure
and we have a set of measure
all the different rows we want tie them together so they have the same next state space that they all
transition to
um so are we do is not good draw these things independently and like a lump them all together we
want different transition
proper probabilities is we don't want them to be complete separate state space they gotta be tied together in some
way
so the way you do this is that you you have a a an extra layer of this graph you
first of all draw uh from some
source of atoms
a
mother distribution G not
it's now a random instead with be fixed parameter like before
and now that the base measure that you used
to going to each of the children
they they draw the atoms from G not
um and they re weight them according to their own stick breaking process
okay
so this ties together a set of random measures it makes them cooperate operate on what are the atoms there
all to use
and what are the weights at they're gonna use
uh this also as a as a rest raw underneath it which is a really easy way to understand these
ideas
we now don't have just one rest route we have multiple restaurants in restaurants in general and again this in
my application hmms these will be the different current state
or for rows of a transition sure
right so if you're rest run number two
you go when you sit at a table proportional number of people or to the table like before that would
get
uh sharing or clustering with a restaurant
and then if i'm the first person as sit at a table i i up to a global menu of
dishes for the entire franchise
and i pick it is from this menu maybe i pick
um
uh a chicken up here
um and i pick "'em" project i bring it down to my table and everybody who joint we my table
has to that same dish
we all the same parameter
but i also put a check mark next to a dish
and if you are in some other restaurant
and and when you're the first person to your table you got to this menu
and you pick it dish proportional the number of check marks next the dish
case they're big you get some dishes a peer there are popular across all of the restaurants not just within
one restaurant
and they get transfer between the rest according to this this kind of preferential attachment it's
okay so again get the hmm setting
these will be the rows of the transition matrix
the number of possible transitions as a in as you get more more data so just have K states i
have a growing number of state
and the state to be shared among all the different rows of the transition matrix according to this rest strong
um
now i have a really nice application of this that we publish their this your that i'm good i think
skip in the interest of time in this talk but to be of a paper on this
i think it's kind of a killer out for this problem
you're basically trying to model some of the geometry of proteins
here's kinds of the angles that you get in proteins
and if you put all the data for a particular a know so on top of each other you get
kind of these fuzzy diagrams where they're not repress side
and so what you really wanna do is not just have one estimation problem of the density of angles
i you wanna break them up according to the context like in
you wanna not just have one
a distribution one have a bunch of them depending on the context
when you break them up according the context or over you have a sparse data problem many of the context
every little data
as a very similar kind of problem lots of settings including signal process in speech
um and so we want to have models like this at that that are depending on the context of the
neighbouring an amino acids
so this setup is exactly that the groups or the neighbourhood
so for each context you of a group of data
and you don't treat them separately you don't want them together read you have them cooperate right this tree
and everyone that racially is looks kind like a virtually process like that's synthetic data should you earlier
but you really have
twenty amino mean as a left when twenty the right we get four hundred estimation problem
or or context
as we have four hundred diagrams like this
and we get them to share
atoms according to this procedure so
uh i guess skip over some the results but they are um they're they're impressive this is kind of log
a lot probably improve on test set for all the to and you know acids
here's no improvement in this is over what they did probably due in in the protein folding literature
so it's really quite massive pro
um okay let's go back to hidden markov models uh where are that is a is a hidden markov model
and our prior now is the structural
machine or in telling you about
um okay so now we have a hidden markov model we don't have a fixed number of states we call
this a H P M
we have an infinite number of states so here's time here's the state space
and so when you in one of these current state you have a distribution on next states
you get they get a like a trying as rest are pasta here's table one table two table three table
four
and as a a a a growing number of tape
i i do that for table for every time i hit state number two
um
then and that's great i can get a growing number of tables but i wanna share those tables between state
number two and three and four and so on
so is i been talking about i need this
hierarchical dirichlet process to tie together all those transitions
um okay so i'm get kind of go quickly to slides like again details are what i'm trying to convey
here just kind of but at this very
and and and uh
briefly what you do is you draw a mother transition distribution for all states
and then everybody a for each
um current state is a kind is a perturbation of of that so here's spike three
it takes this in kind of every weights all these atoms
i i will for
and so one so for
okay
alright right so um
i hope that was signed of kind of clear this is a way of doing hmms were you don't know
the number states a priori
and actually are putting a prior distribution on the number of states and it can grow you get more more
day
a kind of a nice solution to the a classical problem in hmm lan
um so we implemented this uh there is a simple sampling procedure um
we did a slightly non samples procedure but there's a whole bunch of procedures the can be used to do
a poster inference with this i given some emission some data
uh in for
the most probable hmm and for the uh trajectory D a viterbi path and for everything else you want to
a for all the parameters about this
you can do this pretty easily on a computer
just kind of a
we can standard methodology
it's hmm a
right so what we did this uh we apply to the uh diarisation data we did okay but not particularly
well
we identify one problem we need to all before we could start to get performance at which was was satisfactory
and that was that we have a little bit too much state splitting going on so here's a little synthetic
problem in which which time
there were three states in our synthetic data as you can see here so you know this state was on
for a few uh time frames and then this state and so on
um here was the data and you can sort of see that there are three states here but you know
it's pretty noisy be kind hard to tell that's by looking at
um here the output of our inference system are computer program that
for the H D P H
and it did find three states here but then it actually count of for state and the problem was that
the parameters for state three in state for the emission distributions
happen to be just about the same
could happen it happened here
and so when you were state three and four from the outside world point of view the emission probably is
base of the same
so the system was do it perfectly well
at high likelihood
and within flickering between state three and four
and again i have high like this so why not
there's or that prevents it from doing that
but we don't want that of course because now we didn't a very good the diarisation problem thought there for
people the room they're only really three
okay so we're being bayesian in this problem and so we don't my put in a little bit more prior
knowledge and this we put base nine N at this point so is put a little bit more and which
is that people don't tend to type uh talk
for microsecond millisecond time interval
they ten to talk for sec
we have one extra parameter a self transition probability in the hmm the diagonal of this infinite matrix
is good be treated all but special get a extra boost
right so we have one extra parameter which is the extra boost for self transition something like a a
a um
um
uh sim my mark of
my mark you've H M
okay
as so we call that is sticky H D A H E P H M it just has the transition
distributions before
plus one parameter which boost these all transitions
so if there's the distribution you had before
then we add a little bit to that to um
you or a little bit of boost
for the uh self transition
okay
that parameter were being in bayesian is again
i parameter B for by the uh given the data it's it's random
it has a poster distribution
okay so uh we put that into our system and um where now ready report some results
okay so we went back the speaker diarisation problem
and um
implemented this on data uh i think this was two thousand and seven the uh in this R T competition
data a we didn't compete in this we simply took the data compared to what had been done
by people who do compete
um
okay so uh this is diarisation error rate
and icsi results of been the state-of-the-art uh for this at that time and i i don't remember if they
are still um but they they were that see result
no by a wide margin at the time
um
and this is over the twenty one meetings
and these are comparisons to the icsi just to get a flavour
um so are sticky results are basically comparative to the red ones which of the icsi
so if you just kind of scan through here
the green ones are the non sticky each and you can see there worse and that was assigned to as
we actually were reasonable this if something was not quite work
so if you do a head to head
basically were compare we comparative with the icsi result at this time
i do wanna say that might
my goal showing these numbers is not to say that we are
we are be anybody or that were competitive this was two thousand seven
we we're not we're not
speech people we didn't try to compete
um but we got the results are in fact you know compare with they state-of-the-art the art system and the
main point a wanna make is it this was done by
and lee fox a grad student to visit my group in the summer and she learned about
the H H P hmm and she implemented this and they all that are one summer project
so this is not that hard to do it's a tool just like the hmm you can learn and you
can use it and you can get
you know competitive results doing that
um we have not pursue this and you know i know that the icsi team actually did a discriminative method
that after that that that's a better numbers
um
but all these numbers are getting pretty good at and i think that this this particular approach if if um
if pursued can be a
you know what the best that that's out there
i i i'm row i'm i'd recall when hmms first came about i was actually a grad to at the
time
um um the very first paper on hmms gives a numerical results complied compare to dynamic time warping
and they were okay but not great better
um but of course that was enough and that set the whole field often
heading in that direction i i would like to think that this could be the case also these
be the on parametric methods
there easy to implement
they're just a small twist or what you're used to they're just hmms a little bit of extra that you
can move to a new table in the restaurant
a little bit index heat of multiple restaurants that really in implement
i the robust a you know one student can just implement it it in it were
well
so
i do think the gonna play a role
um
a C Ds are some examples of actual means these are with the diarization rates for both of X you're
quite small you can see here we basically solve the problem
and here's a meeting with things were a little bit worse
there was one meeting which we did particularly badly
turned out that wasn't the model as bad it's turned that the mixing we were use in a mcmc out
with the for this the mix C was quite slow one that media
we had mixed by the time we start
and um so that still on when issue of how to
make sure the markov chains mix
if we're gonna use markov chain
okay so that's all i want to say but there is a show um
uh
uh that that you just
briefly mention a couple of other applications of each D P the uh been used in a many other kinds
of problems not just for hmms
in fact you can use these for P C F G's for problems the context free grammars
and on this case you of a parse tree
and the number of rules
rules are like clusters that you're doing statistical an L
and you grow the number of clusters is you see more more data
and the same rule can appear multiple places in the parse tree
that's what you have multiple restaurants of the chinese rest for exactly the same reason
what share
this your strings
in different locations of parse tree so we have
built a system that does that and
um
are able to then in for uh most probable
rules sets
uh uh from data
parsing
build part
um okay so that was it racially process
and um and the start talking about virtual processors lot more a lot people more you on it but one
that wanna move on
it's all you about some of the other a a stochastic process that are in the toolbox
so one i'm particularly interested in these days called the beta process
where the racially like process uh when you or the rest you have to sit at one and one only
and only one table
the beta process allows you know the rest right and sit multiple T
i
so you to think about the tables now was not like clusters is like a feature a bit vector description
of of of T
so if you set a table one three and seventeen
and i sit at three seventeen and thirty five we able to bit of overlap among each other
so bit factors gone overlap an interesting ways
and so that's what the beta process allows us to do the dirichlet process does not
um
okay now be on the beta process all tell you but more but the be the process or map about
one tell you briefly but the general framework
um soaking they had a very important paper uh in nineteen sixty seven and i sixty eight uh a call
on complete a measures
the beta process as an example of a complete random a measure
and right image are really simple to talk about it to work with
all they are they are measures they're random measures on that's just like before
but what's new here's a T assign independent mast an or second subsets
of the space i a picture of this
an X
yeah are we got the here some arbitrary space
a here's a set a and here's a set P
and here this red a check is a random measure a discrete measure on this space
and so there's a random amount a mass fell in to set a a a random out the fill to
be
and if that random variable
um
a here the random mass
is that dependent
because these are non overlapping sets
then this random as is called completely random
so it's a really nice concept it leads to divide and conquer out where them it's basically
um you know
as computational concept
okay so uh there are lots of lots of things you you know you know what brown emotion is what
brownie motion it turns out is is is a special case of this
gamma processes process is and all kinds of other interesting object
to respect is not completely random process but it's a normalized gamma prior
okay so now
um king men had a very beautiful results um
characterising completely random process
and turns out that they can be derived from poisson prop point process
so the pa some process lies behind all of this
and it's really beautiful construction or some tell you bout here a really briefly what we're trying to do remember
as we have this space so make a
we're trying put random measures on a at all job put the racially process measures on spaces
i mean now be more general side but a how all kinds of other random a results space
what you do it is you take the original space to make a and you cross it with the real
line you look at the product space so make across all
okay
i'm gonna put a poisson process in that product space
how do put a poisson process on things
well you to tell but with the rate function in
i you're probably fill with the homogeneous poisson process where you have a flat rate for a concentrate function
and then every the in a little interval or action this case a little set
um the number of points for there is a poisson random variable
i with some re numb
like a let the rate variance so now the poisson on uh number uh the possible rate is an integral
of a rate function over that little set
so you have to write on the rate function here it is
and so i in a great of that rate function to get um
i now possible rates for every little small set of this space
and here's a draw from that some point process so you see the thing was tilted up and the left
i got more points down here and if few are up here
now having drawn from this are pa process with this rate function then i forget all this machine or look
at this red thing here
i take each X i drop a line from that X down to the or make at
and now that resulting object is a as measure on the all make a space it's a discrete measure
and it's random
and it's close it's a completely random measure because any uh mass the falls in a some set a here
and some set be here will be independent
because it's an underlying plus um pro
the beautiful fact is that all the random is gonna got this way
okay so more directions trivial other directions complete it's quite non sure
so if you like a plea ran a measures and i do
uh then this
there are says you can reduce the study of
these measures to study of the pa some
and just the rate functions the possible process
so i think that's a tool
i think that
in this field the me others that we will be studying rate measures for poisson process
as a ways to specify
comet or structures on thing
okay so the particular example of the beta process has this thing is it's rate function or a function at
to arg gonna sort the a make a part and the real part
the make a part just some at a some measure that gives you um
you um
the prior on atoms
just like a was G not be for not called be not
and then here for the beta process is the uh is the beta density
and the improper beta density
um
which gives you infinite collection of a of points we draw from point process which is what we want we
don't want find a number that would be a parametric
prior we wanna a nonparametric power we do this thing to be
improper gender
okay so so i that's probably too much map me just draw picture
the here is that rate function it has a singularity the origins a really tilts up sharply at the origin
it breaks off it stops at one
and so all the uh voice we do the poisson some point process you get lots and lots of
of points that are very near the origin
and you now take this um by dropping from down from the uh
uh
the Y court down the X N
then it looks like the
it's O P I is the height of an adam and all make i is the location of the atom
and you take that's a infinite sum and that a now is a random measures as another way of getting
random measure like to be but stick breaking the four
this is another way of getting random measure which is much more general
um and a particular these P I do not some to one there between zero and one
but they do not some the one
and they're independent
i so i like to think of these pieces coin tossing probabilities and i like i think that this is
now an infinite collection of coins
uh and most of the coins have probably nearly zero also picked a all these coins
i'll get a few ones and lots of zero
and get an infinite set of zeros and a few ones
and i do that again a all can get a few ones and a lot of zero
a keep doing that they'll be a few places where i'm gonna get lots of ones
a lot of overlap
and lots of other the places with there's less so
of a picture showing that
a here i've drawn from the beta process that that blue thing right there this between zero one it's mostly
nearly zero
and then here's a hundred as
with these think of these as coins uh and think of these heights as the probability of one have head
i draw this sum has a big a probably here's like a lots of one that are relatively lots of
ones like go down the column
and some region over here
and uh no know i've nearly all zeros why quantizer probably like a almost all zero
okay so think of a row of this matrix now snaps as a sparse binary
infinite dimensional random age
i think of this is a feature vector for a bunch of entities
so here's a hundred and tease and if you i think of this like a chinese restaurant it D number
one hundred came to the restaurant
and didn't just sit it one table it's sat at this table this table in this table this table on
the table be for different tables
the next person comes in number ninety nine and sat at didn't sit the first table but set at the
four stable table and bubble bobble
right of this captures the it's it in pattern of a hundred people of in this restaurant and all the
tables i cetera at
and the total number of tables you keep doing that's gonna grow it's the denser and denser fill out
but it grows again at its slow rate
okay so um and
you know different centres the for parameters of this process and you can get different right is to that on
the settings
i so that's probably way too abstract for you appreciate um why
you wanna do that
let me just say that there are also was a restaurant metaphor here
um there's something the indian buffet process which captures the sitting pattern directly on the matrix
not talking about the underlying beta process just like the chinese restaurant
captures the
sitting pattern for the de racially process and not talking about the underlined to racially process literally the marginal probability
and of the beta process
so i'm skip that um i one move to
um
all the way there we go back to this problem i could make it concrete again
okay so number this problem of the multiple time series i have these people coming in into the room and
doing these exercise routines
and um and i don't know how many
routines there and in the library and i don't know who does what routine
and each person doesn't do just one routine they do a subset of reading
i i
right so how my gonna model that
i well it's time series and has segments and an i'm plug use an hmm as might basic like
but now got put some structure around that to capture all this come oral structure about my problem
right
so way that i'm gonna do that as a me i'm use the beta process
and uh i have a slide on as i think and is that try say this an english what you
start to slide
um and the math if you care to
but
a but i encourage you not just listen to me tell you how does were
um um okay so let's both everybody this room as good come up here on the stage do the lecture
size routines to every one of you is good to your all the subset
there's a infinite library up their possible routines you could do
you choose before you come up on stage which subset of that library you good at you're good actually pick
out
do maybe i i'm to do jumping jacks and twist that's on there
right
now
a up and have and there is an infinite by in
transition matrix
um um
and that got possesses the not us get to possess
and uh i pick out
um twist
and jumping jacks
from that in for matrix so i'm gonna pick a little to by two sub matrix maybe twist is uh
column number thirty seven
jumping jacks as number forty two
so i pick out that that those those columns in the corresponding rows i get a little to by two
matrix and i bring it down thing in for the matrix
and i and stan see that a classical hmm
it's actually autoregressive hmms like it'll bit of oscillation "'cause" he's are also to remove
i and now i run a classical a jim for for me dream my exercise routine my emissions or the
six dimensional
vector of positions
and just a hmm
right now i run the forward-backward algorithm and i gets "'em" update to the parameter
i don't have they might the local a to go back up an infinite matrix and that little to by
two that's right but the updates from the upper bound welch
right now he comes then he's the next person a couple the stage and he takes also column thirty five
but he did the node one a one one a one seventeen i
so you out a three by three subset of the matrix
he runs C hmm on his data
it's about mulch update and then goes back to the infinite matrix and changes those that three by three submatrix
as
and as we all keep doing that we're gonna be changing overlapping subsets of that for the minimum mean
okay
and so that's the beta process
ar-hmm
so i hope that you kind of got the spirit of that it is just an hmm there one hmm
for each of the people do the exercise routine
and then the beta process is this machine up here which gives us a a
a feature vector a which of the subsets of states are the infinite set a state that i actually you
i so the this maybe look a little bit complicated it's not it's actually very easy to put on the
computer
and again um this is actually in again emily came an implemented this center second summer with us
um
and
but the software to do this
um
okay so uh anyway it just a hmm like the with the beta process prior on some on the way
the parameters struck
now this actually really worked
so um
here motion capture results
uh
and this is nontrivial trivial problem of a lot lots of methods don't do uh well at all on this
problem
um it's a bit qualitative as how well we're doing but i think you'll kind of if you look adults
were doing the well so here is the first feature
if you pick to that's state i E feature
and then and you just
held that fixed an hmm it's or autoregressive hmms it'll oscillate
the oscillations the arms go up and down
this kind of picked out this jumping jacks
um
rudy
this one if you take that feature in you put in hmm a you don't you don't any transitions happen
you get the knees wobbling back back
here you get some kind it was motion of the hips
here's the more wobbling something or other here's where the arms are go in circles
i so when your the bottom you start to get a little bit more subdivision the states than maybe we
might like although emily thinks that there's kind of
good can a mac reasons of these are actually to be gift
a but this really this that nail the problem it took in this sixty dimensional time-series multiple once do not
about the number of segments and where the segments occur
and jointly segment them among on all the six all all all the different users
of the sector
alright right
um how much you on time
um
okay so i'm a about the not time i'm this good uh i think i against get some slides here
and just say something about um
um
this model
uh
so this is something i've were done for a number of years it's say um exchangeable model i bag of
words model for text probably were shocked nations it's fairly widely known
it's extremely simple it's too simple for really lots of we're a role work phenomena
and so we've been working i make a more interesting
and um
so
what the lda model does it takes a bag of words representation of text
and it re represents texans since terms of work on top top
a topic is a probable vision on word
and so i given document can express a subset of topics maybe be sports and trap
and so all the words in the document come from
the sports topic
or the travel top can you get mixtures were called had mixtures
um um of topics within a single talk
um um the problem with this approach would many problems one of them is that the
words like
a a function words
uh ten do occur every single topic
because of i have a duck with only about travel and the function words don't occur in that topic
i can get function words in that document
i get a very low probably document
right so we like to separate out things like function words other kinds of abstract words from more concrete words
and make a traction higher
right so we have done that was the paper the that that was G A C and the you're and
i approach you look at this site more proud out of this and some sense than lda as a kind
of a
path forward
for this field
uh we call this the nested chinese restaurant process
and so this is a a whole chinese restaurant up here and here's another chinese rest are all these are
chinese restaurants are now organise in a tree
when you go to a chinese restaurant you pick a table like before and that tells you what branch to
leave
to go to the next rest wrought
so you think about this is a as the first night here in prague
you pick some rest and then you'll or what branch you you know you know what rest are you need
a to on the second night
and then a third not and so on as you keep going
the nights of the conference
alright so what document comes down here and picks the path down this tree
another dog it comes in a picks another path and they what that that the overlap
and so you get these
overlapping branching structures
alright and now you put a topic at every node this tree it has a distribution words
and then a document has a path down the tree that gives it a set of topics can draw from
and that draws the words from those that that set of top
right
now that a up of the top is been used by all documents
so it makes a lot sense to put the function words up there
or i as this no down here is only been used by small so so the docking so you might
will put some more concrete words down there
and that's what the statistics does we fit this the data
it did develop stop it's at the top of the tree which are more abstract a more concrete topics as
you go down the tree
i so i'm get just yep
and here's my last slide actually this is the a result to turn your head
uh of um in this to us like i think is as action not psychology all you change this errors
like that
psych
um psych review view he's are abstracts or a particular journal and psychology
and this is the most probable hold tree we're put a distribution on the tree turn
is everything is uh
distribution here
and the was high probability tree in is the was high probably topic
um
or a high probably words that the topic at that node so we get a and of is the high
probably words at the root
so we have to strip away the function words in this case they just pop
what well hold up the
um the next level we get a model memory uh self social psychology motion vision binocular
dried food brain
oh so this looks kind of like social psychology cognitive psychology uh physiological psychology and visual psychology
and so once you good other the is actually infinite tree we are shown the first three levels of that
um so any about hope that "'cause" you or more flavour of you know the toolbox box no should here
we but
try rest not together and a
with a a object which first distributions on is and we could sell do things like how traction
and reason about
um so i'm done uh i was just kind of a tour of a few highlights of a literature um
they're probably about a hundred people worldwide are working in this topic
uh actively we as a composer "'cause" there's a conference call bayesian a nonparametric parametric it's we held of very
cruise later this model
um every two years as
you held
um
uh
and it is just be getting going so for much of the younger people the audience you uh
i highly encourage you look at this is there's a little bit of work
beam brought in the speech and signal processing but there as good be a whole whole lot more
um so on my publication page there two papers which are my point you to if you enjoy the talk
that are written for
for for uh a non expert and give you lots of pointers to um more
a literature
thank you very much
a there's time for questions
yes
thank you both to a them for two
so can we see have time for
some questions yeah
hearing is read channel you get of make it
and we already
in the money
a good a so i have two questions a lot first well thank you very much things to this and
is possible for to still uh
a
a community
the first question to i have a a a um
don't i'm just a
full not a scale problem
since this set and a
choir monte carlo simulation
i
okay
to need that is to make a
really
real
difficult you know i don't believe that at all and so no one knows really we go to large scale
uh so
you know the em algorithm does that apply to large scale problems or not
yeah yes and no
i mean for some problems
things
really quickly settled down after a very quick number of iterations of the uh maybe like two iterations of em
or for some problems
these algorithms
that should do a are just give samplers
and so with the E out with a plus one extra step
just a bitterly em do like em before uh you not change the indicator from this to this
or or or go to a brand table
right so we actually don't know whether uh poster inference are guns on large scale or gonna mix
maybe they mix may of more quickly than were used to from a small day
in the last point to make is that this has not you you C C that's just what we happen
use because this was we had three months to the project
any other procedures are split and merge algorithm there's is variational methods and so on for post your as can
be used here
um so um
it's it's low
yeah not a second question i have is that many people in this all
where lee
um the map of the use of a well yeah it's sample
a a not a really small a good model
yeah is for speech
in in a fist of
just
yeah that we select we use that to them what in france easy
and and then
easy correct so i just want to know you'll cake all
to what extent
this this ten
general
your yeah you not okay that's a great great question like like a very much a so how those two
papers or a minute
and your question trying to show the toolbox box shows a range of other kinds of models we
consider a
a there need a racially process for example doesn't have
a power law behavior
you might want power law behavior
or something else called the pitman-yor which gives you power be
you could do a pitman-yor version of H P H M M
um
and so there there are many many generalisations there's inverse gamma forms of the weights and so on so for
i so um
yeah uh you know
this is really is a tool box
and i think also the point about you know i'm a statistician we're used the to models which if you
find the right card to it can really work surprisingly well
and so you hmm yes it's not the right model for speech but that's lee set and i at
i
and it still was the a you know the elephant to the got speech all the way to where it
is
maybe badly may wrongly
right
um
you know but uh i it is a very useful to make cartoons tunes that to have nice that as
is go and computational probably and can be general
so i think a hmm was too much of a box could be generalized easily and i think that these
methods go beyond that
once again
but still staying with a problem
but it's like back up
oh he have to do that the back
i have control over that
or another question
i can summarise ways
yeah so is questions about over fitting wise you getting killed on over fitting a nonparametric world
well you know i
uh
it's easy for is but bayesian don't have such problems of the over fitting
it's kind of i'm not always a bayesian but what i a bayesian
um you know it's one the is i am
um so you know of to first or we don't have a over fitting troubles with these systems even on
pretty large scale problem
a fact
um so you know we compare this for example one our context free grammars stuff the
be a pet of there was ice with a split merge em algorithm
and there they had
a big over fitting problem they kind don't with that
eventually pretty fact way
um but we didn't have to think about that
just didn't
you know
come up
um
you know so that's the first order the second or you have a couple of hyper parameter of to get
them in the right range
you know get the right range of have some overfitting fitting but were very robust to that
and so you're right yeah the day didn't totally work just of about
we had a little over fitting there was a one more state there should a in
but that's not too bad
that it was not to easy do a little bit of engineering and think about the problem a little bit
more they are well little bit of time scale and have at you know uh prior needs to be put
in
so i i think that's fine were sort of artists an engineers were trying to we don't mix wanna build
a box
that in a you give to a high schools to the it's done
oh always be a little bit thinking
and a lot of engineering
and to about
um but i really the you know i'm not always a bayesian i've of of what's to be a non
be i go back and forth every single day of my life
um
you know but for a lot of these really high dimensional hard
yeah inference problems we you have multiple things which need to kind of
collaborate by the harry
a just gives you a a a you know from the get go a lot control over those sorts
in you the question
so for speech we use mixture models within each yeah a is there way too
this side went to great then use a versus one but you a new component yeah X question i should
made that clear
so
the state specific emission distribution here is not a single gal C
for reasons you guy know extremely well
is the the mixture of gas
no
it's a the racially process mixture of dallas
right and it turned out that was critical for us to get this to work
absolutely we need more just like a the classical gmm
a single permit speech done more
we don't wanna have a
L distributions there we wanted to grow as you get more all patches to be more more pictures of that
mission distribution which a rise we put in the hard the like cross
it's a hard like process
that helps to make the point about two boxing better
a well we should uh we've one more question asked one
uh we actually use quite a bit up a all
i hidden markov model in for
and it's very good so make mean everybody to
take a low but one that the major problems is in the
a prior evil lucien because we have to estimate a lot
for a hyper hour um
so sometimes
the number of high but from is is more than
the pair on the this is so
now now yeah talking about a
in for a number of
for it is
so we're have probably i for a number of hyper per on no we don't know that's really critical so
the classical bayesian hmm had a lot hyper parameter that because it had
fixed K states yes and number of hyper ever scaled the size K
right and you had the I C or something else like that the choose K
oh
we're not do we any of that
we have a distribution the number of states and a number of hyper parameters constant
okay a small
so it's share she's here and i was actually there's a sharing by this high this time these uh choices
a french
the hyper programs real little with a top level
okay okay at the menu that one there's of so
a very small number of that so there is uh a pry pollution according to how you is you give
the have breast they are correct the very number of small number of hyper parameters here okay some sense almost
too small then you can't believe it it it it probably has to be a go a little bit to
know is a is very good we but is that we not
this is not your classical
a bayesian hmm or you have the number of states things
were in a of the number state it's friend
okay as the city pattern in the rest
it's it's grows that the prior with log in and the posterior just random
okay
thank you thank you
a well i would like a us to thing
see once again for if
and now we have the uh a few break outside T one
nine or
hmmm
i
yeah