okay so let me but you know they four

or conference

well as

well uh should no practical remarks are of the this morning

you know i guess doesn't have any closing ceremony in the afternoon

and the

greens but

name

but it then

yeah grew the one last year

uh seems to come down

so uh oh i would like to

thank you all for being here

stranger

you next your encoder

and actually

the only thing

that uh after having done

slides is that we have a again in a problem will be not so probably

some like to being written

again

so for that and we are working uh for a while

ooh it

oh we will have the last a line or E

given a uh by michael jordan

from uh there

and the yeah more um uh problem uh university of you know or um

we introduce

or speaker

think you're much one yeah so

a great on two

this a michael jordan

or you've got

i

as uh

last

you you for this

uh a you view

a new

of is uh work

uh

legend

if if use of artificial intelligence and machine

and on some student for which i mentioned

a two

uh is uh work is in fact

for just a two

what you was it P two

san go

then be key effect number eighty might way

uh if not uh

uh interest

machine learning and

you you general if you have to

where

for last one years

i

a word fundamental them then so

uh

be with

i to use you

uh and the light

scratch this

you we you work and you need

absolutely incredible contributions to last

uh the topic is if to talk about day is uh a non permit can i

region

methods

uh

a general use be stressed it is quite this

for a work on a graphical models we see graphic

which which found application

a speaker you

and many statistical signal processing so

in all the rest like a natural language course

nation the G

statistical genetics

uh

a as being of course recognise

a tremendous if voice work uh

both in statistics

engineering in for instance

a year here was a V two

a a national actually number

S

what is to

national shall i could be of uh

this

and

so uh

a fellow

american association for the advanced

first person is slice and

good do three

distinctions the this

in a uh is to follow

usefulness

statistics

uh of

solution for

a machinery

of light you be of course a would you really

a statistic is to so even lecture

as well as see that

a special distinctions

before that i mean you just read a

i

and

train

because you use that's basket

i some uh

um

some some words from students

for

michael has a large number of students

with

extremely successful positions

here's what i read

michael jordan if you imagine

artificial

yeah

Q or

if you can become used to

we really should be with region students you have

you

thank you

i'm delighted to be here and i thank the organisers very much for inviting me

uh so the main goal the talk is to tell you a little bit about what

uh the title means what is a nonparametric

um just to anticipate a little bit first bayesian just means you use probability pretty seriously

and is already community uh where probabilities been used uh

probably more than any other apply committed i know of for a long long time so that's easy part of

my story

nonparametric C doesn't mean there no parameters it is just the opposite it means there's a growing number of parameters

so you get more more data at uh bayes not permit matrix a large have more more parameters

i grows so that something a

a a rates or the number of data points but it grows

uh i think that's kind of a key

moderate perspective

uh

at about out statistics and uh signal process

and the bayesian has a particular take on

oh uh a really does a toolbox high my talk and gonna try to do the talk is give you

idea what this toolbox is

you can use it to solve apply problems

i has a beautiful mathematical structure and it's some since it's really just getting going just getting started

uh a song good try to convince you that you to contribute here

uh both and a fundamental middle side and i and a track

a a big dollars collaborators are the bottom you their names or appear about the talk

um

what

what you do them

um now so i've of the one slide on sort of just sort of a historical philosophy i i but

computer science your but this can equally be bill uh well be

um

so the all these fields in the forties and fifties uh uh were somehow together you know people like uh

a common core of and to rain and so on

some of in this many these

it's separate "'cause" the problems got really hard i believe

um

and to peer sides when a often start looking at data structures an out

it didn't for much on an certain

so this is a ring and focused on uncertainty and certainly and funk focus too much on algorithms and data

structures

and so maybe to not per much as one

a venue you where these two things are coming together and mathematically really at it amounts to using stochastic process

i set of classical parametric distributions

i use a growing number parameters so you have

distributions indexed by

um large spaces those are just called stochastic process

so

yeah you're gonna see that are right the unit the talk some of the stochastic process we're just in looking

at

and sort of put this a little bit of a more of a clack a statistical context if you pick

up a book on bayesian analysis

you will see a the posterior written is proportional to the like the times the prior

uh and usually right out in a parametric way with the data index in the parameter space

um and so this the priors is a likelihood in there's sure posterior

in this talk a we don't want data to be a a dimensional object we want to be in three

dimensional open in that

so we a right set of data but the right G of like that

so of X given G is still to be a likelihood in this talk this is mainly get be hidden

markov model actually

that's so G will be the structure and all the parameters of a hidden markov model the state space and

so on and this is just the usual hmm

i and this will be some kind of a structural component of hmms or multiple priors on that

so all be talking mostly about this and less about that that

um all right so but it with a mathematical story just that instead of having a classical prior distribution

you have a distribution an in an image dimensional space

and that's not so strange mathematically that is what a stochastic processes

uh so we have us a prior stochastic process it's could be multiplied by some kind of us fairly classical

likelihood

"'cause" once G is fixed

it it out care what the size space is is just a fixed object

is the probably of data given G can be easily specified typically

as we multiply this post no prior by this likely that we get ourselves up

posterior stochastic process

that's to get some more concrete idea what these things are

we'll be talking a little bit about

uh distributions on trees of one bounded than an down the fan out you get these in genetics and and

natural language processing quite a bit

uh so cats process on partitions are rise a lot we talk about class tree models to the number of

clusters is a known a priori

we can put distributions on grammars on sparse matrices on

something copy copy really we but distributions on distributions with this tool as well as you can get kind of

a recursive a E

that you see in computer science

um i'm to talk about one particular a large class of um stochastic processes is a random measures what we

get to what those are

those are

one the main ingredients of the tool but some try to a

a at talking

okay so here the familiar problem um i learned about this from us so my always at icsi a we've

work on this a bit um um

and the but use it as a con way to to show you the some of the methods i'm talking

about be rolled out in out applied to domain

uh so i think everyone here knows with this problem is there's a single microphone there's a meeting going on

here's the waveform

and ground truth is like this the bob spoke for well than john then chilled than bob and someone

we'd like to infer this from this

and we don't know how many people in the room will we don't know anything about the spectral characteristics of

their speech

here's the other problem to talk about was a little less traditional for for you guys i think

um but it's a segmentation problem

that's a multivariate segmentation and it's a multi type time series uh segmentation problem

okay so um someone came into a room and this is uh a motion capture we have sensors on their

body

um

and i think it's about sixty dimensional

uh so we have a sixty dimensional time-series for this person

and you can see it segment the able they did a little bit of jumping jacks a little bit of

splits little but was so on so for

um but we're not supposed to know that a pretty or we don't know what the library of exercise routines

was and we don't know for this particular person which routines of of the big library they decided to use

how long they lasted and where the break points were all that

so like to and for all of that

moreover this is just one person sixty dimensional time-series series have its for many different people

so each person gonna come in a will get a sixty dimensional times for each one of them

and the we some overlap some people to twist some people able to touch your toes and so so forth

not every won't do all we're teens null be some overlap

we like to find that and then exploit it so far learn a little bit about what twists look like

for one person

i like that it be available to be used to segment the times or the that for some other person

it's of the joint inference problem over multiple i

multiple to

a a high dimensional time series

okay

um okay so everyone this audience with hmms are plus an audience to talk to for that reason so here's

some of here's three of the diagrams are often use i like the graphical model one here what here's the

states

there is a markov chain i

um i multi normal random variables

you have in missions coming off of those in here the parameter sitting on there

um the core of but of course is the transition matrix is the K by K matrix at each role

of it

is the next state transition probability distributions so give a urine state

um um say three

or two then you have a transition out of that i'm we represent that it's this little

discrete measure here it just a bunch of atoms and look at locations which of the integers

and we have masses is so C with them and also the transition probabilities going out from state to as

is my two

to all the next day there's K of these things so it's a it's a measure that has finite support

and will be talking about measures that infinite support here

pretty so

so there there would be pi two pi one might be a different measure that maybe is is sparse it

his two non-zero atoms

and um and so one pi three and so on

okay so those of the representation that the lattice the graphical model

and i i

let's represents is of the

next state transitions that will be talked about the talk

alright so uh lots of issues with hmms that are still some some sense and resolved uh how many states

are we use

so for the diarisation problem we don't know the number of speakers so we have to infer that in some

way

um and the uh segmentation probably don't know the number of behaviours i i got twists send job job out

touch your toes and jumping jacks and all

or some uh no number of behaviour

i think you more interestingly this ladder problem and it has uh it she's about the structure of the state

space

so how to be an co the notion that a particular times is makes use of particular subset of the

is the comet oral notion

and i i don't know how to and code that in a classical hmm was just set states

how to share states among times zero

i so don't know have think about that there's a structure come sort of stuck state space

um um that we don't have a class which to um so that's gonna going to my P of G

i'm gonna put a

structural information of my prior and and given the particular choice i'm gonna have that

and information available for inference at the uh hmm low

alright so based on parametric solve these i think it a very elegant way and not lots of other the

problem so i'm gonna try to show you what

what we do

a to do that

um

okay so uh i'm get a a again be talking about random measures here be kind skip this slide

i was not really much been set here go right to an example of a random measure

um no right so everyone here knows what a measure is my is just a

set you take a set in you get a number out

and the sets them some sigma algebra

um we wanna talk about now a random measures

i was gonna be probabilistic and do inference on measures we got to have random measures

and so that's first of all star was something really easy that's put a a random distribution on integers

can just go from one to infinity

and a distribution on is the sum of the set of numbers a countable of the sum to one

we wanna a random distributions so there could be a set of random numbers that some the one

right how do we do that they have to kind of a in some ways they can some the one

and uh there many ways you think of doing that here's a way that turns out have beautiful comet a

properties that allow was to

to develop inference algorithms based on this idea

as this called stick breaking it's old i in probability theory

um so what do to do is what take a number of beta random variables that that um uh we

it take in in a number of them and do the draw an independent so beta to pay a paid

a random variable

it has two parameters normally but a little pen the first one to one in the other will be free

a random variables live between zero and one right and if you take a L uh not to be bigger

bigger bigger this think tilt to up to the to the left

so most the masses near the origin and less of the masses near one

i so think of getting lots of small numbers out of this

is beat around

so don't to do but guess stick the goes from zero to one and we get a break off the

first fraction of but according to be the one

and but call that high one

and the remainder of the stick is one minus to one and i'm the take a fraction of that beaded

to

and this little part right here is that amount beta two times the red

and a call that pike two

here's pi three five for pie five and so on

which keep break you know peace the stick and uh the total stick is has has mass one

and so as we do this at to infinity we're gonna get a the pies will eventually some to one

oh actually easy to prove if you want to prove it here

you get these pies was sum to one or in this procedure

alright so now that's a way to get a we now have a these pies are they are distribution for

any fixed pies and the random so we have a random distribution only in

so having learned how to do that we now how can i we can promote these up to distribution arbitrary

spaces using that tool to here's how that's done

you use these same price before as weights in a little mixture model an infinite mixture

and the makes come the next chirp or uh components are these delta function these are do right dealt is

they re unit masses

at locations he K

and the peak here in depended draws from a distribution G not on some space of this some space can

be some arbitrary space

it's that it got out of euclidean space it can be a function space it can be

uh a a but space it can be just about anything

um and so these atoms as are living on that's space and their be weighted by these these mixing proportions

pike K

i so each one these has unit mass and their been weighted by these things which some the ones so

this total object G is a spiky object it's a it's a measure

um it's total mass one

and it lives on some space

that's it's random and two ways it's a because the pies got my stick breaking and because these P case

we're drawn from some distribution G not was the source of these things a me calling you think atoms

so G not as the source of the atoms

so with these weighted atoms and some space

alright so that is a random measure if i take G of some set a it'll get me and number

so it's a measure

i it he's added even so on

and it's random a to weight so it's a random measure so this is a very general way now of

getting atoms if i do the pies according to stick breaking

this particular object G has an name it's called a racially process and usually right it like this it as

these two parameters the stick breaking parameter

and the source of the atoms for

a but we can break sticks in different ways and um get the atoms a different ways so this that

actually a uh

a the useful tool

a right so we can have use this as a component of various kinds of statistical models are here's something

called that a racially process mixture model and sky just what you think it was

we're the use

G a draw from G which are now drawn this way it's on some space but the speech here's the

real line just like and draw at

it has at that locations given by drawing from some underline distribution G not

it has

heights to these atoms give my stick breaking with this parameter alpha not

so if i draw specific G again an object looking like that

i the uses of the kind the a mixture model just uses as a as the distribution

it's not your typical gaussian it's it's the distribution and random

and i draw from it so i might draw this at in the middle of their uh this you know

has high probability has a big high

and have been drawn that that's a parameter now for some underlying likelihood

um and i wrote it down here that X a given data is kind of some distribution a indexed by

data

so i do that again and again that's this here that box

and that gives me

um i mixture model in fact there's some probability are get the same atom

on six

draws of theta and so those

uh

indices i would be coming of from the same at we would think of those is at belong to the

same cluster

alright right so that's a dirichlet process mixture model here some kind

are data drawn from one of these thing and drive more more data is like go across you start seeing

the blue dots more more data

and to the parameters they data in this case are means and covariances of gaussians state as a whole big

long vector

a and you see that the number of distinct once was growing there only if you just ones and the

you get more more

and they grow it's sum rate fact that rate turns of to be logarithm of and

logarithm of the number of number date

so the number of parameters we have in this system is growing we we have small number of parameters here

and we keep dry from G again and again and again we get more more parameters

and like i said it grows a rate log ins we're nonparametric here we don't have a fixed parameter space

no matter how much data we have

as you me more more data i'm gonna give you more more parameters

okay so let's go back to this little slide here

and is i've been alluding to uh with some probability five picked a theta one equal to this adam here

in the middle

with some probability data to decode that exact same atom and so those two data points will come from the

same cluster

uh you can ask what's the kind of comical structure induced by this kind of procedural how often does that

happen

what's the probability that data to is equal to theta one no matter where

theta one occur

probably that you get a lightning twice in the same place

and what is that probability fact not over for this particular G but over all possible choices of G under

this procedure that i outlined

this P G

so you think that would be a really hard problems solve that sounds complicated and it turns out that's really

easy to solve terms of the answer is

and um so that understand that you need or stance a call the chinese rest is another stochastic process

this is on partition

um not on parameters this is on partitions

and this here i've have all drawing this down here

a you have customers coming to a restaurant with an infinite number of tables

um so here these round tables that the chinese rest raw that's why they're round

and and the first customer sits here with probably one

the second customer or joins then was probably half an sort the new table probably half

and the job is used to let table proportional the number of people already the table

so that's often called preferential attachment you start to get a big cluster merging at some small clusters after that

and you can easily prove from this this little set up here the number of occupied tables grows as rate

at rate log in

um okay so it's a beautiful mathematical fact that if you do the integrals that was talking about for the

a racially process this turns out to be the marginal probability of the racially process of who sits with who

i so it's it it's another perspective it's the marginal

uh under that other big probability measure

um P G

you can make this into an explicit clustering model a mixture model effectively but now

a of in terms of clustering in that a mixture components

all you do is that the first person set at a table

a draws a parameter for that table

from some prior which would call G not the same G not as before actually

and everybody was sits a table here at that same parameter vector

so if fee one here is a mean in a variance of a gaussian that all the data points for

customers around this table

well all come from the same gal see have the same mean cover it's

okay so this is actually i i

exactly a marginal of the dirichlet process mixture my

um okay so that's kind of little but a tutorial this is been able you know forty years old the

the the the like process mixture models it's that's leave all the in and slowly getting taken up and apply

communities

um but then very very slow

um now

to use this in a richer class of problems that just mixture models and clustering

a you want to have to face the problem that you have multiple estimation problems you have one not just

one set a at have multiple sets of data

and the hmm case that's gonna come because we have the current state which indexes the next state

we have a bunch of distribution

the next day distributions for all of the the current states

so we don't have one distribution we have a whole bunch of distribution

or all the rows of the transition i

we need to estimate all those rows

um so in statistics we off that have this a rise in we have multiple estimation problems or something parameter

and there's a data based on that of another other parameters data based on that

and we often want tie these things to get because there might be very little data over here a lot

of data over here of these are related problems that makes sense that tie these to get

and that called a hierarchical model it's one the main reasons to be a bayesian which is a very easy

to make these kind of hard ca models at tie things together

um and so what you do is you assume that E parameter

a for all these subproblems uh are these primes are related by coming from an underline parameter

as you draw these randomly from an my data

and now all the data over here kinda goes up to this tree down to here and the the posterior

estimate of data i

depends on all the data here

it's a convex combination of all the data

so that arises out the hierarchical bayesian perspective and here's kind of a picture of that

here's the i just showed

here's a kind of the graphical model representation that using these boxes to represent replication

so that our boss box represents these

amber applications down below

okay so we can do that now what the dirichlet process and so this the paper that we published few

years ago a hard could racially process

um this is i think really a very useful tool is just the hierarchy applied to a racially process next

we have the same kind of setup as before

uh but now we don't just have one G we have G one G two G three G four

um for the different

estimation problems the different groups

and if you want be concrete about this is the different rows of the hmm

each one of them has that is the right a measure

and we have a set of measure

all the different rows we want tie them together so they have the same next state space that they all

transition to

um so are we do is not good draw these things independently and like a lump them all together we

want different transition

proper probabilities is we don't want them to be complete separate state space they gotta be tied together in some

way

so the way you do this is that you you have a a an extra layer of this graph you

first of all draw uh from some

source of atoms

a

mother distribution G not

it's now a random instead with be fixed parameter like before

and now that the base measure that you used

to going to each of the children

they they draw the atoms from G not

um and they re weight them according to their own stick breaking process

okay

so this ties together a set of random measures it makes them cooperate operate on what are the atoms there

all to use

and what are the weights at they're gonna use

uh this also as a as a rest raw underneath it which is a really easy way to understand these

ideas

we now don't have just one rest route we have multiple restaurants in restaurants in general and again this in

my application hmms these will be the different current state

or for rows of a transition sure

right so if you're rest run number two

you go when you sit at a table proportional number of people or to the table like before that would

get

uh sharing or clustering with a restaurant

and then if i'm the first person as sit at a table i i up to a global menu of

dishes for the entire franchise

and i pick it is from this menu maybe i pick

um

uh a chicken up here

um and i pick "'em" project i bring it down to my table and everybody who joint we my table

has to that same dish

we all the same parameter

but i also put a check mark next to a dish

and if you are in some other restaurant

and and when you're the first person to your table you got to this menu

and you pick it dish proportional the number of check marks next the dish

case they're big you get some dishes a peer there are popular across all of the restaurants not just within

one restaurant

and they get transfer between the rest according to this this kind of preferential attachment it's

okay so again get the hmm setting

these will be the rows of the transition matrix

the number of possible transitions as a in as you get more more data so just have K states i

have a growing number of state

and the state to be shared among all the different rows of the transition matrix according to this rest strong

um

now i have a really nice application of this that we publish their this your that i'm good i think

skip in the interest of time in this talk but to be of a paper on this

i think it's kind of a killer out for this problem

you're basically trying to model some of the geometry of proteins

here's kinds of the angles that you get in proteins

and if you put all the data for a particular a know so on top of each other you get

kind of these fuzzy diagrams where they're not repress side

and so what you really wanna do is not just have one estimation problem of the density of angles

i you wanna break them up according to the context like in

you wanna not just have one

a distribution one have a bunch of them depending on the context

when you break them up according the context or over you have a sparse data problem many of the context

every little data

as a very similar kind of problem lots of settings including signal process in speech

um and so we want to have models like this at that that are depending on the context of the

neighbouring an amino acids

so this setup is exactly that the groups or the neighbourhood

so for each context you of a group of data

and you don't treat them separately you don't want them together read you have them cooperate right this tree

and everyone that racially is looks kind like a virtually process like that's synthetic data should you earlier

but you really have

twenty amino mean as a left when twenty the right we get four hundred estimation problem

or or context

as we have four hundred diagrams like this

and we get them to share

atoms according to this procedure so

uh i guess skip over some the results but they are um they're they're impressive this is kind of log

a lot probably improve on test set for all the to and you know acids

here's no improvement in this is over what they did probably due in in the protein folding literature

so it's really quite massive pro

um okay let's go back to hidden markov models uh where are that is a is a hidden markov model

and our prior now is the structural

machine or in telling you about

um okay so now we have a hidden markov model we don't have a fixed number of states we call

this a H P M

we have an infinite number of states so here's time here's the state space

and so when you in one of these current state you have a distribution on next states

you get they get a like a trying as rest are pasta here's table one table two table three table

four

and as a a a a growing number of tape

i i do that for table for every time i hit state number two

um

then and that's great i can get a growing number of tables but i wanna share those tables between state

number two and three and four and so on

so is i been talking about i need this

hierarchical dirichlet process to tie together all those transitions

um okay so i'm get kind of go quickly to slides like again details are what i'm trying to convey

here just kind of but at this very

and and and uh

briefly what you do is you draw a mother transition distribution for all states

and then everybody a for each

um current state is a kind is a perturbation of of that so here's spike three

it takes this in kind of every weights all these atoms

i i will for

and so one so for

okay

alright right so um

i hope that was signed of kind of clear this is a way of doing hmms were you don't know

the number states a priori

and actually are putting a prior distribution on the number of states and it can grow you get more more

day

a kind of a nice solution to the a classical problem in hmm lan

um so we implemented this uh there is a simple sampling procedure um

we did a slightly non samples procedure but there's a whole bunch of procedures the can be used to do

a poster inference with this i given some emission some data

uh in for

the most probable hmm and for the uh trajectory D a viterbi path and for everything else you want to

a for all the parameters about this

you can do this pretty easily on a computer

just kind of a

we can standard methodology

it's hmm a

right so what we did this uh we apply to the uh diarisation data we did okay but not particularly

well

we identify one problem we need to all before we could start to get performance at which was was satisfactory

and that was that we have a little bit too much state splitting going on so here's a little synthetic

problem in which which time

there were three states in our synthetic data as you can see here so you know this state was on

for a few uh time frames and then this state and so on

um here was the data and you can sort of see that there are three states here but you know

it's pretty noisy be kind hard to tell that's by looking at

um here the output of our inference system are computer program that

for the H D P H

and it did find three states here but then it actually count of for state and the problem was that

the parameters for state three in state for the emission distributions

happen to be just about the same

could happen it happened here

and so when you were state three and four from the outside world point of view the emission probably is

base of the same

so the system was do it perfectly well

at high likelihood

and within flickering between state three and four

and again i have high like this so why not

there's or that prevents it from doing that

but we don't want that of course because now we didn't a very good the diarisation problem thought there for

people the room they're only really three

okay so we're being bayesian in this problem and so we don't my put in a little bit more prior

knowledge and this we put base nine N at this point so is put a little bit more and which

is that people don't tend to type uh talk

for microsecond millisecond time interval

they ten to talk for sec

we have one extra parameter a self transition probability in the hmm the diagonal of this infinite matrix

is good be treated all but special get a extra boost

right so we have one extra parameter which is the extra boost for self transition something like a a

a um

um

uh sim my mark of

my mark you've H M

okay

as so we call that is sticky H D A H E P H M it just has the transition

distributions before

plus one parameter which boost these all transitions

so if there's the distribution you had before

then we add a little bit to that to um

you or a little bit of boost

for the uh self transition

okay

that parameter were being in bayesian is again

i parameter B for by the uh given the data it's it's random

it has a poster distribution

okay so uh we put that into our system and um where now ready report some results

okay so we went back the speaker diarisation problem

and um

implemented this on data uh i think this was two thousand and seven the uh in this R T competition

data a we didn't compete in this we simply took the data compared to what had been done

by people who do compete

um

okay so uh this is diarisation error rate

and icsi results of been the state-of-the-art uh for this at that time and i i don't remember if they

are still um but they they were that see result

no by a wide margin at the time

um

and this is over the twenty one meetings

and these are comparisons to the icsi just to get a flavour

um so are sticky results are basically comparative to the red ones which of the icsi

so if you just kind of scan through here

the green ones are the non sticky each and you can see there worse and that was assigned to as

we actually were reasonable this if something was not quite work

so if you do a head to head

basically were compare we comparative with the icsi result at this time

i do wanna say that might

my goal showing these numbers is not to say that we are

we are be anybody or that were competitive this was two thousand seven

we we're not we're not

speech people we didn't try to compete

um but we got the results are in fact you know compare with they state-of-the-art the art system and the

main point a wanna make is it this was done by

and lee fox a grad student to visit my group in the summer and she learned about

the H H P hmm and she implemented this and they all that are one summer project

so this is not that hard to do it's a tool just like the hmm you can learn and you

can use it and you can get

you know competitive results doing that

um we have not pursue this and you know i know that the icsi team actually did a discriminative method

that after that that that's a better numbers

um

but all these numbers are getting pretty good at and i think that this this particular approach if if um

if pursued can be a

you know what the best that that's out there

i i i'm row i'm i'd recall when hmms first came about i was actually a grad to at the

time

um um the very first paper on hmms gives a numerical results complied compare to dynamic time warping

and they were okay but not great better

um but of course that was enough and that set the whole field often

heading in that direction i i would like to think that this could be the case also these

be the on parametric methods

there easy to implement

they're just a small twist or what you're used to they're just hmms a little bit of extra that you

can move to a new table in the restaurant

a little bit index heat of multiple restaurants that really in implement

i the robust a you know one student can just implement it it in it were

well

so

i do think the gonna play a role

um

a C Ds are some examples of actual means these are with the diarization rates for both of X you're

quite small you can see here we basically solve the problem

and here's a meeting with things were a little bit worse

there was one meeting which we did particularly badly

turned out that wasn't the model as bad it's turned that the mixing we were use in a mcmc out

with the for this the mix C was quite slow one that media

we had mixed by the time we start

and um so that still on when issue of how to

make sure the markov chains mix

if we're gonna use markov chain

okay so that's all i want to say but there is a show um

uh

uh that that you just

briefly mention a couple of other applications of each D P the uh been used in a many other kinds

of problems not just for hmms

in fact you can use these for P C F G's for problems the context free grammars

and on this case you of a parse tree

and the number of rules

rules are like clusters that you're doing statistical an L

and you grow the number of clusters is you see more more data

and the same rule can appear multiple places in the parse tree

that's what you have multiple restaurants of the chinese rest for exactly the same reason

what share

this your strings

in different locations of parse tree so we have

built a system that does that and

um

are able to then in for uh most probable

rules sets

uh uh from data

parsing

build part

um okay so that was it racially process

and um and the start talking about virtual processors lot more a lot people more you on it but one

that wanna move on

it's all you about some of the other a a stochastic process that are in the toolbox

so one i'm particularly interested in these days called the beta process

where the racially like process uh when you or the rest you have to sit at one and one only

and only one table

the beta process allows you know the rest right and sit multiple T

i

so you to think about the tables now was not like clusters is like a feature a bit vector description

of of of T

so if you set a table one three and seventeen

and i sit at three seventeen and thirty five we able to bit of overlap among each other

so bit factors gone overlap an interesting ways

and so that's what the beta process allows us to do the dirichlet process does not

um

okay now be on the beta process all tell you but more but the be the process or map about

one tell you briefly but the general framework

um soaking they had a very important paper uh in nineteen sixty seven and i sixty eight uh a call

on complete a measures

the beta process as an example of a complete random a measure

and right image are really simple to talk about it to work with

all they are they are measures they're random measures on that's just like before

but what's new here's a T assign independent mast an or second subsets

of the space i a picture of this

an X

yeah are we got the here some arbitrary space

a here's a set a and here's a set P

and here this red a check is a random measure a discrete measure on this space

and so there's a random amount a mass fell in to set a a a random out the fill to

be

and if that random variable

um

a here the random mass

is that dependent

because these are non overlapping sets

then this random as is called completely random

so it's a really nice concept it leads to divide and conquer out where them it's basically

um you know

as computational concept

okay so uh there are lots of lots of things you you know you know what brown emotion is what

brownie motion it turns out is is is a special case of this

gamma processes process is and all kinds of other interesting object

to respect is not completely random process but it's a normalized gamma prior

okay so now

um king men had a very beautiful results um

characterising completely random process

and turns out that they can be derived from poisson prop point process

so the pa some process lies behind all of this

and it's really beautiful construction or some tell you bout here a really briefly what we're trying to do remember

as we have this space so make a

we're trying put random measures on a at all job put the racially process measures on spaces

i mean now be more general side but a how all kinds of other random a results space

what you do it is you take the original space to make a and you cross it with the real

line you look at the product space so make across all

okay

i'm gonna put a poisson process in that product space

how do put a poisson process on things

well you to tell but with the rate function in

i you're probably fill with the homogeneous poisson process where you have a flat rate for a concentrate function

and then every the in a little interval or action this case a little set

um the number of points for there is a poisson random variable

i with some re numb

like a let the rate variance so now the poisson on uh number uh the possible rate is an integral

of a rate function over that little set

so you have to write on the rate function here it is

and so i in a great of that rate function to get um

i now possible rates for every little small set of this space

and here's a draw from that some point process so you see the thing was tilted up and the left

i got more points down here and if few are up here

now having drawn from this are pa process with this rate function then i forget all this machine or look

at this red thing here

i take each X i drop a line from that X down to the or make at

and now that resulting object is a as measure on the all make a space it's a discrete measure

and it's random

and it's close it's a completely random measure because any uh mass the falls in a some set a here

and some set be here will be independent

because it's an underlying plus um pro

the beautiful fact is that all the random is gonna got this way

okay so more directions trivial other directions complete it's quite non sure

so if you like a plea ran a measures and i do

uh then this

there are says you can reduce the study of

these measures to study of the pa some

and just the rate functions the possible process

so i think that's a tool

i think that

in this field the me others that we will be studying rate measures for poisson process

as a ways to specify

comet or structures on thing

okay so the particular example of the beta process has this thing is it's rate function or a function at

to arg gonna sort the a make a part and the real part

the make a part just some at a some measure that gives you um

you um

the prior on atoms

just like a was G not be for not called be not

and then here for the beta process is the uh is the beta density

and the improper beta density

um

which gives you infinite collection of a of points we draw from point process which is what we want we

don't want find a number that would be a parametric

prior we wanna a nonparametric power we do this thing to be

improper gender

okay so so i that's probably too much map me just draw picture

the here is that rate function it has a singularity the origins a really tilts up sharply at the origin

it breaks off it stops at one

and so all the uh voice we do the poisson some point process you get lots and lots of

of points that are very near the origin

and you now take this um by dropping from down from the uh

uh

the Y court down the X N

then it looks like the

it's O P I is the height of an adam and all make i is the location of the atom

and you take that's a infinite sum and that a now is a random measures as another way of getting

random measure like to be but stick breaking the four

this is another way of getting random measure which is much more general

um and a particular these P I do not some to one there between zero and one

but they do not some the one

and they're independent

i so i like to think of these pieces coin tossing probabilities and i like i think that this is

now an infinite collection of coins

uh and most of the coins have probably nearly zero also picked a all these coins

i'll get a few ones and lots of zero

and get an infinite set of zeros and a few ones

and i do that again a all can get a few ones and a lot of zero

a keep doing that they'll be a few places where i'm gonna get lots of ones

a lot of overlap

and lots of other the places with there's less so

of a picture showing that

a here i've drawn from the beta process that that blue thing right there this between zero one it's mostly

nearly zero

and then here's a hundred as

with these think of these as coins uh and think of these heights as the probability of one have head

i draw this sum has a big a probably here's like a lots of one that are relatively lots of

ones like go down the column

and some region over here

and uh no know i've nearly all zeros why quantizer probably like a almost all zero

okay so think of a row of this matrix now snaps as a sparse binary

infinite dimensional random age

i think of this is a feature vector for a bunch of entities

so here's a hundred and tease and if you i think of this like a chinese restaurant it D number

one hundred came to the restaurant

and didn't just sit it one table it's sat at this table this table in this table this table on

the table be for different tables

the next person comes in number ninety nine and sat at didn't sit the first table but set at the

four stable table and bubble bobble

right of this captures the it's it in pattern of a hundred people of in this restaurant and all the

tables i cetera at

and the total number of tables you keep doing that's gonna grow it's the denser and denser fill out

but it grows again at its slow rate

okay so um and

you know different centres the for parameters of this process and you can get different right is to that on

the settings

i so that's probably way too abstract for you appreciate um why

you wanna do that

let me just say that there are also was a restaurant metaphor here

um there's something the indian buffet process which captures the sitting pattern directly on the matrix

not talking about the underlying beta process just like the chinese restaurant

captures the

sitting pattern for the de racially process and not talking about the underlined to racially process literally the marginal probability

and of the beta process

so i'm skip that um i one move to

um

all the way there we go back to this problem i could make it concrete again

okay so number this problem of the multiple time series i have these people coming in into the room and

doing these exercise routines

and um and i don't know how many

routines there and in the library and i don't know who does what routine

and each person doesn't do just one routine they do a subset of reading

i i

right so how my gonna model that

i well it's time series and has segments and an i'm plug use an hmm as might basic like

but now got put some structure around that to capture all this come oral structure about my problem

right

so way that i'm gonna do that as a me i'm use the beta process

and uh i have a slide on as i think and is that try say this an english what you

start to slide

um and the math if you care to

but

a but i encourage you not just listen to me tell you how does were

um um okay so let's both everybody this room as good come up here on the stage do the lecture

size routines to every one of you is good to your all the subset

there's a infinite library up their possible routines you could do

you choose before you come up on stage which subset of that library you good at you're good actually pick

out

do maybe i i'm to do jumping jacks and twist that's on there

right

now

a up and have and there is an infinite by in

transition matrix

um um

and that got possesses the not us get to possess

and uh i pick out

um twist

and jumping jacks

from that in for matrix so i'm gonna pick a little to by two sub matrix maybe twist is uh

column number thirty seven

jumping jacks as number forty two

so i pick out that that those those columns in the corresponding rows i get a little to by two

matrix and i bring it down thing in for the matrix

and i and stan see that a classical hmm

it's actually autoregressive hmms like it'll bit of oscillation "'cause" he's are also to remove

i and now i run a classical a jim for for me dream my exercise routine my emissions or the

six dimensional

vector of positions

and just a hmm

right now i run the forward-backward algorithm and i gets "'em" update to the parameter

i don't have they might the local a to go back up an infinite matrix and that little to by

two that's right but the updates from the upper bound welch

right now he comes then he's the next person a couple the stage and he takes also column thirty five

but he did the node one a one one a one seventeen i

so you out a three by three subset of the matrix

he runs C hmm on his data

it's about mulch update and then goes back to the infinite matrix and changes those that three by three submatrix

as

and as we all keep doing that we're gonna be changing overlapping subsets of that for the minimum mean

okay

and so that's the beta process

ar-hmm

so i hope that you kind of got the spirit of that it is just an hmm there one hmm

for each of the people do the exercise routine

and then the beta process is this machine up here which gives us a a

a feature vector a which of the subsets of states are the infinite set a state that i actually you

i so the this maybe look a little bit complicated it's not it's actually very easy to put on the

computer

and again um this is actually in again emily came an implemented this center second summer with us

um

and

but the software to do this

um

okay so uh anyway it just a hmm like the with the beta process prior on some on the way

the parameters struck

now this actually really worked

so um

here motion capture results

uh

and this is nontrivial trivial problem of a lot lots of methods don't do uh well at all on this

problem

um it's a bit qualitative as how well we're doing but i think you'll kind of if you look adults

were doing the well so here is the first feature

if you pick to that's state i E feature

and then and you just

held that fixed an hmm it's or autoregressive hmms it'll oscillate

the oscillations the arms go up and down

this kind of picked out this jumping jacks

um

rudy

this one if you take that feature in you put in hmm a you don't you don't any transitions happen

you get the knees wobbling back back

here you get some kind it was motion of the hips

here's the more wobbling something or other here's where the arms are go in circles

i so when your the bottom you start to get a little bit more subdivision the states than maybe we

might like although emily thinks that there's kind of

good can a mac reasons of these are actually to be gift

a but this really this that nail the problem it took in this sixty dimensional time-series multiple once do not

about the number of segments and where the segments occur

and jointly segment them among on all the six all all all the different users

of the sector

alright right

um how much you on time

um

okay so i'm a about the not time i'm this good uh i think i against get some slides here

and just say something about um

um

this model

uh

so this is something i've were done for a number of years it's say um exchangeable model i bag of

words model for text probably were shocked nations it's fairly widely known

it's extremely simple it's too simple for really lots of we're a role work phenomena

and so we've been working i make a more interesting

and um

so

what the lda model does it takes a bag of words representation of text

and it re represents texans since terms of work on top top

a topic is a probable vision on word

and so i given document can express a subset of topics maybe be sports and trap

and so all the words in the document come from

the sports topic

or the travel top can you get mixtures were called had mixtures

um um of topics within a single talk

um um the problem with this approach would many problems one of them is that the

words like

a a function words

uh ten do occur every single topic

because of i have a duck with only about travel and the function words don't occur in that topic

i can get function words in that document

i get a very low probably document

right so we like to separate out things like function words other kinds of abstract words from more concrete words

and make a traction higher

right so we have done that was the paper the that that was G A C and the you're and

i approach you look at this site more proud out of this and some sense than lda as a kind

of a

path forward

for this field

uh we call this the nested chinese restaurant process

and so this is a a whole chinese restaurant up here and here's another chinese rest are all these are

chinese restaurants are now organise in a tree

when you go to a chinese restaurant you pick a table like before and that tells you what branch to

leave

to go to the next rest wrought

so you think about this is a as the first night here in prague

you pick some rest and then you'll or what branch you you know you know what rest are you need

a to on the second night

and then a third not and so on as you keep going

the nights of the conference

alright so what document comes down here and picks the path down this tree

another dog it comes in a picks another path and they what that that the overlap

and so you get these

overlapping branching structures

alright and now you put a topic at every node this tree it has a distribution words

and then a document has a path down the tree that gives it a set of topics can draw from

and that draws the words from those that that set of top

right

now that a up of the top is been used by all documents

so it makes a lot sense to put the function words up there

or i as this no down here is only been used by small so so the docking so you might

will put some more concrete words down there

and that's what the statistics does we fit this the data

it did develop stop it's at the top of the tree which are more abstract a more concrete topics as

you go down the tree

i so i'm get just yep

and here's my last slide actually this is the a result to turn your head

uh of um in this to us like i think is as action not psychology all you change this errors

like that

psych

um psych review view he's are abstracts or a particular journal and psychology

and this is the most probable hold tree we're put a distribution on the tree turn

is everything is uh

distribution here

and the was high probability tree in is the was high probably topic

um

or a high probably words that the topic at that node so we get a and of is the high

probably words at the root

so we have to strip away the function words in this case they just pop

what well hold up the

um the next level we get a model memory uh self social psychology motion vision binocular

dried food brain

oh so this looks kind of like social psychology cognitive psychology uh physiological psychology and visual psychology

and so once you good other the is actually infinite tree we are shown the first three levels of that

um so any about hope that "'cause" you or more flavour of you know the toolbox box no should here

we but

try rest not together and a

with a a object which first distributions on is and we could sell do things like how traction

and reason about

um so i'm done uh i was just kind of a tour of a few highlights of a literature um

they're probably about a hundred people worldwide are working in this topic

uh actively we as a composer "'cause" there's a conference call bayesian a nonparametric parametric it's we held of very

cruise later this model

um every two years as

you held

um

uh

and it is just be getting going so for much of the younger people the audience you uh

i highly encourage you look at this is there's a little bit of work

beam brought in the speech and signal processing but there as good be a whole whole lot more

um so on my publication page there two papers which are my point you to if you enjoy the talk

that are written for

for for uh a non expert and give you lots of pointers to um more

a literature

thank you very much

a there's time for questions

yes

thank you both to a them for two

so can we see have time for

some questions yeah

hearing is read channel you get of make it

and we already

in the money

a good a so i have two questions a lot first well thank you very much things to this and

is possible for to still uh

a

a community

the first question to i have a a a um

don't i'm just a

full not a scale problem

since this set and a

choir monte carlo simulation

i

okay

to need that is to make a

really

real

difficult you know i don't believe that at all and so no one knows really we go to large scale

uh so

you know the em algorithm does that apply to large scale problems or not

yeah yes and no

i mean for some problems

things

really quickly settled down after a very quick number of iterations of the uh maybe like two iterations of em

or for some problems

these algorithms

that should do a are just give samplers

and so with the E out with a plus one extra step

just a bitterly em do like em before uh you not change the indicator from this to this

or or or go to a brand table

right so we actually don't know whether uh poster inference are guns on large scale or gonna mix

maybe they mix may of more quickly than were used to from a small day

in the last point to make is that this has not you you C C that's just what we happen

use because this was we had three months to the project

any other procedures are split and merge algorithm there's is variational methods and so on for post your as can

be used here

um so um

it's it's low

yeah not a second question i have is that many people in this all

where lee

um the map of the use of a well yeah it's sample

a a not a really small a good model

yeah is for speech

in in a fist of

just

yeah that we select we use that to them what in france easy

and and then

easy correct so i just want to know you'll cake all

to what extent

this this ten

general

your yeah you not okay that's a great great question like like a very much a so how those two

papers or a minute

and your question trying to show the toolbox box shows a range of other kinds of models we

consider a

a there need a racially process for example doesn't have

a power law behavior

you might want power law behavior

or something else called the pitman-yor which gives you power be

you could do a pitman-yor version of H P H M M

um

and so there there are many many generalisations there's inverse gamma forms of the weights and so on so for

i so um

yeah uh you know

this is really is a tool box

and i think also the point about you know i'm a statistician we're used the to models which if you

find the right card to it can really work surprisingly well

and so you hmm yes it's not the right model for speech but that's lee set and i at

i

and it still was the a you know the elephant to the got speech all the way to where it

is

maybe badly may wrongly

right

um

you know but uh i it is a very useful to make cartoons tunes that to have nice that as

is go and computational probably and can be general

so i think a hmm was too much of a box could be generalized easily and i think that these

methods go beyond that

once again

but still staying with a problem

but it's like back up

oh he have to do that the back

i have control over that

or another question

i can summarise ways

yeah so is questions about over fitting wise you getting killed on over fitting a nonparametric world

well you know i

uh

it's easy for is but bayesian don't have such problems of the over fitting

it's kind of i'm not always a bayesian but what i a bayesian

um you know it's one the is i am

um so you know of to first or we don't have a over fitting troubles with these systems even on

pretty large scale problem

a fact

um so you know we compare this for example one our context free grammars stuff the

be a pet of there was ice with a split merge em algorithm

and there they had

a big over fitting problem they kind don't with that

eventually pretty fact way

um but we didn't have to think about that

just didn't

you know

come up

um

you know so that's the first order the second or you have a couple of hyper parameter of to get

them in the right range

you know get the right range of have some overfitting fitting but were very robust to that

and so you're right yeah the day didn't totally work just of about

we had a little over fitting there was a one more state there should a in

but that's not too bad

that it was not to easy do a little bit of engineering and think about the problem a little bit

more they are well little bit of time scale and have at you know uh prior needs to be put

in

so i i think that's fine were sort of artists an engineers were trying to we don't mix wanna build

a box

that in a you give to a high schools to the it's done

oh always be a little bit thinking

and a lot of engineering

and to about

um but i really the you know i'm not always a bayesian i've of of what's to be a non

be i go back and forth every single day of my life

um

you know but for a lot of these really high dimensional hard

yeah inference problems we you have multiple things which need to kind of

collaborate by the harry

a just gives you a a a you know from the get go a lot control over those sorts

in you the question

so for speech we use mixture models within each yeah a is there way too

this side went to great then use a versus one but you a new component yeah X question i should

made that clear

so

the state specific emission distribution here is not a single gal C

for reasons you guy know extremely well

is the the mixture of gas

no

it's a the racially process mixture of dallas

right and it turned out that was critical for us to get this to work

absolutely we need more just like a the classical gmm

a single permit speech done more

we don't wanna have a

L distributions there we wanted to grow as you get more all patches to be more more pictures of that

mission distribution which a rise we put in the hard the like cross

it's a hard like process

that helps to make the point about two boxing better

a well we should uh we've one more question asked one

uh we actually use quite a bit up a all

i hidden markov model in for

and it's very good so make mean everybody to

take a low but one that the major problems is in the

a prior evil lucien because we have to estimate a lot

for a hyper hour um

so sometimes

the number of high but from is is more than

the pair on the this is so

now now yeah talking about a

in for a number of

for it is

so we're have probably i for a number of hyper per on no we don't know that's really critical so

the classical bayesian hmm had a lot hyper parameter that because it had

fixed K states yes and number of hyper ever scaled the size K

right and you had the I C or something else like that the choose K

oh

we're not do we any of that

we have a distribution the number of states and a number of hyper parameters constant

okay a small

so it's share she's here and i was actually there's a sharing by this high this time these uh choices

a french

the hyper programs real little with a top level

okay okay at the menu that one there's of so

a very small number of that so there is uh a pry pollution according to how you is you give

the have breast they are correct the very number of small number of hyper parameters here okay some sense almost

too small then you can't believe it it it it probably has to be a go a little bit to

know is a is very good we but is that we not

this is not your classical

a bayesian hmm or you have the number of states things

were in a of the number state it's friend

okay as the city pattern in the rest

it's it's grows that the prior with log in and the posterior just random

okay

thank you thank you

a well i would like a us to thing

see once again for if

and now we have the uh a few break outside T one

nine or

hmmm

i

yeah