okay so let me but you know they four or conference well as well uh should no practical remarks are of the this morning you know i guess doesn't have any closing ceremony in the afternoon and the greens but name but it then yeah grew the one last year uh seems to come down so uh oh i would like to thank you all for being here stranger you next your encoder and actually the only thing that uh after having done slides is that we have a again in a problem will be not so probably some like to being written again so for that and we are working uh for a while ooh it oh we will have the last a line or E given a uh by michael jordan from uh there and the yeah more um uh problem uh university of you know or um we introduce or speaker think you're much one yeah so a great on two this a michael jordan or you've got i as uh last you you for this uh a you view a new of is uh work uh legend if if use of artificial intelligence and machine and on some student for which i mentioned a two uh is uh work is in fact for just a two what you was it P two san go then be key effect number eighty might way uh if not uh uh interest machine learning and you you general if you have to where for last one years i a word fundamental them then so uh be with i to use you uh and the light scratch this you we you work and you need absolutely incredible contributions to last uh the topic is if to talk about day is uh a non permit can i region methods uh a general use be stressed it is quite this for a work on a graphical models we see graphic which which found application a speaker you and many statistical signal processing so in all the rest like a natural language course nation the G statistical genetics uh a as being of course recognise a tremendous if voice work uh both in statistics engineering in for instance a year here was a V two a a national actually number S what is to national shall i could be of uh this and so uh a fellow american association for the advanced first person is slice and good do three distinctions the this in a uh is to follow usefulness statistics uh of solution for a machinery of light you be of course a would you really a statistic is to so even lecture as well as see that a special distinctions before that i mean you just read a i and train because you use that's basket i some uh um some some words from students for michael has a large number of students with extremely successful positions here's what i read michael jordan if you imagine artificial yeah Q or if you can become used to we really should be with region students you have you thank you i'm delighted to be here and i thank the organisers very much for inviting me uh so the main goal the talk is to tell you a little bit about what uh the title means what is a nonparametric um just to anticipate a little bit first bayesian just means you use probability pretty seriously and is already community uh where probabilities been used uh probably more than any other apply committed i know of for a long long time so that's easy part of my story nonparametric C doesn't mean there no parameters it is just the opposite it means there's a growing number of parameters so you get more more data at uh bayes not permit matrix a large have more more parameters i grows so that something a a a rates or the number of data points but it grows uh i think that's kind of a key moderate perspective uh at about out statistics and uh signal process and the bayesian has a particular take on oh uh a really does a toolbox high my talk and gonna try to do the talk is give you idea what this toolbox is you can use it to solve apply problems i has a beautiful mathematical structure and it's some since it's really just getting going just getting started uh a song good try to convince you that you to contribute here uh both and a fundamental middle side and i and a track a a big dollars collaborators are the bottom you their names or appear about the talk um what what you do them um now so i've of the one slide on sort of just sort of a historical philosophy i i but computer science your but this can equally be bill uh well be um so the all these fields in the forties and fifties uh uh were somehow together you know people like uh a common core of and to rain and so on some of in this many these it's separate "'cause" the problems got really hard i believe um and to peer sides when a often start looking at data structures an out it didn't for much on an certain so this is a ring and focused on uncertainty and certainly and funk focus too much on algorithms and data structures and so maybe to not per much as one a venue you where these two things are coming together and mathematically really at it amounts to using stochastic process i set of classical parametric distributions i use a growing number parameters so you have distributions indexed by um large spaces those are just called stochastic process so yeah you're gonna see that are right the unit the talk some of the stochastic process we're just in looking at and sort of put this a little bit of a more of a clack a statistical context if you pick up a book on bayesian analysis you will see a the posterior written is proportional to the like the times the prior uh and usually right out in a parametric way with the data index in the parameter space um and so this the priors is a likelihood in there's sure posterior in this talk a we don't want data to be a a dimensional object we want to be in three dimensional open in that so we a right set of data but the right G of like that so of X given G is still to be a likelihood in this talk this is mainly get be hidden markov model actually that's so G will be the structure and all the parameters of a hidden markov model the state space and so on and this is just the usual hmm i and this will be some kind of a structural component of hmms or multiple priors on that so all be talking mostly about this and less about that that um all right so but it with a mathematical story just that instead of having a classical prior distribution you have a distribution an in an image dimensional space and that's not so strange mathematically that is what a stochastic processes uh so we have us a prior stochastic process it's could be multiplied by some kind of us fairly classical likelihood "'cause" once G is fixed it it out care what the size space is is just a fixed object is the probably of data given G can be easily specified typically as we multiply this post no prior by this likely that we get ourselves up posterior stochastic process that's to get some more concrete idea what these things are we'll be talking a little bit about uh distributions on trees of one bounded than an down the fan out you get these in genetics and and natural language processing quite a bit uh so cats process on partitions are rise a lot we talk about class tree models to the number of clusters is a known a priori we can put distributions on grammars on sparse matrices on something copy copy really we but distributions on distributions with this tool as well as you can get kind of a recursive a E that you see in computer science um i'm to talk about one particular a large class of um stochastic processes is a random measures what we get to what those are those are one the main ingredients of the tool but some try to a a at talking okay so here the familiar problem um i learned about this from us so my always at icsi a we've work on this a bit um um and the but use it as a con way to to show you the some of the methods i'm talking about be rolled out in out applied to domain uh so i think everyone here knows with this problem is there's a single microphone there's a meeting going on here's the waveform and ground truth is like this the bob spoke for well than john then chilled than bob and someone we'd like to infer this from this and we don't know how many people in the room will we don't know anything about the spectral characteristics of their speech here's the other problem to talk about was a little less traditional for for you guys i think um but it's a segmentation problem that's a multivariate segmentation and it's a multi type time series uh segmentation problem okay so um someone came into a room and this is uh a motion capture we have sensors on their body um and i think it's about sixty dimensional uh so we have a sixty dimensional time-series for this person and you can see it segment the able they did a little bit of jumping jacks a little bit of splits little but was so on so for um but we're not supposed to know that a pretty or we don't know what the library of exercise routines was and we don't know for this particular person which routines of of the big library they decided to use how long they lasted and where the break points were all that so like to and for all of that moreover this is just one person sixty dimensional time-series series have its for many different people so each person gonna come in a will get a sixty dimensional times for each one of them and the we some overlap some people to twist some people able to touch your toes and so so forth not every won't do all we're teens null be some overlap we like to find that and then exploit it so far learn a little bit about what twists look like for one person i like that it be available to be used to segment the times or the that for some other person it's of the joint inference problem over multiple i multiple to a a high dimensional time series okay um okay so everyone this audience with hmms are plus an audience to talk to for that reason so here's some of here's three of the diagrams are often use i like the graphical model one here what here's the states there is a markov chain i um i multi normal random variables you have in missions coming off of those in here the parameter sitting on there um the core of but of course is the transition matrix is the K by K matrix at each role of it is the next state transition probability distributions so give a urine state um um say three or two then you have a transition out of that i'm we represent that it's this little discrete measure here it just a bunch of atoms and look at locations which of the integers and we have masses is so C with them and also the transition probabilities going out from state to as is my two to all the next day there's K of these things so it's a it's a measure that has finite support and will be talking about measures that infinite support here pretty so so there there would be pi two pi one might be a different measure that maybe is is sparse it his two non-zero atoms and um and so one pi three and so on okay so those of the representation that the lattice the graphical model and i i let's represents is of the next state transitions that will be talked about the talk alright so uh lots of issues with hmms that are still some some sense and resolved uh how many states are we use so for the diarisation problem we don't know the number of speakers so we have to infer that in some way um and the uh segmentation probably don't know the number of behaviours i i got twists send job job out touch your toes and jumping jacks and all or some uh no number of behaviour i think you more interestingly this ladder problem and it has uh it she's about the structure of the state space so how to be an co the notion that a particular times is makes use of particular subset of the is the comet oral notion and i i don't know how to and code that in a classical hmm was just set states how to share states among times zero i so don't know have think about that there's a structure come sort of stuck state space um um that we don't have a class which to um so that's gonna going to my P of G i'm gonna put a structural information of my prior and and given the particular choice i'm gonna have that and information available for inference at the uh hmm low alright so based on parametric solve these i think it a very elegant way and not lots of other the problem so i'm gonna try to show you what what we do a to do that um okay so uh i'm get a a again be talking about random measures here be kind skip this slide i was not really much been set here go right to an example of a random measure um no right so everyone here knows what a measure is my is just a set you take a set in you get a number out and the sets them some sigma algebra um we wanna talk about now a random measures i was gonna be probabilistic and do inference on measures we got to have random measures and so that's first of all star was something really easy that's put a a random distribution on integers can just go from one to infinity and a distribution on is the sum of the set of numbers a countable of the sum to one we wanna a random distributions so there could be a set of random numbers that some the one right how do we do that they have to kind of a in some ways they can some the one and uh there many ways you think of doing that here's a way that turns out have beautiful comet a properties that allow was to to develop inference algorithms based on this idea as this called stick breaking it's old i in probability theory um so what do to do is what take a number of beta random variables that that um uh we it take in in a number of them and do the draw an independent so beta to pay a paid a random variable it has two parameters normally but a little pen the first one to one in the other will be free a random variables live between zero and one right and if you take a L uh not to be bigger bigger bigger this think tilt to up to the to the left so most the masses near the origin and less of the masses near one i so think of getting lots of small numbers out of this is beat around so don't to do but guess stick the goes from zero to one and we get a break off the first fraction of but according to be the one and but call that high one and the remainder of the stick is one minus to one and i'm the take a fraction of that beaded to and this little part right here is that amount beta two times the red and a call that pike two here's pi three five for pie five and so on which keep break you know peace the stick and uh the total stick is has has mass one and so as we do this at to infinity we're gonna get a the pies will eventually some to one oh actually easy to prove if you want to prove it here you get these pies was sum to one or in this procedure alright so now that's a way to get a we now have a these pies are they are distribution for any fixed pies and the random so we have a random distribution only in so having learned how to do that we now how can i we can promote these up to distribution arbitrary spaces using that tool to here's how that's done you use these same price before as weights in a little mixture model an infinite mixture and the makes come the next chirp or uh components are these delta function these are do right dealt is they re unit masses at locations he K and the peak here in depended draws from a distribution G not on some space of this some space can be some arbitrary space it's that it got out of euclidean space it can be a function space it can be uh a a but space it can be just about anything um and so these atoms as are living on that's space and their be weighted by these these mixing proportions pike K i so each one these has unit mass and their been weighted by these things which some the ones so this total object G is a spiky object it's a it's a measure um it's total mass one and it lives on some space that's it's random and two ways it's a because the pies got my stick breaking and because these P case we're drawn from some distribution G not was the source of these things a me calling you think atoms so G not as the source of the atoms so with these weighted atoms and some space alright so that is a random measure if i take G of some set a it'll get me and number so it's a measure i it he's added even so on and it's random a to weight so it's a random measure so this is a very general way now of getting atoms if i do the pies according to stick breaking this particular object G has an name it's called a racially process and usually right it like this it as these two parameters the stick breaking parameter and the source of the atoms for a but we can break sticks in different ways and um get the atoms a different ways so this that actually a uh a the useful tool a right so we can have use this as a component of various kinds of statistical models are here's something called that a racially process mixture model and sky just what you think it was we're the use G a draw from G which are now drawn this way it's on some space but the speech here's the real line just like and draw at it has at that locations given by drawing from some underline distribution G not it has heights to these atoms give my stick breaking with this parameter alpha not so if i draw specific G again an object looking like that i the uses of the kind the a mixture model just uses as a as the distribution it's not your typical gaussian it's it's the distribution and random and i draw from it so i might draw this at in the middle of their uh this you know has high probability has a big high and have been drawn that that's a parameter now for some underlying likelihood um and i wrote it down here that X a given data is kind of some distribution a indexed by data so i do that again and again that's this here that box and that gives me um i mixture model in fact there's some probability are get the same atom on six draws of theta and so those uh indices i would be coming of from the same at we would think of those is at belong to the same cluster alright right so that's a dirichlet process mixture model here some kind are data drawn from one of these thing and drive more more data is like go across you start seeing the blue dots more more data and to the parameters they data in this case are means and covariances of gaussians state as a whole big long vector a and you see that the number of distinct once was growing there only if you just ones and the you get more more and they grow it's sum rate fact that rate turns of to be logarithm of and logarithm of the number of number date so the number of parameters we have in this system is growing we we have small number of parameters here and we keep dry from G again and again and again we get more more parameters and like i said it grows a rate log ins we're nonparametric here we don't have a fixed parameter space no matter how much data we have as you me more more data i'm gonna give you more more parameters okay so let's go back to this little slide here and is i've been alluding to uh with some probability five picked a theta one equal to this adam here in the middle with some probability data to decode that exact same atom and so those two data points will come from the same cluster uh you can ask what's the kind of comical structure induced by this kind of procedural how often does that happen what's the probability that data to is equal to theta one no matter where theta one occur probably that you get a lightning twice in the same place and what is that probability fact not over for this particular G but over all possible choices of G under this procedure that i outlined this P G so you think that would be a really hard problems solve that sounds complicated and it turns out that's really easy to solve terms of the answer is and um so that understand that you need or stance a call the chinese rest is another stochastic process this is on partition um not on parameters this is on partitions and this here i've have all drawing this down here a you have customers coming to a restaurant with an infinite number of tables um so here these round tables that the chinese rest raw that's why they're round and and the first customer sits here with probably one the second customer or joins then was probably half an sort the new table probably half and the job is used to let table proportional the number of people already the table so that's often called preferential attachment you start to get a big cluster merging at some small clusters after that and you can easily prove from this this little set up here the number of occupied tables grows as rate at rate log in um okay so it's a beautiful mathematical fact that if you do the integrals that was talking about for the a racially process this turns out to be the marginal probability of the racially process of who sits with who i so it's it it's another perspective it's the marginal uh under that other big probability measure um P G you can make this into an explicit clustering model a mixture model effectively but now a of in terms of clustering in that a mixture components all you do is that the first person set at a table a draws a parameter for that table from some prior which would call G not the same G not as before actually and everybody was sits a table here at that same parameter vector so if fee one here is a mean in a variance of a gaussian that all the data points for customers around this table well all come from the same gal see have the same mean cover it's okay so this is actually i i exactly a marginal of the dirichlet process mixture my um okay so that's kind of little but a tutorial this is been able you know forty years old the the the the like process mixture models it's that's leave all the in and slowly getting taken up and apply communities um but then very very slow um now to use this in a richer class of problems that just mixture models and clustering a you want to have to face the problem that you have multiple estimation problems you have one not just one set a at have multiple sets of data and the hmm case that's gonna come because we have the current state which indexes the next state we have a bunch of distribution the next day distributions for all of the the current states so we don't have one distribution we have a whole bunch of distribution or all the rows of the transition i we need to estimate all those rows um so in statistics we off that have this a rise in we have multiple estimation problems or something parameter and there's a data based on that of another other parameters data based on that and we often want tie these things to get because there might be very little data over here a lot of data over here of these are related problems that makes sense that tie these to get and that called a hierarchical model it's one the main reasons to be a bayesian which is a very easy to make these kind of hard ca models at tie things together um and so what you do is you assume that E parameter a for all these subproblems uh are these primes are related by coming from an underline parameter as you draw these randomly from an my data and now all the data over here kinda goes up to this tree down to here and the the posterior estimate of data i depends on all the data here it's a convex combination of all the data so that arises out the hierarchical bayesian perspective and here's kind of a picture of that here's the i just showed here's a kind of the graphical model representation that using these boxes to represent replication so that our boss box represents these amber applications down below okay so we can do that now what the dirichlet process and so this the paper that we published few years ago a hard could racially process um this is i think really a very useful tool is just the hierarchy applied to a racially process next we have the same kind of setup as before uh but now we don't just have one G we have G one G two G three G four um for the different estimation problems the different groups and if you want be concrete about this is the different rows of the hmm each one of them has that is the right a measure and we have a set of measure all the different rows we want tie them together so they have the same next state space that they all transition to um so are we do is not good draw these things independently and like a lump them all together we want different transition proper probabilities is we don't want them to be complete separate state space they gotta be tied together in some way so the way you do this is that you you have a a an extra layer of this graph you first of all draw uh from some source of atoms a mother distribution G not it's now a random instead with be fixed parameter like before and now that the base measure that you used to going to each of the children they they draw the atoms from G not um and they re weight them according to their own stick breaking process okay so this ties together a set of random measures it makes them cooperate operate on what are the atoms there all to use and what are the weights at they're gonna use uh this also as a as a rest raw underneath it which is a really easy way to understand these ideas we now don't have just one rest route we have multiple restaurants in restaurants in general and again this in my application hmms these will be the different current state or for rows of a transition sure right so if you're rest run number two you go when you sit at a table proportional number of people or to the table like before that would get uh sharing or clustering with a restaurant and then if i'm the first person as sit at a table i i up to a global menu of dishes for the entire franchise and i pick it is from this menu maybe i pick um uh a chicken up here um and i pick "'em" project i bring it down to my table and everybody who joint we my table has to that same dish we all the same parameter but i also put a check mark next to a dish and if you are in some other restaurant and and when you're the first person to your table you got to this menu and you pick it dish proportional the number of check marks next the dish case they're big you get some dishes a peer there are popular across all of the restaurants not just within one restaurant and they get transfer between the rest according to this this kind of preferential attachment it's okay so again get the hmm setting these will be the rows of the transition matrix the number of possible transitions as a in as you get more more data so just have K states i have a growing number of state and the state to be shared among all the different rows of the transition matrix according to this rest strong um now i have a really nice application of this that we publish their this your that i'm good i think skip in the interest of time in this talk but to be of a paper on this i think it's kind of a killer out for this problem you're basically trying to model some of the geometry of proteins here's kinds of the angles that you get in proteins and if you put all the data for a particular a know so on top of each other you get kind of these fuzzy diagrams where they're not repress side and so what you really wanna do is not just have one estimation problem of the density of angles i you wanna break them up according to the context like in you wanna not just have one a distribution one have a bunch of them depending on the context when you break them up according the context or over you have a sparse data problem many of the context every little data as a very similar kind of problem lots of settings including signal process in speech um and so we want to have models like this at that that are depending on the context of the neighbouring an amino acids so this setup is exactly that the groups or the neighbourhood so for each context you of a group of data and you don't treat them separately you don't want them together read you have them cooperate right this tree and everyone that racially is looks kind like a virtually process like that's synthetic data should you earlier but you really have twenty amino mean as a left when twenty the right we get four hundred estimation problem or or context as we have four hundred diagrams like this and we get them to share atoms according to this procedure so uh i guess skip over some the results but they are um they're they're impressive this is kind of log a lot probably improve on test set for all the to and you know acids here's no improvement in this is over what they did probably due in in the protein folding literature so it's really quite massive pro um okay let's go back to hidden markov models uh where are that is a is a hidden markov model and our prior now is the structural machine or in telling you about um okay so now we have a hidden markov model we don't have a fixed number of states we call this a H P M we have an infinite number of states so here's time here's the state space and so when you in one of these current state you have a distribution on next states you get they get a like a trying as rest are pasta here's table one table two table three table four and as a a a a growing number of tape i i do that for table for every time i hit state number two um then and that's great i can get a growing number of tables but i wanna share those tables between state number two and three and four and so on so is i been talking about i need this hierarchical dirichlet process to tie together all those transitions um okay so i'm get kind of go quickly to slides like again details are what i'm trying to convey here just kind of but at this very and and and uh briefly what you do is you draw a mother transition distribution for all states and then everybody a for each um current state is a kind is a perturbation of of that so here's spike three it takes this in kind of every weights all these atoms i i will for and so one so for okay alright right so um i hope that was signed of kind of clear this is a way of doing hmms were you don't know the number states a priori and actually are putting a prior distribution on the number of states and it can grow you get more more day a kind of a nice solution to the a classical problem in hmm lan um so we implemented this uh there is a simple sampling procedure um we did a slightly non samples procedure but there's a whole bunch of procedures the can be used to do a poster inference with this i given some emission some data uh in for the most probable hmm and for the uh trajectory D a viterbi path and for everything else you want to a for all the parameters about this you can do this pretty easily on a computer just kind of a we can standard methodology it's hmm a right so what we did this uh we apply to the uh diarisation data we did okay but not particularly well we identify one problem we need to all before we could start to get performance at which was was satisfactory and that was that we have a little bit too much state splitting going on so here's a little synthetic problem in which which time there were three states in our synthetic data as you can see here so you know this state was on for a few uh time frames and then this state and so on um here was the data and you can sort of see that there are three states here but you know it's pretty noisy be kind hard to tell that's by looking at um here the output of our inference system are computer program that for the H D P H and it did find three states here but then it actually count of for state and the problem was that the parameters for state three in state for the emission distributions happen to be just about the same could happen it happened here and so when you were state three and four from the outside world point of view the emission probably is base of the same so the system was do it perfectly well at high likelihood and within flickering between state three and four and again i have high like this so why not there's or that prevents it from doing that but we don't want that of course because now we didn't a very good the diarisation problem thought there for people the room they're only really three okay so we're being bayesian in this problem and so we don't my put in a little bit more prior knowledge and this we put base nine N at this point so is put a little bit more and which is that people don't tend to type uh talk for microsecond millisecond time interval they ten to talk for sec we have one extra parameter a self transition probability in the hmm the diagonal of this infinite matrix is good be treated all but special get a extra boost right so we have one extra parameter which is the extra boost for self transition something like a a a um um uh sim my mark of my mark you've H M okay as so we call that is sticky H D A H E P H M it just has the transition distributions before plus one parameter which boost these all transitions so if there's the distribution you had before then we add a little bit to that to um you or a little bit of boost for the uh self transition okay that parameter were being in bayesian is again i parameter B for by the uh given the data it's it's random it has a poster distribution okay so uh we put that into our system and um where now ready report some results okay so we went back the speaker diarisation problem and um implemented this on data uh i think this was two thousand and seven the uh in this R T competition data a we didn't compete in this we simply took the data compared to what had been done by people who do compete um okay so uh this is diarisation error rate and icsi results of been the state-of-the-art uh for this at that time and i i don't remember if they are still um but they they were that see result no by a wide margin at the time um and this is over the twenty one meetings and these are comparisons to the icsi just to get a flavour um so are sticky results are basically comparative to the red ones which of the icsi so if you just kind of scan through here the green ones are the non sticky each and you can see there worse and that was assigned to as we actually were reasonable this if something was not quite work so if you do a head to head basically were compare we comparative with the icsi result at this time i do wanna say that might my goal showing these numbers is not to say that we are we are be anybody or that were competitive this was two thousand seven we we're not we're not speech people we didn't try to compete um but we got the results are in fact you know compare with they state-of-the-art the art system and the main point a wanna make is it this was done by and lee fox a grad student to visit my group in the summer and she learned about the H H P hmm and she implemented this and they all that are one summer project so this is not that hard to do it's a tool just like the hmm you can learn and you can use it and you can get you know competitive results doing that um we have not pursue this and you know i know that the icsi team actually did a discriminative method that after that that that's a better numbers um but all these numbers are getting pretty good at and i think that this this particular approach if if um if pursued can be a you know what the best that that's out there i i i'm row i'm i'd recall when hmms first came about i was actually a grad to at the time um um the very first paper on hmms gives a numerical results complied compare to dynamic time warping and they were okay but not great better um but of course that was enough and that set the whole field often heading in that direction i i would like to think that this could be the case also these be the on parametric methods there easy to implement they're just a small twist or what you're used to they're just hmms a little bit of extra that you can move to a new table in the restaurant a little bit index heat of multiple restaurants that really in implement i the robust a you know one student can just implement it it in it were well so i do think the gonna play a role um a C Ds are some examples of actual means these are with the diarization rates for both of X you're quite small you can see here we basically solve the problem and here's a meeting with things were a little bit worse there was one meeting which we did particularly badly turned out that wasn't the model as bad it's turned that the mixing we were use in a mcmc out with the for this the mix C was quite slow one that media we had mixed by the time we start and um so that still on when issue of how to make sure the markov chains mix if we're gonna use markov chain okay so that's all i want to say but there is a show um uh uh that that you just briefly mention a couple of other applications of each D P the uh been used in a many other kinds of problems not just for hmms in fact you can use these for P C F G's for problems the context free grammars and on this case you of a parse tree and the number of rules rules are like clusters that you're doing statistical an L and you grow the number of clusters is you see more more data and the same rule can appear multiple places in the parse tree that's what you have multiple restaurants of the chinese rest for exactly the same reason what share this your strings in different locations of parse tree so we have built a system that does that and um are able to then in for uh most probable rules sets uh uh from data parsing build part um okay so that was it racially process and um and the start talking about virtual processors lot more a lot people more you on it but one that wanna move on it's all you about some of the other a a stochastic process that are in the toolbox so one i'm particularly interested in these days called the beta process where the racially like process uh when you or the rest you have to sit at one and one only and only one table the beta process allows you know the rest right and sit multiple T i so you to think about the tables now was not like clusters is like a feature a bit vector description of of of T so if you set a table one three and seventeen and i sit at three seventeen and thirty five we able to bit of overlap among each other so bit factors gone overlap an interesting ways and so that's what the beta process allows us to do the dirichlet process does not um okay now be on the beta process all tell you but more but the be the process or map about one tell you briefly but the general framework um soaking they had a very important paper uh in nineteen sixty seven and i sixty eight uh a call on complete a measures the beta process as an example of a complete random a measure and right image are really simple to talk about it to work with all they are they are measures they're random measures on that's just like before but what's new here's a T assign independent mast an or second subsets of the space i a picture of this an X yeah are we got the here some arbitrary space a here's a set a and here's a set P and here this red a check is a random measure a discrete measure on this space and so there's a random amount a mass fell in to set a a a random out the fill to be and if that random variable um a here the random mass is that dependent because these are non overlapping sets then this random as is called completely random so it's a really nice concept it leads to divide and conquer out where them it's basically um you know as computational concept okay so uh there are lots of lots of things you you know you know what brown emotion is what brownie motion it turns out is is is a special case of this gamma processes process is and all kinds of other interesting object to respect is not completely random process but it's a normalized gamma prior okay so now um king men had a very beautiful results um characterising completely random process and turns out that they can be derived from poisson prop point process so the pa some process lies behind all of this and it's really beautiful construction or some tell you bout here a really briefly what we're trying to do remember as we have this space so make a we're trying put random measures on a at all job put the racially process measures on spaces i mean now be more general side but a how all kinds of other random a results space what you do it is you take the original space to make a and you cross it with the real line you look at the product space so make across all okay i'm gonna put a poisson process in that product space how do put a poisson process on things well you to tell but with the rate function in i you're probably fill with the homogeneous poisson process where you have a flat rate for a concentrate function and then every the in a little interval or action this case a little set um the number of points for there is a poisson random variable i with some re numb like a let the rate variance so now the poisson on uh number uh the possible rate is an integral of a rate function over that little set so you have to write on the rate function here it is and so i in a great of that rate function to get um i now possible rates for every little small set of this space and here's a draw from that some point process so you see the thing was tilted up and the left i got more points down here and if few are up here now having drawn from this are pa process with this rate function then i forget all this machine or look at this red thing here i take each X i drop a line from that X down to the or make at and now that resulting object is a as measure on the all make a space it's a discrete measure and it's random and it's close it's a completely random measure because any uh mass the falls in a some set a here and some set be here will be independent because it's an underlying plus um pro the beautiful fact is that all the random is gonna got this way okay so more directions trivial other directions complete it's quite non sure so if you like a plea ran a measures and i do uh then this there are says you can reduce the study of these measures to study of the pa some and just the rate functions the possible process so i think that's a tool i think that in this field the me others that we will be studying rate measures for poisson process as a ways to specify comet or structures on thing okay so the particular example of the beta process has this thing is it's rate function or a function at to arg gonna sort the a make a part and the real part the make a part just some at a some measure that gives you um you um the prior on atoms just like a was G not be for not called be not and then here for the beta process is the uh is the beta density and the improper beta density um which gives you infinite collection of a of points we draw from point process which is what we want we don't want find a number that would be a parametric prior we wanna a nonparametric power we do this thing to be improper gender okay so so i that's probably too much map me just draw picture the here is that rate function it has a singularity the origins a really tilts up sharply at the origin it breaks off it stops at one and so all the uh voice we do the poisson some point process you get lots and lots of of points that are very near the origin and you now take this um by dropping from down from the uh uh the Y court down the X N then it looks like the it's O P I is the height of an adam and all make i is the location of the atom and you take that's a infinite sum and that a now is a random measures as another way of getting random measure like to be but stick breaking the four this is another way of getting random measure which is much more general um and a particular these P I do not some to one there between zero and one but they do not some the one and they're independent i so i like to think of these pieces coin tossing probabilities and i like i think that this is now an infinite collection of coins uh and most of the coins have probably nearly zero also picked a all these coins i'll get a few ones and lots of zero and get an infinite set of zeros and a few ones and i do that again a all can get a few ones and a lot of zero a keep doing that they'll be a few places where i'm gonna get lots of ones a lot of overlap and lots of other the places with there's less so of a picture showing that a here i've drawn from the beta process that that blue thing right there this between zero one it's mostly nearly zero and then here's a hundred as with these think of these as coins uh and think of these heights as the probability of one have head i draw this sum has a big a probably here's like a lots of one that are relatively lots of ones like go down the column and some region over here and uh no know i've nearly all zeros why quantizer probably like a almost all zero okay so think of a row of this matrix now snaps as a sparse binary infinite dimensional random age i think of this is a feature vector for a bunch of entities so here's a hundred and tease and if you i think of this like a chinese restaurant it D number one hundred came to the restaurant and didn't just sit it one table it's sat at this table this table in this table this table on the table be for different tables the next person comes in number ninety nine and sat at didn't sit the first table but set at the four stable table and bubble bobble right of this captures the it's it in pattern of a hundred people of in this restaurant and all the tables i cetera at and the total number of tables you keep doing that's gonna grow it's the denser and denser fill out but it grows again at its slow rate okay so um and you know different centres the for parameters of this process and you can get different right is to that on the settings i so that's probably way too abstract for you appreciate um why you wanna do that let me just say that there are also was a restaurant metaphor here um there's something the indian buffet process which captures the sitting pattern directly on the matrix not talking about the underlying beta process just like the chinese restaurant captures the sitting pattern for the de racially process and not talking about the underlined to racially process literally the marginal probability and of the beta process so i'm skip that um i one move to um all the way there we go back to this problem i could make it concrete again okay so number this problem of the multiple time series i have these people coming in into the room and doing these exercise routines and um and i don't know how many routines there and in the library and i don't know who does what routine and each person doesn't do just one routine they do a subset of reading i i right so how my gonna model that i well it's time series and has segments and an i'm plug use an hmm as might basic like but now got put some structure around that to capture all this come oral structure about my problem right so way that i'm gonna do that as a me i'm use the beta process and uh i have a slide on as i think and is that try say this an english what you start to slide um and the math if you care to but a but i encourage you not just listen to me tell you how does were um um okay so let's both everybody this room as good come up here on the stage do the lecture size routines to every one of you is good to your all the subset there's a infinite library up their possible routines you could do you choose before you come up on stage which subset of that library you good at you're good actually pick out do maybe i i'm to do jumping jacks and twist that's on there right now a up and have and there is an infinite by in transition matrix um um and that got possesses the not us get to possess and uh i pick out um twist and jumping jacks from that in for matrix so i'm gonna pick a little to by two sub matrix maybe twist is uh column number thirty seven jumping jacks as number forty two so i pick out that that those those columns in the corresponding rows i get a little to by two matrix and i bring it down thing in for the matrix and i and stan see that a classical hmm it's actually autoregressive hmms like it'll bit of oscillation "'cause" he's are also to remove i and now i run a classical a jim for for me dream my exercise routine my emissions or the six dimensional vector of positions and just a hmm right now i run the forward-backward algorithm and i gets "'em" update to the parameter i don't have they might the local a to go back up an infinite matrix and that little to by two that's right but the updates from the upper bound welch right now he comes then he's the next person a couple the stage and he takes also column thirty five but he did the node one a one one a one seventeen i so you out a three by three subset of the matrix he runs C hmm on his data it's about mulch update and then goes back to the infinite matrix and changes those that three by three submatrix as and as we all keep doing that we're gonna be changing overlapping subsets of that for the minimum mean okay and so that's the beta process ar-hmm so i hope that you kind of got the spirit of that it is just an hmm there one hmm for each of the people do the exercise routine and then the beta process is this machine up here which gives us a a a feature vector a which of the subsets of states are the infinite set a state that i actually you i so the this maybe look a little bit complicated it's not it's actually very easy to put on the computer and again um this is actually in again emily came an implemented this center second summer with us um and but the software to do this um okay so uh anyway it just a hmm like the with the beta process prior on some on the way the parameters struck now this actually really worked so um here motion capture results uh and this is nontrivial trivial problem of a lot lots of methods don't do uh well at all on this problem um it's a bit qualitative as how well we're doing but i think you'll kind of if you look adults were doing the well so here is the first feature if you pick to that's state i E feature and then and you just held that fixed an hmm it's or autoregressive hmms it'll oscillate the oscillations the arms go up and down this kind of picked out this jumping jacks um rudy this one if you take that feature in you put in hmm a you don't you don't any transitions happen you get the knees wobbling back back here you get some kind it was motion of the hips here's the more wobbling something or other here's where the arms are go in circles i so when your the bottom you start to get a little bit more subdivision the states than maybe we might like although emily thinks that there's kind of good can a mac reasons of these are actually to be gift a but this really this that nail the problem it took in this sixty dimensional time-series multiple once do not about the number of segments and where the segments occur and jointly segment them among on all the six all all all the different users of the sector alright right um how much you on time um okay so i'm a about the not time i'm this good uh i think i against get some slides here and just say something about um um this model uh so this is something i've were done for a number of years it's say um exchangeable model i bag of words model for text probably were shocked nations it's fairly widely known it's extremely simple it's too simple for really lots of we're a role work phenomena and so we've been working i make a more interesting and um so what the lda model does it takes a bag of words representation of text and it re represents texans since terms of work on top top a topic is a probable vision on word and so i given document can express a subset of topics maybe be sports and trap and so all the words in the document come from the sports topic or the travel top can you get mixtures were called had mixtures um um of topics within a single talk um um the problem with this approach would many problems one of them is that the words like a a function words uh ten do occur every single topic because of i have a duck with only about travel and the function words don't occur in that topic i can get function words in that document i get a very low probably document right so we like to separate out things like function words other kinds of abstract words from more concrete words and make a traction higher right so we have done that was the paper the that that was G A C and the you're and i approach you look at this site more proud out of this and some sense than lda as a kind of a path forward for this field uh we call this the nested chinese restaurant process and so this is a a whole chinese restaurant up here and here's another chinese rest are all these are chinese restaurants are now organise in a tree when you go to a chinese restaurant you pick a table like before and that tells you what branch to leave to go to the next rest wrought so you think about this is a as the first night here in prague you pick some rest and then you'll or what branch you you know you know what rest are you need a to on the second night and then a third not and so on as you keep going the nights of the conference alright so what document comes down here and picks the path down this tree another dog it comes in a picks another path and they what that that the overlap and so you get these overlapping branching structures alright and now you put a topic at every node this tree it has a distribution words and then a document has a path down the tree that gives it a set of topics can draw from and that draws the words from those that that set of top right now that a up of the top is been used by all documents so it makes a lot sense to put the function words up there or i as this no down here is only been used by small so so the docking so you might will put some more concrete words down there and that's what the statistics does we fit this the data it did develop stop it's at the top of the tree which are more abstract a more concrete topics as you go down the tree i so i'm get just yep and here's my last slide actually this is the a result to turn your head uh of um in this to us like i think is as action not psychology all you change this errors like that psych um psych review view he's are abstracts or a particular journal and psychology and this is the most probable hold tree we're put a distribution on the tree turn is everything is uh distribution here and the was high probability tree in is the was high probably topic um or a high probably words that the topic at that node so we get a and of is the high probably words at the root so we have to strip away the function words in this case they just pop what well hold up the um the next level we get a model memory uh self social psychology motion vision binocular dried food brain oh so this looks kind of like social psychology cognitive psychology uh physiological psychology and visual psychology and so once you good other the is actually infinite tree we are shown the first three levels of that um so any about hope that "'cause" you or more flavour of you know the toolbox box no should here we but try rest not together and a with a a object which first distributions on is and we could sell do things like how traction and reason about um so i'm done uh i was just kind of a tour of a few highlights of a literature um they're probably about a hundred people worldwide are working in this topic uh actively we as a composer "'cause" there's a conference call bayesian a nonparametric parametric it's we held of very cruise later this model um every two years as you held um uh and it is just be getting going so for much of the younger people the audience you uh i highly encourage you look at this is there's a little bit of work beam brought in the speech and signal processing but there as good be a whole whole lot more um so on my publication page there two papers which are my point you to if you enjoy the talk that are written for for for uh a non expert and give you lots of pointers to um more a literature thank you very much a there's time for questions yes thank you both to a them for two so can we see have time for some questions yeah hearing is read channel you get of make it and we already in the money a good a so i have two questions a lot first well thank you very much things to this and is possible for to still uh a a community the first question to i have a a a um don't i'm just a full not a scale problem since this set and a choir monte carlo simulation i okay to need that is to make a really real difficult you know i don't believe that at all and so no one knows really we go to large scale uh so you know the em algorithm does that apply to large scale problems or not yeah yes and no i mean for some problems things really quickly settled down after a very quick number of iterations of the uh maybe like two iterations of em or for some problems these algorithms that should do a are just give samplers and so with the E out with a plus one extra step just a bitterly em do like em before uh you not change the indicator from this to this or or or go to a brand table right so we actually don't know whether uh poster inference are guns on large scale or gonna mix maybe they mix may of more quickly than were used to from a small day in the last point to make is that this has not you you C C that's just what we happen use because this was we had three months to the project any other procedures are split and merge algorithm there's is variational methods and so on for post your as can be used here um so um it's it's low yeah not a second question i have is that many people in this all where lee um the map of the use of a well yeah it's sample a a not a really small a good model yeah is for speech in in a fist of just yeah that we select we use that to them what in france easy and and then easy correct so i just want to know you'll cake all to what extent this this ten general your yeah you not okay that's a great great question like like a very much a so how those two papers or a minute and your question trying to show the toolbox box shows a range of other kinds of models we consider a a there need a racially process for example doesn't have a power law behavior you might want power law behavior or something else called the pitman-yor which gives you power be you could do a pitman-yor version of H P H M M um and so there there are many many generalisations there's inverse gamma forms of the weights and so on so for i so um yeah uh you know this is really is a tool box and i think also the point about you know i'm a statistician we're used the to models which if you find the right card to it can really work surprisingly well and so you hmm yes it's not the right model for speech but that's lee set and i at i and it still was the a you know the elephant to the got speech all the way to where it is maybe badly may wrongly right um you know but uh i it is a very useful to make cartoons tunes that to have nice that as is go and computational probably and can be general so i think a hmm was too much of a box could be generalized easily and i think that these methods go beyond that once again but still staying with a problem but it's like back up oh he have to do that the back i have control over that or another question i can summarise ways yeah so is questions about over fitting wise you getting killed on over fitting a nonparametric world well you know i uh it's easy for is but bayesian don't have such problems of the over fitting it's kind of i'm not always a bayesian but what i a bayesian um you know it's one the is i am um so you know of to first or we don't have a over fitting troubles with these systems even on pretty large scale problem a fact um so you know we compare this for example one our context free grammars stuff the be a pet of there was ice with a split merge em algorithm and there they had a big over fitting problem they kind don't with that eventually pretty fact way um but we didn't have to think about that just didn't you know come up um you know so that's the first order the second or you have a couple of hyper parameter of to get them in the right range you know get the right range of have some overfitting fitting but were very robust to that and so you're right yeah the day didn't totally work just of about we had a little over fitting there was a one more state there should a in but that's not too bad that it was not to easy do a little bit of engineering and think about the problem a little bit more they are well little bit of time scale and have at you know uh prior needs to be put in so i i think that's fine were sort of artists an engineers were trying to we don't mix wanna build a box that in a you give to a high schools to the it's done oh always be a little bit thinking and a lot of engineering and to about um but i really the you know i'm not always a bayesian i've of of what's to be a non be i go back and forth every single day of my life um you know but for a lot of these really high dimensional hard yeah inference problems we you have multiple things which need to kind of collaborate by the harry a just gives you a a a you know from the get go a lot control over those sorts in you the question so for speech we use mixture models within each yeah a is there way too this side went to great then use a versus one but you a new component yeah X question i should made that clear so the state specific emission distribution here is not a single gal C for reasons you guy know extremely well is the the mixture of gas no it's a the racially process mixture of dallas right and it turned out that was critical for us to get this to work absolutely we need more just like a the classical gmm a single permit speech done more we don't wanna have a L distributions there we wanted to grow as you get more all patches to be more more pictures of that mission distribution which a rise we put in the hard the like cross it's a hard like process that helps to make the point about two boxing better a well we should uh we've one more question asked one uh we actually use quite a bit up a all i hidden markov model in for and it's very good so make mean everybody to take a low but one that the major problems is in the a prior evil lucien because we have to estimate a lot for a hyper hour um so sometimes the number of high but from is is more than the pair on the this is so now now yeah talking about a in for a number of for it is so we're have probably i for a number of hyper per on no we don't know that's really critical so the classical bayesian hmm had a lot hyper parameter that because it had fixed K states yes and number of hyper ever scaled the size K right and you had the I C or something else like that the choose K oh we're not do we any of that we have a distribution the number of states and a number of hyper parameters constant okay a small so it's share she's here and i was actually there's a sharing by this high this time these uh choices a french the hyper programs real little with a top level okay okay at the menu that one there's of so a very small number of that so there is uh a pry pollution according to how you is you give the have breast they are correct the very number of small number of hyper parameters here okay some sense almost too small then you can't believe it it it it probably has to be a go a little bit to know is a is very good we but is that we not this is not your classical a bayesian hmm or you have the number of states things were in a of the number state it's friend okay as the city pattern in the rest it's it's grows that the prior with log in and the posterior just random okay thank you thank you a well i would like a us to thing see once again for if and now we have the uh a few break outside T one nine or hmmm i yeah