Speech Transcript - BIOLOGICAL PATHWAY INFERENCE USING MANIFOLD EMBEDDING

uh thank you uh first of all of the uh organized five speak uh on uh uh biological pathway inference and uh what i'm going to do is i'm going to describe a approach to this problem of combining in this case gene expression data so this continuously value uh abundance of uh a messenger or are a as measured uh on a V microarray array chip by now with ontological data of which describes something about gene function in particular uh the products that the are generated uh or induced by gene expression okay so the standard approach is to uh i one for matt at uh uh yeah i data analysis is to be data rip ignore any kind of functional uh or biological system a type of uh of of priors and then after to you john your analysis holdout out particular i i i a gene uh a uh uh factors let's say from the data then you go and try of uh you know to validate validated or to make some inferences on what's actually going on what of the functional relationships between these genes and can you somehow in court the pork core icsi's make calm incorporate them into a functional path so we do the simultaneous so here we're going to simultaneously do uh uh clustering variable selection and the and then uh functional i don't shen so uh i think everybody here has a least some vague notion at a minimum of uh the fact the gene is a segment of the N A uh that uh codes for protein and uh uh course uh not all of the uh uh all the go nucleotides uh on the D a on the D N A um as trans code for proteins but the genes uh in particular are are the ones that uh while just understand and can describe some sort of function to uh they these functions which are a primarily production of proteins true the pro is all and the poor uh be process of a translation or organised into what by all just call pathway so pathways or sequences of uh activation of different genes or protein products that need from one state to another right so uh you know a pathway uh four the inflammatory response leads uh starts with a a uh uh uh some uh infectious agent or some some in salt to the uh and you system and ends up with production of sight of kinds that uh basically a induce sinful information there's a very complicated sequence of gene expression that uh is associated with that process so um one of the principal problems is the discovery of how these pathways become perturbed were deregulated uh under uh for example disease states and uh uh the principal uh fact uh of uh the matter is that these functions are not just expressed by a particular gene expression uh uh uh uh uh a factor but they're expressed over time and over space and that's what we're going to talk about in particular in the context of the uh uh so your response to infection "'em" addition shows some data later around for like be so techniques as fusion of of uh expression data and uh ontological a gene not gene ontology data so uh this just shows a a uh i i a typical picture that you'll find in this pretty impressed at a particular case and nature of use in two thousand five which describes the uh you know molecular biologist understanding of how it's cell responds to infection faction in terms of protein production productions all of the user or protein uh in the uh uh the the the nucleus you have the uh uh a process of a the an a transcription and replication and that generate proteins that the uh that are located at particular regions within the cell so close to binding sites receptor sites production of uh uh of sight of combines in fear on so for a so there is a very complicated uh diagram here had ways can be i characterised as these sequences of events that leads for example to sell that program cell that i thought that was this uh which are is is an immune response so the the point though uh is that we want to somehow uh compress all of this complicated and and relatively vague information this picture to some kind of topological uh a constraint on a how uh say two genes can be related or not so this shows uh what's called a gene ontology semantic graph and this captures function of of of different genes in particular captures one of three a gene ontological uh uh uh classifications uh which is the uh cellular location of the protein that's produced by a particular G so you have here for example uh in the membrane us sell your membrane versus and uh in the a protein complex oh or and the nucleus uh down here you'll have different genes that are associated with this particular location in terms of proteins that they produce the larger the circle the more genes are in that particular uh uh functional uh body for this particular uh uh a process this particular pathway which is the of one "'kay" so this is the diagram that basically were gonna you is to merge with the the the expression data the raw expression data this is this comes from literature gene ontology is a a database which collects uh from different data uh a a a database is uh that uh represent experimental uh and validate results on uh in this case location a cellular location of protein production we're gonna we're gonna take this semantic description oh uh relations between between genes that are there attached to a particular component and we're gonna use that to sort of precondition the clustering oh the gene expression that's and not nutshell oh what we're doing so this just shows that that's the this shows the how it's sort of a uh put together you know more graphic uh uh a context we have a gene microarray array here with uh uh a a genes are expressed say over different treatments in class one which might be help the in the class two or the trip or you all is uh a subject uh and uh uh and then uh these would be different genes along the rows and we take uh uh these ontological uh uh speakers they shorten the previous slide so this might be the nucleus uh this one might be the side of plasma of this might be the for use all and that then i gives a a like a prior on how closely related these genes are in terms of uh function right so going from clusters the functional pathways is is a very difficult problem and uh uh the the problem is that genes with similar to have russian uh do not necessarily have similar function right so we if use correlation the correlate late gene expression from two different genes and say that they but they the same function just simply because they seem to have the same shape over their temporal uh uh expression profile uh that maybe completely spur it's they may not have share function how do you incorporate function in the uh as additional information you use G on top so uh in order to uh to capture this ontological uh uh uh function uh relationship between two genes uh we're gonna use basically a a a a a a manifold learning uh in a uh a a a type of a uh approach a lost uh eigen maps approach which is going to basically bed the genes into a lower dimensional map a manifold where distance is in that manifold are gonna be directly proportional two the ontological similarity between those two G so of the genes live in one of the common a locations within the cell in terms of the protein production uh then uh the similarity W Y G between the two genes i J uh will be more right and uh and that will be used as a weighting in this uh a plus in eigen maps a a clustering procedure which will give us a lower dimensional uh uh the plus raffle plus N induced in of the date okay no where does the gene expression come it comes in in a very weak sense in this uh in this embedding you but look at it is being driven by the ontology is been driven by function so similar functions in this case being similar locations within the cell that these genes jeans and uh uh that's R W Y G K but the gene um expression uh controls the neighbour uh so we're going to uh basically a zero wait if the expression profiles are to dissimilar "'kay" so it's it's a way of of conditioning uh the um uh the embedding which would just be based on pure uh on ontology uh uh based on the uh a similarity of that of the gene expression profile i'm not gonna go through a this look life the eigen mass spell can and i U V in two thousand to two publish very nice paper on and sites five is the say that are gonna fine yeah clear that's why i and Y J and some lower dimensional space uh maybe two dimensions the visualise uh such that you basically preserve distance in the ontological space now of force uh well we prune neighbours if there expression profiles are to the simple okay so this uh embedding i has been applied to uh the uh a particular dataset that's out there um called the young data set uh by um uh looking at the in this case uh in each vitro uh to kill and uh tuberculosis uh uh infection of mac fe just cells uh that uh i then are pass say using mike raise an a and R T P C R to produce that gene map that he map actually four uh and a there are eight time points they wanna basically look at how the uh this the these uh a dendritic a mac the phase cells respond uh after that trip kill when uh uh uh has been introduced and so uh here's the data for the control group here's the data for the tuberculosis uh a group again over time and so we're gonna be trying to do was determined changes uh uh uh a a that and and associate those changes from control to the uh but a brick you and uh group uh with functional uh uh uh protein of production okay so we're gonna take a first of all uh a difference between the control and tuberculosis so that we can have a baseline which is control and then we're going to and bed those expression profiles as deferential expression profiles into a two dimensional space using this what plus the eigen map um so uh i'm first gonna show use all uh the standard functional pca and batting which uh simply applies uh a singular value decomposition on a a a a a a spline basis interpolation over time of those uh the an expression profiles i should be the previous slide and uh uh then afterwards it applies a a a gaussian mixture model clustering to try an associate these different uh uh uh these different genes uh as uh hopefully a associated with different pathways or different the uh a function so uh what you see is the sound of classic this classic uh a a uh uh concentration of uh uh of of measure here on on B uh where where the the mit the a gaussian mixture models have basically the trying to match with a two dimensional a domain they're trying to simultaneously capture clusters which might actually be two dimensions but uh there are also clusters are probably just one dimension so it's a very uh a a a a a a a very heterogeneous speech if you go and look at after you clustered you would expect these plus to do very good right "'cause" this clusters uh obviously the centroid doesn't even near uh i near the any any particular uh a a G uh uh the uh uh that that we can classify how good this clustering is simply by looking at the percentages oh of uh uh a a of each one of these genes in a given cluster that in this in the same location in the set and is over fifty percent uh do not uh are not call like it locate over fifty percent of all the genes in any cluster are not co-located co located the cell which indicates that's again that uh gene expression over time other profiles do not i discriminate accurately between uh a genes that with different function oh yeah hand if you use this manifold colour betting that i described you get a much nicer us spread oh of the uh uh a of these clusters into uh well defined groups these are for different clusters uh the uh you drop but to a much much lower percentage of uh a of uh impurity right some money more of the genes within the cluster groups label green blue and uh turquoise and so forth a a i close to each other within the cell which is which is the sign of course uh that uh the the method that we implemented is actually capturing this co location ontology um and does so would this just i was some by clustering and in the C is i'm not gonna bother dwelling on that i'm running out of time uh but in the indicating that the we have improved performance not surprisingly because we're using ontological date gene ontology data to condition uh the uh these cluster if you don't if you just use a functional pca uh you get uh a a cluster indices a for these various pathways these uh that uh or uh i have much lower or a quality and if you use our method that uh which you can see just pairing these numbers this is the this is the quality index between zero one one that uh again indicate the purity of each shot uh of each one these clusters in terms of the number of that a genes that i within the same class okay so in conclusion i describe this this method uh which deals with calm embedding of both uh expression data and functional uh and uh of genes uh uh in terms of their work uh expression under in this particular case that tuberculosis uh uh infection uh i i've uh uh basically uh describe how we do this using a plus an eigen map uh it allows us to uh uh to to if you like couple the uh uh the variable selection clustering and functional annotation in one package and as a result we can improve uh uh pathway way analysis uh by by doing that's a say on what you elaborate a little bit on the test in a pitch T do you terms yeah on case is use that you now using the hierarchical structure at that you it aims we also yeah so those distance uh we are to go back to this equation so yeah i i skipped over this uh partially because there's a type but S a G should uh but uh uh so this distance is defined in terms of the number of go terms which are common to the two gene and the number go terms that uh you know a are margin so that what that's actually doing if you look at the uh this graph here is it saying if i if i have to genes and i look at where they'll a ice let's say the you have to genes that lie within uh you know this uh this particular we're alan membrane co location well then the car you go back and look at the pair as the parent uh statistics that tell you they give you this topological mapping now granted it's only the parents we don't look up the grandparents and great grandparents and so forth uh but we are taking account of the pollen G and that at least first sense yeah sure a question thank you what can you do is you have only a partial knowledge of the ontology which is a actually what we have because you know one can't believe the that all published results and so uh we don't have when we but we we would like to have uh measures the cough uh in terms of you know the degree to which the ontology can be relied upon that just doesn't exist yet right uh i think that uh it's one of the main deficiencies of the functional annotations we have today that there is no figure of merit figure of confidence finance that one can you that allows you to you know you know systematic way know what kind of waiting you need to apply to that on ontological information in order to balance the uncertainty of ontology versus the uncertainty uh of gene expression right so the answer is not a satisfactory one and fortunately i can't can give you uh and not be answer there if it to a question that its um you have the thing impulse impossible to use or disease use the gene expression location be change yes yes absolutely in then how do you i mean which means that you you know way you you gone the by use or completely a you locate a gene yeah them the ontology which is maybe not the one corresponding to the disease so you yeah that that's an excellent question and and it's related of course uh it to the fact that the ontology really should be a temporal database it's not right a it collapses the entire time course of you know functional activation into a summer the only the only way that we account for that is by the fact that the ontological uh i notation each one of these genes uh is not unique it's not just a one dimensional quantity so a gene may be long simultaneously two several locations to in of it if the protein production uh that it that it's responsible for in a two different phases uh is say in the nucleus under one phase and in uh you know that over over the the the the membrane and in another fit but again the ontology is not rich enough yet to be able to capture that temporal information if it was uh we could obviously do much better and we could really start talking about pathways which are temporally modulated excellent question basically you know with slide saw for or change if the data or right yeah just one at this time approach change to the yeah station right right i like to agree with you there that we've done that but we have i uh the because it we have we could do that if we computed distance sort of short time distance over a window between gene expression but we actually print compute the distance over the entire temporal uh period that's collected so time again is collapsed but if we truly had ontological data that was temporally uh specific we could develop a much more sophisticated model B frankly frankly more used but this is a is the beginning right exactly so yeah that's that uh al again i

BIOLOGICAL PATHWAY INFERENCE USING MANIFOLD EMBEDDING

Systems Biology

Presented by: Alfred Hero, Author(s): Arvind Rao, Carnegie Mellon University, United States; Alfred O. Hero III, University of Michigan Ann Arbor, United States