Speech Transcript - STATIONARY COMMON SPATIAL PATTERNS: TOWARDS ROBUST CLASSIFICATION OF NON-STATIONARY EEG SIGNALS

hi uh i'm for each of which you can reach from the machine learning group at technical university in berlin and i would present you a lotta recent work about stationary common patterns this is joint work with common be dora and able to a key cover now so here is an overview i would start with an introduction it's and tell you something about the common spatial patterns method and i was stationary this common spatial map headers method then i was show some results and concludes that all of a summary so our target application is brain computer interfacing and the brain computer interface system aims to translate the intent of a subject for example measure from brain activity you're in this case by E G into account for common for a computer application so it in is this case you a measure ring E G and you want to control those games is pinball game but you can also think of other applications like um controlling a wheelchair or a new row proceed so a very popular paradigm from uh for bci is motor imagery and motor imagery the subject imagine some motions with the right hand towards the left hand towards the feet and is this different emotions lead to different different patterns in C G and if your system is able to extract and classify this different patterns then you can come compared to a computer comment and control an application like so there are still some challenges so for example the E G signal is usually high dimension uh it has a lower spatial resolution that means you have a volume conduction effect and sit this noisy and non-stationary minus one stationary i mean that's is that signal properties change over time so what usually people do in bci as they apply some efforts uh some spatial filtering method for example the csp in order to reduce the dimensionality so it's of the goal is to combine electrodes and to like to project a signal to a to a subspace and increase the spatial resolution and hopefully the signal-to-noise ratio and simplified the learning problem but the problem of csp is that it's it's prone to overfitting and it's can be negatively affected by artifacts and it doesn't tech as a non tissue issue that means if you if your computer features applying csp then the features may still change quite a bit and and usually you classifier assumes a stable distributions so in machine learning to usually the else assume that's a training data and the three test data are comes from the same distribution and if you if you data if should distribution change too much then you it doesn't work so the classifier um we're not work all optimal so therefore we extend the csp my thought um and extract most stationary feature or like non-stationary at changes of the signal properties of a time and same may have very different sources and time scale for example you you may have changes in the and X road input then as when the electrodes gets lose all the gel between the scout and the electrode dries out you may also have muscular activity an eye movements they made it to artifacts in the data and usually also have a changes in task involve so when subjects could tired all differences between sessions so what i can no feedback conditions the calibration session whereas in the if pick session you provides so basically all those non stationarities a a bad for you because uh as the negative negatively at um affect you classifier and so there are two ways to deal with this you can one way is to extract better features to make your features more troubles and more invariant to this changes does this is the way we um we propose an our paper our we target of our paper the other way is to do adaptation so you can adapt the classifier to double will sustain change okay so a common spatial patterns methods it's and i thought we very popular and brain computer interfacing and and it maximises the variance from one class while minimizing the variance of the other class so we if you're you have like to conditions you imagine you have the imagination of the movement of the right hand and the left hand and a you you see that these two guys uh down here think maximise the variance of the signal now to the project signal the maximizer in the uh right hand uh condition but minimize the and the left hand condition and the two guys a off they do exactly the opposite so them the maximise the variance in the left condition but many in the right condition so why do we want to do so like in in B C i U goal is to discriminate between mental states and um you know that the variance of a band has filtered signal is equal to band power in is it's frequency but so and in you can discriminate mental state and by looking at the power in the specific frequency bands so when we need to sell um you can easily um detect changes uh between the conditions because you're you're looking at the bed power is finally you are looking at the bed power one specific frequency band a band and the csp can be solved as uh generalized eigenvalue problem because like you can formulate a garrison here so you want to maximise um this you want to maximise the project variance of one condition while minimizing the the variance of the common conditional equally you can also right here you want to minimize the variance of the other condition of sigma minus so we can solve this very easy it might not work but our idea is um we do not only want the projection which uh which has this properties but we also want that's a projection um if provide stationary features so we want to penalise non-stationary projection type attack directions so we introduce the penalty if P of W two than denominator also really cool of course for coefficient you're so we add this P of W here and then the final goal is to like to uh to maximise the project variance one condition while minimizing the variance in the other condition and minimizing this P a penalty term so the penalty term measures somehow non stationarities so we want to measure the the deviation between the average case so this is the sigma C is the average matrix of all trials from conditions C um the one condition and uh the can mark K C is the uh as the covariance matrix from the cape chunk a channel maybe may consist of one trial or more than one trials from the same cloth so you want to kind of to minimize the and the deviation from the from each trial of to the to the average case so this is like i don't turn because you want to be stationary in for for each class separately so you want to do it for each method hmmm yeah so the problem is if you and this quantity to the denominator then uh you want to get this form anymore because you cannot take out as W C outside to some because of this uh absolute value function here so you you want the egg to solve it as the generalized eigenvalue problem anymore so what what do we do about this we add a quantity which is related so we take this W vector outside the sum but introduce an operator F to make this difference matrix the to be positive definite because we are only interested in like in in the we don't win the variation the of both sides and three that in the similar way so we we do not care if like for example here we we do not care if this guy is big are oh this guy's bigger we are only interested in the difference after projection but here uh we do kind of the same but um we do this before projecting so we we do not do this after projecting up because we take this W outside the sum and we can also show that is this quantity gives an upper bound of the other quantity which we want that's to minimize with make sense to use it so we put this guy and the rayleigh coefficient of our objective function so a lot data set is we compare C S P and S E S P on the data set of at at subjects the foaming a motion meant three say when you to B C i so they did that for the first time we selected for each user as a best binary task combination and the that's parameters on the calibration data and we we we this song testing but test session with feedback back with three hundred trials we record that's so i E G from sixty eight three select electrodes and use log variance feature and the net the egg classifier uh and error rates to measure up performance we use a fixed number of fit respect class and select is the trade of parameter uh with cross validation and we also tried different chunk size a and select it's the best one also by a cross validation on the calibration date so if as some performance results that you had you see the scatter plots when using three csp directions back counts or using one csp direction class on the X axis used the error rate of csp P and on the Y is error rate of our approach and you can you can see that especially specially for subjects which a which fayer when using csp P like these guys they calm really better when with our method and that's the same as can be seen here and we compute that's um test statistic and the changes a significance our method works better especially for the subjects the which have a red light uh larger than thirty percent so we we can improve in those cases which which fail in when using csp we just somehow clear because if it's csp works well then you're patterns are probably really really good in the signal to noise ratio it's good so you do not have a lot of room to improve it but um as so the question is why does as C S P perform better a basically we know that's csp may fail to extract the current patterns when effective by defect and as you saw stationary csp P it's more robust to as artifacts because it treats artifacts as non-stationary nonstationary and it's we uses as non-stationary in the features and C S P is also known to all buffet and as csp S P at you know like this fit with lots not and produces more it's red uses changes and the features so for example you hear you see um the the result that subject performing left and right to motion imagery you see that both methods uh a but to extract the colour correct left hand that are so there activity of the on the right hemisphere this means that um it's the pattern for the left hand motion imagery but in the pose the right hand the csp method fayer because probably in this electrodes there is an artifact of the um this is an four gives the noise the signal all that signal uh it's kind of nonstationary and but scs piece if they're a bit affected by this artifacts as this electrode but it's it's a but to strike the more less correct header of the right hand and you also see here when you look at the distribution between uh training feature as and test features training features uh uh of the triangles and test features of the circles so you see that the distribution is the training phase of S S of P look this usually like like here but it changes a lot when when you go to the test distribution when when you when you look at the test features so that the distribution is completely difference in the test that's case but um when we use C S P we extract most stable features most stationary features so the the distribution between training and and test phase is um it's more less the same so you you can classify in this case to think that if i a lot better so here's the decision boundary and to see that a in that that have a case you really fail to classify a correct you here okay so in summary re extend that's a popular csp method to extract stationary features a S P significantly increase the classification a if especially for subjects we perform badly with csp and unlike other methods like invariant csp we are completely data-driven we do not require additional recordings or models of the expected changes and we also showed that it was not presented in this paper that the combination of stationary features and unsupervised adaptation can further improve classification performance so i want to thank you for your attention we have to and um can you explain more details about um uh dot function yeah in in our town you mean um yeah so the function just one yeah so this function F is the set but it's kind of a heuristic because it makes you're metrics this difference metrics makes it positive definite so it means it's flits the sign of all the negative eigenvalue and it's as i why you want to do so because um we want to use some you what you want to sound of K of possible value a positive value so you want to of for example here you some of like oh what okay of possible uh a positive deviations and you kind of want to do the same here so you make this met the difference metrics positive definite and then we can show that this is an upper bound on on the other quantity so so here you did yeah on the operation dot um duh free to sign on the whole new eigen brazil has you and the expanding this right uh so what are we with computers difference metric then we do a eigen decomposition uh_huh and then flipped uh uh the sign of or negative eigenvalues okay so you keep on the positive ones unpleasantly yeah okay an exit that they're actually i eigen vectors like the directions are kind of this flipped or like when you have a eigenvector with a negative eigenvalues and you few flip it simply but you do not like change a lot but you only flip it because you are only interested in positive contributions yeah yeah okay thing oh uh while you're you know i need a lead to the chunks you know uh really all you have some uh no particle you can use clustering to find some similarities as well no you you you can you can simply use the channel size of one that means that you use each trial that each trial is enters the channel you can do for example this we can do this uh try to wise well you can put the trials from the same class which a subsequent together in one chunk so we do not apply any for clustering we only like put some together overall we we do it for each trial separate my question about your yeah money consuming and that at different me no this is was only one uh one one test session okay uh the question what the clustering of the chunk sizes so if you if you use the chunk size which is not a than one would you could the look average old part of you know and stationarity and yeah so this is what this was the idea to use chunk sizes because with you use chunk size of one then you like detect the changes on a small uh times K if you take that chunk sizes then you time scale we also be bigger because we average out the changes which only a curve for example in one trial so we we tried different chunk sizes and like select is the best one using cross-validation oh

STATIONARY COMMON SPATIAL PATTERNS: TOWARDS ROBUST CLASSIFICATION OF NON-STATIONARY EEG SIGNALS

Biosignal Processing

Presented by: Wojciech Wojcikiewicz, Author(s): Wojciech Wojcikiewicz, Carmen Vidaurre, Technical University of Berlin, Germany; Motoaki Kawanabe, Fraunhofer Institute FIRST, Germany