a speaker

from you procedure used in yeah

for me


so the way from the university this in finland

and this work is a collaboration with the same city from brown university

and you see that and you compress and then from nokia research center


i'm not sure whether i'm presenting this work out the correct conference because

to last the speaker and language variability something that we wouldn't

like what we are interested in the

no sense that stuff that we use

the speaker recognition


and we are interested basically to infer the mobile user's context

based on audio signals so by context in this work we mean in the into

the physical location

and user's activity

or the particular environment

so it's gonna if we have such information but we can use this for many

processes social network purposes or i don't think acoustic models or not

are you know


so it's


so in this study i'm going to focus only on the environment recognition based on

acoustic cues

so it is likely that we consider a nine different a context which the which

we can see in everyday life including colour and more than formant and so there

and also we have a option for out-of-set case like we don't fragments for any

of the


so we all that is what are smart phones the reference whatever out of all

sensors like gps X

parameters line sensors and so on

and in some of the early studies this is present a accelerometer at the

is there in this instance of have been used for context identification

but we consider the use the audio signal

so the obvious reason for that one is that we don't have to have the

latest for the X you can have anymore but one to recognise the context

and the other one is that we are not depend on a network infrastructure if

you if we compare first

thus the gps

or wifi signals


actually in some cases for instance let's setup we study different we are like a

normal car bus

so based on the this the we apples actually present

well it would be quite difficult to tell whether it's a normal car bus but

if we had a diffuse we could do mix of the discrimination

and in fact there is some recent evidence that this audio but use can be

more helpful in

some cases

when we are trying to recognize the user context

so that a couple of examples from the data looks like it was a different

one is probably familiar to all of us this office environment

and it is that the big three second segments so

they are fairly so

the car environment


and the right

we see in room three and shot samples

and here is that examples of how we can have a

state quite different acoustics depending on the which user

or we what type of device has been used to collect data so

but this is so this is a representative of the intra-class variability what we're

facing this problem


then there is this for example

another example from the same user

and this funny really sounds that you can hear is because the user is probably


and phone in his pocket and this close miked the standard




so we get an idea of what this problem is about

so it is what we consider this as a supervised that's sparrows identification task where

we train context model for all of our ten classes

and okay that's probably not the most correct way of doing so what we also

trained explicit model for the out-of-set class is rather than trying to treat in the


so quickly about the feature extraction so we typical mfcc front-end what we see the

speaker recognition thirty miliseconds frames and so

and a bit sixteen K audio


the two differences from the speaker and language recognition is that we have a we

don't include any feature normalizations here because we believe that the central bias and the

such a minute devices contain also information useful information for the context

and also this frame rate is a much reduced so we escaping reality thing this

actually non overlapping frames

what we have here because there is a

requirement so the real time requirements here

so let's look at the classifier backend which is the focus of this work

so i try to summarise in this a slight the process that i could find

in the literature and to make this writing this is available some analogies how this

might be related to speaker the speaker and language recognition

so quite a number of others have used as the very simple distance based classification

k-nearest network vq

those used gaussian mixture models or support vector machines but actually dsp as well i

have been studied in this field i usually not using it you know supervector kernels

whatsoever actually trying to training of the individual mfcc frames

and then of course there is hmms to try to model the temporal trajectories of

the acoustic contexts

and also a couple of all source

in a minute you can't detection so basically you have a

discrete set of event detectors that's less lost a laughing and cheering and then you

construct histogram of these outputs

similar to the high level speaker recognition


as we know this improvement is that a user in competition and we don't need

rick one so many development datasets and so on but they are limited in the

sense that we don't have a

any frame dependence modeling

and the that the more complicated models

model temporal aspects of that involve more involving of that's

the application scenario that we consider a is this the recognition from very short utterances

okay and then

be a remote retaining the contestants and the more users as low as possible

and the other factors that we don't really have an access to similar datasets what

we have in the nist about a sense

at least not in a little

a good level datasets for this purpose

so then

for this reason we focus on this

relatively simple stuff here

so i'll contribution here is basically that with some other to see how the familiar

tools like using the speaker and language recognition but for this task

and the other thing is that in the previous studies

usually the data has been collected in a using same microphone in a fixed locations


okay but in this study we have a large collection of a test samples collected

by different mobile users

and but the

and a couple of different mobile phones as well so there is a lot of

variability with respect to the device and the user collected the data

and also actually to see the

and of other classifiers

okay we see here a couple of familiar and unfamiliar abbreviations i'm not going to

explain the classifiers in this study because that's in this

familiar to the audience


basically six different methods with the distance so you can be to a gaussian mixtures

trained with the maximum likelihood training

and also discriminative training utilizing the but tools

and then also to supervectors systems using gmm supervectors and the generalized linear risk sequence

and there are some of the control parameters that we considered

so process for the two simplest classifiers that we have a

because knn would require that we start the whole training that this is not feasible

so we use the vq code-books to approximate the training set



that he is an overview of the data so i'm not going to a good

indicators of the numbers but

but the actors and the last row and the last column of this table which

so the number of samples by the different classes

and users so

we can see that there is a

massive imbalance in the data which actually causes some problems for the classifier construction

and some of the user didn't collect any data for certain classes and some of

them have been more active collecting the data

and the most popular class seems to be the office

so many people have collected the data director of results and don't feel too much

enthusiastic to do this at every time

and regarding the users

most of the samples come from the city of tampering but there is one user


when we have actually some data samples from bottom row and a majority from closing

so two different it is

and then here is that can see the different phone models that the

included to the comparisons

and when we had a very to a classifiers we consider only one user at

cross validation which means that when we have that english that the user number one

we have training of the classifiers using the remaining five speakers a five user

and then product this

over all the users

and also we refer to report the average class specific accuracy rather than the form

identification accuracy because

that would be very much biased towards the of explicit we want to see on

average per class how we are doing

so here are the

the results for the two simplest classifier so in the x-axis we can see the

codebook size that we used for the vq and K

then we see the identification rate here

so as you can see the best that we can achieve here is around forty

percent but for the knn

and perhaps surprisingly we get the best result when we have just using single nearest


i have some possible that one

really suspect listen and then also they're not so surprisingly we find that is

can be you

vq scoring outperforms the best K configuration

and generally when we use the more than two five six speaker


well here are the results for the simple stuff but for the gaussian mixture models

of frame that's gaussian mixture



well how the accuracy in general et cetera that when we use more than five

hundred twelve gaussians

and we in this numbers we don't see the maximum benefit from the discriminative training

and but later so a couple of more details about the phone

and that is a gmm-svm system was the most confusing for us actually because we

couldn't find any of these typical trance that when we increase the number of gaussian

so that the relevance factor we is difficult this find any

meaningful buttons here so we actually tried two different svm optimizer some take this couple

of time so if it is a that is correct but the still confusing so

could be

could be one reason that we have a dealing with the spatial data segments rather

than to five nist we typically speaker recognition


when we use the universal background model training here we didn't pay attention the most

the data balancing so

we suspect that the power ubm experiments doing obvious detection so

we the reason why we do the balancing is that

then it would mean that we

we need to plan the number of that are the most lot smaller such as

you can see there is the smallest number or less than three thousand

so we didn't want to

reduce data too much

i think would be also some cases where with the svm that we what one

why was talking a couple of the signal that maybe we should someone try to

balance the number of training examples also has been so this could be the reasons

why we



for the cheerleaders a classifier and so in your results for three different mobile

that's a monomial expansion orders one two and three

so and here you can see the number of the elements that to this problem

in supervector

so it seems that we get the best accuracy with a compromise that the second

order polynomial expansion

around thirty five percent correct rate and the

so it is

case number one just correspondent we address the mfcc vectors and train svms december is

that you see that this has been kind of a used actually many of the

previous studies in the context classification

so we had a better than that one


so here's an overall comparison of the classifiers

so if you look at the results

i mean we where we set the parameters to the values

so there is not much difference if we consider the for us improvement this year

and for the svms as we already know this the result

but kind of what that goes with this print based methods

so here's some more detail if we look at the results but the class

so the left you can see the particular name of the plus and number of

a test samples

for the class so

obviously this office environment seems to be the easiest to recognize

mostly most likely because here we have the largest training set assignments and

and also it's a it appears that this

it's very much the same office facilities because of the of the users here a

employs of nokia research center at that there might be even this some of the

same meetings where the same class at

where attending to

so it's

there is some bias here

and also surprisingly the other class the out-of-set class yes very good accuracy because we

it's not gonna model the acoustics of everything else except this one so

i would say this much more difficult than trying to train the ubm and speaker

recognition because we cannot pretty

what are the all the possible audio use

we can see


these gmm-svm is a curious so it's almost like dirac delta function centered on the


i think so as you can see we get about one hundred percent

recognition of the office environment but then

actually for some classes all the test samples is classified so there is something

something funny with this one

and you think about the

a gmm the gmm systems

so we see that the discriminative training helps in about half of the cases

but the again it's the speculation but what might be

because of the

certain classes here


we are also interested look on whether there is any user effect visible because we

can think that the different users have different preferences then maybe they're going to different

restaurants using different histories and so on

so we look at this one

so most of the users are quite a similar so for they want to fifty

percent across this except for this one user where we have this data samples also

from tampa so remember that this well trained using leave-one-user-out

so it means that these

the city helsinki has not been basically see here so it seems that there is


or at disorders that there is a bias due to the city

and also be made an attempt to

look on what's the effect of the certain mobile device but it's a little bit

problematic analysis because too many things are changing at the same time


so these two cases

are the ones with of the most training data

so we all the results of this six to one oh no together data but

test data we don't see the highest accuracy on this have all speakers there


so that doesn't

that's the expectation

and then this to a basically it was that we have not seen at all

so this

this is the only user was data from these devices but we have been doing

what okay for instance for the vq are obtained for that minimizes them again not

so well

so this is a nice a little bit limited because

probably that it should be a button a good what it to conclude anything from


this analysis

and ideally should have a problem recordings please see that you want effect

so from this to a lot not as it seems that the new city or

that kind of a higher level factors will have a

stronger impact on the you once


let me conclude i think that this task is very interesting but also appears to

be quite surprising

so in this building so that we got the first the highest a cluster and

if you get identification rate was around forty four percent

a after all the parameter two things


none of the six classifiers really was supported it's other at least from the for

simple classifiers

and the

the mmi training help in that in some for some of the classes but not

systematically for all of them so i think it's worth looking further


this set of data you didn't have on the average that out of the right

and then we couldn't is really good

two good results using the sequence kernel svms but perhaps

they could be some different way of constructing the sequence kernel not using the gmm


so i think

that's what i wanted to say thank you




so you can be



okay after the redirect that question the right actually was constructed to do gmm systems








yeah but i think it's very much to do with this


yeah we used to but to do it and i believe this is the is

the case that we did

okay i'm an exponent



















no that that's a separate topic so it is data is collected in a mobile

phones but we did all the simulations and a pc

so they have been a couple of studies on the power consumption also i mean

it depends on the optimum already cause an impostor if they prefer it is

simple frame-level svms because


yeah right exactly present if we need to continuously monitor the yeah

and i don't have a clue what