0:00:13so have um can you know the pen that's um your for the talks
0:00:18a that's apart from cornell that's cost here from cmu you
0:00:21and that's that these all the question is just a little bit just similar discussion
0:00:25um but then you're more than welcome to you the reply to the use if you have any other or
0:00:30um a pin on those or tool
0:00:32come up with other questions
0:00:34so the first question would be what is be most promising approach that you can imagine so far
0:00:38so it's ceiling was all the visual detection of non-linguistic
0:00:41vocal all outbursts
0:00:43so what is the most promising approach um to this
0:00:47any of the of this
0:00:50what's point to come for
0:01:00of them that's a just passed
0:01:03and hmmm
0:01:04so that this really isn't my are you totally um my kind of P C topic as a little bits
0:01:09to the side of this post
0:01:11i i think that's a very important in this sort of non-linguistic extra linguistic sort of um
0:01:16a a of a press i think that's um voice quality place a
0:01:19a very big role in us and
0:01:21and yeah unfortunately a lot of features that people of used
0:01:25for voice quality tends not to be robust enough and thus
0:01:28T the reason why
0:01:29people don't use of was um i think that's that's it
0:01:32and more robust features
0:01:34that's could um
0:01:36that can measure different dimensions of voice quality i think that could be that beneficial
0:01:40to the recognition
0:01:43i'm and i
0:01:46thank you
0:01:47um down any opinions
0:01:56yeah i think a lot can be done on the classification side for example was the different fusion techniques
0:02:01so right now we have seen
0:02:03uh early fusion and
0:02:04step rows approach
0:02:06a of course to could also of if used the different modalities but different classifiers not late fusion approach
0:02:12and to estimate the confidences for the different classifiers
0:02:15because just for some problems when for example the
0:02:18visual modality uh yeah does not to the right a confidence to could decide on all you only or to
0:02:24something like this
0:02:29this so from the audio was a part maybe be rows can comment
0:02:33or missing
0:02:38i i get current see that the the i think that is a
0:02:42anything of that
0:02:42sky sky very promising i mean a are so many problems we given in and
0:02:47it's a
0:02:49of course and the main problem
0:02:52the the question is
0:02:54but this
0:02:56was a a it is a huge variability of expressions
0:02:59and to and
0:03:01to good using my and problems of but different classes
0:03:06a the very but isn't so high based expressions
0:03:10hi variability but because of the expression is very but because of yet posted movements
0:03:15and to also but but of different cultures
0:03:20so that of a it's not a problem and a
0:03:25that are several approaches
0:03:29it's for for promising i mean no i'm not sure what the hey
0:03:38a so my idea
0:03:39uh uh E
0:03:40segments from the audience
0:03:48so i i mean i'm asking myself is what you see year this uh
0:03:52this is uh localisation and and detection of this uh
0:03:55all purse is that so much different from what
0:03:57we did so far and many the other
0:03:59but the recognition applications for or remember
0:04:02that's in the late nineties so more than ten years ago we working on broadcast news uh
0:04:07and that was is and we were
0:04:09using basic you get is of the efforts what's for
0:04:11distinguishing between
0:04:13that's say
0:04:14i a speech nonspeech a music parts and silence and um really asking maybe
0:04:19some of you
0:04:21do you think this to what's a difference
0:04:23between what we did
0:04:25then use of a of twelve years ago
0:04:27for this
0:04:27kind of similar the problem or do you you leave that no this kind of thing is giving you
0:04:32extremely new
0:04:33changes which will lead to different approaches
0:04:39thank you so
0:04:48there you can
0:04:50i don't
0:04:51you know exactly what kinds of problems you're referring to um
0:04:54i'm sure that there are lots of problems that are similar to the problems there been tackled
0:04:58these days
0:04:59but i think it's fair to say that there are lots of new problems that we're not considered in the
0:05:04nineties that people
0:05:05are looking at
0:05:08that said the problems that are in common
0:05:11i think people generally don't do with the comparative studies
0:05:14to demonstrate what in fact is new about the new techniques
0:05:19uh sometimes the day are goes away and sometimes the people that did the work of away
0:05:25there's not enough effort being put in to demonstrate
0:05:28that novel a new mathematical techniques are in fact worth doing
0:05:35that's my opinion
0:05:44maybe just do add on this
0:05:45because you mentioned you don't know exactly what was not i mean
0:05:48what was done was kind of there was some the what what do you do today
0:05:51people were basically training new that work
0:05:53and on on both classes and then they were using a sliding windows and we're going over the streams
0:05:59and then they were looking at the i is the was the the and then sign those segments to those
0:06:04uh classes and uh
0:06:06so that was button
0:06:08i both my motivation to say well look there's is a to what is
0:06:11been done now
0:06:12although it just a the different we used maybe to some different types of new networks but
0:06:17if it's not so much i think that was my point on the but vision for my previous question actually
0:06:24okay so
0:06:26for the segment on that
0:06:29okay so maybe we can move to the second question was actually was which all the problems
0:06:34so which is very most similar to this
0:06:36so all the any specific problems maybe in this context um which all these
0:06:42from the
0:06:46well this is the young row but a a uh young old question but um
0:06:51i think um
0:06:53in emotion recognition
0:06:55uh there are
0:06:57some people have been starting how to do this cross control
0:07:01um um that
0:07:02for sir for certain some uh uh will cut out first are dove like come on across cultures but some
0:07:08or not
0:07:09and i think uh more effort into this area could be a
0:07:13you could actually gain some knowledge
0:07:15uh from the this kind of task
0:07:17i Q
0:07:24i take this
0:07:29um i in my you uh i mean i'm and
0:07:32most to do
0:07:33uh speech but uh as far as i can
0:07:35see and have seen
0:07:38one of the organisational problems
0:07:40is that
0:07:42speech uh
0:07:44somebody has pointed out is more developed than uh
0:07:48visual processing
0:07:50sites that claim to do a multimodal processing normally
0:07:54uh are rather but are more uh on the video side then on the audio side of course there's always
0:08:01uh so uh video or
0:08:04in this case is
0:08:05as far as
0:08:06can be rather developed and audio is
0:08:10so uh it might be one of the organisational problems and that
0:08:15uh refers to
0:08:16funding and so on
0:08:18uh to bring together uh
0:08:21there is a special from both sides
0:08:25and not to us
0:08:29vision paper doing some you and vice a versa
0:08:32and uh my second more refers to terminology that uh
0:08:37might seem to be a rather
0:08:40um minor or problem but i i tell you story
0:08:43i i don't like uh this uh a taxonomy into a non volatile and so on and that's my story
0:08:49back in the nineties in the german verbmobil project
0:08:54we uh
0:08:55i a convention
0:08:57that that poses a better conceived as non linguistic stuff like breathing and
0:09:03a coughing thing and uh a to and so on so it was
0:09:08marked with these brackets
0:09:11um you can say well about the problem the problem was
0:09:17dies uh doing the uh implementing the is or machine
0:09:21just uh
0:09:23to do all these this garbage and um didn't a take it into account
0:09:28later on when we or linguistic models wanted to have a look and one to use
0:09:32is iterations
0:09:34these guys from uh uh D is are engine set well
0:09:37that's not in our project be uh can't do it uh we country reimplement because this takes some weeks
0:09:43so uh
0:09:45we you had to do our own is are test because
0:09:48uh in the beginning it was
0:09:50you find as non-linguistic
0:09:54you can say now would days is are more intelligent
0:09:57but i i don't it so uh i think
0:10:00we really should uh take care for even for these problems that seem to be minor or
0:10:05from the very beginning
0:10:09thank think and on um anybody or any comments maybe from the audience in problems
0:10:15in this respect
0:10:22looks like a most of the work has been done using a it
0:10:26anything facial expressions and all of your
0:10:29um using a single speaker
0:10:31kind of C
0:10:31a you you'll
0:10:33uh i want noise
0:10:34there is any work on
0:10:36uh like a meeting kind of see that you're like got many people
0:10:40and you have a camera
0:10:41morning morning different people
0:10:44you have
0:10:45so that a P H an expression is very bored and like just yeah
0:10:49those kind of things
0:10:51quality of it this
0:10:52nonlapel cues
0:11:01so i'm looking at a and solve was from the meeting
0:11:06so actually i i i can quite
0:11:09here everything that you said in spite of on you asked
0:11:12how people working on meetings are with the same problems that are
0:11:16uh in treated here is that
0:11:57i don't know very much about it but i can tell you that it leads to an explosion of sensors
0:12:02so at there's been quite a lot of
0:12:06actually a
0:12:07at that you level
0:12:08that have dealt with instrumenting meeting rooms and seminar rooms et cetera
0:12:13um um
0:12:14i find that research kind of exclusive "'cause" there's only a couple of people that can
0:12:18do research where it you know the number of video cameras and number of microphones that are necessary and that
0:12:24and are fixed inside of a fixed geometry room are are possible
0:12:31i sell and on the video your side
0:12:34i don't really know but in in on the
0:12:37on the speech side there's a lot of
0:12:40problems that are being tackled an essentially the same
0:12:46as the pink out for a single side
0:12:48after for example beam forming the audio from a microphone array or a set of microphone
0:13:01yeah the most works like give seen on a meetings basically what they do if they want to use special
0:13:06expressions is the use one common up this point
0:13:09so that you one comment a
0:13:11a looking at the face bits for this one
0:13:13and am
0:13:15now if you want in the a you mentioned there is only one common a
0:13:19a a as far as i know
0:13:21that is uh and there are some works
0:13:24and the think that is a group which it's are like meet the up
0:13:28but they do a gesture recognition or some L let's say
0:13:32oh Q not very sophisticated features because it's
0:13:34make hard problem when you one one common i
0:13:37and to in order to detect a mean and is the most dominant person in a meeting
0:13:42and things like that
0:13:43a but yeah i think they are would you i think
0:13:47there are many works using only one common
0:13:50and i think also professor nick come belle just done some works uh on
0:13:54using this three sixty one three six six a common to
0:13:59the table
0:14:00and and i when she was in the font some work on but but but you are there are many
0:14:06works a that and one of the reason is that
0:14:09it's not easy in to you are not sure what to features to extract in there are so many problems
0:14:23okay so the next question would be a of and how can we um had to better integrate audio and
0:14:28video how can be
0:14:30but if use these
0:14:31and that should be looked at from both sides so from the video side from the audio side
0:14:35uh on the that mention that there are um these
0:14:38well gaps in between
0:14:40a look an that from one side of the other side
0:14:43so maybe there someone pin in
0:14:45on this
0:14:46how can you best
0:14:47integrate all and video in the future
0:14:59when one possibility ease that not too
0:15:03uh use both
0:15:05video and audio information for uh
0:15:09but only or late fusion for the you uh very same problem but the
0:15:15to look at
0:15:16the context and then uh uh try to
0:15:19i just the probably rows for uh
0:15:22the phenomena
0:15:23i i'm interested in
0:15:25when a
0:15:26may come back to uh
0:15:28or data
0:15:30it's seems
0:15:32if we are interested in a in these interactive uh
0:15:36but that would be you were mother using and so on
0:15:39uh then we can have a very close look at uh
0:15:43the body movements of uh the subjects and when they are lightly then we can just a
0:15:50reset set the priors for
0:15:52the phenomena of are interested in and this holds of course vice versa
0:15:58well coming from
0:15:59speech uh i
0:16:01and most the can't imagine speech problems uh we are interested in but uh
0:16:05it's the job of the guys to to do it vice versa
0:16:09so uh i mean that's a maybe just
0:16:12another type of approach
0:16:16that we can try to as you and
0:16:19my and ditch told me that uh
0:16:21G at the same ideas are coming from from video so it might really be with what
0:16:30so maybe
0:16:35i the i think i a
0:16:38if a a a T V it seems of the most popular approach the moment a decision fusion feature level
0:16:45and also that some so other ports like multi-stream hmms
0:16:49but i think this is not only a to work in video i think fusion of different modalities of thing
0:16:53is a court a P i mean and
0:16:56and this far is a no
0:16:59these methods you you know for years
0:17:01and it seems at nothing better just come out so
0:17:04and they guess people trying to find but the fusion methods
0:17:07and uh uh and obviously for example in feature level fusion
0:17:12give a simple way is just too much the frame rates between audio and video
0:17:18not guarantee that this is
0:17:20the optimal way to do it
0:17:21and to uh
0:17:23so yeah that is i think of an increase of the need for new fusion methods now
0:17:27what is this going to be
0:17:30and think any and knows the answer yet i mean
0:17:32i'm pretty sure a lot of people who were confusion
0:17:35fusion of different modalities so
0:17:37it's will but
0:17:38some point that would be something better than
0:17:40this is to sell
0:19:35actually that was the force on last question on the slide
0:19:39um what applications
0:19:40to to see for all these types of non-linguistic
0:19:43uh vocalisations or vocal outbursts
0:19:46so where B
0:19:47you could each of you also some microphone and tell your a linguistic
0:19:51vocalisation application
0:20:00prediction of in seeking or
0:20:03in you time
0:20:10spoke the dialogue systems particular improved
0:20:12yeah speech synthesis
0:20:15proper pretty it's not just not linguistic but also short and is like a of
0:20:22a and selection a large corpus used section
0:20:25since is it
0:20:27if you can track to
0:20:28the right
0:20:29that type of yeah or other
0:20:32sure it burst
0:20:33and that and
0:20:34mike provides a
0:20:36more foods and
0:20:38synthetic speech or
0:20:39um sets
0:20:42a a i think this guy told everything
0:20:46yeah from my point of view an interesting uh a point of application is
0:20:50personality analysis for example assessment of leadership qualities
0:20:54for instance you could think of
0:20:56if someone is
0:20:57making a lot of hesitations is probably not a good speaker
0:21:00and so that's that's classifications could be quite a
0:21:05a a key
0:21:06at least they
0:21:08in a global i mean they the main goal was used
0:21:11wasn't these two have more user friendly interface
0:21:14so that's why we were
0:21:16mostly concentrate on laughter but also not the localization
0:21:19was for example
0:21:20where a when you tact with an interface
0:21:22and T V it can protect that you laugh
0:21:25and a then this is you uh usually means that
0:21:28you happy P mean a you enjoy a set of course you have
0:21:32a different meaning can be a a a can mean i don't need
0:21:35it's a and if
0:21:36it's the moment think it's very hard to discriminate between these two types
0:21:41but for us to just to make a communication more user friendly because if you just speech
0:21:46it's um
0:21:48i think if you take an example of a interaction between two people in you remove
0:21:52all these non-linguistic vocalisations
0:21:54yeah you will see the interaction
0:21:57and a it's
0:21:58i it doesn't know does not seem so natural
0:22:00it's uh an so already idea was to more
0:22:03our more more deviation was a more user friendly interface and that's how
0:22:06we begin to work we set
0:22:11um i think
0:22:13or we could i am at not only
0:22:15uh assessing or money touring personality but uh assessing and monitoring changes and personality
0:22:22over uh
0:22:23certain times
0:22:25make just imagine when we come back to do this leadership uh
0:22:30topic uh um somebody
0:22:33taking part in the management curves and uh then at the beginning and at the end we assess
0:22:39uh there's an T and uh have a look whether the
0:22:42changes to place the same holds for any interactive be how you were uh
0:22:46can be extended to uh to do an or
0:22:51but a could be hey were and stuff like that
0:22:54but uh not single instance uh judgement because that's to erroneous
0:22:59but uh just combination of uh no
0:23:05um a question
0:23:06so we
0:23:07how about the dangers of using
0:23:10well of setting things like these aside from linguistic stuff
0:23:14i'm from
0:23:15and out of these dream applications five
0:23:18a a i think where linguistic and make sure has to do with speech and and drug for construction
0:23:23what is the benefit of using a node like
0:23:27for these
0:23:58i when you you i think that's mostly programmatic question it's
0:24:01just that
0:24:03uh a linguistic then have dominated
0:24:06this community's general interest for a long time
0:24:10and uh
0:24:11to actually get something going that's nonlinguistic you just need to exert a lot of effort and sound like a
0:24:16first and that actually wants to disintegrated from
0:24:19linguistics but i don't think that's the
0:24:21general name
0:24:23think uh
0:24:25that's just my thing
0:24:28yeah would totally agree with that
0:24:30and thing probably that's a future to have it
0:24:33combines have
0:24:34i with six linguistics apart part from a and video
0:24:36but this we actually reached the end of the session and sums of time
0:24:40a would like to thank you all very very much the speakers
0:24:43all of the audience and well
0:24:45of that we have more
0:24:46for four
0:24:47discussion in the future thank you