Speech Transcript - PANEL DISCUSSION

so have um can you know the pen that's um your for the talks a um a that's apart from cornell that's cost here from cmu you and that's that these all the question is just a little bit just similar discussion um but then you're more than welcome to you the reply to the use if you have any other or um a pin on those or tool come up with other questions so the first question would be what is be most promising approach that you can imagine so far so it's ceiling was all the visual detection of non-linguistic vocal all outbursts so what is the most promising approach um to this topic any of the of this what's point to come for of them that's a just passed and hmmm so that this really isn't my are you totally um my kind of P C topic as a little bits to the side of this post i i think that's a very important in this sort of non-linguistic extra linguistic sort of um a a of a press i think that's um voice quality place a a very big role in us and and yeah unfortunately a lot of features that people of used for voice quality tends not to be robust enough and thus T the reason why people don't use of was um i think that's that's it and more robust features that's could um that can measure different dimensions of voice quality i think that could be that beneficial to the recognition perhaps i'm and i so thank you um down any opinions i yeah i think a lot can be done on the classification side for example was the different fusion techniques so right now we have seen uh early fusion and step rows approach a of course to could also of if used the different modalities but different classifiers not late fusion approach and to estimate the confidences for the different classifiers because just for some problems when for example the visual modality uh yeah does not to the right a confidence to could decide on all you only or to something like this this so from the audio was a part maybe be rows can comment or missing uh i i get current see that the the i think that is a anything of that sky sky very promising i mean a are so many problems we given in and it's a of course and the main problem the the question is but this oh was a a it is a huge variability of expressions and to and to good using my and problems of but different classes and a the very but isn't so high based expressions this hi variability but because of the expression is very but because of yet posted movements and to also but but of different cultures and so that of a it's not a problem and a oh that are several approaches and it's for for promising i mean no i'm not sure what the hey i so a so my idea uh uh E segments from the audience um that's that so i i mean i'm asking myself is what you see year this uh this is uh localisation and and detection of this uh all purse is that so much different from what we did so far and many the other but the recognition applications for or remember that's in the late nineties so more than ten years ago we working on broadcast news uh and that was is and we were using basic you get is of the efforts what's for distinguishing between that's say i a speech nonspeech a music parts and silence and um really asking maybe some of you do you think this to what's a difference between what we did then use of a of twelve years ago for this kind of similar the problem or do you you leave that no this kind of thing is giving you extremely new changes which will lead to different approaches thank you so once L there you can i don't you know exactly what kinds of problems you're referring to um i'm sure that there are lots of problems that are similar to the problems there been tackled these days but i think it's fair to say that there are lots of new problems that we're not considered in the nineties that people are looking at um that said the problems that are in common i think people generally don't do with the comparative studies to demonstrate what in fact is new about the new techniques so uh sometimes the day are goes away and sometimes the people that did the work of away and there's not enough effort being put in to demonstrate that novel a new mathematical techniques are in fact worth doing um that's my opinion maybe just do add on this because you mentioned you don't know exactly what was not i mean what was done was kind of there was some the what what do you do today people were basically training new that work and on on both classes and then they were using a sliding windows and we're going over the streams and then they were looking at the i is the was the the and then sign those segments to those different uh classes and uh so that was button i both my motivation to say well look there's is a to what is been done now although it just a the different we used maybe to some different types of new networks but if it's not so much i think that was my point on the but vision for my previous question actually okay so for the segment on that okay so maybe we can move to the second question was actually was which all the problems so which is very most similar to this so all the any specific problems maybe in this context um which all these anybody from the handle i well this is the young row but a a uh young old question but um i think um uh in emotion recognition uh there are some people have been starting how to do this cross control um um that for sir for certain some uh uh will cut out first are dove like come on across cultures but some or not and i think uh more effort into this area could be a you could actually gain some knowledge uh from the this kind of task i Q i take this um i in my you uh i mean i'm and most to do uh speech but uh as far as i can see and have seen one of the organisational problems is that uh speech uh somebody has pointed out is more developed than uh visual processing sites that claim to do a multimodal processing normally uh are rather but are more uh on the video side then on the audio side of course there's always exceptions uh so uh video or in this case is as far as can be rather developed and audio is let so uh it might be one of the organisational problems and that uh refers to funding and so on uh to bring together uh there is a special from both sides uh and not to us uh vision paper doing some you and vice a versa and uh my second more refers to terminology that uh might seem to be a rather um minor or problem but i i tell you story i i don't like uh this uh a taxonomy into a non volatile and so on and that's my story back in the nineties in the german verbmobil project um we uh i a convention that that poses a better conceived as non linguistic stuff like breathing and a coughing thing and uh a to and so on so it was uh marked with these brackets um you can say well about the problem the problem was that right really dies uh doing the uh implementing the is or machine just uh to do all these this garbage and um didn't a take it into account later on when we or linguistic models wanted to have a look and one to use is iterations these guys from uh uh D is are engine set well that's not in our project be uh can't do it uh we country reimplement because this takes some weeks so uh we you had to do our own is are test because uh in the beginning it was you find as non-linguistic uh you can say now would days is are more intelligent but i i don't it so uh i think we really should uh take care for even for these problems that seem to be minor or from the very beginning thank think and on um anybody or any comments maybe from the audience in problems in this respect yes looks like a most of the work has been done using a it anything facial expressions and all of your um using a single speaker kind of C a you you'll uh i want noise there is any work on uh like a meeting kind of see that you're like got many people and you have a camera morning morning different people i you have so that a P H an expression is very bored and like just yeah those kind of things quality of it this nonlapel cues so i'm looking at a and solve was from the meeting domain so actually i i i can quite fully a here everything that you said in spite of on you asked how people working on meetings are with the same problems that are uh in treated here is that i i don't know very much about it but i can tell you that it leads to an explosion of sensors um so at there's been quite a lot of projects actually a at that you level that have dealt with instrumenting meeting rooms and seminar rooms et cetera um um i find that research kind of exclusive "'cause" there's only a couple of people that can do research where it you know the number of video cameras and number of microphones that are necessary and that and are fixed inside of a fixed geometry room are are possible um i sell and on the video your side i i don't really know but in in on the on the speech side there's a lot of problems that are being tackled an essentially the same way um as the pink out for a single side after for example beam forming the audio from a microphone array or a set of microphone yeah the most works like give seen on a meetings basically what they do if they want to use special expressions is the use one common up this point so that you one comment a a looking at the face bits for this one and am now if you want in the a you mentioned there is only one common a a a as far as i know that is uh and there are some works and the think that is a group which it's are like meet the up but they do a gesture recognition or some L let's say oh Q not very sophisticated features because it's make hard problem when you one one common i and to in order to detect a mean and is the most dominant person in a meeting and things like that a but yeah i think they are would you i think there are many works using only one common and i think also professor nick come belle just done some works uh on using this three sixty one three six six a common to the table and and i when she was in the font some work on but but but you are there are many works a that and one of the reason is that it's not easy in to you are not sure what to features to extract in there are so many problems so okay so the next question would be a of and how can we um had to better integrate audio and video how can be but if use these and that should be looked at from both sides so from the video side from the audio side uh on the that mention that there are um these well gaps in between a look an that from one side of the other side so maybe there someone pin in on this but how can you best integrate all and video in the future when one possibility ease that not too uh use both uh video and audio information for uh but only or late fusion for the you uh very same problem but the to look at the context and then uh uh try to i just the probably rows for uh the phenomena i i'm interested in when a may come back to uh or data it's seems that if we are interested in a in these interactive uh but that would be you were mother using and so on uh then we can have a very close look at uh the body movements of uh the subjects and when they are lightly then we can just a reset set the priors for the phenomena of are interested in and this holds of course vice versa uh well coming from speech uh i and most the can't imagine speech problems uh we are interested in but uh it's the job of the guys to to do it vice versa so uh i mean that's a maybe just another type of approach that um that we can try to as you and um my and ditch told me that uh G at the same ideas are coming from from video so it might really be with what Q and so maybe i the i think i a if a a a T V it seems of the most popular approach the moment a decision fusion feature level fusion yeah and also that some so other ports like multi-stream hmms but i think this is not only a to work in video i think fusion of different modalities of thing is a court a P i mean and and this far is a no and these methods you you know for years and it seems at nothing better just come out so and they guess people trying to find but the fusion methods and uh uh and obviously for example in feature level fusion give a simple way is just too much the frame rates between audio and video and not guarantee that this is the optimal way to do it and to uh so yeah that is i think of an increase of the need for new fusion methods now what is this going to be and think any and knows the answer yet i mean i'm pretty sure a lot of people who were confusion and fusion of different modalities so it's will but some point that would be something better than this is to sell a yeah i i i a oh i i i i i i i a i i i i i i so actually that was the force on last question on the slide um what applications to to see for all these types of non-linguistic uh vocalisations or vocal outbursts so where B you could each of you also some microphone and tell your a linguistic vocalisation application oh uh prediction of in seeking or in you time i and improved spoke the dialogue systems particular improved yeah speech synthesis and proper pretty it's not just not linguistic but also short and is like a of and yeah he a and selection a large corpus used section since is it if you can track to the right that type of yeah or other sure it burst and that and mike provides a more foods and synthetic speech or um sets a a i think this guy told everything yeah from my point of view an interesting uh a point of application is personality analysis for example assessment of leadership qualities for instance you could think of if someone is making a lot of hesitations is probably not a good speaker and so that's that's classifications could be quite a a a key at least they in a global i mean they the main goal was used wasn't these two have more user friendly interface so that's why we were mostly concentrate on laughter but also not the localization was for example where a when you tact with an interface and T V it can protect that you laugh and a then this is you uh usually means that you happy P mean a you enjoy a set of course you have a different meaning can be a a a can mean i don't need it's a and if it's the moment think it's very hard to discriminate between these two types but for us to just to make a communication more user friendly because if you just speech it's um i think if you take an example of a interaction between two people in you remove all these non-linguistic vocalisations yeah you will see the interaction and a it's i it doesn't know does not seem so natural it's uh an so already idea was to more our more more deviation was a more user friendly interface and that's how we begin to work we set um i think or we could i am at not only uh assessing or money touring personality but uh assessing and monitoring changes and personality over uh certain times uh make just imagine when we come back to do this leadership uh topic uh um somebody taking part in the management curves and uh then at the beginning and at the end we assess uh there's an T and uh have a look whether the changes to place the same holds for any interactive be how you were uh can be extended to uh to do an or yeah but a could be hey were and stuff like that but uh not single instance uh judgement because that's to erroneous but uh just combination of uh no um a question so we how about the dangers of using well of setting things like these aside from linguistic stuff um i'm from and out of these dream applications five a a i think where linguistic and make sure has to do with speech and and drug for construction what is the benefit of using a node like for these um i point that i i when you you i think that's mostly programmatic question it's just that uh a linguistic then have dominated this community's general interest for a long time and uh to actually get something going that's nonlinguistic you just need to exert a lot of effort and sound like a first and that actually wants to disintegrated from linguistics but i don't think that's the general name think uh that's just my thing yeah would totally agree with that and thing probably that's a future to have it combines have i with six linguistics apart part from a and video but this we actually reached the end of the session and sums of time a would like to thank you all very very much the speakers all of the audience and well of that we have more for four discussion in the future thank you

PANEL DISCUSSION

Audio/Visual Detection of Non-Linguistic Vocal Outbursts

Presented by: Björn Schuller, Author(s): Björn Schuller, Technische Universität München, Germany