Přepis řeči - AUDIOVISUAL CLASSIFICATION OF VOCAL OUTBURSTS IN HUMAN CONVERSATION USING LONG-SHORT-TERM MEMORY NETWORKS

0:00:14	thank you so as yeah and is right mentioned my name stuff
0:00:17	and then we repeat this unit
0:00:19	yeah
0:00:20	uh a some that of all what do these is an of vocal out
0:00:24	and is a joint work
0:00:27	a a you know that of menu like an even against low
0:00:29	and uh are group at in P and which also includes george the from the feeling and my upon
0:00:35	yeah okay
0:00:36	so what i of vocal outbursts
0:00:38	and availability of non-linguistic vocalisations
0:00:41	which basically
0:00:43	what's a main both facial expressions as well
0:00:45	a examples include a or include the laughter which is probably the most common
0:00:50	a golfing briefing
0:00:52	uh but are also many different types of vocalisations
0:00:55	it's uh
0:00:56	we we may not to real i
0:00:58	but they this vocalisations play an important role in front dense conversations for example
0:01:03	uh it has been shown that laughter punch eight speech
0:01:06	and uh whether this mean
0:01:08	it's that we tend to laugh at places where punctuation would be placed
0:01:12	and a another example is uh um
0:01:15	when in a conversation a two participants laughs and we can lee
0:01:19	mean it is very likely that this indicates if this is that a a the end of the topic they
0:01:23	discuss and it's very likely that in you topic we start
0:01:27	and a part from the after which is probably the most vary widely studied
0:01:31	and vocalisation
0:01:33	most of the other vocalisation vocalisations
0:01:35	and i used as a feedback but a mechanism during direction
0:01:39	and as i said before it's
0:01:41	we are very common although we don't realise it in uh
0:01:45	like a a real conversations
0:01:48	and a if you in several works laughter uh recognition and classification from a the only the also if you
0:01:54	were to not do visual classification of a laughter
0:01:57	but uh a works on recognizing work very discriminating between different vocalisations are limited compared to laughter
0:02:05	and one of the main reasons is uh the lack of data
0:02:08	it's uh so what was all goal you in this work
0:02:11	and i would like to discriminate between different vocalisations
0:02:15	scenes fitted a a a a like to of a data set that contains such vocalisations
0:02:19	but uh used not only the features
0:02:22	but also visual features
0:02:24	and the idea here is that
0:02:25	since uh you you most of the time they if a facial expression involved in the production of localization
0:02:32	a a
0:02:33	no would be this uh information can be cats very visual features
0:02:36	uh uh so it can improve the performance
0:02:39	when it is added to be audio information
0:02:42	uh okay so
0:02:44	i do that this we used was the visual in the score from a U M
0:02:48	which basically it's uh a contains twenty one subject the
0:02:52	and yeah yeah it's a no thirty thirty thousand nine hundred and one turns
0:02:57	and basically to dyadic interaction scenario so basically as a present or and the subject we've
0:03:03	interact
0:03:04	and uh
0:03:05	during the interaction there are several uh vocalisations
0:03:10	these uh uh we use basically
0:03:11	and the partitioning we used
0:03:13	actually
0:03:14	she's been used a this they this it you'll also be used in the the speech probably sticks allan's
0:03:19	and we use the same partitioning
0:03:21	for training development
0:03:23	and testing
0:03:24	although
0:03:25	unfortunately this slide the development a column is missing
0:03:29	and to that our for all a non-linguistic vocalisations
0:03:33	two are for uh that to like yes
0:03:36	uh
0:03:36	B think can send his attention and laughter
0:03:39	and they are also these also another class we could use another class got but which contains other noise since
0:03:44	speech
0:03:46	and uh for experiments
0:03:48	uh i'm going to show you we have excluded the breath class because
0:03:53	and most of the time this i would they use not all
0:03:56	and use a is no um
0:03:58	it in please new this data set in a most of the time there is reason the facial expression of
0:04:02	vol
0:04:03	uh okay
0:04:05	so
0:04:06	seems that what
0:04:08	so just so you a few examples
0:04:11	uh
0:04:17	okay so this is an example of a laughter
0:04:19	a database
0:04:21	i
0:04:23	so which can you see is also
0:04:27	although there is a common uh a point a directed the face of the subject
0:04:32	you see they still uh significant did movement
0:04:35	it's uh
0:04:37	which is quite common in action well as uh for example which you bill a database that
0:04:42	we that subjects
0:04:44	uh what to the fight to find a video clip
0:04:46	and we record the reaction and thing this case because just a static mean they just what's uh something it
0:04:52	don't there is a a this did movements of a small where in this case and also
0:04:56	in or or a a a real case cases
0:04:58	they should movement is always
0:04:59	uh uh there are and
0:05:01	uh okay so let and so an example of a uh i think the nixon's um
0:05:07	a i
0:05:08	so basically it's pretty so but i a
0:05:12	and uh and example of cons and
0:05:15	this from
0:05:16	oh
0:05:18	huh
0:05:20	huh
0:05:21	so basically in this and uh that was an of express and just some
0:05:25	the movement
0:05:27	and T
0:05:28	so
0:05:29	but presentation
0:05:40	okay so we use this P it
0:05:43	and vocalisations
0:05:45	no look classification of the uh of of of a a in the five classes the for uh
0:05:52	yeah the the three actually
0:05:53	for the for the for the four plus is three they a
0:05:56	vocalisations and got but i
0:05:58	and
0:06:00	so we extract it
0:06:01	okay this is just a a number you what will explain it so which of these and slide
0:06:05	which selected uh and visual features which where
0:06:07	and the visual features what up sampled to much the frame rate of the audio the features and then were
0:06:12	concatenated for feature-level fusion
0:06:15	a close in the classification was performed
0:06:17	two different approaches one was a is M
0:06:20	and the other and was there long sort memory
0:06:23	uh a can you a network
0:06:25	it's uh okay
0:06:27	so the frame rate for visual features is twenty five frames per second which is
0:06:31	a a common frame rate
0:06:33	and to there are D we just to types of features
0:06:35	shape features which are based on the point distribution model
0:06:38	and appearance features which are are based on pca and grad and orientations
0:06:43	uh
0:06:45	so uh a yeah so in the beginning with track twenty point to the phase
0:06:49	it's uh a these are the four points on the out of one point for seen four points and meet
0:06:54	i two for is eyebrow
0:06:55	and you can see this is an example of or
0:06:58	of subject laughing
0:07:00	with the tracking point this face
0:07:02	and uh okay as you may see tracking
0:07:06	i
0:07:07	and uh and that happens i mean a fourteen to be in the perfect tracking
0:07:11	and
0:07:13	so we initialize the points and then the tracker
0:07:15	uh like this twenty points
0:07:17	and um
0:07:19	now the main problem we yeah
0:07:21	and that should be is present for both shape and appearance features
0:07:24	is that uh we want to decouple couple
0:07:27	a a a a set pose from uh a facial expressions
0:07:30	and now how we do this
0:07:33	is uh
0:07:34	basically we use
0:07:36	a distribution model
0:07:37	it's point
0:07:38	and just to court needs X and Y
0:07:41	so i won't cut innate all the coordinates went up with a forty dimensional vector for each frame
0:07:46	if we concatenate now of these for uh vectors or from of frames into a matrix
0:07:51	if we have K frames will end up with a K by forty meeting
0:07:54	and then we apply pca on this matrix
0:07:58	okay is well known that the great is variance of the data lies in the fact me small components
0:08:02	and the i'm knees
0:08:03	that seems that are a a significant a head movements most of the variance will be captured and from a
0:08:09	uh in the first principal components well as facial expressions was account for smaller variance
0:08:15	and will be encoded in a lower component
0:08:19	no no components
0:08:20	and uh
0:08:22	so in this case
0:08:24	and we found that a first first piece for components correspond head movements
0:08:28	and the yeah remaining from five ten facial expressions
0:08:31	but was to this is but a uh most of the it's uh
0:08:35	depends on the data sets you know a other do the sets with the a even higher
0:08:39	and with
0:08:40	a stronger should movement
0:08:42	and
0:08:44	we consider that you of in the first five or from six
0:08:47	uh corresponding hidden would
0:08:48	so basically of that the features are very simple it just the project see of the core image of the
0:08:53	thirty decoding mates
0:08:54	to this principal components of course to facial expressions
0:08:58	and
0:08:59	i can we gonna example
0:09:11	okay so basically
0:09:17	what so on
0:09:18	it's this is not from maybe is database but just to get an idea of how these uh a principal
0:09:23	works
0:09:23	it's uh on the top left
0:09:25	yeah is see the videos three
0:09:27	or the top right you see the actually tracked points on the bottom left
0:09:32	you see the reconstruction but on the principal components that correspond to should movements
0:09:36	and the bottom uh light
0:09:38	you should a reconstruction of corresponds
0:09:40	oh to the press components
0:09:42	uh with a to expressions
0:09:45	and you can see that it's C
0:09:47	tensor head
0:09:49	top uh the bottom right remains always front i'll
0:09:52	and four expressions well as the bottom left
0:09:55	for was that the should poles
0:10:02	it's uh
0:10:03	because it's very simple but just
0:10:05	yeah
0:10:06	and uh
0:10:14	okay
0:10:14	a a okay uh are we show the simple also for appearance features we want to move
0:10:19	and should pose
0:10:21	and in in this case it's harder
0:10:23	and yeah so what to a we need
0:10:25	just the common approach in computer vision
0:10:27	we use a reference frame which is uh
0:10:30	can they you tell expression of the subject
0:10:32	and also each in front of you
0:10:34	and we compute the fine transformation between it's frame and the reference frame and went with affine transformation we mean
0:10:39	basically
0:10:40	and we scale rotate
0:10:43	and translate
0:10:44	and the face
0:10:45	so but it comes to front of uh house
0:10:48	and you can see a very simple example
0:10:50	the bottom
0:10:51	you see
0:10:53	or the left
0:10:54	uh a
0:10:55	the shared is
0:10:57	it's a bit uh rotated
0:10:59	and uh after applying
0:11:01	scaling translation rotation
0:11:04	uh uh use of the face becomes fonda
0:11:07	uh
0:11:08	and then we crop
0:11:10	and yeah yeah area on the face
0:11:12	it's uh um
0:11:14	and then we apply pca to you much good D and orientations
0:11:18	it's a okay i'll gonna i will not going into details and vision it goes uh
0:11:23	can find more information this paper
0:11:25	and but the main idea is that
0:11:28	it's quite common to apply pca their action and a pixel intensities are with this approach as you can sing
0:11:33	discuss of this paper it you some advantages for example it's more robust to illumination
0:11:37	and so that's why would side to use this one
0:11:40	uh now
0:11:42	we got audio features is were computed
0:11:44	with the open smile which is uh
0:11:47	to could provide
0:11:49	you M
0:11:50	and you
0:11:50	the a frame rate are is one how find frame for second
0:11:54	and that's why we need to up sample the visual features which are extract twenty five frames per second
0:11:59	it's a we use some standard or where what do features like a
0:12:03	plp coefficients the first five good could be since insanity loudness
0:12:07	a a fundamental frequency and probability of voicing of T
0:12:11	with the uh the first and second order delta coefficients
0:12:14	is a pretty standard features
0:12:16	and um
0:12:18	oh for classification and the first approach was to use long short-term memory recurrent neural networks
0:12:24	which the risky
0:12:26	mean felix describe so
0:12:28	just can to give a slide it's uh
0:12:32	and this for dynamically a a classification of forced uh for starting
0:12:36	a the main problem is that uh it's not a at a different line
0:12:41	S so order to extract features which do not depend on the length of the utterance
0:12:46	and was simply to extract some with statistics of over the entire utterance of these low level features
0:12:53	so just for example the mean
0:12:55	or of the feature over the entire utterance or the maximum value of the rains
0:13:00	yeah will convert
0:13:01	and not ounce so that has is represented by you
0:13:04	if feature vector or of fixed size
0:13:07	and uh event classification is uh are performed for the entire utterance using support vector machines
0:13:13	and uh a you can see just and i but you hear of via the same features the appearance features
0:13:18	plp P energy of zero loudness and probability of voicing
0:13:22	one case
0:13:23	yeah in the study case
0:13:25	we compute the statistics over the entire utterance
0:13:28	we fill them to an svm and we get to label for a sequence
0:13:32	and
0:13:34	where in the second case when we use the L S T Ms uh and team networks
0:13:38	we in simply
0:13:41	yeah give the low level features were no need to compute function functionals
0:13:45	to a
0:13:47	a list of their L S T works
0:13:49	which provide a label for it's
0:13:50	frame
0:13:51	and then we can simply take the majority and to label the sequence according to
0:13:55	so now is result
0:13:58	a as you can see
0:13:59	for a um
0:14:01	for that weighted average
0:14:03	is B Ms
0:14:04	provide
0:14:05	uh but but performance
0:14:07	where as for on a weighted average L S T M
0:14:09	a lead to better performance
0:14:11	so these skin means
0:14:13	but is B M's are good at uh
0:14:15	scream mean there that's in classifying a the largest uh
0:14:20	class which in this case is station
0:14:22	contains more than a thousand examples
0:14:25	a a and they are not so good uh it's recognising the other classes
0:14:30	where whereas the less the M's
0:14:31	and i was the recognising okay a would do that they better data but recognising all classes
0:14:36	so you see that's twenty Q
0:14:38	uh usually much higher and waited of but it's a values
0:14:42	something also which is also interesting
0:14:44	is that
0:14:45	to compare the performance of for the oh
0:14:47	and with the audio visual approach
0:14:49	the close of for do for example here
0:14:51	you see that it's sixty four point six percent
0:14:54	now when we add appearance basically lead goes down
0:14:57	this may sound a bit uh surprising
0:15:00	and because especially for visual speech recognition peons just consider the state-of-the-art
0:15:05	but there are two reasons first of all we use information from the entire face
0:15:09	and so basically these a lot of down information which can made
0:15:13	we possible to get the performance
0:15:15	and and a second reason is a scenes
0:15:18	uh is this sort you before these significant head movement
0:15:21	a although we do this registration step
0:15:23	to convert all expressions to frontal pose
0:15:27	and
0:15:28	still this is but not perfect and especially when there are out of plane rotations which means that uh subject
0:15:34	is not looking at the common a but is looking somewhere else
0:15:37	then we this approach is impossible
0:15:39	uh to reconstruct the front of a a you you
0:15:42	and the
0:15:44	and it is known but the appearance features are are much more sensitive
0:15:48	to a stationary or stop than shape features
0:15:50	it's uh
0:15:53	so this could be it's a reasonable explanation of the but performance when adding the P where when we had
0:15:59	a shape information
0:16:00	we should that is uh a significant gain from sixty four point six to seven two percent
0:16:05	and uh
0:16:07	now if we look at the co fusion radix to see
0:16:10	which class
0:16:12	uh
0:16:13	the result per class
0:16:15	and was you that the okay of this is that little on the left
0:16:18	is the result when using all the information only
0:16:21	and them but i is when using audio plus shape in this is for the L S the M
0:16:26	and networks
0:16:28	so we're a
0:16:29	we we see that for can and and laughter there is uh
0:16:33	significant improvement from seven the phone
0:16:36	from forty seven to sixty six and from sixty three two seventy nine
0:16:40	where as for his it a of the performance
0:16:42	uh ghost down so basically
0:16:45	uh know
0:16:46	when we that these a extra visual information
0:16:49	so this
0:16:50	somebody so
0:16:51	and we so that it is shape features improve press performance for consent and laughter
0:16:56	where appearance features
0:16:58	uh
0:16:59	you not seem to do so i mean
0:17:01	it's uh uh only a which just on seem to show have on in the case of support but the
0:17:06	must scenes
0:17:07	and still okay this is negligible the improve right
0:17:10	improvement from fifty nine point and fifty nine point four
0:17:13	well as when we combine all the features together then there is
0:17:16	is more improvement
0:17:17	so this is going case of the peons features
0:17:19	a cattle
0:17:21	comparing a now is uh L S T M networks with svms
0:17:25	so that a a a list the as basically a a a a to do a better job of recognising
0:17:30	and the egg uh
0:17:32	different vocalisations and where B M's
0:17:35	mostly recognise
0:17:36	the class with a
0:17:38	the largest class was his station
0:17:40	and uh and of for future work okay
0:17:43	it's uh is felix said
0:17:46	we not experiments we have used presegmenting sequence which means we know the start the end to extract the sequence
0:17:52	and we do classification
0:17:54	no much harder the problem is to do sporting of these non-linguistic vocalisations which means give a continuous stream
0:17:59	and we don't know this the beginning and the end actually that's out uh goal
0:18:03	which uh especially when using uh when i in visual information
0:18:07	which this
0:18:09	could be it challenging task because uh that are case is that the face may not be visible so in
0:18:14	this case is to like that we have to ten not for example
0:18:17	uh the visual system
0:18:19	and i fink this is you to soak in a look
0:18:22	these are web sites um
0:18:24	or to P of and you best of unique
0:18:27	so thank you very much
0:18:29	thank you much
0:18:32	to left
0:18:33	oh i'm for a couple of questions
0:18:38	and some
0:18:39	i
0:18:45	i
0:18:46	a paper
0:18:48	looks
0:18:50	as far as i can so but you can correct me
0:18:52	the illumination was uh a pretty much okay a so i'm just wondering
0:18:57	uh
0:18:58	when you you go over to a more realistic
0:19:01	recordings illumination that makes like station
0:19:04	we that terry rate
0:19:06	what to expect that uh
0:19:08	you get the same
0:19:09	amount of improvement for consent and
0:19:13	laughter or or not or
0:19:16	have you done any
0:19:17	and
0:19:18	well in this case
0:19:20	uh okay use appearance features a definitely influenced by illumination and they are sensitive to illumination now shape features
0:19:27	and the question is if this is a difference in illumination uh can uh fact that the that i can
0:19:33	even it works
0:19:34	fine fine a but even like a ins you to illumination
0:19:38	then a okay it's uh um
0:19:41	can can use even shape is gonna provide an useful information because the points we be top along the
0:19:47	but uh no this is a um basically this is an open problem because not been solved and
0:19:53	yeah i this as you know computer vision still be don't
0:19:56	audio processing
0:19:57	and these are problems that uh and nobody knows the answer
0:20:01	and that's why in most applications use single bit of easy and
0:20:04	basically
0:20:06	for example not a visual speech recognition
0:20:09	uh uh it's subject looking directly the common to and it's always a frontal A are of the face
0:20:15	and
0:20:16	quite recently that their here some approach is a trying to apply this method to more realistic scenarios
0:20:23	but to apply it in uh a case is like you said you real environment
0:20:28	um at least i don't know when approach that true
0:20:31	would do work well at the moment
0:20:36	of a question
0:20:48	a some uh basically yeah all the features where uh it what up some of both for a a for
0:20:52	both cases
0:20:53	an initial to the same frame rate
0:20:55	although for is sit it
0:20:57	mean it may not be necessary since we starts these function yeah actually i'm talking about instance up something
0:21:04	if the that's like you know
0:21:05	okay only feature a
0:21:08	okay
0:21:12	of a question
0:21:16	and
0:21:19	you you you shall couple look the most what what i one there
0:21:22	uh are there any that as the audience's a actually this class definition
0:21:28	for so you close to the V usual
0:21:31	features
0:21:31	for example for a station to
0:21:34	i i mean i i cool more and why uh facial expression
0:21:39	each
0:21:41	cried
0:21:42	station
0:21:43	and also a
0:21:44	be able to actually it in training
0:21:48	it
0:21:50	and
0:21:52	a
0:21:52	so that is by so i mean uh uh okay just for you want example a but if you look
0:21:58	at all the a examples you see that there is also sometimes is also a a big difference and sometimes
0:22:04	and uh if you want look
0:22:06	and the video without leasing to audio
0:22:08	it is very likely that even
0:22:10	humans
0:22:11	and a would be confused between different vocalisation but and she's station comes N
0:22:17	and
0:22:18	so yeah there is body ends and uh uh in particular for laughter that
0:22:23	uh i from there three hundred examples of laughter
0:22:26	and there
0:22:27	okay it's a the variance is high
0:22:29	it's uh and now was a gotten turning a test set
0:22:34	i think this is the official partitioning and or maybe be you and can same more
0:22:37	what was the criteria for deciding training and testing
0:22:41	it's uh
0:22:42	uh
0:22:43	what actually as just of that done for the silence
0:22:45	and that was done to be very transfer and
0:22:48	similar as for the a i corpus done by a speaker right
0:22:53	this so yeah but is clear there is by and uh is it before sometimes even
0:22:57	if you turn of the audio
0:22:59	you cannot discriminate between those two
0:23:01	i think
0:23:09	a a is between different
0:23:18	that that my main
0:23:20	uh
0:23:21	i i think actually and is also a another issue
0:23:27	so
0:23:28	what do you wear a so what's question but covariance would between a
0:23:35	i to print that this question as if there was cool covariance between for example a
0:23:41	uh
0:23:42	that
0:23:43	to explain for example that you didn't get much improvement for stations
0:23:48	"'cause" the covariance as a high between the
0:23:50	combination features so that that
0:23:52	a a a a uh
0:23:56	it's uh
0:23:58	yeah it could be a all i means it's uh um
0:24:02	some expressions are similar but yeah from the different a
0:24:06	in different classes
0:24:07	and if friend
0:24:08	it's a
0:24:13	yeah the i i'm not sure i mean it's uh a could because it's are spontaneous expressions and uh they
0:24:19	also different i mean
0:24:20	if you look at all of them you will not find too that are exactly the same
0:24:24	it's uh and
0:24:28	okay thank you
0:24:31	oh the question
0:24:33	okay so thank you again that

AUDIOVISUAL CLASSIFICATION OF VOCAL OUTBURSTS IN HUMAN CONVERSATION USING LONG-SHORT-TERM MEMORY NETWORKS

Audio/Visual Detection of Non-Linguistic Vocal Outbursts