0:00:14good morning everyone welcome to the second they have sre two thousand eleven
0:00:19i hope you're enjoying it as much as i am
0:00:21oh it's my pleasure to introduce a professor david foresight and without risk of you know yeah i wanna champagne
0:00:28and came there from exhibit
0:00:31i'm going to skip some of the by here but we probably more than a hundred and thirty papers
0:00:37yeah he's very active in the A ieee community as well
0:00:40he was a program called J for i two P C V P or twice two thousand and two thousand
0:00:46and one
0:00:46and he was there a general coaching for C P R in two thousand six
0:00:51is also active in their siggraph community
0:00:55here is that
0:00:56ieee technical achievement
0:00:59i
0:01:01became an I Q
0:01:04is an textbook
0:01:06yeah
0:01:09well i don't
0:01:12couple years ago
0:01:13cypress
0:01:16yeah
0:01:20thank you for those kind what
0:01:22so i was a little bit
0:01:27maybe we could
0:01:32nine out
0:01:33i
0:01:35being a vision
0:01:36talking
0:01:37speech
0:01:39can i
0:01:40identify the next
0:01:43and i just
0:01:45right
0:01:46it's in
0:01:47well
0:01:52i'm gonna talk probably about it
0:01:56a lot of cali
0:02:00yeah
0:02:02my colleagues
0:02:04them are a little from
0:02:08and correspond to X
0:02:11one form or not
0:02:12ollie hardy and dress i mean so that the teaching
0:02:19yeah
0:02:20oh
0:02:25oh
0:02:33and reconstruction is essentially you may a model
0:02:37from pictures or video other kinds
0:02:41and the recognition
0:02:43i think of recognition as being
0:02:44what is
0:02:46oh
0:02:47and it's
0:02:48oh
0:02:49it's gone
0:02:50being
0:02:50you know
0:02:53yeah patient of a small number of people
0:02:55to a very successful
0:02:58i don't
0:02:59and the massive applications we're we have a standard problem of academic field which is whatever something really works generates
0:03:07money
0:03:07we say that's not really what we do and ignore it
0:03:10but there are a whole bunch of those things that have spinal
0:03:13and we'll see some of those are
0:03:16i'm not gonna talk very much about reconstruction but i want to mention the state-of-the-art can
0:03:23in this study are if you have multiple can
0:03:26can get
0:03:27still be astonishing results huge geometric scale
0:03:31so if you walk around for example a quadrangle lots of big buildings and
0:03:37waving a video camera those buildings you can reconstruct the geometry of two centimetres or less
0:03:45and the reconstructions of holes city
0:03:47that had been prepared using those made the error gets a little bit bigger and it very largely automatic
0:03:54and furthermore you can put ten you can trying to pretend
0:03:58that a bunch of scattered images of the way i like that it's slightly harder something's it looks like you
0:04:05and so on but that's kind of and
0:04:08if you have a single picture
0:04:11it's much more difficult to reconstruct
0:04:13but you can get some progress actually in the recognition stuff i'm gonna talk about you'll see some of this
0:04:19so some of the things that tell you about the shape of the world might include the symmetry of objects
0:04:24in the world the stylised shape so later on we're gonna pretend that every room is about
0:04:29and that turned out to be a very useful assumption
0:04:32a contour information texture information shading in my
0:04:36can all tell us something about
0:04:38i'm gonna show you reconstruction maybe it's about seven years old now
0:04:42but gives you some sense of the state of the
0:04:44study are like this but big
0:04:46here's a movie all my cultural thing
0:04:50somewhere out there in the well and it's being video from a bunch of the right
0:04:56all this helps a lot from you can reconstruct an enormous number of points lying on the op
0:05:02and well all of those cameras the view that way
0:05:05i have render all the midi might have random order
0:05:09because that would make the rendering
0:05:11but you can see whether cameras when weather points and that's by standard methods this is the complete system
0:05:17you can join those points that in a second we're gonna do that
0:05:21to make a mess
0:05:22and the message will give you some yeah about how i
0:05:25the geometry
0:05:26points that look that
0:05:28yeah we have a nice
0:05:29and i wish i and has them into the mess you can really see we've got a tremendous amount of
0:05:34information about that you know
0:05:36the difference of the last seven years what i'm showing you now on what people do
0:05:40is that
0:05:41okay now that sort of thing is a quadrangle for the buildings or a city or something of that form
0:05:47here it's a subtract the law
0:05:49and that of course we can texture that meant mash and they are not really sweet
0:05:55we have a very cool
0:05:57physical reconstruction of what it's like which we could show to other people and we could use in our method
0:06:03reality applications you should see other applications here as well so if you wanna block downtown los angeles the colours
0:06:10a bit difficult to get but you can fly helicopter over it build a model of the model up in
0:06:15a movie on your phone
0:06:16and if you want to join in a movie sequence all some real live action
0:06:24to blowing up a model action you need to know what the camera part
0:06:27and we could do that as well
0:06:29so the tremendous applications looking behind this
0:06:33that's it for reconstruction i'm gonna talk mainly about recognition
0:06:37why do we care about visual object
0:06:40the answer is if you want to act in the world you have to draw distinctions
0:06:44and those distinctions could be or a very simple kind or a very complex car
0:06:49so if you would building a robot
0:06:52you have this great advantage of vision that it can predict the future
0:06:57you can look ahead of you and you can see things you haven't done yet and figure out what would
0:07:01happen
0:07:02is the ground so
0:07:04if it is maybe out oh my god is that person doing something dangerous
0:07:08does it matter if i run that object of
0:07:11which end of that object has is the shopping
0:07:14and these are really important questions when you
0:07:18now for information system
0:07:20it just really valuable to be able to search for pictures
0:07:24cluster pictures or the pictures to understand what they tell you
0:07:28and all of those the recognition functions you might not need to be really good recognition but you need to
0:07:34build descriptions of what's going on to support
0:07:37and of course the general engineering applications which are demonstrate in a second
0:07:42there is this universal fact about vision systems pretty much any animal that has vision has a recognition
0:07:49they are often pretty lousy so if you look into it by a horseshoe crabs identify female horseshoe crabs visually
0:07:57but what they're looking for you doc square
0:08:00if you build the right kind of dark square and leave it lying on the floor of the ocean a
0:08:06line of amorous males horseshoe crabs will build up behind because the vision system just isn't up to the job
0:08:12what you might not have right recognition but if you if you got this and you got recognition
0:08:17okay as an example of a more general engineering application of vision
0:08:21and i believe strain are not array on we'll talk about this on thursday as well probably in more detail
0:08:28imagine you watch a whole bunch of people
0:08:31and you manage to a bunch of stuff as well so you could look at the physiological mark as you
0:08:36could listen to the sounds and speech and you could watch him and the behaving naturally
0:08:41then what you could do is a bunch of things the first thing is
0:08:44if they behave in a way you don't want you could feed
0:08:48the other thing is you could screen
0:08:51so for example autism spectrum disorders is an affliction where if you catch it into written very ugly
0:09:00you sometimes have better chances of interventions it would be really nice to screen children very ugly
0:09:07in line and it would be very nice to screen every
0:09:10what you'd like to be able to do is to say this child needs to see someone who knows what
0:09:15to do in this child doesn't and you'd like to do that in a very low skill white
0:09:20well maybe what you could do is observe them behaving and say gee the need to see someone you can
0:09:25tell whether they're really
0:09:27and it turns out that models like this there are you can apply that story to in the home care
0:09:32to caff a demented patients to caff a stroke recovery
0:09:38building design and sound models like this look as though they're gonna be really fat
0:09:42and S F is put a bunch of money into the sort of thing on the expeditions program and we
0:09:47have good things will come
0:09:49here's another example you might want to take pictures and simply predict what
0:09:54why would you like to predict what tags well people like to search for pictures with words lots of pictures
0:09:59don't come with words attached what you might do is look at the picture and say based on various classification
0:10:05machinery and on what i know about how words are correlated
0:10:09and so on give me a bunch of word text to associate the picture that would be you
0:10:14and the state-of-the-art in this activity is moderately advanced you get we have very good experimental methods
0:10:21we're getting
0:10:23if you actually retrieve images based on predicted word tags you can get estimates as in the third
0:10:29which may not sound a let impressive bowl ten years ago they were in the three percent so you know
0:10:35it's an order of magnitude which is wonderful and this look this is genuinely useful in
0:10:41but words and pictures affect one another and much more complex ways
0:10:45so there are many interesting problems that are just sort of the merging
0:10:49from the presence of word and picture datasets this is example due to tamara the you'll see in my these
0:10:55approaches from catalogues and their descriptions underneath them all the things in the picture
0:11:00oh there are another two existing vision mechanisms for saying that the thing in the picture is an adorable people
0:11:09telecom
0:11:10but we just don't know how to do that
0:11:12the first instinct problem that arises from that is if you had a whole bunch of catalogues you might actually
0:11:17be able to fish phrases out of the text
0:11:22a fish descriptions out of the pictures and build classifiers that could predict adorable people
0:11:27this something else going on that you re these description
0:11:30the fairly comprehensive descriptions of the object
0:11:34but they don't tell you what colour they are
0:11:36and furthermore gonna tell you what colour the session on the breast
0:11:40and the reason they don't at that is it's a blindingly obvious from the picture does not point
0:11:46but from our perspective if we're looking for things all searching for things or doing things like recommending things to
0:11:52customer
0:11:54being able to push
0:11:56information jointly or i check and a description might add real data
0:12:04okay
0:12:05so getting to the end of the kind of summary of vision and i'll show you some stuff about recognition
0:12:11i was asked to describe just recently what every vision person should not and it's useful "'cause" it gives you
0:12:16a flavour of the distance
0:12:18the big thing is that vision is really useful it's really hot and it's still really poorly on this
0:12:24it's very helpful to know a bunch of
0:12:27it's also very helpful to know a bunch of scepticism in hot probably understood disciplines is always somebody who comes
0:12:33along
0:12:34with a revolutionary new solution and that's come along every five years or so and then they go away so
0:12:40a moderate degree of scepticism is available but is valuable
0:12:45opportunism a simple
0:12:47right so a vision is difficult because you need you need to know a lot of stuff
0:12:52and there's a lot of evidence that the knowledge of any one thing doesn't seem to help much
0:12:56the really are a lot of different ideas that are just sort of boiled together and we'll see some
0:13:02however the main thing is to know the general principles of its
0:13:05and that is you can deduce from evolutionary examples and what has been successful in computer vision that outfit on
0:13:12the slides us to come on the next
0:13:15there aren't
0:13:16well it's not a subject that has general print
0:13:20it's just one of those things
0:13:22anybody who offers you a general principle is either a fool or a liar and you can you can make
0:13:27your
0:13:29so now i'm gonna set up a series of discussions about our state and recognition i like to do this
0:13:35with a conclusion "'cause" then we know where we're going so the first thing is object recognition is subtle but
0:13:41we actually have really strong methods of what really quite well
0:13:45based on class
0:13:47so rather loosely we could believe this about object tracking
0:13:51the object categories of fixed and known this is a cat that's account that's a motor car every object belongs
0:13:59to one category in there are K of the
0:14:02that you can get good training data so i've got a hundred pictures of cats hundred pictures of cows a
0:14:07hundred pictures about it "'cause"
0:14:09and then object recognition sort of turns into K way classification
0:14:14and it will turn out the detection turns into lots of task
0:14:18in that belief space which has been very valuable there's an actual programme of research you get i'd say you
0:14:25bang together a bunch of features you do better fitting with classifiers and you produce a represent
0:14:32and that strategy has been amazingly back it's very
0:14:36we could it features
0:14:37so the summary of about ten years work in features use the to really input
0:14:44one is features need to be illumination invariant so when the lights gets right to the features shouldn't change all
0:14:51that much and there's an easy way to do that which is to look at the orientations of image great
0:14:56a second big principle is you never the object is never quite where you think it is in the image
0:15:02it's away shifted around a little bit and that means if you look at the image gradient at a particular
0:15:07point
0:15:08you're not gonna do well
0:15:09instead you want to look at local pools of image gradients
0:15:13or histograms of orientation
0:15:16and it turns out if you take those stupid
0:15:18suppose
0:15:20and you can see in a fairly natural fashion in development
0:15:24then you get hogan sift feature
0:15:26i've shown in here for a series of different pictures on the left you on the one side sorry i
0:15:32get the right mix that you'll see a woman with a bicycle and then show next to it is i
0:15:37features style representation each of those little balls are basically histograms of gradient orientation in a little ball
0:15:45so what we're saying is at the top of that in
0:15:49the gradient orientations go in pretty much every direction in local but
0:15:54but then when we get down to the sides of the women
0:15:56there are lots of gradients
0:15:58that there are lots of kampala along the side of the
0:16:03the gradients of
0:16:04yeah
0:16:05and adaptive contrast around the bicycle
0:16:08and again in this room with the traffic you can see in the tree is the brightest guy in all
0:16:13directions
0:16:14but a round the colour
0:16:16they have
0:16:18again in this picture the bicycle down the bottom
0:16:21see the rough structure of the wheels on the frame reflected in those patterns of boring
0:16:27and essentially what we do is take this information and buying it in a class
0:16:31when we do this we get really quite good results
0:16:34rather good at
0:16:37"'kay" this kind of K Y classication running up to K A a couple of hundred
0:16:41when we get into the ten thousands things get very interesting but
0:16:45you know we'll set
0:16:47and they're a standard item datasets for investigating methods and features you can take one O one for example you
0:16:54a set of pictures of a hundred different categories one hundred one different categories
0:16:59the pig somewhat at random from a selection of useful looking categories and the main thing here use the error
0:17:06right the number of classication a ten
0:17:08while is now likely about twenty percent
0:17:13if you stick a picture of an isolated object in the caltech one O one list of object into a
0:17:19good model method you're likely to get the right now
0:17:23and if the collection of categories you know about is somewhat bigger you are not as likely to get the
0:17:28right answer out so the accuracy runs up to the fifties if one's very likely
0:17:32and has lots of training examples but you still got a really good chance of getting the right answer
0:17:38so there are some problems we could do quite well
0:17:41and this machinery extends
0:17:43really very complicated and non obvious judge
0:17:47so you can extend these features to work in space time
0:17:51and then what people do now is like take movies
0:17:56they get the script of the movie that's marked up with time codes by and sees the S on the
0:18:01internet
0:18:02the time align these two and then say okay here
0:18:06shen description in the script look for some features a round that are distinctive in the movie trying to classify
0:18:12like that
0:18:13and then run it on something
0:18:15and you can get really quite effective actions part is like that for complex actions like hans and the fun
0:18:22getting out of the hugging kissing sitting down
0:18:25on the top production and a bunch of true positive
0:18:28on the second row a bunch of tuna
0:18:31on the third row some false positive so if you look at the onset and false positive for example the
0:18:36guy on the bed leaning to the side
0:18:38looks as though he could be sitting on a bed on string of fine you just doesn't actually have a
0:18:43phone and
0:18:45right and then of course there are also like
0:18:47people also in the fine in unusual circumstances where distance
0:18:51so it's machinery extends to really quite complicated
0:18:55this machinery can also be used for detection
0:18:58so the way you detect with a classifier used imagine i have a picture with some interesting things in it
0:19:03that i want to detect
0:19:05what i'm gonna do is take a window of the all that in mind
0:19:09oh correct illumination an estimate orientation and then button in that window and to classify and say yes on the
0:19:16and then i'll go to the next window and i'll say yes or no not keep doing that
0:19:20i don't find the best detection responses if the good enough also write it so that
0:19:25if i want to find a big one i'll make the in a small and search it with a fixed
0:19:30sized window again
0:19:32if i wanna find a small one i'll look at a very high resolution version of the image
0:19:37this recipe again this amazingly successful we are really quite good at detecting moderately complicated
0:19:44standard detector has
0:19:46some
0:19:47additional complexity attached to this description
0:19:50yeah additional complexity use these little yellow box
0:19:54if you look at these columns each column displays the behaviour of the standard detection on at the different categories
0:20:01so the first column so i run i'm getting my rise mixed
0:20:05the first row use human detection the second rows vocal detection and the third row discarded
0:20:11in the first row you'll see that going back to step in front of the train
0:20:15as how to learn a little like blue box placed on top of him
0:20:19with yellow so
0:20:20then is a big group of people which has been incorrectly counted one of them is minutes
0:20:26but most of them have boxes on top of them and we know that there are people
0:20:31in the third column of the first row
0:20:34you see somebody hiding behind a bush
0:20:37he's had a box placed on top of them the obvious monty python joke is so obvious is not with
0:20:42mike
0:20:44as my colleague rubber cholesky is the site it's claudia cutting edge the detectors on perfect and that she has
0:20:52been marked as a pet
0:20:54in the second row you'll see martin best bottle detectors on the go we're pretty good at detecting bottles we
0:21:01can find them even if they're in people's hands or on tables but we get bottles and people mixed up
0:21:07a quite good reasons detectors really like
0:21:11strong
0:21:12identifiable high contrast curves
0:21:14people have them around the head and shoulders started bottles and they tend to look the same
0:21:19right so human humans and bottles often get mixed
0:21:22we're also very good at detecting "'cause" and we case they get the mixed up with buses which is no
0:21:27not
0:21:28the i referred to have a carry sort of the standard technology you can download and run the code it's
0:21:34all very established and it's widely you
0:21:39a problem with the belief space about recognition that i described is that is beginning to come apart at the
0:21:44seams because most of billy's obvious notes
0:21:47right that's just not true
0:21:49object belong to multiple categories a good training data might be very hot to get and that present serious problem
0:21:57C has one example i think is all mine
0:22:05well i you like what it's is usually got into
0:22:07i know
0:22:11no i went to that audience is usually going to vapour lock some roundabout this point because they know i'm
0:22:17gonna get them from the side but they don't know which side i'm gonna get
0:22:20okay so if you look at them depending on what you please the could it might easily could not the
0:22:25first one is in fact a mighty size the second and the fourth isn't i
0:22:31right the but it is in fact the monkey i had to check this i'm not that good on product
0:22:35a taxonomy but most of these are i
0:22:38and the one on the bottom row in the second column is a little plastic toy
0:22:42right so the whole point about categorization here use the concept okay
0:22:48i think this can belong to more than one category at the same time perfectly
0:22:53so what we've inherited from the point of view are described few
0:22:56is a tremendous amount of information about feature computation construction
0:23:01we're really good at building and managing and using classifiers
0:23:05and a lot of practise it improves
0:23:08but this is really evil subtleties that yeah and the next thing is to describe some of the efforts to
0:23:13deal
0:23:14so the big questions the really big questions of computer vision that are in play right now
0:23:20what signal representations should we you
0:23:23this sort of at the early level before you get the classifieds and learning stuff
0:23:28some extent models what aspects of the world should we represent and how should we represent
0:23:34and then the other which is what should we say about pictures
0:23:37and those three questions are really very difficult in the
0:23:41so let's start looking and
0:23:43the coming technologies on the nasty problem
0:23:46one big issue is the unfamiliar
0:23:49the recipe i described you really just doesn't deal with the un from
0:23:53let me show you a little movie of somebody doing something
0:23:56almost certainly you've never seen people doing this before it doesn't happen every day
0:24:01and at the same time it doesn't really present you with any problem
0:24:04right it's not you might not have a word to describe it but you know what's going on then that's
0:24:09fine
0:24:11yes another more extreme example of something where you really don't see this every day
0:24:16but you can still watch it and it's just all us what's going on
0:24:21and even that at this point even the donkeys
0:24:25accustomed to
0:24:26i mean done
0:24:27treat this as
0:24:29because you don't have training
0:24:31you can deal with the unfamiliar in satisfactory ways and you probably have put together in your mind a little
0:24:37narrative of what's going on and why they're doing what they're doing and it's all over and they can get
0:24:41on that
0:24:43now that's a really but
0:24:44fling thing
0:24:45from the perspective i described to you we just have no approach
0:24:50there are methods that you can you so you can you can take
0:24:54the stuff i described in rewrite
0:24:56yeah as a an architecture that people are using quite a lot i take a picture ideas and feature selection
0:25:03and stuff and instead of building classifiers that side but
0:25:07i build a bunch of classifiers that say the picture has a peak and it it's got an I and
0:25:12it it's gonna for and it's got a
0:25:16the reason i would think that is if i ran into something else i might not know what it was
0:25:21but i could say oh okay it's got that it might be a feather dust or a but that's got
0:25:26that is so i can say something useful about
0:25:30this is kind of neat because you can then build systems that can make predictions for objects that never seen
0:25:36before
0:25:36where they haven't even seen that
0:25:38degree of that type of object
0:25:40on the slide the little yellow boxes are the spatial basis of the predictions in the image a and underneath
0:25:47them are prediction so that rather baffled look man here but there it's reported is having a kid having an
0:25:53yeah having a snout having a nose and having a man
0:25:57it would be an able to say something useful about something we'd never seen
0:26:02it's harder to get these predictions right
0:26:05you can see on that yeah right for example it is it's gotta tile it's gotta snout it's gotta lay
0:26:11it also say it's got text on it and its might apply
0:26:15and it's is got text on it because text is characterized by little dark and white stripes next to each
0:26:20other and plastic is characterized wonderful bright
0:26:23so the these predictions are hot to make but you can make
0:26:28the other neat thing about this architecture is if you happened to have seen lots of the
0:26:34it's relatively straightforward to add something else it says okay this really is about
0:26:38and that again that's in the whole recipe of classification that i describe
0:26:43if i say that i can also look at that list of attributes and say well gee it's a but
0:26:49something's missing or something's extra
0:26:51some known objects things that i know about whose names are now could be unfamiliar by being different from the
0:26:58typical
0:26:59and if they are different from the typical it's worth mention
0:27:03we can build systems that do that as well essentially if we really sure it's the object and we really
0:27:08sure it has a missing attribute or an extra attribute we can say it
0:27:12so i think yeah i have a bunch of examples from one recent system the semantics of attributes all messed
0:27:18up so that the down there was one is reported is not having a tail not because this compelling evidence
0:27:25that it is a tale this one but because we can see that i we haven't that little detail hasn't
0:27:30been sorted out
0:27:31that aeroplane as reported is not having a jet engine
0:27:34and gloriously but this on the friday she had done like that sheep is reported is not having well
0:27:41what it has in fact been sure
0:27:44and you can report extra stuff as well again you know there was a two questions the semantics that need
0:27:49to be sorted out here that the in the little yellow box on the end there is reported as having
0:27:54an extra lee
0:27:55no but is never have actually so one should
0:27:58have some more complex interpretation sitting on top but there's a bicycle with whole on an aeroplane with a big
0:28:06and a bus with a fine
0:28:08well within the sort of extra special features of the object and we can report
0:28:14no one nice thing about this is
0:28:17joe asked recently there are technologies emerging that say some regions in images actually would like to be not
0:28:25so if the region would like to be an object then what we can do is take collect attribute machinery
0:28:30catch it to the region that would like to be an object and reported description and that sort of stuff
0:28:35is being discussed in the hallways but doesn't yet
0:28:40yeah the second interesting and disturbing thing about modern vision like we coded visual phrase
0:28:45so meaning comes in class
0:28:48i talked about object recognition is something where you spot individual objects
0:28:53but it's really hot to talk sense about what it means to be or not
0:28:58so if you look at this can honestly you could think about that as an object because if you fish
0:29:03around in your head you could come up with a single word to describe that's a flat
0:29:08but it isn't one thing or two
0:29:10what should we cut her off the slack and then sort of think of the person as a person than
0:29:15the slate is a slight that way lies madness because we can also kind of a head inside sick a
0:29:20kind of a jacket and say it's a jacket kind of issues inside shoes and so on
0:29:25so what we might want to do is just sort of excel
0:29:28that is a chunk of meaning of a yeah represented by what many people would think of as at least
0:29:33two or
0:29:36as a precedent for this we think a common notion envision is that of a C
0:29:41so it's a likely stage where particular kinds of object a particular kinds of activity might occur things the things
0:29:49like box rooms or greenhouses or playgrounds or bedroom
0:29:54and we really quite go to classes
0:29:56so you can use the procedure i describe we previously you get a bunch of labeled images of scenes you
0:30:03compute some features you button a minute classifier and it turns out you could be really good at saying that's
0:30:08a picture of a bathroom that's a picture of a boring that's a picture of a clock
0:30:14and the advantage of doing that is you have some idea of the kind of things that might happen
0:30:19so we've known since the early nineteen
0:30:22but if you get the scene right
0:30:25you can predict where to look for objects
0:30:28and although you can't get it right because so i've sent to examples you have from the rubber stuff
0:30:33one is an outdoor scene where you know we predict on the top row the buildings are sort of on
0:30:40the top and street is on the bottom and trees of vertical and they might be in front of you
0:30:45and the spline tends to be on the top and the cause will tend to be on the side of
0:30:49the middle and so
0:30:51not sure that all of these predictions are right there aren't any "'cause" i
0:30:55but they tell you where to look for "'cause" if they work
0:30:59and that seems to be help
0:31:01yes thinking about scenes is currently we talked about meeting is coming in class at two scales
0:31:07one scale is the scene the whole image
0:31:10and the other is individual objects all over a little like is to what it means to be an all
0:31:14and it turns out very recently that is
0:31:17come good practical evidence that the might exist useful clumps of meaning between the scene and the visual and the
0:31:24object and these are referred to as visual phrase
0:31:27the compass
0:31:29so the compass it's where the compass it is easier to recognise in that spot
0:31:34so one useful visual phrases a person drinking from a bottle
0:31:38it turns out it's much easier to detect a person drinking from a bottle that it is to detect a
0:31:44person or to detect above
0:31:46because people who drink from bottles do special thing
0:31:50right they hold them are they a don't special configurations and the law
0:31:55the same goes for things like this and riding a bicycle it's much easier to detect a whole person riding
0:32:00a bicycle that it is to detect the person in the bicycle and then reason about spatial relation
0:32:06because the appearance is constrained by the relation
0:32:10so when you bill when you have this observation then you get into a serious mess about what to report
0:32:17about an image
0:32:18so we might build a person detector we might build a host detect and we might also build a person
0:32:24writing a whole
0:32:26we have to figure out which if any of them is right if we're really lucky the person riding a
0:32:30horse detectable report in the same place
0:32:33as the person detector on the host detector and we have to figure out just how many people just how
0:32:37many pulses and just how many people riding horse
0:32:40so what we do is rack up a whole bunch of detectors
0:32:44and then go through a second phrase which is currently right phase which is currently referred to as decoding where
0:32:49we say
0:32:51based on all of the evidence of the detectors i'm willing to believe you and you
0:32:57and that judgement is again a discriminant a judgement we essentially take the responses of nearby detectors report them to
0:33:05the current detector and construct a second classify which is should we believe you
0:33:10and you can get quite good ounces of that procedure
0:33:14it turns out they help quite a lot
0:33:16so if you look at the picture of the
0:33:19the top row pictures these a detector responses without any decoding a global vision of what's going on a you
0:33:26can see a sofa and a bunch of people on the set to go very small
0:33:31if one then says okay i'm gonna look at the totality of detect which includes more than that
0:33:37and try and find a consistent selection that makes and then you get a side because you gotta so that
0:33:43there's a fair amount of evidence that you got a dog lying on the sofa because you got something it
0:33:47looks a bit like a personal but like a dog and you got a dog lying on the sofa and
0:33:51that's also a dog
0:33:53you can significantly improve detection procedures by this kind of global view
0:34:00another thing that gives a global view that significantly improves detection performance and scene understanding is john
0:34:08so if we know there's something about the geometry
0:34:11we can really improve detectors so on than the one side with the blue line on it i have an
0:34:18image with the horizon and
0:34:19i want to build a pedestrian detector you can see the boxes around pedestrians
0:34:24and cost
0:34:25now the thing about how right
0:34:27is in perspective cameras things that get closer to the horizon from below must be smaller
0:34:34otherwise the bigger in three D
0:34:36what that means is if i wanted to K
0:34:39i pedestrian and i think it's a big one you have to be lower in the image
0:34:44and the small ones have to be i
0:34:47furthermore if i get something pedestrian detector responses
0:34:51i can look at them and say well the big ones they care in the small ones of that helps
0:34:55me estimate the rice
0:34:57and if i estimate the horizon and my reports joint
0:35:02i can get much better re response
0:35:05so for example on the top row of the local detections or the yellow ones of pedestrians the green ones
0:35:11that cause and we just tested against a threshold and all but just sort of band perry pedestrians hovering in
0:35:17the sky
0:35:19but from that and other detector information we can estimate a horizon
0:35:23pedestrians have the feet on the ground most of the time and that just rolls out all those false positives
0:35:29up that and it rolls in some small detects the close to the horizon because they're about the right
0:35:36similarly if we go looking and scene with "'cause" and people in
0:35:41you'll notice the this is the one on the bottom
0:35:44by estimating the horizon
0:35:46several detect the responses the little dotted red ones for the pedestrians
0:35:51have gotten back
0:35:52because we know that even though the image data didn't look all that great it really is it the right
0:35:59size in the right place to be a pedestrian and that gives us just a little bit more calm
0:36:05no geometry is wonderful stuff the roles of the geometric estimates that are making detection better right now
0:36:12one thing is you can pretend that the room is a box
0:36:15when using a variety of standard method
0:36:18you can then estimate the box even if the room isn't exactly
0:36:22you can then estimate the box and when you estimate the box you can get some idea of whether flores
0:36:27so over there we got a room with a box painted on it you think about isn't quite right
0:36:32nonetheless because you got the box we can figure out what the walls look like what the floor looks like
0:36:37and what the ceiling
0:36:38so the rate is one will lose another wall the yellow is that the wall the green is the floor
0:36:46so i the blues the ceiling and the particle is stuff that use none of the above what we call
0:36:52class
0:36:53things that you might bump into and such
0:36:56another thing way that you could benefit from so firstly we gave an account of free space
0:37:02but another thing you could do is you could take that and you could say well because i know the
0:37:06box
0:37:07i can use standard methods to ask
0:37:09what the what would faces all boxes inside the room look like
0:37:15if i looked at them front
0:37:17so if i want to build a better detect and it turns out the people who did this budget head
0:37:21down
0:37:22collings actually have the world's best ad detector which sounds like sort of a slightly eccentric thing to have but
0:37:27there's a principle here and you'll see it being useful in a second if i want to build a good
0:37:31bet detect if i just look at images
0:37:34i have to deal with the fact that the band might appear at different orientation
0:37:39and because it appears at different orientations it's going to look at
0:37:43but if i know the box of the rhythm i can say bad so axis along
0:37:48they have one they typically have one face against the wall of the room
0:37:51therefore i'm gonna write take the box of the room so the faces of the bed of frontal
0:37:57and i can now remove some
0:37:59source of ambiguity in my features and build a better detect
0:38:04now the thing that's nice about that
0:38:06is when you know whether babies you want to know something about where the room
0:38:10because they do not penetrate the walls of room
0:38:14so what i can if i
0:38:15do is estimate the room and the bands simultaneously and come up with quite good estimates as to whether furniture
0:38:23is in free pictures the room so every here at the top you see a and estimated ball
0:38:29in the middle you see a bad that's estimated without thinking about where the box with without re-estimating the box
0:38:36and at the bottom you see a joint estimate or bed and room box
0:38:40and that jointly estimate is used somewhat better than the it's sort of three or four percent but it's way
0:38:47oh the nice thing about box
0:38:49is you can do other things with them as well
0:38:51so very recently kevin "'cause" has shown
0:38:54that if you know the box of a room you can figure out whether like so
0:38:59and you can figure out what the L B O is on the sides of the room whether it's black
0:39:02or white to right
0:39:04or green or red
0:39:06and if that's the case and you know with authority is you can stick out the stuff into the room
0:39:12so i'll go backwards and forwards
0:39:13we put some pieces of computer graphics chat in the room and you'll notice that statue is behind the ottoman
0:39:21and as a result is occluded and the lighting is wrong
0:39:25oh i think about this which is kind of fun is if you can do it for a static thing
0:39:29you can do it for moving stuff
0:39:31so here's a picture of a billiard room from
0:39:34like and you can just play but it's on the ballot
0:39:39yes another picture from flick and so everything i'm showing you come from a single picture but
0:39:43yes another picture from like a and a little glowing bowl manage to get into the picture and is going
0:39:48to explore it
0:39:49you'll notice it gets reflected in them there
0:39:52it costs shadows the way it should
0:39:55and when it flies under the table is more like twitch
0:39:59so these kind of simple geometric inference
0:40:02can support amazing functions the usefulness of this is pretty obvious you can stick furniture into pictures of your and
0:40:09living room
0:40:10if you're inclined to do such things you can should aliens in your or
0:40:15dining room on a computing
0:40:19so let's look at the last sort of begin puzzling principal that's kind of a merging in modern vision
0:40:24and then a selection
0:40:26what should we say
0:40:28so a couple of years ago judy of how can my went out collected a whole bunch of images
0:40:34and then set them on mechanical turk got people want to pretty qualifies english speakers
0:40:40this is kind of important otherwise things get a bit funny and ask them to write a sentence about the
0:40:44pig
0:40:45and then what you do is you get multiple sentences about a single picture
0:40:50and you look at a sentence
0:40:52and just start playing thing about those sentences is that can see
0:40:57people presented with this picture talk about two girls sitting and talking they one of them is holding something that
0:41:04chanting the wearing jeans but that'd talk about the step
0:41:08that i talk about the specular reflections in the window at the back of the image that i talk about
0:41:13the two people in that when the that'd talk about the chewing gum on the ground
0:41:18the capable of looking at this thing and saying this is important
0:41:22this is what's worth mentioning and this is
0:41:25and the moderate beacons
0:41:27not understanding that is terribly important than the reason it's important
0:41:30pictures are all about
0:41:32and if you model is on the record every object in the picture then you're dead because you report is
0:41:37too big
0:41:37so we need to know what's what
0:41:41we can do some of those
0:41:42is this a fair amount of work on predicting sentence level descriptions of images or video
0:41:48so for example have enough got turned colleagues took video all baseball game
0:41:54and they use method similar to the discriminative methods i described to identify who's kidding who was catching who was
0:42:01running
0:42:02no they also build a little
0:42:05a generative model of baseball essentially you can do this and then that once you've done this could happen or
0:42:11that could happen all that
0:42:13and you can think of it as being represented by a tree of events and some surgical rules that allow
0:42:19you to rearrange the tree and then what you do is you say okay i've got these detector response
0:42:25these are the structural rules of the game let me generate a structure that explains those responses and of course
0:42:31of course if i can generate that structure i can generate things that you know without close inspection look like
0:42:38described again
0:42:40no sportscaster would emit something that's as pitch approaches the ball before batting yet that it's and then simultaneously better
0:42:48runs the base and feel the runs towards the ball feel that catches the ball and it it's not the
0:42:53way people talk
0:42:55at the same level at the same time it
0:42:57a description of what's going on
0:42:59that you could use to produce something that scene
0:43:03and it's a fairly detailed description of what's
0:43:07we can generate sentences for over three pictures although it's still a bit rough and ready there are methods that
0:43:14essentially say i go from an image space to some sort of intermediate space of detector response
0:43:21and then i'll go from a sentence space
0:43:23to some intermediate space of detector responses and then i would try and line a sentences and images in that
0:43:30space and report the best matching set
0:43:33the kind of results one gets a channel with yeah so that top picture the detectors are paying sleep on
0:43:41ground animals sleep and ground gold standard ground and the kind of sentences one sees generated a see something and
0:43:48expect
0:43:50"'kay" so people remark things up into account the thing to say the least
0:43:53but you might also get counted grass field which is not that it's a shape but you know it's not
0:43:58bad guy
0:44:00the third one down
0:44:03a man stands next to train on a cloudy day it looks like a wonderful
0:44:08if you raise the money and that it's actually a one
0:44:10so we did you know you can make minor mistakes because sentences a really calm
0:44:15sources of information and sometimes you make houses so this is not in fact i'd that laptop connected to a
0:44:22black belmont there really isn't all that much black on the on the four
0:44:27the sentence is more recently tamara but enrich them significantly like joining a this machinery to machinery about attributes
0:44:36and was able to produce again you know we're not doing sentence generation you know the should be fairly obvious
0:44:41from this end
0:44:43a descriptions of pictures that look like this
0:44:45there were two aeroplane the first shiny aeroplane is near the second
0:44:50again we're not in sentence generation
0:44:52but if you did do sentence generation you might see there's enough meaning that's been extracted from the image that
0:44:57you could turn it into a reasonable for
0:45:01they're all one dining table one chair into windows wouldn't dining table is by the wooden chair and against the
0:45:06first when the noise
0:45:08kind of objection you would right is to that is too much information and not selection as opposed to
0:45:15it's wrong
0:45:17okay now i'm gonna show you a movie too
0:45:21illustrate how far the side your selection seems to go in human vision it's a fairly wrenching movies the first
0:45:27thing is just to warn you that nobody was a watch
0:45:30watch one yeah
0:45:32and then we'll think about so it's clearly a surveillance movie on a train that
0:45:47and that's not there as it gets interesting
0:45:50okay yeah the question how many adults were on the platform and what were they doing
0:45:56right do not right i no audience so or is always give sort of a variety of answers it somewhere
0:46:03in the two seven range it's just not in it
0:46:07you look at that thing in it is clear what's important and it's clear what's not important then you really
0:46:11good at climbing on simple
0:46:14and the important stuff looks like what outcome do we expect how other people feel
0:46:19this feeling thing is not just because we're nice people and we care about what other people feel it's because
0:46:24it gives you a really good idea of what they're gonna do next which match
0:46:28a what we like
0:46:30and of course what's gonna happen the by
0:46:32again actually the whole sequence
0:46:35nobody was that the child was not a it's something about how good probably in baby carriers can be
0:46:41but it's a lot
0:46:57and the trying times
0:46:59but as i wouldn't show that if the child been but it's quite a well known that
0:47:03the baby carrier and it upside down and was pushed along the child was annoyed but not seriously damage
0:47:10if you look at this your ability to predict the behaviour of that one could just nearly threw herself in
0:47:17front of the train
0:47:18it's pretty good she's gonna react in kind of a strange way of the next ten
0:47:23what you don't is you look at this you identify what simple shape well what we're going to notice this
0:47:29guy because he is an important
0:47:31and they build a little narrative around it and they focus on the
0:47:36we don't know how to do that we are trying to but we don't know how to do that yeah
0:47:39so carol some of the two crews would something crucial open questions as well as we move towards the end
0:47:46one is dataset by
0:47:48so
0:47:49i distinctive feature vision is that frequencies in data
0:47:53misrepresent applications
0:47:55for a whole bunch of reasons the labels are wrong
0:47:58the things that are chosen to get labelled a not uniform people collect things in very specific ways
0:48:05and this is not a chart nobody goes out there and does we could things with data collection but it's
0:48:10a real issue
0:48:12so the bias is pervasive and we know it's a big deal envision datasets "'cause" and tanya to rub an
0:48:18eyelash on your staff russ produced this wonderful paper this yeah
0:48:22proved a good classifier can tell which dataset and image come
0:48:28which is very scary news in the
0:48:31and a smart image of the smart vision research you can do it very quickly so you have a little
0:48:35text there the pictures which dataset doesn't come from people run about sixty to seventy percent classifiers are a little
0:48:41bit weak
0:48:43size doesn't make by scale way
0:48:45if you get a really big dataset that doesn't mean it's an unbiased dataset and it might make it worse
0:48:51because you might become
0:48:52so if you look at the he when i collected these pictures from google had you not twenty three million
0:48:59pictures of lines here are the top
0:49:02i don't know however many
0:49:04and you might think they're unbiased but have a close look so the kinds of things you could deduce from
0:49:09these pictures all the lines right of course is fairly or
0:49:13there were two pictures of lines on horseback
0:49:15there's a line lying down with the lamb
0:49:17there's another one
0:49:19putting a having a person and putting a hand on it
0:49:23and is aligned with i'm that way
0:49:25that's on the first
0:49:27so if you use that as your resource of online information you'd be in serious trouble that just not long
0:49:35this is an effective territorial bias people are more interested in with pictures of lines in
0:49:41a common ones
0:49:43the problem is this blows huge holes in what we know about machine learning
0:49:47so machine learning is based on a form of induction that's is the future is going to be like the
0:49:52pot
0:49:54in if you can't make the future like the cost then you've got a problem
0:49:58and current machinery just doesn't sort of go to this
0:50:01this place
0:50:04this is good reason to believe that this issue is pervasive in object recognition that the world cannot be like
0:50:10the training dataset because many things already that's why unfamiliar things a common and we can deal with
0:50:15of course of many things a red then this exaggerates by
0:50:19so gang wang produced a little histogram
0:50:22that said okay all the objects in a marked up dataset that's common envision how many instances out
0:50:30and there's small number of objects that have you know four thousand five hundred instances also but very quickly you're
0:50:36down in that
0:50:38and after that most objects appear two or three times in this data set some most objects a right
0:50:44this is kind of should be a fairly familiar phenomena
0:50:46but it wasn't really an issue envisioned to re
0:50:50are several things you might do about bias
0:50:53you could think about appropriate feature representations and what i described about illumination invariance is one form of doing that
0:51:00if you're features are invariant to illumination then the fact that you're dataset is biased with illumination just doesn't
0:51:07another thing you might do is build appropriate intermediate representation
0:51:11so that those intermediate representations you might be able to make unbiased estimators of classifiers evens out of the objects
0:51:19the right
0:51:20and that's one way of interpreting this attribute
0:51:23on the other thing is if you have a good representations of things like geometry
0:51:28you just might be able to skate the effects of that set
0:51:34so i last conclusion and then we're almost done
0:51:39object recognition links to utility in complex ways that the not terribly well understood yet
0:51:44so
0:51:45biggest question in computer vision right now is what should we actually say about visual day
0:51:52a picture goes into the or a very goes into a recognition system question what should come out
0:51:59one answer is a list of everything that's in the picture that's a silly also the too many things in
0:52:03the picture
0:52:04if i look at this room in front of me it's silly to be describing the not on the bolt
0:52:09that holds the emergency X
0:52:12thing to get that still
0:52:14so that i could on so well a useful representation of reasonable size which is a lousy on so because
0:52:19we don't know what it means to be useful and we don't know how to make the size read
0:52:25it seems that object categories depend on utility
0:52:28so when i talked about that monkey
0:52:31or it could also be a plastic toy but the other category it can occupy is iran
0:52:37it really just doesn't matter no we're not that interested in it so why can't
0:52:41if you look at this little fellow who turned out in my doesn't breaking a bottle recently somebody pointed out
0:52:48that that's a be a bottle so you know you could think of that as a person or a child
0:52:52or be a drink
0:52:53or a be a drink each other a tourist or a hotline like a or an obstacle or potential the
0:52:58rights
0:52:59you know you see that you can write something right or around
0:53:03so just depending on what you're doing that object occupies a wide range of different potential categories
0:53:10so what i talked about suggests
0:53:13the emergence of and you believe space about object recognition with sort of a heading in this direction and it
0:53:18looks as though it's gonna be interesting when we get
0:53:21and the billy spaces look cadres are really flew
0:53:25they're opportunistic devices to a generalisation they're affected by your problems and buying utility
0:53:31things can belong to many cat
0:53:33some people would refer to this is a cellphone or is modified if i fling it into the audience it
0:53:38would turn into a project on the media
0:53:41and in fact the fact that it was a smart find would have nothing to do
0:53:45with whether it was a project
0:53:47so at the same time the same instance can belong to different categories of sorry at different times it can
0:53:52belong to different
0:53:54categories of shape when we talk about objects as being special within the category that's meaningful
0:54:01it's not like all birds of the same but
0:54:03some of the interesting because the missing tiles other the interesting because they have special fetters other birds
0:54:10alright thing "'cause" they're inside this room flying around we had to just before the talk
0:54:15many categories seem to be right
0:54:18and many characterisation's mike because
0:54:20unlike think about some things differently than you and if we don't talk about it is it really just doesn't
0:54:27and in turn that suggested recognition is
0:54:29it's not really just discrimination it's constantly coping with the unfamiliar
0:54:34in the presence of massive an unreasonable by
0:54:38and we need new tools and machinery to do
0:54:42so i'm done on what through my major points
0:54:45and it remains any to point out that if you want more information you can get it
0:54:50but if you if somebody tries to sell you the one with the brown colour than their appellant because that's
0:54:55the first addition and that's ten years i'll second edition appeared physically november
0:55:01so they do exist and they're around and its follow quite up to date information about the state of recognition
0:55:07and thanks to what i describes been supported by numerous agencies and organisations is including the office of naval research
0:55:15the national science foundation
0:55:18and we don't
0:55:36oh
0:55:37just a quick question about size
0:55:39so the issue when the person was misrecognized as a bottle
0:55:43or the issue you know this is a miss recognition when we go well that's something just the wrong scale
0:55:49so but i is size is really difficult to tell how big something
0:55:53oh okay
0:55:56with so many vision
0:55:58yes no the same
0:56:01we know that people are amazingly good at making
0:56:04so
0:56:06the main literature about this i
0:56:09that describing things that they get wrong in attendance at my house
0:56:14we don't know how they do it
0:56:15and we don't have methods right now envision that in computer vision
0:56:19that can do size estimate satisfactorily the
0:56:25it would one reasonable resolution to the personal model is you know but also just a lot smaller than people
0:56:32but how do you know how big the thing you see is
0:56:35in an absolute sense well wonder on some more but i look at some kind of big scale geometric context
0:56:42around that i use it to make some estimate of the camera and with things a and that tells me
0:56:48something about the size and if i get really gross size mismatches that i can say no that isn't gonna
0:56:53work
0:56:55yeah right now nobody can do in a satisfactory way i would regard that as something that sort of in
0:57:01the air coming
0:57:02wouldn't
0:57:05i would think in three or four years time we might do
0:57:08factor to size judgements moderately well
0:57:12more details are judgements i think is still very mysterious
0:57:15they do require putting together a whole bunch of contextual machinery because of the scaling effect of respect
0:57:22that looks like a small what'll in an image might just be a mess of a long way away so
0:57:26you need some notion of the space that it occupies
0:57:29and that's one of the attractions by the way you want to show you that fun movie of the things
0:57:34moving around in the room
0:57:35well the attractions that movie is
0:57:38when you have that degree of understanding of space you probably can make size protection
0:57:44and that you could use them to drive recognition but as far as i know there's nothing right yeah that's
0:57:54so i
0:57:55just
0:57:56i
0:57:57the sets are biased unfairly biased towards things that are interesting
0:58:02and i'm wondering why in computer vision we don't use the data sets as the vocabulary from which to describe
0:58:09do women
0:58:10the bias obviously or something that people are drawn to and it seems that the data itself
0:58:17could be the vocabulary which you describe yeah that is
0:58:21you describe an image in terms of its representation in this huge dataset
0:58:26so on
0:58:28i think that when i mean this is just a
0:58:31setting right because
0:58:35different agenda react to this very different
0:58:38so if you think about vision is something you computer vision is something you do when you stick a camera
0:58:43on your head and you will walk around well
0:58:45then the line that's it i showed you just or
0:58:48right but if you think about computer vision is something where what i do not use google images to interpret
0:58:54more
0:58:55the whole issue bias is just not an issue right because
0:58:59one is the first sample of the other
0:59:02the there is very little explicit writing about what you're referring to what is a lot of what that implicitly
0:59:09takes into account
0:59:11so much of what i've talked about in recognition actually
0:59:16in vol
0:59:18some interesting use of a common graphic
0:59:22which is a is a way of talking about what you're talking about
0:59:25we don't have a good enough and
0:59:28standing of that issue to be able to talk about
0:59:31clearly so you know the two kinds of convention one is the lines interest
0:59:36this one's got that one's riding a horse that solution
0:59:40and the other is we really tinted photograph lines handle
0:59:44you want C will that many pictures or
0:59:47you know a line photograph
0:59:49three quarters with the shoulder dominating action
0:59:52and it seems like one possible iconoclastic convention is different from another one
0:59:56one of them if you like is interesting this in terms of properties and also it's a semantic stuff
1:00:02and the other is
1:00:04characters
1:00:07just don't have the language to separate those rates and talk about them sensibly
1:00:11yeah again i think it's very much on the agenda because of these separating three
1:00:16you know if you really want to learn about the world from google actions
1:00:20you're gonna have trouble
1:00:21and
1:00:22we know that we don't really have
1:00:26so it's of a coincidence the but the best i can do so
1:00:32and a comment a comment on the what you said about the utility of a of what matters in a
1:00:38picture why did what matters in the picture depends on the utility at i'm of the view
1:00:43but yet it seems like when you gave any image today the image of the two girls
1:00:49to several people they came up with pretty much the same description so this seems to be a sort of
1:00:55that baseline utility which is sort of context independent was wondering if you could comment on that and i think
1:01:02you're right
1:01:03so there's a fair amount of experiment
1:01:07one
1:01:08people select dimension
1:01:11the situation is a little bit because
1:01:14it's how to do the experiments exactly right and it's to be precise
1:01:18but this some evidence suggests that kind of things that we dispose people to mention thing
1:01:23oh
1:01:24the really interested in people begin
1:01:27and you can explain that because people have the potential to affect you when you've got a right and left
1:01:32yes
1:01:33the sort of always interesting kind of baseline
1:01:38that thing is that all begin should tend to be mentioned
1:01:44i
1:01:44things that the unusual you know if you have a small rhinoceros in the downtown street view people are gonna
1:01:51say gee you don't see that very often therefore
1:01:54and that seem to be rough
1:01:56principles for baseline utility but i
1:02:01we do not again yet have
1:02:03class of understanding required to say well okay there's a baseline utility and then there's also component that's linked to
1:02:09the immediate task
1:02:11well i would guess that
1:02:13that's a situation
1:02:14if one wanted to take a very extreme point of view you could say
1:02:18the right way to division is with reinforcement learning because that's the white knight should it you just should every
1:02:24vision system in the head if it doesn't do everything right
1:02:27the downside of that one is it was a top notch an awfully long time
1:02:31and you know it's appealing open these utility issues and getting better understanding of the principal seem to be important
1:02:41again sorry surveillance the understanding
1:02:48question
1:02:49so i mean
1:02:51obviously we all kind of an hour had sort of comparing to how vision people do their stuff and how
1:02:56speech people do their stuff
1:02:58and the two things that's kinda make speech recognition work in my view at a very abstract level is one
1:03:05that we model how the various units that we're trying to recognise change in context for instance you know phones
1:03:13the pen
1:03:14rightly how the realised i'm on what other phones their car next to and then we really use in a
1:03:20massive way
1:03:21the this what you called joint modeling you know we model how phones occur together how words occur together how
1:03:28high level unix linux like topics and other linguistic units at various levels all interact and have a co-occurrence statistics
1:03:39that can inform the units would end them
1:03:42so this joint modeling that you just touched arms is really massively important for speech recognition
1:03:49and so these two aspects the modeling of how things change as a as a function of context and then
1:03:55modeling the context itself
1:03:57and it's statistics is the you see that as being find a
1:04:02till having a long ways to go or is it just not something that people that works as well in
1:04:07the in the vision domain what can you draw some comparisons there at your finger on a really nasty ms
1:04:13we know about context we've been talking about context since the eighties
1:04:18and then the question is sort of how what and why and under what circumstances and what you get the
1:04:24contextual statistics and all that jazz
1:04:26and there it is
1:04:29a tremendous amount of work on that topic
1:04:32the
1:04:35i guess a reasonable summary you is
1:04:39clever use or contextual information
1:04:43often improve
1:04:45i particular function just a little bit
1:04:48but there is no example in anyone knows what context just hits the issue out of the
1:04:53and i'm using the word context in the broadest possible sense of various kinds of co-occurrence to
1:05:00so the geometric stuff so for example you can you can make pedestrian detectors a little bit better by knowing
1:05:08about geometry and the little bit is what having like that's one person doesn't get run over or whatever
1:05:14but i know of no example envisioned with
1:05:19things get a lot better by heavy duty contextual information no you could argue that a bunch of what is
1:05:25and people do argue about two ways one argument you
1:05:29well use not using enough contextual information if you use much richer contextual models and more detail in a like
1:05:36things will get back to you if you feel it get back under the whole research programs based on that
1:05:41hypothesis
1:05:42the other arguments as well those elaborate structures
1:05:47become increasingly subject issues upon us issues of variance in estimation and all that jazz
1:05:53and basically what you when with one hand you lose with the other one and you sort of back where
1:05:58you want
1:05:59i would say the juries just count on this question it's very firmly on the agenda it's
1:06:04very aggressively study
1:06:07and my own and that would be contextual information really matters
1:06:12but it also really matters which contextual information you use and which you know
1:06:17and that's second choice is pretty
1:06:21we don't really have the machinery that says this is the good stuff this is the bad
1:06:27i
1:06:28one i know i it's not easy to sort of meaningfully contrast vision and speech the just different activities different
1:06:36communities at the different things
1:06:38but i would say
1:06:41we have a baffling leave rich selection of potential contexts to use
1:06:47everything from camera geometry to geometric context
1:06:51two special properties of texture all night or co-occurrence statistics of objects all objects seen co-occurrences and the like and
1:07:00one possible source of the difficulties we just don't know what to select on that
1:07:12i'm this to address the first you and jeff's where i
1:07:17the mechanism that i don't know if you heard jeff's talk yesterday morning on
1:07:25these segmental conditional random field right in the idea he's proposing which is you know basically to model you know
1:07:34speech at you know it's eight eighty basically the it incorporating information from multiple detectors
1:07:41using the segmental random fields i mean i actually don't know enough to know whether that was inspired by the
1:07:48vision waltz so and migrating to speech or vice versa but i was wondering of
1:07:54both of you know could comment as to what the commonalities you see between those two approaches
1:08:02and whether there is anything you know you think you might obscene in jeff's to upload jeff whether you see
1:08:09anything here you know based on what you're from david for some a little bit of cross pollination between the
1:08:17two areas
1:08:18so i think
1:08:19yeah
1:08:21and i guess jeff is next a microphone and i think from my perspective there are strong resonances and harmonies
1:08:28and one of "'em" year is an idea that's pervasive envision which is
1:08:33if you can call up a picture into pieces the mikes
1:08:38you can get
1:08:39much more information about the P
1:08:41because you got special support of which to cool features in lecture
1:08:46there
1:08:48i'd be
1:08:49most serious vision people believe that if you could do a good job oh
1:08:54coming up on it
1:08:56everything will get
1:08:58i are used with billy because there's no evidence to support that we
1:09:03and
1:09:05it's reasonable to say that the people who believe it simply say that all tested unsupported belief of the wrong
1:09:11statistic any so you know we sort of in a position where smart people think it should work out
1:09:18but right now none of the best
1:09:20detection or classification methods takes any account a special support or just look so the buttons as the whole
1:09:26i think that will try
1:09:29i will go to my grave believing that if it hasn't changed we've done something wrong and we'll come right
1:09:35later on
1:09:36but it hasn't changed yet and that's it's a very disturbing feature of the vision land
1:09:41so i think there's potential that but nobody's demonstrated yeah would be my reaction
1:09:46i don't i i've got the light in my lexicon see if that's just one oh yes
1:09:52i
1:09:53and yeah so i was i thought that was very interesting and that
1:09:57it was i think there are many points of commonality two things struck me one of them in was in
1:10:06addition case
1:10:07and it seemed that the attributes were much clearer or
1:10:11then we have been a speech case for example has that there's
1:10:16has wings has a geek has wheels
1:10:20those are high level attributes
1:10:23that we can sort of rat a lot
1:10:26just by thinking about the problem and i'm not sure that we have the same attributes
1:10:31available to us
1:10:33how looking at the spectrogram or the speech signal
1:10:37and the other thing that occurred to me was
1:10:40that perhaps in fishing case
1:10:43there's an interesting extension today S which were dealing with in interspeech case which has to do with the sequential
1:10:50aspect of thing
1:10:52for example if you're working instead of with a fixed image with the video where you have a sequence of
1:10:58scenes and you might wanna segment that i into segments using some of the attributes that exist within the segments
1:11:08so
1:11:09the
1:11:10responding to one
1:11:11this should discussion and
1:11:16what attributes
1:11:18in the niger talk like this one summarizes about that
1:11:22but
1:11:26it's easy to write down a couple of hundred
1:11:29it's not clear that they're independent of each other and it's not clear that covers the game by any manner
1:11:34we don't really have a story about what you do if you don't know what natural attributes
1:11:40the story currently the people use it is if you can come up with something that's discriminant of it's gonna
1:11:45be an attribute one way or another and what colour attribute going like
1:11:49but there there's actually
1:11:52a moderately interesting vision problem where we sort of know we don't have attributes and would like to and that
1:11:59question developing attributes for things which is hot to write down a list is a big deal for us and
1:12:06i think we can learn about it we would be pleased to learn
1:12:11time help segmentation
1:12:15but it
1:12:16again
1:12:17segment a special temporally segmented videos
1:12:21the
1:12:22doesn't seem to be much better anything we know how to do the non spatially to
1:12:27special temporally segmented videos
1:12:30people like
1:12:31i you know i'd say most of the serious people in vision believe that's because we're understanding something wrong
1:12:37but we don't know what it is and we don't know how to make fine
1:12:44just what you said is this section of the community that does believe in feature detectors like articulatory feature detectors
1:12:55he you know in terms of your whether i i'm not saying it's right or wrong but the there was
1:13:02that part of the community that look
1:13:04each recognition from that viewpoint which is a little more similar one thing i wasn't sure this is just a
1:13:10clarification then all that mike talk is in what menu produce these features i presume these are all are these
1:13:18yard features that are being produced that is either the idea or not there were these all soft decisions
1:13:24that it or extracted so is there like a set of ten billion possible things
1:13:31and is the probability that's thresholded or you make a decision here it's a potatoey as a septic tank et
1:13:38cetera et cetera
1:13:41well the nice thing about
1:13:42you like
1:13:44and you make a list or you know a bunch
1:13:48potential
1:13:49something about a paper about any combination
1:13:53usually what people do this is report
1:13:59one alternative
1:14:01you know it's a pedestrian a pedestrian use a cat not but there's
1:14:05a fair amount of interest in for example the top five
1:14:08rob a bunch applications where
1:14:11as long as you get a ranking that's good and you get the wrong thing plus the top ranking then
1:14:15you're okay and people are very interested in that one there's another class of activity which is look if i
1:14:22build these detectors i can actually think of the output
1:14:26as being features and what i'm gonna do is i'm gonna pretend on building detectors and then i'll look at
1:14:31the responses
1:14:33and pretend that the features and use them for completely different activity so essentially all the alternatives you describe appear
1:14:41in someone's paper somewhere
1:14:43and i wouldn't say there's any consensus about what the best thing is
1:14:47which is unfortunately not so you know you do this you're okay this is not really
1:14:56i difference between speech and our teams about images of that all the images that seem to be isn't dataset
1:15:03seem to be sort of high quality get images no one seems to post their crappy pictures on the web
1:15:08and so as well i have some of these techniques work when the pictures are
1:15:12poor quality blurring you're overexposed or under exposed
1:15:15"'cause" in speech we have a lot more of a sort of
1:15:19variability it seems like of quality which affects the performance of our system
1:15:26so
1:15:27i mean this is what was what was also with it
1:15:31i
1:15:33at the fc is there's an awful lot of pretty pictures and cruddy videos that i like that and often
1:15:38in on you two will reassure you want this point
1:15:41and some things a hot this
1:15:49we
1:15:52the things that mike feature computations
1:15:56a very
1:15:58the acoustical phenomena that mike
1:16:01you you're feature computations give them problems but there are some points of contact
1:16:07we benefit quite a lot from time so for example just one moderately good example if you're interested in human
1:16:17activity recognition
1:16:20if you think about things like soccer field
1:16:23a long view of soccer field with a player running across the field you really just contras all the arms
1:16:29and legs
1:16:29what you got motion blur to worry about the is about one pixel across anyhow it's just a minute
1:16:35but if you look over a more time scale you can get fairly good picture of what's going on what
1:16:41just looking at the sequence of pixels on the motion and pixels
1:16:45so i think
1:16:47some of the losses the resolution might not be as destructive as some of the acoustic effects that you encounter
1:16:53but i'm not sure that that's true
1:16:55there are a whole series envisioned the awe basically dead in the water as a result of
1:17:02it's reflections of light
1:17:05where i think yeah multipath acoustic distortion probably isn't the biggest thing in your life the other things to worry
1:17:10about
1:17:11so i it's and it depends kind of situation
1:17:15there's a lot of interest in low resolution pictures how agencies care about or for pictures that come out of
1:17:23forward looking infrared sensors for example
1:17:25for
1:17:26somewhat alarming reasons
1:17:33i
1:17:34yeah