I am building systems that can understand what they see. In this day and age, the necessary hardware is easily accessible since a digital camera and a computer can now be purchased for well under € 100. It is the software that is the real challenge.
The trick is to interpret the perspective projections of light into a camera in terms of behavior of objects in the world. Some of these objects even have their own goals and dreams. Interpreting the intentions of moving objects from the continuous change of light distributions in optical projections is no small feat.
Luckily we have proof that it is possible. We are proof that it is possible. We humans, and numerous other species of animals, are clearly capable of “optically guided potential behavior”.
I want to focus here on understanding events. An example of an event would be “One of my colleagues enters our common room, pours herself a cup of coffee, sits down on the couch, and falls asleep”. This may be an unlikely event (and therefore a memorable one). But note that an event is more than an activity like drinking coffee or sleeping. It is a (mini-)story with a beginning, middle, and end.
My present goal is to build systems that can understand events, or more realistically, can categorize events from videos. The goal is to turn the continuous motion of objects in a scene (sampled in frames of a video recording) into a discrete event.
In order to keep the problem manageable, I focus on cars and their behavior in traffic. There is a number of advantages in picking this domain. Cars have relatively simple shapes. They do not have extremities like the limbs of humans that need to be included in the interpretation of their behavior. The world of cars is also relatively simple since it is constrained by roads and traffic rules.
My R&D relies heavily on the inspiring psychological research by Jeffrey Zacks and Barbara Tversky on human perception of event structure. They discovered that observers highly agree on the moment in time at which an event begins and ends. These “breakpoints” or “event boundaries” are mentally organized in a hierarchy. For the beginning of a new event the breakpoints correspond to a new action, or a new object or actor, or a new setting. In a bottom-up sense these breakpoints correspond to the greatest physical change. Top-down they are linked to completing actions or reaching goals.
If you want to read more I can recommend the original paper by Zacks & Tversky (2001) and the review paper by Tversky & Zacks (2013).
You can imagine that I am particularly enthralled by the idea that event boundaries correspond to “the greatest physical change”. I envision a 3D vision system that detects the motion of cars, their 3D position and 3D attitude in every frame from the camera. These can be represented as continuous trajectories that can be segmented at the points of extremal linear and angular acceleration.
The system needs to be taught to interpret a class of trajectory patterns as a certain event with a particular label. For example, “A car is driving through the street”, or “A car has parked in spot 5”. The crucial categorisation is that the system clearly indicates “I don’t know” if it encounters a class of object motion that it can not categorize. The options for the maintainer of this system is to either widen the scope of an existing event category, or create a novel event category.
I consider my exploration into automatic event recognition to be part of a very long R&D event that started in the 1990s. I am still right in the middle of it, with no end in sight.