Multiple Object Tracking

Multiple Object Tracking demos to accompany Liu et al (2005)

  • Liu, G., Austen, E.L., Booth, K.S., Fisher, B., Rempel, M.I., & Enns, J. T. (2005). Multiple object tracking is based on scene not retinal coordinates. Journal of Experimental Psychology: Human Perception & Performance, 31, 235-247. [pdf]

Introduction
Multiple object tracking is the ability to follow objects based only on their spatial-temporal visual history. The present study examined whether multiple object tracking is accomplished using a retinotopic or allocentric (scene-based) frame of reference.

Caution: These demonstrations provide only a flavor of the displays used in the experiments. They have some artifacts because they are digital video images of the original displays. On some LCD screens, the contrast between the objects and the grid floor and marking circles may appear quite low. During actual testing, all objects were clearly visible and moving smoothly and viewing distances and object speeds were carefully controlled.

At the beginning of each trial 2, 4, or 6 objects are indicated with circles. All objects undergo a period of motion for 10 seconds, and then one object is again indicated with a circle. The observer’s task is to determine whether this object is a one of the objects originally targeted for tracking. In the demonstrations that follow, only 4 objects are marked for tracking. Object speed is indicated in degrees of visual angle per second (deg/s).

Experiment 1: Baseline Tracking Performance
The tracking accuracy for objects moving within a 2D rectangle (4 objects, 1 deg/s) was compared with objects moving within a depicted 3D box (4 objects, 1 deg/s). Objects could be tracked equally well in both situations, with a slight tendency for tracking to be even more accurate in the 3D display. As object speed increased, (4 objects, 6 deg/s) tracking accuracy declined.

Experiment 2: Tracking during a “Wild Ride”
Tracking accuracy for objects in the 3D box was measured while the box underwent a “wild ride” that consisted of dynamic and simultaneous translation, rotation, and zoom (4 objects, 6 deg/s, fast wild ride). Even though the added motion of the box lead to an increase in speed and variability of the retinal motions of the objects within the box, this additional variation in speed had no measurable influence on tracking ability. Tracking was only affected by the speed of objects relative to the box. This is consistent with tracking being accomplished using an allocentric frame of reference.

Experiment 3: Structure of the “Wild Ride”
Most of the pictorial support for the 3D box was removed in order to see how perception of a stable scene depended on the wire frame and grid floor that had been used to convey the layout of the scene (4 objects, 6 deg/s, fast wild ride without frame). Tracking was unaffected by removal of these 3D cues. This suggests that the movement of the objects themselves within the confines of the 3D box are sufficient to provide the necessary ‘structure from motion’ to perceive the layout of the 3D scene.

Experiment 4: Simultaneously tracking objects moving at two speeds
In an effort to reduce the coherence of the 3D scene, all objects moved at two different speeds (4 objects, 2 speeds, fast wild ride). When the box was moving, the simultaneous presence of two speeds of object motion was detrimental to tracking accuracy. This indicates that the constant speed of objects relative to one another is an important cue to the perceived structure of the 3D space.

Experiment 5: Tracking Objects in a Non-rigid 3D Space
We also reduced scene coherence by projecting the image of the scene onto a convex corner (corner projection). This manipulation was designed to create a failure in shape constancy when the scene was viewed from more than one vantage point (i.e., displayed on two different planes). Even though the retinal projection for the observer was the same as in conditions where tracking accuracy was high (Experiments 2 and 3), tracking accuracy was impaired.

Conclusion
These results converge on the conclusion that multiple object tracking is accomplished using an allocentric spatial reference rather than a retinal frame of reference.