IntPhys: A Benchmark for Visual Intuitive Physics Reasoning


  • 2020-02-26

    The leaderbord is now available with detailed results per conditions and number of objects.

  • 2020-01-27

    We corrected two bugs in the evaluation process:

    • Occluded and visible scenes were inverted during the evaluation, leading to confused results (scores displayed as occluded were in fact visible, and vice versa).
    • Relative error for null submissions was 0.0 instead of 0.5.

    This has been corrected on the leaderbord and in the provided starting kit.


In order to reach human performance on complex visual tasks, artificial systems need to incorporate a significant amount of understanding of the world in terms of macroscopic objects, movements, forces, etc. This raises the challenge of how to evaluate intuitive physics in such artificial systems, especially if these systems are not constructed directly with intuitive physics as an objective function (unsupervised or weakly supervised learning). Drawing inspiration from research in developmental psychology, this benchmark provides a set of diagnostic tests for increasingly difficult aspects of intuitive physics.

The intphys benchmark can be applied to any vision system, engineered, or trained, provided it can output a scalar when presented with a video clip, which should correspond to how physically plausible the video clip is. Our test set contains well matched videos of possible versus impossible events, and the metric consists in measuring how well the vision system can tell apart the possible from the impossible events..

Our benchmark is therefore:

  • task neutral: it can be applied across very different systems that have been trained on a variety of tasks such as Visual Question Answering, 3D reconstruction, or motor planning.
  • model neutral: It only requires models to output a physical plausibility score over an entire video.
  • bias free: because the test is synthetic (constructed with a game engine), it enables careful control, which makes it free of the usual biases present in more realistic datasets.
  • diagnostic: Attention! it is NOT intended to provide a loss function to train system’s parameters. It’s purpose is to diagnose a system on increasingly complex sub-problems of intuitive physics (object individuation, kinematics, object interactions etc). Therefore the dev set is small (just for tuning the plausibility scalar).

We intend to release this benchmark in a series of yearly challenges, each with increasing difficulty.