Benchmark Description
---------------------
Design principles
~~~~~~~~~~~~~~~~~
Our benchmark is organized with three design principles.
- **conceptually pure blocks**: We test one basic principle of
intuitive physics at a time (see Test_Blocks for details).
- **minimal set design**: Possible and impossible videos are
constructed in carefully matched sets, in order to avoid bias.
- **parametric design**: Task difficulty is controlled by manipulating
the duration of occlusion, number and movement of the objects.
We illustrate the minimal set design on the Block 1 (*object
permanence*), which tests the principle that macroscopic objects
either exist or not, but they do not flip in and out of existence,
even when we don't look at them. It is a very important principle,
that has a lot of applications in real life (does the car hidden by
the truck still exists?).
.. image:: ../../data/exampleA.png
:width: 400 px
:align: center
To test for object permanence, we construct four movies. The first two
correspond to following the green arrows: the movie starts with one
(two) object(s) presented on an empty background, then a screen is
raised, totally occluding the object(s) for a period of time, then the
screen is lowered, revealing the object(s) in its/their original
position(s). The two impossible movies correspond to following the red
arrows: it starts with one object and ends up with two, or vice
versa. The two possible and two impossible movies are therefore
composed of exactly the same frames and pixels, the only difference
between them being the particular sequence of events which is in the
first two cases compatible and in the last two incompatible with the
principle of object permanence.
.. image:: ../../data/test_1.gif
:width: 160 px
.. image:: ../../data/test_2.gif
:width: 160 px
.. image:: ../../data/test_3.gif
:width: 160 px
.. image:: ../../data/test_4.gif
:width: 160 px
The test samples therefore come as quadruplets: 2 possible cases and 2
impossible ones. They are organized into 18 conditions resulting from
the combination of 2 visibilities (visible vs occluded) times 3
numbers of objects (0 vs 1, 1, vs 2, 2 vs 3) and 3 mouvements types
(static, dynamic 1, dynamic 2) as follows (still for Block 1):
.. image:: ../../data/designbloc1.png
:width: 90%
:align: center
In total, for each block, the test set consists in 3.6K clips (200
clips for each of the 18 conditions). The dev set is much smaller, and
only there to allow tuning of the plausibility function, but not for
detailed model comparison (12 clips per condition, 216 total). The
files in the dev set are provided with metadata and the evaluation
code. The test set is just provided with raw videos (RGBD), but no
metadata nor evaluation code.
..
In order to evaluate the test results, participants should submit their
numbers by creating a zip file and create a DOI for it in the zenodo
intphys1.0 group. The evaluation will be automatically created and
added in the leaderboard of the benchmark, together with dates and
description of file and group.
.. _evaluation:
Evaluation metric
~~~~~~~~~~~~~~~~~
The system requirements are that given a movie :math:`x`, it must
return a plausibility score :math:`P(x)`. Because the test movies are
structured in :math:`N` matched sets of positive and negative movies
:math:`S_i = \{Pos^1_i .. Pos^{n_i}_i , Imp^1_i .. Imp^{n_i}_i\}`, we
derive two different metrics. The relative error rate :math:`L_R`
computes a score within each set. It requires only that within a set,
the positive movies are more plausible than negative movies.
.. math::
L_{R}=\frac{1}{N}\sum_{i}{\mathbb{I}_{\sum_{j}P(Pos_{i}^{j}) < \sum_{j}P(Imp_{i}^{j})}}
The absolute error rate :math:`L_A` requires that globally, the score
of the positive movies is more positive than the score of the negative
movies. It is computed as:
.. math::
L_{A}=1-AUC(\{i,j; P(Pos_{i}^{j})\}, \{i,j; P(Imp_{i}^{j})\})
Where :math:`AUC` is the Area Under the ROC Curve, which plots the
true positive rate against the false positive rate at various
threshold settings.
Human and baseline results
~~~~~~~~~~~~~~~~~~~~~~~~~~
Here are the baselines results and human performance (evaluated through
Amazon Mechanical Turk) on this test for blocks O1, O2 and O3 of the
Benchmark.
.. image:: ../../data/newres.png
:width: 80%
:align: center
The source code for the baseline systems can be found here:
https://github.com/rronan/IntPhys-Baselines.