A major assumption in modern computer vision is that you have to track points on surfaces in order to see in 3D. You can use 2 images from 2 static cameras (“stereo”), or 2 images from 1 moving camera (“motion”).
The hard part is that you have to find the points in the 2 images that correspond to the same point in the world. If you succeed, you can apply some basic geometry to determine the exact 3D position of that point. This is not trivial and that is why it is called the “correspondence problem”.
ARKit on an iPhone establishes motion correspondence by tracking “feature points”, the yellow dots in the image.
It is assumed that our eyes and brain solve the correspondence problem in a similar way. It is exactly that: an assumption, and also an incorrect assumption.