The distinction between image processing and computer vision is a difficult and subjective one. We reserve the name computer vision for those parts of the fields where we interpret visual information in terms of the entities in the 3D world that are depicted in the image.
We start this part with the description of the pinhole camera being the model for most of our imaging devices. The pinhole camera models the standard photo camera’s, it models the eye, video camera’s etc. It will turn out to be a surprisingly simple model (just one matrix multiplication).
In computer graphics the camera matrix is most often set by the programmer (user) when rendering a virtual scene on the computer screen. In computer vision on the other hand, the camera is real and its parameters (geometrical and optical) and position and orientation in space have to be learned from measurements. Estimating these unknown parameters is called camera calibration and it is a not a trivial task to do correctly and accurately. In these notes we sketch the basic math to understand what calibration is all about.
The pinhole camera projects a 3D scene onto a 2D retina. In such a projection information is lost. All points on a straight line from the optical center of the camera are projected on the same point on the retina. We thus cannot recover depth from one image. At least not from one point (pixel) in the image. Using our knowledge of the 3D world around us the human visual brain is capable of inferring depth of what is seen.
We need two eyes/camera’s at least to measure depth. We will look at the setup with two camera’s (stereo vision) and model the way the two images are related and how we can recover depth from the two views.