The Magic Behind the Click - Image Capture Fundamentals
In our increasingly digital world, the ability to capture and recreate reality is paramount. From virtual reality experiences to robotic navigation, understanding how we perceive and model our 3D environment is key. This post dives into two fundamental concepts: how we capture 2D images, and how we can elevate them into rich 3D models using depth information.
Every photo you take, every frame of video, starts with light. But how does that light become a digital image?
- Light & Lens: It all begins with light reflecting off objects in a scene. A lens (or a system of lenses) in a camera collects this light and focuses it onto a sensor. The crucial role of the lens is to form a sharp, inverted image on the sensor plane. Think of the pinhole camera model as the simplest analogy: a tiny hole projects an inverted image onto a surface behind it. Real lenses are far more complex, correcting for aberrations and allowing control over focus and depth of field.
- The Sensor (CCD/CMOS): At the heart of a digital camera lies an image sensor, typically a CCD (Charge-Coupled Device) or CMOS (Complementary Metal-Oxide-Semiconductor) chip. This sensor is a grid of millions of tiny light-sensitive elements called photodiodes or pixels. When photons (light particles) strike a photodiode, they generate an electrical charge. The brighter the light, the greater the charge accumulated.
- Pixels & Resolution: Each photodiode corresponds to one pixel in the final image. The number of pixels determines the image's resolution. More pixels generally mean more detail can be captured.
- Capturing Color - The Bayer Filter: Most sensors are inherently monochrome; they only measure light intensity. To capture color, a Color Filter Array (CFA), most commonly a Bayer filter, is placed over the sensor. This filter assigns a color (Red, Green, or Blue) to each pixel. A common Bayer pattern has 50% green, 25% red, and 25% blue filters, mimicking the human eye's higher sensitivity to green.
- Demosaicing & Image Processing: The raw sensor data is a mosaic of R, G, and B intensity values. An algorithm called demosaicing (or debayering) interpolates the missing color values for each pixel to produce a full-color image. Further Image Signal Processing (ISP) steps like white balance, noise reduction, and sharpening are then applied to create the final JPEG or RAW image file.
Essentially, image capture is a process of perspective projection: mapping a 3D world onto a 2D plane, driven by optics and semiconductor physics.
Pinhole Camera Modeling
Fig 1: Calculations connecting image plane to real world 3D scene in simple pinhole camera setup. The following assumptions are made for simple setup shown above: (i) Camera centre (i.e., optical centre) is at the origin (0,0,0) of Euclidean coordinate system (ii) Principal axis is the line ⊥ to the image plane and passing through the optical centre. (iii) Principal axis is the Z-axis (iv) Image plane (i.e., focal plane) sits at Z= -f (sign wouldn’t matter in our calculations). (v) Principal plane is || to the image plane and passes through the centre of projection
A pinhole camera is the simplest form of camera, essentially a light-proof box with a tiny hole (the "pinhole" or aperture) on one side and a light-sensitive surface (like film or a digital sensor) on the opposite side. The pinhole model serves as the bedrock upon which more sophisticated camera models are built. Lens distortion, principal point offsets, and other real-world camera characteristics can be added as extensions to the basic pinhole model. For a wide range of computer vision tasks (e.g., 3D reconstruction, object tracking, pose estimation), the pinhole camera model provides sufficient accuracy. The effects of lenses in many standard cameras can be well-approximated by the pinhole model, especially for central portions of the image.
This a summary of how pinhole camera works:
- Rectilinear Propagation of Light: The fundamental principle behind a pinhole camera is that light travels in straight lines. When light rays from an object strike the pinhole, they continue in a straight line through the hole.
- Image Formation: Because the pinhole is extremely small, only a very narrow beam of light rays from each point on the object can pass through it. These rays then project onto the light-sensitive surface on the back of the box.
- Inversion: Due to the straight-line travel of light, the image formed on the screen is inverted both horizontally and vertically. Light from the top of the object will strike the bottom of the image plane, and light from the left of the object will strike the right of the image plane.
- Sharpness vs. Brightness:
- Smaller Pinhole: A smaller pinhole allows fewer light rays to pass through, resulting in a dimmer image but a sharper one. This is because the "circle of confusion" (the small blur created by the pinhole's finite size) is smaller.
- Larger Pinhole: A larger pinhole lets in more light, leading to a brighter image, but it also increases the "circle of confusion," making the image blurrier.
- Infinite Depth of Field: Unlike cameras with lenses, a pinhole camera effectively has infinite depth of field. All objects, regardless of their distance from the camera, appear equally "in focus" (or equally out of focus, depending on the pinhole size and diffraction). The blur is primarily due to the pinhole's size, not the object's distance.
Figure 1 shows calculations for 3D-to-2D projection at the core of image formation using the simple pinhole camera setup (with assumptions listed in the caption). The relationship between the image 2D coordinates and the real world 3D coordinates can be derived using similarity of triangles. The transformation between the 3D and 2D coordinates is succinctly captured via Homogenous Coordinate System and writing transformation equations in the matrix form using homogenous coordinate system.
A brief intro to homogenous coordinates: For a 2D point (x,y), its homogeneous representation is (x,y,w), where w is a non-zero scaling factor, usually set to 1. To convert back to Cartesian coordinates, you divide by w: (x/w,y/w). Similarly, a 3D point (X,Y,Z) becomes (X,Y,Z,W) in homogeneous coordinates.
Homogenous coordinates have a few advantages, such as:
- Unified Representation of Transformations:
- Translation: In standard Cartesian coordinates, translation is an addition: x′ = x+tx, y′ = y+ty. This cannot be represented as a matrix multiplication.
- Rotation and Scaling: These can be represented as matrix multiplications.
- Homogeneous Coordinates to the Rescue: By adding an extra dimension (the w component), all affine transformations (translation, rotation, scaling, shear) can be expressed as single matrix multiplications.
- Concatenation of Transformations:
- Since all transformations become matrix multiplications, a sequence of transformations (e.g., rotate, then translate, then scale) can be combined into a single composite transformation matrix by simply multiplying the individual transformation matrices.
- This is incredibly efficient. Instead of applying each transformation individually to every point, we pre-multiply the matrices once to get a final transformation matrix, and then apply that single matrix multiplication to all points.
This consistency is crucial for building complex transformation pipelines.
The mapping of the 3D world point (X, Y, Z) to 2D image point (x,y), in homogenous coordinates, can be written as the following:
This can be viewed as “Projection” operation (K[ I | O]) applied to 3D world coordinates. The projection operation in this simple formulation has only one-degree of freedom, namely the focal length “f” determining the “K” matrix.
For reasons that will become apparent in the next section, the projection matrix “P” is often viewed as decomposed into two matrices. The “K” is called the “intrinsic matrix” (because it depends on the internal parameters of the camera, in simple case, the focal length “f”). The matrix “[K | 0]” is referred to as “extrinsic matrix” (because, it depends on how the camera is placed in the world, in this simple case, the pinhole is at the origin and the principal axis aligns with the Z-axis, and principal point is at the origin of image plane).
When we relax these assumptions, the intrinsic and extrinsic matrices will change.
Relaxing Assumptions in Simple Pinhole Camera Model
Arbitrary origin at image plane
Let’s relax the assumption that principal point is at the origin of the image plane, and instead let’s discuss the general case where the principal point is at (px, py) in the image plane with respect to some arbitrary origin.
We know that the 3D world coordinate (X, Y, Z) is at the 2D image coordinates (fX/Z, fY/Z) with respect to principal point in the image plane. Therefore, with respect to new image plane origin, the 2D coordinates for the same image point will be (fX/Z + px, fY/Z + py). This can be written in the following homogenous coordinates equation:
Note that here the intrinsic matrix “K” has been updated to have more degrees of freedom (in total there are 3 degrees of freedom now, f, px, py). This is also intuitive in the sense that the changes in the image plane coordinate system is something internal to camera, and therefore only the “intrinsic matrix” is affected, and rest of the transformation remains the same.
Camera translation and rotation
So far in the discussions, it was always assumed that camera optical centre was at origin (0,0,0) and that principal axis aligned with the Z-axis. But this may not be true - this can happen if the origin / coordinate system is with respect to the “world” and the “camera” is placed at an arbitrary location in the world.
In the above figure, we assume that the camera is placed at location “C” with respect to origin of the world, and that camera has a rotation “R” with respect to the axis of the world (here, “R” represents a sequence of rotation around the world X-axis, Y-axis, and Z-axis). Then, any point “Q” (say (X,Y,Z) with respect to world coordinate system) has the coordinates “R⁻¹(Q-C)” with respect to camera. Key points to note here are:
- Individual rotation matrices around any axis is orthonormal, and therefore the composed rotational matrix “R” is also orthonormal.
- And to get the point “Q” with respect to camera centre “C”, we would have to “invert or reverse” the rotations around each of the axis and that’s why we have “R⁻¹(Q-C)”. Here “R⁻¹” is simply the transpose of “R” matrix
- Note that camera rotation matrix is referring to the “R⁻¹” i.e., rotation with respect to camera coordinate axis (and not with respect to the world coordinate axis)
Once we know the point (with respect to camera coordinate axis) we can apply projection matrix to homogenous coordinates of point (with respect to camera coordinate axis)
The above equation gives us the projection transform for camera placed at an arbitrary location (not necessarily at origin) with respect to world coordinate axis.
What I find elegant about this methodology is how each assumption that we relax, either changes the intrinsic matrix (”K”, which gains more degrees of freedom f → f, px, py) or the extrinsic matrix (”[I | O]” which gains more degrees of freedom as we move the camera in the world with “R” and “C” matrices are added). This is a common theme: as we seek to model more and more complexities, the matrices would change further.
Another example of these matrices changing as we model more complexities is the case when we assume the image sensor has different pixel densities on the X and Y axis (let’s say dx, dy). In such a case, the image formed on the X-axis is proportional to the number of pixels along the X-axis, and similarly the image formed on the Y-axis is proportional to the number of pixels along the Y-axis. This leads to a “K” matrix of the form
In the above equation, notice that the effective focal length on both the X and Y axes changes because of different pixel densities on each axis. Also, note that there could be a version of intrinsic matrix where the image formation on one of the axes could be influenced by the focal length of the other axis - this is referred to as “skew” and is controlled by “sx, sy” (this may happen when the lens is not perfectly aligned with the image plane).
Outro: Back-projection and 3D reconstruction
While there are some really cool properties of the projection matrix (which can be read about in references), an immediate consequence is that if we can infer the depth Z (up to a scaling factor), along with the image coordinates 'x, y' in the 3D-to-2D projection equation, then we can solve the system of linear equations to get the 3D world coordinates X, Y (up to a scaling factor). Since we also have Z, we essentially have the 3D point cloud of the scene, derived solely from the image and depth. This is great because the literature on monocular depth estimation is ever-growing, and we now have some excellent models like DepthAnythingV2.
We will use this idea next to reconstruct some 3D scenes (like the one shown below; these are compressed and low resolution, while the actual videos are higher resolution)!