Single Camera Model
We assume the camera adheres to the “pin-hole” model, where points in space project as straight lines to the camera aperture (the origin of the “scene” coordinate system) and intersect through an image plane at “image points”. This image plane is supposed to represent the ideal physical location of the image sensor, and contains a 2D projection of 3D scene points.
The diagram below describes the model:
Where:
- image point: (, , )
- scene point: (, , )
We can compute image points from scene points via:
If you delay the division by , the equations become linear:
and can be recovered by:
Transformation from IMAGE POINTS to pixels COORDINATES
We assume there is a simple offset and scaling relationship between and .
The diagram below describes this transformation:
Where:
- image center: (, )
- scaling factors: ,
The image center accounts for the fact that the image sensor might be shifted from the optical center, and the scaling factors account for the physical size of the pixels.
This leads to:
Again, delaying the division of results in a linear equation:
where:
In an ideal world, the model would be complete and all we would need to do to characterize the camera “intrinsics” would be to find the matrix. But, in the real world, there are distortions due to the camera sensor and lens. Some distortions include radial distortion introduced by lens manufacturing defects, the camera sensor not being completely orthogonal to the lens axis, etc, which are discussed in the next section.
LENs/CAMERA Distortion MODEL
Two popular lens/camera distortion models are heikkila97 and wang08. In my opinion, wang08 makes more sense intuitively than heikkila97, although the overall performance of the two is about the same. Since heikkila97 is used in many toolboxes (e.g Bouguet’s toolbox), I’ll discuss that here for the sake of example.
First, we must make a distinction between distorted and undistorted coordinates:
and:
We assume there are radial and tangential lens distortions:
Where:
All together:
The following diagram describes this model:
We now have most of the camera model addressed, but for camera calibration, a calibration pattern with known geometry is used. The geometry is described in coordinates relative to the calibration board, known as “world” coordinates and discussed in the next section.
World Coordinates to pixel Coordinates
First, the transformation from world to scene coordinates is a simple rigid body translation and rotation, shown below:
Where:
-
- world point: (, , )
This leads to:
If we left-multiply both sides by :
This is a linear equation which transforms world coordinates to pixel coordinates (recall: and ).
Now, we have our entire model. In order to characterize the camera “intrinsics”, we need to compute and the distortion coefficients. The next section describes a numerical implementation to do this.
Single Camera Calibration Numerical Implementation
We now have a full camera model to work with. Some things:
- Camera calibration involves using a calibration target with known world coordinates (, , and ). We take pictures of this calibration target, find these known coordinates in pixel coordinates, and then find the “intrinsic” and “extrinsic” parameters, through linear and non-linear optimization, which make:
i.e. we’re trying to make the model fit our measurements.
- We re-parameterized things to be linear in the previous section so that linear optimization can be used. This is possible up until lens/camera distortion is involved. So, an initial guess can be made (usually assuming zero lens distortion, a reasonable assumption for most applications) with linear optimization, and then it is followed up with non-linear refinement with the full camera model.
- For non-linear optimization, the rotation matrix needs to be re-parameterized so that it has 3 parameters. This is because the rotation matrix has 9 elements, but only 3 degrees of freedom. I use Euler angles here, but Rodriguez’s formula is also popular.
- If the calibration target is flat, then and:
is known as a homography and acts as a projection operator for homogeneous coordinates. Often, it is found up to a scale factor.
For this article, I’m assuming a 2D calibration board is used for calibration. The next sections explain how calibration (i.e. estimation of “intrinsic” and “extrinsic” parameters) is done for this case.
Computing Homographies
If we have two sets of points, for example a set of points on our calibration board in “world coordinates” and the corresponding set of points in pixel coordinates (i.e. we have detected and localized these points on the calibration board image), then we can approximate the transformation from these two sets of points using a homography. The assumption here is we’re using the pin-hole camera model and there is very little lens distortion.
Let:
Then:
Let:
Then:
We form homogeneous equations:
Then, we put them in matrix form:
To solve:
The solution is the eigenvector of associated with the smallest eigenvalue. To compute it, use the SVD:
Assuming the eigenvalues are sorted in descending order, then the last column of will contain the least eigenvector.
Note that since this is a homogeneous equation, once we solve for , what we will actually find is , i.e. a scaled version of . It’s important to keep this in mind.
Also, there is a normalized version of the above which makes it more stable and accounts for the fact that (, ) and (, ) may have very different scales, which may make ill conditioned.
The normalization step does two things:
- translates both sets of points so their centroids are at the origin
- both sets of points are scaled so the average distance from the origin is equal to , which I believe is done to ensure the x and y components are close to 1.
A “normalization matrix” is used:
where:
= mean of
= mean of
The same matrix is also constructed for the set of points in world coordinates, which we call .
Note:
- Original homography:
- Preconditioned homography:
where:
If we do the math to see the updated :
Then:
We form homogeneous equations:
Then, we put them in matrix form:
We can now solve for using . But, we must recover the original homography after we have computed the preconditioned homography; to recover it:
This means:
Which implies:
So, to use the preconditioned approach:
- Use and the SVD method to solve for
- Recover with
So now we have a linear method for computing homographies. Another refinement you can do is a nonlinear optimization of the homography using the linear method as the initial guess. This is more or less optional but included for completeness.
The first step is to divide the initial linear homography guess by so that . Homographies have 8 degrees of freedom and this is a typical constraint added to parameterize the homography for nonlinear optimization.
Next, we derive the Jacobian so we can perform Gauss-Newton iterations (note the initial guess is typically very good and the optimization is well behaved, so in my experience more robust methods like Levenberg-Marquardt aren’t necessary).
We start with:
Let:
Set the cost function to:
where i = ideal, m = measured, and p = pix.
To compute the Jacobian, we start off with the residuals:
Take the partial derivative:
The Jacobian has the form:
Where:
And you can perform Gauss-Newton updates using the pseudo inverse (although you really should use a direct linear solver):
Until some convergence condition.
Now, we have a method to determine a fully optimized homography given two sets of points.
Initial guess of intrinsic parameters
After homographies for each image of a calibration board are computed, we need to compute an initial guess for the intrinsic parameters.
We can only find an initial linear guess for the parameters in : , , , and . We assume that the lens distortion parameters are zero, which is reasonable for most camera/lens combinations.
Furthermore, we assume the following:
We assume the first two things because the principle point estimation is usually difficult and unstable, so we assume it’s at the middle of the image (this also assumes the first pixel has a position of (1,1); i.e. one-based indexing). It’s also a good assumption for most cameras that = (assuming the pixels are square) and hence = = . These constraints will allow us to get a better initial guess for .
Let:
Now, let:
Then:
Then:
We arrive at the following constraint:
If you repeat the same steps for: , you get the additional constraint:
We put these constraints into matrix form:
Let:
where:
Then, you can solve for and take the inverse and square root to recover .
Some things:
- You can normalize , , , and by dividing each with their respective norm before solving for . This does not effect the orthogonality constraint and prevents issues in case one of the ‘s are very large.
- You can solve for using least squares. However, you can also solve the  that makes the projection of on equal to :
This is what Bouguet does in his Matlab toolbox and from my tests appears to work slightly better than the least squares results. I have a feeling it might be because you obtain a larger when and are not collinear, which in turn results in a smaller and prevents over estimates, but beyond that I’m not really sure.
Initial guess of extrinsic parameters
The next step is to get an initial guess of the extrinsic parameters per calibration board image.
Since we now have an initial guess for :
This results in the following:
Since and are supposed to be unit vectors:
We can then solve for the extrinsics:
A few things:
- The parameterization of the homography is only unique because we constrained to be positive. To see the effect this has, look at ; in particular :
is 1 (recall we set this constraint when we computed the homography). So, if is constrained to be positive, the effect this has is to ensure is positive. This is correct, as the calibration board is in the positive z direction.
- and should be equal, but will probably be slightly off due to some combination of model misfit, noise, etc… So we take the average of the two when computing .
- and should be orthogonal, but again, due to model misfit, noise, etc… will probably be slightly off. To make an orthogonal matrix, we can compute the best orthogonal approximation:
Take the SVD of :
Then:
Solves:
 subject to Â
I won’t give a proof, but it should be somewhat intuitive.
- You can (optionally) do nonlinear refinement of just the extrinsics (on a per calibration board image basis) before optimizing all parameters, but this only requires modifying the global optimization problem slightly, which I’ll talk about next.
Nonlinear Refinement of All Parameters (Single Camera)
Note that by now we have initial guesses for all intrinsic and extrinsic parameters. We can now define a cost function and optimize it across all intrinsic and extrinsic parameters.
But, before we do that, it makes things easier to parameterize the camera model with “normalized” coordinates.
Let:
Then:
Lets parameterize our camera model with normalized coordinates:
Doing the same for , we obtain:
Where:
Let:
= parameterization of extrinsic parameters (expanded on soon)
Now, take the partial derivatives of :
Do the same for :
Note that:
Now we must find and .
But, before we can do that, we must parameterize rotation. For that, I’ve elected to use Euler angles; I specify a X-Y-Z rotation sequence:
Note that the inverse is:
We take the Jacobian:
Now, we can compute the full Jacobian:
Where:
and:
We can now define our cost function for single camera calibration:
Where:
- M = number of calibration board images
- N = number of points on calibration board
We put all of our parameters (intrinsic + extrinsic) in a single vector, :
Which results in 8+6M total parameters.
To compute the Jacobian, we start off with the residuals:
Take the partial derivative:
The Jacobian has the form:
Where:
Where the ‘s represent filled blocks and ‘s are empty blocks. This makes intuitive since, as the intrinsic parameters will affect the residuals for all calibration board images. For extrinsics, only the extrinsic parameters for a specific calibration board image will affect its residuals.
To solve this, I’d recommend using the Levenberg-Marquardt method since the nonlinear optimization is not well-behaved under certain circumstances (In my tests, if very few calibration boards are provided, it can cause the Gauss-Newton method to fail):
As usual, it’s recommended to use a sparse linear solver instead of directly taking the inverse.
Some things:
- All initial guesses are used from the previous sections. , , , and are all initialized to zero.
- is usually initialized to 0.01 and will either grow or shrink every iteration depending on whether the mean squared error (MSE) increases or decreases. If MSE decreases, then is decreased and we proceed to the next iteration. If MSE increases, then we increase until MSE descreases. Increasing lambda has the two effects: 1) it makes the step closer to the gradient and 2) it reduces the step size. Decreasing lambda has the effect of making the step closer to the Gauss-Newton method, which is an agressive optimization. Using small steps in the direction of the gradient is incredibly robust, but slightly slow.
- If there is only a single calibration board image, the principle point is removed from the optimization (simply remove the columns from the Jacobian corresponding to those two parameters, and then do not update them during the iterations), as it is very unstable.
We now have enough information to fully calibrate and optimize a camera using a calibration board.
Stereo Vision
The first thing we need to consider is how to incorporate two cameras into the model. We do this by simply adding the transformation between the two cameras as follows:
This results in:
We can solve for and :
In order to optimize the stereo model, we do the following:
- We get initial guesses for the intrinsics, , , and by doing single camera optimization on each camera separately.
- Instead of optimizing , , , and for each calibration board image, we instead optimize , , , and . This reduces the number of extrinsic parameters from 12M to 6M+6, where M is the number of calibration board image pairs.
- To get an initial guess for and , we do the following:
Then:
and:
If we put these in matrix form:
and:
We now have a model for stereo vision and a way to provide an initial guess for this model.
Some things:
, , , have the same form as single camera calibration.
and have the same form as single camera calibration.
, , , are new quantities that we need to derive.
We need to compute the Jacobians:
The only quantities we haven’t derived yet are:
and
The rest of the Jacobians have already been derived in the single camera calibration section.
Now, we’re ready to define our cost function for stereo calibration:
Where:
- M = number of calibration board image pairs
- N = number of points on calibration board
We put all of our parameters (intrinsic + extrinsic) in a single vector, :
Which results in 22+6M total parameters.
To compute the Jacobian, we start off with the residuals:
Take the partial derivative:
The Jacobian has the form:
Where:
Where, like before, the ‘s represent filled blocks and ‘s are empty blocks.
To solve this, again, I’d recommend using the Levenberg-Marquardt method since it is not well-behaved under certain circumstances:
Until some convergence condition.
Some things:
- is set and used as described in the single camera calibration section.
- If there is only a single calibration board image, the principle point is again removed from the optimization similarly to the single camera calibration section.
Now, we can perform camera calibration for stereo cameras using a checkerboard calibration target.
Multi-Camera
Coming soon…