Single Camera Model
We assume the camera adheres to the “pinhole” model. With this model, there are certain quantities we need to find to characterize the “intrinsics” of the system:
 position of image center
 focal length
 scaling factors to convert height and width of pixels to physical units
 lens distortion
Note that 2) and 3) are typically combined as, mathematically, it’s not necessary to know the focal point explicitly. This is talked about in the next section.
The diagram below describes the model:
Where:
 image point: (, )
 scene point: (, , )
We can compute image points from scene points via:
If you delay the division by , the equations become linear:
and can be recovered by:
Transformation from physical units in the image plane to pixels
The diagram below describes this transformation:
Where:
 image center: (, )
 scaling factors: ,
This leads to:
Again, delaying the division of results in a linear equation:
where:
As discussed in the previous section, , , and are typically not found separately; instead, and are found.
Camera Coordinates to World Coordinates
The diagram below describes this transformation:
Where:

 world point: (, , )
This leads to:
If we leftmultiply both sides by :
This is a linear equation which transforms world coordinates to pixel coordinates (recall: and ). However, this does not yet take lens distortion into account.
Lens Distortion
We model lens distortion to occur in the image plane; we now must make a distinction between distorted and undistorted coordinates:
and:
We assume there are radial and tangential lens distortions:
Where:
All together:
The following diagram describes the lens distortion:
Single Camera Calibration Numerical Implementation
We now have a full camera model to work with. Some things:
 Camera calibration involves using a calibration target with known world coordinates (, , and ). We take pictures of this calibration target, find these known coordinates in “pixel space”, and then find the “intrinsic” and “extrinsic” parameters, through linear and nonlinear optimization, which make:
i.e. we’re trying to make the model fit our measurements.
 We reparameterized things to be linear in the previous section so that linear optimization can be used. This is possible up until lens distortion is involved. So, an initial guess can be made (usually assuming zero lens distortion, a reasonable assumption for most applications) with linear optimization, and then it is followed up with nonlinear refinement with the full camera model.
 For nonlinear optimization, the rotation matrix needs to be reparameterized so that it has 3 parameters. This is because the rotation matrix has 9 elements, but only 3 degrees of freedom. I use Euler angles here, but Rodriguez’s formula is also popular.
 If the calibration target is flat, then and:
is known as a homography and acts as a projection operator for homogeneous coordinates. Often, it is found up to a scale factor.
For this article, I’m assuming a 2D checkerboard pattern is used for calibration. The next sections explain how calibration (i.e. estimation of “intrinsic” and “extrinsic” parameters) is done for this case.
Subpixel corner detection
This is based on the fact that at a corner, a vector with the tail at the corner and pointing in a region around the corner should always have a zero dot product with the intensity gradient at the tip of the vector:
 in flat regions:
 in boundary regions:
The dot product is zero in flat regions because the gradient is zero in that region. In boundary regions, it’s zero because and are orthogonal:
You can use this constraint to set up a linear set of equations to solve for by sampling points around an initial guess of :
One way to get an initial guess for is to get the four corners of the entire checkerboard, then compute a homography, and then reproject the world coordinates of the checkerboard into pixel space. If the lens distortion is low, this should be a reasonable initial guess for , which can then be refined with this method.
Here’s the math:
Reformed as matrices:
So now can be solved for using a linear solver.
Some things you can do to improve this algorithm:
 Apply a Gaussian kernel to the sampling points around to give less weight to points further away from the corner.
 Apply multiple iterations (will require interpolation) until convergence is reached (i.e. position of moves less than a certain threshold)
Note that large and large will influence a linear least squares solver more, so if there just happens to be a large gradient far away in the sampling grid, the solver will try to make more orthogonal to that, than a closer gradient with smaller intensity (maybe due to blur), which may not be desirable. Using a Gaussian kernel should help mitigate this potential problem, but it is still an important consideration to keep in mind.
Computing Homographies
If we have two sets of points, for example a set of corner points on our checkerboard in “world coordinates” (i.e. in real units with respect to the checkerboard) and the corresponding set of points in “pixel space” (i.e. we have detected these corner points in an image we’ve taken of the checkerboard), then we can approximate the transformation from these two sets of points using a homography. The assumption here is we’re using the pinhole camera model and there is very little lens distortion.
Let:
Then:
Let:
Then:
We form homogeneous equations:
Then, we put them in matrix form:
To solve:
The solution is the eigenvector of associated with the smallest eigenvalue. To compute it, use the SVD:
Assuming the eigenvalues are sorted in descending order, then the last column of will contain the least eigenvector.
Note that since this is a homogeneous equation, once we solve for , what we will actually find is , i.e. a scaled version of . It’s important to keep this in mind.
Also, there is a normalized version of the above which makes it more stable and accounts for the fact that (,) and (,) may have very different scales, which may make ill conditioned.
The normalization step does two things:
 translates both sets of points so their centroids are at the origin
 both sets of points are scaled so the average distance from the origin is equal to , which I believe is done to ensure the x and y components are close to 1.
A “normalization matrix” is used:
where:
= mean of
= mean of
The same matrix is also constructed for the set of points in world coordinates, which we call .
Note:
 Original homography:
 Preconditioned homography:
where:
If we do the math to see the updated :
Then:
We form homogeneous equations:
Then, we put them in matrix form:
We can now solve for using . But, we must recover the original homography after we have computed the preconditioned homography; to recover it:
This means:
Which implies:
So, to use the preconditioned approach:
 Use and the SVD method to solve for
 Recover with
So now we have a linear method for computing homographies. Another refinement you can do is a nonlinear optimization of the homography using the linear method as the initial guess. This is more or less optional but included for completeness.
The first step is to divide the initial linear homography guess by so that . Homographies have 8 degrees of freedom and this is a typical constraint added to parameterize the homography for nonlinear optimization.
Next, we derive the Jacobian so we can perform GaussNewton iterations (note the initial guess is typically very good and the optimization is well behaved, so in my experience more robust methods like LevenbergMarquardt aren’t necessary).
We start with:
Let:
Set the cost function to:
where i = ideal, m = measured, and p = pix.
To compute the Jacobian, we start off with the residuals:
Take the partial derivative:
The Jacobian has the form:
Where:
And you can perform GaussNewton updates using the pseudo inverse (although you really should use a direct linear solver):
Until some convergence condition.
Now, we have a method to determine a fully optimized homography given two sets of points.
Initial guess of intrinsic parameters
After homographies for each image of a calibration board are computed, we need to compute an initial guess for the intrinsic parameters.
We can only find an initial linear guess for the parameters in : , , , and . We assume that the lens distortion parameters are zero, which is reasonable for most camera/lens combinations.
Furthermore, we assume the following:
We assume the first two things because the principle point estimation is usually difficult and unstable, so we assume it’s at the middle of the image (this also assumes the first pixel has a position of (1,1); i.e. onebased indexing). It’s also a good assumption for most cameras that = (assuming the pixels are square) and hence = = . These constraints will allow us to get a better initial guess for .
Let:
Now, let:
Then:
Then:
We arrive at the following constraint:
If you repeat the same steps for: , you get the additional constraint:
We put these constraints into matrix form:
Let:
where:
Then, you can solve for and take the inverse and square root to recover .
Some things:
 You can normalize , , , and by dividing each with their respective norm before solving for . This does not effect the orthogonality constraint and prevents issues in case one of the ‘s are very large.
 You can solve for using least squares. However, you can also solve the Â that makes the projection of on equal to :
This is what Bouguet does in his Matlab toolbox and from my tests appears to work slightly better than the least squares results. I have a feeling it might be because you obtain a larger when and are not collinear, which in turn results in a smaller and prevents over estimates, but beyond that I’m not really sure.
Initial guess of extrinsic parameters
The next step is to get an initial guess of the extrinsic parameters per calibration board image.
Since we now have an initial guess for :
This results in the following:
Since and are supposed to be unit vectors:
We can then solve for the extrinsics:
A few things:
 The parameterization of the homography is only unique because we constrained to be positive. To see the effect this has, look at ; in particular :
is 1 (recall we set this constraint when we computed the homography). So, if is constrained to be positive, the effect this has is to ensure is positive. This is correct, as the calibration board is in the positive z direction.
 and should be equal, but will probably be slightly off due to some combination of model misfit, noise, etc… So we take the average of the two when computing .
 and should be orthogonal, but again, due to model misfit, noise, etc… will probably be slightly off. To make an orthogonal matrix, we can compute the best orthogonal approximation:
Take the SVD of :
Then:
Solves:
Â subject to Â
I won’t give a proof, but it should be somewhat intuitive.
 You can (optionally) do nonlinear refinement of just the extrinsics (on a per calibration board image basis) before optimizing all parameters, but this only requires modifying the global optimization problem slightly, which I’ll talk about next.
Nonlinear Refinement of All Parameters (Single Camera)
Note that by now we have initial guesses for all intrinsic and extrinsic parameters. We can now define a cost function and optimize it across all intrinsic and extrinsic parameters.
But, before we do that, it makes things easier to parameterize the camera model with “normalized” coordinates.
Let:
Then:
Lets parameterize our camera model with normalized coordinates:
Doing the same for , we obtain:
Where:
Let:
= parameterization of extrinsic parameters (expanded on soon)
Now, take the partial derivatives of :
Do the same for :
Note that:
Now we must find and .
But, before we can do that, we must parameterize rotation. For that, I’ve elected to use Euler angles; I specify a XYZ rotation sequence:
Note that the inverse is:
We take the Jacobian:
Now, we can compute the full Jacobian:
Where:
and:
We can now define our cost function for single camera calibration:
Where:
 M = number of calibration board images
 N = number of points on calibration board
We put all of our parameters (intrinsic + extrinsic) in a single vector, :
Which results in 8+6M total parameters.
To compute the Jacobian, we start off with the residuals:
Take the partial derivative:
The Jacobian has the form:
Where:
Where the ‘s represent filled blocks and ‘s are empty blocks. This makes intuitive since, as the intrinsic parameters will affect the residuals for all calibration board images. For extrinsics, only the extrinsic parameters for a specific calibration board image will affect its residuals.
To solve this, I’d recommend using the LevenbergMarquardt method since the nonlinear optimization is not wellbehaved under certain circumstances (In my tests, if very few calibration boards are provided, it can cause the GaussNewton method to fail):
As usual, it’s recommended to use a sparse linear solver instead of directly taking the inverse.
Some things:
 All initial guesses are used from the previous sections. , , , and are all initialized to zero.
 is usually initialized to 0.01 and will either grow or shrink every iteration depending on whether the mean squared error (MSE) increases or decreases. If MSE decreases, then is decreased and we proceed to the next iteration. If MSE inceases, then we increase until MSE descreases. Increasing lambda has the two effects: 1) it makes the step closer to the gradient and 2) it reduces the step size. Decreasing lambda has the effect of making the step closer to the GaussNewton method, which is an agressive optimization. Using small steps in the direction of the gradient is incredibly robust, but slightly slow.
 If there is only a single calibration board image, the principle point is removed from the optimization (simply remove the columns from the Jacobian corresponding to those two parameters, and then do not update them during the iterations), as it is very unstable.
We now have enough information to fully calibrate and optimize a camera using a checker board calibration target.
Stereo Vision
The first thing we need to consider is how to incorporate two cameras into the model. We do this by simply adding the transformation between the two cameras as follows:
This results in:
We can solve for and :
In order to optimize the stereo model, we do the following:
 We get initial guesses for the intrinsics, , , and by doing single camera optimization on each camera separately.
 Instead of optimizingÂ , , , and for each calibration board image, we instead optimize , , , and . This reduces the number of extrinsic parameters from 12M to 6M+6, where M is the number of calibration board image pairs.
 To get an initial guess for and , we do the following:
Then:
and:
If we put these in matrix form:
and:
We now have a model for stereo vision and a way to provide an initial guess for this model.
Some things:
, , , have the same form as single camera calibration.
and have the same form as single camera calibration.
, , , are new quantities that we need to derive.
We need to compute the Jacobians:
The only quantities we haven’t derived yet are:
and
The rest of the Jacobians have already been derived in the single camera calibration section.
Now, we’re ready to define our cost function for stereo calibration:
Where:
 M = number of calibration board image pairs
 N = number of points on calibration board
We put all of our parameters (intrinsic + extrinsic) in a single vector, :
Which results in 22+6M total parameters.
To compute the Jacobian, we start off with the residuals:
Take the partial derivative:
The Jacobian has the form:
Where:
Where, like before, the ‘s represent filled blocks and ‘s are empty blocks.
To solve this, again, I’d recommend using the LevenbergMarquardt method since it is not wellbehaved under certain circumstances:
Until some convergence condition.
Some things:
 is set and used as described in the single camera calibration section.
 If there is only a single calibration board image, the principle point is again removed from the optimization similarly to the single camera calibration section.
Now, we can perform camera calibration for stereo cameras using a checkerboard calibration target. The next article will go over how to do this in Matlab.