Augmenting From Scratch Part 1: Light, Cameras And Images

This is the first in a series of posts that will go all the way through the process of inserting a virtual object into a live video of the real world. We’re going to start from the very beginning.  How does a camera form an image of the real world?

You may have heard of something called a pinhole camera. A pinhole camera is essentially the simplest camera imaginable so that seems like a reasonable place for us to start.  Think of a box with a pinhole in it and a piece of film inside of the box some distance away from the pinhole. A pinhole camera can be thought of as being made up of just three things:

  1. Camera center
  2. Image plane
  3. Focal length

The camera center is the pinhole, it is the point through which all light passes to “get inside” the pinhole camera. In an everyday camera this would be the aperture.

The image plane is the film or photographic plate that "records" the light as it strikes it.

The focal length is simply the distance between the camera center, and the image plane. Make sense?

If we put all of this together we can see how a camera forms an image of a 3D object. In this animation the blue lines are the light rays coming from the pyramid and also the final image recorded on the image plane (film).

A pinhole camera records images which are always upside-down but this is confusing to our meat computer. Instead we are going to consider the image plane to be in front of the camera center instead of behind it.  It will be the same focal length it was before and since this change doesn’t affect how light passes through the camera center, it will be the same image just flipped up-and-down. Think of this as just adding a minus sign to a number somewhere and it is much more intuitive to us.

Now we’ve seen how a 3D object makes an image using a pinhole camera, but what happens if we go in the other direction? What if we know what’s on the image but we don’t know the 3D object that formed it?

This animation shows the light rays travelling backward, from the camera center through the image plane and on infinitely.

Here we can see the original pyramid, and the light rays pass through the right points as they should. However they travel on infinitely, what does that mean to us?

You can see here that a smaller pyramid closer to our camera and a bigger pyramid further from our camera produces the same image. This means that if we just have a single image of the pyramid and no other information then we can’t figure out it’s size. What if we had another image of the pyramid from a different angle and we knew the difference between the two camera positions?

This animation shows us running the light rays in reverse again, for both pinhole cameras. You can see that each of the light rays passing through one image meets at exactly one point on a light ray from the other camera. These points are the original points in three-dimensions on our pyramid! So from only two pictures of the pyramid we were able to recover some of its original three-dimensional structure.

Now we understand what a pinhole camera is, and how capturing images of a 3D object with a pinhole camera lets us figure out some 3D information about the object we’re photographing.  But what good is this to us? We want to insert virtual objects.. and we’re going to be using a regular camera not a pinhole camera.

First things first, think about what we knew and what we didn’t know in the previous animation.  We knew we had two images of the pyramid, we knew how far apart and how the two cameras that took the images were oriented but we didn’t know anything about the 3D structure of the pyramid.  But we were able to figure that out from what we did know. Well what if we knew some 3D information about an object in the scene we are taking photos of and we also had the images we’ve captured, but we didn’t know how the camera had moved between each photo? Look at the previous animation, can you imagine moving around the cameras until the camera position produces an image consistent with the 3D pyramid? (If we didn’t already know the correct camera positions!) If we can do that, we can figure out how the camera moves just by studying the images and knowing about a 3D object in the scene we are photographing. What if we had a virtual object in the scene and when the camera moved right, the object moved left and when the camera moved down the object moved up? If we can figure out how the camera is moving we can make objects appear as though they are real as they react to the camera’s movement the way the real objects around them do.

Okay we’ve come a long way, but we’ve skipped over something. We don’t use pinhole cameras.  And even if we did, how do we make a pinhole so small that only a single ray of light passes through? Well, what if we just guessed? What if we try to find the pinhole camera which best represents a given real camera with all of its lens characteristics, film characteristics, aperture size, etc? If we could do that then all of the things we’ve gone through would work for real cameras too. Finding this pinhole camera which best matches our real camera is the process of camera calibration.

If you think about our simple pinhole camera there are really two main things that define it: the focal length, and how the image plane is oriented. Camera calibration essentially boils down to seeing what parallel lines look like when captured by our real camera, and using this information to determine the focal length and image plane for our “best guess” pinhole camera.

In the animation we show a picture of a checkerboard so the parallel lines are obvious. The vanishing point is where the parallel lines appear to meet in our image. Since they are parallel they shouldn’t really meet, so something about how the image is being formed is making them appear non-parallel.  So by finding a bunch of these vanishing points, we can figure out a focal length and image plane orientation that would best suit them.  Still with me? Don’t worry if this isn’t clear at this stage, just remember that we are able to find a pinhole camera that suits our real camera by just analyzing the images it takes.

Now we know how light forms an image and we have a really simple way of describing any camera.  But when we have pictures of objects from different camera positions, how do we know where to draw the light rays in reverse through the image plane to figure out the 3D structure of an object (and in turn the camera motion)? If you look closely at what we’ve went through we’ve skipped a step. We need some way of “matching” the pictures of an object taken from different positions so we know how to draw the light rays in reverse through the image plane. With our pyramid example that was easy, we knew just to match up the pyramid vertices and see where each pair of reverse light rays met up in 3D.. but how do we know this pair of reverse light rays for real objects? That is where we will start next time.

-Matt M