3

With respect to computer vision (CV), I always hear these three terms used almost interchangeably:

  • Structure from motion (Sfm)
  • 3D reconstruction
  • Stereo vision/processing

However from what I've read, these are similar-yet-different, but nowhere can I find a universal definition of how they are different.

According to Wikipedia:

Sfm:

...refers to the process of estimating three-dimensional structures from two-dimensional image sequences

3D reconstruction:

...is the process of capturing the shape and appearance of real objects

Stereo processing:

...is a term that is most often used to refer to the perception of depth and 3-dimensional structure obtained on the basis of visual information deriving from two eyes by individuals with normally developed binocular vision

The best-in-show open source libraries in this arena seem to be Bundler and PMVS/CMVS.

Bundler is advertised as Sfm software, whereas PMVS/CMVS are touted as 3D reconstructors. These tools seem to be intended on being used with one another, but could technically be used separately. So clearly there is a difference between Sfm and 3D reconstruction at the very least, but I'm just not seeing the forest through the trees here.

So I ask: What is the difference between these three concepts, how are they related, and how do they complement each other?

smeeb
  • 4,820
  • 10
  • 30
  • 49

2 Answers2

3

Stereo Vision (or Binocular vision) is basically the method by which humans (and other binocular animals) perceive depth. The difference in the position of an object on the X axis across two cameras is called Disparity. The farther away in the distance the object being observed, the lower the disparity. This property is what allows us to sense depth and better estimate distances and object structure from visual cues.

In general, any time you have 2 (or more) photos of a stationary object, and assuming you know the position and properties of the 2 cameras that took these 2 photos, you can calculate the distance of each pixel to one of the cams, which will allow you to reconstruct the 3D shape of whatever is in view.

With Stereo Vision (Multi View Stereo), you usually have control over your camera rig. For instance, you may choose 2 or more cameras with the same intrinsic properties which you know (focal length, aperture, resolution, etc...) and you calibrate them so that you can estimate their position and rotation relative to each other. Once you have calibrated cameras, you have basically fixed some of the variables, and you can now compute the disparity. The algorithm basically picks a pixel in the first image, and tries to find its equivalent in the second image to know how much it has moved. In reality, pixels move along epipolar lines, which are computed from the intrinsic properties of each camera and the extrinsic configuration of the stereo pair.

As for Structure from Motion (SFM), the same principle applies. Except, now we don't have a controlled and calibrated camera pair. We have multiple photos, taken from different cameras, possibly from the web. The positions of those cameras are also unknown to us. The first complexity is that unlike with calibrated pairs, we don't have the epipolar constraints, and we will need to do a thorough search for each pixel across images. The process is slower and prone to errors. We usually resort to edge and feature detectors for this task, such as SURF, SIFT and ORB. Once we have matched features, we try to estimate the Fundamental Matrix, which models the positions and rotations of the cameras relative to each other. This is also prone to errors. Estimating the matrix only works when there's a good baseline between the two cameras, and is usually calculated using RANSAC or similar optimization algorithms which may or may not return good results. Once we have camera extrinsics, we can proceed as we would with MVS to extract point clouds and estimate shape.

Other techniques that allow you to obtain 3d structure include (non-exhaustive):

  • Structured Light Scanning where a known light (usually laser) pattern is projected onto the surface of an object, and a photo is taken. The deformation of the pattern gives an estimate of the surface normal and therefore the shape

  • Shape from Shading, if you happen to know the position of a source of light, and assume a Lambertian material surface (no shiny, specular or transparent surfaces) the light intensity at each pixel can give you an estimate of the surface normal

  • LIDAR and Time of Flight scanning (https://en.wikipedia.org/wiki/Lidar)

  • Structure from Polarization (Read http://web.media.mit.edu/~achoo/polar3D/)

  • I always had the impression that "structure from motion" implied... motion, ie the way I have done it is by moving a single camera around an object. In this case you can use a calibrated camera but, while in the case of a stereo rig you know the relative distance/orientation of the two cameras, this information is not available in SFM (unless of course you can get that information in other ways, eg markers or sensors). But you can analyse things like video and quickly collect hundreds of images, and have a better view moving the camera all around the object (sideways and up and down) – thedayofcondor Apr 24 '19 at 22:07
  • 1
    You're right: If you use the same camera to take the photos, but the camera is moving (hence the name SFM) it's almost the same as if the photos are taken from different cams. I should reword the answer to highlight this fact. – Francois Zard Apr 30 '19 at 01:12
1

Imagine 50 still cameras arranged in a circle (something like this, though those cameras are Go-Pros). You have a dog jump into the middle of the circle, and all 50 cameras take a two-dimensional picture at the same instant. Now you have 50 pictures of the dog, all of the way around the dog.

SFM will take those 50 pictures and allow you to print a physical reconstruction of the dog on a 3d printer. That kind of information is useful to robots, because it allows them to reason about the three-dimensional form of the object so reconstructed.

Structure from motion is one form of 3d reconstruction.

Stereo processing is what happens when you look at an object with your two eyes and perceive depth. Because the images from your two eyes is slightly different, the differences allow you (or a robot) to do things like estimate the distance from the object.

Robert Harvey
  • 198,589
  • 55
  • 464
  • 673
  • Ok, thanks @Robert Harvey (+1) - starting to make sense now. A few quick followups if you don't mind: it sounds like you are saying that *stereo processing* is purely about *depth perception*, whereas *SFM* is *one type* of 3D reconstruction method. (1) If that's true, then how are Bundler and PMVS/CMVS different? If PMVS/CMVS already do 3D reconstruction, then why do they need Bundler to do SFM for them? And (2) Besides SFM, what other types of 3D reconstruction exist (just curious)? Thanks again! – smeeb Aug 26 '16 at 17:24
  • Stereo processing is specifically about *two eyes.* I'm not an expert, so I can't give you a comprehensive list of 3D reconstruction types. There are almost certainly more techniques than just SFM, however. – Robert Harvey Aug 26 '16 at 17:38