Stereo Vision (or Binocular vision) is basically the method by which humans (and other binocular animals) perceive depth. The difference in the position of an object on the X axis across two cameras is called Disparity. The farther away in the distance the object being observed, the lower the disparity. This property is what allows us to sense depth and better estimate distances and object structure from visual cues.
In general, any time you have 2 (or more) photos of a stationary object, and assuming you know the position and properties of the 2 cameras that took these 2 photos, you can calculate the distance of each pixel to one of the cams, which will allow you to reconstruct the 3D shape of whatever is in view.
With Stereo Vision (Multi View Stereo), you usually have control over your camera rig. For instance, you may choose 2 or more cameras with the same intrinsic properties which you know (focal length, aperture, resolution, etc...) and you calibrate them so that you can estimate their position and rotation relative to each other. Once you have calibrated cameras, you have basically fixed some of the variables, and you can now compute the disparity. The algorithm basically picks a pixel in the first image, and tries to find its equivalent in the second image to know how much it has moved. In reality, pixels move along epipolar lines, which are computed from the intrinsic properties of each camera and the extrinsic configuration of the stereo pair.
As for Structure from Motion (SFM), the same principle applies. Except, now we don't have a controlled and calibrated camera pair. We have multiple photos, taken from different cameras, possibly from the web. The positions of those cameras are also unknown to us. The first complexity is that unlike with calibrated pairs, we don't have the epipolar constraints, and we will need to do a thorough search for each pixel across images. The process is slower and prone to errors. We usually resort to edge and feature detectors for this task, such as SURF, SIFT and ORB. Once we have matched features, we try to estimate the Fundamental Matrix, which models the positions and rotations of the cameras relative to each other. This is also prone to errors. Estimating the matrix only works when there's a good baseline between the two cameras, and is usually calculated using RANSAC or similar optimization algorithms which may or may not return good results. Once we have camera extrinsics, we can proceed as we would with MVS to extract point clouds and estimate shape.
Other techniques that allow you to obtain 3d structure include (non-exhaustive):
Structured Light Scanning where a known light (usually laser) pattern is projected onto the surface of an object, and a photo is taken. The deformation of the pattern gives an estimate of the surface normal and therefore the shape
Shape from Shading, if you happen to know the position of a source of light, and assume a Lambertian material surface (no shiny, specular or transparent surfaces) the light intensity at each pixel can give you an estimate of the surface normal
LIDAR and Time of Flight scanning (https://en.wikipedia.org/wiki/Lidar)
Structure from Polarization (Read http://web.media.mit.edu/~achoo/polar3D/)