I did almost exactly the same thing about 7 years ago. It was called the "video squeezer", it took 2 analog video streams and output 1 video stream, except in this case it was left/right instead of top/bottom. The cameras were triggered simultaneously.
Doing left/right has the advantage of needing very little buffer (as long as they are reasonably in sync), while with top/bottom you need to buffer at least one whole frame. Embedded memory was plenty for buffering a couple of lines, but I don't think it would be enough for a full frame (at least not with a cheap FPGA).
The design basically used two ADV7180 to digitize the incoming video streams, the digital output went to an FPGA (Spartan-3AN) which subsamples and combines the two digital streams into one, and the output is sent to an ADV7171 to generate the analog out.