The simple answer is "no, there is no single-chip solution."
The reasons are already mentioned by Chris Stratton:
- The two signals are not in sync, so at least one of the frames need to be re-buffered (typically, both will need to be, to be able to do the blend/overlay.)
- HDMI in a living room typically carries encrypted signals, so you need to also do HDMI content protection negotiation.
- The resolutions may be different.
Additionally:
- HDMI is a pretty high data rate signal, especially if you go above ATSC to 1080p60/30-bit etc. That's half a gigabyte per second per stream direction, so to capture two streams, then read/sum two streams, then output one stream, you end up with 2.5 GB/s in pure data traffic (and more for overhead.)
None of this is impossible. Broadcast video equipment does similar things just fine. But it's a cost question.
Actually, I did a digikey search, and this chip came up:
Analog Devices ADV8003 http://www.analog.com/static/imported-files/data_sheets/ADV8003.pdf
At $70 it's not super cheap, and the datasheet is ludicrously empty (doesn't even specify a chip form factor,) so you'd have to work with AD application engineers to actually get anything done with it. And it doesn't do the actual HDMI capture; you have to do that using separate chips. But it's the closest that a simple search could find.