As a long-time embedded software engineer, I have to say that your assumption that debouncing will take some processing power out of my application is simply incorrect. This will never be true for any competently-written firmware.
Naturally, debouncing will require some processing. However the processing is trivial, and for user input will be happening at such a low update rate as to be utterly negligible. If you needed to debounce inputs with update rates in tens of kHz, perhaps the processing for debouncing would be significant, but a human pressing buttons does not need anything like that kind of resolution. In your case, 100Hz sampling would be easily fast enough, and you could almost certainly drop it as low as 10Hz sampling without seriously affecting your user interaction.
If you're trying to do input processing in a main control loop running at tens of kHz, of course it'll suck processing power. The correct solution is to write firmware which does not do it that way though, not to use a hardware solution to fix a software anti-pattern. Appropriate use of timers and interrupt priorities will give you what you need.
You can optimise the processing by making sure the read-back is all on one I/O port. Assuming that you're setting levels on columns and reading back the rows, then you bit-AND, bit-shift and bit-OR to build up a 16-bit value for the 16 pins. XOR this with the previous 16-bit value, and if this is non-zero then something changed. A simple debounce algorithm is just to set a counter to a value if the pins change state, pick a state if the pins kept their state and the counter is zero, and decrement if it's not zero.
You do need to check only one button is pressed, of course. If you've got an ARM processor, the ARM has an instruction to report how many bits are set, which is ideal for this. Just mentioning for a further optimisation.