This will be a half answer, since it's dirty and not super efficient. Well you never stated the baudrate you were going to use. So who knows, maybe it's efficient enough.
Use a ground wire and a signal wire. This signal wire will be connected to the ADC of every Atmega (input) and to some digital pin of every Atmega (output).
It won't go straight to the digital output pin, it will have a 1 kΩ resistor connected between itself and the digital output pin.
So in ascii art it will look something like this:
_______A_______
| |
| PF0(ADC0) |<--------|
| | |
| Atmega128 | |
| | |
| PD0 |->-1kΩ---|
|_______________| |
|
|
_______B_______ |
| | |
| PC0(ADC0) |<--------|
| | |
| Atmega8 B | |
| | |
| PD0 |->-1kΩ---|
|_______________| |
|
|
_______C_______ |
| | |
| PC0(ADC0) |<--------|
| | |
| Atmega8 | |
| | |
| PD0 |->-1kΩ---|
|_______________| .
.
.
And then you assign particular frequencies to every Atmega. So if, say A want to output a logical "1" then A will output a square wave on PD0 at... say 10 kHz for an example. And if A wants to output a logical "0" then A will be quiet.
If B wants to talk then it will make a square wave signal of say 20 kHz or something else, not 10 kHz. Same thing with C, it gets its own frequency to talk on.
Then internally in every Atmega you will perform FFT. It doesn't have to be 8192 points or some other unnecessarily high number you can easily perform a 32 point FFT. The fewer points you have, the more your sine wave will smear and overlap. Life is not ideal.
It's not pure sine waves that you will put out through PD0, so you will get some other harmonics. In A's case you will see some small sine waves at 30 kHz, 50 kHz, 70 kHz and etc, because they are a part of the square wave at 10 kHz.
Also, since everything is just connected with resistors, the longer your chain is, the less each output will mean. So if you have 4 Atmega's connected. Then there's 4 resistors connecting to the signal wire. If one of them is outputting a high value and the rest low value. Then you will have \$5×0.25 = 1.25\$ V. That means that when you read ADC0 (10 bit ADC), you will read \$\frac{1.25×1024}{5}= 256\$. That's pretty okay. But if you got more, say 20 Atmega's, then you will read \$5×\frac{1}{20}×\frac{1024}{5}=51\$... Hmmm, that's not too bad either... But this means that you will need 20 different frequencies as well, far enough between so they don't overlap when you use your FFT.
So the more frequencies you use the more points of the FFT you have to calculate.
In other words what you will be doing is like having each Atmega pluck a guitar string, and then you say:
- "Oh, that's an A tone, okay so Atmega #2 is sending me a '1' right now"
- "Hmm, that's a C tone I believe, then that has to be Atmega #1 that's sending me a '1' right now"
- "There's an absence of the tone A, so Atmega #2 is sending me a '0', got ya!"
Regarding your Atmega8's, if they are never going to talk to each other. Then you can just make it so they listen to the Atmega128 by only calculating the DFT of one frequency. That frequency would be of the square wave your Atmega128 would broadcast at. Or you can make it so each Atmega8 listens to a different frequency, this way you can easily talk to each Atmega8 simultaneously. Though you would need some DAC at the Atmega128, not a square wave.
I can continue, but I believe you get the gist of it. Remember, it's not the most efficient solution.