Your questions are valid and the path to a proper understanding of what the theory means ;-).
To the question how more bandwidth means a higher bit rate, the explication may look simple but be bad at the same time.
Here is a "bad" explication which looks ok. It is a start to understanding why bigger bandwith is more data though.
Suppose that I have a first WiFi channel number 1 running at 1Mb/s given the power and encoding conditions. Then I take another WiFi channel number 2 which has the same bandwidth, power and encoding conditions. It is also running at 1Mb/s. When I sum the two together, I have doubled the bandwidth (two different channels) and double the data throughput (2x1Mb/s).
If you think that this looks like a perfect explication, you forget that we also doubled the power. So is the double data throughput due to the doubled power or due to the doubled bandwidth. It is a bit of both actually.
If I maintain the total power the same while doubling the bandwidth, I need to compare a first WiFi channel running at 1Mb/s with the sum of two other WiFi channels running each at half the received power.
I am not going to check the datasheets of WiFi modems, but this would be an interesting excercise to compare with the following theoretical approach.
Shannon helps us in prédicting what will happen more or less if the encoding adapts itself to the power levels (which is the case for WiFi). If the encoding does not adapt, the data rate remains constant until the reception level is too low at which time it drops to 0.
So shannon says: C=B∗log2(1+S/N) . When keeping total power, but doubling bandwidth, C2=2*B*log2(1+(S/2)/N) where C2 is the potential data rate.
Filling in actual numbers we could suppose that S=2xN so that log2(1+2)=1.58 and log2(1+1)=1. So C=B*1.58 and C2=B*2 . In other words, when my signal level at the biggest bandwidth equals the noise level, the potential data rate is bout 26% higher than the same total power emitted in half the bandwidth.
So theoretically, ultra narrow band can not be more efficient than ultrawide band based on Shannon's theorem.
And doubling the bandwidth with the same total power level does not double the bandwidth as our WiFi example suggested. But the bandwidth is higher. If we can neglect the "1" term in the log2 of the Shannon expression, then you can easily see that it is more interesting to increase bandwidth than to increase the power (which is subject to a log2, lowering its impact on the data rate).
However, as I mentionned, the encoding must adapt, it must be optimised to the actual power and bandwidth available.
If the encoding stays the same, I simply go from operational to dysfunctionnal.
Switching into your second question, if I have an FSK signal changing at 30Hz with two frequencies, then I can only emit at 30bps because I am emitting 30 symbols per second each corresponding to a bit of 1 or 0. If I introduce 4 states (=4 frequencies) by introducing two frequencies in between the previous ones because my noise level allows it, then I emit at 4x30bps=120bps. With FSK, I do not think that bandwidth remains constant when increasing the number of states this way, but one can surely find a way to keep it more or less constant (considering the 3dB limits because the theoretical frequency spectrum is unlimited).
Why use a square wave for the "modulating" signal? This is a choice in this encoding which makes it "easier" to decode as on the receiver side you simply have to have a bandpass filter for each frequency. You are still emitting "sine waves" - if you are emitting only "1" values, you have just one frequency. However the frequency shifts imply the presence of "harmonics" that allow/accompany these frequency shifts. Other encodings have other advantages and disadvantages. For instance, Direct Sequence Spread Spectrum allows having a signal below the noise level (and therefore have lower antenna power requirements for a similar bitrate in many other encodings), but it is more difficult to decode (and hence require more (compute) power and complexity in the decoding circuit).
Whatever the chosen encoding is, it has to respect the Shannon theorem which fixes the upper limit. You can not just apply Shannon to an encoding like FSK if you do not adjust the power level, number of states, and other parameters of the FSK signal as the noise level or signal level (distance) changes. Shannon allows you to check the absolute minimum power for a given bandwidth and data rate. The encoding method will increase the minimum power limit. And when the power levels exceed this limit, the bit rate will simply remain constant. Applying Shannon there is simply incorrect if you want to explain that more bandwidth means a higher bitrate. The WiFi example might very well apply in practice for an explication there, but it is not the general anwser based on Shannon's theorem.
Edit: rereading your question, "In the second case the bit rate will be a maximum of 660bps".
Actually I do not fully understand how you get to 660bps as your frequency changes only 30 times each second and you encode on two frequencies which is 1 bit. Hence my 30bps above.
This encoding allows for one full period at 30Hz and 22 full periods at 660Hz for each symbol. But 22 periods does not change the fact that there is only one symbol.
It looks like something is missing or that the reasoning is wrong.
Edit2: I got it - you are comparing with the nyquist limit.
This nyquist limit tells you the upper limit of the data rate given a bandwidth and the number of states per symbol.
Here, the selected FSK encoding is not optimimum. You are using 30Hz and 660Hz.
The Nyquist limit says that 30bps=2*B*log2(2), therefore, the bandwith must be at least B=15Hz. Without checking into detail, it says more or less that setting the FSK frequencies to 645Hz and 660Hz would be a good optimisation of the bandwidth (if FSK is otherwise an optimum encoding and without checking the precise bandwidth due to harmonics - the 15Hz may be too low for FSK).
Edit 3 - Explication following after further analysis to further explain source of confusion with other anwser and original question.
- The Nyquist formula is based on the sampling theorem indicating that a signal with a bandwidth B is perfectly reconstructed from precisely 2B samples per second.
- Hence the 2B samples can each represent a symbol (the intensity can determine which symbol).
- A signal with bandwidth of 300Hz can be reconstructed with 600 symbols - no more no less.
- This is why "aliasing" exists - bandwidth limitation can make two different signals look the same after sampling.
- If each symbol only represents 2 states, then only 600 bps is possible.
- The FSK from 30Hz to 330Hz can represent more than 600 bps, but then you need to consider more than 2 states per symbol. But is no longer an FSK demodulation because one can not only consider the frequency.