As far as I know, in DSP we wish to remove or reduce the effect of some undesired signals by separating its frequency spectrum from the spectrum of the whole signal (our main signal and the noise). This process is done by using a window of rectangular shape, right? I thought that the window function is something in the frequency domain that is multiplied by the frequency spectrum of our main signal that is being filtered. I thought we took an FFT of the main signal, multiply it with the filter rectangle window which looks like a rectangle in the frequency domain (each sample of window multiplies with corresponding sample of the FFT of the main signal) and then did an inverse FFT to get the filtered main signal back. I think this will not work very well since our main signal is not periodic (like speech) and we do not have ALL samples of it when we do the FFT, so perhaps the filtered main signal would not be good.
My confusion is arising from coming to know that the window function has its origins in the time domain and not the frequency domain and the window function is convolved in time with our main signal to filter it! (For the window function in the frequency domain we get all this mess with lobes that look weird, which is another thing I do not understand). Why don't we filter in the frequency domain by taking FFT and multiplying it with a window and than doing an inverse FFT?
Apparently a rectangular window is bad since its side lobes are not small enough and there is something called "power leakage" in the spectrum so we do not use a rectangular window. It's all confusing me.