Transmission of information in terms of information theory and protocol

Question

I am a student studying wireless communications.

First of all, I learned that

In AWGN channel, i.e., y=x+n with n~N(0,sigma^2), the amount of information that can be transferred (i.e., capacity) is maximized when x follows Gaussian distribution.

Actually, from a point of view of Mathematics and definitions of Shannon capacity (and etc.), the above statement can be proved. However, I do not understand what "\$x\$ follows Gaussian distribution" means.

For example, we can say 1) if \$A\cos(2\pi f_c t + \phi)\$ is transmitted, the receiver can make a decision that information (corresponding to \$A\$) is transferred, on the other hand 2) if \$x=0.5A\cos(2\pi f_c t + \phi)\$ is transmitted, the receiver can make a decision that information (corresponding to \$0.5A\$) is transferred.

That is, I mean that the information (symbol) to be transmitted is deterministic at the transmitter, which is not a random variable (RV).

To sum up, the information to be transmitted is deterministic, which means that the waveform (transmit signal) is also determined according to which information (symbol) will be transferred. Although \$x\$ is deterministic depending on the transmit information, why do people consider \$x\$ as a random variable when analyzing the capacity?

According to papers, x is just a transmit signal. Additionally, it seems that x is considered as a random variable in information theory field. — Danny_Kim, Feb 15 '19 at 03:04

KillaKem · Accepted Answer · 2019-04-25T10:00:56.750

There are basically two questions here: i) why is \$\mathbf{x}\$ is considered a RV and ii) why it is said that in the presence of gaussian noise (\$\mathbf{z}\$) the capacity is maximised when \$\mathbf{x}\$ is also gaussian.

i) Firstly the scenario you have given as an example is correct, the only issue is that you are considering transmission over a noiseless channel. Lets extend your example and suppose we are transmitting a sample from the set \$\mathcal{X}\$, where \$\mathcal{X} = \{0,\Delta, ..., (A-\Delta), A\}\$. So the hypothetical transmitter picks a symbol \$x\$ uniformly from \$\mathcal{X}\$ and sends it to the receiver.

In a noisy channel, the received symbol with not be \$x\$, it will be (\$x\$ + noise). The noise is \$z\$ which is a RV chosen from the set \$\mathcal{Z} = \{0,\Delta, ..., (A-\Delta), A\}\$. If \$z\$ is also picked uniformly from this set it is clear that the capacity will be zero. In other words, there will be no way of computing \$x\$ from \$y\$, where \$y = x + z\$.

If on the other hand, \$z\$ is only uniformly distributed in the set \$\mathcal{Z} = \{0, \Delta, 2\Delta\}\$ then the channel has some capacity. This capacity can be shown to be achieved when the transmitted symbol is uniformly distributed in the subset \$\mathcal{X}_s = \{0,3\Delta, ..., (A-3\Delta), A\}\$ and zero elsewhere. In other words, we just separated the symbols in \$\mathcal{X}\$ by \$3\Delta\$ so any symbol \$x\$ we transmit is perfectly recoverable regardless despite the presence of noise \$\mathcal{Z}\$.

Our capacity per symbol is the number of bits we get per symbol received. If the number of symbols in the set \$\mathcal{X}_s\$ in which we pick samples uniformly from is \$M\$, then a \$\log_2{M}\$ bit number would be the shortest bit sequence long enough to uniquely represent each of the symbols in the set \$\mathcal{X}_s\$. Our capacity per symbol is thus \$\log_2{M}\$. \$M\$ is called the cardinality of \$\mathcal{X}_s\$ and is represented by \$|\mathcal{X}_s|\$. The value of \$\log_2{|\mathcal{X}_s|}\$ is also called the entropy of a uniformly distributed RV over the set \$\mathcal{X}_s\$. The general formula for computing the entropy of a continuous distribution with distribution \$p(x)\$ is \$h_{p(x)} = - \int p(x) \log p(x)\$.

If a transmitter has the capability to transmit a symbol from a discrete set \$\mathcal{X}\$, but it has to do that over a noisy channel which adds uniform noise from a discrete set \$\mathcal{Z}\$, then the zero error capacity per sample in this case is

$$ C_s [bits] = \max _{\mathcal{X}_s \in \mathcal{X}} \left\{ \log_2(\text{Total Number Of Samples in } \mathcal{X}_s) \right\} $$

So you basically pick the largest subset \$\mathcal{X}_s\$ of \$\mathcal{X}\$ which also has perfectly recoverable symbols at the receiver in the presence of noise \$\mathcal{Z}\$. The distribution \$p(x)\$ over \$\mathcal{X}_s\$ that this capacity is achievable is the uniform distribution as we have discussed.

\begin{align} C_s [bits] & = \max _{\mathcal{X}_s \in \mathcal{X}} \left\{ \log_2(\text{Total Number Of Samples in } \mathcal{X}_s) \right\} \\ & = \log |\mathcal{X}_s| \\ & = \log \frac{|\mathcal{X}|}{|\mathcal{Z}|} \\ & = \log |\mathcal{X}| - \log |\mathcal{Z}| \\ & = h(\mathcal X) - h(\mathcal Z) \end{align}

In other words the zero-error capacity per sample is \$C_s = \max \left( h(X) - h(X | Y) \right) = \$ \$\max \left( h(\mathcal{X}) - h(\mathcal{Z}) \right) \$. This relationship can be shown to also hold in the general case and the quantity \$\left( h(X) - h(X | Y) \right)\$ is called the mutual information and is usually denoted as \$I(X;Y)\$.

Now that we have covered some basics, the question we now move onto what is the distribution \$p(x)\$ over \$X_s\$ that achieves the capacity if i) the noise \$\mathcal Z\$ is not uniformly distributed but it is instead gaussian distributed, ii) the average power per symbol to be transmitted is limited to \$P\$. The average power is \$E(X^2) = \Sigma x^2p(x) = \sigma ^2 = P\$, which implies that the variance of \$p(x)\$ is fixed at a value of \$\sigma^2\$. This then moves us to your second question.

ii) Let \$p(x)\$ be a zero mean gaussian distribution over the domain \$x \in (-\infty, +\infty)\$ with a variance of \$\sigma ^2 = P\$. We will show that there is no other distribution over the domain \$x \in (-\infty, +\infty)\$ with a fixed variance of \$\sigma ^2 = P\$ that also has a higher entropy than the gaussian distribution. This will prove that if we want to maximise \$\left( h(\mathcal{X}) - h(\mathcal{Z}) \right)\$ in \$x \in (-\infty, +\infty)\$ with a distribution with a fixed variance of \$\sigma ^2 = P\$ then there is no distribution better than the normal distribution.

Proof:

The proof relies on two other facts the first one being for any two probability distributions \$p(x)\$ and \$g(x)\$ the quantity below called the relative entropy can be shown to always greater than zero:

\begin{align} D(g(x) || p(x)) = \int_{-\infty}^{+\infty}{g(x) \log \frac{g(x)}{p(x)}} \geq 0 \end{align}

and also if \$p(x)\$ is the gaussian and \$g(x)\$ has the same variance as \$p(x)\$ (i.e if \$\left(\int x^2 g(x) = \int x^2 p(x) \right) = \sigma^2\$ ) then

\begin{align} \int g(x) \log p(x) = \int p(x) \log p(x) \end{align}

To prove the equation above, we can simply evaluate the integral

\begin{align} \int g(x) \log p(x) - \int p(x) \log p(x) &= \int (g(x) - p(x)) \log p(x) \end{align}

Now if \$p(x)\$ is zero mean gaussian then \$p(x)\$ is of the form \$Ae^{Bx^2}\$ where \$A = 1 / \sqrt{2\pi \sigma ^2}\$ and \$B = 1/2\sigma^2\$. So \$\log p(x)\$ is of form \$\log_2 Ae^{Bx^2} = (\ln A + Bx^2)/ \ln(2) = C + Dx^2\$. Where \$C\$ and \$D\$ are constants.

Our integal is then

\begin{align} \int (g(x) - p(x)) \log p(x) &= \int (g(x) - p(x)) (C + Dx^2) \\ &= C\int g(x) - C\int p(x) + D\int x^2 g(x) - D\int x^2p(x) \\ &= C(1) - C(1) + D\sigma^2 - D\sigma^2 \\ &= 0 \end{align}

This proves that

\begin{align} \int g(x) \log p(x) = \int p(x) \log p(x) \end{align}

Now going back to our first equation

\begin{align} 0 &\leq \int_{-\infty}^{+\infty}{g(x) \log \frac{g(x)}{p(x)}}\\ & = \int g(x) \log g(x) - \int g(x) \log p(x) \\ & = \int g(x) \log g(x) - \int p(x) \log p(x) \text{ (from the earlier proof)} \\ & = -h_{g(x)}(\mathcal X) + h_{p(x)}(\mathcal X) \\ \implies & h_{p(x)}(\mathcal X) \geq h_{g(x)}(\mathcal X) \end{align}

And because the capacity per sample is \$C_s = \max \left( h(\mathcal{X}) - h(\mathcal{Z}) \right))\$, the distribution that maximises \$h(\mathcal X)\$ will maximise capacity. The proof above is then sufficient to prove that one of the distributions with variance \$\sigma^2 = P\$ that maximises the capacity over \$x \in (-\infty, +\infty)\$ is the gaussian distribution.

For a wireless channel an expression for the capacity when the noise is gaussian can then thus be calculated as shown in the answer HERE.

score 1 · Answer 2 · answered Feb 15 '19 at 13:26

Because from the receiver's point of view it IS a "random" (i.e., unpredictable) value. If the receiver has any prior knowledge about what the next symbol is going to be, then that means that the amount of information being conveyed is less.

Also, arbitrary payloads, especially when taken in aggregate, are indistinguishable from truly random signals. That's why this kind of statistical analysis is used.

But it isn't clear what "X follows Gaussian distribution" means in the context of the equation. Normally, I would interpret this to mean that a histogram of the values of X approximates a Gaussian. But I don't see how that would make sense here.

Transmission of information in terms of information theory and protocol

2 Answers2