Can someone explain bit-interleaved coded modulation (BICM) in plain language, focusing on the practical mechanisms and properties rather than the deep analytical details? I found a few papers and articles on the subject, but they were all pitched at people who already have a deep familiarity with these types of schemes. My background is information security and I'm a self-taught electronics hobbyist, so my knowledge of the underpinning concepts is largely restricted to the practical.
To help reduce the scope of an answer, I'll explain my understanding so far.
I'm familiar with the high-level concept of block coding: you split your data into message blocks of some fixed size \$k\$, then you feed each message block into a coding function \$C(m)\$. The coding function is (as far as I know) deterministic, and has an inverse function \$C'(m)\$ for decoding, i.e. \$C'(C(m)) = m\$ for all \$m\$. From what I've read, it seems that the coding is generally referred to as a mapping into an alphabet \$\sum^k\$, where \$k\$ indicates the size of that alphabet, from which one could draw a parallel to invertible S-boxes in cryptography. Practically, \$\sum^k\$ could be a lookup table (LUT), although if \$k\$ is large it's probably necessary (due to practical constraints) to derive some linear algebraic function that describes \$\sum^k\$ instead, and perform those operations - this same issue comes up in cryptography.
However, I've also seen notation that uses the form \$(n,k,d)_q\$ to describe the properties of a block code, where the alphabet is \$\sum^q\$ and \$k\$ is the length of the codeword, which indicates to me that there must be block codes that cannot be expressed as a simple mapping over some alphabet \$\sum^k\$. My guess is that block codes where \$C(m)\$ can be expressed simply as mappings over an alphabet (i.e. \$k=q\$) are what I've seen referred to as linear block codes, whereas those which cannot are some other type of block code - perhaps something stateful, with feedback?
I understand that coding functions are designed such that the coded blocks (codewords) have various desirable properties, such as producing a nearly equal number of 1 and 0 bits in \$C(m)\$ for any \$m\$, to provide DC balance - an example would be 8b/10b coding. I also understand that error correction over noisy channels can be achieved via error correction schemes (e.g. parity bits) in the coding.
I know that the rate \$R\$ of a block code is the ratio between the size of the block and the size of the codeword, i.e. \$R=k/n\$. The lower the rate, the more coded data you must send per unit message size, making it less efficient. There is, then, a balance to be struck between the rate of the block code and the practical properties that can be achieved at a given rate. For example, we could refer to an 8-E-1 configuration on a serial port to be a block code \$m_{[0\dots7]} \to [0, m, m_0\oplus m_1 \oplus \dots \oplus m_7,1]\$ with a rate of \$(11,8)\$ or 0.7272, and say that it has one bit of error detection. Intuitively, the rate can't exceed 1.
I've seen the concept of minimum distance \$d\$ described, which is basically the minimum Hamming distance between any two outputs of \$C(m)\$, which you evaluate over all possible values of \$m\$ (and I guess also all states, if the scheme is stateful). I generally understand the relevance of this metric: the larger the minimum distance, the more bits you'd have to change in any one codeword to produce another valid codeword, so the more bit errors you can detect in a message. From this I presume that a general-case performance metric for any given block code might be \$R\times d\$, since you want to maximise both rate and distance in order to achieve the maximum error-free throughput on a given channel.
I very roughly grasp the concept of bounds in this context, at least in terms of them being useful metrics for comparing families of block codes, but I'm definitely not familiar with all of the analytical methods that are applied in this area. I suspect that this may be irrelevant for just understanding the actual practical mechanisms in use with BICM.
In terms of the physical layer, it seems that BICM, LDPC, etc. are often applied over QAM, where you have two carriers at 90° phase from each other and the amplitude of the modulated signal on each carrier can be represented as an axis on an X/Y plot. The in-phase carrier is called \$I\$ and the phase-shifted carrier is called \$Q\$, so it's called an IQ plot. The magnitudes of \$I\$ and \$Q\$ effectively form cartesian coordinates, although I suppose you could also use polar coordinates instead. If discrete points are chosen on an IQ plot, QAM can be used to transmit the coordinates of those points. If symbols are assigned to those points (e.g. 16 points with values 0 to 15) then the scheme can "point" to symbols. As I understand it, these are called constellations, and the scheme is generally referred to as QAM16 or QAM256 or whatever based on the number of defined points in that constellation.
The more symbols you cram into a constellation, the more information you can convey at a given symbol rate, but this necessitates a closer distance between points and therefore increases the likelihood of an error being conveyed. Apply a block coding scheme with error correction on top of this, and you can improve the error rate, albeit with the cost of the rate of the block code. The performance of the overall system then becomes largely reliant on balancing/optimising the Euclidean distance between points on the QAM constellation, the \$\frac {dV} {dt}\$ (and thus bandwidth and Nyquist rate) generated by the maximum change in \$I\$ and \$Q\$ magnitudes in a given symbol period, the Hamming distance between codewords, the error detection and correction capabilities of the block code, the rate of the block code, and other properties that affect electrical and RF performance on the physical layer.
I've seen it described that QAM schemes are beneficial because they allow for the effective information rate of the link to be multiplied by the number of points in the constellation, thus making for faster data transmission at a particular Nyquist frequency. I've also seen the term "capacity-approaching code" to refer to block codes that approach the Shannon limit, i.e. the theoretical maximum rate at which information can be transported over a link. The concept seems simple enough but I'm not sure how this is achieved in practice.
I've also read about parity-check matrices, which I guess are just convenient ways of modelling the parity rules of some given block code? So if we consider binary data as a polynomial \$\sum 2^i x_i\$ for \$x:[0,1]\$, then each row in a parity-check matrix defines which terms must sum (mod 2) to zero, i.e. have even parity. Or, programmatically:
foreach (row in matrix)
{
if (hamming_weight(value & row) != 0)
return false;
}
return true;
The example on the Wikipedia page seems reasonably straight forward. I presume that these come in handy if you're trying to build some kind of combinatorial logic in an FPGA or ASIC that needs to quickly distinguish between valid and invalid symbols, e.g. via a LUT.
At this point we start to reach the limits of my understanding. I've seen reference to turbo codes, LDPC, convolutional codes, recursive codes, iterative decoding, Tanner graphs, belief propagation, non-uniform constellations (NUCs), factor graphs, maximum likelihood decoding, decoder complexity, Trellis modulation, and Rayleigh fading channels. I'm quickly getting lost in the minutiae and am struggling to attain a big-picture understanding. The most readable paper on BICM that I could find was "Bit-interleaved coded modulation" (2008) by Martinez, Caire, but this quickly shot into the realm of formal hieroglyphics and left me behind.
Given this foundation of understanding, can someone explain how BICM works (and some key points about what makes it useful) without going off the deep end with unfamiliar notation and terms?