The standard is well designed and there are subtle details that ease implementation, for example, when rounding, the carry from the mantissa can overflow to the exponent. Or integer comparisons can be used for floating point compares...
But, an FPU is a big heap of combinatorial mess; besides adding, multiplying, dividing, there are barrel shifters to align mantissas, leading zeros counters, rounding, flags (imprecise, overflow, ...), NaN and denormals (which need additional hardware for calculations, particularly for mul/div, or at least trigger an exception for software emulation).
And most FPUs also need to do conversions to/from integer and between formats (float, double). That conversion hardware can be mostly implemented through existing floating point hardware, but it incurs additional multiplexers and special cases...
Then, there is pipelining. Depending on the transistor budget and frequency, either add/sub/mul can have the same throughput, or double precision can be slower, which can incur additional complexity in the pipeline. Modern FPUs now have a pipelined multiply-add operator.
For division, it is always iterative, it can be a separate unit or reuse the multiplier-adder for Newton-Raphson or Goldschmidt. And while you are busy making a divider, you look for ways to tweak it for square roots...
Validation is complex because there are many corner cases. There are a few systematic test suites with test patterns for "interesting" cases about all the rounding modes but things like fast multipliers or dividers are too complex to test easily.
Iterative dividers can have non obvious bugs (for example the famous Pentium bug in its SRT radix 4 divider), multiplicative (Newton) are difficult to test exact rounding (some bugs in old IBM computers).
Formal methods are now used to prove these parts.
Modern FPUs also implement SIMD hardware, where FP operators are instantiated several times for parallel processing.
There is also the case of the x87 and MC68881/2 FPUs which can calculate decimal conversions, hyperbolic and trigonometric operations. These operations are microcoded and use basic FP operators, they are not directly implemented in hardware.