Metrology grade calibration standards are machined to very tight tolerances to provide precise and repeatable results when performing an SOLT calibration. At lower frequency (ones of GHz or below), this is less of an issue, but as the desired measurement moves to higher frequency, the more important this becomes (in particular, for the phase accuracy of the measurement).
Consider what an SOLT calibration is fundamentally doing. If you're not familiar with it, research the 12-term error model for vector network measurements. The SOLT calibration method allows for the mathematical derivation of these 12 error terms, assuming it is possible to measure a perfect Open, Short, Load and Thru at the reference plane of your DUT. For the math to work out, each standard is measured, and the raw measurement of the standard is compared to the given characteristic of the standard (for the open/short, this is typically described as capacitance/inductance coefficients to a polynomial). A perfect physical standard will have exactly the response described by the polynomial, and since this is known, then the raw measurement can be compared to the known value, and the difference is used to determine the error term of the test setup.
Assuming everything is perfect, your reference plane will be well defined once the error terms are applied to the raw measurement, and the measurement you acquire will end up at the DUT reference plane. Of course, nothing is perfect, and there are tolerances in calibration standards. For an SOLT calibration, a "perfect" set of open and short standards will have exactly the same offset length. Any imprecision in the distance between the cable reference plane and the actual open/short standard will introduce uncertainty in the standard measurement, which will result in a poorly-defined reference plane. Similarly, if the load standard isn't close to the specified system impedance (i.e. it has a poor match), uncertainty in the error terms will be introduced. All of these uncertainties occur if the manufactured connector strays from its designed values.
Especially for high frequency measurements, an "open is an open is an open" is not at all a true statement. Quick example: consider an air-dielectric coaxial RF connector with a velocity factor of \$0.9\$, and we want to measure the response of our DUT at \$f_0 = 40\mathrm{GHz}\$.
Our wavelength ends up being:
$$ \lambda = \frac{c}{f_0} = \frac{0.9\cdot3\mathrm{e}8\ \frac{\mathrm{m}}{\mathrm{s}}}{40\mathrm{e}9\ \mathrm{s}^{-1}} = 6.75\ \mathrm{mm}$$
Divide that by 360 and you get that each \$18.75\mathrm{\mu m}\$ equates to 1 degree of phase error. So for every 20 microns your open standard's connector is off from it's specified offset length, you're adding over a degree of phase error to the measurement of the calibration standard at the frequency of interest. This isn't even considering the loss of the cable and standard connection interface (which will add uncertainty to the magnitude of the measurement). Now consider that the short and the load standards may also have uncertainty, and all of these uncertainties will compound.
In addition to calibration accuracy, it is important for a calibration standard to have repeatable results. If a VNA is calibrated with the same standards day-in and day-out, and each calibration results in wildly different calculated error terms, the measurements that come out of that instrument will not be very useful if accurate characterization of the DUT is the goal. Metrology-grade standards, in addition to being accurate to their specified values, are also typically built with tight tolerances to ensure repeatable characteristics after many connect/disconnect cycles (assuming they are used properly, i.e. torqued correctly, kept clean, etc.).
In short, all of the extra machining precision required to provide accurate and repeatable calibration standards combined with the relatively small market (i.e. low economy of scale) for metrology-grade components equates to an expensive device that is seemingly very "simple."
Some decent further reading is the following presentation from Maury Microwave that describes some of the important metrics for metrology-grade connectors (pin depth, concentricity, repeatability, etc.): https://www.maurymw.com/Support/pres/connectors-precision_or_not.pdf
As @mkeith alluded to in the comments to your question, it's certainly possible to make one's own standards (the load is the easiest since if it's got a somewhat decent return loss, it won't affect the calibration very much). It really boils down to a trade-off of how much accuracy is required need. Ballpark area on a smith chart for a low-frequency passive device? Probably fine. Trying to characterize e.g. the impedance of a high-frequency transistor for end-use in a multi-stage power-amplifier design? Not so fine.