The general approach to measure these figures is:
- Establish a test plan with sufficient coverage.
- Execute the formal test plan (could be automated or manual tests), and register the failed test and if necessary the bug reports issued after root cause analysis.
- Compare the figures with the KLOCs which can be computed automatically from source code.
Needless to say : if you're having a manual "ad-hoc" test approach, you wont get consistent bug numbers: as you've mentioned, many bugs aren't discovered immediately. However formal test plans with unit, integration and acceptance tests are very common for bigger and mission critical software. TDD further emphasises the tests, providing very detailed unit tests that can check and diagnose the promised functionality and all the invariants that your code is supposed to respect.
There's also the question if the results of preventive tests run by a developer before submitting his code for integration are to be counted or not. Same question for issues discovered in peer reviews.
The definition of bugs is also an issue. People overuse this word in common language, and the frontier is not clear: is it a non-compliance of the code with the specification ? or is it also issues caused by unclear requirements ? Here some standards with precise definitions, like ISO 9126, can really help.
Finally, the KLOC is a concept that was introduced in a time where dominant languages were line oriented (e.g. fortran, cobol). So it's really a question nowadays, what should count for a LOC: empty lines ? comment lines ? conditionally compiled lines ? active lines, or active instructions ? etc...
All this being said, you'll have of course variances in your absolute figures that will depend on your precise definitions and methodology. But if you remain consistent, interesting facts may emerge when you look at the evolution of these metrics rather than at the absolute figures.
There are companies that keep statistics on huge number of software, and have developed a predictive model that is used to predict bug rate based on evolution of the metric on the project. They then use this prediction in the decision making about releasing or not to market (I think I've read a paper from HP some years ago, but I couldn't find it back). Such predictions have of course only statistical value: the fact that it's meaningful in general, doesn't avoid that particular project might completely contradict the model.
Personally, I'm not sure that these predicting methods still make sense in an era of agile and TDD, where bugs are prevented in early stages and on the fly. However I have to admit that introducing such structured metrics on subcontracted projects (i.e. sotfware built according to well specified requirements) allowed to quickly control and address reliability issues of some contractors.