The COCOMO and NASA datasets are widely used in most of the systematic literature review about software efforts estimations and I want to know the reason for it? I have read many papers which used these two datasets but no author has given any justifications about it. COCOMO/NASA are publicly available on PROMISE repository, is there any other valid reason? Thanks
3 Answers
Occasionally I will attempt to answer the questions that come up on this site asking about academic research to back up theory X or Y.
What I overwhelmingly find is a huge gap between academic studies and common knowledge in the software industry. Both in the approach to testing and building theories and in the results of those experiments.
I would hazard a guess that 99.999% of companies simply don't have consistent enough an approach and enough accumulated data to even attempt a scientific study of their estimation approach.
A common thing I see in academic papers now is to look at open source projects, as these are perhaps more representative of modern software development and have tonnes of publicly accessible data on commits, bugs, releases etc.
There is no real incentive for commercial companies to release this kind of information.
So overall I would guess that you are correct. These studies are often referenced simply because they are some of the few publicly accessible.
For my experience I would also add that they are undoubtedly also wrong. You can see that COCOMO for example boils down to lines of code. Whereas I think most programmers would agree that number of meetings and delays waiting for meetings has a far greater effect on the completion time.

- 70,664
- 5
- 76
- 161
The key benefit of the COCOMO and the NASA datasets is that they are significant estimation datasets that are publicly available.
I will provocatively claim that it’s the only benefit. My key arguments here are that:
- COCOMO is a model based on statistical correlation between a couple of software attributes and SLOC (source lines of code) that are assumed to be representative of the development effort. But SLOC are not a relevant metric for effort estimate in the age of OOP.
- The datasets used are relatively old (70's and 80's as far as I know), and no longer representative of the current software chalenges and development tools and practices.
- COCOMO estimation model assumes the requirements to be known in advance, which doesn’t fit for example for modern agile approches.
There are not many alternative datasets of comparable size. Because there are not so many big organisations that have a sufficient number of big projects, that are able to apply systematic measurement and criteria in a consistent manner, that have homogeneous project population that is not impacted by differences in individual team cultures, ... and that are ready to share such a sensitive information (that could be used for competitive analysis).

- 74,672
- 10
- 115
- 187
I studied the COCOMO technique in college. Outside of math practice I have never seen it used anywhere since then. This was even if we were following a waterfall model.
Adding to what Christopher wrote, the number lines of code nowadays could be a function of a code formatter. So the metric becomes meaningless. For example, autopep8 on source using 79 columns vs 119 columns.
I suggest ignoring COCOMO.

- 113
- 3