Is 500 million lines of code even remotely possible?

Question

The New York Times is reporting that the Healthcare.gov website contains "about 500 million lines of software code." This number, attributed to "one specialist", and widely repeated across the interwebs, seems incredibly far-fetched (even assuming a large fraction of that number includes standard libraries). If this is an accurate estimate, it would truly be staggering (as this fascinating infographic vividly reveals). I realize StackExchange:Programmers isn't Snopes.com, but I'd like to find out if anyone here believes this is even remotely possible. I'd like to know if there is a plausible system of accounting (using examples from publicly available data, if possible) that could lead someone to conclude that such an estimate is within the realm of reason. How could a codebase (by any measure) sum up to such an exhorbitant number of code lines?

Maybe? What gets included in the line count? Just compilable/interpretable application code? What about SQL DDL files, and DML files used to build and populate the database? Would an XSLT file count as code, even though it's usually treated as a resource file? Are resource/config files counted? If so, and if you have copies of environment-specific files, does each copy count separately? Would generated code count? Generated code isn't always line-count efficient but maybe it shouldn't be counted since it's not "written", but (re)generated when necessary... — FrustratedWithFormsDesigner, Nov 04 '13 at 22:35
Hmm... y'know, maybe that supposed code bloat is really where the NSA is hiding all their snooping software. ;) — FrustratedWithFormsDesigner, Nov 04 '13 at 22:42
@FrustratedWithFormsDesigner: Go ahead -- count it all. I still think getting all the way to 500M LOC (over 10x the size of Windows 7) seems preposterous. (Not to mention impossible). — kmote, Nov 04 '13 at 22:44
Given a good code generator I can produce any amount of perfectly valid code lines you like. I charge a dollar per line, discounts possible. Deal? ;-) — JensG, Nov 04 '13 at 22:48
You might wish to look at [SLOC](http://en.wikipedia.org/wiki/Source_lines_of_code) "Another increasingly common problem in comparing SLOC metrics is the difference between auto-generated and hand-written code. Modern software tools often have the capability to auto-generate enormous amounts of code with a few clicks of a mouse. " -- I had a apache axis tool that generated a java class file that was 4 megabytes in size from one rather small wsdl. — , Nov 04 '13 at 22:56
it could be a typo, as they add "By comparison, a large bank’s computer system is typically about one-fifth that size." 100 millions LOC are hardly "typical", even at "large bank" (bank having _that_ inefficient codebase would probably cease to remain "large" very soon) — gnat, Nov 04 '13 at 22:57
Why not - 1 million is very common, you will find programmers here working on 10 Million. I work on 5 Million SLOC on one system and our company has five major systems not including 'standard' MIS stuff like email and sharepoint). Programming is not our core business. Ask why would 500M not be be possible - I can find no compelling reason not to believe it. — mattnz, Nov 05 '13 at 00:58
You cannot count machine generated code in SLOC (but the media would). That's like counting 1 C++ source line as 1000 SLOC because it created 1000 assembly instructions. — mattnz, Nov 05 '13 at 01:00
It's actually very easy to get that large - someone just used #import MouseGenome. Some people just don't understand the value of compiling from source. — BrianH, Nov 05 '13 at 05:04

score 3 · Answer 1 · answered Nov 04 '13 at 23:06

I'm inclined to believe it. For a very generous definition of "the Healthcare.gov website."

The software I work on has almost 1.1 million lines checked in in trunk (according to subversion's stats), and that's with just 4 in-house developers. The largest single chunk of that (about a quarter of a million lines) is simply auto-generated code from including a reference to Ebay's web service. Add another 150k for the various other autogenerated webservices together.

Our database is relatively small, and despite my best efforts, the large majority of it is still using DBF tables. The portion of it that's using EntityFramework is another 11k lines. The web database's Linq2Sql project weighs in at 28k. The sum total of all the javascript is somewhere around 46k (including both minified and unminified versions in that total).

Again, this is 4 developers over something like 10 years (although it only really started exploding a few years ago). It doesn't include much in the way of unit tests, database scripting (we prefer code), redundancy, or really fancy HTML5 graphical effects.

Add 3-5 subcontractors, each with their own external references, included 3rd party libraries, and 10-50 times the developers we have, and include all the database scripting we avoid, and and so on, and I can easily see it getting that big. Especially if you start including documentation and/or heavily commenting the code. I interviewed for an FAA contractor once where they told me that their comment-to-code ratio was ideally 1:1.

And yes, I realize that software doesn't scale linearly as the project gets larger. Just because *our* project grew that way doesn't mean another will. But it's probably the best way to try and get a handle on that number. — Bobson, Nov 04 '13 at 23:10

Is 500 million lines of code even remotely possible?

1 Answers1