39

I have often wondered why strict parsing was not chosen when creating HTML. For most of the Internet history, browsers have accepted any kind of markup and tried their best to parse it. The process degrades performance, permits people to write gibberish, and makes it difficult discontinue obsolete features.

Is there a specific reason why HTML is not strictly parsed?

Shubham
  • 713
  • 7
  • 17
  • 8
    You might find Joels article, [Martian Headsets](http://www.joelonsoftware.com/items/2008/03/17.html) to be of interest. Also of special note is [RFC 793: Robustness Principle](http://tools.ietf.org/html/rfc793#page-13), which explicitly states that TCP implementations should try their best to parse rubbish. This principle has since been applied to browsers. – Brian Jun 07 '13 at 19:30
  • 27
    @Brian: Robustness means you should not fall over when you receive crap. It does not mean you have to make sense of crap. – Marjan Venema Jun 07 '13 at 20:10
  • 2
    XHTML does use strict parsing. – user16764 Jun 07 '13 at 22:07
  • 3
    Is it just me, or are none of these answers very satisfying? – gsgx Jun 14 '13 at 23:01
  • @user16764 only if served as XML. Served as text just got lumped into a general "standards mode" that basically rendered all the doctypes completely pointless. Which is why there is now only one. (which, see my answer can be served as XML) – Erik Reppen Jun 14 '13 at 23:24
  • HTML is written by lots of web amateurs, why would you require for them to write perfect code? It doesn't compile so why the strictness, it's not a programming language. – Czarek Tomczak Jun 15 '13 at 07:49
  • 2
    @gsingh2011 None of the answers are satisfying, but my answer is the truth. Some of us here were active on the net that long ago :-) But yeah, it's astonishing how much junk we're left with for such simple reasons. – Ross Patterson Apr 26 '17 at 23:48
  • Worth a read: [Martian Headsets](https://www.joelonsoftware.com/2008/03/17/martian-headsets/) on Joel on Software blog. – mouviciel Oct 11 '17 at 11:10
  • You may want to read up on SGML. XML came out later and the strictness there was a direct response to the leniency of SGML. – Thorbjørn Ravn Andersen Apr 10 '22 at 23:47
  • @gsingh2011 They are not satisfying because they all address the time when the web was starting up, not why nothing was ever done about it. Not why HTML5 was such a huge disappointment. They could have cleaned it up then, first by making it proper XML, thus making it easier on automated tools. This would have made all the "crappy stuff" fade away in a couple of years. But no, it was hyped years before release, raising expectations for a couple of years and it ultimately dragged along all the nonsense in the new standard, rendering the whole thing pretty much pointless. – Martin Maat Apr 11 '22 at 06:48

7 Answers7

41

The reason is simple: At the time of the first graphical browsers, NCSA Mosiac and later Netscape Navigator, almost all HTML was written by hand. The browser authors (Netscape was built by ex-Mosaic folks) recognized quickly that refusing to render incorrect HTML would be held against them by the users, and voila!

Ross Patterson
  • 10,277
  • 34
  • 43
  • 7
    +1 yes, that's how it all started, in vi or notepad. With most pages being copied from bad example code, it never got better. Plus the WWW boomed, so anyone who could type became a web developer and it was all about getting in done fast. – jqa Jun 08 '13 at 03:32
  • 1
    Apparently, this answer in conjugation with @Jukka's comment give the best possible explaination – Shubham Jun 08 '13 at 19:19
  • I'd just add the words "*very reasonably*" before "be held against them by the users". Users don't buy a browser to promote a perfect web ecosystem, they buy a browser to help them personally read the web. – bdsl Apr 12 '22 at 14:58
  • @bdsl 30 years ago, the HTML-authoring population was mostly geeks, and very supportive of standards-compliance. The creators of the few browsers that existed then would have been forgiven. They just didn't go there. Honestly, they should have. – Ross Patterson Apr 13 '22 at 18:40
  • The browser's customers were and are primarily HTML readers, not HTML authors. Why would HTML readers choose a browser that refused to attempt to render noncompliant web pages? – bdsl Apr 13 '22 at 18:52
35

Because making best guesses is the right thing to do, from a browser-maker's perspective. Consider the situation: ideally, the HTML you receive is completely correct and to spec. That's great. But the interesting part is what happens when the HTML is not correct; since we're dealing with input from a source that we have no influence on, really, we have to be prepared for this. Now when that happens, what could we do? We have two options: a) fail, and b) make a best effort to recover from the error. If we fail, the user has nothing but a useless error message, and there is nothing they can do about it, because they don't control the server. If we make a best effort, the user has at least what we could make of the page, and often the guess is mostly right.

The only real problem with this is when you need the error messages, which is typically in a development situation - you want to make sure the HTML you generate is correct, and since "works in browser X" is not equivalent to "correct", we can't simply run it through a browser and see if it works: we can't tell the difference between correct HTML and incorrect HTML that the browser has fixed for you. This is a solvable problem though; there are browser plugins that report standards violations, there's the W3C validator, and lots of other similar tools.

tdammers
  • 52,406
  • 14
  • 106
  • 154
  • 7
    Well, I don't think anyone would serve up HTML that throws up errors. WHy do you think a compiler assuming code is any diffrent from a browser assuming HTML. – Shubham Jun 08 '13 at 12:09
  • 1
    I agree with Shubham here - "since we're dealing with input from a source that we have no influence on" is false, the influence is indirect but some web sites still support IE6 because of that influence. –  Jun 08 '13 at 17:22
  • 2
    @Shubham: A compiler is different because its purpose is not to transform machine-readable source code into a human-digestible form, but to to transform human-readable source code into something that is more convenient for a computer (machine code or some intermediate format). With the compiler, you fix the input and you're glad the code didn't make it into production. With the browser, you curse the browser maker or the website author, but either way, you don't get to see the page. – tdammers Jun 08 '13 at 20:45
  • 3
    @Shubham: Generally the user of a compiler will have control over the source code being compiled. That is generally not the case with web pages. – supercat Sep 02 '14 at 16:28
17

HTML authors and authoring tools produce crappy markup. Browsers do their best with it for competitive reasons: a browsers that fails to render most of web pages in any reasonable way will be rejected by users, who won’t care the least about whose fault it is.

It’s rather different from what programming language implementations do. Compilers and interpreters work on code that can be assumed to be written by a programmer, whereas everyone and his brother can write HTML with minimal training, or without. HTML markup is code, in a sense, but it’s data rather than programming language instructions, and the (good) tradition in software is to be tolerant with data.

XHTML in principle imposes strict (XML) parsing rules, so that an XHTML document served with an XML content type will be displayed only if it is well-formed in the XML sense – otherwise, only the first error is communicated to the user. This never became popular in web authoring – almost all of the “XHTML” around is served as text/html and processed as traditional tag soup in a very liberal way, just with some new eccentricities .

Jukka K. Korpela
  • 2,387
  • 17
  • 14
  • 1
    mmm tag soup, SE is my favorite flavor... – Jimmy Hoffa Jun 07 '13 at 19:50
  • 18
    `HTML authors and authoring tools produce crappy markup.` - they do because browsers accept it. If from the beginning browsers didn't accept it - then these tools & authors wouldn't have been able to get away with producing crappy markup – user93353 Jun 08 '13 at 03:13
  • 1
    @user93353, you miss the point. Browsers *needed* to be able to cope with bad or simply different markup or they would have lost market share to those that did. – GrandmasterB Jun 08 '13 at 06:49
  • 4
    @GrandmasterB - I think you miss the point - Even where was just one browser in the market - it didn't do strict parsing. – user93353 Jun 08 '13 at 07:14
  • 3
    Funny note: you say that if a browser is unable to parse a invalid site, it will lose market share. But just look at ie: however bad it is, it doesn't lose market share. It just forces poor developers to write dirty hacks in using old APIs... And don't get me started on it's versioning scheme... – Max Jun 08 '13 at 10:20
  • 3
    In the beginning, browsers were written hastily to deal with a markup language that wasn’t finalized and had no official specification – there were no strict parsing rules. (HTML 2.0, in 1995, was nominally SGML-based, but it was too late to have that actually implemented.) – Jukka K. Korpela Jun 08 '13 at 11:09
  • 2
    IE has actually lost quite a lot of its market share. But this probably has little if anything to do with strict parsing. IE, with its oddities, ruled the web long enough to force other browsers largely imitate its oddities, because so many pages would otherwise fall apart. – Jukka K. Korpela Jun 08 '13 at 11:11
  • @Max - just because a factor influencing the success of a browser isn't the only factor, doesn't make it invalid. IE was and still is included as standard with Windows, exploiting the near-monopoly of Windows on the desktop. That doesn't mean other browser makers could say "but Microsoft gets away with it - it's not fair!". –  Jun 08 '13 at 17:28
11

The short of it would be that HTML was based on another non-hyperlinked markup language called SGML often used for documentation and manuals and the like.

From an article about the history of HTML:

Tim had mentioned that some of the early HTML documents were based on an old SGML language that CERN was already using:- We have included in HTML some tags from the SGML tagset used at and once supported at CERN [...] The HTML parser will ignore tags which it does not understand, and will ignore attributes which it does not understand of CERN-SGML tags.

[...] most of the early HTML tags were actually taken from the CERN SGMLGuid language, which itself was a variant of AAP (an early SGML language). For example, title, hn, p, ol and so on are all apparently taken from this language. The only radical change was the addition of the all important anchor () link, without which the WWW wouldn't have taken off.

Taking note of the part I've bolded, basically, they implemented a subset of the tags available in the SGML system they were familiar with, adding the new anchor <a> tag, and choosing to ignore any of the many tags they didn't care about or wish to support for wahtever reason (such as tags for bibliography lists, xmp for "example", "box" tag to draw a box around a block of text, etc). So the simplest way to do that is to be forgiving of markup that is not known by the parser and ignore unknown markup as best as possible, regardless of whether the cause is user typed bad markup, or the quickest easiest way to convert existing documents to this new HTML format is to add some hyperlinks to existing SGML documents, and ignore whatever tags aren't supported or implemented.

Jessica Brown
  • 424
  • 3
  • 9
  • The HTML syntax was indeed based on the SGML Reference Concrete Syntax for the **form** of its markup. But SGML itself did not have **elements** for marking up documents that HTML could borrow, The HTML element set actually resembles that of [IBM's GML](https://en.wikipedia.org/wiki/IBM_Generalized_Markup_Language) document markup language, transliterated into the SGML RCS. – Ross Patterson Apr 27 '17 at 00:04
5

This is partially a historic remnant of the browser war

IE and netscape were competing to take over the market and kept releasing new features that kept becoming more and more "awesome", and were force to accept the pages designed for the other browser.

This means that browser accept and ignore unknown tags silently, after the committees started getting involved ... well you have a committee designing stuff and as a result a lot of different versions (with some ambiguously worded specs) where browser want to support most of them, and creating a separate parser for each version would be enormous bloat. So it is (relatively) easier to use a single parser with different modes.

For another part netscape and IE wanted html to be accessible for the common man (as was the fad those days) which means trying to do what the user wanted to be done instead of what he said to do and tripping over every dangling tag.

Making the problem worse is that there are also several "tutorial" sites teaching the wrong thing and thinking they are right because what they teach works.

Ultimately this means that if you now create a browser with only strict html parsing 99% of the sites out there will just not work.

ratchet freak
  • 25,706
  • 2
  • 62
  • 97
  • 6
    Even before IE came into the market, Netscape never did strict parsing. I remember Netscape from early 1997. – user93353 Jun 08 '13 at 04:10
  • Even if there were clear standards, it would be difficult for a browser to distinguish between tags which were legitimately defined after the browser was released, versus tags which have never been and never would be legitimate. If "optional" tags which enhanced a document but were not required for its semantic correctness included the version number of the standard which implemented them, then a browser which implemented version 23 of the standard could silently ignore an `` tag but balk at an ``, but such a design would have impaired the "human-readable" aspect of HTML. – supercat Sep 02 '14 at 16:27
3

It was not a deliberate decision. The HTML standards didn't explicitly state how to handle invalid HTML, so early browsers did what was simplest to implement: Skip over any unrecognized syntax and ignore all problems.

But first let's clarify what we are talking about when we call HTML parsing non-strict, because there are a few separate issues at play.

HTML supports a number of syntax shortcuts which may give the impression that the HTML parser is not very strict, but which actually follows precise and unambiguous rules. These shortcuts are inherited from SGML (which has even more syntax shortcuts) in order to cut down on boilerplate and make it simpler to write by hand. For example:

  • Some elements are implied by context and can therefore be left out. For example the <html>, <head> and <body> tags can be left out.
  • Some elements are automatically closed. For example, the <p> element is automatically closed when another block level element starts. The first HTML spec didn't even have the </p> closing tag, since it was unnecessary.
  • Attributes with predefined values be shortened, e.g instead of disabled="disabled" you can just write disabled.
  • Attributes only need to be quoted if they contain non-alphanumeric characters. <input type=text> is fine.

Furthermore, unknown elements and attributes are ignored. This is a deliberate decision to make the language extensible in a backwards-compatible way. Early browsers implemented this by just skipping any token they didn't recognize.

But in addition to this, there is the more controversial question of what to do with outright illegal structure like a <td> outside of a <table>, overlapping elements like <i>foo<b>bar</i>baz</b> or mangled syntax like <p <p>. For a long time it was not specified how browsers should handle such malformed HTML. Browsers makers therefore just did what was simplest to implement. The first browsers were very simple pieces of software. They didn't have a DOM or anything like that, they basically processed tags like a linear sequence of formatting codes. HTML like <i>foo<b>bar</i>baz</b> can easily be processed in a linear way as [italic on]foo[bold on]bar[italic off]baz[bold off] even though it cannot be parsed into a valid DOM tree.

Browers were not able to validate HTML up front due to incremenatal rendering. It was an important feature to be able to render HTML as soon as it was received, since the internet was slow. If a browser received HTML like <h1>Hello</h2>. Then it would render Hello in H1 style before the </h2> end tag was received. Since the invalid end tag is only detected after the headline is rendered, it doesn't make much sense to throw an error at this point. The simplest is just to ignore the unmatched end-tag.

Since unknown attributes should be ignored, the parser just skipped any unknown token in attribute position. In <p <p>, the second<p would be interpreted as an unknown attribute token and skipped. This turned out to be useful when XML HTML syntax became fashionable, since you could write <br /> and the trailing slash would just be ignored by HTML parsers.

There is a persistent rumor that the "lenient" parsing of HTML was a deliberate feature in order to make it easier for beginner or non-technical authors. I believe this is an after-the-fact justification, not the real reason. The behavior of HTML parsers is much better explained by implementers always just choosing what was simplest to implement. For example one of the most common syntax errors is mismatched quotes, like <p align="left>. In this case the parser will just scan until the next quote, regardless of how much content it has to skip. Large blocks of content, perhaps the entire rest of the document, may disappear without a trace. This is clearly not very helpful. It would be much more helpful if an unescaped > terminated an attribute. But scanning until next matching quote was the simplest to implement, so this is what browsers did.

Unfortunately, these early design decisions were hard to undo when the web took off, because it turns out that it is impossible to make parsing stricter. The problem is if one browser introduces stricter parsing, some pages will break in this browser which works fine in other browser, and then people will just abandon the browser which "doesn't work". For example Netscape initially didn't render a table at all if the closing </table> was missing. But Internet Explorer did render the table, which forced Netscape to change their parsing to also render the table.

Early implementation decisions came back to bite the browser developers. For example in the beginning it was the simplest to just allow overlapping elements, but when the DOM was introduced, browsers had to implement complex rule for how to represent overlapping elements in a DOM-tree. Different handling of invalid HTML became a source of incompatibilities between browsers. Eventually the authors of the HTML standard bit the bullet and specified in excruciating detail how to parse any form of invalid HTML. This specification is enormously complex, but at least a source of incompatibility is eliminated.

XHTML was an attempt to improve the situation by providing a strictly parsed version of HTML. But the attempt largely failed because XHTML didn't provide any significant benefit to authors compared to HTML. More importantly, browser vendors were not enthusiastic about the effort.

But why are browser vendors not enthusiastic about a strictly parsed version of HTML? Surely it would make their work easier? The problem is that browsers would still need to be able to parse all the existing pages on the internet which are not going to be fixed any time soon. Supporting strict parsing in addition would just add a new mode to the HTML parser which is complex enough as it is, and for no significant benefit to the user.

JacquesB
  • 57,310
  • 21
  • 127
  • 176
2

Well we tried to establish a nice strict option in the 000s but it didn't pan out because people following "best practices" blindly, blamed the browsers when their incorrect markup went to pieces in strict mode. And the browser vendors didn't like being blamed.

They claimed it was because they wanted the web more accessible to non-professionals but nobody was being stopped from using HTML 4 in its most lenient form.

That said, you can still serve HTML5 as XML if you desire strict-style layout. IMO it can be a good way to reap the benefits of doing layout or UI work in a stricter mode before you pass it on to other people who may or may not want it as strict without any real risks (barring them ripping the doctype out because they actually favor quirks mode - in 2017 (the time of this edit) they should be shot. So it's still there basically but do some research. I seem to recall there being some caveats that we didn't have with XHTML that didn't really impact layout work. Just don't spread the word that it's "the only way to do it right" or the twits who buy into that kind of talk will dogpile the idea, blame the browsers again, and they'll take the teeth out of the only strict alternative we have left. (2017 edit: I have no idea whether this still works - gave up)

http://mathiasbynens.be/notes/xhtml5

Erik Reppen
  • 6,243
  • 31
  • 34