It was not a deliberate decision. The HTML standards didn't explicitly state how to handle invalid HTML, so early browsers did what was simplest to implement: Skip over any unrecognized syntax and ignore all problems.
But first let's clarify what we are talking about when we call HTML parsing non-strict, because there are a few separate issues at play.
HTML supports a number of syntax shortcuts which may give the impression that the HTML parser is not very strict, but which actually follows precise and unambiguous rules. These shortcuts are inherited from SGML (which has even more syntax shortcuts) in order to cut down on boilerplate and make it simpler to write by hand. For example:
- Some elements are implied by context and can therefore be left out. For example the
<html>
, <head>
and <body>
tags can be left out.
- Some elements are automatically closed. For example, the
<p>
element is automatically closed when another block level element starts. The first HTML spec didn't even have the </p>
closing tag, since it was unnecessary.
- Attributes with predefined values be shortened, e.g instead of
disabled="disabled"
you can just write disabled
.
- Attributes only need to be quoted if they contain non-alphanumeric characters.
<input type=text>
is fine.
Furthermore, unknown elements and attributes are ignored. This is a deliberate decision to make the language extensible in a backwards-compatible way. Early browsers implemented this by just skipping any token they didn't recognize.
But in addition to this, there is the more controversial question of what to do with outright illegal structure like a <td>
outside of a <table>
, overlapping elements like <i>foo<b>bar</i>baz</b>
or mangled syntax like <p <p>
. For a long time it was not specified how browsers should handle such malformed HTML. Browsers makers therefore just did what was simplest to implement. The first browsers were very simple pieces of software. They didn't have a DOM or anything like that, they basically processed tags like a linear sequence of formatting codes. HTML like <i>foo<b>bar</i>baz</b>
can easily be processed in a linear way as [italic on]foo[bold on]bar[italic off]baz[bold off]
even though it cannot be parsed into a valid DOM tree.
Browers were not able to validate HTML up front due to incremenatal rendering. It was an important feature to be able to render HTML as soon as it was received, since the internet was slow. If a browser received HTML like <h1>Hello</h2>
. Then it would render Hello
in H1
style before the </h2>
end tag was received. Since the invalid end tag is only detected after the headline is rendered, it doesn't make much sense to throw an error at this point. The simplest is just to ignore the unmatched end-tag.
Since unknown attributes should be ignored, the parser just skipped any unknown token in attribute position. In <p <p>
, the second<p
would be interpreted as an unknown attribute token and skipped. This turned out to be useful when XML HTML syntax became fashionable, since you could write <br />
and the trailing slash would just be ignored by HTML parsers.
There is a persistent rumor that the "lenient" parsing of HTML was a deliberate feature in order to make it easier for beginner or non-technical authors. I believe this is an after-the-fact justification, not the real reason. The behavior of HTML parsers is much better explained by implementers always just choosing what was simplest to implement. For example one of the most common syntax errors is mismatched quotes, like <p align="left>
. In this case the parser will just scan until the next quote, regardless of how much content it has to skip. Large blocks of content, perhaps the entire rest of the document, may disappear without a trace. This is clearly not very helpful. It would be much more helpful if an unescaped > terminated an attribute. But scanning until next matching quote was the simplest to implement, so this is what browsers did.
Unfortunately, these early design decisions were hard to undo when the web took off, because it turns out that it is impossible to make parsing stricter. The problem is if one browser introduces stricter parsing, some pages will break in this browser which works fine in other browser, and then people will just abandon the browser which "doesn't work". For example Netscape initially didn't render a table at all if the closing </table>
was missing. But Internet Explorer did render the table, which forced Netscape to change their parsing to also render the table.
Early implementation decisions came back to bite the browser developers. For example in the beginning it was the simplest to just allow overlapping elements, but when the DOM was introduced, browsers had to implement complex rule for how to represent overlapping elements in a DOM-tree. Different handling of invalid HTML became a source of incompatibilities between browsers. Eventually the authors of the HTML standard bit the bullet and specified in excruciating detail how to parse any form of invalid HTML. This specification is enormously complex, but at least a source of incompatibility is eliminated.
XHTML was an attempt to improve the situation by providing a strictly parsed version of HTML. But the attempt largely failed because XHTML didn't provide any significant benefit to authors compared to HTML. More importantly, browser vendors were not enthusiastic about the effort.
But why are browser vendors not enthusiastic about a strictly parsed version of HTML? Surely it would make their work easier? The problem is that browsers would still need to be able to parse all the existing pages on the internet which are not going to be fixed any time soon. Supporting strict parsing in addition would just add a new mode to the HTML parser which is complex enough as it is, and for no significant benefit to the user.