What are the common techniques to handle user-generated HTML modified differently by different browsers?

Question

I am developing a website updater. The front end uses HTML, CSS and JavaScript, and the backend uses Python.

The way it works is that <p/>, <b/> and some other HTML elements can be updated by the user. To enable this, I load the webpage and, with JQuery, convert all those elements to <textarea/> elements. Once they the content of the text area is changed, I apply the change to the original elements and send it to a Python script to store the new content.

The problem is that I'm finding that different browsers change the original HTML.

How do you get around this issue?
What Python libraries do you use?
What techniques or application designs do you use to avoid or overcome this issue?

The problems I found are:

IE removes the quotes around class and id attributes. For example, <img class='abc'/> becomes <img class=abc/>.
Firefox removes the backslash from the line breaks: <br \> becomes <br>.
Some websites have very specific display technicalities, so an insertion of a simple "\n"(which IE does) can affect the display of a website. Example: changing <img class='headingpic' /><div id="maincontent"> to <img class='headingpic'/>\n <div id="maincontent"> inserts a vertical gap in IE.

The things I have unsuccessfully tried to overcome these issues:

Using either JQuery or Python to remove all >\n< occurences, <br> etc. But this fails because I get different patterns in IE, sometimes a ∙\n, sometimes a \n∙∙∙.
In a Python, parse the new HTML, extract the new text/content, insert it into the old HTML so the elements and format never change, just the content. This is very difficult and seems to be overkill.

Is the doctype of your pages set to a restrictive mode? Setting it to XHTML 1.1 forces a strict way of handling the document by every browser which might resolve your issue — Carlo Kuip, Oct 26 '11 at 19:35

score 2 · Answer 1 · answered Oct 13 '11 at 18:30

One of the first rules of web development is to never trust the client. A malicious user or buggy client could bypass anything you do in Javascript and feed your server-side Python malformed and possibly harmful HTML, so your server-side Python needs to standardize and clean up whatever it gets.

As long as you have to do some of the work server-side, why not do everything server-side, bypassing web browsers' vagaries completely? I'd recommend just sending the textarea contents to the server and cleaning it up on the server with BeautifulSoup.

You can still do the textarea-to-HTML conversion client-side if you want to, to show the user a preview of their changes, and just submit the textareas' contents to the server.

score 1 · Answer 2 · answered Nov 12 '11 at 21:30

What is your goal?

First, why are you doing it? If you providing a possibility for your customers to modify the content of a web page, there are two cases:

Your customers have enough technical background, like users of Stack Exchange. In this case, why not using Markdown, much user friendly and much easier to type?
Your customers don't have enough technical background. In this case, providing them a possibility to handle HTML manually is like if your ISP said that in order to have an internet connection, you first have to do your own wiring from your home to their center, then build your own router which matches their protocols, and then do all the configuration yourself. That's how your customers will perceive your business.

Remember, it's HTML

If you still have a valid reason to use direct HTML editing capabilities, you must remember that you're dealing with HTML. Which means, no string.replace, and no regular expressions (I put it in bold, but imagine I putted it in Arial Black 200 bold blinking red).

You need to parse the input.

Primary, you need to parse it in order to normalize the formatting. It's here that you remove your end lines, replace <br> by <br /> (you're talking about a backslash in your question; is this a typo?), etc.

Also, you need to be sure that it's a valid HTML. What if the user adds a </div> which doesn't match any opening tag? Yes, it will probably break the layout of your page.

Remember, it's user input

Last, but most important reason to parse the input as HTML: you need to validate it.

What if I add a <script/> tag with some nasty JavaScript code? What if I want to break your layout just to annoy you? Or to redirect the users from your website to my own? What if...

If you open your HTML code to be modified by a non-trusted source, be prepared that it will hurt soon or later, both you (see, it's never pleasant to see that you've been banned by Google results because your page content contains viruses) and your customers, who will never return.

What are the common techniques to handle user-generated HTML modified differently by different browsers?

2 Answers2

What is your goal?

Remember, it's HTML

Remember, it's user input