Cross Site Scripting aka XSS is increasingly a problem with a lot of webapps and there was a discussion on the phpsec mailinglist about that. As we try to prevent XSS in our BxCMS software since quite some time, I wrote a blog entry (and a little update) about that and put a test page online. Quite a few people tested some common exploits and I further improved our script.
In this article, I will explain the different exploit we try to solve with it and also some other remarks.
What is Cross Site Scripting aka XSS?
I'm far too lazy to write something about that, see Chris Shiflett's article for a very good overview and introduction into XSS.
About our script
We wanted to allow some HTML stuff in our comment fiels, but just allowing everything isn't a good idea. So it started with some simple regular expression for the most commonly abused stuff. As with all regex approaches on XML style input, it got soon quite complex. It's far from perfect, but does prevent most stuff (At least known to me). It does sometimes clean too much, but if someone tries to input exploitable code, it doesn't matter if it removes not-so-bad stuff, IMHO. It shouldn't remove anything on valid input.
I also recommend applying tidy before and maybe even after the script. tidy itselfs fixes some badly written HTML, which makes the regexes easier. I don't rely on tidy in the script, but I nevertheless recommend it.
The script also does not prevent "CSS Hacking", meaning you can quite simply change the layout of a page with like background-color or whatever. This is not dangerous, just annoying.
If you don't want HTML in the input at all, there are much easier ways to prevent that, than this approach, just use some regular PHP functions like strip_tags and htmlentities.
A very good solution. Cleans even more than my solution and as it uses htmlsax, it can for example far better check single attribute values and doesn't rely entirely on regex magic. It's also not slower than my solution.
There's nothing, which speaks against those solutions, they are much older, more tested and therefore most certainly more secure than my solution. When I did first implement my method, I didn't know about them. Now I know and I most certainly wouldn't write my own again. But I was curious, what people will find and how it could be improved.
The script dissected
I'm now trying to explain each regex and what it tries to prevent.
Entities in input
A very common approach to circumvent XSS-cleaners is to use entities instead of plaintext, therefore the first thing we do is to remove does entities with their UTF-8 equivalent (we assume, the input is utf-8 here)
Before doing that, there's one entity style, which html_entitiy_decode doesn't catch. An entity can have a white space before the closing ; (at least browsers do support). Something like
is perfectly treated as an ä. We do remove this with:
Furthermore numeric entities don't need a trailing semicolon (very stupid, IMHO) to be recognized by browsers. The following line takes care of that:
One of the easiest way to do XSS is to use one of the on* attributes, like onclick or onload. With this you can easily execute a script, without the user even having to do something (with onload, etc) or just having to click or hover over something.
We just remove them all with
the second kind of attributes we remove here are namespace attributes. If we only want HTML as input, you don't need them. There are some funny exploitable things with XBL or just namespacing XHTML nodes in Mozilla (but it doesn't work, if the site is delivered as text/html).
If you write your own filter, be aware, that browser also parse something like that
There are more "dangerous" protocol, which you should get rid of, like:
about, wysiwyg, data, view-source, ms-its, mhtml, shell, lynxexec, lynxcgi, hcp, ms-help, help, disk, vnd.ms.radio, opera, res, resource, chrome, mocha, livescript
leads to an endless loop of alerts... url() expression for loading background images can also have *script: protocol handlers. The following removes them:
As mentioned above, the script sometimes removes too much. This regex is an example, it just clears all attributes after the style attribute. But so what, bad input is bad input and I don't care if it removes too much in such a situation (safehtml mentioned above does handle such situations better, btw)
There are more css-tricks possible with instructions like: behavior, include-sorce (NN4 only), moz-binding, content, absolute/fixed (only bad for re-positioning, not executing JS)
We removed all namespace declarations above, here we remove all elements, which have a prefix, they are not needed in HTML..
There are quite some elements in HTML, which you definitively don't want in something like user comments. We remove them with:
The reason for the while loop is, that stuff like
is completely removed. Again, applying tidy before passing the string to this script would have prevented such comments in the first place.
The script seems to do its job, but I don't claim, it's perfect. Use it at your own risk and combine it with other methods like tidy and striptags.
If you find further holes, please report them to me, so we can improve the script.
The following people sent me their input for further improving the cleaning. I like to thank them a lot
= Links =