August 20, 2006 - 16:53 UTC - Tags: sql injection XSS input validation sql injection
Many web sites have SQL-injection and XSS (Cross Site Scripting) vulnerabilities, and security articles often mention lack of input validation as the reason for these problems. This isn't necessarily correct.
The metacharacter problemBoth SQL-injection and XSS are metacharacter problems. A metacharacter is a control character used in a part of the system to control the display or flow of data. These problems occur every time a system communicates with a system of a different flavour, be it a browser, a database or a legacy system.
Why input validation can fail to solve these problemsConsider a blogging system allowing users to post comments to the entries. While this is a simple system, it contains enough functionality for me to explain. The blogging system has a comment form containing the fields: name, e-mail, headline and comment body.
Let's start out with the name field. To avoid getting XSS or SQL injection, input validation needs to block out any letters which are not used in a name. Using best practice, we create a white list of allowed values. While creating this white list would be easy if all names had the same format ("Joe Smith"), a problem arises when Conan O'Brian comes along. The ' in his name is a SQL control character and can also be used for strings in javascript or HTML tags. So how do we handle this using input validation. We have to allow this character. We also have to allow foreign names which do not necessarily follow standard formats.
Next, consider the comment field. If a user wants to post some code for showing how a certain javascript is written, maybe the blog should allow that. This means input validation will fail to remove XSS, or SQL injection for that matter.
So what do people do? During input validation, many people escape the HTML in the data, or escape the quote tags (', "). But is this really the best solution? Why should a comment exist in escaped format in the application. Java or C# or whatever language you are using, does not require the data to be escaped while contained in a string. You can quickly run into trouble when you start displaying data from multiple sources. What data is escaped, and what data isn't? Is all data escaped? What is it escaped for? HTML? SQL? Both? Some weird legacy system?
The solutionDon't get me wrong. I still think input validation should be present in every application. But to avoid metacharacter problems, data needs to be escaped when it leaves the system, not when it enters it. This means that the web application needs to escape data just before sending it to the database (preferably by using prepared statements) or a legacy system. Data presented on an HTML page needs to be escape when it's written to the HTML page. And best practice for escaping should be "Escaping by default", which means you need a reason if you are printing unescaped data.
In the figure on the right, I have marked where input validation (blue) or output escaping (red) should be performed in a web application.
For more information on the metacharacter problem, please check out
"Innocent Code" by Sverre Huseby.