Hacker News Re-Imagined

Don’t try to sanitize input, escape output (2020)

  • 128 points
  • 6 days ago

  • @maple3142
  • Created a post
  • • 124 comments

Don’t try to sanitize input, escape output (2020)


@gumby 6 days

Replying to @maple3142 🎙

Since you don't know where your output will end up how could you possibly know the syntax to escape it?

And how can the consumer of an arbitrary string trust that every input will have been properly escaped?

Reply


@whoopdedo 6 days

Replying to @maple3142 🎙

Sounds like a restatement of Postel's robustness principle[1]. Did it go out of style to "be conservative in what you send, be liberal in what you accept" and we need to relearn it again?

Well, perhaps it did. History has shown the dangers of not handling malformed input well. Postel's principle has received scrutiny[2] for reinforcing those mistakes by creating a mistaken belief in robustness. More recent recommendations have been to be stricter in handling of inputs[3].

But I think there is some confusion between robustness and defensiveness. "Be liberal in what you accept" may be confused with "don't sanitize your inputs" when not sanitizing is the less liberal action. Robustness means the program should not fail if it receives input it didn't expect. A program that crashes, hangs, executes unintended shell code, mangles the data, changes the thermostat, or other undefined behavior is not being robust. To prevent that from happening then data must be sanitized at input so that it can be processed without those side-effects. The examples of programs failing robustness have been because they were insufficiently defensive.

The bigger issue is that robustness doesn't scale easily. You may know how your bit of code will deal with malformed data, but what about every other library you use? Or other systems you communicate with? It becomes a backstage problem, where once someone has gained access to a restricted area it's assumed they are authorized to be there. The further down the tech stack you go the less likely the code will be defensive. That puts a burden on the public-facing sanity checks to anticipate how relaxed they can be about the input.

If you change the definition of output to include internal-outputs, then Postel's principle gets new life. That is, try not to program the entire system and ecosystem at once, but treat each software component as an island. Be liberal not only with the data you receive from the end-user, but also with return values from functions. Be conservative and escape not only your generated HTML, but also the SQL statements you dispatch to the backend. This is what input sanitizing is actually about, it's keeping the promise to the other parts of your program that your code isn't going to give them bad data. That's also what the linked article is saying, because the HTML being generated is itself one component in a chain of programs that includes the end-user's browser.

[1] https://en.wikipedia.org/wiki/Robustness_principle

[2] https://programmingisterrible.com/post/42215715657/postels-p...

[3] https://datatracker.ietf.org/doc/html/draft-iab-protocol-mai...

Reply


@gkoberger 6 days

Replying to @maple3142 🎙

This solution doesn't match the problem. Even the SQL injection example shows him sanitizing the input, which is at odds with the title of the post. Log4J is a more recent example of it being too late/useless to escape the output.

Reply


@wnoise 6 days

Replying to @maple3142 🎙



@ffhhj 6 days

Replying to @maple3142 🎙

sanitize (client side) => confirm with user => trim+escape (server side) => insert

Reply


@1970-01-01 6 days

Replying to @maple3142 🎙

¿Por qué no los dos?

Reply


@chriswarbo 6 days

Replying to @maple3142 🎙

The fundamental problem is attempting to conflate a bunch of semantically-distinct things, just because they might happen to (sometimes) be represented in memory by similar byte sequences.

Such 'byte coincidences' lead to lazy, non-sensical operations, like "append this user-provided name to that SQL statement"; implemented by munging together a bunch of bytes, without thought for how they'll be interpreted.

A much better solution is to ignore whether things might just-so-happen to be represented in a similar way in memory; and instead keep things distinct if they have different semantic meanings (like "name", "SQL statement", "HTML source", "shell command", "form input", etc.). That way, if we try to do non-sensical things like appending user input to HTML, we'll get an informative error message that there is no such operation.

This isn't hard; but it requires more careful thought about APIs. Unfortunately many languages (and now frameworks) have APIs littered with "String"; ignoring any distinctions between values, and hence allowing anything to be plugged into anything else (AKA injection vulnerabilities)

Reply


@Sohcahtoa82 6 days

Replying to @maple3142 🎙

Every time this topic comes up, the comments are full of people talking past each other because they're operating under different definitions of "sanitize", "input", and "escape".

And now in this case, we add "output" to the confusion.

Is the SQL query you send to your DB input or output?

Reply


@dang 6 days

Replying to @maple3142 🎙

Discussed at the time:

Don’t try to sanitize input – escape output - https://news.ycombinator.com/item?id=22431022 - Feb 2020 (280 comments)

Reply


@parhamn 6 days

Replying to @maple3142 🎙

It's cool to see how these posts are becoming less and less important in the wake of today's frameworks/tools protecting devs by default.

From ORMs escaping SQL, to FE frameworks escaping html/js, to browsers starting to default to same-site=lax. It feels like we've slowly pulled ourselves out of OWASP hell. Pretty nice to see!

Obviously it's still important (see log4j) to know it all especially when its not so clear cut, but still good progress.

Reply


@nostrademons 6 days

Replying to @maple3142 🎙

I think a better way to think of this may be in terms of canonicalization. Inside your application, you should decide on a single canonical way to represent data, one which fits the type of processing and expected use of the application. For example, you might decide that all strings should be UTF8, and should be interpreted (and stored) as whatever the user initially wrote. You might decide that any structured data should be parsed and then stored as protobufs in a BigTable. Or you might decide that an RDBMS is your native datastore and use whatever the native string encoding is for it, as well as parse & normalize data into tables upon input.

Then, whenever you take input, your job is to validate and encode it. If you get a Windows-1252 string, you should re-encode it to utf8 for further storage. If it has data that are invalid UTF-8 codepoints, you should either strip, replace with a replacement character, or notify the user with a validation failure. Same with structured data that fails your normalization rules - you should usually notify the user.

And when you send output, you should escape based on the intended output device. If you're putting it in an HTML page, HTML-escape it. If it's a URL, url-encode it. If it's a database query, SQL escape it. If it's a CSV, quote it.

Thinking in these terms keeps the internal logic of your application simple (there are no format conversions except at system boundaries), and it also gives you a lot of flexibility to preserve the user's intent and add new output formats later.

Reply


@taneq 6 days

Replying to @maple3142 🎙

I think escaping output is making the same mistake as sanitizing input. What we should really be saying is "stop using string interpolation/concatenation to process generic user data".

By default, text should only ever be treated as a blob. Yes, there are circumstances where it needs to be treated otherwise but they should be seen as a giant flashing 'danger' sign indicating the need to go back to sanitizing etc.

Reply


@scotty79 6 days

Replying to @maple3142 🎙

I'm really surprised by the discussion here. It's so obviously true and I realized this when correct php function to escape string for sql was names mysql_real_escape_string

Reply


@joering2 6 days

Replying to @maple3142 🎙

Every online form where user can interact and send data back to a server is always a nightmare in terms of security. I do utilize mod_secure, but with my next project, I have an idea of doing "base64" on everything in client's browser via javascript then sending it to server and checking on backend if content is a valid base64. Is that a good concept?

Reply


@blibble 6 days

Replying to @maple3142 🎙

guess I'll just put that 2gb "first name" directly into my database then

Reply


@AtNightWeCode 6 days

Replying to @maple3142 🎙

No, garbage in, garbage out. Sure, things like log or SQL injections should not only be solved by sanitizing. You solve it by separating data and code. A lot of times you really want to store data in a structured canonical way. Usernames for instance. It is bad if you with Unicode trickery can create multiple usernames that looks the same. Product descriptions, it is bad if your ML needs to handle HTML and so on.

Reply


@hamilyon2 6 days

Replying to @maple3142 🎙

Sanitizing inputs is not what you realistically want. You should prohibit certain types of input. Whitelisting strings is that what I would call it.

You should escape outputs, of course (not that anyone in 2022 thinks otherwise).

Why escaping outputs alone won't work is because user inputs will be stored in some database and you can't realistically predict how, when, where it will be used. Years in the future. User name could be used as a filename once, opening up possibility of shell-based exploit. It could trigger a little-known spreadsheet formula vulnerability when exported for analysis. Novel, interesting xss attacks are common and produced every day. That could be even not your code, but the code your client or partner organisation run. You just never know.

One common defence is user names (and other freeform fields) should not be allowed to be arbitrary bytes.

That is defence in depth, an established practice.

Reply


@swlkr 6 days

Replying to @maple3142 🎙

A strong content security policy also helps with xss

Reply


@iou 6 days

Replying to @maple3142 🎙

Do both pls.

Reply


@ipaddr 6 days

Replying to @maple3142 🎙

Instead of sanitizing input you create unsafe datastore which might be used in other applications later. Do it as soon as possible.

Reply


@ncc-erik 6 days

Replying to @maple3142 🎙

I think what makes this hard for folks is tracking what the expected form of data is at each step of its lifecycle, especially considering people working with new and unfamiliar codebases or splitting focus on multiple projects.

There are some frameworks that try using types to solve the problem. Alternatively, the developers could throw in a comment that looks something like:

// client == submits raw data ==> web_server == inserts raw data (param. sql stmt) ==> db_server ==> returns query with raw data ==> our_function == returns html-escaped data ==> client

Reply


@billpg 6 days

Replying to @maple3142 🎙

Shameless plug: NEVER Sanitize Your Inputs (by me, 2013) https://billpg.com/never-sanitize-your-inputs/

Reply


@Sebb767 6 days

Replying to @maple3142 🎙

> The parallel for SQL injection might be if you’re building a data charting tool that allows users to enter arbitrary SQL queries. You might want to allow them to enter SELECT queries but not data-modification queries. In these cases you’re best off using a proper SQL parser [...] to ensure it’s a well-formed SELECT query – but doing this correctly is not trivial, so be sure to get security review.

If you are ever in this situation, you should actually use a dedicated read-only user that can only access the relevant data. If you need to hide columns, use views. Trying to parse SQL can easily go very wrong, especially when someone (ab-)uses the edge cases of your DB.

Reply


About Us

site design / logo © 2022 Box Piper