How to allow newline characters but still prevent CRLF attack? - c#

I've run a security scan at my server and got some CRLF exploitation warning.
So, as recommended, I've sanitized all my query parameter inputs like below.
var encodedStringSafeFromCRLF = Server.UrlDecode(Request.QueryString["address"])
.Replace("\r", string.Empty)
.Replace("%0d", string.Empty)
.Replace("%0D", string.Empty)
.Replace("\n", string.Empty)
.Replace("%0a", string.Empty)
.Replace("%0A", string.Empty);
Let's say, a genuine user is sending an address to me via "address" query parameter.
Example -
https://mywebsite.com/details?instId=151711&address=24%20House%20Road%0aSomePlace%0aCountry
Since "%0A" will be stripped from the above string, the address would now become
'24HouseRoadSomePlaceCountry' which was not my expectation.
How should I handle this ?
If I make code changes for CRLF this changes how the input is intrepreted.
If input string is not sanitized, then it would open my server for CRLF attack.
Any suggestions here ?

If you really need the user to supply data with CRLF sequences, then I would not filter those. As always, never trust user-supplied data in any way: do not use it to generate HTTP headers, responses or write to log files.
In general, it's safer to filter the other way around: specify all the characters you are willing to accept, and filter out everything else.
If you need to write the data to a log, you could for example URL encode the data first, so that "naked" CR LF are never written there.
I might internally specify that I use just \n as the newline, and convert all \r, \n, and \r\n into just one representation \n internally. So the rest of the code does not have to handle all versions.

Related

C# Equivalent to CRLF not working in comparison (==)

I am using a string comparison to get rid of a "\r\n" which is essentially a CRLF.
if (somevalue != "\r\n")
{
}
I have seen a few suggestions/variations of this on SO but not this exactly. How do you check if a string is equal to "\r\n"? When I do it its literally looking for those in the text.
In my particular case I was incorrectly parsing an XML file. The key point however noted by #Prix was this:
Technically you can match \r\n against a newline however different
systems will write a newline differently. You further mention CRLF
which is specifically \r\n. So assuming your string is EXACTLY \r\n it
will match the way you're trying to, but assuming you're receiving it
from some XML data and you have the possibility of having more data
attached to it, its improbable it would match.
I am re-evaluating how I am reading in my XML file from all the helpful links everyone posted in comments but I wanted to summarize the outcome incase someone else tried to do a string comparison in XML and was running into it not working.
Thank you to all that commented on this question.
also , you can use it :
if (somevalue != Environment.NewLine) {
//your code
}

Semicolon in url as a separator for query strings

I keep hearing that W3C recommends to use ";" instead of "&" as a query string separator.
We recommend that HTTP server implementors, and in particular, CGI
implementors support the use of ";" in place of "&" to save authors
the trouble of escaping "&" characters in this manner.
Can somebody please explain why ";" is recommended instead of "&"?
Also, i tried using ";" instead of "&". (example: .com?str1=val1;str2=val2 ) . When reading as Request.QueryString["str1"] i get "val1;str2=val2". So if ";" is recommended, how do we read the query strings?
As the linked document says, ; is recommended over & because
the use of the "&" character to separate form fields interacts with its use in SGML attribute values to delimit character entity references.
For example, say you want your URL to be ...?q1=v1&q2=v2
There's nothing wrong with & there. But if you want to put that query into an HTML attribute, <a href="...?q1=v1&q2=v2">, it breaks because, inside an HTML attribute, & represents the start of a character entity. You have to escape the & as &, giving <a href="...?q1=v1&q2=v2">, and it'd be easier if you didn't have to.
; isn't overloaded like this at all; you can put one in an HTML attribute and not worry about it. Thus it'd be much simpler if servers recognised ; as a query parameter separator.
However, by the look of things (based on your experiment), ASP.Net doesn't recognise it as such. How to get it to? I'm not sure you can.
In short, HTML is a big mess (due to its leniency), and using semicolons help to simplify this a LOT.
In order to use semicolons as the separator, i don't know if .NET allows this customization or whether we developers need to write our own methods to process the QueryString. .NET does give us access to the raw QueryString, and we can run with it from there. This is what i did. I wrote my own methods, which wasn't too hard, but it took a lot of testing time and debugging, some of which was Microsoft's fault for not even conforming to web standards when dealing with surrogate pairs. I made sure my implementation works with the full range of Unicode characters including the Multilingual plane (thus for Chinese and Japanese characters, etc.).
Before adding my own findings, I want also confirm and include the great info that Rawling, Jeevan, and BeniBela have pointed out in Rowling's answer and their comments to such answer: it is incorrect in HTML to not escape them, but it usually works, but only because parsers are so tolerant. With that, i also explain why this can lead to bugs with such improper encoding (which probably most developers fall victim to).
One cannot depend on this leniency of improperly encoding ampersands in QueryStrings, and sometimes this leniency leads to nasty bugs. Let's say for instance a QueryString passes a random ASCII string (or user input) and they are not properly encoded. Then 'amp;' which follows '&' gets decoded and the unexpected consequence is that 'amp;' is essentially 'swallowed'. (By swallowed, i mean it gets 'eaten' or it goes missing.) A practical usage scenario is when the user is asked for input that goes into a database and the user inputs HTML (like here at StackOverflow) but because it is not posted correctly then nasty bugs develop.
The real advantage of the ';' separator is in simplicity: proper encoding of ampersand separated QueryStrings takes two steps of complication for URL strings in an HTML page (and in XML too). First keys and values shud be URL encoded and then all concatenated, and then the whole QueryString or URL shud be HTML encoded (or for XML, encoded with a very similar encoding to HTML encoding). Also don't forget that the encoding process for HTML encoding and URL encoding are different, and it's important that they are different. A developer needs to be careful between the two. And since they are similar, it's not uncommon to see them mixed up by novice programmers.
A good example of a potential problematic URL is when passing two name/values in a QueryString:
a = 'me & you', and
b = 'you & me'.
Here, using '&' as a separator, then '?a=me+%26+you&b=you+%26+me' is a proper querystring BUT it shud also be HTML encoded before being written to HTML source code. This is important to be bug free. Most developers aren't careful to do this two step process of first URL Encoding the keys and values and then HTML encoding the full URL in the HTML source. It's no wonder why, when i had to sit down and seriously think this process thru and test out my conclusions thoroughly. Imaging when the name value is 'year=año' or far more complex when we need Chinese or Japanese characters that use surrogate pairs to represent them!
For the same above key value pairs for a and b, when using ';' as the separator, the process is MUCH simpler. As a matter of fact, the ampersand separator makes the process more than twice as complex as using the semicolon separator! Here's the same info represented using the ';' as a separator: '?a=me+%26+you;b=you+%26+me'. We notice that the only difference tho is that there's no '&' in the string. But using this ';' separator means that no second process of HTML encoding the URL or QueryString is needed. Now imagine if i were writing HTML and wanted correct HTML and needed to write the HTML to explain all this! All this HTML encoding with '&' really adds a lot of complication (and for many developers, quite a lot of confusion too).
Novice developers wud simply not HTML encode the QueryString or URL, which is CORRECT when ; is the separator. But it leaves room for bugs when ampersand is improperly encoded. So '?someText=blah&blah' wud need proper encoding.
Also in .NET, we can write XML documentation for our methods. Well, just today, i wrote a little explanation that used the above 'a=me+%26+you&b=you+%26+me' example. And in my XML, i had to manually type all those & character entities for the XML. In XML documentation, it's picky so one must correctly encode ampersands. But the leniency in HTML adds to ambiguity.
Perhaps this wasn't too confusing. But all the confusion or difficulty is due to using a character which shud be HTML encoded as the separator, thus '&' is the culprit. And semicolon relieves all that complication.
One last consideration: with how much more complicated the '&' separator makes this process, it's no wonder to me why the Microsoft implementation of surrogate pairs in QueryStrings still does not follow the official specifications. And if you write your own methods, you MUST account for Microsoft's incorrect use of percent-encoding surrogate pairs. The official specs forbid percent-encoding of surrogate pairs in UTF-8. So anyone who writes their own methods which also handle the full range of Unicode characters, beware of this.

How to Secure a string from UserInput

We have a Website where a user can reject something from another user with a reason. The reason is sent to the other user via email.
What characters can be sent to the other user?
I am not going to do a regex search for unwanted characters but rather only want to keep "potentially" "safe" characters in the reason.
For example the reason:
"Hello <b>Dear User B"
Would be transformed to:
"Hello bDear User B"
Currently i'm just doing a "Where" on the char array and define my "safe" conditions via
char.IsLetterOrDigit || char.IsPunctuation || char.IsWhiteSpace
Are there any better techniques?
You could perfectly fine use HTML in email body. This will allow for prettier formatting. You could use the AntiXSS library on the user input before sending it by email to filter dangerous things out of the HTML (things like <script> tags for example).
you could try using the HttpEncoder class in C# with the htmlEncode method on your string.
This will turn all < > into the html equivalent of &lt ; etc. Returning a safe string.
Though it will look ugly with all the &lt ; and kinda-like-stuff in the email, you could see it as a potentially safe-work around.
If you are inserting the comments into a database as well I wouldn't recommend using this method, as it's not sql-injection proof. In that case you can use the htmlencode for sending emails and use parameters with inserting to the database.
if you think characters having ascii value 32 to 126 are safe for you, Then you can use
bool bad = value.Any(c => c < 32 || c > 126);

multiline textbox to string

I have a multiline textbox that I wish to convert to a string,
I found this
string textBoxValue = textBox1.Text.Replace(Environment.NewLine,"TOKEN");
But dont understand TOKEN what is TOKEN? whitespace or /n newline ?
If this is the incorrect answer then Please let me know of the correct way of doing this
Thanks
In the code snippet you gave, "TOKEN" is any value you wish to insert, such as an HTML <br /> tag, more Environment.NewLines for formatting, or just some random delimiter that will later allow you to split the text on it.
A very simple example:
string text = textBox1.Text.Replace(Environment.NewLine, "^"); // a random token
string[] lines = test.Split( '^' );
If you are handling input from a textbox available on the web, you also need to take into account XSS (http://en.wikipedia.org/wiki/Cross-site_scripting). Also, in a real scenario I would split on a more complex token and make sure to handle multiple carriage returns in the input value.
EDIT: now that I see your actual requirements, this code may do what you need:
// replace newlines with a single whitespace
string text = textBox1.Text.Replace(Environment.NewLine, " ");
EDIT #2:
further I need to enter this data into
SQLite and rewrite his whole
application, The company does not wish
to have information from the previos
application inputted to the new
database, there are hyperlinks etc
inbedded in the content , so if there
is a way I can make the text box only
accept RAW data this would be the
best.
Regular Expressions are the way to go for something like this, unless the data is structured enough to load into an XML or HTML DOM and process. You can build regular expressions in a variety of tools (do a Google search for a free online tester and you will find many). Once you have determined the expressions you need, you can use the Regex object in C# to match, replace, etc.
http://msdn.microsoft.com/en-us/library/ms228595(VS.80).aspx
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.replace(v=VS.100).aspx
As it stands, "TOKEN" is just a meangingless string, unless it is elsewhere in your code? You can replace "TOKEN" with any text you like.
Edit:
Okay, so you say you're removing NewLine's from your client's text. So you would do it like this. Paste their text into a textBox called textBox2, then use the following:
textBox2.Text = textBox2.Text.Replace(Environment.NewLine, string.Empty);

How to split a user-generated string which may contain the delimitter?

I'd like to String.Split() the following string using a comma as the delimitter:
John,Smith,123 Main Street,212-555-1212
The above content is entered by a user. If they enter a comma in their address, the resulting string would cause problems to String.Split() since you now have 5 fields instead of 4:
John,Smith,123 Main Street, Apt 101,212-555-1212
I can use String.Replace() on all user input to replace commas with something else, and then use String.Replace() again to convert things back to commas:
value = value.Replace(",", "*");
However, this can still be fooled if a user happens to use the placeholder delimitter "*" in their input. Then you'd end up with extra commas and no asterisks in the result.
I see solutions online for dealing with escaped delimitters, but I haven't found a solution for this seemingly common situation. What am I missing?
EDIT: This is called delimitter collision.
This is a common scenario — you have some arbitrary string values that you would like to compose into a structure, which is itself a string, but without allowing the values to interfere with the delimiters in structure around them.
You have several options:
Input restriction: If it is acceptable for your scenario, the simplest solution is to restrict the use of delimiters in the values. In your specific case, this means disallow commas.
Encoding: If input restriction is not appropriate, the next easiest option would be to encode the entire input value. Choose an encoding that does not have delimiters in its range of possible outputs (e.g. Base64 does not feature commas in its encoded output)
Escaping delimiters: A slightly more complex option is to come up with a convention for escaping delimiters. If you're working with something mainstream like CSV it is likely that the problem of escaping is already solved, and there's a standard library that you can use. If not, then it will take some thought to come up with a complete escaping system, and implement it.
If you have the flexibility to not use CSV for your data representation this would open up a host of other options. (e.g. Consider the way in which parameterised SQL queries sidestep the complexity of input escaping by storing the parameter values separately from the query string.)
This may not be an option for you but would is it not be easier to use a very uncommon character, say a pipe |, as your delimiter and not allow this character to be entered in the first instance?
If this is CSV, the address should be surrounded by quotes. CSV parsers are widely available that take this into account when parsing the text.
John,Smith,"123 Main Street, Apt. 6",212-555-1212
One foolproof solution would be to convert the user input to base64 and then delimit with a comma. It will mean that you will have to convert back after parsing.
You could try putting quotes, or some other begin and end delimiters, around each of the user inputs, and ignore any special character between a set of quotes.
This really comes down to a situation of cleansing user inputs. You should only allow desired characters in the user input and reject/strip invalid inputs from the user. This way you could use your asterisk delimiter.
The best solution is to define valid characters, and reject non valid characters somehow, then use the nonvalid character (which will not appear in the input since they are "banned") as you delimiters
Dont allow the user to enter that character which you are using as a Delimiter. I personally feel this is best way.
Funny solution (works if the address is the only field with coma):
Split the string by coma. First two pieces will be name and last name; the last piece is the telephone - take those away. Combine the rest by coma back - that would be address ;)
In a sense, the user is already "escaping" the comma with the space afterward.
So, try this:
string[] values = RegEx.Split(value, ",(?![ ])");
The user can still break this if they don't put a space, and there is a more foolproof method (using the standard CSV method of quoting values that contain commas), but this will do the trick for the use case you've presented.
One more solution: provide an "Address 2" field, which is where things like apartment numbers would traditionally go. User can still break it if they are lazy, though what they'll actually break the fields after address2.
Politely remind your users that properly-formed street addresses in the United States and Canada should NEVER contain any punctuation whatsoever, perhaps?
The process of automatically converting corrupted data into useful data is non-trivial without heuristic logic. You could try to outsource the parsing by calling a third-party address-formatting library to apply the USPS formatting rules.
Even USPS requires the user to perform much of the work, by having components of the address entered into distinct fields on their address "canonicalizer" page (http://zip4.usps.com/zip4/welcome.jsp).

Categories

Resources