C# strings with embedded null characters: bug, bad practice, or vulnerability?

C# strings with embedded null characters: bug, bad practice, or vulnerability? - c#

I have re-coded to avoid embedded null characters in C# strings,... but was wondering why the following calls gave no warning or exception for string parameters with an embedded null character, and whether this was a bug in StringBuilder.ToString(), a bad practice in general for C#, or at worst a vulnerability in .NET.
For background I have a WPF application that was parsing through an XPath to create nodes and attributes within an XmlDocument when needed. The StringBuilder class let me replace a path delimiter with a null character, e.g.: xpathtonode[i] = '\0';
Though this is allowed, if it were a bad practice I would hope to receive and exception or at least a warning.
The call to xmlpathtonode.ToString() would correctly return the string up to the null terminating character, except when the null character was embedded as the last character, then the null character would be included in the string returned by ToString(). Thus the string's Length property would be longer than the intended string value.
If StringBuilder.ToString() would recognize the null character at the end of the string and exclude it, there would not have been the following issue. Maybe this is just a bug in the StringBuilder class...
The subsequent call to XmlDocument.CreateAttribute(...), or even a call to exclude the embedded null character xpathtonode.ToString().Substring(offset,length) would exit the thread of execution without error or exception. My program and the debugger would continue to operate as if the call had never occurred,...
I doubt that this would be an OS style buffer overflow vulnerability,... but it is creepy to have the flow of execution interrupted and continue without any indication.
Bug? Bad Practice? Vulnerability?

In your problem statement, you said,
The StringBuilder class let me replace a path delimiter with a null character, e.g.:
xpathtonode[i] = '\0';
Though this is allowed, if it were a bad practice I would hope to receive and [sic]
exception or at least a warning.
U+0000 (Ascii NUL) is a perfectly legal Unicode control character and a perfectly legal character in a .Net string: .Net strings aren't nul-terminated: they carry a length specifier around with them.
You might use a more appropriate Unicode/ASCII control character for this:
U+001C (FS) is File Separator.
U+001D (GS) is Group Separator.
U+001E (RS) is Record Separator.
U+001F (US) is Unit Separator.
Back in the old days (history lesson coming), when men were men, data was persisted to paper tape or punch cards.
In particular, on paper tape, fields within a file record would be separated with US, the unit separator. Groups of fields (e.g., repeating fields or a group of related fields) might be delimited with GS (group separator). Individual records within a file would be delimited with RS (record separator) and individual files on the tape with FS the file separator.
Punch cards were a little different since cards were discrete things. Each record was often (but not always!) on a single punch card. And a "file" might be 1 or more boxes of punch cards.

Bug? Bad Practice? Vulnerability?
Specific to .NET's XmlDocument object, since you mention that calls to CreateAttribute(...) or xpathtonode.ToString().Substring(offset,length) cause the thread to be exited without error or exception then this appears to be a small bug. It would be bad practise for you to include the null character in any code because of this quirk.
However, this can also be classed as a vulnerability if you are constructing the path from user input, as a malicious user could include the null character on purpose to change the execution path of your code. It is good practise anyway to sanitize any user controlled or external data in XPath queries, as otherwise your code would be vulnerable to XPath Injection:
XPath Injection attacks occur when a web site uses user-supplied information to construct an XPath query for XML data. By sending intentionally malformed information into the web site, an attacker can find out how the XML data is structured, or access data that he may not normally have access to.
There are a few ways to avoid XPath Injection in .NET.
Regarding null bytes in general, and your StringBuilder example it appears that this may be a type of Off-by-one Error. If the StringBuilder does any string processing with user input, it may be possible for an attacker to provide a null terminated string and access the value of a character that they normally wouldn't have access to. It might also be possible for a user to supply a null terminated string and cause the program to discard whatever would normally follow in the string. These attacks would rely on the null value being persisted from the initial input location, as the pipeline may consistently terminate the input at the null byte. It is any inconsistency that is the problem.
For example, if one component treats the string 12345\06789 as 123456789 during validation, and another component treats the string as 12345 when the value is actually used then this is a problem. This was the cause of several PHP null byte related issues where PHP would read the null byte, but any system functions that were written in C classed them as a termination character. This made it possible to smuggle various strings past the PHP validation code and then enable the operating system to execute things it wasn't meant to as an aid to the attacker.
However, as .NET is a managed language this is unlikely to lead to a buffer overflow vulnerability. It might be worth further investigating if it is possible to do any of these by injecting null bytes from user input.

Bad practice because the \0 character can be interpreted differently by various features/functions of .NET giving you strange/unpredictable results, the real question is why you would purposely use that character.
Here is a similar question/response: Why is there no Char.Empty like String.Empty?

Related

How to map C# compiler error location (line, column) onto the SyntaxTree produced by Roslyn API?

So:
The C# compiler outputs the (line,column) style location.
The Roslyn API expects sequential text location
How to map the former to the latter?
The C# code could be UTF8 with or without the BOM or even UTF16. It could contain all kinds of characters in the form of comments or embedded strings.
Let us assume we know the encoding and have the respective Encoding object handy. I can convert the file bytes to char[]. The problem is that some chars may contribute zero to the final sequential position. I know that the BOM character does. I have no idea if others may too.
Now, if we know for sure that BOM is the only character that contributes 0 to the length, then I can skip it and count the characters and my question becomes trivial. This is what I do today - I just assume that the BOM is the only "bad" player.
But maybe there is a better way? Maybe Roslyn API contains some hidden gem that knows for a change to accept (line,column) and spit the sequential position? Or maybe some of the Microsoft.Build libraries?
EDIT 1
As per the accepted answer the following gives the location:
var srcText = SourceText.From(File.ReadAllText(err.FilePath));
int location = srcText.Lines[err.Line - 1].Start + err.Column - 1;

You have uncovered the reason that the SourceText type exists in the roslyn apis. Its entire purpose is to handle encoding of strings and preform calculations of lines, columns, and spans.
Due to the way .NET handles unicode and depending on which code pages are installed in your OS there could be cases that SourceText does not do what you need. It has generally proven "good enough" for our purposes though.

Semicolon in url as a separator for query strings

I keep hearing that W3C recommends to use ";" instead of "&" as a query string separator.
We recommend that HTTP server implementors, and in particular, CGI
implementors support the use of ";" in place of "&" to save authors
the trouble of escaping "&" characters in this manner.
Can somebody please explain why ";" is recommended instead of "&"?
Also, i tried using ";" instead of "&". (example: .com?str1=val1;str2=val2 ) . When reading as Request.QueryString["str1"] i get "val1;str2=val2". So if ";" is recommended, how do we read the query strings?

As the linked document says, ; is recommended over & because
the use of the "&" character to separate form fields interacts with its use in SGML attribute values to delimit character entity references.
For example, say you want your URL to be ...?q1=v1&q2=v2
There's nothing wrong with & there. But if you want to put that query into an HTML attribute, <a href="...?q1=v1&q2=v2">, it breaks because, inside an HTML attribute, & represents the start of a character entity. You have to escape the & as &, giving <a href="...?q1=v1&q2=v2">, and it'd be easier if you didn't have to.
; isn't overloaded like this at all; you can put one in an HTML attribute and not worry about it. Thus it'd be much simpler if servers recognised ; as a query parameter separator.
However, by the look of things (based on your experiment), ASP.Net doesn't recognise it as such. How to get it to? I'm not sure you can.

In short, HTML is a big mess (due to its leniency), and using semicolons help to simplify this a LOT.
In order to use semicolons as the separator, i don't know if .NET allows this customization or whether we developers need to write our own methods to process the QueryString. .NET does give us access to the raw QueryString, and we can run with it from there. This is what i did. I wrote my own methods, which wasn't too hard, but it took a lot of testing time and debugging, some of which was Microsoft's fault for not even conforming to web standards when dealing with surrogate pairs. I made sure my implementation works with the full range of Unicode characters including the Multilingual plane (thus for Chinese and Japanese characters, etc.).
Before adding my own findings, I want also confirm and include the great info that Rawling, Jeevan, and BeniBela have pointed out in Rowling's answer and their comments to such answer: it is incorrect in HTML to not escape them, but it usually works, but only because parsers are so tolerant. With that, i also explain why this can lead to bugs with such improper encoding (which probably most developers fall victim to).
One cannot depend on this leniency of improperly encoding ampersands in QueryStrings, and sometimes this leniency leads to nasty bugs. Let's say for instance a QueryString passes a random ASCII string (or user input) and they are not properly encoded. Then 'amp;' which follows '&' gets decoded and the unexpected consequence is that 'amp;' is essentially 'swallowed'. (By swallowed, i mean it gets 'eaten' or it goes missing.) A practical usage scenario is when the user is asked for input that goes into a database and the user inputs HTML (like here at StackOverflow) but because it is not posted correctly then nasty bugs develop.
The real advantage of the ';' separator is in simplicity: proper encoding of ampersand separated QueryStrings takes two steps of complication for URL strings in an HTML page (and in XML too). First keys and values shud be URL encoded and then all concatenated, and then the whole QueryString or URL shud be HTML encoded (or for XML, encoded with a very similar encoding to HTML encoding). Also don't forget that the encoding process for HTML encoding and URL encoding are different, and it's important that they are different. A developer needs to be careful between the two. And since they are similar, it's not uncommon to see them mixed up by novice programmers.
A good example of a potential problematic URL is when passing two name/values in a QueryString:
a = 'me & you', and
b = 'you & me'.
Here, using '&' as a separator, then '?a=me+%26+you&b=you+%26+me' is a proper querystring BUT it shud also be HTML encoded before being written to HTML source code. This is important to be bug free. Most developers aren't careful to do this two step process of first URL Encoding the keys and values and then HTML encoding the full URL in the HTML source. It's no wonder why, when i had to sit down and seriously think this process thru and test out my conclusions thoroughly. Imaging when the name value is 'year=año' or far more complex when we need Chinese or Japanese characters that use surrogate pairs to represent them!
For the same above key value pairs for a and b, when using ';' as the separator, the process is MUCH simpler. As a matter of fact, the ampersand separator makes the process more than twice as complex as using the semicolon separator! Here's the same info represented using the ';' as a separator: '?a=me+%26+you;b=you+%26+me'. We notice that the only difference tho is that there's no '&' in the string. But using this ';' separator means that no second process of HTML encoding the URL or QueryString is needed. Now imagine if i were writing HTML and wanted correct HTML and needed to write the HTML to explain all this! All this HTML encoding with '&' really adds a lot of complication (and for many developers, quite a lot of confusion too).
Novice developers wud simply not HTML encode the QueryString or URL, which is CORRECT when ; is the separator. But it leaves room for bugs when ampersand is improperly encoded. So '?someText=blah&blah' wud need proper encoding.
Also in .NET, we can write XML documentation for our methods. Well, just today, i wrote a little explanation that used the above 'a=me+%26+you&b=you+%26+me' example. And in my XML, i had to manually type all those & character entities for the XML. In XML documentation, it's picky so one must correctly encode ampersands. But the leniency in HTML adds to ambiguity.
Perhaps this wasn't too confusing. But all the confusion or difficulty is due to using a character which shud be HTML encoded as the separator, thus '&' is the culprit. And semicolon relieves all that complication.
One last consideration: with how much more complicated the '&' separator makes this process, it's no wonder to me why the Microsoft implementation of surrogate pairs in QueryStrings still does not follow the official specifications. And if you write your own methods, you MUST account for Microsoft's incorrect use of percent-encoding surrogate pairs. The official specs forbid percent-encoding of surrogate pairs in UTF-8. So anyone who writes their own methods which also handle the full range of Unicode characters, beware of this.

Request.QueryString[] vs. Request.Query.Get() vs. HttpUtility.ParseQueryString()

I searched SO and found similar questions, but none compared all three. That surprised me, so if someone knows of one, please point me to it.
There are a number of different ways to parse the query string of a request... the "correct" way (IMO) should handle null/missing values, but also decode parameter values as appropriate. Which of the following would be the best way to do both?
Method 1
string suffix = Request.QueryString.Get("suffix") ?? "DefaultSuffix";
Method2
string suffix = Request.QueryString["suffix"] ?? "DefaultSuffix";
Method 3
NameValueCollection params = HttpUtility.ParseQueryString(Request.RawUrl);
string suffix = params.Get("suffix") ?? "DefaultSuffix";
Method 4
NameValueCollection params = HttpUtility.ParseQueryString(Request.RawUrl);
string suffix = params["suffix"] ?? "DefaultSuffix";
Questions:
Would Request.QueryString["suffix"] return a null if no suffix was specified?
(Embarrassingly basic question, I know)
Does HttpUtility.ParseQueryString() provide any extra functionality over accessing Request.QueryString directly?
The MSDN documentation lists this warning:
The ParseQueryString method uses query strings that might contain user input, which is a potential security threat. By default, ASP.NET Web pages validate that user input does not include script or HTML elements. For more information, see Script Exploits Overview.
But it's not clear to me if that means ParseQueryString() should be used to handle that, or is exposed to security flaws because of it... Which is it?
ParseQueryString() uses UTF8 encoding by default... do all browsers encode the query string in UTF8 by default?
ParseQueryString() will comma-separate values if more than one is specified... does Request.QueryString() do that as well, or what happens if it doesn't?
Which of those methods would correctly decode "%2b" to be a "+"?
Showing my Windows development roots again... and I would be a much faster developer if I didn't wonder about these things so much... : P

Methods #1 and #2 are the same thing, really. (I think the .Get() method is provided for language compatibility.)
ParseQueryString returns you something that is the functional equivalent of Request.Querystring. You would usually use it when you have a raw URL and no other way to parse the query string parameters from it. Request.Querystring does that for you, so in this case, it's not needed.
You can't leave off "suffix". You either have to pass a string or an index number. If you leave off the [] entirely, you get the whole NameValueCollection. If you mean what if "suffix" was not one of the QueryString values then yes; you would get null if you called Request.QueryString["suffix"].
No. The most likely time you would use it is if you had an external URL and wanted to parse the query string parameters from it.
ParseQueryString does not handle it... neither does pulling the values straight from Request.QueryString. For ASP.NET, you usually handle form values as the values of controls, and that is where ASP.NET usually 'handles' these things for you. In other words: DON'T TRUST USER INPUT Ever. No matter what framework is doing what ever for you.
I have no clue (I think no). However, I think what you are reading is telling you that ParseQueryString is returning UTF-8 encoded text - regardless if it was so encoded when it came in.
Again: ParseQueryString returns basically the same thing you get from Request.QueryString. In fact, I think ParseQueryString is used internally to provide Request.QueryString.
They would produce the equivalent; they will all properly decode the values submitted. If you have URL: http://site.com/page.aspx?id=%20Hello then call Request.QueryString["id"] the return value will be " Hello", because it automatically decodes.

Example 1:
string itsMeString = string.IsNullOrEmpty(Request.QueryString["itsMe"]) ? string.Empty : HttpUtillity.UrlDecode(Request.QueryString["itsMe"]);
Stright to your questions:
Not quite sure what do you mean by suffix, if you are asking what happens if the key is not present(you don't have it in the QueryString) - yes it will return null.
My GUESS here is that when constructed, Request.QueryString internally calls HttpUtillity.ParseQueryString() method and caches the NameValueCollection for subsequential access. I think the first is only left so you can use it over a string that is not present in the Request, for example if you are scrapping a web page and need to get some arguments from a string you've found in the code of that page. This way you won't need to construct an Uri object but will be able to get just the query string as a NameValueCollection if you are sure you only need this. This is a wild guess ;).)
This is implemented on a page level so if you are accessing the QueryString let's say in Page_Load event handler, you are having a valid and safe string (ASP.NET will throw an exception otherwise and will not let the code flow enter the Page_Load so you are protected from storing XSS in your database, the exception will be: "A potentially dangerous Request.QueryString value was detected from the client, same as if a post variable contains any traces of XSS but instead Request.Form the exception says Request.QueryString."). This is so if you let the "validateRequest" switched on (by default it is). The ASP.NET pipeline will throw an exception earlier, so you don't have the chance to save any XSS things to your store (Database). Switching it off implies you know what you're doing so you will then need to implement the security yourself (by checking what's comming in).
Probably it will be safe to say yes. Anyway, since you will in most cases generating the QueryString on your own (via JavaScript or server side code - be sure to use HttpUtillity.UrlEncode for backend code and escape for JavaScript). This way the browser will be forced to turn "It's me!" to "It%27s%20me%21". You can refer to this article for more on Url Encoding in JavaScript: http://www.javascripter.net/faq/escape.htm.
Please elaborate on that, couldn't quite get what do you mean by "will comma-separate values if more than one is specified.".
As far as I remember, none of them will. You will probably need to call HttpUtillity.UrlDecode / HttpUtillity.HtmlDecode (based on what input do you have) to get the string correctly, in the above example with "It's me!" you will do something like (see Example 1 as something's wrong with the code formatting if I put it after the numbered list).

How to split a user-generated string which may contain the delimitter?

I'd like to String.Split() the following string using a comma as the delimitter:
John,Smith,123 Main Street,212-555-1212
The above content is entered by a user. If they enter a comma in their address, the resulting string would cause problems to String.Split() since you now have 5 fields instead of 4:
John,Smith,123 Main Street, Apt 101,212-555-1212
I can use String.Replace() on all user input to replace commas with something else, and then use String.Replace() again to convert things back to commas:
value = value.Replace(",", "*");
However, this can still be fooled if a user happens to use the placeholder delimitter "*" in their input. Then you'd end up with extra commas and no asterisks in the result.
I see solutions online for dealing with escaped delimitters, but I haven't found a solution for this seemingly common situation. What am I missing?
EDIT: This is called delimitter collision.

This is a common scenario — you have some arbitrary string values that you would like to compose into a structure, which is itself a string, but without allowing the values to interfere with the delimiters in structure around them.
You have several options:
Input restriction: If it is acceptable for your scenario, the simplest solution is to restrict the use of delimiters in the values. In your specific case, this means disallow commas.
Encoding: If input restriction is not appropriate, the next easiest option would be to encode the entire input value. Choose an encoding that does not have delimiters in its range of possible outputs (e.g. Base64 does not feature commas in its encoded output)
Escaping delimiters: A slightly more complex option is to come up with a convention for escaping delimiters. If you're working with something mainstream like CSV it is likely that the problem of escaping is already solved, and there's a standard library that you can use. If not, then it will take some thought to come up with a complete escaping system, and implement it.
If you have the flexibility to not use CSV for your data representation this would open up a host of other options. (e.g. Consider the way in which parameterised SQL queries sidestep the complexity of input escaping by storing the parameter values separately from the query string.)

This may not be an option for you but would is it not be easier to use a very uncommon character, say a pipe |, as your delimiter and not allow this character to be entered in the first instance?

If this is CSV, the address should be surrounded by quotes. CSV parsers are widely available that take this into account when parsing the text.
John,Smith,"123 Main Street, Apt. 6",212-555-1212

One foolproof solution would be to convert the user input to base64 and then delimit with a comma. It will mean that you will have to convert back after parsing.

You could try putting quotes, or some other begin and end delimiters, around each of the user inputs, and ignore any special character between a set of quotes.
This really comes down to a situation of cleansing user inputs. You should only allow desired characters in the user input and reject/strip invalid inputs from the user. This way you could use your asterisk delimiter.
The best solution is to define valid characters, and reject non valid characters somehow, then use the nonvalid character (which will not appear in the input since they are "banned") as you delimiters

Dont allow the user to enter that character which you are using as a Delimiter. I personally feel this is best way.

Funny solution (works if the address is the only field with coma):
Split the string by coma. First two pieces will be name and last name; the last piece is the telephone - take those away. Combine the rest by coma back - that would be address ;)

In a sense, the user is already "escaping" the comma with the space afterward.
So, try this:
string[] values = RegEx.Split(value, ",(?![ ])");
The user can still break this if they don't put a space, and there is a more foolproof method (using the standard CSV method of quoting values that contain commas), but this will do the trick for the use case you've presented.
One more solution: provide an "Address 2" field, which is where things like apartment numbers would traditionally go. User can still break it if they are lazy, though what they'll actually break the fields after address2.

Politely remind your users that properly-formed street addresses in the United States and Canada should NEVER contain any punctuation whatsoever, perhaps?
The process of automatically converting corrupted data into useful data is non-trivial without heuristic logic. You could try to outsource the parsing by calling a third-party address-formatting library to apply the USPS formatting rules.
Even USPS requires the user to perform much of the work, by having components of the address entered into distinct fields on their address "canonicalizer" page (http://zip4.usps.com/zip4/welcome.jsp).

How do I get \0 off my string from C++ when read in C#

I'm kind of stuck here. I'm developing a custom Pipleline component for Commerce Server 2009, but that has little to do with my problem.
In the setup of the pipe, I give the user a windows form to enter some values for configuration. One of those values is a URL for a SharePoint site. Commerce Server uses C++ components behind all this pipeline stuff, so the entered values are put into an IDictionary and eventually persisted to the DB via the C++ component from Microsoft.
When I read the string in during pipeline execution, it is handed to me in an IDictionary object from C++. My C# code sees that URL suffixed with \0\0. I'm not sure where those are coming from, but my code blows up because it's not a valid URI. I am trimming the string before I save it and trimming it when I read it and still can't get rid of those.
Any ideas what is causing this and how I can get rid of it? I prefer not to have a hack like substring it, but something that gets at the root cause.
Thanks,
Corey

Would this help:
string sFixedUrl = "hello\0\0".Trim('\0');

As the others' posts explained, strings in C are null-terminated. (Notice that C++, however, already provides a string type which doesn't depend on that.)
Your case is just a bit different because you're getting double-null-terminated string. I'm not an expert here, so anyone should feel free to correct me if I'm wrong. But this looks like a typical string representation for unicode/i18n aware applications in Windows which use wide characters. Please, take a look at this.
One guess is that the application which is persisting the string into the database is not using a "portable" strategy. For example, it might be persisting the string buffer considering its size in raw bytes instead of its actual length. The former would be counting the extra two zeros in the end (and, consequently, persisting them too) while the latter would discard them.

From this site:
A string in C is simply an array of characters, with the final character set to the NUL character (ascii/unicode point 0). This null-terminator is required; a string is ill-formed if it isn't there. The string literal token in C/C++ ("string") guarantees this.
const char *str = "foo";
is the same as
const char *str = {'f', 'o', 'o', 0};
So as soon as the C++ component gets your IDictionary, it will add the null-terminated string to the end. If you want to remove it, you will have to remove the null terminated char from the end before sending back the dictionary. See this post on how to remove a null terminated character. Basically you need to know the exact size and trim it off.

Another technique you can use is an array of characters and the length of the array. An array of characters does not need a terminating null character.
When you pass this data structure, you must pass the length also. The convention for the C-style strings is to determine the end of the string by searching for a '\0' (or in Unicode, '\0\0'). Since the array doesn't have the terminating characters, the length is always needed.
A much better solution is to use the std::string. It doesn't append null characters. When you need compatibility, or the C-style format, use the c_str() method. I have to use this technique with my program because the GUI framework has its own string data type that is incompatible with std::string.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.