Sanitize registration input? - c#

This may be a really silly question but so far the interwebs has failed me, so i'm hoping you good people of SO will shed some light. Essentially I have a website on which there is membership functionality(sign up/login/forgotten password etc.) using the .net membership providers. Later down the line I am taking users registration data converting to XML then it using elsewhere in logic. Unfortunately I often get issues with the data I have in XML, more often than not its hexadecimal value 0x1C, is an invalid character. I did find a handy blog post on a resolution to this but it got me thinking, are there any standards on how data should be sanitized? What to let through registration and what not to?

Assuming that you're (manually?) de-serializing the registration input, you need to encode it as XML before further processing so that characters with special meaning in XML are escaped properly.
Note that there are only 5 of them so it's perfectly reasonable to do this with a manual replace:
< = <
> = >
& = &
" = "
' = &apos;
You could use the build-in .NET function HttpUtility.HtmlEncode(input) to do this for you.
UPDATE:
I just realized I didn't really answer your question, you seem to be looking for a way to transform Unicode characters to ASCII-supported Html Entities.
I'm not aware of any built-in functions in .NET that do this, so I wrote a little utility method which should illustrate the concept:
public static class StringUtilities
{
public static string HtmlEncode(string input, Encoding source, Encoding destination)
{
var sourceChars = HttpUtility.HtmlEncode(input).ToArray();
var sb = new StringBuilder();
foreach (var sourceChar in sourceChars)
{
byte[] sourceBytes = source.GetBytes(new[] { sourceChar });
char destChar = destination.GetChars(sourceBytes).FirstOrDefault();
if (destChar != sourceChar)
sb.AppendFormat("&#{0};", (int)sourceChar);
else
sb.Append(sourceChar);
}
return sb.ToString();
}
}
Then, given an input string which has both reserved XML characters and Unicode characters in it, you could use it like this:
string unicode = "<tag>some proӸematic text<tag>";
string escapedASCII = StringUtilities.HtmlEncode(
unicode, Encoding.Unicode, Encoding.ASCII);
// Result: <tag>some proӸematic text<tag>
If you need to do this at several places, to clean it up a bit, you could add an extension method for your specific scenario:
public static class StringExtensions
{
public static string ToEncodedASCII(this string input, Encoding sourceEncoding)
{
return StringUtilities.HtmlEncode(input, sourceEncoding, Encoding.ASCII);
}
public static string ToEncodedASCII(this string input)
{
return StringUtilities.HtmlEncode(input, Encoding.Unicode, Encoding.ASCII);
}
}
You could then do:
string unicode = "<tag>some proӸematic text<tag>";
// Default to Unicode as input
string escapedASCII1 = unicode.ToEncodedASCII();
// Pass in a different encoding for your input
string escapedASCII2 = unicode.ToEncodedASCII(Encoding.BigEndianUnicode);
UPDATE #2
Since you also asked for advice on adhering to standards, the most I can tell you is that you need to take into consideration where the input text will actually end up.
If the input for a certain user will only ever be displayed to that user (for instance when they manage their profile / account settings in your app), and your database supports Unicode, you could just leave everything as-is.
On the other hand, if the information can be displayed to other users (for instance when users can view each others public profile information) then you need to take into consideration that not all users will be visiting your website on a device/browser that supports Unicode. In that case, UTF-8 is likely to be your best bet.
This is also why you can't really find that much useful information on it. If the world was able to agree on a standard then we would not have to deal with all these encoding variations in the first place. Think about your target group and what they need.
A useful blog post on the subject of encoding: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Related

Fix string encoding issues

Does anyone know of a .Net library (NuGet package preferrably) that I can use to fix strings that are 'messed up' because of encoding issues?
I have Excel* files that are supplied by third parties that contain strings like:
Telefónica UK Limited
Serviços de Comunicações e Multimédia
These entries are simply user-error (e.g. someone copy/pasted wrong or something) because elsewhere in the same file the same entries are correct:
Telefónica UK Limited
Serviços de Comunicações e Multimédia
So I was wondering if there is a library/package/something that takes a string and fixes "common errors" like çõ → çõ and ó → ó. I understand that this won't be 100% fool-proof and may result in some false-negatives but it would sure be nice to have some field-tested library to help me clean up my data a bit. Ideally it would 'autodetect' the issue(s) and 'autofix' them as I won't always be able to tell what the source encoding (and destination encoding) was at the time the mistake was made.
* The filetype is not very relevant, I may have text from other parties in other fileformats that have the same issue...
My best advice is to start with a list of special characters that are used in the language in question.
I assume you're just dealing with Portuguese or other European languages with just a handful of non-US-ASCII characters.
I also assume you know what the bad encoding was in the first place (i.e. the code page), and it was always the same.
(If you can't assume these things, then it's a bigger problem.)
Then encode each of these characters badly, and look for the results in your source text. If any are found, you can treat it as badly encoded text.
var specialCharacters = "çõéó";
var goodEncoding = Encoding.UTF8;
var badEncoding = Encoding.GetEncoding(28591);
var badStrings = specialCharacters.Select(c => badEncoding.GetString(goodEncoding.GetBytes(c.ToString())));
var sourceText = "Serviços de Comunicações e Multimédia";
if(badStrings.Any(s => sourceText.Contains(s)))
{
sourceText = goodEncoding.GetString(badEncoding.GetBytes(sourceText));
}
The first step in fixing a bad encoding is to find what encoding the text was mis-encoded to, often this is not obvious.
So, start with a bit of text that is mis-encoded, and the corrected version of the text. Here my badly encoded text ends with ä rather than ä
var name = "Viistoperä";
var target = "Viistoperä";
var encs = Encoding.GetEncodings();
foreach (var encodingType in encs)
{
var raw = Encoding.GetEncoding(encodingType.CodePage).GetBytes(name);
var output = Encoding.UTF8.GetString(raw);
if (output == target)
{
Console.WriteLine("{0},{1},{2}",encodingType.DisplayName, encodingType.CodePage, output);
}
}
This will output a number of candidate encodings, and you can either pick the most relevant one. Windows-1252 is a better candidate than Turkish in this case.

How to count unique characters in string [duplicate]

This question already has answers here:
How to get distinct characters?
(9 answers)
Closed 8 years ago.
Lets say we have variable myString="blabla" or mystring=998769
myString.Length; //will get you your result
myString.Count(char.IsLetter); //if you only want the count of letters:
How to get, unique character count? I mean for "blabla" result must be 3, doe "998769" it will be 4. Is there ready to go function? any suggestions?
You can use LINQ:
var count = myString.Distinct().Count();
It uses a fact, that string implements IEnumerable<char>.
Without LINQ, you can do the same stuff Distinct does internally and use HashSet<char>:
var count = (new HashSet<char>(myString)).Count;
If you handle only ANSI text in English (or characters from BMP) then 80% times if you write:
myString.Distinct().Count()
You will live happy and won't ever have any trouble. Let me post this answer only for who will really need to handle that in the proper way. I'd say everyone should but I know it's not true (quote from Wikipedia):
Because the most commonly used characters are all in the Basic Multilingual Plane, handling of surrogate pairs is often not thoroughly tested. This leads to persistent bugs and potential security holes, even in popular and well-reviewed application software (e.g. CVE-2008-2938, CVE-2012-2135)
Problem of our first naïve solution is that it doesn't handle Unicode properly and it also doesn't consider what user perceive as character. Let's try "𠀑".Distinct().Count() and your code will wrongly return...2 because its UTF-16 representation is 0xD840 0xDC11 (BTW each of them, alone, is not a valid Unicode character because they're high and low surrogate, respectively).
Here I won't be very strict about terms and definitions so please refer to www.unicode.org as reference. For a (much) more broad discussion please read How can I perform a Unicode aware character by character comparison?, encoding isn't only issue you have to consider.
1) It doesn't take into account that .NET System.Char doesn't represent a character (or more specifically a grapheme) but a code unit of a UTF-16 encoded text (possible, for example, with ideographic characters). Often they coincide but now always.
2) If you're counting what user thinks (or perceives) as a character then this will fail again because it doesn't check combined characters like ا́ (many examples of this in Arabic language). There are duplicates that exists for historical reasons: for example é it's both a single Unicode code point and a combination (then that code will fail).
3) We're talking about a western/American definition of character. If you're counting characters for end-users you may need to change your definition to what they expect (for example in Korean language definition of character may not be so obvious, another example is Czech text ch that is always counted as a single character). Finally don't forget some strange things when you convert characters to upper case/lower case (for example in German language ß is SS in upper case, see also this post).
Encoding
C# strings are encoded as UTF-16 (char is two bytes) but UTF-16 isn't a fixed size encoding and char should be properly called code unit. What does it mean? That you may have a string where Length is 2 but actually user will see (and it's actually is) just one character (then count should be 1).
If you need to handle this properly then you have to make things much more complicated (and slow). Fortunately Char class has some helpful methods to handle surrogates.
Following code is untested (and for illustration purposes so absolutely not optimized, I'm sure it can be done much better than this) so get it just as starting point for further investigations:
int CountCharacters(string text)
{
HashSet<string> characters = new HashSet<string>();
string currentCharacter = "";
for (int i = 0; i < text.Length; ++i)
{
if (Char.IsHighSurrogate(text, i))
{
// Do not count this, next one will give the full pair
currentCharacter = text[i].ToString();
continue;
}
else if (Char.IsLowSurrogate(text, i))
{
// Our "character" is encoded as previous one plus this one
currentCharacter += text[i];
}
else
currentCharacter = text[i].ToString();
if (!characters.Contains(currentCharacter))
characters.Add(currentCharacter);
}
return characters.Count;
}
Note that this example doesn't handle duplicates (when same character may have different codes or can be a single code point or a combined character).
Combined Characters
If you have to handle combined characters (and of course encoding) then best way to do it is to use StringInfo class. You'll enumerate (and then count) both combined and encoded characters:
StringInfo.GetTextElementEnumerator(text).Walk()
.Distinct().Count();
Walk() is a trivial to implement extension method that simply walks through all IEnumerator elements (we need it because GetTextElementEnumerator() returns IEnumerator instead of IEnumerable).
Please note that after text has been properly splitted it can be counted with our first solution (the point is that brick isn't char but a sequence of char (for simplicity here returned as string itself). Again this code doesn't handle duplicates.
Culture
There is not much you can do to handle issues listed at point 3. Each language has its own rules and to support them all can be a pain. More examples about culture issues on this longer specific post.
It's important to be aware of them (so you have to know little bit about languages you're targeting) and don't forget that Unicode and few translated resx files won't make your application global.
If text processing is important in your application you can solve many issues using specialized DLLs for each locale you support (to count characters, to count words and so on) like Word Processors do. For example, issues I listed can be simply solved using dictionaries. What I usually do is to do not use standard .NET functions for strings (also because of some bugs), I create a Unicode class with static methods for everything I need (character counting, conversions, comparison) and many specialized derived classes for each supported language. At run-time that static methods will user current thread culture name to pick proper implementation from a dictionary and to delegate work to that. A skeleton may be something like this:
abstract class Unicode
{
public static string CountCharacters(string text)
{
return GetConcreteClass().CountCharactersCore(text);
}
protected virtual string CountCharactersCore(string text)
{
// Default implementation, overridden in derived classes if needed
return StringInfo.GetTextElementEnumerator(text).Cast<string>()
.Distinct().Count();
}
private Dictionary<string, Unicode> _implementations;
private Unicode GetConcreteClass()
{
string cultureName = Thread.Current.CurrentCulture.Name;
// Check if concrete class has been loaded and put in dictionary
...
return _implementations[cultureName];
}
}
If you're using C# then Linq comes nicely to the rescue - again:
"blabla".Distinct().Count()
will do it.

HTML/Url decode on multiple times encoded string

We have a string which is readed from web page. Because browsers are tolerant to unencoded special chars (e.g. ampersand), some pages using it encoded, some not... so there is a large possibility, we have stored some data encoded once, and some multiple times...
Is there some clear solution, how to be sure, my string is decoded enough no matter how many times it was encoded?
Here is what we using now:
public static string HtmlDecode(this string input)
{
var temp = HttpUtility.HtmlDecode(input);
while (temp != input)
{
input = temp;
temp = HttpUtility.HtmlDecode(input);
}
return input;
}
and same using with UrlDecode.
That's probably the best approach honestly. The real solution would be to rework your code so that you only singly encode things in all places, so that you could only singly decode them.
Your code seems to be decoding html strings correctly, with multiple checks.
However, if the input HTML is malformed, i.e not encoded properly, the decoding will be unexpected. i.e bad inputs might not be decoded properly no matter how many times it passes through this method.
A quick check with two encoded strings, one with completely encoded string, and another with partially encoded yielded the following results.
"<b>" will decode to "<b>"
"<b&gt will decode to "<b&gt"
In case this is helpful to anyone, here is a recursive version for multiple HTML encoded strings (I find it a bit easier to read):
public static string HtmlDecode(string input) {
string decodedInput = WebUtility.HtmlDecode(input);
if (input == decodedInput) {
return input;
}
return HtmlDecode(decodedInput);
}
WebUtility is in the System.Net namespace.

Encoding "ä" into "%E4"

I'm trying to understand what is the best encode from C# that fulfill a requirement on a new SMS Provider.
The text I want to send is:
Bäste Björn
The encoded text that the provider say it needs is:
B%E4ste+Bj%F6rn
so ä is %E4 and ö is %F6
From this answer, I got that, for such conversion I need to use HttpUtility.HtmlAttributeEncode as the normal HttpUtility.UrlEncode will output:
B%c3%a4ste+Bj%c3%b6rn
and that outputs weird chars on the mobile phone :/
as several chars are not converted, I tried this:
private string specialEncoding(string text)
{
StringBuilder r = new StringBuilder();
foreach (char c in text.ToCharArray())
{
string e = System.Web.HttpUtility.UrlEncode(c.ToString());
if (e.StartsWith("%") && e.ToLower() != "%0a") // %0a == Linefeed
{
string attr = System.Web.HttpUtility.HtmlAttributeEncode(c.ToString());
r.Append(attr);
}
else
{
r.Append(e);
}
}
return r.ToString();
}
verbose so I could breakpoint and test each char, and found out that:
System.Web.HttpUtility.HtmlAttributeEncode("ä") is actually equal to ä... so there is no %E4 as output...
What am I missing? and is there a simply way to do the encoding without manipulating them char by char and have the required output?
that the provider say it needs
Ask the provider in which age they are living. According to Wikipedia: Percent-encoding:
The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.
Granted, this RFC talks about "new URI schemes", which HTTP obviously is not, but adhering to this standard prevents headaches like this. See also What is the proper way to URL encode Unicode characters?.
They seem to want you to encode characters according to the Windows-1250 Code Page (or comparable, like ISO-8859-1 or -2, check alternatives here) instead, as using that code page E4 (132) maps to ä and F6 (148) maps to ö. As #Simon points out in his comment, you should ask the provider which code page exactly they want you to use.
Assuming Windows-1250, you can implement it like this, according to URL encode ASCII/UTF16 characters:
var windows1250 = Encoding.GetEncoding(1250);
var percentEncoded = HttpUtility.UrlEncode("Bäste Björn", windows1250);
The value of percentEncoded is:
B%e4ste+Bj%f6rn
If they insist on using uppercase, see .net UrlEncode - lowercase problem.

ASP.Net URL Encoding

I am implementing URL rewriting in ASP.net and my URLs are causing me a world of problems.
The URL is generated from a database of departments & categories. I want employees to be able to add items to the database with whatever special characters are appropriate without it breaking the site.
I am encoding the data before I construct the URLs.
There are several problems...
IIS decodes the URL before it reaches .net making it impossible to properly parse anything with a "/" in it.
ASP.net gets confused by the url making "~" useless within certain pages
I migrated from the built in test server to my local IIS server (XP machine) and any URL containing an encoded & (%26) gives me a "Bad Request" error.
UrlEncode leaves some breaking characters untouched such as '.'
I did have two other related posts on this subject, at the time I only saw the small problems not the big problem upstream. I've found some registry tricks to solve the "Bad Request" issue but I'm going to be deploying to a shared hosting environment making that useless. I also know that this is a fix for some security issue so I don't want to necessarily bypass it without knowing what can of worms I'm opening.
Rather than trying to force .net to pass me the raw url, or override IIS settings i'd like to make truly safe URLs in the first place.
I'll note i've tried AntiXss.URLEncode, HttpUtility.URLEncode, URI.EscapeDataString. I've even tried stupid things like double URLEncodng. Is there a utility that does what I need, or do i really need to roll my own. I'm even considering doing something Hacky like replacing the % with an unusual string of characters. The end result should be at least readable which was the point of using URL rewriting in the first place.
Sorry for the long post- I just wanted to make sure that I've included all the necessary details. I can't seem to find any relevant information on this, and it seems like it would be a common problem - so maybe I'm missing something big. Thanks for your help, and patience with the long explanation!
Edit for clarity:
When I say the urls are being built from a database what I mean is that the directory structure is contstructed from the departments and categories in my database.
Some Example URLS -
Mystore/Refrigeration/Bar+Fridge.aspx
Mystore/Cooking+Equipment.aspx
Mystore/Kitchen/Cutting+Boards.asxpx
The problems come in when I use a department like "Beverage & Bar" or "Pastry/Decorating" to construct my URL. Despite being encoded first these cause the aforementioned issues.
My handlers are already implemented and working fine except for the special character encoding issues.
You should consider having a table off of your category/department table which has a unique URL for each category. Then you can use a special routine to generate the URLs. This can be a SQL scalar function, or a CLR function, but one of the things it would do is normalize the URL for the web. You can convert "Beverage & Bar" to "Beverage-And-Bar" and "Pastry / Decorating" to "Pastry-Decorating". Mainly, the routine needs to replace all invalid HTTP URL characters with something else. An example is this:
public static class URL
{
static readonly Regex feet = new Regex(#"([0-9]\s?)'([^'])", RegexOptions.Compiled);
static readonly Regex inch1 = new Regex(#"([0-9]\s?)''", RegexOptions.Compiled);
static readonly Regex inch2 = new Regex(#"([0-9]\s?)""", RegexOptions.Compiled);
static readonly Regex num = new Regex(#"#([0-9]+)", RegexOptions.Compiled);
static readonly Regex dollar = new Regex(#"[$]([0-9]+)", RegexOptions.Compiled);
static readonly Regex percent = new Regex(#"([0-9]+)%", RegexOptions.Compiled);
static readonly Regex sep = new Regex(#"[\s_/\\+:.]", RegexOptions.Compiled);
static readonly Regex empty = new Regex(#"[^-A-Za-z0-9]", RegexOptions.Compiled);
static readonly Regex extra = new Regex(#"[-]+", RegexOptions.Compiled);
public static string PrepareURL(string str)
{
str = str.Trim().ToLower();
str = str.Replace("&", "and");
str = feet.Replace(str, "$1-ft-");
str = inch1.Replace(str, "$1-in-");
str = inch2.Replace(str, "$1-in-");
str = num.Replace(str, "num-$1");
str = dollar.Replace(str, "$1-dollar-");
str = percent.Replace(str, "$1-percent-");
str = sep.Replace(str, "-");
str = empty.Replace(str, string.Empty);
str = extra.Replace(str, "-");
str = str.Trim('-');
return str;
}
}
You could make this a SQL enhance function, or run URL generation as a separate process. Then to implement mapping, you would map the entire URL directly to a category ID. This approach is better in the long run for several reasons. First, you are not always generating URLs, you do this once and they stay static, you don't have to worry about your procedure changing, and then GoogleBot not being able to find old URLs. Also, if you get a collision, you may notice a potential duplicate category name, because a collision would only be different by special characters. Finally, you can always view your URLs from the database, without having to run the mapping function.
I have a url rewrite i implement in the global.asax file in the begin authenticated request as I have some security. This is where I take the raw url and then do the db look up. this then rewrites the path to the aspx page and all the parameters are passed through the query string. No encoding is necessary.
However if you are using the url to actually change data then i can see that you will have huge problems as you are effectively using the http GET to change database. It is usually concidered a bad idead, and not something i do.
I only use a post request to do any databse manipulation. This keeps the url clean as all the data is in the page form.
The only issue i had was to set the correct url to the page.form.action which in most cases is the raw url.
If its the category names that are causing the issue then perhaps you should restrict the names to alpha numeric characters only and swap spaces for "-". IIS will throw a wobbly with periods "." as it looks for file names.
P.S.
IIS does not understand the tilde "~", this is something that the compiler understands. so if you use it in an anchor tag it will not work as expected and you should use the application root instead of the tilde.
Edit:
OK, it looks like an issue with IIS having issues with certain characters such as . / and &. Even if you do urlencode these IIS will still try to implement its own meanings.
As such consider removing them so:
Beverage & bar becomes BeverageBar
Pastry / decorating becomes PastryDecorating.
This will keep you urls clean, but does mean an extra column in the database so you can cheack the url against this shortened category name.
I'm having the exact same problem. Thanks for writing it up so nicely. It actually helped me to understand the problem better.
I had some other considerations however. One of the goals I have is to support the potential for any characters to be in the url which is based on the title of an article. Additionally I want to ensure uniqueness in the encoding and a two way encode / decode process.
So I did some manual encoding to solve the problem. This won't completely eliminate percent encoding, but will greatly reduce it and keep users from generating an inaccessible url. My process starts with using the Server.URLEncode function. But this doesn't eliminate the problems in the url. Because IIS is decoding the url and then passing it to the application, certain characters will break it with a dangerous request exception. These characters include +, &, /, !, *, ., ( and ). So on those characters plus other characters I would like to make more readable I do a double encoding for a more usable url. Encoding is also hard because of the limited number of characters that are allowed in an url. So prior to encoding I made all letters capital and then did the encoding with lower case. This keeps it from being totally decodable, but I can easily do a match in the database or in code by making the value I wish to match be upper case.
Well, here is my code. Feedback would be appreciated. Oh ya, this is in VB, but things should transfer over to C# easy enough.
Dim strReturn As String = Trim(strStringToEncode)
strReturn = Server.UrlEncode(strReturn)
strReturn = strReturn.Replace("-", "dash").Replace("+", "-")
strReturn = strReturn.Replace("%26", "and").
Replace("%2f", "or").
Replace("!", "excl").
Replace("*", "star").
Replace("%27", "apos").
Replace("(", "lprn").
Replace(")", "rprn").
Replace("%3b", "semi").
Replace("%3a", "coln").
Replace("%40", "at").
Replace("%3d", "eq").
Replace("%2b", "plus").
Replace("%24", "dols").
Replace("%25", "pct").
Replace("%2c", "coma").
Replace("%3f", "query").
Replace("%23", "hash").
Replace("%5b", "lbrk").
Replace("%5d", "rbrk").
Replace(".", "dot").
Replace("%3e", "gt").
Replace("%3c", "lt")
Return strReturn
I guess you are looking for HttpUtility.UrlEncode and HttpUtility.HtmlDecode
string url = "http://www.google.com/search?q=" + HttpUtility.UrlEncode("Example");

Categories

Resources