HTML/Url decode on multiple times encoded string - c#

We have a string which is readed from web page. Because browsers are tolerant to unencoded special chars (e.g. ampersand), some pages using it encoded, some not... so there is a large possibility, we have stored some data encoded once, and some multiple times...
Is there some clear solution, how to be sure, my string is decoded enough no matter how many times it was encoded?
Here is what we using now:
public static string HtmlDecode(this string input)
{
var temp = HttpUtility.HtmlDecode(input);
while (temp != input)
{
input = temp;
temp = HttpUtility.HtmlDecode(input);
}
return input;
}
and same using with UrlDecode.

That's probably the best approach honestly. The real solution would be to rework your code so that you only singly encode things in all places, so that you could only singly decode them.

Your code seems to be decoding html strings correctly, with multiple checks.
However, if the input HTML is malformed, i.e not encoded properly, the decoding will be unexpected. i.e bad inputs might not be decoded properly no matter how many times it passes through this method.
A quick check with two encoded strings, one with completely encoded string, and another with partially encoded yielded the following results.
"<b>" will decode to "<b>"
"<b&gt will decode to "<b&gt"

In case this is helpful to anyone, here is a recursive version for multiple HTML encoded strings (I find it a bit easier to read):
public static string HtmlDecode(string input) {
string decodedInput = WebUtility.HtmlDecode(input);
if (input == decodedInput) {
return input;
}
return HtmlDecode(decodedInput);
}
WebUtility is in the System.Net namespace.

Related

Sanitize registration input?

This may be a really silly question but so far the interwebs has failed me, so i'm hoping you good people of SO will shed some light. Essentially I have a website on which there is membership functionality(sign up/login/forgotten password etc.) using the .net membership providers. Later down the line I am taking users registration data converting to XML then it using elsewhere in logic. Unfortunately I often get issues with the data I have in XML, more often than not its hexadecimal value 0x1C, is an invalid character. I did find a handy blog post on a resolution to this but it got me thinking, are there any standards on how data should be sanitized? What to let through registration and what not to?
Assuming that you're (manually?) de-serializing the registration input, you need to encode it as XML before further processing so that characters with special meaning in XML are escaped properly.
Note that there are only 5 of them so it's perfectly reasonable to do this with a manual replace:
< = <
> = >
& = &
" = "
' = &apos;
You could use the build-in .NET function HttpUtility.HtmlEncode(input) to do this for you.
UPDATE:
I just realized I didn't really answer your question, you seem to be looking for a way to transform Unicode characters to ASCII-supported Html Entities.
I'm not aware of any built-in functions in .NET that do this, so I wrote a little utility method which should illustrate the concept:
public static class StringUtilities
{
public static string HtmlEncode(string input, Encoding source, Encoding destination)
{
var sourceChars = HttpUtility.HtmlEncode(input).ToArray();
var sb = new StringBuilder();
foreach (var sourceChar in sourceChars)
{
byte[] sourceBytes = source.GetBytes(new[] { sourceChar });
char destChar = destination.GetChars(sourceBytes).FirstOrDefault();
if (destChar != sourceChar)
sb.AppendFormat("&#{0};", (int)sourceChar);
else
sb.Append(sourceChar);
}
return sb.ToString();
}
}
Then, given an input string which has both reserved XML characters and Unicode characters in it, you could use it like this:
string unicode = "<tag>some proӸematic text<tag>";
string escapedASCII = StringUtilities.HtmlEncode(
unicode, Encoding.Unicode, Encoding.ASCII);
// Result: <tag>some proӸematic text<tag>
If you need to do this at several places, to clean it up a bit, you could add an extension method for your specific scenario:
public static class StringExtensions
{
public static string ToEncodedASCII(this string input, Encoding sourceEncoding)
{
return StringUtilities.HtmlEncode(input, sourceEncoding, Encoding.ASCII);
}
public static string ToEncodedASCII(this string input)
{
return StringUtilities.HtmlEncode(input, Encoding.Unicode, Encoding.ASCII);
}
}
You could then do:
string unicode = "<tag>some proӸematic text<tag>";
// Default to Unicode as input
string escapedASCII1 = unicode.ToEncodedASCII();
// Pass in a different encoding for your input
string escapedASCII2 = unicode.ToEncodedASCII(Encoding.BigEndianUnicode);
UPDATE #2
Since you also asked for advice on adhering to standards, the most I can tell you is that you need to take into consideration where the input text will actually end up.
If the input for a certain user will only ever be displayed to that user (for instance when they manage their profile / account settings in your app), and your database supports Unicode, you could just leave everything as-is.
On the other hand, if the information can be displayed to other users (for instance when users can view each others public profile information) then you need to take into consideration that not all users will be visiting your website on a device/browser that supports Unicode. In that case, UTF-8 is likely to be your best bet.
This is also why you can't really find that much useful information on it. If the world was able to agree on a standard then we would not have to deal with all these encoding variations in the first place. Think about your target group and what they need.
A useful blog post on the subject of encoding: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Urlencode large amount of text in .net 4 client C#

What's the best way to urlencode (escape) a large string (50k - 200k characters) in the .net 4 client profile?
System.Net.Uri.EscapeDataString() is limited to 32766 characters.
HttpUtility.UrlEncode is not available in .net 4 client.
The encoded string is to be passed as the value of a parameter in an httprequest as a post.
(Also, is there a .net-4-client profile tag on SO?)
Because a url encoded string is just encoded character by character it means that if you split a string and encode the two parts then you can concatenate them to get the encoded version of the original string.
So simply loop through and urlencode 30,000 characters at a time and then join all those parts together to get your encoded string.
I will echo the sentiments of others that you might be better off with a content-type of multipart/form-data. http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4 explains the differences in case you are unaware. Which of these two you choose should make little difference to the destination since both should be fully understood by the target.
I would suggest looking in to using a MIME format for posting your data. No need to encode (other than maybe a base64 encoding) and would keep you under the limitation.
You could manually encode it all using StringBuilder, though it will increase your transfer amount threefold:
string EncodePostData(byte[] data)
{
var sbData = new StringBuilder();
foreach(byte b in data)
{
sbData.AppendFormat("%{0:x2}", b);
}
return sbData.ToString();
}
The standard method, however, is just to supply a MIME type and Content-Length header, then send the data raw.

Can I add a string converted to Base64 as part of a URL?

I have a string converted with encryption. I would like to make this part of a URL. Is that possible if it has been converted to base64 or do I need to do something more?
var going_to_be_part_of_url = System.Convert.ToBase64String(bytOut, 0, i);
Thanks
Yes, but it's not a good idea, Base64 requires that you respect the difference between upper case and lower case. URL's aren't typically case strict.
Then there's the problem of the special characters in Base64 being converted to URL encoded equivalents, making your URL's ugly and less manageable.
You should go with Base36 instead.
You can use a modified base 64 for URL Applications which is just base64 with a couple of the problem characters replaced.
The easiest way is to take your base64 string and encode perform a string replace on the problem characters when building the URL, and reversing the process when interpreting the URL.

Converting unicode character to a single hexadecimal value in C#

I am getting a character from a emf record using Encoding.Unicode.GetString and the resulting string contains only one character but has two bytes. I don't have any idea about the encoding scheme and the multi byte character set. I want to convert that character to its equivalent single hexadecimal value.Can you help me regarding this..
It's not clear what you mean. A char in C# is a 16-bit unsigned value. If you've got a binary data source and you want to get Unicode characters, you should use an Encoding to decode the binary data into a string, that you can access as a sequence of char values.
You can convert a char to a hex string by first converting it to an integer, and then using the X format specifier like this:
char = '\u0123';
string hex = ((int)c).ToString("X4"); // Now hex = "0123"
Now, that leaves one more issue: surrogate pairs. Values which aren't in the Basic Multilingual Plane (U+0000 to U+FFFF) are represented by two UTF-16 code units - a high surrogate and a low surrogate. You can use the char.IsSurrogate* methods to check for surrogate pairs... although it's harder (as far as I can see) to then convert a surrogate pair into a UCS-4 value. If you're lucky, you won't need to deal with this... if you're happy converting your binary data into a sequence of UTF-16 code units instead of strict UCS-4 values, you don't need to worry.
EDIT: Given your comments, it's still not entirely clear what you've got to start with. You say you've got two bytes... are they separate, or in a byte array? What do they represent? Text in a particular encoding, presumably... but which encoding? Once you know the encoding, you can convert a byte array into a string easily:
byte[] bytes = ...;
// For example, if your binary data is UTF-8
string text = Encoding.UTF8.GetString(bytes);
char firstChar = text[0];
string hex = ((int)firstChar).ToString("X4");
If you could edit your question to give more details about your actual situation, it would be a lot easier to help you get to a solution. If you're generally confused about encodings and the difference between text and binary data, you might want to read my article about it.
Try this:
System.Text.Encoding.Unicode.GetBytes(theChar.ToString())
.Aggregate("", (agg, val) => agg + val.ToString("X2"));
However, since you don't specify exactly what encoding that the character is in, this could fail. Futher, you don't make it very clear if you want the output to be a string of hex chars or bytes. I'm guessing the former, since I'd guess you want to generate HTML. Let me know if any of this is wrong.
I created an extension method to convert unicode or non-unicode string to hex string.
I shared for whom concern.
public static class StringHelper
{
public static string ToHexString(this string str)
{
byte[] bytes = str.IsUnicode() ? Encoding.UTF8.GetBytes(str) : Encoding.Default.GetBytes(str);
return BitConverter.ToString(bytes).Replace("-", string.Empty);
}
public static bool IsUnicode(this string input)
{
const int maxAnsiCode = 255;
return input.Any(c => c > maxAnsiCode);
}
}
Get thee to StringInfo:
http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo.aspx
http://msdn.microsoft.com/en-us/library/8k5611at.aspx
The .NET Framework supports text elements. A text element is a unit of text that is displayed as a single character, called a grapheme. A text element can be a base character, a surrogate pair, or a combining character sequence. The StringInfo class provides methods that allow your application to split a string into its text elements and iterate through the text elements. For an example of using the StringInfo class, see String Indexing.

Re-encode url from utf-8 encoded to iso-8859-1 encoded

I have file:// links with non-english characters which are UrlEncoded in UTF-8. For these links to work in a browser I have to re-encode them.
file://development/H%C3%A5ndplukket.doc
becomes
file://development/H%e5ndplukket.doc
I have the following code which works:
public string ReEncodeUrl(string url)
{
Encoding enc = Encoding.GetEncoding("iso-8859-1");
string[] parts = url.Split('/');
for (int i = 1; i < parts.Length; i++)
{
parts[i] = HttpUtility.UrlDecode(parts[i]); // Decode to string
parts[i] = HttpUtility.UrlEncode(parts[i], enc); // Re-encode to latin1
parts[i] = parts[i].Replace('+', ' '); // Change + to [space]
}
return string.Join("/", parts);
}
Is there a cleaner way of doing this?
I think that's pretty clean actually. It's readable and you said it functions correctly. As long as the implementation is hidden from the consumer, I wouldn't worry about squeezing out that last improvement.
If you are doing this operation excessively (like hundreds of executions per event) I would think about taking the implementation out of UrlEncode/UrlDecode and stream them into each other to get a performance improvement there by removing the need for string split/join, but testing would have to prove that out anyway and definitely wouldn't be "clean" :-)
While I don't see any real way of changing it that would make a difference, shouldn't the + to space replace be before you UrlEncode so it turns into %20?
admittedly ugly and not really an improvement, but could re-encode the whole thing (avoid the split/iterate/join) then .Replace("%2f", "/")
I don't understand the code wanting to keep a space in the final result - seems like you don't end up with something that's actually encoded if it still has spaces in it?

Categories

Resources