I have file:// links with non-english characters which are UrlEncoded in UTF-8. For these links to work in a browser I have to re-encode them.
file://development/H%C3%A5ndplukket.doc
becomes
file://development/H%e5ndplukket.doc
I have the following code which works:
public string ReEncodeUrl(string url)
{
Encoding enc = Encoding.GetEncoding("iso-8859-1");
string[] parts = url.Split('/');
for (int i = 1; i < parts.Length; i++)
{
parts[i] = HttpUtility.UrlDecode(parts[i]); // Decode to string
parts[i] = HttpUtility.UrlEncode(parts[i], enc); // Re-encode to latin1
parts[i] = parts[i].Replace('+', ' '); // Change + to [space]
}
return string.Join("/", parts);
}
Is there a cleaner way of doing this?
I think that's pretty clean actually. It's readable and you said it functions correctly. As long as the implementation is hidden from the consumer, I wouldn't worry about squeezing out that last improvement.
If you are doing this operation excessively (like hundreds of executions per event) I would think about taking the implementation out of UrlEncode/UrlDecode and stream them into each other to get a performance improvement there by removing the need for string split/join, but testing would have to prove that out anyway and definitely wouldn't be "clean" :-)
While I don't see any real way of changing it that would make a difference, shouldn't the + to space replace be before you UrlEncode so it turns into %20?
admittedly ugly and not really an improvement, but could re-encode the whole thing (avoid the split/iterate/join) then .Replace("%2f", "/")
I don't understand the code wanting to keep a space in the final result - seems like you don't end up with something that's actually encoded if it still has spaces in it?
Related
I am parsing emails using a regex in a c# VSTO project. Once in a while, the regex does not seem to work (although if I paste the text and regex in regexbuddy, the regex correctly matches the text). If I look at the email in gmail, I see
=E2=80=8B
at the beginning and end of some lines (which I understand is the UTF8 zero width space); this appears to be what is messing up the regex. This seems to be only sequence showing up.
What is the easiest way to get rid of this exact sequence? I cannot do the obvious
MailItem.Body.Replace("=E2=80=8B", "")
because those characters don't show up in the c# string.
I also tried
byte[] bytes = Encoding.Default.GetBytes(MailItem.TextBody);
string myString = Encoding.UTF8.GetString(bytes);
But the zero-width spaces just show up as ?. I suppose I could go through the bytes array and remove the bytes comprising the zero width space, but I don't know what the bytes would look like (it does not seem as simple as converting E2 80 8B to decimal and searching for that).
As strings in C# are stored in Unicode (not UTF-8) the following might do the trick:
MailItem.Body.Replace("\u200B", "");
As all the Regex.Replace() methods operate on strings, that's not going to be useful here.
The string indexer returns a char, so for want of a better solution (and if you can't predict where these characters are going to be), as long-winded as it seems, you may be best off with:
StringBuilder newText = new StringBuilder();
for (int i = 0; i < MailItem.Body.Length; i++)
{
if (a[i] != '\u200b')
{
newText.Append(a[i]);
}
}
Use System.Web.HttpUtility.HtmlDecode(string);
Quite simple.
Does anyone know of a .Net library (NuGet package preferrably) that I can use to fix strings that are 'messed up' because of encoding issues?
I have Excel* files that are supplied by third parties that contain strings like:
Telefónica UK Limited
Serviços de Comunicações e Multimédia
These entries are simply user-error (e.g. someone copy/pasted wrong or something) because elsewhere in the same file the same entries are correct:
Telefónica UK Limited
Serviços de Comunicações e Multimédia
So I was wondering if there is a library/package/something that takes a string and fixes "common errors" like çõ → çõ and ó → ó. I understand that this won't be 100% fool-proof and may result in some false-negatives but it would sure be nice to have some field-tested library to help me clean up my data a bit. Ideally it would 'autodetect' the issue(s) and 'autofix' them as I won't always be able to tell what the source encoding (and destination encoding) was at the time the mistake was made.
* The filetype is not very relevant, I may have text from other parties in other fileformats that have the same issue...
My best advice is to start with a list of special characters that are used in the language in question.
I assume you're just dealing with Portuguese or other European languages with just a handful of non-US-ASCII characters.
I also assume you know what the bad encoding was in the first place (i.e. the code page), and it was always the same.
(If you can't assume these things, then it's a bigger problem.)
Then encode each of these characters badly, and look for the results in your source text. If any are found, you can treat it as badly encoded text.
var specialCharacters = "çõéó";
var goodEncoding = Encoding.UTF8;
var badEncoding = Encoding.GetEncoding(28591);
var badStrings = specialCharacters.Select(c => badEncoding.GetString(goodEncoding.GetBytes(c.ToString())));
var sourceText = "Serviços de Comunicações e Multimédia";
if(badStrings.Any(s => sourceText.Contains(s)))
{
sourceText = goodEncoding.GetString(badEncoding.GetBytes(sourceText));
}
The first step in fixing a bad encoding is to find what encoding the text was mis-encoded to, often this is not obvious.
So, start with a bit of text that is mis-encoded, and the corrected version of the text. Here my badly encoded text ends with ä rather than ä
var name = "Viistoperä";
var target = "Viistoperä";
var encs = Encoding.GetEncodings();
foreach (var encodingType in encs)
{
var raw = Encoding.GetEncoding(encodingType.CodePage).GetBytes(name);
var output = Encoding.UTF8.GetString(raw);
if (output == target)
{
Console.WriteLine("{0},{1},{2}",encodingType.DisplayName, encodingType.CodePage, output);
}
}
This will output a number of candidate encodings, and you can either pick the most relevant one. Windows-1252 is a better candidate than Turkish in this case.
I am parsing emails using a regex in a c# VSTO project. Once in a while, the regex does not seem to work (although if I paste the text and regex in regexbuddy, the regex correctly matches the text). If I look at the email in gmail, I see
=E2=80=8B
at the beginning and end of some lines (which I understand is the UTF8 zero width space); this appears to be what is messing up the regex. This seems to be only sequence showing up.
What is the easiest way to get rid of this exact sequence? I cannot do the obvious
MailItem.Body.Replace("=E2=80=8B", "")
because those characters don't show up in the c# string.
I also tried
byte[] bytes = Encoding.Default.GetBytes(MailItem.TextBody);
string myString = Encoding.UTF8.GetString(bytes);
But the zero-width spaces just show up as ?. I suppose I could go through the bytes array and remove the bytes comprising the zero width space, but I don't know what the bytes would look like (it does not seem as simple as converting E2 80 8B to decimal and searching for that).
As strings in C# are stored in Unicode (not UTF-8) the following might do the trick:
MailItem.Body.Replace("\u200B", "");
As all the Regex.Replace() methods operate on strings, that's not going to be useful here.
The string indexer returns a char, so for want of a better solution (and if you can't predict where these characters are going to be), as long-winded as it seems, you may be best off with:
StringBuilder newText = new StringBuilder();
for (int i = 0; i < MailItem.Body.Length; i++)
{
if (a[i] != '\u200b')
{
newText.Append(a[i]);
}
}
Use System.Web.HttpUtility.HtmlDecode(string);
Quite simple.
We have a string which is readed from web page. Because browsers are tolerant to unencoded special chars (e.g. ampersand), some pages using it encoded, some not... so there is a large possibility, we have stored some data encoded once, and some multiple times...
Is there some clear solution, how to be sure, my string is decoded enough no matter how many times it was encoded?
Here is what we using now:
public static string HtmlDecode(this string input)
{
var temp = HttpUtility.HtmlDecode(input);
while (temp != input)
{
input = temp;
temp = HttpUtility.HtmlDecode(input);
}
return input;
}
and same using with UrlDecode.
That's probably the best approach honestly. The real solution would be to rework your code so that you only singly encode things in all places, so that you could only singly decode them.
Your code seems to be decoding html strings correctly, with multiple checks.
However, if the input HTML is malformed, i.e not encoded properly, the decoding will be unexpected. i.e bad inputs might not be decoded properly no matter how many times it passes through this method.
A quick check with two encoded strings, one with completely encoded string, and another with partially encoded yielded the following results.
"<b>" will decode to "<b>"
"<b> will decode to "<b>"
In case this is helpful to anyone, here is a recursive version for multiple HTML encoded strings (I find it a bit easier to read):
public static string HtmlDecode(string input) {
string decodedInput = WebUtility.HtmlDecode(input);
if (input == decodedInput) {
return input;
}
return HtmlDecode(decodedInput);
}
WebUtility is in the System.Net namespace.
I seen the problem a lot where you have some obscure unicode character which is somewhat like a certain ascii character and needs to be converted at run time for whatever reason.
In this case I am trying to export to csv. Having already used a nasty fix for dash, emdash, endash and hbar I have just recieved a new request for ' ` '. Aside from another nasty fix is there another better way to do this?
Heres what I have at the moment...
formattedString = formattedString.Replace(char.ConvertFromUtf32(8211), "-");
formattedString = formattedString.Replace(char.ConvertFromUtf32(8212), "-");
formattedString = formattedString.Replace(char.ConvertFromUtf32(8213), "-");
Any Ideas?
It's a rather inelegant problem, so no method will really be deeply elegant.
Still, we can certainly improve things. Just which approach will work best will depend on the number of changes that need to be made (and the size of the string to change, though it's often best to assume this either is or could be quite large).
At one replacement character, the approach you use so far - using .Replace is superior, though I would replace char.ConvertFromUtf32(8211) with "\u2013". The effect on performance is negligible but it's more readable, since it's more usual to refer to that character in hexadecimal as in U+2013 than in decimal notation (of course char.ConvertFromUtf32(0x2013) would have the same advantage there, but no advantage on just using the char notation). (One could also just put '–' straight into the code - more readable in some cases, but less so in this where it looks much the same as ‒, — or - to the reader).
I'd also replace the string replace with the marginally faster character replace (in this case at least, where you are replacing a single char with a single char).
Taking this approach to your code it becomes:
formattedString = formattedString.Replace('\u2013', '-');
formattedString = formattedString.Replace('\u2014', '-');
formattedString = formattedString.Replace('\u2015', '-');
Even with as few replacements as 3, this is likely to be less efficient than doing all such replacements in one pass (I'm not going to do a test to find how long formattedString would need to be for this, above a certain number it becomes more efficient to use a single pass even for strings of only a few characters). One approach is:
StringBuilder sb = new StringBuilder(formattedString.length);//we know this is the capacity so we initialise with it:
foreach(char c in formattedString)
switch(c)
{
case '\u2013': case '\u2014': case '\u2015':
sb.Append('-');
default:
sb.Append(c)
}
formattedString = sb.ToString();
(Another possibility is to check if (int)c >= 0x2013 && (int)c <= 0x2015 but the reduction in number of branches is small, and irrelevant if most of the characters you look for aren't numerically close to each other).
With various variants (e.g. if formattedString is going to be output to a stream at some point, it may be best to do so as each final character is obtained, rather than buffering again).
Note that this approach doesn't deal with multi-char strings in your search, but can with strings in your output, e.g. we could include:
case 'ß':
sb.Append("ss");
Now, this is more efficient than the previous, but still becomes unwieldy after a certain number of replacement cases. It also involves many branches, which have their own performance issues.
Let's consider for a moment the opposite problem. Say you wanted to convert characters from a source that was only in the US-ASCII range. You would have only 128 possible characters so your approach could be:
char[] replacements = {/*list of replacement characters*/}
StringBuilder sb = new StringBuilder(formattedString.length);
foreach(char c in formattedString)
sb.Append(replacements[(int)c]);
formattedString = sb.ToString();
Now, this isn't practical with Unicode, which has over assigned 109,000 characters in a range going from 0 to 1114111. However, chances are the characters you care about are not only much smaller than that (and if you really did care about that many cases, you'd want the approach given just above) but also in a relatively restricted block.
Consider also if you don't especially care about any surrogates (we'll come to those later). Well, most characters you just don't care about, so, let's consider this:
char[] unchanged = new char[128];
for(int i = 0; i != 128; ++i)
unchanged[i] = (char)i;
char[] error = new string('\uFFFD', 128).ToCharArray();
char[] block0 = (new string('\uFFFD', 13) + "---" + new string('\uFFFD', 112)).ToCharArray();
char[][] blocks = new char[8704][];
for(int i = 1; i != 8704; ++i)
blocks[i] = error;
blocks[0] = unchanged;
blocks[64] = block0;
/* the above need only happen once, so it could be done with static members of a helper class that are initialised in a static constructor*/
StringBuilder sb = new StringBuilder(formattedString.Length);
foreach(char c in formattedString)
{
int cAsI = (int)c;
sb.Append(blocks[i / 128][i % 128]);
}
string ret = sb.ToString();
if(ret.IndexOf('\uFFFD') != -1)
throw new ArgumentException("Unconvertable character");
formattedString = ret;
The balance between whether it's better to test for an uncovertable character in one go at the end (as above) or on each conversion varies according to how likely this is to happen. It's obviously even better if you can be sure (due to knowledge of your data) that it won't, and can remove that check - but you have to be really sure.
The advantage here is that while we are using a look-up method, we are only taking up 384 characters' worth of memory to hold the look-up (and some more for the array overhead) rather than 109,000 characters' worth. The best size for the blocks within this varies according to your data, (that is, what replacements you want to make), but the assumption that there will be blocks that are identical to each other tends to hold.
Now, finally, what if you care about a character in the "astral planes" which are represented as surrogate pairs in the UTF-16 used internally in .NET, or if you care about replacing some multi-char strings in a particular way?
In this case, you are probably going to have to at the very least read a character or more ahead in your switch (if using the block-method for most cases, you can use an unconvertable case to signal such work is required). In such a case, it might well be worth converting to and then back from US-ASCII with System.Text.Encoding and a custom implementation of EncoderFallback and EncoderFallbackBuffer and handle it there. This means that most of the conversion (the obvious cases) will be done for you, while your implementation can deal only with the special cases.
You could maintain a lookup table that maps your problem characters to replacement characters. For efficiency you can work on character array to prevent lots of intermediary string churn which would be a result of using string.Replace.
For example:
var lookup = new Dictionary<char, char>
{
{ '`', '-' },
{ 'இ', '-' },
//next pair, etc, etc
};
var input = "blah இ blah ` blah";
var r;
var result = input.Select(c => lookup.TryGetValue(c, out r) ? r : c);
string output = new string(result.ToArray());
Or if you want blanket treatment of non ASCII range characters:
string output = new string(input.Select(c => c <= 127 ? c : '-').ToArray());
Unfortunately, given that you're doing a bunch of specific transforms within your data, you will likely need to do these via replacements.
That being said, you could make a few improvements.
If this is common, and the strings are long, storing these in a StringBuilder instead of a string would allow in-place replacements of the values, which could potentially improve things.
You could store the conversion characters, both from and to, in a Dictionary or other structure, and perform these operations in a simple loop.
You could load both the "from" and "to" character at runtime from a configuration file, instead of having to hard-code every transformation operation. Later, when more of these were requested, you wouldn't need to alter your code - it could be done via configuration.
If they are all replaced with the same string:
formattedString = string.Join("-", formattedString.Split('\u2013', '\u2014', '\u2015'));
or
foreach (char c in "\u2013\u2014\u2015")
formattedString = formattedString.Replace(c, '-');