Does anyone know of a .Net library (NuGet package preferrably) that I can use to fix strings that are 'messed up' because of encoding issues?
I have Excel* files that are supplied by third parties that contain strings like:
Telefónica UK Limited
Serviços de Comunicações e Multimédia
These entries are simply user-error (e.g. someone copy/pasted wrong or something) because elsewhere in the same file the same entries are correct:
Telefónica UK Limited
Serviços de Comunicações e Multimédia
So I was wondering if there is a library/package/something that takes a string and fixes "common errors" like çõ → çõ and ó → ó. I understand that this won't be 100% fool-proof and may result in some false-negatives but it would sure be nice to have some field-tested library to help me clean up my data a bit. Ideally it would 'autodetect' the issue(s) and 'autofix' them as I won't always be able to tell what the source encoding (and destination encoding) was at the time the mistake was made.
* The filetype is not very relevant, I may have text from other parties in other fileformats that have the same issue...
My best advice is to start with a list of special characters that are used in the language in question.
I assume you're just dealing with Portuguese or other European languages with just a handful of non-US-ASCII characters.
I also assume you know what the bad encoding was in the first place (i.e. the code page), and it was always the same.
(If you can't assume these things, then it's a bigger problem.)
Then encode each of these characters badly, and look for the results in your source text. If any are found, you can treat it as badly encoded text.
var specialCharacters = "çõéó";
var goodEncoding = Encoding.UTF8;
var badEncoding = Encoding.GetEncoding(28591);
var badStrings = specialCharacters.Select(c => badEncoding.GetString(goodEncoding.GetBytes(c.ToString())));
var sourceText = "Serviços de Comunicações e Multimédia";
if(badStrings.Any(s => sourceText.Contains(s)))
{
sourceText = goodEncoding.GetString(badEncoding.GetBytes(sourceText));
}
The first step in fixing a bad encoding is to find what encoding the text was mis-encoded to, often this is not obvious.
So, start with a bit of text that is mis-encoded, and the corrected version of the text. Here my badly encoded text ends with ä rather than ä
var name = "Viistoperä";
var target = "Viistoperä";
var encs = Encoding.GetEncodings();
foreach (var encodingType in encs)
{
var raw = Encoding.GetEncoding(encodingType.CodePage).GetBytes(name);
var output = Encoding.UTF8.GetString(raw);
if (output == target)
{
Console.WriteLine("{0},{1},{2}",encodingType.DisplayName, encodingType.CodePage, output);
}
}
This will output a number of candidate encodings, and you can either pick the most relevant one. Windows-1252 is a better candidate than Turkish in this case.
Related
This may be a really silly question but so far the interwebs has failed me, so i'm hoping you good people of SO will shed some light. Essentially I have a website on which there is membership functionality(sign up/login/forgotten password etc.) using the .net membership providers. Later down the line I am taking users registration data converting to XML then it using elsewhere in logic. Unfortunately I often get issues with the data I have in XML, more often than not its hexadecimal value 0x1C, is an invalid character. I did find a handy blog post on a resolution to this but it got me thinking, are there any standards on how data should be sanitized? What to let through registration and what not to?
Assuming that you're (manually?) de-serializing the registration input, you need to encode it as XML before further processing so that characters with special meaning in XML are escaped properly.
Note that there are only 5 of them so it's perfectly reasonable to do this with a manual replace:
< = <
> = >
& = &
" = "
' = '
You could use the build-in .NET function HttpUtility.HtmlEncode(input) to do this for you.
UPDATE:
I just realized I didn't really answer your question, you seem to be looking for a way to transform Unicode characters to ASCII-supported Html Entities.
I'm not aware of any built-in functions in .NET that do this, so I wrote a little utility method which should illustrate the concept:
public static class StringUtilities
{
public static string HtmlEncode(string input, Encoding source, Encoding destination)
{
var sourceChars = HttpUtility.HtmlEncode(input).ToArray();
var sb = new StringBuilder();
foreach (var sourceChar in sourceChars)
{
byte[] sourceBytes = source.GetBytes(new[] { sourceChar });
char destChar = destination.GetChars(sourceBytes).FirstOrDefault();
if (destChar != sourceChar)
sb.AppendFormat("&#{0};", (int)sourceChar);
else
sb.Append(sourceChar);
}
return sb.ToString();
}
}
Then, given an input string which has both reserved XML characters and Unicode characters in it, you could use it like this:
string unicode = "<tag>some proӸematic text<tag>";
string escapedASCII = StringUtilities.HtmlEncode(
unicode, Encoding.Unicode, Encoding.ASCII);
// Result: <tag>some proӸematic text<tag>
If you need to do this at several places, to clean it up a bit, you could add an extension method for your specific scenario:
public static class StringExtensions
{
public static string ToEncodedASCII(this string input, Encoding sourceEncoding)
{
return StringUtilities.HtmlEncode(input, sourceEncoding, Encoding.ASCII);
}
public static string ToEncodedASCII(this string input)
{
return StringUtilities.HtmlEncode(input, Encoding.Unicode, Encoding.ASCII);
}
}
You could then do:
string unicode = "<tag>some proӸematic text<tag>";
// Default to Unicode as input
string escapedASCII1 = unicode.ToEncodedASCII();
// Pass in a different encoding for your input
string escapedASCII2 = unicode.ToEncodedASCII(Encoding.BigEndianUnicode);
UPDATE #2
Since you also asked for advice on adhering to standards, the most I can tell you is that you need to take into consideration where the input text will actually end up.
If the input for a certain user will only ever be displayed to that user (for instance when they manage their profile / account settings in your app), and your database supports Unicode, you could just leave everything as-is.
On the other hand, if the information can be displayed to other users (for instance when users can view each others public profile information) then you need to take into consideration that not all users will be visiting your website on a device/browser that supports Unicode. In that case, UTF-8 is likely to be your best bet.
This is also why you can't really find that much useful information on it. If the world was able to agree on a standard then we would not have to deal with all these encoding variations in the first place. Think about your target group and what they need.
A useful blog post on the subject of encoding: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
I'm trying to understand what is the best encode from C# that fulfill a requirement on a new SMS Provider.
The text I want to send is:
Bäste Björn
The encoded text that the provider say it needs is:
B%E4ste+Bj%F6rn
so ä is %E4 and ö is %F6
From this answer, I got that, for such conversion I need to use HttpUtility.HtmlAttributeEncode as the normal HttpUtility.UrlEncode will output:
B%c3%a4ste+Bj%c3%b6rn
and that outputs weird chars on the mobile phone :/
as several chars are not converted, I tried this:
private string specialEncoding(string text)
{
StringBuilder r = new StringBuilder();
foreach (char c in text.ToCharArray())
{
string e = System.Web.HttpUtility.UrlEncode(c.ToString());
if (e.StartsWith("%") && e.ToLower() != "%0a") // %0a == Linefeed
{
string attr = System.Web.HttpUtility.HtmlAttributeEncode(c.ToString());
r.Append(attr);
}
else
{
r.Append(e);
}
}
return r.ToString();
}
verbose so I could breakpoint and test each char, and found out that:
System.Web.HttpUtility.HtmlAttributeEncode("ä") is actually equal to ä... so there is no %E4 as output...
What am I missing? and is there a simply way to do the encoding without manipulating them char by char and have the required output?
that the provider say it needs
Ask the provider in which age they are living. According to Wikipedia: Percent-encoding:
The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.
Granted, this RFC talks about "new URI schemes", which HTTP obviously is not, but adhering to this standard prevents headaches like this. See also What is the proper way to URL encode Unicode characters?.
They seem to want you to encode characters according to the Windows-1250 Code Page (or comparable, like ISO-8859-1 or -2, check alternatives here) instead, as using that code page E4 (132) maps to ä and F6 (148) maps to ö. As #Simon points out in his comment, you should ask the provider which code page exactly they want you to use.
Assuming Windows-1250, you can implement it like this, according to URL encode ASCII/UTF16 characters:
var windows1250 = Encoding.GetEncoding(1250);
var percentEncoded = HttpUtility.UrlEncode("Bäste Björn", windows1250);
The value of percentEncoded is:
B%e4ste+Bj%f6rn
If they insist on using uppercase, see .net UrlEncode - lowercase problem.
Im reding some csv files. The files are really easy, because there is always just ";" as seperator and there are no ", ', or something like that.
So its possible to read the file, line by line and seperate the strings. Thats working fine. Now people told me: maybe you should check the encoding of the file, it should be always ANSI, if its not maybe your output will be different and corrupted. So non-ansi files should be marked somehow.
I just said, okey! But if I think about it: do I really have to check the file for encoding in this case? I just changed the encoding of the file to something else and Im still able to read the file without any problems. My code is simple:
using (TextReader reader = new StreamReader(myFileStream))
{
while ((line = read.ReadLine()) != null)
{
//read the line, spererate by ; and other stuff...
}
}
So again: do I really need to check the files for ANSI encoding? Could somebody give me an example when could I get in trouble or when do I get a corrupted output after reading a non-ansi file? Thank you!
That particular constructor of StreamReader will assume that the data is UTF-8; that is compatible with ASCII, but can fail if data uses bytes in the 128-255 range for single-byte codepages (you'll get the wrong characters in strings, etc), or could fail completely (i.e. throw an exception) if the data is actually something very different like UTF-7, UTF-32, etc.
In some cases (the minority) you might be able to use the byte-order-mark to detect the encoding, but this is a circular problem: in most cases, if you don't already know the encoding, you can't really detect the encoding (robustly). So a better approach would be: to know the encoding in the first place. Then you can pass in the correct encoding to use via one of the other constructors.
Here's an example of it failing:
// we'll write UTF-32, big-endian, without a byte-order-mark
File.WriteAllText("my.txt", "Hello world", new UTF32Encoding(true, false));
using (var reader = new StreamReader("my.txt"))
{
string s = reader.ReadLine();
}
You can run under UTF-8 encoding , cause UTF-8 has a wonderful property support ASCII characters with 1 byte (as it would expected), but when it needed, shrink to support Unicode ones.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
I am building auto correct for string input encoding. And I want to build a regex for encoding pattern.
For example:
var encoding = "utd-8";
Correct c = new Correct(encoding);
var c.Correct();
And the output is utf-8.
I have most of the work (and using some open source coding from some great people that wrote beautiful stuff). Can some one help please?
UPDATE
What I need in the end is the regex pattern for the right encoding.
The user input a encoding name iso-8859-1 and it check if its valid.
You shouldn't decide on which technology to use before you have figured out how to solve the problem; are Regular Expressions really necessary?
If I understand your question correctly, you want to check whether the input string looks alot like one of the supported encodings. Before writing a single line of code, you'll have to figure out:
Which encodings are you supporting? Are you supporting aliases (UTF-16 is the same as Unicode)?
How much is the input string allowed to be different from the chosen encoding (utd-8, utd-9, utd9, td9, 9)?
Given the input string "utf-36", would the output be UTF-16 or UTF-32?
Perhaps you can take a look at one of the string distance algorithms (for example, http://en.wikipedia.org/wiki/Levenshtein_distance) for inspiration on the subject. There are a ton of links in the "see also" section there.
I've asked this before in a round-about manner before here on Stack Overflow, and want to get it right this time. How do I convert ANSI (Codepage 1252) to UTF-8, while preserving the special characters? (I am aware that UTF-8 supports a larger character set than ANSI, but it is okay if I can preserve all UTF-8 characters that are supported by ANSI and substitute the rest with a ? or something)
Why I Want To Convert ANSI → UTF-8
I am basically writing a program that splits vCard files (VCF) into individual files, each containing a single contact. I've noticed that Nokia and Sony Ericsson phones save the backup VCF file in UTF-8 (without BOM), but Android saves it in ANSI (1252). And God knows in what formats the other phones save them in!
So my questions are
Isn't there an industry standard for vCard files' character encoding?
Which is easier for my solving my problem? Converting ANSI to UTF8 (and/or the other way round) or trying to detect which encoding the input file has and notifying the user about it?
tl;dr
Need to know how to convert the character encoding from (ANSI / UTF8) to (UTF8 / ANSI) while preserving all special characters.
You shouldn't convert from one encoding to the other. You have to read each file using the encoding that it was created with, or you will lose information.
Once you read the file using the correct encoding you have the content as a Unicode string, from there you can save it using any encoding you like.
If you need to detect the encoding, you can read the file as bytes and then look for character codes that are specific for either encoding. If the file contains no special characters, either encoding will work as the characters 32..127 are the same for both encodings.
This is what I use in C# (I've been using it to convert from Windows-1252 to UTF8)
public static String readFileAsUtf8(string fileName)
{
Encoding encoding = Encoding.Default;
String original = String.Empty;
using (StreamReader sr = new StreamReader(fileName, Encoding.Default))
{
original = sr.ReadToEnd();
encoding = sr.CurrentEncoding;
sr.Close();
}
if (encoding == Encoding.UTF8)
return original;
byte[] encBytes = encoding.GetBytes(original);
byte[] utf8Bytes = Encoding.Convert(encoding, Encoding.UTF8, encBytes);
return Encoding.UTF8.GetString(utf8Bytes);
}
VCF is encoded in utf-8 as demanded by the spec in chapter 3.4. You need to take this seriously, the format would be utterly useless if that wasn't cast in stone. If you are seeing some Android app mangling accented characters then work from the assumption that this is a bug in that app. Or more likely, that it got bad info from somewhere else. Your attempt to correct the encoding would then cause more problems because your version of the card will never match the original.
You convert from 1252 to utf-8 with Encoding.GetEncoding(1252).GetString(), passing in a byte[]. Do not ever try to write code that reads a string and whacks it into a byte[] so you can use the conversion method, that just makes the encoding problems a lot worse. In other words, you'd need to read the file with FileStream, not StreamReader. But again, avoid fixing other people's problems.
I do it this way:
private static void ConvertAnsiToUTF8(string inputFilePath, string outputFilePath)
{
string fileContent = File.ReadAllText(inputFilePath, Encoding.Default);
File.WriteAllText(outputFilePath, fileContent, Encoding.UTF8);
}
I found this question while working to process a large collection of ancient text files into well formatted PDFs. None of the files have a BOM, and the oldest of the files contain Codepage 1252 code points that cause incorrect decoding to UTF8. This happens only some of the time, UTF8 works the majority of the time. Also, the latest of the text data DOES contain UTF8 code points, so it's a mixed bag.
So, I also set out "to detect which encoding the input file has" and after reading How to detect the character encoding of a text file? and How to determine the encoding of text? arrived at the conclusion that this would be difficult at best.
BUT, I found The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets in the comments, read it, and found this gem:
UTF-8 has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don’t even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet.
The entire article is short and well worth the read.
So, I solved my problem with the following code. Since only a small amount of my text data contains difficult character code points, I don't mind the performance overhead of the exception handling, especially since this only had to run once. Perhaps there are more clever ways of avoiding the try/catch but I did not bother with devising one.
public static string ReadAllTextFromFile(string file)
{
const int WindowsCodepage1252 = 1252;
string text;
try
{
var utf8Encoding = Encoding.GetEncoding("UTF-8", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
text = File.ReadAllText(file, utf8Encoding);
}
catch (DecoderFallbackException dfe)//then text is not entirely valid UTF8, contains Codepage 1252 characters that can't be correctly decoded to UTF8
{
var codepage1252Encoding = Encoding.GetEncoding(WindowsCodepage1252, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
text = File.ReadAllText(file, codepage1252Encoding);
}
return text;
}
It's also worth noting that the StreamReader class has constructors that take a specific Encoding object, and as I have shown you can adjust the EncoderFallback/DecoderFallback behavior to suit your needs. So if you need a StreamReader or StreamWriter for finer grained work, this approach can still be used.
I use this to convert file encoding to UTF-8
public static void ConvertFileEncoding(String sourcePath, String destPath)
{
// If the destination's parent doesn't exist, create it.
String parent = Path.GetDirectoryName(Path.GetFullPath(destPath));
if (!Directory.Exists(parent))
{
Directory.CreateDirectory(parent);
}
// Convert the file.
String tempName = null;
try
{
tempName = Path.GetTempFileName();
using (StreamReader sr = new StreamReader(sourcePath))
{
using (StreamWriter sw = new StreamWriter(tempName, false, Encoding.UTF8))
{
int charsRead;
char[] buffer = new char[128 * 1024];
while ((charsRead = sr.ReadBlock(buffer, 0, buffer.Length)) > 0)
{
sw.Write(buffer, 0, charsRead);
}
}
}
File.Delete(destPath);
File.Move(tempName, destPath);
}
finally
{
File.Delete(tempName);
}
}
Isn't there an industry standard for vCard files' character encoding?
Which is easier for my solving my problem? Converting ANSI to UTF8 (and/or the other way round) or trying to detect which encoding the input file has and notifying the user about it?
How I solved this:
I have vCard file (*.vcf) - 200 contacts in one file in russian language...
I opened it with vCardOrganizer 2.1 program then made Split to divide it on 200....and what I see - contacts with messy symbols, only thing I can read it numbers :-) ...
Steps: (when you do this steps be patient, sometimes it takes time)
Open vCard file (my file size was 3mb) with "notepad"
Then go from Menu - File-Save As..in opened window choose file name, dont forget put .vcf , and encoding - ANSI or UTF-8...and finally click Save..
I converted filename.vcf (UTF-8) to filename.vcf (ANSI) - nothing lost and perfect readable russian language...if you have quest write: yoshidakatana#gmail.com
Good Luck !!!