How to read an ANSI encoded file containing special characters

How to read an ANSI encoded file containing special characters - c#

I'm writing a TFS Checkin policy, which checks if our source files containing our file header.
My problem is, that our file header contains a special character "©" and unfortunately some of our source files are encoded in ANSI.
So if I read these files in the policy, the string looks like this "Copyright � 2009".
string content = File.ReadAllText(pendingChange.LocalItem);
I tired to change the encoding of the string, but it does not help. So how can I read these files, that I get the correct string "Copyright © 2009"?

Use Encoding.Default:
string content = File.ReadAllText(pendingChange.LocalItem, Encoding.Default);
You should be aware, however, that that reads it using the system default encoding - which may not be the same as the encoding of the file. There's no single encoding called ANSI, but usually when people talk about "the ANSI encoding" they mean Windows Code Page 1252 or whatever their box happens to use.
Your code will be more robust if you can find out the exact encoding used.

It would seem sensible if you going to have such policies that you would also have team agreed standard encoding. To be honest, I can't see why any team would use an encoding other than "Unicode (UtF-8 with signature) - Codepage 65001" (except perhaps for ASPX pages with significant non-latin static content but even then I can't see how it would be a big deal to use UTF-8).
Assuming you still want to allow mixed encodings then you next need a way to determine which encoding a file was save in so you know which encoding to pass to ReadAllText. Its not easy to determine this from the file however using Encoding.Default is likely to work ok. Since its most likely you have just 2 encodings to deal with, the VS (UTF-8 with signature) and a common ANSI encoding used by you machines (probably Windows-1252).
Hence using
string content = File.ReadAllText(pendingChange.LocalItem, Encoding.Default);
will work. (As I see Jon has already posted). This works because when the UTF-8 BOM (which is what VS means by the term "signature") is present at the start of the file the supplied encoding parameter is ignored and UTF-8 is used anyway. Hence where the file is saved using UTF-8 you get correct results and where ANSI is used you are most likely also to get correct results.
BTW if you are processing file headers wouldn't ReadAllLines make things easier?.

I know this is an old question but I ran into a similar situation and found the accepted answer to be cutting some corners (no disregard for Jon Skeet's pragmatic short answer, but I'll flesh it out a little more)...
The specs state that the header will contain the encoding directly after {\rtf:
\ansi ANSI (the default)
\mac Apple Macintosh
\pc IBM PC code page 437
\pca IBM PC code page 850, used by IBM Personal System/2 (not implemented in version 1 of Microsoft Word for OS/2)
According to Wikipedia the "ANSI character set has no well-defined meaning"
For the default ANSI you have the choice of these partially incompatible encodings:
using System.Text;
...
string content = File.ReadAllText(filename, Encoding.GetEncoding("ISO-8859-1"));
or
string content = File.ReadAllText(filename, Encoding.GetEncoding("Windows-1252"));
Using WordPad on windows 10 to save a file with a euro sign (0x80 in Windows-1252 but 0xA4 in ISO-8859-1) revealed the following:
The header stated the exact encoding after \ansi
{\rtf1\ansi\ansicpg1252\deff0\nouicompat\deflang1043{ ...
And the encoding was not directly used, instead it was wrapped in RTF encoding: \'80
according to the specs:
\'hh : A hexadecimal value, based on the specified character set (may
be used to identify 8-bit values).
I guess the best thing to do is read the header, if the file starts with {\rtf1\ansi\ansicpg1252 then go for Windows-1252.
But to make things more complicated, the specs also state that there can be mixed encodings... search for '\upr'...
I guess there is no definitive answer, the easiest way to go in your case may be to search (in the un-decoded raw byte array) for all the variations of the encoded copyright signs that you may encounter in your source base.
In my case I finally decided to cut a few corners as well, but add a small percentage of defensive coding. All files I have seen so far were Windows-1252 so I common-case-optimised for that.
Encoding encoding = Encoding.GetEncoding("Windows-1252", EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
using (System.IO.StreamReader reader = new System.IO.StreamReader(filename, encoding)) {
string header= reader.ReadLine();
if (!header.Contains("cpg1252")) {
if(header.Contains("\\pca"))
encoding = Encoding.GetEncoding(850, EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
else if (header.Contains("\\pc"))
encoding = Encoding.GetEncoding(437, EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
else
encoding = Encoding.GetEncoding("ISO-8859-1", EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
}
}
string content = System.IO.File.ReadAllText(filename, encoding);

Related

Detect special symbols in c#

I'm working on a c# project in which some data contains characters which are not recognised by the encoding.
They are displayed like that:
"Some text � with special � symbols in it".
I have no control over the encoding process, also data come from files of various origins and various formats.
I want to be able to flag data that contains such characters as erroneous or incomplete. Right now I am able to detect them this way:
if(myString.Contains("�"))
{
//Do stuff
}
While it does work, it doesn't feel quite right to use the weird symbol directly in the Contains function. Isn't there a cleaner way to do this ?
EDIT:
After checking back with the team responsible for reading the files, this is how they do it:
var sr = new StreamReader(filePath, true);
var content = sr.ReadToEnd();
Passing true as a second parameter of StreamReader is supposed to detect the encoding from the file's BOM, and use it to read the content. It doesn't always work though, as some files don't bear that information, hence why their data is read incorrectly.
We've made some tests and using StreamReader(filePath, Encoding.Default) instead appears to work for most if not all files we had issues with. Expectedly, files that were working before not longer work because they do not use the default encoding.
So the best solution for us would be to do the following: read the file trying to detect its encoding, then if it wasn't successful read it again with the default encoding.
The problem remains the same though: how do we check, after trying to detect the file's encoding, if data has been read incorrectly ?

The � character is not a special symbol. It's the Unicode Replacement Character. This means that the code tried to convert ASCII text using the wrong codepage. Any characters that didn't have a match in the codepage were replaced with �.
The solution is to read the file using the correct encoding. The default encoding used by the File methods or StreamReader is UTF8. You can pass a different encoding using the appropriate constructor, eg StreamReader(Stream, Encoding, Boolean). To use the system locale's codepage, you need to use Encoding.Default :
var sr = new StreamReader(filePath,Encoding.Default);
You can use the StreamReader(Stream, Encoding, Boolean) constructor to autodetect Unicode encodings from the BOM and fallback to a different encoding.
Assuming the files are either some type of Unicode or match your system locale, you can use:
var sr = new StreamReader(filePath,Encoding.Default, true);
From StreamReader's source shows that the DetectEncoding method will check the first bytes of a file to determine the encoding. If one is found, it is used instead of the supplied encoding. The operation doesn't cause extra IO because the method checks the class's internal buffer

EDIT
I just realized you can't actually load the raw file into a .NET string and still be able to have full information about the original file.
The project here uses the Mlang api which does a better job at not loading the file into a .NET string before guessing. There is also a related SO question

How to read and store string in UTF-8 format in C#?

I have a file with URLs, one of which is http://en.wikipedia.org/wiki/São_Paulo. Note that 'ã'. When I read the URLs (in C#) and try to print it, it appears as http://en.wikipedia.org/wiki/S?o_Paulo.
I tried reading the URLs as following:
List<string> urls = System.IO.File.ReadAllLines(wikiURL_FilePath, Encoding.UTF8).ToList();
Note that I have passed second argument to read it in UTF8 format, but still the problem is not rectified. How can I read and store the string in correct form?

The data you have shown is simply not UTF-8, despite having a UTF-8 BOM; the UTF-8 for São is 53-C3-A3-6F; you have 53-E3-6F, which is... the right unicode code-points for basic multi-lingual plane data, but incorrectly encoded to disk as UTF-8. You probably need to fix the code that wrote this file, or: agree on what the encoding is (it could be a single-byte code-page, but you need to agree which, else everything falls apart).
Likely looking encodings (if we take away the BOM):
utf-7
windows-1252
windows-1254
iso-8859-1
iso-8859-4
iso-8859-9
iso-8859-15

ansi to unicode conversion

While parsing certain documents, I get the character code 146, which is actually an ANSI number. While writing the char to text file, nothing is shown. If we write the char as Unicode number- 8217, the character is displayed fine.
Can anyone give me advice on how to convert the ANSI number 146 to Unicode 8217 in C#.
reference: http://www.alanwood.net/demos/ansi.html
Thanks

"ANSI" is really a misnomer - there are many encodings often known as "ANSI". However, if you're sure you need code page 1252, you can use:
Encoding encoding = Encoding.GetEncoding(1252);
using (TextReader reader = File.OpenText(filename, encoding))
{
// Read text and use it
}
or
Encoding encoding = Encoding.GetEncoding(1252);
string text = File.ReadAllText(filename, encoding);
That's for reading a file - writing a file is the same idea. Basically when you're converting from binary (e.g. file contents) to text, use an appropriate Encoding object.

My recommendation would be to read Joel's "Absolute Minimum Every Software Developer Must Know About Unicode and Character Sets. There's quite a lot involved in your question and my experience has been that you'll just struggle against the simple answers if you don't understand these basics. It takes around 15 minutes to read.

.NET : StreamReader does not recognize ° characters

I am trying to run a RegEx to locate degree characters (\u00B0|\u00BA degrees in addition to locating the other form of ' --> \u00B4). I am reading latitude and longitude DMS coordinates like this one: 12º30'23.256547"S
The problem is with the way I am reading the file as I can manually inject a string like the one below (format is latitude, longitude, description):
const string myTestString = #"12º30'23.256547""S, 12º30'23.256547""W, Somewhere";
and my regex is matching as expected - I can also see the º values where, when I am using the streamreader, I see a � for all unrecognized characters (the º symbol being included as one of those unrecognized characters)
I've tried:
var sr = new StreamReader(dlg.File.OpenRead(), Encoding.UTF8);
var sr = new StreamReader(dlg.File.OpenRead(), Encoding.Unicode);
var sr = new StreamReader(dlg.File.OpenRead(), Encoding.BigEndianUnicode);
in addition to the default ASCII.
Either way I read the file, I end up with these special characters. Any advice would be greatly appreciated!!

You've tried various encodings... but presumably not the right one. You shouldn't just be guessing at encodings - find out what encoding it's really using, and use that. StreamReader itself is absolutely fine. It can deal with any encoding you give it, but it does have to match the encoding used when writing the file out.
Where does the file come from? What has written it out?
If it was written out with Notepad, it may well be using Encoding.Default, which is the system's default encoding (i.e. it will vary from machine to machine). If at all possible, change whatever is creating the file to use a single standard encoding - personally I'm a big fan of UTF-8.

You need to identify what encoding the file was saved in, and use that when you read it with your streamreader.
If it is created using a regular texteditor I'm guessing the default encoding is either Windows-1252 or ISO-8859-1.
The degree symbol is 0xBA in ISO-8859-1 and goes outside of the 7bit ASCII table. I don't know how the Encoding.ASCII interprets it.
Otherwise, it might be easier to just make sure to save the file as UTF-8 if you have that possibility.
The reason that it works when you define the string in code is because .NET will always work with strings with it's internal encoding (UCS-2?), so what StreamReader do is convert the bytes it is reading from the file into the internal encoding using the encoding that you specify when you create the StreamReader.

You can open your file being read in an editor like Notepad++ to see the Encoding type of the file and change it to UTF-8. Then reading as you are doing
'var sr = new StreamReader(dlg.File.OpenRead(), Encoding.UTF8);'
will work. I could read degree symbol by doing this

OpenFileDialog filename as UTF8

C# question here..
I have a UTF-8 string that is being interpreted by a non-Unicode program in C++.. This text which is displayed improperly, but as far as I can tell, is intact, is then applied as an output filename..
Anyway, in a C# project, I am trying to open this file with an System.Windows.Forms.OpenFileDialog object. The filenames I am getting from this object's .FileNames[] is in Unicode (UCS-2). This string, however, has been misinterpreted.. For example, if the original string was 0xe3 0x81 0x82, a FileName[].ToCharArray() reveals that it is now 0x00e3 0x0081 0x201a .... .. It might seem like the OpenFileDialog object only padded it, but it is not.. In the third character that the OpenFileDialog produced, it is different and I cannot figure out what happened to this byte..
My question is: Is there any way to treat the filenames highlighted in the OpenFileDialog box as UTF-8?
I don't think it's relevant, but if you need to know, the string is in Japanese..
Thanks,
kreb
UPDATE
First of all, thanks to everyone who's offered their suggestions here, they're very much appreciated.
Now, to answer the suggestions to modify the C++ application to handle the strings properly, it doesn't seem to be feasible. It isn't just one application that is doing this to the strings.. There are actually a great number of these applications in my company that I have to work with, and it would take huge amount of manpower and time that simply isn't available. However, sean e's idea would probably be the best choice if I were to take this route..
#Remy Lebeau: I think hit the nail right on the head, I will try your proposed solution and report back.. :) I guess the caveat with your solution is that the Default encoding has to be the same on the C# application environment as the C++ application environment that created the file, which certainly makes sense as it would have to use the same code page..
#Jeff Johnson: I'm not pasting the filenames from the C++ app to the C# app.. I am calling OpenFileDialog.ShowDialog() and getting the OpenFileDialog.FileNames on DialogResult.OK.. I did try to use Encoding.UTF8.GetBytes(), but like Remy Lebeau pointed out, it won't work because the original UTF8 bytes are lost..
#everyone else: Thanks for the ideas.. :)
kreb
UPDATE
#Remy Lebeau: Your idea worked perfectly! As long as the environment of the C++ app is the same as the environment of the C# app is the same (same locale for non-Unicode programs) I am able to retrieve the correct text.. :)
Now I have more problems.. Haha.. Is there any way to determine the encoding of a string? The code now works for UTF8 strings that were mistakenly interpreted as ANSI strings, but screws up UCS-2 strings. I need to be able to determine the encoding and process each accordingly. GetEncoding() doesn't seem to be useful.. =/ And neither is StreamReader's CurrentEncoding property (always says UTF-8)..
P.S. Should I open this new question in a new post?

0x201a is the Unicode "low single comma quotation mark" character. 0x82 is the Latin-1 (ISO-8859-1, Windows codepage 1252) encoding of that character. That means the bytes of the filename are being interpretted as plain Ansi instead of as UTF-8, and thus being decoded from Ansi to Unicode accordingly. That is not surprising, as the filesystem has no concept of UTF-8, and Windows assumes non-Unicode filenames are using the OS's default Ansi encoding.
To do what you are looking for, you need access to the original UTF-8 encoded bytes so you can decode them properly. One thing you can try is to pass the FileName to the GetBytes() method of System.Text.Encoding.Default (in theory, that is using the same encoding that was used to decode the filename, so it should be able to produce the same bytes as the original), and then pass the resulting bytes to the GetString() method of System.Text.Encoding.UTF8.

I think your problem is at the begining:
I have a UTF-8 string that is being
interpreted by a non-Unicode program
in C++.. This text which is displayed
improperly, but as far as I can tell,
is intact, is then applied as an
output filename..
If you load a UTF-8 string with a non-unicode program and then serialize it, it will contain non-unicode chars.
Is there any way that your C++ program can handle Unicode?

Can you use members of the System.Text namespace (e.g., the UTF8Encoding class) to convert the .NET framework's internal string representation to/ from a byte array containing the text in the encoding of your choice?

If you are sure that the C++ output is fine, then in your C# app you should convert it from UTF-8 to UTF-16 using the .NET encoding class and just work with it in the Windows native format.
If you can modify the C++ app, that might be better - give the C# app input that doesn't need to be re-encoded. In it, the UTF8 to Unicode translation can be handled via MultiByteToWideChar, using CP_UTF8 for the CodePage parameter, but it only works when none of the flags are set for dwFlags (specify 0 for dwFlags). The whole app doesn't need to be Unicode. Even though it is not compiled unicode, you can make selective use of Unicode APIs.

In answer to your question "is there a way to treat the filenames as utf-8?" Try this code:
List<byte[]> utf8FileNames = new List<byte[]>();
foreach (string fileName in openFileDialog1.FileNames)
{
utf8FileNames.Add(Encoding.UTF8.GetBytes(fileName));
}
// Each byte array in utf8FileNames is a sequence of utf-8 bytes matching each file name chosen
What do you do with the file names once you have got them from the open file dialog? Can you post that code?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.