Read a file with unicode characters

Read a file with unicode characters - c#

I have an asp.net c# page and am trying to read a file that has the following charater ’ and convert it to '. (From slanted apostrophe to apostrophe).
FileInfo fileinfo = new FileInfo(FileLocation);
string content = File.ReadAllText(fileinfo.FullName);
//strip out bad characters
content = content.Replace("’", "'");
This doesn't work and it changes the slanted apostrophes into ? marks.

I suspect that the problem is not with the replacement, but rather with the reading of the file itself. When I tried this the nieve way (using Word and copy-paste) I ended up with the same results as you, however examining content showed that the .Net framework believe that the character was Unicode character 65533, i.e. the "WTF?" character before the string replacement. You can check this yourself by examining the relevant character in the Visual Studio debugger, where it should show the character code:
content[0]; // 65533 '�'
The reason why the replace isn't working is simple - content doesn't contain the string you gave it:
content.IndexOf("’"); // -1
As for why the file reading isn't working properly - you are probably using the wrong encoding when reading the file. (If no encoding is specified then the .Net framework will try to determine the correct encoding for you, however there is no 100% reliable way to do this and so often it can get it wrong). The exact encoding you need depends on the file itself, however in my case the encoding being used was Extended ASCII, and so to read the file I just needed to specify the correct encoding:
string content = File.ReadAllText(fileinfo.FullName, Encoding.GetEncoding("iso-8859-1"));
(See this question).
You also need to make sure that you specify the correct character in your replacement string - when using "odd" characters in code you may find it more reliable to specify the character by its character code, rather than as a string literal (which may cause problems if the encoding of the source file changes), for example the following worked for me:
content = content.Replace("\u0092", "'");

My bet is the file is encoded in Windows-1252. This is almost the same as ISO 8859-1. The difference is Windows-1252 uses "displayable characters rather than control characters in the 0x80 to 0x9F range". (Which is where the slanted apostrophe is located. i.e. 0x92)
//Specify Windows-1252 here
string content = File.ReadAllText(fileinfo.FullName, Encoding.GetEncoding(1252));
//Your replace code will then work as is
content = content.Replace("’", "'");

// This should replace smart single quotes with a straight single quote
Regex.Replace(content, #"(\u2018|\u2019)", "'");
//However the better approach seems to be to read the page with the proper encoding and leave the quotes alone
var sreader= new StreamReader(fileInfo.Create(), Encoding.GetEncoding(1252));

If you use String (capitalized) and not string, it should be able to handle any Unicode you throw at it. Try that first and see if that works.

Related

Is it possible to display (convert?) the unicode hex \u0092 to an unicode html entity in .NET?

I have some string that contains the following code/value:
"You won\u0092t find a ...."
It looks like that string contains the Right Apostrophe special character.
ref1: Unicode control 0092
ref2: ASCII chart (both 127 + extra extended ascii)
I'm not sure how to display this to the webbrowser. It keeps displaying the TOFU square-box character instead. I'm under the impression that the unicode (hex) value 00092 can be converted to unicode (html) 
Is my understanding correct?
Update 1:
It was suggested by #sam-axe that I HtmlEncode the unicode. That didn't work. Here it is...
Note the ampersand got correctly encoded....

It looks like there's an encoding mix-up. In .NET, strings are normally encoded as UTF-16, and a right apostrophe should be represented as \u2019. But in your example, the right apostrophe is represented as \x92, which suggests the original encoding was Windows code page 1252. If you include your string in a Unicode document, the character \x92 won't be interpreted properly.
You can fix the problem by re-encoding your string as UTF-16. To do so, treat the string as an array of bytes, and then convert the bytes back to Unicode using the 1252 code page:
string title = "You won\u0092t find a cheaper apartment * Sauna & Spa";
byte[] bytes = title.Select(c => (byte)c).ToArray();
title = Encoding.GetEncoding(1252).GetString(bytes);
// Result: "You won’t find a cheaper apartment * Sauna & Spa"

Note: much of my answer is based on guessing and looking at the decompiled code of System.Web 4.0. The reference source looks very similar (identical?).
You're correct that "" (6 characters) can be displayed in the browser. Your output string, however, contains "\u0092" (1 character). This is a control character, not an HTML entity.
According to the reference code, WebUtility.HtmlEncode() doesn't transform characters between 128 and 160 - all characters in this range are control characters (ampersand is special-cased in the code as are a few other special HTML symbols).
My guess is that because these are control characters, they're output without transformation because transforming it would change the meaning of the string. (I tried running some examples using LinqPad, this character was not rendered.)
If you really want to transform these characters (or remove them), you'll probably have to write your own function before/after calling HtmlEncode() - there may be something that does this already but I don't know of any.
Hope this helps.
Edit: Michael Liu's answer seems correct. I'm leaving my answer here because it may be useful in cases when the input encoding of a string is not known.

Detect special symbols in c#

I'm working on a c# project in which some data contains characters which are not recognised by the encoding.
They are displayed like that:
"Some text � with special � symbols in it".
I have no control over the encoding process, also data come from files of various origins and various formats.
I want to be able to flag data that contains such characters as erroneous or incomplete. Right now I am able to detect them this way:
if(myString.Contains("�"))
{
//Do stuff
}
While it does work, it doesn't feel quite right to use the weird symbol directly in the Contains function. Isn't there a cleaner way to do this ?
EDIT:
After checking back with the team responsible for reading the files, this is how they do it:
var sr = new StreamReader(filePath, true);
var content = sr.ReadToEnd();
Passing true as a second parameter of StreamReader is supposed to detect the encoding from the file's BOM, and use it to read the content. It doesn't always work though, as some files don't bear that information, hence why their data is read incorrectly.
We've made some tests and using StreamReader(filePath, Encoding.Default) instead appears to work for most if not all files we had issues with. Expectedly, files that were working before not longer work because they do not use the default encoding.
So the best solution for us would be to do the following: read the file trying to detect its encoding, then if it wasn't successful read it again with the default encoding.
The problem remains the same though: how do we check, after trying to detect the file's encoding, if data has been read incorrectly ?

The � character is not a special symbol. It's the Unicode Replacement Character. This means that the code tried to convert ASCII text using the wrong codepage. Any characters that didn't have a match in the codepage were replaced with �.
The solution is to read the file using the correct encoding. The default encoding used by the File methods or StreamReader is UTF8. You can pass a different encoding using the appropriate constructor, eg StreamReader(Stream, Encoding, Boolean). To use the system locale's codepage, you need to use Encoding.Default :
var sr = new StreamReader(filePath,Encoding.Default);
You can use the StreamReader(Stream, Encoding, Boolean) constructor to autodetect Unicode encodings from the BOM and fallback to a different encoding.
Assuming the files are either some type of Unicode or match your system locale, you can use:
var sr = new StreamReader(filePath,Encoding.Default, true);
From StreamReader's source shows that the DetectEncoding method will check the first bytes of a file to determine the encoding. If one is found, it is used instead of the supplied encoding. The operation doesn't cause extra IO because the method checks the class's internal buffer

EDIT
I just realized you can't actually load the raw file into a .NET string and still be able to have full information about the original file.
The project here uses the Mlang api which does a better job at not loading the file into a .NET string before guessing. There is also a related SO question

Before reading a file, do I have to check ANSI encoding?

Im reding some csv files. The files are really easy, because there is always just ";" as seperator and there are no ", ', or something like that.
So its possible to read the file, line by line and seperate the strings. Thats working fine. Now people told me: maybe you should check the encoding of the file, it should be always ANSI, if its not maybe your output will be different and corrupted. So non-ansi files should be marked somehow.
I just said, okey! But if I think about it: do I really have to check the file for encoding in this case? I just changed the encoding of the file to something else and Im still able to read the file without any problems. My code is simple:
using (TextReader reader = new StreamReader(myFileStream))
{
while ((line = read.ReadLine()) != null)
{
//read the line, spererate by ; and other stuff...
}
}
So again: do I really need to check the files for ANSI encoding? Could somebody give me an example when could I get in trouble or when do I get a corrupted output after reading a non-ansi file? Thank you!

That particular constructor of StreamReader will assume that the data is UTF-8; that is compatible with ASCII, but can fail if data uses bytes in the 128-255 range for single-byte codepages (you'll get the wrong characters in strings, etc), or could fail completely (i.e. throw an exception) if the data is actually something very different like UTF-7, UTF-32, etc.
In some cases (the minority) you might be able to use the byte-order-mark to detect the encoding, but this is a circular problem: in most cases, if you don't already know the encoding, you can't really detect the encoding (robustly). So a better approach would be: to know the encoding in the first place. Then you can pass in the correct encoding to use via one of the other constructors.
Here's an example of it failing:
// we'll write UTF-32, big-endian, without a byte-order-mark
File.WriteAllText("my.txt", "Hello world", new UTF32Encoding(true, false));
using (var reader = new StreamReader("my.txt"))
{
string s = reader.ReadLine();
}

You can run under UTF-8 encoding , cause UTF-8 has a wonderful property support ASCII characters with 1 byte (as it would expected), but when it needed, shrink to support Unicode ones.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

.NET : StreamReader does not recognize ° characters

I am trying to run a RegEx to locate degree characters (\u00B0|\u00BA degrees in addition to locating the other form of ' --> \u00B4). I am reading latitude and longitude DMS coordinates like this one: 12º30'23.256547"S
The problem is with the way I am reading the file as I can manually inject a string like the one below (format is latitude, longitude, description):
const string myTestString = #"12º30'23.256547""S, 12º30'23.256547""W, Somewhere";
and my regex is matching as expected - I can also see the º values where, when I am using the streamreader, I see a � for all unrecognized characters (the º symbol being included as one of those unrecognized characters)
I've tried:
var sr = new StreamReader(dlg.File.OpenRead(), Encoding.UTF8);
var sr = new StreamReader(dlg.File.OpenRead(), Encoding.Unicode);
var sr = new StreamReader(dlg.File.OpenRead(), Encoding.BigEndianUnicode);
in addition to the default ASCII.
Either way I read the file, I end up with these special characters. Any advice would be greatly appreciated!!

You've tried various encodings... but presumably not the right one. You shouldn't just be guessing at encodings - find out what encoding it's really using, and use that. StreamReader itself is absolutely fine. It can deal with any encoding you give it, but it does have to match the encoding used when writing the file out.
Where does the file come from? What has written it out?
If it was written out with Notepad, it may well be using Encoding.Default, which is the system's default encoding (i.e. it will vary from machine to machine). If at all possible, change whatever is creating the file to use a single standard encoding - personally I'm a big fan of UTF-8.

You need to identify what encoding the file was saved in, and use that when you read it with your streamreader.
If it is created using a regular texteditor I'm guessing the default encoding is either Windows-1252 or ISO-8859-1.
The degree symbol is 0xBA in ISO-8859-1 and goes outside of the 7bit ASCII table. I don't know how the Encoding.ASCII interprets it.
Otherwise, it might be easier to just make sure to save the file as UTF-8 if you have that possibility.
The reason that it works when you define the string in code is because .NET will always work with strings with it's internal encoding (UCS-2?), so what StreamReader do is convert the bytes it is reading from the file into the internal encoding using the encoding that you specify when you create the StreamReader.

You can open your file being read in an editor like Notepad++ to see the Encoding type of the file and change it to UTF-8. Then reading as you are doing
'var sr = new StreamReader(dlg.File.OpenRead(), Encoding.UTF8);'
will work. I could read degree symbol by doing this

How to read an ANSI encoded file containing special characters

I'm writing a TFS Checkin policy, which checks if our source files containing our file header.
My problem is, that our file header contains a special character "©" and unfortunately some of our source files are encoded in ANSI.
So if I read these files in the policy, the string looks like this "Copyright � 2009".
string content = File.ReadAllText(pendingChange.LocalItem);
I tired to change the encoding of the string, but it does not help. So how can I read these files, that I get the correct string "Copyright © 2009"?

Use Encoding.Default:
string content = File.ReadAllText(pendingChange.LocalItem, Encoding.Default);
You should be aware, however, that that reads it using the system default encoding - which may not be the same as the encoding of the file. There's no single encoding called ANSI, but usually when people talk about "the ANSI encoding" they mean Windows Code Page 1252 or whatever their box happens to use.
Your code will be more robust if you can find out the exact encoding used.

It would seem sensible if you going to have such policies that you would also have team agreed standard encoding. To be honest, I can't see why any team would use an encoding other than "Unicode (UtF-8 with signature) - Codepage 65001" (except perhaps for ASPX pages with significant non-latin static content but even then I can't see how it would be a big deal to use UTF-8).
Assuming you still want to allow mixed encodings then you next need a way to determine which encoding a file was save in so you know which encoding to pass to ReadAllText. Its not easy to determine this from the file however using Encoding.Default is likely to work ok. Since its most likely you have just 2 encodings to deal with, the VS (UTF-8 with signature) and a common ANSI encoding used by you machines (probably Windows-1252).
Hence using
string content = File.ReadAllText(pendingChange.LocalItem, Encoding.Default);
will work. (As I see Jon has already posted). This works because when the UTF-8 BOM (which is what VS means by the term "signature") is present at the start of the file the supplied encoding parameter is ignored and UTF-8 is used anyway. Hence where the file is saved using UTF-8 you get correct results and where ANSI is used you are most likely also to get correct results.
BTW if you are processing file headers wouldn't ReadAllLines make things easier?.

I know this is an old question but I ran into a similar situation and found the accepted answer to be cutting some corners (no disregard for Jon Skeet's pragmatic short answer, but I'll flesh it out a little more)...
The specs state that the header will contain the encoding directly after {\rtf:
\ansi ANSI (the default)
\mac Apple Macintosh
\pc IBM PC code page 437
\pca IBM PC code page 850, used by IBM Personal System/2 (not implemented in version 1 of Microsoft Word for OS/2)
According to Wikipedia the "ANSI character set has no well-defined meaning"
For the default ANSI you have the choice of these partially incompatible encodings:
using System.Text;
...
string content = File.ReadAllText(filename, Encoding.GetEncoding("ISO-8859-1"));
or
string content = File.ReadAllText(filename, Encoding.GetEncoding("Windows-1252"));
Using WordPad on windows 10 to save a file with a euro sign (0x80 in Windows-1252 but 0xA4 in ISO-8859-1) revealed the following:
The header stated the exact encoding after \ansi
{\rtf1\ansi\ansicpg1252\deff0\nouicompat\deflang1043{ ...
And the encoding was not directly used, instead it was wrapped in RTF encoding: \'80
according to the specs:
\'hh : A hexadecimal value, based on the specified character set (may
be used to identify 8-bit values).
I guess the best thing to do is read the header, if the file starts with {\rtf1\ansi\ansicpg1252 then go for Windows-1252.
But to make things more complicated, the specs also state that there can be mixed encodings... search for '\upr'...
I guess there is no definitive answer, the easiest way to go in your case may be to search (in the un-decoded raw byte array) for all the variations of the encoded copyright signs that you may encounter in your source base.
In my case I finally decided to cut a few corners as well, but add a small percentage of defensive coding. All files I have seen so far were Windows-1252 so I common-case-optimised for that.
Encoding encoding = Encoding.GetEncoding("Windows-1252", EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
using (System.IO.StreamReader reader = new System.IO.StreamReader(filename, encoding)) {
string header= reader.ReadLine();
if (!header.Contains("cpg1252")) {
if(header.Contains("\\pca"))
encoding = Encoding.GetEncoding(850, EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
else if (header.Contains("\\pc"))
encoding = Encoding.GetEncoding(437, EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
else
encoding = Encoding.GetEncoding("ISO-8859-1", EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
}
}
string content = System.IO.File.ReadAllText(filename, encoding);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Read a file with unicode characters - c#

If you use String (capitalized) and not string, it should be able to handle any Unicode you throw at it. Try that first and see if that works.

Related

Is it possible to display (convert?) the unicode hex \u0092 to an unicode html entity in .NET?

Detect special symbols in c#

Before reading a file, do I have to check ANSI encoding?

.NET : StreamReader does not recognize ° characters

How to read an ANSI encoded file containing special characters

Categories

Resources