Encoding.GetEncoding ucs 2 little endian - c#

I'm trying to find a string or (not so ideally) int representation of UCS-2 little endian to input into Encoding.GetEncoding().
I am looking for this information because I'm using StreamReader to read a content of a file and I want to use Encoding.GetEncoding to construct it.
The reason for that is I'm reading several different files with variable encodings and I need to be able to specify in configuration which encoding to use for what file.

UCS-2 can be considered a subset of UTF-16, and thus any UTF-16 capable decoder should also be able to handle UCS-2; the difference is that UCS-2 doesn't cover the entire range of unicode, and thus there are some additional values that can be expressed in UTF-16 but not in UCS-2. We just shouldn't expect to see those values here, if the file was written by an encoder that only knows UCS-2.
This is pretty much the same as saying that you can use a UTF-8 decoder to read data that was written in pure ASCII (where by "pure" here, I mean 7-bit ASCII, not the extended code-pages that use the 8th bit).
As such, any of:
Encoding direct = Encoding.Unicode;
Encoding byCode = Encoding.GetEncoding(1200);
Encoding byName = Encoding.GetEncoding("Unicode");
should work fine here.

Related

How to read and store string in UTF-8 format in C#?

I have a file with URLs, one of which is http://en.wikipedia.org/wiki/São_Paulo. Note that 'ã'. When I read the URLs (in C#) and try to print it, it appears as http://en.wikipedia.org/wiki/S?o_Paulo.
I tried reading the URLs as following:
List<string> urls = System.IO.File.ReadAllLines(wikiURL_FilePath, Encoding.UTF8).ToList();
Note that I have passed second argument to read it in UTF8 format, but still the problem is not rectified. How can I read and store the string in correct form?
The data you have shown is simply not UTF-8, despite having a UTF-8 BOM; the UTF-8 for São is 53-C3-A3-6F; you have 53-E3-6F, which is... the right unicode code-points for basic multi-lingual plane data, but incorrectly encoded to disk as UTF-8. You probably need to fix the code that wrote this file, or: agree on what the encoding is (it could be a single-byte code-page, but you need to agree which, else everything falls apart).
Likely looking encodings (if we take away the BOM):
utf-7
windows-1252
windows-1254
iso-8859-1
iso-8859-4
iso-8859-9
iso-8859-15

ansi to unicode conversion

While parsing certain documents, I get the character code 146, which is actually an ANSI number. While writing the char to text file, nothing is shown. If we write the char as Unicode number- 8217, the character is displayed fine.
Can anyone give me advice on how to convert the ANSI number 146 to Unicode 8217 in C#.
reference: http://www.alanwood.net/demos/ansi.html
Thanks
"ANSI" is really a misnomer - there are many encodings often known as "ANSI". However, if you're sure you need code page 1252, you can use:
Encoding encoding = Encoding.GetEncoding(1252);
using (TextReader reader = File.OpenText(filename, encoding))
{
// Read text and use it
}
or
Encoding encoding = Encoding.GetEncoding(1252);
string text = File.ReadAllText(filename, encoding);
That's for reading a file - writing a file is the same idea. Basically when you're converting from binary (e.g. file contents) to text, use an appropriate Encoding object.
My recommendation would be to read Joel's "Absolute Minimum Every Software Developer Must Know About Unicode and Character Sets. There's quite a lot involved in your question and my experience has been that you'll just struggle against the simple answers if you don't understand these basics. It takes around 15 minutes to read.

c#, Textbox.Text encoding

I have code:
string text = sampleTextBox.Text;
and I'm wondering in what encoding text is? Is it utf16 (as it is string) or maybe it is my operating system encoding?
It's all Unicode, basically - there's no conversion between the .NET textual types (char/string) and binary going on, so there's no encoding to worry about.
You potentially need to worry about surrogate pairs to get from the UTF-16 textual representation of char and string to full UTF-32, but that's slightly different to the normal encoding issues.
Philosophically, a textbox contains text, not binary data. You should only be thinking about encodings when there's a conversion to a binary format - such as a file.
String variables in .Net are UTF-16 internally. Encoding comes into play when you want to output the string outside your program: a file, a webpage, or over the network in some fashion.

Force C# to use ASCII

I'm working on an application in C# and need to read and write from a particular datafile format. The only issue at the moment is that the format uses strictly single byte characters, and C# keeps trying to throw in Unicode when I use a writer and a char array (which doubles filesize, among other serious issues). I've been working on modifying the code to use byte arrays instead, but that causes a few complaints when feeding them into a tree view and datagrid controls, and it involves conversions and whatnot.
I've spent a little time googling, and there doesn't seem to be a simple typedef I can use to force the char type to use byte for my program, at least not without causing extra complications.
Is there a simple way to force a C# .NET program to use ASCII-only and not touch Unicode?
Later, I got this almost working. Using the ASCIIEncoding on the BinaryReader/Writers ended up fixing most of the problems (a few issues with an extra character being prepended to strings occurred, but I fixed that up). I'm having one last issue, which is very small but could be big: In the file, a particular character (prints as the Euro sign) gets converted to a ? when I load/save the files. That's not an issue in texts much, but if it occurred in a record length, it could change the size by kilobytes (not good, obviously). I think it's caused by the encoding, but if it came from the file, why won't it go back?
The precise problem/results are such:
Original file: 0x80 (euro)
Encodings:
** ASCII: 0x3F (?)
** UTF8: 0xC280 (A-hat euro)
Neither of those results will work, since anywhere in the file, it can change (if an 80 changed to 3F in a record length int, it could be a difference of 65*(256^3)). Not good. I tried using a UTF-8 encoding, figuring that would fix the issue pretty well, but it's now adding that second character, which is even worse.
C# (.NET) will always use Unicode for strings. This is by design.
When you read or write to your file, you can, however, use a StreamReader/StreamWriter set to force ASCII Encoding, like so:
StreamReader reader = new StreamReader (fileStream, new ASCIIEncoding());
Then just read using StreamReader.
Writing is the same, just use StreamWriter.
Interally strings in .NET are always Unicode, but that really shouldn't be of much interest to you. If you have a particular format that you need to adhere to, then the route you went down (reading it as bytes) was correct. You simply need to use the System.Encoding.ASCII class to do your conversions from string->byte[] and byte[]->string.
If you have a file format that mixes text in single-byte characters with binary values such as lengths, control characters, a good encoding to use is code page 28591 aka Latin1 aka ISO-8859-1.
You can get this encoding by using whichever of the following is the most readable:
Encoding.GetEncoding(28591)
Encoding.GetEncoding("Latin1")
Encoding.GetEncoding("ISO-8859-1")
This encoding has the useful characteristic that byte values up to 255 are converted to unchanged to the unicode character with the same value (e.g. the byte 0x80 becomes the character 0x0080).
In your scenario, this may be more useful than the ASCII encoding (which converts values in the range 0x80 to 0xFF to '?') or any of the other usual encodings, which will also convert some of the characters in this range.
If you want this in .NET, you could use F# to make a library supporting this. F# supports ASCII strings, with a byte array as the underlying type, see Literals (F#) (MSDN):
let asciiString = "This is a string"B

How to read an ANSI encoded file containing special characters

I'm writing a TFS Checkin policy, which checks if our source files containing our file header.
My problem is, that our file header contains a special character "©" and unfortunately some of our source files are encoded in ANSI.
So if I read these files in the policy, the string looks like this "Copyright � 2009".
string content = File.ReadAllText(pendingChange.LocalItem);
I tired to change the encoding of the string, but it does not help. So how can I read these files, that I get the correct string "Copyright © 2009"?
Use Encoding.Default:
string content = File.ReadAllText(pendingChange.LocalItem, Encoding.Default);
You should be aware, however, that that reads it using the system default encoding - which may not be the same as the encoding of the file. There's no single encoding called ANSI, but usually when people talk about "the ANSI encoding" they mean Windows Code Page 1252 or whatever their box happens to use.
Your code will be more robust if you can find out the exact encoding used.
It would seem sensible if you going to have such policies that you would also have team agreed standard encoding. To be honest, I can't see why any team would use an encoding other than "Unicode (UtF-8 with signature) - Codepage 65001" (except perhaps for ASPX pages with significant non-latin static content but even then I can't see how it would be a big deal to use UTF-8).
Assuming you still want to allow mixed encodings then you next need a way to determine which encoding a file was save in so you know which encoding to pass to ReadAllText. Its not easy to determine this from the file however using Encoding.Default is likely to work ok. Since its most likely you have just 2 encodings to deal with, the VS (UTF-8 with signature) and a common ANSI encoding used by you machines (probably Windows-1252).
Hence using
string content = File.ReadAllText(pendingChange.LocalItem, Encoding.Default);
will work. (As I see Jon has already posted). This works because when the UTF-8 BOM (which is what VS means by the term "signature") is present at the start of the file the supplied encoding parameter is ignored and UTF-8 is used anyway. Hence where the file is saved using UTF-8 you get correct results and where ANSI is used you are most likely also to get correct results.
BTW if you are processing file headers wouldn't ReadAllLines make things easier?.
I know this is an old question but I ran into a similar situation and found the accepted answer to be cutting some corners (no disregard for Jon Skeet's pragmatic short answer, but I'll flesh it out a little more)...
The specs state that the header will contain the encoding directly after {\rtf:
\ansi ANSI (the default)
\mac Apple Macintosh
\pc IBM PC code page 437
\pca IBM PC code page 850, used by IBM Personal System/2 (not implemented in version 1 of Microsoft Word for OS/2)
According to Wikipedia the "ANSI character set has no well-defined meaning"
For the default ANSI you have the choice of these partially incompatible encodings:
using System.Text;
...
string content = File.ReadAllText(filename, Encoding.GetEncoding("ISO-8859-1"));
or
string content = File.ReadAllText(filename, Encoding.GetEncoding("Windows-1252"));
Using WordPad on windows 10 to save a file with a euro sign (0x80 in Windows-1252 but 0xA4 in ISO-8859-1) revealed the following:
The header stated the exact encoding after \ansi
{\rtf1\ansi\ansicpg1252\deff0\nouicompat\deflang1043{ ...
And the encoding was not directly used, instead it was wrapped in RTF encoding: \'80
according to the specs:
\'hh : A hexadecimal value, based on the specified character set (may
be used to identify 8-bit values).
I guess the best thing to do is read the header, if the file starts with {\rtf1\ansi\ansicpg1252 then go for Windows-1252.
But to make things more complicated, the specs also state that there can be mixed encodings... search for '\upr'...
I guess there is no definitive answer, the easiest way to go in your case may be to search (in the un-decoded raw byte array) for all the variations of the encoded copyright signs that you may encounter in your source base.
In my case I finally decided to cut a few corners as well, but add a small percentage of defensive coding. All files I have seen so far were Windows-1252 so I common-case-optimised for that.
Encoding encoding = Encoding.GetEncoding("Windows-1252", EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
using (System.IO.StreamReader reader = new System.IO.StreamReader(filename, encoding)) {
string header= reader.ReadLine();
if (!header.Contains("cpg1252")) {
if(header.Contains("\\pca"))
encoding = Encoding.GetEncoding(850, EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
else if (header.Contains("\\pc"))
encoding = Encoding.GetEncoding(437, EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
else
encoding = Encoding.GetEncoding("ISO-8859-1", EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
}
}
string content = System.IO.File.ReadAllText(filename, encoding);

Categories

Resources