c#, Textbox.Text encoding

c#, Textbox.Text encoding - c#

I have code:
string text = sampleTextBox.Text;
and I'm wondering in what encoding text is? Is it utf16 (as it is string) or maybe it is my operating system encoding?

It's all Unicode, basically - there's no conversion between the .NET textual types (char/string) and binary going on, so there's no encoding to worry about.
You potentially need to worry about surrogate pairs to get from the UTF-16 textual representation of char and string to full UTF-32, but that's slightly different to the normal encoding issues.
Philosophically, a textbox contains text, not binary data. You should only be thinking about encodings when there's a conversion to a binary format - such as a file.

String variables in .Net are UTF-16 internally. Encoding comes into play when you want to output the string outside your program: a file, a webpage, or over the network in some fashion.

Related

Encoding.GetEncoding ucs 2 little endian

I'm trying to find a string or (not so ideally) int representation of UCS-2 little endian to input into Encoding.GetEncoding().
I am looking for this information because I'm using StreamReader to read a content of a file and I want to use Encoding.GetEncoding to construct it.
The reason for that is I'm reading several different files with variable encodings and I need to be able to specify in configuration which encoding to use for what file.

UCS-2 can be considered a subset of UTF-16, and thus any UTF-16 capable decoder should also be able to handle UCS-2; the difference is that UCS-2 doesn't cover the entire range of unicode, and thus there are some additional values that can be expressed in UTF-16 but not in UCS-2. We just shouldn't expect to see those values here, if the file was written by an encoder that only knows UCS-2.
This is pretty much the same as saying that you can use a UTF-8 decoder to read data that was written in pure ASCII (where by "pure" here, I mean 7-bit ASCII, not the extended code-pages that use the 8th bit).
As such, any of:
Encoding direct = Encoding.Unicode;
Encoding byCode = Encoding.GetEncoding(1200);
Encoding byName = Encoding.GetEncoding("Unicode");
should work fine here.

How to read and store string in UTF-8 format in C#?

I have a file with URLs, one of which is http://en.wikipedia.org/wiki/São_Paulo. Note that 'ã'. When I read the URLs (in C#) and try to print it, it appears as http://en.wikipedia.org/wiki/S?o_Paulo.
I tried reading the URLs as following:
List<string> urls = System.IO.File.ReadAllLines(wikiURL_FilePath, Encoding.UTF8).ToList();
Note that I have passed second argument to read it in UTF8 format, but still the problem is not rectified. How can I read and store the string in correct form?

The data you have shown is simply not UTF-8, despite having a UTF-8 BOM; the UTF-8 for São is 53-C3-A3-6F; you have 53-E3-6F, which is... the right unicode code-points for basic multi-lingual plane data, but incorrectly encoded to disk as UTF-8. You probably need to fix the code that wrote this file, or: agree on what the encoding is (it could be a single-byte code-page, but you need to agree which, else everything falls apart).
Likely looking encodings (if we take away the BOM):
utf-7
windows-1252
windows-1254
iso-8859-1
iso-8859-4
iso-8859-9
iso-8859-15

String and UTF16

Is the data stored in String object always encoded with UTF16?
I am asking this because my database does stores non English in non Unicode. and I assumed that the data will not be readable because it is read in wrong encoding.
Thanks

Internally .NET strings are in UTF-16, yes... but what's important is how the data is transferred between .NET and your database.
So long as the characters can be represented in Unicode, and the driver performs the appropriate conversion, you should be fine. If you're trying to represent text which can't be represented in Unicode, you may well run into some interesting behaviour.

Yes, .NET strings are always encoded in UTF-16 - with the exception of surrogate pairs that means 2 byte characters.

.NET Strings are ALWAYS Unicode. If your database is unicode you are fine, otherwise you will need to convert the text from whatever format it is in to unicode.

The internal storage of characters (and therefore strings) in .NET is done in UTF-16.
You will need to re-encode the string to the encoding used by your database.
See the Encoding class - this is what you can use to convert a string from one encoding to another.

If you are using ADO.NET with SqlDataCommands (or other types of DataCommands), any required conversion should be handled for you, and you won't need to worry about it.

ansi to unicode conversion

While parsing certain documents, I get the character code 146, which is actually an ANSI number. While writing the char to text file, nothing is shown. If we write the char as Unicode number- 8217, the character is displayed fine.
Can anyone give me advice on how to convert the ANSI number 146 to Unicode 8217 in C#.
reference: http://www.alanwood.net/demos/ansi.html
Thanks

"ANSI" is really a misnomer - there are many encodings often known as "ANSI". However, if you're sure you need code page 1252, you can use:
Encoding encoding = Encoding.GetEncoding(1252);
using (TextReader reader = File.OpenText(filename, encoding))
{
// Read text and use it
}
or
Encoding encoding = Encoding.GetEncoding(1252);
string text = File.ReadAllText(filename, encoding);
That's for reading a file - writing a file is the same idea. Basically when you're converting from binary (e.g. file contents) to text, use an appropriate Encoding object.

My recommendation would be to read Joel's "Absolute Minimum Every Software Developer Must Know About Unicode and Character Sets. There's quite a lot involved in your question and my experience has been that you'll just struggle against the simple answers if you don't understand these basics. It takes around 15 minutes to read.

I'm working on an application in C# and need to read and write from a particular datafile format. The only issue at the moment is that the format uses strictly single byte characters, and C# keeps trying to throw in Unicode when I use a writer and a char array (which doubles filesize, among other serious issues). I've been working on modifying the code to use byte arrays instead, but that causes a few complaints when feeding them into a tree view and datagrid controls, and it involves conversions and whatnot.
I've spent a little time googling, and there doesn't seem to be a simple typedef I can use to force the char type to use byte for my program, at least not without causing extra complications.
Is there a simple way to force a C# .NET program to use ASCII-only and not touch Unicode?
Later, I got this almost working. Using the ASCIIEncoding on the BinaryReader/Writers ended up fixing most of the problems (a few issues with an extra character being prepended to strings occurred, but I fixed that up). I'm having one last issue, which is very small but could be big: In the file, a particular character (prints as the Euro sign) gets converted to a ? when I load/save the files. That's not an issue in texts much, but if it occurred in a record length, it could change the size by kilobytes (not good, obviously). I think it's caused by the encoding, but if it came from the file, why won't it go back?
The precise problem/results are such:
Original file: 0x80 (euro)
Encodings:
** ASCII: 0x3F (?)
** UTF8: 0xC280 (A-hat euro)
Neither of those results will work, since anywhere in the file, it can change (if an 80 changed to 3F in a record length int, it could be a difference of 65*(256^3)). Not good. I tried using a UTF-8 encoding, figuring that would fix the issue pretty well, but it's now adding that second character, which is even worse.

C# (.NET) will always use Unicode for strings. This is by design.
When you read or write to your file, you can, however, use a StreamReader/StreamWriter set to force ASCII Encoding, like so:
StreamReader reader = new StreamReader (fileStream, new ASCIIEncoding());
Then just read using StreamReader.
Writing is the same, just use StreamWriter.

Interally strings in .NET are always Unicode, but that really shouldn't be of much interest to you. If you have a particular format that you need to adhere to, then the route you went down (reading it as bytes) was correct. You simply need to use the System.Encoding.ASCII class to do your conversions from string->byte[] and byte[]->string.

If you have a file format that mixes text in single-byte characters with binary values such as lengths, control characters, a good encoding to use is code page 28591 aka Latin1 aka ISO-8859-1.
You can get this encoding by using whichever of the following is the most readable:
Encoding.GetEncoding(28591)
Encoding.GetEncoding("Latin1")
Encoding.GetEncoding("ISO-8859-1")
This encoding has the useful characteristic that byte values up to 255 are converted to unchanged to the unicode character with the same value (e.g. the byte 0x80 becomes the character 0x0080).
In your scenario, this may be more useful than the ASCII encoding (which converts values in the range 0x80 to 0xFF to '?') or any of the other usual encodings, which will also convert some of the characters in this range.

If you want this in .NET, you could use F# to make a library supporting this. F# supports ASCII strings, with a byte array as the underlying type, see Literals (F#) (MSDN):
let asciiString = "This is a string"B

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.