Encoding En Dash character on C#

Encoding En Dash character on C# - c#

I have a text file which include en dash "–" character. When i read the file and output with encodings it returns to ? character. I tried Utf8, ASCII, UTF32, UTF7, Unicode and default.
sample txt includes;
000020a3;LH 10000924 – 000 – 08;Formal;&&&
Here is my code below;
static void Main(string[] args)
{
var a = File.ReadAllText(#"C:\fullfilepath\sample.txt", Encoding.Default);
Console.WriteLine(a);
Console.ReadKey();
}
Output;
000020a3;LH 10000924 ? 000 ? 08;Formal;&&&

The problem is that the Encoding.Default does not support the en dash character "–", which is represented by the Unicode code point U+2013. To fix this issue, you can use the Encoding.UTF8 encoding instead, as it supports a wide range of Unicode characters, including the en dash.
Here is an example that will fix your code
static void Main(string[] args)
{
var a = File.ReadAllText(#"C:\fullfilepath\sample.txt", Encoding.UTF8);
Console.WriteLine(a);
Console.ReadKey();
}
Alternatively, you can also use Encoding.GetEncoding("UTF-8") or new UTF8Encoding(encoderShouldEmitUTF8Identifier: false) instead of Encoding.UTF8

Related

Special (Hungarian and Serbian) characters lost in the string when reading from file

I'm reading both Hungarian and Serbian words from a text document (which is tab delimited, exported from excel), then I'm writing them on the console. When I write it on the screen, it can't display characters that are outside the English ABC.
For example, instead of körte I get kĂśrte, and instead of kruška I get kruĹĄka.
I'm using streamreader (and later streamwriter), and I've set the encoding to iso-8859-2 for both of them, as well as for the output. This encoding includes both sets of characters I need.
Console.OutputEncoding = Encoding.GetEncoding("iso-8859-2");
using(StreamReader sr = new StreamReader(fIN, Encoding.GetEncoding("iso-8859-2"))) {
using(StreamWriter sw = new StreamWriter(fDB, Encoding.GetEncoding("iso-8859-2"))) {
I've tried to see whether it had trouble writing it on the console, so I just tried writing all these characters on the screen, and it displays everything with no problem.
Console.WriteLine("á Á é É í Í ó Ó ö Ö ü Ü ű Ű");
Console.WriteLine("č Č ć Ć đ Đ š Š ž Ž");
//outputs properly
I tried to see whether it had trouble storing these characters, so I've put them in a string and tried to display it, with no problems.
string s13 = "á Á é É í Í ó Ó ö Ö ü Ü ű Ű";
Console.WriteLine(s13);
s13 = "č Č ć Ć đ Đ š Š ž Ž ";
Console.WriteLine(s13);
//outputs properly
I tried to see where the problem is in runtime with debugging, and it seems like when I read the data from file, it is read wrong.
try {
using(FileStream fs = new FileStream("DB.txt", FileMode.OpenOrCreate)) {
using(StreamReader sr = new StreamReader(fs, Encoding.GetEncoding("iso-8859-2"))) {
while(!sr.EndOfStream) {
string[] s = sr.ReadLine().Split('\t'); //immeadiately becomes faulty, even if not split
HuSrb word = new HuSrb(s[0], s[1]);
bool found = false;
foreach(Categories c in categories) {
if(c.Name == s[2]) {
c.Amount++;
c.Words.Add(word);
found = true;
break;
}
}
if(!found) {
Categories category = new Categories(s[2], word);
categories.Add(category);
}
}
}
}
}
catch(Exception) {
throw;
}
The funny thing is, later I read into a string from file A and write it into a string, then write the contents of that string into file B. Both file A and file B have the characters right, but in the middle, the string doesn't have the characters right.
So basically,
The problem is not with storing the data
The problem is not with printing the data
The problem is not with writing the data into a file.
My assumption is that the problem is when reading from the file, but then I don't understand how it ends up being correct in the other file. Any help?

The problem is that you probably used the wrong encoding while saving the input text file.
I tried to read and write your example's content using another encoding and it works. The thing is that I saved the input file in UTF8 and read the content using Encoding.UTF8 :
Code:
Results :

c# - How to convert a converted UTF8 string to UTF16?

I'm trying to convert a converted UTF-8 string to UTF-16, because I'm going to read a file and it comes like the var strUTF8 below.
For example, the entry would be the string "NÃ£o Ã© possÃvel equipar" and the return I needed is "Não é possível equipar".
static void Main(string[] args)
{
test3();
Console.ReadKey();
}
static void test3()
{
string str = "NÃ£o Ã© possÃvel equipar";
string strUTF16 = Utf8ToUtf16(str);
Console.WriteLine(str);
Console.WriteLine(strUTF16);
}
static string Utf8ToUtf16(string utf8String)
{
byte[] utf8Bytes = Encoding.UTF8.GetBytes(utf8String);
byte[] unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, utf8Bytes);
return Encoding.Unicode.GetString(unicodeBytes);
}
I really don't know how to solve this. Any tips?

If you want to read a file then you should read a file. When you read the file, specify the encoding of that file. If I'm not mistaken UTF8 is the default, so reading files encoded with UTF8 doesn't require the encoding to be specified. If you want to save that text to a file with a specific encoding, specify that encoding when saving the file.
var text = File.ReadAllText(filePath, Encoding.UTF8);
File.WriteAllText(filePath, text, Encoding.Unicode);
That will effectively convert a file from UTF8 encoding to UTF16. A more verbose version would be:
var data = File.ReadAllBytes(filePath);
var text = Encoding.UTF8.GetString(data);
data = Encoding.Unicode.GetBytes(text);
File.WriteAllBytes(filePath, data);

Your Utf8ToUtf16() function is effectively a no-op. You are taking an arbitrary UTF-16 string as input, encoding it into UTF-8 bytes, then decoding those bytes as UTF-8 back into UTF-16. So, you effectively end up with the same string value you started with. You may as well have just written the following, the result would be the same:
static string Utf8ToUtf16(string utf8String)
{
return utf8String;
}
That being said, NÃ£o Ã© possÃvel equipar is what you get when the UTF-8 encoded form of Não é possível equipar is mis-interpreted as Latin (probably ISO-8859-1) or Windows-125x etc, instead of being properly interpreted as UTF-8 to begin with.
If you have a C# string that contains such UTF-8 bytes which were up-scaled as-is to UTF-16 (why???), then you need to down-scale those characters as-is back into 8-bit bytes, and then you can decode those bytes as UTF-8, eg:
static void test3()
{
string str = "NÃ£o Ã© possÃvel equipar";
string strUTF16 = Utf8ToUtf16(str);
Console.WriteLine(str);
Console.WriteLine(strUTF16);
}
static string Utf8ToUtf16(string utf8String)
{
byte[] utf8Bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(utf8String); // or: GetEncoding(28591)
return Encoding.UTF8.GetString(utf8Bytes);
}

Compare Windows-1252 string to UTF-8 string

my goal is to convert a .NET string (Unicode) into Windows-1252 and - if necessary - store the original UTF-8 string in a Base64 entity.
For example, the string "DJ Doena" converted to 1252 is still "DJ Doena".
However if you convert the Japanese kanjii for tree (木) into 1251 you end up with a question mark.
These are my test strings:
String doena = "DJ Doena";
String umlaut = "äöüßéèâ";
String allIn = "< ä ß á â & 木 >";
This is how I convert the string in the first place:
using (MemoryStream ms = new MemoryStream())
{
using (StreamWriter sw = new StreamWriter(ms, Encoding.UTF8))
{
sw.Write(decoded);
sw.Flush();
ms.Seek(0, SeekOrigin.Begin);
using (StreamReader sr = new StreamReader(ms, Encoding.GetEncoding(1252)))
{
encoded = sr.ReadToEnd();
}
}
}
Problem is, while debugging string comparison claims that both are indeed identical, so a simple == or .Equals() doesn't suffice.
This is how I try to find out if I need base64 and produce it:
private static String GetBase64Alternate(String utf8Text, String windows1252Text)
{
Byte[] utf8Bytes;
Byte[] windows1252Bytes;
String base64;
utf8Bytes = Encoding.UTF8.GetBytes(utf8Text);
windows1252Bytes = Encoding.GetEncoding(1252).GetBytes(windows1252Text);
base64 = null;
if (utf8Bytes.Length != windows1252Bytes.Length)
{
base64 = Convert.ToBase64String(utf8Bytes);
}
else
{
for(Int32 i = 0; i < utf8Bytes.Length; i++)
{
if(utf8Bytes[i] != windows1252Bytes[i])
{
base64 = Convert.ToBase64String(utf8Bytes);
break;
}
}
}
return (base64);
}
The first string doena is completely identical and doesn't produce a base64 result
Console.WriteLine(String.Format("{0} / {1}", windows1252Text, base64Text));
results in
DJ Doena /
But the second string umlauts already has twice the bytes in UTF-8 than in 1252 and thus produces an Base64 string even though it does not appear to be necessary:
äöüßéèâ / w6TDtsO8w5/DqcOow6I=
And the third one does what it's supposed to do (no more "木" but a "?", thus base64 needed):
< ä ß á â & ? > / PCDDpCDDnyDDoSDDoiAmIOacqCA+
Any clues how my Base64 getter could be enhanced a) for performance b) for better results?
Thank you in advance. :-)

I'm not sure I completely understood the question. But I tried. :) If I do understand correctly, this code does what you want:
static void Main(string[] args)
{
string[] testStrings = { "DJ Doena", "äöüßéèâ", "< ä ß á â & 木 >" };
foreach (string text in testStrings)
{
Console.WriteLine(ReencodeText(text));
}
}
private static string ReencodeText(string text)
{
Encoding encoding = Encoding.GetEncoding(1252);
string text1252 = encoding.GetString(encoding.GetBytes(text));
return text.Equals(text1252, StringComparison.Ordinal) ?
text : Convert.ToBase64String(Encoding.UTF8.GetBytes(text));
}
I.e. it encodes the text to Windows-1252, then decodes back to a string object, which it then compares with the original. If the comparison succeeds, it returns the original string, otherwise it encodes it to UTF8, and then to base64.
It produces the following output:
DJ Doena
äöüßéèâ
PCDDpCDDnyDDoSDDoiAmIOacqCA+
In other words, the first two strings are left intact, while the third is encoded as base64.

In your first code you are encoding the string using one encoding, then decoding it using a different encoding. That doesn't give you any reliable result at all; it's the equivalent of writing out a number in octal, then reading it as if it was in decimal. It seems to work just fine for numbers up to 7, but after that you get useless results.
The problem with the GetBase64Alternate method is that it's encoding a string to two different encodings, and assumes that the first encoding doesn't support some of the characters if the second encoding resulted in a different set of bytes.
Comparing the byte sequences doesn't tell you whether any of the encodings failed. The sequences will be different if it failed, but it will also be different if there are any characters that are encoded differently between the encodings.
What you want to do is to determine if the encoding actually worked for all characters. You can do that by creating an Encoding instance with a fallback for unsupported characters. There is an EncoderExceptionFallback class that you can use for that, which throws an EncoderFallbackException if it's called.
This code will try use the Windows-1252 encoding on a string, and sets the ok variable to false if the encoding doesn't support all characters in the string:
Encoding e = Encoding.GetEncoding(1252, new EncoderExceptionFallback(), new DecoderExceptionFallback());
bool ok = true;
try {
e.GetByteCount(allIn);
} catch (EncoderFallbackException) {
ok = false;
}
As you are not actually going to used the encoded result for anything, you can use the GetByteCount method. It will check how all characters would be encoded without producing the encoded result.
Used in your method it would be:
private static String GetBase64Alternate(string text) {
Encoding e = Encoding.GetEncoding(1252, new EncoderExceptionFallback(), new DecoderExceptionFallback());
bool ok = true;
try {
e.GetByteCount(allIn);
} catch (EncoderFallbackException) {
ok = false;
}
return ok ? null : Convert.ToBase64(Encoding.UTF8.GetBytes(text));
}

Encoding special characters into Byte array

Few days ago I've asked a question about german special characters.
I can encode and decode characters like ö, ä or ü now. But.. some characters left and I need to encode/decode them too.
For example, characters that fails: ² ³ € µ Ü Ö Ä ~ ´ §
Here is code:
private static byte[] MyGetBytesArray(string data)
{
Encoding enc = new UTF8Encoding(true, true);
return enc.GetBytes(data);
}
private static string MyGetString(byte[] data)
{
Encoding enc = new UTF8Encoding(true, true);
return enc.GetString(data);
}
I'm looking for a solution to encode/decode all characters. I'm writing an encrypt/decrypt algorythm, and I don't know what user will paste into program. I need to give back exactly the same.
Thanks for help, again..
EDIT:
Ok, UnicodeEncoding works (I think). It is my encrypt/decrypt algoryth now:/ I'm still not sure what is going on (I thnik it is sth with zeros. During encoding by Unicode zero is after every character), but encoding special characters wokrs. At least that test was successfull:
string text = File.ReadAllText(opd.FileName, Encoding.Default);
byte[] byt = getBytesArray(text);
string text2 = getString(byt);
if (text2 == text)
{
MessageBox.Show("OK");
}
else
{
MessageBox.Show("FAIL");
}
BTW. Encoding.Default is correct right ?

Try UnicodeEncoding instead.
var encoding = new UnicodeEncoding();
return Write(encoding.GetBytes(s));

Unfortunately those characters are Unicode so you won't be able to use the UTF8Encoding class.
Try using the UnicodeEncoding class instead.

issue with XML encoding

I tried to phrase this as a generic question but realized I don't know enough, so here is the problem I'm having.
Here is a snippet from a console application:
public void Run()
{
Run(Console.Out);
}
public void Run(TextWriter writer)
{
DataTable customers = _quickBooksAdapter.GetTableData("Customer");
customers.WriteXml(writer);
}
Then I run it from the console and use ">" to put it in a file.
c:\> QuickBooksETL extract US > qb_us.xml
If i try to load the result as I would normally:
var x = XDocument.Load("qb_us.xml");
I get the error:
Invalid character in the given encoding. Line 8, position 26.
So I tried to determine what .NET "thinks" it is using:
string path = #"\\ad1\accounting$\Xml\qb_us.xml";
StreamReader sr = new StreamReader(path);
sr.CurrentEncoding.Dump();
Result:
System.Text.UTF8Encoding
BodyName utf-8
EncodingName Unicode (UTF-8)
HeaderName utf-8
WebName utf-8
WindowsCodePage 1200
IsBrowserDisplay True
IsBrowserSave True
IsMailNewsDisplay True
IsMailNewsSave True
IsSingleByte False
EncoderFallback 5EncoderReplacementFallback
System.Text.EncoderReplacementFallback
DefaultString �
MaxCharCount 1
DecoderFallback 5DecoderReplacementFallback
System.Text.DecoderReplacementFallback
DefaultString �
MaxCharCount 1
IsReadOnly True
CodePage 65001
Finally, I find by guessing that it works if I just explicitly say it's ASCII:
string path = #"\\ad1\accounting$\Xml\qb_us.xml";
StreamReader sr = new StreamReader(path, Encoding.ASCII);
var x = XDocument.Load(sr);
Any thoughts on where am I going wrong would be greatly appreciated. I admit I have never taken the "deep dive" on character encodings, but I'm willing to put in the effort to get this right.

The simple answer is not to get the console involved. Write directly to the file from your code:
public void Run(string filename)
{
DataTable customers = _quickBooksAdapter.GetTableData("Customer");
customers.WriteXml(filename);
}
or create the TextWriter or Stream yourself and pass that in, e.g.
public void Run(Stream output)
{
DataTable customers = _quickBooksAdapter.GetTableData("Customer");
customers.WriteXml(output);
}
Note that by reading it as ASCII, you'll basically be getting question marks for any non-ASCII character in the original data. IIRC, that's the default behaviour of an encoding when it encounters binary data it can't handle.
Using a Stream it should default to writing out in UTF-8, and the XML declaration and the data within the file should match.

In my experience, if your data includes illegal characters (for example, character 12), the XML doesn't round trip unless you read the XML with an XmlTextReader with Normalization = false. I've been using XmlSerializer.Deserialize(), not XDocument.Load(). Still, you might try calling the Load(XmlReader) overload by passing in an XmlTextReader with Normalization = false.
I would add my voice to Jon's in suggesting that you write to your own stream, not Console.Out.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Encoding En Dash character on C# - c#

Related

Special (Hungarian and Serbian) characters lost in the string when reading from file

c# - How to convert a converted UTF8 string to UTF16?

Compare Windows-1252 string to UTF-8 string

Encoding special characters into Byte array

issue with XML encoding

Categories

Resources