Arabic presentation forms B support in c# - c#

I was trying to convert a file from utf-8 to Arabic-1265 encoding using the Encoding APIs in C#, but I faced a strange problem that some characters are not converted correctly such as "لا" in the following statement "ﻣﺣﻣد ﺻﻼ ح عادل" it appears as "ﻣﺣﻣد ﺻ? ح عادل". Some of my friends told me that this is because these characters are from the Arabic Presentation Forms B. I create the file using notepad++ and save it as utf-8.
here is the code I use
StreamReader sr = new StreamReader(#"C:\utf-8.txt", Encoding.UTF8);
string str = sr.ReadLine();
StreamWriter sw = new StreamWriter(#"C:\windows-1256.txt", false, Encoding.GetEncoding("windows-1256"));
sw.Write(str);
sw.Flush();
sw.Close();
But, I don't know how to convert the file correctly using this presentation forms in C#.

Yes, your string contains lots of ligatures that cannot be represented in the 1256 code page. You'll have to decompose the string before writing it. Like this:
str = str.Normalize(NormalizationForm.FormKD);
st.Write(str);

To give a more general answer:
The Windows-1256 encoding is an obsolete 8-bit character encoding. It has only 256 characters, of which only 60 are Arabic letters.
Unicode has a much wider range of characters. In particular, it contains:
the “normal” Arabic characters, U+0600 to U+06FF. These are supposed to be used for normal Arabic text, including text written in other languages that use the Arabic script, such as Farsi. For example, “لا” is U+0644 (ل) followed by U+0627 (ا).
the “Presentation Form” characters, U+FB50 to U+FDFF (“Presentation Forms-A”) and U+FE70 to U+FEFF (“Presentation Forms-B”). These are not intended to be used for representing Arabic text. They are primarily intended for compatibility, especially with font-file formats that require separate code points for every different ligated form of every character and ligated character combination. The “لا” ligature is represented by a single codepoint (U+FEFB) despite being two characters.
When encoding into Windows-1256, the .NET encoding for Windows-1256 will automatically convert characters from the Presentation Forms block to “normal text” because it has no other choice (except of course to turn it all into question marks). For obvious reasons, it can only do that with characters that actually have an “equivalent”.
When decoding from Windows-1256, the .NET encoding for Windows-1256 will always generate characters from the “normal text” block.
As we’ve discovered, your input file contains characters that are not representable in Windows-1256. Such characters will turn into question marks (?). Furthermore, those Presentation-Form characters which do have a normal-text equivalent, will change their ligation behaviour, because that is what normal Arabic text does.

First of all, the two characters you quoted are not from the Arabic Presentation Forms block. They are \x0644 and \x0627, which are from the standard Arabic block. However, just to be sure I tried the character \xFEFB, which is the “equivalent” (not equivalent, but you know) character for لا from the Presentation Forms block, and it works fine even for that.
Secondly, I will assume you mean the encoding Windows-1256, which is for legacy 8-bit Arabic text.
So I tried the following:
var input = "لا";
var encoding = Encoding.GetEncoding("windows-1256");
var result = encoding.GetBytes(input);
Console.WriteLine(string.Join(", ", result));
The output I get is 225, 199. So let’s try to turn it back:
var bytes = new byte[] { 225, 199 };
var result2 = encoding.GetString(bytes);
Console.WriteLine(result2);
Fair enough, the Console does not display the result correctly — but the Watch window in the debugger tells me that the answer is correct (it says “لا”). I can also copy the output from the Console and it is correct in the clipboard.
Therefore, the Windows-1256 encoding is working just fine and it is not clear what your problem is.
My recommendation:
Write a short piece of code that exhibits the problem.
Post a new question with that piece of code.
In that question, describe exactly what result you get, and what result you expected instead.

Related

Read multiple files with different encoding, preserving all characters

I am trying to read a text file and writing to a new text file. The input file could be ANSI or UTF-8. I don't care what the output encoding is but I want to preserve all characters when writing. How to do this? Do I need to get the input file's encoding (seems like alot of work).
The following code reads ANSI file and writes output as UTF-8 but there is some gibberish characters "�".
I am looking for a way to read the file no matter which of the 2 encoding and write it correctly without knowing the encoding of input file before hand.
File.WriteAllText(outputfile,File.ReadAllText(inputfilepath + #"\ST60_0.csv"));
Note that this batch command reads a UTF-8 and ANSI file and writes the output as ANSI with all chars preserved so I'm looking to do this but in C#:
type ST60_0.csv inputUTF.csv > outputBASH.txt
Q: The following code reads ANSI file and writes output as UTF-8 but
there is some giberrish characters "�".
A: It would definitely be useful to see the hex values of some of these "gibberish" characters. Perhaps you could install a Hex plugin to Notepad++ and tell us?
Q: It blows my mind its so hard to do something in C# that command
prompt can do easy
A: Typically, it IS easy. There seems to be "something special" written into this particular file.
The difference between C# and other, "simpler" approaches is that C# (unlike C character I/O or .bat files) gives you the flexibility to deal with text that doesn't happen to be "standard ASCII".
ANYWAY:
If "?" you posted (hex 0xefbfbd) is a valid example of your actual text, this might explain what's going on:
https://stackoverflow.com/a/25510366/421195
... %EF%BF%BD is the url-encoded version of the hex representation of
the 3 bytes (EF BF BD) of the UTF-8 replacement character.
See also:
https://en.wikipedia.org/wiki/Specials_(Unicode_block)
The Replacement character � (often displayed as a black rhombus with a
white question mark) is a symbol found in the Unicode standard at code
point U+FFFD in the Specials table. It is used to indicate problems
when a system is unable to render a stream of data to a correct
symbol.[4] It is usually seen when the data is invalid and does not
match any character
You might also be interested in this:
https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-encoding
Best-Fit Fallback When a character does not have an exact match in the target encoding, the encoder can try to map it to a similar
character.
UPDATE:
The offending character was "»", hex 0xc2bb. This is a "Right Angle Quote", a Guillemet. Angle quotes are the quotation marks used in certain languages with an otherwise roman alphabet, such as French.
One possible solution is to specify "iso-8859-1", vs. the default encoding "UTF-8":
File.WriteAllText(outputfile,File.ReadAllText(inputfilepath + #"\ST60_0.csv", System.Text.Encoding.GetEncoding("iso-8859-1")));

How to unescape multibyte unicode in c# [duplicate]

This question already has answers here:
How to unescape unicode string in C#
(2 answers)
Closed 2 years ago.
The following unicode string from a text file encodes a single apostrophe using 3 bytes:
It\u00e2\u0080\u0099s working
This should decode to:
It’s working
How can I decode this string in C#?
For example, when I try the following code:
string test = #"It\u00e2\u0080\u0099s working";
string test2 = System.Text.RegularExpressions.Regex.Unescape(test);
it incorrectly decodes the first byte only:
Itâ\u0080\u0099s awesome
This is UTF8. Try UTF8 Encoding
using System.Text;
using System.Text.RegularExpressions;
string test = "It\u00e2\u0080\u0099s working";
byte[] bytes = Encoding.GetEncoding(28591)
.GetBytes(test);
var converted = Encoding.UTF8.GetString(bytes);//It’s working
try this to parse file :
private static Regex _regex = new Regex(#"\\u(?<Value>[a-zA-Z0-9]{4})", RegexOptions.Compiled);
public string decodeString(string value)
{
return _regex.Replace(
value,
m => ((char)int.Parse(m.Groups["Value"].Value, NumberStyles.HexNumber)).ToString()
);
}
That is javascript unicode encoding. Use a C# javascript deserializer to convert it.
(I don't have enough reputation to comment, so I will write here)
Where did you get those characters from in the first place?
\uXXXX is an encoding used by JavaScript and C# (didn't know about C# this until now) to encode 16 bit Unicode characters in string literals. 16 bit - 4 hex characters, so \uXXXX, each X representing one Hexadecimal digit.
Note this is used to encode string literals in source code! It is not used to encode the bytes stored in files or memory or what not. It is an older style of encoding due to modern source code editors usually support UTF-8 or UTF-16 or some other encoding to be able to store unicode characters in source code files, and then they are also able to display the unicode character symbol, and allow it being typed right at the editor. So \uXXXX typing is not needed, and going out of style.
So that is why I asked where did you get the string initially? You wrote in one comment you read it from a file? What generated the file?
If each \uXXXX is taken alone by itself as unicode characters, which is what \uXXXX means, doesn't make sense being there. 00e2 is a character a with cap on it, 0080 and 0099 are control characters, not printable.
If e28099 are taken together as three single bytes, i.e. dropping off 00 valued first bytes of each as they are in the form of \u00XX then it fits as a UTF8 character representation of a Unicode character with decimal value 2019, which is "Unicode Character 'RIGHT SINGLE QUOTATION MARK' (U+2019)"
Then that is what you are looking for, but this doesn't seem correct usage of encoding that generated that string. If you end up with those strings and have to evaluate them, then comments above by "C# Novice" is working, but it may not work in every case.
You could convert string literals that uses \uXXXX encoding in its strings using a javascript script evaluator, or CSharpScript.Run() to make a string literal with those and assign to a variable, and then look at its bytes. But I tried that later and due to those byte values/characters not making sense I don't get anything meaningful from them. I get an a with a cap, and the next two, CSharpScript refuses to decode and leaves as is. Becuase those are control characters when decoded.
Here three different ways using C# avaliable libraries doing \uXXXX decoding. The first two uses NewtonSoft.JSON package, the last uses Roslyn/CSharpScript, both avalilable from Nuget. Note none of these print single aposthrope, due to what I described above. In contrast, if I change the string to "\u3053\u3093\u306B\u3061\u306F\u4E16\u754C!", it prints on the debug output window this Japanese text: "こんにちは世界!" , which is what Google translate told me is Japanese translation of "Hello World!"
https://translate.google.com/?sl=ja&tl=en&text=%E3%81%93%E3%82%93%E3%81%AB%E3%81%A1%E3%81%AF%E4%B8%96%E7%95%8C!&op=translate
So in summary, whatever generated those scripts, doesn't seem to be doing standard things.
string test = #"It\u00e2\u0080\u0099s working";
// Using JSON deserialization, since \uXXXX is valid encoding JavaScript string literals
// Have to add starting and ending quotes to make it a script literal definition, then deserialize as string
var d = Newtonsoft.Json.JsonConvert.DeserializeObject("\"" + test + "\"", typeof(string));
Console.WriteLine(d);
System.Diagnostics.Debug.WriteLine(d);
// Another way of JavaScript deserialization. If you are using a stream like reading from file this maybe better:
TextReader reader = new StringReader("\"" + test + "\"");
Newtonsoft.Json.JsonTextReader rdr = new JsonTextReader(reader);
rdr.Read();
Console.WriteLine(rdr.Value);
System.Diagnostics.Debug.WriteLine(rdr.Value);
// lastly overkill and too heavy: Using Roslyn CSharpScript, and letting C# compiler to decode \uXXXX's in string literal:
ScriptOptions opt = ScriptOptions.Default;
//opt = opt.WithFileEncoding(Encoding.Unicode);
Task<ScriptState<string>> task = Task.Run(async () => { return CSharpScript.RunAsync<string>("string str = \"" + test + "\".ToString();", opt); }).Result;
ScriptState<string> s = task.Result;
var ddd = s.Variables[0];
Console.WriteLine(ddd.Value);
System.Diagnostics.Debug.WriteLine(ddd.Value);

How do I read hex sequences like xD0 into a C# string?

I am converting a series of strings that are designed to display correctly using a special font into a unicode version that can be used anywhere. It's just a glorified set of string replaces:
"e]" -> "ἓ"
etc.
I'm reading the text using a streamreader which takes the encoding to be UTF-8. All working well. But there are some characters used to replace the punctuation marks that just aren't working. I can see them as hex sequences in notepad++ (encoding set to UTF-8) but when I read them, they all get reduced down to the same character (the 'cannot display' question mark in the black diamond).
StreamReader srnorm = new StreamReader("C:\\Users\\John\\Desktop\\bgt.txt", Encoding.UTF8);
string norm = srnorm.ReadLine();
Should I be reading it as a binary file and working from there or is my encoding very wrong?
(Full size image)
When I read that, I get the following:
o]j ouvci. mh. �avpo�la,bh| pollaplasi,ona evn tw/| kairw/| tou,tw| kai. evn tw/| aivw/ni tw/| evrcome,nw| zwh.n aivw,nion�
C# strings use UTF-16. This is how they are stored in memory. Because of this you should be able to read the string into memory and replace the characters without any issues. You can then write those characters back to a file (UTF8 is the default character encoding for reading and writing to file if I'm not mistaken). The ?'s just means the console you outputed the string to does not support those characters or the bytes are not of a valid encoding.
Here is a good article by Jon Skeet about C#/.NET strings.

C# File Method not reading accented character [duplicate]

This question already has answers here:
Using .NET how to convert ISO 8859-1 encoded text files that contain Latin-1 accented characters to UTF-8
(2 answers)
Closed 8 years ago.
I'm using C# to automate an insert into command for a users table, and there's a user whose first name has an accented E, with a grave I believe?
Desirée
Whenever it makes it into the SQL Server table it appears as:
Desir?e
Which data type should I use on this column to ensure that it keeps the accented e?
I've tried varchar and nvarchar, neither seemed to matter.
Code for inserting:
var lines = File.ReadAllLines(users_feed_file);
I believe that there is an encoding issue occurring. When Visual Studio reads my file it reads the name as Desir?e.
So far I've tried to overload the File method, using:
Encoding enc = new UTF8Encoding(true, true);
var lines = File.ReadAllLines(users_feed_file,enc);
But this had no effect.
var lines = File.ReadAllLines(users_feed_file, Encoding.UTF8);
Doesn't work either.
Sql Server stores unicode text essentially as Unicode-2 or UTF-16. That is, it uses fixed, two-bytes for all characters. UTF-8 uses variable three-bytes for all characters, using one, two, or three bytes as needed. If the character in questions (it would be good to post the actual unicode value) is translated by UTF-8 into three bytes, then Sql Server reads that back as two two-byte characters, one of which probably is not a valid, displayable character, thus rendering a question mark. Note that Sql Server is not storing a question mark, that is just how whatever text editor you are using renders this garbled character.
Try changing your C# encoding to Encoding.Unicode and see if that helps round-trip the character in question.
The same reasoning applies to characters that ought to fit into one-byte, but are represented with two by UTF-8. So for example, the unicode hex value for small e with grave is xE8, which could be represented as 00 E8 in two bytes. But UTF-8 renders it as C3 E8. Now, look for that value in Unicode (UTF-16) - there is no such character. So in this case it is not two bytes represented as three, but one byte represented incorrectly as two. This resource is invaluable when trying to debug extended character issues.
Note that for the basic Latin ascii set, UTF-8 uses the same values as Unicode, and thus those characters round-trip just fine. It's when using extended character sets that compatibility for both encodings cannot be guaranteed.
Hi try with this code:
var lines = File.ReadAllLines(users_feed_file, Encoding.Unicode);
but in notepade++ you can view the file encoding, check this.

ansi to unicode conversion

While parsing certain documents, I get the character code 146, which is actually an ANSI number. While writing the char to text file, nothing is shown. If we write the char as Unicode number- 8217, the character is displayed fine.
Can anyone give me advice on how to convert the ANSI number 146 to Unicode 8217 in C#.
reference: http://www.alanwood.net/demos/ansi.html
Thanks
"ANSI" is really a misnomer - there are many encodings often known as "ANSI". However, if you're sure you need code page 1252, you can use:
Encoding encoding = Encoding.GetEncoding(1252);
using (TextReader reader = File.OpenText(filename, encoding))
{
// Read text and use it
}
or
Encoding encoding = Encoding.GetEncoding(1252);
string text = File.ReadAllText(filename, encoding);
That's for reading a file - writing a file is the same idea. Basically when you're converting from binary (e.g. file contents) to text, use an appropriate Encoding object.
My recommendation would be to read Joel's "Absolute Minimum Every Software Developer Must Know About Unicode and Character Sets. There's quite a lot involved in your question and my experience has been that you'll just struggle against the simple answers if you don't understand these basics. It takes around 15 minutes to read.

Categories

Resources