How to read different language files from using c#.net

How to read different language files from using c#.net - c#

We will be given different text files and each file might be in English,arabic,german or French.We have to read the respective files and display the text in UI in the respective file text language.
I am planning to use the below statement to achieve the same. Do I need to do anything in addition here? As far as I know we have ASCII character-set -255 but how about displaying other language characters like Chinese,hindi or german ? Do we need to take special care of these characters?
StreamReader(System.String filepath, System.Text.Encoding.Unicode)

Use UTF-8 or UTF-16 for the encoding, and the text should be displayed as is
StreamReader(System.String filepath, System.Text.Encoding.UTF8)

Related

Read multiple files with different encoding, preserving all characters

I am trying to read a text file and writing to a new text file. The input file could be ANSI or UTF-8. I don't care what the output encoding is but I want to preserve all characters when writing. How to do this? Do I need to get the input file's encoding (seems like alot of work).
The following code reads ANSI file and writes output as UTF-8 but there is some gibberish characters "�".
I am looking for a way to read the file no matter which of the 2 encoding and write it correctly without knowing the encoding of input file before hand.
File.WriteAllText(outputfile,File.ReadAllText(inputfilepath + #"\ST60_0.csv"));
Note that this batch command reads a UTF-8 and ANSI file and writes the output as ANSI with all chars preserved so I'm looking to do this but in C#:
type ST60_0.csv inputUTF.csv > outputBASH.txt

Q: The following code reads ANSI file and writes output as UTF-8 but
there is some giberrish characters "�".
A: It would definitely be useful to see the hex values of some of these "gibberish" characters. Perhaps you could install a Hex plugin to Notepad++ and tell us?
Q: It blows my mind its so hard to do something in C# that command
prompt can do easy
A: Typically, it IS easy. There seems to be "something special" written into this particular file.
The difference between C# and other, "simpler" approaches is that C# (unlike C character I/O or .bat files) gives you the flexibility to deal with text that doesn't happen to be "standard ASCII".
ANYWAY:
If "?" you posted (hex 0xefbfbd) is a valid example of your actual text, this might explain what's going on:
https://stackoverflow.com/a/25510366/421195
... %EF%BF%BD is the url-encoded version of the hex representation of
the 3 bytes (EF BF BD) of the UTF-8 replacement character.
See also:
https://en.wikipedia.org/wiki/Specials_(Unicode_block)
The Replacement character � (often displayed as a black rhombus with a
white question mark) is a symbol found in the Unicode standard at code
point U+FFFD in the Specials table. It is used to indicate problems
when a system is unable to render a stream of data to a correct
symbol.[4] It is usually seen when the data is invalid and does not
match any character
You might also be interested in this:
https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-encoding
Best-Fit Fallback When a character does not have an exact match in the target encoding, the encoder can try to map it to a similar
character.
UPDATE:
The offending character was "»", hex 0xc2bb. This is a "Right Angle Quote", a Guillemet. Angle quotes are the quotation marks used in certain languages with an otherwise roman alphabet, such as French.
One possible solution is to specify "iso-8859-1", vs. the default encoding "UTF-8":
File.WriteAllText(outputfile,File.ReadAllText(inputfilepath + #"\ST60_0.csv", System.Text.Encoding.GetEncoding("iso-8859-1")));

Detect special symbols in c#

I'm working on a c# project in which some data contains characters which are not recognised by the encoding.
They are displayed like that:
"Some text � with special � symbols in it".
I have no control over the encoding process, also data come from files of various origins and various formats.
I want to be able to flag data that contains such characters as erroneous or incomplete. Right now I am able to detect them this way:
if(myString.Contains("�"))
{
//Do stuff
}
While it does work, it doesn't feel quite right to use the weird symbol directly in the Contains function. Isn't there a cleaner way to do this ?
EDIT:
After checking back with the team responsible for reading the files, this is how they do it:
var sr = new StreamReader(filePath, true);
var content = sr.ReadToEnd();
Passing true as a second parameter of StreamReader is supposed to detect the encoding from the file's BOM, and use it to read the content. It doesn't always work though, as some files don't bear that information, hence why their data is read incorrectly.
We've made some tests and using StreamReader(filePath, Encoding.Default) instead appears to work for most if not all files we had issues with. Expectedly, files that were working before not longer work because they do not use the default encoding.
So the best solution for us would be to do the following: read the file trying to detect its encoding, then if it wasn't successful read it again with the default encoding.
The problem remains the same though: how do we check, after trying to detect the file's encoding, if data has been read incorrectly ?

The � character is not a special symbol. It's the Unicode Replacement Character. This means that the code tried to convert ASCII text using the wrong codepage. Any characters that didn't have a match in the codepage were replaced with �.
The solution is to read the file using the correct encoding. The default encoding used by the File methods or StreamReader is UTF8. You can pass a different encoding using the appropriate constructor, eg StreamReader(Stream, Encoding, Boolean). To use the system locale's codepage, you need to use Encoding.Default :
var sr = new StreamReader(filePath,Encoding.Default);
You can use the StreamReader(Stream, Encoding, Boolean) constructor to autodetect Unicode encodings from the BOM and fallback to a different encoding.
Assuming the files are either some type of Unicode or match your system locale, you can use:
var sr = new StreamReader(filePath,Encoding.Default, true);
From StreamReader's source shows that the DetectEncoding method will check the first bytes of a file to determine the encoding. If one is found, it is used instead of the supplied encoding. The operation doesn't cause extra IO because the method checks the class's internal buffer

EDIT
I just realized you can't actually load the raw file into a .NET string and still be able to have full information about the original file.
The project here uses the Mlang api which does a better job at not loading the file into a .NET string before guessing. There is also a related SO question

Open file, read as hex and convert it to ASCII?

Is it possible to read a file hex values into c# and output the corresponding ASCII? I can view the file in a hex editor which I can then see the appropriate ASCII next to the hex but rather than manually copying out the parts I need I imagine there is a way of the machine doing it for me in a c# program?
I did find Converting HEX data in a file to ascii but that didn't really help?

It sounds like you just need:
string text = File.ReadAllText("file.txt");
There's no such thing as "hex values" in a file - they're just bytes which are shown as hex in various editors geared towards editing non-text files.
The above line of code will load a text file, decoding it as UTF-8 - which is compatible with ASCII, so if your file is truly ASCII, it should be fine. If you need to specify a different encoding, you can do it with an overload, e.g.
// Load an ISO-8859-1 file
string text = File.ReadAllText("file.txt", Encoding.GetEncoding(28591));

C#: Load *.txt to RichTextBox and convert into UTF8

I want to open text files and load them into a RichTextBox. This has been going fine so far, but now I'm struggling with an encoding issue.
So I used the GetType() method from this StackOverflow page:
How to find out the Encoding of a File? C#
- and it returns "System.Text.UnicodeEncoding".
My questions now are:
How do I convert Unicode (I guess that's what they are, although I haven't double checked) into UTF8 (and possibly backwards)?
Can I switch the RichTextBox to display Unicode correctly? The following shows awkward results: rtb.LoadFile(aFile, RichTextBoxStreamType.PlainText);
How can I define which encoding a SaveFileDialog should use?

Instead of having the RichTextBox load the file from the disk, load it yourself, while specifying the correct encoding. (By the way, Encoding.Unicode is just a synonym for "UTF-16 little-endian".)
string myText = File.ReadAllText(myFilePath, Encoding.Unicode);
This will take care of the conversion for you. The string you get is encoded "correctly" (i.e. in the format used internally by .NET), so you can just assign it to the Text property of your RichTextBox.
About your third question: The SaveFileDialog is just a tool that lets the user choose a file name. What you do with the file name (like: save some text into it, or encode some string and then save it) has nothing to do with the SaveFileDialog.

The SaveFileDialog just allow you to choose the path where the file will be saved. It doesn't save it for you..
Use Encoding class to convert from an encoding to another.
And read this article for some example on how to convert and write it to a file.

You can also use:
richTextBox.LoadFile(filePath, RichTextBoxStreamType.UnicodePlainText);

OpenFileDialog filename as UTF8

C# question here..
I have a UTF-8 string that is being interpreted by a non-Unicode program in C++.. This text which is displayed improperly, but as far as I can tell, is intact, is then applied as an output filename..
Anyway, in a C# project, I am trying to open this file with an System.Windows.Forms.OpenFileDialog object. The filenames I am getting from this object's .FileNames[] is in Unicode (UCS-2). This string, however, has been misinterpreted.. For example, if the original string was 0xe3 0x81 0x82, a FileName[].ToCharArray() reveals that it is now 0x00e3 0x0081 0x201a .... .. It might seem like the OpenFileDialog object only padded it, but it is not.. In the third character that the OpenFileDialog produced, it is different and I cannot figure out what happened to this byte..
My question is: Is there any way to treat the filenames highlighted in the OpenFileDialog box as UTF-8?
I don't think it's relevant, but if you need to know, the string is in Japanese..
Thanks,
kreb
UPDATE
First of all, thanks to everyone who's offered their suggestions here, they're very much appreciated.
Now, to answer the suggestions to modify the C++ application to handle the strings properly, it doesn't seem to be feasible. It isn't just one application that is doing this to the strings.. There are actually a great number of these applications in my company that I have to work with, and it would take huge amount of manpower and time that simply isn't available. However, sean e's idea would probably be the best choice if I were to take this route..
#Remy Lebeau: I think hit the nail right on the head, I will try your proposed solution and report back.. :) I guess the caveat with your solution is that the Default encoding has to be the same on the C# application environment as the C++ application environment that created the file, which certainly makes sense as it would have to use the same code page..
#Jeff Johnson: I'm not pasting the filenames from the C++ app to the C# app.. I am calling OpenFileDialog.ShowDialog() and getting the OpenFileDialog.FileNames on DialogResult.OK.. I did try to use Encoding.UTF8.GetBytes(), but like Remy Lebeau pointed out, it won't work because the original UTF8 bytes are lost..
#everyone else: Thanks for the ideas.. :)
kreb
UPDATE
#Remy Lebeau: Your idea worked perfectly! As long as the environment of the C++ app is the same as the environment of the C# app is the same (same locale for non-Unicode programs) I am able to retrieve the correct text.. :)
Now I have more problems.. Haha.. Is there any way to determine the encoding of a string? The code now works for UTF8 strings that were mistakenly interpreted as ANSI strings, but screws up UCS-2 strings. I need to be able to determine the encoding and process each accordingly. GetEncoding() doesn't seem to be useful.. =/ And neither is StreamReader's CurrentEncoding property (always says UTF-8)..
P.S. Should I open this new question in a new post?

0x201a is the Unicode "low single comma quotation mark" character. 0x82 is the Latin-1 (ISO-8859-1, Windows codepage 1252) encoding of that character. That means the bytes of the filename are being interpretted as plain Ansi instead of as UTF-8, and thus being decoded from Ansi to Unicode accordingly. That is not surprising, as the filesystem has no concept of UTF-8, and Windows assumes non-Unicode filenames are using the OS's default Ansi encoding.
To do what you are looking for, you need access to the original UTF-8 encoded bytes so you can decode them properly. One thing you can try is to pass the FileName to the GetBytes() method of System.Text.Encoding.Default (in theory, that is using the same encoding that was used to decode the filename, so it should be able to produce the same bytes as the original), and then pass the resulting bytes to the GetString() method of System.Text.Encoding.UTF8.

I think your problem is at the begining:
I have a UTF-8 string that is being
interpreted by a non-Unicode program
in C++.. This text which is displayed
improperly, but as far as I can tell,
is intact, is then applied as an
output filename..
If you load a UTF-8 string with a non-unicode program and then serialize it, it will contain non-unicode chars.
Is there any way that your C++ program can handle Unicode?

Can you use members of the System.Text namespace (e.g., the UTF8Encoding class) to convert the .NET framework's internal string representation to/ from a byte array containing the text in the encoding of your choice?

If you are sure that the C++ output is fine, then in your C# app you should convert it from UTF-8 to UTF-16 using the .NET encoding class and just work with it in the Windows native format.
If you can modify the C++ app, that might be better - give the C# app input that doesn't need to be re-encoded. In it, the UTF8 to Unicode translation can be handled via MultiByteToWideChar, using CP_UTF8 for the CodePage parameter, but it only works when none of the flags are set for dwFlags (specify 0 for dwFlags). The whole app doesn't need to be Unicode. Even though it is not compiled unicode, you can make selective use of Unicode APIs.

In answer to your question "is there a way to treat the filenames as utf-8?" Try this code:
List<byte[]> utf8FileNames = new List<byte[]>();
foreach (string fileName in openFileDialog1.FileNames)
{
utf8FileNames.Add(Encoding.UTF8.GetBytes(fileName));
}
// Each byte array in utf8FileNames is a sequence of utf-8 bytes matching each file name chosen
What do you do with the file names once you have got them from the open file dialog? Can you post that code?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.