C#: Load *.txt to RichTextBox and convert into UTF8

C#: Load *.txt to RichTextBox and convert into UTF8 - c#

I want to open text files and load them into a RichTextBox. This has been going fine so far, but now I'm struggling with an encoding issue.
So I used the GetType() method from this StackOverflow page:
How to find out the Encoding of a File? C#
- and it returns "System.Text.UnicodeEncoding".
My questions now are:
How do I convert Unicode (I guess that's what they are, although I haven't double checked) into UTF8 (and possibly backwards)?
Can I switch the RichTextBox to display Unicode correctly? The following shows awkward results: rtb.LoadFile(aFile, RichTextBoxStreamType.PlainText);
How can I define which encoding a SaveFileDialog should use?

Instead of having the RichTextBox load the file from the disk, load it yourself, while specifying the correct encoding. (By the way, Encoding.Unicode is just a synonym for "UTF-16 little-endian".)
string myText = File.ReadAllText(myFilePath, Encoding.Unicode);
This will take care of the conversion for you. The string you get is encoded "correctly" (i.e. in the format used internally by .NET), so you can just assign it to the Text property of your RichTextBox.
About your third question: The SaveFileDialog is just a tool that lets the user choose a file name. What you do with the file name (like: save some text into it, or encode some string and then save it) has nothing to do with the SaveFileDialog.

The SaveFileDialog just allow you to choose the path where the file will be saved. It doesn't save it for you..
Use Encoding class to convert from an encoding to another.
And read this article for some example on how to convert and write it to a file.

You can also use:
richTextBox.LoadFile(filePath, RichTextBoxStreamType.UnicodePlainText);

Related

Detect special symbols in c#

I'm working on a c# project in which some data contains characters which are not recognised by the encoding.
They are displayed like that:
"Some text � with special � symbols in it".
I have no control over the encoding process, also data come from files of various origins and various formats.
I want to be able to flag data that contains such characters as erroneous or incomplete. Right now I am able to detect them this way:
if(myString.Contains("�"))
{
//Do stuff
}
While it does work, it doesn't feel quite right to use the weird symbol directly in the Contains function. Isn't there a cleaner way to do this ?
EDIT:
After checking back with the team responsible for reading the files, this is how they do it:
var sr = new StreamReader(filePath, true);
var content = sr.ReadToEnd();
Passing true as a second parameter of StreamReader is supposed to detect the encoding from the file's BOM, and use it to read the content. It doesn't always work though, as some files don't bear that information, hence why their data is read incorrectly.
We've made some tests and using StreamReader(filePath, Encoding.Default) instead appears to work for most if not all files we had issues with. Expectedly, files that were working before not longer work because they do not use the default encoding.
So the best solution for us would be to do the following: read the file trying to detect its encoding, then if it wasn't successful read it again with the default encoding.
The problem remains the same though: how do we check, after trying to detect the file's encoding, if data has been read incorrectly ?

The � character is not a special symbol. It's the Unicode Replacement Character. This means that the code tried to convert ASCII text using the wrong codepage. Any characters that didn't have a match in the codepage were replaced with �.
The solution is to read the file using the correct encoding. The default encoding used by the File methods or StreamReader is UTF8. You can pass a different encoding using the appropriate constructor, eg StreamReader(Stream, Encoding, Boolean). To use the system locale's codepage, you need to use Encoding.Default :
var sr = new StreamReader(filePath,Encoding.Default);
You can use the StreamReader(Stream, Encoding, Boolean) constructor to autodetect Unicode encodings from the BOM and fallback to a different encoding.
Assuming the files are either some type of Unicode or match your system locale, you can use:
var sr = new StreamReader(filePath,Encoding.Default, true);
From StreamReader's source shows that the DetectEncoding method will check the first bytes of a file to determine the encoding. If one is found, it is used instead of the supplied encoding. The operation doesn't cause extra IO because the method checks the class's internal buffer

EDIT
I just realized you can't actually load the raw file into a .NET string and still be able to have full information about the original file.
The project here uses the Mlang api which does a better job at not loading the file into a .NET string before guessing. There is also a related SO question

Open file, read as hex and convert it to ASCII?

Is it possible to read a file hex values into c# and output the corresponding ASCII? I can view the file in a hex editor which I can then see the appropriate ASCII next to the hex but rather than manually copying out the parts I need I imagine there is a way of the machine doing it for me in a c# program?
I did find Converting HEX data in a file to ascii but that didn't really help?

It sounds like you just need:
string text = File.ReadAllText("file.txt");
There's no such thing as "hex values" in a file - they're just bytes which are shown as hex in various editors geared towards editing non-text files.
The above line of code will load a text file, decoding it as UTF-8 - which is compatible with ASCII, so if your file is truly ASCII, it should be fine. If you need to specify a different encoding, you can do it with an overload, e.g.
// Load an ISO-8859-1 file
string text = File.ReadAllText("file.txt", Encoding.GetEncoding(28591));

Convert file path to UTF-8

I want to get, print and write to a text file the full path on disk of a file named A&T+X-8_L_R1.png but when I print it I get A&T+X-8_L_R1.png.
AFAIK I need to change the encoding. I did a search and found this potential solution but it doesn't work:
String filePathString = relativeUri.ToString();
byte[] bytes = Encoding.Default.GetBytes(filePathString);
filePathString = Encoding.UTF8.GetString(bytes);
filePathNode.SetValue(filePathString);
This is the full code of my class: http://pastebin.com/dZLGeS8p
The class searches recursively for *.png files and creates an XML structure from their paths. When I save the XML file the special characters from the paths like & are changed.
Can anyone point me to a solution?

You are writing an XML file, not a plain text file. In XML, an ampersand needs to be escaped to &.
So the result you get is perfectly ok. It's even required to be like this.
I recommend to open the XML file with an application that can properly validate and display XML. It'll be easier to see that the file is correct.
The UTF-8 conversion in your code isn't required. If the XML file is encoded in UTF-8, your XML classes will take care of any required conversions.

How to read different language files from using c#.net

We will be given different text files and each file might be in English,arabic,german or French.We have to read the respective files and display the text in UI in the respective file text language.
I am planning to use the below statement to achieve the same. Do I need to do anything in addition here? As far as I know we have ASCII character-set -255 but how about displaying other language characters like Chinese,hindi or german ? Do we need to take special care of these characters?
StreamReader(System.String filepath, System.Text.Encoding.Unicode)

Use UTF-8 or UTF-16 for the encoding, and the text should be displayed as is
StreamReader(System.String filepath, System.Text.Encoding.UTF8)

OpenFileDialog filename as UTF8

C# question here..
I have a UTF-8 string that is being interpreted by a non-Unicode program in C++.. This text which is displayed improperly, but as far as I can tell, is intact, is then applied as an output filename..
Anyway, in a C# project, I am trying to open this file with an System.Windows.Forms.OpenFileDialog object. The filenames I am getting from this object's .FileNames[] is in Unicode (UCS-2). This string, however, has been misinterpreted.. For example, if the original string was 0xe3 0x81 0x82, a FileName[].ToCharArray() reveals that it is now 0x00e3 0x0081 0x201a .... .. It might seem like the OpenFileDialog object only padded it, but it is not.. In the third character that the OpenFileDialog produced, it is different and I cannot figure out what happened to this byte..
My question is: Is there any way to treat the filenames highlighted in the OpenFileDialog box as UTF-8?
I don't think it's relevant, but if you need to know, the string is in Japanese..
Thanks,
kreb
UPDATE
First of all, thanks to everyone who's offered their suggestions here, they're very much appreciated.
Now, to answer the suggestions to modify the C++ application to handle the strings properly, it doesn't seem to be feasible. It isn't just one application that is doing this to the strings.. There are actually a great number of these applications in my company that I have to work with, and it would take huge amount of manpower and time that simply isn't available. However, sean e's idea would probably be the best choice if I were to take this route..
#Remy Lebeau: I think hit the nail right on the head, I will try your proposed solution and report back.. :) I guess the caveat with your solution is that the Default encoding has to be the same on the C# application environment as the C++ application environment that created the file, which certainly makes sense as it would have to use the same code page..
#Jeff Johnson: I'm not pasting the filenames from the C++ app to the C# app.. I am calling OpenFileDialog.ShowDialog() and getting the OpenFileDialog.FileNames on DialogResult.OK.. I did try to use Encoding.UTF8.GetBytes(), but like Remy Lebeau pointed out, it won't work because the original UTF8 bytes are lost..
#everyone else: Thanks for the ideas.. :)
kreb
UPDATE
#Remy Lebeau: Your idea worked perfectly! As long as the environment of the C++ app is the same as the environment of the C# app is the same (same locale for non-Unicode programs) I am able to retrieve the correct text.. :)
Now I have more problems.. Haha.. Is there any way to determine the encoding of a string? The code now works for UTF8 strings that were mistakenly interpreted as ANSI strings, but screws up UCS-2 strings. I need to be able to determine the encoding and process each accordingly. GetEncoding() doesn't seem to be useful.. =/ And neither is StreamReader's CurrentEncoding property (always says UTF-8)..
P.S. Should I open this new question in a new post?

0x201a is the Unicode "low single comma quotation mark" character. 0x82 is the Latin-1 (ISO-8859-1, Windows codepage 1252) encoding of that character. That means the bytes of the filename are being interpretted as plain Ansi instead of as UTF-8, and thus being decoded from Ansi to Unicode accordingly. That is not surprising, as the filesystem has no concept of UTF-8, and Windows assumes non-Unicode filenames are using the OS's default Ansi encoding.
To do what you are looking for, you need access to the original UTF-8 encoded bytes so you can decode them properly. One thing you can try is to pass the FileName to the GetBytes() method of System.Text.Encoding.Default (in theory, that is using the same encoding that was used to decode the filename, so it should be able to produce the same bytes as the original), and then pass the resulting bytes to the GetString() method of System.Text.Encoding.UTF8.

I think your problem is at the begining:
I have a UTF-8 string that is being
interpreted by a non-Unicode program
in C++.. This text which is displayed
improperly, but as far as I can tell,
is intact, is then applied as an
output filename..
If you load a UTF-8 string with a non-unicode program and then serialize it, it will contain non-unicode chars.
Is there any way that your C++ program can handle Unicode?

Can you use members of the System.Text namespace (e.g., the UTF8Encoding class) to convert the .NET framework's internal string representation to/ from a byte array containing the text in the encoding of your choice?

If you are sure that the C++ output is fine, then in your C# app you should convert it from UTF-8 to UTF-16 using the .NET encoding class and just work with it in the Windows native format.
If you can modify the C++ app, that might be better - give the C# app input that doesn't need to be re-encoded. In it, the UTF8 to Unicode translation can be handled via MultiByteToWideChar, using CP_UTF8 for the CodePage parameter, but it only works when none of the flags are set for dwFlags (specify 0 for dwFlags). The whole app doesn't need to be Unicode. Even though it is not compiled unicode, you can make selective use of Unicode APIs.

In answer to your question "is there a way to treat the filenames as utf-8?" Try this code:
List<byte[]> utf8FileNames = new List<byte[]>();
foreach (string fileName in openFileDialog1.FileNames)
{
utf8FileNames.Add(Encoding.UTF8.GetBytes(fileName));
}
// Each byte array in utf8FileNames is a sequence of utf-8 bytes matching each file name chosen
What do you do with the file names once you have got them from the open file dialog? Can you post that code?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.