Writing, displaying, and storing Japanese characters in c#

Writing, displaying, and storing Japanese characters in c# - c#

I am working on a project which requires lots of japanese katakana, hiragana, and kanji characters. The original files are excel files using the "ＭＳ Ｐゴシック" font. The problem I am having seems to be the same as everyone else with this type of issue and c#. The solutions I have found all seem to start with adding the text within the c# program.
What I am trying to do is read one of my .xls or .txt files that I have made into c#, work with the data using normal c# functions such as string compare. However, when I do this, noting happens. Writing or displaying the data produces "?" marks. Nothing new here.
I tried the same idea with c++ and it works perfectly.
The problem is it has to be c#, not c++ in order to work with the interops for the other software I am utilizing.
Long story short, do c#(system.string) not handle unicode natively compared to c++ (c string)?
I am using Visual Studio C++ 2008 Express and Visual Studio C# 2010 Express.
Files are the same, but it works in c++ and not in c#.
Sorry, I haven't used english in a while.
I have tried various types, the below is the latest but still "?" marks for output.
var reader = new StreamReader(File.OpenRead(#"C:\smallerBunShou.txt"), Encoding.UTF8);
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
var values = line.Split(',');
listA.Add(values[0]);
// listB.Add(values[1]);
// listC.Add(values[2]);
}
int sizeOflistA = listA.Count();
//using (System.IO.StreamWriter file = new System.IO.StreamWriter(#"C:\WriteLines2.txt"))
var file = new StreamWriter(File.OpenWrite(#"C:\WriteLines2.txt"), Encoding.UTF8);
{
foreach (string line in listA)
{
// If the line doesn't contain the word 'Second', write the line to the file.
if (!line.Contains("Second"))
{
file.WriteLine(line);
}
}
}
I have also tried the Encoding.Unicode, etc.
My computer is a japanese PC, software is mostly japanese. According to one of the answers so far, it is not a unicode issue, Japanese PCs use Shift-JIS which is most likely what I need to look into. When I solve this I will post my solution.
Update:
After looking around a bit, I found the Shift-JIS encoding scheme.
Encoding.GetEncoding(932));
This solved my problem! Thank you #EricFalsken for pointing me in the right direction.

Normal .txt files are not saved in Unicode format. You're going to need to specify the byte format when reading the FileStream by running it through the TextReader and Encoding.Unicode.
But note that most Japanese computers and documents do NOT use Unicode. They still use Shift-JIS quite extensively.
I can assure you that all strings in C# support Unicode natively.

Related

UTF-8 results in questionmarks

Writing and reading text in utf-8 results in questionsmarks.
I have a application that exports data from my database into CSV files. These files can then be imported again, acting like a backup. Now I'm having problem with encoding Swedish characters such as "åäö". In my first solution wrote the file using the encoding of the stringwriter (see code). When I then looked at the file in notepad ++, it has the UCS-2 Little Endian encoding as default.
This file looks fine, all special characters are rendered properly. However, I started to run into issues when importing the file again. As i'm using Encoding.UTF8.GetString(), it of course writes the wrong characters because it is the wrong encoding. A solution here is using UCS-2 Little Endian of course. Problem is, I don't want to use it!
So a thing I tried was changing my original encoding when writing to be UTF-8 instead of result.Encoding, like this Encoding.UTF8. However, when I open the file in excel, åäö is replaced by questions marks.
So my questions is; How do I write and read with UTF-8 encoding with special letters working properly in excel and notepad? Maybe its better to use Unicode?
public void ExportAttendees()
{
var result = new StringWriter();
Response.ContentEncoding = result.Encoding; //Encoding.UTF8
}
public ActionResult ImportAttendees(HttpPostedFileBase file)
{
var content = Encoding.UTF8.GetString(binData).Replace("\0", "").Replace("\r", ""); //Here I could use something else?
//Code omitted

Before reading a file, do I have to check ANSI encoding?

Im reding some csv files. The files are really easy, because there is always just ";" as seperator and there are no ", ', or something like that.
So its possible to read the file, line by line and seperate the strings. Thats working fine. Now people told me: maybe you should check the encoding of the file, it should be always ANSI, if its not maybe your output will be different and corrupted. So non-ansi files should be marked somehow.
I just said, okey! But if I think about it: do I really have to check the file for encoding in this case? I just changed the encoding of the file to something else and Im still able to read the file without any problems. My code is simple:
using (TextReader reader = new StreamReader(myFileStream))
{
while ((line = read.ReadLine()) != null)
{
//read the line, spererate by ; and other stuff...
}
}
So again: do I really need to check the files for ANSI encoding? Could somebody give me an example when could I get in trouble or when do I get a corrupted output after reading a non-ansi file? Thank you!

That particular constructor of StreamReader will assume that the data is UTF-8; that is compatible with ASCII, but can fail if data uses bytes in the 128-255 range for single-byte codepages (you'll get the wrong characters in strings, etc), or could fail completely (i.e. throw an exception) if the data is actually something very different like UTF-7, UTF-32, etc.
In some cases (the minority) you might be able to use the byte-order-mark to detect the encoding, but this is a circular problem: in most cases, if you don't already know the encoding, you can't really detect the encoding (robustly). So a better approach would be: to know the encoding in the first place. Then you can pass in the correct encoding to use via one of the other constructors.
Here's an example of it failing:
// we'll write UTF-32, big-endian, without a byte-order-mark
File.WriteAllText("my.txt", "Hello world", new UTF32Encoding(true, false));
using (var reader = new StreamReader("my.txt"))
{
string s = reader.ReadLine();
}

You can run under UTF-8 encoding , cause UTF-8 has a wonderful property support ASCII characters with 1 byte (as it would expected), but when it needed, shrink to support Unicode ones.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

How to read an ANSI encoded file containing special characters

I'm writing a TFS Checkin policy, which checks if our source files containing our file header.
My problem is, that our file header contains a special character "©" and unfortunately some of our source files are encoded in ANSI.
So if I read these files in the policy, the string looks like this "Copyright � 2009".
string content = File.ReadAllText(pendingChange.LocalItem);
I tired to change the encoding of the string, but it does not help. So how can I read these files, that I get the correct string "Copyright © 2009"?

Use Encoding.Default:
string content = File.ReadAllText(pendingChange.LocalItem, Encoding.Default);
You should be aware, however, that that reads it using the system default encoding - which may not be the same as the encoding of the file. There's no single encoding called ANSI, but usually when people talk about "the ANSI encoding" they mean Windows Code Page 1252 or whatever their box happens to use.
Your code will be more robust if you can find out the exact encoding used.

It would seem sensible if you going to have such policies that you would also have team agreed standard encoding. To be honest, I can't see why any team would use an encoding other than "Unicode (UtF-8 with signature) - Codepage 65001" (except perhaps for ASPX pages with significant non-latin static content but even then I can't see how it would be a big deal to use UTF-8).
Assuming you still want to allow mixed encodings then you next need a way to determine which encoding a file was save in so you know which encoding to pass to ReadAllText. Its not easy to determine this from the file however using Encoding.Default is likely to work ok. Since its most likely you have just 2 encodings to deal with, the VS (UTF-8 with signature) and a common ANSI encoding used by you machines (probably Windows-1252).
Hence using
string content = File.ReadAllText(pendingChange.LocalItem, Encoding.Default);
will work. (As I see Jon has already posted). This works because when the UTF-8 BOM (which is what VS means by the term "signature") is present at the start of the file the supplied encoding parameter is ignored and UTF-8 is used anyway. Hence where the file is saved using UTF-8 you get correct results and where ANSI is used you are most likely also to get correct results.
BTW if you are processing file headers wouldn't ReadAllLines make things easier?.

I know this is an old question but I ran into a similar situation and found the accepted answer to be cutting some corners (no disregard for Jon Skeet's pragmatic short answer, but I'll flesh it out a little more)...
The specs state that the header will contain the encoding directly after {\rtf:
\ansi ANSI (the default)
\mac Apple Macintosh
\pc IBM PC code page 437
\pca IBM PC code page 850, used by IBM Personal System/2 (not implemented in version 1 of Microsoft Word for OS/2)
According to Wikipedia the "ANSI character set has no well-defined meaning"
For the default ANSI you have the choice of these partially incompatible encodings:
using System.Text;
...
string content = File.ReadAllText(filename, Encoding.GetEncoding("ISO-8859-1"));
or
string content = File.ReadAllText(filename, Encoding.GetEncoding("Windows-1252"));
Using WordPad on windows 10 to save a file with a euro sign (0x80 in Windows-1252 but 0xA4 in ISO-8859-1) revealed the following:
The header stated the exact encoding after \ansi
{\rtf1\ansi\ansicpg1252\deff0\nouicompat\deflang1043{ ...
And the encoding was not directly used, instead it was wrapped in RTF encoding: \'80
according to the specs:
\'hh : A hexadecimal value, based on the specified character set (may
be used to identify 8-bit values).
I guess the best thing to do is read the header, if the file starts with {\rtf1\ansi\ansicpg1252 then go for Windows-1252.
But to make things more complicated, the specs also state that there can be mixed encodings... search for '\upr'...
I guess there is no definitive answer, the easiest way to go in your case may be to search (in the un-decoded raw byte array) for all the variations of the encoded copyright signs that you may encounter in your source base.
In my case I finally decided to cut a few corners as well, but add a small percentage of defensive coding. All files I have seen so far were Windows-1252 so I common-case-optimised for that.
Encoding encoding = Encoding.GetEncoding("Windows-1252", EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
using (System.IO.StreamReader reader = new System.IO.StreamReader(filename, encoding)) {
string header= reader.ReadLine();
if (!header.Contains("cpg1252")) {
if(header.Contains("\\pca"))
encoding = Encoding.GetEncoding(850, EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
else if (header.Contains("\\pc"))
encoding = Encoding.GetEncoding(437, EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
else
encoding = Encoding.GetEncoding("ISO-8859-1", EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
}
}
string content = System.IO.File.ReadAllText(filename, encoding);

OpenFileDialog filename as UTF8

C# question here..
I have a UTF-8 string that is being interpreted by a non-Unicode program in C++.. This text which is displayed improperly, but as far as I can tell, is intact, is then applied as an output filename..
Anyway, in a C# project, I am trying to open this file with an System.Windows.Forms.OpenFileDialog object. The filenames I am getting from this object's .FileNames[] is in Unicode (UCS-2). This string, however, has been misinterpreted.. For example, if the original string was 0xe3 0x81 0x82, a FileName[].ToCharArray() reveals that it is now 0x00e3 0x0081 0x201a .... .. It might seem like the OpenFileDialog object only padded it, but it is not.. In the third character that the OpenFileDialog produced, it is different and I cannot figure out what happened to this byte..
My question is: Is there any way to treat the filenames highlighted in the OpenFileDialog box as UTF-8?
I don't think it's relevant, but if you need to know, the string is in Japanese..
Thanks,
kreb
UPDATE
First of all, thanks to everyone who's offered their suggestions here, they're very much appreciated.
Now, to answer the suggestions to modify the C++ application to handle the strings properly, it doesn't seem to be feasible. It isn't just one application that is doing this to the strings.. There are actually a great number of these applications in my company that I have to work with, and it would take huge amount of manpower and time that simply isn't available. However, sean e's idea would probably be the best choice if I were to take this route..
#Remy Lebeau: I think hit the nail right on the head, I will try your proposed solution and report back.. :) I guess the caveat with your solution is that the Default encoding has to be the same on the C# application environment as the C++ application environment that created the file, which certainly makes sense as it would have to use the same code page..
#Jeff Johnson: I'm not pasting the filenames from the C++ app to the C# app.. I am calling OpenFileDialog.ShowDialog() and getting the OpenFileDialog.FileNames on DialogResult.OK.. I did try to use Encoding.UTF8.GetBytes(), but like Remy Lebeau pointed out, it won't work because the original UTF8 bytes are lost..
#everyone else: Thanks for the ideas.. :)
kreb
UPDATE
#Remy Lebeau: Your idea worked perfectly! As long as the environment of the C++ app is the same as the environment of the C# app is the same (same locale for non-Unicode programs) I am able to retrieve the correct text.. :)
Now I have more problems.. Haha.. Is there any way to determine the encoding of a string? The code now works for UTF8 strings that were mistakenly interpreted as ANSI strings, but screws up UCS-2 strings. I need to be able to determine the encoding and process each accordingly. GetEncoding() doesn't seem to be useful.. =/ And neither is StreamReader's CurrentEncoding property (always says UTF-8)..
P.S. Should I open this new question in a new post?

0x201a is the Unicode "low single comma quotation mark" character. 0x82 is the Latin-1 (ISO-8859-1, Windows codepage 1252) encoding of that character. That means the bytes of the filename are being interpretted as plain Ansi instead of as UTF-8, and thus being decoded from Ansi to Unicode accordingly. That is not surprising, as the filesystem has no concept of UTF-8, and Windows assumes non-Unicode filenames are using the OS's default Ansi encoding.
To do what you are looking for, you need access to the original UTF-8 encoded bytes so you can decode them properly. One thing you can try is to pass the FileName to the GetBytes() method of System.Text.Encoding.Default (in theory, that is using the same encoding that was used to decode the filename, so it should be able to produce the same bytes as the original), and then pass the resulting bytes to the GetString() method of System.Text.Encoding.UTF8.

I think your problem is at the begining:
I have a UTF-8 string that is being
interpreted by a non-Unicode program
in C++.. This text which is displayed
improperly, but as far as I can tell,
is intact, is then applied as an
output filename..
If you load a UTF-8 string with a non-unicode program and then serialize it, it will contain non-unicode chars.
Is there any way that your C++ program can handle Unicode?

Can you use members of the System.Text namespace (e.g., the UTF8Encoding class) to convert the .NET framework's internal string representation to/ from a byte array containing the text in the encoding of your choice?

If you are sure that the C++ output is fine, then in your C# app you should convert it from UTF-8 to UTF-16 using the .NET encoding class and just work with it in the Windows native format.
If you can modify the C++ app, that might be better - give the C# app input that doesn't need to be re-encoded. In it, the UTF8 to Unicode translation can be handled via MultiByteToWideChar, using CP_UTF8 for the CodePage parameter, but it only works when none of the flags are set for dwFlags (specify 0 for dwFlags). The whole app doesn't need to be Unicode. Even though it is not compiled unicode, you can make selective use of Unicode APIs.

In answer to your question "is there a way to treat the filenames as utf-8?" Try this code:
List<byte[]> utf8FileNames = new List<byte[]>();
foreach (string fileName in openFileDialog1.FileNames)
{
utf8FileNames.Add(Encoding.UTF8.GetBytes(fileName));
}
// Each byte array in utf8FileNames is a sequence of utf-8 bytes matching each file name chosen
What do you do with the file names once you have got them from the open file dialog? Can you post that code?

I think this is some kind of encoding problem

I have two computers. Both running WinXP SP2 (I don't really know ho similar they are beyond that). I am running MS Visual C# 2008 express edition on both and that's what I'm currently using to program.
I made an application that loads in an XML file and displays the contents in a DataGridView.
The first line of my xml file is:
<?xml version="1.0" encoding="utf-8"?>
...and really... it's utf-8 (at least according to MS VS C# when I just open the file there).
I compile the code and run it on one computer, and the contents of my DataGridView appears normal. No funny characters. I compile the code and run it on the other computer (or just take the published version from computer #1 and install it on computer #2 - I tried this both ways) and in the datagridview, where there are line breaks/new lines in the xml file, I see funny square characters.
I'm a novice to encoding... so the only thing I really tried to troubleshoot was to use that same program to write the contents of my xml to a new xml file (but I'm actually writing it to a text file, with the xml tags in it) since the default writing to a text file seems to be utf-8. Then I read this new file back in to my program. I get the same results.
I don't know what else to do or how to troubleshoot this or what I might fundamentally be doing wrong in the first place.
-Adeena

This doesn't have to do with UTF-8 or character encodings - this problem has to do with line endings. In Windows, each line of a text file ends in the two characters carriage-return (CR) and newline (LF, for line feed), which are code points U+000D and U+000A respectively. In ASCII and UTF-8, these are encoded as the two bytes 0D 0A. Most non-Windows systems, including Linux and Mac OS X, on the other hand, uses just a newline character to signal end-of-line, so it's not uncommon to see line ending problems when transferring text files between Windows and non-Windows systems.
However, since you're using just Windows on both systems, this is more of a mystery. One application is correctly interpreting the CRLF combination as a newline, but the other application is confused by the CR. Carriage returns are not printable characters, so it replaces the CR with a placeholder box, which is what you see; it then correctly interprets the line feed as the end-of-line.

The square usually appears when you use different types of newlines.
Linux - (0A) LF
Win - (0D0A) CRLF
Mac - (0D) CR
The app was probably created using 1 type and the running app is expecting another.
Check out Environment.NewLine
And, you might try this: (no guarantees -- I don't write much C#)
strInput = Regex.Replace(strInput, "\\r?\\n?", Environment.NewLine)

I'm not sure of the cause of your problem, but one solution would be to to just strip out the carriage returns from your strings. For every string you add, just call TrimEnd(null) on it to remove trailing whitespace:
newrow["topic"] = att1.ToString().TrimEnd(null);
If your strings might end in other whitespace (i.e. spaces or tabs) and you want to keep those, then just pass an array containing only the carriage return character to TrimEnd:
newrow["topic" = att1.ToString().TrimEnd(new Char[]{'\r'});
Disclaimer: I am not a C# programmer; the second statement may be syntactically incorrect

# Adam:
Sorry! Missed your earlier statement.
To load the document into the program and display in the DataGridView, I am currently doing (I say "currently", because I tried other things like use XDocument instead of Xelement):
XElement xe1 = XElement.Load(filePath);
DataTable myTable = new DataTable();
myTable = mkTable(); // calls a function that makes the table
var _categories = (from p1 in xe1.Descendants("category") select p1);
int numCat = _categories.Count();
int i = 0;
while (i < numCat)
{
DataRow newrow;
newrow = myTable.NewRow();
if (_categories.ElementAt(i).Parent.Name == "topic")
{
string att1 = _categories.ElementAt(i).Parent.Attribute("name").Value.ToString();
newrow["topic"] = att1.ToString();
}
// repeat the above for the different things in my document
myTable.Rows.Add(newrow);
i++;
}
myDataSet.Merge(myTable);
bindingSourceIn.DataSource = myDataSet;
myDataGridView.DataSource = bindingSourceIn;
myDataGridView.DataMember = "xmlthing";
(obviously things are a little abbreviated here... i.e., my bindingsource/datagridview etc is declared elsewhere.... but hopefully this is enough to make sense)
-Adeena

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.