How to show difference for japanese, chinese and other unicode text

How to show difference for japanese, chinese and other unicode text - c#

I'm looking for a way to programmatically, in C#, show the difference of two chunks of text.
The result, with deletes, adds are going to be shown in HTML, but that is a second step, and is an optional answer for the question.
I would like not to call/shell out to a command line if possible, ie calling third party diff tool or similar. Platform is Windows.
It must support Asian languages, such as Japanese, Chinese and Korean, meaning that traditional word break characters don't (necessarily) apply.

Have a look at this SO thread. Few choices for diff engine are listed there - perhaps one of them can suite you.

Related

C# windows form cannot display simplified Chinese characters

Somehow my previous question has been marked as duplicate.
Question:
I have a database with records in Chinese characters. I can take them out, and use them in button.Text.
However, when I use
Console.WriteLine(button.Text);
The output displays every Chinese character as a "?"
Now, why is the question NOT duplicate?
I have THOROUGHLY searched for a solution, not just on stackoverflow, on everywhere I can search (with my limited skills). Read all those related posts. I found two potential solutions:
One:
Console.OutputEncoding = Encoding.Unicode;
Unicode, UTF8, UT7, UTF32.
Two:
Change my computer's locale in Control Panel to a region with Simplified Chinese. Then reboot and run the solution again.
I have tried both these suggested solutions, individually and together. Nothing works. The output changes from "?" to completely jibberish, unrecognizable characters.
Does anyone have any idea what to do here?

This is a more complete version of my comment. The way I was able to display Simplified Chinese characters was by changing the language of Non-Unicode programs to Chinese:
Then in the cmd properties set the font to Consolas
I didn't even need to set the Console.OutputEncoding. This is the result (these are Chinese characters copy/pasted from the internet):

I think this is a duplicate of How to write Unicode characters to the console? which indicates that although .net and Unicode support your characters, the Font you are using as the output font of the Console does not support that Unicode character.
Your post does not indicate that you have tried adjusting the Console Font.

Microsoft IDEs, source file encodings, BOMs and the Unicode character \uFEFF?

We have parsers for various Microsoft languages (VB6, VB.net, C#, MS dialects of C/C++).
They are Unicode enabled to the extent that we all agree on what Unicode is. Where we don't agree, our lexers object.
Recent MS IDEs all seem to read/write their source code files in UTF-8... I'm not sure this is always true. Is there some reference document that makes it clear how MS will write a souce code file? With or without byte order marks? Does it vary from IDE version to version? (I can't imagine that the old VB6 dev environment wrote anything other than an 8 bit character set, and I'd guess it would be in the CP-xxxx encoding established by the locale, right?)
For C# (and I assume other modern language dialects supported by MS), the character code \uFEFF can actually be found in the middle of a file. This code is defined as a zero-width no-break space. It appears to be ignored by VS 2010 when found in the middle of an identifier, in whitespace, but is significant in keywords and numbers. So, what are the rules? Or does MS have some kind of normalize-identifiers to handle things like composite characters, that allows different identifier strings to be treated as identical?

This is in a way a non-answer, because it does not tell what Microsoft says but what the standards say. Hope it will be of assistance anyway.
U+FEFF as a regular character
As you stated, U+FEFF should be treated as BOM (byte order mark) in the beginning of a file. Theoretically it could also appear in the middle of text since it actually is character denoting a zero width non-breaking space (ZWNBSP). In some languages/writing systems all words in a line are joined (=written together) and in such cases this character could be used as a separator, just like regular space in English but it does not cause a typographically visible gap. I'm not actually familiar with such scripts so my view might not be fully correct.
U+FEFF should only appear as a BOM
However, the usage of U+FEFF as a ZWNBSP has been deprecated as of Unicode version 3.2 and currently the purpose of U+FEFF is to act as a BOM. Instead of ZWNBSP as a separator, U+2060 (word joiner) character is strongly preferred by the Unicode consortium. Their FAQ also suggests that any U+FEFF occurring in the middle of a file can be treated as an unsupported character that should be displayed as invisible. Another possible solutions that comes into my mind would be to replace any U+FEFF occurring in the middle of a file with U+2060 or just ignore it.
Accidentally added U+FEFF
I guess the most probable reason for U+FEFF to appear in the middle of text is that it is a an erroneous result (or side effect) of a string concatenation. RFC 3629, that incorporated the usage of a BOM, denotes that stripping of the leading U+FEFF is necessary in concatenating strings. This also implies that the character could just be removed when found in middle of text.
U+FEFF and UTF-8
U+FEFF as a BOM has no real effect when the text is encoded as UTF-8 since it always has the same byte order. BOM in UTF-8 interferes with systems that rely on the presence of certain leading characters and protocols that explicitly mandate the encoding or an encoding identification method. Real world experience has also showed that some applications choke on UTF-8 with BOM. Therefore the usage of a BOM is generally discouraged when using UTF-8. Removing BOM from an UTF-8 encoded file should should not cause incorrect interpretation of the file (unless there is some checksum or digital signature related to the byte stream of the file).

On "how MS will write a souce code file" : VS can save files with and without BOM, as well in whole bunch of other encodings. The default is UTF-8 with BOM. You can try it yourself by going File -> Save ... as -> click triangle on "Save" button and chose "save with encoding".
On usage of FEFF in actual code - never seen one using it in the code... wikipedia suggests that it should be treated as zero-width space if happened anywhere but first position ( http://en.wikipedia.org/wiki/Byte_order_mark ).

For C++, the file is either Unicode with BOM, or will be interpreted as ANSI (meaning the system code page, not necessarily 1252). Yes, you can save with whatever encoding you want, but the compiler will choke if you try to compile a Shift-JIS file (Japanese, code page 932) on an OS with 1252 as system code page.
In fact, even the editor will get it wrong. You can save it as Shift-JIS on a 1252 system, and will look ok. But close the project and open it, and the text looks like junk. So the info is not preserved anywhere.
So that's your best guess: if there is no BOM, assume ANSI. That is what the editor/compiler do.
Also: VS 2008 and VS 2010, older editors where no to Unicode friendly.
And C++ has different rules than C# (for C++ the files are ANSI by default, for C# they are utf-8)

Unicode bidi text algorithm in C#?

Is there a C# version of the Unicode algorithm that takes a Unicode string and breaks it into runs that can be correctly rendered? Each run should be either left-to-right or right-to-left.
We understand this is part of the Java ICU4J, but that is a large library, and we're only looking for this specific functionality, to render text correctly.

This is the unicode standard for bidi handling:
UNICODE BIDIRECTIONAL ALGORITHM
Also try: this
Implementations:
JAVA
C++
I'm sure you will be able to convert them to c# fairly simply

How to fill out a PDF form and support multiple languages in iTextSharp?

I wanted to know if there is a way to support multiple languages when filling out a form field with iTextSharp. We need to support user’s filling out fields in English, European languages with diacritics, and Asian languages like Chinese and Japanese, but do not know how to support these all on the same PDF (e.g. the user could have form fields that are answered in English and some in Chinese for example). We have to work with Acrobat forms that are pre-defined, e.g. we cannot create a PDF completely from scratch in our scenario.
Is there a way to accomplish this within iTextSharp? At least to support most European languages and Chinese and for the form/generation process to know when to use the right know that support the particular character(s)?

Would it be an option to dynamically generate the PDF based on user inputs from another program, e.g. a windows forms app or a web page? Based on the user's selection from said app you could dynamically generate the PDF (based on a template) and apply the appropriate character sets.

Yes.
The problem you're having (best guess) is that the pre-defined fonts for the fields you're filling use WinAnsiEncoding (or some other mono-byte encoding that doesn't support all the diacritics you need).
And I see that iText does support setting a field's font directly. Excellent.
myAcroFields.setFieldProperty(fldName, "textfont", myBaseFont, null);
I believe you're required to subset fonts with Chinese encodings, but for the European-encoded fonts you probably want to fully embed the font in question. Fonts in form fields (that can be edited) react poorly when you try to display a missing character... at least they used to, many moons ago. Last time I tried was around Acrobat 5, so it's quite likely that the behavior has improved.

How to make C# show arabic?

I have a problem that while writing a C# code the output sometimes is words arabic language,and it appears as a strange symbols,how to make C# read and show arabic??

I don't know the precise problem you're having, but would suggest you read The Absolute Minimum Every Programmer Should Know About Unicode to give yourself a solid grounding in this often confusing topic.

Arabic Console Output/Input is not possible on Windows Platforms, according to Microsoft:
http://www.microsoft.com/middleeast/msdn/arabicsupp.aspx#12

C#/.NET will display Arabic characters without a problem, as it represents string internally as UTF-16.
The issue is with how you display the characters.
If you are on the web, you need to ensure that your are including the correct charset encoding header or meta tag for the output.
Please provide more information on where you don't see the characters, and how you are outputting the strings.

maybe it is a problem with your system language, go to Control Panel then to Language options and try to change you System Local Language to Arabic and ensure that the language for non-Unicode programs is arabic.

Please make sure that you have correct fonts installed. If you have them on your system, it could be a fallback mechanism problem.
For web pages (Asp.Net), please make sure that:
You are using (and declaring) correct encoding.
You have correct fonts declared in your style definition.
I know that it sounds strange, but for Internet Explorer, it helps to set language for non-Unicode programs to what you need to support (in my case it was Chinese Simplified on German Windows 2003).

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.