Visual C# Character Encoding Mix Up

Visual C# Character Encoding Mix Up - c#

I've been working on a song Metadata program for a short while now, and have run in to what seems like quite a simple but challenging issue...
As the program reads in some pieces of data from songs, it may come across an usual symbol, such as ℗. I noticed that some things happened that I didn't expect and so stepped through each line of my code individually. I noticed that said character was actually being read as "â„—". I tried this with another character, Ü, to see what the result was, and this time I got "Ãœ".
I did a quick Google search which led me to this page:
http://www.i18nqa.com/debug/utf8-debug.html
From this I would assume that my desired characters are actually being interpreted using the mentioned Windows-1252 encoding, and not UTF-8?
I should add that to further investigate this, I tried writing the Windows-1252 versions into the Metadata using my program, and these came out in Windows itself as the correct ℗ and Ü characters... Can anybody tell me how to ensure that both read and write functions are done using UTF-8, and also tell me why this really happens if I was not correct. Thanks.
I am using C#.

Related

µ and é in namespace

We have developed a c# program. The program is distributed in Europe without problem on misc hardware configurations. Some of the namespaces in our program contains a 'µ' or a 'é' character. When deploying our program on 'non-european' ie China or some US systems a problem occurs somewhere in the process the 'µ' is changed into 'Âµ' causing lots of problems. What is causing this problem and how can we work around it (preferably without changing the name of the namespace)
edit 2015.08.07
Thanks all for your comments, but to clarify: the source files are not distributed as such. The program is compiled to an exe and then distributed using nsis. Source control is done using SVN. How can I verify the presence of the BOM in my source files ?

Either you or the recipient or both are using a character encoding other than UTF-8.
People shouldn't do that, but they do.
Some tools will default to a legacy encoding unless you include a BOM at the start of each file, so include a BOM at the start of each file.

You are hitting a difference in the character sets used by the different systems. Your software was probably running on systems assuming ISO-8859, most often used in European languages, while the Chinese and US systems you are encountering are probably using the Universal Character Set (ISO/IEC 10646). The mapping between the two is not a simple 1-to-1, so you run into the problems you are having. W3.org has a good article on this topic at http://www.w3.org/International/articles/definitions-characters/
Pay special attention to the sections on "Character sets, coded character sets, and encodings", and "The Document Character Set". If this is a web app, "The HTTP Header" might be particularly useful.

This isn't an answer. However I want to point out that this might be an encoding problem but this can happen on the same system and show up only when running code that (guess here) reads a byte at a time, as opposed to explicitly reading text of a particular encoding.
I have a C program (32 bit if that matters) which reads in a file using fgetc and saves characters to be used as "illegal" characters in names. It isn't fancy, just to prevent a few ascii characters from coming in accidentally, like an ' (apostrophe) in the name of a thing/object/label. Some one asked me to test µ (mu, appears as single character in this interface to stackoverflow). I generated this (without examining underlying encoding in MS Word) using Insert-Symbol in MS Word. I cut it from MS Word and inserted it into a text file using Notepad++. In Notepad++ and MS Word, it seems to be the same symbol. BUT fgetc (taking one int or char, however you like to think of it) at a time, sees in my debug output for a test case:
About to check for illegal characters in =>NameOfItemÂµ<=
Illegal character =>Â<= was found. Illegal characters are: '`Âµ
Illegal character =>µ<= was found. Illegal characters are: '`Âµ
I am compiling with Visual C++ Express 2013.
I'm happy it catches the illegal characters, and hope this isn't just noise to readers of this topic.

How to hex edit an exe file safely?

I am working on a small puzzle/wargame which involves coding Windows Forms in C#..
To get to a certain level I need a password which is stored in an exe. The same exe allows me send that password to a default person which is stored in a variable. The password sending is accomplished by updating the given user's data in a MySQL database.
The challenge was that a user should hex edit the exe and change the default recipient to the user desired username. But when I hex edited the file and put the desired user name and tried to run it, it showed an error "x.exe not a valid win32 application"..
Is there a way to safely hex edit a file without encountering this error. Or is there a way to modify the source so that, just one variable may be safely edited using a hex editor..

Editing a PE image in hex is going to be difficult since you will need to update various parts of the PE image if you change the length of a section or if the EXE is signed you would also invalidate it. The PE image spec can be found here if you want to understand all the fields you will need to update. If you want a nice UI, I would use something like CFF Explorer to edit to PE image correctly.
You could also use ildasm, only for .NET assemblies, to disassemble the EXE, edit the IL, and then use ilasm to reassemble and run it. This would eliminate the need to edit the PE image and be safer.

Assuming this is not an illegal alteration of an executable... (It sounds like a challenge in a contest, the way you have it worded.)
Most likely your change caused the program to no longer be able to verify the checksum. If you wish to successfully alter the exe, you need to recalculate the checksum. (This is just one possibile explanation for why the exe was corrupted.)
Altering a compiled executable and having it work is tricky to say the least. It's a pretty advanced topic and not likely something that can be answered fully here.

When I was doing something similar before I remember I had to replace variables with same-length strings for it to work properly. e.g. "someone#example.com" could be replaced with "another#example.net" or "myname#anexample.us". If you're using Gmail this would be easier because "mynameis#gmail.com" is the same as "my.name.is...+slim.shady#gmail.com".
Though, I think #David Stratton's idea is probably more relevant to exe's. I'm pretty sure the files I edited were just data files (it was a long time ago), but I know everything worked then for me as long as I didn't add or remove any bytes in the middle of the file.

When modifying strings inside EXE/DLL files it is important that the length of the string you are editing is kept the same, for example if I changed "Hello User" to "Welcome User", we overflowed the stack for 2 bytes.
This will obviously result in an error. In order to have a successful edit accomplished, it is important that the modified string you put does not overflow the string that you are inserting it on.
TLDR;
If the string you are editing is 12 characters long, you can only change 12 characters in total.

Winform character spacing

I am trying to use Graphics.DrawString and TextRenderer.DrawText to laydown on a fixed rectangle some strings with variable number of characters.
However, even using the GDI+ wrapping methods I am not satisfied with result: I would need to control the font kerning (or string character spacing) to give a chance to pack high number of characters strings.
I read about FontStretches but I do not know how to use in winform. Another method is Typography.SetKerning but again I am blank about using it.
Can someone help?!
Round 2:
I know it could be hard, Win32 API has a freetype support which could be the solution to issue.
Practically my aim is to do something similar to "http://stackoverflow.com/questions/4582545/kerning-problems-when-drawing-text-character-by-character", in .NET. Notice that I am working on pre-formed string of arabic language, not user character imput.
My problem is:
(1) identify which library has the wanted kerning function (most probably gdi32.dll), (2) build a c# safe environment to deal with dll calls, (3) implement a call to dll that works in c#.
Can someone help?
Thank you for answering.

If you look at the documentation, its quite easy to find out which does what, and how to use it.
The method Typography.SetKerning is an WPF-only thing, so you won't be able to use it in WinForms.
A quick Google found this article, which shows us how to modify kerning values to GDI text.

What type of BarCode is this?

I am starting the process of writing an application, one part of which is to decode bar codes, however I'm off to a bad start. I am no bar code expert and this is not a common bar code type, so I'm in trouble. I cannot figure out what type of bar code this is, which I have to decode.
I have looked on Wikipedia and some other sites with visual descriptions of different types of bar codes (and how to identify them), however I cannot identify it. Please note that I have tried several free bar code decoding programs and they have all failed to decode this.
So here is a picture of that bar code:
alt text http://www.shrani.si/f/2B/4p/4UCVyP72/barcode.jpg
I hope one of you can recognize it. Also if anyone has worked with this before and knows of a library that can decode them (from an image), I'd love to hear about them.
I'm very thankful for any additional pointers I can receive. Thank you.

zbar thinks it's Code 128 but the decoded string is suspiciously different than the barcode's own caption. Maybe it's a charset difference?
~/src/zebra-0.5/zebraimg$ ./zebraimg ~/src/barcode/reader/barcode.jpg
CODE-128:10657958011502540742
scanned 1 barcode symbols from 1 images in 0.04 seconds
My old copy was called zebra but the library is now called zbar. http://sourceforge.net/projects/zbar/

I don't recognize this bar code - but here are a few sites that might help you (libraries etc.) - assuming you use C# and .NET (you didn't specify in your question):
http://www.idautomation.com/csharp/
http://www.bokai.com/barcode.net.htm

It looks a bit like Code 128 but http://www.onlinebarcodereader.com/ does not recognize it as such. Maybe the image quality isn't good enough.

If you are using Java:
http://code.google.com/p/zxing/
Open Source, supports multiple types of barcodes
A list of software can be found here:
http://www.dmoz.org/Computers/Software/Bar_Code/Decoding/

IANABCE (I Am Not A Barcode Expert), but looking at the barcodes here, I'd say this looks closest to the UCC/EAN-128 symbology, character set 'C'.
Do you know what the barcode is used for? What's the application domain?

C# : Characters do not display well when in Console, why?

The picture below explains all:
alt text http://img133.imageshack.us/img133/4206/accentar9.png
The variable textInput comes from File.ReadAllText(path); and characters like : ' é è ... do not display. When I run my UnitTest, all is fine! I see them... Why?

The .NET classes (System.IO.StreamReader and the likes) take UTF-8 as the default encoding. If you want to read a different encoding you have to pass this explicitly to the appropriate constructor overload.
Also note that there's not one single encoding called “ANSI”. You're probably referring to the Windows codepage 1252 aka “Western European”. Notice that this is different from the Windows default encoding in other countries. This is relevant when you try to use System.Text.Encoding.Default because this actually differs from system to system.
/EDIT: It seems you misunderstood both my answer and my comment:
The problem in your code is that you need to tell .NET what encoding you're using.
The other remark, saying that “ANSI” may refer to different encodings, didn't have anything to do with your problem. It was just a “by the way” remark to prevent misunderstandings (well, that one backfired).
So, finally: The solution to your problem should be the following code:
string text = System.IO.File.ReadAllText("path", Encoding.GetEncoding(1252));
The important part here is the usage of an appropriate System.Text.Encoding instance.
However, this assumes that your encoding is indeed Windows-1252 (but I believe that's what Notepad++ means by “ANSI”). I have no idea why your text gets displayed correctly when read by NUnit. I suppose that NUnit either has some kind of autodiscovery for text encodings or that NUnit uses some weird defaults (i.e. not UTF-8).
Oh, and by the way: “ANSI” really refers to the “American National Standards Institute”. There are a lot of completely different standards that have “ANSI” as part of their names. For example, C++ is (among others) also an ANSI standard.
Only in some contexts it's (imprecisely) used to refer to the Windows encodings. But even there, as I've tried to explain, it usually doesn't refer to a specific encoding but rather to a class of encodings that Windows uses as defaults for different countries. One of these is Windows-1252.

Try setting your console sessin's output code page using the chcp command. The code pages supported by windows are here, here, and here. Remember, fundametnaly the console is pretty simple: it displays UNCICODE or DBCS characters by using a code page to dtermine the glyph that will be displayed.

I do not know why It works with NUnit, but I open the file with NotePad++ and I see ANSI in the format. Now I converted to UTF-8 and it works.
I am still wondering why it was working with NUnit and not in the console? but at least it works now.
Update
I do not get why I get down voted on the question and in this answer because the question is still good, why in a Console I can't read an ANSI file but in NUNit I can?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.