C# compare strings - Different codepage

C# compare strings - Different codepage - c#

I have two strings read in from textfiles to compare and when I try to compare these files with winmerge or pspad, they both show as the same text strings. If I compare them with the following function, it fails:
string string1 = File.ReadAllText(#"c:\file1.txt");
string string2 = File.ReadAllText(#"c:\file2.txt");
bool stringMatch = false;
if (string1.Equals(string2, StringComparison.InvariantCulture)){
stringMatch = true;
}
//stringMatch is false here
After some searching it seems to be that a " and ' are different:
Content of file1.txt: é"'(§è!çà)-
Content of file2.txt: é”’(§è!çà)-
Any way I can properly compare these two strings and match those " & ' characters?

You could convert them both to byte[] using the methods under System.Text.Encoding
and then compare the byte[] arrays

It looks like you want to use the overload which takes StringComparison.
I'd guess given the current senario you want the "Ordinal" value but you may want one of the others depdending on what you are doing.
http://msdn.microsoft.com/en-us/library/system.stringcomparison.aspx

Well, you don't have the .NET strings in WinMerge or pspad, so something could well be going wrong while decoding. You need to explain your exact scenario:
Is the data in a file (hence WinMerge/pspad)?
How are you loading the file in .NET?
How are you loading the file in WinMerge etc?
EDIT: Okay, based on the comment - what is the encoding of the file meant to be? Are you specifying it in WinMerge anywhere? .NET will be using UTF-8 (because you haven't specified any other encoding).

After reading "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" you should be well equipped to solve your problem yourself.

Related

Read specific parts of ASCII file in C#

I am trying to make a FXB file previewer (VST preset banks for those who don't know) for Sylenth1 banks. I have encoded the FXB as an ASCII string and had it print to the console. The preset names show up fine. My issue is that the parameters for the oscillators, filters and effects are encoded as random characters (mainly "?" and fairly big spaces).
Underlined in red: file header (?)
Underlined in blue: preset name (which I want to keep)
Underlined in yellow: osc/FX/filter parameters (which I want to discard from the string)
Here's the code I wrote:
byte[] arr = File.ReadAllBytes(Properties.Resources.pointer); /* pointer is a string in resources I
used to point to the external FXB file for testing */
System.Text.ASCIIEncoding enc = new System.Text.ASCIIEncoding();
string fstr = enc.GetString(arr);
Console.Write(fstr);
Console.ReadKey();
I had written a foreach loop to replace every unwanted character with string.Empty, but it also removes parts of the preset names (e.g. the L from "Lead"), leaves the spaces intact and creates new ones, so I deleted it.
My end goal for those that are curious is this:
Preset 1
Preset 2
Preset 3
Preset 4
...
I'm at a total loss. I've tried different solutions from various websites and Stack Overflow posts, but none gave me the desired result.
(I also noticed that the preset names have almost the same space between them (~ 200 chars apart), can I use the difference to exclude the unwanted parts?)

It looks like a binary file not ascii. Some data in the file is easily readable because it is ASCII encoded, but other data, for example numbers, are encoded in their binary format.
Not all binary data can be converted to printable ASCII characters, so when you print it out like this you get the ???? mess.
It is better to read this file using a binary editor. Visual studio has one, there is probably an extension for vs code, other editors have a binary viewer (e.g. sublime). This will show you data in the file as it is encoded, usually using hex with the ascii in a second column.
But that is just so you can accurately see the content. It does not help you for understanding the meaning or the layout. You might be able to make something work by reverse engineering like this, but chances are it will not work for all cases. Using and API is going to be way easier.
I'm not familiar with these files but did you find this? https://new.steinberg.net/developers/ There is a forum there that might help.

I found the answer to this myself. I basically somewhat reverse engineered the FXB in a hex editor, and proceeded to load specific bytes of the file (31 to be exact) in order to encode those in a string and have that print to the console.
I managed to do so by literally counting how many bytes there are from the beginning to the 1st preset name, then from the end of the preset name (31 bytes) to the beginning of the other preset name, and so on.
For those who are interested, I am going to develop a GUI version of it in the future. But it does (and probably will) support only Sylenth1 v2 soundbanks/FXBs.
Also thanks to the people who reached out. They helped in their own way.

Read a file with unicode characters

I have an asp.net c# page and am trying to read a file that has the following charater ’ and convert it to '. (From slanted apostrophe to apostrophe).
FileInfo fileinfo = new FileInfo(FileLocation);
string content = File.ReadAllText(fileinfo.FullName);
//strip out bad characters
content = content.Replace("’", "'");
This doesn't work and it changes the slanted apostrophes into ? marks.

I suspect that the problem is not with the replacement, but rather with the reading of the file itself. When I tried this the nieve way (using Word and copy-paste) I ended up with the same results as you, however examining content showed that the .Net framework believe that the character was Unicode character 65533, i.e. the "WTF?" character before the string replacement. You can check this yourself by examining the relevant character in the Visual Studio debugger, where it should show the character code:
content[0]; // 65533 '�'
The reason why the replace isn't working is simple - content doesn't contain the string you gave it:
content.IndexOf("’"); // -1
As for why the file reading isn't working properly - you are probably using the wrong encoding when reading the file. (If no encoding is specified then the .Net framework will try to determine the correct encoding for you, however there is no 100% reliable way to do this and so often it can get it wrong). The exact encoding you need depends on the file itself, however in my case the encoding being used was Extended ASCII, and so to read the file I just needed to specify the correct encoding:
string content = File.ReadAllText(fileinfo.FullName, Encoding.GetEncoding("iso-8859-1"));
(See this question).
You also need to make sure that you specify the correct character in your replacement string - when using "odd" characters in code you may find it more reliable to specify the character by its character code, rather than as a string literal (which may cause problems if the encoding of the source file changes), for example the following worked for me:
content = content.Replace("\u0092", "'");

My bet is the file is encoded in Windows-1252. This is almost the same as ISO 8859-1. The difference is Windows-1252 uses "displayable characters rather than control characters in the 0x80 to 0x9F range". (Which is where the slanted apostrophe is located. i.e. 0x92)
//Specify Windows-1252 here
string content = File.ReadAllText(fileinfo.FullName, Encoding.GetEncoding(1252));
//Your replace code will then work as is
content = content.Replace("’", "'");

// This should replace smart single quotes with a straight single quote
Regex.Replace(content, #"(\u2018|\u2019)", "'");
//However the better approach seems to be to read the page with the proper encoding and leave the quotes alone
var sreader= new StreamReader(fileInfo.Create(), Encoding.GetEncoding(1252));

If you use String (capitalized) and not string, it should be able to handle any Unicode you throw at it. Try that first and see if that works.

Convert string to char

I get from another class string that must be converted to char. It usually contains only one char and that's not a problem. But control chars i receive like '\\n' or '\\t'.
Is there standard methods to convert this to endline or tab char or i need to parse it myself?
edit:
Sorry, parser eat one slash. I receive '\\t'

I assume that you mean that the class that sends you the data is sending you a string like "\n". In that case you have to parse this yourself using:
Char.Parse(returnedChar)
Otherwise you can just cast it to a string like this
(string)returnedChar

New line:
string escapedNewline = #"\\n";
string cleanupNewLine = escapedNewline.Replace(#"\\n", Environment.NewLine);
OR
string cleanupNewLine = escapedNewline.Replace(#"\\n", "\n");
Tab:
string escapedTab = #"\\t";
string cleanupTab= escapedTab.Replace(#"\\t", "\t");
Note the lack of the literal string (i.e. i did not use #"\t" because that will not represent a Tab)
Alternatively you could consider Regular Expressions if you need to replace a range of different string patterns.
You should probably write a utility function to encapsulate the common behaviour above for all the possible Escape Sequences
Then you'd write some Unit Tests to cover each of the cases you can think of.
As you encounter any bugs you add more unit tests to cover those cases.
UPDATE
You could represent a tab in the XML with a special character sequence:
see this article
This article applies to SQL Server but may well be relevant to C# also?
To be absolutely sure, you could try generating a string with a tab in it and putting it into some XML (programmatically) and using XmlSerializer to serialize that to a file to see what the output is, then you can be sure that this will faithfully 'round-trip' the string with the tab still in it.

how about using string.ToCharArray()
You can then add the appropriate logic to process whatever was in the string.

char.parse(string); is used to convert string to char and you can do vice versa
char.tostring();
100% solved

I'm working on an application in C# and need to read and write from a particular datafile format. The only issue at the moment is that the format uses strictly single byte characters, and C# keeps trying to throw in Unicode when I use a writer and a char array (which doubles filesize, among other serious issues). I've been working on modifying the code to use byte arrays instead, but that causes a few complaints when feeding them into a tree view and datagrid controls, and it involves conversions and whatnot.
I've spent a little time googling, and there doesn't seem to be a simple typedef I can use to force the char type to use byte for my program, at least not without causing extra complications.
Is there a simple way to force a C# .NET program to use ASCII-only and not touch Unicode?
Later, I got this almost working. Using the ASCIIEncoding on the BinaryReader/Writers ended up fixing most of the problems (a few issues with an extra character being prepended to strings occurred, but I fixed that up). I'm having one last issue, which is very small but could be big: In the file, a particular character (prints as the Euro sign) gets converted to a ? when I load/save the files. That's not an issue in texts much, but if it occurred in a record length, it could change the size by kilobytes (not good, obviously). I think it's caused by the encoding, but if it came from the file, why won't it go back?
The precise problem/results are such:
Original file: 0x80 (euro)
Encodings:
** ASCII: 0x3F (?)
** UTF8: 0xC280 (A-hat euro)
Neither of those results will work, since anywhere in the file, it can change (if an 80 changed to 3F in a record length int, it could be a difference of 65*(256^3)). Not good. I tried using a UTF-8 encoding, figuring that would fix the issue pretty well, but it's now adding that second character, which is even worse.

C# (.NET) will always use Unicode for strings. This is by design.
When you read or write to your file, you can, however, use a StreamReader/StreamWriter set to force ASCII Encoding, like so:
StreamReader reader = new StreamReader (fileStream, new ASCIIEncoding());
Then just read using StreamReader.
Writing is the same, just use StreamWriter.

Interally strings in .NET are always Unicode, but that really shouldn't be of much interest to you. If you have a particular format that you need to adhere to, then the route you went down (reading it as bytes) was correct. You simply need to use the System.Encoding.ASCII class to do your conversions from string->byte[] and byte[]->string.

If you have a file format that mixes text in single-byte characters with binary values such as lengths, control characters, a good encoding to use is code page 28591 aka Latin1 aka ISO-8859-1.
You can get this encoding by using whichever of the following is the most readable:
Encoding.GetEncoding(28591)
Encoding.GetEncoding("Latin1")
Encoding.GetEncoding("ISO-8859-1")
This encoding has the useful characteristic that byte values up to 255 are converted to unchanged to the unicode character with the same value (e.g. the byte 0x80 becomes the character 0x0080).
In your scenario, this may be more useful than the ASCII encoding (which converts values in the range 0x80 to 0xFF to '?') or any of the other usual encodings, which will also convert some of the characters in this range.

If you want this in .NET, you could use F# to make a library supporting this. F# supports ASCII strings, with a byte array as the underlying type, see Literals (F#) (MSDN):
let asciiString = "This is a string"B

OpenFileDialog filename as UTF8

C# question here..
I have a UTF-8 string that is being interpreted by a non-Unicode program in C++.. This text which is displayed improperly, but as far as I can tell, is intact, is then applied as an output filename..
Anyway, in a C# project, I am trying to open this file with an System.Windows.Forms.OpenFileDialog object. The filenames I am getting from this object's .FileNames[] is in Unicode (UCS-2). This string, however, has been misinterpreted.. For example, if the original string was 0xe3 0x81 0x82, a FileName[].ToCharArray() reveals that it is now 0x00e3 0x0081 0x201a .... .. It might seem like the OpenFileDialog object only padded it, but it is not.. In the third character that the OpenFileDialog produced, it is different and I cannot figure out what happened to this byte..
My question is: Is there any way to treat the filenames highlighted in the OpenFileDialog box as UTF-8?
I don't think it's relevant, but if you need to know, the string is in Japanese..
Thanks,
kreb
UPDATE
First of all, thanks to everyone who's offered their suggestions here, they're very much appreciated.
Now, to answer the suggestions to modify the C++ application to handle the strings properly, it doesn't seem to be feasible. It isn't just one application that is doing this to the strings.. There are actually a great number of these applications in my company that I have to work with, and it would take huge amount of manpower and time that simply isn't available. However, sean e's idea would probably be the best choice if I were to take this route..
#Remy Lebeau: I think hit the nail right on the head, I will try your proposed solution and report back.. :) I guess the caveat with your solution is that the Default encoding has to be the same on the C# application environment as the C++ application environment that created the file, which certainly makes sense as it would have to use the same code page..
#Jeff Johnson: I'm not pasting the filenames from the C++ app to the C# app.. I am calling OpenFileDialog.ShowDialog() and getting the OpenFileDialog.FileNames on DialogResult.OK.. I did try to use Encoding.UTF8.GetBytes(), but like Remy Lebeau pointed out, it won't work because the original UTF8 bytes are lost..
#everyone else: Thanks for the ideas.. :)
kreb
UPDATE
#Remy Lebeau: Your idea worked perfectly! As long as the environment of the C++ app is the same as the environment of the C# app is the same (same locale for non-Unicode programs) I am able to retrieve the correct text.. :)
Now I have more problems.. Haha.. Is there any way to determine the encoding of a string? The code now works for UTF8 strings that were mistakenly interpreted as ANSI strings, but screws up UCS-2 strings. I need to be able to determine the encoding and process each accordingly. GetEncoding() doesn't seem to be useful.. =/ And neither is StreamReader's CurrentEncoding property (always says UTF-8)..
P.S. Should I open this new question in a new post?

0x201a is the Unicode "low single comma quotation mark" character. 0x82 is the Latin-1 (ISO-8859-1, Windows codepage 1252) encoding of that character. That means the bytes of the filename are being interpretted as plain Ansi instead of as UTF-8, and thus being decoded from Ansi to Unicode accordingly. That is not surprising, as the filesystem has no concept of UTF-8, and Windows assumes non-Unicode filenames are using the OS's default Ansi encoding.
To do what you are looking for, you need access to the original UTF-8 encoded bytes so you can decode them properly. One thing you can try is to pass the FileName to the GetBytes() method of System.Text.Encoding.Default (in theory, that is using the same encoding that was used to decode the filename, so it should be able to produce the same bytes as the original), and then pass the resulting bytes to the GetString() method of System.Text.Encoding.UTF8.

I think your problem is at the begining:
I have a UTF-8 string that is being
interpreted by a non-Unicode program
in C++.. This text which is displayed
improperly, but as far as I can tell,
is intact, is then applied as an
output filename..
If you load a UTF-8 string with a non-unicode program and then serialize it, it will contain non-unicode chars.
Is there any way that your C++ program can handle Unicode?

Can you use members of the System.Text namespace (e.g., the UTF8Encoding class) to convert the .NET framework's internal string representation to/ from a byte array containing the text in the encoding of your choice?

If you are sure that the C++ output is fine, then in your C# app you should convert it from UTF-8 to UTF-16 using the .NET encoding class and just work with it in the Windows native format.
If you can modify the C++ app, that might be better - give the C# app input that doesn't need to be re-encoded. In it, the UTF8 to Unicode translation can be handled via MultiByteToWideChar, using CP_UTF8 for the CodePage parameter, but it only works when none of the flags are set for dwFlags (specify 0 for dwFlags). The whole app doesn't need to be Unicode. Even though it is not compiled unicode, you can make selective use of Unicode APIs.

In answer to your question "is there a way to treat the filenames as utf-8?" Try this code:
List<byte[]> utf8FileNames = new List<byte[]>();
foreach (string fileName in openFileDialog1.FileNames)
{
utf8FileNames.Add(Encoding.UTF8.GetBytes(fileName));
}
// Each byte array in utf8FileNames is a sequence of utf-8 bytes matching each file name chosen
What do you do with the file names once you have got them from the open file dialog? Can you post that code?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.