I think this is some kind of encoding problem

I think this is some kind of encoding problem - c#

I have two computers. Both running WinXP SP2 (I don't really know ho similar they are beyond that). I am running MS Visual C# 2008 express edition on both and that's what I'm currently using to program.
I made an application that loads in an XML file and displays the contents in a DataGridView.
The first line of my xml file is:
<?xml version="1.0" encoding="utf-8"?>
...and really... it's utf-8 (at least according to MS VS C# when I just open the file there).
I compile the code and run it on one computer, and the contents of my DataGridView appears normal. No funny characters. I compile the code and run it on the other computer (or just take the published version from computer #1 and install it on computer #2 - I tried this both ways) and in the datagridview, where there are line breaks/new lines in the xml file, I see funny square characters.
I'm a novice to encoding... so the only thing I really tried to troubleshoot was to use that same program to write the contents of my xml to a new xml file (but I'm actually writing it to a text file, with the xml tags in it) since the default writing to a text file seems to be utf-8. Then I read this new file back in to my program. I get the same results.
I don't know what else to do or how to troubleshoot this or what I might fundamentally be doing wrong in the first place.
-Adeena

This doesn't have to do with UTF-8 or character encodings - this problem has to do with line endings. In Windows, each line of a text file ends in the two characters carriage-return (CR) and newline (LF, for line feed), which are code points U+000D and U+000A respectively. In ASCII and UTF-8, these are encoded as the two bytes 0D 0A. Most non-Windows systems, including Linux and Mac OS X, on the other hand, uses just a newline character to signal end-of-line, so it's not uncommon to see line ending problems when transferring text files between Windows and non-Windows systems.
However, since you're using just Windows on both systems, this is more of a mystery. One application is correctly interpreting the CRLF combination as a newline, but the other application is confused by the CR. Carriage returns are not printable characters, so it replaces the CR with a placeholder box, which is what you see; it then correctly interprets the line feed as the end-of-line.

The square usually appears when you use different types of newlines.
Linux - (0A) LF
Win - (0D0A) CRLF
Mac - (0D) CR
The app was probably created using 1 type and the running app is expecting another.
Check out Environment.NewLine
And, you might try this: (no guarantees -- I don't write much C#)
strInput = Regex.Replace(strInput, "\\r?\\n?", Environment.NewLine)

I'm not sure of the cause of your problem, but one solution would be to to just strip out the carriage returns from your strings. For every string you add, just call TrimEnd(null) on it to remove trailing whitespace:
newrow["topic"] = att1.ToString().TrimEnd(null);
If your strings might end in other whitespace (i.e. spaces or tabs) and you want to keep those, then just pass an array containing only the carriage return character to TrimEnd:
newrow["topic" = att1.ToString().TrimEnd(new Char[]{'\r'});
Disclaimer: I am not a C# programmer; the second statement may be syntactically incorrect

# Adam:
Sorry! Missed your earlier statement.
To load the document into the program and display in the DataGridView, I am currently doing (I say "currently", because I tried other things like use XDocument instead of Xelement):
XElement xe1 = XElement.Load(filePath);
DataTable myTable = new DataTable();
myTable = mkTable(); // calls a function that makes the table
var _categories = (from p1 in xe1.Descendants("category") select p1);
int numCat = _categories.Count();
int i = 0;
while (i < numCat)
{
DataRow newrow;
newrow = myTable.NewRow();
if (_categories.ElementAt(i).Parent.Name == "topic")
{
string att1 = _categories.ElementAt(i).Parent.Attribute("name").Value.ToString();
newrow["topic"] = att1.ToString();
}
// repeat the above for the different things in my document
myTable.Rows.Add(newrow);
i++;
}
myDataSet.Merge(myTable);
bindingSourceIn.DataSource = myDataSet;
myDataGridView.DataSource = bindingSourceIn;
myDataGridView.DataMember = "xmlthing";
(obviously things are a little abbreviated here... i.e., my bindingsource/datagridview etc is declared elsewhere.... but hopefully this is enough to make sense)
-Adeena

Related

Read specific parts of ASCII file in C#

I am trying to make a FXB file previewer (VST preset banks for those who don't know) for Sylenth1 banks. I have encoded the FXB as an ASCII string and had it print to the console. The preset names show up fine. My issue is that the parameters for the oscillators, filters and effects are encoded as random characters (mainly "?" and fairly big spaces).
Underlined in red: file header (?)
Underlined in blue: preset name (which I want to keep)
Underlined in yellow: osc/FX/filter parameters (which I want to discard from the string)
Here's the code I wrote:
byte[] arr = File.ReadAllBytes(Properties.Resources.pointer); /* pointer is a string in resources I
used to point to the external FXB file for testing */
System.Text.ASCIIEncoding enc = new System.Text.ASCIIEncoding();
string fstr = enc.GetString(arr);
Console.Write(fstr);
Console.ReadKey();
I had written a foreach loop to replace every unwanted character with string.Empty, but it also removes parts of the preset names (e.g. the L from "Lead"), leaves the spaces intact and creates new ones, so I deleted it.
My end goal for those that are curious is this:
Preset 1
Preset 2
Preset 3
Preset 4
...
I'm at a total loss. I've tried different solutions from various websites and Stack Overflow posts, but none gave me the desired result.
(I also noticed that the preset names have almost the same space between them (~ 200 chars apart), can I use the difference to exclude the unwanted parts?)

It looks like a binary file not ascii. Some data in the file is easily readable because it is ASCII encoded, but other data, for example numbers, are encoded in their binary format.
Not all binary data can be converted to printable ASCII characters, so when you print it out like this you get the ???? mess.
It is better to read this file using a binary editor. Visual studio has one, there is probably an extension for vs code, other editors have a binary viewer (e.g. sublime). This will show you data in the file as it is encoded, usually using hex with the ascii in a second column.
But that is just so you can accurately see the content. It does not help you for understanding the meaning or the layout. You might be able to make something work by reverse engineering like this, but chances are it will not work for all cases. Using and API is going to be way easier.
I'm not familiar with these files but did you find this? https://new.steinberg.net/developers/ There is a forum there that might help.

I found the answer to this myself. I basically somewhat reverse engineered the FXB in a hex editor, and proceeded to load specific bytes of the file (31 to be exact) in order to encode those in a string and have that print to the console.
I managed to do so by literally counting how many bytes there are from the beginning to the 1st preset name, then from the end of the preset name (31 bytes) to the beginning of the other preset name, and so on.
For those who are interested, I am going to develop a GUI version of it in the future. But it does (and probably will) support only Sylenth1 v2 soundbanks/FXBs.
Also thanks to the people who reached out. They helped in their own way.

Why does the Notepad++ [NULL] character not paste?

I am new to this site, and I don't know if I am providing enough info - I'll do my best =)
If you use Notepad++, then you will know what I am talking about -- When a user loads a .exe into Notepad++, the NUL / \x0 character is replaced by NULL, which has a black background, and white text. I tried pasting it into Visual Studio, hoping to obtain the same output, but it just pasted some spaces...
Does anyone know if this is a certain key-combination, or something? I would like to put the NULL character in replacement of \x0, just like Notepad++ =)

Notepad++ is a rich text editor unlike your regular notepad. It can display custom graphics so common in all modern text editors. While reading a file whenever notepad++ encounters the ASCII code of a null character then instead of displaying nothing it adds the string "NULL" to the UI setting the text background colour to black and text colour to white which is what you are seeing. You can show any custom style in your rich text editor too.
NOTE: This is by no means an efficient solution. I'm clearly traversing a read string 2 times just to take benefit of already present methods. This can be done manually in a single pass. It is just to give a hint about how you can do it. Also I wrote the code carefully but haven't ran it because I don't have the tools at the moment. I apologise for any mistakes let me know I'll update it
Step 1 : Read a text file by line (line ends at '\n') and replace all instances of null character of that line with the string "NUL" using the String.Replace(). Finally append the modified text to your RichTextBox.
Step 2 : Re traverse your read line using String.IndexOf() finding start indexes of each "NUL" word. Using these indexed you select text from RichTextBox and then style that selected text using RichTextBox.SelectionColor and RichTextBox.SelectionBackColor
richTextBoxCursor basically just represents the start index of each line in RichTextBox
StreamReader sr = new StreamReader(#"c:\test.txt" , Encoding.UTF8);
int richTextBoxCursor = 0;
while (!sr.EndOfStream){
richTextBoxCursor = richTextBox.TextLength;
string line = sr.ReadLine();
line = line.Replace(Convert.ToChar(0x0).ToString(), "NUL");
richTextBox.AppendText(line);
i = 0;
while(true){
i = line.IndexOf("NUL", i) ;
if(i == -1) break;
// This specific select function select text start from a certain start index to certain specified character range passed as second parameter
// i is the start index of each found "NUL" word in our read line
// 3 is the character range because "NUL" word has three characters
richTextBox.Select(richTextBoxCursor + i , 3);
richTextBox.SelectionColor = Color.White;
richTextBox.SelectionBackColor = Color.Black;
i++;
}
}

Notepad++ may use custom or special fonts to show these particular characters. This behavior also may not appropriate for all text editors. So, they don't show them.
If you want to write a text editor that visualize these characters, you probably need to implement this behavior programmatically. Seeing notepad++ source can be helpful If you want.

Text editor
As far as I know in order to make Visual Studio display non printable characters you need to install an extension from the marketplace at https://marketplace.visualstudio.com.
One such extension, which I have neither tried nor recomend - I just did a quick search and this is the first result - is
Invisible Character Visualizer.
Having said that, copy-pasting binaries is a risky business.
You may try Edit > Advanced > View White Space first.
Binary editor
To really see what's going on you could use the VS' binary editor: File->Open->(Open with... option)->Binary Editor -> OK

To answer your question.
It's a symbolic representation of 00H double byte.
You're copying and pasting the values. Notepad++ is showing you symbols that replace the representation of those values (because you configured it to do so in that IDE).

Reading from text file creates three dots

I am reading a long text file containing a sql query using StreamReader then using StringBuilder to create a string that gets run against a database. Once the string is created I checked the value and three dots ... appear within the string causing the query to fail when I run it against the database. Why is this happening? What can I do to keep it from happening?
string script;
if (File.Exists(path))
{
using(StreamReader sr = File.OpenText(path))
{
StringBuilder sb = new StringBuilder();
while(!sr.EndOfStream)
{
sb.Append(sr.ReadLine());
}
script = sb.ToString();
}
}
UPDATE: I should add that the three dots appear at character position 16384 every time. Not sure of the significance of this
UPDATE: It appears the string is being truncated at runtime. the file contiains 48080 characters but is being truncated in the middle at position 16384 making the string variable 32768.. Is this the max character count for a string?

I have a definite answer for you: Microsoft says that what you are experiencing is a bug in Visual Studio 2015. They have released an "Update 2" for Visual Studio 2015 that is reported to correct the issue.
I am trying to get this update installed by my admin, but in the meantime, a feasible workaround is to load the text in the JSON Visualizer instead. It will show an error that it is, of course, not valid JSON since it is SQL, but it will display the whole text of the string.
Download Update 2 from here:
https://www.visualstudio.com/en-us/news/vs2015-update2-vs.aspx
See the bug report here:
https://connect.microsoft.com/VisualStudio/feedback/details/2016177/text-visualizer-misses-corrupts-text-in-long-strings

I had the same problem while I was debugging the very long query string.
It turns out, Visual Studio(mine is 2015) Debugger will truncate long strings after certain amount of characters for ease of reading. So even though you are seeing three dots(...) in Text Visualizer, actual value doesn't have that three dots.
To my knowledge, visual studio 2012 debugger doesn't add three dots. I haven't found a way how to turn the feature off in VS2015, but you can use html visualizer or json visualizer as alternative solution.

I have a feeling that you are checking the values in your debugger, where the long query text is being partially shown and ends with an ellipsis (...)
Again guessing here, but seems like you join your lines of SQL into one single line, and if the lines in the file do not end with whitespace character, then the query will get messed up. Probably that is the reason why your SQL query does not work.
By the way, you can write the code you have far more succinctly as below
if (File.Exists(path))
script = string.Join(" ", File.ReadLines(path));

Removing text above real content of CSV file

I have a CSV whose author, annoyingly enough, has decided to 'introduce' the file before the contents themselves. So in all, I have a CSV that looks like:
This file was created by XXXXYY and represents the crossover between YY and QQQ.
Additional information can be found through the website GG, blah blah blah...
Jacob, Hybrid
Dan, Pure
Lianne, Hybrid
Jack, Hatchback
So the problem here is that I want to get rid of the first few lines before the 'real content' of the CSV file begins. I'm looking for robustness here, so using Streamreader and removing all content before the 4th line for example, is not ideal (plus the length of the text can vary).
Is there a way in which one can read only what matters and write a new CSV into a directory path?
Regards,
genesis
(edit - I'm looking for C sharp code)

The solution depends on the files you have to parse. You need to look for a reliable pattern that distinguishes data from comment.
In your example, there are some possibilities that might be the same in other files:
there are 4 lines of text. But you say this isn't consistent across files
The text lives may not contain the same number of commas as the data table. But that is unlikely to be reliable for all files.
there is a blank/whitespace only line between the text and the data.
the data appears to be in the form word-comma-word. If this is true it should be easy to identify non data lines (any line which doesn't contain exactly one comma, or has multiple words etc)
You may be able to use a combination of these heuristics to more reliably detect the data.

You could scan by line (looking for the \r\n) and ignore lines that don't have a comma count that matches you csv.
You should be able to read the file into a string pretty easily unless it is really massive.
e.g.
var csv = "some test\r\nsome more text\r\na,b,c\r\nd,e,f\r\n";
var lines = csv.Split('\r\n');
var csvLines = line.Where(l => l.Count(',') == 2);
// now csvLines contains only the lines you are after

List<string> info = new List<string>();
int counter = 0;
// Open the file to read from.
info = System.IO.File.ReadAllLines(path).ToList();
// Find the lines up until (& including) the empty one
foreach (string s in info)
{
counter++;
if(string.IsNullOrEmpty(s))
break; //exit from the loop
}
// Remove the lines including the blank one.
info.RemoveRange(0,counter);
Something like this should work, you should probably put some tests in to make sure counter is not > length and other tests to handle errors.
You could adapt this code so that it just finds the empty line number using linq or something, but I don't like the overhead of linq (Yeah ironic considering I'm using c#).
Regards,
Slipoch

Writing, displaying, and storing Japanese characters in c#

I am working on a project which requires lots of japanese katakana, hiragana, and kanji characters. The original files are excel files using the "ＭＳ Ｐゴシック" font. The problem I am having seems to be the same as everyone else with this type of issue and c#. The solutions I have found all seem to start with adding the text within the c# program.
What I am trying to do is read one of my .xls or .txt files that I have made into c#, work with the data using normal c# functions such as string compare. However, when I do this, noting happens. Writing or displaying the data produces "?" marks. Nothing new here.
I tried the same idea with c++ and it works perfectly.
The problem is it has to be c#, not c++ in order to work with the interops for the other software I am utilizing.
Long story short, do c#(system.string) not handle unicode natively compared to c++ (c string)?
I am using Visual Studio C++ 2008 Express and Visual Studio C# 2010 Express.
Files are the same, but it works in c++ and not in c#.
Sorry, I haven't used english in a while.
I have tried various types, the below is the latest but still "?" marks for output.
var reader = new StreamReader(File.OpenRead(#"C:\smallerBunShou.txt"), Encoding.UTF8);
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
var values = line.Split(',');
listA.Add(values[0]);
// listB.Add(values[1]);
// listC.Add(values[2]);
}
int sizeOflistA = listA.Count();
//using (System.IO.StreamWriter file = new System.IO.StreamWriter(#"C:\WriteLines2.txt"))
var file = new StreamWriter(File.OpenWrite(#"C:\WriteLines2.txt"), Encoding.UTF8);
{
foreach (string line in listA)
{
// If the line doesn't contain the word 'Second', write the line to the file.
if (!line.Contains("Second"))
{
file.WriteLine(line);
}
}
}
I have also tried the Encoding.Unicode, etc.
My computer is a japanese PC, software is mostly japanese. According to one of the answers so far, it is not a unicode issue, Japanese PCs use Shift-JIS which is most likely what I need to look into. When I solve this I will post my solution.
Update:
After looking around a bit, I found the Shift-JIS encoding scheme.
Encoding.GetEncoding(932));
This solved my problem! Thank you #EricFalsken for pointing me in the right direction.

Normal .txt files are not saved in Unicode format. You're going to need to specify the byte format when reading the FileStream by running it through the TextReader and Encoding.Unicode.
But note that most Japanese computers and documents do NOT use Unicode. They still use Shift-JIS quite extensively.
I can assure you that all strings in C# support Unicode natively.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

I think this is some kind of encoding problem - c#

Related

Read specific parts of ASCII file in C#

Why does the Notepad++ [NULL] character not paste?

Reading from text file creates three dots

Removing text above real content of CSV file

Writing, displaying, and storing Japanese characters in c#

Categories

Resources