Mysteriously added quotation mark inside a string - c#

I copied and pasted a certain source code into my program with a text editor. I basically need to confirm that the source code begins with "int main()" so I went ahead and compared line with "int main()" but the comparison always returned false.
I decided to strip the string into characters and found something weird.
so string line has "int main()" passed inside it which is the text that has been pasted inside the text editor. You would think a and b would have the same characters, but they don't:
I'm honestly not sure where is that quotation mark in the beginning coming from. The original string didn't contain it, the debugger doesn't show it (It would display "\"int main()\"" otherwise). What is happening here?
Edit: I tried line = line.Trim(). Still that character is not gone. Apparently it's some special unicode character for Zero width no-break space. How can I remove this from my string?

65279 looks like the decimal representation of a UTF-16 BOM (U+FEFF), is it possible that the way you're reading the data into "line" would've failed to remove it?

Could you set line to line.Trim(); It's hard to tell what might be going on without seeing how line is set.
update based on the BOM character: try line.Trim(new char[]{'\uFEFF'}); assuming .NET 4

I've found the solution:
private readonly string BYTE_ORDER_MARK_UTF8 = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
...
if (line.StartsWith(BYTE_ORDER_MARK_UTF8))
line = line.Remove(0, BYTE_ORDER_MARK_UTF8.Length);
That was bizzare...

In that code you have posted, it seems like the line variable begins with a space character. Try line = line.Trim();
Edit:
The reason the string.Trim() method is not working as expected can found on MSDN
Starting with the .NET Framework 4, the method trims all Unicode white-space characters (that is, characters that produce a true return value when they are passed to the Char.IsWhiteSpace method). Because of this change, the Trim method in the .NET Framework 3.5 SP1 and earlier versions removes two characters, ZERO WIDTH SPACE (U+200B) and ZERO WIDTH NO-BREAK SPACE (U+FEFF), that the Trim method in the .NET Framework 4 and later versions does not remove.
(U+FEFF) seems to be the character at the beginning of line, hence why Trim isn't dealing with it.

Related

Is it possible to display (convert?) the unicode hex \u0092 to an unicode html entity in .NET?

I have some string that contains the following code/value:
"You won\u0092t find a ...."
It looks like that string contains the Right Apostrophe special character.
ref1: Unicode control 0092
ref2: ASCII chart (both 127 + extra extended ascii)
I'm not sure how to display this to the webbrowser. It keeps displaying the TOFU square-box character instead. I'm under the impression that the unicode (hex) value 00092 can be converted to unicode (html) ’
Is my understanding correct?
Update 1:
It was suggested by #sam-axe that I HtmlEncode the unicode. That didn't work. Here it is...
Note the ampersand got correctly encoded....
It looks like there's an encoding mix-up. In .NET, strings are normally encoded as UTF-16, and a right apostrophe should be represented as \u2019. But in your example, the right apostrophe is represented as \x92, which suggests the original encoding was Windows code page 1252. If you include your string in a Unicode document, the character \x92 won't be interpreted properly.
You can fix the problem by re-encoding your string as UTF-16. To do so, treat the string as an array of bytes, and then convert the bytes back to Unicode using the 1252 code page:
string title = "You won\u0092t find a cheaper apartment * Sauna & Spa";
byte[] bytes = title.Select(c => (byte)c).ToArray();
title = Encoding.GetEncoding(1252).GetString(bytes);
// Result: "You won’t find a cheaper apartment * Sauna & Spa"
Note: much of my answer is based on guessing and looking at the decompiled code of System.Web 4.0. The reference source looks very similar (identical?).
You're correct that "’" (6 characters) can be displayed in the browser. Your output string, however, contains "\u0092" (1 character). This is a control character, not an HTML entity.
According to the reference code, WebUtility.HtmlEncode() doesn't transform characters between 128 and 160 - all characters in this range are control characters (ampersand is special-cased in the code as are a few other special HTML symbols).
My guess is that because these are control characters, they're output without transformation because transforming it would change the meaning of the string. (I tried running some examples using LinqPad, this character was not rendered.)
If you really want to transform these characters (or remove them), you'll probably have to write your own function before/after calling HtmlEncode() - there may be something that does this already but I don't know of any.
Hope this helps.
Edit: Michael Liu's answer seems correct. I'm leaving my answer here because it may be useful in cases when the input encoding of a string is not known.

C# string length limit

I am trying to understand why VS loses all keyword higlighting after this line of code. The code will compile but it will through exeptions.
the string is a base64string represenation of an image. If you remove the first letter VS recognizes that the characters are a valid string, compiles and no exceptions. Interestingly enough, the string is 32,742. If you add 32,743..its a no-go. I assume it is related to
a limit to how you can initialize a string
a need to use a different data type like char.
Anyone have an idea..I just stumbled upon this and now I am curious.
Bob
string g = "Any string greater that 32,742 characters suddenly disables all keyword highlighting and code will fail......";

Replacing specific Unicode characters in strings read from Excel

I am attempting to replace some undesirable characters in a string retrieved from an Excel spreadsheet. The reason being that our Oracle database is using the WE8ISO8859P1 character set, which does not define several characters that Excel "helpfully" inserts for you in text (curly quotes, em and en dashes, etc.) Since I have no control over the database or how the Excel spreadsheets are created I need to replace the characters with something else.
I retrieve the cell contents into a string thus:
string s = xlRange.get_Range("A1", Missing.Value).Value2.ToString().Trim();
Viewing the string in Visual Studio's Text Visualiser shows the text to be complete and correctly retrieved. Next I try and replace one of the undesirable characters (in this case the right-hand curly quote symbol):
s = Regex.Replace(s, "\u0094", "\u0022");
But it does nothing (Text Visualiser shows it still to be there). To try and verify that the character I want to replace is actually in there, I tried:
bool a = s.Contains("\u0094");
but it returns false. However:
bool b = s.Contains("”");
returns true.
My (somewhat lacking) understanding of strings in .NET is that they're encoded in UTF-16, whereas Excel would probably be using ANSI. So does that mean I need to change the encoding of the text as it comes out of Excel? Or am I doing something else wrong here? Any advice would be greatly appreciated. I have read and re-read all articles I can find about Unicode and encoding but am still none the wiser.
Yes strings in .Net are UTF-16.
You're doing it right; perhaps your hex-math is incorrect.
The character you tested for isn't "\u0094" (Not sure that's what you meant). The following worked for me:
((int)"”"[0]).ToString("X") returns "201D"
"”" == "\u201D" returns true
"\u0094" == "" (right hand side is the empty string) returns false
A lot of UTF-16 characters will seem as an empty string by the text visualizer but they can either be an undisplayable character or part of a surrogate (i.e. Some characters may need to be typed "\UXXXXXXXX" while others you can do with (four digits) "\uXXXX".). My knowledge of this domain is very limited.
References - Jon Skeet's articles on:
Strings
Unicode
You can use NVARCHAR and NTEXT instead of VARCHAR and TEXT for the columns that need to accomodate those characters.
That wayyou don't have to convert the whole database, and you are future proof, because the columns will be Unicode.

.NET string IndexOf unexpected result

A string variable str contains the following somewhere inside it: se\">
I'm trying to find the beginning of it using:
str.IndexOf("se\\\">")
which returns -1
Why isn't it finding the substring?
Note: due to editing the snippet showed 5x \ for a while, the original had 3 in a row.
Your code is in fact searching for 'se\\">'. When searching for strings including backslashes I usually find it easier to use verbatim strings:
str.IndexOf(#"se\"">")
In this case you also have a quote in the search string, so there is still some escaping, but I personally find it easier to read.
Update: my answer was based on the edit that introduced extra slashes in the parameter to the IndexOf call. Based on current version, I would place my bet on str simply not containing the expected character sequence.
Update 2:
Based on the comments on this answer, it seems to be some confusion regarding the role of the '\' character in the strings. When you inspect a string in the Visual Studio debugger, it will be displayed with escaping characters.
So, if you have a text box and type 'c:\' in it, inspecting the Text property in the debugger will show 'c:\\'. An extra backslash is added for escaping purposes. The actual string content is still 'c:\' (which can be verified by checking the Length property of the string; it will be 3, not 4).
If we take the following string (taken from the comment below)
" '<em
class=\"correct_response\">a
night light</em><br
/><br /><table
width=\"100%\"><tr><td
class=\"right\">Ingrid</td></tr></table>')"
...the \" sequences are simply escaped quotation marks; the backslashes are not part of the string content. So, you are in fact looking for 'se">', not 'se\">'. Either of these will work:
str.IndexOf(#"se"">"); // verbatim string; escape quotation mark by doubling it
str.IndexOf("se\">"); // regular string; escape quotation mark using backslash
This works:
string str = "<case\\\">";
int i = str.IndexOf("se\\\">"); // i = 3
Maybe you're not correctly escaping one of the two strings?
EDIT there's an extra couple of \ in the string you are searching for.
Maybe the str variable does not actually contain the backslash.
It may be just that when you mouse over the variable while debugging, the debugger tooltip will show the escape character.
e.g. If you put a breakpoint after this assignment
string str = "123\"456";
the tooltip will show 123\"456 and not 123"456.
However if you click on the visualize icon, you will get the correct string 123"456
Following code:
public static void RunSnippet()
{
string s = File.ReadAllText (#"D:\txt.txt");
Console.WriteLine (s);
int i = s.IndexOf("se\\\">");
Console.WriteLine (i);
}
Gives following output:
some text before se\"> some text after
17
Seems like working to me...
TextBox2.Text = TextBox1.Text.IndexOf("se\"">")
seems to work in VB.
DoubleQuotes within a string need to be specified like "" Also consider using verbatim strings - So an example would be
var source = #"abdefghise\"">jklmon";
Console.WriteLine(source.IndexOf(#"se\"">")); // returns 8
If you are looking for se\">
then
str.IndexOf(#"se\"">")
is less error-prone. Note the double "" and single \
Edit, after the comment: it seems like the string may contain ecaping itself, in which case in se\"> the \" was an escaped quote, so the literal text is simply se"> and the string to use is Indexof("se\">")

Base64 String throwing invalid character error

I keep getting a Base64 invalid character error even though I shouldn't.
The program takes an XML file and exports it to a document. If the user wants, it will compress the file as well. The compression works fine and returns a Base64 String which is encoded into UTF-8 and written to a file.
When its time to reload the document into the program I have to check whether its compressed or not, the code is simply:
byte[] gzBuffer = System.Convert.FromBase64String(text);
return "1F-8B-08" == BitConverter.ToString(new List<Byte>(gzBuffer).GetRange(4, 3).ToArray());
It checks the beginning of the string to see if it has GZips code in it.
Now the thing is, all my tests work. I take a string, compress it, decompress it, and compare it to the original. The problem is when I get the string returned from an ADO Recordset. The string is exactly what was written to the file (with the addition of a "\0" at the end, but I don't think that even does anything, even trimmed off it still throws). I even copy and pasted the entire string into a test method and compress/decompress that. Works fine.
The tests will pass but the code will fail using the exact same string? The only difference is instead of just declaring a regular string and passing it in I'm getting one returned from a recordset.
Any ideas on what am I doing wrong?
You say
The string is exactly what was written
to the file (with the addition of a
"\0" at the end, but I don't think
that even does anything).
In fact, it does do something (it causes your code to throw a FormatException:"Invalid character in a Base-64 string") because the Convert.FromBase64String does not consider "\0" to be a valid Base64 character.
byte[] data1 = Convert.FromBase64String("AAAA\0"); // Throws exception
byte[] data2 = Convert.FromBase64String("AAAA"); // Works
Solution: Get rid of the zero termination. (Maybe call .Trim("\0"))
Notes:
The MSDN docs for Convert.FromBase64String say it will throw a FormatException when
The length of s, ignoring white space
characters, is not zero or a multiple
of 4.
-or-
The format of s is invalid. s contains a non-base 64 character, more
than two padding characters, or a
non-white space character among the
padding characters.
and that
The base 64 digits in ascending order
from zero are the uppercase characters
'A' to 'Z', lowercase characters 'a'
to 'z', numerals '0' to '9', and the
symbols '+' and '/'.
Whether null char is allowed or not really depends on base64 codec in question.
Given vagueness of Base64 standard (there is no authoritative exact specification), many implementations would just ignore it as white space. And then others can flag it as a problem. And buggiest ones wouldn't notice and would happily try decoding it... :-/
But it sounds c# implementation does not like it (which is one valid approach) so if removing it helps, that should be done.
One minor additional comment: UTF-8 is not a requirement, ISO-8859-x aka Latin-x, and 7-bit Ascii would work as well. This because Base64 was specifically designed to only use 7-bit subset which works with all 7-bit ascii compatible encodings.
string stringToDecrypt = HttpContext.Current.Request.QueryString.ToString()
//change to
string stringToDecrypt = HttpUtility.UrlDecode(HttpContext.Current.Request.QueryString.ToString())
If removing \0 from the end of string is impossible, you can add your own character for each string you encode, and remove it on decode.
One gotcha to do with converting Base64 from a string is that some conversion functions use the preceding "data:image/jpg;base64," and others only accept the actual data.

Categories

Resources