0x202A in filename: Why? - c#

I recently needed to do a isnull in SQL on a varbinary image.
So far so (ab)normal.
I very quickly wrote a C# program to read in the file no_image.png from my desktop, and output the bytes as hex string.
That program started like this:
byte[] ba = System.IO.File.ReadAllBytes(#"‪D:\UserName\Desktop\no_image.png");
Console.WriteLine(ba.Length);
// From here, change ba to hex string
And as I had used readallbytes countless times before, I figured no big deal.
To my surprise, I got a "NotSupported" exception on ReadAllBytes.
I found that the problem was that when I right click on the file, go to tab "Security", and copy-paste the object-name (start marking at the right and move inaccurately to the left), this happens.
And it happens only on Windows 8.1 (and perhaps 8), but not on Windows 7.
When I output the string in question:
public static string ToHexString(string input)
{
string strRetVal = null;
System.Text.StringBuilder sb = new System.Text.StringBuilder();
foreach (char c in input)
{
sb.Append(((int)c).ToString("X2"));
}
strRetVal = sb.ToString();
sb.Length = 0;
sb = null;
return strRetVal;
} // End Function ToHexString
string str = ToHexString(#"‪D:\UserName\Desktop\cookie.png");
string strRight = " (" + ToHexString(#"D:\UserName\Desktop\cookie.png") + ")"; // Correct value, for comparison
string msg = str + Environment.NewLine + " " + strRight;
Console.WriteLine(msg);
I get this:
202A443A5C557365724E616D655C4465736B746F705C636F6F6B69652E706E67
(443A5C557365724E616D655C4465736B746F705C636F6F6B69652E706E67)
First thing, when I lookup 20 2A in ascii, it's [space] + *
Since I don't see neither a space nor a star, when I google 20 2A, the first thing I get is paragraph 202a of the german penal code
http://dejure.org/gesetze/StGB/202a.html
But I suppose that is rather an unfortunate coincidence and it is actually the unicode control character 'LEFT-TO-RIGHT EMBEDDING' (U+202A)
http://www.fileformat.info/info/unicode/char/202a/index.htm
Is that a bug, or is that a feature ?
My guess is, it's a buggy feature.

The issue is that the string does not begin with a letter D at all - it just looks like it does.
It appears that the string is hard-coded in your source file.
If that's the case, then you have pasted the string from the security dialog. Unbeknownst to you, the string you pasted begins with the LRO character. This is an invisible character which tales no space, but tells the renderer to render characters from left-to-right, ignoring the usual rendering.
You just need to delete the character.
To do this, position the cursor AFTER the D in the string. Use the Backspace or Delete to Left key <x] to delete the D. Use the key again to delete the invisible LRO character. One more time to delete the ". Now retype the " and the D.
A similar problem could occur wherever the string came from - e.g. from user input, command line, script file etc.
Note: The security dialog shows the filename beginning with the LRO character to ensure that characters are displayed in the left-to-right order, which is necessary to ensure that the hierarchy is correctly understood when using RTL characters. e.g. a filename c:\folder\path\to\file in Arabic might be c:\folder\مسار/إلى/ملف. The "gotcha" is the Arabic parts read in the other direction so the word "path" according to google translate is مسار, and that is the rightmost word, making it appear is if it was the last element of the path, when in fact it is the element immediately after "c:\folder\".
Because security object paths have an hierarchy which is in conflict with the RTL text layout rules, the security dialog always displays RTL text in LTR mode. That means that the Arabic words will be mangled (letters in wrong order) on the security tab. (Imagine it as if it said "elif ot htap"). So the meaning is just about discernable, but from the point of view of security, the security semantics are preserved.

Filenames that contain RLO/LRO overrides are commonly created by malware. Eg. “exe” read backwards spells “malware”. You probably have an infected host, or the origin of the .png is infected.

This question bothered me a lot, how would it be possible that a deterministic function would give 2 different results for identical input? After some testing, it turns out that the answer is simple.
If you look through it in your debugger, you will see that the 'D' char in your #"‪D:\UserName\Desktop\cookie.png" (first use of Hex function) is NOT the same char as in #"D:\UserName\Desktop\cookie.png" (second use).
You must have used some other 'D'-like character, probably by unwanted keyboard shortcut or by messing with your Visual Studio character encoding.
It looks exactly the same, but in reality it's not event a single char 9try to watch the c variable in your toHex function.
if you change to the normal 'D' in your first example, it will work fine.

Related

How do you get emojis to display in a Unity TextMeshPro element?

I can't seem to find any sort of posts or videos online about this topic, so I'm starting to wonder if it's just not possible. Everything about "emojis" in Unity is just a simple implementation of a spritesheet and then manually indexing them with like <sprite=0>. I'm trying to pull tweets from Twitter and then display their text with emojis, so clearly this isn't feasible to do with the 1500+ emojis that unicode supports.
I believe I've correctly created a TMP font asset using the default Windows emoji font, Segoe UI Emoji, and it looks like using some unicode hex ranges I found on an online unicode database, I was able to detect 1505 emojis in the font.
I then set the emoji font as a fall-back font in the Project Settings:
But upon running the game, I still get the same error that The character with Unicode value \uD83D was not found in the [SEGOEUI SDF] font asset or any potential fallbacks. It was replaced by Unicode character \u25A1 in text object
In the console an output of the tweet text looks something like this: #cat #cats #CatsOfTwitter #CatsOnTwitter #pet \nLike & share , Thanks!\uD83D\uDE4F\uD83D\uDE4F\uD83D\uDE4F
From some looking around online and extremely basic knowledge of unicode, I theorize that the issue is that in the tweet body, the emojis are in UTF-16 surrogate pairs or whatever, where \uD83D\uDE4F is one emoji, but my emoji font is in UTF-32, so it's looking for u+0001f64f. So would I need to find a way to get it to read the full surrogate pair and then convert to UTF-32 to get the correct emoji to render?
Any help would be greatly appreciated, I've tried asking around the Unity Discord server, but nobody else knows how to solve this issue either.
Intro
TMPro is natively able to do this, but only with UTF-32 formatted unicode. For example, \U0001F600 is '😀︎'. Your emojis are formatted in what I believe is UTF-8 (correct me if i'm wrong), being \u1F600, which is still '😀︎'. The only difference between these two are the capital U and 3 zeros prepending it. This makes it very easy to convert. Typing the UTF-32 version into TMPro shows the emoji as normal. What you are looking for is converting UTF-16 surrogate pairs into UTF-32, which is included further down.
Luckily, this solution does not require any font modification, the default font is able to do this, and I didn't change any settings in the inspector.
UTF-8 Solution
This solution below is for non-surrogate pair UTF-8 code.
To convert UTF-8 to UTF-32, we just need to change the 'u' to be uppercase and add a few zeros prepending it. To do so, we can use System.RegularExpressions.Regex.Replace.
public string ToUTF32(string input)
{
string output = input;
Regex pattern = new Regex(#"\\u[a-zA-Z0-9]*");
while (output.Contains(#"\u"))
{
output = pattern.Replace(output, #"\U000" + output.Substring(output.IndexOf(#"\u", StringComparison.Ordinal) + 2, 5), 1);
}
return output;
}
input being the string that contains the emoji unicode. The function converts all of the unicode in the string, and keeps everything else as it was.
Explanation
This code is pretty long, so this is the explanation.
First, the code takes the input string, for example, blah blah \u1F600 blah \u1F603 blah, which contains 2 of the unicode emojis, and replaces the unicode with another long string of code, which is the next section.
Secondly, it takes the input and Substrings everything after "\u", 5 characters ahead. It replaces the text with "\U000" + the aforementioned string.
It repeats the above steps until all of the unicode is translated.
This outputs the correct string to do the job.
If anyone thinks the above information is incorrect, please let me know. My vocabulary on this subject is not the best, so I am willing to take corrections.
Surrogate Pairs Solution
I have tinkered for a little while and come up with the function below.
public string ToUTF32FromPair(string input)
{
var output = input;
Regex pattern = new Regex(#"\\u[a-zA-Z0-9]*\\u[a-zA-Z0-9]*");
while (output.Contains(#"\u"))
{
output = pattern.Replace(output,
m => {
var pair = m.Value;
var first = pair.Substring(0, 6);
var second = pair.Substring(6, 6);
var firstInt = Convert.ToInt32(first.Substring(2), 16);
var secondInt = Convert.ToInt32(second.Substring(2), 16);
var codePoint = (firstInt - 0xD800) * 0x400 + (secondInt - 0xDC00) + 0x10000;
return #"\U" + codePoint.ToString("X8");
},
1
);
}
return output;
}
This does basically the same thing as before except it takes in the input that has surrogate pairs in it and translates it.

Why does the Notepad++ [NULL] character not paste?

I am new to this site, and I don't know if I am providing enough info - I'll do my best =)
If you use Notepad++, then you will know what I am talking about -- When a user loads a .exe into Notepad++, the NUL / \x0 character is replaced by NULL, which has a black background, and white text. I tried pasting it into Visual Studio, hoping to obtain the same output, but it just pasted some spaces...
Does anyone know if this is a certain key-combination, or something? I would like to put the NULL character in replacement of \x0, just like Notepad++ =)
Notepad++ is a rich text editor unlike your regular notepad. It can display custom graphics so common in all modern text editors. While reading a file whenever notepad++ encounters the ASCII code of a null character then instead of displaying nothing it adds the string "NULL" to the UI setting the text background colour to black and text colour to white which is what you are seeing. You can show any custom style in your rich text editor too.
NOTE: This is by no means an efficient solution. I'm clearly traversing a read string 2 times just to take benefit of already present methods. This can be done manually in a single pass. It is just to give a hint about how you can do it. Also I wrote the code carefully but haven't ran it because I don't have the tools at the moment. I apologise for any mistakes let me know I'll update it
Step 1 : Read a text file by line (line ends at '\n') and replace all instances of null character of that line with the string "NUL" using the String.Replace(). Finally append the modified text to your RichTextBox.
Step 2 : Re traverse your read line using String.IndexOf() finding start indexes of each "NUL" word. Using these indexed you select text from RichTextBox and then style that selected text using RichTextBox.SelectionColor and RichTextBox.SelectionBackColor
richTextBoxCursor basically just represents the start index of each line in RichTextBox
StreamReader sr = new StreamReader(#"c:\test.txt" , Encoding.UTF8);
int richTextBoxCursor = 0;
while (!sr.EndOfStream){
richTextBoxCursor = richTextBox.TextLength;
string line = sr.ReadLine();
line = line.Replace(Convert.ToChar(0x0).ToString(), "NUL");
richTextBox.AppendText(line);
i = 0;
while(true){
i = line.IndexOf("NUL", i) ;
if(i == -1) break;
// This specific select function select text start from a certain start index to certain specified character range passed as second parameter
// i is the start index of each found "NUL" word in our read line
// 3 is the character range because "NUL" word has three characters
richTextBox.Select(richTextBoxCursor + i , 3);
richTextBox.SelectionColor = Color.White;
richTextBox.SelectionBackColor = Color.Black;
i++;
}
}
Notepad++ may use custom or special fonts to show these particular characters. This behavior also may not appropriate for all text editors. So, they don't show them.
If you want to write a text editor that visualize these characters, you probably need to implement this behavior programmatically. Seeing notepad++ source can be helpful If you want.
Text editor
As far as I know in order to make Visual Studio display non printable characters you need to install an extension from the marketplace at https://marketplace.visualstudio.com.
One such extension, which I have neither tried nor recomend - I just did a quick search and this is the first result - is
Invisible Character Visualizer.
Having said that, copy-pasting binaries is a risky business.
You may try Edit > Advanced > View White Space first.
Binary editor
To really see what's going on you could use the VS' binary editor: File->Open->(Open with... option)->Binary Editor -> OK
To answer your question.
It's a symbolic representation of 00H double byte.
You're copying and pasting the values. Notepad++ is showing you symbols that replace the representation of those values (because you configured it to do so in that IDE).

Other language word/string split/manipulation issue

Highlight the text searched scenario:
Ex: If I have a word RK and 'r' is searched, I have to highlight first occurance of 'r' i.e., RK. In the background it is like
< b >R< /b >K.
Similarly I have to highlight ம in மொ. Hence I am trying to find the position of ம in மொ and performing highlighting operation.
Here I am getting the text after manipulation as
< b >ம< /b >ொ and hence it is displayed as ம ொ
The code that I used for string manipulation and highlighting:
formattedString = string.Empty;
searchStringLength = searchString.Length;
formattedString += inputString.Substring(0, find);
formattedString += "<b>" + inputString.Substring(find, searchStringLength) + "</b>";
formattedString += inputString.Substring(find + searchStringLength);
The example is just for Tamil word, any suggestions to make it work for all other languages other than english?
I do not know Tamil. Looking at your question, the input string should be three letter string.
Probably, you are setting your find variable something like
find = inputString.IndexOf("ம");?
somewhere in your code.
The Tamil word மொ is not being counted as three letter word. Visual Studio is handling it as single letter while மொ.Length returns 2. ToCharArray() also returns array of two characters. That is why, IndexOf is always returning 0.
Your comment on question:
since ம + ொ = மொ, the ம find was returning true always. Now after this solution, ம find will return false and hence I don't have to highlight. Only if மொ is entered to find, it matches exactly and I can highlight.
I do not think problem is in SubString. The IndexOf needs to be handled tactically.

How to get index of any charcter in unicode string

I having a string variable which basically holds value of corresponding English word in the form of Chinese.
String temp = "'%1'不能输入步骤'%2'";
But when i want to know wether the string having %1 in it or not by using IndexOf function
if(temp.IndexOf("%1") != -1)
{
}
I am not getting true even if it contain %1.
So is there any issue due to Chinese charters or any thing else.
Pls suggest me how i can get the index of any charter in above case.
That is because %1 is not equal to %1 What you want to do in this case as workaround is select the symbols out of string you have like
var s = "'%1'不能输入步骤'%2'";
var firstFragment = s.Substring(1, 2); // this should select you %1
and then do
if(temp.IndexOf(first) != -1){
}
Comments gave the answer. Use the same percent character, so instead of:
"%1"
use:
"%1"
Or, if you find that problematic (your source code is in a "poor" code page, or you fear the code is hard to read when it contains full-width characters that resemble ASCII characters), use:
"\uFF051"
or even:
"\uFF05" + "1"
(concatenation will be done by the C# compiler, no extra concatting done at run-time).
Another approach might be Unicode normalization:
temp = temp.Normalize(NormalizationForm.FormKC);
which seems to project the "exotic" percent char into the usual ASCII percent char, although I am not sure if that behavior is guaranteed, but see the Decomposition field on Unicode Character 'FULLWIDTH PERCENT SIGN' (U+FF05).

Find and replace ASCII character with a new line

I am trying to find every occurrence of an ASCII character in a string and replace it with a new line. Here is what I have so far:
public string parseText(string inTxt)
{
//String builder based on the string passed into the method
StringBuilder n = new StringBuilder(inTxt);
//Convert the ASCII character we're looking for to a string
string replaceMe = char.ConvertFromUtf32(187);
//Replace all occurences of string with a new line
n.Replace(replaceMe, Environment.NewLine);
//Convert our StringBuilder to a string and output it
return n.ToString();
}
This does not add in a new line and the string all remains on one line. I’m not sure what the problem is here. I have tried this as well, but same result:
n.Replace(replaceMe, "\n");
Any suggestions?
char.ConvertFromUtf32, whilst correct, is not the simplest way to read a character based on its ASCII numeric value. (ConvertFromUtf32 is mainly intended for Unicode code points that lie outside the BMP, which result in surrogate pairs. This is not something you'd encounter in English or most modern languages.) Rather, you should just cast it using (char).
char c = (char)187;
string replaceMe = c.ToString();
You may, of course, define a string with the required character as a literal in your code: "»".
Your Replace would then be simplified to:
n.Replace("»", "\n");
Finally, on a technical level, ASCII only covers characters whose value lies in the 0–127 range. Character 187 is not ASCII; however, it corresponds to » in ISO 8859-1, Windows-1252, and Unicode, which collectively are by far the most popular encodings in use today.
Edit: I just tested your original code, and found that it actually worked. Are you sure the result remains on one line? It might be an issue with the way the debugger renders strings in single-line view:
Note that the \r\n sequences actually do represent newlines, despite being displayed as literals. You can check this from the multi-line display (by clicking on the magnifying glass):
StringBuilder.Replace returns a new StringBuilder with the changes made. Strange, I know, but this should work:
StringBuilder replaced = n.Replace(replaceMe, Environment.NewLine);
return replaced.ToString();

Categories

Resources