Line Of Hex Into ASCII - c#

I am trying to make a string of hex like
4100200062006C0061006E006B002000630061006E0076006100730020007200650063006F006D006D0065006E00640065006400200066006F007200200046006F007200670065002000650064006900740069006E00670020006F006E006C0079002E
Turn into ASCII and look like:
A blank canvas recommended for Forge editing only.
The variable for the hex is collected from a file that I opened into the program, reading a specific address like so:
BinaryReader br = new BinaryReader(File.OpenRead(ofd.FileName));
string mapdesc = null;
for (int i = 0x1C1; i <= 0x2EF; i++)
{
br.BaseStream.Position = i;
mapdesc += br.ReadByte().ToString("X2");
}
richTextBox1.Text = ("" + mapdesc);
Now that I have the mapdesc, I made it print into the richtextbox, and it just looked like a line of hex. I wanted it too look like readable ASCII.
In Hex Editor, the other side reading in ANSI looks like
A. .b.l.a.n.k. .c.a.n.v.a.s. .r.e.c.o.m.m.e.n.d.e.d. .f.o.r. .F.o.r.g.e. .e.d.i.t.i.n.g. .o.n.l.y
The dots are 00s in the hex view, so I believe with the ASCII format, they should be nothing so that I get the readable sentence which is how the game reads it. What would I have to do to convert mapdesc into ASCII?

To be fair, the output matches the decoded output exactly, the issue is actually with the input data.
If you look closely, you will notice that ever other pair of characters is 00, using some simple heuristics, we can determine that we have 16 bit words here, 4 hex chars.
The problem that you are facing, and the reason for the . characters, is that while decoding this as UTF-8, every other character will be null.
You have two solutions to solve this:
To continue decoding in UTF-8, remove every other null character from the string, all the _00_s.
Or
Decode at UTF-16
If you choose this option, you still have an issue with your data - the very first word is only 8 bits, which would cause a shift among ever other byte; to decode in UTF-16, prepend an additional 00 at the beginning of the data blob ( or start your loop one position sooner )

Related

How do you get emojis to display in a Unity TextMeshPro element?

I can't seem to find any sort of posts or videos online about this topic, so I'm starting to wonder if it's just not possible. Everything about "emojis" in Unity is just a simple implementation of a spritesheet and then manually indexing them with like <sprite=0>. I'm trying to pull tweets from Twitter and then display their text with emojis, so clearly this isn't feasible to do with the 1500+ emojis that unicode supports.
I believe I've correctly created a TMP font asset using the default Windows emoji font, Segoe UI Emoji, and it looks like using some unicode hex ranges I found on an online unicode database, I was able to detect 1505 emojis in the font.
I then set the emoji font as a fall-back font in the Project Settings:
But upon running the game, I still get the same error that The character with Unicode value \uD83D was not found in the [SEGOEUI SDF] font asset or any potential fallbacks. It was replaced by Unicode character \u25A1 in text object
In the console an output of the tweet text looks something like this: #cat #cats #CatsOfTwitter #CatsOnTwitter #pet \nLike & share , Thanks!\uD83D\uDE4F\uD83D\uDE4F\uD83D\uDE4F
From some looking around online and extremely basic knowledge of unicode, I theorize that the issue is that in the tweet body, the emojis are in UTF-16 surrogate pairs or whatever, where \uD83D\uDE4F is one emoji, but my emoji font is in UTF-32, so it's looking for u+0001f64f. So would I need to find a way to get it to read the full surrogate pair and then convert to UTF-32 to get the correct emoji to render?
Any help would be greatly appreciated, I've tried asking around the Unity Discord server, but nobody else knows how to solve this issue either.
Intro
TMPro is natively able to do this, but only with UTF-32 formatted unicode. For example, \U0001F600 is '😀︎'. Your emojis are formatted in what I believe is UTF-8 (correct me if i'm wrong), being \u1F600, which is still '😀︎'. The only difference between these two are the capital U and 3 zeros prepending it. This makes it very easy to convert. Typing the UTF-32 version into TMPro shows the emoji as normal. What you are looking for is converting UTF-16 surrogate pairs into UTF-32, which is included further down.
Luckily, this solution does not require any font modification, the default font is able to do this, and I didn't change any settings in the inspector.
UTF-8 Solution
This solution below is for non-surrogate pair UTF-8 code.
To convert UTF-8 to UTF-32, we just need to change the 'u' to be uppercase and add a few zeros prepending it. To do so, we can use System.RegularExpressions.Regex.Replace.
public string ToUTF32(string input)
{
string output = input;
Regex pattern = new Regex(#"\\u[a-zA-Z0-9]*");
while (output.Contains(#"\u"))
{
output = pattern.Replace(output, #"\U000" + output.Substring(output.IndexOf(#"\u", StringComparison.Ordinal) + 2, 5), 1);
}
return output;
}
input being the string that contains the emoji unicode. The function converts all of the unicode in the string, and keeps everything else as it was.
Explanation
This code is pretty long, so this is the explanation.
First, the code takes the input string, for example, blah blah \u1F600 blah \u1F603 blah, which contains 2 of the unicode emojis, and replaces the unicode with another long string of code, which is the next section.
Secondly, it takes the input and Substrings everything after "\u", 5 characters ahead. It replaces the text with "\U000" + the aforementioned string.
It repeats the above steps until all of the unicode is translated.
This outputs the correct string to do the job.
If anyone thinks the above information is incorrect, please let me know. My vocabulary on this subject is not the best, so I am willing to take corrections.
Surrogate Pairs Solution
I have tinkered for a little while and come up with the function below.
public string ToUTF32FromPair(string input)
{
var output = input;
Regex pattern = new Regex(#"\\u[a-zA-Z0-9]*\\u[a-zA-Z0-9]*");
while (output.Contains(#"\u"))
{
output = pattern.Replace(output,
m => {
var pair = m.Value;
var first = pair.Substring(0, 6);
var second = pair.Substring(6, 6);
var firstInt = Convert.ToInt32(first.Substring(2), 16);
var secondInt = Convert.ToInt32(second.Substring(2), 16);
var codePoint = (firstInt - 0xD800) * 0x400 + (secondInt - 0xDC00) + 0x10000;
return #"\U" + codePoint.ToString("X8");
},
1
);
}
return output;
}
This does basically the same thing as before except it takes in the input that has surrogate pairs in it and translates it.

Decoding UTF16LE with Encoding.Unicode garbles text

I have the following code:
var reader = new StreamReader(inputSubtitle, encoding);
string str;
var list = new List<String>();
try
{
str = reader.ReadLine();
while (str != null)
{
list.Add(str);
str = reader.ReadLine();
}
return list;
}
The encoding is based on the Byte Order Mark. The charset detector code (can provide it if necessary) simply looks at the hex value of the first couple of bytes in the file. The files are usually UTF-8, Windows ANSI (Codepage-1252) or UTF-16LE. The last one currently fails, and I have no clue why.
Previewing the text in Notepad says it's encoded as Unicode (with which it means UTF-16LE, afaik), opening in Firefox says it's UTF-16LE and the BOM starts with bytes FF FE.
Take this example text:
1
00:04:05,253 --> 00:04:11,886
<i>This is the first line</i>
- This is the second line.
I send this file as a filestream to the charset detector (I use filestream as an input in the backend), where I added the following debug line:
byte[] dataFromFileStream = new byte[(input.Length)];
input.Read(dataFromFileStream, 0, (int)input.Length);
Console.WriteLine(BitConverter.ToString(dataFromFileStream));
This produces the following hexcode:
"FF-FE-31-00-(...)"
FF-FE is the Byte Order Mark of UTF16-LE.
Opening this hexcode with the StreamReader and encoding set to Encoding.Unicode turns the data into a single string:
"\u0d00\u0a00  㨀 㐀㨀 㔀Ⰰ㈀㔀㌀ ⴀⴀ㸀   㨀 㐀㨀\u3100\u3100Ⰰ㠀㠀㘀\u0d00\u0a00㰀椀㸀吀栀椀猀 椀猀 琀栀攀 昀椀爀猀琀 氀椀渀攀㰀⼀椀㸀\u0d00\u0a00ⴀ 吀栀椀猀 椀猀 琀栀攀 猀攀挀漀渀搀 氀椀渀攀⸀"
Setting the encoder to Encoding.GetEncoding(1201), e.g. as a UTF-16BE, opens the file properly and decodes it to 4 lines in the list, as expected.
I noticed this bug since a couple of weeks, before then the code worked properly. Is it something that happened in an update? Encoding.Unicode is described as UTF-16LE in the documentation.
I changed my code to use UTF-16BE as decoder for the moment to make it work again, but that just feels wrong.
Turns out I made a stupid mistake; the Charset detector reads a couple of bytes to determine the encoding of the file. To make sure I have a UTF-16LE file (which starts with FF-FE) and not a UTF-32LE (which starts with FF-FE-00-00) I read the third byte as well. After reading those bytes, however, I did not reset the position of the FileStream back to 0. An earlier version of my code, with a different constructor, did reset the starting position. Adding the code for resetting the position fixed it.
Explanation:
The StreamReader class does not necessarily need a BOM in a UTF file, so it starts reading from the position left after the charset detector did its thing. Detecting UTF-8 or ANSI made no issues, since those are encoded in single bytes. UTF-16 uses two bytes for each character, so starting at an odd numbered byte caused everything to look like reversed byte order for the reader. Therefore decoding like UTF-16BE 'fixed' the issue.

Is it possible to display (convert?) the unicode hex \u0092 to an unicode html entity in .NET?

I have some string that contains the following code/value:
"You won\u0092t find a ...."
It looks like that string contains the Right Apostrophe special character.
ref1: Unicode control 0092
ref2: ASCII chart (both 127 + extra extended ascii)
I'm not sure how to display this to the webbrowser. It keeps displaying the TOFU square-box character instead. I'm under the impression that the unicode (hex) value 00092 can be converted to unicode (html) ’
Is my understanding correct?
Update 1:
It was suggested by #sam-axe that I HtmlEncode the unicode. That didn't work. Here it is...
Note the ampersand got correctly encoded....
It looks like there's an encoding mix-up. In .NET, strings are normally encoded as UTF-16, and a right apostrophe should be represented as \u2019. But in your example, the right apostrophe is represented as \x92, which suggests the original encoding was Windows code page 1252. If you include your string in a Unicode document, the character \x92 won't be interpreted properly.
You can fix the problem by re-encoding your string as UTF-16. To do so, treat the string as an array of bytes, and then convert the bytes back to Unicode using the 1252 code page:
string title = "You won\u0092t find a cheaper apartment * Sauna & Spa";
byte[] bytes = title.Select(c => (byte)c).ToArray();
title = Encoding.GetEncoding(1252).GetString(bytes);
// Result: "You won’t find a cheaper apartment * Sauna & Spa"
Note: much of my answer is based on guessing and looking at the decompiled code of System.Web 4.0. The reference source looks very similar (identical?).
You're correct that "’" (6 characters) can be displayed in the browser. Your output string, however, contains "\u0092" (1 character). This is a control character, not an HTML entity.
According to the reference code, WebUtility.HtmlEncode() doesn't transform characters between 128 and 160 - all characters in this range are control characters (ampersand is special-cased in the code as are a few other special HTML symbols).
My guess is that because these are control characters, they're output without transformation because transforming it would change the meaning of the string. (I tried running some examples using LinqPad, this character was not rendered.)
If you really want to transform these characters (or remove them), you'll probably have to write your own function before/after calling HtmlEncode() - there may be something that does this already but I don't know of any.
Hope this helps.
Edit: Michael Liu's answer seems correct. I'm leaving my answer here because it may be useful in cases when the input encoding of a string is not known.

How to retrieve the unicode decimal representation of the chars in a string containing hindi text?

I am using visual studio 2010 in c# for converting text into unicodes. Like i have a string abc= "मेरा" .
there are 4 characters in this string. i need all the four unicode characters.
Please help me.
When you write a code like string abc= "मेरा";, you already have it as Unicode (specifically, UTF-16), so you don't have to convert anything. If you want to access the singular characters, you can do that using normal index: e.g. abc[1] is े (DEVANAGARI VOWEL SIGN E).
If you want to see the numeric representations of those characters, just cast them to integers. For example
abc.Select(c => (int)c)
gives the sequence of numbers 2350, 2375, 2352, 2366. If you want to see the hexadecimal representation of those numbers, use ToString():
abc.Select(c => ((int)c).ToString("x4"))
returns the sequence of strings "092e", "0947", "0930", "093e".
Note that when I said numeric representations, I actually meant their encoding using UTF-16. For characters in the Basic Multilingual Plane, this is the same as their Unicode code point. The vast majority of used characters lie in BMP, including those 4 Hindi characters presented here.
If you wanted to handle characters in other planes too, you could use code like the following.
byte[] bytes = Encoding.UTF32.GetBytes(abc);
int codePointCount = bytes.Length / 4;
int[] codePoints = new int[codePointCount];
for (int i = 0; i < codePointCount; i++)
codePoints[i] = BitConverter.ToInt32(bytes, i * 4);
Since UTF-32 encodes all (21-bit) code points directly, this will give you them. (Maybe there is a more straightforward solution, but I haven't found one.)
Since a .Net char is a Unicode character (at least, for the BMP code point), you can simply enumerate all characters in a string:
var abc = "मेरा";
foreach (var c in abc)
{
Console.WriteLine((int)c);
}
resulting in
2350
2375
2352
2366
use
System.Text.Encoding.UTF8.GetBytes(abc)
that will return your unicode values.
If you are trying to convert files from a legacy encoding into Unicode:
Read the file, supplying the correct encoding of the source files, then write the file using the desired Unicode encoding scheme.
using (StreamReader reader = new StreamReader(#"C:\MyFile.txt", Encoding.GetEncoding("ISCII")))
using (StreamWriter writer = new StreamWriter(#"C:\MyConvertedFile.txt", false, Encoding.UTF8))
{
writer.Write(reader.ReadToEnd());
}
If you are looking for a mapping of Devanagari characters to the Unicode code points:
You can find the chart at the Unicode Consortium website here.
Note that Unicode code points are traditionally written in hexidecimal. So rather than the decimal number 2350, the code point would be written as U+092E, and it appears as 092E on the code chart.
If you have the string s = मेरा then you already have the answer.
This string contains four code points in the BMP which in UTF-16 are represented by 8 bytes. You can access them by index with s[i], with a foreach loop etc.
If you want the underlying 8 bytes you can access them as so:
string str = #"मेरा";
byte[] arr = System.Text.UnicodeEncoding.GetBytes(str);

How to send ctrl+z

How do I convert ctrl+z to a string?
I am sending this as an AT COMMAND to an attached device to this computer.
Basically, I just to put some chars in a string and ctrl+z in that string as well.
You can embed any Unicode character with the \u escape:
"this ends with ctrl-z \u001A"
Try following will work for you
serialPort1.Write("Test message from coded program" + (char)26);
also try may work for you
serialPort1.Write("Test message from coded program");
SendKeys.Send("^(z)");
also check : http://www.dreamincode.net/forums/topic/48708-sending-ctrl-z-through-serial/
byte[] buffer = new byte[1];
buffer[0] = 26; // ^Z
modemPort.Write(buffer, offset:0, count:1);
It's clear from other responses that Ctrl+Z has ASCII code 26; in general Ctrl+[letter] combinations have ASCII code equal to 1+[letter]-'A' i.e. Ctrl+A has ASCII code 1 (\x01 or \u0001), Ctrl+B has ASCII code 2, etc.
When sending characters to a device, translation from the internal string representation is needed. This is known as Encoding - an encoder translates the string into a byte array.
Consulting the Unicode Character Name Index, we find the SUBSTITUTE - 0x001A character in the C0 Controls and Basic Latin (ASCII Punctuation) section.
To add a CTRL-Z to an internal C# string,
add a unicode character escape sequence (\u001a) code.
String ctrlz = "\u001a";
String atcmd = "AT C5\u001a";
Any encoding used for translation before output to the device
(for example output using StringWriter), will translate this to ASCII Ctrl-Z.

Categories

Resources