I'm trying to understand what is the best encode from C# that fulfill a requirement on a new SMS Provider.
The text I want to send is:
Bäste Björn
The encoded text that the provider say it needs is:
B%E4ste+Bj%F6rn
so ä is %E4 and ö is %F6
From this answer, I got that, for such conversion I need to use HttpUtility.HtmlAttributeEncode as the normal HttpUtility.UrlEncode will output:
B%c3%a4ste+Bj%c3%b6rn
and that outputs weird chars on the mobile phone :/
as several chars are not converted, I tried this:
private string specialEncoding(string text)
{
StringBuilder r = new StringBuilder();
foreach (char c in text.ToCharArray())
{
string e = System.Web.HttpUtility.UrlEncode(c.ToString());
if (e.StartsWith("%") && e.ToLower() != "%0a") // %0a == Linefeed
{
string attr = System.Web.HttpUtility.HtmlAttributeEncode(c.ToString());
r.Append(attr);
}
else
{
r.Append(e);
}
}
return r.ToString();
}
verbose so I could breakpoint and test each char, and found out that:
System.Web.HttpUtility.HtmlAttributeEncode("ä") is actually equal to ä... so there is no %E4 as output...
What am I missing? and is there a simply way to do the encoding without manipulating them char by char and have the required output?
that the provider say it needs
Ask the provider in which age they are living. According to Wikipedia: Percent-encoding:
The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.
Granted, this RFC talks about "new URI schemes", which HTTP obviously is not, but adhering to this standard prevents headaches like this. See also What is the proper way to URL encode Unicode characters?.
They seem to want you to encode characters according to the Windows-1250 Code Page (or comparable, like ISO-8859-1 or -2, check alternatives here) instead, as using that code page E4 (132) maps to ä and F6 (148) maps to ö. As #Simon points out in his comment, you should ask the provider which code page exactly they want you to use.
Assuming Windows-1250, you can implement it like this, according to URL encode ASCII/UTF16 characters:
var windows1250 = Encoding.GetEncoding(1250);
var percentEncoded = HttpUtility.UrlEncode("Bäste Björn", windows1250);
The value of percentEncoded is:
B%e4ste+Bj%f6rn
If they insist on using uppercase, see .net UrlEncode - lowercase problem.
Related
I can't seem to find any sort of posts or videos online about this topic, so I'm starting to wonder if it's just not possible. Everything about "emojis" in Unity is just a simple implementation of a spritesheet and then manually indexing them with like <sprite=0>. I'm trying to pull tweets from Twitter and then display their text with emojis, so clearly this isn't feasible to do with the 1500+ emojis that unicode supports.
I believe I've correctly created a TMP font asset using the default Windows emoji font, Segoe UI Emoji, and it looks like using some unicode hex ranges I found on an online unicode database, I was able to detect 1505 emojis in the font.
I then set the emoji font as a fall-back font in the Project Settings:
But upon running the game, I still get the same error that The character with Unicode value \uD83D was not found in the [SEGOEUI SDF] font asset or any potential fallbacks. It was replaced by Unicode character \u25A1 in text object
In the console an output of the tweet text looks something like this: #cat #cats #CatsOfTwitter #CatsOnTwitter #pet \nLike & share , Thanks!\uD83D\uDE4F\uD83D\uDE4F\uD83D\uDE4F
From some looking around online and extremely basic knowledge of unicode, I theorize that the issue is that in the tweet body, the emojis are in UTF-16 surrogate pairs or whatever, where \uD83D\uDE4F is one emoji, but my emoji font is in UTF-32, so it's looking for u+0001f64f. So would I need to find a way to get it to read the full surrogate pair and then convert to UTF-32 to get the correct emoji to render?
Any help would be greatly appreciated, I've tried asking around the Unity Discord server, but nobody else knows how to solve this issue either.
Intro
TMPro is natively able to do this, but only with UTF-32 formatted unicode. For example, \U0001F600 is '😀︎'. Your emojis are formatted in what I believe is UTF-8 (correct me if i'm wrong), being \u1F600, which is still '😀︎'. The only difference between these two are the capital U and 3 zeros prepending it. This makes it very easy to convert. Typing the UTF-32 version into TMPro shows the emoji as normal. What you are looking for is converting UTF-16 surrogate pairs into UTF-32, which is included further down.
Luckily, this solution does not require any font modification, the default font is able to do this, and I didn't change any settings in the inspector.
UTF-8 Solution
This solution below is for non-surrogate pair UTF-8 code.
To convert UTF-8 to UTF-32, we just need to change the 'u' to be uppercase and add a few zeros prepending it. To do so, we can use System.RegularExpressions.Regex.Replace.
public string ToUTF32(string input)
{
string output = input;
Regex pattern = new Regex(#"\\u[a-zA-Z0-9]*");
while (output.Contains(#"\u"))
{
output = pattern.Replace(output, #"\U000" + output.Substring(output.IndexOf(#"\u", StringComparison.Ordinal) + 2, 5), 1);
}
return output;
}
input being the string that contains the emoji unicode. The function converts all of the unicode in the string, and keeps everything else as it was.
Explanation
This code is pretty long, so this is the explanation.
First, the code takes the input string, for example, blah blah \u1F600 blah \u1F603 blah, which contains 2 of the unicode emojis, and replaces the unicode with another long string of code, which is the next section.
Secondly, it takes the input and Substrings everything after "\u", 5 characters ahead. It replaces the text with "\U000" + the aforementioned string.
It repeats the above steps until all of the unicode is translated.
This outputs the correct string to do the job.
If anyone thinks the above information is incorrect, please let me know. My vocabulary on this subject is not the best, so I am willing to take corrections.
Surrogate Pairs Solution
I have tinkered for a little while and come up with the function below.
public string ToUTF32FromPair(string input)
{
var output = input;
Regex pattern = new Regex(#"\\u[a-zA-Z0-9]*\\u[a-zA-Z0-9]*");
while (output.Contains(#"\u"))
{
output = pattern.Replace(output,
m => {
var pair = m.Value;
var first = pair.Substring(0, 6);
var second = pair.Substring(6, 6);
var firstInt = Convert.ToInt32(first.Substring(2), 16);
var secondInt = Convert.ToInt32(second.Substring(2), 16);
var codePoint = (firstInt - 0xD800) * 0x400 + (secondInt - 0xDC00) + 0x10000;
return #"\U" + codePoint.ToString("X8");
},
1
);
}
return output;
}
This does basically the same thing as before except it takes in the input that has surrogate pairs in it and translates it.
This question already has answers here:
How to unescape unicode string in C#
(2 answers)
Closed 2 years ago.
The following unicode string from a text file encodes a single apostrophe using 3 bytes:
It\u00e2\u0080\u0099s working
This should decode to:
It’s working
How can I decode this string in C#?
For example, when I try the following code:
string test = #"It\u00e2\u0080\u0099s working";
string test2 = System.Text.RegularExpressions.Regex.Unescape(test);
it incorrectly decodes the first byte only:
Itâ\u0080\u0099s awesome
This is UTF8. Try UTF8 Encoding
using System.Text;
using System.Text.RegularExpressions;
string test = "It\u00e2\u0080\u0099s working";
byte[] bytes = Encoding.GetEncoding(28591)
.GetBytes(test);
var converted = Encoding.UTF8.GetString(bytes);//It’s working
try this to parse file :
private static Regex _regex = new Regex(#"\\u(?<Value>[a-zA-Z0-9]{4})", RegexOptions.Compiled);
public string decodeString(string value)
{
return _regex.Replace(
value,
m => ((char)int.Parse(m.Groups["Value"].Value, NumberStyles.HexNumber)).ToString()
);
}
That is javascript unicode encoding. Use a C# javascript deserializer to convert it.
(I don't have enough reputation to comment, so I will write here)
Where did you get those characters from in the first place?
\uXXXX is an encoding used by JavaScript and C# (didn't know about C# this until now) to encode 16 bit Unicode characters in string literals. 16 bit - 4 hex characters, so \uXXXX, each X representing one Hexadecimal digit.
Note this is used to encode string literals in source code! It is not used to encode the bytes stored in files or memory or what not. It is an older style of encoding due to modern source code editors usually support UTF-8 or UTF-16 or some other encoding to be able to store unicode characters in source code files, and then they are also able to display the unicode character symbol, and allow it being typed right at the editor. So \uXXXX typing is not needed, and going out of style.
So that is why I asked where did you get the string initially? You wrote in one comment you read it from a file? What generated the file?
If each \uXXXX is taken alone by itself as unicode characters, which is what \uXXXX means, doesn't make sense being there. 00e2 is a character a with cap on it, 0080 and 0099 are control characters, not printable.
If e28099 are taken together as three single bytes, i.e. dropping off 00 valued first bytes of each as they are in the form of \u00XX then it fits as a UTF8 character representation of a Unicode character with decimal value 2019, which is "Unicode Character 'RIGHT SINGLE QUOTATION MARK' (U+2019)"
Then that is what you are looking for, but this doesn't seem correct usage of encoding that generated that string. If you end up with those strings and have to evaluate them, then comments above by "C# Novice" is working, but it may not work in every case.
You could convert string literals that uses \uXXXX encoding in its strings using a javascript script evaluator, or CSharpScript.Run() to make a string literal with those and assign to a variable, and then look at its bytes. But I tried that later and due to those byte values/characters not making sense I don't get anything meaningful from them. I get an a with a cap, and the next two, CSharpScript refuses to decode and leaves as is. Becuase those are control characters when decoded.
Here three different ways using C# avaliable libraries doing \uXXXX decoding. The first two uses NewtonSoft.JSON package, the last uses Roslyn/CSharpScript, both avalilable from Nuget. Note none of these print single aposthrope, due to what I described above. In contrast, if I change the string to "\u3053\u3093\u306B\u3061\u306F\u4E16\u754C!", it prints on the debug output window this Japanese text: "こんにちは世界!" , which is what Google translate told me is Japanese translation of "Hello World!"
https://translate.google.com/?sl=ja&tl=en&text=%E3%81%93%E3%82%93%E3%81%AB%E3%81%A1%E3%81%AF%E4%B8%96%E7%95%8C!&op=translate
So in summary, whatever generated those scripts, doesn't seem to be doing standard things.
string test = #"It\u00e2\u0080\u0099s working";
// Using JSON deserialization, since \uXXXX is valid encoding JavaScript string literals
// Have to add starting and ending quotes to make it a script literal definition, then deserialize as string
var d = Newtonsoft.Json.JsonConvert.DeserializeObject("\"" + test + "\"", typeof(string));
Console.WriteLine(d);
System.Diagnostics.Debug.WriteLine(d);
// Another way of JavaScript deserialization. If you are using a stream like reading from file this maybe better:
TextReader reader = new StringReader("\"" + test + "\"");
Newtonsoft.Json.JsonTextReader rdr = new JsonTextReader(reader);
rdr.Read();
Console.WriteLine(rdr.Value);
System.Diagnostics.Debug.WriteLine(rdr.Value);
// lastly overkill and too heavy: Using Roslyn CSharpScript, and letting C# compiler to decode \uXXXX's in string literal:
ScriptOptions opt = ScriptOptions.Default;
//opt = opt.WithFileEncoding(Encoding.Unicode);
Task<ScriptState<string>> task = Task.Run(async () => { return CSharpScript.RunAsync<string>("string str = \"" + test + "\".ToString();", opt); }).Result;
ScriptState<string> s = task.Result;
var ddd = s.Variables[0];
Console.WriteLine(ddd.Value);
System.Diagnostics.Debug.WriteLine(ddd.Value);
This may be a really silly question but so far the interwebs has failed me, so i'm hoping you good people of SO will shed some light. Essentially I have a website on which there is membership functionality(sign up/login/forgotten password etc.) using the .net membership providers. Later down the line I am taking users registration data converting to XML then it using elsewhere in logic. Unfortunately I often get issues with the data I have in XML, more often than not its hexadecimal value 0x1C, is an invalid character. I did find a handy blog post on a resolution to this but it got me thinking, are there any standards on how data should be sanitized? What to let through registration and what not to?
Assuming that you're (manually?) de-serializing the registration input, you need to encode it as XML before further processing so that characters with special meaning in XML are escaped properly.
Note that there are only 5 of them so it's perfectly reasonable to do this with a manual replace:
< = <
> = >
& = &
" = "
' = '
You could use the build-in .NET function HttpUtility.HtmlEncode(input) to do this for you.
UPDATE:
I just realized I didn't really answer your question, you seem to be looking for a way to transform Unicode characters to ASCII-supported Html Entities.
I'm not aware of any built-in functions in .NET that do this, so I wrote a little utility method which should illustrate the concept:
public static class StringUtilities
{
public static string HtmlEncode(string input, Encoding source, Encoding destination)
{
var sourceChars = HttpUtility.HtmlEncode(input).ToArray();
var sb = new StringBuilder();
foreach (var sourceChar in sourceChars)
{
byte[] sourceBytes = source.GetBytes(new[] { sourceChar });
char destChar = destination.GetChars(sourceBytes).FirstOrDefault();
if (destChar != sourceChar)
sb.AppendFormat("&#{0};", (int)sourceChar);
else
sb.Append(sourceChar);
}
return sb.ToString();
}
}
Then, given an input string which has both reserved XML characters and Unicode characters in it, you could use it like this:
string unicode = "<tag>some proӸematic text<tag>";
string escapedASCII = StringUtilities.HtmlEncode(
unicode, Encoding.Unicode, Encoding.ASCII);
// Result: <tag>some proӸematic text<tag>
If you need to do this at several places, to clean it up a bit, you could add an extension method for your specific scenario:
public static class StringExtensions
{
public static string ToEncodedASCII(this string input, Encoding sourceEncoding)
{
return StringUtilities.HtmlEncode(input, sourceEncoding, Encoding.ASCII);
}
public static string ToEncodedASCII(this string input)
{
return StringUtilities.HtmlEncode(input, Encoding.Unicode, Encoding.ASCII);
}
}
You could then do:
string unicode = "<tag>some proӸematic text<tag>";
// Default to Unicode as input
string escapedASCII1 = unicode.ToEncodedASCII();
// Pass in a different encoding for your input
string escapedASCII2 = unicode.ToEncodedASCII(Encoding.BigEndianUnicode);
UPDATE #2
Since you also asked for advice on adhering to standards, the most I can tell you is that you need to take into consideration where the input text will actually end up.
If the input for a certain user will only ever be displayed to that user (for instance when they manage their profile / account settings in your app), and your database supports Unicode, you could just leave everything as-is.
On the other hand, if the information can be displayed to other users (for instance when users can view each others public profile information) then you need to take into consideration that not all users will be visiting your website on a device/browser that supports Unicode. In that case, UTF-8 is likely to be your best bet.
This is also why you can't really find that much useful information on it. If the world was able to agree on a standard then we would not have to deal with all these encoding variations in the first place. Think about your target group and what they need.
A useful blog post on the subject of encoding: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
I've just stumbled over another question in which someone suggested to use new ASCIIEncoding().GetBytes(someString) to convert from a string to bytes. For me it was obvious that it shouldn't work for non-ASCII characters. But as it turns out, ASCIIEncoding happily replaces invalid characters with '?'. I'm very confused about this because this kind of breaks the rule of least surprise. In Python, it would be u"some unicode string".encode("ascii") and the conversion is strict by default so that non-ASCII characters would lead to an exception in this example.
Two questions:
How can strings be strictly converted to another encoding (like ASCII or Windows-1252), so that an exception is thrown if invalid characters occur? By the way I don't want a foreach loop converting each Unicode number to a byte, and then checking the 8th bit. This is supposed to be done by a great framework like .NET (or Python ^^).
Any ideas on the rationale behind this default behavior? For me, it makes more sense to do strict conversions by default or at least define a parameter for this purpose (Python allows "replace", "ignore", "strict").
.Net offers the option of throwing an exception if the encoding conversion fails. You'll need to use the EncoderExceptionFallback class (throws a EncoderFallbackException if an input character cannot be converted to an encoded output byte sequence) to create an encoding. The following code is from the documentation for that class:
Encoding ae = Encoding.GetEncoding(
"us-ascii",
new EncoderExceptionFallback(),
new DecoderExceptionFallback());
then use that encoding to perform the conversion:
// The input string consists of the Unicode characters LEFT POINTING
// DOUBLE ANGLE QUOTATION MARK (U+00AB), 'X' (U+0058), and RIGHT POINTING
// DOUBLE ANGLE QUOTATION MARK (U+00BB).
// The encoding can only encode characters in the US-ASCII range of U+0000
// through U+007F. Consequently, the characters bracketing the 'X' character
// cause an exception.
string inputString = "\u00abX\u00bb";
byte[] encodedBytes = new byte[ae.GetMaxByteCount(inputString.Length)];
int numberOfEncodedBytes = 0;
try
{
numberOfEncodedBytes = ae.GetBytes(inputString, 0, inputString.Length,
encodedBytes, 0);
}
catch (EncoderFallbackException e)
{
Console.WriteLine("bad conversion");
}
This MSDN page, "Character Encoding in the .NET Framework" discusses, to some degree, the rationale behind the default conversion behavior. In summary, they didn't want to disturb legacy applications that depend on this behavior. They do recommend overriding the default, though.
I have a a string in c# initialised as follows:
string strVal = "£2000";
However whenever I write this string out the following is written:
£2000
It does not do this with dollars.
An example bit of code I am using to write out the value:
System.IO.File.AppendAllText(HttpContext.Current.Server.MapPath("/logging.txt"), strVal);
I'm guessing it's something to do with localization but if c# strings are just unicode surely this should just work?
CLARIFICATION: Just a bit more info, Jon Skeet's answer is correct, however I also get the issue when I URLEncode the string. Is there a way of preventing this?
So the URL encoded string looks like this:
"%c2%a32000"
%c2 = Â
%a3 = £
If I encode as ASCII the £ comes out as ?
Any more ideas?
AppendAllText is writing out the text in UTF-8.
What are you using to look at it? Chances are it's something that doesn't understand UTF-8, or doesn't try UTF-8 first. Tell your editor/viewer that it's a UTF-8 file and all should be well. Alternatively, use the overload of AppendAllText which allows you to specify the encoding and use whichever encoding is going to be most convenient for you.
EDIT: In response to your edited question, the reason it fails when you encode with ASCII is that £ is not in the ASCII character set (which is Unicode 0-127).
URL encoding is also using UTF-8, by the looks of it. Again, if you want to use a different encoding, specify it to the HttpUtility.UrlEncode overload which accepts an encoding.
The default character set of URLs when used in HTML pages and in HTTP headers is called ISO-8859-1 or ISO Latin-1.
It's not the same as UTF-8, and it's not the same as ASCII, but it does fit into one-byte-per-character. The range 0 to 127 is a lot like ASCII, and the whole range 0 to 255 is the same as the range 0000-00FF of Unicode.
So you can generate it from a C# string by casting each character to a byte, or you can use Encoding.GetEncoding("iso-8859-1") to get an object to do the conversion for you.
(In this character set, the UK pound symbol is 163.)
Background
The RFC says that unencoded text must be limited to the traditional 7-bit US ASCII range, and anything else (plus the special URL delimiter characters) must be encoded. But it leaves open the question of what character set to use for the upper half of the 8-bit range, making it dependent on the context in which the URL appears.
And that context is defined by two other standards, HTTP and HTML, which do specify the default character set, and which together create a practically irresistable force on implementers to assume that the address bar contains percent-encodings that refer to ISO-8859-1.
ISO-8859-1 is the character set of text-based content sent via HTTP except where otherwise specified. So by the time a URL string appears in the HTTP GET header, it ought to be in ISO-8859-1.
The other factor is that HTML also uses ISO-8859-1 as its default, and URLs typically originate as links in HTML pages. So when you craft a simple minimal HTML page in Notepad, the URLs you type into that file are in ISO-8859-1.
It's sometimes described as "hole" in the standards, but it's not really; it's just that HTML/HTTP fill in the blank left by the RFC for URLs.
Hence, for example, the advice on this page:
URL encoding of a character consists
of a "%" symbol, followed by the
two-digit hexadecimal representation
(case-insensitive) of the ISO-Latin
code point for the character.
(ISO-Latin is another name for IS-8859-1).
So much for the theory. Paste this into notepad, save it as an .html file, and open it in a few browsers. Click the link and Google should search for UK pound.
<HTML>
<BODY>
Test
</BODY>
</HTML>
It works in IE, Firefox, Apple Safari, Google Chrome - I don't have any others available right now.
Note that %a3 cannot be encoded in ASCII (7 bit, Basic Latin).
The Pound Sign (down the page) is part of Latin-1 encoding.
I have noticed that this is happening only when long strings are used (over 4000) chars. My solution was upon receiving the parameter in database, I simply replace the  sign with nothing.
Be careful, Â may actually be needed, and if that is the case this solution is not appropriate.