Strict string to byte encoding in C# - c#

I've just stumbled over another question in which someone suggested to use new ASCIIEncoding().GetBytes(someString) to convert from a string to bytes. For me it was obvious that it shouldn't work for non-ASCII characters. But as it turns out, ASCIIEncoding happily replaces invalid characters with '?'. I'm very confused about this because this kind of breaks the rule of least surprise. In Python, it would be u"some unicode string".encode("ascii") and the conversion is strict by default so that non-ASCII characters would lead to an exception in this example.
Two questions:
How can strings be strictly converted to another encoding (like ASCII or Windows-1252), so that an exception is thrown if invalid characters occur? By the way I don't want a foreach loop converting each Unicode number to a byte, and then checking the 8th bit. This is supposed to be done by a great framework like .NET (or Python ^^).
Any ideas on the rationale behind this default behavior? For me, it makes more sense to do strict conversions by default or at least define a parameter for this purpose (Python allows "replace", "ignore", "strict").

.Net offers the option of throwing an exception if the encoding conversion fails. You'll need to use the EncoderExceptionFallback class (throws a EncoderFallbackException if an input character cannot be converted to an encoded output byte sequence) to create an encoding. The following code is from the documentation for that class:
Encoding ae = Encoding.GetEncoding(
"us-ascii",
new EncoderExceptionFallback(),
new DecoderExceptionFallback());
then use that encoding to perform the conversion:
// The input string consists of the Unicode characters LEFT POINTING
// DOUBLE ANGLE QUOTATION MARK (U+00AB), 'X' (U+0058), and RIGHT POINTING
// DOUBLE ANGLE QUOTATION MARK (U+00BB).
// The encoding can only encode characters in the US-ASCII range of U+0000
// through U+007F. Consequently, the characters bracketing the 'X' character
// cause an exception.
string inputString = "\u00abX\u00bb";
byte[] encodedBytes = new byte[ae.GetMaxByteCount(inputString.Length)];
int numberOfEncodedBytes = 0;
try
{
numberOfEncodedBytes = ae.GetBytes(inputString, 0, inputString.Length,
encodedBytes, 0);
}
catch (EncoderFallbackException e)
{
Console.WriteLine("bad conversion");
}
This MSDN page, "Character Encoding in the .NET Framework" discusses, to some degree, the rationale behind the default conversion behavior. In summary, they didn't want to disturb legacy applications that depend on this behavior. They do recommend overriding the default, though.

Related

How to unescape multibyte unicode in c# [duplicate]

This question already has answers here:
How to unescape unicode string in C#
(2 answers)
Closed 2 years ago.
The following unicode string from a text file encodes a single apostrophe using 3 bytes:
It\u00e2\u0080\u0099s working
This should decode to:
It’s working
How can I decode this string in C#?
For example, when I try the following code:
string test = #"It\u00e2\u0080\u0099s working";
string test2 = System.Text.RegularExpressions.Regex.Unescape(test);
it incorrectly decodes the first byte only:
Itâ\u0080\u0099s awesome
This is UTF8. Try UTF8 Encoding
using System.Text;
using System.Text.RegularExpressions;
string test = "It\u00e2\u0080\u0099s working";
byte[] bytes = Encoding.GetEncoding(28591)
.GetBytes(test);
var converted = Encoding.UTF8.GetString(bytes);//It’s working
try this to parse file :
private static Regex _regex = new Regex(#"\\u(?<Value>[a-zA-Z0-9]{4})", RegexOptions.Compiled);
public string decodeString(string value)
{
return _regex.Replace(
value,
m => ((char)int.Parse(m.Groups["Value"].Value, NumberStyles.HexNumber)).ToString()
);
}
That is javascript unicode encoding. Use a C# javascript deserializer to convert it.
(I don't have enough reputation to comment, so I will write here)
Where did you get those characters from in the first place?
\uXXXX is an encoding used by JavaScript and C# (didn't know about C# this until now) to encode 16 bit Unicode characters in string literals. 16 bit - 4 hex characters, so \uXXXX, each X representing one Hexadecimal digit.
Note this is used to encode string literals in source code! It is not used to encode the bytes stored in files or memory or what not. It is an older style of encoding due to modern source code editors usually support UTF-8 or UTF-16 or some other encoding to be able to store unicode characters in source code files, and then they are also able to display the unicode character symbol, and allow it being typed right at the editor. So \uXXXX typing is not needed, and going out of style.
So that is why I asked where did you get the string initially? You wrote in one comment you read it from a file? What generated the file?
If each \uXXXX is taken alone by itself as unicode characters, which is what \uXXXX means, doesn't make sense being there. 00e2 is a character a with cap on it, 0080 and 0099 are control characters, not printable.
If e28099 are taken together as three single bytes, i.e. dropping off 00 valued first bytes of each as they are in the form of \u00XX then it fits as a UTF8 character representation of a Unicode character with decimal value 2019, which is "Unicode Character 'RIGHT SINGLE QUOTATION MARK' (U+2019)"
Then that is what you are looking for, but this doesn't seem correct usage of encoding that generated that string. If you end up with those strings and have to evaluate them, then comments above by "C# Novice" is working, but it may not work in every case.
You could convert string literals that uses \uXXXX encoding in its strings using a javascript script evaluator, or CSharpScript.Run() to make a string literal with those and assign to a variable, and then look at its bytes. But I tried that later and due to those byte values/characters not making sense I don't get anything meaningful from them. I get an a with a cap, and the next two, CSharpScript refuses to decode and leaves as is. Becuase those are control characters when decoded.
Here three different ways using C# avaliable libraries doing \uXXXX decoding. The first two uses NewtonSoft.JSON package, the last uses Roslyn/CSharpScript, both avalilable from Nuget. Note none of these print single aposthrope, due to what I described above. In contrast, if I change the string to "\u3053\u3093\u306B\u3061\u306F\u4E16\u754C!", it prints on the debug output window this Japanese text: "こんにちは世界!" , which is what Google translate told me is Japanese translation of "Hello World!"
https://translate.google.com/?sl=ja&tl=en&text=%E3%81%93%E3%82%93%E3%81%AB%E3%81%A1%E3%81%AF%E4%B8%96%E7%95%8C!&op=translate
So in summary, whatever generated those scripts, doesn't seem to be doing standard things.
string test = #"It\u00e2\u0080\u0099s working";
// Using JSON deserialization, since \uXXXX is valid encoding JavaScript string literals
// Have to add starting and ending quotes to make it a script literal definition, then deserialize as string
var d = Newtonsoft.Json.JsonConvert.DeserializeObject("\"" + test + "\"", typeof(string));
Console.WriteLine(d);
System.Diagnostics.Debug.WriteLine(d);
// Another way of JavaScript deserialization. If you are using a stream like reading from file this maybe better:
TextReader reader = new StringReader("\"" + test + "\"");
Newtonsoft.Json.JsonTextReader rdr = new JsonTextReader(reader);
rdr.Read();
Console.WriteLine(rdr.Value);
System.Diagnostics.Debug.WriteLine(rdr.Value);
// lastly overkill and too heavy: Using Roslyn CSharpScript, and letting C# compiler to decode \uXXXX's in string literal:
ScriptOptions opt = ScriptOptions.Default;
//opt = opt.WithFileEncoding(Encoding.Unicode);
Task<ScriptState<string>> task = Task.Run(async () => { return CSharpScript.RunAsync<string>("string str = \"" + test + "\".ToString();", opt); }).Result;
ScriptState<string> s = task.Result;
var ddd = s.Variables[0];
Console.WriteLine(ddd.Value);
System.Diagnostics.Debug.WriteLine(ddd.Value);

Encoding "ä" into "%E4"

I'm trying to understand what is the best encode from C# that fulfill a requirement on a new SMS Provider.
The text I want to send is:
Bäste Björn
The encoded text that the provider say it needs is:
B%E4ste+Bj%F6rn
so ä is %E4 and ö is %F6
From this answer, I got that, for such conversion I need to use HttpUtility.HtmlAttributeEncode as the normal HttpUtility.UrlEncode will output:
B%c3%a4ste+Bj%c3%b6rn
and that outputs weird chars on the mobile phone :/
as several chars are not converted, I tried this:
private string specialEncoding(string text)
{
StringBuilder r = new StringBuilder();
foreach (char c in text.ToCharArray())
{
string e = System.Web.HttpUtility.UrlEncode(c.ToString());
if (e.StartsWith("%") && e.ToLower() != "%0a") // %0a == Linefeed
{
string attr = System.Web.HttpUtility.HtmlAttributeEncode(c.ToString());
r.Append(attr);
}
else
{
r.Append(e);
}
}
return r.ToString();
}
verbose so I could breakpoint and test each char, and found out that:
System.Web.HttpUtility.HtmlAttributeEncode("ä") is actually equal to ä... so there is no %E4 as output...
What am I missing? and is there a simply way to do the encoding without manipulating them char by char and have the required output?
that the provider say it needs
Ask the provider in which age they are living. According to Wikipedia: Percent-encoding:
The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.
Granted, this RFC talks about "new URI schemes", which HTTP obviously is not, but adhering to this standard prevents headaches like this. See also What is the proper way to URL encode Unicode characters?.
They seem to want you to encode characters according to the Windows-1250 Code Page (or comparable, like ISO-8859-1 or -2, check alternatives here) instead, as using that code page E4 (132) maps to ä and F6 (148) maps to ö. As #Simon points out in his comment, you should ask the provider which code page exactly they want you to use.
Assuming Windows-1250, you can implement it like this, according to URL encode ASCII/UTF16 characters:
var windows1250 = Encoding.GetEncoding(1250);
var percentEncoded = HttpUtility.UrlEncode("Bäste Björn", windows1250);
The value of percentEncoded is:
B%e4ste+Bj%f6rn
If they insist on using uppercase, see .net UrlEncode - lowercase problem.

Unexpected conversion of Unicode overline (U+203E) to Shift-JIS

For a customer project, a query is made against a DB and the results are written to a file. The file is required to be in Shift JIS as it is later used as input for another legacy system. The Wikipedia article indicates that:
The single-byte characters 0x00 to 0x7F match the ASCII encoding, except for a yen sign (U+00A5) at 0x5C and an overline (U+203E) at 0x7E in place of the ASCII character set's backslash and tilde respectively.
During some testing, I have verified that while the yen sign (U+00A5) properly becomes 0x5C, the overline (U+203E) becomes 0x3F (question mark) rather than the expected 0x7E.
While I am doing normal output with a StreamWriter to a file, below is minimal code to reproduce:
static void Test()
{
// Get Shift-JIS encoder.
var encoding = Encoding.GetEncoding("shift_jis");
// Declare overline (U+203E).
char c = (char) 0x203E;
// Get bytes when encoded as Shift-JIS.
var bytes = encoding.GetBytes(c.ToString());
// Expected 0x7E, but the value returned is 0x3F.
}
Is this behavior correct?
I suppose I could subclass EncoderFallback, but this seems like far more work for something that I would have expected to work from the start.
Upon further investigation, I must conclude that Shift JIS is a misnomer. Rather, this is codepage 932. Unicode and Microsoft provide a mapping table between this and Unicode. This is apparently what is being used to map the characters. Notice that it does not contain a mapping between (0x5C, U+00A5) and (0x7E, U+203E).
Note though that I wrote in the original question that "I have verified that while the yen sign (U+00A5) properly becomes 0x5C". Apparently, the Encoding.GetEncoding(String) method returns an encoding which has a DecoderFallback defined as System.Text.InternalDecoderBestFitFallback, which I assume is providing additional mappings for some characters which would normally fail. It must contain an additional mapping for yen (U+00A5), but unfortunately nothing for overline (U+203E). When I replace this with EncoderExceptionFallback if fails for bother characters.
Hence, I conclude that for Shift JIS, this is an error. But for codepage 932, it is the expected result.

Which encoding does Alt+Numpad keys generate?

In short:
For this code:
Encoding.ASCII.GetBytes("‚")
I want the output to be 130, but this gives me 63.
I am typing the string using Alt+0130.
On my setup:
Encoding.ASCII.GetBytes("‚"); // 63
Encoding.Default.GetBytes("‚"); // 130
Of course 'default' could very well be environment-dependent...
When you try to encode the string using the ASCII encoding, it will be converted to a question mark as there is no such character in the ASCII character set. The character code for the question mark is 63.
You need to use an encoding that supports the character, to get it's actual character code.
One option is to use the Encoding.Default property to get the encoding for the system codepage, as David suggested. However as the system codepage can differ, it's not guaranteed to give the same result on all computers.
The unicode character code is 8218, which you can get by simply converting the character to an int:
int characterCode = (int)'‚';
As this is not depending on any system settings, you should consider if you can use that instead of the encoded byte value.

Base64 String throwing invalid character error

I keep getting a Base64 invalid character error even though I shouldn't.
The program takes an XML file and exports it to a document. If the user wants, it will compress the file as well. The compression works fine and returns a Base64 String which is encoded into UTF-8 and written to a file.
When its time to reload the document into the program I have to check whether its compressed or not, the code is simply:
byte[] gzBuffer = System.Convert.FromBase64String(text);
return "1F-8B-08" == BitConverter.ToString(new List<Byte>(gzBuffer).GetRange(4, 3).ToArray());
It checks the beginning of the string to see if it has GZips code in it.
Now the thing is, all my tests work. I take a string, compress it, decompress it, and compare it to the original. The problem is when I get the string returned from an ADO Recordset. The string is exactly what was written to the file (with the addition of a "\0" at the end, but I don't think that even does anything, even trimmed off it still throws). I even copy and pasted the entire string into a test method and compress/decompress that. Works fine.
The tests will pass but the code will fail using the exact same string? The only difference is instead of just declaring a regular string and passing it in I'm getting one returned from a recordset.
Any ideas on what am I doing wrong?
You say
The string is exactly what was written
to the file (with the addition of a
"\0" at the end, but I don't think
that even does anything).
In fact, it does do something (it causes your code to throw a FormatException:"Invalid character in a Base-64 string") because the Convert.FromBase64String does not consider "\0" to be a valid Base64 character.
byte[] data1 = Convert.FromBase64String("AAAA\0"); // Throws exception
byte[] data2 = Convert.FromBase64String("AAAA"); // Works
Solution: Get rid of the zero termination. (Maybe call .Trim("\0"))
Notes:
The MSDN docs for Convert.FromBase64String say it will throw a FormatException when
The length of s, ignoring white space
characters, is not zero or a multiple
of 4.
-or-
The format of s is invalid. s contains a non-base 64 character, more
than two padding characters, or a
non-white space character among the
padding characters.
and that
The base 64 digits in ascending order
from zero are the uppercase characters
'A' to 'Z', lowercase characters 'a'
to 'z', numerals '0' to '9', and the
symbols '+' and '/'.
Whether null char is allowed or not really depends on base64 codec in question.
Given vagueness of Base64 standard (there is no authoritative exact specification), many implementations would just ignore it as white space. And then others can flag it as a problem. And buggiest ones wouldn't notice and would happily try decoding it... :-/
But it sounds c# implementation does not like it (which is one valid approach) so if removing it helps, that should be done.
One minor additional comment: UTF-8 is not a requirement, ISO-8859-x aka Latin-x, and 7-bit Ascii would work as well. This because Base64 was specifically designed to only use 7-bit subset which works with all 7-bit ascii compatible encodings.
string stringToDecrypt = HttpContext.Current.Request.QueryString.ToString()
//change to
string stringToDecrypt = HttpUtility.UrlDecode(HttpContext.Current.Request.QueryString.ToString())
If removing \0 from the end of string is impossible, you can add your own character for each string you encode, and remove it on decode.
One gotcha to do with converting Base64 from a string is that some conversion functions use the preceding "data:image/jpg;base64," and others only accept the actual data.

Categories

Resources