Encoding UTF-16 to UTF-8 C#

Encoding UTF-16 to UTF-8 C# - c#

Hello everyone i have some problem with Encoding..
i want convert utf-16 to utf-8 i founded many code but didn't work..
I hope help me.. Thanks
This text =>
'\x04\x1a\x040\x04#\x04B\x040\x00 \x00*\x003\x003\x000\x001\x00:\x00 \x000\x001\x00.\x001\x001\x00.\x002\x000\x002\x002\x00 \x001\x004\x00:\x001\x000\x00,\x00 \x04?\x04>\x04?\x04>\x04;\x04=\x045\x04=\x048\x045\x00 \x003\x003\x00.\x003\x003\x00 \x00T\x00J\x00S\x00.\x00 \x00 \x04\x14\x04>\x04A\x04B\x04C\x04?\x04=\x04>\x00 \x003\x002\x002\x003'
#I tryed this
string v = Regex.Unescape(text);
get result like
♦→♦0♦#♦B♦0 *3301: 01.11.2022 14:10, ♦?♦>♦?♦>♦;♦=♦5♦=♦8♦5 33.33 TJS. ♦¶♦>♦A♦B♦C♦?♦=♦> 3223
and continue
public static string Utf16ToUtf8(string utf16String)
{
// Get UTF16 bytes and convert UTF16 bytes to UTF8 bytes
byte[] utf16Bytes = Encoding.Unicode.GetBytes(utf16String);
byte[] utf8Bytes = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, utf16Bytes);
// Return UTF8 bytes as ANSI string
return Encoding.Default.GetString(utf8Bytes);
}
don't worked
I need result like this
Карта *4411: 01.11.2022 14:10, пополнение 33.33 TJS. Доступно 3223

The code below decodes the text to what you want, but it would be much better to avoid getting into this situation in the first place. If the data is fundamentally text, store it as text in your log files without the extra "convert to UTF-16 then encode that binary data" aspect - that's just causing problems.
The code below "decodes" the text log data into a byte array by treating each \x escape sequence as a single byte (assuming \\ is used to encode backslashes) and treating any other character as a single byte - effectively ISO-8859-1.
It then converts the byte array to a string using big-endian UTF-16. The output is as desired:
Карта *3301: 01.11.2022 14:10, пополнение 33.33 TJS. Доступно 3223
The code is really inefficient - it's effectively a proof of concept to validate the text format you've got. Don't use it as-is; instead, use this as a starting point for improving your storage representation.
using System.Text;
class Program
{
static void Main()
{
string logText = #"\x04\x1a\x040\x04#\x04B\x040\x00 \x00*\x003\x003\x000\x001\x00:\x00 \x000\x001\x00.\x001\x001\x00.\x002\x000\x002\x002\x00 \x001\x004\x00:\x001\x000\x00,\x00 \x04?\x04>\x04?\x04>\x04;\x04=\x045\x04=\x048\x045\x00 \x003\x003\x00.\x003\x003\x00 \x00T\x00J\x00S\x00.\x00 \x00 \x04\x14\x04>\x04A\x04B\x04C\x04?\x04=\x04>\x00 \x003\x002\x002\x003";
byte[] utf16 = DecodeLogText(logText);
string text = Encoding.BigEndianUnicode.GetString(utf16);
Console.WriteLine(text);
}
static byte[] DecodeLogText(string logText)
{
List<byte> bytes = new List<byte>();
for (int i = 0; i < logText.Length; i++)
{
if (logText[i] == '\\')
{
if (i == logText.Length - 1)
{
throw new Exception("Trailing backslash");
}
switch (logText[i + 1])
{
case 'x':
if (i >= logText.Length - 3)
{
throw new Exception("Not enough data for \\x escape sequence");
}
// This is horribly inefficient, but never mind.
bytes.Add(Convert.ToByte(logText.Substring(i + 2, 2), 16));
// Consume the x and hex
i += 3;
break;
case '\\':
bytes.Add((byte) '\\');
// Consume the extra backslash
i++;
break;
// TODO: Any other escape sequences?
default:
throw new Exception("Unknown escape sequence");
}
}
else
{
bytes.Add((byte) logText[i]);
}
}
return bytes.ToArray();
}
}

This also helped me:
string reg = Regex.Unescape(text2);
byte[] ascii = Encoding.BigEndianUnicode.GetBytes(reg);
byte[] utf8 = Encoding.Convert(Encoding.BigEndianUnicode, Encoding.UTF8, ascii);
Console.WriteLine(Encoding.BigEndianUnicode.GetString(utf8));

Related

Replace() working with hex value

I would like to use the Replace() method but using hex values instead of string value.
I have a programm in C# who write text file.
I don't know why, but when the programm write the '°' (-> Number) it's wrotten Â° ( in hex : C2 B0 instead of B0).
I just would like to patch it, in order to corect this.
Is it possible to do re place in order to replace C2B0 by B0 ? How doing this ?
Thanks a lot :)

Not sure if this is the best solution for your problem but if you want a replace function for a string using hex values this will work:
var newString = HexReplace(sourceString, "C2B0", "B0");
private static string HexReplace(string source, string search, string replaceWith) {
var realSearch = string.Empty;
var realReplace = string.Empty;
if(search.Length % 2 == 1) throw new Exception("Search parameter incorrect!");
for (var i = 0; i < search.Length / 2; i++) {
var hex = search.Substring(i * 2, 2);
realSearch += (char)int.Parse(hex, System.Globalization.NumberStyles.HexNumber);
}
for (var i = 0; i < replaceWith.Length / 2; i++) {
var hex = replaceWith.Substring(i * 2, 2);
realReplace += (char)int.Parse(hex, System.Globalization.NumberStyles.HexNumber);
}
return source.Replace(realSearch, realReplace);
}

C# strings are Unicode. When they are written to a file, an encoding must be applied. The default encoding used by File.WriteAllText is utf-8 with no byte order mark.
The two-byte sequence 0xC2B0 is the representation of the ° degree sign U+00B0 codepoint in utf-8.
To get rid of the 0xC2 part, apply a different encoding, for example latin-1:
var latin1 = Encoding.GetEncoding(1252);
File.WriteAllText(path, text, latin1);
To address the "hex replace" idea of the question: Best practice to remove the utf-8 leading byte from existing files would be to do a ReadAllText with utf-8, followed by a WriteAllText as shown above (or stream chunking if the files are too big to read to memory as a whole).
Single-byte character encodings cannot represent all Unicode characters, so substitution will happen for any such character in your DataTable.
The rendition as Â° must be blamed on the viewer/editor you are using to display the file.
Further reading: https://stackoverflow.com/a/17269952/1132334

Base64 Encoded String to byte array C# Failing

We are receiving a base64 encoded string from a external application in the form body and below is the code that we are using to decode the string into byte array, however we are getting an exception
Input String
PHBheW1lbnRSZXNwb25zZT48cmVzcG9uc2VDb2RlPjAwMDA8L3Jlc3BvbnNlQ29kZT48cmVzcG9uc2VDb2RlVGV4dD4wLVN1Y2Nlc3NmdWw8L3Jlc3BvbnNlQ29kZVRleHQ+PHJlc3BvbnNlU3VtbWFyeT5HUkVFTjwvcmVzcG9uc2VTdW1tYXJ5PjxwYXltZW50RXZlbnRJZGVudGlmaWVyPlRYTiAzNjM5PC9wYXltZW50RXZlbnRJZGVudGlmaWVyPjxMaXN0Pjxjb21wb25lbnRJRD5UWE4gMzYzOTwvY29tcG9uZW50SUQ+PGNsaWVudElEPkdPVERJU0UwNjwvY2xpZW50SUQ+PGJhbmtBdXRoQ29kZT5UOjEyMzQ8L2JhbmtBdXRoQ29kZT48YnV5bmV0VHhuSUQ+Mzc1PC9idXluZXRUeG5JRD48L0xpc3Q+PHBheW1lbnRJbnN0cnVtZW50UmVmPjwvcGF5bWVudEluc3RydW1lbnRSZWY+PG1hc2tlZENhcmROdW1iZXI+KioqKioqKioqKioqOTY4NjwvbWFza2VkQ2FyZE51bWJlcj48Y2FyZFR5cGU+TUFTVEVSQ0FSRDwvY2FyZFR5cGU+PGV4cGlyeURhdGU+MDMvMjAxNzwvZXhwaXJ5RGF0ZT48Y3VzdG9tRGF0YT4mbHQ7IVtDREFUQVsmbHQ7P3htbCB2ZXJzaW9uPSIxLjAiIGVuY29kaW5nPSJVVEYtOCI_Jmd0Ow0KJmx0O1RoaXN0bGVDdXN0b21EYXRhIHhtbG5zOnhzZD0iaHR0cDovL3d3dy53My5vcmcvMjAwMS9YTUxTY2hlbWEiIHhtbG5zOnhzaT0iaHR0cDovL3d3dy53My5vcmcvMjAwMS9YTUxTY2hlbWEtaW5zdGFuY2UiJmd0Ow0KICAmbHQ7T3JkZXJJZCZndDtCVFBQMzYzOSZsdDsvT3JkZXJJZCZndDsNCiAgJmx0O0Ftb3VudCZndDsxMjAmbHQ7L0Ftb3VudCZndDsNCiZsdDsvVGhpc3RsZUN1c3RvbURhdGEmZ3Q7XV0mZ3Q7PC9jdXN0b21EYXRhPjwvcGF5bWVudFJlc3BvbnNlPg
Exception
The input is not a valid Base-64 string as it contains a non-base 64 character, more than two padding characters, or an illegal character among the padding characters.
Code
var inputText = // use the data as shown above
byte[] decodedBytes = Convert.FromBase64String(inputText);
I want to know what is wrong in this and why it is throwing the exception, where are if i try online converters they are able to return the proper result.

Thank you all for the response.
I got this working, after studying how the base64 encoding works.
below is the code that i used to fix it.
var input = new StreamReader(Request.InputStream).ReadToEnd();
var inputText = "PHBheW1lbnRSZXNwb25zZT48cmVzcG9uc2VDb2RlPjAwMDA8L3Jlc3BvbnNlQ29kZT48cmVzcG9uc2VDb2RlVGV4dD4wLVN1Y2Nlc3NmdWw8L3Jlc3BvbnNlQ29kZVRleHQ+PHJlc3BvbnNlU3VtbWFyeT5HUkVFTjwvcmVzcG9uc2VTdW1tYXJ5PjxwYXltZW50RXZlbnRJZGVudGlmaWVyPlRYTiAzNjM5PC9wYXltZW50RXZlbnRJZGVudGlmaWVyPjxMaXN0Pjxjb21wb25lbnRJRD5UWE4gMzYzOTwvY29tcG9uZW50SUQ+PGNsaWVudElEPkdPVERJU0UwNjwvY2xpZW50SUQ+PGJhbmtBdXRoQ29kZT5UOjEyMzQ8L2JhbmtBdXRoQ29kZT48YnV5bmV0VHhuSUQ+Mzc1PC9idXluZXRUeG5JRD48L0xpc3Q+PHBheW1lbnRJbnN0cnVtZW50UmVmPjwvcGF5bWVudEluc3RydW1lbnRSZWY+PG1hc2tlZENhcmROdW1iZXI+KioqKioqKioqKioqOTY4NjwvbWFza2VkQ2FyZE51bWJlcj48Y2FyZFR5cGU+TUFTVEVSQ0FSRDwvY2FyZFR5cGU+PGV4cGlyeURhdGU+MDMvMjAxNzwvZXhwaXJ5RGF0ZT48Y3VzdG9tRGF0YT4mbHQ7IVtDREFUQVsmbHQ7P3htbCB2ZXJzaW9uPSIxLjAiIGVuY29kaW5nPSJVVEYtOCI_Jmd0Ow0KJmx0O1RoaXN0bGVDdXN0b21EYXRhIHhtbG5zOnhzZD0iaHR0cDovL3d3dy53My5vcmcvMjAwMS9YTUxTY2hlbWEiIHhtbG5zOnhzaT0iaHR0cDovL3d3dy53My5vcmcvMjAwMS9YTUxTY2hlbWEtaW5zdGFuY2UiJmd0Ow0KICAmbHQ7T3JkZXJJZCZndDtCVFBQMzYzOSZsdDsvT3JkZXJJZCZndDsNCiAgJmx0O0Ftb3VudCZndDsxMjAmbHQ7L0Ftb3VudCZndDsNCiZsdDsvVGhpc3RsZUN1c3RvbURhdGEmZ3Q7XV0mZ3Q7PC9jdXN0b21EYXRhPjwvcGF5bWVudFJlc3BvbnNlPg";
inputText = ValidateBase64EncodedString(inputText);
byte[] decodedBytes = Convert.FromBase64String(inputText);
string xml = Encoding.UTF8.GetString(decodedBytes);
private static string ValidateBase64EncodedString(string inputText)
{
string stringToValidate = inputText;
stringToValidate = stringToValidate.Replace('-', '+'); // 62nd char of encoding
stringToValidate = stringToValidate.Replace('_', '/'); // 63rd char of encoding
switch (stringToValidate.Length % 4) // Pad with trailing '='s
{
case 0: break; // No pad chars in this case
case 2: stringToValidate += "=="; break; // Two pad chars
case 3: stringToValidate += "="; break; // One pad char
default:
throw new System.Exception(
"Illegal base64url string!");
}
return stringToValidate;
}

It's because your custom data is invalid.
In that Base64 encoded string you have a raw [CDATA] which is most likely not encoded correctly.
notice that if i take the first 684 characters of your string, it converts properly to:
<paymentResponse><responseCode>0000</responseCode><responseCodeText>0-Successful</responseCodeText><responseSummary>GREEN</responseSummary><paymentEventIdentifier>TXN 3639</paymentEventIdentifier><List><componentID>TXN 3639</componentID><clientID>GOTDISE06</clientID><bankAuthCode>T:1234</bankAuthCode><buynetTxnID>375</buynetTxnID></List><paymentInstrumentRef></paymentInstrumentRef><maskedCardNumber>************9686</maskedCardNumber><cardType>MASTERCARD</cardType><expiryDate>03/2017</expiryDate><customData>&
but all hell breaks loose when we get to the customData tag, this I suppose, contains the CDATA xml content, which is probably not base64 encoded.
So either leave the CDATA content out, or encode it properly before adding it to the final string, to fix the issue

Compare Windows-1252 string to UTF-8 string

my goal is to convert a .NET string (Unicode) into Windows-1252 and - if necessary - store the original UTF-8 string in a Base64 entity.
For example, the string "DJ Doena" converted to 1252 is still "DJ Doena".
However if you convert the Japanese kanjii for tree (木) into 1251 you end up with a question mark.
These are my test strings:
String doena = "DJ Doena";
String umlaut = "äöüßéèâ";
String allIn = "< ä ß á â & 木 >";
This is how I convert the string in the first place:
using (MemoryStream ms = new MemoryStream())
{
using (StreamWriter sw = new StreamWriter(ms, Encoding.UTF8))
{
sw.Write(decoded);
sw.Flush();
ms.Seek(0, SeekOrigin.Begin);
using (StreamReader sr = new StreamReader(ms, Encoding.GetEncoding(1252)))
{
encoded = sr.ReadToEnd();
}
}
}
Problem is, while debugging string comparison claims that both are indeed identical, so a simple == or .Equals() doesn't suffice.
This is how I try to find out if I need base64 and produce it:
private static String GetBase64Alternate(String utf8Text, String windows1252Text)
{
Byte[] utf8Bytes;
Byte[] windows1252Bytes;
String base64;
utf8Bytes = Encoding.UTF8.GetBytes(utf8Text);
windows1252Bytes = Encoding.GetEncoding(1252).GetBytes(windows1252Text);
base64 = null;
if (utf8Bytes.Length != windows1252Bytes.Length)
{
base64 = Convert.ToBase64String(utf8Bytes);
}
else
{
for(Int32 i = 0; i < utf8Bytes.Length; i++)
{
if(utf8Bytes[i] != windows1252Bytes[i])
{
base64 = Convert.ToBase64String(utf8Bytes);
break;
}
}
}
return (base64);
}
The first string doena is completely identical and doesn't produce a base64 result
Console.WriteLine(String.Format("{0} / {1}", windows1252Text, base64Text));
results in
DJ Doena /
But the second string umlauts already has twice the bytes in UTF-8 than in 1252 and thus produces an Base64 string even though it does not appear to be necessary:
äöüßéèâ / w6TDtsO8w5/DqcOow6I=
And the third one does what it's supposed to do (no more "木" but a "?", thus base64 needed):
< ä ß á â & ? > / PCDDpCDDnyDDoSDDoiAmIOacqCA+
Any clues how my Base64 getter could be enhanced a) for performance b) for better results?
Thank you in advance. :-)

I'm not sure I completely understood the question. But I tried. :) If I do understand correctly, this code does what you want:
static void Main(string[] args)
{
string[] testStrings = { "DJ Doena", "äöüßéèâ", "< ä ß á â & 木 >" };
foreach (string text in testStrings)
{
Console.WriteLine(ReencodeText(text));
}
}
private static string ReencodeText(string text)
{
Encoding encoding = Encoding.GetEncoding(1252);
string text1252 = encoding.GetString(encoding.GetBytes(text));
return text.Equals(text1252, StringComparison.Ordinal) ?
text : Convert.ToBase64String(Encoding.UTF8.GetBytes(text));
}
I.e. it encodes the text to Windows-1252, then decodes back to a string object, which it then compares with the original. If the comparison succeeds, it returns the original string, otherwise it encodes it to UTF8, and then to base64.
It produces the following output:
DJ Doena
äöüßéèâ
PCDDpCDDnyDDoSDDoiAmIOacqCA+
In other words, the first two strings are left intact, while the third is encoded as base64.

In your first code you are encoding the string using one encoding, then decoding it using a different encoding. That doesn't give you any reliable result at all; it's the equivalent of writing out a number in octal, then reading it as if it was in decimal. It seems to work just fine for numbers up to 7, but after that you get useless results.
The problem with the GetBase64Alternate method is that it's encoding a string to two different encodings, and assumes that the first encoding doesn't support some of the characters if the second encoding resulted in a different set of bytes.
Comparing the byte sequences doesn't tell you whether any of the encodings failed. The sequences will be different if it failed, but it will also be different if there are any characters that are encoded differently between the encodings.
What you want to do is to determine if the encoding actually worked for all characters. You can do that by creating an Encoding instance with a fallback for unsupported characters. There is an EncoderExceptionFallback class that you can use for that, which throws an EncoderFallbackException if it's called.
This code will try use the Windows-1252 encoding on a string, and sets the ok variable to false if the encoding doesn't support all characters in the string:
Encoding e = Encoding.GetEncoding(1252, new EncoderExceptionFallback(), new DecoderExceptionFallback());
bool ok = true;
try {
e.GetByteCount(allIn);
} catch (EncoderFallbackException) {
ok = false;
}
As you are not actually going to used the encoded result for anything, you can use the GetByteCount method. It will check how all characters would be encoded without producing the encoded result.
Used in your method it would be:
private static String GetBase64Alternate(string text) {
Encoding e = Encoding.GetEncoding(1252, new EncoderExceptionFallback(), new DecoderExceptionFallback());
bool ok = true;
try {
e.GetByteCount(allIn);
} catch (EncoderFallbackException) {
ok = false;
}
return ok ? null : Convert.ToBase64(Encoding.UTF8.GetBytes(text));
}

How to overwrite specific bytes in dump file in C#

i'm having a mysql dump with some special characters ("Ä, ä, Ö, ö, Ü, ü, ß"). I have to reimport this dump into the latest mysql version. This is crashing the special characters because of the encoding. The dump is not encoded with UTF-8.
Within this dump there are also some binary attachments which should not be overwritten. Otherwise the attachments will be broken.
I have to overwrite every special character with the bytes that are readable for UTF-8.
I'm currently trying it that way (this is changing the ANSI ü to an for UTF-8 readable ü):
newByteArray[y] = 195;
if (bytesFromLine[i] == 252)
{
newByteArray[y + 1] = 188;
}
newByteArray[y + 2] = bytesFromLine[y + 1];
252 is displaying a 'ü' in Encoding.Default. 195 188 is displaying a 'ü' in Encoding.UTF8.
Now i need help with searching this specific characters in this dump file an overwriting this bytes with the right bytes. I can't replace all '252' with '195 188' because the attachments would get broken then.
Thanks in advance.
Relax

DISCLAIMER: This might corrupt your data. The best way of dealing with this is to get a proper mysqldump from the source database. This solution should only be use when you don't have that option and stuck with a potentially broken dump file.
Assuming all strings in the dump file in quotes (using single quote ') and can be escaped as \':
INSERT INTO `some_table` VALUES (123, 'this is a string', ...
Not too sure how binary data is represented. That might need more checks, you need to check your dump file and see if these assumptions are correct.
const char quote = '\'';
const char escape = '\\';
using (var dumpOut = new FileStream("dump_out.txt", FileMode.Create, FileAccess.Write))
using (var dumpIn = new FileStream("dump_in.txt", FileMode.Open, FileAccess.Read))
{
bool inquotes = false;
byte previousByte = 0;
var stringBytes = new List<byte>();
while (true)
{
int readByte = dumpIn.ReadByte();
if (readByte == -1) break;
var b = (byte) readByte;
if (b == quote && previousByte != escape)
{
if (inquotes) // closing quote
{
var buffer = stringBytes.ToArray();
stringBytes.Clear();
byte[] converted = Encoding.Convert(Encoding.Default, Encoding.UTF8, buffer);
dumpOut.Write(converted, 0, converted.Length);
dumpOut.WriteByte(b);
}
else // opening quote
{
dumpOut.WriteByte(b);
}
inquotes = !inquotes;
continue;
}
previousByte = b;
if (inquotes)
stringBytes.Add(b);
else
dumpOut.WriteByte(b);
}
}

How can i convert a string of characters into binary string and back again?

I need to convert a string into it's binary equivilent and keep it in a string. Then return it back into it's ASCII equivalent.

You can encode a string into a byte-wise representation by using an Encoding, e.g. UTF-8:
var str = "Out of cheese error";
var bytes = Encoding.UTF8.GetBytes(str);
To get back a .NET string object:
var strAgain = Encoding.UTF8.GetString(bytes);
// str == strAgain
You seem to want the representation as a series of '1' and '0' characters; I'm not sure why you do, but that's possible too:
var binStr = string.Join("", bytes.Select(b => Convert.ToString(b, 2)));
Encodings take an abstract string (in the sense that they're an opaque representation of a series of Unicode code points), and map them into a concrete series of bytes. The bytes are meaningless (again, because they're opaque) without the encoding. But, with the encoding, they can be turned back into a string.
You seem to be mixing up "ASCII" with strings; ASCII is simply an encoding that deals only with code-points up to 128. If you have a string containing an 'é', for example, it has no ASCII representation, and so most definitely cannot be represented using a series of ASCII bytes, even though it can exist peacefully in a .NET string object.
See this article by Joel Spolsky for further reading.

You can use these functions for converting to binary and restore it back :
public static string BinaryToString(string data)
{
List<Byte> byteList = new List<Byte>();
for (int i = 0; i < data.Length; i += 8)
{
byteList.Add(Convert.ToByte(data.Substring(i, 8), 2));
}
return Encoding.ASCII.GetString(byteList.ToArray());
}
and for converting string to binary :
public static string StringToBinary(string data)
{
StringBuilder sb = new StringBuilder();
foreach (char c in data.ToCharArray())
{
sb.Append(Convert.ToString(c, 2).PadLeft(8, '0'));
}
return sb.ToString();
}
Hope Helps You.

First convert the string into bytes, as described in my comment and in Cameron's answer; then iterate, convert each byte into an 8-digit binary number (possibly with Convert.ToString, padding appropriately), then concatenate. For the reverse direction, split by 8 characters, run through Convert.ToInt16, build up a byte array, then convert back to a string with GetString.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Encoding UTF-16 to UTF-8 C# - c#

This also helped me: string reg = Regex.Unescape(text2); byte[] ascii = Encoding.BigEndianUnicode.GetBytes(reg); byte[] utf8 = Encoding.Convert(Encoding.BigEndianUnicode, Encoding.UTF8, ascii); Console.WriteLine(Encoding.BigEndianUnicode.GetString(utf8));

Related

Replace() working with hex value

Base64 Encoded String to byte array C# Failing

Compare Windows-1252 string to UTF-8 string

How to overwrite specific bytes in dump file in C#

How can i convert a string of characters into binary string and back again?

Categories

Resources