UTF16 string to normal text

UTF16 string to normal text - c#

I have one column in DB containing UTF16 string and I want to convert the UTF16 string into normal text. How to achieve this in c# ?
For example :
Source : 0645 0631 062D 0628 0627 0020 0627 0644 0639 0627 0644 0645
Convert : مرحبا العالم

I presume that source is simply a string containing the byte values, as this is one thing not quite clear from your question.
You first need to turn that into a byte array. Of course you first need to remove the blanks.
// Initialize the byte array
string sourceNoBlanks = source.Replace(" ", "").Trim();
if ((sourceNoBlanks.Length % 2) > 0)
throw new ArgumentException("The length of the source string must be a multiple of 2!");
byte[] sourceBytes = new byte[source.Length / 2];
// Then, create the bytes
for (int i = 0; i < sourceBytes.Length; i++)
{
string byteString = sourceNoBlanks.Substring(i*2, 2);
sourceBytes[i] = Byte.Parse(byteString, NumberStyles.HexNumber);
}
After that you can easily convert it to string:
string result = Encoding.UTF32.GetString(sourceBytes);
I suggest you read up on the UTF32 encoding to understand little/big endian encoding.

Related

C# utf string conversion, characters which don't display correctly get converted to "unknown character" - how to prevent this?

I've got two strings which are derived from Windows filenames, which contain unicode characters that do not display correctly in Windows (they show just the square box "unknown character" instead of the correct character). However the filenames are valid and these files exist without problems in the operating system, which means I need to be able to deal with them correctly and accurately.
I'm loading the filenames the usual way:
string path = #"c:\folder";
foreach (FileInfo file in DirectoryInfo.EnumerateFiles(path))
{
string filename = file.FullName;
}
but for the purposes of explaining this problem, these are the two filenames I'm having issues with:
string filename1 = "\ude18.txt";
string filename2 = "\udca6.txt";
Two strings, two filenames with a single unicode character plus an extension, both different. This so far is fine, I can read and write these files no problem, however I need to store these strings in a sqlite db and later retrieve them. Every attempt I make to do so results in both of these characters being changed to the "unknown character", so the original data is lost and I can no longer differentiate the two strings. At first I thought this was an sqlite issue, and I've made sure my db is in UTF16, but it turns out it's the conversion in c# to UTF16 that is causing the problem.
If I ignore sqlite entirely, and simply try to manually convert these strings to UTF16 (or to any other encoding), these characters are converted to the "unknown character" and the original data is lost. If I do this:
System.Text.Encoding enc = System.Text.Encoding.Unicode;
string filename1 = "\ude18.txt";
string filename2 = "\udca6.txt";
byte[] name1Bytes = enc.GetBytes(filename1);
byte[] name2Bytes = enc.GetBytes(filename2);
and I then inspect the bytearrays 'name1Bytes' and 'name2Bytes' they are both identical. and I can see that the unicode character in both cases has been converted to a pair of bytes 253 and 255 - the unknown character. and sure enough when I convert back
string newFilename1 = enc.GetString(name1Bytes);
string newFilename2 = enc.GetString(name2Bytes);
the orignal unicode character in each case is lost, and replaced with a diamond question mark symbol. I have lost the original filenames altogether.
It seems that these encoding conversions rely on the system font being able to display the characters, and this is a problem as these strings already exist as filenames, and changing the filenames isn't an option. I need to preserve this data somehow when sending it to sqlite, and when it's sent to sqlite it will go through a conversion process to UTF16, and it's this conversion that I need it to survive without losing data.

If you cast a char to an int, you get the numeric value, bypassing the Unicode conversion mechanism:
foreach (char ch in filename1)
{
int i = ch; // 0x0000de18 == 56856 for the first char in filename1
... do whatever, e.g., create an int array, store it as base64
}
This turns out to work as well, and is perhaps more elegant:
foreach (int ch in filename1)
{
...
}
So perhaps something like this:
string Encode(string raw)
{
byte[] bytes = new byte[2 * raw.Length];
int i = 0;
foreach (int ch in raw)
{
bytes[i++] = (byte)(ch & 0xff);
bytes[i++] = (byte)(ch >> 8);
}
return Convert.ToBase64String(bytes);
}
string Decode(string encoded)
{
byte[] bytes = Convert.FromBase64String(encoded);
char[] chars = new char[bytes.Length / 2];
for (int i = 0; i < chars.Length; ++i)
{
chars[i] = (char)(bytes[i * 2] | (bytes[i * 2 + 1] << 8));
}
return new string(chars);
}

Replace() working with hex value

I would like to use the Replace() method but using hex values instead of string value.
I have a programm in C# who write text file.
I don't know why, but when the programm write the '°' (-> Number) it's wrotten Â° ( in hex : C2 B0 instead of B0).
I just would like to patch it, in order to corect this.
Is it possible to do re place in order to replace C2B0 by B0 ? How doing this ?
Thanks a lot :)

Not sure if this is the best solution for your problem but if you want a replace function for a string using hex values this will work:
var newString = HexReplace(sourceString, "C2B0", "B0");
private static string HexReplace(string source, string search, string replaceWith) {
var realSearch = string.Empty;
var realReplace = string.Empty;
if(search.Length % 2 == 1) throw new Exception("Search parameter incorrect!");
for (var i = 0; i < search.Length / 2; i++) {
var hex = search.Substring(i * 2, 2);
realSearch += (char)int.Parse(hex, System.Globalization.NumberStyles.HexNumber);
}
for (var i = 0; i < replaceWith.Length / 2; i++) {
var hex = replaceWith.Substring(i * 2, 2);
realReplace += (char)int.Parse(hex, System.Globalization.NumberStyles.HexNumber);
}
return source.Replace(realSearch, realReplace);
}

C# strings are Unicode. When they are written to a file, an encoding must be applied. The default encoding used by File.WriteAllText is utf-8 with no byte order mark.
The two-byte sequence 0xC2B0 is the representation of the ° degree sign U+00B0 codepoint in utf-8.
To get rid of the 0xC2 part, apply a different encoding, for example latin-1:
var latin1 = Encoding.GetEncoding(1252);
File.WriteAllText(path, text, latin1);
To address the "hex replace" idea of the question: Best practice to remove the utf-8 leading byte from existing files would be to do a ReadAllText with utf-8, followed by a WriteAllText as shown above (or stream chunking if the files are too big to read to memory as a whole).
Single-byte character encodings cannot represent all Unicode characters, so substitution will happen for any such character in your DataTable.
The rendition as Â° must be blamed on the viewer/editor you are using to display the file.
Further reading: https://stackoverflow.com/a/17269952/1132334

convert hex to byte without conversion

I have a very specific requirement. I have some data. Of which, strings and spaces are to be converted to EBCDIC while numbers to Hexadecimal.
For Example, my string is "Test123"
Test => EBCDIC
123 => Hexadecimal.
What I am trying to do is check every character in string if its number or not, and then based on that doing my conversion.
byte[] dataBuffer = new byte[length];
int i = 0;
if (toEBCDIC)
{
foreach (char c in data)
{
byte[] temp = new byte[1];
if (Char.IsNumber(c))
{
string hexValue = Convert.ToInt32(c).ToString("X");
temp = Encoding.ASCII.GetBytes(hexValue);
dataBuffer[i] = temp[0];
}
else
{
temp = Encoding.GetEncoding("IBM01140").GetBytes(c.ToString());
dataBuffer[i] = temp[0];
}
i++;
}
dataBuffer.CopyTo(array, byteIndex);
The problem comes when i try to convert the number. I need to keep my output in byte array, as i have to write the output to a memory stream and then to a file.
When i get the hex value of number, and then try to convert it to byte, actual conversion happens.
For "1", hexvalue = 31.
Now I want to keep this 31 unchanged in bytes. I mean to say that, when i write it to byte array, it should remain 31 only. But when do GetBytes, it makes byte array, converting 3 and 1 separately to bytes.
Can anyone please help me on this..!!

The problem is here:
ToString("X")
Now it's a hexadecimal string. So in your example, from this point onward, the 3 and the 1 have become separated.
How to fix this: don't convert.
if (Char.IsNumber(c))
{
dataBuffer[i] = (byte)c;
}
Not tested. I think that's what you want. At least, that's what you describe in the last paragraph. That wouldn't make the numbers hexadecimal though - it would make them ASCII, and it's a bit odd to be mixing that with EBCDIC.

You convert the char to its code and then convert that code to string. You don't have to do the second step, instead use the code directly:
if (Char.IsNumber(c))
{
byte hexValue = Convert.ToByte(c);
dataBuffer[i] = hexValue;
}

C# ByteString to ASCII String

I am looking for a smart way to convert a string of hex-byte-values into a string of 'real text' (ASCII Characters).
For example I have the word "Hello" written in Hexadecimal ASCII: 48 45 4C 4C 4F. And using some method I want to receive the ASCII text of it (in this case "Hello").
// I have this string (example: "Hello") and want to convert it to "Hello".
string strHexa = "48454C4C4F";
// I want to convert the strHexa to an ASCII string.
string strResult = ConvertToASCII(strHexa);
I am sure there is a framework method. If this is not the case of course I could implement my own method.
Thanks!

var str = Encoding.UTF8.GetString(SoapHexBinary.Parse("48454C4C4F").Value); //HELLO
PS: SoapHexBinary is in System.Runtime.Remoting.Metadata.W3cXsd2001 namespace

I am sure there is a framework method.
A a single framework method: No.
However the second part of this: converting a byte array containing ASCII encoded text into a .NET string (which is UTF-16 encoded Unicode) does exist: System.Text.ASCIIEncoding and specifically the method GetString:
string result = ASCIIEncoding.GetString(byteArray);
The First part is easy enough to do yourself: take two hex digits at a time, parse as hex and cast to a byte to store in the array. Seomthing like:
byte[] HexStringToByteArray(string input) {
Debug.Assert(input.Length % 2 == 0, "Must have two digits per byte");
var res = new byte[input.Length/2];
for (var i = 0; i < input.Length/2; i++) {
var h = input.Substring(i*2, 2);
res[i] = Convert.ToByte(h, 16);
}
return res;
}
Edit: Note: L.B.'s answer identifies a method in .NET that will do the first part more easily: this is a better approach that writing it yourself (while in a, perhaps, obscure namespace it is implemented in mscorlib rather than needing an additional reference).

StringBuilder sb = new StringBuilder();
for (int i = 0; i < hexStr.Length; i += 2)
{
string hs = hexStr.Substring(i, 2);
sb.Append(Convert.ToByte(hs, 16));
}

How can i convert a string of characters into binary string and back again?

I need to convert a string into it's binary equivilent and keep it in a string. Then return it back into it's ASCII equivalent.

You can encode a string into a byte-wise representation by using an Encoding, e.g. UTF-8:
var str = "Out of cheese error";
var bytes = Encoding.UTF8.GetBytes(str);
To get back a .NET string object:
var strAgain = Encoding.UTF8.GetString(bytes);
// str == strAgain
You seem to want the representation as a series of '1' and '0' characters; I'm not sure why you do, but that's possible too:
var binStr = string.Join("", bytes.Select(b => Convert.ToString(b, 2)));
Encodings take an abstract string (in the sense that they're an opaque representation of a series of Unicode code points), and map them into a concrete series of bytes. The bytes are meaningless (again, because they're opaque) without the encoding. But, with the encoding, they can be turned back into a string.
You seem to be mixing up "ASCII" with strings; ASCII is simply an encoding that deals only with code-points up to 128. If you have a string containing an 'é', for example, it has no ASCII representation, and so most definitely cannot be represented using a series of ASCII bytes, even though it can exist peacefully in a .NET string object.
See this article by Joel Spolsky for further reading.

You can use these functions for converting to binary and restore it back :
public static string BinaryToString(string data)
{
List<Byte> byteList = new List<Byte>();
for (int i = 0; i < data.Length; i += 8)
{
byteList.Add(Convert.ToByte(data.Substring(i, 8), 2));
}
return Encoding.ASCII.GetString(byteList.ToArray());
}
and for converting string to binary :
public static string StringToBinary(string data)
{
StringBuilder sb = new StringBuilder();
foreach (char c in data.ToCharArray())
{
sb.Append(Convert.ToString(c, 2).PadLeft(8, '0'));
}
return sb.ToString();
}
Hope Helps You.

First convert the string into bytes, as described in my comment and in Cameron's answer; then iterate, convert each byte into an 8-digit binary number (possibly with Convert.ToString, padding appropriately), then concatenate. For the reverse direction, split by 8 characters, run through Convert.ToInt16, build up a byte array, then convert back to a string with GetString.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

UTF16 string to normal text - c#

I have one column in DB containing UTF16 string and I want to convert the UTF16 string into normal text. How to achieve this in c# ? For example : Source : 0645 0631 062D 0628 0627 0020 0627 0644 0639 0627 0644 0645 Convert : مرحبا العالم

Related

C# utf string conversion, characters which don't display correctly get converted to "unknown character" - how to prevent this?

Replace() working with hex value

convert hex to byte without conversion

C# ByteString to ASCII String

How can i convert a string of characters into binary string and back again?

Categories

Resources