How should I decode a UTF-8 string

How should I decode a UTF-8 string - c#

I have a string like:
About \xee\x80\x80John F Kennedy\xee\x80\x81\xe2\x80\x99s Assassination . unsolved mystery \xe2\x80\x93 45 years later. Over the last decade, a lot of individuals have speculated on conspiracy theories that ...
I understand that \xe2\x80\x93 is a dash character. But how should I decode the above string in C#?

If you have a string like that, then you have used the wrong encoding when you decoded it in the first place. There is no "UTF-8 string", the UTF-8 data is whent the text is encoded into binary data (bytes). When it's decoded into a string, then it's not UTF-8 any more.
You should use the UTF-8 encoding when you create the string from binary data, once the string is created using the wrong encoding, you can't reliably fix it.
If there is no other alternative, you could try to fix the string by encoding it again using the same wrong encoding that was used to create it, and then decode it using the corrent encoding. There is however no guarantee that this will work for all strings, some characters will simply be lost during the wrong decoding. Example:
// wrong use of encoding, to try to fix wrong decoding
str = Encoding.UTF8.GetString(Encoding.Default.GetBytes(str));

Scan the input string char-by-char and convert values starting with \x (string to byte[] and back to string using UTF8 decoder), leaving all other characters unchanged:
static string Decode(string input)
{
var sb = new StringBuilder();
int position = 0;
var bytes = new List<byte>();
while(position < input.Length)
{
char c = input[position++];
if(c == '\\')
{
if(position < input.Length)
{
c = input[position++];
if(c == 'x' && position <= input.Length - 2)
{
var b = Convert.ToByte(input.Substring(position, 2), 16);
position += 2;
bytes.Add(b);
}
else
{
AppendBytes(sb, bytes);
sb.Append('\\');
sb.Append(c);
}
continue;
}
}
AppendBytes(sb, bytes);
sb.Append(c);
}
AppendBytes(sb, bytes);
return sb.ToString();
}
private static void AppendBytes(StringBuilder sb, List<byte> bytes)
{
if(bytes.Count != 0)
{
var str = System.Text.Encoding.UTF8.GetString(bytes.ToArray());
sb.Append(str);
bytes.Clear();
}
}
Output:
About John F Kennedy’s Assassination . unsolved mystery – 45 years later. Over the last decade, a lot of individuals have speculated on conspiracy theories that ...

Finally I've used something like this:
public static string UnescapeHex(string data)
{
return Encoding.UTF8.GetString(Array.ConvertAll(Regex.Unescape(data).ToCharArray(), c => (byte) c));
}

Related

Encoding of the actual header value for HttpRequestMessage [duplicate]

Is there a class similar to HttpUtility to encode the content of a custom header? Ideally I would like to keep the content readable.

You can use the HttpEncoder.HeaderNameValueEncode Method in the .NET Framework 4.0 and above.
For previous versions of the .NET Framework, you can roll your own encoder, using the logic noted on the HttpEncoder.HeaderNameValueEncode reference page:
All characters whose Unicode value is less than ASCII character 32,
except ASCII character 9, are URL-encoded into a format of %NN where
the N characters represent hexadecimal values.
ASCII character 9 (the horizontal tab character) is not URL-encoded.
ASCII character 127 is encoded as %7F.
All other characters are not encoded.
Update:
As OliverBock point out the HttpEncoder.HeaderNameValueEncode Method is protected and internal. I went to open source Mono project and found the mono's implementation
void HeaderNameValueEncode (string headerName, string headerValue, out string encodedHeaderName, out string encodedHeaderValue)
{
if (String.IsNullOrEmpty (headerName))
encodedHeaderName = headerName;
else
encodedHeaderName = EncodeHeaderString (headerName);
if (String.IsNullOrEmpty (headerValue))
encodedHeaderValue = headerValue;
else
encodedHeaderValue = EncodeHeaderString (headerValue);
}
static void StringBuilderAppend (string s, ref StringBuilder sb)
{
if (sb == null)
sb = new StringBuilder (s);
else
sb.Append (s);
}
static string EncodeHeaderString (string input)
{
StringBuilder sb = null;
for (int i = 0; i < input.Length; i++) {
char ch = input [i];
if ((ch < 32 && ch != 9) || ch == 127)
StringBuilderAppend (String.Format ("%{0:x2}", (int)ch), ref sb);
}
if (sb != null)
return sb.ToString ();
return input;
}
Just FYI
[here ] (https://github.com/mono/mono/blob/master/mcs/class/System.Web/System.Web.Util/HttpEncoder.cs)

For me helped Uri.EscapeDataString(headervalue)

This does the same job as HeaderNameValueEncode(), but will also encode % characters so the header can be reliably decoded later.
static string EncodeHeaderValue(string value)
{
return Regex.Replace(value, #"[\u0000-\u0008\u000a-\u001f%\u007f]", (m) => "%"+((int)m.Value[0]).ToString("x2"));
}
static string DecodeHeaderValue(string encoded)
{
return Regex.Replace(encoded, #"%([0-9a-f]{2})", (m) => new String((char)Convert.ToInt32(m.Groups[1].Value, 16), 1), RegexOptions.IgnoreCase);
}

Mayeb This one ?
UrlEncode Function

sorry its off the top of my head but for your request object there should be a headers object you can add to.
i.e. request.headers.add("blah");
Thats not spot on but it should point you in the right direction.

string values to byte array without converting

I'm trying to put the values of a string into a byte array with out changing the characters. This is because the string is in fact a byte representation of the data.
The goal is to move the input string into a byte array and then convert the byte array using:
string result = System.Text.Encoding.UTF8.GetString(data);
I hope someone can help me although I know it´s not a very good description.
EDIT:
And maybe I should explain that what I´m working on is a simple windows form with a textbox where users can copy the encoded data into it and then click preview to see the decoded data.
EDIT:
A little more code:
(inputText is a textbox)
private void button1_Click(object sender, EventArgs e)
{
string inputString = this.inputText.Text;
byte[] input = new byte[inputString.Length];
for (int i = 0; i < inputString.Length; i++)
{
input[i] = inputString[i];
}
string output = base64Decode(input);
this.inputText.Text = "";
this.inputText.Text = output;
}
This is a part of a windows form and it includes a rich text box. This code doesn´t work because it won´t let me convert type char to byte.
But if I change the line to :
private void button1_Click(object sender, EventArgs e)
{
string inputString = this.inputText.Text;
byte[] input = new byte[inputString.Length];
for (int i = 0; i < inputString.Length; i++)
{
input[i] = (byte)inputString[i];
}
string output = base64Decode(input);
this.inputText.Text = "";
this.inputText.Text = output;
}
It encodes the value and I don´t want that. I hope this explains a little bit better what I´m trying to do.
EDIT: The base64Decode function:
public string base64Decode(byte[] data)
{
try
{
string result = System.Text.Encoding.UTF8.GetString(data);
return result;
}
catch (Exception e)
{
throw new Exception("Error in base64Decode" + e.Message);
}
}
The string is not encoded using base64 just to be clear. This is just bad naming on my behalf.
Note this is just one line of input.
I've got it. The problem was I was always trying to decode the wrong format. I feel very stupid because when I posted the example input I saw this had to be hex and it was so from then on it was easy. I used this site for reference:
http://msdn.microsoft.com/en-us/library/bb311038.aspx
My code:
public string[] getHexValues(string s)
{
int j = 0;
string[] hex = new String[s.Length/2];
for (int i = 0; i < s.Length-2; i += 2)
{
string temp = s.Substring(i, 2);
this.inputText.Text = temp;
if (temp.Equals("0x")) ;
else
{
hex[j] = temp;
j++;
}
}
return hex;
}
public string convertFromHex(string[] hex)
{
string result = null;
for (int i = 0; i < hex.Length; i++)
{
int value = Convert.ToInt32(hex[i], 16);
result += Char.ConvertFromUtf32(value);
}
return result;
}
I feel quite dumb right now but thanks to everyone who helped, especially #Jon Skeet.

Are you saying you have something like this:
string s = "48656c6c6f2c20776f726c6421";
and you want these values as a byte array? Then:
public IEnumerable<byte> GetBytesFromByteString(string s) {
for (int index = 0; index < s.Length; index += 2) {
yield return Convert.ToByte(s.Substring(index, 2), 16);
}
}
Usage:
string s = "48656c6c6f2c20776f726c6421";
var bytes = GetBytesFromByteString(s).ToArray();
Note that the output of
Console.WriteLine(System.Text.ASCIIEncoding.ASCII.GetString(bytes));
is
Hello, world!
You obviously need to make the above method a lot safer.

Encoding has the reverse method:
byte[] data = System.Text.Encoding.UTF8.GetBytes(originalString);
string result = System.Text.Encoding.UTF8.GetString(data);
Debug.Assert(result == originalString);
But what you mean 'without converting' is unclear.

One way to do it would be to write:
string s = new string(bytes.Select(x => (char)c).ToArray());
That will give you a string that has one character for every single byte in the array.
Another way is to use an 8-bit character encoding. For example:
var MyEncoding = Encoding.GetEncoding("windows-1252");
string s = MyEncoding.GetString(bytes);
I'm think that Windows-1252 defines all 256 characters, although I'm not certain. If it doesn't, you're going to end up with converted characters. You should be able to find an 8-bit encoding that will do this without any conversion. But you're probably better off using the byte-to-character loop above.

If anyone still needs it this worked for me:
byte[] result = Convert.FromBase64String(str);

Have you tried:
string s = "....";
System.Text.UTF8Encoding.UTF8.GetBytes(s);

Convert a Unicode string to an escaped ASCII string

How can I convert this string:
This string contains the Unicode character Pi(π)
into an escaped ASCII string:
This string contains the Unicode character Pi(\u03a0)
and vice versa?
The current Encoding available in C# converts the π character to "?". I need to preserve that character.

This goes back and forth to and from the \uXXXX format.
class Program {
static void Main( string[] args ) {
string unicodeString = "This function contains a unicode character pi (\u03a0)";
Console.WriteLine( unicodeString );
string encoded = EncodeNonAsciiCharacters(unicodeString);
Console.WriteLine( encoded );
string decoded = DecodeEncodedNonAsciiCharacters( encoded );
Console.WriteLine( decoded );
}
static string EncodeNonAsciiCharacters( string value ) {
StringBuilder sb = new StringBuilder();
foreach( char c in value ) {
if( c > 127 ) {
// This character is too big for ASCII
string encodedValue = "\\u" + ((int) c).ToString( "x4" );
sb.Append( encodedValue );
}
else {
sb.Append( c );
}
}
return sb.ToString();
}
static string DecodeEncodedNonAsciiCharacters( string value ) {
return Regex.Replace(
value,
#"\\u(?<Value>[a-zA-Z0-9]{4})",
m => {
return ((char) int.Parse( m.Groups["Value"].Value, NumberStyles.HexNumber )).ToString();
} );
}
}
Outputs:
This function contains a unicode character pi (π)
This function contains a unicode character pi (\u03a0)
This function contains a unicode character pi (π)

For Unescape You can simply use this functions:
System.Text.RegularExpressions.Regex.Unescape(string)
System.Uri.UnescapeDataString(string)
I suggest using this method (It works better with UTF-8):
UnescapeDataString(string)

string StringFold(string input, Func<char, string> proc)
{
return string.Concat(input.Select(proc).ToArray());
}
string FoldProc(char input)
{
if (input >= 128)
{
return string.Format(#"\u{0:x4}", (int)input);
}
return input.ToString();
}
string EscapeToAscii(string input)
{
return StringFold(input, FoldProc);
}

As a one-liner:
var result = Regex.Replace(input, #"[^\x00-\x7F]", c =>
string.Format(#"\u{0:x4}", (int)c.Value[0]));

class Program
{
static void Main(string[] args)
{
char[] originalString = "This string contains the unicode character Pi(π)".ToCharArray();
StringBuilder asAscii = new StringBuilder(); // store final ascii string and Unicode points
foreach (char c in originalString)
{
// test if char is ascii, otherwise convert to Unicode Code Point
int cint = Convert.ToInt32(c);
if (cint <= 127 && cint >= 0)
asAscii.Append(c);
else
asAscii.Append(String.Format("\\u{0:x4} ", cint).Trim());
}
Console.WriteLine("Final string: {0}", asAscii);
Console.ReadKey();
}
}
All non-ASCII chars are converted to their Unicode Code Point representation and appended to the final string.

Here is my current implementation:
public static class UnicodeStringExtensions
{
public static string EncodeNonAsciiCharacters(this string value) {
var bytes = Encoding.Unicode.GetBytes(value);
var sb = StringBuilderCache.Acquire(value.Length);
bool encodedsomething = false;
for (int i = 0; i < bytes.Length; i += 2) {
var c = BitConverter.ToUInt16(bytes, i);
if ((c >= 0x20 && c <= 0x7f) || c == 0x0A || c == 0x0D) {
sb.Append((char) c);
} else {
sb.Append($"\\u{c:x4}");
encodedsomething = true;
}
}
if (!encodedsomething) {
StringBuilderCache.Release(sb);
return value;
}
return StringBuilderCache.GetStringAndRelease(sb);
}
public static string DecodeEncodedNonAsciiCharacters(this string value)
=> Regex.Replace(value,/*language=regexp*/#"(?:\\u[a-fA-F0-9]{4})+", Decode);
static readonly string[] Splitsequence = new [] { "\\u" };
private static string Decode(Match m) {
var bytes = m.Value.Split(Splitsequence, StringSplitOptions.RemoveEmptyEntries)
.Select(s => ushort.Parse(s, NumberStyles.HexNumber)).SelectMany(BitConverter.GetBytes).ToArray();
return Encoding.Unicode.GetString(bytes);
}
}
This passes a test:
public void TestBigUnicode() {
var s = "\U00020000";
var encoded = s.EncodeNonAsciiCharacters();
var decoded = encoded.DecodeEncodedNonAsciiCharacters();
Assert.Equals(s, decoded);
}
with the encoded value: "\ud840\udc00"
This implementation makes use of a StringBuilderCache (reference source link)

A small patch to #Adam Sills's answer which solves FormatException on cases where the input string like "c:\u00ab\otherdirectory\" plus RegexOptions.Compiled makes the Regex compilation much faster:
private static Regex DECODING_REGEX = new Regex(#"\\u(?<Value>[a-fA-F0-9]{4})", RegexOptions.Compiled);
private const string PLACEHOLDER = #"#!#";
public static string DecodeEncodedNonAsciiCharacters(this string value)
{
return DECODING_REGEX.Replace(
value.Replace(#"\\", PLACEHOLDER),
m => {
return ((char)int.Parse(m.Groups["Value"].Value, NumberStyles.HexNumber)).ToString(); })
.Replace(PLACEHOLDER, #"\\");
}

To store actual Unicode codepoints, you have to first decode the String's UTF-16 codeunits to UTF-32 codeunits (which are currently the same as the Unicode codepoints). Use System.Text.Encoding.UTF32.GetBytes() for that, and then write the resulting bytes to the StringBuilder as needed,i.e.
static void Main(string[] args)
{
String originalString = "This string contains the unicode character Pi(π)";
Byte[] bytes = Encoding.UTF32.GetBytes(originalString);
StringBuilder asAscii = new StringBuilder();
for (int idx = 0; idx < bytes.Length; idx += 4)
{
uint codepoint = BitConverter.ToUInt32(bytes, idx);
if (codepoint <= 127)
asAscii.Append(Convert.ToChar(codepoint));
else
asAscii.AppendFormat("\\u{0:x4}", codepoint);
}
Console.WriteLine("Final string: {0}", asAscii);
Console.ReadKey();
}

You need to use the Convert() method in the Encoding class:
Create an Encoding object that represents ASCII encoding
Create an Encoding object that represents Unicode encoding
Call Encoding.Convert() with the source encoding, the destination encoding, and the string to be encoded
There is an example here:
using System;
using System.Text;
namespace ConvertExample
{
class ConvertExampleClass
{
static void Main()
{
string unicodeString = "This string contains the unicode character Pi(\u03a0)";
// Create two different encodings.
Encoding ascii = Encoding.ASCII;
Encoding unicode = Encoding.Unicode;
// Convert the string into a byte[].
byte[] unicodeBytes = unicode.GetBytes(unicodeString);
// Perform the conversion from one encoding to the other.
byte[] asciiBytes = Encoding.Convert(unicode, ascii, unicodeBytes);
// Convert the new byte[] into a char[] and then into a string.
// This is a slightly different approach to converting to illustrate
// the use of GetCharCount/GetChars.
char[] asciiChars = new char[ascii.GetCharCount(asciiBytes, 0, asciiBytes.Length)];
ascii.GetChars(asciiBytes, 0, asciiBytes.Length, asciiChars, 0);
string asciiString = new string(asciiChars);
// Display the strings created before and after the conversion.
Console.WriteLine("Original string: {0}", unicodeString);
Console.WriteLine("Ascii converted string: {0}", asciiString);
}
}
}

Calculating info_hash from UrlEncoded query string

Basically, I'm building a small tracker for experimental purposes. I've gotten quite far, and am now working on the announce part.
What I really can't figure out is how I should decode the info_hash query string provided.
From the specification, it is a urlencoded 20-byte SHA1 hash, which made me write this code,
byte[] foo = Encoding.Default.GetBytes(HttpUtility.UrlDecode(infoHash));
string temp = "";
foreach (byte b in foo)
{
temp += b.ToString("X");
}
Which gives 'temp' the following value,
5D3F3F3F3F5E3F3F3F153FE4033683F55693468
The first and last few characters are correct. This is the raw info_hash,
%5d%96%b6%f6%84%5e%ea%da%c5%15%c4%0e%403h%b9Ui4h
And this is what both uTorrent and my own tracker gives me as info_hash when generating it from the torrent file,
5D96B6F6845EEADAC515C40E403368B955693468
What am I doing wrong?

UrlDecode returns a string, but a SHA1 hash doesn't make sense if interpreted as (ANSI) string.
You need to decode the input string directly to an byte array, without the roundtrip to a string.
var s = "%5d%96%b6%f6%84%5e%ea%da%c5%15%c4%0e%403h%b9Ui4h";
var ms = new MemoryStream();
for (var i = 0; i < s.Length; i++)
{
if (s[i] == '%')
{
ms.WriteByte(
byte.Parse(s.Substring(i + 1, 2), NumberStyles.AllowHexSpecifier));
i += 2;
}
else if (s[i] < 128)
{
ms.WriteByte((byte)s[i]);
}
}
byte[] infoHash = ms.ToArray();
string temp = BitConverter.ToString(infoHash);
// "5D-96-B6-F6-84-5E-EA-DA-C5-15-C4-0E-40-33-68-B9-55-69-34-68"

HttpUtility.UrlDecodeToBytes

What's the most efficient way of implementing ReadLine() on a binary stream?

Please feel free to correct me if I am wrong at any point...
I am trying to read a CSV (comma separated values) file using .NET file I/O classes. Now the problem is, this CSV file may contain some fields with soft carriage returns (i.e. solitary \r or \n markers rather than the standard \r\n used in text files to end a line) within some fields and the standard text mode I/O class StreamReader does not respect the standard convention and treats the soft carriage returns as hard carriage returns thus compromising the integrity of the CSV file.
Now using the BinaryReader class seems to be the only option left but the BinaryReader does not have a ReadLine() function hence the need to implement a ReadLine() on my own.
My current approach reads one character from the stream at a time and fills a StringBuilder until a \r\n is obtained (ignoring all other characters including solitary \r or \n) and then returns a string representation of the StringBuilder (using ToString()).
But I wonder: is this is the most efficient way of implementing the ReadLine() function? Please enlighten me.

It probably is. In terms of order, it goes through each char once only, so it would be O(n) (where n is the length of the stream) so that's not a problem. To read a single character a BinaryReader is your best bet.
What I would do is make a class
public class LineReader : IDisposable
{
private Stream stream;
private BinaryReader reader;
public LineReader(Stream stream) { reader = new BinaryReader(stream); }
public string ReadLine()
{
StringBuilder result = new StringBuilder();
char lastChar = reader.ReadChar();
// an EndOfStreamException here would propogate to the caller
try
{
char newChar = reader.ReadChar();
if (lastChar == '\r' && newChar == '\n')
return result.ToString();
result.Append(lastChar);
lastChar = newChar;
}
catch (EndOfStreamException)
{
result.Append(lastChar);
return result.ToString();
}
}
public void Dispose()
{
reader.Close();
}
}
Or something like that.
(WARNING: the code has not been tested and is provided AS IS without warranty of any kind, expressed or implied. Should this program prove defective or destroy the planet, you assume the cost of all necessary servicing, repair or correction.)

You might want to look at using an ODBC/OleDB connection to do this. If you point the data source of an oledb connection to a directory containing csv files, you can then query it as if each CSV was a table.
check http://www.connectionstrings.com/?carrier=textfile>connectionstrings.com for the correct connection string

Here an extension method for BinaryReader class :
using System.IO;
using System.Text;
public static class BinaryReaderExtension
{
public static string ReadLine(this BinaryReader reader)
{
if (reader.IsEndOfStream())
return null;
StringBuilder result = new StringBuilder();
char character;
while(!reader.IsEndOfStream() && (character = reader.ReadChar()) != '\n')
if (character != '\r' && character != '\n')
result.Append(character);
return result.ToString();
}
public static bool IsEndOfStream(this BinaryReader reader)
{
return reader.BaseStream.Position == reader.BaseStream.Length;
}
}
I didn't test in all conditions but this code worked for me.

How about simply preprocessing the file?
Replace the soft carriage returns with something unique.
For the record, CSV files with linefeeds in the data, that's bad design.

You could read a bigger chunk at a time, unencode it to a string using Encoder.GetString and then split into lines using string.Split("\r\n") or even picking out the head of the string using string.Substring(0,string.IndexOf("\r\n")) and leaving the rest for processing of the next line. Remember to add the next read operation to your last line from the previous read.

Your approach sounds fine. One way to improve the efficiency of your method might be to store each line as you're building it in a regular string (i.e. not a StringBuilder), and then append the entire-line-string to your StringBuilder. See this article for a further explanation - StringBuilder is not automatically the best choice here.
It probably will matter little, though.

Here's a faster alternative with encoding support. It extends BinaryReader, so you can use it to do both, read binary chunks and also perform StreamReader like ReadLine directly on a binary stream.
public class LineReader : BinaryReader
{
private Encoding _encoding;
private Decoder _decoder;
const int bufferSize = 1024;
private char[] _LineBuffer = new char[bufferSize];
public LineReader(Stream stream, int bufferSize, Encoding encoding)
: base(stream, encoding)
{
this._encoding = encoding;
this._decoder = encoding.GetDecoder();
}
public string ReadLine()
{
int pos = 0;
char[] buf = new char[2];
StringBuilder stringBuffer = null;
bool lineEndFound = false;
while(base.Read(buf, 0, 2) > 0)
{
if (buf[1] == '\r')
{
// grab buf[0]
this._LineBuffer[pos++] = buf[0];
// get the '\n'
char ch = base.ReadChar();
Debug.Assert(ch == '\n');
lineEndFound = true;
}
else if (buf[0] == '\r')
{
lineEndFound = true;
}
else
{
this._LineBuffer[pos] = buf[0];
this._LineBuffer[pos+1] = buf[1];
pos += 2;
if (pos >= bufferSize)
{
stringBuffer = new StringBuilder(bufferSize + 80);
stringBuffer.Append(this._LineBuffer, 0, bufferSize);
pos = 0;
}
}
if (lineEndFound)
{
if (stringBuffer == null)
{
if (pos > 0)
return new string(this._LineBuffer, 0, pos);
else
return string.Empty;
}
else
{
if (pos > 0)
stringBuffer.Append(this._LineBuffer, 0, pos);
return stringBuffer.ToString();
}
}
}
if (stringBuffer != null)
{
if (pos > 0)
stringBuffer.Append(this._LineBuffer, 0, pos);
return stringBuffer.ToString();
}
else
{
if (pos > 0)
return new string(this._LineBuffer, 0, pos);
else
return null;
}
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How should I decode a UTF-8 string - c#

Finally I've used something like this: public static string UnescapeHex(string data) { return Encoding.UTF8.GetString(Array.ConvertAll(Regex.Unescape(data).ToCharArray(), c => (byte) c)); }

Related

Encoding of the actual header value for HttpRequestMessage [duplicate]

string values to byte array without converting

Convert a Unicode string to an escaped ASCII string

Calculating info_hash from UrlEncoded query string

What's the most efficient way of implementing ReadLine() on a binary stream?

Categories

Resources