ASCIIEncoding In Windows Phone 7 - c#

Is there a way to use ASCIIEncoding in Windows Phone 7?
Unless I'm doing something wrong Encoding.ASCII doesn't exist and I'm needing it for C# -> PHP encryption (as PHP only uses ASCII in SHA1 encryption).
Any suggestions?

It is easy to implement yourself, Unicode never messed with the ASCII codes:
public static byte[] StringToAscii(string s) {
byte[] retval = new byte[s.Length];
for (int ix = 0; ix < s.Length; ++ix) {
char ch = s[ix];
if (ch <= 0x7f) retval[ix] = (byte)ch;
else retval[ix] = (byte)'?';
}
return retval;
}

Not really seeing any detail in your question this could be off track. You are right Silverlight has no support for the ASCII encoding.
However I suspect that in fact UTF8 will do what you need. Its worth bearing in mind that a sequence of single byte ASCII only characters and the same set of characters encoded as UTF-8 are identical. That is the the complete ASCII character set is repeated verbatim by the first 128 single byte code points in UTF-8.

I have a Silverlight app that writes CSV files, which have to be encoded in ASCII (using UTF-8 causes accented characters to show up wrong when you open the files in Excel).
Since Silverlight doesn't have an Encoding.ASCII class, I implemented one as follows. It works for me, hope it's useful to you as well:
/// <summary>
/// Silverlight doesn't have an ASCII encoder, so here is one:
/// </summary>
public class AsciiEncoding : System.Text.Encoding
{
public override int GetMaxByteCount(int charCount)
{
return charCount;
}
public override int GetMaxCharCount(int byteCount)
{
return byteCount;
}
public override int GetByteCount(char[] chars, int index, int count)
{
return count;
}
public override byte[] GetBytes(char[] chars)
{
return base.GetBytes(chars);
}
public override int GetCharCount(byte[] bytes)
{
return bytes.Length;
}
public override int GetBytes(char[] chars, int charIndex, int charCount, byte[] bytes, int byteIndex)
{
for (int i = 0; i < charCount; i++)
{
bytes[byteIndex + i] = (byte)chars[charIndex + i];
}
return charCount;
}
public override int GetCharCount(byte[] bytes, int index, int count)
{
return count;
}
public override int GetChars(byte[] bytes, int byteIndex, int byteCount, char[] chars, int charIndex)
{
for (int i = 0; i < byteCount; i++)
{
chars[charIndex + i] = (char)bytes[byteIndex + i];
}
return byteCount;
}
}

I had similar problem using Xamarin (Mono) for Android where I'm using Portable Class Library and they don't support Econding.ASCII.
Instead, the only working solution (except doing it manually) is this one
Uri.EscapeDataString(yourString);
See this answer which provide additional information.

I started from #Hans Passant 's answer and I rewrote it with Linq :
/// <summary>
/// Gets an encoding for the ASCII (7-bit) character set.
/// </summary>
/// <see cref="http://stackoverflow.com/a/4022893/1248177"/>
/// <param name="s">A character set.</param>
/// <returns>An encoding for the ASCII (7-bit) character set.</returns>
public static byte[] StringToAscii(string s)
{
return (from char c in s select (byte)((c <= 0x7f) ? c : '?')).ToArray();
}
You may want to remove the call to ToArray() and return a IEnumerable<byte> instead of byte[].

According to this MS forum thread, Windows Phone 7 does not support Encoding.ASCII.

Related

Base-N encoding of a byte array

A couple of days ago I came across this CodeReview for Base-36 encoding a byte array. However, the answers that followed didn't touch on decoding back into a byte array, or possibly reusing the answer to perform encodings of different bases (radix).
The answer for the linked question uses BigInteger. So as far as implementation goes, the base and its digits could be parametrized.
The problem with BigInteger though, is that we're treating our input as an assumed integer. However, our input, a byte array, is just an opaque series of values.
If the byte array ends in a series of zero bytes, eg {0xFF,0x7F,0x00,0x00}, those bytes will be lost when using the algorithm in the answer (would only encode {0xFF,0x7F}.
If the last non-zero byte has the sign bit set then the proceeding zero byte is consumed as it's treated as the BigInt's sign delimiter. So {0xFF,0xFF,0x00,0x00} would encode only as {0xFF,0xFF,0x00}.
How could a .NET programmer use BigInteger to create a reasonably efficient and radix-agnostic encoder, with decoding support, plus the ability to handle endian-ness, and with the ability to 'work around' the ending zero bytes being lost?
edit [2020/01/26]: FWIW, the code below along with its unit test live along side my open source libraries on Github.
edit [2016/04/19]: If you're fond of exceptions, you may wish to change some of the Decode implementation code to throw InvalidDataException instead of just returning null.
edit [2014/09/14]: I've added a 'HACK' to Encode() to handle cases where the last byte in the input is signed (if you were to convert to sbyte). Only sane solution I could think of right now is to just Resize() the array by one. Additional unit tests for this case passed, but I didn't rerun perf code to account for such cases. If you can help it, always have your input to Encode() include a dummy 0 byte at the end to avoid additional allocations.
Usage
I've created a RadixEncoding class (found in the "Code" section) which initializes with three parameters:
The radix digits as a string (length determines the actual radix of course),
The assumed byte ordering (endian) of input byte arrays,
And whether or not the user wants the encode/decode logic to acknowledge ending zero bytes.
To create a Base-36 encoding, with little-endian input, and with respect given to ending zero bytes:
const string k_base36_digits = "0123456789abcdefghijklmnopqrstuvwxyz";
var base36_no_zeros = new RadixEncoding(k_base36_digits, EndianFormat.Little, false);
And then to actually perform encoding/decoding:
const string k_input = "A test 1234";
byte[] input_bytes = System.Text.Encoding.UTF8.GetBytes(k_input);
string encoded_string = base36_no_zeros.Encode(input_bytes);
byte[] decoded_bytes = base36_no_zeros.Decode(encoded_string);
Performance
Timed with Diagnostics.Stopwatch, ran on an i7 860 #2.80GHz. Timing EXE ran by itself, not under a debugger.
Encoding was initialized with the same k_base36_digits string from above, EndianFormat.Little, and with ending zero bytes acknowledged (even though the UTF8 bytes don't have any extra ending zero bytes)
To encode the UTF8 bytes of "A test 1234" 1,000,000 times takes 2.6567905secs
To decode the same string the same amount of times takes 3.3916248secs
To encode the UTF8 bytes of "A test 1234. Made slightly larger!" 100,000 times takes 1.1577325secs
To decode the same string the same amount of times takes 1.244326secs
Code
If you don't have a CodeContracts generator, you will have to reimplement the contracts with if/throw code.
using System;
using System.Collections.Generic;
using System.Numerics;
using Contract = System.Diagnostics.Contracts.Contract;
public enum EndianFormat
{
/// <summary>Least Significant Bit order (lsb)</summary>
/// <remarks>Right-to-Left</remarks>
/// <see cref="BitConverter.IsLittleEndian"/>
Little,
/// <summary>Most Significant Bit order (msb)</summary>
/// <remarks>Left-to-Right</remarks>
Big,
};
/// <summary>Encodes/decodes bytes to/from a string</summary>
/// <remarks>
/// Encoded string is always in big-endian ordering
///
/// <p>Encode and Decode take a <b>includeProceedingZeros</b> parameter which acts as a work-around
/// for an edge case with our BigInteger implementation.
/// MSDN says BigInteger byte arrays are in LSB->MSB ordering. So a byte buffer with zeros at the
/// end will have those zeros ignored in the resulting encoded radix string.
/// If such a loss in precision absolutely cannot occur pass true to <b>includeProceedingZeros</b>
/// and for a tiny bit of extra processing it will handle the padding of zero digits (encoding)
/// or bytes (decoding).</p>
/// <p>Note: doing this for decoding <b>may</b> add an extra byte more than what was originally
/// given to Encode.</p>
/// </remarks>
// Based on the answers from http://codereview.stackexchange.com/questions/14084/base-36-encoding-of-a-byte-array/
public class RadixEncoding
{
const int kByteBitCount = 8;
readonly string kDigits;
readonly double kBitsPerDigit;
readonly BigInteger kRadixBig;
readonly EndianFormat kEndian;
readonly bool kIncludeProceedingZeros;
/// <summary>Numerial base of this encoding</summary>
public int Radix { get { return kDigits.Length; } }
/// <summary>Endian ordering of bytes input to Encode and output by Decode</summary>
public EndianFormat Endian { get { return kEndian; } }
/// <summary>True if we want ending zero bytes to be encoded</summary>
public bool IncludeProceedingZeros { get { return kIncludeProceedingZeros; } }
public override string ToString()
{
return string.Format("Base-{0} {1}", Radix.ToString(), kDigits);
}
/// <summary>Create a radix encoder using the given characters as the digits in the radix</summary>
/// <param name="digits">Digits to use for the radix-encoded string</param>
/// <param name="bytesEndian">Endian ordering of bytes input to Encode and output by Decode</param>
/// <param name="includeProceedingZeros">True if we want ending zero bytes to be encoded</param>
public RadixEncoding(string digits,
EndianFormat bytesEndian = EndianFormat.Little, bool includeProceedingZeros = false)
{
Contract.Requires<ArgumentNullException>(digits != null);
int radix = digits.Length;
kDigits = digits;
kBitsPerDigit = System.Math.Log(radix, 2);
kRadixBig = new BigInteger(radix);
kEndian = bytesEndian;
kIncludeProceedingZeros = includeProceedingZeros;
}
// Number of characters needed for encoding the specified number of bytes
int EncodingCharsCount(int bytesLength)
{
return (int)Math.Ceiling((bytesLength * kByteBitCount) / kBitsPerDigit);
}
// Number of bytes needed to decoding the specified number of characters
int DecodingBytesCount(int charsCount)
{
return (int)Math.Ceiling((charsCount * kBitsPerDigit) / kByteBitCount);
}
/// <summary>Encode a byte array into a radix-encoded string</summary>
/// <param name="bytes">byte array to encode</param>
/// <returns>The bytes in encoded into a radix-encoded string</returns>
/// <remarks>If <paramref name="bytes"/> is zero length, returns an empty string</remarks>
public string Encode(byte[] bytes)
{
Contract.Requires<ArgumentNullException>(bytes != null);
Contract.Ensures(Contract.Result<string>() != null);
// Don't really have to do this, our code will build this result (empty string),
// but why not catch the condition before doing work?
if (bytes.Length == 0) return string.Empty;
// if the array ends with zeros, having the capacity set to this will help us know how much
// 'padding' we will need to add
int result_length = EncodingCharsCount(bytes.Length);
// List<> has a(n in-place) Reverse method. StringBuilder doesn't. That's why.
var result = new List<char>(result_length);
// HACK: BigInteger uses the last byte as the 'sign' byte. If that byte's MSB is set,
// we need to pad the input with an extra 0 (ie, make it positive)
if ( (bytes[bytes.Length-1] & 0x80) == 0x80 )
Array.Resize(ref bytes, bytes.Length+1);
var dividend = new BigInteger(bytes);
// IsZero's computation is less complex than evaluating "dividend > 0"
// which invokes BigInteger.CompareTo(BigInteger)
while (!dividend.IsZero)
{
BigInteger remainder;
dividend = BigInteger.DivRem(dividend, kRadixBig, out remainder);
int digit_index = System.Math.Abs((int)remainder);
result.Add(kDigits[digit_index]);
}
if (kIncludeProceedingZeros)
for (int x = result.Count; x < result.Capacity; x++)
result.Add(kDigits[0]); // pad with the character that represents 'zero'
// orientate the characters in big-endian ordering
if (kEndian == EndianFormat.Little)
result.Reverse();
// If we didn't end up adding padding, ToArray will end up returning a TrimExcess'd array,
// so nothing wasted
return new string(result.ToArray());
}
void DecodeImplPadResult(ref byte[] result, int padCount)
{
if (padCount > 0)
{
int new_length = result.Length + DecodingBytesCount(padCount);
Array.Resize(ref result, new_length); // new bytes will be zero, just the way we want it
}
}
#region Decode (Little Endian)
byte[] DecodeImpl(string chars, int startIndex = 0)
{
var bi = new BigInteger();
for (int x = startIndex; x < chars.Length; x++)
{
int i = kDigits.IndexOf(chars[x]);
if (i < 0) return null; // invalid character
bi *= kRadixBig;
bi += i;
}
return bi.ToByteArray();
}
byte[] DecodeImplWithPadding(string chars)
{
int pad_count = 0;
for (int x = 0; x < chars.Length; x++, pad_count++)
if (chars[x] != kDigits[0]) break;
var result = DecodeImpl(chars, pad_count);
DecodeImplPadResult(ref result, pad_count);
return result;
}
#endregion
#region Decode (Big Endian)
byte[] DecodeImplReversed(string chars, int startIndex = 0)
{
var bi = new BigInteger();
for (int x = (chars.Length-1)-startIndex; x >= 0; x--)
{
int i = kDigits.IndexOf(chars[x]);
if (i < 0) return null; // invalid character
bi *= kRadixBig;
bi += i;
}
return bi.ToByteArray();
}
byte[] DecodeImplReversedWithPadding(string chars)
{
int pad_count = 0;
for (int x = chars.Length - 1; x >= 0; x--, pad_count++)
if (chars[x] != kDigits[0]) break;
var result = DecodeImplReversed(chars, pad_count);
DecodeImplPadResult(ref result, pad_count);
return result;
}
#endregion
/// <summary>Decode a radix-encoded string into a byte array</summary>
/// <param name="radixChars">radix string</param>
/// <returns>The decoded bytes, or null if an invalid character is encountered</returns>
/// <remarks>
/// If <paramref name="radixChars"/> is an empty string, returns a zero length array
///
/// Using <paramref name="IncludeProceedingZeros"/> has the potential to return a buffer with an
/// additional zero byte that wasn't in the input. So a 4 byte buffer was encoded, this could end up
/// returning a 5 byte buffer, with the extra byte being null.
/// </remarks>
public byte[] Decode(string radixChars)
{
Contract.Requires<ArgumentNullException>(radixChars != null);
if (kEndian == EndianFormat.Big)
return kIncludeProceedingZeros ? DecodeImplReversedWithPadding(radixChars) : DecodeImplReversed(radixChars);
else
return kIncludeProceedingZeros ? DecodeImplWithPadding(radixChars) : DecodeImpl(radixChars);
}
};
Basic Unit Tests
using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;
static bool ArraysCompareN<T>(T[] input, T[] output)
where T : IEquatable<T>
{
if (output.Length < input.Length) return false;
for (int x = 0; x < input.Length; x++)
if(!output[x].Equals(input[x])) return false;
return true;
}
static bool RadixEncodingTest(RadixEncoding encoding, byte[] bytes)
{
string encoded = encoding.Encode(bytes);
byte[] decoded = encoding.Decode(encoded);
return ArraysCompareN(bytes, decoded);
}
[TestMethod]
public void TestRadixEncoding()
{
const string k_base36_digits = "0123456789abcdefghijklmnopqrstuvwxyz";
var base36 = new RadixEncoding(k_base36_digits, EndianFormat.Little, true);
var base36_no_zeros = new RadixEncoding(k_base36_digits, EndianFormat.Little, true);
byte[] ends_with_zero_neg = { 0xFF, 0xFF, 0x00, 0x00 };
byte[] ends_with_zero_pos = { 0xFF, 0x7F, 0x00, 0x00 };
byte[] text = System.Text.Encoding.ASCII.GetBytes("A test 1234");
Assert.IsTrue(RadixEncodingTest(base36, ends_with_zero_neg));
Assert.IsTrue(RadixEncodingTest(base36, ends_with_zero_pos));
Assert.IsTrue(RadixEncodingTest(base36_no_zeros, text));
}
Interestingly, I was able to port Kornman's techniques across to Java and got expected output up to and including base36. Whereas when running his? code from c# using C:\Windows\Microsoft.NET\Framework\v4.0.30319 csc, the output was not as expected.
For example, trying to base16 encode the obtained MD5 hashBytes for the String "hello world" below using Kornman's RadixEncoding encode, I could see the groups of two bytes per characters had the bytes in wrong order.
Rather than 5eb63bbbe01eeed093cb22bb8f5acdc3
I saw something like e56bb3bb0ee1....
This was on Windows 7.
const string input = "hello world";
public static void Main(string[] args)
{
using (System.Security.Cryptography.MD5 md5 = System.Security.Cryptography.MD5.Create())
{
byte[] inputBytes = System.Text.Encoding.ASCII.GetBytes(input);
byte[] hashBytes = md5.ComputeHash(inputBytes);
// Convert the byte array to hexadecimal string
StringBuilder sb = new StringBuilder();
for (int i = 0; i < hashBytes.Length; i++)
{
sb.Append(hashBytes[i].ToString("X2"));
}
Console.WriteLine(sb.ToString());
}
}
Java code is below for anyone interested. As mentioned above, it only works to base 36.
private static final char[] BASE16_CHARS = "0123456789abcdef".toCharArray();
private static final BigInteger BIGINT_16 = BigInteger.valueOf(16);
private static final char[] BASE36_CHARS = "0123456789abcdefghijklmnopqrstuvwxyz".toCharArray();
private static final BigInteger BIGINT_36 = BigInteger.valueOf(36);
public static String toBaseX(byte[] bytes, BigInteger base, char[] chars)
{
if (bytes == null) {
return null;
}
final int bitsPerByte = 8;
double bitsPerDigit = Math.log(chars.length) / Math.log(2);
// Number of chars to encode specified bytes
int size = (int) Math.ceil((bytes.length * bitsPerByte) / bitsPerDigit);
StringBuilder sb = new StringBuilder(size);
for (BigInteger value = new BigInteger(bytes); !value.equals(BigInteger.ZERO);) {
BigInteger[] quotientAndRemainder = value.divideAndRemainder(base);
sb.insert(0, chars[Math.abs(quotientAndRemainder[1].intValue())]);
value = quotientAndRemainder[0];
}
return sb.toString();
}

How to remove specific characters with Encoding.Convert

I'm sending signed XML via WebClient to a gateway. Now I have to ensure, that the node values only contain german letters. I have 2 Testwords. The first gets very well converted by using:
string foreignString = "Łůj꣥ü";
Encoding utf8 = Encoding.UTF8;
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
byte[] utfBytes = Encoding.Convert(iso, utf8, iso.GetBytes(foreignString));
string result = utf8.GetString(utfBytes);
But in the second string is a character which is also included in the UTF-8 Encoding. Its the
ç (Latin small letter c with cedilla)
After testing a little bit with other Encoding I always got the same result: the character was always there. What makes sense, because it is part of the UTF-8 table :)
So my question is: is there a way to mask out all the french, portuguese and spanish characters without dropping the german umlauts ?
Thanks in advance!
You can create your own Encoding class based on the ISO-8859-1 encoding with your additional special rules:
class GermanEncoding : Encoding {
static readonly Encoding iso88791Encoding = Encoding.GetEncoding("ISO-8859-1");
static readonly Dictionary<Char, Char> charMappingTable = new Dictionary<Char, Char> {
{ 'À', 'A' },
{ 'Á', 'A' },
{ 'Â', 'A' },
{ 'ç', 'c' },
// Add more mappings
};
static readonly Dictionary<Byte, Byte> byteMappingTable = charMappingTable
.ToDictionary(kvp => MapCharToByte(kvp.Key), kvp => MapCharToByte(kvp.Value));
public override Int32 GetByteCount(Char[] chars, Int32 index, Int32 count) {
return iso88791Encoding.GetByteCount(chars, index, count);
}
public override Int32 GetBytes(Char[] chars, Int32 charIndex, Int32 charCount, Byte[] bytes, Int32 byteIndex) {
var count = iso88791Encoding.GetBytes(chars, charIndex, charCount, bytes, byteIndex);
for (var i = byteIndex; i < byteIndex + count; ++i)
if (byteMappingTable.ContainsKey(bytes[i]))
bytes[i] = byteMappingTable[bytes[i]];
return count;
}
public override Int32 GetCharCount(Byte[] bytes, Int32 index, Int32 count) {
return iso88791Encoding.GetCharCount(bytes, index, count);
}
public override Int32 GetChars(Byte[] bytes, Int32 byteIndex, Int32 byteCount, Char[] chars, Int32 charIndex) {
return iso88791Encoding.GetChars(bytes, byteIndex, byteCount, chars, charIndex);
}
public override Int32 GetMaxByteCount(Int32 charCount) {
return iso88791Encoding.GetMaxByteCount(charCount);
}
public override Int32 GetMaxCharCount(Int32 byteCount) {
return iso88791Encoding.GetMaxCharCount(byteCount);
}
static Byte MapCharToByte(Char c) {
// NOTE: Assumes that each character encodes as a single byte.
return iso88791Encoding.GetBytes(new[] { c })[0];
}
}
This encoding is based on the fact that you want to use the ISO-8859-1 encoding with some additional restrictions where you want to map "non-german" characters to their ASCII equivalent. The built-in ISO-8859-1 encoding knows how to map Ł to L and because ISO-8859-1 is a single byte character set you can do additional mapping on the bytes because each byte corresponds to a character. This is done in the GetBytes method.
You can "clean" a string using this code:
var encoding = new GermanEncoding();
string foreignString = "Łůj꣥üç";
var bytes = encoding.GetBytes(foreignString);
var result = encoding.GetString(bytes);
The resulting string is LujeLAüc.
Note that the implementation is quite simplistic and it uses a dictionary to perform the additional mapping step of the bytes. This might not be efficient but in that case you can consider alternatives like using a 256 byte mapping array. Also, you need to expand the charMappingTable to contain all the additional mappings you want to perform.

Best way to find position in the Stream where given byte sequence starts

How do you think what is the best way to find position in the System.Stream where given byte sequence starts (first occurence):
public static long FindPosition(Stream stream, byte[] byteSequence)
{
long position = -1;
/// ???
return position;
}
P.S. The simpliest yet fastest solution is preffered. :)
I've reached this solution.
I did some benchmarks with an ASCII file that was 3.050 KB and 38803 lines.
With a search byte array of 22 bytes in the last line of the file I've got the result in about 2.28 seconds (in a slow/old machine).
public static long FindPosition(Stream stream, byte[] byteSequence)
{
if (byteSequence.Length > stream.Length)
return -1;
byte[] buffer = new byte[byteSequence.Length];
using (BufferedStream bufStream = new BufferedStream(stream, byteSequence.Length))
{
int i;
while ((i = bufStream.Read(buffer, 0, byteSequence.Length)) == byteSequence.Length)
{
if (byteSequence.SequenceEqual(buffer))
return bufStream.Position - byteSequence.Length;
else
bufStream.Position -= byteSequence.Length - PadLeftSequence(buffer, byteSequence);
}
}
return -1;
}
private static int PadLeftSequence(byte[] bytes, byte[] seqBytes)
{
int i = 1;
while (i < bytes.Length)
{
int n = bytes.Length - i;
byte[] aux1 = new byte[n];
byte[] aux2 = new byte[n];
Array.Copy(bytes, i, aux1, 0, n);
Array.Copy(seqBytes, aux2, n);
if (aux1.SequenceEqual(aux2))
return i;
i++;
}
return i;
}
If you treat the stream like another sequence of bytes, you can just search it like you were doing a string search. Wikipedia has a great article on that. Boyer-Moore is a good and simple algorithm for this.
Here's a quick hack I put together in Java. It works and it's pretty close if not Boyer-Moore. Hope it helps ;)
public static final int BUFFER_SIZE = 32;
public static int [] buildShiftArray(byte [] byteSequence){
int [] shifts = new int[byteSequence.length];
int [] ret;
int shiftCount = 0;
byte end = byteSequence[byteSequence.length-1];
int index = byteSequence.length-1;
int shift = 1;
while(--index >= 0){
if(byteSequence[index] == end){
shifts[shiftCount++] = shift;
shift = 1;
} else {
shift++;
}
}
ret = new int[shiftCount];
for(int i = 0;i < shiftCount;i++){
ret[i] = shifts[i];
}
return ret;
}
public static byte [] flushBuffer(byte [] buffer, int keepSize){
byte [] newBuffer = new byte[buffer.length];
for(int i = 0;i < keepSize;i++){
newBuffer[i] = buffer[buffer.length - keepSize + i];
}
return newBuffer;
}
public static int findBytes(byte [] haystack, int haystackSize, byte [] needle, int [] shiftArray){
int index = needle.length;
int searchIndex, needleIndex, currentShiftIndex = 0, shift;
boolean shiftFlag = false;
index = needle.length;
while(true){
needleIndex = needle.length-1;
while(true){
if(index >= haystackSize)
return -1;
if(haystack[index] == needle[needleIndex])
break;
index++;
}
searchIndex = index;
needleIndex = needle.length-1;
while(needleIndex >= 0 && haystack[searchIndex] == needle[needleIndex]){
searchIndex--;
needleIndex--;
}
if(needleIndex < 0)
return index-needle.length+1;
if(shiftFlag){
shiftFlag = false;
index += shiftArray[0];
currentShiftIndex = 1;
} else if(currentShiftIndex >= shiftArray.length){
shiftFlag = true;
index++;
} else{
index += shiftArray[currentShiftIndex++];
}
}
}
public static int findBytes(InputStream stream, byte [] needle){
byte [] buffer = new byte[BUFFER_SIZE];
int [] shiftArray = buildShiftArray(needle);
int bufferSize, initBufferSize;
int offset = 0, init = needle.length;
int val;
try{
while(true){
bufferSize = stream.read(buffer, needle.length-init, buffer.length-needle.length+init);
if(bufferSize == -1)
return -1;
if((val = findBytes(buffer, bufferSize+needle.length-init, needle, shiftArray)) != -1)
return val+offset;
buffer = flushBuffer(buffer, needle.length);
offset += bufferSize-init;
init = 0;
}
} catch (IOException e){
e.printStackTrace();
}
return -1;
}
You'll basically need to keep a buffer the same size as byteSequence so that once you've found that the "next byte" in the stream matches, you can check the rest but then still go back to the "next but one" byte if it's not an actual match.
It's likely to be a bit fiddly whatever you do, to be honest :(
I needed to do this myself, had already started, and didn't like the solutions above. I specifically needed to find where the search-byte-sequence ends. In my situation, I need to fast-forward the stream until after that byte sequence. But you can use my solution for this question too:
var afterSequence = stream.ScanUntilFound(byteSequence);
var beforeSequence = afterSequence - byteSequence.Length;
Here is StreamExtensions.cs
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace System
{
static class StreamExtensions
{
/// <summary>
/// Advances the supplied stream until the given searchBytes are found, without advancing too far (consuming any bytes from the stream after the searchBytes are found).
/// Regarding efficiency, if the stream is network or file, then MEMORY/CPU optimisations will be of little consequence here.
/// </summary>
/// <param name="stream">The stream to search in</param>
/// <param name="searchBytes">The byte sequence to search for</param>
/// <returns></returns>
public static int ScanUntilFound(this Stream stream, byte[] searchBytes)
{
// For this class code comments, a common example is assumed:
// searchBytes are {1,2,3,4} or 1234 for short
// # means value that is outside of search byte sequence
byte[] streamBuffer = new byte[searchBytes.Length];
int nextRead = searchBytes.Length;
int totalScannedBytes = 0;
while (true)
{
FillBuffer(stream, streamBuffer, nextRead);
totalScannedBytes += nextRead; //this is only used for final reporting of where it was found in the stream
if (ArraysMatch(searchBytes, streamBuffer, 0))
return totalScannedBytes; //found it
nextRead = FindPartialMatch(searchBytes, streamBuffer);
}
}
/// <summary>
/// Check all offsets, for partial match.
/// </summary>
/// <param name="searchBytes"></param>
/// <param name="streamBuffer"></param>
/// <returns>The amount of bytes which need to be read in, next round</returns>
static int FindPartialMatch(byte[] searchBytes, byte[] streamBuffer)
{
// 1234 = 0 - found it. this special case is already catered directly in ScanUntilFound
// #123 = 1 - partially matched, only missing 1 value
// ##12 = 2 - partially matched, only missing 2 values
// ###1 = 3 - partially matched, only missing 3 values
// #### = 4 - not matched at all
for (int i = 1; i < searchBytes.Length; i++)
{
if (ArraysMatch(searchBytes, streamBuffer, i))
{
// EG. Searching for 1234, have #123 in the streamBuffer, and [i] is 1
// Output: 123#, where # will be read using FillBuffer next.
Array.Copy(streamBuffer, i, streamBuffer, 0, searchBytes.Length - i);
return i; //if an offset of [i], makes a match then only [i] bytes need to be read from the stream to check if there's a match
}
}
return 4;
}
/// <summary>
/// Reads bytes from the stream, making sure the requested amount of bytes are read (streams don't always fulfill the full request first time)
/// </summary>
/// <param name="stream">The stream to read from</param>
/// <param name="streamBuffer">The buffer to read into</param>
/// <param name="bytesNeeded">How many bytes are needed. If less than the full size of the buffer, it fills the tail end of the streamBuffer</param>
static void FillBuffer(Stream stream, byte[] streamBuffer, int bytesNeeded)
{
// EG1. [123#] - bytesNeeded is 1, when the streamBuffer contains first three matching values, but now we need to read in the next value at the end
// EG2. [####] - bytesNeeded is 4
var bytesAlreadyRead = streamBuffer.Length - bytesNeeded; //invert
while (bytesAlreadyRead < streamBuffer.Length)
{
bytesAlreadyRead += stream.Read(streamBuffer, bytesAlreadyRead, streamBuffer.Length - bytesAlreadyRead);
}
}
/// <summary>
/// Checks if arrays match exactly, or with offset.
/// </summary>
/// <param name="searchBytes">Bytes to search for. Eg. [1234]</param>
/// <param name="streamBuffer">Buffer to match in. Eg. [#123] </param>
/// <param name="startAt">When this is zero, all bytes are checked. Eg. If this value 1, and it matches, this means the next byte in the stream to read may mean a match</param>
/// <returns></returns>
static bool ArraysMatch(byte[] searchBytes, byte[] streamBuffer, int startAt)
{
for (int i = 0; i < searchBytes.Length - startAt; i++)
{
if (searchBytes[i] != streamBuffer[i + startAt])
return false;
}
return true;
}
}
}
Bit old question, but here's my answer. I've found that reading blocks and then searching in that is extremely inefficient compared to just reading one at a time and going from there.
Also, IIRC, the accepted answer would fail if part of the sequence was in one block read and half in another - ex, given 12345, searching for 23, it would read 12, not match, then read 34, not match, etc... haven't tried it, though, seeing as it requires net 4.0. At any rate, this is way simpler, and likely much faster.
static long ReadOneSrch(Stream haystack, byte[] needle)
{
int b;
long i = 0;
while ((b = haystack.ReadByte()) != -1)
{
if (b == needle[i++])
{
if (i == needle.Length)
return haystack.Position - needle.Length;
}
else
i = b == needle[0] ? 1 : 0;
}
return -1;
}
static long Search(Stream stream, byte[] pattern)
{
long start = -1;
stream.Seek(0, SeekOrigin.Begin);
while(stream.Position < stream.Length)
{
if (stream.ReadByte() != pattern[0])
continue;
start = stream.Position - 1;
for (int idx = 1; idx < pattern.Length; idx++)
{
if (stream.ReadByte() != pattern[idx])
{
start = -1;
break;
}
}
if (start > -1)
{
return start;
}
}
return start;
}

How to convert UTF-8 byte[] to string

I have a byte[] array that is loaded from a file that I happen to known contains UTF-8.
In some debugging code, I need to convert it to a string. Is there a one-liner that will do this?
Under the covers it should be just an allocation and a memcopy, so even if it is not implemented, it should be possible.
string result = System.Text.Encoding.UTF8.GetString(byteArray);
There're at least four different ways doing this conversion.
Encoding's GetString, but you won't be able to get the original bytes back if those bytes have non-ASCII characters.
BitConverter.ToString The output is a "-" delimited string, but there's no .NET built-in method to convert the string back to byte array.
Convert.ToBase64String You can easily convert the output string back to byte array by using Convert.FromBase64String. Note: The output string could contain '+', '/' and '='. If you want to use the string in a URL, you need to explicitly encode it.
HttpServerUtility.UrlTokenEncodeYou can easily convert the output string back to byte array by using HttpServerUtility.UrlTokenDecode. The output string is already URL friendly! The downside is it needs System.Web assembly if your project is not a web project.
A full example:
byte[] bytes = { 130, 200, 234, 23 }; // A byte array contains non-ASCII (or non-readable) characters
string s1 = Encoding.UTF8.GetString(bytes); // ���
byte[] decBytes1 = Encoding.UTF8.GetBytes(s1); // decBytes1.Length == 10 !!
// decBytes1 not same as bytes
// Using UTF-8 or other Encoding object will get similar results
string s2 = BitConverter.ToString(bytes); // 82-C8-EA-17
String[] tempAry = s2.Split('-');
byte[] decBytes2 = new byte[tempAry.Length];
for (int i = 0; i < tempAry.Length; i++)
decBytes2[i] = Convert.ToByte(tempAry[i], 16);
// decBytes2 same as bytes
string s3 = Convert.ToBase64String(bytes); // gsjqFw==
byte[] decByte3 = Convert.FromBase64String(s3);
// decByte3 same as bytes
string s4 = HttpServerUtility.UrlTokenEncode(bytes); // gsjqFw2
byte[] decBytes4 = HttpServerUtility.UrlTokenDecode(s4);
// decBytes4 same as bytes
A general solution to convert from byte array to string when you don't know the encoding:
static string BytesToStringConverted(byte[] bytes)
{
using (var stream = new MemoryStream(bytes))
{
using (var streamReader = new StreamReader(stream))
{
return streamReader.ReadToEnd();
}
}
}
Definition:
public static string ConvertByteToString(this byte[] source)
{
return source != null ? System.Text.Encoding.UTF8.GetString(source) : null;
}
Using:
string result = input.ConvertByteToString();
Converting a byte[] to a string seems simple, but any kind of encoding is likely to mess up the output string. This little function just works without any unexpected results:
private string ToString(byte[] bytes)
{
string response = string.Empty;
foreach (byte b in bytes)
response += (Char)b;
return response;
}
I saw some answers at this post and it's possible to be considered completed base knowledge, because I have a several approaches in C# Programming to resolve the same problem. The only thing that is necessary to be considered is about a difference between pure UTF-8 and UTF-8 with a BOM.
Last week, at my job, I needed to develop one functionality that outputs CSV files with a BOM and other CSV files with pure UTF-8 (without a BOM). Each CSV file encoding type will be consumed by different non-standardized APIs. One API reads UTF-8 with a BOM and the other API reads without a BOM. I needed to research the references about this concept, reading the "What's the difference between UTF-8 and UTF-8 without BOM?" Stack Overflow question, and the Wikipedia article "Byte order mark" to build my approach.
Finally, my C# Programming for both UTF-8 encoding types (with BOM and pure) needed to be similar to this example below:
// For UTF-8 with BOM, equals shared by Zanoni (at top)
string result = System.Text.Encoding.UTF8.GetString(byteArray);
//for Pure UTF-8 (without B.O.M.)
string result = (new UTF8Encoding(false)).GetString(byteArray);
Using (byte)b.ToString("x2"), Outputs b4b5dfe475e58b67
public static class Ext {
public static string ToHexString(this byte[] hex)
{
if (hex == null) return null;
if (hex.Length == 0) return string.Empty;
var s = new StringBuilder();
foreach (byte b in hex) {
s.Append(b.ToString("x2"));
}
return s.ToString();
}
public static byte[] ToHexBytes(this string hex)
{
if (hex == null) return null;
if (hex.Length == 0) return new byte[0];
int l = hex.Length / 2;
var b = new byte[l];
for (int i = 0; i < l; ++i) {
b[i] = Convert.ToByte(hex.Substring(i * 2, 2), 16);
}
return b;
}
public static bool EqualsTo(this byte[] bytes, byte[] bytesToCompare)
{
if (bytes == null && bytesToCompare == null) return true; // ?
if (bytes == null || bytesToCompare == null) return false;
if (object.ReferenceEquals(bytes, bytesToCompare)) return true;
if (bytes.Length != bytesToCompare.Length) return false;
for (int i = 0; i < bytes.Length; ++i) {
if (bytes[i] != bytesToCompare[i]) return false;
}
return true;
}
}
There is also class UnicodeEncoding, quite simple in usage:
ByteConverter = new UnicodeEncoding();
string stringDataForEncoding = "My Secret Data!";
byte[] dataEncoded = ByteConverter.GetBytes(stringDataForEncoding);
Console.WriteLine("Data after decoding: {0}", ByteConverter.GetString(dataEncoded));
In addition to the selected answer, if you're using .NET 3.5 or .NET 3.5 CE, you have to specify the index of the first byte to decode, and the number of bytes to decode:
string result = System.Text.Encoding.UTF8.GetString(byteArray, 0, byteArray.Length);
Alternatively:
var byteStr = Convert.ToBase64String(bytes);
The BitConverter class can be used to convert a byte[] to string.
var convertedString = BitConverter.ToString(byteAttay);
Documentation of BitConverter class can be fount on MSDN.
To my knowledge none of the given answers guarantee correct behavior with null termination. Until someone shows me differently I wrote my own static class for handling this with the following methods:
// Mimics the functionality of strlen() in c/c++
// Needed because niether StringBuilder or Encoding.*.GetString() handle \0 well
static int StringLength(byte[] buffer, int startIndex = 0)
{
int strlen = 0;
while
(
(startIndex + strlen + 1) < buffer.Length // Make sure incrementing won't break any bounds
&& buffer[startIndex + strlen] != 0 // The typical null terimation check
)
{
++strlen;
}
return strlen;
}
// This is messy, but I haven't found a built-in way in c# that guarentees null termination
public static string ParseBytes(byte[] buffer, out int strlen, int startIndex = 0)
{
strlen = StringLength(buffer, startIndex);
byte[] c_str = new byte[strlen];
Array.Copy(buffer, startIndex, c_str, 0, strlen);
return Encoding.UTF8.GetString(c_str);
}
The reason for the startIndex was in the example I was working on specifically I needed to parse a byte[] as an array of null terminated strings. It can be safely ignored in the simple case
A LINQ one-liner for converting a byte array byteArrFilename read from a file to a pure ASCII C-style zero-terminated string would be this: Handy for reading things like file index tables in old archive formats.
String filename = new String(byteArrFilename.TakeWhile(x => x != 0)
.Select(x => x < 128 ? (Char)x : '?').ToArray());
I use '?' as the default character for anything not pure ASCII here, but that can be changed, of course. If you want to be sure you can detect it, just use '\0' instead, since the TakeWhile at the start ensures that a string built this way cannot possibly contain '\0' values from the input source.
Try this console application:
static void Main(string[] args)
{
//Encoding _UTF8 = Encoding.UTF8;
string[] _mainString = { "Hello, World!" };
Console.WriteLine("Main String: " + _mainString);
// Convert a string to UTF-8 bytes.
byte[] _utf8Bytes = Encoding.UTF8.GetBytes(_mainString[0]);
// Convert UTF-8 bytes to a string.
string _stringuUnicode = Encoding.UTF8.GetString(_utf8Bytes);
Console.WriteLine("String Unicode: " + _stringuUnicode);
}
Here is a result where you didn’t have to bother with encoding. I used it in my network class and send binary objects as string with it.
public static byte[] String2ByteArray(string str)
{
char[] chars = str.ToArray();
byte[] bytes = new byte[chars.Length * 2];
for (int i = 0; i < chars.Length; i++)
Array.Copy(BitConverter.GetBytes(chars[i]), 0, bytes, i * 2, 2);
return bytes;
}
public static string ByteArray2String(byte[] bytes)
{
char[] chars = new char[bytes.Length / 2];
for (int i = 0; i < chars.Length; i++)
chars[i] = BitConverter.ToChar(bytes, i * 2);
return new string(chars);
}
string result = ASCIIEncoding.UTF8.GetString(byteArray);

System.Text.Encoding.GetEncoding("iso-8859-1") throws PlatformNotSupportedException?

See subject, note that this question only applies to the .NET compact framework. This happens on the emulators that ship with Windows Mobile 6 Professional SDK as well as on my English HTC Touch Pro (all .NET CF 3.5). iso-8859-1 stands for Western European (ISO), which is probably the most important encoding besides us-ascii (at least when one goes by the number of usenet posts).
I'm having a hard time to understand why this encoding is not supported, while the following ones are supported (again on both the emulators & my HTC):
iso-8859-2 (Central European (ISO))
iso-8859-3 (Latin 3 (ISO))
iso-8859-4 (Baltic (ISO))
iso-8859-5 (Cyrillic (ISO))
iso-8859-7 (Greek (ISO))
So, is support for say Greek more important than support for German, French
and Spanish? Can anyone shed some light on this?
Thanks!
Andreas
I would try to use "windows-1252" as encoding string. According to Wikipedia, Windows-1252 is a superset of ISO-8859-1.
System.Text.Encoding.GetEncoding(1252)
This MSDN article says:
The .NET Compact Framework supports
character encoding on all devices:
Unicode (BE and LE), UTF8, UTF7, and
ASCII.
There is limited support for code page
encoding and only if the encoding is
recognized by the operating system of
the device.
The .NET Compact Framework throws a
PlatformNotSupportedException if the a
required encoding is not available on
the device.
I believe all (or at least many) of the ISO encodings are code-page encodings and fall under the "limited support" rule. UTF8 is probably your best bet as a replacement.
I know its a bit later but I made an implementation for .net cf of encoding ISO-8859-1, I hope this could help:
namespace System.Text
{
public class Latin1Encoding : Encoding
{
private readonly string m_specialCharset = (char) 0xA0 + #"¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ";
public override string WebName
{
get { return #"ISO-8859-1"; }
}
public override int CodePage
{
get { return 28591; }
}
public override int GetByteCount(char[] chars, int index, int count)
{
return count;
}
public override int GetBytes(char[] chars, int charIndex, int charCount, byte[] bytes, int byteIndex)
{
if (chars == null)
throw new ArgumentNullException(#"chars", #"null array");
if (bytes == null)
throw new ArgumentNullException(#"bytes", #"null array");
if (charIndex < 0)
throw new ArgumentOutOfRangeException(#"charIndex");
if (charCount < 0)
throw new ArgumentOutOfRangeException(#"charCount");
if (chars.Length - charIndex < charCount)
throw new ArgumentOutOfRangeException(#"chars");
if (byteIndex < 0 || byteIndex > bytes.Length)
throw new ArgumentOutOfRangeException(#"byteIndex");
for (int i = 0; i < charCount; i++)
{
char ch = chars[charIndex + i];
int chVal = ch;
bytes[byteIndex + i] = chVal < 160 ? (byte)ch : (chVal <= byte.MaxValue ? (byte)m_specialCharset[chVal - 160] : (byte)63);
}
return charCount;
}
public override int GetCharCount(byte[] bytes, int index, int count)
{
return count;
}
public override int GetChars(byte[] bytes, int byteIndex, int byteCount, char[] chars, int charIndex)
{
if (chars == null)
throw new ArgumentNullException(#"chars", #"null array");
if (bytes == null)
throw new ArgumentNullException(#"bytes", #"null array");
if (byteIndex < 0)
throw new ArgumentOutOfRangeException(#"byteIndex");
if (byteCount < 0)
throw new ArgumentOutOfRangeException(#"byteCount");
if (bytes.Length - byteIndex < byteCount)
throw new ArgumentOutOfRangeException(#"bytes");
if (charIndex < 0 || charIndex > chars.Length)
throw new ArgumentOutOfRangeException(#"charIndex");
for (int i = 0; i < byteCount; ++i)
{
byte b = bytes[byteIndex + i];
chars[charIndex + i] = b < 160 ? (char)b : m_specialCharset[b - 160];
}
return byteCount;
}
public override int GetMaxByteCount(int charCount)
{
return charCount;
}
public override int GetMaxCharCount(int byteCount)
{
return byteCount;
}
}
}
It is odd that 8859-1 isn't supported, but that said, UTF-8 does have the ability to represent all of teh 8859-1 characters (and more), so is there a reason you can't just use UTF-8 instead? That's what we do internally, and I just dealt with almost this same issue today. The plus side of using UTF-8 is that you get support for far-east and cyrillic languages without making modifications and without adding weight to the western languages.
If anyone is getting an exception(.NET compact framework) like:
System.Text.Encoding.GetEncoding(“iso-8859-1”) throws PlatformNotSupportedException,
//Please follow the steps:
byte[] bytes=Encoding.Default.GetBytes(yourText.ToString);
//The code is:
FileInfo fileInfo = new FileInfo(FullfileName); //file type : *.text,*.xml
string yourText = (char) 0xA0 + #"¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ";
using (FileStream s = fileInfo.OpenWrite()) {
s.Write(Encoding.Default.GetBytes(yourText.ToString()), 0, yourText.Length);
}

Categories

Resources