A couple of days ago I came across this CodeReview for Base-36 encoding a byte array. However, the answers that followed didn't touch on decoding back into a byte array, or possibly reusing the answer to perform encodings of different bases (radix).
The answer for the linked question uses BigInteger. So as far as implementation goes, the base and its digits could be parametrized.
The problem with BigInteger though, is that we're treating our input as an assumed integer. However, our input, a byte array, is just an opaque series of values.
If the byte array ends in a series of zero bytes, eg {0xFF,0x7F,0x00,0x00}, those bytes will be lost when using the algorithm in the answer (would only encode {0xFF,0x7F}.
If the last non-zero byte has the sign bit set then the proceeding zero byte is consumed as it's treated as the BigInt's sign delimiter. So {0xFF,0xFF,0x00,0x00} would encode only as {0xFF,0xFF,0x00}.
How could a .NET programmer use BigInteger to create a reasonably efficient and radix-agnostic encoder, with decoding support, plus the ability to handle endian-ness, and with the ability to 'work around' the ending zero bytes being lost?
edit [2020/01/26]: FWIW, the code below along with its unit test live along side my open source libraries on Github.
edit [2016/04/19]: If you're fond of exceptions, you may wish to change some of the Decode implementation code to throw InvalidDataException instead of just returning null.
edit [2014/09/14]: I've added a 'HACK' to Encode() to handle cases where the last byte in the input is signed (if you were to convert to sbyte). Only sane solution I could think of right now is to just Resize() the array by one. Additional unit tests for this case passed, but I didn't rerun perf code to account for such cases. If you can help it, always have your input to Encode() include a dummy 0 byte at the end to avoid additional allocations.
Usage
I've created a RadixEncoding class (found in the "Code" section) which initializes with three parameters:
The radix digits as a string (length determines the actual radix of course),
The assumed byte ordering (endian) of input byte arrays,
And whether or not the user wants the encode/decode logic to acknowledge ending zero bytes.
To create a Base-36 encoding, with little-endian input, and with respect given to ending zero bytes:
const string k_base36_digits = "0123456789abcdefghijklmnopqrstuvwxyz";
var base36_no_zeros = new RadixEncoding(k_base36_digits, EndianFormat.Little, false);
And then to actually perform encoding/decoding:
const string k_input = "A test 1234";
byte[] input_bytes = System.Text.Encoding.UTF8.GetBytes(k_input);
string encoded_string = base36_no_zeros.Encode(input_bytes);
byte[] decoded_bytes = base36_no_zeros.Decode(encoded_string);
Performance
Timed with Diagnostics.Stopwatch, ran on an i7 860 #2.80GHz. Timing EXE ran by itself, not under a debugger.
Encoding was initialized with the same k_base36_digits string from above, EndianFormat.Little, and with ending zero bytes acknowledged (even though the UTF8 bytes don't have any extra ending zero bytes)
To encode the UTF8 bytes of "A test 1234" 1,000,000 times takes 2.6567905secs
To decode the same string the same amount of times takes 3.3916248secs
To encode the UTF8 bytes of "A test 1234. Made slightly larger!" 100,000 times takes 1.1577325secs
To decode the same string the same amount of times takes 1.244326secs
Code
If you don't have a CodeContracts generator, you will have to reimplement the contracts with if/throw code.
using System;
using System.Collections.Generic;
using System.Numerics;
using Contract = System.Diagnostics.Contracts.Contract;
public enum EndianFormat
{
/// <summary>Least Significant Bit order (lsb)</summary>
/// <remarks>Right-to-Left</remarks>
/// <see cref="BitConverter.IsLittleEndian"/>
Little,
/// <summary>Most Significant Bit order (msb)</summary>
/// <remarks>Left-to-Right</remarks>
Big,
};
/// <summary>Encodes/decodes bytes to/from a string</summary>
/// <remarks>
/// Encoded string is always in big-endian ordering
///
/// <p>Encode and Decode take a <b>includeProceedingZeros</b> parameter which acts as a work-around
/// for an edge case with our BigInteger implementation.
/// MSDN says BigInteger byte arrays are in LSB->MSB ordering. So a byte buffer with zeros at the
/// end will have those zeros ignored in the resulting encoded radix string.
/// If such a loss in precision absolutely cannot occur pass true to <b>includeProceedingZeros</b>
/// and for a tiny bit of extra processing it will handle the padding of zero digits (encoding)
/// or bytes (decoding).</p>
/// <p>Note: doing this for decoding <b>may</b> add an extra byte more than what was originally
/// given to Encode.</p>
/// </remarks>
// Based on the answers from http://codereview.stackexchange.com/questions/14084/base-36-encoding-of-a-byte-array/
public class RadixEncoding
{
const int kByteBitCount = 8;
readonly string kDigits;
readonly double kBitsPerDigit;
readonly BigInteger kRadixBig;
readonly EndianFormat kEndian;
readonly bool kIncludeProceedingZeros;
/// <summary>Numerial base of this encoding</summary>
public int Radix { get { return kDigits.Length; } }
/// <summary>Endian ordering of bytes input to Encode and output by Decode</summary>
public EndianFormat Endian { get { return kEndian; } }
/// <summary>True if we want ending zero bytes to be encoded</summary>
public bool IncludeProceedingZeros { get { return kIncludeProceedingZeros; } }
public override string ToString()
{
return string.Format("Base-{0} {1}", Radix.ToString(), kDigits);
}
/// <summary>Create a radix encoder using the given characters as the digits in the radix</summary>
/// <param name="digits">Digits to use for the radix-encoded string</param>
/// <param name="bytesEndian">Endian ordering of bytes input to Encode and output by Decode</param>
/// <param name="includeProceedingZeros">True if we want ending zero bytes to be encoded</param>
public RadixEncoding(string digits,
EndianFormat bytesEndian = EndianFormat.Little, bool includeProceedingZeros = false)
{
Contract.Requires<ArgumentNullException>(digits != null);
int radix = digits.Length;
kDigits = digits;
kBitsPerDigit = System.Math.Log(radix, 2);
kRadixBig = new BigInteger(radix);
kEndian = bytesEndian;
kIncludeProceedingZeros = includeProceedingZeros;
}
// Number of characters needed for encoding the specified number of bytes
int EncodingCharsCount(int bytesLength)
{
return (int)Math.Ceiling((bytesLength * kByteBitCount) / kBitsPerDigit);
}
// Number of bytes needed to decoding the specified number of characters
int DecodingBytesCount(int charsCount)
{
return (int)Math.Ceiling((charsCount * kBitsPerDigit) / kByteBitCount);
}
/// <summary>Encode a byte array into a radix-encoded string</summary>
/// <param name="bytes">byte array to encode</param>
/// <returns>The bytes in encoded into a radix-encoded string</returns>
/// <remarks>If <paramref name="bytes"/> is zero length, returns an empty string</remarks>
public string Encode(byte[] bytes)
{
Contract.Requires<ArgumentNullException>(bytes != null);
Contract.Ensures(Contract.Result<string>() != null);
// Don't really have to do this, our code will build this result (empty string),
// but why not catch the condition before doing work?
if (bytes.Length == 0) return string.Empty;
// if the array ends with zeros, having the capacity set to this will help us know how much
// 'padding' we will need to add
int result_length = EncodingCharsCount(bytes.Length);
// List<> has a(n in-place) Reverse method. StringBuilder doesn't. That's why.
var result = new List<char>(result_length);
// HACK: BigInteger uses the last byte as the 'sign' byte. If that byte's MSB is set,
// we need to pad the input with an extra 0 (ie, make it positive)
if ( (bytes[bytes.Length-1] & 0x80) == 0x80 )
Array.Resize(ref bytes, bytes.Length+1);
var dividend = new BigInteger(bytes);
// IsZero's computation is less complex than evaluating "dividend > 0"
// which invokes BigInteger.CompareTo(BigInteger)
while (!dividend.IsZero)
{
BigInteger remainder;
dividend = BigInteger.DivRem(dividend, kRadixBig, out remainder);
int digit_index = System.Math.Abs((int)remainder);
result.Add(kDigits[digit_index]);
}
if (kIncludeProceedingZeros)
for (int x = result.Count; x < result.Capacity; x++)
result.Add(kDigits[0]); // pad with the character that represents 'zero'
// orientate the characters in big-endian ordering
if (kEndian == EndianFormat.Little)
result.Reverse();
// If we didn't end up adding padding, ToArray will end up returning a TrimExcess'd array,
// so nothing wasted
return new string(result.ToArray());
}
void DecodeImplPadResult(ref byte[] result, int padCount)
{
if (padCount > 0)
{
int new_length = result.Length + DecodingBytesCount(padCount);
Array.Resize(ref result, new_length); // new bytes will be zero, just the way we want it
}
}
#region Decode (Little Endian)
byte[] DecodeImpl(string chars, int startIndex = 0)
{
var bi = new BigInteger();
for (int x = startIndex; x < chars.Length; x++)
{
int i = kDigits.IndexOf(chars[x]);
if (i < 0) return null; // invalid character
bi *= kRadixBig;
bi += i;
}
return bi.ToByteArray();
}
byte[] DecodeImplWithPadding(string chars)
{
int pad_count = 0;
for (int x = 0; x < chars.Length; x++, pad_count++)
if (chars[x] != kDigits[0]) break;
var result = DecodeImpl(chars, pad_count);
DecodeImplPadResult(ref result, pad_count);
return result;
}
#endregion
#region Decode (Big Endian)
byte[] DecodeImplReversed(string chars, int startIndex = 0)
{
var bi = new BigInteger();
for (int x = (chars.Length-1)-startIndex; x >= 0; x--)
{
int i = kDigits.IndexOf(chars[x]);
if (i < 0) return null; // invalid character
bi *= kRadixBig;
bi += i;
}
return bi.ToByteArray();
}
byte[] DecodeImplReversedWithPadding(string chars)
{
int pad_count = 0;
for (int x = chars.Length - 1; x >= 0; x--, pad_count++)
if (chars[x] != kDigits[0]) break;
var result = DecodeImplReversed(chars, pad_count);
DecodeImplPadResult(ref result, pad_count);
return result;
}
#endregion
/// <summary>Decode a radix-encoded string into a byte array</summary>
/// <param name="radixChars">radix string</param>
/// <returns>The decoded bytes, or null if an invalid character is encountered</returns>
/// <remarks>
/// If <paramref name="radixChars"/> is an empty string, returns a zero length array
///
/// Using <paramref name="IncludeProceedingZeros"/> has the potential to return a buffer with an
/// additional zero byte that wasn't in the input. So a 4 byte buffer was encoded, this could end up
/// returning a 5 byte buffer, with the extra byte being null.
/// </remarks>
public byte[] Decode(string radixChars)
{
Contract.Requires<ArgumentNullException>(radixChars != null);
if (kEndian == EndianFormat.Big)
return kIncludeProceedingZeros ? DecodeImplReversedWithPadding(radixChars) : DecodeImplReversed(radixChars);
else
return kIncludeProceedingZeros ? DecodeImplWithPadding(radixChars) : DecodeImpl(radixChars);
}
};
Basic Unit Tests
using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;
static bool ArraysCompareN<T>(T[] input, T[] output)
where T : IEquatable<T>
{
if (output.Length < input.Length) return false;
for (int x = 0; x < input.Length; x++)
if(!output[x].Equals(input[x])) return false;
return true;
}
static bool RadixEncodingTest(RadixEncoding encoding, byte[] bytes)
{
string encoded = encoding.Encode(bytes);
byte[] decoded = encoding.Decode(encoded);
return ArraysCompareN(bytes, decoded);
}
[TestMethod]
public void TestRadixEncoding()
{
const string k_base36_digits = "0123456789abcdefghijklmnopqrstuvwxyz";
var base36 = new RadixEncoding(k_base36_digits, EndianFormat.Little, true);
var base36_no_zeros = new RadixEncoding(k_base36_digits, EndianFormat.Little, true);
byte[] ends_with_zero_neg = { 0xFF, 0xFF, 0x00, 0x00 };
byte[] ends_with_zero_pos = { 0xFF, 0x7F, 0x00, 0x00 };
byte[] text = System.Text.Encoding.ASCII.GetBytes("A test 1234");
Assert.IsTrue(RadixEncodingTest(base36, ends_with_zero_neg));
Assert.IsTrue(RadixEncodingTest(base36, ends_with_zero_pos));
Assert.IsTrue(RadixEncodingTest(base36_no_zeros, text));
}
Interestingly, I was able to port Kornman's techniques across to Java and got expected output up to and including base36. Whereas when running his? code from c# using C:\Windows\Microsoft.NET\Framework\v4.0.30319 csc, the output was not as expected.
For example, trying to base16 encode the obtained MD5 hashBytes for the String "hello world" below using Kornman's RadixEncoding encode, I could see the groups of two bytes per characters had the bytes in wrong order.
Rather than 5eb63bbbe01eeed093cb22bb8f5acdc3
I saw something like e56bb3bb0ee1....
This was on Windows 7.
const string input = "hello world";
public static void Main(string[] args)
{
using (System.Security.Cryptography.MD5 md5 = System.Security.Cryptography.MD5.Create())
{
byte[] inputBytes = System.Text.Encoding.ASCII.GetBytes(input);
byte[] hashBytes = md5.ComputeHash(inputBytes);
// Convert the byte array to hexadecimal string
StringBuilder sb = new StringBuilder();
for (int i = 0; i < hashBytes.Length; i++)
{
sb.Append(hashBytes[i].ToString("X2"));
}
Console.WriteLine(sb.ToString());
}
}
Java code is below for anyone interested. As mentioned above, it only works to base 36.
private static final char[] BASE16_CHARS = "0123456789abcdef".toCharArray();
private static final BigInteger BIGINT_16 = BigInteger.valueOf(16);
private static final char[] BASE36_CHARS = "0123456789abcdefghijklmnopqrstuvwxyz".toCharArray();
private static final BigInteger BIGINT_36 = BigInteger.valueOf(36);
public static String toBaseX(byte[] bytes, BigInteger base, char[] chars)
{
if (bytes == null) {
return null;
}
final int bitsPerByte = 8;
double bitsPerDigit = Math.log(chars.length) / Math.log(2);
// Number of chars to encode specified bytes
int size = (int) Math.ceil((bytes.length * bitsPerByte) / bitsPerDigit);
StringBuilder sb = new StringBuilder(size);
for (BigInteger value = new BigInteger(bytes); !value.equals(BigInteger.ZERO);) {
BigInteger[] quotientAndRemainder = value.divideAndRemainder(base);
sb.insert(0, chars[Math.abs(quotientAndRemainder[1].intValue())]);
value = quotientAndRemainder[0];
}
return sb.toString();
}
Related
So,
I have a string that I want to convert each character to hex values and then put it in a byte array to be sent through a com port.
I can convert the individual characters to the hex that I need to send, but I can't get that array of strings into a byte array correctly.
example:
string beforeConverting = "HELLO";
String[] afterConverting = {"0x48", "0x45", "0x4C", "0x4C", "0x4F"};
should become
byte[] byteData = new byte[]{0x48, 0x45, 0x4C, 0x4C, 0x4F};
I've tried several different things from several different posts but I can't get the right combination of things together. If anyone could point me in the right direction or give me a snippet of example code that would be awesome!
If your final aim is to send byte[], then you can actually skip the middle step and immediately do the conversion from string to byte[] using Encoding.ASCII.GetBytes (provided that you send ASCII char):
string beforeConverting = "HELLO";
byte[] byteData = Encoding.ASCII.GetBytes(beforeConverting);
//will give you {0x48, 0x45, 0x4C, 0x4C, 0x4F};
If you don't send ASCII, you could find the appropriate Encoding type (like Unicode or UTF32), depends on your need.
That being said, if you still want to convert the hex string to byte array, you could do something something like this:
/// <summary>
/// To convert Hex data string to bytes (i.e. 0x01455687) given the data type
/// </summary>
/// <param name="hexString"></param>
/// <param name="dataType"></param>
/// <returns></returns>
public static byte[] HexStringToBytes(string hexString) {
try {
if (hexString.Length >= 3) //must have minimum of length of 3
if (hexString[0] == '0' && (hexString[1] == 'x' || hexString[1] == 'X'))
hexString = hexString.Substring(2);
int dataSize = (hexString.Length - 1) / 2;
int expectedStringLength = 2 * dataSize;
while (hexString.Length < expectedStringLength)
hexString = "0" + hexString; //zero padding in the front
int NumberChars = hexString.Length / 2;
byte[] bytes = new byte[NumberChars];
using (var sr = new StringReader(hexString)) {
for (int i = 0; i < NumberChars; i++)
bytes[i] = Convert.ToByte(new string(new char[2] { (char)sr.Read(), (char)sr.Read() }), 16);
}
return bytes;
} catch {
return null;
}
}
And then use it like this:
byte[] byteData = afterConverting.Select(x => HexStringToBytes(x)[0]).ToArray();
The method I put above is more general which can handle input string like 0x05163782 to give byte[4]. For your use, you only need to take the first byte (as the byte[] will always be byte[1]) and thus you have [0] index in the LINQ Select.
The core method used in the custom method above is Convert.ToByte():
bytes[i] = Convert.ToByte(new string(new char[2] { (char)sr.Read(), (char)sr.Read() }), 16);
To convert just the hexadecimal string to a number, you could use the System.Convert class like so
string hex = "0x3B";
byte b = Convert.ToByte(hex.Substring(2), 16)
// b is now 0x3B
Substring is used to skip the characters 0x
Hi I am in need of using file handling,for that i used a method for converting a hexadecimal string into a byte array.
public static byte[] StringToByteArray(string hex)
{
return Enumerable.Range(0, hex.Length)
.Where(x => x % 2 == 0)
.Select(x => Convert.ToByte(hex.Substring(x, 2), 16))
.ToArray();
}
My problem is ,when i give a small hexadecimal string as a parameter to this function it will produce the right output,but when i used a large hexadecimal string as a parameter output is not that expected.
for your clear understanding -
I used a hexadecimal string which is being converted from a byte array of value [26246026],
when i convert that hex string into a byte array it should return a byte value as [26246026] - but its returning only the partial bytes ie.[262144].
i cant get the exact byte value from the hex string,how can i get that?
Please someone help me to get the expected output.
My input string for that method contains this hexadecimel string which is a 25mb size file-it should return a byte value of [26246026]---but its returning only the byte value of [262144].
when am using small hex string (min size file) its working fine,but when i work on big files i cant get the original file byte. please suggest me what to do.
my input parameter string content is as follow as asked in comment.
Its totally 524288 characters in length..
looks like this.
3026b2758e66cf11a6d900aa0062ce6c301600000000000008000000010240a4d0d207e3d21197f000a0c95ea850cc0000000000000004001c00530066004f0072006900670069006e0061006c00460050005300000003000400b49204001c0057004d004600530044004b00560065007200730069006f006e00000000001e00310031002e0030002e0036003000300031002e00370030003000300000001a0057004d004600530044004b004e006500650064006500640000000000160030002e0030002e0030002e00300030003000300000000c0049007300560042005200000002000400000000003326b2758e66cf11a6..........................................................................................................................................
d900aa0062ce6c54010000000000001e0000003a00da000000570dcb8b495848cea4609eca906bc24db442394f0ddac5eb0604fb99820bcc30ff0f1736eefd74cd4317a21a369e208c580dbb02f90e888f0a35901e08439ec6087c61d241bc3c476c24d311291a678596a98792a9000b68adf213906e0f00097c8d989e517ee532fcd6cb70e520ec9dd4fad8a1a37668bbd678bea11c1fcf2d187c4c4c6c09c3c2c53d3e64016cfebc34eace85d45a4c08cd78d05d3934e05b72ec194304848165a8c1a585c78423
/// <summary>
/// Parses a continuous hex stream from a string.
/// </summary>
public static byte[] ParseHexBytes(this string s)
{
if (s == null)
throw new ArgumentNullException("s");
if (s.Length == 0)
return new byte[0];
if (s.Length % 2 != 0)
throw new ArgumentException("Source length error", "s");
int length = s.Length >> 1;
byte[] result = new byte[length];
for (int i = 0; i < length; i++)
{
result[i] = Byte.Parse(s.Substring(i * 2, 2), NumberStyles.HexNumber);
}
return result;
}
I am trying to achieve the best possible compression for data that consists of just 1s and 0s in a matrix.
To demonstrate what I mean, here's a sample 6 by 6 matrix:
1,0,0,1,1,1
0,1,0,1,1,1
1,0,0,1,0,0
0,1,1,0,1,1
1,0,0,0,0,1
0,1,0,1,0,1
I'd like to compress that into an as small string or byte array as possible. The matrices I will need to compress are bigger though (always 4096 by 4096 1s and 0s).
I suppose it could be compressed quite heavily, but I'm not sure how. I'll mark the best compression as the answer. Performance does not matter.
I assume that you want to compress string into other strings even though your data really is binary. I don't know what the best compression algorithm is (and that will vary depending on your data) but you can convert the input text into bits, compress these and then convert the compressed bytes into a string again using base-64 encoding. This will allow you to go from string to string and still apply a compression algorithm of your choice.
The .NET framework provides the class DeflateStream that will allow you to compress a stream of bytes. The first step is to create a custom Stream that will allow you to read and write your text format. For lack of better name I have named it TextStream. Note that to simplify matters a bit I use \n as the line ending (instead of \r\n).
class TextStream : Stream {
readonly String text;
readonly Int32 bitsPerLine;
readonly StringBuilder buffer;
Int32 textPosition;
// Initialize a readable stream.
public TextStream(String text) {
if (text == null)
throw new ArgumentNullException("text");
this.text = text;
}
// Initialize a writeable stream.
public TextStream(Int32 bitsPerLine) {
if (bitsPerLine <= 0)
throw new ArgumentException();
this.bitsPerLine = bitsPerLine;
this.buffer = new StringBuilder();
}
public override Boolean CanRead { get { return this.text != null; } }
public override Boolean CanWrite { get { return this.buffer != null; } }
public override Boolean CanSeek { get { return false; } }
public override Int64 Length { get { throw new InvalidOperationException(); } }
public override Int64 Position {
get { throw new InvalidOperationException(); }
set { throw new InvalidOperationException(); }
}
public override void Flush() {
}
public override Int32 Read(Byte[] buffer, Int32 offset, Int32 count) {
// TODO: Validate buffer, offset and count.
if (!CanRead)
throw new InvalidOperationException();
var byteCount = 0;
Byte currentByte = 0;
var bitCount = 0;
for (; byteCount < count && this.textPosition < this.text.Length; this.textPosition += 1) {
if (text[this.textPosition] != '0' && text[this.textPosition] != '1')
continue;
currentByte = (Byte) ((currentByte << 1) | (this.text[this.textPosition] == '0' ? 0 : 1));
bitCount += 1;
if (bitCount == 8) {
buffer[offset + byteCount] = currentByte;
byteCount += 1;
currentByte = 0;
bitCount = 0;
}
}
if (bitCount > 0) {
buffer[offset + byteCount] = currentByte;
byteCount += 1;
}
return byteCount;
}
public override void Write(Byte[] buffer, Int32 offset, Int32 count) {
// TODO: Validate buffer, offset and count.
if (!CanWrite)
throw new InvalidOperationException();
for (var i = 0; i < count; ++i) {
var currentByte = buffer[offset + i];
for (var mask = 0x80; mask > 0; mask /= 2) {
if (this.buffer.Length > 0) {
if ((this.buffer.Length + 1)%(2*this.bitsPerLine) == 0)
this.buffer.Append('\n');
else
this.buffer.Append(',');
}
this.buffer.Append((currentByte & mask) == 0 ? '0' : '1');
}
}
}
public override String ToString() {
if (this.text != null)
return this.text;
else
return this.buffer.ToString();
}
public override Int64 Seek(Int64 offset, SeekOrigin origin) {
throw new InvalidOperationException();
}
public override void SetLength(Int64 length) {
throw new InvalidOperationException();
}
}
Then you can write methods for compressing and decompressing using DeflateStream. Note that the the uncompressed input is a string like the one you have provided in your question an the compressed output is a base-64 encoded string.
String Compress(String text) {
using (var inputStream = new TextStream(text))
using (var outputStream = new MemoryStream()) {
using (var compressedStream = new DeflateStream(outputStream, CompressionMode.Compress))
inputStream.CopyTo(compressedStream);
return Convert.ToBase64String(outputStream.ToArray());
}
}
String Decompress(String compressedText, Int32 bitsPerLine) {
var bytes = Convert.FromBase64String(compressedText);
using (var inputStream = new MemoryStream(bytes))
using (var outputStream = new TextStream(bitsPerLine)) {
using (var compressedStream = new DeflateStream(inputStream, CompressionMode.Decompress))
compressedStream.CopyTo(outputStream);
return outputStream.ToString();
}
}
To test it I used a method to create a random string (using a fixed seed to always create the same string):
String CreateRandomString(Int32 width, Int32 height) {
var random = new Random(0);
var stringBuilder = new StringBuilder();
for (var i = 0; i < width; ++i) {
for (var j = 0; j < height; ++j) {
if (i > 0 && j == 0)
stringBuilder.Append('\n');
else if (j > 0)
stringBuilder.Append(',');
stringBuilder.Append(random.Next(2) == 0 ? '0' : '1');
}
}
return stringBuilder.ToString();
}
Creating a random 4,096 x 4,096 string has an uncompressed size of 33,554,431 characters. This is compressed to 2,797,056 characters which is a reduction to about 8% of the original size.
Skipping the base-64 encoding would increase the compression ratio even more but the output would be binary and not a string. If you also consider the input as binary you actually get the following result for random data with equal probability of 0 and 1:
Input bytes: 4,096 x 4,096 / 8 = 2,097,152
Output bytes: 2,097,792
Size after compression: 100%
Simply converting to bytes is a better than doing that following by a deflate. However, using random input but with 25% 0 and 75% 1 you get this result:
Input bytes: 4,096 x 4,096 / 8 = 2,097,152
Output bytes: 1,757,846
Size after compression: 84%
How much deflate will compress your data really depends of the nature of the data. If it is completely random you wont be able to get much compression after converting from text to bytes.
Hmm... as small as possible is not really possible without knowing the problem domain.
Here's the general approach:
Represent the ones and zeros in the array using bits not bytes or characters or whatever.
Compress using a general purpose loss-less compression algorithm. The two most common are:
Huffman encoding and some type of LZW.
Huffman can be mathematically proven to provide the best possible compression of data, the catch is in order to decompress the data you also need the Huffman tree which may be as big as the original data. LZW gives you compression equivalent to Huffman (within a few percent) for most inputs, but performs best on data with repeating segments such as text.
Implementations for the compression algorithms should be easy to come by (GZIP uses LZ77 which is an earlier slightly less optimal version of LZW.)
A good implementation of compression algorithms using modern algorithms go to 7zip.org. It's open source and they have a C API with a DLL, but you'll have to create the .Net interface (unless someone already made one.)
The non general approach:
This relays on a known characteristic of the data. For example: if you know most of the data is zeroes you can encode only the coordinates of the ones.
If the data contains patches of ones and zeros they can be encoded with RLE or two dimensional variants of the algorithm.
Trying to create your own algorithm for specifically compressing this data will most likely not yield much.
Create a GZipStream with Max CompressionLevel
Run a 4096x4096 loop
- set all 64 bits of a ulong to bits of the array
- when 64 bits are done write the ulong to the compressionstream and start at the first bit again
This will very easily add your cube into a pretty compressed block of memory
Using Huffman Coding you can compress it quite much:
0 => 111
1 => 10
, => 0
\r => 1100
\n => 1101
Yields for you example matrix (in bits):
10011101 11010010 01011001 10111101 00111010 01001011 00110110 01110111
01001110 11111001 10111101 00100111 01001011 00110110 01110111 01110111
01011001 10111101 00111010 0111010
If the commas, line feed and carriage return can be excluded then you only need a BitArray to store each value. Although now you need to know the dimension of the matrix when decoding. If you don't then you could store it as an int and then the data itself if you're planning on serializing the data.
Something like:
var input = #"1,0,0,1,1,1
0,1,0,1,1,1
1,0,0,1,0,0
0,1,1,0,1,1
1,0,0,0,0,1
0,1,0,1,0,1";
var values = new List<bool>();
foreach(var c in input)
{
if (c == '0')
values.Add(false);
else if (c == '1')
values.Add(true);
}
var ba = new BitArray(values.ToArray());
then serialize the BitArray. You'd probably need to add the number of padding bits to properly decode the data. (4096 * 4096 is divisible by 8).
The BitArray approach should get you the most compression unless there is a significant amount of repeating patterns in the matrix (yes I'm assuming the data is mostly random).
Is there a way to use ASCIIEncoding in Windows Phone 7?
Unless I'm doing something wrong Encoding.ASCII doesn't exist and I'm needing it for C# -> PHP encryption (as PHP only uses ASCII in SHA1 encryption).
Any suggestions?
It is easy to implement yourself, Unicode never messed with the ASCII codes:
public static byte[] StringToAscii(string s) {
byte[] retval = new byte[s.Length];
for (int ix = 0; ix < s.Length; ++ix) {
char ch = s[ix];
if (ch <= 0x7f) retval[ix] = (byte)ch;
else retval[ix] = (byte)'?';
}
return retval;
}
Not really seeing any detail in your question this could be off track. You are right Silverlight has no support for the ASCII encoding.
However I suspect that in fact UTF8 will do what you need. Its worth bearing in mind that a sequence of single byte ASCII only characters and the same set of characters encoded as UTF-8 are identical. That is the the complete ASCII character set is repeated verbatim by the first 128 single byte code points in UTF-8.
I have a Silverlight app that writes CSV files, which have to be encoded in ASCII (using UTF-8 causes accented characters to show up wrong when you open the files in Excel).
Since Silverlight doesn't have an Encoding.ASCII class, I implemented one as follows. It works for me, hope it's useful to you as well:
/// <summary>
/// Silverlight doesn't have an ASCII encoder, so here is one:
/// </summary>
public class AsciiEncoding : System.Text.Encoding
{
public override int GetMaxByteCount(int charCount)
{
return charCount;
}
public override int GetMaxCharCount(int byteCount)
{
return byteCount;
}
public override int GetByteCount(char[] chars, int index, int count)
{
return count;
}
public override byte[] GetBytes(char[] chars)
{
return base.GetBytes(chars);
}
public override int GetCharCount(byte[] bytes)
{
return bytes.Length;
}
public override int GetBytes(char[] chars, int charIndex, int charCount, byte[] bytes, int byteIndex)
{
for (int i = 0; i < charCount; i++)
{
bytes[byteIndex + i] = (byte)chars[charIndex + i];
}
return charCount;
}
public override int GetCharCount(byte[] bytes, int index, int count)
{
return count;
}
public override int GetChars(byte[] bytes, int byteIndex, int byteCount, char[] chars, int charIndex)
{
for (int i = 0; i < byteCount; i++)
{
chars[charIndex + i] = (char)bytes[byteIndex + i];
}
return byteCount;
}
}
I had similar problem using Xamarin (Mono) for Android where I'm using Portable Class Library and they don't support Econding.ASCII.
Instead, the only working solution (except doing it manually) is this one
Uri.EscapeDataString(yourString);
See this answer which provide additional information.
I started from #Hans Passant 's answer and I rewrote it with Linq :
/// <summary>
/// Gets an encoding for the ASCII (7-bit) character set.
/// </summary>
/// <see cref="http://stackoverflow.com/a/4022893/1248177"/>
/// <param name="s">A character set.</param>
/// <returns>An encoding for the ASCII (7-bit) character set.</returns>
public static byte[] StringToAscii(string s)
{
return (from char c in s select (byte)((c <= 0x7f) ? c : '?')).ToArray();
}
You may want to remove the call to ToArray() and return a IEnumerable<byte> instead of byte[].
According to this MS forum thread, Windows Phone 7 does not support Encoding.ASCII.
How do you think what is the best way to find position in the System.Stream where given byte sequence starts (first occurence):
public static long FindPosition(Stream stream, byte[] byteSequence)
{
long position = -1;
/// ???
return position;
}
P.S. The simpliest yet fastest solution is preffered. :)
I've reached this solution.
I did some benchmarks with an ASCII file that was 3.050 KB and 38803 lines.
With a search byte array of 22 bytes in the last line of the file I've got the result in about 2.28 seconds (in a slow/old machine).
public static long FindPosition(Stream stream, byte[] byteSequence)
{
if (byteSequence.Length > stream.Length)
return -1;
byte[] buffer = new byte[byteSequence.Length];
using (BufferedStream bufStream = new BufferedStream(stream, byteSequence.Length))
{
int i;
while ((i = bufStream.Read(buffer, 0, byteSequence.Length)) == byteSequence.Length)
{
if (byteSequence.SequenceEqual(buffer))
return bufStream.Position - byteSequence.Length;
else
bufStream.Position -= byteSequence.Length - PadLeftSequence(buffer, byteSequence);
}
}
return -1;
}
private static int PadLeftSequence(byte[] bytes, byte[] seqBytes)
{
int i = 1;
while (i < bytes.Length)
{
int n = bytes.Length - i;
byte[] aux1 = new byte[n];
byte[] aux2 = new byte[n];
Array.Copy(bytes, i, aux1, 0, n);
Array.Copy(seqBytes, aux2, n);
if (aux1.SequenceEqual(aux2))
return i;
i++;
}
return i;
}
If you treat the stream like another sequence of bytes, you can just search it like you were doing a string search. Wikipedia has a great article on that. Boyer-Moore is a good and simple algorithm for this.
Here's a quick hack I put together in Java. It works and it's pretty close if not Boyer-Moore. Hope it helps ;)
public static final int BUFFER_SIZE = 32;
public static int [] buildShiftArray(byte [] byteSequence){
int [] shifts = new int[byteSequence.length];
int [] ret;
int shiftCount = 0;
byte end = byteSequence[byteSequence.length-1];
int index = byteSequence.length-1;
int shift = 1;
while(--index >= 0){
if(byteSequence[index] == end){
shifts[shiftCount++] = shift;
shift = 1;
} else {
shift++;
}
}
ret = new int[shiftCount];
for(int i = 0;i < shiftCount;i++){
ret[i] = shifts[i];
}
return ret;
}
public static byte [] flushBuffer(byte [] buffer, int keepSize){
byte [] newBuffer = new byte[buffer.length];
for(int i = 0;i < keepSize;i++){
newBuffer[i] = buffer[buffer.length - keepSize + i];
}
return newBuffer;
}
public static int findBytes(byte [] haystack, int haystackSize, byte [] needle, int [] shiftArray){
int index = needle.length;
int searchIndex, needleIndex, currentShiftIndex = 0, shift;
boolean shiftFlag = false;
index = needle.length;
while(true){
needleIndex = needle.length-1;
while(true){
if(index >= haystackSize)
return -1;
if(haystack[index] == needle[needleIndex])
break;
index++;
}
searchIndex = index;
needleIndex = needle.length-1;
while(needleIndex >= 0 && haystack[searchIndex] == needle[needleIndex]){
searchIndex--;
needleIndex--;
}
if(needleIndex < 0)
return index-needle.length+1;
if(shiftFlag){
shiftFlag = false;
index += shiftArray[0];
currentShiftIndex = 1;
} else if(currentShiftIndex >= shiftArray.length){
shiftFlag = true;
index++;
} else{
index += shiftArray[currentShiftIndex++];
}
}
}
public static int findBytes(InputStream stream, byte [] needle){
byte [] buffer = new byte[BUFFER_SIZE];
int [] shiftArray = buildShiftArray(needle);
int bufferSize, initBufferSize;
int offset = 0, init = needle.length;
int val;
try{
while(true){
bufferSize = stream.read(buffer, needle.length-init, buffer.length-needle.length+init);
if(bufferSize == -1)
return -1;
if((val = findBytes(buffer, bufferSize+needle.length-init, needle, shiftArray)) != -1)
return val+offset;
buffer = flushBuffer(buffer, needle.length);
offset += bufferSize-init;
init = 0;
}
} catch (IOException e){
e.printStackTrace();
}
return -1;
}
You'll basically need to keep a buffer the same size as byteSequence so that once you've found that the "next byte" in the stream matches, you can check the rest but then still go back to the "next but one" byte if it's not an actual match.
It's likely to be a bit fiddly whatever you do, to be honest :(
I needed to do this myself, had already started, and didn't like the solutions above. I specifically needed to find where the search-byte-sequence ends. In my situation, I need to fast-forward the stream until after that byte sequence. But you can use my solution for this question too:
var afterSequence = stream.ScanUntilFound(byteSequence);
var beforeSequence = afterSequence - byteSequence.Length;
Here is StreamExtensions.cs
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace System
{
static class StreamExtensions
{
/// <summary>
/// Advances the supplied stream until the given searchBytes are found, without advancing too far (consuming any bytes from the stream after the searchBytes are found).
/// Regarding efficiency, if the stream is network or file, then MEMORY/CPU optimisations will be of little consequence here.
/// </summary>
/// <param name="stream">The stream to search in</param>
/// <param name="searchBytes">The byte sequence to search for</param>
/// <returns></returns>
public static int ScanUntilFound(this Stream stream, byte[] searchBytes)
{
// For this class code comments, a common example is assumed:
// searchBytes are {1,2,3,4} or 1234 for short
// # means value that is outside of search byte sequence
byte[] streamBuffer = new byte[searchBytes.Length];
int nextRead = searchBytes.Length;
int totalScannedBytes = 0;
while (true)
{
FillBuffer(stream, streamBuffer, nextRead);
totalScannedBytes += nextRead; //this is only used for final reporting of where it was found in the stream
if (ArraysMatch(searchBytes, streamBuffer, 0))
return totalScannedBytes; //found it
nextRead = FindPartialMatch(searchBytes, streamBuffer);
}
}
/// <summary>
/// Check all offsets, for partial match.
/// </summary>
/// <param name="searchBytes"></param>
/// <param name="streamBuffer"></param>
/// <returns>The amount of bytes which need to be read in, next round</returns>
static int FindPartialMatch(byte[] searchBytes, byte[] streamBuffer)
{
// 1234 = 0 - found it. this special case is already catered directly in ScanUntilFound
// #123 = 1 - partially matched, only missing 1 value
// ##12 = 2 - partially matched, only missing 2 values
// ###1 = 3 - partially matched, only missing 3 values
// #### = 4 - not matched at all
for (int i = 1; i < searchBytes.Length; i++)
{
if (ArraysMatch(searchBytes, streamBuffer, i))
{
// EG. Searching for 1234, have #123 in the streamBuffer, and [i] is 1
// Output: 123#, where # will be read using FillBuffer next.
Array.Copy(streamBuffer, i, streamBuffer, 0, searchBytes.Length - i);
return i; //if an offset of [i], makes a match then only [i] bytes need to be read from the stream to check if there's a match
}
}
return 4;
}
/// <summary>
/// Reads bytes from the stream, making sure the requested amount of bytes are read (streams don't always fulfill the full request first time)
/// </summary>
/// <param name="stream">The stream to read from</param>
/// <param name="streamBuffer">The buffer to read into</param>
/// <param name="bytesNeeded">How many bytes are needed. If less than the full size of the buffer, it fills the tail end of the streamBuffer</param>
static void FillBuffer(Stream stream, byte[] streamBuffer, int bytesNeeded)
{
// EG1. [123#] - bytesNeeded is 1, when the streamBuffer contains first three matching values, but now we need to read in the next value at the end
// EG2. [####] - bytesNeeded is 4
var bytesAlreadyRead = streamBuffer.Length - bytesNeeded; //invert
while (bytesAlreadyRead < streamBuffer.Length)
{
bytesAlreadyRead += stream.Read(streamBuffer, bytesAlreadyRead, streamBuffer.Length - bytesAlreadyRead);
}
}
/// <summary>
/// Checks if arrays match exactly, or with offset.
/// </summary>
/// <param name="searchBytes">Bytes to search for. Eg. [1234]</param>
/// <param name="streamBuffer">Buffer to match in. Eg. [#123] </param>
/// <param name="startAt">When this is zero, all bytes are checked. Eg. If this value 1, and it matches, this means the next byte in the stream to read may mean a match</param>
/// <returns></returns>
static bool ArraysMatch(byte[] searchBytes, byte[] streamBuffer, int startAt)
{
for (int i = 0; i < searchBytes.Length - startAt; i++)
{
if (searchBytes[i] != streamBuffer[i + startAt])
return false;
}
return true;
}
}
}
Bit old question, but here's my answer. I've found that reading blocks and then searching in that is extremely inefficient compared to just reading one at a time and going from there.
Also, IIRC, the accepted answer would fail if part of the sequence was in one block read and half in another - ex, given 12345, searching for 23, it would read 12, not match, then read 34, not match, etc... haven't tried it, though, seeing as it requires net 4.0. At any rate, this is way simpler, and likely much faster.
static long ReadOneSrch(Stream haystack, byte[] needle)
{
int b;
long i = 0;
while ((b = haystack.ReadByte()) != -1)
{
if (b == needle[i++])
{
if (i == needle.Length)
return haystack.Position - needle.Length;
}
else
i = b == needle[0] ? 1 : 0;
}
return -1;
}
static long Search(Stream stream, byte[] pattern)
{
long start = -1;
stream.Seek(0, SeekOrigin.Begin);
while(stream.Position < stream.Length)
{
if (stream.ReadByte() != pattern[0])
continue;
start = stream.Position - 1;
for (int idx = 1; idx < pattern.Length; idx++)
{
if (stream.ReadByte() != pattern[idx])
{
start = -1;
break;
}
}
if (start > -1)
{
return start;
}
}
return start;
}