C# compress a byte array

C# compress a byte array - c#

I do not know much about compression algorithms. I am looking for a simple compression algorithm (or code snippet) which can reduce the size of a byte[,,] or byte[]. I cannot make use of System.IO.Compression. Also, the data has lots of repetition.
I tried implementing the RLE algorithm (posted below for your inspection). However, it produces array's 1.2 to 1.8 times larger.
public static class RLE
{
public static byte[] Encode(byte[] source)
{
List<byte> dest = new List<byte>();
byte runLength;
for (int i = 0; i < source.Length; i++)
{
runLength = 1;
while (runLength < byte.MaxValue
&& i + 1 < source.Length
&& source[i] == source[i + 1])
{
runLength++;
i++;
}
dest.Add(runLength);
dest.Add(source[i]);
}
return dest.ToArray();
}
public static byte[] Decode(byte[] source)
{
List<byte> dest = new List<byte>();
byte runLength;
for (int i = 1; i < source.Length; i+=2)
{
runLength = source[i - 1];
while (runLength > 0)
{
dest.Add(source[i]);
runLength--;
}
}
return dest.ToArray();
}
}
I have also found a java, string and integer based, LZW implementation. I have converted it to C# and the results look good (code posted below). However, I am not sure how it works nor how to make it work with bytes instead of strings and integers.
public class LZW
{
/* Compress a string to a list of output symbols. */
public static int[] compress(string uncompressed)
{
// Build the dictionary.
int dictSize = 256;
Dictionary<string, int> dictionary = new Dictionary<string, int>();
for (int i = 0; i < dictSize; i++)
dictionary.Add("" + (char)i, i);
string w = "";
List<int> result = new List<int>();
for (int i = 0; i < uncompressed.Length; i++)
{
char c = uncompressed[i];
string wc = w + c;
if (dictionary.ContainsKey(wc))
w = wc;
else
{
result.Add(dictionary[w]);
// Add wc to the dictionary.
dictionary.Add(wc, dictSize++);
w = "" + c;
}
}
// Output the code for w.
if (w != "")
result.Add(dictionary[w]);
return result.ToArray();
}
/* Decompress a list of output ks to a string. */
public static string decompress(int[] compressed)
{
int dictSize = 256;
Dictionary<int, string> dictionary = new Dictionary<int, string>();
for (int i = 0; i < dictSize; i++)
dictionary.Add(i, "" + (char)i);
string w = "" + (char)compressed[0];
string result = w;
for (int i = 1; i < compressed.Length; i++)
{
int k = compressed[i];
string entry = "";
if (dictionary.ContainsKey(k))
entry = dictionary[k];
else if (k == dictSize)
entry = w + w[0];
result += entry;
// Add w+entry[0] to the dictionary.
dictionary.Add(dictSize++, w + entry[0]);
w = entry;
}
return result;
}
}

Have a look here. I used this code as a basis to compress in one of my work projects. Not sure how much of the .NET Framework is accessbile in the Xbox 360 SDK, so not sure how well this will work for you.

The problem with that RLE algorithm is that it is too simple. It prefixes every byte with how many times it is repeated, but that does mean that in long ranges of non-repeating bytes, each single byte is prefixed with a "1". On data without any repetitions this will double the file size.
This can be avoided by using Code-type RLE instead; the 'Code' (also called 'Token') will be a byte that can have two meanings; either it indicates how many times the single following byte is repeated, or it indicates how many non-repeating bytes follow that should be copied as they are. The difference between those two codes is made by enabling the highest bit, meaning there are still 7 bits available for the value, meaning the amount to copy or repeat per such code can be up to 127.
This means that even in worst-case scenarios, the final size can only be about 1/127th larger than the original file size.
A good explanation of the whole concept, plus full working (and, in fact, heavily optimised) C# code, can be found here:
http://www.shikadi.net/moddingwiki/RLE_Compression
Note that sometimes, the data will end up larger than the original anyway, simply because there are not enough repeating bytes in it for RLE to work. A good way to deal with such compression failures is by adding a header to your final data. If you simply add an extra byte at the start that's on 0 for uncompressed data and 1 for RLE compressed data, then, when RLE fails to give a smaller result, you just save it uncompressed, with the 0 in front, and your final data will be exactly one byte larger than the original. The system at the other side can then read that starting byte and use that to determine if the following data should be uncompressed or just copied.

Look into Huffman codes, it's a pretty simple algorithm. Basically, use fewer bits for patterns that show up more often, and keep a table of how it's encoded. And you have to account in your codewords that there are no separators to help you decode.

Related

What is the fastest way to do Array Table Lookup with an Integer Index?

I have a video processing application that moves a lot of data.
To speed things up, I have made a lookup table, as many calculations in essence only need to be calculated one time and can be reused.
However I'm at the point where all the lookups now takes 30% of the processing time. I'm wondering if it might be slow RAM.. However, I would still like to try to optimize it some more.
Currently I have the following:
public readonly int[] largeArray = new int[3000*2000];
public readonly int[] lookUp = new int[width*height];
I then perform a lookup with a pointer p (which is equivalent to width * y + x) to fetch the result.
int[] newResults = new int[width*height];
int p = 0;
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++, p++) {
newResults[p] = largeArray[lookUp[p]];
}
}
Note that I cannot do an entire array copy to optimize. Also, the application is heavily multithreaded.
Some progress was in shortening the function stack, so no getters but a straight retrieval from a readonly array.
I've tried converting to ushort as well, but it seemed to be slower (as I understand it's due to word size).
Would an IntPtr be faster? How would I go about that?
Attached below is a screenshot of time distribution:

It looks like what you're doing here is effectively a "gather". Modern CPUs have dedicated instructions for this, in particular VPGATHER** . This is exposed in .NET Core 3, and should work something like below, which is the single loop scenario (you can probably work from here to get the double-loop version);
results first:
AVX enabled: False; slow loop from 0
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 1524ms
AVX enabled: True; slow loop from 1024
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 667ms
code:
using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
static class P
{
static int Gather(int[] source, int[] index, int[] results, bool avx)
{ // normally you wouldn't have avx as a parameter; that is just so
// I can turn it off and on for the test; likewise the "int" return
// here is so I can monitor (in the test) how much we did in the "old"
// loop, vs AVX2; in real code this would be void return
int y = 0;
if (Avx2.IsSupported && avx)
{
var iv = MemoryMarshal.Cast<int, Vector256<int>>(index);
var rv = MemoryMarshal.Cast<int, Vector256<int>>(results);
unsafe
{
fixed (int* sPtr = source)
{
// note: here I'm assuming we are trying to fill "results" in
// a single outer loop; for a double-loop, you'll probably need
// to slice the spans
for (int i = 0; i < rv.Length; i++)
{
rv[i] = Avx2.GatherVector256(sPtr, iv[i], 4);
}
}
}
// move past everything we've processed via SIMD
y += rv.Length * Vector256<int>.Count;
}
// now do anything left, which includes anything not aligned to 256 bits,
// plus the "no AVX2" scenario
int result = y;
int end = results.Length; // hoist, since this is not the JIT recognized pattern
for (; y < end; y++)
{
results[y] = source[index[y]];
}
return result;
}
static void Main()
{
// invent some random data
var rand = new Random(12345);
int size = 1024 * 512;
int[] data = new int[size];
for (int i = 0; i < data.Length; i++)
data[i] = rand.Next(255);
// build a fake index
int[] index = new int[1024];
for (int i = 0; i < index.Length; i++)
index[i] = rand.Next(size);
int[] results = new int[1024];
void GatherLocal(bool avx)
{
// prove that we're getting the same data
Array.Clear(results, 0, results.Length);
int from = Gather(data, index, results, avx);
Console.WriteLine($"AVX enabled: {avx}; slow loop from {from}");
for (int i = 0; i < 32; i++)
{
Console.Write(results[i].ToString("x2"));
}
Console.WriteLine();
const int TimeLoop = 1024 * 512;
var watch = Stopwatch.StartNew();
for (int i = 0; i < TimeLoop; i++)
Gather(data, index, results, avx);
watch.Stop();
Console.WriteLine($"for {TimeLoop} loops: {watch.ElapsedMilliseconds}ms");
Console.WriteLine();
}
GatherLocal(false);
if (Avx2.IsSupported) GatherLocal(true);
}
}

RAM is already one of the fastest things possible. The only memory faster is the CPU caches. So it will be Memory Bound, but that is still plenty fast.
Of course at the given sizes, this array is 6 Million entries in size. That will likely not fit in any cache. And will take forever to itterate over. It does not mater what the speed is, this is simply too much data.
As a general rule, video processing is done on the GPU nowadays. GPU's are literally desinged to operate on giant arrays. Because that is what the Image you are seeing right now is - a giant array.
If you have to keep it on the GPU side, maybe caching or Lazy Initilisation would help? Chances are that you do not truly need every value. You only need to common values. Take a examples from dicerolling: If you roll 2 6-sided dice, every result from 2-12 is possible. But the result 7 happens 6 out of 36 casess. The 2 and 12 only 1 out of 36 cases each. So having the 7 stored is a lot more beneficial then the 2 and 12.

How to remove duplicate lines from a large text file efficiently?

I want to edit a text as each line exists once in it. Each lines contains constantly 10 characters. I am generally working on 5-6 million of lines. So the code i am using currently is consuming too much RAM.
My code:
File.WriteAllLines(targetpath, File.ReadAllLines(sourcepath).Distinct())
So how can I make it less RAM consumer and less time-consumer at the same time?

Taking into account how much memory a string will take in C#, and assuming 10 characters length for 6 million records we get:
size in bytes ~= 20 + (length / 2 ) * 4;
total size in bytes ~= (20 + ( 10 / 2 ) * 4 )* 6000000 = 240 000 000
total size in Mb ~= 230
Now, 230 MB of space is not really a problem, even on x86 (32 bit system), so you can load all that data in memory.
For this, I would use a HashSet class which is obviously, a hash set that will let you easily eliminate the duplicates, by using lookup before adding an element.
In terms of big-O notation for time complexity, the average performance of a lookup in a hash set is O(1), which is the best you can get. In total, you would use lookup N times, totalling to N * O(1) = O(N)
In terms of big-O notation for space complexity, you would have O(N) space used, meaning that you use up memory proportional to number of elements, which is also the best you can get.
I'm not sure it is even possible to use up less space if you implement the algorithm in C# and not rely on any external components (that would also use at least O(N))
That being said, you can optimize for some scenarios by reading your file sequentially, line by line, see here.
This would give a better result if you have lots of duplicates, but worst case scenario when all the lines are distinct would consume the same amount of memory.
On a final note, if you look how Distinct method is implemented, you will see it also uses an implementation of hash table, although it's not the same class, but the performance is still roughly the same, check out this question for more details.

As ironstone13 corrected me, HashSet is OK, but does store the data.
Then this works fine too:
string[] arr = File.ReadAllLines("file.txt");
HashSet<string> hashes = new HashSet<string>();
for (int i = 0; i < arr.Length; i++)
{
if (!hashes.Add(arr[i])) arr[i] = null;
}
File.WriteAllLines("file2.txt", arr.Where(x => x != null));
This implementation was motivated by memory performance and hash conflicts.
The main idea was to keep just hashes, of course it would have to get back to file to get the line it sees as hash conflict/duplicit, to detect which one it is. (that part is not implemented).
class Program
{
static string[] arr;
static Dictionary<int, int>[] hashes = new Dictionary<int, int>[1]
{ new Dictionary<int, int>() }
;
static int[] file_indexes = {-1};
static void AddHash(int hash, int index)
{
for (int h = 0; h < hashes.Length; h++)
{
Dictionary<int, int> dict = hashes[h];
if (!dict.ContainsKey(hash))
{
dict[hash] = index;
return;
}
}
hashes = hashes.Union(new[] {new Dictionary<int, int>() {{hash, index}}}).ToArray();
file_indexes = Enumerable.Range(0, hashes.Length).Select(x => -1).ToArray();
}
static int UpdateFileIndexes(int hash)
{
int updates = 0;
for (int h = 0; h < hashes.Length; h++)
{
int index;
if (hashes[h].TryGetValue(hash, out index))
{
file_indexes[h] = index;
updates++;
}
else
{
file_indexes[h] = -1;
}
}
return updates;
}
static bool IsDuplicate(int index)
{
string str1 = arr[index];
for (int h = 0; h < hashes.Length; h++)
{
int i = file_indexes[h];
if (i == -1 || index == i) continue;
string str0 = arr[i];
if (str0 == null) continue;
if (string.CompareOrdinal(str0, str1) == 0) return true;
}
return false;
}
static void Main(string[] args)
{
arr = File.ReadAllLines("file.txt");
for (int i = 0; i < arr.Length; i++)
{
int hash = arr[i].GetHashCode();
if (UpdateFileIndexes(hash) == 0) AddHash(hash, i);
else if (IsDuplicate(i)) arr[i] = null;
else AddHash(hash, i);
}
File.WriteAllLines("file2.txt", arr.Where(x => x != null));
Console.WriteLine("DONE");
Console.ReadKey();
}
}

Before you write your data, if your data is in a list or dictionary, you could run LINQ query and use group by to group all like keys. Then for each write to the output file.
Your question is a little vague as well. Are you creating a next text file every time and do you have to store the data in text? There are better formats to use such as XML and json

Is simhash function that reliable?

I have been strugling with simhash algorithm for a while. I implemented it according to my understanding on my crawler. However, when I did some test, It seemed not so reliable to me.
I calculated fingerprint for 200.000 different text data and saw that, some different content had same fingerprints. So there are a big posibility of collision.
My implementation code is below.
My question is that: If My implementation is right, there is a big collision on this algorithm. How come google use this algorithm? Otherwise, what's the problem with my algorithm?
public long CalculateSimHash(string input)
{
var vector = GenerateVector(input);
//5- Generate Fingerprint
long fingerprint = 0;
for (var i = 0; i < HashSize; i++)
{
if (vector[i] > 0)
{
var zz = Convert.ToInt64(1 << i);
fingerprint += Math.Abs(zz);
}
}
return fingerprint;
}
private int[] GenerateVector(string input)
{
//1- Tokenize input
ITokeniser tokeniser = new OverlappingStringTokeniser(2, 1);
var tokenizedValues = tokeniser.Tokenise(input);
//2- Hash values
var hashedValues = HashTokens(tokenizedValues);
//3- Prepare vector
var vector = new int[HashSize];
for (var i = 0; i < HashSize; i++)
{
vector[i] = 0;
}
//4- Fill vector according to bitsetof hash
foreach (var value in hashedValues)
{
for (var j = 0; j < HashSize; j++)
{
if (IsBitSet(value, j))
{
vector[j] += 1;
}
else
{
vector[j] -= 1;
}
}
}
return vector;

I can see a couple of issues. First, you're only getting a 32-bit hash, not a 64-bit, because you're using the wrong types. See https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/operators/left-shift-operator
It's also best not to use a signed integer type here, to avoid confusion. So:
// Generate Fingerprint
ulong fingerprint = 0;
for (int i = 0; i < HashSize; i++)
{
if (vector[i] > 0)
{
fingerprint += 1UL << i;
}
}
Second issue is: I don't know how your OverlappingStringTokenizer works -- so I'm only guessing here -- but if your shingles (overlapping ngrams) are only 2 characters long, then a lot of these shingles will be found in a lot of documents. Chances are that two documents will share a lot of these features even if the purpose and meaning of the documents is quite different.
Because words are the smallest simple unit of meaning when dealing with text, I normally count my tokens in terms of words, not characters. Certainly 2 characters is far too small for an effective feature. I like to generate shingles from, say, 5 words, ignoring punctuation and whitespace.

Incorrect data length when invoking Fast Fourier Transform in AForge

I'm working on a FFT (fast fourier transform) function in C#.
I've found AForge library but when using it I keep getting the error message:
Incorrect data length.
The data I'm putting in is a list of doubles and the size can vary, depending on the signal I'm sending in.
What to do about this?
This is what my function looks like so far
private void FastFoulierMethod(ISignal signal, List<IMarker> markers)
{
AForge.Math.Complex[] complex = new AForge.Math.Complex[samples.Count];
for (int i = 0; i < samples.Count-1; i++)
{
complex[i] = new AForge.Math.Complex(samples[i].GetTimeInSec(),0);
}
AForge.Math.Complex[] complex2 = complex;
FourierTransform.DFT(complex, FourierTransform.Direction.Backward);
FourierTransform.FFT(complex2, FourierTransform.Direction.Backward);
}

As per AForge documentation:
The method accepts data array of 2n size only, where n may vary in the [1, 14] range
So you would need to make sure the input size is correctly padded to a length that is a power of 2, and in the specified range:
double logLength = Math.Ceiling(Math.Log((double)sample.Count, 2.0));
int paddedLength = (int) Math.Pow(2.0, Math.Min(Math.Max(1.0, logLength), 14.0));
AForge.Math.Complex[] complex = new AForge.Math.Complex[paddedLength];
// copy all input samples
int i = 0;
for (; i < sample.Count; i++)
{
complex[i] = new AForge.Math.Complex(samples[i].GetTimeInSec(),0);
}
// pad with zeros
for (; i < paddedLength; i++)
{
complex[i] = new AForge.Math.Complex(0,0);
}

If you look at AForge FourierTransform source code you will find ReorderData method(at the bottom of file) which is called from FFT method. In this method there is check which rises error:
private static void ReorderData( Complex[] data )
{
int len = data.Length;
// check data length
if ( ( len < minLength ) || ( len > maxLength ) || ( !Tools.IsPowerOf2( len ) ) )
throw new ArgumentException( "Incorrect data length." );
//rest of code...
}
So the answer to your question is - your array length must be power of 2 and be between 2 and 16384.

How can we prepend strings with StringBuilder?

I know we can append strings using StringBuilder. Is there a way we can prepend strings (i.e. add strings in front of a string) using StringBuilder so we can keep the performance benefits that StringBuilder offers?

Using the insert method with the position parameter set to 0 would be the same as prepending (i.e. inserting at the beginning).
C# example : varStringBuilder.Insert(0, "someThing");
Java example : varStringBuilder.insert(0, "someThing");
It works both for C# and Java

Prepending a String will usually require copying everything after the insertion point back some in the backing array, so it won't be as quick as appending to the end.
But you can do it like this in Java (in C# it's the same, but the method is called Insert):
aStringBuilder.insert(0, "newText");

If you require high performance with lots of prepends, you'll need to write your own version of StringBuilder (or use someone else's). With the standard StringBuilder (although technically it could be implemented differently) insert require copying data after the insertion point. Inserting n piece of text can take O(n^2) time.
A naive approach would be to add an offset into the backing char[] buffer as well as the length. When there is not enough room for a prepend, move the data up by more than is strictly necessary. This can bring performance back down to O(n log n) (I think). A more refined approach is to make the buffer cyclic. In that way the spare space at both ends of the array becomes contiguous.

Here's what you can do If you want to prepend using Java's StringBuilder class:
StringBuilder str = new StringBuilder();
str.Insert(0, "text");

You could try an extension method:
/// <summary>
/// kind of a dopey little one-off for StringBuffer, but
/// an example where you can get crazy with extension methods
/// </summary>
public static void Prepend(this StringBuilder sb, string s)
{
sb.Insert(0, s);
}
StringBuilder sb = new StringBuilder("World!");
sb.Prepend("Hello "); // Hello World!

You could build the string in reverse and then reverse the result.
You incur an O(n) cost instead of an O(n^2) worst case cost.

If I understand you correctly, the insert method looks like it'll do what you want. Just insert the string at offset 0.

I haven't used it but Ropes For Java Sounds intriguing. The project name is a play on words, use a Rope instead of a String for serious work. Gets around the performance penalty for prepending and other operations. Worth a look, if you're going to be doing a lot of this.
A rope is a high performance
replacement for Strings. The
datastructure, described in detail in
"Ropes: an Alternative to Strings",
provides asymptotically better
performance than both String and
StringBuffer for common string
modifications like prepend, append,
delete, and insert. Like Strings,
ropes are immutable and therefore
well-suited for use in multi-threaded
programming.

Try using Insert()
StringBuilder MyStringBuilder = new StringBuilder("World!");
MyStringBuilder.Insert(0,"Hello "); // Hello World!

You could create an extension for StringBuilder yourself with a simple class:
namespace Application.Code.Helpers
{
public static class StringBuilderExtensions
{
#region Methods
public static void Prepend(this StringBuilder sb, string value)
{
sb.Insert(0, value);
}
public static void PrependLine(this StringBuilder sb, string value)
{
sb.Insert(0, value + Environment.NewLine);
}
#endregion
}
}
Then, just add:
using Application.Code.Helpers;
To the top of any class that you want to use the StringBuilder in and any time you use intelli-sense with a StringBuilder variable, the Prepend and PrependLine methods will show up. Just remember that when you use Prepend, you will need to Prepend in reverse order than if you were Appending.

Judging from the other comments, there's no standard quick way of doing this. Using StringBuilder's .Insert(0, "text") is approximately only 1-3x as fast as using painfully slow String concatenation (based on >10000 concats), so below is a class to prepend potentially thousands of times quicker!
I've included some other basic functionality such as append(), subString() and length() etc. Both appends and prepends vary from about twice as fast to 3x slower than StringBuilder appends. Like StringBuilder, the buffer in this class will automatically increase when the text overflows the old buffer size.
The code has been tested quite a lot, but I can't guarantee it's free of bugs.
class Prepender
{
private char[] c;
private int growMultiplier;
public int bufferSize; // Make public for bug testing
public int left; // Make public for bug testing
public int right; // Make public for bug testing
public Prepender(int initialBuffer = 1000, int growMultiplier = 10)
{
c = new char[initialBuffer];
//for (int n = 0; n < initialBuffer; n++) cc[n] = '.'; // For debugging purposes (used fixed width font for testing)
left = initialBuffer / 2;
right = initialBuffer / 2;
bufferSize = initialBuffer;
this.growMultiplier = growMultiplier;
}
public void clear()
{
left = bufferSize / 2;
right = bufferSize / 2;
}
public int length()
{
return right - left;
}
private void increaseBuffer()
{
int nudge = -bufferSize / 2;
bufferSize *= growMultiplier;
nudge += bufferSize / 2;
char[] tmp = new char[bufferSize];
for (int n = left; n < right; n++) tmp[n + nudge] = c[n];
left += nudge;
right += nudge;
c = new char[bufferSize];
//for (int n = 0; n < buffer; n++) cc[n]='.'; // For debugging purposes (used fixed width font for testing)
for (int n = left; n < right; n++) c[n] = tmp[n];
}
public void append(string s)
{
// If necessary, increase buffer size by growMultiplier
while (right + s.Length > bufferSize) increaseBuffer();
// Append user input to buffer
int len = s.Length;
for (int n = 0; n < len; n++)
{
c[right] = s[n];
right++;
}
}
public void prepend(string s)
{
// If necessary, increase buffer size by growMultiplier
while (left - s.Length < 0) increaseBuffer();
// Prepend user input to buffer
int len = s.Length - 1;
for (int n = len; n > -1; n--)
{
left--;
c[left] = s[n];
}
}
public void truncate(int start, int finish)
{
if (start < 0) throw new Exception("Truncation error: Start < 0");
if (left + finish > right) throw new Exception("Truncation error: Finish > string length");
if (finish < start) throw new Exception("Truncation error: Finish < start");
//MessageBox.Show(left + " " + right);
right = left + finish;
left = left + start;
}
public string subString(int start, int finish)
{
if (start < 0) throw new Exception("Substring error: Start < 0");
if (left + finish > right) throw new Exception("Substring error: Finish > string length");
if (finish < start) throw new Exception("Substring error: Finish < start");
return toString(start,finish);
}
public override string ToString()
{
return new string(c, left, right - left);
//return new string(cc, 0, buffer); // For debugging purposes (used fixed width font for testing)
}
private string toString(int start, int finish)
{
return new string(c, left+start, finish-start );
//return new string(cc, 0, buffer); // For debugging purposes (used fixed width font for testing)
}
}

This should work:
aStringBuilder = "newText" + aStringBuilder;

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.