Is simhash function that reliable? - c#

I have been strugling with simhash algorithm for a while. I implemented it according to my understanding on my crawler. However, when I did some test, It seemed not so reliable to me.
I calculated fingerprint for 200.000 different text data and saw that, some different content had same fingerprints. So there are a big posibility of collision.
My implementation code is below.
My question is that: If My implementation is right, there is a big collision on this algorithm. How come google use this algorithm? Otherwise, what's the problem with my algorithm?
public long CalculateSimHash(string input)
{
var vector = GenerateVector(input);
//5- Generate Fingerprint
long fingerprint = 0;
for (var i = 0; i < HashSize; i++)
{
if (vector[i] > 0)
{
var zz = Convert.ToInt64(1 << i);
fingerprint += Math.Abs(zz);
}
}
return fingerprint;
}
private int[] GenerateVector(string input)
{
//1- Tokenize input
ITokeniser tokeniser = new OverlappingStringTokeniser(2, 1);
var tokenizedValues = tokeniser.Tokenise(input);
//2- Hash values
var hashedValues = HashTokens(tokenizedValues);
//3- Prepare vector
var vector = new int[HashSize];
for (var i = 0; i < HashSize; i++)
{
vector[i] = 0;
}
//4- Fill vector according to bitsetof hash
foreach (var value in hashedValues)
{
for (var j = 0; j < HashSize; j++)
{
if (IsBitSet(value, j))
{
vector[j] += 1;
}
else
{
vector[j] -= 1;
}
}
}
return vector;

I can see a couple of issues. First, you're only getting a 32-bit hash, not a 64-bit, because you're using the wrong types. See https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/operators/left-shift-operator
It's also best not to use a signed integer type here, to avoid confusion. So:
// Generate Fingerprint
ulong fingerprint = 0;
for (int i = 0; i < HashSize; i++)
{
if (vector[i] > 0)
{
fingerprint += 1UL << i;
}
}
Second issue is: I don't know how your OverlappingStringTokenizer works -- so I'm only guessing here -- but if your shingles (overlapping ngrams) are only 2 characters long, then a lot of these shingles will be found in a lot of documents. Chances are that two documents will share a lot of these features even if the purpose and meaning of the documents is quite different.
Because words are the smallest simple unit of meaning when dealing with text, I normally count my tokens in terms of words, not characters. Certainly 2 characters is far too small for an effective feature. I like to generate shingles from, say, 5 words, ignoring punctuation and whitespace.

Related

Possibilities to improve performance using vectorization for the following function in C#?

I have a function that estimates correlation between two input arrays.
The input is feeded by a dataDict which is of type Dictionary<string, double[]> which has 153 keys with values as double array of size 1500.
For each individual key, I need to estimate its correlation with all other keys and store the result to a double[,] that has a size of double[dataDict.Count(), dataDict.Count()]
The following function prepares two double[] arrays whose correlation needs to be estimated.
public double[,] CalculateCorrelation(Dictionary<string, double?[]> dataDict, string corrMethod = "kendall")
{
CorrelationLogicModule correlationLogicModule = new CorrelationLogicModule();
double[,] correlationMatrix = new double[dataDict.Count(), dataDict.Count()];
for (int i = 0; i < dataDict.Count; i++)
{
for (int j = 0; j < dataDict.Count; j++)
{
var arrayA = dataDict[dataDict.ElementAt(i).Key].Cast<double>().ToArray();
var arrayB = dataDict[dataDict.ElementAt(j).Key].Cast<double>().ToArray();
correlationMatrix[i, j] = correlationLogicModule.Kendall_Formula(arrayA, arrayB);
}
}
return correlationMatrix;
}
The following function (I found it on internet from here) finds correlation between two input arrays using 'Kendall's' method.
public double Kendall_Formula(double[] Ticker1, double[] Ticker2)
{
double NbrConcord, NbrDiscord, S;
NbrConcord = 0;
NbrDiscord = 0;
S = 0;
for (int i = 0; i < Ticker1.Length - 1; i++)
{
for (int j = i + 1; j < Ticker1.Length; j++)
{
//Compute the number of concordant pairs
if (((Ticker1[i] < Ticker1[j]) & (Ticker2[i] < Ticker2[j])) | ((Ticker1[i] > Ticker1[j]) & (Ticker2[i] > Ticker2[j])))
{
NbrConcord++;
}
//Compute the number of discordant pairs
else if (((Ticker1[i] > Ticker1[j]) & (Ticker2[i] < Ticker2[j])) | ((Ticker1[i] < Ticker1[j]) & (Ticker2[i] > Ticker2[j])))
{
NbrDiscord++;
}
}
}
S = NbrConcord - NbrDiscord;
//Proportion with the total pairs
return 2 * S / (Ticker1.Length * (Ticker1.Length - 1));
}
Moving this way forward, takes a very long time to calculate the correlations for all the keys.
is there a possible way to optimize the performance?.
I am new to C# but I have been using Python for a long time and in Python using 'Numpys' and 'Pandas' I am sure the above operation would take seconds to compute. For e.g. lets say I had the above data in form of a pandas dataframe, then data[[list of columns]].corr('method') would lead the result in seconds. This is because pandas uses numpy under the hood which takes benefit from vectorization. I would like to learn how can I take benefit from vectorization to improve the performance of the above code in C# and if there are other factors I need to consider. Thank you!
You are using dataDict[dataDict.ElementAt(i).Key] to access the dictionary values in an undefined order. I don't know if that's what you intended, but the following code should give the the same results.
If you call dataDict.Values.ToArray(); you will get the dictionary values in the same order as you would when using foreach to iterate over it. That means that it will be the same as the order when using dataDict[dataDict.ElementAt(i).Key].
Therefore this code should be equivalent, and it should be faster:
public double[,] CalculateCorrelation(Dictionary<string, double?[]> dataDict, string corrMethod = "kendall")
{
CorrelationLogicModule correlationLogicModule = new CorrelationLogicModule();
var values = dataDict.Values.Select(array => array.Cast<double>().ToArray()).ToArray();
double[,] correlationMatrix = new double[dataDict.Count, dataDict.Count];
for (int i = 0; i < dataDict.Count; i++)
{
for (int j = 0; j < dataDict.Count; j++)
{
var arrayA = values[i];
var arrayB = values[j];
correlationMatrix[i, j] = correlationLogicModule.Kendall_Formula(arrayA, arrayB);
}
}
return correlationMatrix;
}
Note that the .ElementAt() call in your original code is a Linq extension, not a member of Dictionary<TKey,TValue>. It iterates from the start of the dictionary EVERY TIME you call it - and it also returns items in an unspecified order. From the documentation: For purposes of enumeration, each item in the dictionary is treated as a KeyValuePair<TKey,TValue> structure representing a value and its key. The order in which the items are returned is undefined.
Also:
You should change the bitwise & to logical && in your conditions. The use of & will prevent the compiler applying a boolean short-circuit optimisation, meaning that all the < / > comparisons will be performed, even if the first condition is false.

LBG algorithm - can I use existing k-means algorithm for it?

I'm writing a program comparing the working of k-means and LBG algorithm.
I have a k-means algorithm written, which accepts a list of Points and number of clusters, and then draws them on the screen in different colors. It returns me a list of clusters which have a list of points assigned to them.
Now, my question is: can I somehow modify my k-means to get LBG algorithm? I tried searching online for a step by step explanation of LBG, but wikipedia only has 3 sentences on it, I've found mathlab explanation which consists of 4 lines of code, and the original paper which you have to buy. Could someone explain/point me to a guide for this?
Thank you.
Edit: Please, no very hard technical papers, I can't read them properly yet with my english.
Edit2: Here is code for my k-means class:
public class k_means
{
public static List<Punktyzbior> oblicz(Punktyzbior punkty, int klasterctr)
{
List<Punktyzbior> wszystkieklastry = new List<Punktyzbior>();
List<List<Punkt>> wszystkiegrupy = pomocnicze_listy.PodzielListe<Punkt>(punkty, klasterctr);
foreach (List<Punkt> grupa in wszystkiegrupy)
{
Punktyzbior klaster = new Punktyzbior();
klaster.AddRange(grupa);
wszystkieklastry.Add(klaster);
}
int przejscia = 1;
while (przejscia > 0)
{
przejscia = 0;
foreach (Punktyzbior klaster in wszystkieklastry)
{
for (int punktIdx = 0; punktIdx < klaster.Count; punktIdx++)
{
Punkt punkt = klaster[punktIdx];
int najblizszyklaster = znajdzNajblizszy(wszystkieklastry, punkt);
if (najblizszyklaster != wszystkieklastry.IndexOf(klaster))
{
if (klaster.Count > 1)
{
Punkt usunPunkt = klaster.usunPunkt(punkt);
wszystkieklastry[najblizszyklaster].dodajPunkt(usunPunkt);
przejscia += 1;
}
}
}
}
}
return (wszystkieklastry);
}
public static int znajdzNajblizszy(List<Punktyzbior> wszystkieklastry, Punkt punkt)
{
double minOdl = 0.0;
int najblizszyCIdx = -1;
for (int k = 0; k < wszystkieklastry.Count; k++)
{
double odl = Punkt.znajdzOdl(punkt, wszystkieklastry[k].c);
if (k == 0)
{
minOdl = odl;
najblizszyCIdx = 0;
}
else if (minOdl > odl)
{
minOdl = odl;
najblizszyCIdx = k;
}
}
return (najblizszyCIdx);
}
}
Instead of trying to get from kmeans to LBG I would rather try to "translate" this java code to c#, it would be a lot easier I think.
Try to do it, and if you have a specific problem with the implementation come back and let us know.
https://github.com/internaut/JGenLloydCluster/blob/master/src/net/mkonrad/cluster/GenLloyd.java

Calculating adjacency matrix from randomly generated graphs

I have developed small program, which randomly generates several connections between the graphs (the value of the count could be randomly too, but for the test aim I have defined const value, it could be redefined in random value in any time).
Code is C#: http://ideone.com/FDCtT0
( result: Success time: 0.04s memory: 36968 kB returned value: 0 )
If you don't know, what is the adjacency matrix, go here : http://en.wikipedia.org/wiki/Adjacency_matrix
I think, that my version of code is rather not-optimized.
If I shall work with large matrixes, which have the size: 10k x 10k.
What are your suggestions, how is better to parallel calculations in
this task? Should I use some of the lockers-models like semaphore
etc for multi-threading calculations on large matrixes.
What are your suggestions for redesigning the architecture of
program. How should I prepare it for large matrixes?
As you see, upper at ideone, I have showed the time execution parameter and allocated memory in RAM. What is the asymptotic value of execution of my program? Is it O(n^2)?
So I want to listen to your advice how to increase the asymptotic mark, parallel calculations with using semaphores ( or maybe better locker-model for threads ).
Thank you!
PS:
SO doesn't allow to post topic without formatted code, so I'm posting in at the end (full program):
/*
Oleg Orlov, 2012(c), generating randomly adjacency matrix and graph connections
*/
using System;
using System.Collections.Generic;
class Graph
{
internal int id;
private int value;
internal Graph[] links;
public Graph(int inc_id, int inc_value)
{
this.id = inc_id;
this.value = inc_value;
links = new Graph[Program.random_generator.Next(0, 4)];
}
}
class Program
{
private const int graphs_count = 10;
private static List<Graph> list;
public static Random random_generator;
private static void Init()
{
random_generator = new Random();
list = new List<Graph>(graphs_count);
for (int i = 0; i < list.Capacity; i++)
{
list.Add(new Graph(i, random_generator.Next(100, 255) * i + random_generator.Next(0, 32)));
}
}
private static void InitGraphs()
{
for (int i = 0; i < list.Count; i++)
{
Graph graph = list[i] as Graph;
graph.links = new Graph[random_generator.Next(1, 4)];
for (int j = 0; j < graph.links.Length; j++)
{
graph.links[j] = list[random_generator.Next(0, 10)];
}
list[i] = graph;
}
}
private static bool[,] ParseAdjectiveMatrix()
{
bool[,] matrix = new bool[list.Count, list.Count];
foreach (Graph graph in list)
{
int[] links = new int[graph.links.Length];
for (int i = 0; i < links.Length; i++)
{
links[i] = graph.links[i].id;
matrix[graph.id, links[i]] = matrix[links[i], graph.id] = true;
}
}
return matrix;
}
private static void PrintMatrix(ref bool[,] matrix)
{
for (int i = 0; i < list.Count; i++)
{
Console.Write("{0} | [ ", i);
for (int j = 0; j < list.Count; j++)
{
Console.Write(" {0},", Convert.ToInt32(matrix[i, j]));
}
Console.Write(" ]\r\n");
}
Console.Write("{0}", new string(' ', 7));
for (int i = 0; i < list.Count; i++)
{
Console.Write("---");
}
Console.Write("\r\n{0}", new string(' ', 7));
for (int i = 0; i < list.Count; i++)
{
Console.Write("{0} ", i);
}
Console.Write("\r\n");
}
private static void PrintGraphs()
{
foreach (Graph graph in list)
{
Console.Write("\r\nGraph id: {0}. It references to the graphs: ", graph.id);
for (int i = 0; i < graph.links.Length; i++)
{
Console.Write(" {0}", graph.links[i].id);
}
}
}
[STAThread]
static void Main()
{
try
{
Init();
InitGraphs();
bool[,] matrix = ParseAdjectiveMatrix();
PrintMatrix(ref matrix);
PrintGraphs();
}
catch (Exception exc)
{
Console.WriteLine(exc.Message);
}
Console.Write("\r\n\r\nPress enter to exit this program...");
Console.ReadLine();
}
}
I will start from the end, if you don't mind. :)
3) Of course, it is O(n^2). As well as the memory usage.
2) Since sizeof(bool) == 1 byte, not bit, you can optimize memory usage by using bit masks instead of raw bool values, this will make it (8 bits per bool)^2 = 64 times less.
1) I don't know C# that well, but as i just googled i found out that C# primitive types are atomic, which means you can safely use them in multi-threading. Then, you are to make a super easy multi-threading task: just split your graphs by threads and press the 'run' button, which will run every thread with its part of graph on itself. They are independent so that's not going to be any problem, you don't need any semaphores, locks and so on.
The thing is that you won't be able to have an adjacency matrix with size 10^9 x 10^9. You just can't store it in the memory. But, there is an other way.
Create an adjacency list for each vertex, which will have a list of all vertices it is connected with. After building those lists from your graph, sort those lists for each vertex. Then, you can answer on the 'is a connected to b' in O( log(size of adjacency list for vertex a) ) time by using binary search, which is really fast for common usage.
Now, if you want to implement Dijkstra algorithm really fast, you won't need an adj. matrix at all, just those lists.
Again, it all depends on the future tasks and constraints. You cannot store the matrix of that size, that's all. You don't need it for Dijkstra or BFS, that's a fact. :) There is no conceptual difference from the graph's side: graph will be the same no matter what data structure it's stored in.
If you really want the matrix, then that's the solution:
We know, that number of connections (1 in matrix) is greatly smaller than its maximum which is n^2. By doing those lists, we simply store the positions of 1 (it's also called sparse matrix), which consumes no unneeded memory.

What's wrong with my implementation of the KMP algorithm?

static void Main(string[] args)
{
string str = "ABC ABCDAB ABCDABCDABDE";//We should add some text here for
//the performance tests.
string pattern = "ABCDABD";
List<int> shifts = new List<int>();
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
NaiveStringMatcher(shifts, str, pattern);
stopWatch.Stop();
Trace.WriteLine(String.Format("Naive string matcher {0}", stopWatch.Elapsed));
foreach (int s in shifts)
{
Trace.WriteLine(s);
}
shifts.Clear();
stopWatch.Restart();
int[] pi = new int[pattern.Length];
Knuth_Morris_Pratt(shifts, str, pattern, pi);
stopWatch.Stop();
Trace.WriteLine(String.Format("Knuth_Morris_Pratt {0}", stopWatch.Elapsed));
foreach (int s in shifts)
{
Trace.WriteLine(s);
}
Console.ReadKey();
}
static IList<int> NaiveStringMatcher(List<int> shifts, string text, string pattern)
{
int lengthText = text.Length;
int lengthPattern = pattern.Length;
for (int s = 0; s < lengthText - lengthPattern + 1; s++ )
{
if (text[s] == pattern[0])
{
int i = 0;
while (i < lengthPattern)
{
if (text[s + i] == pattern[i])
i++;
else break;
}
if (i == lengthPattern)
{
shifts.Add(s);
}
}
}
return shifts;
}
static IList<int> Knuth_Morris_Pratt(List<int> shifts, string text, string pattern, int[] pi)
{
int patternLength = pattern.Length;
int textLength = text.Length;
//ComputePrefixFunction(pattern, pi);
int j;
for (int i = 1; i < pi.Length; i++)
{
j = 0;
while ((i < pi.Length) && (pattern[i] == pattern[j]))
{
j++;
pi[i++] = j;
}
}
int matchedSymNum = 0;
for (int i = 0; i < textLength; i++)
{
while (matchedSymNum > 0 && pattern[matchedSymNum] != text[i])
matchedSymNum = pi[matchedSymNum - 1];
if (pattern[matchedSymNum] == text[i])
matchedSymNum++;
if (matchedSymNum == patternLength)
{
shifts.Add(i - patternLength + 1);
matchedSymNum = pi[matchedSymNum - 1];
}
}
return shifts;
}
Why does my implemention of the KMP algorithm work slower than the Naive String Matching algorithm?
The KMP algorithm has two phases: first it builds a table, and then it does a search, directed by the contents of the table.
The naive algorithm has one phase: it does a search. It does that search much less efficiently in the worst case than the KMP search phase.
If the KMP is slower than the naive algorithm then that is probably because building the table is taking you longer than it takes to simply search the string naively in the first place. Naive string matching is usually very fast on short strings. There is a reason why we don't use fancy-pants algorithms like KMP inside the BCL implementations of string searching. By the time you set up the table, you could have done half a dozen searches of short strings with the naive algorithm.
KMP is only a win if you have enormous strings and you are doing lots of searches that allow you to re-use an already-built table. You need to amortize away the huge cost of building the table by doing lots of searches using that table.
And also, the naive algorithm only has bad performance in bizarre and unlikely scenarios. Most people are searching for words like "London" in strings like "Buckingham Palace, London, England", and not searching for strings like "BANANANANANANA" in strings like "BANAN BANBAN BANBANANA BANAN BANAN BANANAN BANANANANANANANANAN...". The naive search algorithm is optimal for the first problem and highly sub-optimal for the latter problem; but it makes sense to optimize for the former, not the latter.
Another way to put it: if the searched-for string is of length w and the searched-in string is of length n, then KMP is O(n) + O(w). The Naive algorithm is worst case O(nw), best case O(n + w). But that says nothing about the "constant factor"! The constant factor of the KMP algorithm is much larger than the constant factor of the naive algorithm. The value of n has to be awfully big, and the number of sub-optimal partial matches has to be awfully large, for the KMP algorithm to win over the blazingly fast naive algorithm.
That deals with the algorithmic complexity issues. Your methodology is also not very good, and that might explain your results. Remember, the first time you run code, the jitter has to jit the IL into assembly code. That can take longer than running the method in some cases. You really should be running the code a few hundred thousand times in a loop, discarding the first result, and taking an average of the timings of the rest.
If you really want to know what is going on you should be using a profiler to determine what the hot spot is. Again, make sure you are measuring the post-jit run, not the run where the code is jitted, if you want to have results that are not skewed by the jit time.
Your example is too small and it does not have enough repetitions of the pattern where KMP avoids backtracking.
KMP can be slower than the normal search in some cases.
A Simple KMPSubstringSearch Implementation.
https://github.com/bharathkumarms/AlgorithmsMadeEasy/blob/master/AlgorithmsMadeEasy/KMPSubstringSearch.cs
using System;
using System.Collections.Generic;
using System.Linq;
namespace AlgorithmsMadeEasy
{
class KMPSubstringSearch
{
public void KMPSubstringSearchMethod()
{
string text = System.Console.ReadLine();
char[] sText = text.ToCharArray();
string pattern = System.Console.ReadLine();
char[] sPattern = pattern.ToCharArray();
int forwardPointer = 1;
int backwardPointer = 0;
int[] tempStorage = new int[sPattern.Length];
tempStorage[0] = 0;
while (forwardPointer < sPattern.Length)
{
if (sPattern[forwardPointer].Equals(sPattern[backwardPointer]))
{
tempStorage[forwardPointer] = backwardPointer + 1;
forwardPointer++;
backwardPointer++;
}
else
{
if (backwardPointer == 0)
{
tempStorage[forwardPointer] = 0;
forwardPointer++;
}
else
{
int temp = tempStorage[backwardPointer];
backwardPointer = temp;
}
}
}
int pointer = 0;
int successPoints = sPattern.Length;
bool success = false;
for (int i = 0; i < sText.Length; i++)
{
if (sText[i].Equals(sPattern[pointer]))
{
pointer++;
}
else
{
if (pointer != 0)
{
int tempPointer = pointer - 1;
pointer = tempStorage[tempPointer];
i--;
}
}
if (successPoints == pointer)
{
success = true;
}
}
if (success)
{
System.Console.WriteLine("TRUE");
}
else
{
System.Console.WriteLine("FALSE");
}
System.Console.Read();
}
}
}
/*
* Sample Input
abxabcabcaby
abcaby
*/

C# compress a byte array

I do not know much about compression algorithms. I am looking for a simple compression algorithm (or code snippet) which can reduce the size of a byte[,,] or byte[]. I cannot make use of System.IO.Compression. Also, the data has lots of repetition.
I tried implementing the RLE algorithm (posted below for your inspection). However, it produces array's 1.2 to 1.8 times larger.
public static class RLE
{
public static byte[] Encode(byte[] source)
{
List<byte> dest = new List<byte>();
byte runLength;
for (int i = 0; i < source.Length; i++)
{
runLength = 1;
while (runLength < byte.MaxValue
&& i + 1 < source.Length
&& source[i] == source[i + 1])
{
runLength++;
i++;
}
dest.Add(runLength);
dest.Add(source[i]);
}
return dest.ToArray();
}
public static byte[] Decode(byte[] source)
{
List<byte> dest = new List<byte>();
byte runLength;
for (int i = 1; i < source.Length; i+=2)
{
runLength = source[i - 1];
while (runLength > 0)
{
dest.Add(source[i]);
runLength--;
}
}
return dest.ToArray();
}
}
I have also found a java, string and integer based, LZW implementation. I have converted it to C# and the results look good (code posted below). However, I am not sure how it works nor how to make it work with bytes instead of strings and integers.
public class LZW
{
/* Compress a string to a list of output symbols. */
public static int[] compress(string uncompressed)
{
// Build the dictionary.
int dictSize = 256;
Dictionary<string, int> dictionary = new Dictionary<string, int>();
for (int i = 0; i < dictSize; i++)
dictionary.Add("" + (char)i, i);
string w = "";
List<int> result = new List<int>();
for (int i = 0; i < uncompressed.Length; i++)
{
char c = uncompressed[i];
string wc = w + c;
if (dictionary.ContainsKey(wc))
w = wc;
else
{
result.Add(dictionary[w]);
// Add wc to the dictionary.
dictionary.Add(wc, dictSize++);
w = "" + c;
}
}
// Output the code for w.
if (w != "")
result.Add(dictionary[w]);
return result.ToArray();
}
/* Decompress a list of output ks to a string. */
public static string decompress(int[] compressed)
{
int dictSize = 256;
Dictionary<int, string> dictionary = new Dictionary<int, string>();
for (int i = 0; i < dictSize; i++)
dictionary.Add(i, "" + (char)i);
string w = "" + (char)compressed[0];
string result = w;
for (int i = 1; i < compressed.Length; i++)
{
int k = compressed[i];
string entry = "";
if (dictionary.ContainsKey(k))
entry = dictionary[k];
else if (k == dictSize)
entry = w + w[0];
result += entry;
// Add w+entry[0] to the dictionary.
dictionary.Add(dictSize++, w + entry[0]);
w = entry;
}
return result;
}
}
Have a look here. I used this code as a basis to compress in one of my work projects. Not sure how much of the .NET Framework is accessbile in the Xbox 360 SDK, so not sure how well this will work for you.
The problem with that RLE algorithm is that it is too simple. It prefixes every byte with how many times it is repeated, but that does mean that in long ranges of non-repeating bytes, each single byte is prefixed with a "1". On data without any repetitions this will double the file size.
This can be avoided by using Code-type RLE instead; the 'Code' (also called 'Token') will be a byte that can have two meanings; either it indicates how many times the single following byte is repeated, or it indicates how many non-repeating bytes follow that should be copied as they are. The difference between those two codes is made by enabling the highest bit, meaning there are still 7 bits available for the value, meaning the amount to copy or repeat per such code can be up to 127.
This means that even in worst-case scenarios, the final size can only be about 1/127th larger than the original file size.
A good explanation of the whole concept, plus full working (and, in fact, heavily optimised) C# code, can be found here:
http://www.shikadi.net/moddingwiki/RLE_Compression
Note that sometimes, the data will end up larger than the original anyway, simply because there are not enough repeating bytes in it for RLE to work. A good way to deal with such compression failures is by adding a header to your final data. If you simply add an extra byte at the start that's on 0 for uncompressed data and 1 for RLE compressed data, then, when RLE fails to give a smaller result, you just save it uncompressed, with the 0 in front, and your final data will be exactly one byte larger than the original. The system at the other side can then read that starting byte and use that to determine if the following data should be uncompressed or just copied.
Look into Huffman codes, it's a pretty simple algorithm. Basically, use fewer bits for patterns that show up more often, and keep a table of how it's encoded. And you have to account in your codewords that there are no separators to help you decode.

Categories

Resources