I am coding a C# application that appends, inserts, replaces and finds string content in string called content. Multiple objects called editObjects are performing these actions on the same string called content.
I am currently passing a StringBuilder object to each editObject object, and then each editObject performs the actions on the StringBuilder object.
This question has two parts:
Am I correct in saying that a StringBuilder is the most efficient way to perform multiple actions on a string?
I have found this post from 2013: Fastest search method in StringBuilder, and would like to know if there is some known code that is a more efficient way to find the index of a string in a StringBuilder?
public static class StringBuilderSearching
{
public static bool Contains(this StringBuilder haystack, string needle)
{
return haystack.IndexOf(needle) != -1;
}
public static int IndexOf(this StringBuilder haystack, string needle)
{
if(haystack == null || needle == null)
throw new ArgumentNullException();
if(needle.Length == 0)
return 0;//empty strings are everywhere!
if(needle.Length == 1)//can't beat just spinning through for it
{
char c = needle[0];
for(int idx = 0; idx != haystack.Length; ++idx)
if(haystack[idx] == c)
return idx;
return -1;
}
int m = 0;
int i = 0;
int[] T = KMPTable(needle);
while(m + i < haystack.Length)
{
if(needle[i] == haystack[m + i])
{
if(i == needle.Length - 1)
return m == needle.Length ? -1 : m;//match -1 = failure to find conventional in .NET
++i;
}
else
{
m = m + i - T[i];
i = T[i] > -1 ? T[i] : 0;
}
}
return -1;
}
private static int[] KMPTable(string sought)
{
int[] table = new int[sought.Length];
int pos = 2;
int cnd = 0;
table[0] = -1;
table[1] = 0;
while(pos < table.Length)
if(sought[pos - 1] == sought[cnd])
table[pos++] = ++cnd;
else if(cnd > 0)
cnd = table[cnd];
else
table[pos++] = 0;
return table;
}
}
StringBuilder is faster for most string manipulations - That doesn't mean it is the best for all multiple actions done on a string.
In your case, you are willing to find a string inside the StringBuilder, this requires you to do one of two things:
Doing a standard search at O(n) iterating the whole StringBuilder, which can be done in a pretty optimized way like the code you have posted in the question.
Indexing chars or strings on every addition and removal of data to/from the StringBuilder. Note that indexing means you will need to analyse every string you add or remove to/from the StringBuilder which creates a little overhead for each manipulation action.
You need to do the judging of what your application does more and what will disturb your application more - O(n) search instead of an O(log(n)) or O(n + m) manipulations instead of O(n) ones. (Where m is the overhead for each insertion/removal divided by the average inserted/removedstring length)
Basic indexing illustration:
Indexes which are in between are fine too for balance (maximum 2 chars index/maximum 3 chars/etc...).
I need to find a way to start this string generating algorithm at a specified string instead of the beginning. For example not starting at 'aaaa' but 'baxi' and go through the rest of the string space.
private static String Charset = "abcdefghijklmnopqrstuvwxyz";
/// <summary>
/// Start Brute Force.
/// </summary>
/// <param name="length">Words length.</param>
public static void StartBruteForce(int length)
{
StringBuilder sb = new StringBuilder(length);
char currentChar = Charset[0];
for (int i = 1; i <= length; i++)
{
sb.Append(currentChar);
}
int counter = 0;
ChangeCharacters(0, sb, length, ref counter);
Console.WriteLine(counter);
}
private static void ChangeCharacters(int pos, StringBuilder sb, int length, ref int counter)
{
for (int i = 0; i <= Charset.Length - 1; i++)
{
sb[pos] = Charset[i];
if (pos == length - 1)
{
counter++;
Console.WriteLine(sb.ToString());
}
else
{
ChangeCharacters(pos + 1, sb, length, ref counter);
}
}
}
What you have is pretty close, but it looks like the root of your problem is that ChangeCharacters is written to always start at the first possible string, e.g. the characters at each position always start at the first letter of your alphabet ('a' in your example). You need the first pass at each position in your starting string to start with the letter that's already in place, and subsequent passes would then start with the first character of your generating alphabet.
Thus, with the code that you already have, you need to do make the following changes:
Pass along a flag indicating this is the first pass.
Select your loop's starting point based on the first pass flag.
Pass that first pass flag down your recursive calls.
Reset the flag after the first pass at each position.
Initialize the StringBuilder with your starting string, rather than the current default starting point.
There are some other items worth changing for clarity's sake. None of these are strictly required for correctness, but they do make the code easier to read and understand:
There's no need to pass around length, as it's always the same as sb.Length. Duplicated information at best forces readers to keep track of more information, and at worst can lead to bugs if later code changes break the relationship.
The standard idiom for index loops in languages that use zero-based indexing is "endpoint-exclusive" form, using a 'less-than' (or even 'not-equal') comparator, as this avoids overflow errors at the boundaries, e.g. i < length instead of i <= length - 1. Most code readers understand this idiom instinctively, whereas endpoint-inclusive forms typically force a reader to consider why the anti-idiom is required.
Rather than passing around your counter by reference, simply return the count of strings generated. Methods that don't mutate external state, often referred to as "pure" or "functional", are generally easier to understand. (Note, however, that you also have state in the StringBuilder being passed around.)
As a further refinement, I'd even suggest returning an IEnumerable<string>, which not only removes the need for tracking a count, but it also lets the caller determine what to do with the string, rather than making that decision (e.g. calling Console.WriteLine and incrementing a counter) insider your method.
Taking all of that together, your code becomes this (annotated with comments pointing to which change or suggestion is in play):
public static void StartBruteForce(string start)
{
/*change 5*/ StringBuilder sb = new StringBuilder(start);
/*sugg 3*/ int counter = ChangeCharacters(0, /*change 1*/ true, sb);
Console.WriteLine(counter);
}
private static int ChangeCharacters(int pos, /*change 1*/ bool firstPass , StringBuilder sb)
{
/*sugg 3*/ int counter = 0;
for (int i = /*change 2*/ firstPass ? Charset.IndexOf(sb[pos]) : 0; /*sugg 2*/ i < Charset.Length; i++)
{
sb[pos] = Charset[i];
if (pos == /*sugg 1*/ sb.Length - 1)
{
counter++;
Console.WriteLine(sb.ToString());
}
else
{
/*sugg 3*/ counter += ChangeCharacters(pos + 1, /*change 3*/ firstPass, sb);
/*change 4*/ firstPass = false;
}
}
/*sugg 3*/ return counter;
}
Your recursive algorithm needs to be broken into steps so it can be resumed.
See this factorial algorithm broken into steps so it can be resumed.
Alternatively, you need to know the resume point [which you will need in any case] but ignore all values prior to hitting that point.
This is a (worse-than) naive implementation, and CPU-wise it invalidates the whole idea of resuming the computation -- since it doesn't actually resume, it just doesn't capture/print the earlier values:
private static String Charset = "abcdefghijklmnopqrstuvwxyz";
/// <summary>
/// Start Brute Force.
/// </summary>
/// <param name="length">Words length.</param>
public static void StartBruteForce(int length)
{
StringBuilder sb = new StringBuilder(length);
char currentChar = Charset[0];
for (int i = 1; i <= length; i++)
{
sb.Append(currentChar);
}
int counter = 0;
var resumePoint = 60975;
ChangeCharacters(0, sb, length, ref counter, resumePoint);
Console.WriteLine(counter);
}
private static void ChangeCharacters(int pos, StringBuilder sb, int length, ref int counter, int resumePoint)
{
for (int i = 0; i <= Charset.Length - 1; i++)
{
sb[pos] = Charset[i];
if (pos == length - 1)
{
counter++;
if (counter >= resumePoint)
{
Console.WriteLine(string.Format("{0} : {1}", counter, sb.ToString()));
}
}
else
{
ChangeCharacters(pos + 1, sb, length, ref counter, resumePoint);
}
}
}
static void Main(string[] args)
{
string str = "ABC ABCDAB ABCDABCDABDE";//We should add some text here for
//the performance tests.
string pattern = "ABCDABD";
List<int> shifts = new List<int>();
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
NaiveStringMatcher(shifts, str, pattern);
stopWatch.Stop();
Trace.WriteLine(String.Format("Naive string matcher {0}", stopWatch.Elapsed));
foreach (int s in shifts)
{
Trace.WriteLine(s);
}
shifts.Clear();
stopWatch.Restart();
int[] pi = new int[pattern.Length];
Knuth_Morris_Pratt(shifts, str, pattern, pi);
stopWatch.Stop();
Trace.WriteLine(String.Format("Knuth_Morris_Pratt {0}", stopWatch.Elapsed));
foreach (int s in shifts)
{
Trace.WriteLine(s);
}
Console.ReadKey();
}
static IList<int> NaiveStringMatcher(List<int> shifts, string text, string pattern)
{
int lengthText = text.Length;
int lengthPattern = pattern.Length;
for (int s = 0; s < lengthText - lengthPattern + 1; s++ )
{
if (text[s] == pattern[0])
{
int i = 0;
while (i < lengthPattern)
{
if (text[s + i] == pattern[i])
i++;
else break;
}
if (i == lengthPattern)
{
shifts.Add(s);
}
}
}
return shifts;
}
static IList<int> Knuth_Morris_Pratt(List<int> shifts, string text, string pattern, int[] pi)
{
int patternLength = pattern.Length;
int textLength = text.Length;
//ComputePrefixFunction(pattern, pi);
int j;
for (int i = 1; i < pi.Length; i++)
{
j = 0;
while ((i < pi.Length) && (pattern[i] == pattern[j]))
{
j++;
pi[i++] = j;
}
}
int matchedSymNum = 0;
for (int i = 0; i < textLength; i++)
{
while (matchedSymNum > 0 && pattern[matchedSymNum] != text[i])
matchedSymNum = pi[matchedSymNum - 1];
if (pattern[matchedSymNum] == text[i])
matchedSymNum++;
if (matchedSymNum == patternLength)
{
shifts.Add(i - patternLength + 1);
matchedSymNum = pi[matchedSymNum - 1];
}
}
return shifts;
}
Why does my implemention of the KMP algorithm work slower than the Naive String Matching algorithm?
The KMP algorithm has two phases: first it builds a table, and then it does a search, directed by the contents of the table.
The naive algorithm has one phase: it does a search. It does that search much less efficiently in the worst case than the KMP search phase.
If the KMP is slower than the naive algorithm then that is probably because building the table is taking you longer than it takes to simply search the string naively in the first place. Naive string matching is usually very fast on short strings. There is a reason why we don't use fancy-pants algorithms like KMP inside the BCL implementations of string searching. By the time you set up the table, you could have done half a dozen searches of short strings with the naive algorithm.
KMP is only a win if you have enormous strings and you are doing lots of searches that allow you to re-use an already-built table. You need to amortize away the huge cost of building the table by doing lots of searches using that table.
And also, the naive algorithm only has bad performance in bizarre and unlikely scenarios. Most people are searching for words like "London" in strings like "Buckingham Palace, London, England", and not searching for strings like "BANANANANANANA" in strings like "BANAN BANBAN BANBANANA BANAN BANAN BANANAN BANANANANANANANANAN...". The naive search algorithm is optimal for the first problem and highly sub-optimal for the latter problem; but it makes sense to optimize for the former, not the latter.
Another way to put it: if the searched-for string is of length w and the searched-in string is of length n, then KMP is O(n) + O(w). The Naive algorithm is worst case O(nw), best case O(n + w). But that says nothing about the "constant factor"! The constant factor of the KMP algorithm is much larger than the constant factor of the naive algorithm. The value of n has to be awfully big, and the number of sub-optimal partial matches has to be awfully large, for the KMP algorithm to win over the blazingly fast naive algorithm.
That deals with the algorithmic complexity issues. Your methodology is also not very good, and that might explain your results. Remember, the first time you run code, the jitter has to jit the IL into assembly code. That can take longer than running the method in some cases. You really should be running the code a few hundred thousand times in a loop, discarding the first result, and taking an average of the timings of the rest.
If you really want to know what is going on you should be using a profiler to determine what the hot spot is. Again, make sure you are measuring the post-jit run, not the run where the code is jitted, if you want to have results that are not skewed by the jit time.
Your example is too small and it does not have enough repetitions of the pattern where KMP avoids backtracking.
KMP can be slower than the normal search in some cases.
A Simple KMPSubstringSearch Implementation.
https://github.com/bharathkumarms/AlgorithmsMadeEasy/blob/master/AlgorithmsMadeEasy/KMPSubstringSearch.cs
using System;
using System.Collections.Generic;
using System.Linq;
namespace AlgorithmsMadeEasy
{
class KMPSubstringSearch
{
public void KMPSubstringSearchMethod()
{
string text = System.Console.ReadLine();
char[] sText = text.ToCharArray();
string pattern = System.Console.ReadLine();
char[] sPattern = pattern.ToCharArray();
int forwardPointer = 1;
int backwardPointer = 0;
int[] tempStorage = new int[sPattern.Length];
tempStorage[0] = 0;
while (forwardPointer < sPattern.Length)
{
if (sPattern[forwardPointer].Equals(sPattern[backwardPointer]))
{
tempStorage[forwardPointer] = backwardPointer + 1;
forwardPointer++;
backwardPointer++;
}
else
{
if (backwardPointer == 0)
{
tempStorage[forwardPointer] = 0;
forwardPointer++;
}
else
{
int temp = tempStorage[backwardPointer];
backwardPointer = temp;
}
}
}
int pointer = 0;
int successPoints = sPattern.Length;
bool success = false;
for (int i = 0; i < sText.Length; i++)
{
if (sText[i].Equals(sPattern[pointer]))
{
pointer++;
}
else
{
if (pointer != 0)
{
int tempPointer = pointer - 1;
pointer = tempStorage[tempPointer];
i--;
}
}
if (successPoints == pointer)
{
success = true;
}
}
if (success)
{
System.Console.WriteLine("TRUE");
}
else
{
System.Console.WriteLine("FALSE");
}
System.Console.Read();
}
}
}
/*
* Sample Input
abxabcabcaby
abcaby
*/
I'm trying to insert a certain number of indentations before a string based on an items depth and I'm wondering if there is a way to return a string repeated X times. Example:
string indent = "---";
Console.WriteLine(indent.Repeat(0)); //would print nothing.
Console.WriteLine(indent.Repeat(1)); //would print "---".
Console.WriteLine(indent.Repeat(2)); //would print "------".
Console.WriteLine(indent.Repeat(3)); //would print "---------".
If you only intend to repeat the same character you can use the string constructor that accepts a char and the number of times to repeat it new String(char c, int count).
For example, to repeat a dash five times:
string result = new String('-', 5);
Output: -----
If you're using .NET 4.0, you could use string.Concat together with Enumerable.Repeat.
int N = 5; // or whatever
Console.WriteLine(string.Concat(Enumerable.Repeat(indent, N)));
Otherwise I'd go with something like Adam's answer.
The reason I generally wouldn't advise using Andrey's answer is simply that the ToArray() call introduces superfluous overhead that is avoided with the StringBuilder approach suggested by Adam. That said, at least it works without requiring .NET 4.0; and it's quick and easy (and isn't going to kill you if efficiency isn't too much of a concern).
most performant solution for string
string result = new StringBuilder().Insert(0, "---", 5).ToString();
public static class StringExtensions
{
public static string Repeat(this string input, int count)
{
if (string.IsNullOrEmpty(input) || count <= 1)
return input;
var builder = new StringBuilder(input.Length * count);
for(var i = 0; i < count; i++) builder.Append(input);
return builder.ToString();
}
}
For many scenarios, this is probably the neatest solution:
public static class StringExtensions
{
public static string Repeat(this string s, int n)
=> new StringBuilder(s.Length * n).Insert(0, s, n).ToString();
}
Usage is then:
text = "Hello World! ".Repeat(5);
This builds on other answers (particularly #c0rd's). As well as simplicity, it has the following features, which not all the other techniques discussed share:
Repetition of a string of any length, not just a character (as requested by the OP).
Efficient use of StringBuilder through storage preallocation.
Strings and chars [version 1]
string.Join("", Enumerable.Repeat("text" , 2 ));
//result: texttext
Strings and chars [version 2]:
String.Concat(Enumerable.Repeat("text", 2));
//result: texttext
Strings and chars [version 3]
new StringBuilder().Insert(0, "text", 2).ToString();
//result: texttext
Chars only:
new string('5', 3);
//result: 555
Extension way:
(works FASTER - better for WEB)
public static class RepeatExtensions
{
public static string Repeat(this string str, int times)
{
var a = new StringBuilder();
//Append is faster than Insert
( () => a.Append(str) ).RepeatAction(times) ;
return a.ToString();
}
public static void RepeatAction(this Action action, int count)
{
for (int i = 0; i < count; i++)
{
action();
}
}
}
usage:
var a = "Hello".Repeat(3);
//result: HelloHelloHello
Use String.PadLeft, if your desired string contains only a single char.
public static string Indent(int count, char pad)
{
return String.Empty.PadLeft(count, pad);
}
Credit due here
You can repeat your string (in case it's not a single char) and concat the result, like this:
String.Concat(Enumerable.Repeat("---", 5))
I would go for Dan Tao's answer, but if you're not using .NET 4.0 you can do something like that:
public static string Repeat(this string str, int count)
{
return Enumerable.Repeat(str, count)
.Aggregate(
new StringBuilder(str.Length * count),
(sb, s) => sb.Append(s))
.ToString();
}
string indent = "---";
string n = string.Concat(Enumerable.Repeat(indent, 1).ToArray());
string n = string.Concat(Enumerable.Repeat(indent, 2).ToArray());
string n = string.Concat(Enumerable.Repeat(indent, 3).ToArray());
Adding the Extension Method I am using all over my projects:
public static string Repeat(this string text, int count)
{
if (!String.IsNullOrEmpty(text))
{
return String.Concat(Enumerable.Repeat(text, count));
}
return "";
}
Hope someone can take use of it...
I like the answer given. Along the same lines though is what I've used in the past:
"".PadLeft(3*Indent,'-')
This will fulfill creating an indent but technically the question was to repeat a string. If the string indent is something like >-< then this as well as the accepted answer would not work. In this case, c0rd's solution using StringBuilder looks good, though the overhead of StringBuilder may in fact not make it the most performant. One option is to build an array of strings, fill it with indent strings, then concat that. To whit:
int Indent = 2;
string[] sarray = new string[6]; //assuming max of 6 levels of indent, 0 based
for (int iter = 0; iter < 6; iter++)
{
//using c0rd's stringbuilder concept, insert ABC as the indent characters to demonstrate any string can be used
sarray[iter] = new StringBuilder().Insert(0, "ABC", iter).ToString();
}
Console.WriteLine(sarray[Indent] +"blah"); //now pretend to output some indented line
We all love a clever solution but sometimes simple is best.
Surprised nobody went old-school.
I am not making any claims about this code, but just for fun:
public static string Repeat(this string #this, int count)
{
var dest = new char[#this.Length * count];
for (int i = 0; i < dest.Length; i += 1)
{
dest[i] = #this[i % #this.Length];
}
return new string(dest);
}
Print a line with repetition.
Console.Write(new string('=', 30) + "\n");
==============================
For general use, solutions involving the StringBuilder class are best for repeating multi-character strings. It's optimized to handle the combination of large numbers of strings in a way that simple concatenation can't and that would be difficult or impossible to do more efficiently by hand. The StringBuilder solutions shown here use O(N) iterations to complete, a flat rate proportional to the number of times it is repeated.
However, for very large number of repeats, or where high levels of efficiency must be squeezed out of it, a better approach is to do something similar to StringBuilder's basic functionality but to produce additional copies from the destination, rather than from the original string, as below.
public static string Repeat_CharArray_LogN(this string str, int times)
{
int limit = (int)Math.Log(times, 2);
char[] buffer = new char[str.Length * times];
int width = str.Length;
Array.Copy(str.ToCharArray(), buffer, width);
for (int index = 0; index < limit; index++)
{
Array.Copy(buffer, 0, buffer, width, width);
width *= 2;
}
Array.Copy(buffer, 0, buffer, width, str.Length * times - width);
return new string(buffer);
}
This doubles the length of the source/destination string with each iteration, which saves the overhead of resetting counters each time it would go through the original string, instead smoothly reading through and copying the now much longer string, something that modern processors can do much more efficiently.
It uses a base-2 logarithm to find how many times it needs to double the length of the string and then proceeds to do so that many times. Since the remainder to be copied is now less than the total length it is copying from, it can then simply copy a subset of what it has already generated.
I have used the Array.Copy() method over the use of StringBuilder, as a copying of the content of the StringBuilder into itself would have the overhead of producing a new string with that content with each iteration. Array.Copy() avoids this, while still operating with an extremely high rate of efficiency.
This solution takes O(1 + log N) iterations to complete, a rate that increases logarithmically with the number of repeats (doubling the number of repeats equals one additional iteration), a substantial savings over the other methods, which increase proportionally.
Another approach is to consider string as IEnumerable<char> and have a generic extension method which will multiply the items in a collection by the specified factor.
public static IEnumerable<T> Repeat<T>(this IEnumerable<T> source, int times)
{
source = source.ToArray();
return Enumerable.Range(0, times).SelectMany(_ => source);
}
So in your case:
string indent = "---";
var f = string.Concat(indent.Repeat(0)); //.NET 4 required
//or
var g = new string(indent.Repeat(5).ToArray());
Not sure how this would perform, but it's an easy piece of code. (I have probably made it appear more complicated than it is.)
int indentCount = 3;
string indent = "---";
string stringToBeIndented = "Blah";
// Need dummy char NOT in stringToBeIndented - vertical tab, anyone?
char dummy = '\v';
stringToBeIndented.PadLeft(stringToBeIndented.Length + indentCount, dummy).Replace(dummy.ToString(), indent);
Alternatively, if you know the maximum number of levels you can expect, you could just declare an array and index into it. You would probably want to make this array static or a constant.
string[] indents = new string[4] { "", indent, indent.Replace("-", "--"), indent.Replace("-", "---"), indent.Replace("-", "----") };
output = indents[indentCount] + stringToBeIndented;
I don't have enough rep to comment on Adam's answer, but the best way to do it imo is like this:
public static string RepeatString(string content, int numTimes) {
if(!string.IsNullOrEmpty(content) && numTimes > 0) {
StringBuilder builder = new StringBuilder(content.Length * numTimes);
for(int i = 0; i < numTimes; i++) builder.Append(content);
return builder.ToString();
}
return string.Empty;
}
You must check to see if numTimes is greater then zero, otherwise you will get an exception.
Using the new string.Create function, we can pre-allocate the right size and copy a single string in a loop using Span<char>.
I suspect this is likely to be the fastest method, as there is no extra allocation at all: the string is precisely allocated.
public static string Repeat(this string source, int times)
{
return string.Create(source.Length * times, source, RepeatFromString);
}
private static void RepeatFromString(Span<char> result, string source)
{
ReadOnlySpan<char> sourceSpan = source.AsSpan();
for (var i = 0; i < result.Length; i += sourceSpan.Length)
sourceSpan.CopyTo(result.Slice(i, sourceSpan.Length));
}
dotnetfiddle
I didn't see this solution. I find it simpler for where I currently am in software development:
public static void PrintFigure(int shapeSize)
{
string figure = "\\/";
for (int loopTwo = 1; loopTwo <= shapeSize - 1; loopTwo++)
{
Console.Write($"{figure}");
}
}
You can create an ExtensionMethod to do that!
public static class StringExtension
{
public static string Repeat(this string str, int count)
{
string ret = "";
for (var x = 0; x < count; x++)
{
ret += str;
}
return ret;
}
}
Or using #Dan Tao solution:
public static class StringExtension
{
public static string Repeat(this string str, int count)
{
if (count == 0)
return "";
return string.Concat(Enumerable.Repeat(indent, N))
}
}
I do not know much about compression algorithms. I am looking for a simple compression algorithm (or code snippet) which can reduce the size of a byte[,,] or byte[]. I cannot make use of System.IO.Compression. Also, the data has lots of repetition.
I tried implementing the RLE algorithm (posted below for your inspection). However, it produces array's 1.2 to 1.8 times larger.
public static class RLE
{
public static byte[] Encode(byte[] source)
{
List<byte> dest = new List<byte>();
byte runLength;
for (int i = 0; i < source.Length; i++)
{
runLength = 1;
while (runLength < byte.MaxValue
&& i + 1 < source.Length
&& source[i] == source[i + 1])
{
runLength++;
i++;
}
dest.Add(runLength);
dest.Add(source[i]);
}
return dest.ToArray();
}
public static byte[] Decode(byte[] source)
{
List<byte> dest = new List<byte>();
byte runLength;
for (int i = 1; i < source.Length; i+=2)
{
runLength = source[i - 1];
while (runLength > 0)
{
dest.Add(source[i]);
runLength--;
}
}
return dest.ToArray();
}
}
I have also found a java, string and integer based, LZW implementation. I have converted it to C# and the results look good (code posted below). However, I am not sure how it works nor how to make it work with bytes instead of strings and integers.
public class LZW
{
/* Compress a string to a list of output symbols. */
public static int[] compress(string uncompressed)
{
// Build the dictionary.
int dictSize = 256;
Dictionary<string, int> dictionary = new Dictionary<string, int>();
for (int i = 0; i < dictSize; i++)
dictionary.Add("" + (char)i, i);
string w = "";
List<int> result = new List<int>();
for (int i = 0; i < uncompressed.Length; i++)
{
char c = uncompressed[i];
string wc = w + c;
if (dictionary.ContainsKey(wc))
w = wc;
else
{
result.Add(dictionary[w]);
// Add wc to the dictionary.
dictionary.Add(wc, dictSize++);
w = "" + c;
}
}
// Output the code for w.
if (w != "")
result.Add(dictionary[w]);
return result.ToArray();
}
/* Decompress a list of output ks to a string. */
public static string decompress(int[] compressed)
{
int dictSize = 256;
Dictionary<int, string> dictionary = new Dictionary<int, string>();
for (int i = 0; i < dictSize; i++)
dictionary.Add(i, "" + (char)i);
string w = "" + (char)compressed[0];
string result = w;
for (int i = 1; i < compressed.Length; i++)
{
int k = compressed[i];
string entry = "";
if (dictionary.ContainsKey(k))
entry = dictionary[k];
else if (k == dictSize)
entry = w + w[0];
result += entry;
// Add w+entry[0] to the dictionary.
dictionary.Add(dictSize++, w + entry[0]);
w = entry;
}
return result;
}
}
Have a look here. I used this code as a basis to compress in one of my work projects. Not sure how much of the .NET Framework is accessbile in the Xbox 360 SDK, so not sure how well this will work for you.
The problem with that RLE algorithm is that it is too simple. It prefixes every byte with how many times it is repeated, but that does mean that in long ranges of non-repeating bytes, each single byte is prefixed with a "1". On data without any repetitions this will double the file size.
This can be avoided by using Code-type RLE instead; the 'Code' (also called 'Token') will be a byte that can have two meanings; either it indicates how many times the single following byte is repeated, or it indicates how many non-repeating bytes follow that should be copied as they are. The difference between those two codes is made by enabling the highest bit, meaning there are still 7 bits available for the value, meaning the amount to copy or repeat per such code can be up to 127.
This means that even in worst-case scenarios, the final size can only be about 1/127th larger than the original file size.
A good explanation of the whole concept, plus full working (and, in fact, heavily optimised) C# code, can be found here:
http://www.shikadi.net/moddingwiki/RLE_Compression
Note that sometimes, the data will end up larger than the original anyway, simply because there are not enough repeating bytes in it for RLE to work. A good way to deal with such compression failures is by adding a header to your final data. If you simply add an extra byte at the start that's on 0 for uncompressed data and 1 for RLE compressed data, then, when RLE fails to give a smaller result, you just save it uncompressed, with the 0 in front, and your final data will be exactly one byte larger than the original. The system at the other side can then read that starting byte and use that to determine if the following data should be uncompressed or just copied.
Look into Huffman codes, it's a pretty simple algorithm. Basically, use fewer bits for patterns that show up more often, and keep a table of how it's encoded. And you have to account in your codewords that there are no separators to help you decode.