How to remove duplicate lines from a large text file efficiently? - c#

I want to edit a text as each line exists once in it. Each lines contains constantly 10 characters. I am generally working on 5-6 million of lines. So the code i am using currently is consuming too much RAM.
My code:
File.WriteAllLines(targetpath, File.ReadAllLines(sourcepath).Distinct())
So how can I make it less RAM consumer and less time-consumer at the same time?

Taking into account how much memory a string will take in C#, and assuming 10 characters length for 6 million records we get:
size in bytes ~= 20 + (length / 2 ) * 4;
total size in bytes ~= (20 + ( 10 / 2 ) * 4 )* 6000000 = 240 000 000
total size in Mb ~= 230
Now, 230 MB of space is not really a problem, even on x86 (32 bit system), so you can load all that data in memory.
For this, I would use a HashSet class which is obviously, a hash set that will let you easily eliminate the duplicates, by using lookup before adding an element.
In terms of big-O notation for time complexity, the average performance of a lookup in a hash set is O(1), which is the best you can get. In total, you would use lookup N times, totalling to N * O(1) = O(N)
In terms of big-O notation for space complexity, you would have O(N) space used, meaning that you use up memory proportional to number of elements, which is also the best you can get.
I'm not sure it is even possible to use up less space if you implement the algorithm in C# and not rely on any external components (that would also use at least O(N))
That being said, you can optimize for some scenarios by reading your file sequentially, line by line, see here.
This would give a better result if you have lots of duplicates, but worst case scenario when all the lines are distinct would consume the same amount of memory.
On a final note, if you look how Distinct method is implemented, you will see it also uses an implementation of hash table, although it's not the same class, but the performance is still roughly the same, check out this question for more details.

As ironstone13 corrected me, HashSet is OK, but does store the data.
Then this works fine too:
string[] arr = File.ReadAllLines("file.txt");
HashSet<string> hashes = new HashSet<string>();
for (int i = 0; i < arr.Length; i++)
{
if (!hashes.Add(arr[i])) arr[i] = null;
}
File.WriteAllLines("file2.txt", arr.Where(x => x != null));
This implementation was motivated by memory performance and hash conflicts.
The main idea was to keep just hashes, of course it would have to get back to file to get the line it sees as hash conflict/duplicit, to detect which one it is. (that part is not implemented).
class Program
{
static string[] arr;
static Dictionary<int, int>[] hashes = new Dictionary<int, int>[1]
{ new Dictionary<int, int>() }
;
static int[] file_indexes = {-1};
static void AddHash(int hash, int index)
{
for (int h = 0; h < hashes.Length; h++)
{
Dictionary<int, int> dict = hashes[h];
if (!dict.ContainsKey(hash))
{
dict[hash] = index;
return;
}
}
hashes = hashes.Union(new[] {new Dictionary<int, int>() {{hash, index}}}).ToArray();
file_indexes = Enumerable.Range(0, hashes.Length).Select(x => -1).ToArray();
}
static int UpdateFileIndexes(int hash)
{
int updates = 0;
for (int h = 0; h < hashes.Length; h++)
{
int index;
if (hashes[h].TryGetValue(hash, out index))
{
file_indexes[h] = index;
updates++;
}
else
{
file_indexes[h] = -1;
}
}
return updates;
}
static bool IsDuplicate(int index)
{
string str1 = arr[index];
for (int h = 0; h < hashes.Length; h++)
{
int i = file_indexes[h];
if (i == -1 || index == i) continue;
string str0 = arr[i];
if (str0 == null) continue;
if (string.CompareOrdinal(str0, str1) == 0) return true;
}
return false;
}
static void Main(string[] args)
{
arr = File.ReadAllLines("file.txt");
for (int i = 0; i < arr.Length; i++)
{
int hash = arr[i].GetHashCode();
if (UpdateFileIndexes(hash) == 0) AddHash(hash, i);
else if (IsDuplicate(i)) arr[i] = null;
else AddHash(hash, i);
}
File.WriteAllLines("file2.txt", arr.Where(x => x != null));
Console.WriteLine("DONE");
Console.ReadKey();
}
}

Before you write your data, if your data is in a list or dictionary, you could run LINQ query and use group by to group all like keys. Then for each write to the output file.
Your question is a little vague as well. Are you creating a next text file every time and do you have to store the data in text? There are better formats to use such as XML and json

Related

Speed up processing 32 bit numbers in combinations (k from n)

I have a list of 128 32 bit numbers, and I want to know, is there any combination of 12 numbers, so that all numbers XORed give the 32 bit number with all bits set to 1.
So I have started with naive approach and took combinations generator like that:
private static IEnumerable<int[]> Combinations(int k, int n)
{
var state = new int[k];
var stack = new Stack<int>();
stack.Push(0);
while (stack.Count > 0)
{
var index = stack.Count - 1;
var value = stack.Pop();
while (value < n)
{
state[index++] = value++;
if (value < n)
{
stack.Push(value);
}
if (index == k)
{
yield return state;
break;
}
}
}
}
and used it like that (data32 is an array of given 32bit numbers)
foreach (var probe in Combinations(12, 128))
{
int p = 0;
foreach (var index in probe)
{
p = p ^ data32[index];
}
if (p == -1)
{
//print out found combination
}
}
Of course it takes forever to check all 23726045489546400 combinations...
So my question(s) are - am I missing something in options how to speedup the check process?
Even if I do the calculation of combinations in partitions (e.g. I could start like 8 threads each will check combination started with numbers 0..8), or speed up the XORing by storing the perviously calculated combination - it is still slow.
P.S. I'd like it to run in reasonable time - minutes, hours not years.
Adding a list of numbers as was requested in one of the comments:
1571089837
2107702069
466053875
226802789
506212087
484103496
1826565655
944897655
1370004928
748118360
1000006005
952591039
2072497930
2115635395
966264796
1229014633
827262231
1276114545
1480412665
2041893083
512565106
1737382276
1045554806
172937528
1746275907
1376570954
1122801782
2013209036
1650561071
1595622894
425898265
770953281
422056706
477352958
1295095933
1783223223
842809023
1939751129
1444043041
1560819338
1810926532
353960897
1128003064
1933682525
1979092040
1987208467
1523445101
174223141
79066913
985640026
798869234
151300097
770795939
1489060367
823126463
1240588773
490645418
832012849
188524191
1034384571
1802169877
150139833
1762370591
1425112310
2121257460
205136626
706737928
265841960
517939268
2070634717
1703052170
1536225470
1511643524
1220003866
714424500
49991283
688093717
1815765740
41049469
529293552
1432086255
1001031015
1792304327
1533146564
399287468
1520421007
153855202
1969342940
742525121
1326187406
1268489176
729430821
1785462100
1180954683
422085275
1578687761
2096405952
1267903266
2105330329
471048135
764314242
459028205
1313062337
1995689086
1786352917
2072560816
282249055
1711434199
1463257872
1497178274
472287065
246628231
1928555152
1908869676
1629894534
885445498
1710706530
1250732374
107768432
524848610
2791827620
1607140095
1820646148
774737399
1808462165
194589252
1051374116
1802033814
I don't know C#, I did something in Python, maybe interesting anyway. Takes about 0.8 seconds to find a solution for your sample set:
solution = {422056706, 2791827620, 506212087, 1571089837, 827262231, 1650561071, 1595622894, 512565106, 205136626, 944897655, 966264796, 477352958}
len(solution) = 12
solution.issubset(nums) = True
hex(xor(solution)) = '0xffffffff'
There are 128C12 combinations, that's 5.5 million times as many as the 232 possible XOR values. So I tried being optimistic and only tried a subset of the possible combinations. I split the 128 numbers into two blocks of 28 and 100 numbers and try combinations with six numbers from each of the two blocks. I put all possible XORs of the first block into a hash set A, then go through all XORs of the second block to find one whose bitwise inversion is in that set. Then I reconstruct the individual numbers.
This way I cover (28C6)2 × (100C6)2 = 4.5e14 combinations, still over 100000 times as many as there are possible XOR values. So probably still a very good chance to find a valid combination.
Code (Try it online!):
from itertools import combinations
from functools import reduce
from operator import xor as xor_
nums = list(map(int, '1571089837 2107702069 466053875 226802789 506212087 484103496 1826565655 944897655 1370004928 748118360 1000006005 952591039 2072497930 2115635395 966264796 1229014633 827262231 1276114545 1480412665 2041893083 512565106 1737382276 1045554806 172937528 1746275907 1376570954 1122801782 2013209036 1650561071 1595622894 425898265 770953281 422056706 477352958 1295095933 1783223223 842809023 1939751129 1444043041 1560819338 1810926532 353960897 1128003064 1933682525 1979092040 1987208467 1523445101 174223141 79066913 985640026 798869234 151300097 770795939 1489060367 823126463 1240588773 490645418 832012849 188524191 1034384571 1802169877 150139833 1762370591 1425112310 2121257460 205136626 706737928 265841960 517939268 2070634717 1703052170 1536225470 1511643524 1220003866 714424500 49991283 688093717 1815765740 41049469 529293552 1432086255 1001031015 1792304327 1533146564 399287468 1520421007 153855202 1969342940 742525121 1326187406 1268489176 729430821 1785462100 1180954683 422085275 1578687761 2096405952 1267903266 2105330329 471048135 764314242 459028205 1313062337 1995689086 1786352917 2072560816 282249055 1711434199 1463257872 1497178274 472287065 246628231 1928555152 1908869676 1629894534 885445498 1710706530 1250732374 107768432 524848610 2791827620 1607140095 1820646148 774737399 1808462165 194589252 1051374116 1802033814'.split()))
def xor(vals):
return reduce(xor_, vals)
A = {xor(a)^0xffffffff: a
for a in combinations(nums[:28], 6)}
for b in combinations(nums[28:], 6):
if a := A.get(xor(b)):
break
solution = {*a, *b}
print(f'{solution = }')
print(f'{len(solution) = }')
print(f'{solution.issubset(nums) = }')
print(f'{hex(xor(solution)) = }')
Arrange your numbers into buckets based on the position of the first 1 bit.
To set the first bit to 1, you will have to use an odd number of the items in the corresponding bucket....
As you recurse, try to maintain an invariant that the number of leading 1 bits is increasing and then select the bucket that will change the next 0 to a 1, this will greatly reduce the number of combinations that you have to try.
I have found a possible solution, which could work for my particular task.
As main issue to straitforward approach I see a number of 2E16 combinations.
But, if I want to check if combination of 12 elements equal to 0xFFFFFFFF, I could check if 2 different combinations of 6 elements with opposit values exists.
That will reduce number of combinations to "just" 5E9, which is achievable.
On first attempt I think to store all combinations and then find opposites in the big list. But, in .NET I could not find quick way of storing more then Int32.MaxValue elements.
Taking in account idea with bits from comments and answer, I decided to store at first only xor sums with leftmost bit set to 1, and then by definition I need to check only sums with leftmost bit set to 0 => reducing storage by 2.
In the end it appears that many collisions could appear, so there are many combinations with the same xor sum.
Current version which could find such combinations, need to be compiled in x64 mode and use (any impovements welcomed):
static uint print32(int[] comb, uint[] data)
{
uint p = 0;
for (int i = 0; i < comb.Length; i++)
{
Console.Write("{0} ", comb[i]);
p = p ^ data[comb[i]];
}
Console.WriteLine(" #[{0:X}]", p);
return p;
}
static uint[] data32;
static void Main(string[] args)
{
int n = 128;
int k = 6;
uint p = 0;
uint inv = 0;
long t = 0;
//load n numbers from a file
init(n);
var lookup1x = new Dictionary<uint, List<byte>>();
var lookup0x = new Dictionary<uint, List<byte>>();
Stopwatch watch = new Stopwatch();
watch.Start();
//do not use IEnumerable generator, use function directly to reuse xor value
var hash = new uint[k];
var comb = new int[k];
var stack = new Stack<int>();
stack.Push(0);
while (stack.Count > 0)
{
var index = stack.Count - 1;
var value = stack.Pop();
if (index == 0)
{
p = 0;
Console.WriteLine("Start {0} sequence, combinations found: {1}",value,t);
}
else
{
//restore previous xor value
p = hash[index - 1];
}
while (value < n)
{
//xor and store
p = p ^ data32[value];
hash[index] = p;
//remember current state (combination)
comb[index++] = value++;
if (value < n)
{
stack.Push(value);
}
//combination filled to end
if (index == k)
{
//if xor have MSB set, put it to lookup table 1x
if ((p & 0x8000000) == 0x8000000)
{
lookup1x[p] = comb.Select(i => (byte)i).ToList();
inv = p ^ 0xFFFFFFFF;
if (lookup0x.ContainsKey(inv))
{
var full = lookup0x[inv].Union(lookup1x[p]).OrderBy(x=>x).ToArray();
if (full.Length == 12)
{
print32(full, data32);
}
}
}
else
{
//otherwise put it to lookup table 2, but skip all combinations which are started with 0
if (comb[0] != 0)
{
lookup0x[p] = comb.Select(i => (byte)i).ToList();
inv = p ^ 0xFFFFFFFF;
if (lookup1x.ContainsKey(inv))
{
var full = lookup0x[p].Union(lookup1x[inv]).OrderBy(x=>x).ToArray();
if (full.Length == 12)
{
print32(full, data32);
}
}
}
}
t++;
break;
}
}
}
Console.WriteLine("Check was done in {0} ms ", watch.ElapsedMilliseconds);
//end
}

What is the fastest way to do Array Table Lookup with an Integer Index?

I have a video processing application that moves a lot of data.
To speed things up, I have made a lookup table, as many calculations in essence only need to be calculated one time and can be reused.
However I'm at the point where all the lookups now takes 30% of the processing time. I'm wondering if it might be slow RAM.. However, I would still like to try to optimize it some more.
Currently I have the following:
public readonly int[] largeArray = new int[3000*2000];
public readonly int[] lookUp = new int[width*height];
I then perform a lookup with a pointer p (which is equivalent to width * y + x) to fetch the result.
int[] newResults = new int[width*height];
int p = 0;
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++, p++) {
newResults[p] = largeArray[lookUp[p]];
}
}
Note that I cannot do an entire array copy to optimize. Also, the application is heavily multithreaded.
Some progress was in shortening the function stack, so no getters but a straight retrieval from a readonly array.
I've tried converting to ushort as well, but it seemed to be slower (as I understand it's due to word size).
Would an IntPtr be faster? How would I go about that?
Attached below is a screenshot of time distribution:
It looks like what you're doing here is effectively a "gather". Modern CPUs have dedicated instructions for this, in particular VPGATHER** . This is exposed in .NET Core 3, and should work something like below, which is the single loop scenario (you can probably work from here to get the double-loop version);
results first:
AVX enabled: False; slow loop from 0
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 1524ms
AVX enabled: True; slow loop from 1024
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 667ms
code:
using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
static class P
{
static int Gather(int[] source, int[] index, int[] results, bool avx)
{ // normally you wouldn't have avx as a parameter; that is just so
// I can turn it off and on for the test; likewise the "int" return
// here is so I can monitor (in the test) how much we did in the "old"
// loop, vs AVX2; in real code this would be void return
int y = 0;
if (Avx2.IsSupported && avx)
{
var iv = MemoryMarshal.Cast<int, Vector256<int>>(index);
var rv = MemoryMarshal.Cast<int, Vector256<int>>(results);
unsafe
{
fixed (int* sPtr = source)
{
// note: here I'm assuming we are trying to fill "results" in
// a single outer loop; for a double-loop, you'll probably need
// to slice the spans
for (int i = 0; i < rv.Length; i++)
{
rv[i] = Avx2.GatherVector256(sPtr, iv[i], 4);
}
}
}
// move past everything we've processed via SIMD
y += rv.Length * Vector256<int>.Count;
}
// now do anything left, which includes anything not aligned to 256 bits,
// plus the "no AVX2" scenario
int result = y;
int end = results.Length; // hoist, since this is not the JIT recognized pattern
for (; y < end; y++)
{
results[y] = source[index[y]];
}
return result;
}
static void Main()
{
// invent some random data
var rand = new Random(12345);
int size = 1024 * 512;
int[] data = new int[size];
for (int i = 0; i < data.Length; i++)
data[i] = rand.Next(255);
// build a fake index
int[] index = new int[1024];
for (int i = 0; i < index.Length; i++)
index[i] = rand.Next(size);
int[] results = new int[1024];
void GatherLocal(bool avx)
{
// prove that we're getting the same data
Array.Clear(results, 0, results.Length);
int from = Gather(data, index, results, avx);
Console.WriteLine($"AVX enabled: {avx}; slow loop from {from}");
for (int i = 0; i < 32; i++)
{
Console.Write(results[i].ToString("x2"));
}
Console.WriteLine();
const int TimeLoop = 1024 * 512;
var watch = Stopwatch.StartNew();
for (int i = 0; i < TimeLoop; i++)
Gather(data, index, results, avx);
watch.Stop();
Console.WriteLine($"for {TimeLoop} loops: {watch.ElapsedMilliseconds}ms");
Console.WriteLine();
}
GatherLocal(false);
if (Avx2.IsSupported) GatherLocal(true);
}
}
RAM is already one of the fastest things possible. The only memory faster is the CPU caches. So it will be Memory Bound, but that is still plenty fast.
Of course at the given sizes, this array is 6 Million entries in size. That will likely not fit in any cache. And will take forever to itterate over. It does not mater what the speed is, this is simply too much data.
As a general rule, video processing is done on the GPU nowadays. GPU's are literally desinged to operate on giant arrays. Because that is what the Image you are seeing right now is - a giant array.
If you have to keep it on the GPU side, maybe caching or Lazy Initilisation would help? Chances are that you do not truly need every value. You only need to common values. Take a examples from dicerolling: If you roll 2 6-sided dice, every result from 2-12 is possible. But the result 7 happens 6 out of 36 casess. The 2 and 12 only 1 out of 36 cases each. So having the 7 stored is a lot more beneficial then the 2 and 12.

How to: Assign a unique number to every entry in a list?

What I want is to make tiles. These tiles (about 30 of them) should have a fixed position in the game, but each time I load the game they should have random numbers that should affect their graphical appearance.
I know how to use the Random method to give a single tile a number to change its appearance, but I'm clueless on how I would use the Random method if I were to make a list storing the position of multiple tiles. How can you assign each entry in a list a unique random number?
I need this for my game where you're in a flat 2D map, generated with random types of rooms (treasure rooms, arena rooms etc.) that you are to explore.
Take a look at the Fisher-Yates shuffle. It's super easy to use and should work well for you, if I read your question right.
Make an array of 30 consecutive numbers, mirroring your array of tiles. Then pick an array-shuffling solution you like from, say, here for instance:
http://forums.asp.net/t/1778021.aspx/1
Then tile[23]'s number will be numberArray[23].
if you have something like this:
public class Tile
{
public int Number {get;set;}
...
}
you can do it like this:
var numbers = Enumerable
.Range(1, tilesList.Count) // generates list of sequential numbers
.OrderBy(x => Guid.NewGuid()) // shuffles the list
.ToList();
for (int i = 0; i < tiles.Count; i++)
{
tile[i].Number = numbers[i];
}
I know, that Guid is not a Random alternative, but it should fit this scenario.
Update: As long as answer was downvoted, I've wrote simple test, to check if Guids are not usable for shuffling an array:
var larger = 0;
var smaller = 0;
var start = DateTime.Now;
var guid = Guid.NewGuid();
for (int i = 0; i < 10000000; i++)
{
var nextGuid = Guid.NewGuid();
if (nextGuid.CompareTo(guid) < 0)
{
larger++;
}
else
{
smaller++;
}
guid = nextGuid;
}
Console.WriteLine("larger: {0}", larger);
Console.WriteLine("smaller: {0}", smaller);
Console.WriteLine("took seconds: {0}", DateTime.Now - start);
Console.ReadKey();
What it does, it counts how many times next guid is smaller than current and how many times is larger. In perfect case, there should be equal number of larger and smaller next guids, which would indicate, that those two events (current guid and next guid) are independent. Also measured time, just to make sure, that it is not too slow.
And got following result (with 10 million guids):
larger: 5000168
smaller: 4999832
took seconds: 00:00:01.1980686
Another test is direct compare of Fisher-Yates and Guid shuffling:
static void Main(string[] args)
{
var numbers = Enumerable.Range(1, 7).ToArray();
var originalNumbers = numbers.OrderBy(x => Guid.NewGuid()).ToList();
var foundAfterListUsingGuid = new List<int>();
var foundAfterListUsingShuffle = new List<int>();
for (int i = 0; i < 100; i++)
{
var foundAfter = 0;
while (!originalNumbers.SequenceEqual(numbers.OrderBy(x => Guid.NewGuid())))
{
foundAfter++;
}
foundAfterListUsingGuid.Add(foundAfter);
foundAfter = 0;
var shuffledNumbers = Enumerable.Range(1, 7).ToArray();
while (!originalNumbers.SequenceEqual(shuffledNumbers))
{
foundAfter++;
Shuffle(shuffledNumbers);
}
foundAfterListUsingShuffle.Add(foundAfter);
}
Console.WriteLine("Average matching order (Guid): {0}", foundAfterListUsingGuid.Average());
Console.WriteLine("Average matching order (Shuffle): {0}", foundAfterListUsingShuffle.Average());
Console.ReadKey();
}
static Random _random = new Random();
public static void Shuffle<T>(T[] array)
{
var random = _random;
for (int i = array.Length; i > 1; i--)
{
// Pick random element to swap.
int j = random.Next(i); // 0 <= j <= i-1
// Swap.
T tmp = array[j];
array[j] = array[i - 1];
array[i - 1] = tmp;
}
}
By "direct compare" I mean, that I'm producing shuffled sequence and try to shuffle again to get same sequence, and assume, that the more tries I need to produce same sequence, the better random is (which is not necessary mathematically correct assumption, I think it is oversimplification).
So results for small set with 1000 iterations to reduce error, was:
Average matching order (Guid): 5015.097
Average matching order (Shuffle): 4969.424
So, Guid performed event better, if my metric is correct :)
with 10000 iterations they came closer:
Average matching order (Guid): 5079.9283
Average matching order (Shuffle): 4940.749
So in my opinion, for current usage (shuffle room number in game), guids are suitable solution.

What's wrong with my implementation of the KMP algorithm?

static void Main(string[] args)
{
string str = "ABC ABCDAB ABCDABCDABDE";//We should add some text here for
//the performance tests.
string pattern = "ABCDABD";
List<int> shifts = new List<int>();
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
NaiveStringMatcher(shifts, str, pattern);
stopWatch.Stop();
Trace.WriteLine(String.Format("Naive string matcher {0}", stopWatch.Elapsed));
foreach (int s in shifts)
{
Trace.WriteLine(s);
}
shifts.Clear();
stopWatch.Restart();
int[] pi = new int[pattern.Length];
Knuth_Morris_Pratt(shifts, str, pattern, pi);
stopWatch.Stop();
Trace.WriteLine(String.Format("Knuth_Morris_Pratt {0}", stopWatch.Elapsed));
foreach (int s in shifts)
{
Trace.WriteLine(s);
}
Console.ReadKey();
}
static IList<int> NaiveStringMatcher(List<int> shifts, string text, string pattern)
{
int lengthText = text.Length;
int lengthPattern = pattern.Length;
for (int s = 0; s < lengthText - lengthPattern + 1; s++ )
{
if (text[s] == pattern[0])
{
int i = 0;
while (i < lengthPattern)
{
if (text[s + i] == pattern[i])
i++;
else break;
}
if (i == lengthPattern)
{
shifts.Add(s);
}
}
}
return shifts;
}
static IList<int> Knuth_Morris_Pratt(List<int> shifts, string text, string pattern, int[] pi)
{
int patternLength = pattern.Length;
int textLength = text.Length;
//ComputePrefixFunction(pattern, pi);
int j;
for (int i = 1; i < pi.Length; i++)
{
j = 0;
while ((i < pi.Length) && (pattern[i] == pattern[j]))
{
j++;
pi[i++] = j;
}
}
int matchedSymNum = 0;
for (int i = 0; i < textLength; i++)
{
while (matchedSymNum > 0 && pattern[matchedSymNum] != text[i])
matchedSymNum = pi[matchedSymNum - 1];
if (pattern[matchedSymNum] == text[i])
matchedSymNum++;
if (matchedSymNum == patternLength)
{
shifts.Add(i - patternLength + 1);
matchedSymNum = pi[matchedSymNum - 1];
}
}
return shifts;
}
Why does my implemention of the KMP algorithm work slower than the Naive String Matching algorithm?
The KMP algorithm has two phases: first it builds a table, and then it does a search, directed by the contents of the table.
The naive algorithm has one phase: it does a search. It does that search much less efficiently in the worst case than the KMP search phase.
If the KMP is slower than the naive algorithm then that is probably because building the table is taking you longer than it takes to simply search the string naively in the first place. Naive string matching is usually very fast on short strings. There is a reason why we don't use fancy-pants algorithms like KMP inside the BCL implementations of string searching. By the time you set up the table, you could have done half a dozen searches of short strings with the naive algorithm.
KMP is only a win if you have enormous strings and you are doing lots of searches that allow you to re-use an already-built table. You need to amortize away the huge cost of building the table by doing lots of searches using that table.
And also, the naive algorithm only has bad performance in bizarre and unlikely scenarios. Most people are searching for words like "London" in strings like "Buckingham Palace, London, England", and not searching for strings like "BANANANANANANA" in strings like "BANAN BANBAN BANBANANA BANAN BANAN BANANAN BANANANANANANANANAN...". The naive search algorithm is optimal for the first problem and highly sub-optimal for the latter problem; but it makes sense to optimize for the former, not the latter.
Another way to put it: if the searched-for string is of length w and the searched-in string is of length n, then KMP is O(n) + O(w). The Naive algorithm is worst case O(nw), best case O(n + w). But that says nothing about the "constant factor"! The constant factor of the KMP algorithm is much larger than the constant factor of the naive algorithm. The value of n has to be awfully big, and the number of sub-optimal partial matches has to be awfully large, for the KMP algorithm to win over the blazingly fast naive algorithm.
That deals with the algorithmic complexity issues. Your methodology is also not very good, and that might explain your results. Remember, the first time you run code, the jitter has to jit the IL into assembly code. That can take longer than running the method in some cases. You really should be running the code a few hundred thousand times in a loop, discarding the first result, and taking an average of the timings of the rest.
If you really want to know what is going on you should be using a profiler to determine what the hot spot is. Again, make sure you are measuring the post-jit run, not the run where the code is jitted, if you want to have results that are not skewed by the jit time.
Your example is too small and it does not have enough repetitions of the pattern where KMP avoids backtracking.
KMP can be slower than the normal search in some cases.
A Simple KMPSubstringSearch Implementation.
https://github.com/bharathkumarms/AlgorithmsMadeEasy/blob/master/AlgorithmsMadeEasy/KMPSubstringSearch.cs
using System;
using System.Collections.Generic;
using System.Linq;
namespace AlgorithmsMadeEasy
{
class KMPSubstringSearch
{
public void KMPSubstringSearchMethod()
{
string text = System.Console.ReadLine();
char[] sText = text.ToCharArray();
string pattern = System.Console.ReadLine();
char[] sPattern = pattern.ToCharArray();
int forwardPointer = 1;
int backwardPointer = 0;
int[] tempStorage = new int[sPattern.Length];
tempStorage[0] = 0;
while (forwardPointer < sPattern.Length)
{
if (sPattern[forwardPointer].Equals(sPattern[backwardPointer]))
{
tempStorage[forwardPointer] = backwardPointer + 1;
forwardPointer++;
backwardPointer++;
}
else
{
if (backwardPointer == 0)
{
tempStorage[forwardPointer] = 0;
forwardPointer++;
}
else
{
int temp = tempStorage[backwardPointer];
backwardPointer = temp;
}
}
}
int pointer = 0;
int successPoints = sPattern.Length;
bool success = false;
for (int i = 0; i < sText.Length; i++)
{
if (sText[i].Equals(sPattern[pointer]))
{
pointer++;
}
else
{
if (pointer != 0)
{
int tempPointer = pointer - 1;
pointer = tempStorage[tempPointer];
i--;
}
}
if (successPoints == pointer)
{
success = true;
}
}
if (success)
{
System.Console.WriteLine("TRUE");
}
else
{
System.Console.WriteLine("FALSE");
}
System.Console.Read();
}
}
}
/*
* Sample Input
abxabcabcaby
abcaby
*/

C# compress a byte array

I do not know much about compression algorithms. I am looking for a simple compression algorithm (or code snippet) which can reduce the size of a byte[,,] or byte[]. I cannot make use of System.IO.Compression. Also, the data has lots of repetition.
I tried implementing the RLE algorithm (posted below for your inspection). However, it produces array's 1.2 to 1.8 times larger.
public static class RLE
{
public static byte[] Encode(byte[] source)
{
List<byte> dest = new List<byte>();
byte runLength;
for (int i = 0; i < source.Length; i++)
{
runLength = 1;
while (runLength < byte.MaxValue
&& i + 1 < source.Length
&& source[i] == source[i + 1])
{
runLength++;
i++;
}
dest.Add(runLength);
dest.Add(source[i]);
}
return dest.ToArray();
}
public static byte[] Decode(byte[] source)
{
List<byte> dest = new List<byte>();
byte runLength;
for (int i = 1; i < source.Length; i+=2)
{
runLength = source[i - 1];
while (runLength > 0)
{
dest.Add(source[i]);
runLength--;
}
}
return dest.ToArray();
}
}
I have also found a java, string and integer based, LZW implementation. I have converted it to C# and the results look good (code posted below). However, I am not sure how it works nor how to make it work with bytes instead of strings and integers.
public class LZW
{
/* Compress a string to a list of output symbols. */
public static int[] compress(string uncompressed)
{
// Build the dictionary.
int dictSize = 256;
Dictionary<string, int> dictionary = new Dictionary<string, int>();
for (int i = 0; i < dictSize; i++)
dictionary.Add("" + (char)i, i);
string w = "";
List<int> result = new List<int>();
for (int i = 0; i < uncompressed.Length; i++)
{
char c = uncompressed[i];
string wc = w + c;
if (dictionary.ContainsKey(wc))
w = wc;
else
{
result.Add(dictionary[w]);
// Add wc to the dictionary.
dictionary.Add(wc, dictSize++);
w = "" + c;
}
}
// Output the code for w.
if (w != "")
result.Add(dictionary[w]);
return result.ToArray();
}
/* Decompress a list of output ks to a string. */
public static string decompress(int[] compressed)
{
int dictSize = 256;
Dictionary<int, string> dictionary = new Dictionary<int, string>();
for (int i = 0; i < dictSize; i++)
dictionary.Add(i, "" + (char)i);
string w = "" + (char)compressed[0];
string result = w;
for (int i = 1; i < compressed.Length; i++)
{
int k = compressed[i];
string entry = "";
if (dictionary.ContainsKey(k))
entry = dictionary[k];
else if (k == dictSize)
entry = w + w[0];
result += entry;
// Add w+entry[0] to the dictionary.
dictionary.Add(dictSize++, w + entry[0]);
w = entry;
}
return result;
}
}
Have a look here. I used this code as a basis to compress in one of my work projects. Not sure how much of the .NET Framework is accessbile in the Xbox 360 SDK, so not sure how well this will work for you.
The problem with that RLE algorithm is that it is too simple. It prefixes every byte with how many times it is repeated, but that does mean that in long ranges of non-repeating bytes, each single byte is prefixed with a "1". On data without any repetitions this will double the file size.
This can be avoided by using Code-type RLE instead; the 'Code' (also called 'Token') will be a byte that can have two meanings; either it indicates how many times the single following byte is repeated, or it indicates how many non-repeating bytes follow that should be copied as they are. The difference between those two codes is made by enabling the highest bit, meaning there are still 7 bits available for the value, meaning the amount to copy or repeat per such code can be up to 127.
This means that even in worst-case scenarios, the final size can only be about 1/127th larger than the original file size.
A good explanation of the whole concept, plus full working (and, in fact, heavily optimised) C# code, can be found here:
http://www.shikadi.net/moddingwiki/RLE_Compression
Note that sometimes, the data will end up larger than the original anyway, simply because there are not enough repeating bytes in it for RLE to work. A good way to deal with such compression failures is by adding a header to your final data. If you simply add an extra byte at the start that's on 0 for uncompressed data and 1 for RLE compressed data, then, when RLE fails to give a smaller result, you just save it uncompressed, with the 0 in front, and your final data will be exactly one byte larger than the original. The system at the other side can then read that starting byte and use that to determine if the following data should be uncompressed or just copied.
Look into Huffman codes, it's a pretty simple algorithm. Basically, use fewer bits for patterns that show up more often, and keep a table of how it's encoded. And you have to account in your codewords that there are no separators to help you decode.

Categories

Resources