Search a file for a sequence of bytes (C#) - c#

I'm writing a C# application in which I need to search a file (could be very big) for a sequence of bytes, and I can't use any libraries to do so. So, I need a function that takes a byte array as an argument and returns the position of the byte following the given sequence. The function doesn't have to be fast, it simply has to work. Any help would be greatly appreciated :)

If it doesn't have to be fast you could use this:
int GetPositionAfterMatch(byte[] data, byte[]pattern)
{
for (int i = 0; i < data.Length - pattern.Length; i++)
{
bool match = true;
for (int k = 0; k < pattern.Length; k++)
{
if (data[i + k] != pattern[k])
{
match = false;
break;
}
}
if (match)
{
return i + pattern.Length;
}
}
}
But I really would recommend you to use Knuth-Morris-Pratt algorithm, it's the algorithm mostly used as a base of IndexOf methods for strings. The algorithm above will perform really slow, exept for small arrays and small patterns.

The straight-forward approach as pointed out by Turrau works, and for your purposes is probably good enough, since you say it doesn't have to be fast - especially since for most practical purposes the algorithm is much faster than the worst case O(n*m). (Depending on your pattern I guess).
For an optimal solution you can also check out the Knuth-Morris-Pratt algorithm, which makes use of partial matches which in the end is O(n+m).

Here's an extract of some code I used to do a boyer-moore type search. It's mean to work on pcap files, so it operates record by record, but should be easy enough to modify to suit just searching a long binary file. It's sort of extracted from some test code, so I hope I got everything for you to follow along. Also look up boyer-moore searching on wikipedia, since that is what it's based off of.
int[] badMatch = new int[256];
byte[] pattern; //the pattern we are searching for
//badMath is an array of every possible byte value (defined as static later).
//we use this as a jump table to know how many characters we can skip comparison on
//so first, we prefill every possibility with the length of our search string
for (int i = 0; i < badMatch.Length; i++)
{
badMatch[i] = pattern.Length;
}
//Now we need to calculate the individual maximum jump length for each byte that appears in my search string
for (int i = 0; i < pattern.Length - 1; i++)
{
badMatch[pattern[i] & 0xff] = pattern.Length - i - 1;
}
// Place the bytes you want to run the search against in the payload variable
byte[] payload = <bytes>
// search the packet starting at offset, and try to match the last character
// if we loop, we increment by whatever our jump value is
for (i = offset + pattern.Length - 1; i < end && cont; i += badMatch[payload[i] & 0xff])
{
// if our payload character equals our search string character, continue matching counting backwards
for (j = pattern.Length - 1, k = i; (j >= 0) && (payload[k] == pattern[j]) && cont; j--)
{
k--;
}
// if we matched every character, then we have a match, add it to the packet list, and exit the search (cont = false)
if (j == -1)
{
//we MATCHED!!!
//i = end;
cont = false;
}
}

Related

.NET BitArray cardinality [duplicate]

I am implementing a library where I am extensively using the .Net BitArray class and need an equivalent to the Java BitSet.Cardinality() method, i.e. a method which returns the number of bits set. I was thinking of implementing it as an extension method for the BitArray class. The trivial implementation is to iterate and count the bits set (like below), but I wanted a faster implementation as I would be performing thousands of set operations and counting the answer. Is there a faster way than the example below?
count = 0;
for (int i = 0; i < mybitarray.Length; i++)
{
if (mybitarray [i])
count++;
}
This is my solution based on the "best bit counting method" from http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
public static Int32 GetCardinality(BitArray bitArray)
{
Int32[] ints = new Int32[(bitArray.Count >> 5) + 1];
bitArray.CopyTo(ints, 0);
Int32 count = 0;
// fix for not truncated bits in last integer that may have been set to true with SetAll()
ints[ints.Length - 1] &= ~(-1 << (bitArray.Count % 32));
for (Int32 i = 0; i < ints.Length; i++)
{
Int32 c = ints[i];
// magic (http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel)
unchecked
{
c = c - ((c >> 1) & 0x55555555);
c = (c & 0x33333333) + ((c >> 2) & 0x33333333);
c = ((c + (c >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;
}
count += c;
}
return count;
}
According to my tests, this is around 60 times faster than the simple foreach loop and still 30 times faster than the Kernighan approach with around 50% bits set to true in a BitArray with 1000 bits. I also have a VB version of this if needed.
you can accomplish this pretty easily with Linq
BitArray ba = new BitArray(new[] { true, false, true, false, false });
var numOnes = (from bool m in ba
where m
select m).Count();
BitArray myBitArray = new BitArray(...
int
bits = myBitArray.Count,
size = ((bits - 1) >> 3) + 1,
counter = 0,
x,
c;
byte[] buffer = new byte[size];
myBitArray.CopyTo(buffer, 0);
for (x = 0; x < size; x++)
for (c = 0; buffer[x] > 0; buffer[x] >>= 1)
counter += buffer[x] & 1;
Taken from "Counting bits set, Brian Kernighan's way" and adapted for bytes. I'm using it for bit arrays of 1 000 000+ bits and it's superb.
If your bits are not n*8 then you can count the mod byte manually.
I had the same issue, but had more than just the one Cardinality method to convert. So, I opted to port the entire BitSet class. Fortunately it was self-contained.
Here is the Gist of the C# port.
I would appreciate if people would report any bugs that are found - I am not a Java developer, and have limited experience with bit logic, so I might have translated some of it incorrectly.
Faster and simpler version than the accepted answer thanks to the use of System.Numerics.BitOperations.PopCount
C#
Int32[] ints = new Int32[(bitArray.Count >> 5) + 1];
bitArray.CopyTo(ints, 0);
Int32 count = 0;
for (Int32 i = 0; i < ints.Length; i++) {
count += BitOperations.PopCount(ints[i]);
}
Console.WriteLine(count);
F#
let ints = Array.create ((bitArray.Count >>> 5) + 1) 0u
bitArray.CopyTo(ints, 0)
ints
|> Array.sumBy BitOperations.PopCount
|> printfn "%d"
See more details in Is BitOperations.PopCount the best way to compute the BitArray cardinality in .NET?
You could use Linq, but it would be useless and slower:
var sum = mybitarray.OfType<bool>().Count(p => p);
There is no faster way with using BitArray - What it comes down to is you will have to count them - you could use LINQ to do that or do your own loop, but there is no method offered by BitArray and the underlying data structure is an int[] array (as seen with Reflector) - so this will always be O(n), n being the number of bits in the array.
The only way I could think of making it faster is using reflection to get a hold of the underlying m_array field, then you can get around the boundary checks that Get() uses on every call (see below) - but this is kinda dirty, and might only be worth it on very large arrays since reflection is expensive.
public bool Get(int index)
{
if ((index < 0) || (index >= this.Length))
{
throw new ArgumentOutOfRangeException("index", Environment.GetResourceString("ArgumentOutOfRange_Index"));
}
return ((this.m_array[index / 0x20] & (((int) 1) << (index % 0x20))) != 0);
}
If this optimization is really important to you, you should create your own class for bit manipulation, that internally could use BitArray, but keeps track of the number of bits set and offers the appropriate methods (mostly delegate to BitArray but add methods to get number of bits currently set) - then of course this would be O(1).
If you really want to maximize the speed, you could pre-compute a lookup table where given a byte-value you have the cardinality, but BitArray is not the most ideal structure for this, since you'd need to use reflection to pull the underlying storage out of it and operate on the integral types - see this question for a better explanation of that technique.
Another, perhaps more useful technique, is to use something like the Kernighan trick, which is O(m) for an n-bit value of cardinality m.
static readonly ZERO = new BitArray (0);
static readonly NOT_ONE = new BitArray (1).Not ();
public static int GetCardinality (this BitArray bits)
{
int c = 0;
var tmp = new BitArray (myBitArray);
for (c; tmp != ZERO; c++)
tmp = tmp.And (tmp.And (NOT_ONE));
return c;
}
This too is a bit more cumbersome than it would be in say C, because there are no operations defined between integer types and BitArrays, (tmp &= tmp - 1, for example, to clear the least significant set bit, has been translated to tmp &= (tmp & ~0x1).
I have no idea if this ends up being any faster than naively iterating for the case of the BCL BitArray, but algorithmically speaking it should be superior.
EDIT: cited where I discovered the Kernighan trick, with a more in-depth explanation
If you don't mind to copy the code of System.Collections.BitArray to your project and Edit it,you can write as fellow:
(I think it's the fastest. And I've tried use BitVector32[] to implement my BitArray, but it's still so slow.)
public void Set(int index, bool value)
{
if ((index < 0) || (index >= this.m_length))
{
throw new ArgumentOutOfRangeException("index", "Index Out Of Range");
}
SetWithOutAuth(index,value);
}
//When in batch setting values,we need one method that won't auth the index range
private void SetWithOutAuth(int index, bool value)
{
int v = ((int)1) << (index % 0x20);
index = index / 0x20;
bool NotSet = (this.m_array[index] & v) == 0;
if (value && NotSet)
{
CountOfTrue++;//Count the True values
this.m_array[index] |= v;
}
else if (!value && !NotSet)
{
CountOfTrue--;//Count the True values
this.m_array[index] &= ~v;
}
else
return;
this._version++;
}
public int CountOfTrue { get; internal set; }
public void BatchSet(int start, int length, bool value)
{
if (start < 0 || start >= this.m_length || length <= 0)
return;
for (int i = start; i < length && i < this.m_length; i++)
{
SetWithOutAuth(i,value);
}
}
I wrote my version of after not finding one that uses a look-up table:
private int[] _bitCountLookup;
private void InitLookupTable()
{
_bitCountLookup = new int[256];
for (var byteValue = 0; byteValue < 256; byteValue++)
{
var count = 0;
for (var bitIndex = 0; bitIndex < 8; bitIndex++)
{
count += (byteValue >> bitIndex) & 1;
}
_bitCountLookup[byteValue] = count;
}
}
private int CountSetBits(BitArray bitArray)
{
var result = 0;
var numberOfFullBytes = bitArray.Length / 8;
var numberOfTailBits = bitArray.Length % 8;
var tailByte = numberOfTailBits > 0 ? 1 : 0;
var bitArrayInBytes = new byte[numberOfFullBytes + tailByte];
bitArray.CopyTo(bitArrayInBytes, 0);
for (var i = 0; i < numberOfFullBytes; i++)
{
result += _bitCountLookup[bitArrayInBytes[i]];
}
for (var i = (numberOfFullBytes * 8); i < bitArray.Length; i++)
{
if (bitArray[i])
{
result++;
}
}
return result;
}
The problem is naturally O(n), as a result your solution is probably the most efficient.
Since you are trying to count an arbitrary subset of bits you cannot count the bits when they are set (would would provide a speed boost if you are not setting the bits too often).
You could check to see if the processor you are using has a command which will return the number of set bits. For example a processor with SSE4 could use the POPCNT according to this post. This would probably not work for you since .Net does not allow assembly (because it is platform independent). Also, ARM processors probably do not have an equivalent.
Probably the best solution would be a look up table (or switch if you could guarantee the switch will compiled to a single jump to currentLocation + byteValue). This would give you the count for the whole byte. Of course BitArray does not give access to the underlying data type so you would have to make your own BitArray. You would also have to guarantee that all the bits in the byte will always be part of the intersection which does not sound likely.
Another option would be to use an array of booleans instead of a BitArray. This has the advantage not needing to extract the bit from the others in the byte. The disadvantage is the array will take up 8x as much space in memory meaning not only wasted space, but also more data push as you iterate through the array to perform your count.
The difference between a standard array look up and a BitArray look up is as follows:
Array:
offset = index * indexSize
Get memory at location + offset and save to value
BitArray:
index = index/indexSize
offset = index * indexSize
Get memory at location + offset and save to value
position = index%indexSize
Shift value position bits
value = value and 1
With the exception of #2 for Arrays and #3 most of these commands take 1 processor cycle to complete. Some of the commands can be combined into 1 command using x86/x64 processors, though probably not with ARM since it uses a reduced set of instructions.
Which of the two (array or BitArray) perform better will be specific to your platform (processor speed, processor instructions, processor cache sizes, processor cache speed, amount of system memory (Ram), speed of system memory (CAS), speed of connection between processor and RAM) as well as the spread of indexes you want to count (are the intersections most often clustered or are they randomly distributed).
To summarize: you could probably find a way to make it faster, but your solution is the fastest you will get for your data set using a bit per boolean model in .NET.
Edit: make sure you are accessing the indexes you want to count in order. If you access indexes 200, 5, 150, 151, 311, 6 in that order then you will increase the amount of cache misses resulting in more time spent waiting for values to be retrieved from RAM.

Shuffling a string so that no two adjacent letters are the same

I've been trying to solve this interview problem which asks to shuffle a string so that no two adjacent letters are identical
For example,
ABCC -> ACBC
The approach I'm thinking of is to
1) Iterate over the input string and store the (letter, frequency)
pairs in some collection
2) Now build a result string by pulling the highest frequency (that is > 0) letter that we didn't just pull
3) Update (decrement) the frequency whenever we pull a letter
4) return the result string if all letters have zero frequency
5) return error if we're left with only one letter with frequency greater than 1
With this approach we can save the more precious (less frequent) letters for last. But for this to work, we need a collection that lets us efficiently query a key and at the same time efficiently sort it by values. Something like this would work except we need to keep the collection sorted after every letter retrieval.
I'm assuming Unicode characters.
Any ideas on what collection to use? Or an alternative approach?
You can sort the letters by frequency, split the sorted list in half, and construct the output by taking letters from the two halves in turn. This takes a single sort.
Example:
Initial string: ACABBACAB
Sort: AAAABBBCC
Split: AAAA+BBBCC
Combine: ABABABCAC
If the number of letters of highest frequency exceeds half the length of the string, the problem has no solution.
Why not use two Data Structures: One for sorting (Like a Heap) and one for key retrieval, like a Dictionary?
The accepted answer may produce a correct result, but is likely not the 'correct' answer to this interview brain teaser, nor the most efficient algorithm.
The simple answer is to take the premise of a basic sorting algorithm and alter the looping predicate to check for adjacency rather than magnitude. This ensures that the 'sorting' operation is the only step required, and (like all good sorting algorithms) does the least amount of work possible.
Below is a c# example akin to insertion sort for simplicity (though many sorting algorithm could be similarly adjusted):
string NonAdjacencySort(string stringInput)
{
var input = stringInput.ToCharArray();
for(var i = 0; i < input.Length; i++)
{
var j = i;
while(j > 0 && j < input.Length - 1 &&
(input[j+1] == input[j] || input[j-1] == input[j]))
{
var tmp = input[j];
input[j] = input[j-1];
input[j-1] = tmp;
j--;
}
if(input[1] == input[0])
{
var tmp = input[0];
input[0] = input[input.Length-1];
input[input.Length-1] = tmp;
}
}
return new string(input);
}
The major change to standard insertion sort is that the function has to both look ahead and behind, and therefore needs to wrap around to the last index.
A final point is that this type of algorithm fails gracefully, providing a result with the fewest consecutive characters (grouped at the front).
Since I somehow got convinced to expand an off-hand comment into a full algorithm, I'll write it out as an answer, which must be more readable than a series of uneditable comments.
The algorithm is pretty simple, actually. It's based on the observation that if we sort the string and then divide it into two equal-length halves, plus the middle character if the string has odd length, then corresponding positions in the two halves must differ from each other, unless there is no solution. That's easy to see: if the two characters are the same, then so are all the characters between them, which totals ⌈n/2⌉&plus;1 characters. But a solution is only possible if there are no more than ⌈n/2⌉ instances of any single character.
So we can proceed as follows:
Sort the string.
If the string's length is odd, output the middle character.
Divide the string (minus its middle character if the length is odd) into two equal-length halves, and interleave the two halves.
At each point in the interleaving, since the pair of characters differ from each other (see above), at least one of them must differ from the last character output. So we first output that character and then the corresponding one from the other half.
The sample code below is in C++, since I don't have a C# environment handy to test with. It's also simplified in two ways, both of which would be easy enough to fix at the cost of obscuring the algorithm:
If at some point in the interleaving, the algorithm encounters a pair of identical characters, it should stop and report failure. But in the sample implementation below, which has an overly simple interface, there's no way to report failure. If there is no solution, the function below returns an incorrect solution.
The OP suggests that the algorithm should work with Unicode characters, but the complexity of correctly handling multibyte encodings didn't seem to add anything useful to explain the algorithm. So I just used single-byte characters. (In C# and certain implementations of C++, there is no character type wide enough to hold a Unicode code point, so astral plane characters must be represented with a surrogate pair.)
#include <algorithm>
#include <iostream>
#include <string>
// If possible, rearranges 'in' so that there are no two consecutive
// instances of the same character.
std::string rearrange(std::string in) {
// Sort the input. The function is call-by-value,
// so the argument itself isn't changed.
std::string out;
size_t len = in.size();
if (in.size()) {
out.reserve(len);
std::sort(in.begin(), in.end());
size_t mid = len / 2;
size_t tail = len - mid;
char prev = in[mid];
// For odd-length strings, start with the middle character.
if (len & 1) out.push_back(prev);
for (size_t head = 0; head < mid; ++head, ++tail)
// See explanatory text
if (in[tail] != prev) {
out.push_back(in[tail]);
out.push_back(prev = in[head]);
}
else {
out.push_back(in[head]);
out.push_back(prev = in[tail]);
}
}
}
return out;
}
you can do that by using a priority queue.
Please find the below explanation.
https://iq.opengenus.org/rearrange-string-no-same-adjacent-characters/
Here is a probabilistic approach. The algorithm is:
10) Select a random char from the input string.
20) Try to insert the selected char in a random position in the output string.
30) If it can't be inserted because of proximity with the same char, go to 10.
40) Remove the selected char from the input string and go to 10.
50) Continue until there are no more chars in the input string, or the failed attempts are too many.
public static string ShuffleNoSameAdjacent(string input, Random random = null)
{
if (input == null) return null;
if (random == null) random = new Random();
string output = "";
int maxAttempts = input.Length * input.Length * 2;
int attempts = 0;
while (input.Length > 0)
{
while (attempts < maxAttempts)
{
int inputPos = random.Next(0, input.Length);
var outputPos = random.Next(0, output.Length + 1);
var c = input[inputPos];
if (outputPos > 0 && output[outputPos - 1] == c)
{
attempts++; continue;
}
if (outputPos < output.Length && output[outputPos] == c)
{
attempts++; continue;
}
input = input.Remove(inputPos, 1);
output = output.Insert(outputPos, c.ToString());
break;
}
if (attempts >= maxAttempts) throw new InvalidOperationException(
$"Shuffle failed to complete after {attempts} attempts.");
}
return output;
}
Not suitable for strings longer than 1,000 chars!
Update: And here is a more complicated deterministic approach. The algorithm is:
Group the elements and sort the groups by length.
Create three empty piles of elements.
Insert each group to a separate pile, inserting always the largest group to the smallest pile, so that the piles differ in length as little as possible.
Check that there is no pile with more than half the total elements, in which case satisfying the condition of not having same adjacent elements is impossible.
Shuffle the piles.
Start yielding elements from the piles, selecting a different pile each time.
When the piles that are eligible for selection are more than one, select randomly, weighting by the size of each pile. Piles containing near half of the remaining elements should be much preferred. For example if the remaining elements are 100 and the two eligible piles have 49 and 40 elements respectively, then the first pile should be 10 times more preferable than the second (because 50 - 49 = 1 and 50 - 40 = 10).
public static IEnumerable<T> ShuffleNoSameAdjacent<T>(IEnumerable<T> source,
Random random = null, IEqualityComparer<T> comparer = null)
{
if (source == null) yield break;
if (random == null) random = new Random();
if (comparer == null) comparer = EqualityComparer<T>.Default;
var grouped = source
.GroupBy(i => i, comparer)
.OrderByDescending(g => g.Count());
var piles = Enumerable.Range(0, 3).Select(i => new Pile<T>()).ToArray();
foreach (var group in grouped)
{
GetSmallestPile().AddRange(group);
}
int totalCount = piles.Select(e => e.Count).Sum();
if (piles.Any(pile => pile.Count > (totalCount + 1) / 2))
{
throw new InvalidOperationException("Shuffle is impossible.");
}
piles.ForEach(pile => Shuffle(pile));
Pile<T> previouslySelectedPile = null;
while (totalCount > 0)
{
var selectedPile = GetRandomPile_WeightedByLength();
yield return selectedPile[selectedPile.Count - 1];
selectedPile.RemoveAt(selectedPile.Count - 1);
totalCount--;
previouslySelectedPile = selectedPile;
}
List<T> GetSmallestPile()
{
List<T> smallestPile = null;
int smallestCount = Int32.MaxValue;
foreach (var pile in piles)
{
if (pile.Count < smallestCount)
{
smallestPile = pile;
smallestCount = pile.Count;
}
}
return smallestPile;
}
void Shuffle(List<T> pile)
{
for (int i = 0; i < pile.Count; i++)
{
int j = random.Next(i, pile.Count);
if (i == j) continue;
var temp = pile[i];
pile[i] = pile[j];
pile[j] = temp;
}
}
Pile<T> GetRandomPile_WeightedByLength()
{
var eligiblePiles = piles
.Where(pile => pile.Count > 0 && pile != previouslySelectedPile)
.ToArray();
Debug.Assert(eligiblePiles.Length > 0, "No eligible pile.");
eligiblePiles.ForEach(pile =>
{
pile.Proximity = ((totalCount + 1) / 2) - pile.Count;
pile.Score = 1;
});
Debug.Assert(eligiblePiles.All(pile => pile.Proximity >= 0),
"A pile has negative proximity.");
foreach (var pile in eligiblePiles)
{
foreach (var otherPile in eligiblePiles)
{
if (otherPile == pile) continue;
pile.Score *= otherPile.Proximity;
}
}
var sumScore = eligiblePiles.Select(p => p.Score).Sum();
while (sumScore > Int32.MaxValue)
{
eligiblePiles.ForEach(pile => pile.Score /= 100);
sumScore = eligiblePiles.Select(p => p.Score).Sum();
}
if (sumScore == 0)
{
return eligiblePiles[random.Next(0, eligiblePiles.Length)];
}
var randomScore = random.Next(0, (int)sumScore);
int accumulatedScore = 0;
foreach (var pile in eligiblePiles)
{
accumulatedScore += (int)pile.Score;
if (randomScore < accumulatedScore) return pile;
}
Debug.Fail("Could not select a pile randomly by weight.");
return null;
}
}
private class Pile<T> : List<T>
{
public int Proximity { get; set; }
public long Score { get; set; }
}
This implementation can suffle millions of elements. I am not completely convinced that the quality of the suffling is as perfect as the previous probabilistic implementation, but should be close.
func shuffle(str:String)-> String{
var shuffleArray = [Character](str)
//Sorting
shuffleArray.sort()
var shuffle1 = [Character]()
var shuffle2 = [Character]()
var adjacentStr = ""
//Split
for i in 0..<shuffleArray.count{
if i > shuffleArray.count/2 {
shuffle2.append(shuffleArray[i])
}else{
shuffle1.append(shuffleArray[i])
}
}
let count = shuffle1.count > shuffle2.count ? shuffle1.count:shuffle2.count
//Merge with adjacent element
for i in 0..<count {
if i < shuffle1.count{
adjacentStr.append(shuffle1[i])
}
if i < shuffle2.count{
adjacentStr.append(shuffle2[i])
}
}
return adjacentStr
}
let s = shuffle(str: "AABC")
print(s)

Comparing char arrays method

I've been playing space engineers which has been epic since they added in-game programming, I'm trying to make a gps auto-pilot navigation script and have to get the block positions finding the blocks by name looking for a smaller string within their bigger string name. I wrote this method to find a small string (word) in a larger string (name of the block):
bool contains(string text, string wordInText)
{
char[] chText = text.ToCharArray();
char[] chWord = wordInText.ToCharArray();
int index = 0;
for(int i = 0 ; i < chText.Length - chWord.Length ; i++)
for(int j = 0; j < chWord.Length;j++,index++)
if (chWord[0] == chText[i])
index = i;
else if (chWord[j] == chText[index]){}
else if (index == chWord.Length-1)
return true;
else break;
return false;
}
Am I even doing it right, should I be doing it another shorter way?
This is simple with .Contains() which returns a bool.
text.Contains(wordInText);
If you simply want to check if a string contains an other string, then you can use string.Contains, the string class already provides a bunch of methods for string operations.
As already mentioned, the String class already has a Contains method that should give you what you need.
That said, your code doesn't work. I can see where you're going with it, but it's just not going to work. Stepping through it in a proper dev environment would show you why, but since this is more in the way of a script that's probably not an option.
So... the basic idea is to iterate through the string you're searching in, looking for matches against the string your searching. Your outer for statement looks fine for this, but your inner statements are a bit messed up.
Firstly, you're doing the first character check repeatedly. It's wasteful, and misplaced. Do it once per iteration of the outer loop.
Second, your exit condition is going to fire when the first character of wordInText matches the characters at index wordInText.Length in text which is not apparently what you want.
In fact you're all tripped up over the index variable. It isn't actually useful, so I'd drop it completely.
Here's a similar piece of code that should work. It is still much slower than the library String.Compare method, but hopefully it shows you how you might achieve the same thing.
for (int i = 0; i <= chText.Length - chWord.Length; i++)
{
if (chText[i] == chWord[0])
{
int j;
for (j = 0; j < chWord.Length; j++)
{
if (chText[i + j] != chWord[j])
break;
}
if (j == chWord.Length)
return true;
}
}
return false;

Time complexity of a powerset generating function

I'm trying to figure out the time complexity of a function that I wrote (it generates a power set for a given string):
public static HashSet<string> GeneratePowerSet(string input)
{
HashSet<string> powerSet = new HashSet<string>();
if (string.IsNullOrEmpty(input))
return powerSet;
int powSetSize = (int)Math.Pow(2.0, (double)input.Length);
// Start at 1 to skip the empty string case
for (int i = 1; i < powSetSize; i++)
{
string str = Convert.ToString(i, 2);
string pset = str;
for (int k = str.Length; k < input.Length; k++)
{
pset = "0" + pset;
}
string set = string.Empty;
for (int j = 0; j < pset.Length; j++)
{
if (pset[j] == '1')
{
set = string.Concat(set, input[j].ToString());
}
}
powerSet.Add(set);
}
return powerSet;
}
So my attempt is this:
let the size of the input string be n
in the outer for loop, must iterate 2^n times (because the set size is 2^n).
in the inner for loop, we must iterate 2*n times (at worst).
1. So Big-O would be O((2^n)*n) (since we drop the constant 2)... is that correct?
And n*(2^n) is worse than n^2.
if n = 4 then
(4*(2^4)) = 64
(4^2) = 16
if n = 100 then
(10*(2^10)) = 10240
(10^2) = 100
2. Is there a faster way to generate a power set, or is this about optimal?
A comment:
the above function is part of an interview question where the program is supposed to take in a string, then print out the words in the dictionary whose letters are an anagram subset of the input string (e.g. Input: tabrcoz Output: boat, car, cat, etc.). The interviewer claims that a n*m implementation is trivial (where n is the length of the string and m is the number of words in the dictionary), but I don't think you can find valid sub-strings of a given string. It seems that the interviewer is incorrect.
I was given the same interview question when I interviewed at Microsoft back in 1995. Basically the problem is to implement a simple Scrabble playing algorithm.
You are barking up completely the wrong tree with this idea of generating the power set. Nice thought, clearly way too expensive. Abandon it and find the right answer.
Here's a hint: run an analysis pass over the dictionary that builds a new data structure more amenable to efficiently solving the problem you actually have to solve. With an optimized dictionary you should be able to achieve O(nm). With a more cleverly built data structure you can probably do even better than that.
2. Is there a faster way to generate a power set, or is this about optimal?
Your algorithm is reasonable, but your string handling could use improvement.
string str = Convert.ToString(i, 2);
string pset = str;
for (int k = str.Length; k < input.Length; k++)
{
pset = "0" + pset;
}
All you're doing here is setting up a bitfield, but using a string. Just skip this, and use variable i directly.
for (int j = 0; j < input.Length; j++)
{
if (i & (1 << j))
{
When you build the string, use a StringBuilder, not creating multiple strings.
// At the beginning of the method
StringBuilder set = new StringBuilder(input.Length);
...
// Inside the loop
set.Clear();
...
set.Append(input[j]);
...
powerSet.Add(set.ToString());
Will any of this change the complexity of your algorithm? No. But it will significantly reduce the number of extra String objects you create, which will provide you a good speedup.

C# search with resemblance / affinity

Suppose We have a, IEnumerable Collection with 20 000 Person object items.
Then suppose we have created another Person object.
We want to list all Persons that ressemble this Person.
That means, for instance, if the Surname affinity is more than 90 % , add that Person to the list.
e.g. ("Andrew" vs "Andrw")
What is the most effective / quick way of doing this?
Iterating through the collection and comparing char by char with affinity determination? OR?
Any ideas?
Thank you!
You may be interested in:
Levenshtein Distance Algorithm
Peter Norvig - How to Write a Spelling Corrector
(you'll be interested in the part where he compares a word against a collection of existing words)
Depending on how often you'll need to do this search, the brute force iterate and compare method might be fast enough. Twenty thousand records really isn't all that much and unless the number of requests is large your performance may be acceptable.
That said, you'll have to implement the comparison logic yourself and if you want a large degree of flexibility (or if you need find you have to work on performance) you might want to look at something like Lucene.Net. Most of the text search engines I've seen and worked with have been more file-based, but I think you can index in-memory objects as well (however I'm not sure about that).
Good luck!
I'm not sure if you're asking for help writing the search given your existing affinity function, or if you're asking for help writing the affinity function. So for the moment I'll assume you're completely lost.
Given that assumption, you'll notice that I divided the problem into two pieces, and that's what you need to do as well. You need to write a function that takes two string inputs and returns a boolean value indicating whether or not the inputs are sufficiently similar. Then you need a separate search a delegate that will match any function with that kind of signature.
The basic signature for your affinity function might look like this:
bool IsAffinityMatch(string p1, string p2)
And then your search would look like this:
MyPersonCollection.Where(p => IsAffinityMatch(p.Surname, OtherPerson.Surname));
I provide the source code of that Affinity method:
/// <summary>
/// Compute Levenshtein distance according to the Levenshtein Distance Algorithm
/// </summary>
/// <param name="s">String 1</param>
/// <param name="t">String 2</param>
/// <returns>Distance between the two strings.
/// The larger the number, the bigger the difference.
/// </returns>
private static int Compare(string s, string t)
{
/* if both string are not set, its uncomparable. But others fields can still match! */
if (string.IsNullOrEmpty(s) && string.IsNullOrEmpty(t)) return 0;
/* if one string has value and the other one hasn't, it's definitely not match */
if (string.IsNullOrEmpty(s) || string.IsNullOrEmpty(t)) return -1;
s = s.ToUpper().Trim();
t = t.ToUpper().Trim();
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
int cost;
if (n == 0) return m;
if (m == 0) return n;
for (int i = 0; i <= n; d[i, 0] = i++) ;
for (int j = 0; j <= m; d[0, j] = j++) ;
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1);
d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
return d[n, m];
}
that means, if 0 is returned, 2 strings are identical.

Categories

Resources