Implementing an efficent algorithm to find the intersection of two strings

Implementing an efficent algorithm to find the intersection of two strings - c#

Implement an algorithm that takes two strings as input, and returns the intersection of the two, with each letter represented at most once.
Algo: (considering language used will be c#)
Convert both strings into char array
take the smaller array and generate a hash table for it with key as the character and value 0
Now Loop through the other array and increment the count in hash table if that char is present in it.
Now take out all char for hash table whose value is > 0.
These are intersection values.
This is an O(n), solution but is uses extra space, 2 char arrays and a hash table
Can you guys think of better solution than this?

How about this ...
var s1 = "aabbccccddd";
var s2 = "aabc";
var ans = s1.Intersect(s2);

Haven't tested this, but here's my thought:
Quicksort both strings in place, so you have an ordered sequence of characters
Keeping an index into both strings, compare the "next" character from each string, pick and output the first one, incrementing the index for that string.
Continue until you get to the end of one of the strings, then just pull unique values from the rest of the remaining string.
Won't use additional memory, only needs the two original strings, two integers, and an output string (or StringBuilder). As an added bonus, the output values will be sorted too!
Part 2:
This is what I'd write (sorry about the comments, new to stackoverflow):
private static string intersect(string left, string right)
{
StringBuilder theResult = new StringBuilder();
string sortedLeft = Program.sort(left);
string sortedRight = Program.sort(right);
int leftIndex = 0;
int rightIndex = 0;
// Work though the string with the "first last character".
if (sortedLeft[sortedLeft.Length - 1] > sortedRight[sortedRight.Length - 1])
{
string temp = sortedLeft;
sortedLeft = sortedRight;
sortedRight = temp;
}
char lastChar = default(char);
while (leftIndex < sortedLeft.Length)
{
char nextChar = (sortedLeft[leftIndex] <= sortedRight[rightIndex]) ? sortedLeft[leftIndex++] : sortedRight[rightIndex++];
if (lastChar == nextChar) continue;
theResult.Append(nextChar);
lastChar = nextChar;
}
// Add the remaining characters from the "right" string
while (rightIndex < sortedRight.Length)
{
char nextChar = sortedRight[rightIndex++];
if (lastChar == nextChar) continue;
theResult.Append(nextChar);
lastChar = nextChar;
}
theResult.Append(sortedRight, rightIndex, sortedRight.Length - rightIndex);
return (theResult.ToString());
}
I hope that makes more sense.

You don't need to 2 char arrays. The System.String data type has a built-in indexer by position that returns the char from that position, so you could just loop through from 0 to (String.Length - 1). If you're more interested in speed than optimizing storage space, then you could make a HashSet for the one of the strings, then make a second HashSet which will contain your final result. Then you iterate through the second string, testing each char against the first HashSet, and if it exists then add it the second HashSet. By the end, you already have a single HashSet with all the intersections, and save yourself the pass of running through the Hashtable looking for ones with a non-zero value.
EDIT: I entered this before all the comments on the question about not wanting to use any built-in containers at all

here's how I would do this. It's still O(N) and it doesn't use a hash table but instead one int array of length 26. (ideally)
make an array of 26 integers, each element for a letter of the alphebet. init to 0's.
iterate over the first string, decrementing one when a letter is encountered.
iterate over the second string and take the absolute of whatever is at the index corresponding to any letter you encounter. (edit: thanks to scwagner in comments)
return all letters corresponding to all indexes holding value greater than 0.
still O(N) and extra space of only 26 ints.
of course if you're not limited to only lower or uppercase characters your array size may need to change.

"with each letter represented at most once"
I'm assuming that this means you just need to know the intersections, and not how many times they occurred. If that's so then you can trim down your algorithm by making use of yield. Instead of storing the count and continuing to iterate the second string looking for additional matches, you can yield the intersection right there and continue to the next possible match from the first string.

Related

how to add a sign between each letter in a string in C#?

I have a task, in which i have to write a function called accum, which transforms given string into something like this:
Accumul.Accum("abcd"); // "A-Bb-Ccc-Dddd"
Accumul.Accum("RqaEzty"); // "R-Qq-Aaa-Eeee-Zzzzz-Tttttt-Yyyyyyy"
Accumul.Accum("cwAt"); // "C-Ww-Aaa-Tttt"
So far I only converted each letter to uppercase and... Now that I am writing about it, I think it could be easier for me to - firstly multiply the number of each letter and then add a dash there... Okay, well let's say I already multiplied the number of them(I will deal with it later) and now I need to add the dash. I tried several manners to solve this, including: for and foreach(and now that I think of it, I can't use foreach if I want to add a dash after multiplying the letters) with String.Join, String.Insert or something called StringBuilder with Append(which I don't exactly understand) and it does nothing to the string.
One of those loops that I tried was:
for (int letter = 0; letter < s.Length-1; letter += 2) {
if (letter % 2 == 0) s.Replace("", "-");
}
and
for (int letter = 0; letter < s.Length; letter++) {
return String.Join(s, "-");
}
The second one returns "unreachable code" error. What am I doing wrong here, that it does nothing to the string(after uppercase convertion)? Also, is there any method to copy each letter, in order to increase the number of them?

As you say string.join can be used as long as an enumerable is created instead of a foreach. Since the string itself is enumerable, you can use the Linq select overload which includes an index:
var input = "abcd";
var res = string.Join("-", input.Select((c,i) => Char.ToUpper(c) + new string(Char.ToLower(c),i)));
(Assuming each char is unique or can be used. e.g. "aab" would become "A-Aa-Bbb")
Explanation:
The Select extension method takes a lambda function as parameter with c being a char and i the index. The lambda returns an uppercase version of the char (c) folowed by a string of the lowercase char of the index length (new string(char,length)), (which is an empty string for the first index). Finally the string.join concatenates the resulting enumeration with a - between each element.

Use this code.
string result = String.Empty;
for (int i = 0; i < s.Length; i++)
{
char c = s[i];
result += char.ToUpper(c);
result += new String(char.ToLower(c), i);
if (i < s.Length - 1)
{
result += "-";
}
}
It will be better to use StringBuilder instead of strings concatenation, but this code can be a bit more clear.

Strings are immutable, which means that you cannot modify them once you created them. It means that Replace function return a new string that you need to capture somehow:
s = s.Replace("x", "-");
you currently are not assigning the result of the Replace method anywhere, that's why you don't see any results

For the future, the best way to approach problems like this one is not to search for the code snippet, but write down step by step algorithm of how you can achieve the expected result in plain English or some other pseudo code, e.g.
Given I have input string 'abcd' which should turn into output string 'A-Bb-Ccc-Dddd'.
Copy first character 'a' from the input to Buffer.
Store the index of the character to Index.
If Buffer has only one character make it Upper Case.
If Index is greater then 1 trail Buffer with Index-1 lower case characters.
Append dash '-' to the Buffer.
Copy Buffer content to Output and clear Buffer.
Copy second character 'b' from the input to Buffer.
...
etc.
Aha moment often happens on the third iteration. Hope it helps! :)

Does there exist an O(n) algorithm for finding the smallest substring from an input string, containing a given character set and count?

Given a string s, and a set of characters with count, find the minimal substring in s that contains all the characters repeating their count number of times.
Example:
charcount = { { 'A', 3 }, { 'B', 1 } };
str = "kjhdfsbabasdadaaaaasdkaaajbajerhhayeom"
---> "aajba"
I know how to do it in O(n^2) time by iterating through all the substrings from smallest to largest.
Possible signature of function:
string SmallestSubstringWithCharacterCount(Dictionary<char,int> chardic, string source)
{
// ...
}
I'm guessing there's some way where you can iterate through str because once you get to
"kjhdfsbabasdadaaaaasdkaaajbajerhhayeom"
|
here
you're found the first string, "kjhdfsbabasda", containing all the characters in the set.

Yes, linear algorithm does exist.
You need additional currentcharcount set with initial zero counts and GoodCount counter.
Make two index pointers - left and right, and move them through input string.
If next char is from the set, increment count for this char in currentcharcount. If this count becomes equal to goal value, increment GoodCount.
Move right index until GoodCount reaches charcount length - now current substring contains all needed chars.
Then move left index, decrementing counts, and decrementing GoodCount when needed. Just before this step we have the shortest substring starting here.
After decrementing GoodCount - repeat process with right index and so on to choose the best from all the shortest substrings.

C# Search array within provided index points

I'm not sure how best to phrase this. I have a text file of almost 80,000 words which I have converted across to a string array.
Basically I want a method where I pass it a word and it checks if it's in the word string array. To save it searching 80,000 each time I have indexed the locations where the words beginning with each letter start and end in a two dimensional array. So wordIndex[0,0] = 0 when the 'A' words start and wordIndex[1,0] = 4407 is where they end. Then wordIndex[0,1] = 4408 which is where the words beginning with 'B' start etc.
What I would like to know is how can I present this range to a method to have it search for a value. I know I can give an index and length but is this the only way? Can I say look for x within range y and z?

Look at Trie set. It can help you to store many words using few memory and quick search. Here is good implementation.

Basically you could use a for loop to search just a part of the array:
string word = "apple";
int start = 0;
int end = 4407;
bool found = false;
for (int i = start; i <= end ; i++)
{
if (arrayOfWords[i] == word)
{
found = true;
break;
}
}
But since the description of your index implies that your array is already sorted a better way might be to go with Array.BinarySearch<T>.

How to put a specific byte from a byte array into a single byte

I have here a code where i set up a byte[1] to fill with a random byte which i then need to have this random generated byte from array [0] into a single byte to be able to compare it. (x is either 16 or 32, z is always staring with 0)
byte compareByte = 0x00;
byte[] rndByte = new byte[1];
byte[] buffer = new byte[x];
Random rnd = new Random();
for (int i = 0; i < dalmatinerRound.Length; i++)
{
while (z != x)
{
Application.DoEvents();
rnd.NextBytes(rndByte);
compareByte = (byte) rndByte[0];
if (compareByte == dalmatinerRound[i])
{
buffer[z] = compareByte;
z++;
if (z == x)
{
string str = Encoding.ASCII.GetString(buffer);
textPass.Text = str;
}
}
}
}
The problem is that compareByte is everytime "A". Regardless of how often i trie. Or even if i use the random byte to compare like:
if (rndByte[0] == dalmatinerRound[i])
it also returns "A". I can't get the byte from offset 0x00 of the array into a single byte.
But when i do some test and use:
string str = Encoding.ASCII.GetString(rndByte);
textPass.Text = str;
then it works and i get everytime a other letter.
To be more clear. This code will generate a Random passwort in length of 16 or 32 diggis. The dalmatinerRound is a Array of bytes in length of 101 contaning Alphabetical letters, lower and upper case, 0-9 and also !"§$%&/()=?*+}][{
thanks

Why not just use (byte)rnd.Next(0, 256)? In any case, rnd.NextBytes works just fine, and you get the first byte by using rndByte[0] just the way you did. In other words, the error must be somewhere else, not in the random generator or reading the byte value. Is this really the code, or have you made some changes? In any case, your code seems incredibly complicated and wasteful. You should probably just use your array of allowable values (no need to have it a byte array, chars are more useful) and use rnd.Next(0, dalmatinerRound.Length); to get a random allowed character.
What your code actually does is that it loops until you get a "random" byte... which is equal to (byte)'A'. Your loops are all wrong. Instead, you can use this:
StringBuilder pwd = new StringBuilder(wantedLength);
for (var i = 0; i < wantedLength; i++)
pwd.Append(dalmatinerRound[rnd.Next(0, dalmatinerRound.Length)]);
And there you have your random password :)
This expects that dalmatinerRound is an array of strings, which is quite useful anyway, so you should do that.

You're looping over the array of eligible characters, I assume starting "ABC...", and for each of those doing something. With your while loop you're trying to generate a count of x bytes in your buffer, but you don't stop until this is done - you never get a chance to increment i until you finish filling up the buffer.
You're also generating a random byte and only adding it to your buffer if it happens to be the current "candidate" character in dalmatinerRound. So the buffer sloooowly fills up with "A" characters. Then once this is full, the next time i is incremented, the while loop immediately exits, so no other characters are tried.
You should instead loop over the character index i in the target buffer, generating one random character in each iteration - just think how you would go about this process by hand.

It looks like your for loop is in the wrong place. Currently, the whole password is generated on the first (and presumably the only) iteration of the for loop. That means your comparison only ever compares the random byte to the first entry in dalmatinerRound, which is presumably the letter A.
You need to put the for loop inside the while loop, so that each random byte gets compared to every element of dalmatinerRound. (Make sure you put the loop after you generate the random byte!)
As a side note, there are much better ways of generating what you need. Since you have an array of all valid characters, you could just pick a random element from that to get a password digit (i.e. generate a random number between 0 and the length of the array).

Shuffle an array without creating any runs

I have an array of repeating letters:
AABCCD
and I would like to put them into pseudo-random order. Simple right, just use Fisher-Yates => done. However there is a restriction on the output - I don't want any runs of the same letter. I want at least two other characters to appear before the same character reappears. For example:
ACCABD
is not valid because there are two Cs next to each other.
ABCACD
is also not valid because there are two C's next to each other (CAC) with only one other character (A) between them, I require at least two other characters.
Every valid sequence for this simple example:
ABCADC ABCDAC ACBACD ACBADC ACBDAC ACBDCA ACDABC ACDACB ACDBAC ACDBCA
ADCABC ADCBAC BACDAC BCADCA CABCAD CABCDA CABDAC CABDCA CADBAC CADBCA
CADCAB CADCBA CBACDA CBADCA CDABCA CDACBA DACBAC DCABCA
I used a brute force approach for this small array but my actual problem is arrays with hundreds of elements. I've tried using Fisher-Yates with some suppression - do normal Fisher-Yates and then if you don't like the character that comes up, try X more times for a better one. Generates valid sequences about 87% of the time only and is very slow. Wondering if there's a better approach. Obviously this isn't possible for all arrays. An array of just "AAB" has no valid order, so I'd like to fail down to the best available order of "ABA" for something like this.

Here is a modified Fisher-Yates approach. As I mentioned, it is very difficult to generate a valid sequence 100% of the time, because you have to check that you haven't trapped yourself by leaving only AAA at the end of your sequence.
It is possible to create a recursive CanBeSorted method, which tells you whether or not a sequence can be sorted according to your rules. That will be your basis for a full solution, but this function, which returns a boolean value indicating success or failure, should be a starting point.
public static bool Shuffle(char[] array)
{
var random = new Random();
var groups = array.ToDictionary(e => e, e => array.Count(v => v == e));
char last = '\0';
char lastButOne = '\0';
for (int i = array.Length; i > 1; i--)
{
var candidates = groups.Keys.Where(c => groups[c] > 0)
.Except(new[] { last, lastButOne }).ToList();
if (!candidates.Any())
return false;
var #char = candidates[random.Next(candidates.Count)];
var j = Array.IndexOf(array.Take(i).ToArray(), #char);
// Swap.
var tmp = array[j];
array[j] = array[i - 1];
array[i - 1] = tmp;
lastButOne = last;
last = #char;
groups[#char] = groups[#char] - 1;
}
return true;
}

Maintain a link list that will keep track of the letter and it's position in the result.
After getting the random number,Pick it's corresponding character from the input(same as Fisher-Yates) but now search in the list whether it has already occurred or not.
If not, insert the letter in the result and also in the link list with its position in the result.
If yes, then check it's position in the result(that you have stored in the link list when you have written that letter in result). Now compare this location with the current inserting location, If mod(currentlocation-previouslocation) is 3 or greater, you can insert that letter in the result otherwise not, if not choose the random number again.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Implementing an efficent algorithm to find the intersection of two strings - c#

How about this ... var s1 = "aabbccccddd"; var s2 = "aabc"; var ans = s1.Intersect(s2);

Related

how to add a sign between each letter in a string in C#?

Does there exist an O(n) algorithm for finding the smallest substring from an input string, containing a given character set and count?

C# Search array within provided index points

How to put a specific byte from a byte array into a single byte

Shuffle an array without creating any runs

Categories

Resources