I am comparing two lists of data that were generated from a binary file. I have a good idea on why it's running slow, when there's a significant amount of records, it does un-needed redundant work.
For example, if a1 = a1, condition is true. Since 2a != 1a so why even bother checking it? I need to eliminate 1a from being checked again. If I don't, it will check the first record when it goes to check the 400,000th record. I thought about making the second for loop a foreach, but I can't remove 1a while iterating through the nested loop
The amount of items that can be in either 'for loop' can vary. I don't think a single for loop using 'i' will work since the match can be anywhere. I'm reading from a binary file
This is my current code. Program has been running for over an hour, and it's still going. I removed a lot of my iterating code for readability reasons.
for (int i = 0; i < origItemList.Count; i++)
{
int modFoundIndex = 0;
Boolean foundIt = false;
for (int g = 0; g < modItemList.Count; g++)
{
if ((origItemList[i].X == modItemList[g].X)
&& (origItemList[i].Y == modItemList[g].Y)
&& (origItemList[i].Z == modItemList[g].Z)
&& (origItemList[i].M == modItemList[g].M))
{
foundIt = true;
modFoundIndex = g;
break;
}
else
{
foundIt = false;
}
}
if (foundIt)
{
/*
* This is run assumming it finds an x,y,z,m
coordinate. It thenchecks the database file.
*
*/
//grab the rows where the coordinates match
DataRow origRow = origDbfFile.dataset.Tables[0].Rows[i];
DataRow modRow = modDbfFile.dataset.Tables[0].Rows[modFoundIndex];
//number matched indicates how many columns were matched
int numberMatched = 0;
//get the number of columns to match in order to detect all changes
int numOfColumnsToMatch = origDbfFile.datatable.Columns.Count;
List<String> mismatchedColumns = new List<String>();
//check each column name for a change
foreach (String columnName in columnNames)
{
//this grabs whatever value is in that field
String origRowValue = "" + origRow.Field<Object>(columnName);
String modRowValue = "" + modRow.Field<Object>(columnName);
//check if they are the same
if (origRowValue.Equals(modRowValue))
{
//if they aren the same, increase the number matched by one
numberMatched++;
//add the column to the list of columns that don't match
}
else
{
mismatchedColumns.Add(columnName);
}
}
/* In the event it matches 15/16 columns, show the change */
if (numberMatched != numOfColumnsToMatch)
{
//Grab the shapeFile in question
Item differentAttrShpFile = origItemList[i];
//start blue highlighting
result += "<div class='turnBlue'>";
//show where the change was made at
result += "Change Detected at<br/> point X: " +
differentAttrShpFile.X + ",<br/> point Y: " +
differentAttrShpFile.Y + ",<br/>";
result += "</div>"; //end turnblue div
foreach (String mismatchedColumn in mismatchedColumns)
{
//iterate changes here
}
}
}
}
You're coming at this in a totally wrong way. The loop you have is O(n^2), breaking when you find the match will on average cut the time in half for a hit, that's not enough. If you have a quarter million items in the list then this loop executes 62 billion times and even if the compiler optimizes out the extra array lookups you're still looking at at least a trillion instructions. You don't do O(n^2) for large n if you can possibly help it!
What you need to do is get rid of the O(n^2) aspect of this. My suggestion:
1) Define a hashing function that looks at the x, y, z & m and comes up with an integer value, my inclination would be to use one that's the wordsize of your target platform.
2) Iterate over both lists, compute hashes for everything.
3) Build an index to one of the tables, hash and the object. I suspect a dictionary is the best data structure here but a simple sorted array would also do.
4) Iterate over the list you didn't build the index over, compare the hashes to the entries in the index. If it's a hash that's an O(n) task, if it's a sorted array it's O(n log n).
5) When the hashes match do a full comparison to confirm the hit is real as you will get the occasional collision with a good 64-bit hash and you'll get a decent number of them if your hashes are 32-bit.
This is something similar to Loren said but below is in language of .NET :)
1. Override GetHashCode method to return sum of x,y,z and m. Override Equals method to check for this sum.
2. Iterate and create HashSet from modItemList (List) before loop.
3. In inner loop, first check if origItemList[i] exists in HashSet using YourModHashSet.Contains(MyObject) method.
4. If .Contains return you false, carry one, no match.
5. If .Contains return you true, iterate thru entire modItemList and apply your current logic of checking for x,y,z and m for entire list. Note that here you should use List as hash table might eat up many objects for which hash code is same.
Also, I would use Foreach instead of For because I've seen Foreach giving little better results (5 to 30% faster) in such case.
Update:
I created MyObject class like below:
public class MyObject
{
public int X, Y, Z, M;
public override int GetHashCode()
{
return X*10000 + Y*100 + Z*10 + M;
}
public override bool Equals(object obj)
{
return (obj.GetHashCode() == this.GetHashCode());
}
}
GetHashCode method is important here. We don't want many false positives. False positive occurs when Hash matches for some other combination of X, Y, Z and M. Best way to prevent false positive is to multiply each member such that each will impact one decimal place in HashCode. Note that you should consider not exceeding Int.Max value. If the expected value of X,Y,Z and M are small you should be good.
set2.Clear();
s1 = DateTime.Now;
MyObject matchingElement;
totalmatch = 0;
foreach (MyObject elem in list2)
set2.Add(elem);
foreach (MyObject t1 in list1)
{
if (set2.Contains(t1))
{
matchingElement = null;
foreach (MyObject t2 in list2)
{
if (t1.X == t2.X && t1.Y == t2.Y && t1.Z == t2.Z && t1.M == t2.M)
{
totalmatch++;
matchingElement = t2;
break;
}
}
//Do Something on matchingElement if not null
}
}
Console.WriteLine("set foreach with contains: " + (DateTime.Now - s1).TotalSeconds + "\t Total Match: " + totalmatch);
Above is sample code that I was trying to describe in my answer. This code should work super fast if matches are expected to be less.
Related
I need to optimize below code so it can execute faster, by means using more memory or parallel, currently it is taking 2 minutes to complete single record in Windows 10 64bit, 16GB RAM PC
data1 list array length = 1000
data2 list array length = 100000
data3 list array length = 100
for (int d1 = 0; d1 < data1.Count; d1++)
{
if (data1[d1].status == 'UNMATCHED')
{
for (int d2 = 0; d2 < data2.Count; d2++)
{
if (data2[d2].status == 'UNMATCHED')
{
vMatched = false;
for (int d3 = 0; d3 < data3.Count; d3++)
{
if (data3[d3].rule == "rule1")
{
if (data1[d1].value == data2[d2].value)
{
data1[d1].status = 'MATCHED';
data1[d2].status = 'MATCHED';
vMatched = true;
break;
}
}
else if (data3[d3].rule == "rule2")
{
...
}
else if (data3[d3].rule == "rule100")
{
...
}
}
if (vMatched)
break;
}
}
}
}
First of all, for any kind of performance oriented programming, avoid using strings, use more appropriate types, like enum or bools, instead. Another recommendation is to profile your code, so you know what parts actually take time.
In the given example there is only one rule presented, so the data3-loop could be eliminated by first checking if this rule exist and only then proceed with the matching.
This matching between items in data1 & data2 essentially pairs unmatched items with the same value. Whenever problems like this occur, the standard solution is some kind of search structure, like a dictionary, to get better than linear search time. For example
var data2Dictionary = data2.ToDictionary(d => Tuple.Create(d.value, d.status), d => d);
This should let you drastically decrease the time to find a item with a specific value and status. Keep in mind that the code above will throw in case multiple items share the same value & status, and that the dictionary key will not be updated if the item changes value or status.
You can avoid to start everytime the 2nd loop from 0. By keeping last index with "UNMATCHED" inside data2.
It should reduce the complexity.
In the worst case:
Now 1000 * 100000 * 100 iterations: 10000000000
New (1000+100000) * 100 iterations: 10100000
I have two lists of strings - this is not mandatory, i can convert them to any collection (list, dictionary, etc).
First is "text":
Birds sings
Dogs barks
Frogs jumps
Second is "words":
sing
dog
cat
I need to iterate through "text" and if line contains any of "words" - do one thing and if not another thing.
Important: yes, in my case i need to find partial match ignoring case, like text "Dogs" is a match for word "dog". This is why i use .Contains and .ToLower().
My naive try looks like this:
List<string> text = new List<string>();
List<string> words = new List<string>();
foreach (string line in text)
{
bool found = false;
foreach (string word in words)
{
if (line.ToLower().Contains(word.ToLower()))
{
;// one thing
found = true;
break;
}
}
if (!found)
;// another
}
Problem in size - 8000 in first list and ~50000 in second. This takes too many time.
How to make it faster?
I'm assuming that you only want to match on the specific words in your text list: that is, if text contains "dogs", and words contains "dog", then that shouldn't be a match.
Note that this is different to what your code currently does.
Given this, we can construct a HashSet<string> of all of the words in your text list. We can then query this very cheaply.
We'll also use StringComparer.OrdinalIgnoreCase to do our comparisons. This is a better way of doing a case-insensitive match than ToLower(), and ordinal comparisons are relatively cheap. If you're dealing with languages other than English, you'll need to consider whether you actually need StringComparer.CurrentCultureIgnoreCase or StringComparer.InvariantCultureIgnoreCase.
var textWords = new HashSet<string>(StringComparer.OrdinalIgnoreCase);
foreach (var line in text)
{
var lineWords = line.Split(' ');
textWords.UnionWith(lineWords);
}
if (textWords.Overlaps(words))
{
// One thing
}
else
{
// Another
}
If this is not the case, and you do want to do a .Contains on each, then you can speed it up a bit by avoiding the calls to .ToLower(). Each call to .ToLower() creates a new string in memory, so you're creating two new, useless objects per comparison.
Instead, use:
if (line.IndexOf(word, StringComparison.OrdinalIgnoreCase) >= 0)
As above, you might have to use StringComparison.CurrentCultureIgnoreCase or StringComparison.InvariantCultureIgnoreCase depending on the language of your strings. However, you should see a significant speedup if your strings are entirely ASCII and you use OrdinalIgnoreCase as this makes the string search a lot quicker.
If you're using .NET Framework, another thing to try is moving to .NET Core. .NET Core introduced a lot of optimizations in this area, and you might find that it's quicker.
Another thing you can do is see if you have duplicates in either text or words. If you have a lot, you might be able to save a lot of time. Consider using a HashSet<string> for this, or linq's .Distinct() (you'll need to see which is quicker).
You can try using LINQ for the second looping construct.
List<string> text = new List<string>();
List<string> words = new List<string>();
foreach (string line in text)
{
bool found = words.FirstOrDefault(w=>line.ToLower().Contains(w.ToLower()))!=null;
if (found)
{
//Do something
}
else
{
//Another
}
}
Might not be as fast as you want but it will be faster than before.
You can improve the search algorithm.
public static int Search(string word, List<string> stringList)
{
string wordCopy = word.ToLower();
List<string> stringListCopy = new List<string>();
stringList.ForEach(s => stringListCopy.Add(s.ToLower()));
stringListCopy.Sort();
int position = -1;
int count = stringListCopy.Count;
if (count > 0)
{
int min = 0;
int max = count - 1;
int middle = (max - min) / 2;
int comparisonStatus = 0;
do
{
comparisonStatus = string.Compare(wordCopy, stringListCopy[middle]);
if (comparisonStatus == 0)
{
position = middle;
break;
}
else if (comparisonStatus < 0)
{
max = middle - 1;
}
else
{
min = middle + 1;
}
middle = min + (max - min) / 2;
} while (min < max);
}
return position;
}
Inside this method we create copy of string list. All elements are lower case.
After that we sort copied list by ascending. This is crucial because the entire algorithm is based upon ascending sort.
If word exists in the list then the Search method will return its position inside list, otherwise it will return -1.
How the algorithm works?
Instead of checking every element in the list, we split the list in half in every iteration.
In every iteration we take the element in the middle and compare two strings (the element and our word). If out string is the same as the one in the middle then our search is finished. If our string is lexical before the string in the middle, then our string must be in the first half of the list, because the list is sorted by ascending. If our string is lexical after the string in the middle, then our string must be in the second half of the list, again because the list is sorted by ascending. Then we take the appropriate half and repeat the process.
In first iteration we take the entire list.
I've tested the Search method using these data:
List<string> stringList = new List<string>();
stringList.Add("Serbia");
stringList.Add("Greece");
stringList.Add("Egypt");
stringList.Add("Peru");
stringList.Add("Palau");
stringList.Add("Slovakia");
stringList.Add("Kyrgyzstan");
stringList.Add("Mongolia");
stringList.Add("Chad");
Search("Serbia", stringList);
This way you will search the entire list of ~50,000 elements in 16 iterations at most.
I always struggle with these types of algorithms. I have a scenario where I have a cubic value for freight and need to split this value into cartons of different sizes, there are 3 sizes available in this instance, 0.12m3, 0.09m3 and 0.05m3. A few examples;
Assume total m3 is 0.16m3, I need to consume this value into the appropriate cartons.
I will have 1 carton of 0.12m3, this leave 0.04m3 to consume. This fits into the 0.05m3 so therefore I will have 1 carton of 0.05m, consumption is now complete. Final answer is 1 x 0.12m3 and 1 x 0.05m3.
Assume total m3 is 0.32m3, I would end up with 2 x 0.12m3 and 1 x 0.09m3.
I would prefer something either in c# or SQL that would easily return to me the results.
Many thanks for any help.
Cheers
I wrote an algorithm that may be a little messy but I do think it works. Your problem statement isn't 100% unambiguous, so this solution is assuming you want to pick containers so that you minimize the remaining space, when started filling from the largest container.
// List of cartons
var cartons = new List<double>
{
0.12,
0.09,
0.05
};
// Amount of stuff that you want to put into cartons
var stuff = 0.32;
var distribution = new Dictionary<double, int>();
// For this algorithm, I want to sort by descending first.
cartons = cartons.OrderByDescending(x => x).ToList();
foreach (var carton in cartons)
{
var count = 0;
while (stuff >= 0)
{
if (stuff >= carton)
{
// If the amount of stuff bigger than the carton size, we use carton size, then update stuff
count++;
stuff = stuff - carton;
distribution.CreateNewOrUpdateExisting(carton, 1);
}
else
{
// Otherwise, among remaining cartons we pick the ones that will have empty space if the remaining stuff is put in
var partial = cartons.Where(x => x - stuff >= 0 && x != carton);
if (partial != null && partial.Count() > 0)
{
var min = partial.Min();
if (min > 0)
{
distribution.CreateNewOrUpdateExisting(min, 1);
stuff = stuff - min;
}
}
else
{
break;
}
}
}
There' an accompanying extension method, which either adds an item to a dictionary, or if the Key exists, then increments the Value.
public static class DictionaryExtensions
{
public static void CreateNewOrUpdateExisting(this IDictionary<double, int> map, double key, int value)
{
if (map.ContainsKey(key))
{
map[key]++;
}
else
{
map.Add(key, value);
}
}
}
EDIT
Found a bug in the case where initial stuff is smaller than the largest container, so code updated to fix it.
NOTE
This may still not be a 100% foolproof algorithm as I haven't tested extensively. But it should give you an idea on how to proceed.
EDIT EDIT
Changing the condition to while (stuff > 0) should fix the bug mentioned in the comments.
I've been trying to solve this interview problem which asks to shuffle a string so that no two adjacent letters are identical
For example,
ABCC -> ACBC
The approach I'm thinking of is to
1) Iterate over the input string and store the (letter, frequency)
pairs in some collection
2) Now build a result string by pulling the highest frequency (that is > 0) letter that we didn't just pull
3) Update (decrement) the frequency whenever we pull a letter
4) return the result string if all letters have zero frequency
5) return error if we're left with only one letter with frequency greater than 1
With this approach we can save the more precious (less frequent) letters for last. But for this to work, we need a collection that lets us efficiently query a key and at the same time efficiently sort it by values. Something like this would work except we need to keep the collection sorted after every letter retrieval.
I'm assuming Unicode characters.
Any ideas on what collection to use? Or an alternative approach?
You can sort the letters by frequency, split the sorted list in half, and construct the output by taking letters from the two halves in turn. This takes a single sort.
Example:
Initial string: ACABBACAB
Sort: AAAABBBCC
Split: AAAA+BBBCC
Combine: ABABABCAC
If the number of letters of highest frequency exceeds half the length of the string, the problem has no solution.
Why not use two Data Structures: One for sorting (Like a Heap) and one for key retrieval, like a Dictionary?
The accepted answer may produce a correct result, but is likely not the 'correct' answer to this interview brain teaser, nor the most efficient algorithm.
The simple answer is to take the premise of a basic sorting algorithm and alter the looping predicate to check for adjacency rather than magnitude. This ensures that the 'sorting' operation is the only step required, and (like all good sorting algorithms) does the least amount of work possible.
Below is a c# example akin to insertion sort for simplicity (though many sorting algorithm could be similarly adjusted):
string NonAdjacencySort(string stringInput)
{
var input = stringInput.ToCharArray();
for(var i = 0; i < input.Length; i++)
{
var j = i;
while(j > 0 && j < input.Length - 1 &&
(input[j+1] == input[j] || input[j-1] == input[j]))
{
var tmp = input[j];
input[j] = input[j-1];
input[j-1] = tmp;
j--;
}
if(input[1] == input[0])
{
var tmp = input[0];
input[0] = input[input.Length-1];
input[input.Length-1] = tmp;
}
}
return new string(input);
}
The major change to standard insertion sort is that the function has to both look ahead and behind, and therefore needs to wrap around to the last index.
A final point is that this type of algorithm fails gracefully, providing a result with the fewest consecutive characters (grouped at the front).
Since I somehow got convinced to expand an off-hand comment into a full algorithm, I'll write it out as an answer, which must be more readable than a series of uneditable comments.
The algorithm is pretty simple, actually. It's based on the observation that if we sort the string and then divide it into two equal-length halves, plus the middle character if the string has odd length, then corresponding positions in the two halves must differ from each other, unless there is no solution. That's easy to see: if the two characters are the same, then so are all the characters between them, which totals ⌈n/2⌉+1 characters. But a solution is only possible if there are no more than ⌈n/2⌉ instances of any single character.
So we can proceed as follows:
Sort the string.
If the string's length is odd, output the middle character.
Divide the string (minus its middle character if the length is odd) into two equal-length halves, and interleave the two halves.
At each point in the interleaving, since the pair of characters differ from each other (see above), at least one of them must differ from the last character output. So we first output that character and then the corresponding one from the other half.
The sample code below is in C++, since I don't have a C# environment handy to test with. It's also simplified in two ways, both of which would be easy enough to fix at the cost of obscuring the algorithm:
If at some point in the interleaving, the algorithm encounters a pair of identical characters, it should stop and report failure. But in the sample implementation below, which has an overly simple interface, there's no way to report failure. If there is no solution, the function below returns an incorrect solution.
The OP suggests that the algorithm should work with Unicode characters, but the complexity of correctly handling multibyte encodings didn't seem to add anything useful to explain the algorithm. So I just used single-byte characters. (In C# and certain implementations of C++, there is no character type wide enough to hold a Unicode code point, so astral plane characters must be represented with a surrogate pair.)
#include <algorithm>
#include <iostream>
#include <string>
// If possible, rearranges 'in' so that there are no two consecutive
// instances of the same character.
std::string rearrange(std::string in) {
// Sort the input. The function is call-by-value,
// so the argument itself isn't changed.
std::string out;
size_t len = in.size();
if (in.size()) {
out.reserve(len);
std::sort(in.begin(), in.end());
size_t mid = len / 2;
size_t tail = len - mid;
char prev = in[mid];
// For odd-length strings, start with the middle character.
if (len & 1) out.push_back(prev);
for (size_t head = 0; head < mid; ++head, ++tail)
// See explanatory text
if (in[tail] != prev) {
out.push_back(in[tail]);
out.push_back(prev = in[head]);
}
else {
out.push_back(in[head]);
out.push_back(prev = in[tail]);
}
}
}
return out;
}
you can do that by using a priority queue.
Please find the below explanation.
https://iq.opengenus.org/rearrange-string-no-same-adjacent-characters/
Here is a probabilistic approach. The algorithm is:
10) Select a random char from the input string.
20) Try to insert the selected char in a random position in the output string.
30) If it can't be inserted because of proximity with the same char, go to 10.
40) Remove the selected char from the input string and go to 10.
50) Continue until there are no more chars in the input string, or the failed attempts are too many.
public static string ShuffleNoSameAdjacent(string input, Random random = null)
{
if (input == null) return null;
if (random == null) random = new Random();
string output = "";
int maxAttempts = input.Length * input.Length * 2;
int attempts = 0;
while (input.Length > 0)
{
while (attempts < maxAttempts)
{
int inputPos = random.Next(0, input.Length);
var outputPos = random.Next(0, output.Length + 1);
var c = input[inputPos];
if (outputPos > 0 && output[outputPos - 1] == c)
{
attempts++; continue;
}
if (outputPos < output.Length && output[outputPos] == c)
{
attempts++; continue;
}
input = input.Remove(inputPos, 1);
output = output.Insert(outputPos, c.ToString());
break;
}
if (attempts >= maxAttempts) throw new InvalidOperationException(
$"Shuffle failed to complete after {attempts} attempts.");
}
return output;
}
Not suitable for strings longer than 1,000 chars!
Update: And here is a more complicated deterministic approach. The algorithm is:
Group the elements and sort the groups by length.
Create three empty piles of elements.
Insert each group to a separate pile, inserting always the largest group to the smallest pile, so that the piles differ in length as little as possible.
Check that there is no pile with more than half the total elements, in which case satisfying the condition of not having same adjacent elements is impossible.
Shuffle the piles.
Start yielding elements from the piles, selecting a different pile each time.
When the piles that are eligible for selection are more than one, select randomly, weighting by the size of each pile. Piles containing near half of the remaining elements should be much preferred. For example if the remaining elements are 100 and the two eligible piles have 49 and 40 elements respectively, then the first pile should be 10 times more preferable than the second (because 50 - 49 = 1 and 50 - 40 = 10).
public static IEnumerable<T> ShuffleNoSameAdjacent<T>(IEnumerable<T> source,
Random random = null, IEqualityComparer<T> comparer = null)
{
if (source == null) yield break;
if (random == null) random = new Random();
if (comparer == null) comparer = EqualityComparer<T>.Default;
var grouped = source
.GroupBy(i => i, comparer)
.OrderByDescending(g => g.Count());
var piles = Enumerable.Range(0, 3).Select(i => new Pile<T>()).ToArray();
foreach (var group in grouped)
{
GetSmallestPile().AddRange(group);
}
int totalCount = piles.Select(e => e.Count).Sum();
if (piles.Any(pile => pile.Count > (totalCount + 1) / 2))
{
throw new InvalidOperationException("Shuffle is impossible.");
}
piles.ForEach(pile => Shuffle(pile));
Pile<T> previouslySelectedPile = null;
while (totalCount > 0)
{
var selectedPile = GetRandomPile_WeightedByLength();
yield return selectedPile[selectedPile.Count - 1];
selectedPile.RemoveAt(selectedPile.Count - 1);
totalCount--;
previouslySelectedPile = selectedPile;
}
List<T> GetSmallestPile()
{
List<T> smallestPile = null;
int smallestCount = Int32.MaxValue;
foreach (var pile in piles)
{
if (pile.Count < smallestCount)
{
smallestPile = pile;
smallestCount = pile.Count;
}
}
return smallestPile;
}
void Shuffle(List<T> pile)
{
for (int i = 0; i < pile.Count; i++)
{
int j = random.Next(i, pile.Count);
if (i == j) continue;
var temp = pile[i];
pile[i] = pile[j];
pile[j] = temp;
}
}
Pile<T> GetRandomPile_WeightedByLength()
{
var eligiblePiles = piles
.Where(pile => pile.Count > 0 && pile != previouslySelectedPile)
.ToArray();
Debug.Assert(eligiblePiles.Length > 0, "No eligible pile.");
eligiblePiles.ForEach(pile =>
{
pile.Proximity = ((totalCount + 1) / 2) - pile.Count;
pile.Score = 1;
});
Debug.Assert(eligiblePiles.All(pile => pile.Proximity >= 0),
"A pile has negative proximity.");
foreach (var pile in eligiblePiles)
{
foreach (var otherPile in eligiblePiles)
{
if (otherPile == pile) continue;
pile.Score *= otherPile.Proximity;
}
}
var sumScore = eligiblePiles.Select(p => p.Score).Sum();
while (sumScore > Int32.MaxValue)
{
eligiblePiles.ForEach(pile => pile.Score /= 100);
sumScore = eligiblePiles.Select(p => p.Score).Sum();
}
if (sumScore == 0)
{
return eligiblePiles[random.Next(0, eligiblePiles.Length)];
}
var randomScore = random.Next(0, (int)sumScore);
int accumulatedScore = 0;
foreach (var pile in eligiblePiles)
{
accumulatedScore += (int)pile.Score;
if (randomScore < accumulatedScore) return pile;
}
Debug.Fail("Could not select a pile randomly by weight.");
return null;
}
}
private class Pile<T> : List<T>
{
public int Proximity { get; set; }
public long Score { get; set; }
}
This implementation can suffle millions of elements. I am not completely convinced that the quality of the suffling is as perfect as the previous probabilistic implementation, but should be close.
func shuffle(str:String)-> String{
var shuffleArray = [Character](str)
//Sorting
shuffleArray.sort()
var shuffle1 = [Character]()
var shuffle2 = [Character]()
var adjacentStr = ""
//Split
for i in 0..<shuffleArray.count{
if i > shuffleArray.count/2 {
shuffle2.append(shuffleArray[i])
}else{
shuffle1.append(shuffleArray[i])
}
}
let count = shuffle1.count > shuffle2.count ? shuffle1.count:shuffle2.count
//Merge with adjacent element
for i in 0..<count {
if i < shuffle1.count{
adjacentStr.append(shuffle1[i])
}
if i < shuffle2.count{
adjacentStr.append(shuffle2[i])
}
}
return adjacentStr
}
let s = shuffle(str: "AABC")
print(s)
I have been programming an object to calculate the DiceSorensen Distance between two strings. The logic of the operation is not so difficult. You calculate how many two letter pairs exist in a string, compare it with a second string and then perform this equation
2(x intersect y)/ (|x| . |y|)
where |x| and |y| is the number of bigram elements in x & y. Reference can be found here for further clarity https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient
So I have tried looking up how to do the code online in various spots but every method I have come across uses the 'Intersect' method between two lists and as far as I am aware this won't work because if you have a string where the bigram already exists it won't add another one. For example if I had a string
'aaaa'
I would like there to be 3 'aa' bigrams but the Intersect method will only produce one, if i am incorrect on this assumption please tell me cause i wondered why so many people used the intersect method. My assumption is based on the MSDN website https://msdn.microsoft.com/en-us/library/bb460136(v=vs.90).aspx
So here is the code I have made
public static double SorensenDiceDistance(this string source, string target)
{
// formula 2|X intersection Y|
// --------------------
// |X| + |Y|
//create variables needed
List<string> bigrams_source = new List<string>();
List<string> bigrams_target = new List<string>();
int source_length;
int target_length;
double intersect_count = 0;
double result = 0;
Console.WriteLine("DEBUG: string length source is " + source.Length);
//base case
if (source.Length == 0 || target.Length == 0)
{
return 0;
}
//extract bigrams from string 1
bigrams_source = source.ListBiGrams();
//extract bigrams from string 2
bigrams_target = target.ListBiGrams();
source_length = bigrams_source.Count();
target_length = bigrams_target.Count();
Console.WriteLine("DEBUG: bigram counts are source: " + source_length + " . target length : " + target_length);
//now we have two sets of bigrams compare them in a non distinct loop
for (int i = 0; i < bigrams_source.Count(); i++)
{
for (int y = 0; y < bigrams_target.Count(); y++)
{
if (bigrams_source.ElementAt(i) == bigrams_target.ElementAt(y))
{
intersect_count++;
//Console.WriteLine("intersect count is :" + intersect_count);
}
}
}
Console.WriteLine("intersect line value : " + intersect_count);
result = (2 * intersect_count) / (source_length + target_length);
if (result < 0)
{
result = Math.Abs(result);
}
return result;
}
In the code somewhere you can see I call a method called listBiGrams and this is how it looks
public static List<string> ListBiGrams(this string source)
{
return ListNGrams(source, 2);
}
public static List<string> ListTriGrams(this string source)
{
return ListNGrams(source, 3);
}
public static List<string> ListNGrams(this string source, int n)
{
List<string> nGrams = new List<string>();
if (n > source.Length)
{
return null;
}
else if (n == source.Length)
{
nGrams.Add(source);
return nGrams;
}
else
{
for (int i = 0; i < source.Length - n; i++)
{
nGrams.Add(source.Substring(i, n));
}
return nGrams;
}
}
So my understanding of the code step by step is
1) pass in strings
2) 0 length check
3) create list and pass up bigrams into them
4) get the lengths of each bigram list
5) nested loop to check in source position[i] against every bigram in target string and then increment i until no more source list to check against
6) perform equation mentioned above taken from wikipedia
7) if result is negative Math.Abs it to return a positive result (however i know the result should be between 0 and 1 already this is what keyed me into knowing i was doing something wrong)
the source string i used is source = "this is not a correct string" and the target string was, target = "this is a correct string"
the result I got was -0.090909090908
I'm SURE (99%) that what I'm missing is something small like a mis-calculated length somewhere or a count mis-count. If anyone could point out what i'm doing wrong I'd be really grateful. Thank you for your time!
This looks like homework, yet this similarity metric on strings is new to me so I took a look.
Algorith implementation in various languages
As you may notice the C# version uses HashSet and takes advantage of the IntersectWith method.
A set is a collection that contains no duplicate elements, and whose
elements are in no particular order.
This solves your string 'aaaa' puzzle. Only one bigram there.
My naive implementation on Rextester
If you prefer Linq then I'd suggest Enumerable.Distinct, Enumerable.Union and Enumerable.Intersect. These should mimic very well the duplicate removal capabilities of the HashSet.
Also found this nice StringMetric framework written in Scala.