Dice Sorensen Distance error calculating Bigrams without using Intersect method - c#

I have been programming an object to calculate the DiceSorensen Distance between two strings. The logic of the operation is not so difficult. You calculate how many two letter pairs exist in a string, compare it with a second string and then perform this equation
2(x intersect y)/ (|x| . |y|)
where |x| and |y| is the number of bigram elements in x & y. Reference can be found here for further clarity https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient
So I have tried looking up how to do the code online in various spots but every method I have come across uses the 'Intersect' method between two lists and as far as I am aware this won't work because if you have a string where the bigram already exists it won't add another one. For example if I had a string
'aaaa'
I would like there to be 3 'aa' bigrams but the Intersect method will only produce one, if i am incorrect on this assumption please tell me cause i wondered why so many people used the intersect method. My assumption is based on the MSDN website https://msdn.microsoft.com/en-us/library/bb460136(v=vs.90).aspx
So here is the code I have made
public static double SorensenDiceDistance(this string source, string target)
{
// formula 2|X intersection Y|
// --------------------
// |X| + |Y|
//create variables needed
List<string> bigrams_source = new List<string>();
List<string> bigrams_target = new List<string>();
int source_length;
int target_length;
double intersect_count = 0;
double result = 0;
Console.WriteLine("DEBUG: string length source is " + source.Length);
//base case
if (source.Length == 0 || target.Length == 0)
{
return 0;
}
//extract bigrams from string 1
bigrams_source = source.ListBiGrams();
//extract bigrams from string 2
bigrams_target = target.ListBiGrams();
source_length = bigrams_source.Count();
target_length = bigrams_target.Count();
Console.WriteLine("DEBUG: bigram counts are source: " + source_length + " . target length : " + target_length);
//now we have two sets of bigrams compare them in a non distinct loop
for (int i = 0; i < bigrams_source.Count(); i++)
{
for (int y = 0; y < bigrams_target.Count(); y++)
{
if (bigrams_source.ElementAt(i) == bigrams_target.ElementAt(y))
{
intersect_count++;
//Console.WriteLine("intersect count is :" + intersect_count);
}
}
}
Console.WriteLine("intersect line value : " + intersect_count);
result = (2 * intersect_count) / (source_length + target_length);
if (result < 0)
{
result = Math.Abs(result);
}
return result;
}
In the code somewhere you can see I call a method called listBiGrams and this is how it looks
public static List<string> ListBiGrams(this string source)
{
return ListNGrams(source, 2);
}
public static List<string> ListTriGrams(this string source)
{
return ListNGrams(source, 3);
}
public static List<string> ListNGrams(this string source, int n)
{
List<string> nGrams = new List<string>();
if (n > source.Length)
{
return null;
}
else if (n == source.Length)
{
nGrams.Add(source);
return nGrams;
}
else
{
for (int i = 0; i < source.Length - n; i++)
{
nGrams.Add(source.Substring(i, n));
}
return nGrams;
}
}
So my understanding of the code step by step is
1) pass in strings
2) 0 length check
3) create list and pass up bigrams into them
4) get the lengths of each bigram list
5) nested loop to check in source position[i] against every bigram in target string and then increment i until no more source list to check against
6) perform equation mentioned above taken from wikipedia
7) if result is negative Math.Abs it to return a positive result (however i know the result should be between 0 and 1 already this is what keyed me into knowing i was doing something wrong)
the source string i used is source = "this is not a correct string" and the target string was, target = "this is a correct string"
the result I got was -0.090909090908
I'm SURE (99%) that what I'm missing is something small like a mis-calculated length somewhere or a count mis-count. If anyone could point out what i'm doing wrong I'd be really grateful. Thank you for your time!

This looks like homework, yet this similarity metric on strings is new to me so I took a look.
Algorith implementation in various languages
As you may notice the C# version uses HashSet and takes advantage of the IntersectWith method.
A set is a collection that contains no duplicate elements, and whose
elements are in no particular order.
This solves your string 'aaaa' puzzle. Only one bigram there.
My naive implementation on Rextester
If you prefer Linq then I'd suggest Enumerable.Distinct, Enumerable.Union and Enumerable.Intersect. These should mimic very well the duplicate removal capabilities of the HashSet.
Also found this nice StringMetric framework written in Scala.

Related

Shuffling a string so that no two adjacent letters are the same

I've been trying to solve this interview problem which asks to shuffle a string so that no two adjacent letters are identical
For example,
ABCC -> ACBC
The approach I'm thinking of is to
1) Iterate over the input string and store the (letter, frequency)
pairs in some collection
2) Now build a result string by pulling the highest frequency (that is > 0) letter that we didn't just pull
3) Update (decrement) the frequency whenever we pull a letter
4) return the result string if all letters have zero frequency
5) return error if we're left with only one letter with frequency greater than 1
With this approach we can save the more precious (less frequent) letters for last. But for this to work, we need a collection that lets us efficiently query a key and at the same time efficiently sort it by values. Something like this would work except we need to keep the collection sorted after every letter retrieval.
I'm assuming Unicode characters.
Any ideas on what collection to use? Or an alternative approach?
You can sort the letters by frequency, split the sorted list in half, and construct the output by taking letters from the two halves in turn. This takes a single sort.
Example:
Initial string: ACABBACAB
Sort: AAAABBBCC
Split: AAAA+BBBCC
Combine: ABABABCAC
If the number of letters of highest frequency exceeds half the length of the string, the problem has no solution.
Why not use two Data Structures: One for sorting (Like a Heap) and one for key retrieval, like a Dictionary?
The accepted answer may produce a correct result, but is likely not the 'correct' answer to this interview brain teaser, nor the most efficient algorithm.
The simple answer is to take the premise of a basic sorting algorithm and alter the looping predicate to check for adjacency rather than magnitude. This ensures that the 'sorting' operation is the only step required, and (like all good sorting algorithms) does the least amount of work possible.
Below is a c# example akin to insertion sort for simplicity (though many sorting algorithm could be similarly adjusted):
string NonAdjacencySort(string stringInput)
{
var input = stringInput.ToCharArray();
for(var i = 0; i < input.Length; i++)
{
var j = i;
while(j > 0 && j < input.Length - 1 &&
(input[j+1] == input[j] || input[j-1] == input[j]))
{
var tmp = input[j];
input[j] = input[j-1];
input[j-1] = tmp;
j--;
}
if(input[1] == input[0])
{
var tmp = input[0];
input[0] = input[input.Length-1];
input[input.Length-1] = tmp;
}
}
return new string(input);
}
The major change to standard insertion sort is that the function has to both look ahead and behind, and therefore needs to wrap around to the last index.
A final point is that this type of algorithm fails gracefully, providing a result with the fewest consecutive characters (grouped at the front).
Since I somehow got convinced to expand an off-hand comment into a full algorithm, I'll write it out as an answer, which must be more readable than a series of uneditable comments.
The algorithm is pretty simple, actually. It's based on the observation that if we sort the string and then divide it into two equal-length halves, plus the middle character if the string has odd length, then corresponding positions in the two halves must differ from each other, unless there is no solution. That's easy to see: if the two characters are the same, then so are all the characters between them, which totals ⌈n/2⌉&plus;1 characters. But a solution is only possible if there are no more than ⌈n/2⌉ instances of any single character.
So we can proceed as follows:
Sort the string.
If the string's length is odd, output the middle character.
Divide the string (minus its middle character if the length is odd) into two equal-length halves, and interleave the two halves.
At each point in the interleaving, since the pair of characters differ from each other (see above), at least one of them must differ from the last character output. So we first output that character and then the corresponding one from the other half.
The sample code below is in C++, since I don't have a C# environment handy to test with. It's also simplified in two ways, both of which would be easy enough to fix at the cost of obscuring the algorithm:
If at some point in the interleaving, the algorithm encounters a pair of identical characters, it should stop and report failure. But in the sample implementation below, which has an overly simple interface, there's no way to report failure. If there is no solution, the function below returns an incorrect solution.
The OP suggests that the algorithm should work with Unicode characters, but the complexity of correctly handling multibyte encodings didn't seem to add anything useful to explain the algorithm. So I just used single-byte characters. (In C# and certain implementations of C++, there is no character type wide enough to hold a Unicode code point, so astral plane characters must be represented with a surrogate pair.)
#include <algorithm>
#include <iostream>
#include <string>
// If possible, rearranges 'in' so that there are no two consecutive
// instances of the same character.
std::string rearrange(std::string in) {
// Sort the input. The function is call-by-value,
// so the argument itself isn't changed.
std::string out;
size_t len = in.size();
if (in.size()) {
out.reserve(len);
std::sort(in.begin(), in.end());
size_t mid = len / 2;
size_t tail = len - mid;
char prev = in[mid];
// For odd-length strings, start with the middle character.
if (len & 1) out.push_back(prev);
for (size_t head = 0; head < mid; ++head, ++tail)
// See explanatory text
if (in[tail] != prev) {
out.push_back(in[tail]);
out.push_back(prev = in[head]);
}
else {
out.push_back(in[head]);
out.push_back(prev = in[tail]);
}
}
}
return out;
}
you can do that by using a priority queue.
Please find the below explanation.
https://iq.opengenus.org/rearrange-string-no-same-adjacent-characters/
Here is a probabilistic approach. The algorithm is:
10) Select a random char from the input string.
20) Try to insert the selected char in a random position in the output string.
30) If it can't be inserted because of proximity with the same char, go to 10.
40) Remove the selected char from the input string and go to 10.
50) Continue until there are no more chars in the input string, or the failed attempts are too many.
public static string ShuffleNoSameAdjacent(string input, Random random = null)
{
if (input == null) return null;
if (random == null) random = new Random();
string output = "";
int maxAttempts = input.Length * input.Length * 2;
int attempts = 0;
while (input.Length > 0)
{
while (attempts < maxAttempts)
{
int inputPos = random.Next(0, input.Length);
var outputPos = random.Next(0, output.Length + 1);
var c = input[inputPos];
if (outputPos > 0 && output[outputPos - 1] == c)
{
attempts++; continue;
}
if (outputPos < output.Length && output[outputPos] == c)
{
attempts++; continue;
}
input = input.Remove(inputPos, 1);
output = output.Insert(outputPos, c.ToString());
break;
}
if (attempts >= maxAttempts) throw new InvalidOperationException(
$"Shuffle failed to complete after {attempts} attempts.");
}
return output;
}
Not suitable for strings longer than 1,000 chars!
Update: And here is a more complicated deterministic approach. The algorithm is:
Group the elements and sort the groups by length.
Create three empty piles of elements.
Insert each group to a separate pile, inserting always the largest group to the smallest pile, so that the piles differ in length as little as possible.
Check that there is no pile with more than half the total elements, in which case satisfying the condition of not having same adjacent elements is impossible.
Shuffle the piles.
Start yielding elements from the piles, selecting a different pile each time.
When the piles that are eligible for selection are more than one, select randomly, weighting by the size of each pile. Piles containing near half of the remaining elements should be much preferred. For example if the remaining elements are 100 and the two eligible piles have 49 and 40 elements respectively, then the first pile should be 10 times more preferable than the second (because 50 - 49 = 1 and 50 - 40 = 10).
public static IEnumerable<T> ShuffleNoSameAdjacent<T>(IEnumerable<T> source,
Random random = null, IEqualityComparer<T> comparer = null)
{
if (source == null) yield break;
if (random == null) random = new Random();
if (comparer == null) comparer = EqualityComparer<T>.Default;
var grouped = source
.GroupBy(i => i, comparer)
.OrderByDescending(g => g.Count());
var piles = Enumerable.Range(0, 3).Select(i => new Pile<T>()).ToArray();
foreach (var group in grouped)
{
GetSmallestPile().AddRange(group);
}
int totalCount = piles.Select(e => e.Count).Sum();
if (piles.Any(pile => pile.Count > (totalCount + 1) / 2))
{
throw new InvalidOperationException("Shuffle is impossible.");
}
piles.ForEach(pile => Shuffle(pile));
Pile<T> previouslySelectedPile = null;
while (totalCount > 0)
{
var selectedPile = GetRandomPile_WeightedByLength();
yield return selectedPile[selectedPile.Count - 1];
selectedPile.RemoveAt(selectedPile.Count - 1);
totalCount--;
previouslySelectedPile = selectedPile;
}
List<T> GetSmallestPile()
{
List<T> smallestPile = null;
int smallestCount = Int32.MaxValue;
foreach (var pile in piles)
{
if (pile.Count < smallestCount)
{
smallestPile = pile;
smallestCount = pile.Count;
}
}
return smallestPile;
}
void Shuffle(List<T> pile)
{
for (int i = 0; i < pile.Count; i++)
{
int j = random.Next(i, pile.Count);
if (i == j) continue;
var temp = pile[i];
pile[i] = pile[j];
pile[j] = temp;
}
}
Pile<T> GetRandomPile_WeightedByLength()
{
var eligiblePiles = piles
.Where(pile => pile.Count > 0 && pile != previouslySelectedPile)
.ToArray();
Debug.Assert(eligiblePiles.Length > 0, "No eligible pile.");
eligiblePiles.ForEach(pile =>
{
pile.Proximity = ((totalCount + 1) / 2) - pile.Count;
pile.Score = 1;
});
Debug.Assert(eligiblePiles.All(pile => pile.Proximity >= 0),
"A pile has negative proximity.");
foreach (var pile in eligiblePiles)
{
foreach (var otherPile in eligiblePiles)
{
if (otherPile == pile) continue;
pile.Score *= otherPile.Proximity;
}
}
var sumScore = eligiblePiles.Select(p => p.Score).Sum();
while (sumScore > Int32.MaxValue)
{
eligiblePiles.ForEach(pile => pile.Score /= 100);
sumScore = eligiblePiles.Select(p => p.Score).Sum();
}
if (sumScore == 0)
{
return eligiblePiles[random.Next(0, eligiblePiles.Length)];
}
var randomScore = random.Next(0, (int)sumScore);
int accumulatedScore = 0;
foreach (var pile in eligiblePiles)
{
accumulatedScore += (int)pile.Score;
if (randomScore < accumulatedScore) return pile;
}
Debug.Fail("Could not select a pile randomly by weight.");
return null;
}
}
private class Pile<T> : List<T>
{
public int Proximity { get; set; }
public long Score { get; set; }
}
This implementation can suffle millions of elements. I am not completely convinced that the quality of the suffling is as perfect as the previous probabilistic implementation, but should be close.
func shuffle(str:String)-> String{
var shuffleArray = [Character](str)
//Sorting
shuffleArray.sort()
var shuffle1 = [Character]()
var shuffle2 = [Character]()
var adjacentStr = ""
//Split
for i in 0..<shuffleArray.count{
if i > shuffleArray.count/2 {
shuffle2.append(shuffleArray[i])
}else{
shuffle1.append(shuffleArray[i])
}
}
let count = shuffle1.count > shuffle2.count ? shuffle1.count:shuffle2.count
//Merge with adjacent element
for i in 0..<count {
if i < shuffle1.count{
adjacentStr.append(shuffle1[i])
}
if i < shuffle2.count{
adjacentStr.append(shuffle2[i])
}
}
return adjacentStr
}
let s = shuffle(str: "AABC")
print(s)

Cut and take new line String with maxchar

I tried to create a function for print all the articles of a billing with some max length vars, but when this max length is exceeded Index out of range errors appears or in another cases total and quantity Doesn't appear in line:
Max Length variables:
private Int32 MaxCharPage = 36;
private Int32 MaxCharArticleName = 15;
private Int32 MaxCharArticleQuantity = 4;
private Int32 MaxCharArticleSellPrice = 6;
private Int32 MaxCharArticleTotal = 8;
private Int32 MaxCharArticleLineSpace = 1;
This is my current function:
private IEnumerable<String> ArticleLine(Double Quantity, String Name, Double SellPrice, Double Total)
{
String QuantityChunk = (String)(Quantity).ToString("#,##0.00");
String PriceChunk = (String)(SellPrice).ToString("#,##0.00");
String TotalChunk = (String)(Total).ToString("#,##0.00");
// full chunks with "size" length
for (int i = 0; i < Name.Length / this.MaxCharArticleName; ++i)
{
String Chunk = String.Empty;
if (i == 0)
{
Chunk = QuantityChunk + new String((Char)32, (MaxCharArticleQuantity + MaxCharArticleLineSpace - QuantityChunk.Length)) +
Name.Substring(i * this.MaxCharArticleName, this.MaxCharArticleName) +
new String((Char)32, (MaxCharArticleLineSpace)) +
new String((Char)32, (MaxCharArticleSellPrice - PriceChunk.Length)) +
PriceChunk +
new String((Char)32, (MaxCharArticleLineSpace)) +
new String((Char)32, (MaxCharArticleTotal - TotalChunk.Length)) +
TotalChunk;
}
else
{
Chunk = new String((Char)32, (MaxCharArticleQuantity + MaxCharArticleLineSpace)) +
Name.Substring(i * this.MaxCharArticleName, this.MaxCharArticleName);
}
yield return Chunk;
}
if (Name.Length % this.MaxCharArticleName > 0)
{
String chunk = Name.Substring(Name.Length / this.MaxCharArticleName * this.MaxCharArticleName);
yield return new String((Char)32, (MaxCharArticleQuantity + MaxCharArticleLineSpace)) + chunk;
}
}
private void AddArticle(Double Quantity, String Name, Double SellPrice, Double Total)
{
Lines.AddRange(ArticleLine(Quantity, Name, SellPrice, Total).ToList());
}
For example:
private List<String> Lines = new List<String>();
private void Form1_Load(object sender, EventArgs e)
{
AddArticle(2.50, "EGGS", 0.50, 1.25); // Problem: Dont show the numbers like (Quantity, Sellprice, Total)
//AddArticle(100.52, "HAND SOAP /w EP", 5.00, 502.60); //OutOfRangeException
AddArticle(105.6, "LONG NAME ARTICLE DESCRIPTION", 500.03, 100.00);
//AddArticle(100, "LONG NAME ARTICLE DESCRIPTION2", 1500.03, 150003.00); // OutOfRangeException
foreach (String line in Lines)
{
Console.WriteLine(line);
}
}
Console Output:
LINE1: EGGS
LINE2:105.6LONG NAME ARTIC 500.03 100.00
LINE3: LE DESCRIPTION
Desired output:
LINE:2.50 EGGS 0.50 1.25
LINE:100. HAND SOAP /w EP 5.00 502.60
LINE: 52
LINE:105. LONG NAME ARTIC 500.03 100.00
LINE: 60 LE DESCRIPTION
LINE:100. LONG NAME ARTIC 1,500. 150,003.
LINE: 00 LE DESCRIPTION2 03 00
You are attempting to take values (string values, ultimately) and extract specified lengths from them (chunks), where the each individual chunk of a given group goes on one line, and the next chunk(s) go on the next line(s). There are several challenges in your posted code. One of the biggest is that you may not understand what the / operator does. It is, of course, used for division, but it is integer division, meaning you get a whole number as the result. E.g. 3 / 15 gives you 0, whereas 17 / 15 would give you 1. This means your loop never runs unless the length of the value is greater than the specified chunk limit.
Another possible issue is that you only perform this check against Name, not the other items (though you may have omitted the code for them for brevity reasons).
Your current code is also creating a lot of unnecessary strings, which will lead to performance degradation. Remember that in C# strings are immutable - meaning they cannot be changed. When you "change" the value of a string, you are actually creating another copy of the string.
The crux of your requirement is how to "chunk" the values in such a way that the correct output is achieved. One way to do this is to create an extension method that will take a string value and "chunkify" it for you. One such example is found here: Split String Into Array of Chunks. I've modified it slightly to use a List<T>, but an array would work just as well.
public static List<string> SplitIntoChunks(this string toSplit, int chunkSize)
{
int stringLength = toSplit.Length;
int chunksRequired = (int)Math.Ceiling(decimal)stringLength / (decimal)chunkSize);
List<string> chunks = new List<string>();
int lengthRemaining = stringLength;
for (int i = 0; i < chunksRequired; i++)
{
int lengthToUse = Math.Min(lengthRemaining, chunkSize);
int startIndex = chunkSize * i;
chunks.Add(toSplit.Substring(startIndex, lengthToUse));
lengthRemaining = lenghtRemaining - lengthToUse;
}
return chunks;
}
Say you have a string named myString. You would use the above method like this: string[] chunks = myString.SplitIntoChunks(15);, and you would receive an array of 15 character strings (depending on the size of the string).
A quick walk-through of the code (as there is not much explanation on the page). The size of the chunk is passed into the extension method. The length of the string is recorded, and the number of chunks for that string is determined using the Math.Ceiling function.
Then a for loop is constructed with the number of chunks required as the limit. Inside the loop, the length of the chunk is determined (using the lower of either the chunk size or the remaining length of the string), the starting index is calculated based on the chunk size and the loop index, and then the chunk is extracted via Substring. Finally remaining length is calculated, and once the loop ends the chunks are returned.
One way to use this extension method in your code would look like this. The extension method will need to be in a separate, static class (I suggest building a library that has a class dedicated solely to extension methods, as they come in quite handy). Note that I haven't had time to test this, and it's a bit kludgy for my tastes, but it should at least get you going in the right direction.
private IEnumerable<string> ArticleLine(double quantity, string name, double sellPrice, double total)
{
List<string> quantityChunks = quantity.ToString("#,##0.00").SplitIntoChunks(maxCharArticleQuantity);
List<string> nameChunks = name.SplitIntoChunks(maxCharArticleName);
List<string> sellPriceChunks = sellPrice.ToString("#,##0.00").SplitIntoChunks(maxCharArticleSellPrice);
List<string> totalChunks = total.ToString("#,##0.00").SplitIntoChunks(maxCharArticleTotal);
int maxLines = (new List<int>() { quantityChunks.Count,
nameChunks.Count,
sellPriceChunks.Count,
totalChunks.Count }).Max();
for (int i = 0; i < maxLines; i++)
{
lines.Add(String.Format("{0}{1}{2}{3}",
quantityChunks.Count > i ?
quantityChunks[i].PadLeft(maxCharArticleQuantity) :
String.Empty.PadLeft(maxCharArticleQuantity),
nameChunks.Count > i ?
nameChunks[i].PadLeft(maxCharArticleName) :
String.Empty.PadLeft(maxCharArticleName, ' '),
sellPriceChunks.Count > i ?
sellPriceChunks[i].PadLeft(maxCharArticleSellPrice) :
String.Empty.PadeLeft(maxCharArticleSellPrice),
totalChunks.Count > i ?
totalChunks[i].PadLeft(maxCharArticleTotal) :
String.Empty.PadLeft(maxCharArticleTotal));
}
return lines;
}
The above code does a couple of things. First, it calls SplitIntoChunks on each of the four variables.
Next, it uses the Max() extension method (you'll need to add a reference to System.Linq if you don't already have one, plus a using directive). to determine the maximum number of lines needed (based on the highest count of the four lists).
Finally, it uses a 4for loop to build the lines. Note that I use the ternary operator (?:) to build each line. The reason I went this route is so that the lines would be properly formatted, even if one or more of the 4 values didn't have something for that line. In other words:
quantityChunks.Count > i
If the count of items in quantityChunks is greater than i, then there is an entry for that item for that line.
quanityChunks[i].PadLeft(maxCharArticleQuantity, ' ')
If the condition evaluates to true, we use the corresponding entry (and pad it with spaces to keep alignment).
String.Empty.PadLeft(maxCharArticleQuantity, ' ')
If the condition is false, we simply put in spaces for the maximum number of characters for that position in the line.
Then you can print the output like this:
foreach(string line in lines)
{
Console.WriteLine(line);
}
The portions of the code where I check for the maximum number of lines and the use of a ternary operator in the String.Format feel very kludgy to me, but I don't have time to finesse it into something more respectable. I hope this at least points you in the right direction.
EDIT
Fixed the PadLeft syntax and removed the character specified, as space is used by default.

Loop Through Every Possible Combination of an Array

I am attempting to loop through every combination of an array in C# dependent on size, but not order. For example: var states = ["NJ", "AK", "NY"];
Some Combinations might be:
states = [];
states = ["NJ"];
states = ["NJ","NY"];
states = ["NY"];
states = ["NJ", "NY", "AK"];
and so on...
It is also true in my case that states = ["NJ","NY"] and states = ["NY","NJ"] are the same thing, as order does not matter.
Does anyone have any idea on the most efficient way to do this?
The combination of the following two methods should do what you want. The idea is that if the number of items is n then the number of subsets is 2^n. And if you iterate from 0 to 2^n - 1 and look at the numbers in binary you'll have one digit for each item and if the digit is 1 then you include the item, and if it is 0 you don't. I'm using BigInteger here as int would only work for a collection of less than 32 items and long would only work for less than 64.
public static IEnumerable<IEnumerable<T>> PowerSets<T>(this IList<T> set)
{
var totalSets = BigInteger.Pow(2, set.Count);
for (BigInteger i = 0; i < totalSets; i++)
{
yield return set.SubSet(i);
}
}
public static IEnumerable<T> SubSet<T>(this IList<T> set, BigInteger n)
{
for (int i = 0; i < set.Count && n > 0; i++)
{
if ((n & 1) == 1)
{
yield return set[i];
}
n = n >> 1;
}
}
With that the following code
var states = new[] { "NJ", "AK", "NY" };
foreach (var subset in states.PowerSets())
{
Console.WriteLine("[" + string.Join(",", subset.Select(s => "'" + s + "'")) + "]");
}
Will give you this output.
[]
['NJ']
['AK']
['NJ','AK']
['NY']
['NJ','NY']
['AK','NY']
['NJ','AK','NY']
You can use back-tracking where in each iteration you'll either (1) take the item in index i or (2) do not take the item in index i.
Pseudo code for this problem:
Main code:
Define a bool array (e.g. named picked) of the length of states
Call the backtracking method with index 0 (the first item)
Backtracking function:
Receives the states array, its length, the current index and the bool array
Halt condition - if the current index is equal to the length then you just need to iterate over the bool array and for each item which is true print the matching string from states
Actual backtracking:
Set picked[i] to true and call the backtracking function with the next index
Set picked[i] to false and call the backtracking function with the next index

What is the Fastest way to split ':' seperated string into given number of chunks where result/record length is variable

I have a large string accepted from TCP listner which is in following format
"1,7620257787,0123456789,99,0922337203,9223372036,32.5455,87,12.7857,1/1/2012,9223372036:1,7620257787,0123456789,99,0922337203,9223372036,32.5455,87,12.7857:2/1/2012,234234234:3,7620257787,01234343456789,99,0922337203,9223372036,32.5455,87,12.7857,1/1/2012,9223372036:34,76202343457787,012434343456789,93339,34340922337203,9223372036,32.5455,87,12.7857,1/1/2012,9223372036"
You can see that this is a : seperated string which contains Records which are comma seperated fields.
I am looking for the best (fastest) way that split the string in given number of chunks and take care that one chunk should contain full record (string upto ':')
or other way of saying , there should not be any chunck which is not ending with :
e.g. 20 MB string to 4 chunks of 5 MB each with proper records (thus size of each chunk may not be exactly 5 MB but very near to it and total of all 4 chunks will be 20 MB)
I hope you can understand my question (sorry for the bad english)
I like the following link , but it does not take care of full record while spliting also don't know if that is the best and fastest way.
Split String into smaller Strings by length variable
I don't know how large a 'large string' is, but initially I would just try it with the String.Split method.
The idea is to divide the lenght of your data for the num of blocks required, then look backwards to search the last sep in the current block.
private string[] splitToBlocks(string data, int numBlocks, char sep)
{
// We return an array of the request length
if (numBlocks <= 1 || data.Length == 0)
{
return new string [] { data };
}
string[] result = new string[numBlocks];
// The optimal size of each block
int blockLen = (data.Length / numBlocks);
int idx = 0; int pos = 0; int lastSepPos = blockLen;
while (idx < numBlocks)
{
// Search backwards for the first sep starting from the lastSepPos
char c = data[lastSepPos];
while (c != sep) { lastSepPos--; c = data[lastSepPos]; }
// Get the block data in the result array
result[idx] = data.Substring(pos, (lastSepPos + 1) - pos);
// Reposition for then next block
idx++;
pos = lastSepPos + 1;
if(idx == numBlocks-1)
lastSepPos = data.Length - 1;
else
lastSepPos = blockLen * (idx + 1);
}
return result;
}
Please test it. I have not fully tested for fringe cases.
OK, I suggest you way with two steps:
Split string into chunks (see below)
Check chunks for completeness
Splitting string into chunks with help of linq (linq extension method taked from Split a collection into `n` parts with LINQ? ):
string tcpstring = "chunk1 : chunck2 : chunk3: chunk4 : chunck5 : chunk6";
int numOfChunks = 4;
var chunks = (from string z in (tcpstring.Split(':').AsEnumerable()) select z).Split(numOfChunks);
List<string> result = new List<string>();
foreach (IEnumerable<string> chunk in chunks)
{
result.Add(string.Join(":",chunk));
}
.......
static class LinqExtensions
{
public static IEnumerable<IEnumerable<T>> Split<T>(this IEnumerable<T> list, int parts)
{
int i = 0;
var splits = from item in list
group item by i++ % parts into part
select part.AsEnumerable();
return splits;
}
}
Am I understand your aims clearly?
[EDIT]
In my opinion, In case of performance consideration, better way to use String.Split method for chunking
It seems you want to split on ":" (you can use the Split method).
Then you have to add ":" after splitting to each chunk that has been split.
(you can then split on "," for all the strings that have been split by ":".
int index = yourstring.IndexOf(":");
string[] whatever = string.Substring(0,index);
yourstring = yourstring.Substring(index); //make a new string without the part you just cut out.
this is a general view example, all you need to do is establish an iteration that will run while the ":" character is encountered; cheers...

Compare each element in an array to each other

I need to compare a 1-dimensional array, in that I need to compare each element of the array with each other element. The array contains a list of strings sorted from longest to the shortest. No 2 items in the array are equal however there will be items with the same length. Currently I am making N*(N+1)/2 comparisons (127.8 Billion) and I'm trying to reduce the number of over all comparisons.
I have implemented a feature that basically says: If the strings are different in length by more than x percent then don't bother they not equal, AND the other guys below him aren't equal either so just break the loop and move on to the next element.
I am currently trying to further reduce this by saying that: If element A matches element C and D then it stands to reason that elements C and D would also match so don't bother checking them (i.e. skip that operation). This is as far as I've factored since I don't currently know of a data structure that will allow me to do that.
The question here is: Does anyone know of such a data structure? or Does anyone know how I can further reduce my comparisons?
My current implementation is estimated to take 3.5 days to complete in a time window of 10 hours (i.e. it's too long) and my only options left are either to reduce the execution time, which may or may not be possible, or distrubute the workload accross dozens of systems, which may not be practical.
Update: My bad. Replace the word equal with closely matches with. I'm calculating the Levenstein distance
The idea is to find out if there are other strings in the array which closely matches with each element in the array. The output is a database mapping of the strings that were closely related.
Here is the partial code from the method. Prior to executing this code block there is code that loads items into the datbase.
public static void RelatedAddressCompute() {
TableWipe("RelatedAddress");
decimal _requiredDistance = Properties.Settings.Default.LevenshteinDistance;
SqlConnection _connection = new SqlConnection(Properties.Settings.Default.AML_STORE);
_connection.Open();
string _cacheFilter = "LevenshteinCache NOT IN ('','SAMEASABOVE','SAME')";
SqlCommand _dataCommand = new SqlCommand(#"
SELECT
COUNT(DISTINCT LevenshteinCache)
FROM
Address
WHERE
" + _cacheFilter + #"
AND
LEN(LevenshteinCache) > 12", _connection);
_dataCommand.CommandTimeout = 0;
int _addressCount = (int)_dataCommand.ExecuteScalar();
_dataCommand = new SqlCommand(#"
SELECT
Data.LevenshteinCache,
Data.CacheCount
FROM
(SELECT
DISTINCT LevenshteinCache,
COUNT(LevenshteinCache) AS CacheCount
FROM
Address
WHERE
" + _cacheFilter + #"
GROUP BY
LevenshteinCache) Data
WHERE
LEN(LevenshteinCache) > 12
ORDER BY
LEN(LevenshteinCache) DESC", _connection);
_dataCommand.CommandTimeout = 0;
SqlDataReader _addressReader = _dataCommand.ExecuteReader();
string[] _addresses = new string[_addressCount + 1];
int[] _addressInstance = new int[_addressCount + 1];
int _itemIndex = 1;
while (_addressReader.Read()) {
string _address = (string)_addressReader[0];
int _count = (int)_addressReader[1];
_addresses[_itemIndex] = _address;
_addressInstance[_itemIndex] = _count;
_itemIndex++;
}
_addressReader.Close();
decimal _comparasionsMade = 0;
decimal _comparisionsAttempted = 0;
decimal _comparisionsExpected = (decimal)_addressCount * ((decimal)_addressCount + 1) / 2;
decimal _percentCompleted = 0;
DateTime _startTime = DateTime.Now;
Parallel.For(1, _addressCount, delegate(int i) {
for (int _index = i + 1; _index <= _addressCount; _index++) {
_comparisionsAttempted++;
decimal _percent = _addresses[i].Length < _addresses[_index].Length ? (decimal)_addresses[i].Length / (decimal)_addresses[_index].Length : (decimal)_addresses[_index].Length / (decimal)_addresses[i].Length;
if (_percent < _requiredDistance) {
decimal _difference = new Levenshtein().threasholdiLD(_addresses[i], _addresses[_index], 50);
_comparasionsMade++;
if (_difference <= _requiredDistance) {
InsertRelatedAddress(ref _connection, _addresses[i], _addresses[_index], _difference);
}
}
else {
_comparisionsAttempted += _addressCount - _index;
break;
}
}
if (_addressInstance[i] > 1 && _addressInstance[i] < 31) {
InsertRelatedAddress(ref _connection, _addresses[i], _addresses[i], 0);
}
_percentCompleted = (_comparisionsAttempted / _comparisionsExpected) * 100M;
TimeSpan _estimatedDuration = new TimeSpan((long)((((decimal)(DateTime.Now - _startTime).Ticks) / _percentCompleted) * 100));
TimeSpan _timeRemaining = _estimatedDuration - (DateTime.Now - _startTime);
string _timeRemains = _timeRemaining.ToString();
});
}
InsertRelatedAddress is a function that updates the database, and there are 500,000 items in the array.
OK. With the updated question, I think it makes more sense. You want to find pairs of strings with a Levenshtein Distance less than a preset distance. I think the key is that you don't compare every set of strings and rely on the properties of Levenshtein distance to search for strings within your preset limit. The answer involves computing the tree of possible changes. That is, compute possible changes to a given string with distance < n and see if any of those strings are in your set. I supposed this is only faster if n is small.
It looks like the question posted here: Finding closest neighbour using optimized Levenshtein Algorithm.
More info required. What is your desired outcome? Are you trying to get a count of all unique strings? You state that you want to see if 2 strings are equal and that if 'they are different in length by x percent then don't bother they not equal'. Why are you checking with a constraint on length by x percent? If you're checking for them to be equal they must be the same length.
I suspect you are trying to something slightly different to determining an exact match in which case I need more info.
Thanks
Neil

Categories

Resources