Find keys with min difference in dictionary - c#

Say, I have this collection, it is generic dictionary
var items = new Dictionary<int, SomeData>
{
{ 1 , new SomeData() },
{ 5 , new SomeData() },
{ 23 , new SomeData() },
{ 22 , new SomeData() },
{ 2 , new SomeData() },
{ 7 , new SomeData() },
{ 59 , new SomeData() }
}
In this case min distance (difference) between keys = 1, for instance, between 23 and 22 or between 1 and 2
23 - 22 = 1 or 2 - 1 = 1
Question : how to find min difference between keys in generic Dictionary? Is there one line LINQ solution for this?
Purpose : If there are several matches then I need only one - the smallest, this is needed to fill missing keys (gaps) between items

I don't know how to do it by one line in LINQ but this is multiline solution for this problem.
var items = new Dictionary<int, string>();
items.Add(1, "SomeData");
items.Add(5, "SomeData");
items.Add(23, "SomeData");
items.Add(22, "SomeData");
items.Add(2, "SomeData");
items.Add(7, "SomeData");
items.Add(59, "SomeData");
var sortedArray = items.Keys.OrderBy(x => x).ToArray();
int minDistance = int.MaxValue;
for (int i = 1; i < sortedArray.Length; i++)
{
var distance = Math.Abs(sortedArray[i] - sortedArray[i - 1]);
if (distance < minDistance)
minDistance = distance;
}
Console.WriteLine(minDistance);

not sure Linq is the most appropriate but something (roughly) along this should work :
var smallestDiff = (from key1 in items.Keys
from key2 in items.Keys
where key1 != key2
group new { key1, key2 } by Math.Abs (key1 - key2) into grp
orderby grp.Key
from keyPair in grp
orderby keyPair.key1
select keyPair).FirstOrDefault ();

I won't give you a LinQ query because there already is an answer.
I know this is not what you are asking for, but I want to show you how to solve it in a very fast and easy to understand/maintain way, if performance and legibility is of any concern to you.
int[] keys;
int i, d, min;
keys = items.Keys.ToArray();
Array.Sort(keys); // leverage fastest possible implementation of sort
min = int.MaxValue;
for (i = 0; i < keys.Length - 1; i++)
{
d = keys[i + 1] - key[i]; // d is always non-negative after sort
if (d < min)
{
if (d == 2)
{
return 2; // minimum 1-gap already reached
} else if (d > 2) // ignore non-gap
{
min = d;
}
}
}
return min; // min contains the minimum difference between keys
Because there is only one sort the performance of this non-LinQ solution performs pretty quick. I don't say this is the best way, but only that you should measure both solutions and compare performance.
EDIT: based on your purpose I've added this piece:
if (d == 2)
{
return 2; // minimum 1-gap already reached
} else if (d > 2) // ignore non-gap
{
min = d;
}
Now what does this mean?
Say the PROBABILITY of having 1-gaps is high, it is probably faster to check at every change of min if you've reached that minimum gap. This may happen when you are 1% or 10% through the for loop, based on probability. So, for very large sets (say, above 1 million or 1 billion) and once you know the probability to expect, this probabilistic approach may give you huge performance gains.
On the contrary, for small sets or when the probability of 1-gaps is low, these extra CPU cycles are wasted and you are better off without that check.
As with very large databases (think of probabilistic indexing) probabilistic reasoning becomes relevant.
The problem is that you'll have to estimate beforehand if and when the probabilistic effect kicks in, and that's a pretty complex topic.
EDIT 2: a 1-gap actually has an index difference of 2. Furthermore, and index difference of 1 is a non-gap (there is no gap to insert an index in between).
So the previous solution was simply wrong, because as soon as two indices are contiguous (say 34, 35) the minimum will be 1, which is not a gap at all.
Because of this gap-problem the internal if() is necessary and at that point the overhead of the probabilistic approach is nullified. You'll be better off with the correct code and probabilistic approach!

I think LINQ is simplest
First, making diff pair from your dictionary
var allPair = items.SelectMany((l) => items.Select((r) => new {l,r}).Where((pair) => l.Key != r.Key));
Then find the min of diff
allPair.OrderBy((pair) => Math.Abs(pair.l.Key - pair.r.Key)).FirstOrDefault();
But you may have multiple pair with same difference value, so you may need to use GroupBy before using OrderBy then handle the multiple pair by yourself

A one line solution not listed in answers:
items.Keys.OrderBy(x => x).Select(x => new { CurVal = x, MinDist = int.MaxValue }).Aggregate((ag, x) => new { CurVal = x.CurVal, MinDist = Math.Min(ag.MinDist, x.CurVal - ag.CurVal) }).MinDist

Related

Shortest list from a two dimensional array

This question is more about an algorithm than actual code, but example code would be appreciated.
Let's say I have a two-dimensional array such as this:
A B C D E
--------------
1 | 0 2 3 4 5
2 | 1 2 4 5 6
3 | 1 3 4 5 6
4 | 2 3 4 5 6
5 | 1 2 3 4 5
I am trying to find the shortest list that would include a value from each row. Currently, I am going row by row and column by column, adding each value to a SortedSet and then checking the length of the set against the shortest set found so far. For example:
Adding cells {1A, 2A, 3A, 4A, 5A} would add the values {0, 1, 1, 2, 1} which would result in a sorted set {0, 1, 2}. {1B, 2A, 3A, 4A, 5A} would add the values {2, 1, 1, 2, 1} which would result in a sorted set {1, 2}, which is shorter than the previous set.
Obviously, adding {1D, 2C, 3C, 4C, 5D} or {1E, 2D, 3D, 4D, 5E} would be the shortest sets, having only one item each, and I could use either one.
I don't have to include every number in the array. I just need to find the shortest set while including at least one number from every row.
Keep in mind that this is just an example array, and the arrays that I'm using are much, much larger. The smallest is 495x28. Brute force will take a VERY long time (28^495 passes). Is there a shortcut that someone knows, to find this in the least number of passes? I have C# code, but it's kind of long.
Edit:
Posting current code, as per request:
// Set an array of counters, Add enough to create largest initial array
int ListsCount = MatrixResults.Count();
int[] Counters = new int[ListsCount];
SortedSet<long> CurrentSet = new SortedSet<long>();
for (long X = 0; X < ListsCount; X++)
{
Counters[X] = 0;
CurrentSet.Add(X);
}
while (true)
{
// Compile sequence list from MatrixResults[]
SortedSet<long> ThisSet = new SortedSet<long>();
for (int X = 0; X < Count4; X ++)
{
ThisSet.Add(MatrixResults[X][Counters[X]]);
}
// if Sequence Length less than current low, set ThisSet as Current
if (ThisSet.Count() < CurrentSet.Count())
{
CurrentSet.Clear();
long[] TSI = ThisSet.ToArray();
for (int Y = 0; Y < ThisSet.Count(); Y ++)
{
CurrentSet.Add(TSI[Y]);
}
}
// Increment Counters
int Index = 0;
bool EndReached = false;
while (true)
{
Counters[Index]++;
if (Counters[Index] < MatrixResults[Index].Count()) break;
Counters[Index] = 0;
Index++;
if (Index >= ListsCount)
{
EndReached = true;
break;
}
Counters[Index]++;
}
// If all counters are fully incremented, then break
if (EndReached) break;
}
With all computations there is always a tradeoff, several factors are in play, like will You get paid for getting it perfect (in this case for me, no). This is a case of the best being the enemy of the good. How long can we spend on solving a problem and will it be sufficient to get close enough to fulfil the use case (imo) and when we can solve the problem without hand painting pixels in UHD resolution to get the idea of a key through, lets!
So, my choice is an approach which will get a covering set which is small and ehem... sometimes will be the smallest :) In essence because of the sequence in comparing would to be spot on be iterative between different strategies, comparing the length of the sets for different strategies - and for this evening of fun I chose to give one strategy which is I find defendable to be close to or equal the minimal set.
So this strategy is to observe the multi dimensional array as a sequence of lists that has a distinct value set each. Then if reducing the total amount of lists with the smallest in the remainder iteratively, weeding out any non used values in that smallest list when having reduced total set in each iteration we will get a path which is close enough to the ideal to be effective as it completes in milliseconds with this approach.
A critique of this approach up front is then that the direction you pass your minimal list in really would have to get iteratively varied to pick best, left to right, right to left, in position sequences X,Y,Z, ... because the amount of potential reducing is not equal. So to get close to the ideal iterations of sequences would have to be made for each iteration too until all combinations were covered, choosing the most reducing sequence. right - but I chose left to right, only!
Now I chose not to run compare execution against Your code, because of the way you instantiate your MatrixResults is an array of int arrays and not instantiated as a multidimension array, which your drawing is, so I went by Your drawing and then couldn't share data source with your code. No matter, you can make that conversion if you wish, onwards to generate sample data:
private int[,] CreateSampleArray(int xDimension, int yDimensions, Random rnd)
{
Debug.WriteLine($"Created sample array of dimensions ({xDimension}, {yDimensions})");
var array = new int[xDimension, yDimensions];
for (int x = 0; x < array.GetLength(0); x++)
{
for(int y = 0; y < array.GetLength(1); y++)
{
array[x, y] = rnd.Next(0, 4000);
}
}
return array;
}
The overall structure with some logging, I'm using xUnit to run the code in
[Fact]
public void SetCoverExperimentTest()
{
var rnd = new Random((int)DateTime.Now.Ticks);
var sw = Stopwatch.StartNew();
int[,] matrixResults = CreateSampleArray(rnd.Next(100, 500), rnd.Next(100, 500), rnd);
//So first requirement is that you must have one element per row, so lets get our unique rows
var listOfAll = new List<List<int>>();
List<int> listOfRow;
for (int y = 0; y < matrixResults.GetLength(1); y++)
{
listOfRow = new List<int>();
for (int x = 0; x < matrixResults.GetLength(0); x++)
{
listOfRow.Add(matrixResults[x, y]);
}
listOfAll.Add(listOfRow.Distinct().ToList());
}
var setFound = new HashSet<int>();
List<List<int>> allUniquelyRequired = GetDistinctSmallestList(listOfAll, setFound);
// This set now has all rows that are either distinctly different
// Or have a reordering of distinct values of that length value lists
// our HashSet has the unique value range
//Meaning any combination of sets with those values,
//grabbing any one for each set, prefering already chosen ones should give a covering total set
var leastSet = new LeastSetData
{
LeastSet = setFound,
MatrixResults = matrixResults,
};
List<Coordinate>? minSet = leastSet.GenerateResultsSet();
sw.Stop();
Debug.WriteLine($"Completed in {sw.Elapsed.TotalMilliseconds:0.00} ms");
Assert.NotNull(minSet);
//There is one for each row
Assert.False(minSet.Select(s => s.y).Distinct().Count() < minSet.Count());
//We took less than 25 milliseconds
var timespan = new TimeSpan(0, 0, 0, 0, 25);
Assert.True(sw.Elapsed < timespan);
//Outputting to debugger for the fun of it
var sb = new StringBuilder();
foreach (var coordinate in minSet)
{
sb.Append($"({coordinate.x}, {coordinate.y}) {matrixResults[coordinate.x, coordinate.y]},");
}
var debugLine = sb.ToString();
debugLine = debugLine.Substring(0, debugLine.Length - 1);
Debug.WriteLine("Resulting set: " + debugLine);
}
Now the more meaty iterative bits
private List<List<int>> GetDistinctSmallestList(List<List<int>> listOfAll, HashSet<int> setFound)
{
// Our smallest set must be a subset the distinct sum of all our smallest lists for value range,
// plus unknown
var listOfShortest = new List<List<int>>();
int shortest = int.MaxValue;
foreach (var list in listOfAll)
{
if (list.Count < shortest)
{
listOfShortest.Clear();
shortest = list.Count;
listOfShortest.Add(list);
}
else if (list.Count == shortest)
{
if (listOfShortest.Contains(list))
continue;
listOfShortest.Add(list);
}
}
var setFoundAddition = new HashSet<int>(setFound);
foreach (var list in listOfShortest)
{
foreach (var item in list)
{
if (setFound.Contains(item))
continue;
if (setFoundAddition.Contains(item))
continue;
setFoundAddition.Add(item);
}
}
//Now we can remove all rows with those found, we'll add the smallest later
var listOfAllRemainder = new List<List<int>>();
bool foundInList;
List<int> consumedWhenReducing = new List<int>();
foreach (var list in listOfAll)
{
foundInList = false;
foreach (int item in list)
{
if (setFound.Contains(item))
{
//Covered by data from last iteration(s)
foundInList = true;
break;
}
else if (setFoundAddition.Contains(item))
{
consumedWhenReducing.Add(item);
foundInList = true;
break;
}
}
if (!foundInList)
{
listOfAllRemainder.Add(list); //adding what lists did not have elements found
}
}
//Remove any from these smallestset lists that did not get consumed in the favour used pass before
if (consumedWhenReducing.Count == 0)
{
throw new Exception($"Shouldn't be possible to remove the row itself without using one of its values, please investigate");
}
var removeArray = setFoundAddition.Where(a => !consumedWhenReducing.Contains(a)).ToArray();
setFoundAddition.RemoveWhere(x => removeArray.Contains(x));
foreach (var value in setFoundAddition)
{
setFound.Add(value);
}
if (listOfAllRemainder.Count != 0)
{
//Do the whole thing again until there in no list left
listOfShortest.AddRange(GetDistinctSmallestList(listOfAllRemainder, setFound));
}
return listOfShortest; //Here we will ultimately have the sum of shortest lists per iteration
}
To conclude: I hope to have inspired You, at least I had fun coming up with a best approximate, and should you feel like completing the code, You're very welcome to grab what You like.
Obviously we should really track the sequence we go through the shortest lists, after all it is of significance if we start by reducing the total distinct lists by element at position 0 or 0+N and which one we reduce with after. I mean we must have one of those values but each time consuming each value has removed most of the total list all it really produces is a value range and the range consumption sequence matters to the later iterations - Because a position we didn't reach before there were no others left e.g. could have remove potentially more than some which were covered. You get the picture I'm sure.
And this is just one strategy, One may as well have chosen the largest distinct list even within the same framework and if You do not iteratively cover enough strategies, there is only brute force left.
Anyways you'd want an AI to act. Just like a human, not to contemplate the existence of universe before, after all we can reconsider pretty often with silicon brains as long as we can do so fast.
With any moving object at least, I'd much rather be 90% on target correcting every second while taking 14 ms to get there, than spend 2 seconds reaching 99% or the illusive 100% => meaning we should stop the vehicle before the concrete pillar or the pram or conversely buy the equity when it is a good time to do so, not figuring out that we should have stopped, when we are allready on the other side of the obstacle or that we should've bought 5 seconds ago, but by then the spot price already jumped again...
Thus the defense rests on the notion that it is opinionated if this solution is good enough or simply incomplete at best :D
I realize it's pretty random, but just to say that although this sketch is not entirely indisputably correct, it is easy to read and maintain and anyways the question is wrong B-] We will very rarely need the absolute minimal set and when we do the answer will be much longer :D
... woopsie, forgot the support classes
public struct Coordinate
{
public int x;
public int y;
public override string ToString()
{
return $"({x},{y})";
}
}
public struct CoordinateValue
{
public int Value { get; set; }
public Coordinate Coordinate { get; set; }
public override string ToString()
{
return string.Concat(Coordinate.ToString(), " ", Value.ToString());
}
}
public class LeastSetData
{
public HashSet<int> LeastSet { get; set; }
public int[,] MatrixResults { get; set; }
public List<Coordinate> GenerateResultsSet()
{
HashSet<int> chosenValueRange = new HashSet<int>();
var chosenSet = new List<Coordinate>();
for (int y = 0; y < MatrixResults.GetLength(1); y++)
{
var candidates = new List<CoordinateValue>();
for (int x = 0; x < MatrixResults.GetLength(0); x++)
{
if (LeastSet.Contains(MatrixResults[x, y]))
{
candidates.Add(new CoordinateValue
{
Value = MatrixResults[x, y],
Coordinate = new Coordinate { x = x, y = y }
}
);
continue;
}
}
if (candidates.Count == 0)
throw new Exception($"OMG Something's wrong! (this row did not have any of derived range [y: {y}])");
var done = false;
foreach (var c in candidates)
{
if (chosenValueRange.Contains(c.Value))
{
chosenSet.Add(c.Coordinate);
done = true;
break;
}
}
if (!done)
{
var firstCandidate = candidates.First();
chosenSet.Add(firstCandidate.Coordinate);
chosenValueRange.Add(firstCandidate.Value);
}
}
return chosenSet;
}
}
This problem is NP hard.
To show that, we have to take a known NP hard problem, and reduce it to this one. Let's do that with the Set Cover Problem.
We start with a universe U of things, and a collection S of sets that covers the universe. Assign each thing a row, and each set a number. This will fill different numbers of columns for each row. Fill in a rectangle by adding new numbers.
Now solve your problem.
For each new number in your solution that didn't come from a set in the original problem, we can replace it with another number in the same row that did come from a set.
And now we turn numbers back into sets and we have a solution to the Set Cover Problem.
The transformations from set cover to your problem and back again are both O(number_of_elements * number_of_sets) which is polynomial in the input. And therefore your problem is NP hard.
Conversely if you replace each number in the matrix with the set of rows covered, your problem turns into the Set Cover Problem. Using any existing solver for set cover then gives a reasonable approach for your problem as well.
The code is not particularly tidy or optimised, but illustrates the approach I think #btilly is suggesting in his answer (E&OE) using a bit of recursion (I was going for intuitive rather than ideal for scaling, so you may have to work an iterative equivalent).
From the rows with their values make a "values with the rows that they appear in" counterpart. Now pick a value, eliminate all rows in which it appears and solve again for the reduced set of rows. Repeat recursively, keeping only the shortest solutions.
I know this is not terribly readable (or well explained) and may come back to tidy up in the morning, so let me know if it does what you want (is worth a bit more of my time;-).
// Setup
var rowValues = new Dictionary<int, HashSet<int>>
{
[0] = new() { 0, 2, 3, 4, 5 },
[1] = new() { 1, 2, 4, 5, 6 },
[2] = new() { 1, 3, 4, 5, 6 },
[3] = new() { 2, 3, 4, 5, 6 },
[4] = new() { 1, 2, 3, 4, 5 }
};
Dictionary<int, HashSet<int>> ValueRows(Dictionary<int, HashSet<int>> rv)
{
var vr = new Dictionary<int, HashSet<int>>();
foreach (var row in rv.Keys)
{
foreach (var value in rv[row])
{
if (vr.ContainsKey(value))
{
if (!vr[value].Contains(row))
vr[value].Add(row);
}
else
{
vr.Add(value, new HashSet<int> { row });
}
}
}
return vr;
}
List<int> FindSolution(Dictionary<int, HashSet<int>> rAndV)
{
if (rAndV.Count == 0) return new List<int>();
var bestSolutionSoFar = new List<int>();
var vAndR = ValueRows(rAndV);
foreach (var v in vAndR.Keys)
{
var copyRemove = new Dictionary<int, HashSet<int>>(rAndV);
foreach (var r in vAndR[v])
copyRemove.Remove(r);
var solution = new List<int>{ v };
solution.AddRange(FindSolution(copyRemove));
if (bestSolutionSoFar.Count == 0 || solution.Count > 0 && solution.Count < bestSolutionSoFar.Count)
bestSolutionSoFar = solution;
}
return bestSolutionSoFar;
}
var solution = FindSolution(rowValues);
Console.WriteLine($"Optimal solution has values {{ {string.Join(',', solution)} }}");
output Optimal solution has values { 4 }

Massive amount number comparison using c#

Comparison of number sets is too slow. What is more efficiency way to solve this problem?
I have two groups of sets, each group has about 5 millions of sets, each set has 6 numbers and each number is between 1 to 100. Sets and Groups are not sorted and duplicated.
Following is Example.
No. Group A Group B
1 {1,2,3,4,5,6} {6,2,4,87,53,12}
2 {2,3,4,5,6,8} {43,6,78,23,96,24}
3 {45,23,57,79,23,76} {12,1,90,3,2,23}
4 {3,5,85,24,78,90} {12,65,78,9,23,13}
... ...
My goal is compare two groups and classify Group A by maximum common element count in 5hrs on my laptop.
In the example, No 1 of Group A and No 3 of Group B has 3 common elements(1,2,3).
Also, No 2 of Group A and No 3 of Group B has 2 common elements(2,3). Therefore I will classify Group A as following.
No. Group A Maximum Common Element Count
1 {1,2,3,4,5,6} 3
2 {2,3,4,5,6,8} 3
3 {45,23,57,79,23,76} 1
4 {3,5,85,24,78,90} 2
...
My approach is compare every sets and number, so complexity is Group A Count * Group B Count * 6 * 6. Therefore it need so many time.
Dictionary<int, List<int>> Classified = new Dictionary<int, List<int>>();
foreach (List<int> setA in GroupA)
{
int maxcount = 0;
foreach (List<int> setB in GroupB)
{
int count = 0;
foreach(int elementA in setA)
{
foreach(int elementB in setB)
{
if (elementA == elementB) count++;
}
}
if (count > maxcount) maxcount = count;
}
Classified.Add(maxcount, setA);
}
Here is my attempt - using a HashSet<int> and precalculating the range of each set to avoid set-to-set comparisons like {1,2,3,4,5,6} and {7,8,9,10,11,12} (as pointed out by Matt's answer).
For me (running with random sets) it resulted in a 130x speed improvement on the original code. You mentioned in a comment that
Now execution time is over 3 days, so as others said I need parallelization.
and in the question itself that
My goal is compare two groups and classify Group A by maximum common element count in 5hrs on my laptop.
so assuming that the comment means that the execution time for your data exceeded 3 days (72 hours), but you want it to complete in 5 hours, you'd only need something like a 14x speed increase.
Framework
I've created some classes to run these benchmarks:
Range - takes some int values, and keeps track of the minimum and maximum values.
public class Range
{
private readonly int _min;
private readonly int _max;
public Range(IReadOnlyCollection<int> values)
{
_min = values.Min();
_max = values.Max();
}
public int Min { get { return _min; } }
public int Max { get { return _max; } }
public bool Intersects(Range other)
{
if ( _min < other._max )
return false;
if ( _max > other._min )
return false;
return true;
}
}
SetWithRange - wraps a HashSet<int> and a Range of the values.
public class SetWithRange : IEnumerable<int>
{
private readonly HashSet<int> _values;
private readonly Range _range;
public SetWithRange(IReadOnlyCollection<int> values)
{
_values = new HashSet<int>(values);
_range = new Range(values);
}
public static SetWithRange Random(Random random, int size, Range range)
{
var values = new HashSet<int>();
// Random.Next(int, int) generates numbers in the range [min, max)
// so we need to add one here to be able to generate numbers in [min, max].
// See https://learn.microsoft.com/en-us/dotnet/api/system.random.next
var min = range.Min;
var max = range.Max + 1;
while ( values.Count() < size )
values.Add(random.Next(min, max));
return new SetWithRange(values);
}
public int CommonValuesWith(SetWithRange other)
{
// No need to call Intersect on the sets if the ranges don't intersect
if ( !_range.Intersects(other._range) )
return 0;
return _values.Intersect(other._values).Count();
}
public IEnumerator<int> GetEnumerator()
{
return _values.GetEnumerator();
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
The results were generated using SetWithRange.Random as follows:
const int groupCount = 10000;
const int setSize = 6;
var range = new Range(new[] { 1, 100 });
var generator = new Random();
var groupA = Enumerable.Range(0, groupCount)
.Select(i => SetWithRange.Random(generator, setSize, range))
.ToList();
var groupB = Enumerable.Range(0, groupCount)
.Select(i => SetWithRange.Random(generator, setSize, range))
.ToList();
The timings given below are for an average of three x64 release build runs on my machine.
For all cases I generated groups with 10000 random sets then scaled up to approximate the execution time for 5 million sets by using
timeFor5Million = timeFor10000 / 10000 / 10000 * 5000000 * 5000000
= timeFor10000 * 250000
Results
Four foreach blocks:
Average time = 48628ms; estimated time for 5 million sets = 3377 hours
var result = new Dictionary<SetWithRange, int>();
foreach ( var setA in groupA )
{
int maxcount = 0;
foreach ( var setB in groupB )
{
int count = 0;
foreach ( var elementA in setA )
{
foreach ( int elementB in setB )
{
if ( elementA == elementB )
count++;
}
}
if ( count > maxcount ) maxcount = count;
}
result.Add(setA, maxcount);
}
Three foreach blocks with parallelisation on the outer foreach:
Average time = 10305ms; estimated time for 5 million sets = 716 hours (4.7 times faster than original):
var result = new Dictionary<SetWithRange, int>();
Parallel.ForEach(groupA, setA =>
{
int maxcount = 0;
foreach ( var setB in groupB )
{
int count = 0;
foreach ( var elementA in setA )
{
foreach ( int elementB in setB )
{
if ( elementA == elementB )
count++;
}
}
if ( count > maxcount ) maxcount = count;
}
lock ( result )
result.Add(setA, maxcount);
});
Using HashSet<int> and adding a Range to only check sets which intersect:
Average time = 375ms; estimated time for 5 million sets = 24 hours (130 times faster than original):
var result = new Dictionary<SetWithRange, int>();
Parallel.ForEach(groupA, setA =>
{
var commonValues = groupB.Max(setB => setA.CommonValuesWith(setB));
lock ( result )
result.Add(setA, commonValues);
});
Link to a working online demo here: https://dotnetfiddle.net/Kxpagh (note that .NET Fiddle limits execution times to 10 seconds, and that for obvious reasons its results are slower than running in a normal environment).
Fastest I can think of is this:
As all your numbers come from a limited range (1-100), you can express each of your sets as a 100-digit binary number <d1,d2,...,d100> where dn equals 1 iff n is in the set.
Then comparing two sets means a binary AND on the two binary representations and counting the set bits (which can be done efficiently)
In addition to that, this task can be parallelized (your input is immutable, so it's quite straightforward).
You would have to benchmark it with smaller sets but since you're going to have to do 5E6 * 5E6 = 25E12 comparisons, you might as well sort the contents of 5E6 + 5E6 = 10E6 sets first.
Then the set to set comparisons become much fast since you can stop in each comparison as soon as you reach the highest number in the first side of the comparison. Minuscule savings per set comparison but trillions of times over, it adds up.
You could also go further and index the two sets of five million by lowest entry and highest entry. You would further cut down the number of comparisons significantly. In the end, that's only 100 * 100' = 10,000 = 1E4 distinct collections. You would never have to compare sets that have for instance 12 for the highest number, with any sets that start with 13 or more. effectively avoiding a ton of work.
In my mind, this is sorting a lot of data, but it pales in order to the number of actual set to set comparisons you would have to do raw. Here, you are eliminating work for all the 0s and able to abort early if the conditions are right when you do do a compare.
And as others have said, parallelization...
PS: 5E6 = 5 * 10^6 = 5,000,000 and 25E12 = 25 * 10^12 = 25 * 10,000,000,000,000
The time complexity of any algorithm you come up with is going to be of the same order. HashSets might be a bit faster, but if they are it won't be by much - the overhead of 36 direct list comparisons vs 12 hashset lookups isn't going to be significantly higher, if at all, but you'll have to benchmark. Presorting might help a bit considering each set will be compared millions of times. Just FYI, for loops are faster than foreach loops on a List and arrays are faster than Lists (for and foreach on array is same performance), which for something like this might make a decent performance difference. If the No. column is sequential then I would use an array for that instead of a dictionary as well. Array lookups are an order of magnitude faster than dictionary lookups.
I think you are generally doing this as quickly as possible aside from parallelization though, with some small gains possible through the above micro-optimizations.
How far off from your target execution time is the current algorithm?
I would use the following:
foreach (List<int> setA in GroupA)
{
int maxcount = GroupB.Max(x => x.Sum(y => setA.Contains(y) ? 1 : 0));
Classified.Add(maxcount, setA);
}

Other way to solve assigning

Lets say I have collection of n workers. Lets say there are 3:
John
Adam
Mark
I want to know when they have to clean the office. If I set int cleanDays = 3 it would be something like that:
//Day of month;worker
1;John
2;John
3;John
4;Adam
5;Adam
6;Adam
7;Mark
8;Mark
9;Mark
10;John
11;John
.
.
.
If I set cleanDays = 1 it would be:
1;John
2;Adam
3;Mark
4;John
5;Adam
.
.
.
And so on.
I already managed something like this:
int cleanDays = 6;
for (int day=1; day<30;day++) { //for each day
int worker = (day-1 % cleanDays)%workers.Count; //get current worker (starting from index 0)
for (int times=0; times< cleanDays; times++) //worker do the job `cleanDays` times
Console.WriteLine(day++ + ";" +workers[worker].Name);
}
This is not working properly, because it gaves me 34 days. That because of day++ in first loop. But if I delete day++ from first loop:
for (int day=1; day<30;) { //for each day
int worker = (day-1 % cleanDays)%workers.Count; //get current worker (starting from index 0)
for (int times=0; times< cleanDays; times++) //worker do the job `cleanDays` times
Console.WriteLine(day++ + ";" +workers[worker].Name);
}
It is giving output only with first worker. When I debugged I saw that:
int worker = (day-1 % cleanDays)%workers.Count;
and worker was equal to 0 everytime. That means:
(20-1%6)%3 was equal to 0. Why does that happen?
UPDATE: I just read your question more carefully and realized you were not asking about the actual code at all. Your real question was:
That means: (20-1%6)%3 was equal to 0. Why does that happen?
First of all, it doesn't. (20-1%6)%3 is 1. But the logic is still wrong because you have the parentheses in the wrong place. You meant to write
int worker = (day - 1) % cleanDays % workers.Count;
Remember, multiplication, division and remainder operators are all higher precedence than subtraction. a + b * c is a + (b * c), not (a + b) * c. The same is true of - and %. a - b % c is a - (b % c), not (a - b) % c.
But I still stand by my original answer: you can eliminate the problem entirely by writing a query that represents your sequence operations, rather than a loop with a bunch of complicated arithmetic that is easy to get wrong.
Original answer follows.
Dmitry Bychenko's solution is pretty good but we can improve on it; modular arithmetic is not necessary here. Rather than indexing into the worker array, we can simply select-many from it directly:
var query = Enumerable.Repeat(
workers.SelectMany(worker => Enumerable.Repeat(worker, cleanDays)),
1000)
.SelectMany(workerseq => workerseq)
.Select((worker, index) => new { Worker = worker, Day = index + 1})
.Take(30);
foreach(var x in query)
Console.WriteLine($"Day {x.Day} Worker {x.Worker}");
Make sure you understand how this query works, because these are core operations of LINQ. We take a sequence of workers,
{A, B, C}
This is projected onto a sequence of sequences:
{ {A, A}, {B, B}, {C, C} }
Which is flattened:
{A, A, B, B, C, C}
We then repeat that a thousand times:
{ { A, A, B, B, C, C },
{ A, A, B, B, C, C },
...
}
And then flatten that sequence-of-sequences:
{ A, A, B, B, C, C, A, A, B, B, C, C, ... }
We then select-with-index into that flattened sequence to produce a sequence of day, worker pairs.
{ {A, 1}, {A, 2}, {B, 3}, {B, 4}, ... }
Then take the first 30 of those. Then we execute the query and print the results.
Now, you might say isn't this inefficient? If we have, say, 4 workers, we put each on 5 days, and then we repeat that sequence 1000 times; that makes a sequence with 5 x 4 x 1000 = 20000 items, but we only need the first 30.
Do you see what is wrong with that logic?
LINQ sequences are constructed lazily. Because of the Take(30) we never construct more than 30 pairs in the first place. We could have repeated it a million times; doesn't matter. You say Take(30) and the sequence construction will stop constructing more items after you've printed 30 of them.
But don't stop there. Ask yourself how you can improve this code further.
The bit with the days as integers seems a bit dodgy. Surely what you want is actual dates.
var start = new DateTime(2017, 1, 1);
And now instead of selecting out numbers, we can select out dates:
...
.Select((worker, index) => new { Worker = worker, Day = start.AddDays(index)})
...
What are the key takeaways here?
Rather than messing around with loops and weird arithmetic, just construct queries that represent the shape of what you want. What do you want? Repeat each worker n times. Great, then there should be a line in your program somewhere that says Repeat(worker, n), and now your program looks like its specification. Now your program is more likely to be correct. And so on.
Use the right data type for the job. Want to represent dates? Use DateTime, not int.
I would use a while loop, and use some tracking variables to keep track of which worker you are at and how many clean-times are left for that worker. Something like this:
const int cleanTime = 3; // or 1 or 6
var workers = new [] { "John", "Adam" , "Mark" }
var day = 1;
var currentWorker = 0;
var currentCleanTimeLeft = cleanTime;
while (day <= 30) {
Console.WriteLine("{0};{1}", day, workers[currentWorker].Name);
currentCleanTimeLeft--;
if (currentCleanTimeLeft == 0) {
currentCleanTimeLeft = cleanTime;
currentWorker++;
if (currentWorker >= workers.Length)
currentWorker = 0;
}
day++;
}
A very basic solution, no division or arithmatics required.
The second loop is unnecessary, it simply messes up your day.
int cleanDays = 6;
for (int day = 1; day <= 30; day++)
{
int worker = ((day-1) / cleanDays) % workers.Count;
Console.WriteLine(day + ";" + workers[worker].Name);
}
Example on Fiddle
The basic idea is to give each individual day an numerical value - DateTime.Now.DayOfYear is a good choice, or just a running count - and map that numerical value to an index in the Worker array.
The main logic is in the workerIndex line below:
It takes the day number and divides it by cleanDays. This means that each x days is mapped to the same workerIndex.
It takes the workerIndex and does a modulo operation on it (%) on the count of workers. This causes the workerIndex to by cyclical, iterating endlessly over all workers.
string[] workers = new string[] {"Mike", "Bob", "Hank"};
int cleanDays = 6;
for (int dayNum = 0 ; dayNum < 300 ; dayNum++)
{
var workerIndex = (dayNum / cleanDays) % workers.Length; // <-- LOGIC!
Console.WriteLine("Day {0} - Cleaner: {1}", dayNum, workers[workerIndex]);
}
I suggest modulo arithmetics and Linq:
List<Worker> personnel = ...
int days = 30;
int cleanDays = 4;
var result = Enumerable.Range(0, int.MaxValue)
.SelectMany(index => Enumerable
.Repeat(personnel[index % personnel.Count], cleanDays))
.Select((man, index) => $"{index + 1};{man.Name}")
.Take(days);
Test:
Console.Write(string.Join(environment.NewLine, result));
Output:
1;John
2;John
3;John
4;John
5;Adam
6;Adam
7;Adam
8;Adam
9;Mark
...
24;Mark
25;John
26;John
27;John
28;John
29;Adam
30;Adam
you could create a sequence function:
public static IEnumerable<string> GenerateSequence(IEnumerable<string> sequence, int groupSize)
{
var day = 1;
while (true)
{
foreach (var element in sequence)
{
for (var i = 0; i < groupSize; ++i)
{
yield return $"{day};{element}";
day++;
}
}
}
}
usage:
var workers = new List<string> { "John", "Adam", "Mark" };
var cleanDays = 3;
GenerateSequence(workers, cleanDays).Take(100).Dump();
I would do something like this:
var cleanDays = 6; // Number of days in each shift
var max = 30; // The amount of days the loop will run for
var count = workers.Count(); // The amount of workers
if(count == 0) return; // Exit If there are no workers
if(count == 1) cleanDays = max; //See '3.' in explanation (*)
for(var index = 0; index < max; index++){
var worker = (index / cleanDays ) % count;
var day = index % cleanDays ;
Console.WriteLine(string.format("Day {0}: {1} cleaned today (Consecutive days cleaned: {2})", index+1, workers[worker].Name ,day));
}
Explanation
By doing index / cleanDays you get the amount of times of worker shifts. But it is possible that the shifts are more than the amount of workers in which case you would want to get the reminder (shifts % amount of workers).
To get how many consecutive days the worker has worked so far you simply need to get the remainder of the first division done above. (index / cleanDays ).
Finally as you can see I get the count of the array before I enter the loop for 3 reasons:
To only read it once. And save some time.
To exit if the method if the array is empty
To check if there is only one worker left. In which case that worker won't have a break and will be working from day 1 until day 'max' therefore I set the cleanDays to max. *

Getting all combinations of K and less elements in List of N elements with big K

I want to have all combination of elements in a list for a result like this:
List: {1,2,3}
1
2
3
1,2
1,3
2,3
My problem is that I have 180 elements, and I want to have all combinations up to 5 elements. With my tests with 4 elements, it took a long time (2 minutes) but all went well. But with 5 elements, I get a run out of memory exception.
My code presently is this:
public IEnumerable<IEnumerable<Rondin>> getPossibilites(List<Rondin> rondins)
{
var combin5 = rondins.Combinations(5);
var combin4 = rondins.Combinations(4);
var combin3 = rondins.Combinations(3);
var combin2 = rondins.Combinations(2);
var combin1 = rondins.Combinations(1);
return combin5.Concat(combin4).Concat(combin3).Concat(combin2).Concat(combin1).ToList();
}
With the fonction: (taken from this question: Algorithm to return all combinations of k elements from n)
public static IEnumerable<IEnumerable<T>> Combinations<T>(this IEnumerable<T> elements, int k)
{
return k == 0 ? new[] { new T[0] } :
elements.SelectMany((e, i) =>
elements.Skip(i + 1).Combinations(k - 1).Select(c => (new[] { e }).Concat(c)));
}
I need to search in the list for a combination where each element added up is near (with a certain precision) to a value, this for each element in an other list. There is all my code for this part:
var possibilites = getPossibilites(opt.rondins);
possibilites = possibilites.Where(p => p.Sum(r => r.longueur + traitScie) < 144);
foreach(BilleOptimisee b in opt.billesOptimisees)
{
var proches = possibilites.Where(p => p.Sum(r => (r.longueur + traitScie)) < b.chute && Math.Abs(b.chute - p.Sum(r => r.longueur)) - (p.Count() * 0.22) < 0.01).OrderByDescending(p => p.Sum(r => r.longueur)).ElementAt(0);
if(proches != null)
{
foreach (Rondin r in proches)
{
opt.rondins.Remove(r);
b.rondins.Add(r);
possibilites = possibilites.Where(p => !p.Contains(r));
}
}
}
With the code I have, how can I limit the memory taken by my list ? Or is there a better solution to search in a very big set of combinations ?
Please, if my question is not good, tell me why and I will do my best to learn and ask better questions next time ;)
Your output list for combinations of 5 elements will have ~1.5*10^9 (that's billion with b) sublists of size 5. If you use 32bit integers, even neglecting lists overhead and assuming you have a perfect list with 0b overhead - that will be ~200GB!
You should reconsider if you actually need to generate the list like you do, some alternative might be: streaming the list of elements - i.e. generating them on the fly.
That can be done by creating a function, which gets the last combination as an argument - and outputs the next. (to think how it is done, think about increasing by one a number. you go from last to first, remembering a "carry over" until you are done)
A streaming example for choosing 2 out of 4:
start: {4,3}
curr = start {4, 3}
curr = next(curr) {4, 2} // reduce last by one
curr = next(curr) {4, 1} // reduce last by one
curr = next(curr) {3, 2} // cannot reduce more, reduce the first by one, and set the follower to maximal possible value
curr = next(curr) {3, 1} // reduce last by one
curr = next(curr) {2, 1} // similar to {3,2}
done.
Now, you need to figure how to do it for lists of size 2, then generalize it for arbitrary size - and program your streaming combination generator.
Good Luck!
Let your precision be defined in the imaginary spectrum.
Use a real index to access the leaf and then traverse the leaf with the required precision.
See PrecisLise # http://net7mma.codeplex.com/SourceControl/latest#Common/Collections/Generic/PrecicseList.cs
While the implementation is not 100% complete as linked you can find where I used a similar concept here:
http://net7mma.codeplex.com/SourceControl/latest#RtspServer/MediaTypes/RFC6184Media.cs
Using this concept I was able to re-order h.264 Access Units and their underlying Network Access Layer Components in what I consider a very interesting way... outside of interesting it also has the potential to be more efficient using close the same amount of memory.
et al, e.g, 0 can be proceeded by 0.1 or 0.01 or 0.001, depending on the type of the key in the list (double, float, Vector, inter alia) you may have the added benefit of using the FPU or even possibly Intrinsics if supported by your processor, thus making sorting and indexing much faster than would be possible on normal sets regardless of the underlying storage mechanism.
Using this concept allows for very interesting ordering... especially if you provide a mechanism to filter the precision.
I was also able to find several bugs in the bit-stream parser of quite a few well known media libraries using this methodology...
I found my solution, I'm writing it here so that other people that has a similar problem than me can have something to work with...
I made a recursive fonction that check for a fixed amount of possibilities that fit the conditions. When the amount of possibilities is found, I return the list of possibilities, do some calculations with the results, and I can restart the process. I added a timer to stop the research when it takes too long. Since my condition is based on the sum of the elements, I do every possibilities with distinct values, and search for a small amount of possibilities each time (like 1).
So the fonction return a possibility with a very high precision, I do what I need to do with this possibility, I remove the elements of the original list, and recall the fontion with the same precision, until there is nothing returned, so I can continue with an other precision. When many precisions are done, there is only about 30 elements in my list, so I can call for all the possibilities (that still fits the maximum sum), and this part is much easier than the beginning.
There is my code:
public List<IEnumerable<Rondin>> getPossibilites(IEnumerable<Rondin> rondins, int nbElements, double minimum, double maximum, int instance = 0, double longueur = 0)
{
if(instance == 0)
timer = DateTime.Now;
List<IEnumerable<Rondin>> liste = new List<IEnumerable<Rondin>>();
//Get all distinct rondins that can fit into the maximal length
foreach (Rondin r in rondins.Where(r => r.longueur < (maximum - longueur)).DistinctBy(r => r.longueur).OrderBy(r => r.longueur))
{
//Check the current length
double longueur2 = longueur + r.longueur + traitScie;
//If the current length is under the maximal length
if (longueur2 < maximum)
{
//Get all the possibilities with all rondins except the current one, and add them to the list
foreach (IEnumerable<Rondin> poss in getPossibilites(rondins.Where(rondin => rondin.id != r.id), nbElements - liste.Count, minimum, maximum, instance + 1, longueur2).Select(possibilite => possibilite.Concat(new Rondin[] { r })))
{
liste.Add(poss);
if (liste.Count >= nbElements && nbElements > 0)
break;
}
//If this the current length in higher than the minimum, add it to the list
if (longueur2 >= minimum)
liste.Add(new Rondin[] { r });
}
//If we have enough possibilities, we stop the research
if (liste.Count >= nbElements && nbElements > 0)
break;
//If the research is taking too long, stop the research and return the list;
if (DateTime.Now.Subtract(timer).TotalSeconds > 30)
break;
}
return liste;
}

Get all possible distinct triples using LINQ

I have a List contains these values: {1, 2, 3, 4, 5, 6, 7}. And I want to be able to retrieve unique combination of three. The result should be like this:
{1,2,3}
{1,2,4}
{1,2,5}
{1,2,6}
{1,2,7}
{2,3,4}
{2,3,5}
{2,3,6}
{2,3,7}
{3,4,5}
{3,4,6}
{3,4,7}
{3,4,1}
{4,5,6}
{4,5,7}
{4,5,1}
{4,5,2}
{5,6,7}
{5,6,1}
{5,6,2}
{5,6,3}
I already have 2 for loops that able to do this:
for (int first = 0; first < test.Count - 2; first++)
{
int second = first + 1;
for (int offset = 1; offset < test.Count; offset++)
{
int third = (second + offset)%test.Count;
if(Math.Abs(first - third) < 2)
continue;
List<int> temp = new List<int>();
temp .Add(test[first]);
temp .Add(test[second]);
temp .Add(test[third]);
result.Add(temp );
}
}
But since I'm learning LINQ, I wonder if there is a smarter way to do this?
UPDATE: I used this question as the subject of a series of articles starting here; I'll go through two slightly different algorithms in that series. Thanks for the great question!
The two solutions posted so far are correct but inefficient for the cases where the numbers get large. The solutions posted so far use the algorithm: first enumerate all the possibilities:
{1, 1, 1 }
{1, 1, 2 },
{1, 1, 3 },
...
{7, 7, 7}
And while doing so, filter out any where the second is not larger than the first, and the third is not larger than the second. This performs 7 x 7 x 7 filtering operations, which is not that many, but if you were trying to get, say, permutations of ten elements from thirty, that's 30 x 30 x 30 x 30 x 30 x 30 x 30 x 30 x 30 x 30, which is rather a lot. You can do better than that.
I would solve this problem as follows. First, produce a data structure which is an efficient immutable set. Let me be very clear what an immutable set is, because you are likely not familiar with them. You normally think of a set as something you add items and remove items from. An immutable set has an Add operation but it does not change the set; it gives you back a new set which has the added item. The same for removal.
Here is an implementation of an immutable set where the elements are integers from 0 to 31:
using System.Collections;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System;
// A super-cheap immutable set of integers from 0 to 31 ;
// just a convenient wrapper around bit operations on an int.
internal struct BitSet : IEnumerable<int>
{
public static BitSet Empty { get { return default(BitSet); } }
private readonly int bits;
private BitSet(int bits) { this.bits = bits; }
public bool Contains(int item)
{
Debug.Assert(0 <= item && item <= 31);
return (bits & (1 << item)) != 0;
}
public BitSet Add(int item)
{
Debug.Assert(0 <= item && item <= 31);
return new BitSet(this.bits | (1 << item));
}
public BitSet Remove(int item)
{
Debug.Assert(0 <= item && item <= 31);
return new BitSet(this.bits & ~(1 << item));
}
IEnumerator IEnumerable.GetEnumerator() { return this.GetEnumerator(); }
public IEnumerator<int> GetEnumerator()
{
for(int item = 0; item < 32; ++item)
if (this.Contains(item))
yield return item;
}
public override string ToString()
{
return string.Join(",", this);
}
}
Read this code carefully to understand how it works. Again, always remember that adding an element to this set does not change the set. It produces a new set that has the added item.
OK, now that we've got that, let's consider a more efficient algorithm for producing your permutations.
We will solve the problem recursively. A recursive solution always has the same structure:
Can we solve a trivial problem? If so, solve it.
If not, break the problem down into a number of smaller problems and solve each one.
Let's start with the trivial problems.
Suppose you have a set and you wish to choose zero items from it. The answer is clear: there is only one possible permutation with zero elements, and that is the empty set.
Suppose you have a set with n elements in it and you want to choose more than n elements. Clearly there is no solution, not even the empty set.
We have now taken care of the cases where the set is empty or the number of elements chosen is more than the number of elements total, so we must be choosing at least one thing from a set that has at least one thing.
Of the possible permutations, some of them have the first element in them and some of them do not. Find all the ones that have the first element in them and yield them. We do this by recursing to choose one fewer elements on the set that is missing the first element.
The ones that do not have the first element in them we find by enumerating the permutations of the set without the first element.
static class Extensions
{
public static IEnumerable<BitSet> Choose(this BitSet b, int choose)
{
if (choose < 0) throw new InvalidOperationException();
if (choose == 0)
{
// Choosing zero elements from any set gives the empty set.
yield return BitSet.Empty;
}
else if (b.Count() >= choose)
{
// We are choosing at least one element from a set that has
// a first element. Get the first element, and the set
// lacking the first element.
int first = b.First();
BitSet rest = b.Remove(first);
// These are the permutations that contain the first element:
foreach(BitSet r in rest.Choose(choose-1))
yield return r.Add(first);
// These are the permutations that do not contain the first element:
foreach(BitSet r in rest.Choose(choose))
yield return r;
}
}
}
Now we can ask the question that you need the answer to:
class Program
{
static void Main()
{
BitSet b = BitSet.Empty.Add(1).Add(2).Add(3).Add(4).Add(5).Add(6).Add(7);
foreach(BitSet result in b.Choose(3))
Console.WriteLine(result);
}
}
And we're done. We have generated only as many sequences as we actually need. (Though we have done a lot of set operations to get there, but set operations are cheap.) The point here is that understanding how this algorithm works is extremely instructive. Recursive programming on immutable structures is a powerful tool that many professional programmers do not have in their toolbox.
You can do it like this:
var data = Enumerable.Range(1, 7);
var r = from a in data
from b in data
from c in data
where a < b && b < c
select new {a, b, c};
foreach (var x in r) {
Console.WriteLine("{0} {1} {2}", x.a, x.b, x.c);
}
Demo.
Edit: Thanks Eric Lippert for simplifying the answer!
var ints = new int[] { 1, 2, 3, 4, 5, 6, 7 };
var permutations = ints.SelectMany(a => ints.Where(b => (b > a)).
SelectMany(b => ints.Where(c => (c > b)).
Select(c => new { a = a, b = b, c = c })));

Categories

Resources