Massive amount number comparison using c# - c#

Comparison of number sets is too slow. What is more efficiency way to solve this problem?
I have two groups of sets, each group has about 5 millions of sets, each set has 6 numbers and each number is between 1 to 100. Sets and Groups are not sorted and duplicated.
Following is Example.
No. Group A Group B
1 {1,2,3,4,5,6} {6,2,4,87,53,12}
2 {2,3,4,5,6,8} {43,6,78,23,96,24}
3 {45,23,57,79,23,76} {12,1,90,3,2,23}
4 {3,5,85,24,78,90} {12,65,78,9,23,13}
... ...
My goal is compare two groups and classify Group A by maximum common element count in 5hrs on my laptop.
In the example, No 1 of Group A and No 3 of Group B has 3 common elements(1,2,3).
Also, No 2 of Group A and No 3 of Group B has 2 common elements(2,3). Therefore I will classify Group A as following.
No. Group A Maximum Common Element Count
1 {1,2,3,4,5,6} 3
2 {2,3,4,5,6,8} 3
3 {45,23,57,79,23,76} 1
4 {3,5,85,24,78,90} 2
...
My approach is compare every sets and number, so complexity is Group A Count * Group B Count * 6 * 6. Therefore it need so many time.
Dictionary<int, List<int>> Classified = new Dictionary<int, List<int>>();
foreach (List<int> setA in GroupA)
{
int maxcount = 0;
foreach (List<int> setB in GroupB)
{
int count = 0;
foreach(int elementA in setA)
{
foreach(int elementB in setB)
{
if (elementA == elementB) count++;
}
}
if (count > maxcount) maxcount = count;
}
Classified.Add(maxcount, setA);
}

Here is my attempt - using a HashSet<int> and precalculating the range of each set to avoid set-to-set comparisons like {1,2,3,4,5,6} and {7,8,9,10,11,12} (as pointed out by Matt's answer).
For me (running with random sets) it resulted in a 130x speed improvement on the original code. You mentioned in a comment that
Now execution time is over 3 days, so as others said I need parallelization.
and in the question itself that
My goal is compare two groups and classify Group A by maximum common element count in 5hrs on my laptop.
so assuming that the comment means that the execution time for your data exceeded 3 days (72 hours), but you want it to complete in 5 hours, you'd only need something like a 14x speed increase.
Framework
I've created some classes to run these benchmarks:
Range - takes some int values, and keeps track of the minimum and maximum values.
public class Range
{
private readonly int _min;
private readonly int _max;
public Range(IReadOnlyCollection<int> values)
{
_min = values.Min();
_max = values.Max();
}
public int Min { get { return _min; } }
public int Max { get { return _max; } }
public bool Intersects(Range other)
{
if ( _min < other._max )
return false;
if ( _max > other._min )
return false;
return true;
}
}
SetWithRange - wraps a HashSet<int> and a Range of the values.
public class SetWithRange : IEnumerable<int>
{
private readonly HashSet<int> _values;
private readonly Range _range;
public SetWithRange(IReadOnlyCollection<int> values)
{
_values = new HashSet<int>(values);
_range = new Range(values);
}
public static SetWithRange Random(Random random, int size, Range range)
{
var values = new HashSet<int>();
// Random.Next(int, int) generates numbers in the range [min, max)
// so we need to add one here to be able to generate numbers in [min, max].
// See https://learn.microsoft.com/en-us/dotnet/api/system.random.next
var min = range.Min;
var max = range.Max + 1;
while ( values.Count() < size )
values.Add(random.Next(min, max));
return new SetWithRange(values);
}
public int CommonValuesWith(SetWithRange other)
{
// No need to call Intersect on the sets if the ranges don't intersect
if ( !_range.Intersects(other._range) )
return 0;
return _values.Intersect(other._values).Count();
}
public IEnumerator<int> GetEnumerator()
{
return _values.GetEnumerator();
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
The results were generated using SetWithRange.Random as follows:
const int groupCount = 10000;
const int setSize = 6;
var range = new Range(new[] { 1, 100 });
var generator = new Random();
var groupA = Enumerable.Range(0, groupCount)
.Select(i => SetWithRange.Random(generator, setSize, range))
.ToList();
var groupB = Enumerable.Range(0, groupCount)
.Select(i => SetWithRange.Random(generator, setSize, range))
.ToList();
The timings given below are for an average of three x64 release build runs on my machine.
For all cases I generated groups with 10000 random sets then scaled up to approximate the execution time for 5 million sets by using
timeFor5Million = timeFor10000 / 10000 / 10000 * 5000000 * 5000000
= timeFor10000 * 250000
Results
Four foreach blocks:
Average time = 48628ms; estimated time for 5 million sets = 3377 hours
var result = new Dictionary<SetWithRange, int>();
foreach ( var setA in groupA )
{
int maxcount = 0;
foreach ( var setB in groupB )
{
int count = 0;
foreach ( var elementA in setA )
{
foreach ( int elementB in setB )
{
if ( elementA == elementB )
count++;
}
}
if ( count > maxcount ) maxcount = count;
}
result.Add(setA, maxcount);
}
Three foreach blocks with parallelisation on the outer foreach:
Average time = 10305ms; estimated time for 5 million sets = 716 hours (4.7 times faster than original):
var result = new Dictionary<SetWithRange, int>();
Parallel.ForEach(groupA, setA =>
{
int maxcount = 0;
foreach ( var setB in groupB )
{
int count = 0;
foreach ( var elementA in setA )
{
foreach ( int elementB in setB )
{
if ( elementA == elementB )
count++;
}
}
if ( count > maxcount ) maxcount = count;
}
lock ( result )
result.Add(setA, maxcount);
});
Using HashSet<int> and adding a Range to only check sets which intersect:
Average time = 375ms; estimated time for 5 million sets = 24 hours (130 times faster than original):
var result = new Dictionary<SetWithRange, int>();
Parallel.ForEach(groupA, setA =>
{
var commonValues = groupB.Max(setB => setA.CommonValuesWith(setB));
lock ( result )
result.Add(setA, commonValues);
});
Link to a working online demo here: https://dotnetfiddle.net/Kxpagh (note that .NET Fiddle limits execution times to 10 seconds, and that for obvious reasons its results are slower than running in a normal environment).

Fastest I can think of is this:
As all your numbers come from a limited range (1-100), you can express each of your sets as a 100-digit binary number <d1,d2,...,d100> where dn equals 1 iff n is in the set.
Then comparing two sets means a binary AND on the two binary representations and counting the set bits (which can be done efficiently)
In addition to that, this task can be parallelized (your input is immutable, so it's quite straightforward).

You would have to benchmark it with smaller sets but since you're going to have to do 5E6 * 5E6 = 25E12 comparisons, you might as well sort the contents of 5E6 + 5E6 = 10E6 sets first.
Then the set to set comparisons become much fast since you can stop in each comparison as soon as you reach the highest number in the first side of the comparison. Minuscule savings per set comparison but trillions of times over, it adds up.
You could also go further and index the two sets of five million by lowest entry and highest entry. You would further cut down the number of comparisons significantly. In the end, that's only 100 * 100' = 10,000 = 1E4 distinct collections. You would never have to compare sets that have for instance 12 for the highest number, with any sets that start with 13 or more. effectively avoiding a ton of work.
In my mind, this is sorting a lot of data, but it pales in order to the number of actual set to set comparisons you would have to do raw. Here, you are eliminating work for all the 0s and able to abort early if the conditions are right when you do do a compare.
And as others have said, parallelization...
PS: 5E6 = 5 * 10^6 = 5,000,000 and 25E12 = 25 * 10^12 = 25 * 10,000,000,000,000

The time complexity of any algorithm you come up with is going to be of the same order. HashSets might be a bit faster, but if they are it won't be by much - the overhead of 36 direct list comparisons vs 12 hashset lookups isn't going to be significantly higher, if at all, but you'll have to benchmark. Presorting might help a bit considering each set will be compared millions of times. Just FYI, for loops are faster than foreach loops on a List and arrays are faster than Lists (for and foreach on array is same performance), which for something like this might make a decent performance difference. If the No. column is sequential then I would use an array for that instead of a dictionary as well. Array lookups are an order of magnitude faster than dictionary lookups.
I think you are generally doing this as quickly as possible aside from parallelization though, with some small gains possible through the above micro-optimizations.
How far off from your target execution time is the current algorithm?

I would use the following:
foreach (List<int> setA in GroupA)
{
int maxcount = GroupB.Max(x => x.Sum(y => setA.Contains(y) ? 1 : 0));
Classified.Add(maxcount, setA);
}

Related

Split a list into n equal parts

Given a sorted list, and a variable n, I want to break up the list into n parts. With n = 3, I expect three lists, with the last one taking on the overflow.
I expect: 0,1,2,3,4,5, 6,7,8,9,10,11, 12,13,14,15,16,17
If the number of items in the list is not divisible by n, then just put the overflow (mod n) in the last list.
This doesn't work:
static class Program
{
static void Main(string[] args)
{
var input = new List<double>();
for (int k = 0; k < 18; ++k)
{
input.Add(k);
}
var result = input.Split(3);
foreach (var resul in result)
{
foreach (var res in resul)
{
Console.WriteLine(res);
}
}
}
}
static class LinqExtensions
{
public static IEnumerable<IEnumerable<T>> Split<T>(this IEnumerable<T> list, int parts)
{
int i = 0;
var splits = from item in list
group item by i++ % parts into part
select part.AsEnumerable();
return splits;
}
}
I think you would benefit from Linq's .Chunk() method.
If you first calculate how many parts will contain the equal item count, you can chunk list and yield return each chunk, before yield returning the remaining part of list (if list is not divisible by n).
As pointed out by Enigmativity, list should be materialized as an ICollection<T> to avoid possible multiple enumeration. The materialization can be obtained by trying to cast list to an ICollection<T>, and falling back to calling list.ToList() if that is unsuccessful.
A possible implementation of your extension method is hence:
public static IEnumerable<IEnumerable<T>> Split<T>(this IEnumerable<T> list, int parts)
{
var collection = list is ICollection<T> c
? c
: list.ToList();
var itemCount = collection.Count;
// return all items if source list is too short to split up
if (itemCount < parts)
{
yield return collection;
yield break;
}
var itemsInEachChunk = itemCount / parts;
var chunks = itemCount % parts == 0
? parts
: parts - 1;
var itemsToChunk = chunks * itemsInEachChunk;
foreach (var chunk in collection.Take(itemsToChunk).Chunk(itemsInEachChunk))
{
yield return chunk;
}
if (itemsToChunk < itemCount)
{
yield return collection.Skip(itemsToChunk);
}
}
Example fiddle here.
I see two issues with your code. First, the way you're outputting the results, it's impossible to tell the groupings of the values since you're just outputing each one on its own line.
This could be resolved buy using Console.Write for each value in a group, and then adding a Console.WriteLine() when the group is done. This way the values from each group are displayed on a separate line. We also might want to pad the values so they line up nicely by getting the length of the largest value and passing that to the PadRight method:
static void Main(string[] args)
{
var numItems = 18;
var splitBy = 3;
var input = Enumerable.Range(0, numItems).ToList();
var results = input.Split(splitBy);
// Get the length of the largest value to use for padding smaller values,
// so all the columns will line up when we display the results
var padValue = input.Max().ToString().Length + 1;
foreach (var group in results)
{
foreach (var item in group)
{
Console.Write($"{item}".PadRight(padValue));
}
Console.WriteLine();
}
Console.Write("\n\nDone. Press any key to exit...");
Console.ReadKey();
}
Now your results look pretty good, except we can see that the numbers are not grouped as we expect:
0 3 6 9 12 15
1 4 7 10 13 16
2 5 8 11 14 17
The reason for this is that we're grouping by the remainder of each item divided by the number of parts. So, the first group contains all numbers whose remainder after being divided by 3 is 0, the second is all items whose remainder is 1, etc.
To resolve this, we should instead divide the index of the item by the number of items in a row (the number of columns).
In other words, 18 items divided by 3 rows will result in 6 items per row. With integer division, all the indexes from 0 to 5 will have a remainder of 0 when divided by 6, all the indexes from 6 to 11 will have a remainder of 1 when divided by 6, and all the indexes from 12 to 17 will have a remainder of 2 when divided by 6.
However, we also have to be able to handle the overflow numbers. One way to do this is to check if the index is greater than or equal to rows * columns (i.e. it would end up on a new row instead of on the last row). If this is true, then we set it to the last row.
I'm not great at linq so there may be a better way to write this, but we can modify our extension method like so:
public static IEnumerable<IEnumerable<T>> Split<T>(
this IEnumerable<T> list, int parts)
{
int numItems = list.Count();
int columns = numItems / parts;
int overflow = numItems % parts;
int index = 0;
return from item in list
group item by
index++ >= (parts * columns) ? parts - 1 : (index - 1) / columns
into part
select part.AsEnumerable();
}
And now our results look better:
// For 18 items split into 3
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
// For 25 items split into 7
0 1 2
3 4 5
6 7 8
9 10 11
12 13 14
15 16 17
18 19 20 21 22 23 24
This should work.
So each list should (ideally) have x/n elements,
where x=> No. of elements in the list &
n=> No. of lists it has to be split into
If x isn't divisible by n, then each list should have x/n (rounded down to the nearest integer). Let that no. be 'y'. While the last list should have x - y*(n - 1). Let that no. be 'z'.
What the first for-loop does is it repeats the process of creating a list with the appropriate no. of elements n times.
The if-else block is there to see if the list getting created is the last one or not. If it's the last one, it has z items. If not, it has y items.
The nested for-loops add the list items into "sub-lists" (List) which will then be added to the main list (List<List>) that is to be returned.
This solution is (noticeably) different from your signature and the other solutions offered. I used this approach because the code is (arguably) easier to understand albeit longer. When I used to look for solutions, I used to apply solutions where I could understand exactly what was going on. I wasn't able to fully understand the other solutions to this question (yet to get a proper hang of programming) so I presented the one I wrote below in case you were to end up in the same predicament.
Let me know if I should make any changes.
static class Program
{
static void Main(string[] args)
{
var input = new List<String>();
for (int k = 0; k < 18; ++k)
{
input.Add(k.ToString());
}
var result = SplitList(input, 5);//I've used 5 but it can be any number
foreach (var resul in result)
{
foreach (var res in result)
{
Console.WriteLine(res);
}
}
}
public static List<List<string>> SplitList (List<string> origList, int n)
{//"n" is the number of parts you want to split your list into
int splitLength = origList.Count / n; //splitLength is no. of items in each list bar the last one. (In case of overflow)
List<List<string>> listCollection = new List<List<string>>();
for ( int i = 0; i < n; i++ )
{
List<string> tempStrList = new List<string>();
if ( i < n - 1 )
{
for ( int j = i * splitLength; j < (i + 1) * splitLength; j++ )
{
tempStrList.Add(origList[j]);
}
}
else
{
for ( int j = i * splitLength; j < origList.Count; j++ )
{
tempStrList.Add(origList[j]);
}
}
listCollection.Add(tempStrList);
}
return listCollection;
}
}

Shortest list from a two dimensional array

This question is more about an algorithm than actual code, but example code would be appreciated.
Let's say I have a two-dimensional array such as this:
A B C D E
--------------
1 | 0 2 3 4 5
2 | 1 2 4 5 6
3 | 1 3 4 5 6
4 | 2 3 4 5 6
5 | 1 2 3 4 5
I am trying to find the shortest list that would include a value from each row. Currently, I am going row by row and column by column, adding each value to a SortedSet and then checking the length of the set against the shortest set found so far. For example:
Adding cells {1A, 2A, 3A, 4A, 5A} would add the values {0, 1, 1, 2, 1} which would result in a sorted set {0, 1, 2}. {1B, 2A, 3A, 4A, 5A} would add the values {2, 1, 1, 2, 1} which would result in a sorted set {1, 2}, which is shorter than the previous set.
Obviously, adding {1D, 2C, 3C, 4C, 5D} or {1E, 2D, 3D, 4D, 5E} would be the shortest sets, having only one item each, and I could use either one.
I don't have to include every number in the array. I just need to find the shortest set while including at least one number from every row.
Keep in mind that this is just an example array, and the arrays that I'm using are much, much larger. The smallest is 495x28. Brute force will take a VERY long time (28^495 passes). Is there a shortcut that someone knows, to find this in the least number of passes? I have C# code, but it's kind of long.
Edit:
Posting current code, as per request:
// Set an array of counters, Add enough to create largest initial array
int ListsCount = MatrixResults.Count();
int[] Counters = new int[ListsCount];
SortedSet<long> CurrentSet = new SortedSet<long>();
for (long X = 0; X < ListsCount; X++)
{
Counters[X] = 0;
CurrentSet.Add(X);
}
while (true)
{
// Compile sequence list from MatrixResults[]
SortedSet<long> ThisSet = new SortedSet<long>();
for (int X = 0; X < Count4; X ++)
{
ThisSet.Add(MatrixResults[X][Counters[X]]);
}
// if Sequence Length less than current low, set ThisSet as Current
if (ThisSet.Count() < CurrentSet.Count())
{
CurrentSet.Clear();
long[] TSI = ThisSet.ToArray();
for (int Y = 0; Y < ThisSet.Count(); Y ++)
{
CurrentSet.Add(TSI[Y]);
}
}
// Increment Counters
int Index = 0;
bool EndReached = false;
while (true)
{
Counters[Index]++;
if (Counters[Index] < MatrixResults[Index].Count()) break;
Counters[Index] = 0;
Index++;
if (Index >= ListsCount)
{
EndReached = true;
break;
}
Counters[Index]++;
}
// If all counters are fully incremented, then break
if (EndReached) break;
}
With all computations there is always a tradeoff, several factors are in play, like will You get paid for getting it perfect (in this case for me, no). This is a case of the best being the enemy of the good. How long can we spend on solving a problem and will it be sufficient to get close enough to fulfil the use case (imo) and when we can solve the problem without hand painting pixels in UHD resolution to get the idea of a key through, lets!
So, my choice is an approach which will get a covering set which is small and ehem... sometimes will be the smallest :) In essence because of the sequence in comparing would to be spot on be iterative between different strategies, comparing the length of the sets for different strategies - and for this evening of fun I chose to give one strategy which is I find defendable to be close to or equal the minimal set.
So this strategy is to observe the multi dimensional array as a sequence of lists that has a distinct value set each. Then if reducing the total amount of lists with the smallest in the remainder iteratively, weeding out any non used values in that smallest list when having reduced total set in each iteration we will get a path which is close enough to the ideal to be effective as it completes in milliseconds with this approach.
A critique of this approach up front is then that the direction you pass your minimal list in really would have to get iteratively varied to pick best, left to right, right to left, in position sequences X,Y,Z, ... because the amount of potential reducing is not equal. So to get close to the ideal iterations of sequences would have to be made for each iteration too until all combinations were covered, choosing the most reducing sequence. right - but I chose left to right, only!
Now I chose not to run compare execution against Your code, because of the way you instantiate your MatrixResults is an array of int arrays and not instantiated as a multidimension array, which your drawing is, so I went by Your drawing and then couldn't share data source with your code. No matter, you can make that conversion if you wish, onwards to generate sample data:
private int[,] CreateSampleArray(int xDimension, int yDimensions, Random rnd)
{
Debug.WriteLine($"Created sample array of dimensions ({xDimension}, {yDimensions})");
var array = new int[xDimension, yDimensions];
for (int x = 0; x < array.GetLength(0); x++)
{
for(int y = 0; y < array.GetLength(1); y++)
{
array[x, y] = rnd.Next(0, 4000);
}
}
return array;
}
The overall structure with some logging, I'm using xUnit to run the code in
[Fact]
public void SetCoverExperimentTest()
{
var rnd = new Random((int)DateTime.Now.Ticks);
var sw = Stopwatch.StartNew();
int[,] matrixResults = CreateSampleArray(rnd.Next(100, 500), rnd.Next(100, 500), rnd);
//So first requirement is that you must have one element per row, so lets get our unique rows
var listOfAll = new List<List<int>>();
List<int> listOfRow;
for (int y = 0; y < matrixResults.GetLength(1); y++)
{
listOfRow = new List<int>();
for (int x = 0; x < matrixResults.GetLength(0); x++)
{
listOfRow.Add(matrixResults[x, y]);
}
listOfAll.Add(listOfRow.Distinct().ToList());
}
var setFound = new HashSet<int>();
List<List<int>> allUniquelyRequired = GetDistinctSmallestList(listOfAll, setFound);
// This set now has all rows that are either distinctly different
// Or have a reordering of distinct values of that length value lists
// our HashSet has the unique value range
//Meaning any combination of sets with those values,
//grabbing any one for each set, prefering already chosen ones should give a covering total set
var leastSet = new LeastSetData
{
LeastSet = setFound,
MatrixResults = matrixResults,
};
List<Coordinate>? minSet = leastSet.GenerateResultsSet();
sw.Stop();
Debug.WriteLine($"Completed in {sw.Elapsed.TotalMilliseconds:0.00} ms");
Assert.NotNull(minSet);
//There is one for each row
Assert.False(minSet.Select(s => s.y).Distinct().Count() < minSet.Count());
//We took less than 25 milliseconds
var timespan = new TimeSpan(0, 0, 0, 0, 25);
Assert.True(sw.Elapsed < timespan);
//Outputting to debugger for the fun of it
var sb = new StringBuilder();
foreach (var coordinate in minSet)
{
sb.Append($"({coordinate.x}, {coordinate.y}) {matrixResults[coordinate.x, coordinate.y]},");
}
var debugLine = sb.ToString();
debugLine = debugLine.Substring(0, debugLine.Length - 1);
Debug.WriteLine("Resulting set: " + debugLine);
}
Now the more meaty iterative bits
private List<List<int>> GetDistinctSmallestList(List<List<int>> listOfAll, HashSet<int> setFound)
{
// Our smallest set must be a subset the distinct sum of all our smallest lists for value range,
// plus unknown
var listOfShortest = new List<List<int>>();
int shortest = int.MaxValue;
foreach (var list in listOfAll)
{
if (list.Count < shortest)
{
listOfShortest.Clear();
shortest = list.Count;
listOfShortest.Add(list);
}
else if (list.Count == shortest)
{
if (listOfShortest.Contains(list))
continue;
listOfShortest.Add(list);
}
}
var setFoundAddition = new HashSet<int>(setFound);
foreach (var list in listOfShortest)
{
foreach (var item in list)
{
if (setFound.Contains(item))
continue;
if (setFoundAddition.Contains(item))
continue;
setFoundAddition.Add(item);
}
}
//Now we can remove all rows with those found, we'll add the smallest later
var listOfAllRemainder = new List<List<int>>();
bool foundInList;
List<int> consumedWhenReducing = new List<int>();
foreach (var list in listOfAll)
{
foundInList = false;
foreach (int item in list)
{
if (setFound.Contains(item))
{
//Covered by data from last iteration(s)
foundInList = true;
break;
}
else if (setFoundAddition.Contains(item))
{
consumedWhenReducing.Add(item);
foundInList = true;
break;
}
}
if (!foundInList)
{
listOfAllRemainder.Add(list); //adding what lists did not have elements found
}
}
//Remove any from these smallestset lists that did not get consumed in the favour used pass before
if (consumedWhenReducing.Count == 0)
{
throw new Exception($"Shouldn't be possible to remove the row itself without using one of its values, please investigate");
}
var removeArray = setFoundAddition.Where(a => !consumedWhenReducing.Contains(a)).ToArray();
setFoundAddition.RemoveWhere(x => removeArray.Contains(x));
foreach (var value in setFoundAddition)
{
setFound.Add(value);
}
if (listOfAllRemainder.Count != 0)
{
//Do the whole thing again until there in no list left
listOfShortest.AddRange(GetDistinctSmallestList(listOfAllRemainder, setFound));
}
return listOfShortest; //Here we will ultimately have the sum of shortest lists per iteration
}
To conclude: I hope to have inspired You, at least I had fun coming up with a best approximate, and should you feel like completing the code, You're very welcome to grab what You like.
Obviously we should really track the sequence we go through the shortest lists, after all it is of significance if we start by reducing the total distinct lists by element at position 0 or 0+N and which one we reduce with after. I mean we must have one of those values but each time consuming each value has removed most of the total list all it really produces is a value range and the range consumption sequence matters to the later iterations - Because a position we didn't reach before there were no others left e.g. could have remove potentially more than some which were covered. You get the picture I'm sure.
And this is just one strategy, One may as well have chosen the largest distinct list even within the same framework and if You do not iteratively cover enough strategies, there is only brute force left.
Anyways you'd want an AI to act. Just like a human, not to contemplate the existence of universe before, after all we can reconsider pretty often with silicon brains as long as we can do so fast.
With any moving object at least, I'd much rather be 90% on target correcting every second while taking 14 ms to get there, than spend 2 seconds reaching 99% or the illusive 100% => meaning we should stop the vehicle before the concrete pillar or the pram or conversely buy the equity when it is a good time to do so, not figuring out that we should have stopped, when we are allready on the other side of the obstacle or that we should've bought 5 seconds ago, but by then the spot price already jumped again...
Thus the defense rests on the notion that it is opinionated if this solution is good enough or simply incomplete at best :D
I realize it's pretty random, but just to say that although this sketch is not entirely indisputably correct, it is easy to read and maintain and anyways the question is wrong B-] We will very rarely need the absolute minimal set and when we do the answer will be much longer :D
... woopsie, forgot the support classes
public struct Coordinate
{
public int x;
public int y;
public override string ToString()
{
return $"({x},{y})";
}
}
public struct CoordinateValue
{
public int Value { get; set; }
public Coordinate Coordinate { get; set; }
public override string ToString()
{
return string.Concat(Coordinate.ToString(), " ", Value.ToString());
}
}
public class LeastSetData
{
public HashSet<int> LeastSet { get; set; }
public int[,] MatrixResults { get; set; }
public List<Coordinate> GenerateResultsSet()
{
HashSet<int> chosenValueRange = new HashSet<int>();
var chosenSet = new List<Coordinate>();
for (int y = 0; y < MatrixResults.GetLength(1); y++)
{
var candidates = new List<CoordinateValue>();
for (int x = 0; x < MatrixResults.GetLength(0); x++)
{
if (LeastSet.Contains(MatrixResults[x, y]))
{
candidates.Add(new CoordinateValue
{
Value = MatrixResults[x, y],
Coordinate = new Coordinate { x = x, y = y }
}
);
continue;
}
}
if (candidates.Count == 0)
throw new Exception($"OMG Something's wrong! (this row did not have any of derived range [y: {y}])");
var done = false;
foreach (var c in candidates)
{
if (chosenValueRange.Contains(c.Value))
{
chosenSet.Add(c.Coordinate);
done = true;
break;
}
}
if (!done)
{
var firstCandidate = candidates.First();
chosenSet.Add(firstCandidate.Coordinate);
chosenValueRange.Add(firstCandidate.Value);
}
}
return chosenSet;
}
}
This problem is NP hard.
To show that, we have to take a known NP hard problem, and reduce it to this one. Let's do that with the Set Cover Problem.
We start with a universe U of things, and a collection S of sets that covers the universe. Assign each thing a row, and each set a number. This will fill different numbers of columns for each row. Fill in a rectangle by adding new numbers.
Now solve your problem.
For each new number in your solution that didn't come from a set in the original problem, we can replace it with another number in the same row that did come from a set.
And now we turn numbers back into sets and we have a solution to the Set Cover Problem.
The transformations from set cover to your problem and back again are both O(number_of_elements * number_of_sets) which is polynomial in the input. And therefore your problem is NP hard.
Conversely if you replace each number in the matrix with the set of rows covered, your problem turns into the Set Cover Problem. Using any existing solver for set cover then gives a reasonable approach for your problem as well.
The code is not particularly tidy or optimised, but illustrates the approach I think #btilly is suggesting in his answer (E&OE) using a bit of recursion (I was going for intuitive rather than ideal for scaling, so you may have to work an iterative equivalent).
From the rows with their values make a "values with the rows that they appear in" counterpart. Now pick a value, eliminate all rows in which it appears and solve again for the reduced set of rows. Repeat recursively, keeping only the shortest solutions.
I know this is not terribly readable (or well explained) and may come back to tidy up in the morning, so let me know if it does what you want (is worth a bit more of my time;-).
// Setup
var rowValues = new Dictionary<int, HashSet<int>>
{
[0] = new() { 0, 2, 3, 4, 5 },
[1] = new() { 1, 2, 4, 5, 6 },
[2] = new() { 1, 3, 4, 5, 6 },
[3] = new() { 2, 3, 4, 5, 6 },
[4] = new() { 1, 2, 3, 4, 5 }
};
Dictionary<int, HashSet<int>> ValueRows(Dictionary<int, HashSet<int>> rv)
{
var vr = new Dictionary<int, HashSet<int>>();
foreach (var row in rv.Keys)
{
foreach (var value in rv[row])
{
if (vr.ContainsKey(value))
{
if (!vr[value].Contains(row))
vr[value].Add(row);
}
else
{
vr.Add(value, new HashSet<int> { row });
}
}
}
return vr;
}
List<int> FindSolution(Dictionary<int, HashSet<int>> rAndV)
{
if (rAndV.Count == 0) return new List<int>();
var bestSolutionSoFar = new List<int>();
var vAndR = ValueRows(rAndV);
foreach (var v in vAndR.Keys)
{
var copyRemove = new Dictionary<int, HashSet<int>>(rAndV);
foreach (var r in vAndR[v])
copyRemove.Remove(r);
var solution = new List<int>{ v };
solution.AddRange(FindSolution(copyRemove));
if (bestSolutionSoFar.Count == 0 || solution.Count > 0 && solution.Count < bestSolutionSoFar.Count)
bestSolutionSoFar = solution;
}
return bestSolutionSoFar;
}
var solution = FindSolution(rowValues);
Console.WriteLine($"Optimal solution has values {{ {string.Join(',', solution)} }}");
output Optimal solution has values { 4 }

Find keys with min difference in dictionary

Say, I have this collection, it is generic dictionary
var items = new Dictionary<int, SomeData>
{
{ 1 , new SomeData() },
{ 5 , new SomeData() },
{ 23 , new SomeData() },
{ 22 , new SomeData() },
{ 2 , new SomeData() },
{ 7 , new SomeData() },
{ 59 , new SomeData() }
}
In this case min distance (difference) between keys = 1, for instance, between 23 and 22 or between 1 and 2
23 - 22 = 1 or 2 - 1 = 1
Question : how to find min difference between keys in generic Dictionary? Is there one line LINQ solution for this?
Purpose : If there are several matches then I need only one - the smallest, this is needed to fill missing keys (gaps) between items
I don't know how to do it by one line in LINQ but this is multiline solution for this problem.
var items = new Dictionary<int, string>();
items.Add(1, "SomeData");
items.Add(5, "SomeData");
items.Add(23, "SomeData");
items.Add(22, "SomeData");
items.Add(2, "SomeData");
items.Add(7, "SomeData");
items.Add(59, "SomeData");
var sortedArray = items.Keys.OrderBy(x => x).ToArray();
int minDistance = int.MaxValue;
for (int i = 1; i < sortedArray.Length; i++)
{
var distance = Math.Abs(sortedArray[i] - sortedArray[i - 1]);
if (distance < minDistance)
minDistance = distance;
}
Console.WriteLine(minDistance);
not sure Linq is the most appropriate but something (roughly) along this should work :
var smallestDiff = (from key1 in items.Keys
from key2 in items.Keys
where key1 != key2
group new { key1, key2 } by Math.Abs (key1 - key2) into grp
orderby grp.Key
from keyPair in grp
orderby keyPair.key1
select keyPair).FirstOrDefault ();
I won't give you a LinQ query because there already is an answer.
I know this is not what you are asking for, but I want to show you how to solve it in a very fast and easy to understand/maintain way, if performance and legibility is of any concern to you.
int[] keys;
int i, d, min;
keys = items.Keys.ToArray();
Array.Sort(keys); // leverage fastest possible implementation of sort
min = int.MaxValue;
for (i = 0; i < keys.Length - 1; i++)
{
d = keys[i + 1] - key[i]; // d is always non-negative after sort
if (d < min)
{
if (d == 2)
{
return 2; // minimum 1-gap already reached
} else if (d > 2) // ignore non-gap
{
min = d;
}
}
}
return min; // min contains the minimum difference between keys
Because there is only one sort the performance of this non-LinQ solution performs pretty quick. I don't say this is the best way, but only that you should measure both solutions and compare performance.
EDIT: based on your purpose I've added this piece:
if (d == 2)
{
return 2; // minimum 1-gap already reached
} else if (d > 2) // ignore non-gap
{
min = d;
}
Now what does this mean?
Say the PROBABILITY of having 1-gaps is high, it is probably faster to check at every change of min if you've reached that minimum gap. This may happen when you are 1% or 10% through the for loop, based on probability. So, for very large sets (say, above 1 million or 1 billion) and once you know the probability to expect, this probabilistic approach may give you huge performance gains.
On the contrary, for small sets or when the probability of 1-gaps is low, these extra CPU cycles are wasted and you are better off without that check.
As with very large databases (think of probabilistic indexing) probabilistic reasoning becomes relevant.
The problem is that you'll have to estimate beforehand if and when the probabilistic effect kicks in, and that's a pretty complex topic.
EDIT 2: a 1-gap actually has an index difference of 2. Furthermore, and index difference of 1 is a non-gap (there is no gap to insert an index in between).
So the previous solution was simply wrong, because as soon as two indices are contiguous (say 34, 35) the minimum will be 1, which is not a gap at all.
Because of this gap-problem the internal if() is necessary and at that point the overhead of the probabilistic approach is nullified. You'll be better off with the correct code and probabilistic approach!
I think LINQ is simplest
First, making diff pair from your dictionary
var allPair = items.SelectMany((l) => items.Select((r) => new {l,r}).Where((pair) => l.Key != r.Key));
Then find the min of diff
allPair.OrderBy((pair) => Math.Abs(pair.l.Key - pair.r.Key)).FirstOrDefault();
But you may have multiple pair with same difference value, so you may need to use GroupBy before using OrderBy then handle the multiple pair by yourself
A one line solution not listed in answers:
items.Keys.OrderBy(x => x).Select(x => new { CurVal = x, MinDist = int.MaxValue }).Aggregate((ag, x) => new { CurVal = x.CurVal, MinDist = Math.Min(ag.MinDist, x.CurVal - ag.CurVal) }).MinDist

Reading out dice

We need to make some small program for school that rolls 5 dices and see if you get a three of a kind with it, if so, increase points etc.
The problem isnt to reading out the dice, I have the knowledge to get it done, but I want it to be a little efficient, not a ugly piece of code that takes up half a page. I have found ways to filter out the the duplicates in an array, but not the other way around. It rolls with 5 dices, so its an array with 5 numbers, is there like a built in function or a nice, efficient way of returning the number that has been rolled three times or return null if none of the number are rolled three times?
Hope anyone can push me in the right direction. :)
You can do it easily and succinctly with LINQ:
var diceRolls = new[] {1, 3, 3, 3, 4};
var winningRolls = diceRolls.GroupBy(die => die).Select(groupedRoll => new {DiceNumber = groupedRoll.Key, Count = groupedRoll.Count()}).Where(x => x.Count >= 3).ToList();
What this is doing is grouping the rolls by the roll number ("Key") and the count of occurrences of that roll. Then, it's selecting any rolls that have a count greater than or equal to 3. The result will be a List containing your winning rolls.
One approach is to store a 6-element array containing the count of how many dice have that face. Loop through the 5 dice and increment the appropriate face's total count.
var rolls = new List<Roll>();
// run as many rolls as you want. e.g.:
rolls.Add(new Roll(5));
var threeOfAKindRolls = rolls.Where(r => r.HasThreeOfAKind());
public class Roll
{
public Roll( int diceCount )
{
// Do your random generation here for the number of dice
DiceResults = new int[0]; // your results.
ResultCounts = new int[6]; // assuming 6 sided die
foreach (var diceResult in DiceResults)
{
ResultCounts[diceResult]++;
}
}
public int[] DiceResults { get; private set; }
public int[] ResultCounts { get; private set; }
public bool HasThreeOfAKind()
{
return ResultCounts.Any(count => count >= 3);
}
}
This code can be shortened somewhat if you don't need the result counts to perform other tests on the results:
public Roll( int diceCount )
{
// Do your random generation here for the number of dice
DiceResults = new int[0]; // your results.
}
public bool HasThreeOfAKind()
{
ResultCounts = new int[6]; // assuming 6 sided die
foreach (var diceResult in DiceResults)
{
// Increment and shortcut if the previous value was 2
if( (ResultCounts[diceResult]++) == 2) return true;
}
return false;
}
Given what you are describing your answer as looking like it sounds like you're trying to do a massive comparison. That's the wrong approach.
Pretend it's 20 dice rather than 5, a good answer will work just as well in a larger case.
I would use something like the following:
public int? WinningRoll(IEnumerable<int> rolls)
{
int threshold = rolls.Count() / 2;
var topRollGroup = rolls.GroupBy(r => r)
.SingleOrDefault(rg => rg.Count() > threshold);
if (topRollGroup != null)
return topRollGroup.Key;
return null;
}
This will work with any number of rolls, not just 5, so if you had 10 rolls, if 6 of them were the same value, that value would be returned. If there is no winning roll, null is returned.

C# Calculation of moving median of time series SortedList<DateTime, double> - improve performance?

I have a method that calculates the moving median value of a time series. Like a moving average, it use a fixed window or period (sometimes referred to as the look back period).
If the period is 10, it will created an array of the first 10 values (0-9), then find the median value of them. It will repeat this, incrementing the window by 1 step (values 1-10 now) and so on... hence the moving part of this. This is process is exactly the same as a moving average.
The median value is found by:
Sorting the values of an array
If there is an odd number of values in the array, take the mid value. The median of a sorted array of 5 values would be the 3rd value.
If there is an even number of values in the array, take the two values on each side of the mid and average them. The median of a sorted array of 6 values would be the (2nd + 3rd) / 2.
I have created a function that calculates this by populating a List<double>, calling List<>.Sort(), and then finding the appropriate values.
Computational it is correct, but I was wonder if there ws a way to improve the performance of this calculation. Perhaps by hand-rolling a sort on an double[] rather than using a list.
My implementation is as follows:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace Moving_Median_TimeSeries
{
class Program
{
static void Main(string[] args)
{
// created a a sample test time series of 10 days
DateTime Today = DateTime.Now;
var TimeSeries = new SortedList<DateTime, double>();
for (int i = 0; i < 10; i++)
TimeSeries.Add(Today.AddDays(i), i * 10);
// write out the time series
Console.WriteLine("Our time series contains...");
foreach (var item in TimeSeries)
Console.WriteLine(" {0}, {1}", item.Key.ToShortDateString(), item.Value);
// calculate an even period moving median
int period = 6;
var TimeSeries_MovingMedian = MovingMedian(TimeSeries, period);
// write out the result of the calculation
Console.WriteLine("\nThe moving median time series of {0} periods contains...", period);
foreach (var item in TimeSeries_MovingMedian)
Console.WriteLine(" {0}, {1}", item.Key.ToShortDateString(), item.Value);
// calculate an odd period moving median
int period2 = 5;
var TimeSeries_MovingMedian2 = MovingMedian(TimeSeries, period);
// write out the result of the calculation
Console.WriteLine("\nThe moving median time series of {0} periods contains...", period2);
foreach (var item in TimeSeries_MovingMedian2)
Console.WriteLine(" {0}, {1}", item.Key.ToShortDateString(), item.Value);
}
public static SortedList<DateTime, double> MovingMedian(SortedList<DateTime, double> TimeSeries, int period)
{
var result = new SortedList<DateTime, double>();
for (int i = 0; i < TimeSeries.Count(); i++)
{
if (i >= period - 1)
{
// add all of the values used in the calc to a list...
var values = new List<double>();
for (int x = i; x > i - period; x--)
values.Add(TimeSeries.Values[x]);
// ... and then sort the list <- there might be a better way than this
values.Sort();
// If there is an even number of values in the array (example 10 values), take the two mid values
// and average them. i.e. 10 values = (5th value + 6th value) / 2.
double median;
if (period % 2 == 0) // is any even number
median = (values[(int)(period / 2)] + values[(int)(period / 2 - 1)]) / 2;
else // is an odd period
// Median equals the middle value of the sorted array, if there is an odd number of values in the array
median = values[(int)(period / 2 + 0.5)];
result.Add(TimeSeries.Keys[i], median);
}
}
return result;
}
}
}
there might be a better way than this
You are right about this - you don't need to sort the whole list if all you want is the median. Follow links from this wikipedia page for more.
For a list of N items and a period P, your algorithm which re-sorts the list for every item is O(N * P lgP). There is an O(N * lg P) algorithm, which uses 2 heaps.
It uses a min-heap which contains P/2 items above the median, and a max-heap with the P-P/2 items less than or equal to it. Whenever you get a new data item, replace the oldest item with the new one, then do a sift-up or sift-down to move it to the correct place. If the new item reaches the root of either heap, compare it to the root of the other and swap and sift-down when needed. For odd P, the median is at the root of the max-heap. For even P, it is the average of both roots.
There is a c implementation in this question. One tricky part in implementing it is
tracking the oldest item efficiently. The overhead in that part may make the speed gains insignificant for small P.

Categories

Resources