Fastest way for Linq to find duplicate Lists?

Fastest way for Linq to find duplicate Lists? - c#

Given a data structure of:
class TheClass
{
int NodeID;
double Cost;
List<int> NodeIDs;
}
And a List with data:
27 -- 10.0 -- 1, 5, 27
27 -- 10.0 -- 1, 5, 27
27 -- 10.0 -- 1, 5, 27
27 -- 15.5 -- 1, 4, 13, 14, 27
27 -- 10.0 -- 1, 4, 25, 26, 27
27 -- 15.5 -- 1, 4, 13, 14, 27
35 -- 10.0 -- 1, 4, 13, 14, 35
I want to reduce it to the unique NodeIDs lists
27 -- 10.0 -- 1, 5, 27
27 -- 15.5 -- 1, 4, 13, 14, 27
27 -- 10.0 -- 1, 4, 25, 26, 27
35 -- 10.0 -- 1, 4, 13, 14, 35
Then I'll be summing the Cost column (Node 27 total cost: 10.0 + 15.5 + 10.0 = 35.5) -- that part is straight forward.
What is the fastest way to remove the duplicate rows / find uniques?
Production data set will have NodeIDs lists of 100 to 200 IDs, about 1,500 in List with around 500 being unique.
I'm 100% focused on speed -- if adding some other data would help, I'm happy to (I've tried hashing the lists into a SHA value, but that turned out slower than my current grunt exhaustive search).

.GroupBy(x=> string.Join(",", x.NodeIDs)).Select(x=>x.First())
That should be faster for big data than Distinct.

If you want to remove duplicate objects according to equal lists you could create a custom IEqualityComparer<T> for lists and use that for Enumerable.GroupBy. Then you just need to create new instances of your class for each group and sum up Cost.
Here is a possible implementation (from):
public class ListEqualityComparer<T> : IEqualityComparer<List<T>>
{
public bool Equals(List<T> lhs, List<T> rhs)
{
return lhs.SequenceEqual(rhs);
}
public int GetHashCode(List<T> list)
{
unchecked
{
int hash = 23;
foreach (T item in list)
{
hash = (hash * 31) + (item == null ? 0 : item.GetHashCode());
}
return hash;
}
}
}
and here is a query that selects one (unique) instance per group:
var nodes = new List<TheClass>(); // fill ....
var uniqueAndSummedNodes = nodes
.GroupBy(n => n.NodeIDs, new ListEqualityComparer<int>())
.Select(grp => new TheClass
{
NodeID = grp.First().NodeID, // just use the first, change accordingly
Cost = grp.Sum(n => n.Cost),
NodeIDs = grp.Key
});
nodes = uniqueAndSummedNodes.ToList();
This implementation uses SequenceEqual which takes the order and the number of occurences of each number in the list into account.
Edit: I've only just seen that you don't want to sum up the group's Costs but to sum up all groups' Cost, that's simple:
double totalCost = nodes.Sum(n => n.Cost);
If you dont want to sum up the group itself replace
...
Cost = grp.Sum(n => n.Cost),
with
...
Cost = grp.First().Cost, // presumes that all are the same

Related

OrderBy and Take in LINQ

I was watching some videos on Youtube with programmers interviews. One of the questions is to write a function that returns n-th smallest element of an array.
In the video I watched a lady try to code that in C++ with some recurrency, but I thought that well, in C# it's one liner:
var nth = vals.OrderBy(x => x).Take(i).Last();
Then I realised that I have no idea whether in fact this is good solution, since the next question was about time complexity. I went to docs and all I found is that the object returned by OrderBy has all the information needed to perform full deferred sort when it is enumerated.
So I decided to test it and wrote a class MyComparable : IComparable<MyComparable> with single value and a static counter in CompareTo method:
class MyComparable : IComparable<MyComparable>
{
public MyComparable(int val)
{
Val = val;
}
public static int CompareCount { get; set; }
public int Val { get; set; }
public int CompareTo(MyComparable other)
{
++CompareCount;
if (ReferenceEquals(this, other)) return 0;
if (ReferenceEquals(null, other)) return 1;
return Val.CompareTo(other.Val);
}
}
Then I wrote a loop that finds n-th element in an array:
static void Main(string[] args)
{
var rand = new Random();
var vals = Enumerable.Range(0, 10000)
// .Reverse() // pesimistic scenario
// .OrderBy(x => rand.Next()) // average scenario
.Select(x => new MyComparable(x))
.ToArray();
for (int i = 1; i < 100; i++)
{
var nth = vals.OrderBy(x => x).Take(i).Last();
Console.WriteLine($"i: {i,5}, OrderBy: {MyComparable.CompareCount,10}, value {nth.Val}");
MyComparable.CompareCount = 0;
var my_nth = vals.OrderByInsertion().Take(i).Last();
Console.WriteLine($"i: {i,5}, Insert : {MyComparable.CompareCount,10}, value {my_nth.Val}");
MyComparable.CompareCount = 0;
my_nth = vals.OrderByInsertionWithIndex().Take(i).Last();
Console.WriteLine($"i: {i,5}, Index : {MyComparable.CompareCount,10}, value {my_nth.Val}");
MyComparable.CompareCount = 0;
Console.WriteLine();
Console.WriteLine();
}
}
I also wrote 2 "different" implementations of finding min element, returning it and removing it from a list:
public static IEnumerable<T> OrderByInsertion<T>(this IEnumerable<T> input) where T : IComparable<T>
{
var list = input.ToList();
while (list.Any())
{
var min = list.Min();
yield return min;
list.Remove(min);
}
}
public static IEnumerable<T> OrderByInsertionWithIndex<T>(this IEnumerable<T> input) where T : IComparable<T>
{
var list = input.ToList();
while (list.Any())
{
var minIndex = 0;
for (int i = 1; i < list.Count; i++)
{
if (list[i].CompareTo(list[minIndex]) < 0)
{
minIndex = i;
}
}
yield return list[minIndex];
list.RemoveAt(minIndex);
}
}
The results really were a surprise to me:
i: 1, OrderBy: 19969, value 0
i: 1, Insert : 9999, value 0
i: 1, Index : 9999, value 0
i: 2, OrderBy: 19969, value 1
i: 2, Insert : 19997, value 1
i: 2, Index : 19997, value 1
i: 3, OrderBy: 19969, value 2
i: 3, Insert : 29994, value 2
i: 3, Index : 29994, value 2
i: 4, OrderBy: 19969, value 3
i: 4, Insert : 39990, value 3
i: 4, Index : 39990, value 3
i: 5, OrderBy: 19970, value 4
i: 5, Insert : 49985, value 4
i: 5, Index : 49985, value 4
...
i: 71, OrderBy: 19973, value 70
i: 71, Insert : 707444, value 70
i: 71, Index : 707444, value 70
...
i: 99, OrderBy: 19972, value 98
i: 99, Insert : 985050, value 98
i: 99, Index : 985050, value 98
Just using LINQ OrderBy().Take(n) is by far the most efficient and fastest, which is what I anticipated, but would never have guessed the gap being some orders of magnitude.
So, my question is mostly to interviewers: how would you grade such an answer?
Code:
var nth = vals.OrderBy(x => x).Take(i).Last();
Time complexity:
I don't know the details, but probably OrderBy uses some kind quick-sort, no n log(n), no matter of which n-th element we want.
Would you ask me to implement my own methods like those or would it be enough to use the framework?
EDIT:
So, it turns out that like suggested answer below, OrderedEnumerable uses variation of QuickSelect to sort just the elements up to where you ask for. On the plus side, it caches the order.
While you are able to find n-th element a little faster, it's not class faster, it's some percentage faster. Also, each and every C# programmer will understand your one liner.
I think that my answer during the interview would end up somewhere along "I will use OrderBy, because it is fast enough and writing it takes 10s. If it turns out to be to slow, we can gain some time with QucikSelect, but implementing it nicely takes a lot of time"
Thanks everyone who decided to participate in this and sorry to everyone who wasted their time thinking this was something else :)

Okay let's start with the low-hanging fruit:
Your implementation is wrong. You need to take index + 1 elements from the sequence. To understand this, consider index = 0 and reread the documentation for Take.
Your "benchmark comparison" only works because calling OrderBy() on an IEnumerable does not modify the underlying collection. For what we're going to do it's easier to just allow modifications to the underlying array. As such I've taken the liberty to change your code to generate the values from scratch in every iteration.
Additionally Take(i + 1).Last() is equivalent to ElementAt(i). You should really be using that.
Oh and your benchmark is really not useful, because the more elements of your range you need to consume with Take, the more these algorithms should come closer to each other. As far as I can tell, you're correct with your runtime analysis of O(n log n).
There is a solution that has a time-complexity of O(n) though (not O(log n) as I incorrectly claimed earlier). This is the solution the interviewer was expecting.
For whatever it's worth, the code you've written there is not movable to that solution, because you don't have any control over the sort process.
If you had you could implement Quickselect, (which is the goal here), resulting in a theoretical improvement over the LINQ query you propose here (especially for high indices). The following is a translation of th pseudocode from the wikipedia article on quickselect, based on your code
static T SelectK<T>(T[] values, int left, int right, int index)
where T : IComparable<T>
{
if (left == right) { return values[left]; }
// could select pivot deterministically through median of 3 or something
var pivotIndex = rand.Next(left, right + 1);
pivotIndex = Partition(values, left, right, pivotIndex);
if (index == pivotIndex) {
return values[index];
} else if (index < pivotIndex) {
return SelectK(values, left, pivotIndex - 1, index);
} else {
return SelectK(values, pivotIndex + 1, right, index);
}
}
static int Partition<T>(T[] values, int left, int right, int pivot)
where T : IComparable<T>
{
var pivotValue = values[pivot];
Swap(values, pivot, right);
var storeIndex = left;
for (var i = left; i < right; i++) {
if (values[i].CompareTo(pivotValue) < 0)
{
Swap(values, storeIndex, i);
storeIndex++;
}
}
Swap(values, right, storeIndex);
return storeIndex;
}
A non-representative subsample of a test I've run gives the output:
i: 6724, OrderBy: 52365, value 6723
i: 6724, SelectK: 40014, value 6724
i: 395, OrderBy: 14436, value 394
i: 395, SelectK: 26106, value 395
i: 7933, OrderBy: 32523, value 7932
i: 7933, SelectK: 17712, value 7933
i: 6730, OrderBy: 46076, value 6729
i: 6730, SelectK: 34367, value 6730
i: 6536, OrderBy: 53458, value 6535
i: 6536, SelectK: 18341, value 6536
Since my SelectK implementation uses a random pivot element, there is quite some variation in it's output, (see for example the second run). It's also considerably worse than the highly optimized sorting algorithm implemented in the standard library.
Even then there are cases where SelectK straight up outperforms the standard library even though I didn't put much effort in.
Now replacing the random pivot with a median of 3[1] (which is a pretty bad pivot selector), we can obtain a slightly different SelectK and race that against OrderBy and SelectK.
I've been racing these three horses with 1m elements in the array, using the random sort you already had, requesting an index in the last 20% of the array and got results like the following:
Winning counts: OrderBy 32, SelectK 32, MedianOf3 35
Winning counts: OrderBy 26, SelectK 35, MedianOf3 38
Winning counts: OrderBy 25, SelectK 41, MedianOf3 33
Even for 100k elements and without restricting the index to the end of the array this pattern seems to persist, though not quite as pronounced:
--- 100k elements
Winning counts: OrderBy 24, SelectK 34, MedianOf3 41
Winning counts: OrderBy 33, SelectK 33, MedianOf3 33
Winning counts: OrderBy 32, SelectK 38, MedianOf3 29
--- 1m elements
Winning counts: OrderBy 27, SelectK 32, MedianOf3 40
Winning counts: OrderBy 32, SelectK 38, MedianOf3 29
Winning counts: OrderBy 35, SelectK 31, MedianOf3 33
Generally speaking, a sloppily implemented quickselect outperforms your suggestion in the average case two thirds of the time... I'd say that's a pretty strong indicator, it's the better algorithm to use if you want to get into the nitty gritty details.
Of course your implementation is significantly easier to understand :)
[1] - Implementation taken from this SO answer, incurring 3 comparisons per recursion depth step

How to Type it in LINQ?

i am learning C# and LINQ so i am sorry for that question.
how to type in linq to group the same elements in certain array, but when they group it they see if the count of the group is greater than 2 and then division it by 2 and return the value of the group count and then add it to int
what i want the linq to do in code :
s => s > 2
s /= 2
return s;
my original code is that:
int n = Convert.ToInt32(Console.ReadLine());
string[] userinput = Console.ReadLine().Split(' ');
int[] socks = new int[n];
socks = Array.ConvertAll(userinput, Int32.Parse);
var result = socks.GroupBy(s => s > 2).ToArray(); //This is the line which i want help
int total = 0;
foreach (var group in result)
{
total += group.Count();
Console.WriteLine(group.Count());
}
Lets assume we have 10 kinds of socks. which are 10,20,30,40,50,60,70, 80,90,100
now in the first line i enter the number of the socks i have so for example its 5.
in the second line i enter the 5 socks kinds for example. 10 10 20 20 30
now i want the linq code to define that there is 3 keys here which are 10 20 30, the 10 key has more than only 1 count it has 2. the same for 20 but the 30 only have 1 so lets forget about the 30. now the count of they key of each one of them is 2 so lets divide it by 2 for each one then add the divided number to the total so 2/2 for each one equals 1 so total = 2 (this is the my expected output)

If understand you right, you want to count pairs of socks:
int total = socks
.GroupBy(x => x)
.Sum(chunk => chunk.Count() / 2);
According to your example:
[10, 10, 20, 20, 30]
after grouping socks by their sizes
10: [10, 10] - 2 socks, 1 pair (you've put it as "divide by 2")
20: [20, 20] - 2 socks, 1 pair
30: [30] - 1 sock, 0 pairs
-------------------------------
2 pairs in total (the expected value)
My example (from the comments to the question)
1, 2, 2, 2, 3, 4, 5, 5
should return 2 as well (we have 2 pairs: of size 2 and 5)
In case you want to get pairs:
var pairs = socks
.GroupBy(x => x)
.Select(chunk => new {
size = chunk.Key,
count = chunk.Count() / 2, });
//.Where(pair => pair.count > 0); // you may want to filter out single socks

How to sort a number sequence that wraps around

I have a sequence of objects, that each have a sequence number that goes from 0 to ushort.MaxValue (0-65535). I have at max about 10 000 items in my sequence, so there should not be any duplicates, and the items are mostly sorted due to the way they are loaded. I only need to access the data sequentially, I don't need them in a list, if that can help. It is also something that is done quite frequently, so it cannot have a too high Big-O.
What is the best way to sort this list?
An example sequence could be (in this example, assume the sequence number is a single byte and wraps at 255):
240 241 242 243 244 250 251 245 246 248 247 249 252 253 0 1 2 254 255 3 4 5 6
The correct order would then be
240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 0 1 2 3 4 5 6
I have a few different approaches, including making a array of ushort.MaxValue size, and just incrementing the position, but that seems like a very inefficient way, and I have some problems when the data I receive have a jump in sequence. However, it's O(1) in performance..
Another approach is to order the items normally, then find the split (6-240), and move the first items to the end. But I'm not sure if that is a good idea.
My third idea is to loop the sequence, until I find a wrong sequence number, look ahead until I find the correct one, and move it to its correct position. However, this can potentially be quite slow if there is a wrong sequence number early on.

Is this what you are looking for?
var groups = ints.GroupBy(x => x < 255 / 2)
.OrderByDescending(list => list.ElementAt(0))
.Select(x => x.OrderBy(u => u))
.SelectMany(i => i).ToList();
Example
In:
int[] ints = new int[] { 88, 89, 90, 91, 92, 0, 1, 2, 3, 92, 93, 94, 95, 96, 97, 4, 5, 6, 7, 8, 99, 100, 9, 10, 11, 12, 13 };
Out:
88 89 90 91 92 92 93 94 95 96 97 99 100 0 1 2 3 4 5 6 7 8 9 10 11 12 13

I realise this is an old question byte I also needed to do this and would have liked an answer so...
Use a SortedSet<FileData> with a custom comparer;
where FileData contains information about the files you are working with
e.g.
struct FileData
{
public ushort SequenceNumber;
...
}
internal class Sequencer : IComparer<FileData>
{
public int Compare(FileData x, FileData y)
{
ushort comparer = (ushort)(x.SequenceNumber - y.SequenceNumber);
if (comparer == 0) return 0;
if (comparer < ushort.MaxValue / 2) return 1;
return -1;
}
}
As you read file information from disk add them to your SortedSet
When you read them out of the SortedSet they are now in the correct order
Note that the SortedSet uses a Red-Black Internally which should give you a nice balance between performance and memory
Insertion is O(log n)
Traversal is O(n)

Strange result when comparing two Lists

I want to compare two lists. I want to check if List2 has any of the items in List1. I get unexpected result. Please see my code below.
test code class
class Program
{
static void Main(string[] args)
{
bool loop = true;
int textcount = 1;
while (loop)
{
var collection1 = GetCollection();
var collection2 = GetCollection();
Console.WriteLine("Test No " + textcount.ToString());
Console.WriteLine("Collection 1 =" + String.Join(", ", collection1.ToArray()));
Console.WriteLine("Collection 2 =" + String.Join(", ", collection2.ToArray()));
System.Diagnostics.Stopwatch watch = new System.Diagnostics.Stopwatch();
watch.Start();
var hasitem = collection1.Any(item => collection2.Contains(item));
watch.Stop();
Console.WriteLine(hasitem.ToString() + " Found in " + watch.ElapsedTicks.ToString());
watch.Reset();
watch.Start();
var hasAtLeastOne = collection1.Intersect(collection2).Any();
watch.Stop();
Console.WriteLine(hasAtLeastOne.ToString() + " With Intersect Found in " + watch.ElapsedTicks.ToString());
textcount++;
Console.ReadKey();
}
}
static Random ran = new Random();
private static IEnumerable<int> GetCollection()
{
for (int i = 0; i < 5; i++)
{
yield return ran.Next(i, 20);
}
}
}
and the result is very annoying. see the last 4 result.
Test No 1
Collection 1 =10, 8, 18, 6, 11
Collection 2 =3, 12, 18, 13, 6
True Found in 3075
True With Intersect Found in 15297
Test No 2
Collection 1 =18, 13, 7, 18, 5
Collection 2 =12, 18, 8, 3, 5
True Found in 22
True With Intersect Found in 100
Test No 3
Collection 1 =1, 6, 15, 7, 9
Collection 2 =16, 15, 14, 14, 12
True Found in 21
True With Intersect Found in 23
Test No 4
Collection 1 =3, 16, 7, 4, 19
Collection 2 =6, 3, 15, 15, 9
True Found in 21
True With Intersect Found in 56
Test No 5
Collection 1 =18, 18, 9, 17, 10
Collection 2 =17, 12, 4, 3, 11
True Found in 25
True With Intersect Found in 20
Test No 6
Collection 1 =9, 9, 2, 17, 19
Collection 2 =17, 2, 18, 3, 15
False Found in 109
False With Intersect Found in 41
Test No 7
Collection 1 =3, 15, 3, 5, 5
Collection 2 =2, 2, 11, 7, 6
True Found in 22
False With Intersect Found in 15
Test No 8
Collection 1 =7, 14, 17, 14, 18
Collection 2 =18, 4, 7, 18, 16
False Found in 28
True With Intersect Found in 19
Test No 9
Collection 1 =3, 9, 6, 18, 9
Collection 2 =10, 3, 17, 17, 18
True Found in 28
True With Intersect Found in 22
Test No 10
Collection 1 =15, 18, 2, 9, 8
Collection 2 =10, 15, 3, 10, 19
False Found in 135
True With Intersect Found in 128
Test No 11
Collection 1 =6, 2, 17, 18, 18
Collection 2 =14, 16, 14, 6, 4
False Found in 20
False With Intersect Found in 17

The problem is that what you call "collection" is actually an unstable sequence of items that changes everytime it is enumerated. The reason for this is the way you implemented GetCollection. Using yield return basically returns a blue print on how to create the sequence. It doesn't return the sequence itself.
And so, everytime that "blue print" is being enumerated, it is being used to create a new sequence.
You can verify this by simply outputing your "collections" twice. You will see that the values are different each time.
And that's the reason why your test yields completely arbitrary results:
You enumerate each collection three times:
First enumeration happens when you output it to the console
Second enumeration happens on the test with Any and Contains. Because this starts a new enumeration new random numbers will be generated.
Third enumeration happens on the Intersect test. This creates yet another set of random numbers.
To fix it, create a stable sequence by calling ToArray() on the result of GetCollection.

This is your problem:
private static IEnumerable<int> GetCollection()
{
for (int i = 0; i < 5; i++)
{
yield return ran.Next(i, 20);
}
}
Make it into this:
private static List<int> GetCollection()
{
return new List<int>
{
ran.Next(0, 20),
ran.Next(1, 20),
ran.Next(2, 20),
ran.Next(3, 20),
ran.Next(4, 20)
};
}
And your problem will disappear.
The long explanation is that when you make an IEnumerable function, you can expect it to repeatedly call the iterator on various LINQ calls (after all, that's what an IEnumerable does, right?). Since you do a yield return <some random number> on each iterator call, you can expect unstable results. Best to either save a reference to the .ToArray() result, or just return a stable list.

How to determine the number of right bit-shifts needed for a power of two value?

I have a function that receives a power of two value.
I need to convert it to an enum range (0, 1, 2, 3, and so on), and then shift it back to the power of two range.
0 1
1 2
2 4
3 8
4 16
5 32
6 64
7 128
8 256
9 512
10 1024
... and so on.
If my function receives a value of 1024, I need to convert it to 10. What is the best way to do this in C#? Should I just keep dividing by 2 in a loop and count the iterations?
I know I can put it back with (1 << 10).

Just use the logarithm of base 2:
Math.Log(/* your number */, 2)
For example, Math.Log(1024, 2) returns 10.
Update:
Here's a rather robust version that checks if the number passed in is a power of two:
public static int Log2(uint number)
{
var isPowerOfTwo = number > 0 && (number & (number - 1)) == 0;
if (!isPowerOfTwo)
{
throw new ArgumentException("Not a power of two", "number");
}
return (int)Math.Log(number, 2);
}
The check for number being a power of two is taken from http://graphics.stanford.edu/~seander/bithacks.html#DetermineIfPowerOf2
There are more tricks to find log2 of an integer on that page, starting here:
http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious

This is the probably fastest algorithm when your CPU doesn't have a bit scan instruction or you can't access that instruction:
unsigned int v; // find the number of trailing zeros in 32-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x077CB531U)) >> 27];
See this paper if you want to know how it works, basically, it's just a perfect hash.

Use _BitScanForward. It does exactly this.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Fastest way for Linq to find duplicate Lists? - c#

.GroupBy(x=> string.Join(",", x.NodeIDs)).Select(x=>x.First()) That should be faster for big data than Distinct.

Related

OrderBy and Take in LINQ

How to Type it in LINQ?

How to sort a number sequence that wraps around

Strange result when comparing two Lists

How to determine the number of right bit-shifts needed for a power of two value?

Categories

Resources