Select Items based on a list containing the IDs

Select Items based on a list containing the IDs - c#

I have a list of Items, each containing a field of Type integer.
I want to filter my list to get only the items that match a given list of integers.
The code I have now works but I know it could be optimized.
Class Item
{
int ID;
//Other fields & methods that are irrelevant here
}
//Selection method
IEnumerable<Item> SelectItems(List<Item> allItems, List<int> toSelect)
{
return allItems.Where(x => toSelect.Contains(x.ID));
}
The problem I have is that I iterate through allItems and in each iteration I iterate through toSelect.
I have the feeling it is possible to be much more effective but I don't know how I can achieve this with Linq.
This might also be an already asked question as I don't know how this is called in English. This feels kind of stupid because I don't know how to formulate this properly in a seach engine.

You can use Join which is more efficient because it's using a set based approach:
var selectedItems = from item in allItems
join id in toSelect
on item.Id equals id
select item;
return selectedItems;
Another way which is more efficient is to use a HashSet<int> instead of a list:
IEnumerable<Item> SelectItems(List<Item> allItems, HashSet<int> toSelect)
{
return allItems.Where(x => toSelect.Contains(x.ID));
}

There are two ways to approach this.
Currently you have O(N×M) performance (where N is the size of allItems and M is the size of toSelect).
If you're just trying to reduce it easily, then you could reduce it to O(N)+O(M) by creating a hash-set of toSelect:
var matches = new HashSet<int>(toSelect);
return allItems.Where(x => matches.Contains(x.ID));
However, this is still going to be dominated by N - the size of allItems.
A better long term approach may be to pre-index the data (and keep it indexed) by Id. So instead of allItems being a List<T> - it could be a Dictionary<int, T>. Note that building the dictionary can be expensive, so you don't want to do this every time you want to search : the key is to do this once at the start (and keep it maintained). Then this becomes O(M) (the size of toSelect, which is usually small), since dictionary lookups are O(1).
IEnumerable<Item> SelectItems(Dictionary<int, Item> allItems, List<int> toSelect)
{
foreach(var id in toSelect)
{
if (allItems.TryGetValue(id, out var found))
yield return found;
}
}
(there is no need to pre-hash toSelect since we aren't checking it for Contains)

Related

Performance between check exists before add to list and distinct in linq

In the foreach loop, I want to add the Products to a List, but I want this List to not contain duplicate Products, currently I have two ideas solved.
1/ In the loop, before adding the Product to the List, I will check whether the Product already exists in the List, otherwise I will add it to the List.
foreach (var product in products)
{
// code logic
if(!listProduct.Any(x => x.Id == product.Id))
{
listProduct.Add(product);
}
}
2/. In the loop, I will add all the Products to the List even if there are duplicate products. Then outside of the loop, I would use Distinct to remove duplicate records.
foreach (var product in products)
{
// code logic
listProduct.Add(product);
}
listProduct = listProduct.Distinct().ToList();
I wonder in these two ways is the most effective way. Or have any other ideas to be able to add records to the List to avoid duplication ??

I'd go for a third approach: the HashSet. It has a constructor overload that accepts an IEnumerable. This constructor removes duplicates:
If the input collection contains duplicates, the set will contain one
of each unique element. No exception will be thrown.
Source: HashSet<T> Constructor
usage:
List<Product> myProducts = ...;
var setOfProducts = new HashSet<Product>(myProducts);
After removing duplicates there is no proper meaning of setOfProducts[4].
Therefore a HashSet is not a IList<Product>, but an ICollection<Product>, you can Count / Add / Remove, etc, everything you can do with a List. The only thing you can't do is fetch by index

You first take which elements are not already in the collection:
var newProducts = products.Where(x => !listProduct.Any(y => x.Id == y.Id));
And then just add them using AddRang
listProduct.AddRagne(newItems)
Or you can use foreach loop too
foreach (var product in newProducts)
{
listProduct.Add(product);
}
1 more easy solution could be there no need to use Distint
var newProductList = products.Union(listProduct).ToList();
But Union has not good performance.

From what you have included, you are storing everything in memory. If this is the case, or you are persisting only after you have it ready you can consider using BinarySearch:
https://msdn.microsoft.com/en-us/library/w4e7fxsh(v=vs.110).aspx and you also get an ordered list at the end. If ordering is not important, you can use HashSet, which is very fast, and meant specially for this purpose.
Check also: https://www.dotnetperls.com/hashset

This should be pretty fast and take care of any ordering:
// build a HashSet of your primary keys type (I'm assuming integers here) containing all your list elements' keys
var hashSet = new HashSet<int>(listProduct.Select(p => p.Id));
// add all items from the products list whose Id can be added to the hashSet (so it's not a duplicate)
listProduct.AddRange(products.Where(p => hashSet.Add(p.Id)));
What you might want to consider doing instead, though, is implementing IEquatable<Product> and overriding GetHashCode() on your Product type which would make the above code a little easier and put the equality checks where they should be (inside the respective type):
var hashSet = new HashSet<int>(listProduct);
listProduct.AddRange(products.Where(hashSet.Add));

Sort Dictionary based on values in a list of integers

I'm having a problem sorting a dictionary based on the sum of 1s in lists of integers inside the same Dictionary. So first I want to count the 1s in each list and then sort the dictionary based on the result.
I've found some solutions in Stackoverflow but they don't answer my question.
Th dictionary looks like the following:
Dictionary<int, List<int>> myDic = new Dictionary<int, List<int>>();
List<int> myList = new List<int>();
myList = new List<int>();//Should appear third
myList.Add(0);
myList.Add(0);
myList.Add(1);
myDic.Add(0, myList);
myList = new List<int>();//Should appear second
myList.Add(1);
myList.Add(1);
myList.Add(0);
myDic.Add(1, myList);
myList = new List<int>();//Should appear first
myList.Add(1);
myList.Add(1);
myList.Add(1);
myDic.Add(2, myList);
I tried this code but it seems it doesn't do anything.
List<KeyValuePair<int, List<int>>> myList2 = myDic.ToList();
myList2.Sort((firstPair, nextPair) =>
{
return firstPair.Value.Where(i=>i==1).Sum().CompareTo(nextPair.Value.Where(x=>x==1).Sum());
});

You are sorting list items in ascending order. I.e. items with more 1s will go to the end of list. You should use descending order. Just compare nextPair to firstPair (or change sign of comparison result):
myList2.Sort((firstPair, nextPair) =>
{
return nextPair.Value.Where(i => i==1).Sum().CompareTo(
firstPair.Value.Where(x => x==1).Sum());
});
This approach has one problem - sum of 1s in value will be calculated each time two items are compared. Better use Enumerable.OrderByDescending. It's more simple to use, and it will compute comparison values (i.e. keys) only once. Thus Dictionary is a enumerable of KeyValuePairs, you can use OrderByDescending directly with dictionary:
var result = myDic.OrderByDescending(kvp => kvp.Value.Where(i => i == 1).Sum());

Your sort is backward, which is why you think it's not doing anything. Reverse the firstPair/nextPair values in your lambda and you'll get the result you expect.
Though, #Sergey Berezovskiy is correct, you could just use OrderBy, your example code could benefit from perhaps a different pattern overall.

class SummedKV
{
public KeyValuePair Kvp {get; set;}
public int Sum {get; set;}
}
var myList =
myDic.ToList()
.Select(kvp=> new SummedKV {Kvp = kvp, Sum = kvp.Value.Sum() });
myList.Sort(skv=>skv.Sum);

Maybe something simpler
myList2.OrderByDescending(x => x.Value.Sum());

Your code does do something. it creates a list of the items that used to be in the dictionary, sorted based on the number of 1 items contained in the list. The code that you have correctly creates this list and sorts it as your requirements say it should. (Note that using OrderByDescending would let you do the same thing more simply.)
It has no effect on the dictionary that you pulled the lists out of, of course. Dictionaries are unordered, so you can't "reorder" the items even if you wanted to. If it were some different type of ordered collection then it would be possible to change the order of it's items, but just creating a new structure and ordering that wouldn't do it; you'd need to use some sort of operation on the collection itself to change the order of the items.

A better way to loop through lists

So I have a couple of different lists that I'm trying to process and merge into 1 list.
Below is a snipet of code that I want to see if there was a better way of doing.
The reason why I'm asking is that some of these lists are rather large. I want to see if there is a more efficient way of doing this.
As you can see I'm looping through a list, and the first thing I'm doing is to check to see if the CompanyId exists in the list. If it does, then I find item in the list that I'm going to process.
pList is my processign list. I'm adding the values from my different lists into this list.
I'm wondering if there is a "better way" of accomplishing the Exist and Find.
boolean tstFind = false;
foreach (parseAC item in pACList)
{
tstFind = pList.Exists(x => (x.CompanyId == item.key.ToString()));
if (tstFind == true)
{
pItem = pList.Find(x => (x.CompanyId == item.key.ToString()));
//Processing done here. pItem gets updated here
...
}
Just as a side note, I'm going to be researching a way to use joins to see if that is faster. But I haven't gotten there yet. The above code is my first cut at solving this issue and it appears to work. However, since I have the time I want to see if there is a better way still.
Any input is greatly appreciated.
Time Findings:
My current Find and Exists code takes about 84 minutes to loop through the 5.5M items in the pACList.
Using pList.firstOrDefault(x=> x.CompanyId == item.key.ToString()); takes 54 minutes to loop through 5.5M items in the pACList

You can retrieve item with FirstOrDefault instead of searching for item two times (first time to define if item exists, and second time to get existing item):
var tstFind = pList.FirstOrDefault(x => x.CompanyId == item.key.ToString());
if (tstFind != null)
{
//Processing done here. pItem gets updated here
}

Yes, use a hashtable so that your algorithm is O(n) instead of O(n*m) which it is right now.
var pListByCompanyId = pList.ToDictionary(x => x.CompanyId);
foreach (parseAC item in pACList)
{
if (pListByCompanyId.ContainsKey(item.key.ToString()))
{
pItem = pListByCompanyId[item.key.ToString()];
//Processing done here. pItem gets updated here
...
}

You can iterate though filtered list using linq
foreach (parseAC item in pACList.Where(i=>pList.Any(x => (x.CompanyId == i.key.ToString()))))
{
pItem = pList.Find(x => (x.CompanyId == item.key.ToString()));
//Processing done here. pItem gets updated here
...
}

Using lists for this type of operation is O(MxN) (M is the count of pACList, N is the count of pList). Additionally, you are searching pACList twice. To avoid that issue, use pList.FirstOrDefault as recommended by #lazyberezovsky.
However, if possible I would avoid using lists. A Dictionary indexed by the key you're searching on would greatly improve the lookup time.

Doing a linear search on the list for each item in another list is not efficient for large data sets. What is preferable is to put the keys into a Table or Dictionary that can be much more efficiently searched to allow you to join the two tables. You don't even need to code this yourself, what you want is a Join operation. You want to get all of the pairs of items from each sequence that each map to the same key.
Either pull out the implementation of the method below, or change Foo and Bar to the appropriate types and use it as a method.
public static IEnumerable<Tuple<Bar, Foo>> Merge(IEnumerable<Bar> pACList
, IEnumerable<Foo> pList)
{
return pACList.Join(pList, item => item.Key.ToString()
, item => item.CompanyID.ToString()
, (a, b) => Tuple.Create(a, b));
}
You can use the results of this call to merge the two items together, as they will have the same key.
Internally the method will create a lookup table that allows for efficient searching before actually doing the searching.

Convert pList to HashSet then query pHashSet.Contains(). Complexity O(N) + O(n)
Sort pList on CompanyId and do Array.BinarySearch() = O(N Log N) + O(n * Log N )
If Max company id is not prohibitively large, simply create and array of them where item with company id i exists at i-th position. Nothing can be more fast.
where N is size of pList and n is size of pACList

Using C#, what's an efficient way to compare/merge two generic lists of the same type?

If I have two generic lists, List, and I want to merge all the unique Place objects into one List, based on the Place.Id property, what's a good method of doing this efficiently?
One list will always contain 50, the other list could contain significantly more.

result = list1.Union(list2, new ElementComparer());
You need to create ElementComparer to implement IEqualityComparer. E.g. see this

If you want to avoid having to define your own ElementComparer and just use lambda expressions, you can try the following:
List<Place> listOne = /* whatever */;
List<Place> listTwo = /* whatever */;
List<Place> listMerge = listOne.Concat(
listTwo.Where(p1 =>
!listOne.Any(p2 => p1.Id == p2.Id)
)
).ToList();
Essentially this will just concatenate the Enumerable listOne with the set of all elements in listTwo such that the elements are not in the intersection between listOne and listTwo.

Enumerable.Distinct Method
Note: .NET 3.5 & above.

If you want to emphasize efficiency, I suggest you write a small method to do the merge yourself:
List<Place> constantList;//always contains 50 elements. no duplicate elements
List<Place> targetList;
List<Place> result;
Dictionary<int, Place> dict;
for(var p in constantList)
dict.Put(p.Id,p);
result.AddRange(constantList);
for(var p in targetList)
{
if(!dict.Contains(p.Id))
result.Add(p)
}

If speed is what you need, you need to compare using a Hashing mechanism. What I would do is maintain a Hashset of the ids that you have already read and then add the elements to the result if the id hasn't been read yet. You can do this for as many lists as you want and can return an IEnumerable instead of a list if you want to start consuming before the merge is over.
public IEnumerable<Place> Merge(params List<Place>[] lists)
{
HashSet<int> _ids = new HashSet<int>();
foreach(List<Place> list in lists)
{
foreach(Place place in list)
{
if (!_ids.Contains(place.Id))
{
_ids.Add(place.Id);
yield return place;
}
}
}
}
The fact that one list has 50 elements and the other one many more has no implication. Unless you know that the lists are ordered...

How to compare two sorted large lists efficiently in C#?

I have got two generic lists with 20,000 and 30,000 objects in each list.
class Employee
{
string name;
double salary;
}
List<Employee> newEmployeeList = List<Employee>() {....} // contains 20,000 objects
List<Employee> oldEmployeeList = List<Employee>() {....} // contains 30,000 objects
Lists can also be sorted by name if it improves the speed.
I want to compare these two lists to find out
employees whose name and salary matching
employees whose name is matching but not salary
What is the fastest way to compare such large data lists with above conditions?

I would sort both newEmployeeList and oldEmployeeList lists by name - O(n*log(n)). And then you can use linear algorithm to search for matches. So the total would be O(n+n*log(n)) if both lists are about the same size. This should be faster than O(n^2) "brute force" algorithm.

I'd probably recommend the two lists be stored in a Dictionary<string, Employee> based on the name to begin with, then you can iterate over the keys in one and lookup to see if they exist and the salaries match in the other. This would also save the cost of sorting them later or putting them in a more efficient structure.
This is pretty much O(n) - linear to build both dictionaries, linear to go through the keys and lookup in the other. Since O(n + m + n) reduces to O(n)
But, if you must use List<T> to hold the lists for other reasons, you could also use the Join() LINQ method, and build a new list with a Match field that tells you whether they were a match or mismatch...
var results = newEmpList.Join(
oldEmpList,
n => n.Name,
o => o.Name,
(n, o) => new
{
Name = n.Name,
Salary = n.Salary,
Match = o.Salary == n.Salary
});
You can then filter this with a Where() clause for Match or !Match.

Update: I assume (by the title of your question) that the 2 lists are already sorted. Perhaps they're stored in a database with a clustered index or something. This answer, therefore, relies on that assumption.
Here is an implementation that has O(n) complexity, and is also very fast, AND is pretty simple too.
I believe this is a variant of the Merge Algorithm.
Here's the idea:
Start enumerating both lists
Compare the 2 current items.
If they match, add to your results.
If the 1st item is "smaller", advance the 1st list.
If the 2nd item is "smaller", advance the 2nd list.
Since both lists are known to be sorted, this will work very well. This implementation assumes that name is unique in each list.
var comparer = StringComparer.OrdinalIgnoreCase;
var namesAndSalaries = new List<Tuple<Employee, Employee>>();
var namesOnly = new List<Tuple<Employee, Employee>>();
// Create 2 iterators; one for old, one for new:
using (IEnumerator<Employee> A = oldEmployeeList.GetEnumerator()) {
using (IEnumerator<Employee> B = newEmployeeList.GetEnumerator()) {
// Start enumerating both:
if (A.MoveNext() && B.MoveNext()) {
while (true) {
int compared = comparer.Compare(A.Current.name, B.Current.name);
if (compared == 0) {
// Names match
if (A.Current.salary == B.Current.salary) {
namesAndSalaries.Add(Tuple.Create(A.Current, B.Current));
} else {
namesOnly.Add(Tuple.Create(A.Current, B.Current));
}
if (!A.MoveNext() || !B.MoveNext()) break;
} else if (compared == -1) {
// Keep searching A
if (!A.MoveNext()) break;
} else {
// Keep searching B
if (!B.MoveNext()) break;
}
}
}
}
}

One of fastest possible solutions on sorted lists is use of BinarySearch in order to find an item in another list.
But as mantioned others, you should measure it against your project requirements, as performance often tends to be a subjective thing.

You could create a Dictionary using
var lookupDictionary = list1.ToDictionary(x=>x.name);
That would give you close to O(1) lookup and a close to O(n) behavior if you're looking up values from a loop over the other list.
(I'm assuming here that ToDictionary is O(n) which would make sense with a straight forward implementation, but I have not tested this to be the case)
This would make for a very straight forward algorithm, and I'm thinking going below O(n) with two unsorted lists is pretty hard.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Select Items based on a list containing the IDs - c#

Related

Performance between check exists before add to list and distinct in linq

Sort Dictionary based on values in a list of integers

A better way to loop through lists

Using C#, what's an efficient way to compare/merge two generic lists of the same type?

How to compare two sorted large lists efficiently in C#?

Categories

Resources