Find common objects in N lists - c#

I have N lists of "People". People have 2 properties: Id and Name. I want to find the People that are contained in all N lists. I only want to match on the Id.
Below is my starting point:
List<People> result = new List<People>();
//I think I only need to find items in the first list that are in the others
foreach (People person in peoplesList.First()) {
//then this is the start of iterating through the other full lists
foreach (List<People> list in peoplesList.Skip(1)) {
//Do I even need this?
}
}
I am stuck trying to wrap my head around the middle part. I only want ones that are in each list from peoplesList.Skip(1).

Mathematically speaking; you are looking for the set intersection between all of your lists. Luckily, LINQ has an Instersect method, so you can iteratively intersect your sets.
List<List<People>> lists; //Initialize with your data
IEnumerable<People> commonPeople = lists.First();
foreach (List<People> list in lists.Skip(1))
{
commonPeople = commonPeople.Intersect(list);
}
//commonPeople is now an IEnumerable containing the intersection of all lists
To get the "ID" selector working you will need to implement IEqualityComparer for People
IEqualityComparer<People> comparer = new PeopleComparer();
...
commonPeople = commonPeople.Intersect(list, comparer);
Actual implementation of IEqualityComparer left out since its pretty darn simple.

Related

Select Items based on a list containing the IDs

I have a list of Items, each containing a field of Type integer.
I want to filter my list to get only the items that match a given list of integers.
The code I have now works but I know it could be optimized.
Class Item
{
int ID;
//Other fields & methods that are irrelevant here
}
//Selection method
IEnumerable<Item> SelectItems(List<Item> allItems, List<int> toSelect)
{
return allItems.Where(x => toSelect.Contains(x.ID));
}
The problem I have is that I iterate through allItems and in each iteration I iterate through toSelect.
I have the feeling it is possible to be much more effective but I don't know how I can achieve this with Linq.
This might also be an already asked question as I don't know how this is called in English. This feels kind of stupid because I don't know how to formulate this properly in a seach engine.
You can use Join which is more efficient because it's using a set based approach:
var selectedItems = from item in allItems
join id in toSelect
on item.Id equals id
select item;
return selectedItems;
Another way which is more efficient is to use a HashSet<int> instead of a list:
IEnumerable<Item> SelectItems(List<Item> allItems, HashSet<int> toSelect)
{
return allItems.Where(x => toSelect.Contains(x.ID));
}
There are two ways to approach this.
Currently you have O(N×M) performance (where N is the size of allItems and M is the size of toSelect).
If you're just trying to reduce it easily, then you could reduce it to O(N)+O(M) by creating a hash-set of toSelect:
var matches = new HashSet<int>(toSelect);
return allItems.Where(x => matches.Contains(x.ID));
However, this is still going to be dominated by N - the size of allItems.
A better long term approach may be to pre-index the data (and keep it indexed) by Id. So instead of allItems being a List<T> - it could be a Dictionary<int, T>. Note that building the dictionary can be expensive, so you don't want to do this every time you want to search : the key is to do this once at the start (and keep it maintained). Then this becomes O(M) (the size of toSelect, which is usually small), since dictionary lookups are O(1).
IEnumerable<Item> SelectItems(Dictionary<int, Item> allItems, List<int> toSelect)
{
foreach(var id in toSelect)
{
if (allItems.TryGetValue(id, out var found))
yield return found;
}
}
(there is no need to pre-hash toSelect since we aren't checking it for Contains)

nested hashset of lists?

I'm working on one of the project Euler problems, and I wanted to take the approach of creating a list of values, and adding the list to a Hashset, this way I could evaluate in constant time if the list already exists in the hashset, with the end goal to count the number of lists in the hashset for my end result.
The problem I'm having is when I create a list in this manner.
HashSet<List<int>> finalList = new HashSet<List<int>>();
List<int> candidate = new List<int>();
candidate.Add(5);
finalList.Add(candidate);
if (finalList.Contains(candidate) == false) finalList.Add(candidate);
candidate.Clear();
//try next value
Obviously the finalList[0] item is cleared when I clear the candidate and is not giving me the desired result. Is it possible to have a hashset of lists(of integers) like this? How would I ensure a new list is instantiated each time and added as a new item to the hashset, perhaps say in a for loop testing many values and possible list combinations?
Why don't you use a value which is unique for each list as a key or identifier? You could create a HashSet for your keys which will unlock your lists.
You can use a Dictionary instead. The only thing is you have to test to see if the Dictionary already has the list. This is easy to do, by creating a simple class that supports this need.
class TheSimpleListManager
{
private Dictionary<String, List<Int32>> Lists = new Dictionary<String, List<Int32>>();
public void AddList(String key, List<Int32> list)
{
if(!Lists.ContainsKey(key))
{
Lists.Add(key, list);
}
else
{
// list already exists....
}
}
}
This is just a quick sample of an approach.
To fix your clear issue: Since its an object reference, you would have to create a new List and add it to the HashSet.
You can create the new List by passing the old one into its constructor.
HashSet<List<int>> finalList = new HashSet<List<int>>();
List<int> candidate = new List<int>();
candidate.Add(5);
var newList = new List<int>(candidate);
finalList.Add(newList);
if (finalList.Contains(newList) == false) //Not required for HashSet
finalList.Add(newList);
candidate.Clear();
NOTE: HashSet internally does a contains before adding items. In otherwords, here even if you execute finalList.Add(newList); n times, it would add newList only once. Therefore it is not necessary to do a contains check.

Using C#, what's an efficient way to compare/merge two generic lists of the same type?

If I have two generic lists, List, and I want to merge all the unique Place objects into one List, based on the Place.Id property, what's a good method of doing this efficiently?
One list will always contain 50, the other list could contain significantly more.
result = list1.Union(list2, new ElementComparer());
You need to create ElementComparer to implement IEqualityComparer. E.g. see this
If you want to avoid having to define your own ElementComparer and just use lambda expressions, you can try the following:
List<Place> listOne = /* whatever */;
List<Place> listTwo = /* whatever */;
List<Place> listMerge = listOne.Concat(
listTwo.Where(p1 =>
!listOne.Any(p2 => p1.Id == p2.Id)
)
).ToList();
Essentially this will just concatenate the Enumerable listOne with the set of all elements in listTwo such that the elements are not in the intersection between listOne and listTwo.
Enumerable.Distinct Method
Note: .NET 3.5 & above.
If you want to emphasize efficiency, I suggest you write a small method to do the merge yourself:
List<Place> constantList;//always contains 50 elements. no duplicate elements
List<Place> targetList;
List<Place> result;
Dictionary<int, Place> dict;
for(var p in constantList)
dict.Put(p.Id,p);
result.AddRange(constantList);
for(var p in targetList)
{
if(!dict.Contains(p.Id))
result.Add(p)
}
If speed is what you need, you need to compare using a Hashing mechanism. What I would do is maintain a Hashset of the ids that you have already read and then add the elements to the result if the id hasn't been read yet. You can do this for as many lists as you want and can return an IEnumerable instead of a list if you want to start consuming before the merge is over.
public IEnumerable<Place> Merge(params List<Place>[] lists)
{
HashSet<int> _ids = new HashSet<int>();
foreach(List<Place> list in lists)
{
foreach(Place place in list)
{
if (!_ids.Contains(place.Id))
{
_ids.Add(place.Id);
yield return place;
}
}
}
}
The fact that one list has 50 elements and the other one many more has no implication. Unless you know that the lists are ordered...

Is there a LINQ method to join/concat an unknown number of lists?

I have an object that contains a list of child objects, each of which in turn contains a list of children, and so on. Using that first generation of children only, I want to combine all those lists as cleanly and cheaply as possible. I know I can do something like
public List<T> UnifiedListOfTChildren<T>()
{
List<T> newlist = new List<T>();
foreach (childThing in myChildren)
{
newlist = newlist.Concat<T>(childThing.TChildren);
}
return newlist;
}
but is there a more elegant, less expensive LINQ method I'm missing?
EDIT If you've landed at this question the same way I did and are new to SelectMany, I strongly recommend this visual explanation of how to use it. Comes up near the top in google results currently, but is worth skipping straight to.
var newList = myChildren.SelectMany(c => c.TChildren);

How to compare two sorted large lists efficiently in C#?

I have got two generic lists with 20,000 and 30,000 objects in each list.
class Employee
{
string name;
double salary;
}
List<Employee> newEmployeeList = List<Employee>() {....} // contains 20,000 objects
List<Employee> oldEmployeeList = List<Employee>() {....} // contains 30,000 objects
Lists can also be sorted by name if it improves the speed.
I want to compare these two lists to find out
employees whose name and salary matching
employees whose name is matching but not salary
What is the fastest way to compare such large data lists with above conditions?
I would sort both newEmployeeList and oldEmployeeList lists by name - O(n*log(n)). And then you can use linear algorithm to search for matches. So the total would be O(n+n*log(n)) if both lists are about the same size. This should be faster than O(n^2) "brute force" algorithm.
I'd probably recommend the two lists be stored in a Dictionary<string, Employee> based on the name to begin with, then you can iterate over the keys in one and lookup to see if they exist and the salaries match in the other. This would also save the cost of sorting them later or putting them in a more efficient structure.
This is pretty much O(n) - linear to build both dictionaries, linear to go through the keys and lookup in the other. Since O(n + m + n) reduces to O(n)
But, if you must use List<T> to hold the lists for other reasons, you could also use the Join() LINQ method, and build a new list with a Match field that tells you whether they were a match or mismatch...
var results = newEmpList.Join(
oldEmpList,
n => n.Name,
o => o.Name,
(n, o) => new
{
Name = n.Name,
Salary = n.Salary,
Match = o.Salary == n.Salary
});
You can then filter this with a Where() clause for Match or !Match.
Update: I assume (by the title of your question) that the 2 lists are already sorted. Perhaps they're stored in a database with a clustered index or something. This answer, therefore, relies on that assumption.
Here is an implementation that has O(n) complexity, and is also very fast, AND is pretty simple too.
I believe this is a variant of the Merge Algorithm.
Here's the idea:
Start enumerating both lists
Compare the 2 current items.
If they match, add to your results.
If the 1st item is "smaller", advance the 1st list.
If the 2nd item is "smaller", advance the 2nd list.
Since both lists are known to be sorted, this will work very well. This implementation assumes that name is unique in each list.
var comparer = StringComparer.OrdinalIgnoreCase;
var namesAndSalaries = new List<Tuple<Employee, Employee>>();
var namesOnly = new List<Tuple<Employee, Employee>>();
// Create 2 iterators; one for old, one for new:
using (IEnumerator<Employee> A = oldEmployeeList.GetEnumerator()) {
using (IEnumerator<Employee> B = newEmployeeList.GetEnumerator()) {
// Start enumerating both:
if (A.MoveNext() && B.MoveNext()) {
while (true) {
int compared = comparer.Compare(A.Current.name, B.Current.name);
if (compared == 0) {
// Names match
if (A.Current.salary == B.Current.salary) {
namesAndSalaries.Add(Tuple.Create(A.Current, B.Current));
} else {
namesOnly.Add(Tuple.Create(A.Current, B.Current));
}
if (!A.MoveNext() || !B.MoveNext()) break;
} else if (compared == -1) {
// Keep searching A
if (!A.MoveNext()) break;
} else {
// Keep searching B
if (!B.MoveNext()) break;
}
}
}
}
}
One of fastest possible solutions on sorted lists is use of BinarySearch in order to find an item in another list.
But as mantioned others, you should measure it against your project requirements, as performance often tends to be a subjective thing.
You could create a Dictionary using
var lookupDictionary = list1.ToDictionary(x=>x.name);
That would give you close to O(1) lookup and a close to O(n) behavior if you're looking up values from a loop over the other list.
(I'm assuming here that ToDictionary is O(n) which would make sense with a straight forward implementation, but I have not tested this to be the case)
This would make for a very straight forward algorithm, and I'm thinking going below O(n) with two unsorted lists is pretty hard.

Categories

Resources