So I have a couple of different lists that I'm trying to process and merge into 1 list.
Below is a snipet of code that I want to see if there was a better way of doing.
The reason why I'm asking is that some of these lists are rather large. I want to see if there is a more efficient way of doing this.
As you can see I'm looping through a list, and the first thing I'm doing is to check to see if the CompanyId exists in the list. If it does, then I find item in the list that I'm going to process.
pList is my processign list. I'm adding the values from my different lists into this list.
I'm wondering if there is a "better way" of accomplishing the Exist and Find.
boolean tstFind = false;
foreach (parseAC item in pACList)
{
tstFind = pList.Exists(x => (x.CompanyId == item.key.ToString()));
if (tstFind == true)
{
pItem = pList.Find(x => (x.CompanyId == item.key.ToString()));
//Processing done here. pItem gets updated here
...
}
Just as a side note, I'm going to be researching a way to use joins to see if that is faster. But I haven't gotten there yet. The above code is my first cut at solving this issue and it appears to work. However, since I have the time I want to see if there is a better way still.
Any input is greatly appreciated.
Time Findings:
My current Find and Exists code takes about 84 minutes to loop through the 5.5M items in the pACList.
Using pList.firstOrDefault(x=> x.CompanyId == item.key.ToString()); takes 54 minutes to loop through 5.5M items in the pACList
You can retrieve item with FirstOrDefault instead of searching for item two times (first time to define if item exists, and second time to get existing item):
var tstFind = pList.FirstOrDefault(x => x.CompanyId == item.key.ToString());
if (tstFind != null)
{
//Processing done here. pItem gets updated here
}
Yes, use a hashtable so that your algorithm is O(n) instead of O(n*m) which it is right now.
var pListByCompanyId = pList.ToDictionary(x => x.CompanyId);
foreach (parseAC item in pACList)
{
if (pListByCompanyId.ContainsKey(item.key.ToString()))
{
pItem = pListByCompanyId[item.key.ToString()];
//Processing done here. pItem gets updated here
...
}
You can iterate though filtered list using linq
foreach (parseAC item in pACList.Where(i=>pList.Any(x => (x.CompanyId == i.key.ToString()))))
{
pItem = pList.Find(x => (x.CompanyId == item.key.ToString()));
//Processing done here. pItem gets updated here
...
}
Using lists for this type of operation is O(MxN) (M is the count of pACList, N is the count of pList). Additionally, you are searching pACList twice. To avoid that issue, use pList.FirstOrDefault as recommended by #lazyberezovsky.
However, if possible I would avoid using lists. A Dictionary indexed by the key you're searching on would greatly improve the lookup time.
Doing a linear search on the list for each item in another list is not efficient for large data sets. What is preferable is to put the keys into a Table or Dictionary that can be much more efficiently searched to allow you to join the two tables. You don't even need to code this yourself, what you want is a Join operation. You want to get all of the pairs of items from each sequence that each map to the same key.
Either pull out the implementation of the method below, or change Foo and Bar to the appropriate types and use it as a method.
public static IEnumerable<Tuple<Bar, Foo>> Merge(IEnumerable<Bar> pACList
, IEnumerable<Foo> pList)
{
return pACList.Join(pList, item => item.Key.ToString()
, item => item.CompanyID.ToString()
, (a, b) => Tuple.Create(a, b));
}
You can use the results of this call to merge the two items together, as they will have the same key.
Internally the method will create a lookup table that allows for efficient searching before actually doing the searching.
Convert pList to HashSet then query pHashSet.Contains(). Complexity O(N) + O(n)
Sort pList on CompanyId and do Array.BinarySearch() = O(N Log N) + O(n * Log N )
If Max company id is not prohibitively large, simply create and array of them where item with company id i exists at i-th position. Nothing can be more fast.
where N is size of pList and n is size of pACList
Related
I have two list:
reports count 6000 and
invoices count 6000
I have loop like this:
foreach (var item in reports)
{
item.SettlementProcessStatus =
invoices.Any(t => t.InvoiceId == item.RelatedInvoiceId)
? SettlementProcessStatus.Done
: SettlementProcessStatus.Error;
}
At first, this code has a good speed, but the higher the index, the slower it becomes.
what solutions are recommended to optimize it?
The performance problem is that your code has to iterate at least partially through invoices for each of your reports. To avoid this, first create a HashSet with all your invoice ID's and then search in there instead of your list.
Assuming you have integer ID's you could do:
var invoiceIds = new HasSet<int>(invoices.Select(x => x.InvoiceId));
foreach (var item in reports)
item.SettlementProcessStatus = invoiceIds.Contains(item.RelatedInvoiceId) ?
SettlementProcessStatus.Done :
SettlementProcessStatus.Error;
Instead of using a list for invoices, use a dictionary that have the RelatedInvoiceId as its Key Value.
That avoids iterating trough all list items each item.
In the foreach loop, I want to add the Products to a List, but I want this List to not contain duplicate Products, currently I have two ideas solved.
1/ In the loop, before adding the Product to the List, I will check whether the Product already exists in the List, otherwise I will add it to the List.
foreach (var product in products)
{
// code logic
if(!listProduct.Any(x => x.Id == product.Id))
{
listProduct.Add(product);
}
}
2/. In the loop, I will add all the Products to the List even if there are duplicate products. Then outside of the loop, I would use Distinct to remove duplicate records.
foreach (var product in products)
{
// code logic
listProduct.Add(product);
}
listProduct = listProduct.Distinct().ToList();
I wonder in these two ways is the most effective way. Or have any other ideas to be able to add records to the List to avoid duplication ??
I'd go for a third approach: the HashSet. It has a constructor overload that accepts an IEnumerable. This constructor removes duplicates:
If the input collection contains duplicates, the set will contain one
of each unique element. No exception will be thrown.
Source: HashSet<T> Constructor
usage:
List<Product> myProducts = ...;
var setOfProducts = new HashSet<Product>(myProducts);
After removing duplicates there is no proper meaning of setOfProducts[4].
Therefore a HashSet is not a IList<Product>, but an ICollection<Product>, you can Count / Add / Remove, etc, everything you can do with a List. The only thing you can't do is fetch by index
You first take which elements are not already in the collection:
var newProducts = products.Where(x => !listProduct.Any(y => x.Id == y.Id));
And then just add them using AddRang
listProduct.AddRagne(newItems)
Or you can use foreach loop too
foreach (var product in newProducts)
{
listProduct.Add(product);
}
1 more easy solution could be there no need to use Distint
var newProductList = products.Union(listProduct).ToList();
But Union has not good performance.
From what you have included, you are storing everything in memory. If this is the case, or you are persisting only after you have it ready you can consider using BinarySearch:
https://msdn.microsoft.com/en-us/library/w4e7fxsh(v=vs.110).aspx and you also get an ordered list at the end. If ordering is not important, you can use HashSet, which is very fast, and meant specially for this purpose.
Check also: https://www.dotnetperls.com/hashset
This should be pretty fast and take care of any ordering:
// build a HashSet of your primary keys type (I'm assuming integers here) containing all your list elements' keys
var hashSet = new HashSet<int>(listProduct.Select(p => p.Id));
// add all items from the products list whose Id can be added to the hashSet (so it's not a duplicate)
listProduct.AddRange(products.Where(p => hashSet.Add(p.Id)));
What you might want to consider doing instead, though, is implementing IEquatable<Product> and overriding GetHashCode() on your Product type which would make the above code a little easier and put the equality checks where they should be (inside the respective type):
var hashSet = new HashSet<int>(listProduct);
listProduct.AddRange(products.Where(hashSet.Add));
Current Code:
For each element in the MapEntryTable, check the properties IsDisplayedColumn and IsReturnColumn and if they are true then add the element to another set of lists, its running time would be O(n), there would be many elements with both properties as false, so will not get added to any of the lists in the loop.
foreach (var mapEntry in MapEntryTable)
{
if (mapEntry.IsDisplayedColumn)
Type1.DisplayColumnId.Add(mapEntry.OutputColumnId);
if (mapEntry.IsReturnColumn)
Type1.ReturnColumnId.Add(mapEntry.OutputColumnId);
}
Following is the Linq version of doing the same:
MapEntryTable.Where(x => x.IsDisplayedColumn == true).ToList().ForEach(mapEntry => Type1.DisplayColumnId.Add(mapEntry.OutputColumnId));
MapEntryTable.Where(x => x.IsReturnColumn == true).ToList().ForEach(mapEntry => Type1.ReturnColumnId.Add(mapEntry.OutputColumnId));
I am converting all such foreach code to linq, as I am learning it, but my question is:
Do I get any advantage of Linq conversion in this case or is it a disadvantage ?
Is there a better way to do the same using Linq
UPDATE:
Consider the condition where out of 1000 elements in the list 80% have both properties false, then does where provides me a benefit of quickly finding elements with a given condition.
Type1 is a custom type with set of List<int> structures, DisplayColumnId and ReturnColumnId
ForEach ins't a LINQ method. It's a method of List. And not only is it not a part of LINQ, it's very much against the very values and patterns of LINQ. Eric Lippet explains this in a blog post that was written when he was a principle developer on the C# compiler team.
Your "LINQ" approach also:
Completely unnecessarily copies all of the items to be added into a list, which is both wasteful in time and memory and also conflicts with LINQ's goals of deferred execution when executing queries.
Isn't actually a query with the exception of the Where operator. You're acting on the items in the query, rather than performing a query. LINQ is a querying tool, not a tool for manipulating data sets.
You're iterating the source sequence twice. This may or may not be a problem, depending on what the source sequence actually is and what the costs of iterating it are.
A solution that uses LINQ as much as is it is designed for would be to use it like so:
foreach (var mapEntry in MapEntryTable.Where(entry => mapEntry.IsDisplayedColumn))
list1.DisplayColumnId.Add(mapEntry.OutputColumnId);
foreach (var mapEntry in MapEntryTable.Where(entry => mapEntry.IsReturnColumn))
list2.ReturnColumnId.Add(mapEntry.OutputColumnId);
I would say stick with the original way with the foreach loop, since you are only iterating through the list 1 time over.
also your linq should look more like this:
list1.DisplayColumnId.AddRange(MapEntryTable.Where(x => x.IsDisplayedColumn).Select(mapEntry => mapEntry.OutputColumnId));
list2.ReturnColumnId.AddRange(MapEntryTable.Where(x => x.IsReturnColumn).Select(mapEntry => mapEntry.OutputColumnId));
The performance of foreach vs Linq ForEach are almost exactly the same, within nano seconds of each other. Assuming you have the same internal logic in the loop in both versions when testing.
However a for loop, outperforms both by a LARGE margin. for(int i; i < count; ++i) is much faster than both. Because a for loop doesn't rely on an IEnumerable implementation (overhead). The for loop compiles to x86 register index/jump code. It maintains an incrementor, and then it's up to you to retrieve the item by it's index in the loop.
Using a Linq ForEach loop though does have a big disadvantage. You cannot break out of the loop. If you need to do that you have to maintain a boolean like "breakLoop = false", set it to true, and have each recursive exit if breakLoop is true... Bad performing there. Secondly you cannot use continue, instead you use "return".
I never use Linq's foreach loop.
If you are dealing with linq, e.g.
List<Thing> things = .....;
var oldThings = things.Where(p.DateTime.Year < DateTime.Now.Year);
That internally will foreach with linq and give you back only the items with a year less than the current year. Cool..
But if I am doing this:
List<Thing> things = new List<Thing>();
foreach(XElement node in Results) {
things.Add(new Thing(node));
}
I don't need to use a linq for each loop. Even if I did...
foreach(var node in thingNodes.Where(p => p.NodeType == "Thing") {
if (node.Ignore) {
continue;
}
thing.Add(node);
}
even though I could write that cleaner like
foreach(var node in thingNodes.Where(p => p.NodeType == "Thing" && !node.Ignore) {
thing.Add(node);
}
There is no real reason I can think of to do this..>
things.ForEach(thing => {
//do something
//can't break
//can't continue
return; //<- continue
});
And if I want the fastest loop possible,
for (int i = 0; i < things.Count; ++i) {
var thing = things[i];
//do something
}
Will be faster.
Your LINQ isn't quite right as you're converting the results of Where to a List and then pseudo-iterating over those results with ForEach to add to another list. Use ToList or AddRange for converting or adding sequences to lists.
Example, where overwriting list1 (if it were actually a List<T>):
list1 = MapEntryTable.Where(x => x.IsDisplayedColumn == true)
.Select(mapEntry => mapEntry.OutputColumnId).ToList();
or to append:
list1.AddRange(MapEntryTable.Where(x => x.IsDisplayedColumn == true)
.Select(mapEntry => mapEntry.OutputColumnId));
In C#, to do what you want functionally in one call, you have to write your own partition method. If you are open to using F#, you can use List.Partition<'T>
https://msdn.microsoft.com/en-us/library/ee353782.aspx
What I am trying to do is to implement a heuristic approach to NP complete problem: I have a list of objects (matches) each has a double score. I am taking the first element in the list sorted by the score desc and then remove it from the list. Then all elements bound to the first one are to be removed. I iterate through the list till I have no more elements.
I need a data structure which can efficiently solve this problem, so basically it should ahve the following properties:
1. Generic
2. Is always sorted
3. Has a fast key access
Right now SortedSet<T> looks like the best fit.
The question is: is it the most optimal choice for in my case?
List result = new List();
while (sortedItems.Any())
{
var first = sortedItems.First();
result.Add(first);
sortedItems.Remove(first);
foreach (var dependentFirst in first.DependentElements)
{
sortedItems.Remove(dependentFirst);
}
}
What I need is something like sorted hash table.
I assume you're not just wanting to clear the list, but you want to do something with each item as it's removed.
var toDelete = new HashSet<T>();
foreach (var item in sortedItems)
{
if (!toDelete.Contains(item))
{
toDelete.Add(item);
// do something with item here
}
foreach (var dependentFirst in item.DependentElements)
{
if (!toDelete.Contains(item))
{
toDelete.Add(dependentFirst);
// do something with item here
}
}
}
sortedItems.RemoveAll(i => toDelete.Contains(i));
I think you should use two data structures - a heap and a set - heap for keeping the sorted items, set for keeping the removed items. Fill the heap with the items, then remove the top one, and add it and all its dependents to the set. Remove the second one - if it's already in the set, ignore it and move to the third, otherwise add it and its dependents to the set.
Each time you add an item to the set, also do whatever it is you plan to do with the items.
The complexity here is O(NlogN), you won't get any better than this, as you have to sort the list of items anyway. If you want to get better performance, you can add a 'Removed' boolean to each item, and set it to true instead of using a set to keep track of the removed items. I don't know if this is applicable to you.
If im not mistake, you want something like this
var dictionary = new Dictionary<string, int>();
dictionary.Add("car", 2);
dictionary.Add("apple", 1);
dictionary.Add("zebra", 0);
dictionary.Add("mouse", 5);
dictionary.Add("year", 3);
dictionary = dictionary.OrderBy(o => o.Key).ToDictionary(o => o.Key, o => o.Value);
I have got two generic lists with 20,000 and 30,000 objects in each list.
class Employee
{
string name;
double salary;
}
List<Employee> newEmployeeList = List<Employee>() {....} // contains 20,000 objects
List<Employee> oldEmployeeList = List<Employee>() {....} // contains 30,000 objects
Lists can also be sorted by name if it improves the speed.
I want to compare these two lists to find out
employees whose name and salary matching
employees whose name is matching but not salary
What is the fastest way to compare such large data lists with above conditions?
I would sort both newEmployeeList and oldEmployeeList lists by name - O(n*log(n)). And then you can use linear algorithm to search for matches. So the total would be O(n+n*log(n)) if both lists are about the same size. This should be faster than O(n^2) "brute force" algorithm.
I'd probably recommend the two lists be stored in a Dictionary<string, Employee> based on the name to begin with, then you can iterate over the keys in one and lookup to see if they exist and the salaries match in the other. This would also save the cost of sorting them later or putting them in a more efficient structure.
This is pretty much O(n) - linear to build both dictionaries, linear to go through the keys and lookup in the other. Since O(n + m + n) reduces to O(n)
But, if you must use List<T> to hold the lists for other reasons, you could also use the Join() LINQ method, and build a new list with a Match field that tells you whether they were a match or mismatch...
var results = newEmpList.Join(
oldEmpList,
n => n.Name,
o => o.Name,
(n, o) => new
{
Name = n.Name,
Salary = n.Salary,
Match = o.Salary == n.Salary
});
You can then filter this with a Where() clause for Match or !Match.
Update: I assume (by the title of your question) that the 2 lists are already sorted. Perhaps they're stored in a database with a clustered index or something. This answer, therefore, relies on that assumption.
Here is an implementation that has O(n) complexity, and is also very fast, AND is pretty simple too.
I believe this is a variant of the Merge Algorithm.
Here's the idea:
Start enumerating both lists
Compare the 2 current items.
If they match, add to your results.
If the 1st item is "smaller", advance the 1st list.
If the 2nd item is "smaller", advance the 2nd list.
Since both lists are known to be sorted, this will work very well. This implementation assumes that name is unique in each list.
var comparer = StringComparer.OrdinalIgnoreCase;
var namesAndSalaries = new List<Tuple<Employee, Employee>>();
var namesOnly = new List<Tuple<Employee, Employee>>();
// Create 2 iterators; one for old, one for new:
using (IEnumerator<Employee> A = oldEmployeeList.GetEnumerator()) {
using (IEnumerator<Employee> B = newEmployeeList.GetEnumerator()) {
// Start enumerating both:
if (A.MoveNext() && B.MoveNext()) {
while (true) {
int compared = comparer.Compare(A.Current.name, B.Current.name);
if (compared == 0) {
// Names match
if (A.Current.salary == B.Current.salary) {
namesAndSalaries.Add(Tuple.Create(A.Current, B.Current));
} else {
namesOnly.Add(Tuple.Create(A.Current, B.Current));
}
if (!A.MoveNext() || !B.MoveNext()) break;
} else if (compared == -1) {
// Keep searching A
if (!A.MoveNext()) break;
} else {
// Keep searching B
if (!B.MoveNext()) break;
}
}
}
}
}
One of fastest possible solutions on sorted lists is use of BinarySearch in order to find an item in another list.
But as mantioned others, you should measure it against your project requirements, as performance often tends to be a subjective thing.
You could create a Dictionary using
var lookupDictionary = list1.ToDictionary(x=>x.name);
That would give you close to O(1) lookup and a close to O(n) behavior if you're looking up values from a loop over the other list.
(I'm assuming here that ToDictionary is O(n) which would make sense with a straight forward implementation, but I have not tested this to be the case)
This would make for a very straight forward algorithm, and I'm thinking going below O(n) with two unsorted lists is pretty hard.