Fastest way to match members of two lists

Fastest way to match members of two lists - c#

I have two lists which are orderHeaders and orderLines. These are two related tables in the database however when I pull them I have to pull them separately as two different lists and then map them out to each other later. I have a solution right now but the performance is a little bit disappointing given that I have around 400k headers and 1million+ lines.
Here's my code below. Is this the standard way to iterate over and find members inside two lists or is there a more optimized approach in C#?
var OutboundOrderHeaders =
DbContext.Context.Database.SqlQuery<OutboundOrderDTO>(queryString, parameter);
var OutboundOrderHeadersList = OutboundOrderHeaders.ToList();
var OutboundOrderLine =
DbContext.Context.Database.SqlQuery<OutboundOrderLineDTO>(queryStringLine, parameter2);
var OutboundOrderLineList = OutboundOrderLine.ToList();
for(var i = 0; i < OutboundOrderHeadersList.Count(); i++)
{
var LineToAdd = OutboundOrderLineList
.Where(x => x.OutboundNumber == OutboundOrderHeadersList[i].OutboundNumber)
.ToList() ;
OutboundOrderHeadersList[i].OrderLine = LineToAdd;
}
return OutboundOrderHeadersList;

As noted in comments, I'd really try hard to do this in the database rather than in memory. But to do it in memory, ToLookup is probably the right way to go:
// Note: here I've renamed used outboundOrderLines where you've got OutboundOrderLineList,
// and orderHeaders where you've got OutboundOrderHeadersList, as simpler
// and more conventional variable names.
var linesByOutboundNumber = outboundOrderLines.ToLookup(line => line.OutboundNumber);
foreach (var orderHeader in orderHeaders)
{
orderHeader.OrderLine = linesByOutboundNumer[orderHeader.OutboundNumber].ToList();
}
This builds a map going from outbound number to "all the lines with that outbound number" by going through outboundOrderLines once, rather than iterating over it for every order header.

Related

EF - A proper way to search several items in database

I have about 100 items (allRights) in the database and about 10 id-s to be searched (inputRightsIds). Which one is better - first to get all rights and then search the items (Variant 1) or to make 10 checking requests requests to the database
Here is some example code:
DbContext db = new DbContext();
int[] inputRightsIds = new int[10]{...};
Variant 1
var allRights = db.Rights.ToLIst();
foreach( var right in allRights)
{
for(int i>0; i<inputRightsIds.Lenght; i++)
{
if(inputRightsIds[i] == right.Id)
{
// Do something
}
}
}
Variant 2
for(int i>0; i<inputRightsIds.Lenght; i++)
{
if(db.Rights.Any(r => r.Id == inputRightsIds[i]);)
{
// Do something
}
}
Thanks in advance!

As other's have already stated you should do the following.
var matchingIds = from r in db.Rights
where inputRightIds.Contains(r.Id)
select r.Id;
foreach(var id in matchingIds)
{
// Do something
}
But this is different from both of your approaches. In your first approach you are making one SQL call to the DB that is returning more results than you are interested in. The second is making multiple SQL calls returning part of the information you want with each call. The query above will make one SQL call to the DB and return only the data you are interested in. This is the best approach as it reduces the two bottle necks of making multiple calls to the DB and having too much data returned.

You can use following :
db.Rights.Where(right => inputRightsIds.Contains(right.Id));

They should be very similar speeds since both must enumerate the arrays the same number of times. There might be subtle differences in speed between the two depending on the input data but in general I would go with Variant 2. I think you should almost always prefer LINQ over manual enumeration when possible. Also consider using the following LINQ statement to simplify the whole search to a single line.
var matches = db.Rights.Where(r=> inputRightIds.Contains(r.Id));
...//Do stuff with matches

Not forget get all your items into memory to process list further
var itemsFromDatabase = db.Rights.Where(r => inputRightsIds.Contains(r.Id)).ToList();
Or you could even enumerate through collection and do some stuff on each item
db.Rights.Where(r => inputRightsIds.Contains(r.Id)).ToList().Foreach(item => {
//your code here
});

Set of values in one or other list but not both

I am diffing two dictionaries, and I want the set of all keys in or or other dictionary but not both (I don't care about order). Since this only involves the keys, we can do this with the IEnumerables of the keys of the dictionaries.
The easy way, involving 2 passes:
return first.Keys.Except(second.Keys).Concat(second.Keys.Except(first.Keys));
We can concat because the Excepts guarantee the lists will be entirely different.
But I sense there is a better, linqy way to do it.

I prefer a non-LINQy way:
var set = new HashSet<KeyType>(first.Keys);
set.SymmetricExceptWith(second.Keys);
Here's an alternative (but not better) LINQy way to yours:
var result = first.Keys.Union(second.Keys)
.Except(first.Keys.Intersect(second.Keys));
If you're looking for something (possibly) more performant:
var result = new HashSet<KeyType>();
foreach(var firstKey in first.Keys)
{
if(!second.ContainsKey(firstKey))
result.Add(firstKey);
}
foreach(var secondKey in second.Keys)
{
if(!first.ContainsKey(secondKey))
result.Add(secondKey);
}

Efficient and Accurate way to add items into a TList

I have 2 lists and the entities of the those lists have some IDs for instance
Client.ID, where ID is a property of Client anf then I have PopulationClient.ID, where ID is a property of the class PopulationClient. So I have two Lists
TList<Client> clients = clientsHelper.GetAllClients();
TList<PopulationClient> populationClients = populationHelper.GetAllPopulationClients();
So then I have a temp List
TList<Client> temp_list = new TList<Client>();
So the problem i am having is doing this efficiently and correctly. This is what I have tried.. but I am not getting the correct results
foreach(PopulationClient pClients in populationClients)
{
foreach(Client client in clients)
{
if(pClients.ID != client.ID && !InTempList(temp_list, pClients.ID))
{
temp_list.Add(client);
}
}
}
public bool InTempList(TList<Client> c, int id)
{
bool IsInList = false;
foreach(Client client in c)
{
if(client.ID == id)
{
IsInList = true;
}
}
return IsInList;
}
So while I am trying to do it right I can not come up with a good way of doing it, this is not returning the correct data because in my statement in the first loop at the top,at some point one or more is different to the otherone so it adds it anyways. What constraints do you think I should check here so that I only end up with a list of Clients that are in population clients but not in Clients?.
For instance population clients would have 4 clients and Clients 2, those 2 are also in population clients but I need to get a list of population clients not in Clients.
ANy help or pointers would be appreciated.

First, let's concentrate on getting the right results, and then we'll optimize.
Consider your nested loops: you will get too many positives, because in most (pclient, client) pairs the IDs wouldn't match. I think you wanted to code it like this:
foreach(PopulationClient pClients in populationClients)
{
if(!InTempList(clients, pClients.ID) && !InTempList(temp_list, pClients.ID))
{
temp_list.Add(client);
}
}
Now for the efficiency of that code: InTempList uses linear search through lists. This is not efficient - consider using structures that are faster to search, for example, hash sets.

If I understand what you're looking for, here is a way to do it with LINQ...
tempList = populationList.Where(p => !clientList.Any(p2 => p2.ID == p.ID));

Just to offer another LINQ-based answer... I think your intent is to populate tempList based on all the items in 'clients' (returned from GetAllClients) that don't show up (based on 'ID" value) in the populationClients collection.
If that's the case, then I'm going to assume that populationClients is sufficiently large to warrant doing a hash-based looked (if it's less than 10 items, the linear scan may not be a big deal, for instance).
So we want a fast-lookup version of all the ID values from the populationClients collection:
var populationClientIDs = populationClients.Select(pc => pc.ID);
var populationClientIDHash = new HashSet(populationClientIDs);
Now that we have the ID values we want to ignore in a fast lookup data structure, we can then use that as a filter for the clients:
var filteredClients = clients.Where(c => populationClientIDHash.Contains(c.ID) == false);
Based on the usage/need, you could either populate the tempList from 'filteredClients', or do a ToList, or whatever.

Would there be any performance difference between looping every row of dataset and same dataset list form

I need to loop every row of a dataset 100k times.
This dataset contains 1 Primary key and another string column. Dataset has 600k rows.
So at the moment i am looping like this
for (int i = 0; i < dsProductNameInfo.Tables[0].Rows.Count; i++)
{
for (int k = 0; k < dsFull.Tables[0].Rows.Count; k++)
{
}
}
Now dsProductNameInfo has 100k rows and dsFull has 600k rows. Should i convert dsFull to a KeyValuePaired string list and loop that or there would not be any speed difference.
What solution would work fastest ?
Thank you.
C# 4.0 WPF application

In the exact scenario you mentioned, the performance would be the same except converting to the list would take some time and cause the list approach to be slower. You can easily find out by writing a unit test and timing it.
I would think it'd be best to do this:
// create a class for each type of object you're going to be dealing with
public class ProductNameInformation { ... }
public class Product { ... }
// load a list from a SqlDataReader (much faster than loading a DataSet)
List<Product> products = GetProductsUsingSqlDataReader(); // don't actually call it that :)
// The only thing I can think of where DataSets are better is indexing certain columns.
// So if you have indices, just emulate them with a hashtable:
Dictionary<string, Product> index1 = products.ToDictionary( ... );
Here are references to the SqlDataReader and ToDictionary concepts that you may or may not be familiar with.
The real question is, why isn't this kind of heavy processing done at the database layer? SQL servers are much more optimized for this type of work. Also, you may not have to actually do this, why don't you post the original problem and maybe we can help you optimize deeper?
HTH

There might be quite a few things that could be optimized not related to the looping. E.g. reducing the number of iteration would yield a lot at pressent the body of the inner loop is executed 100k * 600k times so eliminating one iteration of the outer loop would eliminate 600k iterations of the inner (or you might be able to switch the inner and outer loop if it's easier to remove iterations from the inner loop)
One thing that you could do in any case is only index once for each table:
var productNameInfoRows = dsProductNameInfo.Tables[0].Rows
var productInfoCount = productNameInfoRows.Count;
var fullRows = dsFull.Tables[0].Rows;
var fullCount = fullRows.Count;
for (int i = 0; i < productInfoCount; i++)
{
for (int k = 0; k < fullCount; k++)
{
}
}
inside the loops you'd get to the rows with productNameInfoRows[i] and FullRows[k] which is faster than using the long hand I'm guessing there might be more to gain from optimizing the body than the way you are looping over the collection. Unless of course you have already profiled the code and found the actual looping to be the bottle neck
EDIT After reading your comment to Marc about what you are trying to accomplish. Here's a go at how you could do this. It's worth noting that the below algorithm is probabalistic. That is there's a 1:2^32 for two words being seen as equal without actually being it. It is however a lot faster than comparing strings.
The code assumes that the first column is the one you are comparing.
//store all the values that will not change through the execution for faster access
var productNameInfoRows = dsProductNameInfo.Tables[0].Rows;
var fullRows = dsFull.Tables[0].Rows;
var productInfoCount = productNameInfoRows.Count;
var fullCount = fullRows.Count;
var full = new List<int[]>(fullCount);
for (int i = 0; i < productInfoCount; i++){
//we're going to compare has codes and not strings
var prd = productNameInfoRows[i][0].ToString().Split(';')
.Select(s => s.GetHashCode()).OrderBy(t=>t).ToArray();
for (int k = 0; k < fullCount; k++){
//caches the calculation for all subsequent oterations of the outer loop
if (i == 0) {
full.Add(fullRows[k][0].ToString().Split(';')
.Select(s => s.GetHashCode()).OrderBy(t=>t).ToArray());
}
var fl = full[k];
var count = 0;
for(var j = 0;j<fl.Length;j++){
var f = fl[j];
//the values are sorted so we can exit early
for(var m = 0;m<prd.Length && prd[m] <= f;m++){
count += prd[m] == f ? 1 : 0;
}
}
if((double)(fl.Length + prd.Length)/count >= 0.6){
//there's a match
}
}
}
EDIT your comment motivated me to give it another try. The below code could have fewer iterations. Could have is because it depends on the number of matches and the number of unique words. A lot of unique words and a lot of matches for each (which would require a LOT of words per column) would potentially yield more iterations. However under the assumption that each row has few words this should yield substantial fewer iterations. your code has a NM complexity this has N+M+(matchesproductInfoMatches*fullMatches). In other words the latter would have to be almost 99999*600k for this to have more iterations than yours
//store all the values that will not change through the execution for faster access
var productNameInfoRows = dsProductNameInfo.Tables[0].Rows;
var fullRows = dsFull.Tables[0].Rows;
var productInfoCount = productNameInfoRows.Count;
var fullCount = fullRows.Count;
//Create a list of the words from the product info
var lists = new Dictionary<int, Tuple<List<int>, List<int>>>(productInfoCount*3);
for(var i = 0;i<productInfoCount;i++){
foreach (var token in productNameInfoRows[i][0].ToString().Split(';')
.Select(p => p.GetHashCode())){
if (!lists.ContainsKey(token)){
lists.Add(token, Tuple.Create(new List<int>(), new List<int>()));
}
lists[token].Item1.Add(i);
}
}
//Pair words from full with those from productinfo
for(var i = 0;i<fullCount;i++){
foreach (var token in fullRows[i][0].ToString().Split(';')
.Select(p => p.GetHashCode())){
if (lists.ContainsKey(token)){
lists[token].Item2.Add(i);
}
}
}
//Count all matches for each pair of rows
var counts = new Dictionary<int, Dictionary<int, int>>();
foreach(var key in lists.Keys){
foreach(var p in lists[key].Item1){
if(!counts.ContainsKey(p)){
counts.Add(p,new Dictionary<int, int>());
}
foreach(var f in lists[key].Item2){
var dic = counts[p];
if(!dic.ContainsKey(f)){
dic.Add(f,0);
}
dic[f]++;
}
}
}

If performance is the critical factor, then I would suggest trying an array-of-struct; this has minimal indireaction (DataSet/DataTable has quite a lot of indirection). You mention KeyValuePair, and that would work, although it might not necessarily be my first choice. Milimetric is right to say that there is an overhead if you create a DataSet first and then build an array/list from tht - however, even then the time savings when looping may exceed the build time. If you can restructure the load to remove the DataSet completely, great.
I would also look carefully at the loops, to see if anything could reduce the actual work needed; for example, would building a dictionary/grouping allow faster lookups? Would sorting allow binary search? Can any operations be per-aggregated and applied at a higher level (with fewer rows)?

What are you doing with the data inside the nested loop?
Is the source of your datasets a SQL database? If so, the best possible performance you could get would be to perform your calculation in SQL using an inner join and return the result to .net.
Another alternative would be to use the dataset's built in querying methods that act like SQL, but in-memory.
If neither of those options are appropriate, you would get a performance improvement by retrieving the 'full' dataset as a DataReader and looping over it as the outer loop. A dataset loads all of the data from SQL into memory in one hit. With 600k rows, this will take up a lot of memory! Whereas a DataReader will keep the connection to the DB open and stream rows as they are read. Once you have read a row the memory will be reused/reclaimed by the garbage collector.

In your comment reply to my earlier answer you said that both datasets are essentially lists of strings and each string a delimited list of tags effectively. I would first look to normalise the csv strings in the database. I.e. Split the CSVs, add them to a tag table and link from the product to the tags via a link table.
You can then quite easily create a SQL statement that will do your matching according to the link records rather than by string (which be more performant in it's own right).
The issue you would then have is that if your sub-set product list needs to be passed into SQL from .net you would need to call the SP 100k times. Thankfully SQL 2008 R2, introduced TableTypes. You could define a table type in your database with one column to hold your product ID, have your SP accept that as an input parameter and then perform an inner join between your actual tables and your table parameter.. I've used this in my own project with very large datasets and the performance gain was massive.
On the .net side you can create a DataTable matching the structure of the SQL table type and then pass that as a command parameter when calling your SP (once!).
This article shows you how to do both the SQL and .net sides. http://www.mssqltips.com/sqlservertip/2112/table-value-parameters-in-sql-server-2008-and-net-c/

Compare two IQueryable instances

I have two IQueryable instances - objIQuerableA and objIQueryableB and I want to obtain only elements that are present in objIQuerableA and not in objIQuerableB.
One way is to use a foreach loop but I wonder if there is a better method.

Simple and straight forward.
var result = objIQuerableA.Except(objIQuerableB);

The title actually says compare two IQueryables. If you wanted to actually do a compare to determine if both IQueryable contain the same results in a single query....
var aExceptB = objIQuerableA.Except(objIQuerableB);
var bExceptA = objIQuerableB.Except(objIQuerableA);
var symmetricDiff = aExceptB.Union(bExceptA);
bool areDifferent = symmetricDiff.Any();

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Fastest way to match members of two lists - c#

Related

EF - A proper way to search several items in database

Set of values in one or other list but not both

Efficient and Accurate way to add items into a TList

Would there be any performance difference between looping every row of dataset and same dataset list form

Compare two IQueryable instances

Categories

Resources