EF - A proper way to search several items in database

EF - A proper way to search several items in database - c#

I have about 100 items (allRights) in the database and about 10 id-s to be searched (inputRightsIds). Which one is better - first to get all rights and then search the items (Variant 1) or to make 10 checking requests requests to the database
Here is some example code:
DbContext db = new DbContext();
int[] inputRightsIds = new int[10]{...};
Variant 1
var allRights = db.Rights.ToLIst();
foreach( var right in allRights)
{
for(int i>0; i<inputRightsIds.Lenght; i++)
{
if(inputRightsIds[i] == right.Id)
{
// Do something
}
}
}
Variant 2
for(int i>0; i<inputRightsIds.Lenght; i++)
{
if(db.Rights.Any(r => r.Id == inputRightsIds[i]);)
{
// Do something
}
}
Thanks in advance!

As other's have already stated you should do the following.
var matchingIds = from r in db.Rights
where inputRightIds.Contains(r.Id)
select r.Id;
foreach(var id in matchingIds)
{
// Do something
}
But this is different from both of your approaches. In your first approach you are making one SQL call to the DB that is returning more results than you are interested in. The second is making multiple SQL calls returning part of the information you want with each call. The query above will make one SQL call to the DB and return only the data you are interested in. This is the best approach as it reduces the two bottle necks of making multiple calls to the DB and having too much data returned.

You can use following :
db.Rights.Where(right => inputRightsIds.Contains(right.Id));

They should be very similar speeds since both must enumerate the arrays the same number of times. There might be subtle differences in speed between the two depending on the input data but in general I would go with Variant 2. I think you should almost always prefer LINQ over manual enumeration when possible. Also consider using the following LINQ statement to simplify the whole search to a single line.
var matches = db.Rights.Where(r=> inputRightIds.Contains(r.Id));
...//Do stuff with matches

Not forget get all your items into memory to process list further
var itemsFromDatabase = db.Rights.Where(r => inputRightsIds.Contains(r.Id)).ToList();
Or you could even enumerate through collection and do some stuff on each item
db.Rights.Where(r => inputRightsIds.Contains(r.Id)).ToList().Foreach(item => {
//your code here
});

Related

Fastest way to match members of two lists

I have two lists which are orderHeaders and orderLines. These are two related tables in the database however when I pull them I have to pull them separately as two different lists and then map them out to each other later. I have a solution right now but the performance is a little bit disappointing given that I have around 400k headers and 1million+ lines.
Here's my code below. Is this the standard way to iterate over and find members inside two lists or is there a more optimized approach in C#?
var OutboundOrderHeaders =
DbContext.Context.Database.SqlQuery<OutboundOrderDTO>(queryString, parameter);
var OutboundOrderHeadersList = OutboundOrderHeaders.ToList();
var OutboundOrderLine =
DbContext.Context.Database.SqlQuery<OutboundOrderLineDTO>(queryStringLine, parameter2);
var OutboundOrderLineList = OutboundOrderLine.ToList();
for(var i = 0; i < OutboundOrderHeadersList.Count(); i++)
{
var LineToAdd = OutboundOrderLineList
.Where(x => x.OutboundNumber == OutboundOrderHeadersList[i].OutboundNumber)
.ToList() ;
OutboundOrderHeadersList[i].OrderLine = LineToAdd;
}
return OutboundOrderHeadersList;

As noted in comments, I'd really try hard to do this in the database rather than in memory. But to do it in memory, ToLookup is probably the right way to go:
// Note: here I've renamed used outboundOrderLines where you've got OutboundOrderLineList,
// and orderHeaders where you've got OutboundOrderHeadersList, as simpler
// and more conventional variable names.
var linesByOutboundNumber = outboundOrderLines.ToLookup(line => line.OutboundNumber);
foreach (var orderHeader in orderHeaders)
{
orderHeader.OrderLine = linesByOutboundNumer[orderHeader.OutboundNumber].ToList();
}
This builds a map going from outbound number to "all the lines with that outbound number" by going through outboundOrderLines once, rather than iterating over it for every order header.

Looping on IEnumerator<T>, Any Suggestions

I am having a situation where looping through the result of LINQ is getting on my nerves. Well here is my scenario:
I have a DataTable, that comes from database, from which I am taking data as:
var results = from d in dtAllData.AsEnumerable()
select new MyType
{
ID = d.Field<Decimal>("ID"),
Name = d.Field<string>("Name")
}
After doing the order by depending on the sort order as:
if(orderBy != "")
{
string[] ord = orderBy.Split(' ');
if (ord != null && ord.Length == 2 && ord[0] != "")
{
if (ord[1].ToLower() != "desc")
{
results = from sorted in results
orderby GetPropertyValue(sorted, ord[0])
select sorted;
}
else
{
results = from sorted in results
orderby GetPropertyValue(sorted, ord[0]) descending
select sorted;
}
}
}
The GetPropertyValue method is as:
private object GetPropertyValue(object obj, string property)
{
System.Reflection.PropertyInfo propertyInfo = obj.GetType().GetProperty(property);
return propertyInfo.GetValue(obj, null);
}
After this I am taking out 25 records for first page like:
results = from sorted in results
.Skip(0)
.Take(25)
select sorted;
So far things are going good, Now I have to pass this results to a method which is going to do some manipulation on the data and return me the desired data, here in this method when I want to loop these 25 records its taking a good enough time. My method definition is:
public MyTypeCollection GetMyTypes(IEnumerable<MyType> myData, String dateFormat, String offset)
I have tried foreach and it takes like 8-10 secs on my machine, it is taking time at this line:
foreach(var _data in myData)
I tried while loop and is doing same thing, I used it like:
var enumerator = myData.GetEnumerator();
while(enumerator.MoveNext())
{
int n = enumerator.Current;
Console.WriteLine(n);
}
This piece of code is taking time at MoveNext
Than I went for for loop like:
int length = myData.Count();
for (int i = 0; i < 25;i++ )
{
var temp = myData.ElementAt(i);
}
This code is taking time at ElementAt
Can anyone please guide me, what I am doing wrong. I am using Framework 3.5 in VS 2008.
Thanks in advance

EDIT: I suspect the problem is in how you're ordering. You're using reflection to first fetch and then invoke a property for every record. Even though you only want the first 25 records, it has to call GetPropertyValue on all the records first, in order to order them.
It would be much better if you could do this without reflection at all... but if you do need to use reflection, at least call Type.GetProperty() once instead of for every record.
(In some ways this is more to do with helping you diagnose the problem more easily than a full answer as such...)
As Henk said, this is very odd:
results = from sorted in results
.Skip(0)
.Take(25)
select sorted;
You almost certainly really just want:
results = results.Take(25);
(Skip(0) is pointless.)
It may not actually help, but it will make the code simpler to debug.
The next problem is that we can't actually see all your code. You've written:
After doing the order by depending on the sort order
... but you haven't shown how you're performing the ordering.
You should show us a complete example going from DataTable to its use.
Changing how you iterate over the sequence will not help - it's going to do the same thing either way, really - although it's surprising that in your last attempt, Count() apparently works quickly. Stick to the foreach - but work out exactly what that's going to be doing. LINQ uses a lot of lazy evaluation, and if you've done something which makes that very heavy going, that could be the problem. It's hard to know without seeing the whole pipeline.

The problem is that your "results" IEnumerable isn't actually being evaluated until it is passed into your method and enumerated. That means that the whole operation, getting all the data from dtAllData, selecting out the new type (which is happening on the whole enumerable, not just the first 25), and then finally the take 25 operation, are all happening on the first enumeration of the IEnumerable (foreach, while, whatever).
That's why your method is taking so long. It's actually doing some of the work defined elsewhere inside the method. If you want that to happen before your method, you could do a "ToList()" prior to the method.

You might find it easier to adopt a hybrid approach;
In order:
1) Sort your datatable in-situ. It's probably best to do this at the database level, but, if you can't, then DataTable.DefaultView.Sort is pretty efficient:
dtAllData.DefaultView.Sort = ord[0] + " " + ord[1];
This assumes that ord[0] is the column name, and ord[1] is either ASC or DESC
2) Page through the DefaultView by index:
int pageStart = 0;
List<DataRowView> pageRows = new List<DataRowView>();
for (int i = pageStart; i < dtAllData.DefaultView.Count; i++ )
{
if(pageStart + 25 > i || i == dtAllData.DefaultView.Count - 1) { break; //Exit if more than the number of pages or at the end of the rows }
pageRows.Add(dtAllData.DefaultView[i]);
}
...and create your objects from this much smaller list... (I've assumed the columns are called Id and Name, as well as the types)
List<MyType> myObjects = new List<MyType>();
foreach(DataRowView pageRow in pageRows)
{
myObjects.Add(new MyObject() { Id = Convert.ToInt32(pageRow["Id"]), Name = Convert.ToString(pageRow["Name"])});
}
You can then proceed with the rest of what you were doing.

Efficient and Accurate way to add items into a TList

I have 2 lists and the entities of the those lists have some IDs for instance
Client.ID, where ID is a property of Client anf then I have PopulationClient.ID, where ID is a property of the class PopulationClient. So I have two Lists
TList<Client> clients = clientsHelper.GetAllClients();
TList<PopulationClient> populationClients = populationHelper.GetAllPopulationClients();
So then I have a temp List
TList<Client> temp_list = new TList<Client>();
So the problem i am having is doing this efficiently and correctly. This is what I have tried.. but I am not getting the correct results
foreach(PopulationClient pClients in populationClients)
{
foreach(Client client in clients)
{
if(pClients.ID != client.ID && !InTempList(temp_list, pClients.ID))
{
temp_list.Add(client);
}
}
}
public bool InTempList(TList<Client> c, int id)
{
bool IsInList = false;
foreach(Client client in c)
{
if(client.ID == id)
{
IsInList = true;
}
}
return IsInList;
}
So while I am trying to do it right I can not come up with a good way of doing it, this is not returning the correct data because in my statement in the first loop at the top,at some point one or more is different to the otherone so it adds it anyways. What constraints do you think I should check here so that I only end up with a list of Clients that are in population clients but not in Clients?.
For instance population clients would have 4 clients and Clients 2, those 2 are also in population clients but I need to get a list of population clients not in Clients.
ANy help or pointers would be appreciated.

First, let's concentrate on getting the right results, and then we'll optimize.
Consider your nested loops: you will get too many positives, because in most (pclient, client) pairs the IDs wouldn't match. I think you wanted to code it like this:
foreach(PopulationClient pClients in populationClients)
{
if(!InTempList(clients, pClients.ID) && !InTempList(temp_list, pClients.ID))
{
temp_list.Add(client);
}
}
Now for the efficiency of that code: InTempList uses linear search through lists. This is not efficient - consider using structures that are faster to search, for example, hash sets.

If I understand what you're looking for, here is a way to do it with LINQ...
tempList = populationList.Where(p => !clientList.Any(p2 => p2.ID == p.ID));

Just to offer another LINQ-based answer... I think your intent is to populate tempList based on all the items in 'clients' (returned from GetAllClients) that don't show up (based on 'ID" value) in the populationClients collection.
If that's the case, then I'm going to assume that populationClients is sufficiently large to warrant doing a hash-based looked (if it's less than 10 items, the linear scan may not be a big deal, for instance).
So we want a fast-lookup version of all the ID values from the populationClients collection:
var populationClientIDs = populationClients.Select(pc => pc.ID);
var populationClientIDHash = new HashSet(populationClientIDs);
Now that we have the ID values we want to ignore in a fast lookup data structure, we can then use that as a filter for the clients:
var filteredClients = clients.Where(c => populationClientIDHash.Contains(c.ID) == false);
Based on the usage/need, you could either populate the tempList from 'filteredClients', or do a ToList, or whatever.

Would there be any performance difference between looping every row of dataset and same dataset list form

I need to loop every row of a dataset 100k times.
This dataset contains 1 Primary key and another string column. Dataset has 600k rows.
So at the moment i am looping like this
for (int i = 0; i < dsProductNameInfo.Tables[0].Rows.Count; i++)
{
for (int k = 0; k < dsFull.Tables[0].Rows.Count; k++)
{
}
}
Now dsProductNameInfo has 100k rows and dsFull has 600k rows. Should i convert dsFull to a KeyValuePaired string list and loop that or there would not be any speed difference.
What solution would work fastest ?
Thank you.
C# 4.0 WPF application

In the exact scenario you mentioned, the performance would be the same except converting to the list would take some time and cause the list approach to be slower. You can easily find out by writing a unit test and timing it.
I would think it'd be best to do this:
// create a class for each type of object you're going to be dealing with
public class ProductNameInformation { ... }
public class Product { ... }
// load a list from a SqlDataReader (much faster than loading a DataSet)
List<Product> products = GetProductsUsingSqlDataReader(); // don't actually call it that :)
// The only thing I can think of where DataSets are better is indexing certain columns.
// So if you have indices, just emulate them with a hashtable:
Dictionary<string, Product> index1 = products.ToDictionary( ... );
Here are references to the SqlDataReader and ToDictionary concepts that you may or may not be familiar with.
The real question is, why isn't this kind of heavy processing done at the database layer? SQL servers are much more optimized for this type of work. Also, you may not have to actually do this, why don't you post the original problem and maybe we can help you optimize deeper?
HTH

There might be quite a few things that could be optimized not related to the looping. E.g. reducing the number of iteration would yield a lot at pressent the body of the inner loop is executed 100k * 600k times so eliminating one iteration of the outer loop would eliminate 600k iterations of the inner (or you might be able to switch the inner and outer loop if it's easier to remove iterations from the inner loop)
One thing that you could do in any case is only index once for each table:
var productNameInfoRows = dsProductNameInfo.Tables[0].Rows
var productInfoCount = productNameInfoRows.Count;
var fullRows = dsFull.Tables[0].Rows;
var fullCount = fullRows.Count;
for (int i = 0; i < productInfoCount; i++)
{
for (int k = 0; k < fullCount; k++)
{
}
}
inside the loops you'd get to the rows with productNameInfoRows[i] and FullRows[k] which is faster than using the long hand I'm guessing there might be more to gain from optimizing the body than the way you are looping over the collection. Unless of course you have already profiled the code and found the actual looping to be the bottle neck
EDIT After reading your comment to Marc about what you are trying to accomplish. Here's a go at how you could do this. It's worth noting that the below algorithm is probabalistic. That is there's a 1:2^32 for two words being seen as equal without actually being it. It is however a lot faster than comparing strings.
The code assumes that the first column is the one you are comparing.
//store all the values that will not change through the execution for faster access
var productNameInfoRows = dsProductNameInfo.Tables[0].Rows;
var fullRows = dsFull.Tables[0].Rows;
var productInfoCount = productNameInfoRows.Count;
var fullCount = fullRows.Count;
var full = new List<int[]>(fullCount);
for (int i = 0; i < productInfoCount; i++){
//we're going to compare has codes and not strings
var prd = productNameInfoRows[i][0].ToString().Split(';')
.Select(s => s.GetHashCode()).OrderBy(t=>t).ToArray();
for (int k = 0; k < fullCount; k++){
//caches the calculation for all subsequent oterations of the outer loop
if (i == 0) {
full.Add(fullRows[k][0].ToString().Split(';')
.Select(s => s.GetHashCode()).OrderBy(t=>t).ToArray());
}
var fl = full[k];
var count = 0;
for(var j = 0;j<fl.Length;j++){
var f = fl[j];
//the values are sorted so we can exit early
for(var m = 0;m<prd.Length && prd[m] <= f;m++){
count += prd[m] == f ? 1 : 0;
}
}
if((double)(fl.Length + prd.Length)/count >= 0.6){
//there's a match
}
}
}
EDIT your comment motivated me to give it another try. The below code could have fewer iterations. Could have is because it depends on the number of matches and the number of unique words. A lot of unique words and a lot of matches for each (which would require a LOT of words per column) would potentially yield more iterations. However under the assumption that each row has few words this should yield substantial fewer iterations. your code has a NM complexity this has N+M+(matchesproductInfoMatches*fullMatches). In other words the latter would have to be almost 99999*600k for this to have more iterations than yours
//store all the values that will not change through the execution for faster access
var productNameInfoRows = dsProductNameInfo.Tables[0].Rows;
var fullRows = dsFull.Tables[0].Rows;
var productInfoCount = productNameInfoRows.Count;
var fullCount = fullRows.Count;
//Create a list of the words from the product info
var lists = new Dictionary<int, Tuple<List<int>, List<int>>>(productInfoCount*3);
for(var i = 0;i<productInfoCount;i++){
foreach (var token in productNameInfoRows[i][0].ToString().Split(';')
.Select(p => p.GetHashCode())){
if (!lists.ContainsKey(token)){
lists.Add(token, Tuple.Create(new List<int>(), new List<int>()));
}
lists[token].Item1.Add(i);
}
}
//Pair words from full with those from productinfo
for(var i = 0;i<fullCount;i++){
foreach (var token in fullRows[i][0].ToString().Split(';')
.Select(p => p.GetHashCode())){
if (lists.ContainsKey(token)){
lists[token].Item2.Add(i);
}
}
}
//Count all matches for each pair of rows
var counts = new Dictionary<int, Dictionary<int, int>>();
foreach(var key in lists.Keys){
foreach(var p in lists[key].Item1){
if(!counts.ContainsKey(p)){
counts.Add(p,new Dictionary<int, int>());
}
foreach(var f in lists[key].Item2){
var dic = counts[p];
if(!dic.ContainsKey(f)){
dic.Add(f,0);
}
dic[f]++;
}
}
}

If performance is the critical factor, then I would suggest trying an array-of-struct; this has minimal indireaction (DataSet/DataTable has quite a lot of indirection). You mention KeyValuePair, and that would work, although it might not necessarily be my first choice. Milimetric is right to say that there is an overhead if you create a DataSet first and then build an array/list from tht - however, even then the time savings when looping may exceed the build time. If you can restructure the load to remove the DataSet completely, great.
I would also look carefully at the loops, to see if anything could reduce the actual work needed; for example, would building a dictionary/grouping allow faster lookups? Would sorting allow binary search? Can any operations be per-aggregated and applied at a higher level (with fewer rows)?

What are you doing with the data inside the nested loop?
Is the source of your datasets a SQL database? If so, the best possible performance you could get would be to perform your calculation in SQL using an inner join and return the result to .net.
Another alternative would be to use the dataset's built in querying methods that act like SQL, but in-memory.
If neither of those options are appropriate, you would get a performance improvement by retrieving the 'full' dataset as a DataReader and looping over it as the outer loop. A dataset loads all of the data from SQL into memory in one hit. With 600k rows, this will take up a lot of memory! Whereas a DataReader will keep the connection to the DB open and stream rows as they are read. Once you have read a row the memory will be reused/reclaimed by the garbage collector.

In your comment reply to my earlier answer you said that both datasets are essentially lists of strings and each string a delimited list of tags effectively. I would first look to normalise the csv strings in the database. I.e. Split the CSVs, add them to a tag table and link from the product to the tags via a link table.
You can then quite easily create a SQL statement that will do your matching according to the link records rather than by string (which be more performant in it's own right).
The issue you would then have is that if your sub-set product list needs to be passed into SQL from .net you would need to call the SP 100k times. Thankfully SQL 2008 R2, introduced TableTypes. You could define a table type in your database with one column to hold your product ID, have your SP accept that as an input parameter and then perform an inner join between your actual tables and your table parameter.. I've used this in my own project with very large datasets and the performance gain was massive.
On the .net side you can create a DataTable matching the structure of the SQL table type and then pass that as a command parameter when calling your SP (once!).
This article shows you how to do both the SQL and .net sides. http://www.mssqltips.com/sqlservertip/2112/table-value-parameters-in-sql-server-2008-and-net-c/

Help needed for optimizing linq data extraction

I'm fetching data from all 3 tables at once to avoid network latency. Fetching the data is pretty fast, but when I loop through the results a lot of time is used
Int32[] arr = { 1 };
var query = from a in arr
select new
{
Basket = from b in ent.Basket
where b.SUPERBASKETID == parentId
select new
{
Basket = b,
ObjectTypeId = 0,
firstObjectId = "-1",
},
BasketImage = from b in ent.Image
where b.BASKETID == parentId
select new
{
Image = b,
ObjectTypeId = 1,
CheckedOutBy = b.CHECKEDOUTBY,
firstObjectId = b.FIRSTOBJECTID,
ParentBasket = (from parentBasket in ent.Basket
where parentBasket.ID == b.BASKETID
select parentBasket).ToList()[0],
},
BasketFile = from b in ent.BasketFile
where b.BASKETID == parentId
select new
{
BasketFile = b,
ObjectTypeId = 2,
CheckedOutBy = b.CHECKEDOUTBY,
firstObjectId = b.FIRSTOBJECTID,
ParentBasket = (from parentBasket in ent.Basket
where parentBasket.ID == b.BASKETID
select parentBasket),
}
};
//Exception handling
var mixedElements = query.First();
ICollection<BasketItem> basketItems = new Collection<BasketItem>();
//Here 15 millis has been used
//only 6 elements were found
if (mixedElements.Basket.Count() > 0)
{
foreach (var mixedBasket in mixedElements.Basket){}
}
if (mixedElements.BasketFile.Count() > 0)
{
foreach (var mixedBasketFile in mixedElements.BasketFile){}
}
if (mixedElements.BasketImage.Count() > 0)
{
foreach (var mixedBasketImage in mixedElements.BasketImage){}
}
//the empty loops takes 811 millis!!

Why are you bothering to check the counts before the foreach statements? If there are no results, the foreach will just finish immediately.
Your queries are actually all being deferred - they'll be executed as and when you ask for the data. Don't forget that your outermost query is a LINQ to Objects query: it's just returning the result of calling ent.Basket.Where(...).Select(...) etc... which doesn't actually execute the query.
Your plan to do all three queries in one go isn't actually working. However, by asking for the count separately, you may actually be executing each database query twice - once just getting the count and once for the results.
I strongly suggest that you get rid of the "optimizations" in this code which are making it much more complicated and slower than just writing the simplest code you can.
I don't know of any way of getting LINQ to SQL (or LINQ to EF) to execute multiple queries in a single call - but this approach certainly isn't going to do it.
One other minor hint which is irrelevant in this case, but can be useful in LINQ to Objects - if you want to find out whether there's any data in a collection, just use Any() instead of Count() > 0 - that way it can stop as soon as it's found anything.

You're using IEnumerable in the foreach loop. Implementations only have to prepare data when it's asked for. In this way, I'd suggest that the above code is accessing your data lazily -- that is, only when you enumerate the items (which actually happens when you call Count().)
Put a System.Diagnostics.Stopwatch around the call to Count() and see whether that's taking the bulk of the time you're seeing.
I can't comment further here because you don't specify the type of ent in your code sample.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.