Help needed for optimizing linq data extraction

Help needed for optimizing linq data extraction - c#

I'm fetching data from all 3 tables at once to avoid network latency. Fetching the data is pretty fast, but when I loop through the results a lot of time is used
Int32[] arr = { 1 };
var query = from a in arr
select new
{
Basket = from b in ent.Basket
where b.SUPERBASKETID == parentId
select new
{
Basket = b,
ObjectTypeId = 0,
firstObjectId = "-1",
},
BasketImage = from b in ent.Image
where b.BASKETID == parentId
select new
{
Image = b,
ObjectTypeId = 1,
CheckedOutBy = b.CHECKEDOUTBY,
firstObjectId = b.FIRSTOBJECTID,
ParentBasket = (from parentBasket in ent.Basket
where parentBasket.ID == b.BASKETID
select parentBasket).ToList()[0],
},
BasketFile = from b in ent.BasketFile
where b.BASKETID == parentId
select new
{
BasketFile = b,
ObjectTypeId = 2,
CheckedOutBy = b.CHECKEDOUTBY,
firstObjectId = b.FIRSTOBJECTID,
ParentBasket = (from parentBasket in ent.Basket
where parentBasket.ID == b.BASKETID
select parentBasket),
}
};
//Exception handling
var mixedElements = query.First();
ICollection<BasketItem> basketItems = new Collection<BasketItem>();
//Here 15 millis has been used
//only 6 elements were found
if (mixedElements.Basket.Count() > 0)
{
foreach (var mixedBasket in mixedElements.Basket){}
}
if (mixedElements.BasketFile.Count() > 0)
{
foreach (var mixedBasketFile in mixedElements.BasketFile){}
}
if (mixedElements.BasketImage.Count() > 0)
{
foreach (var mixedBasketImage in mixedElements.BasketImage){}
}
//the empty loops takes 811 millis!!

Why are you bothering to check the counts before the foreach statements? If there are no results, the foreach will just finish immediately.
Your queries are actually all being deferred - they'll be executed as and when you ask for the data. Don't forget that your outermost query is a LINQ to Objects query: it's just returning the result of calling ent.Basket.Where(...).Select(...) etc... which doesn't actually execute the query.
Your plan to do all three queries in one go isn't actually working. However, by asking for the count separately, you may actually be executing each database query twice - once just getting the count and once for the results.
I strongly suggest that you get rid of the "optimizations" in this code which are making it much more complicated and slower than just writing the simplest code you can.
I don't know of any way of getting LINQ to SQL (or LINQ to EF) to execute multiple queries in a single call - but this approach certainly isn't going to do it.
One other minor hint which is irrelevant in this case, but can be useful in LINQ to Objects - if you want to find out whether there's any data in a collection, just use Any() instead of Count() > 0 - that way it can stop as soon as it's found anything.

You're using IEnumerable in the foreach loop. Implementations only have to prepare data when it's asked for. In this way, I'd suggest that the above code is accessing your data lazily -- that is, only when you enumerate the items (which actually happens when you call Count().)
Put a System.Diagnostics.Stopwatch around the call to Count() and see whether that's taking the bulk of the time you're seeing.
I can't comment further here because you don't specify the type of ent in your code sample.

Related

How to optimize a code using DataTable and Linq?

I have 2 DataTables. There are about 17000 (table1) and 100000 (table2) records.
It's needed to check if the field "FooName" contains "ItemName". Also it's needed to take "FooId" and then add "ItemId" and "FooId" to ConcurrentDictionary.
I have this code.
DataTable table1;
DataTable table2;
var table1Select = table1.Select();
ConcurrentDictionary<double, double> compareDictionary = new ConcurrentDictionary<double, double>();
foreach (var item in table1)
{
var fooItem = from foo in table2.AsEnumerable()
where foo.Field<string>("FooName").Contains(item.Field<string>("ItemName"))
select foo.Field<double>("FooId");
if(fooItem != null && fooItem.FirstOrDefault() != 0)
{
compareDictionary.TryAdd(item.Field<double>("ItemId"), fooItem.FirstOrDefault());
}
}
It works slowly (it takes about 10 minutes to perform the task).
I want to make it faster. How I can optimize it?

I see three points you can attack:
ditch strong typing on field accessors in favour of direct casts: it forces unboxing which you can totally avoid with doubles being value types. upd as pointed out in comments, you will not avoid unboxing either way, but could potentially save some method call overheads (which is again, arguable). This point can probably be ignored
cache reference string so you only access it once per outer loop
(i think this is where the biggest gains are) since you seem to always take first result - opt for FirstOrDefault() straight in LINQ - don't let it enumerate the whole thing when a match is found
ConcurrentDictionary<double, double> compareDictionary = new ConcurrentDictionary<double, double>();
foreach (var item in table1)
{
var sample = (string)item["ItemName"]; // cache the value before looping through inner collection
var fooItem = table2.AsEnumerable()
.FirstOrDefault(foo => ((string)foo["FooName"]).Contains(sample)); // you seem to always take First item, so you could instruct LINQ to stop after a match is found
if (fooItem != null && (double)fooItem["FooId"] != 0)
{
compareDictionary.TryAdd((double)item["ItemId"], (double)fooItem["FooId"]);
}
}
It appears, applying .FirstOrDefault() condition to LINQ query syntax will sort of reduce it to method chain syntax anyway, so I'd opt for method chainging all the way and leave it to you to figure out the aesthetics

If you are willing to sacrifice memory for speed, converting from DataTable for the fields you need gives about 6x speedup over repeatedly pulling the column data out of table2. (This is in addition to the speedup from using FirstOrDefault.)
var compareDictionary = new ConcurrentDictionary<double, double>();
var t2e = table2.AsEnumerable().Select(r => (FooName: r.Field<string>("FooName"), FooId: r.Field<double>("FooId"))).ToList();
foreach (var item in table1.AsEnumerable().Select(r => (ItemName: r.Field<string>("ItemName"), ItemId: r.Field<double>("ItemId")))) {
var firstFooId = t2e.FirstOrDefault(foo => foo.FooName.Contains(item.ItemName)).FooId;
if (firstFooId != 0.0) {
compareDictionary.TryAdd(item.ItemId, firstFooId);
}
}
I am using C# ValueTuples to avoid reference object overhead from anonymous classes.

EF - A proper way to search several items in database

I have about 100 items (allRights) in the database and about 10 id-s to be searched (inputRightsIds). Which one is better - first to get all rights and then search the items (Variant 1) or to make 10 checking requests requests to the database
Here is some example code:
DbContext db = new DbContext();
int[] inputRightsIds = new int[10]{...};
Variant 1
var allRights = db.Rights.ToLIst();
foreach( var right in allRights)
{
for(int i>0; i<inputRightsIds.Lenght; i++)
{
if(inputRightsIds[i] == right.Id)
{
// Do something
}
}
}
Variant 2
for(int i>0; i<inputRightsIds.Lenght; i++)
{
if(db.Rights.Any(r => r.Id == inputRightsIds[i]);)
{
// Do something
}
}
Thanks in advance!

As other's have already stated you should do the following.
var matchingIds = from r in db.Rights
where inputRightIds.Contains(r.Id)
select r.Id;
foreach(var id in matchingIds)
{
// Do something
}
But this is different from both of your approaches. In your first approach you are making one SQL call to the DB that is returning more results than you are interested in. The second is making multiple SQL calls returning part of the information you want with each call. The query above will make one SQL call to the DB and return only the data you are interested in. This is the best approach as it reduces the two bottle necks of making multiple calls to the DB and having too much data returned.

You can use following :
db.Rights.Where(right => inputRightsIds.Contains(right.Id));

They should be very similar speeds since both must enumerate the arrays the same number of times. There might be subtle differences in speed between the two depending on the input data but in general I would go with Variant 2. I think you should almost always prefer LINQ over manual enumeration when possible. Also consider using the following LINQ statement to simplify the whole search to a single line.
var matches = db.Rights.Where(r=> inputRightIds.Contains(r.Id));
...//Do stuff with matches

Not forget get all your items into memory to process list further
var itemsFromDatabase = db.Rights.Where(r => inputRightsIds.Contains(r.Id)).ToList();
Or you could even enumerate through collection and do some stuff on each item
db.Rights.Where(r => inputRightsIds.Contains(r.Id)).ToList().Foreach(item => {
//your code here
});

Looping on IEnumerator<T>, Any Suggestions

I am having a situation where looping through the result of LINQ is getting on my nerves. Well here is my scenario:
I have a DataTable, that comes from database, from which I am taking data as:
var results = from d in dtAllData.AsEnumerable()
select new MyType
{
ID = d.Field<Decimal>("ID"),
Name = d.Field<string>("Name")
}
After doing the order by depending on the sort order as:
if(orderBy != "")
{
string[] ord = orderBy.Split(' ');
if (ord != null && ord.Length == 2 && ord[0] != "")
{
if (ord[1].ToLower() != "desc")
{
results = from sorted in results
orderby GetPropertyValue(sorted, ord[0])
select sorted;
}
else
{
results = from sorted in results
orderby GetPropertyValue(sorted, ord[0]) descending
select sorted;
}
}
}
The GetPropertyValue method is as:
private object GetPropertyValue(object obj, string property)
{
System.Reflection.PropertyInfo propertyInfo = obj.GetType().GetProperty(property);
return propertyInfo.GetValue(obj, null);
}
After this I am taking out 25 records for first page like:
results = from sorted in results
.Skip(0)
.Take(25)
select sorted;
So far things are going good, Now I have to pass this results to a method which is going to do some manipulation on the data and return me the desired data, here in this method when I want to loop these 25 records its taking a good enough time. My method definition is:
public MyTypeCollection GetMyTypes(IEnumerable<MyType> myData, String dateFormat, String offset)
I have tried foreach and it takes like 8-10 secs on my machine, it is taking time at this line:
foreach(var _data in myData)
I tried while loop and is doing same thing, I used it like:
var enumerator = myData.GetEnumerator();
while(enumerator.MoveNext())
{
int n = enumerator.Current;
Console.WriteLine(n);
}
This piece of code is taking time at MoveNext
Than I went for for loop like:
int length = myData.Count();
for (int i = 0; i < 25;i++ )
{
var temp = myData.ElementAt(i);
}
This code is taking time at ElementAt
Can anyone please guide me, what I am doing wrong. I am using Framework 3.5 in VS 2008.
Thanks in advance

EDIT: I suspect the problem is in how you're ordering. You're using reflection to first fetch and then invoke a property for every record. Even though you only want the first 25 records, it has to call GetPropertyValue on all the records first, in order to order them.
It would be much better if you could do this without reflection at all... but if you do need to use reflection, at least call Type.GetProperty() once instead of for every record.
(In some ways this is more to do with helping you diagnose the problem more easily than a full answer as such...)
As Henk said, this is very odd:
results = from sorted in results
.Skip(0)
.Take(25)
select sorted;
You almost certainly really just want:
results = results.Take(25);
(Skip(0) is pointless.)
It may not actually help, but it will make the code simpler to debug.
The next problem is that we can't actually see all your code. You've written:
After doing the order by depending on the sort order
... but you haven't shown how you're performing the ordering.
You should show us a complete example going from DataTable to its use.
Changing how you iterate over the sequence will not help - it's going to do the same thing either way, really - although it's surprising that in your last attempt, Count() apparently works quickly. Stick to the foreach - but work out exactly what that's going to be doing. LINQ uses a lot of lazy evaluation, and if you've done something which makes that very heavy going, that could be the problem. It's hard to know without seeing the whole pipeline.

The problem is that your "results" IEnumerable isn't actually being evaluated until it is passed into your method and enumerated. That means that the whole operation, getting all the data from dtAllData, selecting out the new type (which is happening on the whole enumerable, not just the first 25), and then finally the take 25 operation, are all happening on the first enumeration of the IEnumerable (foreach, while, whatever).
That's why your method is taking so long. It's actually doing some of the work defined elsewhere inside the method. If you want that to happen before your method, you could do a "ToList()" prior to the method.

You might find it easier to adopt a hybrid approach;
In order:
1) Sort your datatable in-situ. It's probably best to do this at the database level, but, if you can't, then DataTable.DefaultView.Sort is pretty efficient:
dtAllData.DefaultView.Sort = ord[0] + " " + ord[1];
This assumes that ord[0] is the column name, and ord[1] is either ASC or DESC
2) Page through the DefaultView by index:
int pageStart = 0;
List<DataRowView> pageRows = new List<DataRowView>();
for (int i = pageStart; i < dtAllData.DefaultView.Count; i++ )
{
if(pageStart + 25 > i || i == dtAllData.DefaultView.Count - 1) { break; //Exit if more than the number of pages or at the end of the rows }
pageRows.Add(dtAllData.DefaultView[i]);
}
...and create your objects from this much smaller list... (I've assumed the columns are called Id and Name, as well as the types)
List<MyType> myObjects = new List<MyType>();
foreach(DataRowView pageRow in pageRows)
{
myObjects.Add(new MyObject() { Id = Convert.ToInt32(pageRow["Id"]), Name = Convert.ToString(pageRow["Name"])});
}
You can then proceed with the rest of what you were doing.

Update object in IEnumerable<> not updating?

I have an IEnumerable of a POCO type containing around 80,000 rows
and a db table (L2E/EF4) containing a subset of rows where there was a "an error/a difference" (about 5000 rows, but often repeated to give about 150 distinct entries)
The following code gets the distinct VSACode's "in error" and then attempts to update the complete result set, updating JUST the rows that match...but it doesn't work!
var vsaCodes = (from g in db.GLDIFFLs
select g.VSACode)
.Distinct();
foreach (var code in vsaCodes)
{
var hasDifference = results.Where(r => r.VSACode == code);
foreach (var diff in hasDifference)
diff.Difference = true;
}
var i = results.Count(r => r.Difference == true);
After this code, i = 0
I've also tried:
foreach (var code in vsaCodes)
{
results.Where(r => r.VSACode == code).Select(r => { r.Difference = true; return r; }).ToList();
}
How can I update the "results" to set only the matching Difference property?

Assuming results is just a query (you haven't shown it), it will be evaluated every time you iterate over it. If that query creates new objects each time, you won't see the updates. If it returns references to the same objects, you would.
If you change results to be a materialized query result - e.g. by adding ToList() to the end - then iterating over results won't issue a new query, and you'll see your changes.

I had the same kind of error some time ago. The problem is that linq queries are often deferred and not executed when it appears you are calling them.
Quotation from "Pro LINQ Language Integrated Query in C# 2010":
"Notice that even though we called the query only once, the results of
the enumeration are different for each of the enumerations. This is
further evidence that the query is deferred. If it were not, the
results of both enumerations would be the same. This could be a
benefit or detriment. If you do not want this to happen, use one of
the conversion operators that do not return an IEnumerable so that
the query is not deferred, such as ToArray, ToList, ToDictionary, or
ToLookup, to create a different data structure with cached results
that will not change if the data source changes."
Here you have a good explanation with examples of it:
http://blogs.msdn.com/b/charlie/archive/2007/12/09/deferred-execution.aspx
Regards

Parsing words pretty closely on #jonskeet's answer...
If your query is simply a filter and the underlying source objects are updated, the query will be reevaluated and may exclude these objects based on the filter condition in which case your query results will change on subsequent enumerations but the underlying objects will still have been updated.
The key is a lack of a projection to a new type as far as updating and persisting the changed objects.
ToList() is the usual solution to this issue and it will solve the problem if there is a projection to a new type but things gets cloudy in event your query filters but does not project. Updates to the query still affect the original source objects given everything is referencing the same object.
Again, parsing words but these edge cases can trip you up.
public class Widget
{
public string Name { get; set; }
}
var widgets1 = new[]
{
new Widget { Name = "Red", },
new Widget { Name = "Green", },
new Widget { Name = "Blue", },
new Widget { Name = "Black", },
};
// adding ToList() will result in 'static' query result but
// updates to the objects will still affect the source objects
var query1 = widgets1
.Where(i => i.Name.StartsWith("B"))
//.ToList()
;
foreach (var widget in query1)
{
widget.Name = "Yellow";
}
// produces no output unless you uncomment out the ToList() above
// query1 is reevaluated and filters out "yellow" which does not start with "B"
foreach (var name in query1)
Console.WriteLine(name.Name);
// produces Red, Green, Yellow, Yellow
// the underlying widgets were updated
foreach (var name in widgets1)
Console.WriteLine(name.Name);

Optimize code: Linq and foreach loop 15k records

this is my code
void fixInstellingenTabel(object source, ElapsedEventArgs e)
{
NASDataContext _db = new NASDataContext();
List<Instellingen> newOnes = new List<Instellingen>();
List<InstellingGegeven> li = _db.InstellingGegevens.ToList();
foreach (InstellingGegeven i in li) {
if (_db.Instellingens.Count(q => q.INST_LOC_REF == i.INST_LOC_REF && q.INST_LOCNR == i.INST_LOCNR && q.INST_REF == i.INST_REF && q.INST_TYPE == i.INST_TYPE) <= 0) {
// There is no item yet. Create one.
Instellingen newInst = new Instellingen();
newInst.INST_LOC_REF = i.INST_LOC_REF;
newInst.INST_LOCNR = i.INST_LOCNR;
newInst.INST_REF = i.INST_REF;
newInst.INST_TYPE = i.INST_TYPE;
newInst.Opt_KalStandaard = false;
newOnes.Add(newInst);
}
}
_db.Instellingens.InsertAllOnSubmit(newOnes);
_db.SubmitChanges();
}
basically, the InstellingGegevens table gest filled in by some procedure from another server.
the thing i then need to do is check if there are new records in this table, and fill in the new ones in Instellingens.
this code runs for like 4 minutes on 15k records. how do I optimize it? or is the only way a Stored Procedure?
this code runs in a timer, running every 6h. IF a stored procedure is best, how to I use that in a timer?
Timer Tim = new Timer(21600000); //6u
Tim.Elapsed += new ElapsedEventHandler(fixInstellingenTabel);
Tim.Start();

Doing this in a stored procedure would be a lot faster. We do something quite similar, only there is about 100k items in the table, it's updated every five minutes, and has a lot more fields in it. Our job takes about two minutes to run, and then it does updates in several tables across three databases, so your job would reasonably take only a couple of seconds.
The query you need would just be something like:
create procedure UpdateInstellingens as
insert into Instellingens (
INST_LOC_REF, INST_LOCNR, INST_REF, INST_TYPE, Opt_KalStandaard
)
select q.INST_LOC_REF, q.INST_LOCNR, q.INST_REF, q.INST_TYPE, cast(0 as bit)
from InstellingGeven q
left join Instellingens i
on q.INST_LOC_REF = i.INST_LOC_REF and q.INST_LOCNR = i.INST_LOCNR
and q.INST_REF = i.INST_REF and q.INST_TYPE = i.INST_TYPE
where i.INST_LOC_REF is null
You can run the procedure from a job in the SQL server, without involving any application at all, or you can use ADO.NET to execute the procedure from your timer.

One way you could optimise this is by changing the Count(...) <= 0 into Any(). However, an even better optimisation would be to retrieve this information in a single query outside the loop:
var instellingens = _db.Instellingens
.Select(q => new { q.INST_LOC_REF, q.INST_LOCNR, q.INST_REF, q.INST_TYPE })
.Distinct()
.ToDictionary(q => q, q => true);
(On second thought, a HashSet would be most appropriate here, but there is unfortunately no ToHashSet() extension method. You can write one of your own if you like!)
And then inside your loop:
if (instellingens.ContainsKey(new { q.INST_LOC_REF, q.INST_LOCNR,
q.INST_REF, q.INST_TYPE })) {
// There is no item yet. Create one.
// ...
}
Then you can optimise the loop itself by making it lazy-retrieve:
// No need for the List<InstellingGegeven>
foreach (InstellingGegeven i in _db.InstellingGegevens) {
// ...
}

What Guffa said, but using Linq here is not the best course if performance is what you are after. Linq, like every other ORM, sacrifices performance for usability. Which is usually a great tradeoff for typical application execution paths. On the other hand, SQL is very, very good at set based operations so that really is the way to fly here.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.