Efficient LINQ Query for large dataset

Efficient LINQ Query for large dataset - c#

I have a LINQ query that run on datatable that has 5,00,000 plus records. This query returns me only one row but takes almost 30 seconds to run. This is my query
var callDetailsForNodes = from records in dtRowForNode.Select().Select(dr =>
new
{
caller1 = StringComparer.CurrentCultureIgnoreCase.Compare(dr["F1"], dr["F2"]) < 0 ? dr["F1"] : dr["F2"],
caller2 = StringComparer.CurrentCultureIgnoreCase.Compare(dr["F1"], dr["F2"]) < 0 ? dr["F2"] : dr["F1"],
time = dr["F3"],
filters = dr.Field<string>("F9")
}).Where(dr => (dtMin <= Convert.ToDateTime(dr.time)) && (dtMax >= Convert.ToDateTime(dr.time)) && (lstCallType.Contains(dr.filters))
&& (dtMinTime <= Convert.ToDateTime(dr.time).TimeOfDay) && (dtMaxTime >= Convert.ToDateTime(dr.time).TimeOfDay))
.GroupBy(drg => new { drg.caller1, drg.caller2 })
.Select(drg => new { drg.Key.caller1, drg.Key.caller2, count = drg.Count() }).AsEnumerable()
where (records.caller1.ToString() == VerSelected || records.caller2.ToString() == VerSelected)
select records;
Again i run a query to rearrange the data get it from above query as
var callDetailsForNodes_ReArrange = from records in callDetailsForNodes.Select(r => new
{
caller1 = r.caller1.ToString() == VerSelected ? r.caller1 : r.caller2,
caller2 = r.caller1.ToString() != VerSelected ? r.caller1 : r.caller2,
count = r.count
})
select records;
Then i am just binding this collection to gridview.
Is there any efficient way to query on such a large dataset
Edit
I have try to debug the programm step by step and find that this 2 queries actually runs fast and time is taken at the step when i add the result set of this query to ObservableCollection to bind it to gridview. Here is the code
foreach (var callDetailsForNode_ReArrange in callDetailsForNodes_ReArrange)
{
_CallForNodes.Add(new CallForNodeData
{
Caller1 = callDetailsForNode_ReArrange.caller1.ToString(),
Caller2 = callDetailsForNode_ReArrange.caller2.ToString(),
Count = callDetailsForNode_ReArrange.count
});
}
Here callDetailsForNodes_ReArrange has resultset count = 1

One thing that would help would be to convert dtMin, dtMax and dtMinTime before the call into the units of the data (dr.time). Then you can get rid of the Convert.ToDateTime that is happening multiple times on each record.

I have tidied up your query a little (although this won't make a massive amount of difference to performance, and there may be typos as I don't have VS to hand). From your edit it seems you are a little confused by deferred execution in LINQ. callDetailsForNodes does not represent your results - it is a query that will provide your results once it is executed.
If you have to do all this querying in process I suggest you add a ToList after the first select and run that in isolation. Then add ToList to the Where clause. Calling ToList will force your query to execute and you will see where the delays are.
One final note - you should pass your records directly to ObservableCollection constructor rather than calling Add for each item. Calling Add will (I think) cause the collection to raise a changed notification which is not a big deal for small lists but will slow things down for larger lists.
var callDetailsForNodes = dtRowForNode.AsEnumerable()
.Select(dr => new {
caller1 = StringComparer.CurrentCultureIgnoreCase.Compare(dr["F1"], dr["F2"]) < 0 ? dr["F1"] : dr["F2"],
caller2 = StringComparer.CurrentCultureIgnoreCase.Compare(dr["F1"], dr["F2"]) < 0 ? dr["F2"] : dr["F1"],
time = Convert.ToDateTime(dr["F3"]),
filters = dr.Field<string>("F9")})
.Where(dr => (dtMin <= dr.time)
&& (dtMax >= dr.time)
&& (lstCallType.Contains(dr.filters))
&& (dtMinTime <= dr.time.TimeOfDay)
&& (dtMaxTime >= dr.time.TimeOfDay)
&& caller1 == VerSelected || caller2 == VerSelected))
.GroupBy(drg => new { drg.caller1, drg.caller2 })
.Select(drg => new { drg.Key.caller1, drg.Key.caller2, count = drg.Count());

Related

this code takes a 2hrs to compare and sort 20,000 items each, is there a better way to write this c# code

I am trying to sort all the updated item in DataTableA, by coloring the item that has not been completely updated, and removing the item that has been updated completely from the DataTable. both The item that has been updated completely and the incomplete updated item are in "managed" table in the database, the discharge date will be null if it has not been completely updated.
Below code works but it can take all day for the page to run. This is C# webform.
The code below is writing on my code behind file:
foreach (GridDataItem dataItem in RadGrid1.Items)
{
var panu = dataItem["Inumber"];
var panum = panu.Text;
var _cas = db.managed.Any(b =>
b.panumber == panum && b.dischargedate != null);
var casm = db.managed.Any(b =>
b.panumber == panum && b.dischargedate == null);
if (_cas == true)
{
dataItem.Visible = false;
}
if (casm == true)
{
dataItem.BackColor = Color.Yellow;
}
}

As mentioned in the comment, each call to db.managed.Any will create a new SQL query.
There are various improvements you can make to speed this up:
First, you don't need to call db.managed.Any twice inside the loop, if it's checking the same unique entity. Call it just once and check dischargedate. This alone with speed up the loop 2x.
// one database call, fetching one column
var dischargedate = db.managed
.Select(x => x.dischargedate)
.FirstOrDefault(b => b.panumber == panum);
var _cas = dischargedate != null;
var casm = dischargedate == null;
If panumber is not a unique primary key and you don't have a sql index for this column, then each db.managed.Any call will scan all items in the table on each call. This can be easily solved by creating an index with panum and dischargedate, so if you don't have this index, create it.
Ideally, if the table is not huge, you can just load it all at once. But even if you have tens of millions of records, you can split the loop into several chunks, instead of repeating the same query over and over again.
Consider using better naming for your variables. _cas and casm are a poor choice of variable names.
Pro tip: Always code as if the person who ends up maintaining your code is a violent psychopath who knows where you live.
So if you don't have hundreds of thousands of items, here is the simplest fix: load panumber and discharge values for all rows from that table into memory, and then use a dictionary to instantly find the items:
// load all into memory
var allDischargeDates = await db.managed
.Select(x => new { x.panumber, x.dischargedate })
.ToListAsync(cancellationToken);
// create a dictionary so that you can quickly map panumber -> dischargedate
var dischargeDateByNumber = dbItems
.ToDictionary(x => x.panumber, x => x.dischargedate);
foreach (var dataItem in RadGrid1.Items)
{
var panu = dataItem["Inumber"];
var panum = panu.Text;
// this is very fast to check now
if (!dischargeDateByNumber.TryGetValue(panum, out DateTime? dischargeDate))
{
// no such entry - in this case your original code will just skip the item
return;
}
if (dischargeDate != null)
{
dataItem.Visible = false;
}
else
{
dataItem.BackColor = Color.Yellow;
}
}
If the table is huge and you only want to load certain items, you would do:
// get the list of numbers to fetch from the database
// (this should not be a large list!)
var someList = RadGrid1
.Items
.Select(x => x["Inumber"].Text)
.ToList();
// load these items into memory
var allDischargeDates = await db.managed
.Where(x => someList.Contains(x.panumber))
.Select(x => new { x.panumber, x.dischargedate })
.ToListAsync(cancellationToken);
But there is a limit on how large someList can be (you don't want to run this query for a list of 200 thousand items).

Well, 900 items might be worth simply fetching to a list in memory and then process that. It will definitely be faster, although it consumes more memory.
You can do something like this (assuming the type of managed is Managed):
List<Managed> myList = db.managed.ToList();
That will fetch the whole table.
Now replace your code with:
foreach (GridDataItem dataItem in RadGrid1.Items)
{
var panu = dataItem["Inumber"];
var panum = panu.Text;
var _cas = myList .Any(b =>
b.panumber == panum && b.dischargedate != null);
var casm = myList .Any(b =>
b.panumber == panum && b.dischargedate == null);
if (_cas == true)
{
dataItem.Visible = false;
}
if (casm == true)
{
dataItem.BackColor = Color.Yellow;
}
}
You should see a huge performance approvement.
Another thing: You don't mention what database you're using, but you should make sure the panumber column is properly indexed.

EF Core - Exclude from IQueryable if Datetime-Value is between 2 DateTimes from a list

I'm currently trying to achieve the following :
I have an IQueryable (UserDataDateRange) that has "from" and "to" DateTime-Values and another bigger IQueryable (UserData) with a DateTime-value and basically I want to exclude every data from userdata that Datetime-value is between the "from" and "to" comparing it to every UserDataDateRange-Entry.
Also every DateTime is nullable and I just want to ignore those.
Here is what I have tried :
private IQueryable<Userdata> ExcludeIfInDaterange(IQueryable<Userdata> query)
{
var dateRangeQuery = DBContext.UserDateDateRange.Where(x => x.From.HasValue && x.To.HasValue);
query = query.Where(l => !l.UserDate.HasValue);
foreach (var q in dateRangeQuery)
{
query = query.Where(l => l.UserDate.Value <= q.From.Value && l.UserDate.Value >= q.To.Value);
}
return query;
}
From my understanding this should work? Also I have tried avoid using something like "toArray" because from my understanding an IQueryable is basically the SQL that im manipulating and something toArray gives me the actual data.
However I really don't know what I'm doing wrong, theres no real exception, im just getting the following error :
Could not get function from a frame. The code is not available. The
error code is CORDBG_E_CODE_NOT_AVAILABLE, or0x80131309.
My function seems to break the query but i cant figure out why. I cant even use "Count()", it gives me the same error.
Anyone got an idea?

l.UserDate.Value <= q.From.Value && l.UserDate.Value >= q.To.Value
How can a date be both earlier than the From date and later than the To date at the same time? Unless you have some very odd data in your UserDateDateRange table, that filter will exclude all records.
You're also combining two mutually-exclusive filters with an AND operator, which is another way to exclude all records:
!l.UserDate.HasValue : UserDate is NULL
l.UserDate.Value <= ... : Can't possibly be satisfied, since UserDate is NULL
And there's no need to use .Value when comparing nullable properties with < / <= / > / >=.
Try something like:
private IQueryable<Userdata> ExcludeIfInDaterange(IQueryable<Userdata> query)
{
var dateRangeQuery = DBContext.UserDateDateRange.Where(x => x.From.HasValue && x.To.HasValue);
return query.Where(l => !l.UserDate.HasValue || !dateRangeQuery.Any(q => q.From <= l.UserDate && l.UserDate <= q.To));
}

Collections manipulation, need help optimizing this code from a report generator

I'm creating a report generating tool that use custom data type of different sources from our system. The user can create a report schema and depending on what asked, the data get associated based different index keys, time, time ranges, etc. The project is NOT doing queries in a relational database, it's pure C# code in collections from RAM.
I'm having a huge performance issue and I'm looking at my code since a few days and struggle with trying to optimize it.
I stripped down the code to the minimum for a short example of what the profiler point as the problematic algorithm, but the real version is a bit more complex with more conditions and working with dates.
In short, this function return a subset of "values" satisfying the conditions depending on the keys of the values that were selected from the "index rows".
private List<LoadedDataSource> GetAssociatedValues(IReadOnlyCollection<List<LoadedDataSource>> indexRows, List<LoadedDataSource> values)
{
var checkContainers = ((ValueColumn.LinkKeys & ReportLinkKeys.ContainerId) > 0 &&
values.Any(t => t.ContainerId.HasValue));
var checkEnterpriseId = ((ValueColumn.LinkKeys & ReportLinkKeys.EnterpriseId) > 0 &&
values.Any(t => t.EnterpriseId.HasValue));
var ret = new List<LoadedDataSource>();
foreach (var value in values)
{
var valid = true;
foreach (var index in indexRows)
{
// ContainerId
var indexConservedSource = index.AsEnumerable();
if (checkContainers && index.CheckContainer && value.ContainerId.HasValue)
{
indexConservedSource = indexConservedSource.Where(t => t.ContainerId.HasValue && t.ContainerId.Value == value.ContainerId.Value);
if (!indexConservedSource.Any())
{
valid = false;
break;
}
}
//EnterpriseId
if (checkEnterpriseId && index.CheckEnterpriseId && value.EnterpriseId.HasValue)
{
indexConservedSource = indexConservedSource.Where(t => t.EnterpriseId.HasValue && t.EnterpriseId.Value == value.EnterpriseId.Value);
if (!indexConservedSource.Any())
{
valid = false;
break;
}
}
}
if (valid)
ret.Add(value);
}
return ret;
}
This works for small samples, but as soon as I have thousands of values, and 2-3 index rows with a few dozens values too, it can take hours to generate.
As you can see, I try to break as soon as a index condition fail and pass to the next value.
I could probably do everything in a single "values.Where(####).ToList()", but that condition get complex fast.
I tried generating a IQueryable around indexConservedSource but it was even worse. I tried using a Parallel.ForEach with a ConcurrentBag for "ret", and it was also slower.
What else can be done?

What you are doing, in principle, is calculating intersection of two sequences. You use two nested loops and that is slow as the time is O(m*n). You have two other options:
sort both sequences and merge them
convert one sequence into hash table and test the second against it
The second approach seems better for this scenario. Just convert those index lists into HashSet and test values against it. I added some code for inspiration:
private List<LoadedDataSource> GetAssociatedValues(IReadOnlyCollection<List<LoadedDataSource>> indexRows, List<LoadedDataSource> values)
{
var ret = values;
if ((ValueColumn.LinkKeys & ReportLinkKeys.ContainerId) > 0 &&
ret.Any(t => t.ContainerId.HasValue))
{
var indexes = indexRows
.Where(i => i.CheckContainer)
.Select(i => new HashSet<int>(i
.Where(h => h.ContainerId.HasValue)
.Select(h => h.ContainerId.Value)))
.ToList();
ret = ret.Where(v => v.ContainerId == null
|| indexes.All(i => i.Contains(v.ContainerId)))
.ToList();
}
if ((ValueColumn.LinkKeys & ReportLinkKeys.EnterpriseId) > 0 &&
ret.Any(t => t.EnterpriseId.HasValue))
{
var indexes = indexRows
.Where(i => i.CheckEnterpriseId)
.Select(i => new HashSet<int>(i
.Where(h => h.EnterpriseId.HasValue)
.Select(h => h.EnterpriseId.Value)))
.ToList();
ret = ret.Where(v => v.EnterpriseId == null
|| indexes.All(i => i.Contains(v.EnterpriseId)))
.ToList();
}
return ret;
}

Improve performance on LINQ Query

My linq query goes slow when I try to loop through the results to create an Xelement, which I later process XSLT based on the XElement.
Here is my code
public override XElement Search(SearchCriteria searchCriteria)
{
XElement root = new XElement("Root");
using (ReportOrderLogsDataContext dataContext = DataConnection.GetLinqDataConnection<ReportOrderLogsDataContext>(searchCriteria.GetConnectionString()))
{
try
{
IQueryable<vw_udisclosedDriverResponsePart> results = from a in dataContext.vw_udisclosedDriverResponseParts
where
(a.CreateDt.HasValue &&
a.CreateDt >= Convert.ToDateTime(searchCriteria.BeginDt) &&
a.CreateDt <= Convert.ToDateTime(searchCriteria.EndDt))
select a;
if (!string.IsNullOrEmpty(searchCriteria.AgentNumber))
{
results = results.Where(request => request.LgAgentNumber == searchCriteria.AgentNumber);
}
if (!string.IsNullOrEmpty(searchCriteria.AgentTitle))
{
results = results.Where(a => a.LgTitle == searchCriteria.AgentTitle);
}
if (!string.IsNullOrEmpty(searchCriteria.QuotePolicyNumber))
{
results = results.Where(a => a.QuotePolicyNumber == searchCriteria.QuotePolicyNumber);
}
if (!string.IsNullOrEmpty(searchCriteria.InsuredName))
{
results = results.Where(a => a.LgInsuredName.Contains(searchCriteria.InsuredName));
}
foreach (var match in results) // goes slow here, specifically times out before evaluating the first match when results are too large.
{
DateTime date;
string strDate = string.Empty;
if (DateTime.TryParse(match.CreateDt.ToString(), out date))
{
strDate = date.ToString("MM/dd/yyyy");
}
root.Add(new XElement("Record",
new XElement("System", "Not Supported"),
new XElement("Date", strDate),
new XElement("Agent", match.LgAgentNumber),
new XElement("UserId", match.LgUserId),
new XElement("UserTitle", match.LgTitle),
new XElement("QuoteNum", match.QuotePolicyNumber),
new XElement("AddressLine1", match.AddressLine1),
new XElement("AddressLine2", match.AddressLine2),
new XElement("City", match.City),
new XElement("State", match.State),
new XElement("Zip", match.Zip),
new XElement("DriverName", string.Concat(match.GivenName, " ", match.SurName)),
new XElement("DriverLicense", match.LicenseNumber),
new XElement("LicenseState", match.LicenseState)));
;
}
}
catch (Exception es)
{
throw es;
}
}
return root;
// return GetSearchedCriteriaFromStoredPocedure(searchCriteria);
}
I assume there is a better way to convert the results object into an XElement. Processing the view itself only takes about 2 seconds. Trying to loop through the results object is resulting in a timeout, even when many results are not returned.
Any help would be appreciated.
Thanks!
-James
AMENDED 7/10/2012
The issue is not with the linq query itself but its with the execution of the view when specifying a date range. Executing the view by itself takes about 4-6 seconds. When a small date range (07/05/2012 - 07/10/2012) is used the view takes around 1:30. Does anyone have any suggestions of how to increase performance of the query with a date range specified. Its faster if I got all of the results and looped through them checking the date.
i.e.
IQueryable<vw_udisclosedDriverResponsePart> results = from a in dataContext.vw_udisclosedDriverResponseParts select a;
foreach (var match in results) //results only takes 3 seconds to enumerate, before would timeout
{
// eval search criteria date here.
}
I can code it like I suggested above, but does anyone have a better way?

How does the database perform? The simplest test is to run a sample query - a query that will retrieve the data you need from the database, just to test database indexing and performance - because in 99% of cases that's the cause of slowness.
I would guess that the slowness is occurring because
you are iterating from the database, rather than retrieving all the rows up front, and
you are selecting on bad WHERE conditions (are your indexes correct?)
Firstly, call ToList to get the results to determine that the slowness is happening in the database, not in the XML construction
if (!string.IsNullOrEmpty(searchCriteria.InsuredName))
{
//...
}
var matches = results.ToList();
foreach (var match in matches)
{
//...
Assuming that the var matches = results.ToList() is very slow, I'd look at the functions in the WHERE clause
(a.CreateDt.HasValue &&
a.CreateDt >= Convert.ToDateTime(searchCriteria.BeginDt) &&
a.CreateDt <= Convert.ToDateTime(searchCriteria.EndDt))
to check that they aren't being executed for every row.
If you use SQL Server, run Profiler (in the Tools menu) to trace the SQL that LINQ-to-SQL.

And, of course, do the conversion outside the linq. criteria won't change
during the runtime of the Linq expression.
From what you posted, I made this example:
var begin = Convert.ToDateTime(searchCriteria.BeginDt);
var end = Convert.ToDateTime(searchCriteria.EndDt);
var results = from a in searchList
where ((a.CreateDt.HasValue &&
a.CreateDt >= begin &&
a.CreateDt <= end)
&& (string.IsNullOrEmpty(searchCriteria.AgentNumber) || a.LgAgentNumber == searchCriteria.AgentNumber)
&& (string.IsNullOrEmpty(searchCriteria.AgentTitle) || a.LgTitle == searchCriteria.AgentTitle)
&& (string.IsNullOrEmpty(searchCriteria.QuotePolicyNumber) || a.LgTitle == searchCriteria.QuotePolicyNumber)
&& (string.IsNullOrEmpty(searchCriteria.InsuredName) || a.LgInsuredName.Contains(searchCriteria.InsuredName))
)
select a;
Perhaps this is helpful for you.
For measuring the time I used the following:
var watch = new Stopwatch();
watch.Start();
var arr = results.ToArray(); // force evaluation of linq
watch.Stop();
var elapsed = watch.ElapsedTicks;
Seems the altered query is already about 30-40% faster on average, but i just
did some runs.

I would suggest few experiments:
One.
Put a
int count = results.Count();
before the foreach and see if this takes a long time.
Two.
Leave the the Count() call and see if the foreach is still slow. If it is fast it would suggest that the initial connection to the db is slow.
As others suggested - have a look how you query performs in the db (actually type in in the database, without c#).
You could also post a SHOW TABLE result so the community could inspect the indexes and help you with a fix.

Problem creating empty IQueryable<T> object

Basically i want to merge two Iqueryable to one Iqueryable and then return the complete record set after my loop ends. It runs perfectly but in the end my objret have nothing but when i debug the loop obj have some records. wht im doing wrong
IQueryable<MediaType> objret = Enumerable.Empty<MediaType>().AsQueryable();
var typ = _db.MediaTypes.Where(e => e.int_MediaTypeId != 1 && e.int_MediaTypeId_FK == null).ToList();
for (int i = 0; i < typ.Count; i++)
{
IQueryable<MediaType> obj = _db.MediaTypes.Where(e => e.bit_IsActive == true && e.int_MediaTypeId_FK == typ[i].int_MediaTypeId);
IQueryable<MediaType> obj1 = _db.MediaTypes.Where(e => e.int_OrganizationId == Authorization.OrganizationID && e.bit_IsActive == true && e.int_MediaTypeId_FK == typ[i].int_MediaTypeId);
if (obj1.Count() > 0)
obj.Concat(obj1);
if(obj.Count() > 0)
objret.Concat(obj);
}
return objret;

Just like the other query operators, Concat doesn't change the existing sequence - it returns a new sequence.
So these lines:
if (obj1.Count() > 0)
obj.Concat(obj1);
if(obj.Count() > 0)
objret.Concat(obj);
should be
if (obj1.Count() > 0)
objret = objret.Concat(obj1);
if(obj.Count() > 0)
objret = objret.Concat(obj);
I'm not sure how well IQueryable is going to handle this, given that you're mixing LINQ to SQL (? maybe Entities) with Enumerable.AsQueryable, mind you. Given that you're already executing the queries to some extent due to the Count() calls, have you considered building up a List<T> instead?
(You don't need to execute the Count() at all - just call List<T>.AddRange(obj1) and ditto for obj.)
As jeroenh mentioned, ideally it would be nice to use a solution which could do it all that the database without looping at all in your C# code.

I think you should not do this with a for loop. The code you posted will go to the db for every single active mediatype twice to get Count() and additionally, twice to get actual results.
Checking the Count() property is not necessary: concatenating empty result sets has no additional effect.
Furthermore, I think what you're trying to achieve can be done with a single query, something along the lines of (not tested):
// build up the query
var rootTypes = _db.MediaTypes.Where(e => e.int_MediaTypeId != 1 && e.int_MediaTypeId_FK == null);
var activeChildren = _db.MediaTypes
.Where(e => e.bit_IsActive);
var activeChildrenForOrganization = _db.MediaTypes
.Where(e => e.int_OrganizationId == Authorization.OrganizationID && e.bit_IsActive);
var q = from types in rootTypes
join e in activeChildren
on types.int_MediaTypeId equals e.int_MediaTypeId_FK into joined1
join e in activeChildrenForOrganization
on types.int_MediaTypeId equals e.int_MediaTypeId_FK into joined2
select new {types, joined1, joined2};
// evaluate the query and concatenate the results.
// This will only go to the db once
return q.ToList().SelectMany(x => x.joined1.Concat(x.joined2));

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Efficient LINQ Query for large dataset - c#

One thing that would help would be to convert dtMin, dtMax and dtMinTime before the call into the units of the data (dr.time). Then you can get rid of the Convert.ToDateTime that is happening multiple times on each record.

Related

this code takes a 2hrs to compare and sort 20,000 items each, is there a better way to write this c# code

EF Core - Exclude from IQueryable if Datetime-Value is between 2 DateTimes from a list

Collections manipulation, need help optimizing this code from a report generator

Improve performance on LINQ Query

Problem creating empty IQueryable<T> object

Categories

Resources