Retrieve paged results using Linq

Retrieve paged results using Linq - c#

I want to get items 50-100 back from a result set in linq. How do I do that?
Situation:
I get back an index of the last item of the last result set. I want to then grab the next 50. I do not have the ID of the last result only its index number.

You order it by Something, else you really can't
So it would be Something like
mycontext.mytable
.OrderBy(item=>.item.PropertyYouWantToOrderBy)
.Skip(HowManyYouWantToSkip)
.Take(50);

LINQ is based on the concept of one-way enumeration, so queries all start at the beginning. To implement paging you'll have to use the Skip and Take extensions to isolate the items you're interested in:
int pageSize = 50;
// 0-based page number
int pageNum = 2;
var pageContent = myCollection.Skip(pageSize * pageNum).Take(pageSize);
Of course this just sets up an IEnumerable<T> that, when enumerated, will step through myCollection 100 times, then start returning data for 50 steps before closing.
Which is fine if you're working on something that can be enumerated multiple times, but not if you're working with a source that will only enumerate once. But then you can't realistically implement paging on that sort of enumeration anyway, you need an intermediate storage for at least that portion of it that you've already consumed.
In LINQ to SQL this will result in a query that attempts to select only the 50 records you've asked for, which from memory is going to be based taking numSkip + numTake records, reversing the sort order, taking numTake records and reversing again. Depending on the sort order you've set up and the size of the numbers involved this could be a much more expensive operation than simply pulling a bunch of data back and filtering it in memory.

Related

Entity Framework DbContext filtered query for count is extremely slow using a variable

Using an ADO.NET entity data model I've constructed two queries below against a table containing 1800 records that has just over 30 fields that yield staggering results.
// Executes slowly, over 6000 ms
int count = context.viewCustomers.AsNoTracking()
.Where(c => c.Cust_ID == _custID).Count();
// Executes instantly, under 20 ms
int count = context.viewCustomers.AsNoTracking()
.Where(c => c.Cust_ID == 625).Count();
I see from the database log that Entity Framework provides that the queries are almost identical except that the filter portion uses a parameter. Copying this query into SSMS and declaring & setting this parameter there results in a near instant query so it doesn't appear to be on the database end of things.
Has anyone encountered this that can explain what's happening? I'm at the mercy of a third party control that adds this command to the query in an attempt to limit the number of rows returned, getting the count is a must. This is used for several queries so a generic solution is needed. It is unfortunate it doesn't work as advertised, it seems to only make the query take 5-10 times as long as it would if I just loaded the entire view into memory. When no filter is used however, it works like a dream.
Use of these components includes the source code so I can change this behavior but need to consider which approaches can be used to provide a reusable solution.

You did not mention about design details of your model but if you only want to have count of records based on condition, then this can be optimized by only counting the result set based on one column. For example,
int count = context.viewCustomers.AsNoTracking().Where(c => c.Cust_ID == _custID).Count();
If you design have 10 columns, and based on above statement let say 100 records have been returned, then against every record result set contains 10 columns' data which is of not use.
You can optimize this by only counting result set based on single column.
int count = context.viewCustomers.AsNoTracking().Where(c => c.Cust_ID == _custID).Select(x=>new {x.column}).Count();
Other optimization methods, like using async variants of count CountAsync can be used.

Azure Search Index - Find the list of items

We are using Search Index to run one of our API. The data to the index is populated using the Azure functions which pull data from the database. We could see that the number of records in the database and the Search Service is different. Is there any way to get the list of Keys in the Search Service so that we can compare with the database and see which keys are missing?
Regards,
John

The Azure Search query API is designed for search/filter scenarios, it doesn't offer an efficient way to traverse through all documents.
That said, you can do this reasonably by scanning the keys in order: if you have a field in your index (the key field or another one) that's both filterable and sortable, you can use $select to pull only the keys for each document, 1000 at a time, ordered by that field. After you retrieve the first 1000, don't do $skip (which will limit you to 100,000), instead use a filter that uses greater-than against the field, using the highest value you saw in the previous response. This will allow you to traverse the whole set at reasonable performance, although doing it 1000 at a time will take time.

You can try to search "*". And use orderby and filter to get all data by following example.
I use data metadata_storage_last_modified as filter.
offset skip time
0 --%--> 0
100,000 --%--> 100,000 getLastTime
101,000 --%--> 0 useLastTime
200,000 --%--> 99,000 useLastTime
201,000 --%--> 100,000 useLastTime & getLastTime
202,000 --%--> 0 useLastTime
Because Skip limit is 100k, so we can calculate skip by
AzureSearchSkipLimit = 100k
AzureSearchTopLimit = 1k
skip = offset % (AzureSearchSkipLimit + AzureSearchTopLimit)
If total search count will large than AzureSearchSkipLimit, then apply
orderby = "metadata_storage_last_modified desc"
When skip reach AzureSearchSkipLimit ,then get metadata_storage_last_modified time from end of data. And put metadata_storage_last_modified as next 100k search filer.
filter = metadata_storage_last_modified lt ${metadata_storage_last_modified}

Linq: Return results count for large dataset

I have IEnumerable that is result of database query that has a large dataset, as much as 10000 records. I need the count to display on the webpage for pagination. How can I do it, using .Count() will result in exceptions like 1The underlying provider failed on Open` or takes way too long.
Is there a way I can query database to get the count for results by linq-sql?

It may be how you are using LINQ. if you call the query like this:
var users = (from u in context.Users select u);
int userCount = users.Count();
That would effectively only call a query to return a count from the database. If you did something like this:
List<User> users = (from u in context.Users select u).ToList();
int userCount = users.Count();
That would call and retrieve the records from the database then try to count them.

let the database tell you the count - databases are built to be able to do this - and select only the rows you need from the database instead of returning the whole set from the database when you only want to use a small subset.

You could do a select count query first, but then some data might have been added or deleted. Wrap it in a transaction, and you'll have a huge locking issue.
Another way would be to get page 1, and then get the rest in batches on another thread, when that finishes update the count and display, you could even have a running count, to indicate progress in getting all the data.
Why are you getting all the records at once?
If you looked at it as getting the first, last, previous and next page.
And then dealt with issues that arise from that approach, you'd have a scalable solution.
The reason count is taking so long is it has to suck all 10,000 records into the client in order to count them.
It's a solution that will scale very badly, imagine a 100,000 or a million!
The other point is latency. Those 100,000 are at the time the query was executed, refresh isn't going to be popular is it. Then when you do refresh, you have all the display the page you were on issues...
If you reworked it to get the next n records after the the last one currently displayed or before the first...
Then you could cache, you could refresh pages in the background, you could attempt to predict request, say get the next n pages in advance.

List<T> FirstOrDefault() bad performance - is dictionary possible in this case?

I have set of 'codes' Z that are valid in a certain time period.
Since I need them a lot of times in a large loop (million+) and every time I have to lookup the corresponding code I cache them in a List<>. After finding the correct codes, i'm inserting (using SqlBulkCopy) a million rows.
I lookup the id with the following code (l_z is a List<T>)
var z_fk = (from z in l_z
where z.CODE == lookupCode &&
z.VALIDFROM <= lookupDate &&
z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
In other situations I have used a Dictionary with superb performance, but in those cases I only had to lookup the id based on the code.
But now with searching on the combination of fields, I am stuck.
Any ideas? Thanks in advance.

Create a Dictionary that stores a List of items per lookup code - Dictionary<string, List<Code>> (assuming that lookup code is a string and the objects are of type Code).
Then when you need to query based on lookupDate, you can run your query directly off of dict[lookupCode]:
var z_fk = (from z in dict[lookupCode]
where z.VALIDFROM <= lookupDate &&
z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
Then just make sure that whenever you have a new Code object, that it gets added to the List<Code> collection in the dict corresponding to the lookupCode (and if one doesn't exist, then create it).

A simple improvement would be to use...
//in initialization somewhere
ILookup<string, T> l_z_lookup = l_z.ToLookup(z=>z.CODE);
//your repeated code:
var z_fk = (from z in lookup[lookupCode]
where z.VALIDFROM <= lookupDate && z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
You could further use a more complex, smarter data structure storing dates in sorted fashion and use a binary search to find the id, but this may be sufficient. Further, you speak of SqlBulkCopy - if you're dealing with a database, perhaps you can execute the query on the database, and then simply create the appropriate index including columns CODE, VALIDUNTIL and VALIDFROM.
I generally prefer using a Lookup over a Dictionary containing Lists since it's trivial to construct and has a cleaner API (e.g. when a key is not present).

We don't have enough information to give very prescriptive advice - but there are some general things you should be thinking about.
What types are the time values? Are you comparing date times or some primitive value (like a time_t). Think about how your data types affects performance. Choose the best ones.
Should you really be doing this in memory or should you be putting all these rows in to SQL and letting it be queried on there? It's really good at that.
But let's stick with what you asked about - in memory searching.
When searching is taking too long there is only one solution - search fewer things. You do this by partitioning your data in a way that allows you to easily rule out as many nodes as possible with as few operations as possible.
In your case you have two criteria - a code and a date range. Here are some ideas...
You could partition based on code - i.e. Dictionary> - if you have many evenly distributed codes your list sizes will each be about N/M in size (where N = total event count and M = number of events). So a million nodes with ten codes now requires searching 100k items rather than a million. But you could take that a bit further. The List could itself be sorted by starting time allowing a binary search to rule out many other nodes very quickly. (this of course has a trade-off in time building the collection of data). This should provide very quick
You could partition based on date and just store all the data in a single list sorted by start date and use a binary search to find the start date then march forward to find the code. Is there a benefit to this approach over the dictionary? That depends on the rest of your program. Maybe being an IList is important. I don't know. You need to figure that out.
You could flip the dictionary model partition the data by start time rounded to some boundary (depending on the length, granularity and frequency of your events). This is basically bucketing the data in to groups that have similar start times. E.g., all the events that were started between 12:00 and 12:01 might be in one bucket, etc. If you have a very small number of events and a lot of highly frequent (but not pathologically so) events this might give you very good lookup performance.
The point? Think about your data. Consider how expensive it should be to add new data and how expensive it should be to query the data. Think about how your data types affect those characteristics. Make an informed decision based on that data. When in doubt let SQL do it for you.

This to me sounds like a situation where this could all happen on the database via a single statement. Then you can use indexing to keep the query fast and avoid having to push data over the wire to and from your database.

best data structure for sorted time series data that can quickly return subarrays?

I'm in need of a datastructure that is basically a list of data points, where each data point has a timestamp and a double[] of data values. I want to be able to retrieve the closest point to a given timestamp or all points within a specified range of timestamps.
I'm using c#. my thinking was using a regular list would be possible, where "datapoint" is a class that contains the timestamp and double[] fields. then to insert, I'd use the built-in binarysearch() to find where to insert the new data, and I could use it again to find the start/end indexes for a range search.
I first tried sortedlists, but it seems like you can't iterate through indexes i=0,1,2,...,n, just through keys, so I wasn't sure how to do the range search without some convoluted function.
but then I learned that list<>'s insert() is o(n)...couldn't I do better than that without sacrificing elsewhere?
alternatively, is there some nice linq query that will do everything I want in a single line?

If you're willing to use non BCL libraries, the C5.SortedArray<T> has always worked quite well for me.
It has a great method, RangeFromTo, that performs quite well with this sort of problem.

If you have only static data then any structure implementing IList should be fine. Sort it once and then make queries using BinarySearch. This should also work if your inserted timestamps are always increasing, then you can just do List.Add() in O(1) and it will be still sorted.
List<int> x = new List<int>();
x.Add(5);
x.Add(7);
x.Add(3);
x.Sort();
//want to find all elements between 4 and 6
int rangeStart = x.BinarySearch(4);
//since there is no element equal to 4, we'll get the binary complement of an index, where 4 could have possibly been found
//see MSDN for List<T>.BinarySearch
if (rangeStart < 0)
rangeStart = ~rangeStart;
while (x[rangeStart] < 6)
{
//do you business
rangeStart++;
}
If you need to insert data at random points in your structure, keep it sorted and be able to query ranges fast, you need a structure called B+ tree. It's not implemented in the framework, you'll need to get it somewhere by your own.
Inserting a record requires O(log n) operations in the worst case
Finding a record requires O(log n) operations in the worst case
Removing a (previously located) record requires O(log n) operations in the worst case
Performing a range query with k elements occurring within the range requires O((log n) + k) operations in the worst case.
P.S. "is there some nice linq query that will do everything I want in a single line"
I wish I knew such a nice linq query that could do everything I want in one line :-)

You have the choice of cost at insertion, retrieval or removal time. There are various data structures optimized for each of these cases. Before you decide on one, I'd estimate the total size of your structures, how many data points are being generated (and at which frequency) and what will be used more often: insertion or retrieval.
If you insert a lot of new data points at high frequency, I'd suggest looking at a LinkedList<>. If you're retrieving more often, I'd use a List<> even though its insertion time is slower.
Of course you could do that in a LINQ query, but remember this is only sugar coating: The query will execute every time and for every execution search the entire set of data points to find a match. This may be more expensive than using the right collection for the job in the first place.

How about using an actual database to store your data and run queries against that?
Then, you could use LINQ-to-SQL.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.