Reading thousands of objects with EF Core FAST

Reading thousands of objects with EF Core FAST - c#

I am reading 40,000 small objects / rows from SQLite with EF core, and it's taking 18 seconds, which is too long for my UWP app.
When this happens CPU usage on a single core reaches 100%, but the disk reading speed is circa 1%.
var dataPoints = _db.DataPoints.AsNoTracking().ToArray();
Without AsNoTracking() the time taken is even longer.
DataPoint is a small POCO with a few primitive properties. Total amount of data I am loading is 4.5 MB.
public class DataPointDto
{
[Key]
public ulong Id { get; set; }
[Required]
public DateTimeOffset TimeStamp { get; set; }
[Required]
public bool trueTime { get; set; }
[Required]
public double Value { get; set; }
}
Question: Is there a better way of loading this many objects, or am I stuck with this level of performance?
Fun fact: x86 takes 11 seconds, x64 takes 18. 'Optimise code' shaves off a second. Using Async pushes execution time to 30 seconds.

Most answers follow the common wisdom of loading less data, but in some circumstances such as here you Absolutely Positively Must load a lot of entities. So how do we do that?
Cause of poor performance
Is it unavoidable for this operation to take this long?
Well, its not. We are loading just a megabyte of data from disk, the cause of poor performance is that the data is split across 40,000 tiny entities. The database can handle that, but the entity framework seem to struggle setting up all those entities, change tracking, etc. If we do not intend to modify the data, there is a lot we can do.
I tried three things
Primitives
Load just one property, then you get a list of primitives.
List<double> dataPoints = _db.DataPoints.Select(dp => dp.Value).ToList();
This bypasses all of entity creation normally performed by entity framework. This query took 0.4 seconds, compared to 18 seconds for the original query. We are talking 45 (!) times improvement.
Anonymous types
Of-course most of the time we need more than just an array of primitives
We can create new objects right inside the LINQ query. Entity framework won't create the entities it normally would, and the operation runs much faster. We can use anonymous objects for convenience.
var query = db.DataPoints.Select(dp => new {Guid ID = dp.sensorID, DateTimeOffset Timestamp = dp.TimeStamp, double Value = dp.Value});
This operations takes 1.2 seconds compared to 18 seconds for normally retrieving the same amount of data.
Tuples
I found that in my case using Tuples instead of anonymous types improves performance a little, the following query executed roughly 30% faster:
var query = db.DataPoints.Select(dp => Tuple.Create(dp.sensorID, dp.TimeStamp, dp.Value));
Other ways
You cannot use structures inside LinQ queries, so that's not an
option
In many cases you can combine many records together to reduce
overhead associated with retrieving many individual records. By
retrieving fewer larger records you could improve performance. For
instance in my usecase I've got some measurements that are being
taken every 5 minutes, 24/7. At the moment I am storing them
individually, and that's silly. Nobody will ever query less than a
day worth of them. I plan to update this post when I make the change
and find out how performance changed.
Some recommend using an object oriented DB or micro ORM. I have
never used either, so I can't comment.

you can use a different technique to load all your items.
you can create your own logic to load parts of the data while the user is scrolling the ListView( I guess you are using it) .
fortunately UWP a easy way to do this technique.
Incremental loading
please see the documentation and example
https://msdn.microsoft.com/library/windows/apps/Hh701916

Performance test on 26 million records (1 datetime, 1 double, 1 int), EF Core 3.1.5:
Anonymous types or tuples as suggested in the accepted answer = About
20 sec, 1.3GB RAM
Struct = About 15 sec, 0.8GB RAM

Related

Entity Framework Query Slow for Small Result Set

So I have the below EF query that returns about 12,000 records from a flat table. I use a projection that only selects the necessary fields (about 15) and then puts them into a list of a custom class. It takes almost 3 seconds, which seems like a long time for 12,000 records. I've tried wrapping the entire thing in a transaction scrope with "read uncommitted", and I've also tried using "AsNoTracking()". Neither made any difference. Anyone have any idea why the performance on this would be so lousy?
List<InfoModel> results = new List<InfoModel>();
using (InfoData data = new InfoData())
{
results = (from S in data.InfoRecords
select new
{
...bunch of entity fields...
}).AsEnumerable().Select(x => new InfoModel()
{
...bunch of model fields...
}).ToList();
}

It's difficult to answer, because there are soy many things that can affect, your network, the number of other request in your sqlserver or Windows server, your model, ....
Despite of the lastest versions the quality of generated queries and the performance of Entity framework has improved a lot, is far from others in terms of speed. There are some performance considerations you can look https://msdn.microsoft.com/en-us/data/hh949853.aspx
It speed matters and 3 seconds is too much for you, probably I wouldn't use Entity Framework to retrieve so many rows, for me Entity Framework is great when you need just some items but not thousands if speed is important.
To improve speed you case use many others ORM like Dapper, the one used by Stackoverflow that is pretty fast.

Poor performance when loading child entities with Entity Framework

I'm building an ASP.Net application with Entity Framework (Code First) and I've implemented a Repository pattern like the one in this example.
I only have two tables in my database. One called Sensor and one called MeasurePoint (containing only TimeStamp and Value). A sensor can have multiple measure points. At the moment I have 5 sensors and around 15000 measure points (approximately 3000 points for each sensor).
In one of my MVC controllers I execute the following line (to get the most recent MeasurePoint for a Sensor)
DbSet<Sensor> dbSet = context.Set<Sensor>();
var sensor = dbSet.Find(sensorId);
var point = sensor.MeasurePoints.OrderByDescending(measurePoint => measurePoint.TimeStamp).First();
This call takes ~1s to execute which feels like a lot to me. The call results in the following SQL query
SELECT
[Extent1].[MeasurePointId] AS [MeasurePointId],
[Extent1].[Value] AS [Value],
[Extent1].[TimeStamp] AS [TimeStamp],
[Extent1].[Sensor_SensorId] AS [Sensor_SensorId]
FROM [dbo].[MeasurePoint] AS [Extent1]
WHERE ([Extent1].[Sensor_SensorId] IS NOT NULL) AND ([Extent1].[Sensor_SensorId] = #EntityKeyValue1)
Which only takes ~200ms to execute, so the time is spent somewhere else.
I've profiled the code with the help of Visual Studio Profiler and found that the call that causes the delay is
System.Data.Objects.Internal.LazyLoadBehavior.<>c_DisplayClass7`2.<GetInterceptorDelegate>b_1(!0,!1)
So I guess it has something to do with lazy loading. Do I have to live with performance like this or are there improvements I can make? Is it the ordering by time that causes the performance drop, if so, what options do I have?
Update:
I've updated the code to show where sensor comes from.

What that will do is load the entire children collection into memory and then perform the .First() linq query against the loaded (appx 3000) children.
If you just want the most recent, use this instead:
context.MeasurePoints.OrderByDescending(measurePoint => measurePoint.TimeStamp).First();

If that's the query it's running, it's loading all 3000 points into memory for the sensor. Try running the query directly on your DbContext instead of using the navigation property and see what the performance difference is. Your overhead may be coming from the 2999 points you don't need being loaded.

Is it possible to accelerate (dynamic) LINQ queries using GPU?

I have been searching for some days for solid information on the possibility to accelerate LINQ queries using a GPU.
Technologies I have "investigated" so far:
Microsoft Accelerator
Cudafy
Brahma
In short, would it even be possible at all to do an in-memory filtering of objects on the GPU?
Let´s say we have a list of some objects and we want to filter something like:
var result = myList.Where(x => x.SomeProperty == SomeValue);
Any pointers on this one?
Thanks in advance!
UPDATE
I´ll try to be more specific about what I am trying to achieve :)
The goal is, to use any technology, which is able to filter a list of objects (ranging from ~50 000 to ~2 000 000), in the absolutely fastest way possible.
The operations I perform on the data when the filtering is done (sum, min, max etc) is made using the built in LINQ-methods and is already fast enough for our application, so that´s not a problem.
The bottleneck is "simply" the filtering of data.
UPDATE
Just wanted to add that I have tested about 15 databases, including MySQL (checking possible cluster approach / memcached solution), H2, HSQLDB, VelocityDB (currently investigating further), SQLite, MongoDB etc, and NONE is good enough when it comes to the speed of filtering data (of course, the NO-sql solutions do not offer this like the sql ones, but you get the idea) and/or the returning of the actual data.
Just to summarize what I/we need:
A database which is able to sort data in the format of 200 columns and about 250 000 rows in less than 100 ms.
I currently have a solution with parallellized LINQ which is able (on a specific machine) to spend only nano-seconds on each row when filtering AND processing the result!
So, we need like sub-nano-second-filtering on each row.
Why does it seem that only in-memory LINQ is able to provide this?
Why would this be impossible?
Some figures from the logfile:
Total tid för 1164 frågor: 2579
This is Swedish and translates:
Total time for 1164 queries: 2579
Where the queries in this case are queries like:
WHERE SomeProperty = SomeValue
And those queries are all being done in parallell on 225639 rows.
So, 225639 rows are being filtered in memory 1164 times in about 2.5 seconds.
That´s 9,5185952917007032597107300413827e-9 seconds / row, BUT, that also includes the actual processing of the numbers! We do Count (not null), total count, Sum, Min, Max, Avg, Median. So, we have 7 operations on these filtered rows.
So, we could say it´s actually 7 times faster than the the databases we´ve tried, since we do NOT do any aggregation-stuff in those cases!
So, in conclusion, why are the databases so poor at filtering data compared to in-memory LINQ filtering? Have Microsoft really done such a good job that it is impossible to compete with it? :)
It makes sense though that in-memory filtering should be faster, but I don´t want a sense that it is faster. I want to know what is faster, and if it´s possible why.

I will answer definitively about Brahma since it's my library, but it probably applies to other approaches as well. The GPU has no knowledge of objects. It's memory is also mostly completely separate from CPU memory.
If you do have a LARGE set of objects and want to operate on them, you can only pack the data you want to operate on into a buffer suitable for the GPU/API you're using and send it off to be processed.
Note that this will make two round trips over the CPU-GPU memory interface, so if you aren't doing enough work on the GPU to make it worthwhile, you'll be slower than if you simply used the CPU in the first place (like the sample above).
Hope this helps.

The GPU is really not intended for all general purpose computing purposes, especially with object oriented designs like this, and filtering an arbitrary collection of data like this would really not be an appropriate thing.
GPU computations are great for things where you are performing the same operation on a large dataset - which is why things like matrix operations and transforms can be very nice. There, the data copying can be outweighed by the incredibly fast computational capabilities on the GPU....
In this case, you'd have to copy all of the data into the GPU to make this work, and restructure it into some form the GPU will understand, which would likely be more expensive than just performing the filter in software in the first place.
Instead, I would recommend looking at using PLINQ for speeding up queries of this nature. Provided your filter is thread safe (which it'd have to be for any GPU related work...) this is likely a better option for general purpose query optimization, as it won't require the memory copying of your data. PLINQ would work by rewriting your query as:
var result = myList.AsParallel().Where(x => x.SomeProperty == SomeValue);
If the predicate is an expensive operation, or the collection is very large (and easily partitionable), this can make a significant improvement to the overall performance when compared to standard LINQ to Objects.

GpuLinq
GpuLinq's main mission is to democratize GPGPU programming through LINQ. The main idea is that we represent the query as an Expression tree and after various transformations-optimizations we compile it into fast OpenCL kernel code. In addition we provide a very easy to work API without the need of messing with the details of the OpenCL API.
https://github.com/nessos/GpuLinq

select *
from table1 -- contains 100k rows
left join table2 -- contains 1M rows
on table1.id1=table2.id2 -- this would run for ~100G times
-- unless they are cached on sql side
where table1.id between 1 and 100000 -- but this optimizes things (depends)
could be turned into
select id1 from table1 -- 400k bytes if id1 is 32 bit
-- no need to order
stored in memory
select id2 from table2 -- 4Mbytes if id2 is 32 bit
-- no need to order
stored in memory, both arrays sent to gpu using a kernel(cuda,opencl) like below
int i=get_global_id(0); // to select an id2, we need a thread id
int selectedID2=id2[i];
summary__=-1;
for(int j=0;j<id1Length;j++)
{
int selectedID1=id1[j];
summary__=(selectedID2==selectedID1?j:summary__); // no branching
}
summary[i]=j; // accumulates target indexings of
"on table1.id1=table2.id2" part.
On the host side, you can make
select * from table1 --- query3
and
select * from table2 --- query4
then use the id list from gpu to select the data
// x is table1 ' s data
myList.AsParallel().ForEach(x=>query3.leftjoindata=query4[summary[index]]);
The gpu code shouldn't be slower than 50ms for a gpu with constant memory, global broadcast ability and some thousands of cores.
If any trigonometric function is used for filtering, the performance would drop fast. Also when left joined tables row count makes it O(m*n) complexity so millions versus millions would be much slower. GPU memory bandwidth is important here.
Edit:
A single operation of gpu.findIdToJoin(table1,table2,"id1","id2") on my hd7870(1280 cores) and R7-240(320 cores) with "products table(64k rows)" and a "categories table(64k rows)" (left join filter) took 48 milliseconds with unoptimized kernel.
Ado.Net 's "nosql" style linq-join took more than 2000 ms with only 44k products and 4k categories table.
Edit-2:
left join with a string search condition gets 50 to 200 x faster on gpu when tables grow to 1000s of rows each having at least hundreds of characters.

The simple answer for your use case is no.
1) There's no solution for that kind of workload even in raw linq to object, much less in something that would replace your database.
2) Even if you were fine with loading the whole set of data at once (this takes time) it would still be much slower as GPU have high thoroughput but their access is high latency, so if you're looking at "very" fast solutions GPGPU is often not the answer as just preparing / sending the workload and getting back the results will be slow, and in your case probably need to be done in chunks too.

Why is Entity Framework taking 30 seconds to load records when the generated query only takes 1/2 of a second?

The executeTime below is 30 seconds the first time, and 25 seconds the next time I execute the same set of code. When watching in SQL Profiler, I immediately see a login, then it just sits there for about 30 seconds. Then as soon as the select statement is run, the app finishes the ToList command. When I run the generated query from Management Studio, the database query only takes about 400ms. It returns 14 rows and 350 columns. It looks like time it takes transforming the database results to the entities is so small it is not noticable.
So what is happening in the 30 seconds before the database call is made?
If entity framework is this slow, it is not possible for us to use it. Is there something I am doing wrong or something I can change to speed this up dramatically?
UPDATE:
All right, if I use a Compiled query, the first time it take 30 seconds, and the second time it takes 1/4th of a second. Is there anything I can do to speed up the first call?
using (EntitiesContext context = new EntitiesContext())
{
Stopwatch sw = new Stopwatch();
sw.Start();
var groupQuery = (from g in context.Groups.Include("DealContract")
.Include("DealContract.Contracts")
.Include("DealContract.Contracts.AdvertiserAccountType1")
.Include("DealContract.Contracts.ContractItemDetails")
.Include("DealContract.Contracts.Brands")
.Include("DealContract.Contracts.Agencies")
.Include("DealContract.Contracts.AdvertiserAccountType2")
.Include("DealContract.Contracts.ContractProductLinks.Products")
.Include("DealContract.Contracts.ContractPersonnelLinks")
.Include("DealContract.Contracts.ContractSpotOrderTypes")
.Include("DealContract.Contracts.Advertisers")
where g.GroupKey == 6
select g).OfType<Deal>();
sw.Stop();
var queryTime = sw.Elapsed;
sw.Reset();
sw.Start();
var groups = groupQuery.ToList();
sw.Stop();
var executeTime = sw.Elapsed;
}

I had this exact same problem, my query was taking 40 seconds.
I found the problem was with the .Include("table_name") functions. The more of these I had, the worse it was. Instead I changed my code to Lazy Load all the data I needed right after the query, this knocked the total time down to about 1.5 seconds from 40 seconds. As far as I know, this accomplishes the exact same thing.
So for your code it would be something like this:
var groupQuery = (from g in context.Groups
where g.GroupKey == 6
select g).OfType<Deal>();
var groups = groupQuery.ToList();
foreach (var g in groups)
{
// Assuming Dealcontract is an Object, not a Collection of Objects
g.DealContractReference.Load();
if (g.DealContract != null)
{
foreach (var d in g.DealContract)
{
// If the Reference is to a collection, you can just to a Straight ".Load"
// if it is an object, you call ".Load" on the refence instead like with "g.DealContractReference" above
d.Contracts.Load();
foreach (var c in d.Contracts)
{
c.AdvertiserAccountType1Reference.Load();
// etc....
}
}
}
}
Incidentally, if you were to add this line of code above the query in your current code, it would knock the time down to about 4-5 seconds (still too ling in my option) From what I understand, the MergeOption.NoTracking option disables a lot of the tracking overhead for updating and inserting stuff back into the database:
context.groups.MergeOption = MergeOption.NoTracking;

It is because of the Include. My guess is that you are eager loading a lot of objects into the memory. It takes much time to build the c# objects that corresponds to your db entities.
My recommendation for you is to try to lazy load only the data you need.

The only way to make the initial compilation of the query faster that I know of is to make the query less complex. The MSDN documentation on performance considerations for the Entity Framework and Compiled Queries don't indicate that there is any way to save a compiled query for use in a different application execution session.
I would add that we have found that having lots of Includes can make query execution slower than having fewer Includes and doing more Loads on related entities later. Some trial and error is required to find the right medium.
However, I have to ask if you really need every property of every entity you are including here. It seems to me that there is a large number of different entity types in this query, so materializing them could well be quite expensive. If you are just trying to get tabular results which you don't intend to update, projecting the (relatively) fewer number of fields that you actually need into a flat, anonymous type should be significantly faster for various reasons. Also, this frees you from having to worry about eager loading, calling Load/IsLoaded, etc.
You can certainly speed up the initial view generation by precompiling the entity views. There is documentation on MSDN for this. But since you pay that cost at the time the first query is executed, your test with a simple query shows that this is running in the neighborhood of 2 seconds for you. It's nice to say that 2 seconds, but it won't save anything else.

EF takes a while to start up. It needs build metadata from xml and probably generates objects used for mapping.
So it takes a few sec to start up, i don't think there is a way to get around that, except never restarting your program.

Querying Complex Data Structure in Memory

First time posting to a questions site, but I sort of have a complex problem i've been looking at for a few days.
Background
At work we're implementing a new billing system. However, we want to take the unprecedented move of actually auditing the new billing system against the old one which is significantly more robust on an ongoing basis. The reason is the new billing system is alot more flexible for our new rate plans, so marketing is really on us to get this new billing system in place.
We had our IT group develop a report for a ridiculous amount of money that runs at 8AM each morning for yesterday's data, compares records for getting byte count discrepancies, and generates the report. This isn't very useful for us since for one it runs the next day, and secondly if it shows bad results, we don't have any indication why we may have had a problem the day before.
So we want to build our own system, that hooks into any possible data source (at first only the new and old systems User Data Records (UDR)) and compares the results in near real-time.
Just some notes on the scale, each billing system produces roughly 6 million records / day at a total file size of about 1 gig.
My Proposed set-up
Essentially, buy some servers, we have budget for several 8 core / 32GB of RAM machines, so I'd like to do all the processing and storage in in-memory data structures. We can buy bigger server's if necessary, but after a couple days, I don't see any reason to keep the data in memory any longer (written out to persistent storage) and Aggregate statistics stored in a database.
Each record essentially contains a record-id from the platform, correlation-id, username, login-time, duration, bytes-in, bytes-out, and a few other fields.
I was thinking of using a fairly complex data structure for processing. Each record would be broken into a user object, and a record object belong to either platform a or platform b. At the top level, would be a binary search tree (self balancing) on the username. The next step would be sort of like a skip list based on date, so we would have next matched_record, next day, next hour, next month, next year, etc. Finally we would have our matched record object, essentially just a holder which references the udr_record object from system a, and the udr record object from system b.
I'd run a number of internal analytic as data is added to see if the new billing system has choked, started having large discrepancies compared to the old system, and send an alarm to our operations center to be investigated. I don't have any problem with this part myself.
Problem
The problem I have is aggregate statistics are great, but I want to see if I can come up with a sort of query language where the user can enter a query, for say the top contributors to this alarm, and see what records contributed to the discrepancy, and dig in and investigate. Originally, I wanted to use a syntax similar to a filter in wireshark, with some added in SQL.
Example:
udr.bytesin > 1000 && (udr.analysis.discrepancy > 100000 || udr.analysis.discrepency_percent > 100) && udr.started_date > '2008-11-10 22:00:44' order by udr.analysis.discrepancy DESC LIMIT 10
The other option would be to use DLINQ, but I've been out of the C# game for a year and a half now, so am not 100% up to speed on the .net 3.5 stuff. Also i'm not sure if it could handle the data structure I was planning on using. The real question, is can I get any feedback on how to approach the getting a query string from the user, parsing it, and applying it to the data structure (which has quite a few more attributes then outlined above), and getting the resulting list back. I can handle the rest on my own.
I am fully prepared to hard code much of the possible queries, and just have them more as reports that are run with some parameters, but if there is a nice clean way of doing this type of query syntax, I think it would be immensely cool feature to add.

Actually, for the above type of query, the dynamic LINQ stuff is quite a good fit. Otherwise you'll have to write pretty-much the same anyway - a parser, and a mechanism for mapping that to attributes. Unfortunately it isn't an exact hit, since you need to split things like OrderBy, and dates need to be parameterized - but here's a working example:
class Udr { // formatted for space
public int BytesIn { get; set; }
public UdrAnalysis Analysis { get; set; }
public DateTime StartedDate { get; set; }
}
class UdrAnalysis {
public int Discrepency { get; set; }
public int DiscrepencyPercent { get; set; }
}
static class Program {
static void Main() {
Udr[] data = new [] {
new Udr { BytesIn = 50000, StartedDate = DateTime.Today,
Analysis = new UdrAnalysis { Discrepency = 50000, DiscrepencyPercent = 130}},
new Udr { BytesIn = 500, StartedDate = DateTime.Today,
Analysis = new UdrAnalysis { Discrepency = 50000, DiscrepencyPercent = 130}}
};
DateTime when = DateTime.Parse("2008-11-10 22:00:44");
var query = data.AsQueryable().Where(
#"bytesin > 1000 && (analysis.discrepency > 100000
|| analysis.discrepencypercent > 100)
&& starteddate > #0",when)
.OrderBy("analysis.discrepency DESC")
.Take(10);
foreach(var item in query) {
Console.WriteLine(item.BytesIn);
}
}
}
Of course, you could take the dynamic LINQ sample and customize the parser to do more of what you need...

Whether you use DLINQ or not, I suspect that you'll want to use LINQ somewhere in the solution, because it provides so many bits of what you want.
How much protection do you need from your users, and how technical are they? If this is only for a few very technical internal staff (e.g. who are already developers) then you could just let them write a C# expression and then use CSharpCodeProvider to compile the code - then apply it on your data.
Obviously this requires your users to be able to write C# - or at least just enough of it for a query expression - and it requires that you trust them not to trash the server. (You can load the code into a separate AppDomain, give it low privileges and tear down the AppDomain after a timeout, but that sort of thing is complicated to achieve - and you don't really want huge amounts of data crossing an AppDomain boundary.)

On the subject of LINQ in general - again, a good fit due to your size issues:
Just some notes on the scale, each
billing system produces roughly 6
million records / day at a total file
size of about 1 gig.
LINQ can be used fully with streaming solutions. For example, your "source" could be a file reader. The Where would then iterate over the data checking individual rows without having to buffer the entire thing in memory:
static IEnumerable<Foo> ReadFoos(string path) {
return from line in ReadLines(path)
let parts = line.Split('|')
select new Foo { Name = parts[0],
Size = int.Parse(parts[1]) };
}
static IEnumerable<string> ReadLines(string path) {
using (var reader = File.OpenText(path)) {
string line;
while ((line = reader.ReadLine()) != null) {
yield return line;
}
}
}
This is now lazy loading... we only read one line at a time. You'll need to use AsQueryable() to use it with dynamic LINQ, but it stays lazy.
If you need to perform multiple aggregates over the same data, then Push LINQ is a good fit; this works particularly well if you need to group data, since it doesn't buffer everything.
Finally - if you want binary storage, serializers like protobuf-net can be used to create streaming solutions. At the moment it works best with the "push" approach of Push LINQ, but I expect I could invert it for regular IEnumerable<T> if needed.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.