Is it somehow possible to chain together several LINQ queries on the same IEnumerable ?
Some background,
I've some files, 20-50Gb in size, they will not fit in memory. Some code parses messages from such a file, and basically does :
public IEnumerable<Record> ReadRecordsFromStream(Stream inStream) {
Record msg;
while ((msg = ReadRecord(inStream)) != null) {
yield return msg;
}
}
This allow me to perform interesting queries on the records.
e.g. find the average duration of a Record
var records = ReadRecordsFromStream(stream);
var avg = records.Average(x => x.Duration);
Or perhaps the number of records per hour/minute
var x = from t in records
group t by t.Time.Hour + ":" + t.Time.Minute into g
select new { Period = g.Key, Frequency = g.Count() };
And there's a a dozen or so more queries I'd like to run to pull relevant info out of these records. Some of the simple queries can certainly be combined in a single query, but this seem to get unmanegable quite fast.
Now, each time I run these queries, I have to read the file from the beginning again, all records reparsed - parsing a 20Gb file 20 times takes time, and is a waste.
What can I do to be able to do just one pass over the file, but run several linq queries against it ?
You might want to consider using Reactive Extensions for this. It's been a while since I've used it, but you'd probably create a Subject<Record>, attach all your queries to it (as appropriate IObservable<T> variables) and then hook up the data source. That will push all the data through the various aggregations for you, only reading from disk once.
While the exact details elude me without downloading the latest build myself, I blogged on this a couple of times: part 1; part 2. (Various features that I complained about being missing in part 1 were added :)
I have done this before for logs with 3-10MB/file. Haven't reached that file size but I tried to execute this in a 1GB+ total log files without consuming that much of RAM. You may try what I did.
There's a technology that allows you to do this kind of thing. It's called a database :)
Related
Using an ADO.NET entity data model I've constructed two queries below against a table containing 1800 records that has just over 30 fields that yield staggering results.
// Executes slowly, over 6000 ms
int count = context.viewCustomers.AsNoTracking()
.Where(c => c.Cust_ID == _custID).Count();
// Executes instantly, under 20 ms
int count = context.viewCustomers.AsNoTracking()
.Where(c => c.Cust_ID == 625).Count();
I see from the database log that Entity Framework provides that the queries are almost identical except that the filter portion uses a parameter. Copying this query into SSMS and declaring & setting this parameter there results in a near instant query so it doesn't appear to be on the database end of things.
Has anyone encountered this that can explain what's happening? I'm at the mercy of a third party control that adds this command to the query in an attempt to limit the number of rows returned, getting the count is a must. This is used for several queries so a generic solution is needed. It is unfortunate it doesn't work as advertised, it seems to only make the query take 5-10 times as long as it would if I just loaded the entire view into memory. When no filter is used however, it works like a dream.
Use of these components includes the source code so I can change this behavior but need to consider which approaches can be used to provide a reusable solution.
You did not mention about design details of your model but if you only want to have count of records based on condition, then this can be optimized by only counting the result set based on one column. For example,
int count = context.viewCustomers.AsNoTracking().Where(c => c.Cust_ID == _custID).Count();
If you design have 10 columns, and based on above statement let say 100 records have been returned, then against every record result set contains 10 columns' data which is of not use.
You can optimize this by only counting result set based on single column.
int count = context.viewCustomers.AsNoTracking().Where(c => c.Cust_ID == _custID).Select(x=>new {x.column}).Count();
Other optimization methods, like using async variants of count CountAsync can be used.
Currently I have 7,000 video entries and I have a hard time optimizing it to search for Tags and Actress.
This is my code I am trying to modify, I tried using HashSet. It is my first time using it but I don't think I am doing it right.
Dictionary dictTag = JsonPairtoDictionary(tagsId,tagsName);
Dictionary dictActresss = JsonPairtoDictionary(actressId, actressName);
var listVid = new List<VideoItem>(db.VideoItems.ToList());
HashSet<VideoItem> lll = new HashSet<VideoItem>(listVid);
foreach (var tags in dictTag)
{
lll = new HashSet<VideoItem>(lll.Where(q => q.Tags.Exists(p => p.Id == tags.Key)));
}
foreach (var actress in dictActresss)
{
listVid = listVid.Where(q => q.Actress.Exists(p => p.Id == actress.Key)).ToList();
}
First part I get all the Videos in Db by using db.VideoItems.ToList()
Then it will go through a loop to check if a Tag exist
For each VideoItem it has a List<Tags> and I use 'exist' to check if a tag is match.
Then same thing with Actress.
I am not sure if its because I am in Debug mode and ApplicationInsight is active but it is slow. And I will get like 10-15 events per second with baseType:RemoteDependencyData which I am not sure if it means it still connected to database (should not be since I only should only be messing with the a new list of all videos) or what.
After 7 mins it is still processing and that's the longest time I have waited.
I am afraid to put this on my live site since this will eat up my resource like candy
Instead of optimizing the linq you should optimize your database query.
Databases are great at optimized searches and creating subsets and will most likely be faster than anything you write. If you have need to create a subset based on more than on database parameter I would recommend looking into creating some indexes and using those.
Edit:
Example of db query that would eliminate first for loop (which is actually multiple nested loops and where the time delay comes from):
select * from videos where tag in [list of tags]
Edit2
To make sure this is most efficient, require the database to index on the TAGS column. To create the index:
CREATE INDEX video_tags_idx ON videos (tag)
Use 'explains' to see if the index is being used automatically (it should be)
explain select * from videos where tag in [list of tags]
If it doesn't show your index as being used you can look up the syntax to force the use of it.
The problem was not optimization but it was utilization of the Microsoft SQL or my ApplicationDbContext.
I found this when I realize that http://www.albahari.com/nutshell/predicatebuilder.aspx
Because the problem with Keyword search, there can be multiple keywords, and the code I made above doesn't utilize the SQL which made the long execution time.
Using the predicate builder, it will be possible to create dynamic conditions in LINQ
Let say I have a query with a very large resultset (+100.000 rows) and I need to loop through the and perform an update:
var ds = context.Where(/* query */).Select(e => new { /* fields */ } );
foreach(var d in ds)
{
//perform update
}
I'm fine with this process taking long time to execute but I have limited amount of memory on my server.
What happens in the foreach? Is the entire result fetched at once from the database?
Would it be better to use Skip and Take to do the update in portions?
Best way is to use Skip and Take yes and make sure that after each update, you dispose the DataContext (by using "using")
You could check out my question, has a similiar problem with a nice solution: Out of memory when creating a lot of objects C#
YOu basically abuse LINQ2SQL - not made for that.
ALl results are laoded into memory.
YOur changes are written out once, after you are done.
This will be slow, and it will be - hm - using TONS of memory. Given limited amounts of memory - not possible.
Do NOT load all data in at once. Try to run multiple queries with partial result sets (1000-2500 items each).
ORM's are not made for mass manipulation.
Could you not use a stored procedure to update everything in one go?
The executeTime below is 30 seconds the first time, and 25 seconds the next time I execute the same set of code. When watching in SQL Profiler, I immediately see a login, then it just sits there for about 30 seconds. Then as soon as the select statement is run, the app finishes the ToList command. When I run the generated query from Management Studio, the database query only takes about 400ms. It returns 14 rows and 350 columns. It looks like time it takes transforming the database results to the entities is so small it is not noticable.
So what is happening in the 30 seconds before the database call is made?
If entity framework is this slow, it is not possible for us to use it. Is there something I am doing wrong or something I can change to speed this up dramatically?
UPDATE:
All right, if I use a Compiled query, the first time it take 30 seconds, and the second time it takes 1/4th of a second. Is there anything I can do to speed up the first call?
using (EntitiesContext context = new EntitiesContext())
{
Stopwatch sw = new Stopwatch();
sw.Start();
var groupQuery = (from g in context.Groups.Include("DealContract")
.Include("DealContract.Contracts")
.Include("DealContract.Contracts.AdvertiserAccountType1")
.Include("DealContract.Contracts.ContractItemDetails")
.Include("DealContract.Contracts.Brands")
.Include("DealContract.Contracts.Agencies")
.Include("DealContract.Contracts.AdvertiserAccountType2")
.Include("DealContract.Contracts.ContractProductLinks.Products")
.Include("DealContract.Contracts.ContractPersonnelLinks")
.Include("DealContract.Contracts.ContractSpotOrderTypes")
.Include("DealContract.Contracts.Advertisers")
where g.GroupKey == 6
select g).OfType<Deal>();
sw.Stop();
var queryTime = sw.Elapsed;
sw.Reset();
sw.Start();
var groups = groupQuery.ToList();
sw.Stop();
var executeTime = sw.Elapsed;
}
I had this exact same problem, my query was taking 40 seconds.
I found the problem was with the .Include("table_name") functions. The more of these I had, the worse it was. Instead I changed my code to Lazy Load all the data I needed right after the query, this knocked the total time down to about 1.5 seconds from 40 seconds. As far as I know, this accomplishes the exact same thing.
So for your code it would be something like this:
var groupQuery = (from g in context.Groups
where g.GroupKey == 6
select g).OfType<Deal>();
var groups = groupQuery.ToList();
foreach (var g in groups)
{
// Assuming Dealcontract is an Object, not a Collection of Objects
g.DealContractReference.Load();
if (g.DealContract != null)
{
foreach (var d in g.DealContract)
{
// If the Reference is to a collection, you can just to a Straight ".Load"
// if it is an object, you call ".Load" on the refence instead like with "g.DealContractReference" above
d.Contracts.Load();
foreach (var c in d.Contracts)
{
c.AdvertiserAccountType1Reference.Load();
// etc....
}
}
}
}
Incidentally, if you were to add this line of code above the query in your current code, it would knock the time down to about 4-5 seconds (still too ling in my option) From what I understand, the MergeOption.NoTracking option disables a lot of the tracking overhead for updating and inserting stuff back into the database:
context.groups.MergeOption = MergeOption.NoTracking;
It is because of the Include. My guess is that you are eager loading a lot of objects into the memory. It takes much time to build the c# objects that corresponds to your db entities.
My recommendation for you is to try to lazy load only the data you need.
The only way to make the initial compilation of the query faster that I know of is to make the query less complex. The MSDN documentation on performance considerations for the Entity Framework and Compiled Queries don't indicate that there is any way to save a compiled query for use in a different application execution session.
I would add that we have found that having lots of Includes can make query execution slower than having fewer Includes and doing more Loads on related entities later. Some trial and error is required to find the right medium.
However, I have to ask if you really need every property of every entity you are including here. It seems to me that there is a large number of different entity types in this query, so materializing them could well be quite expensive. If you are just trying to get tabular results which you don't intend to update, projecting the (relatively) fewer number of fields that you actually need into a flat, anonymous type should be significantly faster for various reasons. Also, this frees you from having to worry about eager loading, calling Load/IsLoaded, etc.
You can certainly speed up the initial view generation by precompiling the entity views. There is documentation on MSDN for this. But since you pay that cost at the time the first query is executed, your test with a simple query shows that this is running in the neighborhood of 2 seconds for you. It's nice to say that 2 seconds, but it won't save anything else.
EF takes a while to start up. It needs build metadata from xml and probably generates objects used for mapping.
So it takes a few sec to start up, i don't think there is a way to get around that, except never restarting your program.
First time posting to a questions site, but I sort of have a complex problem i've been looking at for a few days.
Background
At work we're implementing a new billing system. However, we want to take the unprecedented move of actually auditing the new billing system against the old one which is significantly more robust on an ongoing basis. The reason is the new billing system is alot more flexible for our new rate plans, so marketing is really on us to get this new billing system in place.
We had our IT group develop a report for a ridiculous amount of money that runs at 8AM each morning for yesterday's data, compares records for getting byte count discrepancies, and generates the report. This isn't very useful for us since for one it runs the next day, and secondly if it shows bad results, we don't have any indication why we may have had a problem the day before.
So we want to build our own system, that hooks into any possible data source (at first only the new and old systems User Data Records (UDR)) and compares the results in near real-time.
Just some notes on the scale, each billing system produces roughly 6 million records / day at a total file size of about 1 gig.
My Proposed set-up
Essentially, buy some servers, we have budget for several 8 core / 32GB of RAM machines, so I'd like to do all the processing and storage in in-memory data structures. We can buy bigger server's if necessary, but after a couple days, I don't see any reason to keep the data in memory any longer (written out to persistent storage) and Aggregate statistics stored in a database.
Each record essentially contains a record-id from the platform, correlation-id, username, login-time, duration, bytes-in, bytes-out, and a few other fields.
I was thinking of using a fairly complex data structure for processing. Each record would be broken into a user object, and a record object belong to either platform a or platform b. At the top level, would be a binary search tree (self balancing) on the username. The next step would be sort of like a skip list based on date, so we would have next matched_record, next day, next hour, next month, next year, etc. Finally we would have our matched record object, essentially just a holder which references the udr_record object from system a, and the udr record object from system b.
I'd run a number of internal analytic as data is added to see if the new billing system has choked, started having large discrepancies compared to the old system, and send an alarm to our operations center to be investigated. I don't have any problem with this part myself.
Problem
The problem I have is aggregate statistics are great, but I want to see if I can come up with a sort of query language where the user can enter a query, for say the top contributors to this alarm, and see what records contributed to the discrepancy, and dig in and investigate. Originally, I wanted to use a syntax similar to a filter in wireshark, with some added in SQL.
Example:
udr.bytesin > 1000 && (udr.analysis.discrepancy > 100000 || udr.analysis.discrepency_percent > 100) && udr.started_date > '2008-11-10 22:00:44' order by udr.analysis.discrepancy DESC LIMIT 10
The other option would be to use DLINQ, but I've been out of the C# game for a year and a half now, so am not 100% up to speed on the .net 3.5 stuff. Also i'm not sure if it could handle the data structure I was planning on using. The real question, is can I get any feedback on how to approach the getting a query string from the user, parsing it, and applying it to the data structure (which has quite a few more attributes then outlined above), and getting the resulting list back. I can handle the rest on my own.
I am fully prepared to hard code much of the possible queries, and just have them more as reports that are run with some parameters, but if there is a nice clean way of doing this type of query syntax, I think it would be immensely cool feature to add.
Actually, for the above type of query, the dynamic LINQ stuff is quite a good fit. Otherwise you'll have to write pretty-much the same anyway - a parser, and a mechanism for mapping that to attributes. Unfortunately it isn't an exact hit, since you need to split things like OrderBy, and dates need to be parameterized - but here's a working example:
class Udr { // formatted for space
public int BytesIn { get; set; }
public UdrAnalysis Analysis { get; set; }
public DateTime StartedDate { get; set; }
}
class UdrAnalysis {
public int Discrepency { get; set; }
public int DiscrepencyPercent { get; set; }
}
static class Program {
static void Main() {
Udr[] data = new [] {
new Udr { BytesIn = 50000, StartedDate = DateTime.Today,
Analysis = new UdrAnalysis { Discrepency = 50000, DiscrepencyPercent = 130}},
new Udr { BytesIn = 500, StartedDate = DateTime.Today,
Analysis = new UdrAnalysis { Discrepency = 50000, DiscrepencyPercent = 130}}
};
DateTime when = DateTime.Parse("2008-11-10 22:00:44");
var query = data.AsQueryable().Where(
#"bytesin > 1000 && (analysis.discrepency > 100000
|| analysis.discrepencypercent > 100)
&& starteddate > #0",when)
.OrderBy("analysis.discrepency DESC")
.Take(10);
foreach(var item in query) {
Console.WriteLine(item.BytesIn);
}
}
}
Of course, you could take the dynamic LINQ sample and customize the parser to do more of what you need...
Whether you use DLINQ or not, I suspect that you'll want to use LINQ somewhere in the solution, because it provides so many bits of what you want.
How much protection do you need from your users, and how technical are they? If this is only for a few very technical internal staff (e.g. who are already developers) then you could just let them write a C# expression and then use CSharpCodeProvider to compile the code - then apply it on your data.
Obviously this requires your users to be able to write C# - or at least just enough of it for a query expression - and it requires that you trust them not to trash the server. (You can load the code into a separate AppDomain, give it low privileges and tear down the AppDomain after a timeout, but that sort of thing is complicated to achieve - and you don't really want huge amounts of data crossing an AppDomain boundary.)
On the subject of LINQ in general - again, a good fit due to your size issues:
Just some notes on the scale, each
billing system produces roughly 6
million records / day at a total file
size of about 1 gig.
LINQ can be used fully with streaming solutions. For example, your "source" could be a file reader. The Where would then iterate over the data checking individual rows without having to buffer the entire thing in memory:
static IEnumerable<Foo> ReadFoos(string path) {
return from line in ReadLines(path)
let parts = line.Split('|')
select new Foo { Name = parts[0],
Size = int.Parse(parts[1]) };
}
static IEnumerable<string> ReadLines(string path) {
using (var reader = File.OpenText(path)) {
string line;
while ((line = reader.ReadLine()) != null) {
yield return line;
}
}
}
This is now lazy loading... we only read one line at a time. You'll need to use AsQueryable() to use it with dynamic LINQ, but it stays lazy.
If you need to perform multiple aggregates over the same data, then Push LINQ is a good fit; this works particularly well if you need to group data, since it doesn't buffer everything.
Finally - if you want binary storage, serializers like protobuf-net can be used to create streaming solutions. At the moment it works best with the "push" approach of Push LINQ, but I expect I could invert it for regular IEnumerable<T> if needed.