Can you get DataReader-like streaming using Linq-to-SQL? - c#

I've been using Linq-to-SQL for quite awhile and it works great. However, lately I've been experimenting with using it to pull really large amounts of data and am running across some issues. (Of course, I understand that L2S may not be the right tool for this particular kind of processing, but that's why I'm experimenting - to find its limits.)
Here's a code sample:
var buf = new StringBuilder();
var dc = new DataContext(AppSettings.ConnectionString);
var records = from a in dc.GetTable<MyReallyBigTable>() where a.State == "OH" select a;
var i = 0;
foreach (var record in records) {
buf.AppendLine(record.ID.ToString());
i += 1;
if (i > 3) {
break; // Takes forever...
}
}
Once I start iterating over the data, the query executes as expected. When stepping through the code, I enter the loop right away which is exactly what I hoped for - that means that L2S appears to be using a DataReader behind the scenes instead of pulling all the data first. However, once I get to the break, the query continues to run and pull all the rest of the records. Here are my questions for the SO community:
1.) Is there a way to stop Linq-to-SQL from finishing execution of a really big query in the middle the way you can with a DataReader?
2.) If you execute a large Linq-to-SQL query, is there a way to prevent the DataContext from filling up with change tracking information for every object returned. Basically, instead of filling up memory, can I do a large query with short object lifecycles the way you can with DataReader techniques?
I'm okay if this isn't functionality built-in to the DataContext itself and requires extending the functionality with some customization. I'm just looking to leverage the simplicity and power of Linq for large queries for nightly processing tasks instead of relying on T-SQL for everything.

1.) Is there a way to stop Linq-to-SQL from finishing execution of a really
big query in the middle the way you
can with a DataReader?
Not quite. Once the query is finally executed the underlying SQL statement is returning a result set of matching records. The query is deferred up till that point, but not during traversal.
For your example you could simply use records.Take(3) but I understand your actual logic to halt the process might be external to SQL or not easily translatable.
You could use a combination approach by building a strongly typed LINQ query then executing it with old fashioned ADO.NET. The downside is you lose the mapping to the class and have to manually deal with the SqlDataReader results. An example of this is shown below:
var query = from c in Customers
where c.ID < 15
select c;
using (var command = dc.GetCommand(query))
{
command.Connection.Open();
using (var reader = command.ExecuteReader())
{
int i = 0;
while (reader.Read())
{
Customer c = new Customer();
c.ID = reader.GetInt32(reader.GetOrdinal("ID"));
c.Name = reader.GetString(reader.GetOrdinal("Name"));
Console.WriteLine("{0}: {1}", c.ID, c.Name);
i++;
if (i > 3)
break;
}
}
}
2.) If you execute a large Linq-to-SQL query, is there a way to prevent the
DataContext from filling up with
change tracking information for every
object returned.
If your intention for a particular query is to use it for read-only purposes then you could disable object tracking to increase performance by setting the DataContext.ObjectTrackingEnabled property to false:
using (var dc = new MyDataContext())
{
dc.ObjectTrackingEnabled = false;
// do stuff
}
You can also read this MSDN topic: How to: Retrieve Information As Read-Only (LINQ to SQL).

Related

EF (C#) with tunneling to MySQL - optimising database calls

So I have program to deal with some sort of work in MySQL database. I'm connecting through SSH tunneling with putty (yes, I know launching programs on server itself would be much better but I don't have a choice here).
I have some problems with programs speed. I solved one by adding ".Include(table_name)" but I can't think about a way to do it here.
So purpose of this method is to clean database of unwanted, broken records. Simplified code looks like this:
using (var dbContext = new MyDatabase_dataEntities())
{
List<achievements> achiList = new List<achievements>();
var achievementsQuery = from data in dbContext.achievements
orderby data.playerID
select data;
achiList = achievementsQuery.Skip(counter * 5000).Take(5000).ToList();
foreach (achievements record in achiList)
{
var playerExists = from data in dbContext.players_data
where data.playerID == record.playerID
select data;
if(!playerExists.Any())
{
dbContext.achievements.Remove(record);
}
}
dbContext.SaveChanges();
counter++;
}
So this is built this way because I want to load achievements table then check if achievements have their player in player_data. If such player doesn't exist, remove his achievements.
It is all in do while, so I don't overload my memory by loading all data at once.
I know the problem is with checking in database in foreach steps but I can't figure out how to do it without it. Other things I tried generated errors either because EF couldn't translate it into SQL or exceptions were thrown when trying to access non-existing entity. Doing it in foreach bottlenecks whole program probably because of ping to the server.
I will need similar thing more often so I would be really gratefull if anyone could help me with making it so there is no need to call to database in "foreach". I know I could try to load whole players_data table and then check for Any(), but some tables I need it on are too big for that.
Oh, and turning off tracking changes doesn't help at this point because it is not what slows the program.
I would be gratefull for any help, thanks in advance!
EDIT: Mmmm, is there a way to get achievements which doesn't have player_data corresponding to them through one query using associations? Like adding to achievements query something like:
where !data.player_data.Exists()
Intellisense shows me that there is nothing like Exists or Any to use at this point. Is there any trick similar to this? It would definitely deal with the problem I have with speed there since database call in foreach wouldn't be needed.
If you want to delete achievements that don't have corresponding user records, then you can user a SQL query below:
DELETE a
FROM `achievements` a
LEFT JOIN `user` AS u
ON u.`playerID` = a.`playerID`
WHERE u.`playerID` IS NULL;
SQL query will be an order of magnitude faster than Entity Framework.
If you want to do that in the application, you can use the following code that uses LINQ to Entities and LINQ extensions methods. I assume you have a foreign key for player_data in achievements table so Entity Framework generates player_data lazy property for your achievements entity:
using (var dbContext = new MyDatabase_dataEntities())
{
var proceed = true;
while (proceed)
{
// Get net 1000 entities to delete
var entitiesToDelete = dbContext.achievements
.Where(x => x.players_data == null)
.Take(1000)
.ToList();
dbContext.achievements.RemoveRange(entitiesToDelete);
dbContext.SaveChanges();
// Proceed if deleted entities during this iteration
proceed = entitiesToDelete.Count() > 0;
}
}
If you prefer to use LINQ query syntax instead of extension methods, then your code will look like:
using (var dbContext = new MyDatabase_dataEntities())
{
var proceed = true;
while (proceed)
{
// Get net 1000 entities to delete
var query = from achievement in dbContext.achievements
where achievement.players_data == null
select achievement;
var entitiesToDelete = query.ToList();
dbContext.achievements.RemoveRange(entitiesToDelete);
dbContext.SaveChanges();
// Proceed if deleted entities during this iteration
proceed = entitiesToDelete.Count() > 0;
}
}

Slow LINQ Performance on DataTable Where Clause?

I'm dumping a table out of MySQL into a DataTable object using MySqlDataAdapter. Database input and output is doing fine, but my application code seems to have a performance issue I was able to track down to a specific LINQ statement.
The goal is simple, search the contents of the DataTable for a column value matching a specific string, just like a traditional WHERE column = 'text' SQL clause.
Simplified code:
foreach (String someValue in someList) {
String searchCode = OutOfScopeFunction(someValue);
var results = emoteTable.AsEnumerable()
.Where(myRow => myRow.Field<String>("code") == searchCode)
.Take(1);
if (results.Any()) {
results.First()["columnname"] = 10;
}
}
This simplified code is executed thousands of times, once for each entry in someList. When I run Visual Studio Performance Profiler I see that the "results.Any()" line is highlighted as consuming 93.5% of the execution time.
I've tried several different methods for optimizing this code, but none have improved performance while keeping the emoteTable DataTable as the primary source of the data. I can convert emoteTable to Dictionary<String, DataRow> outside of the foreach, but then I have to keep the DataTable and the Dictionary in sync, which while still a performance improvement, feels wrong.
Three questions:
Is this the proper way to search for a value in a DataTable (equivalent of a traditional SQL WHERE clause)? If not, how SHOULD it be done?
Addendum to 1, regardless of the proper way, what is the fastest (execution time)?
Why does the results.Any() line consume 90%+ resources? In this situation it makes more sense that the var results line should consume the resources, after all, it's the line doing the actual search, right?
Thank you for your time. If I find an answer I shall post it here as well.
Any() is taking 90% of the time because the result is only executed when you call Any(). Before you call Any(), the query is not actually made.
It would seem the problem is that you first fetch entire table into the memory and then search. You should instruct your database to search.
Moreover, when you call results.First(), the whole results query is executed again.
With deferred execution in mind, you should write something like
var result = emoteTable.AsEnumerable()
.Where(myRow => myRow.Field<String>("code") == searchCode)
.FirstOrDefault();
if (result != null) {
result["columnname"] = 10;
}
What you have implemented is pretty much join :
var searchCodes = someList.Select(OutOfScopeFunction);
var emotes = emoteTable.AsEnumerable();
var results = Enumerable.Join(emotes, searchCodes, e=>e, sc=>sc.Field<String>("code"), (e, sc)=>sc);
foreach(var result in results)
{
result["columnname"] = 10;
}
Join will probably optimize the access to both lists using some kind of lookup.
But first thing I would do is to completely abandon idea of combining DataTable and LINQ. They are two different technologies and trying to assert what they might do inside when combined is hard.
Did you try doing raw UPDATE calls? How many items are you expecting to update?

LINQ commands took quite a long time

I try some basic select statements to retrieve data from my database. Then I want to display those data on datagridview.
Problem I encountered is that the execution of the LINQ commands took quite a long time(5-10 sec, result has 1000 rows). I tried to look for an answer on this site but the questions where about much more difficult queries than my.
My code is following:
using (var db = new Model1())
{
var query = from a in db.Animals
select a;
dgvAnimals.DataSource = query.ToList();
}
Anyone can tell me why it take so long?
Try to use linq take statement to reduce size of the set
using (var db = new Model1())
{
var query = db.Animals.Take(20);
}
If this does not improve performance, try mocking some data or pass empty set to UI and see what happens then. If taking database out of equation does not improve performance - it must be some other component that is lagging.

Read huge table with LINQ to SQL: Running out of memory vs slow paging

I have a huge table which I need to read through on a certain order and compute some aggregate statistics. The table already has a clustered index for the correct order so getting the records themselves is pretty fast. I'm trying to use LINQ to SQL to simplify the code that I need to write. The problem is that I don't want to load all the objects into memory, since the DataContext seems to keep them around -- yet trying to page them results in horrible performance problems.
Here's the breakdown. Original attempt was this:
var logs =
(from record in dataContext.someTable
where [index is appropriate]
select record);
foreach( linqEntity l in logs )
{
// Do stuff with data from l
}
This is pretty fast, and streams at a good rate, but the problem is that the memory use of the application keeps going up never stops. My guess is that the LINQ to SQL entities are being kept around in memory and not being disposed properly. So after reading Out of memory when creating a lot of objects C# , I tried the following approach. This seems to be the common Skip/Take paradigm that many people use, with the added feature of saving memory.
Note that _conn is created beforehand, and a temporary data context is created for each query, resulting in the associated entities being garbage collected.
int skipAmount = 0;
bool finished = false;
while (!finished)
{
// Trick to allow for automatic garbage collection while iterating through the DB
using (var tempDataContext = new MyDataContext(_conn) {CommandTimeout = 600})
{
var query =
(from record in tempDataContext.someTable
where [index is appropriate]
select record);
List<workerLog> logs = query.Skip(skipAmount).Take(BatchSize).ToList();
if (logs.Count == 0)
{
finished = true;
continue;
}
foreach( linqEntity l in logs )
{
// Do stuff with data from l
}
skipAmount += logs.Count;
}
}
Now I have the desired behavior that memory usage doesn't increase at all as I am streaming through the data. Yet, I have a far worse problem: each Skip is causing the data to load more and more slowly as the underlying query seems to actually cause the server to go through all the data for all previous pages. While running the query each page takes longer and longer to load, and I can tell that this is turning into a quadratic operation. This problem has appeared in the following posts:
LINQ Skip() Problem
LINQ2SQL select orders and skip/take
I can't seem to find a way to do this with LINQ that allows me to have limited memory use by paging data, and yet still have each page load in constant time. Is there a way to do this properly? My hunch is that there might be some way to tell the DataContext to explicitly forget about the object in the first approach above, but I can't find out how to do that.
After madly grasping at some straws, I found that the DataContext's ObjectTrackingEnabled = false could be just what the doctor ordered. It is, not surprisingly, specifically designed for a read-only case like this.
using (var readOnlyDataContext =
new MyDataContext(_conn) {CommandTimeout = really_long, ObjectTrackingEnabled = false})
{
var logs =
(from record in readOnlyDataContext.someTable
where [index is appropriate]
select record);
foreach( linqEntity l in logs )
{
// Do stuff with data from l
}
}
The above approach does not use any memory when streaming through objects. When writing data, I can use a different DataContext that has object tracking enabled, and that seems to work okay. However, this approach does have the problem of a SQL query that can take an hour or more to stream and complete, so if there's a way to do the paging as above without the performance hit, I'm open to other alternatives.
A warning about turning object tracking off: I found out that when you try to do multiple concurrent reads with the same DataContext, you don't get the error There is already an open DataReader associated with this Command which must be closed first. The application just goes into an infinite loop with 100% CPU usage. I'm not sure if this is a C# bug or a feature.

Fastest way to copy data from one table to another using linq

I have a table on another SQL server which I need to copy from overnight. The structure of the destination table is very similar so I was just going to use something like the code below.
Source - http://forums.asp.net/t/1322979.aspx/1
I have not tried this yet, but is there a better/quicker way to do this in linq?
//there exist two table list and listSecond
DataClassesDataContext dataClass = new DataClassesDataContext(); //create the instance of the DataContext
var str = from a in dataClass.lists select a;
foreach (var val in str) // iterator the data from the list and insert them into the listSecond
{
listSecond ls = new listSecond();
ls.ID = val.ID;
ls.pid = val.pid;
ls.url = val.url;
dataClass.listSeconds.InsertOnSubmit(ls);
}
dataClass.SubmitChanges();
Response.Write("success");
Using LINQ to insert large amounts of data is not a good idea, except maybe with complicated schemas that need much transformations before being copied. It will create a separate query for each row inserted, in addition to logging them all in the transaction log.
A much faster solution can be found here - it's using SqlBulkCopy, which is a method for inserting large amounts of data in a single query, withouth transaction logging to slow it down. It will be an order of magnitude faster, and I'm telling you this from personal experience with both methods.

Categories

Resources