db InsertOnSubmit - how to clear collection - c#

I'm doing a massive import, and only doing a .SubmitChanges() only 1,000 records.
Example:
var targetRecord = new Data.User() { FirstName = sourceRecord.FirstName };
db.Users.InsertOnSubmit(record);
The above is in a loop, for each record from the source database. Then, later...
if (i % 1000 == 0) { db.SubmitChanges(); }
The problem is, the collection of items to be inserted keeps getting bigger and bigger, when I want to clear them out after each SubmitChanges();
What I'm looking for:
if (i % 1000 == 0) { db.SubmitChanges(); db.Dispose_InsertOnSubmit_Records(); }
Something like that. I could alternatively have a list of data records stored in a local variable that I continually re-instantiate after submitting changes, but, that's more code.
Hopefully this makes sense. Thanks!

You can intialize a new DataContext after each SubmitChanges. I'm not sure of the performance implications, but I've done something similar in the past without any problems.
The only other solution I've seen is iterating through your changes and reverting them. It seems like the former would be a much more efficient method.

Well, massive and linq-to-sql do not go together, I'm afraid. It is just not made for batch processing.
If what you are doing is just a straight import (and your example is indicating that) you are much better of with using SqlBulkCopy. That is magnitudes faster. Also more code but if you are looking for speed there is no better solution.

Related

Improving Linq query

I have the following query:
if (idUO > 0)
{
query = query.Where(b => b.Product.Center.UO.Id == idUO);
}
else if (dependencyId > 0)
{
query = query.Where(b => b.DependencyId == dependencyId );
}
else
{
var dependencyIds = dependencies.Select(d => d.Id).ToList();
query = query.Where(b => dependencyIds.Contains(b.DependencyId.Value));
}
[...] <- Other filters...
if (specialDateId != 0)
{
query = query.Where(b => b.SpecialDateId == specialDateId);
}
So, I have other filters in this query, but at the end, I process the query in the database with:
return query.OrderBy(b => b.Date).Skip(20 * page).Take(20).ToList(); // the returned object is a Ticket object, that has 23 properties, 5 of them are relationships (FKs) and i fill 3 of these relationships with lazy loading
When I access the first page, its OK, the query takes less than one 1 second, but when I try to access the page 30000, the query takes more than 20 seconds. There is a way in the linq query, that I can improve the performance of the query? Or only in the database level? And in the database level, for this kind of query, which is the best way to improve the performance?
There is no much space here, imo, to make things better (at least looking on the code provided).
When you're trying to achieve a good performance on such numbers, I would recommend do not use LINQ at all, or at list use it on the stuff with smaler data access.
What you can do here, is introduce paging of that data on DataBase level, with some stored procedure, and invoke it from your C# code.
1- Create a view in DB which orders items by date including all related relationships, like Products etc.
2- Create a stored procedure querying this view with related parameters.
I would recommend that you pull up SQL Server Profiler, and run a profile on the server while you run the queries (both the fast and the slow).
Once you've done this, you can pull it into the Database Engine Tuning Advisor to get some tips about Indexes that you should add.. This has had great effect for me in the past. Of course, if you know what indexes you need, you can just add them without running the Advisor :)
I think you'll find that the bottleneck is occurring at the database. Here's why;
query.
You have your query, and the criteria. It goes to the database with a pretty ugly, but not too terrible select statement.
.OrderBy(b => b.Date)
Now you're ordering this giant recordset by date, which probably isn't a terrible hit because it's (hopefully) indexed on that field, but that does mean the entire set is going to be brought into memory and sorted before any skipping or taking occurs.
.Skip(20 * page).Take(20)
Ok, here's where it gets rough for the poor database. Entity is pretty awful at this sort of thing for large recordsets. I dare you to open sql profiler and view the random mess of sql it's sending over.
When you start skipping and taking, Entity usually sends queries that coerce the database into scanning the entire giant recordset until it finds what you are looking for. If that's the first ordered records in the recordset, say page 1, it might not take terribly long. By the time you're picking out page 30,000 it could be scanning a lot of data due to the way Entity has prepared your statement.
I highly recommend you take a look at the following link. I know it says 2005, but it's applicable to 2008 as well.
http://www.codeguru.com/csharp/.net/net_data/article.php/c19611/Paging-in-SQL-Server-2005.htm
Once you've read that link, you might want to consider how you can create a stored procedure to accomplish what you're going for. It will be more lightweight, have cached execution plans, and is pretty well guaranteed to return the data much faster for you.
Barring that, if you want to stick with LINQ, read up on Compiled Queries and make sure you're setting MergeOption.NoTracking for read-only operations. You should also try returning an Object Query with explicit Joins instead of an IQueryable with deferred loading, especially if you're iterating through the results and joining to other tables. Deferred Loading can be a real performance killer.

How to use MongoDB as unique/enumeration store

This seems to be like a common use case... but somehow I cannot get it working.
I'm attempting to use MongoDB as an enumeration store with unique items. I've created a collection with a byte[] Id (the unique ID) and a timestamp (a long, used for enumeration). The store is quite big (terabytes) and distributed among different servers. I am able to re-build the store from scratch currently, since I'm still in the testing phase.
What I want to do is two things:
Create a unique id for each item that I insert. This basically means that if I insert the same ID twice, MongoDB will detect this and give an error. This approach seems to work fine.
Continuously enumerate the store for new items by other processes. The approach I took was to add a second index to InsertID and used a high precision timestamp on this along with the server id and a counter (just to make it unique and ascending).
In the best scenario this would mean that the enumerator would keep track of an index cursor for every server. From what I've learned from mongodb query processing I expected this behavior. However, when I try to execute the code (below) it seems to take forever to get anything.
long lastid = 0;
while (true)
{
DateTime first = DateTime.UtcNow;
foreach (var item in collection.FindAllAs<ContentItem>().OrderBy((a)=>(a.InsertId)).Take(100))
{
lastid = item.InsertId;
}
Console.WriteLine("Took {0:0.00} for 100", (DateTime.UtcNow - first).TotalSeconds);
}
I've read about cursors, but am unsure if they fulfill the requirements when new items are inserted into the store.
As I said, I'm not bound to any table structure or something like that... the only things that are important is that I can get new items over time and without getting duplicate items.
-Stefan.
Somehow I figured it out... more or less...
I created the query manually and ended up with something like this:
db.documents.find({ "InsertId" : { "$gt" : NumberLong("2020374866209304106") } }).limit(10).sort({ "InsertId" : 1 });
The LINQ query I put in the question doesn't generate this query. After some digging in the code I found that it should be this LINQ query:
foreach (var item in collection.AsQueryable().Where((a)=>(a.InsertId > lastid)).OrderBy((a) => (a.InsertId)).Take(100))
The AsQueryable() seems to be the key to execute the rewriting of LINQ to MongoDB queries.
This gives results, but still they appeared to be slow (4 secs for 10 results, 30 for 100). However, when I added 'explain()' I noticed '0 millis' in the query execution.
I stopped the process doing bulk inserts and tada, it works, and fast. In other words: the issues I was having were due to the locking behavior of MongoDB, and due to the way I interpreted the linq implementation. Since the former is the result of initial bulk-filling the data store, this means that the problem is solved.
On the 'negative' part of the solution: I would have preferred a solution that involved serializable cursors or something like that... this 'take' solution has to iterate the b-tree over and over again. If someone has an answer for this, please let me know.
-Stefan.

Log changes made to LINQ to SQL generated objects

I was wonder what would be the best way to log changes made to objects created by linq.
I have searched around and this is what i came up with:
using (testDBDataContext db = new testDBDataContext())
{
Sometable table = db.Sometables.Single(x => x.id == 1);
table.Something = txtTextboxToChangeValue.Text;
Sometable tableBeforeChanges = db.Sometables.GetOriginalEntityState(table);
foreach (System.Data.Linq.ModifiedMemberInfo item in db.Sometables.GetModifiedMembers(table))
{
// Obviously writing to debug is not what i would like to do
System.Diagnostics.Debug.WriteLine("Old value: " + item.OriginalValue.ToString());
System.Diagnostics.Debug.WriteLine("New value: " + item.CurrentValue.ToString());
}
}
Is this really the way to go to log changes?
Change Tracking or Change Data Capture are the way to go. LINQ has nothing to do with it. As a general rule, client side cannot properly track changes that occur in the server because the changes may occur in alternate access that is not occurring through the client. As a cautionary note, setting up a complete data audit for all changes is seldom successful as the performance tax penalty is usually too high.
Ive searched around and i think im gonna use DoodleAudit (doddleaudit.codeplex.com) it seems to give me what i wanted, tnx for helping anyways!

Linq2Sql: How do manage large resultsets?

Let say I have a query with a very large resultset (+100.000 rows) and I need to loop through the and perform an update:
var ds = context.Where(/* query */).Select(e => new { /* fields */ } );
foreach(var d in ds)
{
//perform update
}
I'm fine with this process taking long time to execute but I have limited amount of memory on my server.
What happens in the foreach? Is the entire result fetched at once from the database?
Would it be better to use Skip and Take to do the update in portions?
Best way is to use Skip and Take yes and make sure that after each update, you dispose the DataContext (by using "using")
You could check out my question, has a similiar problem with a nice solution: Out of memory when creating a lot of objects C#
YOu basically abuse LINQ2SQL - not made for that.
ALl results are laoded into memory.
YOur changes are written out once, after you are done.
This will be slow, and it will be - hm - using TONS of memory. Given limited amounts of memory - not possible.
Do NOT load all data in at once. Try to run multiple queries with partial result sets (1000-2500 items each).
ORM's are not made for mass manipulation.
Could you not use a stored procedure to update everything in one go?

How to optimize this Linq query

I am trying to optimize this query. It is not slow for now but I am expecting a spike to hit this query in the near future. Is there anything else i can do to make this query faster?
var posts = from p in context.post
where p.post_isdeleted == false && p.post_parentid == null
select new
{
p.post_date,
p.post_id,
p.post_titleslug,
p.post_title,
p.post_descriptionrender,
p.userinfo.user_username,
p.userinfo.user_userid,
p.userinfo.user_GravatarHash,
p.userinfo.user_points,
p.category.catid,
p.category.name,
p.post_answercount,
p.post_hasbestanswer,
p.post_hits,
p.post_isanonymous,
p.post_votecount,
FavoriteCount = context.favorites.Where(x => x.post.post_id == p.post_id).Count(),
tags = from tg in context.posttag
where tg.posttag_postid == p.post_id
select new
{
tg.tag.tag_id,
tg.tag.tag_title
}
};
In a general sense you could look into caching that information, but nothing is intrinsically "slow" about the query. This will really depend on how the query is being used (how often, what data is being hit, etc). There are a lot of possible optimization solutions for a given problem, and though you may find improvements based on intuition you'll have a much easier time doing so if you have profiling tools in place to nail down problem areas. Plus, you'll have the satisfaction of proving that the areas you improve are worth the time investment.
A possible optimization for this query whould be, only to load the Ids of other related objects (UserInfo, Category, Tag), and initialize this objects only on demand using a lazy loading strategy. Or executing an other query to resolve these objects.
But it depends on, how you use the result of the query. Maybe you need all the information of the related objects, or you need only some information of these objects or the Id of the objects is enough, cause you need the Ids for other queries.
I could make the LINQ cleaner (by using associations instead of psuedo-joins), but it wouldn't make it any faster. To make it faster you probably need to look at DB indexing.

Categories

Resources