How to split synchronization process in sync framework

How to split synchronization process in sync framework - c#

I'm using sync framework to synchronize sql server 2008 database with sqlCE on mobile device. Everything looks fine besides some problems. One of them is:
If i want to sync 1000 or more rows, i get OutOfMemory Exception on mobile device(tho the sync completes well, because after it i check the data of some rows and it looks synced). I thought that maybe too large xmls are rotating between mobile device and server(for 100 rows evrth works just fine)...Thats why i asked about how to split the sent data. But maybe im wrong. I didn't found any resources on this, so i dont exactly know WHAT can eat so much memory to add just 60Kb to the compact database.

You'll need to implement some sort of batching.
A quite naive version of it is shown on here:
http://msdn.microsoft.com/en-us/library/bb902828.aspx.
I've seen that you're intrested in some filtring. If this will filter out some, or rather alot, of rows I would recommend to write your own batch logic. The one we're currently using sets the #sync_new_received_anchor to the anchor of the #sync_batch_size:th row to be synced.
In a quite simplified way the logic looks like this:
SELECT #sync_new_received_anchor = MAX(ThisBatch.ChangeVersion)
FROM (SELECT TOP (#sync_batch_size) CT.SYS_CHANGE_VERSION AS ChangeVersion
FROM TabletoSync
INNER JOIN CHANGETABLE(CHANGES [TabletoSync],
#sync_last_received_anchor) AS CT
ON TabletoSync. TabletoSyncID = CT. TabletoSyncID
WHERE TabletoSync.FilterColumn = #ToClient
ORDER BY CT.SYS_CHANGE_VERSION ASC) AS ThisBatch

Related

Figuring out what process is running on my SQL that is being called by my c# code

I am working on a .NET nop commerce application where I have around 5 million+ results in the database and I need to query all of that data for extraction. But the data from SQL is never returned to my code while my GC keeps on growing (it goes beyond 1gb) but when I run the same stored procedure in SQL after providing the respective parameters, it takes less than 2 minutes. I need to somehow figure out why call from my code is taking so much time.
NopCommerce uses entity framework libraries to call the databases stored procedure but that is not async so I am just trying to call the stored procedure in an async way using this function:
await dbcontext.Database.SqlQuery<TEntity>(commandText, parameters).ToListAsync();
as of my research from another SO post ToListAsync(); turns this call into an async when so the task is sent back to the task library.
now I need to figure out 3 things that currently I'm unable to do:
1) I need to figure out if that thread is running in the background? I assume it is as GC keeps growing but I'm just not sure, below is a pic of how I tried that using Diagnostics tool in Visual Studio:
2) I need to make sure if SQL processes are giving enough time to the database calls from my code, I tried following queries but they don't show me any value for the process running for that particular data export initiated by my code
I tried this query:
select top 50
sum(qs.total_worker_time) as total_cpu_time,
sum(qs.execution_count) as total_execution_count,
count(*) as number_of_statements,
qs.plan_handle
from
sys.dm_exec_query_stats qs
group by qs.plan_handle
order by sum(qs.total_worker_time) desc
also tried this one:
SELECT
r.session_id
,st.TEXT AS batch_text
,SUBSTRING(st.TEXT, statement_start_offset / 2 + 1, (
(
CASE
WHEN r.statement_end_offset = - 1
THEN (LEN(CONVERT(NVARCHAR(max), st.TEXT)) * 2)
ELSE r.statement_end_offset
END
) - r.statement_start_offset
) / 2 + 1) AS statement_text
,qp.query_plan AS 'XML Plan'
,r.*
FROM sys.dm_exec_requests r
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) AS st
CROSS APPLY sys.dm_exec_query_plan(r.plan_handle) AS qp
ORDER BY cpu_time DESC
also when I use sp_who or sp_who2 the statuses for the processes for my database stay in the 'runnable' form like this, also those CPU and DISKIO:
3) I need to know, what if my DB call has been completed successfully but mapping them to the relevant list is taking a lot of time?
I would very much appreciate someone pointing me in the right direction, maybe help me with a query that can help me see the right results, or help me with viewing the running background threads and their status or maybe helping me with learning more about viewing the GC or threads and CPU utilization in a better way.
Any help will be highly appreciated. Thanks

A couple of diagnostic things to try:
Try adding a top 100 clause to the select statement, to see if there's a problem in the communication layer, or in the data mapper.
How much data is being returned by the stored procedure? If the procedure is returning more than a million rows, you may not be querying the data you mean.
Have you tried running it both synchronously and asynchronously?

Trying to sync data from third party api

This question has probably been asked correctly before, and I'll gladly accept an answer pointing me to the right spot. The problem is I don't know how to ask the question correctly to get anything returned in a search.
I'm trying to pull data from a 3rd party api (ADP) and store data in my database using asp.net core.
I am wanting to take the users returned from the API and store them in my database, where I have an ADP ancillary table seeded with the majority of the data from the api.
I would then like to update or add any missing or altered records in my database FROM the API.
I'm thinking that about using an ajax call to the api to retrieve the records, then either storing the data to another table and using sql to look for records that are changed between the two tables and making any necessary changes(this would be manually activated via a button), or some kind of scheduled background task to perform this through methods in my c# code instead of ajax.
The question I have is:
Is it a better fit to do this as a stored procedure in sql or rather have a method in my web app perform the data transformation.
I'm looking for any examples of iterating through the returned data and updating/creating records in my database.
I've only seen vague not quite what I'm looking for examples and nothing definitive on the best way to accomplish this. If I can find any reference material or examples, I'll gladly research but I don't even know where to start, or the correct terms to search for. I've looked into model binding, ajax calls, json serialization & deserialization. I'm probably overthinking this.
Any suggestions or tech I should look at would be appreciated. Thanks for you time in advance.
My app is written in asp.net core 2.2 using EF Core
* EDIT *
For anyone looking - https://learn.microsoft.com/en-us/dotnet/csharp/tutorials/console-webapiclient
This with John Wu's Answer helped me achieve what I was looking for.

If this were my project this is how I would break down the tasks, in this order.
First, start an empty console application.
Next, write a method that gets the list of users from the API. You didn't tell us anything at all about the API, so here is a dummy example that uses an HTTP client.
public async Task<List<User>> GetUsers()
{
var client = new HttpClient();
var response = await client.GetAsync("https://SomeApi.com/Users");
var users = await ParseResponse(response);
return users.ToList();
}
Test the above (e.g. write a little shoestring code to run it and dump the results, or something) to ensure that it works independently. You want to make sure it is solid before moving on.
Next, create a temporary table (or tables) that matches the schema of the data objects that are returned from the API. For now you will just want to store it exactly the way you retrieve it.
Next, write some code to insert records into the table(s). Again, test this independently, and review the data in the table to make sure it all worked correctly. It might look a little like this:
public async Task InsertUser(User user)
{
using (var conn = new SqlConnection(Configuration.ConnectionString))
{
var cmd = new SqlCommand();
//etc.
await cmd.ExecuteNonQueryAsync();
}
}
Once you know how to pull the data and store it, you can finish the code to extract the data from the API and insert it. It might look a little like this:
public async Task DoTheMigration()
{
var users = await GetUsers();
var tasks = users.Select
(
u => InsertUser(u)
);
await Task.WhenAll(tasks.ToArray());
}
As a final step, write a series of stored procedures or a DTS package to move the data from the temp tables to their final resting place. If you are using MS Access, you can write a series of queries and execute them in order with some VBA. At a high level it would:
Check for any records that exist in the temp table but not in the final table and insert them into the final table.
Check for any records that exist in the final table but not the temp table and remove them or mark them as deleted.
Check for any records in common that have different column values and update the final table.
Each of these development activities raises it own set of questions, of course, which you can post back to StackOverflow with details. As it is your question doesn't have enough specificity for a more in-depth answer.

Parsing and inserting bulk data. How to keep performance and do relations?

The data
I have a collection with around 300,000 vacations. Every vacation has several categories, countries, cities, activities and other subobjects. This data needs to be inserted into a MySQL / SQL Server database. I have the luxury of being able to truncate the entire database and start clean every time the parser program is run.
What I have tried
I have tried working with Entity Framework, this is also where my preference lies. To keep Entity Framework's performance up I have created a construction where 300 items are taken out of the vacations collection, parsed and inserted by Entity Framework and it's context disposed thereafter. The program finishes in a matter of minutes using this method. If I fill the context with all 300k vacations from the collection (and it's subobjects) it's a matter of hours.
int total = vacationsObjects.Count;
for (int i = 0; i < total; i += Math.Min(300, (total - i)))
{
var set = vacationsObjects.Skip(i).Take(300);
int enumerator = 0;
using (var database = InitializeContext())
{
foreach (VacationModel vacationData in set)
{
enumerator++;;
Vacations vacation = new Vacations
{
ProductId = vacationData.ExternalId,
Name = vacationData.Name,
Description = vacationData.Description,
Price = vacationData.Price,
Url = vacationData.Url,
};
foreach (string category in vacationData.Categories)
{
var existingCategory = database.Categories.Local.FirstOrDefault(c => c.CategoryName == categor);
if (existingCategory != null)
vacation.Categories.Add(existingCategory);
else
{
vacation.Categories.Add(new Category
{
CategoryName = category
});
}
}
database.Vacations.Add(vacation);
}
database.SaveChanges();
}
}
The downside (and possibly dealbreaker) with this method is figuring out the relationships. As you can see when adding a Category I check if it's already been created in the local context, and then use that. But what if it has been added in a previous set of 300? I don't want to query the database multiple times for every vacation to check whether an entity already resides within it.
Possible solution
I could keep a dictionary in memory containing the categories that have been added. I'd need to figure out how to attach these categories to the proper vacations (or vice-versa) and insert them, including their respective relations into the database.
Possible alternatives
Segregate the context and the transaction -
Purely theoretical, I do not know if I'm making any sense here. Maybe I could have EF's context keep track of all objects, and take manual control over the inserting part. I have messed around with this, trying to work with manual transaction scopes without avail.
Stored procedure -
I could write a stored procedure that handles and inserts my data. I'm not a big fan of this alternative, as I would like to keep the flexibility of switching between MySQL and SQL Server. Also, I would be in the dark as to where to begin.
Intermediary CSV file -
Instead of inserting parsed data directly into the RDMBS, I could export it into one or more CSV files and make use of importing tools such as MySQL's INFLINE.
Alternative database systems
Databases such as Azure Table Storage, MongoDB or RavenDB could be an option. However, I would prefer to stick to a traditional RDMBS due to compatibility with my skillset and tools.
I have been working on and researching this problem for a couple of weeks now. It seems like the best way of finding a solution that fits is by simply trying the different possibilities and observing the result. I was hoping that I could receive some pointers or tips from your personal experiences.

If you insert each record separately, the whole operation will take a lot of time. The bottleneck is SQL-queries between client and server. Each query takes time, so try to avoid using multiple of them. For huge amount of data it will be much better to process them locally. The best solution is to use special import tool. In MySQL you can use LOAD DATA, in MSSQL there is BULK INSERT. To import your data, you need a .css file.
To handle external keys correctly, you must populate tables manually before inserting. If destination tables are empty, you can simply create .css file with predefined primary and external keys. Otherwise you can import existing records from server, update them with your data, then export them back.

Time
Since you can afford to make only INSERTs, one suggestion is to try Entity Framework Bulk Insert extension. I have used it to save up to 200K records and it works fine. Just include in your project and write something like this:
context.BulkInsert(listOfEntities);
This should solve (or greatly improve the EF version) your problem's the time dimension
Data integrity
Keeping everything in one transaction does not sound reasonable (I expect that 300K parent records to generate at least 3M overall records), so I would try the following approach:
1) make your entities insertion using bulk insert.
2) call a stored procedure to check data integrity
If the insertion is quite long and the chance of failure is relatively big, you can load what is already loaded and have the process skip what is already loaded:
1) make smaller bulk inserts for a batch of vacation records and all its children records. Ensure that it runs in a transaction. One BULK INSERT is run atomically (no transaction needed), for several it seems tricky.
2) if the process fails, you have complete vacation data in your database (no partially imported vacation)
3) retake the process, but load existing vacation records (parents only). Using EF, a faster way is using AsNoTracking to spare the tracking overhead (which is great for large lists)
var existingVacations = context.Vacation.Select(v => v.VacationSourceIdentifier).AsNoTracking();

As suggested by Alexei, EntityFramework.BulkInsert is a very good solution if your model is supported by this library.
You can also use Entity Framework Extensions (PRO Version) which allow to use BulkSaveChanges and Bulk Operations (Insert, Update, Delete and Merge).
It's support your both provider: MySQL and SQL Server
// Upgrade SaveChanges performance with BulkSaveChanges
var context = new CustomerContext();
// ... context code ...
// Easy to use
context.BulkSaveChanges();
// Easy to customize
context.BulkSaveChanges(operation => operation.BatchSize = 1000);
// Use direct bulk operation
context.BulkInsert(customers);
Disclaimer: I'm the owner of the project Entity Framework Extensions

Speed up LINQ inserts

I have a CSV file and I have to insert it into a SQL Server database. Is there a way to speed up the LINQ inserts?
I've created a simple Repository method to save a record:
public void SaveOffer(Offer offer)
{
Offer dbOffer = this.db.Offers.SingleOrDefault (
o => o.offer_id == offer.offer_id);
// add new offer
if (dbOffer == null)
{
this.db.Offers.InsertOnSubmit(offer);
}
//update existing offer
else
{
dbOffer = offer;
}
this.db.SubmitChanges();
}
But using this method, the program is way much slower then inserting the data using ADO.net SQL inserts (new SqlConnection, new SqlCommand for select if exists, new SqlCommand for update/insert).
On 100k csv rows it takes about an hour vs 1 minute or so for the ADO.net way. For 2M csv rows it took ADO.net about 20 minutes. LINQ added about 30k of those 2M rows in 25 minutes. My database has 3 tables, linked in the dbml, but the other two tables are empty. The tests were made with all the tables empty.
P.S. I've tried to use SqlBulkCopy, but I need to do some transformations on Offer before inserting it into the db, and I think that defeats the purpose of SqlBulkCopy.
Updates/Edits:
After 18hours, the LINQ version added just ~200K rows.
I've tested the import just with LINQ inserts too, and also is really slow compared with ADO.net. I haven't seen a big difference between just inserts/submitchanges and selects/updates/inserts/submitchanges.
I still have to try batch commit, manually connecting to the db and compiled queries.

SubmitChanges does not batch changes, it does a single insert statement per object. If you want to do fast inserts, I think you need to stop using LINQ.
While SubmitChanges is executing, fire up SQL Profiler and watch the SQL being executed.
See question "Can LINQ to SQL perform batch updates and deletes? Or does it always do one row update at a time?" here: http://www.hookedonlinq.com/LINQToSQLFAQ.ashx
It links to this article: http://www.aneyfamily.com/terryandann/post/2008/04/Batch-Updates-and-Deletes-with-LINQ-to-SQL.aspx that uses extension methods to fix linq's inability to batch inserts and updates etc.

Have you tried wrapping the inserts within a transaction and/or delaying db.SubmitChanges so that you can batch several inserts?
Transactions help throughput by reducing the needs for fsync()'s, and delaying db.SubmitChanges will reduce the number of .NET<->db roundtrips.
Edit: see http://www.sidarok.com/web/blog/content/2008/05/02/10-tips-to-improve-your-linq-to-sql-application-performance.html for some more optimization principles.

Have a look at the following page for a simple walk-through of how to change your code to use a Bulk Insert instead of using LINQ's InsertOnSubmit() function.
You just need to add the (provided) BulkInsert class to your code, make a few subtle changes to your code, and you'll see a huge improvement in performance.
Mikes Knowledge Base - BulkInserts with LINQ
Good luck !

I wonder if you're suffering from an overly large set of data accumulating in the data-context, making it slow to resolve rows against the internal identity cache (which is checked once during the SingleOrDefault, and for "misses" I would expect to see a second hit when the entity is materialized).
I can't recall 100% whether the short-circuit works for SingleOrDefault (although it will in .NET 4.0).
I would try ditching the data-context (submit-changes and replace with an empty one) every n operations for some n - maybe 250 or something.
Given that you're calling SubmitChanges per isntance at the moment, you may also be wasting a lot of time checking the delta - pointless if you've only changed one row. Only call SubmitChanges in batches; not per record.

Alex gave the best answer, but I think a few things are being over looked.
One of the major bottlenecks you have here is calling SubmitChanges for each item individually. A problem I don't think most people know about is that if you haven't manually opened your DataContext's connection yourself, then the DataContext will repeatedly open and close it itself. However, if you open it yourself, and then close it yourself when you're absolutely finished, things will run a lot faster since it won't have to reconnect to the database every time. I found this out when trying to find out why DataContext.ExecuteCommand() was so unbelievably slow when executing multiple commands at once.
A few other areas where you could speed things up:
While Linq To SQL doesn't support your straight up batch processing, you should wait to call SubmitChanges() until you've analyzed everything first. You don't need to call SubmitChanges() after each InsertOnSubmit call.
If live data integrity isn't super crucial, you could retrieve a list of offer_id back from the server before you start checking to see if an offer already exists. This could significantly reduce the amount of times you're calling the server to get an existing item when it's not even there.

Why not pass an offer[] into that method, and doing all the changes in cache before submitting them to the database. Or you could use groups for submission, so you don't run out of cache. The main thing would be how long till you send over the data, the biggest time wasting is in the closing and opening of the connection.

Converting this to a compiled query is the easiest way I can think of to boost your performance here:
Change the following:
Offer dbOffer = this.db.Offers.SingleOrDefault (
o => o.offer_id == offer.offer_id);
to:
Offer dbOffer = RetrieveOffer(offer.offer_id);
private static readonly Func<DataContext, int> RetrieveOffer
{
CompiledQuery.Compile((DataContext context, int offerId) => context.Offers.SingleOrDefault(o => o.offer_id == offerid))
}
This change alone will not make it as fast as your ado.net version, but it will be a significant improvement because without the compiled query you are dynamically building the expression tree every time you run this method.
As one poster already mentioned, you must refactor your code so that submit changes is called only once if you want optimal performance.

Do you really need to check if the record exist before inserting it into the DB. I thought it looked strange as the data comes from a csv file.
P.S. I've tried to use SqlBulkCopy,
but I need to do some transformations
on Offer before inserting it into the
db, and I think that defeats the
purpose of SqlBulkCopy.
I don't think it defeat the purpose at all, why would it? Just fill a simple dataset with all the data from the csv and do a SqlBulkCopy. I did a similar thing with a collection of 30000+ rows and the import time went from minutes to seconds

I suspect it isn't the inserting or updating operations that are taking a long time, rather the code that determines if your offer already exists:
Offer dbOffer = this.db.Offers.SingleOrDefault (
o => o.offer_id == offer.offer_id);
If you look to optimise this, I think you'll be on the right track. Perhaps use the Stopwatch class to do some timing that will help to prove me right or wrong.
Usually, when not using Linq-to-Sql, you would have an insert/update procedure or sql script that would determine whether the record you pass already exists. You're doing this expensive operation in Linq, which certainly will never hope to match the speed of native sql (which is what's happening when you use a SqlCommand and select if the record exists) looking-up on a primary key.

Well you must understand linq creates code dynamically for all ADO operations that you do instead handwritten, so it will always take up more time then your manual code. Its simply an easy way to write code but if you want to talk about performance, ADO.NET code will always be faster depending upon how you write it.
I dont know if linq will try to reuse its last statement or not, if it does then seperating insert batch with update batch may improve performance little bit.

This code runs ok, and prevents large amounts of data:
if (repository2.GeoItems.GetChangeSet().Inserts.Count > 1000)
{
repository2.GeoItems.SubmitChanges();
}
Then, at the end of the bulk insertion, use this:
repository2.GeoItems.SubmitChanges();

How can I get a percentage of LINQ to SQL submitchanges?

I wonder if anyone else has asked a similar question.
Basically, I have a huge tree I'm building up in RAM using LINQ objects, and then I dump it all in one go using DataContext.SubmitChanges().
It works, but I can't find how to give the user a sort of visual indication of how far has the query progressed so far. If I could ultimately implement a sort of progress bar, that would be great, even if there is a minimal loss in performance.
Note that I have quite a large amount of rows to put into the DB, over 750,000 rows.
I haven't timed it exactly, but it does take a long while to put them in.
Edit: I thought I'd better give some indication of what I'm doing.
Basically, I'm building a suffix tree from the Lord of the Rings. Thus, there are a lot of Nodes, and certain Nodes have positions associated to them (Nodes that happen to be at the end of a suffix). I am building the Linq objects along these lines.
suffixTreeDB.NodeObjs.InsertOnSubmit(new NodeObj()
{
NodeID = 0,
ParentID = 0,
Path = "$"
});
After the suffix tree has been fully generated in RAM (which only takes a few seconds), I then call suffixTreeDB.submitChanges();
What I'm wondering is if there is any faster way of doing this. Thanks!
Edit 2: I've did a stopwatch, and apparently it takes precisely 6 minutes for the DB to be written.

I suggest you divide the calls you are doing, as they are sent in separate calls to the db anyway. This will also reduce the size of the transaction (which linq does when calling submitchanges).
If you divide them in 10 blocks of 75.000, you can provide a rough estimate on a 1/10 scale.
Update 1: After re-reading your post and your new comments, I think you should take a look at SqlBulkCopy instead. If you need to improve the time of the operation, that's the way to go. Check this related question/answer: What's the fastest way to bulk insert a lot of data in SQL Server (C# client)

I was able to get percentage progress for ctx.submitchanges() by using ctx.Log and ActionTextWriter
ctx.Log = new ActionTextWriter(s => {
if (s.StartsWith("INSERT INTO"))
insertsCount++;
ReportProgress(insertsCount);
});
more details are available at my blog post
http://epandzo.wordpress.com/2011/01/02/linq-to-sql-ctx-submitchanges-progress/
and stackoverflow question
LINQ to SQL SubmitChangess() progress

This isn't ideal, but you could create another thread that periodically queries the table you're populating to count the number of records that have been inserted. I'm not sure how/if this will work if you are running in a transaction though, since there could be locking/etc.

What I really think I need is a form of Bulk-Insert, however it appears that Linq doesn't support it.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.