I must sync 2 tables of two databases one of which is a MSSQL and the other one MySQL. I have to do this through a Windows Service. I currently have created a Windows Service, and what I currently have added to the service is : getting the data from the MySQL database and inserting it into a Data Adapter from which I plan to move the data to the MSSQL database using an insertion through transaction.Can you tell me what is the best approach to this problem and if what I'm doing right now is on the right track, first time I'm doing such a thing.
Well, probably there is a couple ways to solve this. You didn't mention about count and size of your tables, desired synchronization frequency so my answer might be not 100% relevant. Anyway my proposal is:
Each table should have additional column - SyncDate (TIMESTAMP) which by default is null.
Sync logic which is implemented in Windows Service periodically checks each table if there is some data to sync by executing query like this (use pure IDataReader for performance reasons):
SELECT * from TEST_TABLE where SyncDate is null
If above statement returns data then:
collect package of rows to insert (for example 500 rows)
begin transaction on target database and execute insert statement for collected package (set value for column SyncDate to DateTime.Now)
when insert is complete execute statement like this in source db:
UPDATE TEST_TABLE SET SyncDate = #DateTime.Now where ID_PRIMARY_KEY IN (IDENTIFIRES FROM PACKAGE)
commit transaction
collect another packages and repeat algorithm until DataReader provides data
Above algorithm should be applied on both databases
Probably your database has foreign keys so you have to remember that dependant tables should be synchronized in valid order
Wherever it possible you can benefit from multi threading
For this kind of job, SQL Server Integration Services is the way to go.
SSIS is made to Extract, transform, and load data. (ETL process)
You design a worklow with each transformation step of your data, from MySQL to MSSQL.
You then create a job and configure its schedule, and it will be executed by SQL Server.
No need to develop something from scratch and hard to maintain.
If you are already at a point where you are ready to move the MySQL data to SQL. I would suggest you put them on a separate table and issue a Merge command to merge the data. Make sure that you flag back the data that has been updated so you can now get them back and push to MySQL:
MERGE TableFromMySQL AS TARGET USING TableFromSQL AS SOURCE
ON (TARGET.PrimaryKey = SOURCE.PrimaryKey)
WHEN Matched AND (TARGET.Field1 <> SOURCE.Field1
OR TARGET.Field2 <> SOURCE.Field2
OR .. put more field comparison here that you want to sync
THEN
UPDATE
SET TARGET.Field1 = SOURCE.Field1,
TARGET.Field2 = SOURCE.Field2,
//flag the target field as updated
TARGET.IsUpdated = 1,
TARGET.LastUpdatedOn = GETDATE()
WHEN Not Matched
THEN
INSERT(PrimaryKey, Field1, Field2, IsUpdated, LastUpdatedOn)
VALUES(SOURCE.PrimaryKey, SOURCE.Field1, SOURCE.Field2, 1, GETDATE());
At this point, you can now query the MySQL table with IsUpdated = 1 and push all values to MYSQL database. I hope this helps.
If you add the mySql database as a linked server (sp_addlinkedserver) then I think you could do the following in a stored proc on your MS SQL database:
insert table1(col1)
select col1
from openquery('MySqlLinkedDb','select col1 from mySqlDb.dbo.table2')
If you have to do this through a Windows Service, I am curious how you are receiving the data from MySQL? What is initiating the call to the service and what form is the data passed in? Are you getting changes or are you getting a couple refresh of the data?
If you receive a DataTable or some other kind of IEnumerable and its a complete refresh, for example, you could use SqlBulkCopy.WriteToServer(). That would be the fastest and most efficient.
Tom
Related
I want to perform bulk insert from CSV to MySQL database using C#, I'm using MySql.Data.MySqlClient for connection. CSV columns are refereed into multiple tables and they are dependent on primary key value, for example,
CSV(column & value): -
emp_name, address,country
-------------------------------
jhon,new york,usa
amanda,san diago,usa
Brad,london,uk
DB Schema(CountryTbl) & value
country_Id,Country_Name
1,usa
2,UK
3,Germany
DB Schema(EmployeeTbl)
Emp_Id(AutoIncrement),Emp_Name
DB Schema(AddressTbl)
Address_Id(AutoIncrement), Emp_Id,Address,countryid
Problem statement:
1> Read data from CSV to get the CountryId from "CountryTbl" for respective employee.
2> Insert data into EmployeeTbl and AddressTbl with CountryId
Approach 1
Go as per above problem statement steps, but that will be a performance hit (Row-by-Row read and insert)
Approach 2
Use "Bulk Insert" option "MySqlBulkLoader", but that needs csv files to read, and looks that this option is not going to work for me.
Approach 3
Use stored proc and use the procedure for upload. But I don't want to use stored proc.
Please suggest if there is any other option by which I can do bulk upload or suggest any other approach.
Unless you have hundreds of thousands of rows to upload, bulk loading (your approach 2) probably is not worth the extra programming and debugging time it will cost. That's my opinion, for what it's worth (2x what you paid for it :)
Approaches 1 and 3 are more or less the same. The difference lies in whether you issue the queries from c# or from your sp. You still have to work out the queries. So let's deal with 1.
The solutions to these sorts of problems depend on make and model of RDBMS. If you decide you want to migrate to SQL Server, you'll have to change this stuff.
Here's what you do. For each row of your employee csv ...
... Put a row into the employee tbl
INSERT INTO EmployeeTbl (Emp_Name) VALUES (#emp_name);
Notice this query uses the INSERT ... VALUES form of the insert query. When this query (or any insert query) runs, it drops the autoincremented Emp_Id value where a subsequent invocation of LAST_INSERT_ID() can get it.
... Put a row into the address table
INSERT INTO AddressTbl (Emp_Id,Address,countryid)
SELECT LAST_INSERT_ID() AS Emp_Id,
#address AS Address,
country_id AS countryid
FROM CountryTbl
WHERE Country_Name = #country;
Notice this second INSERT uses the INSERT ... SELECT form of the insert query. The SELECT part of all this generates one row of data with the column values to insert.
It uses LAST_INSERT_ID() to get Emp_Id,
it uses a constant provided by your C# program for the #address, and
it looks up the countryid value from your pre-existing CountryTbl.
Notice, of course, that you must use the C# Parameters.AddWithValue() method to set the values of the # parameters in these queries. Those values come from your CSV file.
Finally, wrap each thousand rows or so of your csv in a transaction, by preceding their INSERT statements with a START TRANSACTION; statement and ending them with a COMMIT; statement. That will get you a performance improvement, and if something goes wrong the entire transaction will get rolled back so you can start over.
My query is fairly complex, but I have simplified it to figure out this problem and now it is a simple JOIN that I'm running on a SQL Server 2014 database. The query is:
SELECT * FROM SportsCars as sc INNER JOIN Cars AS c ON c.CarID = sc.CarID WHERE c.Type = 1
When I run this query from SMSS and watch it in SQL Profiler, it takes around 350ms to execute. When I run the same query inside my application using Entity Framework or ADO.NET (I've tried both). It takes 4500ms to execute.
ADO.NET Code:
using (var connection = new SqlConnection(connectionString))
{
connection.Open();
var cmdA = new SqlCommand("SET ARITHABORT ON", connection);
cmdA.ExecuteNonQuery();
var query = "SELECT * FROM SportsCars as sc INNER JOIN Cars AS c ON c.CarID = sc.CarID WHERE c.Type = 1";
var cmd = new SqlCommand(query, connection);
cmd.ExecuteNonQuery()
}
I've done an extensive Google search and found this awesome article and several StackOverflow questions (here and here). In order to make the session parameters identical for both queries I call SET ARITHABORT ON in ADO.NET and it makes no difference. This is a straight SQL query, so there is not a parameter sniffing problem. I've simplified the query and the indexes down to their most basic form for this test. There is nothing else running on the server and there is nothing else accessing the database during the test. There are no computed columns in the Cars or SportsCars table, just INTs and VARCHARs.
The SportsCars table has about 170k records and 4 columns, and the Cars table has about 1.2M records and 7 columns. The resulting data set (SportsCars of Type=1) has about 2600 records and 11 columns. I have a single non-clustered index on the Cars table, on the [Type] column that includes all the columns of the cars table. And both tables have a clustered index on the CarID column. No other indexes exist on either table. I'm running as the same database user in both cases.
When I view the data in SQL Profiler, I see that both queries are using the exact same, very simple query plan. In SQL Profiler, I'm using the Performance Event Class and the ShowPlan XML Statistics Profile, which I believe to be the proper event to monitor and capture the actual execution plan. The # of reads is the same for both queries (2596).
How can two exact same queries with the exact same query plan take 10x longer in ADO.NET vs. SMSS?
Figured it out:
Because I'm using Entity Framework, the connection string in my application has MultipleActiveResultSets=True. When I remove this from the connection string, the queries have the same performance in ADO.NET and SSMS.
Apparently there is an issue with this setting causing queries to respond slowly when connected to SQL Server via WAN. I found this link and this comment:
MARS uses "firehose mode" to retrieve data. Firehose mode means that
the server will produce data as fast as possible. This also means that
your client application must receive inbound data at the same speed as
it comes in. If it doesn't the data storage buffers on the server will
fill up and the processing will stop until those buffers empty.
So what? You may ask... But as long as the processing is stopped the
resources on the SQL server are in use and are tied up. This includes
the worker thread, schema and data locks, memory, etc. So it is
crucial that your client application consumes the inbound results as
quickly as they arrive.
I have to use this setting with Entity Framework otherwise lazy loading will generate exceptions. So I'm going to have to figure out some other workaround. But at least I understand the issue now.
How can two exact same queries with the exact same query plan take 10x longer in ADO.NET vs. SMSS?
First we need to be clear about what is considered "same" with regards to queries and query plans. Assuming that the query at the very top of the question is a copy-and-paste, then it is not the same query as the one being submitted via ADO.NET. For two queries to be the same, they need to be byte-by-byte the same, which includes all white-space, capitalization, punctuation, comments, etc.
The two queries shown are definitely very similar. And they might even share the same execution plan. But how was "same"ness determined for those? Was the XML the same in both cases? Or just what was shown graphically in SSMS when viewing the plans? If they were determined to be the same based on their graphical representation then that is sometimes misleading. The XML itself needs to be checked. Even if two query plans have the same query hash, there are still (sometimes) parts of a query plan that are variable and changes do not change the plan hash. One example is the evaluation of expressions. Sometimes they are calculated and their result is embedded into the plan as a constant. Sometimes they are calculated at the start of each execution and stored and reused within that particular execution, but not for any subsequent executions.
One difference between SSMS and ADO.NET is the default session properties for each. I thought I had seen a chart years ago showing the defaults for ADO / OLEDB / SQLNCLI but can't find it out. Either way, it doesn't need to be guess work as it can be discovered using the SESSIONPROPERTY function. Just run this query in the C# code instead of your current SELECT, and inspect the results in debug or print them out or whatever. Either way, run something like this:
SELECT SESSIONPROPERTY('ANSI_NULLS') AS [AnsiNulls],
SESSIONPROPERTY('ANSI_PADDING') AS [AnsiPadding],
SESSIONPROPERTY('CONCAT_NULL_YIELDS_NULL') AS [ConcatNullYieldsNull],
...;
Make sure to get all of the setting noted in the linked MSDN page. Now, in SSMS, go to the "Query" menu, select "Query Options...", and go to "Execution" | "ANSI". The settings coming back from the C# code need to match the ones showing in SSMS. Anything set different requires adding something like this to the beginning of your ADO.NET query string:
SET ANSI_NULLS ON;
{rest of query}
Now, if you want to eliminate the DataTable loading from being a possible suspect, just replace that line, just replace:
var cars = new DataTable();
cars.Load(reader);
with:
while(reader.Read());
And lastly, why not just put the query into a Stored Procedure? The session settings (i.e. ANSI_NULLS, etc) that typically matter the most are stored with the proc definition so they should work the same whether you EXEC from SSMS or from ADO.NET (again, we aren't dealing with any parameters here).
This question already has answers here:
Insert 2 million rows into SQL Server quickly
(8 answers)
Closed 8 years ago.
I am writing a stored procedure to insert rows into a table. The problem is that in some operation we might want to insert more than 1 million rows and we want to make it fast. Another thing is that in one of the column, it is Nvarchar(MAX). We might want to put avg 1000 characters in this column.
Firstly, I wrote a prc to insert row by row. Then I generate some random data for insert with the NVARCHAR(MAX) column to be a string of 1000 characters. Then use a loop to call the prc to insert the rows. The perf is very bad which takes 48 mins if I use SQL server to log on the database server to insert. If I use C# to connect to the server in my desktop (that is what we usually want to do ), it takes about more than 90mins.
Then, I changed the prc to take a table type parameter as the input. I prepared the rows somehow and put them in the table type parameter and do the insert by the following command:
INSERT INTO tableA SELECT * from #tableTypeParameterB
I tried batch size as 1000 rows and 3000 rows (Put 1000-3000 rows in the #tableTypeParameterB to be inserted for one time). The performance is still bad. It takes about 3 mins to insert 1 million rows if I run it in the SQL server and take about 10 mins if I use C# program to connect from my desktop.
The tableA has a clustered index with 2 columns.
My target is to make the insert as fast as possible (My idea target is within 1 min). Is there any way to optimize it?
Just an update:
I tried the Bulk Copy Insert which was suggested by some people below. I tried use the SQLBULKCOPY to insert 1000 row and 10000 row at a time. The performance is still 10 mins to insert 1 million row (Every row has a column with 1000 characters). There is no performance improve. Is there any other suggestions?
An update based on the comments require.
The data is actually coming from UI. The user will change use UI to bulk select, we say, one million rows and change one column from the old value to new value. This operation will be done in a separate procedure.But here what we need to do is that make the mid-tier service to get the old value and new value from the UI and insert them in the table. The old value and new value may be up to 4000 characters and the average is 1000 characters. I think the long string old/new value slow down the speed because when I change the test data old value/new value to 20-50 characters and insert is very fast no matter use SQLBulkCopy or table type variable
I think what you are looking for is Bulk Insert if you prefer using SQL.
Or there is also the ADO.NET for Batch Operations option, so you keep the logic in your C# application. This article is also very complete.
Update
Yes I'm afraid bulk insert will only work with imported files (from within the database).
I have an experience in a Java project where we needed to insert millions of rows (data came from outside the application btw).
Database was Oracle, so of course we used the multi-line insert of Oracle. It turned out that the Java batch update was much faster than the multi-valued insert of Oracle (so called "bulk updates").
My suggestion is:
Compare the performance between the multi-value insert of SQL Server code (then you can read from inside your database, a procedure if you like) with the ADO.NET Batch Insert.
If the data you are going to manipulate is coming from outside your application (if it is not already in the database), I would say just go for the ADO.NET Batch Inserts. I think that its your case.
Note: Keep in mind that batch inserts usually operate with the same query. That is what makes them so fast.
Calling a prc in a loop incurs many round trips to SQL.
Not sure what batching approach you used but you should look into table value parameters: Docs are here. You'll want to still batch write.
You'll also want to consider memory on your server. Batching (say 10K at a time) might be a bit slower but might keep memory pressure lower on your server since you're buffering and processing a set at a time.
Table-valued parameters provide an easy way to marshal multiple rows
of data from a client application to SQL Server without requiring
multiple round trips or special server-side logic for processing the
data. You can use table-valued parameters to encapsulate rows of data
in a client application and send the data to the server in a single
parameterized command. The incoming data rows are stored in a table
variable that can then be operated on by using Transact-SQL.
Another option is bulk insert. TVPs benefit from re-use however so it depends on your usage pattern. The first link has a note about comparing:
Using table-valued parameters is comparable to other ways of using
set-based variables; however, using table-valued parameters frequently
can be faster for large data sets. Compared to bulk operations that
have a greater startup cost than table-valued parameters, table-valued
parameters perform well for inserting less than 1000 rows.
Table-valued parameters that are reused benefit from temporary table
caching. This table caching enables better scalability than equivalent
BULK INSERT operations.
Another comparison here: Performance of bcp/BULK INSERT vs. Table-Valued Parameters
Here is an example what I've used before with SqlBulkCopy. Grant it I was only dealing with around 10,000 records but it did it inserted them a few seconds after the query ran. My field names were the same so it was pretty easy. You might have to modify the DataTable field names. Hope this helps.
private void UpdateMemberRecords(Int32 memberId)
{
string sql = string.Format("select * from Member where mem_id > {0}", memberId);
try {
DataTable dt = new DataTable();
using (SqlDataAdapter da = new SqlDataAdapter(new SqlCommand(sql, _sourceDb))) {
da.Fill(dt);
}
Console.WriteLine("Member Count: {0}", dt.Rows.Count);
using (SqlBulkCopy sqlBulk = new SqlBulkCopy(ConfigurationManager.AppSettings("DestDb"), SqlBulkCopyOptions.KeepIdentity)) {
sqlBulk.BulkCopyTimeout = 600;
sqlBulk.DestinationTableName = "Member";
sqlBulk.WriteToServer(dt);
}
} catch (Exception ex) {
throw;
}
}
If you have SQL2014, then the speed of In-Memory OLTP is amazing;
http://msdn.microsoft.com/en-au/library/dn133186.aspx
Depending on your end goal, it may be a good idea to look into Entity Framework (or similar). This abstracts out the SQL so that you don't really have to worry about it in your client application, which is how things should be.
Eventually, you could end up with something like this:
using (DatabaseContext db = new DatabaseContext())
{
for (int i = 0; i < 1000000; i++)
{
db.Table.Add(new Row(){ /* column data goes here */});
}
db.SaveChanges();
}
The key part here (and it boils down to a lot of the other answers) is that Entity Framework handles building the actual insert statement and committing it to the database.
In the above code, nothing will actually be sent to the database until SaveChanges is called and then everything is sent.
I can't quite remember where I found it, but there is research around that suggests it is worth while to call SaveChanges every so often. From memory, I think every 1000 entries is a good choice for committing to the database. Committing every entry, compared to every 100 entries, doesn't provide much performance benefit and 10000 takes it past the limit. Don't take my word for that though, the numbers could be wrong. You seem to have a good grasp on the testing side of things though, so have a play around with things.
[note: needs to be in code as can't use SSIS or similar]
I need to bulk copy data from one database to another using C# and EF probably - though this isn't cast in stone.
The problem is that the source data is all in varchar(max) and I want the destination in correct data types. The source is historical from an old ETL job that works very well and I can't get replaced. The most common issue I have seen is alpha's in numeric fields - e.g. "none" in a money field. These are fine in the source since it's all varchar.
I'd like to copy the data and validate it:source -> validate -> destination
in the simplest way possible. If validation fails then I need to know the exact row that failed (and ideally WHAT failed) so that it can be manually fixed in the source, and the data re-copied.
There are around 50 tables ranging between 10 and 1.7M rows! So speed is important as well.
What would be a sensible way to approach this? Create DTO's, validation attributes and automap? Two EF entities and map across row by row and validate each? SPROC and manual insert?
Do it in T-SQL with a linked server.
I.e.:
--begin a transaction to wrap validation and load
BEGIN TRAN
--Validate that no tickets are set to closed without a completion date
SELECT *
FROM bigTableOnLocalServer with (TABLOCKX) -- prevent new rows
WHERE ticketState = '1' /* ticket closed */ and CompletionDate = 'open'
--if validation fails, quit the transaction to release the lock
COMMIT TRAN
--if no rows in result set 1, execute the load
INSERT INTO RemoteServerName.RemoteServerDBName.RemoteSchema.RemoteTable (field1Int, Field2Money, field3text)
SELECT CAST(Field1 as int),
CASE Field2Money WHEN 'none' then null else CAST(Field2Money as money) END,
Field3Text
FROM bigTableOnLocalServer
WHERE recordID between 1 and 1000000
-- after complete, commit the transaction to release the lock
COMMIT TRAN
If you cannot communicate directly between the servers, still do the validation in SQL, but use a C# client to write the data to disk and hit the Bulk insert function on the destination server. Since the C# component would do nothing more than transport the data, I would just go straight to a format usable by BULK INSERT, à la CSV.
I have a legacy data table in SQL Server 2005 that has a PK with no identity/autoincrement and no power to implement one.
As a result, I am forced to create new records in ASP.NET manually via the ole "SELECT MAX(id) + 1 FROM table"-before-insert technique.
Obviously this creates a race condition on the ID in the event of simultaneous inserts.
What's the best way to gracefully resolve the event of a race collision? I'm looking for VB.NET or C# code ideas along the lines of detecting a collision and then re-attempting the failed insert by getting yet another max(id) + 1. Can this be done?
Thoughts? Comments? Wisdom?
Thank you!
NOTE: What if I cannot change the database in any way?
Create an auxiliary table with an identity column. In a transaction insert into the aux table, retrieve the value and use it to insert in your legacy table. At this point you can even delete the row inserted in the aux table, the point is just to use it as a source of incremented values.
Not being able to change database schema is harsh.
If you insert existing PK into table you will get SqlException with a message indicating PK constraint violation. Catch this exception and retry insert a few times until you succeed. If you find that collision rate is too high, you may try max(id) + <small-random-int> instead of max(id) + 1. Note that with this approach your ids will have gaps and the id space will be exhausted sooner.
Another possible approach is to emulate autoincrementing id outside of database. For instance, create a static integer, Interlocked.Increment it every time you need next id and use returned value. The tricky part is to initialize this static counter to good value. I would do it with Interlocked.CompareExchange:
class Autoincrement {
static int id = -1;
public static int NextId() {
if (id == -1) {
// not initialized - initialize
int lastId = <select max(id) from db>
Interlocked.CompareExchange(id, -1, lastId);
}
// get next id atomically
return Interlocked.Increment(id);
}
}
Obviously the latter works only if all inserted ids are obtained via Autoincrement.NextId of single process.
The key is to do it in one statement or one transaction.
Can you do this?
INSERT (PKcol, col2, col3, ...)
SELECT (SELECT MAX(id) + 1 FROM table WITH (HOLDLOCK, UPDLOCK)), #val2, #val3, ...
Without testing, this will probably work too:
INSERT (PKcol, col2, col3, ...)
VALUES ((SELECT MAX(id) + 1 FROM table WITH (HOLDLOCK, UPDLOCK)), #val2, #val3, ...)
If you can't, another way is to do it in a trigger.
The trigger is part of the INSERT transaction
Use HOLDLOCK, UPDLOCK for the MAX. This holds the row lock until commit
The row being updated is locked for the duration
A second insert will wait until the first completes.
The downside is that you are changing the primary key.
An auxiliary table needs to be part of a transaction.
Or change the schema as suggested...
Note: All you need is a source of ever-increasing integers. It doesn't have to come from the same database, or even from a database at all.
Personally, I would use SQL Express because it is free and easy.
If you have a single web server:
Create a SQL Express database on the web server with a single table [ids] with a single autoincrementing field [new_id]. Insert a record into this [ids] table, get the [new_id], and pass that onto your database layer as the PK of the table in question.
If you have multiple web servers:
It's a pain to setup, but you can use the same trick by setting appropriate seed/increment (i.e. increment = 3, and seed = 1/2/3 for three web servers).
What about running the whole batch (select for id and insert) in serializable transaction?
That should get you around needing to make changes in the database.
Is the main concern concurrent access? I mean, will multiple instances of your app (or, God forbid, other apps outside your control) be performing inserts concurrently?
If not, you can probably manage the inserts through a central, synchronized module in your app, and avoid race conditions entirely.
If so, well... like Joel said, change the database. I know you can't, but the problem is as old as the hills, and it's been solved well -- at the database level. If you want to fix it yourself, you're just going to have to loop (insert, check for collisions, delete) over and over and over again. The fundamental problem is that you can't perform a transaction (I don't mean that in the SQL "TRANSACTION" sense, but in the larger data-theory sense) if you don't have support from the database.
The only further thought I have is that if you at least have control over who has access to the database (e.g., only "authorized" apps, either written or approved by you), you could implement a side-band mutex of sorts, where a "talking stick" is shared by all the apps and ownership of the mutex is required to do an insert. That would be its own hairy ball of wax, though, as you'd have to figure out policy for dead clients, where it's hosted, configuration issues, etc. And of course a "rogue" client could do inserts without the talking stick and hose the whole setup.
The best solution is to change the database. You may not be able to change the column to be an identity column, but you should be able to make sure there's a unique constraint on the column and add a new identity column seeded with your existing PK's. Then either use the new column instead or use a trigger to make the old column mirror the new, or both.