I've got a source database (Sybase), which is read-only and you can write to the database with a import file. The other side is my own database (MSSQL) which has no limitations.
The main problem is that there are no timestamps on the first database and I don't have any access to change the source database. So is there a engine/solution to get this sync. done?
A diff algorithm might work, but it wouldn't be fast, in the sense that you would have to scan the whole source database for each synchronization.
Basically you would do a full data extract, in an agreed upon, and stable, manner (ie. two such extracts with no changes between would produce identical output.)
Then you compare that to the previous extract you did, and then you can find all the changes. Something slightly more intelligent than a pure text diff would be needed, to help determine that rows weren't just deleted+inserted, but in fact updated.
Unfortunately, if there is no way to ask the source database what the latest changes are, through, as you've pointed out, lack of timestamps, or similar mechanisms, then I don't see how you can get any better than a full extract each time.
Now, I don't know Sybase that much, but in MS SQL Server you could potentially create another database that mirrors the first, and in this second database you could make whatever changes you need.
However, if you can make such a database in Sybase, and use SQL to access both at the same time, you might be able to run queries that produce the differences.
For instance, something along the lines of:
SELECT S.*
FROM sourcedb..sourcetable1 AS S
FULL JOIN clonedb..sourcetable1 AS C
ON S.pkvalue = C.pkvalue
WHERE S.pkvalue IS NULL OR C.pkvalue IS NULL
This would produce rows that are inserted or deleted.
To find those that changed, you would need this WHERE-clause:
WHERE S.column1 <> C.column1
OR S.column2 <> C.column2
OR ....
Since the tables are joined, the WHERE-clause would filter out any rows where the previous extract and the current state is different.
Now, this might not run fast either, you would have to test to make sure.
Related
i have a fully working production site based on entity framework and now i need to import a large amount of data weekly into the database.
the data comes in the form of text files which i go through line by line, check against the database to see if it exists and if it does update anything that has changed or just insert it if not.
the problem im having is that it takes around 32 hours to run the full import process and some of the files have to be manually split into smaller chunks to avoid memory issues seemingly caused by entity framework. i have managed to slow down the memory increase but the last time i ran a file without splitting it, it ran for about 12 hours before running out of memory at somewhere over 1.5gb.
so can someone suggest to me the best way of importing this data, i have heard of sqlbulkcopy but wasnt sure if it was the correct thing to use. can anyone provide any examples? or suggest anything more appropriate. for instance, should i create a duplicate of the entity using standard .net sql commands and possibly use a stored procedure
Although SqlBulkCopy is handy from managed code,I reckon the fastest way is to do it is in "pure" sql -- given that SqlBulkCopy doesn't easily do upserts, you would need to execute the MERGE part below anyway
Assuming that your text file is in csv format, and it exists on the SQL Server as "C:\Data\TheFile.txt", and that line endings are normalised as CR-LF (\r\n)
And let's assume that the data is ID,Value1,Value2
this SQL command will insert into a staging table TheFile_Staging which has ID,Value,Value2 columns with compatible data types, and then update the "real" table TheFile_Table (note: code below not tested!)
truncate table TheFile_Staging
BULK INSERT TheFile_Staging FROM'C:\Data\TheFile.txt'
WITH (fieldterminator=',', rowTerminator='\r\n',FirstRow=2)
//FirstRow=2 means skip Row#1 - use this when 1st row is a header.
MERGE TheFile_Table as target
USING (SELECT ID,Value1,Value2 from TheFile_Staging) as source
on target.ID = source.ID
WHEN MATCHED THEN
UPDATE SET target.Value1=source.Value1, target.Value2=source.target2
WHEN NOT MATCHED THEN
INSERT (id,Value1,Value2) VALUES (source.Id,source.Value1,source.Value2);
You can create a stored procedure and set it to run or invoke from code, etc. The only problem with this approach is error handling bulk insert is a bit of a mess - but as long as your data coming in is ok then it's as quite fast.
Normally I'd add some kind of validation check in the WHERE clause us the USING() select of the MERGE to only take the rows that are valid in terms of data.
It's probably also worth pointing out that the definition of the staging table should omit any non-null, primary key and identity constraints, in order that the data can be read in without error esp. if there are empty fields here and there in your source data; and I also normally prefer to pull in date/time data as a plain nvarchar - this way you avoid incorrectly formatted dates causing import errors and your MERGE statement can perform a CAST or CONVERT as needed whilst at the same time ignoring and/or logging to an error table any invalid data it comes across.
Sadly you need to move away from Entity Framework in this kind of scenario; out of the box EF only does line-by-line inserts. You can do interesting things like this, or you can completely disregard EF and manually code the class that will do the bulk inserts using ADO.Net (SqlBulkCopy).
Edit: you can also keep with the current approach if the performance is acceptable, but you will need to recreate the context periodically, not use the same context for all records. I suspect that's the reason for the outrageous memory consumption.
Should the database engine do all the work, or should the responsibility of checking for uniqueness be the responsibility of the client application?
I’m developing an application in C# to scan drives and store file information in a SQL Server CE database and I would like to know which way of ensuring unique entries is "best". So far I tried the following three approaches and haven’t seen any difference in performance:
Maintaining a collection object
Checking for existence in the database
Relying on a unique index in the database
Pseudo code of my three approaches. The actual code breaks the file up into its parts and uses several tables to store path, extension, volume/server, and other information, plus indexing records to look up data.
collectionObj //initialize with existing records from database
While (filesToAdd.Count > 0 )
{
file = filesToAdd.Dequeue();
If(!collectionObj.Contains( file.Name ))
{
Insert file.Name into database
collectionObj.Add(file.Name)
}
}
With method 1 I thought it would be faster to search an object in memory, but since a SQL Server CE database is also in memory I’m not so sure of the benefit.
While (filesToAdd.Count > 0 )
{
file = filesToAdd.Dequeue();
if( ( select count(*) from database where filename = file.Name) == 0 )
{
Insert file.Name into database
}
}
Method 2 doesn’t use any extra objects/memory but queries the database a lot looking for duplicates. With SQL Server CE network traffic isn’t a problem but excessive querying has to have an effect on performance.
While (filesToAdd.Count > 0 )
{
file = filesToAdd.Dequeue();
try
{
Insert file.Name into database
}catch(Duplicate index violation exception)
{
//do nothing
}
}
I’m leaning towards method 3 mainly because it simplifies the code, but it seems to lazy to be a best practice. Also on duplicate insertions the database is throwing an error and so is the program. That seems like it would impact performance.
Given the information provided, which is the "best" way for adding a lot of information into a database when you know there will be many duplicates? Does the answer change if the data is primarily unique or mainly duplicates? If you have an even better approach then what I have thought of I would be happy to hear it. My question is specifically about SQL Server CE which doesn’t have the full power of SQL Server, please keep that in mind when offering suggestions.
The answer is . . . do it in the database.
The uniqueness requirement is a requirement of the data. The database should be used to enforce these requirements.
Remember that ensuring unique entries requiring doing tests for both insert and update. And, you want the uniqueness as part of your data integrity. So, you want the check to happen regardless of how the update or insert is being done (through your application, manually, via a trigger, or whatever). The only way to guarantee that it is always done is to do the check in the database.
This argument transcends performance. However, the database should be quite efficient with a performance check, assuming the unique index fits into memory. There might be some situations where performance is so paramount that constraints would be checked in the application. These would be few and far between. And, I might question why a database is being used for the data store for such an application.
The correct answer is as usual, it depends. The "lazy" solution of having the database do it is ultimately the correct answer. However, if you can filter out duplicates on the client, and the time and effort to filter out has enough benefit to keep from having the database perform all of the filtering, then filtering on the client makes sense. You will still enforce uniqueness on the database, but you might be able to offload some of its processing by filtering out some or most duplicates on the client. I would probably only go this route if I know from actual application experience that it would be worth the effort.
Why would a lazy good method be a bad method?
If you are going to use a database to store data and you want to be sure there are no duplicate entries, then of course should enfore the UNIQUE constraint to your rows. Not only does it help you maintain a duplicate-free data storage, but it will also provide you with a good way of identifying each row.
If there is a duplicate entry, the database engine will notice this while inserting to the database and throw an error/exception that you easily can catch.
Obviously, you want your database to handle the unique constraints but it sounds like you're wanting to avoid the exceptions that get thrown when attempting to insert a duplicate record. Normally, I would suggest using IF NOT EXISTS in your SQL INSERT statement but you can't do that with SQL Server Compact.
Another trick might be to try an UPDATE first and if no rows are affected, you know the record doesn't exist and you can safely INSERT it. It's a little bit of extra work but if you're expecting a lot of duplicates, it might still be more efficient than catching all those exceptions.
It might also be wise to try to filter out known duplicates before even trying to put them into the database. Perhaps consider using a HashSet to keep track of the unique IDs that you've already inserted during that session. If a value is in your HashSet, you know you can just skip it and save yourself a call to the database.
EDIT: Solution (kind of)
So, what I did had very little in common with what I originally wanted to do, but my application now works much faster (DataSets that took upward of 15 minutes to process now go through in 30-40 seconds tops). Here's roughly what I did:
- Read spreadsheet & populate DataTable/DataSet normally
- [HACK WARNING] Instead of using UpdateDataSet, I generate my own SQL queries, mostly by having a skeleton string for each type of update (e.g. String skeleton = "UPDATE ... SET ... WHERE ..."). I then consult the template database and replace the placeholder ... with the appropriate entries.
- [MORE HACK WARNING] The way I dealt with errors was by manually checking whether those errors will occur. So if I know I am about to do an insert, I'll run an error-checking command before the actual insert; what the error checker will do is construct a JOIN statement, checking whether any of the entries in the user's DataSet already exist in the database. Just by executing the JOIN command, I get back a DataSet with the results, so I know that if there is anything there, it's the errors. Then I can proceed to print them.
If anyone needs more details, I'll be happy to provide them. It's a fairly specific question, so I should probably keep this outline fairly high level.
Original Question
For (good) reasons outside of my control, I need to use the Database.UpdateDataSet() method from Microsoft's Enterprise Library. The way my project will work, I am letting the user make changes to the database (multiple database, multiple schemas, multiple tables, but always only one at a time) by uploading Excel spreadsheets to a web application. The spreadsheets follow a design/template specified by me (usually). I am a state where I read the spreadsheet, turn it into a DataTable/DataSet, and use (dynamically generated) prepared statements to make the appropriate changes to the database. Here's the problem:
Each spreadsheet only allows for one type of change (insert/update/delete). I want to make it so if the user uploads an insert spreadsheet, but several (let's say 10) of the entries are already in the database, I not only return with an error, but also tell them which entries (DataRows) violated the primary key constraint.
The idea solution would be get a DataSet with the list of errors back, but I don't see how I can do that. Perhaps there is a way to construct the prepared statements in such a way that if a DataRow is to be inserted (following the example from above), it proceeds normally; however if it attempts to update or delete, it skips it and adds it to an error collection of some sort?
Note that I am trying to avoid using stored procedures. Since the number of different templates will grow extremely quickly after deployment, it is important that I stay away from manually written code and close to database-driven model as much as possible.
I have a lot of data which needs to be paired based on a few simple criteria. There is a time window (both records have a DateTime column), if one record is very close in time (within 5 seconds) to another then it is a potential match, the record which is the closest in time is considered a complete match. There are other fields which help narrow this down also.
I wrote a stored procedure which does this matching on the server before returning the
full, matched dataset to a C# application. My question is, would it be better to pull in the 1 million (x2) rows and deal with them in C#, or is sql server better suited to perform this matching? If Sql server is, then what is the fastest way of pairing data using datetime fields?
Right now I select all records from Table 1/Table 2 into temporary tables, iterate through each record in Table 1, look for a match in Table 2 and store the match (if one exists) in a temporary table, then I delete both records in their own temporary tables.
I had to rush this piece for a game I'm writing, so excuse the bad (very bad) procedure... It works, it's just horribly inefficient! The whole SP is available on pastebin: http://pastebin.com/qaieDsW7
I know the SP is written poorly, so saying "hey, dumbass... write it better" doesn't help! I'm looking for help in improving it, or help/advice on how I should do the whole thing differently! I have about 3/5 days to rewrite it, I can push that deadline back a bit, but I'd rather not if you guys can help me in time! :)
Thanks!
Ultimately, compiling your your data on the database side is preferable 99% of the time, as it's designed for data crunching (through the use of indexes, relations, etc). A lot of your code can be consolidated by the use of joins to compile the data in exactly the format you need. In fact, you can bypass almost all your temp tables entirely and just fill a master Event temp table.
The general pattern is this:
INSERT INTO #Events
SELECT <all interested columns>
FROM
FireEvent
LEFT OUTER JOIN HitEvent ON <all join conditions for HitEvent>
This way you match all fire events to zero or more HitEvents. After our discussion in chat, you can even limit it to zero or one hit event by wrapping it in a subquery and using a window function for ROW_NUMBER() OVER (PARTITION BY HitEvent.EventID ORDER BY ...) AS HitRank and add a WHERE HitRank = 1 to the outer query. This is ultimately what you ended up doing and got the results you were expecting (with a bit of work and learning in the process).
If the data is already in the database, that is where you should do the work. You absolutely should learn to display and query plans using SQL Server Management Studio, and become able to notice and optimize away expensive computations like nested loops.
Your task probably does not require any use of temporary tables. Temporary tables tend to be efficient when they are relatively small and/or heavily reused, which is not your case.
I would advise you to try to optimize the stored procedure if is not running fast enough and not rewrite it in C#. Why would you want to transfer millions of rows out of SQL Server anyway?
Unfortunately I don't have an SQL Server installation so I can't test your script, but I don't see any CREATE INDEX statements in there. If you didn't just skipped them for brevity, then you should surely analyze your queries and see which indexes are needed.
So the answer depends on several factors like resources available per client/server (Ram/CPU/Concurrent Users/Concurrent processes, etc.)
Here are some basic rules that will improve your performance regardless of what you use:
Loading a million rows into c# program is not a good practice. Unless this is a stand alone process with plenty of ram.
Uniqueidentifiers will never out perform Integers. Comparisons
Common Table Expression are a good alternative for fast performing matching. How to use CTE
Finally you have to consider output. If there is constant reading and writing that affects the user interface, then you should manage that in memory (c#), otherwise all CRUD operations should be kept inside the database.
I am trying to write up an SSIS package which would migrate queried data from MySQL server to SQL Server. I would need to modify a particular column say "stream" (DT_I4) values (1 would be 2 , 2 would become 4, etc just some random 4 integer replacements) and then check another column value(emp_id) if it exists in SQL Server before inserting. if it exists, do not insert and if it does not, then we write these values.
I am a SSIS newbie, so far I have been able to add both ADO.NET source and ADO.NET destination. I need help with the following
Should I use a derived column or script component to convert the values
How do i check if emp-id exists in SQL Server
How do I map the errors?
What is the best practice to implement the above situation, thanks for reading and for your help.
Generally speaking, it is better to use the stock components to accomplish a task than to write a custom script. Performance and maintenance are two big reasons for that advice. Also, don't try to do too many things in a single transformation. The pipeline can really take advantage of parallelization if you let it.
1) Specifically speaking, perhaps I didn't understand where the conversion needs to happen in your problem description but I would start with neither a Derived Column Transformation nor a Script Component. Instead, for a straight type conversion I'd use Data Conversion Transformation.
Rereading it, perhaps you are attempting a value conversion. Depending on the complexity, it could be accomplished with a derived column or two and worst case, drop to a script task but, even better-does the data need to come over with the unmapped value? Toss a CASE statement in your source query and skip the SSIS complexity with mapping value A to value B.
2) The Lookup Transformation will help you in this department. It is important to note that failure to find a value would result in the package failing in 2005. 2008+ the option for handling not-found rows is more readily available. There is an output path "Redirect Rows to No Match Output" and this is the path you will want to use as you only want the rows that don't already exist. As a general guideline on a Lookup, only pull back the columns of interest as the package will cache that lookup locally. That does not go well on server memory when it's a hundreds of millions of rows and 80+ columns wide.
3) What errors? Conversion errors? Lookup errors? Some-other-error-not-defined? In general, you'll probably want to read about Integration Services Paths. Everything in a data flow has an Error path leading out of it. Most everything has 1+ non-error paths leading out. In cases where there are multiple non-error paths available, when you connect them to the next component, BIDS will ask which output you are intending to use.
4) Knowing the extremely general problem defined, your package may look something like
Refine your question if that doesn't address the specifics.