EDIT: Solution (kind of)
So, what I did had very little in common with what I originally wanted to do, but my application now works much faster (DataSets that took upward of 15 minutes to process now go through in 30-40 seconds tops). Here's roughly what I did:
- Read spreadsheet & populate DataTable/DataSet normally
- [HACK WARNING] Instead of using UpdateDataSet, I generate my own SQL queries, mostly by having a skeleton string for each type of update (e.g. String skeleton = "UPDATE ... SET ... WHERE ..."). I then consult the template database and replace the placeholder ... with the appropriate entries.
- [MORE HACK WARNING] The way I dealt with errors was by manually checking whether those errors will occur. So if I know I am about to do an insert, I'll run an error-checking command before the actual insert; what the error checker will do is construct a JOIN statement, checking whether any of the entries in the user's DataSet already exist in the database. Just by executing the JOIN command, I get back a DataSet with the results, so I know that if there is anything there, it's the errors. Then I can proceed to print them.
If anyone needs more details, I'll be happy to provide them. It's a fairly specific question, so I should probably keep this outline fairly high level.
Original Question
For (good) reasons outside of my control, I need to use the Database.UpdateDataSet() method from Microsoft's Enterprise Library. The way my project will work, I am letting the user make changes to the database (multiple database, multiple schemas, multiple tables, but always only one at a time) by uploading Excel spreadsheets to a web application. The spreadsheets follow a design/template specified by me (usually). I am a state where I read the spreadsheet, turn it into a DataTable/DataSet, and use (dynamically generated) prepared statements to make the appropriate changes to the database. Here's the problem:
Each spreadsheet only allows for one type of change (insert/update/delete). I want to make it so if the user uploads an insert spreadsheet, but several (let's say 10) of the entries are already in the database, I not only return with an error, but also tell them which entries (DataRows) violated the primary key constraint.
The idea solution would be get a DataSet with the list of errors back, but I don't see how I can do that. Perhaps there is a way to construct the prepared statements in such a way that if a DataRow is to be inserted (following the example from above), it proceeds normally; however if it attempts to update or delete, it skips it and adds it to an error collection of some sort?
Note that I am trying to avoid using stored procedures. Since the number of different templates will grow extremely quickly after deployment, it is important that I stay away from manually written code and close to database-driven model as much as possible.
Related
I don't know whether it is better to use temporary tables in SQL Server or use the DataTable in C# for a report. Here is the scope of the report: it will be copied into a workbook with about 10 worksheets - each worksheet containing about 1000 rows and about 30 columns so it's a lot of data. There is some guidance out there but I could not find anything specific regarding the amount of data that is too much for a DataTable. According to https://msdn.microsoft.com/en-us/library/system.data.datatable.aspx, 16M rows but my data set seems unwieldy considering the number of columns I have. Plus, I will either have to make multiple SQL queries to collect the data in my report or try to write a stored procedure in SQL to collect that data. How do I figure out this quandary?
My rule of thumb is that if it can be processed on the database server, it probably should. Keep in mind, no matter how efficient your C# code is, SQL Server will mostly likely to it faster and more efficiently, after all it was designed for data manipulation.
There is no shame in using #temp tables. They maintain stats, can be indexed, and/or manipulated. One recent example, a developer create an admittedly elegant query using cte, the performance was 12-14 seconds vs mine at 1 second using #temps.
Now, one carefully structured stored procedure could produce and return the 10 data-sets for your worksheets. If you are using a product like SpreadSheetLight (there are many options available), it becomes a small matter of passing the results and creating the tabs (no cell level looping... unless you want or need to).
I would also like to add, you can dramatically reduce the number of touch points and better enforce the business logic by making SQL Server do the heavy lifting. For example, a client introduced a 6W risk rating, which was essentially a 6.5. HUNDREDS of legacy reports had to be updated, while I only had to add the 6W into my mapping table.
There's a lot of missing context here - how is this report going to be accessed and run? Is this going to run as a scripted event every day?
Have you considered SSRS?
In my opinion it's best to abstract away your business logic by creating Views or Stored Procedures in the database. Stored Procedures would probably be the way to go but it really depends on your specific environment. Then you can point whatever tools you want to use at the database object. This has several advantages:
if you end up having different versions or different formats of the report, and your logic ever changes, you can update the logic in one place rather than many.
your code is simpler and cleaner, typically:
select v.col1, v.col2, v.col3
from MY_VIEW v
where v.date between #startdate and #enddate
I assume your 10 spreadsheets are going to be something like
Summary Page | Department 1 | Department 2 | ...
So you could make a generalized View or SP, create a master spreadsheet linked to the db object that pulls all the relevant data from SQL, and use Pivot Tables or filters or whatever else you want, and use that to generate your copies that get sent out.
But before going to all that trouble, I would make sure that SSRS is not an option, because if you can use that, it has a lot of baked in functionality that would make your life easier (export to Excel, automatic date parameters, scheduled execution, email subscriptions, etc).
i have a fully working production site based on entity framework and now i need to import a large amount of data weekly into the database.
the data comes in the form of text files which i go through line by line, check against the database to see if it exists and if it does update anything that has changed or just insert it if not.
the problem im having is that it takes around 32 hours to run the full import process and some of the files have to be manually split into smaller chunks to avoid memory issues seemingly caused by entity framework. i have managed to slow down the memory increase but the last time i ran a file without splitting it, it ran for about 12 hours before running out of memory at somewhere over 1.5gb.
so can someone suggest to me the best way of importing this data, i have heard of sqlbulkcopy but wasnt sure if it was the correct thing to use. can anyone provide any examples? or suggest anything more appropriate. for instance, should i create a duplicate of the entity using standard .net sql commands and possibly use a stored procedure
Although SqlBulkCopy is handy from managed code,I reckon the fastest way is to do it is in "pure" sql -- given that SqlBulkCopy doesn't easily do upserts, you would need to execute the MERGE part below anyway
Assuming that your text file is in csv format, and it exists on the SQL Server as "C:\Data\TheFile.txt", and that line endings are normalised as CR-LF (\r\n)
And let's assume that the data is ID,Value1,Value2
this SQL command will insert into a staging table TheFile_Staging which has ID,Value,Value2 columns with compatible data types, and then update the "real" table TheFile_Table (note: code below not tested!)
truncate table TheFile_Staging
BULK INSERT TheFile_Staging FROM'C:\Data\TheFile.txt'
WITH (fieldterminator=',', rowTerminator='\r\n',FirstRow=2)
//FirstRow=2 means skip Row#1 - use this when 1st row is a header.
MERGE TheFile_Table as target
USING (SELECT ID,Value1,Value2 from TheFile_Staging) as source
on target.ID = source.ID
WHEN MATCHED THEN
UPDATE SET target.Value1=source.Value1, target.Value2=source.target2
WHEN NOT MATCHED THEN
INSERT (id,Value1,Value2) VALUES (source.Id,source.Value1,source.Value2);
You can create a stored procedure and set it to run or invoke from code, etc. The only problem with this approach is error handling bulk insert is a bit of a mess - but as long as your data coming in is ok then it's as quite fast.
Normally I'd add some kind of validation check in the WHERE clause us the USING() select of the MERGE to only take the rows that are valid in terms of data.
It's probably also worth pointing out that the definition of the staging table should omit any non-null, primary key and identity constraints, in order that the data can be read in without error esp. if there are empty fields here and there in your source data; and I also normally prefer to pull in date/time data as a plain nvarchar - this way you avoid incorrectly formatted dates causing import errors and your MERGE statement can perform a CAST or CONVERT as needed whilst at the same time ignoring and/or logging to an error table any invalid data it comes across.
Sadly you need to move away from Entity Framework in this kind of scenario; out of the box EF only does line-by-line inserts. You can do interesting things like this, or you can completely disregard EF and manually code the class that will do the bulk inserts using ADO.Net (SqlBulkCopy).
Edit: you can also keep with the current approach if the performance is acceptable, but you will need to recreate the context periodically, not use the same context for all records. I suspect that's the reason for the outrageous memory consumption.
I have ERP database "A" has only read permission, where i cant create trigger on the table.
A is made for ERP system (Unknown Program for me ). I have another Database "B" that is private to my application this application work on both databases. i want to reflect A's changes(for any insert/Update/Delete) instantly to B.
Is there any Functionality in c# that can work exactly as trigger works in database???
You have few solutions, best one depends on which kind of database you have to support.
Generic solution, changes in A database aren't allowed
If you can't change master database and this must work with every kind of database then you have only one option: polling.
You shouldn't check too often (so forget to do it more or less instantly) to save network traffic and it's better to do in in different ways for insert/update/delete. What you can do depends on how database is structured, for example:
Insert: to catch an insert you may simply check for highest row ID (assuming what you need to monitor has an integer column used as key).
Update: for updates you may check a timestamp column (if it's present).
Delete: this may be more tricky to detect, a first check would be count number of rows, if it's changed and no insert occured then you detected a delete else just subtract the number of inserts.
Generic solution, changes in A database are allowed
If you can change the original database you can decrease network traffic (and complexity) using triggers on database side, when a trigger is fired just put a record in an internal log table (just few columns: one for the change type, one for affected table, one for affected record).
You will need to poll only on this table (using a simple query to check if number of rows increased). Because action (insert/update/delete) is stored in the table you just need to switch on that column to execute proper action.
This has a big disadvantage (in my point of view): it puts logic related to your application inside the master database. This may be terrible or not but it depends on many many factors.
SQL Server/Vendor specific
If you're application is tied to Microsoft SQL Server you can use SqlDependency class to track changes made. It works for SS only but I think there may be implementations for other databases. Disadvantage is that this will always bee specific to a specific vendor (so if A database will change host...you'll have to change your code too).
From MSDN:
SqlDependency was designed to be used in ASP.NET or middle-tier services where there is a relatively small number of servers having dependencies active against the database. It was not designed for use in client applications, where hundreds or thousands of client computers would have SqlDependency objects set up for a single database server.
Anyway if you're using SQL Server you have other options, just follow links in MSDN documentation.
Addendum: if you need a more fine control you may check TraceServer and Object:Altered (and friends) classes. This is even more tied to Microsoft SQL Server but it should be usable on a more wide context (and you may keep your applications unaware of these things).
You may find useful, depending on your DBMS:
Change Data Capture (MS SQL)
http://msdn.microsoft.com/en-us/library/bb522489%28v=SQL.100%29.aspx
Database Change Notification (Oracle)
http://docs.oracle.com/cd/B19306_01/appdev.102/b14251/adfns_dcn.htm
http://www.oracle.com/technetwork/issue-archive/2006/06-mar/o26odpnet-093584.html
Unfortunately, there's no SQL92 solution on data change notification
Yes There is excellent post are here please check this out..
http://devzone.advantagedatabase.com/dz/webhelp/advantage9.1/mergedprojects/devguide/part1point5/creating_triggers_in_c_with_visual_studio_net.htm
If this post solve your question then mark as answered..
Thanks
I am trying to write up an SSIS package which would migrate queried data from MySQL server to SQL Server. I would need to modify a particular column say "stream" (DT_I4) values (1 would be 2 , 2 would become 4, etc just some random 4 integer replacements) and then check another column value(emp_id) if it exists in SQL Server before inserting. if it exists, do not insert and if it does not, then we write these values.
I am a SSIS newbie, so far I have been able to add both ADO.NET source and ADO.NET destination. I need help with the following
Should I use a derived column or script component to convert the values
How do i check if emp-id exists in SQL Server
How do I map the errors?
What is the best practice to implement the above situation, thanks for reading and for your help.
Generally speaking, it is better to use the stock components to accomplish a task than to write a custom script. Performance and maintenance are two big reasons for that advice. Also, don't try to do too many things in a single transformation. The pipeline can really take advantage of parallelization if you let it.
1) Specifically speaking, perhaps I didn't understand where the conversion needs to happen in your problem description but I would start with neither a Derived Column Transformation nor a Script Component. Instead, for a straight type conversion I'd use Data Conversion Transformation.
Rereading it, perhaps you are attempting a value conversion. Depending on the complexity, it could be accomplished with a derived column or two and worst case, drop to a script task but, even better-does the data need to come over with the unmapped value? Toss a CASE statement in your source query and skip the SSIS complexity with mapping value A to value B.
2) The Lookup Transformation will help you in this department. It is important to note that failure to find a value would result in the package failing in 2005. 2008+ the option for handling not-found rows is more readily available. There is an output path "Redirect Rows to No Match Output" and this is the path you will want to use as you only want the rows that don't already exist. As a general guideline on a Lookup, only pull back the columns of interest as the package will cache that lookup locally. That does not go well on server memory when it's a hundreds of millions of rows and 80+ columns wide.
3) What errors? Conversion errors? Lookup errors? Some-other-error-not-defined? In general, you'll probably want to read about Integration Services Paths. Everything in a data flow has an Error path leading out of it. Most everything has 1+ non-error paths leading out. In cases where there are multiple non-error paths available, when you connect them to the next component, BIDS will ask which output you are intending to use.
4) Knowing the extremely general problem defined, your package may look something like
Refine your question if that doesn't address the specifics.
I've got a source database (Sybase), which is read-only and you can write to the database with a import file. The other side is my own database (MSSQL) which has no limitations.
The main problem is that there are no timestamps on the first database and I don't have any access to change the source database. So is there a engine/solution to get this sync. done?
A diff algorithm might work, but it wouldn't be fast, in the sense that you would have to scan the whole source database for each synchronization.
Basically you would do a full data extract, in an agreed upon, and stable, manner (ie. two such extracts with no changes between would produce identical output.)
Then you compare that to the previous extract you did, and then you can find all the changes. Something slightly more intelligent than a pure text diff would be needed, to help determine that rows weren't just deleted+inserted, but in fact updated.
Unfortunately, if there is no way to ask the source database what the latest changes are, through, as you've pointed out, lack of timestamps, or similar mechanisms, then I don't see how you can get any better than a full extract each time.
Now, I don't know Sybase that much, but in MS SQL Server you could potentially create another database that mirrors the first, and in this second database you could make whatever changes you need.
However, if you can make such a database in Sybase, and use SQL to access both at the same time, you might be able to run queries that produce the differences.
For instance, something along the lines of:
SELECT S.*
FROM sourcedb..sourcetable1 AS S
FULL JOIN clonedb..sourcetable1 AS C
ON S.pkvalue = C.pkvalue
WHERE S.pkvalue IS NULL OR C.pkvalue IS NULL
This would produce rows that are inserted or deleted.
To find those that changed, you would need this WHERE-clause:
WHERE S.column1 <> C.column1
OR S.column2 <> C.column2
OR ....
Since the tables are joined, the WHERE-clause would filter out any rows where the previous extract and the current state is different.
Now, this might not run fast either, you would have to test to make sure.