I am working on a dynamic loader. Based on a database table that defines the flat text files I can read a single file with multiple record types and load it into database tables. The tables are related and using identity primary keys. Everything is currently working but runs really slow as would be expected given that it is all accomplished by single insert statements. I am working on optimizing the process and cant find an 'easy' or 'best practice' answer on the web.
My current project deals with 8 tables but to simplify I will use a customers / orders example.
Lets look at two customers below, the data would repeat for each set of customers and orders in the data file. Parent records are always before child records. The first field is record type and each record type has a different definition of the fields that follow. This is all specified in the control tables.
CUST|Joe Green|123 Main St
ORD|Pancakes|5
ORD|Nails|2
CUST|John Deere|456 Park Pl
ORD|Tires|4
Current code will:
Insert customer Joe Green and return an ID. (Using Output
Inserted.Id in the insert statement)
Insert orders pancakes and nails attaching the returned ID.
Insert customer John Deere and return an ID.
Insert order Tires with the return ID.
This runs painfully slow. If this could be optimized and I wouldn't have to change much code, that would be ideal but I cant think of how.
So the solution? I was thinking datatables... Here is what I am thinking of so far.
Create Transaction
Lock all tables that are part of the 'file definition', in this case
Customers and Orders Get max ID for each table and increment by one
to have starting IDs for all tables
Create datatable for all tables
Execute as currently set up but instead of issuing insert statements
add to data table
After data is read bulk upload tables in the correct order based on
relationships
Unlock tables
End Transaction
I was wondering, before I go down this path, if anyone has worked out a better solution. I am also considering a custom script component in SSIS. I have seen posts and blogs about holding off on commiting a transaction but each parent record has only a few child records and the tree can get up to 4 deep, think order details and products. Due to needing the parent record ID I need to commit the insert of parent records. I have also considered managing the ID's myself rather than Identity but I do not want to add that extra management if I can avoid it.
UPDATE based on answer, for clarification / context.
A typical text file has
one file header record
- 5 facility records that relate to the file header
- 7,000 customers(account)
- 5 - 10 notes per customer
- 1-5 payments at the account level
- 1-5 adjustments at the account level
- 5 - 20 orders per customer
- 5 - 20 order details per order
- 1-5 payments at the order level
- 1-5 adjustments at the order level
- one file trailer record related to the file header
Keys
- File Header -> Facility -> Customer (Account)
- File Header -> FileTrailer
- Customer -> Notes
- Customer -> Payments
- Customer -> Adjustments
- Customer -> Orders
- Order -> OrderDetails
- Order -> Payments
- Order -> Adjustments
There are a few more tables involved but this should give an idea of the overall context.
Data Sample ... = MORE FIELDS .... MORE RECORDS
HEADER|F1|F2|...
FACILITY|F1|F2|..
CUSTOMER|F1|F2|...
NOTE|F1|F2|....
....
ORDER|F1|F2|...
ORDERDETAIL|F1|F2|...
.... ORDER DETAILS
ORDERPYMT|F1|F2|...
....
ORDERADJ|F1|F2|...
....
CUSTOMERPYMT|F1|F2|...
....
CUSTOMERADJ|F1|F2|...
....
(The structure repeats for each facility)
TRAILER|F1|F2|...
Inserting related tables with low data volumes should normally not be a problem. If they are slow, we will need more context to answer your question.
If you are encountering problems because you have many records to insert, you will probably have to look at SqlBulkCopy.
If you prefer not managing your ids yourself, the cleanest way I know of is working with temporary placeholder id columns.
Create and fill datatables with your data and a tempId columns you fill yourself and foreign keys blank
SqlBulkCopy primary table
Update secondary datatable with generated foreign keys by finding primary keys from previously inserted table through your tempids column
Upload secondary table
Repeat until done
Remove temporary id columns (optional)
Related
I cringe to ask this... as usual, I'm stuck with a legacy design beyond my control.
Incoming datafeed table (keyed to datafeed, line number, and file date):
DATAFEED_USER"
(
"DATAFEED_NM" VARCHAR2(32) NOT NULL ENABLE,
"LINE_NBR" NUMBER(6,0) NOT NULL ENABLE,
"FILE_DT" DATE NOT NULL ENABLE,
"NETWORK_ID" VARCHAR2(10),
"PERNR" VARCHAR2(10),
"COMPANY_CODE" VARCHAR2(5),
"LOCAL_COMPANY_ID" VARCHAR2(12),
// end identifier fields, begin user data
"USER_TYPE" VARCHAR2(16),
"LAST_NM" VARCHAR2(40),
"FIRST_NM" VARCHAR2(40),
...
)
Table in our system (keyed to network ID):
USER_DESC
(
"NETWORK_ID" VARCHAR2(30) NOT NULL ENABLE,
"PERNR" CHAR(8),
"COMPANY_CODE" VARCHAR2(4),
"LOCAL_COMPANY_ID" VARCHAR2(12),
// end identifier fields, begin user data
"LAST_NM" VARCHAR2(40),
"FIRST_NM" VARCHAR2(40),
...
)
I need the datafeed_user entities to have collections of matching user_desc records - records can match by network ID, PerNr, or Company Code + Local ID. There's no FK relationship, because (1) new users will come in on the datafeed before we have a record of them, and (2) only the network ID is a PK in our system.
The relationships:
Network ID: Many datafeeds can send the same network ID; there will only be one match in our system - Many to 1.
PerNr: Only one datafeed users PerNrs, but they can match to multiple Network IDs in our system - 1 to Many
Company Code/Local ID: Each datafeed sends one Local ID, but they can match to multiple Network IDs in our system - 1 to Many
The datafeed processing looks at all potential matches to the database, chooses one using a set of business rules, and updates the matching record. Users sent on more than one feed are flagged and not processed. There are 250k+ records in our database, so I only want to pull down the matching records to update. (I'd love to only pull down one, but then I'd have to push the business logic for matching records to the database)
How do I define a association/navigation property in the EF designer so that I can easily work with the related records?
I understand that what I'm after isn't a true relationship in db terms, so I'm open to alternatives. The code I'm migrating uses typed datasets, which were extended to have custom properties (such as a collection of matched records). I can't cleanly do that in the EF, because the .cs file is auto-generated. The requirements are:
For each datafeed record, look at all records that match in our system by one of the three unique identifiers
Update the matching record in our system
Don't pull all 250k records down to update one (and definitely don't do that 1,000 times, once for each datafeed record)
Suggestions?
I have two tables, eq. Users (UserID, CompanyID, Name, ...) and Companies (CompanyID, Name, ...). I have to process XML import data which generates a great number of data item into each table. In the xml file when a (new) user references into a new company - the company item must be inserted as well. When a user (or a company) item in the xml contains changed data - the items in the database must be updated.
Using EF6 creating and updating the items is quite simple - but the SubmitChanges() takes extremely long. I was searching the google and stackoverflow and found several similar topic - using datacontext and bulk inserting items. They are useful, but in my case when I bulk insert a new company - its ID is unknown for me, so I cannot bulk insert the user item as well. Which is the good way to solve this?
gather all the new company items first, bulk insert them, read back all the
information and ID-s? (or how would I know these new company item's ID)
then bulk insert all the user items?
My second question: is there a common way to generate the data table structure for the bulk insert from the EF6 class? Can I write a generic bulk insert method using the DB First table class?
I need to update a bit field in a table and set this field to true for a specific list of Ids in that table.
The Ids are passed in from an external process.
I guess in pure SQL the most efficient way would be to create a temp table and populate it with the Ids, then join the main table with this and set the bit field accordingly.
I could create a SPROC to take the Ids but there could be 200 - 300,000 rows involved that need this flag set so its probably not the most efficient way. Using the IN statement has limitation wrt the amount of data that can be passed and performance.
How can I achieve the above using the Entity Framework
I guess its possible to create a SPROC to create a temp table but this would not exist from the models perspective.
Is there a way to dynamically add entities at run time. [Or is this approach just going to cause headaches].
I'm making the assumption above though that populating a temp table with 300,000 rows and doing a join would be quicker than calling a SPROC 300,000 times :)
[The Ids are Guids]
Is there another approach that I should consider.
For data volumes like 300k rows, I would forget EF. I would do this by having a table such as:
BatchId RowId
Where RowId is the PK of the row we want to update, and BatchId just refers to this "run" of 300k rows (to allow multiple at once etc).
I would generate a new BatchId (this could be anything unique -Guid leaps to mind), and use SqlBulkCopy to insert te records onto this table, i.e.
100034 17
100034 22
...
100034 134556
I would then use a simgle sproc to do the join and update (and delete the batch from the table).
SqlBulkCopy is the fastest way of getting this volume of data to the server; you won't drown in round-trips. EF is object-oriented : nice for lots of scenarios - but not this one.
I'm assigning Marcs response as the answer but I'd just like to give a little detail on how we implemented the requirement.
Marc response helped greatly in the formulation of our solution.
We had to deal with an aim/guideline to keep within the Entity Framework while not utilizing SPROCS and although our solution may not suit others it has worked for us
We created a Item table in the Database with BatchId [uniqueidentifier] and ItemId varchar columns.
This table was added to the EF model so we did not use temporary tables.
On upload of these Ids this table is populated with the Ids [Inserts are quick enough we find using EF]
We then use context.ExecuteStoreCommand to run the SQL to do join the item table and the main table and update the bit field in the main table for records that exist for the batch Id created specifically for that session.
We finally clear this table for that batchId.
We have the performance, keeping within our no SPROC goal. [Which not of us agree with :) but its a democracy]
Our exact requirements are a little more complex but insofar as needing good update performance using the Entity framework given our specific restrictions it works fine.
Liam
I am doing a conversion with SqlBulkCopy. I currently have an IList collection of classes which basically i can do a conversion to a DataTable for use with SqlBulkCopy.
Problem is that I can have 3 records with the same ID.
Let me explain .. here are 3 records
ID Name Address
1 Scott London
1 Mark London
1 Manchester
Basically i need to insert them sequentially .. hence i insert record 1 if it doesn't exist, then the next record if it exists i need to update the record rather than insert a new 1 (notice the id is still 1) so in the case of the second record i replace both columns Name And Address on ID 1.
Finally on the 3rd record you notice that Name doesn't exist but its ID 1 and has an address of manchester so i need to update the record but NOT CHANGING Name but updating Manchester.. hence the 3rd record would make the id1 =
ID Name Address
1 Mark Manchester
Any ideas how i can do this? i am at a loss.
Thanks.
EDIT
Ok a little update. I will manage and merge my records before using SQLbulkCopy. Is it possible to get a list of what succeeded and what failed... or is it a case of ALL or nothing? I presume there is no other alternative to SQLbulkCopy but to do updates?
it would be ideal to be able to Insert everything and the ones that failed are inserted into a temp table ... hence i only need to worry about correcting the ones in my failed table as the others i know are all OK
Since you need to process that data into a DataTable anyway (unless you are writing a custom IDataReader), you should merge the records before giving them to SqlBulkCopy; for example (in pseudo code):
/* create empty data-table */
foreach(row in list) {
var row = /* try to get exsiting row from data-table based on id */
if(row == null) { row = /* create and append row to data-table */ }
else { merge non-trivial properties into existing row */
}
then pass the DataTable to SqlBulkCopy once you have the desired data.
Re the edit; in that scenario, I would upload to a staging table (just a regular table that has a schema like the uploaded data, but typically no foreign keys etc), then use regular TSQL to move the data into the transactional tables. In addition to full TSQL support this also allows better logging of operations. In particular, perhaps look at the OUTPUT clause of INSERT which can help complex bulk operations.
You can't do updates with bulk copy (bulk insert), only insert. Hence the name.
You need to fix the data before you insert them. If this means you have updates to pre-existing rows, you can't insert those as that will generate the key conflict.
You can either bulk insert into a temporary table, and run the appropriate insert or update statements, only insert the new rows and issue update statements for the rest, or delete the pre-existing rows after fetching them and fixing the data before reinserting.
But there's no way to persuade bulk copy to update an existing row.
Out of my lack of SQL Server experience and taking into account that this task is a usual one for Line of Business applications, I'd like to ask, maybe there is a standard, common way of doing the following database operation:
Assume we have two tables, connected with each other by one-to-many relationship, for example SalesOderHeader and SalesOrderLines
http://s43.radikal.ru/i100/1002/1d/c664780e92d5.jpg
Field SalesHeaderNo is a PK in SalesOderHeader table and a FK in SalesOrderLines table.
In a front-end app a User selects some number of records in the SalesOderHeader table, using for example Date range, or IsSelected field by clicking checkbox fields in a GridView. Then User performs some operations (let it be just "move to another table") on selected range of Sales Orders.
My question is:
How, in this case, I can reach child records in the SalesOrderLines table for performing the same operations (in our case "move to another table") over these child records in as easy, correct, fast and elegant way as possible?
If you're okay with a T-SQL based solution (as opposed to C# / LINQ) - you could do something like this:
-- define a table to hold the primary keys of the selected master rows
DECLARE #MasterIDs TABLE (HeaderNo INT)
-- fill that table somehow, e.g. by passing in values from a C# apps or something
INSERT INTO dbo.NewTable(LineCodeNo, Item, Quantity, Price)
SELECT SalesLineCodeNo, Item, Quantity, Price
FROM dbo.SalesOrderLine sol
INNER JOIN #MasterIDs m ON m.HeaderNo = sol.SalesHeaderNo
With this, you can insert a whole set of rows from your child table into a new table based on a selection criteria.
Your question is still a bit vague to me in that I'm not exactly sure what would be entailed by "move to another table." Does that mean there is another table with the exact schema of both your sample tables?
However, here's stab at a solution. When a user commits on a SalesOrderHeader record, some operation will be performed that looks like:
Update SalesOrderHeader
Set....
Where SalesOrderHeaderNo = #SalesOrderHeaderNo
Or
Insert SomeOtherTable
Select ...
From SalesOrderHeader
Where SalesOrderHeaderNo = #SalesOrderHeaderNo
In that same operation, is there a reason you can't also do something to the line items such as:
Insert SomeOtherTableItems
Select ...
From SalesOrderLineItems
Where SalesOrderHeaderNo = #SalesOrderHeaderNo
I don't know about "Best Practices", but this is what I use:
var header = db.SalesOrderHeaders.SingleOrDefault(h => h.SaleHeaderNo == 14);
IEnumerable<SalesOrderLine> list = header.SalesOrderLines.AsEnumerable();
// now your list contains the "many" records for the header
foreach (SalesOrderLine line in list)
{
// some code
}
I tried to model it after your table design, but the names may be a little different.
Now whether this is the "best practices" way, I am not sure.
EDITED: Noticed that you want to update them all, possibly move to another table. Since LINQ-To-SQL can't do bulk inserts/updates, you would probably want to use T-SQL for that.