How to sync only certain rows in a database - c#

I am trying to sync a couple of tables from a central (sql-server) database dbA to a local (sql-ce) client database dbB. To this I'm planning to use the Microsoft sync framework. And I have managed to get it working so far as to sync all the rows (or updated rows) in some specified tables.
However, my dbA contains very much data, and it is also growing, so I don't want the customers to have a full copy of the dbA in their local databases. Instead I just want them to have some parts of the database (some rows that depends on some expression, for instance from tableA just select rows where moduleId = 4)
So initially I thought that filters seemed to be the thing for me. Specifically parameter based filters where I can specify a parameter to the filter. However after some investigation it appears that when creating a parameter based filter, you are actually just creating a template, and then creating different scopes from that.
I can't describe exactly what my application does, but you could think of it almost as a version handling system, and the parameter would represent what version I want to select from. So you can see that it would become a large number of scopes.
So in the end, what are my options?

Could you maybe use a SQL function for the filter-parameter so this can change?
Something like
serverTemplate.Tables["tableA"].AddFilterColumn("ModuleId");
serverTemplate.Tables["tableA"].FilterClause = "[side].[ModuleId] = sf_GetCurrentModule()";
You could create a SQL function sf_GetCurrentModule() to return the ModuleId to be synched. If the rows to be synched vary from client to client, you would at least need a synch scope per client - which is not a problem, just use an unique name for the scope, and maybe pass an id from the client to the function:
serverTemplate.Tables["tableA"].FilterClause = string.Format("[side].[ModuleId] = sf_GetCurrentModule({0})", ClientId);
So you could return the ModuleId to be synched depending on the client.
Hope this helps.
Travis

Related

Best approach to track Amount field on Invoice table when InvoiceItem items change?

I'm building an app where I need to store invoices from customers so we can track who has paid and who has not, and if not, see how much they owe in total. Right now my schema looks something like this:
Customer
- Id
- Name
Invoice
- Id
- CreatedOn
- PaidOn
- CustomerId
InvoiceItem
- Id
- Amount
- InvoiceId
Normally I'd fetch all the data using Entity Framework and calculate everything in my C# service, (or even do the calculation on SQL Server) something like so:
var amountOwed = Invoice.Where(i => i.CustomerId == customer.Id)
.SelectMany(i => i.InvoiceItems)
.Select(ii => ii.Amount)
.Sum()
But calculating everything every time I need to generate a report doesn't feel like the right approach this time, because down the line I'll have to generate reports that should calculate what all the customers owe (sometimes go even higher on the hierarchy).
For this scenario I was thinking of adding an Amount field on my Invoice table and possibly an AmountOwed on my Customer table which will be updated or populated via the InvoiceService whenever I insert/update/delete an InvoiceItem. This should be safe enough and make the report querying much faster.
But I've also been searching some on this subject and another recommended approach is using triggers on my database. I like this method best because even if I were to directly modify a value using SQL and not the app services, the other tables would automatically update.
My question is:
How do I add a trigger to update all the parent tables whenever an InvoiceItem is changed?
And from your experience, is this the best (safer, less error-prone) solution to this problem, or am I missing something?
There are many examples of triggers that you can find on the web. Many are poorly written unfortunately. And for future reference, post DDL for your tables, not some abbreviated list. No one should need to ask about the constraints and relationships you have (or should have) defined.
To start, how would you write a query to calculate the total amount at the invoice level? Presumably you know the tsql to do that. So write it, test it, verify it. Then add your amount column to the invoice table. Now how would you write an update statement to set that new amount column to the sum of the associated item rows? Again - write it, test it, verify it. At this point you have all the code you need to implement your trigger.
Since this process involves changes to the item table, you will need to write triggers to handle all three types of dml statements - insert, update, and delete. Write a trigger for each to simplify your learning and debugging. Triggers have access to special tables - go learn about them. And go learn about the false assumption that a trigger works with a single row - it doesn't. Triggers must be written to work correctly if 0 (yes, zero), 1, or many rows are affected.
In an insert statement, the inserted table will hold all the rows inserted by the statement that caused the trigger to execute. So you merely sum the values (using the appropriate grouping logic) and update the appropriate rows in the invoice table. Having written the update statement mentioned in the previous paragraphs, this should be a relatively simple change to that query. But since you can insert a new row for an old invoice, you must remember to add the summed amount to the value already stored in the invoice table. This should be enough direction for you to start.
And to answer your second question - the safest and easiest way is to calculate the value every time. I fear you are trying to solve a problem that you do not have and that you may never have. Generally speaking, no one cares about invoices that are of "significant" age. You might care about unpaid invoices for a period of time, but eventually you write these things off (especially if the amounts are not significant). Another relatively easy approach is to create an indexed view to calculate and materialize the total amount. But remember - nothing is free. An indexed view must be maintained and it will add extra processing for DML statements affecting the item table. Indexed views do have limitations - which are documented.
And one last comment. I would strongly hesitate to maintain a total amount at any level higher than invoice. Above that level one frequently wants to filter the results in any ways - date, location, type, customer, etc. At this level you are approaching data warehouse functionality which is not appropriate for a OLTP system.
First of all never use triggers for business logic. Triggers are tricky and easily forgettable. It will be hard to maintain such application.
For most cases you can easily populate your reporting data via entity framework or SQL query. But if it requires lots of joins then you need to consider using staging tables. Because reporting requires data denormalization. To populate staging tables you can use SQL jobs or other schedule mechanism (Azure Scheduler maybe). This way you won't need to work with lots of join and your reports will populate faster.

ASP.NET Sorting Multiple SQL Database Tables By Date Range

This past week I was tasked with moving a PHP based database to a new SQL database. There are a handful of requirements, but one of those was using ASP.Net MVC to connect to the SQL database...and I have never used ASP.Net or MVC.
I have successfully moved the database to SQL and have the foundation of the ASP site set up (after spending many hours pouring through tutorials). The issue I am having now is that one of the pages is meant to display a handful of fields (User_Name, Work_Date, Work_Description, Work_Location, etc) but the only way of grabbing all of those fields is by combining two of the tables. Furthermore, I am required to allow the user to search the combined table for any matching rows between a user inputted date range.
I have tried having a basic table set up that displays the correct fields and have implemented a search bar...but that only allows me to search by a single date, not a range. I have also tried to use GridView with its Query Builder feature to grab the data fields I needed (which worked really well), but I can't figure out how to attach textboxes/buttons to the newly made GridView. Using a single table with GridView works perfectly and using textboxes/buttons is very intuitive. I just can't seem to make the same connection with a joined view.
So I suppose my question is this: what is the best way for me to combine these two tables while also still having the ability to perform searches on the displayed data? If I could build this database from scratch I would have just made a table with the relevant data attached to it, but because this is derived from a previously made database it has 12+ years of information that I need to dump into it.
Any help would be greatly appreciated. I am kind of dead in the water here. My inexperience with these systems is getting the better of me. I could post the code that I have, but I am mainly interested in my options and then I can do the research on my own.
Thanks!
It's difficult to offer definitive answers to your questions due to the need for guesswork.
But here are some hints.
You can say WHERE datestamp >= '2017-01-01' AND datestamp < '2018-01-01' to filter all the rows in calendar year 2017. Many variations on this sort of date range filter are available.
Your first table probably has some kind of ID number on each row. Let's call it first.first_id. Your second table probably has its own id, let's call it second.second_id. And, it probably has another id that identifies a row in your first table, let's call it second.first_id. That second.first_id is called a foreign key in the second table to the first table. There can be any number of rows in your second table corresponding to your first table via this foreign key.
If this is the case you can do something like this:
SELECT first.datestamp, first.val1, first.val2, second.val1, second.val2
FROM first
JOIN second ON first.first_id = second.first_id
WHERE first.datestamp >= '2018-06-01' AND first.datestamp < '2018-07-01'
AND (first.val1 = 'some search term' OR second.val1 = 'some search term')
ORDER BY first.datestamp
This makes a virtual table by joining together your two physical tables (FROM...JOIN...).
Then it filters the rows you want from that virtual table (FROM ...).
Then it puts them in the order you want (ORDER BY...).
Finally, it chooses the columns from the virtual table you want in your result set (SELECT ...).
SQL database servers (MySQL, SQL Server, postgreSQL, Oracle and the rest) are very smart about doing this sort of thing efficiently.

Azure Elastic Database Merge GUI Key Shards

In Azure we have four Shards and i want to remove two of them as we do not need them anymore. The Data should be merged into the other two Shards.
I use a Listmap with GUIDs as Key to identifiy the Shard (in our application this is the UserId).
In the tutorials i only found samples to merge Shards with the Range type.
Is there a way to merge these type of shards in a faster way or do i have to write my own tool for this?
If the merge is performed automatically what will for example happen in the following case:
The GUID to identify the Shard is the UserId, now this data is moved from Shard A to Shard B. There is another Table called Comments which has the UserId as ForeignKey. The PrimaryKey in this Table is a classic numeric auto increment value. What will happen to those values if they are moved from Shard A to Shard B? Will they be inserted and a new ID is assigned to them or will this not work at all?
Also there is some local FileStorage invloved which uses IDs in the Path so i will have to write my own tool anyway i think.
For that I took a look at the ShardMapManager but did not fully understand how it works. In the ShardMappingsGlobal Table is a Column called MappingId. But this is not the Guid/UserId which is stored in the Shard Database. How do i get the actual Guid which is used to identify the shard, in my case the UserId?
I also did not find Methods to move data between Shards.
What i would do now is Transfer the Data between the Shards with a tool by myself and then use the ListShardMap.UpdateMapping Method to set a new Shard for the value.
At the end of the operation i would use ListShardMap.DeleteShard or is there a better way to do this?
EDIT:
I wrote my own tool to merge the shards but i get a strange exception now. here some code:
Guid userKey = Guid.Parse(userId);
ListShardMap<Guid> map = GetUserShardMap<Guid>();
try
{
PointMapping<Guid> currentMapping = map.GetMappingForKey(userKey);
PointMapping<Guid> mappingOffline = map.UpdateMapping(currentMapping, new PointMappingUpdate()
{
Status = MappingStatus.Offline
});
}
The UpdateMapping causes the following exception:
Store Error: Error 515, Level 16, State 2, Procedure __ShardManagement.spBulkOperationShardMappingsLocal, Line 98, Message: Cannot insert the value NULL into column 'LockOwnerId', table __ShardManagement.ShardMappingsLocal
I do not understand why there is even an insert? I checked for the mappingId in the local and global Shardmapping tables and the mapping is there so no insert should be required in my opinion. I also took a look at the Code of the mentioned stored procedure spBulkOperationShardMappingsLocal here: https://github.com/Azure/elastic-db-tools/blob/master/Src/ElasticScale.Client/ShardManagement/Scripts/UpgradeShardMapManagerLocalFrom1.1To1.2.sql
In the Insert statement the LockOwnerId is not passed as parameter so it can only fail.
Currently i work with a testsetup because i do not want to play on the productive system of course. Maybe i made a mistake there but to me everything looks good. i would be very grateful about any hint regarding this error.
In the tutorials i only found samples to merge Shards with the Range type. Is there a way to merge these type of shards in a faster way or do i have to write my own tool for this?
Yes, the Split-Merge tool can move data from both range and list shard maps. For a list shard map you can issue shardlet move requests for each key. The Split-Merge tool unfortunately has some complicated set up, last time it took me around an hour to configure. I know this is not great, I'll leave it up to you to determine whether it would take more or less time to write your own custom version.
There is another Table called Comments which has the UserId as ForeignKey. The PrimaryKey in this Table is a classic numeric auto increment value. What will happen to those values if they are moved from Shard A to Shard B? Will they be inserted and a new ID is assigned to them or will this not work at all?
The values of autoincrement columns are not copied over, they will be regenerated at the destination. So new ids will be assigned to these rows.
For that I took a look at the ShardMapManager but did not fully understand how it works. In the ShardMappingsGlobal Table is a Column called MappingId. But this is not the Guid/UserId which is stored in the Shard Database. How do i get the actual Guid which is used to identify the shard, in my case the UserId?
I would strongly suggest not trying to edit the ShardMapManager tables on your own, it's very easy to mess up. Editing ShardMapManager tables is precisely what the Elastic Database Tools library is designed to do.
You can update the metadata for a mapping by using the ListShardMap.UpdatePointMapping method. Just to be clear, this only updates the ShardMapManager tables' knowledge of where the data should be for the key. Actually moving the mapping must be done by a higher layer.
This is a high-level summary of what the Split-Merge service does:
Lock the mapping to prevent concurrent update from another shard map management operation
Mark the mapping offline with ListShardMap.UpdatePointMapping. This prevents data-directed routing with OpenConnectionForKey from being allowed to access data with that key. It also kills all current sessions on the shard to force them to reconnect, this ensure that there are no active connections operating on data with the now-offline key
Move the underlying data, using the Shard Map's SchemaInfo to determine which tables need to be moved
Update the mapping and mark it online with ListShardMap.UpdatePointMapping
Unlock the mapping

Bulk insert strategy from c# to SQL Server

In our current project, customers will send collection of a complex/nested messages to our system. Frequency of these messages are approx. 1000-2000 msg/per seconds.
These complex objects contains the transaction data (to be added) as well as master data (which will be added if not found). But instead of passing the ids of the master data, customer passes the 'name' column.
System checks if master data exist for these names. If found, it uses the ids from database otherwise create this master data first and then use these ids.
Once master data ids are resolved, system inserts the transactional data to a SQL Server database (using master data ids). Number of master entities per message are around 15-20.
Following are the some strategies we can adopt.
We can resolve master ids first from our C# code (and insert master data if not found) and store these ids in C# cache. Once all ids are resolved, we can bulk insert the transactional data using SqlBulkCopy class. We can hit the database 15 times to fetch the ids for different entities and then hit database one more time to insert the final data. We can use the same connection will close it after doing all this processing.
We can send all these messages containing master data and transactional data in single hit to the database (in the form of multiple TVP) and then inside stored procedure, create the master data first for the missing ones and then insert the transactional data.
Could anyone suggest the best approach in this use case?
Due to some privacy issue, I cannot share the actual object structure. But here is the hypothetical object structure which is very close to our business object.
One such message will contain information about one product (its master data) and its price details (transaction data) from different vendors:
Master data (which need to be added if not found)
Product name: ABC, ProductCateory: XYZ, Manufacturer: XXX and some other other details (number of properties are in the range of 15-20).
Transaction data (which will always be added)
Vendor Name: A, ListPrice: XXX, Discount: XXX
Vendor Name: B, ListPrice: XXX, Discount: XXX
Vendor Name: C, ListPrice: XXX, Discount: XXX
Vendor Name: D, ListPrice: XXX, Discount: XXX
Most of the information about the master data will remain the same for a message belong to one product (and will change less frequently) but transaction data will always fluctuate. So, system will check if the product 'XXX' exist in the system or not. If not it check if the 'Category' mentioned with this product exist of not. If not, it will insert a new record for category and then for product. This will be done to for Manufacturer and other master data.
Multiple vendors will be sending data about multiple products (2000-5000) at the same time.
So, assume that we have 1000 suppliers, Each vendor is sending data about 10-15 different products. After each 2-3 seconds, every vendor sends us the price updates of these 10 products. He may start sending data about new products, but which will not be very frequent.
You would likely be best off with your #2 idea (i.e. sending all of the 15 - 20 entities to the DB in one shot using multiple TVPs and processing as a whole set of up to 2000 messages).
Caching master data lookups at the app layer and translating prior to sending to the DB sounds great, but misses something:
You are going to have to hit the DB to get the initial list anyway
You are going to have to hit the DB to insert new entries anyway
Looking up values in a dictionary to replace with IDs is exactly what a database does (assume a Non-Clustered Index on each of these name-to-ID lookups)
Frequently queried values will have their datapages cached in the buffer pool (which is a memory cache)
Why duplicate at the app layer what is already provided and happening right now at the DB layer, especially given:
The 15 - 20 entities can have up to 20k records (which is a relatively small number, especially when considering that the Non-Clustered Index only needs to be two fields: Name and ID which can pack many rows into a single data page when using a 100% Fill Factor).
Not all 20k entries are "active" or "current", so you don't need to worry about caching all of them. So whatever values are current will be easily identified as the ones being queried, and those data pages (which may include some inactive entries, but no big deal there) will be the ones to get cached in the Buffer Pool.
Hence, you don't need to worry about aging out old entries OR forcing any key expirations or reloads due to possibly changing values (i.e. updated Name for a particular ID) as that is handled naturally.
Yes, in-memory caching is wonderful technology and greatly speeds up websites, but those scenarios / use-cases are for when non-database processes are requesting the same data over and over in pure read-only purposes. But this particular scenario is one in which data is being merged and the list of lookup values can be changing frequently (moreso due to new entries than due to updated entries).
That all being said, Option #2 is the way to go. I have done this technique several times with much success, though not with 15 TVPs. It might be that some optimizations / adjustments need to be made to the method to tune this particular situation, but what I have found to work well is:
Accept the data via TVP. I prefer this over SqlBulkCopy because:
it makes for an easily self-contained Stored Procedure
it fits very nicely into the app code to fully stream the collection(s) to the DB without needing to copy the collection(s) to a DataTable first, which is duplicating the collection, which is wasting CPU and memory. This requires that you create a method per each collection that returns IEnumerable<SqlDataRecord>, accepts the collection as input, and uses yield return; to send each record in the for or foreach loop.
TVPs are not great for statistics and hence not great for JOINing to (though this can be mitigated by using a TOP (#RecordCount) in the queries), but you don't need to worry about that anyway since they are only used to populate the real tables with any missing values
Step 1: Insert missing Names for each entity. Remember that there should be a NonClustered Index on the [Name] field for each entity, and assuming that the ID is the Clustered Index, that value will naturally be a part of the index, hence [Name] only will provide a covering index in addition to helping the following operation. And also remember that any prior executions for this client (i.e. roughly the same entity values) will cause the data pages for these indexes to remain cached in the Buffer Pool (i.e. memory).
;WITH cte AS
(
SELECT DISTINCT tmp.[Name]
FROM #EntityNumeroUno tmp
)
INSERT INTO EntityNumeroUno ([Name])
SELECT cte.[Name]
FROM cte
WHERE NOT EXISTS(
SELECT *
FROM EntityNumeroUno tab
WHERE tab.[Name] = cte.[Name]
)
Step 2: INSERT all of the "messages" in simple INSERT...SELECT where the data pages for the lookup tables (i.e. the "entities") are already cached in the Buffer Pool due to Step 1
Finally, keep in mind that conjecture / assumptions / educated guesses are no substitute for testing. You need to try a few methods to see what works best for your particular situation since there might be additional details that have not been shared that could influence what is considered "ideal" here.
I will say that if the Messages are insert-only, then Vlad's idea might be faster. The method I am describing here I have used in situations that were more complex and required full syncing (updates and deletes) and did additional validations and creation of related operational data (not lookup values). Using SqlBulkCopy might be faster on straight inserts (though for only 2000 records I doubt there is much difference if any at all), but this assumes you are loading directly to the destination tables (messages and lookups) and not into intermediary / staging tables (and I believe Vlad's idea is to SqlBulkCopy directly to the destination tables). However, as stated above, using an external cache (i.e. not the Buffer Pool) is also more error prone due to the issue of updating lookup values. It could take more code than it's worth to account for invalidating an external cache, especially if using an external cache is only marginally faster. That additional risk / maintenance needs to be factored into which method is overall better for your needs.
UPDATE
Based on info provided in comments, we now know:
There are multiple Vendors
There are multiple Products offered by each Vendor
Products are not unique to a Vendor; Products are sold by 1 or more Vendors
Product properties are singular
Pricing info has properties that can have multiple records
Pricing info is INSERT-only (i.e. point-in-time history)
Unique Product is determined by SKU (or similar field)
Once created, a Product coming through with an existing SKU but different properties otherwise (e.g. category, manufacturer, etc) will be considered the same Product; the differences will be ignored
With all of this in mind, I will still recommend TVPs, but to re-think the approach and make it Vendor-centric, not Product-centric. The assumption here is that Vendor's send files whenever. So when you get a file, import it. The only lookup you would be doing ahead of time is the Vendor. Here is the basic layout:
Seems reasonable to assume that you already have a VendorID at this point because why would the system be importing a file from an unknown source?
You can import in batches
Create a SendRows method that:
accepts a FileStream or something that allows for advancing through a file
accepts something like int BatchSize
returns IEnumerable<SqlDataRecord>
creates a SqlDataRecord to match the TVP structure
for loops though the FileStream until either BatchSize has been met or no more records in the File
perform any necessary validations on the data
map the data to the SqlDataRecord
call yield return;
Open the file
While there is data in the file
call the stored proc
pass in VendorID
pass in SendRows(FileStream, BatchSize) for the TVP
Close the file
Experiment with:
opening the SqlConnection before the loop around the FileStream and closing it after the loops are done
Opening the SqlConnection, executing the stored procedure, and closing the SqlConnection inside of the FileStream loop
Experiment with various BatchSize values. Start at 100, then 200, 500, etc.
The stored proc will handle inserting new Products
Using this type of structure you will be sending in Product properties that are not used (i.e. only the SKU is used for the look up of existing Products). BUT, it scales very well as there is no upper-bound regarding file size. If the Vendor sends 50 Products, fine. If they send 50k Products, fine. If they send 4 million Products (which is the system I worked on and it did handle updating Product info that was different for any of its properties!), then fine. No increase in memory at the app layer or DB layer to handle even 10 million Products. The time the import takes should increase in step with the amount of Products sent.
UPDATE 2
New details related to Source data:
comes from Azure EventHub
comes in the form of C# objects (no files)
Product details come in through O.P.'s system's APIs
is collected in single queue (just pull data out insert into database)
If the data source is C# objects then I would most definitely use TVPs as you can send them over as is via the method I described in my first update (i.e. a method that returns IEnumerable<SqlDataRecord>). Send one or more TVPs for the Price/Offer per Vendor details but regular input params for the singular Property attributes. For example:
CREATE PROCEDURE dbo.ImportProduct
(
#SKU VARCHAR(50),
#ProductName NVARCHAR(100),
#Manufacturer NVARCHAR(100),
#Category NVARCHAR(300),
#VendorPrices dbo.VendorPrices READONLY,
#DiscountCoupons dbo.DiscountCoupons READONLY
)
SET NOCOUNT ON;
-- Insert Product if it doesn't already exist
IF (NOT EXISTS(
SELECT *
FROM dbo.Products pr
WHERE pr.SKU = #SKU
)
)
BEGIN
INSERT INTO dbo.Products (SKU, ProductName, Manufacturer, Category, ...)
VALUES (#SKU, #ProductName, #Manufacturer, #Category, ...);
END;
...INSERT data from TVPs
-- might need OPTION (RECOMPILE) per each TVP query to ensure proper estimated rows
From a DB point of view, there's no such fast thing than BULK INSERT (from csv files for example). The best is to bulk all data asap, then process it with stored procedures.
A C# layer will just slow down the process, since all the queries between C# and SQL will be thousands times slower than what Sql-Server can directly handle.

Parsing and inserting bulk data. How to keep performance and do relations?

The data
I have a collection with around 300,000 vacations. Every vacation has several categories, countries, cities, activities and other subobjects. This data needs to be inserted into a MySQL / SQL Server database. I have the luxury of being able to truncate the entire database and start clean every time the parser program is run.
What I have tried
I have tried working with Entity Framework, this is also where my preference lies. To keep Entity Framework's performance up I have created a construction where 300 items are taken out of the vacations collection, parsed and inserted by Entity Framework and it's context disposed thereafter. The program finishes in a matter of minutes using this method. If I fill the context with all 300k vacations from the collection (and it's subobjects) it's a matter of hours.
int total = vacationsObjects.Count;
for (int i = 0; i < total; i += Math.Min(300, (total - i)))
{
var set = vacationsObjects.Skip(i).Take(300);
int enumerator = 0;
using (var database = InitializeContext())
{
foreach (VacationModel vacationData in set)
{
enumerator++;;
Vacations vacation = new Vacations
{
ProductId = vacationData.ExternalId,
Name = vacationData.Name,
Description = vacationData.Description,
Price = vacationData.Price,
Url = vacationData.Url,
};
foreach (string category in vacationData.Categories)
{
var existingCategory = database.Categories.Local.FirstOrDefault(c => c.CategoryName == categor);
if (existingCategory != null)
vacation.Categories.Add(existingCategory);
else
{
vacation.Categories.Add(new Category
{
CategoryName = category
});
}
}
database.Vacations.Add(vacation);
}
database.SaveChanges();
}
}
The downside (and possibly dealbreaker) with this method is figuring out the relationships. As you can see when adding a Category I check if it's already been created in the local context, and then use that. But what if it has been added in a previous set of 300? I don't want to query the database multiple times for every vacation to check whether an entity already resides within it.
Possible solution
I could keep a dictionary in memory containing the categories that have been added. I'd need to figure out how to attach these categories to the proper vacations (or vice-versa) and insert them, including their respective relations into the database.
Possible alternatives
Segregate the context and the transaction -
Purely theoretical, I do not know if I'm making any sense here. Maybe I could have EF's context keep track of all objects, and take manual control over the inserting part. I have messed around with this, trying to work with manual transaction scopes without avail.
Stored procedure -
I could write a stored procedure that handles and inserts my data. I'm not a big fan of this alternative, as I would like to keep the flexibility of switching between MySQL and SQL Server. Also, I would be in the dark as to where to begin.
Intermediary CSV file -
Instead of inserting parsed data directly into the RDMBS, I could export it into one or more CSV files and make use of importing tools such as MySQL's INFLINE.
Alternative database systems
Databases such as Azure Table Storage, MongoDB or RavenDB could be an option. However, I would prefer to stick to a traditional RDMBS due to compatibility with my skillset and tools.
I have been working on and researching this problem for a couple of weeks now. It seems like the best way of finding a solution that fits is by simply trying the different possibilities and observing the result. I was hoping that I could receive some pointers or tips from your personal experiences.
If you insert each record separately, the whole operation will take a lot of time. The bottleneck is SQL-queries between client and server. Each query takes time, so try to avoid using multiple of them. For huge amount of data it will be much better to process them locally. The best solution is to use special import tool. In MySQL you can use LOAD DATA, in MSSQL there is BULK INSERT. To import your data, you need a .css file.
To handle external keys correctly, you must populate tables manually before inserting. If destination tables are empty, you can simply create .css file with predefined primary and external keys. Otherwise you can import existing records from server, update them with your data, then export them back.
Time
Since you can afford to make only INSERTs, one suggestion is to try Entity Framework Bulk Insert extension. I have used it to save up to 200K records and it works fine. Just include in your project and write something like this:
context.BulkInsert(listOfEntities);
This should solve (or greatly improve the EF version) your problem's the time dimension
Data integrity
Keeping everything in one transaction does not sound reasonable (I expect that 300K parent records to generate at least 3M overall records), so I would try the following approach:
1) make your entities insertion using bulk insert.
2) call a stored procedure to check data integrity
If the insertion is quite long and the chance of failure is relatively big, you can load what is already loaded and have the process skip what is already loaded:
1) make smaller bulk inserts for a batch of vacation records and all its children records. Ensure that it runs in a transaction. One BULK INSERT is run atomically (no transaction needed), for several it seems tricky.
2) if the process fails, you have complete vacation data in your database (no partially imported vacation)
3) retake the process, but load existing vacation records (parents only). Using EF, a faster way is using AsNoTracking to spare the tracking overhead (which is great for large lists)
var existingVacations = context.Vacation.Select(v => v.VacationSourceIdentifier).AsNoTracking();
As suggested by Alexei, EntityFramework.BulkInsert is a very good solution if your model is supported by this library.
You can also use Entity Framework Extensions (PRO Version) which allow to use BulkSaveChanges and Bulk Operations (Insert, Update, Delete and Merge).
It's support your both provider: MySQL and SQL Server
// Upgrade SaveChanges performance with BulkSaveChanges
var context = new CustomerContext();
// ... context code ...
// Easy to use
context.BulkSaveChanges();
// Easy to customize
context.BulkSaveChanges(operation => operation.BatchSize = 1000);
// Use direct bulk operation
context.BulkInsert(customers);
Disclaimer: I'm the owner of the project Entity Framework Extensions

Categories

Resources