READ FIRST before answering!
I have a RESTful service which wraps around the Entity Framework. Basically, all I did was create a database, add relations between the tables, create an Entity model around this database and finally expose the whole thing as a RESTful *.svc service. This is done, cannot be changed either.
Now I need to query data from it through a client application. All I can access is the service itself. I cannot add any server-side code, even if I wanted to. The server is now locked.
I need to retrieve data from a table called "ProductVoorwaarden" (product conditions) which is linked to three other tables. (Rubriek, Categorie and Datatype.) This data needs to be returned as XML, with a root node called "PRODUCTVOORWAARDEN" and every record in it's own XElement called "REC". Within this REC, there's an attribute for every field in the table plus references to related tables. Here's the code I have right now:
XElement PRODUCTVOORWAARDEN()
{
XElement Result = new XElement("PRODUCTVOORWAARDEN");
var Brondata = COBA.Productvoorwaarden.OrderBy(O => O.Code);
foreach (var item in Brondata)
{
COBA.LoadProperty(item, "Rubriek");
COBA.LoadProperty(item, "Categorie");
COBA.LoadProperty(item, "Datatype");
XElement REC = new XElement("REC",
Attribute("Rubriek", item.Rubriek.Code),
Attribute("Categorie", item.Categorie.Naam),
Attribute("Code", item.Code),
Attribute("Datatype", item.Datatype.Naam),
Attribute("Eenheid", item.Eenheid),
Attribute("Naam", item.Naam),
Attribute("Omschrijving", item.Omschrijving),
Attribute("UitgebreideTekstVeld", item.UitgebreideTekstVeld),
Attribute("Veld", item.Veld)
);
Result.Add(REC);
}
return Result;
}
This code works fine, but it's slow. It reads all ProductVoorwaarden records but then it has to make round-trips to the server again for every record to retrieve Rubriek.Code, Categorie.Naam and Datatype.Naam. (In the database, these relations are set by an auto-incremental Identity field but the XML code uses Code or Naam as reference.)
As you can imagine, every trip back to the RESTful service just eats up more time, which I'm trying to avoid. So is there any way to speed this all up a bit more just on the client-side?
The server is still under development and the next release will take a few more months. As a result, I have to deal with the options that the server provides right now. If there's no way to speed this up without modifying the server then fine. At least I've tried. There are 35 more tables that need to be processed with a deadline in a few days so if it works, then it works.
You could make each of your COBA.LoadProperty calls asynchronous and run them in parallel rather than sequentially. It will make your client code more complex since you'll have to handle the return of each async call and determine when they have all completed and you're ready to build your XML. Assuming each of your 4 REST calls is taking the same amount of time that would reduce the delay by half.
You've probably already double checked but I have come across cases where generating the enumerator from the lambda expression can be expensive. Still it was in the hundreds of milliseconds and I get the impression your delay is larger than that. May be worth checking.
Related
This question has probably been asked correctly before, and I'll gladly accept an answer pointing me to the right spot. The problem is I don't know how to ask the question correctly to get anything returned in a search.
I'm trying to pull data from a 3rd party api (ADP) and store data in my database using asp.net core.
I am wanting to take the users returned from the API and store them in my database, where I have an ADP ancillary table seeded with the majority of the data from the api.
I would then like to update or add any missing or altered records in my database FROM the API.
I'm thinking that about using an ajax call to the api to retrieve the records, then either storing the data to another table and using sql to look for records that are changed between the two tables and making any necessary changes(this would be manually activated via a button), or some kind of scheduled background task to perform this through methods in my c# code instead of ajax.
The question I have is:
Is it a better fit to do this as a stored procedure in sql or rather have a method in my web app perform the data transformation.
I'm looking for any examples of iterating through the returned data and updating/creating records in my database.
I've only seen vague not quite what I'm looking for examples and nothing definitive on the best way to accomplish this. If I can find any reference material or examples, I'll gladly research but I don't even know where to start, or the correct terms to search for. I've looked into model binding, ajax calls, json serialization & deserialization. I'm probably overthinking this.
Any suggestions or tech I should look at would be appreciated. Thanks for you time in advance.
My app is written in asp.net core 2.2 using EF Core
* EDIT *
For anyone looking - https://learn.microsoft.com/en-us/dotnet/csharp/tutorials/console-webapiclient
This with John Wu's Answer helped me achieve what I was looking for.
If this were my project this is how I would break down the tasks, in this order.
First, start an empty console application.
Next, write a method that gets the list of users from the API. You didn't tell us anything at all about the API, so here is a dummy example that uses an HTTP client.
public async Task<List<User>> GetUsers()
{
var client = new HttpClient();
var response = await client.GetAsync("https://SomeApi.com/Users");
var users = await ParseResponse(response);
return users.ToList();
}
Test the above (e.g. write a little shoestring code to run it and dump the results, or something) to ensure that it works independently. You want to make sure it is solid before moving on.
Next, create a temporary table (or tables) that matches the schema of the data objects that are returned from the API. For now you will just want to store it exactly the way you retrieve it.
Next, write some code to insert records into the table(s). Again, test this independently, and review the data in the table to make sure it all worked correctly. It might look a little like this:
public async Task InsertUser(User user)
{
using (var conn = new SqlConnection(Configuration.ConnectionString))
{
var cmd = new SqlCommand();
//etc.
await cmd.ExecuteNonQueryAsync();
}
}
Once you know how to pull the data and store it, you can finish the code to extract the data from the API and insert it. It might look a little like this:
public async Task DoTheMigration()
{
var users = await GetUsers();
var tasks = users.Select
(
u => InsertUser(u)
);
await Task.WhenAll(tasks.ToArray());
}
As a final step, write a series of stored procedures or a DTS package to move the data from the temp tables to their final resting place. If you are using MS Access, you can write a series of queries and execute them in order with some VBA. At a high level it would:
Check for any records that exist in the temp table but not in the final table and insert them into the final table.
Check for any records that exist in the final table but not the temp table and remove them or mark them as deleted.
Check for any records in common that have different column values and update the final table.
Each of these development activities raises it own set of questions, of course, which you can post back to StackOverflow with details. As it is your question doesn't have enough specificity for a more in-depth answer.
The data
I have a collection with around 300,000 vacations. Every vacation has several categories, countries, cities, activities and other subobjects. This data needs to be inserted into a MySQL / SQL Server database. I have the luxury of being able to truncate the entire database and start clean every time the parser program is run.
What I have tried
I have tried working with Entity Framework, this is also where my preference lies. To keep Entity Framework's performance up I have created a construction where 300 items are taken out of the vacations collection, parsed and inserted by Entity Framework and it's context disposed thereafter. The program finishes in a matter of minutes using this method. If I fill the context with all 300k vacations from the collection (and it's subobjects) it's a matter of hours.
int total = vacationsObjects.Count;
for (int i = 0; i < total; i += Math.Min(300, (total - i)))
{
var set = vacationsObjects.Skip(i).Take(300);
int enumerator = 0;
using (var database = InitializeContext())
{
foreach (VacationModel vacationData in set)
{
enumerator++;;
Vacations vacation = new Vacations
{
ProductId = vacationData.ExternalId,
Name = vacationData.Name,
Description = vacationData.Description,
Price = vacationData.Price,
Url = vacationData.Url,
};
foreach (string category in vacationData.Categories)
{
var existingCategory = database.Categories.Local.FirstOrDefault(c => c.CategoryName == categor);
if (existingCategory != null)
vacation.Categories.Add(existingCategory);
else
{
vacation.Categories.Add(new Category
{
CategoryName = category
});
}
}
database.Vacations.Add(vacation);
}
database.SaveChanges();
}
}
The downside (and possibly dealbreaker) with this method is figuring out the relationships. As you can see when adding a Category I check if it's already been created in the local context, and then use that. But what if it has been added in a previous set of 300? I don't want to query the database multiple times for every vacation to check whether an entity already resides within it.
Possible solution
I could keep a dictionary in memory containing the categories that have been added. I'd need to figure out how to attach these categories to the proper vacations (or vice-versa) and insert them, including their respective relations into the database.
Possible alternatives
Segregate the context and the transaction -
Purely theoretical, I do not know if I'm making any sense here. Maybe I could have EF's context keep track of all objects, and take manual control over the inserting part. I have messed around with this, trying to work with manual transaction scopes without avail.
Stored procedure -
I could write a stored procedure that handles and inserts my data. I'm not a big fan of this alternative, as I would like to keep the flexibility of switching between MySQL and SQL Server. Also, I would be in the dark as to where to begin.
Intermediary CSV file -
Instead of inserting parsed data directly into the RDMBS, I could export it into one or more CSV files and make use of importing tools such as MySQL's INFLINE.
Alternative database systems
Databases such as Azure Table Storage, MongoDB or RavenDB could be an option. However, I would prefer to stick to a traditional RDMBS due to compatibility with my skillset and tools.
I have been working on and researching this problem for a couple of weeks now. It seems like the best way of finding a solution that fits is by simply trying the different possibilities and observing the result. I was hoping that I could receive some pointers or tips from your personal experiences.
If you insert each record separately, the whole operation will take a lot of time. The bottleneck is SQL-queries between client and server. Each query takes time, so try to avoid using multiple of them. For huge amount of data it will be much better to process them locally. The best solution is to use special import tool. In MySQL you can use LOAD DATA, in MSSQL there is BULK INSERT. To import your data, you need a .css file.
To handle external keys correctly, you must populate tables manually before inserting. If destination tables are empty, you can simply create .css file with predefined primary and external keys. Otherwise you can import existing records from server, update them with your data, then export them back.
Time
Since you can afford to make only INSERTs, one suggestion is to try Entity Framework Bulk Insert extension. I have used it to save up to 200K records and it works fine. Just include in your project and write something like this:
context.BulkInsert(listOfEntities);
This should solve (or greatly improve the EF version) your problem's the time dimension
Data integrity
Keeping everything in one transaction does not sound reasonable (I expect that 300K parent records to generate at least 3M overall records), so I would try the following approach:
1) make your entities insertion using bulk insert.
2) call a stored procedure to check data integrity
If the insertion is quite long and the chance of failure is relatively big, you can load what is already loaded and have the process skip what is already loaded:
1) make smaller bulk inserts for a batch of vacation records and all its children records. Ensure that it runs in a transaction. One BULK INSERT is run atomically (no transaction needed), for several it seems tricky.
2) if the process fails, you have complete vacation data in your database (no partially imported vacation)
3) retake the process, but load existing vacation records (parents only). Using EF, a faster way is using AsNoTracking to spare the tracking overhead (which is great for large lists)
var existingVacations = context.Vacation.Select(v => v.VacationSourceIdentifier).AsNoTracking();
As suggested by Alexei, EntityFramework.BulkInsert is a very good solution if your model is supported by this library.
You can also use Entity Framework Extensions (PRO Version) which allow to use BulkSaveChanges and Bulk Operations (Insert, Update, Delete and Merge).
It's support your both provider: MySQL and SQL Server
// Upgrade SaveChanges performance with BulkSaveChanges
var context = new CustomerContext();
// ... context code ...
// Easy to use
context.BulkSaveChanges();
// Easy to customize
context.BulkSaveChanges(operation => operation.BatchSize = 1000);
// Use direct bulk operation
context.BulkInsert(customers);
Disclaimer: I'm the owner of the project Entity Framework Extensions
I am trying to sync a couple of tables from a central (sql-server) database dbA to a local (sql-ce) client database dbB. To this I'm planning to use the Microsoft sync framework. And I have managed to get it working so far as to sync all the rows (or updated rows) in some specified tables.
However, my dbA contains very much data, and it is also growing, so I don't want the customers to have a full copy of the dbA in their local databases. Instead I just want them to have some parts of the database (some rows that depends on some expression, for instance from tableA just select rows where moduleId = 4)
So initially I thought that filters seemed to be the thing for me. Specifically parameter based filters where I can specify a parameter to the filter. However after some investigation it appears that when creating a parameter based filter, you are actually just creating a template, and then creating different scopes from that.
I can't describe exactly what my application does, but you could think of it almost as a version handling system, and the parameter would represent what version I want to select from. So you can see that it would become a large number of scopes.
So in the end, what are my options?
Could you maybe use a SQL function for the filter-parameter so this can change?
Something like
serverTemplate.Tables["tableA"].AddFilterColumn("ModuleId");
serverTemplate.Tables["tableA"].FilterClause = "[side].[ModuleId] = sf_GetCurrentModule()";
You could create a SQL function sf_GetCurrentModule() to return the ModuleId to be synched. If the rows to be synched vary from client to client, you would at least need a synch scope per client - which is not a problem, just use an unique name for the scope, and maybe pass an id from the client to the function:
serverTemplate.Tables["tableA"].FilterClause = string.Format("[side].[ModuleId] = sf_GetCurrentModule({0})", ClientId);
So you could return the ModuleId to be synched depending on the client.
Hope this helps.
Travis
I've got this piece of caching code
public static ICollection<Messages> GetMessages()
{
if (System.Web.HttpContext.Current.Cache["GetMessages_" + user_id] == null)
{
using (DataContext db = new DataContext())
{
var msgs = (from m in db.Messages
where m.user_id == user_id
&& m.date_deleted == null
select m).ToList();
System.Web.HttpContext.Current.Cache.Insert(
"GetMessages_" + user_id, msgs, null,
DateTime.Now.AddMinutes(5), TimeSpan.Zero);
}
}
return System.Web.HttpContext.Current.Cache["GetMessages_" + user_id]
as ICollection<Messages>;
}
The first time it runs, it pulls the data from a SQL table, and takes about 1500 ms. Every subsequent call takes about 600ms. The collection i'm testing on currently contains just 3 objects, each with minimal data (a string, 3 datetime fields, 3 bools and 5 ints)
Is this normal? loading a page with this tiny amount of data on it takes almost 2 seconds, every single time.
[FYI this is just running on a dev machine, not a fully fledged web server. data is being pulled from a remote server but that should only affect the initial page load]
Please remember that when you create first Context in EF code first approach - EF has to rebuild model from metadata which slow down first invokation.
As noted by HackedByChinese, this was running in the VS development web server which of course slowed things down. However, I did find that my programming style was lending itself to very slow load times.
I used the method above to load a cached collection of messages, and also similar methods to pull back other entities from the database. One such item was a 'Settings' object.
On the settings page, I would call the function each time I wanted to return a property. For instance Cache.GetSettings().username, then Cache.GetSettings().useremail etc. I assumed that this would make the app run a little faster, as it's retrieving the object from the cache each time, without needing to access the database. But obviously, cache is not as fast as memory. Each call was taking about half a second (on the local server) and maybe .1s or .2s on the remote server. I realised if I set this to a variable...
var settings = Cache.GetSettings();
and referenced that instead, load times dropped significantly. There was a single 0.5s load at this line, and then every subsequent reference to settings took no time at all.
So to sum up, yes, cache can be slow, but only if you use it in a stupid manner like I did!
I have a CSV file and I have to insert it into a SQL Server database. Is there a way to speed up the LINQ inserts?
I've created a simple Repository method to save a record:
public void SaveOffer(Offer offer)
{
Offer dbOffer = this.db.Offers.SingleOrDefault (
o => o.offer_id == offer.offer_id);
// add new offer
if (dbOffer == null)
{
this.db.Offers.InsertOnSubmit(offer);
}
//update existing offer
else
{
dbOffer = offer;
}
this.db.SubmitChanges();
}
But using this method, the program is way much slower then inserting the data using ADO.net SQL inserts (new SqlConnection, new SqlCommand for select if exists, new SqlCommand for update/insert).
On 100k csv rows it takes about an hour vs 1 minute or so for the ADO.net way. For 2M csv rows it took ADO.net about 20 minutes. LINQ added about 30k of those 2M rows in 25 minutes. My database has 3 tables, linked in the dbml, but the other two tables are empty. The tests were made with all the tables empty.
P.S. I've tried to use SqlBulkCopy, but I need to do some transformations on Offer before inserting it into the db, and I think that defeats the purpose of SqlBulkCopy.
Updates/Edits:
After 18hours, the LINQ version added just ~200K rows.
I've tested the import just with LINQ inserts too, and also is really slow compared with ADO.net. I haven't seen a big difference between just inserts/submitchanges and selects/updates/inserts/submitchanges.
I still have to try batch commit, manually connecting to the db and compiled queries.
SubmitChanges does not batch changes, it does a single insert statement per object. If you want to do fast inserts, I think you need to stop using LINQ.
While SubmitChanges is executing, fire up SQL Profiler and watch the SQL being executed.
See question "Can LINQ to SQL perform batch updates and deletes? Or does it always do one row update at a time?" here: http://www.hookedonlinq.com/LINQToSQLFAQ.ashx
It links to this article: http://www.aneyfamily.com/terryandann/post/2008/04/Batch-Updates-and-Deletes-with-LINQ-to-SQL.aspx that uses extension methods to fix linq's inability to batch inserts and updates etc.
Have you tried wrapping the inserts within a transaction and/or delaying db.SubmitChanges so that you can batch several inserts?
Transactions help throughput by reducing the needs for fsync()'s, and delaying db.SubmitChanges will reduce the number of .NET<->db roundtrips.
Edit: see http://www.sidarok.com/web/blog/content/2008/05/02/10-tips-to-improve-your-linq-to-sql-application-performance.html for some more optimization principles.
Have a look at the following page for a simple walk-through of how to change your code to use a Bulk Insert instead of using LINQ's InsertOnSubmit() function.
You just need to add the (provided) BulkInsert class to your code, make a few subtle changes to your code, and you'll see a huge improvement in performance.
Mikes Knowledge Base - BulkInserts with LINQ
Good luck !
I wonder if you're suffering from an overly large set of data accumulating in the data-context, making it slow to resolve rows against the internal identity cache (which is checked once during the SingleOrDefault, and for "misses" I would expect to see a second hit when the entity is materialized).
I can't recall 100% whether the short-circuit works for SingleOrDefault (although it will in .NET 4.0).
I would try ditching the data-context (submit-changes and replace with an empty one) every n operations for some n - maybe 250 or something.
Given that you're calling SubmitChanges per isntance at the moment, you may also be wasting a lot of time checking the delta - pointless if you've only changed one row. Only call SubmitChanges in batches; not per record.
Alex gave the best answer, but I think a few things are being over looked.
One of the major bottlenecks you have here is calling SubmitChanges for each item individually. A problem I don't think most people know about is that if you haven't manually opened your DataContext's connection yourself, then the DataContext will repeatedly open and close it itself. However, if you open it yourself, and then close it yourself when you're absolutely finished, things will run a lot faster since it won't have to reconnect to the database every time. I found this out when trying to find out why DataContext.ExecuteCommand() was so unbelievably slow when executing multiple commands at once.
A few other areas where you could speed things up:
While Linq To SQL doesn't support your straight up batch processing, you should wait to call SubmitChanges() until you've analyzed everything first. You don't need to call SubmitChanges() after each InsertOnSubmit call.
If live data integrity isn't super crucial, you could retrieve a list of offer_id back from the server before you start checking to see if an offer already exists. This could significantly reduce the amount of times you're calling the server to get an existing item when it's not even there.
Why not pass an offer[] into that method, and doing all the changes in cache before submitting them to the database. Or you could use groups for submission, so you don't run out of cache. The main thing would be how long till you send over the data, the biggest time wasting is in the closing and opening of the connection.
Converting this to a compiled query is the easiest way I can think of to boost your performance here:
Change the following:
Offer dbOffer = this.db.Offers.SingleOrDefault (
o => o.offer_id == offer.offer_id);
to:
Offer dbOffer = RetrieveOffer(offer.offer_id);
private static readonly Func<DataContext, int> RetrieveOffer
{
CompiledQuery.Compile((DataContext context, int offerId) => context.Offers.SingleOrDefault(o => o.offer_id == offerid))
}
This change alone will not make it as fast as your ado.net version, but it will be a significant improvement because without the compiled query you are dynamically building the expression tree every time you run this method.
As one poster already mentioned, you must refactor your code so that submit changes is called only once if you want optimal performance.
Do you really need to check if the record exist before inserting it into the DB. I thought it looked strange as the data comes from a csv file.
P.S. I've tried to use SqlBulkCopy,
but I need to do some transformations
on Offer before inserting it into the
db, and I think that defeats the
purpose of SqlBulkCopy.
I don't think it defeat the purpose at all, why would it? Just fill a simple dataset with all the data from the csv and do a SqlBulkCopy. I did a similar thing with a collection of 30000+ rows and the import time went from minutes to seconds
I suspect it isn't the inserting or updating operations that are taking a long time, rather the code that determines if your offer already exists:
Offer dbOffer = this.db.Offers.SingleOrDefault (
o => o.offer_id == offer.offer_id);
If you look to optimise this, I think you'll be on the right track. Perhaps use the Stopwatch class to do some timing that will help to prove me right or wrong.
Usually, when not using Linq-to-Sql, you would have an insert/update procedure or sql script that would determine whether the record you pass already exists. You're doing this expensive operation in Linq, which certainly will never hope to match the speed of native sql (which is what's happening when you use a SqlCommand and select if the record exists) looking-up on a primary key.
Well you must understand linq creates code dynamically for all ADO operations that you do instead handwritten, so it will always take up more time then your manual code. Its simply an easy way to write code but if you want to talk about performance, ADO.NET code will always be faster depending upon how you write it.
I dont know if linq will try to reuse its last statement or not, if it does then seperating insert batch with update batch may improve performance little bit.
This code runs ok, and prevents large amounts of data:
if (repository2.GeoItems.GetChangeSet().Inserts.Count > 1000)
{
repository2.GeoItems.SubmitChanges();
}
Then, at the end of the bulk insertion, use this:
repository2.GeoItems.SubmitChanges();