I'm making an API call with C#, getting back a JSON, breaking it down into nested objects, breaking each object into fields and putting the fields into an SQL Server table.
There is one field (OnlineURL) which should be unique.
What is an efficient way of achieving this goal? I currently make a database call for every nested object I pull out of the JSON and then use an if statement. But this is not efficient.
Database Layer
Creating a unique index/constraint for the OnlineURL field in the database will enforce the field being unique no matter what system/codebase references it. This will result in applications erroring on inserts of new records where the OnlineURL already exists or updating record X to an OnlineURL that is already being used by record Y.
Application Layer
What is the rule when OnlineURL already exists? Do you reject the data? Do you update the matching row? Maybe you want to leverage a stored procedure that will insert a new row based on OnlineURL or update the existing one. This will turn a 2 query process into a single query, which will have an impact on large scale inserts.
Assuming your application is serial and the only one working against the database. You could also keep a local cache of OnlineURLs for use during your loop, read in the list once from the database, check each incoming record against it and then add each new OnlineURL you insert into the list. To read in the initial list is only a single query and each comparison is done in memory.
Create an index for that field and it will be.
It is necessary to check the uniqueness and that can't be fullfilled if you don't query the data. That means you will have to check the entire data in that column. Your first option is to improve the query with an index with a fill factor of 80 so you can avoid unnecessary page splits caused by the inserts.
Another option is to use caching and depends on your setup.
You could load the entire column in memory and check for the uniqueness there. Or you could use a distributed cache like Redis. Either way analyze the complexity costs and probably you'll that the index is the most ergonomic option.
Related
I've hit a wall when it comes to adding a new entity object (a regular SQL table) to the Data Context using LINQ-to-SQL. This isn't regarding the drag-and-drop method that is cited regularly across many other threads. This method has worked repeatedly without issue.
The end goal is relatively simple. I need to find a way to add a table that gets created during runtime via stored procedure to the current Data Context of the LINQ-to-SQL dbml file. I'll then need to be able to use the regular LINQ query methods/extension methods (InsertOnSubmit(), DeleteOnSubmit(), Where(), Contains(), FirstOrDefault(), etc...) on this new table object through the existing Data Context. Essentially, I need to find a way to procedurally create the code that would otherwise be automatically generated when you do use the drag-and-drop method during development (when the application isn't running), but have it generate this same code while the application is running via command and/or event trigger.
More Detail
There's one table that gets used a lot and, over the course of an entire year, collects many thousands of rows. Each row contains a timestamp and this table needs to be divided into multiple tables based on the year that the row was added.
Current Solution (using one table)
Single table with tens of thousands of rows which are constantly queried against.
Table is added to Data Context during development using drag-and-drop, so there are no additional coding issues
Significant performance decrease over time
Goals (using multiple tables)
(Complete) While the application is running, use C# code to check if a table for the current year already exists. If it does, no action is taken. If not, a new table gets created using a stored procedure with the current year as a prefix on the table name (2017_TableName, 2018_TableName, 2019_TableName, and so on...).
(Incomplete) While the application is still running, add the newly created table to the active LINQ-to-SQL Data Context (the same code that would otherwise be added using drag-and-drop during development).
(Incomplete) Run regular LINQ queries against the newly added table.
Final Thoughts
Other than the above, my only other concern is how to write the C# code that references a table that may or may not already exist. Is it possible to use a variable in place of the standard 'DB_DataContext.2019_TableName' methodology in order to actually get the table's data into a UI control? Is there a way to simply create an Enumerable of all the tables where the name is prefixed with a year and then select the most current table?
From what I've read so far, the most likely solution seems to involve the use of a SQL add-on like SQLMetal or Huagati which (based solely from what I've read) will generate the code I need during runtime and update the corresponding dbml file. I have no experience using these types of add-ons, so any additional insight into these would be appreciated.
Lastly, I've seen some references to LINQ-to-Entities and/or LINQ-to-Objects. Would these be the components I'm looking for?
Thanks for reading through a rather lengthy first post. Any comments/criticisms are welcome.
The simplest way to achieve what you want is to redirect in SQL Server, and leave your client code alone. At design-time create your L2S Data Context, or EF DbContex referencing a database with only a single table. Then at run-time substitue a view or synonym for that table that points to the "current year" table.
HOWEVER this should not be necessary in the first place. SQL Server supports partitioning, so you can store all the data in a physically separate data structures, but have a single logical table. And SQL Server supports columnstore tables, which can compress and store many millions of rows with excellent performance.
I have a scenario where I need to synchronize a database table with a list (XML) from an external system.
I am using EF but am not sure which would be the best way to achieve this in terms of performance.
There are 2 ways to do this as I see, but neither seem to be efficient to me.
Call Db each time
-Read each entry from the XML
-Try and retrieve the entry from the list
-If no entry found, add the entry
-If found , update timestamp
-At end of loop, delete all entries with older timestamp.
Load All Objects and work in memory
Read all EF objects into a list.
Delete all EF objects
Add item for each item in the XML
Save Changes to Db.
The lists are not that long, estimating around 70k rows. I don't want to clear the db table before inserting the new rows, as this table is a source for data from a webservice, and I don't want to lock the table while its possible to query it.
If I was doing this in T-SQL i would most likely insert the rows into a temp table, and join to find missing and deleted entries, but I have no idea how the best way to handle this in Entity Framework would be.
Any suggestions / ideas ?
The general problem with Entity Framework is that, when changing data, it will fire a query for each changed record anyway, regardless of lazy or eager loading. So by nature, it will be extremely slow (think of factor 1000+).
My suggestion is to use a stored procedure with a table valued parameter and ignore Entity Framework all together. You could use a merge statement.
70k rows is not much, but 70k insert/update/delete statements is always going to be very slow.
You could test it and see if the performance is managable, but intuition says entity framework is not the way to go.
I would iterate over the elements in the XML and update the corresponding row in the DB one at a time. I guess that's what you meant with your first option? As long as you have a good query plan to select each row, that should be pretty efficient. Like you said, 70k rows isn't that much so you are better off keeping the code straightforward rather than doing something less readable for a little more speed.
It depends. It's ok to use EF if there'll be not many changes (say less than hundreds). Otherwise, need bulk insert into DB and merge rows inside database.
I have one database server, acting as the main SQL Server, containing a Table to hold all data. Other database servers come in and out (different instances of SQL Server). When they come online, they need to download data from main Table (for a given time period), they then generate their own additional data to the same local SQL Server database table, and then want to update the main server with only new data, using a C# program, through a scheduled service, every so often. Multiple additional servers could be generating data at the same time, although it's not going to be that many.
Main table will always be online. The additional non-main database table is not always online, and should not be an identical copy of main, first it will contain a subset of the main data, then it generates its own additional data to the local table and updates main table every so often with its updates. There could be a decent amount of number of rows generated and/or downloaded. so an efficient algorithm is needed to copy from the extra database to the main table.
What is the most efficient way to transfer this in C#? SqlBulkCopy doesn't look like it will work because I can't have duplicate entries in main server, and it would fail if checking constraints since some entries already exist.
You could do it in DB or in C#. In all cases you must do something like Using FULL JOINs to Compare Datasets. You know that already.
Most important thing is to do it in transaction. If you have 100k rows split it to 1000 rows per transaction. Or try to determine what combination of rows per transaction is best for you.
Use Dapper. It's really fast.
If you have all your data in C#, use TVP to pass it to DB stored procedure. In stored procedure use MERGE to UPDATE/DELETE/INSERT data.
And last. In C# use Dictionary<Tkey, TValue> or something different with O(1) access time.
SQLBulkCopy is the fastest way for inserting data into a table from a C# program. I have used it to copy data between databases and so far nothing beats it speed wise. Here is a nice generic example: Generic bulk copy.
I would use a IsProcessed flag in the table of the main server and keep track of the main table's primary keys when you download data to the local db server. Then you should be able to do a delete and update to the main server again.
Here's how i would do it:
Create a stored procedure on the main table database which receives a user defined table variable with the same structure as the main table.
it should do something like -
INSERT INTO yourtable (SELECT * FROM tablevar)
OR you could use the MERGE statement for the Insert-or-Update functionality.
In code, (a windows service) load all (or a part of) the data from the secondery table and send it to the stored procedure as a table variable.
You could do it in bulks of 1000's and each time a bulk is updated you should mark it in the source table / source updater code.
Can you use linked servers for this? If yes it will make copying of data from and to main server much easier.
When copying data back to the main server I’d use IF EXISTS before each INSERT statement to additionally make sure there are no duplicates and encapsulate all insert statements into transaction so that if an error occurs transaction is rolled back.
I also agree with others on doing this in batches on 1000 or so records so that if something goes wrong you can limit the damage.
I'm importing data to a database. About 5000 rows each time. When im inserting into the DB one of the columns has a location. There are about 80 possible locations in total. I want to check each one of those and change it to a list of another 80 location names instead before I insert each row into the database. I have a switch statement helping me at the moment, but I was wondering if anyone thinks that this is a bad way to do it or whether I'm on the right track.
So basically at the moment. When I upload my data, that switch statement needs to be checked and a value changed 5000 times. Is switch the right way to go?
Don't use a switch statement, very hard to maintain. Create another table in your DB that maps your input location to the required database location and query off that instead. Makes it much easier to update/insert new locations etc, and keeps the length of your script to a sane level.
You could use a either a conversion table in your database or a Dictionary in your application instead of a switch.
seems inappropriate to convert during the import process.
I would import the data as is, then either UPDATE the table, or use a lookup table as previously suggested
I have a list of strings,the list is constant and won't be changed, there are 12 strings.
Inside database table I have a column with an index to the string.
I don't think it's wise to hold a separate table to hold those strings because they never get changed neither to save the string it self inside this column.
So the only option is to hold the list in some other type.
What about holding the strings in XML file and using Linq-to-Xml to load them into dictionary.
If so is thi better, performance wise, then using datatable?
Those strings will most likely get cached by your SQL server and apply almost no performance hit. But will give you a flexibility in case you have multiple applications sharing same database. Overall keep those in the database unless you have/expect millions database hits.
I agree with Zepplock, keep the strings in the database. You won´t have to worry about performance. One of the big reason is also that if you do so, it will easier for future developers to find the strings and understand their function within the application if you store them in the database in their proper context.
It sounds as if you're describing a table holding product catalog data. Suggest keeping those values in their own rows, and not stored as an XML datatype or in XML in a varchar column.
It sounds as if the data is static today, and rarely if ever, is changed. By storing in XML, you lose the potential future advantage of the relational nature of the database.
Suggest keeping them in a table. As you say, it's only 12 strings/products, and the performance hit will be zero.