Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I'm currently working on a point of sales software in which I have a table to record each and every item of a transaction, Since its going to hold hundreds of records each day after its release, I just wanna know the maximum amount of records that can be held by a Table and can anyone pls let me know whether it can slow down the software over time.
For practical day-to-day purposes (where you're inserting hundreds or thousands of rows per day) there is no limit to the size of the table, except if it fills up your disk.
Remember that organisations with userbases larger than yours, use databases with not hundreds of rows per day, but millions of rows per day.
Typically though, you will start to run into performance issues that needs fixing. You can still get good performance, you just need to do more to watch and tweak it.
For example, you may have a typical table with, say,
An ID (autoincrement/identity) that is the Primary Key (and clustered index).
A date/time field recording when it occurred
Some other data e.g., user IDs, amounts, types of action, etc
Each row you insert into the table just puts a new row at the end of that table, which databases typically have no problem doing. Even if the table is already large, adding more rows isn't much of a problem.
However, imagine you have a query/report that gets the data for the last week - for example, SELECT * FROM trn_log WHERE trn_datetime >= DATEADD(day, -7, getdate())
At first that runs fine.
After a while, it slows down. Why? Because the database doesn't know the that the datetimes are sequential, and therefore it must read every row of the table and work out which of the rows are the ones you want to use.
At that point, you start to think about indexes - which is a good next step. But when you add an index, it slows down your new row inserts (by a small amount).
I learned a lot from watching Brent Ozar's videos. I recommend watching his How to Think Like the SQL Server Engine series.
Note that this above is based on my experience with SQL Server - but it's likely (at this fundamental level) most other databases are the same.
The number of rows per page is limited to 255 rows so that works out to 4.1 billion rows per partition. A table can have an unlimited number of partitions and a single server can manage up to 128PB of storage.
https://www.quora.com/How-many-rows-can-recent-SQL-and-NoSQL-databases-reasonably-handle-within-one-table#:~:text=The%20number%20of%20rows%20per,up%20to%20128PB%20of%20storage.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I want to implement cache for my WPF application.
My application is holding over Sorry, I read of the wrong count. There is 2328681 items, and what I want to do is to cache all these itemsinto a file saved on the computer or something, which should release the workload of retrieving data from the database on the next runtime.
I'm going to have a function which check the latest DBUpdateTime, which compare if the DBUpdateTime in cache differs from the one in SQL, then retrieve the newest update.
Does someone know how I can achieve this? With what kind of library do you suggest my to use in order to achieve the cache?
I'm going to show active items, but I also want to show inactive items, should I save all itemsin a cache, then filter it by runtime?
Making a dynamic database in a cashe is wrong. I think not in one window, you do not call 300,000 records.
Better where you display them, put a limit of 200 records. And make a normal filter, if you have it, optimize your query.
I think instead of 300,000 records, "REASONABLE" will show 200, or at will 300, 500, 1000, 10000.
For example, I have a window "Connections" and "Contracts" and plus a Link window. I have about 2 million entries, I show the last 200 by filter.
With small amounts of data, Serialisation is better than a local database.
In this case it seems you need over 2 million records so you'd need to pull them all into memory to work with them if you stored them in a flat file or memory.
That sounds like it'd be too much data to handle.
Meaning a local database is very likely your best candidate. Which one suits best depends on how you will access the data and what you'll store.
SQLlite would be a candidate if these are simple records.
If you need to know immediately any change is made to a record then a push mechanism would be an idea.
You could use signalr to tell clients data has changed.
If you don't need to know immediately then you could have the client poll and ask what's changed every once in a while.
What I have done in the past is to add a RecentChanges table per logical entity. When a record is changed a record is added with the id, timestamp and user. You can then read this table to find what's been changed since a specific time. Where heavy usage and database overheads mean a more sophisticated approach I've cached copies of recently changed records on a business server.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I am trying to create a simple social networking app starting with back-end.
App Description
When a user opens the app a list of users will be displayed based on his/her Geo-Location, age and gender.
One a user has been viewed he won't be showed ever again.
Technologies
I am using Azure CosmosDB(MongoDB implementation) and Azure Redis Cache to store the documents.
My Approach to deal with the problem
I save all the user db in cosmosdb. I query for user Ids based on geocoordinate and age and gender preference filters and limit the results by 5000.
I also apply one more filter which is if a user has already been viewed filter. I am maintaining collection where for each user all the user Ids that he viewed will be saved as document.
For the first time I'll get 5000 id's from cosmosdb and put 4950 in redis cache(with an expiry time). Using the remaining 50 ids I'll fetch users from cosmosdb and return it as response to the api call. For subsequent calls I get next 50 ids from redis cache and fetch those users and return as response.
Problem I'm facing
Fetching the 5000 users is a time taking step as it involves geolocation computation and other filtering. I created a sample user database where there are nearly 2 million users in 100 mile radius and I am supposed to get 100,000 users based on my preference i.e, age and gender had I not applied 5000 Limit.
It would take around 25 seconds to do so.
Applying 5000 limit would run the query for 1 - 1.5 seconds only initially. Buy as the users get viewed i.e, when Not-In ($nin) filter will exclude those 5000 ids time taken would eventually increase. Time taken to get from cache will be fast but when the cache gets exhausted or expired and we have to hit cosmos db to query for 5000 more users it would take more time as the users he as already viewed keep increasing.
Stats
Time format is in hrs:min:sec.
It is performed just for performance stats. Actual Api request will provide 50 users each time(most of the time from cache).
first time
Time taken to get 5000 matches is 00:00:01.22
Time taken to set Viewed Ids is 00:00:00.06
second time
Time taken to get 5000 matches is 00:00:02.49
Time taken to set Viewed Ids is 00:00:00.67
:
:
Fifteenth time
Time taken to get 5000 matches is 00:00:23.05
Time taken to set Viewed Ids is 00:00:09.23
Question
How can the architecture be improved for better performance ? How apps like Uber, Tinder etc that involve users Geo-Location computations architect their application ? Is there a better way to model the problem or model the data ?
Any help would be appreciated. Thank you.
2 million users is enough that you need to start having a good indexing strategy for database queries to work. Geography queries provide a unique indexing problem because they are searches over two related variabled (namely longitude and latitude).
There's a good description of how Microsoft SQL server does their spacial index over here, which also nicely summarizes the indexing problem more generally.
While I've not personally used it, CosmoDB seems to now have some support for this too. See this and this.
The first thing I would do is slightly rethink your expectations - simply finding the 50 or 5000 (or whatever n) nearest items can involve a lengthy search if there are no nearby matches (or even if there are), but if your database is properly indexed, you can search very efficiently within some radius r of a point, and then sort those results by distance. If you have or expect to have a large number of coordinates, I would suggest doing that several times, in other words searching for all matches with 100m, sort by distance, and then if you need more, search for all matches within 500m and exclude the ones you've already seen, and so on up to 10km or 25km or whatever your app calls for.
MongoDB has a quite efficient index available for geospatial coordinates (basically a world map partitioned into B+ trees). The '$near' query allows you to specify both minimum and maximum distance, and sorts by distance by default, so it'a very convenient for this kind of tiered, distance based searches. You will have to format your coordinates (both in the DB and in the query) as GeoJSON Point objects if they aren't already, though.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I'm designing an accounting application with more than 400 tables in SQL Server.
About 10% of those tables are operational tables, and the others are used for decoding and reference information.
For example, Invoice tables (Master and details) use about 10 table to decode information like buyer, item , marketer and ... .
I want to know is it acceptable to cache decode tables in asp.net cache and do not query them from SQL Server (I know that changes to cache items should commit on SQL Server too). And use cache items for decoding?
I think it makes it so much faster than regular applications.
Maybe all of them together(Cache tables) are about 500 MB after some years because they don't change frequently.
If you've got the RAM then it's fine to use 500 MB.
However unless you have a performance problem now then caching will only cause problems. Don't fix problems that you haven't encountered, design for performance and optimize only when you have problems - because otherwise the optimization can cause more problems that it solves.
So I would advise that usually it is better to ensure that your queries are optimized and well structured, you have the correct indexes on the tables and that you issue a minimum amount of queries.
Although 500MB isn't a lot of data to cache, with all due respect, usually SQL Server will do a better job of caching than you can - providing that you use it correctly.
Using a cache will always improve performance; at a cost of higher implementation complexity.
For static data that never changes a cache is useful; but it still needs to be loaded and shared between threads which in itself can present challenges.
For data that rarely changes it becomes much more complex simply because it could have changed. If a single application (process) is the only updater of a cache then it isn't as difficult, but still not a simple task.
I have spent months optimizing a offline batch processing system (where the code has complete control of the database for a period of 12 hours). Part of the optimisation is to use various caches and data reprojections. All of the caches are readonly. Memory usage is around the 10gb mark during execution, database is around 170gb, 60 million records.
Even with the caching there has been considerable changes to the underlying schema to improve efficiency. The readonly caches are to eliminate reading during processing; to allow multi threaded processing and to improve the insert performance.
Processing rate has gone from 6 items processed per second 20 months ago to around 6000 items per second (yesterday) - but there is a genuine need for this optimization as the number of items to process has risen from 100,000 to 8 million in the same period.
If you don't have a need then don't optimize.
We have a C# application which parses data from text files. We then have to update records in our sql database based on the information in the text files. What's the most efficient way for passing the data from application to SQL server?
We currently use a delimited string and then loop through the string in a stored procedure to update the records. I am also testing using TVP (table valued parameter). Are there any other options out there?
Our files contain thousands of records and we would like a solution that takes the least amount of time.
Please do not use a DataTable as that is just wasting CPU and memory for no benefit (other than possibly familiarity). I have detailed a very fast and flexible approach in my answer to the following questions, which is very similar to this one:
How can I insert 10 million records in the shortest time possible?
The example shown in that answer is for INSERT only, but it can easily be adapted to include UPDATE. Also, it uploads all rows in a single shot, but that can also be easily adapted to set a counter for X number of records and to exit the IEnumerable method after that many records have been passed in, and then close the file once there are no more records. This would require storing the File pointer (i.e. the stream) in a static variable to keep passing to the IEnumerable method so that it can be advanced and picked up at the most recent position the next time around. I have a working example of this method shown in the following answer, though it was using a SqlDataReader as input, but the technique is the same and requires very little modification:
How to split one big table that has 100 million data to multiple tables?
And for some perspective, 50k records is not even close to "huge". I have been uploading / merging / syncing data using the method I am showing here on 4 million row files and that hit several tables with 10 million (or more) rows.
Things to not do:
Use a DataTable: as I said, if you are just filling it for the purpose of using with a TVP, it is a waste of CPU, memory, and time.
Make 1 update at a time in parallel (as suggested in a comment on the question): this is just crazy. Relational database engines are heavily tuned to work most efficiently with sets, not singleton operations. There is no way that 50k inserts will be more efficient than even 500 inserts of 100 rows each. Doing it individually just guarantees more contention on the table, even if just row locks (it's 100k lock + unlock operations). Is could be faster than a single 50k row transaction that escalates to a table lock (as Aaron mentioned), but that is why you do it in smaller batches, just so long as small does not mean 1 row ;).
Set the batch size arbitrarily. Staying below 5000 rows is good to help reduce chances of lock escalation, but don't just pick 200. Experiment with several batch sizes (100, 200, 500, 700, 1000) and try each one a few times. You will see what is best for your system. Just make sure that the batch size is configurable though the app.config file or some other means (table in the DB, registry setting, etc) so that it can be changed without having to re-deploy code.
SSIS (powerful, but very bulky and not fun to debug)
Things which work, but not nearly as flexible as a properly done TVP (i.e. passing in a method that returns IEnumerable<SqlDataRecord>). These are ok, but why dump the records into a temp table just to have to parse them into the destination when you can do it all inline?
BCP / OPENROWSET(BULK...) / BULK INSERT
.NET's SqlBulkCopy
The best way to do this in my opinion is to create a temp table then use SqlBulkCopy to insert into that temp table (https://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy%28v=vs.110%29.aspx), and then simply Update the table based on the temp table.
Based on my tests(using Dapper and also LINQ), updating as a bulk or with batches takes way too longer than just creating a temp table and sending a command to the server to update the data based on the temp table. The process is faster because the SqlBulkCopy populates the data natively in a fast manner, and the rest is completed on the SQL server side which goes through less calculation steps, and the data at that point resides on the server end.
I am still on a learning curve in C# and SQL Server so please forgive my ‘greeness’.
Here is my scenario:
I have an EMPLOYEE table with 10,000 rows. Each of these employees has transactions in a TRANSACTIONS table.
The transaction table has the salary elements like Basic pay, Acting allowance, Overtime hours etc. It also has payroll deductions like advance deductions, some loans (with interest), and savings (pension, social security savings etc.
I need to go through each employee’s transactions and compute taxes, outstanding balances on loans, update balances on savings, convert hours into payments/deductions and some other stuff.
This processing will give me a new set of rows for each employee, with a period marker (eg 2013-04 for April 2013). I need to store this in a HISTORY table for future references.
What is the best approach for processing the entire 10,000 employee table and their transactions?
I am told that pulling the entire table into memory via readers is not good practice and I agree.
Do I keep pulling an employee from the database, process their transactions, and commit the history to the database? And pull the next and so forth?
Too many calls to the back end?
(EF not an option for me, still doing raw SQL in ADO.NET)
I will appreciate any help on this.
10000 rows is not much. Memory could easily handle that if there's not some enourmous varchar or binary columns. Don't feel completely locked by good practice "rules".
On the other hand, consider a stored procedure. Then all processing will be done locally on the server.
edit: if neither of the above is an option, try to stream your results. For example, when reading your query save each row in a ConcurrentQueue or something like that. Before you execute the query, start another thread or a BackgroundWorker which checks the queue for new items and saves back results simultaneously on another SqlConnection. Work will be done when query is done AND the queue has Count 0.
Check out using ROW_NUMBER(). This can be used by programs to allow large tables to be essentially browsed using 'x' number of rows at a time. You could then conceivably use this same method to batch your job over, say, 1000 rows at a time.
See this link for more information.