Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I am trying to create a simple social networking app starting with back-end.
App Description
When a user opens the app a list of users will be displayed based on his/her Geo-Location, age and gender.
One a user has been viewed he won't be showed ever again.
Technologies
I am using Azure CosmosDB(MongoDB implementation) and Azure Redis Cache to store the documents.
My Approach to deal with the problem
I save all the user db in cosmosdb. I query for user Ids based on geocoordinate and age and gender preference filters and limit the results by 5000.
I also apply one more filter which is if a user has already been viewed filter. I am maintaining collection where for each user all the user Ids that he viewed will be saved as document.
For the first time I'll get 5000 id's from cosmosdb and put 4950 in redis cache(with an expiry time). Using the remaining 50 ids I'll fetch users from cosmosdb and return it as response to the api call. For subsequent calls I get next 50 ids from redis cache and fetch those users and return as response.
Problem I'm facing
Fetching the 5000 users is a time taking step as it involves geolocation computation and other filtering. I created a sample user database where there are nearly 2 million users in 100 mile radius and I am supposed to get 100,000 users based on my preference i.e, age and gender had I not applied 5000 Limit.
It would take around 25 seconds to do so.
Applying 5000 limit would run the query for 1 - 1.5 seconds only initially. Buy as the users get viewed i.e, when Not-In ($nin) filter will exclude those 5000 ids time taken would eventually increase. Time taken to get from cache will be fast but when the cache gets exhausted or expired and we have to hit cosmos db to query for 5000 more users it would take more time as the users he as already viewed keep increasing.
Stats
Time format is in hrs:min:sec.
It is performed just for performance stats. Actual Api request will provide 50 users each time(most of the time from cache).
first time
Time taken to get 5000 matches is 00:00:01.22
Time taken to set Viewed Ids is 00:00:00.06
second time
Time taken to get 5000 matches is 00:00:02.49
Time taken to set Viewed Ids is 00:00:00.67
:
:
Fifteenth time
Time taken to get 5000 matches is 00:00:23.05
Time taken to set Viewed Ids is 00:00:09.23
Question
How can the architecture be improved for better performance ? How apps like Uber, Tinder etc that involve users Geo-Location computations architect their application ? Is there a better way to model the problem or model the data ?
Any help would be appreciated. Thank you.
2 million users is enough that you need to start having a good indexing strategy for database queries to work. Geography queries provide a unique indexing problem because they are searches over two related variabled (namely longitude and latitude).
There's a good description of how Microsoft SQL server does their spacial index over here, which also nicely summarizes the indexing problem more generally.
While I've not personally used it, CosmoDB seems to now have some support for this too. See this and this.
The first thing I would do is slightly rethink your expectations - simply finding the 50 or 5000 (or whatever n) nearest items can involve a lengthy search if there are no nearby matches (or even if there are), but if your database is properly indexed, you can search very efficiently within some radius r of a point, and then sort those results by distance. If you have or expect to have a large number of coordinates, I would suggest doing that several times, in other words searching for all matches with 100m, sort by distance, and then if you need more, search for all matches within 500m and exclude the ones you've already seen, and so on up to 10km or 25km or whatever your app calls for.
MongoDB has a quite efficient index available for geospatial coordinates (basically a world map partitioned into B+ trees). The '$near' query allows you to specify both minimum and maximum distance, and sorts by distance by default, so it'a very convenient for this kind of tiered, distance based searches. You will have to format your coordinates (both in the DB and in the query) as GeoJSON Point objects if they aren't already, though.
Related
I'm in the process of programming a web application that gets data from an inverter to which a PV cell is attached. I read the data from a CSV file. Every 20 seconds, the CSV file gains a line that contains the data at the respective point in time (line contains the following data: timestamp, current performance, energy).
The CSV file is saved to a database when the application is started (when the index action is called in the controller). It's all working.
Since the database now contains data at 20s intervals, it is rapidly increasing in size. Since I use graphs to show the energy that the PV system supplies me over the year on my web application, I have to summarize the 20s data, which also requires computing power. I also do this in the index action.
So whenever the user opens the page, the data is updated. If I e.g. switch from one view to the other and back again, the index action is called again in the associated controller. So it takes time to load the page again. So my application becomes slow.
What do I need to do to solve such a problem?
Ok, then.
In our IT industry, we often come across the term "data warehousing".
What this means (in most cases) is that we have a LOT of transaction data. Think maybe the very high transaction rate generated by people shopping on amazon. HUGE number of transactions.
But, if we want to report on such data? Say we want sales by hour, or maybe even only need by per day.
Well, we don't store each single transaction for that "house" of data, but a sum total, and a sum total over a given "chosen" time period by the developer of that data warehouse system.
So, you probably don't need to capture each 20 second data point. (maybe you do?????).
So, as I stated, every 20 seconds, you get a data point. Given a year has 31 million seconds? then that means you will have 1.5 million data points per year.
However, perhaps you don't need such fine resolution. If you take the data, and sum by say 1 minute intervals, then you now down to only 525,000 data points per year. (and if you report by month, then that is only 43,000 points per month).
However, maybe a resolution of 5 minutes is more then fine for your needs. at that resolution, then a whole year of data becomes only 105,120 data points.
And thus for a graph or display of one month of data, we have only 8,760 data points.
So, if we have to (for example) display a graph for one month, then we are only pulling 8,700 points of data. Not at all a large query for any database system these days.
So, you might want to think of this data as some "miniature" data warehousing project, in which you do loose some data "granularity", but as such, it still is sufficient for your reporting needs.
What time slot or "gap" you choose will be based on YOUR requirements or need.
What the above thus then suggests?
You would need a routine that reads the csv, and then groups the data by that "time slot" you chosen, and then sum into existing data points, and appends for the new ones.
This as a result would not only vast reduce the number of data rows of data, but of course would also significantly speed up reports and graphing on such data.
So, you could easy drop from about 1.5 million rows of data per year down to say 100,000 rows per year. With a index on the date, then reporting on such data be it daily, weekly, or monthly becomes far more manageable, and you reduced the data by a factor of 10x. Thus, you have a lot more headroom in the database, a lot less data, and after 10 years of data, you would only be around 1 million rows of data - not a lot for even the free "express" edition of SQL server.
Also, since you can't control when the "device" triggers adding of data to that csv, I would consider a re-name of the file before you read it, and thus during some read (and you deleting after done), you would reduce the possibility of losing data during your csv read + delete operation.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I'm currently working on a point of sales software in which I have a table to record each and every item of a transaction, Since its going to hold hundreds of records each day after its release, I just wanna know the maximum amount of records that can be held by a Table and can anyone pls let me know whether it can slow down the software over time.
For practical day-to-day purposes (where you're inserting hundreds or thousands of rows per day) there is no limit to the size of the table, except if it fills up your disk.
Remember that organisations with userbases larger than yours, use databases with not hundreds of rows per day, but millions of rows per day.
Typically though, you will start to run into performance issues that needs fixing. You can still get good performance, you just need to do more to watch and tweak it.
For example, you may have a typical table with, say,
An ID (autoincrement/identity) that is the Primary Key (and clustered index).
A date/time field recording when it occurred
Some other data e.g., user IDs, amounts, types of action, etc
Each row you insert into the table just puts a new row at the end of that table, which databases typically have no problem doing. Even if the table is already large, adding more rows isn't much of a problem.
However, imagine you have a query/report that gets the data for the last week - for example, SELECT * FROM trn_log WHERE trn_datetime >= DATEADD(day, -7, getdate())
At first that runs fine.
After a while, it slows down. Why? Because the database doesn't know the that the datetimes are sequential, and therefore it must read every row of the table and work out which of the rows are the ones you want to use.
At that point, you start to think about indexes - which is a good next step. But when you add an index, it slows down your new row inserts (by a small amount).
I learned a lot from watching Brent Ozar's videos. I recommend watching his How to Think Like the SQL Server Engine series.
Note that this above is based on my experience with SQL Server - but it's likely (at this fundamental level) most other databases are the same.
The number of rows per page is limited to 255 rows so that works out to 4.1 billion rows per partition. A table can have an unlimited number of partitions and a single server can manage up to 128PB of storage.
https://www.quora.com/How-many-rows-can-recent-SQL-and-NoSQL-databases-reasonably-handle-within-one-table#:~:text=The%20number%20of%20rows%20per,up%20to%20128PB%20of%20storage.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I want to implement cache for my WPF application.
My application is holding over Sorry, I read of the wrong count. There is 2328681 items, and what I want to do is to cache all these itemsinto a file saved on the computer or something, which should release the workload of retrieving data from the database on the next runtime.
I'm going to have a function which check the latest DBUpdateTime, which compare if the DBUpdateTime in cache differs from the one in SQL, then retrieve the newest update.
Does someone know how I can achieve this? With what kind of library do you suggest my to use in order to achieve the cache?
I'm going to show active items, but I also want to show inactive items, should I save all itemsin a cache, then filter it by runtime?
Making a dynamic database in a cashe is wrong. I think not in one window, you do not call 300,000 records.
Better where you display them, put a limit of 200 records. And make a normal filter, if you have it, optimize your query.
I think instead of 300,000 records, "REASONABLE" will show 200, or at will 300, 500, 1000, 10000.
For example, I have a window "Connections" and "Contracts" and plus a Link window. I have about 2 million entries, I show the last 200 by filter.
With small amounts of data, Serialisation is better than a local database.
In this case it seems you need over 2 million records so you'd need to pull them all into memory to work with them if you stored them in a flat file or memory.
That sounds like it'd be too much data to handle.
Meaning a local database is very likely your best candidate. Which one suits best depends on how you will access the data and what you'll store.
SQLlite would be a candidate if these are simple records.
If you need to know immediately any change is made to a record then a push mechanism would be an idea.
You could use signalr to tell clients data has changed.
If you don't need to know immediately then you could have the client poll and ask what's changed every once in a while.
What I have done in the past is to add a RecentChanges table per logical entity. When a record is changed a record is added with the id, timestamp and user. You can then read this table to find what's been changed since a specific time. Where heavy usage and database overheads mean a more sophisticated approach I've cached copies of recently changed records on a business server.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I have an application. Suppose it's an invoice service. Each time a user creates an invoice I need to assign the next sequential number (I.e: ISequentialNumberGeneratorRepository.Next(); So essentially the invoice number must be unique despite having several instances of my application running (horizontal scalability is likely in the future).
In other words, I need a global sequential number generator.
Traditionally this problem is resolved by using a relational database such as SQL server, PostgreSQL, MySQL, etc. because these systems have the capability to generate sequential unique IDs on inserting a record and returning the generated id as part of the same atomic operation, so they're a perfect fit for a centralised sequential number generator.
But I don't have a relational database and I don't need one, so it's a bit brutal having to use one just for this tiny functionality.
I have, however, an EventStore available (EventStore.org) but I couldn't find out whether it has sequential number generation capability.
So my question is: Is there any available product out there which I could use to generate unique sequential numbers so that I can implement my Next(); repository's method with, and which would work well independently of how many instances of my client invoice application I have?
Note: Alternatively, if someone can think of a way to use EventStore for this purpose or how did they achieve this in a DDD/CQRS/ES environment it'd also be great.
You have not stated the reasons(or presented any code) as to why you want this capability. I will assume the term sequential should be taken as monotonically increasing(sorting not looping).
I tend to agree with A.Chiesa, I would add timestamps to the list, although not applicable here.
Since your post does not indicate how the data is to be consumed, I purpose two solutions, the second preferred over the first, if possible; and for all later visitors, use a database solution instead.
The only way to guarantee numerical order across a horizontally scaled application without aggregation, is to utilize a central server to assign the numbers(using REST or RPCs or custom network code; not to mention an SQL server, as a side note). Due to concurrency, the application must wait it's turn for the next number and including network usage and delay, this delay limits the scalability of the application, and provides a single point of failure. These risks can be minimized by creating multiple instances of the central server and multiple application pools(You will lose the global sorting ability).
As an alternative, I would recommend the HI/LO Assigning method, combined with batch aggregation. Each instance has a four? digit identifier prefixed to an incrementing number per instance. Schedule an aggregation task on a central(or more than one, for redundancy) server(s) to pickup the data and assign a sequential unique id during aggregation. This process localizes the data(until pickup, which could be scheduled for (100, 500, 1000)? millisecond intervals if needed for coherence; minutes or more ,if not), and provides almost perfect horizontal scaling, with the drawback of increased vertical scaling requirements at the aggregation server(s).
Distributed computing is a balancing act, between processing, memory, and communication overhead. Where your computing/memory/network capacity boundaries lie cannot be determined from your post.
There is no single correct answer. I have provided you with two possibilities, but without specific requirements of the task at hand, I can go no further.
IMHO, your requirement is kinda flawed, because you have conflicting needs.
You want a unique id. The usual solutions use:
guid. Can be generated centrally or locally. Really easy to implement. Kinda hard for a human reader, but YMMV. But you want incremental keys.
centrally assigned key: you need a transactional system. But you want to do CQRS, and use Event Store. It seems to me that having a separate transactional system just to have an IDENTITY_COLUMN or a SEQUENCE largely misses the point of doing CQRS.
use an HiLo generation approach. That is: every single client gets a unique seed (like 1 billion for the first client, 2 billions for the second, etc). So each client can generate locally a sequence. This sequence is distributed and uses sequential numbers, so there is no concurrency problems, but there is no global sorting for requests and you must ensure that no two clients get the same Hi value (relatively easy task).
use the id assigned by Event Store. I don't know the product, but every event sent to the queue gets a unique id. But (as I understand it) you require the id to be available BEFORE sending the event.
You can generally mix-and-match either of this solutions (especially the Hilo algorithm) with timestamps (like seconds from Unix Epoch, or something alike), in order to produce a (weak, non guaranteed) sortability. But generally I would avoid this, because if you generate ids on multiple sites, you introduce the risk of the clocks being unsynchronized, and generally other unsolved (or unsolvable) problems.
Probably I'm missing something, but this are the ones from the top of my head.
So, as far as i can tell, you are in an empasse. I would try really hard to put myself in one of the previous situations.
It is strange opinion
so it's a bit brutal having to use one just for this tiny
functionality.
Today SQLite is used as relational database even in mobile phones. It is simple, have small memory footprint and have binding for all popular programming languages. 20 years ago databases consumed many resources - today you can find database engine for all tasks. Also, if you need tiny key-pair store you can use BerkeleyDB.
I would like some advice on how to best go about what I'm trying to achieve.
I'd like to provide a user with a screen that will display one or more "icon" (per say) and display a total next to it (bit like the iPhone does). Don't worry about the UI, the question is not about that, it is more about how to handle the back-end.
Let's say for argument sake, I want to provide the following:
Total number of unread records
Total number of waiting for approval
Total number of pre-approved
Total number of approved
etc...
I suppose, the easiest way to descrive the above would be "MS Outlook". Whenever emails arrive to your inbox, you can see the number of unread email being updated immediately. I know it's local, so it's a bit different, but now imagine having the same principle but for the queries above.
This could vary from user to user and while dynamic stored procedures are not ideal, I don't think I could write one sp for each scenario, but again, that's not the issue heree.
Now the recommendation part:
Should I be creating a timer that polls the database every minute (for example?) and run-all my relevant sql queries which will then provide me with the relevant information.
Is there a way to do this in real time without having a "polling" mechanism i.e. Whenever a query changes, it updates the total/count and then pushes out the count of the query to the relevant client(s)?
Should I have some sort of table storing these "totals" for each query and handle the updating of these immediately based on triggers in SQL and then when queried by a user, it would only read the "total" rather than trying to calculate them?
The problem with triggers is that these would have to be defined individually and I'm really tring to keep this as generic as possible... Again, I'm not 100% clear on how to handle this to be honest, so let me know what you think is best or how you would go about it.
Ideally when a specific query is created, I'd like to provide to choices. 1) General (where anyone can use this) and b) Specific where the "username" would be used as part of the query and the count returned would only be applied for that user but that's another issue.
The important part is really the notification part. While the polling is easy, I'm not sure I like it.
Imagine if I had 50 queries to be execute and I've got 500 users (unlikely, but still!) looking at the screen with these icons. 500 users would poll the database every minute and 50 queries would also be executed, this could potentially be 25000 queries per miuntes... Just doesn't sound right.
As mentioned, ideally, a) I'd love to have the data changes in real-time rather than having to wait a minute to be notified of a new "count" and b) I want to reduce the amount of queries to a minimum. Maybe I won't have a choice.
The idea behind this, is that they will have a small icon for each of these queries, and a little number will be displayed indicating how many records apply to the relevant query. When they click on this, it will bring them the relevant result data rather than the actual count and then can deal with it accordingly.
I don't know if I've explained this correctly, but if unclear, please ask, but hopefully I have and I'll be able to get some feedback on this.
Looking forward to your feeback.
Thanks.
I am not sure if this is the ideal solution but maybe a decent 1.
The following are the assumptions I have taken
Considering that your front end is a web application i.e. asp.net
The data which needs to be fetched on a regular basis is not hugh
The data which needs to be fetched does not change very frequently
If I were in this situation then I would have gone with the following approach
Implemented SQL Caching using SQLCacheDependency class. This class will fetch the data from the database and store in the cache of the application. The cache will get invalidated whenever the data in the table on which the dependency is created changes thus fetching the new data and again creating the cache. And you just need to get the data from the cache rest everything (polling the database, etc) is done by asp.net itself. Here is a link which describes the steps to implement SQL Caching and believe me it is not that difficult to implement.
Use AJAX to update the counts on the UI so that the User does not feel the pinch of PostBack.
What about "Improving Performance with SQL Server 2008 Indexed Views"?
"This is often particularly effective for aggregate views in decision
support or data warehouse environments"