I am using an API which is sending some data about products, every 1 second.
on the other hand I have a list of user-created conditions. And I want to check if any data that comes, matches any of the conditions. and if so, I want to notify the user.
for example , user condition maybe like this : price < 30000 and productName = 'chairNumber2'
and the data would be something like this :
{'data':[{'name':'chair1','price':'20000','color':blue},{'name':'chairNumber2','price':'45500','color':green},{'name':'chairNumber2','price':'27000','color':blue}]
I am using microservice architecture, and on validating condition I am sending a message on RabbitMQ to my notification service
I have tried the naïve solution (every 1 second, check every condition , and if any data meets the condition then pass on data my other service)
but this takes so much RAM and time(time order is in n*m,n being the count of conditions, and m is the count of data), so I am looking for a better scenario
It's an interesting problem. I have to confess I don't really know how I would do it - it depends a lot on exactly how fast the processing needs to occur, and a lot of other factors not mentioned - such as what constraints to do you have in terms of the technology stack you have, is it on-premise or in the cloud, must the solution be coded by you/your team or can you buy some $$ tool. For future reference, for architecture questions especially, any context you can provide is really helpful - e.g. constraints.
I did think of Pub-Sub, which may offer patterns you can use, but you really just need a simple implementation that will work within your code base, AND very importantly you only have one consuming client, the RabbitMQ queue - it's not like you have X number of random clients wanting the data. So an off-the-shelf Pub-Sub solution might not be a good fit.
Assuming you want a "home-grown" solution, this is what has come to mind so far:
("flow" connectors show data flow, which could be interpreted as a 'push'; where as the other lines are UML "dependency" lines; e.g. the match engine depends on data held in the batch, but it's agnostic as to how that happens).
The external data source is where the data is coming from. I had not made any assumptions about how that works or what control you have over it.
Interface, all this does is take the raw data and put it into batches that can be processed later by the Match Engine. How the interface works depends on how you want to balance (a) the data coming in, and (b) what you know the match engine expects.
Batches are thrown into a batch queue. It's job is to ensure that no data is lost before its processed, that processing can be managed (order of batch processing, resilience, etc).
Match engine, works fast on the assumption that the size of each batch is a manageable number of records/changes. It's job is to take changes and ask who's interested in them, and return the results to the RabbitMQ. So its inputs are just the batches and the user & user matching rules (more on that later). How this actually works I'm not sure, worst case it iterates through each rule seeing who has a match - what you're doing now, but...
Key point: the queue would also allow you to scale-out the number of match engine instances - but, I don't know what affect that has downstream on the RabbitMQ and it's downstream consumers (the order in which the updates would arrive, etc).
What's not shown: caching. The match engine needs to know what the matching rules are, and which users those rules relate to. The fastest way to do that look-up is probably in memory, not a database read (unless you can be smart about how that happens), which brings me to this addition:
Data Source is wherever the user data, and user matching rules, are kept. I have assumed they are external to "Your Solution" but it doesn't matter.
Cache is something that holds the user matches (rules) & user data. It's sole job is to hold these in a way that is optimized for the Match Engine to work fast. You could logically say it was part of the match engine, or separate. How you approach this might be determined by whether or not you intend to scale-out the match engine.
Data Provider is simply the component whose job it is to fetch user & rule data and make it available for caching.
So, the Rule engine, cache and data provider could all be separate components, or logically parts of the one component / microservice.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I have an application. Suppose it's an invoice service. Each time a user creates an invoice I need to assign the next sequential number (I.e: ISequentialNumberGeneratorRepository.Next(); So essentially the invoice number must be unique despite having several instances of my application running (horizontal scalability is likely in the future).
In other words, I need a global sequential number generator.
Traditionally this problem is resolved by using a relational database such as SQL server, PostgreSQL, MySQL, etc. because these systems have the capability to generate sequential unique IDs on inserting a record and returning the generated id as part of the same atomic operation, so they're a perfect fit for a centralised sequential number generator.
But I don't have a relational database and I don't need one, so it's a bit brutal having to use one just for this tiny functionality.
I have, however, an EventStore available (EventStore.org) but I couldn't find out whether it has sequential number generation capability.
So my question is: Is there any available product out there which I could use to generate unique sequential numbers so that I can implement my Next(); repository's method with, and which would work well independently of how many instances of my client invoice application I have?
Note: Alternatively, if someone can think of a way to use EventStore for this purpose or how did they achieve this in a DDD/CQRS/ES environment it'd also be great.
You have not stated the reasons(or presented any code) as to why you want this capability. I will assume the term sequential should be taken as monotonically increasing(sorting not looping).
I tend to agree with A.Chiesa, I would add timestamps to the list, although not applicable here.
Since your post does not indicate how the data is to be consumed, I purpose two solutions, the second preferred over the first, if possible; and for all later visitors, use a database solution instead.
The only way to guarantee numerical order across a horizontally scaled application without aggregation, is to utilize a central server to assign the numbers(using REST or RPCs or custom network code; not to mention an SQL server, as a side note). Due to concurrency, the application must wait it's turn for the next number and including network usage and delay, this delay limits the scalability of the application, and provides a single point of failure. These risks can be minimized by creating multiple instances of the central server and multiple application pools(You will lose the global sorting ability).
As an alternative, I would recommend the HI/LO Assigning method, combined with batch aggregation. Each instance has a four? digit identifier prefixed to an incrementing number per instance. Schedule an aggregation task on a central(or more than one, for redundancy) server(s) to pickup the data and assign a sequential unique id during aggregation. This process localizes the data(until pickup, which could be scheduled for (100, 500, 1000)? millisecond intervals if needed for coherence; minutes or more ,if not), and provides almost perfect horizontal scaling, with the drawback of increased vertical scaling requirements at the aggregation server(s).
Distributed computing is a balancing act, between processing, memory, and communication overhead. Where your computing/memory/network capacity boundaries lie cannot be determined from your post.
There is no single correct answer. I have provided you with two possibilities, but without specific requirements of the task at hand, I can go no further.
IMHO, your requirement is kinda flawed, because you have conflicting needs.
You want a unique id. The usual solutions use:
guid. Can be generated centrally or locally. Really easy to implement. Kinda hard for a human reader, but YMMV. But you want incremental keys.
centrally assigned key: you need a transactional system. But you want to do CQRS, and use Event Store. It seems to me that having a separate transactional system just to have an IDENTITY_COLUMN or a SEQUENCE largely misses the point of doing CQRS.
use an HiLo generation approach. That is: every single client gets a unique seed (like 1 billion for the first client, 2 billions for the second, etc). So each client can generate locally a sequence. This sequence is distributed and uses sequential numbers, so there is no concurrency problems, but there is no global sorting for requests and you must ensure that no two clients get the same Hi value (relatively easy task).
use the id assigned by Event Store. I don't know the product, but every event sent to the queue gets a unique id. But (as I understand it) you require the id to be available BEFORE sending the event.
You can generally mix-and-match either of this solutions (especially the Hilo algorithm) with timestamps (like seconds from Unix Epoch, or something alike), in order to produce a (weak, non guaranteed) sortability. But generally I would avoid this, because if you generate ids on multiple sites, you introduce the risk of the clocks being unsynchronized, and generally other unsolved (or unsolvable) problems.
Probably I'm missing something, but this are the ones from the top of my head.
So, as far as i can tell, you are in an empasse. I would try really hard to put myself in one of the previous situations.
It is strange opinion
so it's a bit brutal having to use one just for this tiny
functionality.
Today SQLite is used as relational database even in mobile phones. It is simple, have small memory footprint and have binding for all popular programming languages. 20 years ago databases consumed many resources - today you can find database engine for all tasks. Also, if you need tiny key-pair store you can use BerkeleyDB.
My application has different tasks each one posting an XML Document through each HTTP POST on a different endpoint. For every thread I need to keep count of the message I sent, which is identified by a unique incremental number.
I need a mechanism that, after a message has been received by the endpoint will save the last message id sent, so that if there is a problem and the application needs to restart it won't send the same message again, and will restart from where it currently was.
If I don't persist the counters, on my laptop I can manage to obtain a throughput of about 100 messages processed per second for every queue with 5 tasks running. My goal is to achieve no more than a 10/15% reduction in throughput by persisting the counters.
Using SQL Server for saving the counters, with a row for every tasks gives me a 50% decrease in throughput. Saving the counter value on a text file for every task is a bit faster but still far from my goal. I am looking for a way to persist such information so that I can be as close as possible to my goal. I thought that maybe appending the last processed Id rather than updating it could help me in avoiding possible write locks, but the bottom line is that I don't care if for the sake of performance I will have to waste disk space or have a higher startup time for reading the last counter.
In your experience what might be a fast way to avoid contentions and safely persist data from multiple tasks even at the cost of more disk space?
You can get pretty good performance with an ESENT storage, via the ManagedEsent - PersistentDictionary wrapper.
The PersistentDictionary class is concurrent and provides real concurrent access to the ESENT backend. You would represent everything in key-value pair format.
Give it a try, it is not much code to write.
ESENT is an in-process database engine, disk based + in-memory caching, used throughout several Windows components (Search, Exchange, etc). It does provide transactional support, which is what you're after.
It has been included in all versions of Windows since 2000 so you don't need to install any dependencies other than ManagedEsent.
You would probably want to define something like this:
var dictionary = new PersistentDictionary<Guid, int>("ThreadStorage");
The key, I assume, should be something unique (maybe even the service endpoint) so that you are able to re-map it after a restart. The value is the last message identifier.
I am pasting below, shamelessly, their performance benchmarks:
Sequential inserts 32,000 entries/second
Random inserts 17,000 entries/second
Random Updates 36,000 entries/second
Random lookups (database cached in memory) 137,000 entries/second
Linq queries (range of records) 14,000 queries/second
You fit in the Random Updates case, which as you can see offers a really good throughput.
I faced the same issue as OP asked.
I used SQL server Sequence Numbers (with CREATE SEQUENCE).
However, the accepted answer is a good solution to avoid using SQL server.
I would like some advice on how to best go about what I'm trying to achieve.
I'd like to provide a user with a screen that will display one or more "icon" (per say) and display a total next to it (bit like the iPhone does). Don't worry about the UI, the question is not about that, it is more about how to handle the back-end.
Let's say for argument sake, I want to provide the following:
Total number of unread records
Total number of waiting for approval
Total number of pre-approved
Total number of approved
etc...
I suppose, the easiest way to descrive the above would be "MS Outlook". Whenever emails arrive to your inbox, you can see the number of unread email being updated immediately. I know it's local, so it's a bit different, but now imagine having the same principle but for the queries above.
This could vary from user to user and while dynamic stored procedures are not ideal, I don't think I could write one sp for each scenario, but again, that's not the issue heree.
Now the recommendation part:
Should I be creating a timer that polls the database every minute (for example?) and run-all my relevant sql queries which will then provide me with the relevant information.
Is there a way to do this in real time without having a "polling" mechanism i.e. Whenever a query changes, it updates the total/count and then pushes out the count of the query to the relevant client(s)?
Should I have some sort of table storing these "totals" for each query and handle the updating of these immediately based on triggers in SQL and then when queried by a user, it would only read the "total" rather than trying to calculate them?
The problem with triggers is that these would have to be defined individually and I'm really tring to keep this as generic as possible... Again, I'm not 100% clear on how to handle this to be honest, so let me know what you think is best or how you would go about it.
Ideally when a specific query is created, I'd like to provide to choices. 1) General (where anyone can use this) and b) Specific where the "username" would be used as part of the query and the count returned would only be applied for that user but that's another issue.
The important part is really the notification part. While the polling is easy, I'm not sure I like it.
Imagine if I had 50 queries to be execute and I've got 500 users (unlikely, but still!) looking at the screen with these icons. 500 users would poll the database every minute and 50 queries would also be executed, this could potentially be 25000 queries per miuntes... Just doesn't sound right.
As mentioned, ideally, a) I'd love to have the data changes in real-time rather than having to wait a minute to be notified of a new "count" and b) I want to reduce the amount of queries to a minimum. Maybe I won't have a choice.
The idea behind this, is that they will have a small icon for each of these queries, and a little number will be displayed indicating how many records apply to the relevant query. When they click on this, it will bring them the relevant result data rather than the actual count and then can deal with it accordingly.
I don't know if I've explained this correctly, but if unclear, please ask, but hopefully I have and I'll be able to get some feedback on this.
Looking forward to your feeback.
Thanks.
I am not sure if this is the ideal solution but maybe a decent 1.
The following are the assumptions I have taken
Considering that your front end is a web application i.e. asp.net
The data which needs to be fetched on a regular basis is not hugh
The data which needs to be fetched does not change very frequently
If I were in this situation then I would have gone with the following approach
Implemented SQL Caching using SQLCacheDependency class. This class will fetch the data from the database and store in the cache of the application. The cache will get invalidated whenever the data in the table on which the dependency is created changes thus fetching the new data and again creating the cache. And you just need to get the data from the cache rest everything (polling the database, etc) is done by asp.net itself. Here is a link which describes the steps to implement SQL Caching and believe me it is not that difficult to implement.
Use AJAX to update the counts on the UI so that the User does not feel the pinch of PostBack.
What about "Improving Performance with SQL Server 2008 Indexed Views"?
"This is often particularly effective for aggregate views in decision
support or data warehouse environments"
Does anyone have any experience with receiving and updating a large volume of data, storing it, sorting it, and visualizing it very quickly?
Preferably, I'm looking for a .NET solution, but that may not be practical.
Now for the details...
I will receive roughly 1000 updates per second, some updates, some new rows of data records. But, it can also be very burst driven, with sometimes 5000 updates and new rows.
By the end of the day, I could have 4 to 5 million rows of data.
I have to both store them and also show the user updates in the UI. The UI allows the user to apply a number of filters to the data to just show what they want. I need to update all the records plus show the user these updates.
I have an visual update rate of 1 fps.
Anyone have any guidance or direction on this problem? I can't imagine I'm the first one to have to deal with something like this...
At first though, some sort of in memory database I would think, but will it be fast enough for querying for updates near the end of the day once I get a large enough data set? Or is that all dependent on smart indexing and queries?
Thanks in advance.
It's a very interesting and also challenging problem.
I would approach a pipeline design with processors implementing sorting, filtering, aggregation etc. The pipeline needs an async (threadsafe) input buffer that is processed in a timely manner (according to your 1fps req. under a second). If you can't do it, you need to queue the data somewhere, on disk or in memory depending on the nature of your problem.
Consequently, the UI needs to be implemented in a pull style rather than push, you only want to update it every second.
For datastore you have several options. Using a database is not a bad idea, since you need the data persisted (and I guess also queryable) anyway. If you are using an ORM, you may find NHibernate in combination with its superior second level cache a decent choice.
Many of the considerations might also be similar to those Ayende made when designing NHProf, a realtime profiler for NHibernate. He has written a series of posts about them on his blog.
May be Oracle is more appropriate RDBMS solution fo you. The problem with your question is that at this "critical" levels there are too much variables and condition you need to deal with. Not only software, but hardware that you can have (It costs :)), connection speed, your expected common user system setup and more and more and more...
Good Luck.
I am creating a mass mailer application, where a web application sets up a email template and then queues a bunch of email address for sending. The other side will be a Windows service (or exe) that will poll this queue, picking up the messages for sending.
My question is, what would the advantage be of using SQL Service Broker (or MSMQ) over just creating my own custom queue table?
Everything I'm reading is suggesting I use Service Broker, but I really don't see what the huge advantage over a flat table (that would be a lot simpler to work with for me). For reference the application will be used to send 50,000-100,000 emails almost daily.
Do you know how to implement a queue over a flat table? This is not a silly question, implementing a queue over a table correctly is much harder than it sounds. Queue-like-tables are notoriously deadlock prone and you need to carefully consider the table design and the enqueue and dequeue operations. Also, do you know how to scale your pooling of the table? And how are you goind to handle retries and timeouts (ie. what timers are used for)?
I'm not saying you should use SSB. The lerning curve is very steep and is primarily a distributed applicaiton platform, not a local queueing product so some features, like dialogs, will actually be obstacles for you rather than advantages. I'm just saying that you must consider also the difficulties of flat-table-queues. If you never implemented a flat-table-queue then be warned, there are many dragons under that bridge.
50k-100k messages per day is nothing, is only one message per second. If you want 100k per minute, then we have something to talk about.
If you every need to port to another vendor's database, you will have less problem if you used normal tables.
As you seem to only have one reader and one write from your queue, I would tend to use a standard table until you hit problem. However if you start to feel the need to use “locking hints” etc, that the time to switch to the Service Broker Queues.
I would not use MSMQ, if both the sender and the reader need a database connection to work. MSMQ would be good if the sender did not talk to the database at all, as it lets the sender keep working when the database is down. However having to setup and maintain both the MSMQ and the database is likely to be more work then it is worth for most systems.
For advantages of Service Broker see this link:
http://msdn.microsoft.com/en-us/library/ms166063.aspx
In general we try to use a tool or standard functionality rather than building things ourselves. This lowers the cost and can make upgrading easier.
I know this is old question, but is sufficiently abstract to be relevant for long enough time.
After using both paradigms I would suggest flat table. It is surprisingly scalable and nifty. Correct hints need to be used.
Once the application goes distributed, or starts using mutiple allways on groups with different RW and RO servers, the Service Broker (or any other method of distributed communication) becomes a neccessity.
Flat table
needs only few hints (higly dependent on isolation level) to work scalably and reliably in the consumer (READPAST, UPDLOCK, ROWLOCK)
the order of message processing is not set in stone
the consumer must make sure that the message stays in the queue if the processing fails
needs some polling mechanism (job, CDC (here lies madness :)), external application...)
turn of maintenance jobs and automatic statistics for the table
Service broker
needs extremely overblown "infrastructure" (message types, contracts, services, queues, activation procedures, must be enabled after each server restart, conversations need to be correctly created and dropped...)
is extremely opaque - we have spent ages trying to make it run after it mysteriously stopped working
there is a predefined order of message processing
the tables it uses can cause deadlocks themselfs if SB is overused
is the only way (except for linked servers...) to send messages directly from database on RW server of one HA group to a database that is RO in this HA group (without any external app)
is the only way to send messages between different servers (linked servers are a big NONO (unless they become an YESYES - you know the drill - it depends)) (without any external app)