Global sequential number generator without using a relational database [closed]

Global sequential number generator without using a relational database [closed] - c#

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I have an application. Suppose it's an invoice service. Each time a user creates an invoice I need to assign the next sequential number (I.e: ISequentialNumberGeneratorRepository.Next(); So essentially the invoice number must be unique despite having several instances of my application running (horizontal scalability is likely in the future).
In other words, I need a global sequential number generator.
Traditionally this problem is resolved by using a relational database such as SQL server, PostgreSQL, MySQL, etc. because these systems have the capability to generate sequential unique IDs on inserting a record and returning the generated id as part of the same atomic operation, so they're a perfect fit for a centralised sequential number generator.
But I don't have a relational database and I don't need one, so it's a bit brutal having to use one just for this tiny functionality.
I have, however, an EventStore available (EventStore.org) but I couldn't find out whether it has sequential number generation capability.
So my question is: Is there any available product out there which I could use to generate unique sequential numbers so that I can implement my Next(); repository's method with, and which would work well independently of how many instances of my client invoice application I have?
Note: Alternatively, if someone can think of a way to use EventStore for this purpose or how did they achieve this in a DDD/CQRS/ES environment it'd also be great.

You have not stated the reasons(or presented any code) as to why you want this capability. I will assume the term sequential should be taken as monotonically increasing(sorting not looping).
I tend to agree with A.Chiesa, I would add timestamps to the list, although not applicable here.
Since your post does not indicate how the data is to be consumed, I purpose two solutions, the second preferred over the first, if possible; and for all later visitors, use a database solution instead.
The only way to guarantee numerical order across a horizontally scaled application without aggregation, is to utilize a central server to assign the numbers(using REST or RPCs or custom network code; not to mention an SQL server, as a side note). Due to concurrency, the application must wait it's turn for the next number and including network usage and delay, this delay limits the scalability of the application, and provides a single point of failure. These risks can be minimized by creating multiple instances of the central server and multiple application pools(You will lose the global sorting ability).
As an alternative, I would recommend the HI/LO Assigning method, combined with batch aggregation. Each instance has a four? digit identifier prefixed to an incrementing number per instance. Schedule an aggregation task on a central(or more than one, for redundancy) server(s) to pickup the data and assign a sequential unique id during aggregation. This process localizes the data(until pickup, which could be scheduled for (100, 500, 1000)? millisecond intervals if needed for coherence; minutes or more ,if not), and provides almost perfect horizontal scaling, with the drawback of increased vertical scaling requirements at the aggregation server(s).
Distributed computing is a balancing act, between processing, memory, and communication overhead. Where your computing/memory/network capacity boundaries lie cannot be determined from your post.
There is no single correct answer. I have provided you with two possibilities, but without specific requirements of the task at hand, I can go no further.

IMHO, your requirement is kinda flawed, because you have conflicting needs.
You want a unique id. The usual solutions use:
guid. Can be generated centrally or locally. Really easy to implement. Kinda hard for a human reader, but YMMV. But you want incremental keys.
centrally assigned key: you need a transactional system. But you want to do CQRS, and use Event Store. It seems to me that having a separate transactional system just to have an IDENTITY_COLUMN or a SEQUENCE largely misses the point of doing CQRS.
use an HiLo generation approach. That is: every single client gets a unique seed (like 1 billion for the first client, 2 billions for the second, etc). So each client can generate locally a sequence. This sequence is distributed and uses sequential numbers, so there is no concurrency problems, but there is no global sorting for requests and you must ensure that no two clients get the same Hi value (relatively easy task).
use the id assigned by Event Store. I don't know the product, but every event sent to the queue gets a unique id. But (as I understand it) you require the id to be available BEFORE sending the event.
You can generally mix-and-match either of this solutions (especially the Hilo algorithm) with timestamps (like seconds from Unix Epoch, or something alike), in order to produce a (weak, non guaranteed) sortability. But generally I would avoid this, because if you generate ids on multiple sites, you introduce the risk of the clocks being unsynchronized, and generally other unsolved (or unsolvable) problems.
Probably I'm missing something, but this are the ones from the top of my head.
So, as far as i can tell, you are in an empasse. I would try really hard to put myself in one of the previous situations.

It is strange opinion
so it's a bit brutal having to use one just for this tiny
functionality.
Today SQLite is used as relational database even in mobile phones. It is simple, have small memory footprint and have binding for all popular programming languages. 20 years ago databases consumed many resources - today you can find database engine for all tasks. Also, if you need tiny key-pair store you can use BerkeleyDB.

Related

Checking list of conditions on API data

I am using an API which is sending some data about products, every 1 second.
on the other hand I have a list of user-created conditions. And I want to check if any data that comes, matches any of the conditions. and if so, I want to notify the user.
for example , user condition maybe like this : price < 30000 and productName = 'chairNumber2'
and the data would be something like this :
{'data':[{'name':'chair1','price':'20000','color':blue},{'name':'chairNumber2','price':'45500','color':green},{'name':'chairNumber2','price':'27000','color':blue}]
I am using microservice architecture, and on validating condition I am sending a message on RabbitMQ to my notification service
I have tried the naïve solution (every 1 second, check every condition , and if any data meets the condition then pass on data my other service)
but this takes so much RAM and time(time order is in n*m,n being the count of conditions, and m is the count of data), so I am looking for a better scenario

It's an interesting problem. I have to confess I don't really know how I would do it - it depends a lot on exactly how fast the processing needs to occur, and a lot of other factors not mentioned - such as what constraints to do you have in terms of the technology stack you have, is it on-premise or in the cloud, must the solution be coded by you/your team or can you buy some $$ tool. For future reference, for architecture questions especially, any context you can provide is really helpful - e.g. constraints.
I did think of Pub-Sub, which may offer patterns you can use, but you really just need a simple implementation that will work within your code base, AND very importantly you only have one consuming client, the RabbitMQ queue - it's not like you have X number of random clients wanting the data. So an off-the-shelf Pub-Sub solution might not be a good fit.
Assuming you want a "home-grown" solution, this is what has come to mind so far:
("flow" connectors show data flow, which could be interpreted as a 'push'; where as the other lines are UML "dependency" lines; e.g. the match engine depends on data held in the batch, but it's agnostic as to how that happens).
The external data source is where the data is coming from. I had not made any assumptions about how that works or what control you have over it.
Interface, all this does is take the raw data and put it into batches that can be processed later by the Match Engine. How the interface works depends on how you want to balance (a) the data coming in, and (b) what you know the match engine expects.
Batches are thrown into a batch queue. It's job is to ensure that no data is lost before its processed, that processing can be managed (order of batch processing, resilience, etc).
Match engine, works fast on the assumption that the size of each batch is a manageable number of records/changes. It's job is to take changes and ask who's interested in them, and return the results to the RabbitMQ. So its inputs are just the batches and the user & user matching rules (more on that later). How this actually works I'm not sure, worst case it iterates through each rule seeing who has a match - what you're doing now, but...
Key point: the queue would also allow you to scale-out the number of match engine instances - but, I don't know what affect that has downstream on the RabbitMQ and it's downstream consumers (the order in which the updates would arrive, etc).
What's not shown: caching. The match engine needs to know what the matching rules are, and which users those rules relate to. The fastest way to do that look-up is probably in memory, not a database read (unless you can be smart about how that happens), which brings me to this addition:
Data Source is wherever the user data, and user matching rules, are kept. I have assumed they are external to "Your Solution" but it doesn't matter.
Cache is something that holds the user matches (rules) & user data. It's sole job is to hold these in a way that is optimized for the Match Engine to work fast. You could logically say it was part of the match engine, or separate. How you approach this might be determined by whether or not you intend to scale-out the match engine.
Data Provider is simply the component whose job it is to fetch user & rule data and make it available for caching.
So, the Rule engine, cache and data provider could all be separate components, or logically parts of the one component / microservice.

Database benchmarking: should the data transfer/protocol latency be included?

I am currently benchmarking two databases, Postgres and MongoDB, on a relatively large data set with equivalent queries. Of course, I am doing my best to put them on equal grounds, but I have one dilemma. For Postgres I take the execution time reported by EXPLAIN ANALYZE, and there is a similar concept with MongoDB, using profiling (although not equivalent, millis).
However, different times are observed if executed from, lets say, PgAdmin or the mongo CLI client or in my watched C# app. That time also includes the transfer latency, and probably protocol differences. PgAdmin, for example, actually seems to completely deform the execution time (it obviously includes the result rendering time).
The question is: is there any sense in actually measuring the time on the "receiving end", since an application actually does consume that data? Or does it just include too many variables and does not contribute anything to the actual database performance, and I should stick to the reported DBMS execution times?

The question you'd have to answer is why are you benchmarking the databases? If you are benchmarking so you can select one over the other, for use in a C# application, then you need to measure the time "on the 'receiving end'". Whatever variables that may contain, that is what you need to compare.

What is the fastest way to persistently increment a list of numbers from multiple threads?

My application has different tasks each one posting an XML Document through each HTTP POST on a different endpoint. For every thread I need to keep count of the message I sent, which is identified by a unique incremental number.
I need a mechanism that, after a message has been received by the endpoint will save the last message id sent, so that if there is a problem and the application needs to restart it won't send the same message again, and will restart from where it currently was.
If I don't persist the counters, on my laptop I can manage to obtain a throughput of about 100 messages processed per second for every queue with 5 tasks running. My goal is to achieve no more than a 10/15% reduction in throughput by persisting the counters.
Using SQL Server for saving the counters, with a row for every tasks gives me a 50% decrease in throughput. Saving the counter value on a text file for every task is a bit faster but still far from my goal. I am looking for a way to persist such information so that I can be as close as possible to my goal. I thought that maybe appending the last processed Id rather than updating it could help me in avoiding possible write locks, but the bottom line is that I don't care if for the sake of performance I will have to waste disk space or have a higher startup time for reading the last counter.
In your experience what might be a fast way to avoid contentions and safely persist data from multiple tasks even at the cost of more disk space?

You can get pretty good performance with an ESENT storage, via the ManagedEsent - PersistentDictionary wrapper.
The PersistentDictionary class is concurrent and provides real concurrent access to the ESENT backend. You would represent everything in key-value pair format.
Give it a try, it is not much code to write.
ESENT is an in-process database engine, disk based + in-memory caching, used throughout several Windows components (Search, Exchange, etc). It does provide transactional support, which is what you're after.
It has been included in all versions of Windows since 2000 so you don't need to install any dependencies other than ManagedEsent.
You would probably want to define something like this:
var dictionary = new PersistentDictionary<Guid, int>("ThreadStorage");
The key, I assume, should be something unique (maybe even the service endpoint) so that you are able to re-map it after a restart. The value is the last message identifier.
I am pasting below, shamelessly, their performance benchmarks:
Sequential inserts 32,000 entries/second
Random inserts 17,000 entries/second
Random Updates 36,000 entries/second
Random lookups (database cached in memory) 137,000 entries/second
Linq queries (range of records) 14,000 queries/second
You fit in the Random Updates case, which as you can see offers a really good throughput.

I faced the same issue as OP asked.
I used SQL server Sequence Numbers (with CREATE SEQUENCE).
However, the accepted answer is a good solution to avoid using SQL server.

Best practice - load a lot of stuff in the application_start? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I have a webshop with a lot of products and other content. Currently I load all content in to a global list at Application_Start, which takes aprox 15-25 seconds.
This makes the the site really fast, as I can get any product/content in O(1) time.
However, is this best practice?
Currently I got a webhotel which is not a VPS / Dedicated server, so it recycles the application from time to time, which gives random visitors load times up to 15-25 seconds (only to become a bigger number with more content). This is of course totally unacceptable, but I guess it would be solved with a VPS.
What is the normal way of doing this? I guess a webshop like Amazon probably don't load all their products into a huge list :-D
Any thoughts and ideas would be highly appreciated.

It looks like you've answered your question for your case "This is of course totally unacceptable".
If your goal O(1) normal request to database for single product is likely O(1) unless you need to have complicated joins between products. Consider trying to drop all your pre-caching logic and see if you have problem with performance. You can limit startup impact by lazy caching instead.
Large sites often use distributed caching like MemcaheD.

A more scalable setup is to set up a web service to provide the content, which the website calls when it needs it. The web service will need to cache frequently needed content to achieve fast response times.

First of all, 15-20 seconds to load data is too much time, so I suspect this cases
This time is for compile and not the data load
The data is too much and you full the memory
The method that you use to read data is very slow
The data storage is too slow, or the struct is on text file and the read of it is slow
My opinion is that you cache only small amount of data that you need to use it too many times in short time. The way you describe it is not good practice for some reasons.
If you have many pools you read the same data on all pools and you spend memory for no reason.
The data that you cache you can not change them - is for read only
Even if you cache some data then you need to render the page, and there is where you actually need to make the cache, on the final render, not on data.
What and how to cache.
We cache the final render page.
We also set cache for the page and other elements to the client.
We read and write the data from database as they come and we left the database do the cache, he knows better.
If we cache data then they must be small amount that needed to be used for long loop and we avoid the database call many times.
Also we cache as they ask for it, and if not used for long time, or the memory need space this part of cache gone away. If some part of the data come from complex combinations of many tables then we make a temporary flat big table that keep all the data together, every one in a row. This table are temporary and if we needed too much we make a second temporary database file that we keep this part of the data.
How fast is the database read ? Well is so fast that you not need to worry about that, you need to check other point of delays, like as I say the full render of a page, or some parts of the page.
What you need to worry about is a good database design, a good and fast way to retrieve your data, and a good optimize code to show them.

Separation of responsibilities will help you scale for the future.
With your current setup, you are limited to the resources of your web server, and, like you said, your start up times will grow out of control as you continue adding more products.
If you share the burden of each page request with SQL Server, you open up your application to allow it to scale as needed. Over time, you may decide to add more web servers, cluster SQL Server, or switch to a new database back-end altogether. However, if all the burden is on the application pool, then you are drastically limiting yourself.

Real time data storage and access with .net

Does anyone have any experience with receiving and updating a large volume of data, storing it, sorting it, and visualizing it very quickly?
Preferably, I'm looking for a .NET solution, but that may not be practical.
Now for the details...
I will receive roughly 1000 updates per second, some updates, some new rows of data records. But, it can also be very burst driven, with sometimes 5000 updates and new rows.
By the end of the day, I could have 4 to 5 million rows of data.
I have to both store them and also show the user updates in the UI. The UI allows the user to apply a number of filters to the data to just show what they want. I need to update all the records plus show the user these updates.
I have an visual update rate of 1 fps.
Anyone have any guidance or direction on this problem? I can't imagine I'm the first one to have to deal with something like this...
At first though, some sort of in memory database I would think, but will it be fast enough for querying for updates near the end of the day once I get a large enough data set? Or is that all dependent on smart indexing and queries?
Thanks in advance.

It's a very interesting and also challenging problem.
I would approach a pipeline design with processors implementing sorting, filtering, aggregation etc. The pipeline needs an async (threadsafe) input buffer that is processed in a timely manner (according to your 1fps req. under a second). If you can't do it, you need to queue the data somewhere, on disk or in memory depending on the nature of your problem.
Consequently, the UI needs to be implemented in a pull style rather than push, you only want to update it every second.
For datastore you have several options. Using a database is not a bad idea, since you need the data persisted (and I guess also queryable) anyway. If you are using an ORM, you may find NHibernate in combination with its superior second level cache a decent choice.
Many of the considerations might also be similar to those Ayende made when designing NHProf, a realtime profiler for NHibernate. He has written a series of posts about them on his blog.

May be Oracle is more appropriate RDBMS solution fo you. The problem with your question is that at this "critical" levels there are too much variables and condition you need to deal with. Not only software, but hardware that you can have (It costs :)), connection speed, your expected common user system setup and more and more and more...
Good Luck.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.