Buffering of observable to stabilize variable delays for slower observer - c#

I have one observable that produces a sequence of numbers with delays in between the numbers that range from 0 to 1 second (randomly):
var random = new Random();
var randomDelaysObservable = Observable.Create<int>(async observer =>
{
var value = 0;
while (true)
{
// delay from 0 to 1 second
var randomDelay = TimeSpan.FromSeconds(random.NextDouble());
await Task.Delay(randomDelay);
observer.OnNext(value++);
}
return Disposable.Empty;
// ReSharper disable once FunctionNeverReturns
});
I would like to have a consumer that consumes those numbers and writes them out to the console, but only to take a single number every 2 seconds (exactly every two seconds).
Right now, I have this code for the observer (although I know it isn't correct with the use of await):
var delayedConsoleWritingObserver = Observer.Create<int>(async value =>
{
// fixed delay of 2 seconds
var fixedDelay = TimeSpan.FromSeconds(2);
await Task.Delay(fixedDelay);
Console.WriteLine($"[{DateTime.Now:O}] Received value: {value}.");
});
randomDelaysObservable.Subscribe(delayedConsoleWritingObserver);
If the producer produces numbers every 0 to 1 second, and the consumer is only able to consume single number every 2 seconds, it's clear that the producer produces the numbers faster than the consumer can consume them (backpressure). What I would like to do is to be able to "preload" e.g. 10 or 20 of the numbers from the producer in advance (if the consumer cannot process them fast enough) so that the consumer could consume them without the random delays (but not all of them as the observable sequence is infinite, and we'd run out of memory if it was running for some time).
This would sort of stabilize the variable delays from the producer if I have a slower consumer. However, I cannot think of a possible solution how to do this with the operators in ReactiveX, I've looked at the documentation of Buffer, Sample, Debounce and Window, and none of them look like the thing I'm looking for.
Any ideas on how this would be possible? Please note that even my observer code isn't really correct with the async/await, but I wasn't able to think of a better way to illustrate what I'm trying to achieve.
EDIT:
As pointed out by Schlomo, I might not have formulated the question well, as it looks a bit more like an XY-problem. I'm sorry about that. I'll try to illustrate the process I'm trying to model on another example. I don't really care that much about exact time delays on the producer or consumer side. The time delays were really just a placeholder for some asynchronous work that takes some time.
I'm more thinking about a general pattern, where producer produces items at some variable rate, I want to process all the items, and consumer also can only consume the items at some variable rate. And I'm trying to do this more effectively.
I'll try to illustrate this on a more real-world example of a pizza place 🍕.
Let's say that I'm the owner of a pizza place, and we serve just one kind of pizza, e.g. pizza Margherita.
I have one cook employed in the kitchen who makes the pizzas.
Whenever an order comes in for a pizza, he takes the order and prepares the pizza.
Now when I look at this as the owner, I see that it's not efficient. Every time a new order comes in, he has to start preparing the pizza. I think we can increase the throughput and serve the orders faster.
We only make one kind of pizza. I'm thinking that maybe if the cook has free time on his hands and there are no currently pending orders, he could prepare a couple of pizzas in advance. Let's say I'd let him prepare up to 10 pizzas in advance -- again, only when he has free time and is not busy fulfilling orders.
When we open the place in the morning, we've got no pizzas prepared in advance, and we just serve the orders as they come in. As soon as there's just a little bit of time and no orders are pending, the cook starts putting the pizzas aside in a queue. And he only stops once there are 10 pizzas in the queue. If there is an incoming order, we just fulfill it from the queue, and the cook needs to fill in the queue from the other end. For example, if we've got the queue completely filled with all 10 pizzas, and we take 1 pizza out, leaving 9 pizzas in the queue, the cook should immediately start preparing the 1 pizza to fill the queue again to 10 pizzas.
I see the generalized problem as a producer-consumer where the producer produces each item in some time, and consumer consumes each item in some time. And by adding this "buffer queue" between them, we can improve the throughput, so they wouldn't have to wait for each other that much. But I want to limit the size of the queue to 10 to avoid making too many pizzas in advance.
Now to the possible operators from Rx:
Throttle and Sample won't work because they are discarding items produced by the producer. Throughout the day, I don't want to throw away any pizzas that the cook makes. Maybe at the end of the day if few uneaten pizzas are left, it's ok, but I don't want to discard anything during the day.
Buffer won't work because that would just basically mean to prepare the pizzas in batches of 10. That's not what I want to do because I would still need to wait for every batch of 10 pizzas whenever the previous batch is gone. Also, I would still need to prepare the first batch of 10 pizzas first thing in the morning, and I couldn't just start fulfilling orders. So if there would be 10 people waiting in line before the place opens, I would serve all those 10 people at once. That's not how I want it to work, I want "first come first served" as soon as possible.
Window is a little bit better than Buffer in this sense, but I still don't think it works completely like the queue that I described above. Again, when the queue is filled with 10 pizzas, and one pizza gets out, I immediately want to start producing new pizza to fill the queue again, not wait until all 10 pizzas are out.
Hope this helps in illustrating my idea a little bit better. If it's still not clear, maybe I can come up with some better code samples and start a new question later.

Here's what your observables could look like using pure Rx:
var producer = Observable.Generate(
(r: new Random(), i: 0), // initial state
_ => true, // condition
t => (t.r, t.i + 1), // iterator
t => t.i, // result selector
t => TimeSpan.FromSeconds(t.r.NextDouble()) // timespan generator
);
var consumer = producer.Zip(
Observable.Interval(TimeSpan.FromSeconds(2)),
(_, i) => i
);
However, that isn't an easy thing to 'grab the first n without delay'. So we can instead create a non-time-gapped producer:
var rawProducer = Observable.Range(0, int.MaxValue);
then create the time gaps separately:
var timeGaps = Observable.Repeat(TimeSpan.Zero).Take(10) //or 20
.Concat(Observable.Generate(new Random(), r => true, r => r, r => TimeSpan.FromSeconds(r.NextDouble())));
then combine those two:
var timeGappedProducer = rawProducer.Zip(timeGaps, (i, ts) => Observable.Return(i).Delay(ts))
.Concat();
the consumer looks basically the same:
var lessPressureConsumer = timeGappedProducer .Zip(
Observable.Interval(TimeSpan.FromSeconds(2)),
(_, i) => i
);
Given all of that, I don't really understand why you want to do this. It's not a good way to handle back-pressure, and the question sounds like a bit of an XY-problem. The operators you mention (Sample, Throttle, etc.) are better ways of handling back-pressure.

Your problem as described is well suited to a simple bounded buffer shared between the producer and the consumer. The producer must have a condition associate with writing to the buffer stating that the buffer must not be full. The consumer must have a condition stating that the buffer cannot be empty.
See the following example using the Ada language.
with Ada.Text_IO; use Ada.Text_IO;
procedure Main is
type Order_Nums is range 1..10_000;
Type Index is mod 10;
type Buf_T is array(Index) of Order_Nums;
protected Orders is
entry Prepare(Order : in Order_Nums);
entry Sell(Order : out Order_Nums);
private
Buffer : Buf_T;
P_Index : Index := Index'First;
S_Index : Index := Index'First;
Count : Natural := 0;
end Orders;
protected body Orders is
entry Prepare(Order : in Order_Nums) when Count < Index'Modulus is
begin
Buffer(P_Index) := Order;
P_Index := P_Index + 1;
Count := Count + 1;
end Prepare;
entry Sell(Order : out Order_Nums) when Count > 0 is
begin
Order := Buffer(S_Index);
S_Index := S_Index + 1;
Count := Count - 1;
end Sell;
end Orders;
task Chef is
Entry Stop;
end Chef;
task Seller is
Entry Stop;
end Seller;
task body Chef is
The_Order : Order_Nums := Order_Nums'First;
begin
loop
select
accept Stop;
exit;
else
delay 1.0; -- one second
Orders.Prepare(The_Order);
Put_Line("Chef made order number " & The_Order'Image);
The_Order := The_Order + 1;
exit when The_Order = Order_Nums'Last;
end select;
end loop;
end Chef;
task body Seller is
The_Order : Order_Nums;
begin
loop
select
accept Stop;
exit;
else
delay 2.0; -- two seconds
Orders.Sell(The_Order);
Put_Line("Sold order number " & The_Order'Image);
end select;
end loop;
end Seller;
begin
delay 60.0; -- 60 seconds
Chef.Stop;
Seller.Stop;
end Main;
The shared buffer is named Orders. Orders contains a circular buffer of 10 Order_Nums. The index for the array containing the orders is declared as mod 10 which contains the values 0 through 9. Ada modular types exhibit wrap-around arithmetic, so incrementing past 9 wraps to 0.
The Prepare entry has a boundary condition requiring Count < Index'Moduluswhich evaluates to Count < 10 in this instance. The Sell entry has a boundary condition Count < 0.
The Chef task waits 1 second to produce a pizza, but waits until there is room in the buffer. As soon as there is room in the buffer Chef produces an order. Seller waits 2 seconds to consume an order.
Each task terminates when its Stop entry is called. Main waits 60 seconds and then calls the Stop entries for each task.
The output of the program is:
Chef made order number 1
Sold order number 1
Chef made order number 2
Chef made order number 3
Sold order number 2
Chef made order number 4
Chef made order number 5
Sold order number 3
Chef made order number 6
Chef made order number 7
Sold order number 4
Chef made order number 8
Chef made order number 9
Sold order number 5
Chef made order number 10
Chef made order number 11
Sold order number 6
Chef made order number 12
Chef made order number 13
Sold order number 7
Chef made order number 14
Chef made order number 15
Sold order number 8
Chef made order number 16
Chef made order number 17
Sold order number 9
Chef made order number 18
Chef made order number 19
Sold order number 10
Chef made order number 20
Sold order number 11
Chef made order number 21
Sold order number 12
Chef made order number 22
Chef made order number 23
Sold order number 13
Sold order number 14
Chef made order number 24
Sold order number 15
Chef made order number 25
Sold order number 16
Chef made order number 26
Chef made order number 27
Sold order number 17
Chef made order number 28
Sold order number 18
Chef made order number 29
Sold order number 19
Sold order number 20
Chef made order number 30
Sold order number 21
Chef made order number 31
Chef made order number 32
Sold order number 22
Sold order number 23
Chef made order number 33
Sold order number 24
Chef made order number 34
Sold order number 25
Chef made order number 35
Sold order number 26
Chef made order number 36
Chef made order number 37
Sold order number 27
Sold order number 28
Chef made order number 38
Sold order number 29
Chef made order number 39
Sold order number 30
Chef made order number 40
Sold order number 31

Related

Efficient and deterministic ranking items in collection?

I have list of billions of items in SQL which can be shuffled by user at random, by moving them inside list to another position, I consider using simple double divide solution:
Id, Rank
1 10
2 20
3 30
4 40
5 50
Now user moves item id=3 to first position and I perform item rank recalculation based on their adjasent items (0 - means no relative from left, max - no relative from right):
Id, Rank
3 (0+10)/2 = 5
1 10
2 20
4 40
5 50
Now there is a bug - until it reach epsilon for double, it will work, after that you will get a couple of elements with epsilon and they are not possible to move.
This can be avoided by infrequent recalculation of stack rank for entire collection, but I hesitate at the moment to implement this, because this looks too much.
I wanted to know is there some other algorithmic solution other than changing billions of items or is there a well-known name to this problem to find appropriate solution myself.

Consistent number generator from multiple input variables

I wan't to generate a fictional job title from some information I have about the visitor.
For this, I have a table of about 30 different job titles:
01 CEO
02 CFO
03 Key Account Manager
...
29 Window Cleaner
30 Dishwasher
I'm trying to find a way to generate one of these titles from a few different variables like name, age, education history, work history and so on. I wan't it to be somewhat random but still consistent so that the same variables always result in the same title.
I also wan't the different variables to have some impact on the result. Lower numbers are "better" jobs and higher numbers are "worse" jobs, but it doesn't have to be very accurate, just not completely random.
So take these two people as an example.
Name: Joe Smith
Number of previous employers: 10
Number of years education: 8
Age: 56
Name: Samantha Smith
Number of previous employers: 1
Number of years education: 0
Age: 19
Now the reason I wan't the name in there is to have a bit of randomness, so that two co-workers of the same age with the same background doesn't get exactly the same title. So I was thinking of using the number of letters in the name to mix it up a bit.
Now I can generate consistent numbers in an infinite number of ways, like the number of letters in the name * age * years of education * number of employers. This would come out as 35 840 for Joe Smith and 247 for Samantha Smith. But I wan't it to be a number between 1-30 where Samantha is closer to 25-30 and Joe is closer to 1-5.
Maybe this is more of a math problem than a programming problem, but I have seen a lot of "What's your pirate name?" and similar apps out there and I can't figure out how they work. "What's your pirate name?" might be a bad example, since it's probably completely random and I wan't my variables to matter some, but the idea is the same.
What I have tried
I tried adding weights to variable groups so I would get an easier number to use in my calculations.
Age
01-20 5
20-30 4
30-40 3
40-50 2
...
Years of education
00-01 0
01-02 1
02-03 2
04-05 3
...
Add them together and play around with those numbers, but there was a lot of problems like everyone ending up in pretty much the same mid-range (no one got to be CEO or dishwasher, everyone was somewhere in the middle), not to mention how messy the code was.
Is there a good way to accomplish what I want to do without having to build a massive math engine?
int numberOfTitles = 30;
var semiRandomID = person.Name.GetHashCode()
^ person.NumberOfPreviousEmployers.GetHashCode()
^ person.NumberOfYearsEducation.GetHashCode()
^ person.Age.GetHashCode();
var semiRandomTitle = Math.Abs(semiRandomID) % numberOfTitles;
// adjust semiRandomTitle as you see fit
semiRandomTitle += ((person.Age / 10) - 2);
semiRandomTitle += (person.NumberOfYearsEducation / 2);
The semiRandomID is a number that is generated from unique hashes of each component. The numbers are unique so that you will always generate the same number for "Joe" for example, but they don't mean anything. It's just a number. So we take all those unique numbers and generate one job title out of the 30 available. Every person has the same chance to get each job title (probably some math freak will proof that there's egde cases to the contrary, but for all practical, non-cryptographic means, it's sufficient).
Now each person has one job title assigned that looks random. However, as it's math and not randomness, they will get the same every time.
Now lets assume Joe got Taxi-Driver, the number 20. However, he has 10 years of formal education, so you decide you want to have that aspect have some weight. You could just add the years onto the job title number, but that would make anyone with 30 years of college parties CEO, so you decide (arbitrarily) that each year of education counts for half a job title. You add (NumberOfYearsEducation / 2) to the job title.
Lets assume Jane got CIO, the number 5. However, she is only 22 years old, a little young to be that high on the list. Again, you could just add the years onto the job title number, but that would make anyone with 30 years of age a CEO, so you decide (arbitrarily) that each year counts as 1/10 of a job title. In addition, you think that being very young should instead subtract from the job title. All years below the first 20 should indeed be a negative weight. So the formula would be ((Age / 10) - 2). One point for each 10 years of age, with the first 2 counting as negative.

Ideas about Generating Untraceable Invoice IDs

I want to print invoices for customers in my app. Each invoice has an Invoice ID. I want IDs to be:
Sequential (ids entered lately come late)
32 bit integers
Not easily traceable like 1 2 3 so that people can't tell how many items we sell.
An idea of my own:
Number of seconds since a specific date & time (e.g. 1/1/2010 00 AM).
Any other ideas how to generate these numbers ?
I don't like the idea of using time. You can run into all sorts of issues - time differences, several events happening in a single second and so on.
If you want something sequential and not easily traceable, how about generating a random number between 1 and whatever you wish (for example 100) for each new Id. Each new Id will be the previous Id + the random number.
You can also add a constant to your IDs to make them look more impressive. For example you can add 44323 to all your IDs and turn IDs 15, 23 and 27 into 44338, 44346 and 44350.
There are two problems in your question. One is solvable, one isn't (with the constraints you give).
Solvable: Unguessable numbers
The first one is quite simple: It should be hard for a customer to guess a valid invoice number (or the next valid invoice number), when the customer has access to a set of valid invoice numbers.
You can solve this with your constraint:
Split your invoice number in two parts:
A 20 bit prefix, taken from a sequence of increasing numbers (e.g. the natural numbers 0,1,2,...)
A 10 bit suffix that is randomly generated
With these scheme, there are a bout 1 million valid invoice numbers. You can precalculate them and store them in the database. When presented with a invoice number, check if it is in your database. When it isn't, it's not valid.
Use a SQL sequence for handing out numbers. When issuing a new (i.e. unused) invoice number, increment the seuqnce and issue the n-th number from the precalculated list (order by value).
Not solvable: Guessing the number of customers
When you want to prevent a customer having a number of valid invoice numbers from guessing how much invoice numbers you have issued yet (and there for how much customers you have): This is not possible.
You have hare a variant form the so called "German tank problem". I nthe second world war, the allies used serial numbers printed on the gear box of german tanks to guestimate, how much tanks Germany had produced. This worked, because the serial number was increasing without gaps.
But even when you increase the numbers with gaps, the solution for the German tank problem still works. It is quite easy:
You use the method described here to guess the highest issued invoice number
You guess the mean difference between two successive invoice numbers and divide the number through this value
You can use linear regression to get a stable delta value (if it exists).
Now you have a good guess about the order of magnitude of the number of invoices (200, 15000, half an million, etc.).
This works as long there (theoretically) exists a mean value for two successive invoice numbers. This is usually the case, even when using a random number generator, because most random number generators are designed to have such a mean value.
There is a counter measure: You have to make sure that there exists no mean value for the gap of two successive numbers. A random number generator with this property can be constructed very easy.
Example:
Start with the last invoice number plus one as current number
Multiply the current number with a random number >=2. This is your new current number.
Get a random bit: If the bit is 0, the result is your current number. Otherwise go back to step 2.
While this will work in theory, you will very soon run out of 32 bit integer numbers.
I don't think there is a practical solution for this problem. Either the gap between two successive number has a mean value (with little variance) and you can guess the amount of issued numbers easily. Or you will run out of 32 bit numbers very quickly.
Snakeoil (non working solutions)
Don't use any time based solution. The timestamp is usually easy guessable (probably an approximately correct timestamp will be printed somewhere on invoice). Using timestamps usually makes it easier for the attacker, not harder.
Don't use insecure random numbers. Most random number generators are not cryptographically safe. They usually have mathematical properties that are good for statistics but bad for your security (e.g. a predicable distribution, a stable mean value, etc.)
One solution may involve Exclusive OR (XOR) binary bitmaps. The result function is reversible, may generate non-sequential numbers (if the first bit of the least significant byte is set to 1), and is extremely easy to implement. And, as long as you use a reliable sequence generator (your database, for example,) there is no need for thread safety concerns.
According to MSDN, 'the result [of a exclusive-OR operation] is true if and only if exactly one of its operands is true.' reverse logic says that equal operands will always result false.
As an example, I just generated a 32-bit sequence on Random.org. This is it:
11010101111000100101101100111101
This binary number translates to 3588381501 in decimal, 0xD5E25B3D in hex. Let's call it your base key.
Now, lets generate some values using the ([base key] XOR [ID]) formula. In C#, that's what your encryption function would look like:
public static long FlipMask(long baseKey, long ID)
{
return baseKey ^ ID;
}
The following list contains some generated content. Its columns are as follows:
ID
Binary representation of ID
Binary value after XOR operation
Final, 'encrypted' decimal value
0 | 000 | 11010101111000100101101100111101 | 3588381501
1 | 001 | 11010101111000100101101100111100 | 3588381500
2 | 010 | 11010101111000100101101100111111 | 3588381503
3 | 011 | 11010101111000100101101100111110 | 3588381502
4 | 100 | 11010101111000100101101100111001 | 3588381497
In order to reverse the generated key and determine the original value, you only need to do the same XOR operation using the same base key. Let's say we want to obtain the original value of the second row:
11010101111000100101101100111101 XOR
11010101111000100101101100111100 =
00000000000000000000000000000001
Which was indeed your original value.
Now, Stefan made very good points, and the first topic is crucial.
In order to cover his concerns, you may reserve the last, say, 8 bytes to be purely random garbage (which I believe is called a nonce), which you generate when encrypting the original ID and ignore when reversing it. That would heavily increase your security at the expense of a generous slice of all the possible positive integer numbers with 32 bits (16,777,216 instead of 4,294,967,296, or 1/256 of it.)
A class to do that would look like this:
public static class int32crypto
{
// C# follows ECMA 334v4, so Integer Literals have only two possible forms -
// decimal and hexadecimal.
// Original key: 0b11010101111000100101101100111101
public static long baseKey = 0xD5E25B3D;
public static long encrypt(long value)
{
// First we will extract from our baseKey the bits we'll actually use.
// We do this with an AND mask, indicating the bits to extract.
// Remember, we'll ignore the first 8. So the mask must look like this:
// Significance mask: 0b00000000111111111111111111111111
long _sigMask = 0x00FFFFFF;
// sigKey is our baseKey with only the indicated bits still true.
long _sigKey = _sigMask & baseKey;
// nonce generation. First security issue, since Random()
// is time-based on its first iteration. But that's OK for the sake
// of explanation, and safe for most circunstances.
// The bits it will occupy are the first eight, like this:
// OriginalNonce: 0b000000000000000000000000NNNNNNNN
long _tempNonce = new Random().Next(255);
// We now shift them to the last byte, like this:
// finalNonce: 0bNNNNNNNN000000000000000000000000
_tempNonce = _tempNonce << 0x18;
// And now we mix both Nonce and sigKey, 'poisoning' the original
// key, like this:
long _finalKey = _tempNonce | _sigKey;
// Phew! Now we apply the final key to the value, and return
// the encrypted value.
return _finalKey ^ value;
}
public static long decrypt(long value)
{
// This is easier than encrypting. We will just ignore the bits
// we know are used by our nonce.
long _sigMask = 0x00FFFFFF;
long _sigKey = _sigMask & baseKey;
// We will do the same to the informed value:
long _trueValue = _sigMask & value;
// Now we decode and return the value:
return _sigKey ^ _trueValue;
}
}
perhaps idea may come from the millitary? group invoices in blocks like these:
28th Infantry Division
--1st Brigade
---1st BN
----A Co
----B Co
---2nd BN
----A Co
----B Co
--2nd Brigade
---1st BN
----A Co
----B Co
---2nd BN
----A Co
----B Co
--3rd Brigade
---1st BN
----A Co
----B Co
---2nd BN
----A Co
----B Co
http://boards.straightdope.com/sdmb/showthread.php?t=432978
groups don't have to be sequential but numbers in groups do
UPDATE
Think about above as groups differentiated by place, time, person, etc. For example: create group using seller temporary ID, changing it every 10 days or by office/shop.
There is another idea, you may say a bit weird but... when I think of it I like it more and more. Why not to count down these invoices? Choose a big number and count down. It's easy to trace number of items when counting up, but counting down? How anyone would guess where is a starting point? It's easy to implement,
too.
If the orders sit in an inbox until a single person processes them each morning, seeing that it took that person till 16:00 before he got round to creating my invoice will give me the impression that he's been busy. Getting the 9:01 invoice makes me feel like I'm the only customer today.
But if you generate the ID at the time when I place my order, the timestamp tells me nothing.
I think I therefore actually like the timestamps, assuming that collisions where two customers simultaneously need an ID created are rare.
You can see from the code below that I use newsequentialid() to generate a sequential number then convert that to a [bigint]. As that generates a consistent increment of 4294967296 I simply divide that number by the [id] on the table (it could be rand() seeded with nanoseconds or something similar). The result is a number that is always less than 4294967296 so I can safely add it and be sure I'm not overlapping the range of the next number.
Peace
Katherine
declare #generator as table (
[id] [bigint],
[guid] [uniqueidentifier] default( newsequentialid()) not null,
[converted] as (convert([bigint], convert ([varbinary](8), [guid], 1))) + 10000000000000000000,
[converted_with_randomizer] as (convert([bigint], convert ([varbinary](8), [guid], 1))) + 10000000000000000000 + cast((4294967296 / [id]) as [bigint])
);
insert into #generator ([id])
values (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);
select [id],
[guid],
[converted],
[converted] - lag([converted],
1.0)
over (
order by [id]) as [orderly_increment],
[converted_with_randomizer],
[converted_with_randomizer] - lag([converted_with_randomizer],
1.0)
over (
order by [id]) as [disorderly_increment]
from #generator
order by [converted];
I do not know the reasons for the rules you set on the Invoice ID, but you could consider to have an internal Invoice Id which could be the sequential 32-bits integer and an external Invoice ID that you can share with your customers.
This way your internal Id can start at 1 and you can add one to it everytime and the customer invoice id could be what ever you want.
I think Na Na has the correct idea with choosing a big number and counting down. Start off with a large value seed and either count up or down, but don't start with the last placeholder. If you use one of the other placeholders it will give an illusion of a higher invoice count....if they are actually looking at that anyway.
The only caveat here would be to modify the last X digits of the number periodically to maintain the appearance of a change.
Why not taking an easy readable Number constructed like
first 12 digits is the datetime in a yyyymmddhhmm format (that ensures the order of your invoice IDs)
last x-digits is the order number (in this example 8 digits)
The number you get then is something like 20130814140300000008
Then do some simple calculations with it like the first 12 digits
(201308141403) * 3 = 603924424209
The second part (original: 00000008) can be obfuscated like this:
(10001234 - 00000008 * 256) * (minutes + 2) = 49995930
It is easy to translate it back into an easy readable number but unless you don't know how the customer has no clue at all.
Alltogether this number would look like 603924424209-49995930
for an invoice at the 14th August 2013 at 14:03 with the internal invoice number 00000008.
You can write your own function that when applied to the previous number generates the next sequential random number which is greater than the previous one but random. Though the numbers that can be generated will be from a finite set (for example, integers between 1 and 2 power 31) and may eventually repeat itself though highly unlikely. To Add more complexity to the generated numbers you can add some AlphaNumeric Characters at the end. You can read about this here Sequential Random Numbers.
An example generator can be
private static string GetNextnumber(int currentNumber)
{
Int32 nextnumber = currentNumber + (currentNumber % 3) + 5;
Random _random = new Random();
//you can skip the below 2 lines if you don't want alpha numeric
int num = _random.Next(0, 26); // Zero to 25
char let = (char)('a' + num);
return nextnumber + let.ToString();
}
and you can call like
string nextnumber = GetNextnumber(yourpreviouslyGeneratedNumber);

Redis - Hits count tracking and querying in given datetime range

I have many different items and I want to keep a track of number of hits to each item and then query the hit count for each item in a given datetime range, down to every second.
So i started storing the hits in a sorted set, one sorted set for each second (unix epoch time) for example :
zincrby ItemCount:1346742000 item1 1
zincrby ItemCount:1346742000 item2 1
zincrby ItemCount:1346742001 item1 1
zincrby ItemCount:1346742005 item9 1
Now to get an aggregate hit count for each item in a given date range :
1. Given a start datetime and end datetime:
Calculate the range of epochs that fall under that range.
2. Generate the key names for each sorted set using the epoch values example:
ItemCount:1346742001, ItemCount:1346742002, ItemCount:1346742003
3. Use Union store to aggregate all the values from different sorted sets
ZUINIONSTORE _item_count KEYS....
4. To get the final results out:
ZRANGE _item_count 0, -1 withscores
So it kinda works, but i run into problem when I have a big date range like 1 month, the number of key names calculated from step 1 & 2 run into millions (86400 epoch values per day).
With such large number of keys, ZUINIONSTORE command fails - the socket gets broken. Plus it takes a while to loop through and generate that many keys.
How can i design this in Redis in a more efficient way and still keep the tracking granularity all the way down to seconds and not minutes or days.
yeah, you should avoid big unions of sorted sets. a nice trick you can do, assuming you know the maximum hits an item can get per second.
sorted set per item with timestamps as BOTH scores and values.
but the scores are incremented by 1/(max_predicted_hits_per_second), if you are not the first client to write them. this way the number after the decimal dot is always hits/max_predicted_hits_per second, but you can still do range queries.
so let's say max_predicted_hits_per_second is 1000. what we do is this (python example):
#1. make sure only one client adds the actual timestamp,
#by doing SETNX to a temporary key)
now = int(time.time())
rc = redis.setnx('item_ts:%s' % itemId, now)
#just the count part
val = float(1)/1000
if rc: #we are the first to incement this second
val += now
redis.expire('item_ts:%s' % itemId, 10) #we won't need that anymore soon, assuming all clients have the same clock
#2 increment the count
redis.zincrby('item_counts:%s' % itemId, now, amount = val)
and now querying a range will be something like:
counts = redis.zrangebyscore('item_counts:%s' % itemId, minTime, maxTime + 0.999, withscores=True)
total = 0
for value, score in counts:
count = (score - int(value))*1000
total += count

Calculate percent at runtime

I have this problem where I have to "audit" a percent of my transtactions.
If percent is 100 I have to audit them all, if is 0 I have to skip them all and if 50% I have to review the half etc.
The problem ( or the opportunity ) is that I have to perform the check at runtime.
What I tried was:
audit = 100/percent
So if percent is 50
audit = 100 / 50 ( which is 2 )
So I have to audit 1 and skip 1 audit 1 and skip 1 ..
If is 30
audit = 100 / 30 ( 3.3 )
I audit 2 and skip the third.
Question
I'm having problems with numbers beyond 50% ( like 75% ) because it gives me 1.333, ...
When would be the correct algorithm to know how many to audit as they go?... I also have problems with 0 ( due to division by 0 :P ) but I have fixed that already, and with 100 etc.
Any suggestion is greatly appreciated.
Why not do it randomly. For each transaction, pick a random number between 0 and 100. If that number is less than your "percent", then audit the transaction. If the number is greater than your "percent", then don't. I don't know if this satisfies your requirements, but over an extended period of time, you will have the right percentage audited.
If you need an exact "skip 2, audit one, skip 2 audit one" type of algorithm, you'll likely have luck adapting a line-drawing algorithm.
Try this:
1) Keep your audit percentage as a decimal.
2) For every transaction, associate a random number (between 0 and 1) with it
3) If the random number is less than the percentage, audit the transaction.
To follow your own algorithm: just keep adding that 1.333333 (or other quotient) to a counter.
Have two counters: an integer one and a real one. If the truncated part of the real counter = the integer counter, the audit is carried out, otherwise it isn't, like this:
Integer counter Real counter
1 1.333333: audit transaction
2 2.666666: audit transaction
3 3.999999: audit transaction
4 truncated(5.333333) = 5 > 4 => do NOT audit transaction
5 5.333333: audit transaction
Only increment the real counter when its truncated version = the integer counter. Always increment the integer counter.
In code:
var p, pc: double;
c: integer;
begin
p := 100 / Percentage;
pc := p;
for c := 1 to NrOfTransactions do begin
if trunc(pc) = c then begin
pc := pc + p;
Do audit on transaction c
end
end;
end;
if percent > random.randint(1,100):
print("audit")
else:
print("skip")
If you need to audit these transactions in real time (as they are received) perhaps you could use a random number generator to check if you need to audit the transaction.
So if for example you want to audit 50% of transactions, for every transaction received you would generate a random number between 0 and 1, and if the number was greater than 0.5, audit that transaction.
While for low numbers this would not work, for large numbers of transactions this would give you very close to the required percentage.
This is better than your initial suggestion because if does not allow a method to 'game' the audit process - if you are auditing every second transaction this allows bad transactions to slip through.
Another possibility is to keep a running total of the total transactions and as this changes the total number of transactions that need to be audited (according to your percentage) you can pipe transactions into the auditing process. This however still opens the slight possibility of someone detecting the pattern and circumventing the audit.
For a high throughput system the random method is best, but if you don't want randomness, the this algorithm will do the job. Don't forget to test it in a unit test!
// setup
int transactionCount = 0;
int auditCount = 0;
double targetAuditRatio = auditPercent/100.0;
// start of processing
transactionCount++;
double actualAuditRatio = auditCount/transactionCount;
if (actualAuditRatio < targetAuditRatio) {
auditCount++;
// do audit
}
// do processing
You can constantly "query" each audit using counter. For example
ctr = 0;
percent = 50
while(1) {
ctr += percent;
if (ctr >= 100) {
audit;
ctr = ctr - 100;
} else
skip
}
You can use floats (however this will bring some unpredictability) or multiply 100 percent by sth to get better resolution.
There is really no need to use random number generator.
Not tested, but in the random module there is a function sample. If transactions was a list of transactions, you would do something like:
import random
to_be_audited = random.sample(transactions,len(transactions*100/percentage))
This would generate a list to_be_audited which would be a random, non-duplicating sample of the transactions.
See documentation on random

Categories

Resources