Non-Random Weighted Distribution

Non-Random Weighted Distribution - c#

I currently have a system where the server tells all client applications when to next connect to the server between a server configured time window (say 12 to 6 am client time).
The current algorithm does a mod of the client's 10 digit ID number(fairly distributed) by the number of seconds in the time window and gives a pretty evenly distributed, predictable time for each client to connect to the server. The problem now, is that clients are in different time zones dis-proportionately, and certain time zones overlap for the given window, so the net effect is that the load is not distributed evenly on the server. What I would like is to devise an algorithm that I could configure with a percentage of clients we currently have for each time zone, and have it distribute the client's next connect time between the window that results in an even load on the server in a manner that is predictable (non-random).
here is a simple graphical representation:
12AM 1AM 2AM 3AM 4AM 5AM 6AM GMT
GMT -4 40% of the clients ||||||||||||||||||||||||||||||
GMT -5 10% of the clients ||||||||||||||||||||||||||||||
GMT -6 20% of the clients ||||||||||||||||||||||||||||||
GMT -7 30% of the clients ||||||||||||||||||||||||||||||

Break the problem into two parts: (1) determining what distribution you want each set of clients to have; and (2) deterministically assigning reconnect times that fit that distribution.
For problem (1), consider a two-dimensional array of numbers, much like the diagram you've drawn: each row represents a time zone and each column represents an equal period of time (an hour, perhaps) during the day. The problem you have to solve is to fill in the grid with numbers such that
the total of each row is the number of clients in that time zone;
for each row, all the numbers outside that time zone's reconnect window are zero;
the sums of the columns do not exceed some predetermined maximum (and are as evenly balanced as possible).
This kind of problem has lots of solutions. You can find one by simulation without doing any hard math. Write a program that fills the grid in so that each time zone's clients are evenly distributed (that is, the way you're distributing them now) and then repeatedly moves clients horizontally from crowded times-of-day to less crowded ones.
For problem (2), you want a function that takes a ten-digit ID and a desired distribution (that is, one row of the matrix from problem 1 above), and deterministically produces a reconnect time. This is easily done by linear interpolation. Suppose the desired distribution is:
12:00 1:00 2:00 3:00 4:00 5:00 6:00 ...
+------+------+------+------+------+------+----
| 0 | 0 | 100 | 70 | 30 | 0 | ...
+------+------+------+------+------+------+----
First find the sum of the whole row, and scale the numbers up to the range of IDs. That is, divide by the sum and multiply by 1010.
12:00 1:00 2:00 3:00 4:00 5:00 6:00 ...
+------+------+-----------+-----------+-----------+------+----
| 0 | 0 | 500000000 | 350000000 | 150000000 | 0 | ...
+------+------+-----------+-----------+-----------+------+----
Now let x = the ten-digit ID, and read the row from left to right. At each box, subtract the value in that box from x. Keep going until the number in the box is greater than what's left in x. Return the time
(start time for this box) + (duration of this box) * x / (number in box)
Note that once you calculate the solution to problem (1), the reconnect times will be deterministic until the next time you recalculate the matrix. Then everyone's reconnect time will shift around a little--but not much, unless the matrix changes dramatically.

You could take into account the time zone of the user in addition to its ID.
One example solution that uses this would be the following:
There are 24 time zones. Calculate which relative load there is for each of the time zones. You can do this by summing the total number of clients from each time zone from your static data. Now you have "weighted time zones". Each time zone will get a time share proportional to its weight.
For example, if you have the following data (for simplicity, lets assume that there are only three time zones):
Time Zone | Clients num
------------------------
0 | 20
1 | 30
2 | 10
Then you would divide your time interval size by 60, and give each of the time zones its share of time: the first time zone will get (20/60 * #time), the second will get the following (30/60 * #time) etc.
Once you have the smaller time frames, you can tell each client of its time according to your previous function (i.e. mod for example), using the smaller interval according to what you calculated for its specific time zone.
Notes:
Obviously, you will need some minimum clients num value for time zones that are very low on traffic, but this is simple - you just edit the original table.
This is one example of "time division", you can modify this example to your need, for example you could have mutual time frames for several time zones.
EDIT:
Given the example you added to your question, you could apply this method the following way:
If I understand you correctly, you have 10 hours in which your server is active, and you would like for the load to be more or less equal for each of these hours. Meaning: in each of these hours you would like that 10% of the clients would access the server.
Using the idea explained above, it is possible to divide the users non uniformly, so that for each time zone, there are hours with "more probability", and hours with "less probability". In your example, in the GMT-4 group, 10%/40% of the clients will access the server in the first hour: 12AM-01AM GMT. It is possible to calculate the load for each of the time zones, so that the total load for the server in every hour is 10%. There are many methods to do this - a greedy one will do.
Once you have this, you know the weights for each of the time zones, and it should be clearer how to use the time sharing method described above.

I would define a helper class for each of the Timezones you are looking at:
class Timezone
{
DateTime start;
int hourlyWeights[6]; //assuming you have 6 hour long timeslot for every timezone
DateTime GetStartTime(long clientId)
{
long allTicks = 3600*sum(hourlyWeights);
long clientTicks = clientId%allTicks;
int i = 0;
while(clientTicks>hourlyWeights[i])
{
clientTicks -= hourlyWeights[i]*3600;
i++;
}
long seconds = clientTicks/hourlyWeights[i];
return start.AddHours(i).AddSeconds(seconds);
}
}
You now use the method GetStartTime to get the start time for a client from this timezone. The idea here is that we have this hourlyWeights table with the distribution you want to get for the given timezone, e.g. [40, 20, 0, 0, 0, 0] would mean that these clients will be served only during the first 2 hours, and we want twice as many clients during the first hour. Note: I assume that ids are uniformly distributed among clients from a given timezone.
The tricky bit is to get these classes created. If you have fairly stable structure of customers, than you can figure out the distributions manually and put them in the config file. If it changes often, let me know and I will post some code to figure it out dynamically.

How about this for something simple:
If the load on the server is OK, send the client the same number of seconds you sent last time.
If the load on the server is too high, instead send the client some other random number in the time window.
Over a few days things should sort themselves out.
(This assumes you have some way of measuring the quantity you're trying to optimize, which doesn't seem too unreasonable.)

Why not generate your reconnect-window times in GMT on the server and convert to client local time before you send the time to the client?

Related

Exponentially Delay Client Side API Calls

I'm interfacing with a third party API that returns a call limit threshold and how many calls I've used of the threshold so far. I believe it's 60 calls every minute. After 1 minute it resets.
I would like to delay my API calls as I reach that limit more and more, sort of like an exponential curve where the curve hits double the max threshold at the max threshold.
So at 0 it's 0 delay. At 60 it would be a 120 second delay.
And if they change the call limit, I want to be able to respond and adjust my max limit again to 2 * the new limit with an exponential-sorta curve.
What algorithm can I use for this? (Preferable VB.NET, else C#)

You could potentially do something along these lines, we did this to not bombard our mail server when a camera went offline or had an error.
public static class Delay
{
public static double ByInterval(int maximum, int interval) => Math.Round((maximum / (Math.Pow(2, interval) - 1)), 0);
}
So for instance, if the maximum delay should be one hundred twenty and we'd like at an interval of three, the output would be fifteen. I'm also rounding to a whole number. Not sure if this is what you're looking for, but we coupled this to an appender, so we store the emails until our threshold is met. We used our values to equate into seconds with (10000000 * Delay.By(120, 3)) for instance. Since we stored as ticks primarily.

Combining multiple date spans of different frequencies

An order consists of a startDate the starting date of the span, endDate the ending date of the span, unitAmount the number of units per delivery, frequencyAmount the number of times the units are delivered per frequency, and frequencyId the frequency of the delivery. For example: From 2017-01-01 to 2017-04-01 6 units are delivered 5 times per week. It covers 13 calendar weeks for (6*5) units per week resulting in a total of 390 units for the entire order.
Multiple orders can be created overlapping the same dates. This is allowed due to it being impossible to write 2 orders of 1 unit 3 times per week and 5 units 1 times per week as a single order, also for different frequencies like 10 units 1 time per month.
Problem: I can not figure out a way to validate that these orders do not go over certain set limits. For example, I want to make sure the orders stay under 40 units total per month and I might have 4 orders overlapping each other of different date spans, units, and frequencies.
I thought to combine all the orders by calculating how many units in total each order has and what percentage of the order overlaps another order. However, when I validating a larger frequency, say <1000 units per year, and my orders that I have combined are smaller. I end up having to extrapolate and overestimating how many units are being called for. (For example, orders than combine to be 200 units for a single month, it is okay since it is only for a single month, but if I figure the yearly amount from that (200*12) it is 1200 and over the 1000 unit limit, but in reality the total units might still be under.
Orders:
|-----------------4 6x/week---------------|
|-------------8 1x/week-------------|
|-------10 2x/month-------|
===========================================================
Combined Orders:
|--a---|---b---|--------c--------|---d----|----e---|
I am checking if each span of the combined orders (a-e) are over any daily, weekly, monthly, or yearly limits. Different units have different limits and I need to be able to validate at these different frequencies.
I feel like I am going about this the wrong way, I keep running into issues with this approach such as my overestimation when extrapolating. Another issue, looking at my diagram, the 3rd order for 10 units 2 times per month (lets say the order is a month long), when combined, falls into 2 spans b and c. The 10 units could have been delivered twice in b, none in c or 1 in each b and c or none in b and twice in c. So if I am converting to a weekly amount to combine I have to assume the total units were delivered in both spans b and c in a worst case scenario which leads to overestimations. If I figure out the the percentage of units per each span, it leads to underestimations.
Has anyone else faced a similiar issue or does anyone else think they have a solution to this problem?
Thanks
EDIT: Another situation can occur, imagine a limit of 30 units per month:
Orders:
|---10 units 2x/week---| |---10 units 2x/week---|
2017-01-01 2017-01-31
===========================================================
Combined Orders:
|--------20------------| |----------20-----------|
In this case, it goes over the limit not due to overlaps. This makes me believe that I will also have to calculate the amount of units in each month (or week, day, year) between the earliest startDate and the furthest endDate. Unless there is a better way, but I have a feeling this is the only way.

I do not see any way other then what #Furmek suggested in his 3rd comment.
The prototype of the solution is below.
-- Here we calculate daily unit amount for all orders that overlap a validation period.
-- [frequency] is in days
;WITH ValidationPeriodOrders AS(
SELECT ( unitAmount * frequencyAmount ) / [frequency] AS AvgDailyAmount
-- Adjust order dates to be within Validation period
CASE WHEN startDate < ValidationPeriodStart THEN ValidationPeriodStart ELSE startDate END AS OrderStart,
CASE WHEN endDate > ValidationPeriodEnd THEN ValidationPeriodEnd ELSE endDate END AS OrderEnd
FROM Order
-- Look for orders that fall inside a validation period
WHERE endDate BETWEEN ValidationPeriodStart AND ValidationPeriodEnd
OR startDate BETWEEN ValidationPeriodStart AND ValidationPeriodEnd
)
-- Get total units per validation period
SELECT SUM( AvgDailyAmount * DATEDIFF( dd, OrderStart, OrderEnd ))
FROM ValidationPeriodOrders
Since there is no (reliable) way to tell when an order was delivered I would suggest rounding down the number (rather than mathematical rounding).

What are the value of Stopwatch's ticks if their duration varies?

I'm not talking about the bloodsucking spider-like disease-spreader, but rather the sort of tick I recorded here:
Stopwatch noLyme = new Stopwatch();
noLyme.Start();
. . .
noLyme.Stop();
MessageBox.Show(string.Format(
"elapsed milliseconds == {0}, elapsed ticks == {1}",
noLyme.ElapsedMilliseconds, noLyme.ElapsedTicks));
What the message box showed me was 17357 milliseconds and 56411802 ticks; this equates to 3250.089416373797 ticks per millisecond, or approximately 3.25 million ticks per second.
Since the ratio is such an odd one (3250.089416373797:1), I assume the time length of a tick changes based on hardware used, or other factors. That being the case, in what practical way are tick counts used? To me, it seems milliseconds hold more value. IOW: why would I care about ticks (the variable time slices)?

From the documentation (with Frequency being another property of the Stopwatch):
Each tick in the ElapsedTicks value represents the time interval equal
to 1 second divided by the Frequency.
Stopwatch.ElapsedTicks (MSDN)
Ticks are useful if you need very precise timing based on the specifics of your hardware.

You would use ticks if you want to know a very precise performance measurement that is specific to a given machine. Internal hardware mechanisms determine the conversion from ticks to actual time.

ticks are the raw, low level units in which the hardware measures time.
It's like asking "what use are bits when we can use ints". Well, if we didn't have bits, ints wouldn't exist!
However, ticks can be useful. When counting ticks, converting them to milliseconds is a costly process. When measuring accurately, you can count everything in ticks and convert the results to seconds at the end of the process. When comparing measurements, absolute values nay not be relevant, it may only be relative differences that are of interest.
Of course, in these days of high level language and multitasking there aren't many cases where you would go to the bare metal in this way, but why shouldn't the raw hardware value be exposed through the higher level interfaces?

Redis - Hits count tracking and querying in given datetime range

I have many different items and I want to keep a track of number of hits to each item and then query the hit count for each item in a given datetime range, down to every second.
So i started storing the hits in a sorted set, one sorted set for each second (unix epoch time) for example :
zincrby ItemCount:1346742000 item1 1
zincrby ItemCount:1346742000 item2 1
zincrby ItemCount:1346742001 item1 1
zincrby ItemCount:1346742005 item9 1
Now to get an aggregate hit count for each item in a given date range :
1. Given a start datetime and end datetime:
Calculate the range of epochs that fall under that range.
2. Generate the key names for each sorted set using the epoch values example:
ItemCount:1346742001, ItemCount:1346742002, ItemCount:1346742003
3. Use Union store to aggregate all the values from different sorted sets
ZUINIONSTORE _item_count KEYS....
4. To get the final results out:
ZRANGE _item_count 0, -1 withscores
So it kinda works, but i run into problem when I have a big date range like 1 month, the number of key names calculated from step 1 & 2 run into millions (86400 epoch values per day).
With such large number of keys, ZUINIONSTORE command fails - the socket gets broken. Plus it takes a while to loop through and generate that many keys.
How can i design this in Redis in a more efficient way and still keep the tracking granularity all the way down to seconds and not minutes or days.

yeah, you should avoid big unions of sorted sets. a nice trick you can do, assuming you know the maximum hits an item can get per second.
sorted set per item with timestamps as BOTH scores and values.
but the scores are incremented by 1/(max_predicted_hits_per_second), if you are not the first client to write them. this way the number after the decimal dot is always hits/max_predicted_hits_per second, but you can still do range queries.
so let's say max_predicted_hits_per_second is 1000. what we do is this (python example):
#1. make sure only one client adds the actual timestamp,
#by doing SETNX to a temporary key)
now = int(time.time())
rc = redis.setnx('item_ts:%s' % itemId, now)
#just the count part
val = float(1)/1000
if rc: #we are the first to incement this second
val += now
redis.expire('item_ts:%s' % itemId, 10) #we won't need that anymore soon, assuming all clients have the same clock
#2 increment the count
redis.zincrby('item_counts:%s' % itemId, now, amount = val)
and now querying a range will be something like:
counts = redis.zrangebyscore('item_counts:%s' % itemId, minTime, maxTime + 0.999, withscores=True)
total = 0
for value, score in counts:
count = (score - int(value))*1000
total += count

given 10 functions y=a+bx and 1000's of (x,y) data points rounded to ints, how to derive 10 best (a,b) tuples?

We build software that audits fees charged by banks to merchants that accept credit and debit cards. Our customers want us to tell them if the card processor is overcharging them. Per-transaction credit card fees are calculated like this:
fee = fixed + variable*transaction_price
A "fee scheme" is the pair of (fixed, variable) used by a group of credit cards, e.g. "MasterCard business debit gold cards issued by First National Bank of Hollywood". We believe there are fewer than 10 different fee schemes in use at any time, but we aren't getting a complete nor current list of fee schemes from our partners. (yes, I know that some "fee schemes" are more complicated than the equation above because of caps and other gotchas, but our transactions are known to have only a + bx schemes in use).
Here's the problem we're trying to solve: we want to use per-transaction data about fees to derive the fee schemes in use. Then we can compare that list to the fee schemes that each customer should be using according to their bank.
The data we get about each transaction is a data tuple: (card_id, transaction_price, fee).
transaction_price and fee are in integer cents. The bank rolls over fractional cents for each transation until the cumulative is greater than one cent, and then a "rounding cent" will be attached to the fees of that transaction. We cannot predict which transaction the "rounding cent" will be attached to.
card_id identifies a group of cards that share the same fee scheme. In a typical day of 10,000 transactions, there may be several hundred unique card_id's. Multiple card_id's will share a fee scheme.
The data we get looks like this, and what we want to figure out is the last two columns.
card_id transaction_price fee fixed variable
=======================================================================
12345 200 22 ? ?
67890 300 21 ? ?
56789 150 8 ? ?
34567 150 8 ? ?
34567 150 "rounding cent"-> 9 ? ?
34567 150 8 ? ?
The end result we want is a short list like this with 10 or fewer entries showing the fee schemes that best fit our data. Like this:
fee_scheme_id fixed variable
======================================
1 22 0
2 21 0
3 ? ?
4 ? ?
...
The average fee is about 8 cents. This means the rounding cents have a huge impact and the derivation above requires a lot of data.
The average transaction is 125 cents. Transaction prices are always on 5-cent boundaries.
We want a short list of fee schemes that "fit" 98%+ of the 3,000+ transactions each customer gets each day. If that's not enough data to achieve 98% confidence, we can use multiple days' of data.
Because of the rounding cents applied somewhat arbitrarily to each transaction, this isn't a simple algebra problem. Instead, it's a kind of statistical clustering exercise that I'm not sure how to solve.
Any suggestions for how to approach this problem? The implementation can be in C# or T-SQL, whichever makes the most sense given the algorithm.

Hough transform
Consider your problem in image terms: If you would plot your input data on a diagram of price vs. fee, each scheme's entries would form a straight line (with rounding cents being noise). Consider the density map of your plot as an image, and the task is reduced to finding straight lines in an image. Which is just the job of the Hough transform.
You would essentially approach this by plotting one line for each transaction into a diagram of possible fixed fee versus possible variable fee, adding the values of lines where they cross. At the points of real fee schemes, many lines will intersect and form a large local maximum. By detecting this maximum, you find your fee scheme, and even a degree of importance for the fee scheme.
This approach will surely work, but might take some time depending on the resolution you want to achieve. If computation time proves to be an issue, remember that a Voronoi diagram of a coarse Hough space can be used as a classificator - and once you have classified your points into fee schemes, simple linear regression solves your problem.

Considering, that a processing query's storage requirements are in the same power of 2 as a day's worth of transaction data, I assume that such storage is not a problem, so:
First pass: Group the transactions for each card_id by transaction_price, keeping card_id, transaction_price and average fee. This can easily be done in SQL. This assumes, there are not outliers - but you can catch those at after this stage if so required. The resulting number of rows is guaranteed to be no higher than the number of raw data points.
Second pass: Per group walk these new data points (with a cursor or in C#) and calculate the average value of b. Again any outliers can be caught if desired after this stage.
Third pass: Per group calculate the average value of a, now that b is known. This is basic SQL. Outliers as allways
If you decide to do the second step in a cursor you can stuff all that into a stored procedure.
Different card_id groups, that use the same fee scheme can now be coalesced (Sorry of this is the wrong word, non-english native) into fee schemes by rounding a and b with a sane precision and again grouping.

The Hough transform is the most general answer, though I don't know how one would implement it in SQL (rather than pulling the data out and processing it in a general purpose language of your choice).
Alas, the naive version is known to be slow if you have a lot of input data (1000 points is kinda medium sized) and if you want high precision results (scales as size_of_the_input / (rho_precision * theta_precision)).
There is a faster approach based on 2^n-trees, but there are few implementations out on the web to just plug in. (I recently did one in C++ as a testbed for a project I'm involved in. Maybe I'll clean it up and post it somewhere.)
If there is some additional order to the data you may be able to do better (i.e. do the line segments form a piecewise function?).
Naive Hough transform
Define an accumulator in (theta,rho) space spanning [-pi,pi) and [0,max(hypotenuse(x,y)] as an 2D-array.
Foreach point in the input data
Foreach bin in theta
find the distance rho of the altitude from the origin to
a line through (a,y) and making angle theta with the horizontal
rho = x cos(theta) + y sin(theta)
and increment the bin (theta,rho) in the accumulator
Find the maximum bin in the accumulator, this
represents the most line-like structure in the data
if (theta !=0) {a = rho/sin(theta); b = -1/tan(theta);}
Reliably getting multiple lines out of a single pass takes a little more bookkeeping, but it is not significantly harder.
You can improve the result a little by smoothing the data near the candidate peaks and fitting to get sub-bin precision which should be faster than using smaller bins and should pickup the effect of the "rounding" cents fairly smoothly.

You're looking at the rounding cent as a significant source of noise in your calculations, so I'd focus on minimizing the noise due to that issue. The easiest way to do this IMO is to increase the sample size.
Instead of viewing your data as thousands of y=mx + b (+Rounding) group your data into larger subsets:
If you combine X transactions with the same and look at this as (sum of X fees) = (variable rate)*(sum of X transactions) + X(base rates) (+Rounding) your rounding number the noise will likely fall to the wayside.
Get enough groups of size 'X' and you should be able to come up with a pretty close representation of the real numbers.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.