I have list of billions of items in SQL which can be shuffled by user at random, by moving them inside list to another position, I consider using simple double divide solution:
Id, Rank
1 10
2 20
3 30
4 40
5 50
Now user moves item id=3 to first position and I perform item rank recalculation based on their adjasent items (0 - means no relative from left, max - no relative from right):
Id, Rank
3 (0+10)/2 = 5
1 10
2 20
4 40
5 50
Now there is a bug - until it reach epsilon for double, it will work, after that you will get a couple of elements with epsilon and they are not possible to move.
This can be avoided by infrequent recalculation of stack rank for entire collection, but I hesitate at the moment to implement this, because this looks too much.
I wanted to know is there some other algorithmic solution other than changing billions of items or is there a well-known name to this problem to find appropriate solution myself.
I wan't to generate a fictional job title from some information I have about the visitor.
For this, I have a table of about 30 different job titles:
01 CEO
02 CFO
03 Key Account Manager
...
29 Window Cleaner
30 Dishwasher
I'm trying to find a way to generate one of these titles from a few different variables like name, age, education history, work history and so on. I wan't it to be somewhat random but still consistent so that the same variables always result in the same title.
I also wan't the different variables to have some impact on the result. Lower numbers are "better" jobs and higher numbers are "worse" jobs, but it doesn't have to be very accurate, just not completely random.
So take these two people as an example.
Name: Joe Smith
Number of previous employers: 10
Number of years education: 8
Age: 56
Name: Samantha Smith
Number of previous employers: 1
Number of years education: 0
Age: 19
Now the reason I wan't the name in there is to have a bit of randomness, so that two co-workers of the same age with the same background doesn't get exactly the same title. So I was thinking of using the number of letters in the name to mix it up a bit.
Now I can generate consistent numbers in an infinite number of ways, like the number of letters in the name * age * years of education * number of employers. This would come out as 35 840 for Joe Smith and 247 for Samantha Smith. But I wan't it to be a number between 1-30 where Samantha is closer to 25-30 and Joe is closer to 1-5.
Maybe this is more of a math problem than a programming problem, but I have seen a lot of "What's your pirate name?" and similar apps out there and I can't figure out how they work. "What's your pirate name?" might be a bad example, since it's probably completely random and I wan't my variables to matter some, but the idea is the same.
What I have tried
I tried adding weights to variable groups so I would get an easier number to use in my calculations.
Age
01-20 5
20-30 4
30-40 3
40-50 2
...
Years of education
00-01 0
01-02 1
02-03 2
04-05 3
...
Add them together and play around with those numbers, but there was a lot of problems like everyone ending up in pretty much the same mid-range (no one got to be CEO or dishwasher, everyone was somewhere in the middle), not to mention how messy the code was.
Is there a good way to accomplish what I want to do without having to build a massive math engine?
int numberOfTitles = 30;
var semiRandomID = person.Name.GetHashCode()
^ person.NumberOfPreviousEmployers.GetHashCode()
^ person.NumberOfYearsEducation.GetHashCode()
^ person.Age.GetHashCode();
var semiRandomTitle = Math.Abs(semiRandomID) % numberOfTitles;
// adjust semiRandomTitle as you see fit
semiRandomTitle += ((person.Age / 10) - 2);
semiRandomTitle += (person.NumberOfYearsEducation / 2);
The semiRandomID is a number that is generated from unique hashes of each component. The numbers are unique so that you will always generate the same number for "Joe" for example, but they don't mean anything. It's just a number. So we take all those unique numbers and generate one job title out of the 30 available. Every person has the same chance to get each job title (probably some math freak will proof that there's egde cases to the contrary, but for all practical, non-cryptographic means, it's sufficient).
Now each person has one job title assigned that looks random. However, as it's math and not randomness, they will get the same every time.
Now lets assume Joe got Taxi-Driver, the number 20. However, he has 10 years of formal education, so you decide you want to have that aspect have some weight. You could just add the years onto the job title number, but that would make anyone with 30 years of college parties CEO, so you decide (arbitrarily) that each year of education counts for half a job title. You add (NumberOfYearsEducation / 2) to the job title.
Lets assume Jane got CIO, the number 5. However, she is only 22 years old, a little young to be that high on the list. Again, you could just add the years onto the job title number, but that would make anyone with 30 years of age a CEO, so you decide (arbitrarily) that each year counts as 1/10 of a job title. In addition, you think that being very young should instead subtract from the job title. All years below the first 20 should indeed be a negative weight. So the formula would be ((Age / 10) - 2). One point for each 10 years of age, with the first 2 counting as negative.
I want to print invoices for customers in my app. Each invoice has an Invoice ID. I want IDs to be:
Sequential (ids entered lately come late)
32 bit integers
Not easily traceable like 1 2 3 so that people can't tell how many items we sell.
An idea of my own:
Number of seconds since a specific date & time (e.g. 1/1/2010 00 AM).
Any other ideas how to generate these numbers ?
I don't like the idea of using time. You can run into all sorts of issues - time differences, several events happening in a single second and so on.
If you want something sequential and not easily traceable, how about generating a random number between 1 and whatever you wish (for example 100) for each new Id. Each new Id will be the previous Id + the random number.
You can also add a constant to your IDs to make them look more impressive. For example you can add 44323 to all your IDs and turn IDs 15, 23 and 27 into 44338, 44346 and 44350.
There are two problems in your question. One is solvable, one isn't (with the constraints you give).
Solvable: Unguessable numbers
The first one is quite simple: It should be hard for a customer to guess a valid invoice number (or the next valid invoice number), when the customer has access to a set of valid invoice numbers.
You can solve this with your constraint:
Split your invoice number in two parts:
A 20 bit prefix, taken from a sequence of increasing numbers (e.g. the natural numbers 0,1,2,...)
A 10 bit suffix that is randomly generated
With these scheme, there are a bout 1 million valid invoice numbers. You can precalculate them and store them in the database. When presented with a invoice number, check if it is in your database. When it isn't, it's not valid.
Use a SQL sequence for handing out numbers. When issuing a new (i.e. unused) invoice number, increment the seuqnce and issue the n-th number from the precalculated list (order by value).
Not solvable: Guessing the number of customers
When you want to prevent a customer having a number of valid invoice numbers from guessing how much invoice numbers you have issued yet (and there for how much customers you have): This is not possible.
You have hare a variant form the so called "German tank problem". I nthe second world war, the allies used serial numbers printed on the gear box of german tanks to guestimate, how much tanks Germany had produced. This worked, because the serial number was increasing without gaps.
But even when you increase the numbers with gaps, the solution for the German tank problem still works. It is quite easy:
You use the method described here to guess the highest issued invoice number
You guess the mean difference between two successive invoice numbers and divide the number through this value
You can use linear regression to get a stable delta value (if it exists).
Now you have a good guess about the order of magnitude of the number of invoices (200, 15000, half an million, etc.).
This works as long there (theoretically) exists a mean value for two successive invoice numbers. This is usually the case, even when using a random number generator, because most random number generators are designed to have such a mean value.
There is a counter measure: You have to make sure that there exists no mean value for the gap of two successive numbers. A random number generator with this property can be constructed very easy.
Example:
Start with the last invoice number plus one as current number
Multiply the current number with a random number >=2. This is your new current number.
Get a random bit: If the bit is 0, the result is your current number. Otherwise go back to step 2.
While this will work in theory, you will very soon run out of 32 bit integer numbers.
I don't think there is a practical solution for this problem. Either the gap between two successive number has a mean value (with little variance) and you can guess the amount of issued numbers easily. Or you will run out of 32 bit numbers very quickly.
Snakeoil (non working solutions)
Don't use any time based solution. The timestamp is usually easy guessable (probably an approximately correct timestamp will be printed somewhere on invoice). Using timestamps usually makes it easier for the attacker, not harder.
Don't use insecure random numbers. Most random number generators are not cryptographically safe. They usually have mathematical properties that are good for statistics but bad for your security (e.g. a predicable distribution, a stable mean value, etc.)
One solution may involve Exclusive OR (XOR) binary bitmaps. The result function is reversible, may generate non-sequential numbers (if the first bit of the least significant byte is set to 1), and is extremely easy to implement. And, as long as you use a reliable sequence generator (your database, for example,) there is no need for thread safety concerns.
According to MSDN, 'the result [of a exclusive-OR operation] is true if and only if exactly one of its operands is true.' reverse logic says that equal operands will always result false.
As an example, I just generated a 32-bit sequence on Random.org. This is it:
11010101111000100101101100111101
This binary number translates to 3588381501 in decimal, 0xD5E25B3D in hex. Let's call it your base key.
Now, lets generate some values using the ([base key] XOR [ID]) formula. In C#, that's what your encryption function would look like:
public static long FlipMask(long baseKey, long ID)
{
return baseKey ^ ID;
}
The following list contains some generated content. Its columns are as follows:
ID
Binary representation of ID
Binary value after XOR operation
Final, 'encrypted' decimal value
0 | 000 | 11010101111000100101101100111101 | 3588381501
1 | 001 | 11010101111000100101101100111100 | 3588381500
2 | 010 | 11010101111000100101101100111111 | 3588381503
3 | 011 | 11010101111000100101101100111110 | 3588381502
4 | 100 | 11010101111000100101101100111001 | 3588381497
In order to reverse the generated key and determine the original value, you only need to do the same XOR operation using the same base key. Let's say we want to obtain the original value of the second row:
11010101111000100101101100111101 XOR
11010101111000100101101100111100 =
00000000000000000000000000000001
Which was indeed your original value.
Now, Stefan made very good points, and the first topic is crucial.
In order to cover his concerns, you may reserve the last, say, 8 bytes to be purely random garbage (which I believe is called a nonce), which you generate when encrypting the original ID and ignore when reversing it. That would heavily increase your security at the expense of a generous slice of all the possible positive integer numbers with 32 bits (16,777,216 instead of 4,294,967,296, or 1/256 of it.)
A class to do that would look like this:
public static class int32crypto
{
// C# follows ECMA 334v4, so Integer Literals have only two possible forms -
// decimal and hexadecimal.
// Original key: 0b11010101111000100101101100111101
public static long baseKey = 0xD5E25B3D;
public static long encrypt(long value)
{
// First we will extract from our baseKey the bits we'll actually use.
// We do this with an AND mask, indicating the bits to extract.
// Remember, we'll ignore the first 8. So the mask must look like this:
// Significance mask: 0b00000000111111111111111111111111
long _sigMask = 0x00FFFFFF;
// sigKey is our baseKey with only the indicated bits still true.
long _sigKey = _sigMask & baseKey;
// nonce generation. First security issue, since Random()
// is time-based on its first iteration. But that's OK for the sake
// of explanation, and safe for most circunstances.
// The bits it will occupy are the first eight, like this:
// OriginalNonce: 0b000000000000000000000000NNNNNNNN
long _tempNonce = new Random().Next(255);
// We now shift them to the last byte, like this:
// finalNonce: 0bNNNNNNNN000000000000000000000000
_tempNonce = _tempNonce << 0x18;
// And now we mix both Nonce and sigKey, 'poisoning' the original
// key, like this:
long _finalKey = _tempNonce | _sigKey;
// Phew! Now we apply the final key to the value, and return
// the encrypted value.
return _finalKey ^ value;
}
public static long decrypt(long value)
{
// This is easier than encrypting. We will just ignore the bits
// we know are used by our nonce.
long _sigMask = 0x00FFFFFF;
long _sigKey = _sigMask & baseKey;
// We will do the same to the informed value:
long _trueValue = _sigMask & value;
// Now we decode and return the value:
return _sigKey ^ _trueValue;
}
}
perhaps idea may come from the millitary? group invoices in blocks like these:
28th Infantry Division
--1st Brigade
---1st BN
----A Co
----B Co
---2nd BN
----A Co
----B Co
--2nd Brigade
---1st BN
----A Co
----B Co
---2nd BN
----A Co
----B Co
--3rd Brigade
---1st BN
----A Co
----B Co
---2nd BN
----A Co
----B Co
http://boards.straightdope.com/sdmb/showthread.php?t=432978
groups don't have to be sequential but numbers in groups do
UPDATE
Think about above as groups differentiated by place, time, person, etc. For example: create group using seller temporary ID, changing it every 10 days or by office/shop.
There is another idea, you may say a bit weird but... when I think of it I like it more and more. Why not to count down these invoices? Choose a big number and count down. It's easy to trace number of items when counting up, but counting down? How anyone would guess where is a starting point? It's easy to implement,
too.
If the orders sit in an inbox until a single person processes them each morning, seeing that it took that person till 16:00 before he got round to creating my invoice will give me the impression that he's been busy. Getting the 9:01 invoice makes me feel like I'm the only customer today.
But if you generate the ID at the time when I place my order, the timestamp tells me nothing.
I think I therefore actually like the timestamps, assuming that collisions where two customers simultaneously need an ID created are rare.
You can see from the code below that I use newsequentialid() to generate a sequential number then convert that to a [bigint]. As that generates a consistent increment of 4294967296 I simply divide that number by the [id] on the table (it could be rand() seeded with nanoseconds or something similar). The result is a number that is always less than 4294967296 so I can safely add it and be sure I'm not overlapping the range of the next number.
Peace
Katherine
declare #generator as table (
[id] [bigint],
[guid] [uniqueidentifier] default( newsequentialid()) not null,
[converted] as (convert([bigint], convert ([varbinary](8), [guid], 1))) + 10000000000000000000,
[converted_with_randomizer] as (convert([bigint], convert ([varbinary](8), [guid], 1))) + 10000000000000000000 + cast((4294967296 / [id]) as [bigint])
);
insert into #generator ([id])
values (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);
select [id],
[guid],
[converted],
[converted] - lag([converted],
1.0)
over (
order by [id]) as [orderly_increment],
[converted_with_randomizer],
[converted_with_randomizer] - lag([converted_with_randomizer],
1.0)
over (
order by [id]) as [disorderly_increment]
from #generator
order by [converted];
I do not know the reasons for the rules you set on the Invoice ID, but you could consider to have an internal Invoice Id which could be the sequential 32-bits integer and an external Invoice ID that you can share with your customers.
This way your internal Id can start at 1 and you can add one to it everytime and the customer invoice id could be what ever you want.
I think Na Na has the correct idea with choosing a big number and counting down. Start off with a large value seed and either count up or down, but don't start with the last placeholder. If you use one of the other placeholders it will give an illusion of a higher invoice count....if they are actually looking at that anyway.
The only caveat here would be to modify the last X digits of the number periodically to maintain the appearance of a change.
Why not taking an easy readable Number constructed like
first 12 digits is the datetime in a yyyymmddhhmm format (that ensures the order of your invoice IDs)
last x-digits is the order number (in this example 8 digits)
The number you get then is something like 20130814140300000008
Then do some simple calculations with it like the first 12 digits
(201308141403) * 3 = 603924424209
The second part (original: 00000008) can be obfuscated like this:
(10001234 - 00000008 * 256) * (minutes + 2) = 49995930
It is easy to translate it back into an easy readable number but unless you don't know how the customer has no clue at all.
Alltogether this number would look like 603924424209-49995930
for an invoice at the 14th August 2013 at 14:03 with the internal invoice number 00000008.
You can write your own function that when applied to the previous number generates the next sequential random number which is greater than the previous one but random. Though the numbers that can be generated will be from a finite set (for example, integers between 1 and 2 power 31) and may eventually repeat itself though highly unlikely. To Add more complexity to the generated numbers you can add some AlphaNumeric Characters at the end. You can read about this here Sequential Random Numbers.
An example generator can be
private static string GetNextnumber(int currentNumber)
{
Int32 nextnumber = currentNumber + (currentNumber % 3) + 5;
Random _random = new Random();
//you can skip the below 2 lines if you don't want alpha numeric
int num = _random.Next(0, 26); // Zero to 25
char let = (char)('a' + num);
return nextnumber + let.ToString();
}
and you can call like
string nextnumber = GetNextnumber(yourpreviouslyGeneratedNumber);
I have many different items and I want to keep a track of number of hits to each item and then query the hit count for each item in a given datetime range, down to every second.
So i started storing the hits in a sorted set, one sorted set for each second (unix epoch time) for example :
zincrby ItemCount:1346742000 item1 1
zincrby ItemCount:1346742000 item2 1
zincrby ItemCount:1346742001 item1 1
zincrby ItemCount:1346742005 item9 1
Now to get an aggregate hit count for each item in a given date range :
1. Given a start datetime and end datetime:
Calculate the range of epochs that fall under that range.
2. Generate the key names for each sorted set using the epoch values example:
ItemCount:1346742001, ItemCount:1346742002, ItemCount:1346742003
3. Use Union store to aggregate all the values from different sorted sets
ZUINIONSTORE _item_count KEYS....
4. To get the final results out:
ZRANGE _item_count 0, -1 withscores
So it kinda works, but i run into problem when I have a big date range like 1 month, the number of key names calculated from step 1 & 2 run into millions (86400 epoch values per day).
With such large number of keys, ZUINIONSTORE command fails - the socket gets broken. Plus it takes a while to loop through and generate that many keys.
How can i design this in Redis in a more efficient way and still keep the tracking granularity all the way down to seconds and not minutes or days.
yeah, you should avoid big unions of sorted sets. a nice trick you can do, assuming you know the maximum hits an item can get per second.
sorted set per item with timestamps as BOTH scores and values.
but the scores are incremented by 1/(max_predicted_hits_per_second), if you are not the first client to write them. this way the number after the decimal dot is always hits/max_predicted_hits_per second, but you can still do range queries.
so let's say max_predicted_hits_per_second is 1000. what we do is this (python example):
#1. make sure only one client adds the actual timestamp,
#by doing SETNX to a temporary key)
now = int(time.time())
rc = redis.setnx('item_ts:%s' % itemId, now)
#just the count part
val = float(1)/1000
if rc: #we are the first to incement this second
val += now
redis.expire('item_ts:%s' % itemId, 10) #we won't need that anymore soon, assuming all clients have the same clock
#2 increment the count
redis.zincrby('item_counts:%s' % itemId, now, amount = val)
and now querying a range will be something like:
counts = redis.zrangebyscore('item_counts:%s' % itemId, minTime, maxTime + 0.999, withscores=True)
total = 0
for value, score in counts:
count = (score - int(value))*1000
total += count
I have this problem where I have to "audit" a percent of my transtactions.
If percent is 100 I have to audit them all, if is 0 I have to skip them all and if 50% I have to review the half etc.
The problem ( or the opportunity ) is that I have to perform the check at runtime.
What I tried was:
audit = 100/percent
So if percent is 50
audit = 100 / 50 ( which is 2 )
So I have to audit 1 and skip 1 audit 1 and skip 1 ..
If is 30
audit = 100 / 30 ( 3.3 )
I audit 2 and skip the third.
Question
I'm having problems with numbers beyond 50% ( like 75% ) because it gives me 1.333, ...
When would be the correct algorithm to know how many to audit as they go?... I also have problems with 0 ( due to division by 0 :P ) but I have fixed that already, and with 100 etc.
Any suggestion is greatly appreciated.
Why not do it randomly. For each transaction, pick a random number between 0 and 100. If that number is less than your "percent", then audit the transaction. If the number is greater than your "percent", then don't. I don't know if this satisfies your requirements, but over an extended period of time, you will have the right percentage audited.
If you need an exact "skip 2, audit one, skip 2 audit one" type of algorithm, you'll likely have luck adapting a line-drawing algorithm.
Try this:
1) Keep your audit percentage as a decimal.
2) For every transaction, associate a random number (between 0 and 1) with it
3) If the random number is less than the percentage, audit the transaction.
To follow your own algorithm: just keep adding that 1.333333 (or other quotient) to a counter.
Have two counters: an integer one and a real one. If the truncated part of the real counter = the integer counter, the audit is carried out, otherwise it isn't, like this:
Integer counter Real counter
1 1.333333: audit transaction
2 2.666666: audit transaction
3 3.999999: audit transaction
4 truncated(5.333333) = 5 > 4 => do NOT audit transaction
5 5.333333: audit transaction
Only increment the real counter when its truncated version = the integer counter. Always increment the integer counter.
In code:
var p, pc: double;
c: integer;
begin
p := 100 / Percentage;
pc := p;
for c := 1 to NrOfTransactions do begin
if trunc(pc) = c then begin
pc := pc + p;
Do audit on transaction c
end
end;
end;
if percent > random.randint(1,100):
print("audit")
else:
print("skip")
If you need to audit these transactions in real time (as they are received) perhaps you could use a random number generator to check if you need to audit the transaction.
So if for example you want to audit 50% of transactions, for every transaction received you would generate a random number between 0 and 1, and if the number was greater than 0.5, audit that transaction.
While for low numbers this would not work, for large numbers of transactions this would give you very close to the required percentage.
This is better than your initial suggestion because if does not allow a method to 'game' the audit process - if you are auditing every second transaction this allows bad transactions to slip through.
Another possibility is to keep a running total of the total transactions and as this changes the total number of transactions that need to be audited (according to your percentage) you can pipe transactions into the auditing process. This however still opens the slight possibility of someone detecting the pattern and circumventing the audit.
For a high throughput system the random method is best, but if you don't want randomness, the this algorithm will do the job. Don't forget to test it in a unit test!
// setup
int transactionCount = 0;
int auditCount = 0;
double targetAuditRatio = auditPercent/100.0;
// start of processing
transactionCount++;
double actualAuditRatio = auditCount/transactionCount;
if (actualAuditRatio < targetAuditRatio) {
auditCount++;
// do audit
}
// do processing
You can constantly "query" each audit using counter. For example
ctr = 0;
percent = 50
while(1) {
ctr += percent;
if (ctr >= 100) {
audit;
ctr = ctr - 100;
} else
skip
}
You can use floats (however this will bring some unpredictability) or multiply 100 percent by sth to get better resolution.
There is really no need to use random number generator.
Not tested, but in the random module there is a function sample. If transactions was a list of transactions, you would do something like:
import random
to_be_audited = random.sample(transactions,len(transactions*100/percentage))
This would generate a list to_be_audited which would be a random, non-duplicating sample of the transactions.
See documentation on random