Ideas about Generating Untraceable Invoice IDs

Ideas about Generating Untraceable Invoice IDs - c#

I want to print invoices for customers in my app. Each invoice has an Invoice ID. I want IDs to be:
Sequential (ids entered lately come late)
32 bit integers
Not easily traceable like 1 2 3 so that people can't tell how many items we sell.
An idea of my own:
Number of seconds since a specific date & time (e.g. 1/1/2010 00 AM).
Any other ideas how to generate these numbers ?

I don't like the idea of using time. You can run into all sorts of issues - time differences, several events happening in a single second and so on.
If you want something sequential and not easily traceable, how about generating a random number between 1 and whatever you wish (for example 100) for each new Id. Each new Id will be the previous Id + the random number.
You can also add a constant to your IDs to make them look more impressive. For example you can add 44323 to all your IDs and turn IDs 15, 23 and 27 into 44338, 44346 and 44350.

There are two problems in your question. One is solvable, one isn't (with the constraints you give).
Solvable: Unguessable numbers
The first one is quite simple: It should be hard for a customer to guess a valid invoice number (or the next valid invoice number), when the customer has access to a set of valid invoice numbers.
You can solve this with your constraint:
Split your invoice number in two parts:
A 20 bit prefix, taken from a sequence of increasing numbers (e.g. the natural numbers 0,1,2,...)
A 10 bit suffix that is randomly generated
With these scheme, there are a bout 1 million valid invoice numbers. You can precalculate them and store them in the database. When presented with a invoice number, check if it is in your database. When it isn't, it's not valid.
Use a SQL sequence for handing out numbers. When issuing a new (i.e. unused) invoice number, increment the seuqnce and issue the n-th number from the precalculated list (order by value).
Not solvable: Guessing the number of customers
When you want to prevent a customer having a number of valid invoice numbers from guessing how much invoice numbers you have issued yet (and there for how much customers you have): This is not possible.
You have hare a variant form the so called "German tank problem". I nthe second world war, the allies used serial numbers printed on the gear box of german tanks to guestimate, how much tanks Germany had produced. This worked, because the serial number was increasing without gaps.
But even when you increase the numbers with gaps, the solution for the German tank problem still works. It is quite easy:
You use the method described here to guess the highest issued invoice number
You guess the mean difference between two successive invoice numbers and divide the number through this value
You can use linear regression to get a stable delta value (if it exists).
Now you have a good guess about the order of magnitude of the number of invoices (200, 15000, half an million, etc.).
This works as long there (theoretically) exists a mean value for two successive invoice numbers. This is usually the case, even when using a random number generator, because most random number generators are designed to have such a mean value.
There is a counter measure: You have to make sure that there exists no mean value for the gap of two successive numbers. A random number generator with this property can be constructed very easy.
Example:
Start with the last invoice number plus one as current number
Multiply the current number with a random number >=2. This is your new current number.
Get a random bit: If the bit is 0, the result is your current number. Otherwise go back to step 2.
While this will work in theory, you will very soon run out of 32 bit integer numbers.
I don't think there is a practical solution for this problem. Either the gap between two successive number has a mean value (with little variance) and you can guess the amount of issued numbers easily. Or you will run out of 32 bit numbers very quickly.
Snakeoil (non working solutions)
Don't use any time based solution. The timestamp is usually easy guessable (probably an approximately correct timestamp will be printed somewhere on invoice). Using timestamps usually makes it easier for the attacker, not harder.
Don't use insecure random numbers. Most random number generators are not cryptographically safe. They usually have mathematical properties that are good for statistics but bad for your security (e.g. a predicable distribution, a stable mean value, etc.)

One solution may involve Exclusive OR (XOR) binary bitmaps. The result function is reversible, may generate non-sequential numbers (if the first bit of the least significant byte is set to 1), and is extremely easy to implement. And, as long as you use a reliable sequence generator (your database, for example,) there is no need for thread safety concerns.
According to MSDN, 'the result [of a exclusive-OR operation] is true if and only if exactly one of its operands is true.' reverse logic says that equal operands will always result false.
As an example, I just generated a 32-bit sequence on Random.org. This is it:
11010101111000100101101100111101
This binary number translates to 3588381501 in decimal, 0xD5E25B3D in hex. Let's call it your base key.
Now, lets generate some values using the ([base key] XOR [ID]) formula. In C#, that's what your encryption function would look like:
public static long FlipMask(long baseKey, long ID)
{
return baseKey ^ ID;
}
The following list contains some generated content. Its columns are as follows:
ID
Binary representation of ID
Binary value after XOR operation
Final, 'encrypted' decimal value
0 | 000 | 11010101111000100101101100111101 | 3588381501
1 | 001 | 11010101111000100101101100111100 | 3588381500
2 | 010 | 11010101111000100101101100111111 | 3588381503
3 | 011 | 11010101111000100101101100111110 | 3588381502
4 | 100 | 11010101111000100101101100111001 | 3588381497
In order to reverse the generated key and determine the original value, you only need to do the same XOR operation using the same base key. Let's say we want to obtain the original value of the second row:
11010101111000100101101100111101 XOR
11010101111000100101101100111100 =
00000000000000000000000000000001
Which was indeed your original value.
Now, Stefan made very good points, and the first topic is crucial.
In order to cover his concerns, you may reserve the last, say, 8 bytes to be purely random garbage (which I believe is called a nonce), which you generate when encrypting the original ID and ignore when reversing it. That would heavily increase your security at the expense of a generous slice of all the possible positive integer numbers with 32 bits (16,777,216 instead of 4,294,967,296, or 1/256 of it.)
A class to do that would look like this:
public static class int32crypto
{
// C# follows ECMA 334v4, so Integer Literals have only two possible forms -
// decimal and hexadecimal.
// Original key: 0b11010101111000100101101100111101
public static long baseKey = 0xD5E25B3D;
public static long encrypt(long value)
{
// First we will extract from our baseKey the bits we'll actually use.
// We do this with an AND mask, indicating the bits to extract.
// Remember, we'll ignore the first 8. So the mask must look like this:
// Significance mask: 0b00000000111111111111111111111111
long _sigMask = 0x00FFFFFF;
// sigKey is our baseKey with only the indicated bits still true.
long _sigKey = _sigMask & baseKey;
// nonce generation. First security issue, since Random()
// is time-based on its first iteration. But that's OK for the sake
// of explanation, and safe for most circunstances.
// The bits it will occupy are the first eight, like this:
// OriginalNonce: 0b000000000000000000000000NNNNNNNN
long _tempNonce = new Random().Next(255);
// We now shift them to the last byte, like this:
// finalNonce: 0bNNNNNNNN000000000000000000000000
_tempNonce = _tempNonce << 0x18;
// And now we mix both Nonce and sigKey, 'poisoning' the original
// key, like this:
long _finalKey = _tempNonce | _sigKey;
// Phew! Now we apply the final key to the value, and return
// the encrypted value.
return _finalKey ^ value;
}
public static long decrypt(long value)
{
// This is easier than encrypting. We will just ignore the bits
// we know are used by our nonce.
long _sigMask = 0x00FFFFFF;
long _sigKey = _sigMask & baseKey;
// We will do the same to the informed value:
long _trueValue = _sigMask & value;
// Now we decode and return the value:
return _sigKey ^ _trueValue;
}
}

perhaps idea may come from the millitary? group invoices in blocks like these:
28th Infantry Division
--1st Brigade
---1st BN
----A Co
----B Co
---2nd BN
----A Co
----B Co
--2nd Brigade
---1st BN
----A Co
----B Co
---2nd BN
----A Co
----B Co
--3rd Brigade
---1st BN
----A Co
----B Co
---2nd BN
----A Co
----B Co
http://boards.straightdope.com/sdmb/showthread.php?t=432978
groups don't have to be sequential but numbers in groups do
UPDATE
Think about above as groups differentiated by place, time, person, etc. For example: create group using seller temporary ID, changing it every 10 days or by office/shop.
There is another idea, you may say a bit weird but... when I think of it I like it more and more. Why not to count down these invoices? Choose a big number and count down. It's easy to trace number of items when counting up, but counting down? How anyone would guess where is a starting point? It's easy to implement,
too.

If the orders sit in an inbox until a single person processes them each morning, seeing that it took that person till 16:00 before he got round to creating my invoice will give me the impression that he's been busy. Getting the 9:01 invoice makes me feel like I'm the only customer today.
But if you generate the ID at the time when I place my order, the timestamp tells me nothing.
I think I therefore actually like the timestamps, assuming that collisions where two customers simultaneously need an ID created are rare.

You can see from the code below that I use newsequentialid() to generate a sequential number then convert that to a [bigint]. As that generates a consistent increment of 4294967296 I simply divide that number by the [id] on the table (it could be rand() seeded with nanoseconds or something similar). The result is a number that is always less than 4294967296 so I can safely add it and be sure I'm not overlapping the range of the next number.
Peace
Katherine
declare #generator as table (
[id] [bigint],
[guid] [uniqueidentifier] default( newsequentialid()) not null,
[converted] as (convert([bigint], convert ([varbinary](8), [guid], 1))) + 10000000000000000000,
[converted_with_randomizer] as (convert([bigint], convert ([varbinary](8), [guid], 1))) + 10000000000000000000 + cast((4294967296 / [id]) as [bigint])
);
insert into #generator ([id])
values (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);
select [id],
[guid],
[converted],
[converted] - lag([converted],
1.0)
over (
order by [id]) as [orderly_increment],
[converted_with_randomizer],
[converted_with_randomizer] - lag([converted_with_randomizer],
1.0)
over (
order by [id]) as [disorderly_increment]
from #generator
order by [converted];

I do not know the reasons for the rules you set on the Invoice ID, but you could consider to have an internal Invoice Id which could be the sequential 32-bits integer and an external Invoice ID that you can share with your customers.
This way your internal Id can start at 1 and you can add one to it everytime and the customer invoice id could be what ever you want.

I think Na Na has the correct idea with choosing a big number and counting down. Start off with a large value seed and either count up or down, but don't start with the last placeholder. If you use one of the other placeholders it will give an illusion of a higher invoice count....if they are actually looking at that anyway.
The only caveat here would be to modify the last X digits of the number periodically to maintain the appearance of a change.

Why not taking an easy readable Number constructed like
first 12 digits is the datetime in a yyyymmddhhmm format (that ensures the order of your invoice IDs)
last x-digits is the order number (in this example 8 digits)
The number you get then is something like 20130814140300000008
Then do some simple calculations with it like the first 12 digits
(201308141403) * 3 = 603924424209
The second part (original: 00000008) can be obfuscated like this:
(10001234 - 00000008 * 256) * (minutes + 2) = 49995930
It is easy to translate it back into an easy readable number but unless you don't know how the customer has no clue at all.
Alltogether this number would look like 603924424209-49995930
for an invoice at the 14th August 2013 at 14:03 with the internal invoice number 00000008.

You can write your own function that when applied to the previous number generates the next sequential random number which is greater than the previous one but random. Though the numbers that can be generated will be from a finite set (for example, integers between 1 and 2 power 31) and may eventually repeat itself though highly unlikely. To Add more complexity to the generated numbers you can add some AlphaNumeric Characters at the end. You can read about this here Sequential Random Numbers.
An example generator can be
private static string GetNextnumber(int currentNumber)
{
Int32 nextnumber = currentNumber + (currentNumber % 3) + 5;
Random _random = new Random();
//you can skip the below 2 lines if you don't want alpha numeric
int num = _random.Next(0, 26); // Zero to 25
char let = (char)('a' + num);
return nextnumber + let.ToString();
}
and you can call like
string nextnumber = GetNextnumber(yourpreviouslyGeneratedNumber);

Related

Distinct number algorithm from string

I'm working on a simple game and I have the requirement of taking a word or phrase such as "hello world" and converting it to a series of numbers.
The criteria is:
Numbers need to be distinct
Need ability to configure maximum sequence of numbers. IE 10 numbers total.
Need ability to configure max range for each number in sequence.
Must be deterministic, that is we should get the same sequence everytime for the same input phrase.
I've tried breaking down the problem like so:
Convert characters to ASCII number code: "hello world" = 104 101 108 108 111 32 119 111 114 108 100
Remove everyother number until we satisfy total numbers (10 in this case)
Foreach number if number > max number then divide by 2 until number <= max number
If any numbers are duplicated increase or decrease the first occurence until satisfied. (This could cause a problem as you could create a duplicate by solving another duplicate)
Is there a better way of doing this or am I on the right track? As stated above I think I may run into issues with removing distinction.

If you want to limit the size of the output series - then this is impossible.
Proof:
Assume your output is a series of size k, each of range r <= M for some predefined M, then there are at most k*M possible outputs.
However, there are infinite number of inputs, and specifically there are k*M+1 different inputs.
From pigeonhole principle (where the inputs are the pigeons and the outputs are the pigeonholes) - there are 2 pigeons (inputs) in one pigeonhole (output) - so the requirement cannot be achieved.
Original answer, provides workaround without limiting the size of the output series:
You can use prime numbers, let p1,p2,... be the series of prime numbers.
Then, convert the string into series of numbers using number[i] = ascii(char[i]) * p_i
The range of each character is obviously then [0,255 * p_i]
Since for each i,j such that i != j -> p_i * x != p_j * y (for each x,y) - you get uniqueness. However, this is mainly nice theoretically as the generated numbers might grow quickly, and for practical implementation you are going to need some big number library such as java's BigInteger (cannot recall the C# equivalent)
Another possible solution (with the same relaxation of no series limitation) is:
number[i] = ascii(char[i]) + 256*(i-1)
In here the range for number[i] is [256*(i-1),256*i), and elements are still distinct.

Mathematically, it is theoretically possible to do what you want, but you won't be able to do it in C#:
If your outputs are required to be distinct, then you cannot lose any information after encoding the string using ASCII values. This means that if you limit your output size to n numbers then the numbers will have to include all information from the encoding.
So for your example
"Hello World" -> 104 101 108 108 111 32 119 111 114 108 100
you would have to preserve the meaning of each of those numbers. The simplest way to do this would just 0 pad your numbers to three digits and concatenate them together into one large number...making your result 104101108111032119111114108100 for max numbers = 1.
(You can see where the issue becomes, for arbitrary length input you need very large numbers.) So certainly it is possible to encode any arbitrary length string input to n numbers, but the numbers will become exceedingly large.
If by "numbers" you meant digits, then no you cannot have distinct outputs, as #amit explained in his example with the pidgeonhole principle.

Let's eliminate your criteria as easily as possible.
For distinct, deterministic, just use a hash code. (Hash actually isn't guaranteed to be distinct, but is highly likely to be):
string s = "hello world";
uint hash = Convert.ToUInt32(s.GetHashCode());
Note that I converted the signed int returned from GetHashCode to unsigned, to avoid the chance of having a '-' appear.
Then, for your max range per number, just convert the base.
That leaves you with the maximum sequence criteria. Without understanding your requirements better, all I can propose is truncate if necessary:
hash.toString().Substring(0, size)
Truncating leaves a chance that you'll no longer be distinct, but that must be built in as acceptable to your requirements? As amit explains in another answer, you can't have infinite input and non-infinite output.

Ok, so in one comment you've said that this is just to pick lottery numbers. In that case, you could do something like this:
public static List<int> GenNumbers(String input, int count, int maxNum)
{
List<int> ret = new List<int>();
Random r = new Random(input.GetHashCode());
for (int i = 0; i < count; ++i)
{
int next = r.Next(maxNum - i);
foreach (int picked in ret.OrderBy(x => x))
{
if (picked <= next)
++next;
else
break;
}
ret.Add(next);
}
return ret;
}
The idea is to seed a random number generator with the hash code of the String. The rest of that is just picking numbers without replacement. I'm sure it could be written more efficiently - an alternative is to generate all maxNum numbers and shuffle the first count. Warning, untested.
I know newer versions of the .Net runtime use a random String hash code algorithm (so results will differ between runs), but I believe this is opt-in. Writing your own hash algorithm is an option.

find a unique output based on two inputs? [duplicate]

This question already has answers here:
Mapping two integers to one, in a unique and deterministic way
(19 answers)
Closed 7 years ago.
I need to find a way, such that user has to input 2 numbers (int) and for every different value a single output (int preferably!) is returned.
Say user enters 6, 8 it returns k when user enter anything else like 6,7 or 9,8 or any other input m, n except for 6, 8 (even if only one input is changed) a completely different output is produced. But the thing is, it should be unique for only that m, n so I cant use something like m*n because 6 X 4 = 24 but also, 12 X 2 = 24 so the output is not unique, so I need to find a way where for every different input, there is a totally different output that is not repeated for any other value.
EDIT: In response to Nicolas: the input range can be anything but will be less then 1000 (but more then 1 of course!)
EDIT 2: In response to Rawling, I can use long (Int64) but not preferably use float or doulbe, becuase this output will be used in a for loop, and float and double are terrible for for loop, you can check it here

Since your two numbers are less than 1000, you can do k = (1000 * x1) + x2 to get a unique answer. The maximum value would be 999999, which is well within the range of a 32-bit int.

You can always return a long: from two integers a and b, return 2^|INT_SIZE|*a + b
It is easy to see from pigeonhole principle, that given two ints, one cannot return a unique int for every different input. Explanation: If you have 2 numbers, each containing n bits, then there are 2^n possibilities for each number, and thus there are (2^n)^2 possible pairs, so from piegeonhole principle - you need at least lg_2((2^n)^2) = 2n bits to represent them,
EDIT: Your edit mentions the range of your numbers is [1,1000] - thus the same idea can be applied: 1000*a + b will generate a unique int for each pairs.
Note that for the same reasons, the range of the resulting integer must be [1,1000000] - or you will get clashes.

Because I don't have 50 posts to comment, I must say, there are functions
called Pairing Functions.
Pairing functions such as Cantor's Pairing Function(Shown on the previous link) and Szudzik's Pairing Function which allows the inputs to be infinite and still be able to provide an unique and deterministic output.
Here is another similar question on stackoverflow. (Great, I need 10 reputation to post more than two links..)
(http://) stackoverflow.com/questions/919612/mapping-two-integers-to-one-in-a-unique-and-deterministic-way
EDIT: I'm late.

If you didn't have a hard upper bound, you could do the following:
int Unique (int x, int y)
{
int n = x + y;
int t = (n%2==0) ? ((n/2) * (n+1)) : (n * ((n+1)/2));
return t + x;
}
Mathematically speaking, this will return a unique non negative integer for each (non-negative) pair of integers with no upper bound.
Programatically speaking, it will run into overflow problems, which could be overcome by using long instead of int for everything except the input variables.

The canonical mathematical solution is to use prime powers. As every number can be decomposed uniquely into its prime factors, returning 2^n * 3^m will give you different results for every n and m.
This can be extended to 2^n * 3^m * 5^a * 7^b *11^c and so on; you only need to check that you do not run out of 32-bit integers. If there is a risk of overflowing, you can take the remainder after dividing by a prime larger than your input range, and you will still have uniqueness.

Convert ten character classification string into four character one in C#

What's the best way to convert (to hash) a string like 3800290030, which represents an id for a classification into a four character one like 3450 (I need to support at max 9999 classes). We will only have less than 1000 classes in 10 character space and it will never grow to more than 10k.
The hash needs to be unique and always the same for the same an input.
The resulting string should be numeric (but it will be saved as char(4) in SQL Server).
I removed the requirement for reversibility.
This is my solution, please comment:
string classTIC = "3254002092";
MD5 md5Hasher = MD5.Create();
byte[] classHash = md5Hasher.ComputeHash(Encoding.Default.GetBytes(classTIC));
StringBuilder sBuilder = new StringBuilder();
foreach (byte b in classHash)
{
sBuilder.Append(b.ToString());
}
string newClass = (double.Parse(sBuilder.ToString())%9999 + 1).ToString();

You can do something like
str.GetHashCode() % 9999 + 1;
The hash can't be unique since you have more than 9,999 strings
It is not unique so it cannot be reversible
and of course my answer is wrong in case you don't have more than 9999 different 10 character classes.
In case you don't have more than 9999 classes you need to have a mapping from string id to its 4 char representation - for example - save the stings in a list and each string key will be its index in the list

When you want to reverse the process, and have no knowledge about the id's apart from that there are at most 9999 of them, I think you need to use a translation dictionary to map each id to its short version.
Even without the need to reverse the process, I don't think there is a way to guerantee unique id's without such a dictionary.
This short version could then simply be incremented by one with each new id.

You do not want a hash. Hashing by design allows for collisions. There is no possible hashing function for the kind of strings you work with that won't have collisions.
You need to build a persistent mapping table to convert the string to a number. Logically similar to a Dictionary<string, int>. The first string you'll add gets number 0. When you need to map, look up the string and return its associate number. If it is not present then add the string and simply assign it a number equal to the count.
Making this mapping table persistent is what you'll need to think about. Trivially done with a dbase of course.

ehn no idea
Unique is difficult, you have - in your request - 4 characters - thats a max of 9999, collision will occur.
Hash is not reversible. Data is lost (obviously).

I think you might need to create and store a lookup table to be able to support your requirements. And in that case you don't even need a hash you could just increment the last used 4 digit lookup code.

use md5 or sha like:
string = substring(md5("05910395410"),0,4)
or write your own simple method, for example
sum = 0
foreach(char c in string)
{
sum+=(int)c;
}
sum %= 9999

Convert the number to base35/base36
ex: 3800290030 decimal = 22CGHK5 base-35 //length: 7
Or may be convert to Base60 [ignoring Capital O and small o to not confuse with 0]
ex: 3800290030 decimal = 4tDw7A base-60 //length: 6

Convert your int to binary and then base64 encode it. It wont be numbers then, but it will be a reversible hash.
Edit:
As far as my sense tells me you are asking for the impossible.
You cannot take a totally random data and somehow reduce the amount of data it takes to encode it (some might be shorter, others might be longer), thus your requirement that the number is unique is not possible, there has to be some dataloss somewhere and no matter how you do it it won't ensure uniqueness.
Second, due to the above it is also not possible to make it reversible. Thus that is out of the question.
Therefore, the only possible way I can see, is if you have an enumerable data source. IE. you know all the values prior to calculating the value. In that case you can simply assign them a sequencial id.

How do I find the average in a LARGE set of numbers?

I have a large set of numbers, probably in the multiple gigabytes range. First issue is that I can't store all of these in memory. Second is that any attempt at addition of these will result in an overflow. I was thinking of using more of a rolling average, but it needs to be accurate. Any ideas?
These are all floating point numbers.
This is not read from a database, it is a CSV file collected from multiple sources. It has to be accurate as it is stored as parts of a second (e.g; 0.293482888929) and a rolling average can be the difference between .2 and .3
It is a set of #'s representing how long users took to respond to certain form actions. For example when showing a messagebox, how long did it take them to press OK or Cancel. The data was sent to me stored as seconds.portions of a second; 1.2347 seconds for example. Converting it to milliseconds and I overflow int, long, etc.. rather quickly. Even if I don't convert it, I still overflow it rather quickly. I guess the one answer below is correct, that maybe I don't have to be 100% accurate, just look within a certain range inside of a sepcific StdDev and I would be close enough.

You can sample randomly from your set ("population") to get an average ("mean"). The accuracy will be determined by how much your samples vary (as determined by "standard deviation" or variance).
The advantage is that you have billions of observations, and you only have to sample a fraction of them to get a decent accuracy or the "confidence range" of your choice. If the conditions are right, this cuts down the amount of work you will be doing.
Here's a numerical library for C# that includes a random sequence generator. Just make a random sequence of numbers that reference indices in your array of elements (from 1 to x, the number of elements in your array). Dereference to get the values, and then calculate your mean and standard deviation.
If you want to test the distribution of your data, consider using the Chi-Squared Fit test or the K-S test, which you'll find in many spreadsheet and statistical packages (e.g., R). That will help confirm whether this approach is usable or not.

Integers or floats?
If they're integers, you need to accumulate a frequency distribution by reading the numbers and recording how many of each value you see. That can be averaged easily.
For floating point, this is a bit of a problem. Given the overall range of the floats, and the actual distribution, you have to work out a bin-size that preserves the accuracy you want without preserving all of the numbers.
Edit
First, you need to sample your data to get a mean and a standard deviation. A few thousand points should be good enough.
Then, you need to determine a respectable range. Folks pick things like ±6σ (standard deviations) around the mean. You'll divide this range into as many buckets as you can stand.
In effect, the number of buckets determines the number of significant digits in your average. So, pick 10,000 or 100,000 buckets to get 4 or 5 digits of precision. Since it's a measurement, odds are good that your measurements only have two or three digits.
Edit
What you'll discover is that the mean of your initial sample is very close to the mean of any other sample. And any sample mean is close to the population mean. You'll note that most (but not all) of your means are with 1 standard deviation of each other.
You should find that your measurement errors and inaccuracies are larger than your standard deviation.
This means that a sample mean is as useful as a population mean.

Wouldn't a rolling average be as accurate as anything else (discounting rounding errors, I mean)? It might be kind of slow because of all the dividing.
You could group batches of numbers and average them recursively. Like average 100 numbers 100 times, then average the result. This would be less thrashing and mostly addition.
In fact, if you added 256 or 512 at once you might be able to bit-shift the result by either 8 or 9, (I believe you could do this in a double by simply changing the floating point mantissa)--this would make your program extremely quick and it could be written recursively in just a few lines of code (not counting the unsafe operation of the mantissa shift).
Perhaps dividing by 256 would already use this optimization? I may have to speed test dividing by 255 vs 256 and see if there is some massive improvement. I'm guessing not.

You mean of 32-bit and 64-bit numbers. But why not just use a proper Rational Big Num library? If you have so much data and you want an exact mean, then just code it.
class RationalBignum {
public Bignum Numerator { get; set; }
public Bignum Denominator { get; set; }
}
class BigMeanr {
public static int Main(string[] argv) {
var sum = new RationalBignum(0);
var n = new Bignum(0);
using (var s = new FileStream(argv[0])) {
using (var r = new BinaryReader(s)) {
try {
while (true) {
var flt = r.ReadSingle();
rat = new RationalBignum(flt);
sum += rat;
n++;
}
}
catch (EndOfStreamException) {
break;
}
}
}
Console.WriteLine("The mean is: {0}", sum / n);
}
}
Just remember, there are more numeric types out there than the ones your compiler offers you.

You could break the data into sets of, say, 1000 numbers, average these, and then average the averages.

This is a classic divide-and-conquer type problem.
The issue is that the average of a large set of numbers is the same
as the average of the first-half of the set, averaged with the average of the second-half of the set.
In other words:
AVG(A[1..N]) == AVG( AVG(A[1..N/2]), AVG(A[N/2..N]) )
Here is a simple, C#, recursive solution.
Its passed my tests, and should be completely correct.
public struct SubAverage
{
public float Average;
public int Count;
};
static SubAverage AverageMegaList(List<float> aList)
{
if (aList.Count <= 500) // Brute-force average 500 numbers or less.
{
SubAverage avg;
avg.Average = 0;
avg.Count = aList.Count;
foreach(float f in aList)
{
avg.Average += f;
}
avg.Average /= avg.Count;
return avg;
}
// For more than 500 numbers, break the list into two sub-lists.
SubAverage subAvg_A = AverageMegaList(aList.GetRange(0, aList.Count/2));
SubAverage subAvg_B = AverageMegaList(aList.GetRange(aList.Count/2, aList.Count-aList.Count/2));
SubAverage finalAnswer;
finalAnswer.Average = subAvg_A.Average * subAvg_A.Count/aList.Count +
subAvg_B.Average * subAvg_B.Count/aList.Count;
finalAnswer.Count = aList.Count;
Console.WriteLine("The average of {0} numbers is {1}",
finalAnswer.Count, finalAnswer.Average);
return finalAnswer;
}

The trick is that you're worried about an overflow. In that case, it all comes down to order of execution. The basic formula is like this:
Given:
A = current avg
C = count of items
V = next value in the sequence
The next average (A1) is:
(C * A) + V
A1 = ———————————
C + 1
The danger is over the course of evaulating the sequence, while A should stay relatively manageable C will become very large.
Eventually C * A will overflow the integer or double types.
One thing we can try is to re-write it like this, to reduce the chance of an overflow:
A1 = C/(C+1) * A/(C+1) + V/(C+1)
In this way, we never multiply C * A and only deal with smaller numbers. But the concern now is the result of the division operations. If C is very large, C/C+1 (for example) may not be meaningful when constrained to normal floating point representations. The best I can suggest is to use the largest type possible for C here.

Here's one way to do it in pseudocode:
average=first
count=1
while more:
count+=1
diff=next-average
average+=diff/count
return average

Sorry for the late comment, but isn't it the formula above provided by Joel Coehoorn rewritten wrongly?
I mean, the basic formula is right:
Given:
A = current avg
C = count of items
V = next value in the sequence
The next average (A1) is:
A1 = ( (C * A) + V ) / ( C + 1 )
But instead of:
A1 = C/(C+1) * A/(C+1) + V/(C+1)
shouldn't we have:
A1 = C/(C+1) * A + V/(C+1)
That would explain kastermester's post:
"My math ticks off here - You have C, which you say "go towards infinity" or at least, a really big number, then: C/(C+1) goes towards 1. A /(C+1) goes towards 0. V/(C+1) goes towards 0. All in all: A1 = 1 * 0 + 0 So put shortly A1 goes towards 0 - seems a bit off. – kastermester"
Because we would have A1 = 1 * A + 0, i.e., A1 goes towards A, which it's right.
I've been using such method for calculating averages for a long time and the aforementioned precision problems have never been an issue for me.

With floating point numbers the problem is not overflow, but loss of precision when the accumulated value gets large. Adding a small number to a huge accumulated value will result in losing most of the bits of the small number.
There is a clever solution by the author of the IEEE floating point standard himself, the Kahan summation algorithm, which deals exactly with this kind of problems by checking the error at each step and keeping a running compensation term that prevents losing the small values.

If the numbers are int's, accumulate the total in a long. If the numbers are long's ... what language are you using? In Java you could accumulate the total in a BigInteger, which is an integer which will grow as large as it needs to be. You could always write your own class to reproduce this functionality. The gist of it is just to make an array of integers to hold each "big number". When you add two numbers, loop through starting with the low-order value. If the result of the addition sets the high order bit, clear this bit and carry the one to the next column.
Another option would be to find the average of, say, 1000 numbers at a time. Hold these intermediate results, then when you're done average them all together.

Why is a sum of floating point numbers overflowing? In order for that to happen, you would need to have values near the max float value, which sounds odd.
If you were dealing with integers I'd suggest using a BigInteger, or breaking the set into multiple subsets, recursively averaging the subsets, then averaging the averages.
If you're dealing with floats, it gets a bit weird. A rolling average could become very inaccurate. I suggest using a rolling average which is only updated when you hit an overflow exception or the end of the set. So effectively dividing the set into non-overflowing sets.

Two ideas from me:
If the numbers are ints, use an arbitrary precision library like IntX - this could be too slow, though
If the numbers are floats and you know the total amount, you can divide each entry by that number and add up the result. If you use double, the precision should be sufficient.

Why not just scale the numbers (down) before computing the average?

If I were to find the mean of billions of doubles as accurately as possible, I would take the following approach (NOT TESTED):
Find out 'M', an upper bound for log2(nb_of_input_data). If there are billions of data, 50 may be a good candidate (> 1 000 000 billions capacity). Create an L1 array of M double elements. If you're not sure about M, creating an extensible list will solve the issue, but it is slower.
Also create an associated L2 boolean array (all cells set to false by default).
For each incoming data D:
int i = 0;
double localMean = D;
while (L2[i]) {
L2[i] = false;
localMean = (localMean + L1[i]) / 2;
i++;
}
L1[i] = localMean;
L2[i] = true;
And your final mean will be:
double sum = 0;
double totalWeight = 0;
for (int i = 0; i < 50) {
if (L2[i]) {
long weight = 1 << i;
sum += L1[i] * weight;
totalWeight += weight;
}
}
return sum / totalWeight;
Notes:
Many proposed solutions in this thread miss the point of lost precision.
Using binary instead of 100-group-or-whatever provides better precision, and doubles can be safely doubled or halved without losing precision!

Try this
Iterate through the numbers incrementing a counter, and adding each number to a total, until adding the next number would result in an overflow, or you run out of numbers.
( It makes no difference if the inputs are integers or floats - use the largest precision float you can and convert each input to that type)
Divide the total by the counter to get a mean ( a floating point), and add it to a temp array
If you had run out of numbers, and there is only one element in temp, that's your result.
Start over using the temp array as input, ie iteratively recurse until you reached the end condition described earlier.

depending on the range of numbers it might be a good idea to have an array where the subscript is your number and the value is the quantity of that number, you could then do your calculation from this

Need a smaller alternative to GUID for DB ID but still unique and random for URL

I have looked all of the place for this and I can't seem to get a complete answer for this. So if the answer does already exist on stackoverflow then I apologize in advance.
I want a unique and random ID so that users in my website can't guess the next number and just hop to someone else's information. I plan to stick to a incrementing ID for the primary key but to also store a random and unique ID (sort of a hash) for that row in the DB and put an index on it.
From my searching I realize that I would like to avoid collisions and I have read some mentions of SHA1.
My basic requirements are
Something smaller than a GUID. (Looks horrible in URL)
Must be unique
Avoid collisions
Not a long list of strange characters that are unreadable.
An example of what I am looking for would be www.somesite.com/page.aspx?id=AF78FEB
I am not sure whether I should be implementing this in the database (I am using SQL Server 2005) or in the code (I am using C# ASP.Net)
EDIT:
From all the reading I have done I realize that this is security through obscurity. I do intend having proper authorization and authentication for access to the pages. I will use .Net's Authentication and authorization framework. But once a legitimate user has logged in and is accessing a legimate (but dynamically created page) filled with links to items that belong to him. For example a link might be www.site.com/page.aspx?item_id=123. What is stopping him from clicking on that link, then altering the URL above to go www.site.com/page.aspx?item_id=456 which does NOT belong to him? I know some Java technologies like Struts (I stand to be corrected) store everything in the session and somehow work it out from that but I have no idea how this is done.

Raymond Chen has a good article on why you shouldn't use "half a guid", and offers a suitable solution to generating your own "not quite guid but good enough" type value here:
GUIDs are globally unique, but substrings of GUIDs aren't
His strategy (without a specific implementiation) was based on:
Four bits to encode the computer number,
56 bits for the timestamp, and
four bits as a uniquifier.
We can reduce the number of bits to make the computer unique since the number of computers in the cluster is bounded, and we can reduce the number of bits in the timestamp by assuming that the program won’t be in service 200 years from now.
You can get away with a four-bit uniquifier by assuming that the clock won’t drift more than an hour out of skew (say) and that the clock won’t reset more than sixteen times per hour.

UPDATE (4 Feb 2017):
Walter Stabosz discovered a bug in the original code. Upon investigation there were further bugs discovered, however, extensive testing and reworking of the code by myself, the original author (CraigTP) has now fixed all of these issues. I've updated the code here with the correct working version, and you can also download a Visual Studio 2015 solution here which contains the "shortcode" generation code and a fairly comprehensive test suite to prove correctness.
One interesting mechanism I've used in the past is to internally just use an incrementing integer/long, but to "map" that integer to a alphanumeric "code".
Example
Console.WriteLine($"1371 as a shortcode is: {ShortCodes.LongToShortCode(1371)}");
Console.WriteLine($"12345 as a shortcode is: {ShortCodes.LongToShortCode(12345)}");
Console.WriteLine($"7422822196733609484 as a shortcode is: {ShortCodes.LongToShortCode(7422822196733609484)}");
Console.WriteLine($"abc as a long is: {ShortCodes.ShortCodeToLong("abc")}");
Console.WriteLine($"ir6 as a long is: {ShortCodes.ShortCodeToLong("ir6")}");
Console.WriteLine($"atnhb4evqqcyx as a long is: {ShortCodes.ShortCodeToLong("atnhb4evqqcyx")}");
// PLh7lX5fsEKqLgMrI9zCIA
Console.WriteLine(GuidToShortGuid( Guid.Parse("957bb83c-5f7e-42b0-aa2e-032b23dcc220") ) );
Code
The following code shows a simple class that will change a long to a "code" (and back again!):
public static class ShortCodes
{
// You may change the "shortcode_Keyspace" variable to contain as many or as few characters as you
// please. The more characters that are included in the "shortcode_Keyspace" constant, the shorter
// the codes you can produce for a given long.
private static string shortcodeKeyspace = "abcdefghijklmnopqrstuvwxyz0123456789";
public static string LongToShortCode(long number)
{
// Guard clause. If passed 0 as input
// we always return empty string.
if (number == 0)
{
return string.Empty;
}
var keyspaceLength = shortcodeKeyspace.Length;
var shortcodeResult = "";
var numberToEncode = number;
var i = 0;
do
{
i++;
var characterValue = numberToEncode % keyspaceLength == 0 ? keyspaceLength : numberToEncode % keyspaceLength;
var indexer = (int) characterValue - 1;
shortcodeResult = shortcodeKeyspace[indexer] + shortcodeResult;
numberToEncode = ((numberToEncode - characterValue) / keyspaceLength);
}
while (numberToEncode != 0);
return shortcodeResult;
}
public static long ShortCodeToLong(string shortcode)
{
var keyspaceLength = shortcodeKeyspace.Length;
long shortcodeResult = 0;
var shortcodeLength = shortcode.Length;
var codeToDecode = shortcode;
foreach (var character in codeToDecode)
{
shortcodeLength--;
var codeChar = character;
var codeCharIndex = shortcodeKeyspace.IndexOf(codeChar);
if (codeCharIndex < 0)
{
// The character is not part of the keyspace and so entire shortcode is invalid.
return 0;
}
try
{
checked
{
shortcodeResult += (codeCharIndex + 1) * (long) (Math.Pow(keyspaceLength, shortcodeLength));
}
}
catch(OverflowException)
{
// We've overflowed the maximum size for a long (possibly the shortcode is invalid or too long).
return 0;
}
}
return shortcodeResult;
}
}
}
This is essentially your own baseX numbering system (where the X is the number of unique characters in the shortCode_Keyspace constant.
To make things unpredicable, start your internal incrementing numbering at something other than 1 or 0 (i.e start at 184723) and also change the order of the characters in the shortCode_Keyspace constant (i.e. use the letters A-Z and the numbers 0-9, but scamble their order within the constant string. This will help make each code somewhat unpredictable.
If you're using this to "protect" anything, this is still security by obscurity, and if a given user can observe enough of these generated codes, they can predict the relevant code for a given long. The "security" (if you can call it that) of this is that the shortCode_Keyspace constant is scrambled, and remains secret.
EDIT:
If you just want to generate a GUID, and transform it to something that is still unique, but contains a few less characters, this little function will do the trick:
public static string GuidToShortGuid(Guid gooid)
{
string encoded = Convert.ToBase64String(gooid.ToByteArray());
encoded = encoded.Replace("/", "_").Replace("+", "-");
return encoded.Substring(0, 22);
}

If you don't want other users to see people information why don't you secure the page which you are using the id?
If you do that then it won't matter if you use an incrementing Id.

[In response to the edit]
You should consider query strings as "evil input". You need to programmatically check that the authenticated user is allowed to view the requested item.
if( !item456.BelongsTo(user123) )
{
// Either show them one of their items or a show an error message.
}

You could randomly generate a number. Check that this number is not already in the DB and use it. If you want it to appear as a random string you could just convert it to hexadecimal, so you get A-F in there just like in your example.

A GUID is 128 bit. If you take these bits and don’t use a character set with just 16 characters to represent them (16=2^4 and 128/4 = 32 chacters) but a character set with, let’s say, 64 characters (like Base 64), you would end up at only 22 characters (64=2^6 and 128/6 = 21.333, so 22 characters).

Take your auto-increment ID, and HMAC-SHA1 it with a secret known only to you. This will generate a random-looking 160-bits that hide the real incremental ID. Then, take a prefix of a length that makes collisions sufficiently unlikely for your application---say 64-bits, which you can encode in 8 characters. Use this as your string.
HMAC will guarantee that no one can map from the bits shown back to the underlying number. By hashing an auto-increment ID, you can be pretty sure that it will be unique. So your risk for collisions comes from the likelihood of a 64-bit partial collision in SHA1. With this method, you can predetermine if you will have any collisions by pre-generating all the random strings that this method which generate (e.g. up to the number of rows you expect) and checking.
Of course, if you are willing to specify a unique condition on your database column, then simply generating a totally random number will work just as well. You just have to be careful about the source of randomness.

How long is too long? You could convert the GUID to Base 64, which ends up making it quite a bit shorter.

What you could do is something I do when I want exactly what you are wanting.
Create your GUID.
Get remove the dashes, and get a
substring of how long you want your
ID
Check the db for that ID, if it
exists goto step 1.
Insert record.
This is the simplest way to insure it is obscured and unique.

I have just had an idea and I see Greg also pointed it out. I have the user stored in the session with a user ID. When I create my query I will join on the Users table with that User ID, if the result set is empty then we know he was hacking the URL and I can redirect to an error page.

A GUID is just a number
The latest generation of GUIDs (version 4) is basically a big random number*
Because it's a big random number the chances of a collision are REALLY small.
The biggest number you can make with a GUID is over:
5,000,000,000,000,000,000,000,000,000,000,000,000
So if you generate two GUIDs the chance the second GUID is the same as the first is:
1 in 5,000,000,000,000,000,000,000,000,000,000,000,000
If you generate 100 BILLION GUIDs.
The chance your 100 billionth GUID collides with the other 99,999,999,999 GUIDs is:
1 in 50,000,000,000,000,000,000,000,000
Why 128 bits?
One reason is that computers like working with multiples of 8 bits.
8, 16, 32, 64, 128, etc
The other reason is that the guy who came up with the GUID felt 64 wasn't enough, and 256 was way too much.
Do you need 128 bits?
No, how many bits you need depends on how many numbers you expect to generate and how sure you want to be that they don't collide.
64 bit example
Then the chance that your second number would collide with the first would be:
1 in 18,000,000,000,000,000,000 (64 bit)
Instead of:
1 in 5,000,000,000,000,000,000,000,000,000,000,000,000 (128 bit)
What about the 100 billionth number?
The chance your 100 billionth number collides with the other 99,999,999,999 would be:
1 in 180,000,000 (64 bit)
Instead of:
1 in 50,000,000,000,000,000,000,000,000 (128 bit)
So should you use 64 bits?
Depends are you generating 100 billion numbers? Even if you were then does 180,000,000 make you uncomfortable?
A little more details about GUIDs
I'm specifically talking about version 4.
Version 4 doesn't actually use all 128 bits for the random number portion, it uses 122 bits. The other 6 bits are used to indicate that is version 4 of the GUID standard.
The numbers in this answer are based on 122 bits.
And yes since it's just a random number you can just take the number of bits you want from it. (Just make sure you don't take any of the 6 versioning bits that never change - see above).
Instead of taking bits from the GUID though you could instead use the the same random number generator the GUID got it's bits from.
It probably used the random number generator that comes with the operating system.

Late to the party but I found this to be the most reliable way to generate Base62 random strings in C#.
private static Random random = new Random();
void Main()
{
var s = RandomString(7);
Console.WriteLine(s);
}
public static string RandomString(int length)
{
const string chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
return new string(Enumerable.Repeat(chars, length)
.Select(s => s[random.Next(s.Length)]).ToArray());
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.