How to calculate the entropy of a file? (Or let's just say a bunch of bytes)
I have an idea, but I'm not sure that it's mathematically correct.
My idea is the following:
Create an array of 256 integers (all zeros).
Traverse through the file and for each of its bytes,
increment the corresponding position in the array.
At the end: Calculate the "average" value for the array.
Initialize a counter with zero,
and for each of the array's entries:
add the entry's difference
to "average" to the counter.
Well, now I'm stuck. How to "project" the counter result in such a way
that all results would lie between 0.0 and 1.0? But I'm sure,
the idea is inconsistent anyway...
I hope someone has better and simpler solutions?
Note: I need the whole thing to make assumptions on the file's contents:
(plaintext, markup, compressed or some binary, ...)
At the end: Calculate the "average" value for the array.
Initialize a counter with zero,
and for each of the array's entries:
add the entry's difference to "average" to the counter.
With some modifications you can get Shannon's entropy:
rename "average" to "entropy"
(float) entropy = 0
for i in the array[256]:Counts do
(float)p = Counts[i] / filesize
if (p > 0) entropy = entropy - p*lg(p) // lgN is the logarithm with base 2
Edit:
As Wesley mentioned, we must divide entropy by 8 in order to adjust it in the range 0 . . 1 (or alternatively, we can use the logarithmic base 256).
A simpler solution: gzip the file. Use the ratio of file sizes: (size-of-gzipped)/(size-of-original) as measure of randomness (i.e. entropy).
This method doesn't give you the exact absolute value of entropy (because gzip is not an "ideal" compressor), but it's good enough if you need to compare entropy of different sources.
To calculate the information entropy of a collection of bytes, you'll need to do something similar to tydok's answer. (tydok's answer works on a collection of bits.)
The following variables are assumed to already exist:
byte_counts is 256-element list of the number of bytes with each value in your file. For example, byte_counts[2] is the number of bytes that have the value 2.
total is the total number of bytes in your file.
I'll write the following code in Python, but it should be obvious what's going on.
import math
entropy = 0
for count in byte_counts:
# If no bytes of this value were seen in the value, it doesn't affect
# the entropy of the file.
if count == 0:
continue
# p is the probability of seeing this byte in the file, as a floating-
# point number
p = 1.0 * count / total
entropy -= p * math.log(p, 256)
There are several things that are important to note.
The check for count == 0 is not just an optimization. If count == 0, then p == 0, and log(p) will be undefined ("negative infinity"), causing an error.
The 256 in the call to math.log represents the number of discrete values that are possible. A byte composed of eight bits will have 256 possible values.
The resulting value will be between 0 (every single byte in the file is the same) up to 1 (the bytes are evenly divided among every possible value of a byte).
An explanation for the use of log base 256
It is true that this algorithm is usually applied using log base 2. This gives the resulting answer in bits. In such a case, you have a maximum of 8 bits of entropy for any given file. Try it yourself: maximize the entropy of the input by making byte_counts a list of all 1 or 2 or 100. When the bytes of a file are evenly distributed, you'll find that there is an entropy of 8 bits.
It is possible to use other logarithm bases. Using b=2 allows a result in bits, as each bit can have 2 values. Using b=10 puts the result in dits, or decimal bits, as there are 10 possible values for each dit. Using b=256 will give the result in bytes, as each byte can have one of 256 discrete values.
Interestingly, using log identities, you can work out how to convert the resulting entropy between units. Any result obtained in units of bits can be converted to units of bytes by dividing by 8. As an interesting, intentional side-effect, this gives the entropy as a value between 0 and 1.
In summary:
You can use various units to express entropy
Most people express entropy in bits (b=2)
For a collection of bytes, this gives a maximum entropy of 8 bits
Since the asker wants a result between 0 and 1, divide this result by 8 for a meaningful value
The algorithm above calculates entropy in bytes (b=256)
This is equivalent to (entropy in bits) / 8
This already gives a value between 0 and 1
I'm two years late in answering, so please consider this despite only a few up-votes.
Short answer: use my 1st and 3rd bold equations below to get what most people are thinking about when they say "entropy" of a file in bits. Use just 1st equation if you want Shannon's H entropy which is actually entropy/symbol as he stated 13 times in his paper which most people are not aware of. Some online entropy calculators use this one, but Shannon's H is "specific entropy", not "total entropy" which has caused so much confusion. Use 1st and 2nd equation if you want the answer between 0 and 1 which is normalized entropy/symbol (it's not bits/symbol, but a true statistical measure of the "entropic nature" of the data by letting the data choose its own log base instead of arbitrarily assigning 2, e, or 10).
There 4 types of entropy of files (data) of N symbols long with n unique types of symbols. But keep in mind that by knowing the contents of a file, you know the state it is in and therefore S=0. To be precise, if you have a source that generates a lot of data that you have access to, then you can calculate the expected future entropy/character of that source. If you use the following on a file, it is more accurate to say it is estimating the expected entropy of other files from that source.
Shannon (specific) entropy H = -1*sum(count_i / N * log(count_i / N))
where count_i is the number of times symbol i occured in N.
Units are bits/symbol if log is base 2, nats/symbol if natural log.
Normalized specific entropy: H / log(n)
Units are entropy/symbol. Ranges from 0 to 1. 1 means each symbol occurred equally often and near 0 is where all symbols except 1 occurred only once, and the rest of a very long file was the other symbol. The log is in the same base as the H.
Absolute entropy S = N * H
Units are bits if log is base 2, nats if ln()).
Normalized absolute entropy S = N * H / log(n)
Unit is "entropy", varies from 0 to N. The log is in the same base as the H.
Although the last one is the truest "entropy", the first one (Shannon entropy H) is what all books call "entropy" without (the needed IMHO) qualification. Most do not clarify (like Shannon did) that it is bits/symbol or entropy per symbol. Calling H "entropy" is speaking too loosely.
For files with equal frequency of each symbol: S = N * H = N. This is the case for most large files of bits. Entropy does not do any compression on the data and is thereby completely ignorant of any patterns, so 000000111111 has the same H and S as 010111101000 (6 1's and 6 0's in both cases).
Like others have said, using a standard compression routine like gzip and dividing before and after will give a better measure of the amount of pre-existing "order" in the file, but that is biased against data that fits the compression scheme better. There's no general purpose perfectly optimized compressor that we can use to define an absolute "order".
Another thing to consider: H changes if you change how you express the data. H will be different if you select different groupings of bits (bits, nibbles, bytes, or hex). So you divide by log(n) where n is the number of unique symbols in the data (2 for binary, 256 for bytes) and H will range from 0 to 1 (this is normalized intensive Shannon entropy in units of entropy per symbol). But technically if only 100 of the 256 types of bytes occur, then n=100, not 256.
H is an "intensive" entropy, i.e. it is per symbol which is analogous to specific entropy in physics which is entropy per kg or per mole. Regular "extensive" entropy of a file analogous to physics' S is S=N*H where N is the number of symbols in the file. H would be exactly analogous to a portion of an ideal gas volume. Information entropy can't simply be made exactly equal to physical entropy in a deeper sense because physical entropy allows for "ordered" as well disordered arrangements: physical entropy comes out more than a completely random entropy (such as a compressed file). One aspect of the different For an ideal gas there is a additional 5/2 factor to account for this: S = k * N * (H+5/2) where H = possible quantum states per molecule = (xp)^3/hbar * 2 * sigma^2 where x=width of the box, p=total non-directional momentum in the system (calculated from kinetic energy and mass per molecule), and sigma=0.341 in keeping with uncertainty principle giving only the number of possible states within 1 std dev.
A little math gives a shorter form of normalized extensive entropy for a file:
S=N * H / log(n) = sum(count_i*log(N/count_i))/log(n)
Units of this are "entropy" (which is not really a unit). It is normalized to be a better universal measure than the "entropy" units of N * H. But it also should not be called "entropy" without clarification because the normal historical convention is to erringly call H "entropy" (which is contrary to the clarifications made in Shannon's text).
For what it's worth, here's the traditional (bits of entropy) calculation represented in C#:
/// <summary>
/// returns bits of entropy represented in a given string, per
/// http://en.wikipedia.org/wiki/Entropy_(information_theory)
/// </summary>
public static double ShannonEntropy(string s)
{
var map = new Dictionary<char, int>();
foreach (char c in s)
{
if (!map.ContainsKey(c))
map.Add(c, 1);
else
map[c] += 1;
}
double result = 0.0;
int len = s.Length;
foreach (var item in map)
{
var frequency = (double)item.Value / len;
result -= frequency * (Math.Log(frequency) / Math.Log(2));
}
return result;
}
Is this something that ent could handle? (Or perhaps its not available on your platform.)
$ dd if=/dev/urandom of=file bs=1024 count=10
$ ent file
Entropy = 7.983185 bits per byte.
...
As a counter example, here is a file with no entropy.
$ dd if=/dev/zero of=file bs=1024 count=10
$ ent file
Entropy = 0.000000 bits per byte.
...
There's no such thing as the entropy of a file. In information theory, the entropy is a function of a random variable, not of a fixed data set (well, technically a fixed data set does have an entropy, but that entropy would be 0 — we can regard the data as a random distribution that has only one possible outcome with probability 1).
In order to calculate the entropy, you need a random variable with which to model your file. The entropy will then be the entropy of the distribution of that random variable. This entropy will equal the number of bits of information contained in that random variable.
If you use information theory entropy, mind that it might make sense not to use it on bytes. Say, if your data consists of floats you should instead fit a probability distribution to those floats and calculate the entropy of that distribution.
Or, if the contents of the file is unicode characters, you should use those, etc.
Calculates entropy of any string of unsigned chars of size "length". This is basically a refactoring of the code found at http://rosettacode.org/wiki/Entropy. I use this for a 64 bit IV generator that creates a container of 100000000 IV's with no dupes and a average entropy of 3.9. http://www.quantifiedtechnologies.com/Programming.html
#include <string>
#include <map>
#include <algorithm>
#include <cmath>
typedef unsigned char uint8;
double Calculate(uint8 * input, int length)
{
std::map<char, int> frequencies;
for (int i = 0; i < length; ++i)
frequencies[input[i]] ++;
double infocontent = 0;
for (std::pair<char, int> p : frequencies)
{
double freq = static_cast<double>(p.second) / length;
infocontent += freq * log2(freq);
}
infocontent *= -1;
return infocontent;
}
Re: I need the whole thing to make assumptions on the file's contents:
(plaintext, markup, compressed or some binary, ...)
As others have pointed out (or been confused/distracted by), I think you're actually talking about metric entropy (entropy divided by length of message). See more at Entropy (information theory) - Wikipedia.
jitter's comment linking to Scanning data for entropy anomalies is very relevant to your underlying goal. That links eventually to libdisorder (C library for measuring byte entropy). That approach would seem to give you lots more information to work with, since it shows how the metric entropy varies in different parts of the file. See e.g. this graph of how the entropy of a block of 256 consecutive bytes from a 4 MB jpg image (y axis) changes for different offsets (x axis). At the beginning and end the entropy is lower, as it part-way in, but it is about 7 bits per byte for most of the file.
Source: https://github.com/cyphunk/entropy_examples. [Note that this and other graphs are available via the novel http://nonwhiteheterosexualmalelicense.org license....]
More interesting is the analysis and similar graphs at Analysing the byte entropy of a FAT formatted disk | GL.IB.LY
Statistics like the max, min, mode, and standard deviation of the metric entropy for the whole file and/or the first and last blocks of it might be very helpful as a signature.
This book also seems relevant: Detection and Recognition of File Masquerading for E-mail and Data Security - Springer
Here's a Java algo based on this snippet and the invasion that took place during the infinity war
public static double shannon_entropy(File file) throws IOException {
byte[] bytes= Files.readAllBytes(file.toPath());//byte sequence
int max_byte = 255;//max byte value
int no_bytes = bytes.length;//file length
int[] freq = new int[256];//byte frequencies
for (int j = 0; j < no_bytes; j++) {
int value = bytes[j] & 0xFF;//integer value of byte
freq[value]++;
}
double entropy = 0.0;
for (int i = 0; i <= max_byte; i++) {
double p = 1.0 * freq[i] / no_bytes;
if (freq[i] > 0)
entropy -= p * Math.log(p) / Math.log(2);
}
return entropy;
}
usage-example:
File file=new File("C:\\Users\\Somewhere\\In\\The\\Omniverse\\Thanos Invasion.Log");
int file_length=(int)file.length();
double shannon_entropy=shannon_entropy(file);
System.out.println("file length: "+file_length+" bytes");
System.out.println("shannon entropy: "+shannon_entropy+" nats i.e. a minimum of "+shannon_entropy+" bits can be used to encode each byte transfer" +
"\nfrom the file so that in total we transfer atleast "+(file_length*shannon_entropy)+" bits ("+((file_length*shannon_entropy)/8D)+
" bytes instead of "+file_length+" bytes).");
output-example:
file length: 5412 bytes
shannon entropy: 4.537883805240875 nats i.e. a minimum of 4.537883805240875 bits can be used to encode each byte transfer
from the file so that in total we transfer atleast 24559.027153963616 bits (3069.878394245452 bytes instead of 5412 bytes).
Without any additional information entropy of a file is (by definition) equal to its size*8 bits. Entropy of text file is roughly size*6.6 bits, given that:
each character is equally probable
there are 95 printable characters in byte
log(95)/log(2) = 6.6
Entropy of text file in English is estimated to be around 0.6 to 1.3 bits per character (as explained here).
In general you cannot talk about entropy of a given file. Entropy is a property of a set of files.
If you need an entropy (or entropy per byte, to be exact) the best way is to compress it using gzip, bz2, rar or any other strong compression, and then divide compressed size by uncompressed size. It would be a great estimate of entropy.
Calculating entropy byte by byte as Nick Dandoulakis suggested gives a very poor estimate, because it assumes every byte is independent. In text files, for example, it is much more probable to have a small letter after a letter than a whitespace or punctuation after a letter, since words typically are longer than 2 characters. So probability of next character being in a-z range is correlated with value of previous character. Don't use Nick's rough estimate for any real data, use gzip compression ratio instead.
Related
In the "Small Collision Probabilities" section of https://preshing.com/20110504/hash-collision-probabilities/ you can see the trade-offs depending on how big the hash value is. Rather a 32-bit or 160-bit hash depends on your use-case and limitations.
How can I construct a checksum/hash of n-bits given the integer n?
The idea is that the caller of the hash function can determine the trade off that fit their need (only want to store integers and dont care about collisions, use 32 bit, etc).
The probability of collisions should be on-par with typical hash functions of the same bit-length (Given n = 160, the probability of collisions is close to sha-1. Given n = 32 the probability of collisions is close to crc-32)
My attempt:
if n < 160, sha-1(input) truncate first bits to bit-size n.
if n = 160, sha-1(input)
if n > 160, sha-1(input) & sha-1(input) [Repeat concatenation (n % 160) times and then truncate bits to n)
I realize that for n > 160 my collision probability isn't getting any lower because its performing the same hash. Also sha-1 isn't needed because my purpose is not cryptographic in nature and more of a checksum
For hashes of n*160 bits you could pick n prefixes (different but of the same length, to avoid collisions) and hash the input data appended to a prefix to give you n different 160-bit hashes. If you can create a collision while doing this you should be able to track this back to a collision in the underlying hash function, because different input data values never give rise to the same inputs for the underlying hash function.
The paper "Computational complexity of universal hashing" e.g. at https://core.ac.uk/download/pdf/82448366.pdf shows that binary convolution of bit-vectors is a (non-secure) universal hash function. To hash a bit-vector of n bits and produce a bit-vector of m bits as output you convolve with a bit-vector of n+m-1 bits and xor with a bit-vector of m-bits. (Claim 2.2 P 5 marked 125)
Currently, I'm developing some fuzzy logic stuff in C# and want to achieve this in a generic way. For simplicity, I can use float, double and decimal to process an interval [0, 1], but for performance, it would be better to use integers. Some thoughts about symmetry also led to the decision to omit the highest value in unsigned and the lowest value in signed integers. The lowest, non-omitted value maps to 0 and the highest, non-omitted value maps to 1. The omitted value is normalized to the next non-omitted value.
Now, I want to implement some compund calculations in the form of:
byte f(byte p1, byte p2, byte p3, byte p4)
{
return (p1 * p2) / (p3 * p4);
}
where the byte values are interpreted as the [0, 1] interval mentioned above. This means p1 * p2 < p1 and p1 * p2 < p2 as opposed to numbers greater than 1, where this is not valid, e. g. 2 * 3 = 6, but 0.1 * 0.2 = 0.02.
Additionally, a problem is: p1 * p2 and p3 * p4 may exceed the range of the type byte. The result of the whole formula may not exceed this range, but the overflow would still occur in one or both parts. Of course, I can just cast to ushort and in the end back to byte, but for an ulong I wouldn't have this possibility without further effort and I don't want to stick to 32 bits. On the other hand, if I return (p1 / p3) * (p2 / p4), I decrease the type escalation, but might run into a result of 0, where the actual result is non-zero.
So I thought of somehow simultaneously "shrinking" both products step by step until I have the result in the [0, 1] interpretation. I don't need an exact value, a heuristic with an error less than 3 integer values off the correct value would be sufficient, and for an ulong an even higher error would certainly be OK.
So far, I have tried to convert the input to a decimal/float/double in the interval [0, 1] and calculated it. But this is completely counterproductive regarding performance. I read stuff about division algorithms, but I couldn't find the one I saw once in class. It was about calculating quotient and remainder simultaneously, with an accumulator. I tried to reconstruct and extend it for factorized parts of the division with corrections, but this breaks, where inidivisibility occurs and I get a too big error. I also made some notes and calculated some integer examples manually, trying to factor out, cancel out, split sums and such fancy derivation stuff, but nothing led to a satisfying result or steps for an algorithm.
Is there a
performant way
to multiply/divide signed (and unsigned) integers as above
interpreted as interval [0, 1]
without type promotion
?
To answer your question as summarised: No.
You need to state (and rank) your overall goals explicitly (e.g., is symmetry more or less important than performance?). Your chances of getting a helpful answer improve with succinctly stating them in the question.
While I think Phil1970's you can ignore scaling for … division overly optimistic, multiplication is enough of a problem: If you don't generate partial results bigger (twice as big) as your "base type", you are stuck with multiplying parts of your operands and piecing the result together.
For ideas about piecing together "larger" results: AVR's Fractional Multiply.
Regarding …in signed integers. The lowest, non-omitted value maps to 0…, I expect that you will find, e.g., excess -32767/32768-coded fractions even harder to handle than two's complement ones.
If you are not careful, you will lost more time doing conversions that it would have take with regular operations.
That being said, an alternative that might make some sense would be to map value between 0 and 128 included (or 0 and 32768 if you want more precision) so that all value are essentially stored multiplied by 128.
So if you have (0.5 * 0.75) / (0.125 * 0.25) the stored values for each of those numbers would be 64, 96, 16 and 32 respectively. If you do those computation using ushort you would have (64 * 96) / (16 * 32) = 6144 / 512 = 12. This would give a result of 12 / 128 = 0.09375.
By the way, you can ignore scaling for addition, substraction and division. For multiplication, you would do the multiplication as usual and then divide by 128. So for 0.5 * 0.75 you would have 64 * 96 / 128 = 48 which correspond to 48 / 128 = 0.375 as expected.
The code can be optimized for the platform particularly if the platform is more efficient with narrow numbers. And if necessary, rounding could be added to operation.
By the way since the scaling if a power of 2, you can use bit shifting for scaling. You might prefer to use 256 instead of 128 particularly if you don't have one cycle bit shifting but then you need larger width to handle some operations.
But you might be able to do some optimization if the most significant bit is not set for example so that you would only use larger width when necessary.
I have a single-precision float value, and no information about the distribution of the samples from which this value was generated, so I can't apply a sigmoid or perform some kind of normalization. Also, I know the value will always be non-negative. What is the best way to represent this float as a byte?
I've thought of the following:
Interpret the float as a UInt32 (I expect this to maintain relative ordering between numbers, please correct me if I'm wrong) and then scale it to the range of a byte.
UInt32 uVal = BitConverter.ToUInt32(BitConverter.GetBytes(fVal), 0);
byte bVal = Convert.ToByte(uVal * Byte.MaxValue / UInt32.MaxValue);
I'd appreciate your comments and any other suggestions. Thanks!
You have to assume a distribution. You have no choice. Somehow you have to partition the float values and assign them to byte values.
If the distribution is assumed to be linear arithmetic then the space is roughly 0 to 3.4e38. Each increment in the byte value would have a weight of about +1.3e36.
If the distribution is assumed to be linear geometric then the space is roughly 2.3e83. Each increment in the byte value would have a weight of about x2.1.
You can derive these values by simple arithmetic. The first is maxfloat/256. The second is the 256th root of (maxfloat/minfloat).
Your proposal to use and scale the raw bit pattern will produce a lumpy distribution in which numbers with different exponents are grouped together while numbers with the same exponent and different mantissa are separated. I would not recommend it for most purposes.
--
A really simple way that might suit some purposes is to simply use the 8-bit exponent (mask 0x7f80), ignoring the sign bit and mantissa. The values of 00 and ff would have to be handled specially. See http://en.wikipedia.org/wiki/Single-precision_floating-point_format.
I'm working on a simple game and I have the requirement of taking a word or phrase such as "hello world" and converting it to a series of numbers.
The criteria is:
Numbers need to be distinct
Need ability to configure maximum sequence of numbers. IE 10 numbers total.
Need ability to configure max range for each number in sequence.
Must be deterministic, that is we should get the same sequence everytime for the same input phrase.
I've tried breaking down the problem like so:
Convert characters to ASCII number code: "hello world" = 104 101 108 108 111 32 119 111 114 108 100
Remove everyother number until we satisfy total numbers (10 in this case)
Foreach number if number > max number then divide by 2 until number <= max number
If any numbers are duplicated increase or decrease the first occurence until satisfied. (This could cause a problem as you could create a duplicate by solving another duplicate)
Is there a better way of doing this or am I on the right track? As stated above I think I may run into issues with removing distinction.
If you want to limit the size of the output series - then this is impossible.
Proof:
Assume your output is a series of size k, each of range r <= M for some predefined M, then there are at most k*M possible outputs.
However, there are infinite number of inputs, and specifically there are k*M+1 different inputs.
From pigeonhole principle (where the inputs are the pigeons and the outputs are the pigeonholes) - there are 2 pigeons (inputs) in one pigeonhole (output) - so the requirement cannot be achieved.
Original answer, provides workaround without limiting the size of the output series:
You can use prime numbers, let p1,p2,... be the series of prime numbers.
Then, convert the string into series of numbers using number[i] = ascii(char[i]) * p_i
The range of each character is obviously then [0,255 * p_i]
Since for each i,j such that i != j -> p_i * x != p_j * y (for each x,y) - you get uniqueness. However, this is mainly nice theoretically as the generated numbers might grow quickly, and for practical implementation you are going to need some big number library such as java's BigInteger (cannot recall the C# equivalent)
Another possible solution (with the same relaxation of no series limitation) is:
number[i] = ascii(char[i]) + 256*(i-1)
In here the range for number[i] is [256*(i-1),256*i), and elements are still distinct.
Mathematically, it is theoretically possible to do what you want, but you won't be able to do it in C#:
If your outputs are required to be distinct, then you cannot lose any information after encoding the string using ASCII values. This means that if you limit your output size to n numbers then the numbers will have to include all information from the encoding.
So for your example
"Hello World" -> 104 101 108 108 111 32 119 111 114 108 100
you would have to preserve the meaning of each of those numbers. The simplest way to do this would just 0 pad your numbers to three digits and concatenate them together into one large number...making your result 104101108111032119111114108100 for max numbers = 1.
(You can see where the issue becomes, for arbitrary length input you need very large numbers.) So certainly it is possible to encode any arbitrary length string input to n numbers, but the numbers will become exceedingly large.
If by "numbers" you meant digits, then no you cannot have distinct outputs, as #amit explained in his example with the pidgeonhole principle.
Let's eliminate your criteria as easily as possible.
For distinct, deterministic, just use a hash code. (Hash actually isn't guaranteed to be distinct, but is highly likely to be):
string s = "hello world";
uint hash = Convert.ToUInt32(s.GetHashCode());
Note that I converted the signed int returned from GetHashCode to unsigned, to avoid the chance of having a '-' appear.
Then, for your max range per number, just convert the base.
That leaves you with the maximum sequence criteria. Without understanding your requirements better, all I can propose is truncate if necessary:
hash.toString().Substring(0, size)
Truncating leaves a chance that you'll no longer be distinct, but that must be built in as acceptable to your requirements? As amit explains in another answer, you can't have infinite input and non-infinite output.
Ok, so in one comment you've said that this is just to pick lottery numbers. In that case, you could do something like this:
public static List<int> GenNumbers(String input, int count, int maxNum)
{
List<int> ret = new List<int>();
Random r = new Random(input.GetHashCode());
for (int i = 0; i < count; ++i)
{
int next = r.Next(maxNum - i);
foreach (int picked in ret.OrderBy(x => x))
{
if (picked <= next)
++next;
else
break;
}
ret.Add(next);
}
return ret;
}
The idea is to seed a random number generator with the hash code of the String. The rest of that is just picking numbers without replacement. I'm sure it could be written more efficiently - an alternative is to generate all maxNum numbers and shuffle the first count. Warning, untested.
I know newer versions of the .Net runtime use a random String hash code algorithm (so results will differ between runs), but I believe this is opt-in. Writing your own hash algorithm is an option.
I am working on a piece of software that analyzes E01 bitstream images. Basically these are forensic data files that allow a user to compress all the data on a disk into a single file. The E01 format embeds data about the original data, including MD5 hash of the source and resulting data, etc. If you are interested in some light reading, the EWF/E01 specification is here. Onto my problem:
The e01 file contains a "table" section which is a series of 32 bit numbers that are offsets to other locations within the e01 file where the actual data chunks are located. I have successfully parsed this data out into a list doing the following:
this.ChunkLocations = new List<int>();
//hack:Will this overflow? We are adding to integers to a long?
long currentReadLocation = TableSectionDescriptorRef.OffsetFromFileStart + c_SECTION_DESCRIPTOR_LENGTH + c_TABLE_HEADER_LENGTH;
byte[] currReadBytes;
using (var fs = new FileStream(E01File.FullName, FileMode.Open))
{
fs.Seek(currentReadLocation, 0);
for (int i = 0; i < NumberOfEntries; i++)
{
currReadBytes = new byte[c_CHUNK_DATA_OFFSET_LENGTH];
fs.Read(currReadBytes,0, c_CHUNK_DATA_OFFSET_LENGTH);
this.ChunkLocations.Add(BitConverter.ToUInt32(currReadBytes, 0));
}
}
The c_CHUNK_DATA_OFFSET_LENGTH is 4 bytes/ "32 bit" number.
According to the ewf/e01 specification, "The most significant bit in the chunk data offset indicates if the chunk is compressed (1) or uncompressed (0)". This appears to be evidenced by the fact that, if I convert the offsets to ints, there are large negative numbers in the results (for chunks without compression,no doubt), but most of the other offsets appear to be correctly incremented, but every once in a while there is crazy data. The data in the ChunkLocations looks something like this:
346256
379028
-2147071848
444556
477328
510100
Where with -2147071848 it appears the MSB was flipped to indicate compression/lack of compression.
QUESTIONS: So, if the MSB is used to flag for the presence of compression, then really I'm dealing with at 31 bit number, right?
1. How do I ignore the MSB/ compute a 31 bit number in figuring the offset value?
2. This seems to be a strange standard since it would seem like it would significantly limit the size of the offsets you could have, so I'm questioning if I'm missing something? These offsets to seem correct when I navigate to these locations within the e01 file.
Thanks for any help!
This sort of thing is typical when dealing with binary formats. As dtb pointed out, 31 bits is probably plenty large for this application, because it can address offsets up to 2 GiB. So they use that extra bit as a flag to save space.
You can just mask off the bit with a bitwise AND:
const UInt32 COMPRESSED = 0x80000000; // Only bit 31 on
UInt32 raw_value = 0x80004000; // test value
bool compressed = (raw_value & COMPRESSED) > 0;
UInt32 offset = raw_value & ~COMPRESSED;
Console.WriteLine("Compressed={0} Offset=0x{1:X}", compressed, offset);
Output:
Compressed=True Offset=0x4000
If you just want to strip off the leading bit, perform a bitwise and (&) of the value with 0x7FFFFFFF