I recall having read somewhere that it is better (in terms of performance) to use Int32, even if you only require Byte. It applies (supposedly) only to cases where you do not care about the storage. Is this valid?
For example, I need a variable that will hold a day of week. Do I
int dayOfWeek;
or
byte dayOfWeek;
EDIT:
Guys, I am aware of DayOfWeek enum. The question is about something else.
Usually yes, a 32 bit integer will perform slightly better because it is already properly aligned for native CPU instructions. You should only use a smaller sized numeric type when you actually need to store something of that size.
You should use the DayOfWeek enum, unless there's a strong reason not to.
DayOfWeek day = DayOfWeek.Friday;
To explain, since I was downvoted:
The correctness of your code is almost always more critical than the performance, especially in cases where we're talking this small of a difference. If using an enum or a class representing the semantics of the data (whether it's the DayOfWeek enum, or another enum, or a Gallons or Feet class) makes your code clearer or more maintainable, it will help you get to the point where you can safely optimize.
int z;
int x = 3;
int y = 4;
z = x + y;
That may compile. But there's no way to know if it's doing anything sane or not.
Gallons z;
Gallons x = new Gallons(3);
Feet y = new Feet(4);
z = x + y;
This won't compile, and even looking at it it's obvious why not - adding Gallons to Feet makes no sense.
My default position is to try to use strong types to add constraints to values - where you know those in advance. Thus in your example, it may be preferable to use byte dayOfWeek because it is closer to your desired value range.
Here is my reasoning; with the example of storing and passing a year-part of a date. The year part - when considering other parts of the system that include SQL Server DateTimes, is constrained to 1753 through to 9999 (note C#'s possible range for DateTime is different!) Thus a short covers my possible values and if I try to pass anything larger the compiler will warn me before the code will compile. Unfortunately, in this particular example, the C# DateTime.Year property will return an int - thus forcing me to cast the result if I need to pass e.g. DateTime.Now.Year into my function.
This starting-position is driven by considerations of long-term storage of data, assuming 'millions of rows' and disk space - even though it is cheap (it is far less cheap when you are hosted and running a SAN or similar).
In another DB example, I will use smaller types such as byte (SQL Server tinyint) for lookup ID's where I am confident that there will not be many lookup types, through to long (SQL Server bigint) for id's where there are likely to be more records. i.e. to cover transactional records.
So my rules of thumb:
Go for correctness first if possible. Use DayOfWeek in your example, of course :)
Go for a type of appropriate size thus making use of compiler safety checks giving you errors at the earliest possible time;
...but offset against extreme performance needs and simplicity, especially where long-term storage is not involved, or where we are considering a lookup (low row count) table rather than a transactional (high row count) one.
In the interests of clarity, DB storage tends not to shrink as quickly as you expect by shrinking column types from bigint to smaller types. This is both because of padding to word boundaries and page-size issues internal to the DB. However, you probably store every data item several times in your DB, perhaps through storing historic records as they change, and also keeping the last few days of backups and log backups. So saving a few percent of your storage needs will have long term savings in storage cost.
I have never personally experienced issues where the in-memory performance of bytes vs. ints has been an issue, but I have wasted hours and hours having to reallocate disk space and have live servers entirely stall because there was no one person available to monitor and manage such things.
Use an int. Computer memory is addressed by "words," which are usually 4 bytes long. What this means is that if you want to get one byte of data from memory, the CPU has to retrieve the entire 4-byte word from RAM and then perform some extra steps to isolate the single byte that you're interested in. When thinking about performance, it will be a lot easier for the CPU to retrieve a whole word and be done with it.
Actually in all reality, you won't notice any difference between the two as far as performance is concerned (except in rare, extreme circumstances). That's why I like to use int instead of byte, because you can store bigger numbers with pretty much no penalty.
In terms of storage amount, use byte and in terms of cpu performance, use int.
System.DayOfWeek
MSDN
Most of the time use int. Not for performance but simplicity.
Related
Is the reason same as in Why is Array.Length an int, and not an uint? I am asking because I will need to do some additional casting/validation in my code which will unnecessarily reduce readability and I think there should not be any issue with just casting Seconds to uint like below:
uint modulo = (uint)DateTime.Now.Second % triggerModuloSeconds;
Using int as the default data type tend to make programming easier, since it is large enough for most common use cases, and limits the risk someone makes a mistake with unsigned arithmetic. It will often have much greater range than actually needed, but that is fine, memory is cheap, and cpus may be optimized for accessing 32-bit chunks of data.
If you wanted the most precise datatype a byte would be most appropriate, but what would you gain from using that instead of an int? There might be a point if you have millions of values, but that would be rare to store something like seconds in that way.
As mentioned in the comments, unsigned types are not CLS compliant, so will limit compatibility with other languages that do not support unsigned types. And this would be needed for a type like DateTime.
You should also prefer to use specific types over primitives. For example using TimeSpan to represent, well, a span of time.
I'm working on a genetic algorithm project where I encode my chromosome in a binary string where each 32 bits represents a value. The problem is that the functions I'm optimizing has over 3000 parameters which implies that I have over 96000 bits in my bit string and the manipulations i do on this are simply to slow...
I have proceeded as following: I have a binary class where I'm creating a boolean array that represents my big binary string. Then I manipulate this binary string with various shifts and moves and such.
My question is, is there a better way to do this? The speed is just killing. I'm sure it would be fine if i only did this on one bit string but i have to do the manipulations on 25 bit strings for way over 1000 generations.
What I would do is take a step back. Your analysis seems to be wedded to an implementation detail, namely that you have chosen bool[] as how you represent a string of bits.
Clear your mind of bools and arrays and make a complete list of the operations you actually need to perform, how frequently they happen, and how fast they have to be. Ideally consider whether your speed requirement is average speed or worst case speed. (There are many data structures that attain high average speed by having one expensive operation for every thousand cheap operations; if having any expensive operations is unacceptable then you need to know that up front.)
Once you have that list, you can then do research on what data structures work well.
For example, suppose your list of operations is:
construct bit sequences on the order of 32 bits
concatenate on the order of 3000 bit sequences together to form new bit sequences
insert new bit sequences into existing long bit sequences at specific locations, quickly
Given just that list of operations, I'd think that the data structure you want is a catenable deque. Catenable deques support fast insertion on either end, and can be broken up into two deques efficiently. Inserting stuff into the middle of a deque is then easily done: break the deque up, insert the item into the end of one half, and join them back up again.
However, if you then add another operation to the problem, say, "search for a particular bit string anywhere in the 90000-bit sequence, and find the result in sublinear time" then just a catenable deque isn't going to do it. Searching a deque is slow. There are other data structures that support that operation.
If I understood correctly you are encoding the bit array in a bool[]. The first obvious optimisation would be to change this to int[] (or even long[]) and take advantage of bit operations on a whole machine word, where possible.
For example, this would make shifts more efficient by ~ a factor 4.
Is the BitArray class no help?
A BitArray would probably be faster than a boolean array but you would still not get built-in support to shift 96000 bits.
GPUs are extremely good at massive bit operations. Maybe Brahma, CUDA.NET, or Accelerator could be of service?
Let me understand; currently, you're using a sequence of 32-bit values for a "chromosome". Are we talking about DNA chromosomes or neuroevolutionary algorithmic chromosomes?
If it's DNA, you deal with 4 values; A,C,G,T. That can be coded in 2 bits, making a byte able to hold 4 values. Your 3000-element chromosome sequence can be stored in a 750-element byte array; that's nothing, really.
Your two most expensive operations are to and from the compressed bitstream. I would recommend a byte-keyed enum:
public enum DnaMarker : byte { A, C, G, T };
Then, you go from 4 of these to a byte with one operation:
public static byte ToByteCode(this DnaMarker[] markers)
{
byte output = 0;
for(byte i=0;i<4;i++)
output = (output << 2) + (byte)markers[i];
}
... and parse them back out with something like this:
public static DnaMarker[] ToMarkers(this byte input)
{
var result = new byte[4];
for(byte i=0;i<4;i++)
result[i] = (DnaMarker)(input - (input >> (2*(i+1))));
return result;
}
You might see a slight performance increase using four parameters (output if necessary) versus allocating and using an array in the heap. But, you lose the iteration which makes the code more compact.
Now, because you're packing them into four-byte "blocks", if your sequence length isn't always an exact multiple of four you'll end up "padding" the end of your block with zero values (A). Working around this is messy, but if you had a 32-bit integer that told you the exact number of markers, you can simply discard anything more you find in the bytestream.
From here, possibilities are endless; you can convert the enum array to a string by simply calling ToString() on each one, and likewise you can feed in a string and get an enum array by iterating through using Enum.Parse().
And always remember, unless memory is at a premium (usually it isn't), it's almost always faster to deal with the data in an easily-usable format instead of the most compact format. The one big exception is in network transmission; if you had to send 750 bytes vs 12KB over the Internet, there's an obvious advantage in the smaller size.
I'm working on an application that needs to pass around large sets of Int32 values. The sets are expected to contain ~1,000,000-50,000,000 items, where each item is a database key in the range 0-50,000,000. I expect distribution of ids in any given set to be effectively random over this range. The operations I need on the set are dirt simple:
Add a new value
Iterate over all of the values.
There is a serious concern about the memory usage of these sets, so I'm looking for a data structure that can store the ids more efficiently than a simple List<int>or HashSet<int>. I've looked at BitArray, but that can be wasteful depending on how sparse the ids are. I've also considered a bitwise trie, but I'm unsure how to calculate the space efficiency of that solution for the expected data. A Bloom Filter would be great, if only I could tolerate the false negatives.
I would appreciate any suggestions of data structures suitable for this purpose. I'm interested in both out-of-the-box and custom solutions.
EDIT: To answer your questions:
No, the items don't need to be sorted
By "pass around" I mean both pass between methods and serialize and send over the wire. I clearly should have mentioned this.
There could be a decent number of these sets in memory at once (~100).
Use the BitArray. It uses only some 6MB of memory; the only real problem is that iteration is Theta(N), i.e. you have to walk the entire range. Locality of reference is good though and you can allocate the entire structure in one operation.
As for wasting space: you waste 6MB in the worst case.
EDIT: ok, you've lots of sets and you're serializing. For serializing on disk, I suggest 6MB files :)
For sending over the wire, just iterate and consider sending ranges instead of individual elements. That does require a sorting structure.
You need lots of these sets. Consider if you have 600MB to spare. Otherwise, check out:
Bytewise tries: O(1) insert, O(n) iteration, much lower constant factors than bitwise tries
A custom hash table, perhaps Google sparsehash through C++/CLI
BSTs storing ranges/intervals
Supernode BSTs
It would depend on the distribution of the sizes of your sets. Unless you expect most of the sets to be (close to) the minimum you've specified, I'd probably use a bitset. To cover a range up to 50,000,000, a bitset ends up ~6 megabytes.
Compared to storing the numbers directly, this is marginally larger for the minimum size set you've specified (~6 megabytes instead of ~4), but considerably smaller for the maximum size set (1/32nd the size).
The second possibility would be to use a delta encoding. For example, instead of storing each number directly, store the difference between that number and the previous number that was included. Given a maximum magnitude of 50,000,000 and a minimum size of 1,000,000 items, the average difference between one number and the next is ~50. This means you can theoretically store the difference in <6 bits on average. I'd probably use the 7 least significant bits directly, and if you need to encode a larger gap, set the msb and (for example) store the size of the gap in the lower 7 bits plus the next three bytes. That can't happen very often, so in most cases you're using only one byte per number, for about 4:1 compression compared to storing numbers directly. In the best case this would use ~1 megabyte for a set, and in the worst about 50 megabytes -- 4:1 compression compared to storing numbers directly.
If you don't mind a little bit of extra code, you could use an adaptive scheme -- delta encoding for small sets (up to 6,000,000 numbers), and a bitmap for larger sets.
I think the answer depends on what you mean by "passing around" and what you're trying to accomplish. You say you are only adding to the list: how often do you add? How fast will the list grow? What is an acceptable overhead for memory use, versus the time to reallocate memory?
In your worst case, 50,000,000 32-bit numbers = 200 megabytes using the most efficient possible data storage mechanism. Assuming you may end up with this much use in your worst case scenario, is it OK to use this much memory all the time? Is that better than having to reallocate memory frequently? What's the distribution of typical usage patterns? You could always just use an int[] that's pre-allocated to the whole 50 million.
As far as access speed for your operations, nothing is faster than iterating and adding to a pre-allocated chunk of memory.
From OP edit: There could be a decent number of these sets in memory at once (~100).
Hey now. You need to store 100 sets of 1 to 50 million numbers in memory at once? I think the bitset method is the only possible way this could work.
That would be 600 megabytes. Not insignificant, but unless they are (typically) mostly empty, it seems very unlikely that you would find a more efficient storage mechanism.
Now, if you don't use bitsets, but rather use dynamically sized constructs, and they could somehow take up less space to begin with, you're talking about a real ugly memory allocation/deallocation/garbage collection scenario.
Let's assume you really need to do this, though I can only imagine why. So your server's got a ton of memory, just allocate as many of these 6 megabyte bitsets as you need and recycle them. Allocation and garbage collection are no longer a problem. Yeah, you're using a ton of memory, but that seems inevitable.
I have found a few threads in regards to this issue. Most people appear to favor using int in their c# code accross the board even if a byte or smallint would handle the data unless it is a mobile app. I don't understand why. Doesn't it make more sense to define your C# datatype as the same datatype that would be in your data storage solution?
My Premise:
If I am using a typed dataset, Linq2SQL classes, POCO, one way or another I will run into compiler datatype conversion issues if I don't keep my datatypes in sync across my tiers. I don't really like doing System.Convert all the time just because it was easier to use int accross the board in c# code. I have always used whatever the smallest datatype is needed to handle the data in the database as well as in code, to keep my interface to the database clean. So I would bet 75% of my C# code is using byte or short as opposed to int, because that is what is in the database.
Possibilities:
Does this mean that most people who just use int for everything in code also use the int datatype for their sql storage datatypes and could care less about the overall size of their database, or do they do system.convert in code wherever applicable?
Why I care: I have worked on my own forever and I just want to be familiar with best practices and standard coding conventions.
Performance-wise, an int is faster in almost all cases. The CPU is designed to work efficiently with 32-bit values.
Shorter values are complicated to deal with. To read a single byte, say, the CPU has to read the 32-bit block that contains it, and then mask out the upper 24 bits.
To write a byte, it has to read the destination 32-bit block, overwrite the lower 8 bits with the desired byte value, and write the entire 32-bit block back again.
Space-wise, of course, you save a few bytes by using smaller datatypes. So if you're building a table with a few million rows, then shorter datatypes may be worth considering. (And the same might be good reason why you should use smaller datatypes in your database)
And correctness-wise, an int doesn't overflow easily. What if you think your value is going to fit within a byte, and then at some point in the future some harmless-looking change to the code means larger values get stored into it?
Those are some of the reasons why int should be your default datatype for all integral data. Only use byte if you actually want to store machine bytes. Only use shorts if you're dealing with a file format or protocol or similar that actually specifies 16-bit integer values. If you're just dealing with integers in general, make them ints.
I am only 6 years late but maybe I can help someone else.
Here are some guidelines I would use:
If there is a possibility the data will not fit in the future then use the larger int type.
If the variable is used as a struct/class field then by default it will be padded to take up the whole 32-bits anyway so using byte/int16 will not save memory.
If the variable is short lived (like inside a function) then the smaller data types will not help much.
"byte" or "char" can sometimes describe the data better and can do compile time checking to make sure larger values are not assigned to it on accident. e.g. If storing the day of the month(1-31) using a byte and try to assign 1000 to it then it will cause an error.
If the variable is used in an array of roughly 100 or more I would use the smaller data type as long as it makes sense.
byte and int16 arrays are not as thread safe as an int (a primitive).
One topic that no one brought up is the limited CPU cache. Smaller programs execute faster then larger ones because the CPU can fit more of the program in the faster L1/L2/L3 caches.
Using the int type can result in fewer CPU instructions however it will also force a higher percentage of the data memory to not fit in the CPU cache. Instructions are cheap to execute. Modern CPU cores can execute 3-7 instructions per clock cycle however a single cache miss on the other hand can cost 1000-2000 clock cycles because it has to go all the way to RAM.
When memory is conserved it also results in the rest of the application performing better because it is not squeezed out of the cache.
I did a quick sum test with accessing random data in random order using both a byte array and an int array.
const int SIZE = 10000000, LOOPS = 80000;
byte[] array = Enumerable.Repeat(0, SIZE).Select(i => (byte)r.Next(10)).ToArray();
int[] visitOrder = Enumerable.Repeat(0, LOOPS).Select(i => r.Next(SIZE)).ToArray();
System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
sw.Start();
int sum = 0;
foreach (int v in visitOrder)
sum += array[v];
sw.Stop();
Here are the results in time(ticks): (x86, release mode, without debugger, .NET 4.5, I7-3930k) (smaller is better)
________________ Array Size __________________
10 100 1K 10K 100K 1M 10M
byte: 549 559 552 552 568 632 3041
int : 549 566 552 562 590 1803 4206
Accessing 1M items randomly using byte on my CPU had a 285% performance increase!
Anything under 10,000 was hardly noticeable.
int was never faster then byte for this basic sum test.
These values will vary with different CPUs with different cache sizes.
One final note, Sometimes I look at the now open-source .NET framework to see what Microsoft's experts do. The .NET framework uses byte/int16 surprisingly little. I could not find any actually.
You would have to be dealing with a few BILLION rows before this makes any significant difference in terms of storage capacity. Lets say you have three columns, and instead of using a byte-equivalent database type, you use an int-equivalent.
That gives us 3 (columns) x 3 (bytes extra) per row, or 9 bytes per row.
This means, for "a few million rows" (lets say three million), you are consuming a whole extra 27 megabytes of disk space! Fortunately as we're no longer living in the 1970s, you shouldn't have to worry about this :)
As said above, stop micro-optimising - the performance hit in converting to/from different integer-like numeric types is going to hit you much, much harder than the bandwidth/diskspace costs, unless you are dealing with very, very, very large datasets.
For the most part, 'No'.
Unless you know upfront that you are going to be dealing with 100's of millions of rows, it's a micro-optimisation.
Do what fits the Domain model best. Later, if you have performance problems, benchmark and profile to pin-point where they are occuring.
Not that I didn't believe Jon Grant and others, but I had to see for myself with our "million row table". The table has 1,018,000. I converted 11 tinyint columns and 6 smallint columns into int, there were already 5 int & 3 smalldatetimes. 4 different indexes used a combo of the various data types, but obviously the new indexes are now all using int columns.
Making the changes only cost me 40 mb calculating base table disk usage with no indexes. When I added the indexes back in the overall change was only 30 mb difference overall. So I was suprised because I thought the index size would be larger.
So is 30 mb worth the hassle of using all the different data types, No Way! I am off to INT land, thanks everyone for setting this anal retentive programmer back on the straight and happy blissful life of no more integer conversions...yippeee!
The .NET runtime is optimised for Int32. See previous discussion at .NET Integer vs Int16?
If int is used everywhere, no casting or conversions are required. That is a bigger bang for the buck than the memory you will save by using multiple integer sizes.
It just makes life simpler.
I often have to convert a retreived value (usually as a string) - and then convert it to an int. But in C# (.Net) you have to choose either int16, int32 or int64 - how do you know which one to choose when you don't know how big your retrieved number will be?
Everyone here who has mentioned that declaring an Int16 saves ram should get a downvote.
The answer to your question is to use the keyword "int" (or if you feel like it, use "Int32").
That gives you a range of up to 2.4 billion numbers... Also, 32bit processors will handle those ints better... also (and THE MOST IMPORTANT REASON) is that if you plan on using that int for almost any reason... it will likely need to be an "int" (Int32).
In the .Net framework, 99.999% of numeric fields (that are whole numbers) are "ints" (Int32).
Example: Array.Length, Process.ID, Windows.Width, Button.Height, etc, etc, etc 1 million times.
EDIT: I realize that my grumpiness is going to get me down-voted... but this is the right answer.
Just wanted to add that... I remembered that in the days of .NET 1.1 the compiler was optimized so that 'int' operations are actually faster than byte or short operations.
I believe it still holds today, but I'm running some tests now.
EDIT: I have got a surprise discovery: the add, subtract and multiply operations for short(s) actually return int!
Repeatedly trying TryParse() doesn't make sense, you have a field already declared. You can't change your mind unless you make that field of type Object. Not a good idea.
Whatever data the field represents has a physical meaning. It's an age, a size, a count, etc. Physical quantities have realistic restraints on their range. Pick the int type that can store that range. Don't try to fix an overflow, it would be a bug.
Contrary to the current most popular answer, shorter integers (like Int16 and SByte) do often times take up less space in memory than larger integers (like Int32 and Int64). You can easily verify this by instantiating large arrays of sbyte/short/int/long and using perfmon to measure managed heap sizes. It is true that many CLR flavors will widen these integers for CPU-specific optimizations when doing arithmetic on them and such, but when stored as part of an object, they take up only as much memory as is necessary.
So, you definitely should take size into consideration especially if you'll be working with large list of integers (or with large list of objects containing integer fields). You should also consider things like CLS-compliance (which disallows any unsigned integers in public members).
For simple cases like converting a string to an integer, I agree an Int32 (C# int) usually makes the most sense and is likely what other programmers will expect.
If we're just talking about a couple numbers, choosing the largest won't make a noticeable difference in your overall ram usage and will just work. If you are talking about lots of numbers, you'll need to use TryParse() on them and figure out the smallest int type, to save ram.
All computers are finite. You need to define an upper limit based on what you think your users requirements will be.
If you really have no upper limit and want to allow 'unlimited' values, try adding the .Net Java runtime libraries to your project, which will allow you to use the java.math.BigInteger class - which does math on nearly-unlimited size integer.
Note: The .Net Java libraries come with full DevStudio, but I don't think they come with Express.