I have the following code below:
List<long> numbers = new List<long>();
for (long i = 1; i <= 300000000; i++)
{
numbers.Add(i);
}
What I wanted to do is to populate the list from 1-300 million. But when it hit the 67108865, it throws an exception on line 4: Exception of type 'System.OutOfMemoryException' was thrown.
I tried using ulong but still no luck.
I believe the maximum range for long data type is 9,223,372,036,854,775,807 but why am I having an error here?
Thanks in advance!
EDIT
Thanks for all the answers. It helped my realized that my design is not good. I ended up changing my code design.
Well, it is not that your number are to large, but your list is...
Lets calculate it's size:
300,000,000 * 64 bits (size of long) = 19,200,000,000 bits
19,200,000,000 /8 (size of byte) = 2,400,000,000
2,400,000,000 / 2^10 = 2,343,750 KB
2,343,750 / 2^10 = 2,288~ MB
2,288/ 2^10 = 2.235~ GB
You wanted a list of about 2.35 GB.
The current CLR limitation is 2 GB (see this or this SO thread)
If you need a list with size of 300,000,000, split it into 2 lists (either in place or using a wrapping object that will handle managing the lists).
First note System.OutOfMemoryException is not thrown when limit is reached for a variable,
Secondly, it is thrown because there is not enough memory available to continue the execution of a program.
Sadly you can not configure, .the NET runtime makes all the decisions about heap size and memory.
Either switch to 64-bit machine.
For info, on 32 bit machine you can increase the memory by using /3GB boot switch option in Boot.ini
EDIT While searching I found in MSDN Documentation under Remarks section
By default, the maximum size of an Array is 2 gigabytes (GB). In a 64-bit environment, you can avoid the size restriction by setting the enabled attribute of the gcAllowVeryLargeObjects configuration element to true in the run-time environment. However, the array will still be limited to a total of 4 billion elements, and to a maximum index of 0X7FEFFFFF in any given dimension (0X7FFFFFC7 for byte arrays and arrays of single-byte structures).
A List<long> is backed by an long[]. You will fail as soon as the backing array cannot be allocated, during the reallocation, there has to be enough total memory for both the old and the new arrays.
But, if you want a collection with more than 231 elements, you have to write your own implementation by using multiple arrays or List, and managing them.
Your numbers are not too large; your list is too long. It is using all the available memory.
Reduce 300000000 to 3000000, and it will (probably) work fine.
It should ideally hold number of elements that you are telling.
But, as per current CLR implementation each object can have max 2 GB size. As you are storing long values (size 8 bytes), then while populating the List, it is somewhere trying to exceed the 2GB limit. That's why you are getting System.OutOfMemoryException
Related
This question already has answers here:
Can't create huge arrays
(2 answers)
What is the maximum length of an array in .NET on 64-bit Windows
(3 answers)
Closed 1 year ago.
Even though this post says it should work, if you create an int array of size Int32.MaxValue, it throws an OutOfMemoryException: Array dimensions exceeded supported range.
From my testing, it seems like the maximum size that an array can be initialized to is Int32.MaxValue - 1048576 (2,146,435,071). 1048576 is 2^20. So only this works:
var maxSizeOfIntArray = Int32.MaxValue - 1048576;
var array = new int[maxSizeOfIntArray];
Does any one know why? Is there a way to create a larger integer array?
PS: I need to use arrays instead of lists because of a Math.Net library that only returns arrays for sets of random numbers that are cryptographically secure pseudo random number generator
Yes I have looked at the other questions linked but they are not correct as those questions say the largest size is Int32.MaxValue which is not the same as what my computer lets me do
Yes, I do know the size of the array will be 8GB, I need to generate a data set of billions of rows in order to test the randomness with the die harder suite of tests
I also tried the option of creating a BigArray<T> but that doesn't seem to be supported in C# anymore. I found one implementation of it, but that throws an IndexOutOfRangeException at index 524287, even though I set the array size to 3 million.
An Int32 is 32 bits, or 4 bytes. The max value of an Int32 is 2,147,483,647. So, if you could create an array of 2,147,483,647 elements, where each element is 4 bytes, you would need a contiguous piece of memory that is 8GB in size. That is ridiculously huge, and even if your machine had 128GB of RAM (and you were running in a 64-bit process), that would be outside of realistic proportions. If you really need to use that much memory (and your system has it), I would recommend going to native code (i.e., C++).
I am getting on this command
Dictionary<UInt64, int> myIntDict = new Dictionary<UInt64, int>(89478458);
this error:
System.OutOfMemoryException was unhandled HResult=-2147024882
Message=Array dimensions exceeded supported range.
Source=mscorlib
StackTrace:
at System.Collections.Generic.Dictionary`2.Initialize(Int32 capacity)
at System.Collections.Generic.Dictionary`2..ctor(Int32 capacity, IEqualityComparer`1 comparer)
On 89478457 there is no error. Here is the source of Initialize in Dictionary.cs:
private void Initialize(int capacity)
{
int size = HashHelpers.GetPrime(capacity);
...
entries = new Entry[size];
...
}
When I reproduce this, the error happens on the array creation. Entry is a struct in this case with size 24. When we get max int32 (0x80000000-1) and divide on 24 = 89478485 and this number is between prime numbers 89478457 and 89478503.
Does this mean, that array of struct cannot be bigger as maxInt32/sizeOfThisStruct?
EDIT:
Yes. I actually go over 2 GB. This happens, when the dictionary creates the internal array of struct Entry, where are the (key,value) pairs stored. In my case the sizeof(Entry) is 24 bytes and as value type is inline allocated.
And the solution is to use the gcAllowVeryLargeObjects flag (thank you Evk). Actually in .net core the flag is the environment variable COMPlus_gcAllowVeryLargeObjects (thank you svick).
And yes, Paparazzi is right. I have to think about, how not to waste memory.
Thank you all.
There is known limitation of .NET runtime - maximum object size allowed on the heap is 2 GB, even on 64-bit version of runtime. But, starting from .NET 4.5 there is configuration option which allows you to relax this limit (only on 64-bit version of runtime still) and create larger arrays. Example of configuration to enable that is:
<configuration>
<runtime>
<gcAllowVeryLargeObjects enabled="true" />
</runtime>
</configuration>
On the surface Dictionary does not make sense
You can only have int unique values
Do you really have that my duplicates
UnInt32 goes to 4,294,967,295
Why are you wasting 4 bytes?
89,478,458 rows
Currently a row is 12 bytes
You have 1 GB at about 83,333,333 rows
Since an object needs contiguous memory 1 GB is more of a practical limit
If values is really a strut 24
Then 1 gb in 31,250,000
That is just a really big collection
You can split is up into more than one collection
Or use a class as then it is just a reference with I think is 4 bytes
15 years ago, while programming with Pascal, I understood why to use power of two's for memory allocation. But this still seems to be state-of-the-art.
C# Examples:
new StringBuilder(256);
new byte[1024];
int bufferSize = 1 << 12;
I still see this thousands of times, I use this myself and I'm still questioning:
Do we need this in modern programming languages and modern hardware?
I guess its good practice, but what's the reason?
EDIT
For example a byte[] array, as stated by answers here, a power of 2 will make no sense: the array itself will use 16 bytes (?), so does it make sense to use 240 (=256-16) for the size to fit a total of 256 bytes?
Do we need this in modern programming languages and modern hardware? I guess its good practice, but what's the reason?
It depends. There are two things to consider here:
For sizes less than the memory page size, there's no appreciable difference between a power-of-two and an arbitrary number to allocate space;
You mostly use managed data structures with C#, so you won't even know how many bytes are really allocated underneath.
Assuming you're doing low-level allocation with malloc(), using multiples of the page size would be considered a good idea, i.e. 4096 or 8192; this is because it allows for more efficient memory management.
My advice would be to just allocate what you need and let C# handle the memory management and allocation for you.
Sadly, it's quite stupid if you want to keep a block of memory in a single memory page of 4k... And persons don't even know it :-) (I didn't until 10 minutes ago... I only had an hunch)... An example... It's unsafe code and implementation dependant (using .NET 4.5 at 32/64 bits)
byte[] arr = new byte[4096];
fixed (byte* p = arr)
{
int size = ((int*)p)[IntPtr.Size == 4 ? -1 : -2];
}
So the CLR has allocated at least 4096 + (1 or 2) sizeof(int)... So it has gone over one 4k memory page. This is logical... It has to keep the size of the array somewhere, and keeping it together with the array is the most intelligent thing (for those that know what Pascal Strings and BSTR are, yes, it's the same principle)
I'll add that all the objects in .NET have a syncblck number and a RuntimeType... They are at least int if not IntPtr, so a total of between 8 and 16 bytes/object (This is explained in various places... try looking for .net object header if you are interested)
It still makes sense in certain cases, but I would prefer to analyze case-by-case whether I need that kind of specification or not, rather than blindly use it as good practice.
For example, there might be cases where you want to use exactly 8 bits of information (1 byte) to address a table.
In that case, I would let the table have the size of 2^8.
Object table = new Object[256];
By this, you will be able to address any object of the table using only one byte.
Even if the table is actually smaller and doesn't use all 256 places, you still have the guarantee of bidirectional mapping from table to index and from index to table, which could prevent errors that would appear, for example, if you had:
Object table = new Object[100];
And then someone (probably someone else) accesses it with a byte value out of table's range.
Maybe this kind of bijective behavior could be good, maybe you could have other ways to guarantee your constraints.
Probably, given the increase in smartness of current compilers, it is not the only good practice anymore.
IMHO, anything ending in exact power of two's arithmeric operation is like a fast track. low level arithmeric operation for power of two takes less number of turns and bit manipulations than any other numbers need extra work for cpu.
And found this possible duplicate:Is it better to allocate memory in the power of two?
Yes, it's good practice, and it has at least one reason.
The modern processors have L1 cache-line size 64 bytes, and if you will use buffer size as 2^n (for example 1024, 4096,..), you will take fully cache-line, without wasted space.
In some cases, this will help prevent false sharing problem (http://en.wikipedia.org/wiki/False_sharing).
In C, I'm working on a "class" that manages a byte buffer, allowing arbitrary data to be appended to the end. I'm now looking into automatic resizing as the underlying array fills up using calls to realloc. This should make sense to anyone who's ever used Java or C# StringBuilder. I understand how to go about the resizing. But does anyone have any suggestions, with rationale provided, on how much to grow the buffer with each resize?
Obviously, there's a trade off to be made between wasted space and excessive realloc calls (which could lead to excessive copying). I've seen some tutorials/articles that suggest doubling. That seems wasteful if the user manages to supply a good initial guess. Is it worth trying to round to some power of two or a multiple of the alignment size on a platform?
Does any one know what Java or C# does under the hood?
In C# the strategy used to grow the internal buffer used by a StringBuilder has changed over time.
There are three basic strategies for solving this problem, and they have different performance characteristics.
The first basic strategy is:
Make an array of characters
When you run out of room, create a new array with k more characters, for some constant k.
Copy the old array to the new array, and orphan the old array.
This strategy has a number of problems, the most obvious of which is that it is O(n2) in time if the string being built is extremely large. Let's say that k is a thousand characters and the final string is a million characters. You end up reallocating the string at 1000, 2000, 3000, 4000, ... and therefore copying 1000 + 2000 + 3000 + 4000 + ... + 999000 characters, which sums to on the order of 500 billion characters copied!
This strategy has the nice property that the amount of "wasted" memory is bounded by k.
In practice this strategy is seldom used because of that n-squared problem.
The second basic strategy is
Make an array
When you run out of room, create a new array with k% more characters, for some constant k.
Copy the old array to the new array, and orphan the old array.
k% is usually 100%; if it is then this is called the "double when full" strategy.
This strategy has the nice property that its amortized cost is O(n). Suppose again the final string is a million characters and you start with a thousand. You make copies at 1000, 2000, 4000, 8000, ... and end up copying 1000 + 2000 + 4000 + 8000 ... + 512000 characters, which sums to about a million characters copied; much better.
The strategy has the property that the amortized cost is linear no matter what percentage you choose.
This strategy has a number of downside that sometimes a copy operation is extremely expensive, and you can be wasting up to k% of the final string length in unused memory.
The third strategy is to make a linked list of arrays, each array of size k. When you overflow an existing array, a new one is allocated and appended to the end of the list.
This strategy has the nice property that no operation is particularly expensive, the total wasted memory is bounded by k, and you don't need to be able to locate large blocks in the heap on a regular basis. It has the downside that finally turning the thing into a string can be expensive as the arrays in the linked list might have poor locality.
The string builder in the .NET framework used to use a double-when-full strategy; it now uses a linked-list-of-blocks strategy.
You generally want to keep the growth factor a little smaller than the golden mean (~1.6). When it's smaller than the golden mean, the discarded segments will be large enough to satisfy a later request, as long as they're adjacent to each other. If your growth factor is larger than the golden mean, that can't happen.
I've found that reducing the factor to 1.5 still works quite nicely, and has the advantage of being easy to implement in integer math (size = (size + (size << 1))>>1; -- with a decent compiler you can write that as (size * 3)/2, and it should still compile to fast code).
I seem to recall a conversation some years ago on Usenet, in which P.J. Plauger (or maybe it was Pete Becker) of Dinkumware, saying they'd run rather more extensive tests than I ever did, and reached the same conclusion (so, for example, the implementation of std::vector in their C++ standard library uses 1.5).
When working with expanding and contracting buffers, the key property you want is to grow or shrink by a multiple of your size, not a constant difference.
Consider the case where you have a 16 byte array, increasing its size by 128 bytes is overkill; however, if instead you had a 4096 byte array and increased it by only 128 bytes, you would end up copying a lot.
I was taught to always double or halve arrays. If you really have no hint as to the size or maximum, multiplying by two ensures that you have a lot of capacity for a long time, and unless you're working on a resource constrained system, allocating at most twice the space isn't too terrible. Additionally, keeping things in powers of two can let you use bit shifts and other tricks and the underlying allocation is usually in powers of two.
Does any one know what Java or C# does under the hood?
Have a look at the following link to see how it's done in Java's StringBuilder from JDK11, in particular, the ensureCapacityInternal method.
https://java-browser.yawk.at/java/11/java.base/java/lang/AbstractStringBuilder.java#java.lang.AbstractStringBuilder%23ensureCapacityInternal%28int%29
It's implementation-specific, according to the documentation, but starts with 16:
The default capacity for this implementation is 16, and the default
maximum capacity is Int32.MaxValue.
A StringBuilder object can allocate more memory to store characters
when the value of an instance is enlarged, and the capacity is
adjusted accordingly. For example, the Append, AppendFormat,
EnsureCapacity, Insert, and Replace methods can enlarge the value of
an instance.
The amount of memory allocated is implementation-specific, and an
exception (either ArgumentOutOfRangeException or OutOfMemoryException)
is thrown if the amount of memory required is greater than the maximum
capacity.
Based on some other .NET framework things, I would suggest multiplying it by 1.1 each time the current capacity is reached. If extra space is needed, just have an equivalent to EnsureCapacity that will expand it to the necessary size manually.
Translate this to C.
I will probably maitain a List<List<string>> list.
class StringBuilder
{
private List<List<string>> list;
public Append(List<string> listOfCharsToAppend)
{
list.Add(listOfCharsToAppend);
}
}
This way you are just maintaining a list of Lists and allocating memory on demand rather than allocating memory well ahead.
List in .NET framework uses this algorithm: If initial capacity is specified, it creates buffer of this size, otherwise no buffer is allocated until first item(s) is added, which allocates space equal to number of item(s) added, but no less than 4. When more space is needed, it allocates new buffer with 2x previous capacity and copies all items from old buffer to new buffer. Earlier StringBuilder used similar algorithm.
In .NET 4, StringBuilder allocates initial buffer of size specified in constructor (default size is 16 characters). When allocated buffer is too small, no copying is made. Instead it fills current buffer to the rim, then creates new instance of StringBuilder, which allocates buffer of size *MAX(length_of_remaining_data_to_add, MIN(length_of_all_previous_buffers, 8000))* so at least all remaining data fits to new buffer and total size of all buffers is at least doubled. New StringBuilder keeps reference to old StringBuilder and so individual instances creates linked list of buffers.
I have found a few threads in regards to this issue. Most people appear to favor using int in their c# code accross the board even if a byte or smallint would handle the data unless it is a mobile app. I don't understand why. Doesn't it make more sense to define your C# datatype as the same datatype that would be in your data storage solution?
My Premise:
If I am using a typed dataset, Linq2SQL classes, POCO, one way or another I will run into compiler datatype conversion issues if I don't keep my datatypes in sync across my tiers. I don't really like doing System.Convert all the time just because it was easier to use int accross the board in c# code. I have always used whatever the smallest datatype is needed to handle the data in the database as well as in code, to keep my interface to the database clean. So I would bet 75% of my C# code is using byte or short as opposed to int, because that is what is in the database.
Possibilities:
Does this mean that most people who just use int for everything in code also use the int datatype for their sql storage datatypes and could care less about the overall size of their database, or do they do system.convert in code wherever applicable?
Why I care: I have worked on my own forever and I just want to be familiar with best practices and standard coding conventions.
Performance-wise, an int is faster in almost all cases. The CPU is designed to work efficiently with 32-bit values.
Shorter values are complicated to deal with. To read a single byte, say, the CPU has to read the 32-bit block that contains it, and then mask out the upper 24 bits.
To write a byte, it has to read the destination 32-bit block, overwrite the lower 8 bits with the desired byte value, and write the entire 32-bit block back again.
Space-wise, of course, you save a few bytes by using smaller datatypes. So if you're building a table with a few million rows, then shorter datatypes may be worth considering. (And the same might be good reason why you should use smaller datatypes in your database)
And correctness-wise, an int doesn't overflow easily. What if you think your value is going to fit within a byte, and then at some point in the future some harmless-looking change to the code means larger values get stored into it?
Those are some of the reasons why int should be your default datatype for all integral data. Only use byte if you actually want to store machine bytes. Only use shorts if you're dealing with a file format or protocol or similar that actually specifies 16-bit integer values. If you're just dealing with integers in general, make them ints.
I am only 6 years late but maybe I can help someone else.
Here are some guidelines I would use:
If there is a possibility the data will not fit in the future then use the larger int type.
If the variable is used as a struct/class field then by default it will be padded to take up the whole 32-bits anyway so using byte/int16 will not save memory.
If the variable is short lived (like inside a function) then the smaller data types will not help much.
"byte" or "char" can sometimes describe the data better and can do compile time checking to make sure larger values are not assigned to it on accident. e.g. If storing the day of the month(1-31) using a byte and try to assign 1000 to it then it will cause an error.
If the variable is used in an array of roughly 100 or more I would use the smaller data type as long as it makes sense.
byte and int16 arrays are not as thread safe as an int (a primitive).
One topic that no one brought up is the limited CPU cache. Smaller programs execute faster then larger ones because the CPU can fit more of the program in the faster L1/L2/L3 caches.
Using the int type can result in fewer CPU instructions however it will also force a higher percentage of the data memory to not fit in the CPU cache. Instructions are cheap to execute. Modern CPU cores can execute 3-7 instructions per clock cycle however a single cache miss on the other hand can cost 1000-2000 clock cycles because it has to go all the way to RAM.
When memory is conserved it also results in the rest of the application performing better because it is not squeezed out of the cache.
I did a quick sum test with accessing random data in random order using both a byte array and an int array.
const int SIZE = 10000000, LOOPS = 80000;
byte[] array = Enumerable.Repeat(0, SIZE).Select(i => (byte)r.Next(10)).ToArray();
int[] visitOrder = Enumerable.Repeat(0, LOOPS).Select(i => r.Next(SIZE)).ToArray();
System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
sw.Start();
int sum = 0;
foreach (int v in visitOrder)
sum += array[v];
sw.Stop();
Here are the results in time(ticks): (x86, release mode, without debugger, .NET 4.5, I7-3930k) (smaller is better)
________________ Array Size __________________
10 100 1K 10K 100K 1M 10M
byte: 549 559 552 552 568 632 3041
int : 549 566 552 562 590 1803 4206
Accessing 1M items randomly using byte on my CPU had a 285% performance increase!
Anything under 10,000 was hardly noticeable.
int was never faster then byte for this basic sum test.
These values will vary with different CPUs with different cache sizes.
One final note, Sometimes I look at the now open-source .NET framework to see what Microsoft's experts do. The .NET framework uses byte/int16 surprisingly little. I could not find any actually.
You would have to be dealing with a few BILLION rows before this makes any significant difference in terms of storage capacity. Lets say you have three columns, and instead of using a byte-equivalent database type, you use an int-equivalent.
That gives us 3 (columns) x 3 (bytes extra) per row, or 9 bytes per row.
This means, for "a few million rows" (lets say three million), you are consuming a whole extra 27 megabytes of disk space! Fortunately as we're no longer living in the 1970s, you shouldn't have to worry about this :)
As said above, stop micro-optimising - the performance hit in converting to/from different integer-like numeric types is going to hit you much, much harder than the bandwidth/diskspace costs, unless you are dealing with very, very, very large datasets.
For the most part, 'No'.
Unless you know upfront that you are going to be dealing with 100's of millions of rows, it's a micro-optimisation.
Do what fits the Domain model best. Later, if you have performance problems, benchmark and profile to pin-point where they are occuring.
Not that I didn't believe Jon Grant and others, but I had to see for myself with our "million row table". The table has 1,018,000. I converted 11 tinyint columns and 6 smallint columns into int, there were already 5 int & 3 smalldatetimes. 4 different indexes used a combo of the various data types, but obviously the new indexes are now all using int columns.
Making the changes only cost me 40 mb calculating base table disk usage with no indexes. When I added the indexes back in the overall change was only 30 mb difference overall. So I was suprised because I thought the index size would be larger.
So is 30 mb worth the hassle of using all the different data types, No Way! I am off to INT land, thanks everyone for setting this anal retentive programmer back on the straight and happy blissful life of no more integer conversions...yippeee!
The .NET runtime is optimised for Int32. See previous discussion at .NET Integer vs Int16?
If int is used everywhere, no casting or conversions are required. That is a bigger bang for the buck than the memory you will save by using multiple integer sizes.
It just makes life simpler.