Working with large array on disk

Working with large array on disk - c#

I need to work with a large 2-dimensional array of doubles, with more than 100 million cells. The matrix first needs to be filled and then manipulated by taking either one row or one column. The matrix can be bigger than 1 terabyte in size, and will not find in memory.
How can the array be stored efficiently? The main operation is fast saving it from memory row by row (double[100k] each) and fast reading to the memory of one row or one column.

You could use Memory Mapped Files. You are essentially still working with an array, but allowing the kernel to choose what parts to load into memory. You could also possibly use Fixed size buffers to read whole sections of the memory mapped files.

Related

Lists double their space in c# when they need more room. At some point does it become less efficient to double say 1024 to 2048?

When numbers are smaller, it's quick to grow the size of an array list from 2 to 4 memory addresses but when it starts to increase the amount of space closer to the max amount of space allowed in an array list (close to the 2MB limit). Would changing how much space is allotted in those bigger areas be more efficient if it was only growing the size of the array by a fraction of the size it needs at some point? Obviously growing the size from 1mb to 2mb isn't really a big deal now-days HOWEVER, if you had 50,000 people running something per hour that did this doubling the size of an array, I'm curious if that would be a good enough reason to alter how this works. Not to mention cut down on un-needed memory space (in theory).
A small graphical representation of what I mean..
ArrayList a has 4 elements in it and that is it's current max size at the moment
||||
Now lets add another item to the arraylist, the internal code will double the size of the array even though we're only adding one thing to the array.
The arraylist now becomes 8 elements large
||||||||
At these size levels, I doubt it makes any difference but when you're allocating 1mb up to 2mb everytime someone is doing something like adding some file into an arraylist or something that is around 1.25mb, there's .75mb of un-needed space allocated.
To give you more of an idea of the code that is currently ran in c# by the System.Collections.Generic class. The way it works now is it doubles the size of an array list (read array), every time a user tries to add something to an array that is too small. Doubling the size is a good solution and makes sense, until you're essentially growing it far bigger than you technically need it to be.
Here's the source for this particular part of the class:
private void EnsureCapacity(int min)
{
if (this._items.Length >= min)
return;
// This is what I'm refering to
int num = this._items.Length == 0 ? 4 : this._items.Length * 2;
if ((uint) num > 2146435071U)
num = 2146435071;
if (num < min)
num = min;
this.Capacity = num;
}
I'm going to guess that this is how memory management is handled in many programming languages so this has probably been considered many times before, just wondering if this is a kind of efficiency saver that could save system resources by a large amount on a massive scale.

As the size of the collection gets larger, so does the cost of creating a new buffer as you need to copy over all of the existing elements. The fact that the number of these copies that need to be done is indirectly proportional to the expense of each copy is exactly why the amortized cost of adding items to a List is O(1). If the size of the buffer increases linearly, then the amortized cost of adding an item to a List actually becomes O(n).
You save on memory, allowing the "wasted" memory to go from being O(n) to being O(1). As with virtually all performance/algorithm decisions, we're once again faced with the quintessential decision of exchanging memory for speed. We can save on memory and have slower adding speeds (because of more copying) or we can use more memory to get faster additions. Of course there is no one universally right answer. Some people really would prefer to have a slower addition speed in exchange for less wasted memory. The particular resource that is going to run out first is going to vary based on the program, the system that it's running on, and so forth. Those people in the situation where the memory is the scarcer resource may not be able to use List, which is designed to be as wildly applicable as possible, even though it can't be universally the best option.

The idea behind the exponential growth factor for dynamic arrays such as List<T> is that:
The amount of wasted space is always merely proportional to the amount of data in the array. Thus you are never wasting resources on a more massive scale than you are properly using.
Even with many, many reallocations, the total potential time spent copying while creating an array of size N is O(N) -- or O(1) for a single element.
Access time is extremely fast at O(1) with a small coefficient.
This makes List<T> very appropriate for arrays of, say, in-memory tables of references to database objects, for which near-instant access is required but the array elements themselves are small.
Conversely, linear growth of dynamic arrays can result in n-squared memory wastage. This happens in the following situation:
You add something to the array, expanding it to size N for large N, freeing the previous memory block (possibly quite large) of size N-K for small K.
You allocate a few objects. The memory manager puts some in the large memory block just vacated, because why not?
You add something else to the array, expanding it to size N+K for some small K. Because the previously freed memory block now is sparsely occupied, the memory manager does not have a large enough contiguous free memory block and must request more virtual memory from the OS.
Thus virtual memory committed grows quadratically despite the measured size of objects created growing linearly.
This isn't a theoretical possibility. I actually had to fix an n-squared memory leak that arose because somebody had manually coded a linearly-growing dynamic array of integers. The fix was to throw away the manual code and use the library of geometrically-growing arrays that had been created for that purpose.
That being said, I also have seen problems with the exponential reallocation of List<T> (as well as the similarly-growing memory buffer in Dictionary<TKey,TValue>) in 32-bit processes when the total memory required needs to grow past 128 MB. In this case the List or Dictionary will frequently be unable to allocate a 256 MB contiguous range of memory even if there is more than sufficient virtual address space left. The application will then report an out-of-memory error to the user. In my case, customers complained about this since Task Manager was reporting that VM use never went over, say, 1.5GB. If I were Microsoft I would damp the growth of 'List' (and the similar memory buffer in Dictionary) to 1% of total virtual address space.

OutofMemoryException on LOH

I've a simple c# application which tries to read and write a single row to SQL Server Database.
That row contains an Image field that I fill with 100 MB or little greater than.
Every time I get System.OutOfMemoryException.
So, instead to read and write the array bytes in a single instruction, I would like do that in more than one instruction by appending all together.
Is it possibile accomplish that through SqlDataAdapter?

Fastet / most efficient way to work with very large Lists (in c#)

This question is related to the discussion here but can also be treated stand alone.
Also I think it would be nice to have the relevant results in one separate thread. I couldn't find anything on the web that was dealing with the topic comprehensively.
Let's say we need to work with a very very large List<double> of Data (e.g. 2 Billion entries).
Having a large list loaded into the memory leads to either a "System.OutOfMemoryException"
or if one uses <gcAllowVeryLargeObjects enabled="true|false" /> eventually just eats up the entire memory.
First: How to minimize the size of the container:
would an Array of 2 Billion doubles be less expensive ?
use decimals instead of doubles ?
is there a data-tye even lesse expensive than decimal in C# (-100
to 100 in 0.00001 steps would do the job for me)
Second: Saving the data to disc and reading it
What would be the fastet way to save the List to the disk and read it again?
Here I think the size of the file could be a problem. A txt-file containing 2 Billion entries will be huge - opening it could turn out to take years. (Perhaps some type of stream between the programm and the txt - file would do the job ?) - some sample code would be much appreciated :)
Third: Iteration
if the List is in memory I think using yield would make sense.
if the List is saved to the disk the speed of iteration is mostly limited by the speed at
which we can read it from the disc.

I need very big array length(size) in C#

public double[] result = new double[ ??? ];
I am storing results and total number of the results are bigger than the 2,147,483,647 which is max int32.
I tried biginteger, ulong etc. but all of them gave me errors.
How can I extend the size of the array that can store > 50,147,483,647 results (double) inside it?
Thanks...

An array of 2,147,483,648 doubles will occupy 16GB of memory. For some people, that's not a big deal. I've got servers that won't even bother to hit the page file if I allocate a few of those arrays. Doesn't mean it's a good idea.
When you are dealing with huge amounts of data like that you should be looking to minimize the memory impact of the process. There are several ways to go with this, depending on how you're working with the data.
Sparse Arrays
If your array is sparsely populated - lots of default/empty values with a small percentage of actually valid/useful data - then a sparse array can drastically reduce the memory requirements. You can write various implementations to optimize for different distribution profiles: random distribution, grouped values, arbitrary contiguous groups, etc.
Works fine for any type of contained data, including complex classes. Has some overheads, so can actually be worse than naked arrays when the fill percentage is high. And of course you're still going to be using memory to store your actual data.
Simple Flat File
Store the data on disk, create a read/write FileStream for the file, and enclose that in a wrapper that lets you access the file's contents as if it were an in-memory array. The simplest implementation of this will give you reasonable usefulness for sequential reads from the file. Random reads and writes can slow you down, but you can do some buffering in the background to help mitigate the speed issues.
This approach works for any type that has a static size, including structures that can be copied to/from a range of bytes in the file. Doesn't work for dynamically-sized data like strings.
Complex Flat File
If you need to handle dynamic-size records, sparse data, etc. then you might be able to design a file format that can handle it elegantly. Then again, a database is probably a better option at this point.
Memory Mapped File
Same as the other file options, but using a different mechanism to access the data. See System.IO.MemoryMappedFile for more information on how to use Memory Mapped Files from .NET.
Database Storage
Depending on the nature of the data, storing it in a database might work for you. For a large array of doubles this is unlikely to be a great option however. The overheads of reading/writing data in the database, plus the storage overheads - each row will at least need to have a row identity, probably a BIG_INT (8-byte integer) for a large recordset, doubling the size of the data right off the bat. Add in the overheads for indexing, row storage, etc. and you can very easily multiply the size of your data.
Databases are great for storing and manipulating complicated data. That's what they're for. If you have variable-width data - strings and the like - then a database is probably one of your best options. The flip-side is that they're generally not an optimal solution for working with large amounts of very simple data.
Whichever option you go with, you can create an IList<T>-compatible class that encapsulates your data. This lets you write code that doesn't have any need to know how the data is stored, only what it is.

BCL arrays cannot do that.
Someone wrote a chunked BigArray<T> class that can.
However, that will not magically create enough memory to store it.

You can't. Even with gcAllowVeryLargeObjects, the maximum size of any dimension in an array (of non-bytes) is 2,146,435,071
So you'll need to rethink your design, or use an alternative implementation such as a jagged array.

Another possible approach is to implement your own BigList. First note that List is implemented as an array. Also, you can set the initial size of the List in the constructor, so if you know it will be big, get a big chunk of memory up front.
Then
public class myBigList<T> : List<List<T>>
{
}
or, maybe more preferable, use a has-a approach:
public class myBigList<T>
{
List<List<T>> theList;
}
In doing this you will need to re-implement the indexer so you can use division and modulo to find the correct indexes into your backing store. Then you can use a BigInt as the index. In your custom indexer you will decompose the BigInt into two legal sized ints.

I ran into the same problem. I solved it using a list of list which mimics very well an array but can go well beyond the 2Gb limit. Ex List<List> It worked for an 250k x 250k of sbyte running on a 32Gb computer even if this elephant represent a 60Gb+ space:-)

C# arrays are limited in size to System.Int32.MaxValue.
For bigger than that, use List<T> (where T is whatever you want to hold).
More here: What is the Maximum Size that an Array can hold?

data structure for indexing big file

I need to build an index for a very big (50GB+) ASCII text file which will enable me to provide fast random read access to file (get nth line, get nth word in nth line). I've decided to use List<List<long>> map, where map[i][j] element is position of jth word of ith line in the file.
I will build the index sequentially, i.e. read the whole file and populating index with map.Add(new List<long>()) (new line) and map[i].Add(position) (new word). I will then retrieve specific word position with map[i][j].
The only problem I see is that I can't predict total count of lines/words, so I will bump into O(n) on every List reallocation, no idea of how I can avoid this.
Are there any other problems with the data structure I chose for the task? Which structure could be better?
UPD: File will not be altered during the runtime. There are no other ways to retrieve content except what I've listed.

Increasing size of a large list is very expensive operation; so, it's better to reserve list size at the beginning.
I'd suggest to use 2 lists. The first contains indexes of words within file, and the second contains indexes in the first list (index of the first word in the appropriate line).
You are very likely to exceed all available RAM. And when the system starts to page in/page out GC-managed RAM, performance of the program will be completely killed. I'd suggest to store your data in memory-mapped file rather than in managed memory. http://msdn.microsoft.com/en-us/library/dd997372.aspx
UPD memory mapped files are effective, when you need to work with huge amounts of data not fitting in RAM. Basically, it's your the only choice if your index becomes bigger than available RAM.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.