C# Quick bit array

C# Quick bit array - c#

as stated in the title i am evaluating the cost of implement a BitArray over bytes[] (i have understood that native BitArray is pretty slow) insthead of using a string representation of bits (eg : "001001001" ) but i am open to any suggestion that are more effective.
The length of array is not known at design time, but i suppose may be between 200 and 500 bit per array.
Memory is not a concern, so use a lot of memory for represent the array is not an issue, what matter is speed when array is created and manupulated (thiy will be manipulated a lot).
Thanks in advance for yours consideration and suggenstion onto the topic.

A few suggestions:
1) Computers don't process bits o even n int or long will work at the same speed
2) To reach speed you can consider writing it with unsafe code
3) New is expensive. If the objects are created a lot you can do the following: Create a bulk of 10K
objects at a time and serve them from a method when required. Once the cache runs out you can recreate them. Have another method that once an object processing completes you clean it up and return it to the cache
4) Make sure your manipulation is optimal

Related

I need very big array length(size) in C#

public double[] result = new double[ ??? ];
I am storing results and total number of the results are bigger than the 2,147,483,647 which is max int32.
I tried biginteger, ulong etc. but all of them gave me errors.
How can I extend the size of the array that can store > 50,147,483,647 results (double) inside it?
Thanks...

An array of 2,147,483,648 doubles will occupy 16GB of memory. For some people, that's not a big deal. I've got servers that won't even bother to hit the page file if I allocate a few of those arrays. Doesn't mean it's a good idea.
When you are dealing with huge amounts of data like that you should be looking to minimize the memory impact of the process. There are several ways to go with this, depending on how you're working with the data.
Sparse Arrays
If your array is sparsely populated - lots of default/empty values with a small percentage of actually valid/useful data - then a sparse array can drastically reduce the memory requirements. You can write various implementations to optimize for different distribution profiles: random distribution, grouped values, arbitrary contiguous groups, etc.
Works fine for any type of contained data, including complex classes. Has some overheads, so can actually be worse than naked arrays when the fill percentage is high. And of course you're still going to be using memory to store your actual data.
Simple Flat File
Store the data on disk, create a read/write FileStream for the file, and enclose that in a wrapper that lets you access the file's contents as if it were an in-memory array. The simplest implementation of this will give you reasonable usefulness for sequential reads from the file. Random reads and writes can slow you down, but you can do some buffering in the background to help mitigate the speed issues.
This approach works for any type that has a static size, including structures that can be copied to/from a range of bytes in the file. Doesn't work for dynamically-sized data like strings.
Complex Flat File
If you need to handle dynamic-size records, sparse data, etc. then you might be able to design a file format that can handle it elegantly. Then again, a database is probably a better option at this point.
Memory Mapped File
Same as the other file options, but using a different mechanism to access the data. See System.IO.MemoryMappedFile for more information on how to use Memory Mapped Files from .NET.
Database Storage
Depending on the nature of the data, storing it in a database might work for you. For a large array of doubles this is unlikely to be a great option however. The overheads of reading/writing data in the database, plus the storage overheads - each row will at least need to have a row identity, probably a BIG_INT (8-byte integer) for a large recordset, doubling the size of the data right off the bat. Add in the overheads for indexing, row storage, etc. and you can very easily multiply the size of your data.
Databases are great for storing and manipulating complicated data. That's what they're for. If you have variable-width data - strings and the like - then a database is probably one of your best options. The flip-side is that they're generally not an optimal solution for working with large amounts of very simple data.
Whichever option you go with, you can create an IList<T>-compatible class that encapsulates your data. This lets you write code that doesn't have any need to know how the data is stored, only what it is.

BCL arrays cannot do that.
Someone wrote a chunked BigArray<T> class that can.
However, that will not magically create enough memory to store it.

You can't. Even with gcAllowVeryLargeObjects, the maximum size of any dimension in an array (of non-bytes) is 2,146,435,071
So you'll need to rethink your design, or use an alternative implementation such as a jagged array.

Another possible approach is to implement your own BigList. First note that List is implemented as an array. Also, you can set the initial size of the List in the constructor, so if you know it will be big, get a big chunk of memory up front.
Then
public class myBigList<T> : List<List<T>>
{
}
or, maybe more preferable, use a has-a approach:
public class myBigList<T>
{
List<List<T>> theList;
}
In doing this you will need to re-implement the indexer so you can use division and modulo to find the correct indexes into your backing store. Then you can use a BigInt as the index. In your custom indexer you will decompose the BigInt into two legal sized ints.

I ran into the same problem. I solved it using a list of list which mimics very well an array but can go well beyond the 2Gb limit. Ex List<List> It worked for an 250k x 250k of sbyte running on a 32Gb computer even if this elephant represent a 60Gb+ space:-)

C# arrays are limited in size to System.Int32.MaxValue.
For bigger than that, use List<T> (where T is whatever you want to hold).
More here: What is the Maximum Size that an Array can hold?

How can I reduce the garbage generation in this situation

My game has gotten to the point where its generating too much garbage and is resulting in long GC times. I've been going around and reducing a lot of the garbage generated but there's one spot that's allocating a large amount of memory too frequently and I'm stuck on how I can resolve this.
My game is a minecraft-type world that generates new regions as you walk. I have a large, variable size array that is allocated on the creation of a new region that is used to store the vertex data for the terrain. After the array is filled with data, it's passed to a slimdx DataStream so it can be used for rendering.
The problem is the fact that this is a variable-size array and that it needs to be passed to slimdx, which calls GCHandle.Alloc on it. Since its a variable size, it may have to be resized in order to reuse it. I also can't just allocate a max sized array for each region because it would require impossibly large amounts of memory. I can't use a List because of the GCHandle business with slimdx.
So far, resizing the array only when it needs to be made bigger seems to be the only plausible option to me, but it may not work out that well and will likely be a pain to implement. I'd need to keep track of the actual size of the array separately and use unsafe code to get a pointer to the array and pass that to slimdx. It may also end up eventually using such a large amount of memory that I have to occasionally go and reduce the size of all the arrays down to the minimum needed.
I'm hesitant to jump at this solution and would like to know if anyone sees any better solutions to this.

I'd suggest a tighter integration with the slimdx library. It's open source so you could dig in and find the critical path that you need for the rendering. Then you could integrate tighter by using a DMA-style memory sharing approach.

Since SlimDX is open source and it is too slow the time has come to change the open Source to suit your performance needs. What I do see here is that you want to keep a much larger array but hand to SlimDX only the actual used region to prevent additional memory allocatons for this potentially huge array.
There is a type in the .NET Framework named ArraySegment which was made exactly for this purpose.
// Taken from MSDN
// Create and initialize a new string array.
String[] myArr = { "The", "quick", "brown", "fox", "jumps", "over", "the",
"lazy", "dog" };
// Define an array segment that contains the middle five values of the array.
ArraySegment<String> myArrSegMid = new ArraySegment<String>( myArr, 2, 5 );
public static void PrintIndexAndValues( ArraySegment<String> arrSeg )
{
for ( int i = arrSeg.Offset; i < (arrSeg.Offset + arrSeg.Count); i++ )
{
Console.WriteLine( " [{0}] : {1}", i, arrSeg.Array[i] );
}
Console.WriteLine();
}
That said I have found the usage of ArraySegment somewhat strange because I always have to use the offset and the index which just behaves not a regular array. Instead you can distill your own struct which allows zero based indexing which is much easier to use but comes at the cost that every index based access does cost you and add of the base offset. But if the usage pattern is mainly foreaches then it does not really matter.
I had situations where ArraySegment was also too costly because you do allocate a struct every time and pass it to all methods by value on the stack. You need to watch closely where its usage is ok and if it is not allocated at a too high rate.

I sympathize your problem with older library, slimdx, which may not be .NET compliant. I have dealt with such a situation.
Suggestions:
Use a more performance efficient generic list or array like ArrayList. It keeps track of the size of the array so you don't have to. Allocate the List, chunks at a time, e.g. 100 elements at a time.
Use C++ .NET and take advantage of unsafe arrays or the .NET classes like ArrayList.
Updated: Use the idea of virtual memory. Save some data to an XML file or SQL database, relieving huge amount of memory.
I realize it's a gamble either way.

C# dictionary - how to solve limit on number of items?

I am using Dictionary and I need to store almost 13 000 000 keys in it. Unfortunatelly, after adding 11 950 000th key I got an exception "System out of memory". Is there any solution of this problem? I will need my program to run on less powerable computers than is actually mine in the future..
I need that many keys because I need to store pairs - sequence name and sequence length, it is for solving bioinformatics related problem.
Any help will be appreciated.

Buy more memory, install a 64 bit version of the OS and recompile for 64 bits. No, I'm not kidding. If you want so many objects... in ram... And then call it a "feature". If the new Android can require 16gb of memory to be compiled...
I was forgetting... You could begin by reading C# array of objects, very large, looking for a better way
You know how many are 13 million objects?
To make a comparison, a 32 bits Windows app has access to less than 2 gb of address space. So it's 2 billion bytes (give or take)... 2 billion / 13 million = something around 150 bytes/object. Now, if we consider how much a reference type occupies... It's quite easy to eat 150 bytes.
I'll add something: I've looked in my Magic 8-Ball and it told me: show us your code. If you don't tell us what you are using for the key and the values, how should we be able to help you? What are you using, class or struct or "primitive" types? Tell us the "size" of your TKey and TValue. Sadly our crystall ball broke yesterday :-)

C# is not a language that was designed to solve heavy-duty scientific computation problems. It is absolutely possible to use C# to build tools that do what you want, but the off-the-shelf parts like Dictionary were designed to solve more common business problems, like mapping zip codes to cities and that sort of thing.
You're going to have to go with external storage of some sort. My recommendation would be to buy a database and use it to store your data. Then use a DataSet or some similar technology to load portions of the data into memory, manipulate them, and then pour more data from the database into the DataSet, and so on.

Well, I had almost exactly the same problem.
I wanted to load about 12.5 million [string, int]s into a dictionary from a database (for all the programming "gods" above who don't understand why, the answer is that it is enormously quicker when you are working with a 150 GB database if you can cache a proportion of one of the key tables in memory).
It annoyingly threw an out of memory exception at pretty much the same place - just under the 12 million mark even though the process was only consuming about 1.3 GB of memory (reduced to about 800 MB of memory after a judicious change in db read method to not try and do it all at once) - despite running on an I7 with 8 GB of memory.
The solution was actually remarkably simple -
in Visual Studio (2010) in Solution Explorer right click the project and select properties.
In the Build tab set Platform Target to x64 and rebuild.
It rattles through the load into the Dictionary in a few seconds and the Dictionary performance is very good.

Easy solution is just use simple DB. The most obvious solution in this case, IMHO is using SQLite .NET , fast, easy and with low memory footprint.

I think that you need a new approach to your processing.
I must assume that you obtain the data from a file or a database, either way that is where it should remain.
There is no way that you may actually increase the limit on the number of values stored within a Dictionary, other than increasing system memory, but eitherway it is an extremely inefficient means of processing such a alarge amount of data.
You should rethink your algorithm so that you can process the data in more manageable portions. It will mean processing it in stages until you get your result. This may mean many hundreeds of passes through the data, but it's the only way to do it.
I would also suggest that you look at using generics to help speed up this repetitive processing and cut down on memory usage.
Remember that there will still be a balancing act between system performance and access to externally stored data (be it external disk store or database).

It is not the problem with the Dictionary object, but the available memory in your server. I've done some investigation to understand the failures of dictionary object, but it never failed. Below is the code for your reference
private static void TestDictionaryLimit()
{
int intCnt = 0;
Dictionary<long, string> dItems = new Dictionary<long, string>();
Console.WriteLine("Total number of iterations = {0}", long.MaxValue);
Console.WriteLine("....");
for (long lngCnt = 0; lngCnt < long.MaxValue; lngCnt++)
{
if (lngCnt < 11950020)
dItems.Add(lngCnt, lngCnt.ToString());
else
break;
if ((lngCnt % 100000).Equals(0))
Console.Write(intCnt++);
}
Console.WriteLine("Completed..");
Console.WriteLine("{0} number of items in dictionary", dItems.Count);
}
The above code executes properly, and stores more than the number of count that you have mentioned.

Really 13000000 items are quite a lot.
If 13000000 are allocated classes is a very deep kick into garbage collector stomach!
Also if you find a way to use the default .NET dictionary, the performance would be really bad, too much keys, the number of keys approaches the number of values a 31 bit hash can use, performance will be awful in whatever system you use, and of course, memory will be too much!
If you need a data structure that can use more memory than an hash table you probably need a custom hashtable mixed with a custom binary tree data structure.
Yes, it is possible to write your own combination of two.
You cannot rely on .net hashtable for sure for this so strange and specific problem.
Consider that a tree have a lookup complexity of O(log n), while a building complexity of O(n * log n), of course, building it will be too long.
You should then build an hashtable of binary trees (or viceversa) that will allow you to use both data structures consuming less memory.
Then, think about compiling it in 32 bit mode, not in 64 bit mode: 64 bit mode uses more memory for pointers.
In the same time it i spossible the contrary, 32 bit address space may be is not sufficient for your problem.
It never happened to me to have a problem that can run out 32 bit address space!
If both keys and values are simple value types i would suggest you to write your data structure in a C dll and use it through C#.
You can try to write a dictionary of dictionaries.
Let's say, you can split your data into chunks of 500000 items between 26 dictionaries for example, but the occupied memory would be very very big, don't think your system will handle it.
public class MySuperDictionary
{
private readonly Dictionary<KEY, VALUE>[] dictionaries;
public MySuperDictionary()
{
this.dictionaries = new Dictionary<KEY, VALUE>[373]; // must be a prime number.
for (int i = 0; i < dictionaries.Length; ++i)
dictionaries[i] = new Dicionary<KEY, VALUE>(13000000 / dictionaries.Length);
}
public void Add(KEY key, VALUE value)
{
int bucket = (GetSecondaryHashCode(key) & 0x7FFFFFFF) % dictionaries.Length;
dictionaries[bucket].Add(key, value);
}
public bool Remove(KEY key)
{
int bucket = (GetSecondaryHashCode(key) & 0x7FFFFFFF) % dictionaries.Length;
return dictionaries[bucket].Remove(key);
}
public bool TryGetValue(KEY key, out VALUE result)
{
int bucket = (GetSecondaryHashCode(key) & 0x7FFFFFFF) % dictionaries.Length;
return dictionaries[bucket].TryGetValue(key, out result);
}
public static int GetSecondaryHashCode(KEY key)
{
here you should return an hash code for key possibly using a different hashing algorithm than the algorithm you use in inner dictionaries
}
}

With that many keys, you should either use a database or something like memcache while swapping out chunks of the cache in storage. I'm doubting you need all of the items at once, and if you do, there's no way it's going to work on a low-powered machine with little RAM.

In C#/.NEt does a dynamic type take less space than object?

I have a console application that allows the users to specify variables to process. These variables come in three flavors: string, double and long (with double and long being by far the most commonly used types). The user can specify whatever variables they like and in whatever order so my system has to be able to handle that. To this end in my application I had been storing these as object and then casting/uncasting them as required. for example:
public class UnitResponse
{
public object Value { get; set; }
}
My understanding was that boxed objects take up a bit more memory (about 12 bytes) than a standard value type.
My question is: would it be more efficient to use the dynamic keyword to store these values? It might get around the boxing/unboxing issue, and if it is more efficient how would this impact performance?
EDIT
To provide some context and prevent the "are you sure you're using enough RAM to worry about this" in my worst case I have 420,000,000 datapoints to worry about (60 variables * 7,000,000 records). This is in addition to a bunch of other data I keep about each variable (including a few booleans, etc.). So reducing memory does have a HUGE impact.

OK, so the real question here is "I've got a freakin' enormous data set that I am storing in memory, how do I optimize its performance in both time and memory space?"
Several thoughts:
You are absolutely right to hate and fear boxing. Boxing has big costs. First, yes, boxed objects take up extra memory. Second, boxed objects get stored on the heap, not on the stack or in registers. Third, they are garbage collected; every single one of those objects has to be interrogated at GC time to see if it contains a reference to another object, which it never will, and that's a lot of time on the GC thread. You almost certainly need to do something to avoid boxing.
Dynamic ain't it; it's boxing plus a whole lot of other overhead. (C#'s dynamic is very fast compared to other dynamic dispatch systems, but it is not fast or small in absolute terms).
It's gross, but you could consider using a struct whose layout shares memory between the various fields - like a union in C. Doing so is really really gross and not at all safe but it can help in situations like these. Do a web search for "StructLayoutAttribute"; you'll find tutorials.
Long, double or string, really? Can't be int, float or string? Is the data really either in excess of several billion in magnitude or accurate to 15 decimal places? Wouldn't int and float do the job for 99% of the cases? They're half the size.
Normally I don't recommend using float over double because its a false economy; people often economise this way when they have ONE number, like the savings of four bytes is going to make the difference. The difference between 42 million floats and 42 million doubles is considerable.
Is there regularity in the data that you can exploit? For example, suppose that of your 42 million records, there are only 100000 actual values for, say, each long, 100000 values for each double, and 100000 values for each string. In that case, you make an indexed storage of some sort for the longs, doubles and strings, and then each record gets an integer where the low bits are the index, and the top two bits indicate which storage to get it out of. Now you have 42 million records each containing an int, and the values are stored away in some nicely compact form somewhere else.
Store the booleans as bits in a byte; write properties to do the bit shifting to get 'em out. Save yourself several bytes that way.
Remember that memory is actually disk space; RAM is just a convenient cache on top of it. If the data set is going to be too large to keep in RAM then something is going to page it back out to disk and read it back in later; that could be you or it could be the operating system. It is possible that you know more about your data locality than the operating system does. You could write your data to disk in some conveniently pageable form (like a b-tree) and be more efficient about keeping stuff on disk and only bringing it in to memory when you need it.

I think you might be looking at the wrong thing here. Remember what dynamic does. It starts the compiler again, in process, at runtime. It loads hundreds of thousands of bytes of code for the compiler, and then at every call site it emits caches that contain the results of the freshly-emitted IL for each dynamic operation. You're spending a few hundred thousand bytes in order to save eight. That seems like a bad idea.
And of course, you don't save anything. "dynamic" is just "object" with a fancy hat on. "Dynamic" objects are still boxed.

No. dynamic has to do with how operations on the object are performed, not how the object itself is stored. In this particular context, value types would still be boxed.
Also, is all of this effort really worth 12 bytes per object? Surely there's a better use for your time than saving a few kilobytes (if that) of RAM? Have you proved that RAM usage by your program is actually an issue?

No. Dynamic will simply store it as an Object.
Chances are this is a micro optimization that will provide little to no benefit. If this really does become an issue then there are other mechanisms you can use (generics) to speed things up.

Using Int32 or what you need

Should you use Int32 in places where you know the value is not going to be higher than 32,767?
I'd like to keep memory down, hoever, using casts everywhere just to perform simple arithmetic is getting annoying.
short a = 1;
short result = a + 1; // Error
short result = (short)(a + 1); // works but looks ugly when does lots of times
What would be better for overall application performance?

As far as i know it is a good practice to use int whenever possible. Size of int equals to a word size on many architectures, so i think there may be a slight performance degradation when using short in some arithmetical operations.

If you are creating large arrays, then it can save a considerable amount of memory to use narrower types (less bytes), as the size of the array will be "type width" * "number of elements" + "overhead".
However, I'm pretty sure by default that in classes and structs, they will be packed along whole word boundaries e.g. 32bit = 4bytes. A short will still be packed into a 4 byte space.
You can however, manually configure packing in structs\classes by using structure layout:
http://msdn.microsoft.com/en-us/library/system.runtime.interopservices.structlayoutattribute(VS.71).aspx
As with any performance related issue: "don't think - measure".
From an API perspective, it can be majorly annoying to have to keep casting from shorts to ints, etc, as you will find most APIs will use ints for example.

The 3 reasons for me to use a integer datatype smaller than int32:
A system with severe memory constraints.
Huge arrays or similar.
I think it would make the purpose of the code easier to read and understand.
I mostly do normal Windows apps, so the the only reason of those that normally matters to me is the 3rd one.

Unless you're creating 100s of thousands of members then the space savings of a handful of bytes here and there won't really matter on any modern machine. I think the maxim of premature optimization applies here. It might make you feel better, but you're not getting anything particularly measurable out of it. IMO - only optimize memory usage if you're actually using a lot of memory.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.