Why bit shifting? - c#

I was recently looking at a config file that was saving some cryptic values. I happened to have the source available, so I took a look at what it was doing, and it was saving a bunch of different values and bit shifting them into each other. It mystified me why someone would do that. My question, then, is: is there an obvious advantage to storing numeric data in this way? I can see how it might make for a slightly smaller value to store, byte-wise, but it seems like a lot of work to save a couple of bytes of storage. It also seems like it would be significantly slower.
The other possibility that occurred to me is that is was used for obfuscation purposes. is this a common usage of bit shifting?

This is one of the common usages of bit shifting. There are several benefit:
1) Bit-shift operations are fast.
2) You can store multiple flags in a single value.
If you have an app that has several features, but you only want certain (configurable) ones enabled, you could do something like:
[Flags]
public enum Features
{
Profile = 1,
Messaging = 1 << 1,
Signing = 1 << 2,
Advanced = 1 << 3
}
And your single value to enable Messaging and Advanced would be:
(1 << 1) + (1 << 3) = 2 + 16 = 18
<add name="EnabledFeatures" value="18" />
And then to figure out if a given feature is enabled, you just perform some simple bitwise math:
var AdvancedEnabled =
EnabledFeatures & Features.Advanced == Features.Advanced;

Bit shifting seems more common in systems-level languages like C, C++ and assembly, but I've seen it here and there in C# too. It's not often used so much to save space, though, as it is for one (or both) of two typical reasons:
You're talking to an existing system (or using an established protocol, or generating a file in a known format) that requires stuff to be very precisely laid out; and/or
The combination of bits comprise a value that is itself useful for testing.
Anyone who uses it in a high-level language solely to save space or obfuscate their code is almost always prematurely optimizing (and/or is an idiot). The space savings rarely justify the added complexity, and bit shifts really aren't complex enough to stop someone determined to understand your code.

I have a project that stores day/hour matrix of available hours during one week. So it has 24x7 values that have to be stored somehow.
I chose to store it as 7 ints, so each day is represented with one integer, and each hour in it as one bit. Just an example where it can be useful.

Sometimes (especially in older Windows programming), there will be information encoded in "high order" and "low order" bits of a value... and it's sometimes necessary to shift to get the information out. I'm not 100% sure on the reasons behind that, aside from it might make it convenient to return a 64-bit value with two 32-bit values encoded in it (so you can handle it with a single return value from a method call).
I agree with your assertions about needless complexity though, I don't see a lot of applications outside of really heavy number-crunching/cryptography stuff where this would be necessary.

Related

Is a GUID unique 100% in the same machine? [duplicate]

Is a GUID unique 100% of the time?
Will it stay unique over multiple threads?
While each generated GUID is not
guaranteed to be unique, the total
number of unique keys (2128 or
3.4×1038) is so large that the probability of the same number being
generated twice is very small. For
example, consider the observable
universe, which contains about 5×1022
stars; every star could then have
6.8×1015 universally unique GUIDs.
From Wikipedia.
These are some good articles on how a GUID is made (for .NET) and how you could get the same guid in the right situation.
https://ericlippert.com/2012/04/24/guid-guide-part-one/
https://ericlippert.com/2012/04/30/guid-guide-part-two/
https://ericlippert.com/2012/05/07/guid-guide-part-three/
​​
If you are scared of the same GUID values then put two of them next to each other.
Guid.NewGuid().ToString() + Guid.NewGuid().ToString();
If you are too paranoid then put three.
The simple answer is yes.
Raymond Chen wrote a great article on GUIDs and why substrings of GUIDs are not guaranteed unique. The article goes in to some depth as to the way GUIDs are generated and the data they use to ensure uniqueness, which should go to some length in explaining why they are :-)
As a side note, I was playing around with Volume GUIDs in Windows XP. This is a very obscure partition layout with three disks and fourteen volumes.
\\?\Volume{23005604-eb1b-11de-85ba-806d6172696f}\ (F:)
\\?\Volume{23005605-eb1b-11de-85ba-806d6172696f}\ (G:)
\\?\Volume{23005606-eb1b-11de-85ba-806d6172696f}\ (H:)
\\?\Volume{23005607-eb1b-11de-85ba-806d6172696f}\ (J:)
\\?\Volume{23005608-eb1b-11de-85ba-806d6172696f}\ (D:)
\\?\Volume{23005609-eb1b-11de-85ba-806d6172696f}\ (P:)
\\?\Volume{2300560b-eb1b-11de-85ba-806d6172696f}\ (K:)
\\?\Volume{2300560c-eb1b-11de-85ba-806d6172696f}\ (L:)
\\?\Volume{2300560d-eb1b-11de-85ba-806d6172696f}\ (M:)
\\?\Volume{2300560e-eb1b-11de-85ba-806d6172696f}\ (N:)
\\?\Volume{2300560f-eb1b-11de-85ba-806d6172696f}\ (O:)
\\?\Volume{23005610-eb1b-11de-85ba-806d6172696f}\ (E:)
\\?\Volume{23005611-eb1b-11de-85ba-806d6172696f}\ (R:)
| | | | |
| | | | +-- 6f = o
| | | +---- 69 = i
| | +------ 72 = r
| +-------- 61 = a
+---------- 6d = m
It's not that the GUIDs are very similar but the fact that all GUIDs have the string "mario" in them. Is that a coincidence or is there an explanation behind this?
Now, when googling for part 4 in the GUID I found approx 125.000 hits with volume GUIDs.
Conclusion: When it comes to Volume GUIDs they aren't as unique as other GUIDs.
It should not happen. However, when .NET is under a heavy load, it is possible to get duplicate guids. I have two different web servers using two different sql servers. I went to merge the data and found I had 15 million guids and 7 duplicates.
Yes, a GUID should always be unique. It is based on both hardware and time, plus a few extra bits to make sure it's unique. I'm sure it's theoretically possible to end up with two identical ones, but extremely unlikely in a real-world scenario.
Here's a great article by Raymond Chen on Guids:
https://blogs.msdn.com/oldnewthing/archive/2008/06/27/8659071.aspx
​
​
​
Guids are statistically unique. The odds of two different clients generating the same Guid are infinitesimally small (assuming no bugs in the Guid generating code). You may as well worry about your processor glitching due to a cosmic ray and deciding that 2+2=5 today.
Multiple threads allocating new guids will get unique values, but you should get that the function you are calling is thread safe. Which environment is this in?
Eric Lippert has written a very interesting series of articles about GUIDs.
There are on the order 230 personal computers in the world (and of
course lots of hand-held devices or non-PC computing devices that have
more or less the same levels of computing power, but lets ignore
those). Let's assume that we put all those PCs in the world to the
task of generating GUIDs; if each one can generate, say, 220 GUIDs per
second then after only about 272 seconds -- one hundred and fifty
trillion years -- you'll have a very high chance of generating a
collision with your specific GUID. And the odds of collision get
pretty good after only thirty trillion years.
GUID Guide, part one
GUID Guide, part two
GUID Guide, part three
Theoretically, no, they are not unique. It's possible to generate an identical guid over and over. However, the chances of it happening are so low that you can assume they are unique.
I've read before that the chances are so low that you really should stress about something else--like your server spontaneously combusting or other bugs in your code. That is, assume it's unique and don't build in any code to "catch" duplicates--spend your time on something more likely to happen (i.e. anything else).
I made an attempt to describe the usefulness of GUIDs to my blog audience (non-technical family memebers). From there (via Wikipedia), the odds of generating a duplicate GUID:
1 in 2^128
1 in 340 undecillion (don’t worry, undecillion is not on the
quiz)
1 in 3.4 × 10^38
1 in 340,000,000,000,000,000,000,000,000,000,000,000,000
None seems to mention the actual math of the probability of it occurring.
First, let's assume we can use the entire 128 bit space (Guid v4 only uses 122 bits).
We know that the general probability of NOT getting a duplicate in n picks is:
(1-1/2128)(1-2/2128)...(1-(n-1)/2128)
Because 2128 is much much larger than n, we can approximate this to:
(1-1/2128)n(n-1)/2
And because we can assume n is much much larger than 0, we can approximate that to:
(1-1/2128)n^2/2
Now we can equate this to the "acceptable" probability, let's say 1%:
(1-1/2128)n^2/2 = 0.01
Which we solve for n and get:
n = sqrt(2* log 0.01 / log (1-1/2128))
Which Wolfram Alpha gets to be 5.598318 × 1019
To put that number into perspective, lets take 10000 machines, each having a 4 core CPU, doing 4Ghz and spending 10000 cycles to generate a Guid and doing nothing else. It would then take ~111 years before they generate a duplicate.
From http://www.guidgenerator.com/online-guid-generator.aspx
What is a GUID?
GUID (or UUID) is an acronym for 'Globally Unique Identifier' (or 'Universally Unique Identifier'). It is a 128-bit integer number used to identify resources. The term GUID is generally used by developers working with Microsoft technologies, while UUID is used everywhere else.
How unique is a GUID?
128-bits is big enough and the generation algorithm is unique enough that if 1,000,000,000 GUIDs per second were generated for 1 year the probability of a duplicate would be only 50%. Or if every human on Earth generated 600,000,000 GUIDs there would only be a 50% probability of a duplicate.
Is a GUID unique 100% of the time?
Not guaranteed, since there are several ways of generating one. However, you can try to calculate the chance of creating two GUIDs that are identical and you get the idea: a GUID has 128 bits, hence, there are 2128 distinct GUIDs – much more than there are stars in the known universe. Read the wikipedia article for more details.
MSDN:
There is a very low probability that the value of the new Guid is all zeroes or equal to any other Guid.
If your system clock is set properly and hasn't wrapped around, and if your NIC has its own MAC (i.e. you haven't set a custom MAC) and your NIC vendor has not been recycling MACs (which they are not supposed to do but which has been known to occur), and if your system's GUID generation function is properly implemented, then your system will never generate duplicate GUIDs.
If everyone on earth who is generating GUIDs follows those rules then your GUIDs will be globally unique.
In practice, the number of people who break the rules is low, and their GUIDs are unlikely to "escape". Conflicts are statistically improbable.
I experienced a duplicate GUID.
I use the Neat Receipts desktop scanner and it comes with proprietary database software. The software has a sync to cloud feature, and I kept getting an error upon syncing. A gander at the logs revealed the awesome line:
"errors":[{"code":1,"message":"creator_guid: is already
taken","guid":"C83E5734-D77A-4B09-B8C1-9623CAC7B167"}]}
I was a bit in disbelief, but surely enough, when I found a way into my local neatworks database and deleted the record containing that GUID, the error stopped occurring.
So to answer your question with anecdotal evidence, no. A duplicate is possible. But it is likely that the reason it happened wasn't due to chance, but due to standard practice not being adhered to in some way. (I am just not that lucky) However, I cannot say for sure. It isn't my software.
Their customer support was EXTREMELY courteous and helpful, but they must have never encountered this issue before because after 3+ hours on the phone with them, they didn't find the solution. (FWIW, I am very impressed by Neat, and this glitch, however frustrating, didn't change my opinion of their product.)
For more better result the best way is to append the GUID with the timestamp (Just to make sure that it stays unique)
Guid.NewGuid().ToString() + DateTime.Now.ToString();
GUID algorithms are usually implemented according to the v4 GUID specification, which is essentially a pseudo-random string. Sadly, these fall into the category of "likely non-unique", from Wikipedia (I don't know why so many people ignore this bit): "... other GUID versions have different uniqueness properties and probabilities, ranging from guaranteed uniqueness to likely non-uniqueness."
The pseudo-random properties of V8's JavaScript Math.random() are TERRIBLE at uniqueness, with collisions often coming after only a few thousand iterations, but V8 isn't the only culprit. I've seen real-world GUID collisions using both PHP and Ruby implementations of v4 GUIDs.
Because it's becoming more and more common to scale ID generation across multiple clients, and clusters of servers, entropy takes a big hit -- the chances of the same random seed being used to generate an ID escalate (time is often used as a random seed in pseudo-random generators), and GUID collisions escalate from "likely non-unique" to "very likely to cause lots of trouble".
To solve this problem, I set out to create an ID algorithm that could scale safely, and make better guarantees against collision. It does so by using the timestamp, an in-memory client counter, client fingerprint, and random characters. The combination of factors creates an additive complexity that is particularly resistant to collision, even if you scale it across a number of hosts:
http://usecuid.org/
I have experienced the GUIDs not being unique during multi-threaded/multi-process unit-testing (too?). I guess that has to do with, all other tings being equal, the identical seeding (or lack of seeding) of pseudo random generators. I was using it for generating unique file names. I found the OS is much better at doing that :)
Trolling alert
You ask if GUIDs are 100% unique. That depends on the number of GUIDs it must be unique among. As the number of GUIDs approach infinity, the probability for duplicate GUIDs approach 100%.
In a more general sense, this is known as the "birthday problem" or "birthday paradox". Wikipedia has a pretty good overview at:
Wikipedia - Birthday Problem
In very rough terms, the square root of the size of the pool is a rough approximation of when you can expect a 50% chance of a duplicate. The article includes a probability table of pool size and various probabilities, including a row for 2^128. So for a 1% probability of collision you would expect to randomly pick 2.6*10^18 128-bit numbers. A 50% chance requires 2.2*10^19 picks, while SQRT(2^128) is 1.8*10^19.
Of course, that is just the ideal case of a truly random process. As others mentioned, a lot is riding on the that random aspect - just how good is the generator and seed? It would be nice if there was some hardware support to assist with this process which would be more bullet-proof except that anything can be spoofed or virtualized. I suspect that might be the reason why MAC addresses/time-stamps are no longer incorporated.
The Answer of "Is a GUID is 100% unique?" is simply "No" .
If You want 100% uniqueness of GUID then do following.
generate GUID
check if that GUID is Exist in your table column where you are looking for uniquensess
if exist then goto step 1 else step 4
use this GUID as unique.
The hardest part is not about generating a duplicated Guid.
The hardest part is designed a database to store all of the generated ones to check if it is actually duplicated.
From WIKI:
For example, the number of random version 4 UUIDs which need to be generated in order to have a 50% probability of at least one collision is 2.71 quintillion, computed as follows:
enter image description here
This number is equivalent to generating 1 billion UUIDs per second for about 85 years, and a file containing this many UUIDs, at 16 bytes per UUID, would be about 45 exabytes, many times larger than the largest databases currently in existence, which are on the order of hundreds of petabytes
GUID stands for Global Unique Identifier
In Brief:
(the clue is in the name)
In Detail:
GUIDs are designed to be unique; they are calculated using a random method based on the computers clock and computer itself, if you are creating many GUIDs at the same millisecond on the same machine it is possible they may match but for almost all normal operations they should be considered unique.
I think that when people bury their thoughts and fears in statistics, they tend to forget the obvious. If a system is truly random, then the result you are least likely to expect (all ones, say) is equally as likely as any other unexpected value (all zeros, say). Neither fact prevents these occurring in succession, nor within the first pair of samples (even though that would be statistically "truly shocking"). And that's the problem with measuring chance: it ignores criticality (and rotten luck) entirely.
IF it ever happened, what's the outcome? Does your software stop working? Does someone get injured? Does someone die? Does the world explode?
The more extreme the criticality, the worse the word "probability" sits in the mouth. In the end, chaining GUIDs (or XORing them, or whatever) is what you do when you regard (subjectively) your particular criticality (and your feeling of "luckiness") to be unacceptable. And if it could end the world, then please on behalf of all of us not involved in nuclear experiments in the Large Hadron Collider, don't use GUIDs or anything else indeterministic!
Enough GUIDs to assign one to each and every hypothetical grain of sand on every hypothetical planet around each and every star in the visible universe.
Enough so that if every computer in the world generates 1000 GUIDs a second for 200 years, there might (MIGHT) be a collision.
Given the number of current local uses for GUIDs (one sequence per table per database for instance) it is extraordinarily unlikely to ever be a problem for us limited creatures (and machines with lifetimes that are usually less than a decade if not a year or two for mobile phones).
... Can we close this thread now?

Find right features in multiclass svm without PCA

I'm classifing users with a multiclass svm (one-against-on), 3 classes. In binary, I would be able to plot the distribution of the weight of each feature in the hyperplan equation for different training sets. In this case, I don't really need a PCA to see stability of the hyperplan and relative importance of the features (reudced centered btw). What would the alternative be in multiclass svm, as for each training set you have 3 classifiers and you choose one class according to the result of the three classifiers (what is it already ? the class that appears the maximum number of times or the bigger discriminant ? whichever it does not really matter here). Anyone has an idea.
And if it matters, I am writing in C# with Accord.
Thank you !
In a multi-class SVM that uses the one-vs-one strategy, the problem is divided into a set of smaller binary problems. For example, if you have three possible classes, using the one-vs-one strategy requires the creation of (n(n-1))/n binary classifiers. In your example, this would be
(n(n-1))/n = (3(3-1))/2 = (3*2)/2 = 3
Each of those will be specialized in the following problems:
Distinguishing between class 1 and class 2 (let's call it svma).
Distinguishing between class 1 and class 3 (let's call it svmb)
Distinguishing between class 2 and class 3 (let's call it svmc)
Now, I see that actually you have asked multiple questions in your original post, so I will ask them separately. First I will clarify how the decision process works, and then tell how you could detect which features are the most important.
Since you mentioned Accord.NET, there are two ways this framework might be computing the multi-class decision. The default one is to use a Decision Directed Acyclic Graph (DDAG), that is nothing more but the sequential elimination of classes. The other way is by solving all binary problems and taking the class that won most of the time. You can configure them at the moment you are classifying a new sample by setting the method parameter of the SVM's Compute method.
Since the winning-most-of-the-time version is straightforward to understand, I will explain a little more about the default approach, the DDAG.
Decision using directed acyclic graphs
In this algorithm, we test each of the SVMs and eliminate the class that lost at each round. So for example, the algorithm starts with all possible classes:
Candidate classes: [1, 2, 3]
Now it asks svma to classify x, it decides for class 2. Therefore, class 1 lost and is not considered anymore in further tests:
Candidate classes: [2, 3]
Now it asks svmb to classify x, it decides for class 2. Therefore, class 3 lost and is not considered anymore in further tests:
Candidate classes: [2]
The final answer is thus 2.
Detecting which features are the most useful
Now, since the one-vs-one SVM is decomposed into (n(n-1)/2) binary problems, the most straightforward way to analyse which features are the most important is by considering each binary problem separately. Unfortunately it might be tricky to globally determine which are the most important for the entire problem, but it will be possible to detect which ones are the most important to discriminate between class 1 and 2, or class 1 and 3, or class 2 and 3.
However, here I can offer a suggestion if your are using DDAGs. Using DDAGs, it is possible to extract the decision path that lead to a particular decision. This means that is it possible to estimate how many times each of the binary machines was used when classifying your entire database. If you can estimate the importance of a feature for each of the binary machines, and estimate how many times a machine is used during the decision process in your database, perhaps you could take their weighted sum as an indicator of how useful a feature is in your decision process.
By the way, you might also be interested in trying one of the Logistic Regression Support Vector Machines using L1-regularization with a high C to perform sparse feature selection:
// Create a new linear machine
var svm = new SupportVectorMachine(inputs: 2);
// Creates a new instance of the sparse logistic learning algorithm
var smo = new ProbabilisticCoordinateDescent(svm, inputs, outputs)
{
// Set learning parameters
Complexity = 100,
};
I'm not an expert in ML or SVM. I am a self learner. However my prototype over-performed some of similar commercial or academical software in accuracy, while the training time is about 2 hours in contrast of days and weeks(!) of some competitors.
My recognition system (patterns in bio-cells) uses following approach to select best features:
1)Extract features and calculate mean and variance for all classes
2)Select those features, where means of classes are most distanced and variances are minimal.
3)Remove redundant features - those which mean-histograms over classes are similar
In my prototype I'm using parametric features e.g feature "circle" with parameters diameter, threshold, etc.
The training is controlled by scripts defining which features with which argument-ranges are to use. So the software tests all possible combinations.
For some training-time optimization:
The software begins with 5 instances per class for extracting the features and increases the number when the 2nd condition met.
Probably there are some academical names for some of the steps. Unfortunately I'm not aware of them, I've "invented the wheel" myself.

C# dictionary - how to solve limit on number of items?

I am using Dictionary and I need to store almost 13 000 000 keys in it. Unfortunatelly, after adding 11 950 000th key I got an exception "System out of memory". Is there any solution of this problem? I will need my program to run on less powerable computers than is actually mine in the future..
I need that many keys because I need to store pairs - sequence name and sequence length, it is for solving bioinformatics related problem.
Any help will be appreciated.
Buy more memory, install a 64 bit version of the OS and recompile for 64 bits. No, I'm not kidding. If you want so many objects... in ram... And then call it a "feature". If the new Android can require 16gb of memory to be compiled...
I was forgetting... You could begin by reading C# array of objects, very large, looking for a better way
You know how many are 13 million objects?
To make a comparison, a 32 bits Windows app has access to less than 2 gb of address space. So it's 2 billion bytes (give or take)... 2 billion / 13 million = something around 150 bytes/object. Now, if we consider how much a reference type occupies... It's quite easy to eat 150 bytes.
I'll add something: I've looked in my Magic 8-Ball and it told me: show us your code. If you don't tell us what you are using for the key and the values, how should we be able to help you? What are you using, class or struct or "primitive" types? Tell us the "size" of your TKey and TValue. Sadly our crystall ball broke yesterday :-)
C# is not a language that was designed to solve heavy-duty scientific computation problems. It is absolutely possible to use C# to build tools that do what you want, but the off-the-shelf parts like Dictionary were designed to solve more common business problems, like mapping zip codes to cities and that sort of thing.
You're going to have to go with external storage of some sort. My recommendation would be to buy a database and use it to store your data. Then use a DataSet or some similar technology to load portions of the data into memory, manipulate them, and then pour more data from the database into the DataSet, and so on.
Well, I had almost exactly the same problem.
I wanted to load about 12.5 million [string, int]s into a dictionary from a database (for all the programming "gods" above who don't understand why, the answer is that it is enormously quicker when you are working with a 150 GB database if you can cache a proportion of one of the key tables in memory).
It annoyingly threw an out of memory exception at pretty much the same place - just under the 12 million mark even though the process was only consuming about 1.3 GB of memory (reduced to about 800 MB of memory after a judicious change in db read method to not try and do it all at once) - despite running on an I7 with 8 GB of memory.
The solution was actually remarkably simple -
in Visual Studio (2010) in Solution Explorer right click the project and select properties.
In the Build tab set Platform Target to x64 and rebuild.
It rattles through the load into the Dictionary in a few seconds and the Dictionary performance is very good.
Easy solution is just use simple DB. The most obvious solution in this case, IMHO is using SQLite .NET , fast, easy and with low memory footprint.
I think that you need a new approach to your processing.
I must assume that you obtain the data from a file or a database, either way that is where it should remain.
There is no way that you may actually increase the limit on the number of values stored within a Dictionary, other than increasing system memory, but eitherway it is an extremely inefficient means of processing such a alarge amount of data.
You should rethink your algorithm so that you can process the data in more manageable portions. It will mean processing it in stages until you get your result. This may mean many hundreeds of passes through the data, but it's the only way to do it.
I would also suggest that you look at using generics to help speed up this repetitive processing and cut down on memory usage.
Remember that there will still be a balancing act between system performance and access to externally stored data (be it external disk store or database).
It is not the problem with the Dictionary object, but the available memory in your server. I've done some investigation to understand the failures of dictionary object, but it never failed. Below is the code for your reference
private static void TestDictionaryLimit()
{
int intCnt = 0;
Dictionary<long, string> dItems = new Dictionary<long, string>();
Console.WriteLine("Total number of iterations = {0}", long.MaxValue);
Console.WriteLine("....");
for (long lngCnt = 0; lngCnt < long.MaxValue; lngCnt++)
{
if (lngCnt < 11950020)
dItems.Add(lngCnt, lngCnt.ToString());
else
break;
if ((lngCnt % 100000).Equals(0))
Console.Write(intCnt++);
}
Console.WriteLine("Completed..");
Console.WriteLine("{0} number of items in dictionary", dItems.Count);
}
The above code executes properly, and stores more than the number of count that you have mentioned.
Really 13000000 items are quite a lot.
If 13000000 are allocated classes is a very deep kick into garbage collector stomach!
Also if you find a way to use the default .NET dictionary, the performance would be really bad, too much keys, the number of keys approaches the number of values a 31 bit hash can use, performance will be awful in whatever system you use, and of course, memory will be too much!
If you need a data structure that can use more memory than an hash table you probably need a custom hashtable mixed with a custom binary tree data structure.
Yes, it is possible to write your own combination of two.
You cannot rely on .net hashtable for sure for this so strange and specific problem.
Consider that a tree have a lookup complexity of O(log n), while a building complexity of O(n * log n), of course, building it will be too long.
You should then build an hashtable of binary trees (or viceversa) that will allow you to use both data structures consuming less memory.
Then, think about compiling it in 32 bit mode, not in 64 bit mode: 64 bit mode uses more memory for pointers.
In the same time it i spossible the contrary, 32 bit address space may be is not sufficient for your problem.
It never happened to me to have a problem that can run out 32 bit address space!
If both keys and values are simple value types i would suggest you to write your data structure in a C dll and use it through C#.
You can try to write a dictionary of dictionaries.
Let's say, you can split your data into chunks of 500000 items between 26 dictionaries for example, but the occupied memory would be very very big, don't think your system will handle it.
public class MySuperDictionary
{
private readonly Dictionary<KEY, VALUE>[] dictionaries;
public MySuperDictionary()
{
this.dictionaries = new Dictionary<KEY, VALUE>[373]; // must be a prime number.
for (int i = 0; i < dictionaries.Length; ++i)
dictionaries[i] = new Dicionary<KEY, VALUE>(13000000 / dictionaries.Length);
}
public void Add(KEY key, VALUE value)
{
int bucket = (GetSecondaryHashCode(key) & 0x7FFFFFFF) % dictionaries.Length;
dictionaries[bucket].Add(key, value);
}
public bool Remove(KEY key)
{
int bucket = (GetSecondaryHashCode(key) & 0x7FFFFFFF) % dictionaries.Length;
return dictionaries[bucket].Remove(key);
}
public bool TryGetValue(KEY key, out VALUE result)
{
int bucket = (GetSecondaryHashCode(key) & 0x7FFFFFFF) % dictionaries.Length;
return dictionaries[bucket].TryGetValue(key, out result);
}
public static int GetSecondaryHashCode(KEY key)
{
here you should return an hash code for key possibly using a different hashing algorithm than the algorithm you use in inner dictionaries
}
}
With that many keys, you should either use a database or something like memcache while swapping out chunks of the cache in storage. I'm doubting you need all of the items at once, and if you do, there's no way it's going to work on a low-powered machine with little RAM.

In C#/.NEt does a dynamic type take less space than object?

I have a console application that allows the users to specify variables to process. These variables come in three flavors: string, double and long (with double and long being by far the most commonly used types). The user can specify whatever variables they like and in whatever order so my system has to be able to handle that. To this end in my application I had been storing these as object and then casting/uncasting them as required. for example:
public class UnitResponse
{
public object Value { get; set; }
}
My understanding was that boxed objects take up a bit more memory (about 12 bytes) than a standard value type.
My question is: would it be more efficient to use the dynamic keyword to store these values? It might get around the boxing/unboxing issue, and if it is more efficient how would this impact performance?
EDIT
To provide some context and prevent the "are you sure you're using enough RAM to worry about this" in my worst case I have 420,000,000 datapoints to worry about (60 variables * 7,000,000 records). This is in addition to a bunch of other data I keep about each variable (including a few booleans, etc.). So reducing memory does have a HUGE impact.
OK, so the real question here is "I've got a freakin' enormous data set that I am storing in memory, how do I optimize its performance in both time and memory space?"
Several thoughts:
You are absolutely right to hate and fear boxing. Boxing has big costs. First, yes, boxed objects take up extra memory. Second, boxed objects get stored on the heap, not on the stack or in registers. Third, they are garbage collected; every single one of those objects has to be interrogated at GC time to see if it contains a reference to another object, which it never will, and that's a lot of time on the GC thread. You almost certainly need to do something to avoid boxing.
Dynamic ain't it; it's boxing plus a whole lot of other overhead. (C#'s dynamic is very fast compared to other dynamic dispatch systems, but it is not fast or small in absolute terms).
It's gross, but you could consider using a struct whose layout shares memory between the various fields - like a union in C. Doing so is really really gross and not at all safe but it can help in situations like these. Do a web search for "StructLayoutAttribute"; you'll find tutorials.
Long, double or string, really? Can't be int, float or string? Is the data really either in excess of several billion in magnitude or accurate to 15 decimal places? Wouldn't int and float do the job for 99% of the cases? They're half the size.
Normally I don't recommend using float over double because its a false economy; people often economise this way when they have ONE number, like the savings of four bytes is going to make the difference. The difference between 42 million floats and 42 million doubles is considerable.
Is there regularity in the data that you can exploit? For example, suppose that of your 42 million records, there are only 100000 actual values for, say, each long, 100000 values for each double, and 100000 values for each string. In that case, you make an indexed storage of some sort for the longs, doubles and strings, and then each record gets an integer where the low bits are the index, and the top two bits indicate which storage to get it out of. Now you have 42 million records each containing an int, and the values are stored away in some nicely compact form somewhere else.
Store the booleans as bits in a byte; write properties to do the bit shifting to get 'em out. Save yourself several bytes that way.
Remember that memory is actually disk space; RAM is just a convenient cache on top of it. If the data set is going to be too large to keep in RAM then something is going to page it back out to disk and read it back in later; that could be you or it could be the operating system. It is possible that you know more about your data locality than the operating system does. You could write your data to disk in some conveniently pageable form (like a b-tree) and be more efficient about keeping stuff on disk and only bringing it in to memory when you need it.
I think you might be looking at the wrong thing here. Remember what dynamic does. It starts the compiler again, in process, at runtime. It loads hundreds of thousands of bytes of code for the compiler, and then at every call site it emits caches that contain the results of the freshly-emitted IL for each dynamic operation. You're spending a few hundred thousand bytes in order to save eight. That seems like a bad idea.
And of course, you don't save anything. "dynamic" is just "object" with a fancy hat on. "Dynamic" objects are still boxed.
No. dynamic has to do with how operations on the object are performed, not how the object itself is stored. In this particular context, value types would still be boxed.
Also, is all of this effort really worth 12 bytes per object? Surely there's a better use for your time than saving a few kilobytes (if that) of RAM? Have you proved that RAM usage by your program is actually an issue?
No. Dynamic will simply store it as an Object.
Chances are this is a micro optimization that will provide little to no benefit. If this really does become an issue then there are other mechanisms you can use (generics) to speed things up.

Using Int32 or what you need

Should you use Int32 in places where you know the value is not going to be higher than 32,767?
I'd like to keep memory down, hoever, using casts everywhere just to perform simple arithmetic is getting annoying.
short a = 1;
short result = a + 1; // Error
short result = (short)(a + 1); // works but looks ugly when does lots of times
What would be better for overall application performance?
As far as i know it is a good practice to use int whenever possible. Size of int equals to a word size on many architectures, so i think there may be a slight performance degradation when using short in some arithmetical operations.
If you are creating large arrays, then it can save a considerable amount of memory to use narrower types (less bytes), as the size of the array will be "type width" * "number of elements" + "overhead".
However, I'm pretty sure by default that in classes and structs, they will be packed along whole word boundaries e.g. 32bit = 4bytes. A short will still be packed into a 4 byte space.
You can however, manually configure packing in structs\classes by using structure layout:
http://msdn.microsoft.com/en-us/library/system.runtime.interopservices.structlayoutattribute(VS.71).aspx
As with any performance related issue: "don't think - measure".
From an API perspective, it can be majorly annoying to have to keep casting from shorts to ints, etc, as you will find most APIs will use ints for example.
The 3 reasons for me to use a integer datatype smaller than int32:
A system with severe memory constraints.
Huge arrays or similar.
I think it would make the purpose of the code easier to read and understand.
I mostly do normal Windows apps, so the the only reason of those that normally matters to me is the 3rd one.
Unless you're creating 100s of thousands of members then the space savings of a handful of bytes here and there won't really matter on any modern machine. I think the maxim of premature optimization applies here. It might make you feel better, but you're not getting anything particularly measurable out of it. IMO - only optimize memory usage if you're actually using a lot of memory.

Categories

Resources