Related
I'm trying to figure out the best way to store large binary (more than 96 bit) numbers in C#
I'm building application that will automatically allocate workers for shifts. Shifts can be as short as 15 minutes (but this might be even smaller in the future). To avoid double-booking of workers, I plan to have binary map of their daily time: 24 hours separated in equal chunks (15 minutes) and every chunk has a flag (0 for free, 1 for busy)
So when we try to give another shift to a worker, we can do binary comparison of workers daily availability with shift's time. Simple and easy to decide.
But C# long only allows to have up to 64 bit, and with the current set up I need at least 96 bits (24 hours * 60 minutes / 15 minutes per period).
This representation must be memory friendly, as there will be about a million objects operated at a time.
Few other options i considered:
String. Memory-hungry, not simple to implement bit-wise operations
Array of bits. But as far as I know C# does not have bit type
Array of unsigned integers. Each array represents only part of a day. The best I can think of
Any other suggestions??
Thanks in advance!
Have you looked at the BitArray class? It should be pretty much exactly what you're looking for.
Try following,
.Net 4 has inbuilt BigInteger type
http://msdn.microsoft.com/en-us/library/system.numerics.biginteger.aspx
.Net 2 project on code project
http://www.codeproject.com/KB/cs/biginteger.
Another alternative,
http://www.codeplex.com/IntX/
Unless you have millions of employees that all need to be scheduled at the same time, I'd be tempted to store your 96 booleans as a char array with 0 meaning "free" and 1 meaning "busy". Simple to index/access/update. The rest of the employees schedules can sit in their database rows on disk where you simply don't care about "96 megabytes".
If you can find a class which implements a bit array, you could use that. (You could code one easily, too). But does it really matter spacewise?
Frankly, if your organization really has a million employees to schedule, surely you can afford a machine which has space for a 96 mB array as well as the rest of your code?
The one good excuse I can see for using bit vectors has to do with execution time cost. If you scheduling algorithm essentially ANDs one employee bit vector against another looking for conflicts, and does that on large scale, bit vectors might reduce the computation time to do this by a factor of roughly 10 (use a two *long*s per employee to get your 96 bits). I'd wait till my algorithm worked before I worried about this.
You could use and array of bytes. I don't think any language supports an array of bits, as a byte is the smallest addressable piece of memory. Other options are an array of booleans, but each boolean I believe is stored as a byte anyway, so there would be wasted memory, but it might be easier to work with. It really depends on how many days you are going to work with. You could also just store the start and end of the shift and use other means to figure out if there are overlapping schedules. This would probably make the most sense, and be the easiest to debug.
BitArray has already been mentioned, it uses an array of ints much like you planned to do anyway.
That also means that it adds an extra layer of indirection (and some extra bytes); it also does a lot of checking everywhere to make sure that eg the lengths of two bitarrays is the same when operating on them. So I would be careful with them. They're easy, but slower than necessary - the difference is especially big (compared to handling the array yourself) for smallish bitarrays.
I need to store much strings in RAM. But they do not contain special unicode characters, they all contains only characters from "ISO 8859-1" that is one byte.
Now I could convert every string, store it in memory and convert it back to use it with .Contains() and methods like this, but this would be overhead (in my opinion) and slow.
Is there a string class that is fast and reliable and offers some methods of the original string class like .Contains()?
I need this to store more strings in memory with less RAM used. Or is there an other way to do it?
Update:
Thank you for your comments and your answer.
I have a class that stores string. Then with one method call I need to figure out if I already have that string in memory. I have about 1000 strings to figure out if they are in the list a second. hundred of millions in total.
The average size of the string is about 20 chars. It is really the RAM that cares me.
I even thought about compress some millions of strings and store these packages in memory. But then I need to decompress it every time I need to access the values.
I also tried to use a HashSet, but the needed memory amount was even higher.
I don't need the true value. Just to know if the value is in the list. So if there is a hash-value that can do it, even better. But all I found need more memory than the pure string.
Currently there is no plan for further internationalization. So it is something I would deal with when it is time to :-)
I don't know if using a database would solve it. I don't need to fetch anything, just to know if the value was stored in the class. And I need to do this fast.
It is very unlikely that you will win any significant performance from this. However, if you need to save memory, this strategy may be appropriate.
To convert a string to a byte[] for this purpose, use Encoding.Default.GetBytes()[1].
To convert a byte[] back to a string for display or other string-based processing, use Encoding.Default.GetString().
You can make your code look nicer if you use extension methods defined on string and byte[]. Alternatively, you can wrap the byte[] in a wrapper type and put the methods there. Make this wrapper type a struct, not a class, otherwise it will incur extra heap allocations, which is what you’re trying to avoid.
I want to warn you, though — you are throwing away the ability to have Unicode in your application. You should normally have all alarm bells go off every time you think you need to do this. It is best if you structure your code in such a way that you can easily go back to using string when memory sizes will have gone up and memory consumption stops being an issue.
[1] Encoding.Default returns the current 8-bit codepage of the running operating system. The default for this on English-language Windows is Windows-1252, which is what you want. For Russian Windows it will be Windows-1251 (Cyrillic) etc.
As per comments, a basically bad idea. If you have to do it, byte[] is your friend. There is no byte-oriented string class in .NET.
Checkout the string.Intern method, that could help you out:
http://www.yoda.arachsys.com/csharp/strings.html
http://en.csharp-online.net/CSharp_String_Theory%E2%80%94String_intern_pool
However looking at your requirements, I think you are over engineering it. You have 1000 strings at 20 chars = 1000 * 20 * 2 = 40,000 bytes, that's not much memory.
If you really have a large amount, store it in a DB with an index. That would be much faster than anything the average programmer can come up with.
In my C# app, I would like to know whether it is really important to use short for smaller numbers, int for bigger etc. Does the memory consumption really matter?
Unless you are packing large numbers of these together in some kind of structure, it will probably not affect the memory consumption at all. The best reason to use a particular integer type is compatibility with an API. Other than that, just make sure the type you pick has enough range to cover the values you need. Beyond that for simple local variables, it doesn't matter much.
The simple answer is that it's not really important.
The more complex answer is that it depends.
Obviously you need to choose a type that will hold your datastructure without overflowing, and even if you're only storing smaller numbers then choosing int is probably the most sensible thing to do.
However, if your application loads a lot of data or runs on a device with limited memory then you might need to choose short for some values.
For C# apps that aren't trying to mirror some sort of structure from a file, you're better off using ints or whatever your native format is. The only other time it might matter is if using arrays on the order of millions of entries. Even then, I'd still consider ints.
Only you can be the judge of whether the memory consumption really matters to you. In most situations it won't make any discernible difference.
In general, I would recommend using int/Int32 where you can get away with it. If you really need to use short, long, byte, uint etc in a particular situation then do so.
This is entirely relative to the amount of memory you can afford to waste. If you aren't sure, it probably doesn't matter.
The answer is: it depends. The question of whether memory matters is entirely up to you. If you are writing a small application that has minimal storage and memory requirements, then no. If you are google, storing billions and billions of records on thousands of servers, then every byte can cost some real money.
There are a few cases where I really bother choosing.
When I have memory limitations
When I do bitshift operations
When I care about x86/x64 portability
Every other case is int all the way
Edit : About x86/x64
In x86 architecture, an int is 32 bits but in x64, an int is 64 bits
If you write "int" everywhere and move from one architecture to another, it might leads to problems. For example you have an 32 bits api that export a long. You cast it to an integer and everything is fine. But when you move to x64, the hell breaks loose.
The int is defined by your architecture so when you change architecture you need to be aware that it might lead to potential problems
That all depends on how you are using them and how many you have. Even if you only have a few in memory at a time - this might drive the data type in your backing store.
Memory consumption based on the type of integers you are storing is probably not an issue in a desktop or web app. In a game or a mobile device app, it may be more of an issue.
However, the real reason to differentiate between the types is the kind of numbers you need to store. If you have really big numbers, or high precision, you may need to use long to store it.
The context of the situation is very important here. You don't need to take a guess at whether it is important or not though, we are dealing with quantifiable things here. We know that we are saving 2 bytes by using a short instead of an int.
What do you estimate the largest number of instances are going to be in memory at a given point in time? If there are a million then you are saving ~2Mb of Ram. Is that a large amount of ram? Again, it depends on the context, if the app is running on a desktop with 4Gb of ram you probably don't care too much about the 2Mb.
If there will be hundreds of millions of instances in memory the savings are going to get pretty big, but if that is the case you may just not have enough ram to deal with it and you may have to store this structure on disk and work with parts of it at a time.
Int32 will be fine for almost anything. Exceptions include:
if you have specific needs where a different type is clearly better. Example: if you're writing a 16 bit emulator, Int16 (aka: short) would probably be better to represent some of the internals
when an API requires a certain type
one time, I had an invalid int cast and Visual Studio's first suggestion was to verify my value was less than infinity. I couldn't find a good type for that without using the pre-defined constants, so i used ulong since that was the closest I could come in .NET 2.0 :)
I'm building an application that is seriously slower than it should be (a process takes 4 seconds when it should take only .1 seconds, that's my goal at least).
I have a bunch of methods that pass an array from one to the other. This has kept my code nice and organized, but I'm worried that it's killing the efficiency of my code.
Can anyone confirm if this is the case?
Also, I have all of my code contained in a class separate from my UI. Is this going make things run significantly slower than if I had my code contained in the Form1.cs file?
Edit: There are about 95000 points that need to be calculated, each point goes through 7 methods that does additional calculations.
Have you tried any profiling or performance tools to narrow down why the slowdown occurs?
It might show you ways that you could use to refactor your code and improve performance.
This question asked by another user has several options that you can choose from:
Good .Net Profilers
No. This is not what is killing your code speed, unless many methods means like a million or something. You probably have more things iterating through your array than you need or realize, and the array itself may have a larger memory footprint than you realize.
Perhaps you should look into a design where instead of passing the array to 7 methods, you iterate the array once, passing the members to 7 methods, this will minimize the number of times you're iterating through 95000 members.
In general, function calls are basic enough to be highly optimized by any interpreter (or compiler). Therefore these do not produce to much blow-up in run time. In fact, if wrap your problem to, say, some fancy iterative solution, you save handling the stack, but instead have to handle some iteration variables, which will not be to hard.
I know, there have been programmers who wondered why their recursive algorithms have been so slow, until someone told them not to pass array entries by value.
You should provide some sample code. In general, you should for other bottlenecks, or find another algorithm.
Just need to run it against a good profiling tool. I've got some stuff I wished only took 4 seconds - works with upwards of a hundred million records in a pass.
An Array is a reference type not a value type. Therefore you never pass the array. You are actually passing the pointer to the array in memory. So passing the array isn't your issue. Most likely you have an issue with what you do with your array. You need to do what Jamie Keeling said and run it through a profiler or even just debug it and see if you get stuck in some big loops.
Why are you loading them all into an array and doing each method in turn rather than iterating through them as loaded?
If you can obtain them (from whatever input source) deal with them and output them (whether to screen, file our wherever) this will inevitably use less memory and reduce start-up time, at the very least.
If this answer is applicable to your situation, start by changing your methods to deal with enumerations rather than arrays (non-breaking change, since arrays are enumerations), then change your input method to yield return items as loaded rather than loading an entire array.
Sorry for posting an old link (.NET 1.1) but it was contained in VS2010 article, so:
Here you can read about method costs. (Initial link)
Then, if you start your code from VS (no matters, even in Release mode) the VS debugger connects to your code and slow it down.
I know that for this advise I will be minused but... The max performance will be achieved with unsafe operations with arrays (yes it's UNSAFE, but when there is performance deal, so...)
And the last - refactor your code to use minimum of methods which working with your arrays. It will improve the performance.
Given a case where I have an object that may be in one or more true/false states, I've always been a little fuzzy on why programmers frequently use flags+bitmasks instead of just using several boolean values.
It's all over the .NET framework. Not sure if this is the best example, but the .NET framework has the following:
public enum AnchorStyles
{
None = 0,
Top = 1,
Bottom = 2,
Left = 4,
Right = 8
}
So given an anchor style, we can use bitmasks to figure out which of the states are selected. However, it seems like you could accomplish the same thing with an AnchorStyle class/struct with bool properties defined for each possible value, or an array of individual enum values.
Of course the main reason for my question is that I'm wondering if I should follow a similar practice with my own code.
So, why use this approach?
Less memory consumption? (it doesn't seem like it would consume less than an array/struct of bools)
Better stack/heap performance than a struct or array?
Faster compare operations? Faster value addition/removal?
More convenient for the developer who wrote it?
It was traditionally a way of reducing memory usage. So, yes, its quite obsolete in C# :-)
As a programming technique, it may be obsolete in today's systems, and you'd be quite alright to use an array of bools, but...
It is fast to compare values stored as a bitmask. Use the AND and OR logic operators and compare the resulting 2 ints.
It uses considerably less memory. Putting all 4 of your example values in a bitmask would use half a byte. Using an array of bools, most likely would use a few bytes for the array object plus a long word for each bool. If you have to store a million values, you'll see exactly why a bitmask version is superior.
It is easier to manage, you only have to deal with a single integer value, whereas an array of bools would store quite differently in, say a database.
And, because of the memory layout, much faster in every aspect than an array. It's nearly as fast as using a single 32-bit integer. We all know that is as fast as you can get for operations on data.
Easy setting multiple flags in any order.
Easy to save and get a serie of 0101011 to the database.
Among other things, its easier to add new bit meanings to a bitfield than to add new boolean values to a class. Its also easier to copy a bitfield from one instance to another than a series of booleans.
It can also make Methods clearer. Imagine a Method with 10 bools vs. 1 Bitmask.
Actually, it can have a better performance, mainly if your enum derives from an byte.
In that extreme case, each enum value would be represented by a byte, containing all the combinations, up to 256. Having so many possible combinations with booleans would lead to 256 bytes.
But, even then, I don't think that is the real reason. The reason I prefer those is the power C# gives me to handle those enums. I can add several values with a single expression. I can remove them also. I can even compare several values at once with a single expression using the enum. With booleans, code can become, let's say, more verbose.
From a domain Model perspective, it just models reality better in some situations. If you have three booleans like AccountIsInDefault and IsPreferredCustomer and RequiresSalesTaxState, then it doesnn't make sense to add them to a single Flags decorated enumeration, cause they are not three distinct values for the same domain model element.
But if you have a set of booleans like:
[Flags] enum AccountStatus {AccountIsInDefault=1,
AccountOverdue=2 and AccountFrozen=4}
or
[Flags] enum CargoState {ExceedsWeightLimit=1,
ContainsDangerousCargo=2, IsFlammableCargo=4,
ContainsRadioactive=8}
Then it is useful to be able to store the total state of the Account, (or the cargo) in ONE variable... that represents ONE Domain Element whose value can represent any possible combination of states.
Raymond Chen has a blog post on this subject.
Sure, bitfields save data memory, but
you have to balance it against the
cost in code size, debuggability, and
reduced multithreading.
As others have said, its time is largely past. It's tempting to still do it, cause bit fiddling is fun and cool-looking, but it's no longer more efficient, it has serious drawbacks in terms of maintenance, it doesn't play nicely with databases, and unless you're working in an embedded world, you have enough memory.
I would suggest never using enum flags unless you are dealing with some pretty serious memory limitations (not likely). You should always write code optimized for maintenance.
Having several boolean properties makes it easier to read and understand the code, change the values, and provide Intellisense comments not to mention reduce the likelihood of bugs. If necessary, you can always use an enum flag field internally, just make sure you expose the setting/getting of the values with boolean properties.
Space efficiency - 1 bit
Time efficiency - bit comparisons are handled quickly by hardware.
Language independence - where the data may be handled by a number of different programs you don't need to worry about the implementation of booleans across different languages/platforms.
Most of the time, these are not worth the tradeoff in terms of maintance. However, there are times when it is useful:
Network protocols - there will be a big saving in reduced size of messages
Legacy software - once I had to add some information for tracing into some legacy software.
Cost to modify the header: millions of dollars and years of effort.
Cost to shoehorn the information into 2 bytes in the header that weren't being used: 0.
Of course, there was the additional cost in the code that accessed and manipulated this information, but these were done by functions anyways so once you had the accessors defined it was no less maintainable than using Booleans.
I have seen answers like Time efficiency and compatibility. those are The Reasons, but I do not think it is explained why these are sometime necessary in times like ours. from all answers and experience of chatting with other engineers I have seen it pictured as some sort of quirky old time way of doing things that should just die because new way to do things are better.
Yes, in very rare case you may want to do it the "old way" for performance sake like if you have the classic million times loop. but I say that is the wrong perspective of putting things.
While it is true that you should NOT care at all and use whatever C# language throws at you as the new right-way™ to do things (enforced by some fancy AI code analysis slaping you whenever you do not meet their code style), you should understand deeply that low level strategies aren't there randomly and even more, it is in many cases the only way to solve things when you have no help from a fancy framework. your OS, drivers, and even more the .NET itself(especially the garbage collector) are built using bitfields and transactional instructions. your CPU instruction set itself is a very complex bitfield, so JIT compilers will encode their output using complex bit processing and few hardcoded bitfields so that the CPU can execute them correctly.
When we talk about performance things have a much larger impact than people imagine, today more then ever especially when you start considering multicores.
when multicore systems started to become more common all CPU manufacturer started to mitigate the issues of SMP with the addition of dedicated transactional memory access instructions while these were made specifically to mitigate the near impossible task to make multiple CPUs to cooperate at kernel level without a huge drop in perfomrance it actually provides additional benefits like an OS independent way to boost low level part of most programs. basically your program can use CPU assisted instructions to perform memory changes to integers sized memory locations, that is, a read-modify-write where the "modify" part can be anything you want but most common patterns are a combination of set/clear/increment.
usually the CPU simply monitors if there is any other CPU accessing the same address location and if a contention happens it usually stops the operation to be committed to memory and signals the event to the application within the same instruction. this seems trivial task but superscaler CPU (each core has multiple ALUs allowing instruction parallelism), multi-level cache (some private to each core, some shared on a cluster of CPU) and Non-Uniform-Memory-Access systems (check threadripper CPU) makes things difficult to keep coherent, luckily the smartest people in the world work to boost performance and keep all these things happening correctly. todays CPU have a large amount of transistor dedicated to this task so that caches and our read-modify-write transactions work correctly.
C# allows you to use the most common transactional memory access patterns using Interlocked class (it is only a limited set for example a very useful clear mask and increment is missing, but you can always use CompareExchange instead which gets very close to the same performance).
To achieve the same result using a array of booleans you must use some sort of lock and in case of contention the lock is several orders of magnitude less permorming compared to the atomic instructions.
here are some examples of highly appreciated HW assisted transaction access using bitfields which would require a completely different strategy without them of course these are not part of C# scope:
assume a DMA peripheral that has a set of DMA channels, let say 20 (but any number up to the maximum number of bits of the interlock integer will do). When any peripheral's interrupt that might execute at any time, including your beloved OS and from any core of your 32-core latest gen wants a DMA channel you want to allocate a DMA channel (assign it to the peripheral) and use it. a bitfield will cover all those requirements and will use just a dozen of instructions to perform the allocation, which are inlineable within the requesting code. basically you cannot go faster then this and your code is just few functions, basically we delegate the hard part to the HW to solve the problem, constraints: bitfield only
assume a peripheral that to perform its duty requires some working space in normal RAM memory. for example assume a high speed I/O peripheral that uses scatter-gather DMA, in short it uses a fixed-size block of RAM populated with the description (btw the descriptor is itself made of bitfields) of the next transfer and chained one to each other creating a FIFO queue of transfers in RAM. the application prepares the descriptors first and then it chains with the tail of the current transfers without ever pausing the controller (not even disabling the interrupts). the allocation/deallocation of such descriptors can be made using bitfield and transactional instructions so when it is shared between diffent CPUs and between the driver interrupt and the kernel all will still work without conflicts. one usage case would be the kernel allocates atomically descriptors without stopping or disabling interrupts and without additional locks (the bitfield itself is the lock), the interrupt deallocates when the transfer completes.
most old strategies were to preallocate the resources and force the application to free after usage.
If you ever need to use multitask on steriods C# allows you to use either Threads + Interlocked, but lately C# introduced lightweight Tasks, guess how it is made? transactional memory access using Interlocked class. So you likely do not need to reinvent the wheel any of the low level part is already covered and well engineered.
so the idea is, let smart people (not me, I am a common developer like you) solve the hard part for you and just enjoy general purpose computing platform like C#. if you still see some remnants of these parts is because someone may still need to interface with worlds outside .NET and access some driver or system calls for example requiring you to know how to build a descriptor and put each bit in the right place. do not being mad at those people, they made our jobs possible.
In short : Interlocked + bitfields. incredibly powerful, don't use it
It is for speed and efficiency. Essentially all you are working with is a single int.
if ((flags & AnchorStyles.Top) == AnchorStyles.Top)
{
//Do stuff
}