Storing data into custom made files - c#

I am making a combinations generator. For small amounts of elements it's not a problem for the computer that the data is getting stored in the RAM memory instead in a file. But when the number of elements gets bigger, my computer runs out of memory (the exception OutOfMemoryException occurs). The combinations are numbers stored in Lists, that are currently getting stored in an another List.
But this only the first step- the generator work right. I want the data to be stored in file, from where a different program will be able to extract the combinations it needs. Mostly, I need to store the data in a separate file, because the generator will have to be able to create more and more and bigger combinations in the future. The computer will have to read certain parts of the data, without coping all of it in the temporary memory, because this is impossible.
I don't want to turn the data into text and when needed to convert the text back into data. I think this will make things slower because of the conversions. I want the lists to be stored into a custom made file, from where the program can directly extract the data without any converting.

There is a ton of options available I will briefly describe a few.
Use a database. From your description this does not look like a good choices but it will be the most flexible to all clients relativity fast and efficient storage.
Use one of the .net serializers from your description binary serializer will be your best choice. The serlizers offer a lot of advantages relativity fast and baked into .net with built in support and very easy to use.
Use a custom binary format. This will be the fastest option especially if you combine it with a memory mapped file. However binary formats can be hard to use and easy to screw up.

If you really want to store your data in file, you can use BinaryFormatter class. It is probably the most efficient way of serializing data objects into binary stream.
But I wouldn't recommend you generating combinations in this way if only don't you need to store them at one time and load long time after that. It's better to use lazy-generation of combinations. One by one, completely generated without the need to "generate bigger combinations in future" (generate "the biggest" needed combinations one by one - you might want to change your generation algorithm a bit - there are plenty answers how to do that already)

There's a good write up on how to serialize a List<> to a file at http://www.switchonthecode.com/tutorials/csharp-tutorial-serialize-objects-to-a-file

You can use something as a persistent data structure instead, this will reduce the amount of memory needed by your app, without changint too much the current code.
Have a look at this question:
Looking for a simple standalone persistent dictionary implementation in C#
there is a lot of resources doing that, in particular this answer seems to point to some really interesting links:

Related

C# : Serialize objects to XML without reflection

In an application, we can save the current state of the application and it's configuration(which can be huge). We are using the XmlSerializer.
We now have only what we need in the XML(all XmlIgnore are in place), and it's VERY slow to store the whole configuration(file of ~50-100MB).
We NEED to keep storing this configuration as XML, but we would like to avoid :
The reflection, which is to slow
To implement the IXmlSerializable interface
The idea was to have a method to implement in each object, in which we can register which fields/property we want to serialize, then having a SerializationManager which is able to read what we want to serialize, and then write them.
Like this, objects doesn't know the language (XML) in which they will be rendered, and if one day we want a binary serialization(or if we want to have the possibility to serialize in different format), we can.
But we don't want to reinvent the wheel, and I don't know if some library exists or if something like Linq to xml can help, or if this is natively possible, ...
So how do you think I can achieve this?
"The reflection, which is to slow"
Except, it doesn't use reflection at runtime. It performs metaprogramming on the first run (assuming you are using new XmlSerializer(type)) to inspect the type and generate static code that will work on the given type. Therefore, any volume-related performance issue is not related to reflection. There is a chance that the metaprogramming itself can take a measurable time, but a: this is unlikely unless your model is really complicated, and b: it can be avoided by using the sgen.exe tool to pre-generate the serialization assembly.
Any performance issue, therefore, is most likely due to the size of the model and the overhead of xml.
If you want to try a different serializer, consider something like protobuf-net. You won't be able to read the data (it will not be xml), but the output will be much smaller and faster.
As you mentioned
In an application, we can save the current state of the application and it's configuration
State, especially when it is big (100Mb is ... huge!), required its own way to serialize data. Many of us knows and hates that slow saving/loading game saves from past. Even now, game developers distinguish quicksave from ordinal saves. It is optimized to occurs faster (to example, by caching part of recently performed quicksave) than ordinal save.
First question is why XML? BinarySerializer is faster, but for sizes like this you better use manual serialization (as Marc Gravell suggested, use protobuf, it's ultimate superior to anything).
Second question is, do you really need serialize data (change their format)? The fastest way of saving state is to dump memory. Imagine you have all your data saved in one block of memory, then dumping this block into a file is a very quick save. You may (I am not sure, but it should be doable) construct your data in a way, what overriding this memory will be kind of loading game. This much faster of any conversion.
If you go with dumping, then consider to pack it (into zip). Packing and saving 10 mb should be faster than saving unpacked 100 mb (assuming, you are not using too slow or too good packing algorithm), memory operations and cpu are much faster than SSD.
To save configuration, you can still serialize it as usual. If you want it to be a single file, then define own format of this file, to example:
config_stream, separator["<<<>>>>"], memory block [100 Mb]
Serialize with XmlSerializer into memory, create file, save config, separator, dump.

Performance considerations of a large hard-coded array in the .cs file

I'm writing some code where performance is important. In one part of it, I have to compare a large set of pre-computed data against dynamic values. Currently, I'm storing that pre-computed data in a giant array in the .cs file:
Data[] data = { /* my data set */ };
The data set is about 90kb, or roughly 13k elements. I was wondering if there's any downside to doing this, as opposed to loading it in from an external file? I'm not entirely sure how C# works internally, so I just wanted to be aware of any performance issues I might encounter with this method.
Well, 90kb just isn't that big, to be honest. You're not going to see any appreciable difference either way for an array of that size.
In general, if you had a huge array, I would not recommend storing it within a source file. It may not be a problem for runtime performance, but I could see it slowing down compilation.
From a design standpoint, the decision might come down to if the data will ever change. If you're storing, let's say, the byte header of some type of file, that may be reasonable to store within the source since it will never change. But, some precomputed data, especially if you may generate it again at a later date, you should probably put it in an external file.
Bad:
Modifying hard-coded data set is cumbersome
Good:
You're shielded from silly things like data file not being there, being corrupt or in the wrong format due to user error.
Don't have to load/parse the data
Sidenote:
If you're concerned about performance make sure to use an array, not a List: Performance of Arrays vs. Lists.

Memory-mapped file IList implementation, for storing large datasets "in memory"?

I need to perform operations chronologically on huge time series implemented as IList. The data is ultimately stored into a database, but it would not make sense to submit tens of millions of queries to the database.
Currently the in-memory IList triggers an OutOfMemory exception when trying to store more than 8 million (small) objects, though I would need to deal with tens of millions.
After some research, it looks like the best way to do it would be to store data on disk and access it through an IList wrapper.
Memory-mapped files (introduced in .NET 4.0) seem the right interface to use, but I wonder what is the best way to write a class that should implement IList (for easy access) and internally deal with a memory-mapped file.
I am also curious to hear if you know about other ways ! I thought for example of an IList wrapper using data from db4o (someone mentionned here using a memory-mapped file as the IoAdapterFile, though using db4o probably adds a performance cost vs. dealing directly with the memory-mapped file).
I have come across this question asked in 2009, but it did not yield useful answers or serious ideas.
I found this PersistentDictionary<>, but it only works with strings, and by reading the source code I am not sure it was designed for very large datasets.
More scalable (up to 16 TB), the ESENT PersistentDictionary<>, uses the ESENT database engine present in Windows (XP+) and can store all serializable objects containing simple types.
Disk Based Data Structures, including Dictionary, List and Array with an "intelligent" serializer looked exactly like what I was looking for, but it did not run smoothly with extremely large datasets, especially as it does not make use of the "native" .NET MemoryMappedFiles yet, and support for 32 bits systems is experimental.
Update 1: I ended up implementing my own version that makes extensive use of .NET MemoryMappedFiles; it is very fast and I will probably release it on Codeplex once I have made it better for more general purpose usages.
Update 2: TeaFiles.Net also worked great for my purpose. Highly recommended (and free).
I see several options:
"in-memory-DB"
for example SQLite can be used this way - no need for any setup etc. just deploying the DLL (1 or 2) together with the app and the rest can be done programmatically
Load all data into temporary table(s) into the DB, with unknown (but big) amounts of data I found that this pays off really fast (and processing can usually be done inside the DB whcih is even better!)
use a MemoryMappedFile and a fixed structure size (array-like access via offset) but beware that physical memory is the limit except you use some sort of "sliding window" to map only parts into memory
The memory mapped files is a nice way to do it. But it going to be very slow if you need to access things randomly.
Your best bet is probably to come up with a fixed structure size when saved in memory (if you can) then you use the offset as the list item id. However deletes / sorting is always a problem.

Computing, storing, and retrieving values to and from an N-Dimensional matrix

This question is probably quite different from what you are used to reading here - I hope it can provide a fun challenge.
Essentially I have an algorithm that uses 5(or more) variables to compute a single value, called outcome. Now I have to implement this algorithm on an embedded device which has no memory limitations, but has very harsh processing constraints.
Because of this, I would like to run a calculation engine which computes outcome for, say, 20 different values of each variable and stores this information in a file. You may think of this as a 5(or more)-dimensional matrix or 5(or more)-dimensional array, each dimension being 20 entries long.
In any modern language, filling this array is as simple as having 5(or more) nested for loops. The tricky part is that I need to dump these values into a file that can then be placed onto the embedded device so that the device can use it as a lookup table.
The questions now, are:
What format(s) might be acceptable
for storing the data?
What programs (MATLAB, C#, etc)
might be best suited to compute the
data?
C# must be used to import the data
on the device - is this possible
given your answer to #1?
Edit:
Is it possible to read from my lookup table file without reading the entire file into memory? Can you explain how that might be done in C#?
I'll comment on 1 and 3 as well. It may be preferable to use a fixed width output file rather than a CSV. This may take up more or less space than a CSV, depending on the output numbers. However, it tends to work well for lookup tables, as figuring out where to look in a fixed width data file can be done without reading the entire file. This is usually important for a lookup table.
Fixed width data, as with CSV, is trivial to read and write. Some math-oriented languages might offer poor string and binary manipulation functionality, but it should be really easy to convert the data to fixed width during the import step regardless.
Number 2 is harder to answer, particularly without knowing what kind of algorithm you are computing. Matlab and similar programs tend to be great about certain types of computations and often have a lot of stuff built in to make it easier. That said, a lot of the math stuff that is built into such languages is available for other languages in the form of libraries.
I'll comment on (1) and (3). All you need to do is dump the data in slices. Pick a traversal and dump data out in that order. Write it out as comma-delimited numbers.

What is the best way to implement precomputed data?

I have a computation that calculates a resulting percentage based on certain input. But these calculations can take quite some time, which can be annoying. Since there are about 12500 possible inputs, I thought it would be a good idea to precompute all the data, and look this up during normal program execution.
My first idea was to just create a simple file which is read at program initialization and populates some arrays. Although this will work, I would like to know if there are some other options? For example that the array is populated during compile time.
BTW, I'm writing my code in C#.
This tutorial here implements a serializer, which you can use to easily convert an object to a binary file and back. Once you have the serializer in hand, you can just create an object that holds all your data and serialize it; when you actually run your program, just deserialize the object and use it.
This has all the benefits of saving an object to the hard drive, with an implementation that is object-agnostic (meaning you don't have to write much code for any object you want to serialize) and outputs in binary (thus saving space, if that is a concern).
A file with data is probably the easiest and most flexible way to implement it.
If you wanted it in memory without having to read it from somewhere, I would write a program to output your data in C#-like CSV format suitable for copying and pasting into an array/collection initializer, and thereby generate the source code for your precomputed data.
Create a program that outputs valid C# code which initializes your lookup tables. Make this part of your build process so that it will automatically create the source file and then build the rest of your project.
As Daniel Lew said, serialize it into a binary file.
If you need speed, go for a Dictionary. A Dictionary is indexed on it's key, and should allow rapid lookup even with large amounts of data.
I would always start by considering if there was any way to avoid precomputing. If there's 12500 possible inputs, how many are required per user request ? Will all 12500 be needed at the same time or will they be spread out in time ? If you can get by with calculating a few at a time, I'd do that with lazy initialization. I prefer this solution simply because I'll have fewer issues with it in the long run. What do you do when the persistent format changes, or the data changes. How will you handle it when the file is missing or corrupted ? Persisting to a file does not create less code.
I would serialize such a file to a human-readable format if I had to persist a pre-loaded version. I'd probably use xml serialization since it's simple. But quite often there's issues of invalidation and recalculation. Do the values never change or only very infrequently ?
I agree with mquander and Trent. Use your favorite language or script to generate the whole C# file you need to define your data (no copy-pasting, that's a manual step and error-prone). Add it as a Pre-Build event in Visual Studio. You could even detect that you have an up-to-date file and avoid regeneration for most builds.
There is definitely a way to statically generate almost any data using template metaprogramming in C++, although it can be painful. It's not worth it unless you need many sets of different data in several parts of your program. I am not familiar enough with metaprogrammation in C# to evaluate the general effort in your case. You should look into that.

Categories

Resources