C# store and read large array of objects - c#

I have an application (Winforms C#) to perform calculations on a raster. The calculation results are stored as objects in an array, total array length depending on project but currently around 1 million entries (but I want to make them larger, even 2 or 3 million). The goal of the application is perform queries to the data: the users (de)selects some properties, then the app is iterating over the array and summarize the values of the objects for each array entry. The results are shown as a picture (each pixel is an array entry).
Currently I'm storing the data as a compressed JSON string on the disk, so I'm loading all the data in memory. Advantage of doing this is that the queries are performed very fast (max 2 seconds). But disadvantage is that it takes a lot of memory, and it will give a out of memory exception if the array will become larger (I'm already building the app to 64 bit).
Question: is there a way of storing my array on the disk, without loading the entire array in memory and performing the queries in a very fast way? I've done some tests with LiteDB, but executing the queries is not fast enough (but I haven't experience with LiteDB, so maybe I'm doing something wrong). Is a database like LiteDB a good solution? Or is loading all the data in memory the only option?
Update: each entry in my array is a List of class CellResultPart, with around 1 to 10 objects in the list. Class defintion as followes:
public struct CellResultPart
{
public CellResultPart(double designElevation, double existingElevation)
{
DesignElevation = designElevation;
ExistingElevation = existingElevation;
MaterialName = "<None>";
Location = "<None>";
EnvironmentalClass = "<None>";
ElevationTop = double.NaN;
ElevationBottom = double.NaN;
ElevationLayerTop = double.NaN;
ElevationLayerBottom = double.NaN;
DepthLayerTop = double.NaN;
DepthLayerBottom = double.NaN;
DesignElevation = double.NaN;
ExistingElevation = double.NaN;
}
public double DesignElevation;
public double ExistingElevation;
public double Depth
{
get
{
if (IsExcavation)
{
return -Math.Round(Math.Abs(DepthBottom - DepthTop),3);
}
else
{
return Math.Round(Math.Abs(DepthBottom - DepthTop),3);
}
}
}
public double ElevationTop;
public double ElevationBottom;
public double ElevationLayerTop;
public double ElevationLayerBottom;
public double DepthTop
{
get
{
if (IsExcavation)
{
return -Math.Round(Math.Abs(ExistingElevation - ElevationTop),3);
}
else
{
return Math.Round(Math.Abs(DesignElevation - ElevationTop),3);
}
}
}
public double DepthBottom
{
get
{
if (IsExcavation)
{
return -Math.Round(Math.Abs(ExistingElevation - ElevationBottom),3);
}
else
{
return Math.Round(Math.Abs(DesignElevation - ElevationBottom),3);
}
}
}
public double DepthLayerTop;
public double DepthLayerBottom;
public string EnvironmentalClass;
public string Location;
public string MaterialName;
public bool IsExcavation
{
get
{
if (DesignElevation > ExistingElevation)
{
return false;
}
else return true;
}
}
}

Lets make some rough calculations. You have 10 doubles and 3 strings. Lets assume the strings are on average 20 characters. That should give you about 200 bytes per entry, or 200-600Mb overall. That should not be unfeasible to keep in memory, even on a 32 bit system.
Using json will probably not help, since it will make the data much larger. I would consider some binary format that should be closer to the theoretical required size. I have used protobuf .net with good results. That also support SerializeWithLengthPrefix, that should allow you to serialize each object independently from each other in a single stream, and that should avoid the need to keep everything in memory at the same time.
The other option would be to use some kind of database. Such a solution would most likely scale better as the size increase. Database performance is mostly an issue with using appropriate indices, I assume that is the reason your attempt went poorly. Creating good indices may be difficult if you have no idea what queries will be run, but I would still expect a database to perform better than a linear search.

Related

best data structure for storing large number of numeric fields

I am working with a class, say Widget, that has a large number of numeric real world attributes (eg, height, length, weight, cost, etc.). There are different types of widgets (sprockets, cogs, etc.), but each widget shares the exact same attributes (the values will be different by widget, of course, but they all have a weight, weight, etc.). I have 1,000s of each type of widget (1,000 cogs, 1,000 sprockets, etc.)
I need to perform a lot of calculations on these attributes (say calculating the weighted average of the attributes for 1000s of different widgets). For the weighted averages, I have different weights for each widget type (ie, I may care more about length for sprockets than for cogs).
Right now, I am storing all the attributes in a Dictionary< string, double> within each widget (the widgets have an enum that specifies their type: cog, sprocket, etc.). I then have some calculator classes that store weights for each attribute as a Dictionary< WidgetType, Dictionary< string, double >>. To calculate the weighted average for each widget, I simply iterate through its attribute dictionary keys like:
double weightedAvg = 0.0;
foreach (string attibuteName in widget.Attributes.Keys)
{
double attributeValue = widget.Attributes[attributeName];
double attributeWeight = calculator.Weights[widget.Type][attributeName];
weightedAvg += (attributeValue * attributeWeight);
}
So this works fine and is pretty readable and easy to maintain, but is very slow for 1000s of widgets based on some profiling. My universe of attribute names is known and will not change during the life of the application, so I am wondering what some better options are. The few I can think of:
1) Store attribute values and weights in double []s. I think this is probably the most efficient option, but then I need to make sure the arrays are always stored in the correct order between widgets and calculators. This also decouples the data from the metadata so I will need to store an array (?) somewhere that maps between the attribute names and the index into double [] of attribute values and weights.
2) Store attribute values and weights in immutable structs. I like this option because I don't have to worry about the ordering and the data is "self documenting". But is there an easy way to loop over these attributes in code? I have almost 100 attributes, so I don't want to hardcode all those in the code. I can use reflection, but I worry that this will cause even a larger penalty hit since I am looping over so many widgets and will have to use reflection on each one.
Any other alternatives?
Three possibilities come immediately to mind. The first, which I think you rejected too readily, is to have individual fields in your class. That is, individual double values named height, length, weight, cost, etc. You're right that it would be more code to do the calculations, but you wouldn't have the indirection of dictionary lookup.
Second is to ditch the dictionary in favor of an array. So rather than a Dictionary<string, double>, you'd just have a double[]. Again, I think you rejected this too quickly. You can easily replace the string dictionary keys with an enumeration. So you'd have:
enum WidgetProperty
{
First = 0,
Height = 0,
Length = 1,
Weight = 2,
Cost = 3,
...
Last = 100
}
Given that and an array of double, you can easily go through all of the values for each instance:
for (int i = (int)WidgetProperty.First; i < (int)WidgetProperty.Last; ++i)
{
double attributeValue = widget.Attributes[i];
double attributeWeight = calculator.Weights[widget.Type][i];
weightedAvg += (attributeValue * attributeWeight);
}
Direct array access is going to be significantly faster than accessing a dictionary by string.
Finally, you can optimize your dictionary access a little bit. Rather than doing a foreach on the keys and then doing a dictionary lookup, do a foreach on the dictionary itself:
foreach (KeyValuePair<string, double> kvp in widget.Attributes)
{
double attributeValue = kvp.Value;
double attributeWeight = calculator.Weights[widget.Type][kvp.Key];
weightedAvg += (attributeValue * attributeWeight);
}
To calculate weighted averages without looping or reflection, one way would be to calculate the weighted average of the individual attributes and store them in some place. This should happen while you are creating instance of the widget. Following is a sample code which needs to be modified to your needs.
Also, for further processing of the the widgets themselves, you can use data parallelism. see my other response in this thread.
public enum WidgetType { }
public class Claculator { }
public class WeightStore
{
static Dictionary<int, double> widgetWeightedAvg = new Dictionary<int, double>();
public static void AttWeightedAvgAvailable(double attwightedAvg, int widgetid)
{
if (widgetWeightedAvg.Keys.Contains(widgetid))
widgetWeightedAvg[widgetid] += attwightedAvg;
else
widgetWeightedAvg[widgetid] = attwightedAvg;
}
}
public class WidgetAttribute
{
public string Name { get; }
public double Value { get; }
public WidgetAttribute(string name, double value, WidgetType type, int widgetId)
{
Name = name;
Value = value;
double attWeight = Calculator.Weights[type][name];
WeightStore.AttWeightedAvgAvailable(Value*attWeight, widgetId);
}
}
public class CogWdiget
{
public int Id { get; }
public WidgetAttribute height { get; set; }
public WidgetAttribute wight { get; set; }
}
public class Client
{
public void BuildCogWidgets()
{
CogWdiget widget = new CogWdiget();
widget.Id = 1;
widget.height = new WidgetAttribute("height", 12.22, 1);
}
}
As it is always the case with data normalization, is that choosing your normalization level determines a good part of the performance. It looks like you would have to go from your current model to another model or a mix.
Better performance for your scenario is possible when you do not process this with the C# side, but with the database instead. You then get the benefit of indexes, no data transfer except the wanted result, plus 100000s of man hours already spent on performance optimization.
Use Data Parallelism supported by the .net 4 and above.
https://msdn.microsoft.com/en-us/library/dd537608(v=vs.110).aspx
An excerpt from the above link
When a parallel loop runs, the TPL partitions the data source so that the loop can operate on multiple parts concurrently. Behind the scenes, the Task Scheduler partitions the task based on system resources and workload. When possible, the scheduler redistributes work among multiple threads and processors if the workload becomes unbalanced

c# Best type/collection/list/dataset to handle super large data (csv/tab files)

I am building one WPF (MVVM) app that handles really large csv files. We are talking about 1GB to 10GB.
I open the file and parse it with File.ReadLines into a List of following class:
public class FileLine
{
public DateTime Time { get; set; }
public string Message { get; set; } //Usually around 256 characters
public string Info1 { get; set; } //Exact 56 characters
public string Info2 { get; set; } //Exact 4 characters
//and so on
}
... then I do all sort of data manipulation, queries, charts... you name it... everything using Linq.
We are testing a 1.8GB file and when it is opened, the process takes around 2GB of memory.
Eventually, when my customer needs to open his 10GB file it will be impossible, because it is going to take 12GB+ of Memory.
What is the best type/collection/list/dataset to this kind of work?
When i've had to do something like this before I handled it by having a container object that held a list of dictionaries. At the time I thought the limit would/should be 2^32 number of elements, but an exception for exceeding the collection was thrown well before getting 2^32 elements and still had many GB of ram left. Say you want a Dictionary, something like this should work until you really do exhaust all physical and virtual memory... A possible solution for you follows... I remember when I worked on this a few years ago the server actually had 512Gb of ram, I'm sure they have ones with more now... Anyway that's a separate story.
public class MyHugeDictionary
{
List<Dictionary<typea, typeb> allDict= null;
Dictionary<typea, typeb> currDictionary ;
MyHugeDictiionary()
{
allDict = new List<Dictionary<typea, typeb>();
currDictionary = new Dictionary<typea, typeb);
allDict.Add(currDictionary);
}
public bool ItemExists( typea, typeb)
{
foreach( KeyValue<Dictionary<typea, typeb> kv in allDict)
{
if( kv.ContainsKey(typea) )
{
return true;
}
}
return false;
}
public Add( typea a, typeb b)
{
try
{
if( !ItemExist( tyepa, typeb) ) // find if items is in any other dictionary first
{
currDictionary.Add( a, b) ;
}
else { // handle dups... ; }
}
catch( CollectionSizeError x) // look-up for actual exception
{
currDictionary = CreateDictiionary();
allDict.Add( currDictionary ) ;
currDictionary.Add( a,b);
}
catch( OutOfMemory y) // look-up for actual exception
{
// oops game over for real now :(
}
}
}
After some discussion the best thing is to read the file, process it, and dispose all the rest, sticking only with the result.
Another possibility was to use database, but it would add too much complexity, although it is possible.
See this:
https://github.com/aumcode/nfx/tree/master/Source/NFX/ApplicationModel/Pile
https://www.infoq.com/articles/Big-Memory-Part-3
You can store whatever you want - no pauses.
The problem with large collections is:
a. They are not really designed to hold very many entries (i.e. Dictionary never shrinks back to zero size)
b. You get GC stalls/pauses when you have too many objects
see the links above - what we did is "hiding" of data from GC as described in the article. This way you can store millions of objects using LocalCache class as a dictionary.
For large memory apps in net - remember to enable 64 bit and set GC to SERVER mode in your app config file

Database queries in Entity Framework model - variable equals zero

I have some problems with using database in my Model. I suspect that its not good idea to use database queries in Model, but I don't know how to do it better.
Code:
Let's assume that I have some application to analize football scores. I have some EF model that stores info about footballer:
public class Player
{
[...]
public virtual ICollection<Goal> HisGoals { get; set; }
public float Efficiency
{
get
{
using(var db = new ScoresContext())
{
var allGoalsInSeason = db.Goals.Count();
return HisGoals.Count / allGoalsInSeason;
}
}
}
}
Problem:
So the case is: I want to have a variable in my model called "Efficiency" that will return quotient of two variables. One of them contains data got in real-time from database.
For now this code doesn't work. "Efficiency" equals 0. I tried to use debugger and all data is correct and it should return other value.
What's wrong? Why it returns always zero?
My suspicions:
Maybe I'm wrong, I'm not good at C#, but I think the reason that Efficiency is always zero, is because I use database in it and it is somehow asynchronous. When I call this variable, it returns zero first and then calls the database.
I think that your problem lies in dividing integer / integer. In order to get a float you have to cast first one to float like this:
public float Efficiency
{
get
{
using(var db = new ScoresContext())
{
var allGoalsInSeason = db.Goals.Count();
return (float)HisGoals.Count / allGoalsInSeason;
}
}
}
Dividing int/int results always in int that is in your case 0 (if it is as you said in comment 4/11).
Second thing is that Entity Framework will cache values - test it before shipping to production.

If - return is a huge bottleneck in my application

This is a snippet of code from my C# application:
public Player GetSquareCache(int x, int y)
{
if (squaresCacheValid)
return (Player)SquaresCache[x,y];
else
//generate square cache and retry...
}
squareCacheValid is a private bool and SquaresCache is private uint[,].
The problem was that the application is running extremely slow and any optimization just made it slower, so I ran a tracing session.
I figured that GetSquareCache() gets 94.41% own time, and the if and return split that value mostly evenly (46% for if and 44.82% for return statement). Also the method is hit cca. 15,000 times in 30 seconds, in some tests going up to 20,000.
Before adding methods that call GetSquareCache(), program preformed pretty well but was using random value instead of actual GetSquareCache() calls.
My questions are: is it possible that these if/return statements used up so much CPU time? How is it possible that if statements GetSquareCache() is called in (which in total are hit the same number of times) have minimal own time? And is it possible to speed up the fundamental computing operation if?
Edit: Player is defined as
public enum Player
{
None = 0,
PL1 = 1,
PL2 = 2,
Both = 3
}
I would suggest a different approach , under the assumption that most of the values in the square hold no player, and that the square is very large remember only location where there are players,
It should look something like this :
public class PlayerLocaiton
{
Dictionary<Point, List<Player>> _playerLocation = new ...
public void SetPlayer(int x, int y, Player p)
{
_playerLocation[new Point(x,y)].add(p);
}
public Player GetSquareCache(int x, int y)
{
if (squaresCacheValid)
{
Player value;
Point p = new Point(x,y);
if(_playerLocation.TryGetValue(p, out value))
{
return value ;
}
return Player.None;
}
else
//generate square cache and retry...
}
}
The problem is just the fact that method is called way too many times. And indeed, 34,637 ms it gets in last trace, over 34,122 hits it got is a little over 1ms per hit. In decompiled CIL code there are also some assignments to local variables not present in code in both if branches because it needs one ret statement. The algorithm itself is what needs to be modified, and such modifications were planned anyway.
replace return type of this method to int and remove the casting
to Player
if cache is to be set once remove the if from this method so
it is always true when the method is called
replace array with single
dimension array and access it via unsafe fixed way

Is this more suited for key value storage or a tree?

I'm trying to figure out the best way to represent some data. It basically follows the form Manufacturer.Product.Attribute = Value. Something like:
Acme.*.MinimumPrice = 100
Acme.ProductA.MinimumPrice = 50
Acme.ProductB.MinimumPrice = 60
Acme.ProductC.DefaultColor = Blue
So the minimum price across all Acme products is 100 except in the case of product A and B. I want to store this data in C# and have some function where GetValue("Acme.ProductC.MinimumPrice") returns 100 but GetValue("Acme.ProductA.MinimumPrice") return 50.
I'm not sure how to best represent the data. Is there a clean way to code this in C#?
Edit: I may not have been clear. This is configuration data that needs to be stored in a text file then parsed and stored in memory in some way so that it can be retrieved like the examples I gave.
Write the text file exactly like this:
Acme.*.MinimumPrice = 100
Acme.ProductA.MinimumPrice = 50
Acme.ProductB.MinimumPrice = 60
Acme.ProductC.DefaultColor = Blue
Parse it into a path/value pair sequence:
foreach (var pair in File.ReadAllLines(configFileName)
.Select(l => l.Split('='))
.Select(a => new { Path = a[0], Value = a[1] }))
{
// do something with each pair.Path and pair.Value
}
Now, there two possible interpretations of what you want to do. The string Acme.*.MinimumPrice could mean that for any lookup where there is no specific override, such as Acme.Toadstool.MinimumPrice, we return 100 - even though there is nothing referring to Toadstool anywhere in the file. Or it could mean that it should only return 100 if there are other specific mentions of Toadstool in the file.
If it's the former, you could store the whole lot in a flat dictionary, and at look up time keep trying different variants of the key until you find something that matches.
If it's the latter, you need to build a data structure of all the names that actually occur in the path structure, to avoid returning values for ones that don't actually exist. This seems more reliable to me.
So going with the latter option, Acme.*.MinimumPrice is really saying "add this MinimumPrice value to any product that doesn't have its own specifically defined value". This means that you can basically process the pairs at parse time to eliminate all the asterisks, expanding it out into the equivalent of a completed version of the config file:
Acme.ProductA.MinimumPrice = 50
Acme.ProductB.MinimumPrice = 60
Acme.ProductC.DefaultColor = Blue
Acme.ProductC.MinimumPrice = 100
The nice thing about this is that you only need a flat dictionary as the final representation and you can just use TryGetValue or [] to look things up. The result may be a lot bigger, but it all depends how big your config file is.
You could store the information more minimally, but I'd go with something simple that works to start with, and give it a very simple API so that you can re-implement it later if it really turns out to be necessary. You may find (depending on the application) that making the look-up process more complicated is worse over all.
I'm not entirely sure what you're asking but it sounds like you're saying either.
I need a function that will return a fixed value, 100, for every product ID except for two cases: ProductA and ProductB
In that case you don't even need a data structure. A simple comparison function will do
int GetValue(string key) {
if ( key == "Acme.ProductA.MinimumPrice" ) { return 50; }
else if (key == "Acme.ProductB.MinimumPrice") { return 60; }
else { return 100; }
}
Or you could have been asking
I need a function that will return a value if already defined or 100 if it's not
In that case I would use a Dictionary<string,int>. For example
class DataBucket {
private Dictionary<string,int> _priceMap = new Dictionary<string,int>();
public DataBucket() {
_priceMap["Acme.ProductA.MinimumPrice"] = 50;
_priceMap["Acme.ProductB.MinimumPrice"] = 60;
}
public int GetValue(string key) {
int price = 0;
if ( !_priceMap.TryGetValue(key, out price)) {
price = 100;
}
return price;
}
}
One of the ways - you can create nested dictionary: Dictionary<string, Dictionary<string, Dictionary<string, object>>>. In your code you should split "Acme.ProductA.MinimumPrice" by dots and get or set a value to the dictionary corresponding to the splitted chunks.
Another way is using Linq2Xml: you can create XDocument with Acme as root node, products as children of the root and and attributes you can actually store as attributes on products or as children nodes. I prefer the second solution, but it would be slower if you have thousands of products.
I would take an OOP approach to this. The way that you explain it is all your Products are represented by objects, which is good. This seems like a good use of polymorphism.
I would have all products have a ProductBase which has a virtual property that defaults
virtual MinimumPrice { get { return 100; } }
And then your specific products, such as ProductA will override functionality:
override MinimumPrice { get { return 50; } }

Categories

Resources