Best DataStructure for HighPerformance Seek in c# - c#

I was wondering which data structure would offer me better performance for my scenario....
My requirements are:
Possible Huge DataSet several million of records, I am going to write it only once and I am not going to change it any more during the execution lifetime, I don't need that it is stored in a sorted way....
I was thinking to go with List but if I use a Linq query and in the where condition call InRange performance are very bad... if I do a foreach, performance are not so great.... I am pretty sure that there is a best way to do it ( I was thinking to use a struct and or implement IEquatable but performance are not improving...
witch is the quickest data structure in C# for querying in my range with optimal performances?
What I want is a data structure to store several million of instances of the class Rnage
class Range
{
public int Low {get; set;}
public int High {get; set;}
public bool InRange(int val) { return val >= Low && val <= High; }
}
A logic example would be List but I am afraid that List class is not optimized for my requirements... since it is sorted and I don't need sorting and it affect a lot on performances...
thanks for the help!

I think you may want an interval tree. Stackoverflow user alan2here has recently asked several questions regarding a project he's working on; Eric Lippert pointed him towards the interval tree structure in one of them.

Related

Preventing double hash operation when trying to update value in a of type Dictionary<IComparable, int>

I am working on software for scientific research that deals heavily with chemical formulas. I keep track of the contents of a chemical formula using an internal Dictionary<Isotope, int> where Isotope is an object like "Carbon-13", "Nitrogen-14", and the int represents the number of those isotopes in the chemical formula. So the formula C2H3NO would exist like this:
{"C12", 2
"H1", 3
"N14", 1
"O16", 1}
This is all fine and dandy, but when I want to add two chemical formulas together, I end up having to calculate the hash function of Isotope twice to update a value, see follow code example.
public class ChemicalFormula {
internal Dictionary<Isotope, int> _isotopes = new Dictionary<Isotope, int>();
public void Add(Isotope isotope, int count)
{
if (count != 0)
{
int curValue = 0;
if (_isotopes.TryGetValue(isotope, out curValue))
{
int newValue = curValue + count;
if (newValue == 0)
{
_isotopes.Remove(isotope);
}
else
{
_isotopes[isotope] = newValue;
}
}
else
{
_isotopes.Add(isotope, count);
}
_isDirty = true;
}
}
}
While this may not seem like it would be a slow down, it is when we are adding billions of chemical formulas together, this method is consistently the slowest part of the program (>45% of the running time). I am dealing with large chemical formulas like "H5921C3759N1023O1201S21" that are consistently being added to by smaller chemical formulas.
My question is, is there a better data structure for storing data like this? I have tried creating a simple IsotopeCount object that contains a int so I can access the value in a reference-type (as opposed to value-type) to avoid the double hash function. However, this didn't seem beneficial.
EDIT
Isotope is immutable and shouldn't change during the lifetime of the program so I should be able to cache the hashcode.
I have linked to the source code so you can see the classes more in depth rather than me copy and paste them here.
I second the opinion that Isotope should be made immutable with precalculated hash. That would make everything much simpler.
(in fact, functionally-oriented programming is better suited for calculations of such sort, and it deals with immutable objects)
I have tried creating a simple IsotopeCount object that contains a int so I can access the value in a reference-type (as opposed to value-type) to avoid the double hash function. However, this didn't seem beneficial.
Well it would stop the double hashing... but obviously it's then worse in terms of space. What performance difference did you notice?
Another option you should strongly consider if you're doing this a lot is caching the hash within the Isotope class, assuming it's immutable. (If it's not, then using it as a dictionary key is at least somewhat worrying.)
If you're likely to use most Isotope values as dictionary keys (or candidates) then it's probably worth computing the hash during initialization. Otherwise, pick a particularly unlikely hash value (in an ideal world, that would be any value) and use that as the "uncached" value, and compute it lazily.
If you've got 45% of the running time in GetHashCode, have you looked at optimizing that? Is it actually GetHashCode, or Equals which is the problem? (You talk about "hashing" but I suspect you mean "hash lookup in general".)
If you could post the relevant bits of the Isotope type, we may be able to help more.
EDIT: Another option to consider if you're using .NET 4 would be ConcurrentDictionary, with its AddOrUpdate method. You'd use it like this:
public void Add(Isotope isotope, int count)
{
// I prefer early exit to lots of nesting :)
if (count == 0)
{
return 0;
}
int newCount = _isotopes.AddOrUpdate(isotope, count,
(key, oldCount) => oldCount + count);
if (newCount == 0)
{
_isotopes.Remove(isotope);
}
_isDirty = true;
}
Do you actually require random access to Isotope count by type or are you using the dictionary as a means for associating a key with a value?
I would guess the latter.
My suggestion to you is not to work with a dictionary but with a sorted array (or List) of IsotopeTuples, something like:
class IsotopeTuple{
Isotope i;
int count;
}
sorted by the name of the isotope.
Why the sorting?
Because then, when you want to "add" two isotopes together, you can do this in linear time by traversing both arrays (hope this is clear, I can elaborate if needed). No hash computation required, just super fast comparisons of order.
This is a classic approach when dealing with vector multiplications where the dimensions are words.
Used widely in text mining.
the tradeoff is of course that the construction of the initial vector is (n)log(n), but I doubt if you will feel the impact.
Another solution that you could think of if you had a limited number of Isotopes and no memory problems:
public struct Formula
{
public int C12;
public int H1;
public int N14;
public int O16;
}
I am guessing you're looking at organic chemistry, so you may not have to deal with that many isotopes, and if the lookup is the issue, this one will be pretty fast...

code performance question

Let's say I have a relatively large list of an object MyObjectModel called MyBigList. One of the properties of MyObjectModel is an int called ObjectID. In theory, I think MyBigList could reach 15-20MB in size. I also have a table in my database that stores some scalars about this list so that it can be recomposed later.
What is going to be more efficient?
Option A:
List<MyObjectModel> MyBigList = null;
MyBigList = GetBigList(some parameters);
int RowID = PutScalarsInDB(MyBigList);
Option B:
List<MyObjectModel> MyBigList = null;
MyBigList = GetBigList(some parameters);
int TheCount = MyBigList.Count();
StringBuilder ListOfObjectID = null;
foreach (MyObjectModel ThisObject in MyBigList)
{
ListOfObjectID.Append(ThisObject.ObjectID.ToString());
}
int RowID = PutScalarsInDB ( TheCount, ListOfObjectID);
In option A I pass MyBigList to a function that extracts the scalars from the list, stores these in the DB and returns the row where these entries were made. In option B, I keep MyBigList in the page method where I do the extraction of the scalars and I just pass these to the PutScalarsInDB function.
What's the better option, and it could be that yet another is better? I'm concerned about passing around objects this size and memory usage.
I don't think you'll see a material difference between these two approaches. From your description, it sounds like you'll be burning the same CPU cycles either way. The things that matter are:
Get the list
Iterate through the list to get the IDs
Iterate through the list to update the database
The order in which these three activities occur, and whether they occur within a single method or a subroutine, doesn't matter. All other activities (declaring variables, assigning results, etc.,) are of zero to negligible performance impact.
Other things being equal, your first option may be slightly more performant because you'll only be iterating once, I assume, both extracting IDs and updating the database in a single pass. But the cost of iteration will likely be very small compared with the cost of updating the database, so it's not a performance difference you're likely to notice.
Having said all that, there are many, many more factors that may impact performance, such as the type of list you're iterating through, the speed of your connection to the database, etc., that could dwarf these other considerations. It doesn't look like too much code either way. I'd strongly suggesting building both and testing them.
Then let us know your results!
If you want to know which method has more performance you can use the stopwatch class to check the time needed for each method. see here for stopwatch usage: http://www.dotnetperls.com/stopwatch
I think there are other issues for a asp.net application you need to verify:
From where do read your list? if you read it from the data base, would it be more efficient to do your work in database within a stored procedure.
Where is it stored? Is it only read and destroyed or is it stored in session or application?

Cost of mapping POCO's in a high load system C#

I have a poco that needs to be mapped to another poco in a high traffic system. I intend to map these objects together in a simple mapper similar to this:
public class a
{
public int MyValue { get;set; }
public string YAV { get; set; }
}
public class B
{
public int aTestValue { get;set; }
public string YetAnotherValue { get; set; }
}
public class Mapper
{
public static B MapIt(A a)
{
return new B { aTestValue = a.MyValue, YetAnotherValue = a.YAV };
}
}
How much does a mapping like this really affect performance? Ignore the fact that we'll have to write a mapping for all our types and just focus on the performance lost doing the actual mapping.
How much does a mapping like this really affect performance?
I would say that such mapping wouldn't affect performance even in a high traffic system. The cost of calling getters and setters will probably be negligible compared to other operations you might be doing.
Obviously that's just some 2 cents, if you want real stuff do performance benchmarks and measure the difference with and without the mapping.
At least that's what I would do: make something that corresponds to the requirements, then benchmark it, then two possibilities: you are satisfied with the result => ship in production and enjoy life, or you are not satisfied with the results and those benchmarks have allowed you to identify that this part is the bottleneck for your application => refactor the code and start thinking about optimizing it. But never do premature optimization or you will hardly respect the project deadlines.
From our experience, the overhead won't be much. I tested this recently by retrieving 75,000 rows of data using Linq to SQL and then mapping the L2S entities to POCO entities using mapping code we wrote. The cost of doing this was amazing small. If I recall correctly, it was something like 75 to 100 Ms to map 75K rows.
It's almost impossible to know how this will affect performance without knowing something about the scale of the system, the looping structure in which this mapping occurs, etc.
In general, these types of simple mappings are quick, but you can always run into issues that are associated with the scaling issues I mentioned when things such as serialization are involved.
Best thing to do is hook it up to the profiler and make some measurements. Doing a manual mapping like that is a fairly light way to do it so shouldn't be significant. The AutoMapper tool is also available and will reduce the coding time but has a little more overhead as it does other services besides just mapping:
Analyzing AutoMapper Performance
How about using conversion operators. Only worry about its performance if a profiler shows it to be a bottleneck.

How to query for the oldest object from db4o?

I have objects that has a DateTime property, how can i query for the oldest object?
After asking on db4o forum, I get the answer:
It's quite easy: create a sorted SODA-Query and take the first / last object from the resulting ObjectSet. Don't iterate the ObjectSet (therefore the objects won't be activated), just take the required object directly via #ObjectSet.Get(index).
Please note: db4o supports just a limited set of performant sortings (alphabetical, numbers, object ids) in query execution, so maybe you have to store your DateTime as milliseconds to achieve good performance.
first of all, your object needs to keep track of the time itself, so it depends on your requirements:
class Customer
{
public DateTime DateSignedUp {get; private set;}
// ...
}
Now, you can query for the object in whatever way you like, using Linq, SODA, or Native Queries, e.g.
IObjectContainer container = ...;
Customer oldestCustomer = container.Query<Customer>().OrderBy(p => p.DateSignedUp).First();
However, there is a set of pitfalls:
Don't use DateTime in your persisted object. I have had massive problems with them. I can't reproduce the problem so I couldn't report it yet, but I can personally not recommend using them. Use a long instead and copy the ticks from the respective DateTime. Store all times in UTC, unless you're explicitly referring to local time, such as in the case of bus schedules.
Put an index on the time
The order operation could be very, very slow for large amounts of objects because of issue COR-1133. If you have a large amount of objects and you know the approximate age of the object, try to impose a where constrain, because that will be fast. See also my blogpost regarding that performance issue, which can become very annoying already at ~50-100k objects.
Best,
Chris

Best C# data structure for random order population?

In C# I have a use case where I have a mapping from ints to collections.
the ints are a dense (but not packed) set from 1 to n where n is not known.
The cells will be loaded in random order.
the marginal cost of each cell should be ideal (as good as a List<T> or T[])
I'd like to have the cells default filled on demand
What is the best structure for this?
A List<T> would work well (better in space than a Dictionary<>) and by deriving from it I can get much of what I want, but is there something better? As in the best code is code you don't write.
A Dictionary<int, Cell> sounds like a good match to me. Or you could use a List<Cell> quite easily, and just make sure that you expand it where necessary:
public static void EnsureCount<T>(List<T> list, int count)
{
if (list.Count > count)
{
return;
}
if (list.Capacity < count)
{
// Always at least double the capacity, to reduce
// the number of expansions required
list.Capacity = Math.Max(list.Capacity*2, count);
}
list.AddRange(Enumerable.Repeat(default(T), list.Capacity-list.Count));
}
If you want to be a real stickler, one option is to write a facade class which provides a strategy around a number of different mappers. Make it so that it uses a statically defined array when N is < some value and when the load (packedness) is over some threshold. Have it swap to a Tree or a Dictionary representation when certain thresholds are passed.
You did not state how you planned on actually interacting with this data structure once it had been built, so depending on what your access / usage patterns are, a mixed strategy might provide a better runtime behavior than sticking with a single data representation. Think of it in a similar fashion to how quicksort algorithm implementations will sometimes switch to a simpler sort algorithm when the collection to be sorted has less than N data elements.
Cheers!

Categories

Resources