I have a class A that works with hundreds or thousands of classes, each class has a method with some calculations, for example.
Class A has a method where it choose which class, of those hundreds or thousands, runs. And the method of class A runs many times in a short time.
The solution that I thought at the beginning was to have the classes already created in class A, to avoid having to create and destroy classes every time the event was executed and that the garbage collector consumes CPU. But this class A, as I say, is going to have hundreds or thousands of classes to run and having them all loaded is too high an expense in memory (I think).
My question is, can you think of an optimal way to work with hundreds or thousands of classes, which will run some of them every second, without having to create and destroy it in each execution of the method that works with them?
Edit:
First example: Create and save the classes and then use them, I think it would be a memory expense. But keep the garbage collector from working too much.
public class ClassA {
Class1 class1;
Class2 class2;
// ... more classes
Class100 class100;
public ClassA() {
class1 = new Class1();
// ... ‎initializations
class100 = new Class100();
}
public ChooseClass(int numberClass) {
switch (numberClass) {
case 1:
class1.calculate();
break;
case 2:
class2.run();
break;
// ... more cases, one for each class
case 100:
class100.method();
break;
default:
break;
}
}
}
Second example: Creating the class when used, saves memory but the garbage collector consumes a lot of CPU.
public class ClassA {
public ChooseClass(int numberClass) {
switch (numberClass) {
case 1:
Class1 class1 = new Class1();
class1.calculate();
break;
case 2:
Class2 Class2 = new Class2();
class2.run();
break;
// ... more cases, one for each class
case 100:
Class100 class100 = new Class100();
class100.method();
break;
default:
break;
}
}
}
The basic problem you face, when you start increasing the number of class instances is that they all need to be accounted and tracked during garbage collection operation, even if you never free those instances, the garbage collector still needs to track them. There comes a point when the program spends more time performing garbage collection than actual work. We experienced this kind of performance problem with a binary search tree that ended up containing several millions of nodes that originally were class instances.
We were able to circumvent this by using List<T> of structs rather than classes. (The memory of a list is backed by an array, and for structs, the garbage collector only needs to track a single reference to this array). Now, instead of references to a class, we store indices to this list in order to access a desired instance of the struct.
In fact we also faced the problem (notice newer versions of the .NET framework do away with this limitation) that the backing array couldn't grow beyond 2GB even under 64-bits, so we split storage on several lists (256) and used a 32 bit index where 8 bits acted as a list selector and the remaining 24 bits served as an index into the list.
Of course it is convenient to build a class that abstracts all these details, and you need to be aware that when modifying the struct, you actually need to copy it to a local instance, modify it and then replace the original struct with a copy of the modified instance, otherwise your changes will occur in a temporal copy of the struct and not be reflected on your data collection. Also, there is a performance impact, that fortunately is paid-back once the collection is large enough, with extremely fast garbage collection cycles.
Here is some code (quite old), showing these ideas in place, and went from a server spending near 100% of CPU time, to around 15%, just by migrating our search tree to this approach.
public class SplitList<T> where T : struct {
// A virtual list divided into several sublists, removing the 2GB capacity limit
private List<T>[] _lists;
private Queue<int> _free = new Queue<int>();
private int _maxId = 0;
private const int _hashingBits = 8;
private const int _listSelector = 32 - _hashingBits;
private const int _subIndexMask = (1 << _listSelector) - 1;
public SplitList() {
int listCount = 1 << _hashingBits;
_lists = new List<T>[listCount];
for( int i = 0; i < listCount; i++ )
_lists[i] = new List<T>();
}
// Access a struct by index
// Remember that this returns a local copy of the struct, so if changes are to be done,
// the local copy must be copied to a local struct, modify it, and then copy back the changes
// to the list
public T this[int idx] {
get {
return _lists[(idx >> _listSelector)][idx & _subIndexMask];
}
set {
_lists[idx >> _listSelector][idx & _subIndexMask] = value ;
}
}
// returns an index to a "new" struct inside the collection
public int New() {
int result;
T newElement = new T();
// are there any free indexes available?
if( _free.Count > 0 ) {
// yes, return a free index and initialize reused struct to default values
result = _free.Dequeue();
this[result] = newElement;
} else {
// no, grow the capacity
result = ++_maxId;
List<T> list = _lists[result >> _listSelector];
list.Add(newElement);
}
return result;
}
// free an index and allow the struct slot to be reused.
public void Free(int idx) {
_free.Enqueue(idx);
}
}
Here is a snippet of how our binary tree implementation ended up looking using this SplitList backing container class:
public class CLookupTree {
public struct TreeNode {
public int HashValue;
public int LeftIdx;
public int RightIdx;
public int firstSpotIdx;
}
SplitList<TreeNode> _nodes;
…
private int RotateLeft(int idx) {
// Performs a tree rotation to the left, here you can see how we need
// to retrieve the struct to a local copy (thisNode), modify it, and
// push back the modifications to the node storage list
// Also note that we are working with indexes rather than references to
// the nodes
TreeNode thisNode = _nodes[idx];
int result = thisNode.RightIdx;
TreeNode rightNode = _nodes[result];
thisNode.RightIdx = rightNode.LeftIdx;
rightNode.LeftIdx = idx;
_nodes[idx] = thisNode;
_nodes[result] = rightNode;
return result;
}
}
Related
I have declared a basic struct like this
private struct ValLine {
public string val;
public ulong linenum;
}
and declared a Queue like this
Queue<ValLine> check = new Queue<ValLine>();
Then in a using StreamReader setup where I'm reading through the lines of an input file using ReadLine in a while loop, among other things, I'm doing this to populate the Queue:
check.Enqueue(new ValLine { val = line, linenum = linenum });
("line" is a string containing the text of each line, "linenum" is just a counter that is initialized at 0 and is incremented each time through the loop.)
The purpose of the "check" Queue is that if a particular line meets some criteria, then I store that line in "check" along with the line number that it occurs on in the input file.
After I've finished reading through the input file, I use "check" for various things, but then when I'm finished using it I clear it out in the obvious manner:
check.Clear();
(Alternatively, in my final loop through "check" I could just use .Dequeue(), instead of foreach'ing it.)
But then I got to thinking - wait a minute, what about all those "new ValLine" I generated when populating the Queue in the first place??? Have I created a memory leak? I've pretty new to C#, so it's not coming clear to me how to deal with this - or even if it should be dealt with (perhaps .Clear() or .Dequeue() deals with the now obsoleted structs automatically?). I've spent over an hour with our dear friend Google, and just not finding any specific discussion of this kind of example in regard to the clearing of a collection of structs.
So... In C# do we need to deal with wiping out the individual structs before clearing the queue (or as we are dequeueing), or not? And if so, then what is the proper way to do this?
(Just in case it's relevant, I'm using .NET 4.5 in Visual Studio 2013.)
UPDATE: This is for future reference (you know, like if this page comes up in a Google search) in regard to proper coding. To make the struct immutable as per recommendation, this is what I've ended up with:
private struct ValLine {
private readonly string _val;
private readonly ulong _linenum;
public string val { get { return _val; } }
public ulong linenum { get { return _linenum; } }
public ValLine(string x, ulong n) { _val = x; _linenum = n; }
}
Corresponding to that change, the queue population line is now this:
check.Enqueue(new ValLine(line,linenum));
Also, though not strictly necessary, I did get rid of my foreach on the queue (and the check.Clear();, and changed it to this
while (check.Count > 0) {
ValLine ll = check.Dequeue();
writer.WriteLine("[{0}] {1}", ll.linenum, ll.val);
}
so that the queue is emptied out as the information is output.
UPDATE 2: Okay, yes, I'm still a C# newbie (less than a year). I learn a lot from the Internet, but of course, I'm often looking at examples from more than a year ago. I have changed my struct so now it looks like this:
private struct ValLine {
public string val { get; private set; }
public ulong linenum { get; private set; }
public ValLine(string x, ulong n): this()
{ this.val = x; this.linenum = n; }
}
Interestingly enough, I had actually tried exactly this off the top of my head before coming up with what's in the first update (above), but got a compile error (because I did not have the : this() with the constructor). Upon further suggestion, I checked further and found a recent example showing that : this() for making it work like I tried before, plugged that in, and - Wa La! - clean compile. I like the cleaner look of the code. What the private variables are called is irrelevant to me.
No, you won't have created a memory leak. Calling Clear or Dequeue will clear the memory appropriately - for example, if you had a List<T> then a clear operation might use:
for (int i = 0; i < capacity; i++)
{
array[i] = default(T);
}
I don't know offhand whether Queue<T> is implemented with a circular buffer built on an array, or a linked list - but either way, you'll be fine.
Having said that, I would strongly recommend against using mutable structs as you're doing here, along with mutable fields. While it's not causing the particular problem you're envisaging, they can behave in confusing ways.
My financical software processes constantly almost the same objects. For example I have such data online:
HP 100 1
HP 100 2
HP 100.1 1
etc.
I've about 1000 updates every second.
Each update is stored in object - but i do not want to allocate these objects on the fly to improve latency.
I use objects only in short period of time - i recive them, apply and free. Once object is free it actually can be reused for another pack of data.
So I need some storage (likely ring-buffer) that allocates required number of objects once and them allow to "obtain" and "free" them. What is the best way to do that in c#?
Each object has id and i assign id's sequentially and free them sequentially too.
For example i receive id's 1 2 and 3, then I free 1, 2, 3. So any FIFO collection would work, but i'm looking for some library class that cover's required functionality.
I.e. I need FIFO collection that do not allocates objects, but reuse them and allows to reconfigure them.
upd
I've added my implementation of what I want. This is not tested code and probably has bugs.
Idea is simple. Writer should call Obtain Commit methods. Reader should call TryGet method. Reader and writer can access this structure from different threads:
public sealed class ArrayPool<T> where T : class
{
readonly T[] array;
private readonly uint MASK;
private volatile uint curWriteNum;
private volatile uint curReadNum;
public ArrayPool(uint length = 1024) // length must be power of 2
{
if (length <= 0) throw new ArgumentOutOfRangeException("length");
array = new T[length];
MASK = length - 1;
}
/// <summary>
/// TryGet() itself is not thread safe and should be called from one thread.
/// However TryGet() and Obtain/Commit can be called from different threads
/// </summary>
/// <returns></returns>
public T TryGet()
{
if (curReadNum == curWriteNum)
{
return null;
}
T result = array[curReadNum & MASK];
curReadNum++;
return result;
}
public T Obtain()
{
return array[curWriteNum & MASK];
}
public void Commit()
{
curWriteNum++;
}
}
Comments about my implementation are welcome and probably some library method can replace this simple class?
I don't think you should leap at this, as per my comments on the question - however, a simple approach would be something like:
public sealed class MicroPool<T> where T : class
{
readonly T[] array;
public MicroPool(int length = 10)
{
if (length <= 0) throw new ArgumentOutOfRangeException("length");
array = new T[length];
}
public T TryGet()
{
T item;
for (int i = 0; i < array.Length; i++)
{
if ((item = Interlocked.Exchange(ref array[i], null)) != null)
return item;
}
return null;
}
public void Recycle(T item)
{
if(item == null) return;
for (int i = 0; i < array.Length; i++)
{
if (Interlocked.CompareExchange(ref array[i], item, null) == null)
return;
}
using (item as IDisposable) { } // cleaup if needed
}
}
If the loads come in burst, you may be able to use the GC's latency modes to offset the overhead by delaying collects. This is not a silver bullet, but in some cases it can be very helpful.
I am not sure, if this is what you need, but you could always make a pool of objects that are going to be used. Initialize a List of the object type. Then when you need to use an object remove it from the list and add it back when you are done with it.
http://www.codeproject.com/Articles/20848/C-Object-Pooling Is a good start.
Hope I've helped even if a little :)
If you are just worried about the time taken for the GC to run, then don't be - it can't be beaten by anything you can do yourself.
However, if your objects' constructors do some work it might be quicker to cache them.
A fairly straightforward way to do this is to use a ConcurrentBag
Essentially what you do is to pre-populate it with a set of objects using ConcurrentBag.Add() (that is if you want - or you can start with it empty and let it grow).
Then when you need a new object you use ConcurrentBag.TryTake() to grab an object.
If TryTake() fails then you just create a new object and use that instead.
Regardless of whether you grabbed an object from the bag or created a new one, once you're done with it you just put that object back into the bag using ConcurrentBag.Add()
Generally your bag will get to a certain size but no larger (but you might want to instrument things just to check it).
In any case, I would always do some timings to see if changes like this actually make any difference. Unless the object constructors are doing a fair bit of work, chances are it won't make much difference.
I have a object, let's call it "Friend".
This object has method "GetFriendsOfFriend", that returns a List<Friend>.
Given a user input of say 5, all of the Friends friends and the friends friends friends (you get the point) down in a level of 5 (this can be up to 20).
This may be a lot of calculations, so I don't know if recursion is the best solution.
Does anyone have a smart idea of
1. How to do this recursive function best?
2. How to do it without recursion.
Thanks!
Whilst is is certainly possible to do this without recursion, I don't see a particular problem with what you're trying to do. To prevent things going crazy, it might make sense to set a maximum to prevent your program from dying.
public class Friend
{
public static readonly int MaxDepth = 8; // prevent more than 8 recursions
private List<Friend> myFriends_ = new List<Friend>();
// private implementation
private void InternalFriends(int depth, int currDepth, List<Friend> list)
{
// Add "us"
if(currDepth > 1 && !list.Contains(this))
list.Add(this);
if(currDepth <= depth)
{
foreach(Friend f in myFriends_)
{
if(!list.Contains(f))
f.InternalFriends(depth, depth + 1, list); // we can all private functions here.
}
}
} // eo InternalFriends
public List<Friend> GetFriendsOfFriend(int depth)
{
List<Friend> ret = new List<Friend>();
InternalFriends(depth < MaxDepth ? depth : MaxDepth, 1, ret);
return ret;
} // eo getFriendsOfFriend
} // eo class Friend
EDIT: Fixed an error in the code in that an actual friend would not get added, just "their" friends. This is only necessary when adding friends after a depth of "1" (the first call). I also made use of Contains to check for duplicates.
Here is a non recursive version of this code:
public static void ProcessFriendsOf(string person) {
var toVisit = new Queue<string>();
var seen = new HashSet<string>();
toVisit.Enqueue(person);
seen.Add(person);
while(toVisit.Count > 0) {
var current = toVisit.Dequeue();
//process this friend in some way
foreach(var friend in GetFriendsOfFriend(current)) {
if (!seen.Contains(friend)) {
toVisit.Enqueue(friend);
seen.Add(friend);
}
}
}
}
It avoids infinite loop by keeping a HashSet of all members already seen and not adding a member to be processed more than once.
It visits friends using a Queue, in a way that is known as Breadth-first search. If we use a Stack instead of a Queue, it becomes a Depth-first search, and would behave pretty much the same as a recursive approach (which uses an implicit stack - the call stack).
I have 2 time series that contain Bar objects, each Bar object contains a member variable of type long and each time series is stored within its own BlockingCollection. The time series is sorted in ascending order of the long values.
I like to devise a merge algorithm that allows me to take away the Bar that contains the long member variable of lowest value relative to the same comparison element in the other BlockingCollection.
Example, if the long value contained in the first Bar (bar1) in BlockingCollection1 is lower than the long value contained in the first Bar (bar2) in BlockingCollection2, then Take() from BlockingCollection1 and Add() to a MasterBlockingCollection, essentially ending up with a merged stream of Bar objects sorted by the value of each Bar's long member variable.
I like to later on extend to n BlockingCollections, not just 2. I played around with arrays that hold the long values to make the mapping easier but I think arrays are handier when working with pointers pertaining to this specific target algorithm.
I wonder whether anyone can point me to a Linq implementation and comment on how computationally expensive such approach is. I am asking because throughput is of importance as there are hundreds of millions of Bar objects flowing through the collections. If someone has a more clever idea than using Linq that would be very welcomed. I came across some ideas re merge algorithm at DrDobbs some time ago but cannot find the article anymore. In case it is not apparent by now, I target C# (.Net4.0)
Thanks a lot
Edit: I forgot to mention that the merging process is supposed to happen at the same time than workers who add new items to the blockingcollections (running on different tasks)
Here's an implementation of Merge. It should run in O(cN) time, where c is the number of collections. Is this what you're looking for?
public static BlockingCollection<Bar> Merge(IEnumerable<BlockingCollection<Bar>> collections)
{
BlockingCollection<Bar> masterCollection = new BlockingCollection<Bar>();
LinkedList<BarWrapper> orderedLows = new LinkedList<BarWrapper>();
foreach (var c in collections)
OrderedInsert(new BarWrapper { Value = c.Take(), Source = c }, orderedLows);
while (orderedLows.Any())
{
BarWrapper currentLow = orderedLows.First.Value;
orderedLows.RemoveFirst();
BlockingCollection<Bar> collection = currentLow.Source;
if (collection.Any())
OrderedInsert(new BarWrapper { Value = collection.Take(), Source = collection }, orderedLows);
masterCollection.Add(currentLow.Value);
}
return masterCollection;
}
private static void OrderedInsert(BarWrapper bar, LinkedList<BarWrapper> orderedLows)
{
if (!orderedLows.Any())
{
orderedLows.AddFirst(bar);
return;
}
var iterator = orderedLows.First;
while (iterator != null && iterator.Value.Value.LongValue < bar.Value.LongValue)
iterator = iterator.Next;
if (iterator == null)
orderedLows.AddLast(bar);
else
orderedLows.AddBefore(iterator, bar);
}
class BarWrapper
{
public Bar Value { get; set; }
public BlockingCollection<Bar> Source { get; set; }
}
class Bar
{
public Bar(long l)
{
this.LongValue = l;
}
public long LongValue { get; set; }
}
I have this scenario in which memory conservation is paramount. I am trying to read in > 1 GB of Peptide sequences into memory and group peptide instances together that share the same sequence. I am storing the Peptide objects in a Hash so I can quickly check for duplication, but found out that you cannot access the objects in the Set, even after knowing that the Set contains that object.
Memory is really important and I don't want to duplicate data if at all possible. (Otherwise I would of designed my data structure as: peptides = Dictionary<string, Peptide> but that would duplicate the string in both the dictionary and Peptide class). Below is the code to show you what I would like to accomplish:
public SomeClass {
// Main Storage of all the Peptide instances, class provided below
private HashSet<Peptide> peptides = new HashSet<Peptide>();
public void SomeMethod(IEnumerable<string> files) {
foreach(string file in files) {
using(PeptideReader reader = new PeptideReader(file)) {
foreach(DataLine line in reader.ReadNextLine()) {
Peptide testPep = new Peptide(line.Sequence);
if(peptides.Contains(testPep)) {
// ** Problem Is Here **
// I want to get the Peptide object that is in HashSet
// so I can add the DataLine to it, I don't want use the
// testPep object (even though they are considered "equal")
peptides[testPep].Add(line); // I know this doesn't work
testPep.Add(line) // THIS IS NO GOOD, since it won't be saved in the HashSet which i use in other methods.
} else {
// The HashSet doesn't contain this peptide, so we can just add it
testPep.Add(line);
peptides.Add(testPep);
}
}
}
}
}
}
public Peptide : IEquatable<Peptide> {
public string Sequence {get;private set;}
private int hCode = 0;
public PsmList PSMs {get;set;}
public Peptide(string sequence) {
Sequence = sequence.Replace('I', 'L');
hCode = Sequence.GetHashCode();
}
public void Add(DataLine data) {
if(PSMs == null) {
PSMs = new PsmList();
}
PSMs.Add(data);
}
public override int GethashCode() {
return hCode;
}
public bool Equals(Peptide other) {
return Sequence.Equals(other.Sequence);
}
}
public PSMlist : List<DataLine> { // and some other stuff that is not important }
Why does HashSet not let me get the object reference that is contained in the HashSet? I know people will try to say that if HashSet.Contains() returns true, your objects are equivalent. They may be equivalent in terms of values, but I need the references to be the same since I am storing additional information in the Peptide class.
The only solution I came up with is Dictionary<Peptide, Peptide> in which both the key and value point to the same reference. But this seems tacky. Is there another data structure to accomplish this?
Basically you could reimplement HashSet<T> yourself, but that's about the only solution I'm aware of. The Dictionary<Peptide, Peptide> or Dictionary<string, Peptide> solution is probably not that inefficient though - if you're only wasting a single reference per entry, I would imagine that would be relatively insignificant.
In fact, if you remove the hCode member from Peptide, that will safe you 4 bytes per object which is the same size as a reference in x86 anyway... there's no point in caching the hash as far as I can tell, as you'll only compute the hash of each object once, at least in the code you've shown.
If you're really desperate for memory, I suspect you could store the sequence considerably more efficiently than as a string. If you give us more information about what the sequence contains, we may be able to make some suggestions there.
I don't know that there's any particularly strong reason why HashSet doesn't permit this, other than that it's a relatively rare requirement - but it's something I've seen requested in Java as well...
Use a Dictionary<string, Peptide>.