As the title states: are there any downsides to recycling resources (like large arrays) in the finalizer of the containing object? So far it works out fine but since finalizers can be a bit funky and difficult to debug I decided to ask here.
The use case is a polygon class which just represents a list of points. My application makes heavy use of large polygons with a few thousand points - allocating those arrays often isn't exactly cheap. A dispose pattern is unfortunately out of the question because they get passed around and pretty much only the GC knows when no other object is referencing them.
This is how I implemented it (arrays always have the length of 2^sizeLog2 and only arrays with sizeLog2 >= MIN_SIZE_LOG_2 get recycled):
Constructor:
public Polygon(int capacity = 1)
{
//fast method for getting the ceiling of the logarithm of an integer
int sizeLog2 = UIM.CeilingLog2((uint)capacity);
//supress finalize when the array should not be recycled
if(sizeLog2 < TWO_POW_MIN_SIZE) GC.SuppressFinalize(this);
Points = Create(sizeLog2);
}
Create array of size 2^sizeLog2:
private static Pnt[] Create(int sizeLog2)
{
if (sizeLog2 >= TWO_POW_MIN_SIZE)
{
if (/*try to get an item from recycle queue*/) return result;
Pnt[] points = new Pnt[1 << sizeLog2];
//keep array alive so that it won't get collected by GC when the polygon is
GC.KeepAlive(points);
return points;
}
return new Pnt[1 << sizeLog2];
}
In the increase capacity method this is the line for reregistering the polygon for finalize, if the finalize was suppressed up to this point (capacity can only get increased and only by a factor 2):
if (newSizeLog2 == MIN_SIZE_LOG_2) GC.ReRegisterForFinalize(this);
And this is the destructor:
~Polygon()
{
if (Points != null) Recycle(Points); //enqueues the array in a ConcurrentQueue
Points = null;
}
And yes, I do know that this isn't exactly a fancy programming style, but this is rather performance critical so I really can't just let the GC do all the work because then I end up with hundreds of MB on the large object heap in a few seconds.
Also, the polygons get handled by different threads concurrently.
The short answer is yes there are. Large arrays are not the type of resources finalizers are intended to recycle. Finalizers should be used for external resources and in vanishingly rare instances application state.
I would suggest this article, and/or this one, to better understand some of the pitfalls of finalization.
The crux of the problem with what you've described is this statement: " I really can't just let the GC do all the work because then I end up with hundreds of MB on the large object heap in a few seconds."
But the finalizer is never going to be called until the GC "has done all the work" so you must not have a full understanding what is causing the memory pressure for your application.
It really sounds like you've got in issue with the overall design of your application where you're finding it hard to reason about who owns Polygons and/or Pnts. The result is there are references to polygons/Pnts - somewhere - when you don't expect to have them. Use a good profiler to find where this is happening and fix this overall design problem instead trying to use finalization.
Related
We have a problem which seems to be caused by the constant allocation and deallocation of memory:
We have a rather complex system here, where a USB device is measuring arbitrary points and sending the measurement data to the PC at a rate of 50k samples per second. These samples are then collected as MeasurementTasks in the software for each point and afterwards processed which causes even more needed memory because of the requirements of the calculations.
Simplified each MeasurementTask looks like the following:
public class MeasurementTask
{
public LinkedList<Sample> Samples { get; set; }
public ComplexSample[] ComplexSamples { get; set; }
public Complex Result { get; set; }
}
Where Sample looks like:
public class Sample
{
public ushort CommandIndex;
public double ValueChannel1;
public double ValueChannel2;
}
and ComplexSample like:
public class ComplexSample
{
public double Channel1Real;
public double Channel1Imag;
public double Channel2Real;
public double Channel2Imag;
}
In the calculation process the Samples are first calculated into a ComplexSample each and then futher processed until we get our Complex Result. After these calculations are done we release all the Sample and ComplexSample instances and the GC cleans them up soon after, but this results in a constant "up and down" of the memory usage.
This is how it looks at the moment with each MeasurementTask containing ~300k samples:
Now we have sometimes the problem that the samples buffer in our HW device is overflown, as it can only store ~5000 samples (~100ms) and it seems the application is not always reading the device fast enough (we use BULK transfer with LibUSB/LibUSBDotNet). We tracked this problem down to this "memory up and down" by the following facts:
the reading from the USB device happens in its own thread which runs at ThreadPriority.Highest, so the calculations should not interfere
CPU usage is between 1-5% on my 8-core CPU => <50% of one core
if we have (much) faster MeasurementTasks with only a few hundret samples each, the memory goes only up and down very little and the buffer never overflows (but the amount of instances/second is the same, as the device still sends 50k samples/second)
we had a bug before, which did not release the Sample and ComplexSample instances after the calculations and so the memory only went up at ~2-3 MB/s and the buffer overflew all the time
At the moment (after fixing the bug mentioned above) we have a direct correlation between the samples count per point and the overflows. More samples/point = higher memory delta = more overflows.
Now to the actual question:
Can this behaviour be improved (easily)?
Maybe there is a way to tell the GC/runtime to not release the memory so there is no need to re-allocate?
We also thought of an alternative approach by "re-using" the LinkedList<Sample> and ComplexSample[]: Keep a pool of such lists/arrays and instead of releasing them put them back in the pool and "change" these instances instead of creating new ones, but we are not sure this is a good idea as it adds complexity to the whole system...
But we are open to other suggestions!
UPDATE:
I now optimized the code base with the following improvements and did various test runs:
converted Sample to a struct
got rid of the LinkedList<Sample> and replaced them by straigt arrays (I actually had another one somewhere else I also removed)
several minor optimizations I found during analysis and optimization
(optional - see below) converted ComplexSample to a struct
In any case it seems that the problem is gone now on my machine (long term tests and test on low-spec hardware will follow), but I first run a test with both types as struct and got the following memory usage graph:
There it still was going up to ~300 MB on a regular basis (but no overflow errors anymore), but as this still seemed odd to me I did some additional tests:
Side note: Each value of each ComplexSample is altered at least once during the calculations.
1) Add a GC.Collect after a task is processed and the samples are not referenced any more:
Now it was alternating between 140 MB and 150 MB (no noticable perfomance hit).
2) ComplexSample as a class (no GC.Collect):
Using a class it is much more "stable" at ~140-200 MB.
3) ComplexSample as a class and GC.Collect:
Now it is going "up and down" a little in the range of 135-150 MB.
Current solution:
As we are not sure this is a valid case for manually calling GC.Collect we are using "solution 2)" now and I will start running the long-term (= several hours) and low-spec hardware tests...
Can this behaviour be improved (easily)?
Yes (depends on how much you need to improve it).
The first thing I would do is to change Sample and ComplexSample to be value-types. This will reduce the complexity of the graph dealt with by GC as while the arrays and linked lists are still collected, they contain those values directly rather than references to them, and that simplifies the rest of GC.
Then I'd measure performance at this point. The impact of working with relatively large structs is mixed. The guideline that value types should be less than 16 bytes comes from it being around that point where the performance benefits of using a reference type tend to overwhelm the performance benefits of using a value type, but that guideline is only a guideline because "tend to overwhelm" is not the same as "will overwhelm in your application".
After that if it had either not improved things, or not improved things enough, I would consider using a pool of objects; whether for those smaller objects, only the larger objects, or both. This will most certainly increase the complexity of your application, but if it's time-critical, then it might well help. (See How do these people avoid creating any garbage? for example which discusses avoiding normal GC in a time-critical case).
If you know you'll need a fixed maximum of a given type this isn't too hard; create and fill an array of them and dole them out from that array before returning them as they are no longer used. It's still hard enough in that you no longer have GC being automatic and have to manually "delete" the objects by putting them back in the pool.
If you don't have such knowledge, it gets harder but is still possible.
If it is really vital that you avoid GC, be careful of hidden objects. Adding to most collection types can for example result in them moving up to a larger internal store, and leaving the earlier store to be collected. Maybe this is fine in that you've still reduced GC use enough that it is no longer causing the problem you have, but maybe not.
Rarely I've seen a LinkedList<> used in .NET... Have you tried using a List<>? Consider that the basic "element" of a LinkedList<> is a LinkedListNode<> that is a class... So for each Sample there is a whole additional overhead of one object.
Note that if you want to use "big" value types (as suggested by others), the List<> could become again slower (because the List<> grows by "generate a new-internal array of double the current size size and copy from old to new), so the bigger the elements, the more memory the List<> has to copy around when it doubles itself.
If you go to List<> you could try splitting the Sample to
List<ushort> CommandIndex;
List<Sample> ValueChannels;
This because the doubles of Sample require 8 byte alignment, so as written the Sample is 24 bytes, with only 18 bytes used.
This wouldn't be a good idea for LinkedList<>, because the LL has a big overhead per item.
Change Sample and ComplexSample to struct.
Ok, I want to do the following to me it seems like a good idea so if there's no way to do what I'm asking, I'm sure there's a reasonable alternative.
Anyways, I have a sparse matrix. It's pretty big and mostly empty. I have a class called MatrixNode that's basically a wrapper around each of the cells in the matrix. Through it you can get and set the value of that cell. It also has Up, Down, Left and Right properties that return a new MatrixNode that points to the corresponding cell.
Now, since the matrix is mostly empty, having a live node for each cell, including the empty ones, is an unacceptable memory overhead. The other solution is to make new instances of MatrixNode every time a node is requested. This will make sure that only the needed nodes are kept in the memory and the rest will be collected. What I don't like about it is that a new object has to be created every time. I'm scared about it being too slow.
So here's what I've come up with. Have a dictionary of weak references to nodes. When a node is requested, if it doesn't exist, the dictionary creates it and stores it as a weak reference. If the node does already exist (probably referenced somewhere), it just returns it.
Then, if the node doesn't have any live references left, instead of it being collected, I want to store it in a pool. Later, when a new node is needed, I want to first check if the pool is empty and only make a new node if there isn't one already available that can just have it's data swapped out.
Can this be done?
A better question would be, does .NET already do this for me? Am I right in worrying about the performance of creating single use objects in large numbers?
Instead of guessing, you should make a performance test to see if there are any issues at all. You may be surprised to know that managed memory allocation can often outperform explicit allocation because your code doesn't have to pay for deallocation when your data goes out of scope.
Performance may become an issue only when you are allocating new objects so frequently that the garbage collector has no chance to collect them.
That said, there are sparse array implementations in C# already, like Math.NET and MetaNumerics. These libraries are already optimized for performance and will probably avoid performance issues you will run into if you start your implementation from stratch
An SO search for c# and sparse-matrix will return many related questions, including answers pointing to commercial libraries like ILNumerics (has a community edition), NMath and Extreme Optimization's libraries
Most sparse matrix implementations use one of a few well-known schemes for their data; I generally recommend CSR or CSC, as those are efficient for common operations.
If that seems too complex, you can start using COO. What this means in your code is that you will not store anything for empty members; however, you have an item for every non-empty one. A simple implementation might be:
public struct SparseMatrixItem
{
int Row;
int Col;
double Value;
}
And your matrix would generally be a simple container:
public interface SparseMatrix
{
public IList<SparseMatrixItem> Items { get; }
}
You should make sure that the Items list stays sorted according to the row and col indices, because then you can use binary search to quickly find out if an item exists for a specific (i,j).
The idea of having a pool of objects that people use and then return to the pool is used for really expensive objects. Objects representing a network connection, a new thread, etc. It sounds like your object is very small and easy to create. Given that, you're almost certainly going to harm performance pooling it; the overhead of managing the pool will be greater than the cost of just creating a new one each time.
Having lots of short lived very small objects is the exact case that the GC is designed to handle quickly. Creating a new object is dirt cheap; it's just moving a pointer up and clearing out the bits for that object. The real overhead for objects comes in when a new garbage collection happens; for that it needs to find all "alive" objects and move them around, leaving all "dead" objects in their place. If your small object doesn't live through a single collection it has added almost no overhead. Keeping the objects around for a long time (like, say, by pooling them so you can reuse them) means copying them through several collections, consuming a fair bit of resources.
I'm developing an Xbox 360 game with XNA. I'd really like to use C#'s yield return construct in a couple of places, but it seems to create a lot of garbage. Have a look at this code:
class ComponentPool<T> where T : DrawableGameComponent
{
List<T> preallocatedComponents;
public IEnumerable<T> Components
{
get
{
foreach (T component in this.preallocatedComponents)
{
// Enabled often changes during iteration over Components
// for example, it's not uncommon for bullet components to get
// disabled during collision testing
// sorry I didn't make that clear originally
if (component.Enabled)
{
yield return component;
}
}
}
}
...
I use these component pools everywhere - for bullets, enemies, explosions; anything numerous and transient. I often need to loop over their contents, and I'm only ever interested in components that are active (i.e., Enabled == true), hence the behavior of the Components property.
Currently, I'm seeing as much as ~800K per second of additional garbage when using this technique. Is this avoidable? Is there another way to use yield return?
Edit: I found this question about the broader issue of how to iterate over a resource pool without creating garbage. A lot of commenters were dismissive, apparently not understanding the limitations of the Compact Framework, but this commenter was more sympathetic and suggested creating an iterator pool. That's the solution I'm going to use.
The implementation of iterators by the compiler does indeed use class objects and the use (with foreach, for example) of an iterator implemented with yield return will indeed cause memory to be allocated. In the scheme of things this is rarely a problem because either considerable work is done while iterating or considerably more memory is allocated doing other things while iterating.
In order for the memory allocated by an iterator to become a problem, your application must be data structure intensive and your algorithms must operate on objects without allocating any memory. Think of the Game of Life of something similar. Suddenly it is the iteration itself that overwhelms. And when the iteration allocates memory a tremendous amount of memory can be allocated.
If your application fits this profile (and only if) then the first rule you should follow is:
avoid iterators in inner loops when a simpler iteration concept is available
For example, if you have an array or list like data structure, you are already exposing an indexer property and a count property so clients can simply use a for loop instead of using foreach with your iterator. This is "easy money" to reduce GC and it doesn't make your code ugly or bloated, just a little less elegant.
The second principle you should follow is:
measure memory allocations to see when and where you should use with the first rule
Just for grins, try capturing the filter in a Linq query and holding onto the query instance. This might reduce memory reallocations each time the query is enumerated.
If nothing else, the statement preallocatedComponents.Where(r => r.Enabled) is a heck of a lot less code to look at to do the same thing as your yield return.
class ComponentPool<T> where T : DrawableGameComponent
{
List<T> preallocatedComponents;
IEnumerable<T> enabledComponentsFilter;
public ComponentPool()
{
enabledComponentsFilter = this.preallocatedComponents.Where(r => r.Enabled);
}
public IEnumerable<T> Components
{
get { return enabledComponentsFilter; }
}
...
I am using a 3rd-party object I didn't create that over time consumes a lot of resources. This object shouldn't in any way contain a state, it simply performs a calculation. Despite this fact, everytime I call a specific function of this object a little more memory is consumed. A few hours later, and my program is sitting at gigabytes of allocated memory.
The object was origionaly initialized as a static member of my Program class in my command-line application. I have found that if I wrap my entire program in an class, and reinitialize it every now and again, the older (and bloated) object is unallocated by GC and a new smaller object replaces it.
My issue is this method is quite clumsy and ruins the flow of my Program.
Is there any other way you can dispose of an object? I am lead to believe GC.Collect() will only dispose unreachable code. Is there anyway I can make an object 'unreachable'?
Edit: As requested, the code:
static ILexicon lexicon = new Lexicon();
...
lexicon.LoadDataFromFile(#"lexicon.dat", null);
...
byte similarityScore(string w1, string w2, PartOfSpeech pos, SimilarityMeasure measure)
{
if (w1 == w2)
return 255;
if (pos != PartOfSpeech.Noun && pos != PartOfSpeech.Verb)
return 0;
IList<ILemma> w1_lemmas = lexicon.FindSenses(w1, pos);
IList<ILemma> w2_lemmas = lexicon.FindSenses(w2, pos);
byte result;
byte score = 0;
foreach (ILemma w1_lemma in w1_lemmas)
{
foreach (ILemma w2_lemma in w2_lemmas)
{
result = (byte) (w1_lemma.GetSimilarity(w2_lemma, measure) * 255);
if (result > score)
score = result;
}
}
return score;
}
As similarityScore is called, more memory is allocated to a private member of lexicon. It does not implement IDisposable and there are no obvious functions to clear the memory. The library is based on WordNet, and uses an algorithm to find path lengths in the hypernym tree to calculate the similarity of two words. Unless there is caching, I can't see why it would need to store any memory. What is for sure, is I can't change it. I'm almost certain there is nothing wrong with my code. I just need to dispose of lexicon when it gets too large (N.B. it takes a second or two to load the lexicon from file to memory)
If the object doesn't implement IDisposable and you want to push it out of scope you can set all references to it to null and then the force garbage collection with GC.Collect().
GC.Collect() is very expensive. If you're going to have to do this frequently, you might want to consider contacting the vendor.
Find out:
If you are using their library correctly, or is there something you're doing wrong that's causing the memory leak.
If their library is leaking memory even when used as intended, can they fix the leak?
Additional note: If the 3rd party library is native and you're having to use interop, you can use Marshal.ReleaseComObject to free unmanaged memory.
you could try calling the Dispose() method. This would make the object unusable, so you would have to instantiate another one. I assume your program is in a loop, so it can be a loop variable with the call to dispose at the bottom.
I would suggest that if you can get your hands on a memory profiler, you use it. A memory profiler will let you pause your program, click on a class, and and see a list of objects of that class. One can then click on an object and see how it was created, and the "path" to that object from a root (e.g. there's a static class foo, which holds a reference to a bar, which holds a reference to a boz, which holds a reference to a reallybigthing). Often, seeing that will make it clear what needs to be done to break the chain.
you might be able to download the source from wordnet repository and modify the code since it is an opensource.
Here's an interesting article that I found on the web.
It talks about how this firm is able to parse a huge amount of financial data in a managed environment, essentially by object reuse and avoiding immutables such as string. They then go on and show that their program doesn't do any GC during the continuous operation phase.
This is pretty impressive, and I'd like to know if anyone else here has some more detailed guidelines as to how to do this. For one, I'm wondering how the heck you can avoid using string, when blatently some of the data inside the messages are strings, and whatever client application is looking at the messages will want to be passed those strings? Also, what do you allocate in the startup phase? How will you know it's enough? Is it simple a matter of claiming a big chunk of memory and keeping a reference to it so that GC doesn't kick in? What about whatever client application is using the messages? Does it also need to be written according to these stringent standards?
Also, would I need a special tool to look at the memory? I've been using SciTech memory profiler thus far.
I found the paper you linked to rather deficient:
It assumes, and wants you to assume, that garbage collection is the ultimate latency killer. They have not explained why they think so, nor have they explained in what way their system is not basically a custom-made garbage collector in disguise.
It talks about the amount of memory cleaned up in garbage collection, which is irrelevant: the time taken to garbage collect depends more on the number of objects, irrespective of their size.
The table of “results” at the bottom provides no comparison to a system that uses .NET’s garbage collector.
Of course, this doesn’t mean they’re lying and it’s nothing to do with garbage collection, but it basically means that the paper is just trying to sound impressive without actually divulging anything useful that you could use to build your own.
One thing to note from the beginning is where they say "Conventional wisdom has been developing low latency messaging technology required the use of unmanaged C++ or assembly language". In particular, they are talking about a sort of case where people would often dismiss a .NET (or Java) solution out of hand. For that matter, a relatively naïve C++ solution probably wouldn't make the grade either.
Another thing to consider here, is that they have essentially haven't so much gotten rid of the GC as replaced it - there's code there managing object lifetime, but it's their own code.
There are several different ways one could do this instead. Here's one. Say I need to create and destroy several Foo objects as my application runs. Foo creation is parameterised by an int, so the normal code would be:
public class Foo
{
private readonly int _bar;
Foo(int bar)
{
_bar = bar;
}
/* other code that makes this class actually interesting. */
}
public class UsesFoo
{
public void FooUsedHere(int param)
{
Foo baz = new Foo(param)
//Do something here
//baz falls out of scope and is liable to GC colleciton
}
}
A much different approach is:
public class Foo
{
private static readonly Foo[] FOO_STORE = new Foo[MOST_POSSIBLY_NEEDED];
private static Foo FREE;
static Foo()
{
Foo last = FOO_STORE[MOST_POSSIBLY_NEEDED -1] = new Foo();
int idx = MOST_POSSIBLY_NEEDED - 1;
while(idx != 0)
{
Foo newFoo = FOO_STORE[--idx] = new Foo();
newFoo._next = FOO_STORE[idx + 1];
}
FREE = last._next = FOO_STORE[0];
}
private Foo _next;
//Note _bar is no longer readonly. We lose the advantages
//as a cost of reusing objects. Even if Foo acts immutable
//it isn't really.
private int _bar;
public static Foo GetFoo(int bar)
{
Foo ret = FREE;
FREE = ret._next;
return ret;
}
public void Release()
{
_next = FREE;
FREE = this;
}
/* other code that makes this class actually interesting. */
}
public class UsesFoo
{
public void FooUsedHere(int param)
{
Foo baz = Foo.GetFoo(param)
//Do something here
baz.Release();
}
}
Further complication can be added if you are multithreaded (though for really high performance in a non-interactive environment, you may want to have either one thread, or separate stores of Foo classes per thread), and if you cannot predict MOST_POSSIBLY_NEEDED in advance (the simplest is to create new Foo() as needed, but not release them for GC which can be easily done in the above code by creating a new Foo if FREE._next is null).
If we allow for unsafe code we can have even greater advantages in having Foo a struct (and hence the array holding a contiguous area of stack memory), _next being a pointer to Foo, and GetFoo() returning a pointer.
Whether this is what these people are actually doing, I of course cannot say, but the above does prevent GC from activating. This will only be faster in very high throughput conditions, if not then letting GC do its stuff is probably better (GC really does help you, despite 90% of questions about it treating it as a Big Bad).
There are other approaches that similarly avoid GC. In C++ the new and delete operators can be overridden, which allows for the default creation and destruction behaviour to change, and discussions of how and why one might do so might interest you.
A practical take-away from this is when objects either hold resources other than memory that are expensive (e.g. connections to databases) or "learn" as they continue to be used (e.g. XmlNameTables). In this case pooling objects is useful (ADO.NET connections do so behind the scenes by default). In this case though a simple Queue is the way to go, as the extra overhead in terms of memory doesn't matter. You can also abandon objects on lock contention (you're looking to gain performance, and lock contention will hurt it more than abandoning the object), which I doubt would work in their case.
From what I understood, the article doesn't say they don't use strings. They don't use immutable strings. The problem with immutable strings is that when you're doing parsing, most of the strings generated are just throw-away strings.
I'm guessing they're using some sort of pre-allocation combined with free lists of mutable strings.
I worked for a while with a CEP product called StreamBase. One of their engineers told me that they were migrating their C++ code to Java because they were getting better performance, fewer bugs and better portability on the JVM by pretty much avoiding GC altogether. I imagine the arguments apply to the CLR as well.
It seemed counter-intuitive, but their product was blazingly fast.
Here's some information from their site:
StreamBase avoids garbage collection in two ways: Not using objects, and only using the minimum set of objects we need.
First, we avoid using objects by using Java primitive types (Boolean, byte, int, double, and long) to represent our data for processing. Each StreamBase data type is represented by one or more primitive type. By only manipulating the primitive types, we can store data efficiently in stack or array allocated regions of memory. We can then use techniques like parallel arrays or method calling to pass data around efficiently.
Second, when we do use objects, we are careful about their creation and destruction. We tend to pool objects rather than releasing them for garbage collection. We try to manage object lifecycle such that objects are either caught by the garbage collector in the young generation, or kept around forever.
Finally, we test this internally using a benchmarking harness that measures per-tuple garbage collection. In order to achieve our high speeds, we try to eliminate all per-tuple garbage collection, generally with good success.
In 99% of the time you will be wasting your bosses money when you try to achieve this. The article describes a absolute extreme scenario where they need the last drop of performance. As you can read in the article, there are great parts of the .NET framework that can't be used when trying to be GC-free. Some of the most basic parts of the BCL use memory allocations (or 'produce garbage', as the paper calls it). You will need to find a way around those methods. And even when you need absolute blazingly fast applications, you'd better first try to build an application/architecture that can scale out (use multiple machines), before trying to walk the no-GC route. The sole reason for them to use the no-GC route is they need an absolute low latency. IMO, when you need absolute speed, but don't care about the absolute minimum response time, it will be hard to justify a no-GC architecture. Besides this, if you try to build a GC-free client application (such as Windows Forms or WPF App); forget it, those presentation frameworks create new objects constantly.
But if you really want this, it is actually quite simple. Here is a simple how to:
Find out which parts of the .NET API can't be used (you can write a tool that analyzes the .NET assemblies using an introspection engine).
Write a program that verifies the code you or your developers write to ensure they don't allocate directly or use 'forbidden' .NET methods, using the safe list created in the previous point (FxCop is a great tool for this).
Create object pools that you initialize at startup time. The rest of the program can reuse existing object so that they won't have to do any new ops.
If you need to manipulate strings, use byte arrays for this and store byte arrays in a pool (WCF uses this technique also). You will have to create an API that allows manipulating those byte arrays.
And last but not least, profile, profile, profile.
Good luck