Too many Pinned Objects causing OutOfMemoryException - c#

We have an application written in .Net/C# 3.5 Compact Framework on an instrument.
We are using Windows Compact Embedded 7 operating system.
This application in a large sense reads some data from the device and stores it in files on disk or SD Card and displays
the data in graphical/numerical fashion.
To read the data we are calling the driver by passing a byte array in the following manner.
[System.Runtime.InteropServices.DllImport("coredll.dll")]
private static unsafe extern bool ReadFile(
IntPtr hFile,
byte* lpBuffer,
uint nNumberOfBytesToRead,
uint* lpNumberOfBytesRead,
uint lpOverlapped);
public int Read(byte[] buffer, int index, int count)
{
uint temp;
unsafe
{
fixed (byte* pByte = &(buffer[index]))
{
if (!ReadFile(handle, pByte, (uint)count, &temp, 0))
{
return -1;
}
}
}
return (int)temp;
}
And our read method which is running in a separate thread looks like this.
private void Reader()
{
try
{
while(true)
{
byte[] data = new byte[1024*10];
Read(data,0,data.Length);
// Do something with data
}
}
catch(Exception ex)
{
}
}
Our requirement is that the application is supposed to run for months continuously.
I have enabled the performance monitor on the instrumnet and when I run the app, stop it and see the statistics file generated by the perf monitor, i see we have millions of pinned objects which may eventually end up in memory leak.
Here is the stat file.
counter total last datum n mean min max
Total Program Run Time (ms) 60240358 - - - - -
App Domains Created 1 - - - - -
App Domains Unloaded 2 - - - - -
Assemblies Loaded 16 - - - - -
Classes Loaded 2466 - - - - -
Methods Loaded 9685 - - - - -
Closed Types Loaded 598 - - - - -
Closed Types Loaded per Definition 598 1 61 9 1 97
Open Types Loaded 16 - - - - -
Closed Methods Loaded 184 - - - - -
Closed Methods Loaded per Definition 184 3 55 3 1 13
Open Methods Loaded 1 - - - - -
Threads in Thread Pool - 9 19 6 1 9
Pending Timers - 0 1986882 0 0 4
Scheduled Timers 662294 - - - - -
Timers Delayed by Thread Pool Limit 0 - - - - -
Work Items Queued 662294 - - - - -
Uncontested Monitor.Enter Calls 30435505 - - - - -
Contested Monitor.Enter Calls 72 - - - - -
Peak Bytes Allocated (native + managed) 10869964 - - - - -
Managed Objects Allocated 386020616 - - - - -
Managed Bytes Allocated 25202701076 16 386022691 65 8 1048588
Managed String Objects Allocated 39325752 - - - - -
Bytes of String Objects Allocated 3041772660 - - - - -
Garbage Collections (GC) 24644 - - - - -
Bytes Collected By GC 25218571932 936288 24644 1023314 49132 1528716
Managed Bytes In Use After GC - 6162292 24644 5988047 235620 6162292
Total Bytes In Use After GC - 10729796 24644 10045775 2113544 10729796
GC Compactions 24640 - - - - -
Code Pitchings 1 - - - - -
Calls to GC.Collect 0 - - - - -
GC Latency Time (ms) 532689 23 24644 21 4 103
**Pinned Objects 846589** - - - - -
Objects Moved by Compactor 15728750 - - - - -
Objects Not Moved by Compactor 261474730 - - - - -
Objects Finalized 14917931 - - - - -
Objects on Finalizer Queue - 0 14944139 303 0 2546
Boxed Value Types 17838485 - - - - -
Process Heap - 5968 101947181 156314 336 209656
Short Term Heap - 0 4050880 51 0 79952
JIT Heap - 0 27236 895917 0 1861844
App Domain Heap - 1536 29669 960504 1536 1297080
GC Heap - 0 208 4088477 0 7512064
Native Bytes Jitted 3495028 136 8431 414 84 111248
Methods Jitted 8431 - - - - -
Bytes Pitched 1738556 164 4269 407 72 111248
Methods Pitched 4269 - - - - -
Method Pitch Latency Time (ms) 78 78 1 78 78 78
Exceptions Thrown 171 - - - - -
Platform Invoke Calls 9528151 - - - - -
COM Calls Using a vtable 0 - - - - -
COM Calls Using IDispatch 0 - - - - -
Complex Marshaling 1929605 - - - - -
Runtime Callable Wrappers 0 - - - - -
Socket Bytes Sent 784 - - - - -
Socket Bytes Received 1934 - - - - -
Controls Created 255 - - - - -
Brushes Created 3022728 - - - - -
Pens Created 1888616 - - - - -
Bitmaps Created 166286 - - - - -
Regions Created 19 - - - - -
Fonts Created 31 - - - - -
Graphics Created (FromImage) 0 - - - - -
Graphics Created (CreateGraphics) 23 - - - - -
After reading on internet I found/understood that in the Read method we will have a pinned object but it should be unpinned again when we go out of the Read method.
We don't have any other code apart from the above code which is interacting to unmanaged environment.
I am looking for advices to fix this problem of pinned objects, is there any other way/tool which can help us in finding what are those pinned objects exactly are.
Is there any other better way to do this kind of managed/unmanaged communication ?

I think that the counter is referred to the number of object the system finds in the pinned state during a GC. If the read call can take a long time this may happen quite often. On the other side I think you can easily reduce the number of memory allocation performed by your app by simply moving the allocation of the data buffer outside the while loop. If it's size is fixed (as it seems from the code) or, at least, it's easy to determine a maximum size, re-allocating it at each iteration doesn't make sense. You mark the old one for gc and take new memory. You'll reach max memory threshold and gc will start. I don't know how long your app has been running, but allocating 386 million objecs looks like quite a big number. Avoid useless re-allocation (like the one in the code you posted) and using StringBuffers instead of strings, for example, can reduce the number of memory allocations, prevent the GC from running too often and, in general, improve your code efficence.

Related

GC Committed Bytes what does it meant?

I was practicing in app memory profiling. I have an ASP.NET Core 7 MVC app running on Ubuntu. I notice that some memory wasn't released, so I've tried to use dotnet-counters monitor -p <pid> to see what's going on here and found some strange values like GC Committed Byte is huge.
I don't understand what that means. How can I interpret result to understand way for further investigation?
I couldn't find any information in the official documentation that would help me explain the meaning of this counter and how to interpret it.
dotnet-counters output example:
[System.Runtime]
% Time in GC since last GC (%) 0
Allocation Rate (B / 1 sec) 8,168
CPU Usage (%) 0.015
Exception Count (Count / 1 sec) 0
GC Committed Bytes (MB) 1,269.08
GC Fragmentation (%) 54.622
GC Heap Size (MB) 19.337
Gen 0 GC Count (Count / 1 sec) 0
Gen 0 Size (B) 0
Gen 1 GC Count (Count / 1 sec) 0
Gen 1 Size (B) 554,856
Gen 2 GC Count (Count / 1 sec) 0
Gen 2 Size (B) 27,092,392
IL Bytes Jitted (B) 2,410,959
LOH Size (B) 8,032,824
Monitor Lock Contention Count (Count / 1 sec) 0
Number of Active Timers 5
Number of Assemblies Loaded 158
Number of Methods Jitted 29,767
POH (Pinned Object Heap) Size (B) 3,765,072
ThreadPool Completed Work Item Count (Count / 1 sec) 0
ThreadPool Queue Length 0
ThreadPool Thread Count 4
Time spent in JIT (ms / 1 sec) 0
Working Set (MB) 1,552.4

shift bits while retaining pattern

My apologies if this has been asked/answered before but I'm honestly not even sure how to word this as a question properly. I have the following bit pattern:
0110110110110110110110110110110110110110110110110110110110110110
I'm trying to perform a shift that'll preserve my underlying pattern; my first instinct was to use right rotation ((x >> count) | (x << (-count & 63))) but the asymmetry in my bit pattern results in:
0011011011011011011011011011011011011011011011011011011011011011 <--- wrong
The problem is that the most significant (far left) bit ends up being 0 instead of the desired 1:
1011011011011011011011011011011011011011011011011011011011011011 <--- right
Is there a colloquial name for this function I'm looking for? If not, how could I go about implementing this idea?
Additional Information:
While the question is language agnostic I'm currently trying to solve this using C#.
The bit patterns I'm using are entirely predictable and always have the same structure; the pattern starts with a single zero followed by n - 1 ones (where n is an odd number) and then repeats infinitely.
I'd like to accomplish this without conditional operations since they'd defeat the purpose of using bitwise manipulation in the first place but maybe I have no choice...
You've got a number structured like this:
B16 B15 B14 B13 B12 B11 B10 B09 B08 B07 B06 B05 B04 B03 B02 B01 B00
? 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0
The ? needs to appear in the MSB (B15, or B63, or whatever) after the shift. Where does it come from? Well, the closest copy is found n places to the right:
B13 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0
^--------------/
If your word has width w, this is 1 << (w-n)
*
So you can do:
var selector = 1 << (w-n);
var rotated = (val >> 1) | ((val & selector) << (n-1));
But you may want a multiple shift. Then we need to build a wider mask:
? 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0
* * * * *
Here I've chosen to pretend n = 6, it just needs to be a multiple of the basic n, and larger than shift. Now:
var selector = ((1UL << shift) - 1) << (w - n);
var rotated = (val >> shift) | ((val & selector) << (n - shift));
Working demonstration using your pattern: http://rextester.com/UWYSW47054
It's easy to see that the output has period 3, as required:
1:B6DB6DB6DB6DB6DB
2:DB6DB6DB6DB6DB6D
3:6DB6DB6DB6DB6DB6
4:B6DB6DB6DB6DB6DB
5:DB6DB6DB6DB6DB6D
6:6DB6DB6DB6DB6DB6
7:B6DB6DB6DB6DB6DB
8:DB6DB6DB6DB6DB6D
9:6DB6DB6DB6DB6DB6
10:B6DB6DB6DB6DB6DB
11:DB6DB6DB6DB6DB6D
Instead of storing a lot of repetitions of a pattern, just store one recurrence and apply modulo operations on the indexes
byte[] pattern = new byte[] { 0, 1, 1 };
// Get a "bit" at index "i", shifted right by "shift"
byte bit = pattern[(i - shift + 1000000 * byte.Length) % byte.Length];
The + 1000000 * byte.Length must be greater than the greatest expected shift and ensures that we get a posistive sum.
This allows you to store patterns of virtually any length.
An optimization would be to store a mirrored version of the pattern. You could then shift left instead of right. This would simplify the index calculation
byte bit = pattern[(i + shift) % byte.Length];
Branchless Answer after a poke by #BenVoigt:
Get the last bit b by doing (n & 1);
Return n >> 1 | b << ((sizeof(n) - 1).
Original Answer:
Get the last bit b by doing (n & 1);
If b is 1, right shift the number by 1 bit and bitwise-OR it with 1 << (sizeof(n) - 1);
If b is 0, just right shift the number by 1 bit.
The problem was changed a bit through the comments.
For all reasonable n, the following problem can be solved efficiently after minimal pre-computation:
Given an offset k, get 64 bits starting at that position in the stream of bits that follows the pattern of (zero, n-1 ones) repeating.
Clearly the pattern repeats with a period of n, so only n different ulongs have to be produced for every given value of n. That could either be done explicitly, constructing all of them in pre-processing (they could be constructed in any obvious way, it doesn't really matter since that only happens once), or left more implicitly by storing only two ulongs per value for n (this works under the assumption that n < 64, see below) and then extracting a range from them with some shifting/ORing. Either way, use offset % n to compute which pattern to retrieve (since the offset is increasing in a predictable manner, no actual modulo operation is required[1]).
Even with the first method, memory consumption will be reasonable since this optimization is only an optimization for low n: in particular for n > 64 there will be fewer than 1 zero per word on average, so the "old fashioned way" of visiting every multiple of n and resetting that bit starts to skip work while the above trick would still visit every word and would not be able anymore to reset multiple bits at once.
[1]: if there are multiple n's in play at the same time, a possible strategy is keeping an array offsets where offsets[n] = offset % n, which could be updated according to: (not tested)
int next = offsets[n] + _64modn[n]; // 64 % n precomputed
offsets[n] = next - (((n - next - 1) >> 31) & n);
The idea being that n is subtracted whenever next >= n. Only one subtraction is needed since the offset and thing added to the offset are already reduced modulo n.
This offset-increment can be done with System.Numerics.Vectors, which is very feature-poor compared to actual hardware but is just about able to do this. It can't do the shift (yes, it's weird) but it can implement a comparison in a branchless way.
Doing one pass per value of n is easier, but touches lots of memory in a cache unfriendly manner. Doing lots of different n at the same time may not be great either. I guess you'd just have to bechmark that..
Also you could consider hard-coding it for some low numbers, something like offset % 3 is fairly efficient (unlike offset % variable). This does take manual loop-unrolling which is a bit annoying, but it's actually simpler, just big in terms of lines of code.

Profilling results - how to understand

I did profiling for my console application using Unity IOC and a lot of calls using HttpCLient. How to understand it?
Function Name, Inclusive Samples, Exclusive Samples, Inclusive Samples %, Exclusive Samples %
Microsoft.Practices.Unity.UnityContainer.Resolve 175 58 38.89 12.89
Microsoft.Practices.Unity.UnityContainer..ctor 29 29 6.44 6.44
System.Runtime.CompilerServices.AsyncTaskMethodBuilder1[System.DateTime].Start 36 13 8.00 2.89
Microsoft.Practices.Unity.UnityContainerExtensions.RegisterInstance 9 9 2.00 2.00
System.Net.Http.HttpClientHandler..ctor 9 9 2.00 2.00
System.Net.Http.HttpMessageInvoker.Dispose 9 9 2.00 2.00
System.Activator.CreateInstance 20 8 4.44 1.78
Microsoft.Practices.Unity.ObjectBuilder.NamedTypeDependencyResolverPolicy.Resolve 115 3 25.56 0.67
What means that inclusive samples for Microsoft.Practices.Unity.UnityContainer.Resolve are 38,89 but exclusive are 12,89? Is it ok? Not too much?
"Inclusive" means "exclusive time plus time spent in all callees".
Forget the "exclusive" stuff.
"Inclusive" is what it's costing you.
It says UnityContainer.Resolve is costing you 39% of time,
and Unity.ObjectBuilder.NamedTypeDependencyResolverPolicy.Resolve is costing you 26%.
It looks like the first one calls the second one, so you can't add their times together.
If you could avoid calling all that stuff, you would save at least 40%, giving you a speedup of at least 100/60 or 1.67 or 67%
By the way, that Unity stuff, while not exactly deprecated, is no longer being maintained.

Memory allocation in CUDA device is not what is expected

I cant make new tags, but it should be on MANAGEDCUDA tag, since im using that framework for using CUDA in C#.
I allocate 2 INT arrays with this code for testing:
Console.WriteLine("Cells: "+sum+" Expected Total Memory (x4): "+sum*4);
int temp= 0;
temp = cntxt.GetFreeDeviceMemorySize();
Console.Write("\n Memory available before:" + cntxt.GetFreeDeviceMemorySize() + "\n");
CudaDeviceVariable<int> matrix = new CudaDeviceVariable<int>(sum);
CudaDeviceVariable<int> matrixDir = new CudaDeviceVariable<int>(sum);
Console.Write("\n Memory available after allocation:" + cntxt.GetFreeDeviceMemorySize() + "\n");
Console.WriteLine("Memory took: "+(temp - cntxt.GetFreeDeviceMemorySize()));
Console.WriteLine("Diference between the expected and allocated: " + ((temp - cntxt.GetFreeDeviceMemorySize())-sum*8));
After run i got this in the console:
When you allocate memory through an allocator (malloc, cudaMalloc, ...), it needs to keep track of the bytes you allocated, in special metadata structures. This metadata may contain, for example, the number of bytes allocated and their location in memory, some padding to align the allocation, and buffer-overrun checks.
To reduce the management overhead, most modern allocators use pages, that is, they allocate memory in indivisible chunks of a fixed size. On many host systems, this size is by default 4 kB.
In your precise case, it would appear that CUDA serving your memory allocation requests in pages of 64 kB. That is, if you request 56 kB, CUDA will serve you 64 kB anyway, and the unused 8 kB are "wasted" (from the point of view of your application).
When you request an allocation of 1552516 bytes (that's 23.7 pages), the runtime will instead serve you 24 pages (1572864 bytes): that's 20348 bytes extra. Double that (because you have 2 arrays), and this is where your 40696 bytes difference comes from.
Note: The page size varies between GPUs and driver versions. You may try to find it out experimentally by yourself, or search for results published by other people. In any case, this is (to the best of my knowledge) not documented, and may therefore not be relied upon if you intend your program to be portable.

Estimate/Calculate Session Memory Usage

I would like to estimate the amount of memory used by each session on my server in my ASP.NET web application. A few key questions:
How much memory is allocated just to have each Session instance?
Is the memory usage of each variable equal to its given address space (e.g. 32 bits for an Int32)?
What about variables with variable address space (e.g. String, Array[]s)?
What about custom object instances (e.g. MyCustomObject which holds various other things)?
Is anything added for each variable (e.g. address of Int32 variable to tie it to the session instance) adding to overhead-per-variable?
Would appreciate some help in figuring out on how I can exactly predict how much memory each session will consume. Thank you!
The HttpSessionStateContainer class has ten local varaibles, so that is roughly 40 bytes, plus 8 bytes for the object overhead. It has a session id string and an items collection, so that is something like 50 more bytes when the items collection is empty. It has some more references, but I believe those are references to objects shared by all session objects. So, all in all that would make about 100 bytes per session object.
If you put a value type like Int32 in the items collection of a session, it has to be boxed. With the object overhead of 8 bytes it comes to 12 bytes, but due to limitations in the memory manager it can't allocate less than 16 bytes for the object. With the four bytes for the reference to the object, an Int32 needs 20 bytes.
If you put a reference type in the items collection, you only store the reference, so that is just four bytes. If it's a literal string, it's already created so it won't use any more memory. A created string will use (8 + 8 + 2 * Length) bytes.
An array of value types will use (Length * sizeof(type)) plus a few more bytes. An array of reference types will use (Length * 4) plus a few more bytes for the references, and each object is allocated separately.
A custom object uses roughly the sum of the size of it's members, plus some extra padding in some cases, plus the 8 bytes of object overhead. An object containing an Int32 and a Boolean (= 5 bytes) will be padded to 8 bytes, plus the 8 bytes overhead.
So, if you put a string with 20 characters and three integers in a session object, that will use about (100 + (8 + 8 + 20 *2) + 3 * (20)) = 216 bytes. (However, the session items collection will probably allocate a capacity of 16 items, so that's 64 bytes of which you are using 16, so the size would be 264 bytes.)
(All the sizes are on a 32 bit system. On a 64 bit system each reference is 8 bytes instead of 4 bytes.)
.NET Memory Profiler is your friend:
http://memprofiler.com/
You can download the trial version for free and run it. Although these things sometimes get complicated to install and run, I found it surprisingly simple to connect to a running web server and inspect all the objects it was holding in memory.
You can probably get some of these using Performance Counters and Custom Performance Counters. I never tested these with ASP.NET, but otherwise they'r quite nice to measure performances.
This old, old article from Microsoft containing performance tips for ASP (not ASP.NET) states that each session has an overhead of about 10 kilobytes. I have no idea if this is also applicable to ASP.NET, but it sure is a lot more than then 100 bytes overhead the Guffa mentions.
Planning a large scale application there are some other things you probably need to consider besides rough memory usage.
It depends on a Session State Provider you choose, and default in-process sessions are probably not the case at all.
In case of Out-Of-Process session storage (which may be the preferred for scalable applications) the picture will be completely different, and depends on how session objects are serialized and stored.
With SQL session storage, there will not be linear RAM consumption.
I would recommend integration testing with an out-of-process flavour of session state Provider from the very beginning for a large-scale application.

Categories

Resources