Why does a recursive call cause StackOverflow at different stack depths? - c#

I was trying to figure out hands-on how tail calls are handled by the C# compiler.
(Answer: They're not. But the 64bit JIT(s) WILL do TCE (tail call elimination). Restrictions apply.)
So I wrote a small test using a recursive call that prints how many times it gets called before the StackOverflowException kills the process.
class Program
{
static void Main(string[] args)
{
Rec();
}
static int sz = 0;
static Random r = new Random();
static void Rec()
{
sz++;
//uncomment for faster, more imprecise runs
//if (sz % 100 == 0)
{
//some code to keep this method from being inlined
var zz = r.Next();
Console.Write("{0} Random: {1}\r", sz, zz);
}
//uncommenting this stops TCE from happening
//else
//{
// Console.Write("{0}\r", sz);
//}
Rec();
}
Right on cue, the program ends with SO Exception on any of:
'Optimize build' OFF (either Debug or Release)
Target: x86
Target: AnyCPU + "Prefer 32 bit" (this is new in VS 2012 and the first time I saw it. More here.)
Some seemingly innocuous branch in the code (see commented 'else' branch).
Conversely, using 'Optimize build' ON + (Target = x64 or AnyCPU with 'Prefer 32bit' OFF (on a 64bit CPU)), TCE happens and the counter keeps spinning up forever (ok, it arguably spins down each time its value overflows).
But I noticed a behaviour I can't explain in the StackOverflowException case: it never (?) happens at exactly the same stack depth. Here are the outputs of a few 32-bit runs, Release build:
51600 Random: 1778264579
Process is terminated due to StackOverflowException.
51599 Random: 1515673450
Process is terminated due to StackOverflowException.
51602 Random: 1567871768
Process is terminated due to StackOverflowException.
51535 Random: 2760045665
Process is terminated due to StackOverflowException.
And Debug build:
28641 Random: 4435795885
Process is terminated due to StackOverflowException.
28641 Random: 4873901326 //never say never
Process is terminated due to StackOverflowException.
28623 Random: 7255802746
Process is terminated due to StackOverflowException.
28669 Random: 1613806023
Process is terminated due to StackOverflowException.
The stack size is constant (defaults to 1 MB). The stack frames' sizes are constant.
So then, what can account for the (sometimes non-trivial) variation of stack depth when the StackOverflowException hits?
UPDATE
Hans Passant raises the issue of Console.WriteLine touching P/Invoke, interop and possibly non-deterministic locking.
So I simplified the code to this:
class Program
{
static void Main(string[] args)
{
Rec();
}
static int sz = 0;
static void Rec()
{
sz++;
Rec();
}
}
I ran it in Release/32bit/Optimization ON without a debugger. When the program crashes, I attach the debugger and check the value of the counter.
And it still isn't the same on several runs. (Or my test is flawed.)
UPDATE: Closure
As suggested by fejesjoco, I looked into ASLR (Address space layout randomization).
It's a security technique that makes it hard for buffer overflow attacks to find the precise location of (e.g.) specific system calls, by randomizing various things in the process address space, including the stack position and, apparently, its size.
The theory sounds good. Let's put it into practice!
In order to test this, I used a Microsoft tool specific for the task: EMET or The Enhanced Mitigation Experience Toolkit. It allows setting the ASLR flag (and a lot more) on a system- or process-level.
(There is also a system-wide, registry hacking alternative that I didn't try)
In order to verify the effectiveness of the tool, I also discovered that Process Explorer duly reports the status of the ASLR flag in the 'Properties' page of the process. Never saw that until today :)
Theoretically, EMET can (re)set the ASLR flag for a single process. In practice, it didn't seem to change anything (see above image).
However, I disabled ASLR for the entire system and (one reboot later) I could finally verify that indeed, the SO exception now always happens at the same stack depth.
BONUS
ASLR-related, in older news: How Chrome got pwned

I think it may be ASLR at work. You can turn off DEP to test this theory.
See here for a C# utility class to check memory information: https://stackoverflow.com/a/8716410/552139
By the way, with this tool, I found that the difference between the maximum and minimum stack size is around 2 KiB, which is half a page. That's weird.
Update: OK, now I know I'm right. I followed up on the half-page theory, and found this doc that examines the ASLR implementation in Windows: http://www.symantec.com/avcenter/reference/Address_Space_Layout_Randomization.pdf
Quote:
Once the stack has been placed, the initial stack pointer is further
randomized by a random decremental amount. The initial offset is
selected to be up to half a page (2,048 bytes)
And this is the answer to your question. ASLR takes away between 0 and 2048 bytes of your initial stack randomly.

This C++11 code prints the offset of the stack within the start page:
#include <Windows.h>
#include <iostream>
using namespace std;
#if !defined(__llvm__)
#pragma warning(disable: 6387) // handle could be NULL
#pragma warning(disable: 6001) // using uninitialized memory
#endif
int main()
{
SYSTEM_INFO si;
GetSystemInfo( &si );
static atomic<size_t> aPageSize = si.dwPageSize;
auto theThread = []( LPVOID ) -> DWORD
{
size_t pageSize = aPageSize.load( memory_order_relaxed );
return (DWORD)(pageSize - ((size_t)&pageSize & pageSize - 1));
};
constexpr unsigned ROUNDS = 10;
for( unsigned r = ROUNDS; r--; )
{
HANDLE hThread = CreateThread( nullptr, 0, theThread, nullptr, 0, nullptr );
WaitForSingleObject( hThread, INFINITE );
DWORD dwExit;
GetExitCodeThread( hThread, &dwExit );
CloseHandle( hThread );
cout << dwExit << endl;
}
}
Linux doesn't randomize the lower 12 bits per default:
#include <iostream>
#include <atomic>
#include <pthread.h>
#include <unistd.h>
using namespace std;
int main()
{
static atomic<size_t> aPageSize = sysconf( _SC_PAGESIZE );
auto theThread = []( void *threadParam ) -> void *
{
size_t pageSize = aPageSize.load( memory_order_relaxed );
return (void *)(pageSize - ((size_t)&pageSize & pageSize - 1));
};
constexpr unsigned ROUNDS = 10;
for( unsigned r = ROUNDS; r--; )
{
pthread_t pThread;
pthread_create( &pThread, nullptr, theThread, nullptr );
void *retVal;
pthread_join( pThread, &retVal );
cout << (size_t)retVal << endl;
}
}
The issue here is that randomizing the thread stack's starting address within a page doesn't make sense from a security standpoint. The issue is simply that when you have a 64 bit system with 47 bit userspace (on newer Intel-CPUs you even have 55 bit userspace) you have still 35 bits to randomize, i.e. about 34 billion placements of a stack. And it doesn't make sense from a performance standpoint either since cacheline aliasing on SMT-systems can't happen because caches have enough associativity today.

Change r.Next() to r.Next(10). StackOverflowExceptions should occur in the same depth.
Generated strings should consume the same memory because they have the same size. r.Next(10).ToString().Length == 1 always. r.Next().ToString().Length is variable.
The same applies if you use r.Next(100, 1000)

Related

ReadProcessMemory vs MiniDumpWriteDump

I noticed that if I try to read the entirety of the process with ReadProcessMemory it takes VERY long. However, when doing a MiniDumpWriteDump, it happens in about 1 second.
Also for some reason the byte array becomes corrupt when trying to store the entire process when doing ReadProcessMemory and MiniDumpWriteDump doesn't.
Only problem is, when doing a MiniDumpWriteDump, I can't match the addresses/values in something like Cheat Engine. Like for example doing a byte array search returns a different address.
MiniDumpWriteDump(pHandle, procID, fsToDump.SafeFileHandle.DangerousGetHandle(), 0x00000002, IntPtr.Zero, IntPtr.Zero, IntPtr.Zero);
ReadProcessMemory(pHandle, (UIntPtr)0, test, (UIntPtr)procs.PrivateMemorySize, IntPtr.Zero);
ReadProcessMemory Length = 597577728
Dump Length = 372053153
if I try to read the entirety of the process with ReadProcessMemory it takes VERY long.
MiniDumpWriteDump is fast because it's a highly optimized function written my Microsoft themselves.
A proper pattern scan function that is checking page protection type and state by using VirtualQueryEx() with a limited number of wildcards won't take more than 10 seconds and in most cases will take less than 2 seconds.
This is C++ code but the logic will be the same in C#
#include <iostream>
#include <windows.h>
int main()
{
MEMORY_BASIC_INFORMATION meminfo;
char* addr = 0;
HANDLE hProc = OpenProcess(PROCESS_ALL_ACCESS, FALSE, GetCurrentProcessId());
MEMORY_BASIC_INFORMATION mbi;
char buffer[0x1000];
while (VirtualQueryEx(hProc, addr, &mbi, sizeof(mbi)))
{
if (mbi.State != MEM_COMMIT || mbi.Protect == PAGE_NOACCESS)
{
char* buffer = new char[mbi.RegionSize];
ReadProcessMemory(hProc, addr, buffer, mbi.RegionSize, nullptr);
}
addr += mbi.RegionSize;
}
CloseHandle(hProc);
}
Notice we check for MEM_COMMIT, if the memory isn't commited then it's invalid. Similarly if the protection is PAGE_NOACCESS we discard that memory as well. This simple technique will yield only proper memory worth scanning, resulting in a fast scan. After you Read each section into the local buffer, you can run your pattern scan code against that. Lastly just resolve the offset from the beginning of the region to the absolute address in the target process.

A first chance exception of type 'System.OutOfMemoryException' occurred in mscorlib.dll

I am new to C# and trying to read a .sgy file that contains seismic data. I found out a library known as Unplugged.SEGY for reading the file. My file is of 4.12Gb,I am getting " A first chance exception of type 'System.OutOfMemoryException' occurred in mscorlib.dll" and then the program stops suddenly. This is the my code
using System;
using Unplugged.Segy;
namespace ABC
{
class abc
{
static void Main(String[] args)
{
var reader = new SegyReader();
ISegyFile line = reader.Read(#"D:\Major\Seismic.sgy");
ITrace trace = line.Traces[0];
double mean = 0;
double max = double.MinValue;
double min = double.MaxValue;
foreach (var sampleValue in trace.Values)
{
mean += sampleValue / trace.Values.Count;
if (sampleValue < min) min = sampleValue;
if (sampleValue > max) max = sampleValue;
}
Console.WriteLine(mean);
Console.WriteLine(min);
Console.WriteLine(max);
}
}
}
Please Help me out
EDIT: I am running the application as 64-bit process
Since you are running in 64 bit (and as long as you're in .NET 4.5+) I recommend making sure the gcAllowVeryLargeObjects flag is set to true.
In .NET there are various sizes that can be used in 32 bit applications, capping between 2-4 GB per process. A 64 bit application can consume much more per process.
However; in both 32 bit and 64 bit a single object can only consume 2GB at most.
However; to trump that final statement again, since 4.5 and beyond, you can flag your configuration to allow objects greater than 2GB.
My final thought is that flag needs to be set in your situation.
To have a .NET process greater than 4GB it must be a 64bit process.
To have a single object greater than 2GB it must be a 64bit process, running .NET 4.5 or later, and the gcAllowVeryLargeObjects flag is set to true.

How to diagnose a corrupted suffix pattern in a mixed managed/unmanaged x32 .NET application

I've got a .NET application that pinvokes several libraries, all 32 bit (the application is 32 bit as well). I recently started getting crash bugs that occurred when the GC started freeing memory, and when I attached I saw that it was an access violation. After some web searches, I got myself set up with gflags and windbg, and was able to get the actual problem :
===========================================================
VERIFIER STOP 0000000F: pid 0x9650: corrupted suffix pattern
001B1000 : Heap handle
20A5F008 : Heap block
00000006 : Block size
20A5F00E : corruption address
===========================================================
This verifier stop is not continuable. Process will be terminated
when you use the `go' debugger command.
===========================================================
After doing some more reading, I was able to get a stack trace :
0:009> !heap -p -a 20A5F008
address 20a5f008 found in
_HEAP # f420000
HEAP_ENTRY Size Prev Flags UserPtr UserSize - state
20a5efe0 0008 0000 [00] 20a5f008 00006 - (busy)
Trace: 0a94
60cba6a7 verifier!AVrfpDphNormalHeapAllocate+0x000000d7
60cb8f6e verifier!AVrfDebugPageHeapAllocate+0x0000030e
77e00d96 ntdll!RtlDebugAllocateHeap+0x00000030
77dbaf0d ntdll!RtlpAllocateHeap+0x000000c4
77d63cfe ntdll!RtlAllocateHeap+0x0000023a
60cccb62 verifier!AVrfpRtlAllocateHeap+0x00000092
7666ea43 ole32!CRetailMalloc_Alloc+0x00000016
7666ea5f ole32!CoTaskMemAlloc+0x00000013
6c40b25d clr!MngdNativeArrayMarshaler::ConvertSpaceToNative+0x000000bd
... and some more detailed information on the block entry :
0:009> !heap -i 20a5f008
Detailed information for block entry 20a5f008
Assumed heap : 0x0f610000 (Use !heap -i NewHeapHandle to change)
Header content : 0x00000000 0x00000001
Owning segment : 0x0f610000 (offset 0)
Block flags : 0x0 (free )
Total block size : 0x0 units (0x0 bytes)
Previous block size: 0xb4e4 units (0x5a720 bytes)
Block CRC : OK - 0x0
List corrupted: (Blink->Flink = 00000000) != (Block = 20a5f010)
Free list entry : CORRUPTED
Previous block : 0x20a048e8
Next block : 0x20a5f008
I'm kind of stuck with this data. Unfortunately, ConvertSpaceToNative isn't an illuminating call, since that encompasses... pretty much every unmanaged allocation request. I've tried branching out further to find the information I'd need to trace it back to the offending call and spent days looking through documentation, but am not finding a way to determine the actual source of the corruption. I've tried setting break points and stepping through, but I can't find a way to verify the contents of the heap manually that actually works - it always reports that everything is okay. It also seems to me that I should be able to get the application to halt immediately by turning on full page heaps, but it still looks like it's not halting until the free call (this is the call stack when execution halts) :
0:009> kL
ChildEBP RetAddr
2354ecac 60cb9df2 verifier!VerifierStopMessage+0x1f8
2354ed10 60cba22a verifier!AVrfpDphReportCorruptedBlock+0x1c2
2354ed6c 60cba742 verifier!AVrfpDphCheckNormalHeapBlock+0x11a
2354ed8c 60cb90d3 verifier!AVrfpDphNormalHeapFree+0x22
2354edb0 77e01564 verifier!AVrfDebugPageHeapFree+0xe3
2354edf8 77dbac29 ntdll!RtlDebugFreeHeap+0x2f
2354eeec 77d634a2 ntdll!RtlpFreeHeap+0x5d
2354ef0c 60cccc4f ntdll!RtlFreeHeap+0x142
2354ef54 76676e6a verifier!AVrfpRtlFreeHeap+0x86
2354ef68 76676f54 ole32!CRetailMalloc_Free+0x1c
2354ef78 6c40b346 ole32!CoTaskMemFree+0x13
2354f008 231f7e8a clr!MngdNativeArrayMarshaler::ClearNative+0x78
WARNING: Frame IP not in any known module. Following frames may be wrong.
2354f08c 231f6442 0x231f7e8a
2354f154 231f5a7b 0x231f6442
2354f264 231f572b 0x231f5a7b
2354f288 231f56a4 0x231f572b
2354f2a4 231f7b3e 0x231f56a4
2354f330 231f207b 0x231f7b3e
2354f3b4 1edf60e0 0x231f207b
*** WARNING: Unable to verify checksum for C:\Windows\assembly\NativeImages_v4.0.30319_32\mscorlib\045c9588954c3662d542b53f4462268b\mscorlib.ni.dll
2354f850 6a746ed4 0x1edf60e0
2354f85c 6a724157 mscorlib_ni+0x386ed4
2354f8c0 6a724096 mscorlib_ni+0x364157
2354f8d4 6a724051 mscorlib_ni+0x364096
2354f8f0 6a691cd2 mscorlib_ni+0x364051
2354f908 6c353e22 mscorlib_ni+0x2d1cd2
2354f914 6c363355 clr!CallDescrWorkerInternal+0x34
2354f968 6c366d1f clr!CallDescrWorkerWithHandler+0x6b
2354f9e0 6c4d29d6 clr!MethodDescCallSite::CallTargetWorker+0x152
2354fb54 6c3c8357 clr!ThreadNative::KickOffThread_Worker+0x19d
2354fb68 6c3c83c5 clr!Thread::DoExtraWorkForFinalizer+0x1ca
2354fc10 6c3c8492 clr!Thread::DoExtraWorkForFinalizer+0x256
2354fc6c 6c3c84ff clr!Thread::DoExtraWorkForFinalizer+0x615
2354fc90 6c4d2ad8 clr!Thread::DoExtraWorkForFinalizer+0x6b2
2354fd14 6c3fb4ad clr!ThreadNative::KickOffThread+0x1d2
2354feb0 60cd11d3 clr!Thread::intermediateThreadProc+0x4d
2354fee8 75c6336a verifier!AVrfpStandardThreadFunction+0x2f
2354fef4 77d69f72 KERNEL32!BaseThreadInitThunk+0xe
2354ff34 77d69f45 ntdll!__RtlUserThreadStart+0x70
2354ff4c 00000000 ntdll!_RtlUserThreadStart+0x1b
I feel like it's supposed to be obvious what I ought to be doing now, but no avenue of investigation is turning up anything to move me towards resolving the bug.
I finally resolved this, and found that I had made one crucial mistake. With a corrupted suffix pattern, the error will come on a free attempt, which led me to believe that it was unlikely that the allocation would have come right before the free. This was not accurate. When dealing with corruption that occurs on free, barring further information, any allocation point is equally likely. In this case, the verifier halt was coming on freeing a parameter which had been incorrectly defined as a struct of shorts instead of as a struct of ints.
Here's the offending code:
[DllImport("gdi32.dll", CharSet = CharSet.Unicode)]
[return: MarshalAs(UnmanagedType.Bool)]
static extern bool GetCharABCWidths(IntPtr hdc, uint uFirstChar, uint uLastChar, [Out] ABC[] lpabc);
(This declaration is okay)
[StructLayout(LayoutKind.Sequential)]
public struct ABC
{
public short A;
public ushort B;
public short C;
}
(This is not okay, per the MSDN article on the ABC struct : http://msdn.microsoft.com/en-us/library/windows/desktop/dd162454(v=vs.85).aspx )
So, if you find yourself debugging memory corruption that halts on free, keep in mind: never discount the possibility that the memory being freed was incorrectly allocated to begin with... and mind those [Out] parameters on unmanaged calls!

C# opengl context handle getter returns wrong address

Problem solved!
Deleted pragma of sharing from kernel string.(using opencl 1.2)
Reordered GL-VBO-creating and CL-Context-Creating. First create CL-context from gl-context. Then create GL-VBO. Then acquire it by cl. Then compute. Then release by cl. Then bind by gl. Draw. Finish gl. Start over. Use clFinish always to ensure it synchs with gl. For more speed, clflush can be okay maybe even an implicit sync can be done which I did not try.
[original question from here]
In C#, context construction for opencl-gl-interop fails because handle getter function gives wrong address and causes System.AccessViolationException.
C# part:
[DllImport("opengl32.dll",EntryPoint="wglGetCurrentDC")]
extern static IntPtr wglGetCurrentDC();//CAl
[DllImport("opengl32.dll", EntryPoint = "wglGetCurrentContext")]
extern static IntPtr wglGetCurrentContext();// DCAl
C++ part in opencl(this is in a wrapper class of C++ opencl):
pl = new cl_platform_id[2];
clGetPlatformIDs( 1, pl, NULL);
cl_context_properties props[] ={ CL_GL_CONTEXT_KHR, (cl_context_properties)CAl,
CL_WGL_HDC_KHR, (cl_context_properties)DCAl,CL_CONTEXT_PLATFORM,
(cl_context_properties)&pl[0], 0};
ctx=cl::Context(CL_DEVICE_TYPE_GPU,props,NULL,NULL,NULL);//error comes from here
//ctx=cl::Context(CL_DEVICE_TYPE_GPU); this does not interop >:c
What is wrong in these parts? When I change "opengl32.dll" to "opengl64.dll" compiler/linker cannot find it.
Calling wglGetCurrentDC() and wglGetCurrentContext() after glControl1 is loaded but these seem to be giving wrong addresses. Calling wglMakeCurrent() or glControl1.MakeCurrent() before those did not solve the problem too.
OS: 64 bit windows7
Host: fx8150
Device: HD7870
MSVC2012(windows forms application) + OpenTK(2010_10_6) + Khronos opencl 1.2 headers
Build target is x64(release).
Note: opencl part is working well for computing(sgemm) and opengl part is drawing VBO well (some plane built of triangles with some color and normals) but opencl part(context) refuses to interop.
Edit: Adding #pragma OPENCL EXTENSION cl_khr_gl_sharing : enable into kernel string did not solve the problem.
Edit: Creating GL VBOs "after" the construction of cl context, error vanishes but nothing is updated by opencl kernel. Weird. PLus, when I delete cl_khr_sharing pragma, the 3D shape starts artifacting which means opencl is doing something now but its just random deleted pixels and some cropped areas which I did not wrote in kernel. Weirdier. You can see this in below picture(I am trying to make the flat blue sheet disappear but it doesnt fully disappear and also i try changing color and that is not changing)
Edit: CMSoft's opencltemplate looks like what I need to learn/do but their example code consists only 6-7 lines of code! I dont know where to put compute kernel and where to get/set initial data, but that example works great(gives hundreds of "WARNING! ComputeBuffer{T}(575296656) leaked." by the way).
Edit: In case you wonder, here is kernel arguments' construction in C++:
//v1,v2,v3,v4 are unsigned int taken from `bindbuffer` of GL in C#
//so v1 is buf[0] and v2 is buf[1] and so goes like this
glBuf1=cl::BufferGL(ctx,CL_MEM_READ_WRITE,v1,0);
glBuf2=cl::BufferGL(ctx,CL_MEM_READ_WRITE,v2,0);
glBuf3=cl::BufferGL(ctx,CL_MEM_READ_WRITE,v3,0);
glBuf4=cl::BufferGL(ctx,CL_MEM_READ_WRITE,v4,0);
and here is how set into command queue:
v.clear();
v.push_back(glBuf1);
v.push_back(glBuf2);
v.push_back(glBuf3);
v.push_back(glBuf4);
cq.enqueueAcquireGLObjects(&v,0,0);
cq.finish();
and here is how I set as arguments of kernel:
kernel.setArg(0,glBuf1);
kernel.setArg(1,glBuf2);
kernel.setArg(2,glBuf3);
kernel.setArg(3,glBuf3);
here is how executed:
cq.enqueueNDRangeKernel(kernel,referans,Global,Local);
cq.flush();
cq.finish();
here is how released:
cq.enqueueReleaseGLObjects(&v,0,0);
cq.finish();
Simulation iteration:
for (int i = 0; i < 200; i++)
{
GL.Finish(); // lets cl take over
//cl acquires buffers in glTest
clh.glTest(gci.buf[0], gci.buf[1], gci.buf[2], gci.buf[3]);// then computes
// then releases
Thread.Sleep(50);
glControl1.MakeCurrent();
glControl1.Invalidate();
gci.ciz(); //draw
}

Why is my C# program faster in a profiler?

I have a relatively large system (~25000 lines so far) for monitoring radio-related devices. It shows graphs and such using latest version of ZedGraph.
The program is coded using C# on VS2010 with Win7.
The problem is:
when I run the program from within VS, it runs slow
when I run the program from the built EXE, it runs slow
when I run the program though Performance Wizard / CPU Profiler, it runs Blazing Fast.
when I run the program from the built EXE, and then start VS and Attach a profiler to ANY OTHER PROCESS, my program speeds up!
I want the program to always run that fast!
Every project in the solution is set to RELEASE, Debug unmanaged code is DISABLED, Define DEBUG and TRACE constants is DISABLED, Optimize Code - I tried either, Warning Level - I tried either, Suppress JIT - I tried either,
in short I tried all the solutions already proposed on StackOverflow - none worked. Program is slow outside profiler, fast in profiler.
I don't think the problem is in my code, because it becomes fast if I attach the profiler to other, unrelated process as well!
Please help!
I really need it to be that fast everywhere, because it's a business critical application and performance issues are not tolerated...
UPDATES 1 - 8 follow
--------------------Update1:--------------------
The problem seems to Not be ZedGraph related, because it still manifests after I replaced ZedGraph with my own basic drawing.
--------------------Update2:--------------------
Running the program in a Virtual machine, the program still runs slow, and running profiler from the Host machine doesn't make it fast.
--------------------Update3:--------------------
Starting screen capture to video also speeds the program up!
--------------------Update4:--------------------
If I open the Intel graphics driver settings window (this thing: http://www.intel.com/support/graphics/sb/img/resolution_new.jpg)
and just constantly hover with the cursor over buttons, so they glow, etc, my program speeds up!.
It doesn't speed up if I run GPUz or Kombustor though, so no downclocking on the GPU - it stays steady 850Mhz.
--------------------Update5:--------------------
Tests on different machines:
-On my Core i5-2400S with Intel HD2000, UI runs slow and CPU usage is ~15%.
-On a colleague's Core 2 Duo with Intel G41 Express, UI runs fast, but CPU usage is ~90% (which isn't normal either)
-On Core i5-2400S with dedicated Radeon X1650, UI runs blazing fast, CPU usage is ~50%.
--------------------Update6:--------------------
A snip of code showing how I update a single graph (graphFFT is an encapsulation of ZedGraphControl for ease of use):
public void LoopDataRefresh() //executes in a new thread
{
while (true)
{
while (!d.Connected)
Thread.Sleep(1000);
if (IsDisposed)
return;
//... other graphs update here
if (signalNewFFT && PanelFFT.Visible)
{
signalNewFFT = false;
#region FFT
bool newRange = false;
if (graphFFT.MaxY != d.fftRangeYMax)
{
graphFFT.MaxY = d.fftRangeYMax;
newRange = true;
}
if (graphFFT.MinY != d.fftRangeYMin)
{
graphFFT.MinY = d.fftRangeYMin;
newRange = true;
}
List<PointF> points = new List<PointF>(2048);
int tempLength = 0;
short[] tempData = new short[2048];
int i = 0;
lock (d.fftDataLock)
{
tempLength = d.fftLength;
tempData = (short[])d.fftData.Clone();
}
foreach (short s in tempData)
points.Add(new PointF(i++, s));
graphFFT.SetLine("FFT", points);
if (newRange)
graphFFT.RefreshGraphComplete();
else if (PanelFFT.Visible)
graphFFT.RefreshGraph();
#endregion
}
//... other graphs update here
Thread.Sleep(5);
}
}
SetLine is:
public void SetLine(String lineTitle, List<PointF> values)
{
IPointListEdit ip = zgcGraph.GraphPane.CurveList[lineTitle].Points as IPointListEdit;
int tmp = Math.Min(ip.Count, values.Count);
int i = 0;
while(i < tmp)
{
if (values[i].X > peakX)
peakX = values[i].X;
if (values[i].Y > peakY)
peakY = values[i].Y;
ip[i].X = values[i].X;
ip[i].Y = values[i].Y;
i++;
}
while(ip.Count < values.Count)
{
if (values[i].X > peakX)
peakX = values[i].X;
if (values[i].Y > peakY)
peakY = values[i].Y;
ip.Add(values[i].X, values[i].Y);
i++;
}
while(values.Count > ip.Count)
{
ip.RemoveAt(ip.Count - 1);
}
}
RefreshGraph is:
public void RefreshGraph()
{
if (!explicidX && autoScrollFlag)
{
zgcGraph.GraphPane.XAxis.Scale.Max = Math.Max(peakX + grace.X, rangeX);
zgcGraph.GraphPane.XAxis.Scale.Min = zgcGraph.GraphPane.XAxis.Scale.Max - rangeX;
}
if (!explicidY)
{
zgcGraph.GraphPane.YAxis.Scale.Max = Math.Max(peakY + grace.Y, maxY);
zgcGraph.GraphPane.YAxis.Scale.Min = minY;
}
zgcGraph.Refresh();
}
.
--------------------Update7:--------------------
Just ran it through the ANTS profiler. It tells me that the ZedGraph refresh counts when the program is fast are precisely two times higher compared to when it's slow.
Here are the screenshots:
I find it VERY strange that, considering the small difference in the length of the sections, performance differs twice with mathematical precision.
Also, I updated the GPU driver, that didn't help.
--------------------Update8:--------------------
Unfortunately, for a few days now, I'm unable to reproduce the issue... I'm getting constant acceptable speed (which still appear a bit slower than what I had in the profiler two weeks ago) which isn't affected by any of the factors that used to affect it two weeks ago - profiler, video capturing or GPU driver window. I still have no explanation of what was causing it...
Luaan posted the solution in the comments above, it's the system wide timer resolution. Default resolution is 15.6 ms, the profiler sets the resolution to 1ms.
I had the exact same problem, very slow execution that would speed up when the profiler was opened. The problem went away on my PC but popped back up on other PCs seemingly at random. We also noticed the problem disappeared when running a Join Me window in Chrome.
My application transmits a file over a CAN bus. The app loads a CAN message with eight bytes of data, transmits it and waits for an acknowledgment. With the timer set to 15.6ms each round trip took exactly 15.6ms and the entire file transfer would take about 14 minutes. With the timer set to 1ms round trip time varied but would be as low as 4ms and the entire transfer time would drop to less than two minutes.
You can verify your system timer resolution as well as find out which program increased the resolution by opening a command prompt as administrator and entering:
powercfg -energy duration 5
The output file will have the following in it somewhere:
Platform Timer Resolution:Platform Timer Resolution
The default platform timer resolution is 15.6ms (15625000ns) and should be used whenever the system is idle. If the timer resolution is increased, processor power management technologies may not be effective. The timer resolution may be increased due to multimedia playback or graphical animations.
Current Timer Resolution (100ns units) 10000
Maximum Timer Period (100ns units) 156001
My current resolution is 1 ms (10,000 units of 100nS) and is followed by a list of the programs that requested the increased resolution.
This information as well as more detail can be found here: https://randomascii.wordpress.com/2013/07/08/windows-timer-resolution-megawatts-wasted/
Here is some code to increase the timer resolution (originally posted as the answer to this question: how to set timer resolution from C# to 1 ms?):
public static class WinApi
{
/// <summary>TimeBeginPeriod(). See the Windows API documentation for details.</summary>
[System.Diagnostics.CodeAnalysis.SuppressMessage("Microsoft.Interoperability", "CA1401:PInvokesShouldNotBeVisible"), System.Diagnostics.CodeAnalysis.SuppressMessage("Microsoft.Security", "CA2118:ReviewSuppressUnmanagedCodeSecurityUsage"), SuppressUnmanagedCodeSecurity]
[DllImport("winmm.dll", EntryPoint = "timeBeginPeriod", SetLastError = true)]
public static extern uint TimeBeginPeriod(uint uMilliseconds);
/// <summary>TimeEndPeriod(). See the Windows API documentation for details.</summary>
[System.Diagnostics.CodeAnalysis.SuppressMessage("Microsoft.Interoperability", "CA1401:PInvokesShouldNotBeVisible"), System.Diagnostics.CodeAnalysis.SuppressMessage("Microsoft.Security", "CA2118:ReviewSuppressUnmanagedCodeSecurityUsage"), SuppressUnmanagedCodeSecurity]
[DllImport("winmm.dll", EntryPoint = "timeEndPeriod", SetLastError = true)]
public static extern uint TimeEndPeriod(uint uMilliseconds);
}
Use it like this to increase resolution :WinApi.TimeBeginPeriod(1);
And like this to return to the default :WinApi.TimeEndPeriod(1);
The parameter passed to TimeEndPeriod() must match the parameter that was passed to TimeBeginPeriod().
There are situations when slowing down a thread can speed up other threads significantly, usually when one thread is polling or locking some common resource frequently.
For instance (this is a windows-forms example) when the main thread is checking overall progress in a tight loop instead of using a timer, for example:
private void SomeWork() {
// start the worker thread here
while(!PollDone()) {
progressBar1.Value = PollProgress();
Application.DoEvents(); // keep the GUI responisive
}
}
Slowing it down could improve performance:
private void SomeWork() {
// start the worker thread here
while(!PollDone()) {
progressBar1.Value = PollProgress();
System.Threading.Thread.Sleep(300); // give the polled thread some time to work instead of responding to your poll
Application.DoEvents(); // keep the GUI responisive
}
}
Doing it correctly, one should avoid using the DoEvents call alltogether:
private Timer tim = new Timer(){ Interval=300 };
private void SomeWork() {
// start the worker thread here
tim.Tick += tim_Tick;
tim.Start();
}
private void tim_Tick(object sender, EventArgs e){
tim.Enabled = false; // prevent timer messages from piling up
if(PollDone()){
tim.Tick -= tim_Tick;
return;
}
progressBar1.Value = PollProgress();
tim.Enabled = true;
}
Calling Application.DoEvents() can potentially cause allot of headaches when GUI stuff has not been disabled and the user kicks off other events or the same event a 2nd time simultaneously, causing stack climbs which by nature queue the first action behind the new one, but I'm going off topic.
Probably that example is too winforms specific, I'll try making a more general example. If you have a thread that is filling a buffer that is processed by other threads, be sure to leave some System.Threading.Thread.Sleep() slack in the loop to allow the other threads to do some processing before checking if the buffer needs to be filled again:
public class WorkItem {
// populate with something usefull
}
public static object WorkItemsSyncRoot = new object();
public static Queue<WorkItem> workitems = new Queue<WorkItem>();
public void FillBuffer() {
while(!done) {
lock(WorkItemsSyncRoot) {
if(workitems.Count < 30) {
workitems.Enqueue(new WorkItem(/* load a file or something */ ));
}
}
}
}
The worker thread's will have difficulty to obtain anything from the queue since its constantly being locked by the filling thread. Adding a Sleep() (outside the lock) could significantly speed up other threads:
public void FillBuffer() {
while(!done) {
lock(WorkItemsSyncRoot) {
if(workitems.Count < 30) {
workitems.Enqueue(new WorkItem(/* load a file or something */ ));
}
}
System.Threading.Thread.Sleep(50);
}
}
Hooking up a profiler could in some cases have the same effect as the sleep function.
I'm not sure if I've given representative examples (it's quite hard to come up with something simple) but I guess the point is clear, putting sleep() in the correct place can help improve the flow of other threads.
---------- Edit after Update7 -------------
I'd remove that LoopDataRefresh() thread altogether. Rather put a timer in your window with an interval of at least 20 (which would be 50 frames a second if none were skipped):
private void tim_Tick(object sender, EventArgs e) {
tim.Enabled = false; // skip frames that come while we're still drawing
if(IsDisposed) {
tim.Tick -= tim_Tick;
return;
}
// Your code follows, I've tried to optimize it here and there, but no guarantee that it compiles or works, not tested at all
if(signalNewFFT && PanelFFT.Visible) {
signalNewFFT = false;
#region FFT
bool newRange = false;
if(graphFFT.MaxY != d.fftRangeYMax) {
graphFFT.MaxY = d.fftRangeYMax;
newRange = true;
}
if(graphFFT.MinY != d.fftRangeYMin) {
graphFFT.MinY = d.fftRangeYMin;
newRange = true;
}
int tempLength = 0;
short[] tempData;
int i = 0;
lock(d.fftDataLock) {
tempLength = d.fftLength;
tempData = (short[])d.fftData.Clone();
}
graphFFT.SetLine("FFT", tempData);
if(newRange) graphFFT.RefreshGraphComplete();
else if(PanelFFT.Visible) graphFFT.RefreshGraph();
#endregion
// End of your code
tim.Enabled = true; // Drawing is done, allow new frames to come in.
}
}
Here's the optimized SetLine() which no longer takes a list of points but the raw data:
public class GraphFFT {
public void SetLine(String lineTitle, short[] values) {
IPointListEdit ip = zgcGraph.GraphPane.CurveList[lineTitle].Points as IPointListEdit;
int tmp = Math.Min(ip.Count, values.Length);
int i = 0;
peakX = values.Length;
while(i < tmp) {
if(values[i] > peakY) peakY = values[i];
ip[i].X = i;
ip[i].Y = values[i];
i++;
}
while(ip.Count < values.Count) {
if(values[i] > peakY) peakY = values[i];
ip.Add(i, values[i]);
i++;
}
while(values.Count > ip.Count) {
ip.RemoveAt(ip.Count - 1);
}
}
}
I hope you get that working, as I commented before, I hav'nt got the chance to compile or check it so there could be some bugs there. There's more to be optimized there, but the optimizations should be marginal compared to the boost of skipping frames and only collecting data when we have the time to actually draw the frame before the next one comes in.
If you closely study the graphs in the video at iZotope, you'll notice that they too are skipping frames, and sometimes are a bit jumpy. That's not bad at all, it's a trade-off you make between the processing power of the foreground thread and the background workers.
If you really want the drawing to be done in a separate thread, you'll have to draw the graph to a bitmap (calling Draw() and passing the bitmaps device context). Then pass the bitmap on to the main thread and have it update. That way you do lose the convenience of the designer and property grid in your IDE, but you can make use of otherwise vacant processor cores.
---------- edit answer to remarks --------
Yes there is a way to tell what calls what. Look at your first screen-shot, you have selected the "call tree" graph. Each next line jumps in a bit (it's a tree-view, not just a list!). In a call-graph, each tree-node represents a method that has been called by its parent tree-node (method).
In the first image, WndProc was called about 1800 times, it handled 872 messages of which 62 triggered ZedGraphControl.OnPaint() (which in turn accounts for 53% of the main threads total time).
The reason you don't see another rootnode, is because the 3rd dropdown box has selected "[604] Mian Thread" which I didn't notice before.
As for the more fluent graphs, I have 2nd thoughts on that now after looking more closely to the screen-shots. The main thread has clearly received more (double) update messages, and the CPU still has some headroom.
It looks like the threads are out-of-sync and in-sync at different times, where the update messages arrive just too late (when WndProc was done and went to sleep for a while), and then suddenly in time for a while. I'm not very familiar with Ants, but does it have a side-by side thread timeline including sleep time? You should be able to see what's going on in such a view. Microsofts threads view tool would come in handy for this:
When I have never heard or seen something similar; I’d recommend the common sense approach of commenting out sections of code/injecting returns at tops of functions until you find the logic that’s producing the side effect. You know your code and likely have an educated guess where to start chopping. Else chop mostly all as a sanity test and start adding blocks back. I’m often amazed how fast one can find those seemingly impossible bugs to track. Once you find the related code, you will have more clues to solve your issue.
There is an array of potential causes. Without stating completeness, here is how you could approach your search for the actual cause:
Environment variables: the timer issue in another answer is only one example. There might be modifications to the Path and to other variables, new variables could be set by the profiler. Write the current environment variables to a file and compare both configurations. Try to find suspicious entries, unset them one by one (or in combinations) until you get the same behavior in both cases.
Processor frequency. This can easily happen on laptops. Potentially, the energy saving system sets the frequency of the processor(s) to a lower value to save energy. Some apps may 'wake' the system up, increasing the frequency. Check this via performance monitor (permon).
If the apps runs slower than possible there must be some inefficient resource utilization. Use the profiler to investigate this! You can attache the profiler to the (slow) running process to see which resources are under-/ over-utilized. Mostly, there are two major categories of causes for too slow execution: memory bound and compute bound execution. Both can give more insight into what is triggering the slow-down.
If, however, your app actually changes its efficiency by attaching to a profiler you can still use your favorite monitor app to see, which performance indicators do actually change. Again, perfmon is your friend.
If you have a method which throws a lot of exceptions, it can run slowly in debug mode and fast in CPU Profiling mode.
As detailed here, debug performance can be improved by using the DebuggerNonUserCode attribute. For example:
[DebuggerNonUserCode]
public static bool IsArchive(string filename)
{
bool result = false;
try
{
//this calls an external library, which throws an exception if the file is not an archive
result = ExternalLibrary.IsArchive(filename);
}
catch
{
}
return result;
}

Categories

Resources