Is there a way to check threads stack size in C#?
This is a case of if you have to ask, you can't afford it (Raymond Chen said it first.) If the code depends on there being enough stack space to the extent that it has to check first, it might be worthwhile to refactor it to use an explicit Stack<T> object instead. There's merit in John's comment about using a profiler instead.
That said, it turns out that there is a way to estimate the remaining stack space. It's not precise, but it's useful enough for the purpose of evaluating how close to the bottom you are. The following is heavily based on an excellent article by Joe Duffy.
We know (or will make the assumptions) that:
Stack memory is allocated in a contiguous block.
The stack grows 'downwards', from higher addresses towards lower addresses.
The system needs some space near the bottom of the allocated stack space to allow graceful handling of out-of-stack exceptions. We don't know the exact reserved space, but we'll attempt to conservatively bound it.
With these assumptions, we could pinvoke VirtualQuery to obtain the start address of the allocated stack, and subtract it from the address of some stack-allocated variable (obtained with unsafe code.) Further subtracting our estimate of the space the system needs at the bottom of the stack would give us an estimate of the available space.
The code below demonstrates this by invoking a recursive function and writing out the remaining estimated stack space, in bytes, as it goes:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Runtime.InteropServices;
namespace ConsoleApplication1 {
class Program {
private struct MEMORY_BASIC_INFORMATION {
public uint BaseAddress;
public uint AllocationBase;
public uint AllocationProtect;
public uint RegionSize;
public uint State;
public uint Protect;
public uint Type;
}
private const uint STACK_RESERVED_SPACE = 4096 * 16;
[DllImport("kernel32.dll")]
private static extern int VirtualQuery(
IntPtr lpAddress,
ref MEMORY_BASIC_INFORMATION lpBuffer,
int dwLength);
private unsafe static uint EstimatedRemainingStackBytes() {
MEMORY_BASIC_INFORMATION stackInfo = new MEMORY_BASIC_INFORMATION();
IntPtr currentAddr = new IntPtr((uint) &stackInfo - 4096);
VirtualQuery(currentAddr, ref stackInfo, sizeof(MEMORY_BASIC_INFORMATION));
return (uint) currentAddr.ToInt64() - stackInfo.AllocationBase - STACK_RESERVED_SPACE;
}
static void SampleRecursiveMethod(int remainingIterations) {
if (remainingIterations <= 0) { return; }
Console.WriteLine(EstimatedRemainingStackBytes());
SampleRecursiveMethod(remainingIterations - 1);
}
static void Main(string[] args) {
SampleRecursiveMethod(100);
Console.ReadLine();
}
}
}
And here are the first 10 lines of output (intel x64, .NET 4.0, debug). Given the 1MB default stack size, the counts appear plausible.
969332
969256
969180
969104
969028
968952
968876
968800
968724
968648
For brevity, the code above assumes a page size of 4K. While that holds true for x86 and x64, it might not be correct for other supported CLR architectures. You could pinvoke into GetSystemInfo to obtain the machine's page size (the dwPageSize of the SYSTEM_INFO struct).
Note that this technique isn't particularly portable, nor is it future proof. The use of pinvoke limits the utility of this approach to Windows hosts. The assumptions about the continuity and direction of growth of the CLR stack may hold true for the present Microsoft implementations. However, my (possibly limited) reading of the CLI standard (common language infrastructure, PDF, a long read) does not appear to demand as much of thread stacks. As far as the CLI is concerned, each method invocation requires a stack frame; it couldn't care less, however, if stacks grow upward, if local variable stacks are separate from return value stacks, or if stack frames are allocated on the heap.
I'm adding this answer for my future reference. :-)
Oren's answer answers the SO's question (as refined by the comment), but it does not indicate how much memory was actually allocated for the stack to begin with. To get that answer, you can use the Michael Ganß's answer here, which I've updated below using some more recent C# syntax.
public static class Extensions
{
public static void StartAndJoin(this Thread thread, string header)
{
thread.Start(header);
thread.Join();
}
}
class Program
{
[DllImport("kernel32.dll")]
static extern void GetCurrentThreadStackLimits(out uint lowLimit, out uint highLimit);
static void WriteAllocatedStackSize(object header)
{
GetCurrentThreadStackLimits(out var low, out var high);
Console.WriteLine($"{header,-19}: {((high - low) / 1024),4} KB");
}
static void Main(string[] args)
{
WriteAllocatedStackSize("Main Stack Size");
new Thread(WriteAllocatedStackSize, 1024 * 0).StartAndJoin("Default Stack Size");
new Thread(WriteAllocatedStackSize, 1024 * 128).StartAndJoin(" 128 KB Stack Size");
new Thread(WriteAllocatedStackSize, 1024 * 256).StartAndJoin(" 256 KB Stack Size");
new Thread(WriteAllocatedStackSize, 1024 * 512).StartAndJoin(" 512 KB Stack Size");
new Thread(WriteAllocatedStackSize, 1024 * 1024).StartAndJoin(" 1 MB Stack Size");
new Thread(WriteAllocatedStackSize, 1024 * 2048).StartAndJoin(" 2 MB Stack Size");
new Thread(WriteAllocatedStackSize, 1024 * 4096).StartAndJoin(" 4 MB Stack Size");
new Thread(WriteAllocatedStackSize, 1024 * 8192).StartAndJoin(" 8 MB Stack Size");
}
}
What is interesting (and the reason I'm posting this) is the output when run using different configurations. For reference, I'm running this on a Windows 10 Enterprise (Build 1709) 64-bit OS using .NET Framework 4.7.2 (if it matters).
Release|Any CPU (Prefer 32-bit option checked):
Release|Any CPU (Prefer 32-bit option unchecked):
Release|x86:
Main Stack Size : 1024 KB
Default Stack Size : 1024 KB // default stack size = 1 MB
128 KB Stack Size : 256 KB // minimum stack size = 256 KB
256 KB Stack Size : 256 KB
512 KB Stack Size : 512 KB
1 MB Stack Size : 1024 KB
2 MB Stack Size : 2048 KB
4 MB Stack Size : 4096 KB
8 MB Stack Size : 8192 KB
Release|x64:
Main Stack Size : 4096 KB
Default Stack Size : 4096 KB // default stack size = 4 MB
128 KB Stack Size : 256 KB // minimum stack size = 256 KB
256 KB Stack Size : 256 KB
512 KB Stack Size : 512 KB
1 MB Stack Size : 1024 KB
2 MB Stack Size : 2048 KB
4 MB Stack Size : 4096 KB
8 MB Stack Size : 8192 KB
There's nothing particularly shocking about these results given that they are consistent with the documentation. What was a little bit surprising, though, was the default stack size is 1 MB when running in the Release|Any CPU configuration with the Prefer 32-bit option unchecked, meaning it runs as a 64-bit process on a 64-bit OS. I would have assumed the default stack size in this case would've been 4 MB like the Release|x64 configuration.
In any case, I hope this might be of use to someone who lands here wanting to know about the stack size of a .NET thread, like I did.
Related
In my UWP app I need to perform a critical section for few seconds, when I need to be sure that the Garbage collector is not invoked.
So, I call
const long TOTAL_GC_ALLOWED = 1024L * 1024 * 240; // 240 MB, it seems max allowed is 244 for Workstations 64 bits
try
{
bool res = GC.TryStartNoGCRegion(TOTAL_GC_ALLOWED);
if (!res)
s_log.ErrorFormat("Cannot allocate noGC Region!");
}
catch (Exception)
{
s_log.WarnFormat("Cannot start NoGCRegion");
}
Unfortunately, even if the method GC.TryStartNoGCRegion() returns true, I still see the same amount of GarbageCollection of Gen 2 as in the case when I don't call this method.
Please also notice that I am trying on a machine with 16 GB of RAM, of which only 9GB are used by the whole O.S. when I was doing my tests.
What am I doing wrong?
How can I achieve to suppress the GC (for a limited amount of time)?
On my host company, i have 2, more or less identical, .net applications. This two applications use the same pool, which has a max memory on 300 mb.
I can refresh the startpage for one of the application about 10 times, then i get a out of memory exception and the pool is crashed.
In my application i print out this memory values:
PrivateMemorySize64: 197 804.00 kb (193.00 mb)
PeakPagedMemorySize64: 247 220.00 kb (241.00 mb)
VirtualMemorySize64: 27 327 128.00 kb (26 686.00 mb)
PagedMemorySize64: 197 804.00 kb (193.00 mb)
PagedSystemMemorySize64: 415.00 kb (0.00 mb)
PeakWorkingSet64: 109 196.00 kb (106.00 mb)
WorkingSet64: 61 196.00 kb (59.00 mb)
GC.GetTotalMemory(true): 2 960.00 kb (2.00 mb)
GC.GetTotalMemory(false): 2 968.00 kb (2.00 mb)
I have read, and read and read an seen videos about memory profiling, but i can't find any problem when i do the profiling of my application.
I use ANTS Memory profiler 8 and get this result when i refresh the startpage one time after the build:
When i look at the Summary, .NET is using 41.65 MB of 135.8 MB total private bytes allocated for the application.
This values gets bigger and bigger for each refresh. Is that normal? When i refresh 8 times i get this:
.NET is using 56.11 MB of 153 MB total private bytes allocated for the application.
Where should i start? What could be the problem that use so much memory? Is 300 mb to low for memory?
This is undoubtedly due to a memory leak in your code, likely in the form of not disposing/closing connections to something like a queue or database. Profiling aside, review your code and ensure that you're closing/disposing all appropriate resources: your problem should then relieve itself.
There was some db connections that not was disposed. Then i have a class which removes Etags, like this:
public class CustomHeaderModule : IHttpModule
{
public void Dispose() { }
public void Init(HttpApplication context)
{
context.PreSendRequestHeaders += OnPreSendRequestHeaders;
}
void OnPreSendRequestHeaders(object sender, EventArgs e)
{
HttpContext.Current.Response.Headers.Remove("ETag");
}
}'
Must i remove the new event i add in the Init function? Or will the GC fix that?
And i have a lot of this:
Task.Factory.StartNew(() =>
{
Add(...);
});
But i don't dispose them in my code. Will the GC fix that or should i do on a other way?
I was trying to figure out hands-on how tail calls are handled by the C# compiler.
(Answer: They're not. But the 64bit JIT(s) WILL do TCE (tail call elimination). Restrictions apply.)
So I wrote a small test using a recursive call that prints how many times it gets called before the StackOverflowException kills the process.
class Program
{
static void Main(string[] args)
{
Rec();
}
static int sz = 0;
static Random r = new Random();
static void Rec()
{
sz++;
//uncomment for faster, more imprecise runs
//if (sz % 100 == 0)
{
//some code to keep this method from being inlined
var zz = r.Next();
Console.Write("{0} Random: {1}\r", sz, zz);
}
//uncommenting this stops TCE from happening
//else
//{
// Console.Write("{0}\r", sz);
//}
Rec();
}
Right on cue, the program ends with SO Exception on any of:
'Optimize build' OFF (either Debug or Release)
Target: x86
Target: AnyCPU + "Prefer 32 bit" (this is new in VS 2012 and the first time I saw it. More here.)
Some seemingly innocuous branch in the code (see commented 'else' branch).
Conversely, using 'Optimize build' ON + (Target = x64 or AnyCPU with 'Prefer 32bit' OFF (on a 64bit CPU)), TCE happens and the counter keeps spinning up forever (ok, it arguably spins down each time its value overflows).
But I noticed a behaviour I can't explain in the StackOverflowException case: it never (?) happens at exactly the same stack depth. Here are the outputs of a few 32-bit runs, Release build:
51600 Random: 1778264579
Process is terminated due to StackOverflowException.
51599 Random: 1515673450
Process is terminated due to StackOverflowException.
51602 Random: 1567871768
Process is terminated due to StackOverflowException.
51535 Random: 2760045665
Process is terminated due to StackOverflowException.
And Debug build:
28641 Random: 4435795885
Process is terminated due to StackOverflowException.
28641 Random: 4873901326 //never say never
Process is terminated due to StackOverflowException.
28623 Random: 7255802746
Process is terminated due to StackOverflowException.
28669 Random: 1613806023
Process is terminated due to StackOverflowException.
The stack size is constant (defaults to 1 MB). The stack frames' sizes are constant.
So then, what can account for the (sometimes non-trivial) variation of stack depth when the StackOverflowException hits?
UPDATE
Hans Passant raises the issue of Console.WriteLine touching P/Invoke, interop and possibly non-deterministic locking.
So I simplified the code to this:
class Program
{
static void Main(string[] args)
{
Rec();
}
static int sz = 0;
static void Rec()
{
sz++;
Rec();
}
}
I ran it in Release/32bit/Optimization ON without a debugger. When the program crashes, I attach the debugger and check the value of the counter.
And it still isn't the same on several runs. (Or my test is flawed.)
UPDATE: Closure
As suggested by fejesjoco, I looked into ASLR (Address space layout randomization).
It's a security technique that makes it hard for buffer overflow attacks to find the precise location of (e.g.) specific system calls, by randomizing various things in the process address space, including the stack position and, apparently, its size.
The theory sounds good. Let's put it into practice!
In order to test this, I used a Microsoft tool specific for the task: EMET or The Enhanced Mitigation Experience Toolkit. It allows setting the ASLR flag (and a lot more) on a system- or process-level.
(There is also a system-wide, registry hacking alternative that I didn't try)
In order to verify the effectiveness of the tool, I also discovered that Process Explorer duly reports the status of the ASLR flag in the 'Properties' page of the process. Never saw that until today :)
Theoretically, EMET can (re)set the ASLR flag for a single process. In practice, it didn't seem to change anything (see above image).
However, I disabled ASLR for the entire system and (one reboot later) I could finally verify that indeed, the SO exception now always happens at the same stack depth.
BONUS
ASLR-related, in older news: How Chrome got pwned
I think it may be ASLR at work. You can turn off DEP to test this theory.
See here for a C# utility class to check memory information: https://stackoverflow.com/a/8716410/552139
By the way, with this tool, I found that the difference between the maximum and minimum stack size is around 2 KiB, which is half a page. That's weird.
Update: OK, now I know I'm right. I followed up on the half-page theory, and found this doc that examines the ASLR implementation in Windows: http://www.symantec.com/avcenter/reference/Address_Space_Layout_Randomization.pdf
Quote:
Once the stack has been placed, the initial stack pointer is further
randomized by a random decremental amount. The initial offset is
selected to be up to half a page (2,048 bytes)
And this is the answer to your question. ASLR takes away between 0 and 2048 bytes of your initial stack randomly.
This C++11 code prints the offset of the stack within the start page:
#include <Windows.h>
#include <iostream>
using namespace std;
#if !defined(__llvm__)
#pragma warning(disable: 6387) // handle could be NULL
#pragma warning(disable: 6001) // using uninitialized memory
#endif
int main()
{
SYSTEM_INFO si;
GetSystemInfo( &si );
static atomic<size_t> aPageSize = si.dwPageSize;
auto theThread = []( LPVOID ) -> DWORD
{
size_t pageSize = aPageSize.load( memory_order_relaxed );
return (DWORD)(pageSize - ((size_t)&pageSize & pageSize - 1));
};
constexpr unsigned ROUNDS = 10;
for( unsigned r = ROUNDS; r--; )
{
HANDLE hThread = CreateThread( nullptr, 0, theThread, nullptr, 0, nullptr );
WaitForSingleObject( hThread, INFINITE );
DWORD dwExit;
GetExitCodeThread( hThread, &dwExit );
CloseHandle( hThread );
cout << dwExit << endl;
}
}
Linux doesn't randomize the lower 12 bits per default:
#include <iostream>
#include <atomic>
#include <pthread.h>
#include <unistd.h>
using namespace std;
int main()
{
static atomic<size_t> aPageSize = sysconf( _SC_PAGESIZE );
auto theThread = []( void *threadParam ) -> void *
{
size_t pageSize = aPageSize.load( memory_order_relaxed );
return (void *)(pageSize - ((size_t)&pageSize & pageSize - 1));
};
constexpr unsigned ROUNDS = 10;
for( unsigned r = ROUNDS; r--; )
{
pthread_t pThread;
pthread_create( &pThread, nullptr, theThread, nullptr );
void *retVal;
pthread_join( pThread, &retVal );
cout << (size_t)retVal << endl;
}
}
The issue here is that randomizing the thread stack's starting address within a page doesn't make sense from a security standpoint. The issue is simply that when you have a 64 bit system with 47 bit userspace (on newer Intel-CPUs you even have 55 bit userspace) you have still 35 bits to randomize, i.e. about 34 billion placements of a stack. And it doesn't make sense from a performance standpoint either since cacheline aliasing on SMT-systems can't happen because caches have enough associativity today.
Change r.Next() to r.Next(10). StackOverflowExceptions should occur in the same depth.
Generated strings should consume the same memory because they have the same size. r.Next(10).ToString().Length == 1 always. r.Next().ToString().Length is variable.
The same applies if you use r.Next(100, 1000)
Will MemoryMappedFile.CreateViewStream(0, len) allocate a managed block of memory of size len, or will it allocate smaller buffer that acts as a sliding window over the unmanaged data?
I wonder because I aim to replace an intermediate buffer for deserialization that is a MemoryStream today, which is giving me trouble for large datasets, both because of the size of the buffer and because of LOH fragmentation.
If the viewstream's internal buffer becomes the same size then making this switch wouldn't make sense.
Edit:
In a quick test I found these numbers when comparing the MemoryStream to the MemoryMapped file. Readings from GC.GetTotalMemory(true)/1024 and Process.GetCurrentProcess.VirtualMemorySize64/1024
Allocate an 1GB memory stream:
Managed Virtual
Initial: 81 kB 190 896 kB
After alloc: 1 024 084 kB 1 244 852 kB
As expected, a gig of both managed and virtual memory.
Now, for the MemoryMappedFile:
Managed Virtual
Initial: 81 kB 189 616 kB
MMF allocated: 84 kB 189 684 kB
1GB viewstream allocd: 84 kB 1 213 368 kB
Viewstream disposed: 84 kB 190 964 kB
So using a not very scientific test, my assumption is that the ViewStream uses only unmanaged data. Correct?
A MMF like that doesn't solve your problem. A program bombs on OOM because there isn't hole in the virtual memory space big enough to fit the allocation. You are still consuming VM address space with an MMF, as you can tell.
Using a small sliding view would be a workaround, but that isn't any different from writing to a file. Which is what an MMF does when you remap the view, it needs to flush the dirty pages to disk. Simply streaming to a FileStream is the proper workaround. That still uses RAM, the file system cache helps make writing fast. If you've got a gigabyte of RAM available, not hard to come by these days, then writing to a FileStream is just a memory-to-memory copy. Very fast, 5 gigabytes/sec and up. The file gets written in a lazy fashion in the background.
Trying too hard to keep data in memory is unproductive in Windows. Private data in memory is backed by the paging file and will be written to that file when Windows needs the RAM for other processes. And read back when you access it again. That's slow, the more memory you use, the worse it gets. Like any demand-paged virtual memory operating system, the distinction between disk and memory is a small one.
given the example on http://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile.aspx it seems to me that you get a sliding window, at least that is what i interpret when reading the example.
Here the example for convenience:
using System;
using System.IO;
using System.IO.MemoryMappedFiles;
using System.Runtime.InteropServices;
class Program
{
static void Main(string[] args)
{
long offset = 0x10000000; // 256 megabytes
long length = 0x20000000; // 512 megabytes
// Create the memory-mapped file.
using (var mmf = MemoryMappedFile.CreateFromFile(#"c:\ExtremelyLargeImage.data", FileMode.Open,"ImgA"))
{
// Create a random access view, from the 256th megabyte (the offset)
// to the 768th megabyte (the offset plus length).
using (var accessor = mmf.CreateViewAccessor(offset, length))
{
int colorSize = Marshal.SizeOf(typeof(MyColor));
MyColor color;
// Make changes to the view.
for (long i = 0; i < length; i += colorSize)
{
accessor.Read(i, out color);
color.Brighten(10);
accessor.Write(i, ref color);
}
}
}
}
}
public struct MyColor
{
public short Red;
public short Green;
public short Blue;
public short Alpha;
// Make the view brigher.
public void Brighten(short value)
{
Red = (short)Math.Min(short.MaxValue, (int)Red + value);
Green = (short)Math.Min(short.MaxValue, (int)Green + value);
Blue = (short)Math.Min(short.MaxValue, (int)Blue + value);
Alpha = (short)Math.Min(short.MaxValue, (int)Alpha + value);
}
}
I have implemented a file transfer rate calculator to display kB/sec for an upload process occuring in my app, however with the following code it seems I am getting 'bursts' in my KB/s readings just after the file commences to upload.
This is the portion of my stream code, this streams a file in 1024 chunks to a server using httpWebRequest:
using (Stream httpWebRequestStream = httpWebRequest.GetRequestStream())
{
if (request.DataStream != null)
{
byte[] buffer = new byte[1024];
int bytesRead = 0;
Debug.WriteLine("File Start");
var duration = new Stopwatch();
duration.Start();
while (true)
{
bytesRead = request.DataStream.Read(buffer, 0, buffer.Length);
if (bytesRead == 0)
break;
httpWebRequestStream.Write(buffer, 0, bytesRead);
totalBytes += bytesRead;
double bytesPerSecond = 0;
if (duration.Elapsed.TotalSeconds > 0)
bytesPerSecond = (totalBytes / duration.Elapsed.TotalSeconds);
Debug.WriteLine(((long)bytesPerSecond).FormatAsFileSize());
}
duration.Stop();
Debug.WriteLine("File End");
request.DataStream.Close();
}
}
Now an output log of the upload process and associated kB/sec readings are as follows:
(You will note a new file starts and ends with 'File Start' and 'File End')
File Start
5.19 MB
7.89 MB
9.35 MB
11.12 MB
12.2 MB
13.13 MB
13.84 MB
14.42 MB
41.97 kB
37.44 kB
41.17 kB
37.68 kB
40.81 kB
40.21 kB
33.8 kB
34.68 kB
33.34 kB
35.3 kB
33.92 kB
35.7 kB
34.36 kB
35.99 kB
34.7 kB
34.85 kB
File End
File Start
11.32 MB
14.7 MB
15.98 MB
17.82 MB
18.02 MB
18.88 MB
18.93 MB
19.44 MB
40.76 kB
36.53 kB
40.17 kB
36.99 kB
40.07 kB
37.27 kB
39.92 kB
37.44 kB
39.77 kB
36.49 kB
34.81 kB
36.63 kB
35.15 kB
36.82 kB
35.51 kB
37.04 kB
35.71 kB
37.13 kB
34.66 kB
33.6 kB
34.8 kB
33.96 kB
35.09 kB
34.1 kB
35.17 kB
34.34 kB
35.35 kB
34.28 kB
File End
My problem is as you will notice, the 'burst' I am talking about starts at the beginning of every new file, peaking in MB's and then evens out properly. is this normal for an upload to burst like this? My upload speeds typically won't go higher than 40k/sec here so it can't be right.
This is a real issue, when I take an average of the last 5 - 10 seconds for on-screen display, it really throws things out producing a result around ~3MB/sec!
Any ideas if I am approaching this problem the best way? and what I should do? :S
Graham
Also: Why can't I do 'bytesPerSecond = (bytesRead / duration.Elapsed.TotalSeconds)' and move duration.Start & duration.Stop into the while loop and receive accurate results? I would have thought this would be more accurate? Each speed reads as 900 bytes/sec, 800 bytes/sec etc.
The way i do this is:
Save up all bytes transfered in a long.
Then every 1 second i check how much has been transfered. So i basicly only trigger the code to save speed once pr second. Your while loop is going to loop maaaaaaaaaaaany times in one second on a fast network.
Depending on the speed of your network you may need to check the bytes transfered in a seperate thread or function. I prefere doing this with a Timer so i can easly update UI
EDIT:
From your looking at your code, im guessing what your doing wrong is that you dont take into account that one loop in the while(true) is not 1 second
EDIT2:
Another advatage with only doing the speed check once pr second is that things will go much quicker. In cases like this updating the UI can be the slowest thing your are doing, so if you try to update the UI every loop, thats most likely your slowest point and is going to produce unresponsive UI.
Your also correct that you should avarage out the values, so you dont get the microsoft minutes bugs. I normaly do this in the Timer function running by doing something like this:
//Global variables
long gTotalDownloadedBytes;
long gCurrentDownloaded; // Where you add up from the download/upload untill the speedcheck is done.
int gTotalDownloadSpeedChecks;
//Inside function that does speedcheck
gTotalDownloadedBytes += gCurrentDownloaded;
gTotalDownloadSpeedChecks++;
long AvgDwnSpeed = gTotalDownloadedBytes / gTotalDownloadSpeedChecks; // Assumes 1 speedcheck pr second.
There's many layers of software and hardware between you and the system you're sending to, and several of those layers have a certain amount of buffer space available.
When you first start sending, you can pump out data quite quickly until you fill those buffers - it's not actually getting all the way to the other end that fast, though! After you fill up the send buffers, you're limited to putting more data into them at the same rate it's draining out, so the rate you see will drop to the underlying networking sending rate.
All, I think I have fixed my issue by adjusting the 5 - 10 averging variable to wait one second to account for the burst, not the best, but will allow internet to sort itself out and allow me to capture a smooth transfer.
It appears from my network traffic it down right is bursting so there is nothing in code I could do differently to stop this.
Please will still be interested in more answers before I hesitantly accept my own.