Ok - here is the scenario:
I host a server application on Amazon AWS hosted windows instances. (I do not have access to the source code - so I cannot resolve the issues from within the applications source code)
These specific instances are able to build up CPU credits during times of idle cpu (less than 10-20% usage) and then spend those CPU credits during times of increased compute requirement.
My server application however, typically runs at around 15-20% cpu usage when no clients are connected- this is time when I would rather lower the cpu usage to around 5% through throttling of the cpu - maintaining enough cpu throughput to accept a TCP Socket from incoming clients.
When a connected client is detected, I would like to remove the throttle and allow full access to the reserve of AWS CPU Credits.
I have got code in place that can Suspend and Resume processes via C# using Windows API calls.
I am however a bit fuzzy on how to accurately attain a target cpu usage for that process.
What I am doing so far, which is having moderate success:
Looping inside another application
check the cpu usage of the server application - using performance counters (dont like these - they require a 100-1000 ms wait in order to return a % value)
I determine if the current value is above or below the target value - if above, I increase an int value called 'sleep' by 10ms
If below - 'sleep' is decreased by 10ms.
Then the application will call
Process.Suspend();
Threads.sleep(sleep);
Process.Resume();
Like I said - this is having moderate success.
But there are several reasons I don't like it:
1. It requires a semi-rapid loop in an external application: This might end up just shifting cpu usage to that application.
2. Im sure there are better mathematical solutions to work out the ideal sleep time.
I came across this application : http://mion.faireal.net/BES/
It seems to do everything I want, except I need to be able to control it, and I am not a c++ developer.
It also seems to be able to achieve accurate cpu throttling without consuming large cpu utself.
Can someone suggest CPU throttle techniques.
Remember - I cannot modify the source code of the application being throttled - at most, I could inject code into it: but it occurs to me that if I inject suspend code into it, then the resume code could not fire etc.
An external agent program might be the best way to go.
Related
Is there a way to make an application, or a thread, run at a fixed rate?
I'm trying to do some deterministic simulations between networked clients and would like both machines (Windows) to run or process the data at a fixed, unchanging rate. Is this possible?
You can't make an existing application to run at particular speed (there could be VM based solutions that normalize executions speed, but I'm not aware of those myself).
If you writing your own code usual approach is to basically sleep between processing the next iteration. It is commonly done for (simple) games where is less processing than CPU power.
Pseudocode:
while(true)
{
executeStep();
await Task.Delay(GetTimeforNextStep() - DateTime.Now.Utc);
}
Note that precise synchronization is not possible with consumer grade OS (Windows/Linux/MacOS) - you need RTOS for a precise millisecond level timing.
We are scraping an Web based API using Microsoft Azure. The issue is that there is SO much data to retrieve (there are combinations/permutations involved).
If we use a standard Web Job approach, we calculated it would take about 200 years to process all the data we want to get - and we would like our data to be refreshed every week.
Each request/response from the API takes about a 0.5-1.0 seconds to process. Request size is on average 20000 bytes and the average response is 35000 bytes. I believe the total number of requests is in the millions.
Another way to think about this question would be: how would you use Azure to Web scrape - and make sure you don't overload (in terms of memory + network) the VM it's running on? (I don't think you need too much CPU processing in this case).
What we have tried so far:
Used Service Bus Queues/Worker Roles scaled to 8 small VMs - but this caused a lot of network errors to occur (there must be some network limit to how much EACH worker role VM can handle).
Used Service Bus Queues/Continuous Web Job scaled to 8 small VMs - but this seems to work slower - and even scaled, doesn't give us too much control on what's happening behind the scenes. (We don't REALLY know how many VMs are up).
It seems that these things are built for CPU calculation - not for Web/API scraping.
Just to clarify: I throw my requests into a queue - which then get picked up by my multiple VMs for processing to get the responses. That's how I was using the queues. Each VM was using the ServiceBusTrigger class as prescribed by microsoft.
Is it better to have a lot small VMs or few massive VMs?
What C# classes should we be looking at?
What are the technical best practices when trying to do something like this on Azure?
Actually a web scraper is something that I have up and running, in Azure, for quite some time now :-)
AFAIK there is no 'magic bullet'. Scraping a lot of sources with deadlines is quite hard.
How it works (the most important things):
I use worker roles and C# code for the code itself.
For scheduling, I use the queue storage. I put crawling tasks on the queue with a timeout (e.g. 'when to crawl then') and have the scraper pull them off. You can put triggers on the queue size to ensure you meet deadlines in terms of speed -- personally I don't need them.
SQL Azure is slow, so I don't use that. Instead, I only use table storage for storing the scraped items. Note that updating data might be quite complex.
Don't use too much threading; instead, use async IO for all network traffic.
Also you might have to consider that extra threads require extra memory (parse trees can become quite big) - so there's a trade-off there... I do recall using some threads, but it's really just a few.
Note that probably this does require you to re-design and re-implement your complete web scraper if you're now using a threaded approach.. then again, there are some benefits:
Table storage and queue storage are cheap.
I currently use a single Extra Small VM to scrape well over a thousand web sources.
Inbound network traffic is for free.
As such, the result is quite cheap as well; I'm sure it's much less than the alternatives.
As for classes that I use... well, that's a bit of a long list. I'm using HttpWebRequest for the async HTTP requests and the Azure SDK -- but all the rest is hand crafted (and not open source).
P.S.: This doesn't just hold for Azure; most of this also holds for on-premise scrapers.
I have some experience with scraping so I will share my thoughts.
It seems that these things are built for CPU calculation - not for Web/API scraping.
They are built for dynamic scaling which given your task is not something you really need.
How to make sure you don't overload the VM?
Measure the response times and error rates and tune you code to lower them.
I don't think you need too much CPU processing in this case.
Depends on how much data is coming in each second and what you are doing with it. More complex parsing on quickly incoming data (if you decide to do it on the same machine) will eat up CPU pretty quickly.
8 small VMs caused a lot of network errors to occur (there must be some network limit)
The smaller the VMs the less shared resources they get. There are throughput limits and then there is an issue with your neighbors sharing the actual hardware with you. Often, the smaller your instance size the more trouble you run into.
Is it better to have a lot small VMs or few massive VMs?
In my experience, smaller VMs are too crippled. However, your mileage may vary and it all depends on the particular task and its solution implementation. Really, you have to measure yourself in your environment.
What C# classes should we be looking at?
What are the technical best practices when trying to do something like this on Azure?
With high throughput scraping you should be looking at infrastructure. You will have different latency in different Azure datacenters, and different experience with network latency/sustained throughput at different VM sizes, and depending on who in particular is sharing the hardware with you. The best practice is to try and find what works best for you - change datacenters, VM sizes and otherwise experiment.
Azure may not be the best solution to this problem (unless you are on a spending spree). 8 small VMs is $450 a month. It is enough to pay for an unmanaged dedicated server with 256Gb of RAM, 40 hardware threads and 500Mbps - 1Gbps (or even up to several Gbps bursts) of quality network bandwidth without latency issues.
For you budget, you will have a dedicated server that you cannot overload. You will have more than enough RAM to deal with async pinning (if you decide to go async), or enough hardware threads for multi-threaded synchronous IO which gives the best throughput (if you choose to go synchronous with a fixed-size threadpool).
On a sidenote, depending on the API specifics, it might turn out that your main issue will be the API owner simply throttling you down to a crawl when you start to put too much pressure on the API endpoints.
In the company I work for we build machines which are controlled by software running on Windows OS. A C# application communicates with a bus controller (via a DLL). The bus controller runs on a tact time of 15ms. That means, that we get updates of the actual sensors in the system with a heart beat of 15ms from the bus controller (which is real time).
Now, the machines are evolving into a next generation, where we get a new bus controller which runs on a tact of 1ms. Since everybody realizes that Windows is not a real time OS, the question arises: should we move the controlling part of the software to a real time application (on a real time OS, e.g. a (soft) PLC).
If we stay on the windows platform, we do not have guaranteed responsiveness. That on itself is not necessarily a problem; if we miss a few bus cycles (have a few hickups), the machine will just produce slightly slower (which is acceptable).
The part that worries me, is Thread synchronization between the main machine controlling thread, and the updates we receive from the real time controller (every millisecond).
Where can I learn more about how Windows / .NET C# behaves when it goes down the path of thread synchronization on milliseconds? I know that e.g. Thread.Sleep(1) can take up to 15 ms because Windows is preempting other tasks, so how does this reflect when I synchronize between two threads with Monitor.PulseAll every ms? Can I expect the same unpredictable behavior? Is it asking for trouble when I am moving into the soft real time requirements of 1ms in Windows applications?
I hope somebody with experience on these aspects of threading can shed some light on this. If I need to clarify more, by all means, shoot.
Your scenario sounds like a candidate for a kiosk-mode/dedicated application.
In the company I work for we build machines which are controlled by software running on Windows OS.
If so, you could rig the machines such that your low-latency I/O thread could run on a dedicated core with thread and process priorities maximized. Furthermore, ensure the machine has enough cores to handle a buffering thread as well as any others that process your data in transit. The buffer should allocate memory upfront if possible to avoid garbage collection bottlenecks.
#Aron's example is good for situations where data integrity can be compromised to a certain extent. In audio, latency matters a lot during recording for multiple reasons but for pure playback, data loss is acceptable to a certain degree. I am assuming this is not an option in your case.
Of course Windows is not designed to be a real-time OS but if you are using it for a dedicated app, you have control over every aspect of it and can turn off all unrelated services and background processes.
I have had a reasonable amount of success writing software to monitor how well UPS units cope with power fluctuations by measuring their power compensation response times (disclaimer: not for commercial purposes though). Since the data to measure per sample was very small, the GC was not problematic and we cycled pre-allocated memory blocks for buffers.
Some micro-optimizations that came in handy:
Using immutable structs to poll I/O data.
Optimizing data structures to work well with memory allocation.
Optimizing processing algorithms to minimize CPU cache misses.
Using an optimized buffer class to hold data in transit.
Using the Monitor and Interlocked classes for synchronization.
Using unsafe code with (void*) to gain easy access to buffer arrays in various ways to decrease processing time. Minimal use of Marshal and Buffer.BlockCopy.
Lastly, you could go the DDK way and write a small driver. Albeit off-topic, DFMirage is a good example of a video driver that provides both an event-based and a polling model for differential screen capture such that the consumer application can chose on-the-fly based on system load.
As for Thread.Sleep, you could use it as sparingly as possible considering your energy consumption boundaries. With redundant processes out of the way, Thread.Sleep(1) should not be as bad as you think. Try the following to see what you get. Note that this has been coded in the SO editor so I may have made mistakes.
Thread.CurrentThread.Priority = ThreadPriority.Highest;
Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.RealTime;
var ticks = 0L;
var iteration = 0D;
var timer = new Stopwatch();
do
{
iteration++;
timer.Restart();
Thread.Sleep(1);
timer.Stop();
ticks += timer.Elapsed.Ticks;
if (Console.KeyAvailable) { if (Console.ReadKey(true).Key == ConsoleKey.Escape) { break; } }
Console.WriteLine("Elapsed (ms): Last Iteration = {0:N2}, Average = {1:N2}.", timer.Elapsed.TotalMilliseconds, TimeSpan.FromTicks((long) (ticks / iteration)).TotalMilliseconds);
}
while (true);
Console.WriteLine();
Console.WriteLine();
Console.Write("Press any key to continue...");
Console.ReadKey(true);
Come to think about the actual problem itself, processing data at 1ms is pretty easy. When considering audio recording, as an analogous (pun not intended) problem, you might be able to find some inspiration in how to achieve your goals.
Bear in mind.
Even a modest setup can achieve 44.1kHz#16bit per channel sampling rate (that is about 22microseconds or less than a hundredth of your target).
Using ASIO you can achieve sub 10ms latencies
Most methods of achieving high sampling rates will work by increasing your buffer size and sending data to your system in batches
To achieve the best throughput, don't use threads. You DMA and interrupts to callback your processing loop.
Given that sound cards routinely can achieve your goals, you might have a chance.
I have an application where clients connect to IIS with websockets.
IIS then create a local proxy for IPC to connect to executable.
So IIS is sort of a middle man.
As more connection come in the slower the whole architecture gets.
So there is a bottleneck somewhere.
Interesting thing is that CPU does not pass 25% usage. I have not put any limit on the CPU utils.
the issue is not the code as a function that was taking say 100 milliseconds now taking 1000 milliseconds. And These functions are not network bound.
Simple image conversions.
I also check to see if I am blocking on locks or anything.
One would thing the more users joins the system more of these image conversions occur and more cpu is used.
But again the cpu utilization is not changing and it is stuck around 25%
Since execution of even the simplest function is slowing down, I am guessing there is a limit on the application pool of how much CPU it can use. Again I checked the AppPool settings and there is no limit.
any suggestions on how to go about this ?
Sounds like a CPU affinity setting either through the code or system settings.
You can set processor affinity (and thus limit to 1 processor) per application pool, which would effectively limit the app that runs in that pool to use one processor. This will limit the w3wp process to only use one processor, and thus if you have a quad core CPU it would run at 25%. Here you can find the details on changing this here through your IIS settings: this:http://www.iis.net/configreference/system.applicationhost/applicationpools/add/cpu
You may also check Task Manager and right click the process and click "Set Afinity.." and see if your limiting IIS to one core.
Hope this helps you!
You can check process affinity for the Application Pool Process. That may be the reason you are stuck on 25%.
Beyond processor affinity, if your requests are long running, you maybe running up against default limits to the number of concurrent requests per CPU that IIS allows (especially in integrated mode....12). The 25% on a quad core CPU hints that affinity is your problem, but if it isn't, you can check this as well. Here is a related answer
This a VERY open question.
Basically, I have a computing application that launches test combinations for N Scenarios.
Each test is conducted in a single dedicated thread, and involves reading large binary data, processing it, and dropping results to DB.
If the number of threads is too large, the app gets rogue and eats out all available memory and hangs out..
What is the most efficient way to exploit all CPU+RAM capabilities (High Performance computing i.e 12Cores/16GB RAM) without putting the system down to its knees (which happens if "too many" simultaneous threads are launched, "too many" being a relative notion of course)
I have to specify that I have a workers buffer queue with N workers, every time one finishes and dies a new one is launched via a Queue. This works pretty fine as of now. But I would like to avoid "manually" and "empirically" setting the number of simultaneous threads and have an intelligent scalable system that drops as many threads at a time that the system can properly handle, and stop at a "reasonable" memory usage (the target server is dedicated to the app so there is no problem regarding other applications except the system)
PS : I know that .Net 3.5 comes with Thread Pools and .Net 4 has interesting TPL capabilites, that I am still considering right now (I never went very deep into this so far).
PS 2 : After reading this post I was a bit puzzled by the "don't do this" answers. Though I think such request is fair for a memory-demanding computing program.
EDIT
After reading this post I will to try to use WMI features
All built-in threading capabilities in .NET do not support adjusting according to memory usage. You need to build this yourself.
You can either predict memory usage or react to low memory conditions. Alternatives:
Look at the amount of free memory on the system before launching a new task. If it is below 500mb, wait until enough has been freed.
Launch tasks as they come and throttle as soon as some of them start to fail because of OOM. Restart them later. This alternative sucks big time because your process will do garbage collections like crazy to avoid the OOMs.
I recommend (1).
You can either look at free system memory or your own processes memory usage. In order to get the memory usage I recommend looking at private bytes using the Process class.
If you set aside 1GB of buffer on your 16GB system you run at 94% efficiency and are pretty safe.