I am processing my SSAS Cube programmatically. I process the dimensions in parallel (I manage the parallel calls to .Process() myself) and once they're all finished, I process the measure group partitions in parallel (again managing the parallelism myself).
As far as I can see, this is a direct replication of what I would otherwise do in SSMS (same process types etc.) The only difference I can see is that I'm processing ALL of the dimensions in parallel and ALL of the measure group partitions in parallel thereafter. If you right click process several objects within SSMS, it appears to only process 2 in parallel at any one time (inferred from the text that indicates process has not started in all processing windows other than 2). But if anything, I would expect my code to be faster, not slower than SSMS.
I have wrapped the processing action with "starting" and "finishing" debug messages and everything is as expected. It is the work done by .Process() that seems to be much slower than SSMS.
On a Cube that normally takes just under 1 hour to process, it is taking 7.5 hours.
On a cube that normally takes just under 3 minutes to process, it is taking 6.5 minutes.
As far as I can tell, the processing of dimensions is about the same but the measure groups are significantly slower. However, the latter are much much larger of course so it might just be that the difference is not as obvious to me.
I'm at a loss for ideas and would appreciate any help! Am I missing a setting? Is managing the parallelism myself and processing multiple in parallel as opposed to 2 causing a problem?
If you can provide your code I'm happy to look but my guess is that you are calling dimension.Process() in parallel threads expecting it to process in parallel on the server. It won't. It will process in serial due to locking because you are executing separate processing batches and separate transactions.
Any reason not to process everything (rather than incrementally processing just recent partitions or something)? Let's start simple and see if this is all you need. Can you get the database object and just do a ProcessFull? That will properly process in parallel all dimensions and measure groups.
database.Process(ProcessType.ProcessFull)
If you do need incremental processing then review this link for using ExecuteCaptureLog(true,true) to run multiple ProcessUpdate commands in parallel and in a transaction:
https://jesseorosz.wordpress.com/2006/11/20/how-to-process-dimensions-in-parallel-using-amo/
I would recommend including the partitions you want to process in that transactional batch. It will know the right dependencies automatically. Also make sure to include a ProcessIndexes on the cube object in that batch so flexible aggs and indexes on old partitions get rebuilt after the dimension ProcessUpdate.
Related
I have 2 WebJobs in my WebApp. Let's call them WJ1 and WJ2, both WebJobs are triggered by queues (Q1 and Q2 respectively). Since Q2 gets a lot of messages and I would like to process them quickly, I scale out my WebApp to 10 instances (it only runs with 1 instance at times when no messages are expected).
The problem is that since Q2 receives overwhelmingly more messages than Q1, WJ2 takes all the resources and the processing of Q1 lags behind its expected schedule. Would there be a way of assigning WJ1 a higher priority than WJ2, so that whenever there is a message in Q1 it will be process before taking any more from Q2?
Both WebJobs are separate (and both have 2 functions each, triggered by the afore mentioned queues and timers) and can be started and stopped independently, if that is of any help.
Also, WJ1 would be happy with just one instance, since the messages are expected to be processed one after the other.
I have read and thought about splitting the WebJobs in 2 different WebApps and limit the WebApp2 to 9 of the 10 available instances to run WJ2. However, I don't like that option because WJ2 takes 3 hours or so to finish its burst of messages and WJ1 should be done within an hour or less, so it makes no sense to prevent WJ2 to take all the available capacity once WJ1 is done.
Also, if WJ1 was a singleton, would it then have more prio? (Probably not, but worth asking).
Thanks!
Simply scaling the compute resources might not be sufficient to
prevent loss of performance under load. You might also need to scale
storage queues and other resources to prevent a single point of the
overall processing chain from becoming a bottleneck. Also, consider
other limitations, such as the maximum throughput of storage and other
services that the application and the background tasks rely on.
Pls go through Scaling and performance considerations of Web Jobs.
Secondly, you should try to keep your webjobs smaller - 3 hours and 1 hour are very huge timespans.
It means that your WJ2 might acquire all the resources for 3 hours! It requires in depth performance review of the web job and identify if any/how many resource(s) is/are being held by the webjob unnecessarily. Smaller WebJobs will allow different instances to scale independently as per their own usage and free resources asap.
I have a c# console application running on a 64-bit Windows 2008 r2 server which also hosts MSSQL Server 2005.
This application runs through text files, reads the line, splits the line values into variables, and inserts the data into a SQL database hosted at localhost.
Each Text file is a new thread, each line is a new thread, and each SQL insert statement is executed under a new thread.
I am counting the number of each of these types of threads and decrementing when they complete. I'm wondering what the best way is to "pend" future threads from opening...
For example.. before a new SQL insert thread is opened I'm calling...
while(numberofcurrentthreads > specifiednumberofthreads)
{
// wait
}
new.Thread(insertSQL);
Where specifiednumberofthreads has been estimated to a value that does not throw System.OutofMemoryExceptions. A lot of guess work has gone into determining that number for each process.
My questions is.. is there a more 'efficient' or proper way to do this? Is there a way to read System memory, not physical memory, and wait based on a specified resource allotment?
To illustrate this idea...
while(System.Memory < (System.Memory/2) || System.OutofMemory == true)
{
// wait
}
new.Thread(insertSQL);
The current method I am employing works and completes in a decent time.. but it could do better. Some of the text files going through the process are larger than others and do not necessarily make the best use of system resources...
In example, if I say process 2 text files at a time that works perfectly when both text files are < 300KB. It does not work so well if one or two are over 100,000KB.
There also seems to be a 'butter-zone' where things process most efficiently. Somewhere averaging around 75% of all CPU resources. Crank these values too high and it will run at 100% CPU but process way slower as it cannot keep up.
It's crazy to be creating a new thread for every file and for every line and for every SQL insert statement. You'd probably be much better off using three threads and a chained producer-consumer model, all of which communicate through thread-safe queues. In C#, that would be BlockingCollection.
First, you set up two queues, one for lines that have been read from a text file, and one for lines that have been processed:
const int MaxQueueSize = 10000;
BlockingCollection<string> _lines = new BlockingCollection<string>(MaxQueueSize);
BlockingCollection<DataObject> _dataObjects = new BlockingCollection<DataObject>(MaxQueueSize);
DataObject, by the way, is what I'm calling the object that you'll be inserting into the database. You don't say what that is. It doesn't really matter for the purposes of this discussion, but you'd replace it with whatever type you use to represent the processed string.
Now, you create three threads:
A thread that reads text files line-by-line and places the lines into the _lines queue.
A line processor that reads lines one-by-one from the _lines queue, processes it, and creates a DataObject which it then places on the _dataObjects queue.
A thread that reads the _dataObjects queue and inserts them into the database.
Beyond simplicity (and this is very easy to put together), there are many benefits to this model.
First, having more than one thread reading from the disk concurrently usually leads to slower performance because the disk drive can only do one thing at a time. Having multiple threads hitting the disk at the same time just causes unnecessary head seeks. Just one thread will keep your input queue full.
Second, limiting the queues' sizes will prevent you from running out of memory. When the disk reading thread tries to insert the 10,001th item into the queue, it will wait until the processing thread removes an item. That's the "blocking" part of BlockingCollection.
You might find that you can speed your SQL inserts by grouping them and sending a bunch of records at once, doing what is essentially a bulk insert of 100 or 1000 records at a time rather than sending 100 or 1000 individual transactions.
This solution prevents the problem of too many threads. You have a fixed number of threads, all of which are running as fast as they possibly can. And memory use is constrained by limiting the number of things that can be in the queues.
The solution also scales rather well. If you have files on multiple drives, you can add a second file reading thread to read the files from that other physical drive and places the lines in the same queue. BlockingCollection supports multiple producers and multiple consumers, so adding another producer is no trouble at all.
The same goes for consumers. If you find that the processing step is the bottleneck, you can add another processing thread. It, too, will read from the _lines queue and write to the dataObjects queue.
However, having more threads than you have processor cores will likely make your program slower. If you have a four-core processor, creating 8 processing threads won't do you any good. It will make things slower because the operating system will be spending a lot of time on thread context switches rather than on doing useful work.
You'll have to do a little tuning to get the best performance. Queue sizes should be large enough to support continuous workflow (so no thread is starved of work, or spends too much time waiting for the output queue), but not so large to overfill memory. Depending on the relative speed of the three stages, one of the queues might have to be larger than the other. If one of the three stages is a bottleneck, you can add another thread to help at that stage.
I created a simple example of this model using text files for input and output. It should be pretty easy to extend for your situation. See Simple Multithreading, and the follow up, Part 2.
I have a long-running process that reads large files and writes summary files. To speed things up, I'm processing multiple files simultaneously using regular old threads:
ThreadStart ts = new ThreadStart(Work);
Thread t = new Thread(ts);
t.Start();
What I've found is that even with separate threads reading separate files and no locking between them and using 4 threads on a 24-core box, I can't even get up to 10% on the CPU or 10% on disk I/O. If I use more threads in my app, it seems to run even more slowly.
I'd guess I'm doing something wrong, but where it gets curious is that if I start the whole exe a second and third time, then it actually processes files two and three times faster. My question is, why can't I get 12 threads in my one app to process data and tax the machine as well as 4 threads in 3 instances of my app?
I've profiled the app and the most time-intensive and frequently called functions are all string processing calls.
It's possible that your computing problem is not CPU bound, but I/O bound. It doesn't help to state that your disk I/O is "only at 10%". I'm not sure such performance counter even exists.
The reason why it gets slower while using more threads is because those threads are all trying to get to their respective files at the same time, while the disk subsystem is having a hard time trying to accomodate all of the different threads. You see, even with a modern technology like SSDs where the seek time is several orders of magnitude smaller than with traditional hard drives, there's still a penalty involved.
Rather, you should conclude that your problem is disk bound and a single thread will probably be the fastest way to solve your problem.
One could argue that you could use asynchronous techniques to process a bit that's been read, while on the background the next bit is being read in, but I think you'll see very little performance improvement there.
I've had a similar problem not too long ago in a small tool where I wanted to calculate MD5 signatures of all the files on my harddrive and I found that the CPU is way too fast compared to the storage system and I got similar results trying to get more performance by using more threads.
Using the Task Parallel Library didn't alleviate this problem.
First of all on a 24 core box if you are using only 4 threads the most cpu it could ever use is 16.7% so really you are getting 60% utilization, which is fairly good.
It is hard to tell if your program is I/O bound at this point, my guess is that is is. You need to run a profiler on your project and see what sections of code your project is spending the most of it's time. If it is sitting on a read/write operation it is I/O bound.
It is possable you have some form of inter-thread locking being used. That would cause the program to slow down as you add more threads, and yes running a second process would fix that but fixing your locking would too.
What it all boils down to is without profiling information we can not say if using a second process will speed things up or make things slower, we need to know if the program is hanging on a I/O operation, a locking operation, or just taking a long time in a function that can be parallelized better.
I think you find out what file cache is not ideal in case when one proccess write data in many file concurrently. File cache should sync to disk when the number of dirty page cache exceeds a threshold. It seems concurrent writers in one proccess hit threshold faster than the single thread writer. You can read read about file system cache here File Cache Performance and Tuning
Try using Task library from .net 4 (System.Threading.Task). This library have built-in optimizations for different number of processors.
Have no clue what is you problem, maybe because your code snippet is not really informative
I know there are some existing questions and they provide a very good general perspective on things. I'm hoping to get some details on the C#/VB.Net side for the actual implementation (not philosophy) of some of these perspectives.
My Particular Case
I have a WCF Service which, amongst other things, receives files. For most of the service's life this particular area is actually just sat doing nothing - when work does come it arrives in high bursts of greatly varying quantities.
For each file received (which at a max can be thousands per second) the service needs to work on the files for between 1-10 seconds (each) depending on a number of other services, local resources, and network IO wait times.
To aid the service with these burst workloads I implemented a Queue system. Those thousands of files recieved per second are placed onto the Queue. A controller calculates the number of threads to use based on the size of the queue, up until it reaches a "Peak Max Threads" setting which prevents it from creating additional threads. These threads are placed in a thread pool, and reused to cycle through the queue. The controller will; at intervals; recalculate the number of threads required. If the queue size reduces, a relevant number of threads are released.
The age old problem
How many threads should I peak at? Clearly, adding a new thread everytime a file was received would be silly for lack of a better word - the performance, at best, would deteriorate. Capping the threads when CPU utilization is only 10% across each core, also doesn't seem to be the best use of resources.
So, is there an appropriate way to determine how many threads to cap at? I would rather the service could determine this for itself by sampling available resources, but is there a performance hit from doing so? I know the common answer is to monitor workloads, adjust the counts through trial and error until I find a number I like, but due to the nature of this service (long periods of idle followed by high/burst workloads) it could take a long time to get that kind of information.
What then if we move the server's image to a different host which is faster/slower/different to the first? I have to re-sample the process all over again?
Ideally what I'm after, is for the co-ordinator to intelligently increase the size of the threadpool until CPU utilisation is at x% (would 80% be reasonable? 90%? 99%?). Clearly, I want to do this without adding more threads than is necessary to hit x% otherwise all I'll end up with is threads not just waiting on IO resources, but awaiting each other too.
Thanks in advance!
Related questions (if you want some generic ideas):
How many threads to create?
How many threads is too many?
How many threads to create and when?
A Complication for you
Where would be the fun if I didn't make the problem more difficult?
As it currently stands, the service does hit 100% cpu during these bursts, regularly. The issue is the CPU utilisation spikes. It goes from idle (0-10%) to 100%, and back down again. I'm not sure I can help that - ideally I wouldn't take it all the way to 100%. The problem exists because the files mentioned are in fact images, and part of the services' process is to pass the image through to the System.Windows.Media blackbox which does some complex image processing for me.
There are then lulls in between the spikes because of the IO waits and other processing that goes on. If the spikes hitting 100% can't be helped (and I'm all for knowing how to prevent that, or if I should) how should I aim for the CPU utilisation graph to look? Sat constantly at 100%? Bouncing between 50-100? If I do go through the effort of sampling to decide what does seem to work best, is it guaranteed that switching the virtual servers' host will also work best with the same graph?
This added complexity I won't take into consideration for those of you willing to answer. Feel free to ignore this section. However, any answer that also accounts for this complication, or even answers that just provide tips on how to handle it, I'll at the very least upvote!
Heck of a long question - sorry about that - and thanks for reading so much!!
PerformanceCounter allows you to query for processor usage.
However ,have you tried something the framework provides?
foreach (var file in files)
{
var workitem = file;
Task.Factory.StartNew(() =>
{
// do work on workitem
}, TaskCreationOptions.LongRunning | TaskCreationOptions.PreferFairness);
}
You can tune the concurrency level for Tasks in the Task.Factory.
The .NET 4 threadpool by default will schedule the number of threads it finds most performing on the hardware where it runs, but you can change how that works with the previous link.
Probably you need a custom solution but it would be ok to benchmark yours with the standard.
Edit: (comment note):
No links needed, I may have used an invented term since english is not my language. What I mean is: have a variable where you store the variance before the last check (prevDelta), and call it delta. add this to the varuiable avrageDelta and divide by 2, each time you 'check'. You will have the variable averageDelta that will mostly be low since you have no activity. Then have another set of delta variables, one you have already (delta - prevdelta), and store it in a delta variable that is not the average of all deltas but the average of deltas in a small timespan (you will have to come up with an algortihm to calculate accurately this temporal variance). Once done this you can compare the average delta and the 'temporal delta'. The average delta will be mostly low and will slowly go up whjen bursts come. In the same period the temporal delta will go up really fast. Then you have the situation when the burst stops, the average delta goes slowly down, and the 'temporal' goes really fast.
You could use I/O Completion Ports to asynchronously fetch your images without tying up any threads until it comes time to process what you have fetched.
You could then limit your thread pool based on the number of cores on your client PC, making sure to leave a core free for other processes to use.
What about a dynamic thread manager that monitors their overall performance and according to this spawns new threads or kills old ones? The main problem here is only how to define the performance measurement function. The rest can be done with a periodically scheduled job that increases or decreases the number of threads according to the previous number of threads and performance in that case or something like that. Maybe also in connection to resources utilization (CPU, disks, network...).
There is a multi threaded batch processing program, that creates multiple worker threads to process each batch process.
Now to scale the application to handle 100 million records, we need to use a server farm to do the processing of each batch process. Is there native support on C# for handling requests running on a server farm? Any thoughts on how to setup the C# executable to work with this setup?
You can either create a manager that distributes the work like fejesjoco said or you can make your apps smart enough to only grab a certain number of units of work to process on. When they have completed processing of those units, have them contact the db server to get the next batch. Rinse and repeat until done.
As a side note most distributed worker systems run by:
Work is queued in server by batches
Worker Processes check in with server to get a batch to operate on, the available batch is marked as being processed by that worker.
(optional) Worker Processes check back in with server with status report (ie: 10% done, 20% done, etc)
Worker process completes work and submits results.
Go to step 2.
Another option is to have 3 workers process the exact same data set. This would allow you to compare results. If 2 or more have identical results then you accept those results. If all 3 have different results then you know there is a problem and you need to inspect the data/code. Usually this only happens when the workers are outside of your control (like SETI) or you are running massive calculations and want to correct for potential hardware issues.
Sometimes there is a management app which displays current number of workers and progress with entire set. If you know roughly how long an individual batch takes then you can detect when a worker died and can let a new process get the same batch.
This allows you to add or remove as many individual workers as you want without having to recode anything.
I don't think there's builtin support for clustering. In the most simple case, you might try creating a simple manager application which divides the input among the servers, and your processes will not need to know about each other, so no need to rewrite anything.
Why not deploy the app using a distributed framework? I'd recommend CloudIQ Platform You can use the platform to distribute your code to any number of servers. It also handles the load balancing, so you would only need to submit your jobs to the framework, and it will handle job distribution to the individual machines. It also monitors application execution, so if one of the machines suffers a failure, the jobs running there will be restarted on another machine in the group.
Check out the Community link for downloads, forums, etc.