I have the following C# code:
string[] files = Directory.GetFiles(#"C:\Users\Documents\DailyData", "*.txt", SearchOption.AllDirectories);
for (int ii = 0; ii < files.Length; ii++)
{
perFile(files[ii]);
}
Where per file opens each file and creates a series of generic lists, then writes the results to a file.
Each iteration takes about 15 minutes.
I want the perFile method to execute concurrently eg not wait for the current iteration to finish before starting the next one.
My questions are:
How would I do this?
How can I control how many instances of perFile are running concurrently
How can I determine how many concurrent instances of perFile my computer can handle at one time
Try to use Parallel.ForEach. You can use ParallelOptions.MaxDegreeOfParallelism property to set max count of running perFile.
You can use Parallel.For to do what you need if you're using .Net 4.5 and above, You can control how many threads to use by providing a ParallelOptions with MaxDegreeOfParalleism set to a number you like.
Example code:
ParallelOptions POptions = new ParallelOptions();
POptions.MaxDegreeOfParalleism = ReplaceThisWithSomeNumber;
Parallel.For(0,files.Length,POptions,(x) => {perFile(files[x]);});
Note that your program will be IO bound though (read Asynchronous IO for more information on how to tackle that)
Related
There is a C# function A(arg1, arg2) which needs to be called lots of times. To do this fastest, I am using parallel programming.
Take the example of the following code:
long totalCalls = 2000000;
int threads = Environment.ProcessorCount;
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = threads;
Parallel.ForEach(Enumerable.Range(1, threads), options, range =>
{
for (int i = 0; i < total / threads; i++)
{
// init arg1 and arg2
var value = A(arg1, agr2);
// do something with value
}
});
Now the issue is that this is not scaling up with an increase in number of cores; e.g. on 8 cores it is using 80% of CPU and on 16 cores it is using 40-50% of CPU. I want to use the CPU to maximum extent.
You may assume A(arg1, arg2) internally contains a complex calculation, but it doesn't have any IO or network-bound operations, and also there is no thread locking. What are other possibilities to find out which part of the code is making it not perform in a 100% parallel manner?
I also tried increasing the degree of parallelism, e.g.
int threads = Environment.ProcessorCount * 2;
// AND
int threads = Environment.ProcessorCount * 4;
// etc.
But it was of no help.
Update 1 - if I run the same code by replacing A() with a simple function which is calculating prime number then it is utilizing 100 CPU and scaling up well. So this proves that other piece of code is correct. Now issue could be within the original function A(). I need a way to detect that issue which is causing some sort of sequencing.
You have determined that the code in A is the problem.
There is one very common problem: Garbage collection. Configure your application in app.config to use the concurrent server GC. The Workstation GC tends to serialize execution. The effect is severe.
If this is not the problem pause the debugger a few times and look at the Debug -> Parallel Stacks window. There, you can see what your threads are doing. Look for common resources and contention. For example if you find many thread waiting for a lock that's your problem.
Another nice debugging technique is commenting out code. Once the scalability limit disappears you know what code caused it.
Background
Historically, I've not had to write too much thread related code. I am familiar with the process and I can use System.Threading to do what I need to do when the need arises. However, the bulk of my experience has been stuck in .Net 2.0 world (ugh!) and I need a bit of help executing a very simple task in the most recent version of .Net.
I need to write a simple, parallel job but I would prefer it to execute as a low-priority task that doesn't bring my PC to a sluggish halt.
Below is a simple parallel example that demonstrates the type of work that I'm trying to accomplish. In this example, I take a random number and I store it if the new value is larger than the last, largest random number that has been discovered.
Please understand, the point of this example is to strictly show that I have a calculation that I wish to repeatedly execute, compare and store. I understand the need for variable locking and I also know that this code isn't perfect. However, it is an extremely simple example of the nature of what I need to accomplish. If you run this code, your CPU will come to a grinding halt.
int i = 0;
ThreadSafeRNG r = new ThreadSafeRNG();
ParallelOptions.MaxDegreeOfParallelism = 4; //Assume I have 8 cores.
Parallel.For(0, Int32.MaxValue, (j, loopState) =>
{
int k = r.Next();
if (k > i) k = i;
}
How can I do this work, in parallel, and have it execute as a low-priority CPU job. The above mechanism has provided tremendous performance improvements, over a standard for-loop. However, the cost has been that I can't use my computer, even after setting the MaxDegreeOfParallelism option.
What can I do?
Because Parallel.For uses ThreadPool threads you can not set the priority of the thread to be low. However you can change the priority of the entire application
Process currentProcess = Process.GetCurrentProcess();
ProcessPriorityClass oldPriority = currentProcess.PriorityClass;
try
{
currentProcess.PriorityClass = ProcessPriorityClass.BelowNormal;
int i = 0;
ThreadSafeRNG r = new ThreadSafeRNG();
ParallelOptions.MaxDegreeOfParallelism = 4; //Assume I have 8 cores.
Parallel.For(0, Int32.MaxValue, (j, loopState) =>
{
int k = r.Next();
if (k > i) k = i;
}
}
finally
{
//Bring the priority back up to the original level.
currentProcess.PriorityClass = oldPriority;
}
If you really don't want your whole application's priority to be lowered combine this trick with some form of IPC like WCF and have your slow long running operation running in a 2nd process that you start up and kill as needed.
Using the Parallel.For method, you cannot set thread priority since they are ThreadPool threads.
Your best bet is to set ParallelOptions.MaxDegreeOfParallelism to ProcessorCount - 1, this way you have a free core to do other things (such as handle the GUI).
I have a List of items, and I would like to go through each item, create a task and launch the task. But, I want it to do batches of 10 tasks at once.
For example, if I have 100 URL's in a list, I want it to group them into batches of 10, and loop through batches getting the web response from 10 URL's per batch iteration.
Is this possible?
I am using C# 5 and .NET 4.5.
You can use Parallel.For() or Parallel.ForEach(), they will execute the work on a number of Tasks.
When you need precise control over the batches you could use a custom Partitioner but given that the problem is about URLs it will probably make more sense to use the more common MaxDegreeOfParallelism option.
The Partitioner has a good algorithm for creating the batches depending also on the number of cores.
Parallel.ForEach(Partitioner.Create(from, to), range =>
{
for (int i = range.Item1; i < range.Item2; i++)
{
// ... process i
}
});
I want to access a web server using httpwebrequest and fetch thousands of records from a given range of pages. Each hit to a webpage fetches 15 records, and there are almost 8 to 10000 pages on the webserver. That means a total of 120000 hits to the server! If done trivially with a single process, the task can be very time consuming. Hence, multiple threading is the immediate solution that comes to mind.
Currently, I have created a worker class for searching purpose, that worker class will spawn 5 subworkers (threads) to search in a given range. But, due to my novice abilities in threading, I am unable to make it work, as I am having trouble synchronizing and making them all work together. I know about delegates, actions, events in .NET but making them to work with threads is getting confusing..This is the code that I am using:
public void Start()
{
this.totalRangePerThread = ((this.endRange - this.startRange) / this.subWorkerThreads.Length);
for (int i = 0; i < this.subWorkerThreads.Length; ++i)
{
//theThreads[counter] = new Thread(new ThreadStart(MethodName));
this.subWorkerThreads[i] = new Thread(() => searchItem(this.startRange, this.totalRangePerThread));
//this.subWorkerThreads[i].Start();
this.startRange = this.startRange + this.totalRangePerThread;
}
for (int threadIndex = 0; threadIndex < this.subWorkerThreads.Length; ++threadIndex)
this.subWorkerThreads[threadIndex].Start();
}
The searchItem method:
public void searchItem(int start, int pagesToSearchPerThread)
{
for (int count = 0; count < pagesToSearchPerThread; ++count)
{
//searching routine here
}
}
The problem exists between the shared variables of the threads, can anyone guide me how to make it a threadsafe procedure?
the real problem you're facing is that the labmda expression in the Thread constructor is capturing the outer variable (startRange). One way to fix it is to make a copy of the variable, like this:
for (int i = 0; i < this.subWorkerThreads.Length; ++i)
{
var copy = startRange;
this.subWorkerThreads[i] = new Thread(() => searchItem(copy, this.totalRangePerThread));
this.startRange = this.startRange + this.totalRangePerThread;
}
for more information on creating and starting threads, see Joe Albahari's excellent ebook (there's also a section on captured variables a bit further down). If you want to learn about closures, see this question.
The first answer is that these threads don't really need that much work to share variables (assuming I'm understanding you correctly). They have some shared input variables (the target web server, for example), but those are thread-safe because they aren't being changed. The plan is that they'll build a database or some such containing the records they retrieve. You should be fine to just have each of the five fill their own input archive, and then merge them in a single thread once all the subworker threads are done. If somehow the architecture that you're using to store the data makes that expensive... well, how much you're planning to store and what you're planning to store it in becomes pertinent, then, and perhaps you could share what those are?
I have a list of large text files to process. I wonder which is the fastest method, because reading line by line is slow.
I have something like that:
int cnt = this.listView1.Items.Count;
for (int i = 0; i < this.listView1.Items.Count; i++)
{
FileStream fs = new FileStream(this.listView1.Items[i].Text.ToString(), FileMode.Open, FileAccess.Read);
using (StreamReader reader = new StreamReader(fs))
while (reader.Peek() != -1)
{
//code part
}
}
I read about using blocks(like 100k lines each) via backgroundworkers with multiple threads would help, but I don't know how to implement it. Or if you have better ideas to improve the performance ... your expert advice would be appreciated.
First you need to decide what is your bottleneck - I/O (reading the files) or CPU (processing them). If it's I/O, reading multiple files concurrently is not going to help you much, the most you can achieve is have one thread read files, and another process them. The processing thread will be done before the next file is available.
I agree with #asawyer, if it's only 100MB, you should read the file entirely into memory in one swoop. You might as well read 5 of them entirely into memory, it's really not a big deal.
EDIT: After realizing all the files are on a single hard-drive, and that processing takes longer than reading the file.
You should have on thread reading the files sequentially. Once a file is read, fire up another thread that handles the processing, and start reading the second file in the first thread. Once the second file is read, fire up another thread, and so on.
You should make sure you don't fire more processing threads than the numbers of cores you have, but for starters just use the thread-pool for this, and optimize later.
You're missing a little bit of performance, because the time you spend reading the first file is not used for any processing. This should be neglible, reading 100MBs of data to memory shouldn't take more than a few seconds.
I assume that you are processing files line by line. You also said that loading of files is faster than processing them. There are few ways you can do what you need. One for example:
Create a thread that reads files one by one, line by line. Sequentially, because when doing this in parallel you'll only hammer your HDD and possibly get worse results. You can use Queue<string> for that. Use Queue.Enqueue() to add lines you've red.
Run another thread that is processing the queue. Use Queue.Dequeue() to get (and remove) lines from beginning of your queue. Process the line and write it to the output file. Eventually you can put processed lines in another queue or list and write them at once when you finish processing.
If order of lines in output file is not important you can create as many threads as you have CPU cores (or use ThreadPool class) to do the processing (that would speed up things significantly).
[Edit]
If order of lines in the output file is important you should limit line processing to one thread. Or process them in parallel using separate threads and implement mechanism that would control output order. For example you may do that by numbering lines you read from input file (the easy way) or processing lines by each thread in chunks of n-lines and writing output chunk by chunk in the same order you started processing threads.
here is a simple threading code you can use: (.Net 4)
//firstly get file paths from listview so you won't block the UI thread
List<string> filesPaths = new List<string>();
for (int i = 0; i < this.listView1.Items.Count; i++)
{
filesPaths.Add(listView1.Items[i].Text.ToString());
}
//this foreach loop will fire 50 threads at same time to read 50 files
Parallel.ForEach(filesPaths, new ParallelOptions() { MaxDegreeOfParallelism = 50 }, (filepath, i, j) =>
{
//read file contents
string data = File.ReadAllText(filepath);
//do whatever you want with the contents
});
not tested though...