Few days ago I tried to perform a fast search on my disks do few things like, Attributes, Extensions, perform change inside files etc ...
The idea was to make it with really few limitation/lock in order to avoid "latency" for big file or directory with a lots of files inside etc ...
I know it's far for "Best Practices", since i'm not using things like "MaxDegreeOfParallelism" or the Pulling loop with "while(true)"
Even though, the code is running quite fast since we have the architecture to support it.
I tried to move to code to a dummy console project if anybody would like to check what's going on.
class Program
{
static ConcurrentQueue<String> dirToCheck;
static ConcurrentQueue<String> fileToCheck;
static int fileCount; //
static void Main(string[] args)
{
Initialize();
Task.Factory.StartNew(() => ScanDirectories(), TaskCreationOptions.LongRunning);
Task.Factory.StartNew(() => ScanFiles(), TaskCreationOptions.LongRunning);
Console.ReadLine();
}
static void Initialize()
{
//Instantiate caches
dirToCheck = new ConcurrentQueue<string>();
fileToCheck = new ConcurrentQueue<string>();
//Enqueue Directory to Scan here
//Avoid to Enqueue Nested/Sub directories, else they are going to be dcan at least twice
dirToCheck.Enqueue(#"C:\");
//Initialize counters
fileCount = 0;
}
static void ScanDirectories()
{
String dirToScan = null;
while (true)
{
if (dirToCheck.TryDequeue(out dirToScan))
{
ExtractDirectories(dirToScan);
ExtractFiles(dirToScan);
}
//Just here as a visual tracker to have some kind an idea about what's going on and where's the load
Console.WriteLine(dirToCheck.Count + "\t\t" + fileToCheck.Count + "\t\t" + fileCount);
}
}
static void ScanFiles()
{
while (true)
{
String fileToScan = null;
if (fileToCheck.TryDequeue(out fileToScan))
{
CheckFileAsync(fileToScan);
}
}
}
private static Task ExtractDirectories(string dirToScan)
{
Task worker = Task.Factory.StartNew(() =>
{
try
{
Parallel.ForEach<String>(Directory.EnumerateDirectories(dirToScan), (dirPath) =>
{
dirToCheck.Enqueue(dirPath);
});
}
catch (UnauthorizedAccessException) { }
}, TaskCreationOptions.AttachedToParent);
return worker;
}
private static Task ExtractFiles(string dirToScan)
{
Task worker = Task.Factory.StartNew(() =>
{
try
{
Parallel.ForEach<String>(Directory.EnumerateFiles(dirToScan), (filePath) =>
{
fileToCheck.Enqueue(filePath);
});
}
catch (UnauthorizedAccessException) { }
}, TaskCreationOptions.AttachedToParent);
return worker;
}
static Task CheckFileAsync(String filePath)
{
Task worker = Task.Factory.StartNew(() =>
{
//Add statement to play along with the file here
Interlocked.Increment(ref fileCount);
//WARNING !!! If your file fullname is too long this code may not be executed or may just crash
//I just put a simple check 'cause i found 2 or 3 different error message between the framework & msdn documentation
//"Full paths must not exceed 260 characters to maintain compatibility with Windows operating systems. For more information about this restriction, see the entry Long Paths in .NET in the BCL Team blog"
if (filePath.Length > 260)
return;
FileInfo fi = new FileInfo(filePath);
//Add statement here to use FileInfo
}, TaskCreationOptions.AttachedToParent);
return worker;
}
}
Problems:
How can I detect that i'm done with ScanDirectory?
Once it's done, I can manage to enqueue a String empty or whatever to the file queue, to exit it.
I know that if I use "AttachedToParent" I can have a Completion state on the parent Task, and then for example do something like "ContinueWith(()=> { /SomeCode to notice the end/})"
But still the parent task is doing Pulling and is stuck in a kind of infinite loop and each sub statement begin new Task.
On the other hand, I cannot simply test "Count" in each Queue 'cause I might have Flush the File List and Directory List but there might be another task that's going to call "EnumerateDirectory()".
I'm trying to find some kind of "reactive" solution and avoid some "if()" inside the loop that would be checked 80% of time for nothing since it's a simple while(true){} with AsyncCall.
PS: I know i could use TPL Dataflow, i'm not because i'm stuck on .net 4.0 for know, anyway, in .net 4.5 without dataflow since there's few improvement in the TPL, i'm still curious about it
Instead of ConcurrentQueue<T>, you could use BlockingCollection<T>.
BlockingCollection<T> is designed specifically for producer/consumer scenarios such as this, and provides a CompleteAdding method so the producer can notify the consumers that it has finished adding work.
Related
I have a Net 6 Console app where I use several BlockingCollections to process files that are dropped in a folder. I watch the folder using Net's FileWatcher().
In the Created event, I use a Channel to handle the processing, which is done in two phases, and after each phase the result item is moved to a BlockingCollection, that will then be consumed by the next phase.
Program.cs
public static async Task Main(string[] args)
{
BlockingCollection<FileMetadata> _fileMetaDataQueue = new BlockingCollection<FileMetadata>()
var channel = Channel.CreateUnbounded<FileSystemEventArgs>();
// Start a task to monitor the channel and process notifications
var notificationProcessor = Task.Run(() => ProcessNotifications(channel, _fileMetaDataQueue));
Task fileCopyingTask = Task.Run(() => fileCopyThread.Start()); //injected using DI
Task processMovedFile = Task.Run(() => ProcessDestinationThread.Start()); //injected using DI
Task retryOnErrorTask = Task.Run(() => RetryOnErrorThread.Start()); //injected using DI
using var watcher = new FileSystemWatcher(sourceFolder); //C:\temp
// other fw related config
watcher.Created += (sender, e) => channel.Writer.WriteAsync(e);
}
private async Task ProcessNotifications(Channel<FileSystemEventArgs> channel, BlockingCollection<FileMetadata> queue)
{
await foreach (var e in channel.Reader.ReadAllAsync())
{
Thread.Sleep(300); // So the file is released after it is dropped
try
{
// Process the file and add its name and extension to the queue
FileMetaData fileInfo = ExtractFileMetadata(e.FullPath); //processing method
queue.Add(fileInfo);
}
try
{
// logging etc
}
}
}
The BlockingCollection queue is then consumed in the FileCopyThread class, with the Start() method exposed (and called)
FileCopyThread.cs
BlockingCollection<FileMetadata> resultQueue = new();
BlockingCollection<FileMetadata> retryQueue = new();
public async Task Start()
{
await Task.Run(() => {
ProcessQueue();
});
}
private void ProcessQueue()
{
// Since IsCompleted is never set, it will always run
while (!fileMetadataQueue.IsCompleted)
{
// Try to remove an item from the queue
if (fileMetadataQueue.TryTake(out FileMetadata result))
{
// Copy the file to a new location
var newFileLocation = processorOps.MoveFile(result); // move file to other path
// Add the new file location to the result queue
if (newFileLocation != String.Empty)
{
result.newFileLocation = newFileLocation;
resultQueue.Add(result);
}
else {
retryQueue.Add(result);
}
}
}
}
The ProcessDestinationThread and RetryOnErrorThread work in exactly the same way, but do some different processing, and consume the resultQueue and the retryQueue, respectively.
Now when I run this app, it works fine, everything gets processed as expected, but my CPU and power usage is between 85% and 95%, which is huge, IMO, and does so even when it is not processing anything, just sitting idle. I figured this is because all the infinite loops, but how can I remedy this?
Birds eye view: What I would like is that in case the filewatcher.created event is not firing (ie no files are dropped) then the all the queues after it can be running in idle, so to speak. No need for constant checking, then.
I thought about calling CompleteAdding() on the BlockingCollection<T>, but it seems that I cannot reverse that. And the app is supposed to run indefinitely: So if the drop folder is empty, it might be receiving new files at any time.
Is there a way how I can reduce the CPU usage of my application?
Ps. I am aware that this code is not a fully working example. The real code is far more complex than this, and I had to remove a lot of stuff that is distracting. If you think any pieces of relevant code are missing, I can provide them. I hope this code will at least make clear what I am trying to achieve.
private void ProcessQueue()
{
while (!fileMetadataQueue.IsCompleted)
{
if (fileMetadataQueue.TryTake(out FileMetadata result))
{
//...
}
}
}
This pattern for consuming a BlockingCollection<T> is incorrect. It causes a tight loop that burns unproductively a CPU core. The correct pattern is to use the GetConsumingEnumerable method:
private void ProcessQueue()
{
foreach (FileMetadata result in fileMetadataQueue.GetConsumingEnumerable())
{
//...
}
}
I'm currently working on a concurrent file downloader.
For that reason I want to parametrize the number of concurrent tasks. I don't want to wait for all the tasks to be completed but to keep the same number being runned.
In fact, this thread on star overflow gave me a proper clue, but I'm struggling making it async:
Keep running a specific number of tasks
Here is my code:
public async Task StartAsync()
{
var semaphore = new SemaphoreSlim(1, _concurrentTransfers);
var queueHasMessages = true;
while (queueHasMessages)
{
try {
await Task.Run(async () =>
{
await semaphore.WaitAsync();
await asyncStuff();
});
}
finally {
semaphore.Release();
};
}
}
But the code just get executed one at a time. I think that the await is blocking me for generating the desired amount of tasks, but I don't know how to avoid it while respecting the limit established by the semaphore.
If I add all the tasks to a list and make a whenall, the semaphore throws an exception since it has reached the max count.
Any suggestions?
It was brought to my attention that the struck-through solution will drop any exceptions that occur during execution. That's bad.
Here is a solution that will not drop exceptions:
Task.Run is a Factory Method for creating a Task. You can check yourself with the intellisense return value. You can assign the returned Task anywhere you like.
"await" is an operator that will wait until the task it operates on completes. You are able to use any Task with the await operator.
public static async Task RunTasksConcurrently()
{
IList<Task> tasks = new List<Task>();
for (int i = 1; i < 4; i++)
{
tasks.Add(RunNextTask());
}
foreach (var task in tasks) {
await task;
}
}
public static async Task RunNextTask()
{
while(true) {
await Task.Delay(500);
}
}
By adding the values of the Task we create to a list, we can await them later on in execution.
Previous Answer below
Edit: With the clarification I think I understand better.
Instead of running every task at once, you want to start 3 tasks, and as soon as a task is finished, run the next one.
I believe this can happen using the .ContinueWith(Action<Task>) method.
See if this gets closer to your intended solution.
public void SpawnInitialTasks()
{
for (int i = 0; i < 3; i++)
{
RunNextTask();
}
}
public void RunNextTask()
{
Task.Run(async () => await Task.Delay(500))
.ContinueWith(t => RunNextTask());
// Recurse here to keep running tasks whenever we finish one.
}
The idea is that we spawn 3 tasks right away, then whenever one finishes we spawn the next. If you need to keep data flowing between the tasks, you can use parameters:
RunNextTask(DataObject object)
You can do this easily the old-fashioned way without using await by using Parallel.ForEach(), which lets you specify the maximum number of concurrent threads to use.
For example:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
namespace Demo
{
class Program
{
public static void Main(string[] args)
{
IEnumerable<string> filenames = Enumerable.Range(1, 100).Select(x => x.ToString());
Parallel.ForEach(
filenames,
new ParallelOptions { MaxDegreeOfParallelism = 4},
download
);
}
static void download(string filepath)
{
Console.WriteLine("Downloading " + filepath);
Thread.Sleep(1000); // Simulate downloading time.
Console.WriteLine("Downloaded " + filepath);
}
}
}
If you run this and observe the output, you'll see that the "files" are being "downloaded" in batchs.
A better simulation is the change download() so that it takes a random amount of time to process each "file", like so:
static Random rng = new Random();
static void download(string filepath)
{
Console.WriteLine("Downloading " + filepath);
Thread.Sleep(500 + rng.Next(1000)); // Simulate random downloading time.
Console.WriteLine("Downloaded " + filepath);
}
Try that and see the difference in the output.
However, if you want a more modern way to do this, you could look into the Dataflow part of the TPL (Task Parallel Library) - this works well with async methods.
This is a lot more complicated to get to grips with, but it's a lot more powerful. You could use an ActionBlock to do it, but describing how to do that is a bit beyond the scope of an answer I could give here.
Have a look at this other answer on StackOverflow; it gives a brief example.
Also note that the TPL is not built in to .Net - you have to get it from NuGet.
I have a queue of jobs which can be populated by multiple threads (ConcurrentQueue<MyJob>). I need to implement continuous execution of this jobs asynchronously(not by main thread), but only by one thread at the same time. I've tried something like this:
public class ConcurrentLoop {
private static ConcurrentQueue<MyJob> _concurrentQueue = new ConcurrentQueue<MyJob>();
private static Task _currentTask;
private static object _lock = new object();
public static void QueueJob(Job job)
{
_concurrentQueue.Enqueue(job);
checkLoop();
}
private static void checkLoop()
{
if ( _currentTask == null || _currentTask.IsCompleted )
{
lock (_lock)
{
if ( _currentTask == null || _currentTask.IsCompleted )
{
_currentTask = Task.Run(() =>
{
MyJob current;
while( _concurrentQueue.TryDequeue( out current ) )
//Do something
});
}
}
}
}
}
This code in my opinion have a problem: if task finnishing to execute(TryDequeue returns false but task have not been marked as completed yet) and in this moment i get a new job, it will not be executed. Am i right? If so, how to fix this
Your problem statement looks like a producer-consumer problem, with a caveat that you only want a single consumer.
There is no need to reimplement such functionality manually.
Instead, I suggest to use BlockingCollection -- internally it uses ConcurrentQueue and a separate thread for the consumption.
Note, that this may or may not be suitable for your use case.
Something like:
_blockingCollection = new BlockingCollection<your type>(); // you may want to create bounded or unbounded collection
_consumingThread = new Thread(() =>
{
foreach (var workItem in _blockingCollection.GetConsumingEnumerable()) // blocks when there is no more work to do, continues whenever a new item is added.
{
// do work with workItem
}
});
_consumingThread.Start();
Multiple producers (tasks or threads) can add work items to the _blockingCollection no problem, and no need to worry about synchronizing producers/consumer.
When you are done with producing task, call _blockingCollection.CompleteAdding() (this method is not thread safe, so it is advised to stop all producers beforehand).
Probably, you should also do _consumingThread.Join() somewhere to terminate your consuming thread.
I would use Microsoft's Reactive Framework Team's Reactive Extensions (NuGet "System.Reactive") for this. It's a lovely abstraction.
public class ConcurrentLoop
{
private static Subject<MyJob> _jobs = new Subject<MyJob>();
private static IDisposable _subscription =
_jobs
.Synchronize()
.ObserveOn(Scheduler.Default)
.Subscribe(job =>
{
//Do something
});
public static void QueueJob(MyJob job)
{
_jobs.OnNext(job);
}
}
This nicely synchronizes all incoming jobs into a single stream and pushes the execution on to Scheduler.Default (which is basically the thread-pool), but because it has serialized all input only one can happen at a time. The nice thing about this is that it releases the thread if there is a significant gap between the values. It's a very lean solution.
To clean up you just need call either _jobs.OnCompleted(); or _subscription.Dispose();.
SITUATION
Currently in my project I have 3 Workers that have a working loop inside, and one CommonWork class object, which contains Work methods (DoFirstTask, DoSecondTask, DoThirdTask) that Workers can call. Each Work method must be executed mutually exclusively in respect to each other method. Each of methods spawn more nested Tasks that are waited until they are finished.
PROBLEM
When all 3 Workers are started, 2 Workers perform somewhat at the same speed, but 3rd Worker is lagging behind or 1st Worker is super-fast, 2nd a bit slower and 3rd is very slow, it depends on real world.
BIZARRENESS
When only 2 Workers are working, they share the work nicely too, and perform at the same speed.
What's more interesting, that even 3rd Worker calls fewer number of CommonWork methods, and has the potential to perform more loop cycles, it does not. I tried to simulate that in the code below with condition:
if (Task.CurrentId.Value < 3)
When debugging, I found out, that 3rd Worker was waiting on acquiring a lock on a Mutex substantially longer than other Workers. Sometimes, other two Workers just work interchangingly, and the 3rd keeps waiting on Mutex.WaitOne(); I guess, without really entering it, because other Workers have no problem in acquiring that lock!
WHAT I TRIED ALREADY
I tried starting Worker Tasks as TaskCreateOptions.LongRunning, but nothing changed. I also tried making nested Tasks to be child Tasks by specifying TaskCreateOpions.AttachedToParent, thinking it might be related to local queues and scheduling, but apparently it is not.
SIMPLIFIED CODE
Below is the simplified code of my real-world application. Sad to say, I could not reproduce this situation in this simple example:
class Program
{
public class CommonWork
{
private Mutex _mutex;
public CommonWork() { this._mutex = new Mutex(false); }
private void Lock() { this._mutex.WaitOne(); }
private void Unlock() { this._mutex.ReleaseMutex(); }
public void DoFirstTask(int taskId)
{
this.Lock();
try
{
// imitating sync work from 3rd Party lib, that I need to make async
var t = Task.Run(() => {
Thread.Sleep(500); // sync work
});
... // doing some work here
t.Wait();
Console.WriteLine("Task {0}: DoFirstTask - complete", taskId);
}
finally { this.Unlock(); }
}
public void DoSecondTask(int taskId)
{
this.Lock();
try
{
// imitating sync work from 3rd Party lib, that I need to make async
var t = Task.Run(() => {
Thread.Sleep(500); // sync work
});
... // doing some work here
t.Wait();
Console.WriteLine("Task {0}: DoSecondTask - complete", taskId);
}
finally { this.Unlock(); }
}
public void DoThirdTask(int taskId)
{
this.Lock();
try
{
// imitating sync work from 3rd Party lib, that I need to make async
var t = Task.Run(() => {
Thread.Sleep(500); // sync work
});
... // doing some work here
t.Wait();
Console.WriteLine("Task {0}: DoThirdTask - complete", taskId);
}
finally { this.Unlock(); }
}
}
// Worker class
public class Worker
{
private CommonWork CommonWork { get; set; }
public Worker(CommonWork commonWork)
{ this.CommonWork = commonWork; }
private void Loop()
{
while (true)
{
this.CommonWork.DoFirstTask(Task.CurrentId.Value);
if (Task.CurrentId.Value < 3)
{
this.CommonWork.DoSecondTask(Task.CurrentId.Value);
this.CommonWork.DoThirdTask(Task.CurrentId.Value);
}
}
}
public Task Start()
{
return Task.Run(() => this.Loop());
}
}
static void Main(string[] args)
{
var work = new CommonWork();
var client1 = new Worker(work);
var client2 = new Worker(work);
var client3 = new Worker(work);
client1.Start();
client2.Start();
client3.Start();
Console.ReadKey();
}
} // end of Program
The solution was to use new SempahoreSlim(1) instead of Mutex (or simple lock, or Monitor). Only using SemaphoreSlim made Thread Scheduling to be round-robin, and therefore did not make some Threads/Tasks "special" in respect to other threads. Thanks I3arnon.
If someone could comment why it is so, I would appreciate it.
I am trying to use Parallel.ForEach on a list and for each item in the list, trying to make a database call. I am trying to log each item with or without error. Just wanted to check with experts here If I am doing thinsg right way. For this example, I am simulating the I/O using the File access instead of database access.
static ConcurrentQueue<IdAndErrorMessage> queue = new ConcurrentQueue<IdAndErrorMessage>();
private static void RunParallelForEach()
{
List<int> list = Enumerable.Range(1, 5).ToList<int>();
Console.WriteLine("Start....");
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
Parallel.ForEach(list, (tempId) =>
{
string errorMessage = string.Empty;
try
{
ComputeBoundOperationTest(tempId);
try
{
Task[] task = new Task[1]
{
Task.Factory.StartNew(() => this.contentFactory.ContentFileUpdate(content, fileId))
};
}
catch (Exception ex)
{
this.tableContentFileConversionInfoQueue.Enqueue(new ContentFileConversionInfo(fileId, ex.ToString()));
}
}
catch (Exception ex)
{
errorMessage = ex.ToString();
}
if (queue.SingleOrDefault((IdAndErrorMessageObj) => IdAndErrorMessageObj.Id == tempId) == null)
{
queue.Enqueue(new IdAndErrorMessage(tempId, errorMessage));
}
}
);
Console.WriteLine("Stop....");
Console.WriteLine("Total milliseconds :- " + stopWatch.ElapsedMilliseconds.ToString());
}
Below are the helper methods :-
private static byte[] FileAccess(int id)
{
if (id == 5)
{
throw new ApplicationException("This is some file access exception");
}
return File.ReadAllBytes(Directory.GetFiles(Environment.SystemDirectory).First());
//return File.ReadAllBytes("Files/" + fileName + ".docx");
}
private static void ComputeBoundOperationTest(int tempId)
{
//Console.WriteLine("Compute-bound operation started for :- " + tempId.ToString());
if (tempId == 4)
{
throw new ApplicationException("Error thrown for id = 4 from compute-bound operation");
}
Thread.Sleep(20);
}
private static void EnumerateQueue(ConcurrentQueue<IdAndErrorMessage> queue)
{
Console.WriteLine("Enumerating the queue items :- ");
foreach (var item in queue)
{
Console.WriteLine(item.Id.ToString() + (!string.IsNullOrWhiteSpace(item.ErrorMessage) ? item.ErrorMessage : "No error"));
}
}
There is no reason to do this:
/*Below task is I/O bound - so do this Async.*/
Task[] task = new Task[1]
{
Task.Factory.StartNew(() => FileAccess(tempId))
};
Task.WaitAll(task);
By scheduling this in a separate task, and then immediately waiting on it, you're just tying up more threads. You're better off leaving this as:
/*Below task is I/O bound - but just call it.*/
FileAccess(tempId);
That being said, given that you're making a logged value (exception or success) for every item, you might want to consider writing this into a method and then just calling the entire thing as a PLINQ query.
For example, if you write this into a method that handles the try/catch (with no threading), and returns the "logged string", ie:
string ProcessItem(int id) { // ...
You could write the entire operation as:
var results = theIDs.AsParallel().Select(id => ProcessItem(id));
You might want to remove Console.WriteLine from thread code. Reason being there can be only one console per windows app. So if two or more threads going to write parallel to console, one has to wait.
In replacement to your custom error queue you might want to see .NET 4's Aggregate Exception and catch that and process exceptions accordingly. The InnerExceptions propery will give you the necessary list of exceptions. More here
And a general code review comment, don't use magic numbers like 4 in if (tempId == 4) Instead have some const defined which tells what 4 stands for. e.g. if (tempId == Error.FileMissing)
Parallel.ForEach runs an action/func concurrently up to a certain number of simultaneous instances. If what each of those iterations is doing is not inherently independent on one another, you're not getting any performance gains. And, likely are reducing performance by introducing expensive context switching and contention. You say that you want to do a "database call" and simulating it with a file operation. If each iteration uses the same resource (same row in a database table, for example; or try to write to the same file in the same location) then they're not really going to be run in parallel. only one will be running at a time, the others will simply be "waiting" to get a hold of the resource--needlessly making your code complex.
You haven't detailed what you want to do for each iteration; but when I've encountered situations like this with other programmers, they almost always aren't really doing things in parallel and they've simply gone through and replaced foreachs with Parallel.ForEach in the hopes of magically gaining performance or magically making use of multi-CPU/Core processors.