While keeping in mind that:
I am using a blocking queue that waits for ever until something is added to it
I might get a FileSystemWatcher event twice
The updated code:
{
FileProcessingManager processingManager = new FileProcessingManager();
processingManager.RegisterProcessor(new ExcelFileProcessor());
processingManager.RegisterProcessor(new PdfFileProcessor());
processingManager.Completed += new ProcessingCompletedHandler(ProcessingCompletedHandler);
processingManager.Completed += new ProcessingCompletedHandler(LogFileStatus);
while (true)
{
try
{
var jobData = (JobData)fileMonitor.FileQueue.Dequeue();
if (jobData == null)
break;
_pool.WaitOne();
Application.Log(String.Format("{0}:{1}", DateTime.Now.ToString(CultureInfo.InvariantCulture), "Thread launched"));
Task.Factory.StartNew(() => processingManager.Process(jobData));
}
catch (Exception e)
{
Application.Log(String.Format("{0}:{1}", DateTime.Now.ToString(CultureInfo.InvariantCulture), e.Message));
}
}
}
What are are you suggestions on making the code multi-threaded while taking into consideration the possibility that two identical string paths may be added into the blocking queue? I have left the possibility that this might happen and in this case.. the file would be processed twice, the thing is that sometimes I get it twice, sometimes not, it is really awkward, if you have suggestions on this, please tell.
The null checking is for exiting the loop, I intentionally add a null from outside the threaded loop to determine it to stop.
For multi-threading this... I would probably add a "Completed" event to your FileProcessingManager and register for it. One argument of that event will be the "bool" return value you currently have. Then in that event handler, I would do the checking of the bool and re-queueing of the file. Note that you will have to keep a reference to the FileMonitorManager. So, I would have this ThreadProc method be in a class where you keep the FileMonitorManager and FileProcessingManager instances in a property.
To deduplicate, in ThreadProc, I would create a List outside of the while loop. Then inside the while loop, before you process a file, lock that list, check to see if the string is already in there, if not, add the string to the list and process the file, if it is, then skip processing.
Obviously, this is based on little information surrounding your method but my 2 cents anyway.
Rough code, from Notepad:
private static FileMonitorManager fileMon = null;
private static FileProcessingManager processingManager = new FileProcessingManager();
private static void ThreadProc(object param)
{
processingManager.RegisterProcessor(new ExcelFileProcessor());
processingManager.RegisterProcessor(new PdfFileProcessor());
processingManager.Completed += ProcessingCompletedHandler;
var procList = new List<string>();
while (true)
{
try
{
var path = (string)fileMon.FileQueue.Dequeue();
if (path == null)
break;
bool processThis = false;
lock(procList)
{
if(!procList.Contains(path))
{
processThis = true;
procList.Add(path);
}
}
if(processThis)
{
Thread t = new Thread (new ParameterizedThreadStart(processingManager.Process));
t.Start (path);
}
}
catch (System.Exception e)
{
Console.WriteLine(e.Message);
}
}
}
private static void ProcessingCompletedHandler(bool status, string path)
{
if (!status)
{
fileMon.FileQueue.Enqueue(path);
Console.WriteLine("\n\nError on file: " + path);
}
else
Console.WriteLine("\n\nSucces on file: " + path);
}
Related
I need some help. If you input an Directory into my code, it goes in every folder in that Directory and gets every single file. This way, i managed to bypass the "AccessDeniedException" by using a code, BUT if the Directory is one, which contains alot of Data and folders (example: C:/) it just takes way to much time.
I dont really know how to multithread and i could not find any help on the internet. Is there a way to make the code run faster by multithreading? Or is it possible to ask the code to use more memory or Cores ? I really dont know and could use advise
My code to go in every File in every Subdirectory:
public static List<string> Files = new List<string>();
public static List<string> Exceptions = new List<string>();
public MainWindow()
{
InitializeComponent();
}
private static void GetFilesRecursively(string Directory)
{
try
{
foreach (string A in Directory.GetDirectories(Directory))
GetFilesRecursively(A);
foreach (string B in Directory.GetFiles(Directory))
AddtoList(B);
} catch (System.Exception ex) { Exceptions.Add(ex.ToString()); }
}
private static void AddtoList(string Result)
{
Files.Add(Result);
}
private void Btn_Click(object sender, RoutedEventArgs e)
{
GetFilesRecursively(Textbox1.Text);
foreach(string C in Files)
Textbox2.Text += $"{C} \n";
}
You don't need recursion to avoid inaccessible files. You can use the EnumerateFiles overload that accepts an EnumerationOptions parameter and set EnumerationOptions.IgnoreInaccessible to true:
var options=new EnumerationOptions
{
IgnoreInaccessible=true,
RecurseSubdirectories=true
};
var files=Directory.EnumerateFiles(somePath,"*",options);
The loop that appends file paths is very expensive too. Not only does it create a new temporary string on each iteration, it also forces a UI redraw. You could improve speed and memory usage (which, due to garbage collection is also affecting performance) by creating a single string, eg with String.Join or a StringBuilder :
var text=String.Join("\n",files);
Textbox2.Text=text;
String.Join uses a StringBuilder internally whose internal buffer gets reallocated each time it's full. The previous buffer has to be garbage-collected. Once could avoid even this by using a StringBuilder with a specific capacity. Even a rough estimate can reduce reallocations significantly:
var builder=new StringBuilder(4096);
foreach(var file in files)
{
builder.AppendLine(file);
}
create a class so you can add a private field to count the deep of the directroy.
add a TaskSource<T> property to the class, and await the Task that generated only if the deep out of the limit, and trigger an event so your UI can hook into the action and ask user.
if user cancel , then the task fail, if user confirm, then continue.
some logic code
public class FileLocator
{
public FileLocator(int maxDeep = 6){
_maxDeep = maxDeep;
this.TaskSource = new TaskSource();
this.ConfirmTask = this.TaskSource.Task;
}
private int _maxDeep;
private int _deep;
public event Action<FileLocator> OnReachMaxDeep;
public Task ConfirmTask ;
public TaskSource TaskSource {get;}
public Task<List<string>> GetFilesRecursivelyAsync()
{
var result = new List<string>();
foreach(xxxxxxx)
{
xxxxxxxxxxxxxx;
this._deep +=1;
if(_deep == _maxDeep)
{ OnRichMaxDeep?.Invoke(this); }
if(_deep >= _maxDeep)
{
try{
await ConfirmTask;
continue;
}
catch{
return result;
}
}
}
}
}
and call
var locator = new FileLocator();
locator.OnReachMaxDeep += (x)=> { var result = UI.Confirm(); if(result){ x.TaskSource.SetResult(); else{ x.TaskSource.SetException(new Exception()) } } }
var result = await locator.GetFilesRecursivelyAsync("C:");
I have been working on a webscraping project.
I am having two issues, one being presenting the number of urls processed as percentage but a far larger issue is that I can not figure out how I know when all the threads i am creating are totaly finished.
NOTE: I am aware of that the a parallel foreach once done moves on BUT this is within a recursive method.
My code below:
public async Task Scrape(string url)
{
var page = string.Empty;
try
{
page = await _service.Get(url);
if (page != string.Empty)
{
if (regex.IsMatch(page))
{
Parallel.For(0, regex.Matches(page).Count,
index =>
{
try
{
if (regex.Matches(page)[index].Groups[1].Value.StartsWith("/"))
{
var match = regex.Matches(page)[index].Groups[1].Value.ToLower();
if (!links.Contains(BaseUrl + match) && !Visitedlinks.Contains(BaseUrl + match))
{
Uri ValidUri = WebPageValidator.GetUrl(match);
if (ValidUri != null && HostUrls.Contains(ValidUri.Host))
links.Enqueue(match.Replace(".html", ""));
else
links.Enqueue(BaseUrl + match.Replace(".html", ""));
}
}
}
catch (Exception e)
{
log.Error("Error occured: " + e.Message);
Console.WriteLine("Error occured, check log for further details."); ;
}
});
WebPageInternalHandler.SavePage(page, url);
var context = CustomSynchronizationContext.GetSynchronizationContext();
Parallel.ForEach(links, new ParallelOptions { MaxDegreeOfParallelism = 25 },
webpage =>
{
try
{
if (WebPageValidator.ValidUrl(webpage))
{
string linkToProcess = webpage;
if (links.TryDequeue(out linkToProcess) && !Visitedlinks.Contains(linkToProcess))
{
ShowPercentProgress();
Thread.Sleep(15);
Visitedlinks.Enqueue(linkToProcess);
Task d = Scrape(linkToProcess);
Console.Clear();
}
}
}
catch (Exception e)
{
log.Error("Error occured: " + e.Message);
Console.WriteLine("Error occured, check log for further details.");
}
});
Console.WriteLine("parallel finished");
}
}
catch (Exception e)
{
log.Error("Error occured: " + e.Message);
Console.WriteLine("Error occured, check log for further details.");
}
}
NOTE that Scrape gets called multiple times(recursive)
call the method like this:
public Task ExecuteScrape()
{
var context = CustomSynchronizationContext.GetSynchronizationContext();
Scrape(BaseUrl).ContinueWith(x => {
Visitedlinks.Enqueue(BaseUrl);
}, context).Wait();
return null;
}
which in turn gets called like so:
static void Main(string[] args)
{
RunScrapper();
Console.ReadLine();
}
public static void RunScrapper()
{
try
{
_scrapper.ExecuteScrape();
}
catch (Exception e)
{
Console.WriteLine(e);
throw;
}
}
my result:
How do I solve this?
(Is it ethical for me to answer a question about web page scraping?)
Don't call Scrape recursively. Place the list of urls you want to scrape in a ConcurrentQueue and begin processing that queue. As the process of scraping a page returns more urls, just add them into the same queue.
I wouldn't use just a string, either. I recommend creating a class like
public class UrlToScrape //because naming things is hard
{
public string Url { get; set; }
public int Depth { get; set; }
}
Regardless of how you execute this it's recursive, so you have to somehow keep track of how many levels deep you are. A website could deliberately generate URLs that send you into infinite recursion. (If they did this then they don't want you scraping their site. Does anybody want people scraping their site?)
When your queue is empty that doesn't mean you're done. The queue could be empty, but the process of scraping the last url dequeued could still add more items back into that queue, so you need a way to account for that.
You could use a thread safe counter (int using Interlocked.Increment/Decrement) that you increment when you start processing a url and decrement when you finish. You're done when the queue is empty and the count of in-process urls is zero.
This is a very rough model to illustrate the concept, not what I'd call a refined solution. For example, you still need to account for exception handling, and I have no idea where the results go, etc.
public class UrlScraper
{
private readonly ConcurrentQueue<UrlToScrape> _queue = new ConcurrentQueue<UrlToScrape>();
private int _inProcessUrlCounter;
private readonly List<string> _processedUrls = new List<string>();
public UrlScraper(IEnumerable<string> urls)
{
foreach (var url in urls)
{
_queue.Enqueue(new UrlToScrape {Url = url, Depth = 1});
}
}
public void ScrapeUrls()
{
while (_queue.TryDequeue(out var dequeuedUrl) || _inProcessUrlCounter > 0)
{
if (dequeuedUrl != null)
{
// Make sure you don't go more levels deep than you want to.
if (dequeuedUrl.Depth > 5) continue;
if (_processedUrls.Contains(dequeuedUrl.Url)) continue;
_processedUrls.Add(dequeuedUrl.Url);
Interlocked.Increment(ref _inProcessUrlCounter);
var url = dequeuedUrl;
Task.Run(() => ProcessUrl(url));
}
}
}
private void ProcessUrl(UrlToScrape url)
{
try
{
// As the process discovers more urls to scrape,
// pretend that this is one of those new urls.
var someNewUrl = "http://discovered";
_queue.Enqueue(new UrlToScrape { Url = someNewUrl, Depth = url.Depth + 1 });
}
catch (Exception ex)
{
// whatever you want to do with this
}
finally
{
Interlocked.Decrement(ref _inProcessUrlCounter);
}
}
}
If I was doing this for real the ProcessUrl method would be its own class, and it would take HTML, not a URL. In this form it's difficult to unit test. If it were in a separate class then you could pass in HTML, verify that it outputs results somewhere, and that it calls a method to enqueue new URLs it finds.
It's also not a bad idea to maintain the queue as a database table instead. Otherwise if you're processing a bunch of urls and you have to stop, you'd have start all over again.
Can't you add all tasks Task d to some type of concurrent collection you thread through all recursive calls (via method argument) and then simply call Task.WhenAll(tasks).Wait()?
You'd need an intermediate method (makes it cleaner) that launches the base Scrape call and passes in the empty task collection. When the base call returns you have in hand all tasks and you simply wait them out.
public async Task Scrape (
string url) {
var tasks = new ConcurrentQueue<Task>();
//call your implementation but
//change it so that you add
//all launched tasks d to tasks
Scrape(url, tasks);
//1st option: Wait().
//This will block caller
//until all tasks finish
Task.WhenAll(tasks).Wait();
//or 2nd option: await
//this won't block and will return to caller.
//Once all tasks are finished method
//will resume in WriteLine
await Task.WhenAll(tasks);
Console.WriteLine("Finished!"); }
Simple rule: if you want to know when something finishes, the first step is to keep track of it. In your current implementation you are essentially firing and forgetting all launched tasks...
I have a delegate method to run a heavy process in my app (I must use MS Framework 3.5):
private delegate void delRunJob(string strBox, string strJob);
Execution:
private void run()
{
string strBox = "G4P";
string strJob = "Test";
delRunJob delegateRunJob = new delRunJob(runJobThread);
delegateRunJob.Invoke(strBox, strJob);
}
In some part of the method runJobThread
I call to an external program (SAP - Remote Function Calls) to retrieve data. The execution of that line can take 1-30 mins.
private void runJobThread(string strBox, string strJob)
{
// CODE ...
sapLocFunction.Call(); // When this line is running I cannot cancel the process
// CODE ...
}
I want to allow the user cancel whole process.
How can achieve this? I tried some methods; but I fall in the same point; when this specific line is running I cannot stop the process.
Instead of using the delegate mechanism you have to study the async and await mechanism. When you understand this mechanism you can move to cancellationtoken.
An example doing both things can be found here :
http://blogs.msdn.com/b/dotnet/archive/2012/06/06/async-in-4-5-enabling-progress-and-cancellation-in-async-apis.aspx
Well; I find out a complicated, but effective, way to solve my problem:
a.) I created a "Helper application" to show a notification icon when the process is running (To ensure to don't interfere with the normal execution of the main app):
private void callHelper(bool blnClose = false)
{
if (blnClose)
fw.processKill("SDM Helper");
else
Process.Start(fw.appGetPath + "SDM Helper.exe");
}
b.) I created a Thread that call only the heavy process line.
c.) While the Thread is alive I check for external file named "cancel" (The "Helper application" do that; when the user click an option to cancel the process the Helper create the file).
d.) If exists the file; dispose all objects and break the while cycle.
e.) The method sapLocFunction.Call() will raise an exception but I expect errors.
private void runJobThread(string strBox, string strJob)
{
// CODE ...
Thread thrSapCall = new Thread(() =>
{
try { sapLocFunction.Call(); }
catch { /* Do nothing */ }
});
thrSapCall.Start();
while (thrSapCall.IsAlive)
{
Thread.Sleep(1000);
try
{
if (fw.fileExists(fw.appGetPath + "\\cancel"))
{
sapLocFunction = null;
sapLocTable = null;
sapConn.Logoff();
sapConn = null;
canceled = true;
break;
}
}
finally { /* Do nothing */ }
}
thrSapCall = null;
// CODE ...
}
Works like a charm!
I think you would have to resort to the method described here. Read the post to see why this is a long way from ideal.
Perhaps this might work...
private void runJobThread(string strBox, string strJob, CancellationToken token)
{
Thread t = Thread.CurrentThread;
using (token.Register(t.Abort))
{
// CODE ...
sapLocFunction.Call(); // When this line is running I cannot cancel the process
// CODE ...
}
}
A bit of dnspy exposes a cancel method on nco3.0.
private readonly static Type RfcConnection = typeof(RfcSessionManager).Assembly.GetType("SAP.Middleware.Connector.RfcConnection");
private readonly static Func<RfcDestination, object> GetConnection = typeof(RfcSessionManager).GetMethod("GetConnection", BindingFlags.Static | BindingFlags.NonPublic).CreateDelegate(typeof(Func<RfcDestination, object>)) as Func<RfcDestination, object>;
private readonly static MethodInfo Cancel = RfcConnection.GetMethod("Cancel", BindingFlags.Instance | BindingFlags.NonPublic);
object connection = null;
var completed = true;
using (var task = Task.Run(() => { connection = GetConnection(destination); rfcFunction.Invoke(destination); }))
{
try
{
completed = task.Wait(TimeSpan.FromSeconds(invokeTimeout));
if (!completed)
Cancel.Invoke(connection, null);
task.Wait();
}
catch(AggregateException e)
{
if (e.InnerException is RfcCommunicationCanceledException && !completed)
throw new TimeoutException($"SAP FM {functionName} on {destination} did not respond in {timeout} seconds.");
throw;
}
}
I have a simple logging mechanism that should be thread safe. It works most of the time, but every now and then I get an exception on this line, "_logQ.Enqueue(s);" that the queue is not long enough. Looking in the debugger there are sometimes just hundreds of items, so I can't see it being resources. The queue is supposed to expand as needed. If I catch the exception as opposed to letting the debugger pause at the exception I see the same error. Is there something not thread safe here? I don't even know how to start debugging this.
static void ProcessLogQ(object state)
{
try
{
while (_logQ.Count > 0)
{
var s = _logQ.Dequeue();
string dir="";
Type t=Type.GetType("Mono.Runtime");
if (t!=null)
{
dir ="/var/log";
}else
{
dir = #"c:\log";
if (!Directory.Exists(dir))
Directory.CreateDirectory(dir);
}
if (Directory.Exists(dir))
{
File.AppendAllText(Path.Combine(dir, "admin.log"), DateTime.Now.ToString("hh:mm:ss ") + s + Environment.NewLine);
}
}
}
catch (Exception)
{
}
finally
{
_isProcessingLogQ = false;
}
}
public static void Log(string s) {
if (_logQ == null)
_logQ = new Queue<string> { };
lock (_logQ)
_logQ.Enqueue(s);
if (!_isProcessingLogQ) {
_isProcessingLogQ = true;
ThreadPool.QueueUserWorkItem(ProcessLogQ);
}
}
Note that the threads all call Log(string s). ProcessLogQ is private to the logger class.
* Edit *
I made a mistake in not mentioning that this is in a .NET 3.5 environment, therefore I can't use Task or ConcurrentQueue. I am working on fixes for the current example within .NET 3.5 constraints.
** Edit *
I believe I have a thread-safe version for .NET 3.5 listed below. I start the logger thread once from a single thread at program start, so there is only one thread running to log to the file (t is a static Thread):
static void ProcessLogQ()
{
while (true) {
try {
lock (_logQ);
while (_logQ.Count > 0) {
var s = _logQ.Dequeue ();
string dir = "../../log";
if (!Directory.Exists (dir))
Directory.CreateDirectory (dir);
if (Directory.Exists (dir)) {
File.AppendAllText (Path.Combine (dir, "s3ol.log"), DateTime.Now.ToString ("hh:mm:ss ") + s + Environment.NewLine);
}
}
} catch (Exception ex) {
Console.WriteLine (ex.Message);
} finally {
}
Thread.Sleep (1000);
}
}
public static void startLogger(){
lock (t) {
if (t.ThreadState != ThreadState.Running)
t.Start ();
}
}
private static void multiThreadLog(string msg){
lock (_logQ)
_logQ.Enqueue(msg);
}
Look at the TaskParallel Library. All the hard work is already done for you. If you're doing this to learn about multithreading read up on locking techniques and pros and cons of each.
Further, you're checking if _logQ is null outside your lock statement, from what I can deduce it's a static field that you're not initializing inside a static constructor. You can avoid doing this null check (which should be inside a lock, it's critical code!) you can ensure thread-safety by making it a static readonly and initializing it inside the static constructor.
Further, you're not properly handling queue states. Since there's no lock during the check of the queue count it could vary on every iteration. You're missing a lock as your dequeuing items.
Excellent resource:
http://www.yoda.arachsys.com/csharp/threads/
For a thread-safe queue, you should use the ConcurrentQueue instead:
https://msdn.microsoft.com/en-us/library/dd267265(v=vs.110).aspx
Output Buffer declared as a class variable
private Queue<String> __OutputBuffer = new Queue<String>();
Timer Used to Process Output every 100ms
new System.Timers.Timer()
{
Interval = 100,
Enabled = true
}.Elapsed += new ElapsedEventHandler(
(caller, args) =>
{
ProcessOutput();
}
);
Process the Queue
private void ProcessOutput()
{
if (__OutputBuffer.Count > 0 && !String.IsNullOrEmpty(__OutputBuffer.Peek()))
{
object _Item = __OutputBuffer.Dequeue();
if(_Item is String)
{
try
{
Browser.DocumentText += "<span style='font-family: Tahoma; font-size: 9pt;'>" + _Item + "</span>";
//Exception On Line Above!
}
catch (Exception) { }
}
}
}
Method for adding to the output buffer
private void UpdateOutput(String text)
{
__OutputBuffer.Enqueue(text);
}
I'm getting invalid cast exception, and the following is the contents of _Item at the point of getting the exception.
** Also the following causes an exception... so i'm doubting that it's the contents of the string in the queue.
Queue<> is not thread-safe, while System.Timers.Timer fires its events on a random pool thread. That's where ProcessOutput is called, and that's where you call __OutputBuffer.Dequeue() and access Browser.DocumentText.
You can protect __OutputBuffer from concurrent access with a lock (for both Dequeue and Enqueue), or use ConcurrentQueue instead. However, you'd need to marshal the Browser.DocumentText assignment to the UI thread, e.g. with Control.Invoke or Control.BeginInvoke.
As Noseratio said, Queue<> is not thread safe, however if you do not wish to use locking in your project and you are using .NET 4.0 or newer you can use the ConcurrentQueue<> class which is thread safe.
You will need to make a few changes, like there is no Peek nor Dequeue method instead you must use TryPeek and TryDequeue. But it should not require too many major changes, it even lets you do some optimisations because the two try methods will return false if the Queue is empty so you nolonger need the Count check.
private void ProcessOutput()
{
string output;
if (__OutputBuffer.TryDequeue(out output) && !String.IsNullOrEmpty(output))
{
try
{
Browser.DocumentText += "<span style='font-family: Tahoma; font-size: 9pt;'>" + output + "</span>";
}
catch (Exception) { } // <--- Blindly catching exceptsions is almost never the right thing to do.
}
}
using Timer is considered Multithreaded.
you have a new thread every 100 ms, which may cause you to race condition on the dequeue
Use:
private ConcurrentQueue<String> __OutputBuffer = new ConcurrentQueue<String>();
private void ProcessOutput()
{
string _Item;
if (__OutputBuffer.TryDequeue(out _Item))
{
try
{
Browser.DocumentText += "<span style='font-family: Tahoma; font-size: 9pt;'>" + _Item + "</span>";
//Exception On Line Above!
}
catch (Exception) { }
}
}