Questions about code using Task queue for parallel web gets

Questions about code using Task queue for parallel web gets - c#

So I've got this code to drill down into a heirarchy of XML documents from a REST api. I posted earlier to get advice on how to make it recursive, then I went ahead and made it parralel.
First, I was SHOCKED by how fast it ran - it pulled down 318 XML docs in just under 12 seconds, compared to well over 10 minutes single threaded - I really didn't expect to gain that much. Is there some catch to this, because it seems too good to be true?
Second, I suspect this code is implementing a common pattern but possibly in a non "idiomatic" way. I have kind of a "producer-consumer queue" happening, with two separate locking objects. Is there a more standard way I could have done this?
Code.
public class ResourceGetter
{
public ResourceGetter(ILogger logger, string url)
{
this.logger = logger;
this.rootURL = url;
}
public List<XDocument> GetResources()
{
GetResources(rootURL);
while (NumTasks() > 0) RemoveTask().Wait();
return resources;
}
void GetResources(string url)
{
logger.Log("Getting resources at " + url);
AddTask(Task.Factory.StartNew(new Action(() =>
{
var doc = XDocument.Parse(GetXml(url));
if (deserializer.CanDeserialize(doc.CreateReader()))
{
var rl = (resourceList)deserializer.Deserialize(doc.CreateReader());
foreach (var item in rl.resourceURL)
{
GetResources(url + item.location);
}
}
else
{
logger.Log("Got resource for " + url);
AddResrouce(doc);
}
})));
}
object resourceLock = new object();
List<XDocument> resources = new List<XDocument>();
void AddResrouce(XDocument doc)
{
lock (resourceLock)
{
logger.Log("add resource");
resources.Add(doc);
}
}
object taskLock = new object();
Queue<Task> tasks = new Queue<Task>();
void AddTask(Task task)
{
lock (taskLock)
{
tasks.Enqueue(task);
}
}
Task RemoveTask()
{
lock (taskLock)
{
return tasks.Dequeue();
}
}
int NumTasks()
{
lock (taskLock)
{
logger.Log(tasks.Count + " tasks left");
return tasks.Count;
}
}
ILogger logger;
XmlSerializer deserializer = new XmlSerializer(typeof(resourceList));
readonly string rootURL;
}

Just offhand, I wouldn't bother with the code for managing the task list, all the locking, and the NumTasks() method. It would be simpler to just use a CountdownEvent, which is threadsafe to begin with. Just increment it when you create a new task, and decrement it when a task finishes, kind of like you are doing now but without the locking.

Related

C# Code takes to long to run. Is there a way to make it finish quicker?

I need some help. If you input an Directory into my code, it goes in every folder in that Directory and gets every single file. This way, i managed to bypass the "AccessDeniedException" by using a code, BUT if the Directory is one, which contains alot of Data and folders (example: C:/) it just takes way to much time.
I dont really know how to multithread and i could not find any help on the internet. Is there a way to make the code run faster by multithreading? Or is it possible to ask the code to use more memory or Cores ? I really dont know and could use advise
My code to go in every File in every Subdirectory:
public static List<string> Files = new List<string>();
public static List<string> Exceptions = new List<string>();
public MainWindow()
{
InitializeComponent();
}
private static void GetFilesRecursively(string Directory)
{
try
{
foreach (string A in Directory.GetDirectories(Directory))
GetFilesRecursively(A);
foreach (string B in Directory.GetFiles(Directory))
AddtoList(B);
} catch (System.Exception ex) { Exceptions.Add(ex.ToString()); }
}
private static void AddtoList(string Result)
{
Files.Add(Result);
}
private void Btn_Click(object sender, RoutedEventArgs e)
{
GetFilesRecursively(Textbox1.Text);
foreach(string C in Files)
Textbox2.Text += $"{C} \n";
}

You don't need recursion to avoid inaccessible files. You can use the EnumerateFiles overload that accepts an EnumerationOptions parameter and set EnumerationOptions.IgnoreInaccessible to true:
var options=new EnumerationOptions
{
IgnoreInaccessible=true,
RecurseSubdirectories=true
};
var files=Directory.EnumerateFiles(somePath,"*",options);
The loop that appends file paths is very expensive too. Not only does it create a new temporary string on each iteration, it also forces a UI redraw. You could improve speed and memory usage (which, due to garbage collection is also affecting performance) by creating a single string, eg with String.Join or a StringBuilder :
var text=String.Join("\n",files);
Textbox2.Text=text;
String.Join uses a StringBuilder internally whose internal buffer gets reallocated each time it's full. The previous buffer has to be garbage-collected. Once could avoid even this by using a StringBuilder with a specific capacity. Even a rough estimate can reduce reallocations significantly:
var builder=new StringBuilder(4096);
foreach(var file in files)
{
builder.AppendLine(file);
}

create a class so you can add a private field to count the deep of the directroy.
add a TaskSource<T> property to the class, and await the Task that generated only if the deep out of the limit, and trigger an event so your UI can hook into the action and ask user.
if user cancel , then the task fail, if user confirm, then continue.
some logic code
public class FileLocator
{
public FileLocator(int maxDeep = 6){
_maxDeep = maxDeep;
this.TaskSource = new TaskSource();
this.ConfirmTask = this.TaskSource.Task;
}
private int _maxDeep;
private int _deep;
public event Action<FileLocator> OnReachMaxDeep;
public Task ConfirmTask ;
public TaskSource TaskSource {get;}
public Task<List<string>> GetFilesRecursivelyAsync()
{
var result = new List<string>();
foreach(xxxxxxx)
{
xxxxxxxxxxxxxx;
this._deep +=1;
if(_deep == _maxDeep)
{ OnRichMaxDeep?.Invoke(this); }
if(_deep >= _maxDeep)
{
try{
await ConfirmTask;
continue;
}
catch{
return result;
}
}
}
}
}
and call
var locator = new FileLocator();
locator.OnReachMaxDeep += (x)=> { var result = UI.Confirm(); if(result){ x.TaskSource.SetResult(); else{ x.TaskSource.SetException(new Exception()) } } }
var result = await locator.GetFilesRecursivelyAsync("C:");

How to know when all my threads have finished executing when in recursive method?

I have been working on a webscraping project.
I am having two issues, one being presenting the number of urls processed as percentage but a far larger issue is that I can not figure out how I know when all the threads i am creating are totaly finished.
NOTE: I am aware of that the a parallel foreach once done moves on BUT this is within a recursive method.
My code below:
public async Task Scrape(string url)
{
var page = string.Empty;
try
{
page = await _service.Get(url);
if (page != string.Empty)
{
if (regex.IsMatch(page))
{
Parallel.For(0, regex.Matches(page).Count,
index =>
{
try
{
if (regex.Matches(page)[index].Groups[1].Value.StartsWith("/"))
{
var match = regex.Matches(page)[index].Groups[1].Value.ToLower();
if (!links.Contains(BaseUrl + match) && !Visitedlinks.Contains(BaseUrl + match))
{
Uri ValidUri = WebPageValidator.GetUrl(match);
if (ValidUri != null && HostUrls.Contains(ValidUri.Host))
links.Enqueue(match.Replace(".html", ""));
else
links.Enqueue(BaseUrl + match.Replace(".html", ""));
}
}
}
catch (Exception e)
{
log.Error("Error occured: " + e.Message);
Console.WriteLine("Error occured, check log for further details."); ;
}
});
WebPageInternalHandler.SavePage(page, url);
var context = CustomSynchronizationContext.GetSynchronizationContext();
Parallel.ForEach(links, new ParallelOptions { MaxDegreeOfParallelism = 25 },
webpage =>
{
try
{
if (WebPageValidator.ValidUrl(webpage))
{
string linkToProcess = webpage;
if (links.TryDequeue(out linkToProcess) && !Visitedlinks.Contains(linkToProcess))
{
ShowPercentProgress();
Thread.Sleep(15);
Visitedlinks.Enqueue(linkToProcess);
Task d = Scrape(linkToProcess);
Console.Clear();
}
}
}
catch (Exception e)
{
log.Error("Error occured: " + e.Message);
Console.WriteLine("Error occured, check log for further details.");
}
});
Console.WriteLine("parallel finished");
}
}
catch (Exception e)
{
log.Error("Error occured: " + e.Message);
Console.WriteLine("Error occured, check log for further details.");
}
}
NOTE that Scrape gets called multiple times(recursive)
call the method like this:
public Task ExecuteScrape()
{
var context = CustomSynchronizationContext.GetSynchronizationContext();
Scrape(BaseUrl).ContinueWith(x => {
Visitedlinks.Enqueue(BaseUrl);
}, context).Wait();
return null;
}
which in turn gets called like so:
static void Main(string[] args)
{
RunScrapper();
Console.ReadLine();
}
public static void RunScrapper()
{
try
{
_scrapper.ExecuteScrape();
}
catch (Exception e)
{
Console.WriteLine(e);
throw;
}
}
my result:
How do I solve this?

(Is it ethical for me to answer a question about web page scraping?)
Don't call Scrape recursively. Place the list of urls you want to scrape in a ConcurrentQueue and begin processing that queue. As the process of scraping a page returns more urls, just add them into the same queue.
I wouldn't use just a string, either. I recommend creating a class like
public class UrlToScrape //because naming things is hard
{
public string Url { get; set; }
public int Depth { get; set; }
}
Regardless of how you execute this it's recursive, so you have to somehow keep track of how many levels deep you are. A website could deliberately generate URLs that send you into infinite recursion. (If they did this then they don't want you scraping their site. Does anybody want people scraping their site?)
When your queue is empty that doesn't mean you're done. The queue could be empty, but the process of scraping the last url dequeued could still add more items back into that queue, so you need a way to account for that.
You could use a thread safe counter (int using Interlocked.Increment/Decrement) that you increment when you start processing a url and decrement when you finish. You're done when the queue is empty and the count of in-process urls is zero.
This is a very rough model to illustrate the concept, not what I'd call a refined solution. For example, you still need to account for exception handling, and I have no idea where the results go, etc.
public class UrlScraper
{
private readonly ConcurrentQueue<UrlToScrape> _queue = new ConcurrentQueue<UrlToScrape>();
private int _inProcessUrlCounter;
private readonly List<string> _processedUrls = new List<string>();
public UrlScraper(IEnumerable<string> urls)
{
foreach (var url in urls)
{
_queue.Enqueue(new UrlToScrape {Url = url, Depth = 1});
}
}
public void ScrapeUrls()
{
while (_queue.TryDequeue(out var dequeuedUrl) || _inProcessUrlCounter > 0)
{
if (dequeuedUrl != null)
{
// Make sure you don't go more levels deep than you want to.
if (dequeuedUrl.Depth > 5) continue;
if (_processedUrls.Contains(dequeuedUrl.Url)) continue;
_processedUrls.Add(dequeuedUrl.Url);
Interlocked.Increment(ref _inProcessUrlCounter);
var url = dequeuedUrl;
Task.Run(() => ProcessUrl(url));
}
}
}
private void ProcessUrl(UrlToScrape url)
{
try
{
// As the process discovers more urls to scrape,
// pretend that this is one of those new urls.
var someNewUrl = "http://discovered";
_queue.Enqueue(new UrlToScrape { Url = someNewUrl, Depth = url.Depth + 1 });
}
catch (Exception ex)
{
// whatever you want to do with this
}
finally
{
Interlocked.Decrement(ref _inProcessUrlCounter);
}
}
}
If I was doing this for real the ProcessUrl method would be its own class, and it would take HTML, not a URL. In this form it's difficult to unit test. If it were in a separate class then you could pass in HTML, verify that it outputs results somewhere, and that it calls a method to enqueue new URLs it finds.
It's also not a bad idea to maintain the queue as a database table instead. Otherwise if you're processing a bunch of urls and you have to stop, you'd have start all over again.

Can't you add all tasks Task d to some type of concurrent collection you thread through all recursive calls (via method argument) and then simply call Task.WhenAll(tasks).Wait()?
You'd need an intermediate method (makes it cleaner) that launches the base Scrape call and passes in the empty task collection. When the base call returns you have in hand all tasks and you simply wait them out.
public async Task Scrape (
string url) {
var tasks = new ConcurrentQueue<Task>();
//call your implementation but
//change it so that you add
//all launched tasks d to tasks
Scrape(url, tasks);
//1st option: Wait().
//This will block caller
//until all tasks finish
Task.WhenAll(tasks).Wait();
//or 2nd option: await
//this won't block and will return to caller.
//Once all tasks are finished method
//will resume in WriteLine
await Task.WhenAll(tasks);
Console.WriteLine("Finished!"); }
Simple rule: if you want to know when something finishes, the first step is to keep track of it. In your current implementation you are essentially firing and forgetting all launched tasks...

Thread.Suspend() is obsolete

I have a problem with thread, I want to create n thread and write a log (with method write, already implemented)
This is an unit test, when I run it, it works nice, but an exception appears :
System.AppDomainUnloadedException: Attempted to access an unloaded AppDomain. This can happen if the test(s) started a thread but did not stop it. Make sure that all the threads started by the test(s) are stopped before completion.
So, I tried to use ThreadC.Suspend() and error disappears, but mehod Suspend is obsolete..
How can I fix it?
public void TestMethod1()
{
try
{
LogTest logTest = new LogTest(new FileLog());
logTest.PerformanceTest();
logTest = new LogTest(new CLogApi());
logTest.PerformanceTest();
logTest = new LogTest(new EmptyLog());
logTest.PerformanceTest();
}
catch (Exception)
{
Assert.IsTrue(false);
}
}
public class LogTest
{
private readonly Log log;
private int numberOfIterations = 5;
public LogTest(Log log)
{
this.log = log;
}
public void PerformanceTest()
{
for (int i = 0; i < this.numberOfIterations; i++)
{
try
{
Thread threadC = Thread.CurrentThread;
threadC = new Thread(this.ThreadProc);
threadC.Name = i.ToString();
threadC.Start();
// threadC.IsBackground = true;
}
catch (Exception)
{
Assert.IsTrue(false);
}
}
}
private void ThreadProc()
{
try
{
this.log.Write(" Thread : " + Thread.CurrentThread.Name.ToString());
this.log.Write(" Thread : " + Thread.CurrentThread.Name.ToString());
this.log.Write(" Thread : " + Thread.CurrentThread.Name.ToString());
this.log.Write(" Thread : " + Thread.CurrentThread.Name.ToString());
}
catch (Exception)
{
Assert.IsTrue(false);
}
}
}

1: You should use "Assert.Fail()" instead Assert.IsTrue(false);
2: Read the Microsoft documentation if you use an obsolete method. They write what you can use instead."Thread.Suspend has been deprecated. Please use other classes in System.Threading, such as Monitor, Mutex, Event, and Semaphore, to synchronize Threads or protect resources."
3: If i understand you correctly you want to kill all running threads or wait for them. You can use "Thread.Join()" https://msdn.microsoft.com/de-de/library/95hbf2ta(v=vs.110).aspx
You can store all threads in an Array or list an join all threads at the end.
4: Instead using threads you can use the async pattern and wait for all Tasks with Task.WaitAll(tasks) https://msdn.microsoft.com/en-us/library/dd270695(v=vs.110).aspx

C# Enqueue Failure

I have a simple logging mechanism that should be thread safe. It works most of the time, but every now and then I get an exception on this line, "_logQ.Enqueue(s);" that the queue is not long enough. Looking in the debugger there are sometimes just hundreds of items, so I can't see it being resources. The queue is supposed to expand as needed. If I catch the exception as opposed to letting the debugger pause at the exception I see the same error. Is there something not thread safe here? I don't even know how to start debugging this.
static void ProcessLogQ(object state)
{
try
{
while (_logQ.Count > 0)
{
var s = _logQ.Dequeue();
string dir="";
Type t=Type.GetType("Mono.Runtime");
if (t!=null)
{
dir ="/var/log";
}else
{
dir = #"c:\log";
if (!Directory.Exists(dir))
Directory.CreateDirectory(dir);
}
if (Directory.Exists(dir))
{
File.AppendAllText(Path.Combine(dir, "admin.log"), DateTime.Now.ToString("hh:mm:ss ") + s + Environment.NewLine);
}
}
}
catch (Exception)
{
}
finally
{
_isProcessingLogQ = false;
}
}
public static void Log(string s) {
if (_logQ == null)
_logQ = new Queue<string> { };
lock (_logQ)
_logQ.Enqueue(s);
if (!_isProcessingLogQ) {
_isProcessingLogQ = true;
ThreadPool.QueueUserWorkItem(ProcessLogQ);
}
}
Note that the threads all call Log(string s). ProcessLogQ is private to the logger class.
* Edit *
I made a mistake in not mentioning that this is in a .NET 3.5 environment, therefore I can't use Task or ConcurrentQueue. I am working on fixes for the current example within .NET 3.5 constraints.
** Edit *
I believe I have a thread-safe version for .NET 3.5 listed below. I start the logger thread once from a single thread at program start, so there is only one thread running to log to the file (t is a static Thread):
static void ProcessLogQ()
{
while (true) {
try {
lock (_logQ);
while (_logQ.Count > 0) {
var s = _logQ.Dequeue ();
string dir = "../../log";
if (!Directory.Exists (dir))
Directory.CreateDirectory (dir);
if (Directory.Exists (dir)) {
File.AppendAllText (Path.Combine (dir, "s3ol.log"), DateTime.Now.ToString ("hh:mm:ss ") + s + Environment.NewLine);
}
}
} catch (Exception ex) {
Console.WriteLine (ex.Message);
} finally {
}
Thread.Sleep (1000);
}
}
public static void startLogger(){
lock (t) {
if (t.ThreadState != ThreadState.Running)
t.Start ();
}
}
private static void multiThreadLog(string msg){
lock (_logQ)
_logQ.Enqueue(msg);
}

Look at the TaskParallel Library. All the hard work is already done for you. If you're doing this to learn about multithreading read up on locking techniques and pros and cons of each.
Further, you're checking if _logQ is null outside your lock statement, from what I can deduce it's a static field that you're not initializing inside a static constructor. You can avoid doing this null check (which should be inside a lock, it's critical code!) you can ensure thread-safety by making it a static readonly and initializing it inside the static constructor.
Further, you're not properly handling queue states. Since there's no lock during the check of the queue count it could vary on every iteration. You're missing a lock as your dequeuing items.
Excellent resource:
http://www.yoda.arachsys.com/csharp/threads/

For a thread-safe queue, you should use the ConcurrentQueue instead:
https://msdn.microsoft.com/en-us/library/dd267265(v=vs.110).aspx

Stop thread until enough memory is available

Environment : .net 4.0
I have a task that transforms XML files with a XSLT stylesheet, here is my code
public string TransformFileIntoTempFile(string xsltPath,
string xmlPath)
{
var transform = new MvpXslTransform();
transform.Load(xsltPath, new XsltSettings(true, false),
new XmlUrlResolver());
string tempPath = Path.GetTempFileName();
using (var writer = new StreamWriter(tempPath))
{
using (XmlReader reader = XmlReader.Create(xmlPath))
{
transform.Transform(new XmlInput(reader), null,
new XmlOutput(writer));
}
}
return tempPath;
}
I have X threads that can launch this task in parallel.
Sometimes my input file are about 300 MB, sometimes it's only a few MB.
My problem : I get OutOfMemoryException when my program try to transform some big XML files in the same time.
How can I avoid these OutOfMemoryEception ? My idea is to stop a thread before executing the task until there is enough available memory, but I don't know how to do that. Or there is some other solution (like putting my task in a distinct application).
Thanks

I don't recommend blocking a thread. In worst case, you'll just end up starving the task that could potentially free the memory you needed, leading to deadlock or very bad performance in general.
Instead, I suggest you keep a work queue with priorities. Get the tasks from the Queue scheduled fairly across a thread pool. Make sure no thread ever blocks on a wait operation, instead repost the task to the queue (with a lower priority).
So what you'd do (e.g. on receiving an OutOfMemory exception), is post the same job/task onto the queue and terminate the current task, freeing up the thread for another task.
A simplistic approach is to use LIFO which ensures that a task posted to the queue will have 'lower priority' than any other jobs already on that queue.

Since .NET Framework 4 we have API to work with good old Memory-Mapped Files feature which is available many years within from Win32API, so now you can use it from the .NET Managed Code.
For your task better fit "Persisted memory-mapped files" option,
MSDN:
Persisted files are memory-mapped files that are associated with a
source file on a disk. When the last process has finished working with
the file, the data is saved to the source file on the disk. These
memory-mapped files are suitable for working with extremely large
source files.
On the page of MemoryMappedFile.CreateFromFile() method description you can find a nice example describing how to create a memory mapped Views for the extremely large file.
EDIT: Update regarding considerable notes in comments
Just found method MemoryMappedFile.CreateViewStream() which creates a stream of type MemoryMappedViewStream which is inherited from a System.IO.Stream.
I believe you can create an instance of XmlReader from this stream and then instantiate your custom implementation of the XslTransform using this reader/stream.
EDIT2: remi bourgarel (OP) already tested this approach and looks like this particular XslTransform implementation (I wonder whether ANY would) wont work with MM-View stream in way which was supposed

The main problem is that you are loading the entire Xml file. If you were to just transform-as-you-read the out of memory problem should not normally appear.
That being said I found a MS support article which suggests how it can be done:
http://support.microsoft.com/kb/300934
Disclaimer: I did not test this so if you use it and it works please let us know.

You could consider using a queue to throttle how many concurrent transforms are being done based on some sort of artificial memory boundary e.g. file size. Something like the following could be used.
This sort of throttling strategy can be combined with maximum number of concurrent files being processed to ensure your disk is not being thrashed too much.
NB I have not included necessary try\catch\finally around execution to ensure that exceptions are propogated to calling thread and Waithandles are always released. I could go into further detail here.
public static class QueuedXmlTransform
{
private const int MaxBatchSizeMB = 300;
private const double MB = (1024 * 1024);
private static readonly object SyncObj = new object();
private static readonly TaskQueue Tasks = new TaskQueue();
private static readonly Action Join = () => { };
private static double _CurrentBatchSizeMb;
public static string Transform(string xsltPath, string xmlPath)
{
string tempPath = Path.GetTempFileName();
using (AutoResetEvent transformedEvent = new AutoResetEvent(false))
{
Action transformTask = () =>
{
MvpXslTransform transform = new MvpXslTransform();
transform.Load(xsltPath, new XsltSettings(true, false),
new XmlUrlResolver());
using (StreamWriter writer = new StreamWriter(tempPath))
using (XmlReader reader = XmlReader.Create(xmlPath))
{
transform.Transform(new XmlInput(reader), null,
new XmlOutput(writer));
}
transformedEvent.Set();
};
double fileSizeMb = new FileInfo(xmlPath).Length / MB;
lock (SyncObj)
{
if ((_CurrentBatchSizeMb += fileSizeMb) > MaxBatchSizeMB)
{
_CurrentBatchSizeMb = fileSizeMb;
Tasks.Queue(isParallel: false, task: Join);
}
Tasks.Queue(isParallel: true, task: transformTask);
}
transformedEvent.WaitOne();
}
return tempPath;
}
private class TaskQueue
{
private readonly object _syncObj = new object();
private readonly Queue<QTask> _tasks = new Queue<QTask>();
private int _runningTaskCount;
public void Queue(bool isParallel, Action task)
{
lock (_syncObj)
{
_tasks.Enqueue(new QTask { IsParallel = isParallel, Task = task });
}
ProcessTaskQueue();
}
private void ProcessTaskQueue()
{
lock (_syncObj)
{
if (_runningTaskCount != 0) return;
while (_tasks.Count > 0 && _tasks.Peek().IsParallel)
{
QTask parallelTask = _tasks.Dequeue();
QueueUserWorkItem(parallelTask);
}
if (_tasks.Count > 0 && _runningTaskCount == 0)
{
QTask serialTask = _tasks.Dequeue();
QueueUserWorkItem(serialTask);
}
}
}
private void QueueUserWorkItem(QTask qTask)
{
Action completionTask = () =>
{
qTask.Task();
OnTaskCompleted();
};
_runningTaskCount++;
ThreadPool.QueueUserWorkItem(_ => completionTask());
}
private void OnTaskCompleted()
{
lock (_syncObj)
{
if (--_runningTaskCount == 0)
{
ProcessTaskQueue();
}
}
}
private class QTask
{
public Action Task { get; set; }
public bool IsParallel { get; set; }
}
}
}
Update
Fixed bug in maintaining batch size when rolling over to next batch window:
_CurrentBatchSizeMb = fileSizeMb;

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Questions about code using Task queue for parallel web gets - c#

Related

C# Code takes to long to run. Is there a way to make it finish quicker?

How to know when all my threads have finished executing when in recursive method?

Thread.Suspend() is obsolete

C# Enqueue Failure

Stop thread until enough memory is available

Categories

Resources