Recursive function MultiThreading to perform one task at a time - c#

I am writing a program to crawl the websites. The crawl function is a recursive one and may consume more time to complete, So I used Multi Threading to perform the crawl for multiple websites.
What exactly I need is, after completion crawling one website it call next one (which should be in Queqe) instead multiple websites crawling at a time.
I am using C# and ASP.NET.

The standard practice for doing this is to use a blocking queue. If you are using .NET 4.0 then you can take advantage of the BlockingCollection class otherwise you can use Stephen Toub's implementation.
What you will do is spin up as many worker threads as you feel necessary and have them go around in an infinite loop dequeueing items as they appear in the queue. Your main thread will be enqueueing the item. A blocking queue is designed to wait/block on the dequeue operation until an item becomes available.
public class Program
{
private static BlockingQueue<string> m_Queue = new BlockingQueue<string>();
public static void Main()
{
var thread1 = new Thread(Process);
var thread2 = new Thread(Process);
thread1.Start();
thread2.Start();
while (true)
{
string url = GetNextUrl();
m_Queue.Enqueue(url);
}
}
public static void Process()
{
while (true)
{
string url = m_Queue.Dequeue();
// Do whatever with the url here.
}
}
}

I don't usually think positive thoughts when it comes to web crawlers...
You want to use a threadpool.
ThreadPool.QueueUserWorkItem(new WaitCallback(CrawlSite), (object)s);
You simply 'push' you workload into the queue, and let the threadpool manage it.

I have to say - I'm not a Threading expert and my C# is quite rusty - but considering the requirements I would suggest something like this:
Define a Queue for the websites.
Define a Pool with Crawler threads.
The main process iterates over the website queue and retrieves the site address.
Retrieve an available thread from the pool - assign it the website address and allow it to start running. Set an indicator in the thread object that it should wait for all subsequent threads to finish (so you will not continue to the next site).
Once all the threads have ended - the main thread (started in step #4) will end and return to the main loop of the main process to continue to the next website.
The Crawler behavior should be something like this:
Investigate the content of the current address
Retrieve the hierarchy below the current level
For each child of the current node of the site tree - pull a new crawler thread from the pool and start it running in the background with the address of the child node
If the pool is empty, wait until a thread becomes available.
If the thread is marked to wait - wait for all the other threads to finish
I think there are some challenges here - but as a general flow I believe it can do do job.

Put all your url's in a queue, and pop one off the queue each time you are done with the previous one.
You could also put the recursive links in the queue, to better control how many downloads you are executing at a time.
You could set up X number of worker threads which all get a url off the queue in order to process more at a time. But this way you can throttle it yourself.
You can use ConcurrentQueue<T> in .Net to get a thread safe queue to work with.

Related

create multiple threads and communicate with them

I have a program, that takes long time to initialize but it's execution is rather fast.
It's becoming a bottleneck, so I want to start multiple instances of this program (like a pool) having it already initialized, and the idea is to just pass the needed arguments for it's execution, saving all the initialization time.
The problem is that I only found howto start new processes passing arguments:
How to pass parameters to ThreadStart method in Thread?
but I would like to start the process normally and then be able to communicate with it to send each thread the needed paramenters required for it's execution.
The best aproach I found was to create multiple threads where I would initialize the program and then using some communication mechanism (named pipes for example as it's all running in the same machine) be able to pass those arguments and trigger the execution of the program (one of the triggers could break an infinite loop, for example).
I'm asking if anyone can advice a more optimal solution rather that the one I came up with.
I suggest you don't mess with direct Thread usage, and use the TPL, something like this:
foreach (var data in YOUR_INITIALIZATION_LOGIC_METHOD_HERE)
{
Task.Run(() => yourDelegate(data), //other params here);
}
More about Task.Run on MSDN, Stephen Cleary blog
Process != Thread
A thread lives inside a process, while a process is an entire program or service in your OS.
If you want to speed-up your app initialization you can still use threads, but nowadays we use them on top of Task Parallel Library using the Task Asynchronous Pattern.
In order to communicate tasks (usually threads), you might need to implement some kind of state machine (some kind of basic workflow) where you can detect when some task progress and perform actions based on task state (running, failed, completed...).
Anyway, you don't need named pipes or something like that to communicate tasks/threads as everything lives in the same parent process. That is, you need to use regular programming approaches to do so. I mean: use C# and thread synchronization mechanisms and some kind of in-app messaging.
Some very basic idea...
.NET has a List<T> collection class. You should design a coordinator class where you might add some list which receives a message class (designed by you) like this:
public enum OperationType { DataInitialization, Authentication, Caching }
public class Message
{
public OperationType Operation { get; set; }
public Task Task { get; set; }
}
And you start all parallel initialization tasks, you add everyone to a list in the coordinator class:
Coordinator.Messages.AddRange
(
new List<Message>
{
new Message
{
Operation = Operation.DataInitialization,
Task = dataInitTask
},
..., // <--- more messages
}
);
Once you've added all messages with pending initialization tasks, somewhere in your code you can wait until initialization ends asynchronously this way:
// You do a projection of each message to get an IEnumerable<Task>
// to give it as argument of Task.WhenAll
await Task.WhenAll(Coordinator.Messages.Select(message => message.Task));
While this line awaits to finish all initialization, your UI (i.e. the main thread) can continue to work and show some kind of loading animation or who knows what (whatever).
Perhaps you can go a step further, and don't wait for all but wait for a group of tasks which allow your users to start using your app, while other non-critical tasks end...

Thread Pool of workers in a Window service?

I'm creating a Windows service with 2 separate components:
1 component creates jobs and inserts them to the database (1 thread)
The 2nd component processes these jobs (multiple FIXED # of threads in a thread pool)
These 2 components will always run as long as the service is running.
What I'm stuck on is determining how to implement this thread pool. I've done some research, and there seems to be many ways of doing this such as creating a class that overriddes the method "ThreadPoolCallback", and using ThreadPool.QueueUserWorkItem to queue a work item. http://msdn.microsoft.com/en-us/library/3dasc8as.aspx
However in the example given, it doesn't seem to fit my scenario. I want to create a FIXED number of threads in a thread pool initially. Then feed it jobs to process. How do I do this?
// Wrapper method for use with thread pool.
public void ThreadPoolCallback(Object threadContext)
{
int threadIndex = (int)threadContext;
Console.WriteLine("thread {0} started...", threadIndex);
_fibOfN = Calculate(_n);
Console.WriteLine("thread {0} result calculated...", threadIndex);
_doneEvent.Set();
}
Fibonacci[] fibArray = new Fibonacci[FibonacciCalculations];
const int FibonacciCalculations = 10;
for (int i = 0; i < FibonacciCalculations; i++)
{
ThreadPool.QueueUserWorkItem(f.ThreadPoolCallback, i);
}
Create a BlockingCollection of work items. The thread that creates jobs adds them to this collection.
Create a fixed number of persistent threads that read items from that BlockingCollection and process them. Something like:
BlockingCollection<WorkItem> WorkItems = new BlockingCollection<WorkItem>();
void WorkerThreadProc()
{
foreach (var item in WorkItems.GetConsumingEnumerable())
{
// process item
}
}
Multiple worker threads can be doing that concurrently. BlockingCollection supports multiple readers and writers, so there's no concurrency problems that you have to deal with.
See my blog post Simple Multithreading, part 2 for an example that uses one consumer and one producer. Adding multiple consumers is a very simple matter of spinning up a new task for each consumer.
Another way to do it is to use a semaphore that controls how many jobs are currently being processed. I show how to do that in this answer. However, I think the shared BlockingCollection is in general a better solution.
The .NET thread pool isn't really designed for a fixed number of threads. It's designed to use the resources of the machine in the best way possible to perform multiple relatively small jobs.
Maybe a better solution for you would be to instantiate a fixed number of BackgroundWorkers instead? There are some reasonable BW examples.

Running class in the background

I have a win form that starts a mini server type thing to serve web pages to the local browser, now the problem is, is that when I start it the application obviously won't run because there is a loop that waits for requests, for every request I create a new thread. Now should I create a complete new thread for the entire process or is there another way? The class is in a separate dll file I have created. Alone it works perfectly as expected.
I suggest you take a look at the ThreadPool Class. It is an easy-to-use option for handling multiple threads:
The thread pool enables you to use threads more efficiently by providing your application with a pool of worker threads that are managed by the system.
To queue a method for execution simply use the QueueUserWorkItem Method:
ThreadPool.QueueUserWorkItem(state =>
{
// do some work!
});
If you realize that you need more active concurrent threads to serve your clients, call the SetMaxThreads Method:
ThreadPool.SetMaxThreads(50, 10);
All requests above those numbers for worker threads and I/O threads remain queued until thread pool threads become available.
There are two ways here:
Async server. More difficult and more performance. http://robjdavey.wordpress.com/2011/02/12/asynchronous-tcp-server-example/
One thread per client. Easy to write but not applicable if you have many clients. http://tech.pro/tutorial/704/csharp-tutorial-simple-threaded-tcp-server
don't use loop until requests
I would follow #Thomas suggestion, but adding waitHandles to your ThreadPool to manage the callback cycles.
WaitCallback classMethod1= new WaitCallback(DoClassMethod1);
bool isQueued = ThreadPool.QueueUserWorkItem(classMethod1, waitHandle[0]);
WaitCallback classMethod2= new WaitCallback(DoClassMethod2);
bool isQueued = ThreadPool.QueueUserWorkItem(classMethod2, waitHandle[1]);
// do this if you want to wait for all requests complated
if (WaitHandle.WaitAll(waitHandles, 5000, false))
// request completed, show your result.
else
// problem.
void DoClassMethod1(object state)
{
// do your work
ManualResetEvent mre = (ManualResetEvent)state;
mre.Set();
}

What is the most efficient method for assigning threads based on the following scenario?

I can have a maximum of 5 threads running simultaneous at any one time which makes use of 5 separate hardware to speedup the computation of some complex calculations and return the result. The API (contains only one method) for each of this hardware is not thread safe and can only run on a single thread at any point in time. Once the computation is completed, the same thread can be re-used to start another computation on either the same or a different hardware depending on availability. Each computation is stand alone and does not depend on the results of the other computation. Hence, up to 5 threads may complete its execution in any order.
What is the most efficient C# (using .Net Framework 2.0) coding solution for keeping track of which hardware is free/available and assigning a thread to the appropriate hardware API for performing the computation? Note that other than the limitation of 5 concurrently running threads, I do not have any control over when or how the threads are fired.
Please correct me if I am wrong but a lock free solution is preferred as I believe it will result in increased efficiency and a more scalable solution.
Also note that this is not homework although it may sound like it...
.NET provides a thread pool that you can use. System.Threading.ThreadPool.QueueUserWorkItem() tells a thread in the pool to do some work for you.
Were I designing this, I'd not focus on mapping threads to your HW resources. Instead I'd expose a lockable object for each HW resource - this can simply be an array or queue of 5 Objects. Then for each bit of computation you have, call QueueUserWorkItem(). Inside the method you pass to QUWI, find the next available lockable object and lock it (aka, dequeue it). Use the HW resource, then re-enqueue the object, exit the QUWI method.
It won't matter how many times you call QUWI; there can be at most 5 locks held, each lock guards access to one instance of your special hardware device.
The doc page for Monitor.Enter() shows how to create a safe (blocking) Queue that can be accessed by multiple workers. In .NET 4.0, you would use the builtin BlockingCollection - it's the same thing.
That's basically what you want. Except don't call Thread.Create(). Use the thread pool.
cite: Advantage of using Thread.Start vs QueueUserWorkItem
// assume the SafeQueue class from the cited doc page.
SafeQueue<SpecialHardware> q = new SafeQueue<SpecialHardware>()
// set up the queue with objects protecting the 5 magic stones
private void Setup()
{
for (int i=0; i< 5; i++)
{
q.Enqueue(GetInstanceOfSpecialHardware(i));
}
}
// something like this gets called many times, by QueueUserWorkItem()
public void DoWork(WorkDescription d)
{
d.DoPrepWork();
// gain access to one of the special hardware devices
SpecialHardware shw = q.Dequeue();
try
{
shw.DoTheMagicThing();
}
finally
{
// ensure no matter what happens the HW device is released
q.Enqueue(shw);
// at this point another worker can use it.
}
d.DoFollowupWork();
}
A lock free solution is only beneficial if the computation time is very small.
I would create a facade for each hardware thread where jobs are enqueued and a callback is invoked each time a job finishes.
Something like:
public class Job
{
public string JobInfo {get;set;}
public Action<Job> Callback {get;set;}
}
public class MyHardwareService
{
Queue<Job> _jobs = new Queue<Job>();
Thread _hardwareThread;
ManualResetEvent _event = new ManualResetEvent(false);
public MyHardwareService()
{
_hardwareThread = new Thread(WorkerFunc);
}
public void Enqueue(Job job)
{
lock (_jobs)
_jobs.Enqueue(job);
_event.Set();
}
public void WorkerFunc()
{
while(true)
{
_event.Wait(Timeout.Infinite);
Job currentJob;
lock (_queue)
{
currentJob = jobs.Dequeue();
}
//invoke hardware here.
//trigger callback in a Thread Pool thread to be able
// to continue with the next job ASAP
ThreadPool.QueueUserWorkItem(() => job.Callback(job));
if (_queue.Count == 0)
_event.Reset();
}
}
}
Sounds like you need a thread pool with 5 threads where each one relinquishes the HW once it's done and adds it back to some queue. Would that work? If so, .Net makes thread pools very easy.
Sounds a lot like the Sleeping barber problem. I believe the standard solution to that is to use semaphores

In .NET is there a thread scheduler for long running threads?

Our scenario is a network scanner.
It connects to a set of hosts and scans them in parallel for a while using low priority background threads.
I want to be able to schedule lots of work but only have any given say ten or whatever number of hosts scanned in parallel. Even if I create my own threads, the many callbacks and other asynchronous goodness uses the ThreadPool and I end up running out of resources. I should look at MonoTorrent...
If I use THE ThreadPool, can I limit my application to some number that will leave enough for the rest of the application to Run smoothly?
Is there a threadpool that I can initialize to n long lived threads?
[Edit]
No one seems to have noticed that I made some comments on some responses so I will add a couple things here.
Threads should be cancellable both
gracefully and forcefully.
Threads should have low priority leaving the GUI responsive.
Threads are long running but in Order(minutes) and not Order(days).
Work for a given target host is basically:
For each test
Probe target (work is done mostly on the target end of an SSH connection)
Compare probe result to expected result (work is done on engine machine)
Prepare results for host
Can someone explain why using SmartThreadPool is marked wit ha negative usefulness?
In .NET 4 you have the integrated Task Parallel Library. When you create a new Task (the new thread abstraction) you can specify a Task to be long running. We have made good experiences with that (long being days rather than minutes or hours).
You can use it in .NET 2 as well but there it's actually an extension, check here.
In VS2010 the Debugging Parallel applications based on Tasks (not threads) has been radically improved. It's advised to use Tasks whenever possible rather than raw threads. Since it lets you handle parallelism in a more object oriented friendly way.
UPDATE
Tasks that are NOT specified as long running, are queued into the thread pool (or any other scheduler for that matter).
But if a task is specified to be long running, it just creates a standalone Thread, no thread pool is involved.
The CLR ThreadPool isn't appropriate for executing long-running tasks: it's for performing short tasks where the cost of creating a thread would be nearly as high as executing the method itself. (Or at least a significant percentage of the time it takes to execute the method.) As you've seen, .NET itself consumes thread pool threads, you can't reserve a block of them for yourself lest you risk starving the runtime.
Scheduling, throttling, and cancelling work is a different matter. There's no other built-in .NET worker-queue thread pool, so you'll have roll your own (managing the threads or BackgroundWorkers yourself) or find a preexisting one (Ami Bar's SmartThreadPool looks promising, though I haven't used it myself).
In your particular case, the best option would not be either threads or the thread pool or Background worker, but the async programming model (BeginXXX, EndXXX) provided by the framework.
The advantages of using the asynchronous model is that the TcpIp stack uses callbacks whenever there is data to read and the callback is automatically run on a thread from the thread pool.
Using the asynchronous model, you can control the number of requests per time interval initiated and also if you want you can initiate all the requests from a lower priority thread while processing the requests on a normal priority thread which means the packets will stay as little as possible in the internal Tcp Queue of the networking stack.
Asynchronous Client Socket Example - MSDN
P.S. For multiple concurrent and long running jobs that don't do allot of computation but mostly wait on IO (network, disk, etc) the better option always is to use a callback mechanism and not threads.
I'd create your own thread manager. In the following simple example a Queue is used to hold waiting threads and a Dictionary is used to hold active threads, keyed by ManagedThreadId. When a thread finishes, it removes itself from the active dictionary and launches another thread via a callback.
You can change the max running thread limit from your UI, and you can pass extra info to the ThreadDone callback for monitoring performance, etc. If a thread fails for say, a network timeout, you can reinsert back into the queue. Add extra control methods to Supervisor for pausing, stopping, etc.
using System;
using System.Collections.Generic;
using System.Threading;
namespace ConsoleApplication1
{
public delegate void CallbackDelegate(int idArg);
class Program
{
static void Main(string[] args)
{
new Supervisor().Run();
Console.WriteLine("Done");
Console.ReadKey();
}
}
class Supervisor
{
Queue<System.Threading.Thread> waitingThreads = new Queue<System.Threading.Thread>();
Dictionary<int, System.Threading.Thread> activeThreads = new Dictionary<int, System.Threading.Thread>();
int maxRunningThreads = 10;
object locker = new object();
volatile bool done;
public void Run()
{
// queue up some threads
for (int i = 0; i < 50; i++)
{
Thread newThread = new Thread(new Worker(ThreadDone).DoWork);
newThread.IsBackground = true;
waitingThreads.Enqueue(newThread);
}
LaunchWaitingThreads();
while (!done) Thread.Sleep(200);
}
// keep starting waiting threads until we max out
void LaunchWaitingThreads()
{
lock (locker)
{
while ((activeThreads.Count < maxRunningThreads) && (waitingThreads.Count > 0))
{
Thread nextThread = waitingThreads.Dequeue();
activeThreads.Add(nextThread.ManagedThreadId, nextThread);
nextThread.Start();
Console.WriteLine("Thread " + nextThread.ManagedThreadId.ToString() + " launched");
}
done = (activeThreads.Count == 0) && (waitingThreads.Count == 0);
}
}
// this is called by each thread when it's done
void ThreadDone(int threadIdArg)
{
lock (locker)
{
// remove thread from active pool
activeThreads.Remove(threadIdArg);
}
Console.WriteLine("Thread " + threadIdArg.ToString() + " finished");
LaunchWaitingThreads(); // this could instead be put in the wait loop at the end of Run()
}
}
class Worker
{
CallbackDelegate callback;
public Worker(CallbackDelegate callbackArg)
{
callback = callbackArg;
}
public void DoWork()
{
System.Threading.Thread.Sleep(new Random().Next(100, 1000));
callback(System.Threading.Thread.CurrentThread.ManagedThreadId);
}
}
}
Use the built-in threadpool. It has good capabilities.
Alternatively you can look at the Smart Thread Pool implementation here or at Extended Thread Pool for a limit on the maximum number of working threads.

Categories

Resources