Use multiple threads to achieve efficient search from a web server

Use multiple threads to achieve efficient search from a web server - c#

I want to access a web server using httpwebrequest and fetch thousands of records from a given range of pages. Each hit to a webpage fetches 15 records, and there are almost 8 to 10000 pages on the webserver. That means a total of 120000 hits to the server! If done trivially with a single process, the task can be very time consuming. Hence, multiple threading is the immediate solution that comes to mind.
Currently, I have created a worker class for searching purpose, that worker class will spawn 5 subworkers (threads) to search in a given range. But, due to my novice abilities in threading, I am unable to make it work, as I am having trouble synchronizing and making them all work together. I know about delegates, actions, events in .NET but making them to work with threads is getting confusing..This is the code that I am using:
public void Start()
{
this.totalRangePerThread = ((this.endRange - this.startRange) / this.subWorkerThreads.Length);
for (int i = 0; i < this.subWorkerThreads.Length; ++i)
{
//theThreads[counter] = new Thread(new ThreadStart(MethodName));
this.subWorkerThreads[i] = new Thread(() => searchItem(this.startRange, this.totalRangePerThread));
//this.subWorkerThreads[i].Start();
this.startRange = this.startRange + this.totalRangePerThread;
}
for (int threadIndex = 0; threadIndex < this.subWorkerThreads.Length; ++threadIndex)
this.subWorkerThreads[threadIndex].Start();
}
The searchItem method:
public void searchItem(int start, int pagesToSearchPerThread)
{
for (int count = 0; count < pagesToSearchPerThread; ++count)
{
//searching routine here
}
}
The problem exists between the shared variables of the threads, can anyone guide me how to make it a threadsafe procedure?

the real problem you're facing is that the labmda expression in the Thread constructor is capturing the outer variable (startRange). One way to fix it is to make a copy of the variable, like this:
for (int i = 0; i < this.subWorkerThreads.Length; ++i)
{
var copy = startRange;
this.subWorkerThreads[i] = new Thread(() => searchItem(copy, this.totalRangePerThread));
this.startRange = this.startRange + this.totalRangePerThread;
}
for more information on creating and starting threads, see Joe Albahari's excellent ebook (there's also a section on captured variables a bit further down). If you want to learn about closures, see this question.

The first answer is that these threads don't really need that much work to share variables (assuming I'm understanding you correctly). They have some shared input variables (the target web server, for example), but those are thread-safe because they aren't being changed. The plan is that they'll build a database or some such containing the records they retrieve. You should be fine to just have each of the five fill their own input archive, and then merge them in a single thread once all the subworker threads are done. If somehow the architecture that you're using to store the data makes that expensive... well, how much you're planning to store and what you're planning to store it in becomes pertinent, then, and perhaps you could share what those are?

Related

Is this method of reporting progress from a parrallel thread better or worse then a lock?

I implemented a method for reporting progress from a parallel select. It works fine. But after looking at some answers to similar problems on stack overflow I wonder if my approach was overkill or if it has merit.
My current method is just as fast as not reporting progress and doing a AsParrallel on the whole collection. However my interest is that limitation of 12 threads. How this might impact the performance on other machines.
int count = ids.Count();
const int takeValue = 12;
List<ImportantDataDecorator> parrallel = new List<ImportantDataDecorator>();
for (int i = 0; ids.Count > 0; i += takeValue)
{
int take = takeValue;
if (take > ids.Count)
{
take = ids.Count;
}
var idsToProcess = ids.Take(takeValue);
var processed = idsToProcess.AsParallel().Select(x =>
{
var value = GetValue(x, divBuffer, mask);
return new ValueDecorator(value, x);
}).ToList();
parrallel.AddRange(processed);
ids.RemoveRange(0, take);
int remaining = i;
float percent = (remaining/(float)count) * 100.0f;
if (backgroundWorker != null)
{
backgroundWorker.ReportProgress((int) percent);
if (backgroundWorker.CancellationPending)
return;
}
}

Splitting the work and reporting back the progress doesn't seem to be taking up much computational time. But the application is doing some(even if little) work to do that.
My current method is just as fast as not reporting progress and doing
a AsParrallel on the whole collection
You can try your method with much larger data and Benchmark both strategies and pick the better one.
However splitting up processing and doing additional work of reporting back progress will cost you some additional time(which will be noticeable for very large data). You should decide between whether Progress report adds value to your application that makes up for the additional time.
AsParallel() depends on various factors such as computational loads, number of system cores etc. This particular method might run faster in your system and perform worse on a system with less computational power.
See this link for speeding up PLINQ : https://msdn.microsoft.com/en-us/library/dd997399(v=vs.110).aspx

C# run method multiple time concurrently

I have the following C# code:
string[] files = Directory.GetFiles(#"C:\Users\Documents\DailyData", "*.txt", SearchOption.AllDirectories);
for (int ii = 0; ii < files.Length; ii++)
{
perFile(files[ii]);
}
Where per file opens each file and creates a series of generic lists, then writes the results to a file.
Each iteration takes about 15 minutes.
I want the perFile method to execute concurrently eg not wait for the current iteration to finish before starting the next one.
My questions are:
How would I do this?
How can I control how many instances of perFile are running concurrently
How can I determine how many concurrent instances of perFile my computer can handle at one time

Try to use Parallel.ForEach. You can use ParallelOptions.MaxDegreeOfParallelism property to set max count of running perFile.

You can use Parallel.For to do what you need if you're using .Net 4.5 and above, You can control how many threads to use by providing a ParallelOptions with MaxDegreeOfParalleism set to a number you like.
Example code:
ParallelOptions POptions = new ParallelOptions();
POptions.MaxDegreeOfParalleism = ReplaceThisWithSomeNumber;
Parallel.For(0,files.Length,POptions,(x) => {perFile(files[x]);});
Note that your program will be IO bound though (read Asynchronous IO for more information on how to tackle that)

What is the simplest way to execute a parallel, for loop as a low-priority, CPU operation?

Background
Historically, I've not had to write too much thread related code. I am familiar with the process and I can use System.Threading to do what I need to do when the need arises. However, the bulk of my experience has been stuck in .Net 2.0 world (ugh!) and I need a bit of help executing a very simple task in the most recent version of .Net.
I need to write a simple, parallel job but I would prefer it to execute as a low-priority task that doesn't bring my PC to a sluggish halt.
Below is a simple parallel example that demonstrates the type of work that I'm trying to accomplish. In this example, I take a random number and I store it if the new value is larger than the last, largest random number that has been discovered.
Please understand, the point of this example is to strictly show that I have a calculation that I wish to repeatedly execute, compare and store. I understand the need for variable locking and I also know that this code isn't perfect. However, it is an extremely simple example of the nature of what I need to accomplish. If you run this code, your CPU will come to a grinding halt.
int i = 0;
ThreadSafeRNG r = new ThreadSafeRNG();
ParallelOptions.MaxDegreeOfParallelism = 4; //Assume I have 8 cores.
Parallel.For(0, Int32.MaxValue, (j, loopState) =>
{
int k = r.Next();
if (k > i) k = i;
}
How can I do this work, in parallel, and have it execute as a low-priority CPU job. The above mechanism has provided tremendous performance improvements, over a standard for-loop. However, the cost has been that I can't use my computer, even after setting the MaxDegreeOfParallelism option.
What can I do?

Because Parallel.For uses ThreadPool threads you can not set the priority of the thread to be low. However you can change the priority of the entire application
Process currentProcess = Process.GetCurrentProcess();
ProcessPriorityClass oldPriority = currentProcess.PriorityClass;
try
{
currentProcess.PriorityClass = ProcessPriorityClass.BelowNormal;
int i = 0;
ThreadSafeRNG r = new ThreadSafeRNG();
ParallelOptions.MaxDegreeOfParallelism = 4; //Assume I have 8 cores.
Parallel.For(0, Int32.MaxValue, (j, loopState) =>
{
int k = r.Next();
if (k > i) k = i;
}
}
finally
{
//Bring the priority back up to the original level.
currentProcess.PriorityClass = oldPriority;
}
If you really don't want your whole application's priority to be lowered combine this trick with some form of IPC like WCF and have your slow long running operation running in a 2nd process that you start up and kill as needed.

Using the Parallel.For method, you cannot set thread priority since they are ThreadPool threads.
Your best bet is to set ParallelOptions.MaxDegreeOfParallelism to ProcessorCount - 1, this way you have a free core to do other things (such as handle the GUI).

Threading Volume #9000

Ok, So, I just started screwing around with threading, now it's taking a bit of time to wrap my head around the concepts so i wrote a pretty simple test to see how much faster if faster at all printing out 20000 lines would be (and i figured it would be faster since i have a quad core processor?)
so first i wrote this, (this is how i would normally do the following):
System.DateTime startdate = DateTime.Now;
for (int i = 0; i < 10000; ++i)
{
Console.WriteLine("Producing " + i);
Console.WriteLine("\t\t\t\tConsuming " + i);
}
System.DateTime endtime = DateTime.Now;
Console.WriteLine(a.startdate.Second + ":" + a.startdate.Millisecond + " to " + endtime.Second + ":" + endtime.Millisecond);
And then with threading:
public class Test
{
static ProducerConsumer queue;
public System.DateTime startdate = DateTime.Now;
static void Main()
{
queue = new ProducerConsumer();
new Thread(new ThreadStart(ConsumerJob)).Start();
for (int i = 0; i < 10000; i++)
{
Console.WriteLine("Producing {0}", i);
queue.Produce(i);
}
Test a = new Test();
}
static void ConsumerJob()
{
Test a = new Test();
for (int i = 0; i < 10000; i++)
{
object o = queue.Consume();
Console.WriteLine("\t\t\t\tConsuming {0}", o);
}
System.DateTime endtime = DateTime.Now;
Console.WriteLine(a.startdate.Second + ":" + a.startdate.Millisecond + " to " + endtime.Second + ":" + endtime.Millisecond);
}
}
public class ProducerConsumer
{
readonly object listLock = new object();
Queue queue = new Queue();
public void Produce(object o)
{
lock (listLock)
{
queue.Enqueue(o);
Monitor.Pulse(listLock);
}
}
public object Consume()
{
lock (listLock)
{
while (queue.Count == 0)
{
Monitor.Wait(listLock);
}
return queue.Dequeue();
}
}
}
Now, For some reason i assumed this would be faster, but after testing it 15 times, the median of the results is ... a few milliseconds different in favor of non threading
Then i figured hey ... maybe i should try it on a million Console.WriteLine's, but the results were similar
am i doing something wrong ?

Writing to the console is internally synchronized. It is not parallel. It also causes cross-process communication.
In short: It is the worst possible benchmark I can think of ;-)
Try benchmarking something real, something that you actually would want to speed up. It needs to be CPU bound and not internally synchronized.

As far as I can see you have only got one thread servicing the queue, so why would this be any quicker?

I have an example for why your expectation of a big speedup through multi-threading is wrong:
Assume you want to upload 100 pictures. The single threaded variant loads the first, uploads it, loads the second, uploads it, etc.
The limiting part here is the bandwidth of your internet connection (assuming that every upload uses up all the upload bandwidth you have).
What happens if you create 100 threads to upload 1 picture only? Well, each thread reads its picture (this is the part that speeds things up a little, because reading the pictures is done in parallel instead of one after the other).
As the currently active thread uses 100% of the internet upload bandwidth to upload its picture, no other thread can upload a single byte when it is not active. As the amount of bytes that needs to be transmitted, the time that 100 threads need to upload one picture each is the same time that one thread needs to upload 100 pictures one after the other.
You only get a speedup if uploading pictures was limited to lets say 50% of the available bandwidth. Then, 100 threads would be done in 50% of the time it would take one thread to upload 100 pictures.

"For some reason i assumed this would be faster"
If you don't know why you assumed it would be faster, why are you surprised that it's not? Simply starting up new threads is never guaranteed to make any operation run faster. There has to be some inefficiency in the original algorithm that a new thread can reduce (and that is sufficient to overcome the extra overhead of creating the thread).

All the advice given by others is good advice, especially the mention of the fact that the console is serialized, as well as the fact that adding threads does not guarantee speedup.
What I want to point out and what it seems the others missed is that in your original scenario you are printing everything in the main thread, while in the second scenario you are merely delegating the entire printing task to the secondary worker. This cannot be any faster than your original scenario because you simply traded one worker for another.
A scenario where you might see speedup is this one:
for(int i = 0; i < largeNumber; i++)
{
// embarrassingly parallel task that takes some time to process
}
and then replacing that with:
int i = 0;
Parallel.For(i, largeNumber,
o =>
{
// embarrassingly parallel task that takes some time to process
});
This will split the loop among the workers such that each worker processes a smaller chunk of the original data. If the task does not need synchronization you should see the expected speedup.

Cool test.
One thing to have in mind when dealing with threads is bottlenecks. Consider this:
You have a Restaurant. Your kitchen can make a new order every 10
minutes (your chef has a bladder problem so he's always in the
bathroom, but is your girlfriend's cousin), so he produces 6 orders an
hour.
You currently employ only one waiter, which can attend tables
immediately (he's probably on E, but you don't care as long as the
service is good).
During the first week of business everything is fine: you get
customers every ten minutes. Customers still wait for exactly ten
minutes for their meal, but that's fine.
However, after that week, you are getting as much as 2 costumers every
ten minutes, and they have to wait as much as 20 minutes to get their
meal. They start complaining and making noises. And god, you have
noise. So what do you do?
Waiters are cheap, so you hire two more. Will the wait time change?
Not at all... waiters will get the order faster, sure (attend two
customers in parallel), but still some customers wait 20 minutes for
the chef to complete their orders.You need another chef, but as you
search, you discover they are lacking! Every one of them is on TV
doing some crazy reality show (except for your girlfriend's cousin who
actually, you discover, is a former drug dealer).
In your case, waiters are the threads making calls to Console.WriteLine; But your chef is the Console itself. It can only service so much calls a second. Adding some threads might make things a bit faster, but the gains should be minimal.

You have multiple sources, but only 1 output. It that case multi-threading will not speed it up. It's like having a road where 4 lanes that merge into 1 lane. Having 4 lanes will move traffic faster, but at the end it will slow back down when it merges into 1 lane.

Using Caching with AutomationElements does not give me a speedup - intended or wrong usage?

I am trying to increase the speed of some UI Automation operations. I stumbled across the (not so well) documented caching possibility.
From what I have understood, the whole operation (if you have a large GUI-tree) is so slow, because for every single function call there has to be a process-change (kind of like going to kernel-mode, I suppose, speed-wise?!). So.. in comes caching.
Simply tell the function to cache an element and its children, and then work on it lightning-fast. (From what I understand, you only have one context-change, and assemble all the data you need in one go.)
Good idea, but.. it is just as slow for me as the uncached variation. I wrote some simple testcode, and could not see an improvement.
AutomationElement ae; // element whose siblings are to be examined, thre are quite a few siblings
AutomationElement sibling;
#region non-cached
watch.Start();
for (int i = 0; i < 10; ++i)
{
sibling = TreeWalker.RawViewWalker.GetFirstChild(TreeWalker.RawViewWalker.GetParent(ae));
while (sibling != null)
{
sibling = TreeWalker.RawViewWalker.GetNextSibling(sibling);
}
}
watch.Stop();
System.Diagnostics.Debug.WriteLine("Execution time without cache: " + watch.ElapsedMilliseconds + " ms.");
#endregion
#region cached
watch.Reset();
watch.Start();
CacheRequest cacheRequest = new CacheRequest();
cacheRequest.TreeScope = TreeScope.Children | TreeScope.Element; // for testing I chose a minimal set
AutomationElement parent;
for (int j = 0; j < 10; ++j)
{
using (cacheRequest.Activate())
{
parent = TreeWalker.RawViewWalker.GetParent(ae, cacheRequest);
}
int cnt = parent.CachedChildren.Count;
for (int i = 0; i < cnt; ++i)
{
sibling = parent.CachedChildren[i];
}
}
watch.Stop();
System.Diagnostics.Debug.WriteLine("Execution time parentcache: " + watch.ElapsedMilliseconds + " ms.");
#endregion
The setup is: you get an element and want to check on all its (many) siblings. Both implementations, without and with cache, are given.
The output (debug mode):
Execution time without cache: 1130 ms.
Execution time parentcache: 1271 ms.
Why is this not working? How to improve?
Thanks for any ideas!!!

I would not expect the two loops to differ much in their running time, since in both cases the parent's children must be fully walked (in the first explicitly, and in the second to populate the cache.) I do think, though, that the time required for looping through the parent.CachedChildren array is much lower than your initial walk code. At that point, the elements should be cached, and you can use them without re-walking the tree.
The general point is that you can't get performance for free, since you need to invest the time to actually populate the cache before it becomes useful.

Actually, recently I found the time to check it again, and also against different UI elements.
For caching to be faster, one first has to specify all (and only) the needed elements.
There is a real performance difference if you e. g. look at a ComboBox with 100 elements inside or something similar, like a very complicated GUI with a lot of elements on the same hierarchy level - and you actually need the data from all of them.
So.. caching is not a solution for everything but rather a performance-optimization-tool that has to be applied with.. knowledge about the situation, the requirements and the internal working.
BTW, talking about performance, I found out that each .current access to a UI element take very roughly 20ms, so for complex operations this really can reach a significant time.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.