How to best parallelize parsing of webpages? - c#

I am using the html agility pack to parse individual pages of a forum website. So the parsing method returns all the topic/thread links on the page link, passed as an argument. I gather all these topic links of all the parsed pages in a single collection.
After that, I check if they are on my Dictionary of already-viewed urls, and if they are not, then I add them to a new list and the UI shows this list, which is basically new topics/threads created since last time.
Since all these operations seem independent, what would be the best way to parallelize this?
Should I use .NET 4.0's Parallel.For/ForEach?
Either way, how can I gather the results of each page in a single collection? Or is this not necessary?
Can I read from my centralized Dictionary whenever a parse method finishes to see if they are there, simultaneously?
If I run this program for 4000 pages, it takes like 90 mins, it would be great if I could use all my 8 cores to finish the same task in ~10 mins.

Parallel.For/ForEach combined with a ConcurrentDictionary<TKey, TValue> to share state between different threads seem like a good way to implement this. The concurrent dictionary ensures safe read/write from multiple threads.

After that, I check if they are on my Dictionary of already-viewed urls, and if they are not, then I add them to a new list and the UI shows this list, which is basically new topics/threads created since last time.
Since all these operations seem independent, what would be the best way to parallelize this?
You can certainly use Parallel.For/ForEach to do that, but you should think about the design of your crawler a bit. Most crawlers tend to dedicate several threads to crawling and each thread is associated with a page fetching client which is responsible for fetching the pages (in your case, probably using the WebRequest/WebResponse) I would recommend reading these papers:
Mercator: A scalable, extensible Web crawler (an 11 page paper, should be a pretty light read).
IRLbot: Scaling to 6 Billion Pages and Beyond (a 10 page paper that describes a crawler that crawls at about 600 pages per second on a 150 Mbit connection).
IRLbot: Scaling to 6 billion pages and beyond: full paper
If you implement the Mercator design, then you should easily be able to download 50 pages per second, so you 4000 pages will be downloaded in 80 seconds.
Either way, how can I gather the results of each page in a single collection?
You can store your results in a ConcurrentDictionary<TKey, TValue>, like Darin mentioned. You don't need to store anything in the value, since your key would be the link/URL, however if you're performing a URL-seen Test then you can hash each link/URL into an integer and then store the hash as the key and the link/URL as the value.
Or is this not necessary?
It's entirely up to you to decide what's necessary, but if you're performing a URL-seen Test, then it is necessary.
Can I read from my centralized Dictionary whenever a parse method finishes to see if they are there, simultaneously?
Yes, the ConcurrentDictionary allows multiple threads to read simultaneously, so it should be fine. It will work fine if you just want to see if a link has already been crawled.
If I run this program for 4000 pages, it takes like 90 mins, it would be great if I could use all my 8 cores to finish the same task in ~10 mins.
If you design your crawler sufficiently well, you should be able to download and parse (extracts all the links) of 4000 pages in about 57 seconds on an average desktop PC... I get roughly those results with the standard C# WebRequest on a 4GB, i5 3.2 GHz PC with a 10 Mbps connection.

Related

High volume blacklist contains operation - performance in C#

I am working on desktop application which need perform web site access check. I have a huge black lists on PC where application is running, and faced with task:
How to perform fastest check over those black lists?
I'm using C#/.NET development stack, currently my idea load all those lists into hashset and invoke Contains method, but I not sure that this is good idea to load it all into memory, maybe you can suggest another way which save memory from one side and will work as fast as it can from another?
The files are in form of plain text, and in the region of megabytes but this size is expected to grow.
UPDATE:
I found black lists of web site here after download and unzip it the size of data about 80 megabytes. So I not sure that keep all data in memory good idea.
UPDATE 2
I've created perfomance test, downloaded blacklist with 2339643
items.
Loaded it into HashSet and perform 1000 iterations to check
speed.
Results:
The maximum amount of time which Contains method take: 0.2
milliseconds (this is first call)
Second call take about '0.0164' milliseconds
milliseconds and other even less. The perfomance is good.
But application where I run test take about 250MB of system memory which
is not so good as HashSet perfomance.
You can use a HashSet to store your black list, this data structure allows O(1) amortised time complexity for inserts and checks if the item is present in the set.
If you need something more scalable, you can consider brining in redis or memcached.
Reading through comments, I would consider creating a web service that performs a check. A user can query web service, which in turn would query redis or memchached or slq server if you don't need it all in memory. Alternatively, I would suggest looking at whitelisting, if your black lists grow too much this could indicate a problem with the current approach.

What is the fastest way to persistently increment a list of numbers from multiple threads?

My application has different tasks each one posting an XML Document through each HTTP POST on a different endpoint. For every thread I need to keep count of the message I sent, which is identified by a unique incremental number.
I need a mechanism that, after a message has been received by the endpoint will save the last message id sent, so that if there is a problem and the application needs to restart it won't send the same message again, and will restart from where it currently was.
If I don't persist the counters, on my laptop I can manage to obtain a throughput of about 100 messages processed per second for every queue with 5 tasks running. My goal is to achieve no more than a 10/15% reduction in throughput by persisting the counters.
Using SQL Server for saving the counters, with a row for every tasks gives me a 50% decrease in throughput. Saving the counter value on a text file for every task is a bit faster but still far from my goal. I am looking for a way to persist such information so that I can be as close as possible to my goal. I thought that maybe appending the last processed Id rather than updating it could help me in avoiding possible write locks, but the bottom line is that I don't care if for the sake of performance I will have to waste disk space or have a higher startup time for reading the last counter.
In your experience what might be a fast way to avoid contentions and safely persist data from multiple tasks even at the cost of more disk space?
You can get pretty good performance with an ESENT storage, via the ManagedEsent - PersistentDictionary wrapper.
The PersistentDictionary class is concurrent and provides real concurrent access to the ESENT backend. You would represent everything in key-value pair format.
Give it a try, it is not much code to write.
ESENT is an in-process database engine, disk based + in-memory caching, used throughout several Windows components (Search, Exchange, etc). It does provide transactional support, which is what you're after.
It has been included in all versions of Windows since 2000 so you don't need to install any dependencies other than ManagedEsent.
You would probably want to define something like this:
var dictionary = new PersistentDictionary<Guid, int>("ThreadStorage");
The key, I assume, should be something unique (maybe even the service endpoint) so that you are able to re-map it after a restart. The value is the last message identifier.
I am pasting below, shamelessly, their performance benchmarks:
Sequential inserts 32,000 entries/second
Random inserts 17,000 entries/second
Random Updates 36,000 entries/second
Random lookups (database cached in memory) 137,000 entries/second
Linq queries (range of records) 14,000 queries/second
You fit in the Random Updates case, which as you can see offers a really good throughput.
I faced the same issue as OP asked.
I used SQL server Sequence Numbers (with CREATE SEQUENCE).
However, the accepted answer is a good solution to avoid using SQL server.

Why does my Parallel.ForAll call end up using a single thread?

I have been using PLINQ recently to perform some data handling.
Basically I have about 4000 time series (so basically instances of Dictionary<DataTime,T>) which I stock in a list called timeSeries.
To perform my operation, I simply do:
timeSeries.AsParallel().ForAll(x=>myOperation(x))
If I have a look at what is happening with my different cores, I notice that first, all my CPUs are being used and I see on the console (where I output some logs) that several time series are processed at the same time.
However, the process is lengthy, and after about 45 minutes, the logging clearly indicates that there is only one thread working. Why is that?
I tried to give it some thought, and I realized that timeSeries contains instances simpler to process from myOperation's point of view at the beginning and the end of the list. So, I wondered if maybe the algorithm that PLINQ was using consisted in splitting the 4000 instances on, say, 4 cores, giving each of them 1000. Then, when the core is finished with its allocation of work, it goes back to idle. This would mean that one of the core may be facing a much heavier workload.
Is my theory correct or is there another possible explanation?
Shall I shuffle my list before running it or is there some kind of parallelism parameters I can use to fix that problem?
Your theory is probably correct although there is something called 'workstealing' that should counter this. I'm not sure why that doesn't work here. Are there many (>= dozens) large jobs at the outer ends or just a few?
Aside from shuffling your data you could use the overload for AsParallel() that accepts a custom Partioner. That would allow you to balance the work better.
Side note: for this situation I would prefer Parallel.ForEach(), more options and cleaner syntax.

Calling mutliple services in a method. How to do it effectively?

I have a asp .net web page(MVC) displaying 10,000 products.
For this I am using a method. In that method I have to call an external web service 20 times. This is because the web service gives me 500 data at a time, so to get 10000 data I need to call the service 20 times.
20 calls makes the page load slow. Now I need to increase the performance. Since web service is external I cannot make changes there.
Threading is an option I thought of. Since I can use page numbers (service is paging for the data) each service call is almost independent.
Another option is using parallel linq.
Should I use parallel linq, or choose threading?
Someone please guide me here. Or let me know another way to achieve this.
Note : this web page can be used by many users at a time.
We have filters left side of the page.for that we need all the 10,000 data to construct filter.Otherwise pagewise info could have been enough.and caching is not possible since the huge overload on the server. at a time 400-1000 users can hit server.web service response time is 10 second so that we can hit them many time
We have to hit the service 20 times to get all data.Now i need a solution to improve that hit.Is threading is the only option?
If you can't cache the data from the service, then just get the data you need, when you need to display it. I very much doubt that somebody wants to see all 10000 products on a single web page, and if they do, there is probably something wrong!
Threads, parallel linq will not help you here.
Parallel Linq is meant for lots of CPU work to be shared over CPU cores, what you want to do is make 20 web requests at the same time. You will need to use threading to do that.
You'll probably want to use the built in async capability of HttpWebRequest (see BeginGetResponse).
Consider calling that service asyncrhonously. Most of delay in calling webservice is caused by IO operations that can be done simultaneously.
But getting 10000 items per each request is something very scarry :)

Web spider/crawler in C# Windows.forms

I have created a web crawler in VC#. The crawler indexes certain information from .nl sites by brute-forcing all of the possible .nl addresses, starting with http://aa.nl to (theoretically) http://zzzzzzzzzzzzzzzzzzzz.nl.
It works all right except that it takes incredibly long time only to go through the two-letter domains - aa, ab ... zz. I calculated how long it would take me to go through all of the domains in this fashion and I got about a thousand years.
I tried to accelerate this by threading but with 1300 threads running at the same time, WebClient just kept failing, making the resultant data file too inaccurate to be usable.
I do not have access to anything else that a 5Mb/s internet connection, E6300 Core2duo and 2GB of 533#667mhz RAM on Win7.
Does anybody have an idea what to do to make this work? Any idea will do.
Thank you
The combinatorial explosion makes this impossible to do (unless you can wait several months at the very least). What I would try instead is to contact SIDN, who is the authority for the .nl TLD and ask them for the list.
IMO such implementation of a web crawler is not appropriate
The number of pings you need to do for one crawl is ~ 1029
Say every ping takes 200ms
Time for processing 100 ms
Total time estimate 3*104*1029 ms ~ 3*1023 years. Please correct me if I am wrong.
If you want to take advantage of threading you need to have a dedicated core per each thread. Each thread will at least take 1+ MB of your memory.
Threading will not help you here, you will be able to hypotheoretically reduce the time to ~ 3*1020 years
Exceptions that you get are likely to be the result of the thread synchronization issues.
The HTTP support in .Net has a maximum concurrent connections limit of around 8 by default I think (somewhere around that figure anyway)
If you create more HTTP requests many of them will be forced to wait for an available connection and as a result will time out long before they ever get one leading valid URIs to appear invalid.

Categories

Resources