Web spider/crawler in C# Windows.forms

Web spider/crawler in C# Windows.forms - c#

I have created a web crawler in VC#. The crawler indexes certain information from .nl sites by brute-forcing all of the possible .nl addresses, starting with http://aa.nl to (theoretically) http://zzzzzzzzzzzzzzzzzzzz.nl.
It works all right except that it takes incredibly long time only to go through the two-letter domains - aa, ab ... zz. I calculated how long it would take me to go through all of the domains in this fashion and I got about a thousand years.
I tried to accelerate this by threading but with 1300 threads running at the same time, WebClient just kept failing, making the resultant data file too inaccurate to be usable.
I do not have access to anything else that a 5Mb/s internet connection, E6300 Core2duo and 2GB of 533#667mhz RAM on Win7.
Does anybody have an idea what to do to make this work? Any idea will do.
Thank you

The combinatorial explosion makes this impossible to do (unless you can wait several months at the very least). What I would try instead is to contact SIDN, who is the authority for the .nl TLD and ask them for the list.

IMO such implementation of a web crawler is not appropriate
The number of pings you need to do for one crawl is ~ 1029
Say every ping takes 200ms
Time for processing 100 ms
Total time estimate 3*104*1029 ms ~ 3*1023 years. Please correct me if I am wrong.
If you want to take advantage of threading you need to have a dedicated core per each thread. Each thread will at least take 1+ MB of your memory.
Threading will not help you here, you will be able to hypotheoretically reduce the time to ~ 3*1020 years
Exceptions that you get are likely to be the result of the thread synchronization issues.

The HTTP support in .Net has a maximum concurrent connections limit of around 8 by default I think (somewhere around that figure anyway)
If you create more HTTP requests many of them will be forced to wait for an available connection and as a result will time out long before they ever get one leading valid URIs to appear invalid.

Related

Azure Web App. Free is faster than Basic and Standard?

I have a C# MVC application with a WCF service running on Azure. First of it was of course hosted on the free version, but as I had that one running smoothly I wanted to try and see how it ran on either Basic or Standard, which as far as I know should be dedicated servers.
To my surprise the code ran significantly slower once it was changed from Free to either Standard or Basic. I chose the smallest instance, but still expected them to perform better than the Free option?
From my performance logging I can see that the code that runs especially slow is something that is started as async from Task.Run. Initially it was old school Thread.Start() but considered whether this might spawn it in some lower priority thread and therefore changed it to Task.Run - without this changing anything - so perhaps it has nothing to do with it - but it might, so now you know.
The code that runs really slow basically works on some XML document, through XDocument, XElement etc. It loops through, has some LINQ etc. but nothing too fancy. But still it is 5-10 times slower on Basic and Standard as on the Free version? For the exact same request the Free version uses around 1000ms where as Basic and Standard uses 8000-10000ms?
In each test I have tried 5-10 times but without any decrease in response-times. I thought about whether I need to wait some hours before the Basic/Standard is fully functional or something like that, but each time I switch back, the Free version just outperforms it from the get-go.
Any suggestions? Is the Free version for some strange reason more powerful than Basic or Standard or do I need to configure something differently once I get up and running on Basic or Standard?

The notable difference between the Free and Basic/Standard tiers is that Free uses an undisclosed number of shared cores, whereas Basic/Standard has a defined number of CPU cores (1-4 based on how much you pay). Related to this is the fact that Free is a shared instance while Basic/Standard is a private instance.
My best guess based on this that since the Free servers you would be on house multiple different users and applications, they probably have pretty beef specs. Their CPUs are probably 8-core Xeons and there might even be multiple CPUs. Most likely, Azure isn't enforcing any caps but rather relying on quotas (60 CPU minutes / day for the Free tier) and overall demand on the server to restrict CPU use. In other words, if your site is the only one that happens to be doing anything at the moment (unlikely of course, but for the sake of example), you could be potentially utilizing all 8+ cores on the box, whereas when you move over to Basic/Standard you are hard-limited to 1-4. Processing XML is actually very CPU heavy, so this seems to line up with my assumptions.
More than likely, this is a fluke. Perhaps your residency is currently on a relatively newly provisioned server that hasn't been fill up with tenants yet. Maybe you just happen to be sharing with tenants that aren't doing much. Who knows? But, if the server is ever actually under real load, I'd imagine you'd see a much worse response time on the Free tier than even Basic/Standard.

Why is the HttpWebRequest ReadWriteTimeout set to 5 minutes?

The ReadWriteTimeout for HttpWebRequests seems to be defaulted to 5 minutes.
Is there a reason why it is that high? I was trying to set the timeout of an API call to 10 seconds, but it was spinning for a over 2 minutes.
WHen I set this to 30 seconds, it times out in a reasonable amount of time now.
Is it dangerous to set this too low?
I can't imagine something taking longer than 20-30 seconds in my application (small 2-30kb payloads).
Reference: http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.readwritetimeout.aspx

Sure there's a reason for a 5 minute time-out. It looks like this:
This contraption is a robotic tape retrieval system, used by the International Centre for Radio Astronomy Research. It stores 32.5 petabytes of historical data. When its server gets an HttpWebRequest, the machine sends the robot on its way to retrieve the tape with the data. This takes a while, as you might imagine.
These systems were quite common a decade ago, around the time .NET was designed. Not so much today, the unrelenting improvements in hard disk storage capacity made them close to obsolete. Although more than 5 petabyte of SAN storage still sets you back a rather major chunk of money. If speed is not essential then tape is hard to beat.
Clearly .NET cannot possibly reliably declare a timeout when it doesn't know anything about what's happening on the other end of the wire. So the default is high. If you have good reasons to believe that there's an upper limit on your particular setup then don't hesitate to lower it. Do make it an editable setting, you can't predict the future.

You can't possibly know what connection speed the users have that connect to your website. And as the creator of this framework you can't know either what the developer will host. This class already existed in .NET 1.1, so for a very long time. And back then the users had slower speed too.
Finding a good default value is very difficult. You don't want to set it too high to prevent security flaws, and you don't want to set it too low because this would result in a million (exaggerated) threads and requests about aborted requests.
I'm sorry I can't give you any official sources, but this is just reasonable.

Why 5 minutes? Why not?
JustAnotherUserYouMayKnow explained it to you pretty good.
But as usual, you have the freedom to change this default value to a value that suits to your very case, so feel free to follow the path that Christian pointed out.
Setting a default value is not an easy task at all when we are talking about millions of users and maybe millions of billions of possible scenarios involved.
The bootom line is that it isn't that much important why it's 5 minutes but rather how you can adjust it to your very needs.

Well by setting it that low you may or may introduce a series of issues. As you may be able to reach the site within a reasonable time, others may not.
A perfect example is Verizon, they invoke a series of Proxy Servers which can drastically slow a connection down. The reason I brought such an example up; is our application specified a one-minute Timeout before it throws an exception.
Our server has no issues with large amounts of request, it handles them quite easily. However, some of our users throughout the world receive this error: Error 10060.
The issue can route from a incorrect Proxy Configuration or Invalid Registry Key which actually handles the Timeout request.
You'd think that one minute would indeed be fast enough, but it actually isn't. As with this customers particular network it doesn't siphon through the data quick enough- thus causing an error.
So you asked:
Why is the HttpWebRequest ReadWrite Timeout Defaulted to five minutes?
They are attempting to account for the lowest common denominator.
Simply, each network and client may have a vast degree of traffic or delays as it moves to the desired location. If it can't get to the destination within your ports ideal socket request your user will experience an exception.
Some really important things to know about a network:
Some networks that are configured have a limited hop count / time to live.
Proxies and Firewalls which are heavy in filtering data and security, may delay your traffic.
Some areas do not have Fiber or Cable high-speed. They may rely on Satellite or DSL.
Each network protocol is different.
Those are a few variables that you have to consider. If we are talking about an internet; each client has a home network; which connects to ISP; which connects to the Internet; which connects to you. So you have several forms of traffic to be aggregated.
If we are talking about an Intranet, with most modern day technology the odds of your time being an issue are slim but still possible.
Also each individual computer can partake or cause an issue. In Windows 8 the default Timeout specified for the browser is one minute; in some cases those users may experience exceptions with your application, your site, or others. So you'd manually alter the ServerTimeOut and TimeOut key in the registry to assign a longer value.
In short:
Client Machines may pose a problem in reaching your site within your allocated time.
Network / ISP may incur a problem for some users.
Your Server may be configured incorrectly or not allocate the right amount of time.
These are all variables that need to be accounted for; as they will impact access to your application. Unfortunately you won't know for certain until it's launched and users begin to utilize your site.
Unfortunately you won't know if your time you specified will be enough; but it defaults to a higher number because there is so much variation across the world that it is trying to consider the lowest common denominator. As your goal is to reach as many people as possible.
By the way very nice question, and some great answers so far as well.

How to best parallelize parsing of webpages?

I am using the html agility pack to parse individual pages of a forum website. So the parsing method returns all the topic/thread links on the page link, passed as an argument. I gather all these topic links of all the parsed pages in a single collection.
After that, I check if they are on my Dictionary of already-viewed urls, and if they are not, then I add them to a new list and the UI shows this list, which is basically new topics/threads created since last time.
Since all these operations seem independent, what would be the best way to parallelize this?
Should I use .NET 4.0's Parallel.For/ForEach?
Either way, how can I gather the results of each page in a single collection? Or is this not necessary?
Can I read from my centralized Dictionary whenever a parse method finishes to see if they are there, simultaneously?
If I run this program for 4000 pages, it takes like 90 mins, it would be great if I could use all my 8 cores to finish the same task in ~10 mins.

Parallel.For/ForEach combined with a ConcurrentDictionary<TKey, TValue> to share state between different threads seem like a good way to implement this. The concurrent dictionary ensures safe read/write from multiple threads.

After that, I check if they are on my Dictionary of already-viewed urls, and if they are not, then I add them to a new list and the UI shows this list, which is basically new topics/threads created since last time.
Since all these operations seem independent, what would be the best way to parallelize this?
You can certainly use Parallel.For/ForEach to do that, but you should think about the design of your crawler a bit. Most crawlers tend to dedicate several threads to crawling and each thread is associated with a page fetching client which is responsible for fetching the pages (in your case, probably using the WebRequest/WebResponse) I would recommend reading these papers:
Mercator: A scalable, extensible Web crawler (an 11 page paper, should be a pretty light read).
IRLbot: Scaling to 6 Billion Pages and Beyond (a 10 page paper that describes a crawler that crawls at about 600 pages per second on a 150 Mbit connection).
IRLbot: Scaling to 6 billion pages and beyond: full paper
If you implement the Mercator design, then you should easily be able to download 50 pages per second, so you 4000 pages will be downloaded in 80 seconds.
Either way, how can I gather the results of each page in a single collection?
You can store your results in a ConcurrentDictionary<TKey, TValue>, like Darin mentioned. You don't need to store anything in the value, since your key would be the link/URL, however if you're performing a URL-seen Test then you can hash each link/URL into an integer and then store the hash as the key and the link/URL as the value.
Or is this not necessary?
It's entirely up to you to decide what's necessary, but if you're performing a URL-seen Test, then it is necessary.
Can I read from my centralized Dictionary whenever a parse method finishes to see if they are there, simultaneously?
Yes, the ConcurrentDictionary allows multiple threads to read simultaneously, so it should be fine. It will work fine if you just want to see if a link has already been crawled.
If I run this program for 4000 pages, it takes like 90 mins, it would be great if I could use all my 8 cores to finish the same task in ~10 mins.
If you design your crawler sufficiently well, you should be able to download and parse (extracts all the links) of 4000 pages in about 57 seconds on an average desktop PC... I get roughly those results with the standard C# WebRequest on a 4GB, i5 3.2 GHz PC with a 10 Mbps connection.

Need help with the architecture for a penny bidding website

I'm trying to create a website similar to BidCactus and LanceLivre.
The specific part I'm having trouble with is the seconds aspect of the timer.
When an auction starts, a timer of 15 seconds starts counting down, and every time a person bids, the timer is reset and the price of the item is increased by 0,01$.
I've tried using SignalR for this bit, and while it does work well during trials runs in the office, it's just not good enough for real world usage where seconds count. I would get HTTP 503 errors when too many users were bidding and idling on the site.
How can I make the timer on the clients end shows the correct remaining time?
Would HTTP GETting that information with AJAX every second allow me to properly display the missing time? That's a request each second!
And not only that, but when a user requests that GET, I calculate remaining seconds, but until the user see's that response, that time is no longer useful as a second or more might pass between processing and returning. Do you see my conundrum?
Any suggestions on how to approach this problem?

There are a couple problems with the solution you described:
It is extremely wasteful. There is already a fairly high accuracy clock built into every computer on the Internet.
The Internet always has latency. By the time the packet reaches the client, it will be old.
The Internet is a variable-latency network, so the time update packets you get could be as high or higher than one second behind for one packet, and as low as 20ms behind for another packet.
It takes complicated algorithms to deal with #2 and #3.
If you actually need second-level accuracy
There is existing Internet-standard software that solves it - the Network Time Protocol.
Use a real NTP client (not the one built into Windows - it only guarantees it will be accurate to within a couple seconds) to synchronize your server with national standard NTP servers, and build a real NTP client into your application. Sync the time on your server regularly, and sync the time on the client regularly (possibly each time they log in/connect? Maybe every hour?). Then simply use the system clock for time calculations.
Don't try to sync the client's system time - they may not have access to do so, and certainly not from the browser. Instead, you can get a reference time relative to the system time, and simply add the difference as an offset on client-side calculations.
If you don't actually need second-level accuracy
You might not really need to guarantee accuracy to within a second.
If you make this decision, you can simplify things a bit. Simply transmit a relative finish time to the client for each auction, rather than an absolute time. Re-request it on the client side every so often (e.g. every minute). Their global system time may be out of sync, but the second-hand on their clock should pretty accurately tick down seconds.
If you want to make this a little more slick, you could try to determine the (relative) latency for each call to the server. Keep track of how much time has passed between calls to the server, and the time-left value from the previous call. Compare them. Then, calculate whichever is smaller, and base your new time off that calculation.
I'd be careful when engineering such a solution, though. If you get the calculations wrong, or are dealing with inaccurate system clocks, you could break your whole syncing model, or unintentionally cause the client to prefer the higest latency call. Make sure you account for all cases if you write the "slick" version of this code :)

One way to get really good real-time communication is to open a connection from the browser to a special tcp/ip socket server that you write on the server. This is how a lot of chat packages on the web work.
Duplex sockets allow you to push data both directions. Because the connection is already open, you can send quite a bit of very fast data across.
In the past, you needed to use Adobe Flash to accomplish this. I'm not sure if browsers have advanced enough to handle this without a plugin (eg, websockets?)
Another approach worth looking at is long polling. In concept, a connection is made to the server that just doesn't die, and it gives you the opportunity on the server to trickle bits of realtime data down to the clients.
Just some pointers. I have written web software using JavaScript <-> Flash <-> Python/PHP, and was please with how it worked.
Good luck.

How many threads to use?

I know there are some existing questions and they provide a very good general perspective on things. I'm hoping to get some details on the C#/VB.Net side for the actual implementation (not philosophy) of some of these perspectives.
My Particular Case
I have a WCF Service which, amongst other things, receives files. For most of the service's life this particular area is actually just sat doing nothing - when work does come it arrives in high bursts of greatly varying quantities.
For each file received (which at a max can be thousands per second) the service needs to work on the files for between 1-10 seconds (each) depending on a number of other services, local resources, and network IO wait times.
To aid the service with these burst workloads I implemented a Queue system. Those thousands of files recieved per second are placed onto the Queue. A controller calculates the number of threads to use based on the size of the queue, up until it reaches a "Peak Max Threads" setting which prevents it from creating additional threads. These threads are placed in a thread pool, and reused to cycle through the queue. The controller will; at intervals; recalculate the number of threads required. If the queue size reduces, a relevant number of threads are released.
The age old problem
How many threads should I peak at? Clearly, adding a new thread everytime a file was received would be silly for lack of a better word - the performance, at best, would deteriorate. Capping the threads when CPU utilization is only 10% across each core, also doesn't seem to be the best use of resources.
So, is there an appropriate way to determine how many threads to cap at? I would rather the service could determine this for itself by sampling available resources, but is there a performance hit from doing so? I know the common answer is to monitor workloads, adjust the counts through trial and error until I find a number I like, but due to the nature of this service (long periods of idle followed by high/burst workloads) it could take a long time to get that kind of information.
What then if we move the server's image to a different host which is faster/slower/different to the first? I have to re-sample the process all over again?
Ideally what I'm after, is for the co-ordinator to intelligently increase the size of the threadpool until CPU utilisation is at x% (would 80% be reasonable? 90%? 99%?). Clearly, I want to do this without adding more threads than is necessary to hit x% otherwise all I'll end up with is threads not just waiting on IO resources, but awaiting each other too.
Thanks in advance!
Related questions (if you want some generic ideas):
How many threads to create?
How many threads is too many?
How many threads to create and when?
A Complication for you
Where would be the fun if I didn't make the problem more difficult?
As it currently stands, the service does hit 100% cpu during these bursts, regularly. The issue is the CPU utilisation spikes. It goes from idle (0-10%) to 100%, and back down again. I'm not sure I can help that - ideally I wouldn't take it all the way to 100%. The problem exists because the files mentioned are in fact images, and part of the services' process is to pass the image through to the System.Windows.Media blackbox which does some complex image processing for me.
There are then lulls in between the spikes because of the IO waits and other processing that goes on. If the spikes hitting 100% can't be helped (and I'm all for knowing how to prevent that, or if I should) how should I aim for the CPU utilisation graph to look? Sat constantly at 100%? Bouncing between 50-100? If I do go through the effort of sampling to decide what does seem to work best, is it guaranteed that switching the virtual servers' host will also work best with the same graph?
This added complexity I won't take into consideration for those of you willing to answer. Feel free to ignore this section. However, any answer that also accounts for this complication, or even answers that just provide tips on how to handle it, I'll at the very least upvote!
Heck of a long question - sorry about that - and thanks for reading so much!!

PerformanceCounter allows you to query for processor usage.
However ,have you tried something the framework provides?
foreach (var file in files)
{
var workitem = file;
Task.Factory.StartNew(() =>
{
// do work on workitem
}, TaskCreationOptions.LongRunning | TaskCreationOptions.PreferFairness);
}
You can tune the concurrency level for Tasks in the Task.Factory.
The .NET 4 threadpool by default will schedule the number of threads it finds most performing on the hardware where it runs, but you can change how that works with the previous link.
Probably you need a custom solution but it would be ok to benchmark yours with the standard.
Edit: (comment note):
No links needed, I may have used an invented term since english is not my language. What I mean is: have a variable where you store the variance before the last check (prevDelta), and call it delta. add this to the varuiable avrageDelta and divide by 2, each time you 'check'. You will have the variable averageDelta that will mostly be low since you have no activity. Then have another set of delta variables, one you have already (delta - prevdelta), and store it in a delta variable that is not the average of all deltas but the average of deltas in a small timespan (you will have to come up with an algortihm to calculate accurately this temporal variance). Once done this you can compare the average delta and the 'temporal delta'. The average delta will be mostly low and will slowly go up whjen bursts come. In the same period the temporal delta will go up really fast. Then you have the situation when the burst stops, the average delta goes slowly down, and the 'temporal' goes really fast.

You could use I/O Completion Ports to asynchronously fetch your images without tying up any threads until it comes time to process what you have fetched.
You could then limit your thread pool based on the number of cores on your client PC, making sure to leave a core free for other processes to use.

What about a dynamic thread manager that monitors their overall performance and according to this spawns new threads or kills old ones? The main problem here is only how to define the performance measurement function. The rest can be done with a periodically scheduled job that increases or decreases the number of threads according to the previous number of threads and performance in that case or something like that. Maybe also in connection to resources utilization (CPU, disks, network...).

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.