I'm building a small specialized search engine for prise info. The engine will only collect specific segments of data on each site. My plan is to split the process into two steps.
Simple screen scraping based on a URL that points to the page where the segment I need exists. Is the easiest way to do this just to use a WebClient object and get the full HTML?
Once the HTML is pulled and saved analyse it via some script and pull out just the segment and values I need (for example the price value of a product). My problem is that this script somehow has to be unique for each site I pull, it has to be able to handle really ugly HTML (so I don't think XSLT will do ...) and I need to be able to change it on the fly as the target sites updates and changes. I will finally take the specific values and write these to a database to make them searchable
Could you please give me some hints on how to architect the best way? Would you do different then described above?
Well, i would go with the way you describe.
1.
How much data is it going to handle? Fetching the full HTML via WebClient / HttpWebRequest should not be a problem.
2.
I would go for HtmlAgilityPack for HTML parsing. It's very forgiving, and can handle prety ugly markup. As HtmlAgilityPack supports XPath, it's pretty easy to have specific xpath selections for individual sites.
I'm on the run and going to expand on this answer asap.
Yes, a WebClient can work well for this. The WebBrowser control will work as well depending on your requirements. If you are going to load the document into a HtmlDocument (the IE HTML DOM) then it might be easier to use the web browser control.
The HtmlDocument object that is now built into .NET can be used to parse the HTML. It is designed to be used with the WebBrowser control but you can use the implementation from the mshtml dll as well. I hav enot used the HtmlAgilityPack, but I hear that it can do a similar job.
The HTML DOM objects will typically handle, and fix up, most ugly HTML That you throw at them. As well as allowing a nicer way to parse the html, document.GetElementsByTag to get a collection of tag objects for example.
As for handling the changing requirements of the site, it sounds like a good candidate for the strategy pattern. You could load the strategies for each site using reflection or something of that sort.
I have worked on a system that uses XML to define a generic set of parameters for extracting text from HTML pages. Basically it would define start and end elements to begin and end extraction. I have found this technique to work well enough for a small sample, but it gets rather cumbersome and difficult to customize as the collection of sites gets larger and larger. Keeping the XML up to date and trying to keep a generic set of XML and code the handle any type of site is difficult. But if the type and number of sites is small then this might work.
One last thing to mention is that you might want to add a cleaning step to your approach. A flexible way to clean up HTML as it comes into the process was invaluable on the code I have worked on in the past. Perhaps implementing a type of pipeline would be a good approach if you think the domain is complex enough to warrant it. But even just a method that runs some regexes over the HTML before you parse it would be valuable. Getting rid of images, replacing particular mis-used tags with nicer HTML , etc. The amount of really dodgy HTML that is out there continues to amaze me...
Related
I'm doing an program, which is running on an local system, with no internet access. Is it possible to create my own custom Web Map Service (WMS) server, using C#. I no that there are free open source system's. But i like to have full control.
Thanks Morten Starck
That is very possible, but you might be in for a headache or two before you are done. The implementation specification and more is available from the Open Geospatial Consortium at the url below.
http://www.opengeospatial.org/standards/wms
It's quite a large specification but you might be able to get away with implementing only the parts you really need and leaving some of the more specific stuff out. You will of course also need to parse and render the map data from some source which might be your largest problem (for which I really would suggest you have a look at SharpMap, http://sharpmap.codeplex.com/ instead of rolling your own).
I've been entrusted with an idiotic and retarded task by my boss.
The task is: given a web application that returns a table with pagination, do a software that "reads and parses it" since there is nothing like a webservice that provides the raw data. It's like a "spider" or a "crawler" application to steal data that is not meant to be accessed programmatically.
Now the thing: the application is made with standart aspx webform engine, so nothing like standard URLs or posts, but the dreadful postback engine crowded with javascript and non accessible html. The pagination links call the infamous javascript:__doPostBack(param, param) so I think it wouldn't even work if I try even to simulate clicks on those links.
There are also inputs to filter the results and they are also part of the postback mechanism, so I can't simulate a regular post to get the results.
I was forced to do something like this in the past, but it was on a standard-like website with parameters in the querystring like pagesize and pagenumber so I was able to sort it out.
Anyone has a vague idea if this is doable, or if I should tell to my boss to quit asking me to do this retarded stuff?
EDIT: maybe I was a bit unclear about what I have to achieve. I have to parse, extract and convert that data in another format - let's say excel - and not just read it. And this stuff must be automated without user input. I don't think Selenium would cut it.
EDIT: I just blogged about this situation. If anyone is interested can check my post at http://matteomosca.com/archive/2010/09/14/unethical-programming.aspx and comment about that.
Stop disregarding the tools suggested.
No, the parser you can write isn't WatiN or Selenium, both of those Will work in that scenario.
ps. had you mentioned anything on needing to extract the data from flash/flex/silverlight/similar this would be a different answer.
btw, reason to proceed or not is Definitely not technical, but ethical and maybe even lawful. See my comment on the question for my opinion on this.
WatiN will help you navigate the site from the perspective of the UI and grab the HTML for you, and you can find information on .NET DOM parsers here.
Already commented but think thus is actually an answer.
You need a tool which can click client side links and wait while page reloads.
Tool s like selenium can do that.
Also (from comments) WatiN WatiR
#Insane, the CDC's website has this exact problem, and the data is public (and we taxpayers have paid for it), I'm trying to get the survey and question data from http://wwwn.cdc.gov/qbank/Survey.aspx and it's absurdly difficult. Not illegal or unethical, just a terrible implementation that appears to be intentionally making it difficult to get the data (also inaccessible to search engines).
I think Selenium is going to work for us, thanks for the suggestion.
I'm interested to hear from other developers their opinion on an approach that I typically take. I have a web application, asp.net 2.0, c#.
What I usually do to write out drop downs, tables, input controls, etc. is in the code behind use StringBuilder and write out something like sb.Append("
I don't find myself using to many .net controls as I typically write out the html in the code behind. When I want to use jQuery or call JavaScript I just put that function call in my sb.Append tag like sb.Append("td...onblur='fnCallJS()'.
I've gotten pretty comfortable with this approach. For data access I use EntitySpaces.
I'm just kind of curious if this sort of approach is horribly wrong, ok depending on the context, good, time to learn 3.0, etc. I'm interested in learning and was just looking for some input.
Edit
After reading the comments here it sounds like I should take a look at MVC. I've not done that yet. The only hesitancy in doing so is that the existing project is just that, existing. There is a lot of code already done the way I explained and it is hard to imagine what would be involved in changing it, advantages of doing so, and just learning what that would take.
The other thing I'm taking away from the comments is that my code behind should really not include much of the sb.Append code, whereas now it is filled with it in numerous functions. To me it is not messy but that is because I know what each function does and can look at it and see, oh that writes out x, y, and z.
It's not uncommon for me to just have a div on the .aspx part and then build up the .innerHtml of that with the StringBuilder in the code behind.
Thanks again for the comments. I'm thinking as I'm reading them.
I typically write out the html in the code behind.
That part is a little odd, and not something I recommend for webforms. If you want to do that, consider an asp.net mvc project instead.
In webforms, you really want the meat of your html to live with the markup rather than the code. The two should remain separate. You also don't want a huge stringbuilder that encompasses your entire page. This will force you to keep the entire page in memory twice (once for the stringbuilder bytes and once for the built string at the end) rather than writing the page to the response stream as it's built. That means more memory per request, which can really kill scalability.
To those ends, I would abstract distinct portions of your stringbuilder code into custom/user controls that you can use in the aspx markup. These controls can use a stringbuilder to create their output. This means you only need to keep enough html markup in memory to render one control at a time. It also allows you to more easily re-use common markup across pages or even sites.
There are times when you need to generate some HTML in your code behind, but in general, you want to leave the HTML where it belongs, and that's seperated from your code. The VS IDE is a pretty good HTML editor. Use it.
I'm going to go out on a limb and guess you may have come from a "Classic" ASP (vbScript) or PHP background.
My back ground is "Classic ASP" and my first attempts at the Webforms Model were pretty much the same as yours, once I started usnig them and understanding them I've never looked back. There is a disctinct learning curve though in understanding how the page life cycle interacts with the various WebForm controls.
Look up the various threads on ASP.net WebForms vs MCV to see which suits your projects needs the best. MVC Isn't a magic cure-all but in many respects may be more familiar if you're from a "Classic ASP" or PHP backgound.
From a practical perspective, assuming you're sticking with WebForms, if there is the possibility of other developers becoming involved in the project you aim towards using more of the inbuilt controls where you can as that is more than likely what they will be familiar with. Stating the obvious, the more you use the controls the more you will become familiar with what they can and can't do and before to long you will find yourself writing your own controls to fill the gaps or finding existing 3rd party controls.
A big problem you have with that it can get pretty messy... having to escape all the " or messing with carriage returns. Sure YOU can program around that, but what if you want to copy/paste code? sounds like a nightmare and WAY more work than it's worth.
It sounds like you should be writing a custom control and using HtmlTextWriter to write the markup.
Or perhaps more appropriate would be a user control, with markup in the aspx page and anything else in the code behind.
If you're using this approach, you should migrate your development efforts to ASP.Net MVC. Whereas ASP.Net actively tries to abstract the HTML, CSS, JavaScript, etc. away by using web controls, ASP.Net MVC is built around a paradigm of directly controlling the markup itself (though that may arguably be the least of the differences between the two - you should definitely read up on it to at least know the alternatives, even if you stick with ASP.Net in the long run).
Otherwise, what you're doing works if done properly (though you'll be fighting the framework the whole way), though I'd recommend using a StringWriter instead. It uses a StringBuilder internally so the performance characteristics are the same between the two, but the semantics are more consistent with the rest of the .Net framework (e.g., Write vs. Append).
I think this approach kind of defeats the purpose of what webforms was trying to accomplish (separating markup and code).
I know this thread is kind of old and has been answered really well, I just thought I would "append" (pun intended) my answer since I am working with code that was mentioned in the question.
ALL the markup is in the C# classes and they created a StringBuilder object to append all the html and JavaScript strings. This has made it very difficult to read the code and see what's going on, and what if they want to change the markup/design of the front-end? Now, I've got a heck of job on my hands having to go in and refactor all that markup in the classes, when it would be so much easier to change the .aspx pages and connect the data model to those pages.
In my humble opinion, I can't find a good reason to put any markup in your classes/code behind. They are for logic only. Plus, it makes it difficult to test and debug Javascript. That's my two cents. K.
What is the easiest way to programmatically extract structured data from a bunch of web pages?
I am currently using an Adobe AIR program I have written to follow the links on one page and grab a section of data off of the subsequent pages. This actually works fine, and for programmers I think this(or other languages) provides a reasonable approach, to be written on a case by case basis. Maybe there is a specific language or library that allows a programmer to do this very quickly, and if so I would be interested in knowing what they are.
Also do any tools exist which would allow a non-programmer, like a customer support rep or someone in charge of data acquisition, to extract structured data from web pages without the need to do a bunch of copy and paste?
If you do a search on Stackoverflow for WWW::Mechanize & pQuery you will see many examples using these Perl CPAN modules.
However because you have mentioned "non-programmer" then perhaps Web::Scraper CPAN module maybe more appropriate? Its more DSL like and so perhaps easier for "non-programmer" to pick up.
Here is an example from the documentation for retrieving tweets from Twitter:
use URI;
use Web::Scraper;
my $tweets = scraper {
process "li.status", "tweets[]" => scraper {
process ".entry-content", body => 'TEXT';
process ".entry-date", when => 'TEXT';
process 'a[rel="bookmark"]', link => '#href';
};
};
my $res = $tweets->scrape( URI->new("http://twitter.com/miyagawa") );
for my $tweet (#{$res->{tweets}}) {
print "$tweet->{body} $tweet->{when} (link: $tweet->{link})\n";
}
I found YQL to be very powerful and useful for this sort of thing. You can select any web page from the internet and it will make it valid and then allow you to use XPATH to query sections of it. You can output it as XML or JSON ready for loading into another script/ application.
I wrote up my first experiment with it here:
http://www.kelvinluck.com/2009/02/data-scraping-with-yql-and-jquery/
Since then YQL has become more powerful with the addition of the EXECUTE keyword which allows you to write your own logic in javascript and run this on Yahoo!s servers before returning the data to you.
A more detailed writeup of YQL is here.
You could create a datatable for YQL to get at the basics of the information you are trying to grab and then the person in charge of data acquisition could write very simple queries (in a DSL which is prettymuch english) against that table. It would be easier for them than "proper programming" at least...
There is Sprog, which lets you graphically build processes out of parts (Get URL -> Process HTML Table -> Write File), and you can put Perl code in any stage of the process, or write your own parts for non-programmer use. It looks a bit abandoned, but still works well.
I use a combination of Ruby with hpricot and watir gets the job done very efficiently
If you don't mind it taking over your computer, and you happen to need javasript support, WatiN is a pretty damn good browsing tool. Written in C#, it has been very reliable for me in the past, providing a nice browser-independent wrapper for running through and getting text from pages.
Are commercial tools viable answers? If so check out http://screen-scraper.com/ it is super easy to setup and use to scrape websites. They have free version which is actually fairly complete. And no, I am not affiliated with the company :)
I've saved an entire webpage's html to a string, and now I want to grab the "href" values from the links, preferably with the ability to save them to different strings later. What's the best way to do this?
I've tried saving the string as an .xml doc and parsing it using an XPathDocument navigator, but (surprise surprise) it doesn't navigate a not-really-an-xml-document too well.
Are regular expressions the best way to achieve what I'm trying to accomplish?
I can recommend the HTML Agility Pack. I've used it in a few cases where I needed to parse HTML and it works great. Once you load your HTML into it, you can use XPath expressions to query the document and get your anchor tags (as well as just about anything else in there).
HtmlDocument yourDoc = // load your HTML;
int someCount = yourDoc.DocumentNode.SelectNodes("your_xpath").Count;
Regular expressions are one way to do it, but it can be problematic.
Most HTML pages can't be parsed using standard html techniques because, as you've found out, most don't validate.
You could spend the time trying to integrate HTML Tidy or a similar tool, but it would be much faster to just build the regex you need.
UPDATE
At the time of this update I've received 15 up and 9 downvotes. I think that maybe people aren't reading the question nor the comments on this answer. All the OP wanted to do was grab the href values. That's it. From that perspective, a simple regex is just fine. If the author had wanted to parse other items then there is no way I would recommend regex as I stated at the beginning, it's problematic at best.
For dealing with HTML of all shapes and sizes I prefer to use the HTMLAgility pack # http://www.codeplex.com/htmlagilitypack it lets you write XPaths against the nodes you want and get those return in a collection.
Probably you want something like the Majestic parser: http://www.majestic12.co.uk/projects/html_parser.php
There are a few other options that can deal with flaky html, as well. The Html Agility Pack is worth a look, as someone else mentioned.
I don't think regexes are an ideal solution for HTML, since HTML is not context-free. They'll probably produce an adequate, if imprecise, result; even deterministically identifying a URI is a messy problem.
It is always better, if possible not to rediscover the wheel. Some good tools exist that either convert HTML to well-formed XML, or act as an XmlReader:
Here are three good tools:
TagSoup, an open-source program, is a Java and SAX - based tool, developed by John Cowan. This is
a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
Taggle is a commercial C++ port of TagSoup.
SgmlReader is a tool developed by Microsoft's Chris Lovett.
SgmlReader is an XmlReader API over any SGML document (including built in support for HTML). A command line utility is also provided which outputs the well formed XML result.
Download the zip file including the standalone executable and the full source code: SgmlReader.zip
An outstanding achievement is the pure XSLT 2.0 Parser of HTML written by David Carlisle.
Reading its code would be a great learning exercise for everyone of us.
From the description:
"d:htmlparse(string)
d:htmlparse(string,namespace,html-mode)
The one argument form is equivalent to)
d:htmlparse(string,'http://ww.w3.org/1999/xhtml',true()))
Parses the string as HTML and/or XML using some inbuilt heuristics to)
control implied opening and closing of elements.
It doesn't have full knowledge of HTML DTD but does have full list of
empty elements and full list of entity definitions. HTML entities, and
decimal and hex character references are all accepted. Note html-entities
are recognised even if html-mode=false().
Element names are lowercased (if html-mode is true()) and placed into the
namespace specified by the namespace parameter (which may be "" to denote
no-namespace unless the input has explict namespace declarations, in
which case these will be honoured.
Attribute names are lowercased if html-mode=true()"
Read a more detailed description here.
Hope this helped.
Cheers,
Dimitre Novatchev.
I agree with Chris Lively, because HTML is often not very well formed you probably are best off with a regular expression for this.
href=[\"\'](http:\/\/|\.\/|\/)?\w+(\.\w+)*(\/\w+(\.\w+)?)*(\/|\?\w*=\w*(&\w*=\w*)*)?[\"\']
From here on RegExLib should get you started
You might have more luck using xml if you know or can fix the document to be at least well-formed. If you have good html (or rather, xhtml), the xml system in .Net should be able to handle it. Unfortunately, good html is extremely rare.
On the other hand, regular expressions are really bad at parsing html. Fortunately, you don't need to handle a full html spec. All you need to worry about is parsing href= strings to get the url. Even this can be tricky, so I won't make an attempt at it right away. Instead I'll start by asking a few questions to try and establish a few ground rules. They basically all boil down to "How much do you know about the document?", but here goes:
Do you know if the "href" text will always be lower case?
Do you know if it will always use double quotes, single quotes, or nothing around the url?
Is it always be a valid URL, or do you need to account for things like '#', javascript statements, and the like?
Is it possible to work with a document where the content describes html features (IE: href= could also be in the document and not belong to an anchor tag)?
What else can you tell us about the document?
I've linked some code here that will let you use "LINQ to HTML"...
Looking for C# HTML parser