Extracting data from an ASPX page - c#

I've been entrusted with an idiotic and retarded task by my boss.
The task is: given a web application that returns a table with pagination, do a software that "reads and parses it" since there is nothing like a webservice that provides the raw data. It's like a "spider" or a "crawler" application to steal data that is not meant to be accessed programmatically.
Now the thing: the application is made with standart aspx webform engine, so nothing like standard URLs or posts, but the dreadful postback engine crowded with javascript and non accessible html. The pagination links call the infamous javascript:__doPostBack(param, param) so I think it wouldn't even work if I try even to simulate clicks on those links.
There are also inputs to filter the results and they are also part of the postback mechanism, so I can't simulate a regular post to get the results.
I was forced to do something like this in the past, but it was on a standard-like website with parameters in the querystring like pagesize and pagenumber so I was able to sort it out.
Anyone has a vague idea if this is doable, or if I should tell to my boss to quit asking me to do this retarded stuff?
EDIT: maybe I was a bit unclear about what I have to achieve. I have to parse, extract and convert that data in another format - let's say excel - and not just read it. And this stuff must be automated without user input. I don't think Selenium would cut it.
EDIT: I just blogged about this situation. If anyone is interested can check my post at http://matteomosca.com/archive/2010/09/14/unethical-programming.aspx and comment about that.

Stop disregarding the tools suggested.
No, the parser you can write isn't WatiN or Selenium, both of those Will work in that scenario.
ps. had you mentioned anything on needing to extract the data from flash/flex/silverlight/similar this would be a different answer.
btw, reason to proceed or not is Definitely not technical, but ethical and maybe even lawful. See my comment on the question for my opinion on this.

WatiN will help you navigate the site from the perspective of the UI and grab the HTML for you, and you can find information on .NET DOM parsers here.

Already commented but think thus is actually an answer.
You need a tool which can click client side links and wait while page reloads.
Tool s like selenium can do that.
Also (from comments) WatiN WatiR

#Insane, the CDC's website has this exact problem, and the data is public (and we taxpayers have paid for it), I'm trying to get the survey and question data from http://wwwn.cdc.gov/qbank/Survey.aspx and it's absurdly difficult. Not illegal or unethical, just a terrible implementation that appears to be intentionally making it difficult to get the data (also inaccessible to search engines).
I think Selenium is going to work for us, thanks for the suggestion.

Related

ASP.Net AJAX Call c# functions

Been looking around for the last few days trying to figure out what the best route is. I am fairly new to ASP.Net so I am in need of a little assistance.
I like the idea of using Master Pages as it will make making changes to the template a lot easier! But I am running into some problems. I will just list them below and see where we can go, maybe this will help some other newbies like myself.
Dynamic Menu:
I am trying to create a menu system that will show certain links depending on the users role. This is simple enough until I just want the link to perform some functions and thats it. So I dont want it to postback or anything. So my next step was to try to use jQuery as I would with my php development. Problem is I cant seem to get my jQUery to call the master page code behind function. I've gone through all the tutorials I could find with WebMethods but just keep getting an error to the like of This type of page is not served.
General Classes:
In PHP sometimes I would have the need to just have some General classes that pertained to a specific area of the application. I would just use these to hold all the function I may need to call from jQuery. Is there something like that in ASP.Net? I tried just adding a class but again couldn't call it from jQuery. Is this something Web Services would be good at? I am still trying to understand their full use. Seems like we could use Web Services as a buffer between the client and the back end classes.
I look forward to any pointers or tips!
Oded and jrummell made it very clear I should probably start with ASP.NET MVC first. It will most likely be an easier road for me moving from php.

As3 Communicate with C# or Asp.net

I have been looking around for hours trying to find a clear simple solution to my question and have yet to find a good answer. I am trying to do a URL request in flash to my NOPCommerce site. I want to pass a GET value to the my .cs file which i'll then use that value to grab specific information and return it back to flash. How would I set up the C# or asp.net side of things? If anyone could give me an example of what I am looking for I would greatly appreciate it.
I don't know if I am supposed to use a .aspx, .cs or .ascx file.
Thanks,
Brennan
I found it to be extremely simple with web services in as3. Here is a link to see what I mean
As3 Web Services
Use the HttpWebRequest class to GET the variables, do the magic and return a result by invoking the HttpWebRequest again.
Examples and usage here:
http://www.csharp-station.com/HowTo/HttpWebFetch.aspx
You have a few options for server-side communication with flash.
Flash remoting. This is the most popular because it's the most performant, but not the easiest to understand at first glance. It transfers data in a binary format. Available libraries are Weborb and Fluorine.
Web Services as mentioned in a previous post.
Ajax/JSON. I think with Flash Player 11.3, JSON decoding is native in the player now.
Straight up http request.
Sockets (not recommended for beginners)
To answer your question as you asked it, though, for all but #4, you'd be using a CS file to retrieve your data. For #4, you'd most likely be using an .aspx page, but it could be a combination of .aspx and .ascx files.
My recommendation is that you do some research on each of these methods to decide what would work best with your development environment, required level of security, and project. Then, ask specific questions about each method as necessary.
Good Luck!

Practices on filtering user inputs

I would like to ask some suggestions from the more experienced people out there.
I have to filter the inputs the user wherein the they might try to input values like
<script type="text/javascript">alert(12);</script>
on the textbox. I would like to ask if do you have any recommendations for good practices regarding this issue?
Recently we encountered a problem actually on one of our sharepoint projects. We tried to input a script on the textbox and boom the page crashes... I mean trapping it can be easy I think because we know that it is one of the possible inputs of the user but how about the things that we don't know? There might be some other situations that we haven't considered aside from just trapping a script. Can somebody suggest a good practice regarding this matter?
Thanks in advance! :)
Microsoft actually produce an anti-cross site scripting library, though when I looked at it, it was litte more than a wrapper round various encoding functions in the .NET framework. AntiXSS library
Two of the main threats you should consider are:
Script injection
HTML tag injection
Both of these can be mitigated (to a degree) by HTML encoding user input before re=rendering it on the page.
There is also a library called AntiSamy available from the OWASP project, designed to neuter malicious input in web applications.
Jimmy answer is a good technique to manage "Input Validation & Representation" problems.
But you can filter your textbox inputs by yourself before passing it to third party API such AntiSamy and so on.
I generally use these controls:
1) minimize the length of the textbox value: not only in the client side but in the server side too (you couldn't believe me but there aren't buffer overflow attacks also in scripting)
2) Apply a Whitelist control to the characters the users write into the textbox (clientside and Serverside)
3) Use Whitelist if possibile. Blacklist are less secure than Whitelist
It is very important you do these controls into the server side part.
Sure it's very easy to forget some controls and so AntiSamy and products like this are very useful. But I advise you to implement your personal "Input Validation" API.
Securing software is not to get some third party product but it is to program in a different way.
I have tried this on sharepoint with both a single line of text and multiple lines of text, and in both cases sharepoint encodes the value. (i get no alert)
What SharePoint are you using?

What is the easiest way to programmatically extract structured data from a bunch of web pages?

What is the easiest way to programmatically extract structured data from a bunch of web pages?
I am currently using an Adobe AIR program I have written to follow the links on one page and grab a section of data off of the subsequent pages. This actually works fine, and for programmers I think this(or other languages) provides a reasonable approach, to be written on a case by case basis. Maybe there is a specific language or library that allows a programmer to do this very quickly, and if so I would be interested in knowing what they are.
Also do any tools exist which would allow a non-programmer, like a customer support rep or someone in charge of data acquisition, to extract structured data from web pages without the need to do a bunch of copy and paste?
If you do a search on Stackoverflow for WWW::Mechanize & pQuery you will see many examples using these Perl CPAN modules.
However because you have mentioned "non-programmer" then perhaps Web::Scraper CPAN module maybe more appropriate? Its more DSL like and so perhaps easier for "non-programmer" to pick up.
Here is an example from the documentation for retrieving tweets from Twitter:
use URI;
use Web::Scraper;
my $tweets = scraper {
process "li.status", "tweets[]" => scraper {
process ".entry-content", body => 'TEXT';
process ".entry-date", when => 'TEXT';
process 'a[rel="bookmark"]', link => '#href';
};
};
my $res = $tweets->scrape( URI->new("http://twitter.com/miyagawa") );
for my $tweet (#{$res->{tweets}}) {
print "$tweet->{body} $tweet->{when} (link: $tweet->{link})\n";
}
I found YQL to be very powerful and useful for this sort of thing. You can select any web page from the internet and it will make it valid and then allow you to use XPATH to query sections of it. You can output it as XML or JSON ready for loading into another script/ application.
I wrote up my first experiment with it here:
http://www.kelvinluck.com/2009/02/data-scraping-with-yql-and-jquery/
Since then YQL has become more powerful with the addition of the EXECUTE keyword which allows you to write your own logic in javascript and run this on Yahoo!s servers before returning the data to you.
A more detailed writeup of YQL is here.
You could create a datatable for YQL to get at the basics of the information you are trying to grab and then the person in charge of data acquisition could write very simple queries (in a DSL which is prettymuch english) against that table. It would be easier for them than "proper programming" at least...
There is Sprog, which lets you graphically build processes out of parts (Get URL -> Process HTML Table -> Write File), and you can put Perl code in any stage of the process, or write your own parts for non-programmer use. It looks a bit abandoned, but still works well.
I use a combination of Ruby with hpricot and watir gets the job done very efficiently
If you don't mind it taking over your computer, and you happen to need javasript support, WatiN is a pretty damn good browsing tool. Written in C#, it has been very reliable for me in the past, providing a nice browser-independent wrapper for running through and getting text from pages.
Are commercial tools viable answers? If so check out http://screen-scraper.com/ it is super easy to setup and use to scrape websites. They have free version which is actually fairly complete. And no, I am not affiliated with the company :)

Simple screen scraping and analyze in .NET

I'm building a small specialized search engine for prise info. The engine will only collect specific segments of data on each site. My plan is to split the process into two steps.
Simple screen scraping based on a URL that points to the page where the segment I need exists. Is the easiest way to do this just to use a WebClient object and get the full HTML?
Once the HTML is pulled and saved analyse it via some script and pull out just the segment and values I need (for example the price value of a product). My problem is that this script somehow has to be unique for each site I pull, it has to be able to handle really ugly HTML (so I don't think XSLT will do ...) and I need to be able to change it on the fly as the target sites updates and changes. I will finally take the specific values and write these to a database to make them searchable
Could you please give me some hints on how to architect the best way? Would you do different then described above?
Well, i would go with the way you describe.
1.
How much data is it going to handle? Fetching the full HTML via WebClient / HttpWebRequest should not be a problem.
2.
I would go for HtmlAgilityPack for HTML parsing. It's very forgiving, and can handle prety ugly markup. As HtmlAgilityPack supports XPath, it's pretty easy to have specific xpath selections for individual sites.
I'm on the run and going to expand on this answer asap.
Yes, a WebClient can work well for this. The WebBrowser control will work as well depending on your requirements. If you are going to load the document into a HtmlDocument (the IE HTML DOM) then it might be easier to use the web browser control.
The HtmlDocument object that is now built into .NET can be used to parse the HTML. It is designed to be used with the WebBrowser control but you can use the implementation from the mshtml dll as well. I hav enot used the HtmlAgilityPack, but I hear that it can do a similar job.
The HTML DOM objects will typically handle, and fix up, most ugly HTML That you throw at them. As well as allowing a nicer way to parse the html, document.GetElementsByTag to get a collection of tag objects for example.
As for handling the changing requirements of the site, it sounds like a good candidate for the strategy pattern. You could load the strategies for each site using reflection or something of that sort.
I have worked on a system that uses XML to define a generic set of parameters for extracting text from HTML pages. Basically it would define start and end elements to begin and end extraction. I have found this technique to work well enough for a small sample, but it gets rather cumbersome and difficult to customize as the collection of sites gets larger and larger. Keeping the XML up to date and trying to keep a generic set of XML and code the handle any type of site is difficult. But if the type and number of sites is small then this might work.
One last thing to mention is that you might want to add a cleaning step to your approach. A flexible way to clean up HTML as it comes into the process was invaluable on the code I have worked on in the past. Perhaps implementing a type of pipeline would be a good approach if you think the domain is complex enough to warrant it. But even just a method that runs some regexes over the HTML before you parse it would be valuable. Getting rid of images, replacing particular mis-used tags with nicer HTML , etc. The amount of really dodgy HTML that is out there continues to amaze me...

Categories

Resources