Need Help in building a "robot" that extracts data from HTTP request - c#

I am building a web site in ASP.net and C# that one of its components involves log-in to a website that the user has an account (for example cellular phone company) on behalf of the user, take information from this site and store it in our database.
I think this action called "scraping".
Are there any products that already does so that I can use to integrate with my software ?
I don't need a software that does it, I need some sort of SDK that I can integrate with my C# code.
Thanks,
Koby

Use the HtmlAgilityPack to parse the HTML that you get from a web request once you've logged in.
See here for logging in: Login to website, via C#

I haven't found any product, that would do it right so far.
One way to handle this is to
- do requests by your self
- use http://htmlagilitypack.codeplex.com/ to extract important information from downloaded html
- save extracted information by your self
Thing is, that depending on context, there are so many things to tune/configure, that you need very large product and still it won't reach custom solution performance/accuracy:
a) multithreading control
b) extraction rules
c) persistance control
d) web spidering (or how next link to parse is chosen)

Check the Web Scraping Wikipedia Entry.
However I would say since what we need to acquire via web-scraping is application specific, most of the time, it may be more efficient to scrape whatever you need from a web response stream.

Related

Prevent unwanted access to my web service

I have coded a C# MVC5 Internet application and I have a Web API 2 web service that returns JSON data. I am retrieving this JSON data in an android application.
How can I add a feature to the web service such that only my android application can retrieve the JSON data? I am wanting to do this so that other web users cannot hammer the url and the web service will not send my data to unwanted applications and/or users.
Is this possible? If so, how should I do this?
Thanks in advance.
You have various ways to achieve this in fact.
For example, you can store a key in your android application and use send this key together with the request to your WebAPI. Your webAPI will than check if they key is valid and if it is, it will return the JSon.
However, there's no way to ensure that nobody else can request and get your data. For example by reverse engineering your android application and extracting the key, or by monitoring the network traffic and find the key in there.
You need to understand that there isn't anthing that guarantuees you 100% security.
See it as the following:
You have an open door right now, you can close it little by little, but closing and locking down is not possible. There will always be gap. A house also can't by made burglar proof, but you can make it very hard for a buglar to enter.
Go to this link Web Api. I have used the individual authentication for my web api. When you will register the user the response you will get is access token and use that access token as Authentication header in your ajax call if you are using Jquery ajax to call your Web Api. Refer this The OAuth 2.0 Authorization Framework. Hope this help you.
Are you looking for something like this?
http://httpd.apache.org/docs/2.2/howto/access.html
If you have other web server, there should be appropriate means to support such.

How to pull data from any website and use it in Windows Store apps

I want to know the method of pulling the data from website and parsing it into our own code to present it to the user.
For example: Consider an app in which a user types a movie name and all the poster gets fetched from various websites, like IMDb, etc. Or a user enters a movie name and all the data from IMDb is fetched. I know about certain third party API services for fetching data from IMDb, like omdbapi and imdbapi, but I want to know the method of doing so from any sort of website, not just IMDb.
I am a complete newbie in this context so please guide me through this from the very beginning. I want to do this in a Windows 8 Store app using C# and XAML in Visual Studio.
Simple way is you use the website's RSS feed. You can find the rss feed for any website. All you have to do is pass the parameters as a query string using a web request object. the response stream will them have all the details that you want that can be parsed through in c# and worked upon.
There is no standard way to do it for any website
you must write your algorithm for each of the websites you want to get the content from
HttpClient
is you tool in getting web content in your app
Check out YQL:
The Yahoo Query Language is an expressive SQL-like language that lets you query, filter, and join data across Web.
You should use Html Agility Pack.
For better performance, host your scraping service on Azure.

Retrieving the words from WordNet database

I’m looking for a website that offers API for retrieving the words from English WordNet database.
I do not want to download the WordNet database and implement it in my server.
Simply I want to call API and get back some results in XML format from that web site.
I have a web application in ASP.net that is written in C#.
Here there is a sample from WordNet, I want to do something like that in my web application.
WordNet Online
It seems that is no such API publicly available.
According to Related Projects site part of WordNet data is avaible as API via abbreviations.com:
Abbreviations.com has created free APIs based on REST calls which return a well-formatted XML result, providing both synonyms and definitions APIs based on the WordNet database.
However on the same page in .NET/C# section you can find some publicly available local APIs, so you don't have to implement it by yourself, but have to download data files.
WordNet does not seem to expose a REST or similar API that can be used. That said, you might be able to derive the URL pattern by searching online and using that in your application and parsing the response html.
You might want to check there website to make sure this is legal.

Programmatically purchase from a website in C# or javascript

Hey guys, I'm trying to create a website that can help a user purchase items from other websites. What would be the best way to go about doing this?
I know most of the sites I'm using are sending their information using FORM:POST, but I'm having trouble finding the exact POST packet in fiddler (I'm assuming it's encrypted?), and know that a lot of the sites are using login credentials, so that complicates things a bit.
Is there any way I could use webkit or something to handle all the http stuff, and just pass javascript to fill in the forms? Or is there an even simpler way to create proper POST packets and use a WebRequest?
Thank you!
1) get permission
2) use their published API
If the sites do not have an API and allow you to use their server process, copy their forms to your site and use post. You can post from your server with credentials using for example CURL
Usually shopping cart and credit-card transaction use SSL and you have to login in the site. So I think it's not so simple to bridge with a javascript or a simple webrequest.
There's not a statndard-simple-way way to do this!
You're heading for a world of hurt.
First, you should check if what you're trying to do is legal. Does the web site allow "proxy orders"? Or are they forbidden by their EULA?
Second, you'll have to handle the user's confidential data (username, password, credit card number), and especially credit card numbers are calling for troubles.
Third, how are you planning to implement payment methods like PayPal? You're going to collect the user's PayPal credentials in order to make payments on their behalf? (See point number two if answer is yes.)
Fourth, since you have to fake HTTP requests, as soon as the web site changes a single field, your tool will break, how are you planning to handle this?
Or you're trying to automate only the first steps of the orders and not the payment?

Authenticate on an ASP.Net Forms Authorization website from a console app

I'm trying to build a C# console application to automate grabbing certain files from our website, mostly to save myself clicks and - frankly - just to have done it. But I've hit a snag that for which I've been unable to find a working solution.
The website I'm trying to which I'm trying to connect uses ASP.Net forms authorization, and I cannot figure out how to authenticate myself with it. This application is a complete hack so I can hard code my username and password or any other needed auth info, and the solution itself doesn't need to be something that is viable enough to release to general users. In other words, if the only possible solution is a hack, I'm fine with that.
Basically, I'm trying to use HttpWebRequest to pull the site that has the list of files, iterating through that list and then downloading what I need. So the actual work on the site is fairly trivial once I can get the website to consider me authorized.
I have dealt with something similar, and the hardest part is figuring out exactly what you needed to "fake" to get authorized. In my case it was authorizing into some Lotus Notes webservice, but the details are unimportant, the method is the same.
Essentially, we need to record a regular user session. I would recommend Fiddler http://www.fiddler2.com but if you're on linux or something, then you'll need to use wireshark to figure some of the things out. Not sure if there is a firefox plugin that could be used.
Anyway, start up IE, then start up Fiddler. Complete the login process.
Stop what you're doing. Switch to the fiddler pane, and examine the recorded sessions in detail. It should give you exactly what you need to fake using WebRequests.
This page should get you started. You need to first make a request to the page, and then saving the cookie to a container that you include in all later request. That should keep you logged in, and able to retrieve the files.

Categories

Resources