I'm working on a C# application that needs to scrape some data from a phpBB forum. The forum scraping requires logging in. The application will prompt the user for their login credentials to connect.
I've scraped websites before with C#, but what I'm not sure how to do is login to phpBB and keep a session open during the duration of the screen scraping. I've done some searching and haven't had much luck. Is there a good way to programmatically do something like this?
You don't say what you've tried, but if you used an HttpWebRequest object to retrieve pages and/or logon, then you need to assign a new CookieContainer collection to the HttpWebRequest to store any cookies returned by the website. Share this amongst HttpWebRequest objects to remain logged in
look for the names of the username and password fields using Firebug or Chrome (or even View Source), then use my answer here: Programmatically logging into a site, replacing 'session_key' and 'session_password' as appropriate. that should work.
and then translate to C#!
I would recommend using WatiN API for doing screen scraping. I have done screen scraping using this API and it does good work.
Check it out !
I recommend using HTML Agility Pack.
Related
Body
Hi guys. I'll be rather brief if I can so here goes.
I made this app in C# that goes onto my employee portal and automatically gets my shifts for me every 30 minutes my using a web browser control and then it reads the HTML data from that and generates a calendar for me and also provides automated alerts.
Issue
Problem is that this web browser uses IE (yeesh help me) and it doesn't work with all parts of the site. I have done some digging around on the site and I found where the ASP site gets the data from: An XML sheet somewhere on the server. I can access this XML sheet, but only if I'm logged in (please see the attached images for more information).
Current solution
So my question is this: How do I actually login to this area?
I could login using the webbrowser and then download the XML using that, but it's too slow and too old, so is there a way I can pass my credentials through?
The URL is like this "https://www.mycoles.com.au/api/rosters/nextweek" -- I don't see any thing like ?name=myname ?pass=mypassword... soo yea. (I'm a bit new).
Further details:
Application language: C#.
Current technology: Windows forms applications/ IE web browser control.
Site backend: Microsoft Sharepoint.
Anything I'm missing? Please ask..?
Attached content
Mycoles XML Logged in
Mycoles XML Access Denied
Update:
So after a while of searching and examining the site, I tried to access the data with a c# webbrowser and it didn't work. It said that it can't download the data, however chrome is able to. Odd. I'm not sure it is an XML file anymore, rather a request and I don't have enough knowledge to work with this, so pointers anyone? Check this site out https://www.mycoles.com.au/api/rosters/nextweek and tell me what you think it is please. Thanks in advance... :)
SharePoint supports different forms of authentication. Out of the box, Active Directory-based single sign-on is provided, and forms-based (username, password) authentication can be configured.
Typically, organizations use AD SSO for its simplicity. If, once you open your desktop browser and navigate to a SharePoint site, you don't have to enter any credentials and are just logged in, then it's most likely this case. This can be either Kerberos or NTLM. The HttpWebRequest class supports both these methods.
I'm developing a software on C# which has to get info from a website which the user opens in chrome, the user has to input some data and then the website returns a list of different items.
What I want is a way to be able to access to the source code of the page in order to get the info, I cant open the web myself as it doesnt show anything because I didnt input any data, so I need to get it directly from chrome.
How can I achieve this ? A chrome extension ? Or can I access to chrome directly from my software ?
Off the top of my head, I don't know any application that gets data directly from an open instance of Chrome. You'd have to write your own Chrome extension.
Alternatively, you can open the web browser from your application initially.
You can look into these libraries for doing so:
Watin (My personal favourite)
Selenium
Awesomium (You'd have to roll out your own UI, it's invisible)
Cef
Essential Objects Web Browser
EDIT: I didn't think about using QA tools as the actual browser hook as #TheAnathema mentions. That would probably work for your needs.
You're going to need to create it as Chrome extension if you must be dependent on the user actually going to a specific web site (i.e. not being able to do the requests yourself with either Selenium or standard web requests in Python).
The reason why a Chrome extension would be required is because think of how bad it could be for any software to easily read the pages you browse. Banking, medical, email, etc. could all be accessed anonymously from any process if Google allowed any outside process to tap into the web page.
Even Chrome extensions have to ask for permission to be able to do what they want, but at least it is software the user knowingly installed and agreed to the permissions.
A quick search yielded this example of modifying a page's HTML with a Chrome extension: https://blog.lateral.io/2016/04/create-chrome-extension-modify-websites-html-css/
It sounds like you want to do web scraping. Here's a good tutorial to get you started: HTML Scraping.
And this answer has a good example of how to scrape data from a website where you need to submit a form to get access to the data.
Hey guys, I'm trying to create a website that can help a user purchase items from other websites. What would be the best way to go about doing this?
I know most of the sites I'm using are sending their information using FORM:POST, but I'm having trouble finding the exact POST packet in fiddler (I'm assuming it's encrypted?), and know that a lot of the sites are using login credentials, so that complicates things a bit.
Is there any way I could use webkit or something to handle all the http stuff, and just pass javascript to fill in the forms? Or is there an even simpler way to create proper POST packets and use a WebRequest?
Thank you!
1) get permission
2) use their published API
If the sites do not have an API and allow you to use their server process, copy their forms to your site and use post. You can post from your server with credentials using for example CURL
Usually shopping cart and credit-card transaction use SSL and you have to login in the site. So I think it's not so simple to bridge with a javascript or a simple webrequest.
There's not a statndard-simple-way way to do this!
You're heading for a world of hurt.
First, you should check if what you're trying to do is legal. Does the web site allow "proxy orders"? Or are they forbidden by their EULA?
Second, you'll have to handle the user's confidential data (username, password, credit card number), and especially credit card numbers are calling for troubles.
Third, how are you planning to implement payment methods like PayPal? You're going to collect the user's PayPal credentials in order to make payments on their behalf? (See point number two if answer is yes.)
Fourth, since you have to fake HTTP requests, as soon as the web site changes a single field, your tool will break, how are you planning to handle this?
Or you're trying to automate only the first steps of the orders and not the payment?
I'm trying to automate the download of a file from a website. Normally to download the file, I login with a username and password. Navigate to a particular screen then click a button.
I've been trying to watch the sequence of POSTs using Chrome's developer mode, and then replicate all the steps using .Net WebClient class, but to no success. I've derived from the WebClient class and added cookie handling. Which seems to be working. I go to the login page and post using WebClient.UploadValues. About half the times it seems to work. The next step appears to make another POST action to a reporting URL. Once again I use WebClient.UploadValues, but the response from the server is a page showing an internal error.
I have a couple of questions.
1) Are there better tools than hand coding C# code to replicate a bunch of web browser interactions? I really only care about being able to download the file at a particular time each day onto a Windows box.
2) The WebClient does not seem to be the best class to use for this. Perhaps it's a bit to simplistic. I tried using HttpWebRequest, but it has no facilities for encoding POST requests. Any other recommendations?
3) Although Chrome's developer plugin appears to show all interaction, I find it a bit cumbersome to use. I'd be interested in seeing all of the raw communication (unencrypted though, the site is only accesses via https), so I can see if I'm really replicating all of the steps.
I can even post the exact code I'm using. The site I'm pulling data from, specifically is the Standard and Poors website. They have the ability to create custom reports for downloading historical data which I need for reporting, not republishing.
Using IE to download the file would be a much easier, as compared to writing C# / Perl / Java code to replicate http requests.
Reason is, even a slight change in JavaScript code can break the flow.
With IE, you can automate it using COM. Following VBA example opens IS and performs a google search:
Sub Search_Google()
Dim IE As Object
Set IE = CreateObject("InternetExplorer.Application")
IE.Navigate "http://www.google.com" 'load web page google.com
While IE.Busy
DoEvents 'wait until IE is done loading page.
Wend
IE.Document.all("q").Value = "what you want to put in text box"
ie.Document.all("btnG").Click
'clicks the button named "btng" which is google's "google search" button
While ie.Busy
DoEvents 'wait until IE is done loading page.
Wend
End Sub
3) Although Chrome's developer plugin appears to show all interaction, I find it a bit cumbersome to use. I'd be interested in seeing all of the raw communication (unencrypted though, the site is only accesses via https), so I can see if I'm really replicating all of the steps.
For this you can use Fiddler to view all the interaction going on and the RAW data going back and forth. To make it work with HTTPS you will need to install the Certificates to enable decryption of trafffic.
I've tried to find som good how to, or some example that is good for beginners when it comes to write your first web crawler. I would like to write it in c#. Does anybody have any good example code to share or some tips on some sites where I can find info for c#, and some bacic webcrawling.
Thanks
HtmlAgilityPack is your friend.
Yes, HtmlAgeilityPack is a good tool to parse the HTML but that is definitely not enough.
There are 3 elements to crawling:
1) Crawling itself i.e. looping through web sites: This can be done by sending requests to random IP addresses but this does not work well since many websites use shared IP address HTTP with host header so using IP does not hit it. On the other hand, there are far too many IP addresses unused or not hosting a web server so this does not get you anywhere.
I suggest you send request to google (search for words from a dictionary) and crawl the results coming back.
2) Rendering the content: Many websites generate the HTML content in JavaScript when the form is loaded so if you send a simple request, it will not be able to capture the content as a user would be able to see. You need to render the page as browser does and that can be done using Webkit.net which is an open source tool although still in beta.
3) Comprehending and parsing the HTML: use HTML pack and there are tons of examples online. This can be used to crawl the site as well.
A while ago I also wanted to write a custom web crawler, and found this document:
Web Crawler
It has some great info, and is very well written IMO.