C# data scraping from websites

C# data scraping from websites - c#

HI I am pretty new in C# sphere. Been in php and JavaScript since the beginning of this year. I want to scrap posts and comments from a blog. The site is http://www.somewhereinblog.net
What I want to do is
1. I want to log in using a software
2. Then download the html
3. Then use regular expressions, xpath whatever comes handy to separate the contents of posts and comments
I been searching all over. Understood very little. Though I am quite sure I need to use 'htmlagilitypack'. I dont know how to add a library to c# console or form application. Can someone give me some help? I badly need this. And I am not too into C# just a week. So would be grateful if there is some detailed information. Waiting eagerly.
Thanks in advance brothers.

Using Webclient you can login and download
Instead html-agility-pack I like CsQuery because lets you use jQuery syntax inside a string in C# code, so you can download to a string the html, and search and do things in it like with jQuery and HTML page.

Related

C# Using HTTPClient to 'Navigate' a Website

So I am just beginning to learn C#, and one of my main goals is to be able to 'navigate' a website. I have done minimal research and have found that the two primary was to do this would be HTTPClient and Requests, and I would like to learn this through HTTPClient.
Now what I mean by navigate is to essentially bot a website for practice. This is like clicking buttons, putting text into fields, etc.
If anyone can give me an idea on where to start with this it would be much appreciated! Not looking for code specifically, just looking for what I should learn in HTTPClient to make this happen. Thanks!

I think that you are a little confused about the concepts. HTTPClient send requests to some site, but you cannot click buttons or "navigate" inside the site.
If youre looking for a way to test some site, i recommend you learn about cypress.io. You can add text to your textboxes, click buttons or navigate in any site. All of this with a few lines of code with Javascript. Its free.
Otherwise, if you need to save values on a database depending of your "navigation", you have to research about scraping tools. I recommend you Selenium or any other scraping tool.
Usually HTTPClient is used when you have to consume a REST API.

Basically you have to think about how a program could ‘see’ a website. You cannot expect to say to the HTTPClient: ‘Open page www.google.com and search for something.’ If you want to do this programmatically you have to exactly specify what your program should do.
For your purpose I recommend the HTML Agility Pack. This one can be used to get the navigation elements of a HTML document. This way you can parse a HTML delivered from a website into your program and do further stuff with it.
Kind regards :)

Dummies guide to making dropzone.js work?

I'm trying to implement a drag-n-drop fileupload to my website and found a piece of java called Dropzone.js, it holds everything i need . . . i just have no idea of how to use it!
So far I've been programming only in razor ASP.NET (c#) / HTML / CSS, but so far no javascript / jquery.
It's razor webpages, so no mvc.
Due to my current lack of knowledge in java, i apologize in advance but i'm stuck !!
What i'd like to know is:
I'f i've understood things correct, i should not modify Dropzone.js directly, i should use it as a library and integrate it in my other scripts, e.g. in another .js file. Correct?
Any help in this will be greatly appreciated.
Kind regards,
Daniel A. Rischel
.. Edited as per requested.

Well, you're wrong somewhat. It might not be related to the question but please note these scripts are JavaScript and
java != javascript
Got the point? :) Please try to note this in future. Also, Java code cannot be added to these server-side language, because java itself is a language. You can use JavaScript or its library (jQuery) and create some plugins.
I'f i've understood things correct, i should not modify Dropzone.js directly, i should use it as a library and integrate it in my other scripts, e.g. in another .js file. Correct?
True, you need to link this plugin dropzone.js to the site using
<script href="/link/to/dropzone.js"></script>
Then you'll use this script on the whole of the page where it is linked.
And you should never edit the source code of a plugin, untill or unless you know what you're doing. Because you might mess up with the code.
I'd really like to see some source code examples of what a single file drag-n-drop fileupload form would look like. I've googled extensively but not found something i can use yet.
Did you try to read their documentations? They have a good code explaination on the default (main) page of their website too. And really it depends on just the way you style it. Its just a plugin, you will need to add your own CSS code to style the form. This plugin won't style the form for you, all it will do is to handle the upload events and get back with the result. So, which means that form creation and elements is all upto you!
If there is a dummies tutorial in making this work, i'd really like to be pointed in the right direction!
There is no perfect answer to provide you with, however I will provide you with some basics
http://www.javascriptoo.com/dropzone-js
https://stackoverflow.com/questions/tagged/dropzone.js?sort=newest&pageSize=15 (stackoverflow tag)
In the second link, you can go and see what are the very basic of the issues that are handled with what code. Easy methods will be hanlded by you! You'll need some guidance in just the hard jobs. Good luck for that!
My suggestion for you is to first undertand and learn jQuery, then learn how to make ajax calls. After that you'll know how to create and handle the events. The basic code will be
$('input[type=file]').change(function () { // change in value
// send the file via ajax
}
http://jquery.com
This method will be best as you will know what code you're using and what it will do. Using plugin is easy method then this one. But my preference is with the second one; creating your own plugin.

jQuery + C# code to resolve URL's from user-supplied text

I'd like to add some kind of simple URL resolution and formatting to my C# and jQuery-based ASP.NET web application. I currently allow users to add simple text-based descriptions to items and leave simple comments ('simple' as in I only allow plain text).
What I need to support is the ability for a user to enter something like:
Check out this cool link: http://www.really-cool-site.com
...and have the URL above properly resolved as a link and automagically turned into a clickable link...kinda like the way the editor in StackOverflow works. Except that we don't want to support BBCode or any of its variants. The user experience would actually be more like the way Facebook resolves user-generated URL's.
What are some jQuery + C# solutions I should consider?

There's another question with a solution that might help you. It uses a regex in pure JS.
Personally though, I would do it server-side when the user submits it. That way, you only need to do it once, rather than every time you display that text. You could use a similar regex in C#.

I ended up using server-side C# code to do the linkification. I use an AJAX-jQuery wrapper to call into a PageMethod that does the work.
The PageMethod both linkifies and sanitizes the user-supplied string, then returns the result.
I use the Microsoft Anti-Cross Site Scripting Library (AntiXSS) to sanitize:
http://www.microsoft.com/download/en/details.aspx?id=5242
And I use C# code I found here and there to resolve and shorten links using good olde string parsing and regular expressions.
My method is not as cool as the way FaceBook does it in real time, but at least now my users can add links to their descriptions and comments.

Grab details from web page

I need to write a C# code for grabbing contents of a web page. Steps looks like following
Browse to login page
I have user name and a password, provide it programatically and login
Then you are in detail page
You have to get some information there, like (prodcut Id, Des, etc.)
Then need to click(by code) on Detail View
Then you can get the price for that product from there.
Now it is done, so we can write detail line into text file like this...
ABC Printer::225519::285.00
Please help me on this, (Even VB.Net Code is ok, I can convert it to C#)

The WatiN library is probably what you want, then. Basically, it controls a web browser (native support for IE and Firefox, I believe, though they may have added more since I last used it) and provides an easy syntax for programmatically interacting with page elements within that browser. All you'll need are the names and/or IDs of those elements, or some unique way to identify them on the page.

You should be able to achieve this using the WebRequest class to retrieve pages, and the HTML Agility Pack to extract elements from HTML source.

yea I downloaded that library. Nice one.
Thanks for sharing it with me. But I have a issue with that library. The site I want to get data is having a "captcha" on the login page.
I can enter that value if this can show image and wait for my input.
Can we achive that from this library, if you can like to have a sample.

You should be able to achieve this by using two classes in C#, HttpWebRequest (to request the web pages) and perhaps XmlTextReader (to parse the HTML/XML response).
If you do not wish to use XmlTextReader, then I'd advise looking into Regular Expressions, as they are fantastically useful for extracting information from large bodies of text where-in patterns exist.
How to: Send Data Using the WebRequest Class

Where to create an RSS feed for dynamic website

I'm currently creating a website a little bit like Digg.com. There are different category like "Technology", "Sports", etc.. I want to create an RSS feeds for my website and while doing research on this, I have question that I can't find the answer.
First, this is what I have:
-I have the .NET code in C# that create a file with the last 15 news from a query from my database.
What I need to know:
-Is the RSS feeds (the xml file) needs to be generated at each load of the page (I saw that on some tutorial page but maybe it was only for a educational purpose). Personaly, I'm thinking about regenerating the .xml file each time someone submit something new. Is this a good idea?
-Do I need to create a different file for each categorie. Example: feedSports.xml, feedTechnology.xml, etc??? Or is there another way (I saw something about channel.???)
-What does feedburner do with all of this?
Thanks a lot for you help. I know this must be very newbie question so that's why I can't find anything answering this clearly on google.
DarkJaf

Your feeds would be generated just as your HTML pages are generated, after each request. But instead of outputting HTML it would be outputting RSS.
I probably would not make a file for each feed but it sure is possible. A better approach may be to pass a variable via GET or POST to your page generating the RSS and grab the data that pertains to the variable passed. You most likely can use the same logic you use for generate your HTML news lists if you isolate your code well.
I would also take a look at the article posted by Raj. It looks like C# has a nice namespace (System.ServiceModel.Syndication) that contains some objects that make the job pretty easy.
Have fun!
Nick
nickgs.com

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.