How to fetch webpage title and images from URL? - c#

I want to fetch website title and images from URL.
as facebook.com doing. How I get images and website title from third party link.?

use html Agility Pack this is a sample code to get the title:
using System;
using HtmlAgilityPack;
protected void Page_Load(object sender, EventArgs e)
{
string url = #"http://www.veranomovistar.com.pe/";
System.Net.WebClient wc = new System.Net.WebClient();
HtmlDocument doc = new HtmlDocument();
doc.Load(wc.OpenRead(url));
var metaTags = doc.DocumentNode.SelectNodes("//title");
if (metaTags != null)
{
string title = metaTags[0].InnerText;
}
}
Any doubt, post your comment.

At a high level, you just need to send a standard HTTP request to the desired URL. This will get you the site's markup. You can then inspect the markup (either by parsing it into a DOM object and then querying the DOM, or by running some simple regexp's/pattern matching to find the things you are interested in) to extract things like the document's <title> element and any <img> elements on the page.

Off the top of my head, I'd use an HttpWebRequest to go get the page and parse the title out myself, then use further HttpWebRequests in order to go get any images referenced on the page. There's a darn good chance though that there's a better way to do this and somebody will come along and tell you what it is. If not, it'd look something like this:
HttpWebResponse response = null;
try
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(<your URL here>);
response = (HttpWebResponse)request.GetResponse();
Stream responseStream = response.GetResponseStream();
StreamReader reader = new StreamReader(responseStream);
//use the StreamReader object to get the page data and parse out the title as well as
//getting locations of any images you need to get
catch
{
//handle exceptions
}
finally
{
if(response != null)
{
response.Close();
}
}
Probably the dumb way to do it, but that's my $0.02.

just u hav to write using javascript on source body
for example
if u r using master page just u hav to write code on matser page thats reflect on all the pages
u can also used the image Url property in this script like that
khan mohd faizan

Related

How to get the dynamic data of websites which has ajax calls in c#

Sorry if my question was not in proper manner do edit if required.
USE CASE
I want a function which will find the given string or text from a web page as soon as it updated in wepage in c#.
Take example as https://www.worldometers.info/world-population/ i am trying to get "40" data.So,when the current page will load "40" data or content my function should stop.I think its not loading the AJAX calls.
public static void WaitForWebPageContent(string url,string text)
{
while (true)
{
string pageContent = null;
HttpWebRequest myReq = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse myres = (HttpWebResponse)myReq.GetResponse();
using (StreamReader sr = new StreamReader(myres.GetResponseStream()))
{
pageContent = sr.ReadToEnd();
}
if(pageContent.Contains(text))
{
Debug.WriteLine("Found it");
break;
}
}
}
My Question
I am searching for a method where i can get the page content by calling it from my function.Currently i am using httpclient to get the response from a URL but the data its not stopping even the content is loaded in browser.I am continously sending the request in each and every second by using the above code.Please guide me to proper solution
Thanks

Get Page Main Content using the URL

I need to be able to get the page main content from a certain url.
a very good example on what i need to do is the following: http://embed.ly/docs/explore/preview?url=http%3A%2F%2Fedition.cnn.com%2F2012%2F08%2F20%2Fworld%2Fmeast%2Fflight-phobia-boy-long-way-home%2Findex.html%3Fiid%3Darticle_sidebar
I am using asp.net with C# language.
Parsing html pages and guessing the main content is not an easy process. I would recomment to use NReadability and HtmlAgilityPack
Here is an example how it could be done. Main text is always in div with id readInner after NReadability transcoded the page.
string url = "http://.......";
var t = new NReadability.NReadabilityWebTranscoder();
bool b;
string page = t.Transcode(url, out b);
if (b)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
var title = doc.DocumentNode.SelectSingleNode("//title").InnerText;
var text = doc.DocumentNode.SelectSingleNode("//div[#id='readInner']")
.InnerText;
}
Man,
I guess it's made using the implementation of WebClient Class or WebRequest Class. With it you can download all content of page then using any data mining algorithm, you can get the information you want.
[]'s

How to post to asp.net validation required page with C# and read response

I am writing my own specific product crawler. Now there is a product selling website which uses post data for pages. I really really need to able to post data and read the response. But they are using asp.net validation and it is so messed up. I really could not figure how to properly post data and read. I am using htmlagilitypack. If it is possible to post data with htmlagilitypack and read the response it would be really really awesome.
Now this is the example page : http://www.hizlial.com/HizliListele.aspx?CatID=482643
When you opened the page look at the class "urun_listele"
You will see the options there
20 Ürün Listele
40 Ürün Listele
60 Ürün Listele
Tümünü Listele
Those numbers are product counts to be displayed. Tümünü listele means list all products. Now I really need to post data and get all of the products under that product category. I used firebug to debug and tried to code below but i still got default number of products
private void button11_Click(object sender, RoutedEventArgs e)
{
StringBuilder srBuilder = new StringBuilder();
AppendPostParameter(srBuilder, "ctl00$ContentPlaceHolder1$cmbUrunSayi", "full");
srBuilder = srBuilder.Replace("&", "", srBuilder.Length - 1, 1);
byte[] byteArray = Encoding.UTF8.GetBytes(srBuilder.ToString());
HttpWebRequest hWebReq = (HttpWebRequest)WebRequest.Create("http://www.hizlial.com/HizliListele.aspx?CatID=482643");
hWebReq.Method = "POST";
hWebReq.ContentType = "application/x-www-form-urlencoded";
using (Stream requestStream = hWebReq.GetRequestStream())
{
requestStream.Write(byteArray, 0, byteArray.Length);
}
HtmlDocument hd = new HtmlDocument();
using (HttpWebResponse response = (HttpWebResponse)hWebReq.GetResponse())
{
using (StreamReader sr = new StreamReader(response.GetResponseStream()))
{
var htmlstring = sr.ReadToEnd();
}
}
}
static private void AppendPostParameter(StringBuilder sb, string name, string value)
{
sb.AppendFormat("{0}={1}&", name, HttpUtility.UrlEncode(value));
}
After i get the data I will load it to the htmlagilitypack HtmlDocument
Any help is appreciated.
C# 4.0 , wpf application, htmlagiltiypack
ASP .Net uses __EVENTTARGET and __EVENTARGUMENT fields to simulate Windows Forms behavior. To simulate Change event of combobox on server you need to append to form field to request they are __EVENTTARGET as 'ctl00$ContentPlaceHolder1$cmbUrunSayi' and __EVENTARGUMENT as ''.
If you look onchange code of combo and __doPostBack method you will understand what I mean. You can insert the code below after your declaration of srBuilder. That way code will work.
AppendPostParameter(srBuilder, "__EVENTTARGET", "ctl00$ContentPlaceHolder1$cmbUrunSayi");
AppendPostParameter(srBuilder, "__EVENTARGUMENT", string.Empty);
You will also need to extract __VIEWSTATE & __EVENTVALIDATION values. To get them just send a dummy request and extaract that values and cookies from that request and then append them into new one...

Open webpage programmatically and retrieve its html contain as a string

I have a facebook account and I would like to extract my friend's photo and its personal detail such as "Date of birth", "Studied at" and so on. I am able to extract the address of the facebook's first page for each of my friends account but I don't know how to programmatically open webpage for each of my friends first page and save the html contain as a string so that I can extract out their personal detail and photos. Please help! Thank in advance!
You have Three options:
1- Using a WebClient object.
WebClient webClient = new webClient();
webClient.Credentials = new System.Net.NetworkCredential("UserName","Password", "Domain");
string pageHTML = WebClient .DownloadString("http://url");`
2- Using a WebRequest. This is the best solution because it gives you more control over your request.
WebRequest myWebRequest = WebRequest.Create("http://URL");
WebResponse myWebResponse = myWebRequest.GetResponse();
Stream ReceiveStream = myWebResponse.GetResponseStream();
Encoding encode = System.Text.Encoding.GetEncoding("utf-8");
StreamReader readStream = new StreamReader( ReceiveStream, encode );
string strResponse=readStream.ReadToEnd();
StreamWriter oSw=new StreamWriter(strFilePath);
oSw.WriteLine(strResponse);
oSw.Close();
readStream.Close();
myWebResponse.Close();
3- Using a WebBrowser (I bet you don't wanna do that)
WebBrowser wb = new WebBrowser();
wb.Navigate("http://URL");
string pageHTML = "";
wb.DocumentCompleted += (sender, e) => pageHTML = wb.DocumentText;
Excuse me if I misstyped any code because I improvised it and I don't have a syntax checker to check its correctness. But I think it should be fine.
EDIT: For facebook pages. You may consider using facebook Graph API:
http://developers.facebook.com/docs/reference/api/
Try this:
var html = new WebClient()
.DownloadString("the facebook account url goes here");
Also, once you have downloaded the HTML as a string I would highly recommend that you use the Html Agility Pack to parse it.
There are in general 2 things you can do here. The first thing you can do is called web scraping. That way you can download the source of the html with the following code:
var request = WebRequest.Create("http://example.com");
var response = request.GetResponse();
using (Stream responseStream = response.GetResponseStream())
{
StreamReader reader = new StreamReader(responseStream);
string stringResponse = reader.ReadToEnd();
}
stringResponse then contains the Html source of the website http://example.com
However, this is probably not what you want to do. Facebook has an SDK that you can use to download this kind of information. You can read about this on the following pages
http://developers.facebook.com/docs/reference/api/user/
If you want to use the FaceBook API then I think it's worth changing your question or asking a new question about this, since it's quite more complicated and requires some autorization and other codings. However, it's the best way since it's unlikely that your code is every going to break and it warrents the privacy of the people you want to get information from.
For example, if you query me with the api, you get the following string:
{
"id": "1089655429",
"name": "Timo Willemsen",
"birthday": "08/29/1989",
"education": [
{
"school": {
"id": "115091211836927",
"name": "Stedelijk Gymnasium Arnhem"
},
"year": {
"id": "127668947248449",
"name": "2001"
},
"type": "High School"
}
]
}
You can see that I'm Timo Wilemsen, 21 years old and studyied # Stedelijk Gymnasium Arnhem in 2001.
Use selenium 2.0 for C#. http://seleniumhq.org/download/
var driver = new FirefoxDriver();
driver.Navigate().GoToUrl("http://www.google.com");
String pageSource = driver.PageSource;

get data from page that a url points to

Given a Url, I'd like to be able to capture the Title of the page this url points to, as well
as other info - eg a snippet of text from the first paragraph on a page? - maybe even an image from the page.
Digg.com does this nicely when you submit a url.
How could something like this be done in .Net c#?
You're looking for the HTML Agility Pack, which can parse malformed HTML documents.
You can use its HTMLWeb class to download a webpage over HTTP.
You can also download text over HTTP using .Net's WebClient class.
However, it won't help you parse the HTML.
You could try something like this:
using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
using System.Text;
namespace WebGet
{
class progMain
{
static void Main(string[] args)
{
ASCIIEncoding asc = new ASCIIEncoding();
WebRequest wrq = WebRequest.Create("http://localhost");
WebResponse wrp = wrq.GetResponse();
byte [] responseBuf = new byte[wrp.ContentLength];
int status = wrp.GetResponseStream().Read(responseBuf, 0, responseBuf.Length);
Console.WriteLine(asc.GetString(responseBuf));
}
}
}
Once you have the buffer, you can process it looking for paragraph or image HTML tags to extract portions of the returned data.
You can extract the title of a page with a function like the following. You would need to modify the regular expression to look for, say, the first paragraph of text but since each page is different, that may prove difficult. You could look for a meta description tag and take the value from that, however.
public static string GetWebPageTitle(string url)
{
// Create a request to the url
HttpWebRequest request = HttpWebRequest.Create(url) as HttpWebRequest;
// If the request wasn't an HTTP request (like a file), ignore it
if (request == null) return null;
// Use the user's credentials
request.UseDefaultCredentials = true;
// Obtain a response from the server, if there was an error, return nothing
HttpWebResponse response = null;
try { response = request.GetResponse() as HttpWebResponse; }
catch (WebException) { return null; }
// Regular expression for an HTML title
string regex = #"(?<=<title.*>)([\s\S]*)(?=</title>)";
// If the correct HTML header exists for HTML text, continue
if (new List<string>(response.Headers.AllKeys).Contains("Content-Type"))
if (response.Headers["Content-Type"].StartsWith("text/html"))
{
// Download the page
WebClient web = new WebClient();
web.UseDefaultCredentials = true;
string page = web.DownloadString(url);
// Extract the title
Regex ex = new Regex(regex, RegexOptions.IgnoreCase);
return ex.Match(page).Value.Trim();
}
// Not a valid HTML page
return null;
}
You could use Selenium RC (Open Source, www.seleniumhq.org) to parse data etc. from the pages. It is a web test automation tool with an C# .Net lib.
Selenium have full API to read out specific items on a html page.

Categories

Resources