get data from page that a url points to - c#

Given a Url, I'd like to be able to capture the Title of the page this url points to, as well
as other info - eg a snippet of text from the first paragraph on a page? - maybe even an image from the page.
Digg.com does this nicely when you submit a url.
How could something like this be done in .Net c#?

You're looking for the HTML Agility Pack, which can parse malformed HTML documents.
You can use its HTMLWeb class to download a webpage over HTTP.
You can also download text over HTTP using .Net's WebClient class.
However, it won't help you parse the HTML.

You could try something like this:
using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
using System.Text;
namespace WebGet
{
class progMain
{
static void Main(string[] args)
{
ASCIIEncoding asc = new ASCIIEncoding();
WebRequest wrq = WebRequest.Create("http://localhost");
WebResponse wrp = wrq.GetResponse();
byte [] responseBuf = new byte[wrp.ContentLength];
int status = wrp.GetResponseStream().Read(responseBuf, 0, responseBuf.Length);
Console.WriteLine(asc.GetString(responseBuf));
}
}
}
Once you have the buffer, you can process it looking for paragraph or image HTML tags to extract portions of the returned data.

You can extract the title of a page with a function like the following. You would need to modify the regular expression to look for, say, the first paragraph of text but since each page is different, that may prove difficult. You could look for a meta description tag and take the value from that, however.
public static string GetWebPageTitle(string url)
{
// Create a request to the url
HttpWebRequest request = HttpWebRequest.Create(url) as HttpWebRequest;
// If the request wasn't an HTTP request (like a file), ignore it
if (request == null) return null;
// Use the user's credentials
request.UseDefaultCredentials = true;
// Obtain a response from the server, if there was an error, return nothing
HttpWebResponse response = null;
try { response = request.GetResponse() as HttpWebResponse; }
catch (WebException) { return null; }
// Regular expression for an HTML title
string regex = #"(?<=<title.*>)([\s\S]*)(?=</title>)";
// If the correct HTML header exists for HTML text, continue
if (new List<string>(response.Headers.AllKeys).Contains("Content-Type"))
if (response.Headers["Content-Type"].StartsWith("text/html"))
{
// Download the page
WebClient web = new WebClient();
web.UseDefaultCredentials = true;
string page = web.DownloadString(url);
// Extract the title
Regex ex = new Regex(regex, RegexOptions.IgnoreCase);
return ex.Match(page).Value.Trim();
}
// Not a valid HTML page
return null;
}

You could use Selenium RC (Open Source, www.seleniumhq.org) to parse data etc. from the pages. It is a web test automation tool with an C# .Net lib.
Selenium have full API to read out specific items on a html page.

Related

Kanji characters from WebClient html different from actual Kanji in website

So, I'm trying to get a portion of text from a website called Kanji-A-Day.com, but I have a problem.
You see, I'm trying to get the daily kanji from the website, and I was able to narrow the HTML down to what I want, but it seems the characters are different..?
What it looks like
What it should look like
What's even more strange is that I produced the results for the second image by copying and pasting directly from the site, so it's not a font problem.
Here's the code I use for getting the character:
public void UpdateDailyKanji() // Called at the initialization of a new main form
{
string kanji;
using (WebClient client = new WebClient()) // Grab the string
kanji = client.DownloadString("http://www.kanji-a-day.com/level4/index.php");
// Trim the HTML to just the Kanji
kanji = kanji.Remove(0, kanji.IndexOf(#"<div class=""glyph"">") + 19);
kanji = kanji.Remove(kanji.IndexOf("</div>")-2);
kanji = kanji.Trim();
Text_DailyKanji.Text = kanji; // Set the Kanji
}
Does anyone know what's going on here? I'm guessing it's some Unicode thing but I don't know much about it.
Thanks in advance.
The page you're trying to download as a string is encoded using charset=EUC-JP, also known as Japanese (EUC) (CodePage 51932). This is clearly set in the page headers.
Why is the string returned by WebClient.DownloadString encoded using the wrong encoder?
The MSDN Docs state this:
This method retrieves the specified resource. After it downloads the
resource, the method uses the encoding specified in the Encoding
property to convert the resource to a String.
Thus, you have to know beforehand what encoding will be used and specify it, setting the WebClient.Encoding property.
To verify this, check the .NET Reference Source for the WebClient.DownloadString method:
try {
WebRequest request;
byte [] data = DownloadDataInternal(address, out request);
string stringData = GetStringUsingEncoding(request, data);
if(Logging.On)Logging.Exit(Logging.Web, this, "DownloadString", stringData);
return stringData;
} finally {
CompleteWebClientState();
}
The encoding is set using the Request settings, not the Response ones.
The result is, the downloaded string is encoded using the default CodePage.
What you can do now is:
Download the page twice, the first time to check whether the WebClient encoding and the Html page encoding don't match.
Re-encode the string with the correct encoding, set in the underlying WebResponse.
Don't use WebClient, use HttpClient or WebRequest directly. Or, if you like this tool, create a custom WebClient class to handle the WebRequest/WebResponse in a more direct way.
This is a method to perform the re-encoding task:
The string returned by WebClient is converted to a Byte Array and passed to a MemoryStream, then re-encoded using a StreamReader with the Encoding retrieved from the Content-Type: charset Response Header.
EDIT:
Now using Reflection to get the page Encoding from the underlying HttpWebResponse. This should avoid errors in parsing the original CharacterSet as defined by the remote response.
using System.IO;
using System.Net;
using System.Reflection;
using System.Text;
public string WebClient_DownLoadString(Uri uri)
{
using (var client = new WebClient())
{
// If Windows 7 - Windows Server 2008 R2
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;
client.CachePolicy = new System.Net.Cache.RequestCachePolicy(System.Net.Cache.RequestCacheLevel.BypassCache);
client.Headers.Add(HttpRequestHeader.Accept, "ext/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
client.Headers.Add(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.8");
client.Headers.Add(HttpRequestHeader.KeepAlive, "keep-alive");
string result = client.DownloadString(uri);
var flags = BindingFlags.Instance | BindingFlags.NonPublic;
using (var response = (HttpWebResponse)client.GetType().GetField("m_WebResponse", flags).GetValue(client))
{
var pageEncoding = Encoding.GetEncoding(wc_response.CharacterSet);
byte[] bytes = client.Encoding.GetBytes(result);
using (var ms = new MemoryStream(bytes, 0, bytes.Length))
using (var reader = new StreamReader(ms, pageEncoding))
{
ms.Position = 0;
return reader.ReadToEnd();
};
};
}
}
Now your code should get the Japanese characters in their correct form.
Uri uri = new Uri("http://www.kanji-a-day.com/level4/index.php", UriKind.Absolute);
string kanji = WebClient_DownLoadString(uri);
kanji = kanji.Remove(0, kanji.IndexOf("<div class=\"glyph\">") + 19);
kanji = kanji.Remove(kanji.IndexOf("</div>")-2);
kanji = kanji.Trim();
Text_DailyKanji.Text = kanji;

Get Page Main Content using the URL

I need to be able to get the page main content from a certain url.
a very good example on what i need to do is the following: http://embed.ly/docs/explore/preview?url=http%3A%2F%2Fedition.cnn.com%2F2012%2F08%2F20%2Fworld%2Fmeast%2Fflight-phobia-boy-long-way-home%2Findex.html%3Fiid%3Darticle_sidebar
I am using asp.net with C# language.
Parsing html pages and guessing the main content is not an easy process. I would recomment to use NReadability and HtmlAgilityPack
Here is an example how it could be done. Main text is always in div with id readInner after NReadability transcoded the page.
string url = "http://.......";
var t = new NReadability.NReadabilityWebTranscoder();
bool b;
string page = t.Transcode(url, out b);
if (b)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
var title = doc.DocumentNode.SelectSingleNode("//title").InnerText;
var text = doc.DocumentNode.SelectSingleNode("//div[#id='readInner']")
.InnerText;
}
Man,
I guess it's made using the implementation of WebClient Class or WebRequest Class. With it you can download all content of page then using any data mining algorithm, you can get the information you want.
[]'s

WebRequest using Mozilla Firefox

I need to have access at the HTML of a Facebook page, to extract from it some data. So, I need to create a WebRequest.
Example:
My code worked well for other sites, but for Facebook, I must be logged in to can access the HTML.
How can I use Firefox data for creating a WebRequest for Facebook page?
I tried this:
List<string> HTML_code = new List<string>();
WebRequest request = WebRequest.Create(URL);
using (WebResponse response = request.GetResponse())
using (StreamReader stream = new StreamReader(response.GetResponseStream()))
{
string line;
while ((line = stream.ReadLine()) != null)
{
HTML_code.Add(line);
}
}
...but the HTML resulted is the HTML of Facebook Home Page when I am not logged in.
If what you are trying to is retrieve the number of likes from a Facebook page, you can use Facebook's Graph API service. Just too keep it simple, this is what I basically did in the code:
Retrieve the Facebook page's data. In this case I used the Coke page's data since it was an example FB had listed.
Parse the returned Json using Json.Net. There are other ways to do this, but this just keeps it simple, and you can get Json.Net over at Codeplex. The documentation that I looked for my code was from this page in the docs. Their documentation will also help you with parsing and serializing even more Json if you need to.
Then that basically translates in to this code. Just note that I left out all the fancy exception handling to keep it simple as using networking is not always reliable! Also don't forget to include the Json.Net library in your project!
Usings:
using System.IO;
using System.Net;
using Newtonsoft.Json.Linq;
Code:
string url = "https://graph.facebook.com/cocacola";
WebClient client = new WebClient();
string jsonData = string.Empty;
// Load the Facebook page info
Console.WriteLine("Connecting to Facebook...");
using (Stream data = client.OpenRead(url))
{
using (StreamReader reader = new StreamReader(data))
{
jsonData = reader.ReadToEnd();
}
}
// Get number of likes from Json data
JObject jsonParsed = JObject.Parse(jsonData);
int likes = (int)jsonParsed.SelectToken("likes");
// Write out the result
Console.WriteLine("Number of Likes: " + likes);

Capturing IP address of web stream with StreamReader returning too much data

I have this piece of code:
WebClient web = new WebClient();
System.IO.Stream stream = web.OpenRead("http://url/getAddress.html");
string text = "";
using (System.IO.StreamReader reader = new System.IO.StreamReader(stream))
{
text = reader.ReadToEnd();
reader.Close();
}
The result of this in HTML is an IP Address, but when i try to save this result in a database, the result that returns is the whole html page of the web request.
What am i doing wrong?
Example:
text has the result
if i do a Response.Write(text); it returns: 111.222.33.3
if i try to save the variable text which returns the value, this will save the whole html content from the web request page.
You realize that if you write to the Response object, that's going to format HTML on the web page? That just formats the HTML you received from the web page. You need to parse the HTML you get to get the actual data you're looking for, in the format you want it.
public static String GetIP()
{
String ip =
HttpContext.Current.Request.ServerVariables["HTTP_X_FORWARDED_FOR"];
if (string.IsNullOrEmpty(ip))
{
ip = HttpContext.Current.Request.ServerVariables["REMOTE_ADDR"];
}
return ip;
}
i've found this solution, is perfect for what i want

How to fetch webpage title and images from URL?

I want to fetch website title and images from URL.
as facebook.com doing. How I get images and website title from third party link.?
use html Agility Pack this is a sample code to get the title:
using System;
using HtmlAgilityPack;
protected void Page_Load(object sender, EventArgs e)
{
string url = #"http://www.veranomovistar.com.pe/";
System.Net.WebClient wc = new System.Net.WebClient();
HtmlDocument doc = new HtmlDocument();
doc.Load(wc.OpenRead(url));
var metaTags = doc.DocumentNode.SelectNodes("//title");
if (metaTags != null)
{
string title = metaTags[0].InnerText;
}
}
Any doubt, post your comment.
At a high level, you just need to send a standard HTTP request to the desired URL. This will get you the site's markup. You can then inspect the markup (either by parsing it into a DOM object and then querying the DOM, or by running some simple regexp's/pattern matching to find the things you are interested in) to extract things like the document's <title> element and any <img> elements on the page.
Off the top of my head, I'd use an HttpWebRequest to go get the page and parse the title out myself, then use further HttpWebRequests in order to go get any images referenced on the page. There's a darn good chance though that there's a better way to do this and somebody will come along and tell you what it is. If not, it'd look something like this:
HttpWebResponse response = null;
try
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(<your URL here>);
response = (HttpWebResponse)request.GetResponse();
Stream responseStream = response.GetResponseStream();
StreamReader reader = new StreamReader(responseStream);
//use the StreamReader object to get the page data and parse out the title as well as
//getting locations of any images you need to get
catch
{
//handle exceptions
}
finally
{
if(response != null)
{
response.Close();
}
}
Probably the dumb way to do it, but that's my $0.02.
just u hav to write using javascript on source body
for example
if u r using master page just u hav to write code on matser page thats reflect on all the pages
u can also used the image Url property in this script like that
khan mohd faizan

Categories

Resources