I want to count the number of rows in html string returned from API. Any idea to get the rows count without using html agility pack ?
following code will connect to API and return html string for apiContent.
using (var client = new HttpClient())
{
client.DefaultRequestHeaders.Accept.Add(new MediaTypeWithQualityHeaderValue("application/json"));
var response = client.GetAsync(apiURL).Result;
using (HttpContent content = response.Content)
{
Task<string> result = content.ReadAsStringAsync();
apiContent = result.Result;
}
}
now i need to count the numbers of row (tr) from html string in variable "apiContent" but without using html agility pack.
If the only <TR>'s being returned are what you are interested in, why not just do a LINQ .Count()?
int count = result.Count(f => f == '<tr');
Here is a robust solution without HtmlAgilityPack.
Lets consider this HTML:
var html = "<table><tr><td>cell</td></tr><!--<tr><td>comment</td></tr>--></table>"
Lets load this HTML as a document:
// Create a new context for evaluating webpages with the default configuration
var context = BrowsingContext.New(Configuration.Default);
// Parse the document from the content of a response to a virtual request
var document = await context.OpenAsync(req => req.Content(html));
Query whatever you are looking for in your HTML:
var rows = document.QuerySelectorAll("tr");
Console.WriteLine(rows.Count());
Try it online!
Whenever you want to parse HTML, always rely on a HTML parser. If you dont want to use HAP, AngleSharp is a great alternative. If you dont want to use an existing HTML parser, you are doomed to make your own. It will be easier on a subset of HTML, but mostly not worth the hassle. Help yourself ; use a library.
Related
I am trying to get a table from a website using the Html Agility Pack in C# but it always returns null and I don't understand why.
This is my code:
using (var httpClient = new HttpClient())
{
var response = await httpClient.GetAsync("some website");
var htmlBody = await response.Content.ReadAsStringAsync();
var doc = new HtmlDocument();
doc.LoadHtml(htmlBody);
var table = doc.DocumentNode.SelectSingleNode("/html/body/div/div/div/div[2]/div[5]/div/div/table");
}
I have also tried this XPath but it still doesn't work:
var table = doc.DocumentNode.SelectSingleNode("//*[#id=\"__layout\"]/div/div[2]/div[5]/div/div/table");
The variable table is always null after I run this. Is there something wrong with my code or is it an issue with the XPath I'm using?
Here is my code:
HttpClient client = new HttpClient();
HttpResponseMessage response = await client.GetAsync("http://vk.com/video219171498_166049761");
string Vk_video_resText = await response.Content.ReadAsStringAsync();
txt.Text = "" + Vk_video_resText + "";
How can I take
http:\\\/\\\/cs513404v4.vk.me\\\/u3692175\\\/videos\\\/b113808aad.360.mp4\
from the HTML page?
If I understand correctly, all you want to do is strip away all the HTML tags so you're only left with the text.
A lighter weight solution over htmlagilitypack is the code presented in the article Quick and Easy Method to Remove Html Tags.
I need to be able to get the page main content from a certain url.
a very good example on what i need to do is the following: http://embed.ly/docs/explore/preview?url=http%3A%2F%2Fedition.cnn.com%2F2012%2F08%2F20%2Fworld%2Fmeast%2Fflight-phobia-boy-long-way-home%2Findex.html%3Fiid%3Darticle_sidebar
I am using asp.net with C# language.
Parsing html pages and guessing the main content is not an easy process. I would recomment to use NReadability and HtmlAgilityPack
Here is an example how it could be done. Main text is always in div with id readInner after NReadability transcoded the page.
string url = "http://.......";
var t = new NReadability.NReadabilityWebTranscoder();
bool b;
string page = t.Transcode(url, out b);
if (b)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
var title = doc.DocumentNode.SelectSingleNode("//title").InnerText;
var text = doc.DocumentNode.SelectSingleNode("//div[#id='readInner']")
.InnerText;
}
Man,
I guess it's made using the implementation of WebClient Class or WebRequest Class. With it you can download all content of page then using any data mining algorithm, you can get the information you want.
[]'s
I want to fetch website title and images from URL.
as facebook.com doing. How I get images and website title from third party link.?
use html Agility Pack this is a sample code to get the title:
using System;
using HtmlAgilityPack;
protected void Page_Load(object sender, EventArgs e)
{
string url = #"http://www.veranomovistar.com.pe/";
System.Net.WebClient wc = new System.Net.WebClient();
HtmlDocument doc = new HtmlDocument();
doc.Load(wc.OpenRead(url));
var metaTags = doc.DocumentNode.SelectNodes("//title");
if (metaTags != null)
{
string title = metaTags[0].InnerText;
}
}
Any doubt, post your comment.
At a high level, you just need to send a standard HTTP request to the desired URL. This will get you the site's markup. You can then inspect the markup (either by parsing it into a DOM object and then querying the DOM, or by running some simple regexp's/pattern matching to find the things you are interested in) to extract things like the document's <title> element and any <img> elements on the page.
Off the top of my head, I'd use an HttpWebRequest to go get the page and parse the title out myself, then use further HttpWebRequests in order to go get any images referenced on the page. There's a darn good chance though that there's a better way to do this and somebody will come along and tell you what it is. If not, it'd look something like this:
HttpWebResponse response = null;
try
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(<your URL here>);
response = (HttpWebResponse)request.GetResponse();
Stream responseStream = response.GetResponseStream();
StreamReader reader = new StreamReader(responseStream);
//use the StreamReader object to get the page data and parse out the title as well as
//getting locations of any images you need to get
catch
{
//handle exceptions
}
finally
{
if(response != null)
{
response.Close();
}
}
Probably the dumb way to do it, but that's my $0.02.
just u hav to write using javascript on source body
for example
if u r using master page just u hav to write code on matser page thats reflect on all the pages
u can also used the image Url property in this script like that
khan mohd faizan
Given a Url, I'd like to be able to capture the Title of the page this url points to, as well
as other info - eg a snippet of text from the first paragraph on a page? - maybe even an image from the page.
Digg.com does this nicely when you submit a url.
How could something like this be done in .Net c#?
You're looking for the HTML Agility Pack, which can parse malformed HTML documents.
You can use its HTMLWeb class to download a webpage over HTTP.
You can also download text over HTTP using .Net's WebClient class.
However, it won't help you parse the HTML.
You could try something like this:
using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
using System.Text;
namespace WebGet
{
class progMain
{
static void Main(string[] args)
{
ASCIIEncoding asc = new ASCIIEncoding();
WebRequest wrq = WebRequest.Create("http://localhost");
WebResponse wrp = wrq.GetResponse();
byte [] responseBuf = new byte[wrp.ContentLength];
int status = wrp.GetResponseStream().Read(responseBuf, 0, responseBuf.Length);
Console.WriteLine(asc.GetString(responseBuf));
}
}
}
Once you have the buffer, you can process it looking for paragraph or image HTML tags to extract portions of the returned data.
You can extract the title of a page with a function like the following. You would need to modify the regular expression to look for, say, the first paragraph of text but since each page is different, that may prove difficult. You could look for a meta description tag and take the value from that, however.
public static string GetWebPageTitle(string url)
{
// Create a request to the url
HttpWebRequest request = HttpWebRequest.Create(url) as HttpWebRequest;
// If the request wasn't an HTTP request (like a file), ignore it
if (request == null) return null;
// Use the user's credentials
request.UseDefaultCredentials = true;
// Obtain a response from the server, if there was an error, return nothing
HttpWebResponse response = null;
try { response = request.GetResponse() as HttpWebResponse; }
catch (WebException) { return null; }
// Regular expression for an HTML title
string regex = #"(?<=<title.*>)([\s\S]*)(?=</title>)";
// If the correct HTML header exists for HTML text, continue
if (new List<string>(response.Headers.AllKeys).Contains("Content-Type"))
if (response.Headers["Content-Type"].StartsWith("text/html"))
{
// Download the page
WebClient web = new WebClient();
web.UseDefaultCredentials = true;
string page = web.DownloadString(url);
// Extract the title
Regex ex = new Regex(regex, RegexOptions.IgnoreCase);
return ex.Match(page).Value.Trim();
}
// Not a valid HTML page
return null;
}
You could use Selenium RC (Open Source, www.seleniumhq.org) to parse data etc. from the pages. It is a web test automation tool with an C# .Net lib.
Selenium have full API to read out specific items on a html page.