How can I extract this text from an HTML page in C#?

How can I extract this text from an HTML page in C#? - c#

Here is my code:
HttpClient client = new HttpClient();
HttpResponseMessage response = await client.GetAsync("http://vk.com/video219171498_166049761");
string Vk_video_resText = await response.Content.ReadAsStringAsync();
txt.Text = "" + Vk_video_resText + "";
How can I take
http:\\\/\\\/cs513404v4.vk.me\\\/u3692175\\\/videos\\\/b113808aad.360.mp4\
from the HTML page?

If I understand correctly, all you want to do is strip away all the HTML tags so you're only left with the text.
A lighter weight solution over htmlagilitypack is the code presented in the article Quick and Easy Method to Remove Html Tags.

Related

Count <tr> from html string without using HtmlDoc

I want to count the number of rows in html string returned from API. Any idea to get the rows count without using html agility pack ?
following code will connect to API and return html string for apiContent.
using (var client = new HttpClient())
{
client.DefaultRequestHeaders.Accept.Add(new MediaTypeWithQualityHeaderValue("application/json"));
var response = client.GetAsync(apiURL).Result;
using (HttpContent content = response.Content)
{
Task<string> result = content.ReadAsStringAsync();
apiContent = result.Result;
}
}
now i need to count the numbers of row (tr) from html string in variable "apiContent" but without using html agility pack.

If the only <TR>'s being returned are what you are interested in, why not just do a LINQ .Count()?
int count = result.Count(f => f == '<tr');

Here is a robust solution without HtmlAgilityPack.
Lets consider this HTML:
var html = "<table><tr><td>cell</td></tr><!--<tr><td>comment</td></tr>--></table>"
Lets load this HTML as a document:
// Create a new context for evaluating webpages with the default configuration
var context = BrowsingContext.New(Configuration.Default);
// Parse the document from the content of a response to a virtual request
var document = await context.OpenAsync(req => req.Content(html));
Query whatever you are looking for in your HTML:
var rows = document.QuerySelectorAll("tr");
Console.WriteLine(rows.Count());
Try it online!
Whenever you want to parse HTML, always rely on a HTML parser. If you dont want to use HAP, AngleSharp is a great alternative. If you dont want to use an existing HTML parser, you are doomed to make your own. It will be easier on a subset of HTML, but mostly not worth the hassle. Help yourself ; use a library.

Get data from a Pastebin raw

I'm trying on form load, to make it count the number of lines in a pastebin raw & return the value to a textbox. Been racking my brains and still cant figure it out.
textBox1.Text = new WebClient().DownloadString("yourlink").

I'm expanding my comment to an answer.
As already mentioned, you need a HttpRequest or WebRequest to get the content of your string.
Maybe new WebClient().DownloadString(url);, but I prefer to use the WebRequest since it's also supported in .NET Core.
What you need to do is, extract the content of the RAW TextArea object from html. I know, people will probably hate me for that, but I used regex for that task. Alternatively you can use a html parser.
The Raw data is contained within a textarea with following attributes:
<textarea id="paste_code" class="paste_code" name="paste_code" onkeydown="return catchTab(this,event)">
So the regex pattern looks like this:
private static string rgxPatternPasteBinRawContent = #"<textarea id=""paste_code"" class=""paste_code"" name=""paste_code"" onkeydown=""return catchTab\(this,event\)"">(.*)<\/textarea>";
Since the html code is spread over multiple lines, our Regex has to be use with a single line option.
Regex rgx = new Regex(rgxPatternPasteBinRawContent, RegexOptions.Singleline);
Now find the match, that contains the RAW data:
string htmlContent = await GetHtmlContentFromPage("SomePasteBinURL");
//Possibly your new WebClient().DownloadString("SomePasteBinURL");
//await not necesseraly needed here!
Match match = rgx.Match(htmlContent);
string rawContent = "ERROR: No Raw content found!";
if (match.Groups.Count > 0)
{
rawContent = match.Groups[1].Value;
}
int numberOfLines = rawContent.Split('\n').Length + 1;
And you're done.
The WebRequest looks like this for me:
private static async Task<string> GetHtmlContentFromPage(string url)
{
WebRequest request = WebRequest.CreateHttp(url);
WebResponse response = await request.GetResponseAsync();
Stream receiveStream = response.GetResponseStream();
StreamReader readStream = null;
readStream = new StreamReader(receiveStream);
string data = readStream.ReadToEnd();
response.Dispose();
readStream.Dispose();
return data;
}

How to translate word using google translator API?

I am trying to get the converted text using google translator's api.
public JsonResult getCultureMeaning(string word, string langcode)
{
string url = String.Format("https://translate.google.com/#en/" + langcode+ "/" + word + "");
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
string m = "";
foreach (HtmlNode node in doc.DocumentNode.SelectSingleNode("//span[#id='result_box']").ChildNodes)
{
m += node.InnerHtml;
}
return Json(m, JsonRequestBehavior.AllowGet);
}
In this above method I am passing parameters, say if word is Welcome and langcode is hi in this case.
So I would have url https://translate.google.com/#en/hi/welcome and result is आपका स्वागत है
But when I do select result container with its children nodes as- doc.DocumentNode.SelectSingleNode("//span[#id='result_box']").ChildNodes) then it does not find this result container within the result. Hence I don't get this api work in my case.
Edit-
result container from the url-
<span id="result_box" class="short_text" lang="hi"><span class="hps">आपका स्वागत है</span></span>
How should I approach it to get it working. For reference I am using HtmlAgilityPack.

If you inspect page requests, you might notice, that actual translation request done via AJAX, sample query for your translation is: https://translate.google.com/translate_a/single?client=t&sl=en&tl=hi&hl=en&dt=bd&dt=ex&dt=ld&dt=md&dt=qc&dt=rw&dt=rm&dt=ss&dt=t&dt=at&dt=sw&ie=UTF-8&oe=UTF-8&ssel=0&tsel=0&q=welcome
It returns JSON, you might inspect it and get what you looking for(data is pretty big, so i won't post it here)

Agility pack only requests back document elements, It cannot request contents after ajax request is done. Thanks to #Uriil for pointing light on this issue.
However I was able to manage it via traditional way using WebClient
Here is what I did-
public JsonResult getCultureMeaning(string word, string languagePair)
{
string languagePair = "en|" + langua + "";
string url = String.Format("http://www.google.com/translate_t?hl=en&ie=UTF8&text={0}&langpair={1}", word, languagePair);
WebClient webClient = new WebClient();
webClient.Encoding = System.Text.Encoding.UTF8;
string result = webClient.DownloadString(url);
result = result.Substring(result.IndexOf("<span title=\"") + "<span title=\"".Length);
result = result.Substring(result.IndexOf(">") + 1);
result = result.Substring(0, result.IndexOf("</span>"));
result = HttpUtility.HtmlDecode(result.Trim());
return Json(result, JsonRequestBehavior.AllowGet);
}
It works for every culture pair. Except converting en|en, In this case It would request whole HTML document with result.

Get Page Main Content using the URL

I need to be able to get the page main content from a certain url.
a very good example on what i need to do is the following: http://embed.ly/docs/explore/preview?url=http%3A%2F%2Fedition.cnn.com%2F2012%2F08%2F20%2Fworld%2Fmeast%2Fflight-phobia-boy-long-way-home%2Findex.html%3Fiid%3Darticle_sidebar
I am using asp.net with C# language.

Parsing html pages and guessing the main content is not an easy process. I would recomment to use NReadability and HtmlAgilityPack
Here is an example how it could be done. Main text is always in div with id readInner after NReadability transcoded the page.
string url = "http://.......";
var t = new NReadability.NReadabilityWebTranscoder();
bool b;
string page = t.Transcode(url, out b);
if (b)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
var title = doc.DocumentNode.SelectSingleNode("//title").InnerText;
var text = doc.DocumentNode.SelectSingleNode("//div[#id='readInner']")
.InnerText;
}

Man,
I guess it's made using the implementation of WebClient Class or WebRequest Class. With it you can download all content of page then using any data mining algorithm, you can get the information you want.
[]'s

How to fetch webpage title and images from URL?

I want to fetch website title and images from URL.
as facebook.com doing. How I get images and website title from third party link.?

use html Agility Pack this is a sample code to get the title:
using System;
using HtmlAgilityPack;
protected void Page_Load(object sender, EventArgs e)
{
string url = #"http://www.veranomovistar.com.pe/";
System.Net.WebClient wc = new System.Net.WebClient();
HtmlDocument doc = new HtmlDocument();
doc.Load(wc.OpenRead(url));
var metaTags = doc.DocumentNode.SelectNodes("//title");
if (metaTags != null)
{
string title = metaTags[0].InnerText;
}
}
Any doubt, post your comment.

At a high level, you just need to send a standard HTTP request to the desired URL. This will get you the site's markup. You can then inspect the markup (either by parsing it into a DOM object and then querying the DOM, or by running some simple regexp's/pattern matching to find the things you are interested in) to extract things like the document's <title> element and any <img> elements on the page.

Off the top of my head, I'd use an HttpWebRequest to go get the page and parse the title out myself, then use further HttpWebRequests in order to go get any images referenced on the page. There's a darn good chance though that there's a better way to do this and somebody will come along and tell you what it is. If not, it'd look something like this:
HttpWebResponse response = null;
try
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(<your URL here>);
response = (HttpWebResponse)request.GetResponse();
Stream responseStream = response.GetResponseStream();
StreamReader reader = new StreamReader(responseStream);
//use the StreamReader object to get the page data and parse out the title as well as
//getting locations of any images you need to get
catch
{
//handle exceptions
}
finally
{
if(response != null)
{
response.Close();
}
}
Probably the dumb way to do it, but that's my $0.02.

just u hav to write using javascript on source body
for example
if u r using master page just u hav to write code on matser page thats reflect on all the pages
u can also used the image Url property in this script like that
khan mohd faizan

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How can I extract this text from an HTML page in C#? - c#

If I understand correctly, all you want to do is strip away all the HTML tags so you're only left with the text. A lighter weight solution over htmlagilitypack is the code presented in the article Quick and Easy Method to Remove Html Tags.

Related

Count <tr> from html string without using HtmlDoc

Get data from a Pastebin raw

How to translate word using google translator API?

Get Page Main Content using the URL

How to fetch webpage title and images from URL?

Categories

Resources