How to get specific part of webpage in mvc?

How to get specific part of webpage in mvc? - c#

I want to get a specific part of current page to save it in another file for example in a PDF file.
I want to get this part of current view in ActionResult when I click on submit:
<div id="t1" name="t1">
.......
</div>
I tried to use this code to get :
client = new WebClient();
string url = HttpContext.Request.Url.AbsoluteUri;
string content = "";
Stream data = client.OpenRead(url);
StreamReader sr = new StreamReader(data);
content = sr.ReadToEnd();
data.Flush();
data.Close();
But I want to get only tag div named t1 from current page.
NOTE: I don't want to use jquery. I only want to do it in C#.

try using HTMLAgilityPack for this sort of thing...
example code block;
public string GetContent(string url)
{
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(url);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//div[#id='t1']");
return node.InnerHtml;
}

Related

Can't parse body of page

I am trying parse some href from one page, my code looks like:
WebClient webClient = new WebClient();
string htmlCode = webClient.DownloadString("https://www.firmy.cz/Auto-moto");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlCode);
HtmlNodeCollection collection = doc.DocumentNode.SelectNodes("//div[contains(#class,'companyWrap')]");
string target = "";
foreach (HtmlNode link in collection)
{
target = target +"\n"+ link.Attributes["href"].Value;
}
On this page my doc.ParsedText haven't body <body id="root" class="root">
</body> but if i go to page i see elements of body. Can u tell me where is a problem?
Blockquote

If you view the source of the URL you are trying to parse (https://www.firmy.cz/Auto-moto), you can see that the body is empty.
It seems like the page is loading the content through JavaScript on the client side and will thus not be available for you to parse.

What is the fastest way to get an HTML document node using XPath and the HtmlAgilityPack?

In my application I need to get to get the URL of the image of a blog post. In order to do this I'm using the HtmlAgilityPack.
This is the code I have so far:
static string GetBlogImageUrl(string postUrl)
{
string imageUrl = string.Empty;
using (WebClient client = new WebClient())
{
string htmlString = client.DownloadString(postUrl);
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(htmlString);
string xPath = "/html/body/div[contains(#class, 'container')]/div[contains(#class, 'content_border')]/div[contains(#class, 'single-post')]/main[contains(#class, 'site-main')]/article/header/div[contains(#class, 'featured_image')]/img";
HtmlNode node = htmlDocument.DocumentNode.SelectSingleNode(xPath);
imageUrl = node.GetAttributeValue("src", string.Empty);
}
return imageUrl;
}
The problem is that this is too slow, when I did some tests I noticed that It takes about three seconds to extract the URL of the image in the given page. Which it's a problem when I'm loading a feed and trying to red several articles.
I tried to use the absolute xpath of the element I want to load, but I didn't noticed any improvement. Is there a faster way to achieve this?

Can you try this code and see if it's faster or not?
string Url = "http://blog.cedrotech.com/5-tendencias-mobile-que-sua-empresa-precisa-acompanhar/";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
var featureDiv = doc.DocumentNode.Descendants("div").FirstOrDefault(_ => _.Attributes.Contains("class") && _.Attributes["class"].Value.Contains("featured_image"));
var img = featureDiv.ChildNodes.First(_ => _.Name.Equals("img"));
var imgUrl = img.Attributes["src"];

JSON result in MVC print in <pre> tags

so I am trying to show Json result from the method below, and show it in pre tags to see clearly what's going on in the text.
[OutputCache(Duration=300)]
public ActionResult Foo()
{
WebClient wc = new WebClient();
var link = new Uri("http://eu.battle.net/wow/en/feed/news");
var infoFromLinku = wc.DownloadData(link);
string sContent = string.Empty;
sContent = System.Text.Encoding.ASCII.GetString(infoFromLinku);
return Json(sContent, JsonRequestBehavior.AllowGet);
}
and in View:
<div>
<pre>
#Html.Action("Foo", "Home")
</pre>
</div>
to look like on picture 1, but all I get is a total mess like on picture 2.
picture 1:
picture 2:
Can someone help me with this one?
So I have tried serializing this xml, without a success. Here is what I have tried:
System.Web.Helpers.Json.Encode/Decode
and also
XmlDocument doc = new XmlDocument();
doc.LoadXml(sContent);
string jsonText = JsonConvert.SerializeObject(doc);
#one more edit
All I'm trying to do is change this php code into a mvc version. in php it is like 4 lines of code..
$sContent = file_get_contents($url);
$simpleXml = simplexml_load_string($sContent);
$json = json_encode($simpleXml);
$result = json_decode($json, TRUE);
maybe this will help as well.

Return a Partial View, Assign your Json string to ViewBag.yourJsonstring
[OutputCache(Duration=300)]
public ActionResult Foo()
{
WebClient wc = new WebClient();
var link = new Uri("http://eu.battle.net/wow/en/feed/news");
var infoFromLinku = wc.DownloadData(link);
string sContent = string.Empty;
sContent = System.Text.Encoding.ASCII.GetString(infoFromLinku);
ViewBag.yourJsonstring = sContent;
return PartialView("_YourPartialView");
}
and in PartailView
<div><pre>#ViewBag.yourJsonstring</pre></div>
Other Way: You can serialize you json with model objects and can bind in view

C# encoding Shift-JIS vs. utf8 html agility pack

i have a problem. My goal is to save some Text from a (Japanese Shift-JS encoded)html into a utf8 encoded text file.
But i don't really know how to encode the text.. The HtmlNode object is encoded in Shift-JS. But after i used the ToString() Method, the content is corrupted.
My method so far looks like this:
public String getPage(String url)
{
String content = "";
HtmlDocument page = new HtmlWeb(){AutoDetectEncoding = true}.Load(url);
HtmlNode anchor = page.DocumentNode.SelectSingleNode("//div[contains(#class, 'article-def')]");
if (anchor != null)
{
content = anchor.InnerHtml.ToString();
}
return content;
}
I tried
Console.WriteLine(page.Encoding.EncodingName.ToString());
and got: Japanese Shift-JIS
But converting the html into a String produces the error. I thought there should be a way, but since documentation for html-agility-pack is sparse and i couldn't really find a solution via google, i'm here too get some hints.

Well, AutoDetectEncoding doesn't really work like you'd expect it to. From what i found from looking at the source code of the AgilityPack, the property is only used when loading a local file from disk, not from an url.
So there's three options. One would be to just set the Encoding
OverrideEncoding = Encoding.GetEncoding("shift-jis")
If you know the encoding will always be the same that's the easiest fix.
Or you could download the file locally and load it the same way you do now but instead of the url you'd pass the file path.
using (var client=new WebClient())
{
client.DownloadFile(url, "20130519-OYT1T00606.htm");
}
var htmlWeb = new HtmlWeb(){AutoDetectEncoding = true};
var file = new FileInfo("20130519-OYT1T00606.htm");
HtmlDocument page = htmlWeb.Load(file.FullName);
Or you can detect the encoding from your content like this:
byte[] pageBytes;
using (var client = new WebClient())
{
pageBytes = client.DownloadData(url);
}
HtmlDocument page = new HtmlDocument();
using (var ms = new MemoryStream(pageBytes))
{
page.Load(ms);
var metaContentType = page.DocumentNode.SelectSingleNode("//meta[#http-equiv='Content-Type']").GetAttributeValue("content", "");
var contentType = new System.Net.Mime.ContentType(metaContentType);
ms.Position = 0;
page.Load(ms, Encoding.GetEncoding(contentType.CharSet));
}
And finally, if the page you are querying returns the content-Type in the response you can look here for how to get the encoding.
Your code would of course need a few more null checks than mine does. ;)

Can i read iframe through WebClient (i want the outer html)?

Well my program is reading a web target that somewhere in the body there is the iframe that i want to read.
My html source
<html>
...
<iframe src="http://www.mysite.com" ></iframe>
...
</html>
in my program i have a method that is returning the source as a string
public static string get_url_source(string url)
{
using (WebClient client = new WebClient())
{
return client.DownloadString(url);
}
}
My problem is that i want to get the source of the iframe when it's reading the source, as it would do in normal browsing.
Can i do this only by using WebBrowser Class or there is a way to do it within WebClient or even another class?
The real question:
How can i get the outer html given a url? Any appoach is welcomed.

After getting the source of the site, you can use HtmlAgilityPack to get the url of the iframe
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var src = doc.DocumentNode.SelectSingleNode("//iframe")
.Attributes["src"].Value;
then make a second call to get_url_source

Parse your source using HTML Agility Pack and then:
List<String> iframeSource = new List<String>();
HtmlDocument doc = new HtmlDocument();
doc.Load(url);
foreach (HtmlNode node in doc.DocumentElement.SelectNodes("//iframe"))
iframeSource.Add(get_url_source(mainiFrame.Attributes["src"]));
If you are targeting a single iframe, try to identify it using ID attribute or something else so you can only retrieve one source:
String iframeSource;
HtmlDocument doc = new HtmlDocument();
doc.Load(url);
foreach (HtmlNode node in doc.DocumentElement.SelectNodes("//iframe"))
{
// Just an example for check, but you could use different approaches...
if (node.Attributes["id"].Value == 'targetframe')
iframeSource = get_url_source(node.Attributes["src"].Value);
}

Well i found the answer after some search and this is what i wanted
webBrowser1.Url = new Uri("http://www.mysite.com/");
while (webBrowser1.ReadyState != WebBrowserReadyState.Complete) Application.DoEvents();
string InnerSource = webBrowser1.Document.Body.InnerHtml;
//You can use here OuterHtml too.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to get specific part of webpage in mvc? - c#

try using HTMLAgilityPack for this sort of thing... example code block; public string GetContent(string url) { HtmlWeb hw = new HtmlWeb(); HtmlDocument doc = hw.Load(url); HtmlNode node = doc.DocumentNode.SelectSingleNode("//div[#id='t1']"); return node.InnerHtml; }

Related

Can't parse body of page

What is the fastest way to get an HTML document node using XPath and the HtmlAgilityPack?

JSON result in MVC print in <pre> tags

C# encoding Shift-JIS vs. utf8 html agility pack

Can i read iframe through WebClient (i want the outer html)?

Categories

Resources