Ok. So I found this code online everything is working on it but it shows me the div class I am searching for but removes all the text. Any idea why? Heres a example of what its outputting...
<div class="marquee"><img src="logo.png" /></div>
<div id="joke">
<div id="setup" class="exit-left"></div>
<div id="punchline">
<div class="question"></div>
<div id="zing" class="exit-right"></div>
</div>
</div>
<div id="buttons">
<input id="tell-me" class="exit-bottom no-select" type="button" value="Tell Me!" />
<!--<input id="another" class="exit-bottom" type="button" value="Another!" />-->
<table class="another exit-bottom no-select">
<tr>
<td class="another" colspan="3">Another</td>
<td class="share"><div class="share-img"></div>Share</td>
</tr>
</table>
</div>
And the innertext is not shown at all...
And here is my code is VS.
var doc = new HtmlAgilityPack.HtmlDocument();
HtmlAgilityPack.HtmlNode.ElementsFlags["br"] = HtmlAgilityPack.HtmlElementFlag.Empty;
doc.OptionWriteEmptyNodes = true;
try
{
var webRequest = HttpWebRequest.Create("http://dadjokegenerator.com/");
Stream stream = webRequest.GetResponse().GetResponseStream();
doc.Load(stream);
stream.Close();
}
catch (System.UriFormatException uex)
{
throw;
}
catch (System.Net.WebException wex)
{
throw;
}
//get the div by id and then get the inner text
doc.GetElementbyId("content").InnerHtml;
await e.Channel.SendMessage("test " + divString); `
Although your code correctly downloads content of page http://dadjokegenerator.com/, InnerHtml is empty, because this page actually doesn't contain joke you are looking for (you can see that, if you display source code of page in you web browser - e.g. in Firefox press CTRL+U). Joke is added to this page later by javascript. If you look at source code of this Javascript at http://dadjokegenerator.com/js/main.js, you can see that individual jokes are downloaded from URL http://dadjokegenerator.com/api/api.php?a=j<=r&vj=0
Here is minimal sample to download joke from this URL. I ommited all error-checks for simplicity and I used free Json.NET library for JSON deserialization:
public class Joke
{
public int Id;
public string Setup;
public string Punchline;
public override string ToString()
{
return Setup + " " + Punchline;
}
}
public static Joke GetJoke()
{
var request = HttpWebRequest.Create("http://dadjokegenerator.com/api/api.php?a=j<=r&vj=0");
using (var response = request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
using (var reader = new StreamReader(stream))
{
var jokeString = reader.ReadToEnd();
Joke[] jokes = JsonConvert.DeserializeObject<Joke[]>(jokeString);
return jokes.FirstOrDefault();
}
}
}
}
Usage is e.g.
GetJoke().ToString();
These links show how to read a web page.
Html Agility Pack. Load and scrape webpage
Get HTML code from website in C#
Related
I'm using ASP .NET Core MVC (views and controllers).
Is there a way to add some additional output to all *.cshtml files using middleware, filter or something similar?
I would like to display the path(s) of all cshtml-files like the view itself, partial views, layout-file or components, that are part of the current page.
This is how it should look like:
Right now, I have to add this line to the *.cshtml files, one by one:
#using WkOne.AuthorizationServer.ViewModels;
#model IEnumerable<UserViewModel>
#{
Layout = "_Layout3Cols";
ViewData["Title"] = "Users";
}
<!-- I need this line in every cshtml file -->
<!-- \ -->
<div style="font-size: small;background-color: #CFC;">Path: #Path.ToString() </div>
<table class="table">
<!-- ... and so on... -->
But what I'm looking for is a way to do this in central place.
Any suggestions?
MVC project returns the html codes(razor codes has already been complied to html,so your codes shouldn't contain razor codes) which contained in response body to browser,
The response body could write but couldn't be read ,if you want to add the html codes somewhere you want ,I think you need to replace the default body
I tried as below and added the text "hi"
public class CusTestMiddleware
{
private readonly RequestDelegate _next;
public CusTestMiddleware(RequestDelegate next)
{
_next = next;
}
public async Task InvokeAsync(HttpContext context)
{
var response = context.Response;
var responseOriginalBody = response.Body;
using var memStream = new MemoryStream();
response.Body = memStream;
await _next(context);
var targetstr = "<a>hi</a>";
byte[] targetbyte = Encoding.UTF8.GetBytes(targetstr);
memStream.Write(targetbyte);
memStream.Position = 0;
var responseReader = new StreamReader(memStream);
var responseBody = await responseReader.ReadToEndAsync();
memStream.Position = 0;
await memStream.CopyToAsync(responseOriginalBody);
response.Body = responseOriginalBody;
}
}
I am trying to get element by using x-path tree element but showing null, and this type of x-path work for other site for me, only 2% site this types of X-Path not working, also i tried x-path from chrome also but when my x-path not work that time chrome x-path also not work.
public static void Main()
{
string url = "http://www.ndrf.gov.in/tender";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(url);
var nodetest1 = htmlDoc.DocumentNode.SelectSingleNode("/html[1]/body[1]/section[2]/div[1]/div[1]/div[1]/div[1]/div[2]/table[1]"); // i want this type // not wroking
//var nodetest2 = htmlDoc.DocumentNode.SelectSingleNode("//*[#id=\"content\"]/div/div[1]/div[2]/table"); // from Google chrome // not wroking
//var nodetest3 = htmlDoc.DocumentNode.SelectSingleNode("//*[#id=\"content\"]"); // by ID but i don't want this type // wroking
Console.WriteLine(nodetest1.InnerText); //fail
//Console.WriteLine(nodetest2.InnerText); //fail
//Console.WriteLine(nodetest3.InnerText); //proper but I don't want this type
}
The answer that #QHarr suggested works perfectly, But the reason you get null with a correct x-path, is that there is a javascript file in the header of the site, that adds a wrapper div around the table, and since getting result in HtmlAgilityPack seems not loading or executing js, the x-path returns null.
what you observe, after that js runs is:
<div class="view-content">
<div class="guide-text">
...
</div>
<div class="scroll-table1">
<!-- Your table is here -->
</div>
</div>
but what actually you get whithout that js, is:
<div class="view-content">
<!-- Your table is here -->
</div>
thus your x-path should be:
var nodetest1 = htmlDoc.DocumentNode.SelectSingleNode("/html[1]/body[1]/section[2]/div[1]/div[1]/div[1]/div[1]/table[1]");
Your xpath when used in browser selects for entire table. You can shorten and use as follows (fiddle):
using System;
using HtmlAgilityPack;
public class Program
{
public static void Main()
{
string url = "http://www.ndrf.gov.in/tender";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(url);
var nodetest1 = htmlDoc.DocumentNode.SelectSingleNode("//table");
Console.WriteLine(nodetest1.InnerText);
}
}
Use Fizzler.Systems.HtmlAgilityPack
details here : https://www.nuget.org/packages/Fizzler.Systems.HtmlAgilityPack/
This library adds extension methods called QuerySelector and QuerySelectorAll, that takes CSS Selector not XPath.
Ali Bordbar caught perfect, This Url adds a wrapper div when I navigating URL in WebBrowser control in this all JavaScript file are loaded,
but when i load URL using HtmlWeb there is none of the JavaScript file loaded.
The HtmlWeb retrieves the static HTML response that the server sends, and does not execute any javascript, whereas a WebBrowser would.
So WebBrowser control HTML DOM data XPath and HtmlWeb HTML DOM data XPath not match.
My below code work perfect for this switchvation
HtmlWeb web = new HtmlWeb();
web.AutoDetectEncoding = true;
HtmlAgilityPack.HtmlDocument theDoc1 = web.Load("http://www.ndrf.gov.in/tender");
var HtmlDoc = new HtmlAgilityPack.HtmlDocument();
var bodytag = theDoc1.DocumentNode.SelectSingleNode("//html");
HtmlDoc.LoadHtml(bodytag.OuterHtml);
var xpathHtmldata = HtmlDoc.DocumentNode.SelectSingleNode(savexpath); //savexpath is my first xpath make from HTML DOM data of WebBrowser control which is work for most url.
if (xpathHtmldata == null)
{
//take last tag name from first xpath
string mainele = savexpath.Substring(savexpath.LastIndexOf("/") + 1);
if (mainele.Contains("[")) { mainele = mainele.Remove(mainele.IndexOf("[")); }
//collect all tag name with name of which is sotre in mainele variable
var taglist = HtmlDoc.DocumentNode.SelectNodes("//" + mainele);
foreach (var ele in taglist) //check one by one element
{
string htmltext1 = ele.InnerText;
htmltext1 = Regex.Replace(htmltext1, #"\s", "");
htmltext1 = htmltext1.Replace("&", "&").Trim();
htmltext1 = htmltext1.Replace(" ", "").Trim();
string htmltext2 = saveInnerText; // my previus xpath text from HTML DOM data of WebBrowser control
htmltext2 = Regex.Replace(htmltext2, #"\s", "");
if (htmltext1 == htmltext2) // check equality to my previus xpath text..if it is equal thats my new xpath
{
savexpath = ele.XPath;
break;
}
}
}
So I was wondering how I would get the source of an iFrame within the Page Source of the webrequest that has been made.
Example of what I mean:
string text = streamReader.ReadToEnd(); // Sets the string Text to the source of the page.
Now the string text holds the source of the page.
And within that page of the source is
Authenticator
</h3>
</div>
<div class="RaggedBoxContainer"><div class="RaggedBoxBg"><div class="RaggedBoxTop"></div><div class="RaggedBoxContent">
<iframe src="https://secure.runescape.com/m=totp-authenticator/a=13/c=zBsBJTw2E0M/accountinfo" allowtransparency="true" frameborder="0"></iframe>
</div><div class="RaggedBoxBottom"></div></div></div>
</div>
And I need it to read the source of the iFrame which is:
<h2 class="accountSettingsTitle">RuneScape Authenticator is enabled</h2>
<p>Your account is protected from hijackers. You will need your code generator each time you log in to RuneScape.</p>
<p>It's also really important to keep your email account secure. <a target="_top" href="https://support.runescape.com/hc/en-gb/articles/207258145">Find out how to do this.</a>
<p>You can <a target="_top" href="cape.com/m=totp-authenticator/a=13/c=zBsBJTw2E0M/disableTOTPRequest">disable</a> Authenticator - but remember this will make your account much less secure.</p>
How would I do that?
I know that in the webbrowser it will be:
foreach (HtmlElement elm in webBrowser1.Document.GetElementsByTagName("iframe"))
{
string src = elm.GetAttribute("src");
if (src != null && src != "")
{
string content = new System.Net.WebClient().DownloadString(src); //or using HttpWebRequest
MessageBox.Show(content);
}
}
Please help me out im confused.
I am learning to write web crawler and found some great examples to get me started but since I am new to this, I have a few questions in regards to the coding method.
The search result for example can be found here: Search Result
When I look at the HTML source for the result I can see the following:
<HR><CENTER><H3>License Information *</H3></CENTER><HR>
<P>
<CENTER> 06/03/2014 </CENTER> <BR>
<B>Name : </B> WILLIAMS AJAYA L <BR>
<B>Address : </B> NEW YORK NY <BR>
<B>Profession : </B> ATHLETIC TRAINER <BR>
<B>License No: </B> 001475 <BR>
<B>Date of Licensure : </B> 01/12/07 <BR>
<B>Additional Qualification : </B> Not applicable in this profession <BR>
<B> Status :</B> REGISTERED <BR>
<B>Registered through last day of : </B> 08/15 <BR>
How can I use the HTMLAgilityPack to scrap those data from the site?
I was trying to implement an example as shown below, but not sure where to make the edit to get it working to crawl the page:
private void btnCrawl_Click(object sender, EventArgs e)
{
foreach (SHDocVw.InternetExplorer ie in shellWindows)
{
filename = Path.GetFileNameWithoutExtension( ie.FullName ).ToLower();
if ( filename.Equals( "iexplore" ) )
txtURL.Text = "Now Crawling: " + ie.LocationURL.ToString();
}
string url = ie.LocationURL.ToString();
string xmlns = "{http://www.w3.org/1999/xhtml}";
Crawler cl = new Crawler(url);
XDocument xdoc = cl.GetXDocument();
var res = from item in xdoc.Descendants(xmlns + "div")
where item.Attribute("class") != null && item.Attribute("class").Value == "folder-news"
&& item.Element(xmlns + "a") != null
//select item;
select new
{
Link = item.Element(xmlns + "a").Attribute("href").Value,
Image = item.Element(xmlns + "a").Element(xmlns + "img").Attribute("src").Value,
Title = item.Elements(xmlns + "p").ElementAt(0).Element(xmlns + "a").Value,
Desc = item.Elements(xmlns + "p").ElementAt(1).Value
};
foreach (var node in res)
{
MessageBox.Show(node.ToString());
tb.Text = node + "\n";
}
//Console.ReadKey();
}
The Crawler helper class:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Xml.Linq;
namespace CrawlerWeb
{
public class Crawler
{
public string Url
{
get;
set;
}
public Crawler() { }
public Crawler(string Url)
{
this.Url = Url;
}
public XDocument GetXDocument()
{
HtmlAgilityPack.HtmlWeb doc1 = new HtmlAgilityPack.HtmlWeb();
doc1.UserAgent = "Mozilla/4.0 (conpatible; MSIE 7.0; Windows NT 5.1)";
HtmlAgilityPack.HtmlDocument doc2 = doc1.Load(Url);
doc2.OptionOutputAsXml = true;
doc2.OptionAutoCloseOnEnd = true;
doc2.OptionDefaultStreamEncoding = System.Text.Encoding.UTF8;
XDocument xdoc = XDocument.Parse(doc2.DocumentNode.SelectSingleNode("html").OuterHtml);
return xdoc;
}
}
}
tb is a multiline textbox... So I would like it to display the following:
Name WILLIAMS AJAYA L
Address NEW YORK NY
Profession ATHLETIC TRAINER
License No 001475
Date of Licensure 1/12/07
Additional Qualification Not applicable in this profession
Status REGISTERED
Registered through last day of 08/15
I would like the second argument to be added to an array because next step would be to write to a SQL database...
I am able to get the URL from the IE which has the search result but how can I code it in my script?
This little snippet should get you started:
HtmlDocument doc = new HtmlDocument();
WebClient client = new WebClient();
string html = client.DownloadString("http://www.nysed.gov/coms/op001/opsc2a?profcd=67&plicno=001475&namechk=WIL");
doc.LoadHtml(html);
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div");
You basically use the WebClient class to download the HTML file and then you load that HTML into the HtmlDocument object. Then you need to use XPath to query the DOM tree and search for nodes. In the above example "nodes" will include all the div elements in the document.
Here's a quick reference about the XPath syntax: http://msdn.microsoft.com/en-us/library/ms256086(v=vs.110).aspx
I'm trying to read my app comments, but I don't know how? there is no correct/complete sample.
[CanvasAuthorize(Permissions = "user_about_me")]
public ActionResult About()
{
var client = new FacebookWebClient(FacebookWebContext.Current.AccessToken);
dynamic result = client.Get("19292868552_118464504835613/comments");
ViewBag.result = result;
}
and on view, try to read like this:
foreach (dynamic comment in ViewBag.result)
{
#comment.id
<text><br /></text>
}
please help how can I read the user comments entered for a Facebook page in a MVC application.
as you know in Facebook you can create your own page and that page has an ID in URL.
Here is the codes for the controller:
[CanvasAuthorize(Permissions = "user_about_me")]
public ActionResult Comments()
{
var client = new FacebookWebClient(FacebookWebContext.Current.AccessToken);
dynamic feeds = client.Get("{PageID}/feed");
ViewBag.feeds = feeds;
return View();
}
and the codes for your view:
<table>
#foreach (dynamic myFeed in ViewBag.feeds.data)
{
if (myFeed.type == "status" && myFeed.from.id != "{PageID}")
{
<tr>
<td>
<img src="#HomeController.GetPictureUrlSmall(myFeed.from.id)" />
</td>
<td>
<span style="font-weight: bold;">#myFeed.from.name</span><br />
#myFeed.message
</td>
</tr>
}
}
</table>
#{
string strNext = ViewBag.feeds.paging.next;
string strPrevious = ViewBag.feeds.paging.previous;
}
<a href="#strPrevious" >Previous</a>
<br />
<a href="#strNext" >Next</a>
A function for getting the users pictures:
public static string GetPictureUrlSmall(string faceBookId)
{
WebResponse response = null;
string pictureUrl = string.Empty;
try
{
WebRequest request = WebRequest.Create(string.Format("http://graph.facebook.com/{0}/picture?type=small", faceBookId));
response = request.GetResponse();
pictureUrl = response.ResponseUri.ToString();
}
catch (Exception ex)
{
//? handle
}
finally
{
if (response != null) response.Close();
}
return pictureUrl;
}
I hope it will be helpful for some developers but I still have problem with pagination. I didn't find a handy solution for that. Thanks, Pedram