I want fetch some webpage from internet, and get absolute URLs of some images on the page by using HtmlAgilityPack in C#.
The problem is...
The website will first redirect the URL to some other one, and then the src attribute in the <img> tag is related URL.
Currently, I have some codes like this:
using HtmlAgilityPack;
HtmlDocument webpageDocument = new HtmlWeb().Load("http://xyz.example.com/");
HtmlNodeCollection nodes = webpageDocument.DocumentNode.SelectNodes("//img");
String url = nodes[0].Attributes["src"].Value.ToString();
Above codes fetch a webpage from the given example url, and get some <img> element from the DOM tree, and get src attribute of it.
It works if the <img> has absolute url. But unfortunately the website I want to handle give me a related URI (e.g. /img/01.png). I need the absolute URL so that I can do more options about the image.
So, I need to know what URL is the base URL for given src, but failed. Or, in another word, I don't know how to get the location of the webpage after redirect.
Server side is not mine (I have no control to it).
Consider ResponseUri and to avoid second call give html agility parser the string with the content of the page instead.
Related
I am trying to retrieve all HTML text from a website so I can parse it for more stuff, and I am using WebClient and DownloadString but they are not returning the entire HTML text. Actually, it is even returning not the same HTML string I see when I hit F12 on the website.
Below is my code.
string htmlText;
using (WebClient client = new WebClient ())
{
string htmlText = client.DownloadString("URL here");
}
Many websites use this code too, but it is not working for me. It won't return an error, though. It just won't retrieve the entire HTML text. It actually returns only a few % of the entire HTML.
Why does this happen? By the way, I am using an apartment search website, and I set some filters by location, the number of beds, and so on. I am trying to get the result page's HTML text, but the returned HTML doesn't even contain any apartment results, which I need for parsing..
I need to retrieve a form anti forgery token from the html page.
To do so, I'm using the Html Agility Pack, but I'm fairly new to it.
This is my code:
var page = new HtmlDocument();
page.LoadHtml(htmlPage);
var tokenNode = page.DocumentNode.SelectSingleNode("/html/body/div[3]/div[2]/form/input").Attributes["value"].Value;
The 'tokenNode' variable is returning null.
I've managed to trackdown my problem to this method:
page.DocumentNode.SelectSingleNode("/html/body/div[3]/div[2]/form/input");
If I simply use page.DocumentNode.SelectSingleNode("/html/body/div[3]); it returns a value. However when I add the second div to my xpath, it starts returning null.
What am I missing here?
Edit: Got the xpath using chrome developer tools.
Edit2: After all the problem was in the Xpath I got from chrome.
TL;DR The html code in the brwoser was different from the one my http request retrieved, therefore the xpath was wrong.
Here's a more thorough explanation
To get anti forgery token from a page your could just call GetElementById method by passing id
For example
var page = new HtmlDocument();
page.LoadHtml(htmlPage);
string token = page.GetElementbyId("__RequestVerificationToken").GetAttributeValue("value", "");;
You no need to go through the nested path
I would like to access content of a web-page using C#. The content is inside an i-Frame of the Body of the website, underlying an #document object. I am using this to read the page:
WebClient wbClient = new WebClient();
wbClient.UseDefaultCredentials = true;
byte[] raw = wbClient.DownloadData(stWebPage);
stWebPageContent = System.Text.Encoding.UTF8.GetString(raw);
However, the relevant information inside the #document is ignored.
Can anybody explain what I have to do to access the needed info? It is nested under body/div/iframe/#document/html/body/div/..... Thanks!
Note: I am assuming stWebPage is pointing to a http url.
iFrame content will not be downloaded directly in this one call. You need to look for iFrame in stWebPageContent using Regex and pull the value in 'src' attribute, make another call to the src url for downloading content. More details can be found at this link.
The goal of my program is to grab a webpage and then generate a list of Absolute links with the pages it links to.
The problem I am having is when a page redirects to another page without the program knowing, it makes all the relative links wrong.
For example:
I give my program this link: moodle.pgmb.si/moodle/course/view.php?id=1
On this page, if it finds the link href="signup.php" meaning signup.php in the current directory, it errors because there is no directory above the root.
However this error is invalid because the page's real location is:
moodle.pgmb.si/moodle/login/index.php
Meaning that "signup.php" is linking to moodle.pgmb.si/signup.php which is a valid page, not moodle.pgmb.si/moodle/course/signup.php like my program thinks.
So my question is how is my program supposed to know that the page it received is at another location?
I am doing this in C Sharp using the follownig code to get the HTML
WebRequest wrq = WebRequest.Create(address);
WebResponse wrs = wrq.GetResponse();
StreamReader strdr = new StreamReader(wrs.GetResponseStream());
string html = strdr.ReadToEnd();
strdr.Close();
wrs.Close();
You should be able to use ResponseUri method of WebResponse class. This will contain the URI of the internet resource that actually provided the response data, as opposed to the resource that was requested. You can then use this URI to build correct links.
http://msdn.microsoft.com/en-us/library/system.net.webresponse.responseuri.aspx
What I would do is first check if each link is absolute or relative by searching for an "http://" within it. If it's absolute, you're done. If it's relative, then you need to append the path to the page you're scanning in front of it.
There are a number of ways you could get the current path: you could Split() it on the slashes ("/"), then recombine all but the last one. Or you could search for the last occurrence of a slash and then take a substring of up to and including that position.
Edit: Re-reading the question, I'm not sure I am understanding. href="signup.php" is a relative link, which should go to the /signup.php. So the current behavior you mentioned is correct "moodle.pgmb.si/moodle/course/signup.php."
The problem is that, if the URL isn't a relative or absolute URL, then you have no way of knowing where it goes unless you request it. Even then, it might not actually be being served from where you think it is located. This is because it might actually be implemented as a HTTP Redirect or similar server side.
So if you want to be exhaustive, what you can do is:
Use your current technique to grab a list of all links on the page.
Attempt to request each of those pages. Then if you:
Get a 200 responce code then all is good - it's there.
Get a 404 response code you know the page does not exist
Get a 3XX response code then you know where the web server
expects that content to actually orginate form.
Your (Http)WebResponse object should have a ResponseCode property. Note that you should also handle any possible WebException errors - these too will have a WebResponse with a ResponseCode in (usually 5xx).
You can also look at the HttpWebResponse Headers property - the Location header.
I'm unable to get Html.ActionLink to produce absolute urls.
Html.ActionLink(DataBinder.Eval(c.DataItem, "Name").ToString(), DataBinder.Eval(c.DataItem, "Path").ToString())
This pulls the data from my model correctly, but appends the path to the end of the current page, producing URLs like "http://localhost:24590/www.google.com"
How can I get this to work how I want it to?
This works for me:
<a href="http://#Model.URL">
Click Here
</a>
Use an absolute URL starting with i.e. http://.
would have the same result, because it's a relative url.