Get HTML Code from a website after it completed loading

Get HTML Code from a website after it completed loading - c#

I am trying to get the HTML Code from a specific website async with the following code:
var response = await httpClient.GetStringAsync("url");
But the problem is that the website usually takes another second to load the other parts of it. Which I need, so the question is if I can load the site first and read the content after a certain amount of time.
Sorry if this question already got answered, but I didn't really know what to search for.
Thanks,
Twenty
Edit #1
If you want to try it yourself the URL is http://iloveradio.de/iloveradio/, I need the Title and the Artist which do not immediately load.

You are on the wrong direction. The referenced site has playlist api which returns json. you can get information from :
http://iloveradio.de/typo3conf/ext/ep_channel/Scripts/playlist.php
Edit: Chome Inspector is used to find out Playlist link

You could use Puppeteer-Sharp:
await new BrowserFetcher().DownloadAsync(BrowserFetcher.DefaultRevision);
using (var browser = await Puppeteer.LaunchAsync(new LaunchOptions { Headless = false }))
using (var page = await browser.NewPageAsync())
{
await page.SetViewportAsync(new ViewPortOptions() { Width = 1280, Height = 600 });
await page.GoToAsync("http://iloveradio.de/iloveradio/");
await page.WaitForSelectorAsync("#artisttitle DIV");
var artist = await page.EvaluateExpressionAsync<string>("$('#artisttitle DIV')[0].innerText");
Console.WriteLine(artist);
Console.ReadLine();
}

If there are things that load after, it means that they are generated by javascript code after page load (an ajax request for example), so no matter how long you wait, it won't have the content you want (because they are not in the source code when it loads).
Easy way to do it:
Use a WebBrowser and when DocumentCompleated event triggers wait till the element you want appears.
The Right Way:
find the javascript yourself and trigger it yourself (easy to say, hard to do).

The thing to understand here is that when you read the response from the URL, all you will ever get is the raw response, in this case the HTML source code the server replied with.
Unlike what you might see in your browser's DOM Inspector developer tools, you will only get the original HTML source from the page (what you might see in the "Page Source" developer tool) which will not include any dynamically created content (JavaScript) or loaded content (like iframes).
So you aren't getting what you see here in the DOM Inspector:
You are getting what you see here in the Page Source (View > Developer > View Source in Chrome):
You can't wait for that other content to load because it will never load since that HTML content isn't being parsed or rendered like a browser would.
You have several options available though:
See if the website has an API you can use
Determine where that content you want is actually loaded from, and make another/different HTTP request to that content (the Network Panel is helpful here)
Use a headless browser to programmatically load the page and dynamically read the page contents (this will add a lot of overhead, and should probably be avoided if possible)

I have checked out the website, data is loaded by javascript. You only can get the html using httpClient.GetStringAsync("url");.
As far as I know, there is no luck to get the elements what is manipulate by browser.

Related

Blazor page's URL is interfering with component's call to API -- 400-Bad Request

My web-app is .Net 6, VS-2022 Blazor WASM-hosted.
I am having a problem with the URL for the API when being called from a component.
I have three files: page: VehicleList.razor, page: VehicleEdit.Razor and component: CRUD_Vehicle.razor.
The page flow is the user navigates to 'VehicleList' and has the ability to select a vehicle to EDIT via callback function. Once selected, the flow navigates to 'VehicleEdit' where some data needs to be passed to support the embedded sub-component 'CRUD_Vehicle'.
The problem is that URL to the 'VehicleEdit' looks like this with query-string data in a comma-delimited string AND it remains there when the CRUD-component is showing.
https://localhost:7777/vehicleedit/667?par=17,Bigelow,Active
When the user makes editing changes and SUBMITs the component, the HttpClient service gets called and fails with a 400-Bad Request. I am under the impression that the query-string interfers with the api URL. Please see below. Is there a way I could set the URL to not show the query-string and only have the base-address and the UID_VEHICLE like this?
https://localhost:7777/vehicleedit/667
public async Task<Vehicle> UpdateVehicle(Vehicle pVehicle) {
// Save the edited record.
HttpResponseMessage response = await httpClient.PutAsJsonAsync<Vehicle>($"/api/vehicle/{pVehicle.UID_VEHICLE}", pVehicle);
if (!response.IsSuccessStatusCode) {
// problems handling here.
Console.WriteLine($"UpdateVehicle() Error occurred, the status code is: {(int)response.StatusCode}: {response.StatusCode}" );
}
return await response.Content.ReadFromJsonAsync<Vehicle>();
}
I have a similar construction for the editing of the Customer object, but the page that shows the CRUD_Customer component has a simple URL and works perfectly to save data to the DB. The main difference is the URL for the VehicleEdit.razor. Your comments are welcome.
Thanks.
https://localhost:7777/customeredit/17

I downloaded Postman and entered the PUT transaction and the PUT RESULT showed there was an ERROR (based on DataAnnotation for the 'StartDate'-field) that was NOT caught at the client validation code. It turns out that the TITLE text was based on what the possible cause might be. But Postman showed me the exact reason why the PUT call failed.
So, my 'answer' is: Use Postman to simulate the 'httpClient.PutAsJsonAsync'-call to observe the result-details to pinpoint the reason for the '400-Bad Request' return from the service.

Parse page using AngleSharp

I want to parse website using c# with AngleSharp, it's easy to do with static pages, but there is a problem, I can't parse info avalible only for authorized users. What should I do to autorize programmatically into website and parse all info avalible for me?

Depending on the used authorization scheme this may either be super simple or ultra hard / impossible.
So let's first visit what can be done with AngleSharp:
Any kind of requests incl. their manipulation (on request, but also before response)
General cookie management (and their manipulation, of course)
Querying the DOM and perform "simple" actions (e.g., clicking a button, submitting a form)
Running trivial JavaScript files
Here trivial means: Scripts that do not need any capabilities beyond what AngleSharp offers, e.g., rendering tree information, advanced CSSOM access, ... - or scripts that require non-ES5 compliant parsers (e.g., make use of ES6 or some special non-standard capabilities).
Now since I do not know what is the authorization scheme or exact problem that you are hitting (some code / MWE would be helpful!) I'll just go for a simple click example.
var context = BrowsingContext.New(Configuration.Default.WithDefaultLoader().WithCookies());
var loginPage = await context.OpenAsync("http://yourpage.com");
var loginForm = loginPage.QuerySelector<IHtmlFormElement>("form");
var profilePage = await loginForm.SubmitAsync(new { userName = "myUser", password = "password" });
// get something on profilePage
Note that in this example the form field names for the login form are userName and password - they may be different for your login page. Also note that your page may contain multiple forms and the selector could be more sophisticated than a simple form.
HTH!

How can I make a windows app in C# that for example automatically follows someone on Twitter when I open it up?

If I were to follow USA Today on Twitter with a JavaScript code I could go to their page and paste this simple code in the console:
follow();
function follow(){
var button = document.getElementsByClassName("EdgeButton EdgeButton--secondary EdgeButton--medium button-text follow-text")[0];
button.click();
}
How can I trigger this code on a specific URL in C# (only after the page loaded)?
So what I'm trying to ask is how can I trigger a button in C# remotely, without actually visiting the page?
So it could be a simple console application too where I simply
string jsCode = "follow();
function follow(){
var button = document.getElementsByClassName("EdgeButton EdgeButton--secondary EdgeButton--medium button-text follow-text")[0];
button.click();
}";
string url = "https://twitter.com/";
List<string> pagesToFollow = new List<string>();
pagesToFollow.Add("USATODAY");
pagesToFollow.Add("RT_America");
pagesToFollow.Add("Reuters");
foreach (var s in pagesToFollow) whateverMethodGoesHere(url + s, jsCode);
This is pretty much what the app would look like but I have no idea how to execute a JavaScript code remotely and I yet to find a solution to the problem that the code should wait until the page is loaded.

Considering it a general question as you request, a general answer would be something like this:
Use a proxy, like Fiddler, and study the HTTP requests done while performing the task you want to emulate. Then, replay the requests from your program.
Embed a headless browser like Chromium and control it from your code.
Use Selenium. There is a binding package in NuGet.
... and many more, I suppose. This question is really open to speculation.

How to get data from a client on load?

I'm aware that data can be passed in through the URL, like "example.com/thing?id=1234", or it can be passed in through a form and a "submit" button, but neither of these methods will work for me.
I need to get a fairly large xml string/file. I need to parse it and get the data from it before I can even display my page.
How can I get this on page load? Does the client have to send a http request? Or submit the xml as a string to a hidden form?
Edit with background info:
I am creating a widget that will appear in my customer's application, embedded using C# WebBrowser control, but will be hosted on my server. The web app needs to pass some data (including a token for client validation) to my widget via xml, and this needs to be loaded in first thing when my widget starts up.

ASP.NET MVC 4 works great with jQuery and aJax posts. I have accomplished this goal many times by taking advantage of this.
jQuery:
$(document).ready(function() {
$.ajax({
type: "POST",
url: "/{controller}/{action}/",
data: { clientToken: '{token}', foo: 'bar',
success: function (data, text) {
//APPEND YOUR PAGE WITH YOUR PARSED XML DATA
//NOTE: 'data' WILL CONTAIN YOUR RETURNED RESULT
}
});
});
MVC Controller:
[HttpPost]
public JsonResult jqGetXML(string clientToken, string foo)
{
JsonResult jqResult = new JsonResult();
//GET YOUR XML DATA AND DO YOUR WORK
jqResult.Data = //WHATEVER YOU WANT TO RETURN;
return jqResult;
}
Note: This example returns Json data (easier to work with IMO), not XML. It also assumes that the XML data is not coming from the client but is stored server-side.
EDIT: Here is a link to jQuery's Ajax documentation,
http://api.jquery.com/jQuery.ajax/

Assuming you're using ASP.NET, since you say it's generated by another page, just stick the XML in the Session state.

Another approach, not sure if it helps in your situation.
If you share the second level domain name on your two sites (i.e. .....sitename.com ) then another potential way to share data is you could have them assert a cookie at this 2nd level with the token and xml data in it. You'll then be provided with this cookie.
I've only done this to share authentication details, you need to share machine keys at a minimum to support this (assuming .Net here...).

You won't be able to automatically upload a file from the client to the server - at least not via a browser using html/js/httprequests. The browser simply will not allow this.
Imagine the security implications if browsers allowed you to silently upload a file from the clients local machine without their knowledge.

Sample solution:
Background process imports xml file and parses it. The background process knows it is for customer YYY and updates their information so it know the xml file has been processed.
A visitor goes to the customer's web application where the widget is embedded. In the markup of the widget the customer token has been added. This could be in JavaScript, Flash, iFrame, etc.
When the widget loads, it makes a request to you app which then checks to see if the file was parsed for the provided customer (YYY) if it has, then show the page/widget.

If the XML is being served via HTTP you can use Liqn to parse the data.
Ex.
public partial class Sample : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
string url = "http://news.yahoo.com/rss/";
var el = XElement.Load(url).Elements("channel");
StringBuilder output = new StringBuilder();
foreach (var c in el.Elements())
{
switch (c.Name.LocalName.ToLower())
{
case "title":
output.Append(c.Value);
output.Append("<br />");
break;
}
}
this.Label1.Text = output.ToString();
}
}

It is not exactly clear what the application is and what kind of options you have, and what kind of control over web server you have.
If you are the owner of the web server/application your options are way wider. You can first send a file to web-server with HTTP POST or PUT, including a random token, and then use the same token for GET with token in the query string
or use other options, applicable to third party-owned websites
if you are trying to consume some auth api, learn more about it. since you are hosting web browser control, you have plenty of options to script it. including loading whatever form, setting textarea or hidden field text with your xml and then simulating a submit button click. you can then respond to any redirects and html responses.
you can also inject javascript inside the page that would send it to server with ajax request.
the choice heavily depends on the interaction model.
if you need better advice, it would be most helpful if you provided sample/simplified url/url pattern, form content, and sequence of events that is expected from you from code/api/sdk perspective. they are usually quite friendly.

There are limited number of ways to pass data between pages. Personally for this I would keep in session during the generating page and clear it when it is retrieved in the required page.
If it is generated server side then there is no reason to retrieve it from client side.
http://msdn.microsoft.com/en-us/library/6c3yckfw(v=vs.100).aspx

Create a webservice that your C# app can POST the XML to and get back HTML in response. Load this HTML string into the WebBrowser control rather than pointing the control to a URL.

Why Response.Write behavior varies in the given scenario?

When i POST the page using the following code, the Response.write("Hey") doesn't write the content ("Hey") to the parent page
<form method="post" name="upload" enctype="multipart/form-data"
action="http://localhost:2518/Web/CrossPage.aspx" >
<input type="file" name="filename" />
<input type="submit" value="Upload Data File" name="cmdSubmit" />
</form>
But When i use following code , and POST the data, the Response.write("Hey") can be obtained in the parent page
HttpWebRequest requestToSender = (HttpWebRequest)WebRequest.Create("http://localhost:2518/Web/CrossPage.aspx");
requestToSender.Method = "POST";
requestToSender.ContentType = "multipart/form-data";
HttpWebResponse responseFromSender = (HttpWebResponse)requestToSender.GetResponse();
string fromSender = string.Empty;
using (StreamReader responseReader = new StreamReader(responseFromSender.GetResponseStream()))
{
fromSender = responseReader.ReadToEnd();
}
In the CrossPage.aspx i have the following code
if (!Page.IsPostBack)
{
NameValueCollection postPageCollection = Request.Form;
foreach (string name in postPageCollection.AllKeys)
{
Response.Write(name + " " + postPageCollection[name]);
}
HttpFileCollection postCollection = Request.Files;
foreach (string name in postCollection.AllKeys)
{
HttpPostedFile aFile = postCollection[name];
aFile.SaveAs(Server.MapPath(".") + "/" + Path.GetFileName(aFile.FileName));
}
Response.Write("Hey");
}
I don't have any code in the Page_Load event of parent page.?
What could be the cause? I need to write the "hey" to the Parent page using the first scenario. Both the application are of different domain.
Edit: "Hey" would be from the CrossPage.aspx. I need to write this back to the Parent Page
when i post using the form action, after processing the Page_Load() event in CrossPage.aspx, the URL points to "http://localhost:2518/Web/CrossPage.aspx" which means the application is still in the CrossPage.aspx and didn't move to parent page.

You almost hit on the reason yourself:
when i post using the form action, after processing the Page_Load() event in CrossPage.aspx, the URL points to "http://localhost:2518/Web/CrossPage.aspx" which means the application is still in the CrossPage.aspx and didn't move to parent page.
Your form takes the user to CrossPage.aspx, so the parent page is gone, now the previous page in the user's history.
It sounds like you are trying to do some sort of asynchronous file upload. Try looking for AJAX file upload examples, something like this: How can I upload files asynchronously?

Probably it is because you have the code in an
if (!Page.IsPostBack) block? this code will be executed only the page is not loaded on a post-back.
(HttpWebResponse)requestToSender.GetResponse(); will trigger a GET request, that's why your code is working when you call Crosspage.aspx using that code.

You are treating the page like a service. E.G. Start at ParentPage.aspx > pass data to ServicePage.aspx for processing > write response back to ParentPage.aspx for display.
You got it to work with C# by passing the duty back to the server, where state can easily be maintained while crossing page boundries. It's not so simple when you try to solve the problem without C#. This isn't a Winform app. As NimsDotNet pointed out, changing the method to "get" will get you closer, but you will get redirected to CrossPage.aspx and lose the calling page.
You said you are on "different domains". By this I think you mean your using two different IIS servers. The C# solution should still work in this scenerio, just as you've shown. Just add a Response.Write() of the fromSender object. There's nothing you've told us that makes this not technically possible. Still, if you want a client side solution you could use javascript to make the get request without getting redirected.
This post shows you how to make a get request with JQuery.

I say your aFile.SaveAs(Server.MapPath(".") + "/".... is throwing an exception. try commenting it out and testing it.
UPDATE:
I'm guessing it works from a HttpWebRequest because there is no file being posted therefore the file loop is skipped. when posting from the HTML you have a file input so your file loop is getting used and resulting in the save logic being executed. so again, that leads me to think it's your save logic
Also,I think you have a try catch statement wrapped around all this that is catching exception so u have no idea what's going wrong. If that's the case, never do that. You rarely ever want to catch exception.
After quick glance at ur save logic, replace "/" with #"\". Server.MapPath(".") returns the path using back slashes not forward slashes.

I would suggest changing CrossPage.aspx into an .ashx HttpHandler. The Page.IsPostBack may not be working correctly because it is expecting ViewState and other hidden ASP.net form fields that tell it that it's a post back from an ASP form. You also don't need to go through the whole Page life cycle functionality that ASP.net Webforms goes through.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Get HTML Code from a website after it completed loading - c#

You are on the wrong direction. The referenced site has playlist api which returns json. you can get information from : http://iloveradio.de/typo3conf/ext/ep_channel/Scripts/playlist.php Edit: Chome Inspector is used to find out Playlist link

I have checked out the website, data is loaded by javascript. You only can get the html using httpClient.GetStringAsync("url");. As far as I know, there is no luck to get the elements what is manipulate by browser.

Related

Blazor page's URL is interfering with component's call to API -- 400-Bad Request

Parse page using AngleSharp

How can I make a windows app in C# that for example automatically follows someone on Twitter when I open it up?

How to get data from a client on load?

Why Response.Write behavior varies in the given scenario?

Categories

Resources