How to integrate HTML markup from another URL in C# - c#

I have an aspx file that loads HTML markup. This markup contains a div element which is basically a container for another HTML markup retrieved from another URL. A code snippet would look like this:
<div id="container">
<%= RetrieveIntegrationMarkup() %>
</div>
What is the best way to retrieve the markup in the RetrieveIntegrationMarkup()? Currently, we are using a workaround to accept self-signed SSL certificates, but it only works in our test environments. It doesn't work in the production environment.
I don't know if this will help, but here's the snippet of the said method:
HttpWebRequest.DefaultCachePolicy = new HttpRequestCachePolicy(HttpRequestCacheLevel.Revalidate);
ServicePointManager.CertificatePolicy = new MyPolicy();
Uri serviceUri = new Uri(integrationUrl, UriKind.Absolute);
HttpWebRequest webRequest = (HttpWebRequest)System.Net.WebRequest.Create(serviceUri);
HttpWebResponse response = (HttpWebResponse)webRequest.GetResponse();
using (var sr = new StreamReader(response.GetResponseStream()))
{
markup= sr.ReadToEnd();
}
Thanks!

Related

Scrape data from web page with HtmlAgilityPack c#

I had a problem scraping data from a web page which I got a solution
Scrape data from web page that using iframe c#
My problem is that they changed the webpage which is now https://webportal.thpa.gr/ctreport/container/track and I don't think that is using iFrames and I cannot get any data back.
Can someone tell me if I can use the same method to get data from this webpage or should I use a different aproach?
I don't know how #coder_b found that I should use https://portal.thpa.gr/fnet5/track/index.php as web page and that I should use
var reqUrlContent =
hc.PostAsync(url,
new StringContent($"d=1&containerCode={reference}&go=1", Encoding.UTF8,
"application/x-www-form-urlencoded"))
.Result;
to pass the variables
EDIT: When I check the webpage there is an input which contains the number
input type="text" id="report_container_containerno"
name="report_container[containerno]" required="required"
class="form-control" minlength="11" maxlength="11" placeholder="E/K
για αναζήτηση" value="ARKU2215462"
Can I use something to pass with HtmlAgilityPack and then it should be easy to read the result
Also when I check the DocumentNode it seems to show me the cookies page that I should agree.
Can I bypass or auto allow cookies?
Try this:
public static string Download(string search)
{
var request = (HttpWebRequest)WebRequest.Create("https://webportal.thpa.gr/ctreport/container/track");
var postData = string.Format("report_container%5Bcontainerno%5D={0}&report_container%5Bsearch%5D=", search);
var data = Encoding.ASCII.GetBytes(postData);
request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";
request.ContentLength = data.Length;
using (var stream = request.GetRequestStream())
{
stream.Write(data, 0, data.Length);
}
using (var response = (HttpWebResponse)request.GetResponse())
using (var stream = new StreamReader(response.GetResponseStream()))
{
return stream.ReadToEnd();
}
}
Usage:
var html = Download("ARKU2215462");
UPDATE
To find the post parameters to use, press F12 in the browser to show dev tools, then select Network tab. Now, fill the search input with your ARKU2215462 and press the button.
That do a request to the server to get the response. In that request, you can inspect both request and response. There are lots of request (styles, scripts, iamges...) but you want the html pages. In this case, look this:
This is the Form data requested. If you click in "view source", you get the data encoded like "report_container%5Bcontainerno%5D=ARKU2215462&report_container%5Bsearch%5D=", as you need in your code.

How can I scrape a table that is created with JavaScript in c#

I am trying to get a table from the web page https://www.belastingdienst.nl/rekenhulpen/wisselkoersen/ using HtmlAgilityPack.
My code so far is
WebClient webClient = new WebClient();
string page = webClient.DownloadString("https://www.belastingdienst.nl/rekenhulpen/wisselkoersen/");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
List<List<string>> table = doc.DocumentNode.SelectSingleNode("//table[#class='list_result Result']")
.Descendants("tr")
.Skip(1)
.Where(tr => tr.Elements("td").Count() > 1)
.Select(tr => tr.Elements("td").Select(td => td.InnerText.Trim()).ToList())
.ToList();
My problem is that the webpage creates the table by using JavaScript and when I try to read it it throws a null exception because the web page is showing that I must enable JavaScript.
I also tried to use "GET" method
string Url = "https://www.belastingdienst.nl/rekenhulpen/wisselkoersen/";
HttpWebRequest myRequest = (HttpWebRequest)WebRequest.Create(Url);
myRequest.Method = "GET";
WebResponse myResponse = myRequest.GetResponse();
StreamReader sr = new StreamReader(myResponse.GetResponseStream(), System.Text.Encoding.UTF8);
string result = sr.ReadToEnd();
sr.Close();
myResponse.Close();
with the same results.
I already enable JavaScript in Internet Explorer and change registry as well
if (Environment.Is64BitOperatingSystem)
Regkey = Microsoft.Win32.Registry.LocalMachine.OpenSubKey(#"SOFTWARE\\Wow6432Node\\Microsoft\\Internet Explorer\\MAIN\\FeatureControl\\FEATURE_BROWSER_EMULATION", true);
else //For 32 bit machine
Regkey = Microsoft.Win32.Registry.LocalMachine.OpenSubKey(#"SOFTWARE\\Microsoft\\Internet Explorer\\Main\\FeatureControl\\FEATURE_BROWSER_EMULATION", true);
If I use a WebBrowser component I can see the web page without problem but I still can't get the table to list.
F12 is your friend in any browser.
Select the Network tab and you'll notice that all of the info is in this file :
https://www.belastingdienst.nl/data/douane_wisselkoersen/wks.douane.wisselkoersen.dd201806.xml
(I suppose that the data for july 2018 will be held in a url named *.dd201807.xml)
Using C# you will need to do a GET for that URL and parse it as XML, no need to use HtmlAgilityPack. You will need to construct the current year concatenated with the current month to pick the right URL.
Leuker kan ik het niet maken!
WebClient is an http client, not a web browser, so it won't execute JavaScript. What is need is a headless web browser. See this page for a list of headless web browsers. I have not tried any of them though, so I cannot give you a recommendation here:
Headless browser for C# (.NET)?

Can't parse body of page

I am trying parse some href from one page, my code looks like:
WebClient webClient = new WebClient();
string htmlCode = webClient.DownloadString("https://www.firmy.cz/Auto-moto");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlCode);
HtmlNodeCollection collection = doc.DocumentNode.SelectNodes("//div[contains(#class,'companyWrap')]");
string target = "";
foreach (HtmlNode link in collection)
{
target = target +"\n"+ link.Attributes["href"].Value;
}
On this page my doc.ParsedText haven't body <body id="root" class="root">
</body> but if i go to page i see elements of body. Can u tell me where is a problem?
Blockquote
If you view the source of the URL you are trying to parse (https://www.firmy.cz/Auto-moto), you can see that the body is empty.
It seems like the page is loading the content through JavaScript on the client side and will thus not be available for you to parse.

Get html data over Get Request C#

I made a little API in PHP that return some user information after a successful login. The information will be returned in HTML format with Paragraph ID's. Here's an example of data return in HTML:
<body>
<p id="msg">Successful login</p>
<p id="uid">1</p>
<p id="username">Joey</p>
<p id="email">Test#gmail.com</p>
<p id="hwid"></p>
<p id="funds">0</p>
</body>
So I want to post the login data to the API and get the information by HTML-IDs.
The API:
api.php?set=login&username={USER}&password={PASS}
First up - I'd suggest using JSON instead of HTML for this - PHP has json_encode and json_decode - and then you can add the JSON.Net nuget package to deserialize on your end very easily.
echo json_encode(resultObject)
and then in c#
JsonConvert.DeserializeObject<ResultType>(downloadedString)
Then all you need to do is look into HttpWebRequest and WebRequest, to download that string from your api
That would look something like
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create("http://url/api");
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
String downloadedString = reader.ReadToEnd();
}
Alternatively, you can process the HTML as XML
XmlDocument doc = new XmlDocument();
doc.Load(response.GetResponseStream());
String msg = doc.GetElementById("msg").Value;

Logging into a website using HttpWebRequest/Response in C#?

Now, first off, I want to understand whether or not its better to use HttpWebRequest and Response or whether its better to simply use a webbrowser control. Most people seem to prefer to use the web browser, however whenever I ask people about it, they tell me that HttpWebRequest and Response is better. So, if this question could be avoided by switching to a web browser (and there's a good reason as to why its better), please let me know!
Basically, I set up a test site, written in PHP, running on localhost. It consists of three files....
The first is index.php, which just contains a simple login form, all the session and everything is just me testing how sessions work, so its not very well written, like I said, its just for testing purposes:
<?php
session_start();
$_SESSION['id'] = 2233;
?>
<form method="post" action="login.php">
U: <input type="text" name="username" />
<br />
P: <input type="password" name="password" />
<br />
<input type="submit" value="Log In" />
</form>
Then, I have login.php (the action of the form), which looks like:
<?php
session_start();
$username = $_POST['username'];
$password = $_POST['password'];
if ($username == "username" && $password == "password" && $_SESSION['id'] == 2233)
{
header('Location: loggedin.php');
die();
}
else
{
die('Incorrect login details');
}
?>
And lastly, loggedin.php just displays "Success!" (using the element).
As you can see, a very simple test, and many of the things I have there are just for testing purposes.
So, then I go to my C# code. I created a method called "HttpPost". It looks like:
private static string HttpPost(string url)
{
request = HttpWebRequest.Create(url) as HttpWebRequest;
request.CookieContainer = cookies;
request.UserAgent = userAgent;
request.KeepAlive = keepAlive;
request.Method = "POST";
response = request.GetResponse() as HttpWebResponse;
if (response.StatusCode != HttpStatusCode.Found)
throw new Exception("Website not found");
StreamReader sr = new StreamReader(response.GetResponseStream());
return sr.ReadToEnd();
}
I built a Windows Form application, so in the button Click event, I want to add the code to call the HttpPost method with the appropriate URL. However, I'm not really sure what I'm supposed to put there to cause it to log in.
Can anyone help me out? I'd also appreciate some general pointers on programatically logging into websites!
Have you considered using WebClient?
It provides a set of abstract methods for use with web pages, including UploadValues, but I'm not sure if that would work for your purposes.
Also, it's probably better not to use WebBrowser as that's a full blown web browser that can execute scripts and such; HttpWebRequest and WebClient are much more light weight.
Edit : Login to website, via C#
Check this answer out, I think this is exactly what you're looking for.
Relevant code snippet from above link :
var client = new WebClient();
client.BaseAddress = #"https://www.site.com/any/base/url/";
var loginData = new NameValueCollection();
loginData.Add("login", "YourLogin");
loginData.Add("password", "YourPassword");
client.UploadValues("login.php", "POST", loginData);
You should use something like WCF Web Api HttpClient. It much easier to achieve.
Following code is writte off the top of my head. But it should give you the idea.
using (var client = new HttpClient())
{
var data = new Dictionary<string, string>(){{"username", "username_value"}, {"password", "the_password"}};
var content = new FormUrlEncodedContent(data);
var response = client.Post("yourdomain/login.php", content);
if (response.StatusCode == HttpStatusCode.OK)
{
//
}
}

Categories

Resources