I have an ASP.NET Controller, that has a function, that simply returns a ContentResult of a WebClient.
In this case, CNN is my testbed, my WebClient, downloads CNN as a string, into it's content, and returns the result as an ActionResult/ContentResult on a controller.
I'm calling a RenderPartial on this action, to get CNN to display "below" and to the right of my current content, in its own box.
The problem I'm running into, is when you click a link on CNN, it redirects to a "relative" url sometimes, and that relative url, doesn't exist in my localhost, and won't exist on my web server, thus the link fails.
When it does redirect to an absolute url, that's also a problem, as the resulting page "replaces" my .NET shell.
What I need it to do, is take any URLs that are inside that ContentResult I returned, and if any of those URLs are clicked, pass them back to my .NET application, to be downloaded by the WebClient, and rendered back in that shell.
I'm aware I could use an IFRAME to do this instead of WebClient, but an IFRAME is impossible in my case, as its restricted by the public API we're consuming, its restricted/has flaws in the framework that API is servicing, and its blocked on some of our client machines, which we have no control over.
Also, the client machines will be off our network, so I can't use an AJAX load, as that'd be XSS.
One idea that I can think of, is basically to create custom routes, with filters/rules, that look for "CNN" like links, and pass them to my controller, as a parameter; then have my controller render my page, and pass those links to the web client.
This would obviously be a lot of work, and not even really sure where to start with it in the routing engine.
The only other thing I can think of that might work, is checking the validity of the URL, by attempting to see if it points to a valid link. If it doesn't point to a valid link, adding CNN's url prefix to it, and seeing if that is then a valid link. But that seems like a lot of work, for a solution that would be a heavy hit on performance, as checking each link on a page like CNN, could be very costly. Also, since I'd be running each link twice, that'd be a problem where POST operations were required, as it'd essentially run each operation twice, and being that this is a public API I'm consuming off a public framework, I wouldn't have control of the source code to safeguard against that.
Any other suggestions?
Is there any way to accomplish what I'm trying to do easily?
Related
I am building onto a massive Razor website, which I cannot re-architect. I need to use AngularJs on the client side, and when the page is loaded there is a little bit of server side preprossessing that needs to be done before the page is rendered.
I need to pass a parameter to C# via the URL query string, and I need that same parameter on the JavaScript side. Currently, if I use this URL:
http://localhost:32289/razorPage.cshtml#/?param="1234"
I can get that value on the JavaScript side, but when I call
var queryString = HttpContext.Current.Request.QueryString;
on the server side, it's empty. Additionally, if I use this URL:
http://localhost:32289/razorPage.cshtml/?param="1234"#/
I can access the query string on the server side, but then JavaScript goes nuts, as though I was continuously rerunning the code in my Angular controller. I get this in the Chrome console:
Synchronous XMLHttpRequest on the main thread is deprecated because of its detrimental effects to the end user's experience. For more help, check https://xhr.spec.whatwg.org/.
And eventually an error that says the Maximum call stack size has been reached. If I put a console.log() in that Angular controller, it logs continuously until the call stack max size is reached.
This is my razorPage.cshtml:
#{
Page.Title = "OCE Visualizer";
Page.IsDetailedView = false;
Page.IsCapacity = false;
Page.IsEmbedded = false;
InitProvider.Init();
}
<html>
<!--...AngularJs App lives in here-->
</html>
The init method (which populates some data folders on the server so they can be served to Angular) uses parameters from the query string, as does the AngularJs app, which also manages its own routing.
I'm relatively new to Razor, but I am familiar with AngularJs. I think part of the problem could be because of the way .NET manages routing, which could be messing with how Angular can do it. I am aware of this and this SO answers, but they apply to an MVC app, where mine is just a website with a lot of .cshtml pages, no Controllers or APIs.
Is there a way to access query strings in both Angular and Razor C#, while maintaining AngularJs routing with "pretty" URLs?
Ok I figured out a solution. I'll post it if people in the future have this problem, or if it's a bad answer and people can fix it.
I realized that the main difference between the two URLs in my question was the location of the #, which is a fragment identifier (?). I read about it here, but I could caution that that page is almost 20 years old. Anyway, I found that the fragment part of the URL does NOT get sent to the server, which is why I couldn't parse the query string server side when it was after the #. I don't know why JavaScript was freaking out when the query was before the #, but I'm willing to believe it was a problem with my code.
The solution was to pass the query string on both sides of the #. Thus, the working URL looks like this:
http://localhost:32289/razorPage.cshtml?param="1234"#/?param="1234"
The query string to the left is what the C# can access, and the one on the right is what AngularJs can access. Additionally, anything after the # works like normal AngularJs routing, so I don't think that was related.
We have a homebrewed advertising system on our website. Part of this includes code that when an ad is clicked, we first go to a intermediary page that records the click data, which then redirects them along to the desired advertiser's website.
Unfortunately, our current solution requires that a URL parameter be passed to the intermediary page that is the destination URL. Some savvy advertisers have discovered that they can use this for their own nefarious purposes and "launder" their traffic through our site. In other words, on their site, they have a link along the lines of www.oursite.com/redirect?URL=www.theirtargetsite.com, making it seem like that traffic is coming from our site.
I'm working on a solution that will only redirect to a whitelist of URLs, but my first problem is more just knowing what this is called. Finding alternative and probably better solutions is difficult when I don't even know what to call it. With so much spoofing, laundering, and hijacking going on, it's hard to find help for the right topic.
What is it called when website A redirects to website C through website B without the permission of B?
The word you're looking for is open redirect. The MITRE article on this class of vulnerability has some examples of ways that this can be mitigated, e.g:
Whitelist the URLs that you will redirect to
Displaying a warning page before redirecting (probably not viable in your situation)
Use numbers to identify the URLs to redirect to (i.e, look them up in a table) instead of putting the target in a query parameter
Use a HMAC construction to "sign" URLs to redirect to, and reject redirects that don't have a valid signature
Similar questions have been asked about the nature of when to use POST and when to use GET in an AJAX request
Here:
What are the advantages of using a GET request over a POST request?
and here: GET vs. POST ajax requests: When and how to use either?
However, I want to make it clear that that is not exactly what I am asking. I get idempotence, sensitive data, the ability for browsers to be able to try again in the event of an error, and the ability for the browser to be able to cache query string data.
My real scenario is such that I want to prevent my users from being able to simply enter in the URL to my "Compute.cshtml" file (i.e. the file on the server that my jQuery $.ajax function posts to).
I am in a WebMatrix C#.net web-pages environment and I have tried to precede the file name with an underscore (_), but apparently an AJAX request falls under the same criteria that this underscore was designed to prevent the display of and it, of course, breaks the request.
So if I use POST I can simply use this logic:
if (!IsPost) //if this is not a post...
{
Response.Redirect("~/") //...redirect back to home page.
}
If I use GET, I suppose I can send additional data like a string containing the value "AccessGranted" and check it on the other side to see if it equals this value and redirect if not, but this could be easily duplicated through typing in the address bar (not that the data is sensitive on the other side, but...).
Anyway, I suppose I am asking if it is okay to always use POST to handle this logic or what the appropriate way to handle my situation is in regards to using GET or POST with AJAX in a WebMatrix C#.net web-pages environment.
My advice is, don't try to stop them. It's harmless.
You won't have direct links to it, so it won't really come up. (You might want your robots.txt to exclude the whole /api directory, for Google's sake).
It is data they have access to anyway (otherwise you need server-side trimming), so you can't be exposing anything dangerous or sensitive.
The advantages in using GETs for GET-like requests are many, as you linked to (caching, semantics, etc)
So what's the harm in having that url be accessible via direct browser entry? They can POST directly too, if they're crafty enough, using Fiddler "compose" for example. And having the GETs be accessible via url is useful for debugging.
EDIT: See sites like http://www.robotstxt.org/orig.html for lots of details, but a robots.txt that excluded search engines from your web services directory called /api would look like this:
User-agent: *
Disallow: /api/
Similar to IsPost, you can use IsAjax to determine whether the request was initiated by the XmlHttpRequest object in most browsers.
if(!IsAjax){
Response.Redirect("~/WhatDoYouThinkYoureDoing.cshtml");
}
It checks the request to see if it has an X-Requested-With header with the value of XmlHttpRequest, or if there is an item in the Request object with the key X-Requested-With that has a value of XmlHttpRequest.
One way to detect a direct AJAX call is to check for the presence of the http_referer header. Directly typed URLs won't generate a referrer, but you still won't be able to differentiate the call from a simple anchor link.
(Just keep in mind that some browsers don't generate the header for XHR requests.)
I have an idea for an App that would really help me out in work but I'm not sure if it's possible.
I want to run a C# desktop application that will ask for a value. When a value is supplied, the application will open a browswer, go to a webpage and add the value into a form on an online website. The form is then submitted and a new page is loaded that contains a table of results. I then want to extract the table of results from the page source and write code to parse the result values.
It is not important that the user see's this happen in an actual browser. In other words if there's a way to do it by reading HTTP requests then thats great.
The biggest problem I have is getting the values into the form and then retrieving the page source after the form is submitted and the next page loads.
Any help really appreciated.
Thanks
Provided that you're only using this in a legal context:
Usually, web forms are sent via POST request to the web server, specifically some script that handles it. You can look at the HTML code for the form's page and find out the destination for the form (form's action).
You can then use a HttpWebRequest in C# to "pretend you are the form", sending a POST request with all the required parameters (adding them to the HTTP header).
As a result you will get the source code of the destination page as it would be sent to the browser. You can parse this.
This is definitely possible and you don't need to use an actual web browser for this. You can simply use a System.Net.WebClient to send your HTTP request and get an HTTP response.
I suggest to use wireshark (or you can use Firefox + Firebug) it allows you to see HTTP requests and responses. By looking at the HTTP traffic you can see exactly how you should pass your HTTP request and which parameters you should be setting.
You don't need to involve the browser with this. WebClient should do all that you require. You'll need to see what's actually being posted when you submit the form with the browser, and then you should be able to make a POST request using the WebClient and retrieve the resulting page as a string.
The docs for the WebClient constructor have a nice example.
See e.g. this question for some pointers on at least the data retrieval side. You're going to know a lot more about the http protocol before you're done with this...
Why would you do this through web pages if you don't even want the user to do anything?
Web pages are purely for interaction with users, if you simply want data transfer, use WCF.
#Brian using Wireshark will result in a very angry network manager, make sure you are actually allowed to use it.
I'm using a C# WebClient to post login details to a page and read the all the results.
The page I am trying to load includes flash (which, in the browser, translates into HTML). I'm guessing it's flash to avoid being picked up by search engines???
The flash I am interested in is just text (not an image/video) etc and when I "View Selection Source" in firefox I do actually see the text, within HTML, that I want to see.
(Interestingly when I view the source for the whole page I do not see the text, within HTML, that I want to see. Could this be related?)
Currently after I have posted my login details, and loaded the HTML back, I see the page which does NOT show the flash HTML (as if I had viewed source for the whole page).
Thanks in advance,
Jim
PS: I should point out that the POST is actually working, my log in is successful.
Fiddler (or similar tool) is invaluable to track down screen-scraping problems like this. Using a normal browser and with fiddler active, look at all the requests being made as you go through the login and navigation process to get to the data you want. In between, you will likely see one or more things that your code is doing differently which the server is responding to and hence showing you different HTML than a real client.
The list of stuff below (think of it as "scraping 101") is what you want to look for. Most of the stuff below is probably stuff you're already doing, but I included everything for completeness.
In order to scrape effectively, you may need to deal with one or more of the following:
cookies and/or hidden fields. when you show up at any page on a site, you'll typically get a session cookie and/or hidden form field which (in a normal browser) would be propagated back to the server on all subsequent requests. You will likely also get a persistent cookie. On many sites, if a requests shows up without a proper cookie (or form field for sites using "cookieless sessions"), the site will redirect the user to a "no cookies" UI, a login page, or another undesirable location (from the scraper app's perspective). always make sure you capture the cookies set on the initial request and faithfully send them back to the server on subsequent requests, except if one of those subsequent requests changes a cookie (in which case propagate that new cookie instead).
authentication tokens a special case of above is forms-authentication cookies or hidden fields. make sure you're capturing the login token (usually a cookie) and sending it back.
POST vs. GET this is obvious, but make sure you're using the same HTTP method that a real browser does.
form fields (esp. hidden ones!) I'm sure you're doing this already, but make sure to send all form fields that a real browser does, not just the visible fields. make sure fields are HTML-encoded properly.
HTTP headers. you already checked this, but it may make sense to check again just to make sure the (non-cookie) headers are identical. I always start with the exact same headers and then start pulling out headers one by one, and only keep the ones that cause the request to fail or return bogus data. this approach simplifies your scraping code.
redirects. These can either come from the server, or from client script (e.g. "if user doesn't have flash plug-in loaded, redirect to a non-flash page"). See WebRequest: How to find a postal code using a WebRequest against this ContentType="application/xhtml+xml, text/xml, text/html; charset=utf-8"? for a crazy example of how redirection can trip up a screen-scraper. Note that if you're using .NET for scraping, you'll need to use HttpWebRequest (not WebClient) for redirect-dependent scraping, because by default WebClient doesn't provide a way for your code to attach cookies and headers to the second (post-redirect) request. See the thread above for more details.
sub-requests (frames, ajax, flash, etc.) - often, page elements (not the main HTTP requests) will end up fetching the data you want to scrape. you'll be able to figure this out by looking which HTTP response contains the text you want, and then working backwards until you find what on the page is actually making the request for that content. A few sites do really crazy things in sub-requests, like requesting compressed or encrypted text via ajax, and then using client-side script to decrypt it. if this is the case, you'll need to do a bit more work like reverse-engineering what the client script is doing.
ordering - this one is obvious: make HTTP requests in the same order that a browser client does. that doesn't mean you need to make every request (e.g. images). Typically you only need to make the requests which return text/html content type, unless the data you want is not in the HTML and is in an ajax/flash/etc. request.
(Interestingly when I view the source for the whole page I do not see the text, within HTML, that I want to see. Could this be related?)
This usually means that the discrepancy is caused by some DOM manipulations via javascript after the page has loaded. Try turning off javascript and see what it looks like.