C# Screen Scraper - Handle long uri's

C# Screen Scraper - Handle long uri's - c#

I'm building a html screen scraper, which parses urls, and then compare those with a set of other urls.
The comparison is done with Uri.AbsoluteUri or Uri.Host.
My problem is that when i'm creating a new Uri (new Uri(url)), an UriFormatException is thrown when the url is to long, or contains to many slashes.
Since my predefined set of urls contains several (to) long urls, I cannot just use substring to only fetch a part of the url.
What would be the best way to handle this?
Thanks

You can use Uri.TryCreate to check if the URI is valid before you new it.
You should not get an exception on a url this is so short. The folowing program runs well on VS2008:
static void Main(string[] args)
{
Uri uri = new Uri("http://stackoverflow.com/questions/1298985/c-screen-scraper-handle-long-uris/c-screen-scraper-handle-long-uris/c-screen-scraper-handle-long-uris/c-screen-scraper-handle-long-uris/c-screen-scraper-handle-long-uris/c-screen-scraper-handle-long-uris/c-screen-scraper-handle-long-uris/c-screen-scraper-handle-long-uris/");
Uri uri2 = new Uri("http://stackoverflow.com/questions/1298985/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/");
Console.ReadLine();
}

Related

How to access properties of Uri defined with UriKind.Relative

My method receives a URI as a string and attempts to parse it into a predictable and consistent format. The incoming URL could be absolute (http://www.test.com/myFolder) or relative (/myFolder). Absolute URIs are easy enough to work with, but I've hit some stumbling blocks working with relative ones. Most notable is the fact that, although the constructor for Uri allows you to designate a relative URI using UriKind.Relative (or UriKind.RelativeOrAbsolute), it doesn't appear to have any properties available when you do this.
Specifically, it throws this exception: System.InvalidOperationException : This operation is not supported for a relative URI.
It makes sense that you wouldn't be able to access, say, the Scheme or Authority properties--although it seems weird that they actually throw invalid operation exceptions instead of just returning blank strings--but even properties like PathAndQuery or Fragment exhibit the same behavior. In fact, pretty much the only properties that don't throw exceptions for relative URIs are the IsX flags and OriginalString, which just shows the string you passed in to the the object in the first place.
Given that that constructor explicitly allows you to declare a relative URI, this all seems like a baffling omission. Is there something I'm missing here? Is there any way to handle relative URIs as their component parts, or do I need to just treat it as a string? Am I completely failing to understand what "relative URI" means in this case?
To replicate:
var uri = new Uri("/myFolder");
string foo = uri.PathAndQuery; // throws exception
Visual Studio Pro 2015,
.NET 4.5.2 (if any of that makes a difference)

It's by design that an exception gets thrown when accessing eg. PathAndQuery for a relative uri, see uri source code.
As a rather quick and dirty workaround to parse some segments, you could construct a temporary absolute uri out of the relative one, using a dummy base uri (scheme and host) which you ignore.
String url = "/myFolder?bar=1#baz";
Uri uri = new Uri(url, UriKind.RelativeOrAbsolute);
if (!uri.IsAbsoluteUri)
{
Uri baseUri = new Uri("http://foo.com");
uri = new Uri(baseUri, uri);
}
String pathAndQuery = uri.PathAndQuery; // /myFolder?bar=1
String query = uri.Query; // ?bar=1
String fragment = uri.Fragment; // #baz

Hundreds of pins in static google map api

This works fine:
static void Main(string[] args)
{
string latlng = "55.379110,-3.1420898";
string url = "http://maps.googleapis.com/maps/api/staticmap?center=" + latlng +
"&zoom=6&size=1000x1000&maptype=satellite&markers=color:blue%7Clabel:S%7C" +
latlng + "&sensor=false";
using (WebClient wc = new WebClient())
{
wc.DownloadFile(url, #"C:\Bla\img.png");
}
}
Just wondering, how could I add hundreds of pins and save the map as a png? Surely there is a limit for get request and one can not append too many query string parameters.
PS: The limit is 8192 characters - see https://developers.google.com/maps/documentation/static-maps/intro section URL Size Restriction

I'm afraid downloading and storing static maps images is against ToS:
You may not store and serve copies of images generated using the Google Static Maps API from your website. All web pages that require static images must link the src attribute of an HTML img tag or the CSS background-image attribute of an HTML div tag directly to the Google Static Maps API so that all map images are displayed within the HTML content of the web page and served directly to end users by Google.
https://developers.google.com/maps/faq?csw=1#tos_staticmaps_reuse

You will have to deal with an url size restriction from Google Map Static API.
https://developers.google.com/maps/documentation/static-maps/intro#url-size-restriction
URL Size Restriction
Google Static Maps API URLs are restricted to 8192 characters in size.
In practice, you will probably not have need for URLs longer than
this, unless you produce complicated maps with a high number of
markers and paths. Note, however, that certain characters may be
URL-encoded by browsers and/or services before sending them off to the
API, resulting in increased character usage. For more information, see
Building a Valid URL.
To Add multiple marker:
https://developers.google.com/maps/documentation/static-maps/intro
https://maps.googleapis.com/maps/api/staticmap?center=Brooklyn+Bridge,New+York,NY&zoom=13&size=600x300&maptype=roadmap&markers=color:blue%7Clabel:S%7C40.702147,-74.015794&markers=color:green%7Clabel:G%7C40.711614,-74.012318&markers=color:red%7Clabel:C%7C40.718217,-73.998284&key=YOUR_API_KEY

You may also be interested in: Issue 207: KML layer in Static Maps API

Selenium: Find the base Url

I'm using Selenium on different machines to automate testing of a MVC Web application.
My problem is that I can't get the base url for each machine.
I can get the current url using the following code:
IWebDriver driver = new FirefoxDriver();
string currentUrl = driver.Url;
But this doesn't help when I need to navigate to a different page.
Ideally I could just use the following to navigate to different pages:
driver.Navigate().GoToUrl(baseUrl+ "/Feedback");
driver.Navigate().GoToUrl(baseUrl+ "/Home");
A possible workaround I was using is:
string baseUrl = currentUrl.Remove(22); //remove everything from the current url but the base url
driver.Navigate().GoToUrl(baseUrl+ "/Feedback");
Is there a better way I could do this??

The best way around this would be to create a Uri instance of the URL.
This is because the Uri class in .NET already has code in place to do this exactly for you, so you should just use that. I'd go for something like (untested code):
string url = driver.Url; // get the current URL (full)
Uri currentUri = new Uri(url); // create a Uri instance of it
string baseUrl = currentUri.Authority; // just get the "base" bit of the URL
driver.Navigate().GoToUrl(baseUrl + "/Feedback");
Essentially, you are after the Authority property within the Uri class.
Note, there is a property that does a similar thing, called Host but this does not include port numbers, which your site does. It's something to bear in mind though.

Take the driver.Url, toss it into a new System.Uri, and use myUri.GetLeftPart(System.UriPartial.Authority).
If your base URL is http://localhost:12345/Login, this will return you http://localhost:12345.

Try this regular expression taken from this answer.
String baseUrl;
Pattern p = Pattern.compile("^(([a-zA-Z]+://)?[a-zA-Z0-9.-]+\\.[a-zA-Z]+(:\d+)?/");
Matcher m = p.matcher(str);
if (m.matches())
baseUrl = m.group(1);

Converting a web address to a valid href value

Firstly, this seems like something that should have been asked before, but I cannot find anything that answers my question.
A basic overview of my task is to render an anchor link on a web page which is based on a user defined web address. As the address is user defined this could be in any format, for example:
http://www.example.com
https://www.example.com
www.example.com
example.com
What I need to do with this value is to set it as the href property of an anchor tag. Now, the problem is that (in Chrome at least) only the first two examples will work due to the fact they are recognised as absolute URL paths. The last two examples will redirect to the same domain (i.e. treated as relative paths)
So the ultimate question is: What is the best way to format these values to ensure a consistent absolute path is used? I could check for http/https and add it if missing, but I was hoping there might be an out of the box .Net class that would be more reliable.
In addition, as this is a user defined value, it could be complete junk anyway so a function to validate the URL would be a nice bonus too.

We ran into this problem a few months back, and needed a consistent way of ensuring the URLs were absolute. We also wanted a way of removing http(s):// for displaying the URL on the web page.
I came up with this function:
public static string FormatUrl(string Url, bool IncludeHttp = null)
{
Url = Url.ToLower();
switch (IncludeHttp) {
case true:
if (!(Url.StartsWith("http://") || Url.StartsWith("https://")))
Url = "http://" + Url;
break;
case false:
if (Url.StartsWith("http://"))
Url = Url.Remove(0, "http://".Length);
if (Url.StartsWith("https://"))
Url = Url.Remove(0, "https://".Length);
break;
}
return Url;
}
I know you're after an "out of the box" library, but this may be of some help.
I think the problem with an "out of the box" solution would be that the function won't know whether the URL should be http:// or https://. With my function I've made an assumption that its going to be http://, but for some URLs you need https://. If Microsoft were to build something like this into the framework, it would be buggy from the start.

You can try using this overload of the Uri class:
Uri Constructor (String)
This constructor creates a Uri instance from a URI string. It parses the URI, puts it in canonical format, and makes any required escape encodings.
This constructor does not ensure that the Uri refers to an accessible resource.
This constructor assumes that the string parameter references an absolute URI and is equivalent to calling the Uri constructor with UriKind set to Absolute. If the string parameter passed to the constructor is a relative URI, this constructor will throw a UriFormatException.
This will try to construct a canonical Uri from the user input. And you have lots of properties to check and extract the URL parts that you need.

Truncating Query String & Returning Clean URL C# ASP.net

I would like to take the original URL, truncate the query string parameters, and return a cleaned up version of the URL. I would like it to occur across the whole application, so performing through the global.asax would be ideal. Also, I think a 301 redirect would be in order as well.
ie.
in: www.website.com/default.aspx?utm_source=twitter&utm_medium=social-media
out: www.website.com/default.aspx
What would be the best way to achieve this?

System.Uri is your friend here. This has many helpful utilities on it, but the one you want is GetLeftPart:
string url = "http://www.website.com/default.aspx?utm_source=twitter&utm_medium=social-media";
Uri uri = new Uri(url);
Console.WriteLine(uri.GetLeftPart(UriPartial.Path));
This gives the output: http://www.website.com/default.aspx
[The Uri class does require the protocol, http://, to be specified]
GetLeftPart basicallys says "get the left part of the uri up to and including the part I specify". This can be Scheme (just the http:// bit), Authority (the www.website.com part), Path (the /default.aspx) or Query (the querystring).
Assuming you are on an aspx web page, you can then use Response.Redirect(newUrl) to redirect the caller.

Here is a simple trick
Dim uri = New Uri(Request.Url.AbsoluteUri)
dim reqURL = uri.GetLeftPart(UriPartial.Path)

Here is a quick way of getting the root path sans the full path and query.
string path = Request.Url.AbsoluteUri.Replace(Request.Url.PathAndQuery,"");

This may look a little better.
string rawUrl = String.Concat(this.GetApplicationUrl(), Request.RawUrl);
if (rawUrl.Contains("/post/"))
{
bool hasQueryStrings = Request.QueryString.Keys.Count > 1;
if (hasQueryStrings)
{
Uri uri = new Uri(rawUrl);
rawUrl = uri.GetLeftPart(UriPartial.Path);
HtmlLink canonical = new HtmlLink();
canonical.Href = rawUrl;
canonical.Attributes["rel"] = "canonical";
Page.Header.Controls.Add(canonical);
}
}
Followed by a function to properly fetch the application URL.
Works perfectly.

I'm guessing that you want to do this because you want your users to see pretty looking URLs. The only way to get the client to "change" the URL in its address bar is to send it to a new location - i.e. you need to redirect them.
Are the query string parameters going to affect the output of your page? If so, you'll have to look at how to maintain state between requests (session variables, cookies, etc.) because your query string parameters will be lost as soon as you redirect to a page without them.
There are a few ways you can do this globally (in order of preference):
If you have direct control over your server environment then a configurable server module like ISAPI_ReWrite or IIS 7.0 URL Rewrite Module is a great approach.
A custom IHttpModule is a nice, reusable roll-your-own approach.
You can also do this in the global.asax as you suggest
You should only use the 301 response code if the resource has indeed moved permanently. Again, this depends on whether your application needs to use the query string parameters. If you use a permanent redirect a browser (that respects the 301 response code) will skip loading a URL like .../default.aspx?utm_source=twitter&utm_medium=social-media and load .../default.aspx - you'll never even know about the query string parameters.
Finally, you can use POST method requests. This gives you clean URLs and lets you pass parameters in, but will only work with <form> elements or requests you create using JavaScript.

Take a look at the UriBuilder class. You can create one with a url string, and the object will then parse this url and let you access just the elements you desire.

After completing whatever processing you need to do on the query string, just split the url on the question mark:
Dim _CleanUrl as String = Request.Url.AbsoluteUri.Split("?")(0)
Response.Redirect(_CleanUrl)
Granted, my solution is in VB.NET, but I'd imagine that it could be ported over pretty easily. And since we are only looking for the first element of the split, it even "fails" gracefully when there is no querystring.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.