I am implementing URL rewriting in ASP.net and my URLs are causing me a world of problems.
The URL is generated from a database of departments & categories. I want employees to be able to add items to the database with whatever special characters are appropriate without it breaking the site.
I am encoding the data before I construct the URLs.
There are several problems...
IIS decodes the URL before it reaches .net making it impossible to properly parse anything with a "/" in it.
ASP.net gets confused by the url making "~" useless within certain pages
I migrated from the built in test server to my local IIS server (XP machine) and any URL containing an encoded & (%26) gives me a "Bad Request" error.
UrlEncode leaves some breaking characters untouched such as '.'
I did have two other related posts on this subject, at the time I only saw the small problems not the big problem upstream. I've found some registry tricks to solve the "Bad Request" issue but I'm going to be deploying to a shared hosting environment making that useless. I also know that this is a fix for some security issue so I don't want to necessarily bypass it without knowing what can of worms I'm opening.
Rather than trying to force .net to pass me the raw url, or override IIS settings i'd like to make truly safe URLs in the first place.
I'll note i've tried AntiXss.URLEncode, HttpUtility.URLEncode, URI.EscapeDataString. I've even tried stupid things like double URLEncodng. Is there a utility that does what I need, or do i really need to roll my own. I'm even considering doing something Hacky like replacing the % with an unusual string of characters. The end result should be at least readable which was the point of using URL rewriting in the first place.
Sorry for the long post- I just wanted to make sure that I've included all the necessary details. I can't seem to find any relevant information on this, and it seems like it would be a common problem - so maybe I'm missing something big. Thanks for your help, and patience with the long explanation!
Edit for clarity:
When I say the urls are being built from a database what I mean is that the directory structure is contstructed from the departments and categories in my database.
Some Example URLS -
Mystore/Refrigeration/Bar+Fridge.aspx
Mystore/Cooking+Equipment.aspx
Mystore/Kitchen/Cutting+Boards.asxpx
The problems come in when I use a department like "Beverage & Bar" or "Pastry/Decorating" to construct my URL. Despite being encoded first these cause the aforementioned issues.
My handlers are already implemented and working fine except for the special character encoding issues.
You should consider having a table off of your category/department table which has a unique URL for each category. Then you can use a special routine to generate the URLs. This can be a SQL scalar function, or a CLR function, but one of the things it would do is normalize the URL for the web. You can convert "Beverage & Bar" to "Beverage-And-Bar" and "Pastry / Decorating" to "Pastry-Decorating". Mainly, the routine needs to replace all invalid HTTP URL characters with something else. An example is this:
public static class URL
{
static readonly Regex feet = new Regex(#"([0-9]\s?)'([^'])", RegexOptions.Compiled);
static readonly Regex inch1 = new Regex(#"([0-9]\s?)''", RegexOptions.Compiled);
static readonly Regex inch2 = new Regex(#"([0-9]\s?)""", RegexOptions.Compiled);
static readonly Regex num = new Regex(#"#([0-9]+)", RegexOptions.Compiled);
static readonly Regex dollar = new Regex(#"[$]([0-9]+)", RegexOptions.Compiled);
static readonly Regex percent = new Regex(#"([0-9]+)%", RegexOptions.Compiled);
static readonly Regex sep = new Regex(#"[\s_/\\+:.]", RegexOptions.Compiled);
static readonly Regex empty = new Regex(#"[^-A-Za-z0-9]", RegexOptions.Compiled);
static readonly Regex extra = new Regex(#"[-]+", RegexOptions.Compiled);
public static string PrepareURL(string str)
{
str = str.Trim().ToLower();
str = str.Replace("&", "and");
str = feet.Replace(str, "$1-ft-");
str = inch1.Replace(str, "$1-in-");
str = inch2.Replace(str, "$1-in-");
str = num.Replace(str, "num-$1");
str = dollar.Replace(str, "$1-dollar-");
str = percent.Replace(str, "$1-percent-");
str = sep.Replace(str, "-");
str = empty.Replace(str, string.Empty);
str = extra.Replace(str, "-");
str = str.Trim('-');
return str;
}
}
You could make this a SQL enhance function, or run URL generation as a separate process. Then to implement mapping, you would map the entire URL directly to a category ID. This approach is better in the long run for several reasons. First, you are not always generating URLs, you do this once and they stay static, you don't have to worry about your procedure changing, and then GoogleBot not being able to find old URLs. Also, if you get a collision, you may notice a potential duplicate category name, because a collision would only be different by special characters. Finally, you can always view your URLs from the database, without having to run the mapping function.
I have a url rewrite i implement in the global.asax file in the begin authenticated request as I have some security. This is where I take the raw url and then do the db look up. this then rewrites the path to the aspx page and all the parameters are passed through the query string. No encoding is necessary.
However if you are using the url to actually change data then i can see that you will have huge problems as you are effectively using the http GET to change database. It is usually concidered a bad idead, and not something i do.
I only use a post request to do any databse manipulation. This keeps the url clean as all the data is in the page form.
The only issue i had was to set the correct url to the page.form.action which in most cases is the raw url.
If its the category names that are causing the issue then perhaps you should restrict the names to alpha numeric characters only and swap spaces for "-". IIS will throw a wobbly with periods "." as it looks for file names.
P.S.
IIS does not understand the tilde "~", this is something that the compiler understands. so if you use it in an anchor tag it will not work as expected and you should use the application root instead of the tilde.
Edit:
OK, it looks like an issue with IIS having issues with certain characters such as . / and &. Even if you do urlencode these IIS will still try to implement its own meanings.
As such consider removing them so:
Beverage & bar becomes BeverageBar
Pastry / decorating becomes PastryDecorating.
This will keep you urls clean, but does mean an extra column in the database so you can cheack the url against this shortened category name.
I'm having the exact same problem. Thanks for writing it up so nicely. It actually helped me to understand the problem better.
I had some other considerations however. One of the goals I have is to support the potential for any characters to be in the url which is based on the title of an article. Additionally I want to ensure uniqueness in the encoding and a two way encode / decode process.
So I did some manual encoding to solve the problem. This won't completely eliminate percent encoding, but will greatly reduce it and keep users from generating an inaccessible url. My process starts with using the Server.URLEncode function. But this doesn't eliminate the problems in the url. Because IIS is decoding the url and then passing it to the application, certain characters will break it with a dangerous request exception. These characters include +, &, /, !, *, ., ( and ). So on those characters plus other characters I would like to make more readable I do a double encoding for a more usable url. Encoding is also hard because of the limited number of characters that are allowed in an url. So prior to encoding I made all letters capital and then did the encoding with lower case. This keeps it from being totally decodable, but I can easily do a match in the database or in code by making the value I wish to match be upper case.
Well, here is my code. Feedback would be appreciated. Oh ya, this is in VB, but things should transfer over to C# easy enough.
Dim strReturn As String = Trim(strStringToEncode)
strReturn = Server.UrlEncode(strReturn)
strReturn = strReturn.Replace("-", "dash").Replace("+", "-")
strReturn = strReturn.Replace("%26", "and").
Replace("%2f", "or").
Replace("!", "excl").
Replace("*", "star").
Replace("%27", "apos").
Replace("(", "lprn").
Replace(")", "rprn").
Replace("%3b", "semi").
Replace("%3a", "coln").
Replace("%40", "at").
Replace("%3d", "eq").
Replace("%2b", "plus").
Replace("%24", "dols").
Replace("%25", "pct").
Replace("%2c", "coma").
Replace("%3f", "query").
Replace("%23", "hash").
Replace("%5b", "lbrk").
Replace("%5d", "rbrk").
Replace(".", "dot").
Replace("%3e", "gt").
Replace("%3c", "lt")
Return strReturn
I guess you are looking for HttpUtility.UrlEncode and HttpUtility.HtmlDecode
string url = "http://www.google.com/search?q=" + HttpUtility.UrlEncode("Example");
Related
I'm trying to reference an image like this:
<img src="/controller/method/#Model.attribute">
This works until the attribute has a plus sign. I already know that the + sign has a semantic meaning but I'd like to keep it, because some values have the plus sign.
I've tried:
<img src="/controller/method/#HttpUtility.HtmlEncode(#Model.attribute)">
And on the server side:
public method(string param)
{
string p = HttpUtility.HtmlDecode(param);
}
How can I accomplish this using ASP.NET MVC 5?
You need to use UrlEncode:
<img src="/controller/method/#HttpUtility.UrlEncode(Model.attribute)">
And do nothing in the method:
public ActionResult method(string param){
// param should already be decoded
}
Did some testing and got error page while trying to reproduce scenario you described.
Here is related question: double escape sequence inside a url : The request filtering module is configured to deny a request that contains a double escape sequence
In my designs, I'm avoiding any direct use of model fields as part of the URL. It's not only the question of URL-encoding them - which you can always do - but also the question of readability.
What I do instead is to add another field to the model, which is the URL-ready representation of an attribute. That field can be calculated from the original field by only accepting letters and numbers and replacing spaces or any other character with a dash.
For example, if you had the attribute set to someone's pencil + one, the auto-created URL version of this attribute would be someone-s-pencil-one.
You can customize this process, make it recognize some domain-specific words, etc. But that is the general idea I'm always following in my designs.
As a quick solution you can use a regular expression to isolate acceptable words and then separate them with dashes for better readability:
string encoded = string.Join("-",
Regex.Matches(attributeValue, #"[a-zA-z0-9]+")
.Cast<Match>()
.Select(match => match.Value)
.ToArray());
When done this way, you must account for possible duplicates. Part of the information is lost with this encoding.
If you fear that two models could clash with the same URL, then you have to do something to break the clash. Some websites append a GUID to the generated URL to make it unique.
Another possibility is to generate a short random string, like 3-5 letters only, and store it in the database so that you can control its uniqueness. Everything in this solution is subordinated to readability, keep that in mind.
I have an Android app and I'm attempting to use PHP/MySQL.
I'm having a lot of trouble getting my results from PHP accessible in C#/Android.
This is my PHP so far:
$sql = "SELECT Name FROM Employees WHERE Password='$password'";
if(!$result = $mysqli->query($sql)) {
echo "Sorry, the query was unsuccessful";
}
while($employee = $result->fetch_assoc()) {
$jsonResult = json_encode($employee);
$employee->close();
}
I've left out the basic connection code as I have all that up and running. Here is my C#:
private void OnLoginButtonClick()
{
var mClient = new WebClient();
mClient.DownloadDataAsync(new Uri("https://127.0.0.1/JMapp/Login.php?password=" + _passwordEditText.Text));
}
As you can see I really am at a very basic stage. I've installed Newtonsoft so I'm ready to deal with the Json that is coming back, however I have a few questions.
I'm well aware of SQL injection, and the way that my variable (password) is passed to the PHP concerns me. Is there a safer way of doing this?
Secondly, I am now unsure of how to get the 'Employees' that match the MySQL command in PHP back into C#. How am I able to access the object that is passed back from PHP?
Leaving aside other aspects of the code in the question, I sugest some reading on sanitizing and escaping user data.
For this specific case of a password see #Jay Blanchard comments. For other input you would not trasform upon input, the idea is to sanitize it as soon as you receive it.
This is to make sure you receive what you were expecting. In the case of a String, trim() the text, match it against a regex of allowed characters. If you allow html tags or not you can match it against a white list of them. Max length.
Then you would validate it. This is that it makes sense and meets the business requirements.
At the time of storing it in the database you can avoid sqlinjection by using prepared statements. By doing this it is clear what is text to be stored and what is sql instructions.
At the time of using the data, you will escape it accoring to where it is going to be used, for example, if it is html content you escape it for html content, if it is an html attribute, or an URL parameter, you do the escaping accordingly for each case. (Wordpress has a nice suite of functions that do this)
Also don't send passwords as URL parameters. Use a form instead with method POST. Urls are seen in the Browser's address widget. And they also get copy pasted in emails, facebook, etc
I'm trying to get all urls in one regular expression, currently i'm using this pattern.
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
However that regex returns the pages/files, instead of hosts. So instead of having to run a second regular expression, I'm hoping someone here can help
This returns http://www.yoursite.com/index.html
I'm attempting to return yoursite.com.
Also the the regex will be parsing from html and hosts will be checked after, so 100% accuracy isn't crucial.
Assuming that your regex:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
Actually does parse the Urls (I haven't checked it), you could easily use a capture group to get the host:
/^(https?:\/\/)?(?<host>([\da-z\.-]+)\.([a-z\.]{2,6}))([\/\w \.-]*)*\/?$/
When you get the Match result, you can examine Groups["host"] to get the host name.
But you're much better off, in my opinion, just using Uri.TryCreate, although you'll need a little logic to get around the possible lack of a scheme. That is:
if (!Regex.IsMatch(line, "https?:\/\/"))
line = "http://" + line;
Uri uri;
if (Uri.TryCreate(line, UriKind.Absolute, out uri))
{
// it's a valid url.
host = uri.Host;
}
Parsing Urls is a pretty tricky business. For example, no individual dotted segment can exceed 63 characters, and there's nothing preventing the last dotted segment from having numbers or hyphens. Nor is it limited to 6 characters. You're better off passing the entire string to Uri.TryCreate than you are trying to duplicate the craziness of URL parsing with a single regular expression.
It's possible that the rest of the Url (after the host name) could be trash. If you want to eliminate that bit causing a problem, then extract everything up to the end of the host name:
^https?:\/\/[^\/]*
Then run that through Uri.TryCreate.
To capture just the yoursite.com from sample text http://www.yoursite.com/index?querystring=value you could use this expression, however this does not validate the string:
^(https?:\/\/)?(?:[^.\/?]*[.])?([^.\/?]*[.][^.\/?]*)
Live demo: http://www.rubular.com/r/UNR7qiQ0Eq
I've got a .NET 3.5 web application written in C# doing some URL rewriting that includes a file path, and I'm running into a problem. When I call string.Split('/') it matches both '/' and '\' characters. Is that... supposed to happen? I assumed that it would notice that the ASCII values were different and skip it, but it appears that I'm wrong.
// url = 'someserver.com/user/token/files\subdir\file.jpg
string[] buffer = url.Split('/');
The above code gives a string[] with 6 elements in it... which seems counter intuitive. Is there a way to force Split() to match ONLY the forward slash? Right now I'm lucky, since the offending slashes are at the end of the URL, I can just concatenate the rest of the elements in the string[], but it's a lot of work for what we're doing, and not a great solution to the underlying problem.
Anyone run into this before? Have a simple answer? I appreciate it!
More Code:
url = HttpContext.Current.Request.Path.Replace("http://", "");
string[] buffer = url.Split('/');
Turns out, Request.Path and Request.RawUrl are both changing my slashes, which is ridiculous. So, time to research that a bit more and figure out how to get the URL from a function that doesn't break my formatting. Thanks everyone for playing along with my insanity, sorry it was a misleading question!
When I try the following:
string url = #"someserver.com/user/token/files\subdir\file.jpg";
string[] buffer = url.Split('/');
Console.WriteLine(buffer.Length);
... I get 4. Post more code.
Something else is happening, paste more code.
string str = "a\\b/c\\d";
string[] ts = str.Split('/');
foreach (string t in ts)
{
Console.WriteLine(t);
}
outputs
a\b
c\d
just like it should.
My guess is that you are converting / into \ somewhere.
You could use regex to convert all \ slashes to a temp char, split on /, then regex the temp chars back to \. Pain in the butt, but one option.
I suspect (without seeing your whole application) that the problem lies in the semantics of path delimiters in URLs. It sounds like you are trying to attach a semantic value to backslashes within your application that is contrary to the way HTTP protocols define and use backslashes.
This is just a guess, of course.
The best way to solve this problem might be modifying the application to encode the path in some other way (such as "%5C" for backslashes, maybe?).
those two functions are probably converting \ to / because \ is not a valid character in a URL (see Which characters make a URL invalid?). The browser (NOT C#, as you are inferring) is assuming that when you are using that invalid character, you mean /, so it is "fixing" it for you. If you want \ in your URL, you need to encode it first.
The browsers themselves are actually the ones that make that change in the request, even if it is behind the scenes. To verify this, just turn on fiddler and look at the URLs that are actually getting sent when you go to a URL like this. IE and Chrome actually change the \ to / in the URL field on the browser itself, FireFox doesn't, but the request goes through that way anyways.
Update:
How about this:
Regex.Split(url, "/");
If I have a series of "pattern" Urls of the form:
http://{username}.sitename.com/
http://{username}.othersite.net/
http://mysite.com/{username}
and I have an actual Url of the form:
http://joesmith.sitename.com/
Is there any way that I can match a pattern Url and in turn use it to extract the username portion out the actual Url? I've thought of nasty ways to do it, but it just seems like there should be a more intuitive way to accomplish this.
ASP.NET MVC uses a similar approach to extract the various segments of the URL when it is building its routes. Given the example:
{controller}/{action}
So given the Url of the form, Home/Index, it knows that it is the Home controller calling the Index action method.
Not sure I understand this question correctly but you can just use a regular expression to match anything between 'http://' and the first dot.
A very simple regex will do:
':https?://([a-z0-9\.-]*[a-z0-9])\.sitename\.com'
This will allow any subdomain that only contains valid subdomain characters. Example of allowed subdomains:
joesmith.sitename.com
joe.smith.sitename.com
joe-smith.sitename.com
a-very-long-subdomain.sitename.com
As you can see, you might want to complicate the regex slightly. For instance, you could limit it to only allow a certain amount of characters in the subdomain.
It seems the the quickest and easiest solution is going off of Machine's answer.
var givenUri = "http://joesmith.sitename.com/";
var patternUri = "http://{username}.sitename.com/";
patternUri = patternUri.Replace("{username}", #"([a-z0-9\.-]*[a-z0-9]");
var result = Regex.Match(givenUri, patternUri, RegexOptions.IgnoreCase).Groups;
if(!String.IsNullOrEmpty(result[1].Value))
return result[1].Value;
Seems to work great.
Well, this "pattern URL" is a format you've made up, right? You basically you'll just need to process it.
If the format of it is:
anything inside "{ }" is a thing to capture, everything else must be as is
Then you'd just find the start/end index of those brackets, and match everything else. Then when you get to a place where one is, make sure you only look for chars such that they don't match whatever 'token' comes after the next ending '}'.
There are definitely different ways - ultimately though your server must be configured to handle (and possibly route) these different subdomain requests.
What I would do would be to answer all subdomain requests (except maybe some reserved words, like 'www', 'mail', etc.) on sitename.com with a single handler or page (I'm assuming ASP.NET here based on your C# tag).
I'd use the request path, which is easy enough to get, with some simple string parsing/regex routines (remove the 'http://', grab the first token up until '.' or '/' or '\', etc.) and then use that in a session, making sure to observe URL changes.
Alternately, you could map certain virtual paths to request urls ('joesmith.sitename.com' => 'sitename.com/index.aspx?username=joesmith') via IIS but that's kind of nasty too.
Hope this helps!