Query strings and text case

Query strings and text case - c#

Is it safe for me to evaluate a query string's string and take case (upper/lower) into consideration? Do some browsers lower the whole string for example? Is it reliable enough to code as though whatever parameters I add onto the query strings to remain the same case-wise? (Obviously putting to one side the fact that users might mess with it).
Tagged with C# as I'm not sure if the platform evaluating the query string affects the answer to this question; and it's C# I'm coding in.

Convention is key. If you use camel-cased query strings throughout your app, use camel-case, etc. You're going to be the one passing arguments and specifying query strings, so keep it consistent to make life easy on yourself. Other than keeping it consistent, there's no real benefit to a particular casing convention.
The browser will keep capitalization in tact.

Related

How does Visual Studio syntax-highlight strings in the Regex constructor?

Hi fellow programmers and nerds!
When creating regular expressions Visual Studio, the IDE will highlight the string if it's preceded by a verbatim identifier (for example, #"Some string). This looks something like this:
(Notice the way the string is highlighted). Most of you will have seen this by now, I'm sure.
My problem: I am using a package acquired from NuGet which deals with regular expressions, and they have a function which takes in a regular expression string, however their function doesn't have the syntax highlighting.
As you can see, this just makes reading the Regex string just a pain. I mean, it's not all-too-important, but it would make a difference if we can just have that visually-helpful highlighting to reduce the time and effort one's brain uses trying to decipher the expression, especially in a case like mine where there will be quite a quantity of these expressions.
The question
So what I'm wanting to know is, is there a way to make a function highlight the string this way*, or is it just something that's hardwired into the IDE for the specific case of the Regex c-tor? Is there some sort of annotation which can be tacked onto the function to achieve this with minimal effort, or would it be necessary to use some sort of extension?
*I have wrapped the call to AddStyle() into one of my own functions anyway, and the string will be passed as a parameter, so if any modifications need to be made to achieve the syntax-highlight, they can be made to my function. Therefore the fact that the AddStyle() function is from an external library should be irrelevant.
If it's a lot of work then it's not worth my time, somebody else is welcome to develop an extension to solve this, but if there is a way...
Important distinction
Please bear in mind I am talking about Visual Studio, NOT Visual Studio Code.
Also, if there is a way to pull the original expression string from the Regex, I might do it that way, since performance isn't a huge concern here as this is a once-on-startup thing, however I would prefer not to do it that way. I don't actually need the Regex object.

According to https://devblogs.microsoft.com/dotnet/visual-studio-2019-net-productivity/#regex-language-support and https://www.meziantou.net/visual-studio-tips-and-tricks-regex-editing.htm you can mark the string with a special comment to get syntax highlighting:
// language=regex
var str = #"[A-Z]\d+;
or
MyMethod(/* language=regex */ #"[A-Z]\d+);
(the comment may contain more than just this language=regex part)
The first linked blog talks about a preview, but this feature is also present in the final product.

.NET 7 introduces the new [StringSyntax(...)] attribute, which is used in .NET 7 on more than 350 string, string[], and ReadOnlySpan<char> parameters, properties, and fields to highlight to an interested tool what kind of syntax is expected to be passed or set.
https://devblogs.microsoft.com/dotnet/regular-expression-improvements-in-dotnet-7/?WT_mc_id=dotnet-35129-website&hmsr=joyk.com&utm_source=joyk.com&utm_medium=referral
So for a method argument you should just use:
void MyMethod([StringSyntax(StringSyntaxAttribute.Regex)] string regex);
Here is a video demonstrating the feature: https://youtu.be/Y2YOaqSAJAQ

Compare a string which has a param

I am reading in a header from a file which has time fields for example Time (UTC +1). I then need to compare this with a list of stored headers to work out if the file is valid however my stored headers are used for writing and so allow flexibility on the timezones by being written like so Time (UTC {0}).
I would like to know what the best way of dealing with this in as much of a flexible statement as possible. The only way I can imagine doing it is by getting the position of the { and only comparing up to that. This is fine in this circumstance but what if I have some words after the parameter which are more important than a closing bracket.
EDIT: I would like to give some context to the problem so that I can explain better how flexible I need it. I think I possibly didn't emphasise the fact that I didn't want it to JUST work with the time field.
I am trying to write a system which is very flexible. I store a list of valid headings and then use them to find out what value to read/write to the csv file. It is very flexible and easily maintainable. I want to be able to keep it neat and flexible. I want to be able to write a function which takes in a string which has one of more parameters in it and then compare it with a value which has had the parameters filled in (Like the example with the Time header). In the future I may have a field for temperature in a particular place so my stored heading would be Temperature in {0}({1}) which when I am reading back it would be Temperature in Britain(c) or Temperature in America(f).

You could use a regex like this one :
string pattern = #"Time \(UTC \{(\+)*\d\}\)";
Regex rgx = new Regex(pattern);
Regex has a Match method you can use to check whether any string matches the pattern you provided.

Request.QueryString[] vs. Request.Query.Get() vs. HttpUtility.ParseQueryString()

I searched SO and found similar questions, but none compared all three. That surprised me, so if someone knows of one, please point me to it.
There are a number of different ways to parse the query string of a request... the "correct" way (IMO) should handle null/missing values, but also decode parameter values as appropriate. Which of the following would be the best way to do both?
Method 1
string suffix = Request.QueryString.Get("suffix") ?? "DefaultSuffix";
Method2
string suffix = Request.QueryString["suffix"] ?? "DefaultSuffix";
Method 3
NameValueCollection params = HttpUtility.ParseQueryString(Request.RawUrl);
string suffix = params.Get("suffix") ?? "DefaultSuffix";
Method 4
NameValueCollection params = HttpUtility.ParseQueryString(Request.RawUrl);
string suffix = params["suffix"] ?? "DefaultSuffix";
Questions:
Would Request.QueryString["suffix"] return a null if no suffix was specified?
(Embarrassingly basic question, I know)
Does HttpUtility.ParseQueryString() provide any extra functionality over accessing Request.QueryString directly?
The MSDN documentation lists this warning:
The ParseQueryString method uses query strings that might contain user input, which is a potential security threat. By default, ASP.NET Web pages validate that user input does not include script or HTML elements. For more information, see Script Exploits Overview.
But it's not clear to me if that means ParseQueryString() should be used to handle that, or is exposed to security flaws because of it... Which is it?
ParseQueryString() uses UTF8 encoding by default... do all browsers encode the query string in UTF8 by default?
ParseQueryString() will comma-separate values if more than one is specified... does Request.QueryString() do that as well, or what happens if it doesn't?
Which of those methods would correctly decode "%2b" to be a "+"?
Showing my Windows development roots again... and I would be a much faster developer if I didn't wonder about these things so much... : P

Methods #1 and #2 are the same thing, really. (I think the .Get() method is provided for language compatibility.)
ParseQueryString returns you something that is the functional equivalent of Request.Querystring. You would usually use it when you have a raw URL and no other way to parse the query string parameters from it. Request.Querystring does that for you, so in this case, it's not needed.
You can't leave off "suffix". You either have to pass a string or an index number. If you leave off the [] entirely, you get the whole NameValueCollection. If you mean what if "suffix" was not one of the QueryString values then yes; you would get null if you called Request.QueryString["suffix"].
No. The most likely time you would use it is if you had an external URL and wanted to parse the query string parameters from it.
ParseQueryString does not handle it... neither does pulling the values straight from Request.QueryString. For ASP.NET, you usually handle form values as the values of controls, and that is where ASP.NET usually 'handles' these things for you. In other words: DON'T TRUST USER INPUT Ever. No matter what framework is doing what ever for you.
I have no clue (I think no). However, I think what you are reading is telling you that ParseQueryString is returning UTF-8 encoded text - regardless if it was so encoded when it came in.
Again: ParseQueryString returns basically the same thing you get from Request.QueryString. In fact, I think ParseQueryString is used internally to provide Request.QueryString.
They would produce the equivalent; they will all properly decode the values submitted. If you have URL: http://site.com/page.aspx?id=%20Hello then call Request.QueryString["id"] the return value will be " Hello", because it automatically decodes.

Example 1:
string itsMeString = string.IsNullOrEmpty(Request.QueryString["itsMe"]) ? string.Empty : HttpUtillity.UrlDecode(Request.QueryString["itsMe"]);
Stright to your questions:
Not quite sure what do you mean by suffix, if you are asking what happens if the key is not present(you don't have it in the QueryString) - yes it will return null.
My GUESS here is that when constructed, Request.QueryString internally calls HttpUtillity.ParseQueryString() method and caches the NameValueCollection for subsequential access. I think the first is only left so you can use it over a string that is not present in the Request, for example if you are scrapping a web page and need to get some arguments from a string you've found in the code of that page. This way you won't need to construct an Uri object but will be able to get just the query string as a NameValueCollection if you are sure you only need this. This is a wild guess ;).)
This is implemented on a page level so if you are accessing the QueryString let's say in Page_Load event handler, you are having a valid and safe string (ASP.NET will throw an exception otherwise and will not let the code flow enter the Page_Load so you are protected from storing XSS in your database, the exception will be: "A potentially dangerous Request.QueryString value was detected from the client, same as if a post variable contains any traces of XSS but instead Request.Form the exception says Request.QueryString."). This is so if you let the "validateRequest" switched on (by default it is). The ASP.NET pipeline will throw an exception earlier, so you don't have the chance to save any XSS things to your store (Database). Switching it off implies you know what you're doing so you will then need to implement the security yourself (by checking what's comming in).
Probably it will be safe to say yes. Anyway, since you will in most cases generating the QueryString on your own (via JavaScript or server side code - be sure to use HttpUtillity.UrlEncode for backend code and escape for JavaScript). This way the browser will be forced to turn "It's me!" to "It%27s%20me%21". You can refer to this article for more on Url Encoding in JavaScript: http://www.javascripter.net/faq/escape.htm.
Please elaborate on that, couldn't quite get what do you mean by "will comma-separate values if more than one is specified.".
As far as I remember, none of them will. You will probably need to call HttpUtillity.UrlDecode / HttpUtillity.HtmlDecode (based on what input do you have) to get the string correctly, in the above example with "It's me!" you will do something like (see Example 1 as something's wrong with the code formatting if I put it after the numbered list).

string.ToLower() and string.ToLowerInvariant()

What's the difference and when to use what? What's the risk if I always use ToLower() and what's the risk if I always use ToLowerInvariant()?

Depending on the current culture, ToLower might produce a culture specific lowercase letter, that you aren't expecting. Such as producing ınfo without the dot on the i instead of info and thus mucking up string comparisons. For that reason, ToLowerInvariant should be used on any non-language-specific data. When you might have user input that might be in their native language/character-set, would generally be the only time you use ToLower.
See this question for an example of this issue:
C#- ToLower() is sometimes removing dot from the letter "I"

TL;DR:
When working with "content" (e.g. articles, posts, comments, names, places, etc.) use ToLower(). When working with "literals" (e.g. command line arguments, custom grammars, strings that should be enums, etc.) use ToLowerInvariant().
Examples:
=Using ToLowerInvariant incorrectly=
In Turkish, DIŞ means "outside" and diş means "tooth". The proper lower casing of DIŞ is dış. So, if you use ToLowerInvariant incorrectly you may have typos in Turkey.
=Using ToLower incorrectly=
Now pretend you are writing an SQL parser. Somewhere you will have code that looks like:
if(operator.ToLower() == "like")
{
// Handle an SQL LIKE operator
}
The SQL grammar does not change when you change cultures. A Frenchman does not write SÉLECTIONNEZ x DE books instead of SELECT X FROM books. However, in order for the above code to work, a Turkish person would need to write SELECT x FROM books WHERE Author LİKE '%Adams%' (note the dot above the capital i, almost impossible to see). This would be quite frustrating for your Turkish user.

I think this can be useful:
http://msdn.microsoft.com/en-us/library/system.string.tolowerinvariant.aspx
update
If your application depends on the case of a string changing in a predictable way that is unaffected by the current culture, use the ToLowerInvariant method. The ToLowerInvariant method is equivalent to ToLower(CultureInfo.InvariantCulture). The method is recommended when a collection of strings must appear in a predictable order in a user interface control.
also
...ToLower is very similar in most places to ToLowerInvariant. The documents indicate that these methods will only change behavior with Turkish cultures. Also, on Windows systems, the file system is case-insensitive, which further limits its use...
http://www.dotnetperls.com/tolowerinvariant-toupperinvariant
hth

String.ToLower() uses the default culture while String.ToLowerInvariant() uses the invariant culture. So you are essentially asking the differences between invariant culture and ordinal string comparision.

Regular expression to extract domain name from any domain

I'm trying to extract the domain name from a string in C#. You don't necessarily have to use a RegEx but we should be able to extract yourdomain.com from all of the following:
yourdomain.com
www.yourdomain.com
http://www.yourdomain.com
http://www.yourdomain.com/
store.yourdomain.com
http://store.yourdomain.com
whatever.youdomain.com
*.yourdomain.com
Also, any TLD is acceptable, so replace all the above with .net, .org, 'co'uk, etc.

If no scheme present (no colon in string), prepend "http://" to make it a valid URL.
Pass string to Uri constructor.
Access the Uri's Host property.
Now you have the hostname. What exactly you consider the ‘domain name’ of a given hostname is a debatable point. I'm guessing you don't simply mean everything after the first dot.
It's not possible to distinguish hostnames like ‘whatever.youdomain.com’ from domains-in-an-SLD like ‘warwick.ac.uk’ from just the strings. Indeed, there is even a bit of grey area about what is and isn't a public SLD, given the efforts of some registrars to carve out their own niches.
A common approach is to maintain a big list of SLDs and other suffixes used by unrelated entities. This is what web browsers do to stop unwanted public cookie sharing. Once you've found a public suffix, you could add the one nearest prefix in the host name split by dots to get the highest-level entity responsible for the given hostname, if that's what you want. Suffix lists are hell to maintain, but you can piggy-back on someone else's efforts.
Alternatively, if your app has the time and network connection to do it, it could start sniffing for information on the hostname. eg. it could do a whois query for the hostname, and keep looking at each parent until it got a result and that would be the domain name of the lowest-level entity responsible for the given hostname.
Or, if all that's too much work, you could try just chopping off any leading ‘www.’ present!

I would recommend trying this yourself. Using regulator and a regex cheat sheet.
http://sourceforge.net/projects/regulator/
http://regexlib.com/CheatSheet.aspx
Also find some good info on Regular Expressions at coding horror.

Have a look at this other answer. It was for PHP but you'll easily get the regex out of the 4-5 lines of PHP and you can benefit from the discussion that followed (see Alnitak's answer).

A regex doesn't really fit your requirement of "any TLD", since the format and number of TLDs is quite large and continually in flux. If you limited your scope to:
(?<domain>[^\.]+\.([A-Z]+$|co\.[A-Z]$))
You would catch .anything and .co.anything, which I imagine covers most realistic cases...

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.