I have a c# app that sometimes has to work with strings like:
"example\x27s string"
How do I decode that? I know 27 is the ascii code for a single quote ', but UrlDecode() wont work on that string.
Should I replace the \x value with % and then use System.Web.HttpUtility.UrlDecode() or is there another way to do it?
\x27 is not an HTML encoded value. This is a string Escape character. The truth behind it though is that in the actual string is probably a physical \ character so what you're dealing with is:
"\\x27"
Or
#"\x27"
I am unsure if .NET has a way to re-evaluate a string, but the \codes for strings are handled on a compiler level if i remember correctly.
You could use regular expressions to do a replacement if you want, since you know what it represents.
Related
I have some string that contains the following code/value:
"You won\u0092t find a ...."
It looks like that string contains the Right Apostrophe special character.
ref1: Unicode control 0092
ref2: ASCII chart (both 127 + extra extended ascii)
I'm not sure how to display this to the webbrowser. It keeps displaying the TOFU square-box character instead. I'm under the impression that the unicode (hex) value 00092 can be converted to unicode (html)
Is my understanding correct?
Update 1:
It was suggested by #sam-axe that I HtmlEncode the unicode. That didn't work. Here it is...
Note the ampersand got correctly encoded....
It looks like there's an encoding mix-up. In .NET, strings are normally encoded as UTF-16, and a right apostrophe should be represented as \u2019. But in your example, the right apostrophe is represented as \x92, which suggests the original encoding was Windows code page 1252. If you include your string in a Unicode document, the character \x92 won't be interpreted properly.
You can fix the problem by re-encoding your string as UTF-16. To do so, treat the string as an array of bytes, and then convert the bytes back to Unicode using the 1252 code page:
string title = "You won\u0092t find a cheaper apartment * Sauna & Spa";
byte[] bytes = title.Select(c => (byte)c).ToArray();
title = Encoding.GetEncoding(1252).GetString(bytes);
// Result: "You won’t find a cheaper apartment * Sauna & Spa"
Note: much of my answer is based on guessing and looking at the decompiled code of System.Web 4.0. The reference source looks very similar (identical?).
You're correct that "" (6 characters) can be displayed in the browser. Your output string, however, contains "\u0092" (1 character). This is a control character, not an HTML entity.
According to the reference code, WebUtility.HtmlEncode() doesn't transform characters between 128 and 160 - all characters in this range are control characters (ampersand is special-cased in the code as are a few other special HTML symbols).
My guess is that because these are control characters, they're output without transformation because transforming it would change the meaning of the string. (I tried running some examples using LinqPad, this character was not rendered.)
If you really want to transform these characters (or remove them), you'll probably have to write your own function before/after calling HtmlEncode() - there may be something that does this already but I don't know of any.
Hope this helps.
Edit: Michael Liu's answer seems correct. I'm leaving my answer here because it may be useful in cases when the input encoding of a string is not known.
I have a c# string like this:
string a = "Hello";
How can I use the Encoding class to get the exact length of characters including null-terminating characters? For example, if I used Encoding.Unicode.GetByteCount, I should get 12 and if I used Encoding.ASCII.GetByteCount, I should get 6.
How can I use the Encoding class to encode the string into a byte array including the null-terminating characters?
Thank you for help!
As far as I remember, null-termination is a specific thing to C/C++'y languages/platforms. Unicode and ANSI encodings does not specify any requirement for the string to be null-terminated, nor does the C#/CLR platform. You can't expect them to include that extra character. So you will probably have a hard time making those classes emit that from yours 5-character "Hello" string.
However, in C#/CLR, strings can contain null characters.
So, basing on that, try converting the following this 6-character string:
string a = "Hello\0";
or
string a = "Hello";
a += "\0"; // if you really can't have the \0 at first time, you can simply add it
and I'm pretty sure you will get the result you wanted through both Encoding.ANSI and Encoding.Unicode (single \0 in ANSI, single \0 in UTF, \0\0 in UTF16 etc..)
(Also, note that if you are P/Invoking, then you don't need to handle that manually. The Marshaller will nullterminate the string correctly, assuming the datatype set is considered to be string-like data and not array-like data.)
In .NET, strings are not null terminated, so you need to add the null character yourself if the protocol you're working with requires one. That means:
You need to manually add 1 to the string length.
You need to manually write a null character (e.g. (byte)0) to the end of the byte array.
I want a String to have a New Line in it, but I cannot use escape sequences because the interface I am sending my string to does not recognize them. As far as I know, C# does not actually store a New Line in the String, but rather it stores the escape sequence, causing the literal contents to be passed, rather than what they actually mean.
My best guess is that I would have to somehow parse the number 10 (the decimal value of a New Line according to the ASCII table) into ASCII. But I'm not sure how to do that, because C# parses numbers directly to String if attempting this:
"hello" + 10 + "world"
Any suggestions?
If you say "hello\nworld", the actual string will contain:
hello
world
There will be an actual new-line character in the string. At no point are the characters \ and n stored in the string.
There are a few ways to get the exact same result, but a simple \n in the string is a common way.
A simple cast should also do the same:
"hello" + (char)10 + "world"
Although likely slightly slower because of string concatenation. I say "likely" because it could probably be optimized away, or an actual example using \n will also result in string concatenation, taking roughly the same amount of time.
Test.
The preferred new line character is Environment.NewLine for its cross-platform capability.
You could use xml for communication, if you're receiver can handle this
really simple question... just want to represent double quote " without needing to do "" or \"
cases that I'm aware of:
var s=#"123 "" 456 """;
var s="123 \" 456 \"";
It'd make a reasonalbe difference if I could remove this noise somehow. The reason is that the escape sequence \ and the double quote have meaning in a domain specific language (DSL) that we're using. Sometimes it's convenient to throw some syntax inline into a C# string.
What I'd like is a way to tell .net not to touch it. Perhaps some kind of catch all via the DLR?
Within a C# literal, there's nothing you can to - don't forget this is all done at compile-time.
If you don't use single quotes, you could always do:
var s = "123 ' 456 '".Replace("'", "\"");
(Or choose some other character you don't use much, and replace that afterwards instead.)
Other than that, avoiding storing lots of data in your source code helps a lot with this sort of thing - for test data, I often use an embedded resource and load that in at execution time.
I don't suppose you could just read them in from a file or database?
Yeah, there's definitely a way to do that, and I use it all the time for exactly that reason.
You create a string resource collection (open Project Properties, Resources, make sure it's on Strings) and put your literal strings in there. Then, when you need one of those strings, use the Properties.Resources.{insert string resource name} reference to collect it in a pure and unadulterated form!
For completeness, I'll mention that you can use hex in a C# string, so in this case, \x0022. Note that you can omit the leading 0's if the character immediately following isn't hex.
I am currently looking to detect whether an URL is encoded or not. Here are some specific examples:
http://www.linxology.com/browse.php?u=Oi8vZXNwbnN0YXIuY29tL21lZGlhLXBsYXllci8%3D&b=13
http://www.linxology.com/browse.php?u=Oi8vZXNwbnN0YXIuY29tL290aGVyX2ZpbGVzL2VzcG5zdGFyL25hdl9iZy1vZmYucG5n&b=13
Can you please give me a Regular Expression for this?
Is there a self learning regular expression generator out there which can filter a perfect Regex as the number of inputs are increased?
If you are interested in the base64-encoded URLs, you can do it.
A little theory. If L, R are regular languages and T is a regular transducer, then LR (concatenation), L & R (intersection), L | R (union), TR(L) (image), TR^-1(L) (kernel) are all regular languages. Every regular language has a regular expression that generates it, and every regexp generates a regular language. URLs can be described by regular language (except if you need a subset of those that is not), almost every escaping scheme (and base64) is a regular transducer. Therefore, in theory, it's possible.
In practice, it gets rather messy.
A regex for valid base64 strings is ([A-Za-z0-9+/]{4})*(|[A-Za-z0-9+/]{2}(==|[A-Za-z0-9+/]=)
If it is embedded in a query parameter of an url, it will probably be urlencoded. Let's assume only the = will be urlencoded (because other characters can too, but don't need to).
This gets us to something like [?&][^?&#=;]+=([A-Za-z0-9+/]{4})*(|[A-Za-z0-9+/]{2}(%3D%3D|[A-Za-z0-9+/]%3D)
Another possibility is to consider only those base64 encoded URLs that have some property - in your case, thy all begin with "://", which is fortunate, because that translates exactly to 4 characters "Oi8v". Otherwise, it would be more complex.
This gets [?&][^?&#=;]+=Oi8v([A-Za-z0-9+/]{4})*(|[A-Za-z0-9+/]{2}(%3D%3D|[A-Za-z0-9+/]%3D)
As you can see, it gets messier and messier. Therefore, I'd recommend you rather to
break the URL in its parts (eg. protocol, host, query string)
get the parameters from the query string, and urldecode them
try base64 decode on the values of the parameters
apply your criterion for "good encoded URLs"
Well, depending on what is in that encoded text, you might not even need a regular expression. If there are multiple querystring parameters in that one "u" key, perhaps you could just check the length of the text on each querystring value, and if it is over (say) 50, you can assume it's probably encoded. I doubt any unencoded single parameters would be as long as these, since those would have to be string data, and therefore they would probably need to be encoded!
This question may be harder than you realize. For example:
I could say that if a query string includes a question mark character then what follows it is encoded.
Now, it may be simple encoding like "?year=2009" or complicated like in your examples.
Or
The site URLs could use URL rewriting (like this site does). Look at the URL of this question. The "615958" is encoded and... no question marks were used!
In fact, you could say that the entire URL is encoded!
Perhaps you need to better define what you mean by "encoded".
You can't reliably parse URL using regex. (Is this an SO mantra yet?)
Here are some specific examples:
It's not clear what ‘encoded’ means — can you give some counter-examples of URLs you consider “not encoded”?
Are you talking about the Base64 encoding in the ‘u’ parameter? Whilst it is possible to say whether a string is a valid Base64 string, it's not possible to detect Base64 and distinguish it from anything else; for example the word “sausages” also happens to be valid Base64 (it decodes to '\xb1\xab\xacj\x07\xac').