Uri.UnescapeDataString not escaping in runtime - c#

I tried using Uri.UnescapeDataString to unescape JavaScript encoded URL. Heres the sample URL:
https://drive.google.com/open?id\u003d1n1hiV2sDFVctI8Qc9Z3EWPvEBO6KstFr\u0026usp\u003dsharing
When I tried using Uri.UnescapeDataString in C# Interactive window, it correctly unescape the URL.
Microsoft (R) Roslyn C# Compiler version 2.8.3.63029
Loading context from 'CSharpInteractive.rsp'.
Type "#help" for more information.
> Uri.UnescapeDataString("https://drive.google.com/open?
id\u003d1n1hiV2sDFVctI8Qc9Z3EWPvEBO6KstFr\u0026usp\u003dsharing)
"https://drive.google.com/open?id=1n1hiV2sDFVctI8Qc9Z3EWPvEBO6KstFr&usp=sharing"
But in real application, it just don't want to unescape. I tried from Immediate Window.
? uri
"https://drive.google.com/open?id\\u003d1n1hiV2sDFVctI8Qc9Z3EWPvEBO6KstFr\\u0026usp\\u003dsharing"
Uri.UnescapeDataString(uri)
"https://drive.google.com/open?id\\u003d1n1hiV2sDFVctI8Qc9Z3EWPvEBO6KstFr\\u0026usp\\u003dsharing"
Solution
Below code is working for me using Newtonsoft.Json JObject.
var json = "{\"su\": \"" + uri + "\"}";
var ss = JObject.Parse(json);
return ss["su"].Value<string>();

Notice the difference in these two strings:
"https://drive.google.com/open?id\\u003d1n1hiV2sDFVctI8Qc9Z3EWPvEBO6KstFr\\u0026usp\\u003dsharing"
"https://drive.google.com/open?id\u003d1n1hiV2sDFVctI8Qc9Z3EWPvEBO6KstFr\u0026usp\u003dsharing"
The first is the string you printed out in the "real" application, and the second is what you typed into the command line interpreter.
The command line interpreter, like a compiler, it what is converting \uXXXX into unicode characters, not the call to UnescapeDataString. UnescapeDataString decodes url encoded strings (like %20 characters).
Your best bet is to use Json parsing of some kind. For something simple like this, System.Web.Script.Serialization.JavaScriptSerializer is adequate.

Related

Base64EncodedString does not include NewLines

I´m using a .NET core 3.0 project on Windows 10. I´m trying to encode a string to base64 with below code:
var stringvalue = "Row1" + Environment.NewLine + "\n\n" + "Row2";
var encodedString = Convert.ToBase64String(Encoding.UTF8.GetBytes(stringvalue));
encodedString has then below result:
Um93MQ0KCgpSb3cy
stringvalue is:
Row1\r\n\n\nRow2
However, if I´m passing the same value to this site (https://www.base64encode.org/), i´m getting another result:
Um93MVxyXG5cblxuUm93Mg==
In visual studio, I tried to resave the file with Unix lineendings, but without any luck:
I want the string to be encoded as how it´s done in https://www.base64encode.org. Any ideas how to get this done?
From the screenshot, I can see that you have entered a different string from the string you used in your C# code. The string you used in https://www.base64encode.org is represented as a C# string literal like this:
"Row1\\r\\n\n\\nRow2"
// or
#"Row1\r\n\n\nRow2"
So to answer your question:
I want the string to be encoded as how it´s done in https://www.base64encode.org. Any ideas how to get this done?
You should do:
var encodedString = Convert.ToBase64String(Encoding.UTF8.GetBytes("Row1\\r\\n\n\\nRow2"));
But that's probably not what you actually want. Your first attempt at the C# code is more likely to be desired, because that is actually a carriage return character, followed by 3 new line characters. The string you entered in https://www.base64encode.org is simply the backslash character followed by the letter r (or n).
You can't really make the output on https://www.base64encode.org match the C# output, because you can only choose one kind of line separator on there. You can only either encode Row1\r\n\r\n\r\nRow2 or Row\n\n\nRow2. Nevertheless, you can check that the C# result is correct by decoding the output using https://www.base64decode.org.
The \r\n will be encoded on the website, this is not a newline, these are 4 characters. There is this newline-separator-checkbox, to say you want the windows style, to convert your real world input value:
Row1
Row2.
I guess your \r\n\n\n is just a mistake, the website is prepared to convert it to \r\n\r\n only.

Escaping a double quotes in string in c#

I know this has been covered lots of times but I still have a problem with all of the solutions.
I need to build a string to send to a JSON parser which needs quotes in it. I've tried these forms:
string t1 = "[{\"TS\"}]";
string t2 = "[{" + "\"" + "TS" + "\"" + "}]";
string t3 = #"[{""TS""}]";
Debug.Print(t1);
Debug.Print(t1);
Debug.Print(t1);
The debug statement shows it correctly [{"TS"}] but when I look at it in the debugger and most importantly when I send the string to my server side json parser is has the escape character in it:
"[{\"TS\"}]"
How can I get rid of the escape characters in the actual string?
The debug statement shows it correctly [{"TS"}] but when I look at it
in the debugger and most importantly when I send the string to my
server side json parser is has the escape character in it:
"[{\"TS\"}]"
From the debugger point of view it will always show the escaped version (this is so you, as the developer, know exactly what the string value is). This is not an error. When you send it to another .Net system, it will again show the escaped version from the debugger point of view. If you output the value, (Response.Write() or Console.WriteLine()) you will see that the version you expect will be there.
If you highlight the variable (from the debugger) and select the dropdown next to the magnifying glass icon and select "Text Visualizer" you will see how it displays in plain text. This may be what you are looking for.
Per your comments, i wanted to suggest that you also watch how you convert your string in to bytes. You want to make sure you encode your bytes in a format that can be understood by other machines. Make sure you convert your string into bytes using a command as follows:
System.Text.Encoding.ASCII.GetBytes(mystring);
I have the sneaking suspicion that you are sending the bit representation of the string itself instead of an encoded version.

KeyNotFoundException with using HtmlEntity.DeEntitize() method

I am currently working on a scraper written in C# 4.0. I use variety of tools, including the built-in WebClient and RegEx features of .NET. For a part of my scraper I am parsing a HTML document using HtmlAgilityPack. I got everything to work as I desired and went through some cleanup of the code.
I am using the HtmlEntity.DeEntitize() method to clean up the HTML. I made a few tests and the method seemed to work great. But when I implemented the method in my code I kept getting KeyNotFoundException. There are no further details so I'm pretty lost. My code looks like this:
WebClient client = new WebClient();
string html = HtmlEntity.DeEntitize(client.DownloadString(path));
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
The HTML downloaded is UTF-8 encoded. How can I get around the KeyNotFound exception?
I understand that the problem is due to occurrence of non-standard characters. Say, for example, Chinese, Japanese etc.
After you find out that what characters are causing a problem, perhaps you could search for the suitable patch to htmlagilitypack here
This may be of some help to you in case you want to modify the htmlagilitypack source yourself.
Four years later and I have the same problem with some encoded characters (version 1.4.9.5). In my case, there is a limited set of characters that might generate the problem, so I have just created a function to perform the replacements:
// to be called before HtmlEntity.DeEntitize
public static string ReplaceProblematicHtmlEntities(string str)
{
var sb = new StringBuilder(str);
//TODO: add other replacements, as needed
return sb.Replace("&period;", ".")
.Replace("&abreve;", "ă")
.Replace("â", "â")
.ToString();
}
In my case, the string contains both html-encoded characters and UTF-8 characters, but the problem is related to some encoded characters only.
This is not an elegant solution, but a quick fix for all those text with a limited (and known) amount of problematic encoded characters.
My HTML had a block of text like so:
... found in sections: 233.9 & 517.3; ...
Despite the spacing and decimal point, it was interpreting & 517.3; as a unicode character.
Simply HTML Encoding the raw text fixed the problem for me.
string raw = "sections: 233.9 & 517.3;";
// turn '&' into '&', etc, before DeEntitizing
string encoded = System.Web.HttpUtility.HtmlEncode(raw);
string deEntitized = HtmlEntity.DeEntitize(encoded);
In my case I have fixed this by updating HtmlAgilityPack to version 1.5.0

Loading XML to an XDocument with a URL containing an ampersand

XDocument xd = XDocument.Load("http://www.google.com/ig/api?weather=vilnius&hl=lt");
The ampersand & isn't a supported character in a string containing a URL when calling the Load() method. This error occurs:
XmlException was unhandled: Invalid character in the given encoding
How can you load XML from a URL into an XDocument where the URL has an ampersand in the querystring?
You need to URL-encode it as &:
XDocument xd = XDocument.Load(
"http://www.google.com/ig/api?weather=vilnius&hl=lt");
You might be able to get away with using WebUtility.HtmlEncode to perform this conversion automatically; however, be careful that this is not the intended use of that method.
Edit: The real issue here has nothing to do with the ampersand, but with the way Google is encoding the XML document using a custom encoding and failing to declare it. (Ampersands only need to be encoded when they occur within special contexts, such as the <a href="…" /> element of (X)HTML. Read Ampersands (&'s) in URLs for a quick explanation.)
Since the XML declaration does not specify the encoding, XDocument.Load is internally falling back to default UTF-8 encoding as required by XML specification, which is incompatible with the actual data.
To circumvent this issue, you can fetch the raw data and decode it manually using the sample below. I don’t know whether the encoding really is Windows-1252, so you might need to experiment a bit with other encodings.
string url = "http://www.google.com/ig/api?weather=vilnius&hl=lt";
byte[] data;
using (WebClient webClient = new WebClient())
data = webClient.DownloadData(url);
string str = Encoding.GetEncoding("Windows-1252").GetString(data);
XDocument xd = XDocument.Parse(str);
There is nothing wrong with your code - it is perfectly OK to have & in the query string, and it is how separate parameters are defined.
When you look at the error you'll see that it fails to load XML, not to query it from the Url:
XmlException: Invalid character in the given encoding. Line 1, position 473
which clearly points outside of your query string.
The problem could be "Apsiniaukę" (notice last character) in the XML response...
instead of "&" use "&" or "&" . and it will work fine .

How can I deal with ampersands in a mail client's mailto links?

I have an ASP.NET/C# application, part of which converts WWW links to mailto links in an HTML email.
For example, if I have a link such as:
www.site.com
It gets rewritten as:
mailto:my#address.com?Subject=www.site.com
This works extremely well, until I run into URLs with ampersands, which then causes the subject to be truncated.
For example the link:
www.site.com?val1=a&val2=b
Shows up as:
mailto:my#address.com?Subject=www.site.com?val1=a&val2=b
Which is exactly what I want, but then when clicked, it creates a message with:
subject=www.site.com?val1=a
Which has dropped the &val2, which makes sense as & is the delimiter in a mailto command.
So, I have tried various other was to work around this with no success.
I have tried implicitly quoting the subject='' part and that did nothing.
I (in C#) replace '&' with & which Live Mail and Thunderbird just turn back into:
www.site.com?val1=a&val2=b
I replaced '&' with '%26' which resulted in:
mailto:my#address.com?Subject=www.site.com?val1=a%26amp;val2=b
In the mail with the subject:
www.site.com?val1=a&val2=b
EDIT:
In response to how URL is being built, this is much trimmed down but is the gist of it. In place of the att.Value.Replace I have tried System.Web.HtmlUtility.URLEncode calls which also results in a failure
HtmlAgilityPack.HtmlNodeCollection nodes =doc.DocumentNode.SelectNodes("//a[#href]");
foreach (HtmlAgilityPack.HtmlNode link in nodes)
{
HtmlAgilityPack.HtmlAttribute att = link.Attributes["href"];
att.Value = att.Value.Replace("&", "%26");
}
Try mailto:my#address.com?Subject=www.site.com?val1=a%26val2=b
& is an HTML escape code, whereas %26 is a URL escape code. Since it's a URL, that's all you need.
EDIT: I figured that's how you were building your URL. Don't build URLs that way! You need to get the %26 in there before you let anything else parse or escape it. If you really must do it this way (which you really should try to avoid), then you should search for "&" instead of just "&" because the string has already been HTML escaped at this point.
So, ideally, you build your URL properly before it's HTML escaped. If you can't do it properly, at least search for the right string instead of the wrong one. "&" is the wrong one.
You cant put any character as subject. You could try using System.Web.HttpUtility.URLEncode function on the subject´s value...
Using the URL escape code %26 is the right way.
Sadly this is still not working on the Android OS because of bug 8023
What I ended up doing for my case was eliminating the &.
www.site.com/mytest.php?val1=a=b=c. Where the 2nd and 3rd = would be equivalent to www.site.com?val1=a&val2=b&val3=c
In mytest.php I explode on ? and then explode again on =.
A total hack I know but it does work for me.

Categories

Resources