Using C# .net I am parsing some data with some partial html/javascript inside it (i dont know who made that decision) and i need to pull a link. The link looks like this
http:\/\/fc0.site.net\/fs50\/i\/2009\/name.jpg
It came from this which i assume is javascript and looks like json
"name":{"id":"589","src":"http:\/\/fc0.site.net\/fs50\/i\/2009\/name.jpg"}
But anyways how should i escape the first link so i get
http://fc0.site.net/fs50/i/2009/name.jpg
In this case i could just replace '\' with '' since links dont contain \ nor " so i could do that but i am a fan of knowing the right solution and doing things properly. So how might i escape this. After looking at that link for a minute i thought is that valid? does java script or json escape / with \? It doesnt seem like it should?
In your case:
"name":{"id":"589","src":"http://fc0.site.net/fs50/i/2009/name.jpg"}
"/" is a valid escape sequence. However, it is not required that / be escaped. You may escape it if you need to. The reason JSON explicitly allows escaping of slash is because HTML does not allow a string in a to contain "...
Update:
Check out this post
Odd, it doesn’t look like any JavaScript/JSON escaping you’d expect. You can have forward slashes in JavaScript strings just fine.
Why dont you try a regex on the escaped slashes to replace them in the C# code...
String url = #"http:\/\/fc0.site.net\/fs50\/i\/2009\/name.jpg";
String pattern = #"\\/";
String cleanUrl = Regex.Replace(url, pattern, "/");
Hope it helps!
Actually you want to unescape the string. Answered in this question.
Related
I am trying to see if a large string contains this line of HTML:
<label ng-class="choiceCaptionClass" class="ng-binding choice-caption">Was this information helpful?</label>
As you can see, this snippet has quotations in multiple places and it's causing problems when I do something like this:
Assert.IsTrue(responseContent.Contains("<label ng-class="choiceCaptionClass" class="ng - binding choice - caption">Was this information helpful?</label>"));
I've tried both of these ways of defining the string:
#"<label ng-class=""choiceCaptionClass"" class=""ng - binding choice - caption"">Was this information helpful?</label>"
and
"<label ng-class=\"choiceCaptionClass\" class=\"ng - binding choice - caption\">Was this information helpful?</label>"
But in each case the Contains() method looks for the literal string with either the double quotes or the backslashes. Is there another way I could define this string so I can correctly search for it?
Escaping the double-quotes with backslashes is the proper thing to do.
The reason your search may be failing is that the strings don't actually match. For example, in your version with backslashes, you have spaces around some of the dashes but your HTML string does not.
Try using regular expressions. I made this one for you but you can test your own regex here.
var regex = new Regex(#"<label\s+ng-class\s*=\s*""choiceCaptionClass""\s+class\s*=\s*""ng-binding choice-caption""\s*>\s*Was this information helpful\?\s*</label>", RegexOptions.IgnoreCase);
Assert.IsTrue(regex.IsMatch(responseContent));
If this is not working use the tester tool to figure it out what part of the pattern is getting off.
Hope this help!
I have URL's like:
http://127.0.0.1:81/controller/verbOne/NXw4fDF8MXwxfDQ1?source=dddd
or
http://127.0.0.1:81/controller/verbTwo/NXw4fDF8MXwxfDQ1
I'd like to extract that part in bold. The host and port can change to anything (when I publish it to a live server it will change). The controller never changes. And for the verb part, there are 2 possibilities.
Can anyone help me with the regex?
Thanks
Instead of using a regex you could use the built in functionality of Uri
Uri uri = new Uri("http://127.0.0.1:81/controller/verbOne/NXw4fDF8MXwxfDQ1?source=dddd");
var lastSegment = uri.Segments.Last();
You're looking for the Uri and Path classes:
Path.GetFileName(new Uri(str).AbsolutePath)
Why do you look for a regex? you can look for the two string elements "verbOne/" or "verbTwo/" and make a substring from the end. And then you can look for the rest and substrakt the part with the '?'
I think this is faster then a regex.
krikit
Though everyone else here is correct that regex is not the best solution, because it could fail when parsers already exist that should never fail due to their specialization, I believe you could use the following regex:
(?<=http://127\.0\.0\.1:81/controller/verb(One|Two)/)[a-zA-Z0-9]*
I have some query text that is being encoded with JavaScript, but I've encountered a use case where I might have to encode the same text on the server side, and the encoding that's happening is not the same. I need it to be the same. Here's an example.
I enter "I like food" into the search box and hit the search button. JavaScript encodes this as %22I%20like%20food%22
Let's say I get the same value as a string on a request object on the server side. It will look like this: "\"I like food\""
When I use HttpUtility.UrlEncode(value), the result is "%22I+like+food%22". If I use HttpUtility.UrlPathEncode(value), the result is "\"I%20like%20food\""
So UrlEncode is encoding my quotes but is using the + character for spaces. UrlPathEncode is encoding my spaces but is not encoding my escaped quotes.
I really need it to do both, otherwise the Search code completely borks on me (and I have no control over the search code).
Tips?
UrlPathEncode doesn't escape " because they don't need to be escaped in path components.
Uri.EscapeDataString should do what you want.
There are a few options available to you, the fastest might be to use UrlEncode then do a string.replace to swap the + characters with %20.
Something like
HttpUtility.UrlEncode(input).Replace("+", "%20");
WebUtility.UrlEncode(str)
Will encode all characters that need encoded using the %XX format, including space.
I've got a .NET 3.5 web application written in C# doing some URL rewriting that includes a file path, and I'm running into a problem. When I call string.Split('/') it matches both '/' and '\' characters. Is that... supposed to happen? I assumed that it would notice that the ASCII values were different and skip it, but it appears that I'm wrong.
// url = 'someserver.com/user/token/files\subdir\file.jpg
string[] buffer = url.Split('/');
The above code gives a string[] with 6 elements in it... which seems counter intuitive. Is there a way to force Split() to match ONLY the forward slash? Right now I'm lucky, since the offending slashes are at the end of the URL, I can just concatenate the rest of the elements in the string[], but it's a lot of work for what we're doing, and not a great solution to the underlying problem.
Anyone run into this before? Have a simple answer? I appreciate it!
More Code:
url = HttpContext.Current.Request.Path.Replace("http://", "");
string[] buffer = url.Split('/');
Turns out, Request.Path and Request.RawUrl are both changing my slashes, which is ridiculous. So, time to research that a bit more and figure out how to get the URL from a function that doesn't break my formatting. Thanks everyone for playing along with my insanity, sorry it was a misleading question!
When I try the following:
string url = #"someserver.com/user/token/files\subdir\file.jpg";
string[] buffer = url.Split('/');
Console.WriteLine(buffer.Length);
... I get 4. Post more code.
Something else is happening, paste more code.
string str = "a\\b/c\\d";
string[] ts = str.Split('/');
foreach (string t in ts)
{
Console.WriteLine(t);
}
outputs
a\b
c\d
just like it should.
My guess is that you are converting / into \ somewhere.
You could use regex to convert all \ slashes to a temp char, split on /, then regex the temp chars back to \. Pain in the butt, but one option.
I suspect (without seeing your whole application) that the problem lies in the semantics of path delimiters in URLs. It sounds like you are trying to attach a semantic value to backslashes within your application that is contrary to the way HTTP protocols define and use backslashes.
This is just a guess, of course.
The best way to solve this problem might be modifying the application to encode the path in some other way (such as "%5C" for backslashes, maybe?).
those two functions are probably converting \ to / because \ is not a valid character in a URL (see Which characters make a URL invalid?). The browser (NOT C#, as you are inferring) is assuming that when you are using that invalid character, you mean /, so it is "fixing" it for you. If you want \ in your URL, you need to encode it first.
The browsers themselves are actually the ones that make that change in the request, even if it is behind the scenes. To verify this, just turn on fiddler and look at the URLs that are actually getting sent when you go to a URL like this. IE and Chrome actually change the \ to / in the URL field on the browser itself, FireFox doesn't, but the request goes through that way anyways.
Update:
How about this:
Regex.Split(url, "/");
How can I write a regular expression to replace links with no link text like this:
with
http://www.somesite.com
?
This is what I was trying to do to capture the matches, and it isn't catching any. What am I doing wrong?
string pattern = "<a\\s+href\\s*=\\s*\"(?<href>.*)\">\\s*</a>";
I wouldn't use a regex - I'd use the Html Agility Pack, and a query like:
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[.='']")) {
link.InnerText = link.GetAttribute("href");
}
I could be wrong, but I think you simply need to change the quantifier within the href group to be lazy rather than greedy.
string pattern = #"<a\s+href\s*=\s*""(?<href>.*?)"">\s*</a>";
(I've also changed the type of the string literal to use #, for better readability.)
The rest of the regex appears fine to me. That you're not capturing any matches at all makes me think otherwise, but there could be a problem in the rest of the code (or even the input data - have you verified that?).
I would suggest
string pattern = "(<a\\b[^>]*href=\"([^\"]+)\"[^>]*>)[\\s\\r\\n]*(</a>)";
This way also links with their href attribute somewhere else would be captured.
Replace with
"$1$2$3"
The usual word of warning: HTML and regex are essentially incompatible. Use with caution, this might blow up.
Marc Gravell has the right answer, regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.