Regex replace string function not working as expected - c#

I'm trying to implement a hashtag function in a web app to easily embed search links into a page. The issue is that I'm trying to do a replace on the hash marks so they don't appear in the HTML output. Since I'm also wanting to be able to also have hash marks in the output I can't just do a final Replace on the entire string at the end of processing. I'm going to want to be able to escape some hash marks like so \#1 is my answer and I'd find and replace the \# with just # but that is another problem that I'm not even ready for (but still thinking of).
This is what I have so far mocked up in a console app,
static void Main(string[] args)
{
Regex _regex = new Regex(#"(#([a-z0-9]+))");
string link = _regex.Replace("<p>this is #my hash #tag.</p>", MakeLink("$1"));
}
public static string MakeLink(string tag)
{
return string.Format("{1}", tag.Replace("#", ""), tag);
}
The output being:
<p>this is #my hash #tag.</p>
But when I run it with breaks while it's running MakeLink() it's string is displayed at "$1" in the debugger output and it's not replacing the hash's as expected.
Is there a better tool for the job than regex? Or can I do something else to get this working correctly?

Note that you're passing a literal "$1" into MakeLink, not the first captured group. Thus your .Replace("#", "") is doing nothing. The regular expression then replaces the two occurrences of "$1" in the output of MakeLink with the first capture group.
If you replace "$1" with "$2" then I think you get the result you want, just not quite in the manner you're expecting.

To not replace your escaped hashtags, just modify your current regex to not match anything that starts with an escape:
Regex _regex = new Regex(#"[^\\](#([a-z0-9]+))");
And then apply a new regex to find only escaped hashtags and replace them with unescaped ones:
Regex _escape = new Regex(#"\\(#([a-z0-9]+))");
_escape.Replace(input, "$1");

Related

Regex Replace with forward slash

Can someone please explain how to get this regex working? I'm trying to take this string:
"Test0/1"
and turn it into:
"Test0\/1"
I'm using this, but it is not working:
var test = Regex.Replace("Test0/1", #"/", #"\/");
It keeps giving me
"Test0\\/1"
Then I want to take the results of the string and put it into a Regex statement like so:
var match = new Regex(test).Match(myString);
So the string 'test' has to be a valid regex statement.
Basically what I'm trying to do is take a list of interfaces off a device, create a regex statement out of them and then use that regex to compare results for other things in my code. Because of the way interfaces are formatted "FastEthernet0/1" for example, it is causing my regex to fail because you have to escape all forward slashes. I have to build this regex on the fly though because every device will have a different set of interfaces.
This is a function of Visual Studio automatically escaping the \ on your behalf. Look at the following question: What's the use/meaning of the # character in variable names in C#?. Removing the # symbol from #"\" turns the string into "\\".

Single regular expression with 2 matches

Given the path
C:\Users\Bob\Downloads\Product12\Prices\USD
and only knowing it contains a subdirectory called Downloads
I have this regex to locate the Downloads part
(?<=Downloads\\)[^\\""]*
Ideally, I want to also match everything after Downloads as a separate group, but using a single regex for both Downloads and the following path portion.
this will match everything before 'downloads' in one subgroup, and everything after in another subgroup:
/^(.*?Downloads\\)(.*)$/
So given the sample input, you want to get Product12\Prices\USD, right?
result = Regex.Match(s, #"\\Downloads\\(.*)$").Groups[1].Value;
But that [\\""]* in your regex seems to indicate that your path is enclosed in quotes, and you don't want the match to the closing quote or anything after it.
result = Regex.Match(s, #"\\Downloads\\([^""]*)""").Groups[1].Value;
Some points of particular interest are:
when you create regexes in C#, always use C#'s verbatim string notation if at all possible (i.e., #"regex"). It saves you having to litter your code with a bunch of backslashes. For example, if your regex were in a standard C-style string literal you would have to use four backslashes in the regex to match one backslash input.
When you include regexes in your posts here at SO, show them as they appear in your code. Then we won't have to guess what the backslashes mean. For example, is the \\ in [^\\""]* supposed to match a literal backslash, or are you just escaping it for the regex?
Speaking of quotes, " has no special meaning in regexes, so you don't have to escape it for that. I changed that sequence to [^""]* because that's how you escape quotes in a verbatim string. In C-style string literal it would be [\\"]*.
You don't need RegEx to parse paths
var paths = new Uri(#"C:\Users\Bob\Downloads\Product12\Prices\USD").Segments;
would return all segments and you can skip till Downloads. For example
var paths = new Uri(#"C:\Users\Bob\Downloads\Product12\Prices\USD")
.Segments
.SkipWhile(s => s != "Downloads/")
.Skip(1)
.ToList();

How do I escape a RegEx?

I have a Regex that I now need to moved into C#. I'm getting errors like this
Unrecognized escape sequence
I am using Regex.Escape -- but obviously incorrectly.
string pattern = Regex.Escape("^.*(?=.{7,})(?=.*[a-zA-Z])(?=.*(\d|[!##$%\?\(\)\*\&\^\-\+\=_])).*$");
hiddenRegex.Attributes.Add("value", pattern);
How is this correctly done?
The error you're getting is coming at compile time correct? That means C# compiler is not able to make sense of your string. Prepend # sign before the string and you should be fine. You don't need Regex.Escape.
See What's the # in front of a string in C#?
var pattern = new Regex(#"^.*(?=.{7,})(?=.*[a-zA-Z])(?=.*(\d|[!##$%\?\(\)\*\&\^\-\+\=_])).*$");
pattern.IsMatch("Your input string to test the pattern against");
The error you are getting is due to the fact that your string contains invalid escape sequences (e.g. \d). To fix this, either escape the backslashes manually or write a verbatim string literal instead:
string pattern = #"^.*(?=.{7,})(?=.*[a-zA-Z])(?=.*(\d|[!##$%\?\(\)\*\&\^\-\+\=_])).*$";
Regex.Escape would be used when you want to embed dynamic content to a regular expression, not when you want to construct a fixed regex. For example, you would use it here:
string name = "this comes from user input";
string pattern = string.Format("^{0}$", Regex.Escape(name));
You do this because name could very well include characters that have special meaning in a regex, such as dots or parentheses. When name is hardcoded (as in your example) you can escape those characters manually.

parsing tweet text with regex

Regex-noob here. Looking for some C# regex code to "syntax highlight" twitter text. So given this tweet:
#taglius here's some tweet text that shouldn't be highlighted #tagtestpix http://aurl.jpg
I want to find the user mentions (#), hashtags (#), and urls (http://) and add appropriate html to color highlight these elements. Something like
<font color=red>#taglius</font> here's some tweet text that shouldn't be highlighted <font color=blue>#tagtestpix</font> <font color=yellow>http://aurl.jpg</font>
This isn't the exact html I will use, but I think you get the idea.
The answers above are parts of the whole answer, so I think I can add a little extra to answer your question:
Your highlight function would look something like this:
public static String HighlightTwitter(String input)
{
String result = Regex.Replace(input, #"\b\#\w+", #"<font color=""red"">$0</font>");
result = Regex.Replace(result, #"\b#\w+", #"<font color=""blue"">$0</font");
result = Regex.Replace(result, #"\bhttps?://[-\w]+(\.\w[-\w]*)+(:\d+)?(/[^.!,?;""\'<>()\[\]\{\}\s\x7F-\xFF]*([.!,?]+[^.!,?;""\'<>\(\)\[\]\{\}\s\x7F-\xFF]+)*)?\b", #"<font color=""yellow"">$0</font", RegexOptions.IgnoreCase);
return result;
}
I have include \b to make sure that # and # is the start of the word and make sure that urls stands alone. This means that #this_will_highlight but#this_will_not.
If performance might be an issue you can make the Regex'es as static members with RegexOptions.Compiled
E.g.:
private static Regex regexAt = new Regex(#"\b\#\w+", RegexOptions.Compiled);
...
String result = regexAt.Replace(input, #"<font color=""red"">$0</font>");
...
The following would match the '#' character followed by a sequence of alpha-num characters:
#\w+
The following would match the '#' character followed by a sequence of alpha-num characters:
\#\w+
There are a lot of free-form http url match expressions, this is the one I use most commonly:
https?://[-\w]+(\.\w[-\w]*)+(:\d+)?(/[^.!,?;""\'<>()\[\]\{\}\s\x7F-\xFF]*([.!,?]+[^.!,?;""\'<>\(\)\[\]\{\}\s\x7F-\xFF]+)*)?
Lastly, You're going to get false positive hits with all of these so you're going to need to look real hard at how to correctly delineate these tags... For instance you have the following tweet:
the url http://Roger#example.com/#bookmark is interesting.
Obviously this is going to be a problem as all three of the expressions will match inside the url. To avoid this you will need to figure out what characters are allowed to precede or follow the match. As an example, the following requires a whitespace or start of string to precede the #name reference and requires a ',' or space following it.
(?<=[^\s])#\w+(?=[,\s])
Regex patterns are not easy, I recommend getting a tool like Expresso.
You can parse out the # replies using (\#\w+). You can parse out the hash tags using (#\w+).

Trying to understand this line of Java, as C# code

See this java code :-
s = s.replaceAll( "\\\\", "\\\\\\\\" ).replaceAll( "\\$", "\\\\\\$" );
I sorta don't understand it. It's a regex replace all.
I've tried the following C# code...
text = text.RegexReplace("\\\\", "\\\\\\\\");
text = text.RegexReplace("\\$", "\\\\\\$");
But if i have the following unit test :-
} ul[id$=foo] label:hover {
The java code returns: } ul[id\$=foo] label:hover {
My c# code returns: } ul[id\\\$=foo] label:hover {
So i'm not sure I understand why my c# code is putting more \'s in, mainly with regards to how these control characters are being represented.. ??
Update:
So, when i use XXX's idea of just using text.Replace(..), this works.
eg.
text = text.Replace("\\\\", "\\\\\\\\");
text = text.Replace("\\$", "\\\\\\$");
But I was hoping to stick with RegEx... to try and keep it as close to the java code as possible.
The extension method being used is...
public static string RegexReplace(this string input,
string pattern,
string replacement)
{
return Regex.Replace(input, pattern, replacement);
}
hmm...
Java needs all $ signs escaped in its replace string - "\\\\\\$" means \\ and \$. Without it it throws an error: http://www.regular-expressions.info/refreplace.html (look for "$ (unescaped dollar as literal text)").
Remember $1, $0 etc are replaced the text with captured groups, so there are a part of the syntax on the second argument to replaceAll. C# has a slightly different syntax, and doesn't require the extra slash, which it takes literally.
You could write:
text = text.RegexReplace(#"\\", #"\\");
text = text.RegexReplace(#"\$", #"\$");
Or,
text = text.RegexReplace(#"[$\\]", #"\$&");
I think it's the equivalent of this C# code:
text = text.Replace(#"\", #"\\");
text = text.Replace("$", #"\$");
The # indicates a verbatim string in C#, meaning that the backslashes in strings don't have to be escaped with more backslashes. In other words, the code replaces a single backslash with a double backslash and then replaces a dollarsign with a backslash followed by a dollarsign.
If you were to use the regex function, it would be something like this:
text = text.RegexReplace(#"\\", #"\\");
text = text.RegexReplace(#"\$", #"\$$");
Note that in the regex pattern (the first parameter), backslashes are special, while in the replacement (the second parameter) it is the dollarsigns that are special.
The code quotes the backslashes and '$' characters in the original string.
Java regex parsing: http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html
C#: http://msdn.microsoft.com/en-us/library/xwewhkd1.aspx
I think that in Java, you have to escape the \ character by using \, but in C#, you don't. Try taking out half of the \ in your C# version.

Categories

Resources