Regex vs String.Contains - c#

Hola. I'm failing to write a method to test for words within a plain text or html document. I was reasonably literate with regex, and I am newer to c# (from way more java).
Just 'cause,
string html = source.ToLower();
string plaintext = Regex.Replace(html, #"<(.|\n)*?>", " "); // remove tags
plaintext = Regex.Replace(plaintext, #"\s+", " "); // remove excess white space
and then,
string tag = "c++";
bool foundAsRegex = Regex.IsMatch(plaintext,#"\b" + Regex.Escape(tag) + #"\b");
bool foundAsContains = plaintext.Contains(tag);
For a case where "c++" should be found, sometimes foundAsRegex is true and sometimes false. My google-fu is weak, so I didn't get much back on "what the hell". Any ideas or pointers welcome!
edit:
I'm searching for matches on skills in resumes. for example, the distinct value "c++".
edit:
a real excerpt is given below:
"...administration- c, c++, perl, shell programming..."

The problem is that \b matches between a word character and a non-word character. Given the expression \bc\+\+\b, you have a problem. "+" is a non-word character. So searching for the pattern in "xxx c++, xxx", you're not going to find anything. There's no "word break" after the "+" character.
If you're looking for non-word characters then you'll have to change your logic. Not sure what the best thing would be. I suppose you can use \W, but then it's not going to match at the beginning or end of the line, so you'll need (^|\W) and (\W|$) ... which is ugly. And slow, although perhaps still fast enough depending on your needs.

Your regular expression is turning into:
/\bc\+\+\b/
Which means you're looking for a word boundary, followed by the string c++, followed by another word boundary. This means it won't match on strings like abc++, whereas plaintext.Contains will succeed.
If you can give us examples of where your regex fails when you expected it to succeed, then we can give you a more definite answer.
Edit: My original regex was /\bc++\b/, which is incorrect, as c++ is being passed to Regex.Escape(), which escapes out regular expression metacharacters like +. I've fixed it above.

Related

Find special character in string and change it using Regular expression in C#

I am trying to find the '&' character in my string and switch it with "and" string using regular expression, but I am obviously doing it wrong.
This is a part of code that is checking if there is a '&' symbol and if there is one than it should change it to "and".
if (Regex.IsMatch(toCheck, #"[^&]"))
return Regex.Replace(toCheck, #"[^&]", "and");
What I am getting as outcome is string that contains only '&' symbols.
Can someone help me with this regular expression thing, it is a bit confusing to me. Thanks!
I would just do it using regular string functions:
return toCheck.Replace("&","and");
If you really want to do it with regex, your function is a bit wrong, [^&] actually means doesn't contain &. Remove that and you'll be fine. It's not even necessary to put it between brackets as it's not a special character in regex. Just remember not to use regex for trivial things like searching one character and replacing it, it's using a sledgehammer to crack a nut.
You don't need to check with IsMatch if you want to replace, it just won't replace if there are no occurences.
Also a simple string replace is enough, you do not need regexes for this to solve:
Console.WriteLine("Hello&World&Mars".Replace("&", " and "));
This is enough
[^&] means all characters that are not equal to &.
^ inverts the selection
This is how you'll perform a check on a string if it contains '&' Character in it!
if(Regex.IsMatch(toCheck,#"[&]"))
//do whatever you want to
although, there are many topics discussing Regex, you can visit any of them, Here i have got a link for you, you can learn Regular Expressions Briefly here!

Regular Expression Pattern Matching

Hi I need to do like this.
Actually **ctu** is a good university but **ctu's** is not. There are many **,ctus,** present.
What I want to do is, I want to replace ctu in the string like this.
Actually **<s>ctu<e>** is a good university but **<s>ctu's<e>** is not. There are many **,<s>ctus<e>,** present.
But with the following pattern
**\\bctu*(?:['\\\\|""\\\\]*)\\w+\\b**
I'm getting the out put as:
A**<s>ctu<e>**ally **<s>ctu<e>** is a good university but **<s>ctu's<e>** is not. There are many **,ctus,** present.
I dont want to replace ctu inside words Actually. and also I need to replace " ,ctus, " with " ,<s>ctus<e>, "
How do I achieve this using regex. I need this in c#. csharp.
Thanks in advance.
The following regex matches all the cases listed in your example:
#"(\bctu(?:'\w+)?\w*\b)"
Then just replace the match with #"<s>\1<e>" where \1 is the backreference to the match above.
Are you looking for #"\bctu\b" ("ctu" with word boundaries on both sides, so it matches ctu but not Actually, ctu's, or ,ctus,) for the first search pattern and ",ctus," (exactly the string ,ctus,, regardless of where it might fall in a word) as the second search pattern? To search for both of these at once, you could use #"(\bctu\b|,ctus,)".
As a slight aside, in C# you can write regex literals much easier by using the #"" notation (verbatim strings) instead of "". E.g. to get regex to understand a word boundary, it must see \b, which can be represented as #"\b" or "\\b", and a literal \ is "\\\\" or #"\\". The first is easier to read, especially in more complex cases.
If this doesn't answer your question, please give a clear example of expected input/output.

Escaping \x from strings

Well, I got this little method:
static string escapeString(string str) {
string s = str.Replace(#"\r", "\r").Replace(#"\n", "\n").Replace(#"\t", "\t");
Regex regex = new Regex(#"\\x(..)");
var matches = regex.Matches(s);
foreach (Match match in matches) {
s = s.Replace(match.Value, ((char)Convert.ToByte(match.Value.Replace(#"\x", ""), 16)).ToString());
}
return s;
}
It replaces "\x65" from String, which I've got in args[0].
But my Problem is: "\\x65" will be replaced too, so I get "\e". I'd tried to figure out a regex which would check if there are more then one backslashs, but I had no luck.
Can sombody gimme a hint?
You can continue to hack regexes together with things like "\s|\w\x(..)" to remove the case of \x65. Obviously that will be brittle since there is no guarantee that your sequence \x65 always has a space or character in front of it. It could be the beginning of the file. Also, your regex will match \xTT, which obviously isn't unicode. Consider replacing the '.' with a character class like "\x([0-9a-f]{2})".
If this was a school project, I would do something like the following. You can replace all combinations of "\" into another unlikely sequence, like "#!!#!!#", run the regex and replacements, and then replace all of the unlikely sequence back to "\". For example:
String s = inputString.Replace(#"\\", #"_#!!#!!#_");
// do all of the regex, replacements, etc here
String output = s.Replace(#"_#!!#!!#_", #"\");
However, you shouldn't do this in production code because if your input stream ever has the magic sequence then you will get extra backslashes.
It's obvious that you are writing come kind of interpolator. I feel obligated to recommend looking into something more robust like lexers that use regexes to form Finite State Machines. Wiki has some great articles on this topic, and I'm a big fan of ANTLR. It may be overengineering now, but if you keep running into these special cases consider solving your problem in a more general way.
Start reading here for the theory: http://en.wikipedia.org/wiki/Lexical_analysis
Use a negative look-behind:
Regex regex = new Regex(#"(?<!([^\]|^)\\)\\x(..)");
This asserts that the previous character is not a solo backslash, but without capturing the previous character (look-arounds do not capture).

Regex : replace a string

I'm currently facing a (little) blocking issue. I'd like to replace a substring by one another using regular expression. But here is the trick : I suck at regex.
Regex.Replace(contenu, "Request.ServerVariables("*"))",
"ServerVariables('test')");
Basically I'd like to replace whatever is between the " by "test". I tried ".{*}" as a pattern but it doesn't work.
Could you give me some tips, I'd appreciate it!
There are several issues you need to take care of.
You are using special characters in your regex (., parens, quotes) -- you need to escape these with a slash. And you need to escape the slashes with another slash as well because we 're in a C# string literal, unless you prefix the string with # in which case the escaping rules are different.
The expression to match "any number of whatever characters" is .*. In this case, you would want to match any number of non-quote characters, which is [^"]*.
In contrast to (1) above, the replacement string is not a regular expression so you don't want any slashes there.
You need to store the return value of the replace somewhere.
The end result is
var result = Regex.Replace(contenu,
#"Request\.ServerVariables\(""[^""]*""\)",
"Request.ServerVariables('test')");
Based purely on my knowledge of regex (and not how they are done in C#), the pattern you want is probably:
"[^"]*"
ie - match a " then match everything that's not a " then match another "
You may need to escape the double-quotes to make your regex-parser actually match on them... that's what I don't know about C#
Try to avoid where you can the '.*' in regex, you can usually find what you want to get by avoiding other characters, for example [^"]+ not quoted, or ([^)]+) not in parenthesis. So you may just want "([^"]+)" which should give you the whole thing in [0], then in [1] you'll find 'test'.
You could also just replace '"' with '' I think.
Taryn Easts regex includes the *. You should remove it, if it is just a placeholder for any value:
"[^"]"
BTW: You can test this regex with this cool editor: http://rubular.com/r/1MMtJNF3kM

Regex setting word characters and matching exact word

I need my C# regex to only match full words, and I need to make sure that +-/*() delimit words as well (I'm not sure if the last part is already set that way.) I find regexes very confusing and would like some help on the matter.
Currently, my regex is:
public Regex codeFunctions = new Regex("draw_line|draw_rectangle|draw_circle");
Thank you! :)
Try
public Regex codeFunctions = new Regex(#"\b(draw_line|draw_rectangle|draw_circle)\b");
The \b means match a word boundary, i.e. a transition from a non-word character to a word character (or vice versa).
Word characters include alphabet characters, digits, and the underscore symbol. Non-word characters include everything else, including +-/*(), so it should work fine for you.
See the Regex Class documentation for more details.
The # at the start of the string makes the string a verbatim string, otherwise you have to type two backslashes to make one backslash.
Do you want to match any words, or just the words listed above? To match an arbitrary word, substitute this for the bit that creates the Regex object:
new Regex (#"\b(\w+)\b");
In the future, if you want more characters to be treated as whitespace (for example, underscores), I would recommend String.Replace-ing them to a space character. There may be a clever way to get the same effect with regular expressions, but personally I think it would be too clever. The String.Replace version is obvious.
Also, I can't help but recommend that you read up on regular expressions. Yes, they look like line noise until you get used to them, but once you do they're convenient and there are plenty of good resources out there to help you.

Categories

Resources