I'm a little too new to RegEx's so this is mostly asking for help with specific pattern matching and a little with how to implement them in C#.
I have a large Excel file full of, amon other things, repeated addresses that are written in different styles. Most are abbreviations of words like Avenue/etc.
For the simple ones I looked up the string.replace() function:
address.Replace("Av ", "Av. ");
And it does the trick there and for some others; but what if I want to replace the word "Ave" I run into the possibility of it being part of another word (some addresses are in Spanish so this is likely to happen). I thought about including whitespaces before and after (" ave ") but would that work if it's the first word in the string?
Or should I use a pattern like (this might be wrong too)
^[0-9a-zA-Z_#' ](Ave)\w //the word is **not** preceded by any character other than a whitespace and is followed by a whitespace
For Expressions such as those, I should use something along this pattern, right?
string replacement = "Av.";
Regex rgx = new Regex( ^[0-9a-zA-Z_#' ](Ave)\w);
string result = rgx.Replace(input, replacement);
Thanks
Regular expressions have a nifty tool for this which is the \b character class shortcut, it matches on word boundaries, so Ave\b would only match Ave followed by either a space or a dot or something else that is not a word character.
Read all about the word boundary class here: http://www.regular-expressions.info/wordboundaries.html
BTW, that site is THE place to go to to learn about regular expressions.
Also, if you were to do it in the way you try, it could be something like this: [^\w]Ave[^\s]
That literally is: Not a word character (a-z, A-Z, 0-9 or _), then Ave, then not a space character (tab, space, linebreak etc.).
Also you could use the shorthand for [^\w] and [^\s] which are \W and \S so it would then become \WAve\S
But the \b way is better.
Add the word delimiter to your regex,
Regex.Match(content, #"\b(Ave)\b");
Related
I have searched through some questions but couldn't find the exact answer i am looking for.
I have a requirement to search through large strings of text looking for keywords matches. I was using IndexOf, however, i require to find whole word matches e.g. if i search for Java, but the text contains JavaScript, it shouldn't match. This works fine using \b{pattern}\b, but if i search for something like C#, then it doesn't work.
Below is a few examples of text strings that i am searching through:
languages include Java,JavaScript,MySql,C#
languages include Java/JavaScript/MySql/C#
languages include Java, JavaScript, MySql, C#
Obviously the issue is with the special character '#'; so this also doesn't work when searching for C++.
Escape the pattern using Regex.Escape and replace the context-dependent \b word boundaries with (?<!\w) / (?!\w) lookarounds:
var rx = $#"(?<!\w){Regex.Escape(pattern)}(?!\w)";
The (?<!\w) is a negative lookbehind that fails the match if there is a start of string or a non-word char immediately before the current location, and (?!\w) is a negative looahead that fails the match if there is an end of string or a non-word char immediately after the current location.
Yeah, this is because there isn't a word boundary (a \b) after the #, because # isn't a "word" character. You could use a regular expression like the following, which searches for a character that isn't a part of a language name [^a-zA-Z+#] after the language:
\b{pattern}[^a-zA-Z+#]
Or, if you believe you can list all of the possible characters that aren't part of a language name (for example, whitespace, ,, ., and ;):
[\s,.;]{pattern}[\s,.;]
Alternately, if it is possible that a language name is at the very end of a string (depending on what you're getting the data from), you might need to also match the end of the string $ in addition to the separators, or similarly, the beginning of the string ^.
[\s,.;]{pattern}(?:[\s,.;]|$)
My Regex is for a canadian postal code and only allowing the valid letters:
Regex pattern = new Regex("^[ABCEGHJKLMNPRSTVXY][0-9][ABCEGHJKLMNPRSTVWXYZ][/s][0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]$");
The problem I am having is that I want to allow for a space to be put in between the each set but cannot find the correct character to use.
You've got a forward-slash instead of a backslash in your regular expression for whitespace (\s). The following regex should work.
#"^[ABCEGHJKLMNPRSTVXY][0-9][ABCEGHJKLMNPRSTVWXYZ][\s][0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]$"
If you are simply searching for space use \s
To provide the escape sequence character \ use # verbitm literal character as below in the given example.
Regex pattern = new Regex(#"^[ABCEGHJKLMNPRSTVXY][0-9]\s[ABCEGHJKLMNPRSTVWXYZ[0-9]\s[ABCEGHJKLMNPRSTVWXYZ][0-9]$");
As pointed out in the comments, if space is optional you can use ? quantifier as below.
Regex pattern = new Regex(#"^[ABCEGHJKLMNPRSTVXY][0-9]\s?[ABCEGHJKLMNPRSTVWXYZ[0-9]\s?[ABCEGHJKLMNPRSTVWXYZ][0-9]$");
Use the \s token for whitespace instead of /s.
Some handy tools to speed up regex development:
regexr.com helps with syntax and provides realtime testing
regexpr.com (yes I know :)) visualizes your expression.
As per other answers....
Use \s instead of /s
You shouldn't need to square bracket the [\s], because it already implies a complete class of characters.
Also...
In most languages, you probably don't want to use double quotes "..." as delimiters to the Regex, since this might be interpolating the \s before the pattern is applied. It's certainly worth a try.
Use a trailing quantifier \s* or \s? to allow the space to be optional.
I have a series of words coupled with their definitions and I want to add some bbcode around every instance of the word inside the definition.
To achieve this I have the following code:
wd[1] = Regex.Replace(wd[1],#"\b"+wd[0]+#"\b","[ffa500]"+wd[0]+"[-]");
where wd[0] is the word and wd[1] is the definition.
This works for single words but does not when wd[0] contains commas or exclamation points. For instance, it works when wd[0] contains "break dance" or "" but does not for "ay, caramba!".
Any idea why this is happening?
Edit:
I should add that for "ay, carumba!" and some other words I have the italic bbcode [i][/i] around the word in the definition, but that is not the case for all words found in the definition. I would like the solution to work regardless, any way to achieve this?
This issue is that punctuation is not considered to be part of a word, specifically it is not included in the \w class. That means that if your "word" ends with punctuation then that will not be followed by a word boundary \b unless there is a word immediately following the punctuation. So for your example, "\bay, caramba!\b", it would not match "ay, caramba! What is going on?" or "ay, caramba!", but would match "ay, caramba!No Space.". You might be able to try matching word boundaries, non word characters, or the beginning or end of the line instead like this.
wd[1] = Regex.Replace(
wd[1],
#"(^|\b|\W)"+wd[0]+#"(\b|\W|$)",
"$1[ffa500]"+wd[0]+"[-]$2");
Notice that you have to add the $1 and $2 groups in the replacement string in case they matched non-word characters (\W).
EDIT
And here's how you can do a case insensitive match without changing the case in the replacement.
wd[1] = Regex.Replace(
wd[1],
#"(^|\b|\W)(" + wd[0] + #")(\b|\W|$)",
"$1[ffa500]$2[-]$3",
RegexOptions.IgnoreCase);
Further you might want to consider escaping wd[0] when you use it in the pattern in case it contains special regular expression characters like . and *.
wd[1] = Regex.Replace(
wd[1],
#"(^|\b|\W)(" + Regex.Escape(wd[0]) + #")(\b|\W|$)",
"$1[ffa500]$2[-]$3",
RegexOptions.IgnoreCase);
I need my C# regex to only match full words, and I need to make sure that +-/*() delimit words as well (I'm not sure if the last part is already set that way.) I find regexes very confusing and would like some help on the matter.
Currently, my regex is:
public Regex codeFunctions = new Regex("draw_line|draw_rectangle|draw_circle");
Thank you! :)
Try
public Regex codeFunctions = new Regex(#"\b(draw_line|draw_rectangle|draw_circle)\b");
The \b means match a word boundary, i.e. a transition from a non-word character to a word character (or vice versa).
Word characters include alphabet characters, digits, and the underscore symbol. Non-word characters include everything else, including +-/*(), so it should work fine for you.
See the Regex Class documentation for more details.
The # at the start of the string makes the string a verbatim string, otherwise you have to type two backslashes to make one backslash.
Do you want to match any words, or just the words listed above? To match an arbitrary word, substitute this for the bit that creates the Regex object:
new Regex (#"\b(\w+)\b");
In the future, if you want more characters to be treated as whitespace (for example, underscores), I would recommend String.Replace-ing them to a space character. There may be a clever way to get the same effect with regular expressions, but personally I think it would be too clever. The String.Replace version is obvious.
Also, I can't help but recommend that you read up on regular expressions. Yes, they look like line noise until you get used to them, but once you do they're convenient and there are plenty of good resources out there to help you.
I am trying to create a regular expression pattern in C#. The pattern can only allow for:
letters
numbers
underscores
So far I am having little luck (i'm not good at RegEx). Here is what I have tried thus far:
// Create the regular expression
string pattern = #"\w+_";
Regex regex = new Regex(pattern);
// Compare a string against the regular expression
return regex.IsMatch(stringToTest);
EDIT :
#"^[a-zA-Z0-9\_]+$"
or
#"^\w+$"
#"^\w+$"
\w matches any "word character", defined as digits, letters, and underscores. It's Unicode-aware so it'll match letters with umlauts and such (better than trying to roll your own character class like [A-Za-z0-9_] which would only match English letters).
The ^ at the beginning means "match the beginning of the string here", and the $ at the end means "match the end of the string here". Without those, e.g. if you just had #"\w+", then "##Foo##" would match, because it contains one or more word characters. With the ^ and $, then "##Foo##" would not match (which sounds like what you're looking for), because you don't have beginning-of-string followed by one-or-more-word-characters followed by end-of-string.
Try experimenting with something like http://www.weitz.de/regex-coach/ which lets you develop regex interactively.
It's designed for Perl, but helped me understand how a regex works in practice.
Regex
packedasciiRegex = new Regex(#"^[!#$%&'()*+,-./:;?#[\]^_]*$");