string replacement issues with bbcode in c# - c#

I have a series of words coupled with their definitions and I want to add some bbcode around every instance of the word inside the definition.
To achieve this I have the following code:
wd[1] = Regex.Replace(wd[1],#"\b"+wd[0]+#"\b","[ffa500]"+wd[0]+"[-]");
where wd[0] is the word and wd[1] is the definition.
This works for single words but does not when wd[0] contains commas or exclamation points. For instance, it works when wd[0] contains "break dance" or "" but does not for "ay, caramba!".
Any idea why this is happening?
Edit:
I should add that for "ay, carumba!" and some other words I have the italic bbcode [i][/i] around the word in the definition, but that is not the case for all words found in the definition. I would like the solution to work regardless, any way to achieve this?

This issue is that punctuation is not considered to be part of a word, specifically it is not included in the \w class. That means that if your "word" ends with punctuation then that will not be followed by a word boundary \b unless there is a word immediately following the punctuation. So for your example, "\bay, caramba!\b", it would not match "ay, caramba! What is going on?" or "ay, caramba!", but would match "ay, caramba!No Space.". You might be able to try matching word boundaries, non word characters, or the beginning or end of the line instead like this.
wd[1] = Regex.Replace(
wd[1],
#"(^|\b|\W)"+wd[0]+#"(\b|\W|$)",
"$1[ffa500]"+wd[0]+"[-]$2");
Notice that you have to add the $1 and $2 groups in the replacement string in case they matched non-word characters (\W).
EDIT
And here's how you can do a case insensitive match without changing the case in the replacement.
wd[1] = Regex.Replace(
wd[1],
#"(^|\b|\W)(" + wd[0] + #")(\b|\W|$)",
"$1[ffa500]$2[-]$3",
RegexOptions.IgnoreCase);
Further you might want to consider escaping wd[0] when you use it in the pattern in case it contains special regular expression characters like . and *.
wd[1] = Regex.Replace(
wd[1],
#"(^|\b|\W)(" + Regex.Escape(wd[0]) + #")(\b|\W|$)",
"$1[ffa500]$2[-]$3",
RegexOptions.IgnoreCase);

Related

RegEx to find non-existence of white space prefix but not include the character in the match?

So i have the following RegEx for the purpose of finding and adding whitespace:
(\S)(\()
So for a string like "SomeText(Somemoretext)" I want to update this to "SomeText (Somemoretext)" it matches "t(" and so my replace eliminates the "t" from the string which is not good. I also do not know what the character could be, I'm merely trying to find the non-existence of whitespace.
Is there a better expression to use or is there a way to exclude the found character from the match returned so that I can safely replace without catching characters i do not want to replace?
Thanks
I find lookarounds hard to read and would prefer using substitutions in the replacement string instead:
var s = Regex.Replace("test1() test2()", #"(\S)\(", "$1 (");
Debug.Assert(s == "test1 () test2 ()");
$1 inserts the first capture group from the regex into the replacement string which is the non-space character before the opening parenthesis (.
If you need to detect the absence of space before a specific character (such as bracket) after a word, how about the following?
\b(?=[^\s])\(
This will detect words ( [a-zA-z0-9_] that are followed by a bracket, without a space).
(if I got your problem correctly) you can replace the full match with ( and get exactly what you need.
In case you need to look for absence spaces before a symbol (like a bracket) in any kind of text (as in the text may be non-word, such as punctuation) you might want to use the following instead.
^(?:\S*)(\()(?:\S*)$
When using this, your result will be in group 1, instead of just full match (which now contains the whole line, if a line is matched).

Word replacement using regex if the target word is not a part of another word

I am working on regex expression for a word to be replaced if it is a standalone word and not a part of another word. For example the word "thing". If it is something, the substring "thing" should be ignored there, but if a word "thing" is preceded with a special character such as a dot or a bracket, I want it captured. Also I want the word captured if there is a bracket, dot, or comma (or any other non-alphanumeric character is there) after it.
In the string
Something is a thing, and one more thingy and (thing and more thing
In the sentence above I highlighted the 3 words to be marked for replacement. I used the following Regex
\bthing\b
I tried out the above sentence on regex101.com and using this regex expression only the first word gotten highlighted. I understand that my regex would not capture (thing but I thought it would capture the last word in the sentence so that there would be at least 2 occurrences.
Can someone please help me modify my regex expression to capture all 3 occurences in the sentence above?
You were likely using the javascript regex, which returns after the first match is found. If you add the modifier g in the second box on regex101.com, it will find all matches.
This site is better for C# regex testing: http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
If you code this in C# and use the 'Matches' method, it should match multiple times.
Regex regex = new Regex("\\bthing\\b");
foreach (Match match in regex.Matches(
"Something is a thing, and one more thingy and (thing and more thing"))
{
Console.WriteLine(match.Value);
}
Shorthand for alphanum [0-9A-Za-z] is [^\W_]
Using a lookbehind and lookahead you'd get
(?<![^\W_])thing(?![^\W_])
Expanded
(?<! [^\W_] ) # Not alphanum behind
thing # 'thing'
(?! [^\W_] ) # Not alphanum ahedad
Matches the highlighted text
Something is a thing, and one more thingy and (thing and more thing

Using Regular Expressions to replace patterns in C#

I'm a little too new to RegEx's so this is mostly asking for help with specific pattern matching and a little with how to implement them in C#.
I have a large Excel file full of, amon other things, repeated addresses that are written in different styles. Most are abbreviations of words like Avenue/etc.
For the simple ones I looked up the string.replace() function:
address.Replace("Av ", "Av. ");
And it does the trick there and for some others; but what if I want to replace the word "Ave" I run into the possibility of it being part of another word (some addresses are in Spanish so this is likely to happen). I thought about including whitespaces before and after (" ave ") but would that work if it's the first word in the string?
Or should I use a pattern like (this might be wrong too)
^[0-9a-zA-Z_#' ](Ave)\w //the word is **not** preceded by any character other than a whitespace and is followed by a whitespace
For Expressions such as those, I should use something along this pattern, right?
string replacement = "Av.";
Regex rgx = new Regex( ^[0-9a-zA-Z_#' ](Ave)\w);
string result = rgx.Replace(input, replacement);
Thanks
Regular expressions have a nifty tool for this which is the \b character class shortcut, it matches on word boundaries, so Ave\b would only match Ave followed by either a space or a dot or something else that is not a word character.
Read all about the word boundary class here: http://www.regular-expressions.info/wordboundaries.html
BTW, that site is THE place to go to to learn about regular expressions.
Also, if you were to do it in the way you try, it could be something like this: [^\w]Ave[^\s]
That literally is: Not a word character (a-z, A-Z, 0-9 or _), then Ave, then not a space character (tab, space, linebreak etc.).
Also you could use the shorthand for [^\w] and [^\s] which are \W and \S so it would then become \WAve\S
But the \b way is better.
Add the word delimiter to your regex,
Regex.Match(content, #"\b(Ave)\b");

Replace with wildcards

I need some advice. Suppose I have the following string: Read Variable
I want to find all pieces of text like this in a string and make all of them like the following:Variable = MessageBox.Show. So as aditional examples:
"Read Dog" --> "Dog = MessageBox.Show"
"Read Cat" --> "Cat = MessageBox.Show"
Can you help me? I need a fast advice using RegEx in C#. I think it is a job involving wildcards, but I do not know how to use them very well... Also, I need this for a school project tomorrow... Thanks!
Edit: This is what I have done so far and it does not work: Regex.Replace(String, "Read ", " = Messagebox.Show").
You can do this
string ns= Regex.Replace(yourString,"Read\s+(.*?)(?:\s|$)","$1 = MessageBox.Show");
\s+ matches 1 to many space characters
(.*?)(?:\s|$) matches 0 to many characters till the first space (i.e \s) or till the end of the string is reached(i.e $)
$1 represents the first captured group i.e (.*?)
You might want to clarify your question... but here goes:
If you want to match the next word after "Read " in regex, use Read (\w*) where \w is the word character class and * is the greedy match operator.
If you want to match everything after "Read " in regex, use Read (.*)$ where . will match all characters and $ means end of line.
With either regex, you can use a replace of $1 = MessageBox.Show as $1 will reference the first matched group (which was denoted by the parenthesis).
Complete code:
replacedString = Regex.Replace(inStr, #"Read (.*)$", "$1 = MessageBox.Show");
The problem with your attempt is, that it cannot know that the replacement string should be inserted after your variable. Let's assume that valid variable names contain letters, digits and underscores (which can be conveniently matched with \w). That means, any other character ends the variable name. Then you could match the variable name, capture it (using parentheses) and put it in the replacement string with $1:
output = Regex.Replace(input, #"Read\s+(\w+)", "$1 = MessageBox.Show");
Note that \s+ matches one or more arbitrary whitespace characters. \w+ matches one or more letters, digits and underscores. If you want to restrict variable names to letters only, this is the place to change it:
output = Regex.Replace(input, #"Read\s+([a-zA-Z]+)", "$1 = MessageBox.Show");
Here is a good tutorial.
Finally note, that in C# it is advisable to write regular expressions as verbatim strings (#"..."). Otherwise, you will have to double escape everything, so that the backslashes get through to the regex engine, and that really lessens the readability of the regex.

Regex to find specific word starting with a specific char

Using Regex, I need to find a word within a string that starts with specific char. The word must be alphanumeric, but may contain underscore (_) within the word. underscore at the beginning and end of the word is not acceptable.
For example I have the following string.
#word1 Message ## message # message #word2_ message #word#3 #_word4 mesagge #word_5
The result should be:
#word1 #word_5
Thanks.
Use regex pattern
(?:^|(?<=\s))#(?!_)\w+(?<!_)(?:(?=\s)|$)
or
(?:^|(?<=\W))#(?!_)\w+(?<!_)(?:(?=\W)|$)
depends what you need/want to have infront/behind...
For example if #word1 in #word_5 #word1. #word#2 #word*3 should match, considering dot . as separator or end of sentence.
This Regex will do it!
(?<=(^|\s))#([a-zA-Z0-9]{1}\w*[a-zA-Z0-9]|[a-zA-Z0-9]{1})(?=(\s|$))
It also matches single letter
This will work - the bounds (lines 1 and 3) are fairly heavy because \b, the word boundary, won't work here since you don't want to match "#word#3", and the "#" character after "d" triggers a word boundary.
(?<=\s|^)
#(?!_)\w+(?<!_)
(?=\s|$)

Categories

Resources