How to replace group of words in file [duplicate] - c#

In C#, I want to use a regular expression to match any of these words:
string keywords = "(shoes|shirt|pants)";
I want to find the whole words in the content string. I thought this regex would do that:
if (Regex.Match(content, keywords + "\\s+",
RegexOptions.Singleline | RegexOptions.IgnoreCase).Success)
{
//matched
}
but it returns true for words like participants, even though I only want the whole word pants.
How do I match only those literal words?

You should add the word delimiter to your regex:
\b(shoes|shirt|pants)\b
In code:
Regex.Match(content, #"\b(shoes|shirt|pants)\b");

Try
Regex.Match(content, #"\b" + keywords + #"\b", RegexOptions.Singleline | RegexOptions.IgnoreCase)
\b matches on word boundaries. See here for more details.

You need a zero-width assertion on either side that the characters before or after the word are not part of the word:
(?=(\W|^))(shoes|shirt|pants)(?!(\W|$))
As others suggested, I think \b will work instead of (?=(\W|^)) and (?!(\W|$)) even when the word is at the beginning or end of the input string, but I'm not sure.

put a word boundary on it using the \b metasequence.

Related

Regex replace special character

I need help in my regex.
I need to remove the special character found in the start of text
for example I have a text like this
.just a $#text this should not be incl#uded
The output should be like this
just a text this should not be incl#uded
I've been testing my regex here but i can't make it work
([\!-\/\;-\#]+)[\w\d]+
How do I limit the regex to check only the text that starts in special characters?
Thank you
Use \B[!-/;-#]+\s*\b:
var result = Regex.Replace(s, #"\B[!-/;-#]+\s*\b", "");
See the regex demo
Details
\B - the position other than a word boundary (there must be start of string or a non-word char immediately to the left of the current position)
[!-/;-#]+ - 1 or more ASCII punctuation
\s* - 0+ whitespace chars
\b - a word boundary, there must be a letter/digit/underscore immediately to the right of the current location.
If you plan to remove all punctuation and symbols, use
var result = Regex.Replace(s, #"\B[\p{P}\p{S}]+\s*\b", "");
See another regex demo.
Note that \p{P} matches any punctuation symbols and \p{S} matches any symbols.
Use lookahead:
(^[.$#]+|(?<= )[.$#]+)
The ^[.$#]+ is used to match the special characters at the start of a line.
The (?<= )[.$#]+) is used to matching the special characters at the start of a word which is in the sentence.
Add your special characters in the character group [] as you need.
Following are two possible options from your question details. Hope it will help you.
string input = ".just a $#text this should not be incl#uded";
//REMOVING ALL THE SPECIAL CHARACTERS FROM THE WHOLE STRING
string output1 = Regex.Replace(input, #"[^0-9a-zA-Z\ ]+", "");
// REMOVE LEADING SPECIAL CHARACTERS FROM EACH WORD IN THE STRING. WILL KEEP OTHER SPECIAL CHARACTERS
var split = input.Split();
string output2 = string.Join(" ", split.Select(s=> Regex.Replace(s, #"^[^0-9a-zA-Z]+", "")).ToArray());
Negative lookahead is fine here :
(?![\.\$#].*)[\S]+
https://regex101.com/r/i0aacp/11/
[\S] match any character
(?![\.\$#].*) negative lookahead means those characters [\S]+ should not start with any of \.\$#

Regex Match for HTML string with newline

I am trying to match:
<h4>Manufacturer</h4>\n\n Gigabyte\n\n\n
My Regex ATM is:
Match regex = Regex.Match(cleanedUpHtml, "Manufacturer(.*?)\n\n\n", RegexOptions.IgnoreCase);
However it does not work.
The (.*?) should match all in between.
Here are 2 things I find important:
Whenever you declare a regex pattern in C#, it is advisable to use string literals, i.e. #"PATTERN". This simplifies writing regex patterns.
RegexOptions.Singleline must be used to treat multiline text as a string, i.e. a dot will match a line break.
Here is my code snippet:
var str = "<h4>Manufacturer</h4>\n\n Gigabyte\n\n\n";
var regex = Regex.Match(str, #"Manufacturer(.*?)\n\n\n",
RegexOptions.IgnoreCase | RegexOptions.Singleline);
if (regex.Success)
MessageBox.Show("\"" + regex.Value + "\"");
The regex.Value is
"Manufacturer</h4>
Gigabyte
"
Best regards.
I replaced \n with another value and then Regex searched my replaced value. It is working for the time being, but it may not be the best approach. Any recommendations appreciated.
cleanedUpHtml = cleanedUpHtml.Replace("\n", "p19o9");
Match regex = Regex.Match(cleanedUpHtml, "Manufacturer(.*?)p19o9p19o9p19o9", RegexOptions.IgnoreCase);
Generally I prefere to cleanup the string from html tags and new-line characters before using the regex.
(.*?) stops capture with \n characer, you might use a more generic group instead, like ([\w|\W]*?)

string replacement issues with bbcode in c#

I have a series of words coupled with their definitions and I want to add some bbcode around every instance of the word inside the definition.
To achieve this I have the following code:
wd[1] = Regex.Replace(wd[1],#"\b"+wd[0]+#"\b","[ffa500]"+wd[0]+"[-]");
where wd[0] is the word and wd[1] is the definition.
This works for single words but does not when wd[0] contains commas or exclamation points. For instance, it works when wd[0] contains "break dance" or "" but does not for "ay, caramba!".
Any idea why this is happening?
Edit:
I should add that for "ay, carumba!" and some other words I have the italic bbcode [i][/i] around the word in the definition, but that is not the case for all words found in the definition. I would like the solution to work regardless, any way to achieve this?
This issue is that punctuation is not considered to be part of a word, specifically it is not included in the \w class. That means that if your "word" ends with punctuation then that will not be followed by a word boundary \b unless there is a word immediately following the punctuation. So for your example, "\bay, caramba!\b", it would not match "ay, caramba! What is going on?" or "ay, caramba!", but would match "ay, caramba!No Space.". You might be able to try matching word boundaries, non word characters, or the beginning or end of the line instead like this.
wd[1] = Regex.Replace(
wd[1],
#"(^|\b|\W)"+wd[0]+#"(\b|\W|$)",
"$1[ffa500]"+wd[0]+"[-]$2");
Notice that you have to add the $1 and $2 groups in the replacement string in case they matched non-word characters (\W).
EDIT
And here's how you can do a case insensitive match without changing the case in the replacement.
wd[1] = Regex.Replace(
wd[1],
#"(^|\b|\W)(" + wd[0] + #")(\b|\W|$)",
"$1[ffa500]$2[-]$3",
RegexOptions.IgnoreCase);
Further you might want to consider escaping wd[0] when you use it in the pattern in case it contains special regular expression characters like . and *.
wd[1] = Regex.Replace(
wd[1],
#"(^|\b|\W)(" + Regex.Escape(wd[0]) + #")(\b|\W|$)",
"$1[ffa500]$2[-]$3",
RegexOptions.IgnoreCase);

Regex for word.otherword

I want a Regular Expression for a word.otherword form. I tried \b[a-z]\.[a-z]\b, but it gives me an error at the \. part, saying Unrecognized escape sequence. Any idea what's wrong? I'm working under .NET C#. Thanks!
LE:
john.Smith or JoHn.SmItH or JOHN.SMITH should work.
John Smith or john!Smith or john.Smith.Smith shouldn't work.
Try this :
foundMatch = Regex.IsMatch(SubjectString, #"\b[a-z]\.[a-z]\b");
Probably you were not using #?
Your regex tries to match a.a this means a single character. But since you want it to match complete words you need a quantifier e.g.
\b[a-z]+\.[a-z]+\b
Finally you may want to use the case insensitive match to allow for words with capital letters to be matched too :
foundMatch = Regex.IsMatch(SubjectString, #"\b[a-z]+\.[a-z]+\b", RegexOptions.IgnoreCase);
This will match all words.words with at least one character for each word regardless of capitalization.
This will match all word.otherword only if there is a space behind the first word or it is the start of the string and only if there is a space after the second word or it is the end of the string.
foundMatch = Regex.IsMatch(SubjectString, #"(?<=\s|^)\b[a-z]+\.[a-z]+\b(?=\s|$)", RegexOptions.IgnoreCase);
Try this regex for word.word format:
#"\b([a-z]+)\.\1"
For word.otherword use this:
#"\b[a-z]+\.[a-z]+\b"

C# Regex getting words that start with?

How can I use a regular expression to get words that start with ! ? For example !Test.
I tried this but it doesn't give any matches:
#"\B\!\d+\b"
Although it did work when I replaced the ! with $.
I'd say that your regex was quite OK already, you just need to use \w (alphanumeric character) instead of \d (digit):
#"\B!\w+\b"
will match any word that is immediately preceded by a ! unless that ! itself is preceded by a word itself (that's what the \B asserts). Using a ^ instead will limit the matches to words that start at the beginning of a line which might not be what you want.
So this will match all the words including exactly one preceding ! in this line:
!hello !this ...!will !!!be !matched!
but none of the words in this line:
this! won't!be matched!!!
You could also drop the \B altogether if you don't mind matching !that in this!that.
This should work: ^!\w+
MatchCollection matches = Regex.Matches (inputText, #"^!\w+");
foreach (Match match in matches)
{
Console.WriteLine (match.Value);
}

Categories

Resources