Regular Expression to Match Exact Word - Search String Highlight - c#

I'm using the following 2 methods to highlight the search keywords. It is working fine but fetching partial words also.
For Example:
Text: "This is .net Programming"
Search Key Word: "is"
It is highlighting partial word from this and "is"
Please let me know the correct regular expression to highlight the correct match.
private string HighlightSearchKeyWords(string searchKeyWord, string text)
{
Regex exp = new Regex(#", ?");
searchKeyWord = "(\b" + exp.Replace(searchKeyWord, #"|") + "\b)";
exp = new Regex(searchKeyWord, RegexOptions.Singleline | RegexOptions.IgnoreCase);
return exp.Replace(text, new MatchEvaluator(MatchEval));
}
private string MatchEval(Match match)
{
if (match.Groups[1].Success)
{
return "<span class='search-highlight'>" + match.ToString() + "</span>";
}
return ""; //no match
}

You really just need # before your "(\b" and "\b)" because the string "\b" will not be "\b" as you would expect. But I have also tried making another version with a replacement pattern instead of a full-blown method.
How about this one:
private string keywordPattern(string searchKeyword)
{
var keywords = searchKeyword.Split(',').Select(k => k.Trim()).Where(k => k != "").Select(k => Regex.Escape(k));
return #"\b(" + string.Join("|", keywords) + #")\b";
}
private string HighlightSearchKeyWords(string searchKeyword, string text)
{
var pattern = keywordPattern(searchKeyword);
Regex exp = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
return exp.Replace(text, #"<span class=""search-highlight"">$0</span>");
}
Usage:
var res = HighlightSearchKeyWords("is,this", "Is this programming? This is .net Programming.");
Result:
<span class="search-highlight">Is</span> <span class="search-highlight">this</span> programming? <span class="search-highlight">This</span> <span class="search-highlight">is</span> .net Programming.
Updated to use \b and a simplified replace pattern. (The old one used (^|\s) instead of the first \b and ($|\s) instead of the last \b. So it would also work on search terms which not only includes word-characters.
Updated to your comma notation for search terms
Updated forgot Regex.Escape - added now. Otherwise searches for "\w" would blow up the thing :)
Updated do to a comment ;)

Try this fixed line:
searchKeyWord = #"(\b" + exp.Replace(searchKeyWord, #"|") + #"\b)";

You need to enclose the keywords in a non-matching group, otherwise you will get false positives (if you are using multiple keywords separated by commas as indicated in the sample)!
private string EscapeKeyWords(string searchKeyWord)
{
string[] keyWords = searchKeyWord.Split(',');
for (int i = 0; i < keyWords.Length; i++) keyWords[i] = Regex.Escape(keyWords[i].Trim());
return String.Join("|", keyWords);
}
private string HighlightSearchKeyWords(string searchKeyWord, string text)
{
searchKeyWord = #"(\b(?:" + EscapeKeyWords(searchKeyWord) + #")\b)";
Regex exp = new Regex(searchKeyWord, RegexOptions.Singleline | RegexOptions.IgnoreCase);
return exp.Replace(text, #"<span class=""search-highlight"">$0</span>");
}

Related

Replace and remove a string via regex

This is first time I am working with regex.
The string below
var value ="abc ltd as yes"
need to be change to
var value ="abc Limited"
I have the following code:
public static string Attempt_Prefix_Removal( string prefix, string replacement, bool remove = false)
{
if (remove == true)
{
var yesy = $"(?<!prefix )" + prefix + ".*";
var test = Regex.Replace(prefix.ToLower(), $"(?<!prefix )" + prefix + ".*", replacement );
}
var output = (remove == true) ? Regex.Replace(prefix.ToLower(), $"(?<!prefix )" + prefix + ".*", replacement) : Regex.Replace(prefix.ToLower(), $"(?<!prefix )" + prefix + "", replacement);
return output;
}
the values that are passed to the method are
prefix ="ltd", replacement = "Limited" , remove = ture
after running the code the result is
abc Limited as yes
what do i need to change to get ride of as yes ??
thanks
You may leverage this code:
var prefix ="ltd"; var replacement = "Limited";
var pat = $#"(?s)(?<!\w){Regex.Escape(prefix)}(?!\w){remove ? ".*" : string.Empty}";
return Regex.Replace(val, pat, replacement.Replace("$", "$$"));
See the C# demo online
The main points here are:
(?s) - will allow . match a newline (in case you will use .* in the pattern)
(?<!\w){Regex.Escape(prefix)}(?!\w) - the (?<!\w) negative lookbehind will fail the match if the current location is preceded with a word char (you may further tweak the lookbehind pattern as per your requirements)
{remove ? ".*" : string.Empty} - this will either append .* (if remove is true) or not.
private string regexOp(string sentence, string word, string wordtoReplace, bool isRemove)
{
var retValue = sentence;
if (isRemove)
{
var Pattern = "^.*?(?=" + word + ")";
Match result = Regex.Match(sentence, #Pattern);
if (!string.IsNullOrEmpty(result.Value))
retValue = result + wordtoReplace;
}
else
retValue = Regex.Replace(sentence, word, wordtoReplace);
return retValue;
}
try this method, this will work as you expected with dynamic,
do not forget to mark it as answer if this really helped you,

C# Regex to replace invalid character to make it as perfect float number

for example if the string is "-234.24234.-23423.344"
the result should be "-234.2423423423344"
if the string is "898.4.44.4"
the result should be "898.4444"
if the string is "-898.4.-"
the result should be "-898.4"
the result should always make scene as a double type
What I can make is this:
string pattern = String.Format(#"[^\d\{0}\{1}]",
NumberFormatInfo.CurrentInfo.NumberDecimalSeparator,
NumberFormatInfo.CurrentInfo.NegativeSign);
string result = Regex.Replace(value, pattern, string.Empty);
// this will not be able to deal with something like this "-.3-46821721.114.4"
Is there any perfect way to deal with those cases?
It's probably a bad idea, but you can do this with regex like this:
Regex.Replace(input, #"[^-.0-9]|(?<!^)-|(?<=\..*)\.", "")
The regex matches:
[^-.0-9] # anything which isn't ., -, or a digit.
| # or
(?<!^)- # a - which is not at the start of the string
| # or
(?<=\..*)\. # a dot which is not the first dot in the string
This works on your examples, and additionally this case: "9-1.1" becomes "91.1".
You could also change (?<!^)- to (?<!^[^-.0-9]*)- if you'd like "asd-8" to become "-8" rather than "8".
It's not a good idea using regex itself to achieve your goal, since regex lack AND and NOT logic for expression.
Try the code below, it will do the same thing.
var str = #"-.3-46821721.114.4";
var beforeHead = "";
var afterHead = "";
var validHead = new Regex(#"(\d\.)" /* use #"\." if you think "-.5" is also valid*/, RegexOptions.Compiled);
Regex.Replace(str, #"[^0-9\.-]", "");
var match = validHead.Match(str);
beforeHead = str.Substring(0, str.IndexOf(match.Value));
if (beforeHead[0] == '-')
{
beforeHead = '-' + Regex.Replace(beforeHead, #"[^0-9]", "");
}
else
{
beforeHead = Regex.Replace(beforeHead, #"[^0-9]", "");
}
afterHead = Regex.Replace(str.Substring(beforeHead.Length + 2 /* 1, if you use \. as head*/), #"[^0-9]", "");
var validFloatNumber = beforeHead + match.Value + afterHead;
String must be trimmed before operation.

regex replace matchEvaluator using string Array

I need to highlight search terms in a block of text.
My initial thought was looping though the search terms. But is there an easier way?
Here is what I'm thinking using a loop...
public string HighlightText(string inputText)
{
string[] sessionPhrases = (string[])Session["KeywordPhrase"];
string description = inputText;
foreach (string field in sessionPhrases)
{
Regex expression = new Regex(field, RegexOptions.IgnoreCase);
description = expression.Replace(description,
new MatchEvaluator(ReplaceKeywords));
}
return description;
}
public string ReplaceKeywords(Match m)
{
return "<span style='color:red;'>" + m.Value + "</span>";
}
You could replace the loop with something like:
string[] phrases = ...
var re = String.Join("|", phrases.Select(s => Regex.Escape(s)).ToArray());
text = Regex.Replace(re, text, new MatchEvaluator(SomeFunction), RegexOptions.IgnoreCase);
Extending on Qtax's answer:
phrases = ...
// Use Regex.Escape to prevent ., (, * and other special characters to break the search
string re = String.Join("|", phrases.Select(s => Regex.Escape(s)).ToArray());
// Use \b (expression) \b to ensure you're only matching whole words, not partial words
re = #"\b(?:" +re + #")\b"
// use a simple replacement pattern instead of a MatchEvaluator
string replacement = "<span style='color:red;'>$0</span>";
text = Regex.Replace(re, text, replacement, RegexOptions.IgnoreCase);
Not that if you're already replacing data inside HTML, it might not be a good idea to use Regex to replace just anything in the content, you might end up getting:
<<span style='color:red;'>script</span>>
if someone is searching for the term script.
To prevent that from happening, you could use the HTML Agility Pack in combination with Regex.
You might also want to check out this post which deals with a very similar issue.

Need to perform Wildcard (*,?, etc) search on a string using Regex

I need to perform Wildcard (*, ?, etc.) search on a string.
This is what I have done:
string input = "Message";
string pattern = "d*";
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
if (regex.IsMatch(input))
{
MessageBox.Show("Found");
}
else
{
MessageBox.Show("Not Found");
}
With the above code "Found" block is hitting but actually it should not!
If my pattern is "e*" then only "Found" should hit.
My understanding or requirement is d* search should find the text containing "d" followed by any characters.
Should I change my pattern as "d.*" and "e.*"? Is there any support in .NET for Wild Card which internally does it while using Regex class?
From http://www.codeproject.com/KB/recipes/wildcardtoregex.aspx:
public static string WildcardToRegex(string pattern)
{
return "^" + Regex.Escape(pattern)
.Replace(#"\*", ".*")
.Replace(#"\?", ".")
+ "$";
}
So something like foo*.xls? will get transformed to ^foo.*\.xls.$.
You can do a simple wildcard mach without RegEx using a Visual Basic function called LikeString.
using Microsoft.VisualBasic;
using Microsoft.VisualBasic.CompilerServices;
if (Operators.LikeString("This is just a test", "*just*", CompareMethod.Text))
{
Console.WriteLine("This matched!");
}
If you use CompareMethod.Text it will compare case-insensitive. For case-sensitive comparison, you can use CompareMethod.Binary.
More info here: http://www.henrikbrinch.dk/Blog/2012/02/14/Wildcard-matching-in-C
MSDN: http://msdn.microsoft.com/en-us/library/microsoft.visualbasic.compilerservices.operators.likestring%28v=vs.100%29.ASPX
The correct regular expression formulation of the glob expression d* is ^d, which means match anything that starts with d.
string input = "Message";
string pattern = #"^d";
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
(The # quoting is not necessary in this case, but good practice since many regexes use backslash escapes that need to be left alone, and it also indicates to the reader that this string is special).
Windows and *nux treat wildcards differently. *, ? and . are processed in a very complex way by Windows, one's presence or position would change another's meaning. While *nux keeps it simple, all it does is just one simple pattern match. Besides that, Windows matches ? for 0 or 1 chars, Linux matches it for exactly 1 chars.
I didn't find authoritative documents on this matter, here is just my conclusion based on days of tests on Windows 8/XP (command line, dir command to be specific, and the Directory.GetFiles method uses the same rules too) and Ubuntu Server 12.04.1 (ls command). I made tens of common and uncommon cases work, although there'are many failed cases too.
The current answer by Gabe, works like *nux. If you also want a Windows style one, and are willing to accept the imperfection, then here it is:
/// <summary>
/// <para>Tests if a file name matches the given wildcard pattern, uses the same rule as shell commands.</para>
/// </summary>
/// <param name="fileName">The file name to test, without folder.</param>
/// <param name="pattern">A wildcard pattern which can use char * to match any amount of characters; or char ? to match one character.</param>
/// <param name="unixStyle">If true, use the *nix style wildcard rules; otherwise use windows style rules.</param>
/// <returns>true if the file name matches the pattern, false otherwise.</returns>
public static bool MatchesWildcard(this string fileName, string pattern, bool unixStyle)
{
if (fileName == null)
throw new ArgumentNullException("fileName");
if (pattern == null)
throw new ArgumentNullException("pattern");
if (unixStyle)
return WildcardMatchesUnixStyle(pattern, fileName);
return WildcardMatchesWindowsStyle(fileName, pattern);
}
private static bool WildcardMatchesWindowsStyle(string fileName, string pattern)
{
var dotdot = pattern.IndexOf("..", StringComparison.Ordinal);
if (dotdot >= 0)
{
for (var i = dotdot; i < pattern.Length; i++)
if (pattern[i] != '.')
return false;
}
var normalized = Regex.Replace(pattern, #"\.+$", "");
var endsWithDot = normalized.Length != pattern.Length;
var endWeight = 0;
if (endsWithDot)
{
var lastNonWildcard = normalized.Length - 1;
for (; lastNonWildcard >= 0; lastNonWildcard--)
{
var c = normalized[lastNonWildcard];
if (c == '*')
endWeight += short.MaxValue;
else if (c == '?')
endWeight += 1;
else
break;
}
if (endWeight > 0)
normalized = normalized.Substring(0, lastNonWildcard + 1);
}
var endsWithWildcardDot = endWeight > 0;
var endsWithDotWildcardDot = endsWithWildcardDot && normalized.EndsWith(".");
if (endsWithDotWildcardDot)
normalized = normalized.Substring(0, normalized.Length - 1);
normalized = Regex.Replace(normalized, #"(?!^)(\.\*)+$", #".*");
var escaped = Regex.Escape(normalized);
string head, tail;
if (endsWithDotWildcardDot)
{
head = "^" + escaped;
tail = #"(\.[^.]{0," + endWeight + "})?$";
}
else if (endsWithWildcardDot)
{
head = "^" + escaped;
tail = "[^.]{0," + endWeight + "}$";
}
else
{
head = "^" + escaped;
tail = "$";
}
if (head.EndsWith(#"\.\*") && head.Length > 5)
{
head = head.Substring(0, head.Length - 4);
tail = #"(\..*)?" + tail;
}
var regex = head.Replace(#"\*", ".*").Replace(#"\?", "[^.]?") + tail;
return Regex.IsMatch(fileName, regex, RegexOptions.IgnoreCase);
}
private static bool WildcardMatchesUnixStyle(string pattern, string text)
{
var regex = "^" + Regex.Escape(pattern)
.Replace("\\*", ".*")
.Replace("\\?", ".")
+ "$";
return Regex.IsMatch(text, regex);
}
There's a funny thing, even the Windows API PathMatchSpec does not agree with FindFirstFile. Just try a1*., FindFirstFile says it matches a1, PathMatchSpec says not.
d* means that it should match zero or more "d" characters. So any string is a valid match. Try d+ instead!
In order to have support for wildcard patterns I would replace the wildcards with the RegEx equivalents. Like * becomes .* and ? becomes .?. Then your expression above becomes d.*
You need to convert your wildcard expression to a regular expression. For example:
private bool WildcardMatch(String s, String wildcard, bool case_sensitive)
{
// Replace the * with an .* and the ? with a dot. Put ^ at the
// beginning and a $ at the end
String pattern = "^" + Regex.Escape(wildcard).Replace(#"\*", ".*").Replace(#"\?", ".") + "$";
// Now, run the Regex as you already know
Regex regex;
if(case_sensitive)
regex = new Regex(pattern);
else
regex = new Regex(pattern, RegexOptions.IgnoreCase);
return(regex.IsMatch(s));
}
You must escape special Regex symbols in input wildcard pattern (for example pattern *.txt will equivalent to ^.*\.txt$)
So slashes, braces and many special symbols must be replaced with #"\" + s, where s - special Regex symbol.
I think #Dmitri has nice solution at
Matching strings with wildcard https://stackoverflow.com/a/30300521/1726296
Based on his solution, I have created two extension methods. (credit goes to him)
May be helpful.
public static String WildCardToRegular(this String value)
{
return "^" + Regex.Escape(value).Replace("\\?", ".").Replace("\\*", ".*") + "$";
}
public static bool WildCardMatch(this String value,string pattern,bool ignoreCase = true)
{
if (ignoreCase)
return Regex.IsMatch(value, WildCardToRegular(pattern), RegexOptions.IgnoreCase);
return Regex.IsMatch(value, WildCardToRegular(pattern));
}
Usage:
string pattern = "file.*";
var isMatched = "file.doc".WildCardMatch(pattern)
or
string xlsxFile = "file.xlsx"
var isMatched = xlsxFile.WildCardMatch(pattern)
All upper code is not correct to the end.
This is because when searching zz*foo* or zz* you will not get correct results.
And if you search "abcd*" in "abcd" in TotalCommander will he find a abcd file so all upper code is wrong.
Here is the correct code.
public string WildcardToRegex(string pattern)
{
string result= Regex.Escape(pattern).
Replace(#"\*", ".+?").
Replace(#"\?", ".");
if (result.EndsWith(".+?"))
{
result = result.Remove(result.Length - 3, 3);
result += ".*";
}
return result;
}
You may want to use WildcardPattern from System.Management.Automation assembly. See my answer here.
The most accepted answer works fine for most cases and can be used in most scenarios:
"^" + Regex.Escape(pattern).Replace(#"\*", ".*").Replace(#"\?", ".") + "$";
However if you allow escaping in you input wildcard pattern, e.g. "find \*", meaning you want to search for a string "find *" with asterisk, it won't work. The already escaped * will be escaped to "\\\\\\*" and after replacing we have "^value\\ with\\\\.*$", which is wrong.
The following code (which for sure can be optimized and rewritten) handles that special case:
public static string WildcardToRegex(string wildcard)
{
var sb = new StringBuilder();
for (var i = 0; i < wildcard.Length; i++)
{
// If wildcard has an escaped \* or \?, preserve it like it is in the Regex expression
var character = wildcard[i];
if (character == '\\' && i < wildcard.Length - 1)
{
if (wildcard[i + 1] == '*')
{
sb.Append("\\*");
i++;
continue;
}
if (wildcard[i + 1] == '?')
{
sb.Append("\\?");
i++;
continue;
}
}
switch (character)
{
// If it's unescaped * or ?, change it to Regex equivalents. Add more wildcard characters (like []) if you need to support them.
case '*':
sb.Append(".*");
break;
case '?':
sb.Append('.');
break;
default:
//// Escape all other symbols because wildcard could contain Regex special symbols like '.'
sb.Append(Regex.Escape(character.ToString()));
break;
}
}
return $"^{sb}$";
}
Solution for the problem just with Regex substitutions is proposed here https://stackoverflow.com/a/15275806/1105564

c# Regex question

I have a problem dealing with the # symbol in Regex, I am trying to remove #sometext
from a text string can't seem to find anywhere where it uses the # as a literal. I have tried myself but doesn't remove the word from the string. Any ideas?
public string removeAtSymbol(string input)
{
Regex findWords = new Regex(______);//Find the words like "#text"
Regex[] removeWords;
string test = input;
MatchCollection all = findWords.Matches(test);
removeWords = new Regex[all.Count];
int index = 0;
string[] values = new string[all.Count];
YesOutputBox.Text = " you got here";
foreach (Match m in all) //List all the words
{
values[index] = m.Value.Trim();
index++;
YesOutputBox.Text = YesOutputBox.Text + " " + m.Value;
}
for (int i = 0; i < removeWords.Length; i++)
{
removeWords[i] = new Regex(" " + values[i]);
// If the words appears more than one time
if (removeWords[i].Matches(test).Count > 1)
{
removeWords[i] = new Regex(" " + values[i] + " ");
test = removeWords[i].Replace(test, " "); //Remove the first word.
}
}
return test;
}
You can remove all occurences of "#sometext" from string test via the method
Regex.Replace(test, "#sometext", "")
or for any word starting with "#" you can use
Regex.Replace(test, "#\\w+", "")
If you need specifically a separate word (i.e. nothing like #comp within tom#comp.com) you may preceed the regex with a special word boundary (\b does not work here):
Regex.Replace(test, "(^|\\W)#\\w+", "")
You can use:
^\s#([A-Za-z0-9_]+)
as the regex to recognize Twitter usernames.
Regex to remove #something from this string: I want to remove #something from this string.
var regex = new Regex("#\\w*");
string result = regex.Replace(stringWithAt, "");
Is that what you are looking for?
I've had good luck applying this pattern:
\B#\w+
This will match any string starting with an # character that contains alphanumeric characters, plus some linking punctuation like the underscore character, if it does not occur on a boundary between alphanumeric and non-alphanumeric characters.
The result of executing this code:
string result = Regex.Replace(
#"#This1 #That2_thing this2#3that #the5Others #alpha#beta#gamma",
#"\B#\w+",
#"redacted");
is the following string:
redacted redacted this2#3that redacted redacted#beta#gamma
If this question is Twitter-specific, then Twitter provides an open source library that helps capture Twitter-specific entities like links, mentions and hashtags. This java file contains the code defining the regular expressions that Twitter uses, and this yml file contains test strings and expected outcomes of many unit tests that exercise the regular expressions in the Twitter library.
Twitter's mention-matching pattern (extracted from their library, modified to remove unnecessary capture groups, and edited to make sense in the context of a replacement) is shown below. The match should be performed in a case-insensitive manner.
(^|[^a-z0-9_])[#\uFF20][a-z0-9_]{1,20}
Here is an example which reproduces the results of the first replacement in my answer:
string result = Regex.Replace(
#"#This1 #That2_thing this2#3that #the5Others #alpha#beta#gamma",
#"(^|[^a-z0-9_])[#\uFF20][a-z0-9_]{1,20}",
#"$1redacted",
RegexOptions.IgnoreCase);
Note the need to include the substitution $1 since the first capture group can't be directly converted into an atomic zero-width assertion.

Categories

Resources