I need to highlight search terms in a block of text.
My initial thought was looping though the search terms. But is there an easier way?
Here is what I'm thinking using a loop...
public string HighlightText(string inputText)
{
string[] sessionPhrases = (string[])Session["KeywordPhrase"];
string description = inputText;
foreach (string field in sessionPhrases)
{
Regex expression = new Regex(field, RegexOptions.IgnoreCase);
description = expression.Replace(description,
new MatchEvaluator(ReplaceKeywords));
}
return description;
}
public string ReplaceKeywords(Match m)
{
return "<span style='color:red;'>" + m.Value + "</span>";
}
You could replace the loop with something like:
string[] phrases = ...
var re = String.Join("|", phrases.Select(s => Regex.Escape(s)).ToArray());
text = Regex.Replace(re, text, new MatchEvaluator(SomeFunction), RegexOptions.IgnoreCase);
Extending on Qtax's answer:
phrases = ...
// Use Regex.Escape to prevent ., (, * and other special characters to break the search
string re = String.Join("|", phrases.Select(s => Regex.Escape(s)).ToArray());
// Use \b (expression) \b to ensure you're only matching whole words, not partial words
re = #"\b(?:" +re + #")\b"
// use a simple replacement pattern instead of a MatchEvaluator
string replacement = "<span style='color:red;'>$0</span>";
text = Regex.Replace(re, text, replacement, RegexOptions.IgnoreCase);
Not that if you're already replacing data inside HTML, it might not be a good idea to use Regex to replace just anything in the content, you might end up getting:
<<span style='color:red;'>script</span>>
if someone is searching for the term script.
To prevent that from happening, you could use the HTML Agility Pack in combination with Regex.
You might also want to check out this post which deals with a very similar issue.
Related
I just want to replace a portion of a string only if matches the given text.
My use case is as follows:
var text = "<wd:response><wd:response-data></wd:response-data></wd:response >";
string result = text.Replace("wd:response", "response");
/*
* expecting the below text
<response><wd:response-data></wd:response-data></response>
*
*/
I followed the following answers:
Way to have String.Replace only hit "whole words"
Regular expression for exact match of a string
But I failed to achieve what I want.
Please share your thoughts/solutions.
Sample on
https://dotnetfiddle.net/pMkO8Q
In general, you should really be parsing and manipulating XML as XML, using functions that know how XML works and what's legal in the language. Regex and other naive text manipulation will often lead you into trouble.
That said, for a very simple solution to this specific problem, you can do this with two replaces:
var text = "<wd:response><wd:response-data></wd:response-data></wd:response >";
text.Replace("wd:response>", "response>").Replace("wd:response ", "response ")
(Note the spaces at the end of the parameters to the second replace.)
Alternatively use a regex similar to "wd:response\s*>"
The easiest way to achieve your result as per your .net fiddle is use the replace as below.
string result = text.Replace("wd:response>", "response>");
But proper way to achieve this is parsing using XML
You can capture the string wd-response in a capturing group and replace using Regex.Replace using the MatchEvaluator like this.
Regex explanation - <[/]?(wd:response)[\s+]?>
Match < literally
Match / optionally hence the ?
Match the string wd:response and place it in a capturing group enclosed with ()
Match one or more optional whitespace [\s+]?
Match > literally
public class Program
{
public static void Main(string[] args)
{
string text = "<wd:response><wd:response-data></wd:response-data></wd:response >";
string replacePattern = "response";
string pattern = #"<[/]?(wd:response)[\s+]?>";
string replacedPattern = Regex.Replace(text, pattern, match =>
{
// Extract the first group
Group group = match.Groups[1];
// Replace the group value with the replacePattern
return string.Format("{0}{1}{2}", match.Value.Substring(0, group.Index - match.Index), replacePattern, match.Value.Substring(group.Index - match.Index + group.Length));
});
Console.WriteLine(replacedPattern);
}
}
Outputting:
<response><wd:response-data></wd:response-data></response >
Truth is, I'm having a hard time writing a regex string to parse something in the form of
[[[tab name=dog content=cat|tab name=dog2 content=cat2]]]
This regex would be parsed so that I can dynamically build tabs as demonstrated here. Initially I tried a regex pattern like \[\[\[tab name=(?'name'.*?) content=(?'content'.*?)\]\]\]
But I realized I couldn't get the tab as a whole and build upon a query without doing a regex.replace. Is it possible to take the entire tab leading up to the pipe symbol as a group and then parse that group down from the sub key/value pairs?
This is the current regex string I'm working with \[\[\[(?'tab'tab name=(?'name'.*?) content=(?'content'.*?))\]\]\]
And here is my code for performing the regex. Any guidance would be appreciated.
public override string BeforeParse(string markupText)
{
if (CompiledRegex.IsMatch(markupText))
{
// Replaces the [[[code lang=sql|xxx]]]
// with the HTML tags (surrounded with {{{roadkillinternal}}.
// As the code is HTML encoded, it doesn't get butchered by the HTML cleaner.
MatchCollection matches = CompiledRegex.Matches(markupText);
foreach (Match match in matches)
{
string tabname = match.Groups["name"].Value;
string tabcontent = HttpUtility.HtmlEncode(match.Groups["content"].Value);
markupText = markupText.Replace(match.Groups["content"].Value, tabcontent);
markupText = Regex.Replace(markupText, RegexString, ReplacementPattern, CompiledRegex.Options);
}
}
return markupText;
}
Is this what you want?
string input = "[[[tab name=dog content=cat|tab name=dog2 content=cat2]]]";
Regex r = new Regex(#"tab name=([a-z0-9]+) content=([a-z0-9]+)(\||])");
foreach (Match m in r.Matches(input))
{
Console.WriteLine("{0} : {1}", m.Groups[1].Value, m.Groups[2].Value);
}
http://regexr.com/3boot
Maybe string.split will be better in that case? For example something like that :
strgin str = "[[[tab name=dog content=cat|tab name=dog2 content=cat2]]]";
foreach(var entry in str.Split('|')){
var eqBlocks = entry.Split('=');
var tabName = eqBlocks[1].TrimEnd(" content");
var content = eqBlocks[2];
}
Ugly code, but should work.
Try this:
Starts with a word boundary and followed only by allowed characters.
/\b[\w =]*/g
https://regex101.com/r/cI7jS7/1
Just distill the regex pattern down to the individual tab patterns such as name=??? content=??? and match that only. That pattern which will make each Match (two in you example) where the data can be extracted.
string text = #"[[[tab name=dog content=cat|tab name=dog2 content=cat2]]]";
string pattern = #"name=(?<Name>[^\s]+)\scontent=(?<Content>[^\s|\]]+)";
var result = Regex.Matches(text, pattern)
.OfType<Match>()
.Select(mt => new
{
Name = mt.Groups["Name"].Value,
Content = mt.Groups["Content"].Value,
});
The result is an enumerable list with the created dynamic entities with the tabs needed which can be directly bound to the control:
Note in the set notation [^\s|\]] the pipe | is treated as a literal in the set and not used as an or. The bracket ] does have to be escaped though to be treated as a literal. Finally the logic the parse will look for: "To not (^) be a space or a pipe or a brace for that set".
So I'm replacing all instances of a word in a string ignoring the case:
public static String ReplaceAll(String Input, String Word)
{
string Pattern = string.Format(#"\b{0}\b", Word);
Regex rgx = new Regex(Pattern, RegexOptions.IgnoreCase);
StringBuilder sb = new StringBuilder();
sb.Append(rgx.Replace(Input, string.Format("<span class='highlight'>{0}</span>", Word)));
return sb.ToString();
}
What I also need is the replace to keep the found words case, so if I'm looking for 'this' and RegEx finds 'This' it will replace the found word as 'This' and not 'this', I've done this before but it was a few years ago and in javascript, having a little trouble working it out again.
public static string ReplaceAll(string source, string word)
{
string pattern = #"\b" + Regex.Escape(word) + #"\b";
var rx = new Regex(pattern, RegexOptions.IgnoreCase);
return rx.Replace(source, "<span class='highlight'>$0</span>");
}
The following has pretty much what you're looking for using Regex. The only consideration is that it keeps the case for the first character, so if you have an upper case in the middle it doesn't look like it will keep that.
Replace Text While Keeping Case Intact In C Sharp
I have this code that reads a file and creates Regex groups. Then I walk through the groups and use other matches on keywords to extract what I need. I need the stuff between each keyword and the next space or newline. I am wondering if there is a way using the Regex keyword match itself to discard what I don't want (the keyword).
//create the pattern for the regex
String VSANMatchString = #"vsan\s(?<number>\d+)[:\s](?<info>.+)\n(\s+name:(?<name>.+)\s+state:(?<state>.+)\s+\n\s+interoperability mode:(?<mode>.+)\s\n\s+loadbalancing:(?<loadbal>.+)\s\n\s+operational state:(?<opstate>.+)\s\n)?";
//set up the patch
MatchCollection VSANInfoList = Regex.Matches(block, VSANMatchString);
// set up the keyword matches
Regex VSANNum = new Regex(#" \d* ");
Regex VSANName = new Regex(#"name:\S*");
Regex VSANState = new Regex(#"operational state\S*");
//now we can extract what we need since we know all the VSAN info will be matched to the correct VSAN
//match each keyword (name, state, etc), then split and extract the value
foreach (Match m in VSANInfoList)
{
string num=String.Empty;
string name=String.Empty;
string state=String.Empty;
string s = m.ToString();
if (VSANNum.IsMatch(s)) { num=VSANNum.Match(s).ToString().Trim(); }
if (VSANName.IsMatch(s))
{
string totrim = VSANName.Match(s).ToString().Trim();
string[] strsplit = Regex.Split (totrim, "name:");
name=strsplit[1].Trim();
}
if (VSANState.IsMatch(s))
{
string totrim = VSANState.Match(s).ToString().Trim();
string[] strsplit=Regex.Split (totrim, "state:");
state=strsplit[1].Trim();
}
It looks like your single regex should be able to gather all you need. Try this:
string name = m.Groups["name"].Value; // Or was it m.Captures["name"].Value?
I have a problem dealing with the # symbol in Regex, I am trying to remove #sometext
from a text string can't seem to find anywhere where it uses the # as a literal. I have tried myself but doesn't remove the word from the string. Any ideas?
public string removeAtSymbol(string input)
{
Regex findWords = new Regex(______);//Find the words like "#text"
Regex[] removeWords;
string test = input;
MatchCollection all = findWords.Matches(test);
removeWords = new Regex[all.Count];
int index = 0;
string[] values = new string[all.Count];
YesOutputBox.Text = " you got here";
foreach (Match m in all) //List all the words
{
values[index] = m.Value.Trim();
index++;
YesOutputBox.Text = YesOutputBox.Text + " " + m.Value;
}
for (int i = 0; i < removeWords.Length; i++)
{
removeWords[i] = new Regex(" " + values[i]);
// If the words appears more than one time
if (removeWords[i].Matches(test).Count > 1)
{
removeWords[i] = new Regex(" " + values[i] + " ");
test = removeWords[i].Replace(test, " "); //Remove the first word.
}
}
return test;
}
You can remove all occurences of "#sometext" from string test via the method
Regex.Replace(test, "#sometext", "")
or for any word starting with "#" you can use
Regex.Replace(test, "#\\w+", "")
If you need specifically a separate word (i.e. nothing like #comp within tom#comp.com) you may preceed the regex with a special word boundary (\b does not work here):
Regex.Replace(test, "(^|\\W)#\\w+", "")
You can use:
^\s#([A-Za-z0-9_]+)
as the regex to recognize Twitter usernames.
Regex to remove #something from this string: I want to remove #something from this string.
var regex = new Regex("#\\w*");
string result = regex.Replace(stringWithAt, "");
Is that what you are looking for?
I've had good luck applying this pattern:
\B#\w+
This will match any string starting with an # character that contains alphanumeric characters, plus some linking punctuation like the underscore character, if it does not occur on a boundary between alphanumeric and non-alphanumeric characters.
The result of executing this code:
string result = Regex.Replace(
#"#This1 #That2_thing this2#3that #the5Others #alpha#beta#gamma",
#"\B#\w+",
#"redacted");
is the following string:
redacted redacted this2#3that redacted redacted#beta#gamma
If this question is Twitter-specific, then Twitter provides an open source library that helps capture Twitter-specific entities like links, mentions and hashtags. This java file contains the code defining the regular expressions that Twitter uses, and this yml file contains test strings and expected outcomes of many unit tests that exercise the regular expressions in the Twitter library.
Twitter's mention-matching pattern (extracted from their library, modified to remove unnecessary capture groups, and edited to make sense in the context of a replacement) is shown below. The match should be performed in a case-insensitive manner.
(^|[^a-z0-9_])[#\uFF20][a-z0-9_]{1,20}
Here is an example which reproduces the results of the first replacement in my answer:
string result = Regex.Replace(
#"#This1 #That2_thing this2#3that #the5Others #alpha#beta#gamma",
#"(^|[^a-z0-9_])[#\uFF20][a-z0-9_]{1,20}",
#"$1redacted",
RegexOptions.IgnoreCase);
Note the need to include the substitution $1 since the first capture group can't be directly converted into an atomic zero-width assertion.