Matching words with a forward-slash via Regular Expression - c#

I am trying to match words that start with a forward slash in C#.
For example /exit and I have tried using the regex \b(/exit)\b but for some reason it doesn't match.
Here's a sample code that I am trying out:
static void Main(string[] args)
{
var commands= new List<string>();
commands.Add("/exit");
var listOfString = commands.Select(Regex.Escape).ToList();
var joinTheWords = string.Join("|", listOfString);
var regexPattern = $#"\b({joinTheWords})\b";
var theRegex= new Regex(regexPattern, RegexOptions.IgnoreCase);
Console.WriteLine(theRegex);
Console.WriteLine(theRegex.Match(#"/exit").Success);
Console.WriteLine("Press any key to exit.");
Console.ReadLine();
}

At the beginning of the string "/exit", there's no word boundary /b because "/" isn't a letter, number, or underscore. (there's a word boundary just after the "/")
you could roll your own "smart word boundary" to include matching these forward slashes as valid "word" characters:
(?:((?<!/)\B(?=/))|\b(?=\w))
In English, this means that you must have either a "NON word boundary followed by a slash that doesn't have any preceding slashes" (?<!/)\B(?=/), OR "a regular word boundary, provided you can 'see' an alphanumeric after it" \b(?=\w). By using a \B with "/", we can get "pseudo word boundary" behavior:
var commands = new List<string>();
commands.Add("/exit");
List<String> listOfString = commands.Select(Regex.Escape).ToList();
String joinTheWords = string.Join("|", listOfString);
var regexPattern = $#"(?:(?:(?<!/)\B)(?=/)|\b(?=\w))({joinTheWords})\b";
var theRegex = new Regex(regexPattern, RegexOptions.IgnoreCase);
Console.WriteLine(theRegex);
Console.WriteLine(theRegex.Match("/exit").Success);
Console.WriteLine("Press any key to exit.");
Console.ReadLine();
There may (and probably are) more simple ways to approach this, especially if you can "preprocess" the list of pattern fragments first to replace special characters with a static tokens, match with regular \b's, then replace them back.
regex demo

Since you already know the / is included in all the words,
you can factor them out of your command list.
Change commands.Add("/exit"); to this commands.Add("exit");
Then do as normal, escaping metachars and joining.
Then, since you only care that / is not preceded with a / all
thats needed in the beginning is(?<!/)/.
As for the end, I'd use a conditional word boundary (?(?<=\w)\b).
I mean, that's all you really need.
Putting it all together, the regex line would be:
var regexPattern = $#"(?<!/)(/(?:{joinTheWords}))(?(?<=\w)\b)";

a not so clean way (but simple) to find words with forward slashes is to replace the forward slash with accepted (but never used string), and use that in your regex search:
str = "this is a search string with /exit and/exit";
key = "/exit";
value="/EXIT";
str = str.replace(/\//gi, "_a_a_");
k = key.replace(/\//gi, "_a_a_");
var regex = new RegExp('\\b' + k + '\\b', "g");
str = str.replace(regex, value) ;
str = str.replace("_a_a_","/");
console.log(str);

Related

Why does this Regular Expression match nothing?

I want to replace all instances of all consecutive non-lowercase-alphabet-letters with a single space for each instance. This works, but why does it inject spaces in between the alphabet letters?
const string pattern = #"[^a-z]*";
const string replacement = #" ";
var reg = new Regex(pattern);
string a = "the --fat- cat";
string b = reg.Replace(a, replacement); // b = " t h e f a t c a t " should be "the fat cat"
Because of *(which repeats the previous token zero or more times). It must finds a match in all boundaries since an empty string exists in all those boundaries.
const string pattern = #"[^a-z]+";
You don't need regex if you simply want to remove non-lowercase letters:
string a = "the --fat- cat";
string res = String.Join("", a.Where(c => Char.IsLower(c) || Char.IsWhiteSpace(c)));
Console.WriteLine(res); // the fat cat
Just a follow up answer that might turn out useful: if you need to match any character but any Unicode lowercase letter, you may use
var res = Regex.Replace(str, #"\P{Ll}+", " ");
// "моя НЕ знает" > "моя знает"
The \P{Ll} construct will match all characters but lowercase letters from all Unicode table. The + quantifier will match one or more occurrences and will not cause the issue in OP.
And an illustration of the current problem caused by [^a-z]* (see the vertical pipes showing where the Regex.Replace found empty string matches):
A rule of thumb: avoid unanchored patterns that may match empty strings!

Regex Matchcollection groups

I already tried two days to solve the Problem, that I have a MatchCollection. In the patter is a Group and I want to have a list with the Solutions of the Group (there were two or more Solutions).
string input = "<tr><td>Mi, 09.09.15</td><td>1</td><td>PK</td><td>E</td><td>123</td><td></td></tr><tr><td>Mi, 09.09.15</td><td>2</td><td>ER</td><td>ER</td><td>234</td><td></td></tr>";
string Patter2 = "^<tr>$?<td>$?[D-M][i-r],[' '][0-3][1-9].[0-1][1-9].[0-9][0-9]$?</td>$?<td>$?([1-9][0-2]?)$?</td>$?";
Regex r2 = new Regex(Patter2);
MatchCollection mc2 = r2.Matches(input);
foreach (Match match in mc2)
{
GroupCollection groups = match.Groups;
string s = groups[1].Value;
Datum2.Text = s;
}
But only the last match (2) appears in the TextBox "Datum2".
I know that I have to use e.g. a listbox, but the Groups[1].Value is a string...
Thanks for your help and time.
Dieter
First thing you need to correct in the code is Datum2.Text = s; would overwrite the text in Datum2 if it were more than one match.
Now, about your regex,
^ forces a match at the begging of the line, so there is really only 1 match. If you remove it, it'll match twice.
I can't seem to understand what was intended with $? all over the pattern (just take them out).
[' '] matches "either a quote, a space or a quote (no need to repeat characters in a character class.
All dots in [0-3][1-9].[0-1][1-9].[0-9][0-9] need to be escaped. A dot matches any character otherwise.
[0-1][1-9] matches all months except "10". The second character shoud be [0-9] (or \d).
Code:
string input = "<tr><td>Mi, 09.09.15</td><td>1</td><td>PK</td><td>E</td><td>123</td><td></td></tr><tr><td>Mi, 09.09.15</td><td>2</td><td>ER</td><td>ER</td><td>234</td><td></td></tr>";
string Patter2 = "<tr><td>[D-M][i-r],[' ][0-3][0-9]\\.[0-1][0-9]\\.[0-9][0-9]</td><td>([1-9][0-2]?)</td>";
Regex r2 = new Regex(Patter2);
MatchCollection mc2 = r2.Matches(input);
string s= "";
foreach (Match match in mc2)
{
GroupCollection groups = match.Groups;
s = s + " " + groups[1].Value;
}
Datum2.Text = s;
Output:
1 2
DEMO
You should know that regex is not the tool to parse HTML. It'll work for simple cases, but for real cases do consider using HTML Agility Pack

How can I cut out the below pattern from a string using Regex?

I have a string which will have the word "TAG" followed by an integer,underscore and another word.
Eg: "TAG123_Sample"
I need to cut the "TAGXXX_" pattern and get only the word Sample. Meaning I will have to cut the word "TAG" and the integer followed by and the underscore.
I wrote the following code but it doesn't work. What have I done wrong? How can I do this? Please advice.
static void Main(string[] args)
{
String sentence = "TAG123_Sample";
String pattern=#"TAG[^\d]_";
String replacement = "";
Regex r = new Regex(pattern);
String res = r.Replace(sentence,replacement);
Console.WriteLine(res);
Console.ReadLine();
}
You're currently negating (matching NOT a digit), you need to modify the regex as follows:
String s = "TAG123_Sample";
String r = Regex.Replace(s, #"TAG\d+_", "");
Console.WriteLine(r); //=> "Sample"
Explanation:
TAG match 'TAG'
\d+ digits (0-9) (1 or more times)
_ '_'
You can use String.Split for this:
string[] s = "TAG123_Sample".Split('_');
Console.WriteLine(s[1]);
https://msdn.microsoft.com/en-us/library/b873y76a.aspx
Try this will work in this case for sure:
resultString = Regex.Replace(sentence ,
#"^ # Match start of string
[^_]* # Match 0 or more characters except underscore
_ # Match the underscore", "", RegexOptions.IgnorePatternWhitespace);
No regex is necessary if your string contains 1 underscore and you need to get a substring after it.
Here is a Substring+IndexOf-based approach:
var res = sentence.Substring(sentence.IndexOf('_') + 1); // => Sample
See IDEONE demo

c# Regex question

I have a problem dealing with the # symbol in Regex, I am trying to remove #sometext
from a text string can't seem to find anywhere where it uses the # as a literal. I have tried myself but doesn't remove the word from the string. Any ideas?
public string removeAtSymbol(string input)
{
Regex findWords = new Regex(______);//Find the words like "#text"
Regex[] removeWords;
string test = input;
MatchCollection all = findWords.Matches(test);
removeWords = new Regex[all.Count];
int index = 0;
string[] values = new string[all.Count];
YesOutputBox.Text = " you got here";
foreach (Match m in all) //List all the words
{
values[index] = m.Value.Trim();
index++;
YesOutputBox.Text = YesOutputBox.Text + " " + m.Value;
}
for (int i = 0; i < removeWords.Length; i++)
{
removeWords[i] = new Regex(" " + values[i]);
// If the words appears more than one time
if (removeWords[i].Matches(test).Count > 1)
{
removeWords[i] = new Regex(" " + values[i] + " ");
test = removeWords[i].Replace(test, " "); //Remove the first word.
}
}
return test;
}
You can remove all occurences of "#sometext" from string test via the method
Regex.Replace(test, "#sometext", "")
or for any word starting with "#" you can use
Regex.Replace(test, "#\\w+", "")
If you need specifically a separate word (i.e. nothing like #comp within tom#comp.com) you may preceed the regex with a special word boundary (\b does not work here):
Regex.Replace(test, "(^|\\W)#\\w+", "")
You can use:
^\s#([A-Za-z0-9_]+)
as the regex to recognize Twitter usernames.
Regex to remove #something from this string: I want to remove #something from this string.
var regex = new Regex("#\\w*");
string result = regex.Replace(stringWithAt, "");
Is that what you are looking for?
I've had good luck applying this pattern:
\B#\w+
This will match any string starting with an # character that contains alphanumeric characters, plus some linking punctuation like the underscore character, if it does not occur on a boundary between alphanumeric and non-alphanumeric characters.
The result of executing this code:
string result = Regex.Replace(
#"#This1 #That2_thing this2#3that #the5Others #alpha#beta#gamma",
#"\B#\w+",
#"redacted");
is the following string:
redacted redacted this2#3that redacted redacted#beta#gamma
If this question is Twitter-specific, then Twitter provides an open source library that helps capture Twitter-specific entities like links, mentions and hashtags. This java file contains the code defining the regular expressions that Twitter uses, and this yml file contains test strings and expected outcomes of many unit tests that exercise the regular expressions in the Twitter library.
Twitter's mention-matching pattern (extracted from their library, modified to remove unnecessary capture groups, and edited to make sense in the context of a replacement) is shown below. The match should be performed in a case-insensitive manner.
(^|[^a-z0-9_])[#\uFF20][a-z0-9_]{1,20}
Here is an example which reproduces the results of the first replacement in my answer:
string result = Regex.Replace(
#"#This1 #That2_thing this2#3that #the5Others #alpha#beta#gamma",
#"(^|[^a-z0-9_])[#\uFF20][a-z0-9_]{1,20}",
#"$1redacted",
RegexOptions.IgnoreCase);
Note the need to include the substitution $1 since the first capture group can't be directly converted into an atomic zero-width assertion.

string manipulation in regex

i have a problem in string manipulation
here is the code
string str = "LDAP://company.com/OU=MyOU1 Control,DC=MyCompany,DC=com";
Regex regex = new Regex("OU=\\w+");
var result = regex.Matches(str);
var strList = new List<string>();
foreach (var item in result)
{
strList.Add(item.ToString().Remove(0,3));
}
Console.WriteLine(string.Join("/",strList));
the result i am getting is "MyOU1" instead of getting "MyOU1 Control"
please help thanks
If you want the space character to be matched as well, you need to include it in your regex. \w only matches word charactes, which does not include spaces.
Regex regex = new Regex(#"OU=[\w\s]+");
This matches word characters (\w) and whitespace characters (\s).
(The # in front of the string is just for convenience: If you use it, you don't need to escape backslashes.)
Either add space to the allowed list (\w doesn't allow space) or use the knowledge that comma can be used as a separator.
Regex regex = new Regex("OU=(\\w|\\s)+");
OR
Regex regex = new Regex("OU=[^,]+");

Categories

Resources