Regex to match alphanumeric except specific substring

Regex to match alphanumeric except specific substring - c#

Edit:
MANDATORY CONDITION:
Regex has to be inserted into the following statement:
Regex regex = new Regex("<REGEX_STRING>");
val= regex.Matches(val).Cast<Match>().Aggregate("", (s, e) => s + e.Value, s => s);
I found out that I can't use Regex.Replace() method as it was suggested in the answer below.
I am looking for a RegEx that would have to follow two conditions:
accept only a-z, A-Z, 0-9, \s (one or more), and ignore _ (that's why \w is not an option)
[!] exclude any {sq} "substring" anywhere inside the string
*{sq} - it's literally this 4-chars string, not any shortcut for ASCII sign !
What I have so far is:
\b(?!sq)[a-zA-Z0-9 ]*
but this RegEx cuts everything when _ shows up + it also excludes i.e whole [sq].
So for example for a given string:
test[sq]uirrel{sq}_things I should get testsquirrelthings and what I get is: testuirrel
Small input | expected output table below:
Input string
Expected output
Na#me
Name
M2a_ny
M2any
Vari{sq}o#us
Various
test [sq]uirrel h23ere!
test squirrel h23ere
I would really appreciate any help, it's the most complicated RegEx I have ever came across 🙄

The problem is that it is not possible in .NET regex to match any text but a multicharacter sequence.
You will have to use a terrible workaround like
((?:(?!{sq})[A-Za-z0-9\s])+)|{sq}
and you will need to get Group 1 values. See the .NET regex demo. Here is a C# demo:
var texts = new List<string> { "Na#me","M2a_ny","Vari{sq}o#us","test [sq]uirrel h23ere!" };
var pattern = #"((?:(?!{sq})[A-Za-z0-9\s])+)|{sq}";
foreach (var text in texts) {
var result = Regex.Matches(text, pattern).Cast<Match>()
.Aggregate("", (s, e) => s + e.Groups[1].Value, s => s);
Console.WriteLine(result);
}
// => Name, M2any, Various, test squirrel h23ere
A better, Regex.Replace based solution
You can remove {sq} and all non-letter and non-whitespace chars using
Regex.Replace(text, #"{sq}|[^a-zA-Z0-9\s]", "")
Regex.Replace(text, #"{sq}|[^\p{L}\p{N}\s]", "")
The \p{L} / \p{N} version can be used to support any Unicode letters/digits.
See the .NET regex demo:

Related

Use RegEx to uppercase and lowercase the string

I am trying to convert a string to uppercase and lowercase based on the index.
My string is a LanguageCode like cc-CC where cc is the language code and CC is the country code. The user can enter in any format like "cC-Cc". I am using the regular expression to match whether the data is in the format cc-CC.
var regex = new Regex("^[a-z]{2}-[A-Z]{2}$", RegexOptions.IgnoreCase);
//I can use CultureInfos from .net framework and compare it's valid or not.
//But the requirement is it should allow invalid language codes also as long
//The enterd code is cc-CC format
Now when the user enters something cC-Cc I'm trying to lowercase the first two characters and then uppercase last two characters.
I can split the string using - and then concatenate them.
var languageDetails = languageCode.Split('-');
var languageCodeUpdated = $"{languageDetails[0].ToLowerInvariant()}-{languageDetails[1].ToUpperInvariant()}";
I thought can I avoid multiple strings creation and use RegEx itself to uppercase and lowercase accordingly.
While searching for the same I found some solutions to use \L and \U but I am not able to use them as the C# compiler showing error. Also, RegEx.Replace() has a parameter or delegate MatchEvaluator which I'm not able to understand.
Is there any way in C# we can use RegEx to replace uppercase with lowercase and vice versa.

.NET regex does not support case modifying operators.
You may use MatchEvaluator:
var result = Regex.Replace(s, #"(?i)^([a-z]{2})-([a-z]{2})$", m =>
$"{m.Groups[1].Value.ToLower()}-{m.Groups[2].Value.ToUpper()}");
See the C# demo.
Details
(?i) - the inline version of RegexOptions.IgnoreCase mopdiofier
^ - start of the string
([a-z]{2}) - Capturing group #1: 2 ASCII letters
- - a hyphen
([a-z]{2}) - Capturing group #2: 2 ASCII letters
$ - end of string.

TLDR: This is Regex.Replace with \U and \L support.
private static string EnhancedReplace(string input, string pattern, string replacement, RegexOptions options)
{
replacement = Regex.Replace(replacement, #"(?<mode>\\[UL])(?<group>\$((\d+)|({[^}]+})))", #"<!<mode:${mode}>%&${group}&%>");
var output = Regex.Replace(input, pattern, replacement, options);
output = Regex.Replace(output, #"<!<mode:\\L>%&(?<value>[\w\W]*?)&%>", x => x.Groups["value"].Value.ToLower());
output = Regex.Replace(output, #"<!<mode:\\U>%&(?<value>[\w\W]*?)&%>", x => x.Groups["value"].Value.ToUpper());
return output;
}
How To Use
Call the function with \U followed by the group to be uppercase
var result = EnhancedReplace(input, #"(public \w+ )(\w)", #"$1\U$2", RegexOptions.None);
Will replace this:
public string test12 { get; set; } = "test3";
With that:
public string Test12 { get; set; } = "test3";
Details
I'm currently working on an app which allows the user to define a batch of Regex Replace operations.
For example the user enters json and the batch converts it to a C#-Class.
Therefore, speed is no key requirement. But it would be very handy to be able to use \U and \L.
This method will apply Regex.Replace 3 times to the whole content and one time to the replacement string. Therefore it’s at least three times slower than Regex.Replace without \U \L support.
Step by Step
The first Regex.Replace enhances the replacement string.
It replaces: \U$1 with <!<mode:\\U>%&$1&%>
(Also works for named groups: ${groupName})
The new replacement will be applied to the content.
& 4. The inserted placeholder is now relatively unique. That allows you to search only for <!<mode:\\U>%&Actual Value&%> and use the MatchEvaluator to replace it with its uppercase version. The same will be done for \L
Regex101 Demo:
Step 1: Enhance pattern with placeholder
https://regex101.com/r/ZtqigN/1
Step 2 Use new replacement pattern
https://regex101.com/r/PWLTFD/1
Step 3&4 Resolve new placeholders
https://regex101.com/r/5DIIUo/1
Answer
var result = EnhancedReplace(input, #"(cc)(-)(cc)", #"\L$1$2\U$3", RegexOptions.IgnoreCase);

C# Regex - starts with pattern1 not contain pattern2

for the following input string contains all of these:
a1.aaa[SUBSCRIBED]
a1.bbb
a1.ccc
b1.ddd
d1.ddd[SUBSCRIBED]
I want to get the output:
bbb
ccc
which means: all the words that come after "a1." And not contain the substring "[SUBSCRIBED]"

all the words comes after "a1." And not contains the substring
"[SUBSCRIBED]"
Why regex? Following is crystal clear:
var result = strings
.Where(s => s.StartsWith("a1.") && !s.Contains("[SUBSCRIBED]"))
.Select(s => s.Substring(3));

Tim's answer makes sense. However if you insist on it I would venture that a Regex would look like this though.
^a1\.(.*)(?<!\[SUBSCRIBED\])$
with ^a1 meaning starts with a1
\.(.*) taking any number of character
and the negative lookbehind (?<!\[SUBSCRIBED\])$ would refuse text ending with [SUBSCRIBED]

You may use
^a1\.(?!.*\[SUBSCRIBED])(.*)
See the regex demo.
Details
^ - start of string
a1\. - a literal a1. substring
(?!.*\[SUBSCRIBED]) - a negative lookahead that fails the match if there is a [SUBSCRIBED] substring is present after any 0+ chars (other than newline if the RegexOptions.Singleline option is not used)
(.*) - Group 1: the rest of the line up to the end (if you use RegexOptions.Singleline option, . will match newlines as well).
C# code:
var result = string.Empty;
var m = Regex.Match(s, #"^a1\.(?!.*\[SUBSCRIBED])(.*)");
if (m.Success)
{
result = m.Groups[1].Value;
}

Regex & C#: Replace all Special Characters except Emojis

I need to replace all special characters in a string except the following (which includes alphabetic characters):
:)
:P
;)
:D
:(
This is what I have now:
string input = "Hi there!!! :)";
string output = Regex.Replace(input, "[^0-9a-zA-Z]+", "");
This replaces all special characters. How can I modify this to not replace mentioned characters (emojis) but replace any other special character?

You may use a known technique: match and capture what you need and match only what you want to remove, and replace with the backreference to Group 1:
(:(?:[D()P])|;\))|[^0-9a-zA-Z\s]
Replace with $1. Note I added \s to the character class, but in case you do not need spaces, remove it.
See the regex demo
Pattern explanation:
(:(?:[D()P])|;\)) - Group 1 (what we need to keep):
:(?:[D()P]) - a : followed with either D, (, ) or P
| - or
;\) - a ;) substring
(here, you may extend the capture group with more |-separated branches).
| - or ...
[^0-9a-zA-Z\s] - match any char other than ASCII digits, letters (and whitespace, but as I mentioned, you may remove \s if you do not need to keep spaces).

I would use a RegEx to match all emojis and select them out of the text
string input = "Hi there!!! :)";
string output = string.Concat(Regex.Matches(input, "[;|:][D|P|)|(]+").Cast<Match>().Select(x => x.Value));
Pattern [;|:][D|P|)|(]+
[;|:] starts with : or ;
[D|P|)|(] ends with D, P, ) or (
+ one or more

Regex for retaining numbers in the replacement of group containing numbers

Regarding the possible dupe post: Replace only some groups with Regex
This is not a dupe as the post replaces the group with static text, what I want is to replace the group by retaining the text in the group.
I have some texts which contain pattern like:
\super 1 \nosupersub
\super 2 \nosupersub
...
\super 592 \nosupersub
I want to replace them using regex such that they become:
<sup>1</sup>
<sup>2</sup>
...
<sup>592</sup>
So, I am using the following regex (note the group (\d+)):
RegexOptions options = RegexOptions.Multiline; //as of v1.3.1.0 default is multiline
mytext = Regex.Replace(mytext, #"\s?\\super\s?(\d+)\s?\\nosupersub\s", #"<sup>\1</sup>", options);
However, instead of getting what I want, I got all the results replaced with <sup>\1</sup>:
<sup>\1</sup>
<sup>\1</sup>
...
<sup>\1</sup>
If I try the regex replacement using a text editor like https://www.sublimetext.com and also using Python, it is OK.
How to get such group replacement of (\d+) like that (retain the number) in C#?

Many regex tools use the \1 notation to refer to a group's value in the replacement pattern (same in syntax to a backreference). For whatever reason, Microsoft chose to instead use $1 for the notation in the .NET implementation of regex. Note that backreferences still use the \1 syntax in .NET. It's only the syntax in the replacement pattern which is different. See the Substitutions section of this page for more info.

I haven't tested this code and wrote it from memory so this might not work but the general idea is there.
Why use regex at all?
List<string> output = new List<string>();
foreach (string line in myText.Split(new string[] { Environment.NewLine }, StringSplitOptions.None))
{
string alteredLine = line.Replace("\super", "").Replace("\nosupersub", "").Trim();
int n;
if (Int32.TryParse(alteredLine, out n))
{
output.Add("<sup>" + n + "</sup>");
}
else
{
//Add the original input in case it failed?
output.Add(line);
}
}
or for a linq version:
myText = myText.Split(new string[] { Environment.NewLine }, StringSplitOptions.None)
.Select(l => "<sup>" + l.Replace("\super", "").Replace("\nosupersub", "").Trim() + "</sup>");

Regex to remove all (non numeric OR period)

I need for text like "joe ($3,004.50)" to be filtered down to 3004.50 but am terrible at regex and can't find a suitable solution. So only numbers and periods should stay - everything else filtered. I use C# and VS.net 2008 framework 3.5

This should do it:
string s = "joe ($3,004.50)";
s = Regex.Replace(s, "[^0-9.]", "");

The regex is:
[^0-9.]
You can cache the regex:
Regex not_num_period = new Regex("[^0-9.]")
then use:
string result = not_num_period.Replace("joe ($3,004.50)", "");
However, you should keep in mind that some cultures have different conventions for writing monetary amounts, such as: 3.004,50.

You are dealing with a string - string is an IEumerable<char>, so you can use LINQ:
var input = "joe ($3,004.50)";
var result = String.Join("", input.Where(c => Char.IsDigit(c) || c == '.'));
Console.WriteLine(result); // 3004.50

For the accepted answer, MatthewGunn raises a valid point in that all digits, commas, and periods in the entire string will be condensed together. This will avoid that:
string s = "joe.smith ($3,004.50)";
Regex r = new Regex(#"(?:^|[^w.,])(\d[\d,.]+)(?=\W|$)/)");
Match m = r.match(s);
string v = null;
if (m.Success) {
v = m.Groups[1].Value;
v = Regex.Replace(v, ",", "");
}

The approach of removing offending characters is potentially problematic. What if there's another . in the string somewhere? It won't be removed, though it should!
Removing non-digits or periods, the string joe.smith ($3,004.50) would transform into the unparseable .3004.50.
Imho, it is better to match a specific pattern, and extract it using a group. Something simple would be to find all contiguous commas, digits, and periods with regexp:
[\d,\.]+
Sample test run:
Pattern understood as:
[\d,\.]+
Enter string to check if matches pattern
> a2.3 fjdfadfj34 34j3424 2,300 adsfa
Group 0 match: "2.3"
Group 0 match: "34"
Group 0 match: "34"
Group 0 match: "3424"
Group 0 match: "2,300"
Then for each match, remove all commas and send that to the parser. To handle case of something like 12.323.344, you could do another check to see that a matching substring has at most one ..

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex to match alphanumeric except specific substring - c#

Related

Use RegEx to uppercase and lowercase the string

C# Regex - starts with pattern1 not contain pattern2

Regex & C#: Replace all Special Characters except Emojis

Regex for retaining numbers in the replacement of group containing numbers

Regex to remove all (non numeric OR period)

Categories

Resources