Regex to skip certain characters

Regex to skip certain characters - c#

I want to write a Regex which would skip characters like < & >. Reason
Now, to represent this I came across this [^<>] and tried using it in an console application, but it does not work.
[^<>]
Debuggex Demo
string value = "shubh<";
string regEx = "[^<>]";
Regex rx = new Regex(regEx);
if (rx.IsMatch(value))
{
Console.WriteLine("Pass");
}
else { Console.WriteLine("Fail"); }
Console.ReadLine();
The string 'shubh<' should get failed, but I am not sure why it passes the match. Am I doing something rubbish?

From Regex.IsMatch Method (String):
Indicates whether the regular expression specified in the Regex constructor finds a match in a specified input string.
[^<>] is found in shubh< (the s, the h, etc.).
You need to use the ^ and $ anchors:
Regex rx = new Regex("^[^<>]*$");
if (rx.IsMatch(value)) {
Console.WriteLine("Pass");
} else {
Console.WriteLine("Fail");
}
Another solution is to check if < or > is contained:
Regex rx = new Regex("[<>]");
if (rx.IsMatch(value)) {
Console.WriteLine("Fail");
} else {
Console.WriteLine("Pass");
}

Related

Find pattern to solve regex in one step

I have a problem to find the pattern that solves the problem in onestep.
The string looks like this:
Text1
Text1$Text2$Text3
Text1$Text2$Text3$Text4$Text5$Text6 etc.
What i want to get is: Take up to 4x Text. If there are more than "4xText" take only the last sign.
Example:
Text1$Text2$Text3$Text4$Text5$Text6 -> Text1$Text2$Text3$Text4&56
My current solution is:
First pattern:
^([^\$]*)\$?([^\$]*)\$?([^\$]*)\$?([^\$]*)\$?
After this i will do a substitution with the first pattern
New string: Text5$Text6
second pattern is:
([^\$])\b
result: 56
combine both and get the result:
Text1$Text2$Text3$Text4$56
For me it is not clear why i cant easily put the second pattern after the first pattern into one pattern. Is there something like an anchor that tells the engine to start the pattern from here like it would do if is would be the only pattern ?

You might use an alternation with a positive lookbehind and then concatenate the matches.
(?<=^(?:[^$]+\$){0,3})[^$]+\$?|[^$](?=\$|$)
Explanation
(?<= Positive lookbehind, assert what is on the left is
^(?:[^$]+\$){0,3} Match 0-3 times any char except $ followed by an optional $
) Close lookbehind
[^$]+\$? Match 1+ times any char except $, then match an optional $
| Or
[^$] Match any char except $
(?=\$|$) Positive lookahead, assert what is directly to the right is either $ or the end of the string
.NET regex demo | C# demo
Example
string pattern = #"(?<=^(?:[^$]*\$){0,3})[^$]*\$?|[^$](?=\$|$)";
string[] strings = {
"Text1",
"Text1$Text2$Text3",
"Text1$Text2$Text3$Text4$Text5$Text6"
};
Regex regex = new Regex(pattern);
foreach (String s in strings) {
Console.WriteLine(string.Join("", from Match match in regex.Matches(s) select match.Value));
}
Output
Text1
Text1$Text2$Text3
Text1$Text2$Text3$Text4$56

I strongly believe regular expression isn't the way to do that. Mostly because of the readability.
You may consider using simple algorithm like this one to reach your goal:
using System;
public class Program
{
public static void Main()
{
var input = "Text1$Text2$Text3$Text4$Text5$Text6";
var parts = input.Split('$');
var result = "";
for(var i=0; i<parts.Length; i++){
result += (i <= 4 ? parts[i] + "$" : parts[i].Substring(4));
}
Console.WriteLine(result);
}
}
There are also linq alternatives :
using System;
using System.Linq;
public class Program
{
public static void Main()
{
var input = "Text1$Text2$Text3$Text4$Text5$Text6";
var parts = input.Split('$');
var first4 = parts.Take(4);
var remainings = parts.Skip(4);
var result2 = string.Join("$", first4) + "$" + string.Join("", remainings.Select( r=>r.Substring(4)));
Console.WriteLine(result2);
}
}
It has to be adjusted to the actual needs but the idea is there

Try this code:
var texts = new string[] {"Text1", "Text1$Text2$Text3", "Text1$Text2$Text3$Text4$Text5$Text6" };
var parsed = texts
.Select(s => Regex.Replace(s,
#"(Text\d{1,3}(?:\$Text\d{1,3}){0,3})((?:\$Text\d{1,3})*)",
(match) => match.Groups[1].Value +"$"+ match.Groups[2].Value.Replace("Text", "").Replace("$", "")
)).ToArray();
// parsed is now: string[3] { "Text1$", "Text1$Text2$Text3$", "Text1$Text2$Text3$Text4$56" }
Explanation:
solution uses regex pattern: (Text\d{1,3}(?:\$Text\d{1,3}){0,3})((?:\$Text\d{1,3})*)
(...) - first capturing group
(?:...) - non-capturing group
Text\d{1,3}(?:\$Text\d{1,3} - match Text literally, then match \d{1,3}, which is 1 up to three digits, \$ matches $ literally
Rest is just repetition of it. Basically, first group captures first four pieces, second group captures the rest, if any.
We also use MatchEvaluator here which is delegate type defined as:
public delegate string MatchEvaluator(Match match);
We define such method:
(match) => match.Groups[1].Value +"$"+ match.Groups[2].Value.Replace("Text", "").Replace("$", "")
We use it to evaluate match, so takee first capturing group and concatenate with second, removing unnecessary text.

It's not clear to me whether your goal can be achieved using exclusively regex. If nothing else, the fact that you want to introduce a new character '&' into the output adds to the challenge, since just plain matching would never be able to accomplish that. Possibly using the Replace() method? I'm not sure that would work though...using only a replacement pattern and not a MatchEvaluator, I don't see a way to recognize but still exclude the "$Text" portion from the fifth instance and later.
But, if you are willing to mix regex with a small amount of post-processing, you can definitely do it:
static readonly Regex regex1 = new Regex(#"(Text\d(?:\$Text\d){0,3})(?:\$Text(\d))*", RegexOptions.Compiled);
static void Main(string[] args)
{
for (int i = 1; i <= 6; i++)
{
string text = string.Join("$", Enumerable.Range(1, i).Select(j => $"Text{j}"));
WriteLine(KeepFour(text));
}
}
private static string KeepFour(string text)
{
Match match = regex1.Match(text);
if (!match.Success)
{
return "[NO MATCH]";
}
StringBuilder result = new StringBuilder();
result.Append(match.Groups[1].Value);
if (match.Groups[2].Captures.Count > 0)
{
result.Append("&");
// Have to iterate (join), because we don't want the whole match,
// just the captured text.
result.Append(JoinCaptures(match.Groups[2]));
}
return result.ToString();
}
private static string JoinCaptures(Group group)
{
return string.Join("", group.Captures.Cast<Capture>().Select(c => c.Value));
}
The above breaks your requirement into three different capture groups in a regex. Then it extracts the captured text, composing the result based on the results.

Print the regex pattern where the string becomes invalid

Given the regular expression
^(aa|bb){1}(a*)(ab){1}$
For the language,
All strings starting with double letters and ends with substring ab
I would like to know if it is possible to print the regex code where the string becomes invalid. This has got to do with regular expressions in Finite Automata.
For example i have these following input set of invalid strings,
abaa
aabb
aaba
I wanted to have an output like this,
abaa ^(aa|bb){1}
aabb ^(aa|bb){1}(a*)
aaba ^(aa|bb){1}(a*)(ab){1}$

You can create a Regex from a string, if it is a malformed pattern it is going to throw an exception. You can create a loop that's going to get substring of the pattern an try to create a regex, if it fails just continue.
Once you have a Regex you can test for a match and store the last pattern that matched the input. So it would be something like this:
public static string FindBestValidRegex(string input, string pattern)
{
var lastMatch = "";
for (int i = 0; i < pattern.Length; i++)
{
try
{
var partialPattern = pattern.Substring(0, i + 1);
var regex = new Regex(partialPattern);
if (regex.IsMatch(input))
{
lastMatch = partialPattern;
}
}
catch { }
}
return lastMatch;
}
Testing:
static void Main(string[] args)
{
var pattern = #"^(aa|bb){1}(a*)(ab){1}$";
Console.WriteLine(FindBestValidRegex("bbb", pattern));
Console.WriteLine(FindBestValidRegex("aabb", pattern));
Console.WriteLine(FindBestValidRegex("aaab", pattern));
Console.WriteLine(FindBestValidRegex("bbaab", pattern));
Console.ReadKey();
}
Output:
^(aa|bb){1}(a*)
^(aa|bb){1}(a*)
^(aa|bb){1}(a*)(ab){1}$
^(aa|bb){1}(a*)(ab){1}$

Replace regular expression with regular expression

Consider two regular expressions:
var regex_A = "Main\.(.+)\.Value";
var regex_B = "M_(.+)_Sp";
I want to be able to replace a string using regex_A as input, and regex_B as the replacement string. But also the other way around. And without supplying additional information like a format string per regex.
Specifically I want to create a replaced_B string from an input_A string. So:
var input_A = "Main.Rotating.Value";
var replaced_B = input_A.RegEx_Awesome_Replace(regex_A, regex_B);
Assert.AreEqual("M_Rotating_Sp", replaced_B);
And this should also work in reverse (thats the reason i can't use a simple string.format for regex_B). Because I don't want to supply a format string for every regular expression (i'm lazy).
var input_B = "M_Skew_Sp";
var replaced_A = input_B.RegEx_Awesome_Replace(regex_B, regex_A);
Assert.AreEqual("Main.Skew.Value", replaced_A);
I have no clue if this exists, or how to call it. Google search finds me all kinds of other regex replaces... not this one.
Update:
So basically I need a way to convert a regular expression to a format string.
var regex_A_format = Regex2Format(regex_A);
Assert.AreEqual("Main.$1.Value", regex_A_format);
and
var regex_B_format = Regex2Format(regex_B);
Assert.AreEqual("M_$1_Sp", regex_B_format);
So what should the RegEx_Awesome_Replace and/or Regex2Format function look like?
Update 2:
I guess the RegEx_Awesome_Replace should look something like (using some code from answers below):
public static class StringExtenstions
{
public static string RegExAwesomeReplace(this string inputString,string searchPattern,string replacePattern)
{
return Regex.Replace(inputString, searchPattern, Regex2Format(replacePattern));
}
}
Which would leave the Regex2Format as an open question.

There is no defined way for one regex to refer to a match found in another regex. Regexes are not format strings.
What you can do is to use Tuples of a format string together with its regex. e.g.
var a = new Tuple<Regex,string>(new Regex(#"(?<=Main\.).+(?=\.Value)"), #"Main.{0}.Value")
var b = new Tuple<Regex,string>(new Regex(#"(?<=M_).+(?=_Sp)"), #"M_{0}_Sp")`
Then you can pass these objects to a common replacement method in any order, like this:
private string RegEx_Awesome_Replace(string input, Tuple<Regex,string> toFind, Tuple<Regex,string> replaceWith)
{
return string.Format(replaceWith.Item2, toFind.Item1.Match(input).Value);
}
You will notice that I have used zero-width positive lookahead assertion and zero-width positive lookbehind assertions in my regexes, to ensure that Value contains exactly the text that I want to replace.
You may also want to add error handling, for cases where the match can not be found. Maybe read about Regex.Match

Since you have already reduced your problem to where you need to change a Regex into a string format (implementing Regex2Format) I will focus my answer just on that part. Note that my answer is incomplete because it doesn't address the full breadth of parsing regex capturing groups, however it works for simple cases.
First thing needed is a Regex that will match Regex capture groups. There is a negative lookbehind to not match escaped bracket symbols. There are other cases that break this regex. E.g. a non-capturing group, wildcard symbols, things between square braces.
private static readonly Regex CaptureGroupMatcher = new Regex(#"(?<!\\)\([^\)]+\)");
The implementation of Regex2Format here basically writes everything outside of capture groups into the output string, and replaces the capture group value by {x}.
static string Regex2Format(string pattern)
{
var targetBuilder = new StringBuilder();
int previousEndIndex = 0;
int formatIndex = 0;
foreach (Match match in CaptureGroupMatcher.Matches(pattern))
{
var group = match.Groups[0];
int endIndex = group.Index;
AppendPart(pattern, previousEndIndex, endIndex, targetBuilder);
targetBuilder.Append('{');
targetBuilder.Append(formatIndex++);
targetBuilder.Append('}');
previousEndIndex = group.Index + group.Length;
}
AppendPart(pattern, previousEndIndex, pattern.Length, targetBuilder);
return targetBuilder.ToString();
}
This helper function writes pattern string values into the output, it currently writes everything except \ characters used to escape something.
static void AppendPart(string pattern, int previousEndIndex, int endIndex, StringBuilder targetBuilder)
{
for (int i = previousEndIndex; i < endIndex; i++)
{
char c = pattern[i];
if (c == '\\' && i < pattern.Length - 1 && pattern[i + 1] != '\\')
{
//backslash not followed by another backslash - it's an escape char
}
else
{
targetBuilder.Append(c);
}
}
}
Test cases
static void Test()
{
var cases = new Dictionary<string, string>
{
{ #"Main\.(.+)\.Value", #"Main.{0}.Value" },
{ #"M_(.+)_Sp(.*)", "M_{0}_Sp{1}" },
{ #"M_\(.+)_Sp", #"M_(.+)_Sp" },
};
foreach (var kvp in cases)
{
if (PatternToStringFormat(kvp.Key) != kvp.Value)
{
Console.WriteLine("Test failed for {0} - expected {1} but got {2}", kvp.Key, kvp.Value, PatternToStringFormat(kvp.Key));
}
}
}
To wrap up, here is the usage:
private static string AwesomeRegexReplace(string input, string sourcePattern, string targetPattern)
{
var targetFormat = PatternToStringFormat(targetPattern);
return Regex.Replace(input, sourcePattern, match =>
{
var args = match.Groups.OfType<Group>().Skip(1).Select(g => g.Value).ToArray<object>();
return string.Format(targetFormat, args);
});
}

Something like this might work
var replaced_B = Regex.Replace(input_A, #"Main\.(.+)\.Value", #"M_$1_Sp");

Are you looking for something like this?
public static class StringExtenstions
{
public static string RegExAwesomeReplace(this string inputString,string searchPattern,string replacePattern)
{
Match searchMatch = Regex.Match(inputString,searchPattern);
Match replaceMatch = Regex.Match(inputString, replacePattern);
if (!searchMatch.Success || !replaceMatch.Success)
{
return inputString;
}
return inputString.Replace(searchMatch.Value, replaceMatch.Value);
}
}
The string extension method returns the string with replaced value for search pattern and replace pattern.
This is how you call:
input_A.RegEx_Awesome_Replace(regex_A, regex_B);

Using regex to capture a numeric value within a string in C#

I have a string of characters which has 0 or more occurrences of ABC = dddd within it. The dddd stands for an integer value, not necessarily four digits.
What I'd like to do is capture the integer values that occur within this pattern. I know how to perform matches with regexes but I'm new to capturing. It's not necessary to capture all the ABC integer values in one call—it's fine to loop over the string.
If this is too involved I'll just write a tiny parser, but I'd like to use regex if it's reasonably elegant. Expertise greatly appreciated.

First we need to start with a regex that matches the pattern we are looking for. This will match the example you have given (assuming ABC is alphanumeric): \w+\s*=\s*\d+
Next we need to define what we want to capture in a match by defining capture groups. .Net includes support for named capture groups, which I absolutely adore. We specify a group with (?<name for capture>expression), turning our regex into: (?<key>\w+)\s*=\s*(?<value>\d+). This gives us two captures, key and value.
Using this, we can iterate over all matches in your text:
Regex pattern = new Regex(#"(?<key>\w+)\s*=\s*(?<value>\d+)");
string body = "This is your text here. value = 1234";
foreach (Match match in pattern.Matches(body))
{
Console.WriteLine("Found key {0} with value {1}",
match.Groups.Item["key"].Value,
match.Groups.Item["value"].Value
);
}

You can use something like this:
MatchCollection allMatchResults = null;
try {
// This matches a literal '=' and then any number of digits following
Regex regexObj = new Regex(#"=(\d+)");
allMatchResults = regexObj.Matches(subjectString);
if (allMatchResults.Count > 0) {
// Access individual matches using allMatchResults.Item[]
} else {
// Match attempt failed
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
Based on your coment, perhaps this is more what you're after:
try {
Regex regexObj = new Regex(#"=(\d+)");
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success) {
for (int i = 1; i < matchResults.Groups.Count; i++) {
Group groupObj = matchResults.Groups[i];
if (groupObj.Success) {
// matched text: groupObj.Value
// match start: groupObj.Index
// match length: groupObj.Length
}
}
matchResults = matchResults.NextMatch();
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}

Regular Expression To Split On Comma Except If Quoted

What is the regular expression to split on comma (,) except if surrounded by double quotes? For example:
max,emily,john = ["max", "emily", "john"]
BUT
max,"emily,kate",john = ["max", "emily,kate", "john"]
Looking to use in C#: Regex.Split(string, "PATTERN-HERE");
Thanks.

Situations like this often call for something other than regular expressions. They are nifty, but patterns for handling this kind of thing are more complicated than they are useful.
You might try something like this instead:
public static IEnumerable<string> SplitCSV(string csvString)
{
var sb = new StringBuilder();
bool quoted = false;
foreach (char c in csvString) {
if (quoted) {
if (c == '"')
quoted = false;
else
sb.Append(c);
} else {
if (c == '"') {
quoted = true;
} else if (c == ',') {
yield return sb.ToString();
sb.Length = 0;
} else {
sb.Append(c);
}
}
}
if (quoted)
throw new ArgumentException("csvString", "Unterminated quotation mark.");
yield return sb.ToString();
}
It probably needs a few tweaks to follow the CSV spec exactly, but the basic logic is sound.

This is a clear-cut case for a CSV parser, so you should be using .NET's own CSV parsing capabilities or cdhowie's solution.
Purely for your information and not intended as a workable solution, here's what contortions you'd have to go through using regular expressions with Regex.Split():
You could use the regex (please don't!)
(?<=^(?:[^"]*"[^"]*")*[^"]*) # assert that there is an even number of quotes before...
\s*,\s* # the comma to be split on...
(?=(?:[^"]*"[^"]*")*[^"]*$) # as well as after the comma.
if your quoted strings never contain escaped quotes, and you don't mind the quotes themselves becoming part of the match.
This is horribly inefficient, a pain to read and debug, works only in .NET, and it fails on escaped quotes (at least if you're not using "" to escape a single quote). Of course the regex could be modified to handle that as well, but then it's going to be perfectly ghastly.

A little late maybe but I hope I can help someone else
String[] cols = Regex.Split("max, emily, john", #"\s*,\s*");
foreach ( String s in cols ) {
Console.WriteLine(s);
}

Justin, resurrecting this question because it had a simple regex solution that wasn't mentioned. This situation sounds straight out of Match (or replace) a pattern except in situations s1, s2, s3 etc.
Here's our simple regex:
"[^"]*"|(,)
The left side of the alternation matches complete "quoted strings" tags. We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right commas because they were not matched by the expression on the left. We replace these commas with SplitHere, then we split on SplitHere.
This program shows how to use the regex (see the results at the bottom of the online demo):
using System;
using System.Text.RegularExpressions;
using System.Collections.Specialized;
class Program
{
static void Main() {
string s1 = #"max,""emily,kate"",john";
var myRegex = new Regex(#"""[^""]*""|(,)");
string replaced = myRegex.Replace(s1, delegate(Match m) {
if (m.Groups[1].Value == "") return m.Value;
else return "SplitHere";
});
string[] splits = Regex.Split(replaced,"SplitHere");
foreach (string split in splits) Console.WriteLine(split);
Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();
} // END Main
} // END Program
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex to skip certain characters - c#

Related

Find pattern to solve regex in one step

Print the regex pattern where the string becomes invalid

Replace regular expression with regular expression

Using regex to capture a numeric value within a string in C#

Regular Expression To Split On Comma Except If Quoted

Categories

Resources