.NET Regular Expression for N number of Consecutive Characters

.NET Regular Expression for N number of Consecutive Characters - c#

I need a regular expression that matches three consecutive characters (any alphanumeric character) in a string.
Where 2a82a9e4eee646448db00e3fccabd8c7
"eee" would be a match.
Where
2a82a9e4efe64644448db00e3fccabd8c7
"444" would be a match.
etc.

Use backreferences.
([a-zA-Z0-9])\1\1

Try this:
using System;
using System.Text.RegularExpressions;
class MainClass {
private static void DisplayMatches(string text,
string regularExpressionString)
{
Console.WriteLine("using the following regular expression: "
+regularExpressionString);
MatchCollection myMatchCollection =
Regex.Matches(text, regularExpressionString);
foreach (Match myMatch in myMatchCollection) {
Console.WriteLine(myMatch);
}
}
public static void Main()
{
string text ="Missisipli Kerrisdale she";
Console.WriteLine("Matching words that that contain "
+ "two consecutive identical characters");
DisplayMatches(text, #"\S*(.)\1\S*");
}
}

Related

How to use (?!...) regex pattern to skip the whole unmatched part?

I would like to use the ((?!(SEPARATOR)).)* regex pattern for splitting a string.
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
var separator = "__";
var pattern = String.Format("((?!{0}).)*", separator);
var regex = new Regex(pattern);
foreach (var item in regex.Matches("first__second"))
Console.WriteLine(item);
}
}
It works fine when a SEPARATOR is a single character, but when it is longer then 1 character I get an unexpected result. In the code above the second matched string is "_second" instead of "second". How shall I modify my pattern to skip the whole unmatched separator?
My real problem is to split lines where I should skip line separators inside quotes. My line separator is not a predefined value and it can be for example "\r\n".

You can do something like this:
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string input = "plum--pear";
string pattern = "-"; // Split on hyphens
string[] substrings = Regex.Split(input, pattern);
foreach (string match in substrings)
{
Console.WriteLine("'{0}'", match);
}
}
}
// The method displays the following output:
// 'plum'
// ''
// 'pear'

The .NET regex does not does not support matching a piece of text other than a specific multicharacter string. In PCRE, you would use (*SKIP)(*FAIL) verbs, but they are not supported in the native .NET regex library. Surely, you might want to use PCRE.NET, but .NET regex can usually handle those scenarios well with Regex.Split
If you need to, say, match all but [anything here], you could use
var res = Regex.Split(s, #"\[[^][]*]").Where(m => !string.IsNullOrEmpty(m));
If the separator is a simple literal fixed string like __, just use String.Split.
As for your real problem, it seems all you need is
var res = Regex.Matches(s, "(?:\"[^\"]*\"|[^\r\n\"])+")
.Cast<Match>()
.Select(m => m.Value)
.ToList();
See the regex demo
It matches 1+ (due to the final +) occurrences of ", 0+ chars other than " and then " (the "[^"]*" branch) or (|) any char but CR, LF or/and " (see [^\r\n"]).

Document filtering by regex

I'm trying to find best solution to verify input document. I need to check every line of the document. Basically in each line can exist invalid character or characters. The result of searching (validating) is: 'get me the index of line with invalid char and index of each invalid character in this line'.
I know how to do in standard way (open file -> read all lines -> check characters one by one), but this method isn't best optimized way. Instead of this, the best solution will be to use "MatchCollection" (in my opinion).
But how to do this correctly in C# ?
Link:
http://www.dotnetperls.com/regex-matches
Example:
"Some Înput text here,\n Îs another lÎne of thÎs text."
In first line [0] found invalid character on [6] index, in line [1]
found invalid characters on [0, 12, 21] index.
using System;
using System.Text.RegularExpressions;
namespace RegularExpresion
{
class Program
{
private static Regex regex = null;
static void Main(string[] args)
{
string input_text = "Some Înput text here, Îs another lÎne of thÎs text.";
string line_pattern = "\n";
string invalid_character = "Î";
regex = new Regex(line_pattern);
/// Check is multiple or single line document
if (IsMultipleLine(input_text))
{
/// ---> How to do this correctly for each line ? <---
}
else
{
Console.WriteLine("Is a single line file");
regex = new Regex(invalid_character);
MatchCollection mc = regex.Matches(input_text);
Console.WriteLine($"How many matches: {mc.Count}");
foreach (Match match in mc)
Console.WriteLine($"Index: {match.Index}");
}
Console.ReadKey();
}
public static bool IsMultipleLine(string input) => regex.IsMatch(input);
}
}
Output:
Is a single line file
How many matches: 4
Index: 5
Index: 22
Index: 34
Index: 43

Link:
http://www.dotnetperls.com/regexoptions-multiline
SOLUTION
using System;
using System.Text.RegularExpressions;
namespace RegularExpresion
{
class Program
{
private static Regex regex = null;
static void Main(string[] args)
{
string input_text = #"Some Înput text here,
Îs another lÎne of thÎs text.";
string line_pattern = "\n";
string invalid_character = "Î";
regex = new Regex(line_pattern);
/// Check is multiple or single line document
if (IsMultipleLine(input_text))
{
Console.WriteLine("Is a multiple line file");
MatchCollection matches = Regex.Matches(input_text, "^(.+)$", RegexOptions.Multiline);
int line = 0;
foreach (Match match in matches)
{
foreach (Capture capture in match.Captures)
{
line++;
Console.WriteLine($"Line: {line}");
RegexpLine(capture.Value, invalid_character);
}
}
}
else
{
Console.WriteLine("Is a single line file");
RegexpLine(input_text, invalid_character);
}
Pause();
}
public static bool IsMultipleLine(string input) => regex.IsMatch(input);
public static void RegexpLine(string line, string characters)
{
regex = new Regex(characters);
MatchCollection mc = regex.Matches(line);
Console.WriteLine($"How many matches: {mc.Count}");
foreach (Match match in mc)
Console.WriteLine($"Index: {match.Index}");
}
public static ConsoleKeyInfo Pause(string message = "please press ANY key to continue...")
{
Console.WriteLine(message);
return Console.ReadKey();
}
}
}
Thx guys for help, basically will be nice if someone smarter then me, check this code in terms of performance.
Regards,
Nerus.

My approach would be split the string into array of string, each contains a line. If the length of the array is just 1, that means you have only 1 line. Then from there you use the Regex to match each line to find the invalid character that you are looking for.
string input_text = "Some Înput text here,\nÎs another lÎne of thÎs text.";
string line_pattern = "\n";
// split the string into string arrays
string[] input_texts = input_text.Split(new string[] { line_pattern }, StringSplitOptions.RemoveEmptyEntries);
string invalid_character = "Î";
if (input_texts != null && input_texts.Length > 0)
{
if (input_texts.Length == 1)
{
Console.WriteLine("Is a single line file");
}
// loop every line
foreach (string oneline in input_texts)
{
Regex regex = new Regex(invalid_character);
MatchCollection mc = regex.Matches(oneline);
Console.WriteLine("How many matches: {0}", mc.Count);
foreach (Match match in mc)
{
Console.WriteLine("Index: {0}", match.Index);
}
}
}
--- EDIT ---
Things to consider:
If you get your input from a file, I would recommend you to read line by line, not the whole text.
Usually, when you search for invalid character, you don't specify it. Instead you look for a pattern. For ex: Not a char from a-z, A-Z, 0-9. Then your regex is going to be a little bit different.

Get Removed characters from string

I am using Regex to remove unwanted characters from string like below:
str = System.Text.RegularExpressions.Regex.Replace(str, #"[^\u0020-\u007E]", "");
How can I retrieve distinct characters which will be removed in efficient way?
EDIT:
Sample input : str = "This☺ contains Åüsome æspecialæ characters"
Sample output : str = "This contains some special characters"
removedchar = "☺,Å,ü,æ"

string pattern = #"[\u0020-\u007E]";
Regex rgx = new Regex(pattern);
List<string> matches = new List<string> ();
foreach (Match match in rgx.Matches(str))
{
if (!matches.Contains (match.Value))
{
matches.Add (match.Value);
}
}

Here is an example how you can do it with a callback method inside the Regex.Replace overload with an evaluator:
evaluator
Type: System.Text.RegularExpressions.MatchEvaluator
A custom method that examines each match and returns either the original matched string or a replacement string.
C# demo:
using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
public class Test
{
public static List<string> characters = new List<string>();
public static void Main()
{
var str = Regex.Replace("§My string 123”˝", "[^\u0020-\u007E]", Repl);//""
Console.WriteLine(str); // => My string 123
Console.WriteLine(string.Join(", ", characters)); // => §, ”, ˝
}
public static string Repl(Match m)
{
characters.Add(m.Value);
return string.Empty;
}
}
See IDEONE demo
In short, declare a "global" variable (a list of strings, here, characters), initialize it. Add the Repl method to handle the replacement, and when Regex.Replace calls that method, add each matched value to the characters list.

Get only wild card value using regular expression

I want to extract only wild card tokens using regular expressions in dotnet (C#).
Like if I use pattern like Book_* (so it match directory wild card), it extract values what match with *.
For Example:
For a string "Book_1234" and pattern "Book_*"
I want to extract "1234"
For a string "Book_1234_ABC" and pattern "Book_*_*"
I should be able to extract 1234 and ABC

This should do it : (DEMO)
string input = "Book_1234_ABC";
MatchCollection matches = Regex.Matches(input, #"_([A-Za-z0-9]*)");
foreach (Match m in matches)
if (m.Success)
Console.WriteLine(m.Groups[1].Value);

The approach to your scenario would be to
Get the List of strings which appears in between the wildcard (*).
Join the lists with regexp divider (|).
replace the regular expression with char which you do not expect in your string (i suppose space should be adequate here)
trim and then split the returned string by char you used in previous step which will return you the list of wildcard characters.
var str = "Book_1234_ABC";
var inputPattern = "Book_*_*";
var patterns = inputPattern.Split('*');
if (patterns.Last().Equals(""))
patterns = patterns.Take(patterns.Length - 1).ToArray();
string expression = string.Join("|", patterns);
var wildCards = Regex.Replace(str, expression, " ").Trim().Split(' ');

I would first convert the '*' wildcard in an equivalent Regex, ie:
* becames \w+
then I use this regex to extract the matches.

When I run this code using your input strings:
using System;
using System.Text.RegularExpressions;
namespace SampleApplication
{
public class Test
{
static Regex reg = new Regex(#"Book_([^_]+)_*(.*)");
static void DoMatch(String value) {
Console.WriteLine("Input: " + value);
foreach (Match item in reg.Matches(value)) {
for (int i = 0; i < item.Groups.Count; ++i) {
Console.WriteLine(String.Format("Group: {0} = {1}", i, item.Groups[i].Value));
}
}
Console.WriteLine("\n");
}
static void Main(string[] args) {
// For a string "Book_1234" and pattern "Book_*" I want to extract "1234"
DoMatch("Book_1234");
// For a string "Book_1234_ABC" and pattern "Book_*_*" I should be able to extract 1234 and ABC
DoMatch("Book_1234_ABC");
}
}
}
I get this console output:
Input: Book_1234
Group: 0 = Book_1234
Group: 1 = 1234
Group: 2 =
Input: Book_1234_ABC
Group: 0 = Book_1234_ABC
Group: 1 = 1234
Group: 2 = ABC

C# Regex, Either Or

I have a string that I parse in regex:
"one [two] three [four] five"
I have regex that extracts the bracketed text into <bracket>, but now I want to add the other stuff (one, three, five) into <text>, but I want there to be seperate matches.
So either it is a match for <text> or a match for <bracket>. Is this possible using regex?
So the list of matches would look like:
text=one, bracketed=null
text=null, bracketed=[two]
text=three, bracketed=null
text=one, bracketed=[four]
text=five, bracketed=null

Is this what you're after? Basically | is used for alternation in regular expressions.
using System;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
string test = "one [two] three [four] five";
Regex regex = new Regex(#"(?<text>[a-z]+)|(?<bracketed>\[[a-z]+\])");
Match match = regex.Match(test);
while (match.Success)
{
Console.WriteLine("text: {0}; bracketed: {1}",
match.Groups["text"],
match.Groups["bracketed"]);
match = match.NextMatch();
}
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

.NET Regular Expression for N number of Consecutive Characters - c#

I need a regular expression that matches three consecutive characters (any alphanumeric character) in a string. Where 2a82a9e4eee646448db00e3fccabd8c7 "eee" would be a match. Where 2a82a9e4efe64644448db00e3fccabd8c7 "444" would be a match. etc.

Use backreferences. ([a-zA-Z0-9])\1\1

Related

How to use (?!...) regex pattern to skip the whole unmatched part?

Document filtering by regex

Get Removed characters from string

Get only wild card value using regular expression

C# Regex, Either Or

Categories

Resources