Document filtering by regex - c#

I'm trying to find best solution to verify input document. I need to check every line of the document. Basically in each line can exist invalid character or characters. The result of searching (validating) is: 'get me the index of line with invalid char and index of each invalid character in this line'.
I know how to do in standard way (open file -> read all lines -> check characters one by one), but this method isn't best optimized way. Instead of this, the best solution will be to use "MatchCollection" (in my opinion).
But how to do this correctly in C# ?
Link:
http://www.dotnetperls.com/regex-matches
Example:
"Some Înput text here,\n Îs another lÎne of thÎs text."
In first line [0] found invalid character on [6] index, in line [1]
found invalid characters on [0, 12, 21] index.
using System;
using System.Text.RegularExpressions;
namespace RegularExpresion
{
class Program
{
private static Regex regex = null;
static void Main(string[] args)
{
string input_text = "Some Înput text here, Îs another lÎne of thÎs text.";
string line_pattern = "\n";
string invalid_character = "Î";
regex = new Regex(line_pattern);
/// Check is multiple or single line document
if (IsMultipleLine(input_text))
{
/// ---> How to do this correctly for each line ? <---
}
else
{
Console.WriteLine("Is a single line file");
regex = new Regex(invalid_character);
MatchCollection mc = regex.Matches(input_text);
Console.WriteLine($"How many matches: {mc.Count}");
foreach (Match match in mc)
Console.WriteLine($"Index: {match.Index}");
}
Console.ReadKey();
}
public static bool IsMultipleLine(string input) => regex.IsMatch(input);
}
}
Output:
Is a single line file
How many matches: 4
Index: 5
Index: 22
Index: 34
Index: 43

Link:
http://www.dotnetperls.com/regexoptions-multiline
SOLUTION
using System;
using System.Text.RegularExpressions;
namespace RegularExpresion
{
class Program
{
private static Regex regex = null;
static void Main(string[] args)
{
string input_text = #"Some Înput text here,
Îs another lÎne of thÎs text.";
string line_pattern = "\n";
string invalid_character = "Î";
regex = new Regex(line_pattern);
/// Check is multiple or single line document
if (IsMultipleLine(input_text))
{
Console.WriteLine("Is a multiple line file");
MatchCollection matches = Regex.Matches(input_text, "^(.+)$", RegexOptions.Multiline);
int line = 0;
foreach (Match match in matches)
{
foreach (Capture capture in match.Captures)
{
line++;
Console.WriteLine($"Line: {line}");
RegexpLine(capture.Value, invalid_character);
}
}
}
else
{
Console.WriteLine("Is a single line file");
RegexpLine(input_text, invalid_character);
}
Pause();
}
public static bool IsMultipleLine(string input) => regex.IsMatch(input);
public static void RegexpLine(string line, string characters)
{
regex = new Regex(characters);
MatchCollection mc = regex.Matches(line);
Console.WriteLine($"How many matches: {mc.Count}");
foreach (Match match in mc)
Console.WriteLine($"Index: {match.Index}");
}
public static ConsoleKeyInfo Pause(string message = "please press ANY key to continue...")
{
Console.WriteLine(message);
return Console.ReadKey();
}
}
}
Thx guys for help, basically will be nice if someone smarter then me, check this code in terms of performance.
Regards,
Nerus.

My approach would be split the string into array of string, each contains a line. If the length of the array is just 1, that means you have only 1 line. Then from there you use the Regex to match each line to find the invalid character that you are looking for.
string input_text = "Some Înput text here,\nÎs another lÎne of thÎs text.";
string line_pattern = "\n";
// split the string into string arrays
string[] input_texts = input_text.Split(new string[] { line_pattern }, StringSplitOptions.RemoveEmptyEntries);
string invalid_character = "Î";
if (input_texts != null && input_texts.Length > 0)
{
if (input_texts.Length == 1)
{
Console.WriteLine("Is a single line file");
}
// loop every line
foreach (string oneline in input_texts)
{
Regex regex = new Regex(invalid_character);
MatchCollection mc = regex.Matches(oneline);
Console.WriteLine("How many matches: {0}", mc.Count);
foreach (Match match in mc)
{
Console.WriteLine("Index: {0}", match.Index);
}
}
}
--- EDIT ---
Things to consider:
If you get your input from a file, I would recommend you to read line by line, not the whole text.
Usually, when you search for invalid character, you don't specify it. Instead you look for a pattern. For ex: Not a char from a-z, A-Z, 0-9. Then your regex is going to be a little bit different.

Related

Getting a list of strings by splitting a string by a specific tag

I would like to split a string into a list or array by a specific tag.
<START><A>message<B>UnknownLengthOfText<BEOF><AEOF><A>message<B>UnknownLengthOfText<BEOF><AEOF><END>
I want to split the above example into two items, the items being the strings between the <A> and <AEOF> tags
Any help is appreciated.
I would suggest simple regex for this.
Take a look at this example:
using System.Diagnostics;
using System.Text.RegularExpressions;
...
Regex regex = new Regex("<A>(.*?)<B><BEOF>(.*?)<AEOF>");
string myString = #"<START><A>message<B><BEOF>UnknownLengthOfText<AEOF><A>message<B><BEOF>some other line of text<AEOF><END>";
MatchCollection matches = regex.Matches(myString);
foreach (Match m in matches)
{
Debug.WriteLine(m.Groups[1].ToString(), m.Groups[2].ToString());
}
EDIT:
Since string is in one line, regex should be "lazy", marked with lazy quantifier ?. Also, I changed regex so that it uses sTrenat's suggestion to automatically parse message and title also.
So, instead of
Regex regex = new Regex("<A>(.*)<AEOF>");
I used
Regex regex = new Regex("<A>(.*?)<B><BEOF>(.*?)<AEOF>");
Notice additional ? which marks lazy quantifier, to stop when it finds first match between tags (without ? whole strign will be captured and not n messages between tags)
Try it with something like this:
string test = #"<START>
<A>message<B><BEOF>UnknownLengthOfText<AEOF>
<A>message<B><BEOF>UnknownLengthOfText<AEOF>
<END>";
//for this test this will give u an array containing 3 items...
string[] tmp1 = test.Split("<AEOF>");
//here u will store your results in
List<string> results = new List<string>();
//for every single one of those 3 items:
foreach(string item in tmp1)
{
//this will only be true for the first and second item
if(item.Contains("<A>"))
{
string[] tmp2 = item.Split("<A>");
//As the string you are looking for is always BEHIND the <A> you
//store the item[1], (the item[0] would be in front)
results.Add(tmp2[1]);
}
}
Rather than using the String.Split you can use the Regex.Split as below
var stringToSplit = #"<START>
<A>message<B>UnknownLengthOfText<BEOF><AEOF>
<A>message<B>UnknownLengthOfText<BEOF><AEOF>
<END>";
var regex = "<A>(.*)<AEOF>";
var splitStrings = Regex.Split(stringToSplit, regex);
splitStrings will contain 4 elements
splitString[0] = "<START>"
splitString[1] = "message<B>UnknownLengthOfText<BEOF>"
splitString[2] = "message<B>UnknownLengthOfText<BEOF>"
splitString[3] = "<END>"
Playing with the regex could give you only the strings between and
All answer so far are regex based. Here is an alternative without:
Try it Online!
var input = #"
<START>
<A>message<B>UnknownLengthOfText<BEOF><AEOF>
<A>message<B>UnknownLengthOfText<BEOF><AEOF>
<END>";
var start = "<A>";
var end = "<AEOF>";
foreach (var item in ExtractEach(input, start, end))
{
Console.WriteLine(item);
}
}
public static IEnumerable<string> ExtractEach(string input, string start, string end)
{
foreach (var line in input
.Split(Environment.NewLine.ToCharArray())
.Where(x=> x.IndexOf(start) > 0 && x.IndexOf(start) < x.IndexOf(end)))
{
yield return Extract(line, start, end);
}
}
public static string Extract(string input, string start, string end)
{
int startPosition = input.LastIndexOf(start) + start.Length;
int length = input.IndexOf(end) - startPosition;
var substring = input.Substring(startPosition, length);
return substring;
}

Print the regex pattern where the string becomes invalid

Given the regular expression
^(aa|bb){1}(a*)(ab){1}$
For the language,
All strings starting with double letters and ends with substring ab
I would like to know if it is possible to print the regex code where the string becomes invalid. This has got to do with regular expressions in Finite Automata.
For example i have these following input set of invalid strings,
abaa
aabb
aaba
I wanted to have an output like this,
abaa ^(aa|bb){1}
aabb ^(aa|bb){1}(a*)
aaba ^(aa|bb){1}(a*)(ab){1}$
You can create a Regex from a string, if it is a malformed pattern it is going to throw an exception. You can create a loop that's going to get substring of the pattern an try to create a regex, if it fails just continue.
Once you have a Regex you can test for a match and store the last pattern that matched the input. So it would be something like this:
public static string FindBestValidRegex(string input, string pattern)
{
var lastMatch = "";
for (int i = 0; i < pattern.Length; i++)
{
try
{
var partialPattern = pattern.Substring(0, i + 1);
var regex = new Regex(partialPattern);
if (regex.IsMatch(input))
{
lastMatch = partialPattern;
}
}
catch { }
}
return lastMatch;
}
Testing:
static void Main(string[] args)
{
var pattern = #"^(aa|bb){1}(a*)(ab){1}$";
Console.WriteLine(FindBestValidRegex("bbb", pattern));
Console.WriteLine(FindBestValidRegex("aabb", pattern));
Console.WriteLine(FindBestValidRegex("aaab", pattern));
Console.WriteLine(FindBestValidRegex("bbaab", pattern));
Console.ReadKey();
}
Output:
^(aa|bb){1}(a*)
^(aa|bb){1}(a*)
^(aa|bb){1}(a*)(ab){1}$
^(aa|bb){1}(a*)(ab){1}$

dividing richtextbox lines - console app into windows forms

I made a program in console that splits two texts from a file in every line that are divided with ":", and checks if they meet the requirements. Every line in the file has a syntax "xxxxx:xxxxx".
namespace ConsoleApp1
{
class Program
{
static void Main(string[] args)
{
string textone, texttwo, filename;
Regex reg = new Regex("/W");//svi characteri osim A-Z,a-z,0-9
Regex numb = new Regex("[a-zA-Z]");
Regex numbek = new Regex("[0-9]");
Regex donjacrta = new Regex("_");
filename = Console.ReadLine();//should load richtextbox instead
string[] linije = System.IO.File.ReadAllLines(filename);
for (int i = 0; i < linije.Length; i++)
{
string trenutni = linije[i];
int indexx = trenutni.IndexOf(':');
textone = trenutni.Substring(0, indexx);
texttwo = trenutni.Substring((indexx + 1), (trenutni.Length) - (indexx + 1));
if (textone.Length < 3 || textone.Length > 25 || reg.IsMatch(textone) || donjacrta.IsMatch(textone) || !numb.IsMatch(textone) || !numbek.IsMatch(textone))
{
continue;
}
else if (texttwo.Length < 3 || texttwo.Length > 30 )
{
continue;
}
else
{
Console.WriteLine(textone + ":" + texttwo);
}
}
}
}
}
(when i try to format the code here it deletes/hides some of the code, dont know why)
In my WindowsForms, I first load the file into a RichTextBox. From there i need to connect it somehow and make it either:
clear the whole richtextbox and start typing only the valid lines
delete the invalid lines.
Brief
It took me some time to understand what you're trying to do and what you want, but, I think I found your solution. You can pretty much remove all the code you've written and replace it with a single regular expression.
What I believe you are trying to do:
Match strings that are split by : (i.e. xxxxx:xxxxx - where x is defined below) and each string consumes a single row in a text file
Ensure both sections (before and after the colon :) match a-zA-Z0-9 ONLY (no other character)
Ensure the first section is between 3 and 25 characters
Ensure the second section is between 3 and 30 characters
Code
See this code in use here
^([a-zA-Z\d]{3,25}):([a-zA-Z\d]{3,30})$
For a sample C# program, you can use the following. Obviously, you would replace the logic to pull from a text file rather than a string.
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"^([a-zA-Z\d]{3,25}):([a-zA-Z\d]{3,30})$";
string input = #"xxxxx:xxxxx
1adfasfdfasdfsdfsfsfssfsd:1asfdsfsdfsafsdfsadfdfsdfadf2s
sfsd12321:12sfs3123
#342fdfasd:1dsadafdsfs";
RegexOptions options = RegexOptions.Multiline;
foreach (Match m in Regex.Matches(input, pattern, options))
{
Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
}
}
}
Results
Input
xxxxx:xxxxx
1adfasfdfasdfsdfsfsfssfsd:1asfdsfsdfsafsdfsadfdfsdfadf2s
sfsd12321:12sfs3123
#342fdfasd:1dsadafdsfs
Output
The below are matched strings. Each string
Full match: xxxxx:xxxxx
Group 1: xxxxx
Group 2: xxxxx
Full match: 1adfasfdfasdfsdfsfsfssfsd:1asfdsfsdfsafsdfsadfdfsdfadf2s
Group 1: 1adfasfdfasdfsdfsfsfssfsd
Group 2: 1asfdsfsdfsafsdfsadfdfsdfadf2s
Full match: sfsd12321:12sfs3123
Group 1: sfsd12321
- Group 2: 12sfs3123
Explanation
Assert position at the beginning of the line
Match and capture into group 1: Any character in the set a-zA-Z\d between 3 and 25 times
Match a colon :
Match and capture into group 2: Any character in the set a-zA-Z\d between 3 and 30 times
Assert position at the end of the line

Get Removed characters from string

I am using Regex to remove unwanted characters from string like below:
str = System.Text.RegularExpressions.Regex.Replace(str, #"[^\u0020-\u007E]", "");
How can I retrieve distinct characters which will be removed in efficient way?
EDIT:
Sample input : str = "This☺ contains Åüsome æspecialæ characters"
Sample output : str = "This contains some special characters"
removedchar = "☺,Å,ü,æ"
string pattern = #"[\u0020-\u007E]";
Regex rgx = new Regex(pattern);
List<string> matches = new List<string> ();
foreach (Match match in rgx.Matches(str))
{
if (!matches.Contains (match.Value))
{
matches.Add (match.Value);
}
}
Here is an example how you can do it with a callback method inside the Regex.Replace overload with an evaluator:
evaluator
Type: System.Text.RegularExpressions.MatchEvaluator
A custom method that examines each match and returns either the original matched string or a replacement string.
C# demo:
using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
public class Test
{
public static List<string> characters = new List<string>();
public static void Main()
{
var str = Regex.Replace("§My string 123”˝", "[^\u0020-\u007E]", Repl);//""
Console.WriteLine(str); // => My string 123
Console.WriteLine(string.Join(", ", characters)); // => §, ”, ˝
}
public static string Repl(Match m)
{
characters.Add(m.Value);
return string.Empty;
}
}
See IDEONE demo
In short, declare a "global" variable (a list of strings, here, characters), initialize it. Add the Repl method to handle the replacement, and when Regex.Replace calls that method, add each matched value to the characters list.

Get only wild card value using regular expression

I want to extract only wild card tokens using regular expressions in dotnet (C#).
Like if I use pattern like Book_* (so it match directory wild card), it extract values what match with *.
For Example:
For a string "Book_1234" and pattern "Book_*"
I want to extract "1234"
For a string "Book_1234_ABC" and pattern "Book_*_*"
I should be able to extract 1234 and ABC
This should do it : (DEMO)
string input = "Book_1234_ABC";
MatchCollection matches = Regex.Matches(input, #"_([A-Za-z0-9]*)");
foreach (Match m in matches)
if (m.Success)
Console.WriteLine(m.Groups[1].Value);
The approach to your scenario would be to
Get the List of strings which appears in between the wildcard (*).
Join the lists with regexp divider (|).
replace the regular expression with char which you do not expect in your string (i suppose space should be adequate here)
trim and then split the returned string by char you used in previous step which will return you the list of wildcard characters.
var str = "Book_1234_ABC";
var inputPattern = "Book_*_*";
var patterns = inputPattern.Split('*');
if (patterns.Last().Equals(""))
patterns = patterns.Take(patterns.Length - 1).ToArray();
string expression = string.Join("|", patterns);
var wildCards = Regex.Replace(str, expression, " ").Trim().Split(' ');
I would first convert the '*' wildcard in an equivalent Regex, ie:
* becames \w+
then I use this regex to extract the matches.
When I run this code using your input strings:
using System;
using System.Text.RegularExpressions;
namespace SampleApplication
{
public class Test
{
static Regex reg = new Regex(#"Book_([^_]+)_*(.*)");
static void DoMatch(String value) {
Console.WriteLine("Input: " + value);
foreach (Match item in reg.Matches(value)) {
for (int i = 0; i < item.Groups.Count; ++i) {
Console.WriteLine(String.Format("Group: {0} = {1}", i, item.Groups[i].Value));
}
}
Console.WriteLine("\n");
}
static void Main(string[] args) {
// For a string "Book_1234" and pattern "Book_*" I want to extract "1234"
DoMatch("Book_1234");
// For a string "Book_1234_ABC" and pattern "Book_*_*" I should be able to extract 1234 and ABC
DoMatch("Book_1234_ABC");
}
}
}
I get this console output:
Input: Book_1234
Group: 0 = Book_1234
Group: 1 = 1234
Group: 2 =
Input: Book_1234_ABC
Group: 0 = Book_1234_ABC
Group: 1 = 1234
Group: 2 = ABC

Categories

Resources