Regex in C# to process a text - c#

I am trying to remove some text and keep only small text from the string.
Actually I am very new to regex, I have read an article and did not get it very well.
Here is an example of my text (every line in separate string object)
2015-03-08 10:30:00 /user841/column-width
2015-03-08 10:30:01 /user849/connect
2015-03-08 10:30:01 /user262/open-level2-price/some other text
2015-03-08 10:30:01 /user839/open-detailed-quotes
I want to process them using regex in c# and have the following output:
column-width
connect
open-level2-price/some other text
open-detailed-quotes
I have used the following line to do that but it throws an exception:
Match match = Regex.Match(line, #"*./user\d+/*.");
The Exception:
System.ArgumentException: 'parsing "*./user\d+/*." - Quantifier {x,y} following nothing.'
could anyone help please!

The error you get is caused by the fact that you try to quantify the start of the pattern, which is considered an error in a .NET regex. Perhaps, you meant to use .* instead of the *. (to match any 0+ chars greedily, as many as possible), but it is certainly not what you need judging by the expected results.
You need
/user\d+/(.*)
See the regex demo
Details:
/user - a literal substring /user
\d+ - 1 or more digits (use RegexOptions.ECMAScript option to only match ASCII digits with \d in a .NET regex)
/ - a literal /
(.*) - A capturing group #1 that matches any 0+ chars other than a newline (replace * with + to match at least 1 char).
C#:
var results = Regex.Matches(s, #"/user\d+/(.*)")
.Cast<Match>()
.Select(m => m.Groups[1].Value)
.ToList();

Instead of using Regex, just split on the '/' character and use the last index of the array (using LINQ):
string inputString = "2015-03-08 10:30:01 /user262/open-level2-price";
inputString.Split('/').Last();
Split returns an array of strings, in your case with the sample input above the string array would look like:
array[0] = "2015-03-08 10:30:01 "
array[1] = "user262"
array[2] = "open-level2-price"
You indicate you always want the last part so just use LINQ to take the .Last() index of the array.
Fiddle here

Here's a simple example of how to use the Regex.Replace static method.
https://dotnetfiddle.net/JuUF9E
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
string[] lines = new string[] {
"2015-03-08 10:30:00 /user841/column-width",
"2015-03-08 10:30:01 /user849/connect",
"2015-03-08 10:30:01 /user262/open-level2-price",
"2015-03-08 10:30:01 /user839/open-detailed-quotes"
};
string pattern = #"(.*/.*/)(.*)";
string replacement = "$2";
foreach(var line in lines)
{
Console.WriteLine(Regex.Replace(line, pattern, replacement));
}
}
}

I don't know why you're trying to do this simple thing with regex, you just have to read the lines and split by the '\', them select the last index and that's it. For example, if you have that data in a file you can use something like this:
string newString = "";
StreamReader sr = new StreamReader('log.txt');
while(!sr.ReadLine)
{
string[] splitted = sr.ReadLine().Split('/');
if(splitted.Length > 0)
newString += splitted[splitted.Length - 1];
}
sr.Close();
At the end, the newString variable will contains what you want. Otherwise you can add every line in a list if you will do some with the data.

How about using Look around
var line = "2015-03-08 10:30:01 /user839/open-detailed otes/dsada/dsa/das/dsadsa";
// dsadsa
var match = Regex.Match(line, #"(?!.*/).*").Value;

Related

RegEx for capturing a word in between = and ;

I want to select word2 from the following :
word2;word3
word2 that is between ; and start of the line unless there is a = in between. In that case, I want start from the = instead of the start of the line
like word2 from
word1=word2;word3
I have tried using this regex
(?<=\=|^).*?(?=;)
which select the word2 from
word2;word3
but also the whole word1=word2 from
word1=word2;word3
You can use an optional group to check for a word followed by an equals sign and capture the value in the first capturing group:
^(?:\w+=)?(\w+);
Explanation
^ Start of string
(?:\w+=)? Optional non capturing group matching 1+ word chars followed by =
(\w+) Capture in the first capturing group 1+ word chars
; Match ;
See a regex demo
In .NET you might also use:
(?<=^(?:\w+=)?)\w+(?=;)
Regex demo | C# demo
There should be so many options, maybe regular expressions among the last ones.
But, if we wish to use an expression for this problem, let's start with a simple one and explore other options, maybe something similar to:
(.+=)?(.+?);
or
(.+=)?(.+?)(?:;.+)
where the second capturing group has our desired word2.
Demo 1
Demo 2
Example 1
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"(.+=)?(.+?);";
string input = #"word1=word2;word3
word2;word3";
RegexOptions options = RegexOptions.Multiline;
foreach (Match m in Regex.Matches(input, pattern, options))
{
Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
}
}
}
Example 2
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"(.+=)?(.+?)(?:;.+)";
string substitution = #"$2";
string input = #"word1=word2;word3
word2;word3";
RegexOptions options = RegexOptions.Multiline;
Regex regex = new Regex(pattern, options);
string result = regex.Replace(input, substitution);
}
}
Instead of using regular expresions you can solve the problem with String class methods.
string[] words = str.Split(';');
string word2 = words[0].Substring(words[0].IndexOf('=') + 1);
First line splits the line from ';'. Assuming you just have a single ';' this statement splits your line into two strings. And second line returns a substring of first part (words[0]) starting from the first occurence of '=' (words[0].IndexOf('=')) character's next characher (+1) to the end. If your line doesn't have any '=' characters it just starts from the beginning because IndexOf returns -1.
Related documentation:
https://learn.microsoft.com/en-us/dotnet/api/system.string.split?view=netframework-4.8
https://learn.microsoft.com/en-us/dotnet/api/system.string.substring?view=netframework-4.8
https://learn.microsoft.com/en-us/dotnet/api/system.string.indexof?view=netframework-4.8

Parsing File with C# And Replace method

I'm trying to parse a a bunch of file with Replace method(string) while is doing what I expect: I feels is not practical. for instance I will process 10K files but in the First 72 I found like 30 values that need to be replace And this is the rule :
My Goal :"
My goal is to replace all Instance of the ':' Dont follows this Rules :
1- the 2nd or 3rd Character foward is Not Another ':'
2-the 3rd or 2nd Chacarcter backward is Not Another ':'
All other should be Replaced
1- Any time that I found this character (:) and this character is not preceded by two char or three characters like :00: or :12A: I should replace it with an (*).
This is the method that I have so far.....
private static string cleanMesage(string str)
{
string result = String.Empty;
try
{
result = str.Replace("BNF:", "BNF*").Replace("B/O:", "B/O*").Replace("O/B:", "O/B*");
result = result.Replace("Epsas:", "Epsas*").Replace("2017:", "2017*").Replace("BANK:", "BANK*");
result = result.Replace("CDT:", "CDT*").Replace("ENT:", "").Replace("GB22:", "GB22*");
result = result.Replace("A / C:", "A/C*").Replace("ORD:", "ORD*").Replace("A/C:", "A/C*");
result = result.Replace("REF:", "REF*").Replace("ISIN:", "ISIN*").Replace("PAY:", "PAY*");
result = result.Replace("DEPOSITO:", "DEPOSITO*").Replace("WITH:", "WITH*");
result = result.Replace("Operaciones:", "Operaciones*").Replace("INST:", "INST*");
result = result.Replace("DETAIL:", "DETAIL*").Replace("WITH:", "WITH*").Replace("BO:", "BO*");
result = result.Replace("CUST:", "CUST*").Replace("ISIN:", "ISIN*").Replace("SEDL:", "SEDL*");
result = result.Replace("Enero:", "Enero*").Replace("enero:", "Enero*");
result = result.Replace("agosto:", "agosto*").Replace("febrero:", "febrero*");
result = result.Replace("marzo:", "marzo*").Replace("abril:", "abril*");
result = result.Replace("mayo:", "mayo*").Replace("junio:", "junio*").Replace("RE:", "RE:*");
result = result.Replace("julio:", "julio*").Replace("septiembre:", "septiembre*");
result = result.Replace("NIF:", "NIF*").Replace("INST:", "INST*").Replace("SHS:", "SHS*")
.Replace("SK:", "");
result = result.Replace("PARTY:", "PARTY*").Replace("SEDOL:", "SEDOL*").Replace("PD:", "PD*");
}
catch (Exception e)
{
}
return result;
}
And this is some sample data :"
:13: <-- keep /ISIN/XS SVUNSK UXPORTKRUDIT ZX PZY DZTU:<- replace UX DZ
TU:<- replace02ZUG12 RZTU:<- replace W/H TZX RZTU:<- replace0.00000 SHZRUS PZID:<- replace
0.000000 IDDSIN:<- replace
:31: <-- keep 1201000100CD05302,24NSUC20523531001//00520023531014
:13: <-- keep /ISIN/XS0153242003 SVUNSK UXPORTKRUDIT ZX PZY DZTU:<- replace00ZUG12 UX DZ
TU:02ZUG12 RZTU:0.30241 W/H TZX RZTU:<- replace0.00000 SHZRUS PZID:<- replace
0.000000 ISIN:XS0153242003
:31: <-- keep 1201000100DD121253,25S202IMSSMSZUX534C//S0322211DF4301
S F/O 0150001400
:13: <-- keep XNF:<- replace this
If your goal is to replace all instances of the ':' character where it is not followed by 2 or 3 other characters. You could indeed try the System.Text.RegularExpressions library. You could then simplify your cleanMessage function in the following way.
using System.Text.RegularExpressions;
function string cleanMessage(string str)
{
string pattern = ":(\s)"; //This will be a ':' followed by a space
Regex rgx = new Regex(pattern);
string replaceResult = rgx.Replace(str,"*$1") //this will replace the pattern with a '*' followed by a space.
return replaceResult;
}
If your goal is to replace all instances of the ':' character where it is not followed by 2 or 3 other characters and the 2nd or 3rd character forward or backward is not another ':'. You could change your cleanMessage to the following instead.
using System.Text.RegularExpressions;
function string cleanMessage(string str)
{
string pattern = "([^;]{2}.):(\s[^:]{2})";
//This will be 2 characters that cannot be ':' followed by anything then a ':' followed by a space and 2 more characters that cannot by ':'
//For instance, "BNF: :F" would FAIL and not get replaced but "BNF: HH" would pass and become "BNF* HH"
Regex rgx = new Regex(pattern);
string replaceResult = rgx.Replace(str,"$1*$2") //this will replace the : with a *
return replaceResult;
}
More information on the System.Text.RegularExpressions library replace can be found at
https://msdn.microsoft.com/en-us/library/xwewhkd1(v=vs.110).aspx
As #dymanoid mentioned, regular expressions are a way to handle this. By using the following you'd get what you want:
result = Regex.Replace(str, "([a-zA-Z0-9]{2,3})\:", "$1*");
However for large datasets this won't perform well. In that case I'd look at walking through str character by character using a for-loop. If the current character is not a colon, add it to the result string and to a temporary string. When the current character is a colon (:) and the temporary string has a length of 2 or 3, write an asterisk to the result and clear the temporary string.
In this case you don't do any string replacement, you just select what to write to a new string.
See here for a speed comparison between string replacement and regex replacement.

How to use (?!...) regex pattern to skip the whole unmatched part?

I would like to use the ((?!(SEPARATOR)).)* regex pattern for splitting a string.
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
var separator = "__";
var pattern = String.Format("((?!{0}).)*", separator);
var regex = new Regex(pattern);
foreach (var item in regex.Matches("first__second"))
Console.WriteLine(item);
}
}
It works fine when a SEPARATOR is a single character, but when it is longer then 1 character I get an unexpected result. In the code above the second matched string is "_second" instead of "second". How shall I modify my pattern to skip the whole unmatched separator?
My real problem is to split lines where I should skip line separators inside quotes. My line separator is not a predefined value and it can be for example "\r\n".
You can do something like this:
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string input = "plum--pear";
string pattern = "-"; // Split on hyphens
string[] substrings = Regex.Split(input, pattern);
foreach (string match in substrings)
{
Console.WriteLine("'{0}'", match);
}
}
}
// The method displays the following output:
// 'plum'
// ''
// 'pear'
The .NET regex does not does not support matching a piece of text other than a specific multicharacter string. In PCRE, you would use (*SKIP)(*FAIL) verbs, but they are not supported in the native .NET regex library. Surely, you might want to use PCRE.NET, but .NET regex can usually handle those scenarios well with Regex.Split
If you need to, say, match all but [anything here], you could use
var res = Regex.Split(s, #"\[[^][]*]").Where(m => !string.IsNullOrEmpty(m));
If the separator is a simple literal fixed string like __, just use String.Split.
As for your real problem, it seems all you need is
var res = Regex.Matches(s, "(?:\"[^\"]*\"|[^\r\n\"])+")
.Cast<Match>()
.Select(m => m.Value)
.ToList();
See the regex demo
It matches 1+ (due to the final +) occurrences of ", 0+ chars other than " and then " (the "[^"]*" branch) or (|) any char but CR, LF or/and " (see [^\r\n"]).

Splitting of a string using Regex

I have string of the following format:
string test = "test.BO.ID";
My aim is string that part of the string whatever comes after first dot.
So ideally I am expecting output as "BO.ID".
Here is what I have tried:
// Checking for the first occurence and take whatever comes after dot
var output = Regex.Match(test, #"^(?=.).*?");
The output I am getting is empty.
What is the modification I need to make it for Regex?
You get an empty output because the pattern you have can match an empty string at the start of a string, and that is enough since .*? is a lazy subpattern and . matches any char.
Use (the value will be in Match.Groups[1].Value)
\.(.*)
or (with a lookahead, to get the string as a Match.Value)
(?<=\.).*
See the regex demo and a C# online demo.
A non-regex approach can be use String#Split with count argument (demo):
var s = "test.BO.ID";
var res = s.Split(new[] {"."}, 2, StringSplitOptions.None);
if (res.GetLength(0) > 1)
Console.WriteLine(res[1]);
If you only want the part after the first dot you don't need a regex at all:
x.Substring(x.IndexOf('.'))

How to remove only certain substrings from a string?

Using C#, I have a string that is a SQL script containing multiple queries. I want to remove sections of the string that are enclosed in single quotes. I can do this using Regex.Replace, in this manner:
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
test = Regex.Replace(test, "'[^']*'", string.Empty);
Results in: "Only can we turn him to the of the Force"
What I want to do is remove the substrings between quotes EXCEPT for substrings containing a specific substring. For example, using the string above, I want to remove the quoted substrings except for those that contain "dark," such that the resulting string is:
Results in: "Only can we turn him to the 'dark side' of the Force"
How can this be accomplished using Regex.Replace, or perhaps by some other technique? I'm currently trying a solution that involves using Substring(), IndexOf(), and Contains().
Note: I don't care if the single quotes around "dark side" are removed or not, so the result could also be: "Only can we turn him to the dark side of the Force." I say this because a solution using Split() would remove all the single quotes.
Edit: I don't have a solution yet using Substring(), IndexOf(), etc. By "working on," I mean I'm thinking in my head how this can be done. I have no code, which is why I haven't posted any yet. Thanks.
Edit: VKS's solution below works. I wasn't escaping the \b the first attempt which is why it failed. Also, it didn't work unless I included the single quotes around the whole string as well.
test = Regex.Replace(test, "'(?![^']*\\bdark\\b)[^']*'", string.Empty);
'(?![^']*\bdark\b)[^']*'
Try this.See demo.Replace by empty string.You can use lookahead here to check if '' contains a word dark.
https://www.regex101.com/r/rG7gX4/12
While vks's solution works, I'd like to demonstrate a different approach:
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
test = Regex.Replace(test, #"'[^']*'", match => {
if (match.Value.Contains("dark"))
return match.Value;
// You can add more cases here
return string.Empty;
});
Or, if your condition is simple enough:
test = Regex.Replace(test, #"'[^']*'", match => match.Value.Contains("dark")
? match.Value
: string.Empty
);
That is, use a lambda to provide a callback for the replacement. This way, you can run arbitrary logic to replace the string.
some thing like this would work. you can add all strings you want to keep into the excludedStrings array
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
var excludedString = new string[] { "dark side" };
int startIndex = 0;
while ((startIndex = test.IndexOf('\'', startIndex)) >= 0)
{
var endIndex = test.IndexOf('\'', startIndex + 1);
var subString = test.Substring(startIndex, (endIndex - startIndex) + 1);
if (!excludedString.Contains(subString.Replace("'", "")))
{
test = test.Remove(startIndex, (endIndex - startIndex) + 1);
}
else
{
startIndex = endIndex + 1;
}
}
Another method through regex alternation operator |.
#"('[^']*\bdark\b[^']*')|'[^']*'"
Then replace the matched character with $1
DEMO
string str = "Only 'together' can we turn him to the 'dark side' of the Force";
string result = Regex.Replace(str, #"('[^']*\bdark\b[^']*')|'[^']*'", "$1");
Console.WriteLine(result);
IDEONE
Explanation:
(...) called capturing group.
'[^']*\bdark\b[^']*' would match all the single quoted strings which contains the substring dark . [^']* matches any character but not of ', zero or more times.
('[^']*\bdark\b[^']*'), because the regex is within a capturing group, all the matched characters are stored inside the group index 1.
| Next comes the regex alternation operator.
'[^']*' Now this matches all the remaining (except the one contains dark) single quoted strings. Note that this won't match the single quoted string which contains the substring dark because we already matched those strings with the pattern exists before to the | alternation operator.
Finally replacing all the matched characters with the chars inside group index 1 will give you the desired output.
I made this attempt that I think you were thinking about (some solution using split, Contain, ... without regex)
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
string[] separated = test.Split('\'');
string result = "";
for (int i = 0; i < separated.Length; i++)
{
string str = separated[i];
str = str.Trim(); //trim the tailing spaces
if (i % 2 == 0 || str.Contains("dark")) // you can expand your condition
{
result += str+" "; // add space after each added string
}
}
result = result.Trim(); //trim the tailing space again

Categories

Resources