Capture substring within delimiters and excluding characters using regex

Capture substring within delimiters and excluding characters using regex - c#

How could a regex pattern look like to capture a substring between 2 delimiters, but excluding some characters (if any) after first delimiter and before last delimiter (if any)?
The input string looks for instance like this:
var input = #"Not relevant {
#AddInfoStart Comment:String:=""This is a comment"";
AdditionalInfo:String:=""This is some additional info"" ;
# } also not relevant";
The capture should contain the substring between "{" and "}", but excluding any spaces, newlines and "#AddInfoStart" string after start delimiter "{" (just if any of them present), and also excluding any spaces, newlines and ";" and "#" characters before end delimiter "}" (also if any of them present).
The captured string should look like this
Comment:String:=""This is a comment"";
AdditionalInfo:String:=""This is some additional info""
It is possible that there are blanks before or after the ":" and ":=" internal delimiters, and also that the value after ":=" is not always marked as a string, for instance something like:
{ Val1 : Real := 1.7 }
For arrays is used the following syntax:
arr1 : ARRAY [1..5] OF INT := [2,5,44,555,11];
arr2 : ARRAY [1..3] OF REAL

This is my solution:
Remove the content outside the brackets
Use a regular expression to get the values inside the brackets
Code:
var input = #"Not relevant {
#AddInfoStart Comment:String:=""This is a comment"";
Val1 : Real := 1.7
AdditionalInfo:String:=""This is some additional info"" ;
# } also not relevant";
// remove content outside brackets
input = Regex.Replace(input, #".*\{", string.Empty);
input = Regex.Replace(input, #"\}.*", string.Empty);
string property = #"(\w+)";
string separator = #"\s*:\s*"; // ":" with or without whitespace
string type = #"(\w+)";
string equals = #"\s*:=\s*"; // ":=" with or without whitespace
string text = #"""?(.*?)"""; // value between ""
string number = #"(\d+(\.\d+)?)"; // number like 123 or with a . separator such as 1.45
string value = $"({text}|{number})"; // value can be a string or number
string pattern = $"{property}{separator}{type}{equals}{value}";
var result = Regex.Matches(input, pattern)
.Cast<Match>()
.Select(match => new
{
FullMatch = match.Groups[0].Value, // full match is always the 1st group
Property = match.Groups[1].Value,
Type = match.Groups[2].Value,
Value = match.Groups[3].Value
})
.ToList();

Related

Regex How to Match 2 fields

How would capture both the filenames inside the quotes, and the numbers following as named captures (Regex / C#)?
Files("fileone.txt", 5969784, "file2.txt", 45345333)
Out of every occurrence in the string, the ability to capture "fileone.txt" and the integer following (a loop cycles each pair)
I am trying to use this https://regex101.com/r/MwMzBo/1 but having issues matching without the '[' and ']'.
Required to be able to loop each filename+size as a pair and moving next.
Any help is appreciated!
UPDATE
string file = "Files(\"fileone.txt\", 5969784, \"file2.txt\", 45345333, \"file2.txt\", 45345333)";
var regex = new Regex(#"(?:\G(?!\A)\s*,\s*|\w+\()(?:""(?<file>.*?)""|'(?<file>.*?)')\s*,\s*(?<number>\d+)");
var match = regex.Match(file);
var names = match.Groups["file"].Captures.Cast<Capture>();
var lengths = match.Groups["number"].Captures.Cast<Capture>();
var filelist = names.Zip(lengths, (f, n) => new { file = f.Value, length = long.Parse(n.Value) }).ToArray();
foreach (var item in filelist)
{
// Only returning 1 pair result, ignoring the rest
}
Reading match.Value to confirm what is being read. Only first pair is being picked up.
while (match.Success)
{
MessageBox.Show(match.Value);
match = match.NextMatch();
}
Now we are getting all results properly. I read, that Regex.Match only returns the first matched result. This explains a lot.

You can use
(?:\G(?!\A)\s*,\s*|\w+\()(?:""(?<file>.*?)""|'(?<file>.*?)')\s*,\s*(?<number>\d+)
See the regex demo
Details:
(?:\G(?!\A)\s*,\s*|\w+\() - end of the previous successful match and a comma enclosed with zero or more whitespaces, or a word and an opening ( char
(?:""(?<file>.*?)""|'(?<file>.*?)') - ", Group "file" capturing any zero or more chars other than a newline char as few as possible and then a ", or a ', Group "file" capturing any zero or more chars other than a newline char as few as possible and then a '
\s*,\s* - a comma enclosed with zero or more whitespaces
(?<number>\d+) - Group "number": one or more digits.

I like doing it in smaller pieces :
string input = "cov('Age', ['5','7','9'])";
string pattern1 = #"\((?'key'[^,]+),\s+\[(?'values'[^\]]+)";
Match match = Regex.Match(input, pattern1);
string key = match.Groups["key"].Value.Trim(new char[] {'\''});
string pattern2 = #"'(?'value'[^']+)'";
string values = match.Groups["values"].Value;
MatchCollection matches = Regex.Matches(values, pattern2);
int[] number = matches.Cast<Match>().Select(x => int.Parse(x.Value.Replace("'",string.Empty))).ToArray();

C# - Split a string separated by ':'

Im trying to split this string:
PublishDate: "2011-03-18T11:08:07.983"
I tried Split method but it's not successful.
str.Split(new[] { ':', ' ' }, StringSplitOptions.RemoveEmptyEntries)
As a result I get PublishDate 2011-03-18T11 08 07.983
But correct result is PublishDate 2011-03-18T11:08:07.983
What i need to do?

Split(String, Int32, StringSplitOptions)
Splits a string into a maximum number of substrings based on a specified delimiting string and, optionally, options.
str.Split(':', 2, StringSplitOptions.RemoveEmptyEntries)
https://learn.microsoft.com/en-us/dotnet/api/system.string.split?view=net-6.0#system-string-split(system-string-system-int32-system-stringsplitoptions)

I would solve like this:
locate the index of the first :. The property name will be all the characters before this, which you can extract with Substring and Trim to remove whitespace before the colon, if present.
locate the index of the first " and last ". Characters between the first and last quotes are the property value.
string input = "PublishDate: \"2011-03-18T11:08:07.983\"";
int iColon = input.IndexOf(':');
int iOpenQuote = input.IndexOf('"', iColon);
int iCloseQuote = input.LastIndexOf('"');
string propertyName = input.Substring(0, iColon).Trim();
string propertyValue = input.Substring(iOpenQuote + 1, iCloseQuote - iOpenQuote - 1);
This does not handle escaped characters within the property value (for example, to embed a literal quote or newline using a typical escape sequence like \" or \n). But it's likely good enough to extract a date/time string, and permits all characters because of the use of LastIndexOf. However, this is not robust against malformed input, so you will want to add checks for missing colon, or missing quote, or what happens when the close quote is missing (same same index for start and end quote).

So if I got you right, you want as a result: PublishDate 2011-03-18T11:08:07.983.
Then I would recommend you to use the string.Replace method.
using System;
public class HelloWorld
{
public static void Main(string[] args)
{
string yourData = "PublishDate: \"2011-03-18T11:08:07.983\"";
// First replace the colon and the space after the PublishDate with and space
// then replace the quotes from the timestamp -> "2011-03-18T11:08:07.983"
yourData = yourData.Replace(": ", " ").Replace("\"", "");
// Output the result -> PublishDate 2011-03-18T11:08:07.983
Console.WriteLine(yourData);
}
}

Regular expression split string, extract string value before and numeric value between square brackets

I need to parse a string that looks like "Abc[123]". The numerical value between the brackets is needed, as well as the string value before the brackets.
The most examples that I tested work fine, but have problems to parse some special cases.
This code seems to work fine for "normal" cases, but has some problems handling "special" cases:
var pattern = #"\[(.*[0-9])\]";
var query = "Abc[123]";
var numVal = Regex.Matches(query, pattern).Cast<Match>().Select(m => m.Groups[1].Value).FirstOrDefault();
var stringVal = Regex.Split(query, pattern)
.Select(x => x.Trim())
.FirstOrDefault();
How should the code be adjusted to handle also some special cases?
For instance for the string "Abc[]" the parser should return correctly "Abc" as the string value and indicate an empty the numeric value (which could be eventually defaulted to 0).
For the string "Abc[xy33]" the parser should return "Abc" as the string value and indicate an invalid numeric value.
For the string "Abc" the parser should return "Abc" as the string value and indicate a missing numeric value. The blanks before/after or inside the brackets should be trimmed "Abc [ 123 ] ".

Try this pattern: ^([^\[]+)\[([^\]]*)\]
Explanation of a pattern:
^ - match beginning of a string
([^\[]+) - match one or more of any character ecept [ and store it insinde first capturing group
\[ - match [ literally
([^\]]*) - match zero or more of any character except ] and store inside second capturing group
\] - match ] literally
Here's tested code:
var pattern = #"^([^\[]+)\[([^\]]*)\]";
var queries = new string[]{ "Abc[123]", "Abc[xy33]", "Abc[]", "Abc[ 33 ]", "Abc" };
foreach (var query in queries)
{
string beforeBrackets;
string insideBrackets;
var match = Regex.Match(query, pattern);
if (match.Success)
{
beforeBrackets = match.Groups[1].Value;
insideBrackets = match.Groups[2].Value.Trim();
if (insideBrackets == "")
insideBrackets = "0";
else if (!int.TryParse(insideBrackets, out int i))
insideBrackets = "incorrect value!";
}
else
{
beforeBrackets = query;
insideBrackets = "no value";
}
Console.WriteLine($"Input string {query} : before brackets: {beforeBrackets}, inside brackets: {insideBrackets}");
}
Console.ReadKey();
Output:

We can try doing a regex replacement on the input, for a one-liner solution:
string input = "Abc[123]";
string letters = Regex.Replace(input, "\\[.*\\]", "");
string numbers = Regex.Replace("Abc[123]", ".*\\[(\\d+)\\]", "$1");
Console.WriteLine(letters);
Console.WriteLine(numbers);
This prints:
Abc
123

Pretty sure there'd be some language-based techniques for that, which I wouldn't know, yet with a regular expression, we'd capture everything using capturing groups and check for things one by one, maybe:
^([A-Za-z]+)\s*(\[?)\s*([A-Za-z]*)(\d*)\s*(\]?)\s*$
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.

You can achieve that easily without using regex
string temp = "Abc[123]";
string[] arr = temp.Split('[');
string name = arr[0];
string value = arr[1].ToString().TrimEnd(']');
output name = Abc, and value = 123

Extract multiple values from a string

I need to extract values from a string.
string sTemplate = "Hi [FirstName], how are you and [FriendName]?"
Values I need returned:
FirstName
FriendName
Any ideas on how to do this?

You can use the following regex globally:
\[(.*?)\]
Explanation:
\[ : [ is a meta char and needs to be escaped if you want to match it literally.
(.*?) : match everything in a non-greedy way and capture it.
\] : ] is a meta char and needs to be escaped if you want to match it literally.
Example:
string input = "Hi [FirstName], how are you and [FriendName]?";
string pattern = #"\[(.*?)\]";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection matches = rgx.Matches(input);
if (matches.Count > 0)
{
Console.WriteLine("{0} ({1} matches):", input, matches.Count);
foreach (Match match in matches)
Console.WriteLine(" " + match.Value);
}

If the format/structure of the text won't be changing at all, and assuming the square brackets were used as markers for the variable, you could try something like this:
string sTemplate = "Hi FirstName, how are you and FriendName?"
// Split the string into two parts. Before and after the comma.
string[] clauses = sTemplate.Split(',');
// Grab the last word in each part.
string[] names = new string[]
{
clauses[0].Split(' ').Last(), // Using LINQ for .Last()
clauses[1].Split(' ').Last().TrimEnd('?')
};
return names;

You will need to tokenize the text and then extract the terms.
string[] tokenizedTerms = new string[7];
char delimiter = ' ';
tokenizedTerms = sTemplate.Split(delimiter);
firstName = tokenizedTerms[1];
friendName = tokenizedTerms[6];
char[] firstNameChars = firstName.ToCharArray();
firstName = new String(firstNameChars, 0, firstNameChars.length - 1);
char[] friendNameChars = lastName.ToCharArray();
friendName = new String(friendNameChars, 0, friendNameChars.length - 1);
Explanation:
You tokenize the terms, which separates the string into a string array with each element being the char sequence between each delimiter, in this case between spaces which is the words. From this word array we know that we want the 3rd word (element) and the 7th word (element). However each of these terms have punctuation at the end. So we convert the strings to a char array then back to a string minus that last character, which is the punctuation.
Note:
This method assumes that since it is a first name, there will only be one string, as well with the friend name. By this I mean if the name is just Will, it will work. But if one of the names is Will Fisher (first and last name), then this will not work.

replacing characters in a single field of a comma-separated list

I have string in my c# code
a,b,c,d,"e,f",g,h
I want to replace "e,f" with "e f" i.e. ',' which is inside inverted comma should be replaced by space.
I tried using string.split but it is not working for me.

OK, I can't be bothered to think of a regex approach so I am going to offer an old fashioned loop approach which will work:
string DoReplace(string input)
{
bool isInner = false;//flag to detect if we are in the inner string or not
string result = "";//result to return
foreach(char c in input)//loop each character in the input string
{
if(isInner && c == ',')//if we are in an inner string and it is a comma, append space
result += " ";
else//otherwise append the character
result += c;
if(c == '"')//if we have hit an inner quote, toggle the flag
isInner = !isInner;
}
return result;
}
NOTE: This solution assumes that there can only be one level of inner quotes, for example you cannot have "a,b,c,"d,e,"f,g",h",i,j" - because that's just plain madness!

For the scenario where you only need to match one pair of letters, the following regex will work:
string source = "a,b,c,d,\"e,f\",g,h";
string pattern = "\"([\\w]),([\\w])\"";
string replace = "\"$1 $2\"";
string result = Regex.Replace(source, pattern, replace);
Console.WriteLine(result); // a,b,c,d,"e f",g,h
Breaking apart the pattern, it is matching any instance where there is a "X,X" sequence where X is any letter, and is replacing it with the very same sequence, with a space in between the letters instead of a comma.
You could easily extend this if you needed to to have it match more than one letter, etc, as needed.
For the case where you can have multiple letters separated by commas within quotes that need to be replaced, the following can do it for you. Sample text is a,b,c,d,"e,f,a",g,h:
string source = "a,b,c,d,\"e,f,a\",g,h";
string pattern = "\"([ ,\\w]+),([ ,\\w]+)\"";
string replace = "\"$1 $2\"";
string result = source;
while (Regex.IsMatch(result, pattern)) {
result = Regex.Replace(result, pattern, replace);
}
Console.WriteLine(result); // a,b,c,d,"e f a",g,h
This does something similar compared to the first one, but just removes any comma that is sandwiched by letters surrounded by quotes, and repeats it until all cases are removed.

Here's a somewhat fragile but simple solution:
string.Join("\"", line.Split('"').Select((s, i) => i % 2 == 0 ? s : s.Replace(",", " ")))
It's fragile because it doesn't handle flavors of CSV that escape double-quotes inside double-quotes.

Use the following code:
string str = "a,b,c,d,\"e,f\",g,h";
string[] str2 = str.Split('\"');
var str3 = str2.Select(p => ((p.StartsWith(",") || p.EndsWith(",")) ? p : p.Replace(',', ' '))).ToList();
str = string.Join("", str3);

Use Split() and Join():
string input = "a,b,c,d,\"e,f\",g,h";
string[] pieces = input.Split('"');
for ( int i = 1; i < pieces.Length; i += 2 )
{
pieces[i] = string.Join(" ", pieces[i].Split(','));
}
string output = string.Join("\"", pieces);
Console.WriteLine(output);
// output: a,b,c,d,"e f",g,h

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Capture substring within delimiters and excluding characters using regex - c#

Related

Regex How to Match 2 fields

C# - Split a string separated by ':'

Regular expression split string, extract string value before and numeric value between square brackets

Extract multiple values from a string

replacing characters in a single field of a comma-separated list

Categories

Resources