Regex expression - match specific characters (multiple times) and ignore comments

Regex expression - match specific characters (multiple times) and ignore comments - c#

I'm not an expert on regex and need some help to set up one.
I'm using Powershell and its [regex] type, which is a C# class, the final objective is to read a toml file (sample data at the bottom, or use this link to regex101), in which I need to:
match some values (values between "__")
ignore comments. (a comment starts with "#")
To match the values and put them in a capture group the following regex works:
match the template value (values between "__" ):
__(?<tokenName>[\w\.]+)__
I also want to ignore the commented lines, and I came up with this:
Ignore lines that start with a comment (even if "#" is preceded by spaces or tabs):
^(?!\s*\t*#).*
The problem starts when I put them together
^(?!\s*\t*#).*__(?<tokenName>[\w\.]+)__
this expression has the following problems:
up to one match per line, the last one (ie: in the line with "Prop5 = ..." I get one match instead of two)
Comments at the end of a line are not considered (ie: line with "Prop4 = ..." has two matches instead of one)
I've also tried to
add this at the end of the expression, it should stop the match on the first occurrence of the character
[^#]
add this at the beginning, which should check if the matched string has the given char before it and exclude it
(?<!^#)
This is a sample of my data
#templateFile
[Agent]
Prop1 = "__Data.Agent.Prop1__"
Prop2 = [__Data.Agent.Prop2__]
#I'm a comment
#Prop3 = "__NotUsed__"
Prop4 = [__Data.Agent.Prop4__] #sample usage comment __Data.Agent.xxx__
Prop5 = ["__Data.Agent.Prop5a__","__Data.Agent.Prop5b__"]
I think the easier solution will be to match the given string, only if there is no "#" before it on the same line.
Is it possible?
EDIT:
The first expression proposed by #the-fourth-bird works perfectly, it just needs the multiline modifier to be specified.
The final (runnable) result looks like this in PowerShell.
[regex]$reg = "(?m)(?<!^.*#.*)__(?<tokenName>[\w.]+)__"
$text = '
#templateFile
[Agent]
Prop1 = "__Data.Agent.Prop1__"
Prop2 = [__Data.Agent.Prop2__]
Prop5 = ["__Data.Agent.Prop5a__","__Data.Agent.Prop5b__"]
#a comment
#Prop3 = "__Data.Agent.Prop3__"
Prop4 = [__Data.Agent.Prop4__] #sample usage comment __Data.Agent.xxx__
'
$reg.Matches($text) | Format-Table
#This returns
Groups Success Name Captures Index Length Value
------ ------- ---- -------- ----- ------ -----
{0, tokenName} True 0 {0} 31 20 __Data.Agent.Prop1__
{0, tokenName} True 0 {0} 62 20 __Data.Agent.Prop2__
{0, tokenName} True 0 {0} 94 21 __Data.Agent.Prop5a__
{0, tokenName} True 0 {0} 118 21 __Data.Agent.Prop5b__
{0, tokenName} True 0 {0} 194 20 __Data.Agent.Prop4__

I think you could make use of infinite repetition to check if what precedes does not contain a # to also account for the comment in Prop4
(?<!^.*#.*)__(?<tokenName>[\w.]+)__
.Net regex demo
If Prop4 should have 2 matches, you might use:
(?<!^[ \t]*#.*)__(?<tokenName>[\w.]+)__
.NET regex demo
Both expressions needs the multiline modifier to work properly.
it can be specified inline by adding (?m) at the beginning. (or by specifying it in a constructor that supports it)
(?m)(?<!^.*#.*)__(?<tokenName>[\w.]+)__

Related

C# - Getting multiple values with a single key, from a text file

I store multiple values that shares a single key on a text file. The text file looks like that:
Brightness 36 , Manual
BacklightCompensation 3 , Manual
ColorEnable 0 , None
Contrast 16 , Manual
Gain 5 , Manual
Gamma 122 , Manual
Hue 0 , Manual
Saturation 100 , Manual
Sharpness 2 , Manual
WhiteBalance 5450 , Auto
Now I want to store the int value & string value of each key (Brightness, for example).
New to C# and could'nt find something that worked yet.
Thanks

I'd recommend to use custom types to store these settings like these:
public enum DisplaySettingType
{
Manual, Auto, None
}
public class DisplaySetting
{
public string Name { get; set; }
public decimal Value { get; set; }
public DisplaySettingType Type { get; set; }
}
Then you could use following LINQ query using string.Split to get all settings:
decimal value = 0;
DisplaySettingType type = DisplaySettingType.None;
IEnumerable<DisplaySetting> settings = File.ReadLines(path)
.Select(l => l.Trim().Split(new[] { ' ', ',' }, StringSplitOptions.RemoveEmptyEntries))
.Where(arr => arr.Length >= 3 && decimal.TryParse(arr[1], out value) && Enum.TryParse(arr[2], out type))
.Select(arr => new DisplaySetting { Name = arr[0], Value = value, Type = type });

With a regex and a little bit of linq you can do many things.
Here I assume you Know How to read a Text file.
Pros: If the file is not perfect, the reg exp will just ignore the misformatted line, and won't throw error.
Here is a hardcode version of your file, note that a \r will appears because of it. Depending on the way you read you file but it should not be the case with a File.ReadLines()
string input =
#"Brightness 36 , Manual
BacklightCompensation 3 , Manual
ColorEnable 0 , None
Contrast 16 , Manual
Gain 5 , Manual
Gamma 122 , Manual
Hue 0 , Manual
Saturation 100 , Manual
Sharpness 2 , Manual
WhiteBalance 5450 , Auto";
string regEx = #"(.*) (\d+) , (.*)";
var RegexMatch = Regex.Matches(input, regEx).Cast<Match>();
var outputlist = RegexMatch.Select(x => new { setting = x.Groups[1].Value
, value = x.Groups[2].Value
, mode = x.Groups[3].Value });
Regex explanation:/(.*) (\d+) , (.*)/g
1st Capturing Group (.*)
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
matches the character literally (case sensitive)
2nd Capturing Group (\d+)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
, matches the characters , literally (case sensitive)
3rd Capturing Group (.*)
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
Disclamer:
Never trust an input! Even if it's a file some other program did, or send by a customer.
From my experience, you have then two ways of handeling bad format:
Read line by line, and register every bad line.
or Ignore them. You don't fit , you don't sit!
And don't tell your self it won't happend, it will!

Filtering on full string match but not on substrings

So I've got a long string of numbers and characters and I'd like to filter out a substring. The thing I'm struggling with is that I need a full match on a certain value (starting with S) but this may not be matched in another value.
Input:
S10 1+0000000297472+00EURS100 1+0000000297472+00EURS1023P 1+0000000816072+00EUR
The input is exactly like this.
Breakdown of input:
S10 1+0000000297472+00EUR
Every part starts with a tag S and ends with EUR
There are spaces in between because every part has a fixed length
=>
index 0 : tag 'S' with length 1
index 1 : code with length 7
index 8 : numbertype with length 1
index 9 : sign with length 1
index 10 : value with length 13
index 23 : sign with length 1
index 24 : exponent with length 2
index 26 : unit with length 3
I need to match on for example S10 and I only want this substring till EUR. I don't want it to match on S100 or S1023P or any other combination. Only on exactly S10
Output:
S10 1+0000000297472+00EUR
I'm trying to use Regex to find my match on 'S + code'. I'm doing a full match on my search query and then as soon as anything follows I don't want it anymore. But doing it like this also discards the actual match as after the S10 the value will follow which will match with [^\d|^\D])+\w
foreach (var field in fieldList)
{
var query = "S" + field.BallanceCode;
var index = Regex.Match(values, Regex.Escape(query) + #"([^\d|^\D])+\w").Index;
}
For example when looking for S10
needs to match:
S10 1+0000000297472+00EUR
may not match:
S10/15 1+0000001748447+00EUR
S1023P 1+0000000816072+00EUR
S10000001+0000000546546+00EUR
Update:
Using this code
var index = Regex.Match(values, Regex.Escape(query) + #"\p{Zs}.*?EUR").Index;
wil yield S10, S10/15, etc when looked for. However looking for S1000000 in the string doesn't work because there is no whitespace between the code and 1+
S10000001+0000000546546+00EUR
For example when looking for S1000000
needs to match:
S10000001+0000000297472+00EUR
may not match:
S10 1+0000001748447+00EUR
S1023P 1+0000000816072+00EUR
S10/15 1+0000000546546+00EUR

You can use a regex that requires a space (or whitespace) to appear right after the field.BallanceCode:
var index = Regex.Match(values, Regex.Escape(query) + (field.BallanceCode.Length < 7 ? #"\p{Zs}" : "") + ".*?EUR").Index;
The regex will match the S10, then any horizontal whitespace (\p{Zs}), then any 0 or more characters other than a newline (as few as possible due to *?) up to the first EUR.
The (field.BallanceCode.Length < 7 ? #"\p{Zs}" : "") check is necessary to support a 7-digit BallanceCode. If it contains 7 digits or more, we do not check if there is a whitespace after it. If the length is less than 7, we check for a space.

So you just want the start (S...) and end (...EUR) of each line and skip everything in between?
^([sS]\d+).*?([\d\+]+EUR)$
http://regexr.com/3c1ob

Regular expression match all numbers after the last dash?

Trying to find the last instance of numbers after last dash in a string so
test-123-2-456 would return 456
123-test would return ""
123-test-456 would return 456
123-test-456sdfsdf would return 456
123-test-asd456 would return 456
The expression, #"[^-]*$", does not match the numbers though, and I have tried using [\d] but to no avail.

Sure, the simplest solution would be something like this:
(\d+)[^-]*$
This will match one or more digits, captured in group 1, followed by zero or more of any character other than a hyphen, followed by the end of the string. In other words, it will match any sequence of digits as long as there are no hyphens between that sequence and the end of the string. You then just have to extract group 1 from the match. For example:
var inputs = new[] {
"test-123-2-456",
"123-test",
"123-test-456",
"123-test-456sdfsdf",
"123-test-asd456"
};
foreach(var str in inputs)
{
var m = Regex.Match(str, #"(\d+)[^-]*$");
Console.WriteLine("{0} --> {1}", str, m.Groups[1].Value);
}
Produces:
test-123-2-456 --> 456
123-test -->
123-test-456 --> 456
123-test-456sdfsdf --> 456
123-test-asd456 --> 456
Alternatively, if you could use a negative lookahead like this:
\d+(?!.*-)
This will match one or more digit characters so long as they are not followed by a hyphen. Only the digits will be included in the match.
Note that these two options behave differently if there are two or more sets of numbers after the last -, e.g. foo-123bar456. In this case it's not entirely clear what you want to happen, but the first pattern will simply match everything starting from the first sequence of digits to the end (123bar456) with group 1 only containing the first sequence of digits (123). If you'd like to change this so that it only captures the last sequence of digits, place a \d inside the character class (i.e. (\d+)[^\d-]*$). The second second pattern would produce a separate match for each sequence digits (in this example, 123 and 456) but the Regex.Match method will only give you the first match.

I suggest to apply two regex-functions. Take the result of the first one as the input for the second one.
The first regex is:
-[0-9]+[^-]+$ // Take the last peace of your string lead by a minus (-)
// followed by digits ([0-9]+)
// and some ugly rest that doesn't contain another minus ([^-]+$)
The second regex is:
-[0-9]+ // Seperate the relevant digits from the ugly rest
// You know that there can only be one minus + digits part in it
Tested here: http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx

The latest group from this RegEx can get the last number for you:
[^-A-z][0-9]+[^A-z]
If you are looking at groups, you can write this code by matching groups to get the latest number:
var inputs = new[] {
"test-123-2-456",
"123-test",
"123-test-456",
"123-test-456sdfsdf",
"123-test-asd456"
};
var m = Regex.Match(str, #"([0-9]*)");
if(m.Groups.Length>1) //This will avoid the values starting with numbers only.
Console.WriteLine("{0} --> {1}", str, m.Groups[m.Groups.Length-1].Value);

Regex help - match any number of characters

I have following kind of string-sets in a text file:
<< /ImageType 1
/Width 986 /Height 1
/BitsPerComponent 8
/Decode [0 1 0 1 0 1]
/ImageMatrix [986 0 0 -1 0 1]
/DataSource <
803fe0503824160d0784426150b864361d0f8844625138a4562d178c466351b8e4763d1f904864523924964d27944a6552b964b65d2f984c665339a4d66d379c4e6753b9e4f67d3fa05068543a25168d47a4526954ba648202
> /LZWDecode filter >> image } def
There are 100s of Images defined like above.
I need to find all such images defined in the document.
Here is my code -
string txtFile = #"text file path";
string fileContents = File.ReadAllText(txtFile);
string pattern = #"<< /ImageType 1.*(\n|\r|\r\n)*image } def"; //match any number of characters between `<< /ImageType 1` and `image } def`
MatchCollection matchCollection = Regex.Matches(fileContents, pattern, RegexOptions.Singleline);
int count = matchCollection.Count; // returns 1
However, I am getting just one match - whereas there are around 600 images defined.
But it seems they all are matched in one because of 'newline' character used in pattern.
Can anyone please guide what do I need to modify the correct result of regex match as 600.

The reason is that regular expressions are usually greedy, i.e. the matches are always as long as possible. Thus, the image } def is contained in the .*. I think the best approach here would be to perform two separate regex queries, one for << /ImageType 1 and one for image } def. Every match of the first pattern would correspond to exactly one match of the second one and as these matches carry their indices in the original string, you can reconstruct the image by accessing the appropriate substring.

Instead of .* you should use the non-greedy quantifier .*?:
string pattern = #"<< /ImageType 1.*?image } def";

Here is a site that can help you out with REGEX that I use. http://webcheatsheet.com/php/regular_expressions.php.
if(preg_match('/^/[a-z]/i', $string, $matches)){
echo "Match was found <br />";
echo $matches[0];
}

Need multiple regular expression matches using C#

So I have this list of flight data and I need to be able to parse through it using regular expressions (this isn't the entire list).
1 AA2401 F7 A4 Y7 B7 M7 H7 K7 /DFW A LAX 4 0715 0836 E0.M80 9 3:21
2 AA2421 F7 A1 Y7 B7 M7 H7 K7 DFWLAX 4 1106 1215 E0.777 7 3:09
3UA:US6352 B9 M9 H9 K0 /DFW 1 LAX 1200 1448 E0.733 1:48
For example, I might need from the first line 1, AA, 2401, and so on and so on. Now, I'm not asking for someone to come up with a regular expression for me because for the most part I'm getting to where I can pretty much handle that myself. My issue has more to do with being able to store the data some where and access it.
So I'm just trying to initially just "match" the first piece of data I need, which is the line number '1'. My "pattern" for just getting the first number is: ".?(\d{1,2}).*" . The reason it's {1,2} is because obviously once you get past 10 it needs to be able to take 2 numbers. The rest of the line is set up so that it will definitely be a space or a letter.
Here's the code:
var assembly = Assembly.GetExecutingAssembly();
var textStreamReader = new StreamReader(
assembly.GetManifestResourceStream("FlightParser.flightdata.txt"));
List<string> lines = new List<string>();
do
{
lines.Add(textStreamReader.ReadLine());
} while (!textStreamReader.EndOfStream);
Regex sPattern = new Regex(#".?(\d{1,2}).*");//whatever the pattern is
foreach (string line in lines)
{
System.Console.Write("{0,24}", line);
MatchCollection mc = sPattern.Matches(line);
if ( sPattern.IsMatch(line))
{
System.Console.WriteLine(" (match for '{0}' found)", sPattern);
}
else
{
System.Console.WriteLine();
}
System.Console.WriteLine(mc[0].Groups[0].Captures);
System.Console.WriteLine(line);
}//end foreach
System.Console.ReadLine();
With the code I'm writing, I'm basically just trying to get '1' into the match collection and somehow access it and write it to the console (for the sake of testing, that's not the ultimate goal).

Your regex pattern includes an asterisk which matches any number of characters - ie. the whole line. Remove the "*" and it will only match the "1". You may find an online RegEx tester such as this useful.

Assuming your file is not actually formatted as you posted and has each of the fields separated by something, you can match the first two-digit number of the line with this regex (ignoring 0 and leading zeros):
^\s*([1-9]\d?)
Since it is grouped, you can access the matched part through the Groups property of the Match object.
var line = "12 foobar blah 123 etc";
var re = new Regex(#"^\s*([1-9]\d?)");
var match = re.Match(line);
if (match.Success)
{
Console.WriteLine(match.Groups[1].Value); // "12"
}
else
{
Console.WriteLine("No match");
}

The following expression matches the first digit, that you wanted to capture, in the group "First".
^\s*(?<First>\d{1})
I find this regular expression tool highly useful when dealing with regex. Give it a try.
Also set RegexOption to Multiline when you are making the match.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex expression - match specific characters (multiple times) and ignore comments - c#

Related

C# - Getting multiple values with a single key, from a text file

Filtering on full string match but not on substrings

Regular expression match all numbers after the last dash?

Regex help - match any number of characters

Need multiple regular expression matches using C#

Categories

Resources