How to create Regex that contains Not colon char? - c#

I have created a regular expression that seems to be working somewhat:
// look for years starting with 19 or 20 followed by two digits surrounded by spaces.
// Instead of ending space, the year may be followed by a '.' or ';'
static Regex regex = new Regex(#" 19\d{2} | 19\d{2}. | 19\d{2}; | 20\d {2} | 20\d{2}. | 20\d{2}; ");
// Trying to add 'NOT followed by a colon'
static Regex regex = new Regex(#" 19\d{2}(?!:) | 19\d{2}. | 19\d{2}; | 20\d{2}(?!:) | 20\d{2}. | 20\d{2}; ");
// Trying to optimize --
//static Regex regex = new Regex(#" (19|20)\d{2}['.',';']");
You can see where I tried to optimize a bit.
But more importantly, it is finding a match for 2002:
How do I make it not do that?
I think I am looking for some sort of NOT operator?

(?:19|20)\d{2}(?=[ ,;.])
Try this.See demo.
https://regex101.com/r/sJ9gM7/103

I'd rather go with \b here, this will help deal with other punctuation that may appear after/before the years:
\b(?:19|20)[0-9]{2}\b
C#:
static Regex regex = new Regex(#"\b(?:19|20)[0-9]{2}\b");
Tested in Expresso:

Problem in your regex is dot.
You should have something like this:
static Regex regex =
new Regex(#" 19\d{2} | 19\d{2}[.] | 19\d{2}; | 20\d{2} | 20\d{2}[.] | 20\d{2}; ");

This did it for me:
// look for years starting with 19 or 20 followed by two digits surrounded by spaces.
// Instead of ending space, the year may also be followed by a '.' or ';'
// but may not be followed by a colon, dash or any other unspecified character.
// optimized --
static Regex regex = new Regex(#"(19|20)\d{2} | (19|20)\d{2};| (19|20)\d{2}[.]");
Used Regex Tester here:
http://regexhero.net/tester/

Related

How to extract specific value from a string with Regex? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am new to Regex and i want to extract a specific value from a string, i have strings like:
"20098: Blue Quest"
"95: Internal Comp"
"33: ICE"
and so on.Every string has the same pattern : Number followed by ":" followed by a space and random text. I want to get the numbers at the start for ex: "20098","95","33" etc.
i tried
Regex ex = new regex(#"[0-9]+\: [a-zA-Z]$")
This is not giving me any solution, Where am i going wrong?
(i am using c#)
This is a totally silly solution. However, i decided to benchmark an unchecked pointer version, against the other regex and int parse solutions here in the answers.
You mentioned the strings are always the same format, so i decided to see how fast we could get it.
Yehaa
public unsafe static int? FindInt(string val)
{
var result = 0;
fixed (char* p = val)
{
for (var i = 0; i < val.Length; i++)
{
if (*p == ':')return result;
result = result * 10 + *p - 48;
}
return null;
}
}
I run each test 50 times with 100,000 comparisons, and 1,000,000 respectively with both Lee Gunn's int.parse,The fourth bird version ^\d+(?=: [A-Z]) also my pointer version and ^\d+
Results
Test Framework : .NET Framework 4.7.1
Scale : 100000
Name | Time | Delta | Deviation | Cycles
----------------------------------------------------------------------------
Pointers | 2.597 ms | 0.144 ms | 0.19 | 8,836,015
Int.Parse | 17.111 ms | 1.009 ms | 2.91 | 57,167,918
Regex ^\d+ | 85.564 ms | 10.957 ms | 6.14 | 290,724,120
Regex ^\d+(?=: [A-Z]) | 98.912 ms | 1.508 ms | 7.16 | 336,716,453
Scale : 1000000
Name | Time | Delta | Deviation | Cycles
-------------------------------------------------------------------------------
Pointers | 25.968 ms | 1.150 ms | 1.15 | 88,395,856
Int.Parse | 143.382 ms | 2.536 ms | 2.62 | 487,929,382
Regex ^\d+ | 847.109 ms | 14.375 ms | 21.92 | 2,880,964,856
Regex ^\d+(?=: [A-Z]) | 950.591 ms | 6.281 ms | 20.38 | 3,235,489,411
Not surprisingly regex sucks
If they are all separate strings - you don't need to use a regex, you can simply use:
var s = "20098: Blue Quest";
var index = s.IndexOf(':');
if(index > 0){
if(int.TryParse(s.Substring(0, index), out var number))
{
// Do stuff
}
}
If they're all contained in one sting, you can loop over each line and perform the Substring. Perhaps a bit easier to read as a lot of people aren't comfortable with regular expressions.
In your regex "[0-9]+: [a-zA-Z]$ you match one or more digits followed by a colon and then a single lower or uppercase character.
That would match 20098: B and would not match the digits only.
There are better alternatives besides using a regex like as suggested, but you might match from the beginning of the string ^ one or more digits \d+ and use a positive lookahead (?= to assert that what follows is a colon, whitespace and an uppercase character [A-Z])
^\d+(?=: [A-Z])
Firstly, after colon, yoiu should use \s instead of literal space. Also, if the text after colon can include spaces, the second group should also allow /s and have a + after it.
[0-9]+\:\s[a-zA-Z\s]+$
Secondly, that entire regex will return the entire string. If you only want the first number, then the regex would be simply:
[0-9]+
You can use look-behind ?<= to find any number following ^" (where ^ is the beginning of line):
(?<=^")[0-9]+

Regular expression for NN-ARID-NNN?

What will be the RegularExression for this?
NN-ARID-NNN?
//N = Number
I have tried this ^[0-9/-0-9]+$
You're not matching the ARID at all and a character class will match in any order... You might want to use something more like this:
^[0-9]{2}-ARID-[0-9]{3}$
[Assuming that ? is not in the actual string...]
If you want the first two digits to be within the range of 00 to 13, then you can use the OR operator with | and a group:
^(?:0[0-9]|1[0-3])-ARID-[0-9]{3}$
^^^ ^ ^
| OR |
| |
+---- Group ---+
Breakdown:
^ Matches beginning of string
(?: Beginning of group
0[0-9] Matches 00 to 09 only
| OR
1[0-3] Matches 10 to 13 only
) End of group
-ARID- Matches -ARID- literally
[0-9]{3} Matches 3 digits
$ Matches end of line
When there is an option of matching 00-09 or 10-13, the pattern just cannot match a blank. There's no way it will match if the numbers are not there.

Regex issue with nested groups and using "^" and "$" in the pattern

I have a content like this:
var testInput =
"05(testcontent)\r\n" +
"06(testcontent2)\r\n" +
"07(testcontent3)(testcontent4)" +
"08(testcontent5)";
I need to get the one code string and two value strings for each line.
For the first line:
Code: "05"
Value1: "testcontent"
Value2: Empty string.
For the third line:
Code: "07"
Value1: "testcontent3"
Value2: "testcontent4"
The pattern I use:
// (?<Code>[0-9]{2}) - 2 digit number
// \((?<Value1>.+)\) - First value, which is inside the parentheses.
// (\((?<Value2>.+)\))? - Second value, which also is inside the parentheses.
// The second value does not always exist. Which is why it has "?" at its end.
var testPattern = #"(?<Code>[0-9]{2})\((?<Value1>.+)\)(\((?<Value2>.+)\))?";
The code I use:
var testRegex = new Regex(testPattern,
RegexOptions.Compiled |
RegexOptions.CultureInvariant |
RegexOptions.ExplicitCapture |
RegexOptions.Multiline);
foreach (Match match in testRegex.Matches(testInput))
Console.WriteLine("{0}: {1} | {2}",
match.Groups["Code"].Value,
match.Groups["Value1"].Value,
match.Groups["Value2"].Value);
The result I get:
05: testcontent |
06: testcontent2 |
07: testcontent3)(testcontent4)08(testcontent5 |
If I use ^ at the start and $ at the end of the pattern, I get even worse:
07: testcontent3)(testcontent4)08(testcontent5 |
So,
Why does the ^ and $ complicate things even more when I specified "RegexOptions.Multiline"?
What is wrong with my pattern?
Will you ever have closing parentheses inside your Value1 or Value2? If not, I'd suggest using a negated character class like [^)]+ instead of .+. The reason is that .+ being "greedy" (i.e. repeating as many times as possible) is causing problems in this case.

Regular expression for boolean expressions with ()&|~ as operators and groupers identifiers in various positions

I've not been able to find any examples that cover my specific problem. I'm using the .NET c# regex library. I have to parse a legacy file structure that uses long boolean expressions with their own format. I now have a need to replace specific identifier numbers with other text, and I can't my matching to work.
For example i need to match the number 1 in the following types of cases.
"1" |
"1 & 2" |
"2 | 1" |
"3 & (1 | 2)" |
"(3 | 1) & 2"
But not match in:
"11" |
"2 & 11" |
"11 | 3"
Further I'd prefer to match it to just the exact number, not any additional spaces or params. I'm hoping it can be done in one expression. I'm to the point I can only do it with multiple expressions and regex replacements. Any guru know how to do this?
The following regex, using lookahead and lookbehind, should do it for your examples.
//regex expression (the important part)
var regex = new Regex(#"(?<!\d)1(?!\d)");
//test input
var input1 = "\"1\" \"1 & 2\" \"2 | 1\" \"3 & (1 | 2)\" \"3( | 1) & 2\"";
var input2 = "\"11\" \"2 & 11\" \"11 | 3\"";
//replace and print result
Console.WriteLine(regex.Replace(input1, "NEW"));
Console.WriteLine(regex.Replace(input2, "NEW"));
// Output:
// "NEW" "NEW & 2" "2 | NEW" "3 & (NEW | 2)" "3( | NEW) & 2"
// "11" "2 & 11" "11 | 3"
This essentially means "capture all 1s, that aren't directly preceded or followed by a digit".
For more info on the lookahead/lookbehind assertions, check out this page: http://www.regular-expressions.info/lookaround.html
To match just "1" by itself, not as part of another value, you can probably use this:
var regex = new Regex(#"\b1\b");
The "\b" before and after match on a word boundary (the transition from a word to a non-word character or vice-versa). Based on your examples, the above will work.

C# regular expression

Please give a suggestion here:
I try to do a regex in C#. Here is the text
E1A: pop(+)
call T
call E1
E1B: return
TA: call F
call T1
I want to split it like this:
1)
E1A: pop(+)
call T
call E1
2)
E1B: return
3)
TA: call F
call T1
I tought at lookbehind, but it's not working because of the .+
Here is what I hope to work but it doesn't:
"[A-Z0-9]+[:].+(?=([A-Z0-9]+[:]))"
Does anyone have a better ideea?
EDIT: The "E1A","E1B","TA" are changing. All it remains the same is that they are made by letter and numbers follow by ":"
Regex regexObj = new Regex(
#"^ # Start of line
[A-Z0-9]+: # Match identifier
(?: # Match...
(?!^[A-Z0-9]+:) # (unless it's the start of the next identifier)
. # ... any character,
)* # repeat as needed.",
RegexOptions.Singleline | RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);
allMatchResults = regexObj.Matches(subjectString);
Now allMatchResults.Count will contain the number of matches in subjectString, and allMatchResults.Item[i] will contain the ith match.

Categories

Resources