How can I modify this regex to support multiple alternatives?

How can I modify this regex to support multiple alternatives? - c#

im a newbie at constructing regex.
I have this working regex:
^([a-zA-Z0-9\d]+-)*[a-zA-Z0-9\d]+$
Example:
-test : false
test- : false
te--st : false
test : true
test-test : true
te-st-t : true
I would like to add support for _ (underscores), so the above example replaced - to _ is the same result, but can only be one option only.
Example:
te-st_test : false
te_st_test : true
The solutions I tried:
^([a-zA-Z0-9\d]+(-|_))*[a-zA-Z0-9\d]+$
^(([a-zA-Z0-9\d]+-)|([a-zA-Z0-9\d]+_))*[a-zA-Z0-9\d]+$
Bad result:
te_st-test : true
I would like to have this result:
-test : false
test- : false
--test : false
__test : false
test-- : false
test__ : false
-_test : false
test-_ : false
test--test : false
test__test : false
test-_test : false
te-st_test : false
te-st : true
te_st : true
te_st_test : true
te-st-test : true
test : true
Thanks & have a nice day!

You may capture the first delimiter (if any) and then use a backreference to that value in the repeated non-capturing group:
^[a-zA-Z\d]+(?=([-_])?)(?:\1[a-zA-Z\d]+)*$
\A[a-zA-Z\d]+(?=([-_])?)(?:\1[a-zA-Z\d]+)*\z
See the regex demo.
Note: when validating strings, I'd rather use \A (start of string) and \z (the very end of string) anchors rather than ^/$.
Also, if you are worried about matching all Unicode digits (e.g. ३৫৬૦૧௮೪൮໘) with \d you need to pass RegexOptions.ECMAScript option when compiling the regex object, or replace \d with 0-9 inside the character class.
Details:
\A - start of string
[a-zA-Z\d]+ - one or more letters or digits
(?=([-_])?) - a positive lookahead that captures into Group 1 the next char that is an optional - or _
(?:\1[a-zA-Z\d]+)* - zero or more sequences of Group 1 value and one or more letters or digits
\z - the very end of string.
In C#, you can declare it as
var Pattern = new Regex(#"\A[a-zA-Z\d]+(?=([-_])?)(?:\1[a-zA-Z\d]+)*\z");
// Or,
var Pattern = new Regex(#"\A[a-zA-Z\d]+(?=([-_])?)(?:\1[a-zA-Z\d]+)*\z", RegexOptions.ECMAScript);

Related

Regex query to validate a currency string C#?

I am in no way a master of Regex which is why I am here I currently have this:
\s?[^0-9]?\s?[0-9,]+([\\.]{1})?[0-9]+\s?
Link to regex101
To explain my validation attempt I am trying to validate that a string matches the correct formatting structure.
I only want to match strings such as:
£1.00
£10.00
£100.00
£1000.00
£10,000.00
£100,000.00
£1,234,546.00
Validation rules:
String must include a '£' at the start.
String should always have 2 digits following a decimal place.
Following the '£' only digits between 0-9 should be accepted
If the string length is greater than 6 (after £1000.00) then commas need to be entered at the appropriate points (I.E. £10,000.00 - £100,000.00 - £1,000,000.00 - £10,000,000.00 etc)
For example, strings that shouldn't be accepted:
£1
£10.000
£1,00.00
£1,000.00
10,000.00
£100,000,00
£1.234.546
Really hoping that one of you amazing people can help me out, if you need anymore info then please let me know!

You can try the pattern below:
^£(?:[0-9]{1,4}|[0-9]{2,3},[0-9]{3}|[0-9]{1,3}(?:,[0-9]{3}){2,})\.[0-9]{2}$
When pences are mandatory - \.[0-9]{2} we have 3 options for pounds:
[0-9]{1,4} for sums in range £0 .. £9999
[0-9]{2,3},[0-9]{3} for sums in range £10,000 .. £999,999
[0-9]{1,3}(?:,[0-9]{3}){2,} for huge sums £1,000,000 ..
Demo:
using System.Linq;
using System.Text.RegularExpressions;
...
Regex regex = new Regex(
#"^£(?:[0-9]{1,4}|[0-9]{2,3},[0-9]{3}|[0-9]{1,3}(?:,[0-9]{3}){2,})\.[0-9]{2}$";
string[] tests = new string[] {
"£1.00",
"£10.00",
"£100.00",
"£1000.00",
"£10,000.00",
"£100,000.00",
"£1,234,546.00",
"£1",
"£10.000",
"£1,00.00",
"£1,000.00",
"10,000.00",
"£100,000,00",
"£1.234.546",
};
string report = string.Join(Environment.NewLine, tests
.Select(test => $"{test,15} : {(regex.IsMatch(test) ? "Match" : "Fail")}"));
Console.Write(report);
Outcome:
£1.00 : Match
£10.00 : Match
£100.00 : Match
£1000.00 : Match
£10,000.00 : Match
£100,000.00 : Match
£1,234,546.00 : Match
£1 : Fail
£10.000 : Fail
£1,00.00 : Fail
£1,000.00 : Fail
10,000.00 : Fail
£100,000,00 : Fail
£1.234.546 : Fail

What about this?
new Regex(#"£\d{1,3}(\,\d{3})*\.\d{2}\b");
Edit:
new Regex(#"£((\d{1,4})|(\d{2,3}(\,\d{3})+)|(\d(\,\d{3}){2,}))\.\d{2}\b");
https://regex101.com/r/zSRw2B/1

ANTLR4 Nested modes?

I'm attempting to parse the following string:
<<! variable, my_variable, A description of my variable !>>
From the reading I've been doing here, I believe I need to use modes to distinguish between the lexers for the literal string 'variable', the variable name (my_variable), and the variable description.
The problem I'm having is that I'm not sure how to structure this. Is it possible to nest modes? Is there a better/smarter way to organize my lexer rules?
lexer grammar VariableLexer;
variableMarkdown : DELIMITER_OPEN SPACE VARIABLE COMMA SPACE variable_name COMMA SPACE description SPACE DELIMITER_CLOSE;
description : WORDS ;
variable_name : ID ;
DELIMITER_OPEN : '<<!' ;
DELIMITER_CLOSE : '!>>';
COMMA : ',' ;
SPACE : ' ' ;
VARIABLE : 'variable' -> pushMode(VariableName);
mode VariableName;
ID : LOWERCASE ( LOWERCASE | NUMBER | UNDERSCORE )* -> pushMode(VariableDescription) ;
mode VariableDescription;
WORDS : ( UPPERCASE | LOWERCASE | NUMBER | SPACE )+ -> popMode;
fragment LOWERCASE : 'a'..'z' ;
fragment UPPERCASE : 'A'..'Z' ;
fragment UNDERSCORE : '_' ;
fragment NUMBER : '0'..'9' ;

First - you can't have parser rules in lexer grammar - parser rules start with small letter, lexer ones with capital.
I'd do it like so (may not be correct syntax but you'll get the idea):
//default mode is implicitly defined by (or in) ANTLR4
VARIABLE : 'variable' (' ')* ',' -> mode(mode_VariableName);
...
mode mode_VariableName;
//define token with anything ending with comma, many ways to do this...
fragment varNameFrag: [a-zA-Z_0-9];
VARIABLE_NAME: varNameFrag varNameFrag* (' ')* ',' -> mode(mode_varDesc);
mode mode_varDesc;
//similar again for variable description
VAR_DESC: //I'll write just a comment here but should more or less match anything except
END_VAR: '!>>' -> mode(DEFAULT_MODE)
Basically in this way you are jumping to modes you need instead of pushing and popping.

c# migrating to ANTLR 4 from ANTLR 3 with AST

I have inherited some c# code based on ANTLR 3.
We have some grammar files that uses the AST (abstract syntax tree) option and we use those grammar to parse text files with a very odd "language" to objects. we are using the AST as intermediate objects and than convert them to the real objects that we need (with some more processing).
I have no knowledge in ANTLR but currently we have a bottleneck in the application performance from ANTLR processing of the files.
Since we are using ANTLR 3 we thought that we might get a performance boost if we migrate to ANTLR (and also get the latest and greatest version of ANTLR which is always a good practice).
I have read that AST no longer exist in ANTLR 4, what is the best (and simplest) way to replace it and what will it mean to my current code.
What is the best approach to upgrade ? and will it really give us a performance boost.
An example of one of the grammar file ( there are 6 and this is the simplest one):
grammar Rules;
options
{
language=CSharp2;
output=AST;
ASTLabelType=CommonTree;
superClass = OOPLParserBase;
}
tokens
{
OOPL_MODEL;
}
#lexer::namespace { TestParser.Common.RulesParser }
#parser::namespace { TestParser.Common.RulesParser }
#header
{
using System.Collections.Generic;
using TestParser.OOPLModel;
}
#members
{
public RulesParser() : base(null)
{
}
protected override CommonTree GetAst()
{
return root().Tree as CommonTree;
}
protected override Lexer GetLexer()
{
return new RulesLexer();
}
}
//semantic analysis
root : header (rule_line COMMENT?)+ -> ^(header rule_line+);
header : header_comment+ -> ^(OOPL_MODEL<OOPLModel>[new CommonToken(OOPL_MODEL), "1.0"] header_comment+);
header_comment : COMMENT -> ^(COMMENT<OOPLComment>[$COMMENT, $COMMENT.Text]);
rule_line : parameter RULE_TYPE COMMA PARAMETER_NAME COLON condition -> ^(RULE_TYPE<OOPLBlock>[$RULE_TYPE, $RULE_TYPE.Text] parameter PARAMETER_NAME<OOPLValue>[$PARAMETER_NAME, $PARAMETER_NAME.Text] condition);
parameter : PARAMETER_NAME EQUALS (integer_value = INTEGER | real_value = REAL |string_value = STRING) COMMA -> ^(PARAMETER_NAME<OOPLKeyedValue>[$PARAMETER_NAME, $PARAMETER_NAME.Text, SingleWhereNotNull<IToken>($integer_value, $string_value, $real_value).Text]);
condition : condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value;
condition_value : (asterisk| parameter_name | positive_integer);
asterisk : ASTERISK -> ^(ASTERISK<OOPLValue>[$ASTERISK, $ASTERISK.Text]);
parameter_name : PARAMETER_NAME -> ^(PARAMETER_NAME<OOPLValue>[$PARAMETER_NAME, $PARAMETER_NAME.Text]);
positive_integer : INTEGER -> ^(INTEGER<OOPLValue>[$INTEGER, $INTEGER.Text]);
//lexical analysis
EQUALS : '=';
NEW_LINE_R : '\r' { $channel = HIDDEN; };
NEW_LINE_N : '\n' { $channel = HIDDEN; };
RULE_TYPE : ('Time'|'TIME'|'Lol'|'LOL'|'World'|'WORLD'|'Template'|'TEMPLATE');
DOUBLE_COLON : COLON COLON;
INTEGER : MINUS? DIGIT+;
REAL : INTEGER '.' INTEGER;
PARAMETER_NAME : ASTERISK? (LETTER|DIGIT|UNDERSCORE|FORWARDSLASH|DOUBLE_COLON|MINUS)+ ASTERISK?;
WS : ( ' '
| '\t'
| NEW_LINE_R
| NEW_LINE_N
) { $channel = HIDDEN; } ;
COMMENT : '#' ( options {greedy=false;} : . )* NEW_LINE_R? NEW_LINE_N;
STRING : '"'~('"')* '"';
fragment
MINUS : '-';
COMMA : ',';
COLON : ':';
fragment
DOT : '.';
ASTERISK : '*';
fragment
FORWARDSLASH : '/';
fragment
UNDERSCORE : '_';
fragment
DIGIT : '0'..'9';
fragment
LETTER : 'A'..'Z' | 'a'..'z';

I'd do the transformation solely in C# code after the parse.
In this case I'd even skip the intermediate AST form and transform the parse tree (provided by ANTLR4) directly into the target representation.
Some prefer ParseTreeListener/ParseTreeWalkers, which aid you in walking the parse tree. Check these out, if you want some pre-build code. Be sure to use the typed ParseTreeWalker, which should be named RulesParseTreeListener<>, inherit and adjust to your needs.
link: https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Parse+Tree+Listeners
I'd not recommend ParseTreeVisitors which are invoked during the parse (as opposed to after the parse). They are only suitable for simple operations or grammars that are not context free and require code during the parse. If the requirements evolve later on, you're way more flexible with custom processing or listeners/walkers.

Regex match 2 out of 4 groups

I want a single Regex expression to match 2 groups of lowercase, uppercase, numbers or special characters. Length needs to also be grater than 7.
I currently have this expression
^(?=.*[^a-zA-Z])(?=.*[a-z])(?=.*[A-Z]).{8,}$
It, however, forces the string to have lowercase and uppercase and digit or special character.
I currently have this implemented using 4 different regex expressions that I interrogate with some C# code.
I plan to reuse the same expression in JavaScript.
This is sample console app that shows the difference between 2 approaches.
class Program
{
private static readonly Regex[] Regexs = new[] {
new Regex("[a-z]", RegexOptions.Compiled), //Lowercase Letter
new Regex("[A-Z]", RegexOptions.Compiled), // Uppercase Letter
new Regex(#"\d", RegexOptions.Compiled), // Numeric
new Regex(#"[^a-zA-Z\d\s:]", RegexOptions.Compiled) // Non AlphaNumeric
};
static void Main(string[] args)
{
Regex expression = new Regex(#"^(?=.*[^a-zA-Z])(?=.*[a-z])(?=.*[A-Z]).{8,}$", RegexOptions.ECMAScript & RegexOptions.Compiled);
string[] testCases = new[] { "P#ssword", "Password", "P2ssword", "xpo123", "xpo123!", "xpo123!123##", "Myxpo123!123##", "Something_Really_Complex123!#43#2*333" };
Console.WriteLine("{0}\t{1}\t", "Single", "C# Hack");
Console.WriteLine("");
foreach (var testCase in testCases)
{
Console.WriteLine("{0}\t{2}\t : {1}", expression.IsMatch(testCase), testCase,
(testCase.Length >= 8 && Regexs.Count(x => x.IsMatch(testCase)) >= 2));
}
Console.ReadKey();
}
}
Result Proper Test String
------- ------- ------------
True True : P#ssword
False True : Password
True True : P2ssword
False False : xpo123
False False : xpo123!
False True : xpo123!123##
True True : Myxpo123!123##
True True : Something_Really_Complex123!#43#2*333

For javascript you can use this pattern that looks for boundaries between different character classes:
^(?=.*(?:.\b.|(?i)(?:[a-z]\d|\d[a-z])|[a-z][A-Z]|[A-Z][a-z]))[^:\s]{8,}$
if a boundary is found, you are sure to have two different classes.
pattern details:
\b # is a zero width assertion, it's a boundary between a member of
# the \w class and an other character that is not from this class.
.\b. # represents the two characters with the word boundary.
boundary between a letter and a number:
(?i) # make the subpattern case insensitive
(?:
[a-z]\d # a letter and a digit
| # OR
\d[a-z] # a digit and a letter
)
boundary between an uppercase and a lowercase letter:
[a-z][A-Z] | [A-Z][a-z]
since all alternations contains at least two characters from two different character classes, you are sure to obtain the result you hope.

You could use possessive quantifiers (emulated using atomic groups), something like this:
((?>[a-z]+)|(?>[A-Z]+)|(?>[^a-zA-Z]+)){2,}
Since using possessive matching will prevent backtracking, you won't run into the two groups being two consecutive groups of lowercase letters, for instance. So the full regex would be something like:
^(?=.*((?>[a-z]+)|(?>[A-Z]+)|(?>[^a-zA-Z]+)){2,}).{8,}$
Though, were it me, I'd cut the lookahead, just use the expression ((?>[a-z]+)|(?>[A-Z]+)|(?>[^a-zA-Z]+)){2,}, and check the length separately.

C# Regex.IsMatch returns true when it shouldn't?

I'm attempting to match a string that can contain any number of numeric characters or a decimal point using the following regex:
([0-9.])*
Here's some C# code to test the regex:
Regex regex = new Regex("([0-9.])*");
if (!regex.IsMatch("a"))
throw new Exception("No match.");
I expect the exception to be thrown here but it isn't - am I using the Regex incorrectly or is there an error in the pattern?
EDIT: I'd also like to match a blank string.

The * quantifier means "match 0 or more". In your case, "a" returns 0 matches, so the regex still succeeds. You probably wanted:
([0-9.]+)
The + quantifier means "match 1 or more, so it fails on non-numeric inputs and returns no matches. A quick spin the regex tester shows:
input result
----- ------
[empty] No matches
a No matches
. 1 match: "."
20.15 1 match: "20.15"
1 1 match: "1"
1.1.1 1 match: "1.1.1"
20. 1 match: "20."
Looks like we have some false positives, let's revise the regex as such:
^([0-9]+(?:\.[0-9]+)?)$
Now we get:
input result
----- ------
[empty] No matches
a No matches
. No matches
20.15 1 match: "20.15"
1 1 match: "1"
1.1.1 No matches: "1.1.1"
20. No matches
Coolness.

Regex.IsMatch("a", "([0-9.])*") // true
This is because the group can match ZERO or more times.
Regex.IsMatch("a", "([0-9.])+") // false

You should use + instead of *
Regex reg = new Regex("([0-9.])+");
This should work fine.
When you use * any string can match this pattern in your case.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How can I modify this regex to support multiple alternatives? - c#

Related

Regex query to validate a currency string C#?

ANTLR4 Nested modes?

c# migrating to ANTLR 4 from ANTLR 3 with AST

Regex match 2 out of 4 groups

C# Regex.IsMatch returns true when it shouldn't?

Categories

Resources