Algorithm for finding strings that a specific Regex will match - c#

Given a regex pattern, I'm trying to find a string that matches it. Similar to how Django reverses them, but in C#. Are there any pre-made C# libraries that do this?
Edit: Moving this project to Google code pretty soon.
Current Test Results
^abc$ > abc : pass
\Aa > a : pass
z\Z > z : pass
z\z > z : pass
z\z > z : pass
\G\(a\) > \(a\) : pass
ab\b > ab : pass
a\Bb > ab : pass
\a > : pass
[\b] > : pass
\t > \t : pass
\r > \r : pass
\v > ♂ : pass
\f > \f : pass
\n > \n : pass
\e > ← : pass
\141 > a : pass
\x61 > a : pass
\cC > ♥ : pass
\u0061 > a : pass
\\ > \\ : pass
[abc] > a : pass
[^abc] > î : pass
[a-z] > a : pass
. > p : pass
\w > W : pass
\W > ☻ : pass
\s > \n : pass
\S > b : pass
\d > 4 : pass
\D > G : pass
(a)\1 > aa : pass
(?<n>a)\k<n> > aa : pass
(?<n>a)\1 > aa : pass
(a)(?<n>b)\1\2 > abab : pass
(?<n>a)(b)\1\2 > abba : pass
(a(b))\1\2 > ababb : pass
(a(b)(c(d)))\1\2\3\4 > abcdabcdbcdd : pass
a\0 > a : pass
ab* > a : pass
ab+ > abbb : pass
ab? > a : pass
ab{2} > abb : pass
ab{2,} > abbbbbbbbb : pass
ab{2,3} > abb : pass
ab*? > abb : pass
ab+? > abbbbb : pass
ab?? > a : pass
ab{2}? > abb : pass
ab{2,}? > abbbbbbbbb : pass
ab{2,3}? > abbb : pass
/users(?:/(?<id>\d+))? > /users/77 : pass
Passed 52/52 tests.

see for example Using Regex to generate Strings rather than match them
also you can take a look at http://en.wikipedia.org/wiki/Deterministic_finite-state_machine especially at "Accept and Generate modes" section.
as others noted you will need to create a DFA from your regular expression and then generate your strings using this DFA.
to convert your regular expression to DFA, generate NFA first (see for example http://lambda.uta.edu/cse5317/spring01/notes/node9.html) and then convert NFA to DFA.
the easiest way i see is to use a parser generator program for that. i do not think django does this.
hope this helps.

"Are there any pre-made C# libraries that do this?"
NO
(I expect this will be accepted as the answer momentarily)

Related

How can I modify this regex to support multiple alternatives?

im a newbie at constructing regex.
I have this working regex:
^([a-zA-Z0-9\d]+-)*[a-zA-Z0-9\d]+$
Example:
-test : false
test- : false
te--st : false
test : true
test-test : true
te-st-t : true
I would like to add support for _ (underscores), so the above example replaced - to _ is the same result, but can only be one option only.
Example:
te-st_test : false
te_st_test : true
The solutions I tried:
^([a-zA-Z0-9\d]+(-|_))*[a-zA-Z0-9\d]+$
^(([a-zA-Z0-9\d]+-)|([a-zA-Z0-9\d]+_))*[a-zA-Z0-9\d]+$
Bad result:
te_st-test : true
I would like to have this result:
-test : false
test- : false
--test : false
__test : false
test-- : false
test__ : false
-_test : false
test-_ : false
test--test : false
test__test : false
test-_test : false
te-st_test : false
te-st : true
te_st : true
te_st_test : true
te-st-test : true
test : true
Thanks & have a nice day!
You may capture the first delimiter (if any) and then use a backreference to that value in the repeated non-capturing group:
^[a-zA-Z\d]+(?=([-_])?)(?:\1[a-zA-Z\d]+)*$
\A[a-zA-Z\d]+(?=([-_])?)(?:\1[a-zA-Z\d]+)*\z
See the regex demo.
Note: when validating strings, I'd rather use \A (start of string) and \z (the very end of string) anchors rather than ^/$.
Also, if you are worried about matching all Unicode digits (e.g. ३৫৬૦૧௮೪൮໘) with \d you need to pass RegexOptions.ECMAScript option when compiling the regex object, or replace \d with 0-9 inside the character class.
Details:
\A - start of string
[a-zA-Z\d]+ - one or more letters or digits
(?=([-_])?) - a positive lookahead that captures into Group 1 the next char that is an optional - or _
(?:\1[a-zA-Z\d]+)* - zero or more sequences of Group 1 value and one or more letters or digits
\z - the very end of string.
In C#, you can declare it as
var Pattern = new Regex(#"\A[a-zA-Z\d]+(?=([-_])?)(?:\1[a-zA-Z\d]+)*\z");
// Or,
var Pattern = new Regex(#"\A[a-zA-Z\d]+(?=([-_])?)(?:\1[a-zA-Z\d]+)*\z", RegexOptions.ECMAScript);

Directory.CreateDirectory fails with invalid character

I am facing issue that my path string passes check for Path.GetInvalidPathChars() but fails when trying to create directory.
static void Main(string[] args)
{
string str2 = #"C:\Temp\hjk&(*&ghj\config\";
foreach (var character in System.IO.Path.GetInvalidPathChars())
{
if (str2.IndexOf(character) > -1)
{
Console.WriteLine("String contains invalid path character '{0}'", character);
return;
}
}
Directory.CreateDirectory(str2); //<-- Throws exception saying Invalid character.
Console.WriteLine("Press any key..");
Console.ReadKey();
}
Any idea what could be the issue?
This is one of those times where slight issues in the wording of the documentation can make all the difference on how we look at or use the API. In our case, that part of the API doesn't do us much good.
You haven't completely read the documentation on Path.GetInvalidPathChars():
The array returned from this method is not guaranteed to contain the complete set of characters that are invalid in file and directory names. The full set of invalid characters can vary by file system. For example, on Windows-based desktop platforms, invalid path characters might include ASCII/Unicode characters 1 through 31, as well as quote ("), less than (<), greater than (>), pipe (|), backspace (\b), null (\0) and tab (\t).
And don't think that Path.GetInvalidFileNameChars() will do you any better immediately (we'll prove how this is the better choice below):
The array returned from this method is not guaranteed to contain the complete set of characters that are invalid in file and directory names. The full set of invalid characters can vary by file system. For example, on Windows-based desktop platforms, invalid path characters might include ASCII/Unicode characters 1 through 31, as well as quote ("), less than (<), greater than (>), pipe (|), backspace (\b), null (\0) and tab (\t).
In this situation, it's best to try { Directory.CreateDirectory(str2); } catch (ArgumentException e) { /* Most likely the path was invalid */ } instead of manually validating the path*. This will work independent of file-system.
When I tried to create your directory on my Windows system:
Now if we go through all the characters in that array:
foreach (char c in Path.GetInvalidPathChars())
{
Console.WriteLine($"0x{(int)c:X4} : {c}");
}
We get:
0x0022 : "
0x003C : <
0x003E : >
0x007C : |
0x0000 :
0x0001 :
0x0002 :
0x0003 :
0x0004 :
0x0005 :
0x0006 :
0x0007 :
0x0008 :
0x0009 :
0x000A :
0x000B :
0x000C :
0x000D :
0x000E :
0x000F :
0x0010 :
0x0011 :
0x0012 :
0x0013 :
0x0014 :
0x0015 :
0x0016 :
0x0017 :
0x0018 :
0x0019 :
0x001A :
0x001B :
0x001C :
0x001D :
0x001E :
0x001F :
As you can see, that list is incomplete.
However: if we do the same for GetInvalidFileNameChars()
foreach (char c in Path.GetInvalidFileNameChars())
{
Console.WriteLine($"0x{(int)c:X4} : {c}");
}
We end up with a different list, which includes all of the above, as well as:
0x003A : :
0x002A : *
0x003F : ?
0x005C : \
0x002F : /
Which is exactly what our error-message indicates. In this situation, you may decide you want to use that instead. Just remember our warning above, Microsoft makes no guarantees as to the accuracy of either of these methods.
Of course, this isn't perfect, because using Path.GetInvalidFileNameChars() on a path will throw a false invalidation (\ is invalid in a filename, but it's perfectly valid in a path!), so you'll need to correct for that. You can do so by ignoring (at the very least) the following characters:
0x003A : :
0x005C : \
You may also want to ignore the following character (as sometimes people use the web/*nix style paths):
0x002F : /
The last thing to do here is demonstrate a slightly easier way of writing this code. (I'm a regular on Code Review so it's second nature.)
We can do this whole thing in one expresion:
System.IO.Path.GetInvalidFileNameChars().Except(new char[] { '/', '\\', ':' }).Count(c => str2.Contains(c)) > 0
Example of usage:
var invalidPath = #"C:\Temp\hjk&(*&ghj\config\";
var validPath = #"C:\Temp\hjk&(&ghj\config\"; // No asterisk (*)
var invalidPathChars = System.IO.Path.GetInvalidFileNameChars().Except(new char[] { '/', '\\', ':' });
if (invalidPathChars.Count(c => invalidPath.Contains(c)) > 0)
{
Console.WriteLine("Invalid character found.");
}
else
{
Console.WriteLine("Free and clear.");
}
if (invalidPathChars.Count(c => validPath.Contains(c)) > 0)
{
Console.WriteLine("Invalid character found.");
}
else
{
Console.WriteLine("Free and clear.");
}
*: This is arguable, you may want to manually validate the path if you are certain your validation code will not invalidate valid paths. As MikeT said: "you should always try to validate before getting an exception". Your validation code should be equal or less restrictive than the next level of validation.
I faced the same problem as described above.
Due to the fact that each subdirectory name is virtually a file name, I have connected some solutions that I found here:
string retValue = string.Empty;
var dirParts = path.Split(Path.DirectorySeparatorChar, (char)StringSplitOptions.RemoveEmptyEntries);
for (int i = 0; i < dirParts.Length; i++)
{
if (i == 0 && Path.IsPathRooted(path))
{
retValue = string.Join("_", dirParts[0].Split(Path.GetInvalidPathChars()));
}
else
{
retValue = Path.Combine(retValue, string.Join("_", dirParts[i].Split(Path.GetInvalidFileNameChars())));
}
}
My solution returns "C:\Temp\hjk&(_&ghj\config" for the given path in the question.

strange string format percent c# output

does anybody have an idea why following code outputs 1000% for RebatePercent=10 :
return RebatePercent > 0 ? $"{RebatePercent.ToString("0%")}" : "-";
I didn't find anything to output 10%
thx
you can use as:
RebatePercent > 0 ? String.Format("{0}%", RebatePercent) : "-";
and in C#6:
RebatePercent > 0 ? $"{RebatePercent}%" : "-";
If you want to keep the use of string interpolation then you can just:
return RebatePercent > 0 ? $"{RebatePercent.ToString()}%" : "-";

Can I use Antlr with damaged/incomplete input and if so - how?

Can rules/parser/lexer be set up so as to accept input that conforms to the expected structure, but the static (predefined) tokens are not written in full?
Example:
I have an ANTLR4 grammar (C# target) that I use to parse some input and use it to run specific methods of my application.
(made-up):
grammar:
setWage
: SETWAGE userId=STRING value=NUMBER
;
SETWAGE
: 'setWage'
;
input:
setWage john.doe 2000
A listener that walks the parse tree in method for setWage rule (after getting text from labeled tokens) would call for example:
SalaryManager.SetWage(User.GetById("john.doe"), 2000);
My question: can Antlr (or the grammar) be set up so as to allow for example for such input:
setW john.doe 2000
assuming that there are no rules for e.g. "setWater" or "setWindow", or assuming that there are and I'm fine with Antlr choosing one of those by itself (albeit, consistently the same one).
Please note that this question is mostly academical and I'm not looking for a better way to achieve that input->action linking.
You probably know this already, but you can elaborate the set of possible input matches
SETWAGE : 'setW' | 'setWa' | 'setWag' | 'setWage' ;
or
SETWAGE : 'set' ('W' ('a' ('g' ('e')? )? )? ) ;
Not sure if the latter satisfies your requirement that "the static (predefined) tokens are not written in full".
Hard-coding the "synonyms" could be tedious, but how many do you need?
Here's an example I wrote to validate the approach. (Java target, but that shouldn't matter)
actions.g4
grammar actions ;
actions : action+;
action : setWage | deductSum ;
setWage : SETWAGEOP userId=SYMBOL value=NUMBER ;
deductSum : DEDUCTSUMOP userId=SYMBOL value=NUMBER ;
//SETWAGEOP : 'setW' | 'setWa' | 'setWag' | 'setWage' ;
SETWAGEOP : 'set' ('W' ('a' ('g' ('e')? )? )? ) ;
DEDUCTSUMOP : 'deduct' ('S' ('u' ('m')? )? ) ;
WS : [ \t\n\r]+ -> channel(HIDDEN) ;
SYMBOL : [a-zA-Z][a-zA-Z0-9\.]* ;
NUMBER : [0-9]+ ;
testinput
setW john.doe 2000
deductS john.doe 50
setWag joe.doe.III 2002
deductSu joe.doe 40
setWage jane.doe 2004
deductSum john.doe 50
Transcript:
$ antlr4 actions.g4 ; javac actions*.java ; grun actions actions -tree < testinput
(actions (action (setWage setW john.doe 2000)) (action (deductSum deductS john.doe 50)) (action (setWage setWag joe.doe.III 2002)) (action (deductSum deductSu joe.doe 40)) (action (setWage setWage jane.doe 2004)) (action (deductSum deductSum john.doe 50)))

ANTLR4 Mismatch in simple grammar

I am using the latest version of Antlr (4.3) to parse this simple source file. I'm using the Visual Studio add-in, but that should not have anything to do with my problem.
Source file:
OBJECT Codeunit 80 Sales-Post
{
OBJECT-PROPERTIES
{
Date=11/12/10;
Time=12:00:00;
Version List=NAVW16.00.10,NAVBE6.00.01;
}
}
This should be pretty straightforward to parse, but I keep getting 2 errors while parsing:
line 1:16 mismatched input '80' expecting DOCUMENT_ID
line 5:4 mismatched input 'Date' expecting {DOCUMENT_PROPERTY_ID, '}'}
Complete grammar:
grammar Cal;
/*
* Parser Rules
*/
document
: document_header OPEN_BRACE document_content CLOSE_BRACE
;
document_header
: OBJECT_DEFINITION DOCUMENT_TYPE DOCUMENT_ID DOCUMENT_NAME
;
document_content
: document_properties
;
document_properties
: OBJECT_PROPERTIES OPEN_BRACE document_property* CLOSE_BRACE
;
document_property
: DOCUMENT_PROPERTY_ID EQ DOCUMENT_PROPERTY_VALUE LINE_TERM
;
/*
* Lexer Rules
*/
OBJECT_PROPERTIES
: 'OBJECT-PROPERTIES'
;
OBJECT_DEFINITION
: 'OBJECT'
;
DOCUMENT_TYPE
: 'Codeunit'
| 'Table'
;
DOCUMENT_PROPERTY_VALUE
: ([0-9a-zA-Z]|'_'|'-'|'.'|'/'|','|':')+
;
DOCUMENT_PROPERTY_ID
: 'Date'
| 'Time'
| 'Version List'
;
DOCUMENT_ID
: [0-9]+
;
DOCUMENT_NAME
: ID
;
OPEN_BRACE
: '{'
;
CLOSE_BRACE
: '}'
;
LINE_TERM
: ';'
;
EQ
: '='
;
ID
: ([a-zA-Z]|'_'|'-')+
;
INT
: [0-9]+
;
WS
: [ \t]+ -> channel(HIDDEN)
;
NEWLINE
:'\r'? '\n' -> channel(HIDDEN)
;
This is the output of the token stream (tokens are surrounded with '<>':
<OBJECT> < > <Codeunit> < > <80> < > <Sales-Post> <
> <{> <
> < > <OBJECT-PROPERTIES> <
> < > <{> <
> < > <Date> <=> <11/12/10> <;> <
> < > <Time> <=> <12:00:00> <;> <
> < > <Version List> <=> <NAVW16.00.10,NAVBE6.00.01> <;> <
> < > <}> <
> <}> <<EOF>
In ANTLR, when two lexer rules can match the same token (of the same length), the rule that appears first wins.
80 can be matched by DOCUMENT_ID, but also by DOCUMENT_PROPERTY_VALUE and INT, so just reorder these rules here.
You have the same problem here with DOCUMENT_PROPERTY_ID which is below DOCUMENT_PROPERTY_VALUE (both can match Date).
I suggest you put DOCUMENT_PROPERTY_VALUE just above WS: most specific rules (ie keywords) go first, and broader rules last.
You also have to get rid of DOCUMENT_ID or INT, as they have the same definition. One of them will never match. You don't seem to use INT in the parser, so just remove the rule.

Categories

Resources