ANTLR4 grammar integration complexities for selection with removals - c#

I’m attempting to create a grammar for a lighting control system and I make good progress when testing with the tree gui tool but it all seems to fall apart when I attempt to implement it into my app.
The basic structure of the language is [Source] [Mask] [Command] [Destination]. Mask is optional so a super simple sample input might look like this : Fixture 1 # 50 which bypasses Mask. Fixture 1 is the source, # is the command and 50 is the destination which in this case is an intensity value.
I’ve no issues with this type of input but things get complicated as I try and build out more complex source selection. Let’s say I want to select a range of fixtures and remove a few from the selection and then add more fixtures after.
Fixture 1 Thru 50 – 25 – 30 – 35 + 40 > 45 # 50
This is a very common syntax on existing control systems but I’m stumped at how to design the grammar for this in a way that makes integration into my app not too painful.
The user could just as easily type the following:
1 Thru 50 – 25 – 30 – 35 + 40 > 45 # 50
Because sourceType (fixture) is not provided, its inferred.
To try and deal with the above situations, I've written the following:
grammar LiteMic;
/*
* Parser Rules
*/
start : expression;
expression : source command destination
| source mask command destination
| command destination
| source command;
destination : sourceType number
| sourceType number sourceType number
| number;
command : COMMAND;
mask : SOURCETYPE;
operator : ADD #Add
| SUB #Subtract
;
plus : ADD;
minus : SUB;
source : singleSource (plus source)*
| rangeSource (plus source)*
;
singleSource : sourceType number #SourceWithType
| number #InferedSource
;
rangeSource : sourceRange (removeSource)*
;
sourceRange : singleSource '>' singleSource;
removeSource : '-' source;
sourceType : SOURCETYPE;
number : NUMBER;
compileUnit
: EOF
;
/*
* Lexer Rules
*/
SOURCETYPE : 'Cue'
| 'Playback'
| 'List'
| 'Intensity'
| 'Position'
| 'Colour'
| 'Beam'
| 'Effect'
| 'Group'
| 'Fixture'
;
COMMAND : '#'
| 'Record'
| 'Update'
| 'Copy'
| 'Move'
| 'Delete'
| 'Highlight'
| 'Full'
;
ADD : '+' ;
SUB : '-' ;
THRU : '>' ;
/* A number: can be an integer value, or a decimal value */
NUMBER : [0-9]+ ;
/* We're going to ignore all white space characters */
WS : [ \t\r\n]+ -> skip
;
Running the command against grun gui produces the following:
I've had some measure of success being able to override the Listener for AddRangeSource as I can loop through and add the correct types but it all falls apart when I try and remove a range.
1 > 50 - 30 > 35 # 50
This produces a problem as the removal of a range matches to the 'addRangeSource'.
I'm pretty sure I'm missing something obvious and I've been working my way through the book I bought on Amazon but it's still not cleared up in my head how to archieve what I'm after and I've been looking at this for a week.
For good measure, below is a tree for a more advanced query that seems ok apart from the selection.
Does anyone have any pointers / suggestions on where I'm going wrong?
Cheers,
Mike

You can solve the problem by reorganizing the grammar a little:
Merge rangeSource with sourceRange:
rangeSource : singleSource '>' singleSource;
Note: This rule also matches input like Beam 1 > Group 16, which might be unintended, in that case you could use this:
rangeSource : sourceType? number '>' number;
Rename source to sourceList (and don't forget to change it in the expression rule):
expression : sourceList command destination
| sourceList mask command destination
| command destination
| sourceList command;
Add a source rule that matches either singleSource or rangeSource:
source : singleSource | rangeSource;
Put + and - at the same level (as addSource and removeSource):
addSource : plus source;
removeSource : minus source;
Change sourceList to accept a list of addSource/removeSource:
sourceList : source (addSource|removeSource)*;
I tried this and it doesn't have any problems with parsing even the more advanced query.

Related

Split long string for each colon ":" and get index of the line by position

im struggling with the understanding of using Split method to receive my desired texts
im receiving long registration string from user and im trying to split it by colon : and for each colon found i want to get all the text until /n in the line
The string i'm receiving from the user is formatted like this example:
"Username: Jony \n
Fname: Dep\n
Address: Los Angeles\n
Age: 28\n
Date: 11/01:2001\n"
Thats my approche until now didnt figurate out how it works and didnt found question similler like my question
str = the long string
List<string> names = str.ToString().Split(':').ToList<string>();
names.Reverse();
var result = names[0].ToString();
var result1 = names[1].ToString();
Console.WriteLine(result.Remove('\n').Replace(" ",string.Empty));
Console.WriteLine(result1.Remove('\n').Replace(" ",string.Empty));
Benchmarks
----------------------------------------------------------------------------
Mode : Release (64Bit)
Test Framework : .NET Framework 4.7.1 (CLR 4.0.30319.42000)
----------------------------------------------------------------------------
Operating System : Microsoft Windows 10 Pro
Version : 10.0.17134
----------------------------------------------------------------------------
CPU Name : Intel(R) Core(TM) i7-3770K CPU # 3.50GHz
Description : Intel64 Family 6 Model 58 Stepping 9
Cores (Threads) : 4 (8) : Architecture : x64
Clock Speed : 3901 MHz : Bus Speed : 100 MHz
L2Cache : 1 MB : L3Cache : 8 MB
----------------------------------------------------------------------------
Results
--- Random characters -------------------------------------------------
| Value | Average | Fastest | Cycles | Garbage | Test | Gain |
--- Scale 1 -------------------------------------------- Time 1.152 ---
| split | 4.975 µs | 4.091 µs | 20.486 K | 0.000 B | N/A | 71.62 % |
| regex | 17.530 µs | 14.029 µs | 65.707 K | 0.000 B | N/A | 0.00 % |
-----------------------------------------------------------------------
Original Answer
You could use regex , or you could simply use Split
var input = "Username: Jony\n Fname: Dep\nAddress: Los Angeles\nAge: 28\nDate: 11/01:2001\n";
var results = input.Split(new []{'\n'}, StringSplitOptions.RemoveEmptyEntries)
.Select(x => x.Split(':')[1].Trim());
foreach (var result in results)
Console.WriteLine(result);
Full Demo Here
Output
Jony
Dep
Los Angeles
28
11/01
Note : This has no error checking, so if your string doesn't contain a Colon, it will break
Additional Resources
String.Split Method
Returns a string array that contains the substrings in this instance
that are delimited by elements of a specified string or Unicode
character arr
StringSplitOptions Enum
Specifies whether applicable Split method overloads include or omit
empty substrings from the return value
String.Trim Method
Returns a new string in which all leading and trailing occurrences of
a set of specified characters from the current String object are
removed.
Enumerable.Select Method
Projects each element of a sequence into a new form.
You can use a regex to find the matches after colon and up to the Newline character:
(?<=:)\s*[^\n]*
The regex uses a look back, ensuring there's a colon in front of the string, then it matches everything not being Newline = rest of line.
Use it like this:
string searchText = "Username: Jony\n
Fname: Dep\n
Address: Los Angeles\n
Age: 28\n
Date: 11/01:2001\n";
Regex myRegex = new Regex("(?<=:)\s*[^\n]*");
foreach (Match match in myRegex.Matches(searchText))
{
DoSomething(match.Value);
}

parse C# preprocessors in antlr4

I'm trying to parse C# preprocessors using ANTLR4 instead of ignoring them. I'm using the grammar mentioned here: https://github.com/antlr/grammars-v4/tree/master/csharp
This is my addition (now i'm focusing only on pp_conditional):
pp_directive
: Pp_declaration
| pp_conditional
| Pp_line
| Pp_diagnostic
| Pp_region
| Pp_pragma
;
pp_conditional
: pp_if_section (pp_elif_section | pp_else_section | pp_conditional)* pp_endif
;
pp_if_section:
SHARP 'if' conditional_or_expression statement_list
;
pp_elif_section:
SHARP 'elif' conditional_or_expression statement_list
;
pp_else_section:
SHARP 'else' (statement_list | pp_if_section)
;
pp_endif:
SHARP 'endif'
;
I added its entry here:
block
: OPEN_BRACE statement_list? CLOSE_BRACE
| pp_directive
;
i'm getting that error:
line 19:0 mismatched input '#if TEST\n' expecting '}'
when i use the following test case:
if (!IsPostBack){
#if TEST
ltrBuild.Text = "**TEST**";
#else
ltrBuild.Text = "**LIVE**";
#endif
}
The problem is that a block is composed of either '{' statement_list? '}' or a pp_directive. In this specific case, it chooses the first, because the first token it sees is a { (after the if condition). Now, it is expecting to maybe see a statement_list? and then a }, but what it find is #if TEST, a pp_directive.
What do we have to do? Make your pp_directive a statement. Since we know statement_list: statement+;, we search for statement and add pp_directive to it:
statement
: labeled_statement
| declaration_statement
| embedded_statement
| pp_directive
;
And it should be working fine. However, we must also see if your block: ... | pp_directive should be removed or not, and it should be. I'll let it for you to find out why, but here's a test case that's ambiguous:
if (!IsPostBack)
#pragma X
else {
}

Can I use Antlr with damaged/incomplete input and if so - how?

Can rules/parser/lexer be set up so as to accept input that conforms to the expected structure, but the static (predefined) tokens are not written in full?
Example:
I have an ANTLR4 grammar (C# target) that I use to parse some input and use it to run specific methods of my application.
(made-up):
grammar:
setWage
: SETWAGE userId=STRING value=NUMBER
;
SETWAGE
: 'setWage'
;
input:
setWage john.doe 2000
A listener that walks the parse tree in method for setWage rule (after getting text from labeled tokens) would call for example:
SalaryManager.SetWage(User.GetById("john.doe"), 2000);
My question: can Antlr (or the grammar) be set up so as to allow for example for such input:
setW john.doe 2000
assuming that there are no rules for e.g. "setWater" or "setWindow", or assuming that there are and I'm fine with Antlr choosing one of those by itself (albeit, consistently the same one).
Please note that this question is mostly academical and I'm not looking for a better way to achieve that input->action linking.
You probably know this already, but you can elaborate the set of possible input matches
SETWAGE : 'setW' | 'setWa' | 'setWag' | 'setWage' ;
or
SETWAGE : 'set' ('W' ('a' ('g' ('e')? )? )? ) ;
Not sure if the latter satisfies your requirement that "the static (predefined) tokens are not written in full".
Hard-coding the "synonyms" could be tedious, but how many do you need?
Here's an example I wrote to validate the approach. (Java target, but that shouldn't matter)
actions.g4
grammar actions ;
actions : action+;
action : setWage | deductSum ;
setWage : SETWAGEOP userId=SYMBOL value=NUMBER ;
deductSum : DEDUCTSUMOP userId=SYMBOL value=NUMBER ;
//SETWAGEOP : 'setW' | 'setWa' | 'setWag' | 'setWage' ;
SETWAGEOP : 'set' ('W' ('a' ('g' ('e')? )? )? ) ;
DEDUCTSUMOP : 'deduct' ('S' ('u' ('m')? )? ) ;
WS : [ \t\n\r]+ -> channel(HIDDEN) ;
SYMBOL : [a-zA-Z][a-zA-Z0-9\.]* ;
NUMBER : [0-9]+ ;
testinput
setW john.doe 2000
deductS john.doe 50
setWag joe.doe.III 2002
deductSu joe.doe 40
setWage jane.doe 2004
deductSum john.doe 50
Transcript:
$ antlr4 actions.g4 ; javac actions*.java ; grun actions actions -tree < testinput
(actions (action (setWage setW john.doe 2000)) (action (deductSum deductS john.doe 50)) (action (setWage setWag joe.doe.III 2002)) (action (deductSum deductSu joe.doe 40)) (action (setWage setWage jane.doe 2004)) (action (deductSum deductSum john.doe 50)))

Why I get OutOfMemoryException when generating parser tree with ANTLR?

I built a "simples" grammar to interprete a file that looks like a json (or xml). But, when I try to parse the file and navigate on the tree I get a System.OutOfMemoryException.
The input file have just 108MB but contains almost 5 millions lines.
Here is a sample of the file:
(
:field ("ObjectName"
:field (
:field ("{6BF621F9-A0E2-49BB-A86B-3DE4750954F4}")
:field (Value)
:field (Value)
:field (
:Time ("Sun Jan 26 10:08:33 2014")
:last_modified_utc (1390730913)
:By ("Some text")
:From (localhost)
)
:field ("text/text")
:field (false)
:field (false)
)
:field ()
:field ()
:field ()
:field (0)
:field (true)
:field (true)
)
.
.
.
.
.
)
Following the grammar:
grammar Objects;
/*
* Parser Rules
*/
compileUnit
: obj
;
obj
: OPEN ID? (field)* CLOSE
;
field
: ':'(ID)? obj
;
/*
* Lexer Rules
*/
OPEN
: '('
;
CLOSE
: ')'
;
ID
: (ALPHA | ALPHA_IN_STRING)
;
fragment
INT_ID
: ('0'..'9')
;
fragment
ALPHA_EACH
: 'A'..'Z' | 'a'..'z' | '_' | INT_ID | '-' | '.' | '#'
;
fragment
ALPHA
: (ALPHA_EACH)+
;
fragment
ALPHA_IN_STRING
: ('"' ( ~[\r\n] )+ '"')
;
WS
// : ' ' -> channel(HIDDEN)
: [ \t\r\n]+ -> skip // skip spaces, tabs, newlines
;
And the parser:
var input = new Antlr4.Runtime.AntlrInputStream(text);
var lexer = new ObjectsLexer(input);
var tokens = new Antlr4.Runtime.CommonTokenStream(lexer);
var parser = new ObjectsParser(tokens);
// Context for the compileUnit rule
// ERROR: Here I got the error. When start the to build the tree for compileUnit rule
var ctx = parser.compileUnit();
// The following line is not executed
new ObjectsVisitor().Visit(ctx);
On the error line, I realise that the memory growth exponentialy.
If the input is UTF-8 encoded and uses primarily ASCII characters, the conversion to UTF-16 will require approximately 216MB.
Each token uses at least 48 bytes of memory.
Each token which appears in the parse tree uses at least 20 bytes of memory (in addition to the 44).
Each rule node in the parse tree uses at least 36 bytes of memory. If the rule has any children, the minimum is 68 bytes.
The numbers above do not include any locals, arguments, labels, or return values, all of which are stored in the tree if you use them.
Assuming 4 characters per token, half the tokens in the parse tree, and an average of 3 tokens per parse tree node (completely arbitrary values here), you get:
Input: 216MB
~28 million tokens: ~1281MB
~14 million terminal nodes in the parse tree: ~267MB
~4.7 million parse tree nodes: ~308MB
This is over 2GB memory, and doesn't count any of the overhead associated with the runtime or the dynamic DFA cache constructed internally by ANTLR. You will clearly need to either run your application as a 64-bit process or reduce the size of your inputs.

ANTLR3 common values in 2 different domain values

I need to define a language-parser for the following search criteria:
CRITERIA_1=<values-set-#1> AND/OR CRITERIA_2=<values-set-#2>;
Where <values-set-#1> can have values from 1-50 and <values-set-#2> can be from the following set (5, A, B, C) - case is not important here.
I have decided to use ANTLR3 (v3.4) with output in C# (CSharp3) and it used to work pretty smooth until now. The problem is that it fails to parse the string when I provide values from both data-sets (I.e. in this case '5'). For example, if I provide the following string
CRITERIA_1=5;
It returns the following error where the value node was supposed to be:
<unexpected: [#1,11:11='5',<27>,1:11], resync=5>
The grammar definition file is the following:
grammar ZeGrammar;
options {
language=CSharp3;
TokenLabelType=CommonToken;
output=AST;
ASTLabelType=CommonTree;
k=3;
}
tokens
{
ROOT;
CRITERIA_1;
CRITERIA_2;
OR = 'OR';
AND = 'AND';
EOF = ';';
LPAREN = '(';
RPAREN = ')';
}
public
start
: expr EOF -> ^(ROOT expr)
;
expr
: subexpr ((AND|OR)^ subexpr)*
;
subexpr
: grouppedsubexpr
| 'CRITERIA_1=' rangeval1_expr -> ^(CRITERIA_1 rangeval1_expr)
| 'CRITERIA_2=' rangeval2_expr -> ^(CRITERIA_2 rangeval2_expr)
;
grouppedsubexpr
: LPAREN! expr RPAREN!
;
rangeval1_expr
: rangeval1_subexpr
| RANGE1_VALUES
;
rangeval1_subexpr
: LPAREN! rangeval1_expr (OR^ rangeval1_expr)* RPAREN!
;
RANGE1_VALUES
: (('0'..'4')? ('0'..'9') | '5''0')
;
rangeval2_expr
: rangeval2_subexpr
| RANGE2_VALUES
;
rangeval2_subexpr
: LPAREN! rangeval2_expr (OR^ rangeval2_expr)* RPAREN!
;
RANGE2_VALUES
: '5' | ('a'|'A') | ('b'|'B') | ('c'|'C')
;
And if I remove the value '5' from RANGE2_VALUES it works fine. Can anyone hint me on what I am doing wrong?
You must realize that the lexer does not produce tokens based on what the parser tries to match. So, in your case, the input "5" will always be tokenized as a RANGE1_VALUES and never as a RANGE2_VALUES because both RANGE1_VALUES and RANGE2_VALUES can match this input but RANGE1_VALUES comes first (so RANGE1_VALUES takes precedence over RANGE2_VALUES).
A possible fix would be to remove both RANGE1_VALUES and RANGE2_VALUES rules and replace them with the following lexer rules:
D0_4
: '0'..'4'
;
D5
: '5'
;
D6_50
: '6'..'9' // 6-9
| '1'..'4' '0'..'9' // 10-49
| '50' // 50
;
A_B_C
: ('a'|'A')
| ('b'|'B')
| ('c'|'C')
;
and the introduce these new parser rules:
range1_values
: D0_4
| D5
| D6_50
;
range2_values
: A_B_C
| D5
;
and change all RANGE1_VALUES and RANGE2_VALUES calls in your parser rules with range1_values and range2_values respectively.
EDIT
Instead of trying to solve this at the lexer-level, you might simply match any integer value and check inside the parser rule if the value is the correct one (or correct range) using a semantic predicate:
range1_values
: INT {Integer.valueOf($INT.text) <= 50}?
;
range2_values
: A_B_C
| INT {Integer.valueOf($INT.text) == 5}?
;
INT
: '0'..'9'+
;
A_B_C
: 'a'..'c'
| 'A'..'C'
;

Categories

Resources