ANTLR4 Nested modes? - c#

I'm attempting to parse the following string:
<<! variable, my_variable, A description of my variable !>>
From the reading I've been doing here, I believe I need to use modes to distinguish between the lexers for the literal string 'variable', the variable name (my_variable), and the variable description.
The problem I'm having is that I'm not sure how to structure this. Is it possible to nest modes? Is there a better/smarter way to organize my lexer rules?
lexer grammar VariableLexer;
variableMarkdown : DELIMITER_OPEN SPACE VARIABLE COMMA SPACE variable_name COMMA SPACE description SPACE DELIMITER_CLOSE;
description : WORDS ;
variable_name : ID ;
DELIMITER_OPEN : '<<!' ;
DELIMITER_CLOSE : '!>>';
COMMA : ',' ;
SPACE : ' ' ;
VARIABLE : 'variable' -> pushMode(VariableName);
mode VariableName;
ID : LOWERCASE ( LOWERCASE | NUMBER | UNDERSCORE )* -> pushMode(VariableDescription) ;
mode VariableDescription;
WORDS : ( UPPERCASE | LOWERCASE | NUMBER | SPACE )+ -> popMode;
fragment LOWERCASE : 'a'..'z' ;
fragment UPPERCASE : 'A'..'Z' ;
fragment UNDERSCORE : '_' ;
fragment NUMBER : '0'..'9' ;

First - you can't have parser rules in lexer grammar - parser rules start with small letter, lexer ones with capital.
I'd do it like so (may not be correct syntax but you'll get the idea):
//default mode is implicitly defined by (or in) ANTLR4
VARIABLE : 'variable' (' ')* ',' -> mode(mode_VariableName);
...
mode mode_VariableName;
//define token with anything ending with comma, many ways to do this...
fragment varNameFrag: [a-zA-Z_0-9];
VARIABLE_NAME: varNameFrag varNameFrag* (' ')* ',' -> mode(mode_varDesc);
mode mode_varDesc;
//similar again for variable description
VAR_DESC: //I'll write just a comment here but should more or less match anything except
END_VAR: '!>>' -> mode(DEFAULT_MODE)
Basically in this way you are jumping to modes you need instead of pushing and popping.

Related

ANTLR (v4) Arbitrary Rule Ordering

I'm parsing a JSON-like structure that looks something like this:
items {
item {
name : 'Name',
value : 'abc',
type : String
}
}
The parser rule for item might look like this:
item
: name ',' value ',' type // I want these to be able to be in any order
;
name
: NAME ':' Str
;
value
: VALUE ':' atom
;
type
: TYPE ':' data_type
;
How would I write the item rule such that the order of the key : value pairs is unimportant, and the parser would only check for the presence of the rule? That is, value could come before name, etc.
EDIT
I should clarify that I know I could do this:
item
: item_item*
;
item_item
: name
| value
| type
;
But the problem with that is that the item rule needs to limit each rule to only one instance. Using this technique, I could end up with any number of name rules, for example.
The naive brute force approach would be to solve the problem in syntactic analysis (avoid this):
item
: name ',' value ',' type
| name ',' type ',' value
| type ',' name ',' value
| ...
;
This leeds to a large parser spec and unmaintainable visitor/listener code.
Better:
Use a simple parser rule and a semantic predicate to validate:
item
: i+=item_item ',' i+=item_item ',' i+=item_item {containsNVT($i)}?
;
item_item
: name
| value
| type
;
You can place the code to validate that all three items are specified containsNVT($i) inline or in a parser super class.

c# migrating to ANTLR 4 from ANTLR 3 with AST

I have inherited some c# code based on ANTLR 3.
We have some grammar files that uses the AST (abstract syntax tree) option and we use those grammar to parse text files with a very odd "language" to objects. we are using the AST as intermediate objects and than convert them to the real objects that we need (with some more processing).
I have no knowledge in ANTLR but currently we have a bottleneck in the application performance from ANTLR processing of the files.
Since we are using ANTLR 3 we thought that we might get a performance boost if we migrate to ANTLR (and also get the latest and greatest version of ANTLR which is always a good practice).
I have read that AST no longer exist in ANTLR 4, what is the best (and simplest) way to replace it and what will it mean to my current code.
What is the best approach to upgrade ? and will it really give us a performance boost.
An example of one of the grammar file ( there are 6 and this is the simplest one):
grammar Rules;
options
{
language=CSharp2;
output=AST;
ASTLabelType=CommonTree;
superClass = OOPLParserBase;
}
tokens
{
OOPL_MODEL;
}
#lexer::namespace { TestParser.Common.RulesParser }
#parser::namespace { TestParser.Common.RulesParser }
#header
{
using System.Collections.Generic;
using TestParser.OOPLModel;
}
#members
{
public RulesParser() : base(null)
{
}
protected override CommonTree GetAst()
{
return root().Tree as CommonTree;
}
protected override Lexer GetLexer()
{
return new RulesLexer();
}
}
//semantic analysis
root : header (rule_line COMMENT?)+ -> ^(header rule_line+);
header : header_comment+ -> ^(OOPL_MODEL<OOPLModel>[new CommonToken(OOPL_MODEL), "1.0"] header_comment+);
header_comment : COMMENT -> ^(COMMENT<OOPLComment>[$COMMENT, $COMMENT.Text]);
rule_line : parameter RULE_TYPE COMMA PARAMETER_NAME COLON condition -> ^(RULE_TYPE<OOPLBlock>[$RULE_TYPE, $RULE_TYPE.Text] parameter PARAMETER_NAME<OOPLValue>[$PARAMETER_NAME, $PARAMETER_NAME.Text] condition);
parameter : PARAMETER_NAME EQUALS (integer_value = INTEGER | real_value = REAL |string_value = STRING) COMMA -> ^(PARAMETER_NAME<OOPLKeyedValue>[$PARAMETER_NAME, $PARAMETER_NAME.Text, SingleWhereNotNull<IToken>($integer_value, $string_value, $real_value).Text]);
condition : condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value;
condition_value : (asterisk| parameter_name | positive_integer);
asterisk : ASTERISK -> ^(ASTERISK<OOPLValue>[$ASTERISK, $ASTERISK.Text]);
parameter_name : PARAMETER_NAME -> ^(PARAMETER_NAME<OOPLValue>[$PARAMETER_NAME, $PARAMETER_NAME.Text]);
positive_integer : INTEGER -> ^(INTEGER<OOPLValue>[$INTEGER, $INTEGER.Text]);
//lexical analysis
EQUALS : '=';
NEW_LINE_R : '\r' { $channel = HIDDEN; };
NEW_LINE_N : '\n' { $channel = HIDDEN; };
RULE_TYPE : ('Time'|'TIME'|'Lol'|'LOL'|'World'|'WORLD'|'Template'|'TEMPLATE');
DOUBLE_COLON : COLON COLON;
INTEGER : MINUS? DIGIT+;
REAL : INTEGER '.' INTEGER;
PARAMETER_NAME : ASTERISK? (LETTER|DIGIT|UNDERSCORE|FORWARDSLASH|DOUBLE_COLON|MINUS)+ ASTERISK?;
WS : ( ' '
| '\t'
| NEW_LINE_R
| NEW_LINE_N
) { $channel = HIDDEN; } ;
COMMENT : '#' ( options {greedy=false;} : . )* NEW_LINE_R? NEW_LINE_N;
STRING : '"'~('"')* '"';
fragment
MINUS : '-';
COMMA : ',';
COLON : ':';
fragment
DOT : '.';
ASTERISK : '*';
fragment
FORWARDSLASH : '/';
fragment
UNDERSCORE : '_';
fragment
DIGIT : '0'..'9';
fragment
LETTER : 'A'..'Z' | 'a'..'z';
I'd do the transformation solely in C# code after the parse.
In this case I'd even skip the intermediate AST form and transform the parse tree (provided by ANTLR4) directly into the target representation.
Some prefer ParseTreeListener/ParseTreeWalkers, which aid you in walking the parse tree. Check these out, if you want some pre-build code. Be sure to use the typed ParseTreeWalker, which should be named RulesParseTreeListener<>, inherit and adjust to your needs.
link: https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Parse+Tree+Listeners
I'd not recommend ParseTreeVisitors which are invoked during the parse (as opposed to after the parse). They are only suitable for simple operations or grammars that are not context free and require code during the parse. If the requirements evolve later on, you're way more flexible with custom processing or listeners/walkers.

Using ANTLR Parser and Lexer Separatly

I used ANTLR version 4 for creating compiler.First Phase was the Lexer part. I created "CompilerLexer.g4" file and putted lexer rules in it.It works fine.
CompilerLexer.g4:
lexer grammar CompilerLexer;
INT : 'int' ; //1
FLOAT : 'float' ; //2
BEGIN : 'begin' ; //3
END : 'end' ; //4
To : 'to' ; //5
NEXT : 'next' ; //6
REAL : 'real' ; //7
BOOLEAN : 'bool' ; //8
.
.
.
NOTEQUAL : '!=' ; //46
AND : '&&' ; //47
OR : '||' ; //48
POW : '^' ; //49
ID : [a-zA-Z]+ ; //50
WS
: ' ' -> channel(HIDDEN) //50
;
Now it is time for phase 2 which is the parser.I created "CompilerParser.g4" file and putted grammars in it but have dozens warning and errors.
CompilerParser.g4:
parser grammar CompilerParser;
options { tokenVocab = CompilerLexer; }
STATEMENT : EXPRESSION SEMIC
| IFSTMT
| WHILESTMT
| FORSTMT
| READSTMT SEMIC
| WRITESTMT SEMIC
| VARDEF SEMIC
| BLOCK
;
BLOCK : BEGIN STATEMENTS END
;
STATEMENTS : STATEMENT STATEMENTS*
;
EXPRESSION : ID ASSIGN EXPRESSION
| BOOLEXP
;
RELEXP : MODEXP (GT | LT | EQUAL | NOTEQUAL | LE | GE | AND | OR) RELEXP
| MODEXP
;
.
.
.
VARDEF : (ID COMA)* ID COLON VARTYPE
;
VARTYPE : INT
| FLOAT
| CHAR
| STRING
;
compileUnit
: EOF
;
Warning and errors:
implicit definition of token 'BLOCK' in parser
implicit definition of token 'BOOLEXP' in parser
implicit definition of token 'EXP' in parser
implicit definition of token 'EXPLIST' in parser
lexer rule 'BLOCK' not allowed in parser
lexer rule 'EXP' not allowed in parser
lexer rule 'EXPLIST' not allowed in parser
lexer rule 'EXPRESSION' not allowed in parser
Have dozens of these warning and errors. What is the cause?
General Questions: What is difference between using combined grammar and using lexer and parser separately? How should join separate grammar and lexer files?
Lexer rules start with a capital letter, and parser rules start with a lowercase letter. In a parser grammar, you can't define tokens. And since ANTLR thinks all your upper-cased rules lexer rules, it produces theses errors/warning.
EDIT
user2998131 wrote:
General Questions: What is difference between using combined grammar and using lexer and parser separately?
Separating the lexer and parser rules will keeps things organized. Also, when creating separate lexer and parser grammars, you can't (accidentally) put literal tokens inside your parser grammar but will need to define all tokens in your lexer grammar. This will make it apparent which lexer rules get matched before others, and you can't make any typo's inside recurring literal tokens:
grammar P;
r1 : 'foo' r2;
r2 : r3 'foo '; // added an accidental space after 'foo'
But when you have a parser grammar, you can't make that mistake. You will have to use the lexer rule that matches 'foo':
parser grammar P
options { tokenVocab=L; }
r1 : FOO r2;
r2 : r3 FOO;
lexer grammar L;
FOO : 'foo';
user2998131 wrote:
How should join separate grammar and lexer files?
Just like you do in your parser grammar: you point to the proper tokenVocab inside the options { ... } block.
Note that you can also import grammars, which is something different: https://github.com/antlr/antlr4/blob/master/doc/grammars.md#grammar-imports

ANTLR "[Token name] is already defined" errors when building in visual studio 2012

I have a very simple ANTLR parser building in Visual Studio 2012. It works. But when it builds the grammar file, it emits a warning for every token, saying that the token is already defined. What could be causing this?
Here is the grammar file SimpleCalc.g4:
grammar SimpleCalc;
options {
language=CSharp2;
}
tokens {
PLUS,
MINUS,
TIMES,
DIV
}
#members {
}
expr : term ( (PLUS|MINUS) term )* ;
term : factor ( ( TIMES|DIV ) factor )* ;
factor : NUMBER ;
DIV : '/';
PLUS : '+';
TIMES: '*';
MINUS: '-';
NUMBER : (DIGIT)+ {System.Console.WriteLine("Found number"); };
WHITESPACE: ( '\t' | ' ' | '\r' | '\n' | '\u000C' )+ -> skip ;
fragment DIGIT : '0'..'9';
And here are the warnings:
[path]\SimpleCalc.g4(8,3): warning AC0108: token name 'PLUS' is already defined
[path]\SimpleCalc.g4(9,3): warning AC0108: token name 'MINUS' is already defined
[path]\SimpleCalc.g4(10,3): warning AC0108: token name 'TIMES' is already defined
[path]\SimpleCalc.g4(11,3): warning AC0108: token name 'DIV' is already defined
I would get rid of the unnecessary tokens {...} block.

ANTLR rule to skip method body

My task is to create ANTLR grammar, to analyse C# source code files and generate class hierarchy. Then, I will use it to generate class diagram.
I wrote rules to parse namespaces, class declarations and method declarations. Now I have problem with skipping methods bodies. I don't need to parse them, because bodies are useless in my task.
I wrote simple rule:
body:
'{' .* '}'
;
but it does not work properly, when method looks like:
void foo()
{
...
{
...
}
...
}
rule matches first brace what is ok, then it matches
...
{
...
as 'any'(.*) and then third brace as final brace, what is not ok, and rule ends.
Anybody could help me to write proper rule for method bodies? As I said before, I don't want to parse them - only to skip.
UPDATE:
here is solution of my problem strongly based on Adam12 answer
body:
'{' ( ~('{' | '}') | body)* '}'
;
You have to use recursive rules that match parentheses pairs.
rule1 : '('
(
nestedParan
| (~')')*
)
')';
nestedParan : '('
(
nestedParan
| (~')')*
)
')';
This code assumes you are using the parser here so strings and comments are already excluded. ANTLR doesn't allow negation of multiple alternatives in parser rules so the code above relies on the fact that alternatives are tried in order. It should give a warning that alternatives 1 and 2 both match '(' and thus choose the first alternative, which is what we want.
You can handle the recursion of (nested) blocks in your lexer. The trick is to let your class definition also include the opening { so that not the entire contents of the class is gobbled up by this recursive lexer rule.
A quick demo that is without a doubt not complete, but is a decent start to "fuzzy parse/lex" a Java (or C# with some slight modifications) source file:
grammar T;
parse
: (t=. {System.out.printf("\%-15s '\%s'\n", tokenNames[$t.type], $t.text.replace("\n", "\\n"));})* EOF
;
Skip
: (StringLiteral | CharLiteral | Comment) {skip();}
;
PackageDecl
: 'package' Spaces Ids {setText($Ids.text);}
;
ClassDecl
: 'class' Spaces Id Spaces? '{' {setText($Id.text);}
;
Method
: Id Spaces? ('(' {setText($Id.text);}
| /* no method after all! */ {skip();}
)
;
MethodOrStaticBlock
: Block {skip();}
;
Any
: . {skip();}
;
// fragments
fragment Spaces
: (' ' | '\t' | '\r' | '\n')+
;
fragment Ids
: Id ('.' Id)*
;
fragment Id
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
;
fragment Block
: '{' ( ~('{' | '}' | '"' | '\'' | '/')
| {input.LA(2) != '/'}?=> '/'
| StringLiteral
| CharLiteral
| Comment
| Block
)*
'}'
;
fragment Comment
: '/*' .* '*/'
| '//' ~('\r' | '\n')*
;
fragment CharLiteral
: '\'' ('\\\'' | ~('\\' | '\'' | '\r' | '\n'))+ '\''
;
fragment StringLiteral
: '"' ('\\"' | ~('\\' | '"' | '\r' | '\n'))* '"'
;
I ran the generated parser against the following Java source file:
/*
... package NO.PACKAGE; ...
*/
package foo.bar;
public final class Mu {
static String x;
static {
x = "class NotAClass!";
}
void m1() {
// {
while(true) {
double a = 2.0 / 2;
if(a == 1.0) { break; } // }
/* } */
}
}
static class Inner {
int m2 () {return 42; /*comment}*/ }
}
}
which produced the following output:
PackageDecl 'foo.bar'
ClassDecl 'Mu'
Method 'm1'
ClassDecl 'Inner'
Method 'm2'

Categories

Resources