ANTLR rule to skip method body - c#

My task is to create ANTLR grammar, to analyse C# source code files and generate class hierarchy. Then, I will use it to generate class diagram.
I wrote rules to parse namespaces, class declarations and method declarations. Now I have problem with skipping methods bodies. I don't need to parse them, because bodies are useless in my task.
I wrote simple rule:
body:
'{' .* '}'
;
but it does not work properly, when method looks like:
void foo()
{
...
{
...
}
...
}
rule matches first brace what is ok, then it matches
...
{
...
as 'any'(.*) and then third brace as final brace, what is not ok, and rule ends.
Anybody could help me to write proper rule for method bodies? As I said before, I don't want to parse them - only to skip.
UPDATE:
here is solution of my problem strongly based on Adam12 answer
body:
'{' ( ~('{' | '}') | body)* '}'
;

You have to use recursive rules that match parentheses pairs.
rule1 : '('
(
nestedParan
| (~')')*
)
')';
nestedParan : '('
(
nestedParan
| (~')')*
)
')';
This code assumes you are using the parser here so strings and comments are already excluded. ANTLR doesn't allow negation of multiple alternatives in parser rules so the code above relies on the fact that alternatives are tried in order. It should give a warning that alternatives 1 and 2 both match '(' and thus choose the first alternative, which is what we want.

You can handle the recursion of (nested) blocks in your lexer. The trick is to let your class definition also include the opening { so that not the entire contents of the class is gobbled up by this recursive lexer rule.
A quick demo that is without a doubt not complete, but is a decent start to "fuzzy parse/lex" a Java (or C# with some slight modifications) source file:
grammar T;
parse
: (t=. {System.out.printf("\%-15s '\%s'\n", tokenNames[$t.type], $t.text.replace("\n", "\\n"));})* EOF
;
Skip
: (StringLiteral | CharLiteral | Comment) {skip();}
;
PackageDecl
: 'package' Spaces Ids {setText($Ids.text);}
;
ClassDecl
: 'class' Spaces Id Spaces? '{' {setText($Id.text);}
;
Method
: Id Spaces? ('(' {setText($Id.text);}
| /* no method after all! */ {skip();}
)
;
MethodOrStaticBlock
: Block {skip();}
;
Any
: . {skip();}
;
// fragments
fragment Spaces
: (' ' | '\t' | '\r' | '\n')+
;
fragment Ids
: Id ('.' Id)*
;
fragment Id
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
;
fragment Block
: '{' ( ~('{' | '}' | '"' | '\'' | '/')
| {input.LA(2) != '/'}?=> '/'
| StringLiteral
| CharLiteral
| Comment
| Block
)*
'}'
;
fragment Comment
: '/*' .* '*/'
| '//' ~('\r' | '\n')*
;
fragment CharLiteral
: '\'' ('\\\'' | ~('\\' | '\'' | '\r' | '\n'))+ '\''
;
fragment StringLiteral
: '"' ('\\"' | ~('\\' | '"' | '\r' | '\n'))* '"'
;
I ran the generated parser against the following Java source file:
/*
... package NO.PACKAGE; ...
*/
package foo.bar;
public final class Mu {
static String x;
static {
x = "class NotAClass!";
}
void m1() {
// {
while(true) {
double a = 2.0 / 2;
if(a == 1.0) { break; } // }
/* } */
}
}
static class Inner {
int m2 () {return 42; /*comment}*/ }
}
}
which produced the following output:
PackageDecl 'foo.bar'
ClassDecl 'Mu'
Method 'm1'
ClassDecl 'Inner'
Method 'm2'

Related

Possible generation bug for C# in Antlr?

Using Antlr 4.3 and this grammar
http://www.harward.us/~nharward/antlr/OracleNetServicesV3.g
following *Lexer.cs code for C# is generated :
private void WHITESPACE_action(RuleContext _localctx, int actionIndex) {
switch (actionIndex) {
case 1: skip(); break;
}
}
private void NEWLINE_action(RuleContext _localctx, int actionIndex) {
switch (actionIndex) {
case 2: skip(); break;
}
}
private void COMMENT_action(RuleContext _localctx, int actionIndex) {
switch (actionIndex) {
case 0: skip(); break;
}
}
But the method skip() in the runtime is defined as:
public virtual void Skip()
Which of course gives a compilation error.
The same skip() method is generated with Antlr 3.5.2 as well.
Is this a bug or i am doing something wrong?
You can easily make this v4 compatible and language independent by using the command
-> channel(HIDDEN)
Here is an updated grammar that implements this change
configuration_file
: ( parameter )*
;
parameter
: keyword EQUALS ( value
| LEFT_PAREN value_list RIGHT_PAREN
| ( LEFT_PAREN parameter RIGHT_PAREN )+
)
;
keyword
: WORD
;
value
: WORD
| QUOTED_STRING
;
value_list
: value ( COMMA value )*
;
QUOTED_STRING
: SINGLE_QUOTE ~'\''* SINGLE_QUOTE
| DOUBLE_QUOTE ~'"'* DOUBLE_QUOTE
;
WORD
: ( 'A' .. 'Z'
| 'a' .. 'z'
| '0' .. '9'
| '<'
| '>'
| '/'
| '.'
| ':'
| ';'
| '-'
| '_'
| '$'
| '+'
| '*'
| '&'
| '!'
| '%'
| '?'
| '#'
| '\\' .
)+
;
LEFT_PAREN
: '('
;
RIGHT_PAREN
: ')'
;
EQUALS
: '='
;
COMMA
: ','
;
SINGLE_QUOTE
: '\''
;
DOUBLE_QUOTE
: '"'
;
COMMENT
: '#' ( ~( '\n' ) )* -> channel(HIDDEN)
;
WHITESPACE
: ( '\t'
| ' '
) -> channel(HIDDEN)
;
NEWLINE
: ( '\r' )? '\n' -> channel(HIDDEN)
;
As written in my comment, it's because auf the skip() in the grammar file, which is Java dependend.
So there is no bug in Antlr. :)

ANTLR: no viable alternative at input

Sorry for my bad English.
I wrote ANTLR4-grammar for GDB/MI output commands from this manual:
grammar GdbOutput;
output : out_of_band_record | result_record | terminator_record;
result_record : TOKEN? '^' RESULT_CLASS (',' result)*;
out_of_band_record : async_record
| stream_record;
async_record : exec_async_output
| status_async_output
| notify_async_output;
exec_async_output : TOKEN? '*' async_output;
status_async_output : TOKEN? '+' async_output;
notify_async_output : TOKEN? '=' async_output;
async_output : async_class (',' result)*;
RESULT_CLASS : 'done'
| 'running'
| 'connected'
| 'error'
| 'exit';
async_class : 'stopped'; //TODO
result : VARIABLE '=' value;
value : const
| tuple
| list;
const : c_string;
c_string : '"' STRING_LITERAL '"';
tuple : '{}'
| '{' result (',' result)* '}';
list : '[]'
| '[' value (',' value)* ']'
| '[' result (',' result)* ']';
stream_record : console_stream_output
| target_stream_output
| log_stream_output;
console_stream_output : '~' c_string;
target_stream_output : '#' c_string;
log_stream_output : '&' c_string;
terminator_record : '(gdb)';
VARIABLE : [a-z-]*;
STRING_LITERAL : (~('"' | '\\' | '\r' | '\n') | '\\' ('"' | '\\'))*;
TOKEN : [0-9]+;
I tried some output strings from GDB:
"~\"Reading symbols from C:\\src.exe...\"" - OK (out_of_band_record -> stream_record -> console_stream_output)
"(gdb)" -> OK (terminator_record)
"^done,bkpt={number=\"1\",type=\"breakpoint\",disp=\"keep\",enabled=\"y\",addr=\"0x00000000004014e4\",file=\"src.s\",fullname=\"C:\\src.s\",line=\"17\",thread-groups=[\"i1\"],times=\"0\",original-location=\"main\"}" - FAIL (with exception: line 1:0 no viable alternative at input '^done,bkpt={number=')
"^done,bkpt=\"1\"" - FAIL (line 1:0 no viable alternative at input '^done,bkpt=)
"^done,bkpt={}" - FAIL (line 1:0 no viable alternative at input '^done,bkpt={}')
Why my parser didn't recognize strings #3-5?
P.S.: C# target for ANTLR v.4.2.0 prerelease from Nuget
For starters, let your STRING_LITERAL match the quotes too: don't match them in a parser rule. And let your VARIABLE rule match at least one character (change the * into a +).

ANTLR Grammar and generated code problems

I'm trying to create an expression parser using ANTLR
The expression will go inside an if statement so its root is a condition.
I have the following grammar, which "compiles" to parser/lexer files with no problems, however the generated code itself has some errors, essentially two "empty" if statements
i.e.
if (())
Not sure what I'm doing wrong, any help will be greatly appreciated.
Thanks.
Grammar .g file below:
grammar Expression;
options {
language=CSharp3;
output=AST;
}
tokens {
ROOT;
UNARY_MIN;
}
#parser::namespace { Antlr3 }
#lexer::namespace { Antlr3 }
public parse
: orcond EOF -> ^(ROOT orcond)
;
orcond
: andcond ('||' andcond)*
;
andcond
: condition ('&&' condition)*
;
condition
: exp (('<' | '>' | '==' | '!=' | '<=' | '>=')^ exp)?
;
exp
: addExp
;
addExp
: mulExp (('+' | '-')^ mulExp)*
;
mulExp
: unaryExp (('*' | '/')^ unaryExp)*
;
unaryExp
: '-' atom -> ^(UNARY_MIN atom)
| atom
;
atom
: Number
| '(' parenthesisvalid ')' -> parenthesisvalid
;
parenthesisvalid
: fullobjectref
| orcond
;
fullobjectref
: objectref ('.' objectref)?
;
objectref
: objectname ('()' | '(' params ')' | '[' params ']')?
;
objectname
: (('a'..'z') | ('A'..'Z'))^ (('a'..'z') | ('A'..'Z') | ('0'..'9') | '_')*
;
params
: paramitem (',' paramitem)?
;
paramitem
: unaryExp
;
Number
: ('0'..'9')+ ('.' ('0'..'9')+)?
;
Space
: (' ' | '\t' | '\r' | '\n'){Skip();}
;
Don't use the range operator, .., inside parser rules.
Remove the parser rule objectname and create the lexer rule:
Objectname
: ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')*
;
Additionally it will be more neatly to use fragment blocks. In this case your code will be something like this:
Objectname : LETTER (LETTER | DIGIT | '_')*;
Number: DIGIT+ ('.' DIGIT+)?;
fragment DIGIT : '0'..'9' ;
fragment LETTER : ('a'..'z' | 'A'..'Z');

How can my ANTLR parser (not lexer) trigger a lexical "include" (not AST splice)?

The ANTLR website describes two approaches to implementing "include" directives. The first approach is to recognize the directive in the lexer and include the file lexically (by pushing the CharStream onto a stack and replacing it with one that reads the new file); the second is to recognize the directive in the parser, launch a sub-parser to parse the new file, and splice in the AST generated by the sub-parser. Neither of these are quite what I need.
In the language I'm parsing, recognizing the directive in the lexer is impractical for a few reasons:
There is no self-contained character pattern that always means "this is an include directive". For example, Include "foo"; at top level is an include directive, but in Array bar --> Include "foo"; or Constant Include "foo"; the word Include is an identifier.
The name of the file to include may be given as a string or as a constant identifier, and such constants can be defined with arbitrarily complex expressions.
So I want to trigger the inclusion from the parser. But to perform the inclusion, I can't launch a sub-parser and splice the AST together; I have to splice the tokens. It's legal for a block to begin with { in the main file and be terminated by } in the included file. A file included inside a function can even close the function definition and start a new one.
It seems like I'll need something like the first approach but at the level of TokenStreams instead of CharStreams. Is that a viable approach? How much state would I need to keep on the stack, and how would I make the parser switch back to the original token stream instead of terminating when it hits EOF? Or is there a better way to handle this?
==========
Here's an example of the language, demonstrating that blocks opened in the main file can be closed in the included file (and vice versa). Note that the # before Include is required when the directive is inside a function, but optional outside.
main.inf:
[ Main;
print "This is Main!";
if (0) {
#include "other.h";
print "This is OtherFunction!";
];
other.h:
} ! end if
]; ! end Main
[ OtherFunction;
A possibility is for each Include statement to let your parser create a new instance of your lexer and insert these new tokens the lexer creates at the index the parser is currently at (see the insertTokens(...) method in the parser's #members block.).
Here's a quick demo:
Inform6.g
grammar Inform6;
options {
output=AST;
}
tokens {
STATS;
F_DECL;
F_CALL;
EXPRS;
}
#parser::header {
import java.util.Map;
import java.util.HashMap;
}
#parser::members {
private Map<String, String> memory = new HashMap<String, String>();
private void putInMemory(String key, String str) {
String value;
if(str.startsWith("\"")) {
value = str.substring(1, str.length() - 1);
}
else {
value = memory.get(str);
}
memory.put(key, value);
}
private void insertTokens(String fileName) {
// possibly strip quotes from `fileName` in case it's a Str-token
try {
CommonTokenStream thatStream = new CommonTokenStream(new Inform6Lexer(new ANTLRFileStream(fileName)));
thatStream.fill();
List extraTokens = thatStream.getTokens();
extraTokens.remove(extraTokens.size() - 1); // remove EOF
CommonTokenStream thisStream = (CommonTokenStream)this.getTokenStream();
thisStream.getTokens().addAll(thisStream.index(), extraTokens);
} catch(Exception e) {
e.printStackTrace();
}
}
}
parse
: stats EOF -> stats
;
stats
: stat* -> ^(STATS stat*)
;
stat
: function_decl
| function_call
| include
| constant
| if_stat
;
if_stat
: If '(' expr ')' '{' stats '}' -> ^(If expr stats)
;
function_decl
: '[' id ';' stats ']' ';' -> ^(F_DECL id stats)
;
function_call
: Id exprs ';' -> ^(F_CALL Id exprs)
;
include
: Include Str ';' {insertTokens($Str.text);} -> /* omit statement from AST */
| Include id ';' {insertTokens(memory.get($id.text));} -> /* omit statement from AST */
;
constant
: Constant id expr ';' {putInMemory($id.text, $expr.text);} -> ^(Constant id expr)
;
exprs
: expr (',' expr)* -> ^(EXPRS expr+)
;
expr
: add_expr
;
add_expr
: mult_expr (('+' | '-')^ mult_expr)*
;
mult_expr
: atom (('*' | '/')^ atom)*
;
atom
: id
| Num
| Str
| '(' expr ')' -> expr
;
id
: Id
| Include
;
Comment : '!' ~('\r' | '\n')* {skip();};
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
If : 'if';
Include : 'Include';
Constant : 'Constant';
Id : ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | '0'..'9')+;
Str : '"' ~'"'* '"';
Num : '0'..'9'+ ('.' '0'..'9'+)?;
main.inf
Constant IMPORT "other.h";
[ Main;
print "This is Main!";
if (0) {
Include IMPORT;
print "This is OtherFunction!";
];
other.h
} ! end if
]; ! end Main
[ OtherFunction;
Main.java
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
// create lexer & parser
Inform6Lexer lexer = new Inform6Lexer(new ANTLRFileStream("main.inf"));
Inform6Parser parser = new Inform6Parser(new CommonTokenStream(lexer));
// print the AST
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT((CommonTree)parser.parse().getTree());
System.out.println(st);
}
}
To run the demo, do the following on the command line:
java -cp antlr-3.3.jar org.antlr.Tool Inform6.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main
The output you'll see corresponds to the following AST:

ANTLR3 common values in 2 different domain values

I need to define a language-parser for the following search criteria:
CRITERIA_1=<values-set-#1> AND/OR CRITERIA_2=<values-set-#2>;
Where <values-set-#1> can have values from 1-50 and <values-set-#2> can be from the following set (5, A, B, C) - case is not important here.
I have decided to use ANTLR3 (v3.4) with output in C# (CSharp3) and it used to work pretty smooth until now. The problem is that it fails to parse the string when I provide values from both data-sets (I.e. in this case '5'). For example, if I provide the following string
CRITERIA_1=5;
It returns the following error where the value node was supposed to be:
<unexpected: [#1,11:11='5',<27>,1:11], resync=5>
The grammar definition file is the following:
grammar ZeGrammar;
options {
language=CSharp3;
TokenLabelType=CommonToken;
output=AST;
ASTLabelType=CommonTree;
k=3;
}
tokens
{
ROOT;
CRITERIA_1;
CRITERIA_2;
OR = 'OR';
AND = 'AND';
EOF = ';';
LPAREN = '(';
RPAREN = ')';
}
public
start
: expr EOF -> ^(ROOT expr)
;
expr
: subexpr ((AND|OR)^ subexpr)*
;
subexpr
: grouppedsubexpr
| 'CRITERIA_1=' rangeval1_expr -> ^(CRITERIA_1 rangeval1_expr)
| 'CRITERIA_2=' rangeval2_expr -> ^(CRITERIA_2 rangeval2_expr)
;
grouppedsubexpr
: LPAREN! expr RPAREN!
;
rangeval1_expr
: rangeval1_subexpr
| RANGE1_VALUES
;
rangeval1_subexpr
: LPAREN! rangeval1_expr (OR^ rangeval1_expr)* RPAREN!
;
RANGE1_VALUES
: (('0'..'4')? ('0'..'9') | '5''0')
;
rangeval2_expr
: rangeval2_subexpr
| RANGE2_VALUES
;
rangeval2_subexpr
: LPAREN! rangeval2_expr (OR^ rangeval2_expr)* RPAREN!
;
RANGE2_VALUES
: '5' | ('a'|'A') | ('b'|'B') | ('c'|'C')
;
And if I remove the value '5' from RANGE2_VALUES it works fine. Can anyone hint me on what I am doing wrong?
You must realize that the lexer does not produce tokens based on what the parser tries to match. So, in your case, the input "5" will always be tokenized as a RANGE1_VALUES and never as a RANGE2_VALUES because both RANGE1_VALUES and RANGE2_VALUES can match this input but RANGE1_VALUES comes first (so RANGE1_VALUES takes precedence over RANGE2_VALUES).
A possible fix would be to remove both RANGE1_VALUES and RANGE2_VALUES rules and replace them with the following lexer rules:
D0_4
: '0'..'4'
;
D5
: '5'
;
D6_50
: '6'..'9' // 6-9
| '1'..'4' '0'..'9' // 10-49
| '50' // 50
;
A_B_C
: ('a'|'A')
| ('b'|'B')
| ('c'|'C')
;
and the introduce these new parser rules:
range1_values
: D0_4
| D5
| D6_50
;
range2_values
: A_B_C
| D5
;
and change all RANGE1_VALUES and RANGE2_VALUES calls in your parser rules with range1_values and range2_values respectively.
EDIT
Instead of trying to solve this at the lexer-level, you might simply match any integer value and check inside the parser rule if the value is the correct one (or correct range) using a semantic predicate:
range1_values
: INT {Integer.valueOf($INT.text) <= 50}?
;
range2_values
: A_B_C
| INT {Integer.valueOf($INT.text) == 5}?
;
INT
: '0'..'9'+
;
A_B_C
: 'a'..'c'
| 'A'..'C'
;

Categories

Resources