ANTLR (v4) Arbitrary Rule Ordering - c#

I'm parsing a JSON-like structure that looks something like this:
items {
item {
name : 'Name',
value : 'abc',
type : String
}
}
The parser rule for item might look like this:
item
: name ',' value ',' type // I want these to be able to be in any order
;
name
: NAME ':' Str
;
value
: VALUE ':' atom
;
type
: TYPE ':' data_type
;
How would I write the item rule such that the order of the key : value pairs is unimportant, and the parser would only check for the presence of the rule? That is, value could come before name, etc.
EDIT
I should clarify that I know I could do this:
item
: item_item*
;
item_item
: name
| value
| type
;
But the problem with that is that the item rule needs to limit each rule to only one instance. Using this technique, I could end up with any number of name rules, for example.

The naive brute force approach would be to solve the problem in syntactic analysis (avoid this):
item
: name ',' value ',' type
| name ',' type ',' value
| type ',' name ',' value
| ...
;
This leeds to a large parser spec and unmaintainable visitor/listener code.
Better:
Use a simple parser rule and a semantic predicate to validate:
item
: i+=item_item ',' i+=item_item ',' i+=item_item {containsNVT($i)}?
;
item_item
: name
| value
| type
;
You can place the code to validate that all three items are specified containsNVT($i) inline or in a parser super class.

Related

Error when using questionmark in ANTLR4 actions

I'm testing ANTLR 4 with C# as target language.
The Definitive ANTLR 4 reference says:
Actions are arbitrary chunks of code written in the target language
(the language in which ANTLR generates code) enclosed in {...}. We can
do whatever we want in these actions as long as they are valid target
language statements
However, I get an error if I place a '?' inside {...}
This works:
| ID '(' exprList? ')' { $result = creator.CreateFunctionCall( $ID, null, $exprList.result ); }
But if I add a questionmark, to take care of the optional exprList, ANTLR, not C#, gives an error:
| ID '(' exprList? ')' { $result = creator.CreateFunctionCall( $ID, null, $exprList?.result ); }
Error ANT02 error(67): Expr.g4:4:156: missing attribute access on rule
reference exprList in $exprList
Is this an error in ANTLR? Or can you use an escape code or similar?
Try something like this instead:
| ID '(' exprList ')' { $result = creator.CreateFunctionCall( $ID, null, $exprList.result ); }
| ID '(' ')' { $result = creator.CreateFunctionCall( $ID, null, null ); }

ANTLR4 Nested modes?

I'm attempting to parse the following string:
<<! variable, my_variable, A description of my variable !>>
From the reading I've been doing here, I believe I need to use modes to distinguish between the lexers for the literal string 'variable', the variable name (my_variable), and the variable description.
The problem I'm having is that I'm not sure how to structure this. Is it possible to nest modes? Is there a better/smarter way to organize my lexer rules?
lexer grammar VariableLexer;
variableMarkdown : DELIMITER_OPEN SPACE VARIABLE COMMA SPACE variable_name COMMA SPACE description SPACE DELIMITER_CLOSE;
description : WORDS ;
variable_name : ID ;
DELIMITER_OPEN : '<<!' ;
DELIMITER_CLOSE : '!>>';
COMMA : ',' ;
SPACE : ' ' ;
VARIABLE : 'variable' -> pushMode(VariableName);
mode VariableName;
ID : LOWERCASE ( LOWERCASE | NUMBER | UNDERSCORE )* -> pushMode(VariableDescription) ;
mode VariableDescription;
WORDS : ( UPPERCASE | LOWERCASE | NUMBER | SPACE )+ -> popMode;
fragment LOWERCASE : 'a'..'z' ;
fragment UPPERCASE : 'A'..'Z' ;
fragment UNDERSCORE : '_' ;
fragment NUMBER : '0'..'9' ;
First - you can't have parser rules in lexer grammar - parser rules start with small letter, lexer ones with capital.
I'd do it like so (may not be correct syntax but you'll get the idea):
//default mode is implicitly defined by (or in) ANTLR4
VARIABLE : 'variable' (' ')* ',' -> mode(mode_VariableName);
...
mode mode_VariableName;
//define token with anything ending with comma, many ways to do this...
fragment varNameFrag: [a-zA-Z_0-9];
VARIABLE_NAME: varNameFrag varNameFrag* (' ')* ',' -> mode(mode_varDesc);
mode mode_varDesc;
//similar again for variable description
VAR_DESC: //I'll write just a comment here but should more or less match anything except
END_VAR: '!>>' -> mode(DEFAULT_MODE)
Basically in this way you are jumping to modes you need instead of pushing and popping.

Parsing structured text input and composing structured output of nested classes

Here is my code for reading from text file. It "works" and reads from the text file but there is a small bug. It returns this: {Employee: Name: Name: red ID: 123 ID: Request: Name: Name: toilet ID: 444 Desc: water ID: Desc: } I know why its doing it, I just cant figure out how to fix it. columns[0] value is "Name: red \t ID: 123" and columnms[1] value is "Name: toilet \t ID: 444 \t Desc: water".
I know it's doing it because I'm calling assignment.Employee.Name but I don't know how else to call it to get it to show on my form. I thought it would be something like assignment.Employee but then it gives the error that I can't convert string to the Employee type.
Assignment is a list that holds 2 objects from other lists (employee and service request).
public static List<Assignment> GetAssignment()
{
if (!Directory.Exists(dir))
Directory.CreateDirectory(dir);
StreamReader textIn =
new StreamReader(
new FileStream(path3, FileMode.OpenOrCreate, FileAccess.Read));
List<Assignment> assignments = new List<Assignment>();
while (textIn.Peek() != -1)
{
string row = textIn.ReadLine();
string[] columns = row.Split('|');
if (columns.Length >= 2)
{
Assignment assignment = new Assignment();
assignment.Employee.Name = columns[0];
assignment.Request.Name = columns[1];
assignments.Add(assignment);
}
}
textIn.Close();
return assignments;
}
EDIT: I expect it to just return {Employee: Name: red ID: 123 Request: Name: toilet ID: 444 Desc: water}
Sorry this isn't an answer but due to the strange rules on this site I am not allowed to add a comment. Please give us the definition of the class or structure called "Assignment" and tell us what you expect it to contain after your code has run.
You are performing a string.Format() on the this.Employee so basically it is performing the default ToString() on the Employee object, which will list all fields and their associated values. You perhaps are meaning to call it like this:
return string.Format("Employee: {0} \t Request: {1}", this.Employee.Name, this.Request.Name);
Or perhaps you want to override the ToString() on your Employee and ServiceRequest objects to return your desired results.
Update
Since you edited your question to include the Employee object, the above is not relevant. Since your column[0] value actually has the text "Name: red \t ID: 123" then in your Employee override of ToString you do not also need to specify the text "Name:".
This answer is based on the assumption that a typical text line in your data file looks like this:
Name: red \t ID: 123 | Name: toilet \t ID: 444 \t Desc: water
This looks to me like it is encoding two objects, the first one having two attributes (Name and ID) and the second one having three attributes (Name, ID, Desc).
Objects within the same line are separated by pipe signs ("|"). Attributes within the same object are separated by tabs ("\t"). Each attribute consists of an identifier ("Name", "ID") and a value ("red", "123"), separated by a colon (":"). The natural data structure for such pairs would be a Dictionary<string, string>.
Reading such a file would emulate that nesting.
Read a line; split it by "|" into strings containing one object each (your columns).
Split each of these object strings by \t so that each resulting string contains one key and one value with a colon (":") and white space between them.
Split each of those key-values by ":" to separate the key from the value. Trim both to get rid of excess white space.
Employees or other objects of this kind hold a dictionary to store the key/value pairs, and ToString() just prints each pair by printing a key, a colon, and the value.

c# migrating to ANTLR 4 from ANTLR 3 with AST

I have inherited some c# code based on ANTLR 3.
We have some grammar files that uses the AST (abstract syntax tree) option and we use those grammar to parse text files with a very odd "language" to objects. we are using the AST as intermediate objects and than convert them to the real objects that we need (with some more processing).
I have no knowledge in ANTLR but currently we have a bottleneck in the application performance from ANTLR processing of the files.
Since we are using ANTLR 3 we thought that we might get a performance boost if we migrate to ANTLR (and also get the latest and greatest version of ANTLR which is always a good practice).
I have read that AST no longer exist in ANTLR 4, what is the best (and simplest) way to replace it and what will it mean to my current code.
What is the best approach to upgrade ? and will it really give us a performance boost.
An example of one of the grammar file ( there are 6 and this is the simplest one):
grammar Rules;
options
{
language=CSharp2;
output=AST;
ASTLabelType=CommonTree;
superClass = OOPLParserBase;
}
tokens
{
OOPL_MODEL;
}
#lexer::namespace { TestParser.Common.RulesParser }
#parser::namespace { TestParser.Common.RulesParser }
#header
{
using System.Collections.Generic;
using TestParser.OOPLModel;
}
#members
{
public RulesParser() : base(null)
{
}
protected override CommonTree GetAst()
{
return root().Tree as CommonTree;
}
protected override Lexer GetLexer()
{
return new RulesLexer();
}
}
//semantic analysis
root : header (rule_line COMMENT?)+ -> ^(header rule_line+);
header : header_comment+ -> ^(OOPL_MODEL<OOPLModel>[new CommonToken(OOPL_MODEL), "1.0"] header_comment+);
header_comment : COMMENT -> ^(COMMENT<OOPLComment>[$COMMENT, $COMMENT.Text]);
rule_line : parameter RULE_TYPE COMMA PARAMETER_NAME COLON condition -> ^(RULE_TYPE<OOPLBlock>[$RULE_TYPE, $RULE_TYPE.Text] parameter PARAMETER_NAME<OOPLValue>[$PARAMETER_NAME, $PARAMETER_NAME.Text] condition);
parameter : PARAMETER_NAME EQUALS (integer_value = INTEGER | real_value = REAL |string_value = STRING) COMMA -> ^(PARAMETER_NAME<OOPLKeyedValue>[$PARAMETER_NAME, $PARAMETER_NAME.Text, SingleWhereNotNull<IToken>($integer_value, $string_value, $real_value).Text]);
condition : condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value COMMA condition_value;
condition_value : (asterisk| parameter_name | positive_integer);
asterisk : ASTERISK -> ^(ASTERISK<OOPLValue>[$ASTERISK, $ASTERISK.Text]);
parameter_name : PARAMETER_NAME -> ^(PARAMETER_NAME<OOPLValue>[$PARAMETER_NAME, $PARAMETER_NAME.Text]);
positive_integer : INTEGER -> ^(INTEGER<OOPLValue>[$INTEGER, $INTEGER.Text]);
//lexical analysis
EQUALS : '=';
NEW_LINE_R : '\r' { $channel = HIDDEN; };
NEW_LINE_N : '\n' { $channel = HIDDEN; };
RULE_TYPE : ('Time'|'TIME'|'Lol'|'LOL'|'World'|'WORLD'|'Template'|'TEMPLATE');
DOUBLE_COLON : COLON COLON;
INTEGER : MINUS? DIGIT+;
REAL : INTEGER '.' INTEGER;
PARAMETER_NAME : ASTERISK? (LETTER|DIGIT|UNDERSCORE|FORWARDSLASH|DOUBLE_COLON|MINUS)+ ASTERISK?;
WS : ( ' '
| '\t'
| NEW_LINE_R
| NEW_LINE_N
) { $channel = HIDDEN; } ;
COMMENT : '#' ( options {greedy=false;} : . )* NEW_LINE_R? NEW_LINE_N;
STRING : '"'~('"')* '"';
fragment
MINUS : '-';
COMMA : ',';
COLON : ':';
fragment
DOT : '.';
ASTERISK : '*';
fragment
FORWARDSLASH : '/';
fragment
UNDERSCORE : '_';
fragment
DIGIT : '0'..'9';
fragment
LETTER : 'A'..'Z' | 'a'..'z';
I'd do the transformation solely in C# code after the parse.
In this case I'd even skip the intermediate AST form and transform the parse tree (provided by ANTLR4) directly into the target representation.
Some prefer ParseTreeListener/ParseTreeWalkers, which aid you in walking the parse tree. Check these out, if you want some pre-build code. Be sure to use the typed ParseTreeWalker, which should be named RulesParseTreeListener<>, inherit and adjust to your needs.
link: https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Parse+Tree+Listeners
I'd not recommend ParseTreeVisitors which are invoked during the parse (as opposed to after the parse). They are only suitable for simple operations or grammars that are not context free and require code during the parse. If the requirements evolve later on, you're way more flexible with custom processing or listeners/walkers.

How can my ANTLR parser (not lexer) trigger a lexical "include" (not AST splice)?

The ANTLR website describes two approaches to implementing "include" directives. The first approach is to recognize the directive in the lexer and include the file lexically (by pushing the CharStream onto a stack and replacing it with one that reads the new file); the second is to recognize the directive in the parser, launch a sub-parser to parse the new file, and splice in the AST generated by the sub-parser. Neither of these are quite what I need.
In the language I'm parsing, recognizing the directive in the lexer is impractical for a few reasons:
There is no self-contained character pattern that always means "this is an include directive". For example, Include "foo"; at top level is an include directive, but in Array bar --> Include "foo"; or Constant Include "foo"; the word Include is an identifier.
The name of the file to include may be given as a string or as a constant identifier, and such constants can be defined with arbitrarily complex expressions.
So I want to trigger the inclusion from the parser. But to perform the inclusion, I can't launch a sub-parser and splice the AST together; I have to splice the tokens. It's legal for a block to begin with { in the main file and be terminated by } in the included file. A file included inside a function can even close the function definition and start a new one.
It seems like I'll need something like the first approach but at the level of TokenStreams instead of CharStreams. Is that a viable approach? How much state would I need to keep on the stack, and how would I make the parser switch back to the original token stream instead of terminating when it hits EOF? Or is there a better way to handle this?
==========
Here's an example of the language, demonstrating that blocks opened in the main file can be closed in the included file (and vice versa). Note that the # before Include is required when the directive is inside a function, but optional outside.
main.inf:
[ Main;
print "This is Main!";
if (0) {
#include "other.h";
print "This is OtherFunction!";
];
other.h:
} ! end if
]; ! end Main
[ OtherFunction;
A possibility is for each Include statement to let your parser create a new instance of your lexer and insert these new tokens the lexer creates at the index the parser is currently at (see the insertTokens(...) method in the parser's #members block.).
Here's a quick demo:
Inform6.g
grammar Inform6;
options {
output=AST;
}
tokens {
STATS;
F_DECL;
F_CALL;
EXPRS;
}
#parser::header {
import java.util.Map;
import java.util.HashMap;
}
#parser::members {
private Map<String, String> memory = new HashMap<String, String>();
private void putInMemory(String key, String str) {
String value;
if(str.startsWith("\"")) {
value = str.substring(1, str.length() - 1);
}
else {
value = memory.get(str);
}
memory.put(key, value);
}
private void insertTokens(String fileName) {
// possibly strip quotes from `fileName` in case it's a Str-token
try {
CommonTokenStream thatStream = new CommonTokenStream(new Inform6Lexer(new ANTLRFileStream(fileName)));
thatStream.fill();
List extraTokens = thatStream.getTokens();
extraTokens.remove(extraTokens.size() - 1); // remove EOF
CommonTokenStream thisStream = (CommonTokenStream)this.getTokenStream();
thisStream.getTokens().addAll(thisStream.index(), extraTokens);
} catch(Exception e) {
e.printStackTrace();
}
}
}
parse
: stats EOF -> stats
;
stats
: stat* -> ^(STATS stat*)
;
stat
: function_decl
| function_call
| include
| constant
| if_stat
;
if_stat
: If '(' expr ')' '{' stats '}' -> ^(If expr stats)
;
function_decl
: '[' id ';' stats ']' ';' -> ^(F_DECL id stats)
;
function_call
: Id exprs ';' -> ^(F_CALL Id exprs)
;
include
: Include Str ';' {insertTokens($Str.text);} -> /* omit statement from AST */
| Include id ';' {insertTokens(memory.get($id.text));} -> /* omit statement from AST */
;
constant
: Constant id expr ';' {putInMemory($id.text, $expr.text);} -> ^(Constant id expr)
;
exprs
: expr (',' expr)* -> ^(EXPRS expr+)
;
expr
: add_expr
;
add_expr
: mult_expr (('+' | '-')^ mult_expr)*
;
mult_expr
: atom (('*' | '/')^ atom)*
;
atom
: id
| Num
| Str
| '(' expr ')' -> expr
;
id
: Id
| Include
;
Comment : '!' ~('\r' | '\n')* {skip();};
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
If : 'if';
Include : 'Include';
Constant : 'Constant';
Id : ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | '0'..'9')+;
Str : '"' ~'"'* '"';
Num : '0'..'9'+ ('.' '0'..'9'+)?;
main.inf
Constant IMPORT "other.h";
[ Main;
print "This is Main!";
if (0) {
Include IMPORT;
print "This is OtherFunction!";
];
other.h
} ! end if
]; ! end Main
[ OtherFunction;
Main.java
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
// create lexer & parser
Inform6Lexer lexer = new Inform6Lexer(new ANTLRFileStream("main.inf"));
Inform6Parser parser = new Inform6Parser(new CommonTokenStream(lexer));
// print the AST
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT((CommonTree)parser.parse().getTree());
System.out.println(st);
}
}
To run the demo, do the following on the command line:
java -cp antlr-3.3.jar org.antlr.Tool Inform6.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main
The output you'll see corresponds to the following AST:

Categories

Resources