I built a "simples" grammar to interprete a file that looks like a json (or xml). But, when I try to parse the file and navigate on the tree I get a System.OutOfMemoryException.
The input file have just 108MB but contains almost 5 millions lines.
Here is a sample of the file:
(
:field ("ObjectName"
:field (
:field ("{6BF621F9-A0E2-49BB-A86B-3DE4750954F4}")
:field (Value)
:field (Value)
:field (
:Time ("Sun Jan 26 10:08:33 2014")
:last_modified_utc (1390730913)
:By ("Some text")
:From (localhost)
)
:field ("text/text")
:field (false)
:field (false)
)
:field ()
:field ()
:field ()
:field (0)
:field (true)
:field (true)
)
.
.
.
.
.
)
Following the grammar:
grammar Objects;
/*
* Parser Rules
*/
compileUnit
: obj
;
obj
: OPEN ID? (field)* CLOSE
;
field
: ':'(ID)? obj
;
/*
* Lexer Rules
*/
OPEN
: '('
;
CLOSE
: ')'
;
ID
: (ALPHA | ALPHA_IN_STRING)
;
fragment
INT_ID
: ('0'..'9')
;
fragment
ALPHA_EACH
: 'A'..'Z' | 'a'..'z' | '_' | INT_ID | '-' | '.' | '#'
;
fragment
ALPHA
: (ALPHA_EACH)+
;
fragment
ALPHA_IN_STRING
: ('"' ( ~[\r\n] )+ '"')
;
WS
// : ' ' -> channel(HIDDEN)
: [ \t\r\n]+ -> skip // skip spaces, tabs, newlines
;
And the parser:
var input = new Antlr4.Runtime.AntlrInputStream(text);
var lexer = new ObjectsLexer(input);
var tokens = new Antlr4.Runtime.CommonTokenStream(lexer);
var parser = new ObjectsParser(tokens);
// Context for the compileUnit rule
// ERROR: Here I got the error. When start the to build the tree for compileUnit rule
var ctx = parser.compileUnit();
// The following line is not executed
new ObjectsVisitor().Visit(ctx);
On the error line, I realise that the memory growth exponentialy.
If the input is UTF-8 encoded and uses primarily ASCII characters, the conversion to UTF-16 will require approximately 216MB.
Each token uses at least 48 bytes of memory.
Each token which appears in the parse tree uses at least 20 bytes of memory (in addition to the 44).
Each rule node in the parse tree uses at least 36 bytes of memory. If the rule has any children, the minimum is 68 bytes.
The numbers above do not include any locals, arguments, labels, or return values, all of which are stored in the tree if you use them.
Assuming 4 characters per token, half the tokens in the parse tree, and an average of 3 tokens per parse tree node (completely arbitrary values here), you get:
Input: 216MB
~28 million tokens: ~1281MB
~14 million terminal nodes in the parse tree: ~267MB
~4.7 million parse tree nodes: ~308MB
This is over 2GB memory, and doesn't count any of the overhead associated with the runtime or the dynamic DFA cache constructed internally by ANTLR. You will clearly need to either run your application as a 64-bit process or reduce the size of your inputs.
Related
I’m attempting to create a grammar for a lighting control system and I make good progress when testing with the tree gui tool but it all seems to fall apart when I attempt to implement it into my app.
The basic structure of the language is [Source] [Mask] [Command] [Destination]. Mask is optional so a super simple sample input might look like this : Fixture 1 # 50 which bypasses Mask. Fixture 1 is the source, # is the command and 50 is the destination which in this case is an intensity value.
I’ve no issues with this type of input but things get complicated as I try and build out more complex source selection. Let’s say I want to select a range of fixtures and remove a few from the selection and then add more fixtures after.
Fixture 1 Thru 50 – 25 – 30 – 35 + 40 > 45 # 50
This is a very common syntax on existing control systems but I’m stumped at how to design the grammar for this in a way that makes integration into my app not too painful.
The user could just as easily type the following:
1 Thru 50 – 25 – 30 – 35 + 40 > 45 # 50
Because sourceType (fixture) is not provided, its inferred.
To try and deal with the above situations, I've written the following:
grammar LiteMic;
/*
* Parser Rules
*/
start : expression;
expression : source command destination
| source mask command destination
| command destination
| source command;
destination : sourceType number
| sourceType number sourceType number
| number;
command : COMMAND;
mask : SOURCETYPE;
operator : ADD #Add
| SUB #Subtract
;
plus : ADD;
minus : SUB;
source : singleSource (plus source)*
| rangeSource (plus source)*
;
singleSource : sourceType number #SourceWithType
| number #InferedSource
;
rangeSource : sourceRange (removeSource)*
;
sourceRange : singleSource '>' singleSource;
removeSource : '-' source;
sourceType : SOURCETYPE;
number : NUMBER;
compileUnit
: EOF
;
/*
* Lexer Rules
*/
SOURCETYPE : 'Cue'
| 'Playback'
| 'List'
| 'Intensity'
| 'Position'
| 'Colour'
| 'Beam'
| 'Effect'
| 'Group'
| 'Fixture'
;
COMMAND : '#'
| 'Record'
| 'Update'
| 'Copy'
| 'Move'
| 'Delete'
| 'Highlight'
| 'Full'
;
ADD : '+' ;
SUB : '-' ;
THRU : '>' ;
/* A number: can be an integer value, or a decimal value */
NUMBER : [0-9]+ ;
/* We're going to ignore all white space characters */
WS : [ \t\r\n]+ -> skip
;
Running the command against grun gui produces the following:
I've had some measure of success being able to override the Listener for AddRangeSource as I can loop through and add the correct types but it all falls apart when I try and remove a range.
1 > 50 - 30 > 35 # 50
This produces a problem as the removal of a range matches to the 'addRangeSource'.
I'm pretty sure I'm missing something obvious and I've been working my way through the book I bought on Amazon but it's still not cleared up in my head how to archieve what I'm after and I've been looking at this for a week.
For good measure, below is a tree for a more advanced query that seems ok apart from the selection.
Does anyone have any pointers / suggestions on where I'm going wrong?
Cheers,
Mike
You can solve the problem by reorganizing the grammar a little:
Merge rangeSource with sourceRange:
rangeSource : singleSource '>' singleSource;
Note: This rule also matches input like Beam 1 > Group 16, which might be unintended, in that case you could use this:
rangeSource : sourceType? number '>' number;
Rename source to sourceList (and don't forget to change it in the expression rule):
expression : sourceList command destination
| sourceList mask command destination
| command destination
| sourceList command;
Add a source rule that matches either singleSource or rangeSource:
source : singleSource | rangeSource;
Put + and - at the same level (as addSource and removeSource):
addSource : plus source;
removeSource : minus source;
Change sourceList to accept a list of addSource/removeSource:
sourceList : source (addSource|removeSource)*;
I tried this and it doesn't have any problems with parsing even the more advanced query.
I'm attempting to parse the following string:
<<! variable, my_variable, A description of my variable !>>
From the reading I've been doing here, I believe I need to use modes to distinguish between the lexers for the literal string 'variable', the variable name (my_variable), and the variable description.
The problem I'm having is that I'm not sure how to structure this. Is it possible to nest modes? Is there a better/smarter way to organize my lexer rules?
lexer grammar VariableLexer;
variableMarkdown : DELIMITER_OPEN SPACE VARIABLE COMMA SPACE variable_name COMMA SPACE description SPACE DELIMITER_CLOSE;
description : WORDS ;
variable_name : ID ;
DELIMITER_OPEN : '<<!' ;
DELIMITER_CLOSE : '!>>';
COMMA : ',' ;
SPACE : ' ' ;
VARIABLE : 'variable' -> pushMode(VariableName);
mode VariableName;
ID : LOWERCASE ( LOWERCASE | NUMBER | UNDERSCORE )* -> pushMode(VariableDescription) ;
mode VariableDescription;
WORDS : ( UPPERCASE | LOWERCASE | NUMBER | SPACE )+ -> popMode;
fragment LOWERCASE : 'a'..'z' ;
fragment UPPERCASE : 'A'..'Z' ;
fragment UNDERSCORE : '_' ;
fragment NUMBER : '0'..'9' ;
First - you can't have parser rules in lexer grammar - parser rules start with small letter, lexer ones with capital.
I'd do it like so (may not be correct syntax but you'll get the idea):
//default mode is implicitly defined by (or in) ANTLR4
VARIABLE : 'variable' (' ')* ',' -> mode(mode_VariableName);
...
mode mode_VariableName;
//define token with anything ending with comma, many ways to do this...
fragment varNameFrag: [a-zA-Z_0-9];
VARIABLE_NAME: varNameFrag varNameFrag* (' ')* ',' -> mode(mode_varDesc);
mode mode_varDesc;
//similar again for variable description
VAR_DESC: //I'll write just a comment here but should more or less match anything except
END_VAR: '!>>' -> mode(DEFAULT_MODE)
Basically in this way you are jumping to modes you need instead of pushing and popping.
I used ANTLR version 4 for creating compiler.First Phase was the Lexer part. I created "CompilerLexer.g4" file and putted lexer rules in it.It works fine.
CompilerLexer.g4:
lexer grammar CompilerLexer;
INT : 'int' ; //1
FLOAT : 'float' ; //2
BEGIN : 'begin' ; //3
END : 'end' ; //4
To : 'to' ; //5
NEXT : 'next' ; //6
REAL : 'real' ; //7
BOOLEAN : 'bool' ; //8
.
.
.
NOTEQUAL : '!=' ; //46
AND : '&&' ; //47
OR : '||' ; //48
POW : '^' ; //49
ID : [a-zA-Z]+ ; //50
WS
: ' ' -> channel(HIDDEN) //50
;
Now it is time for phase 2 which is the parser.I created "CompilerParser.g4" file and putted grammars in it but have dozens warning and errors.
CompilerParser.g4:
parser grammar CompilerParser;
options { tokenVocab = CompilerLexer; }
STATEMENT : EXPRESSION SEMIC
| IFSTMT
| WHILESTMT
| FORSTMT
| READSTMT SEMIC
| WRITESTMT SEMIC
| VARDEF SEMIC
| BLOCK
;
BLOCK : BEGIN STATEMENTS END
;
STATEMENTS : STATEMENT STATEMENTS*
;
EXPRESSION : ID ASSIGN EXPRESSION
| BOOLEXP
;
RELEXP : MODEXP (GT | LT | EQUAL | NOTEQUAL | LE | GE | AND | OR) RELEXP
| MODEXP
;
.
.
.
VARDEF : (ID COMA)* ID COLON VARTYPE
;
VARTYPE : INT
| FLOAT
| CHAR
| STRING
;
compileUnit
: EOF
;
Warning and errors:
implicit definition of token 'BLOCK' in parser
implicit definition of token 'BOOLEXP' in parser
implicit definition of token 'EXP' in parser
implicit definition of token 'EXPLIST' in parser
lexer rule 'BLOCK' not allowed in parser
lexer rule 'EXP' not allowed in parser
lexer rule 'EXPLIST' not allowed in parser
lexer rule 'EXPRESSION' not allowed in parser
Have dozens of these warning and errors. What is the cause?
General Questions: What is difference between using combined grammar and using lexer and parser separately? How should join separate grammar and lexer files?
Lexer rules start with a capital letter, and parser rules start with a lowercase letter. In a parser grammar, you can't define tokens. And since ANTLR thinks all your upper-cased rules lexer rules, it produces theses errors/warning.
EDIT
user2998131 wrote:
General Questions: What is difference between using combined grammar and using lexer and parser separately?
Separating the lexer and parser rules will keeps things organized. Also, when creating separate lexer and parser grammars, you can't (accidentally) put literal tokens inside your parser grammar but will need to define all tokens in your lexer grammar. This will make it apparent which lexer rules get matched before others, and you can't make any typo's inside recurring literal tokens:
grammar P;
r1 : 'foo' r2;
r2 : r3 'foo '; // added an accidental space after 'foo'
But when you have a parser grammar, you can't make that mistake. You will have to use the lexer rule that matches 'foo':
parser grammar P
options { tokenVocab=L; }
r1 : FOO r2;
r2 : r3 FOO;
lexer grammar L;
FOO : 'foo';
user2998131 wrote:
How should join separate grammar and lexer files?
Just like you do in your parser grammar: you point to the proper tokenVocab inside the options { ... } block.
Note that you can also import grammars, which is something different: https://github.com/antlr/antlr4/blob/master/doc/grammars.md#grammar-imports
I have a very simple ANTLR parser building in Visual Studio 2012. It works. But when it builds the grammar file, it emits a warning for every token, saying that the token is already defined. What could be causing this?
Here is the grammar file SimpleCalc.g4:
grammar SimpleCalc;
options {
language=CSharp2;
}
tokens {
PLUS,
MINUS,
TIMES,
DIV
}
#members {
}
expr : term ( (PLUS|MINUS) term )* ;
term : factor ( ( TIMES|DIV ) factor )* ;
factor : NUMBER ;
DIV : '/';
PLUS : '+';
TIMES: '*';
MINUS: '-';
NUMBER : (DIGIT)+ {System.Console.WriteLine("Found number"); };
WHITESPACE: ( '\t' | ' ' | '\r' | '\n' | '\u000C' )+ -> skip ;
fragment DIGIT : '0'..'9';
And here are the warnings:
[path]\SimpleCalc.g4(8,3): warning AC0108: token name 'PLUS' is already defined
[path]\SimpleCalc.g4(9,3): warning AC0108: token name 'MINUS' is already defined
[path]\SimpleCalc.g4(10,3): warning AC0108: token name 'TIMES' is already defined
[path]\SimpleCalc.g4(11,3): warning AC0108: token name 'DIV' is already defined
I would get rid of the unnecessary tokens {...} block.
The ANTLR website describes two approaches to implementing "include" directives. The first approach is to recognize the directive in the lexer and include the file lexically (by pushing the CharStream onto a stack and replacing it with one that reads the new file); the second is to recognize the directive in the parser, launch a sub-parser to parse the new file, and splice in the AST generated by the sub-parser. Neither of these are quite what I need.
In the language I'm parsing, recognizing the directive in the lexer is impractical for a few reasons:
There is no self-contained character pattern that always means "this is an include directive". For example, Include "foo"; at top level is an include directive, but in Array bar --> Include "foo"; or Constant Include "foo"; the word Include is an identifier.
The name of the file to include may be given as a string or as a constant identifier, and such constants can be defined with arbitrarily complex expressions.
So I want to trigger the inclusion from the parser. But to perform the inclusion, I can't launch a sub-parser and splice the AST together; I have to splice the tokens. It's legal for a block to begin with { in the main file and be terminated by } in the included file. A file included inside a function can even close the function definition and start a new one.
It seems like I'll need something like the first approach but at the level of TokenStreams instead of CharStreams. Is that a viable approach? How much state would I need to keep on the stack, and how would I make the parser switch back to the original token stream instead of terminating when it hits EOF? Or is there a better way to handle this?
==========
Here's an example of the language, demonstrating that blocks opened in the main file can be closed in the included file (and vice versa). Note that the # before Include is required when the directive is inside a function, but optional outside.
main.inf:
[ Main;
print "This is Main!";
if (0) {
#include "other.h";
print "This is OtherFunction!";
];
other.h:
} ! end if
]; ! end Main
[ OtherFunction;
A possibility is for each Include statement to let your parser create a new instance of your lexer and insert these new tokens the lexer creates at the index the parser is currently at (see the insertTokens(...) method in the parser's #members block.).
Here's a quick demo:
Inform6.g
grammar Inform6;
options {
output=AST;
}
tokens {
STATS;
F_DECL;
F_CALL;
EXPRS;
}
#parser::header {
import java.util.Map;
import java.util.HashMap;
}
#parser::members {
private Map<String, String> memory = new HashMap<String, String>();
private void putInMemory(String key, String str) {
String value;
if(str.startsWith("\"")) {
value = str.substring(1, str.length() - 1);
}
else {
value = memory.get(str);
}
memory.put(key, value);
}
private void insertTokens(String fileName) {
// possibly strip quotes from `fileName` in case it's a Str-token
try {
CommonTokenStream thatStream = new CommonTokenStream(new Inform6Lexer(new ANTLRFileStream(fileName)));
thatStream.fill();
List extraTokens = thatStream.getTokens();
extraTokens.remove(extraTokens.size() - 1); // remove EOF
CommonTokenStream thisStream = (CommonTokenStream)this.getTokenStream();
thisStream.getTokens().addAll(thisStream.index(), extraTokens);
} catch(Exception e) {
e.printStackTrace();
}
}
}
parse
: stats EOF -> stats
;
stats
: stat* -> ^(STATS stat*)
;
stat
: function_decl
| function_call
| include
| constant
| if_stat
;
if_stat
: If '(' expr ')' '{' stats '}' -> ^(If expr stats)
;
function_decl
: '[' id ';' stats ']' ';' -> ^(F_DECL id stats)
;
function_call
: Id exprs ';' -> ^(F_CALL Id exprs)
;
include
: Include Str ';' {insertTokens($Str.text);} -> /* omit statement from AST */
| Include id ';' {insertTokens(memory.get($id.text));} -> /* omit statement from AST */
;
constant
: Constant id expr ';' {putInMemory($id.text, $expr.text);} -> ^(Constant id expr)
;
exprs
: expr (',' expr)* -> ^(EXPRS expr+)
;
expr
: add_expr
;
add_expr
: mult_expr (('+' | '-')^ mult_expr)*
;
mult_expr
: atom (('*' | '/')^ atom)*
;
atom
: id
| Num
| Str
| '(' expr ')' -> expr
;
id
: Id
| Include
;
Comment : '!' ~('\r' | '\n')* {skip();};
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
If : 'if';
Include : 'Include';
Constant : 'Constant';
Id : ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | '0'..'9')+;
Str : '"' ~'"'* '"';
Num : '0'..'9'+ ('.' '0'..'9'+)?;
main.inf
Constant IMPORT "other.h";
[ Main;
print "This is Main!";
if (0) {
Include IMPORT;
print "This is OtherFunction!";
];
other.h
} ! end if
]; ! end Main
[ OtherFunction;
Main.java
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
// create lexer & parser
Inform6Lexer lexer = new Inform6Lexer(new ANTLRFileStream("main.inf"));
Inform6Parser parser = new Inform6Parser(new CommonTokenStream(lexer));
// print the AST
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT((CommonTree)parser.parse().getTree());
System.out.println(st);
}
}
To run the demo, do the following on the command line:
java -cp antlr-3.3.jar org.antlr.Tool Inform6.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main
The output you'll see corresponds to the following AST: