Can I use ANTLR for not pre-processed code? - c#

I am about to write a parser for OpenEdge (a 4GL database language) and I would like to use ANTLR (or similar).
There are two reasons I think this may be a problem:
OpenEdge is a 4GL database language which allows constructs like:
assign
customer.name = 'Customer name'
customer.age = 20
.
Where the . at the end is the line separator and this statement combines the assignment of two database fields. OpenEdge has many more of these constructs;
I need to preserve all details of the source files, so I cannot expand preprocessor statements before I can parse the file, so:
// file myinc.i
7 * 14
// source.p
assign customer.age = {myinc.i}.
In the above example, I need to preserve the fact that customer.age was assigned using {myinc.i} instead of 7 * 14.
Can I use ANTLR to acchieve this or do I need to write my own parser?
UPDATE:
I need this parser not to generate an executable from it, but rather for code analysis. This is why I need the AST to contain the fact that the include was used.

Just to clarify: ANTLR isn't a parser, but a parser generator.
You either write your own parser for the language, or you write a (ANTLR) grammar for it, and let ANTLR generate the lexer and parser for you. You can mix custom code in your grammar to keep track of your assignments.
So, the answer is: yes, you can use ANTLR.
Note I am unfamiliar with OpenEdge, but SQL grammars are usually tough to write parser or grammars for. Have a look at the ANTLR wiki to see that it's no trivial task to write one from the ground up. You didn't mention it, but I assume you've looked at existing parsers that can parse your language?
FYI: you might already have it, but here's a link to the documentation including a BNF grammar for the OpenEdge SQL dialect: http://www.progress.com/progress/products/documentation/docs/dmsrf/dmsrf.pdf

The solution lies within the OpenEdge architect itself. You should checkout the openedge architect jar files (C:\Progress\OpenEdge\oeide\eclipse\plugins\com.openedge.pdt.core_10.2.1.01\lib\progressparser.jar)
Here you will find the parser classes. They are linked all the way to Eclipse, but I did the separation from the eclipse framework, and it works.
The progressparser uses antlr, and the antlr document can be found in the following folder...
C:\Progress\OpenEdge\oeide\eclipse\plugins\com.openedge.pdt.core_10.2.1.01\oe_common_services.jar.
Inside that file you will find the antlr definition (check for openedge.g).
Good luck. If you want the separated eclipse environment just drop me a mail.

Are you aware that there is already an open source parser for OpenEdge / Progress 4GL? It is called Proparse, written using ANTLR (originally it was hand-coded in OpenEdge itself, but eventually converted to ANTLR). It is written in Java, but I think you can run it in C# by using IKVM.
The license is the Eclipse license, so it is business-friendly.

You can do the same thing the C preprocessor is doing - extend your grammar with some sort of pragmas that set a source location, and let your preprocessor generate code stuffed with that pragmas.

The issue with multiple assignments is easy enought to handle in a grammar. Just allow multiple assignements:
assign_stmt = 'assign' assignments '.' ;
assignements = ;
assignments = assignments target '=' expression ;
One method you can use is to augment the grammar to allow preprocessor token sequences wherever a nonterminal can be allowed, and simply not do preprocessor expansion. For your example, you have some grammar rule:
expression = ... ;
just add the rule:
expression = '{' include_reference '}' ;
This works to the extent that the preprocessor isn't used abusively to generate several lanaguage elements that span nonterminal boundaries.
What kind of code anlaysis do you intend to do? Pretty much to do anything, you'll need to name and type resolution, which will require to expand the preprocessor directives. In that case, you'll need a more sophisticated scheme, because you need the expanded tree to do the name resolution, and need the include information associated off to the side.
Our DMS Software Reengineering Toolkit has an OpenEdge parser, in which we present do the previous "keep the include file references" trick. DMS's C parser adds a "macro node" to the tree where the macro (OpenEdge "include" is just a funny way to write a macro definition) child nodes contains the tree as you expect it, and the reference information that refers back to the macro defintion. This requires some careful organization, and lots of special handliing of macro nodes where they occur.

Related

General purpose plain-text linting tool

I'm looking for a command-line tool where I can specify regex patterns (or similar) for certain file extensions (e.g. cs files, js files, xaml files) that can provide errors/warnings when run, like during a build. These would scan plain-text source code of all types.
I know there are tools for specific languages... I plan on using those too. This tool is for quick patterns we want to flag where we don't want to invest in writing a Rosyln rule, for example. I'd like to flag certain patterns or API usages in an easy way where anyone can add a new rule without thinking too hard. Often times we don't add rules because it is hard.
Features like source tokenization is bonus. Open-source / free is mega bonus.
Is there such a tool?
If you want to go old-skool, you can dust-off Awk for this one.
It scans file line by line (for some configurable definition of line, with a sane default) cuts them in pieces (on whitespace IMMSMR) and applies a set of regexes and fires the code behind the matching regex. There are some conditions to match the beginning and end of a file to print headers/footers.
It seems to be what you want, but IMHO, a perl or ruby script is easier and has replaced AWK for me a long time ago. But it IS simple and straightforward for your use-case AFAICT.
Our Source Code Search Engine (SCSE) can do this.
SCSE lexes (using language-accurate tokenization including skipping language-specific whitespace but retaining comments) a set of source files, and then builds a token index for each token type. One can provide SCSE with token-based search queries such as:
'if' '(' I '='
to search for patterns in the source code; this example "lints" C-like code for the common mistake of assigning a variable (I for "identifier") in an IF statement caused by accidental use of '=' instead of the intended '=='.
The search is accomplished using the token indexes to speed up the search. Typically SCSE can search millions of lines of codes in a few seconds, far faster than grep or other scheme that insists on reading the file content for each query. It also produces fewer false positives because the token checks are accurate, and the queries are much easier to write because one does not have to worry about white space/line breaks/comments.
A list of hits on the pattern can be logged or merely counted.
Normally SCSE is used interactively; queries produce a list of hits, and clicking on a hit produces a view of a page of the source text with the hit superimposed. However, one can also script calls on the SCSE.
SCSE can be obtained with langauge-accurate lexers for some 40 languages.

Pre-Compile - Obfuscate Roslyn Generated Code

I have recently been tasked with coming up with a solution for providing renaming functions, as well as various other obfuscation pre-compile at runtime. I believe using Roslyn is the way to go, but please provide any insight you may have.
The ultimate goal is as follows:
Allow end user to select various options that are then generated into a text version of assembly at runtime. We then use Roslyn to generate the .exe. I was curious if it possible to obfuscate at runtime, before the EXE is even generated. This way I can rename vars, etc.
You can use any tool that can reliably transform C# source code.
Roslyn is one but in a funny way; you can modify the program and produce object code. That should work.
Other Program Transformation Systems (PTS) can do this by modifying the source code. A PTS reads source code, builds compiler data structures (e.g., ASTs), lets you modify the ASTs, and then can regenerate source code from the modified AST. That way you can see the obfuscated code; you can always compile it later with the C# compiler. A good PTS will let you write code transformations in terms of the syntax of the targeted language in a form like this:
if you see *this pattern*, replace it by *that pattern*
expressed below as
rule <name> <patternvariables> "thispattern" -> "thatpattern";
Using a PTS, you can arguably make arbitrary changes to the source code, including function and variable renaming, code flow scrambling and data flow scrambling. For instance, you might use this rule to add confusion:
rule scramble_if_then(c: condition, b: block): statement -> statement
" if (\c) \b " -> "int temp = \c?4:3;
while (temp>3) {\b; temp--; }";
This rule is a bit simple/silly but I think it makes the point that you can write readable source code transformations. If you have many such rules, it will scramble the code a lot, especially if your rules do sophisticated transformations.
We use our DMS Software Reengineering Toolkit to implement name-scrambling obfuscators, including one for C#.

Add a keyword to C# with code generation?

I have a domain specific language that I would like to interact with C# by adding new keywords (or some keyword-like syntax). Using attributes would be insufficient (I can't use them in method bodies), and shoehorning it into 'valid' C# notation that gets compiled into something else would be ugly and ruin the analogy with the DSL (and the translation from DSL-like notation to C# is nontrivial, so just writing the C# each time is out of the question).
I already have a way to parse the .cs file and transform it into legitimate, nontrivial, C# code which can be compiled.
The problem is, even through I can do all the work of defining the DSL, parsing it, and translating it into valid C#, Visual Studio won't let me use notation it doesn't understand; it just adds red squiggles, emits an error "cannot resolve symbol", and then often fails to properly parse things after it.
Is there a way to to force visual studio to ignore specific strings in its analysis? I've looked at visual studio plugins but it looks like, although I can do syntax highlighting and other stuff, I can't force it to ignore something it doesn't know how to parse (unless I'm missing some way to do that in the extension API, which is certainly possible).
I've skimmed through the Roslyn stuff and don't see offhand a way to do this there, either. (Again, may be missing something, it doesn't seem to have great documentation.)
Take a look at PowerLanguages.E: http://visualstudiogallery.msdn.microsoft.com/a512e0d0-f4f3-4435-bad4-8d5efbb1db4a
No english docs yet, sorry

Parse .h header files into c# data structures in runtime

I'm trying to write a C# library to manipulate my C/C++ header files.. I want to be able to read and parse the headers file and manipulate function prototypes and data structures in C#. I'm trying to avoid writing a C Parser, due to all code brances caused by #ifdefs and stuff like that.
I've tryed playing around with EnvDTE, but couldn't find any decent documentation.
Any ideas how can I do it?
Edit -
Thank you for the answers... Here are some more details about my project: I'm writing a ptrace-like tool for windows using the debugging API's, which enable me to trace my already compiled binaries and see which windows API's are being called. I also want to see which parameter is given in each call and what return values are given, so I need to know the definition of the API's. I also want to know the defition for my own libraries (hence, the header parsing approach). I thought of 3 solutions:
* Parsing the header files
* Parsing the PDB files (I wrote a prototype using DIA SDK, but unfortionatly, the symbols PDB contained only general info about the API's and not the real prototypes with the parameters and return values)
* Crawling over the MSDN online library (automaticly or manualy)
Is there any better way for getting the names and types for windows API's and my libraries in runtime in c#?
Parsing C (even "just" headers) is hard; the language is more complex than people remember,
and then there's the preprocessor, and finally the problem of doing something with the parse. C++ includes essentially all of C, and with C++11 here the problem is even worse.
People can often hack a 98% solution for a limited set of inputs, often with regexes in Perl or some other string hackery. If that works for you, then fine. Usually what happens is that 2% causes the hacked parser to choke or to produce the wrong answer, and then you get to debug the result and hand hack the 98% solution output.
Hacked solutions tend to fail pretty badly on real header files, which seem to concentrate weirdness in macros and conditionals (sometimes even to the point of mixing different dialects of C and C++ in the conditional arms). See a typical Microsoft .h file as an example. This appears to be what OP wants to process. Preprocessing gets rid of part of the problem, and now you get to encounter the real complexity of C and/or C++. You won't get a 98% solution for real header files even with preprocessing; you need typedefs and therefore name and type resolution, too. You might "parse" FOO X; that tells you that X is of type FOO... oops, what's that? Only a symbol table knows for sure.
GCCXML does all this preprocessing, parsing, and symbol table construction ... for the GCC dialect of C. Microsoft's dialect is different, and I don't think GCCXML can handle it.
A more general tool is our DMS Software Reengineering Toolkit, with its C front end; there's also a C++ front end (yes, they're different; C and C++ aren't the same language by a long shot). These process a wide variety of C dialects (both MS and GCC when configured properly), does macro/conditional expansion, builds an AST and a symbol table (does that name and type resolution stuff correctly).
You can add customization to extract the information you want, by crawling over the symbol table structures produced. You'll have to export what you want to C# (e.g. generate your C# classes), since DMS isn't implemented in a .net language.
In the most general case, header files are only usable, not convertable.
This due the possibility of preprocessor (#define) use of macros, fragments of structures constants etc which only get meaning when used in context.
Examples
anything with ## in macros
or
//header
#define mystructconstant "bla","bla"
// in using .c
char test[10][2] ={mystructconstant};
but you can't simply discard all macros, since then you won't process the very common calling convention macros
etc etc.
So header parsing and conversion is mostly only possible for semi automated use (manually run cleaned up headers through it) or for reasonably clean and consistent headers (like e.g. the older MS SDK headers)
Since the general case is so hard, there isn't much readily available. Everybody crafts something quick and dirty for its own headers.
The only more general tool that I know is SWIG.

Interpreting custom language

I need to develop an application that will read and understand text file in which I'll find a custom language that describe a list of operations (ie cooking recipe). This language has not been defined yet, but it will probably take one of the following shape :
C++ like code
(This code is randomly generated, just for example purpose) :
begin
repeat(10)
{
bar(toto, 10, 1999, xxx);
}
result = foo(xxxx, 10);
if(foo == ok)
{
...
}
else
{
...
}
end
XML code
(This code is randomly generated, just for example purpose) :
<recipe>
<action name="foo" argument"bar, toto, xxx" repeat=10/>
<action name="bar" argument"xxxxx;10" condition="foo == ok">
<true>...</true>
<false>...</false>
</action>
</recipe>
No matter which language will be chosen, there will have to handle simple conditions, loops.
I never did such a thing but at first sight, it occurs to me that describing those operations into XML would be simplier yet less powerful.
After browsing StackOverFlow, I've found some chats on a tool called "ANTLR"... I started reading "The Definitive ANTLR Reference" but since I never done that kind of stuff, I find it hard to know if it's really the kind of tool I need...
In other words, what do I need to read a text file, interpret it properly and perform actions in my C# code. Those operations will interact between themselves by simple conditions like :
If operation1 failed, I do operation2 else operation3.
Repeat the operation4 10 times.
What would be the best language to do describe those text file (XML, my own) ? What are the key points during such developments ?
I hope I'm being clear :)
Thanks a lot for your help and advices !
XML is great for storing relational data in a verbose way. I think it is a terrible candidate for writing logic such as a program, however.
Have you considered using an existing grammar/scripting language that you can embed, rather than writing your own? E.g:
LUA
Python
In one of my projects I actually started with an XML like language as I already had an XML parser and parsed the XML structure into an expression tree in memory to be interpreted/run.
This works out very nicely to get passed the problem of figuring out tokenizing/parsing of text files and concentrate instead on your 'language' and the logic of the operations in your language. The down side is writing the text files is a little strange and very wordy. Its also very unnatural for a programmer use to C/C++ syntax.
Eventually you could easily replace your XML with a full blown scanner & lexer to parse a more 'natural C++' like text format into your expression tree.
As for writing a scanner & lexer, I found it easier to write these by hand using simple logic flow/loops for the scanner and recursive decent parser for the lexer.
That said, ANTLR is great at letting you write out rules for your language and generating your scanner & lexer for you. This allows for much more dynamic language which can easily change without having to refactor everything again when new things are added. So, it might be worth looking into as learning this as it would save you much time in rewrites as things change if you hand wrote your own.
I'd recommend writing the app in F#. It has many useful features for parsing strings and xmls like Pattern Matching and Active Patterns.
For parsing C-like code I would recommend F# (just did one interpreter with F#, works like a charm)
For parsing XML's I would recommend C#/F# + XmlDocument class.
You basically need to work on two files:
Operator dictionary
Code file in YourLanguage
Load and interpret the operators and then apply them recursively to your code file.
The best prefab answer: S-expressions
C and XML are good first steps. They have sort of opposite disadvantages. The C-like syntax won't add a ton of extra characters, but it's going to be hard to parse due to ambiguity, the variety of tokens, and probably a bunch more issues I can't think of. XML is relatively easy to parse and there's tons of example code, but it will also contain tons of extra text. It might also give you too many options for where to stick language features - for example, is the number of times to repeat a loop an attribute, element or text?
S-expressions are more terse than XML for sure, maybe even C. At the same time, they're specific to the task of applying operations to data. They don't admit ambiguity. Parsers are simple and easy to find example code for.
This might save you from having to learn too much theory before you start experimenting. I'll emphasize MerickOWA's point that ANTLR and other parser generators are probably a bigger battle than you want to fight right now. See this discussion on programmers.stackexchange for some background on when the full generality of this type of tool could help.

Categories

Resources