Best/fastest way to write a parser in c# - c#

What is the best way to build a parser in c# to parse my own language?
Ideally I'd like to provide a grammar, and get Abstract Syntax Trees as an output.
Many thanks,
Nestor

I've had good experience with ANTLR v3. By far the biggest benefit is that it lets you write LL(*) parsers with infinite lookahead - these can be quite suboptimal, but the grammar can be written in the most straightforward and natural way with no need to refactor to work around parser limitations, and parser performance is often not a big deal (I hope you aren't writing a C++ compiler), especially in learning projects.
It also provides pretty good means of constructing meaningful ASTs without need to write any code - for every grammar production, you indicate the "crucial" token or sub-production, and that becomes a tree node. Or you can write a tree production.
Have a look at the following ANTLR grammars (listed here in order of increasing complexity) to get a gist of how it looks and feels
JSON grammar - with tree productions
Lua grammar
C grammar

I've played wtih Irony. It looks simple and useful.

You could study the source code for the Mono C# compiler.

While it is still in early beta the Oslo Modeling language and MGrammar tools from Microsoft are showing some promise.

I would also take a look at SableCC. Its very easy to create the EBNF grammer. Here is a simple C# calculator example.

There's a short paper here on constructing an LL(1) parser here, of course you could use a generator too.

Lex and yacc are still my favorites. Obscure if you're just starting out, but extremely simple, fast, and easy once you've got the lingo down.
You can make it do whatever you want; generate C# code, build other grammars, emulate instructions, whatever.
It's not pretty, it's a text based format and LL1, so your syntax has to accomodate that.
On the plus side, it's everywhere. There are great O'reilly books about it, lots of sample code, lots of premade grammars, and lots of native language libraries.

Related

Interpreting custom language

I need to develop an application that will read and understand text file in which I'll find a custom language that describe a list of operations (ie cooking recipe). This language has not been defined yet, but it will probably take one of the following shape :
C++ like code
(This code is randomly generated, just for example purpose) :
begin
repeat(10)
{
bar(toto, 10, 1999, xxx);
}
result = foo(xxxx, 10);
if(foo == ok)
{
...
}
else
{
...
}
end
XML code
(This code is randomly generated, just for example purpose) :
<recipe>
<action name="foo" argument"bar, toto, xxx" repeat=10/>
<action name="bar" argument"xxxxx;10" condition="foo == ok">
<true>...</true>
<false>...</false>
</action>
</recipe>
No matter which language will be chosen, there will have to handle simple conditions, loops.
I never did such a thing but at first sight, it occurs to me that describing those operations into XML would be simplier yet less powerful.
After browsing StackOverFlow, I've found some chats on a tool called "ANTLR"... I started reading "The Definitive ANTLR Reference" but since I never done that kind of stuff, I find it hard to know if it's really the kind of tool I need...
In other words, what do I need to read a text file, interpret it properly and perform actions in my C# code. Those operations will interact between themselves by simple conditions like :
If operation1 failed, I do operation2 else operation3.
Repeat the operation4 10 times.
What would be the best language to do describe those text file (XML, my own) ? What are the key points during such developments ?
I hope I'm being clear :)
Thanks a lot for your help and advices !
XML is great for storing relational data in a verbose way. I think it is a terrible candidate for writing logic such as a program, however.
Have you considered using an existing grammar/scripting language that you can embed, rather than writing your own? E.g:
LUA
Python
In one of my projects I actually started with an XML like language as I already had an XML parser and parsed the XML structure into an expression tree in memory to be interpreted/run.
This works out very nicely to get passed the problem of figuring out tokenizing/parsing of text files and concentrate instead on your 'language' and the logic of the operations in your language. The down side is writing the text files is a little strange and very wordy. Its also very unnatural for a programmer use to C/C++ syntax.
Eventually you could easily replace your XML with a full blown scanner & lexer to parse a more 'natural C++' like text format into your expression tree.
As for writing a scanner & lexer, I found it easier to write these by hand using simple logic flow/loops for the scanner and recursive decent parser for the lexer.
That said, ANTLR is great at letting you write out rules for your language and generating your scanner & lexer for you. This allows for much more dynamic language which can easily change without having to refactor everything again when new things are added. So, it might be worth looking into as learning this as it would save you much time in rewrites as things change if you hand wrote your own.
I'd recommend writing the app in F#. It has many useful features for parsing strings and xmls like Pattern Matching and Active Patterns.
For parsing C-like code I would recommend F# (just did one interpreter with F#, works like a charm)
For parsing XML's I would recommend C#/F# + XmlDocument class.
You basically need to work on two files:
Operator dictionary
Code file in YourLanguage
Load and interpret the operators and then apply them recursively to your code file.
The best prefab answer: S-expressions
C and XML are good first steps. They have sort of opposite disadvantages. The C-like syntax won't add a ton of extra characters, but it's going to be hard to parse due to ambiguity, the variety of tokens, and probably a bunch more issues I can't think of. XML is relatively easy to parse and there's tons of example code, but it will also contain tons of extra text. It might also give you too many options for where to stick language features - for example, is the number of times to repeat a loop an attribute, element or text?
S-expressions are more terse than XML for sure, maybe even C. At the same time, they're specific to the task of applying operations to data. They don't admit ambiguity. Parsers are simple and easy to find example code for.
This might save you from having to learn too much theory before you start experimenting. I'll emphasize MerickOWA's point that ANTLR and other parser generators are probably a bigger battle than you want to fight right now. See this discussion on programmers.stackexchange for some background on when the full generality of this type of tool could help.

Is there a C# token counter?

I want to have a programming competition with a friend of mine in C# and the competition will be to write with the fewest number of C# tokens. I have seen C++ token counting programs around but is there one for C#? Or would there be something in System.Reflection? Additionally, if anyone has links to token counters for other languages, feel free to link them.
Irony (a C# parser) has a C# grammar (i'm not sure which version of C# it supports), and the grammar explorer tool that Irony comes with probably gives you a token count...
If it doesn't, im sure you could make it do so pretty easily (open source ftw)
Well I believe that technically anything in the Reflection namespaces won't be a token counter, as everything in Reflection deals with inspection of the IL, which may be optimised for example.
This wikipedia entry might help you however - List of C Sharp lexer generators

artificial intelligence - Creative Writing

I am trying to find information (and hopefully c# source code) about trying to create a basic AI tool that can understand english words, grammar and context.
The Idea is to train the AI by using as many written documents as possible and then based on these documents, for the AI to create its own creative writitng in proper english that makes sense to a human.
While the idea is simple, I do realise that the hurdles are huge, any starting points or good resoueces will be appriacted.
A basic AI tool that you can use to do something like this is a Markov Chain. It's actually not too tricky to write!
See: http://pscode.com/vb/scripts/ShowCode.asp?txtCodeId=2031&lngWId=10
If that's not enough, you might be able to store WordNet synsets in your Markov chain instead of just words. This gives you some sense of the meaning of the words.
To be able to recompose a document you are going to have to have away to filter through the bad results.
Which means:
You are going to have to write a program that can evaluate if the output is valid (grammatically and syntactically is the best you can do reliablily) (This would would NLP)
You would need lots of training data and test data
You would need to watch out for overtraining (take a look at ROC curves)
Instead of writing a tool you could:
Manually score the output (will take a long time to properly train the algorigthm)
With this using the Amazon Mechanical Turk might be a good idea
The irony of this: The computer would have a difficult time "Creatively" composing something new. All of its worth will be based on its previous experiences [training data]
Some good references and reading at this Natural Language article.
As others said, Markov chain seems to be most suitable for such a task. Nice description of implementing Markov chain can be found in Kernighan & Pike, The Practice of Programming, section 3.1. Nice description of text-generating is also present in Programming Pearls.
One thing, though not quite what you need, would be a Markov chain of words. Here's a link I found by a quick search: http://blog.figmentengine.com/2008/10/markov-chain-code.html, but you can find much more information by searching for it.
Take a look at http://www.nltk.org/ (Natural Language Toolkit), lots of powerful tools there. They use Python (not C#) but Python is easy enough to pick up. Much easier to pick up than the breadth and depth of natural language processing, at least.
I agree, that you will have troubles in creating something creative. You could possibly also use a keyword spinner on certain words. You might also want to implement a stop word filter to remove anything colloquial.

Lex/Yacc for C#?

Actually, maybe not full-blown Lex/Yacc. I'm implementing a command-interpreter front-end to administer a webapp. I'm looking for something that'll take a grammar definition and turn it into a parser that directly invokes methods on my object. Similar to how ASP.NET MVC can figure out which controller method to invoke, and how to pony up the arguments.
So, if the user types "create foo" at my command-prompt, it should transparently call a method:
private void Create(string id) { /* ... */ }
Oh, and if it could generate help text from (e.g.) attributes on those controller methods, that'd be awesome, too.
I've done a couple of small projects with GPLEX/GPPG, which are pretty straightforward reimplementations of LEX/YACC in C#. I've not used any of the other tools above, so I can't really compare them, but these worked fine.
GPPG can be found here and GPLEX here.
That being said, I agree, a full LEX/YACC solution probably is overkill for your problem. I would suggest generating a set of bindings using IronPython: it interfaces easily with .NET code, non-programmers seem to find the basic syntax fairly usable, and it gives you a lot of flexibility/power if you choose to use it.
I'm not sure Lex/Yacc will be of any help. You'll just need a basic tokenizer and an interpreter which are faster to write by hand. If you're still into parsing route see Irony.
As a sidenote: have you considered PowerShell and its commandlets?
Also look at Antlr, which has C# support.
Still early CTP so can't be used in production apps but you may be interested in Oslo/MGrammar:
http://msdn.microsoft.com/en-us/oslo/
Jison is getting a lot of traction recently. It is a Bison port to javascript. Because of it's extremely simple nature, I've ported the jison parsing/lexing template to php, and now to C#. It is still very new, but if you get a chance, take a look at it here: https://github.com/robertleeplummerjr/jison/tree/master/ports/csharp/Jison
If you don't fear alpha software and want an alternative to Lex / Yacc for creating your own languages, you might look into Oslo. I would recommend you to sit through session recordings of sessions TL27 and TL31 from last years PDC. TL31 directly addresses the creation of Domain Specific Languages using Oslo.
Coco/R is a compiler generator with a .NET implementation. You could try that out, but I'm not sure if getting such a library to work would be faster than writing your own tokenizer.
http://www.ssw.uni-linz.ac.at/Research/Projects/Coco/
I would suggest csflex - C# port of flex - most famous unix scanner generator.
I believe that lex/yacc are in one of the SDKs already (i.e. RTM). Either Windows or .NET Framework SDK.
Gardens Point Parser Generator here provides Yacc/Bison functionality for C#. It can be donwloaded here. A usefull example using GPPG is provided here
As Anton said, PowerShell is probably the way to go. If you do want a lex/ yacc implementation then Malcolm Crowe has a good set.
Edit: Direct Link to the Compiler Tools
Just for the record, implementation of lexer and LALR parser in C# for C#:
http://code.google.com/p/naive-language-tools/
It should be similar in use to Lex/Yacc, however those tools (NLT) are not generators! Thus, forget about speed.

Constructing a simple interpreter

I’m starting a project where I need to implement a light-weight interpreter.
The interpreter is used to execute simple scientific algorithms.
The programming language that this interpreter will use should be simple, since it is targeting non- software developers (for example, mathematicians.)
The interpreter should support basic programming languages features:
Real numbers, variables, multi-dimensional arrays
Binary (+, -, *, /, %) and Boolean (==, !=, <, >, <=, >=) operations
Loops (for, while), Conditional expressions (if)
Functions
MathWorks MatLab is a good example of where I’m heading, just much simpler.
The interpreter will be used as an environment to demonstrate algorithms; simple algorithms such as finding the average of a dataset/array, or slightly more complicated algorithms such as Gaussian elimination or RSA.
Best/Most practical resource I found on the subject is Ron Ayoub’s entry on Code Project (Parsing Algebraic Expressions Using the Interpreter Pattern) - a perfect example of a minified version of my problem.
The Purple Dragon Book seems to be too much, anything more practical?
The interpreter will be implemented as a .NET library, using C#. However, resources for any platform are welcome, since the design-architecture part of this problem is the most challenging.
Any practical resources?
(please avoid “this is not trivial” or “why re-invent the wheel” responses)
I would write it in ANTLR. Write the grammar, let ANTLR generate a C# parser. You can ANTLR ask for a parse tree, and possibly the interpreter can already operate on the parse tree. Perhaps you'll have to convert the parse tree to some more abstract internal representation (although ANTLR already allows to leave out irrelevant punctuation when generating the tree).
It might sound odd, but Game Scripting Mastery is a great resource for learning about parsing, compiling and interpreting code.
You should really check it out:
http://www.amazon.com/Scripting-Mastery-Premier-Press-Development/dp/1931841578
One way to do it is to examine the source code for an existing interpreter. I've written a javascript interpreter in the D programming language, you can download the source code from http://ftp.digitalmars.com/dmdscript.zip
Walter Bright, Digital Mars
I'd recommend leveraging the DLR to do this, as this is exactly what it is designed for.
Create Your Own Language ontop of the DLR
Lua was designed as an extensible interpreter for use by non-programmers. (The first users were Brazilian petroleum geologists although the user base has broadened considerably since then.) You can take Lua and easily add your scientific algorithms, visualizations, what have you. It's superbly well engineered and you can get on with the task at hand.
Of course, if what you really want is the fun of building your own, then the other advice is reasonable.
Have you considered using IronPython? It's easy to use from .NET and it seems to meet all your requirements. I understand that python is fairly popular for scientific programming, so it's possible your users will already be familiar with it.
The Silk library has just been published to GitHub. It seems to do most of what you are asking. It is very easy to use. Just register the functions you want to make available to the script, compile the script to bytecode and execute it.
The programming language that this interpreter will use should be simple, since it is targeting non- software developers.
I'm going to chime in on this part of your question. A simple language is not what you really want to hand to non-software developers. Stripped down languages require more effort by the programmer. What you really want id a well designed and well implemented Domain Specific Language (DSL).
In this sense I will second what Norman Ramsey recommends with Lua. It has an excellent reputation as a base for high quality DSLs. A well documented and useful DSL takes time and effort, but will save everyone time in the long run when domain experts can be brought up to speed quickly and require minimal support.
I am surprised no one has mentioned xtext yet. It is available as Eclipse plugin and IntelliJ plugin. It provides not just the parser like ANTLR but the whole pipeline (including parser, linker, typechecker, compiler) needed for a DSL. You can check it's source code on Github for understanding how, an interpreter/compiler works.

Categories

Resources