Interpreting custom language

Interpreting custom language - c#

I need to develop an application that will read and understand text file in which I'll find a custom language that describe a list of operations (ie cooking recipe). This language has not been defined yet, but it will probably take one of the following shape :
C++ like code
(This code is randomly generated, just for example purpose) :
begin
repeat(10)
{
bar(toto, 10, 1999, xxx);
}
result = foo(xxxx, 10);
if(foo == ok)
{
...
}
else
{
...
}
end
XML code
(This code is randomly generated, just for example purpose) :
<recipe>
<action name="foo" argument"bar, toto, xxx" repeat=10/>
<action name="bar" argument"xxxxx;10" condition="foo == ok">
<true>...</true>
<false>...</false>
</action>
</recipe>
No matter which language will be chosen, there will have to handle simple conditions, loops.
I never did such a thing but at first sight, it occurs to me that describing those operations into XML would be simplier yet less powerful.
After browsing StackOverFlow, I've found some chats on a tool called "ANTLR"... I started reading "The Definitive ANTLR Reference" but since I never done that kind of stuff, I find it hard to know if it's really the kind of tool I need...
In other words, what do I need to read a text file, interpret it properly and perform actions in my C# code. Those operations will interact between themselves by simple conditions like :
If operation1 failed, I do operation2 else operation3.
Repeat the operation4 10 times.
What would be the best language to do describe those text file (XML, my own) ? What are the key points during such developments ?
I hope I'm being clear :)
Thanks a lot for your help and advices !

XML is great for storing relational data in a verbose way. I think it is a terrible candidate for writing logic such as a program, however.
Have you considered using an existing grammar/scripting language that you can embed, rather than writing your own? E.g:
LUA
Python

In one of my projects I actually started with an XML like language as I already had an XML parser and parsed the XML structure into an expression tree in memory to be interpreted/run.
This works out very nicely to get passed the problem of figuring out tokenizing/parsing of text files and concentrate instead on your 'language' and the logic of the operations in your language. The down side is writing the text files is a little strange and very wordy. Its also very unnatural for a programmer use to C/C++ syntax.
Eventually you could easily replace your XML with a full blown scanner & lexer to parse a more 'natural C++' like text format into your expression tree.
As for writing a scanner & lexer, I found it easier to write these by hand using simple logic flow/loops for the scanner and recursive decent parser for the lexer.
That said, ANTLR is great at letting you write out rules for your language and generating your scanner & lexer for you. This allows for much more dynamic language which can easily change without having to refactor everything again when new things are added. So, it might be worth looking into as learning this as it would save you much time in rewrites as things change if you hand wrote your own.

I'd recommend writing the app in F#. It has many useful features for parsing strings and xmls like Pattern Matching and Active Patterns.
For parsing C-like code I would recommend F# (just did one interpreter with F#, works like a charm)
For parsing XML's I would recommend C#/F# + XmlDocument class.
You basically need to work on two files:
Operator dictionary
Code file in YourLanguage
Load and interpret the operators and then apply them recursively to your code file.

The best prefab answer: S-expressions
C and XML are good first steps. They have sort of opposite disadvantages. The C-like syntax won't add a ton of extra characters, but it's going to be hard to parse due to ambiguity, the variety of tokens, and probably a bunch more issues I can't think of. XML is relatively easy to parse and there's tons of example code, but it will also contain tons of extra text. It might also give you too many options for where to stick language features - for example, is the number of times to repeat a loop an attribute, element or text?
S-expressions are more terse than XML for sure, maybe even C. At the same time, they're specific to the task of applying operations to data. They don't admit ambiguity. Parsers are simple and easy to find example code for.
This might save you from having to learn too much theory before you start experimenting. I'll emphasize MerickOWA's point that ANTLR and other parser generators are probably a bigger battle than you want to fight right now. See this discussion on programmers.stackexchange for some background on when the full generality of this type of tool could help.

Related

What techniques are used to write a parser that switches between languages?

I'm interested in how a parser like the Razor view engine can parse two distinct languages like C# and JavaScript.
It's very cool that the following works, for instance:
$("#fm_duedate").val('#DateTime.Now.AddMonths(1).ToString("MM/dd/yyyy")');
I'm going to try and look at the source but I'm curious if there's a some kind of theoretical foundation for a parser like this or is it more brute force like taking the union of the two languages and parsing that?
Trying to reason it for myself, I say "you start with a parser for each language then you add to each one a set of productions that switch it to the other" but I doubt its so simple.
I guess the perfect answer would be a pointer to discussion on how the Razor engine is implemented or a walk-through of the source (I haven't actually Google'd this for fear of going down a rabbit hole). Alternately, just some insight on how the problem of parsing two languages is approached would be great.

As Corey points out, Razor and similar frameworks do not do anything particularly fancy.
However there are some more theoretically sound models for building parsers for languages where one language is embedded in another. My erstwhile colleague Luke Hoban has a great introductory article on parser combinators, which afford a very nice way to build a parser for one-language-embedded-in-another-language scenarios:
http://blogs.msdn.com/b/lukeh/archive/2007/08/19/monadic-parser-combinators-using-c-3-0.aspx
The wikipedia page is pretty straightforward as well:
http://en.wikipedia.org/wiki/Parser_combinator

Razor (and the other view engines) do not parse the HTML or JavaScript of a view. Instead they parse the text to detect specific tokens, with no real concern about the surrounding text.
In the case of Razor, every # character in the source file is processed as a code block of some sort. Razor is quite smart about detecting the expression that follows the # character, including handling things like #foreach (var x in collection) { and locating the closing } while not trying to parse the HTML (or JavaScript) inside. It also lets you use #{ } and #( ) to override the processing to a degree.
I find the ASPX <%...%> format simpler to read, since I've used that format more and I've got some established pattern recognition going on for those. Having explicit start/finish tokens is simpler to process and simpler to read in-place.

Communication between lexer and parser

Every time I write a simple lexer and parser, I stumble upon the same question: how should the lexer and the parser communicate? I see four different approaches:
The lexer eagerly converts the entire input string into a vector of tokens. Once this is done, the vector is fed to the parser which converts it into a tree. This is by far the simplest solution to implement, but since all tokens are stored in memory, it wastes a lot of space.
Each time the lexer finds a token, it invokes a function on the parser, passing the current token. In my experience, this only works if the parser can naturally be implemented as a state machine like LALR parsers. By contrast, I don't think it would work at all for recursive descent parsers.
Each time the parser needs a token, it asks the lexer for the next one. This is very easy to implement in C# due to the yield keyword, but quite hard in C++ which doesn't have it.
The lexer and parser communicate through an asynchronous queue. This is commonly known under the title "producer/consumer", and it should simplify the communication between the lexer and the parser a lot. Does it also outperform the other solutions on multicores? Or is lexing too trivial?
Is my analysis sound? Are there other approaches I haven't thought of? What is used in real-world compilers? It would be really cool if compiler writers like Eric Lippert could shed some light on this issue.

While I wouldn't classify much of the above as incorrect, I do believe several items are misleading.
Lexing an entire input before running a parser has many advantages over other options. Implementations vary, but in general the memory required for this operation is not a problem, especially when you consider the type of information that you'd like to have available for reporting compilation errors.
Benefits
Potentially more information available during error reporting.
Languages written in a way that allows lexing to occur before parsing are easier to specify and write compilers for.
Drawbacks
Some languages require context-sensitive lexers that simply cannot operate before the parsing phase.
Language implementation note: This is my preferred strategy, as it results in separable code and is best suited for translation to implementing an IDE for the language.
Parser implementation note: I experimented with ANTLR v3 regarding memory overhead with this strategy. The C target uses over 130 bytes per token, and the Java target uses around 44 bytes per token. With a modified C# target, I showed it's possible to fully represent the tokenized input with only 8 bytes per token, making this strategy practical for even quite large source files.
Language design note: I encourage people designing a new language to do so in a way that allows this parsing strategy, whether or not they end up choosing it for their reference compiler.
It appears you've described a "push" version of what I generally see described as a "pull" parser like you have in #3. My work emphasis has always been on LL parsing, so this wasn't really an option for me. I would be surprised if there are benefits to this over #3, but cannot rule them out.
The most misleading part of this is the statement about C++. Proper use of iterators in C++ makes it exceptionally well suited to this type of behavior.
A queue seems like a rehash of #3 with a middleman. While abstracting independent operations has many advantages in areas like modular software development, a lexer/parser pair for a distributable product offering is highly performance-sensitive, and this type of abstraction removes the ability to do certain types of optimization regarding data structure and memory layout. I would encourage the use of option #3 over this.
As an additional note on multi-core parsing: The initial lexer/parser phases of compilation for a single compilation unit generally cannot be parallelized, nor do they need to be considering how easy it is to simply run parallel compilation tasks on different compilation units (e.g. one lexer/parser operation on each source file, parallelizing across the source files but only using a single thread for any given file).
Regarding other options: For a compiler intended for widespread use (commercial or otherwise), generally implementers choose a parsing strategy and implementation which provides the best performance under the constraints of the target language. Some languages (e.g. Go) can be parsed exceptionally quickly with a simple LR parsing strategy, and using a "more powerful" parsing strategy (read: unnecessary features) would only serve to slow things down. Other languages (e.g. C++) are extremely challenging or impossible to parse with typical algorithms, so slower but more powerful/flexible parsers are employed.

I think there is no golden rule here. Requirements may vary from one case to another. So, reasonable solutions can be different also. Let me comment on your options from my own experience.
"Vector of tokens". This solution may have big memory footprint. Imagine compiling source file with a lot of headers. Storing the token itself is not enough. Error message should contain context with the file name and the line number. It may happen that lexer depends on the parser. Reasonable example: ">>" - is this a shift operator or this is closing of 2 layers of template instantiations? I would not recommend this option.
(2,3). "One part calls another". My impression is that more complex system should call less complex one. I consider lexer to be more simple. This means parser should call lexer. I do not see why C# is better than C++. I implemented C/C++ lexer as a subroutine (in reality this is a complex class) that is called from the grammar based parser. There were no problems in this implementation.
"Communicating processes". This seems to me an overkill. There is nothing wrong in this approach, but maybe it is better to keep the things simple? Multicore aspect. Compiling single file is a relatively rare case. I would recommend to load each core with its own file.
I do not see other reasonable options of combiming lexer and parser together.
I wrote these notes thinking about compiling sources of the software project. Parsing a short query request is completely different thing, and reasons can significantly differ.
My answer is based on my own experience. Other people may see this differently.

The lexer-parser relationship is simpler than the most general case of coroutines, because in general the communication is one-way; the parser does not need to send information back to the lexer. This is why the method of eager generation works (with some penalty, although it does mean that you can discard the input earlier).
As you've observed, if either the lexer or the parser can be written in a reinvocable style then the other can be treated as a simple subroutine. This can always be implemented as a source code transformation, with local variables translated to object slots.
Although C++ doesn't have language support for coroutines, it is possible to make use of library support, in particular fibers. The Unix setcontext family is one option; another is to use multithreading but with a synchronous queue (essentially single-threading but switching between two threads of control).

Also consider for #1 that you lex tokens that don't need it, for example if there is an error, and in addition, you may run low on memory or I/O bandwidth. I believe that the best solution is that employed by parsers generated by tools like Bison, where the parser calls the lexer to get the next token. Minimizes space requirements and memory bandwidth requirements.
#4 is just not going to be worth it. Lexing and parsing are inherently synchronous- there's just not enough processing going on to justify the costs of communication. In addition, typically you parse/lex multiple files concurrently- this can already max out all your cores at once.

The way I handle it in my toy buildsystem project in progress is by having a "file reader" class, with a function bool next_token(std::string&,const std::set<char>&). This class contains one line of input (for error reporting purposes with line number). The function accepts a std::string reference to put the token in, and a std::set<char> which contains the "token-ending" characters. My input class is both parser and lexer, but you could easily split it up if you need more fanciness. So the parsing functions just call next_token and can do their thing, including very detailed error output.
If you need to keep the verbatim input, you'll need to store each line that's read in a vector<string> or something, but not store each token seperately and/or double-y.
The code I'm talking about is located here:
https://github.com/rubenvb/Ambrosia/blob/master/libAmbrosia/Source/nectar_loader.cpp
(search for ::next_token and the extract_nectar function is where is all begins)

Best/fastest way to write a parser in c#

What is the best way to build a parser in c# to parse my own language?
Ideally I'd like to provide a grammar, and get Abstract Syntax Trees as an output.
Many thanks,
Nestor

I've had good experience with ANTLR v3. By far the biggest benefit is that it lets you write LL(*) parsers with infinite lookahead - these can be quite suboptimal, but the grammar can be written in the most straightforward and natural way with no need to refactor to work around parser limitations, and parser performance is often not a big deal (I hope you aren't writing a C++ compiler), especially in learning projects.
It also provides pretty good means of constructing meaningful ASTs without need to write any code - for every grammar production, you indicate the "crucial" token or sub-production, and that becomes a tree node. Or you can write a tree production.
Have a look at the following ANTLR grammars (listed here in order of increasing complexity) to get a gist of how it looks and feels
JSON grammar - with tree productions
Lua grammar
C grammar

I've played wtih Irony. It looks simple and useful.

You could study the source code for the Mono C# compiler.

While it is still in early beta the Oslo Modeling language and MGrammar tools from Microsoft are showing some promise.

I would also take a look at SableCC. Its very easy to create the EBNF grammer. Here is a simple C# calculator example.

There's a short paper here on constructing an LL(1) parser here, of course you could use a generator too.

Lex and yacc are still my favorites. Obscure if you're just starting out, but extremely simple, fast, and easy once you've got the lingo down.
You can make it do whatever you want; generate C# code, build other grammars, emulate instructions, whatever.
It's not pretty, it's a text based format and LL1, so your syntax has to accomodate that.
On the plus side, it's everywhere. There are great O'reilly books about it, lots of sample code, lots of premade grammars, and lots of native language libraries.

Using reflection for code gen?

I'm writing a console tool to generate some C# code for objects in a class library. The best/easiest way I can actual generate the code is to use reflection after the library has been built. It works great, but this seems like a haphazard approch at best. Since the generated code will be compiled with the library, after making a change I'll need to build the solution twice to get the final result, etc. Some of these issues could be mitigated with a build script, but it still feels like a bit too much of a hack to me.
My question is, are there any high-level best practices for this sort of thing?

Its pretty unclear what you are doing, but what does seem clear is that you have some base line code, and based on some its properties, you want to generate more code.
So the key issue here are, given the base line code, how do you extract interesting properties, and how do you generate code from those properties?
Reflection is a way to extract properties of code running (well, at least loaded) into the same execution enviroment as the reflection user code. The problem with reflection is it only provides a very limited set of properties, typically lists of classes, methods, or perhaps names of arguments. IF all the code generation you want to do can be done with just that, well, then reflection seems just fine. But if you want more detailed properties about the code, reflection won't cut it.
In fact, the only artifact from which truly arbitrary code properties can be extracted is the the source code as a character string (how else could you answer, is the number of characters between the add operator and T in middle of the variable name is a prime number?). As a practical matter, properties you can get from character strings are generally not very helpful (see the example I just gave :).
The compiler guys have spent the last 60 years figuring out how to extract interesting program properties and you'd be a complete idiot to ignore what they've learned in that half century.
They have settled on a number of relatively standard "compiler data structures": abstract syntax trees (ASTs), symbol tables (STs), control flow graphs (CFGs), data flow facts (DFFs), program triples, ponter analyses, etc.
If you want to analyze or generate code, your best bet is to process it first into such standard compiler data structures and then do the job. If you have ASTs, you can answer all kinds of question about what operators and operands are used. If you have STs, you can answer questions about where-defined, where-visible and what-type. If you have CFGs, you can answer questions about "this-before-that", "what conditions does statement X depend upon". If you have DFFs, you can determine which assignments affect the actions at a point in the code. Reflection will never provide this IMHO, because it will always be limited to what the runtime system developers are willing to keep around when running a program. (Maybe someday they'll keep all the compiler data structures around, but then it won't be reflection; it will just finally be compiler support).
Now, after you have determined the properties of interest, what do you do for code generation? Here the compiler guys have been so focused on generation of machine code that they don't offer standard answers. The guys that do are the program transformation community (http://en.wikipedia.org/wiki/Program_transformation). Here the idea is to keep at least one representation of your program as ASTs, and to provide special support for matching source code syntax (by constructing pattern-match ASTs from the code fragments of interest), and provide "rewrite" rules that say in effect, "when you see this pattern, then replace it by that pattern under this condition".
By connecting the condition to various property-extracting mechanisms from the compiler guys, you get relatively easy way to say what you want backed up by that 50 years of experience. Such program transformation systems have the ability to read in source code,
carry out analysis and transformations, and generally to regenerate code after transformation.
For your code generation task, you'd read in the base line code into ASTs, apply analyses to determine properties of interesting, use transformations to generate new ASTs, and then spit out the answer.
For such a system to be useful, it also has to be able to parse and prettyprint a wide variety of source code langauges, so that folks other than C# lovers can also have the benefits of code analysis and generation.
These ideas are all reified in the
DMS Software Reengineering Toolkit. DMS handles C, C++, C#, Java, COBOL, JavaScript, PHP, Verilog, ... and a lot of other langauges.
(I'm the architect of DMS, so I have a rather biased view. YMMV).

Have you considered using T4 templates for performing the code generation? It looks like it's getting much more publicity and attention now and more support in VS2010.
This tutorial seems database centric but it may give you some pointers: http://www.olegsych.com/2008/09/t4-tutorial-creatating-your-first-code-generator/ in addition there was a recent Hanselminutes on T4 here: http://www.hanselminutes.com/default.aspx?showID=170.
Edit: Another great place is the T4 tag here on StackOverflow: https://stackoverflow.com/questions/tagged/t4
EDIT: (By asker, new developments)
As of VS2012, T4 now supports reflection over an active project in a single step. This means you can make a change to your code, and the compiled output of the T4 template will reflect the newest version, without requiring you to perform a second reflect/build step. With this capability, I'm marking this as the accepted answer.

You may wish to use CodeDom, so that you only have to build once.
First, I would read this CodeProject article to make sure there are not language-specific features you'd be unable to support without using Reflection.

From what I understand, you could use something like Common Compiler Infrastructure (http://ccimetadata.codeplex.com/) to programatically analyze your existing c# source.
This looks pretty involved to me though, and CCI apparently only has full support for C# language spec 2. A better strategy may be to streamline your existing method instead.

I'm not sure of the best way to do this, but you could do this
As a post-build step on your base dll, run the code generator
As another post-build step, run csc or msbuild to build the generated dll
Other things which depend on the generated dll will also need to depend on the base dll, so the build order remains correct

artificial intelligence - Creative Writing

I am trying to find information (and hopefully c# source code) about trying to create a basic AI tool that can understand english words, grammar and context.
The Idea is to train the AI by using as many written documents as possible and then based on these documents, for the AI to create its own creative writitng in proper english that makes sense to a human.
While the idea is simple, I do realise that the hurdles are huge, any starting points or good resoueces will be appriacted.

A basic AI tool that you can use to do something like this is a Markov Chain. It's actually not too tricky to write!
See: http://pscode.com/vb/scripts/ShowCode.asp?txtCodeId=2031&lngWId=10
If that's not enough, you might be able to store WordNet synsets in your Markov chain instead of just words. This gives you some sense of the meaning of the words.

To be able to recompose a document you are going to have to have away to filter through the bad results.
Which means:
You are going to have to write a program that can evaluate if the output is valid (grammatically and syntactically is the best you can do reliablily) (This would would NLP)
You would need lots of training data and test data
You would need to watch out for overtraining (take a look at ROC curves)
Instead of writing a tool you could:
Manually score the output (will take a long time to properly train the algorigthm)
With this using the Amazon Mechanical Turk might be a good idea
The irony of this: The computer would have a difficult time "Creatively" composing something new. All of its worth will be based on its previous experiences [training data]

Some good references and reading at this Natural Language article.

As others said, Markov chain seems to be most suitable for such a task. Nice description of implementing Markov chain can be found in Kernighan & Pike, The Practice of Programming, section 3.1. Nice description of text-generating is also present in Programming Pearls.

One thing, though not quite what you need, would be a Markov chain of words. Here's a link I found by a quick search: http://blog.figmentengine.com/2008/10/markov-chain-code.html, but you can find much more information by searching for it.

Take a look at http://www.nltk.org/ (Natural Language Toolkit), lots of powerful tools there. They use Python (not C#) but Python is easy enough to pick up. Much easier to pick up than the breadth and depth of natural language processing, at least.

I agree, that you will have troubles in creating something creative. You could possibly also use a keyword spinner on certain words. You might also want to implement a stop word filter to remove anything colloquial.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.