Coco/R vs. ANTLR - c#

I'm evaluating using Coco/R vs. ANTLR for use in a C# project as part of what's essentially a scriptable mail-merge functionality. To parse the (simple) scripts, I'll need a parser.
I've focussed on Coco/R and ANTLR because both seem fairly mature and well-maintained and capable of generating decent C# parsers.
Neither seem to be trivial to use either, however, and simplicity is something I'd appreciate - particularly maintainability by others.
Does anyone have any recommendations to make? What are the pros/cons of either for a parsing a small language - or am I looking into the wrong things entirely? How well do these integrate into a typical continuous integration setup? What are the pitfalls?
Related: Well, many questions, such as 1, 2, 3, 4, 5.

We have used Coco for 2 years, having replaced Antler we were formerly using. For a typical big-data query (our application), our experience has been this. Caveat: We are dependent upon full Utf-8 handling, with the parser implemented in C++. These numbers are for a language that has some 200 EBNF productions.
Antler: 260 usecs/query and a 108 MEGABYTE memory footprint for the generated parser/lexer
Coco: 220 usecs/query and a 70 KBYTE memory footprint for the parser/scanner
Initially, Coco had a 1.2 msecs startup time and generated several 60 KBYTE tables for mapping Utf-8. We have made many local enhancements to Coco, such as to eliminate the big tables, eliminated the 1.2 msec startup time, hugely enhanced internal documentation (as well as documentation in the generated code).
Our version of (open source) Coco has a tiny footprint compared to Antlr and is very measurably faster, has no startup delay and just... works. It does not have Antler's nice UI but that never entered our mind to be an issue once we started using Coco.

ANTLR is LL(*), which is as powerful as PEG, though usually much more efficient and flexible. LL(*) degenerates to LL(k) for k>1 one arbitrary lookahead is not necessary.

If you're simply merging data into a complicated template, consider Terence Parr's StringTemplate engine. He's the man behind ANTLR. StringTemplate may be better suited and easier to use than a full parser generator. It's a very feature-rich template engine.
There is a C# port available in the downloads.

Basically, coco/r generates recursive descent parsers and only supports LL(1) grammars whereas ANTLR uses back-tracking (among other techniques), which allows it to handle more complex grammars. coco/r parsers are much more light-weight and easier to understand and deploy but sometimes it's a struggle getting the grammar into a form that coco/r understands given its one look-ahead constraint - for many common programming language grammars (e.g. C++, SQL), it's not possible at all.

Related

Communication between lexer and parser

Every time I write a simple lexer and parser, I stumble upon the same question: how should the lexer and the parser communicate? I see four different approaches:
The lexer eagerly converts the entire input string into a vector of tokens. Once this is done, the vector is fed to the parser which converts it into a tree. This is by far the simplest solution to implement, but since all tokens are stored in memory, it wastes a lot of space.
Each time the lexer finds a token, it invokes a function on the parser, passing the current token. In my experience, this only works if the parser can naturally be implemented as a state machine like LALR parsers. By contrast, I don't think it would work at all for recursive descent parsers.
Each time the parser needs a token, it asks the lexer for the next one. This is very easy to implement in C# due to the yield keyword, but quite hard in C++ which doesn't have it.
The lexer and parser communicate through an asynchronous queue. This is commonly known under the title "producer/consumer", and it should simplify the communication between the lexer and the parser a lot. Does it also outperform the other solutions on multicores? Or is lexing too trivial?
Is my analysis sound? Are there other approaches I haven't thought of? What is used in real-world compilers? It would be really cool if compiler writers like Eric Lippert could shed some light on this issue.
While I wouldn't classify much of the above as incorrect, I do believe several items are misleading.
Lexing an entire input before running a parser has many advantages over other options. Implementations vary, but in general the memory required for this operation is not a problem, especially when you consider the type of information that you'd like to have available for reporting compilation errors.
Benefits
Potentially more information available during error reporting.
Languages written in a way that allows lexing to occur before parsing are easier to specify and write compilers for.
Drawbacks
Some languages require context-sensitive lexers that simply cannot operate before the parsing phase.
Language implementation note: This is my preferred strategy, as it results in separable code and is best suited for translation to implementing an IDE for the language.
Parser implementation note: I experimented with ANTLR v3 regarding memory overhead with this strategy. The C target uses over 130 bytes per token, and the Java target uses around 44 bytes per token. With a modified C# target, I showed it's possible to fully represent the tokenized input with only 8 bytes per token, making this strategy practical for even quite large source files.
Language design note: I encourage people designing a new language to do so in a way that allows this parsing strategy, whether or not they end up choosing it for their reference compiler.
It appears you've described a "push" version of what I generally see described as a "pull" parser like you have in #3. My work emphasis has always been on LL parsing, so this wasn't really an option for me. I would be surprised if there are benefits to this over #3, but cannot rule them out.
The most misleading part of this is the statement about C++. Proper use of iterators in C++ makes it exceptionally well suited to this type of behavior.
A queue seems like a rehash of #3 with a middleman. While abstracting independent operations has many advantages in areas like modular software development, a lexer/parser pair for a distributable product offering is highly performance-sensitive, and this type of abstraction removes the ability to do certain types of optimization regarding data structure and memory layout. I would encourage the use of option #3 over this.
As an additional note on multi-core parsing: The initial lexer/parser phases of compilation for a single compilation unit generally cannot be parallelized, nor do they need to be considering how easy it is to simply run parallel compilation tasks on different compilation units (e.g. one lexer/parser operation on each source file, parallelizing across the source files but only using a single thread for any given file).
Regarding other options: For a compiler intended for widespread use (commercial or otherwise), generally implementers choose a parsing strategy and implementation which provides the best performance under the constraints of the target language. Some languages (e.g. Go) can be parsed exceptionally quickly with a simple LR parsing strategy, and using a "more powerful" parsing strategy (read: unnecessary features) would only serve to slow things down. Other languages (e.g. C++) are extremely challenging or impossible to parse with typical algorithms, so slower but more powerful/flexible parsers are employed.
I think there is no golden rule here. Requirements may vary from one case to another. So, reasonable solutions can be different also. Let me comment on your options from my own experience.
"Vector of tokens". This solution may have big memory footprint. Imagine compiling source file with a lot of headers. Storing the token itself is not enough. Error message should contain context with the file name and the line number. It may happen that lexer depends on the parser. Reasonable example: ">>" - is this a shift operator or this is closing of 2 layers of template instantiations? I would not recommend this option.
(2,3). "One part calls another". My impression is that more complex system should call less complex one. I consider lexer to be more simple. This means parser should call lexer. I do not see why C# is better than C++. I implemented C/C++ lexer as a subroutine (in reality this is a complex class) that is called from the grammar based parser. There were no problems in this implementation.
"Communicating processes". This seems to me an overkill. There is nothing wrong in this approach, but maybe it is better to keep the things simple? Multicore aspect. Compiling single file is a relatively rare case. I would recommend to load each core with its own file.
I do not see other reasonable options of combiming lexer and parser together.
I wrote these notes thinking about compiling sources of the software project. Parsing a short query request is completely different thing, and reasons can significantly differ.
My answer is based on my own experience. Other people may see this differently.
The lexer-parser relationship is simpler than the most general case of coroutines, because in general the communication is one-way; the parser does not need to send information back to the lexer. This is why the method of eager generation works (with some penalty, although it does mean that you can discard the input earlier).
As you've observed, if either the lexer or the parser can be written in a reinvocable style then the other can be treated as a simple subroutine. This can always be implemented as a source code transformation, with local variables translated to object slots.
Although C++ doesn't have language support for coroutines, it is possible to make use of library support, in particular fibers. The Unix setcontext family is one option; another is to use multithreading but with a synchronous queue (essentially single-threading but switching between two threads of control).
Also consider for #1 that you lex tokens that don't need it, for example if there is an error, and in addition, you may run low on memory or I/O bandwidth. I believe that the best solution is that employed by parsers generated by tools like Bison, where the parser calls the lexer to get the next token. Minimizes space requirements and memory bandwidth requirements.
#4 is just not going to be worth it. Lexing and parsing are inherently synchronous- there's just not enough processing going on to justify the costs of communication. In addition, typically you parse/lex multiple files concurrently- this can already max out all your cores at once.
The way I handle it in my toy buildsystem project in progress is by having a "file reader" class, with a function bool next_token(std::string&,const std::set<char>&). This class contains one line of input (for error reporting purposes with line number). The function accepts a std::string reference to put the token in, and a std::set<char> which contains the "token-ending" characters. My input class is both parser and lexer, but you could easily split it up if you need more fanciness. So the parsing functions just call next_token and can do their thing, including very detailed error output.
If you need to keep the verbatim input, you'll need to store each line that's read in a vector<string> or something, but not store each token seperately and/or double-y.
The code I'm talking about is located here:
https://github.com/rubenvb/Ambrosia/blob/master/libAmbrosia/Source/nectar_loader.cpp
(search for ::next_token and the extract_nectar function is where is all begins)

What Python features will excite the interest of a C# developer? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
For someone who’s been happily programming in C# for quite some time now and planning to learn a new language I find the Python community more closely knit than many others.
Personally dynamic typing puts me off, but I am fascinated by the way the Python community rallies around it. There are a lot of other things I expect I would miss in Python (LINQ, expression trees, etc.)
What are the good things about Python that developers love? Stuff that’ll excite me more than C#.
For me its the flexibility and elegance, but there are a handful of things I wish could be pulled in from other languages though (better threading, more robust expressions).
In typical I can write a little bit of code in python and do a lot more than the same amount of lines in many other languages. Also, in python code form is of utmost importance and the syntax lends its self to highly readable, clean looking code. That of course helps out with maintenance.
I love having a command line interpreter that I can quickly prototype an algorithm in rather than having to start up a new project, code, compile, test, repeat. Not to mention the fact I can use it to help me automate my server maintenance as well (I double as a SA for my company).
The last thing that comes to mind immediately is the vast amounts of libraries. There are a lot of things already solved out there, the built-in library has a lot to offer, and the third party ones are many times very good (not always though).
Being able to type in some code and get the result back immediately.
(Disclaimer: I use both C# and Python regularly, and I think both have their good and bad points.)
I'm primarily .NET developer and using Python for me personal projects.
What are the good things about python that developers love?
I can say for myself - Python is like a breath of fresh air.
1) It's simple to learn, took about a week for me in the evenings. I'm saying about Python + Django. Python syntax is quite simple.
2) It's simple to use. No troubles installing Python + Django on Windows at all.
3) It can be run on Windows and UNIX.
4) I need it for web, so I get cheaper hosting than ASP.NET.
5) All the advantages of Python language over C#. Like tuples - so useful!
The only thing I don't like is that my favorite IDE Visual Studio doesn't support it (I know about IronPython, don't you worry).
I'm a very heavy user of both C# and Python; I've built very complicated applications in both languages, and I've also embedded Python scripting in my major C# application. I'm not using either to do much in the way of web work right now, but other than that I feel like I'm pretty qualified to answer the question.
The things about Python that excite me, in particular:
The deep integration of generators into the language. This was the first thing that made me realize that I needed to take a long, serious look at Python. My appreciation for this has deepened considerably since I've become conversant with the itertools module, which looks like a nifty set of tools but is in fact a new way of life.
The coupling of dynamic typing and the fact that everything's an object makes pretty sophisticated techniques extremely simple to implement. It's so easy to replace logic with tables in Python (e.g. o = class_map[k]() instead of if k='foo': o = Foo()) that it becomes a basic technique. It's so normal in Python to write methods that take methods as parameters that you don't raise an eyebrow when you see d = defaultdict(list).
zip, and the methods that are designed with it in mind. It takes a while before you can intuitively grasp what dict(zip(k, v)) and d.update(zip(k, v)) are doing, but it's a paradigm-shifting moment when you get there. An entire universe of uninteresting and potentially error-laden code eliminated, just by using one function. Then you start designing functions and classes with the expectation that they'll be used in conjunction with zip, and suddenly your code gets simpler and easier. (Protip: Or itertools.izip. Or itertools.izip_longest.)
Speaking of dictionaries, the way that they're deeply integrated into the language. Understanding what a line of code like self.__dict__.update(**kwargs) does is another one of those paradigm-shifting moments.
List comprehensions and generator expressions, of course.
Inexpensive exceptions.
An interactive intepreter.
Function decorators.
IronPython, which is so much simpler to use than we have any right to expect.
And that's without even getting into the remarkable array of functionality in the standard modules, or the ridiculous bounty of third-party tools like BeautifulSoup or SQL Alchemy or Pylons.
One of the most direct benefits that I've gotten from getting deeply into Python is that it has greatly improved my C# code. I could generally understand code that had a variable of type Dictionary<string, Action<Foo>> in it, but it didn't seem natural to write it. (I use static dictionaries to replace hard-coded logic far more frequently today than I did a year ago.) I have no difficulty understanding what LINQ is doing now, or how IEnumerable<T> and return yield work.
So what don't I like about Python?
Dynamic typing really limits what you can do with static code analysis. Not only isn't there a tool like Resharper for Python, in a language where it's possible to write getattr(x, y)() there really can't be.
It has a bunch of inelegant conventions. How I would love to be able to go back in time and try to talk GVR out of the idea that lambda expressions should be introduced with the word lambda - it's pretty damning that something as fundamental as lambda expressions should be more concise in C# than they are in Python. The leading and trailing double-underscore convention is horrible, and the fact that people mutely acquiesce to it is testimony to Dostoevsky's observation that man is the animal who can get used to anything. And don't get me started on the fact that a module with the name of StringIO was allowed to get out the door.
Some of the features that make Python work on multiple platforms also make it kind of baffling. It's easy to use import, but it's really not easy to understand what the hell it's actually doing. (Where is it looking? What does __init__.py do? Etc.)
The amazingly rich library of standard modules is so amazingly rich that it's hard to know what's in it. It's often easier to write a function than it is to find out whether or not there's something in the standard library that does the same thing - I'm looking at you, itertools.chain.
Your question is kind of like a plumber asking why carpenters are always going on and on about hammers. After all the plumber doesn't have a hammer and has never missed it. Python (even IronPython) and C# target different types of developers and different types of programs. I am very comfortable in Python and enjoy the freedom to focus on the business rules without being distracted by the syntax requirements of the language. On the other hand I have written some fairly substantial code in C# and would be very concerned about the lack of type safety had I taken on the same task in Python. This is not to say that Python is a "toy" language. You can (and people have) write a complete medium or large application in Python. You have the freedom of dynamic typing, but you also have the responsibility to keep it all straight (frameworks help here). Similarly you can write a small application in C#, but you will bring along some overhead you do not likely need.
So if the problem is a nail use a hammer, if the problem is a screw use a screw driver. In other words spend some time to learn Python, get to know it's streangths (text processing, quick coding cycles, simple clean code, etc) and then when you are looking at tackling a new problem ask whether you would be better off in Python or C#. One thing is certain. So long as C# is the only programming language you know, it is the only one you will ever use.
Pat O
My language of choice is C#, and I didn't quite see the point for me to learn Python so far. This talk from PDC09 really piqued my interest: the guy demonstrates how you can use IronPython (or IronRuby) to make a C# app scriptable (in his demo, drop a Python script in a text box, and it works with/extends your C# code). I found this really fascinating: I don't even know where I would start to do something similar in C#, and this made me at least appreciate that it brings something different to the table, which could really enrich what I can develop!
I'm an asymmetrical user of both languages, in a sense that I use C# mostly professionally and Python for all my "fun" projects (not that work is never fun, but... you know...)
This difference of context may skew my perspective, including my opinion that they are two distinct types (pun intended) of languages for, generally, distinct purposes.
This said, it may not be a coincidence that Python is, at this point in time, [one of?] the languages of choice for all kinds of cutting edge, somewhat scholarly, technology/science oriented projects. (And BTW, this "scholarly" keyword here doesNOT imply, that Python is a university toy, plenty of "serious" applications in plenty of domains/industry are proof to the contrary). This may be due to several factors:
(I don't develop most points, readily well expressed in other responses)
the openness and quasi universal availability of Python (unlike C# !)
the lightweight / ease of use / low learning curve
the extensive, high quality, "standard" library and the extensiver (and occasionally bum quality, but on the whole available, open-sourced, etc.) additional library.
the wide array of open source projects in Python language
the relative ease to bind with C/C++ for reusing legacy code, but also for placing performance-critical portions of a project
the generally higher level of abstraction of may constructs of the language
the multi-paradigms (imperative, object oriented and functional)
the availability of practitioners in so many domain of science and technology
and, yes, the
"herd mentality effect" mentioned in a remark, possibly in a [self?] deriding way. The fact that a language attracts a broad, "closely knit" community, makes it attractive too, beyond the superficial ("look cool" and such) traits of herd mentality. Put in broader context, sometimes the best technology/language to use is not measured on the its intrinsic merits but on the overall "picture", including the user community.
I like all stuff with [] and {}. Selectors like this [-1:1]. Possibility to write less code, but more something meaningfull, that gives to write Models and other declarative things very DRY.
Like any programming language, it is just a tool in the box or a brush by which you may paint your creation. Any creative endeavour requires that the artist loves the tools he uses; otherwise, the outcome suffers. Some people like Python for the same reason others love Perl. Incidentally, I have found that most Python lovers loathe Perl's flexible and expressive syntax. As a Perl lover, I don't hate Python, but consider it to be overly structured and restrictive.
If you ask me, all of these throngs of people who seem to love Python were silently suffering under the tool choices before Python came into being. Some suffered under Perl, others under something else. In other words, I believe that when Python came along, it found a large group of silent sufferers longing for a tool like Python.
I can't program in Python because I can't "think" in Python. I can "think" in Perl, therefore, it is the tool I prefer. The silently suffering mass of, now, Python users seem to have found some long lost salvation. Now if they could only keep their evangelism to themselves :).
If you are familiar with the .NET CLR and prefer a statically-typed language, but you like Python's lightweight syntax, then perhaps Boo is the language for you.
Don't get me wrong, I am and will always be a devoted fan of C#.
But sometimes there are things I can't do in C#. lthough C# keeps reducing those gaps, Python is still the language I go to to fill them.
It's dynamic, flexible, powerful, and clean. Lovely language. Whenever I need to script or build dynamic or functional (as in functional programming) software, I go Python.
For me Python is the most elegant language I've used. The syntax is minimalist (significantly less punctuation than most) and intentionally modeled after the psuedo-code conventions which are ubiquitously used by programmer to outline their intentions.
Python's if __name__ == '__main__': suite encourages re-use and test driven development.
For example, the night before last I hacked together to run thousands of ssh jobs (with about 100 concurrently) and gather up all the results (output, error messages, exit values) ... and record the time take on each. It also handles timeouts (An ssh command can stall indefinitely on connection to a thrashing system --- it's connection timeouts and retry options don't apply after the socket connection is made, not matter if the authentication stalls). This only takes a few dozen lines of Python and it's really is easiest to create it as a class (defined above the __main__ suite) and do my command line parsing in a simple wrapper down inside __main__. That's sufficient to do the job at hand (I ran the script on 25,000 hosts the next day, in about two hours). It I can now use this code in other scripts as easily as:
from sshwrap import SSHJobMan
cmd = '/etc/init.d/foo restart'
targets = queryDB(some_criteria)
job = SSHJobMan(cmd, targets)
job.start()
while not job.done():
completed = job.poll()
# ...
# Deal with incremental disposition of of completed jobs
for each in sorted(job.results):
# ...
# Summarize results
... and so on.
So my script can be used for simple jobs ... and it can be imported as a module for more specialized work that couldn't be described on my wrapper's command line. (For example I could start up "consumer" subprocesses for handling other work on each host where the job was successful while spitting out service tickets or automated reboot requests for all hosts reporting timeouts or failures, etc).
For modules which have no standalone usage I can use the __main__ suite to contain unit-tests. Thus every module can contain its own tests ... which, in fact, can be integrated into the "doc strings" using the doctest module from the standard libraries. (Which, incidentally, means that properly formatted examples in the documentary comments can be kept in sync with the implementation ... since they are parts of the unit-test suite).
The main thing I like about Python is its very concise, readable syntax. Though using indentation as a block delimiter can seem strange at first, once you begin to code a lot in the language I find it begins to make sense. Though the core language is quite simple, its more advanced features, e.g. list comprehension, decorators and generators, are rather useful too.
In addition, the Python standard library is just fantastic; its documentation is very well written, and it contains a lot of very useful packages. I also find that there are plenty of good bindings for C libraries, such as PyGTK, Webkit and Qt, to name but a few.
One caveat is that Python, like most dynamic languages, is quite slow in comparison with compiled, statically-typed languages. However, you can easily extend it with C, allowing you to write code requiring better performance in C and the rest in Python.
It's a great language overall, and (for me at least) makes coding more productive and enjoyable.

Best/fastest way to write a parser in c#

What is the best way to build a parser in c# to parse my own language?
Ideally I'd like to provide a grammar, and get Abstract Syntax Trees as an output.
Many thanks,
Nestor
I've had good experience with ANTLR v3. By far the biggest benefit is that it lets you write LL(*) parsers with infinite lookahead - these can be quite suboptimal, but the grammar can be written in the most straightforward and natural way with no need to refactor to work around parser limitations, and parser performance is often not a big deal (I hope you aren't writing a C++ compiler), especially in learning projects.
It also provides pretty good means of constructing meaningful ASTs without need to write any code - for every grammar production, you indicate the "crucial" token or sub-production, and that becomes a tree node. Or you can write a tree production.
Have a look at the following ANTLR grammars (listed here in order of increasing complexity) to get a gist of how it looks and feels
JSON grammar - with tree productions
Lua grammar
C grammar
I've played wtih Irony. It looks simple and useful.
You could study the source code for the Mono C# compiler.
While it is still in early beta the Oslo Modeling language and MGrammar tools from Microsoft are showing some promise.
I would also take a look at SableCC. Its very easy to create the EBNF grammer. Here is a simple C# calculator example.
There's a short paper here on constructing an LL(1) parser here, of course you could use a generator too.
Lex and yacc are still my favorites. Obscure if you're just starting out, but extremely simple, fast, and easy once you've got the lingo down.
You can make it do whatever you want; generate C# code, build other grammars, emulate instructions, whatever.
It's not pretty, it's a text based format and LL1, so your syntax has to accomodate that.
On the plus side, it's everywhere. There are great O'reilly books about it, lots of sample code, lots of premade grammars, and lots of native language libraries.

SAX vs XmlTextReader - SAX in C#

I am attempting to read a large XML document and I wanted to do it in chunks vs XmlDocument's way of reading the entire file into memory. I know I can use XmlTextReader to do this but I was wondering if anyone has used SAX for .NET? I know Java developers swear by it and I was wondering if it is worth giving it a try and if so what are the benefits in using it. I am looking for specifics.
If you just want to get the job done quickly, the XmlTextReader exists for that purpose (in .NET).
If you want to learn a de facto standard (and available in may other programming languages) that is stable and which will force you to code very efficiently and elegantly, but which is also extremely flexible, then look into SAX. However, don't waste your time unless you're going to be creating highly esoteric XML parsers. Instead, look for parsers that next generation parsers (like XmlTextReader) for your particular platform.
SAX Resources
SAX was originally written for Java, and you can find the original open source project, which has been stable for several years, here:
http://sax.sourceforge.net/
There is a C# port of the same project here (with HTML docs as part of the source download); it is also stable:
http://saxdotnet.sourceforge.net/
If you do not like the C# implementation, you could always resort to referencing COM DLLs via COMInterop using MSXML3 or later: http://msdn.microsoft.com/en-us/library/ms994343.aspx
Articles that come from the Java world but which probably illustrate the concepts you need to be successful with this approach (there may also be downloadable Java source code that could prove useful and may be easy enough to convert to C#):
Output large XML documents, Part 1 (http://www.ibm.com/developerworks/xml/library/x-tipbigdoc.html)
Output large XML documents, Part 2 (http://www.ibm.com/developerworks/xml/library/x-tipbigdoc2.html)
Use a SAX filter to manipulate data (http://www.ibm.com/developerworks/xml/library/x-tipsaxfilter/)
It will be a cumbersome implementation. I have only used SAX back in my pre-.NET days, but it requires some pretty advanced coding techniques. At this point, it's just not worth the trouble.
Interesting Concept for a Hybrid Parser
This thread describes a hybrid parser that uses the .NET XmlTextReader to implement a parser that provides a combination of DOM and SAX benefits...
http://bytes.com/groups/net-xml/178403-xmltextreader-versus-dom
If you're talking about SAX for .NET, the project doesn't appear to be maintained. The last release was more than 2 years ago. Maybe they got it perfect on the last release, but I wouldn't bet on it. The author, Karl Waclawek, seems to have disappeared off the net.
As for SAX under Java? You bet, it's great. Unfortunately, SAX was never developed as a standard, so all of the non-Java ports have been adapting a Java API for their own needs. While DOM is a pretty lousy API, it has the advantage of having been designed for multiple languages and environments, so it's easy to implement in Java, C#, JavaScript, C, et al.
I believe there are no benefits using SAX at least due two reasons:
SAX is a "push" model while XmlReader is a pull parser that has a number of benefits.
Being dependent on a 3rd-party library rather than using a standard .NET API.
Personally, I much prefer the SAX model as the XmlReader has some really annoying traps that can cause bugs in your code that might cause your code to skip elements. Most code would be structured around a while(rdr.Read()) model, but if you have any "ReadString" or "ReadInnerXml()" within that loop you will find yourself skipping elements on the next iteration.
As SAX is event based this will never hapen as you can not perform any operations that would cause your parser to seek-ahead.
My personal feeling is that Microsoft have invented the notion that the XmlReader is better with the explanation of the push/pull model, but I don't really buy it. So Microsoft think that you don't need to create a state-machine with XmlReader, that doesn't make sense to me, but anyway, it's just my opinion.

Constructing a simple interpreter

I’m starting a project where I need to implement a light-weight interpreter.
The interpreter is used to execute simple scientific algorithms.
The programming language that this interpreter will use should be simple, since it is targeting non- software developers (for example, mathematicians.)
The interpreter should support basic programming languages features:
Real numbers, variables, multi-dimensional arrays
Binary (+, -, *, /, %) and Boolean (==, !=, <, >, <=, >=) operations
Loops (for, while), Conditional expressions (if)
Functions
MathWorks MatLab is a good example of where I’m heading, just much simpler.
The interpreter will be used as an environment to demonstrate algorithms; simple algorithms such as finding the average of a dataset/array, or slightly more complicated algorithms such as Gaussian elimination or RSA.
Best/Most practical resource I found on the subject is Ron Ayoub’s entry on Code Project (Parsing Algebraic Expressions Using the Interpreter Pattern) - a perfect example of a minified version of my problem.
The Purple Dragon Book seems to be too much, anything more practical?
The interpreter will be implemented as a .NET library, using C#. However, resources for any platform are welcome, since the design-architecture part of this problem is the most challenging.
Any practical resources?
(please avoid “this is not trivial” or “why re-invent the wheel” responses)
I would write it in ANTLR. Write the grammar, let ANTLR generate a C# parser. You can ANTLR ask for a parse tree, and possibly the interpreter can already operate on the parse tree. Perhaps you'll have to convert the parse tree to some more abstract internal representation (although ANTLR already allows to leave out irrelevant punctuation when generating the tree).
It might sound odd, but Game Scripting Mastery is a great resource for learning about parsing, compiling and interpreting code.
You should really check it out:
http://www.amazon.com/Scripting-Mastery-Premier-Press-Development/dp/1931841578
One way to do it is to examine the source code for an existing interpreter. I've written a javascript interpreter in the D programming language, you can download the source code from http://ftp.digitalmars.com/dmdscript.zip
Walter Bright, Digital Mars
I'd recommend leveraging the DLR to do this, as this is exactly what it is designed for.
Create Your Own Language ontop of the DLR
Lua was designed as an extensible interpreter for use by non-programmers. (The first users were Brazilian petroleum geologists although the user base has broadened considerably since then.) You can take Lua and easily add your scientific algorithms, visualizations, what have you. It's superbly well engineered and you can get on with the task at hand.
Of course, if what you really want is the fun of building your own, then the other advice is reasonable.
Have you considered using IronPython? It's easy to use from .NET and it seems to meet all your requirements. I understand that python is fairly popular for scientific programming, so it's possible your users will already be familiar with it.
The Silk library has just been published to GitHub. It seems to do most of what you are asking. It is very easy to use. Just register the functions you want to make available to the script, compile the script to bytecode and execute it.
The programming language that this interpreter will use should be simple, since it is targeting non- software developers.
I'm going to chime in on this part of your question. A simple language is not what you really want to hand to non-software developers. Stripped down languages require more effort by the programmer. What you really want id a well designed and well implemented Domain Specific Language (DSL).
In this sense I will second what Norman Ramsey recommends with Lua. It has an excellent reputation as a base for high quality DSLs. A well documented and useful DSL takes time and effort, but will save everyone time in the long run when domain experts can be brought up to speed quickly and require minimal support.
I am surprised no one has mentioned xtext yet. It is available as Eclipse plugin and IntelliJ plugin. It provides not just the parser like ANTLR but the whole pipeline (including parser, linker, typechecker, compiler) needed for a DSL. You can check it's source code on Github for understanding how, an interpreter/compiler works.

Categories

Resources