Here is a extract from the grammar section of the C# Language Specification:
Is this written in a specific format? I looked at the grammar section in an old C++ ISO I found and it seemed to follow the same format, so is there some standard being used here for writing this grammar? I ask because I would like to be able to create a tool where I can paste the grammar directly and have a working C# parser immediately.
Microsoft seem to release their C# spec for free, but I can't find the C++11 format anywhere. Am I going to have to buy this to view it?
It's a variant of BNF that used by Yacc. Yacc normally has ; as part of its syntax, but changing that makes things simpler with a language like C# and C++ in which ; is very significant in itself. Unlike most BNF variants, it has a : where often BNF uses = (see also Van Wijngaarden grammar and you'll soon know much more than the little bit of knowledge that this answer is coming from).
ISO don't have a rule on which grammar must be used in their standards, so others use BNF, ABNF, EBNF, Wirth syntax and perhaps others.
ISO standards often originate as national or other standards that are then adopted by ISO. Since different standards bodies use different grammars (The IETF use ABNF in RFCs [itself defined in RFC 5234], BSI and the W3C use different variants of EBNF, and so on) the grammar in an ISO often reflects its origins.
This is the case here. Kernigan and Ritchie used this format in their book, The C Programming Language. While the ANSI standard and later ISO standards differed in the grammar itself, they used the same format, and it's been used since for other C-like languages.
Each standard does its own thing. But among compiler writers there's a fairly standard way of describing grammars, and that's what you're seeing here and in the C++ standard.
This is a variation of the backus naur form of grammars that you are seeing here. While not exactly the standard format it is pretty similar. This is generally the standard way of showing how the language is supposed to be parsed, and a common input to parser generators.
The C++ standard is not available for free. You can buy a copy for 30 USD at webstore.ansi.org. Search for document number 14882, and then look for the C++ standard.
The common way to describe a grammar is using either Backus-Naur Form (BNF) or Extended Backus-Naur Form (EBNF). If you are looking to parse a language easily in C#, take a look at Irony which is a language toolkit for C# and it allows you to use something very similar to EBNF to describe the grammar.
On top of those grammars, there is also Parsing Expression Grammar (PEG) but I don't believe it is as common as BNF or EBNF.
Related
I am working on a larger C# project in visual studio handling finance math, so naturally the code implements many special math formulas and they need to be properly documented. I am looking for a good way to produce a documentation from the code. Many objects already have some xml-doc comments with description setup and i am looking for ways to include math formulas written in latex into that.
What options are there and how easy are they to set up?
Or maybe more generally, are there better ways to produce such code documentation?
For me a few things are important:
documenation must have a way to include math formulas.
latex is our preferred syntax to write formulas
ability to use cref-like links in documentation
refactoring (like renaming a class) shouldn't break the links between documentation and object.
it should work with vs-intellisense tooltips and at least show the summary documentation of methods and classes
I tried using Doxygen 1.9.6 (we have also one C++ project) and I manged to make it partially work. it does render latex formulas from the summary tag, but it seems to have issues with certain C# things, for example i cannot make it to generate any documentation for (public) implementations of methods from generic interfaces regardless how i set up the configuration (need to do more research to what exactly is the problem).
I add this as a separate answer because it is completely different approach.
I have found another existing answer which may be helpful.
There are two extensions to VS which support LaTeX formula in comments.
https://marketplace.visualstudio.com/items?itemName=vs-publisher-1305558.VsTeXCommentsExtension (for VS 2017, 2019)
https://marketplace.visualstudio.com/items?itemName=pierreMertz.TeXcomments (works with VS 2010)
For VS 2022 there is new version of the first extension:
https://marketplace.visualstudio.com/items?itemName=vs-publisher-1305558.VsTeXCommentsExtension2022
Maybe Literate Programming is this what are you looking for. Literate programming is a programming paradigm where the documentation and the source code are written in a natural language, making the code easier to read, understand, and maintain. The source code is interspersed with explanations and descriptions that provide context and clarify the purpose and function of the code. This approach aims to make the code more accessible to a wider audience and to improve the overall quality of the code.
The general idea was introduced by Donald E. Knuth and implemented for C. (see http://www.literateprogramming.com)
Tommi Johtela proposed LiterateCS to implement it for C#. It assumes using markdown with LaTeX syntax for math formulas.
General introduction: https://johtela.github.io/LiterateCS/Introduction.html
Example of math formula in the generated documentation: https://johtela.github.io/LiterateCS/TipsAndTricks.html
I am looking for best practices on externalizing simple data structures into human readable files.
I have some experience with iOS's plist functionality (which I believe is XML-like underneith) and I'd like to find something similar.
On the .NET side .resx seems to be the way to go, but as I do research everyone brings up localization and this data is not meant to be localized. Is .resx still the answer?
If so, is there a way to get a dictionary structure of all the .resx data instead of reading a single entry? I'd like to know things like number of entries, an array of all the keys, an array of all the values, etc.
Given my druthers, I'd avoid XML. It's designed to be easy to parse. It's verbose, it's not designed human readability. Avoid the angle-bracket tax if you can.
There's JSON. That's a useful alternative. Simple, easy to read, easy to parse. No angle-bracket tax. That's one option. YAML is another (it's a superset of JSON).
There's LISP-style S-expressions (see also wikipedia). You could also use Prolog-style terms to construct the data structures of your choice (also quite easy to parse).
And there's old-school DOS/Windows INI files. There's multiple tools out there for wrangling them, including .Net/CLR implementations.
You could just co-op Apple's pList format from OS X. You can use its old-school "ASCII" (text) representation or its XML representation.
You can also (preferred, IMHO) write a custom/"little" language to suit your needs specifically. The buzzword du jour for which, these days, is "domain-specific language". I'd avoid using the Visual Studio/C#/.Net domain-specific language facilities because what you get is going to be XML-based.
Terrance Parr's excellent ANTLR is arguably the tool of choice for language building. It's written in Java, comes with an IDE for working with grammars and parse trees, and can target multiple languages (Java, C#, Python, Objective-C, C/C++ are all up-to-date. There's some support for Scala as well. A few other target languages exist for older versions, in varying levels of completeness.)
Terrance Parr's books are equally excellent:
The Definitive ANTLR Reference: Building Domain-Specific Languages
Language Implementation Patterns: Create Your Own Domain-Specific and General Programming Languages
XML (any format, not just resx) or CSV are common choices. XML is easier to use in the code as long as it is valid XML.
Look into LINQ to XML ( http://msdn.microsoft.com/en-us/library/bb387098.aspx ) to get started on reading XML.
What is the best way to build a parser in c# to parse my own language?
Ideally I'd like to provide a grammar, and get Abstract Syntax Trees as an output.
Many thanks,
Nestor
I've had good experience with ANTLR v3. By far the biggest benefit is that it lets you write LL(*) parsers with infinite lookahead - these can be quite suboptimal, but the grammar can be written in the most straightforward and natural way with no need to refactor to work around parser limitations, and parser performance is often not a big deal (I hope you aren't writing a C++ compiler), especially in learning projects.
It also provides pretty good means of constructing meaningful ASTs without need to write any code - for every grammar production, you indicate the "crucial" token or sub-production, and that becomes a tree node. Or you can write a tree production.
Have a look at the following ANTLR grammars (listed here in order of increasing complexity) to get a gist of how it looks and feels
JSON grammar - with tree productions
Lua grammar
C grammar
I've played wtih Irony. It looks simple and useful.
You could study the source code for the Mono C# compiler.
While it is still in early beta the Oslo Modeling language and MGrammar tools from Microsoft are showing some promise.
I would also take a look at SableCC. Its very easy to create the EBNF grammer. Here is a simple C# calculator example.
There's a short paper here on constructing an LL(1) parser here, of course you could use a generator too.
Lex and yacc are still my favorites. Obscure if you're just starting out, but extremely simple, fast, and easy once you've got the lingo down.
You can make it do whatever you want; generate C# code, build other grammars, emulate instructions, whatever.
It's not pretty, it's a text based format and LL1, so your syntax has to accomodate that.
On the plus side, it's everywhere. There are great O'reilly books about it, lots of sample code, lots of premade grammars, and lots of native language libraries.
This question already has answers here:
How to detect the language of a string?
(9 answers)
Closed 8 years ago.
Is there any C# library which can detect the language of a particular piece of text? i.e. for an input text "This is a sentence", it should detect the language as "English". Or for "Esto es una sentencia" it should detect the language as "Spanish".
I understand that language detection from text is not a deterministic problem. But both Google Translate and Bing Translator have an "Auto detect" option, which best-guesses the input language. Is there something similar available publicly, preferably in C#?
Yes indeed, TextCat is very good for language identification. And it has a lot of implementations in different languages.
There were no ports in .Net. So I have written one: NTextCat (NuGet, Online Demo).
It is pure .NET Standard 2.0 DLL + command line interface to it. By default, it uses a profile of 14 languages.
Any feedback is very appreciated! New ideas and feature requests are welcomed too :)
Language detection is a pretty hard thing to do.
Some languages are much easier to detect than others simply due to the diacritics and digraphs/trigraphs used. For example, double-acute accents are used almost exclusively in Hungarian. The dotless i ‘ı’, is used exclusively [I think] in Turkish, t-comma (not t-cedilla) is used only in Romanian, and the eszett ‘ß’ occurs only in German.
Some digraphs, trigraphs and tetragraphs are also a good give-away. For example, you'll most likely find ‘eeuw’ and ‘ieuw’ primarily in Dutch, and ‘tsch’ and ‘dsch’ primarily in German etc.
More giveaways would include common words or common prefixes/suffixes used in a particular language. Sometimes even the punctuation that is used can help determine a language (quote-style and use, etc).
If such a library exists I would like to know about it, since I'm working on one myself.
Please find a C# implementation based on of 3grams analysis here:
http://idsyst.hu/development/language_detector.html
Here you have a simple detector based on bigram statistics (basically means learning from a big set which bigrams occur more frequently on each language and then count those in a piece of text, comparing to your previously detected values):
http://allantech.blogspot.com/2007/07/automatic-language-detection.html
This is probably good enough for many (most?) applications and doesn't require Internet access.
Of course it will perform worse than Google's or Bing's algorithm (which themselves aren't great). If you need excellent detection performance you would have to do both a lot of hard work and over huge amounts of data.
The other option would be to leverage Google's or Bing APIs if your app has Internet access.
You'll want a machine learning algorithm based on hidden markov chains, process a bunch of texts in different languages.
Then when it gets to the unidentified text, the language that has the closer 'score' is the winner.
There is a simple tool to identify text language:
http://www.detectlanguage.com/
I've found that "textcat" is very useful for this. I've used a PHP implementation, PHP Text Cat, based on this this original implementation, and found it reliable. If you have a look at the sources, you'll find it's not a terrifyingly difficult thing to implement in the language of your choice. The hard work -- the letter combinations that are relevant to a particular language -- is all in there as data.
I’m starting a project where I need to implement a light-weight interpreter.
The interpreter is used to execute simple scientific algorithms.
The programming language that this interpreter will use should be simple, since it is targeting non- software developers (for example, mathematicians.)
The interpreter should support basic programming languages features:
Real numbers, variables, multi-dimensional arrays
Binary (+, -, *, /, %) and Boolean (==, !=, <, >, <=, >=) operations
Loops (for, while), Conditional expressions (if)
Functions
MathWorks MatLab is a good example of where I’m heading, just much simpler.
The interpreter will be used as an environment to demonstrate algorithms; simple algorithms such as finding the average of a dataset/array, or slightly more complicated algorithms such as Gaussian elimination or RSA.
Best/Most practical resource I found on the subject is Ron Ayoub’s entry on Code Project (Parsing Algebraic Expressions Using the Interpreter Pattern) - a perfect example of a minified version of my problem.
The Purple Dragon Book seems to be too much, anything more practical?
The interpreter will be implemented as a .NET library, using C#. However, resources for any platform are welcome, since the design-architecture part of this problem is the most challenging.
Any practical resources?
(please avoid “this is not trivial” or “why re-invent the wheel” responses)
I would write it in ANTLR. Write the grammar, let ANTLR generate a C# parser. You can ANTLR ask for a parse tree, and possibly the interpreter can already operate on the parse tree. Perhaps you'll have to convert the parse tree to some more abstract internal representation (although ANTLR already allows to leave out irrelevant punctuation when generating the tree).
It might sound odd, but Game Scripting Mastery is a great resource for learning about parsing, compiling and interpreting code.
You should really check it out:
http://www.amazon.com/Scripting-Mastery-Premier-Press-Development/dp/1931841578
One way to do it is to examine the source code for an existing interpreter. I've written a javascript interpreter in the D programming language, you can download the source code from http://ftp.digitalmars.com/dmdscript.zip
Walter Bright, Digital Mars
I'd recommend leveraging the DLR to do this, as this is exactly what it is designed for.
Create Your Own Language ontop of the DLR
Lua was designed as an extensible interpreter for use by non-programmers. (The first users were Brazilian petroleum geologists although the user base has broadened considerably since then.) You can take Lua and easily add your scientific algorithms, visualizations, what have you. It's superbly well engineered and you can get on with the task at hand.
Of course, if what you really want is the fun of building your own, then the other advice is reasonable.
Have you considered using IronPython? It's easy to use from .NET and it seems to meet all your requirements. I understand that python is fairly popular for scientific programming, so it's possible your users will already be familiar with it.
The Silk library has just been published to GitHub. It seems to do most of what you are asking. It is very easy to use. Just register the functions you want to make available to the script, compile the script to bytecode and execute it.
The programming language that this interpreter will use should be simple, since it is targeting non- software developers.
I'm going to chime in on this part of your question. A simple language is not what you really want to hand to non-software developers. Stripped down languages require more effort by the programmer. What you really want id a well designed and well implemented Domain Specific Language (DSL).
In this sense I will second what Norman Ramsey recommends with Lua. It has an excellent reputation as a base for high quality DSLs. A well documented and useful DSL takes time and effort, but will save everyone time in the long run when domain experts can be brought up to speed quickly and require minimal support.
I am surprised no one has mentioned xtext yet. It is available as Eclipse plugin and IntelliJ plugin. It provides not just the parser like ANTLR but the whole pipeline (including parser, linker, typechecker, compiler) needed for a DSL. You can check it's source code on Github for understanding how, an interpreter/compiler works.