What's the best way to highlight C# syntax using Roslyn?

What's the best way to highlight C# syntax using Roslyn? - c#

I'm trying to make a Xamarin.Android app that highlights the syntax of many different languages. I plan to use ANTLR to deal with most of them, but for C# I want to use Roslyn as that will undoubtedly be faster and less buggy than ANTLR.
What is the best way to implement syntax highlighting with Roslyn? For highlighting Java syntax, the approach I took was parsing the text into a parse tree, and using a visitor to color the text associated with each terminal. You can view my code here. Is this also a good idea for Roslyn, or does Roslyn provide APIs for syntax highlighting? (e.g. Does the code behind syntax highlighting in Visual Studio live in the dotnet/roslyn repo?) I'd really prefer not to reinvent the wheel, but I will if I have to.
edit: I have accepted Tamas' answer because his solution is the most practical for my use case; I do not have access to the full solution to build a semantic model with, so I will have to do some of my own analysis. However, if your app supports more broad C# integration and can build a semantic model, take a look at the Roslyn Classification APIs which are used in Jonathon Marolf's answer.

the ConsoleClassifier project in the roslyn Samples should be a good starting place.

Did you have a look at SourceBrowser? If you can do a full solution build, then I would use the same approach.
If your context doesn't allow a full build, then you can implement something relatively good based on syntax token types. However you might have to handle some corner cases, like contextual keywords, var, implicitly declared local variables (like value), etc. Have a look at what SonarQube is using.
Similarly, you can look for other tools that you know are Roslyn based, like OmniSharp. I'm not sure if that uses regex or Roslyn to do the highlighting. But in any case you could get quite far with Regex too.

Related

Add a keyword to C# with code generation?

I have a domain specific language that I would like to interact with C# by adding new keywords (or some keyword-like syntax). Using attributes would be insufficient (I can't use them in method bodies), and shoehorning it into 'valid' C# notation that gets compiled into something else would be ugly and ruin the analogy with the DSL (and the translation from DSL-like notation to C# is nontrivial, so just writing the C# each time is out of the question).
I already have a way to parse the .cs file and transform it into legitimate, nontrivial, C# code which can be compiled.
The problem is, even through I can do all the work of defining the DSL, parsing it, and translating it into valid C#, Visual Studio won't let me use notation it doesn't understand; it just adds red squiggles, emits an error "cannot resolve symbol", and then often fails to properly parse things after it.
Is there a way to to force visual studio to ignore specific strings in its analysis? I've looked at visual studio plugins but it looks like, although I can do syntax highlighting and other stuff, I can't force it to ignore something it doesn't know how to parse (unless I'm missing some way to do that in the extension API, which is certainly possible).
I've skimmed through the Roslyn stuff and don't see offhand a way to do this there, either. (Again, may be missing something, it doesn't seem to have great documentation.)

Take a look at PowerLanguages.E: http://visualstudiogallery.msdn.microsoft.com/a512e0d0-f4f3-4435-bad4-8d5efbb1db4a
No english docs yet, sorry

How to parse simple statement into CodeDom object

I need to parse a simple statement (essentially a chain of function calls on some object) represented as a string variable into a CodeDom object (probably a subclass of CodeStatement). I would also like to provide some default imports of namespaces to be able to use less verbose statements.
I have looked around SO and the Internet to find some suggestions but I'm quite confused about what is and isn't possible and what is the simplest way to do it. For example this question seems to be almost what I want, unfortunately I can't use the solution as the CodeSnippetStatement seems not to be supported by the execution engine that I use (the WF rules engine).
Any suggestions that could help me / point me into the right direction ?

There is no library or function to parse C# code into CodeDOM objects as part of the standard .NET libraries. The CodeDOM libraries have some methods that seem to be designed for this, but none of them are actually implemented. As far as I know, there is some implementation available in Visual Studio (used e.g. by designers), but that is only internal.
CodeSnippetStatement is a CodeDOM node that allows you to place any string into the generated code. If you want to create CodeDOM tree just to generate C# source code, than this is usually fine (the source code generator just prints the string to the output). If the WF engine needs to understand the code in your string (and not just generate source code and compile it), than CodeSnippetStatement won't work.
However, there are 3rd party tools that can be used for parsing C# source code. In one project I worked on, we used NRefactory library (which is used in SharpDevelop) and it worked quite well. It gives you some tree (AST) representing the parsed code and I'm afraid you'll need to convert this to the corresponding CodeDOM tree yourself.

I have found a library implementation here that seems to cover pretty much everything I need for my purposes. I don't know if it's robust enough to be used in business scenarios, but for my unit tests it's pretty much all I can ask for.

Best/fastest way to write a parser in c#

What is the best way to build a parser in c# to parse my own language?
Ideally I'd like to provide a grammar, and get Abstract Syntax Trees as an output.
Many thanks,
Nestor

I've had good experience with ANTLR v3. By far the biggest benefit is that it lets you write LL(*) parsers with infinite lookahead - these can be quite suboptimal, but the grammar can be written in the most straightforward and natural way with no need to refactor to work around parser limitations, and parser performance is often not a big deal (I hope you aren't writing a C++ compiler), especially in learning projects.
It also provides pretty good means of constructing meaningful ASTs without need to write any code - for every grammar production, you indicate the "crucial" token or sub-production, and that becomes a tree node. Or you can write a tree production.
Have a look at the following ANTLR grammars (listed here in order of increasing complexity) to get a gist of how it looks and feels
JSON grammar - with tree productions
Lua grammar
C grammar

I've played wtih Irony. It looks simple and useful.

You could study the source code for the Mono C# compiler.

While it is still in early beta the Oslo Modeling language and MGrammar tools from Microsoft are showing some promise.

I would also take a look at SableCC. Its very easy to create the EBNF grammer. Here is a simple C# calculator example.

There's a short paper here on constructing an LL(1) parser here, of course you could use a generator too.

Lex and yacc are still my favorites. Obscure if you're just starting out, but extremely simple, fast, and easy once you've got the lingo down.
You can make it do whatever you want; generate C# code, build other grammars, emulate instructions, whatever.
It's not pretty, it's a text based format and LL1, so your syntax has to accomodate that.
On the plus side, it's everywhere. There are great O'reilly books about it, lots of sample code, lots of premade grammars, and lots of native language libraries.

Using reflection for code gen?

I'm writing a console tool to generate some C# code for objects in a class library. The best/easiest way I can actual generate the code is to use reflection after the library has been built. It works great, but this seems like a haphazard approch at best. Since the generated code will be compiled with the library, after making a change I'll need to build the solution twice to get the final result, etc. Some of these issues could be mitigated with a build script, but it still feels like a bit too much of a hack to me.
My question is, are there any high-level best practices for this sort of thing?

Its pretty unclear what you are doing, but what does seem clear is that you have some base line code, and based on some its properties, you want to generate more code.
So the key issue here are, given the base line code, how do you extract interesting properties, and how do you generate code from those properties?
Reflection is a way to extract properties of code running (well, at least loaded) into the same execution enviroment as the reflection user code. The problem with reflection is it only provides a very limited set of properties, typically lists of classes, methods, or perhaps names of arguments. IF all the code generation you want to do can be done with just that, well, then reflection seems just fine. But if you want more detailed properties about the code, reflection won't cut it.
In fact, the only artifact from which truly arbitrary code properties can be extracted is the the source code as a character string (how else could you answer, is the number of characters between the add operator and T in middle of the variable name is a prime number?). As a practical matter, properties you can get from character strings are generally not very helpful (see the example I just gave :).
The compiler guys have spent the last 60 years figuring out how to extract interesting program properties and you'd be a complete idiot to ignore what they've learned in that half century.
They have settled on a number of relatively standard "compiler data structures": abstract syntax trees (ASTs), symbol tables (STs), control flow graphs (CFGs), data flow facts (DFFs), program triples, ponter analyses, etc.
If you want to analyze or generate code, your best bet is to process it first into such standard compiler data structures and then do the job. If you have ASTs, you can answer all kinds of question about what operators and operands are used. If you have STs, you can answer questions about where-defined, where-visible and what-type. If you have CFGs, you can answer questions about "this-before-that", "what conditions does statement X depend upon". If you have DFFs, you can determine which assignments affect the actions at a point in the code. Reflection will never provide this IMHO, because it will always be limited to what the runtime system developers are willing to keep around when running a program. (Maybe someday they'll keep all the compiler data structures around, but then it won't be reflection; it will just finally be compiler support).
Now, after you have determined the properties of interest, what do you do for code generation? Here the compiler guys have been so focused on generation of machine code that they don't offer standard answers. The guys that do are the program transformation community (http://en.wikipedia.org/wiki/Program_transformation). Here the idea is to keep at least one representation of your program as ASTs, and to provide special support for matching source code syntax (by constructing pattern-match ASTs from the code fragments of interest), and provide "rewrite" rules that say in effect, "when you see this pattern, then replace it by that pattern under this condition".
By connecting the condition to various property-extracting mechanisms from the compiler guys, you get relatively easy way to say what you want backed up by that 50 years of experience. Such program transformation systems have the ability to read in source code,
carry out analysis and transformations, and generally to regenerate code after transformation.
For your code generation task, you'd read in the base line code into ASTs, apply analyses to determine properties of interesting, use transformations to generate new ASTs, and then spit out the answer.
For such a system to be useful, it also has to be able to parse and prettyprint a wide variety of source code langauges, so that folks other than C# lovers can also have the benefits of code analysis and generation.
These ideas are all reified in the
DMS Software Reengineering Toolkit. DMS handles C, C++, C#, Java, COBOL, JavaScript, PHP, Verilog, ... and a lot of other langauges.
(I'm the architect of DMS, so I have a rather biased view. YMMV).

Have you considered using T4 templates for performing the code generation? It looks like it's getting much more publicity and attention now and more support in VS2010.
This tutorial seems database centric but it may give you some pointers: http://www.olegsych.com/2008/09/t4-tutorial-creatating-your-first-code-generator/ in addition there was a recent Hanselminutes on T4 here: http://www.hanselminutes.com/default.aspx?showID=170.
Edit: Another great place is the T4 tag here on StackOverflow: https://stackoverflow.com/questions/tagged/t4
EDIT: (By asker, new developments)
As of VS2012, T4 now supports reflection over an active project in a single step. This means you can make a change to your code, and the compiled output of the T4 template will reflect the newest version, without requiring you to perform a second reflect/build step. With this capability, I'm marking this as the accepted answer.

You may wish to use CodeDom, so that you only have to build once.
First, I would read this CodeProject article to make sure there are not language-specific features you'd be unable to support without using Reflection.

From what I understand, you could use something like Common Compiler Infrastructure (http://ccimetadata.codeplex.com/) to programatically analyze your existing c# source.
This looks pretty involved to me though, and CCI apparently only has full support for C# language spec 2. A better strategy may be to streamline your existing method instead.

I'm not sure of the best way to do this, but you could do this
As a post-build step on your base dll, run the code generator
As another post-build step, run csc or msbuild to build the generated dll
Other things which depend on the generated dll will also need to depend on the base dll, so the build order remains correct

Lex/Yacc for C#?

Actually, maybe not full-blown Lex/Yacc. I'm implementing a command-interpreter front-end to administer a webapp. I'm looking for something that'll take a grammar definition and turn it into a parser that directly invokes methods on my object. Similar to how ASP.NET MVC can figure out which controller method to invoke, and how to pony up the arguments.
So, if the user types "create foo" at my command-prompt, it should transparently call a method:
private void Create(string id) { /* ... */ }
Oh, and if it could generate help text from (e.g.) attributes on those controller methods, that'd be awesome, too.

I've done a couple of small projects with GPLEX/GPPG, which are pretty straightforward reimplementations of LEX/YACC in C#. I've not used any of the other tools above, so I can't really compare them, but these worked fine.
GPPG can be found here and GPLEX here.
That being said, I agree, a full LEX/YACC solution probably is overkill for your problem. I would suggest generating a set of bindings using IronPython: it interfaces easily with .NET code, non-programmers seem to find the basic syntax fairly usable, and it gives you a lot of flexibility/power if you choose to use it.

I'm not sure Lex/Yacc will be of any help. You'll just need a basic tokenizer and an interpreter which are faster to write by hand. If you're still into parsing route see Irony.
As a sidenote: have you considered PowerShell and its commandlets?

Also look at Antlr, which has C# support.

Still early CTP so can't be used in production apps but you may be interested in Oslo/MGrammar:
http://msdn.microsoft.com/en-us/oslo/

Jison is getting a lot of traction recently. It is a Bison port to javascript. Because of it's extremely simple nature, I've ported the jison parsing/lexing template to php, and now to C#. It is still very new, but if you get a chance, take a look at it here: https://github.com/robertleeplummerjr/jison/tree/master/ports/csharp/Jison

If you don't fear alpha software and want an alternative to Lex / Yacc for creating your own languages, you might look into Oslo. I would recommend you to sit through session recordings of sessions TL27 and TL31 from last years PDC. TL31 directly addresses the creation of Domain Specific Languages using Oslo.

Coco/R is a compiler generator with a .NET implementation. You could try that out, but I'm not sure if getting such a library to work would be faster than writing your own tokenizer.
http://www.ssw.uni-linz.ac.at/Research/Projects/Coco/

I would suggest csflex - C# port of flex - most famous unix scanner generator.

I believe that lex/yacc are in one of the SDKs already (i.e. RTM). Either Windows or .NET Framework SDK.

Gardens Point Parser Generator here provides Yacc/Bison functionality for C#. It can be donwloaded here. A usefull example using GPPG is provided here

As Anton said, PowerShell is probably the way to go. If you do want a lex/ yacc implementation then Malcolm Crowe has a good set.
Edit: Direct Link to the Compiler Tools

Just for the record, implementation of lexer and LALR parser in C# for C#:
http://code.google.com/p/naive-language-tools/
It should be similar in use to Lex/Yacc, however those tools (NLT) are not generators! Thus, forget about speed.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.