Trying to build a C# grammar for bison/wisent - c#

I've never done Bison or Wisent before.
how can I get started?
My real goal is to produce a working Wisent/Semantic grammar for C#, to allow C# to be edited in emacs with code-completion, and all the other CEDET goodies. (For those who don't know, Wisent is a emacs-lisp port of GNU Bison, which is included into CEDET. The Wisent apparently is a European Bison. And Bison, I take it, is a play-on-words deriving from YACC. And CEDET is a Collection of Emacs Development Tools. All caught up? I'm not going to try to define emacs. )
Microsoft provides the BNF grammar for C#, including all the LINQ extensions, in the language reference document. I was able to translate that into a .wy file that compiles successfully with semantic-grammar-create-package.
But the compiled grammar doesn't "work". In some cases the grammar "finds" enum declarations, but not class declarations. Why? I don't know. I haven't been able to get it to recognize attributes.
I'm not finding the "debugging" of the grammar to be very easy.
I thought I'd take a step back and try to produce a wisent grammar for a vastly simpler language, a toy language with only a few keywords. Just to sort of gain some experience. Even that is proving a challenge.
I've seen the .info documents on the grammar fw, and wisent, but... still those things are not really clarifying for me, how the stuff really works.
So
Q1: any tips on debugging a wisent grammar in emacs? Is there a way to run a "lint-like" thing on the grammar to find out if there are unused rules, dead-ends stuff like that? What about being able to watch the parser in action? Anything like that?
Q2: Any tips on coming up to speed on bison/wisent in general? What I'm thinking is a tool that will allow me to gain some insight into how the rules work. Something that provides some transparency, instead of the "it didn't work" experience i'm getting now with Wisent.
Q3: Rather than continue to fight this, should I give up and become an organic farmer?
ps: I know about the existing C# grammar in the contrib directory of CEDET/semantic. That thing works, but ... It doesn't support the latest C# spec, including LINQ, partial classes and methods, yield, anonymous methods, object initializers, and so on. Also it mostly punts on parsing a bunch of the C# code. It sniffs out the classes and methods, and then bails out. Even foreach loops aren't done quite right. It's good as far as it goes, but I'd like to see it be better. What I'm trying to do is make it current, and also extend it to parse more of the C# code.

You may want to look at the calc example in the semantic/wisent directory. It is quite simple, and also shows how to use the %left and %right features. It will "execute" the code instead of convert it into tags. Some other simple grammars include the 'dot' parser in cogre, and the srecode parser in srecode.
For wisent debugging, there is a verbosity flag in the menu, though to be honest I hadn't tried it. There is also wisent-debug-on-entry which lets you select a action that will cause the Emacs debugger to stop in that action so you can see what the values are.
The older "bovine" parser has a debug mode that allows you to step through the rules, but it was never ported to wisent. That is a feature I have sorely missed as I write wisent parsers.

Regarding Q1:
1st make sure that the wisent parser is actually used:
(fetch-overload 'semantic-parse-stream)
should return wisent-parse-stream.
Run the following elisp-snippet:
(easy-menu-add-item semantic-mode-map '(menu-bar cedet-menu) ["Wisent-Debug" wisent-debug-toggle :style toggle :selected (wisent-debug-active)])
(defun wisent-debug-active ()
"Return non-nil if wisent debugging is active."
(assoc 'wisent-parse-action-debug (ad-get-advice-info-field 'wisent-parse-action 'after)))
(defun wisent-debug-toggle ()
"Install debugging of wisent-parser"
(interactive)
(if (wisent-debug-active)
(ad-unadvise 'wisent-parse-action)
(defadvice wisent-parse-action (after wisent-parse-action-debug activate)
(princ (format "\ntoken:%S;\nactionList:%S;\nreturn:%S\n"
(eval i)
(eval al)
(eval ad-return-value)) (get-buffer-create "*wisent-debug*"))))
(let ((fileName (locate-file "semantic/wisent/wisent" load-path '(".el" ".el.gz")))
fct found)
(if fileName
(with-current-buffer (find-file-noselect fileName)
(goto-char (point-max))
(while (progn
(backward-list)
(setq fct (sexp-at-point))
(null
(or
(bobp)
(and
(listp fct)
(eq 'defun (car fct))
(setq found (eq 'wisent-parse (cadr fct))))))))
(if found
(eval fct)
(error "Did not find wisent-parse.")))
(error "Source file for semantic/wisent/wisent not found.")
)))
It creates a new entry Wisent-Debug in the Development-menu. Clicking this entry toggles debugging of the wisent parser. Next time you reparse a buffer with the wisent-parser it outputs debug information to the buffer *wisent debug*. The buffer *wisent debug* is not shown automatically but you find it via the buffer menu.
To avoid a flooding of *wisent debug* you should disable "Reparse when idle".
From time to time you shold clear the buffer *wisent debug* with erase-buffer.

Related

How to rewrite AST dynamically in resharper plugin?

The request:
I'd like to be able to write an analyzer that can provide a proxy value for a certain expression and trigger a re-parsing of the document.
The motivation:
Our code is littered with ABTests that can be either in a deployed or active state with a control and variant group.
Determining a test's state is done through a database lookup.
For the tests that are deployed with the control group, any statement of the following form will evaluate to false:
if(ExperimentService.IsInVariant(ABTest.Test1))
{
}
I'm trying to provide tooling to make this easier to deal with at develop time by greying it out in this scenario.
As it is, this is fairly limited and not robust because I basically have to play parser myself.
What if the actual code is
if(!ExperimentService.IsInVariant(ABTest.Test1))
or
if(ExperimentService.IsInVariant(ABTest.Test1) || true)
or
var val = ..... && (ExperimentService.IsInVariant(ABTest.Test1);
if(val){
// val is always going to be false if we deployed control.
}
A possible approach I could see provided is by allowing us to write analyzers that are fired once and rewrite the tree before the actual IDE parsing happens (or, well, just parse it a second time).
These should only fire once and allow us to replace a certain expression with another. This would allow me to swap all of these experiment calls for true and false literals.
As a result, these sections could benefit from all the other IDE features such as code greying for unreachable code but also more intricate ones like a variable that will never have a different value
Obviously this is just an example and I'm not sure how feasible it is. Any suggestions for a proper feature or something that already exists are more than welcome.
I don't think there's an approach that doesn't have a compromise.
ReSharper doesn't support rewriting the AST before analysis - that would just rewrite the text in the file.
You could write an analyser that greys out the code, by applying a "dead code" highlight to the contents of the if block, but as you say, you'd need to parse the code and analyse control flow in order to get it correct, and I think that would be very difficult (ReSharper does provide a control flow graph, so you could walk it, but it would be up to you to A. find the return value of IsInVariant and B. trace that value through whatever conditions, && or || statements until you find an appropriate if block).
Alternatively, you could mark the IsInVariant method with the ContractAnnotation attribute, something like:
[ContractAnnotation("=> false")]
public bool IsInVariant(string identifier)
{
// whatever...
}
This will tell ReSharper's analysis that this method always returns false (you can also say it will return true/false/null/not null based on specific input). Because it always returns false, ReSharper will grey out the code in the if statement, or the else branch if you do if (!IsInVariant(…)).
The downside here is that ReSharper will also add a warning to the if statement to tell you that the expression always returns false. So, it's a compromise, but you could change the severity of that warning to Hint, so it's not so intrusive.
This is not enough to really warrant the bounty, but one solution that might apply from the developer documentation is to create a custom language and extend the basic rules.
You said
I'm trying to provide tooling to make this easier to deal with at develop time by greying it out in this scenario.
Greying out the corresponding parts might just be done by altering syntax highlighting rules.
See this example for .tt files.

ANTLR Parser with manual lexer

I'm migrating a C#-based programming language compiler from a manual lexer/parser to Antlr.
Antlr has been giving me severe headaches because it usually mostly works, but then there are the small parts that do not and are incredibly painful to solve.
I discovered that most of my headaches are caused by the lexer parts of Antlr, rather than the parser. Then I noticed parser grammar X; and realized that perhaps I could have my manually written lexer and then an Antlr generated parser.
So I'm looking for more documentation on this topic. I guess a custom ITokenStream could work, but there appears to be virtually no online documentation on this topic...
I found out how. It might not be the best approach but it certainly seems to be working.
Antlr parsers receive a ITokenStream parameter
Antlr lexers are themselves ITokenSources
ITokenSource is a significantly simpler interface than ITokenStream
The simplest way to convert a ITokenSource to a ITokenStream is to use a CommonSourceStream, which receives a ITokenSource parameter
So now we only need to do 2 things:
Adjust the grammar to be parser-only
Implement ITokenSource
Adjusting the grammar is very simple. Simply remove all lexer declarations and ensure you declare the grammar as parser grammar. A simple example is posted here for convinience:
parser grammar mygrammar;
options
{
language=CSharp2;
}
#parser::namespace { MyNamespace }
document: (WORD {Console.WriteLine($WORD.text);} |
NUMBER {Console.WriteLine($NUMBER.text);})*;
Note that the following file will output class mygrammar instead of class mygrammarParser.
So now we want to implement a "fake" lexer.
I personally used the following pseudo-code:
TokenQueue q = new TokenQueue();
//Do normal lexer stuff and output to q
CommonTokenStream cts = new CommonTokenStream(q);
mygrammar g = new mygrammar(cts);
g.document();
Finally, we need to define TokenQueue. TokenQueue is not strictly necessary but I used it for convenience.
It should have methods to receive the lexer tokens, and methods to output Antlr tokens. So if not using Antlr native tokens one has to implement a convert-to-Antlr-token method.
Also, TokenQueue must implement ITokenSource.
Be aware that it is very important to correctly set the token variables. Initially, I had some problems because I was miscalculating CharPositionInLine. If these variables are incorrectly set, then the parser may fail.
Also, the normal channel(not hidden) is 0.
This seems to be working for me so far. I hope others find it useful as well.
I'm open to feedback. In particular, if you find a better way to solve this problem, feel free to post a separate reply.

Go To Statement Considered Harmful?

If the statement above is correct, then why when I use reflector on .Net BCL I see it is used a lot?
EDIT: let me rephrase: are all the GO-TO's I see in reflector written by humans or compilers?
I think the following excerpt from the Wikipedia Article on Goto is particularly relevant here:
Probably the most famous criticism of
GOTO is a 1968 letter by Edsger
Dijkstra called Go To Statement
Considered Harmful. In that letter
Dijkstra argued that unrestricted GOTO
statements should be abolished from
higher-level languages because they
complicated the task of analyzing and
verifying the correctness of programs
(particularly those involving loops).
An alternative viewpoint is presented
in Donald Knuth's Structured
Programming with go to Statements
which analyzes many common programming
tasks and finds that in some of them
GOTO is the optimal language construct
to use.
So, on the one hand we have Edsger Dijkstra (a incredibly talented computer scientist) arguing against the use of the GOTO statement, and specifically arguing against the excessive use of the GOTO statement on the grounds that it is a much less structured way of writing code.
On the other hand, we have Donald Knuth (another incredibly talented computer scientist) arguing that using GOTO, especially using it judiciously can actually be the "best" and most optimal construct for a given piece of program code.
Ultimately, IMHO, I believe both men are correct. Dijkstra is correct in that overuse of the GOTO statement certainly makes a piece of code less readable and less structured, and this is certainly true when viewing computer programming from a purely theoretical perspective.
However, Knuth is also correct as, in the "real world", where one must take a pragmatic approach, the GOTO statement when used wisely can indeed be the best choice of language construct to use.
The above isn't really correct - it was a polemical device used by Dijkstra at a time when gotos were about the only flow control structure in use. In fact, several people have produced rebuttals, including Knuth's classic "Structured Programming Using Goto" paper (title from memory). And there are some situations (error handling, state machines) where gotos can produce clearer code (IMHO), than the "structured" alternatives.
These goto's are very often generated by the compiler, especially inside enumerators.
The compiler always knows what she's doing.
If you find yourself in the need to use goto, you should make sure it is the only option. Most often you'll find there's a better solution.
Other than that, there are very few instances the use of goto can be justified, such as when using nested loops. Again, there are other options in this case still. You could break out the inner loop in a function and use a return statement instead. You need to look closely if the additional method call is really too costly.
In response to your edit:
No, not all gotos are compiler generated, but a lot of them result from compiler generated state machines (enumerators), switch case statements or optimized if else structures. There are only a few instances you'll be able to judge whether it was the compiler or the original developer. You can get a good hint by looking at the function/class name, a compiler will generate "forbidden" names to avoid name clashes with your code. If everything looks normal and the code has not been optimized or obfuscated the use of goto is probably intended.
Keep in mind that the code you are seeing in Reflector is a disassembly -- Reflector is looking at the compiled byte codes and trying to piece together the original source code.
With that, you must remember that rules against gotos apply to high-level code. All the constructs that are used to replace gotos (for, while, break, switch etc) all compile down to code using JMPs.
So, Reflector looks at code much like this:
A:
if !(a > b)
goto B;
DoStuff();
goto A;
B: ...
And must realize that it was actually coded as:
while (a > b)
DoStuff();
Sometimes the code being read to too complicated for it to recognize the pattern.
Go To statement itself is not harmful, it is even pretty useful sometimes. Harmful are users who tend to put it in inappropriate places in their code.
When compiled down to assembly code, all control structured and converted to (un)conditional jumps. However, the optimizer may be too powerful, and when the disassembler cannot identify what control structure a jump pattern corresponds to, the always-correct statement, i.e. goto label; will be emitted.
This has nothing to do with the harm(ful|less)ness of goto.
what about a double loop or many nested loops, of which you have break out, for ex.
foreach (KeyValuePair<DateTime, ChangedValues> changedValForDate in _changedValForDates)
{
foreach (KeyValuePair<string, int> TypVal in changedValForDate.Value.TypeVales)
{
RefreshProgress("Daten werden geändert...", count++, false);
if (IsProgressCanceled)
{
goto TheEnd; //I like goto :)
}
}
}
TheEnd:
in this case you should consider that here the following should be done with break:
foreach(KeyValuePair<DateTime, ChangedValues> changedValForDate in _changedValForDates)
{
foreach (KeyValuePair<string, int> TypVal in changedValForDate.Value.TypeVales)
{
RefreshProgress("Daten werden geändert...", count++, false);
if (IsProgressCanceled)
{
break; //I miss goto :|
}
}
if (IsProgressCanceled)
{
break; //I really miss goto now :|
}//waaaAA !! so many brakets, I hate'm
}
The general rule is that you don't need to use goto. As with any rule there are of course exceptions, but as with any exceptions they are few.
The goto command is like a drug. If it's used in limited amounts only in special situations, it's good. If you use too much all the time, it will ruin your life.
When you are looing at the code using Reflector, you are not seeing the actual code. You are seeing code that is recreated from what the compiler produced from the original code. When you see a goto in the recreated code, it's not certain that there was a goto in the original code. There might be a more structured command to control the flow, like a break or a continue which has been implemented by the compiler in the same way as a goto, so that Reflector can't tell the difference.
goto considered harmful (for human to use but
for computers its okay).
because no matter how madly we(human) use goto, compiler always knows how to read the code.
Believe me...
Reading others code with gotos in it is HARD. Reading your own code with gotos in it is HARDER.
That is why you see it used in low level (machine languages) and not in high level (human languages e.g. C#,Python...) ;)
"C provides the infinitely-abusable goto statement, and labels to branch to. Formally, the goto is never necessary, and in practice it is almost always easy to write code without it. We have not used goto in this book."
-- K&R (2nd Ed.) : Page 65
I sometimes use goto when I want to perform a termination action:
static void DoAction(params int[] args)
{
foreach (int arg in args)
{
Console.WriteLine(arg);
if (arg == 93) goto exit;
}
//edit:
if (args.Length > 3) goto exit;
//Do another gazillion actions you might wanna skip.
//etc.
//etc.
exit:
Console.Write("Delete resource or whatever");
}
So instead of hitting return, I send it to the last line that performs another final action I can refer to from various places in the snippet instead of just terminating.
In decompiled code, virtually all gotos that you see will be synthetic. Don't worry about them; they're an artifact of how the code is represented at the low level.
As to valid reasons for putting them in your own code? The main one I can think of is where the language you are using does not provide a control construct suitable for the problem you are tackling; languages which make it easy to make custom control flow systems typically don't have goto at all. It's also always possible to avoid using them at all, but rearranging arbitrarily complex code into a while loop and lots of conditionals with a whole battery of control variables... that can actually make the code even more obscure (and slower too; compilers usually aren't smart enough to pick apart such complexity). The main goal of programming should be to produce a description of a program that is both clear to the computer and to the people reading it.
If it's harmful or not, it's a matter of likes and dislikes of each one. I personally don't like them, and find them very unproductive as they attempt maintainability of the code.
Now, one thing is how gotos affect our reading of the code, while another is how the jitter performs when found one. From Eric Lippert's Blog, I'd like to quote:
We first run a pass to transform loops into gotos and labels.
So, in fact the compiler transforms pretty each flow control structure into goto/label pattern while emitting IL. When reflector reads the IL of the assembly, it recognizes the pattern, and transforms it back to the appropriate flow control structure.
In some cases, when the emitted code is too complicated for reflector to understand, it just shows you the C# code that uses labels and gotos, equivalent to the IL it's reading. This is the case for example when implementing IEnumerable<T> methods with yield return and yield break statements. Those kind of methods get transformed into their own classes implementing the IEnumerable<T> interface using an underlying state machine. I believe in BCL you'll find lots of this cases.
GOTO can be useful, if it's not overused as stated above. Microsoft even uses it in several instances within the .NET Framework itself.
These goto's are very often generated by the compiler, especially inside enumerators. The compiler always knows what she's doing.
If you find yourself in the need to use goto, you should make sure it is the only option. Most often you'll find there's a better solution.
Other than that, there are very few instances the use of goto can be justified, such as when using nested loops. Again, there are other options in this case still. You could break out the inner loop in a function and use a return statement instead. You need to look closely if the additional method call is really too costly.

Is there a way to mark up code to tell ReSharper not to format it?

I quite often use the ReSharper "Clean Up Code" command to format my code to our coding style before checking it into source control. This works well in general, but some bits of code are better formatted manually (eg. because of the indenting rules in ReSharper, things like chained linq methods or multi-line ternary operators have a strange indent that pushes them way to the right).
Is there any way to mark up parts of a file to tell ReSharper not to format that area? I'm hoping for some kind of markup similar to how ReSharper suppresses other warnings/features. If not, is there some way of changing a combination of settings to get ReSharper to format the indenting correctly?
EDIT:
I have found this post from the ReSharper forums that says that generated code sections (as defined in the ReSharper options page) are ignored in code cleanup. Having tried it though, it doesn't seem to get ignored.
Resharper>Options>Languages>C#>Formatting Style>Other>
Uncheck "Indent anonymous method body" and "Indent array, object and collection initilizer blocks" and anything else that strikes your fancy.
As a last resort, if you've got legacy code that you don't want to format but you want additions to the class to be nicely formatted, then make the class partial and put new code in the new file.
Check out this question I asked that involves the same issue: Resharper formatting code into a single line
The answer I got there work really good for me.

Where do Label_ markers come from in Reflector and how to decipher them?

I'm trying to understand a method using the disassembly feature of Reflector. As anyone that's used this tool will know, certain code is displayed with C# labels that were (presumably) not used in the original source.
In the 110 line method I'm looking at there are 11 label statements. Random snippet examples:
Label_0076:
if (enumerator.MoveNext())
{
goto Label_008F;
}
if (!base.IsValid)
{
return;
}
goto Label_0219;
Label_0087:
num = 0;
goto Label_01CB;
Label_01CB:
if (num < entityArray.Length)
{
goto Label_0194;
}
goto Label_01AE;
Label_01F3:
num++;
goto Label_01CB;
What sort of code makes Reflector display these labels everywhere and why can't it disassemble them?
Is there a good technique for deciphering them?
Actually, the C# compiler doesn't do much of any optimization - it leaves that to the JIT compiler (or ngen). As such, the IL it generates is pretty consistent and predictable, which is why tools like Reflector are able to decompile IL so effectively. One situation where the compiler does transform your code is in an iterator method. The method you're looking at probably contained something along the lines of:
foreach(var x in something)
if(x.IsValid)
yield return x;
Since the iterator transformation can be pretty complex, Reflector can't really deal with it. To get familiar with what to look for, write your own iterator methods and run them through Reflector to see what kind of IL gets generated based on your C# code. Then you'll know what to look for.
You're looking at code generated by the compiler. The compiler doesn't respect you. No, really. It doesn't respect me or anybody else, either. It looks at our code, scoffs at us, and rewrites it to run as efficiently as possible.
Nested if statements, recursion, "yield"s, case statements, and other code shortcuts will result in weird looking code. And if you're using lambdas with lots of enclosures, well, don't expect it to be pretty.
Any where, any chance the compiler can rewrite your code to make it run faster it will. So there isn't any one "sort of code" that will cause this. Reflector does its best to disassemble, but it can't divine the author's original code from its rewritten version. It does its best (which sometimes is even incorrect!) to translate IL into some form of acceptable code.
If you're having a hard time deciphering it, you could manually edit the code to inline goto's that only get called once and refactor goto's that are called more than once into method calls. Another alternative is to disassemble into another language. The code that translates IL into higher level languages isn't the same. The C++/CLI decompiler may do a better job for you and still be similar enough (find/replace -> with .) to be understandable.
There really isn't a silver bullet for this; at least not until somebody writes a better disassembler plugin.

Categories

Resources