Pre-Compile - Obfuscate Roslyn Generated Code

Pre-Compile - Obfuscate Roslyn Generated Code - c#

I have recently been tasked with coming up with a solution for providing renaming functions, as well as various other obfuscation pre-compile at runtime. I believe using Roslyn is the way to go, but please provide any insight you may have.
The ultimate goal is as follows:
Allow end user to select various options that are then generated into a text version of assembly at runtime. We then use Roslyn to generate the .exe. I was curious if it possible to obfuscate at runtime, before the EXE is even generated. This way I can rename vars, etc.

You can use any tool that can reliably transform C# source code.
Roslyn is one but in a funny way; you can modify the program and produce object code. That should work.
Other Program Transformation Systems (PTS) can do this by modifying the source code. A PTS reads source code, builds compiler data structures (e.g., ASTs), lets you modify the ASTs, and then can regenerate source code from the modified AST. That way you can see the obfuscated code; you can always compile it later with the C# compiler. A good PTS will let you write code transformations in terms of the syntax of the targeted language in a form like this:
if you see *this pattern*, replace it by *that pattern*
expressed below as
rule <name> <patternvariables> "thispattern" -> "thatpattern";
Using a PTS, you can arguably make arbitrary changes to the source code, including function and variable renaming, code flow scrambling and data flow scrambling. For instance, you might use this rule to add confusion:
rule scramble_if_then(c: condition, b: block): statement -> statement
" if (\c) \b " -> "int temp = \c?4:3;
while (temp>3) {\b; temp--; }";
This rule is a bit simple/silly but I think it makes the point that you can write readable source code transformations. If you have many such rules, it will scramble the code a lot, especially if your rules do sophisticated transformations.
We use our DMS Software Reengineering Toolkit to implement name-scrambling obfuscators, including one for C#.

Related

How to access the AST generated by the Q# compiler?

Background
Part of the project I'm working on requires me to analyze Q# source code and perform specific actions when certain syntax elements are encountered. For example, say I'd like to count how many different gate types are used throughout the program. Now, this could be implemented by walking the Abstract Syntax Tree of the program and performing actions based on the current syntax node.
What I've tried
I've started by analyzing the qsharp-compiler repository, however, the inner workings of the compiler lack online documentation and browsing all the C# and F# sources can be really tedious.
Of course, I could write my own parser for the language, but that would probably be an overkill for the task at hand. There has to be a way to extract the AST from inside of the compiler.
The question
Is there a way to compile Q# source code using the Q# compiler programmatically (from C# or F#), and extract the internal AST?

Yes, it is perfectly possible to compile Q# source code programmatically. This is particularly useful if you want to repeatedly update a compilation - you can add/remove/edit (parts of) the sources and references in memory, and query all kinds of useful information about the current state of the compilation that e.g. an IDE cares about (like e.g. which symbols are defined at a particular location in a certain file).
However, if you just want to process the AST for a Q# compilation, then there is a much easier way! The Q# compiler has an extensibility mechanism that I believe fits your need perfectly.
This blog post gives a brief overview over the feature.
There is also an example for an extension on the compiler repo. This readme (and possibly this one) may also come in handy. I believe this answers half of your question, namely how to easily get access to the built AST.
The other half of the question according to my interpretation is how to conveniently analyze or transform the AST. For that there is also a mechanism provided; the syntax tree transformation framework. That framework consists of a couple of classes that define the walk/transformation for different kinds of nodes, as well as a wrapping class that plugs it all together.
Rather than starting by looking at the definition of the transformations, it is probably more intuitive to just look at some examples that use it. An example that is pretty close to what you want to do can be found here. The implemented transformation adds a comment to each callable listing all identifiers used within the callable. It is invoked as as part of a compilation step (see here) that is defined in the example I already linked above.
There are a couple of other good examples for simple transformations that are a bit farther from what you want to do, but should give you an idea how the whole setup works if you are interested: this one allows to attach attributes to callables, and this one is used to inline conjugations (pattern of the form U*VU).
Last but not least, the Gitter for the Q# community can possibly also be a good resource to engage as you work.

Parse .h header files into c# data structures in runtime

I'm trying to write a C# library to manipulate my C/C++ header files.. I want to be able to read and parse the headers file and manipulate function prototypes and data structures in C#. I'm trying to avoid writing a C Parser, due to all code brances caused by #ifdefs and stuff like that.
I've tryed playing around with EnvDTE, but couldn't find any decent documentation.
Any ideas how can I do it?
Edit -
Thank you for the answers... Here are some more details about my project: I'm writing a ptrace-like tool for windows using the debugging API's, which enable me to trace my already compiled binaries and see which windows API's are being called. I also want to see which parameter is given in each call and what return values are given, so I need to know the definition of the API's. I also want to know the defition for my own libraries (hence, the header parsing approach). I thought of 3 solutions:
* Parsing the header files
* Parsing the PDB files (I wrote a prototype using DIA SDK, but unfortionatly, the symbols PDB contained only general info about the API's and not the real prototypes with the parameters and return values)
* Crawling over the MSDN online library (automaticly or manualy)
Is there any better way for getting the names and types for windows API's and my libraries in runtime in c#?

Parsing C (even "just" headers) is hard; the language is more complex than people remember,
and then there's the preprocessor, and finally the problem of doing something with the parse. C++ includes essentially all of C, and with C++11 here the problem is even worse.
People can often hack a 98% solution for a limited set of inputs, often with regexes in Perl or some other string hackery. If that works for you, then fine. Usually what happens is that 2% causes the hacked parser to choke or to produce the wrong answer, and then you get to debug the result and hand hack the 98% solution output.
Hacked solutions tend to fail pretty badly on real header files, which seem to concentrate weirdness in macros and conditionals (sometimes even to the point of mixing different dialects of C and C++ in the conditional arms). See a typical Microsoft .h file as an example. This appears to be what OP wants to process. Preprocessing gets rid of part of the problem, and now you get to encounter the real complexity of C and/or C++. You won't get a 98% solution for real header files even with preprocessing; you need typedefs and therefore name and type resolution, too. You might "parse" FOO X; that tells you that X is of type FOO... oops, what's that? Only a symbol table knows for sure.
GCCXML does all this preprocessing, parsing, and symbol table construction ... for the GCC dialect of C. Microsoft's dialect is different, and I don't think GCCXML can handle it.
A more general tool is our DMS Software Reengineering Toolkit, with its C front end; there's also a C++ front end (yes, they're different; C and C++ aren't the same language by a long shot). These process a wide variety of C dialects (both MS and GCC when configured properly), does macro/conditional expansion, builds an AST and a symbol table (does that name and type resolution stuff correctly).
You can add customization to extract the information you want, by crawling over the symbol table structures produced. You'll have to export what you want to C# (e.g. generate your C# classes), since DMS isn't implemented in a .net language.

In the most general case, header files are only usable, not convertable.
This due the possibility of preprocessor (#define) use of macros, fragments of structures constants etc which only get meaning when used in context.
Examples
anything with ## in macros
or
//header
#define mystructconstant "bla","bla"
// in using .c
char test[10][2] ={mystructconstant};
but you can't simply discard all macros, since then you won't process the very common calling convention macros
etc etc.
So header parsing and conversion is mostly only possible for semi automated use (manually run cleaned up headers through it) or for reasonably clean and consistent headers (like e.g. the older MS SDK headers)
Since the general case is so hard, there isn't much readily available. Everybody crafts something quick and dirty for its own headers.
The only more general tool that I know is SWIG.

Manipulating a Python file from C#

I'm working on some tools for a game I'm making. The tools serve as a front end to making editing game files easier. Several of the files are python scripting files. For instance, I have an Items.py file that contains the following (minimalized for example)
from ItemModule import *
import copy
class ScriptedItem(Item):
def __init__(self, name, description, itemtypes, primarytype, flags, usability, value, throwpower):
Item.__init__(self, name, description, itemtypes, primarytype, flags, usability, value, throwpower, Item.GetNextItemID())
def Clone(self):
return copy.deepcopy(self)
ItemLibrary.AddItem(ScriptedItem("Abounding Crystal", "A colourful crystal composed of many smaller crystals. It gives off a warm glow.", ItemType.SynthesisMaterial, ItemType.SynthesisMaterial, 0, ItemUsage.Unusable, 0, 50))
As I Mentioned, I want to provide a front end for editing this file without requring an editor to know python/edit the file directly. My editor needs to be able to:
Find and list all the class types (in this example, it'd be only
Scripted Item)
Find and list all created items (in this case there'd only be one,
Abounding Crystal). I'd need to find the type (in this
caseScriptedItem) and all the parameter values
Allow editing of parameters and the creation/removal of items.
To do this, I started writing my own parser, looking for the class keyword and when these recorded classes are use to construct objects. This worked for simple data, but when I started using classes with complex constructors (lists, maps, etc.) it became increasing difficult to correctly parse.
After searching around, I found IronPython made it easy to parse python files, so that's what I went about doing. Once I built the Abstract Syntax Tree I used PythonWalkers to identify and find all the information I need. This works perfectly for reading in data, but I don't see an easy way to push updated data into the Python file. As far as I can tell, there's no way to change the values in the AST and much less so to convert the AST back into a script file. If I'm wrong, I'd love for someone to tell me how I could do this. What I'd need to do now is search through the file until I find the correctly line, then try to push the data into the constructor, ensuring correct ordering.
Is there some obvious solution I'm not seeing? Should I just keeping working on my parser and make it support more complex data types? I really though I had it with the IronPython parser, but I didn't think about how tricky it'd be to push modified data back into the file.
Any suggestions would be appreciated

You want a source-to-source program transformation tool.
Such a tool parses a language to an internal data structure (invariably an AST), allows you to modify the AST, and then can regenerate source text from the modified AST without changing essentially anything about the source except where the AST changes were made.
Such a program transformation tool has to parse text to ASTs, and "anti-parse" (called "Prettyprint") ASTs to text. If IronPython has a prettyprinter, that's what you need.
If it doesn't, you can build one with some (maybe a lot) of effort; as you've observed,
this isn't as easy as one might think. See my answer
Compiling an AST back to source code
If that doesn't work, our DMS Software Reengineering Toolkit with its Python front end might do the trick. It has all the above properties.

Provided you can find a complete and up-to-date context free grammar file for Python, you could use CoCo/R parser generator to generate a python parser in C#.
You can add production code to the grammar file itself to populate a data structure in your C# app. Said data structure can hold all the information you need (methods and their arguments, properties, constructors, destructors etc). Once you have this data structure, its just a task of designing a front end for the user and representing this data structure in a way that makes it editable to them (this is more of a design task than a complicated programming task).
Finally, iterate through you data structure and write out a .py file.

You can use the python inspect module to print the source of an object. In your case: To print the source of your module - the file you just parsed with IronPython. I haven't checked to see if inspect works with IronPython yet, though.
As to adding stuff, well, it's a module, right? You can just add stuff to a module... I'd load the module and then alter it, use inspect to view print it and save to disk.
From your post, it looks like you're already deep in the trenches and having fun, so I'd be really happy to see a post here on how you solved this problem!

To me it sounds more like you are at the point where you shove it all into a sqlite database and start editing it that way. Hooking up some forms to edit tables is simpler for the UI. At that point you generate new python files by dumping your tables out with some formatting to provide the surrounding python scripts.
SVN / Git / whatever can merge the updated changes via the python files.
This is what I ended up doing for my project at any rate. I started using python to hook up the various items using their computed keys and then just added some forms UI to avoid editing mistakes in the python files.

How to parse simple statement into CodeDom object

I need to parse a simple statement (essentially a chain of function calls on some object) represented as a string variable into a CodeDom object (probably a subclass of CodeStatement). I would also like to provide some default imports of namespaces to be able to use less verbose statements.
I have looked around SO and the Internet to find some suggestions but I'm quite confused about what is and isn't possible and what is the simplest way to do it. For example this question seems to be almost what I want, unfortunately I can't use the solution as the CodeSnippetStatement seems not to be supported by the execution engine that I use (the WF rules engine).
Any suggestions that could help me / point me into the right direction ?

There is no library or function to parse C# code into CodeDOM objects as part of the standard .NET libraries. The CodeDOM libraries have some methods that seem to be designed for this, but none of them are actually implemented. As far as I know, there is some implementation available in Visual Studio (used e.g. by designers), but that is only internal.
CodeSnippetStatement is a CodeDOM node that allows you to place any string into the generated code. If you want to create CodeDOM tree just to generate C# source code, than this is usually fine (the source code generator just prints the string to the output). If the WF engine needs to understand the code in your string (and not just generate source code and compile it), than CodeSnippetStatement won't work.
However, there are 3rd party tools that can be used for parsing C# source code. In one project I worked on, we used NRefactory library (which is used in SharpDevelop) and it worked quite well. It gives you some tree (AST) representing the parsed code and I'm afraid you'll need to convert this to the corresponding CodeDOM tree yourself.

I have found a library implementation here that seems to cover pretty much everything I need for my purposes. I don't know if it's robust enough to be used in business scenarios, but for my unit tests it's pretty much all I can ask for.

Using reflection for code gen?

I'm writing a console tool to generate some C# code for objects in a class library. The best/easiest way I can actual generate the code is to use reflection after the library has been built. It works great, but this seems like a haphazard approch at best. Since the generated code will be compiled with the library, after making a change I'll need to build the solution twice to get the final result, etc. Some of these issues could be mitigated with a build script, but it still feels like a bit too much of a hack to me.
My question is, are there any high-level best practices for this sort of thing?

Its pretty unclear what you are doing, but what does seem clear is that you have some base line code, and based on some its properties, you want to generate more code.
So the key issue here are, given the base line code, how do you extract interesting properties, and how do you generate code from those properties?
Reflection is a way to extract properties of code running (well, at least loaded) into the same execution enviroment as the reflection user code. The problem with reflection is it only provides a very limited set of properties, typically lists of classes, methods, or perhaps names of arguments. IF all the code generation you want to do can be done with just that, well, then reflection seems just fine. But if you want more detailed properties about the code, reflection won't cut it.
In fact, the only artifact from which truly arbitrary code properties can be extracted is the the source code as a character string (how else could you answer, is the number of characters between the add operator and T in middle of the variable name is a prime number?). As a practical matter, properties you can get from character strings are generally not very helpful (see the example I just gave :).
The compiler guys have spent the last 60 years figuring out how to extract interesting program properties and you'd be a complete idiot to ignore what they've learned in that half century.
They have settled on a number of relatively standard "compiler data structures": abstract syntax trees (ASTs), symbol tables (STs), control flow graphs (CFGs), data flow facts (DFFs), program triples, ponter analyses, etc.
If you want to analyze or generate code, your best bet is to process it first into such standard compiler data structures and then do the job. If you have ASTs, you can answer all kinds of question about what operators and operands are used. If you have STs, you can answer questions about where-defined, where-visible and what-type. If you have CFGs, you can answer questions about "this-before-that", "what conditions does statement X depend upon". If you have DFFs, you can determine which assignments affect the actions at a point in the code. Reflection will never provide this IMHO, because it will always be limited to what the runtime system developers are willing to keep around when running a program. (Maybe someday they'll keep all the compiler data structures around, but then it won't be reflection; it will just finally be compiler support).
Now, after you have determined the properties of interest, what do you do for code generation? Here the compiler guys have been so focused on generation of machine code that they don't offer standard answers. The guys that do are the program transformation community (http://en.wikipedia.org/wiki/Program_transformation). Here the idea is to keep at least one representation of your program as ASTs, and to provide special support for matching source code syntax (by constructing pattern-match ASTs from the code fragments of interest), and provide "rewrite" rules that say in effect, "when you see this pattern, then replace it by that pattern under this condition".
By connecting the condition to various property-extracting mechanisms from the compiler guys, you get relatively easy way to say what you want backed up by that 50 years of experience. Such program transformation systems have the ability to read in source code,
carry out analysis and transformations, and generally to regenerate code after transformation.
For your code generation task, you'd read in the base line code into ASTs, apply analyses to determine properties of interesting, use transformations to generate new ASTs, and then spit out the answer.
For such a system to be useful, it also has to be able to parse and prettyprint a wide variety of source code langauges, so that folks other than C# lovers can also have the benefits of code analysis and generation.
These ideas are all reified in the
DMS Software Reengineering Toolkit. DMS handles C, C++, C#, Java, COBOL, JavaScript, PHP, Verilog, ... and a lot of other langauges.
(I'm the architect of DMS, so I have a rather biased view. YMMV).

Have you considered using T4 templates for performing the code generation? It looks like it's getting much more publicity and attention now and more support in VS2010.
This tutorial seems database centric but it may give you some pointers: http://www.olegsych.com/2008/09/t4-tutorial-creatating-your-first-code-generator/ in addition there was a recent Hanselminutes on T4 here: http://www.hanselminutes.com/default.aspx?showID=170.
Edit: Another great place is the T4 tag here on StackOverflow: https://stackoverflow.com/questions/tagged/t4
EDIT: (By asker, new developments)
As of VS2012, T4 now supports reflection over an active project in a single step. This means you can make a change to your code, and the compiled output of the T4 template will reflect the newest version, without requiring you to perform a second reflect/build step. With this capability, I'm marking this as the accepted answer.

You may wish to use CodeDom, so that you only have to build once.
First, I would read this CodeProject article to make sure there are not language-specific features you'd be unable to support without using Reflection.

From what I understand, you could use something like Common Compiler Infrastructure (http://ccimetadata.codeplex.com/) to programatically analyze your existing c# source.
This looks pretty involved to me though, and CCI apparently only has full support for C# language spec 2. A better strategy may be to streamline your existing method instead.

I'm not sure of the best way to do this, but you could do this
As a post-build step on your base dll, run the code generator
As another post-build step, run csc or msbuild to build the generated dll
Other things which depend on the generated dll will also need to depend on the base dll, so the build order remains correct

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.