Parse .h header files into c# data structures in runtime

Parse .h header files into c# data structures in runtime - c#

I'm trying to write a C# library to manipulate my C/C++ header files.. I want to be able to read and parse the headers file and manipulate function prototypes and data structures in C#. I'm trying to avoid writing a C Parser, due to all code brances caused by #ifdefs and stuff like that.
I've tryed playing around with EnvDTE, but couldn't find any decent documentation.
Any ideas how can I do it?
Edit -
Thank you for the answers... Here are some more details about my project: I'm writing a ptrace-like tool for windows using the debugging API's, which enable me to trace my already compiled binaries and see which windows API's are being called. I also want to see which parameter is given in each call and what return values are given, so I need to know the definition of the API's. I also want to know the defition for my own libraries (hence, the header parsing approach). I thought of 3 solutions:
* Parsing the header files
* Parsing the PDB files (I wrote a prototype using DIA SDK, but unfortionatly, the symbols PDB contained only general info about the API's and not the real prototypes with the parameters and return values)
* Crawling over the MSDN online library (automaticly or manualy)
Is there any better way for getting the names and types for windows API's and my libraries in runtime in c#?

Parsing C (even "just" headers) is hard; the language is more complex than people remember,
and then there's the preprocessor, and finally the problem of doing something with the parse. C++ includes essentially all of C, and with C++11 here the problem is even worse.
People can often hack a 98% solution for a limited set of inputs, often with regexes in Perl or some other string hackery. If that works for you, then fine. Usually what happens is that 2% causes the hacked parser to choke or to produce the wrong answer, and then you get to debug the result and hand hack the 98% solution output.
Hacked solutions tend to fail pretty badly on real header files, which seem to concentrate weirdness in macros and conditionals (sometimes even to the point of mixing different dialects of C and C++ in the conditional arms). See a typical Microsoft .h file as an example. This appears to be what OP wants to process. Preprocessing gets rid of part of the problem, and now you get to encounter the real complexity of C and/or C++. You won't get a 98% solution for real header files even with preprocessing; you need typedefs and therefore name and type resolution, too. You might "parse" FOO X; that tells you that X is of type FOO... oops, what's that? Only a symbol table knows for sure.
GCCXML does all this preprocessing, parsing, and symbol table construction ... for the GCC dialect of C. Microsoft's dialect is different, and I don't think GCCXML can handle it.
A more general tool is our DMS Software Reengineering Toolkit, with its C front end; there's also a C++ front end (yes, they're different; C and C++ aren't the same language by a long shot). These process a wide variety of C dialects (both MS and GCC when configured properly), does macro/conditional expansion, builds an AST and a symbol table (does that name and type resolution stuff correctly).
You can add customization to extract the information you want, by crawling over the symbol table structures produced. You'll have to export what you want to C# (e.g. generate your C# classes), since DMS isn't implemented in a .net language.

In the most general case, header files are only usable, not convertable.
This due the possibility of preprocessor (#define) use of macros, fragments of structures constants etc which only get meaning when used in context.
Examples
anything with ## in macros
or
//header
#define mystructconstant "bla","bla"
// in using .c
char test[10][2] ={mystructconstant};
but you can't simply discard all macros, since then you won't process the very common calling convention macros
etc etc.
So header parsing and conversion is mostly only possible for semi automated use (manually run cleaned up headers through it) or for reasonably clean and consistent headers (like e.g. the older MS SDK headers)
Since the general case is so hard, there isn't much readily available. Everybody crafts something quick and dirty for its own headers.
The only more general tool that I know is SWIG.

Related

Tool to analyze the executable

is there a tool that analyzes the executable and detects:
- the programming language used (compiler),
- frameworks used (Qt, Gtk, .Net, WxWidgets etc),
- other useful information (compression, etc.).
I know it is quite hard to tell the programming language sometimes (especially in C or Pascal exes), but it is possible to tell the language or compiler used? (Delphi generates exes differently, VB6 too for instance).
It may be possible eg. with dependency analysis of the dlls, headers etc.
Thanks.

On GNU, you can use several tools to try to guess the informartion you want :
ldd to resolv shared libraries linked to the binary
nm to list symbols Imported/exporeted by the binary
strings, which can dump the strings embedded in the binary
objdump can be useful too
A hex editor can be useful too.
I guess there are similar tools on the windows plateform. Dumpbin.exe is something similar to nm, and depends.exe to ldd iirc.
Btw, java is often bytecode compiled, not native.

I have used in the past (uni) PEInfo, but it did not give information you want. After that I used reflector as I knew my dll/exe where .net
But I think there is no software to do that.
Workaround: Best thing you can do is look in the strings of exe (for example use Process explorer) and guess yourself.

Open your executable in a binary file viewer and look for strings that look like names of the functions. These strings are not always available, but in certain cases they are present. They can be used to resolve links with DLLs for example. After that google those strings. There is a chance that they will tell you something.

i dont think there is a possible way to do this correctly. Maybe some basic programming languages can be detected but nobody can detect frameworks used. There are thousands of frameworks.

Protocol buffers, getting C# to talk to C++ : type issues and schema issues

I am about to embark on a project to connect two programs, one in c#, and one in c++. I already have a working c# program, which is able to talk to other versions of itself. Before I start with the c++ version, I've thought of some issues:
1) I'm using protobuf-net v1. I take it the .proto files from the serializer are exactly what are required as templates for the c++ version? A google search mentioned something about pascal casing, but I have no idea if that's important.
2) What do I do if one of the .NET types does not have a direct counterpart in c++? What if I have a decimal or a Dictionary? Do I have to modify the .proto files somehow and squish the data into a different shape? (I shall examine the files and see if I can figure it out)
3) Are there any other gotchas that people can think of? Binary formats and things like that?
EDIT
I've had a look at one of the proto files now. It seems .NET specific stuff is tagged eg bcl.DateTime or bcl.Decimal. Subtypes are included in the proto definitions. I'm not sure what to do about bcl types, though. If my c++ prog sees a decimal, what will it do?

Yes, the proto files should be compatible. The casing is about conventions, which shouldn't affect actual functionality - just the generated code etc.
It's not whether or not there's a directly comparable type in .NET which is important - it's whether protocol buffers support the type which is important. Protocol buffers are mostly pretty primitive - if you want to build up anything bigger, you'll need to create your own messages.
The point of protocol buffers is to make it all binary compatible on the wire, so there really shouldn't be gotchas... read the documentation to find out about versioning policies etc. The only thing I can think of is that in the Java version at least, it's a good idea to make enum fields optional, and give the enum type itself a zero value of "unknown" which will be used if you try to deserialize a new value which isn't supported in deserializing code yet.

Some minor additions to Jon's points:
protobuf-net v1 does have a Getaproto which may help with a starting point, however, for interop purposes I would recommend starting from a .proto; protobuf-net can work this was around too, either via "protogen", or via the VS addin
other than that, you shouldn't have my issues as long as you remember to treat all files as binary; opening files in text mode will cause grief

Manipulating a Python file from C#

I'm working on some tools for a game I'm making. The tools serve as a front end to making editing game files easier. Several of the files are python scripting files. For instance, I have an Items.py file that contains the following (minimalized for example)
from ItemModule import *
import copy
class ScriptedItem(Item):
def __init__(self, name, description, itemtypes, primarytype, flags, usability, value, throwpower):
Item.__init__(self, name, description, itemtypes, primarytype, flags, usability, value, throwpower, Item.GetNextItemID())
def Clone(self):
return copy.deepcopy(self)
ItemLibrary.AddItem(ScriptedItem("Abounding Crystal", "A colourful crystal composed of many smaller crystals. It gives off a warm glow.", ItemType.SynthesisMaterial, ItemType.SynthesisMaterial, 0, ItemUsage.Unusable, 0, 50))
As I Mentioned, I want to provide a front end for editing this file without requring an editor to know python/edit the file directly. My editor needs to be able to:
Find and list all the class types (in this example, it'd be only
Scripted Item)
Find and list all created items (in this case there'd only be one,
Abounding Crystal). I'd need to find the type (in this
caseScriptedItem) and all the parameter values
Allow editing of parameters and the creation/removal of items.
To do this, I started writing my own parser, looking for the class keyword and when these recorded classes are use to construct objects. This worked for simple data, but when I started using classes with complex constructors (lists, maps, etc.) it became increasing difficult to correctly parse.
After searching around, I found IronPython made it easy to parse python files, so that's what I went about doing. Once I built the Abstract Syntax Tree I used PythonWalkers to identify and find all the information I need. This works perfectly for reading in data, but I don't see an easy way to push updated data into the Python file. As far as I can tell, there's no way to change the values in the AST and much less so to convert the AST back into a script file. If I'm wrong, I'd love for someone to tell me how I could do this. What I'd need to do now is search through the file until I find the correctly line, then try to push the data into the constructor, ensuring correct ordering.
Is there some obvious solution I'm not seeing? Should I just keeping working on my parser and make it support more complex data types? I really though I had it with the IronPython parser, but I didn't think about how tricky it'd be to push modified data back into the file.
Any suggestions would be appreciated

You want a source-to-source program transformation tool.
Such a tool parses a language to an internal data structure (invariably an AST), allows you to modify the AST, and then can regenerate source text from the modified AST without changing essentially anything about the source except where the AST changes were made.
Such a program transformation tool has to parse text to ASTs, and "anti-parse" (called "Prettyprint") ASTs to text. If IronPython has a prettyprinter, that's what you need.
If it doesn't, you can build one with some (maybe a lot) of effort; as you've observed,
this isn't as easy as one might think. See my answer
Compiling an AST back to source code
If that doesn't work, our DMS Software Reengineering Toolkit with its Python front end might do the trick. It has all the above properties.

Provided you can find a complete and up-to-date context free grammar file for Python, you could use CoCo/R parser generator to generate a python parser in C#.
You can add production code to the grammar file itself to populate a data structure in your C# app. Said data structure can hold all the information you need (methods and their arguments, properties, constructors, destructors etc). Once you have this data structure, its just a task of designing a front end for the user and representing this data structure in a way that makes it editable to them (this is more of a design task than a complicated programming task).
Finally, iterate through you data structure and write out a .py file.

You can use the python inspect module to print the source of an object. In your case: To print the source of your module - the file you just parsed with IronPython. I haven't checked to see if inspect works with IronPython yet, though.
As to adding stuff, well, it's a module, right? You can just add stuff to a module... I'd load the module and then alter it, use inspect to view print it and save to disk.
From your post, it looks like you're already deep in the trenches and having fun, so I'd be really happy to see a post here on how you solved this problem!

To me it sounds more like you are at the point where you shove it all into a sqlite database and start editing it that way. Hooking up some forms to edit tables is simpler for the UI. At that point you generate new python files by dumping your tables out with some formatting to provide the surrounding python scripts.
SVN / Git / whatever can merge the updated changes via the python files.
This is what I ended up doing for my project at any rate. I started using python to hook up the various items using their computed keys and then just added some forms UI to avoid editing mistakes in the python files.

how to break up the code in a method body into statement by statement symbols/"tokens"

I'm writing something that will examine a function and rewrite that function in another language so basically if inside my function F1, i have this line of code var x=a.b(1) how do i break up the function body into symbols or "tokens"?
I've searched around and thought that stuff in System.Reflection.MethodInfo.GetMethodBody would do the trick however that class doesn't seem to be able to have the capabilities to do what i want..
what other solutions do we have?
Edit:
Is there anyway we can get the "method body" of a method using reflection? (like as a string or something)
Edit 2:
basically what I'm trying to do is to write a program in c#/vb and when i hit F5 a serializer function will (use reflection and) take the entire program (all the classes in that program) and serialize it into a single javascript file. of course javascript doesn't have the .net library so basically the C#/VB program will limit its use of classes to the .js library (which is a library written in c#/vb emulating the framework of javascript objects).
The advantage is that i have type safety while coding my javascript classes and many other benefits like using overloading and having classes/etc. since javascript doesn't have classes/overloading features natively, it rely on hacks to get it done. so basically the serializer function will write the javascript based on the C#/VB program input for me (along with all the hacks and possible optimizations).
I'm trying to code this serializer function

It sounds like you want a parse tree, which Reflection won't give you. Have a look at NRefactory, which is a VB and C# parser.

If you want to do this, the best way would be to parse the C#/VB code with a parser/lexer, such as the Gardens Point Parser Generator, flex/bison or ANTLR. then at the token level, reassemble it with proper javascript grammar. There are a few out there for C# and Java.

See this answer on analyzing and transforming source code
and this one on translating between programming languages.
These assume that you use conventional compiler methods for breaking your text into tokens ("lexing") and grouping related tokens into program structures ("parsing"). If you analysis is anything other than trivial, you'll need all the machinery, or it won't be reliable.
Reflection can only give you what the language designers decided to give you. They invariably don't give you detail inside functions.

If you want to go from IL to other language it may be easier than parsing source language first. If you want to go this route consider reading on Microsoft's "Volta" project (IL->JavaScript), while project is no longer available there are still old blogs discussing issues around it.
Note that reflection alone is not enough - reflection gives you byte array for the body of any particular method (MethodInfo.GetMethodBody.GetILAsByteArray - http://msdn.microsoft.com/en-us/library/system.reflection.methodbody.aspx) and you have to read it. There are several publically available "IL reader" libraries.

Using reflection for code gen?

I'm writing a console tool to generate some C# code for objects in a class library. The best/easiest way I can actual generate the code is to use reflection after the library has been built. It works great, but this seems like a haphazard approch at best. Since the generated code will be compiled with the library, after making a change I'll need to build the solution twice to get the final result, etc. Some of these issues could be mitigated with a build script, but it still feels like a bit too much of a hack to me.
My question is, are there any high-level best practices for this sort of thing?

Its pretty unclear what you are doing, but what does seem clear is that you have some base line code, and based on some its properties, you want to generate more code.
So the key issue here are, given the base line code, how do you extract interesting properties, and how do you generate code from those properties?
Reflection is a way to extract properties of code running (well, at least loaded) into the same execution enviroment as the reflection user code. The problem with reflection is it only provides a very limited set of properties, typically lists of classes, methods, or perhaps names of arguments. IF all the code generation you want to do can be done with just that, well, then reflection seems just fine. But if you want more detailed properties about the code, reflection won't cut it.
In fact, the only artifact from which truly arbitrary code properties can be extracted is the the source code as a character string (how else could you answer, is the number of characters between the add operator and T in middle of the variable name is a prime number?). As a practical matter, properties you can get from character strings are generally not very helpful (see the example I just gave :).
The compiler guys have spent the last 60 years figuring out how to extract interesting program properties and you'd be a complete idiot to ignore what they've learned in that half century.
They have settled on a number of relatively standard "compiler data structures": abstract syntax trees (ASTs), symbol tables (STs), control flow graphs (CFGs), data flow facts (DFFs), program triples, ponter analyses, etc.
If you want to analyze or generate code, your best bet is to process it first into such standard compiler data structures and then do the job. If you have ASTs, you can answer all kinds of question about what operators and operands are used. If you have STs, you can answer questions about where-defined, where-visible and what-type. If you have CFGs, you can answer questions about "this-before-that", "what conditions does statement X depend upon". If you have DFFs, you can determine which assignments affect the actions at a point in the code. Reflection will never provide this IMHO, because it will always be limited to what the runtime system developers are willing to keep around when running a program. (Maybe someday they'll keep all the compiler data structures around, but then it won't be reflection; it will just finally be compiler support).
Now, after you have determined the properties of interest, what do you do for code generation? Here the compiler guys have been so focused on generation of machine code that they don't offer standard answers. The guys that do are the program transformation community (http://en.wikipedia.org/wiki/Program_transformation). Here the idea is to keep at least one representation of your program as ASTs, and to provide special support for matching source code syntax (by constructing pattern-match ASTs from the code fragments of interest), and provide "rewrite" rules that say in effect, "when you see this pattern, then replace it by that pattern under this condition".
By connecting the condition to various property-extracting mechanisms from the compiler guys, you get relatively easy way to say what you want backed up by that 50 years of experience. Such program transformation systems have the ability to read in source code,
carry out analysis and transformations, and generally to regenerate code after transformation.
For your code generation task, you'd read in the base line code into ASTs, apply analyses to determine properties of interesting, use transformations to generate new ASTs, and then spit out the answer.
For such a system to be useful, it also has to be able to parse and prettyprint a wide variety of source code langauges, so that folks other than C# lovers can also have the benefits of code analysis and generation.
These ideas are all reified in the
DMS Software Reengineering Toolkit. DMS handles C, C++, C#, Java, COBOL, JavaScript, PHP, Verilog, ... and a lot of other langauges.
(I'm the architect of DMS, so I have a rather biased view. YMMV).

Have you considered using T4 templates for performing the code generation? It looks like it's getting much more publicity and attention now and more support in VS2010.
This tutorial seems database centric but it may give you some pointers: http://www.olegsych.com/2008/09/t4-tutorial-creatating-your-first-code-generator/ in addition there was a recent Hanselminutes on T4 here: http://www.hanselminutes.com/default.aspx?showID=170.
Edit: Another great place is the T4 tag here on StackOverflow: https://stackoverflow.com/questions/tagged/t4
EDIT: (By asker, new developments)
As of VS2012, T4 now supports reflection over an active project in a single step. This means you can make a change to your code, and the compiled output of the T4 template will reflect the newest version, without requiring you to perform a second reflect/build step. With this capability, I'm marking this as the accepted answer.

You may wish to use CodeDom, so that you only have to build once.
First, I would read this CodeProject article to make sure there are not language-specific features you'd be unable to support without using Reflection.

From what I understand, you could use something like Common Compiler Infrastructure (http://ccimetadata.codeplex.com/) to programatically analyze your existing c# source.
This looks pretty involved to me though, and CCI apparently only has full support for C# language spec 2. A better strategy may be to streamline your existing method instead.

I'm not sure of the best way to do this, but you could do this
As a post-build step on your base dll, run the code generator
As another post-build step, run csc or msbuild to build the generated dll
Other things which depend on the generated dll will also need to depend on the base dll, so the build order remains correct

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.