Building an assembler

Building an assembler - c#

I need to build an assembler for a CPU architecture that I've built. The architecture is similar to MIPS, but this is of no importance.
I started using C#, although C++ would be more appropriate. (C# means faster development time for me).
My only problem is that I can't come with a good design for this application. I am building a 2 pass assembler. I know what I need to do in each pass.\
I've implemented the first pass and I realised that if I have to lines assembly code on the same line ...no error is thrown.This means only one thing poor parsing techniques.
So almighty programmers, fathers of assembler enlighten me how should I proceed.
I just need to support symbols and data declaration. Instructions have fixed size.
Please let me know if you need more information.

I've written three or four simple assemblers. Without using a parser generator, what I did was model the S-C assembler that I knew best for 6502.
To do this, I used a simple syntax - a line was one of the following:
nothing
[label] [instruction] [comment]
[label] [directive] [comment]
A label was one letter followed by any number of letters or numbers.
An instruction was <whitespace><mnemonic> [operands]
A directive was <whitespace>.XX [operands]
A comment was a * up to end of line.
Operands depended on the instruction and the directive.
Directives included
.EQ equate for defining constants
.OR set origin address of code
.HS hex string of bytes
.AS ascii string of bytes - any delimiter except white space - whatever started it ended it
.TF target file for output
.BS n reserve block storage of n bytes
When I wrote it, I wrote simple parsers for each component. Whenever I encountered a label, I put it in a table with its target address. Whenever I encountered a label I didn't know, I marked the instruction as incomplete and put the unknown label with a reference to the instruction that needed fixing.
After all source lines had passed, I looked through the "to fix" table and tried to find an entry in the symbol table, if I did, I patched the instructions. If not, then it was an error.
I kept a table of instruction names and all the valid addressing modes for operands. When I got an instruction, I tried to parse each addressing mode in turn until something worked.
Given this structure, it should take a day maybe two to do the whole thing.

Look at this Assembler Development Kit from Randy Hyde's author of the famous "The Art of Assembly Language":
The Assembler Developer's Kit

The first pass of a two-pass assembler assembles the code and puts placeholders for the symbols (as you don't know how big everything is until you've run the assembler). The second pass fills in the addresses. If the assembled code subsequently needs to be linked to external references, this is the job of the eponymous linker.

If you are to write an assembler that just works, and spits out a hex file to be loaded on a microcontroller, it can be simple and easy. Part of my ciforth library is a full Pentium assembler to add inline definitions, of about 150 lines. There is an assembler for the 8080 of a couple dozen lines.
The principle is explained http://home.hccnet.nl/a.w.m.van.der.horst/postitfixup.html .
It amounts to applying the blackboard design pattern to the problem. You start with laying down the instruction, leaving holes for any and all operands. Then you fill in the holes, when you encounter the parameters.
There is a strict separation between the generic tool and the instruction set.
In case the assembler you need is just for yourself, and there are no requirements than usability (not a homework assignment), you can have an example implementation in http://home.hccnet.nl/a.w.m.van.der.horst/forthassembler.html. If you dislike Forth, there is also an example implementation in Perl. If the Pentium instruction set is too much too chew, then still you must be able to understand the principle and the generic part.
You're advised to have a look at the asi8080.frt file first. This is 389 WOC (Words Of Code, not Lines Of Code). An experienced Forther familiar with the instruction set can crank out an assembler like that in an evening. The Pentium is a bitch.

Related

General purpose plain-text linting tool

I'm looking for a command-line tool where I can specify regex patterns (or similar) for certain file extensions (e.g. cs files, js files, xaml files) that can provide errors/warnings when run, like during a build. These would scan plain-text source code of all types.
I know there are tools for specific languages... I plan on using those too. This tool is for quick patterns we want to flag where we don't want to invest in writing a Rosyln rule, for example. I'd like to flag certain patterns or API usages in an easy way where anyone can add a new rule without thinking too hard. Often times we don't add rules because it is hard.
Features like source tokenization is bonus. Open-source / free is mega bonus.
Is there such a tool?

If you want to go old-skool, you can dust-off Awk for this one.
It scans file line by line (for some configurable definition of line, with a sane default) cuts them in pieces (on whitespace IMMSMR) and applies a set of regexes and fires the code behind the matching regex. There are some conditions to match the beginning and end of a file to print headers/footers.
It seems to be what you want, but IMHO, a perl or ruby script is easier and has replaced AWK for me a long time ago. But it IS simple and straightforward for your use-case AFAICT.

Our Source Code Search Engine (SCSE) can do this.
SCSE lexes (using language-accurate tokenization including skipping language-specific whitespace but retaining comments) a set of source files, and then builds a token index for each token type. One can provide SCSE with token-based search queries such as:
'if' '(' I '='
to search for patterns in the source code; this example "lints" C-like code for the common mistake of assigning a variable (I for "identifier") in an IF statement caused by accidental use of '=' instead of the intended '=='.
The search is accomplished using the token indexes to speed up the search. Typically SCSE can search millions of lines of codes in a few seconds, far faster than grep or other scheme that insists on reading the file content for each query. It also produces fewer false positives because the token checks are accurate, and the queries are much easier to write because one does not have to worry about white space/line breaks/comments.
A list of hits on the pattern can be logged or merely counted.
Normally SCSE is used interactively; queries produce a list of hits, and clicking on a hit produces a view of a page of the source text with the hit superimposed. However, one can also script calls on the SCSE.
SCSE can be obtained with langauge-accurate lexers for some 40 languages.

Parse .h header files into c# data structures in runtime

I'm trying to write a C# library to manipulate my C/C++ header files.. I want to be able to read and parse the headers file and manipulate function prototypes and data structures in C#. I'm trying to avoid writing a C Parser, due to all code brances caused by #ifdefs and stuff like that.
I've tryed playing around with EnvDTE, but couldn't find any decent documentation.
Any ideas how can I do it?
Edit -
Thank you for the answers... Here are some more details about my project: I'm writing a ptrace-like tool for windows using the debugging API's, which enable me to trace my already compiled binaries and see which windows API's are being called. I also want to see which parameter is given in each call and what return values are given, so I need to know the definition of the API's. I also want to know the defition for my own libraries (hence, the header parsing approach). I thought of 3 solutions:
* Parsing the header files
* Parsing the PDB files (I wrote a prototype using DIA SDK, but unfortionatly, the symbols PDB contained only general info about the API's and not the real prototypes with the parameters and return values)
* Crawling over the MSDN online library (automaticly or manualy)
Is there any better way for getting the names and types for windows API's and my libraries in runtime in c#?

Parsing C (even "just" headers) is hard; the language is more complex than people remember,
and then there's the preprocessor, and finally the problem of doing something with the parse. C++ includes essentially all of C, and with C++11 here the problem is even worse.
People can often hack a 98% solution for a limited set of inputs, often with regexes in Perl or some other string hackery. If that works for you, then fine. Usually what happens is that 2% causes the hacked parser to choke or to produce the wrong answer, and then you get to debug the result and hand hack the 98% solution output.
Hacked solutions tend to fail pretty badly on real header files, which seem to concentrate weirdness in macros and conditionals (sometimes even to the point of mixing different dialects of C and C++ in the conditional arms). See a typical Microsoft .h file as an example. This appears to be what OP wants to process. Preprocessing gets rid of part of the problem, and now you get to encounter the real complexity of C and/or C++. You won't get a 98% solution for real header files even with preprocessing; you need typedefs and therefore name and type resolution, too. You might "parse" FOO X; that tells you that X is of type FOO... oops, what's that? Only a symbol table knows for sure.
GCCXML does all this preprocessing, parsing, and symbol table construction ... for the GCC dialect of C. Microsoft's dialect is different, and I don't think GCCXML can handle it.
A more general tool is our DMS Software Reengineering Toolkit, with its C front end; there's also a C++ front end (yes, they're different; C and C++ aren't the same language by a long shot). These process a wide variety of C dialects (both MS and GCC when configured properly), does macro/conditional expansion, builds an AST and a symbol table (does that name and type resolution stuff correctly).
You can add customization to extract the information you want, by crawling over the symbol table structures produced. You'll have to export what you want to C# (e.g. generate your C# classes), since DMS isn't implemented in a .net language.

In the most general case, header files are only usable, not convertable.
This due the possibility of preprocessor (#define) use of macros, fragments of structures constants etc which only get meaning when used in context.
Examples
anything with ## in macros
or
//header
#define mystructconstant "bla","bla"
// in using .c
char test[10][2] ={mystructconstant};
but you can't simply discard all macros, since then you won't process the very common calling convention macros
etc etc.
So header parsing and conversion is mostly only possible for semi automated use (manually run cleaned up headers through it) or for reasonably clean and consistent headers (like e.g. the older MS SDK headers)
Since the general case is so hard, there isn't much readily available. Everybody crafts something quick and dirty for its own headers.
The only more general tool that I know is SWIG.

If I insert a comment on a source code will it`s binarries MD5 be changed?

I`m just wondering....
If it was the case that I was thinking in creating different reseases for each custumer that I sold my software, could I check each one with MD5 just changing a comment inside the source code and recompiling? I mean, will a ##comment inside a C++, C# or java code change the binnary MD5?

Comments are removed early in (or before) the compilation process, so inserting a comment will not change the hash of the compiled binary.
The only exception (that I can think of) is if your binaries include line numbers, which can change based on comments. Typically this happens when you're compiling in debug mode, but you can also force it using something like the __LINE__ macro in C++. But even in this case, the content of the comment is irrelevant, only how many lines it takes up (so you might as well just use blank lines for that purpose). Besides, released software probably shouldn't include that information.

Short answer is no. Comments are stripped out very early in the compilation process.
The longer answer is, sometimes - but not reliably. There's a number of foreseeable reasons that vestiges (more like side-effects) of the comment could show up. However, those are fragile at best.
I assume this is for some sort of automated process, like selling the software on a website. How about outputting a header file like "user.h" that simply specifies the name/email/username/etc as a #define'd string, and then printing that somewhere in your program's About screen (both for the user's benefit and so the compiler doesn't optimize it away)? It requires you to recompile your program for each user, but that may not be a problem if it only takes a few seconds to build.

In some cases, the binary changes for every build, if there is for instance a build timestamp. This could provide the traceability you want. However, comments should not affect the MD5 of a release mode binary, unless your compiler is buggy.

A comment cannot be compiled to a cpu opcode so it will not change the hash of the blob.

Yes it can change the binaries.
For example in C/C++ there's the LINE macro for instance. If this is used in the code it will change the binary if you add or remove a line with comment.

Best approach to render MediaWiki in C#?

Question:
I want to render MediaWiki syntax (and I mean MediaWiki syntax as used by WikiPedia, not some other wiki format from some other engine such as WikiPlex), and that in C#.
Input: MediaWiki Markup string
Output: HTML string
There are some alternative mediawiki parsers, but nothing in C#, and additionally pinvoking C/C++ looks bleak, because of the structure of those libaries.
As syntax guidance, I use
http://en.wikipedia.org/wiki/Wikipedia:Cheatsheet
My first goal is to render that page's markup correctly.
Markup can be seen here:
http://en.wikipedia.org/w/index.php?title=Wikipedia:Cheatsheet&action=edit
Now, if I use Regex, it's not of much use, because one can't exactly say which tag ends which starting ones, especially when some elements, such as italic, become an attribute of the parent element.
On the other hand, parsing character by character is not a good approach either, because
for example ''' means bold, '' means italic, and ''''' means bold and italic...
I looked into porting some of the other parsers' code, but the java implementations are obscure, and the Python implementations have have a very different regex syntax.
The best approach I see so far would be to port mwlib to IronPython
http://www.mediawiki.org/wiki/Alternative_parsers
But frankly, I'm not looking forward to having the IronPython runtime added as a dependency to my application, and even if I would want to, the documentation is bad at best.

Update per 2017:
You can use ParseoidSharp to get a fully compatible MediaWiki-renderer.
It uses the official Wikipedia Parsoid library via NodeServices.
(NetStandard 2.0)
Since Parsoid is GPL 2.0, and and the GPL-code is invoked in nodejs in a separate process via network, you can even use any license you like ;)
Pre-2017
Problem solved.
As originally assumed, the solution lies in using one of the existing alternative parsers in C#.
WikiModel (Java) works well for that purpose.
First attempt was pinvoke kiwi.
It worked, but but failed because:
kiwi uses char* (fails on anything non-English/ASCII)
not thread safe.
bad because of the need have a native dll in the code for every architecture
(did add x86 and amd64, then it went kaboom on my ARM processor)
Second attempt was mwlib.
That failed because somehow IronPython doesn't work as it should.
Third attempt was Swebele, which essentially turned out to be academic vapoware.
The fourth attempt was using the original mediawiki renderer, using Phalanger. That failed because the MediaWiki renderer is not really modular.
The fifth attempt was using Wiky.php via Phalanger, which worked, but was slow and Wiky.php doesn't very completely implement MediaWiki.
The sixth attempt was using bliki via ikvmc, which failed because of the excessive use of 3rd party libraries ==> it compiles, but yields null-reference exceptions only
The seventh attempt was using JavaScript in C#, which worked but was very slow, plus the MediaWiki functionality implemented was very incomplete.
The 8th attempt was writing an own "parser" via Regex.
But the time required to make it work is just excessive, so I stopped.
The 9th attempt was successful.
Using ikvmc on WikiModel yields a useful dll.
The problem there was the example-code was hoplessly out of date.
But using google and the WikiModel sourcecode, I was able to piece it together.
The end-result can be found here:
https://github.com/ststeiger/MultiWikiParser

Why shouldn't this be possible with regular expressions?
inputString = Regex.Replace(inputString, #"(?:'''''')(.*?)(?:'''''')", #"<strong><em>$1</em></strong>");
inputString = Regex.Replace(inputString, #"(?:''')(.*?)(?:''')", #"<strong>$1</strong>");
inputString = Regex.Replace(inputString, #"(?:'')(.*?)(?:'')", #"<em>$1</em>");
This will, as far as I can see, render all 'Bold and italic', 'Bold' and 'Italic' text.

Here is how I once implemented a solution:
define your regular expressions for Markup->HTML conversion
regular expressions must be non greedy
collect the regular expressions in a Dictionary<char, List<RegEx>>
The char is the first (Markup) character in each RegEx, and RegEx's must be sorted by Markup keyword length desc, e.g. === before ==.
Iterate through the characters of the input string, and check if Dictionary.ContainsKey(char). If it does, search the List for matching RegEx. First matching RegEx wins.
As MediaWiki allows recursive markup (except for <pre> and others), the string inside the markup must also be processed in this fashion recursively.
If there is a match, skip ahead the number of characters matching the RegEx in input string. Otherwise proceed to next character.

Kiwi (https://github.com/aboutus/kiwi, mentioned on http://mediawiki.org/wiki/Alternative_parsers) may be a solution. Since it is C based, and I/O is simply done by stdin/stdout, it should not be too hard to create a "PInvoke"-able DLL from it.

As with the accepted solution I found parsoid is the best way forward as it's the official library - and has the greatest support for the wikimedia markup; that said I found ParseoidSharp to be using obsolete methods such as Microsoft.AspNetCore.NodeServices and really it's just a wrapper for a fairly old version of pasoid's npm package.
Since there is a fairly current version of parsoid in node.js you can use Jering.Javascript.NodeJS to do the same thing as ParseoidSharp, the steps are fairly similar also.
Install nodeJS (
download parsoid https://www.npmjs.com/package/parsoid place the required files in your project.
in powershell cd to your project
npm install
Then it's as simple as
output = StaticNodeJSService.InvokeFromFileAsync(Of String)(HttpContext.Current.Request.PhysicalApplicationPath & "./NodeScripts/parsee.js", args:=New Object() {Markup})
Bonus it's now much easier than ParseoidSharp's method to add the options required, e.g. you'll probably want to set the domain to your own domain.

Using reflection for code gen?

I'm writing a console tool to generate some C# code for objects in a class library. The best/easiest way I can actual generate the code is to use reflection after the library has been built. It works great, but this seems like a haphazard approch at best. Since the generated code will be compiled with the library, after making a change I'll need to build the solution twice to get the final result, etc. Some of these issues could be mitigated with a build script, but it still feels like a bit too much of a hack to me.
My question is, are there any high-level best practices for this sort of thing?

Its pretty unclear what you are doing, but what does seem clear is that you have some base line code, and based on some its properties, you want to generate more code.
So the key issue here are, given the base line code, how do you extract interesting properties, and how do you generate code from those properties?
Reflection is a way to extract properties of code running (well, at least loaded) into the same execution enviroment as the reflection user code. The problem with reflection is it only provides a very limited set of properties, typically lists of classes, methods, or perhaps names of arguments. IF all the code generation you want to do can be done with just that, well, then reflection seems just fine. But if you want more detailed properties about the code, reflection won't cut it.
In fact, the only artifact from which truly arbitrary code properties can be extracted is the the source code as a character string (how else could you answer, is the number of characters between the add operator and T in middle of the variable name is a prime number?). As a practical matter, properties you can get from character strings are generally not very helpful (see the example I just gave :).
The compiler guys have spent the last 60 years figuring out how to extract interesting program properties and you'd be a complete idiot to ignore what they've learned in that half century.
They have settled on a number of relatively standard "compiler data structures": abstract syntax trees (ASTs), symbol tables (STs), control flow graphs (CFGs), data flow facts (DFFs), program triples, ponter analyses, etc.
If you want to analyze or generate code, your best bet is to process it first into such standard compiler data structures and then do the job. If you have ASTs, you can answer all kinds of question about what operators and operands are used. If you have STs, you can answer questions about where-defined, where-visible and what-type. If you have CFGs, you can answer questions about "this-before-that", "what conditions does statement X depend upon". If you have DFFs, you can determine which assignments affect the actions at a point in the code. Reflection will never provide this IMHO, because it will always be limited to what the runtime system developers are willing to keep around when running a program. (Maybe someday they'll keep all the compiler data structures around, but then it won't be reflection; it will just finally be compiler support).
Now, after you have determined the properties of interest, what do you do for code generation? Here the compiler guys have been so focused on generation of machine code that they don't offer standard answers. The guys that do are the program transformation community (http://en.wikipedia.org/wiki/Program_transformation). Here the idea is to keep at least one representation of your program as ASTs, and to provide special support for matching source code syntax (by constructing pattern-match ASTs from the code fragments of interest), and provide "rewrite" rules that say in effect, "when you see this pattern, then replace it by that pattern under this condition".
By connecting the condition to various property-extracting mechanisms from the compiler guys, you get relatively easy way to say what you want backed up by that 50 years of experience. Such program transformation systems have the ability to read in source code,
carry out analysis and transformations, and generally to regenerate code after transformation.
For your code generation task, you'd read in the base line code into ASTs, apply analyses to determine properties of interesting, use transformations to generate new ASTs, and then spit out the answer.
For such a system to be useful, it also has to be able to parse and prettyprint a wide variety of source code langauges, so that folks other than C# lovers can also have the benefits of code analysis and generation.
These ideas are all reified in the
DMS Software Reengineering Toolkit. DMS handles C, C++, C#, Java, COBOL, JavaScript, PHP, Verilog, ... and a lot of other langauges.
(I'm the architect of DMS, so I have a rather biased view. YMMV).

Have you considered using T4 templates for performing the code generation? It looks like it's getting much more publicity and attention now and more support in VS2010.
This tutorial seems database centric but it may give you some pointers: http://www.olegsych.com/2008/09/t4-tutorial-creatating-your-first-code-generator/ in addition there was a recent Hanselminutes on T4 here: http://www.hanselminutes.com/default.aspx?showID=170.
Edit: Another great place is the T4 tag here on StackOverflow: https://stackoverflow.com/questions/tagged/t4
EDIT: (By asker, new developments)
As of VS2012, T4 now supports reflection over an active project in a single step. This means you can make a change to your code, and the compiled output of the T4 template will reflect the newest version, without requiring you to perform a second reflect/build step. With this capability, I'm marking this as the accepted answer.

You may wish to use CodeDom, so that you only have to build once.
First, I would read this CodeProject article to make sure there are not language-specific features you'd be unable to support without using Reflection.

From what I understand, you could use something like Common Compiler Infrastructure (http://ccimetadata.codeplex.com/) to programatically analyze your existing c# source.
This looks pretty involved to me though, and CCI apparently only has full support for C# language spec 2. A better strategy may be to streamline your existing method instead.

I'm not sure of the best way to do this, but you could do this
As a post-build step on your base dll, run the code generator
As another post-build step, run csc or msbuild to build the generated dll
Other things which depend on the generated dll will also need to depend on the base dll, so the build order remains correct

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.