Polymorphic engines, in managed languages?

Polymorphic engines, in managed languages? - c#

I have developed my programming skills to a point where i can do most everyday stuff quite well and easily, and I thought one day, that making a polymorphic engine would really test my skills, and I was wondering if anybody had any pointers on making a polymorphic engine for a program, where to start, maybe some code examples? really anything would be helpful at this point :)
here are some of my resorces:
http://en.wikipedia.org/wiki/Polymorphic_code <- this is the one im particularly interested in..
http://en.wikipedia.org/wiki/Polymorphic_engine

As I mention in a comment, this is possible in .NET using the magical System.Reflection.Emit namespace. You just create a new DynamicMethod and emit any [valid] opcodes into it, and then call Invoke.
I've spent the last few hours trying to build a simple showcase for a "clean" program that would create new copies of itself with encrypted il code. The approach I went for was having an Exec method, grab the il-bytes (using MethodBase.GetMethodBody), encrypt them and emit a new assembly having the iv+key and the encrypted bytes. The main method would then decrypt, create a new DynamicMethod, call DynamicILInfo.SetCode and hopefully work. It didnt.
The encryption/decryption thingie worked, and my emitted code was correct. However, it seems that you can not take the raw bytes from one assembly and just execute them in another.
Data (from BitConverter.ToString) from run A and run B.
A: 28-01-00-00-0A...
B: 28-11-00-00-0A...
Unless you know the byte values for every opcode, open ILDAsm, choose View > Show bytes. There's also a View > Show token values which also helps debugging. Press ctrl-m for View > MetaData > Show! to resolve tokens and other magical creatures.
"28 01 00 00 0A" -> CALL 0A000001 -> [According to ctrl-m] MethodBase.GetCurrentMethod
These different token values are generated sequentially by the compiler. This means that it's impossible to guarantee that everything will work using raw bytes. Just think of the common case where the compiler only created tokens for every method call require to decrypt your byte array, and you call Console.WriteLine in your encrypted code. No such token is written, and you'll end up with a "BadImageFormatException: Bad binary signature" when invoking your dynamic method.
I leave it as a task for the read (or until I'm bored again) to transform the byte array, during the emitting process, to a format which the decryptor can read and emit to real il bytes. The emitting process will create all necessary tokens, so it should work.
If you want to chicken out from all the awesomeness of emitting opcodes, do some dynamic compilation from code stored as strings (which can of course be encrypted). This, however, lose in both cleverness, coolness and everything else that can be used to measure the pure awesomeness of the developer (YOU!). Check out this tutorial for a quick display of dynamic compilation and execution of C# within strings.

Well, as far as I know a polymorphic engine is just the code you want run encrypted, then pair that with a decryption module. So all you need to do is encrypt your code into a string. Then you write a decypter. I would use a basic symmetric encryption class like hxxp://www.obviex.com/samples/Code.aspx?Source=EncryptionCS&Title=Symmetric%20Key%20Encryption&Lang=C%23
After that, run the code in memory, something like hxxp://support.microsoft.com/kb/304655
EDIT:
If you wanted to get more indepth, you could always write your own encryption/decryption, make it something like base_64 (no key), instead of AES (with a key)
Hope that helps,
Max

Polymorphic code is not possible in C# or managed languages. It requires you to produce assembly code for a specific platform (.NET is not platform-specific) into a buffer or data area, then jump into that buffer or data area. There are many layers of software and hardware in place to stop that happening - see the NX bit on Wikipedia for example.
You can't do it in managed code. You'd have to write it in unmanaged code and call into that.
Hope that helps.
Please see the helpful comment on this answer, as you can dynamically create managed code from managed code; I was considering unmanaged code, as is used on the whole.

Related

Determine if game has been de-compiled/altered

I'm looking for Unity function to determine if my game has been de-compiled/ recompiled or modified in any way.

Yes, there is a Unity function for this but it can still be circumvented.
This can be done with Application.genuine which returns false when the application is altered in any way after it was built.
if (Application.genuineCheckAvailable)
{
if (Application.genuine)
{
Debug.Log("Not tempered");
}
}
The problem is that if the person is smart enough to de-compile, modify and compile the game, he/she can also remove the check above so the check above becomes useless. Any type of program genuinity or authenticity check can be removed as long as it is running on the player's machine.
EDIT
You can make it harder to be circumvented by doing the following:
1.Go to File --> Build Settings... then select your platform.
2.Click on Player Settings --> Other Settings and then change the Scripting Backend from Mono to IL2CPP(C++).
This will make it harder to circumvent that but it is still possible to be circumvented.

TL;DR: Thats frankly not possible.
You can never determine whether your program was decompiled, because there exists no measure to determine whether that happened. And every executable can be disassembled into at least assembler even if you scramble and screw up your data. You can make it hard to understand your source code though using obfuscating software. The ultimate obfuscator would be the M/o/Vuscator, which changes all assembler commands into mov instructions, which make it a pain in the butt to understand anything. But it also is slow as heck and probably not what you want (btw. this works because the mov-instruction is touring-complete in the x86 Instruction set. There is a great talk about it here). When you follow this trend further down the rabbit hole you can also use the exact same assembler code (around 10-20ish instructions) to create all programms possible which will make it impossible to get to your source code by simply disassembling your code.
Staying in the realm of the possible though: No, you are not able to prevent people from disassembling or decompiling your code. But you can make it harder (not impossible) to understand.
Detecting a change in the executable is on the possible side, though. Altough probably not feasible for you.
The main problem beeing that any code you build into the app to detect changes can be patched away. So you'll need to prevent that. But there is no practicall way of preventing that...
If you try to detect changes in your app by using a signature of the original and compare that to the actual signature for example, you can just exclude that check in the recompiled version. You can try to verify the signature against a server, but that can still be circumvented by removing the server check. You can force a server check for multiplayer games, but then we'll just use a fake signature. If you then want to calculate the signature on your server to prevent tampering, we'll just give you the original file and run the recompiled one.
There is one way (altough not feasible as mentioned above) to actually absolutely protect parts of your code against decompiling. The mechanism is called "BlurryBox" and was developed at KIT in germany. As I can't seem to find a proper document as a reference, here is what it does to archieve this.
The mechanism uses a stick with an encrypted storage and a microcontroller to do encryption. You put the parts of your code you want to protect (something that is called regularly, is necessary but not that time critical) into the encrypted storage. As it is impossible to retrieve the key [citation needed], you cannot access the code. The microcontroller then takes commands from your programm to call one of the encrypted functions in the storage with given parameters and to return the result. Because it is not possible to read the code you need to analyze its behaviour. Here comes the "Blurry" part of the box. Each function you store needs to have a small and well defined set of allowed parameters. Every other set of parameters leads into a trap that kills your device. As the attacker has no specs as to what the valid parameters are, this method gives you profable security against tampering with the code (as they state). There might be some mistakes on how this exactly works though as I'm writing this down from my memory.
You could try mimicking that behaviour with a server you control (code on the server and IP bans for trying to understand the code)

Security of String Encryption in C# Obfuscation

Disclaimer: I'm perfectly aware that a client-side program will never be safe from a dedicated reverse engineer.
Mostly out of personal curiosity, I've been learning about "obfuscation" techniques for C# applications. It seems that a popular technique is "string encryption", which appears to encrypt the string constants in the software and decrypt them for use later. This makes them not appear properly in decompilers like Reflector (please correct me if this is wrong).
If this is true, and you only see an encrypted version of the string in Reflector, what needs to be done (i.e. how difficult is it) to work around this and get the decrypted string? Obviously it must be possible or the application wouldn't be able to do it, but just how much of a deterrent would it be?

I don't have any experience with C# obfuscators, but the Java obfuscators I've looked at (Stringer, Allatori, Zelix Klassmaster, JFuscator) were pretty bad. Usually, I can reverse engineer the encryption algorithm after a day or two, and then I can deobfuscate all apps protected by the same obfuscator version and other versions usually only require a slight tweak.
Note that this is for purely static analysis, to figure out the algorithm and write a script that decrypts it without executing any code. If your goal is to just decrypt things quickly, it's a lot easier to simply execute the decryption function. The good obfuscators have a call context check so you can't do it directly, but it's a simple matter to find and edit out the check. This could potentially be done in only a couple minutes.
Obviously, there are ways to make reverse engineering much harder, but they aren't done in practice.

If you have the encrypted strings in your application, then you also have the decryption key embedded in your application.
So, a moderately determined person could use a debugger to step through the decryption code to retrieve the key, and then decrypt all other strings in your application with the flick of a wrist.

How much time it would need to understand manually
depends on the general obfuscation level,
ie. if there is a method decrypt which can be called just like that...
with compiling and all probably <1min
The other, simple way: Recompile with debug information, set a breakpoint,
execute the program and just read the string in VS
("simple" depends what is necessary to get to this code part)

I only recently learned about the secureString class. I am sure it is not a total solution, but in combination with other techniques, it might help.
http://msdn.microsoft.com/en-us/library/system.security.securestring%28v=vs.80%29.aspx
The value of an instance of SecureString is automatically encrypted when the instance is initialized or when the value is modified.
Note that SecureString has no members that inspect, compare, or convert the value of a SecureString. The absence of such members helps protect the value of the instance from accidental or malicious exposure.

Which compression/encryption should I take?

We've developed a service, which sends e-mails... quite trivial at this step.
The next step will be: handling the bounces.
To implement this I need to add some information into the headers... Let's say it's a simple string (to keep the question really basic).
Which compression/encryption (.net-built-in prefered) should I take, when I'm looking for an algorithm which includes a checksum internally (I do not want to create a CRC or alikes and add it to the headers either) - so, changing some char of the encrypted/compressed string doesn't mean it's valid!
This need not be a "high-sofisticated" algorithm, as I just want a basic detection against changes/injections...
Just to be clear: There must be a chance to decompress/decrypt!

If you need to decompress/decrypt the message, you probably want a two-way encryption. I am not an expert here, but I think .NET comes with built-in support for AES, which is a Rijndael algorithm. You can get more information here.

Have you thought/read about OpenPGP? This SO thread might be a good starting point for you.

To answer the compression part of it, you may want to consider either the System.IO.Compression.GZipStream or System.IO.Compression.DeflateStream classes for compression. DeflateStream uses LZW compression that is (with a bit of hackery), compatible with ZLib (http://stackoverflow.com/questions/70347/zlib-compatible-compression-streams).

Using reflection for code gen?

I'm writing a console tool to generate some C# code for objects in a class library. The best/easiest way I can actual generate the code is to use reflection after the library has been built. It works great, but this seems like a haphazard approch at best. Since the generated code will be compiled with the library, after making a change I'll need to build the solution twice to get the final result, etc. Some of these issues could be mitigated with a build script, but it still feels like a bit too much of a hack to me.
My question is, are there any high-level best practices for this sort of thing?

Its pretty unclear what you are doing, but what does seem clear is that you have some base line code, and based on some its properties, you want to generate more code.
So the key issue here are, given the base line code, how do you extract interesting properties, and how do you generate code from those properties?
Reflection is a way to extract properties of code running (well, at least loaded) into the same execution enviroment as the reflection user code. The problem with reflection is it only provides a very limited set of properties, typically lists of classes, methods, or perhaps names of arguments. IF all the code generation you want to do can be done with just that, well, then reflection seems just fine. But if you want more detailed properties about the code, reflection won't cut it.
In fact, the only artifact from which truly arbitrary code properties can be extracted is the the source code as a character string (how else could you answer, is the number of characters between the add operator and T in middle of the variable name is a prime number?). As a practical matter, properties you can get from character strings are generally not very helpful (see the example I just gave :).
The compiler guys have spent the last 60 years figuring out how to extract interesting program properties and you'd be a complete idiot to ignore what they've learned in that half century.
They have settled on a number of relatively standard "compiler data structures": abstract syntax trees (ASTs), symbol tables (STs), control flow graphs (CFGs), data flow facts (DFFs), program triples, ponter analyses, etc.
If you want to analyze or generate code, your best bet is to process it first into such standard compiler data structures and then do the job. If you have ASTs, you can answer all kinds of question about what operators and operands are used. If you have STs, you can answer questions about where-defined, where-visible and what-type. If you have CFGs, you can answer questions about "this-before-that", "what conditions does statement X depend upon". If you have DFFs, you can determine which assignments affect the actions at a point in the code. Reflection will never provide this IMHO, because it will always be limited to what the runtime system developers are willing to keep around when running a program. (Maybe someday they'll keep all the compiler data structures around, but then it won't be reflection; it will just finally be compiler support).
Now, after you have determined the properties of interest, what do you do for code generation? Here the compiler guys have been so focused on generation of machine code that they don't offer standard answers. The guys that do are the program transformation community (http://en.wikipedia.org/wiki/Program_transformation). Here the idea is to keep at least one representation of your program as ASTs, and to provide special support for matching source code syntax (by constructing pattern-match ASTs from the code fragments of interest), and provide "rewrite" rules that say in effect, "when you see this pattern, then replace it by that pattern under this condition".
By connecting the condition to various property-extracting mechanisms from the compiler guys, you get relatively easy way to say what you want backed up by that 50 years of experience. Such program transformation systems have the ability to read in source code,
carry out analysis and transformations, and generally to regenerate code after transformation.
For your code generation task, you'd read in the base line code into ASTs, apply analyses to determine properties of interesting, use transformations to generate new ASTs, and then spit out the answer.
For such a system to be useful, it also has to be able to parse and prettyprint a wide variety of source code langauges, so that folks other than C# lovers can also have the benefits of code analysis and generation.
These ideas are all reified in the
DMS Software Reengineering Toolkit. DMS handles C, C++, C#, Java, COBOL, JavaScript, PHP, Verilog, ... and a lot of other langauges.
(I'm the architect of DMS, so I have a rather biased view. YMMV).

Have you considered using T4 templates for performing the code generation? It looks like it's getting much more publicity and attention now and more support in VS2010.
This tutorial seems database centric but it may give you some pointers: http://www.olegsych.com/2008/09/t4-tutorial-creatating-your-first-code-generator/ in addition there was a recent Hanselminutes on T4 here: http://www.hanselminutes.com/default.aspx?showID=170.
Edit: Another great place is the T4 tag here on StackOverflow: https://stackoverflow.com/questions/tagged/t4
EDIT: (By asker, new developments)
As of VS2012, T4 now supports reflection over an active project in a single step. This means you can make a change to your code, and the compiled output of the T4 template will reflect the newest version, without requiring you to perform a second reflect/build step. With this capability, I'm marking this as the accepted answer.

You may wish to use CodeDom, so that you only have to build once.
First, I would read this CodeProject article to make sure there are not language-specific features you'd be unable to support without using Reflection.

From what I understand, you could use something like Common Compiler Infrastructure (http://ccimetadata.codeplex.com/) to programatically analyze your existing c# source.
This looks pretty involved to me though, and CCI apparently only has full support for C# language spec 2. A better strategy may be to streamline your existing method instead.

I'm not sure of the best way to do this, but you could do this
As a post-build step on your base dll, run the code generator
As another post-build step, run csc or msbuild to build the generated dll
Other things which depend on the generated dll will also need to depend on the base dll, so the build order remains correct

Building an assembler

I need to build an assembler for a CPU architecture that I've built. The architecture is similar to MIPS, but this is of no importance.
I started using C#, although C++ would be more appropriate. (C# means faster development time for me).
My only problem is that I can't come with a good design for this application. I am building a 2 pass assembler. I know what I need to do in each pass.\
I've implemented the first pass and I realised that if I have to lines assembly code on the same line ...no error is thrown.This means only one thing poor parsing techniques.
So almighty programmers, fathers of assembler enlighten me how should I proceed.
I just need to support symbols and data declaration. Instructions have fixed size.
Please let me know if you need more information.

I've written three or four simple assemblers. Without using a parser generator, what I did was model the S-C assembler that I knew best for 6502.
To do this, I used a simple syntax - a line was one of the following:
nothing
[label] [instruction] [comment]
[label] [directive] [comment]
A label was one letter followed by any number of letters or numbers.
An instruction was <whitespace><mnemonic> [operands]
A directive was <whitespace>.XX [operands]
A comment was a * up to end of line.
Operands depended on the instruction and the directive.
Directives included
.EQ equate for defining constants
.OR set origin address of code
.HS hex string of bytes
.AS ascii string of bytes - any delimiter except white space - whatever started it ended it
.TF target file for output
.BS n reserve block storage of n bytes
When I wrote it, I wrote simple parsers for each component. Whenever I encountered a label, I put it in a table with its target address. Whenever I encountered a label I didn't know, I marked the instruction as incomplete and put the unknown label with a reference to the instruction that needed fixing.
After all source lines had passed, I looked through the "to fix" table and tried to find an entry in the symbol table, if I did, I patched the instructions. If not, then it was an error.
I kept a table of instruction names and all the valid addressing modes for operands. When I got an instruction, I tried to parse each addressing mode in turn until something worked.
Given this structure, it should take a day maybe two to do the whole thing.

Look at this Assembler Development Kit from Randy Hyde's author of the famous "The Art of Assembly Language":
The Assembler Developer's Kit

The first pass of a two-pass assembler assembles the code and puts placeholders for the symbols (as you don't know how big everything is until you've run the assembler). The second pass fills in the addresses. If the assembled code subsequently needs to be linked to external references, this is the job of the eponymous linker.

If you are to write an assembler that just works, and spits out a hex file to be loaded on a microcontroller, it can be simple and easy. Part of my ciforth library is a full Pentium assembler to add inline definitions, of about 150 lines. There is an assembler for the 8080 of a couple dozen lines.
The principle is explained http://home.hccnet.nl/a.w.m.van.der.horst/postitfixup.html .
It amounts to applying the blackboard design pattern to the problem. You start with laying down the instruction, leaving holes for any and all operands. Then you fill in the holes, when you encounter the parameters.
There is a strict separation between the generic tool and the instruction set.
In case the assembler you need is just for yourself, and there are no requirements than usability (not a homework assignment), you can have an example implementation in http://home.hccnet.nl/a.w.m.van.der.horst/forthassembler.html. If you dislike Forth, there is also an example implementation in Perl. If the Pentium instruction set is too much too chew, then still you must be able to understand the principle and the generic part.
You're advised to have a look at the asi8080.frt file first. This is 389 WOC (Words Of Code, not Lines Of Code). An experienced Forther familiar with the instruction set can crank out an assembler like that in an evening. The Pentium is a bitch.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.