Implementing ByteCode interpreter in c#

Implementing ByteCode interpreter in c# - c#

My question: is there a
memory-efficient way to mimic the c++ union concept while allowing for string datatype, or some other efficient way to include data types and values in bytecode with minimal pointer chasing so as to take advantage of instruction caching?
I'm trying to write a VM bytecode interpreter in C#. I'd like to keep it in C# for simplicity, security, and familiarity reasons, mostly because I want to interact with a library of C# code I've already written.
There's information about how to do so online readily enough, except that it uses 'union' in c++, for which I can't seem to find an equivalent. Specifically, any kind of values (that is, anything that isn't an instruction) are stored as a tagged union.
I've searched and found questions like: Discriminated union in C#, but their answers don't make for efficient code - using inheritance still involves pointer chasing.
C++ union in C# proposes using StructLayout. It works until you need string values, and then throws:
[StructLayout(LayoutKind.Explicit)]
public struct SampleUnion
{
[FieldOffset(0)] public byte typeTag;
[FieldOffset(1)] public int num;
[FieldOffset(1)] public bool flag;
[FieldOffset(1)] public string c;
}
Could not load type ... because it contains an object field at offset 1 that is incorrectly aligned or overlapped by a non-object field.
I also tried messing around with just passing around arrays of bytes but then I get burned in perf costs when I have to use the value, because I have to convert it.
I've considered using dynamic. Maybe that will work, but it's at best a waste of memory for some types, and at worst I'm uncertain what shenanigans it might try to pull behind the scenes.
I mean, worst case scenario I suppose I could write the byte code interpreter in c++ and call it within the c# code, but I'd rather avoid that if I can, especially because I don't love the idea of messing around with the unsafe keyword, and it introduces a lot of complexity into my project.

As described in this article, the pseudocode of a bytecode interpreter is:
load the bytecode into memory
initialize interpreter state
repeat {
   fetch the next instruction, advance the instruction pointer
   decode the instruction
   execute the instruction
}
Depending on the bytecode format or structure, the instruction can have either fixed or dynamic length. Data like arrays or strings are typically referenced as (fixed length) memory offsets. The data is embedded in the bytecode separate from the instructions. The data address/offset is an index within the bytecode, as data is stored as sequence of bytes. An instruction to load a string would contain the string offset but not the string data itself.
To fetch and decode the next instruction, it is common to analyze the first one or two bytes which is/are usually the opcode. From this opcode, the length of the instruction is derived. The bytes belonging to the instruction can then be copied into a struct(ure) to disect it further and extract the instruction operand(s).
I can't see where a union would help in this process.
A simple C++ bytecode interpreter is described in XIDEK Extensible Interpreter Development Kit

Related

Why can I not encode const structs?

I wish to encode hard coded value of a const Point struct.
Why does the compiler not allow neither internal, nor arbitrary structs to be replaced during compilation? Since the internal bitwise representation can be established at compile time (in both cases), there is no apparent reason for the restriction.
My question is: Is there a way to hard-code a predefined set of bytes in c# that can be interpreted at compile time as the appropriate type, since all structs have a predetermined memory outline.
EDIT:
To clarify: Compile time means C# -> IL byte-code as stored in the output assembly.
The use case example:
public void Draw(Bitmap bmp, Point Location = new Point(0,0)) // invalid
This is an error because the new Point(0,0) cannot be evaluated at compile time. I can pass in int X = 0, int Y = 0 or the nullable Point? Location = null and generate the struct inside of the method, or Overload the method without the optional parameters and call the main method passing in the default values, but that technique incurs a performance penalty in terms of the extra method calls required.
This may not be appropriate for all structs, since the constructor could rely on, or change, external state or randomness.
FINAL EDIT:
This is now possible. Making the question moot. Yay.
The issue was the incorrect belief that the new keyword always implied heap allocation or dynamic stack allocation, with constant arguments neither case was true.

Why does the compiler not allow neither internal, nor arbitrary structs to be replaced during compilation? Since the internal bitwise representation can be established at compile time (in both cases), there is no apparent reason for the restriction.
All not-implemented features are not implemented for the same reason. To be implemented, a feature must be thought of, judged to be appropriate, designed, specified, implemented, tested, documented and shipped. All those things must happen. For your proposed feature, none of them happened. Therefore, no feature.
Programming language designers are not required to provide a justification for why a feature was not implemented. Rather, the people who want the feature are required to provide a reason why programming language designers should spend their valuable time implementing a feature that you want.
The C# design process is open, and the compiler source code is available. Why have you not designed and implemented the feature? If it is fair for you to ask the designers that question, it's fair for them to ask it of you! You're a computer programmer; get busy programming computers and build the feature if you think it is worthwhile, and then convince the language team to accept your pull request. If you don't think it is worth your time to do that, well, probably the language designers feel the same way.
My question is: Is there a way to hard-code a predefined set of bytes in c# that can be interpreted at compile time as the appropriate type, since all structs have a predetermined memory outline.
I'm not sure what you mean by "at compile time"; can you clarify?
There are ways to store byte arrays in an assembly, sure. Make a C# program with a byte array initialized to all constant values and ildasm the assembly; you'll see the code that the C# compiler generates to get the byte array image out of the metadata and into the array.
You could implement similar shenanigans to get a byte array, fix the array in place, and then use unsafe pointer magic to reinterpret the array bytes as struct bytes. That sounds extraordinarily dangerous, and might mess up the performance of the garbage collector. I would not wish to do so myself, but you seem pretty keen on this feature, so go for it and report back what you find out!
Alternatively, C++/CLI probably implements the feature you want; I've never used it but that seems like the sort of thing it would do. You could write a little program in C++/CLI that does what you want, and then either (1) use that program's assembly as a dependency of your assembly, (2) compile it as a netmodule link it in to your assembly via the usual netmodule linking gear (yuck) or (3) deduce how they implemented the feature and then do the same.

You can convert a struct into byte array and encode a byte array. It may work like this: (please compile and fix any errors (typing through mobile))
Suppose your struct is:
public struct TestStruct
{
public int x;
public string y;
}
public byte[] GetTestByte(TestStruct c)
{
var intGuy = BitConverter.GetBytes(c.x);
var stringGuy = Encoding.UTF8.GetBytes(c.y);
var both = stringGuy.Concat(intGuy).Concat(new byte[1]).ToArray();
return both;
}
Now you can encode a byte array like:
Convert.ToBase64String(byteArray);
There is no direct way to encode a struct. Its a value type at best and probably there as legacy for C and C++

XOR linked list

I recently came across the link below which I have found quite interesting.
http://en.wikipedia.org/wiki/XOR_linked_list
General-purpose debugging tools
cannot follow the XOR chain, making
debugging more difficult; [1]
The price for the decrease in memory
usage is an increase in code
complexity, making maintenance more
expensive;
Most garbage collection schemes do
not work with data structures that do
not contain literal pointers;
XOR of pointers is not defined in
some contexts (e.g., the C language),
although many languages provide some
kind of type conversion between
pointers and integers;
The pointers will be unreadable if
one isn't traversing the list — for
example, if the pointer to a list
item was contained in another data
structure;
While traversing the list you need to
remember the address of the
previously accessed node in order to
calculate the next node's address.
Now I am wondering if that is exclusive to low level languages or if that is also possible within C#?
Are there any similar options to produce the same results with C#?

TL;DR I quickly wrote a proof-of-concept XorLinkedList implementation in C#.
This is absolutely possible using unsafe code in C#. There are a few restrictions, though:
XorLinkedList must be "unmanaged structs", i.e., they cannot contain managed references
Due to a limitation in C# generics, the linked list cannot be generic (not even with where T : struct)
The latter seems to be because you cannot restrict the generic parameter to unmanaged structs. With just where T : struct you'd also allow structs that contain managed references.
This means that your XorLinkedList can only hold primitive values like ints, pointers or other unmanaged structs.
Low-level programming in C#
private static Node* _ptrXor(Node* a, Node* b)
{
return (Node*)((ulong)a ^ (ulong)b);//very fragile
}
Very fragile, I know. C# pointers and IntPtr do not support the XOR-operator (probably a good idea).
private static Node* _allocate(Node* link, int value = 0)
{
var node = (Node*) Marshal.AllocHGlobal(sizeof (Node));
node->xorLink = link;
node->value = value;
return node;
}
Don't forget to Marshal.FreeHGlobal those nodes afterwards (Implement the full IDisposable pattern and be sure to place the free calls outside the if(disposing) block.
private static Node* _insertMiddle(Node* first, Node* second, int value)
{
var node = _allocate(_ptrXor(first, second), value);
var prev = _prev(first, second);
first->xorLink = _ptrXor(prev, node);
var next = _next(first, second);
second->xorLink = _ptrXor(node, next);
return node;
}
Conclusion
Personally, I would never use an XorLinkedList in C# (maybe in C when I'm writing really low level system stuff like memory allocators or kernel data structures. In any other setting the small gain in storage efficiency is really not worth the pain. The fact that you can't use it together with managed objects in C# renders it pretty much useless for everyday programming.
Also storage is almost free today, even main memory and if you're using C# you likely don't care about storage much. I've read somewhere that CLR object headers were around ~40 bytes, so this one pointer will be the least of your concerns ;)

C# doesn't generally let you manipulate references at that level, so no, unfortunately.

As an alternative to the unsafe solutions that have been proposed.
If you backed your linked list with an array or list collection where instead of a memory pointer 'next' and 'previous' indicate indexes into the array you could implement this xor without resorting to using unsafe features.

There are ways to work with pointers in C#, but you can have a pointer to an object only temporarily, so you can't use them in this scenario. The main reason for this is garbage collection – as long as you can do things like XOR pointers and unXOR them later, the GC has no way of knowing whether it's safe to collect certain object or not.
You could make something very similar by emulating pointers using indexes in one big array, but you would have to implement a simple form of memory management yourself (i.e. when creating new node, where in the array should I put it?).
Another option would be to go with C++/CLI which allows you both the full flexibility of pointers on one hand and GC and access to the framework when you need it on the other.

Sure. You would just need to code the class. the XOR operator in c# is ^
That should be all you need to start the coding.
Note this will require the code to be declared "unsafe." See here: for how to use pointers in c#.

Making a broad generalization here: C# appears to have gone the path of readability and clean interfaces and not the path of bit fiddling and packing all the information as dense as possible.
So, unless you have a specific need here, you should use the List you are provided. Future maintenance programmers will thank you for it.

It is possible however you have to understand how C# looks at objects. An instance variable does not actually contain an object but a pointer to the object in memory.
DateTime dt = DateTime.Now;
dt is a pointer to a struct in memory containing the DateTime scheme.
So you could do this type of linked list although I am not sure why you would as the framework typically has already implemented the most efficient collections. As a thought expirament it is possible.

When to Use a Data Type Other Than int?

I have a project in which I have many objects with many properties whose values will always be less than small numbers like 60, 100, and 255.
I'm aware that when doing math on bytes it is necessary to cast the result. But other than that, is there any disadvantage to using a byte or a short for all these properties? Or another way of looking at it, is there any advantage to using bytes or shorts instead of int?

bytes and shorts are useful in interop scenarios, where you need a datatype that is the same size as the native code expects.
They can also save memory when dealing with very large sets of data.

In general, use the type that makes sense. This can be helpful from data validation, integrity and persistence standpoints.
Typically, for instance, a bool will actually consume 4-bytes on a 32-bit architecture, perhaps 8-bytes on a 64-bit architecture due to padding and alignment issues. Most compilers will, by default, optimize for speed and not code size or memory usage. Aligned access is typically faster than unaligned access. In .Net, some of this can be controlled via attributes. To my knowledge, the JIT compiler makes no effort to compact multiple boolean types into an integer bit field.
The same would hold true for byte types. The JIT compiler could compact multiple byte types (and even rearrange storage order) to share the same word (assuming you don't override such behavior with attributes such as StructLayout and FieldOffset).
For instance:
struct Foo
{
string A;
short B;
string C;
short D;
}
Treating reference types as pointers, the above would likely have a size of 16 (maybe 14, but aligned to 16).
Foo could be rearranged such that the order would actually be:
struct Foo
{
string A;
string C;
short B;
short D;
}
The above would likely have a size of 12, and an alignment of 12 or 16 (probably 16).
...potentially, in this contrived example, you could save 4 bytes due to rearrangement. And do note that .Net is more aggressive about realigning members than a typical C++ compiler is.
(On an aside, I recently spent some effort in a C++ library I maintain. I was able to achieve a 55% reduction in memory usage purely by optimizing member layout).

If you are storing them on disk, they take up less space. But it's often more of a pain than it is worth.

What is the underlying reason for not being able to put arrays of pointers in unsafe structs in C#?

If one could put an array of pointers to child structs inside unsafe structs in C# like one could in C, constructing complex data structures without the overhead of having one object per node would be a lot easier and less of a time sink, as well as syntactically cleaner and much more readable.
Is there a deep architectural reason why fixed arrays inside unsafe structs are only allowed to be composed of "value types" and not pointers?
I assume only having explicitly named pointers inside structs must be a deliberate decision to weaken the language, but I can't find any documentation about why this is so, or the reasoning for not allowing pointer arrays inside structs, since I would assume the garbage collector shouldn't care what is going on in structs marked as unsafe.
Digital Mars' D handles structs and pointers elegantly in comparison, and I'm missing not being able to rapidly develop succinct data structures; by making references abstract in C# a lot of power seems to have been removed from the language, even though pointers are still there at least in a marketing sense.
Maybe I'm wrong to expect languages to become more powerful at representing complex data structures efficiently over time.

One very simple reason: dotNET has a compacting garbage collector. It moves things around. So even if you could create arrays like that, you would have to pin every allocated block and you would see the system slow down to a crawl.
But you are trying to optimize based on an assumption. Allocation and cleanup of objects in dotNET is highly optimized. So write a working program first and then use a profiler to find your bottlenecks. It will most likely not be the allocation of your objects.
Edit, to answer the latter part:
Maybe I'm wrong to expect languages to
become more powerful at representing
complex data structures efficiently
over time.
I think C# (or any managed language) is much more powerful at representing
complex data structures (efficiently). By changing from low level pointers to garbage collected references.

I'm just guessing, but it might have to do with different pointer sizes for different target platforms. It seems that the C# compiler is using the size of the elements directly for index calculations (i.e. there is no CLR support for calculating fixed sized buffers indices...)
Anyway you can use an array of ulongs and cast the pointers to it:
unsafe struct s1
{
public int a;
public int b;
}
unsafe struct s
{
public fixed ulong otherStruct[100];
}
unsafe void f() {
var S = new s();
var S1 = new s1();
S.otherStruct[4] = (ulong)&S1;
var S2 = (s1*)S.otherStruct[4];
}

Putting a fixed array of pointers in a struct would quickly make it a bad candidate for a struct. The recommended size limit for a struct is 16 bytes, so on a x64 system you would be able to fit only two pointers in the array, which is pretty pointless.
You should use classes for complex data structures, if you use structures they become very limited in their usage. You wouldn't for example be able to create a data structure in a method and return it, as it would then contain pointers to structs that no longer exists as they were allocated in the stack frame of the method.

C++ CLI structure to byte array

I have a structure that represents a wire format packet. In this structure is an array of other structures. I have generic code that handles this very nicely for most cases but this array of structures case is throwing the marshaller for a loop.
Unsafe code is a no go since I can't get a pointer to a struct with an array (argh!).
I can see from this codeproject article that there is a very nice, generic approach involving C++/CLI that goes something like...
public ref class Reader abstract sealed
{
public:
generic <typename T> where T : value class
static T Read(array<System::Byte>^ data)
{
T value;
pin_ptr<System::Byte> src = &data[0];
pin_ptr<T> dst = &value;
memcpy((void*)dst, (void*)src,
/*System::Runtime::InteropServices::Marshal::SizeOf(T::typeid)*/
sizeof(T));
return value;
}
};
Now if just had the structure -> byte array / writer version I'd be set! Thanks in advance!

Using memcpy to copy an array of bytes to a structure is extremely dangerous if you are not controlling the byte packing of the structure. It is safer to marshall and unmarshall a structure one field at a time. Of course you will lose the generic feature of the sample code you have given.
To answer your real question though (and consider this pseudo code):
public ref class Writer abstract sealed
{
public:
generic <typename T> where T : value class
static System::Byte[] Write(T value)
{
System::Byte buffer[] = new System::Byte[sizeof(T)]; // this syntax is probably wrong.
pin_ptr<System::Byte> dst = &buffer[0];
pin_ptr<T> src = &value;
memcpy((void*)dst, (void*)src,
/*System::Runtime::InteropServices::Marshal::SizeOf(T::typeid)*/
sizeof(T));
return buffer;
}
};

This is probably not the right way to go. CLR is allowed to add padding, reorder the items and alter the way it's stored in memory.
If you want to do this, be sure to add [System.Runtime.InteropServices.StructLayout] attribute to force a specific memory layout for the structure. In general, I suggest you not to mess with memory layout of .NET types.

Unsafe code can be made to do this, actually. See my post on reading structs from disk: Reading arrays from files in C# without extra copy.

Not altering the structure is certainly sound advice. I use liberal amounts of StructLayout attributes to specify the packing, layout and character encoding. Everything flows just fine.
My issue is just that I need a performant and preferably generic solution. Performance because this is a server application and generic for elegance. If you look at the codeproject link you'll see that the StructureToPtr and PtrToStructure methods perform on the order of 20 times slower than a simple unsafe pointer cast. This is one of those areas where unsafe code is full of win. C# will only let you have pointers to primitives (and it's not generic - can't get a pointer to a generic), so that's why CLI.
Thanks for the psuedocode grieve, I'll see if it gets the job done and report back.

Am I missing something? Why not create a new array of the same size and initialise each element seperately in a loop?
Using an array of byte data is quite dangerous unless you are targetting one platform only... for example your method doesn't consider differing endianness between the source and destination arrays.
Something I don't really understand about your question as well is why having an array as a member in your class is causing a problem. If the class comes from a .NET language you should have no issues, otherwise, you should be able to take the pointer in unsafe code and initialise a new array by going through the elements pointed at one by one (with unsafe code) and adding them to it.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.