Efficient small byte-arrays in C# - c#

I have a huge collection of very small objects. To ensure the data is stored very compactly I rewrote the class to store all information within a byte-array with variable-byte encoding. Most instances of these millions of objects need only 3 to 7 bytes to store all the data.
After memory-profiling I found out that these byte-arrays always take at least 32 bytes.
Is there a way to store the information more compactly than bit-fiddled into a byte[]? Would it be better to point to an unmanaged array?
class MyClass
{
byte[] compressed;
public MyClass(IEnumerable<int> data)
{
compressed = compress(data);
}
private byte[] compress(IEnumerable<int> data)
{
// ...
}
private IEnumerable<int> decompress(byte[] compressedData)
{
// ...
}
public IEnumerable<int> Data { get { return decompress(compressed); } }
}

There are a couple problems you're facing that eat up memory. One is object overhead, and the other is objects aligning to 32 or 64 bit boundaries (depending on your build). Your current approach suffers from both issues. The following sources describe this in more detail:
Of Memory and Strings
How much memory does a C# string take up
I played around with this when I was fiddling with benchmarking sizes.
A solution that is simple would be to simply create a struct that has a single member that is a long value. Its methods would handle packing and unpacking bytes into and out of that long, using shift and mask bit fiddling.
Another idea would be a class that served up objects by ID, and stored the actual bytes in a single backing List<byte>. But this would get complicated and messy. I think the struct idea is much more straightforward.

Related

Using protobuf-net to write fixed size objects in parts and read them one-by-one

Use Case Description
I receive the collections in chunks from a server and I want them to write to file in a way so I can read them back one-by-one later. My objects are fixed size meaning the class only contains objects of types double, long and DateTime.
I already serialize and deserialize objects using below methods at different places in my project:
public static T Deserialize<T>(byte[] buffer)
{
using (MemoryStream stream = new MemoryStream(buffer))
{
return Serializer.Deserialize<T>(stream);
}
}
public static byte[] Serialize<T>(T message)
{
using (MemoryStream stream = new MemoryStream())
{
Serializer.Serialize(stream, message);
return stream.ToArray();
}
}
But, even if this could work, I still think it will produce a larger output file because I believe protobuf stores some information about field names (in its own way). But I could create the byte[] using BinaryWriter without having any info of field names. I know I need to make sure that I read them back in the right order but this could still make some meaningful impact on the output size file I think especially when the number of objects in the collection is really huge.
Do you think is there a way to efficiently write collections in parts and be able to read them one-by-one and also having minimum output files and memory footprint while reading as my collections are really large containing years of market data that I need to read and process. I need to just read the object once, process it, and forget about it. I do not have any need to keep objects in memory.
Protobuf doesn't store field names, but it does use a field prefix that is an encoded integer. For storing multiple objects, you would typically use the *WithLengthPrefix overloads; in particular, DateTime has no reliable fixed length encoding.
However! In your case, perhaps a serializer isn't the right tool. I would consider:
creating a readonly struct composed of a double and two long (or three long if you need high precision epoch time)
using a memory mapped file to access the file system directly
create a Span<byte> over the memory mapped file (or a section thereof)
coerce the Span<byte> to a Span<YourStruct> using MemoryMarshal.Cast
et voila, direct access to your values all the way to the file system.

"Immutable" Byte Array or Object Locks?

I'm developing a multithreaded modbus server, and I will need to manage a block of bytes for the client(s) to read. Each modbus device will have a thread to update their respective portion of the byte array, so I will need to implement some form of locking.
When the application is initialized, the number of devices and the number of bytes of memory allocated to each device will be constant. I did some research, and it seems that it's safe to lock a multidimensional array with an array of ReaderWriterLockSlim Objects to be used as locks:
private static ReaderWriterLockSlim[] memLock;
private static byte[][] memoryBlock;
...
//Modify or read memoryBlock corresponding to deviceIndex as needed using:
memLock[deviceIndex].EnterWriteLock();
try
{
// Write to array of bytes from memoryBlock
memoryBlock[2][3] = 0x05;
}
finally
{
memLock[deviceIndex].ExitWriteLock();
}
memLock[deviceIndex].EnterReadLock();
try
{
// Read array of bytes from memoryBlock
}
finally
{
memLock[deviceIndex].ExitReadLock();
}
I have not written a lot of multithreaded applications, and I only recently "discovered" the concept of immutability. Here is my stab at turning the above into an immutable class, assuming a set list of devices and a memory size that never changes after the memory is Initializeed. The reason the class is static is because there will only be 1 instance of this class in my application:
private static class DeviceMemory
{
private static bool initialized_ = false;
private static int memSizeInBytes_;
private static List<byte[]> registers_;
public static void Initialize(int deviceCount, int memSizeInBytes)
{
if (initialized_) throw new Exception("DeviceMemory already initialized");
if (memSizeInBytes <= 0) throw new Exception("Invalid memory size in bytes");
memSizeInBytes_ = memSizeInBytes;
registers_ = new List<byte[]>();
for(int i=0; i<deviceCount;++i)
{
byte[] scannerRegs = new byte[memSizeInBytes];
registers_.Add(scannerRegs);
}
initialized_ = true;
}
public static byte[] GetBytes(int deviceIndex)
{
if (initialized_) return registers_[deviceIndex];
else return null;
}
public static void UpdateBytes(int deviceIndex, byte[] memRegisters)
{
if (!initialized_) throw new Exception("Memory has not been initialized");
if (memRegisters.Length != memSizeInBytes_)
throw new Exception("Memory register size does not match the defined memory size in bytes: " + memSizeInBytes_);
registers_[deviceIndex] = memRegisters;
}
}
Here are my questions about the above:
Am I correct that I can lock a row of the 2 dimensional array as shown above?
Have I properly implemented an immutable class? i.e. Do you see any issues with the DeviceMemory class that would prevent me from writing to the UpdateBytes method from the device thread and read simultaneously from multiple clients on different thread?
Is this immutable class a wise choice over the more traditional multi-dimensional byte array/lock? Specifically, I'm concerned about memory usage/garbage collection since updates to the byte arrays will actually be "new" byte arrays that replace the reference to the old array. The reference to the old array should be released immediately after the client reads it, however. There will be about 35 devices updating every second and 4 clients per device reading at approximately 1 second intervals.
Would once solution or the other perform better, especially if the server were to scale out?
Thank you for reading!
First, what you've shown in your second code example is not an "immutable class", but rather a "static class". Two very different things; you'll want to review your terminology to ensure that you are communicating effectively, as well as not becoming confused when researching techniques related to either.
As for your questions:
1.Am I correct that I can lock a row of the 2 dimensional array as shown above?
You should use ReaderWriterLockSlim instead, and you didn't show any mechanism for catching the timeout exception should one occur. But otherwise, yes…you can use an individual lock for each byte[] element of the byte[][] object.
More generally, you can use a lock to represent whatever unit of data or other resource you want. The lock object doesn't care.
2.Have I properly implemented an immutable class? i.e. Do you see any issues with the DeviceMemory class that would prevent me from writing to the UpdateBytes method from the device thread and read simultaneously from multiple clients on different thread?
If you really mean "immutable", then no. If you really mean "static", then yes, but it's not clear to me that knowing that is useful. The class isn't thread-safe, which seems to be your greater concern, so in that respect you haven't done it right.
As far as the UpdateBytes() method itself goes, there's nothing wrong with it per se. And indeed, since copying a reference from one variable to another is an atomic operation in .NET, the UpdateBytes() method can "safely" update the array element at the same time some other thread is trying to retrieve it. "Safely" in the sense that a reader won't get corrupted data.
But there's nothing in your class that ensures that Initialize() is called only once. In addition, without synchronization (a lock or marking variables volatile) you have no guarantees that values written in one thread will ever be observed by another thread. That includes all of the fields, as well as the individual byte[] array elements.
3.Is this immutable class a wise choice over the more traditional multi-dimensional byte array/lock? Specifically, I'm concerned about memory usage/garbage collection since updates to the byte arrays will actually be "new" byte arrays that replace the reference to the old array. The reference to the old array should be released immediately after the client reads it, however. There will be about 35 devices updating every second and 4 clients per device reading at approximately 1 second intervals.
There's not enough context in your question to make a comparison, as we have no idea how you'd otherwise access the memoryBlock array. If you're just copying new arrays into the array, the two should be similar. Even if it's only in the second example that you are creating new arrays, then assuming the arrays are not large, I would expect the generation of ~100 new objects per second to be well within the bandwidth of the garbage collector.
As far as whether the approach in your second code example is a desirable approach, I will with all due respect suggest that you should probably stick to a conventionally synchronized code (i.e. with ReaderWriterLockSlim, or even just a plain lock statement). Concurrency is hard enough to get right with regular locks, never mind trying to write lock-free code. Given the update rates you're describing, I would expect a plain lock statement would work just fine.
If you run into some bandwidth problems, then at least you have a known-good implementation with which to compare new implementations, and will have a better idea of just how complicated an implementation you really require.
4.Would once solution or the other perform better, especially if the server were to scale out?
Impossible to say without more details.

Is it possible to avoid serialization/deserialization and to share big memory object with Memory-mapped files (MMF)?

I need to pass a C# memory object from one process to another (IPC)
I've just tried to serialize this object in a file and then deserialize it in my 2nd process with binary serialization (BinaryFormatter) in order to have good performance.
Unfortunately, the performance is not up to what I expected.
As my object has a lot of information, serialization and deserialization takes too much times (the serialization of my object takes more than 1MB on my hard drive).
I have heard of
Memory-mapped files (MMF)
which seems to be one of the fastest method for IPC when the objects to share between process are simple.
What is the fastest and easiest way to communicate between 2 processes in C#?
My object is only simple nested struct like this :
public struct Library
{
public Book[] books;
public string name;
}
public struct Book
{
public decimal price;
public string title;
public string author;
}
=> Is it possible to avoid serialization/deserialization and share this kind of object with MMF ?
=> What should be the characteristics of the shared object to avoid these operations of sérialization/deserialization ?
One more constraint :
My first process is a C# 64 bits process and my 2nd process is a 32 bits one.
Thank you
You can't directly allocate objects in Memory Mapped File with C# (staying in safe code). So it means you need to have some sort of serialization to transfer of data between two applications.
Broad options:
keep all raw data (more or less as byte array) in the MMF and have C# wrappers to read/write data on demand.
Find faster serialization/build one by hand
Use some form of change tracking and send only diffs between applications.
I'd personally go with option 3 as it will give the most reliable and type safe gains if applicable to particular problem.
Is it possible to avoid serialization/deserialization and share this
kind of object with MMF ?
Use a var/foreach statement to iterate the elements of your Book[] items, and write them to a MMF by creating a view accessor.
Example :
var BookWriter = (whatever you named Book[]);
Foreach(var in BookWriter)
{
(Using MMF...))
{
(Using Accessor.MMF...))
{
Accessor.write(point1, Bookwriter[0]);
Accessor.write(point2, Bookwriter[1]);
}//dispose ViewAcessor.
}// disposes the MMF handle...`
}// Finished iterating Book[i]...

C# basics - Memory Management

I am new to the programming in C#.
Can anyone please tell me memory management about C#?
Class Student
{
int Id;
String Name;
Double Marks;
public string getStudentName()
{
return this.Name;
}
public double getPersantage()
{
return this.Marks * 100 / 500;
}
}
I want to know how much memory is allocated for instance of this class?
What about methods? Where they are allocated?
And if there are static methods, what about their storage?
Can anyone please briefly explain this to me?
An instance of the class itself will take up 24 bytes on a 32-bit CLR:
8 bytes of object overhead (sync block and type pointer)
4 bytes for the int
4 bytes for the string reference
8 bytes for the double
Note that the memory for the string itself is in addition to that - but many objects could share references to the same string, for example.
Methods don't incur the same sort of storage penalty is fields. Essentially they're associated with the type rather than an instance of the type, but there's the IL version and the JIT-compiled code to consider. However, usually you can ignore this in my experience. You'd have to have a large amount of code and very few instances for the memory taken up by the code to be significant compared with the data. The important thing is that you don't get a separate copy of each method for each instance.
EDIT: Note that you happened to pick a relatively easy case. In situations where you've got fields of logically smaller sizes (e.g. short or byte fields) the CLR chooses how to lay out the object in memory, such that values which require memory alignment (being on a word boundary) are laid out appropriately, but possibly backing other ones - so 4 byte fields could end up taking 4 bytes, or they could take 16 if the CLR decides to align each of them separately. I think that's implementation-specific, but it's possible that the CLI spec dictates the exact approach taken.
As, I think Jon Skeet is saying, it depends on a lot of factors, and not easily measurable ahead of time. Factors such as whether it's running on a 64 bit OS or 32 bit OS must be taken into account, and whether you are running a debug or release version come into play. The amount of memory taken up by code depends on the processor that the JITTER compiles to, as different optimizations can be used for different processors.
Not really answer, just for fun.
struct Student
{
int Id;
[MarshalAs(UnmanagedType.LPStr)]
String Name;
Double Marks;
public string getStudentName()
{
return this.Name;
}
public double getPersantage()
{
return this.Marks * 100 / 500;
}
}
And
Console.WriteLine(Marshal.SizeOf(typeof(Student)));
On 64bit return:
24
And on 32 bit:
16
sizeof(getPersantage());
a good way to find out bytes for it. not too havent done much C#, but better with an answer than no answer :=)

What is the best method for creating a network packet struct/class in C#?

I'm wondering if there are any good guides or books that explain the best way to handle network packet communication in C#?
Right now I'm using a structure and a method that generates a byte array based on values of the structure.
Is there a simpler way to do this? Or even a better way?
public struct hotline_transaction
{
private int transaction_id;
private short task_number;
private int error_code;
private int data_length;
private int data_length2;
...
public int Transaction_id
{
get
{
return IPAddress.HostToNetworkOrder(transaction_id);
}
set
{
transaction_id = value;
}
}
...
public byte[] GetBytes()
{
List<byte> buffer = new List<byte>();
buffer.Add(0); // reserved
buffer.Add(0); // request = 0
buffer.AddRange(BitConverter.GetBytes(Task_number));
buffer.AddRange(BitConverter.GetBytes(Transaction_id));
buffer.AddRange(BitConverter.GetBytes(error_code));
buffer.AddRange(BitConverter.GetBytes(Data_length));
buffer.AddRange(subBuffer.ToArray());
return buffer.ToArray(); // return byte array for network sending
}
}
Beyond that is there a good guide or article on the best practice of parsing network data into usable structures / classes?
Have you heard of google protocol buffers?
protocol buffers is the name of the
binary serialization format used by
Google for much of their data
communications. It is designed to be:
small in size - efficient data storage
(far smaller than xml) cheap to
process - both at the client and
server platform independent - portable
between different programming
architectures extensible - to add new
data to old messages
Well, rather than GetBytes(), I'd be tempted to use a Write(Stream), in case it is big... but in the general case there are serialization APIs for this... I'd mention my own, but I think people get bored of hearing it.
IMO, hotline_transaction looks more like a class (than a struct) to me, btw.
You should probably use a BinaryWriter for this, and rather than returning byte[], you should pass a BinaryWriter to the serialization code, i.e.,
public void WriteBytes(BinaryWriter writer)
{
writer.Write((byte)0); // reserved
writer.Write((byte)0); // request = 0
writer.Write(Task_number);
writer.Write(Transaction_id);
writer.Write(error_code);
writer.Write(Data_length);
subBuffer.WriteBytes(writer);
}
You can easily wrap an existing Stream with a BinaryWriter. If you really need to get a byte[] somehow, you can use a MemoryStream as a backing stream, and call ToArray when you're done.

Categories

Resources