C# string to byte array speed - c#

So coming off of this question:
Which is fast comparison: Convert.ToInt32(stringValue)==intValue or stringValue==intValue.ToString()
I am looking a base type for my networked application to be stored in packets.
The Idea:
Packet class stores a list of (type)
Add objects to the packet class
Serialize and send it between machines
Deserialize into (type)
Convert (type) into the type of object you added originally.
Originally, I was using strings as (type). However, I am a bit dubious as every time I want to convert an int to a string, it seems like a tasking process. When I am communicating packets containing lots of uints to strings at 30FPS, I would like to make this process as fast as possible.
Therefore, I was wondering if byte[] would be a more suitable type. How fast is converting back and forth between a byte[] and ints/strings vs just strings to ints? BTW, I will not be sending a lot of strings on the network very often. Almost all of what I will be sending will be uints.

If you are using the same program on both ends, use BinarySerialization if possible. You are worried about speed; but unless this is just going between two processes on localhost, actual wire time, let alone latancy, will be orders of magnitude slower than any real conversion process.
Of course, don't concatenate strings; you will make a liar out of me.
The thing you need to save here is your coding time, plus the possibility of errors for rolling your own serialization. If you properly encapsulate the data transfer parts of your program, upgrading them would be easy. Trying to spend extra time making something fast is called premature optimization (google it - it's a valid argument - most of the time). If it is a bottleneck, leverage your encapsulated design, and change it. You won't spend that much extra time then if you'd done it first - but likely won't end up spending that time at all.
A warning about binary serialization. The types you are sending must be the same version and type name. If you can put the same version into production on both ends, easily, it's no worry. If you need more than this, or binaryserialization is too slow, look into FastJson, which makes big promises and is free, or something similar.

byte[] is the "natural" data type for socket operations, so this seems a good fit, ints/uints will be very fast to convert also. Strings are a bit different, but if you chose the natural encoding of the platform, this will be fast also.

Convert.ToInt32 is decently fast provided it does not fail. If it fails then you incur the overhead of a thrown/caught exception which is massive.
The byte [] vs. some other type dichotomy is false. The network transports all information as -- essentially -- an array of bytes. So whether a StreamReader wrapped around a NetworkStream is turning the byte [] into a String, or you are yourself, it's still getting done.

Related

C strtol vs C# long.Parse

I wonder why C# does not have a version of long.Parse accepting an offset in the string and length. In effect, I am forced to call string.Substring first.
This is unlike C strtol where one does not need to extract the substring first.
If I need to parse millions of rows I have a feeling there will be overhead creating those small strings that immediately become garbage.
Is there a way to parse a string into numbers efficiently without creating temporary short lived garbage strings on the heap? (Essentially doing it the C way)
Unless I'm reading this wrong, strtol doesn't take an offset into the string. It takes a memory address, which the caller can set to any position within a character buffer (or outside the buffer, if they aren't paying attention).
This presents a couple issues:
Computation of the offset requires an understanding of how the string is encoded. I believe c# uses UTF16 for in-memory strings, currently anyway. if that were ever to change, your offsets would be off, possibly with disastrous results.
Computation of the address could easily go stale for managed objects since they are not pinned in memory-- they could be moved around by memory management at any time. You'd have to pin it in memory using something like GCHandle.Alloc. When you're done, you'd better unpin it, or you could have serious problems!
If you get the address wrong, e.g. outside your buffer, your program is likely going to blow up.
I think C programmers are more accustomed to managing memory mapped objects themselves and have no issue computing offsets and addresses and monkeying around with them like you would with assembly. With a managed language like c# those sorts of things require more work and aren't typically done-- the only time we pin things in memory is when we have to pass objects off to unmanaged code. When we do it, it incurs overhead. I wouldn't advise it if your overall goal is to improve performance.
But if you are hell bent on getting down to the bare metal on this, you could try this solution where one clever c# programmer would read the string as an array of ASCII-encoded bytes and compute the numbers based on that. With his solution you can specify start and length to your heart's content. You'd have to write something different if your strings are encoded in UTF. I would go this route rather than trying to hack the string object's memory mapping.

protobuf-net message serialized size property

We are using protobuf-net for serialization and deserialization of messages in an application whose public protocol is based on Google Protocol Buffers. The library is excellent and covers our all requirements except for this one: we need to find out the serialized message length in bytes before the message is actually serialized.
The question has already been asked a year and a half ago and according to Marc, the only way to do this was to serialize to a MemoryStream and read the .Length property afterwards. This is not acceptable in our case, because MemoryStream allocates a byte buffer behind the scenes and we have to avoid this.
This line from the same response gives us hope that it might be possible after all:
If you clarify what the use-case is, I'm sure we can make it easily
available (if it isn't already).
Here is our use case. We have messages whose size varies between several bytes and two megabytes. The application pre-allocates byte buffers used for socket operations and for serializing / deserializing and once the warm-up phase is over, no additional buffers can be created (hint: avoding GC and heap fragmentation). Byte buffers are essentially pooled. We also want to avoid copying bytes between buffers / streams as much as possible.
We have come up with two possible strategies and both of them require message size upfront:
Use (large) fixed-size byte buffers and serialize all messages that can fit into one buffer; send the content of the buffer using Socket.Send. We have to know when the next message cannot fit into the buffer and stop serializing. Without message size, the only way to achieve this is to wait for an exception to occur during Serialize.
Use (small) variable size byte buffers and serialize each message into one buffer; send the content of the buffer using Socket.Send. In order to check out the byte buffer with appropriate size from the pool, we need to know how much bytes does a serialized message have.
Because the protocol is already defined (we cannot change this) and requires message length prefix to be Varint32, we cannot use SerializeWithLengthPrefix method.
So is it possible to add a method that estimates a message size without serialization into a stream? If it is something that does not fit into the current feature set and roadmap of the library, but is doable, we are interested into extending the library ourselves. We are also looking for alternative approaches, if there are any.
As noted, this is not immediately available, as the code intentionally tries to do a single pass over the data (especially IEnumerable<T> etc). Depending on your data, though, it might already be doing a moderate amount of copying, to allow for the fact that sub-messages are also length-prefixed, so might need juggling. This juggling can be greatly reduced by using the "grouped" sub-format internally in the message, as groups allow forwards-only construction without track-backs.
So is it possible to add a method that estimates a message size without serialization into a stream?
An estimate is next to useless; since there is no terminator, it needs to be exact. Ultimately, the sizes are a little hard to predict without actually doing it. There was some code in v1 for size prediction, but the single-pass code currently seems preferred, and in most cases the buffer overhead is nominal (there is code in place to re-use the internal buffers so that it doesn't spend all the time allocating buffers for small messages).
If your message internally is forwards-only (grouped), then a cheat might be to serialize to a fake stream that measures, but drops all the data; you'd end up serializing twice, however.
Re:
and requires message length prefix to be Varint32, we cannot use SerializeWithLengthPrefix method
I'm not quite sure I see the relationship there - it allows a range of formats etc to be used here; perhaps if you can be more specific?
Re copying data around - an idea I played with here is that of using sub-normal forms for the length prefix. For example, it might be that in most cases 5 bytes is plenty, so rather than juggle, it could leave 5 bytes, and then simply overwrite without condensing (since the octet 10000000 still means "zero and continue", even if it is redundant). This would still need to be buffered (to allow backfill), but would not require and movement of the data.
A final simple idea would be simply: serialize to a FileStream; then write the file length, and the file data. It trades memory usage for IO, obviously.

Processing data sent and received by a socket server in c#

Hey everyone.
This place is like a goldmine of knowledge and it's helping me so much! My next query is:
I have byte data being sent to my c# socket server. I am converting it to an ascii string, then splitting the data based on a common character (like the bar | character) and using the data. Typically the first piece of data is a command as a 4 digit number. I can imagine this not being very efficient! What would be the best way to process the data is an receiving, efficiently?
Related, how I be trapping and processing commands? Multiple if statements or a large case/switch statement. I really need speed and efficiency.
Typically the first piece of data is a command as a 4 digit number. I can imagine this not being very efficient! What would be the best way to process the data is an receiving, efficiently?
No, converting a number to/from a string is not efficient. But the question is: Do it really matter? It sounds to me like you are trying to do premature optimizations. Do not do that. Your goal should be to write code that is easy to read and maintain. Do not do optimizations until someone actually complains about the performance.
Related, how I be trapping and processing commands? Multiple if statements or a large case/switch statement. I really need speed and efficiency.
Again. Determine that the command processing really is the bottle neck in your application.
The whole processing really depends on what you do with the incoming messages. You provide way to little information to give a proper answer. Create a new question (since two questions in one is not really allowed). Add code which shows your current handling and describe what you do not like about it.
If you really need the performance I guess you shouldn't use a string representation for your command but work directly on the bytes. Four numbers in string format are 32 of 64 bits (depending on which charset you are using) in size, whilst a single byte is sufficient to store a four digit number. Using a lot of branches (which if-statements are) also effects your performance.
My suggestion is that you reserve a fixed size prefix in your message for the command. You then use these bytes to lookup in O(1) in a table which command you should execute, this table can be filled with object that have a method execute. So you can do something table[command].execute().
That being said, I don't think the performance-gain would be that large and that you are better off (maintenance-wise) by using one of the serialization libraries out there.

Is there a string type with 8 BIT chars?

I need to store much strings in RAM. But they do not contain special unicode characters, they all contains only characters from "ISO 8859-1" that is one byte.
Now I could convert every string, store it in memory and convert it back to use it with .Contains() and methods like this, but this would be overhead (in my opinion) and slow.
Is there a string class that is fast and reliable and offers some methods of the original string class like .Contains()?
I need this to store more strings in memory with less RAM used. Or is there an other way to do it?
Update:
Thank you for your comments and your answer.
I have a class that stores string. Then with one method call I need to figure out if I already have that string in memory. I have about 1000 strings to figure out if they are in the list a second. hundred of millions in total.
The average size of the string is about 20 chars. It is really the RAM that cares me.
I even thought about compress some millions of strings and store these packages in memory. But then I need to decompress it every time I need to access the values.
I also tried to use a HashSet, but the needed memory amount was even higher.
I don't need the true value. Just to know if the value is in the list. So if there is a hash-value that can do it, even better. But all I found need more memory than the pure string.
Currently there is no plan for further internationalization. So it is something I would deal with when it is time to :-)
I don't know if using a database would solve it. I don't need to fetch anything, just to know if the value was stored in the class. And I need to do this fast.
It is very unlikely that you will win any significant performance from this. However, if you need to save memory, this strategy may be appropriate.
To convert a string to a byte[] for this purpose, use Encoding.Default.GetBytes()[1].
To convert a byte[] back to a string for display or other string-based processing, use Encoding.Default.GetString().
You can make your code look nicer if you use extension methods defined on string and byte[]. Alternatively, you can wrap the byte[] in a wrapper type and put the methods there. Make this wrapper type a struct, not a class, otherwise it will incur extra heap allocations, which is what you’re trying to avoid.
I want to warn you, though — you are throwing away the ability to have Unicode in your application. You should normally have all alarm bells go off every time you think you need to do this. It is best if you structure your code in such a way that you can easily go back to using string when memory sizes will have gone up and memory consumption stops being an issue.
[1] Encoding.Default returns the current 8-bit codepage of the running operating system. The default for this on English-language Windows is Windows-1252, which is what you want. For Russian Windows it will be Windows-1251 (Cyrillic) etc.
As per comments, a basically bad idea. If you have to do it, byte[] is your friend. There is no byte-oriented string class in .NET.
Checkout the string.Intern method, that could help you out:
http://www.yoda.arachsys.com/csharp/strings.html
http://en.csharp-online.net/CSharp_String_Theory%E2%80%94String_intern_pool
However looking at your requirements, I think you are over engineering it. You have 1000 strings at 20 chars = 1000 * 20 * 2 = 40,000 bytes, that's not much memory.
If you really have a large amount, store it in a DB with an index. That would be much faster than anything the average programmer can come up with.

Does it really matter to distinct between short, int, long?

In my C# app, I would like to know whether it is really important to use short for smaller numbers, int for bigger etc. Does the memory consumption really matter?
Unless you are packing large numbers of these together in some kind of structure, it will probably not affect the memory consumption at all. The best reason to use a particular integer type is compatibility with an API. Other than that, just make sure the type you pick has enough range to cover the values you need. Beyond that for simple local variables, it doesn't matter much.
The simple answer is that it's not really important.
The more complex answer is that it depends.
Obviously you need to choose a type that will hold your datastructure without overflowing, and even if you're only storing smaller numbers then choosing int is probably the most sensible thing to do.
However, if your application loads a lot of data or runs on a device with limited memory then you might need to choose short for some values.
For C# apps that aren't trying to mirror some sort of structure from a file, you're better off using ints or whatever your native format is. The only other time it might matter is if using arrays on the order of millions of entries. Even then, I'd still consider ints.
Only you can be the judge of whether the memory consumption really matters to you. In most situations it won't make any discernible difference.
In general, I would recommend using int/Int32 where you can get away with it. If you really need to use short, long, byte, uint etc in a particular situation then do so.
This is entirely relative to the amount of memory you can afford to waste. If you aren't sure, it probably doesn't matter.
The answer is: it depends. The question of whether memory matters is entirely up to you. If you are writing a small application that has minimal storage and memory requirements, then no. If you are google, storing billions and billions of records on thousands of servers, then every byte can cost some real money.
There are a few cases where I really bother choosing.
When I have memory limitations
When I do bitshift operations
When I care about x86/x64 portability
Every other case is int all the way
Edit : About x86/x64
In x86 architecture, an int is 32 bits but in x64, an int is 64 bits
If you write "int" everywhere and move from one architecture to another, it might leads to problems. For example you have an 32 bits api that export a long. You cast it to an integer and everything is fine. But when you move to x64, the hell breaks loose.
The int is defined by your architecture so when you change architecture you need to be aware that it might lead to potential problems
That all depends on how you are using them and how many you have. Even if you only have a few in memory at a time - this might drive the data type in your backing store.
Memory consumption based on the type of integers you are storing is probably not an issue in a desktop or web app. In a game or a mobile device app, it may be more of an issue.
However, the real reason to differentiate between the types is the kind of numbers you need to store. If you have really big numbers, or high precision, you may need to use long to store it.
The context of the situation is very important here. You don't need to take a guess at whether it is important or not though, we are dealing with quantifiable things here. We know that we are saving 2 bytes by using a short instead of an int.
What do you estimate the largest number of instances are going to be in memory at a given point in time? If there are a million then you are saving ~2Mb of Ram. Is that a large amount of ram? Again, it depends on the context, if the app is running on a desktop with 4Gb of ram you probably don't care too much about the 2Mb.
If there will be hundreds of millions of instances in memory the savings are going to get pretty big, but if that is the case you may just not have enough ram to deal with it and you may have to store this structure on disk and work with parts of it at a time.
Int32 will be fine for almost anything. Exceptions include:
if you have specific needs where a different type is clearly better. Example: if you're writing a 16 bit emulator, Int16 (aka: short) would probably be better to represent some of the internals
when an API requires a certain type
one time, I had an invalid int cast and Visual Studio's first suggestion was to verify my value was less than infinity. I couldn't find a good type for that without using the pre-defined constants, so i used ulong since that was the closest I could come in .NET 2.0 :)

Categories

Resources