How many bytes will a string take up?

How many bytes will a string take up? - c#

Can anyone tell me how many bytes the below string will take up?
string abc = "a";

From my article on strings:
In the current implementation at least, strings take up 20+(n/2)*4 bytes (rounding the value of n/2 down), where n is the number of characters in the string. The string type is unusual in that the size of the object itself varies. The only other classes which do this (as far as I know) are arrays. Essentially, a string is a character array in memory, plus the length of the array and the length of the string (in characters). The length of the array isn't always the same as the length in characters, as strings can be "over-allocated" within mscorlib.dll, to make building them up easier. (StringBuilder does this, for instance.) While strings are immutable to the outside world, code within mscorlib can change the contents, so StringBuilder creates a string with a larger internal character array than the current contents requires, then appends to that string until the character array is no longer big enough to cope, at which point it creates a new string with a larger array. The string length member also contains a flag in its top bit to say whether or not the string contains any non-ASCII characters. This allows for extra optimisation in some cases.
I suspect that was written before I had a chance to work with a 64-bit CLR; I suspect in 64-bit land each string takes up either 4 or 8 more bytes.
EDIT: I wrote up a blog post more recently which includes 64-bit information (and contradicts the above slightly for x86...)

Basically, Each string object require a constant 20 bytes for the object data.
The buffer requires 2 bytes per character.
The memory usage estimation for string in bytes: 20 + (2 * Length).
So, Normally The memory in CLR for this string: 22 bytes
However while we pass or sending this string to another end or in any other usage, we do not need this much memory(we never need the 20 bytes for the object data). So it depends on the type of encoding you select, while you use it.
For a default encoding, it will take 1 byte for a character.
So Answer is 1 byte for default encoding.
You can check with this code:
Encoding.Default.GetBytes("a"); //It will give you a byte array of size 1.
Encoding.Default.GetBytes("ABC"); //It will give you a byte array of size 3.

If you ask about size of string object then it is wrong to ask about its size, without debugger it is impossible to say what exactly is it. Not sure that it is possible with debugger either. string uses pointers internally.
If you ask about size of sequence of chars that it contains then it is 4, because strings are stored in UTF-16. All chars in Basic Multilingual Plane are coded with two bytes.

Related

How can I hold a list of string as efficiently (memory) as possible?

I have huge a list of string. I want to hold these list as memory efficient. I tried to hold on a list. But, it uses 24 bytes for each string which has 5 characters. Namely, there should be some overhead areas.
Then, I tried to hold on a string array. The memory usage has been a bit efficient. But, I have still memory usage problem.
How can I hold a list of string? I know that "C# reserves 2 bytes for each character". I want to hold a string which has 5 characters as 5*2 = 10 bytes. But, why does it use 24 bytes for this process?
Thank you for helps.
enter image description here

Firstly, note that the difference between a List<string> that was created at the correct size, and a string[] (of the same size) is inconsequential for any non-trivial size; a List<T> is really just a fancy wrapper for T[] with insert/resize/etc capabilities. If you only need to hold the data: T[] is fine, but so is List<T> usually.
As for the string - it isn't C# that reserves anything - it is .NET that defines that a string is an object, which is internally a length (int) plus memory for char data, 2 bytes per char. But: objects in .NET have object headers, padding/alignment, etc - and importantly: a minimum size. So yes, they take more memory than just the raw data you're trying to represent.
If you only need the actual data, you could perhaps store the data not as string, but as raw memory - either a simple large byte[] or byte*, or as a twinned pair of int[]/int* (for lengths and/or offsets into the page) and a char[]/char* (for the actual character data), or a byte[]/byte* if you can work with encoded data (i.e. you're mainly interested in IO work). However, working with such a form will be hugely inconvenient - virtually no common APIs will want to play with you unless you are talking in string. There are some APIs that accept raw byte/char data, but they are largely the encoder/decoder APIs, and some IO APIs. So again: unless that's what you're doing: it won't end well. Very recently, some Span<char> / Span<byte> APIs have appeared which would make this slightly less inconvenient (if you can use the latest .NET Core builds, etc), but: I strongly suspect that in most common cases you're just going to have to accept the string overhead and live with it.

Minimum size of any object in 64-bit .NET is 24 bytes.
In 32-bit it's a bit smaller but there's always at least 8 bytes for the object header and here we'd expect the string to store it's length (4 bytes). 8 + 4 + 10 = 22. I'm guessing it also wants/needs all objects to be 4-byte aligned. So if you're storing them as objects, you're not going to get a much smaller representation.
If it's all 7-bit ASCII type characters, you could store them as arrays of bytes but each array would still take up some space.
Your best route (I appreciate this bit is more comment like) is to come up with different processing algorithms that don't require them to all be in memory at the same time in the first place.

Convert Byte[64] array to minimum length string

I have try to generate unlock key like XXXX-XXXX-XXXX or simply small length string or Hexstring. I am using RSA algorithm to encrypt and decrypt the Key. I got some long string like
Q65g2+uiytyEUW5SFsiI/c5z9NSxyuU2CM1SEly6cAVv9PdTpH81XaWS8lITcaTZ4IjdmINwhHBosvt5kdg==
when I convert the byte array (array size is 64 byte) using the below convert method.
Convert.ToBase64String(bytes);
My requirement is to generate the minimal length Key. Is there any way to convert the Byte array (array size is 64 byte) to minimal length and I need that back to byte array or any other suggestions (to minimize the string length) would be helpful.
I have tried to convert the output string to Hex decimal, but the output is too long than the string.

You may want to take a look at What is the most efficient way to encode an arbitrary GUID into readable ASCII (33-127)? There the Base 85 encoding is discussed which is used to compress PDF files.
Though, the difference between Base64 and Base85 in your case is 8 characters.
You can safely remove trailing '==' in Base64 string because it is used for alignment and will always be there for 64-byte values (Of course you will have to add these characters back to decode the string).

Since you mention you want users to be able to type in the string,
there will be an inverse correlation between easy-of-use from point of view of users and the length of string.
Even typing a Base64 string is prone to lot of errors. Base32 strings are much easier to type, but correspondingly the length will increase.
If the users can Copy-Paste the key, then the above is moot and there should not be any valid reason why the length of the string should be as small as possible.

Obviously, you can only fit a certain amount of data into a fixed number of characters. You have pretty much maxed out the limit with base64 already which gives you 6 bits per byte.
Therefore you need to reduce the amount of data that needs to be stored. Can you reduce the key length? You could use a 96 bit key (by always leaving all other bytes zero). That would require 16 base64 characters which is much better.
It seems you don't need much security against brute forcing. So you can reduce the key size even further.

Size of a Struct containing an array

I am using a c# struct as a pseudo-union (by using the LayoutKind.Explicit attribute), to pass network messages around my program. I understand how to use the layout with the primitive types, as they are of a know size.
However, how would I do this with one of the fields being a char array? I know a char is 2 bytes of data (when in unicode format), but how big is char[]? Am I correct in believing that this is a reference type, so its size is not just number of items * 2?
How would I layout the struct for this? Is it even possible?

The size is the width of a reference; so 4 bytes on x86 or 8 bytes on x64. The size of the array is irrelevant, as the array is stored separately on the heap. If you want to serialize that data to a byte stream, then it probably depends on which encoding you use for the char data. UTF16 would indeed be 2 * number of characters, but UTF8 or UTF32 will be different.

That is strange, shouldn't it be equal to the length times the number of bytes per character ?

Encoding-free String class for handling bytes? (Or alternative approach)

I have an application converted from Python 2 (where strings are essentially lists of bytes) and I'm using a string as a convenient byte buffer.
I am rewriting some of this code in the Boo language (Python-like syntax, runs on .NET) and am finding that the strings have an intrinsic encoding type, such as ASCII, UTF-8, etc. Most of the information dealing with bytes refer to arrays of bytes, which are (apparently) fixed length, making them quite awkward to work with.
I can obviously get bytes from a string, but at the risk of expanding some characters into multiple bytes, or discarding/altering bytes above 127, etc. This is fine and I fully understand the reasons for this - but what would be handy for me is either (a) an encoding that guarantees no conversion or discarding of characters so that I can use a string as a convenient byte buffer, or (b) some sort of ByteString class that gives the convenience of the string class. (Ideally the latter as it seems less of a hack.) Do either of these already exist? (Or are trivial to implement?)
I am aware of System.IO.MemoryStream, but the prospect of creating one of those each time and then having to make a System.IO.StreamReader at the end just to get access to ReadToEnd() doesn't seem very efficient, and this is in performance-sensitive code.
(I hope nobody minds that I tagged this as C# as I felt the answers would likely apply there also, and that C# users might have a good idea of the possible solutions.)
EDIT: I've also just discovered System.Text.StringBuilder - again, is there such a thing for bytes?

Use the Latin-1 encoding as described in this answer. It maps values in the range 128-255 unchanged, useful when you want to roundtrip bytes to chars.
UPDATE
Or if you want to manipulate bytes directly, use List<byte>:
List<byte> result = ...
...
// Add a byte at the end
result.Add(b);
// Add a collection of bytes at the end
byte[] bytesToAppend = ...
result.AddRange(bytesToAppend);
// Insert a collection of bytes at any position
byte[] bytesToInsert = ...
int insertIndex = ...
result.InsertRange(insertIndex, bytesToInsert);
// Remove a range of bytes
result.RemoveRange(index, count);
... etc ...
I've also just discovered System.Text.StringBuilder - again, is there such a thing for bytes?
The StringBuilder class is needed because regular strings are immutable, and a List<byte> gives you everything you might expect from a "StringBuilder for bytes".

I would suggest that you use MemoryStream combined with the GetBuffer() operator to retrieve the end result. Strings are actually fixed length and immutable, and to add or replace one byte to a string requires you to copy the whole thing into a new string, which is quite slow. To avoid this you would need to use a StringBuilder which allocates memory and doubles the capacity when needed, but then you might just as well use MemoryStream instead which does a similar thing but on bytes.
Each element in the string is a char and are actually two bytes because .NET strings are always UTF-16 in memory, which means you will also be wasting memory if you decide to store only one byte in each element.

Getting a string, int, etc in binary representation?

Is it possible to get strings, ints, etc in binary format? What I mean is that assume I have the string:
"Hello" and I want to store it in binary format, so assume "Hello" is
11110000110011001111111100000000 in binary (I know it not, I just typed something quickly).
Can I store the above binary not as a string, but in the actual format with the bits.
In addition to this, is it actually possible to store less than 8 bits. What I am getting at is if the letter A is the most frequent letter used in a text, can I use 1 bit to store it with regards to compression instead of building a binary tree.

Is it possible to get strings, ints,
etc in binary format?
Yes. There are several different methods for doing so. One common method is to make a MemoryStream out of an array of bytes, and then make a BinaryWriter on top of that memory stream, and then write ints, bools, chars, strings, whatever, to the BinaryWriter. That will fill the array with the bytes that represent the data you wrote. There are other ways to do this too.
Can I store the above binary not as a string, but in the actual format with the bits.
Sure, you can store an array of bytes.
is it actually possible to store less than 8 bits.
No. The smallest unit of storage in C# is a byte. However, there are classes that will let you treat an array of bytes as an array of bits. You should read about the BitArray class.

What encoding would you be assuming?

What you are looking for is something like Huffman coding, it's used to represent more common values with a shorter bit pattern.
How you store the bit codes is still limited to whole bytes. There is no data type that uses less than a byte. The way that you store variable width bit values is to pack them end to end in a byte array. That way you have a stream of bit values, but that also means that you can only read the stream from start to end, there is no random access to the values like you have with the byte values in a byte array.

What I am getting at is if the letter
A is the most frequent letter used in
a text, can I use 1 bit to store it
with regards to compression instead of
building a binary tree.
The algorithm you're describing is known as Huffman coding. To relate to your example, if 'A' appears frequently in the data, then the algorithm will represent 'A' as simply 1. If 'B' also appears frequently (but less frequently than A), the algorithm usually would represent 'B' as 01. Then, the rest of the characters would be 00xxxxx... etc.
In essence, the algorithm performs statistical analysis on the data and generates a code that will give you the most compression.

You can use things like:
Convert.ToBytes(1);
ASCII.GetBytes("text");
Unicode.GetBytes("text");
Once you have the bytes, you can do all the bit twiddling you want. You would need an algorithm of some sort before we can give you much more useful information.

The string is actually stored in binary format, as are all strings.
The difference between a string and another data type is that when your program displays the string, it retrieves the binary and shows the corresponding (ASCII) characters.
If you were to store data in a compressed format, you would need to assign more than 1 bit per character. How else would you identify which character is the mose frequent?
If 1 represents an 'A', what does 0 mean? all the other characters?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How many bytes will a string take up? - c#

Can anyone tell me how many bytes the below string will take up? string abc = "a";

Related

How can I hold a list of string as efficiently (memory) as possible?

Convert Byte[64] array to minimum length string

Size of a Struct containing an array

Encoding-free String class for handling bytes? (Or alternative approach)

Getting a string, int, etc in binary representation?

Categories

Resources