FromBase64String Method maximum length - c#

FromBase64String Method maximum length

Basically as long as is practical - you'll be hit by other limitations (maximum size of a string, or indeed any object) before you hit a problem in FromBase64String itself.
If you want to read huge amounts of base64 data, you'll need to split it up. So long as you read chunks which are multiples of 4 characters at a time, you should be able to transform it one chunk at a time without any problem.

Related

How can I hold a list of string as efficiently (memory) as possible?

I have huge a list of string. I want to hold these list as memory efficient. I tried to hold on a list. But, it uses 24 bytes for each string which has 5 characters. Namely, there should be some overhead areas.
Then, I tried to hold on a string array. The memory usage has been a bit efficient. But, I have still memory usage problem.
How can I hold a list of string? I know that "C# reserves 2 bytes for each character". I want to hold a string which has 5 characters as 5*2 = 10 bytes. But, why does it use 24 bytes for this process?
Thank you for helps.
enter image description here
Firstly, note that the difference between a List<string> that was created at the correct size, and a string[] (of the same size) is inconsequential for any non-trivial size; a List<T> is really just a fancy wrapper for T[] with insert/resize/etc capabilities. If you only need to hold the data: T[] is fine, but so is List<T> usually.
As for the string - it isn't C# that reserves anything - it is .NET that defines that a string is an object, which is internally a length (int) plus memory for char data, 2 bytes per char. But: objects in .NET have object headers, padding/alignment, etc - and importantly: a minimum size. So yes, they take more memory than just the raw data you're trying to represent.
If you only need the actual data, you could perhaps store the data not as string, but as raw memory - either a simple large byte[] or byte*, or as a twinned pair of int[]/int* (for lengths and/or offsets into the page) and a char[]/char* (for the actual character data), or a byte[]/byte* if you can work with encoded data (i.e. you're mainly interested in IO work). However, working with such a form will be hugely inconvenient - virtually no common APIs will want to play with you unless you are talking in string. There are some APIs that accept raw byte/char data, but they are largely the encoder/decoder APIs, and some IO APIs. So again: unless that's what you're doing: it won't end well. Very recently, some Span<char> / Span<byte> APIs have appeared which would make this slightly less inconvenient (if you can use the latest .NET Core builds, etc), but: I strongly suspect that in most common cases you're just going to have to accept the string overhead and live with it.
Minimum size of any object in 64-bit .NET is 24 bytes.
In 32-bit it's a bit smaller but there's always at least 8 bytes for the object header and here we'd expect the string to store it's length (4 bytes). 8 + 4 + 10 = 22. I'm guessing it also wants/needs all objects to be 4-byte aligned. So if you're storing them as objects, you're not going to get a much smaller representation.
If it's all 7-bit ASCII type characters, you could store them as arrays of bytes but each array would still take up some space.
Your best route (I appreciate this bit is more comment like) is to come up with different processing algorithms that don't require them to all be in memory at the same time in the first place.

Cutting random bytes off of file byte array in C#

So I've been working on this project for a while now, involving LSB steganography. Really fun stuff. Anyways, I just finished writing the code for embedding and extracting files from an image(instead of just plaintext), and I'm running into this problem. I can recognize the MIME and extension of the bytes, but because the embedded file doesn't usually take up all of the LSBs of the image, there's a lot of garbage data. So I have the extracted file + some garbage in the byte array right after it. I need to figure out how to cut these, so that the file that is being exported is the correct, smaller size.
TLDR: I have a byte array with a recognized file in it, with some additional random bytes. How do I find out where the file ends and the random bytes begin?
Remember this is all in C#.
Any advice is appreciated.
Link to my project for reference: https://github.com/nicosogangstar/Steg
Generally you have two options.
End of stream marker
This is the more direct approach of the two, but it may lack some versatily depending on what data you want to hide. After you embed your data, continue with embedding a unique sequence of bits/bytes such that you know it cannot be prematurely encountered in the data before. As you extract the bits, you can stop reading once you encounter this sequence. If you expect to hide only readable text, i.e. bytes with ascii codes between 32 and 127, your marker can be as short as eight 0s, or eight 1s. However, if you intend to hide any sort of binary data, where each byte has a chance of appearing, you may accidentally encounter the marker while extracting legitimate data and thus halt the process prematurely.
Header information
You can add a header preceding data, e.g, another 16-24 bits (or any other amount) which can be translated to a number that tells you how many bits/bytes/pixels to read before stopping. For example, if you want to hide a byte array of size 1000, first embed 2 bytes related to the length of the secret and then follow it with the actual data. More specifically, split the length in 2 bytes, where the first byte has the 8th to 15th bits and the second byte has the 0th to 7th bits of the number 1000 in binary.
00000011 11101000 1000 in binary
3 -24 byte values
You can embed all sorts of information in a header, such as whether the data is encrypted or compressed with some algorithm, the original filename of the date, how many LSBs to read for extracting the information, etc.

Convert Byte[64] array to minimum length string

I have try to generate unlock key like XXXX-XXXX-XXXX or simply small length string or Hexstring. I am using RSA algorithm to encrypt and decrypt the Key. I got some long string like
Q65g2+uiytyEUW5SFsiI/c5z9NSxyuU2CM1SEly6cAVv9PdTpH81XaWS8lITcaTZ4IjdmINwhHBosvt5kdg==
when I convert the byte array (array size is 64 byte) using the below convert method.
Convert.ToBase64String(bytes);
My requirement is to generate the minimal length Key. Is there any way to convert the Byte array (array size is 64 byte) to minimal length and I need that back to byte array or any other suggestions (to minimize the string length) would be helpful.
I have tried to convert the output string to Hex decimal, but the output is too long than the string.
You may want to take a look at What is the most efficient way to encode an arbitrary GUID into readable ASCII (33-127)? There the Base 85 encoding is discussed which is used to compress PDF files.
Though, the difference between Base64 and Base85 in your case is 8 characters.
You can safely remove trailing '==' in Base64 string because it is used for alignment and will always be there for 64-byte values (Of course you will have to add these characters back to decode the string).
Since you mention you want users to be able to type in the string,
there will be an inverse correlation between easy-of-use from point of view of users and the length of string.
Even typing a Base64 string is prone to lot of errors. Base32 strings are much easier to type, but correspondingly the length will increase.
If the users can Copy-Paste the key, then the above is moot and there should not be any valid reason why the length of the string should be as small as possible.
Obviously, you can only fit a certain amount of data into a fixed number of characters. You have pretty much maxed out the limit with base64 already which gives you 6 bits per byte.
Therefore you need to reduce the amount of data that needs to be stored. Can you reduce the key length? You could use a 96 bit key (by always leaving all other bytes zero). That would require 16 base64 characters which is much better.
It seems you don't need much security against brute forcing. So you can reduce the key size even further.

ASCII file parsing speed

I have two type of file. One of them is ASCII file and data is stored like;
X Y Value
0 0 5154,4
1 0 5545455;
. . ...
. . ...
other one is a binary file.
I parse first one with StreamReader and ReadLine() method and then setting values to an double[,] array by Split(' ').
I parse second one with BinaryReader.
Parsing of binary file is 3-4 times faster than ASCII one.
Question 1: Reading ASCII file is slower than binary one. Is it normal?
Question 2: Do you suggest another way for parsing ASCII file?
It's not so much reading ascii is slower, but how you do it.
It's parsing, looking for new lines, seperators, then converting bits of text to other formats. BinaryReader is basically a straight memory copy.
It's like the difference between fixed length and csv, or csv and xml The more meta data you add, the more you can get out it but the more it costs.
Reading an ascii file character by character might work out faster than readline and split, in that you could optimise it for your specific file structure. Lot of work though and very fragile making it a dubious prospect. Chucking loading to a seperate thread perhaps even parallel processing the lines, might be more rewarding, definitely be more satisfying and reusable.
Reading from ASCII file and binary not different, different is parsing of them,after reading ASCII file you parse string to double, this is got process time.But in binary file your read data stream is completely equals to equivalent binary double number and not need to parsing.
Once a month we receive a 350 MB csv file with 3.5 million rows, then we used to read it one line at a time and make some indexes, it took aprox. 60 seconds every time the service was restarted.I made a program that boiled it down to 1.7 million rows and converted it to a binary format to aprox 24 MB.These data was read directly into memory in 7 ms and the indexes was generated when needed and the data was converted when used.The memory consumption declined from 400 MB til 90 MB.The point is that you should choose an appropriate format for your data if performance is an issue, also note that this solution is only possible because the data is fairly static and that the data is not retrieved more than a few million times in 24 hours.I believe that the new service actually answers a little faster now than it used to.

How do I accomplish random reads of a UTF8 file

My understanding is that reads to a UTF8 or UTF16 Encoded file can't necessarily be random because of the occasional surrogate byte (used in Eastern languages for example).
How can I use .NET to skip to an approximate position within the file, and read the unicode text from a semi-random position?
Do I discard surrogate bytes and wait for a word break to continue reading? If so, what are the valid word breaks I should wait for until I start the decoding?
Easy, UTF-8 is self-synchronizing.
Simply jump to random byte in a file and skip-read all bytes with leading bits 10 (continuation bytes). The first byte that does not have leading 10 is the starting byte of a proper UFT-8 character and you can read the following bytes using a regular UTF-8 encoding.
Assuming that you're looking to extract a pseudo-random character from a UTF-8 file, I personally would lean away from trying to work out how to jump into a random place then scroll forwards to a guaranteed 'start of character' position (which my feeling is would be a tricky proposition) edit this is wrong. How about something like:
Establish the length of the file in bytes
Heuristically guess the number of characters - for example, by scaling by a constant established from some suitable corpus; or by examining the first n bytes and seeing how many characters they describe, in order to get a scaling constant that might be more representative of this file
Pick a pseudo-random number in 1..<guessed number of characters in file>
If the file is very big (which I'm guessing it must be, else you wouldn't be asking this), use a buffered read to:
Read the file's bytes, decoding to UTF-8, until you reach the desired character. If you fall off the end of the file, use the last
A buffered read here will need to use two buffers which are alternately 'first' to avoid losing context when a character's bytes are split across two reads, eg:
Read Buffer A : bytes 1000-1999
Read Buffer B : bytes 2000-2999
If a character occupies bytes 1998-2001, using a single buffer would lose context.
Read Buffer A : bytes 3000-3999
Now in effect buffer A follows buffer B when we convert the byte stream into characters.
As noted by #jleedev below, and as seen in the other answer, it is actually easy and safe to 'scroll forward' to a guaranteed character start. But the character count estimation stuff above might still prove useful.
For UTF-16, you always have to jump to an even byte position. Then you can check whether a trailing surrogate follows. If so, skip it, otherwise you are at the start of a well-formed UTF-16 code unit sequence (always assuming that the file is well-formed, of course).
The Unicode encodings UTF-8 and UTF-16 were specifically designed to be self-synchronizing, and there are strong guarantees that you only have to skip at most a small number of code units.

Categories

Resources