Encrypt\decrypt a part of big file using salsa20

Encrypt\decrypt a part of big file using salsa20 - c#

I have homework from my university teacher. I have to write code which will encrypt\decrypt small part of big file (about 10GB). I use algorithm Salsa20.
The main thing is not to load RAM. As he said, I should read, for example, 100 lines then encrypt\decrypt it, write to file and back.
I create List
List<string> dict = new List<string>();
Read lines (because reading all bytes is loading lots of RAM)
using (StreamReader sReader = new StreamReader(filePath))
{
while (dict.Count < 100)
{
dict.Add(sReader.ReadLine());
}
}
Try to create one line from
string words = string.Join("", dict.ToArray());
Encrypt this line
string encrypted;
using (var salsa = new Salsa20.Salsa20())
using (var mstream_out = new MemoryStream())
{
salsa.Key = key;
salsa.IV = iv;
using (var cstream = new CryptoStream(mstream_out,
salsa.CreateEncryptor(), CryptoStreamMode.Write))
{
var bytes = Encoding.UTF8.GetBytes(words);
cstream.Write(bytes, 0, bytes.Length);
}
encrypted = Encoding.UTF8.GetString(mstream_out.ToArray());
}
Then I need to write 100 lines of encrypted string, but I don't know how to do it! Is there any solution?

OK, so here's what you could do.
Accept a filename, a starting line number and an ending line number.
Read the lines, simply writing them them to another file if they are lower than the starting line number or larger than the ending line number.
Once you read a line that is in the range, you can encrypt it with the key and an IV. You will possibly need to encode it to a byte array e.g. using UTF-8 first, as modern ciphers such as Salsa operate on bytes, not text.
You can use the line number possibly as nonce/IV for your stream cipher, if you don't expect the number of lines to change. Otherwise you can prefix the ciphertext with a large, fixed size, random nonce.
The ciphertext - possibly including the nonce - can be encoded as base64 without line endings. Then you write the base 64 line to the other file.
Keep encrypting the lines until you found the end index. It is up to you if your ending line is inclusive or exclusive.
Now read the remaining lines, and write them to the other file.
Don't forget to finalize the encryption and close the file. You may possibly want to destroy the source input file.
Encrypting bytes may be easier as you could write to the original file. However, writing encrypted strings will likely always expand the ciphertext compared with the plaintext. So you need to copy the file, as it needs to grow from the middle out.
I haven't got a clue why you would keep a list or dictionary in memory. If that's part of the requirements then I don't see it in the rest of the question. If you read in all the lines of a file that way then clearly you're using up memory.
Of course, if your 4 GiB file is just a single line then you're still using way too much memory. In that case you need to stream everything, parsing text from files, putting it in a character buffer, character-decoding it, encrypting it, encoding it again to base 64 and writing it to file. Certainly doable, but tricky if you've never done such things.

Related

Why does whitespace appear at the end of my C# TextWriter file?

I have created a text file using TextWriter C#, on final creation the text file often has various rows of whitespace at the end of the file. The whitespace is not included in any of the string objects that make up the file and I don’t know what is causing it. The larger the file the more whitespace there is.
I've tried various tests to see if the whitespace occurs based upon the content on the string, but this is not the case. i.e. I have identified the number of rows where the whitespace starts and changed the string for something completely different but the whitespace still occurs.
//To start:
MemoryStream memoryStream = new MemoryStream();
TextWriter tw = new StreamWriter(memoryStream);
//Loop through records & create a concatenated string object
string strUTL1 = string.Format("{0}{1}{2}{3}{4}{5}{6}{7}", strUTL1_1, strUTL1_2, strUTL1_3, strUTL1_4, strUTL1_5, strUTL1_6, strUTL1_7, strUTL1_8);
//Add the line to the text file
tw.WriteLine(strUTL1);
//Once all rows are added I complete the file
tw.Flush();
tw.Close();
//Then return the file
return File(memoryStream.GetBuffer(), "text/plain", txtFileName);
I don't want to manipulate the file after completion (e.g. replace blank spaces), as this could lead to other problems. The file will be exchanged with a third party and needs to be formatted exactly.
Thank you for your assistance.

As the doc for MemoryStream.GetBuffer explains:
Note that the buffer contains allocated bytes which might be unused. For example, if the string "test" is written into the MemoryStream object, the length of the buffer returned from GetBuffer is 256, not 4, with 252 bytes unused. To obtain only the data in the buffer, use the ToArray method; however, ToArray creates a copy of the data in memory.
Use .ToArray() (which will allocate a new array of the right size), or you can use the buffer returned from .GetBuffer() but you'll need to check the .Length to see how many valid bytes are in it.

GetBuffer() returns all the memory that was allocated, which is almost always more bytes than what you actually wrote into it.
Might I suggest using Encoding.UTF8.GetBytes(...) instead:
string strUTL1 = string.Format("{0}{1}{2}{3}{4}{5}{6}{7}", strUTL1_1, strUTL1_2, strUTL1_3, strUTL1_4, strUTL1_5, strUTL1_6, strUTL1_7, strUTL1_8);
var bytes = Encoding.UTF8.GetBytes(strUTL1);
return File(bytes, "text/plain", txtFileName);

Use ToArray() instead of GetBuffer(), since the buffer is larger than needed.
That's often the case. Classes or functions that work with buffers usually reserve a certain size of memory to hold the data. The function will then return a value, how many bytes have been written to the buffer. You shall then only use the first n bytes of the buffer.
Citation of MSDN:
For example, if the string "test" is written into the MemoryStream object, the length of the buffer returned from GetBuffer() is 256, not 4, with 252 bytes unused. To obtain only the data in the buffer, use the ToArray() method; however, ToArray() creates a copy of the data in memory.

Read Extremely large file efficiently in C#. Currently using StreamReader

I have a Json file that is sized 50GB and beyond.
Following is what I have written to read a very small chunk of the Json. I now need to modify this to read the large file.
internal static IEnumerable<T> ReadJson<T>(string filePath)
{
DataContractJsonSerializer ser = new DataContractJsonSerializer(typeof(T));
using (StreamReader sr = new StreamReader(filePath))
{
String line;
// Read and display lines from the file until the end of
// the file is reached.
while ((line = sr.ReadLine()) != null)
{
byte[] jsonBytes = Encoding.UTF8.GetBytes(line);
XmlDictionaryReader jsonReader = JsonReaderWriterFactory.CreateJsonReader(jsonBytes, XmlDictionaryReaderQuotas.Max);
var myPerson = ser.ReadObject(jsonReader);
jsonReader.Close();
yield return (T)myPerson;
}
}
}
Would it suffice if I specify the buffer size while constructing the StreamReader in the current code?
Please correct me if I am wrong here.. The buffer size basically specifies how much data is read from disk to memory at a time. So if File is 100MB in size with buffer size as 5MB, it reads 5MB at a time to memory, until entire file is read.
Assuming my understanding of point 3 is right, what would be the ideal buffer size with such a large text file? Would int.Max size be a bad idea? In 64-bit PC int.Max size is 2147483647. I presume buffer size is in bytes, which evaluates to about 2GB. This itself could consume time. I was looking at something like 100MB - 300MB as buffer size.

It is going to read a line at a time (of the input file), which could be 10 bytes, and could be all 50GB. So it comes down to : how is the input file structured? And if the input JSON has newlines other than cleanly at the breaks between objects, this could get really ill.
The buffer size might impact how much it reads while looking for the end of each line, but ultimately: it needs to find a new-line each time (at least, how it is written currently).

I think you should first compare different parsers before worrying about details as the buffer size.
The differences between DataContractJsonSerializer, Raven JSON or Newtonsoft JSON will be quite significant.

So your main issue with this is where are your boundaries, and given that your doc is a JSON doc it seems to me likely that your boundaries are likely to be classes, I assume (or hope) that you don't have 1 honking great class that is 50gb large. I also assume that you don't really need all those classes in memory but you may need to search the whole thing for your subset...does that sound roughly right? if so I think that your pseudo code is something like
using a Json parser that accepts a streamreader (newtonsoft?)
read and parse until eof
yield return your parsed class that matches criteria
read and parse next class
end

Best approach for in memory manipulation of text file in memory: read as byte[] first? read as File.ReadAllText() then save as binary?

I need to change a file in memory, and currently I read the file to memory into a byte[] using a filestream and a binaryreader.
I was wondering whats the best approach to change that file in memory, convert the byte[] to string, make changes and do an Encoding.GetBytes()? or Read the file first as string using File.ReadAllText() and then Encoding.GetBytes()? or any approach will work without caveats?
Any special approaches? I need to replace specific text inside files with additional chars or replacement strings, several 100,000 of files. Reliability is preferred over efficiency. Files are text like HTML, not binary files.

Read the files using File.ReadAllText(), modify them, then do byte[] byteData = Encoding.UTF8.GetBytes(your_modified_string_from_file). Use the encoding with which the files were saved. This will give you an array of byte[]. You can convert the byte[] to a stream like this:
MemoryStream stream = new MemoryStream();
stream.Write(byteData, 0, byteData.Length);
Edit:
It looks like one of the Add methods in the API can take a byte array, so you don't have to use a stream.

You're definitely making things harder on yourself by reading into bytes first. Just use a StreamReader. You can probably get away with using ReadLine() and processing a line at a time. This can seriously reduce your app's memory usage, especially if you're working with that many files.
using (var reader = File.OpenText(originalFile))
using (var writer = File.CreateText(tempFile))
{
string line;
while ((line = reader.ReadLine()) != null)
{
var temp = DoMyStuff(line);
writer.WriteLine(temp);
}
}
File.Delete(originalFile);
File.Move(tempFile, originalFile);

Based on the size of the files, I would use File.ReadAllText to read them and File.WriteAllText to wirte them. This frees you up from the responsibility of having to call Close or Dispose on either read or write.

You generally don't want to read a text file on a binary level - just use File.ReadAllText() and supply it with the correct encoding used in the file (there's an overload for that). If the file encoding is UTF8 or UTF32 usually the method can automatically detect and use the correct endcoding. Same applies to writing it back - if it's not UTF8 specify which encoding you want.

how to convert Image to string the most efficient way?

I want to convert an image file to a string. The following works:
MemoryStream ms = new MemoryStream();
Image1.Save(ms, ImageFormat.Jpeg);
byte[] picture = ms.ToArray();
string formmattedPic = Convert.ToBase64String(picture);
However, when saving this to a XmlWriter, it takes ages before it's saved(20secs for a 26k image file). Is there a way to speed this action up?
Thanks,
Raks

There are three points where you are doing large operations needlessly:
Getting the stream's bytes
Converting it to Base64
Writing it to the XmlWriter.
Instead. First call Length and GetBuffer. This let's you operate upon the stream's buffer directly. (Do flush it first though).
Then, implement base-64 yourself. It's relatively simple as you take groups of 3 bytes, do some bit-twiddling to get the index into the character it'll be converted to, and then output that character. At the very end you add some = symbols according to how many bytes where in the last block sent (= for one remainder byte, == for two remainder bytes and none if there were no partial blocks).
Do this writting into a char buffer (a char[]). The most efficient size is a matter for experimentation but I'd start with 2048 characters. When you've filled the buffer, call XmlWriter.WriteRaw on it, and then start writing back at index 0 again.
This way, you're doing less allocations, and you're started on the output from the moment you've got your image loaded into the memory stream. Generally, this should result in better throughput.

Filestream only reading the first 4 characters of the file

Hey there! I'm trying to read a 150mb file with a file stream but every time I do it all I get is: |zl instead of the whole stream. Note that it has some special characters in it.
Does anybody know what the problem could be? here is my code:
using (FileStream fs = File.OpenRead(path))
{
byte[] buffer = new byte[fs.Length];
fs.Read(buffer, 0, buffer.Length);
extract = Encoding.Default.GetString(buffer);
}
Edit:
I tried to read all text but it still returned the same four characters. It works fine on any other file except for these few. When I use read all lines it only gets the first line.

fs.Read() does not read the whole smash of bytes all at once, it reads some number of bytes and returns the number of bytes read. MSDN has an excellent example of how to use it to get the whole file:
http://msdn.microsoft.com/en-us/library/system.io.filestream.read.aspx
For what it's worth, reading the entire 150MB of data into memory is really going to put a drain on your client's system -- the preferred option would be to optimize it so that you don't need the whole file all at once.

If you want to read text this way File.ReadAllLine (or ReadAllText) - http://msdn.microsoft.com/en-us/library/s2tte0y1.aspx is better option.
My guess the file is not text file to start with and the way you display resulting string does stop at 0 characters.
As debracey pointed out Read returns number of bytes read - check that out. Also for file operations it is unlikely to stop at 4 characters...

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.