Reading a Stream is changing the hash value of it

Reading a Stream is changing the hash value of it - c#

I have the following code, which is reading a Stream to store the content of it as a string. Unfortunately after the StreamReader is not used anymore, the hash value of the Stream has changed. How is this possible? The Stream is readonly and thus can't be changed.
string content;
string hash = Cryptography.CalculateSHA1Hash(stream); // 5B006E35CF1838871FDC1E3DF52B0CB5A8A97274
using (StreamReader reader = new StreamReader(stream))
{
content = reader.ReadToEnd();
}
hash = Cryptography.CalculateSHA1Hash(stream); // DA39A3EE5E6B4B0D3255BFEF95601890AFD80709

The SHA1 value DA39A3EE5E6B4B0D3255BFEF95601890AFD80709 is the result of hashing an empty string. The call to Cryptography.CalculateSHA1Hash reads everything (from the current position to the end) from the string and hashes it. There's no more data to read after the first call to Cryptography.CalculateSHA1Hash.
I would also guess that your StreamReader.ReadToEnd() returns an empty string due to the same reason.

Are the 2 values always the same? It's possible that your method is creating a hash from the contents of the stream, and in the second instance the stream is at the end position so has no more data to be read.
If you seek the stream back to the beginning do you get consistent numbers?

Maybe the stream Position has changed (e.g. by ReadToEnd) and the digest is computed from the current Position ?
It's only a guess since we can't help you much without seeing the code for Cryptography.CalculateSHA1Hash.

You have placed your StreamReader into a using block - a good thing. However, TextReader.Dispose by default calls Dispose on the underlying stream. This is likely to change things.
Try checking the hash from within the using block.

Related

Why does whitespace appear at the end of my C# TextWriter file?

I have created a text file using TextWriter C#, on final creation the text file often has various rows of whitespace at the end of the file. The whitespace is not included in any of the string objects that make up the file and I don’t know what is causing it. The larger the file the more whitespace there is.
I've tried various tests to see if the whitespace occurs based upon the content on the string, but this is not the case. i.e. I have identified the number of rows where the whitespace starts and changed the string for something completely different but the whitespace still occurs.
//To start:
MemoryStream memoryStream = new MemoryStream();
TextWriter tw = new StreamWriter(memoryStream);
//Loop through records & create a concatenated string object
string strUTL1 = string.Format("{0}{1}{2}{3}{4}{5}{6}{7}", strUTL1_1, strUTL1_2, strUTL1_3, strUTL1_4, strUTL1_5, strUTL1_6, strUTL1_7, strUTL1_8);
//Add the line to the text file
tw.WriteLine(strUTL1);
//Once all rows are added I complete the file
tw.Flush();
tw.Close();
//Then return the file
return File(memoryStream.GetBuffer(), "text/plain", txtFileName);
I don't want to manipulate the file after completion (e.g. replace blank spaces), as this could lead to other problems. The file will be exchanged with a third party and needs to be formatted exactly.
Thank you for your assistance.

As the doc for MemoryStream.GetBuffer explains:
Note that the buffer contains allocated bytes which might be unused. For example, if the string "test" is written into the MemoryStream object, the length of the buffer returned from GetBuffer is 256, not 4, with 252 bytes unused. To obtain only the data in the buffer, use the ToArray method; however, ToArray creates a copy of the data in memory.
Use .ToArray() (which will allocate a new array of the right size), or you can use the buffer returned from .GetBuffer() but you'll need to check the .Length to see how many valid bytes are in it.

GetBuffer() returns all the memory that was allocated, which is almost always more bytes than what you actually wrote into it.
Might I suggest using Encoding.UTF8.GetBytes(...) instead:
string strUTL1 = string.Format("{0}{1}{2}{3}{4}{5}{6}{7}", strUTL1_1, strUTL1_2, strUTL1_3, strUTL1_4, strUTL1_5, strUTL1_6, strUTL1_7, strUTL1_8);
var bytes = Encoding.UTF8.GetBytes(strUTL1);
return File(bytes, "text/plain", txtFileName);

Use ToArray() instead of GetBuffer(), since the buffer is larger than needed.
That's often the case. Classes or functions that work with buffers usually reserve a certain size of memory to hold the data. The function will then return a value, how many bytes have been written to the buffer. You shall then only use the first n bytes of the buffer.
Citation of MSDN:
For example, if the string "test" is written into the MemoryStream object, the length of the buffer returned from GetBuffer() is 256, not 4, with 252 bytes unused. To obtain only the data in the buffer, use the ToArray() method; however, ToArray() creates a copy of the data in memory.

Encrypt\decrypt a part of big file using salsa20

I have homework from my university teacher. I have to write code which will encrypt\decrypt small part of big file (about 10GB). I use algorithm Salsa20.
The main thing is not to load RAM. As he said, I should read, for example, 100 lines then encrypt\decrypt it, write to file and back.
I create List
List<string> dict = new List<string>();
Read lines (because reading all bytes is loading lots of RAM)
using (StreamReader sReader = new StreamReader(filePath))
{
while (dict.Count < 100)
{
dict.Add(sReader.ReadLine());
}
}
Try to create one line from
string words = string.Join("", dict.ToArray());
Encrypt this line
string encrypted;
using (var salsa = new Salsa20.Salsa20())
using (var mstream_out = new MemoryStream())
{
salsa.Key = key;
salsa.IV = iv;
using (var cstream = new CryptoStream(mstream_out,
salsa.CreateEncryptor(), CryptoStreamMode.Write))
{
var bytes = Encoding.UTF8.GetBytes(words);
cstream.Write(bytes, 0, bytes.Length);
}
encrypted = Encoding.UTF8.GetString(mstream_out.ToArray());
}
Then I need to write 100 lines of encrypted string, but I don't know how to do it! Is there any solution?

OK, so here's what you could do.
Accept a filename, a starting line number and an ending line number.
Read the lines, simply writing them them to another file if they are lower than the starting line number or larger than the ending line number.
Once you read a line that is in the range, you can encrypt it with the key and an IV. You will possibly need to encode it to a byte array e.g. using UTF-8 first, as modern ciphers such as Salsa operate on bytes, not text.
You can use the line number possibly as nonce/IV for your stream cipher, if you don't expect the number of lines to change. Otherwise you can prefix the ciphertext with a large, fixed size, random nonce.
The ciphertext - possibly including the nonce - can be encoded as base64 without line endings. Then you write the base 64 line to the other file.
Keep encrypting the lines until you found the end index. It is up to you if your ending line is inclusive or exclusive.
Now read the remaining lines, and write them to the other file.
Don't forget to finalize the encryption and close the file. You may possibly want to destroy the source input file.
Encrypting bytes may be easier as you could write to the original file. However, writing encrypted strings will likely always expand the ciphertext compared with the plaintext. So you need to copy the file, as it needs to grow from the middle out.
I haven't got a clue why you would keep a list or dictionary in memory. If that's part of the requirements then I don't see it in the rest of the question. If you read in all the lines of a file that way then clearly you're using up memory.
Of course, if your 4 GiB file is just a single line then you're still using way too much memory. In that case you need to stream everything, parsing text from files, putting it in a character buffer, character-decoding it, encrypting it, encoding it again to base 64 and writing it to file. Certainly doable, but tricky if you've never done such things.

Hash from different files is always the same

I'm building an API which has a method that accepts a file via POST request.
Based on that file, i need to create a hash on the file itself (not the name), check if the hash already exists and do some other actions.
My problem is that whatever file i will send through postman, the hash is always the same for every file, which means that every time i get only 1 file which is overwritten.
Here is my method
private string GetHashFromImage(IFormFile file)
{
/* Creates a hash with the image as a parameter
* with the SHA1 algorithm and returns the hash
* as a string since the ComputeHash() method
* creates a byte array.
*/
System.IO.MemoryStream image = new System.IO.MemoryStream();
file.CopyTo(image);
var hashedValue = System.Security.Cryptography.SHA1.Create().ComputeHash(image);
var hashAsString = Convert.ToBase64String(hashedValue).Replace(#"/", #"");
image.Seek(0, System.IO.SeekOrigin.Begin);
return hashAsString;
}
}
I need a hash method that is agnostic to OS and will return the same hash on each file.

Not entirely sure why you're solution is not working but I think I have an idea on how to achieve what you want and it uses MD5 instead of SHA1.
Let's create a function that will receive an IFormFile, compute the MD5 hash of its contents then return the hash value as a string.
using System;
using System.IO;
using System.Security.Cryptography;
private string GetMD5Hash(IFormFile file)
{
// get stream from file then convert it to a MemoryStream
MemoryStream stream = new MemoryStream();
file.OpenReadStream().CopyTo(stream);
// compute md5 hash of the file's byte array.
byte[] bytes = MD5.Create().ComputeHash(stream.ToArray());
return BitConverter.ToString(bytes).Replace("-",string.Empty).ToLower();
}
Hope it works for you!

The real reason of this behaviour is the last position (same as position after image.Seek(0, System.IO.SeekOrigin.End)) in the calculated stream.
Stream operations like CopyTo, ComputeHash, etc change the position of sreams because they have to iterate through them. The final hash of any stream with position on the end is always same - like a hash of empty stream or empty array.
Convert stream to array works, of course, because to array function works with whole stream (from position = 0) but it is not generally very elegant solution because you have to copy whole stream into memory (this is same for memory stream - the data are also in memory).
When you work directly with stream the function (like compute hash from stream) reads the stream by small chunks (like 4096B) and compute hash iteretively (.NET source code). It means that original solution should work when the seek operation to the start is performed before hash calculation.
Actually you should be able to compute hash directly from input stream (in IFormFile) without copy whole stream into memory (array or memory stream) with better performance and without risk e.g. OutOfMemoryException.

C# BaseStream returning same MD5 hash regardless of what is in the stream

The question says it all. This code
string hash = "";
using (var md5 = System.Security.Cryptography.MD5.Create())
{
hash = Convert.ToBase64String(md5.ComputeHash(streamReader.BaseStream));
}
will always return the same hash.
If I pass all of the data from the BaseStream into a MemoryStream, it gives a unique hash every time. The same goes for running
string hash = "";
using (var md5 = System.Security.Cryptography.MD5.Create())
{
hash = Convert.ToBase64String(md5.ComputeHash(
Encoding.ASCII.GetBytes(streamReader.ReadToEnd())));
}
The second one is actually faster, but I've heard it's bad practice.
My question is, what is the proper way to use ComputeHash(stream). For me it always (and I mean always, even if I restart the program, meaning it's not just hashing the reference) returns the same hash, regardless of the data in the stream.

The Stream instance is likely positioned at the end of the stream. ComputeHash returns the hash for the bytes from the current position to the end of the stream. So if the current position is the end of the stream, it will the hash for the empty input. Make sure that the Stream instance is positioned at the beginning of the stream.

I solved this issue by setting stream.Position = 0 before ComputeHash

FileStream position is off after calling ReadLine() from C#

I'm trying to read a (small-ish) file in chunks of a few lines at a time, and I need to return to the beginning of particular chunks.
The problem is, after the very first call to
streamReader.ReadLine();
the streamReader.BaseStream.Position property is set to the end of the file! Now I assume some caching is done in the backstage, but I was expecting this property to reflect the number of bytes that I used from that file. And yes, the file has more than one line :-)
For instance, calling ReadLine() again will (naturally) return the next line in the file, which does not start at the position previously reported by streamReader.BaseStream.Position.
How can I find the actual position where the 1st line ends, so I can return there later?
I can only think of manually doing the bookkeeping, by adding the lengths of the strings returned by ReadLine(), but even here there are a couple of caveats:
ReadLine() strips the new-line character(s) which may have a variable length (is is '\n'? Is it "\r\n"? Etc.)
I'm not sure if this would work OK with variable-length characters
...so right now it seems like my only option is to rethink how I parse the file, so I don't have to rewind.
If it helps, I open my file like this:
using (var reader = new StreamReader(
new FileStream(
m_path,
FileMode.Open,
FileAccess.Read,
FileShare.ReadWrite)))
{...}
Any suggestions?

If you need to read lines, and you need to go back to previous chunks, why not store the lines you read in a List? That should be easy enough.
You should not depend on calculating a length in bytes based on the length of the string - for the reasons you mention yourself: Multibyte characters, newline characters, etc.

I have done a similar implementation where I needed to access the n-th line in an extremely big text file fast.
The reason streamReader.BaseStream.Position had pointed to the end of file is that it has a built-in buffer, as you expected.
Bookkeeping by counting number of bytes read from each ReadLine() call will work for most plain text files. However, I have encounter cases where there control character, the unprintable one, mixed in the text file. The number of bytes calculated is wrong and caused my program not beeing able to seek to the correct location thereafter.
My final solution was to go with implementing the line reader on my own. It worked well so far. This should give some ideas what it looks like:
using (FileStream fs = new FileStream(filePath, FileMode.Open))
{
int ch;
int currentLine = 1, offset = 0;
while ((ch = fs.ReadByte()) >= 0)
{
offset++;
// This covers all cases: \r\n and only \n (for UNIX files)
if (ch == 10)
{
currentLine++;
// ... do sth such as log current offset with line number
}
}
}
And to go back to logged offset:
using (FileStream fs = new FileStream(filePath, FileMode.Open))
{
fs.Seek(yourOffset, SeekOrigin.Begin);
TextReader tr = new StreamReader(fs);
string line = tr.ReadLine();
}
Also note there is already buffering mechanism built into FileStream.

StreamReader isn't designed for this kind of usage, so if this is what you need I suspect that you'll have to write your own wrapper for FileStream.

A problem with the accepted answer is that if ReadLine() encounters an exception, say due to the logging framework locking the file temporarily right when you ReadLine(), then you will not have that line "saved" into a list because it never returned a line. If you catch this exception you cannot retry the ReadLine() a second time because StreamReaders internal state and buffer are screwed up from the last ReadLine() and you will only get part of a line returned, and you cannot ignore that broken line and seek back to the beginning of it as OP found out.
If you want to get to the true seekable location then you need to use reflection to get to StreamReaders private variables that allow you calculate its position inside it's own buffer. Granger's solution seen here: StreamReader and seeking, should work. Or do what other answers in other related questions have done: create your own StreamReader that exposes the true seekable location (this answer in this link: Tracking the position of the line of a streamreader). Those are the only two options I've come across while dealing with StreamReader and seeking, which for some reason decided to completely remove the possibility of seeking in nearly every situation.
edit: I used Granger's solution and it works. Just be sure you go in this order: GetActualPosition(), then set BaseStream.Position to that position, then make sure you call DiscardBufferedData(), and finally you can call ReadLine() and you will get the full line starting from the position given in the method.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.