Calculate hash without having the entire buffer in memory at once - c#

I am doing an operation where I receive some bytes from a component, do some processing, and then send it on to the next component. I need to be able to calculate the hash of all the data I have seen at any given time - and because of data size; I cannot keep it all in a local buffer.
How would you calculate the (MD5) hash under these circumstances ?
I am thinking that I should be able to hold on to an intermediate result of the hash, and add more data as I go. But does any of the built-in framework classes support this ?

You simply want to use the TransformBlock and TransformFinalBlock members of the class, which allow you to compute the hash in chunks.
MSDN has a good example of how to do this.

Its a bit surprising that it doesn't come in the box.
If you create the MD5CryptoServiceProvider in a member variable, and call ComputeHash() repeatedly, does it not work as an append?

Related

Deleting byte[] from program and memory in C#

I'm making a program in which one of its functions, in order to correctly create the message to be sent, keeps calling a function I have generated to add each of the parts to the array. The thing is, in C# you can't do this because the byte arrays (and if I'm not wrong, any kind of array) has a finite Length which cannot be changed.
Due to this, I thought of creating 2 byte variables. The first one would get the first to values. The second one would be created after you know the quantity of new bytes you have to add, and after this, you would delete the first variable and create it again, with the Length of the previous variable, but adding the Length of the new values, doing the same you did with the second variable. The code I've generated is:
byte[] message_mod_0 = adr_and_func;
byte[] byte_memory_adr = AddAndTypes.ToByteArray(memory_adr);
byte[] message_mod_1 = new byte[2 + byte_memory_adr.Length];
message_mod_1 = AddAndTypes.AddByteArrayToByteArray(message_mod_0, byte_memory_adr);
AddAndTypes.AddByteArrayToByteArray(message_mod_0, AddAndTypes.IntToByte(value));
byte[] CRC = Aux.CRC(message_mod_0);
AddAndTypes.AddByteArrayToByteArray(message_mod_0, CRC);
In this code, the two variables I've meant are message_mod_0 and message_mod_1. I also think of doing the deleting and redeclaring the byte_memory_adr variable that is required in order to know which is the Length of the byte array you want to add to the ouput message.
The parameters adr_and_func, memory_adr and value are given as input parameters of the function I'm making.
The question can be summed up as: is there any way to delete variables in the same scope they were created? And, in case it can be done, would there be any problem if I created a new variable with the same name after I have deleted the first one? I can't think of any reason why that could happen, but I'm pretty new to this programming language.
Also, I don't know if there is any less messy way of doing this.
This sounds like you are writing your own custom serializer.
I would recommend just using a existing library, like protobuf.net to define your messages if at all possible.
If this is not possible you can use a BinaryWriter to write your values to a Stream. If you want to keep it in memory use a MemoryStream and use .ToArray() when your done to get a array of all bytes.
As for memory, do not worry about it. Unless you have gigabyte sized messages the memory requirements should not be an issue, and the garbage collector will automatically recycle memory when it is no longer needed, and it can do this after the last usage, regardless of scope. If you have huge memory streams you might want to look at something like recyclable memory stream since this can avoid some allocation performance issues and fragmentation issues.

C# Quick bit array

as stated in the title i am evaluating the cost of implement a BitArray over bytes[] (i have understood that native BitArray is pretty slow) insthead of using a string representation of bits (eg : "001001001" ) but i am open to any suggestion that are more effective.
The length of array is not known at design time, but i suppose may be between 200 and 500 bit per array.
Memory is not a concern, so use a lot of memory for represent the array is not an issue, what matter is speed when array is created and manupulated (thiy will be manipulated a lot).
Thanks in advance for yours consideration and suggenstion onto the topic.
A few suggestions:
1) Computers don't process bits o even n int or long will work at the same speed
2) To reach speed you can consider writing it with unsafe code
3) New is expensive. If the objects are created a lot you can do the following: Create a bulk of 10K
objects at a time and serve them from a method when required. Once the cache runs out you can recreate them. Have another method that once an object processing completes you clean it up and return it to the cache
4) Make sure your manipulation is optimal

I need very big array length(size) in C#

public double[] result = new double[ ??? ];
I am storing results and total number of the results are bigger than the 2,147,483,647 which is max int32.
I tried biginteger, ulong etc. but all of them gave me errors.
How can I extend the size of the array that can store > 50,147,483,647 results (double) inside it?
Thanks...
An array of 2,147,483,648 doubles will occupy 16GB of memory. For some people, that's not a big deal. I've got servers that won't even bother to hit the page file if I allocate a few of those arrays. Doesn't mean it's a good idea.
When you are dealing with huge amounts of data like that you should be looking to minimize the memory impact of the process. There are several ways to go with this, depending on how you're working with the data.
Sparse Arrays
If your array is sparsely populated - lots of default/empty values with a small percentage of actually valid/useful data - then a sparse array can drastically reduce the memory requirements. You can write various implementations to optimize for different distribution profiles: random distribution, grouped values, arbitrary contiguous groups, etc.
Works fine for any type of contained data, including complex classes. Has some overheads, so can actually be worse than naked arrays when the fill percentage is high. And of course you're still going to be using memory to store your actual data.
Simple Flat File
Store the data on disk, create a read/write FileStream for the file, and enclose that in a wrapper that lets you access the file's contents as if it were an in-memory array. The simplest implementation of this will give you reasonable usefulness for sequential reads from the file. Random reads and writes can slow you down, but you can do some buffering in the background to help mitigate the speed issues.
This approach works for any type that has a static size, including structures that can be copied to/from a range of bytes in the file. Doesn't work for dynamically-sized data like strings.
Complex Flat File
If you need to handle dynamic-size records, sparse data, etc. then you might be able to design a file format that can handle it elegantly. Then again, a database is probably a better option at this point.
Memory Mapped File
Same as the other file options, but using a different mechanism to access the data. See System.IO.MemoryMappedFile for more information on how to use Memory Mapped Files from .NET.
Database Storage
Depending on the nature of the data, storing it in a database might work for you. For a large array of doubles this is unlikely to be a great option however. The overheads of reading/writing data in the database, plus the storage overheads - each row will at least need to have a row identity, probably a BIG_INT (8-byte integer) for a large recordset, doubling the size of the data right off the bat. Add in the overheads for indexing, row storage, etc. and you can very easily multiply the size of your data.
Databases are great for storing and manipulating complicated data. That's what they're for. If you have variable-width data - strings and the like - then a database is probably one of your best options. The flip-side is that they're generally not an optimal solution for working with large amounts of very simple data.
Whichever option you go with, you can create an IList<T>-compatible class that encapsulates your data. This lets you write code that doesn't have any need to know how the data is stored, only what it is.
BCL arrays cannot do that.
Someone wrote a chunked BigArray<T> class that can.
However, that will not magically create enough memory to store it.
You can't. Even with gcAllowVeryLargeObjects, the maximum size of any dimension in an array (of non-bytes) is 2,146,435,071
So you'll need to rethink your design, or use an alternative implementation such as a jagged array.
Another possible approach is to implement your own BigList. First note that List is implemented as an array. Also, you can set the initial size of the List in the constructor, so if you know it will be big, get a big chunk of memory up front.
Then
public class myBigList<T> : List<List<T>>
{
}
or, maybe more preferable, use a has-a approach:
public class myBigList<T>
{
List<List<T>> theList;
}
In doing this you will need to re-implement the indexer so you can use division and modulo to find the correct indexes into your backing store. Then you can use a BigInt as the index. In your custom indexer you will decompose the BigInt into two legal sized ints.
I ran into the same problem. I solved it using a list of list which mimics very well an array but can go well beyond the 2Gb limit. Ex List<List> It worked for an 250k x 250k of sbyte running on a 32Gb computer even if this elephant represent a 60Gb+ space:-)
C# arrays are limited in size to System.Int32.MaxValue.
For bigger than that, use List<T> (where T is whatever you want to hold).
More here: What is the Maximum Size that an Array can hold?

monitor html change using hash func

I want to write an application that gets a list of urls.
For each of them I need to monitor periodically if the content has changed.
I thought :
to use HtmlAgilityPack to fetch html content (any other recommendation?)
I don't need to spot the change itself,
so I though to hash the content, save it in the DB
and re-compare the has in the future.
How would you suggest hashing? .net's GetHashCode() ?
I saw this documentation http://support.microsoft.com/kb/307020
which advise using
tmpSource = ASCIIEncoding.ASCII.GetBytes(sSourceData);
why?
You should absolutely not use GetHashCode() for this. The documentation explicitly states:
Furthermore, the .NET Framework does not guarantee the default implementation of the GetHashCode method, and the value it returns will be the same between different versions of the .NET Framework.
The results of GetHashCode can change between runs - all that's guaranteed is that calling it on two equal objects in the same process (possibly AppDomain) will give the same hash code. Indeed, String.GetHashCode's algorithm has changed over time, and in .NET 4 the 32-bit implementation is different to the 64-bit implementation.
If you want to use hashing, use MD5, SHA1 etc - something with a specified algorithm which will not change. (Note that these operation on binary data rather than string data, which is probably more appropriate too - you don't need to bother decoding the data as text.)
It's not clear to me whether refetching periodically is really the best idea though - do these servers not support last modified times, etags etc?
As you have asked for suggestions. I would have used this method instead
WebClient client = new WebClient();
String htmlCode = client.DownloadString("http://google.com");
And i would have saved this string in my DB. After the particular interval i could have compared them again.
But yes I do agree the string size would be really be large.
If I just want to get a alert on the fact the content has changed some how. I would use MD5. As the result size of an MD5 string is only 27 characters.
Hence easier to compare and store in DB

Compress Guids by hashing in small data sets

I'm working on a mobile app and i want to optimise the data that it's receiving from the server (as JSON).
There are 3 lists returned (each containing its own class of objects, the approximate list sizes are 50, 100 and 170). Each object has a Guid id and there is some relation data for each object. E.g.:
o = { Id = "8f088552-5b24-4ba4-a6e5-8958c4353581",
RelatedIds = ["19d2e562-0874-473f-8e05-7052e8defd9a", "615b4c47-199a-4f7d-8268-08ed43d9c891", ... ] }
Is there a way to compress these Guids to something sorter without storing an identity map? Perhaps using a hash function?
You can convert the 16-byte representation of a GUID into a Base 64 string. However you didn't mention a programming language so we can't help further.
A hash function is not recommended here because hash functions are generally lossy.
No. One of the attributes of (non-cryptographic) hashes is that they collide: hash(a) == hash(b) but a != b. They are a performance optimization in the case where you are doing a lot of equality checks and you expect many false results (because if hash(a) != hash(b) then a != b). A GUID->counter map is probably the best way to get smaller ids here.
You can convert hex (base16) to base64, and remove all the punctuation. You should save 25% for using base64, and another 4 bytes for punctuation.
Thinking about it some more i've realized that HTTP compression (if enabled) is probably going to compress that data well enough anyway, so it's not really worth the effort to compress data manually.

Categories

Resources