monitor html change using hash func

monitor html change using hash func - c#

I want to write an application that gets a list of urls.
For each of them I need to monitor periodically if the content has changed.
I thought :
to use HtmlAgilityPack to fetch html content (any other recommendation?)
I don't need to spot the change itself,
so I though to hash the content, save it in the DB
and re-compare the has in the future.
How would you suggest hashing? .net's GetHashCode() ?
I saw this documentation http://support.microsoft.com/kb/307020
which advise using
tmpSource = ASCIIEncoding.ASCII.GetBytes(sSourceData);
why?

You should absolutely not use GetHashCode() for this. The documentation explicitly states:
Furthermore, the .NET Framework does not guarantee the default implementation of the GetHashCode method, and the value it returns will be the same between different versions of the .NET Framework.
The results of GetHashCode can change between runs - all that's guaranteed is that calling it on two equal objects in the same process (possibly AppDomain) will give the same hash code. Indeed, String.GetHashCode's algorithm has changed over time, and in .NET 4 the 32-bit implementation is different to the 64-bit implementation.
If you want to use hashing, use MD5, SHA1 etc - something with a specified algorithm which will not change. (Note that these operation on binary data rather than string data, which is probably more appropriate too - you don't need to bother decoding the data as text.)
It's not clear to me whether refetching periodically is really the best idea though - do these servers not support last modified times, etags etc?

As you have asked for suggestions. I would have used this method instead
WebClient client = new WebClient();
String htmlCode = client.DownloadString("http://google.com");
And i would have saved this string in my DB. After the particular interval i could have compared them again.
But yes I do agree the string size would be really be large.
If I just want to get a alert on the fact the content has changed some how. I would use MD5. As the result size of an MD5 string is only 27 characters.
Hence easier to compare and store in DB

Related

String Builder and string size

Why StringBuilder Size is greater than string(~250MB).
Please read the question. I want to know the reason of size constraint in the string, but not in stringbuilder. I have fixed the problem of reading file.
Yes, I know there are operation, we can perform on string builder like append, replace, remove, etc. But what is the use of it when we can't get ToString() from it and we can't write it directly in the file. We had to get ToString() to actually use it, but because its size is out of string range it throws exception.
So in particular is there any use of string builder having size greated than string as i read a file of around 1 gb into string builder but cant get it into string. I read all the pros and cons of StringBuilder over String but I cant anything explaning this
Update:
I want to load XMLDocument from file if reading in chunk then data cannot be loaded because root level node needs its closing tag which will be in other chunk block
Update:
I know it is not a correct aproach now i am different process but still i want to know the reason of size constraing in string but not in stringbuilder
Update:
I have Fixed my proble and want to know the reason why there is no memory constraint on stringbuilder.

Why StringBuilder Size is greater than string(~250MB).
The reason depends on the version of .net.
There are two implementations Eric Lippert mentions here: https://stackoverflow.com/a/6524401/360211
Internally a string builder maintains a char[]. When you append it may have to resize this array. In order to stop it needing to be resized every time you append it resizes to a larger size to anticipate future appends (it actually doubles in size). So the StringBuilder often ends up larger than it's content, as much as double the size.
A newer implementation maintains a linked list of char[]. If you do many small appends, the overhead of the linked list may account for the extra 250MB.
In normal use, an extra 100% size on a string temporarily doesn't make one bit of difference given the performance benefits, but when you are dealing with a GB, it becomes significant and that is not its intended usage.
Why you get OutOfMemoryException
The linked list implementation can fit more in memory than a string because it does not need one continuous block of 1GB. When you ToString it would force it to try to find another GB, which is also continuous and that is the problem.
Why is there no constraint preventing this?
Well there is. The constraint is if there is not enough memory to create a string during ToString, throw an OutOfMemoryException.
You may want this to happen during Append operations, but that would be impossible to determine. StringBuilder could look at the free memory, but that might change before you call ToString. So the author of StringBuilder could have set an arbitrary limit, but that can't suit all systems equally, as some will have more memory than others.
You also might want to do operations that reduce the size of the StringBuilder before calling ToString, or not call ToString at all! So just because StringBuilder is too large to ToString at any point is not a reason to throw an exception.

You can use StringBuilder.ToString(int, int) to get smaller-sized chunks of your huge content out of of the StringBuilder.
In addition, you might want to consider whether you are really using the right tool for the job. StringBuilder's purpose is to build and modify strings, not to load huge files to memory.

You can try the following to handle large XML files.
CodeProject

Fastest way to detect non-equal strings (without storing the string)?

I am writing a templating engine and I am searching for a good way to detect if a template has changed.
For this I have the following requirements (in order of importance):
non-equal strings are required to be detected different
as fast as possible
as less memory as possible (=> do not store the whole string for comparison)
high propability to detect equal strings as equal
It is not a big problem, if sometimes equal strings are not detected as equal as this would just trigger a "re-rendering" which would not be needed, but because of the "heavy work" of this, this should happen as less as possible.
I first thought of using String.GetHashCode(), but the probalility of getting the same hash-code for two non-equal strings is pretty high.
Are there any good combinations like checking hash-code and Length to get the probability of to non-equal strings wrongly detected as equal to an unrealisticly happening low number?
Or is using some hashing algorithm, like MD5 or SHA, a good alternative (after hash-code is equal)?
My rendering looks something like the following:
public string RenderTemplate(string name, string template)
{
var cachedTemplate = Cache.Get(name);
if(cachedTemplate == null || !cachedTemplate.Equals(template)) // <= Equals
{
cachedTemplate = new Template(name, template);
cachedTemplate.Render();
Cache.Set(name, cachedTemplate);
}
return cachedTemplate.Result;
}
The Equals is the point I am asking about.
I am also open for other suggestions how this could be solved.
UPDATE:
To add some numbers to get more context:
I expect to have >1000 individual templates and each template will have up to at least a few thousand characters.
This is why I would like to avoid storing the whole template-string "in memory" only for the comparison.
Most of the templates are stored in the DB.
UPDATE 2:
What do you think about extending my RenderTemplate method with a timestamp as suggested by Nikola:
public string RenderTemplate(string name, string template, DateTime timestamp)
Then I could compare name, GetHashCode and timestamp which does not need much memory, should be pretty fast and the probability of a "wrongly detected equality" is practically 0. The timestamp I can read from the DB (have it already there) or the "last changed date" from the file-system for a file-based template.

You don't have much choice. If you don't compare strings by comparing their content, use a hash algorithm to determine if strings are equal. Personally, I would probably use a hash algorithm. If you are a bit paranoid and afraid of a collision, choose algorithm with widest space (e.g. SHA512).
Why do you need to compare strings to determine that a template has changed? Why not use a different approach?
If file is stored on disk, why not use a file watcher?
If stored in database, why not use a timestamp to detect when it was saved?
If application is restarted, anyway reload templates
Also, it's worrying that a template for UI changes so often that you must make checks like this. I think you have more problems with design beside comparing strings.

Compress a short but repeating string

I'm working on a web app that needs to take a list of files on a query string (specifically a GET and not a POST), something like:
http://site.com/app?things=/stuff/things/item123,/stuff/things/item456,/stuff/things/item789
I want to shorten that string:
http://site.com/app?things=somekindofencoding
The string isn't terribly long, varies from 20-150 chars. Something that short isn't really suitable for GZip, but it does have an awful lot of repetition so compression should be possible.
I don't want a DB or Dictionary of strings - the URL will be built by a different application to the one that consumes it. I want a reversible compression that shortens this URL. It doesn't need to be secure.
Is there an existing way to do this? I'm working in C#/.Net but would be happy to adapt an algorithm from some other language/stack.

If you can express the data in BNF you could contruct a parser for the data. in stead of sending the data you could send the AST where each node would be identified as one character (or several if you have a lot of different nodes). In your example
we could have
files : file files
|
file : path id
path : itemsthing
| filesitem
| stuffthingsitem
you could the represent a list of files as path[id1,id2,...,idn] using 0,1,2 for the paths and the input being:
/stuff/things/item123,/stuff/things/item456,/stuff/things/item789
/files/item1,/files/item46,/files/item7
you'd then end up with ?things=2[123,456,789]1[1,46,7]
where /stuff/things/item is represented with 2 and /files/item/ is represented with 1 each number within [...] is an id. so 2[123] would expand to /stuff/things/item123
EDIT The approach does not have to be static. If you have to discover the repeated items dynamically you can use the same approach and pass the map between identifier and token. in that case the above example would be
?things=2[123,456,789]1[1,46,7]&tokens=2=/stuff/things/,1=/files/item
which if the grammar is this simple ofcourse would do better with
?things=/stuff/things/[123,456,789]/files/item[1,46,7]
compressing the repeated part to less than the unique value with such a short string is possible but will most likely have to be based on constraining the possible values or risk actually increasing the size when "compressing"

You can try zlib using raw deflate (no zlib or gzip headers and trailers). It will generally provide some compression even on short strings that are composed of printable characters and does look for and take advantage of repeated strings. I haven't tried it, but could also see if smaz works for your data.
I would recommend obtaining a large set of real-life example URLs to use for benchmark testing of possible compression approaches.

Generating a non-guid unique key outside of a database

I have a situation where I need to create some kind of uniqueness between 'entities', but it is not a GUID, and it is not saved in a database (It is saved, however. Just not by a database).
The basic use of the key is a mere redundancy check. It does not have to be as scalable as a real 'primary key', but in the simplest terms I can think of , this is how it works.
[receiver] has List<possibilities>.
possibilities exist independently, but many will have the same values (impossible to predict. This is by design)
Frequently, the receivers list of possibilities will have to be emptied and then refilled (this is a business requirement).
The key is basically used to add a very lightweight redundancy check. In other words, sometimes the same possibility will be repeated, sometimes it should only appear once in the receiver's list.
I basically want to use something very light and simple. A string is sufficient. I was just wanting to figure out a modest algorithm to accomplish this. I thought about using the GetHashCode() method, but I am not certain about how reliable that is. Can I get some thoughts?

If you can use GetHashCode() at a first glance, you can probably use an MD5 hash as well, obtaining less collision probability. The resulting MD5 can be stored as a 24 charachter string by encoding it base 64, let see this example:
public static class MD5Gen
{
static MD5 hash = MD5.Create();
public static string Encode(string toEncode)
{
return Convert.ToBase64String(
hash.ComputeHash(Encoding.UTF8.GetBytes(toEncode)));
}
}
with this you encode a source string in an md5 hash in string format too. You just have to write the "possibility" class in term of string.

Try this for generating Guid.
VBScript Function to Generate a UUID/GUID
If you are on Windows, you can use the simple VBScript below to generate a UUID. Just save the code to a file called createguid.vbs, and then run cscript createguid.vbs at a command prompt.
Set TypeLib = CreateObject("Scriptlet.TypeLib")
NewGUID = TypeLib.Guid
WScript.Echo(left(NewGUID, len(NewGUID)-2))
Set TypeLib = Nothing
Create a UUID/GUID via the Windows Command Line
If you have the Microsoft SDK installed on your system, you can use the utility uuidgen.exe, which is located in the "C:\Program Files\Microsoft SDK\Bin" directory
or try the same for more info.
Link
I would say go for the Windows command line as it is more reliable.

Calculate hash without having the entire buffer in memory at once

I am doing an operation where I receive some bytes from a component, do some processing, and then send it on to the next component. I need to be able to calculate the hash of all the data I have seen at any given time - and because of data size; I cannot keep it all in a local buffer.
How would you calculate the (MD5) hash under these circumstances ?
I am thinking that I should be able to hold on to an intermediate result of the hash, and add more data as I go. But does any of the built-in framework classes support this ?

You simply want to use the TransformBlock and TransformFinalBlock members of the class, which allow you to compute the hash in chunks.
MSDN has a good example of how to do this.

Its a bit surprising that it doesn't come in the box.
If you create the MD5CryptoServiceProvider in a member variable, and call ComputeHash() repeatedly, does it not work as an append?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.