CRC32 is calculated as uint32, while HashAlgorithm in .NET by convention returns byte[]. I can, of course, easily convert it with bytes = BitConverter.GetBytes(hash) but this is affected by "endianness" of a system (almost no chance for big-endian, of course).
Anyway, I've been thinking is there any convention to follow? I have a feeling that it should be big-endian to make hash.ToString("X") and bytes.ToHex() (assuming .ToHex() exists) look the same.
I've checked https://github.com/vurdalakov/crc32/wiki/CRC32 and it does not do that. Any thoughts?
I can only cite examples, where the zip and gzip file formats store the CRC in little-endian order. I'm sure someone can find examples where a 32-bit CRC is stored in big-endian order, sometimes called "network order" as big-endian is intended to be a convention for network communications.
If you are defining your own protocol, then you can pick whichever you like. For the code to be portable, you would need to use shift operators to pick apart the bytes so that the endianess of the machine does not affect the result.
Related
I have been searching on this topic for a while now, without finding any relevant answers. So thought of taking it on 'Stackoverflow' ...
We are trying to encode a string in order to pass it over a TCP/IP connection. Since ASN.1 is the most popular one to do it, so we are trying the various rules BER,DER,PER etc. to find out which one we can use. Our application is a .net based application and I was looking for freely available library which does this.
Strangely i could not find any free libraries.So, i started looking in the .Net framework itself. I found the there is only a 'BERConverter'. So, i did a small example with it. Taking an example string
string str = "The BER format specifies a self-describing and self-delimiting format for encoding ASN.1 data structures. Each data element is encoded as a type identifier, a length description, the actual data elements, and, where necessary, an end-of-content marker. These types of encodings are commonly called type-length-value or TLV encodings. This format allows a receiver to decode the ASN.1 information from an incomplete stream, without requiring any pre-knowledge of the size, content, or semantic meaning of the data"
In UTF-8 or ASCII it show as 512 bytes. I use the following code to encode it using BER
public static byte[] BerConvert(byte[] inputbytes)
{
byte[] output = BerConverter.Encode("{o}", inputbytes);
return output;
}
I get a byte array with size 522. In some of the other cases I find that the byte size increases compared to the original text. I thought encoding will decrease the size. Why is it happening like this ?
Apart from BER, are there other encoding rules like PER or DER which can be used to reduce the encoding size ? Are there any examples, libraries, or support which will help is implementing the these encoding styles?
When looking for ASN.1 Tools (free and commercial), a good place to start is the ITU-T web page http://www.itu.int/en/ITU-T/asn1/Pages/Tools.aspx that lists several. There are commercial tools listed there that support C#, but I do not see a free C# tool.
As for reduction of size of encodings, this depends significantly on the nature of your ASN.1 specification and the encoding rules used. If you are primarily sending text strings, BER and DER will not result in a reduction of the size of your message, while PER can significantly reduce the size of the message if you are able to produce a "permitted alphabet" constraint indicating a smaller set of characters permitted in the text you are sending.
You can try various encodings rules and different constraints to see the effects of your changes at the free online ASN.1 encoder decoder at http://asn1-playground.oss.com.
If you are beginning work on a new protocol, you may want to reevaluate your needs a bit.
As you probably know by now, ASN.1 comes with a bit of overhead—not just in the messaging, but in the engineering. A typical workflow involves writing a specification that describes the protocol, feeding it into a CASE tool that generates source code for an API, and then integrating the generated components into your application.
That said, some prefer a more ad-hoc approach. Microsoft has a BER converter class that you could try to use with C#: it may be suitable for your needs.
If compression is important, you may want to look into PER, as Paul said. But it's hard to produce valid PER encodings by hand because they rely on the specification to perform compression. (The permitted alphabet constraint is written into the specification and used to enumerate valid characters for shrinking the encoding.)
For more information on ASN.1 there are a number of tutorials online; you can also look at ITU-T standards X.680-X.695, which specify both the syntax notation and various encoding rules.
There are a few libraries on CodePlex. Like this one.
https://asn1.codeplex.com/SourceControl/latest#ObjectIdentifier.cs
I'll just leave it here Asn1DerParser.NET . And thank to the author for his work!
When sending information from a java application to a C# application through sockets, is the byte order different? Or can I just send an integer from C# to a java application and read it as integer?
(And do the OS matter, or is the same for java/.net no matter how the actual OS handles it?)
It all comes down to how you encode the data. If you are treating it only as a raw sequence of bytes, there is no conflict; the sequence is the same. When the matters is in endianness when interpreting chunks of the data as (for example) integers.
Any serializer written with portability in mind will have defined endianness - for example, in protocol buffers (available for both Java and C#) little-endian is always used regardless of your local hardware.
If you are doing manual writing to the stream, using things like shift-based encoding (rather than direct memory copying) will give you defined endianness.
If you use pre-canned platform serializers, you are at the mercy of the implementation. It might be endian-safe, it might not be (i.e. it might depend on the platform at both ends). For example, the .NET BitConverter class is not safe - it is usually assumed (incorrectly) to be little-endian, but on some platforms (and particularly in Mono on some hardware) it could be big-endian; hence the .IsLittleEndian property.
My advice would be to use a serializer that handles it all for you ;p
In Java, you can use a DataInputStream or DataOutputStream which read and write the high-byte first, as documented:
http://download.oracle.com/javase/6/docs/api/java/io/DataOutputStream.html#writeInt%28int%29
You should check corresponding C# documentation to see what it does (or maybe someone here can tell you).
You also have, in Java, the option of using ByteByffer:
http://download.oracle.com/javase/6/docs/api/java/nio/ByteBuffer.html
... which has the "order" method to allow you to specify a byte order for operations reading multi-byte primitive types.
Java uses Big Endian for some libraries like DataInput/OutputStream. IP protocols all use Big Endian which can lead people to use Big Endian as default for network protocols.
However NIO, ByteBuffer allows you to specify BigEndian, LittleEndian or NativeEndian (whatever the system uses by default)
x86 systems tend to use little endian and so many Microsoft/Linux applications use little endian by default but can support big-endian.
Yes, the byte order may be different. C# assumes little-endian may use the platform's byte ordering, Java tends to use big-endian. This has been discussed before on SO. See for example C# little endian or big endian?
I want to generate a hash code for a file. Using C# I would do something like this then store the value in a database.
byte[] b = File.ReadAllBytes(#"C:\image.jpg");
string hash = ComputeHash(b);
Now, if i use say a Java program that implements the same hashing alogorithm (Md5), can i expect the hash values to be the equal to the value generated in C#? What if i execute the java program from different environments, Windows, Linux or Mac?
Hash values are not globally unique. But that is not what you are really asking.
What you really want to know is whether a hashing algorithm (such as MD5) will produce the same hash value for identical files on different operating system platforms. The answer to that is "yes" ... provided that files are byte-for-byte identical.
In the case of an binary format that should be the case. In the case of text files, transcoding between different character encodings, or changing line termination sequences will make the files different at the byte level and result in different MD5 hash values.
Havh values generated from the same input and with the same algorithm are defined to be equal. 1+1=2, regardless of the programming language I program this in.
Otherwise the internet would not work at all, you know.
My suggestion would be to use a common/accepted hashing algorithm like MD5 to achieve the same hash values.
If the Hashing algorithm and the input are same, the hash value generated would be same irrespective of language or environment.
The hashing algorithm takes the full/part of the key and manipulates it to generate the value which is why it would be same in all languages.
I wish I could comment on this but I don't have enough reputation to do that.
While I don't know for what purpose you want to use a hash algorithm, I'd like to say that some collisions have been found for MD5 so it might be less "secure" (well, we probably can't say "broken" since those collisions are hard to compute). The same remark applies to the SHA-1 algorithm.
More information here: http://www.mathstat.dal.ca/~selinger/md5collision/
So if you want to use a hash algorithm for security purposes, you might take a look at SHA-256 or SHA-512 which are stronger for now.
Otherwise you can probably keep going with MD5.
My two cents.
I am doing a md5 hash, and just want to make sure the result of:
md5.ComputeHash(bytePassword);
Is consistent regardless of the server?
e.g. windows 2003/2008 and 32/64 bit etc.
Yes, it's consistent, the md5 algorithm specification defines it regardless of platform.
MD5 is independent of operating system and architecture. So it is "consistent".
However, MD5 takes as input an arbitrary sequence of bits, and outputs a sequence of 128 bits. In many situations, you want strings. For instance, you want to hash a password, and the password is initially a string. The conversion of that string into a sequence of bits is not part of MD5 itself, and several conventions exist. I do not know precisely about C#, but the Java equivalent String.getBytes() method will use the "platform default charset" which may vary with the operating system installation. Similarly, the output of MD5 is often converted to a string with hexadecimal notation, and it may be uppercase or lowercase or whatever.
So that while MD5 itself is consistent, bugs often lurk in the parts which prepare the data for MD5 and post-process its output. Beware.
The result of an md5 hash is a number. The number returned for a given input is always the same, no matter what server or even platform you use.
However, the expression of the number may vary. For example, 1 and 1.0 are the same number, but are expressed differently. Similarly, some platforms will return the hash formatted slightly differently than others. In this case, you have a byte array, and that should be fairly safe to pass around. Just be careful what you do after converting it to a string.
MD5 Hashing is [system/time/anything except the input] independent
We are beginning to roll out more and more WAN deployments of our product (.NET fat client with an IIS hosted Remoting backend). Because of this we are trying to reduce the size of the data on the wire.
We have overridden the default serialization by implementing ISerializable (similar to this), and we are seeing anywhere from 12% to 50% gains. Most of our efforts focus on optimizing arrays of primitive types. Is there a fancy way of serializing primitive types, beyond the obvious?
For example, today we serialize an array of ints as follows:
[4-bytes (array length)][4-bytes][4-bytes]
Can anyone do significantly better?
The most obvious example of a significant improvement, for boolean arrays, is putting 8 bools in each byte, which we already do.
Note: Saving 7 bits per bool may seem like a waste of time, but when you are dealing with large magnitudes of data (which we are), it adds up very fast.
Note: We want to avoid general compression algorithms because of the latency associated with it. Remoting only supports buffered requests/responses (no chunked encoding). I realize there is a fine line between compression and optimal serialization, but our tests indicate we can afford very specific serialization optimizations at very little cost in latency. Whereas reprocessing the entire buffered response into new compressed buffer is too expensive.
(relates to messages/classes, not just primitives)
Google designed "protocol buffers" for this type of scenario (they shift a huge amount of data around) - their format is compact (using things like base-128 encoding) but extensible and version tolerant (so clients and servers can upgrade easily).
In the .NET world, I can recommend 2 protocol buffers implementations:
protobuf-net (by me)
dotnet-protobufs (by Jon Skeet)
For info, protobuf-net has direct support for ISerializable and remoting (it is part of the unit tests). There are performance/size metrics here.
And best of all, all you do is add a few attributes to your classes.
Caveat: it doesn't claim to be the theoretical best - but pragmatic and easy to get right - a compromise between performance, portability and simplicity.
Check out the base-128 varint type used in Google's protocol buffers; that might be what you're looking for.
(There are a number of .NET implementations of protocol buffers available if you search the web which, depending on their license, you might be able to grovel some code from!)
Yes, there is a fancy way of serialising primitive types. As a bonus it is also much faster (typically 20-40 times).
Simon Hewitt's open source library, see Optimizing Serialization in
.NET - part 2, uses various tricks. For example, if it is known that an array contains small integers then less is going to the serialised output. This is described in detail in part 1 of the article. For example:
...So, an Int32 that is less
than 128 can be stored in a single byte (by using 7-bit
encoding) ....
The full and the size optimised integers can be mixed and matched. This may seem obvious, but there are other things; for example, special things happen for integer value 0 - optimisation to store numeric type and a zero value.
Part 1 states:
... If you've ever used .NET remoting for large amounts of data, you will have found that there are problems with scalability. For small amounts of data, it works well enough, but larger amounts take a lot of CPU and memory, generate massive amounts of data for transmission, and can fail with Out Of Memory exceptions. There is also a big problem with the time taken to actually perform the serialisation - large amounts of data can make it unfeasible for use in apps ....
I have used this library with great success in my application.
To make sure .NET serialisation is never used put an
ASSERT 0, Debug.WriteLine() or similar into the place in the
library code where it falls back on .NET serialisation.
That's at the end of function WriteObject() in file
FastSerializer.cs, near createBinaryFormatter().Serialize(BaseStream, value);.
If your arrays can be sorted you can perform a simple RLE to save space. Even if they aren't sorted RLE can still be beneficial. It is fast to implement for both writing and reading.
Here's a trick I used once for encoding arrays of integers:
Group array elements in groups of 4.
Precede each group with a byte (let's call it length mask) that indicates the length of the following 4 elements. The length mask is a bitmask composed of dibits which indicate how long the corresponding element is (00 - 1 byte, 01 - 2 bytes, 10 - 3 bytes, 11 - 4 bytes).
Write out the elements as short as you can.
For example, to represent unsigned integers 0x0000017B, 0x000000A9, 0xC247E8AD and 0x00032A64, you would write (assuming little-endian): B1, 7B, 01, A9, AD, E8, 47, C2, 64, 2A, 03.
It can save you up to 68.75% (11/16) of space in the best case. In the worst case, you would actually waste additional 6.25% (1/16).
If you know which int values are more common, you can encode those values in fewer bits (and encode the less-common values using correspondingly more bits): this is called "Huffman" coding/encoding/compression.
In general though I'd suggest that one of the easiest things you could do would be to run a standard 'zip' or 'compression' utility over your data.
For integer, if you usually have small numbers (under 127 or 32768) you can encode the number using the MSB as a flag to determine if it's the last byte or not. A little bit similar to UTF-8 but the flag bit is actually wasted (which is not the case with UTF-8)
Example (big-endian):
125 which is usually encoded as 00 00 00 7D
Could be encoded as 7D
270 which is usually encoded as 00 00 01 0E
Could be encoded as 82 0E
The main limitation is that the effective range of a 32 bit value is reduced to 28 bits. But for small values you will usually gain a lot.
This method is actually used in old formats such as MIDI because old electronics needed very efficient and simple encoding techniques.
If you want to control the serialization format yourself, with just library help for compact integer storage, derive a class from BinaryWriter that uses Write7BitEncodedInt. Do likewise for BinaryReader.Read7BitEncodedInt.
Before implementing ISerializable yourself, you were probably using XmlSerializer or the SOAP formatter in a web service. Given you have all fat clients all running .NET, you could try using the BinaryFormatter.