according to my understanding, a base64 encoded string (ie the output of encode) must always be a multiple of 4.
the c# Convert.FromBase64String says that its input must be a multiple of 4
However if I give it a 25 character string it doesnt complain
[convert]::FromBase64String("ei5gsIELIki+GpnPGyPVBA==")
[convert]::FromBase64String("1ei5gsIELIki+GpnPGyPVBA==")
both work. (The first one is 24 , second is 25)
[convert]::FromBase64String("11ei5gsIELIki+GpnPGyPVBA==")
fails with Invalid length exception
I assume this is a bug in the c# library but I just want to make sure - I am writing code that is sniffing strings to see if they are valid base64 strings and I want to be sure that I understand what a valid one looks like (one possible implementation was to give the string to system.convert and see if it threw - why reinvent perfectly good code)
Yes, this is a flaw (aka bug). It got started due to a perf optimization in an internal helper function named FromBase64_ComputeResultLength() which calculates the length of the byte[] result. It has this comment (edited to fit):
// For legal input, we can assume that 0 <= padding < 3. But it may be
// more for illegal input.
// We will notice it at decode when we see a '=' at the wrong place.
The "we will notice" remark is not entirely accurate, the decoder does flag an '=' if one isn't expected but it fails to check if there's one too many. Which is the case for the 25-char string.
You can report the problem at connect.microsoft.com, I don't see an existing report that resembles it. Do note that it is fairly unlikely that Microsoft can actually fix it any time soon since the change is going to break existing programs that now successfully parse bad base64 strings. It normally requires a major .NET release update to get rid of such problems, like it was done for .NET 4.0, there isn't one on the horizon afaik.
But yes, the simple workaround for you is to check if the string length is divisible by 4, use the % operator.
Related
byte[] lengthBytes = new byte[4];
serverStream.Read(lengthBytes, 0, 4);
MessageBox.Show("'>>" + System.Text.Encoding.UTF8.GetString(lengthBytes) + "<<'");
MessageBox.Show("Hello");
This is the code I used for debugging. I get 2 messageboxes now. If I used Debug.WriteLine it was also printed twice.
Msgbox 1: '>>/ (Note that this is still 4 characters long, the last 3 bytes are null.
Msgbox 2: '>>{"ac<<'
Msgbox 3: Hello
I'm trying to send 4 bytes with an integer, the length of the message. This is going fine ('/ ' is utf8 for 47). The problem is that the first 4 bytes of the message are also being read ('{"ac'). I totally dont know how this happens, I'm already debugging this for several hours and I just can't get my head around it. One of my friends suggested to make an account on StackOverflow so here I am :p
Thanks for all the help :)
EDIT: The real code for the people who asked
My code http://kutj.es/2ah-j9
You are making traditional programmer mistakes, everybody has to make them once to learn how to avoid it and do it right. This primarily went off the rails by writing debugging code that is buggy and made it lot harder to find your mistake:
Never write debugging code that uses MessageBox.Show(). It is a very, very evil function, it causes re-entrancy. And expensive word that means that it only freezes the user interface, it doesn't freeze your program. It continues to run, one of the things that can go wrong is that the code that you posted is executed again. Re-entered. You'll see two message boxes. And you'll have a completely corrupted program state because your code was never written to assume it could be re-entered. Which is why you complained that 4 bytes of data were swallowed.
The proper tool to use here is the feature that really freezes your program. A debugger breakpoint.
Never assume that binary data can be converted to text. Those 4 bytes you received contain binary zeros. There is no character for it. Worse, it acts as a string terminator to many operating system calls, the kind used by the debugger, Debug.WriteLine() etc. This is why you can't see the "<<"
The proper tool to use here is a debugger watch or tooltip, it lets you look into the array directly. If you absolutely have to generate a diagnostic string then use BitConverter.GetString().
Never assume that a stream's Read() method will always return the number of bytes you asked for. Using the return value in your code is a hard requirement. This is the real bug in your program, the only you are actually trying to fix.
The proper solution is to continue to call Read() until you counted down the number of bytes you expected to receive from the length you read earlier. You'll need a MemoryStream to store the chunks of byte[]s you get.
Perhaps this link regarding Encoding.GetString() will help you out a bit. The part to pay attention to being:
If the data to be converted is available only in sequential blocks
(such as data read from a stream) or if the amount of data is so large
that it needs to be divided into smaller blocks, you should use the
Decoder object returned by the GetDecoder method of a derived class.
The problem was that I started the getMessage void 2 times. This started the while 2 times (in different threads).
Elgonzo helped me finding the problem, he is a great guy :)
Is there any side effect of passing and extra argument to string.Format function in C#? I was looking at the string.Format function documentation at MSDN ( http://msdn.microsoft.com/en-us/library/b1csw23d.aspx) but unable to find an answer.
Eg:-
string str = string.Format("Hello_{0}", 255, 555);
Now, as you can see that according to format string, we are suppose to pass only one argument after it but I have passed two.
EDIT:
I have tried it on my end and everything looks fine to me. Since I am new to C# and from C background, I just want to make sure that it will not cause any problem in later run.
Looking in Reflector, it will allocate a little more memory for building the string, but there's no massive repercussion for passing in an extra object.
There's also the "side effect" that, if you accidentally included a {n} in your format string where n was too large, and then added some spare arguments, you'd no longer get an exception but get a string with unexpected items in.
If you look at the exception section of the link you provide for string.Format
"The index of a format item is less than zero, or greater than or equal to the length of the args array."
Microsoft doesn't indicate that it can throw if you have too much arguments, so it won't. The effect is a small loss of memory due to an useless parameter
I want to write an application that gets a list of urls.
For each of them I need to monitor periodically if the content has changed.
I thought :
to use HtmlAgilityPack to fetch html content (any other recommendation?)
I don't need to spot the change itself,
so I though to hash the content, save it in the DB
and re-compare the has in the future.
How would you suggest hashing? .net's GetHashCode() ?
I saw this documentation http://support.microsoft.com/kb/307020
which advise using
tmpSource = ASCIIEncoding.ASCII.GetBytes(sSourceData);
why?
You should absolutely not use GetHashCode() for this. The documentation explicitly states:
Furthermore, the .NET Framework does not guarantee the default implementation of the GetHashCode method, and the value it returns will be the same between different versions of the .NET Framework.
The results of GetHashCode can change between runs - all that's guaranteed is that calling it on two equal objects in the same process (possibly AppDomain) will give the same hash code. Indeed, String.GetHashCode's algorithm has changed over time, and in .NET 4 the 32-bit implementation is different to the 64-bit implementation.
If you want to use hashing, use MD5, SHA1 etc - something with a specified algorithm which will not change. (Note that these operation on binary data rather than string data, which is probably more appropriate too - you don't need to bother decoding the data as text.)
It's not clear to me whether refetching periodically is really the best idea though - do these servers not support last modified times, etags etc?
As you have asked for suggestions. I would have used this method instead
WebClient client = new WebClient();
String htmlCode = client.DownloadString("http://google.com");
And i would have saved this string in my DB. After the particular interval i could have compared them again.
But yes I do agree the string size would be really be large.
If I just want to get a alert on the fact the content has changed some how. I would use MD5. As the result size of an MD5 string is only 27 characters.
Hence easier to compare and store in DB
I know there are similar questions already on SO but none of them seem to address this problem. I have inherited the following c# code that has been used to create password hashes in a legacy .net app, for various reasons the C# implementation is now being migrated to php:
string input = "fred";
SHA256CryptoServiceProvider provider = new SHA256CryptoServiceProvider();
byte[] hashedValue = provider.ComputeHash(Encoding.ASCII.GetBytes(input));
string output = "";
string asciiString = ASCIIEncoding.ASCII.GetString(hashedValue);
foreach ( char c in asciiString ) {
int tmp = c;
output += String.Format("{0:x2}",
(uint)System.Convert.ToUInt32(tmp.ToString()));
}
return output;
My php code is very simple but for the same input "fred" doesn't produce the same result:
$output = hash('sha256', "fred");
I've traced the problem down to an encoding issue - if I change this line in the C# code:
string asciiString = ASCIIEncoding.ASCII.GetString(hashedValue);
to
string asciiString = ASCIIEncoding.UTF7.GetString(hashedValue);
Then the php and C# output match (it yields d0cfc2e5319b82cdc71a33873e826c93d7ee11363f8ac91c4fa3a2cfcd2286e5).
Since I'm not able to change the .net code I need to work out how to replicate the results in php.
Thanks in advance for any help,
I don’t know PHP well enough to answer your question; however, I must point out that your C# code is broken. Try generating the hash of these two inputs: "âèí" and "çñÿ". You will find that their hash collides:
3f3b221c6c6e3f71223f51695d456d52223f243f3f363949443f3f763b483615
The first bug lies in this operation:
Encoding.ASCII.GetBytes(input)
This assumes that all characters within your input are US-ASCII. Any non-ASCII characters would cause the encoder to fall back to the byte value for the ? character, thereby giving (unwanted) hash collisions, as demonstrated above. Notwithstanding, this will not be an issue if your input is constrained to only allow US-ASCII characters.
The other (more severe) bug lies in the following operation:
ASCIIEncoding.ASCII.GetString(hashedValue)
ASCII only defines mappings for values 0–127. Since the elements of your hashedValue byte array may contain any byte value (0–255), encoding them as ASCII would cause data to be lost whenever a value greater than 127 is encountered. This may lead to further “unwanted” (read: potentially maliciously generated) hash collisions, even when your original input was US-ASCII.
Given that, statistically, half of the bytes constituting your hashes would be greater than 127, then you are losing at least half the strength of your hash algorithm. If a hacker gains access to your stored hashes, it is quite likely that they will manage to devise an attack to generate hash collisions by exploiting this cryptographic weakness.
Edit: Notwithstanding the considerations mentioned in my posts and Jon’s, here is the PHP code that succumbs to the same weakness – so to speak – as your C# code, and thereby gives the same hash:
$output = hash('sha256', $input, true);
for ($i = 0; $i < strlen($output); $i++)
if ($output[$i] > chr(127))
$output[$i] = '?';
$output = bin2hex($output);
Could you use mb_convert_encoding (see http://php.net/manual/en/function.mb-convert-encoding.php - the page also has a link to a list of supported encodings) to convert the PHP string to ASCII from UTF7?
I've traced the problem down to an encoding issue
Yes. You're trying to treat arbitrary binary data as if it's valid text-encoded data. It's not. You should not be using any Encoding here.
If you want the results in hex, the simplest approach is to use BitConverter.ToString
string text = BitConverter.ToString(hashedValue).Replace("-", "").ToLower();
And yes, as pointed out elsewhere, you probably shouldn't be using ASCII to convert the text to binary at the start of the hashing process. I'd probably use UTF-8.
It's really important that you understand the problem here though, as otherwise you'll run into it in other places too. You should only use encodings such as ASCII, UTF-8 etc (on any platform) when you've genuinely got encoded text data. You shouldn't use them for images, the results of cryptography, the results of hashing, etc.
EDIT: Okay, you say you can't change the C# code... it's not clear whether that just means you've got legacy data, or whether you need to keep using the C# code regardless. You should absolutey not run this code for a second longer than you have to.
But in PHP, you may find you can get away with just replacing every byte with a value >= 0x80 in the hash with 0x3F, which is the ASCII for "question mark". If you look through your data you'll probably find there are a lot of 3F bytes in there.
If you can get this to work, I would strongly suggest that you migrate over to the true MD5 hash without losing information like this. Wherever you're storing the hashes, store two: the legacy one (which is all you have now) and the rehashed one. Whenever you're asked to validate that a password is correct, you should:
Check whether you have a "new" one; if so, only use that - ignore the legacy one.
If you only have a legacy one:
Hash the password in the broken way to check whether it's correct
If it is, hash it again properly and store the results in the "new" place.
Then when everyone's logged in correctly once, you'll be able to wipe out the legacy hashes.
I trying to handle to following character: ⨝ (http://www.fileformat.info/info/unicode/char/2a1d/index.htm)
If you checking whether an empty string starting with this character, it always returns true, this does not make any sense! Why is that?
// visual studio 2008 hides lines that have this char literally (bug in visual studio?!?) so i wrote it's unicode instead.
char specialChar = (char)10781;
string specialString = specialChar.ToString();
// prints 1
Console.WriteLine(specialString.Length);
// prints 10781
Console.WriteLine((int)specialChar);
// prints false
Console.WriteLine(string.Empty.StartsWith("A"));
// both prints true WTF?!?
Console.WriteLine(string.Empty.StartsWith(specialString));
Console.WriteLine(string.Empty.StartsWith(((char)10781).ToString()));
You can fix this bug by using ordinal StringComparison:
From the MSDN docs:
When you specify either
StringComparison.Ordinal or
StringComparison.OrdinalIgnoreCase,
the string comparison will be
non-linguistic. That is, the features
that are specific to the natural
language are ignored when making
comparison decisions. This means the
decisions are based on simple byte
comparisons and ignore casing or
equivalence tables that are
parameterized by culture. As a result,
by explicitly setting the parameter to
either the StringComparison.Ordinal or
StringComparison.OrdinalIgnoreCase,
your code often gains speed, increases
correctness, and becomes more
reliable.
char specialChar = (char)10781;
string specialString = Convert.ToString(specialChar);
// prints 1
Console.WriteLine(specialString.Length);
// prints 10781
Console.WriteLine((int)specialChar);
// prints false
Console.WriteLine(string.Empty.StartsWith("A"));
// prints false
Console.WriteLine(string.Empty.StartsWith(specialString, StringComparison.Ordinal));
Nice unicode glitch ;-p
I'm not sure why it does this, but amusingly:
Console.WriteLine(string.Empty.StartsWith(specialString)); // true
Console.WriteLine(string.Empty.Contains(specialString)); // false
Console.WriteLine("abc".StartsWith(specialString)); // true
Console.WriteLine("abc".Contains(specialString)); // false
I'm guessing this is treated a bit like the non-joining character that Jon mentioned at devdays; some string functions see it, and some don't. And if it doesn't see it, this becomes "does (some string) start with an empty string", which is always true.
The underlying reason for this is the default string comparison is locale aware. This means using tables of locale data for comparisons (including equality).
Many (if not most) Unicode characters have no value for many locales, and thus don't exist (or do, but match anything, or nothing).
See entries on character weights on Michael Kaplan's blog "Sorting It All Out". This series of blogs contains a lot of background information (the APIs are native, but—as I understand—the mechanisms in .NET are the same).
Quick version: this is a complex area to get expected (normal language) comparisons right is hard, this tends to lead to odd things with code points for glyphs outside your language.