It's late and my caffeine IV is running low so my mind is mush and I'm having problems finding a solution to what I think is a simple encoding problem (which I have almost no experience dealing with).
I have a DB using EF6 Code First and everything seems to work well until I copy some text from a website forum contained within a codeblock. I checked the header and it's supposedly encoded in UTF-8.
I essentially take this text, split it to an array of strings and check the DB for a record matching the string in each line. Everything was going well until I hit a problem with a string "Magnеtic" not matching up to anything in my DB table yet when I went into the SQLMS and queried the table with LIKE '%Magnеtic%' I got a result.
I dropped the text from the website into Notepad++ with the text from the DB query and saw that they look equal:
Magnеtic
Magnеtic
Then, I changed the encoding to ANSI and it showed:
Magnetic <--From DB
Magnеtic <--From website
A tiny light bulb went on in my head but my attempts to remedy this issue has failed.
I've tried using various methods but I think it's my fried brain attacking the problem with the wrong tools:
string.compare(a, b) == 0
string.equals(a, b)
string.ToUpperInvariant()
and probably a few others that I can't remember.
So now you know what my issue is and I feel this is such a simple problem to fix but, like I said, I'm fried and now need some community help.
I'm not a professional coder, more a hobbyist so I may not be using best practices or advanced techniques to do some things.
Edit:
Today I did some more searching and found a couple of methods that didn't work.
private string RemoveAccent(string txt)
{
byte[] bytes = Encoding.GetEncoding("Cyrillic").GetBytes(txt);
return Encoding.ASCII.GetString(bytes);
}
This one appears to remove the accented characters of the Cyrillic encoding. The result wasn't as expected but it DID have an effect.
Results:
Magn?tic <- Computer interpretation
Magnetic <- Visual representation
I also tried:
private string RemoveAccent2(string txt)
{
char[] toReplace = "àèìòùÀÈÌÒÙ äëïöüÄËÏÖÜ âêîôûÂÊÎÔÛ áéíóúÁÉÍÓÚðÐýÝ ãñõÃÑÕšŠžŽçÇåÅøØ".ToCharArray();
char[] replaceChars = "aeiouAEIOU aeiouAEIOU aeiouAEIOU aeiouAEIOUdDyY anoANOsSzZcCaAoO".ToCharArray();
for (int i = 0; i < toReplace.Count(); i++)
{
txt = txt.Replace(toReplace[i], replaceChars[i]);
}
return txt;
}
This method didn't provide any changes.
What can help in these cases is to copy-paste the character into google. In this case, the results point to the Wikipedia article about the letter Ye in Cyrillic, which looks exactly like E in the Latin alphabet, but has different encoding in Unicode.
This means the results you're getting are correct: the string “Magnеtic” looks exactly the same as “Magnetic” (at least using common fonts), but it's actually a different string.
Related
Don't often post. but here is a problem that I can't seem to find an answer online.
With this snippet of a C# code:
if (Attributes.TryGetValue("Language", out value)) {
if (value.ToString().Equals("French")) {
cmd.Parameters.AddWithValue("#DEFAULTLANGUAGE","Français");
} else {
cmd.Parameters.AddWithValue("#DEFAULTLANGUAGE","English");
}
I get this stored in the DB Français instead of Français.
Can't find on the MSDN what I need to do to get the correct value.
The languages are stored in their native text, so the ç isn't going to be the only one.
*** Editing original, because replying to thread limits text length***
More details:
Ok, I should have said "Languages are stored in their native text when possible" because yes, it's just a varchar(60) not a nvarchar. I didn't think it would be a problem as ç is in the ASCII varchar list and other records have the value stored correctly.
The query isn't anything special just a standard insert into values query. (not able to display query, but it works. the record is created.) No stored proc, No specific definitions.
LOL
Ok.. There will be a facepalm in here.. but I'm going to post it, because leaving this question unanswered doesn't help people.. (Science should publish their failures too.. anyways)
So, the solutions didn't work, still getting Français instead of Français.
Then I realized that the code file itself, could be in unicode.. and it was..
Converted it to ascii, and it changed the Français to Français.
Corrected the character, and with the below code the problem is fixed!
cmd.Parameters.Add("#DEFAULTLANGUAGE",System.Data.SqlDbType.VarChar,60).Value ="Français";
In the end I wonder why this was a problem in the first place? Ascii is wholly contained within Unicode. Why does it change?
Ok the title may be not correct but this is what i came as best
My question is this
Example 1
see , saw
I can convert see to saw with as
replace ee with aw
string srA = "see";
string srB = "saw";
srA = srB.Replace("aw", "ee");
Or lets say
show , shown
add n to original string
Now what i want it is, with minimum length of code, generating such procedures to any compared strings
Looking for your ideas how can i make it? Can i generate regexes automatically to apply and convert?
c# 6
Check diffplex and and see if it is what you need. If you want to create a custom algorithm, instead of using a 3rd party library just go through the code -it's open source.
You might also want to check this work for optimizations, but it might get complicated.
Then there's also Diff.NET.
Also this blog post is part of a series in implementing a diff tool.
If you're simply interested in learning more about the subject, your googling efforts should be directed to the Levenshtein algorithm.
I can only assume what your end goal is, and the time you're willing to invest in this, but I believe the first library should be enough for most needs.
according to my understanding, a base64 encoded string (ie the output of encode) must always be a multiple of 4.
the c# Convert.FromBase64String says that its input must be a multiple of 4
However if I give it a 25 character string it doesnt complain
[convert]::FromBase64String("ei5gsIELIki+GpnPGyPVBA==")
[convert]::FromBase64String("1ei5gsIELIki+GpnPGyPVBA==")
both work. (The first one is 24 , second is 25)
[convert]::FromBase64String("11ei5gsIELIki+GpnPGyPVBA==")
fails with Invalid length exception
I assume this is a bug in the c# library but I just want to make sure - I am writing code that is sniffing strings to see if they are valid base64 strings and I want to be sure that I understand what a valid one looks like (one possible implementation was to give the string to system.convert and see if it threw - why reinvent perfectly good code)
Yes, this is a flaw (aka bug). It got started due to a perf optimization in an internal helper function named FromBase64_ComputeResultLength() which calculates the length of the byte[] result. It has this comment (edited to fit):
// For legal input, we can assume that 0 <= padding < 3. But it may be
// more for illegal input.
// We will notice it at decode when we see a '=' at the wrong place.
The "we will notice" remark is not entirely accurate, the decoder does flag an '=' if one isn't expected but it fails to check if there's one too many. Which is the case for the 25-char string.
You can report the problem at connect.microsoft.com, I don't see an existing report that resembles it. Do note that it is fairly unlikely that Microsoft can actually fix it any time soon since the change is going to break existing programs that now successfully parse bad base64 strings. It normally requires a major .NET release update to get rid of such problems, like it was done for .NET 4.0, there isn't one on the horizon afaik.
But yes, the simple workaround for you is to check if the string length is divisible by 4, use the % operator.
I know there are similar questions already on SO but none of them seem to address this problem. I have inherited the following c# code that has been used to create password hashes in a legacy .net app, for various reasons the C# implementation is now being migrated to php:
string input = "fred";
SHA256CryptoServiceProvider provider = new SHA256CryptoServiceProvider();
byte[] hashedValue = provider.ComputeHash(Encoding.ASCII.GetBytes(input));
string output = "";
string asciiString = ASCIIEncoding.ASCII.GetString(hashedValue);
foreach ( char c in asciiString ) {
int tmp = c;
output += String.Format("{0:x2}",
(uint)System.Convert.ToUInt32(tmp.ToString()));
}
return output;
My php code is very simple but for the same input "fred" doesn't produce the same result:
$output = hash('sha256', "fred");
I've traced the problem down to an encoding issue - if I change this line in the C# code:
string asciiString = ASCIIEncoding.ASCII.GetString(hashedValue);
to
string asciiString = ASCIIEncoding.UTF7.GetString(hashedValue);
Then the php and C# output match (it yields d0cfc2e5319b82cdc71a33873e826c93d7ee11363f8ac91c4fa3a2cfcd2286e5).
Since I'm not able to change the .net code I need to work out how to replicate the results in php.
Thanks in advance for any help,
I don’t know PHP well enough to answer your question; however, I must point out that your C# code is broken. Try generating the hash of these two inputs: "âèí" and "çñÿ". You will find that their hash collides:
3f3b221c6c6e3f71223f51695d456d52223f243f3f363949443f3f763b483615
The first bug lies in this operation:
Encoding.ASCII.GetBytes(input)
This assumes that all characters within your input are US-ASCII. Any non-ASCII characters would cause the encoder to fall back to the byte value for the ? character, thereby giving (unwanted) hash collisions, as demonstrated above. Notwithstanding, this will not be an issue if your input is constrained to only allow US-ASCII characters.
The other (more severe) bug lies in the following operation:
ASCIIEncoding.ASCII.GetString(hashedValue)
ASCII only defines mappings for values 0–127. Since the elements of your hashedValue byte array may contain any byte value (0–255), encoding them as ASCII would cause data to be lost whenever a value greater than 127 is encountered. This may lead to further “unwanted” (read: potentially maliciously generated) hash collisions, even when your original input was US-ASCII.
Given that, statistically, half of the bytes constituting your hashes would be greater than 127, then you are losing at least half the strength of your hash algorithm. If a hacker gains access to your stored hashes, it is quite likely that they will manage to devise an attack to generate hash collisions by exploiting this cryptographic weakness.
Edit: Notwithstanding the considerations mentioned in my posts and Jon’s, here is the PHP code that succumbs to the same weakness – so to speak – as your C# code, and thereby gives the same hash:
$output = hash('sha256', $input, true);
for ($i = 0; $i < strlen($output); $i++)
if ($output[$i] > chr(127))
$output[$i] = '?';
$output = bin2hex($output);
Could you use mb_convert_encoding (see http://php.net/manual/en/function.mb-convert-encoding.php - the page also has a link to a list of supported encodings) to convert the PHP string to ASCII from UTF7?
I've traced the problem down to an encoding issue
Yes. You're trying to treat arbitrary binary data as if it's valid text-encoded data. It's not. You should not be using any Encoding here.
If you want the results in hex, the simplest approach is to use BitConverter.ToString
string text = BitConverter.ToString(hashedValue).Replace("-", "").ToLower();
And yes, as pointed out elsewhere, you probably shouldn't be using ASCII to convert the text to binary at the start of the hashing process. I'd probably use UTF-8.
It's really important that you understand the problem here though, as otherwise you'll run into it in other places too. You should only use encodings such as ASCII, UTF-8 etc (on any platform) when you've genuinely got encoded text data. You shouldn't use them for images, the results of cryptography, the results of hashing, etc.
EDIT: Okay, you say you can't change the C# code... it's not clear whether that just means you've got legacy data, or whether you need to keep using the C# code regardless. You should absolutey not run this code for a second longer than you have to.
But in PHP, you may find you can get away with just replacing every byte with a value >= 0x80 in the hash with 0x3F, which is the ASCII for "question mark". If you look through your data you'll probably find there are a lot of 3F bytes in there.
If you can get this to work, I would strongly suggest that you migrate over to the true MD5 hash without losing information like this. Wherever you're storing the hashes, store two: the legacy one (which is all you have now) and the rehashed one. Whenever you're asked to validate that a password is correct, you should:
Check whether you have a "new" one; if so, only use that - ignore the legacy one.
If you only have a legacy one:
Hash the password in the broken way to check whether it's correct
If it is, hash it again properly and store the results in the "new" place.
Then when everyone's logged in correctly once, you'll be able to wipe out the legacy hashes.
Recently our site has been deluged with the resurgence of the Asprox botnet SQL injection attack. Without going into details, the attack attempts to execute SQL code by encoding the T-SQL commands in an ASCII encoded BINARY string. It looks something like this:
DECLARE%20#S%20NVARCHAR(4000);SET%20#S=CAST(0x44004500...06F007200%20AS%20NVARCHAR(4000));EXEC(#S);--
I was able to decode this in SQL, but I was a little wary of doing this since I didn't know exactly what was happening at the time.
I tried to write a simple decode tool, so I could decode this type of text without even touching SQL Server. The main part I need to be decoded is:
CAST(0x44004500...06F007200 AS
NVARCHAR(4000))
I've tried all of the following commands with no luck:
txtDecodedText.Text =
System.Web.HttpUtility.UrlDecode(txtURLText.Text);
txtDecodedText.Text =
Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(txtURLText.Text));
txtDecodedText.Text =
Encoding.Unicode.GetString(Encoding.Unicode.GetBytes(txtURLText.Text));
txtDecodedText.Text =
Encoding.ASCII.GetString(Encoding.Unicode.GetBytes(txtURLText.Text));
txtDecodedText.Text =
Encoding.Unicode.GetString(Convert.FromBase64String(txtURLText.Text));
What is the proper way to translate this encoding without using SQL Server? Is it possible? I'll take VB.NET code since I'm familiar with that too.
Okay, I'm sure I'm missing something here, so here's where I'm at.
Since my input is a basic string, I started with just a snippet of the encoded portion - 4445434C41 (which translates to DECLA) - and the first attempt was to do this...
txtDecodedText.Text = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(txtURL.Text));
...and all it did was return the exact same thing that I put in since it converted each character into is a byte.
I realized that I need to parse every two characters into a byte manually since I don't know of any methods yet that will do that, so now my little decoder looks something like this:
while (!boolIsDone)
{
bytURLChar = byte.Parse(txtURLText.Text.Substring(intParseIndex, 2));
bytURL[intURLIndex] = bytURLChar;
intParseIndex += 2;
intURLIndex++;
if (txtURLText.Text.Length - intParseIndex < 2)
{
boolIsDone = true;
}
}
txtDecodedText.Text = Encoding.UTF8.GetString(bytURL);
Things look good for the first couple of pairs, but then the loop balks when it gets to the "4C" pair and says that the string is in the incorrect format.
Interestingly enough, when I step through the debugger and to the GetString method on the byte array that I was able to parse up to that point, I get ",-+" as the result.
How do I figure out what I'm missing - do I need to do a "direct cast" for each byte instead of attempting to parse it?
I went back to Michael's post, did some more poking and realized that I did need to do a double conversion, and eventually worked out this little nugget:
Convert.ToString(Convert.ToChar(Int32.Parse(EncodedString.Substring(intParseIndex, 2), System.Globalization.NumberStyles.HexNumber)));
From there I simply made a loop to go through all the characters 2 by 2 and get them "hexified" and then translated to a string.
To Nick, and anybody else interested, I went ahead and posted my little application over in CodePlex. Feel free to use/modify as you need.
Try removing the 0x first and then call Encoding.UTF8.GetString. I think that may work.
Essentially: 0x44004500
Remove the 0x, and then always two bytes are one character:
44 00 = D
45 00 = E
6F 00 = o
72 00 = r
So it's definitely a Unicode/UTF format with two bytes/character.