Determine if a string contains a base64 string inside of it

Determine if a string contains a base64 string inside of it - c#

I'm trying to figure out a way to parse out a base64 string from with a larger string.
I have the string "Hello <base64 content> World" and I want to be able to parse out the base64 content and convert it back to a string. "Hello Awesome World"
Answers in C# preferred.
Edit: Updated with a more real example.
--abcdef
\n
Content-Type: Text/Plain;
Content-Transfer-Encoding: base64
\n
<base64 content>
\n
--abcdef--
This is taken from 1 sample. The problem is that the Content.... vary quite a bit from one record to the next.

There is no reliable way to do it. How would you know that, for instance, "Hello" is not a base64 string ? OK, it's a bad example because base64 is supposed to be padded so that the length is a multiple of 4, but what about "overflow" ? It's 8-character long, it is a valid base64 string (it would decode to "¢÷«~Z0"), even though it's obviously a normal word to a human reader. There's just no way you can tell for sure whether a word is a normal word or base64 encoded text.
The fact that you have base64 encoded text embedded in normal text is clearly a design mistake, I suggest you do something about it rather that trying to do something impossible...

In short form you could:
split the string on any chars that are not valid base64 data or padding
try to convert each token
if the conversion succeeds, call replace on the original string to switch the token with the converted value
In code:
var delimiters = new char[] { /* non-base64 ASCII chars */ };
var possibles = value.Split(delimiters, StringSplitOptions.RemoveEmptyEntries);
//need to tweak to include padding chars in matches, but still split on padding?
//maybe better off creating a regex to match base64 + padding
//and using Regex.Split?
foreach(var match in possibles)
{
try
{
var converted = Convert.FromBase64String(match);
var text = System.Text.Encoding.UTF8.GetString(converted);
if(!string.IsNullOrEmpty(text))
{
value = value.Replace(match, text);
}
}
catch (System.ArgumentNullException)
{
//handle it
}
catch (System.FormatException)
{
//handle it
}
}
Without a delimiter though, you can end up converting non-base64 text that happens to be also be valid as base64 encoded text.
Looking at your example of trying to convert "Hello QXdlc29tZQ== World" to "Hello Awesome World" the above algorithm could easily generate something like "ée¡Ý•Í½µ”¢¹]" by trying to convert the whole string from base64 since there is no delimiter between plain and encoded text.
Update (based on comments):
If there are no '\n's in the base64 content and it is always preceded by "Content-Transfer-Encoding: base64\n", then there is a way:
split the string on '\n'
iterate over all the tokens until a token ends in "Content-Transfer-Encoding: base64"
the next token (if there are any) should be decoded (if possible) and then the replacement should be made in the original string
return to iterating until out of tokens
In code:
private string ConvertMixedUpTextAndBase64(string value)
{
var delimiters = new char[] { '\n' };
var possibles = value.Split(delimiters,
StringSplitOptions.RemoveEmptyEntries);
for (int i = 0; i < possibles.Length - 1; i++)
{
if (possibles[i].EndsWith("Content-Transfer-Encoding: base64"))
{
var nextTokenPlain = DecodeBase64(possibles[i + 1]);
if (!string.IsNullOrEmpty(nextTokenPlain))
{
value = value.Replace(possibles[i + 1], nextTokenPlain);
i++;
}
}
}
return value;
}
private string DecodeBase64(string text)
{
string result = null;
try
{
var converted = Convert.FromBase64String(text);
result = System.Text.Encoding.UTF8.GetString(converted);
}
catch (System.ArgumentNullException)
{
//handle it
}
catch (System.FormatException)
{
//handle it
}
return result;
}

Related

Convert string with special characters to hex - C#

Hi I'm trying to transform a string containing special characters like û and ….
In my research and tests I almost succeeded using the following function:
public static string ToHex(this string input)
{
char[] values = input.ToCharArray();
string hex = "0x";
string add = "";
foreach (char c in values)
{
int value = Convert.ToInt32(c);
add = String.Format("{0:X}", value).Length == 1 ?
"0" + String.Format("{0:X}", value) + "00"
: String.Format("{0:X}", value) + "00";
hex += add;
}
return hex;
}
If I try to decode ´o¸sçPQ^ûË\u000f±d it does it correctly and turns it into this 0xB4006F00B8007300E700500051005E00FB00CB000F00B1006400,
instead when I try to decode ´o¸sçPQ](ÂF\u0012…a it fails and turns it into 0xB4006F00B8007300E700500051005D002800C200460012002026006100 instead of this
0xB4006F00B8007300E700500051005D002800C2004600120026206100.
Making a minimum of debug I saw that the string is transformed from
´o¸sçPQ](ÂF\u0012…a to ´o¸sçPQ](ÂF.a, I wouldn't want that to be the problem but I'm not sure.
EDIT
0xB4006F00B8007300E700500051005D002800C2004600120026206100 ´o¸sçPQ](ÂF…a CORRECT
0xB4006F00B8007300E700500051005D002800C200460012002026006100 ´o¸sçPQ](ÂF.a MY OUTPUT
0xB4006F00B8007300E700500051005D003D00CB0042000C00A50061006000AD004500BB00 ´o¸sçPQ]=ËB¥a`E» CORRECT
0xB4006F00B8007300E700500051005D003D00CB0042000C00A50061006000AD004500BB00 ´o¸sçPQ]=ËB¥a`E» MY OUTPUT
0xB4006F00B8007300E700500051005D002F00D30042001900B7006E006100 ´o¸sçPQ]/ÓB·na CORRECT
0xB4006F00B8007300E700500051005D002F00D30042001900B7006E006100 ´o¸sçPQ]/ÓB·na MY OUTPUT
0xB4006F00B8007300E700500051005F001A20BC006B0021003500DD00 ´o¸sçPQ_‚¼k!5Ý CORRECT
0xB4006F00B8007300E700500051005F00201A00BC006B0021003500DD00 ´o¸sçPQ_'¼k!5Ý MY OUTPUT
0xB4006F00B8007300E700500051005D002F00EE006B00290014204E004100 ´o¸sçPQ]/îk)—NA CORRECT
0xB4006F00B8007300E700500051005D002F00EE006B0029002014004E004100 ´o¸sçPQ]/îk)-NA MY OUTPUT
0xB4006F00B8007300E700500051005D003800E600690036001C204C004F00 ´o¸sçPQ]8æi6“LO CORRECT
0xB4006F00B8007300E700500051005D003800E60069003600201C004C004F00 ´o¸sçPQ]8æi6"LO MY OUTPUT
0xB4006F00B8007300E700500051005D002F00F3006200390014204E004700C602 ´o¸sçPQ]/ób9—NGˆ CORRECT
0xB4006F00B8007300E700500051005D002F00F300620039002014004E0047002C600 ´o¸sçPQ]/ób9-NG^ MY OUTPUT
0xB4006F00B8007300E700500051005D003B00EE007200330078014100 ´o¸sçPQ];îr3ŸA CORRECT
0xB4006F00B8007300E700500051005D003B00EE0072003300178004100 ´o¸sçPQ];îr3YA MY OUTPUT
0xB4006F00B8007300E700500051005D003000F20064003E009D004B00 ´o¸sçPQ]0òd>K CORRECT
0xB4006F00B8007300E700500051005D003000F20064003E009D004B00 ´o¸sçPQ]0òd>?K MY OUTPUT
0xB4006F00B8007300E700500051005D002F00E60075003E00 ´o¸sçPQ]/æu> CORRECT
0xB4006F00B8007300E700500051005D002F00E60075003E00 ´o¸sçPQ]/æu> MY OUTPUT
0xB4006F00B8007300E700500051005D002F00EE006A003000DC024500 ´o¸sçPQ]/îj0˜E CORRECT
0xB4006F00B8007300E700500051005D002F00EE006A0030002DC004500 ´o¸sçPQ]/îj0~E MY OUTPUT
I thank you in advance for every reply or comment,
greetings.

This is due to endianness, and different integer and string encodings.
char cc = '…';
Console.WriteLine(cc);
// 2026 <-- note, hex value differs from byte representation shown below
Console.WriteLine(((int)cc).ToString("x"));
// 26200000
Console.WriteLine(BytesToHex(BitConverter.GetBytes((int)cc)));
// 2620
Console.WriteLine(BytesToHex(Encoding.GetEncoding("utf-16").GetBytes(new[] { cc })));
You should not treat chars as integers. There are plenty of different ways to encode strings, .net internally uses UTF-16. And all encodings works with bytes, not with integers. Explicit conversion chars to integer can lead to unexpected results, like yours. Why don't you get encoding you need and work with bytes via Encoding.GetBytes?
void Main()
{
// output you expect 0xB4006F00B8007300E700500051005D002800C2004600120026206100
Console.WriteLine(BytesToHex(Encoding.GetEncoding("utf-16").GetBytes("´o¸sçPQ](ÂF\u0012…a")));
}
public static string BytesToHex(byte[] bytes)
{
// whatever way to convert bytes to hex
return "0x" + BitConverter.ToString(bytes).Replace("-", "");
}

Base64 Encoded String to byte array C# Failing

We are receiving a base64 encoded string from a external application in the form body and below is the code that we are using to decode the string into byte array, however we are getting an exception
Input String
PHBheW1lbnRSZXNwb25zZT48cmVzcG9uc2VDb2RlPjAwMDA8L3Jlc3BvbnNlQ29kZT48cmVzcG9uc2VDb2RlVGV4dD4wLVN1Y2Nlc3NmdWw8L3Jlc3BvbnNlQ29kZVRleHQ+PHJlc3BvbnNlU3VtbWFyeT5HUkVFTjwvcmVzcG9uc2VTdW1tYXJ5PjxwYXltZW50RXZlbnRJZGVudGlmaWVyPlRYTiAzNjM5PC9wYXltZW50RXZlbnRJZGVudGlmaWVyPjxMaXN0Pjxjb21wb25lbnRJRD5UWE4gMzYzOTwvY29tcG9uZW50SUQ+PGNsaWVudElEPkdPVERJU0UwNjwvY2xpZW50SUQ+PGJhbmtBdXRoQ29kZT5UOjEyMzQ8L2JhbmtBdXRoQ29kZT48YnV5bmV0VHhuSUQ+Mzc1PC9idXluZXRUeG5JRD48L0xpc3Q+PHBheW1lbnRJbnN0cnVtZW50UmVmPjwvcGF5bWVudEluc3RydW1lbnRSZWY+PG1hc2tlZENhcmROdW1iZXI+KioqKioqKioqKioqOTY4NjwvbWFza2VkQ2FyZE51bWJlcj48Y2FyZFR5cGU+TUFTVEVSQ0FSRDwvY2FyZFR5cGU+PGV4cGlyeURhdGU+MDMvMjAxNzwvZXhwaXJ5RGF0ZT48Y3VzdG9tRGF0YT4mbHQ7IVtDREFUQVsmbHQ7P3htbCB2ZXJzaW9uPSIxLjAiIGVuY29kaW5nPSJVVEYtOCI_Jmd0Ow0KJmx0O1RoaXN0bGVDdXN0b21EYXRhIHhtbG5zOnhzZD0iaHR0cDovL3d3dy53My5vcmcvMjAwMS9YTUxTY2hlbWEiIHhtbG5zOnhzaT0iaHR0cDovL3d3dy53My5vcmcvMjAwMS9YTUxTY2hlbWEtaW5zdGFuY2UiJmd0Ow0KICAmbHQ7T3JkZXJJZCZndDtCVFBQMzYzOSZsdDsvT3JkZXJJZCZndDsNCiAgJmx0O0Ftb3VudCZndDsxMjAmbHQ7L0Ftb3VudCZndDsNCiZsdDsvVGhpc3RsZUN1c3RvbURhdGEmZ3Q7XV0mZ3Q7PC9jdXN0b21EYXRhPjwvcGF5bWVudFJlc3BvbnNlPg
Exception
The input is not a valid Base-64 string as it contains a non-base 64 character, more than two padding characters, or an illegal character among the padding characters.
Code
var inputText = // use the data as shown above
byte[] decodedBytes = Convert.FromBase64String(inputText);
I want to know what is wrong in this and why it is throwing the exception, where are if i try online converters they are able to return the proper result.

Thank you all for the response.
I got this working, after studying how the base64 encoding works.
below is the code that i used to fix it.
var input = new StreamReader(Request.InputStream).ReadToEnd();
var inputText = "PHBheW1lbnRSZXNwb25zZT48cmVzcG9uc2VDb2RlPjAwMDA8L3Jlc3BvbnNlQ29kZT48cmVzcG9uc2VDb2RlVGV4dD4wLVN1Y2Nlc3NmdWw8L3Jlc3BvbnNlQ29kZVRleHQ+PHJlc3BvbnNlU3VtbWFyeT5HUkVFTjwvcmVzcG9uc2VTdW1tYXJ5PjxwYXltZW50RXZlbnRJZGVudGlmaWVyPlRYTiAzNjM5PC9wYXltZW50RXZlbnRJZGVudGlmaWVyPjxMaXN0Pjxjb21wb25lbnRJRD5UWE4gMzYzOTwvY29tcG9uZW50SUQ+PGNsaWVudElEPkdPVERJU0UwNjwvY2xpZW50SUQ+PGJhbmtBdXRoQ29kZT5UOjEyMzQ8L2JhbmtBdXRoQ29kZT48YnV5bmV0VHhuSUQ+Mzc1PC9idXluZXRUeG5JRD48L0xpc3Q+PHBheW1lbnRJbnN0cnVtZW50UmVmPjwvcGF5bWVudEluc3RydW1lbnRSZWY+PG1hc2tlZENhcmROdW1iZXI+KioqKioqKioqKioqOTY4NjwvbWFza2VkQ2FyZE51bWJlcj48Y2FyZFR5cGU+TUFTVEVSQ0FSRDwvY2FyZFR5cGU+PGV4cGlyeURhdGU+MDMvMjAxNzwvZXhwaXJ5RGF0ZT48Y3VzdG9tRGF0YT4mbHQ7IVtDREFUQVsmbHQ7P3htbCB2ZXJzaW9uPSIxLjAiIGVuY29kaW5nPSJVVEYtOCI_Jmd0Ow0KJmx0O1RoaXN0bGVDdXN0b21EYXRhIHhtbG5zOnhzZD0iaHR0cDovL3d3dy53My5vcmcvMjAwMS9YTUxTY2hlbWEiIHhtbG5zOnhzaT0iaHR0cDovL3d3dy53My5vcmcvMjAwMS9YTUxTY2hlbWEtaW5zdGFuY2UiJmd0Ow0KICAmbHQ7T3JkZXJJZCZndDtCVFBQMzYzOSZsdDsvT3JkZXJJZCZndDsNCiAgJmx0O0Ftb3VudCZndDsxMjAmbHQ7L0Ftb3VudCZndDsNCiZsdDsvVGhpc3RsZUN1c3RvbURhdGEmZ3Q7XV0mZ3Q7PC9jdXN0b21EYXRhPjwvcGF5bWVudFJlc3BvbnNlPg";
inputText = ValidateBase64EncodedString(inputText);
byte[] decodedBytes = Convert.FromBase64String(inputText);
string xml = Encoding.UTF8.GetString(decodedBytes);
private static string ValidateBase64EncodedString(string inputText)
{
string stringToValidate = inputText;
stringToValidate = stringToValidate.Replace('-', '+'); // 62nd char of encoding
stringToValidate = stringToValidate.Replace('_', '/'); // 63rd char of encoding
switch (stringToValidate.Length % 4) // Pad with trailing '='s
{
case 0: break; // No pad chars in this case
case 2: stringToValidate += "=="; break; // Two pad chars
case 3: stringToValidate += "="; break; // One pad char
default:
throw new System.Exception(
"Illegal base64url string!");
}
return stringToValidate;
}

It's because your custom data is invalid.
In that Base64 encoded string you have a raw [CDATA] which is most likely not encoded correctly.
notice that if i take the first 684 characters of your string, it converts properly to:
<paymentResponse><responseCode>0000</responseCode><responseCodeText>0-Successful</responseCodeText><responseSummary>GREEN</responseSummary><paymentEventIdentifier>TXN 3639</paymentEventIdentifier><List><componentID>TXN 3639</componentID><clientID>GOTDISE06</clientID><bankAuthCode>T:1234</bankAuthCode><buynetTxnID>375</buynetTxnID></List><paymentInstrumentRef></paymentInstrumentRef><maskedCardNumber>************9686</maskedCardNumber><cardType>MASTERCARD</cardType><expiryDate>03/2017</expiryDate><customData>&
but all hell breaks loose when we get to the customData tag, this I suppose, contains the CDATA xml content, which is probably not base64 encoded.
So either leave the CDATA content out, or encode it properly before adding it to the final string, to fix the issue

Compare Windows-1252 string to UTF-8 string

my goal is to convert a .NET string (Unicode) into Windows-1252 and - if necessary - store the original UTF-8 string in a Base64 entity.
For example, the string "DJ Doena" converted to 1252 is still "DJ Doena".
However if you convert the Japanese kanjii for tree (木) into 1251 you end up with a question mark.
These are my test strings:
String doena = "DJ Doena";
String umlaut = "äöüßéèâ";
String allIn = "< ä ß á â & 木 >";
This is how I convert the string in the first place:
using (MemoryStream ms = new MemoryStream())
{
using (StreamWriter sw = new StreamWriter(ms, Encoding.UTF8))
{
sw.Write(decoded);
sw.Flush();
ms.Seek(0, SeekOrigin.Begin);
using (StreamReader sr = new StreamReader(ms, Encoding.GetEncoding(1252)))
{
encoded = sr.ReadToEnd();
}
}
}
Problem is, while debugging string comparison claims that both are indeed identical, so a simple == or .Equals() doesn't suffice.
This is how I try to find out if I need base64 and produce it:
private static String GetBase64Alternate(String utf8Text, String windows1252Text)
{
Byte[] utf8Bytes;
Byte[] windows1252Bytes;
String base64;
utf8Bytes = Encoding.UTF8.GetBytes(utf8Text);
windows1252Bytes = Encoding.GetEncoding(1252).GetBytes(windows1252Text);
base64 = null;
if (utf8Bytes.Length != windows1252Bytes.Length)
{
base64 = Convert.ToBase64String(utf8Bytes);
}
else
{
for(Int32 i = 0; i < utf8Bytes.Length; i++)
{
if(utf8Bytes[i] != windows1252Bytes[i])
{
base64 = Convert.ToBase64String(utf8Bytes);
break;
}
}
}
return (base64);
}
The first string doena is completely identical and doesn't produce a base64 result
Console.WriteLine(String.Format("{0} / {1}", windows1252Text, base64Text));
results in
DJ Doena /
But the second string umlauts already has twice the bytes in UTF-8 than in 1252 and thus produces an Base64 string even though it does not appear to be necessary:
äöüßéèâ / w6TDtsO8w5/DqcOow6I=
And the third one does what it's supposed to do (no more "木" but a "?", thus base64 needed):
< ä ß á â & ? > / PCDDpCDDnyDDoSDDoiAmIOacqCA+
Any clues how my Base64 getter could be enhanced a) for performance b) for better results?
Thank you in advance. :-)

I'm not sure I completely understood the question. But I tried. :) If I do understand correctly, this code does what you want:
static void Main(string[] args)
{
string[] testStrings = { "DJ Doena", "äöüßéèâ", "< ä ß á â & 木 >" };
foreach (string text in testStrings)
{
Console.WriteLine(ReencodeText(text));
}
}
private static string ReencodeText(string text)
{
Encoding encoding = Encoding.GetEncoding(1252);
string text1252 = encoding.GetString(encoding.GetBytes(text));
return text.Equals(text1252, StringComparison.Ordinal) ?
text : Convert.ToBase64String(Encoding.UTF8.GetBytes(text));
}
I.e. it encodes the text to Windows-1252, then decodes back to a string object, which it then compares with the original. If the comparison succeeds, it returns the original string, otherwise it encodes it to UTF8, and then to base64.
It produces the following output:
DJ Doena
äöüßéèâ
PCDDpCDDnyDDoSDDoiAmIOacqCA+
In other words, the first two strings are left intact, while the third is encoded as base64.

In your first code you are encoding the string using one encoding, then decoding it using a different encoding. That doesn't give you any reliable result at all; it's the equivalent of writing out a number in octal, then reading it as if it was in decimal. It seems to work just fine for numbers up to 7, but after that you get useless results.
The problem with the GetBase64Alternate method is that it's encoding a string to two different encodings, and assumes that the first encoding doesn't support some of the characters if the second encoding resulted in a different set of bytes.
Comparing the byte sequences doesn't tell you whether any of the encodings failed. The sequences will be different if it failed, but it will also be different if there are any characters that are encoded differently between the encodings.
What you want to do is to determine if the encoding actually worked for all characters. You can do that by creating an Encoding instance with a fallback for unsupported characters. There is an EncoderExceptionFallback class that you can use for that, which throws an EncoderFallbackException if it's called.
This code will try use the Windows-1252 encoding on a string, and sets the ok variable to false if the encoding doesn't support all characters in the string:
Encoding e = Encoding.GetEncoding(1252, new EncoderExceptionFallback(), new DecoderExceptionFallback());
bool ok = true;
try {
e.GetByteCount(allIn);
} catch (EncoderFallbackException) {
ok = false;
}
As you are not actually going to used the encoded result for anything, you can use the GetByteCount method. It will check how all characters would be encoded without producing the encoded result.
Used in your method it would be:
private static String GetBase64Alternate(string text) {
Encoding e = Encoding.GetEncoding(1252, new EncoderExceptionFallback(), new DecoderExceptionFallback());
bool ok = true;
try {
e.GetByteCount(allIn);
} catch (EncoderFallbackException) {
ok = false;
}
return ok ? null : Convert.ToBase64(Encoding.UTF8.GetBytes(text));
}

The input is not a valid Base-64 string as it contains a non-base 64 character, "

C# code:
string base64string = Textbox1.Text;
string converted = base64string.Replace('-', '+');
converted = converted.Replace('_', '/');
try
{
// Convert base64string to bytes array
Byte[] bytes = Convert.FromBase64String(converted);
gif = iTextSharp.text.Image.GetInstance(bytes);
}
Textbox1.Text contains
data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAABUYAAADICAYAAAAkwztuAAAgAElEQVR4Xu3db6wdVd0v8KX2UGlLRfPE8JgIWiUc3yikvqt/ohi1L1Q04u2VCiVq0UpjTIAcCUewxj9YOAakgR7808TEe28rFpOSFKI3xPhGQk1jWyOJN+gLlQYLCtQoT4V7Zj+Z7Zw5s//v2XvNzKdJE+nZs+a3Pr91tvt8z5qZl7y49Cf4Q4AAAQIECBAgQIAAAQIECBAgQIAAgQYJvEQw2qBumyoBAgQIECBAgAABAgQIECBAgAABAi0BwaiFQIAAAQIECBAgQIAAAQIECBAgQIBA4wQEo41ruQkTIECAAAECBAgQIECAAAECBAgQICAYtQYIECBAgAABAgQIECBAgAABAgQIEGicgGC0cS03YQIECBAgQIAAAQIECBAgQIAAAQIEBKPWAAECBAgQIECAAA....
its a correct format but still i am getting error.

You need to skip this: data:image/png;base64,, so try something like:
string base64string = Textbox1.Text.Substring(22);
This will fetch everything after the first 22 characters in you string. Note that you might want to verify that there are more than 22 characters in the text-box before doing this, just to make sure it's not empty.
EDIT: Perhaps an even better approach would be:
var text = Textbox1.Text;
var metadataStart = text.IndexOf("data:image/png;base64,");
if(start != -1)
{
// Remove the metadata if found
text = text.Remove(metadataStart, metadataStart + 22);
}
After this, you can go ahead and convert text.

You need to strip the starting data:image/png;base64,. The rest of the string looks like valid BASE64, but everything up to and including the comma does not belong in there.

You need to split this: data:image/png;base64,so try something like
try
{
string base64string = Textbox1.Text.Split(',')[1];
//Convert base64string to bytes array
Byte[] bytes = Convert.FromBase64String(base64string);
gif = iTextSharp.text.Image.GetInstance(bytes);
}

How to get data off of a character

I am working on a project in Unity which uses Assembly C#. I try to get special character such as é, but in the console it just displays a blank character: "". For instance translating "How are you?" Should return "Cómo Estás?", but it returns "Cmo Ests". I put the return string "Cmo Ests" in a character array and realized that it is a non-null blank character. I am using Encoding.UTF8, and when I do:
char ch = '\u00e9';
print (ch);
It will print "é". I have tried getting the bytes off of a given string using:
byte[] utf8bytes = System.Text.Encoding.UTF8.GetBytes(temp);
While translating "How are you?", it will return a byte string, but for the special characters such as é, I get the series of bytes 239, 191, 189, which is a replacement character.
What type of information do I need to retrieve from the characters in order to accurately determining what character it is? Do I need to do something with the information that Google gives me, or is it something else? I am need a general case that I can place in my program and will work for any input string. If anyone can help, it would be greatly appreciated.
Here is the code that is referenced:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using UnityEngine;
using System.Collections;
using System.Net;
using HtmlAgilityPack;
public class Dictionary{
string[] formatParams;
HtmlDocument doc;
string returnString;
char[] letters;
public char[] charString;
public Dictionary(){
formatParams = new string[2];
doc = new HtmlDocument();
returnString = "";
}
public string Translate(String input, String languagePair, Encoding encoding)
{
formatParams[0]= input;
formatParams[1]= languagePair;
string url = String.Format("http://www.google.com/translate_t?hl=en&ie=UTF8&text={0}&langpair={1}", formatParams);
string result = String.Empty;
using (WebClient webClient = new WebClient())
{
webClient.Encoding = encoding;
result = webClient.DownloadString(url);
}
doc.LoadHtml(result);
input = alter (input);
string temp = doc.DocumentNode.SelectSingleNode("//span[#title='"+input+"']").InnerText;
charString = temp.ToCharArray();
return temp;
}
// Use this for initialization
void Start () {
}
string alter(string inputString){
returnString = "";
letters = inputString.ToCharArray();
for(int i=0; i<inputString.Length;i++){
if(letters[i]=='\''){
returnString = returnString + "'";
}else{
returnString = returnString + letters[i];
}
}
return returnString;
}
}

Maybe you should use another API/URL. This function below uses a different url that returns JSON data and seems to work better:
public static string Translate(string input, string fromLanguage, string toLanguage)
{
using (WebClient webClient = new WebClient())
{
string url = string.Format("http://translate.google.com/translate_a/t?client=j&text={0}&sl={1}&tl={2}", Uri.EscapeUriString(input), fromLanguage, toLanguage);
string result = webClient.DownloadString(url);
// I used JavaScriptSerializer but another JSON parser would work
JavaScriptSerializer serializer = new JavaScriptSerializer();
Dictionary<string, object> dic = (Dictionary<string, object>)serializer.DeserializeObject(result);
Dictionary<string, object> sentences = (Dictionary<string, object>)((object[])dic["sentences"])[0];
return (string)sentences["trans"];
}
}
If I run this in a Console App:
Console.WriteLine(Translate("How are you?", "en", "es"));
It will display
¿Cómo estás?

I don't know much about the GoogleTranslate API, but my first thought is that you've got a Unicode Normalization problem.
Have a look at System.String.Normalize() and it's friends.
Unicode is very complicated, so I'll over simplify! Many symbols can be represented in different ways in Unicode, that is: 'é' could be represented as 'é' (one character), or as an 'e' + 'accent character' (two characters), or, depending what comes back from the API, something else altogether.
The Normalize function will convert your string to one with the same Textual meaning, but potentially a different binary value which may fix your output problem.

You actually pretty much have it. Just insert the coded letter with a \u and it works.
string mystr = "C\u00f3mo Est\u00e1s?";

There are several issues with your approach. First of all the UTF8 encoding is a multibyte encoding. This means that if you use any non-ASCII character (having char code > 127), you will get a series of special characters that indicate to the system that this is an Unicode char. So actually your sequence 239, 191, 189 indicates a single character which is not an ASCII character. If you use UTF16, then you get fixed-size encodings (2-byte encodings) which actually map a character to an unsigned short (0-65535).
The char type in c# is a two-byte type, so it is actually an unsigned short. This contrasts with other languages, such as C/C++ where the char type is a 1-byte type.
So in your case, unless you really need to be using byte[] arrays, you should use char[] arrays. Or if you want to encode the characters so that they can be used in HTML, then you can just iterate through the characters and check if the character code is > 128, then you can replace it with the &hex; character code.

I had the same problem working one of my project [Language Resource Localization Translation]
I was doing the same thing and was using.. System.Text.Encoding.UTF8.GetBytes() and because of utf8 encoding was receiving special characters like your
e.g 239, 191, 189 in result string.
please take a look of my solution... hope this helps
Don't Use encoding at all Google translation will return correct like á as it self in the string. do some string manipulation and read the string as it is...
Generic Solution [works for every language translation which google support]
try
{
//Don't use UtF Encoding
// use default webclient encoding
var url = String.Format("http://www.google.com/translate_t?hl=en&text={0}&langpair={1}", "►" + txtNewResourceValue.Text.Trim() + "◄", "en|" + item.Text.Substring(0, 2));
var webClient = new WebClient();
string result = webClient.DownloadString(url); //get all data from google translate in UTF8 coding..
int start = result.IndexOf("id=result_box");
int end = result.IndexOf("id=spell-place-holder");
int length = end - start;
result = result.Substring(start, length);
result = reverseString(result);
start = result.IndexOf(";8669#&");//◄
end = result.IndexOf(";8569#&"); //►
length = end - start;
result = result.Substring(start +7 , length - 8);
objDic2.Text = reverseString(result);
//hard code substring; finding the correct translation within the string.
dictList.Add(objDic2);
}
catch (Exception ex)
{
lblMessages.InnerHtml = "<strong>Google translate exception occured no resource saved..." + ex.Message + "</strong>";
error = true;
}
public static string reverseString(string s)
{
char[] arr = s.ToCharArray();
Array.Reverse(arr);
return new string(arr);
}
as you can see from the code no encoding has been performed and i am sending 2 special key charachters as "►" + txtNewResourceValue.Text.Trim() + "◄"to determine the start and end of the return translation from google.
Also i have checked hough my language utility tool I am getting "Cómo Estás?" when sending
How are you to google translation... :)
Best regards
[Shaz]
---------------------------Edited-------------------------
public string Translate(String input, String languagePair)
{
try
{
//Don't use UtF Encoding
// use default webclient encoding
//input [string to translate]
//Languagepair [eg|es]
var url = String.Format("http://www.google.com/translate_t?hl=en&text={0}&langpair={1}", "►" + input.Trim() + "◄", languagePair);
var webClient = new WebClient();
string result = webClient.DownloadString(url); //get all data from google translate
int start = result.IndexOf("id=result_box");
int end = result.IndexOf("id=spell-place-holder");
int length = end - start;
result = result.Substring(start, length);
result = reverseString(result);
start = result.IndexOf(";8669#&");//◄
end = result.IndexOf(";8569#&"); //►
length = end - start;
result = result.Substring(start + 7, length - 8);
//return transalted string
return reverseString(result);
}
catch (Exception ex)
{
return "Google translate exception occured no resource saved..." + ex.Message";
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Determine if a string contains a base64 string inside of it - c#

Related

Convert string with special characters to hex - C#

Base64 Encoded String to byte array C# Failing

Compare Windows-1252 string to UTF-8 string

The input is not a valid Base-64 string as it contains a non-base 64 character, "

How to get data off of a character

Categories

Resources