How to get data off of a character - c#

I am working on a project in Unity which uses Assembly C#. I try to get special character such as é, but in the console it just displays a blank character: "". For instance translating "How are you?" Should return "Cómo Estás?", but it returns "Cmo Ests". I put the return string "Cmo Ests" in a character array and realized that it is a non-null blank character. I am using Encoding.UTF8, and when I do:
char ch = '\u00e9';
print (ch);
It will print "é". I have tried getting the bytes off of a given string using:
byte[] utf8bytes = System.Text.Encoding.UTF8.GetBytes(temp);
While translating "How are you?", it will return a byte string, but for the special characters such as é, I get the series of bytes 239, 191, 189, which is a replacement character.
What type of information do I need to retrieve from the characters in order to accurately determining what character it is? Do I need to do something with the information that Google gives me, or is it something else? I am need a general case that I can place in my program and will work for any input string. If anyone can help, it would be greatly appreciated.
Here is the code that is referenced:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using UnityEngine;
using System.Collections;
using System.Net;
using HtmlAgilityPack;
public class Dictionary{
string[] formatParams;
HtmlDocument doc;
string returnString;
char[] letters;
public char[] charString;
public Dictionary(){
formatParams = new string[2];
doc = new HtmlDocument();
returnString = "";
}
public string Translate(String input, String languagePair, Encoding encoding)
{
formatParams[0]= input;
formatParams[1]= languagePair;
string url = String.Format("http://www.google.com/translate_t?hl=en&ie=UTF8&text={0}&langpair={1}", formatParams);
string result = String.Empty;
using (WebClient webClient = new WebClient())
{
webClient.Encoding = encoding;
result = webClient.DownloadString(url);
}
doc.LoadHtml(result);
input = alter (input);
string temp = doc.DocumentNode.SelectSingleNode("//span[#title='"+input+"']").InnerText;
charString = temp.ToCharArray();
return temp;
}
// Use this for initialization
void Start () {
}
string alter(string inputString){
returnString = "";
letters = inputString.ToCharArray();
for(int i=0; i<inputString.Length;i++){
if(letters[i]=='\''){
returnString = returnString + "'";
}else{
returnString = returnString + letters[i];
}
}
return returnString;
}
}

Maybe you should use another API/URL. This function below uses a different url that returns JSON data and seems to work better:
public static string Translate(string input, string fromLanguage, string toLanguage)
{
using (WebClient webClient = new WebClient())
{
string url = string.Format("http://translate.google.com/translate_a/t?client=j&text={0}&sl={1}&tl={2}", Uri.EscapeUriString(input), fromLanguage, toLanguage);
string result = webClient.DownloadString(url);
// I used JavaScriptSerializer but another JSON parser would work
JavaScriptSerializer serializer = new JavaScriptSerializer();
Dictionary<string, object> dic = (Dictionary<string, object>)serializer.DeserializeObject(result);
Dictionary<string, object> sentences = (Dictionary<string, object>)((object[])dic["sentences"])[0];
return (string)sentences["trans"];
}
}
If I run this in a Console App:
Console.WriteLine(Translate("How are you?", "en", "es"));
It will display
¿Cómo estás?

I don't know much about the GoogleTranslate API, but my first thought is that you've got a Unicode Normalization problem.
Have a look at System.String.Normalize() and it's friends.
Unicode is very complicated, so I'll over simplify! Many symbols can be represented in different ways in Unicode, that is: 'é' could be represented as 'é' (one character), or as an 'e' + 'accent character' (two characters), or, depending what comes back from the API, something else altogether.
The Normalize function will convert your string to one with the same Textual meaning, but potentially a different binary value which may fix your output problem.

You actually pretty much have it. Just insert the coded letter with a \u and it works.
string mystr = "C\u00f3mo Est\u00e1s?";

There are several issues with your approach. First of all the UTF8 encoding is a multibyte encoding. This means that if you use any non-ASCII character (having char code > 127), you will get a series of special characters that indicate to the system that this is an Unicode char. So actually your sequence 239, 191, 189 indicates a single character which is not an ASCII character. If you use UTF16, then you get fixed-size encodings (2-byte encodings) which actually map a character to an unsigned short (0-65535).
The char type in c# is a two-byte type, so it is actually an unsigned short. This contrasts with other languages, such as C/C++ where the char type is a 1-byte type.
So in your case, unless you really need to be using byte[] arrays, you should use char[] arrays. Or if you want to encode the characters so that they can be used in HTML, then you can just iterate through the characters and check if the character code is > 128, then you can replace it with the &hex; character code.

I had the same problem working one of my project [Language Resource Localization Translation]
I was doing the same thing and was using.. System.Text.Encoding.UTF8.GetBytes() and because of utf8 encoding was receiving special characters like your
e.g 239, 191, 189 in result string.
please take a look of my solution... hope this helps
Don't Use encoding at all Google translation will return correct like á as it self in the string. do some string manipulation and read the string as it is...
Generic Solution [works for every language translation which google support]
try
{
//Don't use UtF Encoding
// use default webclient encoding
var url = String.Format("http://www.google.com/translate_t?hl=en&text={0}&langpair={1}", "►" + txtNewResourceValue.Text.Trim() + "◄", "en|" + item.Text.Substring(0, 2));
var webClient = new WebClient();
string result = webClient.DownloadString(url); //get all data from google translate in UTF8 coding..
int start = result.IndexOf("id=result_box");
int end = result.IndexOf("id=spell-place-holder");
int length = end - start;
result = result.Substring(start, length);
result = reverseString(result);
start = result.IndexOf(";8669#&");//◄
end = result.IndexOf(";8569#&"); //►
length = end - start;
result = result.Substring(start +7 , length - 8);
objDic2.Text = reverseString(result);
//hard code substring; finding the correct translation within the string.
dictList.Add(objDic2);
}
catch (Exception ex)
{
lblMessages.InnerHtml = "<strong>Google translate exception occured no resource saved..." + ex.Message + "</strong>";
error = true;
}
public static string reverseString(string s)
{
char[] arr = s.ToCharArray();
Array.Reverse(arr);
return new string(arr);
}
as you can see from the code no encoding has been performed and i am sending 2 special key charachters as "►" + txtNewResourceValue.Text.Trim() + "◄"to determine the start and end of the return translation from google.
Also i have checked hough my language utility tool I am getting "Cómo Estás?" when sending
How are you to google translation... :)
Best regards
[Shaz]
---------------------------Edited-------------------------
public string Translate(String input, String languagePair)
{
try
{
//Don't use UtF Encoding
// use default webclient encoding
//input [string to translate]
//Languagepair [eg|es]
var url = String.Format("http://www.google.com/translate_t?hl=en&text={0}&langpair={1}", "►" + input.Trim() + "◄", languagePair);
var webClient = new WebClient();
string result = webClient.DownloadString(url); //get all data from google translate
int start = result.IndexOf("id=result_box");
int end = result.IndexOf("id=spell-place-holder");
int length = end - start;
result = result.Substring(start, length);
result = reverseString(result);
start = result.IndexOf(";8669#&");//◄
end = result.IndexOf(";8569#&"); //►
length = end - start;
result = result.Substring(start + 7, length - 8);
//return transalted string
return reverseString(result);
}
catch (Exception ex)
{
return "Google translate exception occured no resource saved..." + ex.Message";
}
}

Related

Compare Windows-1252 string to UTF-8 string

my goal is to convert a .NET string (Unicode) into Windows-1252 and - if necessary - store the original UTF-8 string in a Base64 entity.
For example, the string "DJ Doena" converted to 1252 is still "DJ Doena".
However if you convert the Japanese kanjii for tree (木) into 1251 you end up with a question mark.
These are my test strings:
String doena = "DJ Doena";
String umlaut = "äöüßéèâ";
String allIn = "< ä ß á â & 木 >";
This is how I convert the string in the first place:
using (MemoryStream ms = new MemoryStream())
{
using (StreamWriter sw = new StreamWriter(ms, Encoding.UTF8))
{
sw.Write(decoded);
sw.Flush();
ms.Seek(0, SeekOrigin.Begin);
using (StreamReader sr = new StreamReader(ms, Encoding.GetEncoding(1252)))
{
encoded = sr.ReadToEnd();
}
}
}
Problem is, while debugging string comparison claims that both are indeed identical, so a simple == or .Equals() doesn't suffice.
This is how I try to find out if I need base64 and produce it:
private static String GetBase64Alternate(String utf8Text, String windows1252Text)
{
Byte[] utf8Bytes;
Byte[] windows1252Bytes;
String base64;
utf8Bytes = Encoding.UTF8.GetBytes(utf8Text);
windows1252Bytes = Encoding.GetEncoding(1252).GetBytes(windows1252Text);
base64 = null;
if (utf8Bytes.Length != windows1252Bytes.Length)
{
base64 = Convert.ToBase64String(utf8Bytes);
}
else
{
for(Int32 i = 0; i < utf8Bytes.Length; i++)
{
if(utf8Bytes[i] != windows1252Bytes[i])
{
base64 = Convert.ToBase64String(utf8Bytes);
break;
}
}
}
return (base64);
}
The first string doena is completely identical and doesn't produce a base64 result
Console.WriteLine(String.Format("{0} / {1}", windows1252Text, base64Text));
results in
DJ Doena /
But the second string umlauts already has twice the bytes in UTF-8 than in 1252 and thus produces an Base64 string even though it does not appear to be necessary:
äöüßéèâ / w6TDtsO8w5/DqcOow6I=
And the third one does what it's supposed to do (no more "木" but a "?", thus base64 needed):
< ä ß á â & ? > / PCDDpCDDnyDDoSDDoiAmIOacqCA+
Any clues how my Base64 getter could be enhanced a) for performance b) for better results?
Thank you in advance. :-)
I'm not sure I completely understood the question. But I tried. :) If I do understand correctly, this code does what you want:
static void Main(string[] args)
{
string[] testStrings = { "DJ Doena", "äöüßéèâ", "< ä ß á â & 木 >" };
foreach (string text in testStrings)
{
Console.WriteLine(ReencodeText(text));
}
}
private static string ReencodeText(string text)
{
Encoding encoding = Encoding.GetEncoding(1252);
string text1252 = encoding.GetString(encoding.GetBytes(text));
return text.Equals(text1252, StringComparison.Ordinal) ?
text : Convert.ToBase64String(Encoding.UTF8.GetBytes(text));
}
I.e. it encodes the text to Windows-1252, then decodes back to a string object, which it then compares with the original. If the comparison succeeds, it returns the original string, otherwise it encodes it to UTF8, and then to base64.
It produces the following output:
DJ Doena
äöüßéèâ
PCDDpCDDnyDDoSDDoiAmIOacqCA+
In other words, the first two strings are left intact, while the third is encoded as base64.
In your first code you are encoding the string using one encoding, then decoding it using a different encoding. That doesn't give you any reliable result at all; it's the equivalent of writing out a number in octal, then reading it as if it was in decimal. It seems to work just fine for numbers up to 7, but after that you get useless results.
The problem with the GetBase64Alternate method is that it's encoding a string to two different encodings, and assumes that the first encoding doesn't support some of the characters if the second encoding resulted in a different set of bytes.
Comparing the byte sequences doesn't tell you whether any of the encodings failed. The sequences will be different if it failed, but it will also be different if there are any characters that are encoded differently between the encodings.
What you want to do is to determine if the encoding actually worked for all characters. You can do that by creating an Encoding instance with a fallback for unsupported characters. There is an EncoderExceptionFallback class that you can use for that, which throws an EncoderFallbackException if it's called.
This code will try use the Windows-1252 encoding on a string, and sets the ok variable to false if the encoding doesn't support all characters in the string:
Encoding e = Encoding.GetEncoding(1252, new EncoderExceptionFallback(), new DecoderExceptionFallback());
bool ok = true;
try {
e.GetByteCount(allIn);
} catch (EncoderFallbackException) {
ok = false;
}
As you are not actually going to used the encoded result for anything, you can use the GetByteCount method. It will check how all characters would be encoded without producing the encoded result.
Used in your method it would be:
private static String GetBase64Alternate(string text) {
Encoding e = Encoding.GetEncoding(1252, new EncoderExceptionFallback(), new DecoderExceptionFallback());
bool ok = true;
try {
e.GetByteCount(allIn);
} catch (EncoderFallbackException) {
ok = false;
}
return ok ? null : Convert.ToBase64(Encoding.UTF8.GetBytes(text));
}

C# ByteString to ASCII String

I am looking for a smart way to convert a string of hex-byte-values into a string of 'real text' (ASCII Characters).
For example I have the word "Hello" written in Hexadecimal ASCII: 48 45 4C 4C 4F. And using some method I want to receive the ASCII text of it (in this case "Hello").
// I have this string (example: "Hello") and want to convert it to "Hello".
string strHexa = "48454C4C4F";
// I want to convert the strHexa to an ASCII string.
string strResult = ConvertToASCII(strHexa);
I am sure there is a framework method. If this is not the case of course I could implement my own method.
Thanks!
var str = Encoding.UTF8.GetString(SoapHexBinary.Parse("48454C4C4F").Value); //HELLO
PS: SoapHexBinary is in System.Runtime.Remoting.Metadata.W3cXsd2001 namespace
I am sure there is a framework method.
A a single framework method: No.
However the second part of this: converting a byte array containing ASCII encoded text into a .NET string (which is UTF-16 encoded Unicode) does exist: System.Text.ASCIIEncoding and specifically the method GetString:
string result = ASCIIEncoding.GetString(byteArray);
The First part is easy enough to do yourself: take two hex digits at a time, parse as hex and cast to a byte to store in the array. Seomthing like:
byte[] HexStringToByteArray(string input) {
Debug.Assert(input.Length % 2 == 0, "Must have two digits per byte");
var res = new byte[input.Length/2];
for (var i = 0; i < input.Length/2; i++) {
var h = input.Substring(i*2, 2);
res[i] = Convert.ToByte(h, 16);
}
return res;
}
Edit: Note: L.B.'s answer identifies a method in .NET that will do the first part more easily: this is a better approach that writing it yourself (while in a, perhaps, obscure namespace it is implemented in mscorlib rather than needing an additional reference).
StringBuilder sb = new StringBuilder();
for (int i = 0; i < hexStr.Length; i += 2)
{
string hs = hexStr.Substring(i, 2);
sb.Append(Convert.ToByte(hs, 16));
}

How to convert a string containing escape characters to a string

I have a string that is returned to me which contains escape characters.
Here is a sample string
"test\40gmail.com"
As you can see it contains escape characters. I need it to be converted to its real value which is
"test#gmail.com"
How can I do this?
If you are looking to replace all escaped character codes, not only the code for #, you can use this snippet of code to do the conversion:
public static string UnescapeCodes(string src) {
var rx = new Regex("\\\\([0-9A-Fa-f]+)");
var res = new StringBuilder();
var pos = 0;
foreach (Match m in rx.Matches(src)) {
res.Append(src.Substring(pos, m.Index - pos));
pos = m.Index + m.Length;
res.Append((char)Convert.ToInt32(m.Groups[1].ToString(), 16));
}
res.Append(src.Substring(pos));
return res.ToString();
}
The code relies on a regular expression to find all sequences of hex digits, converting them to int, and casting the resultant value to a char.
string test = "test\40gmail.com";
test.replace(#"\40","#");
If you want a more general approach ...
HTML Decode
The sample string provided ("test\40gmail.com") is JID escaped. It is not malformed, and HttpUtility/WebUtility will not correctly handle this escaping scheme.
You can certainly do it with string or regex functions, as suggested in the answers from dasblinkenlight and C.Barlow. This is probably the cleanest way to achieve the desired result. I'm not aware of any .NET libraries for decoding JID escaping, and a brief search hasn't turned up much. Here is a link to some source which may be useful, though.
I just wrote this piece of code and it seems to work beautifully... It requires that the escape sequence is in HEX, and is valid for value's 0x00 to 0xFF.
// Example
str = remEscChars(#"Test\x0D") // str = "Test\r"
Here is the code.
private string remEscChars(string str)
{
int pos = 0;
string subStr = null;
string escStr = null;
try
{
while ((pos = str.IndexOf(#"\x")) >= 0)
{
subStr = str.Substring(pos + 2, 2);
escStr = Convert.ToString(Convert.ToChar(Convert.ToInt32(subStr, 16)));
str = str.Replace(#"\x" + subStr, escStr);
}
}
catch (Exception ex)
{
throw ex;
}
return str;
}
.NET provides the static methods Regex.Unescape and Regex.Escape to perform this task and back again. Regex.Unescape will do what you need.
https://learn.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.regex.unescape

Regex Match string with non-alphanumberic chars from a binary file

I am trying to extract some information from a binary file. It looks like this:
AUTHCODE(here goes 3 bytes, that I don't need)part_that_i_need(here goes a NULL byte).
How to I match the portion of alpha-numeric characters qszjlbnkmctkkezgd_qyzkyptqigudilzpkp_qgetefvmigwimrihudk that is between bytes {11} {00} {38} and {00}.
Here's what I've done so far:
string ReadFileMF;
using (StreamReader reader = new StreamReader(pathCopy))
{
ReadFileMF = reader.ReadToEnd();
}
///match the whole string
Match passMF = Regex.Match(ReadFileMF, #"(AUTHCODE).+?(www)");
String passMFs = passMF.Value;
//convert to array of bytes
byte[] bpass = StrToByteArray(passMFs);
//replace the 3 bytes after AUTHCODE with spaces
bpass[8] = 0x20;
bpass[9] = 0x20;
bpass[10] = 0x20;
Ok, so now I have just to match the nullbyte at the end. Somthing like (AUTHCODE).+?(NULL_BYTE). Any ideas?
This might be easiest with a few simple for-loops or Copy() actions over the byte data. Too few specs to be exact. Like:
// untested, I could be off-by-1 somewhere
int start = "AUTHCODE".Length + 3;
int end = text.Indexof('\0', start);
string result = text.SubString(start, end-start);
If you do want/need Regex, you'll have to turn it into a string first. Your only safe bet seems to be ASCII encoding.
string text = Encoding.ASCII.GetString(data);
and then (untested)
Regex.Match(text, "AUTHCODE.{3}([^\0x00]+)\0x00);

Determine if a string contains a base64 string inside of it

I'm trying to figure out a way to parse out a base64 string from with a larger string.
I have the string "Hello <base64 content> World" and I want to be able to parse out the base64 content and convert it back to a string. "Hello Awesome World"
Answers in C# preferred.
Edit: Updated with a more real example.
--abcdef
\n
Content-Type: Text/Plain;
Content-Transfer-Encoding: base64
\n
<base64 content>
\n
--abcdef--
This is taken from 1 sample. The problem is that the Content.... vary quite a bit from one record to the next.
There is no reliable way to do it. How would you know that, for instance, "Hello" is not a base64 string ? OK, it's a bad example because base64 is supposed to be padded so that the length is a multiple of 4, but what about "overflow" ? It's 8-character long, it is a valid base64 string (it would decode to "¢÷«~Z0"), even though it's obviously a normal word to a human reader. There's just no way you can tell for sure whether a word is a normal word or base64 encoded text.
The fact that you have base64 encoded text embedded in normal text is clearly a design mistake, I suggest you do something about it rather that trying to do something impossible...
In short form you could:
split the string on any chars that are not valid base64 data or padding
try to convert each token
if the conversion succeeds, call replace on the original string to switch the token with the converted value
In code:
var delimiters = new char[] { /* non-base64 ASCII chars */ };
var possibles = value.Split(delimiters, StringSplitOptions.RemoveEmptyEntries);
//need to tweak to include padding chars in matches, but still split on padding?
//maybe better off creating a regex to match base64 + padding
//and using Regex.Split?
foreach(var match in possibles)
{
try
{
var converted = Convert.FromBase64String(match);
var text = System.Text.Encoding.UTF8.GetString(converted);
if(!string.IsNullOrEmpty(text))
{
value = value.Replace(match, text);
}
}
catch (System.ArgumentNullException)
{
//handle it
}
catch (System.FormatException)
{
//handle it
}
}
Without a delimiter though, you can end up converting non-base64 text that happens to be also be valid as base64 encoded text.
Looking at your example of trying to convert "Hello QXdlc29tZQ== World" to "Hello Awesome World" the above algorithm could easily generate something like "ée¡Ý•Í½µ”¢¹]" by trying to convert the whole string from base64 since there is no delimiter between plain and encoded text.
Update (based on comments):
If there are no '\n's in the base64 content and it is always preceded by "Content-Transfer-Encoding: base64\n", then there is a way:
split the string on '\n'
iterate over all the tokens until a token ends in "Content-Transfer-Encoding: base64"
the next token (if there are any) should be decoded (if possible) and then the replacement should be made in the original string
return to iterating until out of tokens
In code:
private string ConvertMixedUpTextAndBase64(string value)
{
var delimiters = new char[] { '\n' };
var possibles = value.Split(delimiters,
StringSplitOptions.RemoveEmptyEntries);
for (int i = 0; i < possibles.Length - 1; i++)
{
if (possibles[i].EndsWith("Content-Transfer-Encoding: base64"))
{
var nextTokenPlain = DecodeBase64(possibles[i + 1]);
if (!string.IsNullOrEmpty(nextTokenPlain))
{
value = value.Replace(possibles[i + 1], nextTokenPlain);
i++;
}
}
}
return value;
}
private string DecodeBase64(string text)
{
string result = null;
try
{
var converted = Convert.FromBase64String(text);
result = System.Text.Encoding.UTF8.GetString(converted);
}
catch (System.ArgumentNullException)
{
//handle it
}
catch (System.FormatException)
{
//handle it
}
return result;
}

Categories

Resources