How to use UTF-16 Encoding in C# winform app

How to use UTF-16 Encoding in C# winform app - c#

I am trying to use the Google translator for some sentence translation(English to Urdu also not working with Arabic) but I am stuck with a problem.
when i click button it returns some symbols.
i used this answer link
getting the translated language using google translator in winform apps
But it's not working as i require. I think the problem is with UTF Encoding.
UTF16 is not supported. I used Unicode but it's also not working.
I also have installed the desired language in my PC.
Any help is appreciated.
Thank you.
string input = "How are you";
string languagePair = "en|ur";
string url = String.Format("http://www.google.com/translate_t?hl=en&ie=UTF8&text={0}&langpair={1}", input, languagePair);
WebClient webClient = new WebClient();
webClient.Encoding = System.Text.Encoding.UTF8;
string result = webClient.DownloadString(url);
result = result.Substring(result.IndexOf("<span title=\"") + "<span title=\"".Length);
result = result.Substring(result.IndexOf(">") + 1);
result = result.Substring(0, result.IndexOf("</span>"));
result = HttpUtility.HtmlDecode(result.Trim());
MessageBox.Show(result);

Related

Non english characters in Kodi API

I am using the Kodi API, to control my htpc via asp.net.
Especialy the functio named "Playlist.Add".
The Json I send is like this:
{"jsonrpc":"2.0","method":"Playlist.Insert","params":{"playlistid":0,"position":0,"item":{"file":"smb://server/Ferry Corsten/Beautiful/Ferry Corsten - Beautiful (Extended).mp3"}},"id":1}
This is working fine. But when there are some none english characters in the string like this:
{"jsonrpc":"2.0","method":"Playlist.Insert","params":{"playlistid":0,"position":0,"item":{"file":"smb://server/01-Zum Geburtstag viel Glück.mp3"}},"id":1}
It is just throwing a "RequestCanceled" Exception.
My c# source is like this:
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(_url);
string authInfo = Convert.ToBase64String(Encoding.Default.GetBytes(_username + ":" + _password));
webRequest.Headers["Authorization"] = "Basic " + authInfo;
webRequest.Method = "POST";
webRequest.UserAgent = "KodiControl";
webRequest.ContentType = "application/json";
webRequest.ContentLength = json.Length;
using (var streamWriter = new StreamWriter(webRequest.GetRequestStream()))
{
streamWriter.Write(json);
streamWriter.Flush();
streamWriter.Close();
}
The Exception is thrown at streamWriter.Flush().
So what do I have to do to send this request?``

I suggest you look into Kodi addon unicode paths
Following that guide will help you prevent common problems with non-latin characters in Kodi.
Python only works with unicode strings internally, converting to a particular encoding on output. (or input)". To make string literals unicode by default, add
from __future__ import unicode_literals
Addon path
path = addon.getAddonInfo('path').decode('utf-8')
.decode('utf-8') tells kodi to decode the given function using utf-8.
Kodi's getAddonInfo returns an UTF-8 encoded string and we decode it an unicode.
Browse dialog
dialog = xbmcgui.Dialog()
directory = dialog.browse(0, 'Title' , 'pictures').decode('utf-8')
dialog.browse() returns an UTF-8 encoded string which perhaps contains some non latin characters. Therefore decode it to unicode!

DownloadString and Special Characters

I am trying to find the index of Mauricio in a string that is downloaded from a website using webclient and download string. However, on the website it contains a foreign character, Maurício. So I found elsewhere some code
string ToASCII(string s)
{
return String.Join("",
s.Normalize(NormalizationForm.FormD)
.Where(c => char.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark));
}
that converts foreign characters. I have tested the code and it works. So the problem I have is that when I download the string, it downloads as MaurA-cio. I have tried both
wc.Encoding = System.Text.Encoding.UTF8;
wc.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7");
Neither stop it from downloading as MaurA-cio.
(Also, I cannot change the search as I am getting the search term from a list).
What else can I try?
Thanks

var client = new WebClient { Encoding = System.Text.Encoding.UTF8 };
var json = client.DownloadString(url);
this one will work for any character

DownloadString doesn't look at HTTP response headers. It uses the previously set WebClient.Encoding property. If you have to use it, get the headers first:
// call twice
// (or to just do a HEAD, see http://stackoverflow.com/questions/3268926/head-with-webclient)
webClient.DownloadString("http://en.wikipedia.org/wiki/Maurício");
var contentType = webClient.ResponseHeaders["Content-Type"];
var charset = Regex.Match(contentType,"charset=([^;]+)").Groups[1].Value;
webClient.Encoding = Encoding.GetEncoding(charset);
var s = webClient.DownloadString("http://en.wikipedia.org/wiki/Maurício");
BTW--Unicode doesn't define "foreign" characters. From Maurício's perspective, "Mauricio" would be the foreign spelling of his name.

How to translate word using google translator API?

I am trying to get the converted text using google translator's api.
public JsonResult getCultureMeaning(string word, string langcode)
{
string url = String.Format("https://translate.google.com/#en/" + langcode+ "/" + word + "");
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
string m = "";
foreach (HtmlNode node in doc.DocumentNode.SelectSingleNode("//span[#id='result_box']").ChildNodes)
{
m += node.InnerHtml;
}
return Json(m, JsonRequestBehavior.AllowGet);
}
In this above method I am passing parameters, say if word is Welcome and langcode is hi in this case.
So I would have url https://translate.google.com/#en/hi/welcome and result is आपका स्वागत है
But when I do select result container with its children nodes as- doc.DocumentNode.SelectSingleNode("//span[#id='result_box']").ChildNodes) then it does not find this result container within the result. Hence I don't get this api work in my case.
Edit-
result container from the url-
<span id="result_box" class="short_text" lang="hi"><span class="hps">आपका स्वागत है</span></span>
How should I approach it to get it working. For reference I am using HtmlAgilityPack.

If you inspect page requests, you might notice, that actual translation request done via AJAX, sample query for your translation is: https://translate.google.com/translate_a/single?client=t&sl=en&tl=hi&hl=en&dt=bd&dt=ex&dt=ld&dt=md&dt=qc&dt=rw&dt=rm&dt=ss&dt=t&dt=at&dt=sw&ie=UTF-8&oe=UTF-8&ssel=0&tsel=0&q=welcome
It returns JSON, you might inspect it and get what you looking for(data is pretty big, so i won't post it here)

Agility pack only requests back document elements, It cannot request contents after ajax request is done. Thanks to #Uriil for pointing light on this issue.
However I was able to manage it via traditional way using WebClient
Here is what I did-
public JsonResult getCultureMeaning(string word, string languagePair)
{
string languagePair = "en|" + langua + "";
string url = String.Format("http://www.google.com/translate_t?hl=en&ie=UTF8&text={0}&langpair={1}", word, languagePair);
WebClient webClient = new WebClient();
webClient.Encoding = System.Text.Encoding.UTF8;
string result = webClient.DownloadString(url);
result = result.Substring(result.IndexOf("<span title=\"") + "<span title=\"".Length);
result = result.Substring(result.IndexOf(">") + 1);
result = result.Substring(0, result.IndexOf("</span>"));
result = HttpUtility.HtmlDecode(result.Trim());
return Json(result, JsonRequestBehavior.AllowGet);
}
It works for every culture pair. Except converting en|en, In this case It would request whole HTML document with result.

Get Romaji from google translation website

i am trying to use the translation code bellow to get the romaji words for a specific set of japanese characters, but i can't get the romaji character to even show up the url i download, it's not even in the Google Translate page source code, this is my code:
string languagePair = "jp|en";
string url = String.Format("http://www.google.com/translate_t?hl=en&ie=UTF8&text={0}&langpair={1}", "本", languagePair);
WebClient webClient = new WebClient();
webClient.Encoding = Encoding.UTF8;
string result = webClient.DownloadString(url);
Clipboard.SetText(result);
the character in my code is just an example, it's supposed to say Hon.

For japanese language you must use ja ISO 639-1 Code as described here:
Notes:
1. the language pairs are listed in this FAQ, while the language codes are included in this long list.
So, you must change your code to this:
string languagePair = "ja|en";
string url = String.Format("http://www.google.com/translate_t?hl=en&ie=UTF8&text={0}&langpair={1}", "本", languagePair);
WebClient webClient = new WebClient();
webClient.Encoding = Encoding.UTF8;
string result = webClient.DownloadString(url);
Clipboard.SetText(result);
Result page:

Web scraper replacing some characters with question marks

I make a simple web scraper that scrapes lyrics for me then writes it to a database. everything works but for some reason it's replacing some characters with question marks and when I view this information on a simple php web page I'm seeing a lot of mistakes in the lyrics.
I?m = I'm
Let?s = Let's
haven?t = haven't
stuff like that.
I know the error is in c# and my code because I put a breakpoints before it writes to the database and I display it in a rich text box. How would I get it to display these characters correctly?
public static string getSourceCode(string url)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
StreamReader sr = new StreamReader(resp.GetResponseStream());
string sourceCode = sr.ReadToEnd();
sr.Close();
resp.Close();
return sourceCode;
}
........
string url = txbURL2.Text;
string sourceCode = sourceCode = WorkerClass.getSourceCode(url);
int startIndex = sourceCode.IndexOf("<td valign=\"top\" width=\"100%\">");
sourceCode = sourceCode.Substring(startIndex, sourceCode.Length - startIndex);
........
//Gets Lyric
startIndex = sourceCode.IndexOf("<br><b>Lyrics:</b><br><br>") + 30;
endIndex = sourceCode.IndexOf(" <br><br>", startIndex);
string lyric = sourceCode.Substring(startIndex, endIndex - startIndex) + "";
rtbLyric.Text = lyric;
//End Lyric

The problem is probably character encoding. My guess is that the web page you're scraping is encoded in UTF8, but somewhere along the line you're converting to ASCII.
Check out the excellent article called "What every developer should know about character encoding" for more details.
Update
You could try this, although the StreamReader should default to UTF-8 anyway:
var encoding = System.Text.Encoding.GetEncoding("utf-8");
StreamReader sr = new StreamReader(resp.GetResponseStream(), encoding);

Check the encoding by searching for charset in the html code.
Your code snipplet misses the actual load process, so it is impossible to tell where it goes wrong.

You can also try using the WebClient:
WebClient client = new WebClient { Encoding = Encoding.UTF8 };
string html = client.DownloadString(url);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to use UTF-16 Encoding in C# winform app - c#

Related

Non english characters in Kodi API

DownloadString and Special Characters

How to translate word using google translator API?

Get Romaji from google translation website

Web scraper replacing some characters with question marks

Categories

Resources