Get Romaji from google translation website - c#

i am trying to use the translation code bellow to get the romaji words for a specific set of japanese characters, but i can't get the romaji character to even show up the url i download, it's not even in the Google Translate page source code, this is my code:
string languagePair = "jp|en";
string url = String.Format("http://www.google.com/translate_t?hl=en&ie=UTF8&text={0}&langpair={1}", "本", languagePair);
WebClient webClient = new WebClient();
webClient.Encoding = Encoding.UTF8;
string result = webClient.DownloadString(url);
Clipboard.SetText(result);
the character in my code is just an example, it's supposed to say Hon.

For japanese language you must use ja ISO 639-1 Code as described here:
Notes:
1. the language pairs are listed in this FAQ, while the language codes are included in this long list.
So, you must change your code to this:
string languagePair = "ja|en";
string url = String.Format("http://www.google.com/translate_t?hl=en&ie=UTF8&text={0}&langpair={1}", "本", languagePair);
WebClient webClient = new WebClient();
webClient.Encoding = Encoding.UTF8;
string result = webClient.DownloadString(url);
Clipboard.SetText(result);
Result page:

Related

How to use UTF-16 Encoding in C# winform app

I am trying to use the Google translator for some sentence translation(English to Urdu also not working with Arabic) but I am stuck with a problem.
when i click button it returns some symbols.
i used this answer link
getting the translated language using google translator in winform apps
But it's not working as i require. I think the problem is with UTF Encoding.
UTF16 is not supported. I used Unicode but it's also not working.
I also have installed the desired language in my PC.
Any help is appreciated.
Thank you.
string input = "How are you";
string languagePair = "en|ur";
string url = String.Format("http://www.google.com/translate_t?hl=en&ie=UTF8&text={0}&langpair={1}", input, languagePair);
WebClient webClient = new WebClient();
webClient.Encoding = System.Text.Encoding.UTF8;
string result = webClient.DownloadString(url);
result = result.Substring(result.IndexOf("<span title=\"") + "<span title=\"".Length);
result = result.Substring(result.IndexOf(">") + 1);
result = result.Substring(0, result.IndexOf("</span>"));
result = HttpUtility.HtmlDecode(result.Trim());
MessageBox.Show(result);

Kanji characters from WebClient html different from actual Kanji in website

So, I'm trying to get a portion of text from a website called Kanji-A-Day.com, but I have a problem.
You see, I'm trying to get the daily kanji from the website, and I was able to narrow the HTML down to what I want, but it seems the characters are different..?
What it looks like
What it should look like
What's even more strange is that I produced the results for the second image by copying and pasting directly from the site, so it's not a font problem.
Here's the code I use for getting the character:
public void UpdateDailyKanji() // Called at the initialization of a new main form
{
string kanji;
using (WebClient client = new WebClient()) // Grab the string
kanji = client.DownloadString("http://www.kanji-a-day.com/level4/index.php");
// Trim the HTML to just the Kanji
kanji = kanji.Remove(0, kanji.IndexOf(#"<div class=""glyph"">") + 19);
kanji = kanji.Remove(kanji.IndexOf("</div>")-2);
kanji = kanji.Trim();
Text_DailyKanji.Text = kanji; // Set the Kanji
}
Does anyone know what's going on here? I'm guessing it's some Unicode thing but I don't know much about it.
Thanks in advance.
The page you're trying to download as a string is encoded using charset=EUC-JP, also known as Japanese (EUC) (CodePage 51932). This is clearly set in the page headers.
Why is the string returned by WebClient.DownloadString encoded using the wrong encoder?
The MSDN Docs state this:
This method retrieves the specified resource. After it downloads the
resource, the method uses the encoding specified in the Encoding
property to convert the resource to a String.
Thus, you have to know beforehand what encoding will be used and specify it, setting the WebClient.Encoding property.
To verify this, check the .NET Reference Source for the WebClient.DownloadString method:
try {
WebRequest request;
byte [] data = DownloadDataInternal(address, out request);
string stringData = GetStringUsingEncoding(request, data);
if(Logging.On)Logging.Exit(Logging.Web, this, "DownloadString", stringData);
return stringData;
} finally {
CompleteWebClientState();
}
The encoding is set using the Request settings, not the Response ones.
The result is, the downloaded string is encoded using the default CodePage.
What you can do now is:
Download the page twice, the first time to check whether the WebClient encoding and the Html page encoding don't match.
Re-encode the string with the correct encoding, set in the underlying WebResponse.
Don't use WebClient, use HttpClient or WebRequest directly. Or, if you like this tool, create a custom WebClient class to handle the WebRequest/WebResponse in a more direct way.
This is a method to perform the re-encoding task:
The string returned by WebClient is converted to a Byte Array and passed to a MemoryStream, then re-encoded using a StreamReader with the Encoding retrieved from the Content-Type: charset Response Header.
EDIT:
Now using Reflection to get the page Encoding from the underlying HttpWebResponse. This should avoid errors in parsing the original CharacterSet as defined by the remote response.
using System.IO;
using System.Net;
using System.Reflection;
using System.Text;
public string WebClient_DownLoadString(Uri uri)
{
using (var client = new WebClient())
{
// If Windows 7 - Windows Server 2008 R2
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;
client.CachePolicy = new System.Net.Cache.RequestCachePolicy(System.Net.Cache.RequestCacheLevel.BypassCache);
client.Headers.Add(HttpRequestHeader.Accept, "ext/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
client.Headers.Add(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.8");
client.Headers.Add(HttpRequestHeader.KeepAlive, "keep-alive");
string result = client.DownloadString(uri);
var flags = BindingFlags.Instance | BindingFlags.NonPublic;
using (var response = (HttpWebResponse)client.GetType().GetField("m_WebResponse", flags).GetValue(client))
{
var pageEncoding = Encoding.GetEncoding(wc_response.CharacterSet);
byte[] bytes = client.Encoding.GetBytes(result);
using (var ms = new MemoryStream(bytes, 0, bytes.Length))
using (var reader = new StreamReader(ms, pageEncoding))
{
ms.Position = 0;
return reader.ReadToEnd();
};
};
}
}
Now your code should get the Japanese characters in their correct form.
Uri uri = new Uri("http://www.kanji-a-day.com/level4/index.php", UriKind.Absolute);
string kanji = WebClient_DownLoadString(uri);
kanji = kanji.Remove(0, kanji.IndexOf("<div class=\"glyph\">") + 19);
kanji = kanji.Remove(kanji.IndexOf("</div>")-2);
kanji = kanji.Trim();
Text_DailyKanji.Text = kanji;

DownloadString and Special Characters

I am trying to find the index of Mauricio in a string that is downloaded from a website using webclient and download string. However, on the website it contains a foreign character, Maurício. So I found elsewhere some code
string ToASCII(string s)
{
return String.Join("",
s.Normalize(NormalizationForm.FormD)
.Where(c => char.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark));
}
that converts foreign characters. I have tested the code and it works. So the problem I have is that when I download the string, it downloads as MaurA-cio. I have tried both
wc.Encoding = System.Text.Encoding.UTF8;
wc.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7");
Neither stop it from downloading as MaurA-cio.
(Also, I cannot change the search as I am getting the search term from a list).
What else can I try?
Thanks
var client = new WebClient { Encoding = System.Text.Encoding.UTF8 };
var json = client.DownloadString(url);
this one will work for any character
DownloadString doesn't look at HTTP response headers. It uses the previously set WebClient.Encoding property. If you have to use it, get the headers first:
// call twice
// (or to just do a HEAD, see http://stackoverflow.com/questions/3268926/head-with-webclient)
webClient.DownloadString("http://en.wikipedia.org/wiki/Maurício");
var contentType = webClient.ResponseHeaders["Content-Type"];
var charset = Regex.Match(contentType,"charset=([^;]+)").Groups[1].Value;
webClient.Encoding = Encoding.GetEncoding(charset);
var s = webClient.DownloadString("http://en.wikipedia.org/wiki/Maurício");
BTW--Unicode doesn't define "foreign" characters. From Maurício's perspective, "Mauricio" would be the foreign spelling of his name.

WebClient Http Post format parameters from illegal characters

I'm using the WebClient class to post a form, it looks like this.
string URI = "url here";
string myParameters = string.Format("Parameter1={0}&Parameter2={1}", var1, var2);
WebClient wc = new WebClient();
wc.Encoding = Encoding.UTF8;
wc.Headers["Content-type"] = "application/x-www-form-urlencoded";
string HtmlResult = wc.UploadString(URI, myParameters);
Which works great, but certain characters like the & obviously breaks it.
How do I format my parameters to not remove but still make illegal characters get through?
You need to UrlEncode your vars using HttpUtility.UrlEncode
string URI = "url here";
string myParameters = string.Format("Parameter1={0}&Parameter2={1}",
HttpUtility.UrlEncode(var1),
HttpUtility.UrlEncode(var2));
WebClient wc = new WebClient();
wc.Encoding = Encoding.UTF8;
wc.Headers["Content-type"] = "application/x-www-form-urlencoded";
string HtmlResult = wc.UploadString(URI, myParameters);
This will encode all illegal characters, not just &, BTW
when reading these, on the other side of the request, you'll probably need to also UrlDecode them MSDN
They need to be percent-encoded:
encodeURIComponent('&')
"%90k"
For example the URL would look like:
www.facebook.com?service_name=M%90k

iso-8859-1 string to Unicode string

i get html source of page with this line of code
WebClient client = new WebClient();
string sPage = client.DownloadString(url);
page code package is iso-8859-1
but exist unicode string like below in html source
<span title="ضافه"
how can i convert this code to Unicode string in c# ?
string s = HttpUtility.HtmlDecode(#"<span title=""ضافه");

Categories

Resources