foreign characters; language translation proxy application - c#

I am working on a C# application that sends an http request to a language translation server that converts Norwegian to English and vice versa. The translation results I'm getting back suggest that certain Norwegian characters like "Ø" are being garbled. Stepping through the code, I found this character being garbled in the URI members of the HttpWebRequest class, so I've explicitly specified the URI (instead of just depending on HTTPWebRequest class to do it for me). But my translation result remains garbled. I suspect that the StreamReader is the culprit now, and have tried UTF8, Unicode, ASCII encoding on a Norwegian machine but to no avail. Please advise. Thank you in advance. The code for the function is pasted below.
Other details:
Example of input content: følger;
Example of translation result: ? lger;
Input and output are in unicode text files.
public string Translate(string requestquery)
{
string translatedresult = "";
Uri uri = new Uri(requestquery, true); //this appears to have solved the garbled URI
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
string charSet = response.CharacterSet;
Stream ReceiveStream = response.GetResponseStream();
//I've tried all sorts of encoding here and even changing the machine locale
string contentEncoding = response.ContentEncoding;
StreamReader sr = new StreamReader(ReceiveStream, Encoding.UTF8, false);
translatedresult = sr.ReadToEnd();
sr.Close();
return translatedresult;
}

Related

Kanji characters from WebClient html different from actual Kanji in website

So, I'm trying to get a portion of text from a website called Kanji-A-Day.com, but I have a problem.
You see, I'm trying to get the daily kanji from the website, and I was able to narrow the HTML down to what I want, but it seems the characters are different..?
What it looks like
What it should look like
What's even more strange is that I produced the results for the second image by copying and pasting directly from the site, so it's not a font problem.
Here's the code I use for getting the character:
public void UpdateDailyKanji() // Called at the initialization of a new main form
{
string kanji;
using (WebClient client = new WebClient()) // Grab the string
kanji = client.DownloadString("http://www.kanji-a-day.com/level4/index.php");
// Trim the HTML to just the Kanji
kanji = kanji.Remove(0, kanji.IndexOf(#"<div class=""glyph"">") + 19);
kanji = kanji.Remove(kanji.IndexOf("</div>")-2);
kanji = kanji.Trim();
Text_DailyKanji.Text = kanji; // Set the Kanji
}
Does anyone know what's going on here? I'm guessing it's some Unicode thing but I don't know much about it.
Thanks in advance.
The page you're trying to download as a string is encoded using charset=EUC-JP, also known as Japanese (EUC) (CodePage 51932). This is clearly set in the page headers.
Why is the string returned by WebClient.DownloadString encoded using the wrong encoder?
The MSDN Docs state this:
This method retrieves the specified resource. After it downloads the
resource, the method uses the encoding specified in the Encoding
property to convert the resource to a String.
Thus, you have to know beforehand what encoding will be used and specify it, setting the WebClient.Encoding property.
To verify this, check the .NET Reference Source for the WebClient.DownloadString method:
try {
WebRequest request;
byte [] data = DownloadDataInternal(address, out request);
string stringData = GetStringUsingEncoding(request, data);
if(Logging.On)Logging.Exit(Logging.Web, this, "DownloadString", stringData);
return stringData;
} finally {
CompleteWebClientState();
}
The encoding is set using the Request settings, not the Response ones.
The result is, the downloaded string is encoded using the default CodePage.
What you can do now is:
Download the page twice, the first time to check whether the WebClient encoding and the Html page encoding don't match.
Re-encode the string with the correct encoding, set in the underlying WebResponse.
Don't use WebClient, use HttpClient or WebRequest directly. Or, if you like this tool, create a custom WebClient class to handle the WebRequest/WebResponse in a more direct way.
This is a method to perform the re-encoding task:
The string returned by WebClient is converted to a Byte Array and passed to a MemoryStream, then re-encoded using a StreamReader with the Encoding retrieved from the Content-Type: charset Response Header.
EDIT:
Now using Reflection to get the page Encoding from the underlying HttpWebResponse. This should avoid errors in parsing the original CharacterSet as defined by the remote response.
using System.IO;
using System.Net;
using System.Reflection;
using System.Text;
public string WebClient_DownLoadString(Uri uri)
{
using (var client = new WebClient())
{
// If Windows 7 - Windows Server 2008 R2
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;
client.CachePolicy = new System.Net.Cache.RequestCachePolicy(System.Net.Cache.RequestCacheLevel.BypassCache);
client.Headers.Add(HttpRequestHeader.Accept, "ext/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
client.Headers.Add(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.8");
client.Headers.Add(HttpRequestHeader.KeepAlive, "keep-alive");
string result = client.DownloadString(uri);
var flags = BindingFlags.Instance | BindingFlags.NonPublic;
using (var response = (HttpWebResponse)client.GetType().GetField("m_WebResponse", flags).GetValue(client))
{
var pageEncoding = Encoding.GetEncoding(wc_response.CharacterSet);
byte[] bytes = client.Encoding.GetBytes(result);
using (var ms = new MemoryStream(bytes, 0, bytes.Length))
using (var reader = new StreamReader(ms, pageEncoding))
{
ms.Position = 0;
return reader.ReadToEnd();
};
};
}
}
Now your code should get the Japanese characters in their correct form.
Uri uri = new Uri("http://www.kanji-a-day.com/level4/index.php", UriKind.Absolute);
string kanji = WebClient_DownLoadString(uri);
kanji = kanji.Remove(0, kanji.IndexOf("<div class=\"glyph\">") + 19);
kanji = kanji.Remove(kanji.IndexOf("</div>")-2);
kanji = kanji.Trim();
Text_DailyKanji.Text = kanji;

Non english characters in Kodi API

I am using the Kodi API, to control my htpc via asp.net.
Especialy the functio named "Playlist.Add".
The Json I send is like this:
{"jsonrpc":"2.0","method":"Playlist.Insert","params":{"playlistid":0,"position":0,"item":{"file":"smb://server/Ferry Corsten/Beautiful/Ferry Corsten - Beautiful (Extended).mp3"}},"id":1}
This is working fine. But when there are some none english characters in the string like this:
{"jsonrpc":"2.0","method":"Playlist.Insert","params":{"playlistid":0,"position":0,"item":{"file":"smb://server/01-Zum Geburtstag viel Glück.mp3"}},"id":1}
It is just throwing a "RequestCanceled" Exception.
My c# source is like this:
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(_url);
string authInfo = Convert.ToBase64String(Encoding.Default.GetBytes(_username + ":" + _password));
webRequest.Headers["Authorization"] = "Basic " + authInfo;
webRequest.Method = "POST";
webRequest.UserAgent = "KodiControl";
webRequest.ContentType = "application/json";
webRequest.ContentLength = json.Length;
using (var streamWriter = new StreamWriter(webRequest.GetRequestStream()))
{
streamWriter.Write(json);
streamWriter.Flush();
streamWriter.Close();
}
The Exception is thrown at streamWriter.Flush().
So what do I have to do to send this request?``
I suggest you look into Kodi addon unicode paths
Following that guide will help you prevent common problems with non-latin characters in Kodi.
Python only works with unicode strings internally, converting to a particular encoding on output. (or input)". To make string literals unicode by default, add
from __future__ import unicode_literals
Addon path
path = addon.getAddonInfo('path').decode('utf-8')
.decode('utf-8') tells kodi to decode the given function using utf-8.
Kodi's getAddonInfo returns an UTF-8 encoded string and we decode it an unicode.
Browse dialog
dialog = xbmcgui.Dialog()
directory = dialog.browse(0, 'Title' , 'pictures').decode('utf-8')
dialog.browse() returns an UTF-8 encoded string which perhaps contains some non latin characters. Therefore decode it to unicode!

Web scraper replacing some characters with question marks

I make a simple web scraper that scrapes lyrics for me then writes it to a database. everything works but for some reason it's replacing some characters with question marks and when I view this information on a simple php web page I'm seeing a lot of mistakes in the lyrics.
I?m = I'm
Let?s = Let's
haven?t = haven't
stuff like that.
I know the error is in c# and my code because I put a breakpoints before it writes to the database and I display it in a rich text box. How would I get it to display these characters correctly?
public static string getSourceCode(string url)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
StreamReader sr = new StreamReader(resp.GetResponseStream());
string sourceCode = sr.ReadToEnd();
sr.Close();
resp.Close();
return sourceCode;
}
........
string url = txbURL2.Text;
string sourceCode = sourceCode = WorkerClass.getSourceCode(url);
int startIndex = sourceCode.IndexOf("<td valign=\"top\" width=\"100%\">");
sourceCode = sourceCode.Substring(startIndex, sourceCode.Length - startIndex);
........
//Gets Lyric
startIndex = sourceCode.IndexOf("<br><b>Lyrics:</b><br><br>") + 30;
endIndex = sourceCode.IndexOf(" <br><br>", startIndex);
string lyric = sourceCode.Substring(startIndex, endIndex - startIndex) + "";
rtbLyric.Text = lyric;
//End Lyric
The problem is probably character encoding. My guess is that the web page you're scraping is encoded in UTF8, but somewhere along the line you're converting to ASCII.
Check out the excellent article called "What every developer should know about character encoding" for more details.
Update
You could try this, although the StreamReader should default to UTF-8 anyway:
var encoding = System.Text.Encoding.GetEncoding("utf-8");
StreamReader sr = new StreamReader(resp.GetResponseStream(), encoding);
Check the encoding by searching for charset in the html code.
Your code snipplet misses the actual load process, so it is impossible to tell where it goes wrong.
You can also try using the WebClient:
WebClient client = new WebClient { Encoding = Encoding.UTF8 };
string html = client.DownloadString(url);

C# Escape Plus Sign (+) in POST using HttpWebRequest

I'm having issues to send POST data that includes characteres like "+" in the password field,
string postData = String.Format("username={0}&password={1}", "anyname", "+13Gt2");
I'm using HttpWebRequest and a webbrowser to see the results, and when I try to log in from my C# WinForms using HttpWebRequest to POST data to the website, it tells me that password is incorrect. (in the source code[richTexbox1] and the webBrowser1). Trying it with another account of mine, that does not contain '+' character, it lets me log in correctly (using array of bytes and writing it to the stream)
byte[] byteArray = Encoding.ASCII.GetBytes(postData); //get the data
request.Method = "POST";
request.Accept = "text/html";
request.ContentType = "application/x-www-form-urlencoded";
request.ContentLength = byteArray.Length;
Stream newStream = request.GetRequestStream(); //open connection
newStream.Write(byteArray, 0, byteArray.Length); // Send the data.
newStream.Close(); //this works well if user does not includes symbols
From this Question I found that HttpUtility.UrlEncode() is the solution to escape illegal characters, but I can't find out how to use it correctly, my question is, after url-encoding my POST data with urlEncode() how do I send the data to my request correctly?
This is how I've been trying for HOURS to make it work, but no luck,
First method
string urlEncoded = HttpUtility.UrlEncode(postData, ASCIIEncoding.ASCII);
//request.ContentType = "application/x-www-form-urlencoded";
request.ContentLength = urlEncoded.Length;
StreamWriter wr = new StreamWriter(request.GetRequestStream(),ASCIIEncoding.ASCII);
wr.Write(urlEncoded); //server returns for wrong password.
wr.Close();
Second method
byte[] urlEncodedArray = HttpUtility.UrlEncodeToBytes(postData,ASCIIEncoding.ASCII);
Stream newStream = request.GetRequestStream(); //open connection
newStream.Write(urlEncodedArray, 0, urlEncodedArray.Length); // Send the data.
newStream.Close(); //The server tells me the same thing..
I think I'm doing wrong on how the url-encoded must be sent to the request, I really ask for some help please, I searched through google and couldn't find more info about how to send encoded url to an HttpWebRequest.. I appreciate your time and attention, hope you can help me. Thank you.
I found my answer using Uri.EscapeDataString it exactly solved my problem, but somehow I couldn't do it with HttpUtility.UrlEncode. Stackoverflowing around, I found this question that is about urlEncode method, in msdn it documentation tells that:
Encodes a URL string.
But I know now that is wrong if used to encode POST data (#Marvin, #Polity, thanks for the correction). After discarding it, I tried the following:
string postData = String.Format("username={0}&password={1}", "anyname", Uri.EscapeDataString("+13Gt2"));
The POST data is converted into:
// **Output
string username = anyname;
string password = %2B13Gt2;
Uri.EscapeDataString in msdn says the following:
Converts a string to its escaped representation.
I think this what I was looking for, when I tried the above, I could POST correctly whenever there's data including the '+' characters in the formdata, but somehow there's much to learn about them.
This link is really helpful.
http://blogs.msdn.com/b/yangxind/archive/2006/11/09/don-t-use-net-system-uri-unescapedatastring-in-url-decoding.aspx
Thanks a lot for the answers and your time, I appreciate it very much. Regards mates.
I'd recommend you a WebClient. Will shorten your code and take care of encoding and stuff:
using (var client = new WebClient())
{
var values = new NameValueCollection
{
{ "username", "anyname" },
{ "password", "+13Gt2" },
};
var url = "http://foo.com";
var result = client.UploadValues(url, values);
}
the postdata you are sending should NOT be URL encoded! it's formdata, not the URL
string url = #"http://localhost/WebApp/AddService.asmx/Add";
string postData = "x=6&y=8";
WebRequest req = WebRequest.Create(url);
HttpWebRequest httpReq = (HttpWebRequest)req;
httpReq.Method = WebRequestMethods.Http.Post;
httpReq.ContentType = "application/x-www-form-urlencoded";
Stream s = httpReq.GetRequestStream();
StreamWriter sw = new StreamWriter(s,Encoding.ASCII);
sw.Write(postData);
sw.Close();
HttpWebResponse httpResp =
(HttpWebResponse)httpReq.GetResponse();
s = httpResp.GetResponseStream();
StreamReader sr = new StreamReader(s, Encoding.ASCII);
Console.WriteLine(sr.ReadToEnd());
This uses the System.Text.Encoding.ASCII to encode the postdata.
Hope this helps,

"The format of the URI could not be determined" with WebRequest

I'm trying to perform a POST to a site using a WebRequest in C#. The site I'm posting to is an SMS site, and the messagetext is part of the URL. To avoid spaces in the URL I'm calling HttpUtility.Encode() to URL encode it.
But I keep getting an URIFormatException - "Invalid URI: The format of the URI could not be determined" - when I use code similar to this:
string url = "http://www.stackoverflow.com?question=a sentence with spaces";
string encoded = HttpUtility.UrlEncode(url);
WebRequest r = WebRequest.Create(encoded);
r.Method = "POST";
r.ContentLength = encoded.Length;
WebResponse response = r.GetResponse();
The exception occurs when I call WebRequest.Create().
What am I doing wrong?
You should only encode the argument, not the entire url, so try:
string url = "http://www.stackoverflow.com?question=" + HttpUtility.UrlEncode("a sentence with spaces");
WebRequest r = WebRequest.Create(url);
r.Method = "POST";
r.ContentLength = encoded.Length;
WebResponse response = r.GetResponse();
Encoding the entire url would mean the :// and the ? get encoded too. The encoded string is then no longer a valid url.
UrlEncode should only be used on the query string. Try this:
string query = "a sentence with spaces";
string encoded = "http://www.stackoverflow.com/?question=" + HttpUtility.UrlEncode(query);
The current version of your code is urlencoding the slashes and colon in the URL, which is confusing webrequest.

Categories

Resources