DownloadString and Special Characters - c#

I am trying to find the index of Mauricio in a string that is downloaded from a website using webclient and download string. However, on the website it contains a foreign character, Maurício. So I found elsewhere some code
string ToASCII(string s)
{
return String.Join("",
s.Normalize(NormalizationForm.FormD)
.Where(c => char.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark));
}
that converts foreign characters. I have tested the code and it works. So the problem I have is that when I download the string, it downloads as MaurA-cio. I have tried both
wc.Encoding = System.Text.Encoding.UTF8;
wc.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7");
Neither stop it from downloading as MaurA-cio.
(Also, I cannot change the search as I am getting the search term from a list).
What else can I try?
Thanks

var client = new WebClient { Encoding = System.Text.Encoding.UTF8 };
var json = client.DownloadString(url);
this one will work for any character

DownloadString doesn't look at HTTP response headers. It uses the previously set WebClient.Encoding property. If you have to use it, get the headers first:
// call twice
// (or to just do a HEAD, see http://stackoverflow.com/questions/3268926/head-with-webclient)
webClient.DownloadString("http://en.wikipedia.org/wiki/Maurício");
var contentType = webClient.ResponseHeaders["Content-Type"];
var charset = Regex.Match(contentType,"charset=([^;]+)").Groups[1].Value;
webClient.Encoding = Encoding.GetEncoding(charset);
var s = webClient.DownloadString("http://en.wikipedia.org/wiki/Maurício");
BTW--Unicode doesn't define "foreign" characters. From Maurício's perspective, "Mauricio" would be the foreign spelling of his name.

Related

How can i search a hebrew word from a website using c#

Im trying to search a Hebrew word in a website using c# but i cant figure it out.
this is my current state code that im trying to work with:
var client = new WebClient();
Encoding encoding = Encoding.GetEncoding(1255);
var text = client.DownloadString("http://shchakim.iscool.co.il/default.aspx");
if (text.Contains("ביטול"))
{
MessageBox.Show("idk");
}
thanks for any help :)
The problem seems to be that WebClient is not using the right encoding when converting the response into a string, you must set the WebClient.Encoding property to the expected encoding from the server for this conversion to happen correctly.
I inspected the response from the server and it's encoded using utf-8, the updated code below reflects this change:
using (var client = new WebClient())
{
client.Encoding = System.Text.Encoding.UTF8;
var text = client.DownloadString("http://shchakim.iscool.co.il/default.aspx");
// The response from the server doesn't contains the word ביטול, therefore, for demo purposes I changed it for שוחרות which is present in the response.
if (text.Contains("שוחרות"))
{
MessageBox.Show("idk");
}
}
Here you can find more information about the WebClient.Encoding property:
https://learn.microsoft.com/en-us/dotnet/api/system.net.webclient.encoding?view=netframework-4.7.2
Hope this helps.

Kanji characters from WebClient html different from actual Kanji in website

So, I'm trying to get a portion of text from a website called Kanji-A-Day.com, but I have a problem.
You see, I'm trying to get the daily kanji from the website, and I was able to narrow the HTML down to what I want, but it seems the characters are different..?
What it looks like
What it should look like
What's even more strange is that I produced the results for the second image by copying and pasting directly from the site, so it's not a font problem.
Here's the code I use for getting the character:
public void UpdateDailyKanji() // Called at the initialization of a new main form
{
string kanji;
using (WebClient client = new WebClient()) // Grab the string
kanji = client.DownloadString("http://www.kanji-a-day.com/level4/index.php");
// Trim the HTML to just the Kanji
kanji = kanji.Remove(0, kanji.IndexOf(#"<div class=""glyph"">") + 19);
kanji = kanji.Remove(kanji.IndexOf("</div>")-2);
kanji = kanji.Trim();
Text_DailyKanji.Text = kanji; // Set the Kanji
}
Does anyone know what's going on here? I'm guessing it's some Unicode thing but I don't know much about it.
Thanks in advance.
The page you're trying to download as a string is encoded using charset=EUC-JP, also known as Japanese (EUC) (CodePage 51932). This is clearly set in the page headers.
Why is the string returned by WebClient.DownloadString encoded using the wrong encoder?
The MSDN Docs state this:
This method retrieves the specified resource. After it downloads the
resource, the method uses the encoding specified in the Encoding
property to convert the resource to a String.
Thus, you have to know beforehand what encoding will be used and specify it, setting the WebClient.Encoding property.
To verify this, check the .NET Reference Source for the WebClient.DownloadString method:
try {
WebRequest request;
byte [] data = DownloadDataInternal(address, out request);
string stringData = GetStringUsingEncoding(request, data);
if(Logging.On)Logging.Exit(Logging.Web, this, "DownloadString", stringData);
return stringData;
} finally {
CompleteWebClientState();
}
The encoding is set using the Request settings, not the Response ones.
The result is, the downloaded string is encoded using the default CodePage.
What you can do now is:
Download the page twice, the first time to check whether the WebClient encoding and the Html page encoding don't match.
Re-encode the string with the correct encoding, set in the underlying WebResponse.
Don't use WebClient, use HttpClient or WebRequest directly. Or, if you like this tool, create a custom WebClient class to handle the WebRequest/WebResponse in a more direct way.
This is a method to perform the re-encoding task:
The string returned by WebClient is converted to a Byte Array and passed to a MemoryStream, then re-encoded using a StreamReader with the Encoding retrieved from the Content-Type: charset Response Header.
EDIT:
Now using Reflection to get the page Encoding from the underlying HttpWebResponse. This should avoid errors in parsing the original CharacterSet as defined by the remote response.
using System.IO;
using System.Net;
using System.Reflection;
using System.Text;
public string WebClient_DownLoadString(Uri uri)
{
using (var client = new WebClient())
{
// If Windows 7 - Windows Server 2008 R2
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;
client.CachePolicy = new System.Net.Cache.RequestCachePolicy(System.Net.Cache.RequestCacheLevel.BypassCache);
client.Headers.Add(HttpRequestHeader.Accept, "ext/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
client.Headers.Add(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.8");
client.Headers.Add(HttpRequestHeader.KeepAlive, "keep-alive");
string result = client.DownloadString(uri);
var flags = BindingFlags.Instance | BindingFlags.NonPublic;
using (var response = (HttpWebResponse)client.GetType().GetField("m_WebResponse", flags).GetValue(client))
{
var pageEncoding = Encoding.GetEncoding(wc_response.CharacterSet);
byte[] bytes = client.Encoding.GetBytes(result);
using (var ms = new MemoryStream(bytes, 0, bytes.Length))
using (var reader = new StreamReader(ms, pageEncoding))
{
ms.Position = 0;
return reader.ReadToEnd();
};
};
}
}
Now your code should get the Japanese characters in their correct form.
Uri uri = new Uri("http://www.kanji-a-day.com/level4/index.php", UriKind.Absolute);
string kanji = WebClient_DownLoadString(uri);
kanji = kanji.Remove(0, kanji.IndexOf("<div class=\"glyph\">") + 19);
kanji = kanji.Remove(kanji.IndexOf("</div>")-2);
kanji = kanji.Trim();
Text_DailyKanji.Text = kanji;

Cyrillic symbols in HttpClient POST request for upload filename

In one of my .NET applications I've go a method which uploads file to the site via HttpClient. Here is implementation
using (var clientHandler = new HttpClientHandler
{
CookieContainer = cookieContainer,
UseDefaultCredentials = true
})
{
using (var client = new HttpClient(clientHandler))
{
client.BaseAddress = requestAddress;
client.DefaultRequestHeaders.Accept.Clear();
using (var content = new MultipartFormDataContent())
{
var streamContent = new StreamContent(new MemoryStream(fileData));
streamContent.Headers.ContentDisposition = ContentDispositionHeaderValue.Parse("form-data");
streamContent.Headers.ContentDisposition.Parameters.Add(new NameValueHeaderValue("name", "contentFile"));
streamContent.Headers.ContentDisposition.Parameters.Add(new NameValueHeaderValue("filename", "\"" + fileName + "\""));
streamContent.Headers.ContentType = new MediaTypeHeaderValue(contentType);
content.Add(streamContent);
HttpResponseMessage response = await client.PostAsync("/Files/UploadFile", content);
if (response.IsSuccessStatusCode)
{
return true;
}
return false;
}
}
}
Method work fine. But when I pass Cyrillic symbols in fileName property generated post request filename has corrupted symbols like ????1.docx for exmaple where ? replaces the Cyrillic symbol. Is there any way to send Cyrillic symbols without corruption?
I believe filename is very limited on what you can do in terms of code page. I guess it only supports ASCII (not 100% sure). There is a better header you can use called filename* which is bit hard to google for since google will just remove the * and you get all the ordinary filename back.
Long story short you need to use this:
$"filename*=UTF-8''{fileName}"
You also might need to do some encoding of filename with regard to space, etc. You can google a bit more on that + check this SO post.
P.S. Some older browsers might not like it, you need to check your requirements.

Get Romaji from google translation website

i am trying to use the translation code bellow to get the romaji words for a specific set of japanese characters, but i can't get the romaji character to even show up the url i download, it's not even in the Google Translate page source code, this is my code:
string languagePair = "jp|en";
string url = String.Format("http://www.google.com/translate_t?hl=en&ie=UTF8&text={0}&langpair={1}", "本", languagePair);
WebClient webClient = new WebClient();
webClient.Encoding = Encoding.UTF8;
string result = webClient.DownloadString(url);
Clipboard.SetText(result);
the character in my code is just an example, it's supposed to say Hon.
For japanese language you must use ja ISO 639-1 Code as described here:
Notes:
1. the language pairs are listed in this FAQ, while the language codes are included in this long list.
So, you must change your code to this:
string languagePair = "ja|en";
string url = String.Format("http://www.google.com/translate_t?hl=en&ie=UTF8&text={0}&langpair={1}", "本", languagePair);
WebClient webClient = new WebClient();
webClient.Encoding = Encoding.UTF8;
string result = webClient.DownloadString(url);
Clipboard.SetText(result);
Result page:

C# encoding Shift-JIS vs. utf8 html agility pack

i have a problem. My goal is to save some Text from a (Japanese Shift-JS encoded)html into a utf8 encoded text file.
But i don't really know how to encode the text.. The HtmlNode object is encoded in Shift-JS. But after i used the ToString() Method, the content is corrupted.
My method so far looks like this:
public String getPage(String url)
{
String content = "";
HtmlDocument page = new HtmlWeb(){AutoDetectEncoding = true}.Load(url);
HtmlNode anchor = page.DocumentNode.SelectSingleNode("//div[contains(#class, 'article-def')]");
if (anchor != null)
{
content = anchor.InnerHtml.ToString();
}
return content;
}
I tried
Console.WriteLine(page.Encoding.EncodingName.ToString());
and got: Japanese Shift-JIS
But converting the html into a String produces the error. I thought there should be a way, but since documentation for html-agility-pack is sparse and i couldn't really find a solution via google, i'm here too get some hints.
Well, AutoDetectEncoding doesn't really work like you'd expect it to. From what i found from looking at the source code of the AgilityPack, the property is only used when loading a local file from disk, not from an url.
So there's three options. One would be to just set the Encoding
OverrideEncoding = Encoding.GetEncoding("shift-jis")
If you know the encoding will always be the same that's the easiest fix.
Or you could download the file locally and load it the same way you do now but instead of the url you'd pass the file path.
using (var client=new WebClient())
{
client.DownloadFile(url, "20130519-OYT1T00606.htm");
}
var htmlWeb = new HtmlWeb(){AutoDetectEncoding = true};
var file = new FileInfo("20130519-OYT1T00606.htm");
HtmlDocument page = htmlWeb.Load(file.FullName);
Or you can detect the encoding from your content like this:
byte[] pageBytes;
using (var client = new WebClient())
{
pageBytes = client.DownloadData(url);
}
HtmlDocument page = new HtmlDocument();
using (var ms = new MemoryStream(pageBytes))
{
page.Load(ms);
var metaContentType = page.DocumentNode.SelectSingleNode("//meta[#http-equiv='Content-Type']").GetAttributeValue("content", "");
var contentType = new System.Net.Mime.ContentType(metaContentType);
ms.Position = 0;
page.Load(ms, Encoding.GetEncoding(contentType.CharSet));
}
And finally, if the page you are querying returns the content-Type in the response you can look here for how to get the encoding.
Your code would of course need a few more null checks than mine does. ;)

Categories

Resources