C# Get site source code with letters other than english

C# Get site source code with letters other than english - c#

I'm trying to get a site's source in C# using
WebClient client = new WebClient();
string content = client.DownloadString(url);
And it gets it just fine.
However, the source code contains Hebrew characters which shows like Gibbrish in content variable.
What do I need to do for it to recognize it?

WebClient client = new WebClient();
client.Encoding = System.Text.UTF8Encoding.UTF8; // added
string content = client.DownloadString(url);
You have to specify the encoding, you are probably requesting ASCII by default and the content could be in UTF8. This is an example where the encoding is set to UTF8. If you are not sure what it is check the source manually first and then specify the encoding accordingly. For more info see Remarks in the documentation.

The problem is the Encoding of your WebClient. MSDN says:
... the method uses the encoding specified in the Encoding property to convert the resource to a String.
Solution: Set a specific Encoding like
client.Encoding = Encoding.UTF8;
and try it again
string content = client.DownloadString(url);
UTF8 should do the trick to encode also the hebrew characters.

Related

WebClient.DownloadString uses wrong encoding

I'm downloading XML files from sharepoint online using webclient.
However, when I use WebClient.DownloadString(string url) method, some characters are not correctly decoded.
When I use WebClient.DownloadFile(string url, string file) and then I read the file all characters are correct.
The xml itself does not contain encoding declaration.
string wrongXml = webClient.DownloadString(url);
//wrongXml contains Ä™ instead of ę
webClient.DownloadFile(url, #"C:\temp\file1.xml");
string correctXml = File.ReadAllText(#"C:\temp\file1.xml");
//contains ę, like it should.
Also, when open the url in Internet Explorer, it is shown correctly.
Why is that? Is it because of the default windows encoding on my machine or webclient handles responses differently when using DownloadString, resp DownloadFile?

Probably the encoding it is using now is not the one the service returns.
You can set the encoding you expect before you make the request:
webClient.Encoding = Encoding.UTF8;
string previouslyWrongXml = webClient.DownloadString(url);

C# decoding "â„¢" to "TM"

on a web page there is following string
"Qualcomm Snapdragon™ S4"
when i get this string in my .net code the string convert to "Qualcomm Snapdragonâ„¢ S4"
the character "TM" change to â„¢
how can i decode "â„¢" back to "TM"
Update
follwoing is the code for downloaded string using webproxy
wc is webproxy
wc.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8");
string html = Server.HtmlEncode(wc.DownloadString(url));

You should read the webpage in its proper encoding in the first place. In this case it seems you are reading with Encoding.Default (i.e. probably CP1252) and the page is really in UTF-8. This should be apparent either by reading the Content-Type header of the response or by looking for a <meta http-equiv="Content-Type" content='text/html; charset=utf-8'> in the content.
If you still need to do this after the fact, then use
var bytes = Encoding.Default.GetBytes(myString);
var correctString = Encoding.UTF8.GetString(bytes);
In any case you would need to know the exact encodings that were used on the page and for reading the malformed string in the first place. Furthermore I'd generally advise explicitly against using Encoding.Default because its value isn't fixed. It's just the legacy encoding on a Windows system for use in non-Unicode applications and also gets used as the default non-Unicode text file encoding. It should have no place whatsoever in handling external resources.

Strange characters when consuming JSON web service

I'm consuming a JSON WebService by using the WebClient.DOwnloadStringAsync. The returning string contains some strange character pair:
"start_address" : "GoethestraÃŸe 7-9, Monaco di Baviera, Germania",
In place of some extended charachter. How can I see the correct one? In the example it sould be: ß

Solved Myself:
WebClient client = new WebClient();
client.Encoding = Encoding.UTF8; // Specify the encoding here

That is the encoding of the German "Double S" character, still used in the word Strasse in parts of Germany. Switching to UTF8 should solve your problem.

How to "iso-8859-1" encoding a string in jQuery?

I'm looking for a jQuery(or jQuery plugin) equivalent of this C# code block. What it does is to encode a string to base64 string in iso-8859-1 character set.
string authInfo = "encrypted secret";
Encoding encoding = Encoding.GetEncoding("iso-8859-1");
byte[] authBytes = encoding.GetBytes(authInfo);
string encryptedMsg = Convert.ToBase64String(authBytes);
Is there a plugin out there that can do this?

Found a jQuery plugin that's close enough to what I need: Base64 encode and decode
It doesn't have an option to specify character set but I can live with it for now. So the jQuery code becomes:
authInfo = $.base64.encode(authInfo);

I believe you must specify the character encoding of the page (or where ever authInfo is defined) to ISO-8859-1. You may also specify the character encoding of the tag for referenced javascript files if authInfo is defined in one of those.
As for base64 encoding, here's a page that has a code snippet that does just that: http://www.webtoolkit.info/javascript-base64.html

HttpWebRequest an Unicode characters

I am using this code:
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
string result = null;
using (HttpWebResponse resp = (HttpWebResponse)req.GetResponse())
{
StreamReader reader = new StreamReader(resp.GetResponseStream());
result = reader.ReadToEnd();
reader.Close();
}
In result I get text like 003cbr /003e003cbr /003e (I think this should be 2 line breaks instead). I tried with the 2, 3 parameter versions of Streamreader but the string was the same. (the request returns a json string)
Why am I getting those characters, and how can I avoid them?

It's not really clear what that text is, but you're not specifying an encoding at the moment. What content encoding is the server using? StreamReader will default to UTF-8.
It sounds like actually you're getting some sort of oddly-encoded HTML, as U+003C is < and U+003E is >, giving <br /><br /> as the content. That's not JSON...
Two tests:
Use WebClient.DownloadString, which will detect the right encoding to use
See what gets shown using the same URL in a browser
EDIT: Okay, now that I've seen the text, it's actually got:
\u003cbr /\u003e
The \u part is important here - that's part of the JSON which states that the next four characters form ar the hex representation of a UTF-16 code unit.
Any JSON API used to parse that text should perform the unescaping for you.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# Get site source code with letters other than english - c#

I'm trying to get a site's source in C# using WebClient client = new WebClient(); string content = client.DownloadString(url); And it gets it just fine. However, the source code contains Hebrew characters which shows like Gibbrish in content variable. What do I need to do for it to recognize it?

Related

WebClient.DownloadString uses wrong encoding

C# decoding "â„¢" to "TM"

Strange characters when consuming JSON web service

How to "iso-8859-1" encoding a string in jQuery?

HttpWebRequest an Unicode characters

Categories

Resources