C# encoding Shift-JIS vs. utf8 html agility pack

C# encoding Shift-JIS vs. utf8 html agility pack - c#

i have a problem. My goal is to save some Text from a (Japanese Shift-JS encoded)html into a utf8 encoded text file.
But i don't really know how to encode the text.. The HtmlNode object is encoded in Shift-JS. But after i used the ToString() Method, the content is corrupted.
My method so far looks like this:
public String getPage(String url)
{
String content = "";
HtmlDocument page = new HtmlWeb(){AutoDetectEncoding = true}.Load(url);
HtmlNode anchor = page.DocumentNode.SelectSingleNode("//div[contains(#class, 'article-def')]");
if (anchor != null)
{
content = anchor.InnerHtml.ToString();
}
return content;
}
I tried
Console.WriteLine(page.Encoding.EncodingName.ToString());
and got: Japanese Shift-JIS
But converting the html into a String produces the error. I thought there should be a way, but since documentation for html-agility-pack is sparse and i couldn't really find a solution via google, i'm here too get some hints.

Well, AutoDetectEncoding doesn't really work like you'd expect it to. From what i found from looking at the source code of the AgilityPack, the property is only used when loading a local file from disk, not from an url.
So there's three options. One would be to just set the Encoding
OverrideEncoding = Encoding.GetEncoding("shift-jis")
If you know the encoding will always be the same that's the easiest fix.
Or you could download the file locally and load it the same way you do now but instead of the url you'd pass the file path.
using (var client=new WebClient())
{
client.DownloadFile(url, "20130519-OYT1T00606.htm");
}
var htmlWeb = new HtmlWeb(){AutoDetectEncoding = true};
var file = new FileInfo("20130519-OYT1T00606.htm");
HtmlDocument page = htmlWeb.Load(file.FullName);
Or you can detect the encoding from your content like this:
byte[] pageBytes;
using (var client = new WebClient())
{
pageBytes = client.DownloadData(url);
}
HtmlDocument page = new HtmlDocument();
using (var ms = new MemoryStream(pageBytes))
{
page.Load(ms);
var metaContentType = page.DocumentNode.SelectSingleNode("//meta[#http-equiv='Content-Type']").GetAttributeValue("content", "");
var contentType = new System.Net.Mime.ContentType(metaContentType);
ms.Position = 0;
page.Load(ms, Encoding.GetEncoding(contentType.CharSet));
}
And finally, if the page you are querying returns the content-Type in the response you can look here for how to get the encoding.
Your code would of course need a few more null checks than mine does. ;)

Related

Kanji characters from WebClient html different from actual Kanji in website

So, I'm trying to get a portion of text from a website called Kanji-A-Day.com, but I have a problem.
You see, I'm trying to get the daily kanji from the website, and I was able to narrow the HTML down to what I want, but it seems the characters are different..?
What it looks like
What it should look like
What's even more strange is that I produced the results for the second image by copying and pasting directly from the site, so it's not a font problem.
Here's the code I use for getting the character:
public void UpdateDailyKanji() // Called at the initialization of a new main form
{
string kanji;
using (WebClient client = new WebClient()) // Grab the string
kanji = client.DownloadString("http://www.kanji-a-day.com/level4/index.php");
// Trim the HTML to just the Kanji
kanji = kanji.Remove(0, kanji.IndexOf(#"<div class=""glyph"">") + 19);
kanji = kanji.Remove(kanji.IndexOf("</div>")-2);
kanji = kanji.Trim();
Text_DailyKanji.Text = kanji; // Set the Kanji
}
Does anyone know what's going on here? I'm guessing it's some Unicode thing but I don't know much about it.
Thanks in advance.

The page you're trying to download as a string is encoded using charset=EUC-JP, also known as Japanese (EUC) (CodePage 51932). This is clearly set in the page headers.
Why is the string returned by WebClient.DownloadString encoded using the wrong encoder?
The MSDN Docs state this:
This method retrieves the specified resource. After it downloads the
resource, the method uses the encoding specified in the Encoding
property to convert the resource to a String.
Thus, you have to know beforehand what encoding will be used and specify it, setting the WebClient.Encoding property.
To verify this, check the .NET Reference Source for the WebClient.DownloadString method:
try {
WebRequest request;
byte [] data = DownloadDataInternal(address, out request);
string stringData = GetStringUsingEncoding(request, data);
if(Logging.On)Logging.Exit(Logging.Web, this, "DownloadString", stringData);
return stringData;
} finally {
CompleteWebClientState();
}
The encoding is set using the Request settings, not the Response ones.
The result is, the downloaded string is encoded using the default CodePage.
What you can do now is:
Download the page twice, the first time to check whether the WebClient encoding and the Html page encoding don't match.
Re-encode the string with the correct encoding, set in the underlying WebResponse.
Don't use WebClient, use HttpClient or WebRequest directly. Or, if you like this tool, create a custom WebClient class to handle the WebRequest/WebResponse in a more direct way.
This is a method to perform the re-encoding task:
The string returned by WebClient is converted to a Byte Array and passed to a MemoryStream, then re-encoded using a StreamReader with the Encoding retrieved from the Content-Type: charset Response Header.
EDIT:
Now using Reflection to get the page Encoding from the underlying HttpWebResponse. This should avoid errors in parsing the original CharacterSet as defined by the remote response.
using System.IO;
using System.Net;
using System.Reflection;
using System.Text;
public string WebClient_DownLoadString(Uri uri)
{
using (var client = new WebClient())
{
// If Windows 7 - Windows Server 2008 R2
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;
client.CachePolicy = new System.Net.Cache.RequestCachePolicy(System.Net.Cache.RequestCacheLevel.BypassCache);
client.Headers.Add(HttpRequestHeader.Accept, "ext/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
client.Headers.Add(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.8");
client.Headers.Add(HttpRequestHeader.KeepAlive, "keep-alive");
string result = client.DownloadString(uri);
var flags = BindingFlags.Instance | BindingFlags.NonPublic;
using (var response = (HttpWebResponse)client.GetType().GetField("m_WebResponse", flags).GetValue(client))
{
var pageEncoding = Encoding.GetEncoding(wc_response.CharacterSet);
byte[] bytes = client.Encoding.GetBytes(result);
using (var ms = new MemoryStream(bytes, 0, bytes.Length))
using (var reader = new StreamReader(ms, pageEncoding))
{
ms.Position = 0;
return reader.ReadToEnd();
};
};
}
}
Now your code should get the Japanese characters in their correct form.
Uri uri = new Uri("http://www.kanji-a-day.com/level4/index.php", UriKind.Absolute);
string kanji = WebClient_DownLoadString(uri);
kanji = kanji.Remove(0, kanji.IndexOf("<div class=\"glyph\">") + 19);
kanji = kanji.Remove(kanji.IndexOf("</div>")-2);
kanji = kanji.Trim();
Text_DailyKanji.Text = kanji;

What is the fastest way to get an HTML document node using XPath and the HtmlAgilityPack?

In my application I need to get to get the URL of the image of a blog post. In order to do this I'm using the HtmlAgilityPack.
This is the code I have so far:
static string GetBlogImageUrl(string postUrl)
{
string imageUrl = string.Empty;
using (WebClient client = new WebClient())
{
string htmlString = client.DownloadString(postUrl);
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(htmlString);
string xPath = "/html/body/div[contains(#class, 'container')]/div[contains(#class, 'content_border')]/div[contains(#class, 'single-post')]/main[contains(#class, 'site-main')]/article/header/div[contains(#class, 'featured_image')]/img";
HtmlNode node = htmlDocument.DocumentNode.SelectSingleNode(xPath);
imageUrl = node.GetAttributeValue("src", string.Empty);
}
return imageUrl;
}
The problem is that this is too slow, when I did some tests I noticed that It takes about three seconds to extract the URL of the image in the given page. Which it's a problem when I'm loading a feed and trying to red several articles.
I tried to use the absolute xpath of the element I want to load, but I didn't noticed any improvement. Is there a faster way to achieve this?

Can you try this code and see if it's faster or not?
string Url = "http://blog.cedrotech.com/5-tendencias-mobile-que-sua-empresa-precisa-acompanhar/";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
var featureDiv = doc.DocumentNode.Descendants("div").FirstOrDefault(_ => _.Attributes.Contains("class") && _.Attributes["class"].Value.Contains("featured_image"));
var img = featureDiv.ChildNodes.First(_ => _.Name.Equals("img"));
var imgUrl = img.Attributes["src"];

URL Formats are not supported

I am reading File with File.OpenRead method, I am giving this path
http://localhost:10001/MyFiles/folder/abc.png
I have tried this as well but no luck
http://localhost:10001//MyFiles//abc.png
but its giving
URL Formats are not supported
When I give physical path of my Drive like this,It works fine
d:\MyFolder\MyProject\MyFiles\folder\abc.png
How can I give file path to an Http path?
this is my code
public FileStream GetFile(string filename)
{
FileStream file = File.OpenRead(filename);
return file;
}

Have a look at WebClient (MSDN docs), it has many utility methods for downloading data from the web.
If you want the resource as a Stream, try:
using(WebClient webClient = new WebClient())
{
using(Stream stream = webClient.OpenRead(uriString))
{
using( StreamReader sr = new StreamReader(stream) )
{
Console.WriteLine(sr.ReadToEnd());
}
}
}

You could either use a WebClient as suggested in other answers or fetch the relative path like this:
var url = "http://localhost:10001/MyFiles/folder/abc.png";
var uri = new Uri(url);
var path = Path.GetFileName(uri.AbsolutePath);
var file = GetFile(path);
// ...
In general you should get rid of the absolute URLs.

The best way to download the HTML is by using the WebClient class. You do this like:
private string GetWebsiteHtml(string url)
{
WebRequest request = WebRequest.Create(url);
WebResponse response = request.GetResponse();
Stream stream = response.GetResponseStream();
StreamReader reader = new StreamReader(stream);
string result = reader.ReadToEnd();
stream.Dispose();
reader.Dispose();
return result;
}
Then, If you want to further process the HTML to ex. extract images or links, you will want to use technique known as HTML scrapping.
It's currently best achieved by using the HTML Agility Pack.
Also, documentation on WebClient class: MSDN

Here I found this snippet. Might do exactly what you need:
using(WebClient client = new WebClient()) {
string s = client.DownloadFile(new Uri("http://.../abc.png"), filename);
}
It uses the WebClient class.

To convert a file:// URL to a UNC file name, you should use the Uri.LocalPath property, as documented.
In other words, you can do this:
public FileStream GetFile(string url)
{
var filename = new Uri(url).LocalPath;
FileStream file = File.OpenRead(filename);
return file;
}

Url request on RichTextBox?

I want to get HTML code to be displayed in a RichTextBox. I am using the code
WebClient client = new WebClient();
byte[] data = client.DownloadData("http://www.google.com");
richTextBox1.Text = data.ToString();
How can I do this?
Also: I don't know why but this shows me "System.Byte[]" on the RichTextBox.

Use WebClient.DownloadString that downloads the specified resource as a String or a Uri:
var contents = new System.Net.WebClient().DownloadString(url);
Note that: RTF encoding is different from HTML. You cannot do this straight away. I suggest WebBrowser control.
or try this ways:
http://www.codeproject.com/KB/HTML/XHTML2RTF.aspx
http://www.codeproject.com/KB/edit/htmlrichtextbox.aspx

It shows System.Byte[] Because it is show the description of data, not data's contents. to do this do something like:
WebClient client = new WebClient();
byte[] file = client.DownloadData("example.com");
File.WriteAllBytes(#"example.txt", file);
string[] lines = File.ReadAllLines("example.txt");
richTextBox1.Text = lines;
To see the actual content
EDIT
Or you can do WebClient.DownloadString like #Ria Suggested. Only I would implement it like this:
WebClient client = new WebClient();
var data = client.DownloadString("example.com");
richTextBox1.Text = data.ToString();
Or to be more efficient even
richTextBox1.Text = client.DownloadString("example.com");

How to consume an HTTP webservice in Asp.net?

I want to generate html content based on a result returned by http url.
http://www.zillow.com/webservice/GetDeepSearchResults.htm?zws-id=X1-ZWz1c239bjatxn_5taq0&address=2114+Bigelow+Ave&citystatezip=Seattle%2C+WA
This page will give you some XML results. I want to convert to use that XML to generate HTML. I am not getting any idea where to start? Would someone offer any guidelines or sample code for asp.net?
For details: http://www.zillow.com/howto/api/GetDeepSearchResults.htm

To fetch the data you can use the HttpWebRequest class, this is an example I have to hand but it may be slightly overdone for your needs (and you need to make sure you're doing the right thing - I suspect the above to be a GET rather than a POST).
Uri baseUri = new Uri(this.RemoteServer);
HttpWebRequest rq = (HttpWebRequest)HttpWebRequest.Create(new Uri(baseUri, action));
rq.Method = "POST";
rq.ContentType = "application/x-www-form-urlencoded";
rq.Accept = "text/xml";
rq.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
Encoding encoding = Encoding.GetEncoding("UTF-8");
byte[] chars = encoding.GetBytes(body);
rq.ContentLength = chars.Length;
using (Stream stream = rq.GetRequestStream())
{
stream.Write(chars, 0, chars.Length);
stream.Close();
}
XDocument doc;
WebResponse rs = rq.GetResponse();
using (Stream stream = rs.GetResponseStream())
{
using (XmlTextReader tr = new XmlTextReader(stream))
{
doc = XDocument.Load(tr);
responseXml = doc.Root;
}
if (responseXml == null)
{
throw new Exception("No response");
}
}
return responseXml;
Once you've got the data back you need to render HTML, lots and lots of choices - if you just want to convert what you've got into HTML with minimal further processing then you can use XSLT - which is a question all on its own. If you need to do stuff with it then the question is too vague and you'll need to be more specific.

Create a xsl stylesheet, and inject the stylesheet element into the resulting xml from teh page

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# encoding Shift-JIS vs. utf8 html agility pack - c#

Related

Kanji characters from WebClient html different from actual Kanji in website

What is the fastest way to get an HTML document node using XPath and the HtmlAgilityPack?

URL Formats are not supported

Url request on RichTextBox?

How to consume an HTTP webservice in Asp.net?

Categories

Resources