How Cyrillic text can be parsed with HTMLAgilityPack? - c#

got a trouble with HtmlAgilityPack. I can't parse Cyrillic text, it's appears as some unknown symbols.
HtmlWeb webGet = new HtmlWeb();
webGet.OverrideEncoding = Encoding.UTF8;
HtmlAgilityPack.HtmlDocument doc = webGet.Load("http://vk.com/glitchhop");
HtmlNode myNode = doc.DocumentNode.SelectSingleNode("//div[#id='page_wall_posts']/*[2]//div[#class='wall_post_text']");
if (myNode != null)
return myNode.InnerText;
else return "Nothing found";
Also attach example of error and how that text should be looks like

This problem is not related to HTMLAgilityPack, it is caused by incorrect encoding you're using.
Webpage you're trying to parse is encoded using windows-1251 encoding.
So changing webGet.OverrideEncoding from Encoding.UTF8 to Encoding.GetEncoding(1251) should help you.

Related

Get webpage source code with alt key code symbols using asp.net c#

I'm trying to get webpage source code using htmlagilitypack. This is my code to get source code and fill into multiline textbox:
var url = "http://www.example.com";
var web = new HtmlWeb();
var doc = web.Load(url);
sourcecodetxt.Text = doc.ToString();
code is working fine but if my webpage have some "Alt Codes Symbols" then symbol changed with some characters eg: ★ changed with ★
My question is how to get original symbol. Sorry for my bad english. Thanks in advance.
Try using WebClient and HtmlDocument's Load() method so you can specify the encoding:
WebClient client = new WebClient();
HtmlDocument doc = new HtmlDocument();
doc.Load(client.OpenRead("http://www.example.com"), Encoding.UTF8);

.Net HtmlAgilityPack Turkish character encoding issue

I have problem with HtmlAgilityPack Turkish charackter encoding.
Thank you I solve this issue with the following code
string url = "blabla";
var Webget = new HtmlWeb();
Webget.OverrideEncoding = Encoding.UTF8;
var doc = Webget.Load(url);

HttpWebRequest return broken chars

I'm reading a Dutch webpage :
HttpWebRequest oReq = (HttpWebRequest)WebRequest.Create(website);
oReq.Method = "GET";
HttpWebResponse resp = (HttpWebResponse)oReq.GetResponse();
HtmlDocument doc;
doc.Load(resp.GetResponseStream(), Encoding.GetEncoding("iso-8859-1"));
When I get the text of some random element within the page I get some weird characters not the Dutch ones I see in Chrome:
HtmlNode node = doc.DocumentNode.SelectSingleNode(xpath);
if(node != null)
{
MessageBox.Show(node.InnerText, "--- just scrapped some xpath ---");
}
Instead of café I get café
How do I solve this? I get the same text when writting it to a file, when I assign it to a richtextbox, etc ,etc, the same broken text.
Change the encoding to Unicode, e.g. utf-8

C# encoding Shift-JIS vs. utf8 html agility pack

i have a problem. My goal is to save some Text from a (Japanese Shift-JS encoded)html into a utf8 encoded text file.
But i don't really know how to encode the text.. The HtmlNode object is encoded in Shift-JS. But after i used the ToString() Method, the content is corrupted.
My method so far looks like this:
public String getPage(String url)
{
String content = "";
HtmlDocument page = new HtmlWeb(){AutoDetectEncoding = true}.Load(url);
HtmlNode anchor = page.DocumentNode.SelectSingleNode("//div[contains(#class, 'article-def')]");
if (anchor != null)
{
content = anchor.InnerHtml.ToString();
}
return content;
}
I tried
Console.WriteLine(page.Encoding.EncodingName.ToString());
and got: Japanese Shift-JIS
But converting the html into a String produces the error. I thought there should be a way, but since documentation for html-agility-pack is sparse and i couldn't really find a solution via google, i'm here too get some hints.
Well, AutoDetectEncoding doesn't really work like you'd expect it to. From what i found from looking at the source code of the AgilityPack, the property is only used when loading a local file from disk, not from an url.
So there's three options. One would be to just set the Encoding
OverrideEncoding = Encoding.GetEncoding("shift-jis")
If you know the encoding will always be the same that's the easiest fix.
Or you could download the file locally and load it the same way you do now but instead of the url you'd pass the file path.
using (var client=new WebClient())
{
client.DownloadFile(url, "20130519-OYT1T00606.htm");
}
var htmlWeb = new HtmlWeb(){AutoDetectEncoding = true};
var file = new FileInfo("20130519-OYT1T00606.htm");
HtmlDocument page = htmlWeb.Load(file.FullName);
Or you can detect the encoding from your content like this:
byte[] pageBytes;
using (var client = new WebClient())
{
pageBytes = client.DownloadData(url);
}
HtmlDocument page = new HtmlDocument();
using (var ms = new MemoryStream(pageBytes))
{
page.Load(ms);
var metaContentType = page.DocumentNode.SelectSingleNode("//meta[#http-equiv='Content-Type']").GetAttributeValue("content", "");
var contentType = new System.Net.Mime.ContentType(metaContentType);
ms.Position = 0;
page.Load(ms, Encoding.GetEncoding(contentType.CharSet));
}
And finally, if the page you are querying returns the content-Type in the response you can look here for how to get the encoding.
Your code would of course need a few more null checks than mine does. ;)

Parsing HTML String [duplicate]

This question already has answers here:
What is the best way to parse html in C#? [closed]
(15 answers)
Closed 9 years ago.
Is there a way to parse HTML string in .Net code behind like DOM parsing...
i.e. GetElementByTagName("abc").GetElementByTagName("tag")
I've this code chunk...
private void LoadProfilePage()
{
string sURL;
sURL = "http://www.abcd1234.com/abcd1234";
WebRequest wrGETURL;
wrGETURL = WebRequest.Create(sURL);
//WebProxy myProxy = new WebProxy("myproxy",80);
//myProxy.BypassProxyOnLocal = true;
//wrGETURL.Proxy = WebProxy.GetDefaultProxy();
Stream objStream;
objStream = wrGETURL.GetResponse().GetResponseStream();
if (objStream != null)
{
StreamReader objReader = new StreamReader(objStream);
string sLine = objReader.ReadToEnd();
if (String.IsNullOrEmpty(sLine) == false)
{
....
}
}
}
You can use the excellent HTML Agility Pack.
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
Take a look at using the Html Agility Pack
Example of its use:
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]")
{
HtmlAttribute att = link["href"];
att.Value = FixLink(att);
}
You can use the HTML Agility Pack and a little XPath (it can even download the document for you):
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.abcd1234.com/abcd1234");
HtmlNodeCollection tags = doc.DocumentNode.SelectNodes("//abc//tag");
I've used the HTML Agility Pack to do this exact thing and I think it's great. It has been really helpful to me.
maybe this can help: What is the best way to parse html in C#?

Categories

Resources