.Net HtmlAgilityPack Turkish character encoding issue - c#

I have problem with HtmlAgilityPack Turkish charackter encoding.

Thank you I solve this issue with the following code
string url = "blabla";
var Webget = new HtmlWeb();
Webget.OverrideEncoding = Encoding.UTF8;
var doc = Webget.Load(url);

Related

Base64 to Html Decode

I want to convert the base64 data I have to html type, when I do this conversion, the html file comes out as corrupted and I cannot do scraping with Agility pack. But when I do the conversion manually with an online tool on the internet, the html file comes up properly and I can scrape. My codes are as follows. Please help
string base64Data = "/base64 in here";
byte[] decodedBytes = Convert.FromBase64String(base64Data);
string decodedText = Encoding.UTF8.GetString(decodedBytes);
string desktopPath = Environment.GetFolderPath(Environment.SpecialFolder.DesktopDirectory);
string filePath = Path.Combine(desktopPath, "decoded_data.html");
File.WriteAllText(filePath, decodedText);
HtmlWeb htmlWeb = new HtmlWeb();
HtmlDocument doc = htmlWeb.Load(filePath);
var name = doc.DocumentNode.SelectSingleNode("//*[#id="kunye"]/tbody/tr[5]/td").InnerHtml; Console.WriteLine(name);

Get webpage source code with alt key code symbols using asp.net c#

I'm trying to get webpage source code using htmlagilitypack. This is my code to get source code and fill into multiline textbox:
var url = "http://www.example.com";
var web = new HtmlWeb();
var doc = web.Load(url);
sourcecodetxt.Text = doc.ToString();
code is working fine but if my webpage have some "Alt Codes Symbols" then symbol changed with some characters eg: ★ changed with ★
My question is how to get original symbol. Sorry for my bad english. Thanks in advance.
Try using WebClient and HtmlDocument's Load() method so you can specify the encoding:
WebClient client = new WebClient();
HtmlDocument doc = new HtmlDocument();
doc.Load(client.OpenRead("http://www.example.com"), Encoding.UTF8);

Html Agility Pack, SelectSingleNode

This code works
WebClient client = new WebClient();
client.Encoding = Encoding.UTF8;
html = client.DownloadString("http://www.imdb.com/chart/moviemeter?ref_=nv_mv_mpm_8");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
MessageBox.Show(doc.DocumentNode.SelectSingleNode("//*[#id='main']/div/span/div/div/div[3]/table/tbody/tr[1]/td[2]/a").InnerText);
Html codes here:
Split
MessageBox shows the text which is "Split". But look this Html codes:
<div class="summary_text" itemprop="description">
Three girls are kidnapped by a man with a diagnosed 23 distinct personalities, and must try and escape before the apparent emergence of a frightful new 24th.
</div>
I want MessageBox to show the text which starts with "Three girls are kidn..." so i wrote this code:
WebClient client2 = new WebClient();
client2.Encoding = Encoding.UTF8;
HtmlAgilityPack.HtmlDocument doc2 = new HtmlAgilityPack.HtmlDocument();
doc2.LoadHtml(client2.DownloadString("http://www.imdb.com/title/tt4972582/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2240084082&pf_rd_r=1QW31NGD6JSE46F79CKQ&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=moviemeter&ref_=chtmvm_tt_1"));
MessageBox.Show(doc2.DocumentNode.SelectSingleNode("//*[#id='title - overview - widget']/div[3]/div[1]/div[1]").InnerText);
When i start this code,an unhandled exception of type "System.NullReferenceException" occurred
Xpaths are true, i've checked a hundred times so what should i do?
Can you try this?
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.imdb.com/title/tt4972582/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2240084082&pf_rd_r=1QW31NGD6JSE46F79CKQ&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=moviemeter&ref_=chtmvm_tt_1");
var desNodeText = doc.DocumentNode.Descendants("div").FirstOrDefault(o => o.GetAttributeValue("class", "") == "summary_text").InnerText;

How Cyrillic text can be parsed with HTMLAgilityPack?

got a trouble with HtmlAgilityPack. I can't parse Cyrillic text, it's appears as some unknown symbols.
HtmlWeb webGet = new HtmlWeb();
webGet.OverrideEncoding = Encoding.UTF8;
HtmlAgilityPack.HtmlDocument doc = webGet.Load("http://vk.com/glitchhop");
HtmlNode myNode = doc.DocumentNode.SelectSingleNode("//div[#id='page_wall_posts']/*[2]//div[#class='wall_post_text']");
if (myNode != null)
return myNode.InnerText;
else return "Nothing found";
Also attach example of error and how that text should be looks like
This problem is not related to HTMLAgilityPack, it is caused by incorrect encoding you're using.
Webpage you're trying to parse is encoded using windows-1251 encoding.
So changing webGet.OverrideEncoding from Encoding.UTF8 to Encoding.GetEncoding(1251) should help you.

Convert "iso-8859-1" to "utf-8" with HTML Agility Pack and xpath

I'm trying to get a piece of web page, but I have a problem with special characters. how to convert the data to obtain a correct reading? the website use ISO 8859-1 and i must use UTF 8.
string url = "http://www.ta-meteo.fr/troyes.htm";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
HtmlNode bulletinMatin = doc.DocumentNode.SelectSingleNode("//*[#id='blockdetday0']/div[1]/p[1]");
MessageBox.Show(bulletinMatin.InnerText);
thanks.
I solved the problem
string url = "http://www.ta-meteo.fr/troyes.htm";
Encoding iso = Encoding.GetEncoding("iso-8859-1");
HtmlWeb web = new HtmlWeb()
{
AutoDetectEncoding = false,
OverrideEncoding = iso,
};
HtmlDocument doc = web.Load(url);
HtmlNode bulletinMatin = doc.DocumentNode.SelectSingleNode("//*[#id='blockdetday0']/div[1]/p[1]");
MessageBox.Show(bulletinMatin.InnerText);

Categories

Resources