Base64 to Html Decode - c#

I want to convert the base64 data I have to html type, when I do this conversion, the html file comes out as corrupted and I cannot do scraping with Agility pack. But when I do the conversion manually with an online tool on the internet, the html file comes up properly and I can scrape. My codes are as follows. Please help
string base64Data = "/base64 in here";
byte[] decodedBytes = Convert.FromBase64String(base64Data);
string decodedText = Encoding.UTF8.GetString(decodedBytes);
string desktopPath = Environment.GetFolderPath(Environment.SpecialFolder.DesktopDirectory);
string filePath = Path.Combine(desktopPath, "decoded_data.html");
File.WriteAllText(filePath, decodedText);
HtmlWeb htmlWeb = new HtmlWeb();
HtmlDocument doc = htmlWeb.Load(filePath);
var name = doc.DocumentNode.SelectSingleNode("//*[#id="kunye"]/tbody/tr[5]/td").InnerHtml; Console.WriteLine(name);

Related

What is the fastest way to get an HTML document node using XPath and the HtmlAgilityPack?

In my application I need to get to get the URL of the image of a blog post. In order to do this I'm using the HtmlAgilityPack.
This is the code I have so far:
static string GetBlogImageUrl(string postUrl)
{
string imageUrl = string.Empty;
using (WebClient client = new WebClient())
{
string htmlString = client.DownloadString(postUrl);
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(htmlString);
string xPath = "/html/body/div[contains(#class, 'container')]/div[contains(#class, 'content_border')]/div[contains(#class, 'single-post')]/main[contains(#class, 'site-main')]/article/header/div[contains(#class, 'featured_image')]/img";
HtmlNode node = htmlDocument.DocumentNode.SelectSingleNode(xPath);
imageUrl = node.GetAttributeValue("src", string.Empty);
}
return imageUrl;
}
The problem is that this is too slow, when I did some tests I noticed that It takes about three seconds to extract the URL of the image in the given page. Which it's a problem when I'm loading a feed and trying to red several articles.
I tried to use the absolute xpath of the element I want to load, but I didn't noticed any improvement. Is there a faster way to achieve this?
Can you try this code and see if it's faster or not?
string Url = "http://blog.cedrotech.com/5-tendencias-mobile-que-sua-empresa-precisa-acompanhar/";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
var featureDiv = doc.DocumentNode.Descendants("div").FirstOrDefault(_ => _.Attributes.Contains("class") && _.Attributes["class"].Value.Contains("featured_image"));
var img = featureDiv.ChildNodes.First(_ => _.Name.Equals("img"));
var imgUrl = img.Attributes["src"];

Empty Xml Document response from Api

I need information from imdb unoffical api "omdbapi".I am sending link in correct but when I get response the document is null.I am using htmlagiltypack.what am I doing wrong?
here is direct link:http://www.omdbapi.com/?i=tt2231253&plot=short&r=xml
string url = "http://www.omdbapi.com/?i=" + ImdbID + "&plot=short&r=xml";
HtmlWeb source = new HtmlWeb();
HtmlDocument document = source.Load(url);
Its no Html but a XML document you expect. Try this instead:
string url = "http://www.omdbapi.com/?i=tt2231253&plot=short&r=xml";
WebClient wc = new WebClient();
XDocument doc = XDocument.Parse(wc.DownloadString(url));
Console.WriteLine(doc);

Convert "iso-8859-1" to "utf-8" with HTML Agility Pack and xpath

I'm trying to get a piece of web page, but I have a problem with special characters. how to convert the data to obtain a correct reading? the website use ISO 8859-1 and i must use UTF 8.
string url = "http://www.ta-meteo.fr/troyes.htm";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
HtmlNode bulletinMatin = doc.DocumentNode.SelectSingleNode("//*[#id='blockdetday0']/div[1]/p[1]");
MessageBox.Show(bulletinMatin.InnerText);
thanks.
I solved the problem
string url = "http://www.ta-meteo.fr/troyes.htm";
Encoding iso = Encoding.GetEncoding("iso-8859-1");
HtmlWeb web = new HtmlWeb()
{
AutoDetectEncoding = false,
OverrideEncoding = iso,
};
HtmlDocument doc = web.Load(url);
HtmlNode bulletinMatin = doc.DocumentNode.SelectSingleNode("//*[#id='blockdetday0']/div[1]/p[1]");
MessageBox.Show(bulletinMatin.InnerText);

C# encoding Shift-JIS vs. utf8 html agility pack

i have a problem. My goal is to save some Text from a (Japanese Shift-JS encoded)html into a utf8 encoded text file.
But i don't really know how to encode the text.. The HtmlNode object is encoded in Shift-JS. But after i used the ToString() Method, the content is corrupted.
My method so far looks like this:
public String getPage(String url)
{
String content = "";
HtmlDocument page = new HtmlWeb(){AutoDetectEncoding = true}.Load(url);
HtmlNode anchor = page.DocumentNode.SelectSingleNode("//div[contains(#class, 'article-def')]");
if (anchor != null)
{
content = anchor.InnerHtml.ToString();
}
return content;
}
I tried
Console.WriteLine(page.Encoding.EncodingName.ToString());
and got: Japanese Shift-JIS
But converting the html into a String produces the error. I thought there should be a way, but since documentation for html-agility-pack is sparse and i couldn't really find a solution via google, i'm here too get some hints.
Well, AutoDetectEncoding doesn't really work like you'd expect it to. From what i found from looking at the source code of the AgilityPack, the property is only used when loading a local file from disk, not from an url.
So there's three options. One would be to just set the Encoding
OverrideEncoding = Encoding.GetEncoding("shift-jis")
If you know the encoding will always be the same that's the easiest fix.
Or you could download the file locally and load it the same way you do now but instead of the url you'd pass the file path.
using (var client=new WebClient())
{
client.DownloadFile(url, "20130519-OYT1T00606.htm");
}
var htmlWeb = new HtmlWeb(){AutoDetectEncoding = true};
var file = new FileInfo("20130519-OYT1T00606.htm");
HtmlDocument page = htmlWeb.Load(file.FullName);
Or you can detect the encoding from your content like this:
byte[] pageBytes;
using (var client = new WebClient())
{
pageBytes = client.DownloadData(url);
}
HtmlDocument page = new HtmlDocument();
using (var ms = new MemoryStream(pageBytes))
{
page.Load(ms);
var metaContentType = page.DocumentNode.SelectSingleNode("//meta[#http-equiv='Content-Type']").GetAttributeValue("content", "");
var contentType = new System.Net.Mime.ContentType(metaContentType);
ms.Position = 0;
page.Load(ms, Encoding.GetEncoding(contentType.CharSet));
}
And finally, if the page you are querying returns the content-Type in the response you can look here for how to get the encoding.
Your code would of course need a few more null checks than mine does. ;)

iso-8859-1 string to Unicode string

i get html source of page with this line of code
WebClient client = new WebClient();
string sPage = client.DownloadString(url);
page code package is iso-8859-1
but exist unicode string like below in html source
<span title="ضافه"
how can i convert this code to Unicode string in c# ?
string s = HttpUtility.HtmlDecode(#"<span title=""ضافه");

Categories

Resources