In .net what is the best way to scrape HTML web pages.
Is there something open source that runs on .net framework 2 and and put all the html into objects. I have read about "HTML Agility Pack" but is there any think else?
I think HtmlAgilityPack is but you can also use
Fizzler : css selector engine for C#
SgmlReader : Convert html to valid xml
SharpQuery : Alternative of fizzler
You might use Tidy.net, which is a c# wrapper for the Tidy Library to convert HTML in XHTML available here: http://sourceforge.net/projects/tidynet/ so you could get valid XML and process it as such.
I'd make it this way:
// don't forget to import TidyNet and System.Xml.Linq
var t = new Tidy();
TidyMessageCollection messages = new TidyMessageCollection();
t.Options.Xhtml = true;
//extra options if you plan to edit the result by hand
t.Options.IndentContent = true;
t.Options.SmartIndent = true;
t.Options.DropEmptyParas = true;
t.Options.DropFontTags = true;
t.Options.BreakBeforeBR = true;
string sInput = "your html code goes here";
var bytes = System.Text.Encoding.UTF8.GetBytes(sInput);
StringBuilder sbOutput = new StringBuilder();
var msIn = new MemoryStream(bytes);
var msOut = new MemoryStream();
t.Parse(msIn, msOut, messages);
var bytesOut = msOut.ToArray();
string sOut = System.Text.Encoding.UTF8.GetString(bytesOut);
XDocument doc = XDocument.Parse(sOut);
//process XML as you like
Otherwise, HTML Agility pack is ok.
Related
I have this html file: http://mek.oszk.hu/17700/17789/17789.htm, which I already downloaded.
This html file has iso-8859-2 charset.
I want to convert this HTML file to a PDF file with IronPdf nuget package.
I tried this, but It doesn't work:
using (StreamReader stream = new StreamReader(book.Source,Encoding.GetEncoding("ISO-8859-2")))
{
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.Load(stream);
var Renderer = new IronPdf.HtmlToPdf();
var PDF = Renderer.RenderHtmlAsPdf(htmlDocument.DocumentNode.OuterHtml);
var OutputPath = "HtmlToPDF.pdf";
PDF.SaveAs(OutputPath);
System.Diagnostics.Process.Start(OutputPath);
}
My output result:
UPDATE 1: I want to this output result:
For me it's Magyar :) but obtained a better result with this piece of code:
var Renderer = new IronPdf.HtmlToPdf();
var PDF = Renderer.StaticRenderHTMLFileAsPdf("17789.htm", new IronPdf.PdfPrintOptions() { InputEncoding = Encoding.GetEncoding("ISO-8859-2") });
var OutputPath = "HtmlToPDF.pdf";
PDF.SaveAs(OutputPath);
System.Diagnostics.Process.Start(OutputPath);
Below is an example portion of a block of the Html that I am trying to extract information from:
<a href="https://secure.tibia.com/community/?subtopic=characters&name=Alemao+Golpista" >Alemao Golpista</a></td><td style="width:10%;" >51</td><td style="width:20%;" >Knight</td></tr><tr class="Even" style="text-align:right;" ><td style="width:70%;text-align:left;" >
I am basically grabbing the entire Html which is a list of players online and trying to append them to a list with the: Name (Alemao Golpista), Level (51), and Vocation (Knight).
Using regex for it is a pain in the ass and pretty slow how would I go about it using the Agility Pack?
Don't ever use regex to parse html files. As has already been stated you should use whatever HtmlagilityPack examples you can find even though they are scarce on their site. And the documentation isn't easy to find.
To get you started here is how you can load an HtmlDocument and get the anchor tags' href attributes.
HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
try{
var temp = new Uri(url.Url);
var request = (HttpWebRequest)WebRequest.Create(temp);
request.Method = "GET";
using (var response = (HttpWebResponse)request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
htmlDoc.Load(stream, Encoding.GetEncoding("iso-8859-9"));
}
}
}catch(WebException ex){
Console.WriteLine(ex.Message);
}
HtmlNodeCollection c = htmlDoc.DocumentNode.SelectNodes("//a");
List<string> urls = new List<string>();
foreach(HtmlNode n in c){
urls.Add(n.GetAttributeValue("href", ""));
}
The above code gets you all the links of a webpage in a string array.
You should look into xpath. And you should also get the documentation of HAP and read it. I couldn't find the documentation anywhere, so I uploaded the one I already had on my computer.
I am trying to read in the current day's dilbert image. I am able to get the full text of the page by doing this:
var todayDate = DateTime.Now.ToString("yyyy-MM-dd");
var web = new HtmlWeb();
web.UseCookies = true;
var wp = new WebProxy("http://myproxy:8080");
wp.UseDefaultCredentials = true;
NetworkCredential nc = (NetworkCredential)CredentialCache.DefaultCredentials;
HtmlDocument document = web.Load("http://www.dilbert.com/strips/comic/" + todayDate, "GET", wp, nc);
if I look at the full html of the document I see the image listed multiple times on the page such as:
<meta property="og:image" content="http://assets.amuniversal.com/c2168fa0c45a0132d8f0005056a9545d"/>
or:
<meta name="twitter:image" content="http://assets.amuniversal.com/c2168fa0c45a0132d8f0005056a9545d">
or
<img alt="Squirrel In The Large Hadron Collider - Dilbert by Scott Adams" class="img-responsive img-comic" height="280" src="http://assets.amuniversal.com/c2168fa0c45a0132d8f0005056a9545d" width="900" />
what is the best way to parse out the URl from this picture?
You can try using HtmlAgilityPack or a similar library to parse the structure of the response HTML and then walk the DOM generated by the parser.
You can use HtmlAgilityPack if you are going to do lots of dom manipulation, but a quick and dirty hack is to just use the built in .Net C# string features..
This is untested and written without an IDE but you could try something like:
var urlStartText = "<meta property=\"og:image\" content=\""
var urlEndText = "\"/>";
var urlStartIndex = documentHtml.IndexOf(urlStartText)+urlStartText.Length;
var url = documentHtml.Substring(urlStartIndex, documentHtml.IndexOf(urlEndText, urlStartIndex) - urlStartIndex);
The idea is to find the start and end position of the html text surrounding the URL and then just using Substring to grab it. You could make a method like "GetStringInbetween(string startText, string endText)" so that it would be reuseable
Edit ** An example of this turned into a method:
/// <summary>
/// Returns the text located between the start and end text within content
/// </summary>
public static string GetStringInBetween(string content, string start, string end)
{
var startIndex = content.IndexOf(start) + start.Length;
return content.Substring(startIndex, content.IndexOf(end, startIndex) - startIndex);
}
string url = GetStringInbetween(documentHtml, "<meta property=\"og:image\" content=\"", "\">");
I want to parse the HTML of a website in my C# program.
First, I use the SGMLReader DLL to convert the HTML to XML. I use the following method for this:
XmlDocument FromHtml(TextReader reader)
{
// setup SGMLReader
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.None;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = reader;
// create document
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);
return doc;
}
Next, I read a website and try to look for the header node:
var client = new WebClient();
var xmlDoc = FromHtml(new StringReader(client.DownloadString(#"http://www.switchonthecode.com")));
var result = xmlDoc.DocumentElement.SelectNodes("head");
However, this query gives an empty result (count == 0). But when I inspect the results view of xmlDoc.DocumentElement, I see the following:
Any idea's why there are no results? Note that when I try another site, like http://www.google.com, it works.
You need to select using the namespace explicitly, see this question.
XmlNamespaceManager manager = new XmlNamespaceManager(doc.NameTable);
manager.AddNamespace("ns", "http://www.w3.org/1999/xhtml");
doc.DocumentElement.SelectNodes("ns:head", manager);
You can use HTML Agility Pack instead. It's an open source HTML parser
This seems like it should be a easy thing to do but I am having some major issues with this. I am trying to parse for a specific tag with the HAP. I use Firebug to find the XPath I want and come up with //*[#id="atfResults"]. I believe my issue is with the " since the signals the start and end of a new string. I have tried making it a literal string but I have errors. I have attached the functions
public List<string> GetHtmlPage(string strURL)
{
// the html retrieved from the page
WebResponse objResponse;
WebRequest objRequest = System.Net.HttpWebRequest.Create(strURL);
objResponse = objRequest.GetResponse();
// the using keyword will automatically dispose the object
// once complete
using (StreamReader sr =
new StreamReader(objResponse.GetResponseStream()))
{//*[#id="atfResults"]
string strContent = sr.ReadToEnd();
// Close and clean up the StreamReader
sr.Close();
/*Regex regex = new Regex("<body>((.|\n)*?)</body>", RegexOptions.IgnoreCase);
//Here we apply our regular expression to our string using the
//Match object.
Match oM = regex.Match(strContent);
Result = oM.Value;*/
HtmlDocument doc = new HtmlDocument();
doc.Load(new StringReader(strContent));
HtmlNode root = doc.DocumentNode;
List<string> itemTags = new List<string>();
string listingtag = "//*[#id="atfResults"]";
foreach (HtmlNode link in root.SelectNodes(listingtag))
{
string att = link.OuterHtml;
itemTags.Add(att);
}
return itemTags;
}
}
You can escape it:
string listingtag = "//*[#id=\"atfResults\"]";
If you wanted to use a raw string, it would be:
string listingtag = #"//*[#id=""atfResults""]";
As you can see, raw strings don't really provide a benefit here.
However, you can instead use:
HtmlNode link = doc.GetElementById("atfResults");
This will also be slightly faster.
Have you tried this:
string listingtag = "//*[#id='atfResults']";