Below is an example portion of a block of the Html that I am trying to extract information from:
<a href="https://secure.tibia.com/community/?subtopic=characters&name=Alemao+Golpista" >Alemao Golpista</a></td><td style="width:10%;" >51</td><td style="width:20%;" >Knight</td></tr><tr class="Even" style="text-align:right;" ><td style="width:70%;text-align:left;" >
I am basically grabbing the entire Html which is a list of players online and trying to append them to a list with the: Name (Alemao Golpista), Level (51), and Vocation (Knight).
Using regex for it is a pain in the ass and pretty slow how would I go about it using the Agility Pack?
Don't ever use regex to parse html files. As has already been stated you should use whatever HtmlagilityPack examples you can find even though they are scarce on their site. And the documentation isn't easy to find.
To get you started here is how you can load an HtmlDocument and get the anchor tags' href attributes.
HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
try{
var temp = new Uri(url.Url);
var request = (HttpWebRequest)WebRequest.Create(temp);
request.Method = "GET";
using (var response = (HttpWebResponse)request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
htmlDoc.Load(stream, Encoding.GetEncoding("iso-8859-9"));
}
}
}catch(WebException ex){
Console.WriteLine(ex.Message);
}
HtmlNodeCollection c = htmlDoc.DocumentNode.SelectNodes("//a");
List<string> urls = new List<string>();
foreach(HtmlNode n in c){
urls.Add(n.GetAttributeValue("href", ""));
}
The above code gets you all the links of a webpage in a string array.
You should look into xpath. And you should also get the documentation of HAP and read it. I couldn't find the documentation anywhere, so I uploaded the one I already had on my computer.
Related
I need to get every single file from a URL so then I can iterate over them.
The idea is to resize each image using ImageMagick, but first I need to be able to get the files and iterate over them.
Here is the code I have done so far
using System;
using System.Net;
using System.IO;
using System.Text.RegularExpressions;
namespace Example
{
public class MyExample
{
public static void Main(String[] args)
{
string url = "https://www.paz.cl/imagenes_cotizador/BannerPrincipal/";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
string html = reader.ReadToEnd();
Console.WriteLine(html);
}
}
Console.ReadLine();
}
}
}
Which returns the entire html of the URL. However, I just need the files (all images) so I can work with them As I expect.
Any idea how to achieve this?
I looked at that page, and it's a directory/file list. You can use Regex to extract all links to images from the body of that page.
Here's a pattern I could think of: HREF="([^"]+\.(jpg|png))
Build your regex object, iterate over the matches, and download each image:
var regex = new System.Text.RegularExpressions.Regex("HREF=\"([^\"]+\\.(jpg|png))");
var matches = regex.Matches(html); // this is your html string
foreach(var match in matches) {
var imagePath = match.ToString().Substring("HREF=\"".Length);
Console.WriteLine(imagePath);
}
Now, concatenate the base url https://www.paz.cl with the image relative path obtained above, issue another request to that url to download the image and process it as you wish.
You can use The HTML Agility Pack
for example
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//a");
foreach (var node in htmlNodes)
{
Console.WriteLine(node.Attributes["href"].Value);
}
You can use AngleSharp to load and parse the html page. Then you can extract all the information you need.
// TODO add a reference to NuGet package AngleSharp
private static async Task Main(string[] args)
{
var config = Configuration.Default.WithDefaultLoader();
var address = "https://www.paz.cl/imagenes_cotizador/BannerPrincipal";
var context = BrowsingContext.New(config);
var document = await context.OpenAsync(address);
var images = document.Images.Select(img=>img.Source);
}
AngleSharp implements the w3c standard, so it works better than HTMLAgilityPack on real world webpage.
I have several thousand (ASP.net - messy html) html generated invoices that I'm trying to parse and save into a database.
Basically like:
foreach(var htmlDoc in HtmlFolder)
{
foreach(var inputBox in htmlDoc)
{
//Make Collection of ID and Values Insert to DB
}
}
From all the other questions I've read the best tool for this type of problem is the HtmlAgilityPack, however for the life of me I can't get the documentation .chm file to work. Any ideas on how I could accomplish this with or without the Agility Pack ?
Thanks in advance
An newer alternative to HtmlAgilityPack is CsQuery. See this later question on its relative performance merits, but its use of CSS selectors can't be beat:
var doc = CQ.CreateDocumentFromFile(htmldoc); //load, parse the file
var fields = doc["input"]; //get input fields with CSS
var pairs = fields.Select(node => new Tuple<string, string>(node.Id, node.Value()))
//get values
To get the CHM to work, you probably need to view the properties in Windows Explorer and uncheck the "Unblock Content" checkbox.
The HTML Agility Pack is quite easy when you know your way around Linq-to-XML or XPath.
Basics you'll need to know:
//import the HtmlAgilityPack
using HtmlAgilityPack;
HtmlDocument doc = new HtmlDocument();
// Load your data
// -----------------------------
// Load doc from file:
doc.Load(pathToFile);
// OR
// Load doc from string:
doc.LoadHtml(contentsOfFile);
// -----------------------------
// Find what you're after
// -----------------------------
// Finding things using Linq
var nodes = doc.DocumentNode.DescendantsAndSelf("input")
.Where(node => !string.IsNullOrWhitespace(node.Id)
&& node.Attributes["value"] != null
&& !string.IsNullOrWhitespace(node.Attributes["value"].Value));
// OR
// Finding things using XPath
var nodes = doc.DocumentNode
.SelectNodes("//input[not(#id='') and not(#value='')]");
// -----------------------------
// looping through the nodes:
// the XPath interfaces can return null when no nodes are found
if (nodes != null)
{
foreach (var node in nodes)
{
var id = node.Id;
var value = node.Attributes["value"].Value;
}
}
The easiest way to add the HtmlAgility Pack is using NuGet:
PM> Install-Package HtmlAgilityPack
Hah, looks like the ideal time to make a shameless plug of a library I wrote!
This should be rather easy to accomplish with this library (that's built on top of HtmlAgility pack by the way!) : https://github.com/amoerie/htmlbuilders
(You can find the Nuget package here: https://www.nuget.org/packages/HtmlBuilders/ )
Code samples:
const string html = "<div class='invoice'><input type='text' name='abc' value='123'/><input id='ohgood' type='text' name='def' value='456'/></div>";
var htmlDocument = new HtmlDocument {OptionCheckSyntax = false}; // avoid exceptions when html is invalid
htmlDocument.Load(new StringReader(html));
var tag = HtmlTag.Parse(htmlDocument); // if there is a root tag
var tags = HtmlTag.ParseAll(htmlDocument); // if there is no root tag
// find looks recursively through the entire DOM tree
var inputFields = tag.Find(t => string.Equals(t.TagName, "input"));
foreach (var inputField in inputFields)
{
Console.WriteLine(inputField["type"]);
Console.WriteLine(inputField["value"]);
if(inputField.HasAttribute("id"))
Console.WriteLine(inputField["id"]);
}
Note that inputField[attribute] will throw a 'KeyNotFoundException' if that field does not have the specified attribute name. That's because HtmlTag implements and reuses IDictionary logic for its attributes.
Edit: If you're not running this code in a web environment, you'll need to add a reference to System.Web. That's because this library makes use of the HtmlString class which can be found in System.Web. Just choose 'Add reference' and then you can find it under 'Assemblies > Framework'
You can download HtmlAgilityPack Documents CHM file from here.
If chm file contents are not visible then un-check Always ask before opening this file check-box as shown in screen shot
Note: The above dialog appears for unsigned files
Source: HtmlAgilityPack Documentation
In .net what is the best way to scrape HTML web pages.
Is there something open source that runs on .net framework 2 and and put all the html into objects. I have read about "HTML Agility Pack" but is there any think else?
I think HtmlAgilityPack is but you can also use
Fizzler : css selector engine for C#
SgmlReader : Convert html to valid xml
SharpQuery : Alternative of fizzler
You might use Tidy.net, which is a c# wrapper for the Tidy Library to convert HTML in XHTML available here: http://sourceforge.net/projects/tidynet/ so you could get valid XML and process it as such.
I'd make it this way:
// don't forget to import TidyNet and System.Xml.Linq
var t = new Tidy();
TidyMessageCollection messages = new TidyMessageCollection();
t.Options.Xhtml = true;
//extra options if you plan to edit the result by hand
t.Options.IndentContent = true;
t.Options.SmartIndent = true;
t.Options.DropEmptyParas = true;
t.Options.DropFontTags = true;
t.Options.BreakBeforeBR = true;
string sInput = "your html code goes here";
var bytes = System.Text.Encoding.UTF8.GetBytes(sInput);
StringBuilder sbOutput = new StringBuilder();
var msIn = new MemoryStream(bytes);
var msOut = new MemoryStream();
t.Parse(msIn, msOut, messages);
var bytesOut = msOut.ToArray();
string sOut = System.Text.Encoding.UTF8.GetString(bytesOut);
XDocument doc = XDocument.Parse(sOut);
//process XML as you like
Otherwise, HTML Agility pack is ok.
This seems like it should be a easy thing to do but I am having some major issues with this. I am trying to parse for a specific tag with the HAP. I use Firebug to find the XPath I want and come up with //*[#id="atfResults"]. I believe my issue is with the " since the signals the start and end of a new string. I have tried making it a literal string but I have errors. I have attached the functions
public List<string> GetHtmlPage(string strURL)
{
// the html retrieved from the page
WebResponse objResponse;
WebRequest objRequest = System.Net.HttpWebRequest.Create(strURL);
objResponse = objRequest.GetResponse();
// the using keyword will automatically dispose the object
// once complete
using (StreamReader sr =
new StreamReader(objResponse.GetResponseStream()))
{//*[#id="atfResults"]
string strContent = sr.ReadToEnd();
// Close and clean up the StreamReader
sr.Close();
/*Regex regex = new Regex("<body>((.|\n)*?)</body>", RegexOptions.IgnoreCase);
//Here we apply our regular expression to our string using the
//Match object.
Match oM = regex.Match(strContent);
Result = oM.Value;*/
HtmlDocument doc = new HtmlDocument();
doc.Load(new StringReader(strContent));
HtmlNode root = doc.DocumentNode;
List<string> itemTags = new List<string>();
string listingtag = "//*[#id="atfResults"]";
foreach (HtmlNode link in root.SelectNodes(listingtag))
{
string att = link.OuterHtml;
itemTags.Add(att);
}
return itemTags;
}
}
You can escape it:
string listingtag = "//*[#id=\"atfResults\"]";
If you wanted to use a raw string, it would be:
string listingtag = #"//*[#id=""atfResults""]";
As you can see, raw strings don't really provide a benefit here.
However, you can instead use:
HtmlNode link = doc.GetElementById("atfResults");
This will also be slightly faster.
Have you tried this:
string listingtag = "//*[#id='atfResults']";
I have a page that contains some links to .mp3/.wav files in that format
File Name
what I need to make a script that will download all these files instead of downloading them my self
I know that I can use regular expression to do some thing like that but i don't know how ? and what is the best choose to do that (Java , C# , JavaScript) ?
Any help will be appreciated
Thanks in Advance
You could use SgmlReader to parse the DOM and extract all the anchor links and then download the corresponding resources:
class Program
{
static void Main()
{
using (var reader = new SgmlReader())
{
reader.DocType = "HTML";
reader.Href = "http://www.example.com";
var doc = new XmlDocument();
doc.Load(reader);
var anchors = doc.SelectNodes("//a/#href[contains(., 'mp3') or contains(., 'wav')]");
foreach (XmlAttribute href in anchors)
{
using (var client = new WebClient())
{
var data = client.DownloadData(href.Value);
// TODO: do something with the downloaded data
}
}
}
}
}
Well, if you want to go hard-core, I think parsing the page with DOMDocument ( http://php.net/manual/en/class.domdocument.php ) and retrieving the files with cURL would do it if you're ok with PHP.
How many files are we talking about here?
Python's Beautiful Soup library is well-suited to this task:
http://www.crummy.com/software/BeautifulSoup/
Could be used in this way:
import urllib2, re
from BeautifulSoup import BeautifulSoup
#open the URL
page = urllib2.urlopen("http://www.foo.com")
#parse the page
soup = BeautifulSoup(page)
#get all anchor elements
anchors = soup.findAll("a")
#filter anchors based on their href attribute
filteredAnchors = filter(lambda a : re.search("\.wav",a["href"]) or re.search("\.mp3",a["href"]), anchors)
urlsToDownload = map(lambda a : a["href"],filteredAnchors)
#download each anchor url...
See here for instructions on downloading the mp3's from their URLs: How do I download a file over HTTP using Python?