I'm trying to get a list of PDF links from different websites. First I'm using the Web client class to download the page source. I then use sgmlReader to convert the HTML to XML. So for one particular site, I'll get a tag that looks like this:
<p>1985 to 1997 Board Action Summary</p>
I need to grab all the links that contain ".pdf". Obviously not all websites are laid out the same, so just searching for a <p> tag, wont be dynamic enough. I'd rather not use linq, but I will if I have to. Thanks in advance.
Linq makes this easy...
var hrefs = doc.Root.Descendants("a")
.Where(a => a.Attrib("href").Value.ToUpper().EndsWith(".PDF"))
.Select(a => a.Attrib("href"));
away you go! (note: did this from memory, so you might have to fix it somewhat)
This will break down for <a/> tags that don't have an href (anchors) but you can fix that surely...
I think you have 2 options here. If you need only the links, you can use Regular Expressions to find the matches for strings ending with .pdf. If you need to manipulate the XML structure or get other values from the XML, it would be better to use XmlDocument and use an XPath query to find out the nodes which have a link to a pdf file in it. Using LINQ to XML just reduces the number of lines of code you have to write.
Related
I'm trying to scrape a webpage (Pub Med) to see how many references appear in specific articles(some articles have references, some don't). However, the problem i'm having right now is that the divs are all nested and named the same thing so I haven't been able to figure out what code is required to get the elements.
So far i've tried using contains to see if I could just grab a catch all and dig my way into the node from there but that hasn't worked.
.SelectNodes("//div[contains(#class,'portlet_title')]");
I also have tried copying the XPath but all I would get is null as a result
.SelectNodes("//*[#id="disc_col"]/div[3]/div[1]/div/h3/span");
Any help would be appreciated as I am no master at Xpath.
And for reference, a page that would fit my criteria is:
http://www.ncbi.nlm.nih.gov/pubmed/?term=23489346 (right hand side says Cited by * articles).
I've also browsed some other responses however they all seemed to be for results with differently named Divs ( ie get all the divs ids on a html page using Html Agility Pack). Either I dont understand how to use this correctly, or my problem is different.
Thanks again.
Mike! Try use
var titles = website.DocumentNode.SelectNodes("//div[#class='portlet_title']");
The errors in your XPaths are: 1. attributes are written just in "[]" with "#" symbol like I wrote; 2. in every XPath node you should write an index e.g. "//div[3]/div[1]/div[1]/h3[1]/span[1]".
Good luck!
I have built a little crawler and now when trying it out i found that when crawling certain sites my crawler uses 98-99% CPU.
I used dotTrace to see what the problem could be and it pointed me towards my httpwebrequest method - i optimised it a bit with the help of some previous questions here on stackoverflow.. but the problem was still there.
I then went to see what URLs that were causing the CPU load and found that it was actually sites that are extremely large in size - go figure :)
So, now i am 99% certain it has to do with the following piece of code:
HtmlAgilityPack.HtmlDocument documentt = new HtmlAgilityPack.HtmlDocument();
HtmlAgilityPack.HtmlNodeCollection list;
HtmlAgilityPack.HtmlNodeCollection frameList;
documentt.LoadHtml(_html);
list = documentt.DocumentNode.SelectNodes(".//a[#href]");
All that i want to do is to extract the links on the page, so for large sites.. is there anyway i can get this to not use so much CPU?
I was thinking maybe limit what i fetch? What would be my best option here?
Certainly someone must have run into this problem before :)
Have you tried dropping the XPath and using the LINQ functionality?
var list = documentt.DocumentNode.Descendants("a").Select(n => n.GetAttributeValue("href", string.Empty);
That'll pull a list of the href attribute of all anchor tags as a List<string>.
If you aren't heavily invested in Html Agility Pack, try using CsQuery instead. It builds an index when parsing the documents, and selectors are much faster than HTML Agility Pack. See a comparison.
CsQuery is a .NET jQuery port with a full CSS selector engine; it lets you use CSS selectors as well as the jQuery API to access and manipulate HTML. It's on nuget as CsQuery.
".//a[#href]" is extremely slow XPath. Tried to replace with "//a[#href]" or with code that simply walks whole document and checks all A nodes.
Why this XPath is slow:
"." starting with a node
"//" select all descendent nodes
"a" - pick only "a" nodes
"#href" with href.
Portion 1+2 ends up with "for every node select all its descendant nodes" which is very slow.
I need some advice and possible code examples for parsing an HTML table from a website. I'm using the webclient class to download the html from an address. I then need to find the table I want the data from. So for example if the table id is <table id="cia_list", I want to loop through the <td> tags and get just the text inside them. What would be the best way to approach this?
In the past I have converted the HTML to XML and then used XSLT to parse the results. If this is an approach you want to take I would recommend looking at SGMLReader, which will handle the conversion.
People will often attempt to use regex to do what you are talking about. This is something I typically advise against. Here is an amusing post that goes over some of the reasons not to do this:
RegEx match open tags except XHTML self-contained tags
I've tried this and searched for help but I cannot figure it out. I can get the source for a page but I don't need the whole thing, just one string that is repeated. Think of it like trying to grab only the titles of articles on a page and adding them in order to an array without losing any special characters. Can someone shed some light?
You can use a Regular Expression
to extract the content you want from a string, such as your html string.
Or you can use a DOM parser such as
Html Agility Pack
Hope this helps!
You could use something like this -
var text = "12 hello 45 yes 890 bye 999";
var matches = System.Text.RegularExpressions.Regex.Matches(text,#"\d+").Cast<Match>().Select(m => m.Value).ToList();
The example pulls all numbers in the text variable into a list of strings. But you could change the Regular Expression to do something more suited to your needs.
if the page is well-formed xml, you could use linq to xml by loading the page into an XDocument and using XPath or another way of traversing to the element(s) you desire and loading what you need into the array for which you are looking (or just use the enumerable if all you want to do is enumerate). if the page is not under your control, though, this is a brittle solution that could break at any time when subtle changes could break the well-formedness of the xml. if that's the case, you're probably better off using regular expressions. eiither way, though, the page could be changed under you and your code suddenly won't work anymore.
the best thing you could do would be to get the provider of the page to expose what you need as a webservice rather than trying to scrape their page.
I have something of a a hairy problem, I'd like to generate a couple of paragraphs of "description" of a given url, normally the start of an article. The Meta description field is one way to go but it isn't always good or set properly.
It's fair to say it's a bit problematic to accomplish this from the screenscraped HTML. I had a general idea that perhaps one could scan the HTML for the first "appropriate" segment but it's hard to say what that is, perhaps something like the first paragraph containing a certain amount of text...
Anyone have any good ideas? :) It doesn't have to be foolproof
So, you wanna become a new Google, heh? :-)
Many sites are "SEO friendly" these days. This enables you to go for the headings and then look for paragraphs bellow.
Also, look for lists. There is a lot of content in some sort of tab-like (tabs, accordions...) interfaces that is done using ordered or unordered lists.
If that fails, maybe look for a div with class "content" or "main" or a combination and start from there.
If you use different approaches, make sure you keep statistics of what worked and what didn't (maybe even save a full page), so you can review and tweak your parsing and searching methods.
As a side note, I've used htmlagilitypack to parse and search through html with success. Well, at leasts it beats parsing with regex :-)
Perhaps look for the div element that contains the most p elements, and then grab the first p child. If no div, get the first p from the body element.
This will always have its problems.
You can strip the HTML tags using this regular expression
string stripped = Regex.Replace(textBox1.Text,#"<(.|\n)*?>",string.Empty)
You will them get the content text you can use to generate your paragraphs.